Word2vec -------- Introduction ~~~~~~~~~~~~ The Word2vec algorithm takes a text `corpus `__ as an input and produces the word vectors as output. The algorithm first creates a vocabulary from the training text data and then learns vector representations of the words. The vector space can include hundreds of dimensions, with each unique word in the sample corpus being assigned a corresponding vector in the space. In addition, words that share similar contexts in the corpus are placed in close proximity to one another in the space. The result is an H2O Word2vec model that can be exported as a binary model or as a MOJO. This file can be used as features in many natural language processing and machine learning applications. **Notes**: - Word2vec is not currently supported under Python. - A Word2vec demo in R using a Craigslist job titles dataset available `here `__. Defining a Word2vec Model ~~~~~~~~~~~~~~~~~~~~~~~~~ - `model_id `__: (Optional) Specify a custom name for the model to use as a reference. By default, H2O automatically generates a destination key. - `training_frame `__: (Required) Specify the dataset used to build the model. **NOTE**: In Flow, if you click the **Build a model** button from the ``Parse`` cell, the training frame is entered automatically. - **min_word_freq**: Specify an integer for the minimum word frequency. Word2vec will discard words that appear less than this number of times. - **word_model**: Specify "SkipGram" to use the Skip-Gram model when producing a distributed representation of words. When enabled, the model uses each word to predict the surrounding window of context words. The skip-gram architecture weighs close context words more heavily than more distant context words. Using Skip-Gram can increase model build time but performs better for infrequently used words. **NOTE**: This option is specified by default and cannot be disabled. It is currently the only approach supported in H2O. - **norm_model**: Specify "HSM" to use Hierarchical Softmax. When enabled, Word2vec uses a `Huffman tree `__ to reduce calculations when approximating the conditional log-likelihood that the model is attempting to maximize. This option is useful for infrequent words, but this option becomes less useful as training epochs increase. **NOTE**: This option is specified by default and cannot be disabled. It is currently the only approach supported in H2O. - **vec_size**: Specify the size of word vectors. - **window_size**: This specifies the size of the context window around a given word. For example, consider the following string: "Lorem ipsum (dolor sit amet, quot hendrerit) pri cu,..." For a target word, "amet" and ``window size=2``, the context is made of words: dolor, sit, quot, hendrerit. - **sent_sample_rate**: Set the threshold for the occurrence of words. Those words that appear with higher frequency in the training data will be randomly down-sampled. An ideal range for this option 0, 1e-5. - **init_learning_rate**: Set the starting learning rate. - **epochs**: Specify the number of training iterations to run. - **pre_trained**: Specify the ID of a data frame that contains a pre-trained (external) Word2vec model. Interpreting a Word2vec Model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ By default, the following output displays: - Model parameters - Output (model category, model summary, cross validation metrics, validation metrics) - Column names - Domains (for categorical columns) Transforming Words to Vectors ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A ``transform`` function is available for use with Word2vec. This function transforms words to vectors using an existing Word2Vec model and has the following usage: :: h2o.transform(word2vec, words, aggregate_method) - ``word2vec``: A Word2Vec model - ``words``: An H2O Frame made of a single column containing source words. Note that you can specify to include a subset of this frame. - ``aggregate_method``: Specifies how to aggregate sequences of words. If the method is ``NONE``, then no aggregation is performed, and each input word is mapped to a single word-vector. If the method is ``AVERAGE``, then the input is treated as sequences of words delimited by NA. Each word of a sequences is internally mapped to a vector, and vectors belonging to the same sentence are averaged and returned in the result. More information about this function can be found in the `H2O-3 GitHub repository `__. **Example** .. example-code:: .. code-block:: r # Build a dummy word2vec model library(h2o) h2o.init(nthread=-1) data <- as.character(as.h2o(c("a", "b", "a"))) w2v.model <- h2o.word2vec(data, sent_sample_rate = 0, min_word_freq = 0, epochs = 1, vec_size = 2) # Transform words to vectors without aggregation sentences <- as.character(as.h2o(c("b", "c", "a", NA, "b"))) h2o.transform(w2v.model, sentences) # -> 5 rows total, 2 rows NA ("c" is not in the vocabulary) # Transform words to vectors and return average vector for each sentence h2o.transform(w2v.model, sentences, aggregate_method = "AVERAGE") # -> 2 rows h2o.transform <- function(word2vec, words, aggregate_method = c("NONE", "AVERAGE")) { if (!is(word2vec, "H2OModel")) stop("`word2vec` must be a word2vec model") if (missing(words)) stop("`words` must be specified") if (!is.H2OFrame(words)) stop("`words` must be an H2OFrame") if (ncol(words) != 1) stop("`words` frame must contain a single string column") if (length(aggregate_method) > 1) aggregate_method <- aggregate_method[1] res <- .h2o.__remoteSend(method="GET", "Word2VecTransform", model = word2vec@model_id, words_frame = h2o.getId(words)) words_frame = h2o.getId(words), aggregate_method = aggregate_method) key <- res$vectors_frame$name h2o.getFrame(key) } References ~~~~~~~~~~ `Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. "Efficient Estimation of Word Representations in Vector Space." In Proceedings of Workshop at ICLR. (Sep 2013) `__ `Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. "Distributed Representations of Words and Phrases and their Compositionality." In Proceedings of NIPS. (Oct 2013) `__ `Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. "Linguistic Regularities in Continuous Space Word Representations." In Proceedings of NAACL HLT. (May 2013) `__ `Tomas Mikolov, Quoc V. Le and Ilya Sutskever. "Exploiting Similarities among Languages for Machine Translation." (Sep 2013) `__