Word2vec¶
Introduction¶
The Word2vec algorithm takes a text corpus as an input and produces the word vectors as output. The algorithm first creates a vocabulary from the training text data and then learns vector representations of the words. The vector space can include hundreds of dimensions, with each unique word in the sample corpus being assigned a corresponding vector in the space. In addition, words that share similar contexts in the corpus are placed in close proximity to one another in the space. The result is an H2O Word2vec model that can be exported as a binary model or as a MOJO. This file can be used as features in many natural language processing and machine learning applications.
Notes:
- Word2vec is not currently supported under Python.
- A Word2vec demo in R using a Craigslist job titles dataset available here.
Defining a Word2vec Model¶
model_id: (Optional) Specify a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
training_frame: (Required) Specify the dataset used to build the model. NOTE: In Flow, if you click the Build a model button from the
Parse
cell, the training frame is entered automatically.min_word_freq: Specify an integer for the minimum word frequency. Word2vec will discard words that appear less than this number of times.
word_model: Specify “SkipGram” to use the Skip-Gram model when producing a distributed representation of words. When enabled, the model uses each word to predict the surrounding window of context words. The skip-gram architecture weighs close context words more heavily than more distant context words. Using Skip-Gram can increase model build time but performs better for infrequently used words. NOTE: This option is specified by default and cannot be disabled. It is currently the only approach supported in H2O.
norm_model: Specify “HSM” to use Hierarchical Softmax. When enabled, Word2vec uses a Huffman tree to reduce calculations when approximating the conditional log-likelihood that the model is attempting to maximize. This option is useful for infrequent words, but this option becomes less useful as training epochs increase. NOTE: This option is specified by default and cannot be disabled. It is currently the only approach supported in H2O.
vec_size: Specify the size of word vectors.
window_size: This specifies the size of the context window around a given word. For example, consider the following string:
“Lorem ipsum (dolor sit amet, quot hendrerit) pri cu,...”
For a target word, “amet” and
window size=2
, the context is made of words: dolor, sit, quot, hendrerit.sent_sample_rate: Set the threshold for the occurrence of words. Those words that appear with higher frequency in the training data will be randomly down-sampled. An ideal range for this option 0, 1e-5.
init_learning_rate: Set the starting learning rate.
epochs: Specify the number of training iterations to run.
pre_trained: Specify the ID of a data frame that contains a pre-trained (external) Word2vec model.
Interpreting a Word2vec Model¶
By default, the following output displays:
- Model parameters
- Output (model category, model summary, cross validation metrics, validation metrics)
- Column names
- Domains (for categorical columns)
Transforming Words to Vectors¶
A transform
function is available for use with Word2vec. This function transforms words to vectors using an existing Word2Vec model and has the following usage:
h2o.transform(word2vec, words, aggregate_method)
word2vec
: A Word2Vec modelwords
: An H2O Frame made of a single column containing source words. Note that you can specify to include a subset of this frame.aggregate_method
: Specifies how to aggregate sequences of words. If the method isNONE
, then no aggregation is performed, and each input word is mapped to a single word-vector. If the method isAVERAGE
, then the input is treated as sequences of words delimited by NA. Each word of a sequences is internally mapped to a vector, and vectors belonging to the same sentence are averaged and returned in the result.
More information about this function can be found in the H2O-3 GitHub repository.
Example
# Build a dummy word2vec model
library(h2o)
h2o.init(nthread=-1)
data <- as.character(as.h2o(c("a", "b", "a")))
w2v.model <- h2o.word2vec(data, sent_sample_rate = 0, min_word_freq = 0, epochs = 1, vec_size = 2)
# Transform words to vectors without aggregation
sentences <- as.character(as.h2o(c("b", "c", "a", NA, "b")))
h2o.transform(w2v.model, sentences) # -> 5 rows total, 2 rows NA ("c" is not in the vocabulary)
# Transform words to vectors and return average vector for each sentence
h2o.transform(w2v.model, sentences, aggregate_method = "AVERAGE") # -> 2 rows
h2o.transform <- function(word2vec, words, aggregate_method = c("NONE", "AVERAGE")) {
if (!is(word2vec, "H2OModel")) stop("`word2vec` must be a word2vec model")
if (missing(words)) stop("`words` must be specified")
if (!is.H2OFrame(words)) stop("`words` must be an H2OFrame")
if (ncol(words) != 1) stop("`words` frame must contain a single string column")
if (length(aggregate_method) > 1)
aggregate_method <- aggregate_method[1]
res <- .h2o.__remoteSend(method="GET", "Word2VecTransform", model = word2vec@model_id,
words_frame = h2o.getId(words))
words_frame = h2o.getId(words), aggregate_method = aggregate_method)
key <- res$vectors_frame$name
h2o.getFrame(key)
}