Train Word2Vec Model in Sparkling Water

Sparkling Water provides API for H2O Word2Vec in Scala and Python. The following sections describe how to train the Word2Vec model in Sparkling Water in both languages. See also Parameters of H2OWord2Vec.

  • Scala
  • Python

First, let’s start Sparkling Shell as

./bin/sparkling-shell

Start H2O cluster inside the Spark environment

import ai.h2o.sparkling._
import java.net.URI
val hc = H2OContext.getOrCreate()

Parse the data using H2O and convert them to Spark Frame

import org.apache.spark.SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/craigslistJobTitles.csv")
val sparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("craigslistJobTitles.csv"))
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))

Create the pipeline with the H2O Word2Vec. You can configure all the available Word2Vec arguments using provided setters.

import ai.h2o.sparkling.ml.algos.H2OGBM
import ai.h2o.sparkling.ml.features.H2OWord2Vec
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{RegexTokenizer, StopWordsRemover}

val tokenizer = new RegexTokenizer()
  .setInputCol("jobtitle")
  .setMinTokenLength(2)

val stopWordsRemover = new StopWordsRemover()
  .setInputCol(tokenizer.getOutputCol)

val w2v = new H2OWord2Vec()
  .setSentSampleRate(0)
  .setEpochs(10)
  .setInputCol(stopWordsRemover.getOutputCol)

val gbm = new H2OGBM()
  .setLabelCol("category")
  .setFeaturesCols(w2v.getOutputCol)

val pipeline = new Pipeline().setStages(Array(tokenizer, stopWordsRemover, w2v, gbm))

Train the pipeline:

val model = pipeline.fit(trainingDF)

Run Predictions

model.transform(testingDF).show(false)