Train Word2Vec Model in Sparkling Water¶
Sparkling Water provides API for H2O Word2Vec in Scala and Python. The following sections describe how to train the Word2Vec model in Sparkling Water in both languages. See also Parameters of H2OWord2Vec.
- Scala
- Python
First, let’s start Sparkling Shell as
./bin/sparkling-shell
Start H2O cluster inside the Spark environment
import ai.h2o.sparkling._
import java.net.URI
val hc = H2OContext.getOrCreate()
Parse the data using H2O and convert them to Spark Frame
import org.apache.spark.SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/craigslistJobTitles.csv")
val sparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("craigslistJobTitles.csv"))
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))
Create the pipeline with the H2O Word2Vec. You can configure all the available Word2Vec arguments using provided setters.
import ai.h2o.sparkling.ml.algos.H2OGBM
import ai.h2o.sparkling.ml.features.H2OWord2Vec
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{RegexTokenizer, StopWordsRemover}
val tokenizer = new RegexTokenizer()
.setInputCol("jobtitle")
.setMinTokenLength(2)
val stopWordsRemover = new StopWordsRemover()
.setInputCol(tokenizer.getOutputCol)
val w2v = new H2OWord2Vec()
.setSentSampleRate(0)
.setEpochs(10)
.setInputCol(stopWordsRemover.getOutputCol)
val gbm = new H2OGBM()
.setLabelCol("category")
.setFeaturesCols(w2v.getOutputCol)
val pipeline = new Pipeline().setStages(Array(tokenizer, stopWordsRemover, w2v, gbm))
Train the pipeline:
val model = pipeline.fit(trainingDF)
Run Predictions
model.transform(testingDF).show(false)