Train Word2Vec Model in Sparkling Water¶
Sparkling Water provides API for H2O Word2Vec in Scala and Python. The following sections describe how to train the Word2Vec model in Sparkling Water in both languages.
Scala
First, let’s start Sparkling Shell as
./bin/sparkling-shell
Start H2O cluster inside the Spark environment
import ai.h2o.sparkling._
import java.net.URI
val hc = H2OContext.getOrCreate()
Parse the data using H2O and convert them to Spark Frame
import org.apache.spark.SparkFiles
spark.sparkContext.addFile("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/craigslistJobTitles.csv")
val sparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("craigslistJobTitles.csv"))
val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2))
Create the pipeline with the H2O Word2Vec. You can configure all the available Word2Vec arguments using provided setters.
import ai.h2o.sparkling.ml.algos.H2OGBM
import ai.h2o.sparkling.ml.features.H2OWord2Vec
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.{RegexTokenizer, StopWordsRemover}
val tokenizer = new RegexTokenizer()
.setInputCol("jobtitle")
.setMinTokenLength(2)
val stopWordsRemover = new StopWordsRemover()
.setInputCol(tokenizer.getOutputCol)
val w2v = new H2OWord2Vec()
.setSentSampleRate(0)
.setEpochs(10)
.setInputCol(stopWordsRemover.getOutputCol)
val gbm = new H2OGBM()
.setLabelCol("category")
.setFeaturesCols(w2v.getOutputCol)
val pipeline = new Pipeline().setStages(Array(tokenizer, stopWordsRemover, w2v, gbm))
Train the pipeline:
val model = pipeline.fit(trainingDF)
Run Predictions
model.transform(testingDF).show(false)
Python
First, let’s start PySparkling Shell as
./bin/pysparkling
Start H2O cluster inside the Spark environment
from pysparkling import *
hc = H2OContext.getOrCreate()
Parse the data using H2O and convert them to Spark Frame
import h2o
frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/craigslistJobTitles.csv")
sparkDF = hc.asSparkFrame(frame.set_names(['category', 'jobtitle']))
[trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2])
Create the pipeline with the Word2Vec. You can configure all the available Word2Vec arguments using provided setters.
from pysparkling.ml import H2OGBM, H2OWord2Vec
from pyspark.ml import Pipeline
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover
tokenizer = RegexTokenizer(inputCol="jobtitle", minTokenLength=2)
stopWordsRemover = StopWordsRemover(inputCol=tokenizer.getOutputCol())
w2v = H2OWord2Vec(sentSampleRate=0, epochs=10, inputCol=stopWordsRemover.getOutputCol())
gbm = H2OGBM(labelCol="category", featuresCols=[w2v.getOutputCol()])
pipeline = Pipeline(stages=[tokenizer, stopWordsRemover, w2v, gbm])
Train the pipeline:
model = pipeline.fit(trainingDF)
Run Predictions
model.transform(testingDF).show(truncate = False)