Deploying PySparkling Pipeline Models

This tutorial demonstrates how we can import PySparkling pipeline models for scoring.

Let’s first create and export the model as:

from pyspark.ml import Pipeline
from pysparkling import *
from pysparkling.ml import *

hc = H2OContext.getOrCreate(spark)

# Helper method to locate the data file
def locate(file_name):
    return "examples/smalldata/smsData.txt"

# Prepare the data
def load():
    row_rdd = spark.sparkContext.textFile(locate("smsData.txt")).map(lambda x: x.split("\t", 1)).filter(lambda r: r[0].strip())
    return spark.createDataFrame(row_rdd, ["label", "text"])

# load the data
data = load()

# Create the H2O GBM pipeline stage
gbm = H2OGBM(splitRatio=0.8, seed=1, labelCol="label")

# Create a pipeline with a single GBM step
pipeline = Pipeline(stages=[gbm])

# Fit and export the pipeline
model = pipeline.fit(data)
model.save("exported_model")

Once we have exported the model, let’s start a new ./pysparkling shell as we want to demonstrate that for scoring, H2OContext does not need to be created as the H2OGBM step is internally using MOJO which does not require run-time of H2O.

First, we need to ensure that all Java classes internally stored in the PySparkling distribution are distributed in the Spark cluster. For that, we use the following code:

from pysparkling.initializer import Initializer
Initializer.load_sparkling_jar(spark)

Once we initialized PySparkling, we can load the model as:

from pyspark.ml import PipelineModel
model = PipelineModel.load("exported_model")

And we can run the predictions on the model as:

df_for_predictions = ..
model.transform(df_for_predictions)

If we don’t initialize the PySparkling using the Initializer, we would get class not found exception during loading the model as Spark would not now about the required classes. But as we can see, we do not need to initialize H2OContext for scoring tasks.