Deploying PySparkling Pipeline Models¶
This tutorial demonstrates how we can import PySparkling pipeline models for scoring.
Let’s first create and export the model as:
from pyspark.ml import Pipeline from pysparkling import * from pysparkling.ml import * hc = H2OContext.getOrCreate(spark) # Helper method to locate the data file def locate(file_name): return "examples/smalldata/smsData.txt" # Prepare the data def load(): row_rdd = spark.sparkContext.textFile(locate("smsData.txt")).map(lambda x: x.split("\t", 1)).filter(lambda r: r.strip()) return spark.createDataFrame(row_rdd, ["label", "text"]) # load the data data = load() # Create the H2O GBM pipeline stage gbm = H2OGBM(ratio=0.8, seed=1, predictionCol="label") # Create a pipeline with a single GBM step pipeline = Pipeline(stages=[gbm]) # Fit and export the pipeline model = pipeline.fit(data) model.save("exported_model")
Once we have exported the model, let’s start a new
./pysparkling shell as we want to demonstrate that for scoring,
H2OContext does not need to be created as the
H2OGBM step is internally using MOJO which does not require run-time of H2O.
First, we need to ensure that all Java classes internally stored in the PySparkling distribution are distributed in the Spark cluster. For that, we use the following code:
from pysparkling.initializer import Initializer Initializer.load_sparkling_jar(spark)
Once we initialized PySparkling, we can load the model as:
from pyspark.ml import PipelineModel model = PipelineModel.load("exported_model")
And we can run the predictions on the model as:
df_for_predictions = .. model.transform(df_for_predictions)
If we don’t initialize the PySparkling using the
Initializer, we would get
class not found exception during loading
the model as Spark would not now about the required classes. But as we can see, we do not need to initialize
for scoring tasks.