Train GLM Model in Sparkling Water ---------------------------------- Introduction ~~~~~~~~~~~~ Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions. In addition to the Gaussian (i.e. normal) distribution, these include Poisson, binomial, and gamma distributions. Each serves a different purpose, and depending on distribution and link function choice, can be used either for prediction or classification. For more more comprehensive description see `H2O-3 GLM documentation `__. Example ~~~~~~~ The following section describes how to train the GLM model in Sparkling Water in Scala & Python following the same example as H2O-3 documentation mentioned above. See also :ref:`parameters_H2OGLM` and :ref:`model_details_H2OGLMMOJOModel`. .. content-tabs:: .. tab-container:: Scala :title: Scala First, let's start Sparkling Shell as .. code:: shell ./bin/sparkling-shell Start H2O cluster inside the Spark environment .. code:: scala import ai.h2o.sparkling._ import java.net.URI val hc = H2OContext.getOrCreate() Parse the data using H2O and convert them to Spark Frame .. code:: scala import org.apache.spark.SparkFiles val datasetUrl = "https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv" spark.sparkContext.addFile(datasetUrl) //for example purposes, on a real cluster it's better to load directly from distributed storage val rawSparkDF = spark.read.option("header", "true").option("inferSchema", "true").csv(SparkFiles.get("prostate.csv")) val sparkDF = rawSparkDF.withColumn("CAPSULE", $"CAPSULE" cast "string") .withColumn("RACE", $"RACE" cast "string") .withColumn("DCAPS", $"DCAPS" cast "string") .withColumn("DPROS", $"DPROS" cast "string") val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2)) Train the model. You can configure all the available GLM arguments using provided setters. .. code:: scala import ai.h2o.sparkling.ml.algos.H2OGLM val predictors = Array("AGE", "RACE", "VOL", "GLEASON") val response = "CAPSULE" val estimator = new H2OGLM() .setFamily("binomial") .setFeaturesCols(predictors) .setLabelCol(response) .setLambdaValue(Array(0)) .setComputePValues(true) val model = estimator.fit(trainingDF) Note: When family is not set, by default, by default, the ``H2OGLM`` algorithm distinguishes between a classification and regression problem based on the type of the label column of the training dataset. If the label column is a string column, a classification model will be trained. If the label column is a numeric column, a regression model will be trained. If you don't want to be worried about column data types, you can explicitly identify the problem by using ``ai.h2o.sparkling.ml.algos.classification.H2OGLMClassifier`` or ``ai.h2o.sparkling.ml.algos.regression.H2OGLMRegressor`` instead. Print the coefficients table .. code:: scala model.getCoefficients().show(truncate = false) Run Predictions .. code:: scala model.transform(testingDF).show(truncate = false) You can also get model details via calling methods listed in :ref:`model_details_H2OGLMMOJOModel`. .. tab-container:: Python :title: Python First, let's start PySparkling Shell as .. code:: shell ./bin/pysparkling Start H2O cluster inside the Spark environment .. code:: python from pysparkling import * hc = H2OContext.getOrCreate() Parse the data using H2O and convert them to Spark Frame .. code:: python import h2o frame = h2o.import_file("https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate/prostate.csv") sparkDF = hc.asSparkFrame(frame) sparkDF = sparkDF.withColumn("CAPSULE", sparkDF.CAPSULE.cast("string")) sparkDF = sparkDF.withColumn("RACE", sparkDF.RACE.cast("string")) sparkDF = sparkDF.withColumn("DCAPS", sparkDF.DCAPS.cast("string")) sparkDF = sparkDF.withColumn("DPROS", sparkDF.DPROS.cast("string")) [trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2]) Train the model. You can configure all the available GLM arguments using provided setters or constructor parameters. .. code:: python from pysparkling.ml import H2OGLM predictors = ["AGE", "RACE", "VOL", "GLEASON"] response = "CAPSULE" estimator = H2OGLM(family="binomial", featuresCols=predictors, labelCol=response, computePValues=True, lambdaValue=[0]) model = estimator.fit(trainingDF) Note: When family is not set, by default, the ``H2OGLM`` algorithm distinguishes between a classification and regression problem based on the type of the label column of the training dataset. If the label column is a string column, a classification model will be trained. If the label column is a numeric column, a regression model will be trained. If you don't want to be worried about column data types, you can explicitly identify the problem by using ``H2OGLMClassifier`` or ``H2OGLMRegressor`` instead. Print the coefficients table .. code:: python model.getCoefficients().show(truncate = False) Run Predictions .. code:: python model.transform(testingDF).show(truncate = False) You can also get model details via calling methods listed in :ref:`model_details_H2OGLMMOJOModel`.