Train GAM Model in Sparkling Water ---------------------------------- **Note**: GAM models are currently experimental. Introduction ~~~~~~~~~~~~ A generalized additive model (GAM) is a Generalized Linear Model (GLM) in which the linear predictor depends linearly on predictor variables and smooth functions of predictor variables. For more more comprehensive description see `H2O-3 GAM documentation `__. Example ~~~~~~~ The following section describes how to train the GAM model in Sparkling Water in Scala & Python following the same example as H2O-3 documentation mentioned above. See also :ref:`parameters_H2OGAM` and :ref:`model_details_H2OGAMMOJOModel`. .. content-tabs:: .. tab-container:: Scala :title: Scala First, let's start Sparkling Shell as .. code:: shell ./bin/sparkling-shell Start H2O cluster inside the Spark environment .. code:: scala import ai.h2o.sparkling._ val hc = H2OContext.getOrCreate() Create the frame knots .. code:: scala val knots1 = Seq(-1.99905699, -0.98143075, 0.02599159, 1.00770987, 1.99942290).toDF() val frameKnots1 = hc.asH2OFrame(knots1) val knots2 = Seq(-1.999821861, -1.005257990, -0.006716042, 1.002197392, 1.999073589).toDF() val frameKnots2 = hc.asH2OFrame(knots2) val knots3 = Seq(-1.999675688, -0.979893796, 0.007573327, 1.011437347, 1.999611676).toDF() val frameKnots3 = hc.asH2OFrame(knots3) Import the dataset and split into train and validation sets .. code:: scala import org.apache.spark.SparkFiles val datasetUrl = "https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/multinomial_10_classes_10_cols_10000_Rows_train.csv" spark.sparkContext.addFile(datasetUrl) //for example purposes, on a real cluster it's better to load directly from distributed storage val sparkDF = spark .read .option("header", "true") .option("inferSchema", "true").csv(SparkFiles.get("multinomial_10_classes_10_cols_10000_Rows_train.csv")) val Array(trainingDF, testingDF) = sparkDF.randomSplit(Array(0.8, 0.2)) Set the predictor and response columns, specify the knots array .. code:: scala val y = "C11" val x = Array("C1", "C2") val numKnots = Array(5, 5, 5) Train the model. You can configure all the available GAM arguments using provided setters, such as the label column and gam columns, which are mandatory. .. code:: scala val gam = new H2OGAM() .setFeaturesCols(x) .setLabelCol(y) .setFamily("multinomial") .setGamCols(Array("C6", "C7", "C8")) .setColumnsToCategorical("C1", "C2", "C11") .setScale(Array(1.0, 1.0, 1.0)) .setNumKnots(numKnots) .setKnotIds(Array(frameKnots1.frameId, frameKnots2.frameId, frameKnots3.frameId)) val gamModel = gam.fit(trainingDF) By default, the ``H2OGAM`` algorithm distinguishes between a classification and regression problem based on the type of the label column of the training dataset. If the label column is a string column, a classification model will be trained. If the label column is a numeric column, a regression model will be trained. If you don't want to be worried about column data types, you can explicitly identify the problem by using ``ai.h2o.sparkling.ml.algos.classification.H2OGAMClassifier`` or ``ai.h2o.sparkling.ml.algos.regression.H2OGAMRegressor`` instead. Run Predictions .. code:: scala val predictions = gamModel.transform(testingDF) predictions.show(truncate = false) You can also get model details via calling methods listed in :ref:`model_details_H2OGAMMOJOModel`. Clean up .. code:: scala frameKnots1.delete() frameKnots2.delete() frameKnots3.delete() .. tab-container:: Python :title: Python First, let's start PySparkling Shell as .. code:: shell ./bin/pysparkling Start H2O cluster inside the Spark environment .. code:: python from pysparkling import * hc = H2OContext.getOrCreate() Create the frame knots .. code:: python knots1 = [-1.99905699, -0.98143075, 0.02599159, 1.00770987, 1.99942290] frameKnots1 = h2o.H2OFrame(python_obj=knots1) knots2 = [-1.999821861, -1.005257990, -0.006716042, 1.002197392, 1.999073589] frameKnots2 = h2o.H2OFrame(python_obj=knots2) knots3 = [-1.999675688, -0.979893796, 0.007573327,1.011437347, 1.999611676] frameKnots3 = h2o.H2OFrame(python_obj=knots3) Import the dataset and split into train and validation sets .. code:: python import h2o frame = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/multinomial_10_classes_10_cols_10000_Rows_train.csv") sparkDF = hc.asSparkFrame(frame) [trainingDF, testingDF] = sparkDF.randomSplit([0.8, 0.2]) Set the predictor and response columns, specify the knots array .. code:: python y = "C11" x = ["C1","C2"] numKnots = [5,5,5] Train the model. You can configure all the available GAM arguments using provided setters or constructor parameters, such as the label column and gam columns, which are mandatory. .. code:: python from pysparkling.ml import H2OGAM estimator = H2OGAM( featuresCols = x, labelCol = y, family = "multinomial", gamCols = ["C6", "C7", "C8"], columnsToCategorical = ["C1", "C2", "C11"], scale = [1.0, 1.0, 1.0], numKnots = numKnots, knotIds = [frameKnots1.key, frameKnots2.key, frameKnots3.key]) model = estimator.fit(trainingDF) By default, the ``H2OGAM`` algorithm distinguishes between a classification and regression problem based on the type of the label column of the training dataset. If the label column is a string column, a classification model will be trained. If the label column is a numeric column, a regression model will be trained. If you don't want to be worried about column data types, you can explicitly identify the problem by using ``H2OGAMClassifier`` or ``H2OGAMRegressor`` instead. Run Predictions .. code:: python model.transform(testingDF).show(truncate = False) You can also get model details via calling methods listed in :ref:`model_details_H2OGAMMOJOModel`. Clean up .. code:: python h2o.remove(frameKnots1) h2o.remove(frameKnots2) h2o.remove(frameKnots3) h2o.remove(frame)