RSparkling

Join the chat at https://gitter.im/h2oai/rsparkling License Powered by H2O.ai

The rsparkling R package is an extension package for sparklyr that creates an R front-end for the Sparkling Water package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R.

This package implements basic functionality (creating an H2OContext, showing the H2O Flow interface, and converting between Spark DataFrames and H2O Frames). The main purpose of this package is to provide a connector between Sparklyr and H2O’s machine learning algorithms.

The rsparkling package uses sparklyr for Spark job deployment and initialization of Sparkling Water. After that, the user can use the regular h2o R package for modeling. In the following sections, we show how to install each of these packages.

Installation of SparklyR & Spark

Install Sparklyr

We recommend the latest stable version of sparklyr.

install.packages("sparklyr")

Install Spark via Sparklyr

RSparkling 3.40.0.1-1-3.1 is built for 3.1.

The following command will install Spark 3.1.3:

library(sparklyr)
spark_install(version = "3.1.3")

NOTE: The previous command requires access to the internet. If you are not connected to the internet/behind a firewall you can do the following:

  1. Download Spark (Pick any supported minor version for Spark 3.1)

  2. Unzip Spark files

  3. Set the SPARK_HOME environment variable to the location of the downloaded Spark folder in R as follows:

    Sys.setenv(SPARK_HOME="/path/to/spark")
    

Install H2O

Prepare Environment for H2O Installation

RSparkling 3.40.0.1-1-3.1 requires H2O of version 3.40.0.1.

It is advised to remove previously installed H2O versions and install H2O dependencies. The command bellow can be used for this.

# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Install packages H2O depends on
pkgs <- c("methods", "statmod", "stats", "graphics", "RCurl", "jsonlite", "tools", "utils")
for (pkg in pkgs) {
    if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
}

Install H2O from CRAN

In case of installation from CRAN, the typical install.packages("h2o", "3.40.0.1") command can be used. Please note that the latest released version might not be available in CRAN. In that case, please install H2O from S3.

Install H2O from S3

H2O can be also installed from the hosted R repository in H2O’s S3 buckets.

At present, you can install the h2o R package using a repository URL comprised of the H2O version name and number. Example: http://h2o-release.s3.amazonaws.com/h2o/rel-zz_kurka/1/R

# Download, install, and initialize the H2O package for R.
# In this case we are using rel-zz_kurka 1 (3.40.0.1)
install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-zz_kurka/1/R")

Install RSparkling

RSparkling can be installed from hosted R repository in Sparkling Water’s S3 buckets from the link http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.1/3.40.0.1-1-3.1/R as:

# Download, install, and initialize the RSparkling
install.packages("rsparkling", type = "source", repos = "http://h2o-release.s3.amazonaws.com/sparkling-water/spark-3.1/3.40.0.1-1-3.1/R")

Enable RSparkling

The call to library(rsparkling) automatically registers the Sparkling Water extension. This needs to be called before the spark_connect method.

library(rsparkling)

Starting Spark

Once we’ve installed rsparkling and its dependencies, the first step would be to create a Spark connection as follows:

sc <- spark_connect(master = "local", version = "3.1.3")

Note: If you are running on Databricks, please use the following code instead:

sc <- spark_connect(method = "databricks")

NOTE: Please be sure to set version to the proper Spark version utilized by your version of Sparkling Water in spark_connect()

spark_connect method has also spark_home argument which defaults to the SPARK_HOME environment variable. If SPARK_HOME is defined it will be always used unless the version parameter is specified to force the use of a locally installed version. Therefore, to use existing Spark, please run:

sc <- spark_connect(master = "local")

Using RSparkling

Specify H2OConf

H2OConf contains all settings needed the start and run the H2O-3 cluster.

h2oConf <- H2OConf()

The newly created instance of H2OConf contains SW defaults affected by property values specified in spark-defaults.conf. If you want to change a value of a given property, use an appropriate setter listed in Sparkling Water Configuration Properties.

h2oConf$setBasePort(55555)

Or you change the property value via the set method and specifying the property name.:

h2oConf$set("spark.ext.h2o.cloud.name", "mycloud")

Create H2OContext

To create H2OContext, call:

hc <- H2OContext.getOrCreate(h2oConf)

Open H2O Flow

We can also view the H2O Flow web UI:

hc$openFlow()

H2O with Spark DataFrames

As an example, let’s copy the mtcars dataset to Spark so we can access it from H2O Sparkling Water:

library(dplyr)
mtcars_tbl <- copy_to(sc, mtcars, overwrite = TRUE)
mtcars_tbl

   ## Source:   query [?? x 11]
   ## Database: spark connection master=local[8] app=sparklyr local=TRUE
   ##
   ##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   ##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
   ## 1   21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
   ## 2   21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
   ## 3   22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
   ## 4   21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
   ## 5   18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
   ## 6   18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
   ## 7   14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
   ## 8   24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
   ## 9   22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
   ## 10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
   ## ... with more rows

The use case we’d like to enable is calling the H2O algorithms and feature transformers directly on Spark DataFrames that we’ve manipulated with dplyr. This is indeed supported by the Sparkling Water package. Here is how you convert a Spark DataFrame into an H2O Frame:

mtcars_hf <- hc$asH2OFrame(mtcars_tbl)
mtcars_hf

   ## <jobj[103]>
   ##   class water.fvec.H2OFrame
   ##   Frame frame_rdd_39 (32 rows and 11 cols):
   ##                        mpg  cyl                disp   hp                drat                  wt                qsec  vs  am  gear  carb
   ##     min               10.4    4                71.1   52                2.76               1.513                14.5   0   0     3     1
   ##    mean          20.090625    6          230.721875  146           3.5965625             3.21725  17.848750000000003   0   0     3     2
   ##  stddev  6.026948052089104    1  123.93869383138194   68  0.5346787360709715  0.9784574429896966  1.7869432360968436   0   0     0     1
   ##     max               33.9    8               472.0  335                4.93               5.424                22.9   1   1     5     8
   ## missing                0.0    0                 0.0    0                 0.0                 0.0                 0.0   0   0     0     0
   ##       0               21.0    6               160.0  110                 3.9                2.62               16.46   0   1     4     4
   ##       1               21.0    6               160.0  110                 3.9               2.875               17.02   0   1     4     4
   ##       2               22.8    4               108.0   93                3.85                2.32               18.61   1   1     4     1
   ##       3               21.4    6               258.0  110                3.08               3.215               19.44   1   0     3     1
   ##       4               18.7    8               360.0  175                3.15                3.44               17.02   0   0     3     2
   ##       5               18.1    6               225.0  105                2.76                3.46               20.22   1   0     3     1
   ##       6               14.3    8               360.0  245                3.21                3.57               15.84   0   0     3     4
   ##       7               24.4    4               146.7   62                3.69                3.19                20.0   1   0     4     2
   ##       8               22.8    4               140.8   95                3.92                3.15                22.9   1   0     4     2
   ##       9               19.2    6               167.6  123                3.92                3.44                18.3   1   0     4     4
   ##      10               17.8    6               167.6  123                3.92                3.44                18.9   1   0     4     4
   ##      11               16.4    8               275.8  180                3.07                4.07                17.4   0   0     3     3
   ##      12               17.3    8               275.8  180                3.07                3.73                17.6   0   0     3     3
   ##      13               15.2    8               275.8  180                3.07                3.78                18.0   0   0     3     3
   ##      14               10.4    8               472.0  205                2.93                5.25               17.98   0   0     3     4
   ##      15               10.4    8               460.0  215                 3.0               5.424               17.82   0   0     3     4
   ##      16               14.7    8               440.0  230                3.23               5.345               17.42   0   0     3     4
   ##      17               32.4    4                78.7   66                4.08                 2.2               19.47   1   1     4     1
   ##      18               30.4    4                75.7   52                4.93               1.615               18.52   1   1     4     2
   ##      19               33.9    4                71.1   65                4.22               1.835                19.9   1   1     4     1

Obtaining Logs

Look at the Spark log from R:

spark_log(sc, n = 100)

Disconnect from Spark

Now we disconnect from Spark, this will result in the H2OContext being stopped as well since it’s owned by the Spark shell process used by our Spark connection:

spark_disconnect(sc)

Machine Learning with RSparkling & H2O

Using the same mtcars dataset, here is an example where we train a Gradient Boosting Machine (GBM) to predict “mpg”.

Initialize H2O

library(h2o)

Data Preparations

Define the response, y, and set of predictor variables, x:

y <- "mpg"
x <- setdiff(names(mtcars_hf), y)

Let’s split the data into a train and test set using H2O. The h2o.splitFrame function defaults to a 75-25 split (ratios = 0.75), but here we will make a 70-30 train-test split:

# Split the mtcars H2O Frame into train & test sets
splits <- h2o.splitFrame(mtcars_hf, ratios = 0.7, seed = 1)

Model Training

Now train an H2O GBM using the training H2OFrame.

fit <- h2o.gbm(x = x,
               y = y,
               training_frame = splits[[1]],
               min_rows = 1,
               seed = 1)
print(fit)

   ## H2ORegressionModel: gbm
   ## Model ID:  GBM_model_R_1474763476171_1
   ## Model Summary:
   ##  number_of_trees number_of_internal_trees model_size_in_bytes min_depth
   ##   1              50                       50               14807         5
   ##  max_depth mean_depth min_leaves max_leaves mean_leaves
   ##   1         5    5.00000         17         21    18.64000
   ##
   ##
   ## H2ORegressionMetrics: gbm
   ## ** Reported on training data. **
   ##
   ## MSE:  0.001211724
   ## RMSE:  0.03480983
   ## MAE:  0.02761402
   ## RMSLE:  0.001929304
   ## Mean Residual Deviance :  0.001211724

Model Performance:

We can evaluate the performance of the GBM by evaluating its performance on a test set.

perf <- h2o.performance(fit, newdata = splits[[2]])
print(perf)

   ## H2ORegressionMetrics: gbm
   ##
   ## MSE:  2.707001
   ## RMSE:  1.645297
   ## MAE:  1.455267
   ## RMSLE:  0.08579109
   ## Mean Residual Deviance :  2.707001

Predictions

To generate predictions on a test set, you do the following. This will return an H2OFrame with a single (or multiple) columns of predicted values. If regression, it will be a single column, if binary classification it will be 3 columns and in multi-class prediction, it will be C+1 columns (where C is the number of classes).

pred_hf <- h2o.predict(fit, newdata = splits[[2]])
head(pred_hf)

   ##   predict
   ## 1 21.39512
   ## 2 16.92804
   ## 3 15.19558
   ## 4 20.47695
   ## 5 20.47695
   ## 6 15.24433

Now let’s say you want to make this H2OFrame available to Spark. You can convert an H2OFrame into a Spark DataFrame using the as_spark_dataframe function:

pred_sdf <- hc$asSparkFrame(pred_hf)
head(pred_sdf)

   Source:   query [?? x 1]
   Database: spark connection master=local[8] app=sparklyr local=TRUE

   ##   predict
   ##   <dbl>
   ## 1 21.39512
   ## 2 16.92804
   ## 3 15.19558
   ## 4 20.47695
   ## 5 20.47695
   ## 6 15.24433

Additional Resources

If you are new to H2O for machine learning, we recommend you start with:

There is also a number of other H2O R tutorials, demos available, and the Machine Learning with R and H2O Booklet (pdf).