RSparkling¶
The rsparkling R package is an extension package for sparklyr that creates an R front-end for the Sparkling Water package from H2O. This provides an interface to H2O’s high performance, distributed machine learning algorithms on Spark, using R.
This package implements basic functionality (creating an H2OContext, showing the H2O Flow interface, and converting between Spark DataFrames and H2O Frames). The main purpose of this package is to provide a connector between sparklyr and H2O’s machine learning algorithms.
The rsparkling package uses sparklyr for Spark job deployment and initialization of Sparkling Water. After that, user can use the regular h2o R package for modeling. In the following sections we show how to install each of these packages.
Installation of SparklyR & Spark¶
Install Spark via sparklyr¶
The sparklyr package makes it easy to install any particular version of Spark. Prior to installing h2o and rsparkling, the user will need to decide which version of Spark they would like to work with, as the remaining installation revolve around a particular major version of Spark (2.1, 2.2, 2.3 or 2.4).
The following command will install Spark 2.1.3:
library(sparklyr)
spark_install(version = "2.1.3")
NOTE: The previous command requires access to the internet. If you are not connected to the internet/behind a firewall you would need to do the following:
Download Spark (Pick the major version that corresponds to Sparkling Water)
Unzip Spark files
Set the
SPARK_HOME
environment variable to the location of the downloaded Spark folder in R as follows:Sys.setenv(SPARK_HOME="/path/to/spark")
Install H2O¶
H2O & Sparkling Water Versions Mapping¶
rsparkling requires certain version of H2O to be used depending on desired Sparkling Water & Spark versions. This is because each release of Sparking Water is built from specific versions of H2O.
By default, rsparkling automatically uses the latest Sparkling Water based on the major Spark version provided and advices the user which H2O version to install.
Advanced users may want to choose a particular Sparking Water / H2O version (specific Sparkling Water versions must match specific Spark and H2O versions).
Spark_Version | Sparkling_Water_Version | H2O_Version | H2O_Release_Name | H2O_Release_Patch_Number |
---|---|---|---|---|
2.4 | 2.4.5 | 3.22.1.3 | rel-xu | 3 |
2.4 | 2.4.4 | 3.22.1.2 | rel-xu | 2 |
2.4 | 2.4.3 | 3.22.0.5 | rel-xia | 5 |
2.4 | 2.4.2 | 3.22.0.4 | rel-xia | 4 |
2.4 | 2.4.1 | 3.22.0.3 | rel-xia | 3 |
2.3 | 2.3.23 | 3.22.1.3 | rel-xu | 3 |
2.3 | 2.3.22 | 3.22.1.2 | rel-xu | 2 |
2.3 | 2.3.21 | 3.22.0.5 | rel-xia | 5 |
2.3 | 2.3.20 | 3.22.0.4 | rel-xia | 4 |
2.3 | 2.3.19 | 3.22.0.3 | rel-xia | 3 |
2.3 | 2.3.18 | 3.22.0.2 | rel-xia | 2 |
2.3 | 2.3.17 | 3.22.0.1 | rel-xia | 1 |
2.3 | 2.3.16 | 3.20.0.10 | rel-wright | 10 |
2.3 | 2.3.15 | 3.20.0.9 | rel-wright | 9 |
2.3 | 2.3.14 | 3.20.0.8 | rel-wright | 8 |
2.3 | 2.3.13 | 3.20.0.7 | rel-wright | 7 |
2.3 | 2.3.12 | 3.20.0.6 | rel-wright | 6 |
2.3 | 2.3.11 | 3.20.0.5 | rel-wright | 5 |
2.3 | 2.3.10 | 3.20.0.4 | rel-wright | 4 |
2.3 | 2.3.9 | 3.20.0.3 | rel-wright | 3 |
2.3 | 2.3.8 | 3.20.0.2 | rel-wright | 2 |
2.3 | 2.3.7 | 3.20.0.1 | rel-wright | 1 |
2.3 | 2.3.6 | 3.18.0.11 | rel-wolpert | 11 |
2.3 | 2.3.5 | 3.18.0.10 | rel-wolpert | 10 |
2.3 | 2.3.4 | 3.18.0.9 | rel-wolpert | 9 |
2.3 | 2.3.3 | 3.18.0.9 | rel-wolpert | 9 |
2.3 | 2.3.2 | 3.18.0.8 | rel-wolpert | 8 |
2.3 | 2.3.1 | 3.18.0.7 | rel-wolpert | 7 |
2.3 | 2.3.0 | 3.18.0.5 | rel-wolpert | 5 |
2.2 | 2.2.34 | 3.22.1.3 | rel-xu | 3 |
2.2 | 2.2.33 | 3.22.1.2 | rel-xu | 2 |
2.2 | 2.2.32 | 3.22.0.5 | rel-xia | 5 |
2.2 | 2.2.31 | 3.22.0.4 | rel-xia | 4 |
2.2 | 2.2.30 | 3.22.0.3 | rel-xia | 3 |
2.2 | 2.2.29 | 3.22.0.2 | rel-xia | 2 |
2.2 | 2.2.28 | 3.22.0.1 | rel-xia | 1 |
2.2 | 2.2.27 | 3.20.0.10 | rel-wright | 10 |
2.2 | 2.2.26 | 3.20.0.9 | rel-wright | 9 |
2.2 | 2.2.25 | 3.20.0.8 | rel-wright | 8 |
2.2 | 2.2.24 | 3.20.0.7 | rel-wright | 7 |
2.2 | 2.2.23 | 3.20.0.6 | rel-wright | 6 |
2.2 | 2.2.22 | 3.20.0.5 | rel-wright | 5 |
2.2 | 2.2.21 | 3.20.0.4 | rel-wright | 4 |
2.2 | 2.2.20 | 3.20.0.3 | rel-wright | 3 |
2.2 | 2.2.19 | 3.20.0.2 | rel-wright | 2 |
2.2 | 2.2.18 | 3.20.0.1 | rel-wright | 1 |
2.2 | 2.2.17 | 3.18.0.11 | rel-wolpert | 11 |
2.2 | 2.2.16 | 3.18.0.10 | rel-wolpert | 10 |
2.2 | 2.2.15 | 3.18.0.9 | rel-wolpert | 9 |
2.2 | 2.2.14 | 3.18.0.9 | rel-wolpert | 9 |
2.2 | 2.2.13 | 3.18.0.8 | rel-wolpert | 8 |
2.2 | 2.2.12 | 3.18.0.7 | rel-wolpert | 7 |
2.2 | 2.2.11 | 3.18.0.5 | rel-wolpert | 5 |
2.2 | 2.2.10 | 3.18.0.4 | rel-wolpert | 4 |
2.2 | 2.2.9 | 3.18.0.2 | rel-wolpert | 2 |
2.2 | 2.2.8 | 3.18.0.1 | rel-wolpert | 1 |
2.2 | 2.2.7 | 3.16.0.4 | rel-wheeler | 4 |
2.2 | 2.2.6 | 3.16.0.2 | rel-wheeler | 2 |
2.2 | 2.2.5 | 3.16.0.2 | rel-wheeler | 2 |
2.2 | 2.2.4 | 3.16.0.2 | rel-wheeler | 2 |
2.2 | 2.2.3 | 3.16.0.1 | rel-wheeler | 1 |
2.2 | 2.2.2 | 3.14.0.7 | rel-weierstrass | 7 |
2.2 | 2.2.1 | 3.14.0.6 | rel-weierstrass | 6 |
2.2 | 2.2.0 | 3.14.0.2 | rel-weierstrass | 2 |
2.1 | 2.1.48 | 3.22.1.3 | rel-xu | 3 |
2.1 | 2.1.47 | 3.22.1.2 | rel-xu | 2 |
2.1 | 2.1.46 | 3.22.0.5 | rel-xia | 5 |
2.1 | 2.1.45 | 3.22.0.4 | rel-xia | 4 |
2.1 | 2.1.44 | 3.22.0.3 | rel-xia | 3 |
2.1 | 2.1.43 | 3.22.0.2 | rel-xia | 2 |
2.1 | 2.1.42 | 3.22.0.1 | rel-xia | 1 |
2.1 | 2.1.41 | 3.20.0.10 | rel-wright | 10 |
2.1 | 2.1.40 | 3.20.0.9 | rel-wright | 9 |
2.1 | 2.1.39 | 3.20.0.8 | rel-wright | 8 |
2.1 | 2.1.38 | 3.20.0.7 | rel-wright | 7 |
2.1 | 2.1.37 | 3.20.0.6 | rel-wright | 6 |
2.1 | 2.1.36 | 3.20.0.5 | rel-wright | 5 |
2.1 | 2.1.35 | 3.20.0.4 | rel-wright | 4 |
2.1 | 2.1.34 | 3.20.0.3 | rel-wright | 3 |
2.1 | 2.1.33 | 3.20.0.2 | rel-wright | 2 |
2.1 | 2.1.32 | 3.20.0.1 | rel-wright | 1 |
2.1 | 2.1.31 | 3.18.0.11 | rel-wolpert | 11 |
2.1 | 2.1.30 | 3.18.0.10 | rel-wolpert | 10 |
2.1 | 2.1.29 | 3.18.0.9 | rel-wolpert | 9 |
2.1 | 2.1.28 | 3.18.0.9 | rel-wolpert | 9 |
2.1 | 2.1.27 | 3.18.0.8 | rel-wolpert | 8 |
2.1 | 2.1.26 | 3.18.0.7 | rel-wolpert | 7 |
2.1 | 2.1.25 | 3.18.0.5 | rel-wolpert | 5 |
2.1 | 2.1.24 | 3.18.0.4 | rel-wolpert | 4 |
2.1 | 2.1.23 | 3.18.0.2 | rel-wolpert | 2 |
2.1 | 2.1.22 | 3.18.0.1 | rel-wolpert | 1 |
2.1 | 2.1.21 | 3.16.0.4 | rel-wheeler | 4 |
2.1 | 2.1.20 | 3.16.0.2 | rel-wheeler | 2 |
2.1 | 2.1.19 | 3.16.0.2 | rel-wheeler | 2 |
2.1 | 2.1.18 | 3.16.0.2 | rel-wheeler | 2 |
2.1 | 2.1.17 | 3.16.0.1 | rel-wheeler | 1 |
2.1 | 2.1.16 | 3.14.0.7 | rel-weierstrass | 7 |
2.1 | 2.1.15 | 3.14.0.6 | rel-weierstrass | 6 |
2.1 | 2.1.14 | 3.14.0.2 | rel-weierstrass | 2 |
2.1 | 2.1.13 | 3.10.5.4 | rel-vajda | 4 |
2.1 | 2.1.12 | 3.10.5.4 | rel-vajda | 4 |
2.1 | 2.1.11 | 3.10.5.3 | rel-vajda | 3 |
2.1 | 2.1.10 | 3.10.5.2 | rel-vajda | 2 |
2.1 | 2.1.9 | 3.10.5.1 | rel-vajda | 1 |
2.1 | 2.1.8 | 3.10.4.8 | rel-ueno | 8 |
2.1 | 2.1.7 | 3.10.4.7 | rel-ueno | 7 |
2.1 | 2.1.6 | 3.10.4.7 | rel-ueno | 7 |
2.1 | 2.1.5 | 3.10.4.6 | rel-ueno | 6 |
2.1 | 2.1.4 | 3.10.4.5 | rel-ueno | 5 |
2.1 | 2.1.3 | 3.10.4.3 | rel-ueno | 3 |
2.1 | 2.1.2 | 3.10.4.2 | rel-ueno | 2 |
2.1 | 2.1.1 | 3.10.4.2 | rel-ueno | 2 |
2.1 | 2.1.0 | 3.10.3.2 | rel-tverberg | 2 |
NOTE: A call to rsparkling::h2o_release_table()
displays the release table in your R console and returns
a data.frame
containing this information.
Prepare Environment for H2O Installation¶
It is advised to remove previously installed H2O versions and install H2O dependencies. The command bellow can be used for this.
# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }
# Install packages H2O depends on
pkgs <- c("methods", "statmod", "stats", "graphics", "RCurl", "jsonlite", "tools", "utils")
for (pkg in pkgs) {
if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
}
Install H2O from CRAN¶
In case of installation from CRAN, the typical install.packages("h2o", "3.22.1.3")
command can be used. Please note
that the latest released version might not be available in CRAN. In that case, please install H2O from S3.
Install H2O from S3¶
H2O can be also installed from hosted R repository in H2O’s S3 buckets.
At present, you can install the h2o R package using a repository URL comprised of the H2O version name and number. Example: http://h2o-release.s3.amazonaws.com/h2o/rel-xu/3/R
# Download, install, and initialize the H2O package for R.
# In this case we are using rel-xu 3 (3.22.1.3)
install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-xu/3/R")
Install rsparkling¶
The latest stable version of rsparkling on CRAN can be installed as follows:
install.packages("rsparkling")
You can also install the latest version available on Github as:
devtools::install_github("h2oai/sparkling-water", ref="master", subdir="r/src")
Alternatively, you can also install nightly version of RSparkling. Please follow the information on the RSparkling tab on Sparkling Water Nightly Download Page.
RSparkling & SparklyR Configuration¶
Configure Sparkling Water Version¶
With no configuration, the latest version of Sparkling Water will be used based on the version of Spark installed. All the additional options configurations needs to be called before library(rsparkling) in order for them to take affect.
Particular version of Sparkling Water can be specified as:
options(rsparkling.sparklingwater.version = ...)
In both cases, the internet access is required as correct Sparkling Water versions will be fetched from Maven central. If you don’t have internet access or firewall is set up, you can specify Sparkling Water JAR directly as
options(rsparkling.sparklingwater.location = "/path/to/sparkling_water.jar")
This JAR file can be obtain in following steps:
- Download the Sparkling Water jar of your choice based on the integration table above.
To do this go to the following link where
[SW Major Version]
is the major version of Sparkling Water you wish to use, i.e.,2.1
and[SW Minor Version]
is the minor version of Sparkling Water you wish to use, i.e.,48
, such ashttp://h2o-release.s3.amazonaws.com/sparkling-water/rel-[SW Major Version]/[SW Minor Version]/index.html
- Click the
DOWNLOAD SPARKLING WATER
tab, which will download a.zip
file of Sparkling Water. - Run the following command to unzip the folder:
unzip sparkling-water-[SW Major Version].[SW Minor Version].zip
- The path to the Sparkling Water jar file is:
sparkling-water-[SW Major Version].[SW Minor Version]/assembly/build/libs/sparkling-water-assembly_*.jar
.
Configure Spark Connection¶
Once we’ve installed rsparkling and it’s dependencies, the first step would be to create a Spark connection as follows:
sc <- spark_connect(master = "local", version = "2.1.3")
NOTE: Please be sure to set version
to the proper Spark version utilized by your version of Sparkling Water in spark_connect()
NOTE: The previous command requires access to the internet. If you are not connected to the internet/behind a firewall, please first read the previous section about Spark installation.
spark_connect
method has also spark_home
argument which defaults to the SPARK_HOME
environment
variable. If SPARK_HOME
is defined it will be always used unless the version
parameter is specified to force the use of a locally installed version. Therefore, to use existing
Spark, please run:
sc <- spark_connect(master = "local")
Changing the Default H2O Client Port¶
RSparkling does not expose setters and getters for specifying configuration options. You must specify the Spark configuration options directly, for example:
config=spark_config()
config=c(config, list("spark.ext.h2o.node.port.base"="55555", "spark.ext.h2o.client.port.base"="44444"))
sc <- spark_connect(master="yarn-client", app_name = "demo", config = config)
In the above, spark.ext.h2o.node.port.base
affects the worker nodes,
and spark.ext.h2o.client.port.base
affects the client.
Using RSparkling¶
H2OContext & Flow¶
The call to library(rsparkling)
automatically registers the Sparkling Water extension.
Let’s inspect the H2OContext for our Spark connection:
h2o_context(sc)
## <jobj[6]>
## class org.apache.spark.h2o.H2OContext
##
## Sparkling Water Context:
## * H2O name: sparkling-water-jjallaire_-1482215501
## * number of executors: 1
## * list of used executors:
## (executorId, host, port)
## ------------------------
## (driver,localhost,54323)
## ------------------------
##
## Open H2O Flow in browser: http://127.0.0.1:54323 (CMD + click in Mac OSX)
##
We can also view the H2O Flow web UI:
h2o_flow(sc)
H2O with Spark DataFrames¶
As an example, let’s copy the mtcars dataset to to Spark so we can access it from H2O Sparkling Water:
library(dplyr)
mtcars_tbl <- copy_to(sc, mtcars, overwrite = TRUE)
mtcars_tbl
## Source: query [?? x 11]
## Database: spark connection master=local[8] app=sparklyr local=TRUE
##
## mpg cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## ... with more rows
The use case we’d like to enable is calling the H2O algorithms and feature transformers directly on Spark DataFrames that we’ve manipulated with dplyr. This is indeed supported by the Sparkling Water package. Here is how you convert a Spark DataFrame into an H2O Frame:
mtcars_hf <- as_h2o_frame(sc, mtcars_tbl)
mtcars_hf
## <jobj[103]>
## class water.fvec.H2OFrame
## Frame frame_rdd_39 (32 rows and 11 cols):
## mpg cyl disp hp drat wt qsec vs am gear carb
## min 10.4 4 71.1 52 2.76 1.513 14.5 0 0 3 1
## mean 20.090625 6 230.721875 146 3.5965625 3.21725 17.848750000000003 0 0 3 2
## stddev 6.026948052089104 1 123.93869383138194 68 0.5346787360709715 0.9784574429896966 1.7869432360968436 0 0 0 1
## max 33.9 8 472.0 335 4.93 5.424 22.9 1 1 5 8
## missing 0.0 0 0.0 0 0.0 0.0 0.0 0 0 0 0
## 0 21.0 6 160.0 110 3.9 2.62 16.46 0 1 4 4
## 1 21.0 6 160.0 110 3.9 2.875 17.02 0 1 4 4
## 2 22.8 4 108.0 93 3.85 2.32 18.61 1 1 4 1
## 3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## 4 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2
## 5 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1
## 6 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
## 7 24.4 4 146.7 62 3.69 3.19 20.0 1 0 4 2
## 8 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
## 9 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
## 10 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4
## 11 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3
## 12 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3
## 13 15.2 8 275.8 180 3.07 3.78 18.0 0 0 3 3
## 14 10.4 8 472.0 205 2.93 5.25 17.98 0 0 3 4
## 15 10.4 8 460.0 215 3.0 5.424 17.82 0 0 3 4
## 16 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## 17 32.4 4 78.7 66 4.08 2.2 19.47 1 1 4 1
## 18 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 19 33.9 4 71.1 65 4.22 1.835 19.9 1 1 4 1
Disconnect from Spark¶
Now we disconnect from Spark, this will result in the H2OContext being stopped as well since it’s owned by the spark shell process used by our Spark connection:
spark_disconnect(sc)
Machine Learning with RSparkling & H2O¶
Using the same mtcars dataset, here is an example where we train a Gradient Boosting Machine (GBM) to predict “mpg”.
Initialize H2O¶
library(h2o)
Data Preparations¶
Define the response, y, and set of predictor variables, x:
y <- "mpg"
x <- setdiff(names(mtcars_hf), y)
Let’s split the data into a train and test set using H2O. The h2o.splitFrame
function defaults to a 75-25 split (ratios = 0.75
), but here we will make a 70-30 train-test split:
# Split the mtcars H2O Frame into train & test sets
splits <- h2o.splitFrame(mtcars_hf, ratios = 0.7, seed = 1)
Model Training¶
Now train an H2O GBM using the training H2OFrame.
fit <- h2o.gbm(x = x,
y = y,
training_frame = splits[[1]],
min_rows = 1,
seed = 1)
print(fit)
## H2ORegressionModel: gbm
## Model ID: GBM_model_R_1474763476171_1
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 50 50 14807 5
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 5 5.00000 17 21 18.64000
##
##
## H2ORegressionMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.001211724
## RMSE: 0.03480983
## MAE: 0.02761402
## RMSLE: 0.001929304
## Mean Residual Deviance : 0.001211724
Model Performance:¶
We can evaluate the performance of the GBM by evaluating its performance on a test set.
perf <- h2o.performance(fit, newdata = splits[[2]])
print(perf)
## H2ORegressionMetrics: gbm
##
## MSE: 2.707001
## RMSE: 1.645297
## MAE: 1.455267
## RMSLE: 0.08579109
## Mean Residual Deviance : 2.707001
Predictions¶
To generate predictions on a test set, you do the following. This will return an H2OFrame with a single (or multiple) columns of predicted values. If regression, it will be a single colum, if binary classification it will be 3 columns and in multi-class prediction it will be C+1 columns (where C is the number of classes).
pred_hf <- h2o.predict(fit, newdata = splits[[2]])
head(pred_hf)
## predict
## 1 21.39512
## 2 16.92804
## 3 15.19558
## 4 20.47695
## 5 20.47695
## 6 15.24433
Now let’s say you want to make this H2OFrame available to Spark. You can convert an H2OFrame into a Spark DataFrame using the as_spark_dataframe
function:
pred_sdf <- as_spark_dataframe(sc, pred_hf)
head(pred_sdf)
Source: query [?? x 1]
Database: spark connection master=local[8] app=sparklyr local=TRUE
## predict
## <dbl>
## 1 21.39512
## 2 16.92804
## 3 15.19558
## 4 20.47695
## 5 20.47695
## 6 15.24433
Additional Resources¶
If you are new to H2O for machine learning, we recommend you start with:
There is also number of other H2O R tutorials, demos available, and the Machine Learning with R and H2O Booklet (pdf).