R Tutorial¶
This tutorial provides a sample workflow for new users of H2O’s R API. Readers will learn the basic syntax of H2O, including importing and parsing files, specifying a model, and obtaining model output.
New H2O users should refer to the quick start guide for additional instructions on how to run H2O. The following tutorial assumes that H2O is installed in R.
Getting Started¶
R uses an REST API to send functions to H2O, so a reference object in R to the H2O instance is required. You can start H2O outside of R and connect to it. You can also launch directly from R, but if you close the R session, the H2O instance is closed as well. The client object is used to direct R to datasets and models located in H2O.
Launch From R
By default, if the argument max_mem_size is not specified when running h2o.init(), the heap size of the H2O running on 32-bit Java is 1g. On 64-bit Java, the heap size is 1/4 of the total memory available on the machine. For a 32-bit version, the function runs a check and suggests an upgrade.
> library(h2o)
> localH2O <- h2o.init(ip = 'localhost', port = 54321, max_mem_size = '4g')
Successfully connected to http://localhost:54321
R is connected to H2O cluster:
H2O cluster uptime: 11 minutes 35 seconds
H2O cluster version: 2.7.0.1497
H2O cluster name: H2O_started_from_R
H2O cluster total nodes: 1
H2O cluster total memory: 3.56 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
Launch From Command Line
- Follow one of the deployment tutorials to launch an instance from the command line:
- on your desktop
- on ec2 instances
- on Hadoop servers
After launching the H2O cluster, initialize the connection by taking one node in the cluster and run h2o.init with the node’s IP Address and port in the parentheses. Note that the IP Address must be on your local machine. For the following example, change 192.168.1.161 to your local host.
> library(h2o)
> localH2O <- h2o.init(ip = '192.168.1.161', port =54321)
Warning
If the version of the current H2O instance is not the same as the package version loaded in R, a “version mismatch” warning message displays. To fix this issue, update the R package or launch an H2O instance using the jar file from the installed package.
Error in h2o.init():
Version mismatch! H2O is running version # but R package is version #
Cluster Info
To check the status and health of the H2O cluster, use h2o.clusterInfo() to display an easy-to-read summary of information about the cluster.
> library(h2o)
> localH2O = h2o.init(ip = 'localhost', port = 54321)
> h2o.clusterInfo(localH2O)
R is connected to H2O cluster:
H2O cluster uptime: 43 minutes 43 seconds
H2O cluster version: 2.7.0.1497
H2O cluster name: H2O_started_from_R
H2O cluster total nodes: 1
H2O cluster total memory: 3.56 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: TRUE
Importing Data¶
Import File
The H2O package consolidates all of the various supported import functions. Although h2o.importFolder and h2o.importHDFS will still work, these functions are deprecated and should be updated to h2o.importFile.
## To import small iris data file from H2O's package
> irisPath = system.file("extdata", "iris.csv", package="h2o")
> iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex")
|=================================================| 100%
## To import an entire folder of files as one data object
> pathToFolder = "/Users/Amy/0xdata/data/airlines/"
> airlines.hex = h2o.importFile(localH2O, path = pathToFolder, key = "airlines.hex")
|=================================================| 100%
## To import from HDFS
> pathToData = "hdfs://mr-0xd6.0xdata.loc/datasets/airlines_all.csv"
> airlines.hex = h2o.importFile(localH2O, path = pathToData, key = "airlines.hex")
|=================================================| 100%
Upload File
To upload a file from your local disk, importFile is recommended. However, you can still run upload file.
> irisPath = system.file("extdata", "iris.csv", package="h2o")
> iris.hex = h2o.uploadFile(localH2O, path = irisPath, key = "iris.hex")
|====================================================| 100%
Data Manipulation and Description¶
Any Factor
Determine if any column in a data set is a factor.
> irisPath = system.file("extdata", "iris_wheader.csv", package="h2o")
> iris.hex = h2o.importFile(localH2O, path = irisPath)
|===================================================| 100%
> h2o.anyFactor(iris.hex)
[1] TRUE
As Data Frame
Convert an H2O parsed data object into an R data frame that can be manipulated using R calls. While this can be very useful, be careful with as.data.frame when converting H2O Parsed Data objects. Data sets that are easily and quickly handled by H2O are often too large to be treated equivalently well in R.
> prosPath <- system.file("extdata", "prostate.csv", package="h2o")
> prostate.hex = h2o.importFile(localH2O, path = prosPath)
|===================================================| 100%
> prostate.data.frame<- as.data.frame(prostate.hex)
> summary(prostate.data.frame)
ID CAPSULE AGE RACE
Min. : 1.00 Min. :0.0000 Min. :43.00 Min. :0.000
1st Qu.: 95.75 1st Qu.:0.0000 1st Qu.:62.00 1st Qu.:1.000
....
> head(prostate.data.frame)
ID CAPSULE AGE RACE DPROS DCAPS PSA VOL GLEASON
1 1 0 65 1 2 1 1.4 0.0 6
2 2 0 72 1 3 2 6.7 0.0 7
....
As Factor
Convert an integer into a non-ordered factor (also called an enum or categorical).
> prosPath = system.file("extdata", "prostate.csv", package="h2o")
> prostate.hex = h2o.importFile(localH2O, path = prosPath)
|===================================================| 100%
> prostate.hex[,4] = as.factor(prostate.hex[,4])
> summary(prostate.hex)
ID CAPSULE AGE RACE DPROS
Min. : 1.00 Min. :0.0000 Min. :43.00 1 :341 Min. :1.000
1st Qu.: 95.75 1st Qu.:0.0000 1st Qu.:62.00 2 : 36 1st Qu.:1.000
....
As H2O
Pass a data frame from inside the R environment to the H2O instance.
> data(iris)
> summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
....
> iris.r <- iris
> iris.h2o <- as.h2o(localH2O, iris.r, key="iris.h2o")
|===================================================| 100%
> class(iris.h2o)
[1] "H2OParsedData"
attr(,"package")
[1] "h2o"
Assign H2O
Create a hex key on the server running H2O for data sets manipulated in R. For instance, in the example below, the prostate data set was uploaded to the H2O instance and manipulated to remove outliers. To save the new data set on the H2O server so that it can be subsequently be analyzed with H2O without overwriting the original data set, use h2o.assign.
> prosPath = system.file("extdata", "prostate.csv", package="h2o")
> prostate.hex = h2o.importFile(localH2O, path = prosPath)
|===================================================| 100%
> prostate.qs = quantile(prostate.hex$PSA)
> PSA.outliers = prostate.hex[prostate.hex$PSA
<= prostate.qs[2] | prostate.hex$PSA >= prostate.qs[10],]
> PSA.outliers = h2o.assign(PSA.outliers, "PSA.outliers")
> nrow(prostate.hex)
[1] 380
> nrow(PSA.outliers)
[1] 380
Colnames
Obtain a list of the column names in a data set.
> irisPath = system.file("extdata", "iris.csv", package="h2o")
> iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex")
|===================================================| 100%
> colnames(iris.hex)
[1] "C1" "C2" "C3" "C4" "C5"
Extremes
Obtain the maximum and minimum values in real-valued columns.
> ausPath = system.file("extdata", "australia.csv", package="h2o")
> australia.hex = h2o.importFile(localH2O, path = ausPath, key = "australia.hex")
|===================================================| 100%
> min(australia.hex)
[1] 0
> min(c(-1, 0.5, 0.2), FALSE, australia.hex[,1:4])
[1] -1
Quantile
Request quantiles for an H2O parsed data set. To request a quantile for a single numeric column, use the column name (for example, $AGE). When you request for a full parsed data set, quantile() returns a matrix that displays quantile information for all numeric columns in the data set.
> prosPath = system.file("extdata", "prostate.csv", package="h2o")
> prostate.hex = h2o.importFile(localH2O, path = prosPath)
|===================================================| 100%
> quantile(prostate.hex$AGE)
Summary
Generate an R-like summary for each of the columns in a data set. For continuous real functions, this produces a summary that includes information on quartiles, min, max, and mean. For factors, this produces information about counts of elements within each factor level. For information on the Summary algorithm, see Summary.
> prosPath = system.file("extdata", "prostate.csv", package="h2o")
> prostate.hex = h2o.importFile(localH2O, path = prosPath)
|===================================================| 100%
> summary(prostate.hex)
ID CAPSULE AGE RACE
Min. : 1.00 Min. :0.0000 Min. :43.00 Min. :0.000
1st Qu.: 95.75 1st Qu.:0.0000 1st Qu.:62.00 1st Qu.:1.000
....
> summary(prostate.hex$GLEASON)
GLEASON
Min. :0.000
1st Qu.:6.000
Median :6.000
Mean :6.384
3rd Qu.:7.000
Max. :9.000
> summary(prostate.hex[,4:6])
RACE DPROS DCAPS
Min. :0.000 Min. :1.000 Min. :1.000
1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
Median :1.000 Median :2.000 Median :1.000
Mean :1.087 Mean :2.271 Mean :1.108
3rd Qu.:1.000 3rd Qu.:3.000 3rd Qu.:1.000
Max. :2.000 Max. :4.000 Max. :2.000
H2O Table
Summarize information in data. Because H2O handles such large data sets, it is possible to generate tables that are larger than R’s capacity. To minimize this risk and enable uninterrupted work, h2o.table is called inside of a call for head() or tail(). Within head() and tail(), specify the number of rows in the table to return.
> head(h2o.table(prostate.hex[,3]))
row.names Count
1 43 1
2 47 1
3 50 2
4 51 3
5 52 2
6 53 4
> head(h2o.table(prostate.hex[,c(3,4)]))
row.names X0 X1 X2
1 43 1 0 0
2 47 0 1 0
3 50 0 2 0
4 51 0 3 0
5 52 0 2 0
6 53 0 3 1
Generate Random Uniformly Distributed Numbers
h2o.runif() appends a column of random numbers to an H2O data frame and facilitates creating testing/training data splits for analysis and validation in H2O.
> prosPath = system.file("extdata", "prostate.csv", package="h2o")
> prostate.hex = h2o.importFile(localH2O, path = prosPath, key = "prostate.hex")
|===================================================| 100%
> s = h2o.runif(prostate.hex)
> summary(s)
rnd
Min. :0.001434
1st Qu.:0.241275
Median :0.496995
Mean :0.489468
3rd Qu.:0.740592
Max. :0.994894
> prostate.train = prostate.hex[s <= 0.8,]
> prostate.train = h2o.assign(prostate.train, "prostate.train")
> prostate.test = prostate.hex[s > 0.8,]
> prostate.test = h2o.assign(prostate.test, "prostate.test")
> nrow(prostate.train) + nrow(prostate.test)
[1] 380
Split Frame
Generate two subsets from an existing H2O data set, according to user-specified ratios that can be used as testing/training sets. This is the preferred method of splitting a data frame because it’s faster and more stable than running runif across entire the data set. However, runif can be used for customized frame splitting.
> prosPath = system.file("extdata", "prostate.csv", package="h2o")
> prostate.hex = h2o.importFile(localH2O, path = prosPath, key = "prostate.hex")
|===================================================| 100%
> prostate.split = h2o.splitFrame(data = prostate.hex , ratios = 0.75)
> prostate.train = prostate.split[1]
> prostate.test = prostate.split[2]
> summary(prostate.train)
Length Class Mode
[1,] 9 H2OParsedData S4
> summary(prostate.test)
Length Class Mode
[1,] 9 H2OParsedData S4
Running Models¶
GBM
Generate Gradient Boosted Models (GBM), which are used to develop forward-learning ensembles. For information on the GBM algorithm, see Gradient Boosted Regression and Classification.
> ausPath = system.file("extdata", "australia.csv", package="h2o")
> australia.hex = h2o.importFile(localH2O, path = ausPath)
|===================================================| 100%
> independent <- c("premax", "salmax","minairtemp", "maxairtemp",
"maxsst", "maxsoilmoist", "Max_czcs")
> dependent <- "runoffnew"
> h2o.gbm(y = dependent, x = independent, data = australia.hex,
> n.trees = 10, interaction.depth = 3,
n.minobsinnode = 2, shrinkage = 0.2, distribution= "gaussian")
|======================================================| 100%
IP Address: 127.0.0.1
Port : 54321
Parsed Data Key: australia1.hex
GBM Model Key: GBM_a3ae2edf5dfadbd9ba5dc2e9560c405d
Mean-squared Error by tree:
[1] 230760.11 166957.80 124904.30 94031.17 72367.01 57180.17 47092.85
[8] 39168.05 34456.00 31095.86 28397.10
Run multinomial classification GBM on abalone data
To generate a classification model that uses labels, use a multinomial distribution.
> h2o.gbm(y = dependent, x = independent, data = australia.hex, n.trees
= 15, interaction.depth = 5,
n.minobsinnode = 2, shrinkage = 0.01, distribution= "multinomial")
IP Address: 127.0.0.1
Port : 54321
Parsed Data Key: australia1.hex
GBM Model Key: GBM_8e4591a9b413407b983d73fbd9eb44cf
Confusion matrix:
Reported on australia1.hex
Predicted
Actual 0 3 6 7 14 16 17 19 20 25 38 43 61 75 82 107 138 150 167 191 200
0 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
....
Totals 120 1 1 2 1 2 2 2 2 31 1 1 1 6 1 1 1 6 1 1 1
Predicted
Actual 210 245 300 343 396 400 462 480 514 533 545 600 750 764 840 933 960
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
....
Totals 1 1 20 1 1 1 1 1 1 1 1 8 1 1 1 1 1
Predicted
Actual 1154 1200 2000 2400 Error
0 0 0 0 0 0.000
3 0 0 0 0 0.000
6 0 0 0 0 0.000
7 0 0 0 0 0.000
....
Mean-squared Error by tree:
[1] 0.9529478 0.9337646 0.9157476 0.8985756 0.8818316 0.8654845 0.8497011
[8] 0.8341974 0.8187867 0.8036760 0.7887764 0.7741757 0.7594546 0.7452223
[15] 0.7309634 0.7168317
GLM
Generate Generalized Linear Models, which are used to develop linear models for exponential distributions. Regularization can be applied. For information on the GLM algorithm, see Generalized Linear Model.
> prostate.hex = h2o.importFile(localH2O, path =
"https://raw.github.com/0xdata/h2o/master/smalldata/logreg/prostate.csv",
key = "prostate.hex")
|===================================================| 100%
> h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS"), data =
prostate.hex, family = "binomial", nfolds = 10, alpha = 0.5)
|=====================================================================| 100%
IP Address: 127.0.0.1
Port : 54321
Parsed Data Key: prostate.hex
GLM2 Model Key: GLMModel__a2fdb4e3fdd92e0325141cdbd1bd43e1
Coefficients:
AGE RACE DCAPS PSA Intercept
-0.01104 -0.63136 1.31888 0.04713 -1.10896
Normalized Coefficients:
AGE RACE DCAPS PSA Intercept
-0.07208 -0.19495 0.40972 0.94253 -0.33707
Degrees of Freedom: 379 Total (i.e. Null); 375 Residual
Null Deviance: 514.9
Residual Deviance: 461.3 AIC: 471.3
Deviance Explained: 0.10404
AUC: 0.68875 Best Threshold: 0.328
Confusion Matrix:
Predicted
Actual false true Error
false 127 100 0.441
true 51 102 0.333
Totals 178 202 0.397
Cross-Validation Models:
Nonzeros AUC Deviance Explained
Model 1 4 0.6532738 0.048419803
Model 2 4 0.6316527 -0.006414532
Model 3 4 0.7100840 0.087779178
Model 4 4 0.8268698 0.243020554
Model 5 4 0.6354167 0.153190735
Model 6 4 0.6888889 0.041892118
Model 7 4 0.7366071 0.164717509
Model 8 4 0.6711310 0.004897310
Model 9 4 0.7803571 0.200384622
Model 10 4 0.7435897 0.114548543
> myX = setdiff(colnames(prostate.hex), c("ID", "DPROS", "DCAPS", "VOL"))
> h2o.glm(y = "VOL", x = myX, data = prostate.hex, family = "gaussian", nfolds = 5,
alpha = 0.1)
|=========================================================| 100%
IP Address: 127.0.0.1
Port : 54321
Parsed Data Key: prostate.hex
GLM2 Model Key: GLMModel__b8339af00fbe8951ba0871611c9e42eb
Coefficients:
CAPSULE AGE RACE PSA GLEASON Intercept
-4.29014 0.29787 4.35557 0.04946 -0.51274 -4.35359
Normalized Coefficients:
CAPSULE AGE RACE PSA GLEASON Intercept
-2.10678 1.94424 1.34488 0.98908 -0.55989 15.81292
Degrees of Freedom: 379 Total (i.e. Null); 374 Residual
Null Deviance: 126623.9
Residual Deviance: 127402 AIC: 11059.1
Deviance Explained: -0.00615
Cross-Validation Models:
Nonzeros AIC Deviance Explained
Model 1 5 685.6101 -0.02827868
Model 2 5 660.3719 -0.15397511
Model 3 5 658.0768 0.05826293
Model 4 5 665.8665 0.05117173
Model 5 5 683.6276 0.01333543
K-Means
Generate a K-means model, which is a clustering algorithm that allows users to characterize data. This algorithm does not rely on a dependent variable. For information on the K-Means algorithm, see K-Means
> prosPath = system.file("extdata", "prostate.csv", package="h2o")
> prostate.hex = h2o.importFile(localH2O, path = prosPath)
|=========================================================| 100%
> prostate.km = h2o.kmeans(data = prostate.hex, centers = 10,
cols = c("AGE", "RACE", "VOL", "GLEASON"))
|=========================================================| 100%
print(prostate.km)
IP Address: 127.0.0.1
Port : 54321
Parsed Data Key: prostate6.hex
K-Means Model Key: KMeans2_99fea55be4a22f741df74532d7844bb4
K-means clustering with 10 clusters of sizes 41, 27, 59, 17, 21, 47, 26, 61, 47, 34
Cluster means:
AGE RACE VOL GLEASON
1 69.73171 1.024390 37.99756098 6.512195
2 54.48148 1.111111 0.32222222 6.518519
3 62.59322 1.067797 0.19322034 5.966102
.....
Principal Components Analysis
Map a set of variables onto a subspace using linear transformations. Principle Components Analysis (PCA) is the first step in Principal Components Regression. For more information on PCA, see Principal Components Analysis.
> ausPath = system.file("extdata", "australia.csv", package="h2o")
> australia.hex = h2o.importFile(localH2O, path = ausPath)
|=========================================================| 100%
> australia.pca = h2o.prcomp(data = australia.hex, standardize = TRUE)
|=========================================================| 100%
> print(australia.pca)
IP Address: 127.0.0.1
Port : 54321
Parsed Data Key: australia2.hex
PCA Model Key: PCA_90d7162c6d4855392ba1272c2f314bec
Standard deviations:
1.750703 1.512142 1.031181 0.8283127 0.6083786 0.5481364 0.4181621 0.2314953
....
summary(australia.pca)
Importance of components:
....
Principal Components Regression
Map a set of variables to a new set of linearly independent variables. The new set of variables are linearly independent linear combinations of the original variables and exist in a subspace of lower dimension. This transformation is then prepended to a regression model, often improving results. For more information on PCA, see Principal Components Analysis.
> prostate.hex = h2o.importFile(localH2O, path =
"https://raw.github.com/0xdata/h2o/master/smalldata/logreg/prostate.csv",
key = "prostate.hex")
|=========================================================| 100%
> h2o.pcr(x = c("AGE","RACE","PSA","DCAPS"), y = "CAPSULE", data =
prostate.hex, family = "binomial",
nfolds = 10, alpha = 0.5, ncomp = 3)
|==========================================================| 100%
IP Address: 127.0.0.1
Port : 54321
Parsed Data Key: PCAPredict_80069467adfe441c92282ac766f9de7e
GLM2 Model Key: GLMModel__a1454a5b8a212d1069376356543a4887
Coefficients:
PC0 PC1 PC2 Intercept
3.76219 1.26824 -1.35455 -0.36271
....
Obtaining Predictions¶
Predict
Apply an H2O model to a holdout set to obtain predictions based on model results. In the examples below, models are generated first, and then the predictions for that model are displayed.
> prostate.hex = h2o.importFile(localH2O, path =
"https://raw.github.com/0xdata/h2o/master/smalldata/logreg/prostate.csv",
key = "prostate.hex")
|==========================================================| 100%
> prostate.glm = h2o.glm(y = "CAPSULE", x =
c("AGE","RACE","PSA","DCAPS"), data = prostate.hex,
family = "binomial", nfolds = 10, alpha = 0.5)
|==========================================================| 100%
> prostate.fit = h2o.predict(object = prostate.glm, newdata = prostate.hex)
> (prostate.fit)
IP Address: 127.0.0.1
Port : 54321
Parsed Data Key: GLM2Predict_8b6890653fa743be9eb3ab1668c5a6e9
predict X0 X1
1 0 0.7452267 0.2547732
2 1 0.3969807 0.6030193
3 1 0.4120950 0.5879050
4 1 0.3726134 0.6273866
5 1 0.6465137 0.3534863
6 1 0.4331880 0.5668120
Other Useful Functions¶
Get Frame
For users that alternate between using the web interface and the R API, or for multiple users accessing the same H2O, this function gives the user the option to create a reference object for a data frame sitting in H2O (assuming there’s a prostate.hex in the KV store).
> prostate.hex = h2o.getFrame(h2o = localH2O, key = "prostate.hex")
Get Model
For users that alternate between using the web interface and the R API, this function gives the user the option to create a reference object for a data frame sitting in H2O (assuming there’s a GLMModel__ba724fe4f6d6d5b8b6370f776df94e47 model in the KV store).
> glm.model = h2o.getModel(h2o = localH2O, key = "GLMModel__ba724fe4f6d6d5b8b6370f776df94e47")
> glm.model
List all H2O Objects
Generate a list of all H2O objects generated during a work session, along with each object’s byte size.
> prostate.hex = h2o.importFile(localH2O, path = prosPath, key = "prostate.hex")
|==========================================================| 100%
> prostate.split = h2o.splitFrame(prostate.hex , ratio = 0.8)
> prostate.train = prostate.split[[1]]
> prostate.train = h2o.assign(prostate.train, "prostate.train")
> h2o.ls(localH2O)
Key Bytesize
1 GBM_8e4591a9b413407b983d73fbd9eb44cf 40617
2 GBM_a3ae2edf5dfadbd9ba5dc2e9560c405d 1516
Remove an H2O object from the server where H2O is running
To remove an H2O object on the server associated with an object in the R environment, we recommend also removing the object from the R environment.
> h2o.ls(localH2O)
Key Bytesize
1 Last.value.39 448
2 Last.value.42 73
3 prostate.hex 4874
4 prostate.train 4028
5 prostate_part0.hex 4028
6 prostate_part1.hex 1432
> h2o.rm(object= localH2O, keys= "prostate.train")
> h2o.ls(localH2O)
Key Bytesize
1 Last.value.39 448
2 Last.value.42 73
3 prostate.hex 4874
4 prostate_part0.hex 4028
5 prostate_part1.hex 1432