.. _R_Tutorial: R Tutorial ========== This tutorial provides a sample workflow for new users of H2O's R API. Readers will learn the basic syntax of H2O, including importing and parsing files, specifying a model, and obtaining model output. New H2O users should refer to the `quick start guide `_ for additional instructions on how to run H2O. The following tutorial assumes that H2O is installed in R. """" Getting Started """"""""""""""" R uses an REST API to send functions to H2O, so a reference object in R to the H2O instance is required. You can start H2O outside of R and connect to it. You can also launch directly from R, but if you close the R session, the H2O instance is closed as well. The client object is used to direct R to datasets and models located in H2O. **Launch From R** By default, if the argument **max_mem_size** is not specified when running **h2o.init()**, the heap size of the H2O running on 32-bit Java is 1g. On 64-bit Java, the heap size is 1/4 of the total memory available on the machine. For a 32-bit version, the function runs a check and suggests an upgrade. :: > library(h2o) > localH2O <- h2o.init(ip = 'localhost', port = 54321, max_mem_size = '4g') Successfully connected to http://localhost:54321 R is connected to H2O cluster: H2O cluster uptime: 11 minutes 35 seconds H2O cluster version: 2.7.0.1497 H2O cluster name: H2O_started_from_R H2O cluster total nodes: 1 H2O cluster total memory: 3.56 GB H2O cluster total cores: 8 H2O cluster allowed cores: 8 H2O cluster healthy: TRUE **Launch From Command Line** Follow one of the `deployment tutorials `_ to launch an instance from the command line: * on your desktop * on ec2 instances * on Hadoop servers After launching the H2O cluster, initialize the connection by taking one node in the cluster and run **h2o.init** with the node's IP Address and port in the parentheses. Note that the IP Address must be on your local machine. For the following example, change **192.168.1.161** to your local host. :: > library(h2o) > localH2O <- h2o.init(ip = '192.168.1.161', port =54321) .. WARNING:: If the version of the current H2O instance is not the same as the package version loaded in R, a "version mismatch" warning message displays. To fix this issue, update the R package or launch an H2O instance using the jar file from the installed package. :: Error in h2o.init(): Version mismatch! H2O is running version # but R package is version # **Cluster Info** To check the status and health of the H2O cluster, use **h2o.clusterInfo()** to display an easy-to-read summary of information about the cluster. :: > library(h2o) > localH2O = h2o.init(ip = 'localhost', port = 54321) > h2o.clusterInfo(localH2O) R is connected to H2O cluster: H2O cluster uptime: 43 minutes 43 seconds H2O cluster version: 2.7.0.1497 H2O cluster name: H2O_started_from_R H2O cluster total nodes: 1 H2O cluster total memory: 3.56 GB H2O cluster total cores: 8 H2O cluster allowed cores: 8 H2O cluster healthy: TRUE """" Importing Data """""""""""""" **Import File** The H2O package consolidates all of the various supported import functions. Although **h2o.importFolder** and **h2o.importHDFS** will still work, these functions are deprecated and should be updated to **h2o.importFile**. :: ## To import small iris data file from H2O's package > irisPath = system.file("extdata", "iris.csv", package="h2o") > iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex") |=================================================| 100% ## To import an entire folder of files as one data object > pathToFolder = "/Users/Amy/0xdata/data/airlines/" > airlines.hex = h2o.importFile(localH2O, path = pathToFolder, key = "airlines.hex") |=================================================| 100% ## To import from HDFS > pathToData = "hdfs://mr-0xd6.0xdata.loc/datasets/airlines_all.csv" > airlines.hex = h2o.importFile(localH2O, path = pathToData, key = "airlines.hex") |=================================================| 100% **Upload File** To upload a file from your local disk, **importFile** is recommended. However, you can still run **upload file**. :: > irisPath = system.file("extdata", "iris.csv", package="h2o") > iris.hex = h2o.uploadFile(localH2O, path = irisPath, key = "iris.hex") |====================================================| 100% """" Data Manipulation and Description """"""""""""""""""""""""""""""""" **Any Factor** Determine if any column in a data set is a factor. :: > irisPath = system.file("extdata", "iris_wheader.csv", package="h2o") > iris.hex = h2o.importFile(localH2O, path = irisPath) |===================================================| 100% > h2o.anyFactor(iris.hex) [1] TRUE **As Data Frame** Convert an H2O parsed data object into an R data frame that can be manipulated using R calls. While this can be very useful, be careful with **as.data.frame** when converting H2O Parsed Data objects. Data sets that are easily and quickly handled by H2O are often too large to be treated equivalently well in R. :: > prosPath <- system.file("extdata", "prostate.csv", package="h2o") > prostate.hex = h2o.importFile(localH2O, path = prosPath) |===================================================| 100% > prostate.data.frame<- as.data.frame(prostate.hex) > summary(prostate.data.frame) ID CAPSULE AGE RACE Min. : 1.00 Min. :0.0000 Min. :43.00 Min. :0.000 1st Qu.: 95.75 1st Qu.:0.0000 1st Qu.:62.00 1st Qu.:1.000 .... > head(prostate.data.frame) ID CAPSULE AGE RACE DPROS DCAPS PSA VOL GLEASON 1 1 0 65 1 2 1 1.4 0.0 6 2 2 0 72 1 3 2 6.7 0.0 7 .... **As Factor** Convert an integer into a non-ordered factor (also called an enum or categorical). :: > prosPath = system.file("extdata", "prostate.csv", package="h2o") > prostate.hex = h2o.importFile(localH2O, path = prosPath) |===================================================| 100% > prostate.hex[,4] = as.factor(prostate.hex[,4]) > summary(prostate.hex) ID CAPSULE AGE RACE DPROS Min. : 1.00 Min. :0.0000 Min. :43.00 1 :341 Min. :1.000 1st Qu.: 95.75 1st Qu.:0.0000 1st Qu.:62.00 2 : 36 1st Qu.:1.000 .... **As H2O** Pass a data frame from inside the R environment to the H2O instance. :: > data(iris) > summary(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 .... > iris.r <- iris > iris.h2o <- as.h2o(localH2O, iris.r, key="iris.h2o") |===================================================| 100% > class(iris.h2o) [1] "H2OParsedData" attr(,"package") [1] "h2o" **Assign H2O** Create a hex key on the server running H2O for data sets manipulated in R. For instance, in the example below, the prostate data set was uploaded to the H2O instance and manipulated to remove outliers. To save the new data set on the H2O server so that it can be subsequently be analyzed with H2O without overwriting the original data set, use **h2o.assign**. :: > prosPath = system.file("extdata", "prostate.csv", package="h2o") > prostate.hex = h2o.importFile(localH2O, path = prosPath) |===================================================| 100% > prostate.qs = quantile(prostate.hex$PSA) > PSA.outliers = prostate.hex[prostate.hex$PSA <= prostate.qs[2] | prostate.hex$PSA >= prostate.qs[10],] > PSA.outliers = h2o.assign(PSA.outliers, "PSA.outliers") > nrow(prostate.hex) [1] 380 > nrow(PSA.outliers) [1] 380 **Colnames** Obtain a list of the column names in a data set. :: > irisPath = system.file("extdata", "iris.csv", package="h2o") > iris.hex = h2o.importFile(localH2O, path = irisPath, key = "iris.hex") |===================================================| 100% > colnames(iris.hex) [1] "C1" "C2" "C3" "C4" "C5" **Extremes** Obtain the maximum and minimum values in real-valued columns. :: > ausPath = system.file("extdata", "australia.csv", package="h2o") > australia.hex = h2o.importFile(localH2O, path = ausPath, key = "australia.hex") |===================================================| 100% > min(australia.hex) [1] 0 > min(c(-1, 0.5, 0.2), FALSE, australia.hex[,1:4]) [1] -1 **Quantile** Request quantiles for an H2O parsed data set. To request a quantile for a single numeric column, use the column name (for example, **$AGE**). When you request for a full parsed data set, **quantile()** returns a matrix that displays quantile information for all numeric columns in the data set. :: > prosPath = system.file("extdata", "prostate.csv", package="h2o") > prostate.hex = h2o.importFile(localH2O, path = prosPath) |===================================================| 100% > quantile(prostate.hex$AGE) **Summary** Generate an R-like summary for each of the columns in a data set. For continuous real functions, this produces a summary that includes information on quartiles, min, max, and mean. For factors, this produces information about counts of elements within each factor level. For information on the Summary algorithm, see :ref:`SUMmath`. :: > prosPath = system.file("extdata", "prostate.csv", package="h2o") > prostate.hex = h2o.importFile(localH2O, path = prosPath) |===================================================| 100% > summary(prostate.hex) ID CAPSULE AGE RACE Min. : 1.00 Min. :0.0000 Min. :43.00 Min. :0.000 1st Qu.: 95.75 1st Qu.:0.0000 1st Qu.:62.00 1st Qu.:1.000 .... > summary(prostate.hex$GLEASON) GLEASON Min. :0.000 1st Qu.:6.000 Median :6.000 Mean :6.384 3rd Qu.:7.000 Max. :9.000 > summary(prostate.hex[,4:6]) RACE DPROS DCAPS Min. :0.000 Min. :1.000 Min. :1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 Median :1.000 Median :2.000 Median :1.000 Mean :1.087 Mean :2.271 Mean :1.108 3rd Qu.:1.000 3rd Qu.:3.000 3rd Qu.:1.000 Max. :2.000 Max. :4.000 Max. :2.000 **H2O Table** Summarize information in data. Because H2O handles such large data sets, it is possible to generate tables that are larger than R's capacity. To minimize this risk and enable uninterrupted work, **h2o.table** is called inside of a call for **head()** or **tail()**. Within **head()** and **tail()**, specify the number of rows in the table to return. :: > head(h2o.table(prostate.hex[,3])) row.names Count 1 43 1 2 47 1 3 50 2 4 51 3 5 52 2 6 53 4 > head(h2o.table(prostate.hex[,c(3,4)])) row.names X0 X1 X2 1 43 1 0 0 2 47 0 1 0 3 50 0 2 0 4 51 0 3 0 5 52 0 2 0 6 53 0 3 1 **Generate Random Uniformly Distributed Numbers** **h2o.runif()** appends a column of random numbers to an H2O data frame and facilitates creating testing/training data splits for analysis and validation in H2O. :: > prosPath = system.file("extdata", "prostate.csv", package="h2o") > prostate.hex = h2o.importFile(localH2O, path = prosPath, key = "prostate.hex") |===================================================| 100% > s = h2o.runif(prostate.hex) > summary(s) rnd Min. :0.001434 1st Qu.:0.241275 Median :0.496995 Mean :0.489468 3rd Qu.:0.740592 Max. :0.994894 > prostate.train = prostate.hex[s <= 0.8,] > prostate.train = h2o.assign(prostate.train, "prostate.train") > prostate.test = prostate.hex[s > 0.8,] > prostate.test = h2o.assign(prostate.test, "prostate.test") > nrow(prostate.train) + nrow(prostate.test) [1] 380 **Split Frame** Generate two subsets from an existing H2O data set, according to user-specified ratios that can be used as testing/training sets. This is the preferred method of splitting a data frame because it's faster and more stable than running **runif** across entire the data set. However, **runif** can be used for customized frame splitting. :: > prosPath = system.file("extdata", "prostate.csv", package="h2o") > prostate.hex = h2o.importFile(localH2O, path = prosPath, key = "prostate.hex") |===================================================| 100% > prostate.split = h2o.splitFrame(data = prostate.hex , ratios = 0.75) > prostate.train = prostate.split[1] > prostate.test = prostate.split[2] > summary(prostate.train) Length Class Mode [1,] 9 H2OParsedData S4 > summary(prostate.test) Length Class Mode [1,] 9 H2OParsedData S4 """" Running Models """""""""""""" **GBM** Generate Gradient Boosted Models (GBM), which are used to develop forward-learning ensembles. For information on the GBM algorithm, see :ref:`GBMmath`. :: > ausPath = system.file("extdata", "australia.csv", package="h2o") > australia.hex = h2o.importFile(localH2O, path = ausPath) |===================================================| 100% > independent <- c("premax", "salmax","minairtemp", "maxairtemp", "maxsst", "maxsoilmoist", "Max_czcs") > dependent <- "runoffnew" > h2o.gbm(y = dependent, x = independent, data = australia.hex, > n.trees = 10, interaction.depth = 3, n.minobsinnode = 2, shrinkage = 0.2, distribution= "gaussian") |======================================================| 100% IP Address: 127.0.0.1 Port : 54321 Parsed Data Key: australia1.hex GBM Model Key: GBM_a3ae2edf5dfadbd9ba5dc2e9560c405d Mean-squared Error by tree: [1] 230760.11 166957.80 124904.30 94031.17 72367.01 57180.17 47092.85 [8] 39168.05 34456.00 31095.86 28397.10 *Run multinomial classification GBM on abalone data* To generate a classification model that uses labels, use a **multinomial** distribution. :: > h2o.gbm(y = dependent, x = independent, data = australia.hex, n.trees = 15, interaction.depth = 5, n.minobsinnode = 2, shrinkage = 0.01, distribution= "multinomial") IP Address: 127.0.0.1 Port : 54321 Parsed Data Key: australia1.hex GBM Model Key: GBM_8e4591a9b413407b983d73fbd9eb44cf Confusion matrix: Reported on australia1.hex Predicted Actual 0 3 6 7 14 16 17 19 20 25 38 43 61 75 82 107 138 150 167 191 200 0 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .... Totals 120 1 1 2 1 2 2 2 2 31 1 1 1 6 1 1 1 6 1 1 1 Predicted Actual 210 245 300 343 396 400 462 480 514 533 545 600 750 764 840 933 960 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .... Totals 1 1 20 1 1 1 1 1 1 1 1 8 1 1 1 1 1 Predicted Actual 1154 1200 2000 2400 Error 0 0 0 0 0 0.000 3 0 0 0 0 0.000 6 0 0 0 0 0.000 7 0 0 0 0 0.000 .... Mean-squared Error by tree: [1] 0.9529478 0.9337646 0.9157476 0.8985756 0.8818316 0.8654845 0.8497011 [8] 0.8341974 0.8187867 0.8036760 0.7887764 0.7741757 0.7594546 0.7452223 [15] 0.7309634 0.7168317 **GLM** Generate Generalized Linear Models, which are used to develop linear models for exponential distributions. Regularization can be applied. For information on the GLM algorithm, see :ref:`GLMmath`. :: > prostate.hex = h2o.importFile(localH2O, path = "https://raw.github.com/0xdata/h2o/master/smalldata/logreg/prostate.csv", key = "prostate.hex") |===================================================| 100% > h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS"), data = prostate.hex, family = "binomial", nfolds = 10, alpha = 0.5) |=====================================================================| 100% IP Address: 127.0.0.1 Port : 54321 Parsed Data Key: prostate.hex GLM2 Model Key: GLMModel__a2fdb4e3fdd92e0325141cdbd1bd43e1 Coefficients: AGE RACE DCAPS PSA Intercept -0.01104 -0.63136 1.31888 0.04713 -1.10896 Normalized Coefficients: AGE RACE DCAPS PSA Intercept -0.07208 -0.19495 0.40972 0.94253 -0.33707 Degrees of Freedom: 379 Total (i.e. Null); 375 Residual Null Deviance: 514.9 Residual Deviance: 461.3 AIC: 471.3 Deviance Explained: 0.10404 AUC: 0.68875 Best Threshold: 0.328 Confusion Matrix: Predicted Actual false true Error false 127 100 0.441 true 51 102 0.333 Totals 178 202 0.397 Cross-Validation Models: Nonzeros AUC Deviance Explained Model 1 4 0.6532738 0.048419803 Model 2 4 0.6316527 -0.006414532 Model 3 4 0.7100840 0.087779178 Model 4 4 0.8268698 0.243020554 Model 5 4 0.6354167 0.153190735 Model 6 4 0.6888889 0.041892118 Model 7 4 0.7366071 0.164717509 Model 8 4 0.6711310 0.004897310 Model 9 4 0.7803571 0.200384622 Model 10 4 0.7435897 0.114548543 :: > myX = setdiff(colnames(prostate.hex), c("ID", "DPROS", "DCAPS", "VOL")) > h2o.glm(y = "VOL", x = myX, data = prostate.hex, family = "gaussian", nfolds = 5, alpha = 0.1) |=========================================================| 100% IP Address: 127.0.0.1 Port : 54321 Parsed Data Key: prostate.hex GLM2 Model Key: GLMModel__b8339af00fbe8951ba0871611c9e42eb Coefficients: CAPSULE AGE RACE PSA GLEASON Intercept -4.29014 0.29787 4.35557 0.04946 -0.51274 -4.35359 Normalized Coefficients: CAPSULE AGE RACE PSA GLEASON Intercept -2.10678 1.94424 1.34488 0.98908 -0.55989 15.81292 Degrees of Freedom: 379 Total (i.e. Null); 374 Residual Null Deviance: 126623.9 Residual Deviance: 127402 AIC: 11059.1 Deviance Explained: -0.00615 Cross-Validation Models: Nonzeros AIC Deviance Explained Model 1 5 685.6101 -0.02827868 Model 2 5 660.3719 -0.15397511 Model 3 5 658.0768 0.05826293 Model 4 5 665.8665 0.05117173 Model 5 5 683.6276 0.01333543 **K-Means** Generate a K-means model, which is a clustering algorithm that allows users to characterize data. This algorithm does not rely on a dependent variable. For information on the K-Means algorithm, see :ref:`KMmath` :: > prosPath = system.file("extdata", "prostate.csv", package="h2o") > prostate.hex = h2o.importFile(localH2O, path = prosPath) |=========================================================| 100% > prostate.km = h2o.kmeans(data = prostate.hex, centers = 10, cols = c("AGE", "RACE", "VOL", "GLEASON")) |=========================================================| 100% print(prostate.km) IP Address: 127.0.0.1 Port : 54321 Parsed Data Key: prostate6.hex K-Means Model Key: KMeans2_99fea55be4a22f741df74532d7844bb4 K-means clustering with 10 clusters of sizes 41, 27, 59, 17, 21, 47, 26, 61, 47, 34 Cluster means: AGE RACE VOL GLEASON 1 69.73171 1.024390 37.99756098 6.512195 2 54.48148 1.111111 0.32222222 6.518519 3 62.59322 1.067797 0.19322034 5.966102 ..... **Principal Components Analysis** Map a set of variables onto a subspace using linear transformations. Principle Components Analysis (PCA) is the first step in Principal Components Regression. For more information on PCA, see :ref:`PCAmath`. :: > ausPath = system.file("extdata", "australia.csv", package="h2o") > australia.hex = h2o.importFile(localH2O, path = ausPath) |=========================================================| 100% > australia.pca = h2o.prcomp(data = australia.hex, standardize = TRUE) |=========================================================| 100% > print(australia.pca) IP Address: 127.0.0.1 Port : 54321 Parsed Data Key: australia2.hex PCA Model Key: PCA_90d7162c6d4855392ba1272c2f314bec Standard deviations: 1.750703 1.512142 1.031181 0.8283127 0.6083786 0.5481364 0.4181621 0.2314953 .... summary(australia.pca) Importance of components: .... **Principal Components Regression** Map a set of variables to a new set of linearly independent variables. The new set of variables are linearly independent linear combinations of the original variables and exist in a subspace of lower dimension. This transformation is then prepended to a regression model, often improving results. For more information on PCA, see :ref:`PCAmath`. :: > prostate.hex = h2o.importFile(localH2O, path = "https://raw.github.com/0xdata/h2o/master/smalldata/logreg/prostate.csv", key = "prostate.hex") |=========================================================| 100% > h2o.pcr(x = c("AGE","RACE","PSA","DCAPS"), y = "CAPSULE", data = prostate.hex, family = "binomial", nfolds = 10, alpha = 0.5, ncomp = 3) |==========================================================| 100% IP Address: 127.0.0.1 Port : 54321 Parsed Data Key: PCAPredict_80069467adfe441c92282ac766f9de7e GLM2 Model Key: GLMModel__a1454a5b8a212d1069376356543a4887 Coefficients: PC0 PC1 PC2 Intercept 3.76219 1.26824 -1.35455 -0.36271 .... """" Obtaining Predictions """"""""""""""""""""" **Predict** Apply an H2O model to a holdout set to obtain predictions based on model results. In the examples below, models are generated first, and then the predictions for that model are displayed. :: > prostate.hex = h2o.importFile(localH2O, path = "https://raw.github.com/0xdata/h2o/master/smalldata/logreg/prostate.csv", key = "prostate.hex") |==========================================================| 100% > prostate.glm = h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS"), data = prostate.hex, family = "binomial", nfolds = 10, alpha = 0.5) |==========================================================| 100% > prostate.fit = h2o.predict(object = prostate.glm, newdata = prostate.hex) > (prostate.fit) IP Address: 127.0.0.1 Port : 54321 Parsed Data Key: GLM2Predict_8b6890653fa743be9eb3ab1668c5a6e9 predict X0 X1 1 0 0.7452267 0.2547732 2 1 0.3969807 0.6030193 3 1 0.4120950 0.5879050 4 1 0.3726134 0.6273866 5 1 0.6465137 0.3534863 6 1 0.4331880 0.5668120 """" Other Useful Functions """""""""""""""""""""" **Get Frame** For users that alternate between using the web interface and the R API, or for multiple users accessing the same H2O, this function gives the user the option to create a reference object for a data frame sitting in H2O (assuming there's a **prostate.hex** in the KV store). :: > prostate.hex = h2o.getFrame(h2o = localH2O, key = "prostate.hex") **Get Model** For users that alternate between using the web interface and the R API, this function gives the user the option to create a reference object for a data frame sitting in H2O (assuming there's a **GLMModel__ba724fe4f6d6d5b8b6370f776df94e47** model in the KV store). :: > glm.model = h2o.getModel(h2o = localH2O, key = "GLMModel__ba724fe4f6d6d5b8b6370f776df94e47") > glm.model **List all H2O Objects** Generate a list of all H2O objects generated during a work session, along with each object's byte size. :: > prostate.hex = h2o.importFile(localH2O, path = prosPath, key = "prostate.hex") |==========================================================| 100% > prostate.split = h2o.splitFrame(prostate.hex , ratio = 0.8) > prostate.train = prostate.split[[1]] > prostate.train = h2o.assign(prostate.train, "prostate.train") > h2o.ls(localH2O) Key Bytesize 1 GBM_8e4591a9b413407b983d73fbd9eb44cf 40617 2 GBM_a3ae2edf5dfadbd9ba5dc2e9560c405d 1516 **Remove an H2O object from the server where H2O is running** To remove an H2O object on the server associated with an object in the R environment, we recommend also removing the object from the R environment. :: > h2o.ls(localH2O) Key Bytesize 1 Last.value.39 448 2 Last.value.42 73 3 prostate.hex 4874 4 prostate.train 4028 5 prostate_part0.hex 4028 6 prostate_part1.hex 1432 > h2o.rm(object= localH2O, keys= "prostate.train") > h2o.ls(localH2O) Key Bytesize 1 Last.value.39 448 2 Last.value.42 73 3 prostate.hex 4874 4 prostate_part0.hex 4028 5 prostate_part1.hex 1432 """"