R¶

Which versions of R are compatible with H2O?

Currently, the only version of R that is known to not work well with H2O is R version 3.1.0 (codename “Spring Dance”). If you are using this version, we recommend upgrading the R version before using H2O.

What R packages are required to use H2O?

The following packages are required:

methods
statmod
stats
graphics
RCurl
jsonlite
tools
utils

Some of these packages have dependencies; for example, bitops is required, but it is a dependency of the RCurl package, so bitops is automatically included when RCurl is installed.

If you are encountering errors related to missing R packages when using H2O, refer to the following list for a complete list of all R packages, including dependencies:

statmod
bitops
RCurl
jsonlite
methods
stats
graphics
tools
utils
stringi
magrittr
colorspace
stringr
RColorBrewer
dichromat
munsell
labeling
plyr
digest
gtable
reshape2
scales
proto
ggplot2
h2oEnsemble
gtools
gdata
caTools
gplots
chron
ROCR
data.table
cvAUC

Finally, if you are running R on Linux, then you must install libcurl, which allows H2O to communicate with R.

How can I install the H2O R package if I am having permissions problems?

This issue typically occurs for Linux users when the R software was installed by a root user. For more information, refer to the following link.

To specify the installation location for the R packages, create a file that contains the R_LIBS_USER environment variable:

echo R_LIBS_USER=\"~/.Rlibrary\" > ~/.Renviron

Confirm the file was created successfully using cat:

$ cat ~/.Renviron

You should see the following output:

R_LIBS_USER="~/.Rlibrary"

Create a new directory for the environment variable:

$ mkdir ~/.Rlibrary

Start R and enter the following:

.libPaths()

Look for the following output to confirm the changes:

[1] "<Your home directory>/.Rlibrary"
[2] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library"

I received the following error message after launching H2O in RStudio and using ``h2o.init`` - what should I do to resolve this error?

Error in h2o.init() :
Version mismatch! H2O is running version 3.2.0.9 but R package is version 3.2.0.3

This error is due to a version mismatch between the H2O R package and the running H2O instance. Make sure you are using the latest version of both files by downloading H2O from the downloads page and installing the latest version and that you have removed any previous H2O R package versions by running:

if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

Make sure to install the dependencies for the H2O R package as well:

if (! ("methods" %in% rownames(installed.packages()))) { install.packages("methods") }
if (! ("statmod" %in% rownames(installed.packages()))) { install.packages("statmod") }
if (! ("stats" %in% rownames(installed.packages()))) { install.packages("stats") }
if (! ("graphics" %in% rownames(installed.packages()))) { install.packages("graphics") }
if (! ("RCurl" %in% rownames(installed.packages()))) { install.packages("RCurl") }
if (! ("jsonlite" %in% rownames(installed.packages()))) { install.packages("jsonlite") }
if (! ("tools" %in% rownames(installed.packages()))) { install.packages("tools") }
if (! ("utils" %in% rownames(installed.packages()))) { install.packages("utils") }

Finally, install the latest stable version of the H2O package for R:

install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R)))
library(h2o)
localH2O = h2o.init()

If your R version is older than the H2O R package, upgrade your R version using update.packages(checkBuilt=TRUE, ask=FALSE).

I received the following error message after launching H2O in RStudio and using ``h2o.init`` - what should I do to resolve this error?

Server error - server 127.0.0.1 is unreachable at this moment.
Please retry the request or contact your administrator.

This error occurs when the proxy is set in your R environment. The resolution is to unset that so that you can access localhost from within R. Run the following to unset the proxy:

Sys.unsetenv("http_proxy")
Sys.unsetenv("https_proxy")
Sys.unsetenv("http_proxy_user")
Sys.unsetenv("https_proxy_user")

I received the following error message after trying to run some code - what should I do?

> fit <- h2o.deeplearning(x=2:4, y=1, training_frame=train_hex)
  |=========================================================================================================| 100%
Error in model$training_metrics$MSE :
  $ operator not defined for this S4 class
In addition: Warning message:
Not all shim outputs are fully supported, please see ?h2o.shim for more information

Remove the h2o.shim(enable=TRUE) line and try running the code again. Note that the h2o.shim is only a way to notify users of previous versions of H2O about changes to the H2O R package - it will not revise your code, but provides suggested replacements for deprecated commands and parameters.

How do I extract the model weights from a model I’ve creating using H2O in R? I’ve enabled ``extract_model_weights_and_biases``, but the output refers to a file I can’t open in R.

For an example of how to extract weights and biases from a model, refer to the following repo location on GitHub.

How do I extract the run time of my model as output?

For the following example:

out.h2o.rf = h2o.randomForest( x=c("x1", "x2", "x3", "w"), y="y", training_frame=h2o.df.train, seed=555, model_id= "my.model.1st.try.out.h2o.rf" )

Use out.h2o.rf@model$run_time to determine the value of the run_time variable.

What is the best way to do group summarizations? For example, getting sums of specific columns grouped by a categorical column.

We strongly recommend using h2o.group_by for this function instead of h2o.ddply, as shown in the following example:

newframe <- h2o.group_by(h2oframe, by="footwear_category", nrow("email_event_click_ct"), sum("email_event_click_ct"), mean("email_event_click_ct"), sd("email_event_click_ct"), gb.control = list( col.names=c("count", "total_email_event_click_ct", "avg_email_event_click_ct", "std_email_event_click_ct") ) )

Using gb.control is optional; here it is included so the column names are user-configurable.

The by option can take a list of columns if you want to group by more than one column to compute the summary as shown in the following example:

newframe <- h2o.group_by(h2oframe, by=c("footwear_category","age_group"), nrow("email_event_click_ct"), sum("email_event_click_ct"), mean("email_event_click_ct"), sd("email_event_click_ct"), gb.control = list( col.names=c("count", "total_email_event_click_ct", "avg_email_event_click_ct", "std_email_event_click_ct") ) )

I’m using Linux and I want to run H2O in R - are there any dependencies I need to install?

Yes, make sure to install libcurl, which allows H2O to communicate with R. We also recommend disabling SElinux and any firewalls, at least initially until you have confirmed H2O can initialize.

On Ubuntu, run: apt-get install libcurl4-openssl-dev
On CentOS, run: yum install libcurl-devel

How do I change variable/header names on an H2O frame in R?

There are two ways to change header names. To specify the headers during parsing, import the headers in R and then specify the header as the column name when the actual data frame is imported:

header <- h2o.importFile(path = pathToHeader)
data   <- h2o.importFile(path = pathToData, col.names = header)
data

You can also use the names() function:

header <- c("user", "specified", "column", "names")
data   <- h2o.importFile(path = pathToData)
names(data) <- header

To replace specific column names, you can also use a sub/gsub in R:

header <- c("user", "specified", "column", "names")
## I want to replace "user" column with "computer"
data   <- h2o.importFile(path = pathToData)
names(data) <- sub(pattern = "user", replacement = "computer", x = names(header))

My R terminal crashed - how can I re-access my H2O frame?

Launch H2O and use your web browser to access the web UI, Flow, at localhost:54321. Click the Data menu, then click List All Frames. Copy the frame ID, then run h2o.ls() in R to list all the frames, or use the frame ID in the following code (replacing YOUR_FRAME_ID with the frame ID):

library(h2o)
localH2O = h2o.init(ip="sri.h2o.ai", port=54321, startH2O = F, strict_version_check=T)
data_frame <- h2o.getFrame(frame_id = "YOUR_FRAME_ID")

How do I remove rows containing NAs in an H2OFrame?

To remove NAs from rows:

  a   b    c    d    e
0   NA   NA   NA   NA
0   2    2    2    2
0   NA   NA   NA   NA
0   NA   NA   1    2
0   NA   NA   NA   NA
0   1    2    3    2

Removing rows 1, 3, 4, 5 to get:

  a   b    c    d    e
2 0   2    2    2    2
6 0   1    2    3    2

Use na.omit(myFrame), where myFrame represents the name of the frame you are editing.

I installed H2O in R using OS X and updated all the dependencies, but the following error message displayed: ``Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, Unexpected CURL error: Empty reply from server`` - what should I do?

This error message displays if the JAVA_HOME environment variable is not set correctly. The JAVA_HOME variable is likely points to Apple Java version 6 instead of Oracle Java version 8.

If you are running OS X 10.7 or earlier, enter the following in Terminal: export JAVA_HOME=/Library/Internet\ Plug-Ins/JavaAppletPlugin.plugin/Contents/Home

If you are running OS X 10.8 or later, modify the launchd.plist by entering the following in Terminal:

cat << EOF | sudo tee /Library/LaunchDaemons/setenv.JAVA_HOME.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
  <plist version="1.0">
  <dict>
  <key>Label</key>
  <string>setenv.JAVA_HOME</string>
  <key>ProgramArguments</key>
  <array>
    <string>/bin/launchctl</string>
    <string>setenv</string>
    <string>JAVA_HOME</string>
    <string>/Library/Internet Plug-Ins/JavaAppletPlugin.plugin/Contents/Home</string>
  </array>
  <key>RunAtLoad</key>
  <true/>
  <key>ServiceIPC</key>
  <false/>
</dict>
</plist>
EOF

How does the ``col.names`` argument work in ``group_by``?

You need to add the col.names inside the gb.control list. Refer to the following example:

newframe <- h2o.group_by(dd, by="footwear_category", nrow("email_event_click_ct"), sum("email_event_click_ct"), mean("email_event_click_ct"),
    sd("email_event_click_ct"), gb.control = list( col.names=c("count", "total_email_event_click_ct", "avg_email_event_click_ct", "std_email_event_click_ct") ) )
newframe$avg_email_event_click_ct2 = newframe$total_email_event_click_ct / newframe$count

How are the results of ``h2o.predict`` displayed?

The order of the rows in the results for h2o.predict is the same as the order in which the data was loaded, even if some rows fail (for example, due to missing values or unseen factor levels). To bind a per-row identifier, use cbind.

How do I view all the variable importances for a model?

By default, H2O returns the top five and lowest five variable importances. To view all the variable importances, use the following:

model <- h2o.getModel(model_id = "my_H2O_modelID",conn=localH2O)

varimp<-as.data.frame(h2o.varimp(model))

How do I add random noise to a column in an H2O frame?

To add random noise to a column in an H2O frame, refer to the following example:

h2o.init()

fr <- as.h2o(iris)

  |======================================================================| 100%

random_column <- h2o.runif(fr)

new_fr <- h2o.cbind(fr,random_column)

new_fr