Welcome to H2O 3

New Users

If you’re just getting started with H2O, here are some links to help you learn more:

  • Recommended Systems: This one-page PDF provides a basic overview of the operating systems, languages and APIs, Hadoop resource manager versions, cloud computing environments, browsers, and other resources recommended to run H2O. At a minimum, we recommend the following for compatibility with H2O:

  • Operating Systems: Windows 7 or later; OS X 10.9 or later, Ubuntu 12.04, or RHEL/CentOS 6 or later

  • Languages: Java 7 or later; Scala v 2.10 or later; R v.3 or later; Python 2.7.x or 3.5.x (Scala, R, and Python are not required to use H2O unless you want to use H2O in those environments, but Java is always required)

  • Browsers: Latest version of Chrome, Firefox, Safari, or Internet Explorer (An internet browser is required to use H2O’s web UI, Flow)

  • Hadoop: Cloudera CDH 5.2 or later (5.3 is recommended); MapR v.3.1.1 or later; Hortonworks HDP 2.1 or later (Hadoop is not required to run H2O unless you want to deploy H2O on a Hadoop cluster)

  • Spark: v 1.4 or later (Spark is only required if you want to run Sparkling Water)

  • Downloads page: First things first - download a copy of H2O here by selecting a build under “Download H2O” (the “Bleeding Edge” build contains the latest changes, while the latest alpha release is a more stable build), then use the installation instruction tabs to install H2O on your client of choice (standalone, R, Python, Hadoop, or Maven).

    For first-time users, we recommend downloading the latest alpha release and the default standalone option (the first tab) as the installation method. Make sure to install Java if it is not already installed.

  • Tutorials: To see a step-by-step example of our algorithms in action, select a model type from the following list:

  • Using Flow - H2O’s Web UI: This document describes our new intuitive web interface, Flow. This interface is similar to IPython notebooks, and allows you to create a visual workflow to share with others.

  • Launch from the command line: This document describes some of the additional options that you can configure when launching H2O (for example, to specify a different directory for saved Flow data, to allocate more memory, or to use a flatfile for quick configuration of a cluster).

  • Data Science Algorithms: This document describes the science behind our algorithms and provides a detailed, per-algo view of each model type.

  • GitHub Help: The GitHub Help system is a useful resource for becoming familiar with Git.

New User Quick Start

New users can follow the steps below to quickly get up and running with H2O directly from the h2o-3 repository. These steps guide you through cloning the repository, starting H2O, and importing a dataset. Once you’re up and running, you’ll be better able to follow examples included within this user guide.

  1. In a terminal window, create a folder for the H2O repository. The example below creates a folder called “repos” on the desktop.
user$ mkdir ~/Desktop/repos
  1. Change directories to that new folder, and then clone the repository. Notice that the prompt changes when you change directories.
user$ cd ~/Desktop/repos
repos user$ git clone https://github.com/h2oai/h2o-3.git
  1. After the repo is cloned, change directories to the h2o folder.
repos user$ cd h2o-3
h2o-3 user$
  1. Run the following command to retrieve sample datasets. These datasets are used throughout this User Guide as well as within the Booklets.
h2o-3 user$ ./gradlew syncSmalldata

At this point, determine whether you want to complete this quick start in either R or Python, and run the corresponding commands below from either the R or Python tab.

# Download and install R:
# 1. Go to http://cran.r-project.org/mirrors.html.
# 2. Select your closest local mirror.
# 3. Select your operating system (Linux, OS X, or Windows).
# 4. Depending on your OS, download the appropriate file, along with any required packages.
# 5. When the download is complete, unzip the file and install.

# Start R
h2o-3 user$ r
...
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>

# Copy and paste the following commands in R to download dependency packages.
> pkgs <- c("methods","statmod","stats","graphics","RCurl","jsonlite","tools","utils")
> for (pkg in pkgs) {if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }}

# Run the following command to load the H2O:
> library(h2o)

# Run the following command to initialize H2O on your local machine (single-node cluster) using all available CPUs.
> h2o.init(nthreads=-1)

# Import the Iris (with headers) dataset.
> path <- "smalldata/iris/iris_wheader.csv"
> iris <- h2o.importFile(path)

# View a summary of the imported dataset.
> print(iris)

  sepal_len    sepal_wid    petal_len    petal_wid        class
-----------  -----------  -----------  -----------  -----------
        5.1          3.5          1.4          0.2  Iris-setosa
        4.9          3            1.4          0.2  Iris-setosa
        4.7          3.2          1.3          0.2  Iris-setosa
        4.6          3.1          1.5          0.2  Iris-setosa
        5            3.6          1.4          0.2  Iris-setosa
        5.4          3.9          1.7          0.4  Iris-setosa
        4.6          3.4          1.4          0.3  Iris-setosa
        5            3.4          1.5          0.2  Iris-setosa
        4.4          2.9          1.4          0.2  Iris-setosa
        4.9          3.1          1.5          0.1  Iris-setosa
[150 rows x 5 columns]
>
# Before starting Python, run the following commands to install dependencies.
# Prepend these commands with `sudo` only if necessary.
h2o-3 user$ [sudo] pip install -U requests
h2o-3 user$ [sudo] pip install -U tabulate
h2o-3 user$ [sudo] pip install -U future
h2o-3 user$ [sudo] pip install -U six

# Start python
h2o-3 user$ python
>>>

# Run the following command to import the H2O module:
>>> import h2o

# Run the following command to initialize H2O on your local machine (single-node cluster).
>>> h2o.init()

# Import the Iris (with headers) dataset.
>>> path = "smalldata/iris/iris_wheader.csv"
>>> iris = h2o.import_file(path=path)

# View a summary of the imported dataset.
>>> iris.summary
  sepal_len    sepal_wid    petal_len    petal_wid        class
-----------  -----------  -----------  -----------  -----------
        5.1          3.5          1.4          0.2  Iris-setosa
        4.9          3            1.4          0.2  Iris-setosa
        4.7          3.2          1.3          0.2  Iris-setosa
        4.6          3.1          1.5          0.2  Iris-setosa
        5            3.6          1.4          0.2  Iris-setosa
        5.4          3.9          1.7          0.4  Iris-setosa
        4.6          3.4          1.4          0.3  Iris-setosa
        5            3.4          1.5          0.2  Iris-setosa
        4.4          2.9          1.4          0.2  Iris-setosa
        4.9          3.1          1.5          0.1  Iris-setosa

[150 rows x 5 columns]
<bound method H2OFrame.summary of >
>>>

Experienced Users

If you’ve used previous versions of H2O, the following links will help guide you through the process of upgrading to H2O-3.

  • Recommended Systems: This one-page PDF provides a basic overview of the operating systems, languages and APIs, Hadoop resource manager versions, cloud computing environments, browsers, and other resources recommended to run H2O.
  • Migrating to H2O 3: This document provides a comprehensive guide to assist users in upgrading to H2O 3.0. It gives an overview of the changes to the algorithms and the web UI introduced in this version and describes the benefits of upgrading for users of R, APIs, and Java.
  • Recent Changes: This document describes the most recent changes in the latest build of H2O. It lists new features, enhancements (including changed parameter default values), and bug fixes for each release, organized by sub-categories such as Python, R, and Web UI.
  • Contributing code: If you’re interested in contributing code to H2O, we appreciate your assistance! This document describes how to access our list of Jiras that are suggested tasks for contributors and how to contact us.

Sparkling Water Users

Sparkling Water is a gradle project with the following submodules:

  • Core: Implementation of H2OContext, H2ORDD, and all technical integration code
  • Examples: Application, demos, examples
  • ML: Implementation of MLlib pipelines for H2O algorithms
  • Assembly: Creates “fatJar” composed of all other modules
  • py: Implementation of (h2o) Python binding to Sparkling Water

The best way to get started is to modify the core module or create a new module, which extends a project.

Users of our Spark-compatible solution, Sparkling Water, should be aware that Sparkling Water is only supported with the latest version of H2O. For more information about Sparkling Water, refer to the following links.

Sparkling Water is versioned according to the Spark versioning, so make sure to use the Sparkling Water version that corresponds to the installed version of Spark.

Getting Started with Sparkling Water

PySparkling

Note: PySparkling requires Sparkling Water 1.5 or later.

H2O’s PySparkling package is not available through pip. (There is another similarly-named package.) H2O’s PySparkling package requires EasyInstall.

To install H2O’s PySparkling package, use the egg file included in the distribution.

  1. Download Spark 1.5.1.
  2. Set the SPARK_HOME and MASTER variables as described on the Downloads page.
  3. Download Sparkling Water 1.5
  4. In the unpacked Sparkling Water directory, run the following command: easy_install --upgrade sparkling-water-1.5.6/py/dist/pySparkling-1.5.6-py2.7.egg

Python Users

Pythonistas will be glad to know that H2O now provides support for this popular programming language. Python users can also use H2O with IPython notebooks. For more information, refer to the following links.

  • Click here to view instructions on how to use H2O with Python.
  • Python docs: This document represents the definitive guide to using Python with H2O.
  • Grid Search in Python: This notebook demonstrates the use of grid search in Python.

R Users

Currently, the only version of R that is known to be incompatible with H2O is R version 3.1.0 (codename “Spring Dance”). If you are using that version, we recommend upgrading the R version before using H2O.

To check which version of H2O is installed in R, use versions::installed.versions("h2o").

  • R User Documentation: This document contains all commands in the H2O package for R, including examples and arguments. It represents the definitive guide to using H2O in R.
  • Connecting RStudio to Sparkling Water: This illustrated tutorial describes how to use RStudio to connect to Sparkling Water.

Ensembles

Ensemble machine learning methods use multiple learning algorithms to obtain better predictive performance.

API Users

API users will be happy to know that the APIs have been more thoroughly documented in the latest release of H2O and additional capabilities (such as exporting weights and biases for Deep Learning models) have been added.

REST APIs are generated immediately out of the code, allowing users to implement machine learning in many ways. For example, REST APIs could be used to call a model created by sensor data and to set up auto-alerts if the sensor data falls below a specified threshold.

  • H2O 3 REST API Overview: This document describes how the REST API commands are used in H2O, versioning, experimental APIs, verbs, status codes, formats, schemas, payloads, metadata, and examples.
  • REST API Reference: This document represents the definitive guide to the H2O REST API.
  • REST API Schema Reference: This document represents the definitive guide to the H2O REST API schemas.

Java Users

For Java developers, the following resources will help you create your own custom app that uses H2O.

Developers

If you’re looking to use H2O to help you develop your own apps, the following links will provide helpful references.

For the latest version of IDEA IntelliJ, run ./gradlew idea, then click File > Open within IDEA. Select the .ipr file in the repository and click the Choose button.

For older versions of IDEA IntelliJ, run ./gradlew idea, then Import Project within IDEA and point it to the h2o-3 directory.

Note: This process will take longer, so we recommend using the first method if possible.

For JUnit tests to pass, you may need multiple H2O nodes. Create a “Run/Debug” configuration with the following parameters:

Type: Application
Main class: H2OApp
Use class path of module: h2o-app

After starting multiple “worker” node processes in addition to the JUnit test process, they will cloud up and run the multi-node JUnit tests.

  • Recommended Systems: This one-page PDF provides a basic overview of the operating systems, languages and APIs, Hadoop resource manager versions, cloud computing environments, browsers, and other resources recommended to run H2O.
  • Developer Documentation: Detailed instructions on how to build and launch H2O, including how to clone the repository, how to pull from the repository, and how to install required dependencies.
  • Click here to view instructions on how to use H2O with Maven.
  • Maven install: This page provides information on how to build a version of H2O that generates the correct IDE files.
  • apps.h2o.ai: Apps.h2o.ai is designed to support application developers via events, networking opportunities, and a new, dedicated website comprising developer kits and technical specs, news, and product spotlights.
  • H2O Droplet Project Templates: This page provides template info for projects created in Java, Scala, or Sparkling Water.
  • H2O Scala API Developer Documentation: The definitive Scala API guide for H2O.
  • Hacking Algos: This blog post by Cliff walks you through building a new algorithm, using K-Means, Quantiles, and Grep as examples.
  • KV Store Guide: Learn more about performance characteristics when implementing new algorithms.
  • Contributing code: If you’re interested in contributing code to H2O, we appreciate your assistance! This document describes how to access our list of Jiras that contributors can work on and how to contact us. Note: To access this link, you must have an Atlassian account.