Welcome to H2O 3.0

Welcome to the H2O documentation site! Depending on your area of interest, select a learning path from the links above.

We’re glad you’re interested in learning more about H2O - if you have any questions, please email them to support@h2o.ai or post them on our Google groups website, h2ostream.

Note: To join our Google group on h2ostream, you need a Google account (such as Gmail or Google+). On the h2ostream page, click the Join group button, then click the New Topic button to post a new message. You don’t need to request or leave a message to join - you should be added to the group automatically.

We welcome your feedback! Please let us know if you have any questions or comments about this site by emailing us at support@h2o.ai.


New Users

If you’re just getting started with H2O, here are some links to help you learn more:


Experienced Users

If you’ve used previous versions of H2O, the following links will help guide you through the process of upgrading to H2O 3.0.


Enterprise Users

If you’re considering using H2O in an enterprise environment, you’ll be happy to know that H2O supports many popular scalable computing solutions, such as Hadoop and EC2 (AWS). For more information, refer to the following links.


Sparkling Water Users

Sparkling Water is a gradle project with the following submodules:

The best way to get started is to modify the core module or create a new module, which extends a project.

Users of our Spark-compatible solution, Sparkling Water, should be aware that Sparkling Water is only supported with the latest version of H2O. For more information about Sparkling Water, refer to the following links.

Sparkling Water is versioned according to the Spark versioning:

Getting Started with Sparkling Water

The following video provides step-by-step instructions on how to start H2O using Sparkling Water:

Sparkling Water Blog Posts

Sparkling Water Meetup Slide Decks


Python Users

Pythonistas will be glad to know that H2O now provides support for this popular programming language. Python users can also use H2O with IPython notebooks. For more information, refer to the following links.

The following video provides step-by-step instructions on how to start H2O using Python:


R Users

Don’t worry, R users - we still provide R support in the latest version of H2O, just as before. The R components of H2O have been cleaned up, simplified, and standardized, so the command format is easier and more intuitive. Due to these improvements, be aware that any scripts created with previous versions of H2O will need some revision to be compatible with the latest version.

We have provided the following helpful resources to assist R users in upgrading to the latest version, including a document that outlines the differences between versions and a tool that reviews scripts for deprecated or renamed parameters.

The following video provides step-by-step instructions on how to start H2O in R:


API Users

API users will be happy to know that the APIs have been more thoroughly documented in the latest release of H2O and additional capabilities (such as exporting weights and biases for Deep Learning models) have been added.

REST APIs are generated immediately out of the code, allowing users to implement machine learning in many ways. For example, REST APIs could be used to call a model created by sensor data and to set up auto-alerts if the sensor data falls below a specified threshold.


Java Users

For Java developers, the following resources will help you create your own custom app that uses H2O.

SDK Information

The Java API is generated and accessible from the download page.


Developers

If you’re looking to use H2O to help you develop your own apps, the following links will provide helpful references.

For IDEA IntelliJ support, run gradle idea, then Import Project within IDEA and point it to the h2o-3 directory.

For JUnit tests to pass, you may need multiple H2O nodes. Create a “Run/Debug” configuration with the following parameters:

Type: Application
Main class: H2OApp
Use class path of module: h2o-app

After starting multiple “worker” node processes in addition to the JUnit test process, they will cloud up and run the multi-node JUnit tests.


Downloading H2O

To download H2O, go to our downloads page. Select a build type (bleeding edge or latest alpha), then select an installation method (standalone, R, Python, Hadoop, or Maven) by clicking the tabs at the top of the page. Follow the instructions in the tab to install H2O.

Starting H2O …

There are a variety of ways to start H2O, depending on which client you would like to use.

… From R

To use H2O in R, follow the instructions on the download page.

… From Python

To use H2O in Python, follow the instructions on the download page.

… On Spark

To use H2O on Spark, follow the instructions on the Sparkling Water download page.

… From the Cmd Line

You can use Terminal (OS X) or the Command Prompt (Windows) to launch H2O 3.0. When you launch from the command line, you can include additional instructions to H2O 3.0, such as how many nodes to launch, how much memory to allocate for each node, assign names to the nodes in the cloud, and more.

Note: H2O requires some space in the /tmp directory to launch. If you cannot launch H2O, try freeing up some space in the /tmp directory, then try launching H2O again.

For more detailed instructions on how to build and launch H2O, including how to clone the repository, how to pull from the repository, and how to install required dependencies, refer to the developer documentation.

There are two different argument types:

The arguments use the following format: java <JVM Options> -jar h2o.jar <H2O Options>.

JVM Options

Note: Do not try to launch H2O with more memory than you have available.

H2O Options

Cloud Formation Behavior

New H2O nodes join to form a cloud during launch. After a job has started on the cloud, it prevents new members from joining.

Wait for the INFO: Registered: # schemas in: #mS output before entering the above command again to add another node (the number for # will vary).

Flatfile Configuration for Multi-Node Clusters

Running H2O on a multi-node cluster allows you to use more memory for large-scale tasks (for example, creating models from huge datasets) than would be possible on a single node.

If you are configuring many nodes, using the -flatfile option is fast and easy. The -flatfile option is used to define a list of potential cloud peers. However, it is not an alternative to -ip and -port, which should be used to bind the IP and port address of the node you are using to launch H2O.

To configure H2O on a multi-node cluster:

  1. Locate a set of hosts that will be used to create your cluster. A host can be a server, an EC2 instance, or your laptop.
  2. Download the appropriate version of H2O for your environment.
  3. Verify the same h2o.jar file is available on each host in the multi-node cluster.
  4. Create a flatfile.txt that contains an IP address and port number for each H2O instance. Use one entry per line. For example:

    192.168.1.163:54321
    192.168.1.164:54321
    
  5. Copy the flatfile.txt to each node in the cluster.
  6. Use the -Xmx option to specify the amount of memory for each node. The cluster’s memory capacity is the sum of all H2O nodes in the cluster.

    For example, if you create a cluster with four 20g nodes (by specifying -Xmx20g four times), H2O will have a total of 80 gigs of memory available.

    For best performance, we recommend sizing your cluster to be about four times the size of your data. To avoid swapping, the -Xmx allocation must not exceed the physical memory on any node. Allocating the same amount of memory for all nodes is strongly recommended, as H2O works best with symmetric nodes.

    Note the optional -ip and -port options specify the IP address and ports to use. The -ip option is especially helpful for hosts with multiple network interfaces.

    java -Xmx20g -jar h2o.jar -flatfile flatfile.txt -port 54321

    The output will resemble the following:

     04-20 16:14:00.253 192.168.1.70:54321    2754   main      INFO:   1. Open a terminal and run 'ssh -L 55555:localhost:54321 H2O-3User@###.###.#.##'
     04-20 16:14:00.253 192.168.1.70:54321    2754   main      INFO:   2. Point your browser to http://localhost:55555
     04-20 16:14:00.437 192.168.1.70:54321    2754   main      INFO: Log dir: '/tmp/h2o-H2O-3User/h2ologs'
     04-20 16:14:00.437 192.168.1.70:54321    2754   main      INFO: Cur dir: '/Users/H2O-3User/h2o-3'
     04-20 16:14:00.459 192.168.1.70:54321    2754   main      INFO: HDFS subsystem successfully initialized
     04-20 16:14:00.460 192.168.1.70:54321    2754   main      INFO: S3 subsystem successfully initialized
     04-20 16:14:00.460 192.168.1.70:54321    2754   main      INFO: Flow dir: '/Users/H2O-3User/h2oflows'
     04-20 16:14:00.475 192.168.1.70:54321    2754   main      INFO: Cloud of size 1 formed [/192.168.1.70:54321]
    

    As you add more nodes to your cluster, the output is updated: INFO WATER: Cloud of size 2 formed [/...]...

  7. Access the H2O 3.0 web UI (Flow) with your browser. Point your browser to the HTTP address specified in the output Listening for HTTP and REST traffic on ....

… On EC2 and S3

Note: If you would like to try out H2O on an EC2 cluster, play.h2o.ai is the easiest way to get started. H2O Play provides access to a temporary cluster managed by H2O.

If you would still like to set up your own EC2 cluster, follow the instructions below.

On EC2

Tested on Redhat AMI, Amazon Linux AMI, and Ubuntu AMI

To use the Amazon Web Services (AWS) S3 storage solution, you will need to pass your S3 access credentials to H2O. This will allow you to access your data on S3 when importing data frames with path prefixes s3n://....

For security reasons, we recommend writing a script to read the access credentials that are stored in a separate file. This will not only keep your credentials from propagating to other locations, but it will also make it easier to change the credential information later.

Standalone Instance

When running H2O in standalone mode using the simple Java launch command, we can pass in the S3 credentials in two ways.


Core-site.xml Example

The following is an example core-site.xml file:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <!--
    <property>
    <name>fs.default.name</name>
    <value>s3n://<your s3 bucket></value>
    </property>
    -->

    <property>
        <name>fs.s3n.awsAccessKeyId</name>
        <value>insert access key here</value>
    </property>

    <property>
        <name>fs.s3n.awsSecretAccessKey</name>
        <value>insert secret key here</value>
    </property>
    </configuration> 

Launching H2O

Note: Before launching H2O on an EC2 cluster, verify that ports 54321 and 54322 are both accessible by TCP and UDP.

Selecting the Operating System and Virtualization Type

Select your operating system and the virtualization type of the prebuilt AMI on Amazon. If you are using Windows, you will need to use a hardware-assisted virtual machine (HVM). If you are using Linux, you can choose between para-virtualization (PV) and HVM. These selections determine the type of instances you can launch.

EC2 Systems

For more information about virtualization types, refer to Amazon.


Configuring the Instance

  1. Select the IAM role and policy to use to launch the instance. H2O detects the temporary access keys associated with the instance, so you don’t need to copy your AWS credentials to the instances.

    EC2 Configuration

  2. When launching the instance, select an accessible key pair.

    EC2 Key Pair


(Windows Users) Tunneling into the Instance

For Windows users that do not have the ability to use ssh from the terminal, either download Cygwin or a Git Bash that has the capability to run ssh:

ssh -i amy_account.pem ec2-user@54.165.25.98

Otherwise, download PuTTY and follow these instructions:

  1. Launch the PuTTY Key Generator.
  2. Load your downloaded AWS pem key file. Note: To see the file, change the browser file type to “All”.
  3. Save the private key as a .ppk file.

    Private Key

  4. Launch the PuTTY client.

  5. In the Session section, enter the host name or IP address. For Ubuntu users, the default host name is ubuntu@<ip-address>. For Linux users, the default host name is ec2-user@<ip-address>.

    Configuring Session

  6. Select SSH, then Auth in the sidebar, and click the Browse button to select the private key file for authentication.

    Configuring SSH

  7. Start a new session and click the Yes button to confirm caching of the server’s rsa2 key fingerprint and continue connecting.

    PuTTY Alert


Downloading Java and H2O

  1. Download Java (JDK 1.7 or later) if it is not already available on the instance.
  2. To download H2O, run the wget command with the link to the zip file available on our website by copying the link associated with the Download button for the selected H2O build.

     wget http://h2o-release.s3.amazonaws.com/h2o/rel-shannon/30/index.html
     unzip h2o-3.0.0.30.zip
     cd h2o-3.0.0.30
     java -Xmx4g -jar h2o.jar
    
  3. From your browser, navigate to <Private_IP_Address>:54321 or <Public_DNS>:54321 to use H2O’s web interface.

… On Hadoop

Currently supported versions:

Important Points to Remember:

Prerequisite: Open Communication Paths

H2O communicates using two communication paths. Verify these are open and available for use by H2O.

Path 1: mapper to driver

Optionally specify this port using the -driverport option in the hadoop jar command (see “Hadoop Launch Parameters” below). This port is opened on the driver host (the host where you entered the hadoop jar command). By default, this port is chosen randomly by the operating system.

Path 2: mapper to mapper

Optionally specify this port using the -baseport option in the hadoop jar command (refer to Hadoop Launch Parameters below. This port and the next subsequent port are opened on the mapper hosts (the Hadoop worker nodes) where the H2O mapper nodes are placed by the Resource Manager. By default, ports 54321 (TCP) and 54322 (TCP & UDP) are used.

The mapper port is adaptive: if 54321 and 54322 are not available, H2O will try 54323 and 54324 and so on. The mapper port is designed to be adaptive because sometimes if the YARN cluster is low on resources, YARN will place two H2O mappers for the same H2O cluster request on the same physical host. For this reason, we recommend opening a range of more than two ports (20 ports should be sufficient).


Tutorial

The following tutorial will walk the user through the download or build of H2O and the parameters involved in launching H2O from the command line.

  1. Download the latest H2O release for your version of Hadoop:

     wget http://h2o-release.s3.amazonaws.com/h2o/master/30/h2o-3.0.0.30-cdh5.2.zip
     wget http://h2o-release.s3.amazonaws.com/h2o/master/30/h2o-3.0.0.30-cdh5.3.zip
     wget http://h2o-release.s3.amazonaws.com/h2o/master/30/h2o-3.0.0.30-hdp2.1.zip
     wget http://h2o-release.s3.amazonaws.com/h2o/master/30/h2o-3.0.0.30-hdp2.2.zip
     wget http://h2o-release.s3.amazonaws.com/h2o/master/30/h2o-3.0.0.30-mapr3.1.1.zip
     wget http://h2o-release.s3.amazonaws.com/h2o/master/30/h2o-3.0.0.30-mapr4.0.1.zip
    

    Note: Enter only one of the above commands.

  2. Prepare the job input on the Hadoop Node by unzipping the build file and changing to the directory with the Hadoop and H2O’s driver jar files.

     unzip h2o-3.0.0.30-*.zip
     cd h2o-3.0.0.30-*
    
  3. To launch H2O nodes and form a cluster on the Hadoop cluster, run:

     hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g -output hdfsOutputDirName
    
    • The above command launches a 6g node of H2O. We recommend you launch the cluster with at least four times the memory of your data file size.

    • mapperXmx is the mapper size or the amount of memory allocated to each node. Specify at least 6 GB.

    • nodes is the number of nodes requested to form the cluster.

    • output is the name of the directory created each time a H2O cloud is created so it is necessary for the name to be unique each time it is launched.

  4. To monitor your job, direct your web browser to your standard job tracker Web UI. To access H2O’s Web UI, direct your web browser to one of the launched instances. If you are unsure where your JVM is launched, review the output from your command after the nodes has clouded up and formed a cluster. Any of the nodes’ IP addresses will work as there is no master node.

     Determining driver host interface for mapper->driver callback...
     [Possible callback IP address: 172.16.2.181]
     [Possible callback IP address: 127.0.0.1]
     ...
     Waiting for H2O cluster to come up...
     H2O node 172.16.2.184:54321 requested flatfile
     Sending flatfiles to nodes...
      [Sending flatfile to node 172.16.2.184:54321]
     H2O node 172.16.2.184:54321 reports H2O cluster size 1 
     H2O cluster (1 nodes) is up
     Blocking until the H2O cluster shuts down...
    

Hadoop Launch Parameters

Accessing S3 Data from Hadoop

H2O launched on Hadoop can access S3 Data in addition to to HDFS. To enable access, follow the instructions below.

Edit Hadoop’s core-site.xml, then set the HADOOP_CONF_DIR environment property to the directory containing the core-site.xml file. For an example core-site.xml file, refer to Core-site.xml. Typically, the configuration directory for most Hadoop distributions is /etc/hadoop/conf.

You can also pass the S3 credentials when launching H2O with the Hadoop jar command. Use the -D flag to pass the credentials:

    hadoop jar h2odriver.jar -Dfs.s3.awsAccessKeyId="${AWS_ACCESS_KEY}" -Dfs.s3n.awsSecretAccessKey="${AWS_SECRET_KEY}" -n 3 -mapperXmx 10g  -output outputDirectory

where AWS_ACCESS_KEY represents your user name and AWS_SECRET_KEY represents your password.

Then import the data with the S3 URL path:

… Using Docker

This walkthrough describes:

Walkthrough

Prerequisites

Note: Older Linux kernel versions are known to cause kernel panics and to break Docker; there are ways around it, but attempt at your own risk. You can check the version of your kernel by running uname -r in your terminal. The following walkthrough has been tested on a Mac OS X 10.10.1.

Step 1 - Install and Launch Docker

Step 2 - Create or Download Dockerfile

Create a folder on the Host OS to host your Dockerfile by running:

mkdir -p /data/h2o-shannon

Then either download or create a Dockerfile. The Dockerfile is essentially a build recipe that will be used to build the container.

Download and use our Dockerfile template by running:

cd /data/h2o-shannon
wget http://h2o.ai/blog/2015_01_h2o-docker/Dockerfile

The Dockerfile will:

Step 3 - Build Docker image from Dockerfile

From the /data/h2o-shannon directory, run:

docker build -t="h2o.ai/shannon" .

This process can take a few minutes as it assembles all the necessary parts to the image.

Step 4 - Run Docker Build

On a Mac, you must use the argument -p 54321:54321 to expressly map the port 54321. This is redundant on Linux.

docker run -it -p 54321:54321 h2o.ai/shannon

Step 5 - Launch H2O

Step into the /opt directory and launch H2O. Change the value of -Xmx to the amount of memory you want to allocate to the H2O instance. By default, H2O launches on port 54321.

cd /opt
java -Xmx1g -jar h2o.jar

Step 6 - Access H2O from the web browser or R

03:58:25.963 main      INFO WATER: Cloud of size 1 formed [/172.17.0.5:54321 (00:00:00.000)]
$ boot2docker ip
192.168.59.103

Once you have the IP address, point your browser to the specified ip address and port. In R, you can access the instance by installing the latest version of the H2O R package and running:

library(h2o)
dockerH2O <- h2o.init(ip = "192.168.59.103", port = 54321)

Flow Web UI …

H2O Flow is an open-source user interface for H2O. It is a web-based interactive environment that allows you to combine code execution, text, mathematics, plots, and rich media in a single document.

With H2O Flow, you can capture, rerun, annotate, present, and share your workflow. H2O Flow allows you to use H2O interactively to import files, build models, and iteratively improve them. Based on your models, you can make predictions and add rich text to create vignettes of your work - all within Flow’s browser-based environment.

Flow’s hybrid user interface seamlessly blends command-line computing with a modern graphical user interface. However, rather than displaying output as plain text, Flow provides a point-and-click user interface for every H2O operation. It allows you to access any H2O object in the form of well-organized tabular data.

H2O Flow sends commands to H2O as a sequence of executable cells. The cells can be modified, rearranged, or saved to a library. Each cell contains an input field that allows you to enter commands, define functions, call other functions, and access other cells or objects on the page. When you execute the cell, the output is a graphical object, which can be inspected to view additional details.

While H2O Flow supports REST API, R scripts, and CoffeeScript, no programming experience is required to run H2O Flow. You can click your way through any H2O operation without ever writing a single line of code. You can even disable the input cells to run H2O Flow using only the GUI. H2O Flow is designed to guide you every step of the way, by providing input prompts, interactive help, and example flows.

Introduction

This guide will walk you through how to use H2O’s web UI, H2O Flow. To view a demo video of H2O Flow, click here.


Getting Help


First, let’s go over the basics. Type h to view a list of helpful shortcuts.

The following help window displays:

help menu

To close this window, click the X in the upper-right corner, or click the Close button in the lower-right corner. You can also click behind the window to close it. You can also access this list of shortcuts by clicking the Help menu and selecting Keyboard Shortcuts.

For additional help, click Help > Assist Me or click the Assist Me! button in the row of buttons below the menus.

Assist Me

You can also type assist in a blank cell and press Ctrl+Enter. A list of common tasks displays to help you find the correct command.

Assist Me links

There are multiple resources to help you get started with Flow in the Help sidebar.

Note: To hide the sidebar, click the >> button above it. Flow - Hide Sidebar

To display the sidebar if it is hidden, click the << button. Flow - Hide Sidebar

To access this documentation, select the Getting Started with H2O Flow link below the Help Topics heading.

You can also explore the pre-configured flows available in H2O Flow for a demonstration of how to create a flow. To view the example flows:

If you have a flow currently open, a confirmation window appears asking if the current notebook should be replaced. To load the example flow, click the Load Notebook button.

To view the REST API documentation, click the Help tab in the sidebar and then select the type of REST API documentation (Routes or Schemas).

REST API documentation

Before getting started with H2O Flow, make sure you understand the different cell modes. Certain actions can only be performed when the cell is in a specific mode.


Understanding Cell Modes

There are two modes for cells: edit and command.

Using Edit Mode

In edit mode, the cell is yellow with a blinking bar to indicate where text can be entered and there is an orange flag to the left of the cell.

Edit Mode

Using Command Mode

In command mode, the flag is yellow. The flag also indicates the cell’s format:

NOTE: If there is an error in the cell, the flag is red.

Cell error

If the cell is executing commands, the flag is teal. The flag returns to yellow when the task is complete.

Cell executing

Changing Cell Formats

To change the cell’s format (for example, from code to Markdown), make sure you are in not in command (not edit) mode and that the cell you want to change is selected. The easiest way to do this is to click on the flag to the left of the cell. Enter the keyboard shortcut for the format you want to use. The flag’s text changes to display the current format.

Cell Mode Keyboard Shortcut
Code y
Markdown m
Raw text r
Heading 1 1
Heading 2 2
Heading 3 3
Heading 4 4
Heading 5 5
Heading 6 6

Running Cells

The series of buttons at the top of the page below the menus run cells in a flow.

Flow - Run Buttons

Running Flows

When you run the flow, a progress bar that indicates the current status of the flow. You can cancel the currently running flow by clicking the Stop button in the progress bar.

Flow Progress Bar

When the flow is complete, a message displays in the upper right.

Note: If there is an error in the flow, H2O Flow stops the flow at the cell that contains the error.

Flow - Completed Successfully Flow - Did Not Complete

Using Keyboard Shortcuts

Here are some important keyboard shortcuts to remember:

The following commands must be entered in command mode.

You can view these shortcuts by clicking Help > Keyboard Shortcuts or by clicking the Help tab in the sidebar.

Using Variables in Cells

Variables can be used to store information such as download locations. To use a variable in Flow:

  1. Define the variable in a code cell (for example, locA = "https://h2o-public-test-data.s3.amazonaws.com/bigdata/laptop/kdd2009/small-churn/kdd_train.csv"). Flow variable definition
  2. Run the cell. H2O validates the variable. Flow variable validation
  3. Use the variable in another code cell (for example, importFiles [locA]). Flow variable example To further simplify your workflow, you can save the cells containing the variables and definitions as clips.

Using Flow Buttons

There are also a series of buttons at the top of the page below the flow name that allow you to save the current flow, add a new cell, move cells up or down, run the current cell, and cut, copy, or paste the current cell. If you hover over the button, a description of the button’s function displays.

Flow buttons

You can also use the menus at the top of the screen to edit the order of the cells, toggle specific format types (such as input or output), create models, or score models. You can also access troubleshooting information or obtain help with Flow.

Flow menus

Note: To disable the code input and use H2O Flow strictly as a GUI, click the Cell menu, then Toggle Cell Input.

Now that you are familiar with the cell modes, let’s import some data.


… Importing Data

If you don’t have any data of your own to work with, you can find some example datasets here:

There are multiple ways to import data in H2O flow:

After selecting the file to import, the file path displays in the “Search Results” section. To import a single file, click the plus sign next to the file. To import all files in the search results, click the Add all link. The files selected for import display in the “Selected Files” section. Import Files

Note: If the file is compressed, it will only be read using a single thread. For best performance, we recommend uncompressing the file before importing, as this will allow use of the faster multithreaded distributed parallel reader during import. Please note that .zip files containing multiple files are not currently supported.

After you click the Import button, the raw code for the current job displays. A summary displays the results of the file import, including the number of imported files and their Network File System (nfs) locations.

Import Files - Results

Uploading Data

To upload a local file, click the Data menu and select Upload File…. Click the Choose File button, select the file, click the Choose button, then click the Upload button.

File Upload Pop-Up

When the file has uploaded successfully, a message displays in the upper right and the Setup Parse cell displays.

File Upload Successful

Ok, now that your data is available in H2O Flow, let’s move on to the next step: parsing. Click the Parse these files button to continue.


Parsing Data

After you have imported your data, parse the data.

Select the parser type (if necessary) from the drop-down Parser list. For most data parsing, H2O automatically recognizes the data type, so the default settings typically do not need to be changed. The following options are available:

If a separator or delimiter is used, select it from the Separator list.

Select a column header option, if applicable:

Select any necessary additional options:

A preview of the data displays in the “Edit Column Names and Types” section. Flow - Parse options

To change or add a column name, edit or enter the text in the column’s entry field. In the screenshot below, the entry field for column 16 is highlighted in red.

Flow - Column Name Entry Field

To change the column type, select the drop-down list to the right of the column name entry field and select the data type. The options are:

You can search for a column by entering it in the Search by column name… entry field above the first column name entry field. As you type, H2O displays the columns that match the specified search terms.

To navigate the data preview, click the <- Previous page or -> Next page buttons.

Flow - Pagination buttons

After making your selections, click the Parse button.

After you click the Parse button, the code for the current job displays.

Flow - Parse code

Since we’ve submitted a couple of jobs (data import & parse) to H2O now, let’s take a moment to learn more about jobs in H2O.


Viewing Jobs

Any command (such as importFiles) you enter in H2O is submitted as a job, which is associated with a key. The key identifies the job within H2O and is used as a reference.

Viewing All Jobs

To view all jobs, click the Admin menu, then click Jobs, or enter getJobs in a cell in CS mode.

View Jobs

The following information displays:

To refresh this information, click the Refresh button. To view the details of the job, click the View button.

Viewing Specific Jobs

To view a specific job, click the link in the “Destination” column.

View Job - Model

The following information displays:

NOTE: For a better understanding of how jobs work, make sure to review the Viewing Frames section as well.

Ok, now that you understand how to find jobs in H2O, let’s submit a new one by building a model.


… Building Models

To build a model:

The Build Model… button can be accessed from any page containing the .hex key for the parsed data (for example, getJobs > getFrame). The following image depicts the K-Means model type. Available options vary depending on model type.

Model Builder

In the Build a Model cell, select an algorithm from the drop-down menu:

The available options vary depending on the selected model. If an option is only available for a specific model type, the model type is listed. If no model type is specified, the option is applicable to all model types.

Advanced Options

Expert Options


Viewing Models

Click the Assist Me! button, then click the getModels link, or enter getModels in the cell in CS mode and press Ctrl+Enter. A list of available models displays.

Flow Models

To view all current models, you can also click the Model menu and click List All Models.

To inspect a model, check its checkbox then click the Inspect button, or click the Inspect button to the right of the model name.

Flow Model

A summary of the model’s parameters displays. To display more details, click the Show All Parameters button.

To delete a model, click the Delete button.

To generate a POJO to be able to use the model outside of H2O, click the Download POJO button.

To learn how to make predictions, continue to the next section.


… Making Predictions

After creating your model, click the key link for the model, then click the Predict button. Select the model to use in the prediction from the drop-down Model: menu and the data frame to use in the prediction from the drop-down Frame: menu, then click the Predict button.

Making Predictions


Viewing Predictions

Click the Assist Me! button, then click the getPredictions link, or enter getPredictions in the cell in CS mode and press Ctrl+Enter. A list of the stored predictions displays. To view a prediction, click the View button to the right of the model name.

Viewing Predictions

You can also view predictions by clicking the drop-down Score menu and selecting List All Predictions.


Viewing Frames

To view a specific frame, click the “Key” link for the specified frame, or enter getFrameSummary "FrameName" in a cell in CS mode (where FrameName is the name of a frame, such as allyears2k.hex).

Viewing specified frame

From the getFrameSummary cell, you can:

When you view a frame, you can “drill-down” to the necessary level of detail (such as a specific column or row) using the Inspect button or by clicking the links. The following screenshot displays the results of clicking the Inspect button for a frame.

Inspecting Frames

This screenshot displays the results of clicking the columns link.

Inspecting Columns

To view all frames, click the Assist Me! button, then click the getFrames link, or enter getFrames in the cell in CS mode and press Ctrl+Enter. You can also view all current frames by clicking the drop-down Data menu and selecting List All Frames.

A list of the current frames in H2O displays that includes the following information for each frame:

For parsed data, the following information displays:

To make a prediction, check the checkboxes for the frames you want to use to make the prediction, then click the Predict on Selected Frames button.


Splitting Frames

Datasets can be split within Flow for use in model training and testing.

splitFrame cell

  1. To split a frame, click the Assist Me button, then click splitFrame.

    Note: You can also click the drop-down Data menu and select Split Frame….

  2. From the drop-down Frame: list, select the frame to split.
  3. In the second Ratio entry field, specify the fractional value to determine the split. The first Ratio field is automatically calculated based on the values entered in the second Ratio field.

    Note: Only fractional values between 0 and 1 are supported (for example, enter .5 to split the frame in half). The total sum of the ratio values must equal one. H2O automatically adjusts the ratio values to equal one; if unsupported values are entered, an error displays.

  4. In the Key entry field, specify a name for the new frame.
  5. (Optional) To add another split, click the Add a split link. To remove a split, click the X to the right of the Key entry field.
  6. Click the Create button.

Creating Frames

To create a frame with a large amount of random data (for example, to use for testing), click the drop-down Admin menu, then select Create Synthetic Frame. Customize the frame as needed, then click the Create button to create the frame.

Flow - Creating Frames


Plotting Frames

To create a plot from a frame, click the Inspect button, then click the Plot button.

Select the type of plot (point, path, or rect) from the drop-down Type menu, then select the x-axis and y-axis from the following options:

Select one of the above options from the drop-down Color menu to display the specified data in color, then click the Plot button to plot the data.

Flow - Plotting Frames

Note: Because H2O stores enums internally as numeric then maps the integers to an array of strings, any min, max, or mean values for categorical columns are not meaningful and should be ignored. Displays for categorical data will be modified in a future version of H2O.


… Using Flows

You can use and modify flows in a variety of ways:


Using Clips

Clips enable you to save cells containing your workflow for later reuse. To save a cell as a clip, click the paperclip icon to the right of the cell (highlighted in the red box in the following screenshot). Paperclip icon

To use a clip in a workflow, click the “Clips” tab in the sidebar on the right.

Clips tab

All saved clips, including the default system clips (such as assist, importFiles, and predict), are listed. Clips you have created are listed under the “My Clips” heading. To select a clip to insert, click the circular button to the left of the clip name. To delete a clip, click the trashcan icon to right of the clip name.

NOTE: The default clips listed under “System” cannot be deleted.

Deleted clips are stored in the trash. To permanently delete all clips in the trash, click the Empty Trash button.

NOTE: Saved data, including flows and clips, are persistent as long as the same IP address is used for the cluster. If a new IP is used, previously saved flows and clips are not available.


Viewing Outlines

The Outline tab in the sidebar displays a brief summary of the cells currently used in your flow; essentially, a command history.


Saving Flows

You can save your flow for later reuse. To save your flow as a notebook, click the “Save” button (the first button in the row of buttons below the flow name), or click the drop-down “Flow” menu and select “Save Flow.” To enter a custom name for the flow, click the default flow name (“Untitled Flow”) and type the desired flow name. A pencil icon indicates where to enter the desired name.

Renaming Flows

To confirm the name, click the checkmark to the right of the name field.

Confirm Name

To reuse a saved flow, click the “Flows” tab in the sidebar, then click the flow name. To delete a saved flow, click the trashcan icon to the right of the flow name.

Flows

Finding Saved Flows on your Disk

By default, flows are saved to the h2oflows directory underneath your home directory. The directory where flows are saved is printed to stdout:

03-20 14:54:20.945 172.16.2.39:54323     95667  main      INFO: Flow dir: '/Users/<UserName>/h2oflows'

To back up saved flows, copy this directory to your preferred backup location.

To specify a different location for saved flows, use the command-line argument -flow_dir when launching H2O:

java -jar h2o.jar -flow_dir /<New>/<Location>/<For>/<Saved>/<Flows>

where /<New>/<Location>/<For>/<Saved>/<Flows> represents the specified location. If the directory does not exist, it will be created the first time you save a flow.

Saving Flows on a Hadoop cluster

If you are running H2O Flow on a Hadoop cluster, H2O will try to find the HDFS home directory to use as the default directory for flows. If the HDFS home directory is not found, flows cannot be saved unless a directory is specified while launching using -flow_dir:

hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g -output hdfsOutputDirName -flow_dir hdfs:///<Saved>/<Flows>/<Location>

The location specified in flow_dir may be either an hdfs or regular filesystem directory. If the directory does not exist, it will be created the first time you save a flow.

Copying Flows

To create a copy of the current flow, select the Flow menu, then click Make a Copy. The name of the current flow changes to Copy of <FlowName> (where <FlowName> is the name of the flow). You can save the duplicated flow using this name by clicking Flow > Save Flow, or rename it before saving.

Downloading Flows

After saving a flow as a notebook, click the Flow menu, then select Download this Flow. A new window opens and the saved flow is downloaded to the default downloads folder on your computer. The file is exported as <filename>.flow, where <filename> is the name specified when the flow was saved.

Caution: You must have an active internet connection to download flows.

Loading Flows

To load a saved flow, click the Flows tab in the sidebar at the right. In the pop-up confirmation window that appears, select Load Notebook, or click Cancel to return to the current flow.

Confirm Replace Flow

After clicking Load Notebook, the saved flow is loaded.

To load an exported flow, click the Flow menu and select Open Flow…. In the pop-up window that appears, click the Choose File button and select the exported flow, then click the Open button.

Open Flow

Notes:


…Troubleshooting Flow

To troubleshoot issues in Flow, use the Admin menu. The Admin menu allows you to check the status of the cluster, view a timeline of events, and view or download logs for issue analysis.

NOTE: To view the current H2O Flow version, click the Help menu, then click About.

Viewing Cluster Status

Click the Admin menu, then select Cluster Status. A summary of the status of the cluster (also known as a cloud) displays, which includes the same information:

The following information displays for each node:

To view more information, click the Show Advanced button.


Viewing CPU Status (Water Meter)

To view the current CPU usage, click the Admin menu, then click Water Meter (CPU Meter). A new window opens, displaying the current CPU use statistics.


Viewing Logs

To view the logs for troubleshooting, click the Admin menu, then click Inspect Log.

Inspect Log

To view the logs for a specific node, select it from the drop-down Select Node menu.


Downloading Logs

To download the logs for further analysis, click the Admin menu, then click Download Log. A new window opens and the logs download to your default download folder. You can close the new window after downloading the logs. Send the logs to support@h2o.ai for issue resolution.


Viewing Stack Trace Information

To view the stack trace information, click the Admin menu, then click Stack Trace.

Stack Trace

To view the stack trace information for a specific node, select it from the drop-down Select Node menu.


Viewing Network Test Results

To view network test results, click the Admin menu, then click Network Test.

Network Test Results


Accessing the Profiler

The Profiler looks across the cluster to see where the same stack trace occurs, and can be helpful for identifying what the currently used CPU is doing. To view the profiler, click the Admin menu, then click Profiler.

Profiler

To view the profiler information for a specific node, select it from the drop-down Select Node menu.


Viewing the Timeline

To view a timeline of events in Flow, click the Admin menu, then click Timeline. The following information displays for each event:

To obtain the most recent information, click the Refresh button.


Reporting Issues

If you experience an error with Flow, you can submit a JIRA ticket to notify our team.

  1. First, click the Admin menu, then click Download Logs. This will download a file contains information that will help our developers identify the cause of the issue.
  2. Click the Help menu, then click Report an issue. This will open our JIRA page where you can file your ticket.
  3. Click the Create button at the top of the JIRA page.
  4. Attach the log file from the first step, write a description of the error you experienced, then click the Create button at the bottom of the page. Our team will work to resolve the issue and you can track the progress of your ticket in JIRA.

Requesting Help

If you have a Google account, you can submit a request for assistance with H2O on our Google Groups page, H2Ostream.

To access H2Ostream from Flow:

  1. Click the Help menu.
  2. Click Forum/Ask a question.
  3. Click the red New topic button.
  4. Enter your question and click the red Post button. If you are requesting assistance for an error you experienced, be sure to include your logs.

You can also email your question to h2ostream@googlegroups.com.


Shutting Down H2O

To shut down H2O, click the Admin menu, then click Shut Down. A Shut down complete message displays in the upper right when the cluster has been shut down.

Data Science Algorithms

This document describes how to define the models and how to interpret the model, as well the algorithm itself, and provides an FAQ.

Commonalities

Missing Value Handling for Training

If missing values are found in the validation frame during model training or during the scoring process for creating predictions, the missing values are automatically imputed.

If the missing values are found during POJO scoring, the answer is converted to NaN.

K-Means

Introduction

K-Means falls in the general category of clustering algorithms.

Defining a K-Means Model

Interpreting a K-Means Model

By default, the following output displays:

K-Means randomly chooses starting points and converges to a local minimum of centroids. The number of clusters is arbitrary, and should be thought of as a tuning parameter. The output is a matrix of the cluster assignments and the coordinates of the cluster centers in terms of the originally chosen attributes. Your cluster centers may differ slightly from run to run as this problem is Non-deterministic Polynomial-time (NP)-hard.

FAQ

K-Means Algorithm

The number of clusters \(K\) is user-defined and is determined a priori.

  1. Choose \(K\) initial cluster centers \(m_{k}\) according to one of the following:

    • Randomization: Choose \(K\) clusters from the set of \(N\) observations at random so that each observation has an equal chance of being chosen.

    • Plus Plus

      a. Choose one center \(m_{1}\) at random.

      1. Calculate the difference between \(m_{1}\) and each of the remaining \(N-1\) observations \(x_{i}\). \(d(x_{i}, m_{1}) = ||(x_{i}-m_{1})||^2\)

      2. Let \(P(i)\) be the probability of choosing \(x_{i}\) as \(m_{2}\). Weight \(P(i)\) by \(d(x_{i}, m_{1})\) so that those \(x_{i}\) furthest from \(m_{2}\) have a higher probability of being selected than those \(x_{i}\) close to \(m_{1}\).

      3. Choose the next center \(m_{2}\) by drawing at random according to the weighted probability distribution.

      4. Repeat until \(K\) centers have been chosen.

    • Furthest

      a. Choose one center \(m_{1}\) at random.

      1. Calculate the difference between \(m_{1}\) and each of the remaining \(N-1\) observations \(x_{i}\). \(d(x_{i}, m_{1}) = ||(x_{i}-m_{1})||^2\)

      2. Choose \(m_{2}\) to be the \(x_{i}\) that maximizes \(d(x_{i}, m_{1})\).

      3. Repeat until \(K\) centers have been chosen.

  2. Once \(K\) initial centers have been chosen calculate the difference between each observation \(x_{i}\) and each of the centers \(m_{1},...,m_{K}\), where difference is the squared Euclidean distance taken over \(p\) parameters.

    \(d(x_{i}, m_{k})=\) \(\sum_{j=1}^{p}(x_{ij}-m_{k})^2=\) \(\lVert(x_{i}-m_{k})\rVert^2\)

  1. Assign \(x_{i}\) to the cluster \(k\) defined by \(m_{k}\) that minimizes \(d(x_{i}, m_{k})\)

  2. When all observations \(x_{i}\) are assigned to a cluster calculate the mean of the points in the cluster.

    \(\bar{x}(k)=\lbrace\bar{x_{i1}},…\bar{x_{ip}}\rbrace\)

  3. Set the \(\bar{x}(k)\) as the new cluster centers \(m_{k}\). Repeat steps 2 through 5 until the specified number of max iterations is reached or cluster assignments of the \(x_{i}\) are stable.

References

Hastie, Trevor, Robert Tibshirani, and J Jerome H Friedman. The Elements of Statistical Learning. Vol.1. N.p., Springer New York, 2001.

Xiong, Hui, Junjie Wu, and Jian Chen. “K-means Clustering Versus Validation Measures: A Data- distribution Perspective.” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 39.2 (2009): 318-331.


GLM

Introduction

Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions. In addition to the Gaussian (i.e. normal) distribution, these include Poisson, binomial, and gamma distributions. Each serves a different purpose, and depending on distribution and link function choice, can be used either for prediction or classification.

The GLM suite includes:

Defining a GLM Model

Interpreting a GLM Model

By default, the following output displays:

FAQ

GLM Algorithm

Following the definitive text by P. McCullagh and J.A. Nelder (1989) on the generalization of linear models to non-linear distributions of the response variable Y, H2O fits GLM models based on the maximum likelihood estimation via iteratively reweighed least squares.

Let \(y_{1},…,y_{n}\) be n observations of the independent, random response variable \(Y_{i}\).

Assume that the observations are distributed according to a function from the exponential family and have a probability density function of the form:

\(f(y_{i})=exp[\frac{y_{i}\theta_{i} - b(\theta_{i})}{a_{i}(\phi)} + c(y_{i}; \phi)]\) where \(\theta\) and \(\phi\) are location and scale parameters, and \(\: a_{i}(\phi), \:b_{i}(\theta_{i}),\: c_{i}(y_{i}; \phi)\) are known functions.

\(a_{i}\) is of the form \(\:a_{i}=\frac{\phi}{p_{i}}; p_{i}\) is a known prior weight.

When \(Y\) has a pdf from the exponential family:

\(E(Y_{i})=\mu_{i}=b^{\prime}\) \(var(Y_{i})=\sigma_{i}^2=b^{\prime\prime}(\theta_{i})a_{i}(\phi)\)

Let \(g(\mu_{i})=\eta_{i}\) be a monotonic, differentiable transformation of the expected value of \(y_{i}\). The function \(\eta_{i}\) is the link function and follows a linear model.

\(g(\mu_{i})=\eta_{i}=\mathbf{x_{i}^{\prime}}\beta\)

When inverted: \(\mu=g^{-1}(\mathbf{x_{i}^{\prime}}\beta)\)

Maximum Likelihood Estimation

For an initial rough estimate of the parameters \(\hat{\beta}\), use the estimate to generate fitted values: \(\mu_{i}=g^{-1}(\hat{\eta_{i}})\)

Let \(z\) be a working dependent variable such that \(z_{i}=\hat{\eta_{i}}+(y_{i}-\hat{\mu_{i}})\frac{d\eta_{i}}{d\mu_{i}}\),

where \(\frac{d\eta_{i}}{d\mu_{i}}\) is the derivative of the link function evaluated at the trial estimate.

Calculate the iterative weights: \(w_{i}=\frac{p_{i}}{[b^{\prime\prime}(\theta_{i})\frac{d\eta_{i}}{d\mu_{i}}^{2}]}\)

Where \(b^{\prime\prime}\) is the second derivative of \(b(\theta_{i})\) evaluated at the trial estimate.

Assume \(a_{i}(\phi)\) is of the form \(\frac{\phi}{p_{i}}\). The weight \(w_{i}\) is inversely proportional to the variance of the working dependent variable \(z_{i}\) for current parameter estimates and proportionality factor \(\phi\).

Regress \(z_{i}\) on the predictors \(x_{i}\) using the weights \(w_{i}\) to obtain new estimates of \(\beta\). \(\hat{\beta}=(\mathbf{X}^{\prime}\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^{\prime}\mathbf{W}\mathbf{z}\)

Where \(\mathbf{X}\) is the model matrix, \(\mathbf{W}\) is a diagonal matrix of \(w_{i}\), and \(\mathbf{z}\) is a vector of the working response variable \(z_{i}\).

This process is repeated until the estimates \(\hat{\beta}\) change by less than the specified amount.

Cost of computation

H2O can process large data sets because it relies on parallel processes. Large data sets are divided into smaller data sets and processed simultaneously and the results are communicated between computers as needed throughout the process.

In GLM, data are split by rows but not by columns, because the predicted Y values depend on information in each of the predictor variable vectors. If O is a complexity function, N is the number of observations (or rows), and P is the number of predictors (or columns) then

    \(Runtime\propto p^3+\frac{(N*p^2)}{CPUs}\)

Distribution reduces the time it takes an algorithm to process because it decreases N.

Relative to P, the larger that (N/CPUs) becomes, the more trivial p becomes to the overall computational cost. However, when p is greater than (N/CPUs), O is dominated by p.

    \(Complexity = O(p^3 + N*p^2)\)

References

Breslow, N E. “Generalized Linear Models: Checking Assumptions and Strengthening Conclusions.” Statistica Applicata 8 (1996): 23-41.

Frome, E L. “The Analysis of Rates Using Poisson Regression Models.” Biometrics (1983): 665-674.

Goldberger, Arthur S. “Best Linear Unbiased Prediction in the Generalized Linear Regression Model.” Journal of the American Statistical Association 57.298 (1962): 369-375.

Guisan, Antoine, Thomas C Edwards Jr, and Trevor Hastie. “Generalized Linear and Generalized Additive Models in Studies of Species Distributions: Setting the Scene.” Ecological modeling 157.2 (2002): 89-100.

Nelder, John A, and Robert WM Wedderburn. “Generalized Linear Models.” Journal of the Royal Statistical Society. Series A (General) (1972): 370-384.

Niu, Feng, et al. “Hogwild!: A lock-free approach to parallelizing stochastic gradient descent.” Advances in Neural Information Processing Systems 24 (2011): 693-701.*implemented algorithm on p.5

Pearce, Jennie, and Simon Ferrier. “Evaluating the Predictive Performance of Habitat Models Developed Using Logistic Regression.” Ecological modeling 133.3 (2000): 225-245.

Press, S James, and Sandra Wilson. “Choosing Between Logistic Regression and Discriminant Analysis.” Journal of the American Statistical Association 73.364 (April, 2012): 699–705.

Snee, Ronald D. “Validation of Regression Models: Methods and Examples.” Technometrics 19.4 (1977): 415-428.


DRF

Introduction

Distributed Random Forest (DRF) is a powerful classification tool. When given a set of data, DRF generates a forest of classification trees, rather than a single classification tree. Each of these trees is a weak learner built on a subset of rows and columns. More trees will reduce the variance. The classification from each H2O tree can be thought of as a vote; the most votes determines the classification.

Defining a DRF Model

Interpreting a DRF Model

By default, the following output displays:

FAQ

DRF Algorithm

Jan vitek distributedrandomforest_5-2-2013 from 0xdata

References


Naïve Bayes

Introduction

Naïve Bayes (NB) is a classification algorithm that relies on strong assumptions of the independence of covariates in applying Bayes Theorem. NB models are commonly used as an alternative to decision trees for classification problems.

Defining a Naïve Bayes Model

Interpreting a Naïve Bayes Model

The output from Naïve Bayes is a list of tables containing the a-priori and conditional probabilities of each class of the response. The a-priori probability is the estimated probability of a particular class before observing any of the predictors. Each conditional probability table corresponds to a predictor column. The row headers are the classes of the response and the column headers are the classes of the predictor. Thus, in the table below, the probability of survival (y) given a person is male (x) is 0.91543624.

                Sex
Survived       Male     Female
     No  0.91543624 0.08456376
     Yes 0.51617440 0.48382560

When the predictor is numeric, Naïve Bayes assumes it is sampled from a Gaussian distribution given the class of the response. The first column contains the mean and the second column contains the standard deviation of the distribution.

By default, the following output displays:

FAQ

Naïve Bayes Algorithm

The algorithm is presented for the simplified binomial case without loss of generality.

Under the Naive Bayes assumption of independence, given a training set for a set of discrete valued features X \({(X^{(i)},\ y^{(i)};\ i=1,...m)}\)

The joint likelihood of the data can be expressed as:

\(\mathcal{L} \: (\phi(y),\: \phi_{i|y=1},\:\phi_{i|y=0})=\Pi_{i=1}^{m} p(X^{(i)},\: y^{(i)})\)

The model can be parameterized by:

\(\phi_{i|y=0}=\ p(x_{i}=1|\ y=0);\: \phi_{i|y=1}=\ p(x_{i}=1|y=1);\: \phi(y)\)

Where \(\phi_{i|y=0}=\ p(x_{i}=1|\ y=0)\) can be thought of as the fraction of the observed instances where feature \(x_{i}\) is observed, and the outcome is \(y=0, \phi_{i|y=1}=p(x_{i}=1|\ y=1)\) is the fraction of the observed instances where feature \(x_{i}\) is observed, and the outcome is \(y=1\), and so on.

The objective of the algorithm is to maximize with respect to \(\phi_{i|y=0}, \ \phi_{i|y=1},\ and \ \phi(y)\)

Where the maximum likelihood estimates are:

\(\phi_{j|y=1}= \frac{\Sigma_{i}^m 1(x_{j}^{(i)}=1 \ \bigcap y^{i} = 1)}{\Sigma_{i=1}^{m}(y^{(i)}=1}\)

\(\phi_{j|y=0}= \frac{\Sigma_{i}^m 1(x_{j}^{(i)}=1 \ \bigcap y^{i} = 0)}{\Sigma_{i=1}^{m}(y^{(i)}=0}\)

\(\phi(y)= \frac{(y^{i} = 1)}{m}\)

Once all parameters \(\phi_{j|y}\) are fitted, the model can be used to predict new examples with features \(X_{(i^*)}\).

This is carried out by calculating:

\(p(y=1|x)=\frac{\Pi p(x_i|y=1) p(y=1)}{\Pi p(x_i|y=1)p(y=1) \: +\: \Pi p(x_i|y=0)p(y=0)}\)

\(p(y=0|x)=\frac{\Pi p(x_i|y=0) p(y=0)}{\Pi p(x_i|y=1)p(y=1) \: +\: \Pi p(x_i|y=0)p(y=0)}\)

and predicting the class with the highest probability.

It is possible that prediction sets contain features not originally seen in the training set. If this occurs, the maximum likelihood estimates for these features predict a probability of 0 for all cases of y.

Laplace smoothing allows a model to predict on out of training data features by adjusting the maximum likelihood estimates to be:

\(\phi_{j|y=1}= \frac{\Sigma_{i}^m 1(x_{j}^{(i)}=1 \ \bigcap y^{i} = 1) \: + \: 1}{\Sigma_{i=1}^{m}(y^{(i)}=1 \: + \: 2}\)

\(\phi_{j|y=0}= \frac{\Sigma_{i}^m 1(x_{j}^{(i)}=1 \ \bigcap y^{i} = 0) \: + \: 1}{\Sigma_{i=1}^{m}(y^{(i)}=0 \: + \: 2}\)

Note that in the general case where y takes on k values, there are k+1 modified parameter estimates, and they are added in when the denominator is k (rather than two, as shown in the two-level classifier shown here.)

Laplace smoothing should be used with care; it is generally intended to allow for predictions in rare events. As prediction data becomes increasingly distinct from training data, train new models when possible to account for a broader set of possible X values.

References

Hastie, Trevor, Robert Tibshirani, and J Jerome H Friedman. The Elements of Statistical Learning. Vol.1. N.p., Springer New York, 2001.

Ng, Andrew. “Generative Learning algorithms.” (2008).


PCA

Introduction

Principal Components Analysis (PCA) is closely related to Principal Components Regression. The algorithm is carried out on a set of possibly collinear features and performs a transformation to produce a new set of uncorrelated features.

PCA is commonly used to model without regularization or perform dimensionality reduction. It can also be useful to carry out as a preprocessing step before distance-based algorithms such as K-Means since PCA guarantees that all dimensions of a manifold are orthogonal.

Defining a PCA Model

Interpreting a PCA Model

PCA output returns a table displaying the number of components specified by the value for k.

Scree and cumulative variance plots for the components are returned as well. Users can access them by clicking on the black button labeled “Scree and Variance Plots” at the top left of the results page. A scree plot shows the variance of each component, while the cumulative variance plot shows the total variance accounted for by the set of components.

The output for PCA includes the following:

FAQ

For the GramSVD and Power methods, all rows containing missing values are ignored during training. For the GLRM method, missing values are excluded from the sum over the loss function in the objective. For more information, refer to section 4 Generalized Loss Functions, equation (13), in “Generalized Low Rank Models” by Boyd et al.

PCA Algorithm

Let \(X\) be an \(M\times N\) matrix where

The covariance matrix \(C_{x}\) is

\(C_{x}=\frac{1}{n}XX^{T}\)

where \(n\) is the number of observations.

\(C_{x}\) is a square, symmetric \(m\times m\) matrix, the diagonal entries of which are the variances of attributes, and the off-diagonal entries are covariances between attributes.

PCA convergence is based on the method described by Gockenbach: “The rate of convergence of the power method depends on the ratio \(lambda_2|/|\lambda_1\). If this is small…then the power method converges rapidly. If the ratio is close to 1, then convergence is quite slow. The power method will fail if \(lambda_2| = |\lambda_1\).” (567).

The objective of PCA is to maximize variance while minimizing covariance.

To accomplish this, for a new matrix \(C_{y}\) with off diagonal entries of 0, and each successive dimension of Y ranked according to variance, PCA finds an orthonormal matrix \(P\) such that \(Y=PX\) constrained by the requirement that \(C_{y}=\frac{1}{n}YY^{T}\) be a diagonal matrix.

The rows of \(P\) are the principal components of X.

\(C_{y}=\frac{1}{n}YY^{T}\) \(=\frac{1}{n}(PX)(PX)^{T}\) \(C_{y}=PC_{x}P^{T}.\)

Because any symmetric matrix is diagonalized by an orthogonal matrix of its eigenvectors, solve matrix \(P\) to be a matrix where each row is an eigenvector of \(\frac{1}{n}XX^{T}=C_{x}\)

Then the principal components of \(X\) are the eigenvectors of \(C_{x}\), and the \(i^{th}\) diagonal value of \(C_{y}\) is the variance of \(X\) along \(p_{i}\).

Eigenvectors of \(C_{x}\) are found by first finding the eigenvalues \(\lambda\) of \(C_{x}\).

For each eigenvalue \(\lambda\) \((C-{x}-\lambda I)x =0\) where \(x\) is the eigenvector associated with \(\lambda\).

Solve for \(x\) by Gaussian elimination.

Recovering SVD from GLRM

GLRM gives \(x\) and \(y\), where \(x \in \rm \Bbb I \!\Bbb R ^{n * k}\) and \( y \in \rm \Bbb I \!\Bbb R ^{k*m} \)

   - \(n\)= number of rows (A)

   - \(m\)= number of columns (A)

   - \(k\)= user-specified rank    - \(A\)= training matrix

It is assumed that the \(x\) and \(y\) columns are independent.

First, perform QR decomposition of \(x\) and \(y^T\):

   \(x = QR\)

    \(y^T = ZS\), where \(Q^TQ = I = Z^TZ\)

      Call JAMA QR Decomposition directly on \(y^T\) to get \( Z \in \rm \Bbb I \! \Bbb R\), \( S \in \Bbb I \! \Bbb R \)

      \( R \) from QR decomposition of \( x \) is the upper triangular factor of Cholesky of \(X^TX\) Gram

      \( X^TX = LL^T, X = QR \)

      \( X^TX= (R^TQ^T) QR = R^TR \), since \(Q^TQ=I \) => \(R=L^T\) (transpose lower triangular)

Note: In code, \(X^TX \over n\) = \( LL^T \)

   \( X^TX = (L \sqrt{n})(L \sqrt{n})^T =R^TR \)

   \( R = L^T \sqrt{n} \in \rm \Bbb I \! \Bbb R^{k * k} \) reduced QR decomposition.

For more information, refer to the Rectangular matrix section of “QR Decomposition” on Wikipedia.

\( XY = QR(ZS)^T = Q(RS^T)Z^T \)

Note: \( (RS^T) \in \rm \Bbb I \!\Bbb R \)

Find SVD (locally) of \( RS^T \)

\( RS^T = U \sum V^T, U^TU = I = V^TV \) orthogonal

\( XY = Q(RS^T)Z^T = (QU \sum (V^T Z^T) SVD \)

   \( (QU)^T(QU) = U^T Q^TQU U^TU = I\)

   \( (ZV)^T(ZV) = V^TZ^TZV = V^TV =I \)

Right singular vectors: \( ZV \in \rm \Bbb I \!\Bbb R^{m * k} \)

Singular values: \( \sum \in \rm \Bbb I \!\Bbb R^{k * k} \) diagonal

Left singular vectors: \( (QU) \in \rm \Bbb I \!\Bbb R^{n * k}\)

References

Gockenbach, Mark S. “Finite-Dimensional Linear Algebra (Discrete Mathematics and Its Applications).” (2010): 566-567.


GBM

Introduction

Gradient Boosted Regression and Gradient Boosted Classification are forward learning ensemble methods. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations. H2O’s GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.

Defining a GBM Model

Interpreting a GBM Model

The output for GBM includes the following:

FAQ

GBM Algorithm

H2O’s Gradient Boosting Algorithms follow the algorithm specified by Hastie et al (2001):

Initialize \(f_{k0} = 0,\: k=1,2,…,K\)

For \(m=1\) to \(M:\)

  (a) Set \(p_{k}(x)=\frac{e^{f_{k}(x)}}{\sum_{l=1}^{K}e^{f_{l}(x)}},\:k=1,2,…,K\)

  (b) For \(k=1\) to \(K\):

    i. Compute \(r_{ikm}=y_{ik}-p_{k}(x_{i}),\:i=1,2,…,N.\)     ii. Fit a regression tree to the targets \(r_{ikm},\:i=1,2,…,N\), giving terminal regions \(R_{jim},\:j=1,2,…,J_{m}.\) \(iii. Compute\) \(\gamma_{jkm}=\frac{K-1}{K}\:\frac{\sum_{x_{i}\in R_{jkm}}(r_{ikm})}{\sum_{x_{i}\in R_{jkm}}|r_{ikm}|(1-|r_{ikm})},\:j=1,2,…,J_{m}.\) \(\:iv.\:Update\:f_{km}(x)=f_{k,m-1}(x)+\sum_{j=1}^{J_{m}}\gamma_{jkm}I(x\in\:R_{jkm}).\)

Output \(\:\hat{f_{k}}(x)=f_{kM}(x),\:k=1,2,…,K.\)

Be aware that the column type affects how the histogram is created and the column type depends on whether rows are excluded or assigned a weight of 0. For example:

val weight 1 1 0.5 0 5 1 3.5 0

The above vec has a real-valued type if passed as a whole, but if the zero-weighted rows are sliced away first, the integer weight is used. The resulting histogram is either kept at full nbins resolution or potentially shrunk to the discrete integer range, which affects the split points.

References

Dietterich, Thomas G, and Eun Bae Kong. “Machine Learning Bias, Statistical Bias, and Statistical Variance of Decision Tree Algorithms.” ML-95 255 (1995).

Elith, Jane, John R Leathwick, and Trevor Hastie. “A Working Guide to Boosted Regression Trees.” Journal of Animal Ecology 77.4 (2008): 802-813

Friedman, Jerome H. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics (2001): 1189-1232.

Friedman, Jerome, Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. “Discussion of Boosting Papers.” Ann. Statist 32 (2004): 102-107

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. “Additive Logistic Regression: A Statistical View of Boosting (With Discussion and a Rejoinder by the Authors).” The Annals of Statistics 28.2 (2000): 337-407

Hastie, Trevor, Robert Tibshirani, and J Jerome H Friedman. The Elements of Statistical Learning. Vol.1. N.p., page 339: Springer New York, 2001.


Deep Learning

Introduction

H2O’s Deep Learning is based on a multi-layer feed-forward artificial neural network that is trained with stochastic gradient descent using back-propagation. The network can contain a large number of hidden layers consisting of neurons with tanh, rectifier and maxout activation functions. Advanced features such as adaptive learning rate, rate annealing, momentum training, dropout, L1 or L2 regularization, checkpointing and grid search enable high predictive accuracy. Each compute node trains a copy of the global model parameters on its local data with multi-threading (asynchronously), and contributes periodically to the global model via model averaging across the network.

Defining a Deep Learning Model

H2O Deep Learning models have many input parameters, many of which are only accessible via the expert mode. For most cases, use the default values. Please read the following instructions before building extensive Deep Learning models. The application of grid search and successive continuation of winning models via checkpoint restart is highly recommended, as model performance can vary greatly.

Interpreting a Deep Learning Model

To view the results, click the View button. The output for the Deep Learning model includes the following information for both the training and testing sets:

FAQ

This is something to look out for. Say you have three columns: zip code (70k levels), height, and income. The resulting number of internally one-hot encoded features will be 70,002 and only 3 of them will be activated (non-zero). If the first hidden layer has 200 neurons, then the resulting weight matrix will be of size 70,002 x 200, which can take a long time to train and converge. In this case, we recommend either reducing the number of categorical factor levels upfront (e.g., using h2o.interaction() from R), or specifying max_categorical_features to use feature hashing to reduce the dimensionality.

Deep Learning Algorithm

For more information about how the Deep Learning algorithm works, refer to the Deep Learning booklet.

References

“Deep Learning.” Wikipedia: The free encyclopedia. Wikimedia Foundation, Inc. 1 May 2015. Web. 4 May 2015.

“Artificial Neural Network.” Wikipedia: The free encyclopedia. Wikimedia Foundation, Inc. 22 April 2015. Web. 4 May 2015.

Zeiler, Matthew D. ‘ADADELTA: An Adaptive Learning Rate Method’. Arxiv.org. N.p., 2012. Web. 4 May 2015.

Sutskever, Ilya et al. “On the importance of initialization and momementum in deep learning.” JMLR:W&CP vol. 28. (2013).

Hinton, G.E. et. al. “Improving neural networks by preventing co-adaptation of feature detectors.” University of Toronto. (2012).

Wager, Stefan et. al. “Dropout Training as Adaptive Regularization.” Advances in Neural Information Processing Systems. (2013).

Gedeon, TD. “Data mining of inputs: analysing magnitude and functional measures.” University of New South Wales. (1997).

Candel, Arno and Parmar, Viraj. “Deep Learning with H2O.” H2O.ai, Inc. (2015).

Deep Learning Training

Slideshare slide decks

Youtube channel

Candel, Arno. “The Definitive Performance Tuning Guide for H2O Deep Learning.” H2O.ai, Inc. (2015).

Niu, Feng, et al. “Hogwild!: A lock-free approach to parallelizing stochastic gradient descent.” Advances in Neural Information Processing Systems 24 (2011): 693-701. (algorithm implemented is on p.5)

Hawkins, Simon et al. “Outlier Detection Using Replicator Neural Networks.” CSIRO Mathematical and Information Sciences

YARN Best Practices

YARN (Yet Another Resource Manager) is a resource management framework. H2O can be launched as an application on YARN. If you want to run H2O on Hadoop, essentially, you are running H2O on YARN. If you are not currently using YARN to manage your cluster resources, we strongly recommend it.

Using H2O with YARN

When you launch H2O on Hadoop using the hadoop jar command, YARN allocates the necessary resources to launch the requested number of nodes. H2O launches as a MapReduce (V2) task, where each mapper is an H2O node of the specified size.

hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g -output hdfsOutputDirName

Occasionally, YARN may reject a job request. This usually occurs because either there is not enough memory to launch the job or because of an incorrect configuration.

If YARN rejects the job request, try launching the job with less memory to see if that is the cause of the failure. Specify smaller values for -mapperXmx (we recommend a minimum of 2g) and -nodes (start with 1) to confirm that H2O can launch successfully.

To resolve configuration issues, adjust the maximum memory that YARN will allow when launching each mapper. If the cluster manager settings are configured for the default maximum memory size but the memory required for the request exceeds that amount, YARN will not launch and H2O will time out. If you are using the default configuration, change the configuration settings in your cluster manager to specify memory allocation when launching mapper tasks. To calculate the amount of memory required for a successful launch, use the following formula:

YARN container size (mapreduce.map.memory.mb) = -mapperXmx value + (-mapperXmx * -extramempercent [default is 10%])

The mapreduce.map.memory.mb value must be less than the YARN memory configuration values for the launch to succeed.

Configuring YARN

For Cloudera, configure the settings in Cloudera Manager. Depending on how the cluster is configured, you may need to change the settings for more than one role group.

  1. Click Configuration and enter the following search term in quotes: yarn.nodemanager.resource.memory-mb.

  2. Enter the amount of memory (in GB) to allocate in the Value field. If more than one group is listed, change the values for all listed groups.

    Cloudera Configuration

  3. Click the Save Changes button in the upper-right corner.

  4. Enter the following search term in quotes: yarn.scheduler.maximum-allocation-mb
  5. Change the value, click the Save Changes button in the upper-right corner, and redeploy.

    Cloudera Configuration

For Hortonworks, configure the settings in Ambari.

  1. Select YARN, then click the Configs tab.
  2. Select the group.
  3. In the Node Manager section, enter the amount of memory (in MB) to allocate in the yarn.nodemanager.resource.memory-mb entry field.

    Ambari Configuration

  4. In the Scheduler section, enter the amount of memory (in MB)to allocate in the yarn.scheduler.maximum-allocation-mb entry field.

    Ambari Configuration

  5. Click the Save button at the bottom of the page and redeploy the cluster.

For MapR:

  1. Edit the yarn-site.xml file for the node running the ResourceManager.
  2. Change the values for the yarn.nodemanager.resource.memory-mb and yarn.scheduler.maximum-allocation-mb properties.
  3. Restart the ResourceManager and redeploy the cluster.

To verify the values were changed, check the values for the following properties:

 - <name>yarn.nodemanager.resource.memory-mb</name>
 - <name>yarn.scheduler.maximum-allocation-mb</name>

Limiting CPU Usage

To limit the number of CPUs used by H2O, use the -nthreads option and specify the maximum number of CPUs for a single container to use. The following example limits the number of CPUs to four:

hadoop jar h2odriver.jar -nthreads 4 -nodes 1 -mapperXmx 6g -output hdfsOutputDirName

Note: The default is 4*the number of CPUs. You must specify at least four CPUs; otherwise, the following error message displays: ERROR: nthreads invalid (must be >= 4)

Specifying Queues

If you do not specify a queue when launching H2O, H2O jobs are submitted to the default queue. Jobs submitted to the default queue have a lower priority than jobs submitted to a specific queue.

To specify a queue with Hadoop, enter -Dmapreduce.job.queuename=<queue name>

(where <queue name> is the name of the queue) when launching Hadoop.

For example,

hadoop jar h2odriver.jar -Dmapreduce.job.queuename=default -nodes 1 -mapperXmx 6g -output hdfsOutputDirName

Specifying Output Directories

To prevent overwriting multiple users’ files, each job must have a unique output directory name. Change the -output hdfsOutputDir argument (where hdfsOutputDir is the name of the directory.

Alternatively, you can delete the directory (manually or by using a script) instead of creating a unique directory each time you launch H2O.

Customizing YARN

Most of the configurable YARN variables are stored in yarn-site.xml. To prevent settings from being overridden, you can mark a config as “final.” If you change any values in yarn-site.xml, you must restart YARN to confirm the changes.

Accessing Logs

To learn how to access logs in YARN, refer to Downloading Logs.

Downloading Logs

Accessing Logs

Depending on whether you are using Hadoop with H2O and whether the job is currently running, there are different ways of obtaining the logs for H2O.

Copy and email the logs to support@h2o.ai or submit them to h2ostream@googlegroups.com with a brief description of your Hadoop environment, including the Hadoop distribution and version.

Without Running Jobs

    jessica@mr-0x8:~/h2o-3.1.0.3008-cdh5.2$ hadoop jar h2odriver.jar -nodes 1 -mapperXmx 6g -output hdfsOutputDirName
Determining driver host interface for mapper->driver callback...
    [Possible callback IP address: 172.16.2.178]
    [Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 172.16.2.178:52030
(You can override these with -driverif and -driverport.)
Memory Settings:
    mapreduce.map.java.opts:     -Xms1g -Xmx1g -Dlog4j.defaultInitOverride=true
    Extra memory percent:        10
    mapreduce.map.memory.mb:     1126
15/05/06 17:11:50 INFO client.RMProxy: Connecting to ResourceManager at mr-0x10.0xdata.loc/172.16.2.180:8032
15/05/06 17:11:52 INFO mapreduce.JobSubmitter: number of splits:1
15/05/06 17:11:52 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1430127035640_0075
15/05/06 17:11:52 INFO impl.YarnClientImpl: Submitted application application_1430127035640_0075
15/05/06 17:11:52 INFO mapreduce.Job: The url to track the job: http://mr-0x10.0xdata.loc:8088/proxy/application_1430127035640_0075/
Job name 'H2O_29570' submitted
JobTracker job ID is 'job_1430127035640_0075'
For YARN users, logs command is 'yarn logs -applicationId application_1430127035640_0075'
Waiting for H2O cluster to come up...

In the above example, the command is specified in the next to last line (For YARN users, logs command is...). The command is unique for each instance. In Terminal, enter yarn logs -applicationId application_<UniqueID> to view the logs (where <UniqueID> is the number specified in the next to last line of the output that displayed when you created the cluster).


Use YARN to obtain the stdout and stderr logs that are used for troubleshooting. To learn how to access YARN based on management software, version, and job status, see Accessing YARN.

  1. Click the Applications link to view all jobs, then click the History link for the job.

    YARN - History

  2. Click the logs link.

    YARN - History

  3. Copy the information that displays and send it in an email to support@h2o.ai.

    YARN - History


With Running Jobs

If you are using Hadoop and the job is still running:


05-06 17:12:15.610 172.16.2.179:54321    26336  main      INFO: ----- H2O started  -----
05-06 17:12:15.731 172.16.2.179:54321    26336  main      INFO: Build git branch: master
05-06 17:12:15.731 172.16.2.179:54321    26336  main      INFO: Build git hash: 41d039196088df081ad77610d3e2d6550868f11b
05-06 17:12:15.731 172.16.2.179:54321    26336  main      INFO: Build git describe: jenkins-master-1187
05-06 17:12:15.732 172.16.2.179:54321    26336  main      INFO: Build project version: 0.3.0.1187
05-06 17:12:15.732 172.16.2.179:54321    26336  main      INFO: Built by: 'jenkins'
05-06 17:12:15.732 172.16.2.179:54321    26336  main      INFO: Built on: '2015-05-05 23:31:12'
05-06 17:12:15.732 172.16.2.179:54321    26336  main      INFO: Java availableProcessors: 8
05-06 17:12:15.732 172.16.2.179:54321    26336  main      INFO: Java heap totalMemory: 982.0 MB
05-06 17:12:15.732 172.16.2.179:54321    26336  main      INFO: Java heap maxMemory: 982.0 MB
05-06 17:12:15.732 172.16.2.179:54321    26336  main      INFO: Java version: Java 1.7.0_80 (from Oracle Corporation)
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: OS   version: Linux 3.13.0-51-generic (amd64)
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: Machine physical memory: 31.30 GB
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: X-h2o-cluster-id: 1430957535344
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: Possible IP Address: virbr0 (virbr0), 192.168.122.1
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: Possible IP Address: br0 (br0), 172.16.2.179
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: Possible IP Address: lo (lo), 127.0.0.1
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: Multiple local IPs detected:
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO:   /192.168.122.1  /172.16.2.179
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: Attempting to determine correct address...
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: Using /172.16.2.179
05-06 17:12:15.734 172.16.2.179:54321    26336  main      INFO: Internal communication uses port: 54322
05-06 17:12:15.734 172.16.2.179:54321    26336  main      INFO: Listening for HTTP and REST traffic on  http://172.16.2.179:54321/
05-06 17:12:15.744 172.16.2.179:54321    26336  main      INFO: H2O cloud name: 'H2O_29570' on /172.16.2.179:54321, discovery address /237.61.246.13:60733
05-06 17:12:15.744 172.16.2.179:54321    26336  main      INFO: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
05-06 17:12:15.744 172.16.2.179:54321    26336  main      INFO:   1. Open a terminal and run 'ssh -L 55555:localhost:54321 yarn@172.16.2.179'
05-06 17:12:15.744 172.16.2.179:54321    26336  main      INFO:   2. Point your browser to http://localhost:55555
05-06 17:12:15.979 172.16.2.179:54321    26336  main      INFO: Log dir: '/home2/yarn/nm/usercache/jessica/appcache/application_1430127035640_0075/h2ologs'



Accessing YARN

Methods for accessing YARN vary depending on the default management software and version, as well as job status.


Cloudera 5 & 5.2

  1. In Cloudera Manager, click the YARN link in the cluster section.

    Cloudera Manager

  2. In the Quick Links section, select ResourceManager Web UI if the job is running or select HistoryServer Web UI if the job is not running.

    Cloudera Manager


Ambari

  1. From the Ambari Dashboard, select YARN.

    Ambari

  2. From the Quick Links drop-down menu, select ResourceManager UI.

    Ambari


For Non-Hadoop Users

Without Current Jobs

If you are not using Hadoop and the job is not running:


With Current Jobs

If you are not using Hadoop and the job is still running:

05-06 17:12:15.610 172.16.2.179:54321    26336  main      INFO: ----- H2O started  -----
05-06 17:12:15.731 172.16.2.179:54321    26336  main      INFO: Build git branch: master
05-06 17:12:15.731 172.16.2.179:54321    26336  main      INFO: Build git hash: 41d039196088df081ad77610d3e2d6550868f11b
05-06 17:12:15.731 172.16.2.179:54321    26336  main      INFO: Build git describe: jenkins-master-1187
05-06 17:12:15.732 172.16.2.179:54321    26336  main      INFO: Build project version: 0.3.0.1187
05-06 17:12:15.732 172.16.2.179:54321    26336  main      INFO: Built by: 'jenkins'
05-06 17:12:15.732 172.16.2.179:54321    26336  main      INFO: Built on: '2015-05-05 23:31:12'
05-06 17:12:15.732 172.16.2.179:54321    26336  main      INFO: Java availableProcessors: 8
05-06 17:12:15.732 172.16.2.179:54321    26336  main      INFO: Java heap totalMemory: 982.0 MB
05-06 17:12:15.732 172.16.2.179:54321    26336  main      INFO: Java heap maxMemory: 982.0 MB
05-06 17:12:15.732 172.16.2.179:54321    26336  main      INFO: Java version: Java 1.7.0_80 (from Oracle Corporation)
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: OS   version: Linux 3.13.0-51-generic (amd64)
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: Machine physical memory: 31.30 GB
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: X-h2o-cluster-id: 1430957535344
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: Possible IP Address: virbr0 (virbr0), 192.168.122.1
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: Possible IP Address: br0 (br0), 172.16.2.179
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: Possible IP Address: lo (lo), 127.0.0.1
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: Multiple local IPs detected:
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO:   /192.168.122.1  /172.16.2.179
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: Attempting to determine correct address...
05-06 17:12:15.733 172.16.2.179:54321    26336  main      INFO: Using /172.16.2.179
05-06 17:12:15.734 172.16.2.179:54321    26336  main      INFO: Internal communication uses port: 54322
05-06 17:12:15.734 172.16.2.179:54321    26336  main      INFO: Listening for HTTP and REST traffic on  http://172.16.2.179:54321/
05-06 17:12:15.744 172.16.2.179:54321    26336  main      INFO: H2O cloud name: 'H2O_29570' on /172.16.2.179:54321, discovery address /237.61.246.13:60733
05-06 17:12:15.744 172.16.2.179:54321    26336  main      INFO: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
05-06 17:12:15.744 172.16.2.179:54321    26336  main      INFO:   1. Open a terminal and run 'ssh -L 55555:localhost:54321 yarn@172.16.2.179'
05-06 17:12:15.744 172.16.2.179:54321    26336  main      INFO:   2. Point your browser to http://localhost:55555
05-06 17:12:15.979 172.16.2.179:54321    26336  main      INFO: Log dir: '/home2/yarn/nm/usercache/jessica/appcache/application_1430127035640_0075/h2ologs'


        ------------------------------------------------------------

        Time:     2015-01-06 15:46:11.083

        GET       http://172.16.2.20:54321/3/Cloud.json
        postBody: 

        curlError:         FALSE
        curlErrorMessage:  
        httpStatusCode:    200
        httpStatusMessage: OK
        millis:            3

        {"__meta":{"schema_version":    1,"schema_name":"CloudV1","schema_type":"Iced"},"version":"0.1.17.1009","cloud_name":...[truncated]}
        -------------------------------------------------------------


Migrating to H2O 3.0

We’re excited about the upcoming release of the latest and greatest version of H2O, and we hope you are too! H2O 3.0 has lots of improvements, including:

and much more! Overall, H2O has been retooled for better accuracy and performance and to provide additional functionality. If you’re a current user of H2O, we strongly encourage you to upgrade to the latest version to take advantage of the latest features and capabilities.

Please be aware that H2O 3.0 will supersede all previous versions of H2O as the primary version as of May 15th, 2015. Support for previous versions will be offered for a limited time, but there will no longer be any significant updates to the previous version of H2O.

The following information and links will inform you about what’s new and different and help you prepare to upgrade to H2O 3.0.

Overall, H2O 3.0 is more stable, elegant, and simplified, with additional capabilities not available in previous versions of H2O.


Algorithm Changes

Most of the algorithms available in previous versions of H2O have been improved in terms of speed and accuracy. Currently available model types include Gradient Boosting Machine, Deep Learning, Generalized Linear Model, K-means, Distributed Random Forest, and Naïve Bayes.

There are a few algorithms that are still being refined to provide these same benefits and will be available in a future version of H2O.

Currently, the following algorithms and associated capabilities are still in development:

Check back for updates, as these algorithms will be re-introduced in an improved form in a future version of H2O.

Note: The SpeeDRF model has been removed, as it was originally intended as an optimization for small data only. This optimization will be added to the Distributed Random Forest model automatically for small data in a future version of H2O.


Parsing Changes

In H2O Classic, the parser reads all the data and tries to guess the column type. In H2O 3.0, the parser reads a subset and makes a type guess for each column. In Flow, you can view the preliminary parse results in the Data Preview area. To change the column type, select an option from the drop-down menu at the top of the column. H2O 3.0 can also automatically identify mixed-type columns; in H2O Classic, if one column is mixed integers or real numbers using a string, the output is blank.


Web UI Changes

Our web UI has been completely overhauled with a much more intuitive interface that is similar to IPython Notebook. Each point-and-click action is translated immediately into an individual workflow script that can be saved for later interactive and offline use. As a result, you can now revise and rerun your workflows easily, and can even add comments and rich media.

For more information, refer to our Getting Started with Flow guide, which comprehensively documents how to use Flow. You can also view this brief video, which provides an overview of Flow in action.


API Users

H2O’s new Python API allows Pythonistas to use H2O in their favorite environment. Using the Python command line or an integrated development environment like IPython Notebook H2O users can control clusters and manage massive datasets quickly.

H2O’s REST API is the basis for the web UI (Flow), as well as the R and Python APIs, and is versioned for stability. It is also easier to understand and use, with full metadata available dynamically from the server, allowing for easier integration by developers.


Java Users

Generated Java REST classes ease REST API use by external programs running in a Java Virtual Machine (JVM).

As in previous versions of H2O, users can export trained models as Java objects for easy integration into JVM applications. H2O is currently the only ML tool that provides this capability, making it the data science tool of choice for enterprise developers.


R Users

If you use H2O primarily in R, be aware that as a result of the improvements to the R package for H2O scripts created using previous versions (Nunes 2.8.6.2 or prior) will require minor revisions to work with H2O 3.0.

To assist our R users in upgrading to H2O 3.0 a “shim” tool has been developed. The shim reviews your script, identifies deprecated or revised parameters and arguments, and suggests replacements.

There is also an R Porting Guide that provides a side-by-side comparison of the algorithms in the previous version of H2O with H2O 3.0. It outlines the new, revised, and deprecated parameters for each algorithm, as well as the changes to the output.


Porting R Scripts

This document outlines how to port R scripts written in previous versions of H2O (Nunes 2.8.6.2 or prior, also known as “H2O Classic”) for compatibility with the new H2O 3.0 API. When upgrading from H2O to H2O 3.0, most functions are the same. However, there are some differences that will need to be resolved when porting any scripts that were originally created using H2O to H2O 3.0.

The original R script for H2O is listed first, followed by the updated script for H2O 3.0.

Some of the parameters have been renamed for consistency. For each algorithm, a table that describes the differences is provided.

For additional assistance within R, enter a question mark before the command (for example, ?h2o.glm).

There is also a “shim” available that will review R scripts created with previous versions of H2O, identify deprecated or renamed parameters, and suggest replacements. For more information, refer to the repo here.

Changes from H2O 2.8 to H2O 3.0

h2o.exec

The h2o.exec command is no longer supported. Any workflows using h2o.exec must be revised to remove this command. If the H2O 3.0 workflow contains any parameters or commands from H2O Classic, errors will result and the workflow will fail.

The purpose of h2o.exec was to wrap expressions so that they could be evaluated in a single \Exec2 call. For example, h2o.exec(fr[,1] + 2/fr[,3]) and fr[,1] + 2/fr[,3] produced the same results in H2O. However, the first example makes a single REST call and uses a single temp object, while the second makes several REST calls and uses several temp objects.

Due to the improved architecture in H2O 3.0, the need to use h2o.exec has been eliminated, as the expression can be processed by R as an “unwrapped” typical R expression.

Currently, the only known exception is when factor is used in conjunction with h2o.exec. For example, h2o.exec(fr$myIntCol <- factor(fr$myIntCol)) would become fr$myIntCol <- as.factor(fr$myIntCol)

Note also that an array is not inside a string:

An int array is [1, 2, 3], not “[1, 2, 3]”.

A String array is [“f00”, “b4r”], not “[\”f00\”, \”b4r\”]”

Only string values are enclosed in double quotation marks (").

h2o.performance

To access any exclusively binomial output, use h2o.performance, optionally with the corresponding accessor. The accessor can only use the model metrics object created by h2o.performance. Each accessor is named for its corresponding field (for example, h2o.AUC, h2o.gini, h2o.F1). h2o.performance supports all current algorithms except for K-Means.

If you specify a data frame as a second parameter, H2O will use the specified data frame for scoring. If you do not specify a second parameter, the training metrics for the model metrics object are used.

xval and validation slots

The xval slot has been removed, as nfolds is not currently supported.

The validation slot has been merged with the model slot.

Principal Components Regression (PCR)

Principal Components Regression (PCR) has also been deprecated. To obtain PCR values, create a Principal Components Analysis (PCA) model, then create a GLM model from the scored data from the PCA model.

Saving and Loading Models

Saving and loading a model from R is supported in version 3.0.0.18 and later. H2O 3.0 uses the same binary serialization method as previous versions of H2O, but saves the model and its dependencies into a directory, with each object as a separate file. The save_CV option for available in previous versions of H2O has been deprecated, as h2o.saveAll and h2o.loadAll are not currently supported. The following commands are now supported:

Table of Contents

GBM

N-fold cross-validation and grid search will be supported in a future version of H2O 3.0.

Renamed GBM Parameters

The following parameters have been renamed, but retain the same functions:

H2O Classic Parameter Name H2O 3.0 Parameter Name
data training_frame
key model_id
n.trees ntrees
interaction.depth max_depth
n.minobsinnode min_rows
shrinkage learn_rate
n.bins nbins
validation validation_frame
balance.classes balance_classes
max.after.balance.size max_after_balance_size

Deprecated GBM Parameters

The following parameters have been removed:

New GBM Parameters

The following parameters have been added:

GBM Algorithm Comparison

H2O Classic H2O 3.0
h2o.gbm <- function( h2o.gbm <- function(
x, x,
y, y,
data, training_frame,
key = "", model_id,
distribution = 'multinomial', distribution = c("bernoulli", "multinomial", "gaussian"),
n.trees = 10, ntrees = 50
interaction.depth = 5, max_depth = 5,
n.minobsinnode = 10, min_rows = 10,
shrinkage = 0.1, learn_rate = 0.1,
n.bins = 20, nbins = 20,
validation, validation_frame = NULL,
balance.classes = FALSE balance_classes = FALSE,
max.after.balance.size = 5, max_after_balance_size = 1,
  seed,
  build_tree_one_node = FALSE,
  score_each_iteration)
group_split = TRUE,
importance = FALSE,
nfolds = 0,
holdout.fraction = 0,
class.sampling.factors = NULL,
grid.parallelism = 1)

Output

The following table provides the component name in H2O, the corresponding component name in H2O 3.0 (if supported), and the model type (binomial, multinomial, or all). Many components are now included in h2o.performance; for more information, refer to (h2o.performance).

H2O Classic H2O 3.0 Model Type
@model$priorDistribution   all
@model$params @allparameters all
@model$err @model$scoring_history all
@model$classification   all
@model$varimp @model$variable_importances all
@model$confusion @model$training_metrics$cm$table binomial and multinomial
@model$auc @model$training_metrics$AUC binomial
@model$gini @model$training_metrics$Gini binomial
@model$best_cutoff   binomial
@model$F1 @model$training_metrics$thresholds_and_metric_scores$f1 binomial
@model$F2 @model$training_metrics$thresholds_and_metric_scores$f2 binomial
@model$accuracy @model$training_metrics$thresholds_and_metric_scores$accuracy binomial
@model$error   binomial
@model$precision @model$training_metrics$thresholds_and_metric_scores$precision binomial
@model$recall @model$training_metrics$thresholds_and_metric_scores$recall binomial
@model$mcc @model$training_metrics$thresholds_and_metric_scores$absolute_MCC binomial
@model$max_per_class_err currently replaced by @model$training_metrics$thresholds_and_metric_scores$min_per_class_correct binomial

GLM

N-fold cross-validation and grid search will be supported in a future version of H2O 3.0.

Renamed GLM Parameters

The following parameters have been renamed, but retain the same functions:

H2O Classic Parameter Name H2O 3.0 Parameter Name
data training_frame
key model_id
nlambda nlambdas
lambda.min.ratio lambda_min_ratio
iter.max max_iterations
epsilon beta_epsilon

Deprecated GLM Parameters

The following parameters have been removed:

New GLM Parameters

The following parameters have been added:

GLM Algorithm Comparison

H2O Classic H2O 3.0
h2o.glm <- function( h2o.startGLMJob <- function(
x, x,
y, y,
data, training_frame,
key = "", model_id,
  validation_frame
iter.max = 100, max_iterations = 50,
epsilon = 1e-4 beta_epsilon = 0
strong_rules = TRUE,
return_all_lambda = FALSE,
intercept = TRUE, intercept = TRUE
non_negative = FALSE,
  solver = c("IRLSM", "L_BFGS"),
standardize = TRUE, standardize = TRUE,
family, family = c("gaussian", "binomial", "poisson", "gamma", "tweedie"),
link, link = c("family_default", "identity", "logit", "log", "inverse", "tweedie"),
tweedie.p = ifelse(family == "tweedie",1.5, NA_real_) tweedie_variance_power = NaN,
  tweedie_link_power = NaN,
alpha = 0.5, alpha = 0.5,
prior = NULL prior = 0.0,
lambda = 1e-5, lambda = 1e-05,
lambda_search = FALSE, lambda_search = FALSE,
nlambda = -1, nlambdas = -1,
lambda.min.ratio = -1, lambda_min_ratio = 1.0,
use_all_factor_levels = FALSE use_all_factor_levels = FALSE,
nfolds = 0, nfolds = 0,
beta_constraints = NULL, beta_constraint = NULL)
higher_accuracy = FALSE,
variable_importances = FALSE,
disable_line_search = FALSE,
offset = NULL,
max_predictors = -1)

Output

The following table provides the component name in H2O, the corresponding component name in H2O 3.0 (if supported), and the model type (binomial, multinomial, or all). Many components are now included in h2o.performance; for more information, refer to (h2o.performance).

H2O Classic H2O 3.0 Model Type
@model$params @allparameters all
@model$coefficients @model$coefficients all
@model$nomalized_coefficients @model$coefficients_table$norm_coefficients all
@model$rank @model$rank all
@model$iter @model$iter all
@model$lambda   all
@model$deviance @model$residual_deviance all
@model$null.deviance @model$null_deviance all
@model$df.residual @model$residual_degrees_of_freedom all
@model$df.null @model$null_degrees_of_freedom all
@model$aic @model$AIC all
@model$train.err   binomial
@model$prior   binomial
@model$thresholds @model$threshold binomial
@model$best_threshold   binomial
@model$auc @model$AUC binomial
@model$confusion   binomial

K-Means

Renamed K-Means Parameters

The following parameters have been renamed, but retain the same functions:

H2O Classic Parameter Name H2O 3.0 Parameter Name
data training_frame
key model_id
centers k
cols x
iter.max max_iterations
normalize standardize

Note In H2O, the normalize parameter was disabled by default. The standardize parameter is enabled by default in H2O 3.0 to provide more accurate results for datasets containing columns with large values.

New K-Means Parameters

The following parameters have been added:

K-Means Algorithm Comparison

H2O Classic H2O 3.0
h2o.kmeans <- function( h2o.kmeans <- function(
data, training_frame,
cols = '', x,
centers, k,
key = "", model_id,
iter.max = 10, max_iterations = 1000,
normalize = FALSE, standardize = TRUE,
init = "none", init = c("Furthest","Random", "PlusPlus"),
seed = 0, seed)

Output

The following table provides the component name in H2O and the corresponding component name in H2O 3.0 (if supported).

H2O Classic H2O 3.0
@model$params @allparameters
@model$centers @model$centers
@model$tot.withinss @model$tot_withinss
@model$size @model$size
@model$iter @model$iterations
  @model$_scoring_history
  @model$_model_summary

Deep Learning

N-fold cross-validation and grid search will be supported in a future version of H2O 3.0.

Note: If the results in the confusion matrix are incorrect, verify that score_training_samples is equal to 0. By default, only the first 10,000 rows are included.

Renamed Deep Learning Parameters

The following parameters have been renamed, but retain the same functions:

H2O Classic Parameter Name H2O 3.0 Parameter Name
data training_frame
key model_id
validation validation_frame
class.sampling.factors class_sampling_factors
nfolds n_folds
override_with_best_model overwrite_with_best_model
dlmodel@model$valid_class_error @model$validation_metrics@$MSE

Deprecated DL Parameters

The following parameters have been removed:

New DL Parameters

The following parameters have been added:

The following options for the loss parameter have been added:

DL Algorithm Comparison

H2O Classic H2O 3.0
h2o.deeplearning <- function(x, h2o.deeplearning <- function(x,
y, y,
data, training_frame,
key = "", model_id = "",
override_with_best_model, overwrite_with_best_model = true,
classification = TRUE,
nfolds = 0, n_folds = 0
validation, validation_frame,
holdout_fraction = 0,
checkpoint = " " checkpoint,
autoencoder, autoencoder = false,
use_all_factor_levels, use_all_factor_levels = true
activation, _activation = c("Rectifier", "Tanh", "TanhWithDropout", "RectifierWithDropout", "Maxout", "MaxoutWithDropout"),
hidden, hidden= c(200, 200),
epochs, epochs = 10.0,
train_samples_per_iteration, train_samples_per_iteration = -2,
seed, _seed,
adaptive_rate, adaptive_rate = true,
rho, rho = 0.99,
epsilon, epsilon = 1e-8,
rate, rate = .005,
rate_annealing, rate_annealing = 1e-6,
rate_decay, rate_decay = 1.0,
momentum_start, momentum_start = 0,
momentum_ramp, momentum_ramp = 1e6,
momentum_stable, momentum_stable = 0,
nesterov_accelerated_gradient, nesterov_accelerated_gradient = true,
input_dropout_ratio, input_dropout_ratio = 0.0,
hidden_dropout_ratios, hidden_dropout_ratios,
l1, l1 = 0.0,
l2, l2 = 0.0,
max_w2, max_w2 = Inf,
initial_weight_distribution, initial_weight_distribution = c("UniformAdaptive","Uniform", "Normal"),
initial_weight_scale, initial_weight_scale = 1.0,
loss, loss = "Automatic", "CrossEntropy", "MeanSquare", "Absolute", "Huber"),
score_interval, score_interval = 5,
score_training_samples, score_training_samples = 10000l,
score_validation_samples, score_validation_samples = 0l,
score_duty_cycle, score_duty_cycle = 0.1,
classification_stop, classification_stop = 0
regression_stop, regression_stop = 1e-6,
quiet_mode, quiet_mode = false,
max_confusion_matrix_size, max_confusion_matrix_size,
max_hit_ratio_k, max_hit_ratio_k,
balance_classes, balance_classes = false,
class_sampling_factors, class_sampling_factors,
max_after_balance_size, max_after_balance_size,
score_validation_sampling, score_validation_sampling,
diagnostics, diagnostics = true,
variable_importances, variable_importances = false,
fast_mode, fast_mode = true,
ignore_const_cols, ignore_const_cols = true,
force_load_balance, force_load_balance = true,
replicate_training_data, replicate_training_data = true,
single_node_mode, single_node_mode = false,
shuffle_training_data, shuffle_training_data = false,
sparse, sparse = false,
col_major, col_major = false,
max_categorical_features, max_categorical_features = Integer.MAX_VALUE,
reproducible) reproducible=FALSE,
average_activation average_activation = 0,
  sparsity_beta = 0
  export_weights_and_biases=FALSE)

Output

The following table provides the component name in H2O, the corresponding component name in H2O 3.0 (if supported), and the model type (binomial, multinomial, or all). Many components are now included in h2o.performance; for more information, refer to (h2o.performance).

H2O Classic H2O 3.0 Model Type
@model$priorDistribution   all
@model$params @allparameters all
@model$train_class_error @model$training_metrics@$MSE all
@model$valid_class_error @model$validation_metrics@$MSE all
@model$varimp @model$_variable_importances all
@model$confusion @model$training_metrics$cm$table binomial and multinomial
@model$train_auc @model$train_AUC binomial
  @model$_validation_metrics all
  @model$_model_summary all
  @model$_scoring_history all

Distributed Random Forest

Changes to DRF in H2O 3.0

Distributed Random Forest (DRF) was represented as h2o.randomForest(type="BigData", ...) in H2O Classic. In H2O Classic, SpeeDRF (type="fast") was not as accurate, especially for complex data with categoricals, and did not address regression problems. DRF (type="BigData") was at least as accurate as SpeeDRF (type="fast") and was the only algorithm that scaled to big data (data too large to fit on a single node). In H2O 3.0, our plan is to improve the performance of DRF so that the data fits on a single node (optimally, for all cases), which will make SpeeDRF obsolete. Ultimately, the goal is provide a single algorithm that provides the “best of both worlds” for all datasets and use cases. Please note that H2O does not currently support the ability to specify the number of trees when using h2o.predict for a DRF model.

Note: H2O 3.0 only supports DRF. SpeeDRF is no longer supported. The functionality of DRF in H2O 3.0 is similar to DRF functionality in H2O.

Renamed DRF Parameters

The following parameters have been renamed, but retain the same functions:

H2O Classic Parameter Name H2O 3.0 Parameter Name
data training_frame
key model_id
validation validation_frame
sample.rate sample_rate
ntree ntrees
depth max_depth
balance.classes balance_classes
score.each.iteration score_each_iteration
class.sampling.factors class_sampling_factors
nodesize min_rows

Deprecated DRF Parameters

The following parameters have been removed:

New DRF Parameters

The following parameter has been added:

DRF Algorithm Comparison

H2O Classic H2O 3.0
h2o.randomForest <- function(x, h2o.randomForest <- function(
x, x,
y, y,
data, training_frame,
key="", model_id,
validation, validation_frame,
mtries = -1, mtries = -1,
sample.rate=2/3, sample_rate = 0.632,
  build_tree_one_node = FALSE,
ntree=50 ntrees=50,
depth=20, max_depth = 20,
  min_rows = 1,
nbins=20, nbins = 20,
balance.classes = FALSE, balance_classes = FALSE,
score.each.iteration = FALSE, score_each_iteration = FALSE,
seed = -1, seed
nodesize = 1,
classification=TRUE,
importance=FALSE,
nfolds=0,
holdout.fraction = 0,
max.after.balance.size = 5, max_after_balance_size)
class.sampling.factors = NULL,  
doGrpSplit = TRUE,
verbose = FALSE,
oobee = TRUE,
stat.type = "ENTROPY",
type = "fast")

Output

The following table provides the component name in H2O, the corresponding component name in H2O 3.0 (if supported), and the model type (binomial, multinomial, or all). Many components are now included in h2o.performance; for more information, refer to (h2o.performance).

H2O Classic H2O 3.0 Model Type
@model$priorDistribution   all
@model$params @allparameters all
@model$mse @model$scoring_history all
@model$forest @model$model_summary all
@model$classification   all
@model$varimp @model$variable_importances all
@model$confusion @model$training_metrics$cm$table binomial and multinomial
@model$auc @model$training_metrics$AUC binomial
@model$gini @model$training_metrics$Gini binomial
@model$best_cutoff   binomial
@model$F1 @model$training_metrics$thresholds_and_metric_scores$f1 binomial
@model$F2 @model$training_metrics$thresholds_and_metric_scores$f2 binomial
@model$accuracy @model$training_metrics$thresholds_and_metric_scores$accuracy binomial
@model$Error @model$Error binomial
@model$precision @model$training_metrics$thresholds_and_metric_scores$precision binomial
@model$recall @model$training_metrics$thresholds_and_metric_scores$recall binomial
@model$mcc @model$training_metrics$thresholds_and_metric_scores$absolute_MCC binomial
@model$max_per_class_err currently replaced by @model$training_metrics$thresholds_and_metric_scores$min_per_class_correct binomial

Github Users

All users who pull directly from the H2O classic repo on Github should be aware that this repo will be renamed. To retain access to the original H2O (2.8.6.2 and prior) repository:

The simple way

This is the easiest way to change your local repo and is recommended for most users.

  1. Enter git remote -v to view a list of your repositories.
  2. Copy the address your H2O classic repo (refer to the text in brackets below - your address will vary depending on your connection method):

    H2O_User-MBP:h2o H2O_User$ git remote -v
    origin    https://{H2O_User@github.com}/h2oai/h2o.git (fetch)
    origin    https://{H2O_User@github.com}/h2oai/h2o.git (push)
    
  3. Enter git remote set-url origin {H2O_User@github.com}:h2oai/h2o-2.git, where {H2O_User@github.com} represents the address copied in the previous step.

The more complicated way

This method involves editing the Github config file and should only be attempted by users who are confident enough with their knowledge of Github to do so.

  1. Enter vim .git/config.
  2. Look for the [remote "origin"] section:

    [remote "origin"]
         url = https://H2O_User@github.com/h2oai/h2o.git
         fetch = +refs/heads/*:refs/remotes/origin/*
    
  3. In the url = line, change h2o.git to h2o-2.git.
  4. Save the changes.

The latest version of H2O is stored in the h2o-3 repository. All previous links to this repo will still work, but if you would like to manually update your Github configuration, follow the instructions above, replacing h2o-2 with h2o-3.

FAQ

General Troubleshooting Tips


The following error message displayed when I tried to launch H2O - what should I do?

Exception in thread "main" java.lang.UnsupportedClassVersionError: water/H2OApp
: Unsupported major.minor version 51.0
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClassCond(Unknown Source)
        at java.lang.ClassLoader.defineClass(Unknown Source)
        at java.security.SecureClassLoader.defineClass(Unknown Source)
        at java.net.URLClassLoader.defineClass(Unknown Source)
        at java.net.URLClassLoader.access$000(Unknown Source)
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
Could not find the main class: water.H2OApp. Program will exit.

This error output indicates that your Java version is not supported. Upgrade to Java 7 (JVM) or later and H2O should launch successfully.


Algorithms

What does it mean if the r2 value in my model is negative?

The coefficient of determination (also known as r^2) can be negative if:

If your r2 value is negative after your model is complete, your model is likely incorrect. Make sure your data is suitable for the type of model, then try adding an intercept.


What’s the process for implementing new algorithms in H2O?

This blog post by Cliff walks you through building a new algorithm, using K-Means, Quantiles, and Grep as examples.

To learn more about performance characteristics when implementing new algorithms, refer to Cliff’s KV Store Guide.


How do I find the standard errors of the parameter estimates (p-values)?

P-values are currently not supported. They are on our road map and will be added, depending on the current customer demand/priorities. Generally, adding p-values involves significant engineering effort because p-values for regularized GLM are not straightforward and have been defined only recently (with no standard implementation available that we know of). P-values for a restricted set of GLM problems (no regularization, low number of predictors) are easier to do and may be added sooner, if there is a sufficient demand.

For now, we recommend using a non-zero l1 penalty (alpha > 0) and considering all non-zero coefficients in the model as significant. The recommended use case is running GLM with lambda search enabled and alpha > 0 and picking the best lambda value based on cross-validation or hold-out set validation.


How do I specify regression or classification for Distributed Random Forest in the web UI?

If the response column is numeric, H2O generates a regression model. If the response column is enum, the model uses classification. To specify the column type, select it from the drop-down column heading list in the Data Preview section during parsing.


What’s the largest number of classes that H2O supports for multinomial prediction?

For tree-based algorithms, the maximum number of classes (or levels) for a response column is 1000.


How do I obtain a tree diagram of my DRF model?

Output the SVG code for the edges and nodes. A simple tree visitor is available here and the Java code generator is available here.


Is Word2Vec available? I can see the Java and R sources, but calling the API generates an error.

Word2Vec, along with other natural language processing (NLP) algos, are currently in development in the current version of H2O.


Building H2O

Using ./gradlew build doesn’t generate a build successfully - is there anything I can do to troubleshoot?

Use ./gradlew clean before running ./gradlew build.


I tried using ./gradlew build after using git pull to update my local H2O repo, but now I can’t get H2O to build successfully - what should I do?

Try using ./gradlew build -x test - the build may be failing tests if data is not synced correctly.


Clusters

When trying to launch H2O, I received the following error message: ERROR: Too many retries starting cloud. What should I do?

If you are trying to start a multi-node cluster where the nodes use multiple network interfaces, by default H2O will resort to using the default host (127.0.0.1).

To specify an IP address, launch H2O using the following command:

java -jar h2o.jar -ip <IP_Address> -port <PortNumber>

If this does not resolve the issue, try the following additional troubleshooting tips:


What should I do if I tried to start a cluster but the nodes started independent clouds that are not connected?

Because the default cloud name is the user name of the node, if the nodes are on different operating systems (for example, one node is using Windows and the other uses OS X), the different user names on each machine will prevent the nodes from recognizing that they belong to the same cloud. To resolve this issue, use -name to configure the same name for all nodes.


One of the nodes in my cluster is unavailable — what do I do?

H2O does not support high availability (HA). If a node in the cluster is unavailable, bring the cluster down and create a new healthy cluster.


How do I add new nodes to an existing cluster?

New nodes can only be added if H2O has not started any jobs. Once H2O starts a task, it locks the cluster to prevent new nodes from joining. If H2O has started a job, you must create a new cluster to include additional nodes.


How do I check if all the nodes in the cluster are healthy and communicating?

In the Flow web UI, click the Admin menu and select Cluster Status.


How do I create a cluster behind a firewall?

H2O uses two ports:

You can start the cluster behind the firewall, but to reach it, you must make a tunnel to reach the REST_API port. To use the cluster, the REST_API port of at least one node must be reachable.


I launched H2O instances on my nodes - why won’t they form a cloud?

If you launch without specifying the IP address by adding argument -ip:

$ java -Xmx20g -jar h2o.jar -flatfile flatfile.txt -port 54321

and multiple local IP addresses are detected, H2O uses the default localhost (127.0.0.1) as shown below:

  10:26:32.266 main      WARN WATER: Multiple local IPs detected:
  +                                    /198.168.1.161  /198.168.58.102
  +                                  Attempting to determine correct address...
  10:26:32.284 main      WARN WATER: Failed to determine IP, falling back to localhost.
  10:26:32.325 main      INFO WATER: Internal communication uses port: 54322
  +                                  Listening for HTTP and REST traffic
  +                                  on http://127.0.0.1:54321/
  10:26:32.378 main      WARN WATER: Flatfile configuration does not include self:
  /127.0.0.1:54321 but contains [/192.168.1.161:54321, /192.168.1.162:54321]

To avoid using 127.0.0.1 on servers with multiple local IP addresses, run the command with the -ip argument to force H2O to launch at the specified IP:

$ java -Xmx20g -jar h2o.jar -flatfile flatfile.txt -ip 192.168.1.161 -port 54321


Data

How should I format my SVMLight data before importing?

The data must be formatted as a sorted list of unique integers, the column indices must be >= 1, and the columns must be in ascending order.


General

How do I score using an exported JSON model?

Since JSON is just a representation format, it cannot be directly executed, so a JSON export can’t be used for scoring. However, you can score by:


How do I predict using multiple response variables?

Currently, H2O does not support multiple response variables. To predict different response variables, build multiple modes.


How do I kill any running instances of H2O?

In Terminal, enter ps -efww | grep h2o, then kill any running PIDs. You can also find the running instance in Terminal and press Ctrl + C on your keyboard. To confirm no H2O sessions are still running, go to http://localhost:54321 and verify that the H2O web UI does not display.


Why is H2O not launching from the command line?

$ java -jar h2o.jar &
% Exception in thread "main" java.lang.ExceptionInInitializerError
at java.lang.Class.initializeClass(libgcj.so.10)
at water.Boot.getMD5(Boot.java:73)
at water.Boot.<init>(Boot.java:114)
at water.Boot.<clinit>(Boot.java:57)
at java.lang.Class.initializeClass(libgcj.so.10)
Caused by: java.lang.IllegalArgumentException
at java.util.regex.Pattern.compile(libgcj.so.10)
at water.util.Utils.<clinit>(Utils.java:1286)
at java.lang.Class.initializeClass(libgcj.so.10)
...4 more

The only prerequisite for running H2O is a compatible version of Java. We recommend Oracle’s Java 1.7.


Why did I receive the following error when I tried to launch H2O?

[root@sandbox h2o-dev-0.3.0.1188-hdp2.2]hadoop jar h2odriver.jar -nodes 2 -mapperXmx 1g -output hdfsOutputDirName
Determining driver host interface for mapper->driver callback...
   [Possible callback IP address: 10.0.2.15]
   [Possible callback IP address: 127.0.0.1]
Using mapper->driver callback IP address and port: 10.0.2.15:41188
(You can override these with -driverif and -driverport.)
Memory Settings:
   mapreduce.map.java.opts:     -Xms1g -Xmx1g -Dlog4j.defaultInitOverride=true
   Extra memory percent:        10
   mapreduce.map.memory.mb:     1126
15/05/08 02:33:40 INFO impl.TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/05/08 02:33:41 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/10.0.2.15:8050
15/05/08 02:33:47 INFO mapreduce.JobSubmitter: number of splits:2
15/05/08 02:33:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1431052132967_0001
15/05/08 02:33:51 INFO impl.YarnClientImpl: Submitted application application_1431052132967_0001
15/05/08 02:33:51 INFO mapreduce.Job: The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1431052132967_0001/
Job name 'H2O_3889' submitted
JobTracker job ID is 'job_1431052132967_0001'
For YARN users, logs command is 'yarn logs -applicationId application_1431052132967_0001'
Waiting for H2O cluster to come up...
H2O node 10.0.2.15:54321 requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (120 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
15/05/08 02:35:59 INFO impl.TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
15/05/08 02:35:59 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/10.0.2.15:8050

----- YARN cluster metrics -----
Number of YARN worker nodes: 1

----- Nodes -----
Node: http://sandbox.hortonworks.com:8042 Rack: /default-rack, RUNNING, 1 containers used, 0.2 / 2.2 GB used, 1 / 8 vcores used

----- Queues -----
Queue name:            default
   Queue state:       RUNNING
   Current capacity:  0.11
   Capacity:          1.00
   Maximum capacity:  1.00
   Application count: 1
   ----- Applications in this queue -----
   Application ID:                  application_1431052132967_0001 (H2O_3889)
       Started:                     root (Fri May 08 02:33:50 UTC 2015)
       Application state:           FINISHED
       Tracking URL:                http://sandbox.hortonworks.com:8088/proxy/application_1431052132967_0001/jobhistory/job/job_1431052132967_0001
       Queue name:                  default
       Used/Reserved containers:    1 / 0
       Needed/Used/Reserved memory: 0.2 GB / 0.2 GB / 0.0 GB
       Needed/Used/Reserved vcores: 1 / 1 / 0

Queue 'default' approximate utilization: 0.2 / 2.2 GB used, 1 / 8 vcores used

----------------------------------------------------------------------

ERROR:   Job memory request (2.2 GB) exceeds available YARN cluster memory (2.2 GB)
WARNING: Job memory request (2.2 GB) exceeds queue available memory capacity (2.0 GB)
ERROR:   Only 1 out of the requested 2 worker containers were started due to YARN cluster resource limitations

----------------------------------------------------------------------
Attempting to clean up hadoop job...
15/05/08 02:35:59 INFO impl.YarnClientImpl: Killed application application_1431052132967_0001
Killed.
[root@sandbox h2o-dev-0.3.0.1188-hdp2.2]#

The H2O launch failed because more memory was requested than was available. Make sure you are not trying to specify more memory in the launch parameters than you have available.


How does the architecture of H2O work?

This PDF includes diagrams and slides depicting how H2O works in big data environments.


How does H2O work with Excel?

For more information on how H2O works with Excel, refer to this page.


I received the following error message when launching H2O - how do I resolve the error?

Invalid flow_dir illegal character at index 12...

This error message means that there is a space (or other unsupported character) in your H2O directory. To resolve this error:


Hadoop

How do I specify which nodes should run H2O in a Hadoop cluster?

Currently, this is not yet supported. To provide resource isolation (for example, to isolate H2O to the worker nodes, rather than the master nodes), use YARN Nodemanagers to specify the nodes to use.


How do I import data from HDFS in R and in Flow?

To import from HDFS in R:

h2o.importHDFS(path, conn = h2o.getConnection(), pattern = "",
destination_frame = "", parse = TRUE, header = NA, sep = "",
col.names = NULL, na.strings = NULL)

Here is another example:

# pathToAirlines <- "hdfs://mr-0xd6.0xdata.loc/datasets/airlines_all.csv"
# airlines.hex <- h2o.importFile(conn = h, path = pathToAirlines, destination_frame = "airlines.hex")

In Flow, the easiest way is to let the auto-suggestion feature in the Search: field complete the path for you. Just start typing the path to the file, starting with the top-level directory, and H2O provides a list of matching files.

Flow - Import Auto-Suggest

Click the file to add it to the Search: field.


Why do I receive the following error when I try to save my notebook in Flow?

Error saving notebook: Error calling POST /3/NodePersistentStorage/notebook/Test%201 with opts

When you are running H2O on Hadoop, H2O tries to determine the home HDFS directory so it can use that as the download location. If the default home HDFS directory is not found, manually set the download location from the command line using the -flow_dir parameter (for example, hadoop jar h2odriver.jar <...> -flow_dir hdfs:///user/yourname/yourflowdir). You can view the default download directory in the logs by clicking Admin > View logs… and looking for the line that begins Flow dir:.


Java

How do I use H2O with Java?

There are two ways to use H2O with Java. The simplest way is to call the REST API from your Java program to a remote cluster and should meet the needs of most users.

You can access the REST API documentation within Flow, or on our documentation site.

Flow, Python, and R all rely on the REST API to run H2O. For example, each action in Flow translates into one or more REST API calls. The script fragments in the cells in Flow are essentially the payloads for the REST API calls. Most R and Python API calls translate into a single REST API call.

To see how the REST API is used with H2O:

The second way to use H2O with Java is to embed H2O within your Java application, similar to Sparkling Water.


How do I communicate with a remote cluster using the REST API?

To create a set of bare POJOs for the REST API payloads that can be used by JVM REST API clients:

  1. Clone the sources from GitHub.
  2. Start an H2O instance.
  3. Enter % cd py.
  4. Enter % python generate_java_binding.py.

This script connects to the server, gets all the metadata for the REST API schemas, and writes the Java POJOs to {sourcehome}/build/bindings/Java.


Python

How do I specify a value as an enum in Python? Is there a Python equivalent of as.factor() in R?

Use .asfactor() to specify a value as an enum.


I received the following error when I tried to install H2O using the Python instructions on the downloads page - what should I do to resolve it?

Downloading/unpacking http://h2o-release.s3.amazonaws.com/h2o/rel-shannon/12/Python/h2o-3.0.0.12-py2.py3-none-any.whl 
  Downloading h2o-3.0.0.12-py2.py3-none-any.whl (43.1Mb): 43.1Mb downloaded 
  Running setup.py egg_info for package from http://h2o-release.s3.amazonaws.com/h2o/rel-shannon/12/Python/h2o-3.0.0.12-py2.py3-none-any.whl 
    Traceback (most recent call last): 
      File "<string>", line 14, in <module> 
    IOError: [Errno 2] No such file or directory: '/tmp/pip-nTu3HK-build/setup.py' 
    Complete output from command python setup.py egg_info: 
    Traceback (most recent call last): 

  File "<string>", line 14, in <module> 

IOError: [Errno 2] No such file or directory: '/tmp/pip-nTu3HK-build/setup.py' 

---------------------------------------- 
Command python setup.py egg_info failed with error code 1 in /tmp/pip-nTu3HK-build

With Python, there is no automatic update of installed packages, so you must upgrade manually. Additionally, the package distribution method recently changed from distutils to wheel. The following procedure should be tried first if you are having trouble installing the H2O package, particularly if error messages related to bdist_wheel or eggs display.

# this gets the latest setuptools 
# see https://pip.pypa.io/en/latest/installing.html 
wget https://bootstrap.pypa.io/ez_setup.py -O - | sudo python 

# platform dependent ways of installing pip are at 
# https://pip.pypa.io/en/latest/installing.html 
# but the above should work on most linux platforms? 

# on ubuntu 
# if you already have some version of pip, you can skip this. 
sudo apt-get install python-pip 

# the package manager doesn't install the latest. upgrade to latest 
# we're not using easy_install any more, so don't care about checking that 
pip install pip --upgrade 

# I've seen pip not install to the final version ..i.e. it goes to an almost 
# final version first, then another upgrade gets it to the final version. 
# We'll cover that, and also double check the install. 

# after upgrading pip, the path name may change from /usr/bin to /usr/local/bin 
# start a new shell, just to make sure you see any path changes 

bash 

# Also: I like double checking that the install is bulletproof by reinstalling. 
# Sometimes it seems like things say they are installed, but have errors during the install. Check for no errors or stack traces. 

pip install pip --upgrade --force-reinstall 

# distribute should be at the most recent now. Just in case 
# don't do --force-reinstall here, it causes an issue. 

pip install distribute --upgrade 


# Now check the versions 
pip list | egrep '(distribute|pip|setuptools)' 
distribute (0.7.3) 
pip (7.0.3) 
setuptools (17.0) 


# Re-install wheel 
pip install wheel --upgrade --force-reinstall

After completing this procedure, go to Python and use h2o.init() to start H2O in Python.

Note:

If you use gradlew to build the jar yourself, you have to start the jar >yourself before you do h2o.init().

If you download the jar and the H2O package, h2o.init() will work like R >and you don’t have to start the jar yourself.


How should I specify the datatype during import in Python?

Refer to the following example:

fraw = h2o.import_file("smalldata/logreg/prostate.csv") 
fsetup = h2o.parse_setup(fraw) 
fsetup["column_types"][1] = "Enum" # change second column "CAPSULE" to categorical 
fr = h2o.parse_raw(fsetup) 
fr.describe()

R

How can I install the H2O R package if I am having permissions problems?

This issue typically occurs for Linux users when the R software was installed by a root user. For more information, refer to the following link.

To specify the installation location for the R packages, create a file that contains the R_LIBS_USER environment variable:

echo R_LIBS_USER=\"~/.Rlibrary\" > ~/.Renviron

Confirm the file was created successfully using cat:

$ cat ~/.Renviron

You should see the following output:

R_LIBS_USER="~/.Rlibrary"

Create a new directory for the environment variable:

$ mkdir ~/.Rlibrary

Start R and enter the following:

.libPaths()

Look for the following output to confirm the changes:

[1] "<Your home directory>/.Rlibrary"                                         
[2] "/Library/Frameworks/R.framework/Versions/3.1/Resources/library"

I received the following error message after launching H2O in RStudio and using h2o.init - what should I do to resolve this error?

> localH2O = h2o.init()
Successfully connected to http://127.0.0.1:54321/

ERROR: Unexpected HTTP Status code: 301 Moved Permanently (url = http://127.0.0.
1:54321/3/Cloud?skip_ticks=true)

Error in fromJSON(rv$payload) : unexpected character '<'
Calls: h2o.init ... gsub -> .h2o.doSafeGET -> .h2o.doSafeREST -> fromJSON
Execution halted

This error is due to a version mismatch between the H2O package and the running H2O instance. Make sure you are using the latest version of both files by downloading H2O from the downloads page and installing the latest version and that you have removed any previous H2O R package versions by running:

if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

Make sure to install the dependencies for the H2O R package as well:

if (! ("methods" %in% rownames(installed.packages()))) { install.packages("methods") }
if (! ("statmod" %in% rownames(installed.packages()))) { install.packages("statmod") }
if (! ("stats" %in% rownames(installed.packages()))) { install.packages("stats") }
if (! ("graphics" %in% rownames(installed.packages()))) { install.packages("graphics") }
if (! ("RCurl" %in% rownames(installed.packages()))) { install.packages("RCurl") }
if (! ("rjson" %in% rownames(installed.packages()))) { install.packages("rjson") }
if (! ("tools" %in% rownames(installed.packages()))) { install.packages("tools") }
if (! ("utils" %in% rownames(installed.packages()))) { install.packages("utils") }

Finally, install the latest version of the H2O package for R:

install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/master/30/R")))
library(h2o)
localH2O = h2o.init()

I received the following error message after trying to run some code - what should I do?

> fit <- h2o.deeplearning(x=2:4, y=1, training_frame=train_hex)
  |=========================================================================================================| 100%
Error in model$training_metrics$MSE :
  $ operator not defined for this S4 class
In addition: Warning message:
Not all shim outputs are fully supported, please see ?h2o.shim for more information

Remove the h2o.shim(enable=TRUE) line and try running the code again. Note that the h2o.shim is only a way to notify users of previous versions of H2O about changes to the H2O R package - it will not revise your code, but provides suggested replacements for deprecated commands and parameters.


How do I extract the model weights from a model I’ve creating using H2O in R? I’ve enabled extract_model_weights_and_biases, but the output refers to a file I can’t open in R.

For an example of how to extract weights and biases from a model, refer to the following repo location on GitHub.


I’m using CentOS and I want to run H2O in R - are there any dependencies I need to install?

Yes, make sure to install libcurl, which allows H2O to communicate with R. We also recommend disabling SElinux and any firewalls, at least initially until you have confirmed H2O can initialize.


Sparkling Water

How do I filter an H2OFrame using Sparkling Water?

Filtering columns is easy: just remove the unnecessary columns or create a new H2OFrame from the columns you want to include (Frame(String[] names, Vec[] vec)), then make the H2OFrame wrapper around it (new H2OFrame(frame)).

Filtering rows is a little bit harder. There are two ways:


How do I inspect H2O using Flow while a droplet is running?

If your droplet execution time is very short, add a simple sleep statement to your code:

Thread.sleep(...)


How do I change the memory size of the executors in a droplet?

There are two ways to do this:


I received the following error while running Sparkling Water using multiple nodes, but not when using a single node - what should I do?

onExCompletion for water.parser.ParseDataset$MultiFileParseTask@31cd4150
water.DException$DistributedException: from /10.23.36.177:54321; by class water.parser.ParseDataset$MultiFileParseTask; class water.DException$DistributedException: from /10.23.36.177:54325; by class water.parser.ParseDataset$MultiFileParseTask; class water.DException$DistributedException: from /10.23.36.178:54325; by class water.parser.ParseDataset$MultiFileParseTask$DistributedParse; class java.lang.NullPointerException: null
    at water.persist.PersistManager.load(PersistManager.java:141)
    at water.Value.loadPersist(Value.java:226)
    at water.Value.memOrLoad(Value.java:123)
    at water.Value.get(Value.java:137)
    at water.fvec.Vec.chunkForChunkIdx(Vec.java:794)
    at water.fvec.ByteVec.chunkForChunkIdx(ByteVec.java:18)
    at water.fvec.ByteVec.chunkForChunkIdx(ByteVec.java:14)
    at water.MRTask.compute2(MRTask.java:426)
    at water.MRTask.compute2(MRTask.java:398)

This error output displays if the input file is not present on all nodes. Because of the way that Sparkling Water distributes data, the input file is required on all nodes (including remote), not just the primary node. Make sure there is a copy of the input file on all the nodes, then try again.


Are there any drawbacks to using Sparkling Water compared to standalone H2O?

The version of H2O embedded in Sparkling Water is the same as the standalone version.


How do I use Sparkling Water from the Spark shell?

There are two methods:

The software distribution provides example scripts in the examples/scripts directory:

bin/sparkling-shell -i examples/scripts/chicagoCrimeSmallShell.script.scala

For either method, initialize H2O as shown in the following example:

import org.apache.spark.h2o._
val h2oContext = new H2OContext(sc).start()

After successfully launching H2O, the following output displays:

Sparkling Water Context:
 * number of executors: 3
 * list of used executors:
  (executorId, host, port)
  ------------------------
  (1,Michals-MBP.0xdata.loc,54325)
  (0,Michals-MBP.0xdata.loc,54321)
  (2,Michals-MBP.0xdata.loc,54323)
  ------------------------

  Open H2O Flow in browser: http://172.16.2.223:54327 (CMD + click in Mac OSX)

How do I use H2O with Spark Submit?

Spark Submit is for submitting self-contained applications. For more information, refer to the Spark documentation.

First, initialize H2O:

import org.apache.spark.h2o._
val h2oContext = new H2OContext(sc).start()

The Sparkling Water distribution provides several examples of self-contained applications built with Sparkling Water. To run the examples:

bin/run-example.sh ChicagoCrimeAppSmall

The “magic” behind run-example.sh is a regular Spark Submit:

$SPARK_HOME/bin/spark-submit ChicagoCrimeAppSmall --packages ai.h2o:sparkling-water-core_2.10:1.3.3 --packages ai.h2o:sparkling-water-examples_2.10:1.3.3


How do I use Sparkling Water with Databricks cloud?

Sparkling Water compatibility with Databricks cloud is still in development.


How do I develop applications with Sparkling Water?

For a regular Spark application (a self-contained application as described in the Spark documentation), the app needs to initialize H2OServices via H2OContext:

import org.apache.spark.h2o._
val h2oContext = new H2OContext(sc).start()

For more information, refer to the Sparkling Water development documentation.


How do I connect to Sparkling Water from R or Python?

After starting H2OServices by starting H2OContext, point your client to the IP address and port number specified in H2OContext.


Tunneling between servers with H2O

To tunnel between servers (for example, due to firewalls):

  1. Use ssh to log in to the machine where H2O will run.
  2. Start an instance of H2O by locating the working directory and calling a java command similar to the following example.

    The port number chosen here is arbitrary; yours may be different.

    $ java -jar h2o.jar -port 55599

    This returns output similar to the following:

     irene@mr-0x3:~/target$ java -jar h2o.jar -port 55599
     04:48:58.053 main      INFO WATER: ----- H2O started -----
     04:48:58.055 main      INFO WATER: Build git branch: master
     04:48:58.055 main      INFO WATER: Build git hash: 64fe68c59ced5875ac6bac26a784ce210ef9f7a0
     04:48:58.055 main      INFO WATER: Build git describe: 64fe68c
     04:48:58.055 main      INFO WATER: Build project version: 1.7.0.99999
     04:48:58.055 main      INFO WATER: Built by: 'Irene'
     04:48:58.055 main      INFO WATER: Built on: 'Wed Sep  4 07:30:45 PDT 2013'
     04:48:58.055 main      INFO WATER: Java availableProcessors: 4
     04:48:58.059 main      INFO WATER: Java heap totalMemory: 0.47 gb
     04:48:58.059 main      INFO WATER: Java heap maxMemory: 6.96 gb
     04:48:58.060 main      INFO WATER: ICE root: '/tmp'
     04:48:58.081 main      INFO WATER: Internal communication uses port: 55600
     +                                  Listening for HTTP and REST traffic on
     +                                  http://192.168.1.173:55599/
     04:48:58.109 main      INFO WATER: H2O cloud name: 'irene'
     04:48:58.109 main      INFO WATER: (v1.7.0.99999) 'irene' on
     /192.168.1.173:55599, discovery address /230 .252.255.19:59132
     04:48:58.111 main      INFO WATER: Cloud of size 1 formed [/192.168.1.173:55599]
     04:48:58.247 main      INFO WATER: Log dir: '/tmp/h2ologs'
    
  3. Log into the remote machine where the running instance of H2O will be forwarded using a command similar to the following (your specified port numbers and IP address will be different)

    ssh -L 55577:localhost:55599 irene@192.168.1.173

  4. Check the cluster status.

You are now using H2O from localhost:55577, but the instance of H2O is running on the remote server (in this case the server with the ip address 192.168.1.xxx) at port number 55599.

To see this in action note that the web UI is pointed at localhost:55577, but that the cluster status shows the cluster running on 192.168.1.173:55599


Quick Start Videos

H2O Quick Start with Flow


H2O Quick Start with Python


H2O Quick Start on Hadoop


H2O Quick Start with Sparkling Water


H2O Quick Start with R


REST API Reference

GET /3/About

Return information about this H2O.

InputAboutV3
OutputAboutV3

GET /3/Cloud

Determine the status of the nodes in the H2O cloud.

InputCloudV3
OutputCloudV3

HEAD /3/Cloud

Determine the status of the nodes in the H2O cloud.

InputCloudV3
OutputCloudV3

POST /3/Configuration/ModelBuilders/visibility

Set Model Builders visibility level.

InputModelBuildersVisibilityV3
OutputModelBuildersVisibilityV3

GET /3/Configuration/ModelBuilders/visibility

Get Model Builders visibility level.

InputModelBuildersVisibilityV3
OutputModelBuildersVisibilityV3

POST /3/CreateFrame

Create a synthetic H2O Frame.

InputCreateFrameV3
OutputCreateFrameV3

DELETE /3/DKV

Remove all keys from the H2O distributed K/V store.

InputRemoveAllV3
OutputRemoveAllV3

DELETE /3/DKV/(?.*)

Remove an arbitrary key from the H2O distributed K/V store.

InputRemoveV3
OutputRemoveV3

GET /3/DownloadDataset

Download something something.

InputDownloadDataV3
OutputDownloadDataV3

GET /3/Find

Find a value within a Frame.

InputFindV3
OutputFindV3

GET /3/Frames

Return all Frames in the H2O distributed K/V store.

InputFramesV3
OutputFramesV3

DELETE /3/Frames

Delete all Frames from the H2O distributed K/V store.

InputFramesV3
OutputFramesV3

GET /3/Frames/(?.*)

Return the specified Frame.

InputFramesV3
OutputFramesV3

DELETE /3/Frames/(?.*)

Delete the specified Frame from the H2O distributed K/V store.

InputFramesV3
OutputFramesV3

GET /3/Frames/(?.*)/columns

Return all the columns from a Frame.

InputFramesV3
OutputFramesV3

GET /3/Frames/(?.*)/columns/(?.*)

Return the specified column from a Frame.

InputFramesV3
OutputFramesV3

GET /3/Frames/(?.*)/columns/(?.*)/domain

Return the domains for the specified column. “null” if the column is not an Enum.

InputFramesV3
OutputFramesV3

GET /3/Frames/(?.*)/columns/(?.*)/summary

Return the summary metrics for a column, e.g. mins, maxes, mean, sigma, percentiles, etc.

InputFramesV3
OutputFramesV3

GET /3/Frames/(?.*)/export/(?.*)/overwrite/(?.*)

Export a Frame to the given path with optional overwrite.

InputFramesV3
OutputFramesV3

GET /3/Frames/(?.*)/summary

Return a Frame, including the histograms, after forcing computation of rollups.

InputFramesV3
OutputFramesV3

POST /3/GarbageCollect

Explicitly call System.gc().

InputGarbageCollectV3
OutputGarbageCollectV3

GET /3/ImportFiles

Import raw data files into a single-column H2O Frame.

InputImportFilesV3
OutputImportFilesV3

GET /3/InitID

Issue a new session ID.

InputInitIDV3
OutputInitIDV3

POST /3/Interaction

Create interactions between categorical columns.

InputInteractionV3
OutputInteractionV3

GET /3/JStack

Something something something.

InputJStackV3
OutputJStackV3

GET /3/Jobs

Get a list of all the H2O Jobs (long-running actions).

InputJobsV3
OutputSchema

GET /3/Jobs/(?.*)

Get the status of the given H2O Job (long-running action).

InputJobsV3
OutputSchema

POST /3/Jobs/(?.*)/cancel

Cancel a running job.

InputJobsV3
OutputSchema

GET /3/KillMinus3

Kill minus 3 on this node

InputKillMinus3V3
OutputKillMinus3V3

POST /3/LogAndEcho

Save a message to the H2O logfile.

InputLogAndEchoV3
OutputLogAndEchoV3

GET /3/Logs/nodes/(?.*)/files/(?.*)

Get named log file for a node.

InputLogsV3
OutputLogsV3

POST /3/MakeGLMModel

make a new GLM model based on existing one

InputMakeGLMModelV3
OutputGLMModelV3

GET /3/Metadata/endpoints

Return a list of all the REST API endpoints.

InputMetadataV3
OutputMetadataV3

GET /3/Metadata/endpoints/(?[0-9]+)

Return the REST API endpoint metadata, including documentation, for the endpoint specified by number.

InputMetadataV3
OutputMetadataV3

GET /3/Metadata/endpoints/(?.*)

Return the REST API endpoint metadata, including documentation, for the endpoint specified by path.

InputMetadataV3
OutputMetadataV3

GET /3/Metadata/schemaclasses/(?.*)

Return the REST API schema metadata for specified schema class.

InputMetadataV3
OutputMetadataV3

GET /3/Metadata/schemas

Return list of all REST API schemas.

InputMetadataV3
OutputMetadataV3

GET /3/Metadata/schemas/(?.*)

Return the REST API schema metadata for specified schema.

InputMetadataV3
OutputMetadataV3

POST /3/MissingInserter

Insert missing values.

InputMissingInserterV3
OutputMissingInserterV3

GET /3/ModelBuilders

Return the Model Builder metadata for all available algorithms.

InputModelBuildersV3
OutputModelBuildersV3

GET /3/ModelBuilders/(?.*)

Return the Model Builder metadata for the specified algorithm.

InputModelBuildersV3
OutputModelBuildersV3

POST /3/ModelBuilders/deeplearning

Train a Deep Learning model on the specified Frame.

InputDeepLearningV3
OutputSchema

POST /3/ModelBuilders/deeplearning/parameters

Validate a set of Deep Learning model builder parameters.

InputDeepLearningV3
OutputDeepLearningV3

POST /3/ModelBuilders/drf

Train a DRF model on the specified Frame.

InputDRFV3
OutputSchema

POST /3/ModelBuilders/drf/parameters

Validate a set of DRF model builder parameters.

InputDRFV3
OutputDRFV3

POST /3/ModelBuilders/gbm

Train a GBM model on the specified Frame.

InputGBMV3
OutputSchema

POST /3/ModelBuilders/gbm/parameters

Validate a set of GBM model builder parameters.

InputGBMV3
OutputGBMV3

POST /3/ModelBuilders/glm

Train a GLM model on the specified Frame.

InputGLMV3
OutputSchema

POST /3/ModelBuilders/glm/parameters

Validate a set of GLM model builder parameters.

InputGLMV3
OutputGLMV3

POST /3/ModelBuilders/kmeans

Train a KMeans model on the specified Frame.

InputKMeansV3
OutputSchema

POST /3/ModelBuilders/kmeans/parameters

Validate a set of KMeans model builder parameters.

InputKMeansV3
OutputKMeansV3

POST /3/ModelBuilders/naivebayes

Train a Naive Bayes model on the specified Frame.

InputNaiveBayesV3
OutputSchema

POST /3/ModelBuilders/naivebayes/parameters

Validate a set of Naive Bayes model builder parameters.

InputNaiveBayesV3
OutputNaiveBayesV3

GET /3/ModelMetrics

Return all the saved scoring metrics.

InputModelMetricsListSchemaV3
OutputModelMetricsListSchemaV3

GET /3/ModelMetrics/frames/(?.*)

Return the saved scoring metrics for the specified Frame.

InputModelMetricsListSchemaV3
OutputModelMetricsListSchemaV3

GET /3/ModelMetrics/frames/(?.*)/models/(?.*)

Return the saved scoring metrics for the specified Model and Frame.

InputModelMetricsListSchemaV3
OutputModelMetricsListSchemaV3

DELETE /3/ModelMetrics/frames/(?.*)/models/(?.*)

Return the saved scoring metrics for the specified Model and Frame.

InputModelMetricsListSchemaV3
OutputModelMetricsListSchemaV3

GET /3/ModelMetrics/models/(?.*)

Return the saved scoring metrics for the specified Model.

InputModelMetricsListSchemaV3
OutputModelMetricsListSchemaV3

GET /3/ModelMetrics/models/(?.*)/frames/(?.*)

Return the saved scoring metrics for the specified Model and Frame.

InputModelMetricsListSchemaV3
OutputModelMetricsListSchemaV3

DELETE /3/ModelMetrics/models/(?.*)/frames/(?.*)

Return the saved scoring metrics for the specified Model and Frame.

InputModelMetricsListSchemaV3
OutputModelMetricsListSchemaV3

POST /3/ModelMetrics/models/(?.*)/frames/(?.*)

Return the scoring metrics for the specified Frame with the specified Model. If the Frame has already been scored with the Model then cached results will be returned; otherwise predictions for all rows in the Frame will be generated and the metrics will be returned.

InputModelMetricsListSchemaV3
OutputModelMetricsListSchemaV3

GET /3/Models

Return all Models from the H2O distributed K/V store.

InputModelsV3
OutputModelsV3

DELETE /3/Models

Delete all Models from the H2O distributed K/V store.

InputModelsV3
OutputModelsV3

GET /3/Models/(?.*)

Return the specified Model from the H2O distributed K/V store, optionally with the list of compatible Frames.

InputModelsV3
OutputModelsV3

DELETE /3/Models/(?.*)

Delete the specified Model from the H2O distributed K/V store.

InputModelsV3
OutputModelsV3

GET /3/Models/(?.*)/preview

Return potentially abridged model suitable for viewing in a browser (currently only used for java model code).

InputModelsV3
OutputModelsV3

GET /3/NetworkTest

Something something something.

InputNetworkTestV3
OutputNetworkTestV3

POST /3/NodePersistentStorage/(?.*)

Store a value.

InputNodePersistentStorageV3
OutputNodePersistentStorageV3

GET /3/NodePersistentStorage/(?.*)

Return all keys stored for a given category.

InputNodePersistentStorageV3
OutputNodePersistentStorageV3

POST /3/NodePersistentStorage/(?.*)/(?.*)

Store a named value.

InputNodePersistentStorageV3
OutputNodePersistentStorageV3

GET /3/NodePersistentStorage/(?.*)/(?.*)

Return value for a given name.

InputNodePersistentStorageV3
OutputNodePersistentStorageV3

DELETE /3/NodePersistentStorage/(?.*)/(?.*)

Delete a key.

InputNodePersistentStorageV3
OutputNodePersistentStorageV3

GET /3/NodePersistentStorage/categories/(?.*)/exists

Return true or false.

InputNodePersistentStorageV3
OutputNodePersistentStorageV3

GET /3/NodePersistentStorage/categories/(?.*)/names/(?.*)/exists

Return true or false.

InputNodePersistentStorageV3
OutputNodePersistentStorageV3

GET /3/NodePersistentStorage/configured

Return true or false.

InputNodePersistentStorageV3
OutputNodePersistentStorageV3

POST /3/Parse

Parse a raw byte-oriented Frame into a useful columnar data Frame.

InputParseV3
OutputParseV3

POST /3/ParseSetup

Guess the parameters for parsing raw byte-oriented data into an H2O Frame.

InputParseSetupV3
OutputParseSetupV3

POST /3/Predictions/models/(?.*)/frames/(?.*)

Score (generate predictions) for the specified Frame with the specified Model. Both the Frame of predictions and the metrics will be returned.

InputModelMetricsListSchemaV3
OutputModelMetricsListSchemaV3

GET /3/Profiler

Something something something.

InputProfilerV3
OutputProfilerV3

POST /3/Shutdown

Shut down the cluster

InputShutdownV3
OutputShutdownV3

POST /3/SplitFrame

Split a H2O Frame.

InputSplitFrameV3
OutputSplitFrameV3

GET /3/Timeline

Something something something.

InputTimelineV3
OutputTimelineV3

GET /3/Tutorials

H2O tutorials.

InputTutorialsV3
OutputTutorialsV3

GET /3/Typeahead/files

Typehead hander for filename completion.

InputTypeaheadV3
OutputSchema

POST /3/UnlockKeys

Unlock all keys in the H2O distributed K/V store, to attempt to recover from a crash.

InputUnlockKeysV3
OutputUnlockKeysV3

GET /3/WaterMeterCpuTicks/(?.*)

Return a CPU usage snapshot of all cores of all nodes in the H2O cluster.

InputWaterMeterCpuTicksV3
OutputWaterMeterCpuTicksV3

GET /3/WaterMeterIo

Return IO usage snapshot of all nodes in the H2O cluster.

InputWaterMeterIoV3
OutputWaterMeterIoV3

GET /3/WaterMeterIo/(?.*)

Return IO usage snapshot of all nodes in the H2O cluster.

InputWaterMeterIoV3
OutputWaterMeterIoV3

POST /99/Grid/drf

Run grid search for DRF model.

InputDRFGridSearchV99
OutputDRFGridSearchV99

POST /99/Grid/gbm

Run grid search for GBM model.

InputGBMGridSearchV99
OutputGBMGridSearchV99

POST /99/Grid/kmeans

Run grid search for KMeans model.

InputKMeansGridSearchV99
OutputKMeansGridSearchV99

POST /99/ModelBuilders/glrm

Train a GLRM model on the specified Frame.

InputGLRMV99
OutputSchema

POST /99/ModelBuilders/glrm/parameters

Validate a set of GLRM model builder parameters.

InputGLRMV99
OutputGLRMV99

POST /99/ModelBuilders/pca

Train a PCA model on the specified Frame.

InputPCAV99
OutputSchema

POST /99/ModelBuilders/pca/parameters

Validate a set of PCA model builder parameters.

InputPCAV99
OutputPCAV99

POST /99/ModelBuilders/svd

Train a SVD model on the specified Frame.

InputSVDV99
OutputSchema

POST /99/ModelBuilders/svd/parameters

Validate a set of SVD model builder parameters.

InputSVDV99
OutputSVDV99

POST /99/Models.bin/(?.*)

Import given binary model into H2O.

InputModelImportV3
OutputModelsV3

GET /99/Models.bin/(?.*)

Export given model.

InputModelExportV3
OutputModelExportV3

POST /99/Rapids

Something something R exec something.

InputRapidsV99
OutputRapidsV99

GET /99/Rapids/isEval

something something r exec something.

InputRapidsV99
OutputRapidsV99

GET /99/Sample

Example of an experimental endpoint. Call via /EXPERIMENTAL/Sample. Experimental endpoints can change at any moment.

InputCloudV3
OutputCloudV3

REST API Schema Reference

AboutEntryV3

name
string
Property nameOut
value
string
Property valueOut

AboutV3

_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
entries
Iced[]
List of properties about this running H2O instanceOut

CloudV3

skip_ticks
boolean
skip_ticksIn
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
version
string
versionOut
node_idx
int
Node index number cloud status is collected from (zero-based)Out
cloud_name
string
cloud_nameOut
cloud_size
int
cloud_sizeOut
cloud_uptime_millis
long
cloud_uptime_millisOut
cloud_healthy
boolean
cloud_healthyOut
bad_nodes
int
Nodes reporting unhealthyOut
consensus
boolean
Cloud voting is stableOut
locked
boolean
Cloud is accepting new members or notOut
nodes
Iced[]
nodesOut

ClusteringModelBuilderSchema

parameters
Parameters
Model builder parameters.In
__http_status
int
HTTP status to return for this build.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
algo
string
The algo name for this ModelBuilder.Out
algo_full_name
string
The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM).Out
can_build
enum[]
Model categories this ModelBuilder can build.Out
visibility
enum
Should the builder always be visible, be marked as beta, or only visible if the user starts up with the experimental flag?Out
job
Job
Job KeyOut
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut

ClusteringModelParametersSchema

k
int
Number of clustersIn/Out
model_id
Key
Destination id for this model; auto-generated if not specifiedIn/Out
training_frame
Key
Training frameIn/Out
validation_frame
Key
Validation frameIn/Out
ignored_columns
string[]
Ignored columnsIn/Out
ignore_const_cols
boolean
Ignore constant columnsIn/Out
score_each_iteration
boolean
Whether to score during each iteration of model trainingIn/Out

ColSpecifierV3

column_name
string
Name of the columnIn/Out
is_member_of_frames
string[]
List of fields which specify columns that must contain this columnIn/Out

ColV3

label
string
labelOut
missing_count
long
missingOut
zero_count
long
zerosOut
positive_infinity_count
long
positive infinitiesOut
negative_infinity_count
long
negative infinitiesOut
mins
double[]
minsOut
maxs
double[]
maxsOut
mean
double
meanOut
sigma
double
sigmaOut
type
string
datatype: {enum, string, int, real, time, uuid}Out
domain
string[]
domain; not-null for enum columns onlyOut
domain_cardinality
int
cardinality of this column’s domain; not-null for enum columns onlyOut
data
double[]
dataOut
string_data
string[]
string dataOut
precision
byte
decimal precision, -1 for all digitsOut
histogram_bins
long[]
Histogram bins; null if not computedOut
histogram_base
double
Start of histogram bin zeroOut
histogram_stride
double
Stride per binOut
percentiles
double[]
Percentile values, matching the default percentilesOut

ColumnSpecsBase

name
string
Column NameOut
type
string
Column TypeOut
format
string
Column Format (printf)Out
description
string
Column DescriptionOut

ConfusionMatrixBase

table
TwoDimTable
Annotated confusion matrixOut

ConfusionMatrixV3

table
TwoDimTable
Annotated confusion matrixOut

CoxPHModelOutputV3

names
string[]
Column names.Out
domains
string[][]
Domains for categorical (enum) columns.Out
model_category
enum
Category of the model (e.g., Binomial).Out
model_summary
TwoDimTable
Model summaryOut
scoring_history
TwoDimTable
Scoring historyOut
training_metrics
ModelMetrics
Training data model metricsOut
validation_metrics
ModelMetrics
Validation data model metricsOut
help
Map
Help information for output fieldsOut

CoxPHModelV3

model_id
Key
Model keyIn/Out
parameters
CoxPHParameters
The build parameters for the model (e.g. K for KMeans).Out
output
CoxPHOutput
The build output for the model (e.g. the cluster centers for KMeans).Out
compatible_frames
string[]
Compatible frames, if requestedOut
checksum
long
Checksum for all the things that go into building the Model.Out
algo
string
The algo name for this Model.Out
algo_full_name
string
The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM).Out

CoxPHParametersV3

model_id
Key
Destination id for this model; auto-generated if not specifiedIn/Out
training_frame
Key
Training frameIn/Out
validation_frame
Key
Validation frameIn/Out
ignored_columns
string[]
Ignored columnsIn/Out
ignore_const_cols
boolean
Ignore constant columnsIn/Out
score_each_iteration
boolean
Whether to score during each iteration of model trainingIn/Out

CoxPHV3

parameters
CoxPHParameters
Model builder parameters.In
__http_status
int
HTTP status to return for this build.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
algo
string
The algo name for this ModelBuilder.Out
algo_full_name
string
The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM).Out
can_build
enum[]
Model categories this ModelBuilder can build.Out
visibility
enum
Should the builder always be visible, be marked as beta, or only visible if the user starts up with the experimental flag?Out
job
Job
Job KeyOut
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut

CreateFrameV3

rows
long
Number of rowsIn
cols
int
Number of data columns (in addition to the first response column)In
seed
long
Random number seedIn
randomize
boolean
Whether frame should be randomizedIn
value
long
Constant value (for randomize=false)In
real_range
long
Range for real variables (-range … range)In
categorical_fraction
double
Fraction of categorical columns (for randomize=true)In
factors
int
Factor levels for categorical variablesIn
integer_fraction
double
Fraction of integer columns (for randomize=true)In
integer_range
long
Range for integer variables (-range … range)In
binary_fraction
double
Fraction of binary columns (for randomize=true)In
binary_ones_fraction
double
Fraction of 1’s in binary columnsIn
missing_fraction
double
Fraction of missing valuesIn
response_factors
int
Number of factor levels of the first column (1=real, 2=binomial, N=multinomial)In
has_response
boolean
Whether an additional response column should be generatedIn
key
Key
Job KeyIn
description
string
Job descriptionIn
dest
Key
destination keyIn/Out
status
string
job statusOut
progress
float
progress, from 0 to 1Out
progress_msg
string
current progress status descriptionOut
start_time
long
Start timeOut
msec
long
runtimeOut
exception
string
exceptionOut

DRFGridSearchV99

parameters
DRFParameters
Basic model builder parameters.In
grid_parameters
Map
Grid search parameters.In
total_models
int
Number of all models generated by grid search.Out
job
Job
Job Key.Out

DRFModelOutputV3

variable_importances
TwoDimTable
Variable ImportancesOut
init_f
double
The Intercept term, the initial model function value to which trees make adjustmentsOut
names
string[]
Column names.Out
domains
string[][]
Domains for categorical (enum) columns.Out
model_category
enum
Category of the model (e.g., Binomial).Out
model_summary
TwoDimTable
Model summaryOut
scoring_history
TwoDimTable
Scoring historyOut
training_metrics
ModelMetrics
Training data model metricsOut
validation_metrics
ModelMetrics
Validation data model metricsOut
help
Map
Help information for output fieldsOut

DRFModelV3

model_id
Key
Model keyIn/Out
parameters
DRFParameters
The build parameters for the model (e.g. K for KMeans).Out
output
DRFOutput
The build output for the model (e.g. the cluster centers for KMeans).Out
compatible_frames
string[]
Compatible frames, if requestedOut
checksum
long
Checksum for all the things that go into building the Model.Out
algo
string
The algo name for this Model.Out
algo_full_name
string
The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM).Out

DRFParametersV3

mtries
int
Number of variables randomly sampled as candidates at each split. If set to -1, defaults to sqrt{p} for classification and p/3 for regression (where p is the # of predictorsIn
sample_rate
float
Sample rate, from 0. to 1.0In
binomial_double_trees
boolean
For binary classification: Build 2x as many trees (one per class) - can lead to higher accuracy.In
ntrees
int
Number of trees.In
max_depth
int
Maximum tree depth.In
min_rows
double
Fewest allowed (weighted) observations in a leaf (in R called ‘nodesize’).In
nbins
int
For numerical columns (real/int), build a histogram of this many bins, then split at the best pointIn
nbins_cats
int
For categorical columns (enum), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.In
r2_stopping
double
Stop making trees when the R^2 metric equals or exceeds thisIn
seed
long
Seed for pseudo random number generator (if applicable)In
build_tree_one_node
boolean
Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.In
response_column
VecSpecifier
Response columnIn/Out
weights_column
VecSpecifier
Column with observation weightsIn/Out
offset_column
VecSpecifier
Offset columnIn/Out
balance_classes
boolean
Balance training data class counts via over/under-sampling (for imbalanced data).In/Out
class_sampling_factors
float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.In/Out
max_after_balance_size
float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.In/Out
max_confusion_matrix_size
int
Maximum size (# classes) for confusion matrices to be printed in the LogsIn/Out
max_hit_ratio_k
int
Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable)In/Out
model_id
Key
Destination id for this model; auto-generated if not specifiedIn/Out
training_frame
Key
Training frameIn/Out
validation_frame
Key
Validation frameIn/Out
ignored_columns
string[]
Ignored columnsIn/Out
ignore_const_cols
boolean
Ignore constant columnsIn/Out
score_each_iteration
boolean
Whether to score during each iteration of model trainingIn/Out

DRFV3

parameters
DRFParameters
Model builder parameters.In
__http_status
int
HTTP status to return for this build.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
algo
string
The algo name for this ModelBuilder.Out
algo_full_name
string
The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM).Out
can_build
enum[]
Model categories this ModelBuilder can build.Out
visibility
enum
Should the builder always be visible, be marked as beta, or only visible if the user starts up with the experimental flag?Out
job
Job
Job KeyOut
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut

DStackTraceV3

node
string
Node nameOut
time
long
Unix epoch timeOut
thread_traces
string[]
One trace per threadOut

DeepLearningModelOutputV3

weights
Key[]
Frame keys for weight matricesIn
biases
Key[]
Frame keys for bias vectorsIn
normmul
double[]
Normalization/Standardization multipliers for numeric predictorsOut
normsub
double[]
Normalization/Standardization offsets for numeric predictorsOut
normrespmul
double[]
Normalization/Standardization multipliers for numeric responseOut
normrespsub
double[]
Normalization/Standardization offsets for numeric responseOut
catoffsets
int[]
Categorical offsets for one-hot encodingOut
variable_importances
TwoDimTable
Variable ImportancesOut
names
string[]
Column names.Out
domains
string[][]
Domains for categorical (enum) columns.Out
model_category
enum
Category of the model (e.g., Binomial).Out
model_summary
TwoDimTable
Model summaryOut
scoring_history
TwoDimTable
Scoring historyOut
training_metrics
ModelMetrics
Training data model metricsOut
validation_metrics
ModelMetrics
Validation data model metricsOut
help
Map
Help information for output fieldsOut

DeepLearningModelV3

model_id
Key
Model keyIn/Out
parameters
DeepLearningParameters
The build parameters for the model (e.g. K for KMeans).Out
output
DeepLearningModelOutput
The build output for the model (e.g. the cluster centers for KMeans).Out
compatible_frames
string[]
Compatible frames, if requestedOut
checksum
long
Checksum for all the things that go into building the Model.Out
algo
string
The algo name for this Model.Out
algo_full_name
string
The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM).Out

DeepLearningParametersV3

response_column
VecSpecifier
Response columnIn/Out
weights_column
VecSpecifier
Column with observation weightsIn/Out
offset_column
VecSpecifier
Offset columnIn/Out
balance_classes
boolean
Balance training data class counts via over/under-sampling (for imbalanced data).In/Out
class_sampling_factors
float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.In/Out
max_after_balance_size
float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.In/Out
max_confusion_matrix_size
int
Maximum size (# classes) for confusion matrices to be printed in the LogsIn/Out
max_hit_ratio_k
int
Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable)In/Out
checkpoint
Key
Model checkpoint to resume training withIn/Out
overwrite_with_best_model
boolean
If enabled, override the final model with the best model found during trainingIn/Out
autoencoder
boolean
Auto-EncoderIn/Out
use_all_factor_levels
boolean
Use all factor levels of categorical variables. Otherwise, the first factor level is omitted (without loss of accuracy). Useful for variable importances and auto-enabled for autoencoder.In/Out
activation
enum
Activation functionIn/Out
hidden
int[]
Hidden layer sizes (e.g. 100,100).In/Out
epochs
double
How many times the dataset should be iterated (streamed), can be fractionalIn/Out
train_samples_per_iteration
long
Number of training samples (globally) per MapReduce iteration. Special values are 0: one epoch, -1: all available data (e.g., replicated training data), -2: automaticIn/Out
target_ratio_comm_to_comp
double
Target ratio of communication overhead to computation. Only for multi-node operation and train_samples_per_iteration=-2 (auto-tuning)In/Out
seed
long
Seed for random numbers (affects sampling) - Note: only reproducible when running single threadedIn/Out
adaptive_rate
boolean
Adaptive learning rateIn/Out
rho
double
Adaptive learning rate time decay factor (similarity to prior updates)In/Out
epsilon
double
Adaptive learning rate smoothing factor (to avoid divisions by zero and allow progress)In/Out
rate
double
Learning rate (higher => less stable, lower => slower convergence)In/Out
rate_annealing
double
Learning rate annealing: rate / (1 + rate_annealing * samples)In/Out
rate_decay
double
Learning rate decay factor between layers (N-th layer: rate*alpha^(N-1))In/Out
momentum_start
double
Initial momentum at the beginning of training (try 0.5)In/Out
momentum_ramp
double
Number of training samples for which momentum increasesIn/Out
momentum_stable
double
Final momentum after the ramp is over (try 0.99)In/Out
nesterov_accelerated_gradient
boolean
Use Nesterov accelerated gradient (recommended)In/Out
input_dropout_ratio
double
Input layer dropout ratio (can improve generalization, try 0.1 or 0.2)In/Out
hidden_dropout_ratios
double[]
Hidden layer dropout ratios (can improve generalization), specify one value per hidden layer, defaults to 0.5In/Out
l1
double
L1 regularization (can add stability and improve generalization, causes many weights to become 0)In/Out
l2
double
L2 regularization (can add stability and improve generalization, causes many weights to be smallIn/Out
max_w2
float
Constraint for squared sum of incoming weights per unit (e.g. for Rectifier)In/Out
initial_weight_distribution
enum
Initial Weight DistributionIn/Out
initial_weight_scale
double
Uniform: -value…value, Normal: stddev)In/Out
loss
enum
Loss functionIn/Out
score_interval
double
Shortest time interval (in secs) between model scoringIn/Out
score_training_samples
long
Number of training set samples for scoring (0 for all)In/Out
score_validation_samples
long
Number of validation set samples for scoring (0 for all)In/Out
score_duty_cycle
double
Maximum duty cycle fraction for scoring (lower: more training, higher: more scoring).In/Out
classification_stop
double
Stopping criterion for classification error fraction on training data (-1 to disable)In/Out
regression_stop
double
Stopping criterion for regression error (MSE) on training data (-1 to disable)In/Out
quiet_mode
boolean
Enable quiet mode for less output to standard outputIn/Out
score_validation_sampling
enum
Method used to sample validation dataset for scoringIn/Out
diagnostics
boolean
Enable diagnostics for hidden layersIn/Out
variable_importances
boolean
Compute variable importances for input features (Gedeon method) - can be slow for large networksIn/Out
fast_mode
boolean
Enable fast mode (minor approximation in back-propagation)In/Out
force_load_balance
boolean
Force extra load balancing to increase training speed for small datasets (to keep all cores busy)In/Out
replicate_training_data
boolean
Replicate the entire training dataset onto every node for faster training on small datasetsIn/Out
single_node_mode
boolean
Run on a single node for fine-tuning of model parametersIn/Out
shuffle_training_data
boolean
Enable shuffling of training data (recommended if training data is replicated and train_samples_per_iteration is close to #nodes x #rows, of if using balance_classes)In/Out
missing_values_handling
enum
Handling of missing values. Either Skip or MeanImputation.In/Out
sparse
boolean
Sparse data handling (Experimental).In/Out
col_major
boolean
Use a column major weight matrix for input layer. Can speed up forward propagation, but might slow down backpropagation (Experimental).In/Out
average_activation
double
Average activation for sparse auto-encoder (Experimental)In/Out
sparsity_beta
double
Sparsity regularization (Experimental)In/Out
max_categorical_features
int
Max. number of categorical features, enforced via hashing (Experimental)In/Out
reproducible
boolean
Force reproducibility on small data (will be slow - only uses 1 thread)In/Out
export_weights_and_biases
boolean
Whether to export Neural Network weights and biases to H2O FramesIn/Out
model_id
Key
Destination id for this model; auto-generated if not specifiedIn/Out
training_frame
Key
Training frameIn/Out
validation_frame
Key
Validation frameIn/Out
ignored_columns
string[]
Ignored columnsIn/Out
ignore_const_cols
boolean
Ignore constant columnsIn/Out
score_each_iteration
boolean
Whether to score during each iteration of model trainingIn/Out

DeepLearningV3

parameters
DeepLearningParameters
Model builder parameters.In
__http_status
int
HTTP status to return for this build.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
algo
string
The algo name for this ModelBuilder.Out
algo_full_name
string
The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM).Out
can_build
enum[]
Model categories this ModelBuilder can build.Out
visibility
enum
Should the builder always be visible, be marked as beta, or only visible if the user starts up with the experimental flag?Out
job
Job
Job KeyOut
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut

DownloadDataV3

frame_id
Key
Frame to downloadIn
hex_string
boolean
Emit double values in a machine readable lossless format with Double.toHexString().In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
csv
string
CSV StreamOut
filename
string
Suggested FilenameOut

EventV3

date
string
Time when the event was recorded. Format is hh:mm:ss:msIn
nanos
long
Time in nanosIn
type
enum
type of recorded eventIn

ExampleModelOutputV3

iterations
int
Iterations executedIn
maxs
double[]
(No description available)In
names
string[]
Column names.Out
domains
string[][]
Domains for categorical (enum) columns.Out
model_category
enum
Category of the model (e.g., Binomial).Out
model_summary
TwoDimTable
Model summaryOut
scoring_history
TwoDimTable
Scoring historyOut
training_metrics
ModelMetrics
Training data model metricsOut
validation_metrics
ModelMetrics
Validation data model metricsOut
help
Map
Help information for output fieldsOut

ExampleModelV3

model_id
Key
Model keyIn/Out
parameters
ExampleParameters
The build parameters for the model (e.g. K for KMeans).Out
output
ExampleOutput
The build output for the model (e.g. the cluster centers for KMeans).Out
compatible_frames
string[]
Compatible frames, if requestedOut
checksum
long
Checksum for all the things that go into building the Model.Out
algo
string
The algo name for this Model.Out
algo_full_name
string
The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM).Out

ExampleParametersV3

max_iterations
int
Maximum training iterations.In
model_id
Key
Destination id for this model; auto-generated if not specifiedIn/Out
training_frame
Key
Training frameIn/Out
validation_frame
Key
Validation frameIn/Out
ignored_columns
string[]
Ignored columnsIn/Out
ignore_const_cols
boolean
Ignore constant columnsIn/Out
score_each_iteration
boolean
Whether to score during each iteration of model trainingIn/Out

ExampleV3

parameters
ExampleParameters
Model builder parameters.In
__http_status
int
HTTP status to return for this build.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
algo
string
The algo name for this ModelBuilder.Out
algo_full_name
string
The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM).Out
can_build
enum[]
Model categories this ModelBuilder can build.Out
visibility
enum
Should the builder always be visible, be marked as beta, or only visible if the user starts up with the experimental flag?Out
job
Job
Job KeyOut
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut

FieldMetadataBase

schema_name
string
Schema name for this field, if it is_schema, or the name of the enum, if it’s an enum.In
name
string
Field name in the SchemaOut
type
string
Type for this fieldOut
is_schema
boolean
Type for this field is itself a Schema.Out
value
Polymorphic
Value for this fieldOut
help
string
A short help description to appear alongside the field in a UIOut
label
string
The label that should be displayed for the field if the name is insufficientOut
required
boolean
Is this field required, or is the default value generally sufficient?Out
level
enum
How important is this field? The web UI uses the level to do a slow reveal of the parametersOut
direction
enum
Is this field an input, output or inout?Out
values
string[]
For enum-type fields the allowed values are specified using the values annotation; this is used in UIs to tell the user the allowed values, and for validationOut
json
boolean
Should this field be rendered in the JSON representation?Out
is_member_of_frames
string[]
For Vec-type fields this is the set of other Vec-type fields which must contain mutually exclusive values; for example, for a SupervisedModel the response_column must be mutually exclusive with the weights_columnOut
is_mutually_exclusive_with
string[]
For Vec-type fields this is the set of Frame-type fields which must contain the named column; for example, for a SupervisedModel the response_column must be in both the training_frame and (if it’s set) the validation_frameOut

FieldMetadataV3

schema_name
string
Schema name for this field, if it is_schema, or the name of the enum, if it’s an enum.In
name
string
Field name in the SchemaOut
type
string
Type for this fieldOut
is_schema
boolean
Type for this field is itself a Schema.Out
value
Polymorphic
Value for this fieldOut
help
string
A short help description to appear alongside the field in a UIOut
label
string
The label that should be displayed for the field if the name is insufficientOut
required
boolean
Is this field required, or is the default value generally sufficient?Out
level
enum
How important is this field? The web UI uses the level to do a slow reveal of the parametersOut
direction
enum
Is this field an input, output or inout?Out
values
string[]
For enum-type fields the allowed values are specified using the values annotation; this is used in UIs to tell the user the allowed values, and for validationOut
json
boolean
Should this field be rendered in the JSON representation?Out
is_member_of_frames
string[]
For Vec-type fields this is the set of other Vec-type fields which must contain mutually exclusive values; for example, for a SupervisedModel the response_column must be mutually exclusive with the weights_columnOut
is_mutually_exclusive_with
string[]
For Vec-type fields this is the set of Frame-type fields which must contain the named column; for example, for a SupervisedModel the response_column must be in both the training_frame and (if it’s set) the validation_frameOut

FindV3

key
Frame
Frame to searchIn
column
string
Column, or null for allIn
row
long
Starting row for searchIn
match
string
Value to search for; leave blank for a search for missing valuesIn
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
prev
long
previous row with matching value, or -1Out
next
long
next row with matching value, or -1Out

FrameBase

frame_id
Key
Frame IDIn/Out
byte_size
long
Total data size in bytesOut
is_text
boolean
Is this Frame raw unparsed data?Out

FrameKeyV3

name
string
Name (string representation) for this Key.In/Out
type
string
Name (string representation) for the type of Keyed this Key points to.In/Out
URL
string
URL for the resource that this Key points to, if one exists.In/Out

FrameSynopsisV3

frame_id
Key
Frame IDIn/Out
rows
long
Number of rows in the FrameOut
columns
long
Number of columns in the FrameOut
byte_size
long
Total data size in bytesOut
is_text
boolean
Is this Frame raw unparsed data?Out

FrameV3

row_offset
long
Row offset to displayIn
row_count
int
Number of rows to displayIn/Out
column_offset
int
Column offset to returnIn/Out
column_count
int
Number of columns to returnIn/Out
total_column_count
int
Total number of columns in the FrameIn/Out
frame_id
Key
Frame IDIn/Out
checksum
long
checksumOut
rows
long
Number of rows in the FrameOut
default_percentiles
double[]
Default percentiles, from 0 to 1Out
columns
Vec[]
Columns in the FrameOut
compatible_models
string[]
Compatible models, if requestedOut
vec_ids
Key[]
The set of IDs of vectors in the FrameOut
chunk_summary
TwoDimTable
Chunk summaryOut
distribution_summary
TwoDimTable
Distribution summaryOut
byte_size
long
Total data size in bytesOut
is_text
boolean
Is this Frame raw unparsed data?Out

FramesBase

frame_id
Key
Name of Frame of interestIn
column
string
Name of column of interestIn
find_compatible_models
boolean
Find and return compatible models?In
path
string
File output pathIn
force
boolean
Overwrite existing fileIn
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
row_offset
long
Row offset to returnIn/Out
row_count
int
Number of rows to returnIn/Out
column_offset
int
Column offset to returnIn/Out
column_count
int
Number of columns to returnIn/Out
frames
Iced[]
FramesOut
compatible_models
Model[]
Compatible modelsOut
domain
string[][]
DomainsOut

FramesV3

frame_id
Key
Name of Frame of interestIn
column
string
Name of column of interestIn
find_compatible_models
boolean
Find and return compatible models?In
path
string
File output pathIn
force
boolean
Overwrite existing fileIn
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
row_offset
long
Row offset to returnIn/Out
row_count
int
Number of rows to returnIn/Out
column_offset
int
Column offset to returnIn/Out
column_count
int
Number of columns to returnIn/Out
frames
Iced[]
FramesOut
compatible_models
Model[]
Compatible modelsOut
domain
string[][]
DomainsOut

GBMGridSearchV99

parameters
GBMParameters
Basic model builder parameters.In
grid_parameters
Map
Grid search parameters.In
total_models
int
Number of all models generated by grid search.Out
job
Job
Job Key.Out

GBMModelOutputV3

variable_importances
TwoDimTable
Variable ImportancesOut
init_f
double
The Intercept term, the initial model function value to which trees make adjustmentsOut
names
string[]
Column names.Out
domains
string[][]
Domains for categorical (enum) columns.Out
model_category
enum
Category of the model (e.g., Binomial).Out
model_summary
TwoDimTable
Model summaryOut
scoring_history
TwoDimTable
Scoring historyOut
training_metrics
ModelMetrics
Training data model metricsOut
validation_metrics
ModelMetrics
Validation data model metricsOut
help
Map
Help information for output fieldsOut

GBMModelV3

model_id
Key
Model keyIn/Out
parameters
GBMParameters
The build parameters for the model (e.g. K for KMeans).Out
output
GBMOutput
The build output for the model (e.g. the cluster centers for KMeans).Out
compatible_frames
string[]
Compatible frames, if requestedOut
checksum
long
Checksum for all the things that go into building the Model.Out
algo
string
The algo name for this Model.Out
algo_full_name
string
The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM).Out

GBMParametersV3

learn_rate
float
Learning rate from 0.0 to 1.0In
distribution
enum
Distribution functionIn
ntrees
int
Number of trees.In
max_depth
int
Maximum tree depth.In
min_rows
double
Fewest allowed (weighted) observations in a leaf (in R called ‘nodesize’).In
nbins
int
For numerical columns (real/int), build a histogram of this many bins, then split at the best pointIn
nbins_cats
int
For categorical columns (enum), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.In
r2_stopping
double
Stop making trees when the R^2 metric equals or exceeds thisIn
seed
long
Seed for pseudo random number generator (if applicable)In
build_tree_one_node
boolean
Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.In
response_column
VecSpecifier
Response columnIn/Out
weights_column
VecSpecifier
Column with observation weightsIn/Out
offset_column
VecSpecifier
Offset columnIn/Out
balance_classes
boolean
Balance training data class counts via over/under-sampling (for imbalanced data).In/Out
class_sampling_factors
float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.In/Out
max_after_balance_size
float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.In/Out
max_confusion_matrix_size
int
Maximum size (# classes) for confusion matrices to be printed in the LogsIn/Out
max_hit_ratio_k
int
Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable)In/Out
model_id
Key
Destination id for this model; auto-generated if not specifiedIn/Out
training_frame
Key
Training frameIn/Out
validation_frame
Key
Validation frameIn/Out
ignored_columns
string[]
Ignored columnsIn/Out
ignore_const_cols
boolean
Ignore constant columnsIn/Out
score_each_iteration
boolean
Whether to score during each iteration of model trainingIn/Out

GBMV3

parameters
GBMParameters
Model builder parameters.In
__http_status
int
HTTP status to return for this build.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
algo
string
The algo name for this ModelBuilder.Out
algo_full_name
string
The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM).Out
can_build
enum[]
Model categories this ModelBuilder can build.Out
visibility
enum
Should the builder always be visible, be marked as beta, or only visible if the user starts up with the experimental flag?Out
job
Job
Job KeyOut
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut

GLMModelOutputV3

coefficients_table
TwoDimTable
Table of CoefficientsIn
standardized_coefficients_magnitude
TwoDimTable
Standardized Coefficient MagnitudesIn
names
string[]
Column names.Out
domains
string[][]
Domains for categorical (enum) columns.Out
model_category
enum
Category of the model (e.g., Binomial).Out
model_summary
TwoDimTable
Model summaryOut
scoring_history
TwoDimTable
Scoring historyOut
training_metrics
ModelMetrics
Training data model metricsOut
validation_metrics
ModelMetrics
Validation data model metricsOut
help
Map
Help information for output fieldsOut

GLMModelV3

model_id
Key
Model keyIn/Out
parameters
GLMParameters
The build parameters for the model (e.g. K for KMeans).Out
output
GLMOutput
The build output for the model (e.g. the cluster centers for KMeans).Out
compatible_frames
string[]
Compatible frames, if requestedOut
checksum
long
Checksum for all the things that go into building the Model.Out
algo
string
The algo name for this Model.Out
algo_full_name
string
The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM).Out

GLMParametersV3

family
enum
Family. Use binomial for classification with logistic regression, others are for regression problems.In
tweedie_variance_power
double
Tweedie variance powerIn
tweedie_link_power
double
Tweedie link powerIn
solver
enum
Auto will pick solver better suited for the given dataset, in case of lambda search solvers may be changed during computation. IRLSM is fast on on problems with small number of predictors and for lambda-search with L1 penalty, L_BFGS scales better for datasets with many columns.In
alpha
double[]
distribution of regularization between L1 and L2.In
lambda
double[]
regularization strengthIn
lambda_search
boolean
use lambda search starting at lambda max, given lambda is then interpreted as lambda minIn
nlambdas
int
number of lambdas to be used in a searchIn
standardize
boolean
Standardize numeric columns to have zero mean and unit varianceIn
non_negative
boolean
Restrict coefficients (not intercept) to be non-negativeIn
max_iterations
int
Maximum number of iterationsIn
beta_epsilon
double
converge if beta changes less (using L-infinity norm) than beta esilon, ONLY applies to IRLSM solver In
objective_epsilon
double
converge if objective value changes less than thisIn
gradient_epsilon
double
converge if objective changes less (using L-infinity norm) than this, ONLY applies to L-BFGS solverIn
link
enum
(No description available)In
intercept
boolean
include constant term in the modelIn
prior
double
prior probability for y==1. To be used only for logistic regression iff the data has been sampled and the mean of response does not reflect reality.In
lambda_min_ratio
double
min lambda used in lambda search, specified as a ratio of lambda_maxIn
beta_constraints
Key
beta constraintsIn
max_active_predictors
int
Maximum number of active predictors during computation. Use as a stopping criterium to prevent expensive model building with many predictors.In
response_column
VecSpecifier
Response columnIn/Out
weights_column
VecSpecifier
Column with observation weightsIn/Out
offset_column
VecSpecifier
Offset columnIn/Out
balance_classes
boolean
Balance training data class counts via over/under-sampling (for imbalanced data).In/Out
class_sampling_factors
float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.In/Out
max_after_balance_size
float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.In/Out
max_confusion_matrix_size
int
Maximum size (# classes) for confusion matrices to be printed in the LogsIn/Out
max_hit_ratio_k
int
Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable)In/Out
model_id
Key
Destination id for this model; auto-generated if not specifiedIn/Out
training_frame
Key
Training frameIn/Out
validation_frame
Key
Validation frameIn/Out
ignored_columns
string[]
Ignored columnsIn/Out
ignore_const_cols
boolean
Ignore constant columnsIn/Out
score_each_iteration
boolean
Whether to score during each iteration of model trainingIn/Out

GLMV3

parameters
GLMParameters
Model builder parameters.In
__http_status
int
HTTP status to return for this build.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
algo
string
The algo name for this ModelBuilder.Out
algo_full_name
string
The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM).Out
can_build
enum[]
Model categories this ModelBuilder can build.Out
visibility
enum
Should the builder always be visible, be marked as beta, or only visible if the user starts up with the experimental flag?Out
job
Job
Job KeyOut
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut

GLRMModelOutputV99

iterations
int
Iterations executedIn
objective
double
Objective valueIn
avg_change_obj
double
Average change in objective value on final iterationIn
step_size
double
Final step sizeIn
archetypes
double[][]
Mapping from training data to lower dimensional k-spaceIn
singular_vals
double[]
Singular values of XY matrixIn
eigenvectors
double[][]
Eigenvectors of XY matrixIn
loading_key
Key
Frame key for X matrixIn
names
string[]
Column names.Out
domains
string[][]
Domains for categorical (enum) columns.Out
model_category
enum
Category of the model (e.g., Binomial).Out
model_summary
TwoDimTable
Model summaryOut
scoring_history
TwoDimTable
Scoring historyOut
training_metrics
ModelMetrics
Training data model metricsOut
validation_metrics
ModelMetrics
Validation data model metricsOut
help
Map
Help information for output fieldsOut

GLRMModelV99

model_id
Key
Model keyIn/Out
parameters
GLRMParameters
The build parameters for the model (e.g. K for KMeans).Out
output
GLRMOutput
The build output for the model (e.g. the cluster centers for KMeans).Out
compatible_frames
string[]
Compatible frames, if requestedOut
checksum
long
Checksum for all the things that go into building the Model.Out
algo
string
The algo name for this Model.Out
algo_full_name
string
The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM).Out

GLRMParametersV99

transform
enum
Transformation of training dataIn
k
int
Rank of matrix approximationIn
loss
enum
Loss functionIn
regularization_x
enum
Regularization function for X matrixIn
regularization_y
enum
Regularization function for Y matrixIn
gamma_x
double
Regularization weight on X matrixIn
gamma_y
double
Regularization weight on Y matrixIn
max_iterations
int
Maximum number of iterationsIn
init_step_size
double
Initial step sizeIn
min_step_size
double
Minimum step sizeIn
seed
long
RNG seed for initializationIn
init
enum
Initialization modeIn
user_points
Key
User-specified initial YIn
loading_key
Key
Frame key to save resulting XIn
recover_svd
boolean
Recover singular values and eigenvectors of XYIn
model_id
Key
Destination id for this model; auto-generated if not specifiedIn/Out
training_frame
Key
Training frameIn/Out
validation_frame
Key
Validation frameIn/Out
ignored_columns
string[]
Ignored columnsIn/Out
ignore_const_cols
boolean
Ignore constant columnsIn/Out
score_each_iteration
boolean
Whether to score during each iteration of model trainingIn/Out

GLRMV99

parameters
GLRMParameters
Model builder parameters.In
__http_status
int
HTTP status to return for this build.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
algo
string
The algo name for this ModelBuilder.Out
algo_full_name
string
The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM).Out
can_build
enum[]
Model categories this ModelBuilder can build.Out
visibility
enum
Should the builder always be visible, be marked as beta, or only visible if the user starts up with the experimental flag?Out
job
Job
Job KeyOut
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut

GarbageCollectV3

(No fields)

GrepModelOutputV3

matches
string[]
Matching stringsIn
offsets
long[]
Byte offsets of matchesIn
names
string[]
Column names.Out
domains
string[][]
Domains for categorical (enum) columns.Out
model_category
enum
Category of the model (e.g., Binomial).Out
model_summary
TwoDimTable
Model summaryOut
scoring_history
TwoDimTable
Scoring historyOut
training_metrics
ModelMetrics
Training data model metricsOut
validation_metrics
ModelMetrics
Validation data model metricsOut
help
Map
Help information for output fieldsOut

GrepModelV3

model_id
Key
Model keyIn/Out
parameters
GrepParameters
The build parameters for the model (e.g. K for KMeans).Out
output
GrepOutput
The build output for the model (e.g. the cluster centers for KMeans).Out
compatible_frames
string[]
Compatible frames, if requestedOut
checksum
long
Checksum for all the things that go into building the Model.Out
algo
string
The algo name for this Model.Out
algo_full_name
string
The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM).Out

GrepParametersV3

regex
string
regexIn
model_id
Key
Destination id for this model; auto-generated if not specifiedIn/Out
training_frame
Key
Training frameIn/Out
validation_frame
Key
Validation frameIn/Out
ignored_columns
string[]
Ignored columnsIn/Out
ignore_const_cols
boolean
Ignore constant columnsIn/Out
score_each_iteration
boolean
Whether to score during each iteration of model trainingIn/Out

GrepV3

parameters
GrepParameters
Model builder parameters.In
__http_status
int
HTTP status to return for this build.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
algo
string
The algo name for this ModelBuilder.Out
algo_full_name
string
The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM).Out
can_build
enum[]
Model categories this ModelBuilder can build.Out
visibility
enum
Should the builder always be visible, be marked as beta, or only visible if the user starts up with the experimental flag?Out
job
Job
Job KeyOut
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut

GridSearchSchema

parameters
Parameters
Basic model builder parameters.In
grid_parameters
Map
Grid search parameters.In
total_models
int
Number of all models generated by grid search.Out
job
Job
Job Key.Out

H2OErrorV3

timestamp
long
Milliseconds since the epoch for the time that this H2OError instance was created. Generally this is a short time since the underlying error ocurred.Out
error_url
string
Error urlOut
msg
string
Message intended for the end user (a data scientist).Out
dev_msg
string
Potentially more detailed message intended for a developer (e.g. a front end engineer or someone designing a language binding).Out
http_status
int
HTTP status code for this error.Out
values
Map
Any values that are relevant to reporting or handling this error. Examples are a key name if the error is on a key, or a field name and object name if it’s on a specific field.Out
exception_type
string
Exception type, if any.Out
exception_msg
string
Raw exception message, if any.Out
stacktrace
string[]
Stacktrace, if any.Out

H2OModelBuilderErrorV3

parameters
Parameters
Model builder parameters.Out
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut
timestamp
long
Milliseconds since the epoch for the time that this H2OError instance was created. Generally this is a short time since the underlying error ocurred.Out
error_url
string
Error urlOut
msg
string
Message intended for the end user (a data scientist).Out
dev_msg
string
Potentially more detailed message intended for a developer (e.g. a front end engineer or someone designing a language binding).Out
http_status
int
HTTP status code for this error.Out
values
Map
Any values that are relevant to reporting or handling this error. Examples are a key name if the error is on a key, or a field name and object name if it’s on a specific field.Out
exception_type
string
Exception type, if any.Out
exception_msg
string
Raw exception message, if any.Out
stacktrace
string[]
Stacktrace, if any.Out

HeartBeatEvent

sends
int
number of sent heartbeatsIn
recvs
int
number of received heartbeatsIn
date
string
Time when the event was recorded. Format is hh:mm:ss:msIn
nanos
long
Time in nanosIn
type
enum
type of recorded eventIn

IOEvent

io_flavor
string
flavor of the recorded io (ice/hdfs/…)In
node
string
node where this io event happenedIn
data
string
data infoIn
date
string
Time when the event was recorded. Format is hh:mm:ss:msIn
nanos
long
Time in nanosIn
type
enum
type of recorded eventIn

ImportFilesV3

path
string
pathIn
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
files
string[]
filesOut
destination_frames
string[]
namesOut
fails
string[]
failsOut
dels
string[]
delsOut

InitIDV3

_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
session_key
string
Session IDOut

InteractionV3

key
Key
Job KeyIn
description
string
Job descriptionIn
source_frame
Key
Input data frameIn/Out
factor_columns
string[]
Factor columnsIn/Out
pairwise
boolean
Whether to create pairwise quadratic interactions between factors (otherwise create one higher-order interaction). Only applicable if there are 3 or more factors.In/Out
max_factors
int
Max. number of factor levels in pair-wise interaction terms (if enforced, one extra catch-all factor will be made)In/Out
min_occurrence
int
Min. occurrence threshold for factor levels in pair-wise interaction termsIn/Out
dest
Key
destination keyIn/Out
status
string
job statusOut
progress
float
progress, from 0 to 1Out
progress_msg
string
current progress status descriptionOut
start_time
long
Start timeOut
msec
long
runtimeOut
exception
string
exceptionOut

IoStatsEntry

backend
string
Back end typeOut
store_count
long
Number of store eventsOut
store_bytes
long
Cumulative stored bytesOut
delete_count
long
Number of delete eventsOut
load_count
long
Number of load eventsOut
load_bytes
long
Cumulative loaded bytesOut

JStackV3

_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
traces
DStackTrace[]
StacktracesOut

JobKeyV3

name
string
Name (string representation) for this Key.In/Out
type
string
Name (string representation) for the type of Keyed this Key points to.In/Out
URL
string
URL for the resource that this Key points to, if one exists.In/Out

JobV3

key
Key
Job KeyIn
description
string
Job descriptionIn
dest
Key
destination keyIn/Out
status
string
job statusOut
progress
float
progress, from 0 to 1Out
progress_msg
string
current progress status descriptionOut
start_time
long
Start timeOut
msec
long
runtimeOut
exception
string
exceptionOut

JobsV3

job_id
Key
Optional Job identifierIn
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
jobs
Job[]
jobsOut

KMeansGridSearchV99

parameters
KMeansParameters
Basic model builder parameters.In
grid_parameters
Map
Grid search parameters.In
total_models
int
Number of all models generated by grid search.Out
job
Job
Job Key.Out

KMeansModelOutputV3

centers
TwoDimTable
Cluster Centers[k][features]In
centers_std
TwoDimTable
Cluster Centers[k][features] on Standardized DataIn
names
string[]
Column names.Out
domains
string[][]
Domains for categorical (enum) columns.Out
model_category
enum
Category of the model (e.g., Binomial).Out
model_summary
TwoDimTable
Model summaryOut
scoring_history
TwoDimTable
Scoring historyOut
training_metrics
ModelMetrics
Training data model metricsOut
validation_metrics
ModelMetrics
Validation data model metricsOut
help
Map
Help information for output fieldsOut

KMeansModelV3

model_id
Key
Model keyIn/Out
parameters
KMeansParameters
The build parameters for the model (e.g. K for KMeans).Out
output
KMeansOutput
The build output for the model (e.g. the cluster centers for KMeans).Out
compatible_frames
string[]
Compatible frames, if requestedOut
checksum
long
Checksum for all the things that go into building the Model.Out
algo
string
The algo name for this Model.Out
algo_full_name
string
The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM).Out

KMeansParametersV3

user_points
Key
User-specified pointsIn
max_iterations
int
Maximum training iterationsIn
standardize
boolean
Standardize columnsIn
seed
long
RNG SeedIn
init
enum
Initialization modeIn
k
int
Number of clustersIn/Out
model_id
Key
Destination id for this model; auto-generated if not specifiedIn/Out
training_frame
Key
Training frameIn/Out
validation_frame
Key
Validation frameIn/Out
ignored_columns
string[]
Ignored columnsIn/Out
ignore_const_cols
boolean
Ignore constant columnsIn/Out
score_each_iteration
boolean
Whether to score during each iteration of model trainingIn/Out

KMeansV3

parameters
KMeansParameters
Model builder parameters.In
__http_status
int
HTTP status to return for this build.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
algo
string
The algo name for this ModelBuilder.Out
algo_full_name
string
The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM).Out
can_build
enum[]
Model categories this ModelBuilder can build.Out
visibility
enum
Should the builder always be visible, be marked as beta, or only visible if the user starts up with the experimental flag?Out
job
Job
Job KeyOut
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut

KeyV3

name
string
Name (string representation) for this Key.In/Out
type
string
Name (string representation) for the type of Keyed this Key points to.In/Out
URL
string
URL for the resource that this Key points to, if one exists.In/Out

KillMinus3V3

_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In

LogAndEchoV3

message
string
Message to be Logged and EchoedIn
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In

LogsV3

nodeidx
int
Index of node to query ticks for (0-based). -1 means current node.In
name
string
Which specific log file to read from the log file directory. If left unspecified, the system chooses a default for you.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
log
string
Content of log fileOut

MakeGLMModelV3

model
Key
source modelIn
dest
Key
destination keyIn
names
string[]
coefficient namesIn
beta
double[]
new glm coefficientsIn
threshold
float
decision threshold for label-generationIn

MetadataBase

num
int
Number for specifying an endpointIn
http_method
string
HTTP method (GET, POST, DELETE) if fetching by pathIn
path
string
Path for specifying an endpointIn
classname
string
Class name, for fetching docs for a schema (DEPRECATED)In
schemaname
string
Schema name (e.g., DocsV1), for fetching docs for a schemaIn
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
routes
Route[]
List of endpoint routesOut
schemas
SchemaMetadata[]
List of schemasOut
markdown
string
Table of Contents MarkdownOut

MetadataV3

num
int
Number for specifying an endpointIn
http_method
string
HTTP method (GET, POST, DELETE) if fetching by pathIn
path
string
Path for specifying an endpointIn
classname
string
Class name, for fetching docs for a schema (DEPRECATED)In
schemaname
string
Schema name (e.g., DocsV1), for fetching docs for a schemaIn
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
routes
Route[]
List of endpoint routesOut
schemas
SchemaMetadata[]
List of schemasOut
markdown
string
Table of Contents MarkdownOut

MissingInserterV3

dataset
Key
datasetIn
fraction
double
Fraction of data to replace with a missing valueIn
seed
long
SeedIn
key
Key
Job KeyIn
description
string
Job descriptionIn
dest
Key
destination keyIn/Out
status
string
job statusOut
progress
float
progress, from 0 to 1Out
progress_msg
string
current progress status descriptionOut
start_time
long
Start timeOut
msec
long
runtimeOut
exception
string
exceptionOut

ModelBuilderJobV3

key
Key
Job KeyIn
description
string
Job descriptionIn
dest
Key
destination keyIn/Out
parameters
Parameters
Model builder parameters.Out
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut
status
string
job statusOut
progress
float
progress, from 0 to 1Out
progress_msg
string
current progress status descriptionOut
start_time
long
Start timeOut
msec
long
runtimeOut
exception
string
exceptionOut

ModelBuilderSchema

parameters
Parameters
Model builder parameters.In
__http_status
int
HTTP status to return for this build.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
algo
string
The algo name for this ModelBuilder.Out
algo_full_name
string
The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM).Out
can_build
enum[]
Model categories this ModelBuilder can build.Out
visibility
enum
Should the builder always be visible, be marked as beta, or only visible if the user starts up with the experimental flag?Out
job
Job
Job KeyOut
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut

ModelBuildersBase

algo
string
Algo of ModelBuilder of interestIn
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
model_builders
Map
ModelBuildersOut

ModelBuildersV3

algo
string
Algo of ModelBuilder of interestIn
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
model_builders
Map
ModelBuildersOut

ModelBuildersVisibilityV3

value
string
Stable, Beta, ExperimentalIn/Out

ModelExportV3

model_id
Key
Name of Model of interestIn
dir
string
Destination directory (hdfs, s3, local)In
force
boolean
Overwrite destination directory in case it exists or throw exception if set to false.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In

ModelImportV3

model_id
Key
Save imported model under given key into DKV.In
dir
string
Source directory (hdfs, s3, local) containing serialized modelIn
force
boolean
Override existing model in case it exists or throw exception if set to falseIn
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In

ModelKeyV3

name
string
Name (string representation) for this Key.In/Out
type
string
Name (string representation) for the type of Keyed this Key points to.In/Out
URL
string
URL for the resource that this Key points to, if one exists.In/Out

ModelMetricsAutoEncoderV3

model
Key
The model used for this scoring run.In/Out
model_checksum
long
The checksum for the model used for this scoring run.In/Out
frame
Key
The frame used for this scoring run.In/Out
frame_checksum
long
The checksum for the frame used for this scoring run.In/Out
description
string
Optional description for this scoring run (to note out-of-bag, sampled data, etc.)Out
model_category
enum
The category (e.g., Clustering) for the model used for this scoring run.Out
scoring_time
long
The time in mS since the epoch for the start of this scoring run.Out
predictions
Frame
Predictions Frame.Out
MSE
double
The Mean Squared Error of the prediction for this scoring run.Out

ModelMetricsBase

model
Key
The model used for this scoring run.In/Out
model_checksum
long
The checksum for the model used for this scoring run.In/Out
frame
Key
The frame used for this scoring run.In/Out
frame_checksum
long
The checksum for the frame used for this scoring run.In/Out
description
string
Optional description for this scoring run (to note out-of-bag, sampled data, etc.)Out
model_category
enum
The category (e.g., Clustering) for the model used for this scoring run.Out
scoring_time
long
The time in mS since the epoch for the start of this scoring run.Out
predictions
Frame
Predictions Frame.Out
MSE
double
The Mean Squared Error of the prediction for this scoring run.Out

ModelMetricsBinomialGLMV3

model
Key
The model used for this scoring run.In/Out
model_checksum
long
The checksum for the model used for this scoring run.In/Out
frame
Key
The frame used for this scoring run.In/Out
frame_checksum
long
The checksum for the frame used for this scoring run.In/Out
residual_deviance
double
residual devianceOut
null_deviance
double
null devianceOut
AIC
double
AICOut
null_degrees_of_freedom
long
null DOFOut
residual_degrees_of_freedom
long
residual DOFOut
r2
double
The R^2 for this scoring run.Out
logloss
double
The logarithmic loss for this scoring run.Out
AUC
double
The AUC for this scoring run.Out
Gini
double
The Gini score for this scoring run.Out
domain
string[]
The class labels of the response.Out
thresholds_and_metric_scores
TwoDimTable
The Metrics for various thresholds.Out
max_criteria_and_metric_scores
TwoDimTable
The Metrics for various criteria.Out
description
string
Optional description for this scoring run (to note out-of-bag, sampled data, etc.)Out
model_category
enum
The category (e.g., Clustering) for the model used for this scoring run.Out
scoring_time
long
The time in mS since the epoch for the start of this scoring run.Out
predictions
Frame
Predictions Frame.Out
MSE
double
The Mean Squared Error of the prediction for this scoring run.Out

ModelMetricsBinomialV3

model
Key
The model used for this scoring run.In/Out
model_checksum
long
The checksum for the model used for this scoring run.In/Out
frame
Key
The frame used for this scoring run.In/Out
frame_checksum
long
The checksum for the frame used for this scoring run.In/Out
r2
double
The R^2 for this scoring run.Out
logloss
double
The logarithmic loss for this scoring run.Out
AUC
double
The AUC for this scoring run.Out
Gini
double
The Gini score for this scoring run.Out
domain
string[]
The class labels of the response.Out
thresholds_and_metric_scores
TwoDimTable
The Metrics for various thresholds.Out
max_criteria_and_metric_scores
TwoDimTable
The Metrics for various criteria.Out
description
string
Optional description for this scoring run (to note out-of-bag, sampled data, etc.)Out
model_category
enum
The category (e.g., Clustering) for the model used for this scoring run.Out
scoring_time
long
The time in mS since the epoch for the start of this scoring run.Out
predictions
Frame
Predictions Frame.Out
MSE
double
The Mean Squared Error of the prediction for this scoring run.Out

ModelMetricsClusteringV3

tot_withinss
double
Within Cluster Sum of Square ErrorIn
totss
double
Total Sum of Square Error to Grand MeanIn
betweenss
double
Between Cluster Sum of Square ErrorIn
centroid_stats
TwoDimTable
Centroid StatisticsIn
model
Key
The model used for this scoring run.In/Out
model_checksum
long
The checksum for the model used for this scoring run.In/Out
frame
Key
The frame used for this scoring run.In/Out
frame_checksum
long
The checksum for the frame used for this scoring run.In/Out
description
string
Optional description for this scoring run (to note out-of-bag, sampled data, etc.)Out
model_category
enum
The category (e.g., Clustering) for the model used for this scoring run.Out
scoring_time
long
The time in mS since the epoch for the start of this scoring run.Out
predictions
Frame
Predictions Frame.Out
MSE
double
The Mean Squared Error of the prediction for this scoring run.Out

ModelMetricsListSchemaV3

model
Key
Key of Model of interest (optional)In
frame
Key
Key of Frame of interest (optional)In
reconstruction_error
boolean
Compute reconstruction error (optional, only for Deep Learning AutoEncoder models)In
deep_features_hidden_layer
int
Extract Deep Features for given hidden layer (optional, only for Deep Learning models)In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
predictions_frame
Key
Key of predictions frame, if predictions are requested (optional)In/Out
model_metrics
ModelMetrics[]
ModelMetricsOut

ModelMetricsMultinomialV3

model
Key
The model used for this scoring run.In/Out
model_checksum
long
The checksum for the model used for this scoring run.In/Out
frame
Key
The frame used for this scoring run.In/Out
frame_checksum
long
The checksum for the frame used for this scoring run.In/Out
r2
double
The R^2 for this scoring run.Out
hit_ratio_table
TwoDimTable
The hit ratio table for this scoring run.Out
cm
ConfusionMatrix
The ConfusionMatrix object for this scoring run.Out
logloss
double
The logarithmic loss for this scoring run.Out
description
string
Optional description for this scoring run (to note out-of-bag, sampled data, etc.)Out
model_category
enum
The category (e.g., Clustering) for the model used for this scoring run.Out
scoring_time
long
The time in mS since the epoch for the start of this scoring run.Out
predictions
Frame
Predictions Frame.Out
MSE
double
The Mean Squared Error of the prediction for this scoring run.Out

ModelMetricsPCAV99

model
Key
The model used for this scoring run.In/Out
model_checksum
long
The checksum for the model used for this scoring run.In/Out
frame
Key
The frame used for this scoring run.In/Out
frame_checksum
long
The checksum for the frame used for this scoring run.In/Out
description
string
Optional description for this scoring run (to note out-of-bag, sampled data, etc.)Out
model_category
enum
The category (e.g., Clustering) for the model used for this scoring run.Out
scoring_time
long
The time in mS since the epoch for the start of this scoring run.Out
predictions
Frame
Predictions Frame.Out
MSE
double
The Mean Squared Error of the prediction for this scoring run.Out

ModelMetricsRegressionGLMV3

model
Key
The model used for this scoring run.In/Out
model_checksum
long
The checksum for the model used for this scoring run.In/Out
frame
Key
The frame used for this scoring run.In/Out
frame_checksum
long
The checksum for the frame used for this scoring run.In/Out
residual_deviance
double
residual devianceOut
null_deviance
double
null devianceOut
AIC
double
AICOut
null_degrees_of_freedom
long
null DOFOut
residual_degrees_of_freedom
long
residual DOFOut
r2
double
The R^2 for this scoring run.Out
description
string
Optional description for this scoring run (to note out-of-bag, sampled data, etc.)Out
model_category
enum
The category (e.g., Clustering) for the model used for this scoring run.Out
scoring_time
long
The time in mS since the epoch for the start of this scoring run.Out
predictions
Frame
Predictions Frame.Out
MSE
double
The Mean Squared Error of the prediction for this scoring run.Out

ModelMetricsRegressionV3

model
Key
The model used for this scoring run.In/Out
model_checksum
long
The checksum for the model used for this scoring run.In/Out
frame
Key
The frame used for this scoring run.In/Out
frame_checksum
long
The checksum for the frame used for this scoring run.In/Out
r2
double
The R^2 for this scoring run.Out
description
string
Optional description for this scoring run (to note out-of-bag, sampled data, etc.)Out
model_category
enum
The category (e.g., Clustering) for the model used for this scoring run.Out
scoring_time
long
The time in mS since the epoch for the start of this scoring run.Out
predictions
Frame
Predictions Frame.Out
MSE
double
The Mean Squared Error of the prediction for this scoring run.Out

ModelMetricsSVDV99

model
Key
The model used for this scoring run.In/Out
model_checksum
long
The checksum for the model used for this scoring run.In/Out
frame
Key
The frame used for this scoring run.In/Out
frame_checksum
long
The checksum for the frame used for this scoring run.In/Out
description
string
Optional description for this scoring run (to note out-of-bag, sampled data, etc.)Out
model_category
enum
The category (e.g., Clustering) for the model used for this scoring run.Out
scoring_time
long
The time in mS since the epoch for the start of this scoring run.Out
predictions
Frame
Predictions Frame.Out
MSE
double
The Mean Squared Error of the prediction for this scoring run.Out

ModelOutputSchema

names
string[]
Column names.Out
domains
string[][]
Domains for categorical (enum) columns.Out
model_category
enum
Category of the model (e.g., Binomial).Out
model_summary
TwoDimTable
Model summaryOut
scoring_history
TwoDimTable
Scoring historyOut
training_metrics
ModelMetrics
Training data model metricsOut
validation_metrics
ModelMetrics
Validation data model metricsOut
help
Map
Help information for output fieldsOut

ModelParameterSchemaV3

is_member_of_frames
string[]
For Vec-type fields this is the set of other Vec-type fields which must contain mutually exclusive values; for example, for a SupervisedModel the response_column must be mutually exclusive with the weights_columnIn
is_mutually_exclusive_with
string[]
For Vec-type fields this is the set of Frame-type fields which must contain the named column; for example, for a SupervisedModel the response_column must be in both the training_frame and (if it’s set) the validation_frameIn
name
string
name in the JSON, e.g. “lambda”Out
label
string
label in the UI, e.g. “lambda”Out
help
string
help for the UI, e.g. “regularization multiplier, typically used for foo bar baz etc.”Out
required
boolean
the field is requiredOut
type
string
Java type, e.g. “double”Out
default_value
Polymorphic
default value, e.g. 1Out
actual_value
Polymorphic
actual value as set by the user and / or modified by the ModelBuilder, e.g., 10Out
level
string
the importance of the parameter, used by the UI, e.g. “critical”, “extended” or “expert”Out
values
string[]
list of valid values for use by the front-endOut

ModelParametersSchema

model_id
Key
Destination id for this model; auto-generated if not specifiedIn/Out
training_frame
Key
Training frameIn/Out
validation_frame
Key
Validation frameIn/Out
ignored_columns
string[]
Ignored columnsIn/Out
ignore_const_cols
boolean
Ignore constant columnsIn/Out
score_each_iteration
boolean
Whether to score during each iteration of model trainingIn/Out

ModelSchema

model_id
Key
Model keyIn/Out
parameters
Parameters
The build parameters for the model (e.g. K for KMeans).Out
output
Output
The build output for the model (e.g. the cluster centers for KMeans).Out
compatible_frames
string[]
Compatible frames, if requestedOut
checksum
long
Checksum for all the things that go into building the Model.Out
algo
string
The algo name for this Model.Out
algo_full_name
string
The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM).Out

ModelSchemaBase

model_id
Key
Model keyIn/Out
algo
string
The algo name for this Model.Out
algo_full_name
string
The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM).Out

ModelSynopsisV3

model_id
Key
Model keyIn/Out
algo
string
The algo name for this Model.Out
algo_full_name
string
The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM).Out

ModelsBase

model_id
Key
Name of Model of interestIn
preview
boolean
Return potentially abridged model suitable for viewing in a browserIn
find_compatible_frames
boolean
Find and return compatible frames?In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
models
Iced[]
ModelsOut
compatible_frames
Frame[]
Compatible framesOut

ModelsV3

model_id
Key
Name of Model of interestIn
preview
boolean
Return potentially abridged model suitable for viewing in a browserIn
find_compatible_frames
boolean
Find and return compatible frames?In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
models
Iced[]
ModelsOut
compatible_frames
Frame[]
Compatible framesOut

NaiveBayesModelOutputV3

levels
string[]
Categorical levels of the responseIn
apriori
TwoDimTable
A-priori probabilities of the responseIn
pcond
TwoDimTable[]
Conditional probabilities of the predictorsIn
names
string[]
Column names.Out
domains
string[][]
Domains for categorical (enum) columns.Out
model_category
enum
Category of the model (e.g., Binomial).Out
model_summary
TwoDimTable
Model summaryOut
scoring_history
TwoDimTable
Scoring historyOut
training_metrics
ModelMetrics
Training data model metricsOut
validation_metrics
ModelMetrics
Validation data model metricsOut
help
Map
Help information for output fieldsOut

NaiveBayesModelV3

model_id
Key
Model keyIn/Out
parameters
NaiveBayesParameters
The build parameters for the model (e.g. K for KMeans).Out
output
NaiveBayesOutput
The build output for the model (e.g. the cluster centers for KMeans).Out
compatible_frames
string[]
Compatible frames, if requestedOut
checksum
long
Checksum for all the things that go into building the Model.Out
algo
string
The algo name for this Model.Out
algo_full_name
string
The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM).Out

NaiveBayesParametersV3

laplace
double
Laplace smoothing parameterIn
min_sdev
double
Min. standard deviation to use for observations with not enough dataIn
eps_sdev
double
Cutoff below which standard deviation is replaced with min_sdevIn
min_prob
double
Min. probability to use for observations with not enough dataIn
eps_prob
double
Cutoff below which probability is replaced with min_probIn
compute_metrics
boolean
Compute metrics on training dataIn
response_column
VecSpecifier
Response columnIn/Out
balance_classes
boolean
Balance training data class counts via over/under-sampling (for imbalanced data).In/Out
class_sampling_factors
float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.In/Out
max_after_balance_size
float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.In/Out
max_confusion_matrix_size
int
Maximum size (# classes) for confusion matrices to be printed in the LogsIn/Out
max_hit_ratio_k
int
Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable)In/Out
model_id
Key
Destination id for this model; auto-generated if not specifiedIn/Out
training_frame
Key
Training frameIn/Out
validation_frame
Key
Validation frameIn/Out
ignored_columns
string[]
Ignored columnsIn/Out
ignore_const_cols
boolean
Ignore constant columnsIn/Out
score_each_iteration
boolean
Whether to score during each iteration of model trainingIn/Out

NaiveBayesV3

parameters
NaiveBayesParameters
Model builder parameters.In
__http_status
int
HTTP status to return for this build.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
algo
string
The algo name for this ModelBuilder.Out
algo_full_name
string
The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM).Out
can_build
enum[]
Model categories this ModelBuilder can build.Out
visibility
enum
Should the builder always be visible, be marked as beta, or only visible if the user starts up with the experimental flag?Out
job
Job
Job KeyOut
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut

NetworkEvent

is_send
boolean
Boolean flag distinguishing between sends (true) and receives(false)In
protocol
string
network protocol (UDP/TCP)In
msg_type
string
UDP type (exec,ack, ackack,…In
from
string
Sending nodeIn
to
string
Receiving nodeIn
data
string
Pretty print of the first few bytes of the msg payload. Contains class name for tasks.In
date
string
Time when the event was recorded. Format is hh:mm:ss:msIn
nanos
long
Time in nanosIn
type
enum
type of recorded eventIn

NetworkTestV3

_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
microseconds_collective
double[]
Collective broadcast/reduce times in microseconds (for each message size)Out
bandwidths_collective
double[]
Collective bandwidths in Bytes/sec (for each message size, for each node)Out
microseconds
double[][]
Round-trip times in microseconds (for each message size, for each node)Out
bandwidths
double[][]
Bi-directional bandwidths in Bytes/sec (for each message size, for each node)Out
nodes
string[]
NodesOut
table
TwoDimTable
NetworkTestResultsOut

NodePersistentStorageEntryV3

category
string
Category nameOut
name
string
Key nameOut
size
long
Size in bytes of valueOut
timestamp_millis
long
Epoch time in milliseconds of when the value was writtenOut

NodePersistentStorageV3

_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
category
string
Category nameIn/Out
name
string
Key nameIn/Out
value
string
ValueIn/Out
configured
boolean
ConfiguredOut
exists
boolean
ExistsOut
entries
Iced[]
List of entriesOut

NodeV3

h2o
string
IPOut
ip_port
string
IP address and port in the form a.b.c.d:eOut
healthy
boolean
(now-last_ping)<HeartbeatThread.TIMEOUTOut
last_ping
long
Time (in msec) of last pingOut
sys_load
float
System load; average #runnables/#coresOut
gflops
double
Linpack GFlopsOut
mem_bw
double
Memory BandwidthOut
total_value_size
long
Data on Node (memory or disk)Out
mem_value_size
long
Data on Node (memory only)Out
num_keys
int
id="local-keys">local keys<Out
free_mem
long
Free heapOut
tot_mem
long
Total heapOut
max_mem
long
Max heapOut
free_disk
long
Free diskOut
max_disk
long
Max diskOut
rpcs_active
int
Active Remote Procedure CallsOut
fjthrds
short[]
F/J Thread count, by priorityOut
fjqueue
short[]
F/J Task count, by priorityOut
tcps_active
int
Open TCP connectionsOut
open_fds
int
Open File DescriptersOut
num_cpus
int
num_cpusOut
cpus_allowed
int
cpus_allowedOut
nthreads
int
nthreadsOut
my_cpu_pct
int
System CPU percentage used by this H2O process in last intervalOut
sys_cpu_pct
int
System CPU percentage used by everything in last intervalOut
pid
string
PIDOut

PCAModelOutputV99

std_deviation
double[]
Standard deviationsIn
pc_importance
TwoDimTable
Importance of each principal componentIn
eigenvectors
TwoDimTable
Principal components matrixIn
loading_key
Key
Frame key for loading matrix (Power method only)In
names
string[]
Column names.Out
domains
string[][]
Domains for categorical (enum) columns.Out
model_category
enum
Category of the model (e.g., Binomial).Out
model_summary
TwoDimTable
Model summaryOut
scoring_history
TwoDimTable
Scoring historyOut
training_metrics
ModelMetrics
Training data model metricsOut
validation_metrics
ModelMetrics
Validation data model metricsOut
help
Map
Help information for output fieldsOut

PCAModelV99

model_id
Key
Model keyIn/Out
parameters
PCAParameters
The build parameters for the model (e.g. K for KMeans).Out
output
PCAOutput
The build output for the model (e.g. the cluster centers for KMeans).Out
compatible_frames
string[]
Compatible frames, if requestedOut
checksum
long
Checksum for all the things that go into building the Model.Out
algo
string
The algo name for this Model.Out
algo_full_name
string
The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM).Out

PCAParametersV99

transform
enum
Transformation of training dataIn
pca_method
enum
Method for computing PCAIn
loading_name
string
Frame key to save left singular vectors from SVDIn
k
int
Rank of matrix approximationIn/Out
max_iterations
int
Maximum training iterationsIn/Out
seed
long
RNG seed for initializationIn/Out
use_all_factor_levels
boolean
Whether first factor level is included in each categorical expansionIn/Out
model_id
Key
Destination id for this model; auto-generated if not specifiedIn/Out
training_frame
Key
Training frameIn/Out
validation_frame
Key
Validation frameIn/Out
ignored_columns
string[]
Ignored columnsIn/Out
ignore_const_cols
boolean
Ignore constant columnsIn/Out
score_each_iteration
boolean
Whether to score during each iteration of model trainingIn/Out

PCAV99

parameters
PCAParameters
Model builder parameters.In
__http_status
int
HTTP status to return for this build.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
algo
string
The algo name for this ModelBuilder.Out
algo_full_name
string
The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM).Out
can_build
enum[]
Model categories this ModelBuilder can build.Out
visibility
enum
Should the builder always be visible, be marked as beta, or only visible if the user starts up with the experimental flag?Out
job
Job
Job KeyOut
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut

ParseSetupV3

_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
source_frames
Key[]
Source framesIn/Out
parse_type
enum
Parser typeIn/Out
separator
byte
Field separatorIn/Out
single_quotes
boolean
Single quotesIn/Out
check_header
int
Check header: 0 means guess, +1 means 1st line is header not data, -1 means 1st line is data not headerIn/Out
column_names
string[]
Column namesIn/Out
column_types
string[]
Value types for columnsIn/Out
na_strings
string[][]
NA strings for columnsIn/Out
column_name_filter
string
Regex for names of columns to returnIn/Out
column_offset
int
Column offset to returnIn/Out
column_count
int
Number of columns to returnIn/Out
total_filtered_column_count
int
Total number of columns we would return with no column paginationIn/Out
destination_frame
string
Suggested nameOut
header_lines
long
Number of header lines foundOut
number_columns
int
Number of columnsOut
data
string[][]
Sample dataOut
chunk_size
int
Size of individual parse tasksOut

ParseV3

destination_frame
Key
Final frame nameIn
source_frames
Key[]
Source framesIn
parse_type
enum
Parser typeIn
separator
byte
Field separatorIn
single_quotes
boolean
Single QuotesIn
check_header
int
Check header: 0 means guess, +1 means 1st line is header not data, -1 means 1st line is data not headerIn
number_columns
int
Number of columnsIn
column_names
string[]
Column namesIn
column_types
string[]
Value types for columnsIn
domains
string[][]
Domains for categorical columnsIn
na_strings
string[][]
NA strings for columnsIn
chunk_size
int
Size of individual parse tasksIn
delete_on_done
boolean
Delete input key after parseIn
blocking
boolean
Block until the parse completes (as opposed to returning early and requiring pollingIn
remove_frame
boolean
Remove frame after blocking parse, and return array of VecsIn
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
job
Job
Parse jobOut
rows
long
RowsOut
vec_ids
Key[]
Vec IDsOut

ProfilerNodeEntryV3

stacktrace
string
Stack traceOut
count
int
Profile CountOut

ProfilerNodeV3

node_name
string
Node namesOut
timestamp
long
Timestamp (millis since epoch)Out
entries
Iced[]
Profile entry listOut

ProfilerV3

depth
int
Stack trace depthIn
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
nodes
Iced[]
(No description available)Out

QuantileParametersV3

probs
double[]
Probabilities for quantilesIn
combine_method
enum
How to combine quantiles for even sample sizesIn
model_id
Key
Destination id for this model; auto-generated if not specifiedIn/Out
training_frame
Key
Training frameIn/Out
validation_frame
Key
Validation frameIn/Out
ignored_columns
string[]
Ignored columnsIn/Out
ignore_const_cols
boolean
Ignore constant columnsIn/Out
score_each_iteration
boolean
Whether to score during each iteration of model trainingIn/Out

QuantileV3

parameters
QuantileParameters
Model builder parameters.In
__http_status
int
HTTP status to return for this build.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
algo
string
The algo name for this ModelBuilder.Out
algo_full_name
string
The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM).Out
can_build
enum[]
Model categories this ModelBuilder can build.Out
visibility
enum
Should the builder always be visible, be marked as beta, or only visible if the user starts up with the experimental flag?Out
job
Job
Job KeyOut
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut

RapidsV99

ast
string
An Abstract Syntax Tree.In
fun
string
An array of function definitions.In
ast_key
Key
A pointer to a FrameIn
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
error
string
Parsing error, if anyOut
key
Key
Result keyOut
num_rows
long
Rows in Frame resultOut
num_cols
int
Columns in Frame resultOut
scalar
double
Scalar resultOut
funstr
string
Function resultOut
col_names
string[]
Column NamesOut
string
string
String resultOut
result
string
resultOut
evaluated
boolean
Was evaluatedOut
head
string[][]
Head of a Frame resultOut
result_type
int
Result Type.Out
vec_ids
Key[]
Vec keys for key resultOut

RemoveAllV3

_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In

RemoveV3

key
Key
Object to be removed.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In

RequestSchema

_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In

RouteBase

http_method
string
(No description available)Out
url_pattern
string
(No description available)Out
summary
string
(No description available)Out
handler_class
string
(No description available)Out
handler_method
string
(No description available)Out
input_schema
string
(No description available)Out
output_schema
string
(No description available)Out
doc_method
string
(No description available)Out
path_params
string[]
(No description available)Out
markdown
string
(No description available)Out

RouteV3

http_method
string
(No description available)Out
url_pattern
string
(No description available)Out
summary
string
(No description available)Out
handler_class
string
(No description available)Out
handler_method
string
(No description available)Out
input_schema
string
(No description available)Out
output_schema
string
(No description available)Out
doc_method
string
(No description available)Out
path_params
string[]
(No description available)Out
markdown
string
(No description available)Out

SVDModelOutputV99

v
double[][]
Right singular vectorsIn
d
double[]
Singular valuesIn
u_key
Key
Frame key of left singular vectorsIn
names
string[]
Column names.Out
domains
string[][]
Domains for categorical (enum) columns.Out
model_category
enum
Category of the model (e.g., Binomial).Out
model_summary
TwoDimTable
Model summaryOut
scoring_history
TwoDimTable
Scoring historyOut
training_metrics
ModelMetrics
Training data model metricsOut
validation_metrics
ModelMetrics
Validation data model metricsOut
help
Map
Help information for output fieldsOut

SVDModelV99

model_id
Key
Model keyIn/Out
parameters
SVDParameters
The build parameters for the model (e.g. K for KMeans).Out
output
SVDOutput
The build output for the model (e.g. the cluster centers for KMeans).Out
compatible_frames
string[]
Compatible frames, if requestedOut
checksum
long
Checksum for all the things that go into building the Model.Out
algo
string
The algo name for this Model.Out
algo_full_name
string
The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM).Out

SVDParametersV99

transform
enum
Transformation of training dataIn
nv
int
Number of right singular vectorsIn
max_iterations
int
Maximum iterationsIn
seed
long
RNG seed for k-means++ initializationIn
keep_u
boolean
Save left singular vectors?In
u_name
string
Frame key to save left singular vectorsIn
use_all_factor_levels
boolean
Whether first factor level is included in each categorical expansionIn/Out
model_id
Key
Destination id for this model; auto-generated if not specifiedIn/Out
training_frame
Key
Training frameIn/Out
validation_frame
Key
Validation frameIn/Out
ignored_columns
string[]
Ignored columnsIn/Out
ignore_const_cols
boolean
Ignore constant columnsIn/Out
score_each_iteration
boolean
Whether to score during each iteration of model trainingIn/Out

SVDV99

parameters
SVDParameters
Model builder parameters.In
__http_status
int
HTTP status to return for this build.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
algo
string
The algo name for this ModelBuilder.Out
algo_full_name
string
The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM).Out
can_build
enum[]
Model categories this ModelBuilder can build.Out
visibility
enum
Should the builder always be visible, be marked as beta, or only visible if the user starts up with the experimental flag?Out
job
Job
Job KeyOut
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut

Schema

(No fields)

SchemaMetadataBase

version
int
Version number of the Schema.In
name
string
Simple name of the Schema. NOTE: the schema_names form a single namespace.In
superclass
string
Simple name of the superclass of the Schema. NOTE: the schema_names form a single namespace.In
type
string
Simple name of H2O type that this Schema represents. Must not be changed after creation (treat as final).In
fields
FieldMetadata[]
All the public fields of the schemaOut
markdown
string
Documentation for the schema in Markdown format with GitHub extensionsOut

SchemaMetadataV3

version
int
Version number of the Schema.In
name
string
Simple name of the Schema. NOTE: the schema_names form a single namespace.In
superclass
string
Simple name of the superclass of the Schema. NOTE: the schema_names form a single namespace.In
type
string
Simple name of H2O type that this Schema represents. Must not be changed after creation (treat as final).In
fields
FieldMetadata[]
All the public fields of the schemaOut
markdown
string
Documentation for the schema in Markdown format with GitHub extensionsOut

SharedTreeModelOutputV3

variable_importances
TwoDimTable
Variable ImportancesOut
init_f
double
The Intercept term, the initial model function value to which trees make adjustmentsOut
names
string[]
Column names.Out
domains
string[][]
Domains for categorical (enum) columns.Out
model_category
enum
Category of the model (e.g., Binomial).Out
model_summary
TwoDimTable
Model summaryOut
scoring_history
TwoDimTable
Scoring historyOut
training_metrics
ModelMetrics
Training data model metricsOut
validation_metrics
ModelMetrics
Validation data model metricsOut
help
Map
Help information for output fieldsOut

SharedTreeModelV3

model_id
Key
Model keyIn/Out
parameters
Parameters
The build parameters for the model (e.g. K for KMeans).Out
output
Output
The build output for the model (e.g. the cluster centers for KMeans).Out
compatible_frames
string[]
Compatible frames, if requestedOut
checksum
long
Checksum for all the things that go into building the Model.Out
algo
string
The algo name for this Model.Out
algo_full_name
string
The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM).Out

SharedTreeParametersV3

ntrees
int
Number of trees.In
max_depth
int
Maximum tree depth.In
min_rows
double
Fewest allowed (weighted) observations in a leaf (in R called ‘nodesize’).In
nbins
int
For numerical columns (real/int), build a histogram of this many bins, then split at the best pointIn
nbins_cats
int
For categorical columns (enum), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.In
r2_stopping
double
Stop making trees when the R^2 metric equals or exceeds thisIn
seed
long
Seed for pseudo random number generator (if applicable)In
build_tree_one_node
boolean
Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.In
response_column
VecSpecifier
Response columnIn/Out
weights_column
VecSpecifier
Column with observation weightsIn/Out
offset_column
VecSpecifier
Offset columnIn/Out
balance_classes
boolean
Balance training data class counts via over/under-sampling (for imbalanced data).In/Out
class_sampling_factors
float[]
Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.In/Out
max_after_balance_size
float
Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes.In/Out
max_confusion_matrix_size
int
Maximum size (# classes) for confusion matrices to be printed in the LogsIn/Out
max_hit_ratio_k
int
Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable)In/Out
model_id
Key
Destination id for this model; auto-generated if not specifiedIn/Out
training_frame
Key
Training frameIn/Out
validation_frame
Key
Validation frameIn/Out
ignored_columns
string[]
Ignored columnsIn/Out
ignore_const_cols
boolean
Ignore constant columnsIn/Out
score_each_iteration
boolean
Whether to score during each iteration of model trainingIn/Out

SharedTreeV3

parameters
Parameters
Model builder parameters.In
__http_status
int
HTTP status to return for this build.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
algo
string
The algo name for this ModelBuilder.Out
algo_full_name
string
The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM).Out
can_build
enum[]
Model categories this ModelBuilder can build.Out
visibility
enum
Should the builder always be visible, be marked as beta, or only visible if the user starts up with the experimental flag?Out
job
Job
Job KeyOut
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut

ShutdownV3

_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In

SplitFrameV3

dataset
Key
DatasetIn
ratios
double[]
Split ratios - resulting number of split is ratios.length+1In
key
Key
Job KeyIn
description
string
Job descriptionIn
destination_frames
Key[]
Destination keys for each output frame split.In/Out
dest
Key
destination keyIn/Out
status
string
job statusOut
progress
float
progress, from 0 to 1Out
progress_msg
string
current progress status descriptionOut
start_time
long
Start timeOut
msec
long
runtimeOut
exception
string
exceptionOut

SynonymV3

key
Key
A word2vec model key.In
target
string
The target string to find synonyms.In
cnt
int
Find the top cnt synonyms of the target word.In
synonyms
string[]
The synonyms.Out
cos_sim
float[]
The cosine similarities.Out

TimelineV3

_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
now
long
Current time in millis.Out
self
string
This nodeOut
events
Iced[]
recorded timeline eventsOut

TreeStatsV3

min_depth
int
minDepthIn
max_depth
int
maxDepthIn
mean_depth
float
meanDepthIn
min_leaves
int
minLeavesIn
max_leaves
int
maxLeavesIn
mean_leaves
float
meanLeavesIn

TutorialsV3

_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In

TwoDimTableBase

name
string
Table NameOut
description
string
Table DescriptionOut
columns
Iced[]
Column SpecificationOut
rowcount
int
Number of RowsOut
data
Polymorphic[][]
Table Data (col-major)Out

TwoDimTableV3

name
string
Table NameOut
description
string
Table DescriptionOut
columns
Iced[]
Column SpecificationOut
rowcount
int
Number of RowsOut
data
Polymorphic[][]
Table Data (col-major)Out

TypeaheadV3

src
string
training_frameIn
limit
int
limitIn
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
matches
string[]
matchesOut

UnlockKeysV3

_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In

ValidationMessageBase

message_type
string
Type of validation message (ERROR, WARN, INFO, HIDE)Out
field_name
string
Field to which the message appliesOut
message
string
Message textOut

ValidationMessageV3

message_type
string
Type of validation message (ERROR, WARN, INFO, HIDE)Out
field_name
string
Field to which the message appliesOut
message
string
Message textOut

VarImpBase

varimp
float[]
Variable importance of individual variablesOut
names
string[]
Names of variablesOut

VarImpV3

varimp
float[]
Variable importance of individual variablesOut
names
string[]
Names of variablesOut

VecKeyV3

name
string
Name (string representation) for this Key.In/Out
type
string
Name (string representation) for the type of Keyed this Key points to.In/Out
URL
string
URL for the resource that this Key points to, if one exists.In/Out

WaterMeterCpuTicksV3

nodeidx
int
Index of node to query ticks for (0-based)In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
cpu_ticks
long[][]
array of tick counts per coreOut

WaterMeterIoV3

nodeidx
int
Index of node to query ticks for (0-based)In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
persist_stats
Iced[]
array of IO infoOut

Word2VecModelOutputV3

names
string[]
Column names.Out
domains
string[][]
Domains for categorical (enum) columns.Out
model_category
enum
Category of the model (e.g., Binomial).Out
model_summary
TwoDimTable
Model summaryOut
scoring_history
TwoDimTable
Scoring historyOut
training_metrics
ModelMetrics
Training data model metricsOut
validation_metrics
ModelMetrics
Validation data model metricsOut
help
Map
Help information for output fieldsOut

Word2VecModelV3

model_id
Key
Model keyIn/Out
parameters
Word2VecParameters
The build parameters for the model (e.g. K for KMeans).Out
output
Word2VecOutput
The build output for the model (e.g. the cluster centers for KMeans).Out
compatible_frames
string[]
Compatible frames, if requestedOut
checksum
long
Checksum for all the things that go into building the Model.Out
algo
string
The algo name for this Model.Out
algo_full_name
string
The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM).Out

Word2VecParametersV3

vecSize
int
Set size of word vectorsIn
windowSize
int
Set max skip length between wordsIn
sentSampleRate
float
Set threshold for occurrence of words. Those that appear with higher frequency in the training data will be randomly down-sampled; useful range is (0, 1e-5)In
normModel
enum
Use Hierarchical Softmax or Negative SamplingIn
negSampleCnt
int
Number of negative examples, common values are 3 - 10 (0 = not used)In
epochs
int
Number of training iterations to runIn
minWordFreq
int
This will discard words that appear less than timesIn
initLearningRate
float
Set the starting learning rateIn
wordModel
enum
Use the continuous bag of words model or the Skip-Gram modelIn
model_id
Key
Destination id for this model; auto-generated if not specifiedIn/Out
training_frame
Key
Training frameIn/Out
validation_frame
Key
Validation frameIn/Out
ignored_columns
string[]
Ignored columnsIn/Out
ignore_const_cols
boolean
Ignore constant columnsIn/Out
score_each_iteration
boolean
Whether to score during each iteration of model trainingIn/Out

Word2VecV3

parameters
Word2VecParameters
Model builder parameters.In
__http_status
int
HTTP status to return for this build.In
_exclude_fields
string
Comma-separated list of JSON field paths to exclude from the result, used like: “/3/Frames?_exclude_fields=frames/frame_id/URL,__meta”In
algo
string
The algo name for this ModelBuilder.Out
algo_full_name
string
The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM).Out
can_build
enum[]
Model categories this ModelBuilder can build.Out
visibility
enum
Should the builder always be visible, be marked as beta, or only visible if the user starts up with the experimental flag?Out
job
Job
Job KeyOut
messages
ValidationMessage[]
Parameter validation messagesOut
error_count
int
Count of parameter validation errorsOut