Welcome to H2O
- New Users
- Experienced Users
- Corporate Users
- Sparkling Water Users
- Python Users
- R Users
- API Users
- Developers
Welcome to the H2O documentation site! We’re glad you’re interested in learning more about H2O - if you have any questions, please email them to support@h2o.ai or post them on our Google groups website, h2ostream.
Note: To join our Google group on h2ostream, you need a Google account (such as Gmail or Google+). On the h2ostream page, click the Join group button, then click the New Topic button to post a new message.
We welcome your feedback! Please let us know if you have any questions or comments about this site by emailing us at support@h2o.ai.
Depending on your area of interest, select your learning path below:
- New Users
- Experienced Users
- Corporate Users
- Sparkling Water Users
- Python Users
- R Users
- API Users
- Developers
New Users
If you’re just getting started with H2O, here are some links to help you learn more:
Downloads page: First things first - download a copy of H2O here by selecting a build under “H2O-Dev” (the “Bleeding Edge” build contains the latest changes, while the latest alpha release represents a more stable build), then use the installation instruction tabs to install H2O on your client of choice (standalone, R, Python, Hadoop, or Maven).
For first-time users, we recommend downloading the latest alpha release and the default standalone option (the first tab) as the installation method.
Tutorials: We provide tutorials for each algorithm, so if you’d like to see a step-by-step example of our algorithms in action, this is a good place to start.
Getting Started with Flow: This document describes our new intuitive web interface, Flow. This interface is similar to iPython notebooks, and allows you to create a visual workflow to share with others.
Launch from the command line: This document describes some of the additional options that you can configure when launching H2O (for example, to specify a different directory for saved Flow data, allocate more memory, or use a flatfile for quick configuration of a cluster.
Data Science: This document describes the science behind our algorithms and provides a detailed, per-algo view of each model type.
Experienced Users
If you’ve used previous versions of H2O, the following links will help guide you through the process of upgrading to H2O 3.0.
Porting R Scripts: This document is designed to assist users who have created R scripts using previous versions of H2O. Due to the many improvements in R, scripts created using previous versions of H2O will not work. This document provides a side-by-side comparison of the changes in R for each algorithm, as well as overall structural enhancements R users should be aware of, and provides a link to a tool that assists users in upgrading their scripts.
Recent Changes: This document describes the most recent changes in the latest build of H2O. It lists new features, enhancements (including changed parameter default values), and bug fixes for each release, organized by sub-categories such as Python, R, and Web UI.
H2O vs H2O-dev: This document presents a side-by-side comparison of H2O 3.0 and the previous version of H2O. It compares and contrasts the features, capabilities, and supported algorithms between the versions. If you’d like to learn more about the benefits of upgrading, this is a great source of information.
Contributing code: If you’re interested in contributing code to H2O, we appreciate your assistance! This document describes how to access our list of Jiras that contributors can work on and how to contact us.
Corporate Users
If you’re considering using H2O in a corporate environment, you’ll be happy to know that H2O supports many popular scalable computing solutions, such as Hadoop and EC2 (AWS). For more information, refer to the following links.
How to Pass S3 Credentials to H2O: This document describes the necessary step of passing your S3 credentials to H2O so that H2O can be used with AWS.
H2O-Dev on EC2: This document describes how to launch H2O on an EC2 cluster.
Running H2O-Dev on Hadoop: This document describes how to run H2O on Hadoop.
Sparkling Water Users
Users of our Spark-compatible solution, Sparkling Water, should be aware that Sparkling Water is only supported with the latest version of H2O. For more information about Sparkling Water, refer to the following links.
Getting Started with Sparkling Water
Sparkling Water Blog Posts
Sparkling Water Meetup Slide Decks
Python Users
Pythonistas will be thrilled to know that H2O now provides support for this popular programming language! Python users can also use H2O with iPython notebooks. For more information, refer to the following links.
Python readme: This document describes how to setup and install the prerequisites for using Python with H2O.
Python docs: This document represents the definitive guide to using Python with H2O.
R Users
Don’t worry, R users - we still provide R support in the latest version of H2O, just as before. However, the R components of H2O have been cleaned up, simplified, and standardized, so the command format is easier and more intuitive. Due to these improvements, be aware that any scripts created with previous versions of H2O will not be compatible with the latest version. However, we have provided resources to assist R users in upgrading to the latest version in the form of a document that outlines the differences between versions and a tool that reviews scripts for deprecated or renamed parameters.
R User Documentation: This document contains all commands in the H2O package for R, including examples and arguments. It represents the definitive guide to using H2O in R.
Porting R Scripts: This document is designed to assist users who have created R scripts using previous versions of H2O. Due to the many improvements in R, scripts created using previous versions of H2O will not work. This document provides a side-by-side comparison of the changes in R for each algorithm, as well as overall structural enhancements R users should be aware of, and provides a link to a tool that assists users in upgrading their scripts.
API Users
API users will be happy to know that the APIs have been more thoroughly documented in the latest release of H2O and additional capabilities (such as exporting weights and biases for Deep Learning models) have been added.
REST APIs are generated immediately out of the code, allowing users to implement machine learning in many ways. For example, REST APIs could be used to call a model created by sensor data and to set up auto-alerts if the sensor data falls below a specified threshold.
REST API Reference: This document represents the definitive guide to the H2O REST API.
REST API Schema Reference: This document represents the definitive guide to the H2O REST API schemas.
Developers
If you’re looking to use H2O to help you develop your own apps, the following links will provide helpful references.
Maven install: This page provides information on how to build a version of H2O that generates the correct IDE files. Additionally, there is information on how to
H2O Project Templates: This page provides template info for projects created in Java, Scala, or Sparkling Water.
Contributing code: If you’re interested in contributing code to H2O, we appreciate your assistance! This document describes how to access our list of Jiras that contributors can work on and how to contact us.
Introduction
This guide will walk you through how to use H2O-dev’s web UI, H2O Flow. To view a demo video of H2O Flow, click here.
About H2O Flow
H2O Flow is an open-source user interface for H2O. It is a web-based interactive computational environment that allows you to combine code execution, text, mathematics, plots, and rich media in a single document, similar to iPython Notebooks.
With H2O Flow, you can capture, rerun, annotate, present, and share your workflow. H2O Flow allows you to use H2O interactively to import files, build models, and iteratively improve them. Based on your models, you can make predictions and add rich text to create vignettes of your work - all within Flow’s browser-based environment.
Flow’s hybrid user interface seamlessly blends command-line computing with a modern graphical user interface. However, rather than displaying output as plain text, Flow provides a point-and-click user interface for every H2O operation. It allows you to access any H2O object in the form of well-organized tabular data.
H2O Flow sends commands to H2O as a sequence of executable cells. The cells can be modified, rearranged, or saved to a library. Each cell contains an input field that allows you to enter commands, define functions, call other functions, and access other cells or objects on the page. When you execute the cell, the output is a graphical object, which can be inspected to view additional details.
While H2O Flow supports REST API, R scripts, and Coffeescript, no programming experience is required to run H2O Flow. You can click your way through any H2O operation without ever writing a single line of code. You can even disable the input cells to run H2O Flow using only the GUI. H2O Flow is designed to guide you every step of the way, by providing input prompts, interactive help, and example flows.
Getting Help
First, let’s go over the basics. Type h
to view a list of helpful shortcuts.
The following help window displays:
To close this window, click the X in the upper-right corner, or click the Close button in the lower-right corner. You can also click behind the window to close it. You can also access this list of shortcuts by clicking the Help menu and selecting Keyboard Shortcuts.
For additional help, select the Help sidebar to the right and click the Assist Me! button.
You can also type assist
in a blank cell and press Ctrl+Enter. A list of common tasks displays to help you find the correct command.
There are multiple resources to help you get started with Flow in the Help sidebar. To access this document, select the Getting Started with H2O Flow link below the Help Topics heading.
You can also explore the pre-configured flows available in H2O Flow for a demonstration of how to create a flow. To view the example flows, click the Browse installed packs… link in the Packs subsection of the Help sidebar. Click the examples folder and select the example flow from the list.
If you have a flow currently open, a confirmation window appears asking if the current notebook should be replaced. To load the example flow, click the Load Notebook button.
To view the REST API documentation, click the Help tab in the sidebar and then select the type of REST API documentation (Routes or Schemas).
Before getting started with H2O Flow, make sure you understand the different cell modes.
Understanding Cell Modes
- Using Edit Mode
- Using Command Mode
- Changing Cell Formats
- Running Flows
- Using Keyboard Shortcuts
- Using Flow Buttons
There are two modes for cells: edit and command.
Using Edit Mode
In edit mode, the cell is yellow with a blinking bar to indicate where text can be entered and there is an orange flag to the left of the cell.
Using Command Mode
In command mode, the flag is yellow. The flag also indicates the cell’s format:
MD: Markdown
Note: Markdown formatting is not applied until you run the cell by clicking the Run button or clicking the Run menu and selecting Run.
CS: Code (default)
RAW: Raw format (for code comments)
H[1-6]: Heading level (where 1 is a first-level heading)
NOTE: If there is an error in the cell, the flag is red.
If the cell is executing commands, the flag is teal. The flag returns to yellow when the task is complete.
Changing Cell Formats
To change the cell’s format (for example, from code to Markdown), make sure you are in not in command (not edit) mode and that the cell you want to change is selected. The easiest way to do this is to click on the flag to the left of the cell. Enter the keyboard shortcut for the format you want to use. The flag’s text changes to display the current format.
Cell Mode | Keyboard Shortcut |
---|---|
Code | y |
Markdown | m |
Raw text | r |
Heading 1 | 1 |
Heading 2 | 2 |
Heading 3 | 3 |
Heading 4 | 4 |
Heading 5 | 5 |
Heading 6 | 6 |
Running Flows
When you run the flow, a progress bar that indicates the current status of the flow. You can cancel the currently running flow by clicking the Stop button in the progress bar.
When the flow is complete, a message displays in the upper right. Note: If there is an error in the flow, H2O Flow stops the flow at the cell that contains the error.
Using Keyboard Shortcuts
Here are some important keyboard shortcuts to remember:
- Click a cell and press Enter to enter edit mode, which allows you to change the contents of a cell.
- To exit edit mode, press Esc.
- To execute the contents of a cell, press the Ctrl and Enter buttons at the same time.
The following commands must be entered in command mode.
- To add a new cell above the current cell, press a.
- To add a new cell below the current cell, press b.
- To delete the current cell, press the d key twice. (dd).
You can view these shortcuts by clicking Help > Keyboard Shortcuts or by clicking the Help tab in the sidebar.
Using Flow Buttons
There are also a series of buttons at the top of the page below the flow name that allow you to save the current flow, add a new cell, move cells up or down, run the current cell, and cut, copy, or paste the current cell. If you hover over the button, a description of the button’s function displays.
You can also use the menus at the top of the screen to edit the order of the cells, toggle specific format types (such as input or output), create models, or score models. You can also access troubleshooting information or obtain help with Flow.
Note: To disable the code input and use H2O Flow strictly as a GUI, click the Cell menu, then Toggle Cell Input.
Now that you are familiar with the cell modes, let’s import some data.
Importing Data
If you don’t have any of your own data to work with, you can find some example datasets here:
There are multiple ways to import data in H2O flow:
Click the Assist Me! button in the Help sidebar, then click the importFiles link. Enter the file path in the auto-completing Search entry field and press Enter. Select the file from the search results and select it by clicking the Add All link.
You can also drag and drop the file onto the Search field in the cell.
In a blank cell, select the CS format, then enter
importFiles ["path/filename.format"]
(wherepath/filename.format
represents the complete file path to the file, including the full file name. The file path can be a local file path or a website address.
After selecting the file to import, the file path displays in the “Search Results” section. To import a single file, click the plus sign next to the file. To import all files in the search results, click the Add all link. The files selected for import display in the “Selected Files” section.
To import the selected file(s), click the Import button.
To remove all files from the “Selected Files” list, click the Clear All link.
To remove a specific file, click the X next to the file path.
After you click the Import button, the raw code for the current job displays. A summary displays the results of the file import, including the number of imported files and their Network File System (nfs) locations.
Uploading Data
To upload a local file, click the Data menu and select Upload File…. Click the Choose File button, select the file, click the Choose button, then click the Upload button.
When the file has uploaded successfully, a message displays in the upper right and the Setup Parse cell displays.
Ok, now that your data is available in H2O Flow, let’s move on to the next step: parsing. Click the Parse these files button to continue.
Parsing Data
After you have imported your data, parse the data.
Select the parser type (if necessary) from the drop-down Parser list. For most data parsing, H2O automatically recognizes the data type, so the default settings typically do not need to be changed. The following options are available:
- Auto
- XLS
- CSV
- SVMLight
If a separator or delimiter is used, select it from the Separator list.
Select a column header option, if applicable:
- Auto: Automatically detect header types.
- First row contains column names: Specify heading as column names.
- First row contains data: Specify heading as data. This option is selected by default.
Select any necessary additional options:
- Enable single quotes as a field quotation character: Treat single quote marks (also known as apostrophes) in the data as a character, rather than an enum. This option is not selected by default.
- Delete on done: Check this checkbox to delete the imported data after parsing. This option is selected by default.
A preview of the data displays in the “Data Preview” section.
Note: To change the column type, select the drop-down list at the top of the column and select the data type. The options are:
- Unknown
- Numeric
- Enum
- Time
- UUID
- String
- Invalid
After making your selections, click the Parse button.
After you click the Parse button, the code for the current job displays.
Since we’ve submitted a couple of jobs (data import & parse) to H2O now, let’s take a moment to learn more about jobs in H2O.
Viewing Jobs
Any command (such as importFiles
) you enter in H2O is submitted as a job, which is associated with a key. The key identifies the job within H2O and is used as a reference.
Viewing All Jobs
To view all jobs, click the Admin menu, then click Jobs, or enter getJobs
in a cell in CS mode.
The following information displays:
- Type (for example,
Frame
orModel
) - Link to the object
- Description of the job type (for example,
Parse
orGBM
) - Start time
- End time
- Run time
To refresh this information, click the Refresh button. To view the details of the job, click the View button.
Viewing Specific Jobs
To view a specific job, click the link in the “Destination” column.
The following information displays:
- Type (for example,
Frame
) - Link to object (key)
- Description (for example,
Parse
) - Status
- Run time
- Progress
NOTE: For a better understanding of how jobs work, make sure to review the Viewing Frames section as well.
Ok, now that you understand how to find jobs in H2O, let’s submit a new one by building a model.
Building Models
To build a model:
Click the Assist Me! button and select buildModel
or
Click the Assist Me! button, select getFrames, then click the Build Model… button below the parsed .hex data set
or
Click the View button after parsing data, then click the Build Model button
or
Click the drop-down Model menu and select the model type from the list
The Build Model… button can be accessed from any page containing the .hex key for the parsed data (for example, getJobs
> getFrame
).
In the Build a Model cell, select an algorithm from the drop-down menu:
- K-means: Create a K-Means model.
- Generalized Linear Model: Create a Generalized Linear model.
- Distributed RF: Create a distributed Random Forest model.
- Naive Bayes: Create a Naive Bayes model.
- Principal Component Analysis: Create a Principal Components Analysis model for modeling without regularization or performing dimensionality reduction.
- Gradient Boosting Machine: Create a Gradient Boosted model
- Deep Learning: Create a Deep Learning model.
The available options vary depending on the selected model. If an option is only available for a specific model type, the model type is listed. If no model type is specified, the option is applicable to all model types.
Model_ID: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates an ID containing the model type (for example,
gbm-6f6bdc8b-ccbc-474a-b590-4579eea44596
).Training_frame: (Required) Select the dataset used to build the model.
NOTE: If you click the Build a model button from the
Parse
cell, the training frame is entered automatically.Validation_frame: (Optional) Select the dataset used to evaluate the accuracy of the model.
Ignored_columns: (Optional) Click the plus sign next to a column name to add it to the list of columns excluded from the model. To add all columns, click the Add all button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the Clear all button.
Drop_na20_cols: (Optional) Check this checkbox to drop columns that are missing (i.e., use 0 or NA) over 20% of their values
User_points: (K-Means, PCA) For K-Means, specify the number of initial cluster centers. For PCA, specify the initial Y matrix. Note: The PCA User_points parameter should only be used by advanced users for testing purposes.
Transform: (PCA) Select the transformation method for the training data: None, Standardize, Normalize, Demean, or Descale. The default is None.
Response_column: (Required for GLM, GBM, DL, DRF, NaiveBayes) Select the column to use as the independent variable.
Solver: (GLM) Select the solver to use (IRLSM, L_BFGS, or auto). IRLSM is fast on on problems with small number of predictors and for lambda-search with L1 penalty, while L_BFGS scales better for datasets with many columns. The default is IRLSM.
Ntrees: (GBM, DRF) Specify the number of trees. The default value is 50.
Max_depth: (GBM, DRF) Specify the maximum tree depth. For GBM, the default value is 5. For DRF, the default value is 20.
Min_rows: (GBM), (DRF) Specify the minimum number of observations for a leaf (“nodesize” in R). For Grid Search, use comma-separated values. The default value is 10.
Nbins: (GBM, DRF) Specify the number of bins for the histogram. The default value is 20.
Mtries: (DRF) Specify the columns to randomly select at each level. To use the square root of the columns, enter
-1
. The default value is -1.Sample_rate: (DRF) Specify the sample rate. The range is 0 to 1.0 and the default value is 0.6666667.
Build_tree_one_node: (DRF) To run on a single node, check this checkbox. This is suitable for small datasets as there is no network overhead but fewer CPUs are used. The default setting is disabled.
Learn_rate: (GBM) Specify the learning rate. The range is 0.0 to 1.0 and the default is 0.1.
Distribution: (GBM) Select the distribution type from the drop-down list. The options are auto, bernoulli, multinomial, or gaussian and the default is auto.
Loss: (DL) Select the loss function. For DL, the options are Automatic, MeanSquare, CrossEntropy, Huber, or Absolute and the default value is Automatic. Absolute, MeanSquare, and Huber are applicable for regression or classification, while CrossEntropy is only applicable for classification. Huber can improve for regression problems with outliers.
Score_each_iteration: (K-Means, DRF, NaiveBayes, PCA, GBM) To score during each iteration of the model training, check this checkbox.
K: (K-Means), (PCA) For K-Means, specify the number of clusters. For PCA, specify the rank of matrix approximation. The default for K-Means and PCA is 1.
Gamma: (PCA) Specify the regularization weight for PCA. The default is 0.
Max_iterations: (K-Means, PCA,GLM) Specify the number of training iterations. For K-Means and PCA, the default is 1000. For GLM, the default is -1.
Beta_epsilon: (GLM) Specify the beta epsilon value. If the L1 normalization of the current beta change is below this threshold, consider using convergence.
Init: (K-Means, PCA) Select the initialization mode. For K-Means, the options are Furthest, PlusPlus, Random, or User. For PCA, the options are PlusPlus, User, or None.
Note: If PlusPlus is selected, the initial Y matrix is chosen by the final cluster centers from the K-Means PlusPlus algorithm.
Family: (GLM) Select the model type (Gaussian, Binomial, Poisson, or Gamma).
Activation: (DL) Select the activation function (Tanh, TanhWithDropout, Rectifier, RectifierWithDropout, Maxout, MaxoutWithDropout). The default option is Rectifier.
Hidden: (DL) Specify the hidden layer sizes (e.g., 100,100). For Grid Search, use comma-separated values: (10,10),(20,20,20). The default value is [200,200]. The specified value(s) must be positive.
Epochs: (DL) Specify the number of times to iterate (stream) the dataset. The value can be a fraction. The default value for DL is 10.0.
Variable_importances: (DL) Check this checkbox to compute variable importance. This option is not selected by default.
Laplace: (NaiveBayes) Specify the Laplace smoothing parameter. The default value is 0.
Min_sdev: (NaiveBayes) Specify the minimum standard deviation to use for observations without enough data. The default value is 0.001.
Eps_sdev: (NaiveBayes) Specify the threshold for standard deviation. If this threshold is not met, the min_sdev value is used. The default value is 0.
Min_prob: (NaiveBayes) Specify the minimum probability to use for observations without enough data. The default value is 0.001.
Eps_prob: (NaiveBayes) Specify the threshold for standard deviation. If this threshold is not met, the min_sdev value is used. The default value is 0.
Standardize: (K-Means, GLM) To standardize the numeric columns to have mean of zero and unit variance, check this checkbox. Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This option is selected by default.
Beta_constraints: (GLM)To use beta constraints, select a dataset from the drop-down menu. The selected frame is used to constraint the coefficient vector to provide upper and lower bounds.
Advanced Options
Checkpoint: (DL) Enter a model key associated with a previously-trained Deep Learning model. Use this option to build a new model as a continuation of a previously-generated model (e.g., by a grid search).
Use_all_factor_levels: (GLM, DL) Check this checkbox to use all factor levels in the possible set of predictors; if you enable this option, sufficient regularization is required. By default, the first factor level is skipped. For Deep Learning models, this option is useful for determining variable importances and is automatically enabled if the autoencoder is selected.
Train_samples_per_iteration: (DL) Specify the number of global training samples per MapReduce iteration. To specify one epoch, enter 0. To specify all available data (e.g., replicated training data), enter -1. To use the automatic values, enter -2. The default is -2.
Adaptive_rate: (DL) Check this checkbox to enable the adaptive learning rate (ADADELTA). This option is selected by default. If this option is enabled, the following parameters are ignored:
rate
,rate_decay
,rate_annealing
,momentum_start
,momentum_ramp
,momentum_stable
, andnesterov_accelerated_gradient
.Input_dropout_ratio: (DL) Specify the input layer dropout ratio to improve generalization. Suggested values are 0.1 or 0.2. The range is >= 0 to <1 and the default value is 0.
L1: (DL) Specify the L1 regularization to add stability and improve generalization; sets the value of many weights to 0. The default value is 0.
L2: (DL) Specify the L2 regularization to add stability and improve generalization; sets the value of many weights to smaller values. The default value is 0.
Score_interval: (DL) Specify the shortest time interval (in seconds) to wait between model scoring. The default value is 5.
Score_training_samples: (DL) Specify the number of training set samples for scoring. To use all training samples, enter 0. The default value is 10000.
Score_validation_samples: (DL) (Requires selection from the Validation_Frame drop-down list) Specify the number of validation set samples for scoring. To use all validation set samples, enter 0. The default value is 0. This option is applicable to classification only.
Score_duty_cycle: (DL) Specify the maximum duty cycle fraction for scoring. A lower value results in more training and a higher value results in more scoring. The default value is 0.1.
Autoencoder: (DL) Check this checkbox to enable the Deep Learning autoencoder. This option is not selected by default. Note: This option requires a loss function other than CrossEntropy. If this option is enabled, use_all_factor_levels must be enabled.
Balance_classes: (GLM, GBM, DRF, DL, NaiveBayes) Oversample the minority classes to balance the class distribution. This option is not selected by default. This option is only applicable for classification. Majority classes can be undersampled to satisfy the Max_after_balance_size parameter.
Max_confusion_matrix_size: (DRF, NaiveBayes, GBM) Specify the maximum size (in number of classes) for confusion matrices to be printed in the Logs.
Max_hit_ratio_k: (DRF, NaiveBayes) Specify the maximum number (top K) of predictions to use for hit ratio computation. Applicable to multi-class only. To disable, enter 0.
Link: (GLM) Select a link function (Identity, Family_Default, Logit, Log, or Inverse).
Alpha: (GLM) Specify the regularization distribution between L2 and L2. The default value is 0.5.
Lambda: (GLM) Specify the regularization strength. There is no default value.
Lambda_search: (GLM) Check this checkbox to enable lambda search, starting with lambda max. The given lambda is then interpreted as lambda min.
Rate: (DL) Specify the learning rate. Higher rates result in less stable models and lower rates result in slower convergence. The default value is 0.005. Not applicable if adaptive_rate is enabled.
Rate_annealing: (DL) Specify the learning rate annealing. The formula is rate/(1+rate_annealing value * samples). The default value is 10.000001. Not applicable if adaptive_rate is enabled.
Momentum_start: (DL) Specify the initial momentum at the beginning of training. A suggested value is 0.5. The default value is 0. Not applicable if adaptive_rate is enabled.
Momentum_ramp: (DL) Specify the number of training samples for increasing the momentum. The default value is 1000000. Not applicable if adaptive_rate is enabled.
Momentum_stable: DL Specify the final momentum value reached after the momentum_ramp training samples. Not applicable if adaptive_rate is enabled.
Nesterov_accelerated_gradient: (DL) Check this checkbox to use the Nesterov accelerated gradient. This option is recommended and selected by default. Not applicable is adaptive_rate is enabled.
Hidden_dropout_ratios: (DL) Specify the hidden layer dropout ratios to improve generalization. Specify one value per hidden layer, each value between 0 and 1 (exclusive). There is no default value. This option is applicable only if TanhwithDropout, RectifierwithDropout, or MaxoutWithDropout is selected from the Activation drop-down list.
Expert Options
Keep_cross_validation_splits: (DL) Check this checkbox to keep the cross-validation frames. This option is not selected by default.
Overwrite_with_best_model: (DL) Check this checkbox to overwrite the final model with the best model found during training. This option is selected by default.
Target_ratio_comm_to_comp: (DL) Specify the target ratio of communication overhead to computation. This option is only enabled for multi-node operation and if train_samples_per_iteration equals -2 (auto-tuning). The default value is 0.02.
Rho: (DL) Specify the adaptive learning rate time decay factor. The default value is 0.99. This option is only applicable if adaptive_rate is enabled.
Epsilon: (DL) Specify the adaptive learning rate time smoothing factor to avoid dividing by zero. The default value is 1.0E-8. This option is only applicable if adaptive_rate is enabled.
Max_W2: (DL) Specify the constraint for the squared sum of the incoming weights per unit (e.g., for Rectifier). The default value is infinity.
Initial_weight_distribution: (DL) Select the initial weight distribution (Uniform Adaptive, Uniform, or Normal). The default is Uniform Adaptive. If Uniform Adaptive is used, the initial_weight_scale parameter is not applicable.
Initial_weight_scale: (DL) Specify the initial weight scale of the distribution function for Uniform or Normal distributions. For Uniform, the values are drawn uniformly from initial weight scale. For Normal, the values are drawn from a Normal distribution with the standard deviation of the initial weight scale. The default value is 1.0. If Uniform Adaptive is selected as the initial_weight_distribution, the initial_weight_scale parameter is not applicable.
Classification_stop: (DL) (Applicable to discrete/categorical datasets only) Specify the stopping criterion for classification error fractions on training data. To disable this option, enter -1. The default value is 0.0.
Max_hit_ratio_k: (DL,)GLM (Classification only) Specify the maximum number (top K) of predictions to use for hit ratio computation (for multi-class only). To disable this option, enter 0. The default value is 10.
Regression_stop: (DL) (Applicable to real value/continuous datasets only) Specify the stopping criterion for regression error (MSE) on the training data. To disable this option, enter -1. The default value is 0.000001.
Diagnostics: (DL) Check this checkbox to compute the variable importances for input features (using the Gedeon method). For large networks, selecting this option can reduce speed. This option is selected by default.
Fast_mode: (DL) Check this checkbox to enable fast mode, a minor approximation in back-propagation. This option is selected by default.
Ignore_const_cols: (DL) Check this checkbox to ignore constant training columns, since no information can be gained from them. This option is selected by default.
Force_load_balance: (DL) Check this checkbox to force extra load balancing to increase training speed for small datasets and use all cores. This option is selected by default.
Single_node_mode: (DL) Check this checkbox to force H2O to run on a single node for fine-tuning of model parameters. This option is not selected by default.
Replicate_training_data: (DL) Check this checkbox to replicate the entire training dataset on every node for faster training on small datasets. This option is not selected by default. This option is only applicable for clouds with more than one node.
Shuffle_training_data: (DL) Check this checkbox to shuffle the training data. This option is recommended if the training data is replicated and the value of train_samples_per_iteration is close to the number of nodes times the number of rows. This option is not selected by default.
Missing_values_handling: (DL) Select how to handle missing values (Skip or MeanImputation). The default value is MeanImputation.
Quiet_mode: (DL) Check this checkbox to display less output in the standard output. This option is not selected by default.
Sparse: (DL) Check this checkbox to use sparse iterators for the input layer. This option is not selected by default as it rarely improves performance.
Col_major: (DL) Check this checkbox to use a column major weight matrix for the input layer. This option can speed up forward propagation but may reduce the speed of backpropagation. This option is not selected by default.
Average_activation: (DL) Specify the average activation for the sparse autoencoder. The default value is 0. If Rectifier is selected as the Activation type, this value must be positive. For Tanh, the value must be in (-1,1).
Sparsity_beta: (DL) Specify the sparsity regularization. The default value is 0.
Max_categorical_features: (DL) Specify the maximum number of categorical features enforced via hashing. The default is unlimited.
Reproducible: (DL) To force reproducibility on small data, check this checkbox. If this option is enabled, the model takes more time to generate, since it uses only one thread.
Export_weights_and_biases: (DL) To export the neural network weights and biases as H2O frames, check this checkbox.
Class_sampling_factors: (GLM, DRF, NaiveBayes), GBM, DL) Specify the per-class (in lexicographical order) over/under-sampling ratios. By default, these ratios are automatically computed during training to obtain the class balance. There is no default value. This option is only applicable for classification problems and when Balance_Classes is enabled.
Seed: (K-Means, GBM, DL, DRF) Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.
Prior: (GLM) Specify prior probability for y ==1. Use this parameter for logistic regression if the data has been sampled and the mean of response does not reflect reality. The default value is -1.
Max_active_predictors: (GLM) Specify the maximum number of active predictors during computation. This value is used as a stopping criterium to prevent expensive model building with many predictors.
Viewing Models
Click the Assist Me! button, then click the getModels link, or enter getModels
in the cell in CS mode and press Ctrl+Enter. A list of available models displays.
To view all current models, you can also click the Model menu and click List All Models.
To inspect a model, check its checkbox then click the Inspect button, or click the Inspect button to the right of the model name.
A summary of the model’s parameters displays. To display more details, click the Show All Parameters button.
NOTE: The Clone this model… button will be supported in a future version.
To compare models, check the checkboxes for the models to use in the comparison and click the Compare selected models button. To select all models, check the checkbox at the top of the checkbox column (next to the KEY heading).
To delete a model, click the Delete button.
To generate a POJO to be able to use the model outside of H2O, click the Preview POJO button.
To learn how to make predictions, continue to the next section.
Making Predictions
After creating your model, click the key link for the model, then click the Predict button. Select the model to use in the prediction from the drop-down Model: menu and the data frame to use in the prediction from the drop-down Frame menu, then click the Predict button.
Viewing Predictions
Click the Assist Me! button, then click the getPredictions link, or enter getPredictions
in the cell in CS mode and press Ctrl+Enter. A list of the stored predictions displays.
To view a prediction, click the View button to the right of the model name.
You can also view predictions by clicking the drop-down Score menu and selecting List All Predictions.
Viewing Frames
To view a specific frame, click the “Key” link for the specified frame, or enter getFrame "FrameName"
in a cell in CS mode (where FrameName
is the name of a frame, such as allyears2k.hex
.
From the getFrame
cell, you can:
- view a truncated list of the rows in the data frame by clicking the View Data button
- create a model by clicking the Build Model button
- make a prediction based on the data by clicking the Predict button
- download the data as a .csv file by clicking the Download button
- view the columns, data, and factors in more detail or plot a graph by clicking the Inspect button
- view the characteristics or domain of a specific column by clicking the Summary link
When you view a frame, you can “drill-down” to the necessary level of detail (such as a specific column or row) using the View Data and Inspect buttons. The following screenshot displays the results of clicking the Inspect button.
This screenshot displays the results of clicking the Summary link for the first column.
To view all frames, click the Assist Me! button, then click the getFrames link, or enter getFrames
in the cell in CS mode and press Ctrl+Enter. You can also view all current frames by clicking the drop-down Data menu and selecting List All Frames.
A list of the current frames in H2O displays that includes the following information for each frame:
- Column headings
- Number of rows and columns
- Size
For parsed data, the following information displays:
- Link to the .hex file
The Build Model, Predict, and Inspect buttons
To make a prediction, check the checkboxes for the frames you want to use to make the prediction, then click the Predict on Selected Frames button.
Splitting Frames
In H2O Flow, you can split datasets within Flow for use in training and testing.
- To split a frame, click the Assist Me button, then click splitFrame. Note: You can also click the drop-down Data menu and select Split Frame….
- From the drop-down Frame: list, select the frame to split.
In the second Ratio entry field, specify the fractional value to determine the split. The first Ratio field is automatically calculated based on the values entered in the second Ratio field.
Note: Only fractional values between 0 and 1 are supported (for example, enter
.5
to split the frame in half). The total sum of the ratio values must equal one. H2O automatically adjusts the ratio values to equal one; if unsupported values are entered, an error displays.- In the Key entry field, specify a name for the new frame.
- (Optional) To add another split, click the Add a split link. To remove a split, click the
X
to the right of the Key entry field. - Click the Create button.
Creating Frames
To create a frame with a large amount of random data (for example, to use for testing), click the drop-down Admin menu, then select Create Synthetic Frame. Customize the frame as needed, then click the Create button to create the frame.
Plotting Frames
To create a plot from a frame, click the Inspect button, then click the Plot button.
Select the type of plot (point, line, area, or interval) from the drop-down Type menu, then select the x-axis and y-axis from the following options:
- label
- missing
- zeros
- pinfs
- ninfs
- min
- max
- mean
- sigma
- type
- cardinality
Select one of the above options from the drop-down Color menu to display the specified data in color, then click the Plot button to plot the data.
Using Clips
Clips enable you to save cells containing your workflow for later reuse. To save a cell as a clip, click the paperclip icon to the right of the cell (highlighted in the red box in the following screenshot).
To use a clip in a workflow, click the “Clips” tab in the sidebar on the right.
All saved clips, including the default system clips (such as assist
, importFiles
, and predict
), are listed. Clips you have created are listed under the “My Clips” heading. To select a clip to insert, click the circular button to the left of the clip name. To delete a clip, click the trashcan icon to right of the clip name.
NOTE: The default clips listed under “System” cannot be deleted.
Deleted clips are stored in the trash. To permanently delete all clips in the trash, click the Empty Trash button.
NOTE: Saved data, including flows and clips, are persistent as long as the same IP address is used for the cluster. If a new IP is used, previously saved flows and clips are not available.
Viewing Outlines
The “Outline” tab in the sidebar displays a brief summary of the cells currently used in your flow; essentially, a command history.
- To jump to a specific cell, click the cell description.
To delete a cell, select it and press the X key on your keyboard.
Saving Flows
- Finding Saved Flows on your Disk
- Saving Flows on a Hadoop cluster
- Duplicating Flows
- Downloading Flows
- Loading Flows
You can save your flow for later reuse. To save your flow as a notebook, click the “Save” button (the first button in the row of buttons below the flow name), or click the drop-down “Flow” menu and select “Save.” To enter a custom name for the flow, click the default flow name (“Untitled Flow”) and type the desired flow name. A pencil icon indicates where to enter the desired name.
To confirm the name, click the checkmark to the right of the name field.
To reuse a saved flow, click the “Flows” tab in the sidebar, then click the flow name. To delete a saved flow, click the trashcan icon to the right of the flow name.
Finding Saved Flows on your Disk
By default, flows are saved to the h2oflows
directory underneath your home directory. The directory where flows are saved is printed to stdout:
03-20 14:54:20.945 172.16.2.39:54323 95667 main INFO: Flow dir: '/Users/<UserName>/h2oflows'
To back up saved flows, copy this directory to your preferred backup location.
To specify a different location for saved flows, use the command-line argument -flow_dir
when launching H2O:
java -jar h2o.jar -flow_dir /<New>/<Location>/<For>/<Saved>/<Flows>
where /<New>/<Location>/<For>/<Saved>/<Flows>
represents the specified location. If the directory does not exist, it will be created the first time you save a flow.
Saving Flows on a Hadoop cluster
Note: If you are running H2O Flow on a Hadoop cluster, H2O will try to find the HDFS home directory to use as the default directory for flows. If the HDFS home directory is not found, flows cannot be saved unless a directory is specified while launching using -flow_dir
:
hadoop jar h2odriver.jar -nodes 1 -mapperXmx 1g -output hdfsOutputDirName -flow_dir hdfs:///<Saved>/<Flows>/<Location>
The location specified in flow_dir
may be either an hdfs or regular filesystem directory. If the directory does not exist, it will be created the first time you save a flow.
Duplicating Flows
To create a copy of the current flow, select the Flow menu, then click Make a Copy. The name of the current flow changes to “Copy of
Downloading Flows
After saving a flow as a notebook, click the Flow menu, then select Download…. A new window opens and the saved flow is downloaded to the default downloads folder on your computer. The file is exported as <filename> .flow
, where <filename>
is the name specified when the flow was saved.
Caution: You must have an active internet connection to export flows.
Loading Flows
To load a saved flow, click the Flows tab in the sidebar at the right. In the pop-up confirmation window that appears, select Load Notebook, or click Cancel to return to the current flow.
After clicking Load Notebook, the saved flow is loaded.
To load an exported flow, click the Flow menu and select Open…. In the pop-up window that appears, click the Choose File button and select the exported flow, then click the Open button.
Notes:
- Only exported flows using the default .flow filetype are supported. Other filetypes will not open.
- If the current notebook has the same name as the selected file, a pop-up confirmation appears to confirm that the current notebook should be overwritten.
Troubleshooting
- Viewing Cluster Status
- Viewing CPU Status (Water Meter)
- Viewing Logs
- Downloading Logs
- Viewing Stack Trace Information
- Viewing Network Test Results
- Accessing the Profiler
- Viewing the Timeline
- Shutting Down H2O
To troubleshoot issues in Flow, use the Admin menu. The Admin menu allows you to check the status of the cluster, view a timeline of events, and view or download logs for issue analysis.
NOTE: To view the current version, click the Help menu, then click About.
Viewing Cluster Status
Click the Admin menu, then select Cluster Status. A summary of the status of the cluster (also known as a cloud) displays, which includes the same information:
- Cluster health
- Whether all nodes can communicate (consensus)
- Whether new nodes can join (locked/unlocked) Note: After you submit a job to H2O, the cluster does not accept new nodes.
- H2O version
- Number of used and available nodes
When the cluster was created
The following information displays for each node:
- IP address (name)
- Time of last ping
- Number of cores
- Load
- Amount of data (used/total)
- Percentage of cached data
- GC (free/total/max)
- Amount of disk space in GB (free/max)
- Percentage of free disk space
To view more information, click the Show Advanced button.
Viewing CPU Status (Water Meter)
To view the current CPU usage, click the Admin menu, then click Water Meter (CPU Meter). A new window opens, displaying the current CPU use statistics.
Viewing Logs
To view the logs for troubleshooting, click the Admin menu, then click Inspect Log.
To view the logs for a specific node, select it from the drop-down Select Node menu.
Downloading Logs
To download the logs for further analysis, click the Admin menu, then click Download Log. A new window opens and the logs download to your default download folder. You can close the new window after downloading the logs. Send the logs to support@h2o.ai for issue resolution.
Viewing Stack Trace Information
To view the stack trace information, click the Admin menu, then click Stack Trace.
To view the stack trace information for a specific node, select it from the drop-down Select Node menu.
Viewing Network Test Results
To view network test results, click the Admin menu, then click Network Test.
Accessing the Profiler
The Profiler looks across the cluster to see where the same stack trace occurs, and can be helpful for identifying what the currently used CPU is doing. To view the profiler, click the Admin menu, then click Profiler.
To view the profiler information for a specific node, select it from the drop-down Select Node menu.
Viewing the Timeline
To view a timeline of events in Flow, click the Admin menu, then click Timeline. The following information displays for each event:
- Time of occurrence (HH:MM:SS:MS)
- Number of nanoseconds for duration
- Originator of event (“who”)
- I/O type
- Event type
Number of bytes sent & received
To obtain the most recent information, click the Refresh button.
Shutting Down H2O
To shut down H2O, click the Admin menu, then click Shut Down. A Shut down complete message displays in the upper right when the cluster has been shut down.
Porting R Scripts from H2O to H2O-Dev
This document outlines how to port R scripts written in H2O for compatibility with the new H2O-Dev API. When upgrading from H2O to H2O-Dev, most functions are the same. However, there are some differences that will need to be resolved when porting any scripts that were originally created using H2O to H2O-Dev.
The original R script for H2O is listed first, followed by the updated script for H2O-Dev.
Some of the parameters have been renamed for consistency. For each algorithm, a table that describes the differences is provided.
For additional assistance within R, enter a question mark before the command (for example, ?h2o.glm
).
There is also a “shim” available that will review R scripts created with previous versions of H2O, identify deprecated or renamed parameters, and suggest replacements. For more information, refer to the repo here.
Changes from H2O to H2O-Dev
h2o.exec
The h2o.exec
command is no longer supported. Any workflows using h2o.exec
must be revised to remove this command. If the H2O-Dev workflow contains any parameters or commands from H2O, errors will result and the workflow will fail.
The purpose of h2o.exec
was to wrap expressions so that they could be evaluated in a single \Exec2
call. For example,
h2o.exec(fr[,1] + 2/fr[,3])
and
fr[,1] + 2/fr[,3]
produced the same results in H2O. However, the first example makes a single REST call and uses a single temp object, while the second makes several REST calls and uses several temp objects.
Due to the improved architecture in H2O-Dev, the need to use h2o.exec
has been eliminated, as the expression can be processed by R as an “unwrapped” typical R expression.
Currently, the only known exception is when factor
is used in conjunction with h2o.exec
. For example, h2o.exec(fr$myIntCol <- factor(fr$myIntCol))
would become fr$myIntCol <- as.factor(fr$myIntCol)
Note also that an array is not inside a string:
An int array is [1, 2, 3], not “[1, 2, 3]”.
A String array is [“f00”, “b4r”], not “[\”f00\”, \”b4r\”]”
Only string values are enclosed in double quotation marks ("
).
h2o.performance
To access any exclusively binomial output, use h2o.performance
, optionally with the corresponding accessor. The accessor can only use the model metrics object created by h2o.performance
. Each accessor is named for its corresponding field (for example, h2o.AUC
, h2o.gini
, h2o.F1
). h2o.performance
supports all current algorithms except for K-Means.
If you specify a data frame as a second parameter, H2O will use the specified data frame for scoring. If you do not specify a second parameter, the training metrics for the model metrics object are used.
xval and validation slots
The xval
slot has been removed, as nfolds
is not currently supported.
The validation
slot has been merged with the model
slot.
Table of Contents
Principal Components Regression (PCR)
Principal Components Regression (PCR) has also been deprecated. To obtain PCR values, create a Principal Components Analysis (PCA) model, then create a GLM model from the scored data from the PCA model.
GBM
N-fold cross-validation and grid search will be supported in a future version of H2O-Dev.
Renamed GBM Parameters
The following parameters have been renamed, but retain the same functions:
H2O Parameter Name | H2O-Dev Parameter Name |
---|---|
data |
training_frame |
key |
model_id |
n.trees |
ntrees |
interaction.depth |
max_depth |
n.minobsinnode |
min_rows |
shrinkage |
learn_rate |
n.bins |
nbins |
validation |
validation_frame |
balance.classes |
balance_classes |
max.after.balance.size |
max_after_balance_size |
Deprecated GBM Parameters
The following parameters have been removed:
group_split
: Bit-set group splitting of categorical variables is now the default.importance
: Variable importances are now computed automatically and displayed in the model output.holdout.fraction
: The fraction of the training data to hold out for validation is no longer supported.grid.parallelism
: Specifying the number of parallel threads to run during a grid search is no longer supported. Grid search will be supported in a future version of H2O-Dev.
New GBM Parameters
The following parameters have been added:
seed
: A random number to control sampling and initialization whenbalance_classes
is enabled.score_each_iteration
: Display error rate information after each tree in the requested set is built.
GBM Algorithm Comparison
H2O | H2O-Dev |
---|---|
h2o.gbm <- function( |
h2o.gbm <- function( |
x, |
x, |
y, |
y, |
data, |
training_frame, |
key = "", |
model_id, |
distribution = 'multinomial', |
distribution = c("bernoulli", "multinomial", "gaussian"), |
n.trees = 10, |
ntrees = 50 |
interaction.depth = 5, |
max_depth = 5, |
n.minobsinnode = 10, |
min_rows = 10, |
shrinkage = 0.1, |
learn_rate = 0.1, |
n.bins = 20, |
nbins = 20, |
validation, |
validation_frame = NULL, |
balance.classes = FALSE |
balance_classes = FALSE, |
max.after.balance.size = 5, |
max_after_balance_size = 1, |
seed, |
|
score_each_iteration) |
|
group_split = TRUE, |
|
importance = FALSE, |
|
nfolds = 0, |
|
holdout.fraction = 0, |
|
class.sampling.factors = NULL, |
|
grid.parallelism = 1) |
Output
The following table provides the component name in H2O, the corresponding component name in H2O-Dev (if supported), and the model type (binomial, multinomial, or all). Many components are now included in h2o.performance
; for more information, refer to (h2o.performance
).
H2O | H2O-Dev | Model Type |
---|---|---|
@model$priorDistribution |
all |
|
@model$params |
@allparameters |
all |
@model$err |
@model$scoring_history |
all |
@model$classification |
all |
|
@model$varimp |
@model$variable_importances |
all |
@model$confusion |
@model$training_metrics$cm$table |
binomial and multinomial |
@model$auc |
@model$training_metrics$AUC |
binomial |
@model$gini |
@model$training_metrics$Gini |
binomial |
@model$best_cutoff |
binomial |
|
@model$F1 |
@model$training_metrics$thresholds_and_metric_scores$f1 |
binomial |
@model$F2 |
@model$training_metrics$thresholds_and_metric_scores$f2 |
binomial |
@model$accuracy |
@model$training_metrics$thresholds_and_metric_scores$accuracy |
binomial |
@model$error |
binomial |
|
@model$precision |
@model$training_metrics$thresholds_and_metric_scores$precision |
binomial |
@model$recall |
@model$training_metrics$thresholds_and_metric_scores$recall |
binomial |
@model$mcc |
@model$training_metrics$thresholds_and_metric_scores$absolute_MCC |
binomial |
@model$max_per_class_err |
currently replaced by @model$training_metrics$thresholds_and_metric_scores$min_per_class_correct |
binomial |
GLM
N-fold cross-validation and grid search will be supported in a future version of H2O-Dev.
Renamed GLM Parameters
The following parameters have been renamed, but retain the same functions:
H2O Parameter Name | H2O-Dev Parameter Name |
---|---|
data |
training_frame |
key |
model_id |
nlambda |
nlambdas |
lambda.min.ratio |
lambda_min_ratio |
iter.max |
max_iterations |
epsilon |
beta_epsilon |
Deprecated GLM Parameters
The following parameters have been removed:
return_all_lambda
: A logical value indicating whether to return every model built during the lambda search. (may be re-added)higher_accuracy
: For improved accuracy, adjust thebeta_epsilon
value.strong_rules
: Discards predictors likely to have 0 coefficients prior to model building. (may be re-added as enabled by default)intercept
: Defines factor columns in the model. (may be re-added)non_negative
: Specify a non-negative response. (may be re-added)variable_importances
: Variable importances are now computed automatically and displayed in the model output. They have been renamed to Normalized Coefficient Magnitudes.disable_line_search
: This parameter has been deprecated, as it was mainly used for testing purposes.offset
: Specify a column as an offset. (may be re-added)max_predictors
: Stops training the algorithm if the number of predictors exceeds the specified value. (may be re-added)
New GLM Parameters
The following parameters have been added:
validation_frame
: Specify the validation dataset.solver
: Select IRLSM or LBFGS.
GLM Algorithm Comparison
H2O | H2O-Dev |
---|---|
h2o.glm <- function( |
h2o.startGLMJob <- function( |
x, |
x, |
y, |
y, |
data, |
training_frame, |
key = "", |
model_id, |
validation_frame |
|
iter.max = 100, |
max_iterations = 50, |
epsilon = 1e-4 |
beta_epsilon = 0 |
strong_rules = TRUE, |
|
return_all_lambda = FALSE, |
|
intercept = TRUE, |
|
non_negative = FALSE, |
|
solver = c("IRLSM", "L_BFGS"), |
|
standardize = TRUE, |
standardize = TRUE, |
family, |
family = c("gaussian", "binomial", "poisson", "gamma", "tweedie"), |
link, |
link = c("family_default", "identity", "logit", "log", "inverse", "tweedie"), |
tweedie.p = ifelse(family == "tweedie",1.5, NA_real_) |
tweedie_variance_power = NaN, |
tweedie_link_power = NaN, |
|
alpha = 0.5, |
alpha = 0.5, |
prior = NULL |
prior = 0.0, |
lambda = 1e-5, |
lambda = 1e-05, |
lambda_search = FALSE, |
lambda_search = FALSE, |
nlambda = -1, |
nlambdas = -1, |
lambda.min.ratio = -1, |
lambda_min_ratio = 1.0, |
use_all_factor_levels = FALSE |
use_all_factor_levels = FALSE, |
nfolds = 0, |
nfolds = 0, |
beta_constraints = NULL, |
beta_constraint = NULL) |
higher_accuracy = FALSE, |
|
variable_importances = FALSE, |
|
disable_line_search = FALSE, |
|
offset = NULL, |
|
max_predictors = -1) |
Output
The following table provides the component name in H2O, the corresponding component name in H2O-Dev (if supported), and the model type (binomial, multinomial, or all). Many components are now included in h2o.performance
; for more information, refer to (h2o.performance
).
H2O | H2O-Dev | Model Type |
---|---|---|
@model$params |
@allparameters |
all |
@model$coefficients |
@model$coefficients |
all |
@model$nomalized_coefficients |
@model$coefficients_table$norm_coefficients |
all |
@model$rank |
@model$rank |
all |
@model$iter |
@model$iter |
all |
@model$lambda |
all |
|
@model$deviance |
@model$residual_deviance |
all |
@model$null.deviance |
@model$null_deviance |
all |
@model$df.residual |
@model$residual_degrees_of_freedom |
all |
@model$df.null |
@model$null_degrees_of_freedom |
all |
@model$aic |
@model$AIC |
all |
@model$train.err |
binomial |
|
@model$prior |
binomial |
|
@model$thresholds |
@model$threshold |
binomial |
@model$best_threshold |
binomial |
|
@model$auc |
@model$AUC |
binomial |
@model$confusion |
binomial |
K-Means
Renamed K-Means Parameters
The following parameters have been renamed, but retain the same functions:
H2O Parameter Name | H2O-Dev Parameter Name |
---|---|
data |
training_frame |
key |
model_id |
centers |
k |
cols |
x |
iter.max |
max_iterations |
normalize |
standardize |
Note In H2O, the normalize
parameter was disabled by default. The standardize
parameter is enabled by default in H2O-Dev to provide more accurate results for datasets containing columns with large values.
New K-Means Parameters
The following parameters have been added:
user
has been added as an additional option for theinit
parameter. Using this parameter forces the K-Means algorithm to start at the user-specified points.user_points
: Specify starting points for the K-Means algorithm.
K-Means Algorithm Comparison
H2O | H2O-Dev |
---|---|
h2o.kmeans <- function( |
h2o.kmeans <- function( |
data, |
training_frame, |
cols = '', |
x, |
centers, |
k, |
key = "", |
model_id, |
iter.max = 10, |
max_iterations = 1000, |
normalize = FALSE, |
standardize = TRUE, |
init = "none", |
init = c("Furthest","Random", "PlusPlus"), |
seed = 0, |
seed) |
Output
The following table provides the component name in H2O and the corresponding component name in H2O-Dev (if supported).
H2O | H2O-Dev |
---|---|
@model$params |
@allparameters |
@model$centers |
@model$centers |
@model$tot.withinss |
@model$tot_withinss |
@model$size |
@model$size |
@model$iter |
@model$iterations |
@model$_scoring_history |
|
@model$_model_summary |
Deep Learning
- Renamed Deep Learning Parameters
- Deprecated DL Parameters
- New DL Parameters
- DL Algorithm Comparison
- Output
N-fold cross-validation and grid search will be supported in a future version of H2O-Dev.
Note: If the results in the confusion matrix are incorrect, verify that score_training_samples
is equal to 0. By default, only the first 10,000 rows are included.
Renamed Deep Learning Parameters
The following parameters have been renamed, but retain the same functions:
H2O Parameter Name | H2O-Dev Parameter Name |
---|---|
data |
training_frame |
key |
model_id |
validation |
validation_frame |
class.sampling.factors |
class_sampling_factors |
nfolds |
n_folds |
override_with_best_model |
overwrite_with_best_model |
Deprecated DL Parameters
The following parameters have been removed:
classification
: Classification is now inferred from the data type.holdout_fraction
: Fraction of the training data to hold out for validation.
New DL Parameters
The following parameters have been added:
export_weights_and_biases
: An additional option allowing users to export the raw weights and biases as H2O frames.
The following options for the loss
parameter have been added:
absolute
: Provides strong penalties for mispredictionshuber
: Can improve results for regression
DL Algorithm Comparison
H2O | H2O-Dev |
---|---|
h2o.deeplearning <- function(x, |
h2o.deeplearning <- function(x, |
y, |
y, |
data, |
training_frame, |
key = "", |
model_id = "", |
override_with_best_model, |
overwrite_with_best_model = true, |
classification = TRUE, |
|
nfolds = 0, |
n_folds = 0 |
validation, |
validation_frame, |
holdout_fraction = 0, |
|
checkpoint = " " |
checkpoint, |
autoencoder, |
autoencoder = false, |
use_all_factor_levels, |
use_all_factor_levels = true |
activation, |
_activation = c("Rectifier", "Tanh", "TanhWithDropout", "RectifierWithDropout", "Maxout", "MaxoutWithDropout"), |
hidden, |
hidden= c(200, 200), |
epochs, |
epochs = 10.0, |
train_samples_per_iteration, |
train_samples_per_iteration = -2, |
seed, |
_seed, |
adaptive_rate, |
adaptive_rate = true, |
rho, |
rho = 0.99, |
epsilon, |
epsilon = 1e-8, |
rate, |
rate = .005, |
rate_annealing, |
rate_annealing = 1e-6, |
rate_decay, |
rate_decay = 1.0, |
momentum_start, |
momentum_start = 0, |
momentum_ramp, |
momentum_ramp = 1e6, |
momentum_stable, |
momentum_stable = 0, |
nesterov_accelerated_gradient, |
nesterov_accelerated_gradient = true, |
input_dropout_ratio, |
input_dropout_ratio = 0.0, |
hidden_dropout_ratios, |
hidden_dropout_ratios, |
l1, |
l1 = 0.0, |
l2, |
l2 = 0.0, |
max_w2, |
max_w2 = Inf, |
initial_weight_distribution, |
initial_weight_distribution = c("UniformAdaptive","Uniform", "Normal"), |
initial_weight_scale, |
initial_weight_scale = 1.0, |
loss, |
loss = "Automatic", "CrossEntropy", "MeanSquare", "Absolute", "Huber"), |
score_interval, |
score_interval = 5, |
score_training_samples, |
score_training_samples = 10000l, |
score_validation_samples, |
score_validation_samples = 0l, |
score_duty_cycle, |
score_duty_cycle = 0.1, |
classification_stop, |
classification_stop = 0 |
regression_stop, |
regression_stop = 1e-6, |
quiet_mode, |
quiet_mode = false, |
max_confusion_matrix_size, |
max_confusion_matrix_size, |
max_hit_ratio_k, |
max_hit_ratio_k, |
balance_classes, |
balance_classes = false, |
class_sampling_factors, |
class_sampling_factors, |
max_after_balance_size, |
max_after_balance_size, |
score_validation_sampling, |
score_validation_sampling, |
diagnostics, |
diagnostics = true, |
variable_importances, |
variable_importances = false, |
fast_mode, |
fast_mode = true, |
ignore_const_cols, |
ignore_const_cols = true, |
force_load_balance, |
force_load_balance = true, |
replicate_training_data, |
replicate_training_data = true, |
single_node_mode, |
single_node_mode = false, |
shuffle_training_data, |
shuffle_training_data = false, |
sparse, |
sparse = false, |
col_major, |
col_major = false, |
max_categorical_features, |
max_categorical_features = Integer.MAX_VALUE, |
reproducible) |
reproducible=FALSE, |
average_activation |
average_activation = 0, |
sparsity_beta = 0 |
|
export_weights_and_biases=FALSE) |
Output
The following table provides the component name in H2O, the corresponding component name in H2O-Dev (if supported), and the model type (binomial, multinomial, or all). Many components are now included in h2o.performance
; for more information, refer to (h2o.performance
).
H2O | H2O-Dev | Model Type |
---|---|---|
@model$priorDistribution |
all |
|
@model$params |
@allparameters |
all |
@model$train_class_error |
@model$training_metrics$MSE |
all |
@model$valid_class_error |
@model$validation_metrics$MSE |
all |
@model$varimp |
@model$_variable_importances |
all |
@model$confusion |
@model$training_metrics$cm$table |
binomial and multinomial |
@model$train_auc |
@model$train_AUC |
binomial |
@model$_validation_metrics |
all |
|
@model$_model_summary |
all |
|
@model$_scoring_history |
all |
Distributed Random Forest
- Changes to DRF in H2O-Dev
- Renamed DRF Parameters
- Deprecated DRF Parameters
- New DRF Parameters
- DRF Algorithm Comparison
- Output
Changes to DRF in H2O-Dev
Distributed Random Forest (DRF) was represented as h2o.randomForest(type="BigData", ...)
in H2O. In H2O, SpeeDRF (type="fast"
) was not as accurate, especially for complex data with categoricals, and did not address regression problems. DRF (type="BigData"
) was at least as accurate as SpeeDRT (type="fast"
) and was the only algorithm that scaled to big data (data too large to fit on a single node).
In H2O-Dev, our plan is to improve the performance of DRF so that the data fits on a single node (optimally, for all cases), which will make SpeeDRF obsolete. Ultimately, the goal is provide a single algorithm that provides the “best of both worlds” for all datasets and use cases.
Note: H2O-Dev only supports DRF. SpeeDRF is no longer supported. The functionality of DRF in H2O-Dev is similar to DRF functionality in H2O.
Renamed DRF Parameters
The following parameters have been renamed, but retain the same functions:
H2O Parameter Name | H2O-Dev Parameter Name |
---|---|
data |
training_frame |
key |
model_id |
validation |
validation_frame |
sample.rate |
sample_rate |
ntree |
ntrees |
depth |
max_depth |
balance.classes |
balance_classes |
score.each.iteration |
score_each_iteration |
class.sampling.factors |
class_sampling_factors |
nodesize |
min_rows |
Deprecated DRF Parameters
The following parameters have been removed:
classification
: This is now automatically inferred from the response type. To achieve classification with a 0/1 response column, explicitly convert the response to a factor (as.factor()
).importance
: Variable importances are now computed automatically and displayed in the model output.holdout.fraction
: Specifying the fraction of the training data to hold out for validation is no longer supported.doGrpSplit
: The bit-set group splitting of categorical variables is now the default.verbose
: Infonrmation about tree splits and extra statistics is now included automatically in the stdout.oobee
: The out-of-bag error estimate is now computed automatically (if no validation set is specified).stat.type
: This parameter was used for SpeeDRF, which is no longer supported.type
: This parameter was used for SpeeDRF, which is no longer supported.
New DRF Parameters
The following parameter has been added:
build_tree_one_node
: Run on a single node to use fewer CPUs.
DRF Algorithm Comparison
H2O | H2O-Dev |
---|---|
h2o.randomForest <- function(x, |
h2o.randomForest <- function( |
x, |
x, |
y, |
y, |
data, |
training_frame, |
key="", |
model_id, |
validation, |
validation_frame, |
mtries = -1, |
mtries = -1, |
sample.rate=2/3, |
sample_rate = 0.6666667, |
build_tree_one_node = FALSE, |
|
ntree=50 |
ntrees=50, |
depth=20, |
max_depth = 20, |
min_rows = 1, |
|
nbins=20, |
nbins = 20, |
balance.classes = FALSE, |
balance_classes = FALSE, |
score.each.iteration = FALSE, |
score_each_iteration = FALSE, |
seed = -1, |
seed |
nodesize = 1, |
|
classification=TRUE, |
|
importance=FALSE, |
|
nfolds=0, |
|
holdout.fraction = 0, |
|
max.after.balance.size = 5, |
max_after_balance_size) |
class.sampling.factors = NULL, |
|
doGrpSplit = TRUE, |
|
verbose = FALSE, |
|
oobee = TRUE, |
|
stat.type = "ENTROPY", |
|
type = "fast") |
Output
The following table provides the component name in H2O, the corresponding component name in H2O-Dev (if supported), and the model type (binomial, multinomial, or all). Many components are now included in h2o.performance
; for more information, refer to (h2o.performance
).
H2O | H2O-Dev | Model Type |
---|---|---|
@model$priorDistribution |
all |
|
@model$params |
@allparameters |
all |
@model$mse |
@model$scoring_history |
all |
@model$forest |
@model$model_summary |
all |
@model$classification |
all |
|
@model$varimp |
@model$variable_importances |
all |
@model$confusion |
@model$training_metrics$cm$table |
binomial and multinomial |
@model$auc |
@model$training_metrics$AUC |
binomial |
@model$gini |
@model$training_metrics$Gini |
binomial |
@model$best_cutoff |
binomial |
|
@model$F1 |
@model$training_metrics$thresholds_and_metric_scores$f1 |
binomial |
@model$F2 |
@model$training_metrics$thresholds_and_metric_scores$f2 |
binomial |
@model$accuracy |
@model$training_metrics$thresholds_and_metric_scores$accuracy |
binomial |
@model$Error |
@model$Error |
binomial |
@model$precision |
@model$training_metrics$thresholds_and_metric_scores$precision |
binomial |
@model$recall |
@model$training_metrics$thresholds_and_metric_scores$recall |
binomial |
@model$mcc |
@model$training_metrics$thresholds_and_metric_scores$absolute_MCC |
binomial |
@model$max_per_class_err |
currently replaced by @model$training_metrics$thresholds_and_metric_scores$min_per_class_correct |
binomial |
How to Launch H2O-Dev from the Command Line
You can use Terminal (OS X) or the Command Prompt (Windows) to launch H2O-Dev. When you launch from the command line, you can include additional instructions to H2O-Dev, such as how many nodes to launch, how much memory to allocate for each node, assign names to the nodes in the cloud, and more.
There are two different argument types:
- JVM arguments
- H2O arguments
The arguments use the following format: java <JVM Options>
-jar h2o.jar <H2O Options>
.
JVM Options
-version
: Display Java version info.-Xmx<Heap Size>
: To set the total heap size for an H2O node, configure the memory allocation option-Xmx
. By default, this option is set to 1 Gb (-Xmx1g
). When launching nodes, we recommend allocating a total of four times the memory of your data.
Note: Do not try to launch H2O with more memory than you have available.
H2O Options
h
or-help
: Display this information in the command line output.-name <H2O-DevCloudName>
: Assign a name to the H2O instance in the cloud (where<H2O-DevCloudName>
is the name of the cloud. Nodes with the same cloud name will form an H2O cloud (also known as an H2O cluster).-flatfile <FileName>
: Specify a flatfile of IP address for faster cloud formation (where<FileName>
is the name of the flatfile.-ip <IPnodeAddress>
: Specify an IP address other than the defaultlocalhost
for the node to use (where<IPnodeAddress>
is the IP address).-port <#>
: Specify a port number other than the default54321
for the node to use (where<#>
is the port number).-network <IPv4NetworkSpecification1>[,<IPv4NetworkSpecification2> ...]
: Specify a range (where applicable) of IP addresses (where<IPv4NetworkSpecification1>
represents the first interface,<IPv4NetworkSpecification2>
represents the second, and so on). The IP address discovery code binds to the first interface that matches one of the networks in the comma-separated list. For example,10.1.2.0/24
supports 256 possibilities.-ice_root <fileSystemPath>
: Specify a directory for H2O to spill temporary data to disk (where<fileSystemPath>
is the file path).-flow_dir <server-side or HDFS directory>
: Specify a directory for saved flows. The default is/Users/h2o-<H2OUserName>/h2oflows
(where<H2OUserName>
is your user name).nthreads <#ofThreads>
: Specify the maximum number of threads in the low-priority batch work queue (where<#ofThreads>
is the number of threads). The default is 99.-client
: Launch H2O node in client mode. This is used mostly for running Sparkling Water.
Cloud Formation Behavior
New H2O nodes join to form a cloud during launch. After a job has started on the cloud, it prevents new members from joining.
To start an H2O node with 4GB of memory and a default cloud name:
java -Xmx4g -jar h2o.jar
To start an H2O node with 6GB of memory and a specific cloud name:
java -Xmx6g -jar h2o.jar -name MyCloud
To start an H2O cloud with three 2GB nodes using the default cloud names:
java -Xmx2g -jar h2o.jar &
java -Xmx2g -jar h2o.jar &
java -Xmx2g -jar h2o.jar &
Wait for the INFO: Registered: # schemas in: #mS
output before entering the above command again to add another node (the number for # will vary).
Flatfile Configuration
If you are configuring many nodes, it is faster and easier to use the -flatfile
option, rather than -ip
and -port
.
To configure H2O-Dev on a multi-node cluster:
- Locate a set of hosts.
- Download the appropriate version of H2O-Dev for your environment.
- Verify that the same h2o.jar file is available on all hosts.
Create a flatfile (a plain text file with the IP and port numbers of the hosts). Use one entry per line. For example:
192.168.1.163:54321 192.168.1.164:54321
- Copy the flatfile.txt to each node in the cluster.
Use the
-Xmx
option to specify the amount of memory for each node. The cluster’s memory capacity is the sum of all H2O nodes in the cluster.For example, if you create a cluster with four 20g nodes (by specifying
-Xmx20g
four times), H2O will have a total of 80 gigs of memory available.For best performance, we recommend sizing your cluster to be about four times the size of your data. To avoid swapping, the
-Xmx
allocation must not exceed the physical memory on any node. Allocating the same amount of memory for all nodes is strongly recommended, as H2O-Dev works best with symmetric nodes.Note the optional
-ip
and-port
options specify the IP address and ports to use. The-ip
option is especially helpful for hosts with multiple network interfaces.java -Xmx20g -jar h2o.jar -flatfile flatfile.txt -port 54321
The output will resemble the following:
04-20 16:14:00.253 192.168.1.70:54321 2754 main INFO: 1. Open a terminal and run 'ssh -L 55555:localhost:54321 H2O-DevUser@###.###.#.##' 04-20 16:14:00.253 192.168.1.70:54321 2754 main INFO: 2. Point your browser to http://localhost:55555 04-20 16:14:00.437 192.168.1.70:54321 2754 main INFO: Log dir: '/tmp/h2o-H2O-DevUser/h2ologs' 04-20 16:14:00.437 192.168.1.70:54321 2754 main INFO: Cur dir: '/Users/H2O-DevUser/h2o-dev' 04-20 16:14:00.459 192.168.1.70:54321 2754 main INFO: HDFS subsystem successfully initialized 04-20 16:14:00.460 192.168.1.70:54321 2754 main INFO: S3 subsystem successfully initialized 04-20 16:14:00.460 192.168.1.70:54321 2754 main INFO: Flow dir: '/Users/H2O-DevUser/h2oflows' 04-20 16:14:00.475 192.168.1.70:54321 2754 main INFO: Cloud of size 1 formed [/192.168.1.70:54321]
As you add more nodes to your cluster, the output is updated:
INFO WATER: Cloud of size 2 formed [/...]...
Access the H2O-Dev web UI (Flow) with your browser. Point your browser to the HTTP address specified in the output
Listening for HTTP and REST traffic on ...
.To check if the cloud is available, point to the url
http://<ip>:<port>/Cloud.json
(an example of the JSON response is provided below). Wait forcloud_size
to be the expected value and theconsensus
field to be true:{ ... "cloud_size": 2, "consensus": true, ... }
H2O-Dev on EC2
Tested on Redhat AMI, Amazon Linux AMI, and Ubuntu AMI
Launch H2O-Dev
- Selecting the Operating System and Virtualization Type
- Configuring the Instance
- Downloading Java and H2O
+Note: Before launching H2O on an EC2 cluster, verify that ports 54321
and 54322
are both accessible by TCP and UDP.
Selecting the Operating System and Virtualization Type
Select your operating system and the virtualization type of the prebuilt AMI on Amazon. If you are using Windows, you will need to use a hardware-assisted virtual machine (HVM). If you are using Linux, you can choose between para-virtualization (PV) and HVM. These selections determine the type of instances you can launch.
For more information about virtualization types, refer to Amazon.
Configuring the Instance
Select the IAM role and policy to use to launch the instance. H2O detects the temporary access keys associated with the instance, so you don’t need to copy your AWS credentials to the instances.
When launching the instance, select an accessible key pair.
(Windows Users) Tunneling into the Instance
For Windows users that do not have the ability to use ssh
from the terminal, either download Cygwin or a Git Bash that has the capability to run ssh
:
ssh -i amy_account.pem ec2-user@54.165.25.98
Otherwise, download PuTTY and follow these instructions:
- Launch the PuTTY Key Generator.
- Load your downloaded AWS pem key file. Note: To see the file, change the browser file type to “All”.
Save the private key as a .ppk file.
Launch the PuTTY client.
In the Session section, enter the host name or IP address. For Ubuntu users, the default host name is
ubuntu@<ip-address>
. For Linux users, the default host name isec2-user@<ip-address>
.Select SSH, then Auth in the sidebar, and click the Browse button to select the private key file for authentication.
Start a new session and click the Yes button to confirm caching of the server’s rsa2 key fingerprint and continue connecting.
Downloading Java and H2O
- Download Java (JDK 1.7 or later) if it is not already available on the instance.
To download H2O, run the
wget
command with the link to the zip file available on our website by copying the link associated with the Download button for the selected H2O-Dev build.wget http://h2o-release.s3.amazonaws.com/h2o-dev/rel-serre/1/index.html unzip h2o-dev-0.2.1.1.zip cd h2o-dev-0.2.1.1 java -Xmx4g -jar h2o.jar
- From your browser, navigate to
<Private_IP_Address>:54321
or<Public_DNS>:54321
to use H2O’s web interface.
Launch H2O-Dev from the Command Line
Important Notes
Java is a pre-requisite for H2O; if you do not already have Java installed, make sure to install it before installing H2O. Java is available free on the web, and can be installed quickly. Although Java is required to run H2O, no programming is necessary. For users that only want to run H2O without compiling their own code, Java Runtime Environment (version 1.6 or later) is sufficient, but for users planning on compiling their own builds, we strongly recommend using Java Development Kit 1.7 or later.
After installation, launch H2O using the argument -Xmx
. Xmx is the
amount of memory given to H2O. If your data set is large,
allocate more memory to H2O by using -Xmx4g
instead of the default -Xmx1g
, which will allocate 4g instead of the default 1g to your instance. For best performance, the amount of memory allocated to H2O should be four times the size of your data, but never more than the total amount of memory on your computer.
Step-by-Step Walk-Through
Download the .zip file containing the latest release of H2O-Dev from the H2O downloads page.
From your terminal, change your working directory to the same directory as the location of the .zip file.
From your terminal, unzip the .zip file. For example,
unzip h2o-dev-0.2.1.1.zip
.At the prompt, enter the following commands:
cd h2o-dev-0.2.1.1 #change working directory to the downloaded file java -Xmx4g -jar h2o.jar #run the basic java command to start h2o
After a few moments, output similar to the following appears in your terminal window:
03-23 14:57:52.930 172.16.2.39:54321 1932 main INFO: ----- H2O started ----- 03-23 14:57:52.997 172.16.2.39:54321 1932 main INFO: Build git branch: rel-serre 03-23 14:57:52.998 172.16.2.39:54321 1932 main INFO: Build git hash: 9eaa5f0c4ca39144b1fd180aedb535b5ba08b2ce 03-23 14:57:52.998 172.16.2.39:54321 1932 main INFO: Build git describe: jenkins-rel-serre-1 03-23 14:57:52.998 172.16.2.39:54321 1932 main INFO: Build project version: 0.2.1.1 03-23 14:57:52.998 172.16.2.39:54321 1932 main INFO: Built by: 'jenkins' 03-23 14:57:52.998 172.16.2.39:54321 1932 main INFO: Built on: '2015-03-18 12:55:28' 03-23 14:57:52.998 172.16.2.39:54321 1932 main INFO: Java availableProcessors: 8 03-23 14:57:52.999 172.16.2.39:54321 1932 main INFO: Java heap totalMemory: 245.5 MB 03-23 14:57:52.999 172.16.2.39:54321 1932 main INFO: Java heap maxMemory: 3.56 GB 03-23 14:57:52.999 172.16.2.39:54321 1932 main INFO: Java version: Java 1.7.0_67 (from Oracle Corporation) 03-23 14:57:52.999 172.16.2.39:54321 1932 main INFO: OS version: Mac OS X 10.10.2 (x86_64) 03-23 14:57:52.999 172.16.2.39:54321 1932 main INFO: Machine physical memory: 16.00 GB 03-23 14:57:52.999 172.16.2.39:54321 1932 main INFO: Possible IP Address: en5 (en5), fe80:0:0:0:daeb:97ff:feb3:6d4b%10 03-23 14:57:52.999 172.16.2.39:54321 1932 main INFO: Possible IP Address: en5 (en5), 172.16.2.39 03-23 14:57:53.000 172.16.2.39:54321 1932 main INFO: Possible IP Address: lo0 (lo0), fe80:0:0:0:0:0:0:1%1 03-23 14:57:53.000 172.16.2.39:54321 1932 main INFO: Possible IP Address: lo0 (lo0), 0:0:0:0:0:0:0:1 03-23 14:57:53.000 172.16.2.39:54321 1932 main INFO: Possible IP Address: lo0 (lo0), 127.0.0.1 03-23 14:57:53.000 172.16.2.39:54321 1932 main INFO: Internal communication uses port: 54322 03-23 14:57:53.000 172.16.2.39:54321 1932 main INFO: Listening for HTTP and REST traffic on http://172.16.2.39:54321/ 03-23 14:57:53.001 172.16.2.39:54321 1932 main INFO: H2O cloud name: 'H2O-Dev-User' on /172.16.2.39:54321, discovery address /238.222.48.136:61150 03-23 14:57:53.001 172.16.2.39:54321 1932 main INFO: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555): 03-23 14:57:53.001 172.16.2.39:54321 1932 main INFO: 1. Open a terminal and run 'ssh -L 55555:localhost:54321 H2O-Dev-User@172.16.2.39' 03-23 14:57:53.001 172.16.2.39:54321 1932 main INFO: 2. Point your browser to http://localhost:55555 03-23 14:57:53.211 172.16.2.39:54321 1932 main INFO: Log dir: '/tmp/h2o-H2O-Dev-User/h2ologs' 03-23 14:57:53.211 172.16.2.39:54321 1932 main INFO: Cur dir: '/Users/H2O-Dev-User/Downloads/h2o-dev-0.2.1.1' 03-23 14:57:53.234 172.16.2.39:54321 1932 main INFO: HDFS subsystem successfully initialized 03-23 14:57:53.234 172.16.2.39:54321 1932 main INFO: S3 subsystem successfully initialized 03-23 14:57:53.235 172.16.2.39:54321 1932 main INFO: Flow dir: '/Users/H2O-Dev-User/h2oflows' 03-23 14:57:53.248 172.16.2.39:54321 1932 main INFO: Cloud of size 1 formed [/172.16.2.39:54321] 03-23 14:57:53.776 172.16.2.39:54321 1932 main WARN: Found schema field which violates the naming convention; name has mixed lowercase and uppercase characters: ModelParametersSchema.dropNA20Cols 03-23 14:57:53.935 172.16.2.39:54321 1932 main INFO: Registered: 142 schemas in: 605mS
Point your web browser to
http://localhost:54321/
The user interface appears in your browser, and now H2O-Dev is ready to go.
WARNING: On Windows systems, Internet Explorer is frequently blocked due to security settings. If you cannot reach http://localhost:54321, try using a different web browser, such as Firefox or Chrome.
Running H2O-Dev on Hadoop
Currently supported versions:
- CDH 5.2
- CDH 5.3
- HDP 2.1
- HDP 2.2
- MapR 3.1.1
- MapR 4.0.1
Important Points to Remember:
- Each H2O node runs as a mapper
- Run only one mapper per host
- There are no combiners or reducers
- Each H2O cluster must have a unique job name
-mapperXmx
,-nodes
, and-output
are required- Root permissions are not required - just unzip the H2O .zip file on any single node
Prerequisite: Open Communication Paths
H2O communicates using two communication paths. Verify these are open and available for use by H2O.
Path 1: mapper to driver
Optionally specify this port using the -driverport
option in the hadoop jar
command (see “Hadoop Launch Parameters” below). This port is opened on the driver host (the host where you entered the hadoop jar
command). By default, this port is chosen randomly by the operating system.
Path 2: mapper to mapper
Optionally specify this port using the -baseport
option in the hadoop jar
command (see “Hadoop Launch Parameters” below). This port and the next subsequent port are opened on the mapper hosts (the Hadoop worker nodes) where the H2O mapper nodes are placed by the Resource Manager. By default, ports 54321 (TCP) and 54322 (TCP & UDP) are used.
The mapper port is adaptive: if 54321 and 54322 are not available, H2O will try 54323 and 54324 and so on. The mapper port is designed to be adaptive because sometimes if the YARN cluster is low on resources, YARN will place two H2O mappers for the same H2O cluster request on the same physical host. For this reason, we recommend opening a range of more than two ports (20 ports should be sufficient).
Tutorial
The following tutorial will walk the user through the download or build of H2O and the parameters involved in launching H2O from the command line.
Download the latest H2O-dev release for your version of Hadoop:
wget http://h2o-release.s3.amazonaws.com/h2o-dev/master/1110/h2o-dev-0.3.0.1110-cdh5.2.zip wget http://h2o-release.s3.amazonaws.com/h2o-dev/master/1110/h2o-dev-0.3.0.1110-cdh5.3.zip wget http://h2o-release.s3.amazonaws.com/h2o-dev/master/1110/h2o-dev-0.3.0.1110-hdp2.1.zip wget http://h2o-release.s3.amazonaws.com/h2o-dev/master/1110/h2o-dev-0.3.0.1110-hdp2.2.zip wget http://h2o-release.s3.amazonaws.com/h2o-dev/master/1110/h2o-dev-0.3.0.1110-mapr3.1.1.zip wget http://h2o-release.s3.amazonaws.com/h2o-dev/master/1110/h2o-dev-0.3.0.1110-mapr4.0.1.zip
Note: Enter only one of the above commands.
Prepare the job input on the Hadoop Node by unzipping the build file and changing to the directory with the Hadoop and H2O’s driver jar files.
unzip h2o-0.3.0.1110-*.zip cd h2o-0.3.0.1110-*
To launch H2O nodes and form a cluster on the Hadoop cluster, run:
hadoop jar h2odriver.jar -nodes 1 -mapperXmx 1g -output hdfsOutputDirName
The above command launches a 1g node of H2O. We recommend you launch the cluster with at least four times the memory of your data file size.
mapperXmx is the mapper size or the amount of memory allocated to each node.
nodes is the number of nodes requested to form the cluster.
output is the name of the directory created each time a H2O cloud is created so it is necessary for the name to be unique each time it is launched.
To monitor your job, direct your web browser to your standard job tracker Web UI. To access H2O’s Web UI, direct your web browser to one of the launched instances. If you are unsure where your JVM is launched, review the output from your command after the nodes has clouded up and formed a cluster. Any of the nodes’ IP addresses will work as there is no master node.
Determining driver host interface for mapper->driver callback... [Possible callback IP address: 172.16.2.181] [Possible callback IP address: 127.0.0.1] ... Waiting for H2O cluster to come up... H2O node 172.16.2.184:54321 requested flatfile Sending flatfiles to nodes... [Sending flatfile to node 172.16.2.184:54321] H2O node 172.16.2.184:54321 reports H2O cluster size 1 H2O cluster (1 nodes) is up Blocking until the H2O cluster shuts down...
Hadoop Launch Parameters
-libjars <.../h2o.jar>
: Add external jar files; must end withh2o.jar
.-h | -help
: Display help-job name <JobName>
: Specify a job name; the default isH2O_nnnnn
(where n is chosen randomly)-driverif <IP address of mapper -> driver callback interface>
: Specify the IP address for callback messages from the mapper to the driver.-driverport <port of mapper -> callback interface>
: Specify the port number for callback messages from the mapper to the driver.-network <IPv4Network1>[,<IPv4Network2>]
: Specify the IPv4 network(s) to bind to the H2O nodes; multiple networks can be specified to force H2O to use the specified host in the Hadoop cluster.10.1.2.0/24
allows 256 possibilities.-timeout <seconds>
: Specify the timeout duration (in seconds) to wait for the cluster to form before failing.-disown
: Exit the driver after the cluster forms.notify <notification file name>
: Specify a file to write when the cluster is up. The file contains the IP and port of the embedded web server for one of the nodes in the cluster. All mappers must start before the H2O cloud is considered “up”.mapperXmx <per mapper Java Xmx heap size>
: Specify the amount of memory to allocate to H2O.extramempercent <0-20>
: Specify the extra memory for internal JVM use outside of the Java heap. This is a percentage ofmapperXmx
.-n | -nodes <number of H2O nodes>
: Specify the number of nodes.-nthreads <maximum number of CPUs>
: Specify the number of CPUs to use. Enter-1
to use all CPUs on the host, or enter a positive integer.-baseport <initialization port for H2O nodes>
: Specify the initialization port for the H2O nodes. The default is54321
.-ea
: Enable assertions to verify boolean expressions for error detection.-verbose:gc
: Include heap and garbage collection information in the logs.-XX:+PrintGCDetails
: Include a short message after each garbage collection.-license <license file name>
: Specify the directory of local filesytem location and the license file name.-o | -output <HDFS output directory>
: Specify the HDFS directory for the output.-flow_dir <Saved Flows directory>
: Specify the directory for saved flows. By default, H2O will try to find the HDFS home directory to use as the directory for flows. If the HDFS home directory is not found, flows cannot be saved unless a directory is specified using-flow_dir
.
How to Pass S3 Credentials to H2O
- Standalone Instance
- Accessing S3 Data from Hadoop Instance
- Sparkling Water Instance
- Core-site.xml Example
To make use of Amazon Web Services (AWS) storage solution S3 you will need to pass your S3 access credentials to H2O. This will allow you to access your data on S3 when importing data frames with path prefixes s3n://...
.
For security reasons, we recommend writing a script to read the access credentials that are stored in a separate file. This will not only keep your credentials from propagating to other locations, but it will also make it easier to change the credential information later.
Standalone Instance
When running H2O on standalone mode aka using the simple java launch command, we can pass in the S3 credentials in two ways.
You can pass in credentials in standalone mode the same way we access data from hdfs on Hadoop mode. You’ll need to create a core-site.xml
file and pass it in with the flag -hdfs_config
. For an example core-site.xml
file, refer to Core-site.xml.
Edit the properties in the core-site.xml file to include your Access Key ID and Access Key as shown in the following example:
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>[AWS SECRET KEY]</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>[AWS SECRET ACCESS KEY]</value>
</property>
Launch with the configuration file core-site.xml
by running in the command line:
`java -jar h2o.jar -hdfs_config core-site.xml`
or
java -cp h2o.jar water.H2OApp -hdfs_config core-site.xml
Then import the data with the S3 url path: s3n://bucket/path/to/file.csv
with importFile.
You can pass the AWS Access Key and Secret Access Key in an S3N Url in Flow, R, or Python (where
AWS_ACCESS_KEY
represents your user name andAWS_SECRET_KEY
represents your password).To import the data from the Flow API:
importFiles [ "s3n://<AWS_ACCESS_KEY>:<AWS_SECRET_KEY>@bucket/path/to/file.csv" ]
To import the data from the R API:
h2o.importFile(path = "s3n://<AWS_ACCESS_KEY>:<AWS_SECRET_KEY>@bucket/path/to/file.csv")
To import the data from the Python API:
h2o.import_frame(path = "s3n://<AWS_ACCESS_KEY>:<AWS_SECRET_KEY>@bucket/path/to/file.csv")
Accessing S3 Data from Hadoop Instance
H2O launched atop Hadoop servers can still access S3 Data in addition to having access to HDFS. To do this, edit Hadoop’s core-site.xml
the same way. Set the HADOOP_CONF_DIR
environment property to the directory containing the core-site.xml
file. For an example core-site.xml
file, refer to Core-site.xml. Typically, the configuration directory for most Hadoop distributions is /etc/hadoop/conf
.
To launch H2O without using any schedulers or with Yarn, use the same process as the standalone instance with the exception of the HDFS configuration directory path:
java -jar h2o.jar -hdfs_config $HADOOP_CONF_DIR/core-site.xml java -cp h2o.jar water.H2OApp -hdfs_config $HADOOP_CONF_DIR/core-site.xml
Pass the S3 credentials when launching H2O using the hadoop jar command use the
-D
flag to pass the credentials:hadoop jar h2odriver.jar -Dfs.s3.awsAccessKeyId="${AWS_ACCESS_KEY}" -Dfs.s3n.awsSecretAccessKey="${AWS_SECRET_KEY}" -n 3 -mapperXmx 10g -output outputDirectory
where AWS_ACCESS_KEY
represents your user name and AWS_SECRET_KEY
represents your password.
Then you can import the data with the S3 URL path:
To import the data from the Flow API:
importFiles [ "s3n://bucket/path/to/file.csv" ]
To import the data from the R API:
h2o.importFile(path = "s3n://bucket/path/to/file.csv")
To import the data from the Python API:
h2o.import_frame(path = "s3n://bucket/path/to/file.csv")
Sparkling Water Instance
For Sparkling Water, the S3 credentials need to be passed via HADOOP_CONF_DIR
that will point to a core-site.xml
with the AWS_ACCESS_KEY
AND AWS_SECRET_KEY
. On Hadoop, typically the configuration directory is set to /etc/hadoop/conf
:
export HADOOP_CONF_DIR=/etc/hadoop/conf
If you are running a local instance, create a configuration directory locally with the core-site.xml
and then export the path to the configuration directory:
mkdir CONF
cd CONF
export HADOOP_CONF_DIR=`pwd`
Core-site.xml Example
The following is an example core-site.xml file:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!--
<property>
<name>fs.default.name</name>
<value>s3n://<your s3 bucket></value>
</property>
-->
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>insert access key here</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>insert secret key here</value>
</property>
</configuration>
Data Science in H2O-Dev
Commonalities
Missing Value Handling for Training
If missing values are found in the validation frame during model training or during the scoring process for creating predictions, the missing values are automatically imputed.
If the missing values are found during POJO scoring, the answer is converted to NaN
.
K-Means
Introduction
K-Means falls in the general category of clustering algorithms.
Defining a K-Means Model
Destination_key: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
Training_frame: (Required) Select the dataset used to build the model. NOTE: If you click the Build a model button from the
Parse
cell, the training frame is entered automatically.Validation_frame: (Optional) Select the dataset used to evaluate the accuracy of the model.
Ignored_columns: (Optional) Click the plus sign next to a column name to add it to the list of columns excluded from the model. To add all columns, click the Add all button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the Clear all button.
Drop_na20_cols: (Optional) Check this checkbox to omit columns that have at least 20% missing values.
Score_each_iteration: (Optional) Check this checkbox to score during each iteration of the model training.
K: Specify the number of clusters. The default is 1.
User_points: Specify a vector of initial cluster centers.
Max_iterations: Specify the maximum number of training iterations. The default is 1000.
Init: Select the initialization mode. The options are Random, Furthest, PlusPlus, or User. Note: If PlusPlus is selected, the initial Y matrix is chosen by the final cluster centers from the K-Means PlusPlus algorithm.
Standardize: To standardize the numeric columns to have mean of zero and unit variance, check this checkbox. Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This option is selected by default.
Seed: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.
Interpreting a K-Means Model
By default, the following output displays:
- A graph of the scoring history (number of iterations vs. average within the cluster’s sum of squares)
- Output (model category, validation metrics if applicable, and centers std)
- Model Summary (number of clusters, number of categorical columns, number of iterations, avg. within sum of squares, avg. sum of squares, avg. between the sum of squares)
- Scoring history (number of iterations, avg. change of standardized centroids, avg. within cluster sum of squares)
- Training metrics (model name, checksum name, frame name, frame checksum name, description if applicable, model category, duration in ms, scoring time, predictions, MSE, avg. within sum of squares, avg. between sum of squares)
- Centroid statistics (centroid number, size, within sum of squares)
- Cluster means (centroid number, column)
K-Means randomly chooses starting points and converges to a local minimum of centroids. The number of clusters is arbitrary, and should be thought of as a tuning parameter. The output is a matrix of the cluster assignments and the coordinates of the cluster centers in terms of the originally chosen attributes. Your cluster centers may differ slightly from run to run as this problem is Non-deterministic Polynomial-time (NP)-hard.
FAQ
How does the algorithm handle missing values during training?
Missing values are automatically imputed by the column mean.
How does the algorithm handle missing values during testing?
Missing values are automatically imputed by the column mean of the training data.
Does it matter if the data is sorted?
No.
Should data be shuffled before training?
No.
What if there are a large number of columns?
K-Means suffers from the curse of dimensionality: all points are roughly at the same distance from each other in high dimensions, making the algorithm less and less useful.
What if there are a large number of categorical factor levels?
This can be problematic, as categoricals are one-hot encoded on the fly, which can lead to the same problem as datasets with a large number of columns.
K-Means Algorithm
The number of clusters $K$ is user-defined and is determined a priori.
Choose $K$ initial cluster centers $m_{k}$ according to one of the following:
Randomization: Choose $K$ clusters from the set of $N$ observations at random so that each observation has an equal chance of being chosen.
Plus Plus
a. Choose one center $m_{1}$ at random.
Calculate the difference between $m{1}$ and each of the remaining $N-1$ observations $x{i}$. $d(x{i}, m{1}) = ||(x{i}-m{1})||^2$
Let $P(i)$ be the probability of choosing $x{i}$ as $m{2}$. Weight $P(i)$ by $d(x{i}, m{1})$ so that those $x{i}$ furthest from $m{2}$ have a higher probability of being selected than those $x{i}$ close to $m{1}$.
Choose the next center $m_{2}$ by drawing at random according to the weighted probability distribution.
Repeat until $K$ centers have been chosen.
Furthest
a. Choose one center $m_{1}$ at random.
Calculate the difference between $m{1}$ and each of the remaining $N-1$ observations $x{i}$. $d(x{i}, m{1}) = ||(x{i}-m{1})||^2$
Choose $m{2}$ to be the $x{i}$ that maximizes $d(x{i}, m{1})$.
Repeat until $K$ centers have been chosen.
Once $K$ initial centers have been chosen calculate the difference between each observation $x{i}$ and each of the centers $m{1},…,m_{K}$, where difference is the squared Euclidean distance taken over $p$ parameters.
$d(x{i}, m{k})=$ $\sum{j=1}^{p}(x{ij}-m{k})^2=$ $\lVert(x{i}-m_{k})\rVert^2$
Assign $x{i}$ to the cluster $k$ defined by $m{k}$ that minimizes $d(x{i}, m{k})$
When all observations $x_{i}$ are assigned to a cluster calculate the mean of the points in the cluster.
$\bar{x}(k)=\lbrace\bar{x{i1}},…\bar{x{ip}}\rbrace$
Set the $\bar{x}(k)$ as the new cluster centers $m{k}$. Repeat steps 2 through 5 until the specified number of max iterations is reached or cluster assignments of the $x{i}$ are stable.
References
Xiong, Hui, Junjie Wu, and Jian Chen. “K-means Clustering Versus Validation Measures: A Data- distribution Perspective.” Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 39.2 (2009): 318-331.
GLM
Introduction
Generalized Linear Models (GLM) estimate regression models for outcomes following exponential distributions. In addition to the Gaussian (i.e. normal) distribution, these include Poisson, binomial, and gamma distributions. Each serves a different purpose, and depending on distribution and link function choice, can be used either for prediction or classification.
The GLM suite includes:
- Gaussian regression
- Poisson regression
- Binomial regression
- Gamma regression
Defining a GLM Model
Destination_key: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
Training_frame: (Required) Select the dataset used to build the model. NOTE: If you click the Build a model button from the
Parse
cell, the training frame is entered automatically.Validation_frame: (Optional) Select the dataset used to evaluate the accuracy of the model.
Ignored_columns: (Optional) Click the plus sign next to a column name to add it to the list of columns excluded from the model. To add all columns, click the Add all button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the Clear all button.
Drop_na20_cols: (Optional) Check this checkbox to omit columns that have at least 20% missing values.
Response_column: (Required) Select the column to use as the independent variable.
Family: Select the model type (Gaussian, Binomial, Poisson, or Gamma).
Solver: Select the solver to use (IRLSM, L_BFGS, or auto). IRLSM is fast on on problems with small number of predictors and for lambda-search with L1 penalty, while L_BFGS scales better for datasets with many columns. The default is IRLSM.
Alpha: Specify the regularization distribution between L2 and L2. The default value is 0.5.
Lambda: Specify the regularization strength. The default value is data dependent.
Lambda_search: Check this checkbox to enable lambda search, starting with lambda max. The given lambda is then interpreted as lambda min.
Standardize: To standardize the numeric columns to have a mean of zero and unit variance, check this checkbox. Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This option is selected by default.
Beta constraints: To use beta constraints, select a dataset from the drop-down menu. The selected frame is used to constraint the coefficient vector to provide upper and lower bounds.
Max_confusion_matrix_size: Specify the maximum size (number of classes) for the confusion matrices printed in the logs.
Max_hits_ratio_k: Specify the maximum number (top K) of predictions to use for hit ratio computation. Applicable to multi-class only. To disable, enter
0
.Max_iterations: Specify the number of training iterations. The default is 50.
Beta_epsilon: Specify the beta epsilon value. If the L1 normalization of the current beta change is below this threshold, consider using convergence.
Link: Select a link function (Identity, Family_Default, Logit, Log, or Inverse).
Prior: Specify prior probability for y ==1. Use this parameter for logistic regression if the data has been sampled and the mean of response does not reflect reality. The default value is 0.
Max_active_\predictors: Specify the maximum number of active predictors during computation. This value is used as a stopping criterium to prevent expensive model building with many predictors.
Interpreting a GLM Model
By default, the following output displays:
- A graph of the normalized coefficient magnitudes
- Output (model category, model summary, scoring history, training metrics, validation metrics, best lambda, threshold, residual deviance, null deviance, residual degrees of freedom, null degrees of freedom, AIC, AUC, binomial, rank)
- Coefficients
- Coefficient magnitudes
FAQ
How does the algorithm handle missing values during training?
GLM skips rows with missing values.
How does the algorithm handle missing values during testing?
GLM will predict Double.NaN for rows containg missing values.
What happens if the response has missing values?
It is handled properly, but verify the results are correct.
What happens during prediction if the new sample has categorical levels not seen in training?
It will predict Double.NaN.
Does it matter if the data is sorted?
No.
Should data be shuffled before training?
No.
How does the algorithm handle highly imbalanced data in a response column?
GLM does not require special handling for imbalanced data.
What if there are a large number of columns?
IRLS will get quadratically slower with the number of columns. Try L-BFGS for datasets with more than 5-10 thousand columns.
What if there are a large number of categorical factor levels?
GLM internally one-hot encodes the categorical factor levels; the same limitations as with a high column count will apply.
GLM Algorithm
Following the definitive text by P. McCullagh and J.A. Nelder (1989) on the generalization of linear models to non-linear distributions of the response variable Y, H2O fits GLM models based on the maximum likelihood estimation via iteratively reweighed least squares.
Let $y{1},…,y{n}$ be n observations of the independent, random response variable $Y_{i}$.
Assume that the observations are distributed according to a function from the exponential family and have a probability density function of the form:
$f(y{i})=exp[\frac{y{i}\theta{i} - b(\theta{i})}{a{i}(\phi)} + c(y{i}; \phi)]$ where $\theta$ and $\phi$ are location and scale parameters, and $\: a{i}(\phi), \:b{i}(\theta{i}),\: c{i}(y_{i}; \phi)$ are known functions.
$a{i}$ is of the form $\:a{i}=\frac{\phi}{p{i}}; p{i}$ is a known prior weight.
When $Y$ has a pdf from the exponential family:
$E(Y{i})=\mu{i}=b^{\prime}$ $var(Y{i})=\sigma{i}^2=b^{\prime\prime}(\theta{i})a{i}(\phi)$
Let $g(\mu{i})=\eta{i}$ be a monotonic, differentiable transformation of the expected value of $y{i}$. The function $\eta{i}$ is the link function and follows a linear model.
$g(\mu{i})=\eta{i}=\mathbf{x_{i}^{\prime}}\beta$
When inverted: $\mu=g^{-1}(\mathbf{x_{i}^{\prime}}\beta)$
Maximum Likelihood Estimation
For an initial rough estimate of the parameters $\hat{\beta}$, use the estimate to generate fitted values: $\mu{i}=g^{-1}(\hat{\eta{i}})$
Let $z$ be a working dependent variable such that $z{i}=\hat{\eta{i}}+(y{i}-\hat{\mu{i}})\frac{d\eta{i}}{d\mu{i}}$,
where $\frac{d\eta{i}}{d\mu{i}}$ is the derivative of the link function evaluated at the trial estimate.
Calculate the iterative weights: $w{i}=\frac{p{i}}{[b^{\prime\prime}(\theta{i})\frac{d\eta{i}}{d\mu_{i}}^{2}]}$
Where $b^{\prime\prime}$ is the second derivative of $b(\theta_{i})$ evaluated at the trial estimate.
Assume $a{i}(\phi)$ is of the form $\frac{\phi}{p{i}}$. The weight $w{i}$ is inversely proportional to the variance of the working dependent variable $z{i}$ for current parameter estimates and proportionality factor $\phi$.
Regress $z{i}$ on the predictors $x{i}$ using the weights $w_{i}$ to obtain new estimates of $\beta$. $\hat{\beta}=(\mathbf{X}^{\prime}\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^{\prime}\mathbf{W}\mathbf{z}$
Where $\mathbf{X}$ is the model matrix, $\mathbf{W}$ is a diagonal matrix of $w{i}$, and $\mathbf{z}$ is a vector of the working response variable $z{i}$.
This process is repeated until the estimates $\hat{\beta}$ change by less than the specified amount.
Cost of computation
H2O can process large data sets because it relies on parallel processes. Large data sets are divided into smaller data sets and processed simultaneously and the results are communicated between computers as needed throughout the process.
In GLM, data are split by rows but not by columns, because the predicted Y values depend on information in each of the predictor variable vectors. If O is a complexity function, N is the number of observations (or rows), and P is the number of predictors (or columns) then
$Runtime\propto p^3+\frac{(N*p^2)}{CPUs}$
Distribution reduces the time it takes an algorithm to process because it decreases N.
Relative to P, the larger that (N/CPUs) becomes, the more trivial p becomes to the overall computational cost. However, when p is greater than (N/CPUs), O is dominated by p.
$Complexity = O(p^3 + N*p^2)$
References
Breslow, N E. “Generalized Linear Models: Checking Assumptions and Strengthening Conclusions.” Statistica Applicata 8 (1996): 23-41.
Frome, E L. “The Analysis of Rates Using Poisson Regression Models.” Biometrics (1983): 665-674.
Snee, Ronald D. “Validation of Regression Models: Methods and Examples.” Technometrics 19.4 (1977): 415-428.
DRF
Introduction
Distributed Random Forest (DRF) is a powerful classification tool. When given a set of data, DRF generates a forest of classification trees, rather than a single classification tree. Each of these trees is a weak learner built on a subset of rows and columns. More trees will reduce the variance. The classification from each H2O tree can be thought of as a vote; the most votes determines the classification.
Defining a DRF Model
Destination_key: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
Training_frame: (Required) Select the dataset used to build the model. NOTE: If you click the Build a model button from the
Parse
cell, the training frame is entered automatically.Validation_frame: (Optional) Select the dataset used to evaluate the accuracy of the model.
Ignored_columns: (Optional) Click the plus sign next to a column name to add it to the list of columns excluded from the model. To add all columns, click the Add all button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the Clear all button.
Drop_na20_cols: (Optional) Check this checkbox to omit columns that have at least 20% missing values.
Score_each_iteration: (Optional) Check this checkbox to score during each iteration of the model training.
Response_column: (Required) Select the column to use as the independent variable.
Ntrees: Specify the number of trees. The default value is 50.
Max_depth: Specify the maximum tree depth. The default value is 5.
Min_rows: Specify the minimum number of observations for a leaf (
nodesize
in R). The default value is 10.Nbins: Specify the number of bins for the histogram. The default value is 20.
Mtries: Specify the columns to randomly select at each level. To use the square root of the columns, enter
-1
. The default value is -1.Sample_rate: Specify the sample rate. The range is 0 to 1.0 and the default value is 0.6666667.
Build_tree_one_node: To run on a single node, check this checkbox. This is suitable for small datasets as there is no network overhead but fewer CPUs are used. The default setting is disabled.
Balance_classes: Oversample the minority classes to balance the class distribution. This option is not selected by default. This option is only applicable for classification. Majority classes can be undersampled to satisfy the Max_after_balance_size parameter.
Max_confusion_matrix_size: Specify the maximum size (in number of classes) for confusion matrices to be printed in the Logs.
Max_hit_ratio_k: Specify the maximum number (top K) of predictions to use for hit ratio computation. Applicable to multi-class only. To disable, enter 0.
Class_sampling_factors: Specify the per-class (in lexicographical order) over/under-sampling ratios. By default, these ratios are automatically computed during training to obtain the class balance. There is no default value.
Seed: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.
Interpreting a DRF Model
By default, the following output displays:
- Model parameters (hidden)
- A graph of the scoring history (number of trees vs. training MSE)
- A graph of the ROC curve (TPR vs. FPR)
- A graph of the variable importances
- Output (model category, validation metrics, initf)
- Model summary (number of trees, min. depth, max. depth, mean depth, min. leaves, max. leaves, mean leaves)
- Scoring history in tabular format
- Training metrics (model name, checksum name, frame name, frame checksum name, description, model category, duration in ms, scoring time, predictions, MSE, R2, logloss, AUC, GINI)
- Training metrics for thresholds (thresholds, F1, F2, F0Points, Accuracy, Precision, Recall, Specificity, Absolute MCC, min. per-class accuracy, TNS, FNS, FPS, TPS, IDX)
- Maximum metrics (metric, threshold, value, IDX)
- Variable importances in tabular format
FAQ
How does the algorithm handle missing values during training?
Missing values do not alter the tree building in any way (i.e., they are not counted as a point when computing means or errors). Rows containing missing values do affect tree building, but the missing values don’t change the split-point of the column they are in.
How does the algorithm handle missing values during testing?
During scoring, missing values “always go left” at any decision point in a tree. Due to dynamic binning in DRF, a row with a missing value typically ends up in the “leftmost bin” - with other outliers.
What happens if the response has missing values?
No errors will occur, but nothing will be learned from rows containing missing the response.
Does it matter if the data is sorted?
No.
Should data be shuffled before training?
No.
How does the algorithm handle highly imbalanced data in a response column?
Specify
balance_classes
,class_sampling_factors
andmax_after_balance_size
to control over/under-sampling.What if there are a large number of columns?
DRFs are best for datasets with fewer than a few thousand columns.
What if there are a large number of categorical factor levels?
Large numbers of categoricals are handled very efficiently - there is never any one-hot encoding.
DRF Algorithm
References
Naïve Bayes
- Introduction
- Defining a Naïve Bayes Model
- Interpreting a Naïve Bayes Model
- FAQ
- Naïve Bayes Algorithm
- References
Introduction
Naïve Bayes (NB) is a classification algorithm that relies on strong assumptions of the independence of covariates in applying Bayes Theorem. NB models are commonly used as an alternative to decision trees for classification problems.
Defining a Naïve Bayes Model
Destination_key: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
Training_frame: (Required) Select the dataset used to build the model. NOTE: If you click the Build a model button from the
Parse
cell, the training frame is entered automatically.Validation_frame: (Optional) Select the dataset used to evaluate the accuracy of the model.
Ignored_columns: (Optional) Click the plus sign next to a column name to add it to the list of columns excluded from the model. To add all columns, click the Add all button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the Clear all button.
Drop_na20_cols: (Optional) Check this checkbox to omit columns that have at least 20% missing values.
Response_column: (Required) Select the column to use as the independent variable.
Laplace: Specify the Laplace smoothing parameter. The default value is 0.
Min_sdev: Specify the minimum standard deviation to use for observations without enough data. The default value is 0.001.
Eps_sdev: Specify the threshold for standard deviation. If this threshold is not met, the min_sdev value is used. The default value is 1e-10.
Min_prob: Specify the minimum probability to use for observations without enough data. The default value is 0.001.
Eps_prob: Specify the threshold for standard deviation. If this threshold is not met, the min_sdev value is used. The default value is 1e-10.
Max_confusion_matrix_size: Specify the maximum size (in number of classes) for confusion matrices to be printed in the Logs.
Max_hit_ratio_k: Specify the maximum number (top K) of predictions to use for hit ratio computation. Applicable to multi-class only. To disable, enter 0.
Interpreting a Naïve Bayes Model
The output from Naïve Bayes is a list of tables containing the a-priori and conditional probabilities of each class of the response. The a-priori probability is the estimated probability of a particular class before observing any of the predictors. Each conditional probability table corresponds to a predictor column. The row headers are the classes of the response and the column headers are the classes of the predictor. Thus, in the table below, the probability of survival (y) given a person is male (x) is 0.91543624.
Sex
Survived Male Female
No 0.91543624 0.08456376
Yes 0.51617440 0.48382560
When the predictor is numeric, Naïve Bayes assumes it is sampled from a Gaussian distribution given the class of the response. The first column contains the mean and the second column contains the standard deviation of the distribution.
By default, the following output displays:
- Output (model category, model summary, scoring history, training metrics, validation metrics)
- Y-Levels (levels of the response column)
- P-conditionals
FAQ
How does the algorithm handle missing values during training?
All rows with one or more missing values (either in the predictors or the response) will be skipped during model building.
How does the algorithm handle missing values during testing?
If a predictor is missing, it will be skipped when taking the product of conditional probabilities in calculating the joint probability conditional on the response.
What happens if the response domain is different in the training and test datasets?
The response column in the test dataset is not used during scoring, so any response categories absent in the training data will not be predicted.
What happens during prediction if the new sample has categorical levels not seen in training?
The conditional probability of that predictor level will be set according to the Laplace smoothing factor. If Laplace smoothing is disabled (set to zero), the joint probability will be zero. See pgs. 13-14 of Andrew Ng’s “Generative learning algorithms” in the References section for mathematical details.
Does it matter if the data is sorted?
No.
Should data be shuffled before training?
This does not affect model building.
How does the algorithm handle highly imbalanced data in a response column?
Unbalanced data will not affect the model. However, if one response category has very few observations compared to the total, the conditional probability may be very low. A cutoff (
eps_prob
) and minimum value (min_prob
) are available for the user to set a floor on the calculated probability.
What if there are a large number of columns?
More memory will be allocated on each node to store the joint frequency counts and sums.
What if there are a large number of categorical factor levels?
More memory will be allocated on each node to store the joint frequency count of each categorical predictor level with the response’s level.
Naïve Bayes Algorithm
The algorithm is presented for the simplified binomial case without loss of generality.
Under the Naive Bayes assumption of independence, given a training set for a set of discrete valued features X ${(X^{(i)},\ y^{(i)};\ i=1,…m)}$
The joint likelihood of the data can be expressed as:
$\mathcal{L} \: (\phi(y),\: \phi{i|y=1},\:\phi{i|y=0})=\Pi_{i=1}^{m} p(X^{(i)},\: y^{(i)})$
The model can be parameterized by:
$\phi{i|y=0}=\ p(x{i}=1|\ y=0);\: \phi{i|y=1}=\ p(x{i}=1|y=1);\: \phi(y)$
Where $\phi{i|y=0}=\ p(x{i}=1|\ y=0)$ can be thought of as the fraction of the observed instances where feature $x{i}$ is observed, and the outcome is $y=0, \phi{i|y=1}=p(x{i}=1|\ y=1)$ is the fraction of the observed instances where feature $x{i}$ is observed, and the outcome is $y=1$, and so on.
The objective of the algorithm is to maximize with respect to $\phi{i|y=0}, \ \phi{i|y=1},\ and \ \phi(y)$
Where the maximum likelihood estimates are:
$\phi{j|y=1}= \frac{\Sigma{i}^m 1(x{j}^{(i)}=1 \ \bigcap y^{i} = 1)}{\Sigma{i=1}^{m}(y^{(i)}=1}$
$\phi{j|y=0}= \frac{\Sigma{i}^m 1(x{j}^{(i)}=1 \ \bigcap y^{i} = 0)}{\Sigma{i=1}^{m}(y^{(i)}=0}$
$\phi(y)= \frac{(y^{i} = 1)}{m}$
Once all parameters $\phi{j|y}$ are fitted, the model can be used to predict new examples with features $X{(i^*)}$.
This is carried out by calculating:
$p(y=1|x)=\frac{\Pi p(x_i|y=1) p(y=1)}{\Pi p(x_i|y=1)p(y=1) \: +\: \Pi p(x_i|y=0)p(y=0)}$
$p(y=0|x)=\frac{\Pi p(x_i|y=0) p(y=0)}{\Pi p(x_i|y=1)p(y=1) \: +\: \Pi p(x_i|y=0)p(y=0)}$
and predicting the class with the highest probability.
It is possible that prediction sets contain features not originally seen in the training set. If this occurs, the maximum likelihood estimates for these features predict a probability of 0 for all cases of y.
Laplace smoothing allows a model to predict on out of training data features by adjusting the maximum likelihood estimates to be:
$\phi{j|y=1}= \frac{\Sigma{i}^m 1(x{j}^{(i)}=1 \ \bigcap y^{i} = 1) \: + \: 1}{\Sigma{i=1}^{m}(y^{(i)}=1 \: + \: 2}$
$\phi{j|y=0}= \frac{\Sigma{i}^m 1(x{j}^{(i)}=1 \ \bigcap y^{i} = 0) \: + \: 1}{\Sigma{i=1}^{m}(y^{(i)}=0 \: + \: 2}$
Note that in the general case where y takes on k values, there are k+1 modified parameter estimates, and they are added in when the denominator is k (rather than two, as shown in the two-level classifier shown here.)
Laplace smoothing should be used with care; it is generally intended to allow for predictions in rare events. As prediction data becomes increasingly distinct from training data, train new models when possible to account for a broader set of possible X values.
References
Ng, Andrew. “Generative Learning algorithms.” (2008).
PCA
PCA is currently in progress in H2O-Dev. Once implementation of this algorithm is complete, this section of the document will be updated.
GBM
Introduction
Gradient Boosted Regression and Gradient Boosted Classification are forward learning ensemble methods. The guiding heuristic is that good predictive results can be obtained through increasingly refined approximations. H2O’s GBM sequentially builds regression trees on all the features of the dataset in a fully distributed way - each tree is built in parallel.
Defining a GBM Model
Destination_key: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
Training_frame: (Required) Select the dataset used to build the model. NOTE: If you click the Build a model button from the
Parse
cell, the training frame is entered automatically.Validation_frame: (Optional) Select the dataset used to evaluate the accuracy of the model.
Ignored_columns: (Optional) Click the plus sign next to a column name to add it to the list of columns excluded from the model. To add all columns, click the Add all button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the Clear all button.
Drop_na20_cols: (Optional) Check this checkbox to omit columns that have at least 20% missing values.
Score_each_iteration: (Optional) Check this checkbox to score during each iteration of the model training.
Response_column: (Required) Select the column to use as the independent variable.
Ntrees: Specify the number of trees. The default value is 50.
Max_depth: Specify the maximum tree depth. The default value is 5.
Min_rows: Specify the minimum number of observations for a leaf (
nodesize
in R). The default value is 10.Nbins: Specify the number of bins for the histogram. The default value is 20.
Learn_rate: Specify the learning rate. The range is 0.0 to 1.0 and the default is 0.1.
Distribution: Select the loss function. The options are auto, bernoulli, multinomial, or gaussian and the default is auto.
Balance_classes: Oversample the minority classes to balance the class distribution. This option is not selected by default. This option is only applicable for classification. Majority classes can be undersampled to satisfy the Max_after_balance_size parameter.
Max_confusion_matrix_size: Specify the maximum size (in number of classes) for confusion matrices to be printed in the Logs.
Max_hit_ratio_k: Specify the maximum number (top K) of predictions to use for hit ratio computation. Applicable to multi-class only. To disable, enter 0.
Class_sampling_factors: Specify the per-class (in lexicographical order) over/under-sampling ratios. By default, these ratios are automatically computed during training to obtain the class balance. There is no default value.
Seed: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.
Interpreting a GBM Model
The output for GBM includes the following:
- Model parameters (hidden)
- A graph of the scoring history (training MSE vs number of trees)
- A graph of the variable importances
- Output (model category, validation metrics, initf)
- Model summary (number of trees, min. depth, max. depth, mean depth, min. leaves, max. leaves, mean leaves)
- Scoring history in tabular format
- Training metrics (model name, model checksum name, frame name, description, model category, duration in ms, scoring time, predictions, MSE, R2)
- Variable importances in tabular format
FAQ
How does the algorithm handle missing values during training?
Missing values do not alter the tree building in any way (i.e., they are not counted as a point when computing means or errors). Rows containing missing values do affect tree building, but the missing values don’t change the split-point of the column they are in.
How does the algorithm handle missing values during testing?
During scoring, missing values “always go left” at any decision point in a tree. Due to dynamic binning in GBM, a row with a missing value typically ends up in the “leftmost bin” - with other outliers.
What happens if the response has missing values?
No errors will occur, but nothing will be learned from rows containing missing the response.
Does it matter if the data is sorted?
No.
Should data be shuffled before training?
No.
How does the algorithm handle highly imbalanced data in a response column?
You can specify
balance_classes
,class_sampling_factors
andmax_after_balance_size
to control over/under-sampling.What if there are a large number of columns?
DRF models are best for datasets with fewer than a few thousand columns.
What if there are a large number of categorical factor levels?
Large number of categoricals are handled very efficiently - there is never any one-hot encoding.
GBM Algorithm
H2O’s Gradient Boosting Algorithms follow the algorithm specified by Hastie et al (2001):
Initialize $f_{k0} = 0,\: k=1,2,…,K$
For $m=1$ to $M:$
(a) Set $p{k}(x)=\frac{e^{f{k}(x)}}{\sum{l=1}^{K}e^{f{l}(x)}},\:k=1,2,…,K$
(b) For $k=1$ to $K$:
i. Compute $r{ikm}=y{ik}-p{k}(x{i}),\:i=1,2,…,N.$ ii. Fit a regression tree to the targets $r{ikm},\:i=1,2,…,N$, giving terminal regions $R{jim},\:j=1,2,…,J{m}.$ $iii. Compute$ $\gamma{jkm}=\frac{K-1}{K}\:\frac{\sum{x{i}\in R{jkm}}(r{ikm})}{\sum{x{i}\in R{jkm}}|r{ikm}|(1-|r{ikm})},\:j=1,2,…,J{m}.$ $\:iv.\:Update\:f{km}(x)=f{k,m-1}(x)+\sum{j=1}^{J{m}}\gamma{jkm}I(x\in\:R{jkm}).$
Output $\:\hat{f{k}}(x)=f{kM}(x),\:k=1,2,…,K.$
References
Dietterich, Thomas G, and Eun Bae Kong. “Machine Learning Bias, Statistical Bias, and Statistical Variance of Decision Tree Algorithms.” ML-95 255 (1995).
Elith, Jane, John R Leathwick, and Trevor Hastie. “A Working Guide to Boosted Regression Trees.” Journal of Animal Ecology 77.4 (2008): 802-813
Friedman, Jerome H. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics (2001): 1189-1232.
Friedman, Jerome, Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. “Discussion of Boosting Papers.” Ann. Statist 32 (2004): 102-107
Deep Learning
- Introduction
- Defining a Deep Learning Model
- Interpreting a Deep Learning Model
- FAQ
- Deep Learning Algorithm
- References
Introduction
H2O’s Deep Learning is based on a multi-layer feed-forward artificial neural network that is trained with stochastic gradient descent using back-propagation. The network can contain a large number of hidden layers consisting of neurons with tanh, rectifier and maxout activation functions. Advanced features such as adaptive learning rate, rate annealing, momentum training, dropout, L1 or L2 regularization, checkpointing and grid search enable high predictive accuracy. Each compute node trains a copy of the global model parameters on its local data with multi-threading (asynchronously), and contributes periodically to the global model via model averaging across the network.
Defining a Deep Learning Model
H2O Deep Learning models have many input parameters, many of which are only accessible via the expert mode. For most cases, use the default values. Please read the following instructions before building extensive Deep Learning models. The application of grid search and successive continuation of winning models via checkpoint restart is highly recommended, as model performance can vary greatly.
Destination_key: (Optional) Enter a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
Training_frame: (Required) Select the dataset used to build the model. NOTE: If you click the Build a model button from the
Parse
cell, the training frame is entered automatically.Validation_frame: (Optional) Select the dataset used to evaluate the accuracy of the model.
Ignored_columns: (Optional) Click the plus sign next to a column name to add it to the list of columns excluded from the model. To add all columns, click the Add all button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the Clear all button.
Drop_na20_cols: (Optional) Check this checkbox to omit columns that have at least 20% missing values.
Response_column: Select the column to use as the independent variable.
Activation: Select the activation function (Tahn, Tahn with dropout, Rectifier, Rectifier with dropout, Maxout, Maxout with dropout). The default option is Rectifier.
Hidden: Specify the hidden layer sizes (e.g., 100,100). The default value is 200,200.
Epochs: Specify the number of times to iterate (stream) the dataset. The value can be a fraction. The default value for DL is 10.
Variable_importances: Check this checkbox to compute variable importance. This option is not selected by default.
Balance_classes: Oversample the minority classes to balance the class distribution. This option is not selected by default. This option is only applicable for classification. Majority classes can be undersampled to satisfy the Max_after_balance_size parameter.
Max_confusion_matrix_size: Specify the maximum size (in number of classes) for confusion matrices to be printed in the Logs.
Max_hit_ratio_k: Specify the maximum number (top K) of predictions to use for hit ratio computation. Applicable to multi-class only. To disable, enter 0.
Checkpoint: Enter a model key associated with a previously-trained Deep Learning model. Use this option to build a new model as a continuation of a previously-generated model (e.g., by a grid search).
Use_all_factor_levels: Check this checkbox to use all factor levels in the possible set of predictors; if you enable this option, sufficient regularization is required. By default, the first factor level is skipped. For Deep Learning models, this option is useful for determining variable importances and is automatically enabled if the autoencoder is selected.
Train_samples_per_iteration: Specify the number of global training samples per MapReduce iteration. To specify one epoch, enter 0. To specify all available data (e.g., replicated training data), enter -1. To use the automatic values, enter -2. The default is -2.
Adaptive_rate: Check this checkbox to enable the adaptive learning rate (ADADELTA). This option is selected by default.
Input_dropout_ratio: Specify the input layer dropout ratio to improve generalization. Suggested values are 0.1 or 0.2. The default value is 0.
L1: Specify the L1 regularization to add stability and improve generalization; sets the value of many weights to 0. The default value is 0.
L2: Specify the L2 regularization to add stability and improve generalization; sets the value of many weights to smaller values. The default value is 0.
Loss: Select the loss function. The options are automatic, mean square, cross-entropy, Huber, or Absolute and the default value is automatic.
Score_interval: Specify the shortest time interval (in seconds) to wait between model scoring. The default value is 5.0.
Score_training_samples: Specify the number of training set samples for scoring. To use all training samples, enter 0. The default value is 10000.
Score_duty_cycle: Specify the maximum duty cycle fraction for scoring. A lower value results in more training and a higher value results in more scoring. The default value is 0.1.
Autoencoder: Check this checkbox to enable the Deep Learning autoencoder. This option is not selected by default. Note: This option requires MeanSquare as the loss function.
Class_sampling_factors: Specify the per-class (in lexicographical order) over/under-sampling ratios. By default, these ratios are automatically computed during training to obtain the class balance. There is no default value.
Overwrite_with_best_model: Check this checkbox to overwrite the final model with the best model found during training. This option is selected by default.
Target_ratio_comm_to_comp: Specify the target ratio of communication overhead to computation. This option is only enabled for multi-node operation and if train_samples_per_iteration equals -2 (auto-tuning). The default value is 0.02.
Seed: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.
Rho: Specify the adaptive learning rate time decay factor. The default value is 0.99.
Epsilon: Specify the adaptive learning rate time smoothing factor to avoid dividing by zero. The default value is 1e-8.
Max_W2: Specify the constraint for the squared sum of the incoming weights per unit (e.g., for Rectifier). The default value is infinity.
Initial_weight_distribution: Select the initial weight distribution (Uniform Adaptive, Uniform, or Normal). The default is Uniform Adaptive.
Regression_stop: Specify the stopping criterion for regression error (MSE) on the training data. To disable this option, enter -1. The default value is 1.0E-6.
Diagnostics: Check this checkbox to compute the variable importances for input features (using the Gedeon method). For large networks, selecting this option can reduce speed. This option is selected by default.
Fast_mode: Check this checkbox to enable fast mode, a minor approximation in back-propagation. This option is selected by default.
Ignore_const_cols: Check this checkbox to ignore constant training columns, since no information can be gained from them. This option is selected by default.
Force_load_balance: Check this checkbox to force extra load balancing to increase training speed for small datasets and use all cores. This option is selected by default.
Single_node_mode: Check this checkbox to force H2O to run on a single node for fine-tuning of model parameters. This option is not selected by default.
Shuffle_training_data: Check this checkbox to shuffle the training data. This option is recommended if the training data is replicated and the value of train_samples_per_iteration is close to the number of nodes times the number of rows. This option is not selected by default.
Missing_values_handling: Select how to handle missing values (skip or mean imputation). The default value is mean imputation.
Quiet_mode: Check this checkbox to display less output in the standard output. This option is not selected by default.
Sparse: Check this checkbox to use sparse data handling. This option is not selected by default.
Col_major: Check this checkbox to use a column major weight matrix for the input layer. This option can speed up forward propagation but may reduce the speed of backpropagation. This option is not selected by default.
Average_activation: Specify the average activation for the sparse autoencoder. The default value is 0.0.
Sparsity_beta: Specify the sparsity regularization. The default value is 0.0.
Max_categorical_features: Specify the maximum number of categorical features enforced via hashing.
Reproducible: To force reproducibility on small data, check this checkbox. If this option is enabled, the model takes more time to generate, since it uses only one thread.
Export_weights_and_biases: To export the neural network weights and biases as H2O frames, check this checkbox.
Interpreting a Deep Learning Model
To view the results, click the View button. The output for the Deep Learning model includes the following information for both the training and testing sets:
- Model parameters (hidden)
- A chart of the variable importances
- A graph of the scoring history (training MSE and validation MSE vs epochs)
- Output (model category, weights, biases)
- Status of neuron layers (layer number, units, type, dropout, L1, L2, mean rate, rate RMS, momentum, mean weight, weight RMS, mean bias, bias RMS)
- Scoring history in tabular format
- Training metrics (model name, model checksum name, frame name, frame checksum name, description, model category, duration in ms, scoring time, predictions, MSE, R2, logloss)
- Top-K Hit Ratios (for multi-class classification)
- Confusion matrix (for classification)
FAQ
How does the algorithm handle missing values during training?
User-specifiable treatment of missing values via
missing_values_handling
. Specify either the skip or mean-impute option.How does the algorithm handle missing values during testing?
Missing values in the test set will be mean-imputed during scoring.
What happens if the response has missing values?
No errors will occur, but nothing will be learned from rows containing missing the response.
Does it matter if the data is sorted?
Yes, since the training set is processed in order. Depending whether
train_samples_per_iteration
is enabled, some rows will be skipped. Ifshuffle_training_data
is enabled, then each thread that is processing a small subset of rows will process rows randomly, but it is not a global shuffle.Should data be shuffled before training?
Yes, the data should be shuffled before training, especially if the dataset is sorted.
How does the algorithm handle highly imbalanced data in a response column?
Specify
balance_classes
,class_sampling_factors
andmax_after_balance_size
to control over/under-sampling.What if there are a large number of columns?
The input neuron layer’s size is scaled to the number of input features, so as the number of columns increases, the model complexity increases as well.
What if there are a large number of categorical factor levels?
This is something to look out for. Say you have three columns: zip code (70k levels), height, and income. The resulting number of internally one-hot encoded features will be 70,002 and only 3 of them will be activated (non-zero). If the first hidden layer has 200 neurons, then the resulting weight matrix will be of size 70,002 x 200, which can take a long time to train and converge. In this case, we recommend either reducing the number of categorical factor levels upfront (e.g., using h2o.interaction()
from R), or specifying max_categorical_features
to use feature hashing to reduce the dimensionality.
Deep Learning Algorithm
For more information about how the Deep Learning algorithm works, refer to the Deep Learning booklet.
References
Candel, Arno and Parmar, Viraj. “Deep Learning with H2O.” H2O.ai, Inc. (2015).
Candel, Arno. “The Definitive Performance Tuning Guide for H2O Deep Learning.” H2O.ai, Inc. (2015).
REST API Reference
- /3/About
- /3/Cloud
- /3/Cloud
- /3/CreateFrame
- /3/DKV
- /3/DKV/(?
.*) - /3/DownloadDataset
- /3/Find
- /3/Frames
- /3/Frames
- /3/Frames/(?
.*) - /3/Frames/(?
.*) - /3/Frames/(?
.*)/columns - /3/Frames/(?
.*)/columns/(? .*) - /3/Frames/(?
.*)/columns/(? .*)/domain - /3/Frames/(?
.*)/columns/(? .*)/summary - /3/Frames/(?
.*)/export/(? .*)/overwrite/(? .*) - /3/Frames/(?
.*)/summary - /3/ImportFiles
- /3/InitID
- /3/JStack
- /3/Jobs
- /3/Jobs/(?
.*) - /3/Jobs/(?
.*)/cancel - /3/KillMinus3
- /3/LogAndEcho
- /3/Logs/nodes/(?
.*)/files/(? .*) - /3/MakeGLMModel
- /3/Metadata/endpoints
- /3/Metadata/endpoints/(?
[0-9]+) - /3/Metadata/endpoints/(?
.*) - /3/Metadata/schemaclasses/(?
.*) - /3/Metadata/schemas
- /3/Metadata/schemas/(?
.*) - /3/MissingInserter
- /3/ModelBuilders
- /3/ModelBuilders/(?
.*) - /3/ModelBuilders/deeplearning
- /3/ModelBuilders/deeplearning/parameters
- /3/ModelBuilders/drf
- /3/ModelBuilders/drf/parameters
- /3/ModelBuilders/gbm
- /3/ModelBuilders/gbm/parameters
- /3/ModelBuilders/glm
- /3/ModelBuilders/glm/parameters
- /3/ModelBuilders/kmeans
- /3/ModelBuilders/kmeans/parameters
- /3/ModelBuilders/naivebayes
- /3/ModelBuilders/naivebayes/parameters
- /3/ModelBuilders/pca
- /3/ModelBuilders/pca/parameters
- /3/ModelMetrics
- /3/ModelMetrics/frames/(?.*)
- /3/ModelMetrics/frames/(?.*)/models/(?
.*) - /3/ModelMetrics/frames/(?.*)/models/(?
.*) - /3/ModelMetrics/models/(?
.*) - /3/ModelMetrics/models/(?
.*)/frames/(?.*) - /3/ModelMetrics/models/(?
.*)/frames/(?.*) - /3/ModelMetrics/models/(?
.*)/frames/(?.*) - /3/Models
- /3/Models
- /3/Models/(?
.*) - /3/Models/(?
.*) - /3/Models/(?
.*)/preview - /3/NetworkTest
- /3/NodePersistentStorage/(?
.*) - /3/NodePersistentStorage/(?
.*) - /3/NodePersistentStorage/(?
.*)/(? .*) - /3/NodePersistentStorage/(?
.*)/(? .*) - /3/NodePersistentStorage/(?
.*)/(? .*) - /3/NodePersistentStorage/categories/(?
.*)/exists - /3/NodePersistentStorage/categories/(?
.*)/names/(? .*)/exists - /3/NodePersistentStorage/configured
- /3/Parse
- /3/ParseSetup
- /3/Predictions/models/(?
.*)/frames/(?.*) - /3/Profiler
- /3/Rapids
- /3/Rapids/isEval
- /3/Shutdown
- /3/SplitFrame
- /3/Timeline
- /3/Tutorials
- /3/Typeahead/files
- /3/UnlockKeys
- /3/WaterMeterCpuTicks/(?
.*) - /3/WaterMeterIo
- /3/WaterMeterIo/(?
.*) - /99/Sample
GET /3/About
Return information about this H2O.
Input | AboutV3 |
Output | AboutV3 |
GET /3/Cloud
Determine the status of the nodes in the H2O cloud.
Input | CloudV3 |
Output | CloudV3 |
HEAD /3/Cloud
Determine the status of the nodes in the H2O cloud.
Input | CloudV3 |
Output | CloudV3 |
POST /3/CreateFrame
Create a synthetic H2O Frame.
Input | CreateFrameV3 |
Output | CreateFrameV3 |
DELETE /3/DKV
Remove all keys from the H2O distributed K/V store.
Input | RemoveAllV3 |
Output | RemoveAllV3 |
DELETE /3/DKV/(?.*)
Remove an arbitrary key from the H2O distributed K/V store.
Input | RemoveV3 |
Output | RemoveV3 |
GET /3/DownloadDataset
Download something something.
Input | DownloadDataV3 |
Output | DownloadDataV3 |
GET /3/Find
Find a value within a Frame.
Input | FindV3 |
Output | FindV3 |
GET /3/Frames
Return all Frames in the H2O distributed K/V store.
Input | FramesV3 |
Output | FramesV3 |
DELETE /3/Frames
Delete all Frames from the H2O distributed K/V store.
Input | FramesV3 |
Output | FramesV3 |
GET /3/Frames/(?.*)
Return the specified Frame.
Input | FramesV3 |
Output | FramesV3 |
DELETE /3/Frames/(?.*)
Delete the specified Frame from the H2O distributed K/V store.
Input | FramesV3 |
Output | FramesV3 |
GET /3/Frames/(?.*)/columns
Return all the columns from a Frame.
Input | FramesV3 |
Output | FramesV3 |
GET /3/Frames/(?.*)/columns/(?.*)
Return the specified column from a Frame.
Input | FramesV3 |
Output | FramesV3 |
GET /3/Frames/(?.*)/columns/(?.*)/domain
Return the domains for the specified column. “null” if the column is not an Enum.
Input | FramesV3 |
Output | FramesV3 |
GET /3/Frames/(?.*)/columns/(?.*)/summary
Return the summary metrics for a column, e.g. mins, maxes, mean, sigma, percentiles, etc.
Input | FramesV3 |
Output | FramesV3 |
GET /3/Frames/(?.*)/export/(?.*)/overwrite/(?.*)
Export a Frame to the given path with optional overwrite.
Input | FramesV3 |
Output | FramesV3 |
GET /3/Frames/(?.*)/summary
Return a Frame, including the histograms, after forcing computation of rollups.
Input | FramesV3 |
Output | FramesV3 |
GET /3/ImportFiles
Import raw data files into a single-column H2O Frame.
Input | ImportFilesV3 |
Output | ImportFilesV3 |
GET /3/InitID
Issue a new session ID.
Input | InitIDV3 |
Output | InitIDV3 |
GET /3/JStack
Something something something.
Input | JStackV3 |
Output | JStackV3 |
GET /3/Jobs
Get a list of all the H2O Jobs (long-running actions).
Input | JobsV3 |
Output | Schema |
GET /3/Jobs/(?.*)
Get the status of the given H2O Job (long-running action).
Input | JobsV3 |
Output | Schema |
POST /3/Jobs/(?.*)/cancel
Cancel a running job.
Input | JobsV3 |
Output | Schema |
GET /3/KillMinus3
Kill minus 3 on this node
Input | KillMinus3V3 |
Output | KillMinus3V3 |
POST /3/LogAndEcho
Save a message to the H2O logfile.
Input | LogAndEchoV3 |
Output | LogAndEchoV3 |
GET /3/Logs/nodes/(?.*)/files/(?.*)
Get named log file for a node.
Input | LogsV3 |
Output | LogsV3 |
POST /3/MakeGLMModel
make a new GLM model based on existing one
Input | MakeGLMModelV3 |
Output | GLMModelV3 |
GET /3/Metadata/endpoints
Return a list of all the REST API endpoints.
Input | DocsV3 |
Output | DocsV3 |
GET /3/Metadata/endpoints/(?[0-9]+)
Return the REST API endpoint metadata, including documentation, for the endpoint specified by number.
Input | DocsV3 |
Output | DocsV3 |
GET /3/Metadata/endpoints/(?.*)
Return the REST API endpoint metadata, including documentation, for the endpoint specified by path.
Input | DocsV3 |
Output | DocsV3 |
GET /3/Metadata/schemaclasses/(?.*)
Return the REST API schema metadata for specified schema class.
Input | DocsV3 |
Output | DocsV3 |
GET /3/Metadata/schemas
Return list of all REST API schemas.
Input | DocsV3 |
Output | DocsV3 |
GET /3/Metadata/schemas/(?.*)
Return the REST API schema metadata for specified schema.
Input | DocsV3 |
Output | DocsV3 |
POST /3/MissingInserter
Insert missing values.
Input | MissingInserterV3 |
Output | MissingInserterV3 |
GET /3/ModelBuilders
Return the Model Builder metadata for all available algorithms.
Input | ModelBuildersV3 |
Output | ModelBuildersV3 |
GET /3/ModelBuilders/(?.*)
Return the Model Builder metadata for the specified algorithm.
Input | ModelBuildersV3 |
Output | ModelBuildersV3 |
POST /3/ModelBuilders/deeplearning
Train a Deep Learning model on the specified Frame.
Input | DeepLearningV3 |
Output | Schema |
POST /3/ModelBuilders/deeplearning/parameters
Validate a set of Deep Learning model builder parameters.
Input | DeepLearningV3 |
Output | DeepLearningV3 |
POST /3/ModelBuilders/drf
Train a DRF model on the specified Frame.
Input | DRFV3 |
Output | Schema |
POST /3/ModelBuilders/drf/parameters
Validate a set of DRF model builder parameters.
Input | DRFV3 |
Output | DRFV3 |
POST /3/ModelBuilders/gbm
Train a GBM model on the specified Frame.
Input | GBMV3 |
Output | Schema |
POST /3/ModelBuilders/gbm/parameters
Validate a set of GBM model builder parameters.
Input | GBMV3 |
Output | GBMV3 |
POST /3/ModelBuilders/glm
Train a GLM model on the specified Frame.
Input | GLMV3 |
Output | Schema |
POST /3/ModelBuilders/glm/parameters
Validate a set of GLM model builder parameters.
Input | GLMV3 |
Output | GLMV3 |
POST /3/ModelBuilders/kmeans
Train a KMeans model on the specified Frame.
Input | KMeansV3 |
Output | Schema |
POST /3/ModelBuilders/kmeans/parameters
Validate a set of KMeans model builder parameters.
Input | KMeansV3 |
Output | KMeansV3 |
POST /3/ModelBuilders/naivebayes
Train a Naive Bayes model on the specified Frame.
Input | NaiveBayesV3 |
Output | Schema |
POST /3/ModelBuilders/naivebayes/parameters
Validate a set of Naive Bayes model builder parameters.
Input | NaiveBayesV3 |
Output | NaiveBayesV3 |
POST /3/ModelBuilders/pca
Train a PCA model on the specified Frame.
Input | PCAV3 |
Output | Schema |
POST /3/ModelBuilders/pca/parameters
Validate a set of PCA model builder parameters.
Input | PCAV3 |
Output | PCAV3 |
GET /3/ModelMetrics
Return all the saved scoring metrics.
Input | ModelMetricsListSchemaV3 |
Output | ModelMetricsListSchemaV3 |
GET /3/ModelMetrics/frames/(?.*)
Return the saved scoring metrics for the specified Frame.
Input | ModelMetricsListSchemaV3 |
Output | ModelMetricsListSchemaV3 |
GET /3/ModelMetrics/frames/(?.*)/models/(?.*)
Return the saved scoring metrics for the specified Model and Frame.
Input | ModelMetricsListSchemaV3 |
Output | ModelMetricsListSchemaV3 |
DELETE /3/ModelMetrics/frames/(?.*)/models/(?.*)
Return the saved scoring metrics for the specified Model and Frame.
Input | ModelMetricsListSchemaV3 |
Output | ModelMetricsListSchemaV3 |
GET /3/ModelMetrics/models/(?.*)
Return the saved scoring metrics for the specified Model.
Input | ModelMetricsListSchemaV3 |
Output | ModelMetricsListSchemaV3 |
GET /3/ModelMetrics/models/(?.*)/frames/(?.*)
Return the saved scoring metrics for the specified Model and Frame.
Input | ModelMetricsListSchemaV3 |
Output | ModelMetricsListSchemaV3 |
DELETE /3/ModelMetrics/models/(?.*)/frames/(?.*)
Return the saved scoring metrics for the specified Model and Frame.
Input | ModelMetricsListSchemaV3 |
Output | ModelMetricsListSchemaV3 |
POST /3/ModelMetrics/models/(?.*)/frames/(?.*)
Return the scoring metrics for the specified Frame with the specified Model. If the Frame has already been scored with the Model then cached results will be returned; otherwise predictions for all rows in the Frame will be generated and the metrics will be returned.
Input | ModelMetricsListSchemaV3 |
Output | ModelMetricsListSchemaV3 |
GET /3/Models
Return all Models from the H2O distributed K/V store.
Input | ModelsV3 |
Output | ModelsV3 |
DELETE /3/Models
Delete all Models from the H2O distributed K/V store.
Input | ModelsV3 |
Output | ModelsV3 |
GET /3/Models/(?.*)
Return the specified Model from the H2O distributed K/V store, optionally with the list of compatible Frames.
Input | ModelsV3 |
Output | ModelsV3 |
DELETE /3/Models/(?.*)
Delete the specified Model from the H2O distributed K/V store.
Input | ModelsV3 |
Output | ModelsV3 |
GET /3/Models/(?.*)/preview
Return potentially abridged model suitable for viewing in a browser (currently only used for java model code).
Input | ModelsV3 |
Output | ModelsV3 |
GET /3/NetworkTest
Something something something.
Input | NetworkTestV3 |
Output | NetworkTestV3 |
POST /3/NodePersistentStorage/(?.*)
Store a value.
Input | NodePersistentStorageV3 |
Output | NodePersistentStorageV3 |
GET /3/NodePersistentStorage/(?.*)
Return all keys stored for a given category.
Input | NodePersistentStorageV3 |
Output | NodePersistentStorageV3 |
POST /3/NodePersistentStorage/(?.*)/(?.*)
Store a named value.
Input | NodePersistentStorageV3 |
Output | NodePersistentStorageV3 |
GET /3/NodePersistentStorage/(?.*)/(?.*)
Return value for a given name.
Input | NodePersistentStorageV3 |
Output | NodePersistentStorageV3 |
DELETE /3/NodePersistentStorage/(?.*)/(?.*)
Delete a key.
Input | NodePersistentStorageV3 |
Output | NodePersistentStorageV3 |
GET /3/NodePersistentStorage/categories/(?.*)/exists
Return true or false.
Input | NodePersistentStorageV3 |
Output | NodePersistentStorageV3 |
GET /3/NodePersistentStorage/categories/(?.*)/names/(?.*)/exists
Return true or false.
Input | NodePersistentStorageV3 |
Output | NodePersistentStorageV3 |
GET /3/NodePersistentStorage/configured
Return true or false.
Input | NodePersistentStorageV3 |
Output | NodePersistentStorageV3 |
POST /3/Parse
Parse a raw byte-oriented Frame into a useful columnar data Frame.
Input | ParseV3 |
Output | ParseV3 |
POST /3/ParseSetup
Guess the parameters for parsing raw byte-oriented data into an H2O Frame.
Input | ParseSetupV3 |
Output | ParseSetupV3 |
POST /3/Predictions/models/(?.*)/frames/(?.*)
Score (generate predictions) for the specified Frame with the specified Model. Both the Frame of predictions and the metrics will be returned.
Input | ModelMetricsListSchemaV3 |
Output | ModelMetricsListSchemaV3 |
GET /3/Profiler
Something something something.
Input | ProfilerV3 |
Output | ProfilerV3 |
POST /3/Rapids
Something something R exec something.
Input | RapidsV3 |
Output | RapidsV3 |
GET /3/Rapids/isEval
something something r exec something.
Input | RapidsV3 |
Output | RapidsV3 |
POST /3/Shutdown
Shut down the cluster
Input | ShutdownV3 |
Output | ShutdownV3 |
POST /3/SplitFrame
Split a H2O Frame.
Input | SplitFrameV3 |
Output | SplitFrameV3 |
GET /3/Timeline
Something something something.
Input | TimelineV3 |
Output | TimelineV3 |
GET /3/Tutorials
H2O tutorials.
Input | TutorialsV3 |
Output | TutorialsV3 |
GET /3/Typeahead/files
Typehead hander for filename completion.
Input | TypeaheadV3 |
Output | Schema |
POST /3/UnlockKeys
Unlock all keys in the H2O distributed K/V store, to attempt to recover from a crash.
Input | UnlockKeysV3 |
Output | UnlockKeysV3 |
GET /3/WaterMeterCpuTicks/(?.*)
Return a CPU usage snapshot of all cores of all nodes in the H2O cluster.
Input | WaterMeterCpuTicksV3 |
Output | WaterMeterCpuTicksV3 |
GET /3/WaterMeterIo
Return IO usage snapshot of all nodes in the H2O cluster.
Input | WaterMeterIoV3 |
Output | WaterMeterIoV3 |
GET /3/WaterMeterIo/(?.*)
Return IO usage snapshot of all nodes in the H2O cluster.
Input | WaterMeterIoV3 |
Output | WaterMeterIoV3 |
GET /99/Sample
Example of an experimental endpoint. Call via /EXPERIMENTAL/Sample. Experimental endpoints can change at any moment.
Input | CloudV3 |
Output | CloudV3 |
REST API Schema Reference
- AboutEntryV3
- AboutV3
- CloudV3
- ClusteringModelBuilderSchema
- ClusteringModelParametersSchema
- ColSpecifierV2
- ColV2
- ColumnSpecsBase
- ConfusionMatrixBase
- ConfusionMatrixV3
- CoxPHModelOutputV3
- CoxPHModelV3
- CoxPHParametersV3
- CoxPHV3
- CreateFrameV3
- DRFModelOutputV3
- DRFModelV3
- DRFParametersV3
- DRFV3
- DStackTraceV2
- DeepLearningModelOutputV3
- DeepLearningModelV3
- DeepLearningParametersV3
- DeepLearningV3
- DocsBase
- DocsV3
- DownloadDataV3
- EventV2
- ExampleModelOutputV3
- ExampleModelV3
- ExampleParametersV3
- ExampleV3
- FieldMetadataBase
- FieldMetadataV3
- FindV3
- FrameKeyV3
- FrameV3
- FramesBase
- FramesV3
- GBMModelOutputV3
- GBMModelV3
- GBMParametersV3
- GBMV3
- GLMModelOutputV3
- GLMModelV3
- GLMParametersV3
- GLMV3
- GrepModelOutputV3
- GrepModelV3
- GrepParametersV3
- GrepV3
- H2OErrorV3
- H2OModelBuilderErrorV3
- HeartBeatEvent
- IOEvent
- ImportFilesV3
- InitIDV3
- IoStatsEntry
- JStackV3
- JobKeyV3
- JobV3
- JobsV3
- KMeansModelOutputV3
- KMeansModelV3
- KMeansParametersV3
- KMeansV3
- KeyV3
- KillMinus3V3
- LogAndEchoV3
- LogsV3
- MakeGLMModelV3
- MissingInserterV3
- ModelBuilderSchema
- ModelBuildersBase
- ModelBuildersV3
- ModelKeyV3
- ModelMetricsAutoEncoderV3
- ModelMetricsBase
- ModelMetricsBinomialGLMV3
- ModelMetricsBinomialV3
- ModelMetricsClusteringV3
- ModelMetricsListSchemaV3
- ModelMetricsMultinomialV3
- ModelMetricsPCAV3
- ModelMetricsRegressionGLMV3
- ModelMetricsRegressionV3
- ModelOutputSchema
- ModelParameterSchemaV3
- ModelParametersSchema
- ModelSchema
- ModelsBase
- ModelsV3
- NaiveBayesModelOutputV3
- NaiveBayesModelV3
- NaiveBayesParametersV3
- NaiveBayesV3
- NetworkEvent
- NetworkTestV3
- NodePersistentStorageEntryV3
- NodePersistentStorageV3
- NodeV1
- PCAModelOutputV3
- PCAModelV3
- PCAParametersV3
- PCAV3
- ParseSetupV3
- ParseV3
- ProfilerNodeEntryV3
- ProfilerNodeV3
- ProfilerV3
- QuantileParametersV2
- QuantileV3
- RapidsV3
- RemoveAllV3
- RemoveV3
- RouteBase
- RouteV3
- Schema
- SchemaMetadataBase
- SchemaMetadataV3
- SharedTreeModelOutputV3
- SharedTreeModelV3
- SharedTreeParametersV3
- SharedTreeV3
- ShutdownV3
- SplitFrameV3
- SupervisedModelBuilderSchema
- SupervisedModelParametersSchema
- SynonymV3
- TimelineV3
- TreeStatsV3
- TutorialsV3
- TwoDimTableBase
- TwoDimTableV3
- TypeaheadV3
- UnlockKeysV3
- ValidationMessageBase
- ValidationMessageV2
- VarImpBase
- VarImpV3
- VecKeyV3
- WaterMeterCpuTicksV3
- WaterMeterIoV3
- Word2VecModelOutputV3
- Word2VecModelV3
- Word2VecParametersV3
- Word2VecV3
AboutEntryV3
name string | Property name | Out |
value string | Property value | Out |
AboutV3
entries Iced[] | List of properties about this running H2O instance | Out |
CloudV3
skip_ticks boolean | skip_ticks | In |
version string | version | Out |
node_idx int | Node index number cloud status is collected from (zero-based) | Out |
cloud_name string | cloud_name | Out |
cloud_size int | cloud_size | Out |
cloud_uptime_millis long | cloud_uptime_millis | Out |
cloud_healthy boolean | cloud_healthy | Out |
bad_nodes int | Nodes reporting unhealthy | Out |
consensus boolean | Cloud voting is stable | Out |
locked boolean | Cloud is accepting new members or not | Out |
nodes Iced[] | nodes | Out |
ClusteringModelBuilderSchema
parameters Parameters | Model builder parameters. | In |
__http_status int | HTTP status to return for this build. | In |
algo string | The algo name for this ModelBuilder. | Out |
algo_full_name string | The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM). | Out |
can_build enum[] | Model categories this ModelBuilder can build. | Out |
job Job | Job Key | Out |
validation_messages ValidationMessage[] | Parameter validation messages | Out |
validation_error_count int | Count of parameter validation errors | Out |
ClusteringModelParametersSchema
k int | Number of clusters | In/Out |
model_id Key | Destination id for this model; auto-generated if not specified | In/Out |
training_frame Key | Training frame | In/Out |
validation_frame Key | Validation frame | In/Out |
ignored_columns string[] | Ignored columns | In/Out |
drop_na20_cols boolean | Drop columns with more than 20% missing values | In/Out |
score_each_iteration boolean | Whether to score during each iteration of model training | In/Out |
ColSpecifierV2
column_name string | Name of the column | In/Out |
is_member_of_frames string[] | List of fields which specify columns that must contain this column | In/Out |
ColV2
label string | label | Out |
missing_count long | missing | Out |
zero_count long | zeros | Out |
positive_infinity_count long | positive infinities | Out |
negative_infinity_count long | negative infinities | Out |
mins double[] | mins | Out |
maxs double[] | maxs | Out |
mean double | mean | Out |
sigma double | sigma | Out |
type string | datatype: {enum, string, int, real, time, uuid} | Out |
domain string[] | domain; not-null for enum columns only | Out |
data double[] | data | Out |
string_data string[] | string data | Out |
precision byte | decimal precision, -1 for all digits | Out |
histogram_bins long[] | Histogram bins; null if not computed | Out |
histogram_base double | Start of histogram bin zero | Out |
histogram_stride double | Stride per bin | Out |
percentiles double[] | Percentile values, matching the default percentiles | Out |
ColumnSpecsBase
name string | Column Name | Out |
type string | Column Type | Out |
format string | Column Format (printf) | Out |
description string | Column Description | Out |
ConfusionMatrixBase
table TwoDimTable | Annotated confusion matrix | Out |
ConfusionMatrixV3
table TwoDimTable | Annotated confusion matrix | Out |
CoxPHModelOutputV3
names string[] | Column names. | Out |
domains string[][] | Domains for categorical (enum) columns. | Out |
model_category enum | Category of the model (e.g., Binomial). | Out |
model_summary TwoDimTable | Model summary | Out |
scoring_history TwoDimTable | Scoring history | Out |
training_metrics ModelMetrics | Training data model metrics | Out |
validation_metrics ModelMetrics | Validation data model metrics | Out |
help Map | Help information for output fields | Out |
CoxPHModelV3
model_id Key | Model key | In/Out |
algo string | The algo name for this Model. | Out |
algo_full_name string | The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM). | Out |
parameters CoxPHParameters | The build parameters for the model (e.g. K for KMeans). | Out |
output CoxPHOutput | The build output for the model (e.g. the cluster centers for KMeans). | Out |
compatible_frames string[] | Compatible frames, if requested | Out |
checksum long | Checksum for all the things that go into building the Model. | Out |
CoxPHParametersV3
model_id Key | Destination id for this model; auto-generated if not specified | In/Out |
training_frame Key | Training frame | In/Out |
validation_frame Key | Validation frame | In/Out |
ignored_columns string[] | Ignored columns | In/Out |
drop_na20_cols boolean | Drop columns with more than 20% missing values | In/Out |
score_each_iteration boolean | Whether to score during each iteration of model training | In/Out |
CoxPHV3
parameters CoxPHParameters | Model builder parameters. | In |
__http_status int | HTTP status to return for this build. | In |
algo string | The algo name for this ModelBuilder. | Out |
algo_full_name string | The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM). | Out |
can_build enum[] | Model categories this ModelBuilder can build. | Out |
job Job | Job Key | Out |
validation_messages ValidationMessage[] | Parameter validation messages | Out |
validation_error_count int | Count of parameter validation errors | Out |
CreateFrameV3
rows long | Number of rows | In |
cols int | Number of data columns (in addition to the first response column) | In |
seed long | Random number seed | In |
randomize boolean | Whether frame should be randomized | In |
value long | Constant value (for randomize=false) | In |
real_range long | Range for real variables (-range … range) | In |
categorical_fraction double | Fraction of categorical columns (for randomize=true) | In |
factors int | Factor levels for categorical variables | In |
integer_fraction double | Fraction of integer columns (for randomize=true) | In |
integer_range long | Range for integer variables (-range … range) | In |
binary_fraction double | Fraction of binary columns (for randomize=true) | In |
binary_ones_fraction double | Fraction of 1’s in binary columns | In |
missing_fraction double | Fraction of missing values | In |
response_factors int | Number of factor levels of the first column (1=real, 2=binomial, N=multinomial) | In |
has_response boolean | Whether an additional response column should be generated | In |
key Key | Job Key | In |
description string | Job description | In |
dest Key | destination key | In/Out |
status string | job status | Out |
progress float | progress, from 0 to 1 | Out |
progress_msg string | current progress status description | Out |
start_time long | Start time | Out |
msec long | runtime | Out |
exception string | exception | Out |
DRFModelOutputV3
variable_importances TwoDimTable | Variable Importances | Out |
init_f double | The Intercept term, the initial model function value to which trees make adjustments | Out |
names string[] | Column names. | Out |
domains string[][] | Domains for categorical (enum) columns. | Out |
model_category enum | Category of the model (e.g., Binomial). | Out |
model_summary TwoDimTable | Model summary | Out |
scoring_history TwoDimTable | Scoring history | Out |
training_metrics ModelMetrics | Training data model metrics | Out |
validation_metrics ModelMetrics | Validation data model metrics | Out |
help Map | Help information for output fields | Out |
DRFModelV3
model_id Key | Model key | In/Out |
algo string | The algo name for this Model. | Out |
algo_full_name string | The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM). | Out |
parameters DRFParameters | The build parameters for the model (e.g. K for KMeans). | Out |
output DRFOutput | The build output for the model (e.g. the cluster centers for KMeans). | Out |
compatible_frames string[] | Compatible frames, if requested | Out |
checksum long | Checksum for all the things that go into building the Model. | Out |
DRFParametersV3
mtries int | Number of columns to randomly select at each level, or -1 for sqrt(#cols) | In |
sample_rate float | Sample rate, from 0. to 1.0 | In |
build_tree_one_node boolean | Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets. | In |
ntrees int | Number of trees. | In |
max_depth int | Maximum tree depth. | In |
min_rows int | Fewest allowed observations in a leaf (in R called ‘nodesize’). | In |
nbins int | Build a histogram of this many bins, then split at the best point | In |
seed long | Seed for pseudo random number generator (if applicable) | In |
response_column VecSpecifier | Response column | In/Out |
balance_classes boolean | Balance training data class counts via over/under-sampling (for imbalanced data). | In/Out |
class_sampling_factors float[] | Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes. | In/Out |
max_after_balance_size float | Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes. | In/Out |
max_confusion_matrix_size int | Maximum size (# classes) for confusion matrices to be printed in the Logs | In/Out |
max_hit_ratio_k int | Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable) | In/Out |
model_id Key | Destination id for this model; auto-generated if not specified | In/Out |
training_frame Key | Training frame | In/Out |
validation_frame Key | Validation frame | In/Out |
ignored_columns string[] | Ignored columns | In/Out |
drop_na20_cols boolean | Drop columns with more than 20% missing values | In/Out |
score_each_iteration boolean | Whether to score during each iteration of model training | In/Out |
DRFV3
parameters DRFParameters | Model builder parameters. | In |
__http_status int | HTTP status to return for this build. | In |
algo string | The algo name for this ModelBuilder. | Out |
algo_full_name string | The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM). | Out |
can_build enum[] | Model categories this ModelBuilder can build. | Out |
job Job | Job Key | Out |
validation_messages ValidationMessage[] | Parameter validation messages | Out |
validation_error_count int | Count of parameter validation errors | Out |
DStackTraceV2
node string | Node name | Out |
time long | Unix epoch time | Out |
thread_traces string[] | One trace per thread | Out |
DeepLearningModelOutputV3
weights Key[] | Frame keys for weight matrices | In |
biases Key[] | Frame keys for bias vectors | In |
variable_importances TwoDimTable | Variable Importances | Out |
names string[] | Column names. | Out |
domains string[][] | Domains for categorical (enum) columns. | Out |
model_category enum | Category of the model (e.g., Binomial). | Out |
model_summary TwoDimTable | Model summary | Out |
scoring_history TwoDimTable | Scoring history | Out |
training_metrics ModelMetrics | Training data model metrics | Out |
validation_metrics ModelMetrics | Validation data model metrics | Out |
help Map | Help information for output fields | Out |
DeepLearningModelV3
model_id Key | Model key | In/Out |
algo string | The algo name for this Model. | Out |
algo_full_name string | The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM). | Out |
parameters DeepLearningParameters | The build parameters for the model (e.g. K for KMeans). | Out |
output DeepLearningModelOutput | The build output for the model (e.g. the cluster centers for KMeans). | Out |
compatible_frames string[] | Compatible frames, if requested | Out |
checksum long | Checksum for all the things that go into building the Model. | Out |
DeepLearningParametersV3
checkpoint Key | Model checkpoint to resume training with | In/Out |
override_with_best_model boolean | If enabled, override the final model with the best model found during training | In/Out |
autoencoder boolean | Auto-Encoder | In/Out |
use_all_factor_levels boolean | Use all factor levels of categorical variables. Otherwise, the first factor level is omitted (without loss of accuracy). Useful for variable importances and auto-enabled for autoencoder. | In/Out |
activation enum | Activation function | In/Out |
hidden int[] | Hidden layer sizes (e.g. 100,100). | In/Out |
epochs double | How many times the dataset should be iterated (streamed), can be fractional | In/Out |
train_samples_per_iteration long | Number of training samples (globally) per MapReduce iteration. Special values are 0: one epoch, -1: all available data (e.g., replicated training data), -2: automatic | In/Out |
target_ratio_comm_to_comp double | Target ratio of communication overhead to computation. Only for multi-node operation and train_samples_per_iteration=-2 (auto-tuning) | In/Out |
seed long | Seed for random numbers (affects sampling) - Note: only reproducible when running single threaded | In/Out |
adaptive_rate boolean | Adaptive learning rate | In/Out |
rho double | Adaptive learning rate time decay factor (similarity to prior updates) | In/Out |
epsilon double | Adaptive learning rate smoothing factor (to avoid divisions by zero and allow progress) | In/Out |
rate double | Learning rate (higher => less stable, lower => slower convergence) | In/Out |
rate_annealing double | Learning rate annealing: rate / (1 + rate_annealing * samples) | In/Out |
rate_decay double | Learning rate decay factor between layers (N-th layer: rate*alpha^(N-1)) | In/Out |
momentum_start double | Initial momentum at the beginning of training (try 0.5) | In/Out |
momentum_ramp double | Number of training samples for which momentum increases | In/Out |
momentum_stable double | Final momentum after the ramp is over (try 0.99) | In/Out |
nesterov_accelerated_gradient boolean | Use Nesterov accelerated gradient (recommended) | In/Out |
input_dropout_ratio double | Input layer dropout ratio (can improve generalization, try 0.1 or 0.2) | In/Out |
hidden_dropout_ratios double[] | Hidden layer dropout ratios (can improve generalization), specify one value per hidden layer, defaults to 0.5 | In/Out |
l1 double | L1 regularization (can add stability and improve generalization, causes many weights to become 0) | In/Out |
l2 double | L2 regularization (can add stability and improve generalization, causes many weights to be small | In/Out |
max_w2 float | Constraint for squared sum of incoming weights per unit (e.g. for Rectifier) | In/Out |
initial_weight_distribution enum | Initial Weight Distribution | In/Out |
initial_weight_scale double | Uniform: -value…value, Normal: stddev) | In/Out |
loss enum | Loss function | In/Out |
score_interval double | Shortest time interval (in secs) between model scoring | In/Out |
score_training_samples long | Number of training set samples for scoring (0 for all) | In/Out |
score_validation_samples long | Number of validation set samples for scoring (0 for all) | In/Out |
score_duty_cycle double | Maximum duty cycle fraction for scoring (lower: more training, higher: more scoring). | In/Out |
classification_stop double | Stopping criterion for classification error fraction on training data (-1 to disable) | In/Out |
regression_stop double | Stopping criterion for regression error (MSE) on training data (-1 to disable) | In/Out |
quiet_mode boolean | Enable quiet mode for less output to standard output | In/Out |
score_validation_sampling enum | Method used to sample validation dataset for scoring | In/Out |
diagnostics boolean | Enable diagnostics for hidden layers | In/Out |
variable_importances boolean | Compute variable importances for input features (Gedeon method) - can be slow for large networks | In/Out |
fast_mode boolean | Enable fast mode (minor approximation in back-propagation) | In/Out |
ignore_const_cols boolean | Ignore constant training columns (no information can be gained anyway) | In/Out |
force_load_balance boolean | Force extra load balancing to increase training speed for small datasets (to keep all cores busy) | In/Out |
replicate_training_data boolean | Replicate the entire training dataset onto every node for faster training on small datasets | In/Out |
single_node_mode boolean | Run on a single node for fine-tuning of model parameters | In/Out |
shuffle_training_data boolean | Enable shuffling of training data (recommended if training data is replicated and train_samples_per_iteration is close to #nodes x #rows) | In/Out |
missing_values_handling enum | Handling of missing values. Either Skip or MeanImputation. | In/Out |
sparse boolean | Sparse data handling (Experimental). | In/Out |
col_major boolean | Use a column major weight matrix for input layer. Can speed up forward propagation, but might slow down backpropagation (Experimental). | In/Out |
average_activation double | Average activation for sparse auto-encoder (Experimental) | In/Out |
sparsity_beta double | Sparsity regularization (Experimental) | In/Out |
max_categorical_features int | Max. number of categorical features, enforced via hashing (Experimental) | In/Out |
reproducible boolean | Force reproducibility on small data (will be slow - only uses 1 thread) | In/Out |
export_weights_and_biases boolean | Whether to export Neural Network weights and biases to H2O Frames | In/Out |
response_column VecSpecifier | Response column | In/Out |
balance_classes boolean | Balance training data class counts via over/under-sampling (for imbalanced data). | In/Out |
class_sampling_factors float[] | Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes. | In/Out |
max_after_balance_size float | Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes. | In/Out |
max_confusion_matrix_size int | Maximum size (# classes) for confusion matrices to be printed in the Logs | In/Out |
max_hit_ratio_k int | Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable) | In/Out |
model_id Key | Destination id for this model; auto-generated if not specified | In/Out |
training_frame Key | Training frame | In/Out |
validation_frame Key | Validation frame | In/Out |
ignored_columns string[] | Ignored columns | In/Out |
drop_na20_cols boolean | Drop columns with more than 20% missing values | In/Out |
score_each_iteration boolean | Whether to score during each iteration of model training | In/Out |
DeepLearningV3
parameters DeepLearningParameters | Model builder parameters. | In |
__http_status int | HTTP status to return for this build. | In |
algo string | The algo name for this ModelBuilder. | Out |
algo_full_name string | The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM). | Out |
can_build enum[] | Model categories this ModelBuilder can build. | Out |
job Job | Job Key | Out |
validation_messages ValidationMessage[] | Parameter validation messages | Out |
validation_error_count int | Count of parameter validation errors | Out |
DocsBase
num int | Number for specifying an endpoint | In |
http_method string | HTTP method (GET, POST, DELETE) if fetching by path | In |
path string | Path for specifying an endpoint | In |
classname string | Class name, for fetching docs for a schema (DEPRECATED) | In |
schemaname string | Schema name (e.g., DocsV1), for fetching docs for a schema | In |
routes Route[] | List of endpoint routes | Out |
schemas SchemaMetadata[] | List of schemas | Out |
markdown string | Table of Contents Markdown | Out |
DocsV3
num int | Number for specifying an endpoint | In |
http_method string | HTTP method (GET, POST, DELETE) if fetching by path | In |
path string | Path for specifying an endpoint | In |
classname string | Class name, for fetching docs for a schema (DEPRECATED) | In |
schemaname string | Schema name (e.g., DocsV1), for fetching docs for a schema | In |
routes Route[] | List of endpoint routes | Out |
schemas SchemaMetadata[] | List of schemas | Out |
markdown string | Table of Contents Markdown | Out |
DownloadDataV3
frame_id Key | Frame to download | In |
hex_string boolean | Emit double values in a machine readable lossless format with Double.toHexString(). | In |
csv string | CSV Stream | Out |
filename string | Suggested Filename | Out |
EventV2
date string | Time when the event was recorded. Format is hh:mm:ss:ms | In |
nanos long | Time in nanos | In |
type enum | type of recorded event | In |
ExampleModelOutputV3
iterations int | Iterations executed | In |
maxs double[] | (No description available) | In |
names string[] | Column names. | Out |
domains string[][] | Domains for categorical (enum) columns. | Out |
model_category enum | Category of the model (e.g., Binomial). | Out |
model_summary TwoDimTable | Model summary | Out |
scoring_history TwoDimTable | Scoring history | Out |
training_metrics ModelMetrics | Training data model metrics | Out |
validation_metrics ModelMetrics | Validation data model metrics | Out |
help Map | Help information for output fields | Out |
ExampleModelV3
model_id Key | Model key | In/Out |
algo string | The algo name for this Model. | Out |
algo_full_name string | The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM). | Out |
parameters ExampleParameters | The build parameters for the model (e.g. K for KMeans). | Out |
output ExampleOutput | The build output for the model (e.g. the cluster centers for KMeans). | Out |
compatible_frames string[] | Compatible frames, if requested | Out |
checksum long | Checksum for all the things that go into building the Model. | Out |
ExampleParametersV3
max_iterations int | Maximum training iterations. | In |
model_id Key | Destination id for this model; auto-generated if not specified | In/Out |
training_frame Key | Training frame | In/Out |
validation_frame Key | Validation frame | In/Out |
ignored_columns string[] | Ignored columns | In/Out |
drop_na20_cols boolean | Drop columns with more than 20% missing values | In/Out |
score_each_iteration boolean | Whether to score during each iteration of model training | In/Out |
ExampleV3
parameters ExampleParameters | Model builder parameters. | In |
__http_status int | HTTP status to return for this build. | In |
algo string | The algo name for this ModelBuilder. | Out |
algo_full_name string | The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM). | Out |
can_build enum[] | Model categories this ModelBuilder can build. | Out |
job Job | Job Key | Out |
validation_messages ValidationMessage[] | Parameter validation messages | Out |
validation_error_count int | Count of parameter validation errors | Out |
FieldMetadataBase
schema_name string | Schema name for this field, if it is_schema, or the name of the enum, if it’s an enum. | In |
name string | Field name in the Schema | Out |
type string | Type for this field | Out |
is_schema boolean | Type for this field is itself a Schema. | Out |
value Polymorphic | Value for this field | Out |
help string | A short help description to appear alongside the field in a UI | Out |
label string | The label that should be displayed for the field if the name is insufficient | Out |
required boolean | Is this field required, or is the default value generally sufficient? | Out |
level enum | How important is this field? The web UI uses the level to do a slow reveal of the parameters | Out |
direction enum | Is this field an input, output or inout? | Out |
values string[] | For enum-type fields the allowed values are specified using the values annotation; this is used in UIs to tell the user the allowed values, and for validation | Out |
json boolean | Should this field be rendered in the JSON representation? | Out |
is_member_of_frames string[] | For Vec-type fields this is the set of other Vec-type fields which must contain mutually exclusive values; for example, for a SupervisedModel the response_column must be mutually exclusive with the weights_column | Out |
is_mutually_exclusive_with string[] | For Vec-type fields this is the set of Frame-type fields which must contain the named column; for example, for a SupervisedModel the response_column must be in both the training_frame and (if it’s set) the validation_frame | Out |
FieldMetadataV3
schema_name string | Schema name for this field, if it is_schema, or the name of the enum, if it’s an enum. | In |
name string | Field name in the Schema | Out |
type string | Type for this field | Out |
is_schema boolean | Type for this field is itself a Schema. | Out |
value Polymorphic | Value for this field | Out |
help string | A short help description to appear alongside the field in a UI | Out |
label string | The label that should be displayed for the field if the name is insufficient | Out |
required boolean | Is this field required, or is the default value generally sufficient? | Out |
level enum | How important is this field? The web UI uses the level to do a slow reveal of the parameters | Out |
direction enum | Is this field an input, output or inout? | Out |
values string[] | For enum-type fields the allowed values are specified using the values annotation; this is used in UIs to tell the user the allowed values, and for validation | Out |
json boolean | Should this field be rendered in the JSON representation? | Out |
is_member_of_frames string[] | For Vec-type fields this is the set of other Vec-type fields which must contain mutually exclusive values; for example, for a SupervisedModel the response_column must be mutually exclusive with the weights_column | Out |
is_mutually_exclusive_with string[] | For Vec-type fields this is the set of Frame-type fields which must contain the named column; for example, for a SupervisedModel the response_column must be in both the training_frame and (if it’s set) the validation_frame | Out |
FindV3
key Frame | Frame to search | In |
column string | Column, or null for all | In |
row long | Starting row for search | In |
match string | Value to search for; leave blank for a search for missing values | In |
prev long | previous row with matching value, or -1 | Out |
next long | next row with matching value, or -1 | Out |
FrameKeyV3
name string | Name (string representation) for this Key. | In/Out |
type string | Name (string representation) for the type of Keyed this Key points to. | In/Out |
URL string | URL for the resource that this Key points to, if one exists. | In/Out |
FrameV3
frame_id Key | Key to inspect | In |
row_offset long | Row offset to display | In |
row_count int | Number of rows to display | In/Out |
checksum long | checksum | Out |
rows long | Number of rows | Out |
byte_size long | Total data size in bytes | Out |
is_text boolean | Raw unparsed text | Out |
default_percentiles double[] | Default percentiles, from 0 to 1 | Out |
columns Vec[] | Columns | Out |
compatible_models string[] | Compatible models, if requested | Out |
vec_ids Key | The set of IDs of vectors in the Frame | Out |
chunk_summary TwoDimTable | Chunk summary | Out |
distribution_summary TwoDimTable | Distribution summary | Out |
FramesBase
frame_id Key | Name of Frame of interest | In |
column string | Name of column of interest | In |
find_compatible_models boolean | Find and return compatible models? | In |
path string | File output path | In |
force boolean | Overwrite existing fil | In |
row_offset long | Row offset to display | In/Out |
row_count int | Number of rows to display | In/Out |
frames Frame[] | Frames | Out |
compatible_models Model[] | Compatible models | Out |
domain string[][] | Domains | Out |
FramesV3
frame_id Key | Name of Frame of interest | In |
column string | Name of column of interest | In |
find_compatible_models boolean | Find and return compatible models? | In |
path string | File output path | In |
force boolean | Overwrite existing fil | In |
row_offset long | Row offset to display | In/Out |
row_count int | Number of rows to display | In/Out |
frames Frame[] | Frames | Out |
compatible_models Model[] | Compatible models | Out |
domain string[][] | Domains | Out |
GBMModelOutputV3
variable_importances TwoDimTable | Variable Importances | Out |
init_f double | The Intercept term, the initial model function value to which trees make adjustments | Out |
names string[] | Column names. | Out |
domains string[][] | Domains for categorical (enum) columns. | Out |
model_category enum | Category of the model (e.g., Binomial). | Out |
model_summary TwoDimTable | Model summary | Out |
scoring_history TwoDimTable | Scoring history | Out |
training_metrics ModelMetrics | Training data model metrics | Out |
validation_metrics ModelMetrics | Validation data model metrics | Out |
help Map | Help information for output fields | Out |
GBMModelV3
model_id Key | Model key | In/Out |
algo string | The algo name for this Model. | Out |
algo_full_name string | The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM). | Out |
parameters GBMParameters | The build parameters for the model (e.g. K for KMeans). | Out |
output GBMOutput | The build output for the model (e.g. the cluster centers for KMeans). | Out |
compatible_frames string[] | Compatible frames, if requested | Out |
checksum long | Checksum for all the things that go into building the Model. | Out |
GBMParametersV3
learn_rate float | Learning rate from 0.0 to 1.0 | In |
distribution enum | Distribution function | In |
ntrees int | Number of trees. | In |
max_depth int | Maximum tree depth. | In |
min_rows int | Fewest allowed observations in a leaf (in R called ‘nodesize’). | In |
nbins int | Build a histogram of this many bins, then split at the best point | In |
seed long | Seed for pseudo random number generator (if applicable) | In |
response_column VecSpecifier | Response column | In/Out |
balance_classes boolean | Balance training data class counts via over/under-sampling (for imbalanced data). | In/Out |
class_sampling_factors float[] | Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes. | In/Out |
max_after_balance_size float | Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes. | In/Out |
max_confusion_matrix_size int | Maximum size (# classes) for confusion matrices to be printed in the Logs | In/Out |
max_hit_ratio_k int | Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable) | In/Out |
model_id Key | Destination id for this model; auto-generated if not specified | In/Out |
training_frame Key | Training frame | In/Out |
validation_frame Key | Validation frame | In/Out |
ignored_columns string[] | Ignored columns | In/Out |
drop_na20_cols boolean | Drop columns with more than 20% missing values | In/Out |
score_each_iteration boolean | Whether to score during each iteration of model training | In/Out |
GBMV3
parameters GBMParameters | Model builder parameters. | In |
__http_status int | HTTP status to return for this build. | In |
algo string | The algo name for this ModelBuilder. | Out |
algo_full_name string | The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM). | Out |
can_build enum[] | Model categories this ModelBuilder can build. | Out |
job Job | Job Key | Out |
validation_messages ValidationMessage[] | Parameter validation messages | Out |
validation_error_count int | Count of parameter validation errors | Out |
GLMModelOutputV3
coefficients_table TwoDimTable | Table of coefficients | In |
coefficients_magnitude TwoDimTable | Coefficient magnitudes | In |
names string[] | Column names. | Out |
domains string[][] | Domains for categorical (enum) columns. | Out |
model_category enum | Category of the model (e.g., Binomial). | Out |
model_summary TwoDimTable | Model summary | Out |
scoring_history TwoDimTable | Scoring history | Out |
training_metrics ModelMetrics | Training data model metrics | Out |
validation_metrics ModelMetrics | Validation data model metrics | Out |
help Map | Help information for output fields | Out |
GLMModelV3
model_id Key | Model key | In/Out |
algo string | The algo name for this Model. | Out |
algo_full_name string | The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM). | Out |
parameters GLMParameters | The build parameters for the model (e.g. K for KMeans). | Out |
output GLMOutput | The build output for the model (e.g. the cluster centers for KMeans). | Out |
compatible_frames string[] | Compatible frames, if requested | Out |
checksum long | Checksum for all the things that go into building the Model. | Out |
GLMParametersV3
family enum | Family. Use binomial for classification with logistic regression, others are for regression problems. | In |
solver enum | Auto will pick solver better suited for the given dataset, in case of lambda search solvers may be changed during computation. IRLSM is fast on on problems with small number of predictors and for lambda-search with L1 penalty, L_BFGS scales better for datasets with many columns. | In |
alpha double[] | distribution of regularization between L1 and L2. | In |
lambda double[] | regularization strength | In |
lambda_search boolean | use lambda search starting at lambda max, given lambda is then interpreted as lambda min | In |
nlambdas int | number of lambdas to be used in a search | In |
standardize boolean | Standardize numeric columns to have zero mean and unit variance | In |
max_iterations int | Maximum number of iterations | In |
beta_epsilon double | beta esilon -> consider being converged if L1 norm of the current beta change is below this threshold | In |
link enum | (No description available) | In |
prior double | prior probability for y==1. To be used only for logistic regression iff the data has been sampled and the mean of response does not reflect reality. | In |
lambda_min_ratio double | min lambda used in lambda search, specified as a ratio of lambda_max | In |
use_all_factor_levels boolean | By default, first factor level is skipped from the possible set of predictors. Set this flag if you want use all of the levels. Needs sufficient regularization to solve! | In |
beta_constraints Key | beta constraints | In |
max_active_predictors int | Maximum number of active predictors during computation. Use as a stopping criterium to prevent expensive model building with many predictors. | In |
response_column VecSpecifier | Response column | In/Out |
balance_classes boolean | Balance training data class counts via over/under-sampling (for imbalanced data). | In/Out |
class_sampling_factors float[] | Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes. | In/Out |
max_after_balance_size float | Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes. | In/Out |
max_confusion_matrix_size int | Maximum size (# classes) for confusion matrices to be printed in the Logs | In/Out |
max_hit_ratio_k int | Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable) | In/Out |
model_id Key | Destination id for this model; auto-generated if not specified | In/Out |
training_frame Key | Training frame | In/Out |
validation_frame Key | Validation frame | In/Out |
ignored_columns string[] | Ignored columns | In/Out |
drop_na20_cols boolean | Drop columns with more than 20% missing values | In/Out |
score_each_iteration boolean | Whether to score during each iteration of model training | In/Out |
GLMV3
parameters GLMParameters | Model builder parameters. | In |
__http_status int | HTTP status to return for this build. | In |
algo string | The algo name for this ModelBuilder. | Out |
algo_full_name string | The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM). | Out |
can_build enum[] | Model categories this ModelBuilder can build. | Out |
job Job | Job Key | Out |
validation_messages ValidationMessage[] | Parameter validation messages | Out |
validation_error_count int | Count of parameter validation errors | Out |
GrepModelOutputV3
matches string[] | Matching strings | In |
offsets long[] | Byte offsets of matches | In |
names string[] | Column names. | Out |
domains string[][] | Domains for categorical (enum) columns. | Out |
model_category enum | Category of the model (e.g., Binomial). | Out |
model_summary TwoDimTable | Model summary | Out |
scoring_history TwoDimTable | Scoring history | Out |
training_metrics ModelMetrics | Training data model metrics | Out |
validation_metrics ModelMetrics | Validation data model metrics | Out |
help Map | Help information for output fields | Out |
GrepModelV3
model_id Key | Model key | In/Out |
algo string | The algo name for this Model. | Out |
algo_full_name string | The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM). | Out |
parameters GrepParameters | The build parameters for the model (e.g. K for KMeans). | Out |
output GrepOutput | The build output for the model (e.g. the cluster centers for KMeans). | Out |
compatible_frames string[] | Compatible frames, if requested | Out |
checksum long | Checksum for all the things that go into building the Model. | Out |
GrepParametersV3
regex string | regex | In |
model_id Key | Destination id for this model; auto-generated if not specified | In/Out |
training_frame Key | Training frame | In/Out |
validation_frame Key | Validation frame | In/Out |
ignored_columns string[] | Ignored columns | In/Out |
drop_na20_cols boolean | Drop columns with more than 20% missing values | In/Out |
score_each_iteration boolean | Whether to score during each iteration of model training | In/Out |
GrepV3
parameters GrepParameters | Model builder parameters. | In |
__http_status int | HTTP status to return for this build. | In |
algo string | The algo name for this ModelBuilder. | Out |
algo_full_name string | The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM). | Out |
can_build enum[] | Model categories this ModelBuilder can build. | Out |
job Job | Job Key | Out |
validation_messages ValidationMessage[] | Parameter validation messages | Out |
validation_error_count int | Count of parameter validation errors | Out |
H2OErrorV3
timestamp long | Milliseconds since the epoch for the time that this H2OError instance was created. Generally this is a short time since the underlying error ocurred. | Out |
error_url string | Error url | Out |
msg string | Message intended for the end user (a data scientist). | Out |
dev_msg string | Potentially more detailed message intended for a developer (e.g. a front end engineer or someone designing a language binding). | Out |
http_status int | HTTP status code for this error. | Out |
values Map | Any values that are relevant to reporting or handling this error. Examples are a key name if the error is on a key, or a field name and object name if it’s on a specific field. | Out |
exception_type string | Exception type, if any. | Out |
exception_msg string | Raw exception message, if any. | Out |
stacktrace string[] | Stacktrace, if any. | Out |
H2OModelBuilderErrorV3
parameters Parameters | Model builder parameters. | Out |
validation_messages ValidationMessage[] | Parameter validation messages | Out |
validation_error_count int | Count of parameter validation errors | Out |
timestamp long | Milliseconds since the epoch for the time that this H2OError instance was created. Generally this is a short time since the underlying error ocurred. | Out |
error_url string | Error url | Out |
msg string | Message intended for the end user (a data scientist). | Out |
dev_msg string | Potentially more detailed message intended for a developer (e.g. a front end engineer or someone designing a language binding). | Out |
http_status int | HTTP status code for this error. | Out |
values Map | Any values that are relevant to reporting or handling this error. Examples are a key name if the error is on a key, or a field name and object name if it’s on a specific field. | Out |
exception_type string | Exception type, if any. | Out |
exception_msg string | Raw exception message, if any. | Out |
stacktrace string[] | Stacktrace, if any. | Out |
HeartBeatEvent
sends int | number of sent heartbeats | In |
recvs int | number of received heartbeats | In |
date string | Time when the event was recorded. Format is hh:mm:ss:ms | In |
nanos long | Time in nanos | In |
type enum | type of recorded event | In |
IOEvent
io_flavor string | flavor of the recorded io (ice/hdfs/…) | In |
node string | node where this io event happened | In |
data string | data info | In |
date string | Time when the event was recorded. Format is hh:mm:ss:ms | In |
nanos long | Time in nanos | In |
type enum | type of recorded event | In |
ImportFilesV3
path string | path | In |
files string[] | files | Out |
destination_frames string[] | names | Out |
fails string[] | fails | Out |
dels string[] | dels | Out |
InitIDV3
session_key string | Session ID | Out |
IoStatsEntry
backend string | Back end type | Out |
store_count long | Number of store events | Out |
store_bytes long | Cumulative stored bytes | Out |
delete_count long | Number of delete events | Out |
load_count long | Number of load events | Out |
load_bytes long | Cumulative loaded bytes | Out |
JStackV3
traces DStackTrace[] | Stacktraces | Out |
JobKeyV3
name string | Name (string representation) for this Key. | In/Out |
type string | Name (string representation) for the type of Keyed this Key points to. | In/Out |
URL string | URL for the resource that this Key points to, if one exists. | In/Out |
JobV3
key Key | Job Key | In |
description string | Job description | In |
dest Key | destination key | In/Out |
status string | job status | Out |
progress float | progress, from 0 to 1 | Out |
progress_msg string | current progress status description | Out |
start_time long | Start time | Out |
msec long | runtime | Out |
exception string | exception | Out |
JobsV3
job_id Key | Optional Job identifier | In |
jobs Job[] | jobs | Out |
KMeansModelOutputV3
centers TwoDimTable | Cluster Centers[k][features] | In |
centers_std TwoDimTable | Cluster Centers[k][features] on Standardized Data | In |
names string[] | Column names. | Out |
domains string[][] | Domains for categorical (enum) columns. | Out |
model_category enum | Category of the model (e.g., Binomial). | Out |
model_summary TwoDimTable | Model summary | Out |
scoring_history TwoDimTable | Scoring history | Out |
training_metrics ModelMetrics | Training data model metrics | Out |
validation_metrics ModelMetrics | Validation data model metrics | Out |
help Map | Help information for output fields | Out |
KMeansModelV3
model_id Key | Model key | In/Out |
algo string | The algo name for this Model. | Out |
algo_full_name string | The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM). | Out |
parameters KMeansParameters | The build parameters for the model (e.g. K for KMeans). | Out |
output KMeansOutput | The build output for the model (e.g. the cluster centers for KMeans). | Out |
compatible_frames string[] | Compatible frames, if requested | Out |
checksum long | Checksum for all the things that go into building the Model. | Out |
KMeansParametersV3
user_points Key | User-specified points | In |
max_iterations int | Maximum training iterations | In |
standardize boolean | Standardize columns | In |
seed long | RNG Seed | In |
init enum | Initialization mode | In |
k int | Number of clusters | In/Out |
model_id Key | Destination id for this model; auto-generated if not specified | In/Out |
training_frame Key | Training frame | In/Out |
validation_frame Key | Validation frame | In/Out |
ignored_columns string[] | Ignored columns | In/Out |
drop_na20_cols boolean | Drop columns with more than 20% missing values | In/Out |
score_each_iteration boolean | Whether to score during each iteration of model training | In/Out |
KMeansV3
parameters KMeansParameters | Model builder parameters. | In |
__http_status int | HTTP status to return for this build. | In |
algo string | The algo name for this ModelBuilder. | Out |
algo_full_name string | The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM). | Out |
can_build enum[] | Model categories this ModelBuilder can build. | Out |
job Job | Job Key | Out |
validation_messages ValidationMessage[] | Parameter validation messages | Out |
validation_error_count int | Count of parameter validation errors | Out |
KeyV3
name string | Name (string representation) for this Key. | In/Out |
type string | Name (string representation) for the type of Keyed this Key points to. | In/Out |
URL string | URL for the resource that this Key points to, if one exists. | In/Out |
KillMinus3V3
(No fields)
LogAndEchoV3
message string | Message to be Logged and Echoed | In |
LogsV3
nodeidx int | Index of node to query ticks for (0-based). -1 means current node. | In |
name string | Which specific log file to read from the log file directory. If left unspecified, the system chooses a default for you. | In |
log string | Content of log file | Out |
MakeGLMModelV3
model Key | source model | In |
dest Key | destination key | In |
names string[] | coefficient names | In |
beta double[] | new glm coefficients | In |
threshold float | decision threshold for label-generation | In |
MissingInserterV3
dataset Key | dataset | In |
fraction double | Fraction of data to replace with a missing value | In |
seed long | Seed | In |
key Key | Job Key | In |
description string | Job description | In |
dest Key | destination key | In/Out |
status string | job status | Out |
progress float | progress, from 0 to 1 | Out |
progress_msg string | current progress status description | Out |
start_time long | Start time | Out |
msec long | runtime | Out |
exception string | exception | Out |
ModelBuilderSchema
parameters Parameters | Model builder parameters. | In |
__http_status int | HTTP status to return for this build. | In |
algo string | The algo name for this ModelBuilder. | Out |
algo_full_name string | The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM). | Out |
can_build enum[] | Model categories this ModelBuilder can build. | Out |
job Job | Job Key | Out |
validation_messages ValidationMessage[] | Parameter validation messages | Out |
validation_error_count int | Count of parameter validation errors | Out |
ModelBuildersBase
algo string | Algo of ModelBuilder of interest | In |
model_builders Map | ModelBuilders | Out |
ModelBuildersV3
algo string | Algo of ModelBuilder of interest | In |
model_builders Map | ModelBuilders | Out |
ModelKeyV3
name string | Name (string representation) for this Key. | In/Out |
type string | Name (string representation) for the type of Keyed this Key points to. | In/Out |
URL string | URL for the resource that this Key points to, if one exists. | In/Out |
ModelMetricsAutoEncoderV3
model Key | The model used for this scoring run. | In/Out |
model_checksum long | The checksum for the model used for this scoring run. | In/Out |
frame Key | The frame used for this scoring run. | In/Out |
frame_checksum long | The checksum for the frame used for this scoring run. | In/Out |
description string | Optional description for this scoring run (to note out-of-bag, sampled data, etc.) | Out |
model_category enum | The category (e.g., Clustering) for the model used for this scoring run. | Out |
duration_in_ms long | The duration in mS for this scoring run. | Out |
scoring_time long | The time in mS since the epoch for the start of this scoring run. | Out |
predictions Frame | Predictions Frame. | Out |
MSE double | The Mean Squared Error of the prediction for this scoring run. | Out |
ModelMetricsBase
model Key | The model used for this scoring run. | In/Out |
model_checksum long | The checksum for the model used for this scoring run. | In/Out |
frame Key | The frame used for this scoring run. | In/Out |
frame_checksum long | The checksum for the frame used for this scoring run. | In/Out |
description string | Optional description for this scoring run (to note out-of-bag, sampled data, etc.) | Out |
model_category enum | The category (e.g., Clustering) for the model used for this scoring run. | Out |
duration_in_ms long | The duration in mS for this scoring run. | Out |
scoring_time long | The time in mS since the epoch for the start of this scoring run. | Out |
predictions Frame | Predictions Frame. | Out |
MSE double | The Mean Squared Error of the prediction for this scoring run. | Out |
ModelMetricsBinomialGLMV3
model Key | The model used for this scoring run. | In/Out |
model_checksum long | The checksum for the model used for this scoring run. | In/Out |
frame Key | The frame used for this scoring run. | In/Out |
frame_checksum long | The checksum for the frame used for this scoring run. | In/Out |
residual_deviance double | residual deviance | Out |
null_deviance double | null deviance | Out |
AIC double | AIC | Out |
null_degrees_of_freedom long | null DOF | Out |
residual_degrees_of_freedom long | residual DOF | Out |
r2 double | The R^2 for this scoring run. | Out |
logloss double | The logarithmic loss for this scoring run. | Out |
AUC double | The AUC for this scoring run. | Out |
Gini double | The Gini score for this scoring run. | Out |
thresholds_and_metric_scores TwoDimTable | The Metrics for various thresholds. | Out |
max_criteria_and_metric_scores TwoDimTable | The Metrics for various criteria. | Out |
description string | Optional description for this scoring run (to note out-of-bag, sampled data, etc.) | Out |
model_category enum | The category (e.g., Clustering) for the model used for this scoring run. | Out |
duration_in_ms long | The duration in mS for this scoring run. | Out |
scoring_time long | The time in mS since the epoch for the start of this scoring run. | Out |
predictions Frame | Predictions Frame. | Out |
MSE double | The Mean Squared Error of the prediction for this scoring run. | Out |
ModelMetricsBinomialV3
model Key | The model used for this scoring run. | In/Out |
model_checksum long | The checksum for the model used for this scoring run. | In/Out |
frame Key | The frame used for this scoring run. | In/Out |
frame_checksum long | The checksum for the frame used for this scoring run. | In/Out |
r2 double | The R^2 for this scoring run. | Out |
logloss double | The logarithmic loss for this scoring run. | Out |
AUC double | The AUC for this scoring run. | Out |
Gini double | The Gini score for this scoring run. | Out |
thresholds_and_metric_scores TwoDimTable | The Metrics for various thresholds. | Out |
max_criteria_and_metric_scores TwoDimTable | The Metrics for various criteria. | Out |
description string | Optional description for this scoring run (to note out-of-bag, sampled data, etc.) | Out |
model_category enum | The category (e.g., Clustering) for the model used for this scoring run. | Out |
duration_in_ms long | The duration in mS for this scoring run. | Out |
scoring_time long | The time in mS since the epoch for the start of this scoring run. | Out |
predictions Frame | Predictions Frame. | Out |
MSE double | The Mean Squared Error of the prediction for this scoring run. | Out |
ModelMetricsClusteringV3
avg_within_ss double | Average within cluster Mean Square Error | In |
avg_ss double | Average Mean Square Error to grand mean | In |
avg_between_ss double | Average between cluster Mean Square Error | In |
centroid_stats TwoDimTable | Centroid Statistics | In |
model Key | The model used for this scoring run. | In/Out |
model_checksum long | The checksum for the model used for this scoring run. | In/Out |
frame Key | The frame used for this scoring run. | In/Out |
frame_checksum long | The checksum for the frame used for this scoring run. | In/Out |
description string | Optional description for this scoring run (to note out-of-bag, sampled data, etc.) | Out |
model_category enum | The category (e.g., Clustering) for the model used for this scoring run. | Out |
duration_in_ms long | The duration in mS for this scoring run. | Out |
scoring_time long | The time in mS since the epoch for the start of this scoring run. | Out |
predictions Frame | Predictions Frame. | Out |
MSE double | The Mean Squared Error of the prediction for this scoring run. | Out |
ModelMetricsListSchemaV3
model Key | Key of Model of interest (optional) | In |
frame Key | Key of Frame of interest (optional) | In |
reconstruction_error boolean | Compute reconstruction error (optional, only for Deep Learning AutoEncoder models) | In |
deep_features_hidden_layer int | Extract Deep Features for given hidden layer (optional, only for Deep Learning models) | In |
predictions_frame Key | Key of predictions frame, if predictions are requested (optional) | In/Out |
model_metrics ModelMetrics[] | ModelMetrics | Out |
ModelMetricsMultinomialV3
model Key | The model used for this scoring run. | In/Out |
model_checksum long | The checksum for the model used for this scoring run. | In/Out |
frame Key | The frame used for this scoring run. | In/Out |
frame_checksum long | The checksum for the frame used for this scoring run. | In/Out |
r2 double | The R^2 for this scoring run. | Out |
hit_ratio_table TwoDimTable | The hit ratio table for this scoring run. | Out |
cm ConfusionMatrix | The ConfusionMatrix object for this scoring run. | Out |
logloss double | The logarithmic loss for this scoring run. | Out |
description string | Optional description for this scoring run (to note out-of-bag, sampled data, etc.) | Out |
model_category enum | The category (e.g., Clustering) for the model used for this scoring run. | Out |
duration_in_ms long | The duration in mS for this scoring run. | Out |
scoring_time long | The time in mS since the epoch for the start of this scoring run. | Out |
predictions Frame | Predictions Frame. | Out |
MSE double | The Mean Squared Error of the prediction for this scoring run. | Out |
ModelMetricsPCAV3
model Key | The model used for this scoring run. | In/Out |
model_checksum long | The checksum for the model used for this scoring run. | In/Out |
frame Key | The frame used for this scoring run. | In/Out |
frame_checksum long | The checksum for the frame used for this scoring run. | In/Out |
description string | Optional description for this scoring run (to note out-of-bag, sampled data, etc.) | Out |
model_category enum | The category (e.g., Clustering) for the model used for this scoring run. | Out |
duration_in_ms long | The duration in mS for this scoring run. | Out |
scoring_time long | The time in mS since the epoch for the start of this scoring run. | Out |
predictions Frame | Predictions Frame. | Out |
MSE double | The Mean Squared Error of the prediction for this scoring run. | Out |
ModelMetricsRegressionGLMV3
model Key | The model used for this scoring run. | In/Out |
model_checksum long | The checksum for the model used for this scoring run. | In/Out |
frame Key | The frame used for this scoring run. | In/Out |
frame_checksum long | The checksum for the frame used for this scoring run. | In/Out |
residual_deviance double | residual deviance | Out |
null_deviance double | null deviance | Out |
AIC double | AIC | Out |
null_degrees_of_freedom long | null DOF | Out |
residual_degrees_of_freedom long | residual DOF | Out |
r2 double | The R^2 for this scoring run. | Out |
description string | Optional description for this scoring run (to note out-of-bag, sampled data, etc.) | Out |
model_category enum | The category (e.g., Clustering) for the model used for this scoring run. | Out |
duration_in_ms long | The duration in mS for this scoring run. | Out |
scoring_time long | The time in mS since the epoch for the start of this scoring run. | Out |
predictions Frame | Predictions Frame. | Out |
MSE double | The Mean Squared Error of the prediction for this scoring run. | Out |
ModelMetricsRegressionV3
model Key | The model used for this scoring run. | In/Out |
model_checksum long | The checksum for the model used for this scoring run. | In/Out |
frame Key | The frame used for this scoring run. | In/Out |
frame_checksum long | The checksum for the frame used for this scoring run. | In/Out |
r2 double | The R^2 for this scoring run. | Out |
description string | Optional description for this scoring run (to note out-of-bag, sampled data, etc.) | Out |
model_category enum | The category (e.g., Clustering) for the model used for this scoring run. | Out |
duration_in_ms long | The duration in mS for this scoring run. | Out |
scoring_time long | The time in mS since the epoch for the start of this scoring run. | Out |
predictions Frame | Predictions Frame. | Out |
MSE double | The Mean Squared Error of the prediction for this scoring run. | Out |
ModelOutputSchema
names string[] | Column names. | Out |
domains string[][] | Domains for categorical (enum) columns. | Out |
model_category enum | Category of the model (e.g., Binomial). | Out |
model_summary TwoDimTable | Model summary | Out |
scoring_history TwoDimTable | Scoring history | Out |
training_metrics ModelMetrics | Training data model metrics | Out |
validation_metrics ModelMetrics | Validation data model metrics | Out |
help Map | Help information for output fields | Out |
ModelParameterSchemaV3
is_member_of_frames string[] | For Vec-type fields this is the set of other Vec-type fields which must contain mutually exclusive values; for example, for a SupervisedModel the response_column must be mutually exclusive with the weights_column | In |
is_mutually_exclusive_with string[] | For Vec-type fields this is the set of Frame-type fields which must contain the named column; for example, for a SupervisedModel the response_column must be in both the training_frame and (if it’s set) the validation_frame | In |
name string | name in the JSON, e.g. “lambda” | Out |
label string | label in the UI, e.g. “lambda” | Out |
help string | help for the UI, e.g. “regularization multiplier, typically used for foo bar baz etc.” | Out |
required boolean | the field is required | Out |
type string | Java type, e.g. “double” | Out |
default_value Polymorphic | default value, e.g. 1 | Out |
actual_value Polymorphic | actual value as set by the user and / or modified by the ModelBuilder, e.g., 10 | Out |
level string | the importance of the parameter, used by the UI, e.g. “critical”, “extended” or “expert” | Out |
values string[] | list of valid values for use by the front-end | Out |
ModelParametersSchema
model_id Key | Destination id for this model; auto-generated if not specified | In/Out |
training_frame Key | Training frame | In/Out |
validation_frame Key | Validation frame | In/Out |
ignored_columns string[] | Ignored columns | In/Out |
drop_na20_cols boolean | Drop columns with more than 20% missing values | In/Out |
score_each_iteration boolean | Whether to score during each iteration of model training | In/Out |
ModelSchema
model_id Key | Model key | In/Out |
algo string | The algo name for this Model. | Out |
algo_full_name string | The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM). | Out |
parameters Parameters | The build parameters for the model (e.g. K for KMeans). | Out |
output Output | The build output for the model (e.g. the cluster centers for KMeans). | Out |
compatible_frames string[] | Compatible frames, if requested | Out |
checksum long | Checksum for all the things that go into building the Model. | Out |
ModelsBase
model_id Key | Name of Model of interest | In |
preview boolean | Return potentially abridged model suitable for viewing in a browser | In |
find_compatible_frames boolean | Find and return compatible frames? | In |
models Model[] | Models | Out |
compatible_frames Frame[] | Compatible frames | Out |
ModelsV3
model_id Key | Name of Model of interest | In |
preview boolean | Return potentially abridged model suitable for viewing in a browser | In |
find_compatible_frames boolean | Find and return compatible frames? | In |
models Model[] | Models | Out |
compatible_frames Frame[] | Compatible frames | Out |
NaiveBayesModelOutputV3
levels string[] | Categorical levels of the response | In |
apriori TwoDimTable | A-priori probabilities of the response | In |
pcond TwoDimTable[] | Conditional probabilities of the predictors | In |
names string[] | Column names. | Out |
domains string[][] | Domains for categorical (enum) columns. | Out |
model_category enum | Category of the model (e.g., Binomial). | Out |
model_summary TwoDimTable | Model summary | Out |
scoring_history TwoDimTable | Scoring history | Out |
training_metrics ModelMetrics | Training data model metrics | Out |
validation_metrics ModelMetrics | Validation data model metrics | Out |
help Map | Help information for output fields | Out |
NaiveBayesModelV3
model_id Key | Model key | In/Out |
algo string | The algo name for this Model. | Out |
algo_full_name string | The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM). | Out |
parameters NaiveBayesParameters | The build parameters for the model (e.g. K for KMeans). | Out |
output NaiveBayesOutput | The build output for the model (e.g. the cluster centers for KMeans). | Out |
compatible_frames string[] | Compatible frames, if requested | Out |
checksum long | Checksum for all the things that go into building the Model. | Out |
NaiveBayesParametersV3
laplace double | Laplace smoothing parameter | In |
min_sdev double | Min. standard deviation to use for observations with not enough data | In |
eps_sdev double | Cutoff below which standard deviation is replaced with min_sdev | In |
min_prob double | Min. probability to use for observations with not enough data | In |
eps_prob double | Cutoff below which probability is replaced with min_prob | In |
response_column VecSpecifier | Response column | In/Out |
balance_classes boolean | Balance training data class counts via over/under-sampling (for imbalanced data). | In/Out |
class_sampling_factors float[] | Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes. | In/Out |
max_after_balance_size float | Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes. | In/Out |
max_confusion_matrix_size int | Maximum size (# classes) for confusion matrices to be printed in the Logs | In/Out |
max_hit_ratio_k int | Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable) | In/Out |
model_id Key | Destination id for this model; auto-generated if not specified | In/Out |
training_frame Key | Training frame | In/Out |
validation_frame Key | Validation frame | In/Out |
ignored_columns string[] | Ignored columns | In/Out |
drop_na20_cols boolean | Drop columns with more than 20% missing values | In/Out |
score_each_iteration boolean | Whether to score during each iteration of model training | In/Out |
NaiveBayesV3
parameters NaiveBayesParameters | Model builder parameters. | In |
__http_status int | HTTP status to return for this build. | In |
algo string | The algo name for this ModelBuilder. | Out |
algo_full_name string | The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM). | Out |
can_build enum[] | Model categories this ModelBuilder can build. | Out |
job Job | Job Key | Out |
validation_messages ValidationMessage[] | Parameter validation messages | Out |
validation_error_count int | Count of parameter validation errors | Out |
NetworkEvent
is_send boolean | Boolean flag distinguishing between sends (true) and receives(false) | In |
protocol string | network protocol (UDP/TCP) | In |
msg_type string | UDP type (exec,ack, ackack,… | In |
from string | Sending node | In |
to string | Receiving node | In |
data string | Pretty print of the first few bytes of the msg payload. Contains class name for tasks. | In |
date string | Time when the event was recorded. Format is hh:mm:ss:ms | In |
nanos long | Time in nanos | In |
type enum | type of recorded event | In |
NetworkTestV3
microseconds_collective double[] | Collective broadcast/reduce times in microseconds (for each message size) | Out |
bandwidths_collective double[] | Collective bandwidths in Bytes/sec (for each message size, for each node) | Out |
microseconds double[][] | Round-trip times in microseconds (for each message size, for each node) | Out |
bandwidths double[][] | Bi-directional bandwidths in Bytes/sec (for each message size, for each node) | Out |
nodes string[] | Nodes | Out |
table TwoDimTable | NetworkTestResults | Out |
NodePersistentStorageEntryV3
category string | Category name | Out |
name string | Key name | Out |
size long | Size in bytes of value | Out |
timestamp_millis long | Epoch time in milliseconds of when the value was written | Out |
NodePersistentStorageV3
category string | Category name | In/Out |
name string | Key name | In/Out |
value string | Value | In/Out |
configured boolean | Configured | Out |
exists boolean | Exists | Out |
entries Iced[] | List of entries | Out |
NodeV1
h2o string | IP | Out |
ip_port string | IP address and port in the form a.b.c.d:e | Out |
healthy boolean | (now-last_ping)<HeartbeatThread.TIMEOUT | Out |
last_ping long | Time (in msec) of last ping | Out |
sys_load float | System load; average #runnables/#cores | Out |
gflops double | Linpack GFlops | Out |
mem_bw double | Memory Bandwidth | Out |
total_value_size long | Data on Node (memory or disk) | Out |
mem_value_size long | Data on Node (memory only) | Out |
num_keys int | id="local-keys">local keys< | Out |
free_mem long | Free heap | Out |
tot_mem long | Total heap | Out |
max_mem long | Max heap | Out |
free_disk long | Free disk | Out |
max_disk long | Max disk | Out |
rpcs_active int | Active Remote Procedure Calls | Out |
fjthrds short[] | F/J Thread count, by priority | Out |
fjqueue short[] | F/J Task count, by priority | Out |
tcps_active int | Open TCP connections | Out |
open_fds int | Open File Descripters | Out |
num_cpus int | num_cpus | Out |
cpus_allowed int | cpus_allowed | Out |
nthreads int | nthreads | Out |
my_cpu_pct int | System CPU percentage used by this H2O process in last interval | Out |
sys_cpu_pct int | System CPU percentage used by everything in last interval | Out |
pid string | PID | Out |
PCAModelOutputV3
iterations int | Iterations executed | In |
archetypes double[][] | Mapping from training data to lower dimensional k-space | In |
std_deviation double[] | Standard deviation of each principal component | In |
eigenvectors TwoDimTable | Principal components matrix | In |
pc_importance TwoDimTable | Importance of each principal component | In |
names string[] | Column names. | Out |
domains string[][] | Domains for categorical (enum) columns. | Out |
model_category enum | Category of the model (e.g., Binomial). | Out |
model_summary TwoDimTable | Model summary | Out |
scoring_history TwoDimTable | Scoring history | Out |
training_metrics ModelMetrics | Training data model metrics | Out |
validation_metrics ModelMetrics | Validation data model metrics | Out |
help Map | Help information for output fields | Out |
PCAModelV3
model_id Key | Model key | In/Out |
algo string | The algo name for this Model. | Out |
algo_full_name string | The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM). | Out |
parameters PCAParameters | The build parameters for the model (e.g. K for KMeans). | Out |
output PCAOutput | The build output for the model (e.g. the cluster centers for KMeans). | Out |
compatible_frames string[] | Compatible frames, if requested | Out |
checksum long | Checksum for all the things that go into building the Model. | Out |
PCAParametersV3
transform enum | Transformation of training data | In |
k int | Rank of matrix approximation | In |
gamma double | Regularization weight | In |
max_iterations int | Maximum training iterations | In |
seed long | RNG seed for k-means++ initialization | In |
init enum | Initialization mode | In |
user_points Key | User-specified initial Y | In |
loading_key Key | Frame key to save resulting X | In |
model_id Key | Destination id for this model; auto-generated if not specified | In/Out |
training_frame Key | Training frame | In/Out |
validation_frame Key | Validation frame | In/Out |
ignored_columns string[] | Ignored columns | In/Out |
drop_na20_cols boolean | Drop columns with more than 20% missing values | In/Out |
score_each_iteration boolean | Whether to score during each iteration of model training | In/Out |
PCAV3
parameters PCAParameters | Model builder parameters. | In |
__http_status int | HTTP status to return for this build. | In |
algo string | The algo name for this ModelBuilder. | Out |
algo_full_name string | The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM). | Out |
can_build enum[] | Model categories this ModelBuilder can build. | Out |
job Job | Job Key | Out |
validation_messages ValidationMessage[] | Parameter validation messages | Out |
validation_error_count int | Count of parameter validation errors | Out |
ParseSetupV3
source_frames Key[] | Source frames | In/Out |
parse_type enum | Parser type | In/Out |
separator byte | Field separator | In/Out |
single_quotes boolean | Single quotes | In/Out |
check_header int | Check header: 0 means guess, +1 means 1st line is header not data, -1 means 1st line is data not header | In/Out |
column_names string[] | Column names | In/Out |
column_types string[] | Value types for columns | In/Out |
na_strings string[] | NA strings for columns | In/Out |
destination_frame string | Suggested name | Out |
is_valid boolean | The initial parse is sane | Out |
invalid_lines long | Number of broken/invalid lines found | Out |
header_lines long | Number of header lines found | Out |
number_columns int | Number of columns | Out |
domains string[][] | Domains for categorical columns | Out |
data string[][] | Sample data | Out |
chunk_size int | Size of individual parse tasks | Out |
ParseV3
destination_frame Key | Final frame name | In |
source_frames Key[] | Source frames | In |
parse_type enum | Parser type | In |
separator byte | Field separator | In |
single_quotes boolean | Single Quotes | In |
check_header int | Check header: 0 means guess, +1 means 1st line is header not data, -1 means 1st line is data not header | In |
number_columns int | Number of columns | In |
column_names string[] | Column names | In |
column_types string[] | Value types for columns | In |
domains string[][] | Domains for categorical columns | In |
na_strings string[] | NA strings for columns | In |
chunk_size int | Size of individual parse tasks | In |
delete_on_done boolean | Delete input key after parse | In |
blocking boolean | Block until the parse completes (as opposed to returning early and requiring polling | In |
remove_frame boolean | Remove frame after blocking parse, and return array of Vecs | In |
job Job | Parse job | Out |
rows long | Rows | Out |
vec_ids Key | Vec IDs | Out |
ProfilerNodeEntryV3
stacktrace string | Stack trace | Out |
count int | Profile Count | Out |
ProfilerNodeV3
node_name string | Node names | Out |
timestamp long | Timestamp (millis since epoch) | Out |
entries Iced[] | Profile entry list | Out |
ProfilerV3
depth int | Stack trace depth | In |
nodes Iced[] | (No description available) | Out |
QuantileParametersV2
probs double[] | Probabilities for quantiles | In |
combine_method enum | How to combine quantiles for even sample sizes | In |
model_id Key | Destination id for this model; auto-generated if not specified | In/Out |
training_frame Key | Training frame | In/Out |
validation_frame Key | Validation frame | In/Out |
ignored_columns string[] | Ignored columns | In/Out |
drop_na20_cols boolean | Drop columns with more than 20% missing values | In/Out |
score_each_iteration boolean | Whether to score during each iteration of model training | In/Out |
QuantileV3
parameters QuantileParameters | Model builder parameters. | In |
__http_status int | HTTP status to return for this build. | In |
algo string | The algo name for this ModelBuilder. | Out |
algo_full_name string | The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM). | Out |
can_build enum[] | Model categories this ModelBuilder can build. | Out |
job Job | Job Key | Out |
validation_messages ValidationMessage[] | Parameter validation messages | Out |
validation_error_count int | Count of parameter validation errors | Out |
RapidsV3
ast string | An Abstract Syntax Tree. | In |
fun string | An array of function definitions. | In |
ast_key Key | A pointer to a Frame | In |
error string | Parsing error, if any | Out |
key Key | Result key | Out |
num_rows long | Rows in Frame result | Out |
num_cols int | Columns in Frame result | Out |
scalar double | Scalar result | Out |
funstr string | Function result | Out |
col_names string[] | Column Names | Out |
string string | String result | Out |
result string | result | Out |
evaluated boolean | Was evaluated | Out |
head string[][] | Head of a Frame result | Out |
result_type int | Result Type. | Out |
vec_ids Key | Vec keys for key result | Out |
RemoveAllV3
(No fields)
RemoveV3
key Key | Object to be removed. | In |
RouteBase
http_method string | (No description available) | Out |
url_pattern string | (No description available) | Out |
summary string | (No description available) | Out |
handler_class string | (No description available) | Out |
handler_method string | (No description available) | Out |
input_schema string | (No description available) | Out |
output_schema string | (No description available) | Out |
doc_method string | (No description available) | Out |
path_params string[] | (No description available) | Out |
markdown string | (No description available) | Out |
RouteV3
http_method string | (No description available) | Out |
url_pattern string | (No description available) | Out |
summary string | (No description available) | Out |
handler_class string | (No description available) | Out |
handler_method string | (No description available) | Out |
input_schema string | (No description available) | Out |
output_schema string | (No description available) | Out |
doc_method string | (No description available) | Out |
path_params string[] | (No description available) | Out |
markdown string | (No description available) | Out |
Schema
(No fields)
SchemaMetadataBase
version int | Version number of the Schema. | In |
name string | Simple name of the Schema. NOTE: the schema_names form a single namespace. | In |
superclass string | Simple name of the superclass of the Schema. NOTE: the schema_names form a single namespace. | In |
type string | Simple name of H2O type that this Schema represents. Must not be changed after creation (treat as final). | In |
fields FieldMetadata[] | All the public fields of the schema | Out |
markdown string | Documentation for the schema in Markdown format with GitHub extensions | Out |
SchemaMetadataV3
version int | Version number of the Schema. | In |
name string | Simple name of the Schema. NOTE: the schema_names form a single namespace. | In |
superclass string | Simple name of the superclass of the Schema. NOTE: the schema_names form a single namespace. | In |
type string | Simple name of H2O type that this Schema represents. Must not be changed after creation (treat as final). | In |
fields FieldMetadata[] | All the public fields of the schema | Out |
markdown string | Documentation for the schema in Markdown format with GitHub extensions | Out |
SharedTreeModelOutputV3
variable_importances TwoDimTable | Variable Importances | Out |
init_f double | The Intercept term, the initial model function value to which trees make adjustments | Out |
names string[] | Column names. | Out |
domains string[][] | Domains for categorical (enum) columns. | Out |
model_category enum | Category of the model (e.g., Binomial). | Out |
model_summary TwoDimTable | Model summary | Out |
scoring_history TwoDimTable | Scoring history | Out |
training_metrics ModelMetrics | Training data model metrics | Out |
validation_metrics ModelMetrics | Validation data model metrics | Out |
help Map | Help information for output fields | Out |
SharedTreeModelV3
model_id Key | Model key | In/Out |
algo string | The algo name for this Model. | Out |
algo_full_name string | The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM). | Out |
parameters Parameters | The build parameters for the model (e.g. K for KMeans). | Out |
output Output | The build output for the model (e.g. the cluster centers for KMeans). | Out |
compatible_frames string[] | Compatible frames, if requested | Out |
checksum long | Checksum for all the things that go into building the Model. | Out |
SharedTreeParametersV3
ntrees int | Number of trees. | In |
max_depth int | Maximum tree depth. | In |
min_rows int | Fewest allowed observations in a leaf (in R called ‘nodesize’). | In |
nbins int | Build a histogram of this many bins, then split at the best point | In |
seed long | Seed for pseudo random number generator (if applicable) | In |
response_column VecSpecifier | Response column | In/Out |
balance_classes boolean | Balance training data class counts via over/under-sampling (for imbalanced data). | In/Out |
class_sampling_factors float[] | Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes. | In/Out |
max_after_balance_size float | Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes. | In/Out |
max_confusion_matrix_size int | Maximum size (# classes) for confusion matrices to be printed in the Logs | In/Out |
max_hit_ratio_k int | Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable) | In/Out |
model_id Key | Destination id for this model; auto-generated if not specified | In/Out |
training_frame Key | Training frame | In/Out |
validation_frame Key | Validation frame | In/Out |
ignored_columns string[] | Ignored columns | In/Out |
drop_na20_cols boolean | Drop columns with more than 20% missing values | In/Out |
score_each_iteration boolean | Whether to score during each iteration of model training | In/Out |
SharedTreeV3
parameters Parameters | Model builder parameters. | In |
__http_status int | HTTP status to return for this build. | In |
algo string | The algo name for this ModelBuilder. | Out |
algo_full_name string | The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM). | Out |
can_build enum[] | Model categories this ModelBuilder can build. | Out |
job Job | Job Key | Out |
validation_messages ValidationMessage[] | Parameter validation messages | Out |
validation_error_count int | Count of parameter validation errors | Out |
ShutdownV3
(No fields)
SplitFrameV3
dataset Key | Dataset | In |
ratios double[] | Split ratios - resulting number of split is ratios.length+1 | In |
key Key | Job Key | In |
description string | Job description | In |
destination_frames Key[] | Destination keys for each output frame split. | In/Out |
dest Key | destination key | In/Out |
status string | job status | Out |
progress float | progress, from 0 to 1 | Out |
progress_msg string | current progress status description | Out |
start_time long | Start time | Out |
msec long | runtime | Out |
exception string | exception | Out |
SupervisedModelBuilderSchema
parameters Parameters | Model builder parameters. | In |
__http_status int | HTTP status to return for this build. | In |
algo string | The algo name for this ModelBuilder. | Out |
algo_full_name string | The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM). | Out |
can_build enum[] | Model categories this ModelBuilder can build. | Out |
job Job | Job Key | Out |
validation_messages ValidationMessage[] | Parameter validation messages | Out |
validation_error_count int | Count of parameter validation errors | Out |
SupervisedModelParametersSchema
response_column VecSpecifier | Response column | In/Out |
balance_classes boolean | Balance training data class counts via over/under-sampling (for imbalanced data). | In/Out |
class_sampling_factors float[] | Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes. | In/Out |
max_after_balance_size float | Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes. | In/Out |
max_confusion_matrix_size int | Maximum size (# classes) for confusion matrices to be printed in the Logs | In/Out |
max_hit_ratio_k int | Max. number (top K) of predictions to use for hit ratio computation (for multi-class only, 0 to disable) | In/Out |
model_id Key | Destination id for this model; auto-generated if not specified | In/Out |
training_frame Key | Training frame | In/Out |
validation_frame Key | Validation frame | In/Out |
ignored_columns string[] | Ignored columns | In/Out |
drop_na20_cols boolean | Drop columns with more than 20% missing values | In/Out |
score_each_iteration boolean | Whether to score during each iteration of model training | In/Out |
SynonymV3
key Key | A word2vec model key. | In |
target string | The target string to find synonyms. | In |
cnt int | Find the top cnt synonyms of the target word. | In |
synonyms string[] | The synonyms. | Out |
cos_sim float[] | The cosine similarities. | Out |
TimelineV3
now long | Current time in millis. | Out |
self string | This node | Out |
events Iced[] | recorded timeline events | Out |
TreeStatsV3
min_depth int | minDepth | In |
max_depth int | maxDepth | In |
mean_depth float | meanDepth | In |
min_leaves int | minLeaves | In |
max_leaves int | maxLeaves | In |
mean_leaves float | meanLeaves | In |
TutorialsV3
(No fields)
TwoDimTableBase
name string | Table Name | Out |
description string | Table Description | Out |
columns Iced[] | Column Specification | Out |
rowcount int | Number of Rows | Out |
data Polymorphic[][] | Table Data (col-major) | Out |
TwoDimTableV3
name string | Table Name | Out |
description string | Table Description | Out |
columns Iced[] | Column Specification | Out |
rowcount int | Number of Rows | Out |
data Polymorphic[][] | Table Data (col-major) | Out |
TypeaheadV3
src string | training_frame | In |
limit int | limit | In |
matches string[] | matches | Out |
UnlockKeysV3
(No fields)
ValidationMessageBase
message_type string | Type of validation message (ERROR, WARN, INFO, HIDE) | Out |
field_name string | Field to which the message applies | Out |
message string | Message text | Out |
ValidationMessageV2
message_type string | Type of validation message (ERROR, WARN, INFO, HIDE) | Out |
field_name string | Field to which the message applies | Out |
message string | Message text | Out |
VarImpBase
varimp float[] | Variable importance of individual variables | Out |
names string[] | Names of variables | Out |
VarImpV3
varimp float[] | Variable importance of individual variables | Out |
names string[] | Names of variables | Out |
VecKeyV3
name string | Name (string representation) for this Key. | In/Out |
type string | Name (string representation) for the type of Keyed this Key points to. | In/Out |
URL string | URL for the resource that this Key points to, if one exists. | In/Out |
WaterMeterCpuTicksV3
nodeidx int | Index of node to query ticks for (0-based) | In |
cpu_ticks long[][] | array of tick counts per core | Out |
WaterMeterIoV3
nodeidx int | Index of node to query ticks for (0-based) | In |
persist_stats Iced[] | array of IO info | Out |
Word2VecModelOutputV3
names string[] | Column names. | Out |
domains string[][] | Domains for categorical (enum) columns. | Out |
model_category enum | Category of the model (e.g., Binomial). | Out |
model_summary TwoDimTable | Model summary | Out |
scoring_history TwoDimTable | Scoring history | Out |
training_metrics ModelMetrics | Training data model metrics | Out |
validation_metrics ModelMetrics | Validation data model metrics | Out |
help Map | Help information for output fields | Out |
Word2VecModelV3
model_id Key | Model key | In/Out |
algo string | The algo name for this Model. | Out |
algo_full_name string | The pretty algo name for this Model (e.g., Generalized Linear Model, rather than GLM). | Out |
parameters Word2VecParameters | The build parameters for the model (e.g. K for KMeans). | Out |
output Word2VecOutput | The build output for the model (e.g. the cluster centers for KMeans). | Out |
compatible_frames string[] | Compatible frames, if requested | Out |
checksum long | Checksum for all the things that go into building the Model. | Out |
Word2VecParametersV3
vecSize int | Set size of word vectors | In |
windowSize int | Set max skip length between words | In |
sentSampleRate float | Set threshold for occurrence of words. Those that appear with higher frequency in the training data will be randomly down-sampled; useful range is (0, 1e-5) | In |
normModel enum | Use Hierarchical Softmax or Negative Sampling | In |
negSampleCnt int | Number of negative examples, common values are 3 - 10 (0 = not used) | In |
epochs int | Number of training iterations to run | In |
minWordFreq int | This will discard words that appear less than | In |
initLearningRate float | Set the starting learning rate | In |
wordModel enum | Use the continuous bag of words model or the Skip-Gram model | In |
model_id Key | Destination id for this model; auto-generated if not specified | In/Out |
training_frame Key | Training frame | In/Out |
validation_frame Key | Validation frame | In/Out |
ignored_columns string[] | Ignored columns | In/Out |
drop_na20_cols boolean | Drop columns with more than 20% missing values | In/Out |
score_each_iteration boolean | Whether to score during each iteration of model training | In/Out |
Word2VecV3
parameters Word2VecParameters | Model builder parameters. | In |
__http_status int | HTTP status to return for this build. | In |
algo string | The algo name for this ModelBuilder. | Out |
algo_full_name string | The pretty algo name for this ModelBuilder (e.g., Generalized Linear Model, rather than GLM). | Out |
can_build enum[] | Model categories this ModelBuilder can build. | Out |
job Job | Job Key | Out |
validation_messages ValidationMessage[] | Parameter validation messages | Out |
validation_error_count int | Count of parameter validation errors | Out |