H2O Module

h2o

h2o.h2o.as_list(data, use_pandas=True)[source]

Convert an H2O data object into a python-specific object.

WARNING! This will pull all data local!

If Pandas is available (and use_pandas is True), then pandas will be used to parse the data frame. Otherwise, a list-of-lists populated by character data will be returned (so the types of data will all be str).

Parameters:

data : H2OFrame

An H2O data object.

use_pandas
: bool

Try to use pandas for reading in the data.

Returns:

List of list (Rows x Columns).

h2o.h2o.cluster_info()[source]

Display the current H2O cluster information.

h2o.h2o.cluster_status()[source]

This is possibly confusing because this can come back without warning, but if a user tries to do any remoteSend, they will get a “cloud sick warning” Retrieve information on the status of the cluster running H2O.

h2o.h2o.create_frame(id=None, rows=10000, cols=10, randomize=True, value=0, real_range=100, categorical_fraction=0.2, factors=100, integer_fraction=0.2, integer_range=100, binary_fraction=0.1, binary_ones_fraction=0.02, time_fraction=0, string_fraction=0, missing_fraction=0.01, response_factors=2, has_response=False, seed=None, seed_for_column_types=None)[source]

Data Frame Creation in H2O. Creates a data frame in H2O with real-valued, categorical, integer, and binary columns specified by the user.

Parameters:

id : str

A string indicating the destination key. If empty, this will be auto-generated by H2O.

rows
: int

The number of rows of data to generate.

cols
: int

The number of columns of data to generate. Excludes the response column if has_response == True.

randomize
: bool

A logical value indicating whether data values should be randomly generated. This must be TRUE if either categorical_fraction or integer_fraction is non-zero.

value
: int

If randomize == FALSE, then all real-valued entries will be set to this value.

real_range
: float

The range of randomly generated real values.

categorical_fraction
: float

The fraction of total columns that are categorical.

factors
: int

The number of (unique) factor levels in each categorical column.

integer_fraction
: float

The fraction of total columns that are integer-valued.

integer_range
: list

The range of randomly generated integer values.

binary_fraction
: float

The fraction of total columns that are binary-valued.

binary_ones_fraction
: float

The fraction of values in a binary column that are set to 1.

time_fraction
: float

The fraction of randomly created date/time columns

string_fraction
: float

The fraction of randomly created string columns

missing_fraction
: float

The fraction of total entries in the data frame that are set to NA.

response_factors
: int

If has_response == TRUE, then this is the number of factor levels in the response column.

has_response
: bool

A logical value indicating whether an additional response column should be pre-pended to the final H2O data frame. If set to TRUE, the total number of columns will be cols+1.

seed
: int

A seed used to generate random values when randomize = TRUE.

seed_for_column_types
: int

A seed used to generate random column types when randomize = TRUE.

Returns:

H2OFrame

h2o.h2o.download_all_logs(dirname='.', filename=None)[source]

Download H2O Log Files to Disk

Parameters:

dirname : str, optional

A character string indicating the directory that the log file should be saved in.

filename
: str, optional

A string indicating the name that the CSV file should be

Returns:

Path of logs written.

h2o.h2o.download_csv(data, filename)[source]

Download an H2O data set to a CSV file on the local disk.

Warning: Files located on the H2O server may be very large! Make sure you have enough hard drive space to accommodate the entire file.

Parameters:

data : H2OFrame

An H2OFrame object to be downloaded.

filename
: str

A string indicating the name that the CSV file should be should be saved to.

h2o.h2o.download_pojo(model, path='', get_jar=True)[source]

Download the POJO for this model to the directory specified by path (no trailing slash!). If path is “”, then dump to screen.

Parameters:

model : H2OModel

Retrieve this model’s scoring POJO.

path
: str

An absolute path to the directory where POJO should be saved.

get_jar
: bool

Retrieve the h2o-genmodel.jar also.

h2o.h2o.export_file(frame, path, force=False)[source]

Export a given H2OFrame to a path on the machine this python session is currently connected to. To view the current session, call h2o.cluster_info().

Parameters:

frame : H2OFrame

The Frame to save to disk.

path
: str

The path to the save point on disk.

force
: bool

Overwrite any preexisting file with the same path

h2o.h2o.frame(frame_id, exclude='')[source]

Retrieve metadata for an id that points to a Frame.

Parameters:

frame_id : str

A pointer to a Frame in H2O.

Returns:

Python dict containing the frame meta-information

h2o.h2o.frames()[source]

Retrieve all the Frames.

Returns:Meta information on the frames
h2o.h2o.get_frame(frame_id)[source]

Obtain a handle to the frame in H2O with the frame_id key.

Returns:H2OFrame
h2o.h2o.get_grid(grid_id)[source]

Return the specified grid

Parameters:

grid_id : str

The grid identification in h2o

Returns:

H2OGridSearch instance

h2o.h2o.get_model(model_id)[source]

Return the specified model

Parameters:

model_id : str

The model identification in h2o

Returns:

Subclass of H2OEstimator

h2o.h2o.get_timezone()[source]

Get the Time Zone on the H2O Cloud

Returns:The time zone (string)
h2o.h2o.import_file(path=None, destination_frame='', parse=True, header=(-1, 0, 1), sep='', col_names=None, col_types=None, na_strings=None)[source]

Have H2O import a dataset into memory. The path to the data must be a valid path for each node in the H2O cluster. If some node in the H2O cluster cannot see the file, then an exception will be thrown by the H2O cluster. Does a parallel/distributed multi-threaded pull of the data. Also see upload_file.

Parameters:

path : str

A path specifying the location of the data to import.

destination_frame
: str, optional

The unique hex key assigned to the imported file. If none is given, a key will automatically be generated.

parse
: bool, optional

A logical value indicating whether the file should be parsed after import.

header
: int, optional

-1 means the first line is data, 0 means guess, 1 means first line is header.

sep
: str, optional

The field separator character. Values on each line of the file are separated by this character. If sep = “”, the parser will automatically detect the separator.

col_names
: list, optional

A list of column names for the file.

col_types
: list or dict, optional

A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are None will be guessed. The possible types a column may have are: “unknown” - this will force the column to be parsed as all NA “uuid” - the values in the column must be true UUID or will be parsed as NA “string” - force the column to be parsed as a string “numeric” - force the column to be parsed as numeric. H2O will handle the compression of the numeric data in the optimal manner. “enum” - force the column to be parsed as a categorical column. “time” - force the column to be parsed as a time column. H2O will attempt to parse the following list of date time formats date - “yyyy-MM-dd”, “yyyy MM dd”, “dd-MMM-yy”, “dd MMM yy” time - “HH:mm:ss”, “HH:mm:ss:SSS”, “HH:mm:ss:SSSnnnnnn”, “HH.mm.ss” “HH.mm.ss.SSS”, “HH.mm.ss.SSSnnnnnn” Times can also contain “AM” or “PM”.

na_strings
: list or dict, optional

A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.

Returns:

A new H2OFrame instance.

h2o.h2o.import_sql_select(connection_url, select_query, username, password, optimize=None)[source]

Imports the SQL table that is the result of the specified SQL query to H2OFrame in memory. Creates a temporary SQL table from the specified sql_query. Runs multiple SELECT SQL queries on the temporary table concurrently for parallel ingestion, then drops the table. Be sure to start the h2o.jar in the terminal with your downloaded JDBC driver in the classpath:

java -cp <path_to_h2o_jar>:<path_to_jdbc_driver_jar> water.H2OApp

Also see h2o.import_sql_table. Currently supported SQL databases are MySQL, PostgreSQL, and MariaDB. Support for Oracle 12g and Microsoft SQL Server is forthcoming.

Parameters:

connection_url : str

URL of the SQL database connection as specified by the Java Database Connectivity (JDBC) Driver. For example, “jdbc:mysql://localhost:3306/menagerie?&useSSL=false

select_query
: str

SQL query starting with SELECT that returns rows from one or more database tables.

username
: str

Username for SQL server

password
: str

Password for SQL server

optimize
: bool, optional, default is True

Optimize import of SQL table for faster imports. Experimental.

Returns:

H2OFrame containing data of specified SQL select query

Examples

>>> conn_url = "jdbc:mysql://172.16.2.178:3306/ingestSQL?&useSSL=false"
>>> select_query = "SELECT bikeid from citibike20k"
>>> username = "root"
>>> password = "abc123"
>>> my_citibike_data = h2o.import_sql_select(conn_url, select_query, username, password)
h2o.h2o.import_sql_table(connection_url, table, username, password, columns=None, optimize=None)[source]

Import SQL table to H2OFrame in memory. Assumes that the SQL table is not being updated and is stable. Runs multiple SELECT SQL queries concurrently for parallel ingestion. Be sure to start the h2o.jar in the terminal with your downloaded JDBC driver in the classpath:

java -cp <path_to_h2o_jar>:<path_to_jdbc_driver_jar> water.H2OApp

Also see h2o.import_sql_select. Currently supported SQL databases are MySQL, PostgreSQL, and MariaDB. Support for Oracle 12g and Microsoft SQL Server is forthcoming.

Parameters:

connection_url : str

URL of the SQL database connection as specified by the Java Database Connectivity (JDBC) Driver. For example, “jdbc:mysql://localhost:3306/menagerie?&useSSL=false

table
: str

Name of SQL table

username
: str

Username for SQL server

password
: str

Password for SQL server

columns
: list of strings, optional

A list of column names to import from SQL table. Default is to import all columns.

optimize
: bool, optional, default is True

Optimize import of SQL table for faster imports. Experimental.

Returns:

H2OFrame containing data of specified SQL table

Examples

>>> conn_url = "jdbc:mysql://172.16.2.178:3306/ingestSQL?&useSSL=false"
>>> table = "citibike20k"
>>> username = "root"
>>> password = "abc123"
>>> my_citibike_data = h2o.import_sql_table(conn_url, table, username, password)
h2o.h2o.init(ip='localhost', port=54321, start_h2o=True, enable_assertions=True, license=None, nthreads=-1, max_mem_size=None, min_mem_size=None, ice_root=None, strict_version_check=True, proxy=None, https=False, insecure=False, username=None, password=None, cluster_name=None, force_connect=False, max_mem_size_GB=None, min_mem_size_GB=None, proxies=None, size=None)[source]

Initiate an H2O connection to the specified ip and port.

Parameters:

ip : str

A string representing the hostname or IP address of the server where H2O is running.

port : int

A port, default is 54321

start_h2o : bool

A boolean dictating whether this module should start the H2O jvm. An attempt is made anyways if _connect fails.

enable_assertions : bool

If start_h2o, pass -ea as a VM option.

license : str

If not None, is a path to a license file.

nthreads : int

Number of threads in the thread pool. This relates very closely to the number of CPUs used. -1 means use all CPUs on the host. A positive integer specifies the number of CPUs directly. This value is only used when Python starts H2O.

max_mem_size : int

Maximum heap size (jvm option Xmx) in gigabytes.

min_mem_size : int

Minimum heap size (jvm option Xms) in gigabytes.

ice_root : str

A temporary directory (default location is determined by tempfile.mkdtemp()) to hold H2O log files.

strict_version_check : bool

Setting this to False is unsupported and should only be done when advised by technical support.

proxy : dict

A dictionary with keys ‘ftp’, ‘http’, ‘https’ and values that correspond to a proxy path.

https : bool

Set this to True to use https instead of http.

insecure : bool

Set this to True to disable SSL certificate checking.

username : str

Username to login with.

password : str

Password to login with.

cluster_name : str

Cluster to login to.

force_connect : bool

When set to True, a connection to the cluster will attempt to be established regardless of its reported health.

max_mem_size_GB : DEPRECATED

Use max_mem_size instead.

min_mem_size_GB : DEPRECATED

Use min_mem_size instead.

proxies : DEPRECATED

Use proxy instead.

size : DEPRECATED

Size is deprecated.

Examples

Using the ‘proxy’ parameter

>>> import h2o
>>> import urllib
>>> proxy_dict = urllib.getproxies()
>>> h2o.init(proxy=proxy_dict)
Starting H2O JVM and connecting: ............... Connection successful!
h2o.h2o.interaction(data, factors, pairwise, max_factors, min_occurrence, destination_frame=None)[source]

Categorical Interaction Feature Creation in H2O. Creates a frame in H2O with n-th order interaction features between categorical columns, as specified by the user.

Parameters:

data : H2OFrame

the H2OFrame that holds the target categorical columns.

factors
: list

factors Factor columns (either indices or column names).

pairwise
: bool

Whether to create pairwise interactions between factors (otherwise create one higher-order interaction). Only applicable if there are 3 or more factors.

max_factors
: int

Max. number of factor levels in pair-wise interaction terms (if enforced, one extra catch-all factor will be made)

min_occurrence
: int

Min. occurrence threshold for factor levels in pair-wise interaction terms

destination_frame
: str

A string indicating the destination key. If empty, this will be auto-generated by H2O.

Returns:

H2OFrame

h2o.h2o.lazy_import(path)[source]

Import a single file or collection of files.

Parameters:

path : str

A path to a data file (remote or local).

h2o.h2o.list_timezones()[source]

Get a list of all the timezones

Returns:The time zones (as an H2OFrame)
h2o.h2o.load_model(path)[source]

Load a saved H2O model from disk.

Parameters:

path : str

The full path of the H2O Model to be imported.

Returns:

H2OEstimator object

Examples

>>> path = h2o.save_mode(my_model,dir=my_path)
>>> h2o.load_model(path)
h2o.h2o.log_and_echo(message)[source]

Log a message on the server-side logs This is helpful when running several pieces of work one after the other on a single H2O cluster and you want to make a notation in the H2O server side log where one piece of work ends and the next piece of work begins.

Sends a message to H2O for logging. Generally used for debugging purposes.

Parameters:

message : str

A character string with the message to write to the log.

h2o.h2o.ls()[source]

List Keys on an H2O Cluster

Returns:A list of keys in the current H2O instance.
h2o.h2o.no_progress()[source]

Disable the progress bar from flushing to stdout. The completed progress bar is printed when a job is complete so as to demarcate a log file.

h2o.h2o.ou()[source]

Where is my baguette!?

Returns:The name of the baguette. oh uhr uhr huhr
h2o.h2o.parse_raw(setup, id=None, first_line_is_header=(-1, 0, 1))[source]

Used in conjunction with lazy_import and parse_setup in order to make alterations before parsing.

Parameters:

setup : dict

Result of h2o.parse_setup

id
: str, optional

An id for the frame.

first_line_is_header
: int, optional

-1,0,1 if the first line is to be used as the header

Returns:

H2OFrame

h2o.h2o.parse_setup(raw_frames, destination_frame='', header=(-1, 0, 1), separator='', column_names=None, column_types=None, na_strings=None)[source]

During parse setup, the H2O cluster will make several guesses about the attributes of the data. This method allows a user to perform corrective measures by updating the returning dictionary from this method. This dictionary is then fed into parse_raw to produce the H2OFrame instance.

Parameters:

raw_frames : H2OFrame

A collection of imported file frames

destination_frame
: str, optional

The unique hex key assigned to the imported file. If none is given, a key will automatically be generated.

parse
: bool, optional

A logical value indicating whether the file should be parsed after import.

header
: int, optional

-1 means the first line is data, 0 means guess, 1 means first line is header.

sep
: str, optional
The field separator character. Values on each line of the file are separated by this

character. If sep = “”, the parser will automatically detect the separator.

col_names
: list, optional

A list of column names for the file.

col_types
: list or dict, optional

A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are None will be guessed. The possible types a column may have are: “unknown” - this will force the column to be parsed as all NA “uuid” - the values in the column must be true UUID or will be parsed as NA “string” - force the column to be parsed as a string “numeric” - force the column to be parsed as numeric. H2O will handle the compression of the numeric data in the optimal manner. “enum” - force the column to be parsed as a categorical column. “time” - force the column to be parsed as a time column. H2O will attempt to parse the following list of date time formats date - “yyyy-MM-dd”, “yyyy MM dd”, “dd-MMM-yy”, “dd MMM yy” time - “HH:mm:ss”, “HH:mm:ss:SSS”, “HH:mm:ss:SSSnnnnnn”, “HH.mm.ss” “HH.mm.ss.SSS”, “HH.mm.ss.SSSnnnnnn” Times can also contain “AM” or “PM”.

na_strings
: list or dict, optional

A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.

Returns:

A dictionary is returned containing all of the guesses made by the H2O back end.

h2o.h2o.rapids(expr)[source]

Execute a Rapids expression.

Parameters:

expr : str

The rapids expression (ascii string).

Returns:

The JSON response (as a python dictionary) of the Rapids execution

h2o.h2o.remove(x)[source]

Remove object(s) from H2O.

Parameters:

x : H2OFrame, H2OEstimator, or basestring, or a list/tuple of those things.

The object(s) or unique id(s) pointing to the object(s) to be removed.

h2o.h2o.remove_all()[source]

Remove all objects from H2O.

h2o.h2o.save_model(model, path='', force=False)[source]

Save an H2O Model Object to Disk.

Parameters:

model : H2OModel

The model object to save.

path
: str

A path to save the model at (hdfs, s3, local)

force
: bool

Overwrite destination directory in case it exists or throw exception if set to false.

Returns:

The path of the saved model (string)

h2o.h2o.set_timezone(tz)[source]

Set the Time Zone on the H2O Cloud

Parameters:

tz : str

The desired timezone.

h2o.h2o.show_progress()[source]

Enable the progress bar. (Progress bar is enabled by default).

h2o.h2o.shutdown(conn=None, prompt=True)[source]

Shut down the specified instance. All data will be lost. This method checks if H2O is running at the specified IP address and port, and if it is, shuts down that H2O instance.

Parameters:

conn : H2OConnection

An H2OConnection object containing the IP address and port of the server running H2O.

prompt
: bool

A logical value indicating whether to prompt the user before shutting down the H2O server.

h2o.h2o.upload_file(path, destination_frame='', header=(-1, 0, 1), sep='', col_names=None, col_types=None, na_strings=None)[source]

Upload a dataset at the path given from the local machine to the H2O cluster. Does a single-threaded push to H2O. Also see import_file.

Parameters:

path : str

A path specifying the location of the data to upload.

destination_frame
: str, optional

The unique hex key assigned to the imported file. If none is given, a key will automatically be generated.

header
: int, optional

-1 means the first line is data, 0 means guess, 1 means first line is header.

sep
: str, optional

The field separator character. Values on each line of the file are separated by this character. If sep = “”, the parser will automatically detect the separator.

col_names
: list, optional

A list of column names for the file.

col_types
: list or dict, optional

A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are None will be guessed. The possible types a column may have are “unknown” - this will force the column to be parsed as all NA “uuid” - the values in the column must be true UUID or will be parsed as NA “string” - force the column to be parsed as a string “numeric” - force the column to be parsed as numeric. H2O will handle the compression of the numeric data in the optimal manner. “enum” - force the column to be parsed as a categorical column. “time” - force the column to be parsed as a time column. H2O will attempt to parse the following list of date time formats date - “yyyy-MM-dd”, “yyyy MM dd”, “dd-MMM-yy”, “dd MMM yy” time - “HH:mm:ss”, “HH:mm:ss:SSS”, “HH:mm:ss:SSSnnnnnn”, “HH.mm.ss” “HH.mm.ss.SSS”, “HH.mm.ss.SSSnnnnnn” Times can also contain “AM” or “PM”.

na_strings
: list or dict, optional

A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.

Returns:

A new H2OFrame instance.

Examples

>>> import h2o as ml
>>> ml.upload_file(path="/path/to/local/data", destination_frame="my_local_data")
...