H2O Module¶
h2o
– module for using H2O services.
(please add description).
-
h2o.
connect
(server=None, url=None, ip=None, port=None, https=None, verify_ssl_certificates=None, auth=None, proxy=None, cluster_id=None, cookies=None, verbose=True)[source]¶ Connect to an existing H2O server, remote or local.
There are two ways to connect to a server: either pass a server parameter containing an instance of an H2OLocalServer, or specify ip and port of the server that you want to connect to.
Parameters: - server – An H2OLocalServer instance to connect to (optional).
- url – Full URL of the server to connect to (can be used instead of ip + port + https).
- ip – The ip address (or host name) of the server where H2O is running.
- port – Port number that H2O service is listening to.
- https – Set to True to connect via https:// instead of http://.
- verify_ssl_certificates – When using https, setting this to False will disable SSL certificates verification.
- auth – Either a (username, password) pair for basic authentication, or one of the requests.auth authenticator objects.
- proxy – Proxy server address.
- cluster_id – Name of the H2O cluster to connect to. This option is used from Steam only.
- cookies – Cookie (or list of) to add to request
- verbose – Set to False to disable printing connection status messages.
Returns: the new
H2OConnection
object.
-
h2o.
init
(url=None, ip=None, port=None, https=None, insecure=None, username=None, password=None, cluster_id=None, cookies=None, proxy=None, start_h2o=True, nthreads=-1, ice_root=None, enable_assertions=True, max_mem_size=None, min_mem_size=None, strict_version_check=None, **kwargs)[source]¶ Attempt to connect to a local server, or if not successful start a new server and connect to it.
Parameters: - url – Full URL of the server to connect to (can be used instead of ip + port + https).
- ip – The ip address (or host name) of the server where H2O is running.
- port – Port number that H2O service is listening to.
- https – Set to True to connect via https:// instead of http://.
- insecure – When using https, setting this to True will disable SSL certificates verification.
- username – Username and
- password – Password for basic authentication.
- cluster_id – Name of the H2O cluster to connect to. This option is used from Steam only.
- cookies – Cookie (or list of) to add to each request.
- proxy – Proxy server address.
- start_h2o – If False, do not attempt to start an h2o server when connection to an existing one failed.
- nthreads – “Number of threads” option when launching a new h2o server.
- ice_root – Directory for temporary files for the new h2o server.
- enable_assertions – Enable assertions in Java for the new h2o server.
- max_mem_size – Maximum memory to use for the new h2o server.
- min_mem_size – Minimum memory to use for the new h2o server.
- strict_version_check – If True, an error will be raised if the client and server versions don’t match.
- kwargs – (all other deprecated attributes)
-
h2o.
api
(endpoint, data=None, json=None, filename=None, save_to=None)[source]¶ Perform a REST API request to a previously connected server.
This function is mostly for internal purposes, but may occasionally be useful for direct access to the backend H2O server. It has same parameters as
H2OConnection.request
.
-
h2o.
upload_file
(path, destination_frame=None, header=0, sep=None, col_names=None, col_types=None, na_strings=None)[source]¶ Upload a dataset from the provided local path to the H2O cluster.
Does a single-threaded push to H2O. Also see
import_file()
.Parameters: - path – A path specifying the location of the data to upload.
- destination_frame – The unique hex key assigned to the imported file. If none is given, a key will be automatically generated.
- header – -1 means the first line is data, 0 means guess, 1 means first line is header.
- sep – The field separator character. Values on each line of the file are separated by this character. If not provided, the parser will automatically detect the separator.
- col_names – A list of column names for the file.
- col_types –
A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are one will be guessed. The possible types a column may have are:
- “unknown” - this will force the column to be parsed as all NA
- “uuid” - the values in the column must be true UUID or will be parsed as NA
- “string” - force the column to be parsed as a string
- “numeric” - force the column to be parsed as numeric. H2O will handle the compression of the numeric data in the optimal manner.
- “enum” - force the column to be parsed as a categorical column.
- “time” - force the column to be parsed as a time column. H2O will attempt to parse the following list of date time formats: (date) “yyyy-MM-dd”, “yyyy MM dd”, “dd-MMM-yy”, “dd MMM yy”, (time) “HH:mm:ss”, “HH:mm:ss:SSS”, “HH:mm:ss:SSSnnnnnn”, “HH.mm.ss” “HH.mm.ss.SSS”, “HH.mm.ss.SSSnnnnnn”. Times can also contain “AM” or “PM”.
- na_strings – A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.
Returns: a new
H2OFrame
instance.Examples: >>> frame = h2o.upload_file("/path/to/local/data")
-
h2o.
lazy_import
(path, pattern=None)[source]¶ Import a single file or collection of files.
Parameters: - path – A path to a data file (remote or local).
- pattern – Character string containing a regular expression to match file(s) in the folder.
Returns: either a
H2OFrame
with the content of the provided file, or a list of such frames if importing multiple files.
-
h2o.
import_file
(path=None, destination_frame=None, parse=True, header=0, sep=None, col_names=None, col_types=None, na_strings=None, pattern=None)[source]¶ Import a dataset that is already on the cluster.
The path to the data must be a valid path for each node in the H2O cluster. If some node in the H2O cluster cannot see the file, then an exception will be thrown by the H2O cluster. Does a parallel/distributed multi-threaded pull of the data. The main difference between this method and
upload_file()
is that the latter works with local files, whereas this method imports remote files (i.e. files local to the server). If you running H2O server on your own maching, then both methods behave the same.Parameters: - path – path(s) specifying the location of the data to import or a path to a directory of files to import
- destination_frame – The unique hex key assigned to the imported file. If none is given, a key will be automatically generated.
- parse – If True, the file should be parsed after import.
- header – -1 means the first line is data, 0 means guess, 1 means first line is header.
- sep – The field separator character. Values on each line of the file are separated by this character. If not provided, the parser will automatically detect the separator.
- col_names – A list of column names for the file.
- col_types –
A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are one will be guessed. The possible types a column may have are:
- “unknown” - this will force the column to be parsed as all NA
- “uuid” - the values in the column must be true UUID or will be parsed as NA
- “string” - force the column to be parsed as a string
- “numeric” - force the column to be parsed as numeric. H2O will handle the compression of the numeric data in the optimal manner.
- “enum” - force the column to be parsed as a categorical column.
- “time” - force the column to be parsed as a time column. H2O will attempt to parse the following list of date time formats: (date) “yyyy-MM-dd”, “yyyy MM dd”, “dd-MMM-yy”, “dd MMM yy”, (time) “HH:mm:ss”, “HH:mm:ss:SSS”, “HH:mm:ss:SSSnnnnnn”, “HH.mm.ss” “HH.mm.ss.SSS”, “HH.mm.ss.SSSnnnnnn”. Times can also contain “AM” or “PM”.
- na_strings – A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.
- pattern – Character string containing a regular expression to match file(s) in the folder if path is a directory.
Returns: a new
H2OFrame
instance.Examples: >>> # Single file import >>> iris = import_file("h2o-3/smalldata/iris.csv") >>> # Return all files in the folder iris/ matching the regex r"iris_.*\.csv" >>> iris_pattern = h2o.import_file(path = "h2o-3/smalldata/iris", ... pattern = "iris_.*\.csv")
-
h2o.
import_sql_table
(connection_url, table, username, password, columns=None, optimize=True)[source]¶ Import SQL table to H2OFrame in memory.
Assumes that the SQL table is not being updated and is stable. Runs multiple SELECT SQL queries concurrently for parallel ingestion. Be sure to start the h2o.jar in the terminal with your downloaded JDBC driver in the classpath:
java -cp <path_to_h2o_jar>:<path_to_jdbc_driver_jar> water.H2OApp
Also see
import_sql_select()
. Currently supported SQL databases are MySQL, PostgreSQL, and MariaDB. Support for Oracle 12g and Microsoft SQL Server is forthcoming.Parameters: - connection_url – URL of the SQL database connection as specified by the Java Database Connectivity (JDBC) Driver. For example, “jdbc:mysql://localhost:3306/menagerie?&useSSL=false“
- table – name of SQL table
- columns – a list of column names to import from SQL table. Default is to import all columns.
- username – username for SQL server
- password – password for SQL server
- optimize – optimize import of SQL table for faster imports. Experimental.
Returns: an
H2OFrame
containing data of the specified SQL table.Examples: >>> conn_url = "jdbc:mysql://172.16.2.178:3306/ingestSQL?&useSSL=false" >>> table = "citibike20k" >>> username = "root" >>> password = "abc123" >>> my_citibike_data = h2o.import_sql_table(conn_url, table, username, password)
-
h2o.
import_sql_select
(connection_url, select_query, username, password, optimize=True)[source]¶ Import the SQL table that is the result of the specified SQL query to H2OFrame in memory.
Creates a temporary SQL table from the specified sql_query. Runs multiple SELECT SQL queries on the temporary table concurrently for parallel ingestion, then drops the table. Be sure to start the h2o.jar in the terminal with your downloaded JDBC driver in the classpath:
java -cp <path_to_h2o_jar>:<path_to_jdbc_driver_jar> water.H2OApp
Also see h2o.import_sql_table. Currently supported SQL databases are MySQL, PostgreSQL, and MariaDB. Support for Oracle 12g and Microsoft SQL Server is forthcoming.
Parameters: - connection_url – URL of the SQL database connection as specified by the Java Database Connectivity (JDBC) Driver. For example, “jdbc:mysql://localhost:3306/menagerie?&useSSL=false“
- select_query – SQL query starting with SELECT that returns rows from one or more database tables.
- username – username for SQL server
- password – password for SQL server
- optimize – optimize import of SQL table for faster imports. Experimental.
Returns: an
H2OFrame
containing data of the specified SQL query.Examples: >>> conn_url = "jdbc:mysql://172.16.2.178:3306/ingestSQL?&useSSL=false" >>> select_query = "SELECT bikeid from citibike20k" >>> username = "root" >>> password = "abc123" >>> my_citibike_data = h2o.import_sql_select(conn_url, select_query, ... username, password)
-
h2o.
parse_setup
(raw_frames, destination_frame=None, header=0, separator=None, column_names=None, column_types=None, na_strings=None)[source]¶ Retrieve H2O’s best guess as to what the structure of the data file is.
During parse setup, the H2O cluster will make several guesses about the attributes of the data. This method allows a user to perform corrective measures by updating the returning dictionary from this method. This dictionary is then fed into parse_raw to produce the H2OFrame instance.
Parameters: - raw_frames – a collection of imported file frames
- destination_frame – The unique hex key assigned to the imported file. If none is given, a key will automatically be generated.
- header – -1 means the first line is data, 0 means guess, 1 means first line is header.
- separator – The field separator character. Values on each line of the file are separated by this character. If not provided, the parser will automatically detect the separator.
- column_names – A list of column names for the file.
- column_types –
A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are one will be guessed. The possible types a column may have are:
- “unknown” - this will force the column to be parsed as all NA
- “uuid” - the values in the column must be true UUID or will be parsed as NA
- “string” - force the column to be parsed as a string
- “numeric” - force the column to be parsed as numeric. H2O will handle the compression of the numeric data in the optimal manner.
- “enum” - force the column to be parsed as a categorical column.
- “time” - force the column to be parsed as a time column. H2O will attempt to parse the following list of date time formats: (date) “yyyy-MM-dd”, “yyyy MM dd”, “dd-MMM-yy”, “dd MMM yy”, (time) “HH:mm:ss”, “HH:mm:ss:SSS”, “HH:mm:ss:SSSnnnnnn”, “HH.mm.ss” “HH.mm.ss.SSS”, “HH.mm.ss.SSSnnnnnn”. Times can also contain “AM” or “PM”.
- na_strings – A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.
Returns: a dictionary containing parse parameters guessed by the H2O backend.
-
h2o.
parse_raw
(setup, id=None, first_line_is_header=0)[source]¶ Parse dataset using the parse setup structure.
Parameters: - setup – Result of
h2o.parse_setup()
- id – an id for the frame.
- first_line_is_header – -1, 0, 1 if the first line is to be used as the header
Returns: an
H2OFrame
object.- setup – Result of
-
h2o.
assign
(data, xid)[source]¶ (internal) Assign new id to the frame.
Parameters: - data – an H2OFrame whose id should be changed
- xid – new id for the frame.
Returns: the passed frame.
-
h2o.
deep_copy
(data, xid)[source]¶ Create a deep clone of the frame
data
.Parameters: - data – an H2OFrame to be cloned
- xid – (internal) id to be assigned to the new frame.
Returns: new
H2OFrame
which is the clone of the passed frame.
-
h2o.
get_model
(model_id)[source]¶ Load a model from the server.
Parameters: model_id – The model identification in H2O Returns: Model object, a subclass of H2OEstimator
-
h2o.
get_grid
(grid_id)[source]¶ Return the specified grid.
Parameters: grid_id – The grid identification in h2o Returns: an H2OGridSearch
instance.
-
h2o.
get_frame
(frame_id)[source]¶ Obtain a handle to the frame in H2O with the frame_id key.
Parameters: frame_id (str) – id of the frame to retrieve. Returns: an H2OFrame
object
-
h2o.
no_progress
()[source]¶ Disable the progress bar from flushing to stdout.
The completed progress bar is printed when a job is complete so as to demarcate a log file.
-
h2o.
log_and_echo
(message=u'')[source]¶ Log a message on the server-side logs.
This is helpful when running several pieces of work one after the other on a single H2O cluster and you want to make a notation in the H2O server side log where one piece of work ends and the next piece of work begins.
Sends a message to H2O for logging. Generally used for debugging purposes.
Parameters: message – message to write to the log.
-
h2o.
remove
(x)[source]¶ Remove object(s) from H2O.
Parameters: x – H2OFrame, H2OEstimator, or string, or a list of those things: the object(s) or unique id(s) pointing to the object(s) to be removed.
-
h2o.
rapids
(expr)[source]¶ Execute a Rapids expression.
Parameters: expr – The rapids expression (ascii string). Returns: The JSON response (as a python dictionary) of the Rapids execution.
-
h2o.
frame
(frame_id)[source]¶ Retrieve metadata for an id that points to a Frame.
Parameters: frame_id – the key of a Frame in H2O. Returns: dict containing the frame meta-information.
-
h2o.
download_pojo
(model, path=u'', get_jar=True)[source]¶ Download the POJO for this model to the directory specified by path; if path is “”, then dump to screen.
Parameters: - model – the model whose scoring POJO should be retrieved.
- path – an absolute path to the directory where POJO should be saved.
- get_jar – retrieve the h2o-genmodel.jar also (will be saved to the same folder
path
).
Returns: location of the downloaded POJO file.
-
h2o.
download_csv
(data, filename)[source]¶ Download an H2O data set to a CSV file on the local disk.
Warning: Files located on the H2O server may be very large! Make sure you have enough hard drive space to accommodate the entire file.
Parameters: - data – an H2OFrame object to be downloaded.
- filename – name for the CSV file where the data should be saved to.
-
h2o.
download_all_logs
(dirname=u'.', filename=None)[source]¶ Download H2O log files to disk.
Parameters: - dirname – a character string indicating the directory that the log file should be saved in.
- filename – a string indicating the name that the CSV file should be.
Returns: path of logs written.
-
h2o.
save_model
(model, path=u'', force=False)[source]¶ Save an H2O Model object to disk.
Parameters: - model – The model object to save.
- path – a path to save the model at (hdfs, s3, local)
- force – if True overwrite destination directory in case it exists, or throw exception if set to False.
Returns: the path of the saved model
-
h2o.
load_model
(path)[source]¶ Load a saved H2O model from disk.
Parameters: path – the full path of the H2O Model to be imported.
Returns: an
H2OEstimator
objectExamples: >>> path = h2o.save_mode(my_model, dir=my_path) >>> h2o.load_model(path)
-
h2o.
export_file
(frame, path, force=False, parts=1)[source]¶ Export a given H2OFrame to a path on the machine this python session is currently connected to.
Parameters: - frame – the Frame to save to disk.
- path – the path to the save point on disk.
- force – if True, overwrite any preexisting file with the same path
- parts – enables export to multiple ‘part’ files instead of just a single file.
Convenient for large datasets that take too long to store in a single file.
Use parts=-1 to instruct H2O to determine the optimal number of part files or
specify your desired maximum number of part files. Path needs to be a directory
when exporting to multiple files, also that directory must be empty.
Default is
parts = 1
, which is to export to a single file.
-
h2o.
create_frame
(frame_id=None, rows=10000, cols=10, randomize=True, real_fraction=None, categorical_fraction=None, integer_fraction=None, binary_fraction=None, time_fraction=None, string_fraction=None, value=0, real_range=100, factors=100, integer_range=100, binary_ones_fraction=0.02, missing_fraction=0.01, has_response=False, response_factors=2, positive_response=False, seed=None, seed_for_column_types=None)[source]¶ Create a new frame with random data.
Creates a data frame in H2O with real-valued, categorical, integer, and binary columns specified by the user.
Parameters: - frame_id – the destination key. If empty, this will be auto-generated.
- rows – the number of rows of data to generate.
- cols – the number of columns of data to generate. Excludes the response column if has_response is True.
- randomize – If True, data values will be randomly generated. This must be True if either categorical_fraction or integer_fraction is non-zero.
- value – if randomize is False, then all real-valued entries will be set to this value.
- real_range – the range of randomly generated real values.
- real_fraction – the fraction of columns that are real-valued.
- categorical_fraction – the fraction of total columns that are categorical.
- factors – the number of (unique) factor levels in each categorical column.
- integer_fraction – the fraction of total columns that are integer-valued.
- integer_range – the range of randomly generated integer values.
- binary_fraction – the fraction of total columns that are binary-valued.
- binary_ones_fraction – the fraction of values in a binary column that are set to 1.
- time_fraction – the fraction of randomly created date/time columns.
- string_fraction – the fraction of randomly created string columns.
- missing_fraction – the fraction of total entries in the data frame that are set to NA.
- has_response – A logical value indicating whether an additional response column should be prepended to the
final H2O data frame. If set to True, the total number of columns will be
cols + 1
. - response_factors – if has_response is True, then this variable controls the type of the “response” column: setting response_factors to 1 will generate real-valued response, any value greater or equal than 2 will create categorical response with that many categories.
- positive_reponse – when response variable is present and of real type, this will control whether it contains positive values only, or both positive and negative.
- seed – a seed used to generate random values when
randomize
is True. - seed_for_column_types – a seed used to generate random column types when
randomize
is True.
Returns: an
H2OFrame
object
-
h2o.
interaction
(data, factors, pairwise, max_factors, min_occurrence, destination_frame=None)[source]¶ Categorical Interaction Feature Creation in H2O.
Creates a frame in H2O with n-th order interaction features between categorical columns, as specified by the user.
Parameters: - data – the H2OFrame that holds the target categorical columns.
- factors – factor columns (either indices or column names).
- pairwise – If True, create pairwise interactions between factors (otherwise create one higher-order interaction). Only applicable if there are 3 or more factors.
- max_factors – Max. number of factor levels in pair-wise interaction terms (if enforced, one extra catch-all factor will be made).
- min_occurrence – Min. occurrence threshold for factor levels in pair-wise interaction terms
- destination_frame – a string indicating the destination key. If empty, this will be auto-generated by H2O.
Returns: H2OFrame
-
h2o.
as_list
(data, use_pandas=True, header=True)[source]¶ Convert an H2O data object into a python-specific object.
WARNING! This will pull all data local!
If Pandas is available (and use_pandas is True), then pandas will be used to parse the data frame. Otherwise, a list-of-lists populated by character data will be returned (so the types of data will all be str).
Parameters: - data – an H2O data object.
- use_pandas – If True, try to use pandas for reading in the data.
- header – If True, return column names as first element in list
Returns: List of lists (Rows x Columns).
-
h2o.
demo
(funcname, interactive=True, echo=True, test=False)[source]¶ H2O built-in demo facility.
Parameters: - funcname – A string that identifies the h2o python function to demonstrate.
- interactive – If True, the user will be prompted to continue the demonstration after every segment.
- echo – If True, the python commands that are executed will be displayed.
- test – If True, h2o.init() will not be called (used for pyunit testing).
Example: >>> import h2o >>> h2o.demo("gbm")