H2O Module¶
h2o
– module for using H2O services.
-
h2o.
api
(endpoint, data=None, json=None, filename=None, save_to=None)[source]¶ Perform a REST API request to a previously connected server.
This function is mostly for internal purposes, but may occasionally be useful for direct access to the backend H2O server. It has same parameters as
H2OConnection.request
.The list of available endpoints can be obtained using:
endpoints = [' '.join([r.http_method, r.url_pattern]) for r in h2o.api("GET /3/Metadata/endpoints").routes]
For each route, the available parameters (passed as data or json) can be obtained using:
parameters = {f.name: f.help for f in h2o.api("GET /3/Metadata/schemas/{route.input_schema}").fields}
- Examples
>>> res = h2o.api("GET /3/NetworkTest") >>> res["table"].show()
-
h2o.
as_list
(data, use_pandas=True, header=True)[source]¶ Convert an H2O data object into a python-specific object.
WARNING! This will pull all data local!
If Pandas is available (and
use_pandas
is True), then pandas will be used to parse the data frame. Otherwise, a list-of-lists populated by character data will be returned (so the types of data will all be str).- Parameters
data – an H2O data object.
use_pandas – If True, try to use pandas for reading in the data.
header – If True, return column names as first element in list
- Returns
List of lists (Rows x Columns).
- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> from h2o.utils.typechecks import assert_is_type >>> res1 = h2o.as_list(iris, use_pandas=False) >>> assert_is_type(res1, list) >>> res1 = list(zip(*res1)) >>> assert abs(float(res1[0][9]) - 4.4) < 1e-10 and abs(float(res1[1][9]) - 2.9) < 1e-10 and ... abs(float(res1[2][9]) - 1.4) < 1e-10, "incorrect values" >>> res1
-
h2o.
assign
(data, xid)[source]¶ (internal) Assign new id to the frame.
- Parameters
data – an H2OFrame whose id should be changed
xid – new id for the frame.
- Returns
the passed frame.
- Examples
>>> old_name = "prostate.csv" >>> new_name = "newProstate.csv" >>> training_data = h2o.import_file(("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip"), ... destination_frame=old_name) >>> temp=h2o.assign(training_data, new_name)
-
h2o.
cluster
()[source]¶ Return
H2OCluster
object describing the backend H2O cluster.- Examples
>>> import h2o >>> h2o.init() >>> h2o.cluster()
-
h2o.
connect
(server=None, url=None, ip=None, port=None, https=None, verify_ssl_certificates=None, cacert=None, auth=None, proxy=None, cookies=None, verbose=True, config=None, strict_version_check=False)[source]¶ Connect to an existing H2O server, remote or local.
There are two ways to connect to a server: either pass a
server
parameter containing an instance of an H2OLocalServer, or specifyip
andport
of the server that you want to connect to.- Parameters
server – An H2OLocalServer instance to connect to (optional).
url – Full URL of the server to connect to (can be used instead of
ip
+port
+https
).ip – The ip address (or host name) of the server where H2O is running.
port – Port number that H2O service is listening to.
https – Set to True to connect via https:// instead of http://.
verify_ssl_certificates – When using https, setting this to False will disable SSL certificates verification.
cacert – Path to a CA bundle file or a directory with certificates of trusted CAs (optional).
auth – Either a (username, password) pair for basic authentication, an instance of h2o.auth.SpnegoAuth or one of the requests.auth authenticator objects.
proxy – Proxy server address.
cookies – Cookie (or list of) to add to request
verbose – Set to False to disable printing connection status messages.
config – Connection configuration object encapsulating connection parameters.
strict_version_check – If True, an error will be raised if the client and server versions don’t match.
- Returns
the new
H2OConnection
object.- Examples
>>> import h2o >>> ipA = "127.0.0.1" >>> portN = "54321" >>> urlS = "http://127.0.0.1:54321" >>> connect_type=h2o.connect(ip=ipA, port=portN, verbose=True) # or >>> connect_type2 = h2o.connect(url=urlS, https=True, verbose=True)
-
h2o.
connection
()[source]¶ Return the current
H2OConnection
handler.- Examples
>>> temp = h2o.connection() >>> temp
-
h2o.
create_frame
(frame_id=None, rows=10000, cols=10, randomize=True, real_fraction=None, categorical_fraction=None, integer_fraction=None, binary_fraction=None, time_fraction=None, string_fraction=None, value=0, real_range=100, factors=100, integer_range=100, binary_ones_fraction=0.02, missing_fraction=0.01, has_response=False, response_factors=2, positive_response=False, seed=None, seed_for_column_types=None)[source]¶ Create a new frame with random data.
Creates a data frame in H2O with real-valued, categorical, integer, and binary columns specified by the user.
- Parameters
frame_id – the destination key. If empty, this will be auto-generated.
rows – the number of rows of data to generate.
cols – the number of columns of data to generate. Excludes the response column if
has_response
is True.randomize – If True, data values will be randomly generated. This must be True if either
categorical_fraction
orinteger_fraction
is non-zero.value – if randomize is False, then all real-valued entries will be set to this value.
real_range – the range of randomly generated real values.
real_fraction – the fraction of columns that are real-valued.
categorical_fraction – the fraction of total columns that are categorical.
factors – the number of (unique) factor levels in each categorical column.
integer_fraction – the fraction of total columns that are integer-valued.
integer_range – the range of randomly generated integer values.
binary_fraction – the fraction of total columns that are binary-valued.
binary_ones_fraction – the fraction of values in a binary column that are set to 1.
time_fraction – the fraction of randomly created date/time columns.
string_fraction – the fraction of randomly created string columns.
missing_fraction – the fraction of total entries in the data frame that are set to NA.
has_response – A logical value indicating whether an additional response column should be prepended to the final H2O data frame. If set to True, the total number of columns will be
cols + 1
.response_factors – if
has_response
is True, then this variable controls the type of the “response” column: settingresponse_factors
to 1 will generate real-valued response, any value greater or equal than 2 will create categorical response with that many categories.positive_reponse – when response variable is present and of real type, this will control whether it contains positive values only, or both positive and negative.
seed – a seed used to generate random values when
randomize
is True.seed_for_column_types – a seed used to generate random column types when
randomize
is True.
- Returns
an
H2OFrame
object- Examples
>>> dataset_params = {} >>> dataset_params['rows'] = random.sample(list(range(50,150)),1)[0] >>> dataset_params['cols'] = random.sample(list(range(3,6)),1)[0] >>> dataset_params['categorical_fraction'] = round(random.random(),1) >>> left_over = (1 - dataset_params['categorical_fraction']) >>> dataset_params['integer_fraction'] = ... round(left_over - round(random.uniform(0,left_over),1),1) >>> if dataset_params['integer_fraction'] + dataset_params['categorical_fraction'] == 1: ... if dataset_params['integer_fraction'] > ... dataset_params['categorical_fraction']: ... dataset_params['integer_fraction'] = ... dataset_params['integer_fraction'] - 0.1 ... else: ... dataset_params['categorical_fraction'] = ... dataset_params['categorical_fraction'] - 0.1 >>> dataset_params['missing_fraction'] = random.uniform(0,0.5) >>> dataset_params['has_response'] = False >>> dataset_params['randomize'] = True >>> dataset_params['factors'] = random.randint(2,5) >>> print("Dataset parameters: {0}".format(dataset_params)) >>> distribution = random.sample(['bernoulli','multinomial', ... 'gaussian','poisson','gamma'], 1)[0] >>> if distribution == 'bernoulli': dataset_params['response_factors'] = 2 ... elif distribution == 'gaussian': dataset_params['response_factors'] = 1 ... elif distribution == 'multinomial': dataset_params['response_factors'] = random.randint(3,5) ... else: ... dataset_params['has_response'] = False >>> print("Distribution: {0}".format(distribution)) >>> train = h2o.create_frame(**dataset_params)
-
h2o.
deep_copy
(data, xid)[source]¶ Create a deep clone of the frame
data
.- Parameters
data – an H2OFrame to be cloned
xid – (internal) id to be assigned to the new frame.
- Returns
new
H2OFrame
which is the clone of the passed frame.- Examples
>>> training_data = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> new_name = "new_frame" >>> training_copy = h2o.deep_copy(training_data, new_name) >>> training_copy
-
h2o.
demo
(funcname, interactive=True, echo=True, test=False)[source]¶ H2O built-in demo facility.
- Parameters
funcname – A string that identifies the h2o python function to demonstrate.
interactive – If True, the user will be prompted to continue the demonstration after every segment.
echo – If True, the python commands that are executed will be displayed.
test – If True,
h2o.init()
will not be called (used for pyunit testing).
- Example
>>> import h2o >>> h2o.demo("gbm")
-
h2o.
download_all_logs
(dirname='.', filename=None, container=None)[source]¶ Download H2O log files to disk.
- Parameters
dirname – a character string indicating the directory that the log file should be saved in.
filename – a string indicating the name that the CSV file should be. Note that the default container format is .zip, so the file name must include the .zip extension.
container –
a string indicating how to archive the logs, choice of “ZIP” (default) and “LOG”:
ZIP: individual log files archived in a ZIP package
LOG: all log files will be concatenated together in one text file
- Returns
path of logs written in a zip file.
- Examples
The following code will save the zip file ‘h2o_log.zip’ in a directory that is one down from where you are currently working into a directory called your_directory_name. (Please note that your_directory_name should be replaced with the name of the directory that you’ve created and that already exists.)
>>> h2o.download_all_logs(dirname='./your_directory_name/', filename = 'h2o_log.zip')
-
h2o.
download_csv
(data, filename)[source]¶ Download an H2O data set to a CSV file on the local disk.
Warning: Files located on the H2O server may be very large! Make sure you have enough hard drive space to accommodate the entire file.
- Parameters
data – an H2OFrame object to be downloaded.
filename – name for the CSV file where the data should be saved to.
- Examples
>>> iris = h2o.load_dataset("iris") >>> h2o.download_csv(iris, "iris_delete.csv") >>> iris2 = h2o.import_file("iris_delete.csv") >>> iris2 = h2o.import_file("iris_delete.csv")
-
h2o.
download_model
(model, path='', export_cross_validation_predictions=False, filename=None)[source]¶ Download an H2O Model object to the machine this python session is currently connected to. The owner of the file saved is the user by which python session was executed.
- Parameters
model – The model object to download.
path – a path to the directory where the model should be saved.
export_cross_validation_predictions – logical, indicates whether the exported model artifact should also include CV Holdout Frame predictions. Default is not to include the predictions.
filename – a filename for the saved model
- Returns
the path of the downloaded model
- Examples
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator >>> h2o_df = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> my_model = H2OGeneralizedLinearEstimator(family = "binomial") >>> my_model.train(y = "CAPSULE", ... x = ["AGE", "RACE", "PSA", "GLEASON"], ... training_frame = h2o_df) >>> h2o.download_model(my_model, path='')
-
h2o.
download_pojo
(model, path='', get_jar=True, jar_name='')[source]¶ Download the POJO for this model to the directory specified by path; if path is “”, then dump to screen.
- Parameters
model – the model whose scoring POJO should be retrieved.
path – an absolute path to the directory where POJO should be saved.
get_jar – retrieve the h2o-genmodel.jar also (will be saved to the same folder
path
).jar_name – Custom name of genmodel jar.
- Returns
location of the downloaded POJO file.
- Examples
>>> h2o_df = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> h2o_df['CAPSULE'] = h2o_df['CAPSULE'].asfactor() >>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator >>> binomial_fit = H2OGeneralizedLinearEstimator(family = "binomial") >>> binomial_fit.train(y = "CAPSULE", ... x = ["AGE", "RACE", "PSA", "GLEASON"], ... training_frame = h2o_df) >>> h2o.download_pojo(binomial_fit, path='', get_jar=False)
-
h2o.
enable_expr_optimizations
(flag)[source]¶ Enable expression tree local optimizations.
- Examples
>>> h2o.enable_expr_optimizations(True)
-
h2o.
estimate_cluster_mem
(ncols, nrows, num_cols=0, string_cols=0, cat_cols=0, time_cols=0, uuid_cols=0)[source]¶ Computes an estimate for cluster memory usage in GB.
Number of columns and number of rows are required. For a better estimate you can provide a counts of different types of columns in the dataset.
- Parameters
ncols – (Required) total number of columns in a dataset. Integer, can’t be negative
nrows – (Required) total number of rows in a dataset. Integer, can’t be negative
num_cols – number of numeric columns in a dataset. Integer, can’t be negative.
string_cols – number of string columns in a dataset. Integer, can’t be negative.
cat_cols – number of categorical columns in a dataset. Integer, can’t be negative.
time_cols – number of time columns in a dataset. Integer, can’t be negative.
uuid_cols – number of uuid columns in a dataset. Integer, can’t be negative.
- Returns
An memory estimate in GB.
- Example
>>> from h2o import estimate_cluster_mem >>> ### I will load a parquet file with 18 columns and 2 million lines >>> estimate_cluster_mem(18, 2000000) >>> ### I will load another parquet file with 16 columns and 2 million lines; I ask for a more precise estimate >>> ### because I know 12 of 16 columns are categorical and 1 of 16 columns consist of uuids. >>> estimate_cluster_mem(18, 2000000, cat_cols=12, uuid_cols=1) >>> ### I will load a parquet file with 8 columns and 31 million lines; I ask for a more precise estimate >>> ### because I know 4 of 8 columns are categorical and 4 of 8 columns consist of numbers. >>> estimate_cluster_mem(ncols=8, nrows=31000000, cat_cols=4, num_cols=4)
-
h2o.
explain
(models, frame, columns=None, top_n_features=5, include_explanations='ALL', exclude_explanations=[], plot_overrides={}, figsize=(16, 9), render=True, qualitative_colormap='Dark2', sequential_colormap='RdYlBu_r')[source]¶ Generate model explanations on frame data set.
The H2O Explainability Interface is a convenient wrapper to a number of explainabilty methods and visualizations in H2O. The function can be applied to a single model or group of models and returns an object containing explanations, such as a partial dependence plot or a variable importance plot. Most of the explanations are visual (plots). These plots can also be created by individual utility functions/methods as well.
- Parameters
models – a list of H2O models, an H2O AutoML instance, or an H2OFrame with a ‘model_id’ column (e.g. H2OAutoML leaderboard).
frame – H2OFrame.
columns – either a list of columns or column indices to show. If specified parameter
top_n_features
will be ignored.top_n_features – a number of columns to pick using variable importance (where applicable).
include_explanations – if specified, return only the specified model explanations (mutually exclusive with
exclude_explanations
).exclude_explanations – exclude specified model explanations.
plot_overrides – overrides for individual model explanations.
figsize – figure size; passed directly to matplotlib.
render – if
True
, render the model explanations; otherwise model explanations are just returned.
- Returns
H2OExplanation containing the model explanations including headers and descriptions.
- Examples
>>> import h2o >>> from h2o.automl import H2OAutoML >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train an H2OAutoML >>> aml = H2OAutoML(max_models=10) >>> aml.train(y=response, training_frame=train) >>> >>> # Create the H2OAutoML explanation >>> aml.explain(test) >>> >>> # Create the leader model explanation >>> aml.leader.explain(test)
-
h2o.
explain_row
(models, frame, row_index, columns=None, top_n_features=5, include_explanations='ALL', exclude_explanations=[], plot_overrides={}, qualitative_colormap='Dark2', figsize=(16, 9), render=True)[source]¶ Generate model explanations on frame data set for a given instance.
Explain the behavior of a model or group of models with respect to a single row of data. The function returns an object containing explanations, such as a partial dependence plot or a variable importance plot. Most of the explanations are visual (plots). These plots can also be created by individual utility functions/methods as well.
- Parameters
models – H2OAutoML object, supervised H2O model, or list of supervised H2O models.
frame – H2OFrame.
row_index – row index of the instance to inspect.
columns – either a list of columns or column indices to show. If specified, parameter
top_n_features
will be ignored.top_n_features – a number of columns to pick using variable importance (where applicable).
include_explanations – if specified, return only the specified model explanations (mutually exclusive with
exclude_explanations
).exclude_explanations – exclude specified model explanations.
plot_overrides – overrides for individual model explanations.
qualitative_colormap – a colormap name.
figsize – figure size; passed directly to matplotlib.
render – if
True
, render the model explanations; otherwise model explanations are just returned.
- Returns
H2OExplanation containing the model explanations including headers and descriptions.
- Examples
>>> import h2o >>> from h2o.automl import H2OAutoML >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train an H2OAutoML >>> aml = H2OAutoML(max_models=10) >>> aml.train(y=response, training_frame=train) >>> >>> # Create the H2OAutoML explanation >>> aml.explain_row(test, row_index=0) >>> >>> # Create the leader model explanation >>> aml.leader.explain_row(test, row_index=0)
-
h2o.
export_file
(frame, path, force=False, sep=', ', compression=None, parts=1, header=True, quote_header=True, parallel=False, format='csv')[source]¶ Export a given H2OFrame to a path on the machine this python session is currently connected to.
- Parameters
frame – the Frame to save to disk.
path – the path to the save point on disk.
force – if True, overwrite any preexisting file with the same path.
sep – field delimiter for the output file.
compression – how to compress the exported dataset (default none; gzip, bzip2 and snappy available)
parts – enables export to multiple ‘part’ files instead of just a single file. Convenient for large datasets that take too long to store in a single file. Use
parts = -1
to instruct H2O to determine the optimal number of part files or specify your desired maximum number of part files. Path needs to be a directory when exporting to multiple files, also that directory must be empty. Default isparts = 1
, which is to export to a single file.header – if True, write out column names in the header line.
quote_header – if True, quote column names in the header.
parallel – use a parallel export to a single file (doesn’t apply when num_parts != 1, might create temporary files in the destination directory).
format – one of ‘csv’ or ‘parquet’. Defaults to ‘csv’. Export to parquet is multipart and H2O itself determines the optimal number of files (1 file per chunk).
- Examples
>>> h2o_df = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv") >>> h2o_df['CAPSULE'] = h2o_df['CAPSULE'].asfactor() >>> rand_vec = h2o_df.runif(1234) >>> train = h2o_df[rand_vec <= 0.8] >>> valid = h2o_df[(rand_vec > 0.8) & (rand_vec <= 0.9)] >>> test = h2o_df[rand_vec > 0.9] >>> binomial_fit = H2OGeneralizedLinearEstimator(family = "binomial") >>> binomial_fit.train(y = "CAPSULE", ... x = ["AGE", "RACE", "PSA", "GLEASON"], ... training_frame = train, validation_frame = valid) >>> pred = binomial_fit.predict(test) >>> h2o.export_file(pred, "/tmp/pred.csv", force = True)
-
h2o.
flow
()[source]¶ Open H2O Flow in your browser.
- Examples
>>> python >>> import h2o >>> h2o.init() >>> h2o.flow()
-
h2o.
frames
()[source]¶ Retrieve all the Frames.
- Returns
Meta information on the frames
- Examples
>>> arrestsH2O = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv") >>> h2o.frames()
-
h2o.
get_frame
(frame_id, **kwargs)[source]¶ Obtain a handle to the frame in H2O with the
frame_id
key.- Parameters
frame_id (str) – id of the frame to retrieve.
- Returns
an
H2OFrame
object- Examples
>>> from h2o.frame import H2OFrame >>> frame1 = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv") >>> frame2 = h2o.get_frame(frame1.frame_id)
-
h2o.
get_grid
(grid_id)[source]¶ Return the specified grid.
- Parameters
grid_id – The grid identification in h2o
- Returns
an
H2OGridSearch
instance.- Examples
>>> from h2o.grid.grid_search import H2OGridSearch >>> from h2o.estimators import H2OGradientBoostingEstimator >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> x = ["DayofMonth", "Month"] >>> hyper_parameters = {'learn_rate':[0.1,0.2], ... 'max_depth':[2,3], ... 'ntrees':[5,10]} >>> search_crit = {'strategy': "RandomDiscrete", ... 'max_models': 5, ... 'seed' : 1234, ... 'stopping_metric' : "AUTO", ... 'stopping_tolerance': 1e-2} >>> air_grid = H2OGridSearch(H2OGradientBoostingEstimator, ... hyper_params=hyper_parameters, ... search_criteria=search_crit) >>> air_grid.train(x=x, ... y="IsDepDelayed", ... training_frame=airlines, ... distribution="bernoulli") >>> fetched_grid = h2o.get_grid(str(air_grid.grid_id)) >>> fetched_grid
-
h2o.
get_model
(model_id)[source]¶ Load a model from the server.
- Parameters
model_id – The model identification in H2O
- Returns
Model object, a subclass of H2OEstimator
- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"]= airlines["Year"].asfactor() >>> airlines["Month"]= airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> model = H2OGeneralizedLinearEstimator(family="binomial", ... alpha=0, ... Lambda=1e-5) >>> model.train(x=predictors, ... y=response, ... training_frame=airlines) >>> model2 = h2o.get_model(model.model_id)
-
h2o.
import_file
(path=None, destination_frame=None, parse=True, header=0, sep=None, col_names=None, col_types=None, na_strings=None, pattern=None, skipped_columns=None, custom_non_data_line_markers=None, partition_by=None, quotechar=None, escapechar=None)[source]¶ Import a dataset that is already on the cluster.
The path to the data must be a valid path for each node in the H2O cluster. If some node in the H2O cluster cannot see the file, then an exception will be thrown by the H2O cluster. Does a parallel/distributed multi-threaded pull of the data. The main difference between this method and
upload_file()
is that the latter works with local files, whereas this method imports remote files (i.e. files local to the server). If you running H2O server on your own machine, then both methods behave the same.- Parameters
path – path(s) specifying the location of the data to import or a path to a directory of files to import
destination_frame – The unique hex key assigned to the imported file. If none is given, a key will be automatically generated.
parse – If True, the file should be parsed after import. If False, then a list is returned containing the file path.
header – -1 means the first line is data, 0 means guess, 1 means first line is header.
sep – The field separator character. Values on each line of the file are separated by this character. If not provided, the parser will automatically detect the separator.
col_names – A list of column names for the file.
col_types – A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are one will be guessed. The possible types a column may have are:
partition_by –
Names of the column the persisted dataset has been partitioned by.
”unknown” - this will force the column to be parsed as all NA
”uuid” - the values in the column must be true UUID or will be parsed as NA
”string” - force the column to be parsed as a string
”numeric” - force the column to be parsed as numeric. H2O will handle the compression of the numeric data in the optimal manner.
”enum” - force the column to be parsed as a categorical column.
”time” - force the column to be parsed as a time column. H2O will attempt to parse the following list of date time formats:
”yyyy-MM-dd” (date),
”yyyy MM dd” (date),
”dd-MMM-yy” (date),
”dd MMM yy” (date),
”HH:mm:ss” (time),
”HH:mm:ss:SSS” (time),
”HH:mm:ss:SSSnnnnnn” (time),
”HH.mm.ss” (time),
”HH.mm.ss.SSS” (time),
”HH.mm.ss.SSSnnnnnn”(time).
Times can also contain “AM” or “PM”.
na_strings – A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.
pattern – Character string containing a regular expression to match file(s) in the folder if path is a directory.
skipped_columns – an integer list of column indices to skip and not parsed into the final frame from the import file.
custom_non_data_line_markers – If a line in imported file starts with any character in given string it will NOT be imported. Empty string means all lines are imported, None means that default behaviour for given format will be used
quotechar – A hint for the parser which character to expect as quoting character. Only single quote, double quote or None (default) are allowed. None means automatic detection.
escapechar – (Optional) One ASCII character used to escape other characters.
- Returns
a new
H2OFrame
instance.- Examples
>>> birds = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/birds.csv")
-
h2o.
import_hive_table
(database=None, table=None, partitions=None, allow_multi_format=False)[source]¶ Import Hive table to H2OFrame in memory.
Make sure to start H2O with Hive on classpath. Uses hive-site.xml on classpath to connect to Hive. When database is specified as jdbc URL uses Hive JDBC driver to obtain table metadata. then uses direct HDFS access to import data.
- Parameters
database – Name of Hive database (default database will be used by default), can be also a JDBC URL.
table – name of Hive table to import
partitions – a list of lists of strings - partition key column values of partitions you want to import.
allow_multi_format – enable import of partitioned tables with different storage formats used. WARNING: this may fail on out-of-memory for tables with a large number of small partitions.
- Returns
an
H2OFrame
containing data of the specified Hive table.- Examples
>>> basic_import = h2o.import_hive_table("default", ... "table_name") >>> jdbc_import = h2o.import_hive_table("jdbc:hive2://hive-server:10000/default", ... "table_name") >>> multi_format_enabled = h2o.import_hive_table("default", ... "table_name", ... allow_multi_format=True) >>> with_partition_filter = h2o.import_hive_table("jdbc:hive2://hive-server:10000/default", ... "table_name", ... [["2017", "02"]])
-
h2o.
import_mojo
(mojo_path, model_id=None)[source]¶ Imports an existing MOJO model as an H2O model.
- Parameters
mojo_path – Path to the MOJO archive on the H2O’s filesystem
model_id – Model ID, default is None
- Returns
An H2OGenericEstimator instance embedding given MOJO
- Examples
>>> from h2o.estimators import H2OGradientBoostingEstimator >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> model = H2OGradientBoostingEstimator(ntrees = 1) >>> model.train(x = ["Origin", "Dest"], ... y = "IsDepDelayed", ... training_frame=airlines) >>> original_model_filename = tempfile.mkdtemp() >>> original_model_filename = model.download_mojo(original_model_filename) >>> mojo_model = h2o.import_mojo(original_model_filename)
-
h2o.
import_sql_select
(connection_url, select_query, username, password, optimize=True, use_temp_table=None, temp_table_name=None, fetch_mode=None, num_chunks_hint=None)[source]¶ Import the SQL table that is the result of the specified SQL query to H2OFrame in memory.
Creates a temporary SQL table from the specified sql_query. Runs multiple SELECT SQL queries on the temporary table concurrently for parallel ingestion, then drops the table. Be sure to start the h2o.jar in the terminal with your downloaded JDBC driver in the classpath:
java -cp <path_to_h2o_jar>:<path_to_jdbc_driver_jar> water.H2OApp
Also see
h2o.import_sql_table
. Currently supported SQL databases are MySQL, PostgreSQL, MariaDB, Hive, Oracle and Microsoft SQL Server.- Parameters
connection_url – URL of the SQL database connection as specified by the Java Database Connectivity (JDBC) Driver. For example, “jdbc:mysql://localhost:3306/menagerie?&useSSL=false”
select_query – SQL query starting with SELECT that returns rows from one or more database tables.
username – username for SQL server
password – password for SQL server
optimize – DEPRECATED. Ignored - use
fetch_mode
instead. Optimize import of SQL table for faster imports.use_temp_table – whether a temporary table should be created from
select_query
temp_table_name – name of temporary table to be created from
select_query
fetch_mode – Set to DISTRIBUTED to enable distributed import. Set to SINGLE to force a sequential read by a single node from the database.
num_chunks_hint – Desired number of chunks for the target Frame.
- Returns
an
H2OFrame
containing data of the specified SQL query.- Examples
>>> conn_url = "jdbc:mysql://172.16.2.178:3306/ingestSQL?&useSSL=false" >>> select_query = "SELECT bikeid from citibike20k" >>> username = "root" >>> password = "abc123" >>> my_citibike_data = h2o.import_sql_select(conn_url, select_query, ... username, password)
-
h2o.
import_sql_table
(connection_url, table, username, password, columns=None, optimize=True, fetch_mode=None, num_chunks_hint=None)[source]¶ Import SQL table to H2OFrame in memory.
Assumes that the SQL table is not being updated and is stable. Runs multiple SELECT SQL queries concurrently for parallel ingestion. Be sure to start the h2o.jar in the terminal with your downloaded JDBC driver in the classpath:
java -cp <path_to_h2o_jar>:<path_to_jdbc_driver_jar> water.H2OApp
Also see
import_sql_select()
. Currently supported SQL databases are MySQL, PostgreSQL, MariaDB, Hive, Oracle and Microsoft SQL.- Parameters
connection_url –
URL of the SQL database connection as specified by the Java Database Connectivity (JDBC) Driver. For example:
"jdbc:mysql://localhost:3306/menagerie?&useSSL=false"
table – name of SQL table
columns – a list of column names to import from SQL table. Default is to import all columns.
username – username for SQL server
password – password for SQL server
optimize – DEPRECATED. Ignored - use
fetch_mode
instead. Optimize import of SQL table for faster imports.fetch_mode – Set to DISTRIBUTED to enable distributed import. Set to SINGLE to force a sequential read by a single node from the database.
num_chunks_hint – Desired number of chunks for the target Frame.
- Returns
an
H2OFrame
containing data of the specified SQL table.- Examples
>>> conn_url = "jdbc:mysql://172.16.2.178:3306/ingestSQL?&useSSL=false" >>> table = "citibike20k" >>> username = "root" >>> password = "abc123" >>> my_citibike_data = h2o.import_sql_table(conn_url, table, username, password)
-
h2o.
init
(url=None, ip=None, port=None, name=None, https=None, cacert=None, insecure=None, username=None, password=None, cookies=None, proxy=None, start_h2o=True, nthreads=-1, ice_root=None, log_dir=None, log_level=None, max_log_file_size=None, enable_assertions=True, max_mem_size=None, min_mem_size=None, strict_version_check=None, ignore_config=False, extra_classpath=None, jvm_custom_args=None, bind_to_localhost=True, **kwargs)[source]¶ Attempt to connect to a local server, or if not successful start a new server and connect to it.
- Parameters
url – Full URL of the server to connect to (can be used instead of
ip
+port
+https
).ip – The ip address (or host name) of the server where H2O is running.
port – Port number that H2O service is listening to.
name – Cluster name. If None while connecting to an existing cluster it will not check the cluster name. If set then will connect only if the target cluster name matches. If no instance is found and decides to start a local one then this will be used as the cluster name or a random one will be generated if set to None.
https – Set to True to connect via https:// instead of http://.
cacert – Path to a CA bundle file or a directory with certificates of trusted CAs (optional).
insecure – When using https, setting this to True will disable SSL certificates verification.
username – Username and
password – Password for basic authentication.
cookies – Cookie (or list of) to add to each request.
proxy – Proxy server address.
start_h2o – If False, do not attempt to start an h2o server when connection to an existing one failed.
nthreads – “Number of threads” option when launching a new h2o server.
ice_root – Directory for temporary files for the new h2o server.
log_dir – Directory for H2O logs to be stored if a new instance is started. Ignored if connecting to an existing node.
log_level –
The logger level for H2O if a new instance is started. One of:
TRACE
DEBUG
INFO
WARN
ERRR
FATA
Default is INFO. Ignored if connecting to an existing node.
max_log_file_size – Maximum size of INFO and DEBUG log files. The file is rolled over after a specified size has been reached. (The default is 3MB. Minimum is 1MB and maximum is 99999MB)
enable_assertions – Enable assertions in Java for the new h2o server.
max_mem_size –
Maximum memory to use for the new h2o server. Integer input will be evaluated as gigabytes. Other units can be specified by passing in a string (e.g. “160M” for 160 megabytes).
Note: If
max_mem_size
is not defined, then the amount of memory that H2O allocates will be determined by the default memory of the Java Virtual Machine (JVM). This amount depends on the Java version, but it will generally be 25% of the machine’s physical memory.
min_mem_size – Minimum memory to use for the new h2o server. Integer input will be evaluated as gigabytes. Other units can be specified by passing in a string (e.g. “160M” for 160 megabytes).
strict_version_check – If True, an error will be raised if the client and server versions don’t match.
ignore_config – Indicates whether a processing of a .h2oconfig file should be conducted or not. Default value is False.
extra_classpath – List of paths to libraries that should be included on the Java classpath when starting H2O from Python.
kwargs – (all other deprecated attributes)
jvm_custom_args – Customer, user-defined argument’s for the JVM H2O is instantiated in. Ignored if there is an instance of H2O already running and the client connects to it.
bind_to_localhost – A flag indicating whether access to the H2O instance should be restricted to the local machine (default) or if it can be reached from other computers on the network.
- Examples
>>> import h2o >>> h2o.init(ip="localhost", port=54323)
-
h2o.
interaction
(data, factors, pairwise, max_factors, min_occurrence, destination_frame=None)[source]¶ Categorical Interaction Feature Creation in H2O.
Creates a frame in H2O with n-th order interaction features between categorical columns, as specified by the user.
- Parameters
data – the H2OFrame that holds the target categorical columns.
factors – factor columns (either indices or column names).
pairwise – If True, create pairwise interactions between factors (otherwise create one higher-order interaction). Only applicable if there are 3 or more factors.
max_factors – Max. number of factor levels in pair-wise interaction terms (if enforced, one extra catch-all factor will be made).
min_occurrence – Min. occurrence threshold for factor levels in pair-wise interaction terms
destination_frame – a string indicating the destination key. If empty, this will be auto-generated by H2O.
- Returns
- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> iris = iris.cbind(iris[4] == "Iris-setosa") >>> iris[5] = iris[5].asfactor() >>> iris.set_name(5,"C6") >>> iris = iris.cbind(iris[4] == "Iris-virginica") >>> iris[6] = iris[6].asfactor() >>> iris.set_name(6, name="C7") >>> two_way_interactions = h2o.interaction(iris, ... factors=[4,5,6], ... pairwise=True, ... max_factors=10000, ... min_occurrence=1) >>> from h2o.utils.typechecks import assert_is_type >>> assert_is_type(two_way_interactions, H2OFrame) >>> levels1 = two_way_interactions.levels()[0] >>> levels2 = two_way_interactions.levels()[1] >>> levels3 = two_way_interactions.levels()[2] >>> two_way_interactions
-
h2o.
is_expr_optimizations_enabled
()[source]¶ - Examples
>>> h2o.enable_expr_optimizations(True) >>> h2o.is_expr_optimizations_enabled() >>> h2o.enable_expr_optimizations(False) >>> h2o.is_expr_optimizations_enabled()
-
h2o.
lazy_import
(path, pattern=None)[source]¶ Import a single file or collection of files.
- Parameters
path – A path to a data file (remote or local).
pattern – Character string containing a regular expression to match file(s) in the folder.
- Returns
either a
H2OFrame
with the content of the provided file, or a list of such frames if importing multiple files.- Examples
>>> iris = h2o.lazy_import("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
-
h2o.
load_dataset
(relative_path)[source]¶ Imports a data file within the ‘h2o_data’ folder.
- Examples
>>> fr = h2o.load_dataset("iris")
-
h2o.
load_frame
(frame_id, path, force=True)[source]¶ Load frame previously stored in H2O’s native format.
This will load a data frame from file-system location. Stored data can be loaded only with a cluster of the same size and same version the the one which wrote the data. The provided directory must be accessible from all nodes (HDFS, NFS). Provided frame_id must be the same as the one used when writing the data.
- Parameters
frame_id – the frame ID of the original frame
path – a filesystem location where to look for frame data
force – overwrite an already existing frame (defaults to true)
- Returns
A Frame object.
- Examples
>>> iris = h2o.load_frame("iris_weather.hex", "hdfs://namenode/h2o_data")
-
h2o.
load_grid
(grid_file_path, load_params_references=False)[source]¶ Loads previously saved grid with all its models from the same folder
- Parameters
grid_file_path – A string containing the path to the file with grid saved. Grid models are expected to be in the same folder.
load_params_references – If true will attemt to reload saved objects referenced by grid parameters (e.g. training frame, calibration frame), will fail if grid was saved without referenced objects.
- Returns
An instance of H2OGridSearch
- Examples
>>> from collections import OrderedDict >>> from h2o.grid.grid_search import H2OGridSearch >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> train = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") # Run GBM Grid Search >>> ntrees_opts = [1, 3] >>> learn_rate_opts = [0.1, 0.01, .05] >>> hyper_parameters = OrderedDict() >>> hyper_parameters["learn_rate"] = learn_rate_opts >>> hyper_parameters["ntrees"] = ntrees_opts >>> export_dir = pyunit_utils.locate("results") >>> gs = H2OGridSearch(H2OGradientBoostingEstimator, hyper_params=hyper_parameters) >>> gs.train(x=list(range(4)), y=4, training_frame=train) >>> grid_id = gs.grid_id >>> old_grid_model_count = len(gs.model_ids) # Save the grid search to the export directory >>> saved_path = h2o.save_grid(export_dir, grid_id) >>> h2o.remove_all(); >>> train = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") # Load the grid searcht-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> grid = h2o.load_grid(saved_path) >>> grid.train(x=list(range(4)), y=4, training_frame=train)
-
h2o.
load_model
(path)[source]¶ Load a saved H2O model from disk. (Note that ensemble binary models can now be loaded using this method.)
- Parameters
path – the full path of the H2O Model to be imported.
- Returns
an
H2OEstimator
object- Examples
>>> training_data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier", ... "DayOfWeek", "Month", "Distance", "FlightNum"] >>> response = "IsDepDelayed" >>> model = H2OGeneralizedLinearEstimator(family="binomial", ... alpha=0, ... Lambda=1e-5) >>> model.train(x=predictors, ... y=response, ... training_frame=training_data) >>> h2o.save_model(model, path='', force=True) >>> h2o.load_model(model)
-
h2o.
log_and_echo
(message='')[source]¶ Log a message on the server-side logs.
This is helpful when running several pieces of work one after the other on a single H2O cluster and you want to make a notation in the H2O server side log where one piece of work ends and the next piece of work begins.
Sends a message to H2O for logging. Generally used for debugging purposes.
- Parameters
message – message to write to the log.
- Examples
>>> ret = h2o.log_and_echo("Testing h2o.log_and_echo")
-
h2o.
ls
()[source]¶ List keys on an H2O Cluster.
- Examples
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") >>> h2o.ls()
-
h2o.
make_leaderboard
(object, leaderboard_frame=None, sort_metric='AUTO', extra_columns=[], scoring_data='AUTO')[source]¶ Create a leaderboard from a list of models, grids and/or automls.
- Parameters
object – List of models, automls, or grids; or just single automl/grid object.
leaderboard_frame – Frame used for generating the metrics (optional).
sort_metric – Metric used for sorting the leaderboard.
extra_columns – What extra columns should be calculated (might require leaderboard_frame). Use “ALL” for all available or list of extra columns.
scoring_data – Metrics to be reported in the leaderboard (“xval”, “train”, or “valid”). Used if no leaderboard_frame is provided.
- Returns
H2OFrame
- Examples
>>> import h2o >>> from h2o.grid.grid_search import H2OGridSearch >>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator >>> h2o.init() >>> training_data = h2o.import_file("https://h2o-public-test-data.s3.amazonaws.com/smalldata/logreg/benign.csv") >>> hyper_parameters = {'alpha': [0.01,0.5], ... 'lambda': [1e-5,1e-6]} >>> gs = H2OGridSearch(H2OGeneralizedLinearEstimator(family='binomial'), ... hyper_parameters) >>> gs.train(y=3, training_frame=training_data) >>> h2o.make_leaderboard(gs, training_data)
-
h2o.
make_metrics
(predicted, actual, domain=None, distribution=None, weights=None, treatment=None, auc_type='NONE', auuc_type='AUTO', auuc_nbins=-1)[source]¶ Create Model Metrics from predicted and actual values in H2O.
- Parameters
predicted (H2OFrame) – an H2OFrame containing predictions.
actuals (H2OFrame) – an H2OFrame containing actual values.
domain – list of response factors for classification.
distribution – distribution for regression.
weights (H2OFrame) – an H2OFrame containing observation weights (optional).
treatment (H2OFrame) – an H2OFrame containing treatment information for uplift binomial classification only.
auc_type –
For multinomial classification you have to specify which type of agregated AUC/AUCPR will be used to calculate this metric. Possibilities are:
MACRO_OVO
MACRO_OVR
WEIGHTED_OVO
WEIGHTED_OVR
NONE
AUTO
(OVO = One vs. One, OVR = One vs. Rest). Default is “NONE” (AUC and AUCPR are not calculated).
auuc_type –
For uplift binomial classification you have to specify which type of AUUC will be used to calculate this metric. Choose from:
gini
lift
gain
AUTO (default, uses qini)
auuc_nbins – For uplift binomial classification you have to specify number of bins to be used for calculation the AUUC. Default is -1, which means 1000.
- Examples
>>> fr = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> fr["CAPSULE"] = fr["CAPSULE"].asfactor() >>> fr["RACE"] = fr["RACE"].asfactor() >>> response = "AGE" >>> predictors = list(set(fr.names) - {"ID", response}) >>> for distr in ["gaussian", "poisson", "laplace", "gamma"]: ... print("distribution: %s" % distr) ... model = H2OGradientBoostingEstimator(distribution=distr, ... ntrees=2, ... max_depth=3, ... min_rows=1, ... learn_rate=0.1, ... nbins=20) ... model.train(x=predictors, ... y=response, ... training_frame=fr) ... predicted = h2o.assign(model.predict(fr), "pred") ... actual = fr[response] ... m0 = model.model_performance(train=True) ... m1 = h2o.make_metrics(predicted, actual, distribution=distr) ... m2 = h2o.make_metrics(predicted, actual) >>> print(m0) >>> print(m1) >>> print(m2)
-
h2o.
model_correlation_heatmap
(models, frame, top_n=None, cluster_models=True, triangular=True, figsize=(13, 13), colormap='RdYlBu_r', save_plot_path=None)[source]¶ Model Prediction Correlation Heatmap
This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering).
- Parameters
models – a list of H2O models, an H2O AutoML instance, or an H2OFrame with a ‘model_id’ column (e.g. H2OAutoML leaderboard)
frame – H2OFrame
top_n – DEPRECATED. show just top n models (applies only when used with H2OAutoML).
cluster_models – if True, cluster the models
triangular – make the heatmap triangular
figsize – figsize: figure size; passed directly to matplotlib
colormap – colormap to use
save_plot_path – a path to save the plot via using matplotlib function savefig
- Returns
object that contains the resulting figure (can be accessed using
result.figure()
)- Examples
>>> import h2o >>> from h2o.automl import H2OAutoML >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train an H2OAutoML >>> aml = H2OAutoML(max_models=10) >>> aml.train(y=response, training_frame=train) >>> >>> # Create the model correlation heatmap >>> aml.model_correlation_heatmap(test)
-
h2o.
models
()[source]¶ Retrieve the IDs all the Models.
- Returns
Handles of all the models present in the cluster
- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> airlines["Year"]= airlines["Year"].asfactor() >>> airlines["Month"]= airlines["Month"].asfactor() >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor() >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor() >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor() >>> model1 = H2OGeneralizedLinearEstimator(family="binomial") >>> model1.train(y=response, training_frame=airlines) >>> model2 = H2OXGBoostEstimator(family="binomial") >>> model2.train(y=response, training_frame=airlines) >>> model_list = h2o.get_models()
-
h2o.
mojo_predict_csv
(input_csv_path, mojo_zip_path, output_csv_path=None, genmodel_jar_path=None, classpath=None, java_options=None, verbose=False, setInvNumNA=False, predict_contributions=False, predict_calibrated=False, extra_cmd_args=None)[source]¶ MOJO scoring function to take a CSV file and use MOJO model as zip file to score.
- Parameters
input_csv_path – Path to input CSV file.
mojo_zip_path – Path to MOJO zip downloaded from H2O.
output_csv_path – Optional, name of the output CSV file with computed predictions. If None (default), then predictions will be saved as prediction.csv in the same folder as the MOJO zip.
genmodel_jar_path – Optional, path to genmodel jar file. If None (default) then the h2o-genmodel.jar in the same folder as the MOJO zip will be used.
classpath – Optional, specifies custom user defined classpath which will be used when scoring. If None (default) then the default classpath for this MOJO model will be used.
java_options – Optional, custom user defined options for Java. By default
-Xmx4g -XX:ReservedCodeCacheSize=256m
is used.verbose – Optional, if True, then additional debug information will be printed. False by default.
predict_contributions – if True, then return prediction contributions instead of regular predictions (only for tree-based models).
predict_calibrated – if true, then return calibrated probabilities in addition to the predicted probabilities.
extra_cmd_args – Optional, a list of additional arguments to append to genmodel.jar’s command line.
- Returns
List of computed predictions
-
h2o.
mojo_predict_pandas
(dataframe, mojo_zip_path, genmodel_jar_path=None, classpath=None, java_options=None, verbose=False, setInvNumNA=False, predict_contributions=False, predict_calibrated=False)[source]¶ MOJO scoring function to take a Pandas frame and use MOJO model as zip file to score.
- Parameters
dataframe – Pandas frame to score.
mojo_zip_path – Path to MOJO zip downloaded from H2O.
genmodel_jar_path – Optional, path to genmodel jar file. If None (default) then the h2o-genmodel.jar in the same folder as the MOJO zip will be used.
classpath – Optional, specifies custom user defined classpath which will be used when scoring. If None (default) then the default classpath for this MOJO model will be used.
java_options – Optional, custom user defined options for Java. By default
-Xmx4g
is used.verbose – Optional, if True, then additional debug information will be printed. False by default.
predict_contributions – if True, then return prediction contributions instead of regular predictions (only for tree-based models).
predict_calibrated – if true, then return calibrated probabilities in addition to the predicted probabilities.
- Returns
Pandas frame with predictions
-
h2o.
no_progress
()[source]¶ Disable the progress bar from flushing to stdout.
The completed progress bar is printed when a job is complete so as to demarcate a log file.
- Examples
>>> h2o.no_progress() >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> x = ["DayofMonth", "Month"] >>> model = H2OGeneralizedLinearEstimator(family="binomial", ... alpha=0, ... Lambda=1e-5) >>> model.train(x=x, y="IsDepDelayed", training_frame=airlines)
-
h2o.
parse_raw
(setup, id=None, first_line_is_header=0)[source]¶ Parse dataset using the parse setup structure.
- Parameters
setup – Result of
h2o.parse_setup()
id – an id for the frame.
first_line_is_header – -1, 0, 1 if the first line is to be used as the header
- Returns
an
H2OFrame
object.- Examples
>>> fraw = h2o.import_file(("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip"), ... parse=False) >>> fhex = h2o.parse_raw(h2o.parse_setup(fraw), ... id='prostate.csv', ... first_line_is_header=0) >>> fhex.summary()
-
h2o.
parse_setup
(raw_frames, destination_frame=None, header=0, separator=None, column_names=None, column_types=None, na_strings=None, skipped_columns=None, custom_non_data_line_markers=None, partition_by=None, quotechar=None, escapechar=None)[source]¶ Retrieve H2O’s best guess as to what the structure of the data file is.
During parse setup, the H2O cluster will make several guesses about the attributes of the data. This method allows a user to perform corrective measures by updating the returning dictionary from this method. This dictionary is then fed into parse_raw to produce the H2OFrame instance.
- Parameters
raw_frames – a collection of imported file frames
destination_frame – The unique hex key assigned to the imported file. If none is given, a key will automatically be generated.
header – -1 means the first line is data, 0 means guess, 1 means first line is header.
separator – The field separator character. Values on each line of the file are separated by this character. If not provided, the parser will automatically detect the separator.
column_names – A list of column names for the file. If
skipped_columns
are specified, only list column names of columns that are not skipped.column_types –
A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are one will be guessed. If
skipped_columns
are specified, only list column types of columns that are not skipped. The possible types a column may have are:”unknown” - this will force the column to be parsed as all NA
”uuid” - the values in the column must be true UUID or will be parsed as NA
”string” - force the column to be parsed as a string
”numeric” - force the column to be parsed as numeric. H2O will handle the compression of the numeric data in the optimal manner.
”enum” - force the column to be parsed as a categorical column.
”time” - force the column to be parsed as a time column. H2O will attempt to parse the following list of date time formats:
”yyyy-MM-dd” (date),
”yyyy MM dd” (date),
”dd-MMM-yy” (date),
”dd MMM yy” (date),
”HH:mm:ss” (time),
”HH:mm:ss:SSS” (time),
”HH:mm:ss:SSSnnnnnn” (time),
”HH.mm.ss” (time),
”HH.mm.ss.SSS” (time),
”HH.mm.ss.SSSnnnnnn” (time).
Times can also contain “AM” or “PM”.
na_strings – A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.
skipped_columns – an integer lists of column indices to skip and not parsed into the final frame from the import file.
custom_non_data_line_markers – If a line in imported file starts with any character in given string it will NOT be imported. Empty string means all lines are imported, None means that default behaviour for given format will be used
partition_by – A list of columns the dataset has been partitioned by. None by default.
quotechar – A hint for the parser which character to expect as quoting character. Only single quote, double quote or None (default) are allowed. None means automatic detection.
escapechar – (Optional) One ASCII character used to escape other characters.
- Returns
a dictionary containing parse parameters guessed by the H2O backend.
- Examples
>>> col_headers = ["ID","CAPSULE","AGE","RACE", ... "DPROS","DCAPS","PSA","VOL","GLEASON"] >>> col_types=['enum','enum','numeric','enum', ... 'enum','enum','numeric','numeric','numeric'] >>> hex_key = "training_data.hex" >>> fraw = h2o.import_file(("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip"), ... parse=False) >>> setup = h2o.parse_setup(fraw, ... destination_frame=hex_key, ... header=1, ... separator=',', ... column_names=col_headers, ... column_types=col_types, ... na_strings=["NA"]) >>> setup
-
h2o.
pd_multi_plot
(models, frame, column, best_of_family=True, row_index=None, target=None, max_levels=30, figsize=(16, 9), colormap='Dark2', markers=['o', 'v', 's', 'P', '*', 'D', 'X', '^', '<', '>', '.'], save_plot_path=None, show_rug=True)[source]¶ Plot partial dependencies of a variable across multiple models.
The partial dependence plot (PDP) provides a graph of the marginal effect of a variable on the response. The effect of a variable is measured by the change in the mean response. The PDP assumes independence between the feature for which is the PDP computed and the rest.
- Parameters
models – a list of H2O models, an H2O AutoML instance, or an H2OFrame with a ‘model_id’ column (e.g. H2OAutoML leaderboard)
frame – H2OFrame
column – string containing column name
best_of_family – if True, show only the best models per family
row_index – if None, do partial dependence, if integer, do individual conditional expectation for the row specified by this integer
target – (only for multinomial classification) for what target should the plot be done
max_levels – maximum number of factor levels to show
figsize – figure size; passed directly to matplotlib
colormap – colormap name
markers – List of markers to use for factors, when it runs out of possible markers the last in this list will get reused
save_plot_path – a path to save the plot via using matplotlib function savefig
show_rug – Show rug to visualize the density of the column
- Returns
object that contains the resulting matplotlib figure (can be accessed using
result.figure()
).- Examples
>>> import h2o >>> from h2o.automl import H2OAutoML >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train an H2OAutoML >>> aml = H2OAutoML(max_models=10) >>> aml.train(y=response, training_frame=train) >>> >>> # Create a partial dependence plot >>> aml.pd_multi_plot(test, column="alcohol")
-
h2o.
print_mojo
(mojo_path, format='json', tree_index=None)[source]¶ Generates string representation of an existing MOJO model.
- Parameters
mojo_path – Path to the MOJO archive on the user’s local filesystem
format – Output format. Possible values: json (default), dot, png
tree_index – Index of tree to print
- Returns
An string representation of the MOJO for text output formats, a path to a directory with the rendered images for image output formats (or a path to a file if only a single tree is outputted)
- Example
>>> import json >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv") >>> prostate["CAPSULE"] = prostate["CAPSULE"].asfactor() >>> gbm_h2o = H2OGradientBoostingEstimator(ntrees = 5, ... learn_rate = 0.1, ... max_depth = 4, ... min_rows = 10) >>> gbm_h2o.train(x = list(range(1,prostate.ncol)), ... y = "CAPSULE", ... training_frame = prostate) >>> mojo_path = gbm_h2o.download_mojo() >>> mojo_str = h2o.print_mojo(mojo_path) >>> mojo_dict = json.loads(mojo_str)
-
h2o.
rapids
(expr)[source]¶ Execute a Rapids expression.
- Parameters
expr – The rapids expression (ascii string).
- Returns
The JSON response (as a python dictionary) of the Rapids execution.
- Examples
>>> rapidTime = h2o.rapids("(getTimeZone)")["string"] >>> print(str(rapidTime))
-
h2o.
remove
(x, cascade=True)[source]¶ Remove object(s) from H2O.
- Parameters
x – H2OFrame, H2OEstimator, or string, or a list of those things: the object(s) or unique id(s) pointing to the object(s) to be removed.
cascade – boolean, if set to TRUE (default), the object dependencies (e.g. submodels) are also removed.
- Examples
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> h2o.remove(airlines) >>> airlines # Should receive error: "This H2OFrame has been removed."
-
h2o.
remove_all
(retained=None)[source]¶ Removes all objects from H2O with possibility to specify models and frames to retain. Retained keys must be keys of models and frames only. For models retained, training and validation frames are retained as well. Cross validation models of a retained model are NOT retained automatically, those must be specified explicitely.
- Parameters
retained – Keys of models and frames to retain
- Examples
>>> from h2o.estimators import H2OGradientBoostingEstimator >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> gbm = H2OGradientBoostingEstimator(ntrees = 1) >>> gbm.train(x = ["Origin", "Dest"], ... y = "IsDepDelayed", ... training_frame=airlines) >>> h2o.remove_all([airlines.frame_id, ... gbm.model_id])
-
h2o.
resume
(recovery_dir=None)[source]¶ Triggers auto-recovery resume - this will look into configured recovery dir and resume and tasks that were interrupted by unexpected cluster stopping.
- Parameters
recovery_dir – A path to where cluster recovery data is stored, if blank, will use cluster’s configuration.
-
h2o.
save_grid
(grid_directory, grid_id, save_params_references=False, export_cross_validation_predictions=False)[source]¶ Export a Grid and it’s all its models into the given folder
- Parameters
grid_directory – A string containing the path to the folder for the grid to be saved to.
grid_id – A character string with identification of the Grid in H2O.
save_params_references – True if objects referenced by grid parameters (e.g. training frame, calibration frame) should also be saved.
export_cross_validation_predictions – A boolean flag indicating whether the models exported from the grid should be saved with CV Holdout Frame predictions. Default is not to export the predictions.
- Examples
>>> from collections import OrderedDict >>> from h2o.grid.grid_search import H2OGridSearch >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator >>> train = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") # Run GBM Grid Search >>> ntrees_opts = [1, 3] >>> learn_rate_opts = [0.1, 0.01, .05] >>> hyper_parameters = OrderedDict() >>> hyper_parameters["learn_rate"] = learn_rate_opts >>> hyper_parameters["ntrees"] = ntrees_opts >>> export_dir = pyunit_utils.locate("results") >>> gs = H2OGridSearch(H2OGradientBoostingEstimator, hyper_params=hyper_parameters) >>> gs.train(x=list(range(4)), y=4, training_frame=train) >>> grid_id = gs.grid_id >>> old_grid_model_count = len(gs.model_ids) # Save the grid search to the export directory >>> saved_path = h2o.save_grid(export_dir, grid_id) >>> h2o.remove_all(); >>> train = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv") # Load the grid search >>> grid = h2o.load_grid(saved_path) >>> grid.train(x=list(range(4)), y=4, training_frame=train)
-
h2o.
save_model
(model, path='', force=False, export_cross_validation_predictions=False, filename=None)[source]¶ Save an H2O Model object to disk. (Note that ensemble binary models can now be saved using this method.) The owner of the file saved is the user by which H2O cluster was executed.
- Parameters
model – The model object to save.
path – a path to save the model at (hdfs, s3, local)
force – if True overwrite destination directory in case it exists, or throw exception if set to False.
export_cross_validation_predictions – logical, indicates whether the exported model artifact should also include CV Holdout Frame predictions. Default is not to export the predictions.
filename – a filename for the saved model
- Returns
the path of the saved model
- Examples
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator >>> h2o_df = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") >>> my_model = H2OGeneralizedLinearEstimator(family = "binomial") >>> my_model.train(y = "CAPSULE", ... x = ["AGE", "RACE", "PSA", "GLEASON"], ... training_frame = h2o_df) >>> h2o.save_model(my_model, path='', force=True)
-
h2o.
show_progress
()[source]¶ Enable the progress bar (it is enabled by default).
- Examples
>>> h2o.no_progress() >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> x = ["DayofMonth", "Month"] >>> model = H2OGeneralizedLinearEstimator(family="binomial", ... alpha=0, ... Lambda=1e-5) >>> model.train(x=x, y="IsDepDelayed", training_frame=airlines) >>> h2o.show_progress() >>> model.train(x=x, y="IsDepDelayed", training_frame=airlines)
-
h2o.
upload_custom_metric
(func, func_file='metrics.py', func_name=None, class_name=None, source_provider=None)[source]¶ Upload given metrics function into H2O cluster.
- The metrics can have different representation:
class: needs to implement map(pred, act, weight, offset, model), reduce(l, r) and metric(l) methods
string: the same as in class case, but the class is given as a string
- Parameters
func – metric representation: string, class
func_file – internal name of file to save given metrics representation
func_name – name for h2o key under which the given metric is saved
class_name – name of class wrapping the metrics function (when supplied as string)
source_provider – a function which provides a source code for given function
- Returns
reference to uploaded metrics function
- Examples
>>> class CustomMaeFunc: >>> def map(self, pred, act, w, o, model): >>> return [abs(act[0] - pred[0]), 1] >>> >>> def reduce(self, l, r): >>> return [l[0] + r[0], l[1] + r[1]] >>> >>> def metric(self, l): >>> return l[0] / l[1] >>> >>> custom_func_str = '''class CustomMaeFunc: >>> def map(self, pred, act, w, o, model): >>> return [abs(act[0] - pred[0]), 1] >>> >>> def reduce(self, l, r): >>> return [l[0] + r[0], l[1] + r[1]] >>> >>> def metric(self, l): >>> return l[0] / l[1]''' >>> >>> >>> h2o.upload_custom_metric(custom_func_str, class_name="CustomMaeFunc", func_name="mae")
-
h2o.
upload_file
(path, destination_frame=None, header=0, sep=None, col_names=None, col_types=None, na_strings=None, skipped_columns=None, quotechar=None, escapechar=None)[source]¶ Upload a dataset from the provided local path to the H2O cluster.
Does a single-threaded push to H2O. Also see
import_file()
.- Parameters
path – A path specifying the location of the data to upload.
destination_frame – The unique hex key assigned to the imported file. If none is given, a key will be automatically generated.
header – -1 means the first line is data, 0 means guess, 1 means first line is header.
sep – The field separator character. Values on each line of the file are separated by this character. If not provided, the parser will automatically detect the separator.
col_names – A list of column names for the file.
col_types –
A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are one will be guessed. The possible types a column may have are:
”unknown” - this will force the column to be parsed as all NA
”uuid” - the values in the column must be true UUID or will be parsed as NA
”string” - force the column to be parsed as a string
”numeric” - force the column to be parsed as numeric. H2O will handle the compression of the numeric data in the optimal manner.
”enum” - force the column to be parsed as a categorical column.
”time” - force the column to be parsed as a time column. H2O will attempt to parse the following list of date time formats:
”yyyy-MM-dd” (date),
”yyyy MM dd” (date),
”dd-MMM-yy” (date),
”dd MMM yy” (date).
”HH:mm:ss” (time),
”HH:mm:ss:SSS” (time),
”HH:mm:ss:SSSnnnnnn” (time),
”HH.mm.ss” (time),
”HH.mm.ss.SSS” (time),
”HH.mm.ss.SSSnnnnnn” (time).
Times can also contain “AM” or “PM”.
na_strings – A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.
skipped_columns – an integer lists of column indices to skip and not parsed into the final frame from the import file.
quotechar – A hint for the parser which character to expect as quoting character. Only single quote, double quote or None (default) are allowed. None means automatic detection.
escapechar – (Optional) One ASCII character used to escape other characters.
- Returns
a new
H2OFrame
instance.- Examples
>>> iris_df = h2o.upload_file("~/Desktop/repos/h2o-3/smalldata/iris/iris.csv")
-
h2o.
upload_model
(path)[source]¶ Upload a binary model from the provided local path to the H2O cluster. (H2O model can be saved in a binary form either by
save_model()
or bydownload_model()
function.)- Parameters
path – A path on the machine this python session is currently connected to, specifying the location of the model to upload.
- Returns
a new
H2OEstimator
object.
-
h2o.
upload_mojo
(mojo_path, model_id=None)[source]¶ Uploads an existing MOJO model from local filesystem into H2O and imports it as an H2O Generic Model.
- Parameters
mojo_path – Path to the MOJO archive on the user’s local filesystem
model_id – Model ID, default None
- Returns
An H2OGenericEstimator instance embedding given MOJO
- Examples
>>> from h2o.estimators import H2OGradientBoostingEstimator >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip") >>> model = H2OGradientBoostingEstimator(ntrees = 1) >>> model.train(x = ["Origin", "Dest"], ... y = "IsDepDelayed", ... training_frame=airlines) >>> original_model_filename = tempfile.mkdtemp() >>> original_model_filename = model.download_mojo(original_model_filename) >>> mojo_model = h2o.upload_mojo(original_model_filename)
-
h2o.
varimp_heatmap
(models, top_n=None, num_of_features=20, figsize=(16, 9), cluster=True, colormap='RdYlBu_r', save_plot_path=None)[source]¶ Variable Importance Heatmap across a group of models
Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.
- Parameters
models – a list of H2O models, an H2O AutoML instance, or an H2OFrame with a ‘model_id’ column (e.g. H2OAutoML leaderboard)
top_n – DEPRECATED. use just top n models (applies only when used with H2OAutoML)
num_of_features – limit the number of features to plot based on the maximum variable importance across the models. Use None for unlimited.
figsize – figsize: figure size; passed directly to matplotlib
cluster – if True, cluster the models and variables
colormap – colormap to use
save_plot_path – a path to save the plot via using matplotlib function savefig
- Returns
object that contains the resulting figure (can be accessed using
result.figure()
)- Examples
>>> import h2o >>> from h2o.automl import H2OAutoML >>> >>> h2o.init() >>> >>> # Import the wine dataset into H2O: >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv" >>> df = h2o.import_file(f) >>> >>> # Set the response >>> response = "quality" >>> >>> # Split the dataset into a train and test set: >>> train, test = df.split_frame([0.8]) >>> >>> # Train an H2OAutoML >>> aml = H2OAutoML(max_models=10) >>> aml.train(y=response, training_frame=train) >>> >>> # Create the variable importance heatmap >>> aml.varimp_heatmap()