H2O Module¶

The H2O Python Module¶

This module provides access to the H2O JVM (and extensions thereof), its objects, its machine-learning algorithms, and modeling support (basic munging and feature generation) capabilities.

The H2O JVM sports a web server such that all communication occurs on a socket (specified by an IP address and a port) via a series of REST calls (see connection.py for the REST layer implementation and details). There is a single active connection to the H2O JVM at any one time, and this handle is stashed away out of sight in a singleton instance of H2OConnection (this is the global __H2OConn__). In other words, this package does not rely on Jython, and there is no direct manipulation of the JVM.

The H2O python module is not intended as a replacement for other popular machine learning modules such as scikit-learn, pylearn2, and their ilk. This module is a complementary interface to a modeling engine intended to make the transition of models from development to production as seamless as possible. Additionally, it is designed to bring H2O to a wider audience of data and machine learning devotees that work exclusively with Python (rather than R or scala or Java – which are other popular interfaces that H2O supports), and are wanting another tool for building applications or doing data munging in a fast, scalable environment without any extra mental anguish about threads and parallelism. There are additional treasures that H2O incorporates meant to alleviate the pain of doing some basic feature manipulation (e.g. automatic categorical handling and not having to one-hot encode).

What is H2O?¶

H2O is a piece of java software for data modeling and general computing. There are many different views of the H2O software, but the primary view of H2O is that of a distributed (many machines), parallel (many CPUs), in memory (several hundred GBs Xmx) processing engine.

There are two levels of parallelism:

within node

across (or between) node.

The goal, remember, is to “simply” add more processors to a given problem in order to produce a solution faster. The conceptual paradigm MapReduce (also known as “divide and conquer and combine”) along with a good concurrent application structure (c.f. jsr166y and NonBlockingHashMap) enable this type of scaling in H2O (we’re really cooking with gas now!).

For application developers and data scientists, the gritty details of thread-safety, algorithm parallelism, and node coherence on a network are concealed by simple-to-use REST calls that are all documented here. In addition, H2O is an open-source project under the Apache v2 licence. All of the source code is on github, there is an active google group mailing list, our nightly tests are open for perusal, our JIRA ticketing system is also open for public use. Last, but not least, we regularly engage the machine learning community all over the nation with a very busy meetup schedule (so if you’re not in The Valley, no sweat, we’re probably coming to you soon!), and finally, we host our very own H2O World conference. We also sometimes host hack-a-thons at our campus in Mountain View, CA. Needless to say, there is a lot of support for the application developer.

In order to make the most out of H2O, there are some key conceptual pieces that are helpful to know before getting started. Mainly, it’s helpful to know about the different types of objects that live in H2O and what the rules of engagement are in the context of the REST API (which is what any non-JVM interface is all about).

Let’s get started!

The H2O Object System¶

H2O sports a distributed key-value store (the “DKV”), which contains pointers to the various objects that make up the H2O ecosystem. The DKV is a kind of biosphere in that it encapsulates all shared objects (though, it may not encapsulate all objects). Some shared objects are mutable by the client; some shared objects are read-only by the client, but mutable by H2O (e.g. a model being constructed will change over time); and actions by the client may have side-effects on other clients (multi-tenancy is not a supported model of use, but it is possible for multiple clients to attach to a single H2O cloud).

Briefly, these objects are:

Key: A key is an entry in the DKV that maps to an object in H2O.

Frame: A Frame is a collection of Vec objects. It is a 2D array of elements.

Vec: A Vec is a collection of Chunk objects. It is a 1D array of elements.

Chunk: A Chunk holds a fraction of the BigData. It is a 1D array of elements.

ModelMetrics: A collection of metrics for a given category of model.

Model: A model is an immutable object having predict and metrics methods.

Job: A Job is a non-blocking task that performs a finite amount of work.

Many of these objects have no meaning to an end python user, but in order to make sense of the objects available in this module it is helpful to understand how these objects map to objects in the JVM (because after all, this module is an interface that allows the manipulation of a distributed system).

Objects In This Module¶

The objects that are of primary concern to the python user are (in order) Keys, Frames, Vecs, Models, ModelMetrics, and to a lesser extent Jobs. Each of these objects are described in greater detail throughout this documentation, but a few brief notes are warranted here.

H2OFrame¶

An H2OFrame is 2D array of uniformly-typed columns. Data in H2O is compressed (often achieving 2-4x better compression the gzip on disk) and is held in the JVM heap (i.e. data is “in memory”), and not in the python process local memory.. The H2OFrame is an iterable (supporting list comprehensions) wrapper around a list of H2OVec objects. All an H2OFrame object is, therefore, is a wrapper on a list that supports various types of operations that may or may not be lazy. Here’s an example showing how a list comprehension is combined with lazy expressions to compute the column means for all columns in the H2OFrame:

>>> df = h2o.import_frame(path="smalldata/logreg/prostate.csv")  # import prostate data
>>>
>>> colmeans = [v.mean() for v in df]                            # compute column means
>>>
>>> colmeans                                                     # print the results
[5.843333333333335, 3.0540000000000007, 3.7586666666666693, 1.1986666666666672]

Lazy expressions will be discussed lightly in the coming sections, as they are not necessarily going to be front-and-center to the practicing data scientist, but their primary purpose is to cut down on the chatter between the client (a.k.a this python interface) and H2O. Lazy expressions are Katamari’d together and only ever evaluated when some piece of output is requested (e.g. print-to-screen).

The set of operations on an H2OFrame is described in a chapter devoted to this object, but suffice it to say that this set of operations closely resembles those that may be performed on an R data.frame. This includes all manner of slicing (with complex conditionals), broadcasting operations, and a slew of math operations for transforming and mutating a Frame (all the while the actual Big Data is sitting in the H2O cloud). The semantics for modifying a Frame closely resembles R’s copy-on-modify semantics, except when it comes to mutating a Frame in place. For example, it’s possible to assign all occurrences of the number 0 in a column to missing (or NA in R parlance) as demonstrated in the following snippet:

>>> df = h2o.import_frame(path="smalldata/logreg/prostate.csv")  # import prostate data
>>>
>>> vol = df['VOL']                                              # select the VOL column
>>>
>>> vol[vol == 0] = None                                         # 0 VOL means 'missing'

After this operation, vol has been permanently mutated in place (it is not a copy!).

H2OVec¶

An H2OVec is a single column of data that is uniformly typed and possibly lazily computed. As with H2OFrame, an H2OVec is a pointer to a distributed java object residing in the H2O cloud (and truthfully, an H2OFrame is simply a collection of H2OVec pointers along with some metadata and various member methods).

Expr¶

Deep in the guts of this module is the Expr class, which defines those objects holding the cumulative, unevaluated expressions that may become H2OFrame/H2OVec objects. For example:

>>> fr = h2o.import_frame(path="smalldata/logreg/prostate.csv")  # import prostate data
>>>
>>> a = fr + 3.14159                                             # "a" is now an Expr
>>>
>>> type(a)                                                      # <class 'h2o.expr.Expr'>

These objects are not too important to distinguish at the user level, and all operations can be performed with the mental model of operating on 2D frames (i.e. everything is an H2OFrame), but it is worth mentioning them here for completeness, as they will not discussed elsewhere.

In the previous snippet, a has not yet triggered any big data evaluation and is, in fact, a pending computation. Once a is evaluated, it stays evaluated. Additionally, if all dependent subparts composing a are also evaluated.

It is worthwhile mentioning at this point that this module relies on reference counting of python objects to dispose of out-of-scope objects. The Expr class destroys objects and their big data counterparts in the H2O cloud by way of a remove call:

>>> fr = h2o.import_frame(path="smalldata/logreg/prostate.csv")  # import prostate data
>>>
>>> h2o.remove(fr)                                               # remove prostate data
>>> fr                                                           # attempting to use fr results in a ValueError

Notice that when attempting to use the object after a remove call has been issued, it will result in a ValueError. Therefore, any working reference is not necessarily cleaned up, but it will no longer be functional. Note that deleting an unevaluated expression will not delete all subparts!

Models¶

The model-building experience with this module is unique, and is not the same experience for those coming from a background in scikit-learn. Instead of using objects to build the model, builder functions are provided in the top-level module, and the result of a call is an model object belonging to one of the following categories:

Regression

Binomial

Multinomial

Clustering

Autoencoder

This is better demonstrated by way of an example:

>>> fr = h2o.import_frame(path="smalldata/logreg/prostate.csv")  # import prostate data
>>>
>>> fr[1] = fr[1].asfactor()                                     # make 2nd column a factor
>>>
>>> m = h2o.glm(x=fr[3:], y=fr[2])                               # build a glm with a method call
>>>
>>> m.__class__                                                  # <h2o.model.binomial.H2OBinomialModel object at 0x104659cd0>
>>>
>>> m.show()                                                     # print the model details
>>>
>>> m.summary()                                                  # print a model summary

As you can see, the result of the glm call is a binomial model. This example also showcases an important feature-munging step in order to cause the glm to perform a classification task over a regression task. Namely, the second column is a numeric column when it’s initially read in, but it must be cast to a factor by way of the H2OVec operation asfactor. Let’s take a look at this more deeply:

>>> fr = h2o.import_frame(path="smalldata/logreg/prostate.csv")  # import prostate data
>>>
>>> fr[1].isfactor()                                             # produces False
>>>
>>> m = h2o.gbm(x=fr[2:],y=fr[1])                                # build the gbm
>>>
>>> m.__class__                                                  # <h2o.model.regression.H2ORegressionModel object at 0x104d07590>
>>>
>>> fr[1] = fr[1].asfactor()                                     # cast the 2nd column to a factor column
>>>
>>> fr[1].isfactor()                                             # produces True
>>>
>>> m = h2o.gbm(x=fr[2:],y=fr[1])                                # build the gbm
>>>
>>> m.__class__                                                  # <h2o.model.binomial.H2OBinomialModel object at 0x104d18f50>

The above example shows how to properly deal with numeric columns you would like to use in a classification setting. Additionally, H2O can perform on-the-fly scoring of validation data and provide a host of metrics on the validation and training data. Here’s an example of doing this, where we additionally split the data set into three pieces for training, validation, and finally testing:

>>> fr = h2o.import_frame(path="smalldata/logreg/prostate.csv")  # import prostate
>>>
>>> fr[1] = fr[1].asfactor()                                     # cast to factor
>>>
>>> r = fr[0].runif()                                            # Random UNIform numbers, one per row
>>>
>>> train = fr[ r < 0.6 ]                                        # 60% for training data
>>>
>>> valid = fr[ (0.6 <= r) & (r < 0.9) ]                         # 30% for validation
>>>
>>> test  = fr[ 0.9 <= r ]                                       # 10% for testing
>>>
>>> m = h2o.deeplearning(x=train[2:],y=train[1],validation_x=valid[2:],validation_y=valid[1])  # build a deeplearning with a validation set (yes it's this simple)
>>>
>>> m                                                            # display the model summary by default (can also call m.show())
>>>
>>> m.show()                                                     # equivalent to the above
>>>
>>> m.model_performance()                                        # show the performance on the training data, (can also be m.performance(train=True)
>>>
>>> m.model_performance(valid=True)                              # show the performance on the validation data
>>>
>>> m.model_performance(test_data=test)                          # score and compute new metrics on the test data!

Continuing from this example, there are a number of ways of querying a model for its attributes. Here are some examples doing just that:

>>> m.mse()           # MSE on the training data
>>>
>>> m.mse(valid=True) # MSE on the validation data
>>>
>>> m.r2()            # R^2 on the training data
>>>
>>> m.r2(valid=True)  # R^2 on the validation data
>>>
>>> m.confusion_matrix()  # confusion matrix for max F1
>>>
>>> m.confusion_matrix("tpr") # confusion matrix for max true positive rate
>>>
>>> m.confusion_matrix("max_per_class_error")   # etc.

All of our models support various accessors such as these. Please refer to the relevant documentation for each model category to get further specifics on arguments and available metrics.

Each model handles missing (colloquially: “missing” or “NA”) and categorical data automatically. Because each model handles these types of data differently, please refer to the model call below for more information. If you still have questions, please send a note to support@support@h2o.ai.

Metrics¶

Example of H2O on Hadoop¶

Here is a small example (H2O on Hadoop) :

import h2o
h2o.init(ip="192.168.1.10", port=54321)
--------------------------  ------------------------------------
H2O cluster uptime:         2 minutes 1 seconds 966 milliseconds
H2O cluster version:        0.1.27.1064
H2O cluster name:           H2O_96762
H2O cluster total nodes:    4
H2O cluster total memory:   38.34 GB
H2O cluster total cores:    16
H2O cluster allowed cores:  80
H2O cluster healthy:        True
--------------------------  ------------------------------------
pathDataTrain = ["hdfs://192.168.1.10/user/data/data_train.csv"]
pathDataTest = ["hdfs://192.168.1.10/user/data/data_test.csv"]
trainFrame = h2o.import_frame(path=pathDataTrain)
testFrame = h2o.import_frame(path=pathDataTest)

#Parse Progress: [##################################################] 100%
#Imported [hdfs://192.168.1.10/user/data/data_train.csv'] into cluster with 60000 rows and 500 cols

#Parse Progress: [##################################################] 100%
#Imported ['hdfs://192.168.1.10/user/data/data_test.csv'] into cluster with 10000 rows and 500 cols

trainFrame[499]._name = "label"
testFrame[499]._name = "label"

model = h2o.gbm(x=trainFrame.drop("label"),
            y=trainFrame["label"],
            validation_x=testFrame.drop("label"),
            validation_y=testFrame["label"],
            ntrees=100,
            max_depth=10
            )

#gbm Model Build Progress: [##################################################] 100%

predictFrame = model.predict(testFrame)
model.model_performance(testFrame)

`h2o`¶

This module provides all of the top level calls for models and various data transform methods. By simply

h2o.h2o.as_list(data)[source]¶

If data is an Expr, then eagerly evaluate it and pull the result from h2o into the local environment. In the local environment an H2O Frame is represented as a list of lists (each element in the broader list represents a row). Note: This uses function uses h2o.frame(), which will return meta information on the H2O Frame and only the first 100 rows. This function is only intended to be used within the testing framework. More robust functionality must be constructed for production conversion between H2O and python data types.

Returns:	List of list (Rows x Columns).

h2o.h2o.autoencoder(x, **kwargs)[source]¶

Build an Autoencoder

Parameters:	x – Columns with which to build an autoencoder kwargs – Additional arguments to pass to the autoencoder.
Returns:	A new autoencoder model

h2o.h2o.cbind(left, right)[source]¶

Parameters:	left – H2OFrame or H2OVec right – H2OFrame or H2OVec
Returns:	new H2OFrame with left\|right cbinded

h2o.h2o.cluster_info()[source]¶

Display the current H2O cluster information.

Returns:	None

h2o.h2o.deeplearning(x, y=None, validation_x=None, validation_y=None, **kwargs)[source]¶

Build a supervised Deep Learning model (kwargs are the same arguments that you can find in FLOW)

Returns:	Return a new classifier or regression model.

h2o.h2o.dim_check(data1, data2)[source]¶: Check that the dimensions of the data1 and data2 are the same :param data1: an H2OFrame, H2OVec or Expr :param data2: an H2OFrame, H2OVec or Expr :return: None

h2o.h2o.export_file(frame, path, force=False)[source]¶

Export a given H2OFrame to a path on the machine this python session is currently connected to. To view the current session, call h2o.cluster_info().

Parameters:	frame – The Frame to save to disk. path – The path to the save point on disk. force – Overwrite any preexisting file with the same path
Returns:	None

h2o.h2o.frame(frame_id)[source]¶

Retrieve metadata for a id that points to a Frame.

Parameters:	frame_id – A pointer to a Frame in H2O.
Returns:	Meta information on the frame

h2o.h2o.frame_summary(key)[source]¶

Retrieve metadata and summary information for a key that points to a Frame/Vec

Parameters:	key – A pointer to a Frame/Vec in H2O
Returns:	Meta and summary info on the frame

h2o.h2o.frames()[source]¶

Retrieve all the Frames.

Returns:	Meta information on the frames

h2o.h2o.gbm(x, y, validation_x=None, validation_y=None, **kwargs)[source]¶

Build a Gradient Boosted Method model (kwargs are the same arguments that you can find in FLOW)

Returns:	A new classifier or regression model.

h2o.h2o.glm(x, y, validation_x=None, validation_y=None, **kwargs)[source]¶

Build a Generalized Linear Model (kwargs are the same arguments that you can find in FLOW)

Returns:	A new regression or binomial classifier.

h2o.h2o.import_file(path)[source]¶

Import a single file or collection of files.

Parameters:	path – A path to a data file (remote or local).
Returns:	A new H2OFrame

h2o.h2o.import_frame(path=None, vecs=None)[source]¶

Import a frame from a file (remote or local machine). If you run H2O on Hadoop, you can access to HDFS

Parameters:	path – A path specifying the location of the data to import.
Returns:	A new H2OFrame

h2o.h2o.init(ip='localhost', port=54321, size=1, start_h2o=False, enable_assertions=False, license=None, max_mem_size_GB=None, min_mem_size_GB=None, ice_root=None, strict_version_check=False)[source]¶

Initiate an H2O connection to the specified ip and port.

Parameters:

ip – An IP address, default is “localhost”
port – A port, default is 54321
size – THe expected number of h2o instances (ignored if start_h2o is True)
start_h2o – A boolean dictating whether this module should start the H2O jvm. An attempt is made anyways if _connect fails.
enable_assertions – If start_h2o, pass -ea as a VM option.s
license – If not None, is a path to a license file.
max_mem_size_GB – Maximum heap size (jvm option Xmx) in gigabytes.
min_mem_size_GB – Minimum heap size (jvm option Xms) in gigabytes.
ice_root – A temporary directory (default location is determined by tempfile.mkdtemp()) to hold H2O log files.

Returns:

None

h2o.h2o.kmeans(x, validation_x=None, **kwargs)[source]¶

Build a KMeans model (kwargs are the same arguments that you can find in FLOW)

Returns:	A new clustering model

h2o.h2o.locate(path)[source]¶

Search for a relative path and turn it into an absolute path. This is handy when hunting for data files to be passed into h2o and used by import file. Note: This function is for unit testing purposes only.

Parameters:	path – Path to search for
Returns:	Absolute path if it is found. None otherwise.

h2o.h2o.np_comparison_check(h2o_data, np_data, num_elements)[source]¶: Check values achieved by h2o against values achieved by numpy :param h2o_data: an H2OFrame, H2OVec or Expr :param np_data: a numpy array :param num_elements: number of elements to compare :return: None

h2o.h2o.parse(setup, h2o_name, first_line_is_header=(-1, 0, 1))[source]¶

Trigger a parse; blocking; removeFrame just keep the Vecs.

Parameters:	setup – The result of calling parse_setup. h2o_name – The name of the H2O Frame on the back end. first_line_is_header – -1 means data, 0 means guess, 1 means header.
Returns:	A new parsed object

h2o.h2o.parse_setup(raw_frames)[source]¶

Parameters:	raw_frames – A collection of imported file frames
Returns:	A ParseSetup “object”

h2o.h2o.random_forest(x, y, validation_x=None, validation_y=None, **kwargs)[source]¶

Build a Random Forest Model (kwargs are the same arguments that you can find in FLOW)

Returns:	A new classifier or regression model.

h2o.h2o.rapids(expr)[source]¶

Fire off a Rapids expression.

Parameters:	expr – The rapids expression (ascii string).
Returns:	The JSON response of the Rapids execution

h2o.h2o.remove(object)[source]¶

Remove object from H2O.

Parameters:	object – The object pointing to the object to be removed.
Returns:	Void

h2o.h2o.upload_file(path, destination_frame='')[source]¶

Upload a dataset at the path given from the local machine to the H2O cluster.

Parameters:	path – A path specifying the location of the data to upload. destination_frame – The name of the H2O Frame in the H2O Cluster.
Returns:	A new H2OFrame

h2o.h2o.value_check(h2o_data, local_data, num_elements, col=None)[source]¶

Check that the values of h2o_data and local_data are the same. In a testing context, this could be used to check that an operation did not alter the original h2o_data.

Parameters:	h2o_data – an H2OFrame, H2OVec or Expr local_data – a list of lists (row x col format) num_elements – number of elements to check col – an optional integer that specifies the particular column to check
Returns:	None

H2O Module¶

The H2O Python Module¶

What is H2O?¶

The H2O Object System¶

Objects In This Module¶

H2OFrame¶

H2OVec¶

Expr¶

Models¶

Metrics¶

Example of H2O on Hadoop¶

`h2o`¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

H2O Module¶

The H2O Python Module¶

What is H2O?¶

The H2O Object System¶

Objects In This Module¶

H2OFrame¶

H2OVec¶

Expr¶

Models¶

Metrics¶

Example of H2O on Hadoop¶

h2o¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation

`h2o`¶