H2O Module¶
The H2O Python Module¶
This module provides access to the H2O JVM, as well as its extensions, objects, machine-learning algorithms, and modeling support capabilities, such as basic munging and feature generation.
The H2O JVM uses a web server so that all communication occurs on a socket (specified by an IP address and a port) via a series of REST calls (see connection.py for the REST layer implementation and details). There is a single active connection to the H2O JVM at any time, and this handle is stashed out of sight in a singleton instance of H2OConnection (this is the global __H2OConn__). In other words, this package does not rely on Jython, and there is no direct manipulation of the JVM.
The H2O python module is not intended as a replacement for other popular machine learning frameworks such as scikit-learn, pylearn2, and their ilk, but is intended to bring H2O to a wider audience of data and machine learning devotees who work exclusively with Python.
H2O from Python is a tool for rapidly turning over models, doing data munging, and building applications in a fast, scalable environment without any of the mental anguish about parallelism and distribution of work.
What is H2O?¶
H2O is a Java-based software for data modeling and general computing. There are many different perceptions of the H2O software, but the primary purpose of H2O is as a distributed (many machines), parallel (many CPUs), in memory (several hundred GBs Xmx) processing engine.
There are two levels of parallelism:
- within node
- across (or between) nodes
The goal, remember, is to easily add more processors to a given problem in order to produce a solution faster. The conceptual paradigm MapReduce (also known as “divide and conquer and combine”), along with a good concurrent application structure, (c.f. jsr166y and NonBlockingHashMap) enable this type of scaling in H2O – we’re really cooking with gas now!
For application developers and data scientists, the gritty details of thread-safety, algorithm parallelism, and node coherence on a network are concealed by simple-to-use REST calls that are all documented here. In addition, H2O is an open-source project under the Apache v2 licence. All of the source code is on github, there is an active google group mailing list, our nightly tests are open for perusal, and our JIRA ticketing system is also open for public use. Last, but not least, we regularly engage the machine learning community all over the nation with a very busy meetup schedule (so if you’re not in The Valley, no sweat, we’re probably coming to your area soon!), and finally, we host our very own H2O World conference. We also sometimes host hack-a-thons at our campus in Mountain View, CA. Needless to say, H2O provides a lot of support for application developers.
In order to make the most out of H2O, there are some key conceptual pieces that are important to know before getting started. Mainly, it’s helpful to know about the different types of objects that live in H2O and what the rules of engagement are in the context of the REST API (which is what any non-JVM interface is all about).
Let’s get started!
The H2O Object System¶
H2O uses a distributed key-value store (the “DKV”) that contains pointers to the various objects of the H2O ecosystem. The DKV is a kind of biosphere in that it encapsulates all shared objects; however, it may not encapsulate all objects. Some shared objects are mutable by the client; some shared objects are read-only by the client, but are mutable by H2O (e.g. a model being constructed will change over time); and actions by the client may have side-effects on other clients (multi-tenancy is not a supported model of use, but it is possible for multiple clients to attach to a single H2O cloud).
Briefly, these objects are:
- Key: A key is an entry in the DKV that maps to an object in H2O.
- Frame: A Frame is a collection of Vec objects. It is a 2D array of elements.
- Vec: A Vec is a collection of Chunk objects. It is a 1D array of elements.
- Chunk: A Chunk holds a fraction of the BigData. It is a 1D array of elements.
- ModelMetrics: A collection of metrics for a given category of model.
- Model: A model is an immutable object having predict and metrics methods.
- Job: A Job is a non-blocking task that performs a finite amount of work.
Many of these objects have no meaning to a Python end-user, but to make sense of the objects available in this module it is helpful to understand how these objects map to objects in the JVM. After all, this module is an interface that allows the manipulation of a distributed system.
Objects In This Module¶
The objects that are of primary concern to the python user are (in order of importance) - IDs/Keys - Frames - Models - ModelMetrics - Jobs (to a lesser extent) Each of these objects are described in greater detail in this documentation, but a few brief notes are provided here.
H2OFrame¶
An H2OFrame is a 2D array of uniformly-typed columns. Data in H2O is compressed (often achieving 2-4x better compression than gzip on disk) and is held in the JVM heap (i.e. data is “in memory”), and not in the python process local memory. The H2OFrame is an iterable (supporting list comprehensions). All an H2OFrame object is, therefore, is a wrapper on a list that supports various types of operations that may or may not be lazy. Here’s an example showing how a list comprehension is combined with lazy expressions to compute the column means for all columns in the H2OFrame:
>>> df = h2o.import_file(path="smalldata/logreg/prostate.csv") # import prostate data
>>>
>>> colmeans = df.mean() # compute column means
>>>
>>> colmeans # print the results
[5.843333333333335, 3.0540000000000007, 3.7586666666666693, 1.1986666666666672]
Lazy expressions will be discussed briefly in the coming sections, as they are not necessarily going to be integral to the practicing data scientist. However, their primary purpose is to cut down on the chatter between the client (a.k.a the python interface) and H2O. Lazy expressions are Katamari’d together and only ever evaluated when some piece of output is requested (e.g. print-to-screen).
The set of operations on an H2OFrame is described in a dedicated chapter, but in general, this set of operations closely resembles those that may be performed on an R data.frame. This includes all types of slicing (with complex conditionals), broadcasting operations, and a slew of math operations for transforming and mutating a Frame – all the while the actual Big Data is sitting in the H2O cloud. The semantics for modifying a Frame closely resemble R’s copy-on-modify semantics, except when it comes to mutating a Frame in place. For example, it’s possible to assign all occurrences of the number 0 in a column to missing (or NA in R parlance) as demonstrated in the following snippet:
>>> df = h2o.import_file(path="smalldata/logreg/prostate.csv") # import prostate data
>>>
>>> vol = df['VOL'] # select the VOL column
>>>
>>> vol[vol == 0] = None # 0 VOL means 'missing'
After this operation, vol has been permanently mutated in place (it is not a copy!).
ExprNode¶
In the guts of this module is the Expr class, which defines objects holding the cumulative, unevaluated expressions that may become H2OFrame objects. For example:
>>> fr = h2o.import_file(path="smalldata/logreg/prostate.csv") # import prostate data
>>>
>>> a = fr + 3.14159 # "a" is now an Expr
>>>
>>> type(a) # <class 'h2o.expr.Expr'>
These objects are not as important to distinguish at the user level, and all operations can be performed with the mental model of operating on 2D frames (i.e. everything is an H2OFrame).
In the previous snippet, a has not yet triggered any big data evaluation and is, in fact, a pending computation. Once a is evaluated, it stays evaluated. Additionally, all dependent subparts composing a are also evaluated.
This module relies on reference counting of python objects to dispose of out-of-scope objects. The Expr class destroys objects and their big data counterparts in the H2O cloud using a remove call:
>>> fr = h2o.import_file(path="smalldata/logreg/prostate.csv") # import prostate data
>>>
>>> h2o.remove(fr) # remove prostate data
>>> fr # attempting to use fr results in a ValueError
Notice that attempting to use the object after a remove call has been issued will result in a ValueError. Therefore, any working references may not be cleaned up, but they will no longer be functional. Deleting an unevaluated expression will not delete all subparts.
Models¶
The model-building experience with this module is unique, especially for those coming from a background in scikit-learn. Instead of using objects to build the model, builder functions are provided in the top-level module, and the result of a call is a model object belonging to one of the following categories:
- Regression
- Binomial
- Multinomial
- Clustering
- Autoencoder
To better demonstrate this concept, refer to the following example:
>>> fr = h2o.import_file(path="smalldata/logreg/prostate.csv") # import prostate data
>>>
>>> fr[1] = fr[1].asfactor() # make 2nd column a factor
>>>
>>> m = h2o.glm(x=fr[3:], y=fr[2]) # build a glm with a method call
>>>
>>> m.__class__ # <h2o.model.binomial.H2OBinomialModel object at 0x104659cd0>
>>>
>>> m.show() # print the model details
>>>
>>> m.summary() # print a model summary
As you can see in the example, the result of the GLM call is a binomial model. This example also showcases an important feature-munging step needed for GLM to perform a classification task rather than a regression task. Namely, the second column is initially read as a numeric column, but it must be changed to a factor by way of the operation asfactor. Let’s take a look at this more deeply:
>>> fr = h2o.import_file(path="smalldata/logreg/prostate.csv") # import prostate data
>>>
>>> fr[1].isfactor() # produces False
>>>
>>> m = h2o.gbm(x=fr[2:],y=fr[1]) # build the gbm
>>>
>>> m.__class__ # <h2o.model.regression.H2ORegressionModel object at 0x104d07590>
>>>
>>> fr[1] = fr[1].asfactor() # cast the 2nd column to a factor column
>>>
>>> fr[1].isfactor() # produces True
>>>
>>> m = h2o.gbm(x=fr[2:],y=fr[1]) # build the gbm
>>>
>>> m.__class__ # <h2o.model.binomial.H2OBinomialModel object at 0x104d18f50>
The above example shows how to properly deal with numeric columns you would like to use in a classification setting. Additionally, H2O can perform on-the-fly scoring of validation data and provide a host of metrics on the validation and training data. Here’s an example of this functionality, where we additionally split the data set into three pieces for training, validation, and finally testing:
>>> fr = h2o.import_file(path="smalldata/logreg/prostate.csv") # import prostate
>>>
>>> fr[1] = fr[1].asfactor() # cast to factor
>>>
>>> r = fr[0].runif() # Random UNIform numbers, one per row
>>>
>>> train = fr[ r < 0.6 ] # 60% for training data
>>>
>>> valid = fr[ (0.6 <= r) & (r < 0.9) ] # 30% for validation
>>>
>>> test = fr[ 0.9 <= r ] # 10% for testing
>>>
>>> m = h2o.deeplearning(x=train[2:],y=train[1],validation_x=valid[2:],validation_y=valid[1]) # build a deeplearning with a validation set (yes it's this simple)
>>>
>>> m # display the model summary by default (can also call m.show())
>>>
>>> m.show() # equivalent to the above
>>>
>>> m.model_performance() # show the performance on the training data, (can also be m.performance(train=True)
>>>
>>> m.model_performance(valid=True) # show the performance on the validation data
>>>
>>> m.model_performance(test_data=test) # score and compute new metrics on the test data!
Expanding on this example, there are a number of ways of querying a model for its attributes. Here are some examples of how to do just that:
>>> m.mse() # MSE on the training data
>>>
>>> m.mse(valid=True) # MSE on the validation data
>>>
>>> m.r2() # R^2 on the training data
>>>
>>> m.r2(valid=True) # R^2 on the validation data
>>>
>>> m.confusion_matrix() # confusion matrix for max F1
>>>
>>> m.confusion_matrix("tpr") # confusion matrix for max true positive rate
>>>
>>> m.confusion_matrix("max_per_class_error") # etc.
All of our models support various accessor methods such as these. The following section will discuss model metrics in greater detail.
On a final note, each of H2O’s algorithms handles missing (colloquially: “missing” or “NA”) and categorical data automatically differently, depending on the algorithm. You can find out more about each of the individual differences at the following link: http://docs2.h2o.ai/datascience/top.html
Metrics¶
H2O models exhibit a wide array of metrics for each of the model categories: - Clustering - Binomial - Multinomial - Regression - AutoEncoder In turn, each of these categories is associated with a corresponding H2OModelMetrics class.
All algorithm calls return at least one type of metrics: the training set metrics. When building a model in H2O, you can optionally provide a validation set for on-the-fly evaluation of holdout data. If the validation set is provided, then two types of metrics are returned: the training set metrics and the validation set metrics.
In addition to the metrics that can be retrieved at model-build time, there is a possible third type of metrics available post-build for the final holdout test set that contains data that does not appear in either the training or validation sets: the test set metrics. While the returned object is an H2OModelMetrics rather than an H2O model, it can be queried in the same exact way. Here’s an example:
>>> fr = h2o.import_file(path="smalldata/iris/iris_wheader.csv") # import iris
>>>
>>> r = fr[0].runif() # generate a random vector for splitting
>>>
>>> train = fr[ r < 0.6 ] # split out 60% for training
>>>
>>> valid = fr[ 0.6 <= r & r < 0.9 ] # split out 30% for validation
>>>
>>> test = fr[ 0.9 <= r ] # split out 10% for testing
>>>
>>> my_model = h2o.glm(x=train[1:], y=train[0], validation_x=valid[1:], validation_y=valid[0]) # build a GLM
>>>
>>> my_model.coef() # print the GLM coefficients, can also perform my_model.coef_norm() to get the normalized coefficients
>>>
>>> my_model.null_deviance() # get the null deviance from the training set metrics
>>>
>>> my_model.residual_deviance() # get the residual deviance from the training set metrics
>>>
>>> my_model.null_deviance(valid=True) # get the null deviance from the validation set metrics (similar for residual deviance)
>>>
>>> # now generate a new metrics object for the test hold-out data:
>>>
>>> my_metrics = my_model.model_performance(test_data=test) # create the new test set metrics
>>>
>>> my_metrics.null_degrees_of_freedom() # returns the test null dof
>>>
>>> my_metrics.residual_deviance() # returns the test res. deviance
>>>
>>> my_metrics.aic() # returns the test aic
As you can see, the new model metrics object generated by calling model_performance on the model object supports all of the metric accessor methods as a model. For a complete list of the available metrics for various model categories, please refer to the “Metrics in H2O” section of this document.
Example of H2O on Hadoop¶
Here is a brief example of H2O on Hadoop:
import h2o
h2o.init(ip="192.168.1.10", port=54321)
-------------------------- ------------------------------------
H2O cluster uptime: 2 minutes 1 seconds 966 milliseconds
H2O cluster version: 0.1.27.1064
H2O cluster name: H2O_96762
H2O cluster total nodes: 4
H2O cluster total memory: 38.34 GB
H2O cluster total cores: 16
H2O cluster allowed cores: 80
H2O cluster healthy: True
-------------------------- ------------------------------------
pathDataTrain = ["hdfs://192.168.1.10/user/data/data_train.csv"]
pathDataTest = ["hdfs://192.168.1.10/user/data/data_test.csv"]
trainFrame = h2o.import_file(path=pathDataTrain)
testFrame = h2o.import_file(path=pathDataTest)
#Parse Progress: [##################################################] 100%
#Imported [hdfs://192.168.1.10/user/data/data_train.csv'] into cluster with 60000 rows and 500 cols
#Parse Progress: [##################################################] 100%
#Imported ['hdfs://192.168.1.10/user/data/data_test.csv'] into cluster with 10000 rows and 500 cols
trainFrame[499]._name = "label"
testFrame[499]._name = "label"
model = h2o.gbm(x=trainFrame.drop("label"),
y=trainFrame["label"],
validation_x=testFrame.drop("label"),
validation_y=testFrame["label"],
ntrees=100,
max_depth=10
)
#gbm Model Build Progress: [##################################################] 100%
predictFrame = model.predict(testFrame)
model.model_performance(testFrame)
h2o¶
- class h2o.h2o.H2ODisplay(table=None, header=None, table_header=None, **kwargs)[source]¶
Pretty printing for H2O Objects; Handles both IPython and vanilla console display
- h2o.h2o.as_list(data, use_pandas=True)[source]¶
Convert an H2O data object into a python-specific object.
WARNING: This will pull all data local!
If Pandas is available (and use_pandas is True), then pandas will be used to parse the data frame. Otherwise, a list-of-lists populated by character data will be returned (so the types of data will all be str).
- data : H2OFrame
- An H2O data object.
- use_pandas : bool
- Try to use pandas for reading in the data.
Returns: List of list (Rows x Columns).
- h2o.h2o.autoencoder(x, training_frame=None, model_id=None, overwrite_with_best_model=None, checkpoint=None, use_all_factor_levels=None, activation=None, hidden=None, epochs=None, train_samples_per_iteration=None, target_ratio_comm_to_comp=None, seed=None, adaptive_rate=None, rho=None, epsilon=None, rate=None, rate_annealing=None, rate_decay=None, momentum_start=None, momentum_ramp=None, momentum_stable=None, nesterov_accelerated_gradient=None, input_dropout_ratio=None, hidden_dropout_ratios=None, l1=None, l2=None, max_w2=None, initial_weight_distribution=None, initial_weight_scale=None, loss=None, distribution=None, tweedie_power=None, score_interval=None, score_training_samples=None, score_duty_cycle=None, classification_stop=None, regression_stop=None, quiet_mode=None, max_confusion_matrix_size=None, max_hit_ratio_k=None, balance_classes=None, class_sampling_factors=None, max_after_balance_size=None, diagnostics=None, variable_importances=None, fast_mode=None, ignore_const_cols=None, force_load_balance=None, replicate_training_data=None, single_node_mode=None, shuffle_training_data=None, sparse=None, col_major=None, average_activation=None, sparsity_beta=None, max_categorical_features=None, reproducible=None, export_weights_and_biases=None)[source]¶
Build unsupervised auto encoder using H2O Deeplearning
- x : H2OFrame
- An H2OFrame containing the predictors in the model.
- training_frame : H2OFrame
- (Optional) An H2OFrame. Only used to retrieve weights, offset, or nfolds columns, if they aren’t already provided in x.
- model_id : str
- (Optional) The unique id assigned to the resulting model. If none is given, an id will automatically be generated.
- overwrite_with_best_model : bool
- Logical. If True, overwrite the final model with the best model found during training. Defaults to True.
- checkpoint : H2ODeepLearningModel
- “Model checkpoint (either key or H2ODeepLearningModel) to resume training with.”
- use_all_factor_levels : bool
- Logical. Use all factor levels of categorical variance. Otherwise the first factor level is omitted (without loss of accuracy). Useful for variable importances and auto-enabled for autoencoder.
- activation : str
- A string indicating the activation function to use. Must be either “Tanh”, “TanhWithDropout”, “Rectifier”, “RectifierWithDropout”, “Maxout”, or “MaxoutWithDropout”
- hidden : list
- Hidden layer sizes (e.g. c(100,100))
- epochs : float
- How many times the dataset should be iterated (streamed), can be fractional
- train_samples_per_iteration : int
- Number of training samples (globally) per MapReduce iteration. Special values are: 0 one epoch; -1 all available data (e.g., replicated training data); or -2 auto-tuning (default)
- target_ratio_comm_to_comp : float
- Target ratio of communication overhead to computation. Only for multi-node operation and train_samples_per_iteration=-2 (auto-tuning). Higher values can lead to faster convergence.
- seed : int
- Seed for random numbers (affects sampling) - Note: only reproducible when running single threaded
- adaptive_rate : bool
- Logical. Adaptive learning rate (ADAELTA)
- rho : float
- Adaptive learning rate time decay factor (similarity to prior updates)
- epsilon : float
- Adaptive learning rate parameter, similar to learn rate annealing during initial training phase. Typical values are between 1.0e-10 and 1.0e-4
- rate : float
- Learning rate (higher => less stable, lower => slower convergence)
- rate_annealing : float
- Learning rate annealing: eqn{(rate)/(1 + rate_annealing*samples)
- rate_decay : float
- Learning rate decay factor between layers (N-th layer: eqn{rate*lpha^(N-1))
- momentum_start : float
- Initial momentum at the beginning of training (try 0.5)
- momentum_ramp : int
- Number of training samples for which momentum increases
- momentum_stable : float
- Final momentum after the amp is over (try 0.99)
- nesterov_accelerated_gradient : bool
- Logical. Use Nesterov accelerated gradient (recommended)
- input_dropout_ratio : float
- A fraction of the features for each training row to be omitted from training in order to improve generalization (dimension sampling).
- hidden_dropout_ratios : float
- Input layer dropout ratio (can improve generalization) specify one value per hidden layer, defaults to 0.5
- l1 : float
- L1 regularization (can add stability and improve generalization, causes many weights to become 0)
- l2: float
- L2 regularization (can add stability and improve generalization, causes many weights to be small)
- max_w2 : float
- Constraint for squared sum of incoming weights per unit (e.g. Rectifier)
- initial_weight_distribution : str
- Can be “Uniform”, “UniformAdaptive”, or “Normal”
- initial_weight_scale : str
- Uniform: -value ... value, Normal: stddev
- loss : str
- Loss function: “Automatic”, “CrossEntropy” (for classification only), “Quadratic”, “Absolute” (experimental) or “Huber” (experimental)
- distribution : str
- A character string. The distribution function of the response. Must be “AUTO”, “bernoulli”, “multinomial”, “poisson”, “gamma”, “tweedie”, “laplace”, “huber” or “gaussian”
- tweedie_power : float
- Tweedie power (only for Tweedie distribution, must be between 1 and 2)
- score_interval : int
- Shortest time interval (in secs) between model scoring
- score_training_samples : int
- Number of training set samples for scoring (0 for all)
- score_duty_cycle : float
- Maximum duty cycle fraction for scoring (lower: more training, higher: more scoring)
- classification_stop : float
- Stopping criterion for classification error fraction on training data (-1 to disable)
- regression_stop : float
- Stopping criterion for regression error (MSE) on training data (-1 to disable)
- stopping_rounds : int
- Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve (by stopping_tolerance) for k=stopping_rounds scoring events. Can only trigger after at least 2k scoring events. Use 0 to disable.
- stopping_metric : str
- Metric to use for convergence checking, only for _stopping_rounds > 0 Can be one of “AUTO”, “MSE”.
- stopping_tolerance : float
- Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
- quiet_mode : bool
- Enable quiet mode for less output to standard output
- max_confusion_matrix_size : int
- Max. size (number of classes) for confusion matrices to be shown
- max_hit_ratio_k : float
- Max number (top K) of predictions to use for hit ratio computation(for multi-class only, 0 to disable)
- balance_classes : bool
- Balance training data class counts via over/under-sampling (for imbalanced data)
- class_sampling_factors : list
- Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
- max_after_balance_size : float
- Maximum relative size of the training data after balancing class counts (can be less than 1.0)
- diagnostics : bool
- Enable diagnostics for hidden layers
- variable_importances : bool
- Compute variable importances for input features (Gedeon method) - can be slow for large networks)
- fast_mode : bool
- Enable fast mode (minor approximations in back-propagation)
- ignore_const_cols : bool
- Ignore constant columns (no information can be gained anyway)
- force_load_balance : bool
- Force extra load balancing to increase training speed for small datasets (to keep all cores busy)
- replicate_training_data : bool
- Replicate the entire training dataset onto every node for faster training
- single_node_mode : bool
- Run on a single node for fine-tuning of model parameters
- shuffle_training_data : bool
- Enable shuffling of training data (recommended if training data is replicated and train_samples_per_iteration is close to eqn{numRows*numNodes
- sparse : bool
- Sparse data handling (Experimental)
- col_major : bool
- Use a column major weight matrix for input layer. Can speed up forward propagation, but might slow down backpropagation (Experimental)
- average_activation : float
- Average activation for sparse auto-encoder (Experimental)
- sparsity_beta : float
- Sparsity regularization (Experimental)
- max_categorical_features : int
- Max. number of categorical features, enforced via hashing Experimental)
- reproducible : bool
- Force reproducibility on small data (will be slow - only uses 1 thread)
- export_weights_and_biases : bool
- Whether to export Neural Network weights and biases to H2O Frames”
Returns: H2OAutoEncoderModel
- h2o.h2o.cluster_status()[source]¶
TODO: This isn’t really a cluster status... it’s a node status check for the node we’re connected to. This is possibly confusing because this can come back without warning, but if a user tries to do any remoteSend, they will get a “cloud sick warning”
Retrieve information on the status of the cluster running H2O. :return: None
- h2o.h2o.create_frame(id=None, rows=10000, cols=10, randomize=True, value=0, real_range=100, categorical_fraction=0.2, factors=100, integer_fraction=0.2, integer_range=100, binary_fraction=0.1, binary_ones_fraction=0.02, missing_fraction=0.01, response_factors=2, has_response=False, seed=None)[source]¶
Data Frame Creation in H2O. Creates a data frame in H2O with real-valued, categorical, integer, and binary columns specified by the user.
- id : str
- A string indicating the destination key. If empty, this will be auto-generated by H2O.
- rows : int
- The number of rows of data to generate.
- cols : int
- The number of columns of data to generate. Excludes the response column if has_response == True.
- randomize : bool
- A logical value indicating whether data values should be randomly generated. This must be TRUE if either categorical_fraction or integer_fraction is non-zero.
- value : int
- If randomize == FALSE, then all real-valued entries will be set to this value.
- real_range : float
- The range of randomly generated real values.
- categorical_fraction : float
- The fraction of total columns that are categorical.
- factors : int
- The number of (unique) factor levels in each categorical column.
- integer_fraction : float
- The fraction of total columns that are integer-valued.
- integer_range : list
- The range of randomly generated integer values.
- binary_fraction : float
- The fraction of total columns that are binary-valued.
- binary_ones_fraction : float
- The fraction of values in a binary column that are set to 1.
- missing_fraction : float
- The fraction of total entries in the data frame that are set to NA.
- response_factors : int
- If has_response == TRUE, then this is the number of factor levels in the response column.
- has_response : bool
- A logical value indicating whether an additional response column should be pre-pended to the final H2O data frame. If set to TRUE, the total number of columns will be cols+1.
- seed : int
- A seed used to generate random values when randomize = TRUE.
Returns: the H2OFrame that was created
- h2o.h2o.deeplearning(x, y=None, validation_x=None, validation_y=None, training_frame=None, model_id=None, overwrite_with_best_model=None, validation_frame=None, checkpoint=None, autoencoder=None, use_all_factor_levels=None, activation=None, hidden=None, epochs=None, train_samples_per_iteration=None, target_ratio_comm_to_comp=None, seed=None, adaptive_rate=None, rho=None, epsilon=None, rate=None, rate_annealing=None, rate_decay=None, momentum_start=None, momentum_ramp=None, momentum_stable=None, nesterov_accelerated_gradient=None, input_dropout_ratio=None, hidden_dropout_ratios=None, l1=None, l2=None, max_w2=None, initial_weight_distribution=None, initial_weight_scale=None, loss=None, distribution=None, tweedie_power=None, score_interval=None, score_training_samples=None, score_validation_samples=None, score_duty_cycle=None, classification_stop=None, regression_stop=None, stopping_rounds=None, stopping_metric=None, stopping_tolerance=None, quiet_mode=None, max_confusion_matrix_size=None, max_hit_ratio_k=None, balance_classes=None, class_sampling_factors=None, max_after_balance_size=None, score_validation_sampling=None, diagnostics=None, variable_importances=None, fast_mode=None, ignore_const_cols=None, force_load_balance=None, replicate_training_data=None, single_node_mode=None, shuffle_training_data=None, sparse=None, col_major=None, average_activation=None, sparsity_beta=None, max_categorical_features=None, reproducible=None, export_weights_and_biases=None, offset_column=None, weights_column=None, nfolds=None, fold_column=None, fold_assignment=None, keep_cross_validation_predictions=None)[source]¶
Build a supervised Deep Learning model Performs Deep Learning neural networks on an H2OFrame
- x : H2OFrame
- An H2OFrame containing the predictors in the model.
- y : H2OFrame
- An H2OFrame of the response variable in the model.
- training_frame : H2OFrame
- (Optional) An H2OFrame. Only used to retrieve weights, offset, or nfolds columns, if they aren’t already provided in x.
- model_id : str
- (Optional) The unique id assigned to the resulting model. If none is given, an id will automatically be generated.
- overwrite_with_best_model : bool
- Logical. If True, overwrite the final model with the best model found during training. Defaults to True.
- validation_frame : H2OFrame
- (Optional) An H2OFrame object indicating the validation dataset used to construct the confusion matrix. If left blank, this defaults to the training data when nfolds = 0
- checkpoint : H2ODeepLearningModel
- “Model checkpoint (either key or H2ODeepLearningModel) to resume training with.”
- autoencoder : bool
- Enable auto-encoder for model building.
- use_all_factor_levels : bool
- Logical. Use all factor levels of categorical variance. Otherwise the first factor level is omitted (without loss of accuracy). Useful for variable importances and auto-enabled for autoencoder.
- activation : str
- A string indicating the activation function to use. Must be either “Tanh”, “TanhWithDropout”, “Rectifier”, “RectifierWithDropout”, “Maxout”, or “MaxoutWithDropout”
- hidden : list
- Hidden layer sizes (e.g. c(100,100))
- epochs : float
- How many times the dataset should be iterated (streamed), can be fractional
- train_samples_per_iteration : int
- Number of training samples (globally) per MapReduce iteration. Special values are: 0 one epoch; -1 all available data (e.g., replicated training data); or -2 auto-tuning (default)
- target_ratio_comm_to_comp : float
- Target ratio of communication overhead to computation. Only for multi-node operation and train_samples_per_iteration=-2 (auto-tuning). Higher values can lead to faster convergence.
- seed : int
- Seed for random numbers (affects sampling) - Note: only reproducible when running single threaded
- adaptive_rate : bool
- Logical. Adaptive learning rate (ADAELTA)
- rho : float
- Adaptive learning rate time decay factor (similarity to prior updates)
- epsilon : float
- Adaptive learning rate parameter, similar to learn rate annealing during initial training phase. Typical values are between 1.0e-10 and 1.0e-4
- rate : float
- Learning rate (higher => less stable, lower => slower convergence)
- rate_annealing : float
- Learning rate annealing: eqn{(rate)/(1 + rate_annealing*samples)
- rate_decay : float
- Learning rate decay factor between layers (N-th layer: eqn{rate*lpha^(N-1))
- momentum_start : float
- Initial momentum at the beginning of training (try 0.5)
- momentum_ramp : float
- Number of training samples for which momentum increases
- momentum_stable : float
- Final momentum after the amp is over (try 0.99)
- nesterov_accelerated_gradient : bool
- Logical. Use Nesterov accelerated gradient (recommended)
- input_dropout_ratio : float
- A fraction of the features for each training row to be omitted from training in order to improve generalization (dimension sampling).
- hidden_dropout_ratios : float
- Input layer dropout ratio (can improve generalization) specify one value per hidden layer, defaults to 0.5
- l1 : float
- L1 regularization (can add stability and improve generalization, causes many weights to become 0)
- l2 : float
- L2 regularization (can add stability and improve generalization, causes many weights to be small)
- max_w2 : float
- Constraint for squared sum of incoming weights per unit (e.g. Rectifier)
- initial_weight_distribution : str
- Can be “Uniform”, “UniformAdaptive”, or “Normal”
- initial_weight_scale : str
- Uniform: -value ... value, Normal: stddev
- loss : str
- Loss function: “Automatic”, “CrossEntropy” (for classification only), “Quadratic”, “Absolute” (experimental) or “Huber” (experimental)
- distribution : str
- A character string. The distribution function of the response. Must be “AUTO”, “bernoulli”, “multinomial”, “poisson”, “gamma”, “tweedie”, “laplace”, “huber” or “gaussian”
- tweedie_power : float
- Tweedie power (only for Tweedie distribution, must be between 1 and 2)
- score_interval : int
- Shortest time interval (in secs) between model scoring
- score_training_samples : int
- Number of training set samples for scoring (0 for all)
- score_validation_samples : int
- Number of validation set samples for scoring (0 for all)
- score_duty_cycle : float
- Maximum duty cycle fraction for scoring (lower: more training, higher: more scoring)
- classification_stop : float
- Stopping criterion for classification error fraction on training data (-1 to disable)
- regression_stop : float
- Stopping criterion for regression error (MSE) on training data (-1 to disable)
- stopping_rounds : int
- Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve (by stopping_tolerance) for k=stopping_rounds scoring events. Can only trigger after at least 2k scoring events. Use 0 to disable.
- stopping_metric : str
- Metric to use for convergence checking, only for _stopping_rounds > 0 Can be one of “AUTO”, “deviance”, “logloss”, “MSE”, “AUC”, “r2”, “misclassification”.
- stopping_tolerance : float
- Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
- quiet_mode : bool
- Enable quiet mode for less output to standard output
- max_confusion_matrix_size : int
- Max. size (number of classes) for confusion matrices to be shown
- max_hit_ratio_k : float
- Max number (top K) of predictions to use for hit ratio computation(for multi-class only, 0 to disable)
- balance_classes : bool
- Balance training data class counts via over/under-sampling (for imbalanced data)
- class_sampling_factors : list
- Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes.
- max_after_balance_size : float
- Maximum relative size of the training data after balancing class counts (can be less than 1.0)
- score_validation_sampling :
- Method used to sample validation dataset for scoring
- diagnostics : bool
- Enable diagnostics for hidden layers
- variable_importances : bool
- Compute variable importances for input features (Gedeon method) - can be slow for large networks)
- fast_mode : bool
- Enable fast mode (minor approximations in back-propagation)
- ignore_const_cols : bool
- Ignore constant columns (no information can be gained anyway)
- force_load_balance : bool
- Force extra load balancing to increase training speed for small datasets (to keep all cores busy)
- replicate_training_data : bool
- Replicate the entire training dataset onto every node for faster training
- single_node_mode : bool
- Run on a single node for fine-tuning of model parameters
- shuffle_training_data : bool
- Enable shuffling of training data (recommended if training data is replicated and train_samples_per_iteration is close to eqn{numRows*numNodes
- sparse : bool
- Sparse data handling (more efficient for data with lots of 0 values)
- col_major : bool
- Use a column major weight matrix for input layer. Can speed up forward propagation, but might slow down backpropagation (Experimental)
- average_activation : float
- Average activation for sparse auto-encoder (Experimental)
- sparsity_beta : bool
- Sparsity regularization (Experimental)
- max_categorical_features : int
- Max. number of categorical features, enforced via hashing Experimental)
- reproducible : bool
- Force reproducibility on small data (will be slow - only uses 1 thread)
- export_weights_and_biases : bool
- Whether to export Neural Network weights and biases to H2O Frames”
- offset_column : H2OFrame
- Specify the offset column.
- weights_column : H2OFrame
- Specify the weights column.
- nfolds : int
- (Optional) Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.
- fold_column : H2OFrame
- (Optional) Column with cross-validation fold index assignment per observation
- fold_assignment : str
- Cross-validation fold assignment scheme, if fold_column is not specified Must be “AUTO”, “Random” or “Modulo”
- keep_cross_validation_predictions : bool
- Whether to keep the predictions of the cross-validation models
Returns: Return a new classifier or regression model.
- h2o.h2o.download_all_logs(dirname='.', filename=None)[source]¶
Download H2O Log Files to Disk
- dirname : str, optional
- A character string indicating the directory that the log file should be saved in.
- filename : str, optional
- A string indicating the name that the CSV file should be
Path of logs written.
- h2o.h2o.download_csv(data, filename)[source]¶
Download an H2O data set to a CSV file on the local disk.
Warning: Files located on the H2O server may be very large! Make sure you have enough hard drive space to accommodate the entire file.
- data : H2OFrame
- An H2OFrame object to be downloaded.
- filename : str
- A string indicating the name that the CSV file should be should be saved to.
- h2o.h2o.download_pojo(model, path='', get_jar=True)[source]¶
Download the POJO for this model to the directory specified by path (no trailing slash!). If path is “”, then dump to screen.
- model : H2OModel
- Retrieve this model’s scoring POJO.
- path : str
- An absolute path to the directory where POJO should be saved.
- get_jar : bool
- Retrieve the h2o-genmodel.jar also.
- h2o.h2o.export_file(frame, path, force=False)[source]¶
Export a given H2OFrame to a path on the machine this python session is currently connected to. To view the current session, call h2o.cluster_info().
- frame : H2OFrame
- The Frame to save to disk.
- path : str
- The path to the save point on disk.
- force : bool
- Overwrite any preexisting file with the same path
Returns: None
- h2o.h2o.frame(frame_id, exclude='')[source]¶
Retrieve metadata for a id that points to a Frame.
- frame_id : str
- A pointer to a Frame in H2O.
Python dict containing the frame meta-information
- h2o.h2o.gbm(x, y, validation_x=None, validation_y=None, training_frame=None, model_id=None, distribution=None, tweedie_power=None, ntrees=None, max_depth=None, min_rows=None, learn_rate=None, sample_rate=None, col_sample_rate=None, nbins=None, nbins_top_level=None, nbins_cats=None, validation_frame=None, balance_classes=None, max_after_balance_size=None, seed=None, build_tree_one_node=None, nfolds=None, fold_column=None, fold_assignment=None, keep_cross_validation_predictions=None, score_each_iteration=None, offset_column=None, weights_column=None, do_future=None, checkpoint=None, stopping_rounds=None, stopping_metric=None, stopping_tolerance=None)[source]¶
Builds gradient boosted classification trees, and gradient boosted regression trees on a parsed data set. The default distribution function will guess the model type based on the response column typerun properly the response column must be an numeric for “gaussian” or an enum for “bernoulli” or “multinomial”.
- x : H2OFrame
- An H2OFrame containing the predictors in the model.
- y : H2OFrame
- An H2OFrame of the response variable in the model.
- training_frame : H2OFrame
- (Optional) An H2OFrame. Only used to retrieve weights, offset, or nfolds columns, if they aren’t already provided in x.
- model_id : str
- (Optional) The unique id assigned to the resulting model. If none is given, an id will automatically be generated.
- distribution : str
- A character string. The distribution function of the response. Must be “AUTO”, “bernoulli”, “multinomial”, “poisson”, “gamma”, “tweedie” or “gaussian”
- tweedie_power : float
- Tweedie power (only for Tweedie distribution, must be between 1 and 2)
- ntrees : int
- A non-negative integer that determines the number of trees to grow.
- max_depth : int
- Maximum depth to grow the tree.
- min_rows : int
- Minimum number of rows to assign to terminal nodes.
- learn_rate : float
- Learning rate (from 0.0 to 1.0)
- sample_rate : float
- Row sample rate (from 0.0 to 1.0)
- col_sample_rate : float
- Column sample rate (from 0.0 to 1.0)
- nbins : int
- For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point.
- nbins_top_level : int
- For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level.
- nbins_cats : int
- For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.
- validation_frame : H2OFrame
- An H2OFrame object indicating the validation dataset used to contruct the confusion matrix. If left blank, this defaults to the training data when nfolds = 0
- balance_classes : bool
- logical, indicates whether or not to balance training data class counts via over/under-sampling (for imbalanced data)
- max_after_balance_size : float
- Maximum relative size of the training data after balancing class counts (can be less than 1.0). Ignored if balance_classes is False, which is the default behavior.
- seed : int
- Seed for random numbers (affects sampling when balance_classes=T)
- build_tree_one_node : bool
- Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.
- nfolds : int
- (Optional) Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.
- fold_column : H2OFrame
- (Optional) Column with cross-validation fold index assignment per observation
- fold_assignment : str
- Cross-validation fold assignment scheme, if fold_column is not specified Must be “AUTO”, “Random” or “Modulo”
- keep_cross_validation_predictions : bool
- Whether to keep the predictions of the cross-validation models
- score_each_iteration : bool
- Attempts to score each tree.
- offset_column : H2OFrame
- Specify the offset column.
- weights_column : H2OFrame
- Specify the weights column.
- stopping_rounds : int
- Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve (by stopping_tolerance) for k=stopping_rounds scoring events. Can only trigger after at least 2k scoring events. Use 0 to disable.
- stopping_metric : str
- Metric to use for convergence checking, only for _stopping_rounds > 0 Can be one of “AUTO”, “deviance”, “logloss”, “MSE”, “AUC”, “r2”, “misclassification”.
- stopping_tolerance : float
- Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
Returns: A new classifier or regression model.
- h2o.h2o.get_frame(frame_id)[source]¶
Obtain a handle to the frame in H2O with the frame_id key.
H2OFrame
- h2o.h2o.get_future_model(future_model)[source]¶
Waits for the future model to finish building, and then returns the model.
- future_model : H2OModelFuture
- an H2OModelFuture object
H2OEstimator
- h2o.h2o.get_model(model_id)[source]¶
Return the specified model
- model_id : str
- The model identification in h2o
H2OEstimator
- h2o.h2o.glm(x, y, validation_x=None, validation_y=None, training_frame=None, model_id=None, validation_frame=None, max_iterations=None, beta_epsilon=None, solver=None, standardize=None, family=None, link=None, tweedie_variance_power=None, tweedie_link_power=None, alpha=None, prior=None, lambda_search=None, nlambdas=None, lambda_min_ratio=None, beta_constraints=None, offset_column=None, weights_column=None, nfolds=None, fold_column=None, fold_assignment=None, keep_cross_validation_predictions=None, intercept=None, Lambda=None, max_active_predictors=None, do_future=None, checkpoint=None)[source]¶
Build a Generalized Linear Model Fit a generalized linear model, specified by a response variable, a set of predictors, and a description of the error distribution.
- x : H2OFrame
- An H2OFrame containing the predictors in the model.
- y : H2OFrame
- An H2OFrame of the response variable in the model.
- training_frame : H2OFrame
- (Optional) An H2OFrame. Only used to retrieve weights, offset, or nfolds columns, if they aren’t already provided in x.
- model_id : str
- (Optional) The unique id assigned to the resulting model. If none is given, an id will automatically be generated.
- validation_frame : H2OFrame
- An H2OFrame object containing the variables in the model.
- max_iterations : int
- A non-negative integer specifying the maximum number of iterations.
- beta_epsilon : int
- A non-negative number specifying the magnitude of the maximum difference between the coefficient estimates from successive iterations. Defines the convergence criterion for h2o.glm.
- solver : str
- A character string specifying the solver used: IRLSM (supports more features), L_BFGS (scales better for datasets with many columns)
- standardize : bool
- A logical value indicating whether the numeric predictors should be standardized to have a mean of 0 and a variance of 1 prior to training the models.
- family : str
- A character string specifying the distribution of the model: gaussian, binomial, poisson, gamma, tweedie.
- link : str
- A character string specifying the link function. The default is the canonical link for the family.
The supported links for each of the family specifications are:
“gaussian”: “identity”, “log”, “inverse”
“binomial”: “logit”, “log” “poisson”: “log”, “identity” “gamma”: “inverse”, “log”, “identity” “tweedie”: “tweedie”
- tweedie_variance_power : int
- numeric specifying the power for the variance function when family = “tweedie”.
- tweedie_link_power : int
- A numeric specifying the power for the link function when family = “tweedie”.
- alpha : float
- A numeric in [0, 1] specifying the elastic-net mixing parameter.
The elastic-net penalty is defined to be: eqn{P(lpha,eta) = (1-lpha)/2||eta||_2^2 + lpha||eta||_1 = sum_j [(1-lpha)/2 eta_j^2 + lpha|eta_j|], making alpha = 1 the lasso penalty and alpha = 0 the ridge penalty.
- Lambda : float
- A non-negative shrinkage parameter for the elastic-net, which multiplies eqn{P(lpha,eta) in the objective function. When Lambda = 0, no elastic-net penalty is applied and ordinary generalized linear models are fit.
- prior : float
- (Optional) A numeric specifying the prior probability of class 1 in the response when family = “binomial”. The default prior is the observational frequency of class 1.
- lambda_search : bool
- A logical value indicating whether to conduct a search over the space of lambda values starting from the lambda max, given lambda is interpreted as lambda minself.
- nlambdas : int
- The number of lambda values to use when lambda_search = TRUE.
- lambda_min_ratio : float
- Smallest value for lambda as a fraction of lambda.max. By default if the number of observations is greater than the the number of variables then lambda_min_ratio = 0.0001; if the number of observations is less than the number of variables then lambda_min_ratio = 0.01.
- beta_constraints : H2OFrame
- A data.frame or H2OParsedData object with the columns [“names”, “lower_bounds”, “upper_bounds”, “beta_given”], where each row corresponds to a predictor in the GLM. “names” contains the predictor names, “lower”/”upper_bounds”, are the lower and upper bounds of beta, and “beta_given” is some supplied starting values for the
- offset_column : H2OFrame
- Specify the offset column.
- weights_column : H2OFrame
- Specify the weights column.
- nfolds : int
- (Optional) Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.
- fold_column : H2OFrame
- (Optional) Column with cross-validation fold index assignment per observation
- fold_assignment : str
- Cross-validation fold assignment scheme, if fold_column is not specified Must be “AUTO”, “Random” or “Modulo”
- keep_cross_validation_predictions : bool
- Whether to keep the predictions of the cross-validation models
- intercept : bool
- Logical, include constant term (intercept) in the model
- max_active_predictors : int
- (Optional) Convergence criteria for number of predictors when using L1 penalty.
Returns: A subclass of ModelBase is returned. The specific subclass depends on the machine learning task at hand (if it’s binomial classification, then an H2OBinomialModel is returned, if it’s regression then a H2ORegressionModel is returned). The default print-out of the models is shown, but further GLM-specifc information can be queried out of the object. Upon completion of the GLM, the resulting object has coefficients, normalized coefficients, residual/null deviance, aic, and a host of model metrics including MSE, AUC (for logistic regression), degrees of freedom, and confusion matrices.
- h2o.h2o.glrm(x, validation_x=None, training_frame=None, validation_frame=None, k=None, max_iterations=None, transform=None, seed=None, ignore_const_cols=None, loss=None, multi_loss=None, loss_by_col=None, loss_by_col_idx=None, regularization_x=None, regularization_y=None, gamma_x=None, gamma_y=None, init_step_size=None, min_step_size=None, init=None, svd_method=None, user_y=None, user_x=None, expand_user_y=None, impute_original=None, recover_svd=None)[source]¶
Builds a generalized low rank model of a H2O dataset.
- k : int
- The rank of the resulting decomposition. This must be between 1 and the number of columns in the training frame inclusive.
- max_iterations : int
- The maximum number of iterations to run the optimization loop. Each iteration consists of an update of the X matrix, followed by an update of the Y matrix.
- transform : str
- A character string that indicates how the training data should be transformed before running GLRM. Possible values are “NONE”: for no transformation, “DEMEAN”: for subtracting the mean of each column, “DESCALE”: for dividing by the standard deviation of each column, “STANDARDIZE”: for demeaning and descaling, and “NORMALIZE”: for demeaning and dividing each column by its range (max - min).
- seed : int
- (Optional) Random seed used to initialize the X and Y matrices.
- ignore_const_cols : bool
- (Optional) A logical value indicating whether to ignore constant columns in the training frame. A column is constant if all of its non-missing values are the same value.
- loss : str
- A character string indicating the default loss function for numeric columns. Possible values are “Quadratic” (default), “Absolute”, “Huber”, “Poisson”, “Hinge”, and “Logistic”.
- multi_loss : str
- A character string indicating the default loss function for enum columns. Possible values are “Categorical” and “Ordinal”.
- loss_by_col : str
- (Optional) A list of strings indicating the loss function for specific columns by corresponding index in loss_by_col_idx. Will override loss for numeric columns and multi_loss for enum columns.
- loss_by_col_idx : str
- (Optional) A list of column indices to which the corresponding loss functions in loss_by_col are assigned. Must be zero indexed.
- regularization_x : str
- A character string indicating the regularization function for the X matrix. Possible values are “None” (default), “Quadratic”, “L2”, “L1”, “NonNegative”, “OneSparse”, “UnitOneSparse”, and “Simplex”.
- regularization_y : str
- A character string indicating the regularization function for the Y matrix. Possible values are “None” (default), “Quadratic”, “L2”, “L1”, “NonNegative”, “OneSparse”, “UnitOneSparse”, and “Simplex”.
- gamma_x : float
- The weight on the X matrix regularization term.
- gamma_y : float
- The weight on the Y matrix regularization term.
- init_step_size : float
- Initial step size. Divided by number of columns in the training frame when calculating the proximal gradient update. The algorithm begins at init_step_size and decreases the step size at each iteration until a termination condition is reached.
- min_step_size : float
- Minimum step size upon which the algorithm is terminated.
- init : str
- A character string indicating how to select the initial X and Y matrices. Possible values are “Random”: for initialization to a random array from the standard normal distribution, “PlusPlus”: for initialization using the clusters from k-means++ initialization, “SVD”: for initialization using the first k (approximate) right singular vectors, and “User”: user-specified initial X and Y frames (must set user_y and user_x arguments).
- svd_method : str
- A character string that indicates how SVD should be calculated during initialization. Possible values are “GramSVD”: distributed computation of the Gram matrix followed by a local SVD using the JAMA package, “Power”: computation of the SVD using the power iteration method, “Randomized”: approximate SVD by projecting onto a random subspace.
- user_x : H2OFrame
- (Optional) An H2OFrame object specifying the initial X matrix. Only used when init = “User”.
- user_y : H2OFrame
- (Optional) An H2OFrame object specifying the initial Y matrix. Only used when init = “User”.
- expand_user_y : bool
- A logical value indicating whether the categorical columns of the initial Y matrix should be one-hot expanded. Only used when init = “User”
and user_y is specified.
- impute_original : bool
- A logical value indicating whether to reconstruct the original training data by reversing the transformation during prediction. Model metrics are calculated with respect to the original data.
- recover_svd : bool
- A logical value indicating whether the singular values and eigenvectors should be recovered during post-processing of the generalized low rank decomposition.
Returns: a new dim reduction model
- h2o.h2o.import_file(path=None, destination_frame='', parse=True, header=(-1, 0, 1), sep='', col_names=None, col_types=None, na_strings=None)[source]¶
Have H2O import a dataset into memory. The path to the data must be a valid path for each node in the H2O cluster. If some node in the H2O cluster cannot see the file, then an exception will be thrown by the H2O cluster.
- path : str
- A path specifying the location of the data to import.
- destination_frame : str, optional
- The unique hex key assigned to the imported file. If none is given, a key will automatically be generated.
- parse : bool, optional
- A logical value indicating whether the file should be parsed after import.
- header : int, optional
- -1 means the first line is data, 0 means guess, 1 means first line is header.
- sep : str, optional
- The field separator character. Values on each line of the file are separated by this character. If sep = “”, the parser will automatically detect the separator.
- col_names : list, optional
- A list of column names for the file.
- col_types : list or dict, optional
A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are None will be guessed. The possible types a column may have are:
“unknown” - this will force the column to be parsed as all NA “uuid” - the values in the column must be true UUID or will be parsed as NA “string” - force the column to be parsed as a string “numeric” - force the column to be parsed as numeric. H2O will handle the
compression of the numeric data in the optimal manner.“enum” - force the column to be parsed as a categorical column. “time” - force the column to be parsed as a time column. H2O will attempt to
- parse the following list of date time formats.
- date:
- “yyyy-MM-dd” “yyyy MM dd” “dd-MMM-yy” “dd MMM yy”
- time:
- “HH:mm:ss” “HH:mm:ss:SSS” “HH:mm:ss:SSSnnnnnn” “HH.mm.ss” “HH.mm.ss.SSS” “HH.mm.ss.SSSnnnnnn”
Times can also contain “AM” or “PM”.
- na_strings : list or dict, optional
- A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.
A new H2OFrame instance.
- h2o.h2o.init(ip='localhost', port=54321, size=1, start_h2o=False, enable_assertions=False, license=None, max_mem_size_GB=None, min_mem_size_GB=None, ice_root=None, strict_version_check=True)[source]¶
Initiate an H2O connection to the specified ip and port.
- ip : str
- A string representing the hostname or IP address of the server where H2O is running.
- port : int
- A port, default is 54321
- size : int
- The expected number of h2o instances (ignored if start_h2o is True)
- start_h2o : bool
- A boolean dictating whether this module should start the H2O jvm. An attempt is made anyways if _connect fails.
- enable_assertions : bool
- If start_h2o, pass -ea as a VM option.s
- license : str
- If not None, is a path to a license file.
- max_mem_size_GB : int
- Maximum heap size (jvm option Xmx) in gigabytes.
- min_mem_size_GB : int
- Minimum heap size (jvm option Xms) in gigabytes.
- ice_root : str
- A temporary directory (default location is determined by tempfile.mkdtemp()) to hold H2O log files.
- h2o.h2o.interaction(data, factors, pairwise, max_factors, min_occurrence, destination_frame=None)[source]¶
Categorical Interaction Feature Creation in H2O. Creates a frame in H2O with n-th order interaction features between categorical columns, as specified by the user.
- data : H2OFrame
- the H2OFrame that holds the target categorical columns.
- factors : list
- factors Factor columns (either indices or column names).
- pairwise : bool
- Whether to create pairwise interactions between factors (otherwise create one higher-order interaction). Only applicable if there are 3 or more factors.
- max_factors : int
- Max. number of factor levels in pair-wise interaction terms (if enforced, one extra catch-all factor will be made)
- min_occurrence : int
- Min. occurrence threshold for factor levels in pair-wise interaction terms
- destination_frame : str
- A string indicating the destination key. If empty, this will be auto-generated by H2O.
H2OFrame
- h2o.h2o.keys_leaked(num_keys)[source]¶
Ask H2O if any keys leaked. @param num_keys: The number of keys that should be there. :return: A boolean True/False if keys leaked. If keys leaked, check H2O logs for further detail.
- h2o.h2o.kmeans(x, validation_x=None, k=None, model_id=None, max_iterations=None, standardize=None, init=None, seed=None, nfolds=None, fold_column=None, fold_assignment=None, training_frame=None, validation_frame=None, user_points=None, ignored_columns=None, score_each_iteration=None, keep_cross_validation_predictions=None, ignore_const_cols=None, checkpoint=None)[source]¶
Performs k-means clustering on an H2O dataset.
- x : H2OFrame
- The data columns on which k-means operates.
- k : int
- The number of clusters. Must be between 1 and 1e7 inclusive. k may be omitted if the user specifies the initial centers in the init parameter. If k is not omitted, in this case, then it should be equal to the number of user-specified centers.
- model_id : str
- (Optional) The unique id assigned to the resulting model. If none is given, an id will automatically be generated.
- max_iterations : int
- The maximum number of iterations allowed. Must be between 0 and 1e6 inclusive.
- standardize : bool
- Indicates whether the data should be standardized before running k-means.
- init : str
- A character string that selects the initial set of k cluster centers. Possible values are “Random”: for random initialization, “PlusPlus”: for k-means plus initialization, or “Furthest”: for initialization at the furthest point from each successive center. Additionally, the user may specify a the initial centers as a matrix, data.frame, H2OFrame, or list of vectors. For matrices, data.frames, and H2OFrames, each row of the respective structure is an initial center. For lists of vectors, each vector is an initial center.
- seed : int
- (Optional) Random seed used to initialize the cluster centroids.
- nfolds : int
- (Optional) Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.
- fold_column : H2OFrame
- (Optional) Column with cross-validation fold index assignment per observation
- fold_assignment : str
- Cross-validation fold assignment scheme, if fold_column is not specified Must be “AUTO”, “Random” or “Modulo”
Returns: An instance of H2OClusteringModel.
- h2o.h2o.lazy_import(path)[source]¶
Import a single file or collection of files.
- path : str
- A path to a data file (remote or local).
- h2o.h2o.list_timezones()[source]¶
Get a list of all the timezones
Returns: the time zones (as an H2OFrame)
- h2o.h2o.load_model(path)[source]¶
Load a saved H2O model from disk. Example:
>>> path = h2o.save_model(my_model,dir=my_path) >>> h2o.load_model(path) # use the result of save_model
- path : str
- The full path of the H2O Model to be imported.
Returns: the model
- h2o.h2o.log_and_echo(message)[source]¶
Log a message on the server-side logs This is helpful when running several pieces of work one after the other on a single H2O cluster and you want to make a notation in the H2O server side log where one piece of work ends and the next piece of work begins.
Sends a message to H2O for logging. Generally used for debugging purposes.
- message : str
- A character string with the message to write to the log.
- h2o.h2o.naive_bayes(x, y, validation_x=None, validation_y=None, training_frame=None, validation_frame=None, laplace=None, threshold=None, eps=None, compute_metrics=None, offset_column=None, weights_column=None, balance_classes=None, max_after_balance_size=None, nfolds=None, fold_column=None, fold_assignment=None, keep_cross_validation_predictions=None, checkpoint=None)[source]¶
The naive Bayes classifier assumes independence between predictor variables conditional on the response, and a Gaussian distribution of numeric predictors with mean and standard deviation computed from the training dataset. When building a naive Bayes classifier, every row in the training dataset that contains at least one NA will be skipped completely. If the test dataset has missing values, then those predictors are omitted in the probability calculation during prediction.
- laplace : int
- A positive number controlling Laplace smoothing. The default zero disables smoothing.
- threshold : float
- The minimum standard deviation to use for observations without enough data. Must be at least 1e-10.
- eps : float
- A threshold cutoff to deal with numeric instability, must be positive.
- compute_metrics : bool
- A logical value indicating whether model metrics should be computed. Set to FALSE to reduce the runtime of the algorithm.
- training_frame : H2OFrame
- Training Frame
- validation_frame : H2OFrame
- Validation Frame
- offset_column : H2OFrame
- Specify the offset column.
- weights_column : H2OFrame
- Specify the weights column.
- nfolds : int
- (Optional) Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.
- fold_column : H2OFrame
- (Optional) Column with cross-validation fold index assignment per observation
- fold_assignment : str
- Cross-validation fold assignment scheme, if fold_column is not specified Must be “AUTO”, “Random” or “Modulo”
- keep_cross_validation_predictions : bool
- Whether to keep the predictions of the cross-validation models
Returns: Returns an H2OBinomialModel if the response has two categorical levels, H2OMultinomialModel otherwise.
- h2o.h2o.no_progress()[source]¶
Disable the progress bar from flushing to stdout. The completed progress bar is printed when a job is complete so as to demarcate a log file.
- h2o.h2o.parse_raw(setup, id=None, first_line_is_header=(-1, 0, 1))[source]¶
Used in conjunction with lazy_import and parse_setup in order to make alterations before parsing.
- setup : dict
- Result of h2o.parse_setup
- id : str, optional
- An id for the frame.
- first_line_is_header : int, optional
- -1,0,1 if the first line is to be used as the header
H2OFrame
- h2o.h2o.parse_setup(raw_frames, destination_frame='', header=(-1, 0, 1), separator='', column_names=None, column_types=None, na_strings=None)[source]¶
During parse setup, the H2O cluster will make several guesses about the attributes of the data. This method allows a user to perform corrective measures by updating the returning dictionary from this method. This dictionary is then fed into parse_raw to produce the H2OFrame instance.
- raw_frames : H2OFrame
- A collection of imported file frames
- destination_frame : str, optional
- The unique hex key assigned to the imported file. If none is given, a key will automatically be generated.
- parse : bool, optional
- A logical value indicating whether the file should be parsed after import.
- header : int, optional
- -1 means the first line is data, 0 means guess, 1 means first line is header.
- sep : str, optional
- The field separator character. Values on each line of the file are separated by this
- character. If sep = “”, the parser will automatically detect the separator.
- col_names : list, optional
- A list of column names for the file.
- col_types : list or dict, optional
A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are None will be guessed. The possible types a column may have are:
“unknown” - this will force the column to be parsed as all NA “uuid” - the values in the column must be true UUID or will be parsed as NA “string” - force the column to be parsed as a string “numeric” - force the column to be parsed as numeric. H2O will handle the
compression of the numeric data in the optimal manner.“enum” - force the column to be parsed as a categorical column. “time” - force the column to be parsed as a time column. H2O will attempt to
- parse the following list of date time formats.
- date:
- “yyyy-MM-dd” “yyyy MM dd” “dd-MMM-yy” “dd MMM yy”
- time:
- “HH:mm:ss” “HH:mm:ss:SSS” “HH:mm:ss:SSSnnnnnn” “HH.mm.ss” “HH.mm.ss.SSS” “HH.mm.ss.SSSnnnnnn”
Times can also contain “AM” or “PM”.
A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.
A dictionary is returned containing all of the guesses made by the H2O back end.
- h2o.h2o.prcomp(x, validation_x=None, k=None, model_id=None, max_iterations=None, transform=None, seed=None, use_all_factor_levels=None, training_frame=None, validation_frame=None, pca_method=None)[source]¶
Principal components analysis of a H2O dataset.
- k : int
- The number of principal components to be computed. This must be between 1 and min(ncol(training_frame), nrow(training_frame)) inclusive.
- model_id : str
- (Optional) The unique hex key assigned to the resulting model. Automatically generated if none is provided.
- max_iterations : int
- The maximum number of iterations to run each power iteration loop. Must be between 1 and 1e6 inclusive.
- transform : str
- A character string that indicates how the training data should be transformed before running PCA. Possible values are “NONE”: for no transformation, “DEMEAN”: for subtracting the mean of each column, “DESCALE”: for dividing by the standard deviation of each column, “STANDARDIZE”: for demeaning and descaling, and “NORMALIZE”: for demeaning and dividing each column by its range (max - min).
- seed : int
- (Optional) Random seed used to initialize the right singular vectors at the beginning of each power method iteration.
- use_all_factor_levels : bool
- (Optional) A logical value indicating whether all factor levels should be included in each categorical column expansion. If FALSE, the indicator column corresponding to the first factor level of every categorical variable will be dropped. Defaults to FALSE.
- pca_method : str
- A character string that indicates how PCA should be calculated. Possible values are “GramSVD”: distributed computation of the Gram matrix followed by a local SVD using the JAMA package, “Power”: computation of the SVD using the power iteration method, “Randomized”: approximate SVD by projecting onto a random subspace, “GLRM”: fit a generalized low rank model with an l2 loss function (no regularization) and solve for the SVD using local matrix algebra.
Returns: a new dim reduction model
- h2o.h2o.random_forest(x, y, validation_x=None, validation_y=None, training_frame=None, model_id=None, mtries=None, sample_rate=None, build_tree_one_node=None, ntrees=None, max_depth=None, min_rows=None, nbins=None, nbins_top_level=None, nbins_cats=None, binomial_double_trees=None, validation_frame=None, balance_classes=None, max_after_balance_size=None, seed=None, offset_column=None, weights_column=None, nfolds=None, fold_column=None, fold_assignment=None, keep_cross_validation_predictions=None, score_each_iteration=None, checkpoint=None, stopping_rounds=None, stopping_metric=None, stopping_tolerance=None)[source]¶
Build a Big Data Random Forest Model Builds a Random Forest Model on an H2OFrame
- x : H2OFrame
- An H2OFrame containing the predictors in the model.
- y : H2OFrame
- An H2OFrame of the response variable in the model.
- training_frame : H2OFrame
- (Optional) An H2OFrame. Only used to retrieve weights, offset, or nfolds columns, if they aren’t already provided in x.
- model_id : str
- (Optional) The unique id assigned to the resulting model. If none is given, an id will automatically be generated.
- mtries : int
- Number of variables randomly sampled as candidates at each split. If set to -1, defaults to sqrt{p} for classification, and p/3 for regression, where p is the number of predictors.
- sample_rate : float
- Sample rate, from 0 to 1.0.
- build_tree_one_node : bool
- Run on one node only; no network overhead but fewer cpus used. Suitable for small datasets.
- ntrees : int
- A nonnegative integer that determines the number of trees to grow.
- max_depth : int
- Maximum depth to grow the tree.
- min_rows : int
- Minimum number of rows to assign to teminal nodes.
- nbins : int
- For numerical columns (real/int), build a histogram of (at least) this many bins, then split at the best point.
- nbins_top_level : int
- For numerical columns (real/int), build a histogram of (at most) this many bins at the root level, then decrease by factor of two per level.
- nbins_cats : int
- For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting.
- binomial_double_trees : bool
- or binary classification: Build 2x as many trees (one per class) - can lead to higher accuracy.
- validation_frame : H2OFrame
- An H2OFrame object containing the variables in the model.
- balance_classes : bool
- logical, indicates whether or not to balance training data class counts via over/under-sampling (for imbalanced data)
- max_after_balance_size : float
- Maximum relative size of the training data after balancing class counts (can be less than 1.0). Ignored if balance_classes is False, which is the default behavior.
- seed : int
- Seed for random numbers (affects sampling) - Note: only reproducible when running single threaded
- offset_column : H2OFrame
- Specify the offset column.
- weights_column : H2OFrame
- Specify the weights column.
- nfolds : int
- (Optional) Number of folds for cross-validation. If nfolds >= 2, then validation must remain empty.
- fold_column : H2OFrame
- (Optional) Column with cross-validation fold index assignment per observation
- fold_assignment : str
- Cross-validation fold assignment scheme, if fold_column is not specified Must be “AUTO”, “Random” or “Modulo”
- keep_cross_validation_predictions : bool
- Whether to keep the predictions of the cross-validation models
- score_each_iteration : bool
- Attempts to score each tree.
- stopping_rounds : int
- Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve (by stopping_tolerance) for k=stopping_rounds scoring events. Can only trigger after at least 2k scoring events. Use 0 to disable.
- stopping_metric : str
- Metric to use for convergence checking, only for _stopping_rounds > 0 Can be one of “AUTO”, “deviance”, “logloss”, “MSE”, “AUC”, “r2”, “misclassification”.
- stopping_tolerance : float
- Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
Returns: A new classifier or regression model.
- h2o.h2o.rapids(expr)[source]¶
Execute a Rapids expression.
- expr : str
- The rapids expression (ascii string).
The JSON response (as a python dictionary) of the Rapids execution
- h2o.h2o.remove(x)[source]¶
Remove object from H2O.
- x : H2OFrame or str
- The object pointing to the object to be removed.
- h2o.h2o.save_model(model, path='', force=False)[source]¶
Save an H2O Model Object to Disk.
- model : H2OModel
- The model object to save.
- path : str
- A path to save the model at (hdfs, s3, local)
- force : bool
- Overwrite destination directory in case it exists or throw exception if set to false.
Returns: the path of the saved model (string)
- h2o.h2o.set_timezone(tz)[source]¶
Set the Time Zone on the H2O Cloud
- tz : str
- The desired timezone.
Returns: None
- h2o.h2o.shutdown(conn=None, prompt=True)[source]¶
Shut down the specified instance. All data will be lost. This method checks if H2O is running at the specified IP address and port, and if it is, shuts down that H2O instance.
- conn : H2OConnection
- An H2OConnection object containing the IP address and port of the server running H2O.
- prompt : bool
- A logical value indicating whether to prompt the user before shutting down the H2O server.
Returns: None
- h2o.h2o.start_glm_job(x, y, validation_x=None, validation_y=None, **kwargs)[source]¶
Build a Generalized Linear Model Note: this function is the same as glm(), but it doesn’t block on model-build. Instead, it returns and H2OModelFuture object immediately. The model can be retrieved from the H2OModelFuture object with get_future_model().
Returns: H2OModelFuture
- h2o.h2o.store_size()[source]¶
Get the H2O store size (current count of keys). :return: number of keys in H2O cloud
- h2o.h2o.svd(x, validation_x=None, training_frame=None, validation_frame=None, nv=None, max_iterations=None, transform=None, seed=None, use_all_factor_levels=None, svd_method=None)[source]¶
Singular value decomposition of a H2O dataset.
- nv : int
- The number of right singular vectors to be computed. This must be between 1 and min(ncol(training_frame), snrow(training_frame)) inclusive.
- max_iterations : int
- The maximum number of iterations to run each power iteration loop. Must be between 1 and 1e6 inclusive.max_iterations The maximum number of iterations to run each power iteration loop. Must be between 1 and 1e6 inclusive.
- transform : str
- A character string that indicates how the training data should be transformed before running SVD. Possible values are “NONE”: for no transformation, “DEMEAN”: for subtracting the mean of each column, “DESCALE”: for dividing by the standard deviation of each column, “STANDARDIZE”: for demeaning and descaling, and “NORMALIZE”: for demeaning and dividing each column by its range (max - min).
- seed : int
- (Optional) Random seed used to initialize the right singular vectors at the beginning of each power method iteration.
- use_all_factor_levels : bool
- (Optional) A logical value indicating whether all factor levels should be included in each categorical column expansion. If FALSE, the indicator column corresponding to the first factor level of every categorical variable will be dropped. Defaults to TRUE.
- svd_method : str
- A character string that indicates how SVD should be calculated. Possible values are “GramSVD”: distributed computation of the Gram matrix followed by a local SVD using the JAMA package, “Power”: computation of the SVD using the power iteration method, “Randomized”: approximate SVD by projecting onto a random subspace.
Returns: a new dim reduction model
- h2o.h2o.upload_file(path, destination_frame='', header=(-1, 0, 1), sep='', col_names=None, col_types=None, na_strings=None)[source]¶
Upload a dataset at the path given from the local machine to the H2O cluster.
- path : str
- A path specifying the location of the data to upload.
- destination_frame : str, optional
- The unique hex key assigned to the imported file. If none is given, a key will automatically be generated.
- header : int, optional
- -1 means the first line is data, 0 means guess, 1 means first line is header.
- sep : str, optional
- The field separator character. Values on each line of the file are separated by this character. If sep = “”, the parser will automatically detect the separator.
- col_names : list, optional
- A list of column names for the file.
- col_types : list or dict, optional
A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are None will be guessed. The possible types a column may have are:
“unknown” - this will force the column to be parsed as all NA “uuid” - the values in the column must be true UUID or will be parsed as NA “string” - force the column to be parsed as a string “numeric” - force the column to be parsed as numeric. H2O will handle the
compression of the numeric data in the optimal manner.“enum” - force the column to be parsed as a categorical column. “time” - force the column to be parsed as a time column. H2O will attempt to
- parse the following list of date time formats.
- date:
- “yyyy-MM-dd” “yyyy MM dd” “dd-MMM-yy” “dd MMM yy”
- time:
- “HH:mm:ss” “HH:mm:ss:SSS” “HH:mm:ss:SSSnnnnnn” “HH.mm.ss” “HH.mm.ss.SSS” “HH.mm.ss.SSSnnnnnn”
Times can also contain “AM” or “PM”.
- na_strings : list or dict, optional
- A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.
A new H2OFrame instance.>>> import h2o as ml >>> ml.upload_file(path="/path/to/local/data", destination_frame="my_local_data") ...