R/deepwater.R
Builds a deep neural network on an H2OFrame containing various data sources.
h2o.deepwater(x, y, training_frame, model_id = NULL, checkpoint = NULL, autoencoder = FALSE, validation_frame = NULL, nfolds = 0, balance_classes = FALSE, max_after_balance_size = 5, class_sampling_factors = NULL, keep_cross_validation_predictions = FALSE, keep_cross_validation_fold_assignment = FALSE, fold_assignment = c("AUTO", "Random", "Modulo", "Stratified"), fold_column = NULL, offset_column = NULL, weights_column = NULL, score_each_iteration = FALSE, categorical_encoding = c("AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited"), overwrite_with_best_model = TRUE, epochs = 10, train_samples_per_iteration = -2, target_ratio_comm_to_comp = 0.05, seed = -1, standardize = TRUE, learning_rate = 0.001, learning_rate_annealing = 1e-06, momentum_start = 0.9, momentum_ramp = 10000, momentum_stable = 0.9, distribution = c("AUTO", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber"), score_interval = 5, score_training_samples = 10000, score_validation_samples = 0, score_duty_cycle = 0.1, classification_stop = 0, regression_stop = 0, stopping_rounds = 5, stopping_metric = c("AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "lift_top_group", "misclassification", "mean_per_class_error"), stopping_tolerance = 0, max_runtime_secs = 0, ignore_const_cols = TRUE, shuffle_training_data = TRUE, mini_batch_size = 32, clip_gradient = 10, network = c("auto", "user", "lenet", "alexnet", "vgg", "googlenet", "inception_bn", "resnet"), backend = c("mxnet", "caffe", "tensorflow"), image_shape = c(0, 0), channels = 3, sparse = FALSE, gpu = TRUE, device_id = c(0), cache_data = TRUE, network_definition_file = NULL, network_parameters_file = NULL, mean_image_file = NULL, export_native_parameters_prefix = NULL, activation = c("Rectifier", "Tanh"), hidden = NULL, input_dropout_ratio = 0, hidden_dropout_ratios = NULL, problem_type = c("auto", "image", "dataset"))
x | (Optional) A vector containing the names or indices of the predictor variables to use in building the model. If x is missing, then all columns except y are used. |
---|---|
y | The name or column index of the response variable in the data. The response must be either a numeric or a categorical/factor variable. If the response is numeric, then a regression model will be trained, otherwise it will train a classification model. |
training_frame | Id of the training data frame. |
model_id | Destination id for this model; auto-generated if not specified. |
checkpoint | Model checkpoint to resume training with. |
autoencoder |
|
validation_frame | Id of the validation data frame. |
nfolds | Number of folds for K-fold cross-validation (0 to disable or >= 2). Defaults to 0. |
balance_classes |
|
max_after_balance_size | Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires balance_classes. Defaults to 5.0. |
class_sampling_factors | Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will be automatically computed to obtain class balance during training. Requires balance_classes. |
keep_cross_validation_predictions |
|
keep_cross_validation_fold_assignment |
|
fold_assignment | Cross-validation fold assignment scheme, if fold_column is not specified. The 'Stratified' option will stratify the folds based on the response variable, for classification problems. Must be one of: "AUTO", "Random", "Modulo", "Stratified". Defaults to AUTO. |
fold_column | Column with cross-validation fold index assignment per observation. |
offset_column | Offset column. This will be added to the combination of columns before applying the link function. |
weights_column | Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor. |
score_each_iteration |
|
categorical_encoding | Encoding scheme for categorical features Must be one of: "AUTO", "Enum", "OneHotInternal", "OneHotExplicit", "Binary", "Eigen", "LabelEncoder", "SortByResponse", "EnumLimited". Defaults to AUTO. |
overwrite_with_best_model |
|
epochs | How many times the dataset should be iterated (streamed), can be fractional. Defaults to 10. |
train_samples_per_iteration | Number of training samples (globally) per MapReduce iteration. Special values are 0: one epoch, -1: all available data (e.g., replicated training data), -2: automatic. Defaults to -2. |
target_ratio_comm_to_comp | Target ratio of communication overhead to computation. Only for multi-node operation and train_samples_per_iteration = -2 (auto-tuning). Defaults to 0.05. |
seed | Seed for random numbers (affects certain parts of the algo that are stochastic and those might or might not be enabled by default) Note: only reproducible when running single threaded. Defaults to -1 (time-based random number). |
standardize |
|
learning_rate | Learning rate (higher => less stable, lower => slower convergence). Defaults to 0.001. |
learning_rate_annealing | Learning rate annealing: rate / (1 + rate_annealing * samples). Defaults to 1e-06. |
momentum_start | Initial momentum at the beginning of training (try 0.5). Defaults to 0.9. |
momentum_ramp | Number of training samples for which momentum increases. Defaults to 10000. |
momentum_stable | Final momentum after the ramp is over (try 0.99). Defaults to 0.9. |
distribution | Distribution function Must be one of: "AUTO", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace", "quantile", "huber". Defaults to AUTO. |
score_interval | Shortest time interval (in seconds) between model scoring. Defaults to 5. |
score_training_samples | Number of training set samples for scoring (0 for all). Defaults to 10000. |
score_validation_samples | Number of validation set samples for scoring (0 for all). Defaults to 0. |
score_duty_cycle | Maximum duty cycle fraction for scoring (lower: more training, higher: more scoring). Defaults to 0.1. |
classification_stop | Stopping criterion for classification error fraction on training data (-1 to disable). Defaults to 0. |
regression_stop | Stopping criterion for regression error (MSE) on training data (-1 to disable). Defaults to 0. |
stopping_rounds | Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable) Defaults to 5. |
stopping_metric | Metric to use for early stopping (AUTO: logloss for classification, deviance for regression) Must be one of: "AUTO", "deviance", "logloss", "MSE", "RMSE", "MAE", "RMSLE", "AUC", "lift_top_group", "misclassification", "mean_per_class_error". Defaults to AUTO. |
stopping_tolerance | Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much) Defaults to 0. |
max_runtime_secs | Maximum allowed runtime in seconds for model training. Use 0 to disable. Defaults to 0. |
ignore_const_cols |
|
shuffle_training_data |
|
mini_batch_size | Mini-batch size (smaller leads to better fit, larger can speed up and generalize better). Defaults to 32. |
clip_gradient | Clip gradients once their absolute value is larger than this value. Defaults to 10. |
network | Network architecture. Must be one of: "auto", "user", "lenet", "alexnet", "vgg", "googlenet", "inception_bn", "resnet". Defaults to auto. |
backend | Deep Learning Backend. Must be one of: "mxnet", "caffe", "tensorflow". Defaults to mxnet. |
image_shape | Width and height of image. Defaults to [0, 0]. |
channels | Number of (color) channels. Defaults to 3. |
sparse |
|
gpu |
|
device_id | Device IDs (which GPUs to use). Defaults to [0]. |
cache_data |
|
network_definition_file | Path of file containing network definition (graph, architecture). |
network_parameters_file | Path of file containing network (initial) parameters (weights, biases). |
mean_image_file | Path of file containing the mean image data for data normalization. |
export_native_parameters_prefix | Path (prefix) where to export the native model parameters after every iteration. |
activation | Activation function. Only used if no user-defined network architecture file is provided, and only for problem_type=dataset. Must be one of: "Rectifier", "Tanh". |
hidden | Hidden layer sizes (e.g. [200, 200]). Only used if no user-defined network architecture file is provided, and only for problem_type=dataset. |
input_dropout_ratio | Input layer dropout ratio (can improve generalization, try 0.1 or 0.2). Defaults to 0. |
hidden_dropout_ratios | Hidden layer dropout ratios (can improve generalization), specify one value per hidden layer, defaults to 0.5. |
problem_type | Problem type, auto-detected by default. If set to image, the H2OFrame must contain a string column containing the path (URI or URL) to the images in the first column. If set to text, the H2OFrame must contain a string column containing the text in the first column. If set to dataset, Deep Water behaves just like any other H2O Model and builds a model on the provided H2OFrame (non-String columns). Must be one of: "auto", "image", "dataset". Defaults to auto. |