Glossary¶

0xdata: Maker of H₂O. Visit our website: http://0xdata.com

backpropogation

binomial

a variable that takes on only the value 0 or 1. Binomial variables are often interpreted as 0 indicates that an event hasn’t occurred or that the observation lacks a feature, where 1 indicates occurrence or display of an attribute.

categorical data or categorical variable

A qualitative variable (for example: blood type); (synonym for enumerator, factor).

cloud

(Synonym for cluster.) See the definition of cluster.

cluster

1. (Synonym for cloud.) A group of H₂O nodes that work together; when a job is submitted to a cluster, all the nodes in the cluster work on a portion of the job.

2. in statistics: A cluster is a group of observations from a data set that have been identified as similar according to a particular clustering algorithm.

continuous

A variable that can take on all or nearly all values along an interval on the real number line.

CSV file

CSV is an acronym for comma separated value. A CSV file stores data in a plain text format.

destination key

Automatically generated key for a model; it allows recall of a specific model later in analysis. Users can specify a different destination key than the key generated by H₂O.

deviance

Deviance is the difference between an expected value and an observed value. It plays a critical role in defining GLM models. For a more detailed discussion of deviance please see the H₂O Data Science documentation on GLM.

DKV

Distributed key/value store; see key/value store.

family

In GLM family describes the options available for predictive modeling in GLM. Also see gaussian, poisson, gamma, binomial

feature

Synonym for attribute, predictor, or independent variable. Usually the data observed on features are given in the columns of a data set.

feed forward

gz (gzipped) file

Gzip is a type of file compression, a gzipped file is a file compressed by gzip.

HEX format

Records made up of hexadecimal numbers that represent machine language code or constant data. In H₂O, data must be parsed into .hex format before operations can be preformed on it.

instance

An instance of H₂O occurs each time the user opens and runs H₂O, and in the process builds a cluster of nodes (even a one node cluster on their local machine). The instance begins when the cluster is formed, and terminates when the program is closed and the cloud is terminated.

job

A piece of work that needs to be done. For example, reading in a data file, parsing a data file, or building a model. In the browser based GUI of H₂O each job is listed in the admin menu under jobs. For more information see the user guide.

key

The .hex key generated when data are parsed into H₂O. In the web based GUI “key” is an input on each page where users define models, and also any page where users validate models on a new data set or use a model to generate predictions.

key/value pair

A type of data that associates a particular key index to a certain datum.

key/value store

A tool that allows storage of schema-less data. Data usually consists of a string which represents the key, and the data itself which is the value.

L1 regularization

A regularization method that constrains the absolute value of the weights and has the net effect of dropping some weights (setting them to zero) from a model to reduce complexity and avoid overfitting.

L2 Regularization

A regularization method that constrains the sum of the squared weights. This method introduces bias into parameter estimates, but frequently produces substantial gains in modeling as estimate variance is reduced.

link function

A user defined option in GLM. See the GLM user guide in Data Science for further detail.

loss function

Synonymous to objective function and criterion function; the loss function is the function minimized in order to achieve a desired estimator. For example, linear regression defines the set of best parameter estimates to be the set of estimates that produce the minimum of the sum of the squared errors. Errors are the difference between the predicted value and the observed value for an observation.

n-folds

User defined number of cross validation models generated by H₂O

node

In distributed computing systems nodes include clients, servers or peers.

In statistics a node is a decision or terminal point in a classification tree.

parse

Analysis of a string of symbols or datum resulting in the conversion of a set of information from a person-readable format to a machine-readable format.

seed

A starting point for randomization. Seed specification is used when machine learning models have a random component; it allows users to recreate the exact “random” conditions used in a model at a later time.

standard deviation

Abbreviated sd The standard deviation of the data in the column, defined as the square root of the sum of the deviance of observed values from the mean divided by the number of elements in the column less one.

standardization

Transformation of a variable such that it is mean centered at 0 and scaled by the standard deviation.

XLS file

A Microsoft Excel 2003 - 2007 spreadsheet file format.

Y

Dependent variable used in GLM; a user defined input selected from the set of variables present in the user’s data.