# GlossaryΒΆ

**0xdata**- Maker of H
_{2}O. Visit our website: http://0xdata.com

**backpropogation**

**binomial**- a variable that takes on only the value 0 or 1. Binomial variables are often interpreted as 0 indicates that an event hasn’t occurred or that the observation lacks a feature, where 1 indicates occurrence or display of an attribute.
**categorical data or categorical variable**- A qualitative variable (for example: blood type); (synonym for enumerator, factor).
**cloud**- (Synonym for cluster.) See the definition of cluster.
**cluster**1. (Synonym for cloud.) A group of H

_{2}O nodes that work together; when a job is submitted to a cluster, all the nodes in the cluster work on a portion of the job.2. in statistics: A cluster is a group of observations from a data set that have been identified as similar according to a particular clustering algorithm.

**continuous**- A variable that can take on all or nearly all values along an interval on the real number line.
**CSV file**- CSV is an acronym for comma separated value. A CSV file stores data in a plain text format.
**destination key**- Automatically generated key for a model; it allows recall of a
specific model later in analysis. Users can specify a
different destination key than the key generated by H
_{2}O. **deviance**- Deviance is the difference between an expected value and an observed value.
It plays a critical role in defining GLM models. For a more detailed
discussion of deviance please see the H
_{2}O Data Science documentation on GLM. **DKV**- Distributed key/value store; see key/value store.
**family**- In GLM family describes the options available for predictive modeling in GLM. Also see gaussian, poisson, gamma, binomial
**feature**- Synonym for attribute, predictor, or independent variable. Usually the data observed on features are given in the columns of a data set.

**feed forward**

**gz (gzipped) file**- Gzip is a type of file compression, a gzipped file is a file compressed by gzip.
**HEX format**- Records made up of hexadecimal numbers that represent machine
language code or constant data. In H
_{2}O, data must be parsed into .hex format before operations can be preformed on it. **instance**- An instance of H
_{2}O occurs each time the user opens and runs H_{2}O, and in the process builds a cluster of nodes (even a one node cluster on their local machine). The instance begins when the cluster is formed, and terminates when the program is closed and the cloud is terminated. **job**- A piece of work that needs to be done. For example, reading
in a data file, parsing a data file, or building a model. In the
browser based GUI of H
_{2}O each job is listed in the admin menu under jobs. For more information see the user guide. **key**- The .hex key generated when data are parsed into
H
_{2}O. In the web based GUI “key” is an input on each page where users define models, and also any page where users validate models on a new data set or use a model to generate predictions. **key/value pair**- A type of data that associates a particular key index to a certain datum.
**key/value store**- A tool that allows storage of schema-less data. Data usually consists of a string which represents the key, and the data itself which is the value.
**L1 regularization**- A regularization method that constrains the absolute value of the weights and has the net effect of dropping some weights (setting them to zero) from a model to reduce complexity and avoid overfitting.
**L2 Regularization**- A regularization method that constrains the sum of the squared weights. This method introduces bias into parameter estimates, but frequently produces substantial gains in modeling as estimate variance is reduced.
**link function**- A user defined option in GLM. See the GLM user guide in Data Science for further detail.
**loss function**- Synonymous to objective function and criterion function; the loss function is the function minimized in order to achieve a desired estimator. For example, linear regression defines the set of best parameter estimates to be the set of estimates that produce the minimum of the sum of the squared errors. Errors are the difference between the predicted value and the observed value for an observation.
**n-folds**- User defined number of cross validation models generated by H
_{2}O **node**In distributed computing systems nodes include clients, servers or peers.

In statistics a node is a decision or terminal point in a classification tree.

**parse**- Analysis of a string of symbols or datum resulting in the conversion of a set of information from a person-readable format to a machine-readable format.
**seed**- A starting point for randomization. Seed specification is used when machine learning models have a random component; it allows users to recreate the exact “random” conditions used in a model at a later time.
**standard deviation**- Abbreviated
**sd**The standard deviation of the data in the column, defined as the square root of the sum of the deviance of observed values from the mean divided by the number of elements in the column less one. **standardization**- Transformation of a variable such that it is mean centered at 0 and scaled by the standard deviation.
**XLS file**- A Microsoft Excel 2003 - 2007 spreadsheet file format.
**Y**- Dependent variable used in GLM; a user defined input selected from the set of variables present in the user’s data.