.. -*- mode: rst -*- .. _glossary: Glossary ======== .. glossary:: **0xdata** Maker of H\ :sub:`2`\ O. Visit our website: http://0xdata.com **binomial** a variable that takes on only the value 0 or 1. Binomial variables are often interpreted as 0 indicates that an event hasn't occurred or that the observation lacks a feature, where 1 indicates occurrence or display of an attribute. **categorical data or categorical variable** A qualitative variable (for example: blood type); (synonym for enumerator, factor). **cloud** (Synonym for cluster.) See the definition of cluster. **cluster** 1. (Synonym for cloud.) A group of H\ :sub:`2`\ O nodes that work together; when a job is submitted to a cluster, all the nodes in the cluster work on a portion of the job. 2. in statistics: A cluster is a group of observations from a data set that have been identified as similar according to a particular clustering algorithm. **continuous** A variable that can take on all or nearly all values along an interval on the real number line. **CSV file** CSV is an acronym for comma separated value. A CSV file stores data in a plain text format. **destination key** Automatically generated key for a model; it allows recall of a specific model later in analysis. Users can specify a different destination key than the key generated by H\ :sub:`2`\ O. **deviance** Deviance is the difference between an expected value and an observed value. It plays a critical role in defining GLM models. For a more detailed discussion of deviance please see the H\ :sub:`2`\ O Data Science documentation on GLM. **DKV** Distributed key/value store; see key/value store. **family** In GLM family describes the options available for predictive modeling in GLM. Also see gaussian, poisson, gamma, binomial **feature** Synonym for attribute, predictor, or independent variable. Usually the data observed on features are given in the columns of a data set. **gz (gzipped) file** Gzip is a type of file compression, a gzipped file is a file compressed by gzip. **HEX format** Records made up of hexadecimal numbers that represent machine language code or constant data. In H\ :sub:`2`\ O, data must be parsed into .hex format before operations can be preformed on it. **instance** An instance of H\ :sub:`2`\ O occurs each time the user opens and runs H\ :sub:`2`\ O, and in the process builds a cluster of nodes (even a one node cluster on their local machine). The instance begins when the cluster is formed, and terminates when the program is closed and the cloud is terminated. **job** A piece of work that needs to be done. For example, reading in a data file, parsing a data file, or building a model. In the browser based GUI of H\ :sub:`2`\ O each job is listed in the admin menu under jobs. For more information see the user guide. **key** The .hex key generated when data are parsed into H\ :sub:`2`\ O. In the web based GUI "key" is an input on each page where users define models, and also any page where users validate models on a new data set or use a model to generate predictions. **key/value pair** A type of data that associates a particular key index to a certain datum. **key/value store** A tool that allows storage of schema-less data. Data usually consists of a string which represents the key, and the data itself which is the value. **link function** A user defined option in GLM. See the GLM user guide in Data Science for further detail. **n-folds** User defined number of cross validation models generated by H\ :sub:`2`\ O **node** In distributed computing systems nodes include clients, servers or peers. In statistics a node is a decision or terminal point in a classification tree. **parse** Analysis of a string of symbols or datum resulting in the conversion of a set of information from a person-readable format to a machine-readable format. **seed** A starting point for randomization. Seed specification is used when machine learning models have a random component; it allows users to recreate the exact "random" conditions used in a model at a later time. **standardization** Transformation of a variable such that it is mean centered at 0 and scaled by the standard deviation. **XLS file** A Microsoft Excel 2003 - 2007 spreadsheet file format. **Y** Dependent variable used in GLM; a user defined input selected from the set of variables present in the user's data.