.. _Hadoop_Glossary: Hadoop Glossary =============== **Driver Jar File** Jar file that will allow Hadoop to drive an H\ :sub:`2`\ O launch as well as create a connection between HDFS and H\ :sub:`2`\ O for importing data from HDFS. **H2O** H\ :sub:`2`\ O makes Hadoop do math. H\ :sub:`2`\ O is an Apache v2 licensed open source math and prediction engine. **H2O Cluster** A group of H\ :sub:`2`\ O nodes that operate together to work on jobs. H\ :sub:`2`\ O scales by distributing work over many H\ :sub:`2`\ O nodes. (Note multiple H\ :sub:`2`\ O nodes can run on a single Hadoop node if sufficient resources are available.) All H\ :sub:`2`\ O nodes in an H\ :sub:`2`\ O cluster are peers. There is no "master" node. **H2O Key Value** H\ :sub:`2`\ O implements a distributed in-memory Key/Value store within the H\ :sub:`2`\ O cluster. H\ :sub:`2`\ O uses Keys to uniquely identify data sets that have been read in (pre-parse), data sets that have been parsed (into HEX format), and models (e.g. GLM) that have been created. For example, when you ingest your data from HDFS into H\ :sub:`2`\ O, that entire data set is referred to by a single Key. **H2O Node** H\ :sub:`2`\ O nodes are launched via Hadoop MapReduce and run on Hadoop DataNodes. (At a system level, an H\ :sub:`2`\ O node is a Java invocation of h2o.jar.) Note that while Hadoop operations are centralized around HDFS file accesses, H\ :sub:`2`\ O operations are memory-based when possible for best performance. (H\ :sub:`2`\ O reads the dataset from HDFS into memory and then attempts to perform all operations to the data in memory.) **Hadoop** An open source big-data platform. Cloudera, MapR, and Hortonworks are distro providers of Hadoop. Data is stored in HDFS (DataNode, NameNode) and processed through MapReduce and managed via JobTracker. **HDFS** Hadoop Distributed File System is a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. **HEX format** The HEX format is an efficient internal representation for data that can be used by H\ :sub:`2`\ O algorithms. A data set must be parsed into HEX format before you can operate on it. **JobTracker** The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster. **JobTracker port** Port where you can access the JobTracker. The default port might be different for each distribution. **Mapper Size** The memory allocated to each mapper task that will launch on each of the Hadoop Nodes. **MapReduce** MapReduce is Hadoop's programming model for large scale data processing. H\ :sub:`2`\ O nodes are launched via Hadoop MapReduce and run on Hadoop DataNodes. **Parse** The parse operation converts an in-memory raw data set (in CSV format, for example) into a HEX format data set. The parse operation takes a data set named by a Key as input, and produces a HEX format Key,Value output. **Spilling** An H\ :sub:`2`\ O node may choose to temporarily "spill" data from memory onto disk. (Think of this like swapping.) In Hadoop environments, H\ :sub:`2`\ O spills to HDFS. Usage is intended to function like a temporary cache, and the spilled data is discarded when the job is done. **YARN** A resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.