Data In H2O¶
A H2OFrame represents a 2D array of data where each column is uniformly typed.
The data may be local or it may be in an H2O cluster. The data are loaded from a CSV file or from a native python data structure, and is either a python-process-local file, a cluster-local file, or a list of H2OVec objects.
Loading Data From A CSV File¶
H2O’s parser supports data of various formats from multiple sources. The following formats are supported:
- SVMLight
- CSV (data may delimited by any of the 128 ASCII characters)
- XLS
- The following data sources are supported:
- NFS / Local File / List of Files
- HDFS
- URL
- A Directory (with many data files inside at the same level – no support for recursive import of data)
- S3/S3N
- Native Language Data Structure (c.f. the subsequent section)
>>> trainFrame = h2o.import_file(path="hdfs://192.168.1.10/user/data/data_test.csv") #or >>> trainFrame = h2o.import_file(path="~/data/data_test.csv")
Loading Data From A Python Object¶
To transfer the data that are stored in python data structures to H2O, use the H2OFrame constructor and the python_obj argument. If the python_obj argument is not None, then additional arguments are ignored.
The following types are permissible for python_obj:
- tuple ()
- list []
- dict {}
- collections.OrderedDict
The type of python_obj is inspected by performing an isinstance call. A ValueError will be raised if the type of python_obj is not one of the above types. For example, sets, byte arrays, and un-contained types are not permissible.
The subsequent sections discuss each data type in detail in terms of the “source” representation (the python object) and the “target” representation (the H2O object). Concretely, the topics of discussion will be on the following: Headers, Data Types, Number of Rows, Number of Columns, and Missing Values.
In the following documentation, H2OFrame and Frame will be used synonymously. Technically, an H2OFrame is the object-pointer that resides in the python VM and points to a Frame object inside of the H2O JVM. Similarly, H2OFrame, Frame, and H2O Frame all refer to the same kind of object. In general, though, the context is from the python VM, unless otherwise specified.
Loading A Python Tuple¶
Essentially, the tuple is an immutable list. This immutability does not map to the H2OFrame. So pythonistas beware!
The restrictions on what goes inside the tuple are fairly relaxed, but if they are not recognized, a ValueError displays.
A tuple is formatted as follows:
(i1, i2, i3, ..., iN)
Restrictions are mainly on the types of the individual iJ (1 <= J <= N).
If iJ is {} for some J, then a ValueError displays.
If iJ is a () (tuple) or [] (list), then iJ must be a () or [] for all J; otherwise a ValueError displays.
If iJ is a () or [], and if it is a nested () or nested [], then a ValueError displays. In other words, only a single level of nesting is valid and all internal arrays must be flat – H2O does not flatten them for you.
If iJ is not a () or [], then it must be of type string or a non-complex numeric type (float or int). In other words, if iJ is not a tuple, list, string, float, or int, for some J, then a ValueError displays.
- Some examples of acceptable inputs are:
- Example A: (1,2,3)
- Example B: ((1,2,3), (4,5,6), (“cat”, “dog”))
- Example C: ((1,2,3), [4,5,6], [“blue”, “yellow”], (321.239, “green”,”hi”))
- Example D: (3284.123891, “dog”, 89)
Note that it is perfectly fine to mix () and [] within a tuple.
Headers, Columns, Rows, Data Types, and Missing Values:
The format of the H2OFrame is as follows:
column1 column2 column3 ... columnN a11, a12, a13, ..., a1N ., ., ., ..., . ., ., ., ..., . ., ., ., ..., . aM1, aM2, aM3, ..., aMN
It looks exactly like an MxN matrix with an additional header “row”. This header cannot be specified when loading data from a () (or from a [] but it is possible to specify a header with a python dictionary (see below for details).
Headers:
Since no header row can be specified for this case, H2O automatically generates a column header in the following format:
C1, C2, C3, ..., CN
Notably, these columns have a 1-based indexing (i.e. the 0th column is “C1”).
Rows, Columns, and Missing Data:
The shape of the H2OFrame is determined by two factors:
- the number of arrays nested in the ()
- the number of items in each array
If there are no nested arrays (as in Example A and Example D above), the resulting H2OFrame will have the following shape (rows x cols):
1 x len(tuple)
(i.e. a Frame with a single row).
If there are nested arrays (as in Example B and Example C above), then (given the rules stated above) the resulting H2OFrame will have ROWS equal to the number of arrays nested within and COLUMNS equal to the maximum sub-array:
max( [len(l) for l in tuple] ) x len(tuple)
Note that this addresses the issue with ragged sub-arrays by assuming that shorter sub-arrays will pad themselves with NA (missing values) at the end so that they become the correct length.
Because the Frame is uniformly typed, combining data types within a column may produce unexpected results. Please read up on the H2O parser for details on how a column type is determined for mixed-type columns.
Loading A Python List¶
The same principles that apply to tuples also apply to lists. Lists are mutable objects, so there is no semantic difference regarding mutability between an H2OFrame and a list (as there is for a tuple).
Additionally, a list [] is ordered the same way as a tuple (), with the data appearing within the brackets.
Loading A Python Dictionary Or collections.OrderedDict¶
Each entry in the {} is expected to represent a single column. Keys in the {} must be character strings following the pattern: ^[a-zA-Z_][a-zA-Z0-9_.]*$ without restriction on length. A valid column name may begin with any letter (capital or not) or an “_”, followed by any number of letters, digits, “_”s, or ”.”s.
Values in the {} may be a flat [], a flat (), or a single int, float, or string value. Nested [] and () will raise a ValueError. This is the only additional restriction on [] and () that applies in this context.
Note that the built-in dict does not provide any guarantees on ordering. This has implications on the order of columns in the eventual H2OFrame, since they may be written out of order from which they were initially put into the dict.
collections.OrderedDict preserves the order of the key-value pairs in which they were entered.
H2OFrame¶
- class h2o.frame.H2OFrame(python_obj=None, file_path=None, raw_id=None, expr=None, destination_frame='', header=(-1, 0, 1), separator='', column_names=None, column_types=None, na_strings=None)[source]¶
Bases: h2o.frame.H2OFrameWeakRefMixin
- COUNTING = True¶
- MAGIC_REF_COUNT = 5¶
- any(na_rm=False)[source]¶
Parameters: na_rm – True or False to remove NAs from computation. Returns: True if any element is True in the column.
- as_data_frame(use_pandas=True)[source]¶
Obtain the dataset as a python-local object (pandas frame if possible, list otherwise)
Parameters: use_pandas – A flag specifying whether or not to attempt to coerce to Pandas. Returns: A local python object containing this H2OFrame instance’s data.s
- as_date(format)[source]¶
Return the column with all elements converted to millis since the epoch.
Parameters: format – The date time format string Returns: H2OFrame
- asnumeric()[source]¶
Returns: A frame with factor columns converted to numbers (numeric columns untouched).
- col_names None[source]¶
Retrieve the column names (one name per H2OVec) for this H2OFrame.
Returns: A character list[] of column names.
- countmatches(pattern)[source]¶
Split the strings in the target column on the given pattern
- pattern : str
- The pattern to count matches on in each string.
Returns: H2OFrame
- cut(breaks, labels=None, include_lowest=False, right=True, dig_lab=3)[source]¶
Cut a numeric vector into factor “buckets”. Similar to R’s cut method.
Parameters: - breaks – The cut points in the numeric vector (must span the range of the col.)
- labels – Factor labels, defaults to set notation of intervals defined by breaks.s
- include_lowest – By default, cuts are defined as (lo,hi]. If True, get [lo,hi].
- right – Include the high value: (lo,hi]. If False, get (lo,hi).
- dig_lab – Number of digits following the decimal point to consider.
Returns: A factor column.
- ddply(cols, fun)[source]¶
Parameters: - cols – Column names used to control grouping
- fun – Function to execute on each group. Right now limited to textual Rapids expression
Returns: New frame with 1 row per-group, of results from ‘fun’
- describe()[source]¶
Generate an in-depth description of this H2OFrame.
The description is a tabular print of the type, min, max, sigma, number of zeros, and number of missing elements for each H2OVec in this H2OFrame.
Returns: None (print to stdout)
- dim None[source]¶
Get the number of rows and columns in the H2OFrame.
Returns: The number of rows and columns in the H2OFrame as a list [rows, cols].
- drop(i)[source]¶
Returns a Frame with the column at index i dropped.
Parameters: i – Column to drop Returns: Returns an H2OFrame
- dropped_instances = []¶
- filterNACols(frac=0.2)[source]¶
Filter columns with prportion of NAs >= frac. :param frac: Fraction of NAs in the column. :return: A list of column indices.
- group_by(by, order_by=None)[source]¶
Returns a new GroupBy object using this frame and the desired grouping columns.
Parameters: - by – The columns to group on.
- order_by – A list of column names or indices on which to order the results.
Returns: A new GroupBy object.
- gsub(pattern, replacement, ignore_case=False)[source]¶
sub and gsub perform replacement of the first and all matches respectively. Of note, mutates the frame.
Parameters: - pattern –
- replacement –
- ignore_case –
Returns: H2OFrame
- head(rows=10, cols=200, show=False, as_pandas=False)[source]¶
Analgous to R’s head call on a data.frame. Display a digestible chunk of the H2OFrame starting from the beginning.
Parameters: - rows – Number of rows to display.
- cols – Number of columns to display.
- show – Display the output.
- as_pandas – Display with pandas frame.
Returns: None
- hist(breaks='Sturges', plot=True, **kwargs)[source]¶
Compute a histogram over a numeric column. If breaks==”FD”, the MAD is used over the IQR in computing bin width.
Parameters: - breaks – breaks Can be one of the following: A string: “Sturges”, “Rice”, “sqrt”, “Doane”, “FD”, “Scott.” A single number for the number of breaks splitting the range of the vec into number of breaks bins of equal width. Or, A vector of numbers giving the split points, e.g., c(-50,213.2123,9324834)
- plot – A logical value indicating whether or not a plot should be generated (default is TRUE).
Returns: if plot is True, then return None, else, an H2OFrame with these columns: breaks, counts, mids_true, mids, and density
- impute(column, method='mean', combine_method='interpolate', by=None, inplace=True)[source]¶
Impute a column in this H2OFrame.
Parameters: - column – The column to impute
- method – How to compute the imputation value.
- combine_method – For even samples and method=”median”, how to combine quantiles.
- by – Columns to group-by for computing imputation value per groups of columns.
- inplace – Impute inplace?
Returns: the imputed frame.
- insert_missing_values(fraction=0.1, seed=None)[source]¶
Inserting Missing Values to an H2OFrame This is primarily used for testing. Randomly replaces a user-specified fraction of entries in a H2O dataset with missing values. WARNING: This will modify the original dataset. Unless this is intended, this function should only be called on a subset of the original.
Parameters: - fraction – A number between 0 and 1 indicating the fraction of entries to replace with missing.
- seed – A random number used to select which entries to replace with missing values. Default of seed = -1 will automatically generate a seed in H2O.
Returns: H2OFrame with missing values inserted
- interaction(factors, pairwise, max_factors, min_occurrence, destination_frame=None)[source]¶
Categorical Interaction Feature Creation in H2O. Creates a frame in H2O with n-th order interaction features between categorical columns, as specified by the user.
Parameters: - factors – factors Factor columns (either indices or column names).
- pairwise – Whether to create pairwise interactions between factors (otherwise create one higher-order interaction). Only applicable if there are 3 or more factors.
- max_factors – Max. number of factor levels in pair-wise interaction terms (if enforced, one extra catch-all factor will be made)
- min_occurrence – Min. occurrence threshold for factor levels in pair-wise interaction terms
- destination_frame – A string indicating the destination key. If empty, this will be auto-generated by H2O.
Returns: H2OFrame
- ischaracter()[source]¶
Returns: True if the column is a character column, otherwise False (same as isstring)
- isfactor()[source]¶
Returns: A lazy Expr representing the truth of whether or not this vec is a factor.
- isstring()[source]¶
Returns: True if the column is a string column, otherwise False (same as ischaracter)
- kfold_column(n_folds=3, seed=-1)[source]¶
Build a fold assignments column for cross-validation. This call will produce a column having the same data layout as the calling object.
Parameters: n_folds – Number of folds. :param seed:Seed for random numbers (affects sampling when balance_classes=T) :return: A column of fold IDs.
- levels(col=None)[source]¶
Get the factor levels for this frame and the specified column index.
Parameters: col – A column index in this H2OFrame. Returns: a list of strings that are the factor levels for the column.
- match(table, nomatch=0)[source]¶
Makes a vector of the positions of (first) matches of its first argument in its second.
Parameters: - table –
- nomatch –
Returns: bit H2OVec
- mean(na_rm=False)[source]¶
Parameters: na_rm – True or False to remove NAs from computation. Returns: The mean of the column.
- merge(other, allLeft=False, allRite=False)[source]¶
Merge two datasets based on common column names
Parameters: - other – Other dataset to merge. Must have at least one column in common with self, and all columns in common are used as the merge key. If you want to use only a subset of the columns in common, rename the other columns so the columns are unique in the merged result.
- allLeft – If true, include all rows from the left/self frame
- allRite – If true, include all rows from the right/other frame
Returns: Original self frame enhanced with merged columns and rows
- static mktime(year=1970, month=0, day=0, hour=0, minute=0, second=0, msec=0)[source]¶
All units are zero-based (including months and days). Missing year is 1970.
Returns: Returns msec since the Epoch.
- modulo_kfold_column(n_folds=3)[source]¶
Build a fold assignments column for cross-validation. Rows are assigned a fold according to the current row number modulo n_folds.
- n_folds : int
- The number of folds to build.
Returns: An H2OFrame holding a single column of the fold assignments.
- mult(matrix)[source]¶
Perform matrix multiplication.
Parameters: matrix – The matrix to multiply to the left of self. Returns: The multiplied matrices.
- names None[source]¶
Retrieve the column names (one name per H2OVec) for this H2OFrame.
Returns: A character list[] of column names.
- ncol None[source]¶
Get the number of columns in this H2OFrame.
Returns: The number of columns in this H2OFrame.
- nlevels(col=None)[source]¶
Get the number of factor levels for this frame and the specified column index.
Parameters: col – A column index in this H2OFrame. Returns: an integer.
- nrow None[source]¶
Get the number of rows in this H2OFrame.
Returns: The number of rows in this dataset.
- pop(i)[source]¶
Pop a colunn out of an H2OFrame.
Parameters: i – The index or name of the column to pop. Returns: The column dropped from the frame.
- prod(na_rm=False)[source]¶
Parameters: na_rm – True or False to remove NAs from computation. Returns: The product of the column.
- quantile(prob=None, combine_method='interpolate')[source]¶
Compute quantiles over a given H2OFrame.
Parameters: - prob – A list of probabilties, default is [0.01,0.1,0.25,0.333,0.5,0.667,0.75,0.9,0.99]. You may provide any sequence of any length.
- combine_method – For even samples, how to combine quantiles. Should be one of [“interpolate”, “average”, “low”, “hi”]
Returns: an H2OFrame containing the quantiles and probabilities.
- rbind(data)[source]¶
Combine H2O Datasets by Rows. Takes a sequence of H2O data sets and combines them by rows. :param data: an H2OFrame :return: self, with data appended (row-wise)
- remove_vecs(cols)[source]¶
Parameters: cols – Drop these columns. Returns: A frame with the columns dropped.
- rep_len(length_out)[source]¶
Replicates the values in data in the H2O backend
Parameters: length_out – the number of columns of the resulting H2OFrame Returns: an H2OFrame
- round(digits=0)[source]¶
Parameters: digits – Returns: The rounded values in the H2OFrame to the specified number of decimal digits.
- runif(seed=None)[source]¶
Parameters: seed – A random seed. If None, then one will be generated. Returns: A new H2OVec filled with doubles sampled uniformly from [0,1).
- scale(center=True, scale=True)[source]¶
Centers and/or scales the columns of the H2OFrame
Returns: H2OFrame
Parameters: - center – either a ‘logical’ value or numeric list of length equal to the number of columns of the H2OFrame
- scale – either a ‘logical’ value or numeric list of length equal to the number of columns of H2OFrame.
- sd(na_rm=False)[source]¶
Parameters: na_rm – True or False to remove NAs from computation. Returns: Standard deviation of the H2OVec elements.
- setLevel(level)[source]¶
A method to set all column values to one of the levels.
Parameters: level – The level at which the column will be set (a string) Returns: An H2OFrame with all entries set to the desired level
- setLevels(levels)[source]¶
Works on a single categorical vector. New domains must be aligned with the old domains. This call has SIDE EFFECTS and mutates the column in place (does not make a copy).
Parameters: levels – A list of strings specifying the new levels. The number of new levels must match the number of old levels. :return: None
- setName(col=None, name=None)[source]¶
Set the name of the column at the specified index.
Parameters: - col – Index of the column whose name is to be set.
- name – The new name of the column to set
Returns: the input frame
- setNames(names)[source]¶
Change the column names to names.
Parameters: names – A list of strings equal to the number of columns in the H2OFrame. Returns: None. Rename the column names in this H2OFrame.
- show(as_pandas=True)[source]¶
Used by the H2OFrame.__repr__ method to display a snippet of the data frame.
Parameters: as_pandas – Display a pandas style data frame (better for pretty printing wide datasets) Returns: None
- signif(digits=6)[source]¶
Parameters: digits – Returns: The rounded values in the H2OFrame to the specified number of significant digits.
- split_frame(ratios=[0.75], destination_frames='')[source]¶
Split a frame into distinct subsets of size determined by the given ratios. The number of subsets is always 1 more than the number of ratios given.
Parameters: - ratios – The fraction of rows for each split.
- destination_frames – names of the split frames
Returns: a list of frames
- stratified_kfold_column(n_folds=3, seed=-1)[source]¶
Build a fold assignment column with the constraint that each fold has the same class distribution as the fold column.
- n_folds: int
- The number of folds to build.
- seed: int
- A random seed.
Returns: An H2OFrame holding a single column of the fold assignments.
- strsplit(pattern)[source]¶
Split the strings in the target column on the given pattern
- pattern : str
- The split pattern.
Returns: H2OFrame
- structure()[source]¶
Similar to R’s str method: Compactly Display the Structure of this H2OFrame instance.
Returns: None
- sub(pattern, replacement, ignore_case=False)[source]¶
sub and gsub perform replacement of the first and all matches respectively. Of note, mutates the frame.
Parameters: - pattern –
- replacement –
- ignore_case –
Returns: H2OFrame
- table(data2=None)[source]¶
- data2 : H2OFrame
- Default is None, can be an optional single column to aggregate counts by.
Returns: An H2OFrame of the counts at each combination of factor levels
- tail(rows=10, cols=200, show=False, as_pandas=False)[source]¶
Analgous to R’s tail call on a data.frame. Display a digestible chunk of the H2OFrame starting from the end.
Parameters: - rows – Number of rows to display.
- cols – Number of columns to display.
- show –
- as_pandas –
Returns: None
- tolower()[source]¶
Translate characters from upper to lower case for a particular column Of note, mutates the frame. :return: H2OFrame
- toupper()[source]¶
Translate characters from lower to upper case for a particular column Of note, mutates the frame. :return: H2OFrame
- trim()[source]¶
Trim the edge-spaces in a column of strings (only operates on frame with one column)
Returns: H2OFrame
- types None[source]¶
Get the column types of H2OFrame.
Returns: A dictionary of column_name-type pairs
- unique()[source]¶
Extract the unique values in the column.
Returns: A new H2OFrame of just the unique values in the column.
GroupBy¶
- class h2o.group_by.GroupBy(fr, by, order_by=None)[source]¶
A class that represents the group by operation on an H2OFrame.
Sample usage:
>>> my_frame = ... # some existing H2OFrame >>> grouped = my_frame.group_by(by=["C1","C2"],order_by="C1") >>> grouped.sum(col="X1",na="all").mean(col="X5",na="all").max()
Any number of aggregations may be chained together in this manner.
If no arguments are given to the aggregation (e.g. “max” in the above example), then it is assumed that the aggregation should apply to all columns but the group by columns.
- The na parameter is one of [“all”,”ignore”,”rm”].
- “all” - include NAs “rm” - exclude NAs “ignore” - ignore NAs in aggregates, but count them (e.g. in denominators for mean, var, sd, etc.)