Data In H2O¶
A H2OFrame represents a 2D array of data where each column is uniformly typed.
The data may be local or it may be in an H2O cluster. The data are loaded from a CSV file or from a native python data structure, and is either a python-process-local file, a cluster-local file, or a list of H2OVec objects.
Loading Data From A CSV File¶
H2O’s parser supports data of various formats from multiple sources. The following formats are supported:
- SVMLight
- CSV (data may delimited by any of the 128 ASCII characters)
- XLS
- The following data sources are supported:
- NFS / Local File / List of Files
- HDFS
- URL
- A Directory (with many data files inside at the same level – no support for recursive import of data)
- S3/S3N
- Native Language Data Structure (c.f. the subsequent section)
>>> trainFrame = h2o.import_file(path="hdfs://192.168.1.10/user/data/data_test.csv") #or >>> trainFrame = h2o.import_file(path="~/data/data_test.csv")
Loading Data From A Python Object¶
To transfer the data that are stored in python data structures to H2O, use the H2OFrame constructor and the python_obj argument. If the python_obj argument is not None, then additional arguments are ignored.
The following types are permissible for python_obj:
- tuple ()
- list []
- dict {}
- collections.OrderedDict
The type of python_obj is inspected by performing an isinstance call. A ValueError will be raised if the type of python_obj is not one of the above types. For example, sets, byte arrays, and un-contained types are not permissible.
The subsequent sections discuss each data type in detail in terms of the “source” representation (the python object) and the “target” representation (the H2O object). Concretely, the topics of discussion will be on the following: Headers, Data Types, Number of Rows, Number of Columns, and Missing Values.
In the following documentation, H2OFrame and Frame will be used synonymously. Technically, an H2OFrame is the object-pointer that resides in the python VM and points to a Frame object inside of the H2O JVM. Similarly, H2OFrame, Frame, and H2O Frame all refer to the same kind of object. In general, though, the context is from the python VM, unless otherwise specified.
Loading A Python Tuple¶
Essentially, the tuple is an immutable list. This immutability does not map to the H2OFrame. So pythonistas beware!
The restrictions on what goes inside the tuple are fairly relaxed, but if they are not recognized, a ValueError displays.
A tuple is formatted as follows:
(i1, i2, i3, ..., iN)
Restrictions are mainly on the types of the individual iJ (1 <= J <= N).
If iJ is {} for some J, then a ValueError displays.
If iJ is a () (tuple) or [] (list), then iJ must be a () or [] for all J; otherwise a ValueError displays.
If iJ is a () or [], and if it is a nested () or nested [], then a ValueError displays. In other words, only a single level of nesting is valid and all internal arrays must be flat – H2O does not flatten them for you.
If iJ is not a () or [], then it must be of type string or a non-complex numeric type (float or int). In other words, if iJ is not a tuple, list, string, float, or int, for some J, then a ValueError displays.
- Some examples of acceptable inputs are:
- Example A: (1,2,3)
- Example B: ((1,2,3), (4,5,6), (“cat”, “dog”))
- Example C: ((1,2,3), [4,5,6], [“blue”, “yellow”], (321.239, “green”,”hi”))
- Example D: (3284.123891, “dog”, 89)
Note that it is perfectly fine to mix () and [] within a tuple.
Headers, Columns, Rows, Data Types, and Missing Values:
The format of the H2OFrame is as follows:
column1 column2 column3 ... columnN a11, a12, a13, ..., a1N ., ., ., ..., . ., ., ., ..., . ., ., ., ..., . aM1, aM2, aM3, ..., aMN
It looks exactly like an MxN matrix with an additional header “row”. This header cannot be specified when loading data from a () (or from a [] but it is possible to specify a header with a python dictionary (see below for details).
Headers:
Since no header row can be specified for this case, H2O automatically generates a column header in the following format:
C1, C2, C3, ..., CN
Notably, these columns have a 1-based indexing (i.e. the 0th column is “C1”).
Rows, Columns, and Missing Data:
The shape of the H2OFrame is determined by two factors:
- the number of arrays nested in the ()
- the number of items in each array
If there are no nested arrays (as in Example A and Example D above), the resulting H2OFrame will have the following shape (rows x cols):
1 x len(tuple)
(i.e. a Frame with a single row).
If there are nested arrays (as in Example B and Example C above), then (given the rules stated above) the resulting H2OFrame will have ROWS equal to the number of arrays nested within and COLUMNS equal to the maximum sub-array:
max( [len(l) for l in tuple] ) x len(tuple)
Note that this addresses the issue with ragged sub-arrays by assuming that shorter sub-arrays will pad themselves with NA (missing values) at the end so that they become the correct length.
Because the Frame is uniformly typed, combining data types within a column may produce unexpected results. Please read up on the H2O parser for details on how a column type is determined for mixed-type columns.
Loading A Python List¶
The same principles that apply to tuples also apply to lists. Lists are mutable objects, so there is no semantic difference regarding mutability between an H2OFrame and a list (as there is for a tuple).
Additionally, a list [] is ordered the same way as a tuple (), with the data appearing within the brackets.
Loading A Python Dictionary Or collections.OrderedDict¶
Each entry in the {} is expected to represent a single column. Keys in the {} must be character strings following the pattern: ^[a-zA-Z_][a-zA-Z0-9_.]*$ without restriction on length. A valid column name may begin with any letter (capital or not) or an “_”, followed by any number of letters, digits, “_”s, or ”.”s.
Values in the {} may be a flat [], a flat (), or a single int, float, or string value. Nested [] and () will raise a ValueError. This is the only additional restriction on [] and () that applies in this context.
Note that the built-in dict does not provide any guarantees on ordering. This has implications on the order of columns in the eventual H2OFrame, since they may be written out of order from which they were initially put into the dict.
collections.OrderedDict preserves the order of the key-value pairs in which they were entered.
H2OFrame¶
- class h2o.frame.H2OFrame(python_object=None)[source]¶
Bases: object
- anyfactor()[source]¶
Test if H2OFrame has any factor columns.
True if there are any categorical columns; False otherwise.
- apply(fun=None, axis=0)[source]¶
Apply a lambda expression to an H2OFrame.
- fun: lambda
- A lambda expression to be applied per row or per column
- axis: int
- 0: apply to each column; 1: apply to each row
H2OFrame
- as_data_frame(use_pandas=False)[source]¶
Obtain the dataset as a python-local object.
- use_pandas : bool, default=False
- A flag specifying whether or not to return a pandas DataFrame.
A local python object (a list of lists of strings, each list is a row, if use_pandas=False, otherwise a pandas DataFrame) containing this H2OFrame instance’s data.
- as_date(format)[source]¶
Return the column with all elements converted to millis since the epoch.
- format : str
- A datetime format string (e.g. “YYYY-mm-dd”)
An H2OFrame instance.
- cbind(data)[source]¶
Append data to this H2OFrame column-wise.
- data : H2OFrame
- H2OFrame to be column bound to the right of this H2OFrame.
H2OFrame of the combined datasets.
- countmatches(pattern)[source]¶
For each string in the column, count the occurrences of pattern.
- pattern : str
- The pattern to count matches on in each string.
A single-column H2OFrame containing the counts for the per-row occurrences of pattern in the input column.
- cut(breaks, labels=None, include_lowest=False, right=True, dig_lab=3)[source]¶
Cut a numeric vector into factor “buckets”. Similar to R’s cut method.
- breaks : list
- The cut points in the numeric vector (must span the range of the col.)
- labels: list
- Factor labels, defaults to set notation of intervals defined by breaks.
- include_lowest : bool
- By default, cuts are defined as (lo,hi]. If True, get [lo,hi].
- right : bool
- Include the high value: (lo,hi]. If False, get (lo,hi).
- dig_lab: int
- Number of digits following the decimal point to consider.
Single-column H2OFrame of categorical data.
- describe()[source]¶
Generate an in-depth description of this H2OFrame. Everything in summary(), plus the data layout.
- drop(i)[source]¶
Drop a column from the current H2OFrame.
- i : str, int
- The column to be dropped
H2OFrame with the column at index i dropped. Returns a new H2OFrame.
- filter_na_cols(frac=0.2)[source]¶
Filter columns with proportion of NAs >= frac.
- frac : float
- Fraction of NAs in the column.
A list of column indices
- static from_python(python_obj, destination_frame='', header=(-1, 0, 1), separator='', column_names=None, column_types=None, na_strings=None)[source]¶
Properly handle native python data types. For a discussion of the rules and permissible data types please refer to the main documentation for H2OFrame.
- python_obj : tuple, list, dict, collections.OrderedDict
- If a nested list/tuple, then each nested collection is a column.
- destination_frame : str, optional
- The unique hex key assigned to the imported file. If none is given, a key will automatically be generated.
- header : int, optional
- -1 means the first line is data, 0 means guess, 1 means first line is header.
- sep : str, optional
- The field separator character. Values on each line of the file are separated by this character. If sep = “”, the parser will automatically detect the separator.
- col_names : list, optional
- A list of column names for the file.
- col_types : list or dict, optional
A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are None will be guessed. The possible types a column may have are:
“unknown” - this will force the column to be parsed as all NA “uuid” - the values in the column must be true UUID or will be parsed as NA “string” - force the column to be parsed as a string “numeric” - force the column to be parsed as numeric. H2O will handle the
compression of the numeric data in the optimal manner.“enum” - force the column to be parsed as a categorical column. “time” - force the column to be parsed as a time column. H2O will attempt to
- parse the following list of date time formats.
- date:
- “yyyy-MM-dd” “yyyy MM dd” “dd-MMM-yy” “dd MMM yy”
- time:
- “HH:mm:ss” “HH:mm:ss:SSS” “HH:mm:ss:SSSnnnnnn” “HH.mm.ss” “HH.mm.ss.SSS” “HH.mm.ss.SSSnnnnnn”
Times can also contain “AM” or “PM”.
- na_strings : list or dict, optional
- A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.
A new H2OFrame instance.
>>> l = [[1,2,3,4,5], [99,123,51233,321]] >>> l = H2OFrame(l) >>> l
- static get_frame(frame_id)[source]¶
Create an H2OFrame mapped to an existing id in the cluster.
H2OFrame that points to a pre-existing big data H2OFrame in the cluster
- group_by(by)[source]¶
- Returns a new GroupBy object using this frame and the desired grouping columns.
- The returned groups are sorted by the natural group-by column sort.
- by : list
- The columns to group on.
A new GroupBy object.
- gsub(pattern, replacement, ignore_case=False)[source]¶
Globally substitute occurrences of pattern in a string with replacement.
- pattern : str
- A regular expression.
- replacement : str
- A replacement string.
- ignore_case : bool
- If True then pattern will match against upper and lower case.
H2OFrame
- head(rows=10, cols=200)[source]¶
Analogous to Rs head call on a data.frame.
- rows : int, default=10
- Number of rows starting from the topmost
- cols : int, default=200
- Number of columns starting from the leftmost
An H2OFrame.
- hist(breaks='Sturges', plot=True, **kwargs)[source]¶
Compute a histogram over a numeric column.
- breaks: str, int, list
- Can be one of “Sturges”, “Rice”, “sqrt”, “Doane”, “FD”, “Scott.” Can be a single number for the number of breaks. Can be a list containing sthe split points, e.g., [-50,213.2123,9324834] If breaks is “FD”, the MAD is used over the IQR in computing bin width.
- plot : bool, default=True
- If True, then a plot is generated
If plot is False, return H2OFrame with these columns: breaks, counts, mids_true, mids, and density; otherwise produce the plot.
- ifelse(yes, no)[source]¶
Equivalent to [y if t else n for t,y,n in zip(self,yes,no)]
Based on the booleans in the test vector, the output has the values of the yes and no vectors interleaved (or merged together). All Frames must have the same row count. Single column frames are broadened to match wider Frames. Scalars are allowed, and are also broadened to match wider frames.
- test : H2OFrame (self)
- Frame of values treated as booleans; may be a single column
- yes : H2OFrame
- Frame to use if [test] is true ; may be a scalar or single column
- no : H2OFrame
- Frame to use if [test] is false; may be a scalar or single column
H2OFrame of the merged yes/no Frames/scalars according to the test input frame.
- impute(column, method='mean', combine_method='interpolate', by=None)[source]¶
Impute a column in this H2OFrame
- column : int, str
- The column to impute
- method: str, default=”mean”
- How to compute the imputation value.
- combine_method: str, default=”interpolate”
- For even samples and method=”median”, how to combine quantiles.
- by : list
- Columns to group-by for computing imputation value per groups of columns.
An H2OFrame with the desired column’s NAs filled with imputed values. Note that the returned Frame is in conceptually a new Frame, but due to back-end optimizations is frequently not actually a copy.
- insert_missing_values(fraction=0.1, seed=None)[source]¶
Inserting Missing Values into an H2OFrame. This is primarily used for testing.
Randomly replaces a user-specified fraction of entries in a H2O dataset with missing values.
WARNING: This will modify the original dataset. Unless this is intended, this function should only be called on a subset of the original.
- fraction : float
- A number between 0 and 1 indicating the fraction of entries to replace with missing.
- seed : int
- A random number used to select which entries to replace with missing values.
H2OFrame with missing values inserted.
- interaction(factors, pairwise, max_factors, min_occurrence, destination_frame=None)[source]¶
Categorical Interaction Feature Creation in H2O. Creates a frame in H2O with n-th order interaction features between categorical columns, as specified by the user.
- factors : list
- factors Factor columns (either indices or column names).
- pairwise : bool
- Whether to create pairwise interactions between factors (otherwise create one higher-order interaction). Only applicable if there are 3 or more factors.
- max_factors: int
- Max. number of factor levels in pair-wise interaction terms (if enforced, one extra catch-all factor will be made)
- min_occurrence: int
- Min. occurrence threshold for factor levels in pair-wise interaction terms
- destination_frame: str, optional
- A string indicating the destination key.
H2OFrame
- isfactor()[source]¶
Test if the selection is a factor column. Returns ——-
True if the column is categorical; otherwise False. For String columns, the result is False.
- isna()[source]¶
For each row in a column, determine if it is NA or not.
Single-column H2OFrame of 1s and 0s. 1 means the value was NA.
- kfold_column(n_folds=3, seed=-1)[source]¶
Build a fold assignments column for cross-validation. This call will produce a column having the same data layout as the calling object.
- n_folds : int
- An integer specifying the number of validation sets to split the training data into.
- seed : int, optional
- Seed for random numbers as fold IDs are randomly assigned.
A single column H2OFrame with the fold assignments.
- match(table, nomatch=0)[source]¶
Makes a vector of the positions of (first) matches of its first argument in its second.
Parameters: - table –
- nomatch –
Returns: H2OFrame of one boolean column
- mean(na_rm=False)[source]¶
Compute the mean.
- na_rm: bool, default=False
- If True, then remove NAs from the computation.
A list containing the mean for each column (NaN for non-numeric columns).
- median(na_rm=False)[source]¶
Compute the median.
- na_rm: bool, default=False
- If True, then remove NAs from the computation.
A list containing the median for each column (NaN for non-numeric columns).
- merge(other, allLeft=True, allRite=False)[source]¶
Merge two datasets based on common column names
- other: H2OFrame
- Other dataset to merge. Must have at least one column in common with self, and all columns in common are used as the merge key. If you want to use only a subset of the columns in common, rename the other columns so the columns are unique in the merged result.
- allLeft: bool, default=True
- If True, include all rows from the left/self frame
- allRite: bool, default=True
- If True, include all rows from the right/other frame
Original self frame enhanced with merged columns and rows
- static mktime(year=1970, month=0, day=0, hour=0, minute=0, second=0, msec=0)[source]¶
All units are zero-based (including months and days). Missing year is 1970.
year : int, H2OFrame month: int, H2OFrame day : int, H2OFrame hour : int, H2OFrame minute : int, H2OFrame second : int, H2OFrame msec : int, H2OFrameH2OFrame of one column containing the date in millis since the epoch.
- modulo_kfold_column(n_folds=3)[source]¶
Build a fold assignments column for cross-validation. Rows are assigned a fold according to the current row number modulo n_folds.
- n_folds : int
- An integer specifying the number of validation sets to split the training data into.
A single column H2OFrame with the fold assignments.
- mult(matrix)[source]¶
Perform matrix multiplication.
- matrix : H2OFrame
- The right-hand-side matrix
H2OFrame result of the matrix multiplication
- names None[source]¶
Retrieve the column names (one name per H2OVec) for this H2OFrame.
A str list of column names
- nchar()[source]¶
Count the number of characters in each string of single-column H2OFrame.
A single-column H2OFrame containing the per-row character count.
- nlevels()[source]¶
Get the number of factor levels for this frame.
A dictionary of column_name:number_levels pairs.
- pop(i)[source]¶
Pop a column from the H2OFrame at index i
- i : int, str
- The index or name of the column to pop.
The column dropped from the frame; the frame is side-effected to lose the column
- prod(na_rm=False)[source]¶
- na_rm : bool, default=False
- True or False to remove NAs from computation.
The product of the column.
- quantile(prob=None, combine_method='interpolate')[source]¶
Compute quantiles.
- prob : list, default=[0.01,0.1,0.25,0.333,0.5,0.667,0.75,0.9,0.99]
- A list of probabilities of any length.
- combine_method : str, default=”interpolate”
- For even samples, how to combine quantiles. Should be one of [“interpolate”, “average”, “low”, “hi”]
A new H2OFrame containing the quantiles and probabilities.
- rbind(data)[source]¶
Combine H2O Datasets by rows. Takes a sequence of H2O data sets and combines them by rows.
data : H2OFrameReturns this H2OFrame with data appended row-wise.
- rep_len(length_out)[source]¶
Replicates the values in data in the H2O backend
- length_out : int
- Number of columns of the resulting H2OFrame
H2OFrame
- runif(seed=None)[source]¶
Generate a column of random numbers drawn from a uniform distribution [0,1) and having the same data layout as the calling H2OFrame instance.
- seed : int, optional
- A random seed. If None, then one will be generated.
Single-column H2OFrame filled with doubles sampled uniformly from [0,1).
- scale(center=True, scale=True)[source]¶
Centers and/or scales the columns of the self._newExpr
- center : bool, list
- If True, then demean the data by the mean. If False, no shifting is done. If a list, then shift each column by the given amount in the list.
- scale : bool, list
- If True, then scale the data by the column standard deviation. If False, no scaling is done. If a list, then scale each column by the given amount in the list.
H2OFrame
- sd(na_rm=False)[source]¶
Compute the standard deviation.
- na_rm : bool, default=False
- Remove NAs from the computation.
A list containing the standard deviation for each column (NaN for non-numeric columns).
- set_level(level)[source]¶
A method to set all column values to one of the levels.
- level : str
- The level at which the column will be set (a string)
H2OFrame with entries set to the desired level.
- set_levels(levels)[source]¶
Works on a single categorical column. New domains must be aligned with the old domains. This call has copy-on-write semantics.
- levels : list
- A list of strings specifying the new levels. The number of new levels must match the number of old levels.
A single-column H2OFrame with the desired levels.
- set_name(col=None, name=None)[source]¶
Set the name of the column at the specified index.
- col : int, str
- Index of the column whose name is to be set; may be skipped for 1-column frames
- name : str
- The new name of the column to set
Returns self.
- set_names(names)[source]¶
Change all of this H2OFrame instance’s column names.
- names : list
- A list of strings equal to the number of columns in the H2OFrame.
- show(use_pandas=False)[source]¶
Used by the H2OFrame.__repr__ method to print or display a snippet of the data frame. If called from IPython, displays an html’ized result Else prints a tabulate’d result
- signif(digits=6)[source]¶
Round doubles/floats to the given number of significant digits.
- digits : int, default=6
- Number of significant digits to round doubles/floats.
H2OFrame
- split_frame(ratios=None, destination_frames=None, seed=None)[source]¶
Split a frame into distinct subsets of size determined by the given ratios. The number of subsets is always 1 more than the number of ratios given.
- ratios : list
- The fraction of rows for each split.
- destination_frames : list
- The names of the split frames.
- seed : int
- Used for selecting which H2OFrame a row will belong to.
A list of H2OFrame instances
- stratified_kfold_column(n_folds=3, seed=-1)[source]¶
Build a fold assignment column with the constraint that each fold has the same class distribution as the fold column.
- n_folds: int
- The number of folds to build.
- seed: int
- A random seed.
A single column H2OFrame with the fold assignments.
- stratified_split(test_frac=0.2, seed=-1)[source]¶
Construct a column that can be used to perform a random stratified split.
- test_frac : float, default=0.2
- The fraction of rows that will belong to the “test”.
- seed : int
- For seeding the random splitting.
A categorical column of two levels “train” and “test”.>>> my_stratified_split = my_frame["response"].stratified_split(test_frac=0.3,seed=12349453) >>> train = my_frame[my_stratified_split=="train"] >>> test = my_frame[my_stratified_split=="test"]
# check the distributions among the initial frame, and the train/test frames match >>> my_frame[“response”].table()[“Count”] / my_frame[“response”].table()[“Count”].sum() >>> train[“response”].table()[“Count”] / train[“response”].table()[“Count”].sum() >>> test[“response”].table()[“Count”] / test[“response”].table()[“Count”].sum()
- strsplit(pattern)[source]¶
Split the strings in the target column on the given pattern
- pattern : str
- The split pattern.
H2OFrame containing columns of the split strings.
- sub(pattern, replacement, ignore_case=False)[source]¶
Substitute the first occurrence of pattern in a string with replacement.
- pattern : str
- A regular expression.
- replacement : str
- A replacement string.
- ignore_case : bool
- If True then pattern will match against upper and lower case.
H2OFrame
- table(data2=None)[source]¶
Compute the counts of values appearing in a column, or co-occurence counts between two columns.
- data2 : H2OFrame
- Default is None, can be an optional single column to aggregate counts by.
H2OFrame of the counts at each combination of factor levels
- tail(rows=10, cols=200)[source]¶
Analogous to Rs tail call on a data.frame.
- rows : int, default=10
- Number of rows starting from the bottommost
- cols: int, default=200
- Number of columns starting from the leftmost
An H2OFrame.
- trim()[source]¶
Trim white space on the left and right of strings in a single-column H2OFrame.
H2OFrame with trimmed strings.
- unique()[source]¶
Extract the unique values in the column.
H2OFrame of just the unique values in the column.
- var(y=None, use='everything')[source]¶
Compute the variance, or co-variance matrix.
- y : H2OFrame, default=None
- If y is None, then the variance is computed for self. If self has more than one column, then the covariance matrix is returned. If y is not None, then the covariance between self and y is computed (self and y must therefore both be single columns).
- use : str
- One of “everything”, “complete.obs”, or “all.obs”
The covariance matrix of the columns in this H2OFrame if y is given, or a eagerly computed scalar if y is not given.
GroupBy¶
- class h2o.group_by.GroupBy(fr, by)[source]¶
A class that represents the group by operation on an H2OFrame.
Sample usage:
>>> my_frame = ... # some existing H2OFrame >>> grouped = my_frame.group_by(by=["C1","C2"]) >>> grouped.sum(col="X1",na="all").mean(col="X5",na="all").max() >>> grouped.get_frame
Any number of aggregations may be chained together in this manner.
If no arguments are given to the aggregation (e.g. “max” in the above example), then it is assumed that the aggregation should apply to all columns but the group by columns.
- The na parameter is one of [“all”,”ignore”,”rm”].
- “all” - include NAs “rm” - exclude NAs
Variance (var) and standard deviation (sd) are the sample (not population) statistics.