Data Manipulation¶
H2OFrame
¶

class
h2o.frame.
H2OFrame
(python_obj=None, destination_frame=None, header=0, separator=u', ', column_names=None, column_types=None, na_strings=None)[source]¶ Primary data store for H2O.
H2OFrame is similar to pandas’
DataFrame
, or R’sdata.frame
. One of the critical distinction is that the data is generally not held in memory, instead it is located on a (possibly remote) H2O cluster, and thusH2OFrame
represents a mere handle to that data.
acosh
()[source]¶ Return new H2OFrame equal to elementwise inverse hyperbolic cosine of the current frame.

apply
(fun=None, axis=0)[source]¶ Apply a lambda expression to an H2OFrame.
Parameters:  fun – a lambda expression to be applied per row or per column.
 axis – 0 = apply to each column; 1 = apply to each row
Returns: a new H2OFrame with the results of applying
fun
to the current frame.

as_data_frame
(use_pandas=True, header=True)[source]¶ Obtain the dataset as a pythonlocal object.
Parameters:  use_pandas (bool) – If True (default) then return the H2OFrame as a pandas DataFrame (requires that the
pandas
library was installed). If False, then return the contents of the H2OFrame as plain nested list, in a rowwise order.  header (bool) – If True (default), then column names will be appended as the first row in list
Returns: A python object (a list of lists of strings, each list is a row, if use_pandas=False, otherwise a pandas DataFrame) containing this H2OFrame instance’s data.
 use_pandas (bool) – If True (default) then return the H2OFrame as a pandas DataFrame (requires that the

as_date
(format)[source]¶ Convert the frame (containing strings / categoricals) into the
date
format.Parameters: format (str) – the format string (e.g. “YYYYmmdd”) Returns: new H2OFrame with “date” column types

ascharacter
()[source]¶ Convert all columns in the frame into strings.
Returns: new H2OFrame with columns of “string” type.

asfactor
()[source]¶ Convert columns in the current frame to categoricals.
Returns: new H2OFrame with columns of the “enum” type.

asinh
()[source]¶ Return new H2OFrame equal to elementwise inverse hyperbolic sine of the current frame.

atanh
()[source]¶ Return new H2OFrame equal to elementwise inverse hyperbolic tangent of the current frame.

categories
()[source]¶ Return the list of levels for an enum (categorical) column.
This function can only be applied to singlecolumn categorical frame.

cbind
(data)[source]¶ Append data to this frame columnwise.
Parameters: data (H2OFrame) – append columns of frame data
to the current frame. You can also cbind a number, in which case it will get converted into a constant column.Returns: new H2OFrame with all frames in data
appended columnwise.

ceil
()[source]¶ Apply the ceiling function to the current frame.
ceil(x)
is the smallest integer greater or equal tox
.Returns: new H2OFrame of ceiling values of the original frame.

col_names
¶ Same as
self.names
.

columns
¶ Same as
self.names
.

columns_by_type
(coltype=u'numeric')[source]¶ Extract columns of the specified type from the frame.
Parameters: coltype (str) – A character string indicating which column type to filter by. This must be one of the following:
"numeric"
 Numeric, but not categorical or time"categorical"
 Integer, with a categorical/factor String mapping"string"
 String column"time"
 Long msec since the Unix Epoch  with a variety of display/parse options"uuid"
 UUID"bad"
 No noneNA rows (triple negative! all NAs or zero rows)
Returns: list of indices of columns that have the requested type

concat
(frames, axis=1)[source]¶ Append multiple H2OFrames to this frame, columnwise or rowwise.
Parameters:  frames (List[H2OFrame]) – list of frames that should be appended to the current frame.
 axis (int) – if 1 then append columnwise (default), if 0 then append rowwise.
Returns: an H2OFrame of the combined datasets.

cor
(y=None, na_rm=False, use=None)[source]¶ Compute the correlation matrix of one or two H2OFrames.
Parameters:  y (H2OFrame) – If this parameter is provided, then compute correlation between the columns of
y
and the columns of the current frame. If this parameter is not given, then just compute the correlation matrix for the columns of the current frame.  use (str) –
A string indicating how to handle missing values. This could be one of the following:
"everything"
: outputs NaNs whenever one of its contributing observations is missing"all.obs"
: presence of missing observations will throw an error"complete.obs"
: discards missing values along with all observations in their rows so that only complete observations are used
 na_rm (bool) – an alternative to
use
: when this is True then default value foruse
is"everything"
; and if False then defaultuse
is"complete.obs"
. This parameter has no effect ifuse
is given explicitly.
Returns: An H2OFrame of the correlation matrix of the columns of this frame (if
y
is not given), or with the columns ofy
(ify
is given). However when this frame andy
are both single rows or single columns, then the correlation is returned as a scalar. y (H2OFrame) – If this parameter is provided, then compute correlation between the columns of

cosh
()[source]¶ Make new H2OFrame with values equal to the hyperbolic cosines of the values in the current frame.

cospi
()[source]¶ Return new H2OFrame equal to elementwise cosine of the current frame multiplied by Pi.

countmatches
(pattern)[source]¶ For each string in the frame, count the occurrences of the provided pattern.
The pattern here is a plain string, not a regular expression. We will search for the occurrences of the pattern as a substring in element of the frame. This function is applicable to frames containing only string or categorical columns.
Parameters: pattern (str) – The pattern to count matches on in each string. This can also be a list of strings, in which case all of them will be searched for. Returns: numeric H2OFrame with the same shape as the original, containing counts of matches of the pattern for each cell in the original frame.

cummax
(axis=0)[source]¶ Compute cumulative maximum over rows / columns of the frame.
Parameters: axis (int) – 0 for columnwise, 1 for rowwise Returns: new H2OFrame with running maximums of the original frame.

cummin
(axis=0)[source]¶ Compute cumulative minimum over rows / columns of the frame.
Parameters: axis (int) – 0 for columnwise, 1 for rowwise Returns: new H2OFrame with running minimums of the original frame.

cumprod
(axis=0)[source]¶ Compute cumulative product over rows / columns of the frame.
Parameters: axis (int) – 0 for columnwise, 1 for rowwise Returns: new H2OFrame with cumulative products of the original frame.

cumsum
(axis=0)[source]¶ Compute cumulative sum over rows / columns of the frame.
Parameters: axis (int) – 0 for columnwise, 1 for rowwise Returns: new H2OFrame with cumulative sums of the original frame.

cut
(breaks, labels=None, include_lowest=False, right=True, dig_lab=3)[source]¶ Cut a numeric vector into categorical “buckets”.
This method is only applicable to a singlecolumn numeric frame.
Parameters:  breaks (List[float]) – The cut points in the numeric vector.
 labels (List[str]) – Labels for categorical levels produced. Defaults to set notation of intervals defined by the breaks.
 include_lowest (bool) – By default, cuts are defined as intervals
(lo, hi]
. If this parameter is True, then the interval becomes[lo, hi]
.  right (bool) – Include the high value:
(lo, hi]
. If False, get(lo, hi)
.  dig_lab (int) – Number of digits following the decimal point to consider.
Returns: Singlecolumn H2OFrame of categorical data.

day
()[source]¶ Extract the “day” part from a date column.
Returns: a singlecolumn H2OFrame containing the “day” part from the source frame.

dayOfWeek
()[source]¶ Extract the “dayofweek” part from a date column.
Returns: a singlecolumn H2OFrame containing the “dayofweek” part from the source frame.

describe
(chunk_summary=False)[source]¶ Generate an indepth description of this H2OFrame.
This will print to the console the dimensions of the frame; names/types/summary statistics for each column; and finally first ten rows of the frame.
Parameters: chunk_summary (bool) – Retrieve the chunk summary along with the distribution summary

difflag1
()[source]¶ Conduct a diff1 transform on a numeric frame column.
Returns: an H2OFrame where each element is equal to the corresponding element in the source frame minus the previousrow element in the same frame.

dim
¶ Same as
list(self.shape)
.

distance
(y, measure=None)[source]¶ Compute a pairwise distance measure between all rows of two numeric H2OFrames.
Parameters:  y (H2OFrame) – Frame containing queries (small)
 use (str) –
A string indicating what distance measure to use. Must be one of:
"l1"
: Absolute distance (L1norm, >=0)"l2"
: Euclidean distance (L2norm, >=0)"cosine"
: Cosine similarity (1...1)"cosine_sq"
: Squared Cosine similarity (0...1)
Examples: >>> >>> iris_h2o = h2o.import_file(path=pyunit_utils.locate("smalldata/iris/iris.csv")) >>> references = iris_h2o[10:150,0:4 >>> queries = iris_h2o[0:10,0:4] >>> A = references.distance(queries, "l1") >>> B = references.distance(queries, "l2") >>> C = references.distance(queries, "cosine") >>> D = references.distance(queries, "cosine_sq") >>> E = queries.distance(references, "l1") >>> (E.transpose() == A).all()
Returns: An H2OFrame of the matrix containing pairwise distance / similarity between the rows of this frame (N x p) and
y
(M x p), with dimensions (N x M).

drop
(index, axis=1)[source]¶ Drop a single column or row or a set of columns or rows from a H2OFrame.
Dropping a column or row is not inplace. Indices of rows and columns are zerobased.
Parameters:  index – A list of column indices, column names, or row indices to drop; or a string to drop a single column by name; or an int to drop a single column by index.
 axis (int) – If 1 (default), then drop columns; if 0 then drop rows.
Returns: a new H2OFrame with the respective dropped columns or rows. The original H2OFrame remains unchanged.

entropy
()[source]¶ For each string compute its Shannon entropy, if the string is empty the entropy is 0.
Returns: an H2OFrame of Shannon entropies.

expm1
()[source]¶ Return new H2OFrame equals to elementwise exponent minus 1 (i.e.
e^x  1
) of the current frame.

fillna
(method=u'forward', axis=0, maxlen=1)[source]¶ Return a new Frame that fills NA along a given axis and along a given direction with a maximum fill length :param method:
"forward"
or"backward"
:param axis: 0 for columnarwise or 1 for rowwise fill :param maxlen: Max number of consecutive NA’s to fill :return:

filter_na_cols
(frac=0.2)[source]¶ Filter columns with proportion of NAs greater or equals than
frac
.Parameters: frac (float) – Maximum fraction of NAs in the column to keep. Returns: A list of indices of columns that have fewer NAs than frac
. If all columns are filtered, None is returned.

flatten
()[source]¶ Convert a 1x1 frame into a scalar.
Returns: content of this 1x1 frame as a scalar ( int
,float
, orstr
).Raises: H2OValueError – if current frame has shape other than 1x1

floor
()[source]¶ Apply the floor function to the current frame.
floor(x)
is the largest integer smaller or equal tox
.Returns: new H2OFrame of floor values of the original frame.

frame_id
¶ Internal id of the frame (str).

static
from_python
(python_obj, destination_frame=None, header=0, separator=u', ', column_names=None, column_types=None, na_strings=None)[source]¶ [DEPRECATED] Use constructor
H2OFrame()
instead.

static
get_frame
(frame_id)[source]¶ Retrieve an existing H2OFrame from the H2O cluster using the frame’s id.
Parameters: frame_id (str) – id of the frame to retrieve Returns: an existing H2OFrame with the id provided; or None if such frame doesn’t exist.

get_frame_data
()[source]¶ Get frame data as a string in csv format.
This will create a multiline string, where each line will contain a separate row of frame’s data, with individual values separated by commas.

getrow
()[source]¶ Convert a 1xn frame into an nelement list.
Returns: content of this 1xn frame as a Python list. Raises: H2OValueError – if current frame has more than one row.

grep
(pattern, ignore_case=False, invert=False, output_logical=False)[source]¶ Searches for matches to argument pattern within each element of a string column.
Default behavior is to return indices of the elements matching the pattern. Parameter output_logical can be used to return a logical vector indicating if the element matches the pattern (1) or not (0).
Parameters:  pattern (str) – A character string containing a regular expression.
 ignore_case (bool) – If True, then case is ignored during matching.
 invert (bool) – If True, then identify elements that do not match the pattern.
 output_logical (bool) – If True, then return logical vector of indicators instead of list of matching positions
Returns: H2OFrame holding the matching positions or a logical list if output_logical is enabled.

group_by
(by)[source]¶ Return a new
GroupBy
object using this frame and the desired grouping columns.The returned groups are sorted by the natural groupby column sort.
Parameters: by – The columns to group on (either a single column name, or a list of column names, or a list of column indices).

gsub
(pattern, replacement, ignore_case=False)[source]¶ Globally substitute occurrences of pattern in a string with replacement.
Parameters:  pattern (str) – A regular expression.
 replacement (str) – A replacement string.
 ignore_case (bool) – If True then pattern will match caseinsensitively.
Returns: an H2OFrame with all occurrences of
pattern
in all values replaced withreplacement
.

head
(rows=10, cols=200)[source]¶ Return the first
rows
andcols
of the frame as a new H2OFrame.Parameters:  rows (int) – maximum number of rows to return
 cols (int) – maximum number of columns to return
Returns: a new H2OFrame cut from the top left corner of the current frame, and having dimensions at most
rows
xcols
.

hist
(breaks=u'sturges', plot=True, **kwargs)[source]¶ Compute a histogram over a numeric column.
Parameters:  breaks – Can be one of
"sturges"
,"rice"
,"sqrt"
,"doane"
,"fd"
,"scott"
; or a single number for the number of breaks; or a list containing the split points, e.g:[50, 213.2123, 9324834]
. If breaks is “fd”, the MAD is used over the IQR in computing bin width.  plot (bool) – If True (default), then a plot will be generated using
matplotlib
.
Returns: If
plot
is False, return H2OFrame with these columns: breaks, counts, mids_true, mids, and density; otherwise this method draws a plot and returns nothing. breaks – Can be one of

hour
()[source]¶ Extract the “hourofday” part from a date column.
Returns: a singlecolumn H2OFrame containing the “hourofday” part from the source frame.

idxmax
(skipna=True, axis=0)[source]¶ Get the index of the max value in a column or row
Parameters:  skipna (bool) – If True (default), then NAs are ignored during the search. Otherwise presence of NAs renders the entire result NA.
 axis (int) – Direction of finding the max index. If 0 (default), then the max index is searched columnwise, and the result is a frame with 1 row and number of columns as in the original frame. If 1, then the max index is searched rowwise and the result is a frame with 1 column, and number of rows equal to the number of rows in the original frame.
Returns: either a list of max index values percolumn or an H2OFrame containing max index values perrow from the original frame.

idxmin
(skipna=True, axis=0)[source]¶ Get the index of the min value in a column or row
Parameters:  skipna (bool) – If True (default), then NAs are ignored during the search. Otherwise presence of NAs renders the entire result NA.
 axis (int) – Direction of finding the min index. If 0 (default), then the min index is searched columnwise, and the result is a frame with 1 row and number of columns as in the original frame. If 1, then the min index is searched rowwise and the result is a frame with 1 column, and number of rows equal to the number of rows in the original frame.
Returns: either a list of min index values percolumn or an H2OFrame containing min index values perrow from the original frame.

ifelse
(yes, no)[source]¶ Equivalent to
[y if t else n for t,y,n in zip(self,yes,no)]
.Based on the booleans in the test vector, the output has the values of the yes and no vectors interleaved (or merged together). All Frames must have the same row count. Single column frames are broadened to match wider Frames. Scalars are allowed, and are also broadened to match wider frames.
Parameters:  yes – Frame to use if
test
is true; may be a scalar or single column  no – Frame to use if
test
is false; may be a scalar or single column
Returns: an H2OFrame of the merged yes/no frames/scalars according to the test input frame.
 yes – Frame to use if

impute
(column=1, method=u'mean', combine_method=u'interpolate', by=None, group_by_frame=None, values=None)[source]¶ Impute missing values into the frame, modifying it inplace.
Parameters:  column (int) – Index of the column to impute, or 1 to impute the entire frame.
 method (str) – The method of imputation:
"mean"
,"median"
, or"mode"
.  combine_method (str) – When the method is
"median"
, this setting dictates how to combine quantiles for even samples. One of"interpolate"
,"average"
,"low"
,"high"
.  by – The list of columns to group on.
 group_by_frame (H2OFrame) – Impute the values with this precomputed grouped frame.
 values (List) – The list of impute values, one per column. None indicates to skip the column.
Returns: A list of values used in the imputation or the groupby result used in imputation.

insert_missing_values
(fraction=0.1, seed=None)[source]¶ Insert missing values into the current frame, modifying it inplace.
Randomly replaces a userspecified fraction of entries in a H2O dataset with missing values.
Parameters:  fraction (float) – A number between 0 and 1 indicating the fraction of entries to replace with missing.
 seed (int) – The seed for the random number generator used to determine which values to make missing.
Returns: the original H2OFrame with missing values inserted.

interaction
(factors, pairwise, max_factors, min_occurrence, destination_frame=None)[source]¶ Categorical Interaction Feature Creation in H2O.
Creates a frame in H2O with nth order interaction features between categorical columns, as specified by the user.
Parameters:  factors – list of factor columns (either indices or column names).
 pairwise (bool) – Whether to create pairwise interactions between factors (otherwise create one higherorder interaction). Only applicable if there are 3 or more factors.
 max_factors (int) – Max. number of factor levels in pairwise interaction terms (if enforced, one extra catchall factor will be made).
 min_occurrence (int) – Min. occurrence threshold for factor levels in pairwise interaction terms.
 destination_frame (str) – (internal) string indicating the key for the frame created.
Returns: an H2OFrame

isax
(num_words, max_cardinality, optimize_card=False, **kwargs)[source]¶ Compute the iSAX index for DataFrame which is assumed to be numeric time series data.
References:
Parameters:  num_words (int) – Number of iSAX words for the timeseries, i.e. granularity along the time series
 max_cardinality (int) – Maximum cardinality of the iSAX word. Each word can have less than the max
 optimized_card (bool) – An optimization flag that will find the max cardinality regardless of what is
passed in for
max_cardinality
.
Returns: An H2OFrame with the name of time series, string representation of iSAX word, followed by binary representation.

isfactor
()[source]¶ Test which columns in the current frame are categorical.
Returns: a list of True/False indicating for each column in the frame whether it is categorical.

isin
(item)[source]¶ Test whether elements of an H2OFrame are contained in the
item
.Parameters: items – An item or a list of items to compare the H2OFrame against. Returns: An H2OFrame of 0s and 1s showing whether each element in the original H2OFrame is contained in item.

isna
()[source]¶ For each element in an H2OFrame, determine if it is NA or not.
Returns: an H2OFrame of 1s and 0s, where 1s mean the values were NAs.

isnumeric
()[source]¶ Test which columns in the frame are numeric.
Returns: a list of True/False indicating for each column in the frame whether it is numeric.

isstring
()[source]¶ Test which columns in the frame are string.
Returns: a list of True/False indicating for each column in the frame whether it is numeric.

kfold_column
(n_folds=3, seed=1)[source]¶ Build a fold assignments column for crossvalidation.
This method will produce a column having the same data layout as the source frame.
Parameters:  n_folds (int) – An integer specifying the number of validation sets to split the training data into.
 seed (int) – Seed for random numbers as fold IDs are randomly assigned.
Returns: A single column H2OFrame with the fold assignments.

kurtosis
(na_rm=False)[source]¶ Compute the kurtosis of each column in the frame.
We calculate the common kurtosis, such that kurtosis(normal distribution) is 3.
Parameters: na_rm (bool) – If True, then ignore NAs during the computation. Returns: A list containing the kurtosis for each column (NaN for nonnumeric columns).

lgamma
()[source]¶ Return new H2OFrame equals to elementwise logarithm of the gamma function of the current frame.

log1p
()[source]¶ Return new H2OFrame equals to elementwise
ln(1 + x)
for eachx
in the current frame.

logical_negation
()[source]¶ Returns new H2OFrame equal to elementwise Logical NOT applied to the current frame.

lstrip
(set=u' ')[source]¶ Return a copy of the column with leading characters removed.
The set argument is a string specifying the set of characters to be removed. If omitted, the set argument defaults to removing whitespace.
Parameters: set (str) – The set of characters to lstrip from strings in column Returns: a new H2OFrame with the same shape as the original frame and having all its values trimmed from the left (equivalent of Python’s str.lstrip()
).

match
(table, nomatch=0)[source]¶ Make a vector of the positions of (first) matches of its first argument in its second.
Only applicable to singlecolumn categorical/string frames.
Parameters:  table (List) – the list of items to match against
 nomatch (int) – value that should be returned when there is no match.
Returns: a new H2OFrame containing for each cell from the source frame the index where the pattern
table
first occurs within that cell.

mean
(skipna=True, axis=0, **kwargs)[source]¶ Compute the frame’s means bycolumn (or byrow).
Parameters:  skipna (bool) – If True (default), then NAs are ignored during the computation. Otherwise presence of NAs renders the entire result NA.
 axis (int) – Direction of mean computation. If 0 (default), then mean is computed columnwise, and the result is a frame with 1 row and number of columns as in the original frame. If 1, then mean is computed rowwise and the result is a frame with 1 column (called “mean”), and number of rows equal to the number of rows in the original frame.
Returns: either a list of mean values percolumn (old semantic); or an H2OFrame containing mean values percolumn/perrow from the original frame (new semantic). The new semantic is triggered by either providing the
return_frame=True
parameter, or having thegeneral.allow_breaking_changed
config option turned on.

median
(na_rm=False)[source]¶ Compute the median of each column in the frame.
Parameters: na_rm (bool) – If True, then ignore NAs during the computation. Returns: A list containing the median for each column (NaN for nonnumeric columns).

merge
(other, all_x=False, all_y=False, by_x=None, by_y=None, method=u'auto')[source]¶ Merge two datasets based on common column names.
Parameters:  other (H2OFrame) – The frame to merge to the current one. By default, must have at least one column in common with this frame, and all columns in common are used as the merge key. If you want to use only a subset of the columns in common, rename the other columns so the columns are unique in the merged result.
 all_x (bool) – If True, include all rows from the left/self frame
 all_y (bool) – If True, include all rows from the right/other frame
 by_x – list of columns in the current frame to use as a merge key.
 by_y – list of columns in the
other
frame to use as a merge key. Should have the same number of columns as in theby_x
list.
Returns: New H2OFrame with the result of merging the current frame with the
other
frame.

minute
()[source]¶ Extract the “minute” part from a date column.
Returns: a singlecolumn H2OFrame containing the “minute” part from the source frame.

static
mktime
(year=1970, month=0, day=0, hour=0, minute=0, second=0, msec=0)[source]¶ Deprecated, use
moment()
instead.This function was left for backwardcompatibility purposes only. It is not very stable, and counterintuitively uses 0based months and days, so “January 4th, 2001” should be entered as
mktime(2001, 0, 3)
.

modulo_kfold_column
(n_folds=3)[source]¶ Build a fold assignments column for crossvalidation.
Rows are assigned a fold according to the current row number modulo
n_folds
.Parameters: n_folds (int) – An integer specifying the number of validation sets to split the training data into. Returns: A singlecolumn H2OFrame with the fold assignments.

static
moment
(year=None, month=None, day=None, hour=None, minute=None, second=None, msec=None, date=None, time=None)[source]¶ Create a time column from individual components.
Each parameter should be either an integer, or a singlecolumn H2OFrame containing the corresponding time parts for each row.
The “date” part of the timestamp can be specified using either the tuple
(year, month, day)
, or an explicitdate
parameter. The “time” part of the timestamp is optional, but can be specified either via thetime
parameter, or via the(hour, minute, second, msec)
tuple.Parameters:  year – the year part of the constructed date
 month – the month part of the constructed date
 day – the dayofthemonth part of the constructed date
 hour – the hours part of the constructed date
 minute – the minutes part of the constructed date
 second – the seconds part of the constructed date
 msec – the milliseconds part of the constructed date
 date (date) – construct the timestamp from the Python’s native
datetime.date
(ordatetime.datetime
) object. If the object passed is of typedate
, then you can specify the time part using either thetime
argument, orhour
...msec
arguments (but not both). If the object passed is of typedatetime
, then no other arguments can be provided.  time (time) – construct the timestamp from this Python’s native
datetime.time
object. This argument cannot be used alone, it should be supplemented with eitherdate
argument, oryear
...day
tuple.
Returns: H2OFrame with one column containing the date constructed from the provided arguments.

month
()[source]¶ Extract the “month” part from a date column.
Returns: a singlecolumn H2OFrame containing the “month” part from the source frame.

mult
(matrix)[source]¶ Multiply this frame, viewed as a matrix, by another matrix.
Parameters: matrix – another frame that you want to multiply the current frame by; must be compatible with the current frame (i.e. its number of rows must be the same as number of columns in the current frame). Returns: new H2OFrame, which is the result of multiplying the current frame by matrix
.

na_omit
()[source]¶ Remove rows with NAs from the H2OFrame.
Returns: new H2OFrame with all rows from the original frame containing any NAs removed.

nacnt
()[source]¶ Count of NAs for each column in this H2OFrame.
Returns: A list of the na counts (one entry per column).

names
¶ The list of column names (List[str]).

nchar
()[source]¶ Count the length of each string in a singlecolumn H2OFrame of string type.
Returns: A singlecolumn H2OFrame containing the perrow character count.

ncol
¶ Same as
self.ncols
.

ncols
¶ Number of columns in the dataframe (int).

nlevels
()[source]¶ Get the number of factor levels for each categorical column.
Returns: A list of the number of levels per column.

nrow
¶ Same as
self.nrows
.

nrows
¶ Number of rows in the dataframe (int).

num_valid_substrings
(path_to_words)[source]¶ For each string, find the count of all possible substrings with 2 characters or more that are contained in the lineseparated text file whose path is given.
Parameters: path_to_words (str) – Path to file that contains a lineseparated list of strings considered valid. Returns: An H2OFrame with the number of substrings that are contained in the given word list.

pivot
(index, column, value)[source]¶ Pivot the frame designated by the three columns: index, column, and value. Index and column should be of type enum, int, or time. For cases of multiple indexes for a column label, the aggregation method is to pick the first occurrence in the data frame
Parameters:  index – Index is a column that will be the row label
 column – The labels for the columns in the pivoted Frame
 value – The column of values for the given index and column label
Returns:

pop
(i)[source]¶ Pop a column from the H2OFrame at index i.
Parameters: i – The index (int) or name (str) of the column to pop. Returns: an H2OFrame containing the column dropped from the current frame; the current frame is modified inplace and loses the column.

prod
(na_rm=False)[source]¶ Compute the product of all values in the frame.
Parameters: na_rm (bool) – If True then NAs will be ignored during the computation. Returns: product of all values in the frame (a float)

quantile
(prob=None, combine_method=u'interpolate', weights_column=None)[source]¶ Compute quantiles.
Parameters:  prob (List[float]) – list of probabilities for which quantiles should be computed.
 combine_method (str) – for even samples this setting determines how to combine quantiles. This can be
one of
"interpolate"
,"average"
,"low"
,"high"
.  weights_column – optional weights for each row. If not given, all rows are assumed to have equal importance. This parameter can be either the name of column containing the observation weights in this frame, or a singlecolumn separate H2OFrame of observation weights.
Returns: a new H2OFrame containing the quantiles and probabilities.

rbind
(data)[source]¶ Append data to this frame rowwise.
Parameters: data – an H2OFrame or a list of H2OFrame’s to be combined with current frame rowwise. Returns: this H2OFrame with all frames in data appended rowwise.

relevel
(y)[source]¶ Reorder levels of an H2O factor.
The levels of a factor are reordered such that the reference level is at level 0, all remaining levels are moved down as needed.
Parameters: y (str) – The reference level Returns: New reordered factor column

rep_len
(length_out)[source]¶ Create a new frame replicating the current frame.
If the source frame has a single column, then the new frame will be replicating rows and its dimensions will be
length_out x 1
. However if the source frame has more than 1 column, then then new frame will be replicating data in columnwise direction, and its dimensions will benrows x length_out
, wherenrows
is the number of rows in the source frame. Also note that iflength_out
is smaller than the corresponding dimension of the source frame, then the new frame will actually be a truncated version of the original.Parameters: length_out (int) – Number of columns (rows) of the resulting H2OFrame Returns: new H2OFrame with repeated data from the current frame.

round
(digits=0)[source]¶ Round doubles/floats to the given number of decimal places.
Parameters: digits (int) – The number of decimal places to retain. Rounding to a negative number of decimal places is not supported. For rounding we use the “round half to even” mode (IEC 60559 standard), so that round(2.5) = 2
andround(3.5) = 4
.Returns: new H2OFrame with rounded values from the original frame.

rstrip
(set=u' ')[source]¶ Return a copy of the column with trailing characters removed.
The set argument is a string specifying the set of characters to be removed. If omitted, the set argument defaults to removing whitespace.
Parameters: set (str) – The set of characters to rstrip from strings in column Returns: a new H2OFrame with the same shape as the original frame and having all its values trimmed from the right (equivalent of Python’s str.rstrip()
).

runif
(seed=None)[source]¶ Generate a column of random numbers drawn from a uniform distribution [0,1) and having the same data layout as the source frame.
Parameters: seed (int) – seed for the random number generator. Returns: Singlecolumn H2OFrame filled with doubles sampled uniformly from [0,1).

scale
(center=True, scale=True)[source]¶ Center and/or scale the columns of the current frame.
Parameters:  center – If True, then demean the data. If False, no shifting is done. If
center
is a list of numbers then shift each column by the corresponding amount.  scale – If True, then scale the data by each column’s standard deviation. If False, no scaling
is done. If
scale
is a list of numbers, then scale each column by the requested amount.
Returns: an H2OFrame with scaled values from the current frame.
 center – If True, then demean the data. If False, no shifting is done. If

sd
(na_rm=False)[source]¶ Compute the standard deviation for each column in the frame.
Parameters: na_rm (bool) – if True, then NAs will be removed from the computation. Returns: A list containing the standard deviation for each column (NaN for nonnumeric columns).

second
()[source]¶ Extract the “second” part from a date column.
Returns: a singlecolumn H2OFrame containing the “second” part from the source frame.

set_level
(level)[source]¶ A method to set all column values to one of the levels.
Parameters: level (str) – The level at which the column will be set (a string) Returns: H2OFrame with entries set to the desired level.

set_levels
(levels)[source]¶ Replace the levels of a categorical column.
New levels must be aligned with the old domain. This call has copyonwrite semantics.
Parameters: levels (List[str]) – A list of strings specifying the new levels. The number of new levels must match the number of old levels. Returns: A singlecolumn H2OFrame with the desired levels.

set_name
(col=None, name=None)[source]¶ Set a new name for a column.
Parameters:  col – index or name of the column whose name is to be set; may be skipped for 1column frames
 name – the new name of the column

set_names
(names)[source]¶ Change names of all columns in the frame.
Parameters: names (List[str]) – The list of new names for every column in the frame.

shape
¶ Number of rows and columns in the dataframe as a tuple
(nrows, ncols)
.

show
(use_pandas=False)[source]¶ Used by the H2OFrame.__repr__ method to print or display a snippet of the data frame.
If called from IPython, displays an html’ized result. Else prints a tabulate’d result.

signif
(digits=6)[source]¶ Round doubles/floats to the given number of significant digits.
Parameters: digits (int) – Number of significant digits to retain. Returns: new H2OFrame with rounded values from the original frame.

sinpi
()[source]¶ Return new H2OFrame equal to elementwise sine of the current frame multiplied by Pi.

skewness
(na_rm=False)[source]¶ Compute the skewness of each column in the frame.
Parameters: na_rm (bool) – If True, then ignore NAs during the computation. Returns: A list containing the skewness for each column (NaN for nonnumeric columns).

sort
(by)[source]¶ Return a new Frame that is sorted by column(s) in ascending order. A fully distributed and parallel sort. :param by: The column to sort by (either a single column name, or a list of column names, or
a list of column indices)Returns: a new sorted Frame

split_frame
(ratios=None, destination_frames=None, seed=None)[source]¶ Split a frame into distinct subsets of size determined by the given ratios.
The number of subsets is always 1 more than the number of ratios given. Note that this does not give an exact split. H2O is designed to be efficient on big data using a probabilistic splitting method rather than an exact split. For example when specifying a split of 0.75/0.25, H2O will produce a test/train split with an expected value of 0.75/0.25 rather than exactly 0.75/0.25. On small datasets, the sizes of the resulting splits will deviate from the expected value more than on big data, where they will be very close to exact.
Parameters:  ratios (List[float]) – The fractions of rows for each split.
 destination_frames (List[str]) – The names of the split frames.
 seed (int) – seed for the random number generator
Returns: A list of H2OFrames

stratified_kfold_column
(n_folds=3, seed=1)[source]¶ Build a fold assignment column with the constraint that each fold has the same class distribution as the fold column.
Parameters:  n_folds (int) – The number of folds to build.
 seed (int) – A seed for the random number generator.
Returns: A single column H2OFrame with the fold assignments.

stratified_split
(test_frac=0.2, seed=1)[source]¶ Construct a column that can be used to perform a random stratified split.
Parameters:  test_frac (float) – The fraction of rows that will belong to the “test”.
 seed (int) – The seed for the random number generator.
Returns: an H2OFrame having single categorical column with two levels:
"train"
and"test"
.Examples: >>> stratsplit = df["y"].stratified_split(test_frac=0.3, seed=12349453) >>> train = df[stratsplit=="train"] >>> test = df[stratsplit=="test"] >>> >>> # check that the distributions among the initial frame, and the >>> # train/test frames match >>> df["y"].table()["Count"] / df["y"].table()["Count"].sum() >>> train["y"].table()["Count"] / train["y"].table()["Count"].sum() >>> test["y"].table()["Count"] / test["y"].table()["Count"].sum()

strdistance
(y, measure=None)[source]¶ Compute elementwise string distances between two H2OFrames. Both frames need to have the same shape and only contain string/factor columns.
Parameters:  y (H2OFrame) – A comparison frame.
 measure (str) –
A string identifier indicating what string distance measure to use. Must be one of:
"lv"
: Levenshtein distance"lcs"
: Longest common substring distance"qgram"
: qgram distance"jaccard"
: Jaccard distance between qgram profiles"jw"
: Jaro, or JaroWinker distance"soundex"
: Distance based on soundex encoding
Examples: >>> >>> x = h2o.H2OFrame.from_python(['Martha', 'Dwayne', 'Dixon'], column_types=['factor']) >>> y = h2o.H2OFrame.from_python(['Marhta', 'Duane', 'Dicksonx'], column_types=['string']) >>> x.strdistance(y, measure="jw")
Returns: An H2OFrame of the matrix containing elementwise distance between the strings of this frame and
y
. The returned frame has the same shape as the input frames.

strsplit
(pattern)[source]¶ Split the strings in the target column on the given regular expression pattern.
Parameters: pattern (str) – The split pattern. Returns: H2OFrame containing columns of the split strings.

sub
(pattern, replacement, ignore_case=False)[source]¶ Substitute the first occurrence of pattern in a string with replacement.
Parameters:  pattern (str) – A regular expression.
 replacement (str) – A replacement string.
 ignore_case (bool) – If True then pattern will match caseinsensitively.
Returns: an H2OFrame with all values matching
pattern
replaced withreplacement
.

substring
(start_index, end_index=None)[source]¶ For each string, return a new string that is a substring of the original string.
If end_index is not specified, then the substring extends to the end of the original string. If the start_index is longer than the length of the string, or is greater than or equal to the end_index, an empty string is returned. Negative start_index is coerced to 0.
Parameters:  start_index (int) – The index of the original string at which to start the substring, inclusive.
 end_index (int) – The index of the original string at which to end the substring, exclusive.
Returns: An H2OFrame containing the specified substrings.

sum
(skipna=True, axis=0, **kwargs)[source]¶ Compute the frame’s sum bycolumn (or byrow).
Parameters:  skipna (bool) – If True (default), then NAs are ignored during the computation. Otherwise presence of NAs renders the entire result NA.
 axis (int) – Direction of sum computation. If 0 (default), then sum is computed columnwise, and the result is a frame with 1 row and number of columns as in the original frame. If 1, then sum is computed rowwise and the result is a frame with 1 column (called “sum”), and number of rows equal to the number of rows in the original frame.
Returns: either a list of sum of values percolumn (old semantic); or an H2OFrame containing sum of values percolumn/perrow in the original frame (new semantic). The new semantic is triggered by either providing the
return_frame=True
parameter, or having thegeneral.allow_breaking_changed
config option turned on.

summary
(return_data=False)[source]¶ Display summary information about the frame.
Summary includes min/mean/max/sigma and other rollup data. :param bool return_data: Return a dictionary of the summary output

table
(data2=None, dense=True)[source]¶ Compute the counts of values appearing in a column, or cooccurence counts between two columns.
Parameters:  data2 (H2OFrame) – An optional single column to aggregate counts by.
 dense (bool) – If True (default) then use dense representation, which lists only nonzero counts, 1 combination per row. Set to False to expand counts across all combinations.
Returns: H2OFrame of the counts at each combination of factor levels

tail
(rows=10, cols=200)[source]¶ Return the last
rows
andcols
of the frame as a new H2OFrame.Parameters:  rows (int) – maximum number of rows to return
 cols (int) – maximum number of columns to return
Returns: a new H2OFrame cut from the bottom left corner of the current frame, and having dimensions at most
rows
xcols
.

tanpi
()[source]¶ Return new H2OFrame equal to elementwise tangent of the current frame multiplied by Pi.

tokenize
(split)[source]¶ Tokenize String
tokenize() is similar to strsplit(), the difference between them is that tokenize() will store the tokenized text into a single column making it easier for additional processing (filtering stop words, word2vec algo, ...).
:param str split The regular expression to split on. @return An H2OFrame with a single column representing the tokenized Strings. Original rows of the input DF are separated by NA.

tolower
()[source]¶ Translate characters from upper to lower case for a particular column.
Returns: new H2OFrame with all strings in the current frame converted to the lowercase.

toupper
()[source]¶ Translate characters from lower to upper case for a particular column.
Returns: new H2OFrame with all strings in the current frame converted to the uppercase.

transpose
()[source]¶ Transpose rows and columns of this frame.
Returns: new H2OFrame where with rows/columns from the original frame transposed.

trigamma
()[source]¶ Return new H2OFrame equals to elementwise trigamma function of the current frame.

trim
()[source]¶ Trim white space on the left and right of strings in a singlecolumn H2OFrame.
Returns: H2OFrame with trimmed strings.

trunc
()[source]¶ Apply the numeric truncation function.
trunc(x)
is the integer obtained fromx
by dropping its decimal tail. This is equal tofloor(x)
ifx
is positive, andceil(x)
ifx
is negative. Truncation is also called “rounding towards zero”.Returns: new H2OFrame of truncated values of the original frame.

type
(col)[source]¶ The type for the given column.
Parameters: col – either a name, or an index of the column to look up Returns: type of the column, one of: str
,int
,real
,enum
,time
,bool
.Raises: H2OValueError – if such column does not exist in the frame.

types
¶ The dictionary of column name/type pairs.

unique
()[source]¶ Extract the unique values in the column.
Returns: H2OFrame of just the unique values in the column.

var
(y=None, na_rm=False, use=None)[source]¶ Compute the variancecovariance matrix of one or two H2OFrames.
Parameters:  y (H2OFrame) – If this parameter is given, then a covariance matrix between the columns of the target
frame and the columns of
y
is computed. If this parameter is not provided then the covariance matrix of the target frame is returned. If target frame has just a single column, then return the scalar variance instead of the matrix. Single rows are treated as single columns.  use (str) –
A string indicating how to handle missing values. This could be one of the following:
"everything"
: outputs NaNs whenever one of its contributing observations is missing"all.obs"
: presence of missing observations will throw an error"complete.obs"
: discards missing values along with all observations in their rows so that only complete observations are used
 na_rm (bool) – an alternative to
use
: when this is True then default value foruse
is"everything"
; and if False then defaultuse
is"complete.obs"
. This parameter has no effect ifuse
is given explicitly.
Returns: An H2OFrame of the covariance matrix of the columns of this frame (if
y
is not given), or with the columns ofy
(ify
is given). However when this frame andy
are both single rows or single columns, then the variance is returned as a scalar. y (H2OFrame) – If this parameter is given, then a covariance matrix between the columns of the target
frame and the columns of

week
()[source]¶ Extract the “week” part from a date column.
Returns: a singlecolumn H2OFrame containing the “week” part from the source frame.

which
()[source]¶ Compose the list of row indices for which the frame contains nonzero values.
Only applicable to integer singlecolumn frames. Equivalent to comprehension
[index for index, value in enumerate(self) if value]
.Returns: a new singlecolumn H2OFrame containing indices of those rows in the original frame that contained nonzero values.

GroupBy
¶

class
h2o.group_by.
GroupBy
(fr, by)[source]¶ Bases:
object
A class that represents the group by operation on an H2OFrame.
The returned groups are sorted by the natural groupby column sort.
Parameters:  fr (H2OFrame) – H2OFrame that you want the group by operation to be performed on.
 by – by can be a column name (str) or an index (int) of a single column, or a list for multiple columns denoting the set of columns to group by.
Sample usage:
>>> my_frame = ... # some existing H2OFrame >>> grouped = my_frame.group_by(by=["C1", "C2"]) >>> grouped.sum(col="X1", na="all").mean(col="X5", na="all").max() >>> grouped.get_frame()
Any number of aggregations may be chained together in this manner. Note that once the aggregation operations are complete, calling the GroupBy object with a new set of aggregations will yield no effect. You must generate a new GroupBy object in order to apply a new aggregation on it. In addition, certain aggregations are only defined for numerical or categorical columns. An error will be thrown for calling aggregation on the wrong data types.
If no arguments are given to the aggregation (e.g. “max” in the above example), then it is assumed that the aggregation should apply to all columns but the group by columns.
All GroupBy aggregations take parameter na, which controls treatment of NA values during the calculation. It can be one of:
 “all” (default) – any NAs are used in the calculation asis; which usually results in the final result being NA too.
 “ignore” – NA entries are not included in calculations, but the total number of entries is taken as the total number of rows. For example, mean([1, 2, 3, nan], na=”ignore”) will produce 1.5.
 “rm” entries are skipped during the calculations, reducing the total effective count of entries. For example, mean([1, 2, 3, nan], na=”rm”) will produce 2.
Variance (var) and standard deviation (sd) are the sample (not population) statistics.

count
(na=u'all')[source]¶ Count the number of rows in each group of a GroupBy object.
Parameters: na (str) – one of ‘rm’, ‘ignore’ or ‘all’ (default). Returns: the original GroupBy object (self), for ease of constructing chained operations.

frame
¶ same as get_frame().

get_frame
()[source]¶ Return the resulting H2OFrame containing the result(s) of aggregation(s) of the group by.
The number of rows denote the number of groups generated by the group by operation.
The number of columns depend on the number of aggregations performed, the number of columns specified in the col parameter. Generally, expect the number of columns to be
(len(col) of aggregation 0 + len(col) of aggregation 1 +...+ len(col) of aggregation n) x (number of groups of the GroupBy object) +1 (for groupby group names).
 Note:
 the count aggregation only generates one column;
 if col is a str or int, len(col) = 1.

max
(col=None, na=u'all')[source]¶ Calculate the maximum of each column specified in col for each group of a GroupBy object. If no col is given, compute the maximum among all numeric columns other than those being grouped on.
Parameters:  col – col can be None (default), a column name (str) or an index (int) of a single column, or a list for multiple columns
 na (str) – one of ‘rm’, ‘ignore’ or ‘all’ (default).
Returns: the original GroupBy object (self), for ease of constructing chained operations.

mean
(col=None, na=u'all')[source]¶ Calculate the mean of each column specified in col for each group of a GroupBy object. If no col is given, compute the mean among all numeric columns other than those being grouped on.
Parameters:  col – col can be None (default), a column name (str) or an index (int) of a single column, or a list for multiple columns
 na (str) – one of ‘rm’, ‘ignore’ or ‘all’ (default).
Returns: the original GroupBy object (self), for ease of constructing chained operations.

min
(col=None, na=u'all')[source]¶ Calculate the minimum of each column specified in col for each group of a GroupBy object. If no col is given, compute the minimum among all numeric columns other than those being grouped on.
Parameters:  col – col can be None (default), a column name (str) or an index (int) of a single column, or a list for multiple columns denoting the set of columns to group by.
 na (str) – one of ‘rm’, ‘ignore’ or ‘all’ (default).
Returns: the original GroupBy object (self), for ease of constructing chained operations.

mode
(col=None, na=u'all')[source]¶ Calculate the mode of each column specified in col for each group of a GroupBy object. If no col is given, compute the mode among all categorical columns other than those being grouped on.
Parameters:  col – col can be None (default), a column name (str) or an index (int) of a single column, or a list for multiple columns
 na (str) – one of ‘rm’, ‘ignore’ or ‘all’ (default).
Returns: the original GroupBy object (self), for ease of constructing chained operations.

sd
(col=None, na=u'all')[source]¶ Calculate the standard deviation of each column specified in col for each group of a GroupBy object. If no col is given, compute the standard deviation among all numeric columns other than those being grouped on.
Parameters:  col – col can be None (default), a column name (str) or an index (int) of a single column, or a list for multiple columns
 na (str) – one of ‘rm’, ‘ignore’ or ‘all’ (default).
Returns: the original GroupBy object (self), for ease of constructing chained operations.

ss
(col=None, na=u'all')[source]¶ Calculate the sum of squares of each column specified in col for each group of a GroupBy object. If no col is given, compute the sum of squares among all numeric columns other than those being grouped on.
Parameters:  col – col can be None (default), a column name (str) or an index (int) of a single column, or a list for multiple columns
 na (str) – one of ‘rm’, ‘ignore’ or ‘all’ (default).
Returns: the original GroupBy object (self), for ease of constructing chained operations.

sum
(col=None, na=u'all')[source]¶ Calculate the sum of each column specified in col for each group of a GroupBy object. If no col is given, compute the sum among all numeric columns other than those being grouped on.
Parameters:  col – col can be None (default), a column name (str) or an index (int) of a single column, or a list for multiple columns
 na (str) – one of ‘rm’, ‘ignore’ or ‘all’ (default).
Returns: the original GroupBy object (self), for ease of constructing chained operations.

var
(col=None, na=u'all')[source]¶ Calculate the variance of each column specified in col for each group of a GroupBy object. If no col is given, compute the variance among all numeric columns other than those being grouped on.
Parameters:  col – col can be None (default), a column name (str) or an index (int) of a single column, or a list for multiple columns
 na (str) – one of ‘rm’, ‘ignore’ or ‘all’ (default).
Returns: the original GroupBy object (self), for ease of constructing chained operations.