Data Manipulation¶
H2OFrame
¶
-
class
h2o.frame.
H2OFrame
(python_obj=None)[source]¶ Bases:
object
Primary data store for H2O.
H2OFrame is similar to pandas’
DataFrame
, or R’sdata.frame
. One of the critical distinction is that the data is generally not held in memory, instead it is located on a (possibly remote) H2O cluster, and thusH2OFrame
represents a mere handle to that data.-
anyfactor
()[source]¶ Test if H2OFrame has any factor columns.
Returns: True if there are any categorical columns; False otherwise.
-
apply
(fun=None, axis=0)[source]¶ Apply a lambda expression to an H2OFrame.
Parameters: - fun – a lambda expression to be applied per row or per column.
- axis – 0 = apply to each column; 1 = apply to each row
Returns: an H2OFrame
-
as_data_frame
(use_pandas=True)[source]¶ Obtain the dataset as a python-local object.
Parameters: use_pandas : bool, default=True
A flag specifying whether or not to return a pandas DataFrame.
Returns: A local python object (a list of lists of strings, each list is a row, if
use_pandas=False, otherwise a pandas DataFrame) containing this H2OFrame instance’s data.
-
as_date
(format)[source]¶ Return the column with all elements converted to millis since the epoch.
Parameters: format : str
A datetime format string (e.g. “YYYY-mm-dd”)
Returns: An H2OFrame instance.
-
categories
()[source]¶ Create a list of categorical levels for a H2OFrame factor(enum) column.
Returns: Pythonic list of categorical levels.
-
cbind
(data)[source]¶ Append data to this frame column-wise.
Parameters: data – an H2OFrame or a list of H2OFrame’s to be column bound to the this frame on the right. You can also cbind a number, in which case it will get converted into a constant column. Returns: new H2OFrame with all frames in data appended column-wise.
-
col_names
¶ Same as
self.names
.
-
columns
¶ Same as
self.names
.
-
columns_by_type
(coltype=u'numeric')[source]¶ Obtain a list of columns that are specified by coltype
Parameters: coltype : str
- A character string indicating which column type to filter by. This must be one of the following:
“numeric” - Numeric, but not categorical or time “categorical” - Integer, with a categorical/factor String mapping “string” - String column “time” - Long msec since the Unix Epoch - with a variety of display/parse options “uuid” - UUID “bad” - No none-NA rows (triple negative! all NAs or zero rows)
Returns
——-
A list of column indices that correspond to coltype
-
concat
(frames, axis=1)[source]¶ Append multiple data to this H2OFrame column-wise or row wise.
Parameters: frames : List of H2OFrame’s
H2OFrame’s to be column bound to the right of this H2OFrame.
- axis : int, default = 1
Type of concatenation to conduct. If axis = 1, then column-wise (Default). If axis = 0, then row-wise.
Returns: H2OFrame of the combined datasets.
-
cor
(y=None, na_rm=False, use=None)[source]¶ Compute the correlation matrix of one or two H2OFrames.
Parameters: y : H2OFrame, default=None
If y is None and self is a single column, then the correlation is computed for self. If self has multiple columns, then its correlation matrix is returned. Single rows are treated as single columns. If y is not None, then a correlation matrix between the columns of self and the columns of y is computed.
na_rm : bool, default=False
Remove NAs from the computation.
use : str, default=None, which acts as “everything” if na_rm is False, and “complete.obs” if na_rm is True
- A string indicating how to handle missing values. This must be one of the following:
“everything” - outputs NaNs whenever one of its contributing observations is missing “all.obs” - presence of missing observations will throw an error “complete.obs” - discards missing values along with all observations in their rows so that only
complete observations are used
Returns: An H2OFrame of the correlation matrix of the columns of this H2OFrame with itself (if y is not given), or
with the columns of y (if y is given). If self and y are single rows or single columns, the correlation is given as a scalar.
-
countmatches
(pattern)[source]¶ For each string in the column, count the occurrences of pattern.
Parameters: pattern : str
The pattern to count matches on in each string.
Returns: A single-column H2OFrame containing the counts for the per-row occurrences of
pattern in the input column.
-
cut
(breaks, labels=None, include_lowest=False, right=True, dig_lab=3)[source]¶ Cut a numeric vector into factor “buckets”. Similar to R’s cut method.
Parameters: breaks : list
The cut points in the numeric vector (must span the range of the col.)
- labels: list
Factor labels, defaults to set notation of intervals defined by breaks.
- include_lowest : bool
By default, cuts are defined as (lo,hi]. If True, get [lo,hi].
- right : bool
Include the high value: (lo,hi]. If False, get (lo,hi).
- dig_lab: int
Number of digits following the decimal point to consider.
Returns: Single-column H2OFrame of categorical data.
-
describe
(chunk_summary=False)[source]¶ Generate an in-depth description of this H2OFrame.
Parameters: chunk_summary : bool, default=False
Retrieve the chunk summary along with the distribution summary
-
difflag1
()[source]¶ Conduct a lag 1 transform on a numeric H2OFrame column
Returns: H2OFrame column with a lag 1 transform
-
dim
¶ Same as
list(self.shape())
.
-
drop
(index, axis=1)[source]¶ Drop a single column or row or a set of columns or rows from a H2OFrame. Dropping a column or row is not in-place. Dropping a column or row by index or a set of indexes is zero-based.
Parameters: index : list,str,int
A list of column indexes, column names, or row indexes to drop A string to drop a single column by column name An int to drop a single column by index
- axis : int, default = 1
Type of drop to conduct. If axis = 1, then column-wise (Default). If axis = 0, then row-wise.
Returns: H2OFrame with the respective dropped columns or rows. Returns a new H2OFrame.
-
entropy
()[source]¶ For each string, return the Shannon entropy, if the string is empty the entropy is 0.
Returns: An H2OFrame of Shannon entropies.
-
filter_na_cols
(frac=0.2)[source]¶ Filter columns with proportion of NAs >= frac.
Parameters: frac : float
Fraction of NAs in the column.
Returns: A list of column indices that have a fewer count of NAs.
If all columns are filtered, None is returned.
-
frame_id
¶ The name of the frame.
-
static
from_python
(python_obj, destination_frame=None, header=0, separator=u', ', column_names=None, column_types=None, na_strings=None)[source]¶ Create a new
H2OFrame
object from an existing Python object (which can be of different kinds).Properly handle native python data types. For a discussion of the rules and permissible data types please refer to the main documentation for H2OFrame.
Parameters: python_obj : tuple, list, dict, collections.OrderedDict
If a nested list/tuple, then each nested collection is a row.
- destination_frame : str, optional
The unique hex key assigned to the imported file. If none is given, a key will automatically be generated.
- header : int, optional
-1 means the first line is data, 0 means guess, 1 means first line is header.
- sep : str, optional
The field separator character. Values on each line of the file are separated by this character. If sep = “”, the parser will automatically detect the separator.
- col_names : list, optional
A list of column names for the file.
- col_types : list or dict, optional
A list of types or a dictionary of column names to types to specify whether columns should be forced to a certain type upon import parsing. If a list, the types for elements that are None will be guessed. The possible types a column may have are.
- na_strings : list or dict, optional
A list of strings, or a list of lists of strings (one list per column), or a dictionary of column names to strings which are to be interpreted as missing values.
Returns: A new H2OFrame instance.
Examples
>>> l = [[1,2,3,4,5], [99,123,51233,321]] >>> l = H2OFrame(l) >>> l
-
static
get_frame
(frame_id)[source]¶ Create an H2OFrame mapped to an existing id in the cluster.
Returns: H2OFrame that points to a pre-existing big data H2OFrame in the cluster
-
get_frame_data
()[source]¶ Get frame data as str in csv format
Returns: A local python string, each line is a row and each element separated by commas,
containing this H2OFrame instance’s data.
-
group_by
(by)[source]¶ Return a new GroupBy object using this frame and the desired grouping columns.
The returned groups are sorted by the natural group-by column sort.
Parameters: by : list
The columns to group on.
Returns: A new GroupBy object.
-
gsub
(pattern, replacement, ignore_case=False)[source]¶ Globally substitute occurrences of pattern in a string with replacement.
Parameters: pattern : str
A regular expression.
- replacement : str
A replacement string.
- ignore_case : bool
If True then pattern will match against upper and lower case.
Returns: H2OFrame
-
head
(rows=10, cols=200)[source]¶ Equivalent of R’s head call on a data.frame.
Parameters: rows : int, default=10
Number of rows starting from the topmost
- cols : int, default=200
Number of columns starting from the leftmost
Returns: An H2OFrame.
-
hist
(breaks=u'Sturges', plot=True, **kwargs)[source]¶ Compute a histogram over a numeric column.
Parameters: breaks: str, int, list
Can be one of “Sturges”, “Rice”, “sqrt”, “Doane”, “FD”, “Scott.” Can be a single number for the number of breaks. Can be a list containing sthe split points, e.g., [-50,213.2123,9324834] If breaks is “FD”, the MAD is used over the IQR in computing bin width.
- plot : bool, default=True
If True, then a plot is generated
Returns: If plot is False, return H2OFrame with these columns: breaks, counts, mids_true,
mids, and density; otherwise produce the plot.
-
ifelse
(yes, no)[source]¶ Equivalent to [y if t else n for t,y,n in zip(self,yes,no)].
Based on the booleans in the test vector, the output has the values of the yes and no vectors interleaved (or merged together). All Frames must have the same row count. Single column frames are broadened to match wider Frames. Scalars are allowed, and are also broadened to match wider frames.
Parameters: test : H2OFrame (self)
Frame of values treated as booleans; may be a single column
- yes : H2OFrame
Frame to use if [test] is true ; may be a scalar or single column
- no : H2OFrame
Frame to use if [test] is false; may be a scalar or single column
Returns: H2OFrame of the merged yes/no Frames/scalars according to the test input frame.
-
impute
(column=-1, method=u'mean', combine_method=u'interpolate', by=None, group_by_frame=None, values=None)[source]¶ Impute in place.
Parameters: column: int, default=-1
The column to impute, if -1 then impute the whole frame
- method : str, default=”mean”
The method of imputation: mean, median, mode
- combine_method : str, default=”interpolate”
When method is “median”, dictates how to combine quantiles for even samples.
- by : list, default=None
The columns to group on.
- group_by_frame : H2OFrame, default=None
Impute the column col with this pre-computed grouped frame.
- values : list
A list of impute values (one per column). NaN indicates to skip the column.
Returns: A list of values used in the imputation or the group by result used in imputation.
-
insert_missing_values
(fraction=0.1, seed=None)[source]¶ Inserting Missing Values into an H2OFrame. Randomly replaces a user-specified fraction of entries in a H2O dataset with missing values.
WARNING! This will modify the original dataset. Unless this is intended, this function should only be called on a subset of the original.
Parameters: fraction : float
A number between 0 and 1 indicating the fraction of entries to replace with missing.
- seed : int
A random number used to select which entries to replace with missing values.
Returns: H2OFrame with missing values inserted.
-
interaction
(factors, pairwise, max_factors, min_occurrence, destination_frame=None)[source]¶ Categorical Interaction Feature Creation in H2O. Creates a frame in H2O with n-th order interaction features between categorical columns, as specified by the user.
Parameters: factors : list
factors Factor columns (either indices or column names).
- pairwise : bool
Whether to create pairwise interactions between factors (otherwise create one higher-order interaction). Only applicable if there are 3 or more factors.
- max_factors: int
Max. number of factor levels in pair-wise interaction terms (if enforced, one extra catch-all factor will be made)
- min_occurrence: int
Min. occurrence threshold for factor levels in pair-wise interaction terms
- destination_frame: str, optional
A string indicating the destination key.
Returns: H2OFrame
-
ischaracter
()[source]¶ Returns: True if the column is a character column, otherwise False (same as isstring)
-
isfactor
()[source]¶ Test if the selection is a factor column.
Returns: True if the column is categorical; otherwise False. For String columns, the result
is False.
-
isin
(item)[source]¶ Test whether elements of an H2OFrame are contained in the item.
Parameters: items : any element or a list of elements
An item or a list of items to compare the H2OFrame against.
Returns: An H2OFrame of 0s and 1s showing whether each element in the original H2OFrame is contained in item.
-
isna
()[source]¶ For each element in an H2OFrame, determine if it is NA or not.
Returns: H2OFrame of 1s and 0s. 1 means the value was NA.
-
isstring
()[source]¶ Returns: True if the column is a string column, otherwise False (same as ischaracter)
-
kfold_column
(n_folds=3, seed=-1)[source]¶ Build a fold assignments column for cross-validation. This call will produce a column having the same data layout as the calling object.
Parameters: n_folds : int
An integer specifying the number of validation sets to split the training data into.
- seed : int, optional
Seed for random numbers as fold IDs are randomly assigned.
Returns: A single column H2OFrame with the fold assignments.
-
kurtosis
(na_rm=False)[source]¶ Compute the kurtosis.
Parameters: na_rm: bool, default=False
If True, then remove NAs from the computation.
Returns: A list containing the kurtosis for each column (NaN for non-numeric columns).
-
lstrip
(set=u' ')[source]¶ Return a copy of the column with leading characters removed.
The set argument is a string specifying the set of characters to be removed. If omitted, the set argument defaults to removing whitespace.
Parameters: set : str
Set of characters to lstrip from strings in column
Returns: H2OFrame with lstripped strings.
-
match
(table, nomatch=0)[source]¶ Make a vector of the positions of (first) matches of its first argument in its second.
Parameters: table : list
list of items to match against
nomatch : optional
Returns: H2OFrame of one boolean column
-
mean
(skipna=True, axis=0, **kwargs)[source]¶ Compute the frame’s means by-column (or by-row).
- @param skipna: if True (default), then NAs are ignored during the computation. Otherwise presence
- of NAs renders the entire result NA.
- @param axis: direction of mean computation. If 0 (default), then mean is computed columnwise, and the result
- is a frame with 1 row and number of columns as in the original frame. If 1, then mean is computed rowwise and the result is a frame with 1 column (called “mean”), and number of rows equal to the number of rows in the original frame.
@returns H2OFrame: the results frame.
-
median
(na_rm=False)[source]¶ Compute the median.
Parameters: na_rm: bool, default=False
If True, then remove NAs from the computation.
Returns: A list containing the median for each column (NaN for non-numeric columns).
-
merge
(other, all_x=False, all_y=False, by_x=None, by_y=None, method=u'auto')[source]¶ Merge two datasets based on common column names.
Parameters: other: H2OFrame
Other dataset to merge. Must have at least one column in common with self, and all columns in common are used as the merge key. If you want to use only a subset of the columns in common, rename the other columns so the columns are unique in the merged result.
- all_x: bool, default=False
If True, include all rows from the left/self frame
- all_y: bool, default=False
If True, include all rows from the right/other frame
Returns: Original self frame enhanced with merged columns and rows
-
static
mktime
(year=1970, month=0, day=0, hour=0, minute=0, second=0, msec=0)[source]¶ All units are zero-based (including months and days). Missing year is 1970.
Parameters: year : int, H2OFrame
the year
- month: int, H2OFrame
the month
- day : int, H2OFrame
the day
- hour : int, H2OFrame
the hour
- minute : int, H2OFrame
the minute
- second : int, H2OFrame
the second
- msec : int, H2OFrame
the milisecond
Returns: H2OFrame of one column containing the date in millis since the epoch.
-
modulo_kfold_column
(n_folds=3)[source]¶ Build a fold assignments column for cross-validation. Rows are assigned a fold according to the current row number modulo n_folds.
Parameters: n_folds : int
An integer specifying the number of validation sets to split the training data into.
Returns: A single column H2OFrame with the fold assignments.
-
mult
(matrix)[source]¶ Perform matrix multiplication.
Parameters: matrix : H2OFrame
The right-hand-side matrix
Returns: H2OFrame result of the matrix multiplication
-
nacnt
()[source]¶ Count of NAs for each column in this H2OFrame.
Returns: A list of the na cnts (one entry per column).
-
names
¶ The list of column names.
-
nchar
()[source]¶ Count the number of characters in each string of single-column H2OFrame.
Returns: A single-column H2OFrame containing the per-row character count.
-
ncol
¶ Same as
self.ncols
.
-
ncols
¶ Number of columns in the dataframe.
-
nlevels
()[source]¶ Get the number of factor levels for this frame.
Returns: A list of the number of levels per column.
-
nrow
¶ Same as
self.nrows
.
-
nrows
¶ Number of rows in the dataframe.
-
num_valid_substrings
(path_to_words)[source]¶ For each string, find the count of all possible substrings >= 2 characters that are contained in the line-separated text file whose path is given.
Parameters: path_to_words : str
Path to file that contains a line-separated list of strings considered valid.
Returns: An H2OFrame with the number of substrings that are contained in the given word list.
-
pop
(i)[source]¶ Pop a column from the H2OFrame at index i.
Parameters: i – The index (int) or name (str) of the column to pop. Returns: The column dropped from the frame; the frame is side-effected to lose the column.
-
prod
(na_rm=False)[source]¶ Parameters: na_rm : bool, default=False
True or False to remove NAs from computation.
Returns: The product of the column.
-
quantile
(prob=None, combine_method=u'interpolate', weights_column=None)[source]¶ Compute quantiles.
Parameters: - prob – list, default=[0.01,0.1,0.25,0.333,0.5,0.667,0.75,0.9,0.99] A list of probabilities of any length.
- combine_method –
- For even samples, how to combine quantiles.
- Should be one of [“interpolate”, “average”, “low”, “high”]
- weights_column : str, default=None
- Name of column with optional observation weights in this H2OFrame or a 1-column H2OFrame of observation weights.
Returns: A new H2OFrame containing the quantiles and probabilities.
-
rbind
(data)[source]¶ Append data to this frame row-wise.
Parameters: data – an H2OFrame or a list of H2OFrame’s to be combined with current frame row-wise. Returns: this H2OFrame with all frames in data appended row-wise.
-
relevel
(y)[source]¶ Reorder levels of an H2O factor, similarly to standard R’s relevel.
The levels of a factor are reordered such that the reference level is at level 0, remaining levels are moved down as needed.
Parameters: x: Column
Column in H2O Frame
- y : String
Reference level
Returns: New reordered factor column
-
rep_len
(length_out)[source]¶ Replicate the values in data in the H2O backend.
Parameters: length_out : int
Number of columns of the resulting H2OFrame
Returns: H2OFrame
-
round
(digits=0)[source]¶ Round doubles/floats to the given number of decimal places.
Parameters: digits : int, default=0
Number of decimal places to round doubles/floats. Rounding to a negative number of decimal places is not supported. For rounding off a 5, the IEC 60559 standard is used, ‘go to the even digit’. Therefore rounding 2.5 gives 2 and rounding 3.5 gives 4.
Returns: H2OFrame
-
rstrip
(set=u' ')[source]¶ Return a copy of the column with trailing characters removed.
The set argument is a string specifying the set of characters to be removed. If omitted, the set argument defaults to removing whitespace.
Parameters: set : str
Set of characters to rstrip from strings in column
Returns: H2OFrame with rstripped strings.
-
runif
(seed=None)[source]¶ Generate a column of random numbers drawn from a uniform distribution [0,1) and having the same data layout as the calling H2OFrame instance.
Parameters: seed : int, optional
A random seed. If None, then one will be generated.
Returns: Single-column H2OFrame filled with doubles sampled uniformly from [0,1).
-
scale
(center=True, scale=True)[source]¶ Center and/or scale the columns of the self._newExpr.
Parameters: center : bool, list
If True, then demean the data by the mean. If False, no shifting is done. If a list, then shift each column by the given amount in the list.
- scale : bool, list
If True, then scale the data by the column standard deviation. If False, no scaling is done. If a list, then scale each column by the given amount in the list.
Returns: H2OFrame
-
sd
(na_rm=False)[source]¶ Compute the standard deviation.
Parameters: na_rm : bool, default=False
Remove NAs from the computation.
Returns: A list containing the standard deviation for each column (NaN for non-numeric
columns).
-
set_level
(level)[source]¶ A method to set all column values to one of the levels.
Parameters: level : str
The level at which the column will be set (a string)
Returns: H2OFrame with entries set to the desired level.
-
set_levels
(levels)[source]¶ Works on a single categorical column. New domains must be aligned with the old domains. This call has copy-on-write semantics.
Parameters: levels : list
A list of strings specifying the new levels. The number of new levels must match the number of old levels.
Returns: A single-column H2OFrame with the desired levels.
-
set_name
(col=None, name=None)[source]¶ Set the name of a column.
Parameters: - col – index or name of the column whose name is to be set; may be skipped for 1-column frames
- name – the new name of the column
-
set_names
(names)[source]¶ Change all of this H2OFrame instance’s column names.
Parameters: names : list
A list of strings equal to the number of columns in the H2OFrame.
-
shape
¶ Number of rows and columns in the dataframe as a tuple (nrows, ncols).
-
show
(use_pandas=False)[source]¶ Used by the H2OFrame.__repr__ method to print or display a snippet of the data frame.
If called from IPython, displays an html’ized result. Else prints a tabulate’d result.
-
signif
(digits=6)[source]¶ Round doubles/floats to the given number of significant digits.
Parameters: digits : int, default=6
Number of significant digits to round doubles/floats.
Returns: H2OFrame
-
skewness
(na_rm=False)[source]¶ Compute the skewness.
Parameters: na_rm: bool, default=False
If True, then remove NAs from the computation.
Returns: A list containing the skewness for each column (NaN for non-numeric columns).
-
split_frame
(ratios=None, destination_frames=None, seed=None)[source]¶ Split a frame into distinct subsets of size determined by the given ratios.
The number of subsets is always 1 more than the number of ratios given. Note that this does not give an exact split. H2O is designed to be efficient on big data using a probabilistic splitting method rather than an exact split. For example when specifying a split of 0.75/0.25, H2O will produce a test/train split with an expected value of 0.75/0.25 rather than exactly 0.75/0.25. On small datasets, the sizes of the resulting splits will deviate from the expected value more than on big data, where they will be very close to exact.
Parameters: ratios : list
The fraction of rows for each split.
- destination_frames : list
The names of the split frames.
- seed : int
Used for selecting which H2OFrame a row will belong to.
Returns: A list of H2OFrame instances
-
stratified_kfold_column
(n_folds=3, seed=-1)[source]¶ Build a fold assignment column with the constraint that each fold has the same class distribution as the fold column.
Parameters: n_folds: int
The number of folds to build.
- seed: int
A random seed.
Returns: A single column H2OFrame with the fold assignments.
-
stratified_split
(test_frac=0.2, seed=-1)[source]¶ Construct a column that can be used to perform a random stratified split.
Parameters: test_frac : float, default=0.2
The fraction of rows that will belong to the “test”.
- seed : int
For seeding the random splitting.
Returns: A categorical column of two levels “train” and “test”.
Examples
>>> my_stratified_split = my_frame["response"].stratified_split(test_frac=0.3,seed=12349453) >>> train = my_frame[my_stratified_split=="train"] >>> test = my_frame[my_stratified_split=="test"]
# check the distributions among the initial frame, and the train/test frames match >>> my_frame[“response”].table()[“Count”] / my_frame[“response”].table()[“Count”].sum() >>> train[“response”].table()[“Count”] / train[“response”].table()[“Count”].sum() >>> test[“response”].table()[“Count”] / test[“response”].table()[“Count”].sum()
-
strsplit
(pattern)[source]¶ Split the strings in the target column on the given regular expression pattern.
Parameters: pattern : str
The split pattern.
Returns: H2OFrame containing columns of the split strings.
-
sub
(pattern, replacement, ignore_case=False)[source]¶ Substitute the first occurrence of pattern in a string with replacement.
Parameters: pattern : str
A regular expression.
- replacement : str
A replacement string.
- ignore_case : bool
If True then pattern will match against upper and lower case.
Returns: H2OFrame
-
substring
(start_index, end_index=None)[source]¶ For each string, return a new string that is a substring of the original string.
If end_index is not specified, then the substring extends to the end of the original string. If the start_index is longer than the length of the string, or is greater than or equal to the end_index, an empty string is returned. Negative start_index is coerced to 0.
Parameters: start_index : int
The index of the original string at which to start the substring, inclusive.
- end_index: int, optional
The index of the original string at which to end the substring, exclusive.
Returns: An H2OFrame containing the specified substrings.
-
table
(data2=None, dense=True)[source]¶ Compute the counts of values appearing in a column, or co-occurence counts between two columns.
Parameters: data2 : H2OFrame
Default is None, can be an optional single column to aggregate counts by.
- dense : bool
Default is True, for dense representation, which lists only non-zero counts, 1 combination per row. Set to False to expand counts across all combinations.
Returns: H2OFrame of the counts at each combination of factor levels
-
tail
(rows=10, cols=200)[source]¶ Equivalent of R’s tail call on a data.frame.
Parameters: rows : int, default=10
Number of rows starting from the bottommost
- cols: int, default=200
Number of columns starting from the leftmost
Returns: An H2OFrame.
-
transpose
()[source]¶ Transpose rows and columns of H2OFrame.
Returns: The transpose of the input frame.
-
trim
()[source]¶ Trim white space on the left and right of strings in a single-column H2OFrame.
Returns: H2OFrame with trimmed strings.
-
types
¶ The dictionary of column name/type pairs.
-
unique
()[source]¶ Extract the unique values in the column.
Returns: H2OFrame of just the unique values in the column.
-
var
(y=None, na_rm=False, use=None)[source]¶ Compute the variance or covariance matrix of one or two H2OFrames.
Parameters: y : H2OFrame, default=None
If y is None and self is a single column, then the variance is computed for self. If self has multiple columns, then its covariance matrix is returned. Single rows are treated as single columns. If y is not None, then a covariance matrix between the columns of self and the columns of y is computed.
na_rm : bool, default=False
Remove NAs from the computation.
use : str, default=None, which acts as “everything” if na_rm is False, and “complete.obs” if na_rm is True
- A string indicating how to handle missing values. This must be one of the following:
“everything” - outputs NaNs whenever one of its contributing observations is missing “all.obs” - presence of missing observations will throw an error “complete.obs” - discards missing values along with all observations in their rows so that only
complete observations are used
Returns: An H2OFrame of the covariance matrix of the columns of this H2OFrame with itself (if y is not given), or with
the columns of y (if y is given). If self and y are single rows or single columns, the variance or covariance is given as a scalar.
-
GroupBy
¶
-
class
h2o.group_by.
GroupBy
(fr, by)[source]¶ Bases:
object
A class that represents the group by operation on an H2OFrame.
Sample usage:
>>> my_frame = ... # some existing H2OFrame >>> grouped = my_frame.group_by(by=["C1","C2"]) >>> grouped.sum(col="X1",na="all").mean(col="X5",na="all").max() >>> grouped.get_frame()
Any number of aggregations may be chained together in this manner.
If no arguments are given to the aggregation (e.g. “max” in the above example), then it is assumed that the aggregation should apply to all columns but the group by columns.
- The na parameter is one of [“all”,”ignore”,”rm”].
- “all” - include NAs “rm” - exclude NAs
Variance (var) and standard deviation (sd) are the sample (not population) statistics.
-
frame
¶ Returns: the result of the group by