Aggregator

Introduction

The H2O Aggregator method is a clustering-based method for reducing a numerical/categorical dataset into a dataset with fewer rows. If the dataset has categorical columns, then for each categorical column, Aggregator will:

  1. Accumulate the category frequencies.
  2. For the top 1,000 or fewer categories (by frequency), generate dummy variables (called one-hot encoding by ML people, called dummy coding by statisticians).
  3. Calculate the first eigenvector of the covariance matrix of these dummy variables.
  4. Replace the row values on the categorical column with the value from the eigenvector corresponding to the dummy values.

Aggregator maintains outliers as outliers, but lumps together dense clusters into exemplars with an attached count column showing the member points.

The Aggregator method behaves just any other unsupervised model. You can ignore columns, which will then be dropped for distance computation. Training itself creates the aggregated H2O Frame, which also includes the count of members for every row/exemplar. The aggregated frame always includes the full original content of the training frame, even if some columns were ignored for the distance computation. Scoring/prediction is overloaded with a function that returns the members of a given exemplar row index from 0…Nexemplars (this time without a count).

Defining an Aggregator Model

  • model_id: (Optional) Specify a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
  • training_frame: (Required) Specify the dataset used to build the model. NOTE: In Flow, if you click the Build a model button from the Parse cell, the training frame is entered automatically.
  • ignored_columns: (Optional) Specify the column or columns to be excluded from the model. In Flow, click the checkbox next to a column name to add it to the list of columns excluded from the model. To add all columns, click the All button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the None button. To search for a specific column, type the column name in the Search field above the column list. To only show columns with a specific percentage of missing values, specify the percentage in the Only show columns with more than 0% missing values field. To change the selections for the hidden columns, use the Select Visible or Deselect Visible buttons.
  • ignore_const_cols: Enable this option to ignore constant training columns, since no information can be gained from them. This option is enabled by default.
  • target_num_exemplars: Specify a value for the targeted number of exemplars. This value defaults to 5000.
  • rel_tol_num_exemplars: Specify the relative tolerance for the number of exemplars (e.g, 0.5 is +/- 50 percent). This value defaults to 0.5.
  • transform: Specify the transformation method for the training data: None, Standardize, Normalize, Demean, or Descale. The default is Normalize.
  • categorical_encoding: Specify one of the following encoding schemes for handling categorical features:
  • auto or AUTO: Allow the algorithm to decide (default). In GBM, the algorithm will automatically perform enum encoding.
  • one_hot_internal or OneHotInternal: On the fly N+1 new cols for categorical features with N levels (default)
  • binary: No more than 32 columns per categorical feature
  • eigen or Eigen: k columns per categorical feature, keeping projections of one-hot-encoded matrix onto k-dim eigen space only
  • label_encoder or LabelEncoder: Convert every enum into the integer of its index (for example, level 0 -> 0, level 1 -> 1, etc.)
  • enum_limited or EnumLimited: Automatically reduce categorical levels to the most prevalent ones during Aggregator training and only keep the T (1024) most frequent levels.
  • save_mapping_frame: When this option is enabled, the mapping of rows in an aggregated frame to the one in the original/raw frame will be created and exported. This option is disabled by default.

Aggregator Output

The output of the aggregation is a new aggregrated frame that can be accessed in R and Python.

# Create a random frame with 5 columns and 100 rows
df <- h2o.createFrame(
  rows=100,
  cols=5,
  categorical_fraction=0.6,
  integer_fraction=0,
  binary_fraction=0,
  real_range=100,
  integer_range=100,
  missing_fraction=0,
  seed=123
)

# View the dataframe
df
      C1        C2     C3        C4     C5
1 c0.l53  10.94351 c2.l88 -93.64087 c4.l56
2 c0.l21 -93.70999 c2.l37  39.10130 c4.l97
3 c0.l96  55.43136  c2.l7 -43.47587 c4.l23
4 c0.l78  27.41477 c2.l63  83.09211 c4.l81
5 c0.l95 -77.98143 c2.l17 -93.95397  c4.l8
6 c0.l90  12.54660 c2.l36  60.54920 c4.l56

[100 rows x 5 columns]

# Build an aggregated frame using eigan categorical encoding
target_num_exemplars=1000
rel_tol_num_exemplars=0.5
encoding="Eigen"
agg <- h2o.aggregator(training_frame=df,
                      target_num_exemplars=target_num_exemplars,
                      rel_tol_num_exemplars=rel_tol_num_exemplars,
                      categorical_encoding=encoding)

# Use the aggregated frame to create a new dataframe
new_df <- h2o.aggregated_frame(agg)

#View the new dataframe
new_df
      C1        C2     C3        C4     C5 counts
1 c0.l53  10.94351 c2.l88 -93.64087 c4.l56      1
2 c0.l21 -93.70999 c2.l37  39.10130 c4.l97      1
3 c0.l96  55.43136  c2.l7 -43.47587 c4.l23      1
4 c0.l78  27.41477 c2.l63  83.09211 c4.l81      1
5 c0.l95 -77.98143 c2.l17 -93.95397  c4.l8      1
6 c0.l90  12.54660 c2.l36  60.54920 c4.l56      1

[100 rows x 6 columns]
import h2o
h2o.init()
from h2o.estimators.aggregator import H2OAggregatorEstimator

# Create a random data frame with 5 columns and 100 rows
df = h2o.create_frame(
    rows=100,
    cols=5,
    categorical_fraction=0.6,
    integer_fraction=0,
    binary_fraction=0,
    real_range=100,
    integer_range=100,
    missing_fraction=0,
    seed=1234
)

# View the dataframe
>>> df
      C1  C2      C3            C4  C5
--------  ------  ------  --------  ------
 56.3978  c1.l74  c2.l58   36.4711  c4.l66
-41.3355  c1.l31  c2.l43  -54.4267  c4.l4
 79.9964  c1.l4   c2.l68  -13.5409  c4.l49
 73.4546  c1.l5   c2.l25  -23.6456  c4.l12
 12.2449  c1.l7   c2.l49  -71.3769  c4.l61
-20.2171  c1.l41  c2.l92  -70.2103  c4.l50
 80.6089  c1.l28  c2.l18  -34.7444  c4.l19
-99.6821  c1.l21  c2.l74   93.7822  c4.l31
-56.1135  c1.l35  c2.l8   -79.3114  c4.l75
-71.4061  c1.l77  c2.l83  -32.2047  c4.l65

[100 rows x 5 columns]

# Build an aggregated frame using eigan categorical encoding
params = {
    "target_num_exemplars": 1000,
    "rel_tol_num_exemplars": 0.5,
    "categorical_encoding": "eigen"
}
agg = H2OAggregatorEstimator(**params)
agg.train(training_frame=df)

# Use the aggregated model to create a new dataframe using aggregated_frame
new_df = agg.aggregated_frame

# View the new dataframe
new_df
      C1  C2      C3            C4  C5        counts
--------  ------  ------  --------  ------  --------
 56.3978  c1.l74  c2.l58   36.4711  c4.l66         1
-41.3355  c1.l31  c2.l43  -54.4267  c4.l4          1
 79.9964  c1.l4   c2.l68  -13.5409  c4.l49         1
 73.4546  c1.l5   c2.l25  -23.6456  c4.l12         1
 12.2449  c1.l7   c2.l49  -71.3769  c4.l61         1
-20.2171  c1.l41  c2.l92  -70.2103  c4.l50         1
 80.6089  c1.l28  c2.l18  -34.7444  c4.l19         1
-99.6821  c1.l21  c2.l74   93.7822  c4.l31         1
-56.1135  c1.l35  c2.l8   -79.3114  c4.l75         1
-71.4061  c1.l77  c2.l83  -32.2047  c4.l65         1

[100 rows x 6 columns]