The H2O Aggregator method is a clustering-based method for reducing a numerical/categorical dataset into a dataset with fewer rows. If the dataset has categorical columns, then for each categorical column, Aggregator will:
Accumulate the category frequencies.
For the top 1,000 or fewer categories (by frequency), generate dummy variables (called one-hot encoding by ML people, called dummy coding by statisticians).
Calculate the first eigenvector of the covariance matrix of these dummy variables.
Replace the row values on the categorical column with the value from the eigenvector corresponding to the dummy values.
Aggregator maintains outliers as outliers, but lumps together dense clusters into exemplars with an attached count column showing the member points.
The Aggregator method behaves just any other unsupervised model. You can ignore columns, which will then be dropped for distance computation. Training itself creates the aggregated H2O Frame, which also includes the count of members for every row/exemplar. The aggregated frame always includes the full original content of the training frame, even if some columns were ignored for the distance computation. Scoring/prediction is overloaded with a function that returns the members of a given exemplar row index from 0…Nexemplars (this time without a count).
Defining an Aggregator Model¶
model_id: (Optional) Specify a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.
training_frame: (Required) Specify the dataset used to build the model. NOTE: In Flow, if you click the Build a model button from the
Parsecell, the training frame is entered automatically.
ignored_columns: (Optional) Specify the column or columns to be excluded from the model. In Flow, click the checkbox next to a column name to add it to the list of columns excluded from the model. To add all columns, click the All button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the None button. To search for a specific column, type the column name in the Search field above the column list. To only show columns with a specific percentage of missing values, specify the percentage in the Only show columns with more than 0% missing values field. To change the selections for the hidden columns, use the Select Visible or Deselect Visible buttons.
ignore_const_cols: Enable this option to ignore constant training columns, since no information can be gained from them. This option is enabled by default.
target_num_exemplars: Specify a value for the targeted number of exemplars. This value defaults to 5000.
rel_tol_num_exemplars: Specify the relative tolerance for the number of exemplars (e.g, 0.5 is +/- 50 percent). This value defaults to 0.5.
transform: Specify the transformation method for numeric columns in the training data: None, Standardize, Normalize, Demean, or Descale. The default is Normalize.
categorical_encoding: Specify one of the following encoding schemes for handling categorical features:
AUTO: Allow the algorithm to decide (default). In Aggregator, the algorithm will automatically perform
OneHotInternal: On the fly N+1 new cols for categorical features with N levels (default)
binary: No more than 32 columns per categorical feature
Eigen: k columns per categorical feature, keeping projections of one-hot-encoded matrix onto k-dim eigen space only
LabelEncoder: Convert every enum into the integer of its index (for example, level 0 -> 0, level 1 -> 1, etc.)
EnumLimited: Automatically reduce categorical levels to the most prevalent ones during Aggregator training and only keep the T (10) most frequent levels.
save_mapping_frame: When this option is enabled, the mapping of rows in an aggregated frame to the one in the original/raw frame will be created and exported. This option is disabled by default.
export_checkpoints_dir: Specify a directory to which generated models will automatically be exported.
The output of the aggregation is a new aggregated frame that can be accessed in R and Python.
Below is a simple example showing how to build a Aggregator model.