H2O supports training of supervised models (where the outcome variable is known) and unsupervised models (unlabeled data). Below we present examples of classification, regression, clustering, dimensionality reduction and training on data segments (train a set of models – one for each partition of the data).
Supervised learning algorithms support classification and regression problems.
Classification and Regression¶
In classification problems, the output or “response” variable is a categorical value. The answer can be binary (for example, yes/no), or it can be multiclass (for example, dog/cat/mouse).
In regression problems, the output or “response” variable is a numeric value. An example would be predicting someone’s age or the sale price of a house.
Classification vs Regression in H2O models: The way that you tell H2O whether you want to do classification versus regression for a particular supervised algorithm is by encoding the response column as either a categorical/factor type (classification) or a numeric type (regression). If your column is represented as strings (“yes”, “no”), then H2O will automatically encode that column as an categorical/factor (aka. “enum”) type when you import your dataset. However if you have a column with integers that represents a class in a classification problem (0, 1), you will have to change the column type from numeric to categorical/factor (aka. “enum”) type. The reason that H2O requires the response column to be encoded as the “correct” type for a particular task is to maximze efficiency of the algorithm.
This example uses the Prostate dataset and H2O’s GLM algorithm to predict the likelihood of a patient being diagnosed with prostate cancer. The dataset includes the following columns:
ID: A row identifier. This can be dropped from the list of predictors.
CAPSULE: Whether the tumor penetrated the prostatic capsule
AGE: The patient’s age
RACE: The patient’s race
DPROS: The result of the digital rectal exam, where 1=no nodule; 2=unilober nodule on the left; 3 =unilibar nodule on the right; and 4=bilobar nodule.
DCAPS: Whether there existed capsular involvement on the rectal exam
PSA: The Prostate Specific Antigen Value (mg/ml)
VOL: The tumor volume (cm3)
GLEASON: The patient’s Gleason score in the range 0 to 10
This example uses only the AGE, RACE, VOL, and GLEASON columns to make the prediction.
This example uses the Boston Housing data and H2O’s GLM algorithm to predict the median home price using all available features. The dataset includes the following columns:
crim: The per capita crime rate by town
zn: The proportion of residential land zoned for lots over 25,000 sq.ft
indus: The proportion of non-retail business acres per town
chas: A Charles River dummy variable (1 if the tract bounds the Charles river; 0 otherwise)
nox: Nitric oxides concentration (parts per 10 million)
rm: The average number of rooms per dwelling
age: The proportion of owner-occupied units built prior to 1940
dis: The weighted distances to five Boston employment centers
rad: The index of accessibility to radial highways
tax: The full-value property-tax rate per $10,000
ptratio: The pupil-teacher ratio by town
b: 1000(Bk - 0.63)^2, where Bk is the black proportion of population
lstat: The % lower status of the population
medv: The median value of owner-occupied homes in $1000’s
Unsupervised learning algorithms include clustering and anomaly detection methods. Unsupervised learning algorithms such as GLRM and PCA can also be used to perform dimensionality reduction.
The example below uses the K-Means algorithm to build a simple clustering model of the Iris dataset.
Anomaly Detection Example¶
This example uses the Isolation Forest algorithm to detect anomalies in the Electrocardiograms (ECG) dataset.
In H2O, you can perform bulk training on segments, or partitions, of the training set. The
train_segments() method in Python and
h2o.train_segments() function in R train an H2O model for each segment (subpopulation) of the training dataset.
Defining a Segmented Model¶
train_segments() function accepts the following parameters:
x: A list of column names or indices indicating the predictor columns.
y: An index or a column name indicating the response column.
algorithm (R only): When building a segmented model in R, specify the algorithm to use. Available options include:
training_frame: The H2OFrame having the columns indicated by
y(as well as any additional columns specified by
offset_column: The name or index of the column in the
training_framethat holds the offsets.
fold_column: The name or index of the column in the
training_framethat holds the per-row fold assignments.
weights_column: The name or index of the column in the
training_framethat holds the per-row weights.
validation_frame: The H2OFrame with validation data to be scored on while training.
max_runtime_secs: Maximum allowed runtime in seconds for each model training. Use 0 to disable. Please note that regardless of how this parameter is set, a model will be built for each input segment. This parameter only affects individual model training.
segments (Python)/segment_columns (R): A list of columns to segment by. H2O will group the training (and validation) dataset by the segment-by columns and train a separate model for each segment (group of rows). As an alternative to providing a list of columns, users can also supply an explicit enumeration of segments to build the models for. This enumeration needs to be represented as H2OFrame.
segment_models_id: Identifier for the returned collection of Segment Models. If not specified it will be automatically generated.
parallelism: Level of parallelism of the bulk segment models building. This is the maximum number of models each H2O node will be building in parallel.
verbose: Enable to print additional information during model building. Defaults to False.