# Generalized Additive Models (GAM)¶

**Note**: GAM models are currently experimental.

## Introduction¶

A generalized additive model (GAM) is a Generalized Linear Model (GLM) in which the linear predictor depends linearly on predictor variables and smooth functions of predictor variables. H2O’s GAM implementation follows closely the implementation as described in “Generalized Additive Models: An Introduction with R, Texts in Statistical Science [1]” by Simon N. Wood. Another good resource for GAM can be found in “Generalized Additive Models” by T.J. Hastie and R.J. Tibshirani [2].

## Defining a GAM Model¶

model_id: (Optional) Specify a custom name for the model to use as a reference. By default, H2O automatically generates a destination key.

training_frame: (Required) Specify the dataset used to build the model.

**NOTE**: In Flow, if you click the**Build a model**button from the`Parse`

cell, the training frame is entered automatically.validation_frame: (Optional) Specify the dataset used to evaluate the accuracy of the model.

nfolds: Specify the number of folds for cross-validation.

seed: Specify the random number generator (RNG) seed for algorithm components dependent on randomization. The seed is consistent for each H2O instance so that you can create models with the same starting conditions in alternative configurations.

y: (Required) Specify the column to use as the dependent variable.

For a regression model, this column must be numeric (

**Real**or**Int**).For a classification model, this column must be categorical (

**Enum**or**String**). If the family is**Binomial**, the dataset cannot contain more than two levels.

x: Specify a vector containing the names or indices of the predictor variables to use when building the model. If

`x`

is missing, then all columns except`y`

are used.keep_cross_validation_models: Specify whether to keep the cross-validated models. Keeping cross-validation models may consume significantly more memory in the H2O cluster. This option defaults to TRUE.

keep_cross_validation_predictions: Specify whether to keep the cross-validation predictions.

keep_cross_validation_fold_assignment: Enable this option to preserve the cross-validation fold assignment.

fold_assignment: (Applicable only if a value for

**nfolds**is specified and**fold_column**is not specified) Specify the cross-validation fold assignment scheme. The available options are AUTO (which is Random), Random, Modulo, or Stratified (which will stratify the folds based on the response variable for classification problems).fold_column: Specify the column that contains the cross-validation fold index assignment per observation.

ignored_columns: (Optional, Python and Flow only) Specify the column or columns to be excluded from the model. In Flow, click the checkbox next to a column name to add it to the list of columns excluded from the model. To add all columns, click the

**All**button. To remove a column from the list of ignored columns, click the X next to the column name. To remove all columns from the list of ignored columns, click the**None**button. To search for a specific column, type the column name in the**Search**field above the column list. To only show columns with a specific percentage of missing values, specify the percentage in the**Only show columns with more than 0% missing values**field. To change the selections for the hidden columns, use the**Select Visible**or**Deselect Visible**buttons.ignore_const_cols: Enable this option to ignore constant training columns, since no information can be gained from them. This option is enabled by default.

score_each_iteration: (Optional) Enable this option to score during each iteration of the model training.

offset_column: Specify a column to use as the offset; the value cannot be the same as the value for the

`weights_column`

.**Note**: Offsets are per-row “bias values” that are used during model training. For Gaussian distributions, they can be seen as simple corrections to the response (y) column. Instead of learning to predict the response (y-row), the model learns to predict the (row) offset of the response column. For other distributions, the offset corrections are applied in the linearized space before applying the inverse link function to get the actual response values.weights_column: Specify a column to use for the observation weights, which are used for bias correction. The specified

`weights_column`

must be included in the specified`training_frame`

.*Python only*: To use a weights column when passing an H2OFrame to`x`

instead of a list of column names, the specified`training_frame`

must contain the specified``weights_column``.**Note**: Weights are per-row observation weights and do not increase the size of the data frame. This is typically the number of times a row is repeated, but non-integer values are supported as well. During training, rows with higher weights matter more, due to the larger loss function pre-factor.family: Specify the model type.

If the family is

**gaussian**, the response must be numeric (**Real**or**Int**). (default)If the family is

**binomial**, the response must be categorical 2 levels/classes or binary (**Enum**or**Int**).If the family is

**fractionalbinomial**, the response must be a numeric between 0 and 1.If the family is

**multinomial**, the response can be categorical with more than two levels/classes (**Enum**).If the family is

**ordinal**, the response must be categorical with at least 3 levels.If the family is

**quasibinomial**, the response must be numeric.If the family is

**poisson**, the response must be numeric and non-negative (**Int**).If the family is

**negativebinomial**, the response must be numeric and non-negative (**Int**).If the family is

**gamma**, the response must be numeric and continuous and positive (**Real**or**Int**).If the family is

**tweedie**, the response must be numeric and continuous (**Real**) and non-negative.If the family is

**AUTO**,and the response is

**Enum**with cardinality = 2, then the family is automatically determined as**binomial**.and the response is

**Enum**with cardinality > 2, then the family is automatically determined as**multinomial**.and the response is numeric (

**Real**or**Int**), then the family is automatically determined as**gaussian**.

tweedie_variance_power: (Only applicable if

*Tweedie*is specified for**Family**) Specify the Tweedie variance power.tweedie_link_power: (Only applicable if

*Tweedie*is specified for**Family**) Specify the Tweedie link power.theta: Theta value (equal to 1/r) for use with the negative binomial family. This value must be > 0 and defaults to 1e-10.

solver: Specify the solver to use (AUTO, IRLSM, L_BFGS, COORDINATE_DESCENT_NAIVE, COORDINATE_DESCENT, GRADIENT_DESCENT_LH, or GRADIENT_DESCENT_SQERR). IRLSM is fast on problems with a small number of predictors and for lambda search with L1 penalty, while L_BFGS scales better for datasets with many columns. COORDINATE_DESCENT is IRLSM with the covariance updates version of cyclical coordinate descent in the innermost loop. COORDINATE_DESCENT_NAIVE is IRLSM with the naive updates version of cyclical coordinate descent in the innermost loop. GRADIENT_DESCENT_LH and GRADIENT_DESCENT_SQERR can only be used with the Ordinal family. AUTO well set the solver based on the given data and other parameters.

alpha: Specify the regularization distribution between L1 and L2.

lambda: Specify the regularization strength.

lambda_search: Specify whether to enable lambda search, starting with lambda max (the smallest \(\lambda\) that drives all coefficients to zero). If you also specify a value for

`lambda_min_ratio`

, then this value is interpreted as lambda min. If you do not specify a value for`lambda_min_ratio`

, then GAM will calculate the minimum lambda.early_stopping: Specify whether to stop early when there is no more relative improvement on the training or validation set.

stopping_rounds: Stops training when the option selected for

**stopping_metric**doesn’t improve for the specified number of training rounds, based on a simple moving average. To disable this feature, specify`0`

.**Note:**If cross-validation is enabled:All cross-validation models stop training when the validation metric doesn’t improve.

The main model runs for the mean number of epochs.

N+1 models may be off by the number specified for

**stopping_rounds**from the best model, but the cross-validation metric estimates the performance of the main model for the resulting number of epochs (which may be fewer than the specified number of epochs).

stopping_metric: Specify the metric to use for early stopping. The available options are:

`AUTO`

: This defaults to`logloss`

for classification,`deviance`

for regression, and`anomaly_score`

for Isolation Forest. Note that`custom`

and`custom_increasing`

can only be used in GBM and DRF with the Python Client. Must be one of:`AUTO`

,`anomaly_score`

. Defaults to`AUTO`

.`anomaly_score`

(Isolation Forest only)`deviance`

`logloss`

`MSE`

`RMSE`

`MAE`

`RMSLE`

`AUC`

(area under the ROC curve)`AUCPR`

(area under the Precision-Recall curve)`lift_top_group`

`misclassification`

`mean_per_class_error`

`custom`

(GBM/DRF Python client only)`custom_increasing`

(GBM/DRF Python client only)

stopping_tolerance: Specify the relative tolerance for the metric-based stopping to stop training if the improvement is less than this value. Defaults to

`0.001`

.nlambdas: (Applicable only if

**lambda_search**is enabled) Specify the number of lambdas to use in the search. When`alpha`

> 0, the default value for`lambda_min_ratio`

is \(1e^{-4}\), then the default value for`nlambdas`

is 100. This gives a ratio of 0.912. (For best results when using strong rules, keep the ratio close to this default.) When`alpha=0`

, the default value for`nlamdas`

is set to 30 because fewer lambdas are needed for ridge regression.standardize: Specify whether to standardize the numeric columns to have a mean of zero and unit variance. Standardization is highly recommended; if you do not use standardization, the results can include components that are dominated by variables that appear to have larger variances relative to other attributes as a matter of scale, rather than true contribution. This option is enabled by default.

missing_values_handling: Specify how to handle missing values (Skip, MeanImputation, or PlugValues).

plug_values: When

`missing_values_handling="PlugValues"`

, specify a single row frame containing values that will be used to impute missing values of the training/validation frame.compute_p_values: Request computation of p-values. Only applicable with no penalty (lambda = 0 and no beta constraints). Setting remove_collinear_columns is recommended. H2O will return an error if p-values are requested and there are collinear columns and remove_collinear_columns flag is not enabled. Note that this option is not available for

`family="multinomial"`

or`family="ordinal"`

.remove_collinear_columns: Specify whether to automatically remove collinear columns during model-building. When enabled, collinear columns will be dropped from the model and will have 0 coefficient in the returned model. This can only be set if there is no regularization (lambda=0).

intercept: Specify whether to include a constant term in the model. This option is enabled by default.

non_negative: Specify whether to force coefficients to have non-negative values.

max_iterations: Specify the number of training iterations.

objective_epsilon: If the objective value is less than this threshold, then the model is converged. If

`lambda_search=True`

, then this value defaults to .0001. If`lambda_search=False`

and lambda is equal to zero, then this value defaults to .000001. For any other value of lambda, the default value of objective_epsilon is set to .0001.beta_epsilon: Converge if beta changes less than this value (using L-infinity norm). This only applies to IRLSM solver.

gradient_epsilon: (For L-BFGS only) Specify a threshold for convergence. If the objective value (using the L-infinity norm) is less than this threshold, the model is converged. If

`lambda_search=True`

, then this value defaults to .0001. If`lambda_search=False`

and lambda is equal to zero, then this value defaults to .000001. For any other value of lambda, this value defaults to .0001.link: Specify a link function (Identity, Family_Default, Logit, Log, Inverse, Tweedie, or Ologit).

If the family is

**Gaussian**, then**Identity**,**Log**, and**Inverse**are supported.If the family is

**Binomial**, then**Logit**is supported.If the family is

**Fractionalbinomial**, then Logit is supported.If the family is

**Poisson**, then**Log**and**Identity**are supported.If the family is

**Gamma**, then**Inverse**,**Log**, and**Identity**are supported.If the family is

**Tweedie**, then only**Tweedie**is supported.If the family is

**Multinomial**, then only**Family_Default**is supported. (This defaults to`multinomial`

.)If the family is

**Quasibinomial**, then only**Logit**is supported.If the family is

**Ordinal**, then only**Ologit**is supportedIf the family is

**Negative Binomial**, then only**Log**and**Identity**are supported.If the family is

**AUTO**,and a link is not specified, then the link is determined as

**Family_Default**(defaults to the family to which AUTO is determined).and a link is specified, the link is used so long as the specified link is compatible with the family to which AUTO is determined. Otherwise, an error message is thrown stating that AUTO for underlying data requires a different link and gives a list of possible compatible links.

The list of supported links for

`family = AUTO`

is:If the response is

**Enum**with cardinality = 2, then**Logit**is supported.If the response is

**Enum**with cardinality > 2, then only**Family_Default**is supported (this defaults to**multinomial**).If the response is numeric (

**Real**or**Int**), then**Identity**,**Log**, and**Inverse**are suported.

prior: Specify prior probability for p(y==1). Use this parameter for logistic regression if the data has been sampled and the mean of response does not reflect reality. This value defaults to -1 and must be a value in the range (0,1).

**Note**: This is a simple method affecting only the intercept. You may want to use weights and offset for a better fit.lambda_min_ratio: Specify the minimum lambda to use for lambda search. This is specified as a ratio of

**lambda_max**, which is the smallest \(\lambda\) for which the solution is all zeros.beta_constraints: Specify a dataset to use beta constraints. The selected frame is used to constrain the coefficient vector to provide upper and lower bounds. The dataset must contain a names column with valid coefficient names.

max_active_predictors: Specify the maximum number of active predictors during computation. This value is used as a stopping criterium to prevent expensive model building with many predictors.

interactions: Specify a list of predictor column indices to interact. All pairwise combinations will be computed for this list.

interaction_pairs: When defining interactions, use this option to specify a list of pairwise column interactions (interactions between two variables). Note that this is different than

`interactions`

, which will compute all pairwise combinations of specified columns.**obj_reg**: Specifies the likelihood divider in objective value computation. This defaults to 1/nobs.export_checkpoints_dir: Specify a directory to which generated models will automatically be exported.

**subspaces**: A list of model parameters that, when in the same subspace list, can vary freely amongst each other. This allows the user to group model parameters that can vary with restrictions. If specified, the following parameters must have the same array dimension:`gam_columns`

`scale`

`num_knots`

`bs`

These parameters can be specified as follows:

hyper_parameters = { 'alpha':[0.9, 0.1], ‘subspaces’: [{ ‘scale’:[[1,1,1],[0.1, 0.1, 0.1]], ‘num_knots’:[[4,4,4],[5,5,5]], ‘gam_columns’:[[“C1”,“C2",“C3”]] }, { ‘scale’:[[1,1],[0.1, 0.1,]], ‘num_knots’:[[4,4],[5,5]], ‘gam_columns’:[[“C2",“C3”]] }] }

With this hyperparameter specification, there will be 16 GAM models built.

**num_knots**: An array that specifies the number of knots for each predictor specified in`gam_columns`

.**knot_ids**: A string array storing frame keys/IDs that contain knot locations. Specify one value for each GAM column specified in`gam_columns`

.**gam_columns**: Required. An array of column names representing the smoothing terms used for prediction. GAM will build a smoother for each specified column.**bs**: An array specifying the B-Splines for each GAM predictor. You must include one value for each GAM predictor. This defaults to 0, which specifies cubic regression spline.**scale**: An array specifying the smoothing parameter for GAM.**keep_gam_cols**: Specify whether to save keys storing GAM columns.

## A Simple Linear Model¶

Consider \(n\) observations, \(x_i\) with response variable \(y_i\), where \(y_i\) is an observation on random variable \(Y_i\). Let \(u_i ≡ E(Y_i)\). Assuming a linear relationship between the predictor variables and the response, the following relationship exists between \(xi\) and \(Y_i\) as:

\(Y_i = u_i + \epsilon_i \text{ where } u_i = \beta_i x_i + \beta_0\)

and \(\beta_i, \beta_0\) are unknown parameters, \(\epsilon_i\) are i.i.d zero mean variables with variances \(\delta^2\). We already know how to estimate \(\beta_i, \beta_0\) using GLM.

## A Simple Linear GAM Model¶

Using the same observations as in the previous A Simple Linear Model section, a linear GAM model can be:

\(Y_i = f(x_i) + \epsilon_i \text{ where } f(x_i) = {\Sigma_{j=1}^k}b_j(x_i)\beta_j+\beta_0\)

Again, \(\beta = [\beta_0, \beta_1, \ldots, b_k]\) is an unknown parameter vector that can also be estimated using GLM. This can be done by using \([b_1(x_i), b_2(x_i), \ldots , b_K(x_i)]\) as the predictor variables instead of \(x_i\). Here, we are basically estimating \(f(x_i)\) using a set of basis functions:

\(\{b_1(x_i), b_2(x_i), \ldots, b_K(x_i)\}\)

where \(k\) is the number of basis functions used. Note that for each predictor variable, we get to decide the types and number of basis functions that we would like to use to best generate a GAM.

## A Simple Piecewise Linear Basis Function¶

To understand the role of basis functions, we are going to use a linear tent function.

Using piecewise basis functions, we need to pay attention to the locations of the function’s derivative discontinuities, that is by the locations at which the linear pieces join up. These locations are referred to as the knots and denoted by \(\{x_i^*:j=1, \ldots, K\}\). And suppose that the knots are sorted, meaning that \(x_i^* > x_{i-1}^*\). Then for \(j=2, \ldots, K - 1\), we have basis function \(b_j(x)\) defined as:

## Using Piecewise Tent Function to Approximate One Predictor Variable¶

To illustrate how we can use the piecewise tent functions to approximate a predictor variable, let’s use the following example for a predictor:

Predictor value goes from 0.0 to 1.0

Set \(K=10\) to use 10 piecewise tent functions

The knots are located at 0, 1/9, 2/9, 3/9, …, 8/9, 1.

The basis function values are plotted in in the figure below. Note that there are 10 basis functions. The basis function values overlap with its neighbors from the left and the right except for the first and the last basis functions.

For simplicity, let’s assume that we only have 21 predictor values uniformly spreading over the range from 0 to 1 with values 0, 0.05, 0.1, 0.15, …, 1.0. The next task is to translate each \(x_j\) to a set of 10 basis function values. This means that for every value of \(x_j\), we will obtain 10 values, each one correspond to each of the basis function.

For the predictor value at 0, the only basis function that matters is the first one. All the other basis function contributes 0 to the predictor value. Hence, for \(x_j = 0\), the vector corresponding to all basis functions will have the following values: {1,0,0,0,0,0,0,0,0,0} because the first basis function value is 1 at \(x_j = 0\). **Note** Substitute \(x=0\) to the first basis function

\(b_1(x) = \frac{\big(\frac{1}{9} - x \big)}{\big(\frac{2}{9} - \frac{1}{9} \big)}\)

For predictor value 0.05, only the first and second basis functions contribute to its value, while the other basis functions are 0 at 0.05. The value of the first basis function is 0.55. **Note** Substitute \(x=0.05\) to the first basis function

\(b_1(x) = \frac{\big(\frac{1}{9} - x \big)}{\big(\frac{2}{9} - \frac{1}{9} \big)}\)

The value of the second basis function at 0.05 is 0.45. **Note** Substitute \(x=0.05\) to the second basis function

\(b_2(x) = \frac{x}{\big(\frac{1}{9}\big)}\)

Hence, for \(x_j = 0.05\), the vector corresponding to all basis function is {0.55,0.45,0,0,0,0,0,0,0,0}.

We have calculated the expanded basis function vector for all predictor values, and they can be found in following table.

\(x_j\) |
\(b_1\) |
\(b_2\) |
\(b_3\) |
\(b_4\) |
\(b_5\) |
\(b_6\) |
\(b_7\) |
\(b_8\) |
\(b_9\) |
\(b_{10}\) |
---|---|---|---|---|---|---|---|---|---|---|

0 |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |

0.05 |
0.55 |
0.45 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |

0.1 |
0.1 |
0.9 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |

0.15 |
0 |
0.65 |
0.35 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |

0.2 |
0 |
0.2 |
0.8 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |

0.25 |
0 |
0 |
0.75 |
0.25 |
0 |
0 |
0 |
0 |
0 |
0 |

0.3 |
0 |
0 |
0.3 |
0.7 |
0 |
0 |
0 |
0 |
0 |
0 |

0.35 |
0 |
0 |
0 |
0.85 |
0.15 |
0 |
0 |
0 |
0 |
0 |

0.4 |
0 |
0 |
0 |
0.4 |
0.6 |
0 |
0 |
0 |
0 |
0 |

0.45 |
0 |
0 |
0 |
0 |
0.95 |
0.05 |
0 |
0 |
0 |
0 |

0.5 |
0 |
0 |
0 |
0 |
0.5 |
0.5 |
0 |
0 |
0 |
0 |

0.55 |
0 |
0 |
0 |
0 |
0.05 |
0.95 |
0 |
0 |
0 |
0 |

0.6 |
0 |
0 |
0 |
0 |
0 |
0.6 |
0.4 |
0 |
0 |
0 |

0.65 |
0 |
0 |
0 |
0 |
0 |
0.15 |
0.85 |
0 |
0 |
0 |

0.7 |
0 |
0 |
0 |
0 |
0 |
0 |
0.7 |
0.3 |
0 |
0 |

0.75 |
0 |
0 |
0 |
0 |
0 |
0 |
0.25 |
0.75 |
0 |
0 |

0.8 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0.8 |
0.2 |
0 |

0.85 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0.35 |
0.65 |
0 |

0.9 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0.9 |
0.1 |

0.95 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0.45 |
0.55 |

1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |

## Spline Functions¶

It has been proven in [2] that the natural cubic splines are the smoothest interpolators. For a set of points \({x_i, y_i:i = 1, \ldots, n}\) where \(x_i \leq x_{i+1}\). The natural cubic spline, \(g(x)\), interpolating these points, is a function made up of sections of cubic polynomial, one for each \([x_i, x_{i+1}]\). They are joined up together so that the whole spline is continuous to second derivative, while \(g(x_i) = y_i\) and \(g^{''}(x_i) = g^{''}(x_n) = 0\). To ensure smooth function, we can add a penalty function \(J(f) = \int_{x_1}^{x_n} {(f^{''}(x))^2}dx\) to the actual objective function that we are trying to optimize. The rationality behind this penalty is that the second derivative of a function measures the gradient change. For functions that wriggle a lot, it will have a higher second derivative magnitude. For a straight line that does not wriggle at all, the second derivative is zero.

### Cubic Regression Splines¶

Following the implementation in [1], we have implemented the cubic regression splines for a single predictor variable. This approach defines the splines in terms of its values at the knots. Next, we define a cubic spline function, \(f(x)\), \(k\) knots, \(x_1, x_2, \ldots, x_k\). Let \(\beta_j = f(x_j)\) and \(\delta_j = f^{''}(x_j) = \frac{d^2f(x_j)}{d^2x}\).

The splines can be written as:

where:

\(a_j^-(x) = (x_{j+1} - x)/h_j, a_j^+(x) = (x - x_j) / h_j\)

\(c_j^-(x) = \big[\frac{(x_{j+1}-x)^3}{h_j} - h_j(x_{j+1} - x)\big] /6, c_j^+(x) = \big[\frac{(x-x_j)^3}{h_j} - h_j(x-x_j \big] / 6\)

Note that in order to ensure smooth fitting functions at the knots, the spline must be continuous to second derivative at the \(x_j\) and should have zero second derivative at \(x_1\) and \(x_k\). It can be shown that \(\beta\delta^- = DB\) (to be added at a later date), where

Let \(BinvD = B^{-1}D\) and let \(F = {\begin{bmatrix}0\\BinvD\\0\end{bmatrix}}\)

The spline can be rewritten entire in terms of \(\beta\) as

\(f(x) = a_j^-(x)\beta_j + a_j^+(x)\beta_{j+1} + c_j^-(x)F_j\beta + c_j^+(x)F_{j+1}\beta \text{ for } x_j \leq x \leq x_{j+1}\)

which can be written as \(f(x_i) = \sum_{j=1}^{k}b_j(x_i)\beta_j+\beta_0\) where \(b_j(x_i)\) are the basis functions and \(\beta_0, \beta_1, \ldots, \beta_k\) are the unknown parameters that can be estimated using GLM. In addition, the penalty term added to the final objective function can be derived to be:

where \(S = D^T B^{-1} D\)

For linear regression models, the final objective function to minimize is

Note that \(\lambda\) will be another parameter for the user to choose using gridsearch. In a future release, we may use cross-validation to automatically choose \(lambda\).

Hence, at this point, we can call our GLM. However, we still need to add the contribution of the penalty term to the gradient and hessian calculation.

## General GAM¶

In a general GAM, using the GLM jargon, the link function can be constructed using a mixture of predictor variables and smooth functions of predictor variables as follows:

This is the GAM we implemented in H2O. However, with multiple predictor variables in any form, we need to resolve the identifiability problems by adding identifiability constraints.

### Identifiability Constraints¶

Consider GAM with multiple predictor smooth functions like the following:

The model now contains more than one function introduces an identifiability problem: \(f_1\) an \(f_2\) are each only estimable to within an additive constant. This is due to the fact that \(f_1(x_i) + f_2(v_i) = (f_1(x_i) + C) + (f_2(v_i) - C)\). Hence, identifiability constraints have to be imposed on the model before fitting to avoid the identifiability problem. The following sum-to-zero constraints are implemented in H2O:

where 1 is a column vector of 1, and \(f_p\) is the column vector containing \(f_p(x_1), \ldots ,f_p(x_n)\). To apply the sum-to-zero constraints, a Householder transform is used. Refer to [1] for details. This transform is applied to each basis function of any predictor column we choose on its own.

### Sum-to-zero Constraints Implementation¶

Let \(X\) be the model matrix that contain the basis functions of one predictor variable, the sum-to-zero constraints required that

where \(\beta\) contains the coefficients relating to the basis functions of that particular predictor column. The idea is to create a \(k \text{ by } (k-1)\) matrix \(Z\) such that \(\beta = Z\beta_z\), then \(1^TX\beta =0\) for any \(\beta_z\). To see how this works, let’s go through the following derivations:

With \(Z\), we are looking at \(0 = 1^TX\beta = 1^TXZ\beta_z\)

Let \(C=1^TX\), then the QR decomposition of \(C^T = U {\begin{bmatrix}P\\0\end{bmatrix}}\) where \(C^T\) is of size \(k \times 1\), \(U\) is of size \(k \times k\), \(P\) is the size of \(1\times1\)

Substitute everything back to \(1^TXZ\beta_z = [P^T \text{ } 0]{\begin{bmatrix}D^T\\Z^T\end{bmatrix}} Z\beta_z = [P^T \text{ } 0]{\begin{bmatrix}D^TZ\beta_z\\Z^TZ\beta_z\end{bmatrix}} = P^TD^TZ\beta_z + 0Z^TZ\beta_z=0\) since \(D^TZ=0\)

### Generating the Z Matrix¶

One Householder reflection is used to generate the \(Z\) matrix. To create the \(Z\) matrix, we need to calculate the QR decomposition of \(C^T = X^T1\) Since \(C^T\) is of size \(k \times 1\), the application of one householder reflection will generate \(HC^T = {\begin{bmatrix}R\\0\end{bmatrix}}\) where \(R\) is of size \(1 \times 1\). This implies that \(H = Q^T = Q\), since the householder reflection matrix is symmetrical. Hence, computing \(XZ\) is equivalent to computing \(XH\) and dropping the first column.

### Generating the Householder reflection matrix H¶

Let \(\bar{x} = X^T1\) and \(\bar{x}' = {\begin{bmatrix}{\parallel{\bar{x}}\parallel}\\0\end{bmatrix}}\), then \(H = (I - \frac{2uu^T}{(u^Tu)})\) and \(u = \bar{x} = \bar{x}'\).

### Estimation of GAM Coefficients with Identifiability Constraints¶

The following procedure is used to estimate the GAM coefficients:

Generating \(Z\) matrix for each predictor column that uses smoothe functions

Generate new model matrix for each predictor column smooth function as \(X_z = XZ\), new penalty function \({\beta{^T_z}}Z^TSZ\beta_z\).

Call GLM using model matrix \(X_z\), penalty function \({\beta{^T_z}}Z^TSZ\beta_z\) to get coefficient estimates of \(\beta_z\)

Convert \(\beta_z\) to \(\beta\) using \(\beta = Z\beta_z\) and performing scoring with \(\beta\) and the original model matrix \(X\).

## Examples¶

Below are simple examples showing how to use GAM in R and Python.

```
library(h2o)
h2o.init()
# create frame knots
knots1 <- c(-1.99905699, -0.98143075, 0.02599159, 1.00770987, 1.99942290)
frame_Knots1 <- as.h2o(knots1)
knots2 <- c(-1.999821861, -1.005257990, -0.006716042, 1.002197392, 1.999073589)
frame_Knots2 <- as.h2o(knots2)
knots3 <- c(-1.999675688, -0.979893796, 0.007573327, 1.011437347, 1.999611676)
frame_Knots3 <- as.h2o(knots3)
# import the dataset
h2o_data <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/multinomial_10_classes_10_cols_10000_Rows_train.csv")
# Convert the C1, C2, and C11 columns to factors
h2o_data["C1"] <- as.factor(h2o_data["C1"])
h2o_data["C2"] <- as.factor(h2o_data["C2"])
h2o_data["C11"] <- as.factor(h2o_data["C11"])
# split into train and test sets
splits <- h2o.splitFrame(data = h2o_data, ratios = 0.8)
train <- splits[[1]]
test <- splits[[2]]
# Set the predictor and response columns
predictors <- colnames(train[1:2])
response <- 'C11'
# specify the knots array
numKnots <- c(5, 5, 5)
# build the GAM model
gam_model <- h2o.gam(x = predictors,
y = response,
training_frame = train,
family = 'multinomial',
gam_columns = c("C6", "C7", "C8"),
scale = c(1, 1, 1),
num_knots = numKnots,
knot_ids = c(h2o.keyof(frame_Knots1), h2o.keyof(frame_Knots2), h2o.keyof(frame_Knots3)))
# get the model coefficients
coefficients <- h2o.coef(gam_model)
# generate predictions using the test data
pred <- h2o.predict(object = gam_model, newdata = test)
```

```
import h2o
from h2o.estimators.gam import H2OGeneralizedAdditiveEstimator
h2o.init()
# create frame knots
knots1 = [-1.99905699, -0.98143075, 0.02599159, 1.00770987, 1.99942290]
frameKnots1 = h2o.H2OFrame(python_obj=knots1)
knots2 = [-1.999821861, -1.005257990, -0.006716042, 1.002197392, 1.999073589]
frameKnots2 = h2o.H2OFrame(python_obj=knots2)
knots3 = [-1.999675688, -0.979893796, 0.007573327,1.011437347, 1.999611676]
frameKnots3 = h2o.H2OFrame(python_obj=knots3)
# import the dataset
h2o_data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/multinomial_10_classes_10_cols_10000_Rows_train.csv")
# convert the C1, C2, and C11 columns to factors
h2o_data["C1"] = h2o_data["C1"].asfactor()
h2o_data["C2"] = h2o_data["C2"].asfactor()
h2o_data["C11"] = h2o_data["C11"].asfactor()
# split into train and validation sets
train, test = h2o_data.split_frame(ratios = [.8])
# set the predictor and response columns
y = "C11"
x = ["C1","C2"]
# specify the knots array
numKnots = [5,5,5]
# build the GAM model
h2o_model = H2OGeneralizedAdditiveEstimator(family='multinomial',
gam_columns=["C6","C7","C8"],
scale=[1,1,1],
num_knots=numKnots,
knot_ids=[frameKnots1.key, frameKnots2.key, frameKnots3.key])
h2o_model.train(x=x, y=y, training_frame=train)
# get the model coefficients
h2oCoeffs = h2o_model.coef()
# generate predictions using the test data
pred = h2o_model.predict(test)
```

## References¶

Simon N. Wood, Generalized Additive Models: An Introduction with R, Texts in Statistical Science, CRC Press, Second Edition.

T.J. Hastie, R.J. Tibshirani, Generalized Additive Models, Chapman and Hall, First Edition, 1990.