Available in: GBM, DRF, Deep Learning, GLM, GAM, Naïve-Bayes, K-Means, XGBoost
This option specifies the scheme to use for cross-validation fold assignment. This option is only applicable if a value for
nfolds is specified and a
fold_column is not specified. Options include:
Auto: Allow the algorithm to automatically choose an option. Auto currently uses Random.
Random: Randomly split the data into
Modulo: Performs modulo operation when splitting the folds.
Stratified: Stratifies the folds based on the response variable for classification problems.
Keep the following in mind when specifying a fold assignment for your data:
Random is best for large datasets, but can lead to imbalanced samples for small datasets.
Modulo is a simple deterministic way to evenly split the dataset into the folds and does not depend on the seed.
Specifying Stratified will attempt to evenly distribute observations from the different classes to all sets when splitting a dataset into training and validation. This can be useful if there are many classes and the dataset is relatively small.
Note that all three options are only suitable for datasets that are i.i.d. If the dataset requires custom grouping to perform meaningful cross-validation, then a
fold_columnshould be created and provided instead.
In general, when comparing multiple models using validation sets, ensure that you use the same validation set for all models. When performing cross-validation, specify a seed for all models, or specify Modulo for the
fold_assignment. This ensures that the cross-validation folds are the same, and eliminates the noise that can come from, for example, the Random fold assignment.