Available in: GBM, DRF, Deep Learning, GLM, GAM, PCA, GLRM, Naïve-Bayes, K-Means, Stacked Ensembles, AutoML, XGBoost
Datasets are commonly split into training, testing, and validation sets. When splitting a dataset, the bulk of the data goes into the training dataset, with small portions held out for the testing and validation dataframes.
training_frame is used to build the model, the
validation_frame is used to compare against the adjusted model and evaluate the model’s accuracy. Typically, the model will include sampled data which will then be compared against the validation frame’s unsampled data. The recommended process is to train on the training set and stop early based on the validation set (and/or cross-validation). When you find a good model, you score it once on the test set to estimate the generalization error.