class: title-slide, center, bottom # Build a fine-tuned model ## Tidymodels, virtually — Session 05 ### Alison Hill --- .pull-left[ ### Single decision tree ```r tree_mod <- decision_tree(engine = "rpart") %>% set_mode("classification") tree_wf <- workflow() %>% add_formula(Class ~ .) %>% add_model(tree_mod) set.seed(100) tree_res <- tree_wf %>% fit_resamples(resamples = alz_folds, control = control_resamples(save_pred = TRUE)) tree_res %>% collect_metrics() # # A tibble: 2 × 6 # .metric .estimator mean n std_err .config # <chr> <chr> <dbl> <int> <dbl> <chr> # 1 accuracy binary 0.756 10 0.0245 Preprocessor1_Mod… # 2 roc_auc binary 0.770 10 0.0255 Preprocessor1_Mod… ``` ] -- .pull-right[ ### A random forest of trees ```r rf_mod <- rand_forest(engine = "ranger") %>% set_mode("classification") rf_wf <- tree_wf %>% update_model(rf_mod) set.seed(100) rf_res <- rf_wf %>% fit_resamples(resamples = alz_folds, control = control_resamples(save_pred = TRUE)) rf_res %>% collect_metrics() # # A tibble: 2 × 6 # .metric .estimator mean n std_err .config # <chr> <chr> <dbl> <int> <dbl> <chr> # 1 accuracy binary 0.837 10 0.0172 Preprocessor1_Mod… # 2 roc_auc binary 0.886 10 0.0171 Preprocessor1_Mod… ``` ] --- .pull-left[ ### Single decision tree <img src="figs/05/unnamed-chunk-3-1.png" width="504" style="display: block; margin: auto;" /> ] -- .pull-right[ ### A random forest of trees <img src="figs/05/unnamed-chunk-4-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: your-turn # Your turn 1 Challenge: Fit 3 more random forest models, each using 3, 8, and 30 variables at each split. Update your `rf_wf` with each new model. Which value maximizes the area under the ROC curve?
03
:
00
--- ```r rf3_mod <- rf_mod %>% * set_args(mtry = 3) rf8_mod <- rf_mod %>% * set_args(mtry = 8) rf30_mod <- rf_mod %>% * set_args(mtry = 30) ``` --- ```r rf3_wf <- rf_wf %>% update_model(rf3_mod) set.seed(100) rf3_wf %>% fit_resamples(resamples = alz_folds) %>% collect_metrics() # # A tibble: 2 × 6 # .metric .estimator mean n std_err .config # <chr> <chr> <dbl> <int> <dbl> <chr> # 1 accuracy binary 0.786 10 0.0145 Preprocessor1_Mod… # 2 roc_auc binary 0.862 10 0.0167 Preprocessor1_Mod… ``` --- ```r rf8_wf <- rf_wf %>% update_model(rf8_mod) set.seed(100) rf8_wf %>% fit_resamples(resamples = alz_folds) %>% collect_metrics() # # A tibble: 2 × 6 # .metric .estimator mean n std_err .config # <chr> <chr> <dbl> <int> <dbl> <chr> # 1 accuracy binary 0.817 10 0.0137 Preprocessor1_Mod… # 2 roc_auc binary 0.886 10 0.0152 Preprocessor1_Mod… ``` --- ```r rf30_wf <- rf_wf %>% update_model(rf30_mod) set.seed(100) rf30_wf %>% fit_resamples(resamples = alz_folds) %>% collect_metrics() # # A tibble: 2 × 6 # .metric .estimator mean n std_err .config # <chr> <chr> <dbl> <int> <dbl> <chr> # 1 accuracy binary 0.846 10 0.0140 Preprocessor1_Mod… # 2 roc_auc binary 0.897 10 0.0133 Preprocessor1_Mod… ``` --- class: middle, center, frame # tune Functions for fitting and tuning models <https://tune.tidymodels.org> <iframe src="https://tune.tidymodels.org" width="100%" height="400px"></iframe> --- class: middle, center # `tune()` A placeholder for hyper-parameters to be "tuned" ```r nearest_neighbor(neighbors = tune()) ``` --- .center[ # `tune_grid()` A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters. ] .pull-left[ ```r tune_grid( object, resamples, ..., grid = 10, metrics = NULL, control = control_grid() ) ``` ] --- .center[ # `tune_grid()` A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters. ] .pull-left[ ```r tune_grid( * object, resamples, ..., grid = 10, metrics = NULL, control = control_grid() ) ``` ] -- .pull-right[ One of: + A parsnip `model` object + A `workflow` ] --- .center[ # `tune_grid()` A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters. ] .pull-left[ ```r tune_grid( * object, * preprocessor, resamples, ..., grid = 10, metrics = NULL, control = control_grid() ) ``` ] .pull-right[ A `model` + `recipe` ] --- .center[ # `tune_grid()` A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters. ] .pull-left[ ```r tune_grid( object, resamples, ..., * grid = 10, metrics = NULL, control = control_grid() ) ``` ] .pull-right[ One of: + A positive integer. + A data frame of tuning combinations. ] --- .center[ # `tune_grid()` A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters. ] .pull-left[ ```r tune_grid( object, resamples, ..., * grid = 10, metrics = NULL, control = control_grid() ) ``` ] .pull-right[ Number of candidate parameter sets to be created automatically; `10` is the default. ] --- ```r data("ad_data") alz <- ad_data # data splitting set.seed(100) # Important! alz_split <- initial_split(alz, strata = Class, prop = .9) alz_train <- training(alz_split) alz_test <- testing(alz_split) # data resampling set.seed(100) alz_folds <- vfold_cv(alz_train, v = 10, strata = Class) ``` --- class: your-turn # Your Turn 2 Here's our random forest model plus workflow to work with. ```r rf_mod <- rand_forest(engine = "ranger") %>% set_mode("classification") rf_wf <- workflow() %>% add_formula(Class ~ .) %>% add_model(rf_mod) ``` --- class: your-turn # Your Turn 2 Here is the output from `fit_resamples()`... ```r set.seed(100) # Important! rf_results <- rf_wf %>% fit_resamples(resamples = alz_folds) rf_results %>% collect_metrics() # # A tibble: 2 × 6 # .metric .estimator mean n std_err .config # <chr> <chr> <dbl> <int> <dbl> <chr> # 1 accuracy binary 0.826 10 0.0181 Preprocessor1_Mod… # 2 roc_auc binary 0.882 10 0.0248 Preprocessor1_Mod… ``` --- class: your-turn # Your Turn 2 Edit the random forest model to tune the `mtry` and `min_n` hyperparameters. Update your workflow to use the tuned model. Then use `tune_grid()` to find the best combination of hyper-parameters to maximize `roc_auc`; let tune set up the grid for you. How does it compare to the average ROC AUC across folds from `fit_resamples()`?
05
:
00
--- ```r rf_tuner <- rand_forest(engine = "ranger", mtry = tune(), min_n = tune()) %>% set_mode("classification") rf_wf <- rf_wf %>% update_model(rf_tuner) set.seed(100) # Important! rf_results <- rf_wf %>% tune_grid(resamples = alz_folds) ``` --- ```r rf_results %>% collect_metrics() # # A tibble: 20 × 8 # mtry min_n .metric .estimator mean n std_err # <int> <int> <chr> <chr> <dbl> <int> <dbl> # 1 40 3 accuracy binary 0.853 10 0.0158 # 2 40 3 roc_auc binary 0.889 10 0.0241 # 3 60 20 accuracy binary 0.857 10 0.0190 # 4 60 20 roc_auc binary 0.891 10 0.0225 # 5 104 24 accuracy binary 0.864 10 0.0156 # 6 104 24 roc_auc binary 0.881 10 0.0233 # 7 19 15 accuracy binary 0.837 10 0.0160 # 8 19 15 roc_auc binary 0.889 10 0.0245 # 9 124 26 accuracy binary 0.860 10 0.0172 # 10 124 26 roc_auc binary 0.885 10 0.0222 # # … with 10 more rows, and 1 more variable: .config <chr> ``` --- ```r rf_results %>% collect_metrics(summarize = FALSE) # # A tibble: 200 × 7 # id mtry min_n .metric .estimator .estimate .config # <chr> <int> <int> <chr> <chr> <dbl> <chr> # 1 Fold01 40 3 accuracy binary 0.839 Preproc… # 2 Fold01 40 3 roc_auc binary 0.869 Preproc… # 3 Fold02 40 3 accuracy binary 0.806 Preproc… # 4 Fold02 40 3 roc_auc binary 0.869 Preproc… # 5 Fold03 40 3 accuracy binary 0.933 Preproc… # 6 Fold03 40 3 roc_auc binary 0.966 Preproc… # 7 Fold04 40 3 accuracy binary 0.8 Preproc… # 8 Fold04 40 3 roc_auc binary 0.847 Preproc… # 9 Fold05 40 3 accuracy binary 0.867 Preproc… # 10 Fold05 40 3 roc_auc binary 0.977 Preproc… # # … with 190 more rows ``` --- .center[ # `tune_grid()` A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters. ] .pull-left[ ```r tune_grid( object, resamples, ..., * grid = df, metrics = NULL, control = control_grid() ) ``` ] .pull-right[ A data frame of tuning combinations. ] --- class: middle, center # `expand_grid()` Takes one or more vectors, and returns a data frame holding all combinations of their values. ```r expand_grid(mtry = c(1, 5), min_n = 1:3) # # A tibble: 6 × 2 # mtry min_n # <dbl> <int> # 1 1 1 # 2 1 2 # 3 1 3 # 4 5 1 # 5 5 2 # 6 5 3 ``` -- .footnote[tidyr package; see also base `expand.grid()`] --- class: middle name: show-best .center[ # `show_best()` Shows the .display[n] most optimum combinations of hyper-parameters ] ```r rf_results %>% show_best(metric = "roc_auc", n = 5) ``` --- template: show-best ``` # # A tibble: 5 × 8 # mtry min_n .metric .estimator mean n std_err .config # <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr> # 1 91 9 roc_auc binary 0.893 10 0.0243 Prepro… # 2 71 33 roc_auc binary 0.892 10 0.0230 Prepro… # 3 60 20 roc_auc binary 0.891 10 0.0225 Prepro… # 4 28 29 roc_auc binary 0.889 10 0.0249 Prepro… # 5 19 15 roc_auc binary 0.889 10 0.0245 Prepro… ``` --- class: middle, center # `autoplot()` Quickly visualize tuning results ```r rf_results %>% autoplot() ``` <img src="figs/05/rf-plot-1.png" width="504" style="display: block; margin: auto;" /> --- class: middle, center <img src="figs/05/unnamed-chunk-26-1.png" width="504" style="display: block; margin: auto;" /> --- class: middle name: select-best .center[ # `select_best()` Shows the .display[top] combination of hyper-parameters. ] ```r alz_best <- rf_results %>% select_best(metric = "roc_auc") alz_best ``` --- template: select-best ``` # # A tibble: 1 × 3 # mtry min_n .config # <int> <int> <chr> # 1 91 9 Preprocessor1_Model09 ``` --- class: middle .center[ # `finalize_workflow()` Replaces `tune()` placeholders in a model/recipe/workflow with a set of hyper-parameter values. ] ```r last_rf_workflow <- rf_wf %>% finalize_workflow(alz_best) ``` --- background-image: url(images/diamonds.jpg) background-size: contain background-position: left class: middle, center background-color: #f5f5f5 .pull-right[ ## We are ready to touch the jewels... ## The .display[testing set]! ] --- class: middle .center[ # `last_fit()` ] ```r last_rf_fit <- last_rf_workflow %>% last_fit(split = alz_split) ``` --- ```r last_rf_fit # # Resampling results # # Manual resampling # # A tibble: 1 × 6 # splits id .metrics .notes .predictions .workflow # <list> <chr> <list> <list> <list> <list> # 1 <split [300/33]> trai… <tibble… <tibb… <tibble [33… <workflo… ``` --- class: your-turn # Your Turn 3 Use `select_best()`, `finalize_workflow()`, and `last_fit()` to take the best combination of hyper-parameters from `rf_results` and use them to predict the test set. How does our actual test ROC AUC compare to our cross-validated estimate?
05
:
00
--- ```r alz_best <- rf_results %>% select_best(metric = "roc_auc") last_rf_workflow <- rf_wf%>% finalize_workflow(alz_best) last_rf_fit <- last_rf_workflow %>% last_fit(split = alz_split) last_rf_fit %>% collect_metrics() ``` --- class: middle, frame .center[ # Final metrics ] ```r last_rf_fit %>% collect_metrics() # # A tibble: 2 × 4 # .metric .estimator .estimate .config # <chr> <chr> <dbl> <chr> # 1 accuracy binary 0.818 Preprocessor1_Model1 # 2 roc_auc binary 0.819 Preprocessor1_Model1 ``` --- class: middle .center[ # Final test predictions ] ```r last_rf_fit %>% collect_predictions() # # A tibble: 33 × 7 # id .pred_Impaired .pred_Control .row .pred_class # <chr> <dbl> <dbl> <int> <fct> # 1 train/test split 0.264 0.736 13 Control # 2 train/test split 0.223 0.777 14 Control # 3 train/test split 0.256 0.744 33 Control # 4 train/test split 0.434 0.566 43 Control # 5 train/test split 0.0765 0.924 46 Control # 6 train/test split 0.270 0.730 48 Control # 7 train/test split 0.00933 0.991 49 Control # 8 train/test split 0.808 0.192 56 Impaired # 9 train/test split 0.629 0.371 67 Impaired # 10 train/test split 0.110 0.890 68 Control # # … with 23 more rows, and 2 more variables: Class <fct>, # # .config <chr> ``` --- ```r roc_values <- last_rf_fit %>% collect_predictions() %>% roc_curve(truth = Class, estimate = .pred_Impaired) autoplot(roc_values) ``` <img src="figs/05/unnamed-chunk-35-1.png" width="50%" style="display: block; margin: auto;" /> --- # The set-up ```r set.seed(100) # Important! # holdout method alz_split <- initial_split(alz, strata = Class, prop = .9) alz_train <- training(alz_split) alz_test <- testing(alz_split) # add cross-validation set.seed(100) alz_folds <- vfold_cv(alz_train, v = 10, strata = Class) ``` --- # The tune-up ```r # here comes the actual ML bits… # pick model to tune rf_tuner <- rand_forest(engine = "ranger", mtry = tune(), min_n = tune()) %>% set_mode("classification") rf_wf <- workflow() %>% add_formula(Class ~ .) %>% add_model(rf_tuner) rf_results <- rf_wf %>% tune_grid(resamples = alz_folds, control = control_grid(save_pred = TRUE)) ``` --- # Quick check-in... ```r rf_results %>% collect_predictions() %>% group_by(.config, mtry, min_n) %>% summarize(folds = n_distinct(id)) # # A tibble: 10 × 4 # # Groups: .config, mtry [10] # .config mtry min_n folds # <chr> <int> <int> <int> # 1 Preprocessor1_Model01 72 31 10 # 2 Preprocessor1_Model02 44 21 10 # 3 Preprocessor1_Model03 59 28 10 # 4 Preprocessor1_Model04 31 11 10 # 5 Preprocessor1_Model05 88 38 10 # 6 Preprocessor1_Model06 94 33 10 # 7 Preprocessor1_Model07 118 15 10 # 8 Preprocessor1_Model08 6 17 10 # 9 Preprocessor1_Model09 26 9 10 # 10 Preprocessor1_Model10 106 2 10 ``` --- # The match up! .pull-left[ ```r show_best(rf_results, metric = "roc_auc", n = 5) # # A tibble: 5 × 8 # mtry min_n .metric .estimator mean n std_err .config # <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr> # 1 44 21 roc_auc binary 0.894 10 0.0227 Prepro… # 2 31 11 roc_auc binary 0.891 10 0.0224 Prepro… # 3 59 28 roc_auc binary 0.888 10 0.0229 Prepro… # 4 26 9 roc_auc binary 0.888 10 0.0259 Prepro… # 5 106 2 roc_auc binary 0.887 10 0.0248 Prepro… # pick final model workflow alz_best <- rf_results %>% select_best(metric = "roc_auc") alz_best # # A tibble: 1 × 3 # mtry min_n .config # <int> <int> <chr> # 1 44 21 Preprocessor1_Model02 ``` ] .pull-right[ <img src="figs/05/unnamed-chunk-40-1.png" width="504" style="display: block; margin: auto;" /> ] --- # The wrap-up .pull-left[ ```r last_rf_workflow <- rf_wf %>% finalize_workflow(alz_best) last_rf_workflow # ══ Workflow ════════════════════════════════════════════════════════════════════ # Preprocessor: Formula # Model: rand_forest() # # ── Preprocessor ──────────────────────────────────────────────────────────────── # Class ~ . # # ── Model ─────────────────────────────────────────────────────────────────────── # Random Forest Model Specification (classification) # # Main Arguments: # mtry = 44 # min_n = 21 # # Computational engine: ranger ``` ] -- .pull-right[ ```r # train + test final model last_rf_fit <- last_rf_workflow %>% last_fit(split = alz_split) # explore final model last_rf_fit %>% collect_metrics() # # A tibble: 2 × 4 # .metric .estimator .estimate .config # <chr> <chr> <dbl> <chr> # 1 accuracy binary 0.788 Preprocessor1_Model1 # 2 roc_auc binary 0.815 Preprocessor1_Model1 last_rf_fit %>% collect_predictions() %>% roc_curve(truth = Class, estimate = .pred_Impaired) %>% autoplot() ``` ]