Build a fine-tuned model

class: title-slide, center, bottom

# Build a fine-tuned model

## Tidymodels, virtually &mdash; Session 05

### Alison Hill

---

.pull-left[

### Single decision tree

```r
tree_mod <- 
  decision_tree(engine = "rpart") %>% 
  set_mode("classification")

tree_wf <-
  workflow() %>% 
  add_formula(Class ~ .) %>% 
  add_model(tree_mod)

set.seed(100)
tree_res <- 
  tree_wf %>% 
  fit_resamples(resamples = alz_folds,
                control = control_resamples(save_pred = TRUE))

tree_res %>% 
  collect_metrics()
# # A tibble: 2 × 6
#   .metric  .estimator  mean     n std_err .config           
#   <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
# 1 accuracy binary     0.756    10  0.0245 Preprocessor1_Mod…
# 2 roc_auc  binary     0.770    10  0.0255 Preprocessor1_Mod…
```
]

.pull-right[

### A random forest of trees

```r
rf_mod <-
  rand_forest(engine = "ranger") %>% 
  set_mode("classification")

rf_wf <-
  tree_wf %>% 
  update_model(rf_mod)

set.seed(100)
rf_res <- rf_wf %>% 
  fit_resamples(resamples = alz_folds,
                control = control_resamples(save_pred = TRUE))

rf_res %>% 
  collect_metrics()
# # A tibble: 2 × 6
#   .metric  .estimator  mean     n std_err .config           
#   <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
# 1 accuracy binary     0.837    10  0.0172 Preprocessor1_Mod…
# 2 roc_auc  binary     0.886    10  0.0171 Preprocessor1_Mod…
```
]

---

.pull-left[

### Single decision tree
<img src="figs/05/unnamed-chunk-3-1.png" width="504" style="display: block; margin: auto;" />
]

.pull-right[

### A random forest of trees
<img src="figs/05/unnamed-chunk-4-1.png" width="504" style="display: block; margin: auto;" />
]

---
class: your-turn

# Your turn 1

Challenge: Fit 3 more random forest models, each using 3, 8, and 30 variables at each split. Update your `rf_wf` with each new model. Which value maximizes the area under the ROC curve?

---

```r
rf3_mod <- rf_mod %>% 
* set_args(mtry = 3)

rf8_mod <- rf_mod %>% 
* set_args(mtry = 8)

rf30_mod <- rf_mod %>% 
* set_args(mtry = 30)
```

---

```r
rf3_wf <- rf_wf %>% 
  update_model(rf3_mod)

set.seed(100)
rf3_wf %>% 
  fit_resamples(resamples = alz_folds) %>% 
  collect_metrics()
# # A tibble: 2 × 6
#   .metric  .estimator  mean     n std_err .config           
#   <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
# 1 accuracy binary     0.786    10  0.0145 Preprocessor1_Mod…
# 2 roc_auc  binary     0.862    10  0.0167 Preprocessor1_Mod…
```

---

```r
rf8_wf <- rf_wf %>% 
  update_model(rf8_mod)

set.seed(100)
rf8_wf %>% 
  fit_resamples(resamples = alz_folds) %>% 
  collect_metrics()
# # A tibble: 2 × 6
#   .metric  .estimator  mean     n std_err .config           
#   <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
# 1 accuracy binary     0.817    10  0.0137 Preprocessor1_Mod…
# 2 roc_auc  binary     0.886    10  0.0152 Preprocessor1_Mod…
```

---

```r
rf30_wf <- rf_wf %>% 
  update_model(rf30_mod)

set.seed(100)
rf30_wf %>% 
  fit_resamples(resamples = alz_folds) %>% 
  collect_metrics()
# # A tibble: 2 × 6
#   .metric  .estimator  mean     n std_err .config           
#   <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
# 1 accuracy binary     0.846    10  0.0140 Preprocessor1_Mod…
# 2 roc_auc  binary     0.897    10  0.0133 Preprocessor1_Mod…
```

---
class: middle, center, frame

# tune

Functions for fitting and tuning models

<https://tune.tidymodels.org>

---
class: middle, center

# `tune()`

A placeholder for hyper-parameters to be "tuned"

```r
nearest_neighbor(neighbors = tune())
```

---

.center[
# `tune_grid()`

A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters.
]

.pull-left[

```r
tune_grid(
  object, 
  resamples, 
  ..., 
  grid = 10, 
  metrics = NULL, 
  control = control_grid()
)
```

]

---

.center[
# `tune_grid()`

A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters.
]

.pull-left[

```r
tune_grid(
* object,
  resamples, 
  ..., 
  grid = 10, 
  metrics = NULL, 
  control = control_grid()
)
```

]

.pull-right[
One of:

+ A parsnip `model` object

+ A `workflow`

]

---

.center[
# `tune_grid()`

A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters.
]

.pull-left[

```r
tune_grid(
* object,
* preprocessor,
  resamples, 
  ..., 
  grid = 10, 
  metrics = NULL, 
  control = control_grid()
)
```

]

.pull-right[
A `model` + `recipe`
]

---

.center[
# `tune_grid()`

A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters.
]

.pull-left[

```r
tune_grid(
  object, 
  resamples, 
  ..., 
* grid = 10,
  metrics = NULL, 
  control = control_grid()
)
```

]

.pull-right[
One of:

+ A positive integer.

+ A data frame of tuning combinations.

]

---

.center[

# `tune_grid()`

A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters.

]

.pull-left[

```r
tune_grid(
  object, 
  resamples, 
  ..., 
* grid = 10,
  metrics = NULL, 
  control = control_grid()
)
```

]

.pull-right[
Number of candidate parameter sets to be created automatically; `10` is the default.
]

---

```r
data("ad_data")
alz <- ad_data

# data splitting
set.seed(100) # Important!
alz_split  <- initial_split(alz, strata = Class, prop = .9)
alz_train  <- training(alz_split)
alz_test   <- testing(alz_split)

# data resampling
set.seed(100)
alz_folds <- 
    vfold_cv(alz_train, v = 10, strata = Class)
```

---
class: your-turn

# Your Turn 2

Here's our random forest model plus workflow to work with.

```r
rf_mod <- 
  rand_forest(engine = "ranger") %>% 
  set_mode("classification")

rf_wf <-
  workflow() %>% 
  add_formula(Class ~ .) %>% 
  add_model(rf_mod)
```

---
class: your-turn

# Your Turn 2

Here is the output from `fit_resamples()`...

```r
set.seed(100) # Important!
rf_results <-
  rf_wf %>% 
  fit_resamples(resamples = alz_folds)

rf_results %>% 
  collect_metrics()
# # A tibble: 2 × 6
#   .metric  .estimator  mean     n std_err .config           
#   <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
# 1 accuracy binary     0.826    10  0.0181 Preprocessor1_Mod…
# 2 roc_auc  binary     0.882    10  0.0248 Preprocessor1_Mod…
```

---
class: your-turn

# Your Turn 2

Edit the random forest model to tune the `mtry` and `min_n` hyperparameters.

Update your workflow to use the tuned model.

Then use `tune_grid()` to find the best combination of hyper-parameters to maximize `roc_auc`; let tune set up the grid for you.

How does it compare to the average ROC AUC across folds from `fit_resamples()`?

---

```r
rf_tuner <- 
  rand_forest(engine = "ranger", 
              mtry = tune(),
              min_n = tune()) %>% 
  set_mode("classification")

rf_wf <-
  rf_wf %>% 
  update_model(rf_tuner)

set.seed(100) # Important!
rf_results <-
  rf_wf %>% 
  tune_grid(resamples = alz_folds)
```

---

```r
rf_results %>% 
  collect_metrics() 
# # A tibble: 20 × 8
#     mtry min_n .metric  .estimator  mean     n std_err
#    <int> <int> <chr>    <chr>      <dbl> <int>   <dbl>
#  1    40     3 accuracy binary     0.853    10  0.0158
#  2    40     3 roc_auc  binary     0.889    10  0.0241
#  3    60    20 accuracy binary     0.857    10  0.0190
#  4    60    20 roc_auc  binary     0.891    10  0.0225
#  5   104    24 accuracy binary     0.864    10  0.0156
#  6   104    24 roc_auc  binary     0.881    10  0.0233
#  7    19    15 accuracy binary     0.837    10  0.0160
#  8    19    15 roc_auc  binary     0.889    10  0.0245
#  9   124    26 accuracy binary     0.860    10  0.0172
# 10   124    26 roc_auc  binary     0.885    10  0.0222
# # … with 10 more rows, and 1 more variable: .config <chr>
```

---

```r
rf_results %>% 
  collect_metrics(summarize = FALSE) 
# # A tibble: 200 × 7
#    id      mtry min_n .metric  .estimator .estimate .config 
#    <chr>  <int> <int> <chr>    <chr>          <dbl> <chr>   
#  1 Fold01    40     3 accuracy binary         0.839 Preproc…
#  2 Fold01    40     3 roc_auc  binary         0.869 Preproc…
#  3 Fold02    40     3 accuracy binary         0.806 Preproc…
#  4 Fold02    40     3 roc_auc  binary         0.869 Preproc…
#  5 Fold03    40     3 accuracy binary         0.933 Preproc…
#  6 Fold03    40     3 roc_auc  binary         0.966 Preproc…
#  7 Fold04    40     3 accuracy binary         0.8   Preproc…
#  8 Fold04    40     3 roc_auc  binary         0.847 Preproc…
#  9 Fold05    40     3 accuracy binary         0.867 Preproc…
# 10 Fold05    40     3 roc_auc  binary         0.977 Preproc…
# # … with 190 more rows
```

---

.center[
# `tune_grid()`

A version of `fit_resamples()` that performs a grid search for the best combination of tuned hyper-parameters.

]

.pull-left[

```r
tune_grid(
  object, 
  resamples, 
  ..., 
* grid = df,
  metrics = NULL, 
  control = control_grid()
)
```

]

.pull-right[
A data frame of tuning combinations.
]

---
class: middle, center

# `expand_grid()`

Takes one or more vectors, and returns a data frame holding all combinations of their values.

```r
expand_grid(mtry = c(1, 5), min_n = 1:3)
# # A tibble: 6 × 2
#    mtry min_n
#   <dbl> <int>
# 1     1     1
# 2     1     2
# 3     1     3
# 4     5     1
# 5     5     2
# 6     5     3
```

.footnote[tidyr package; see also base `expand.grid()`]

---
class: middle
name: show-best

.center[
# `show_best()`

Shows the .display[n] most optimum combinations of hyper-parameters
]

```r
rf_results %>% 
  show_best(metric = "roc_auc", n = 5)
```

---
template: show-best

```
# # A tibble: 5 × 8
#    mtry min_n .metric .estimator  mean     n std_err .config
#   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>  
# 1    91     9 roc_auc binary     0.893    10  0.0243 Prepro…
# 2    71    33 roc_auc binary     0.892    10  0.0230 Prepro…
# 3    60    20 roc_auc binary     0.891    10  0.0225 Prepro…
# 4    28    29 roc_auc binary     0.889    10  0.0249 Prepro…
# 5    19    15 roc_auc binary     0.889    10  0.0245 Prepro…
```

---
class: middle, center

# `autoplot()`

Quickly visualize tuning results

```r
rf_results %>% autoplot()
```

---
class: middle, center

---
class: middle
name: select-best

.center[
# `select_best()`

Shows the .display[top] combination of hyper-parameters.
]

```r
alz_best <-
  rf_results %>% 
  select_best(metric = "roc_auc")

alz_best
```

---
template: select-best

```
# # A tibble: 1 × 3
#    mtry min_n .config              
#   <int> <int> <chr>                
# 1    91     9 Preprocessor1_Model09
```

---
class: middle

.center[
# `finalize_workflow()`

Replaces `tune()` placeholders in a model/recipe/workflow with a set of hyper-parameter values.
]

```r
last_rf_workflow <- 
  rf_wf %>%
  finalize_workflow(alz_best) 
```

---
background-image: url(images/diamonds.jpg)
background-size: contain
background-position: left
class: middle, center
background-color: #f5f5f5

.pull-right[
## We are ready to touch the jewels...

## The .display[testing set]!

]

---
class: middle

.center[

# `last_fit()`

]

```r
last_rf_fit <-
  last_rf_workflow %>% 
  last_fit(split = alz_split)
```

---

```r
last_rf_fit
# # Resampling results
# # Manual resampling 
# # A tibble: 1 × 6
#   splits           id    .metrics .notes .predictions .workflow
#   <list>           <chr> <list>   <list> <list>       <list>   
# 1 <split [300/33]> trai… <tibble… <tibb… <tibble [33… <workflo…
```

---
class: your-turn

# Your Turn 3

Use `select_best()`, `finalize_workflow()`, and `last_fit()` to take the best combination of hyper-parameters from `rf_results` and use them to predict the test set.

How does our actual test ROC AUC compare to our cross-validated estimate?

---

```r
alz_best <-
  rf_results %>% 
  select_best(metric = "roc_auc")

last_rf_workflow <- 
  rf_wf%>%
  finalize_workflow(alz_best)

last_rf_fit <-
  last_rf_workflow %>% 
  last_fit(split = alz_split)

last_rf_fit %>% 
  collect_metrics()
```

---
class: middle, frame

.center[
# Final metrics
]

```r
last_rf_fit %>% 
  collect_metrics()
# # A tibble: 2 × 4
#   .metric  .estimator .estimate .config             
#   <chr>    <chr>          <dbl> <chr>               
# 1 accuracy binary         0.818 Preprocessor1_Model1
# 2 roc_auc  binary         0.819 Preprocessor1_Model1
```

---
class: middle

.center[
# Final test predictions
]

```r
last_rf_fit %>% 
  collect_predictions()
# # A tibble: 33 × 7
#    id               .pred_Impaired .pred_Control  .row .pred_class
#    <chr>                     <dbl>         <dbl> <int> <fct>      
#  1 train/test split        0.264           0.736    13 Control    
#  2 train/test split        0.223           0.777    14 Control    
#  3 train/test split        0.256           0.744    33 Control    
#  4 train/test split        0.434           0.566    43 Control    
#  5 train/test split        0.0765          0.924    46 Control    
#  6 train/test split        0.270           0.730    48 Control    
#  7 train/test split        0.00933         0.991    49 Control    
#  8 train/test split        0.808           0.192    56 Impaired   
#  9 train/test split        0.629           0.371    67 Impaired   
# 10 train/test split        0.110           0.890    68 Control    
# # … with 23 more rows, and 2 more variables: Class <fct>,
# #   .config <chr>
```

---

```r
roc_values <- 
  last_rf_fit %>% 
  collect_predictions() %>% 
  roc_curve(truth = Class, estimate = .pred_Impaired)
autoplot(roc_values)
```

---

# The set-up

```r
set.seed(100) # Important!

# holdout method
alz_split  <- initial_split(alz, strata = Class, prop = .9)
alz_train  <- training(alz_split)
alz_test   <- testing(alz_split)

# add cross-validation
set.seed(100)
alz_folds <- 
    vfold_cv(alz_train, v = 10, strata = Class)
```

---

# The tune-up

```r
# here comes the actual ML bits…

# pick model to tune
rf_tuner <- 
  rand_forest(engine = "ranger", 
              mtry = tune(),
              min_n = tune()) %>% 
  set_mode("classification")

rf_wf <-
  workflow() %>% 
  add_formula(Class ~ .) %>% 
  add_model(rf_tuner)

rf_results <-
  rf_wf %>% 
  tune_grid(resamples = alz_folds,
            control = control_grid(save_pred = TRUE))
```

---

# Quick check-in...

```r
rf_results %>%
  collect_predictions() %>% 
  group_by(.config, mtry, min_n) %>% 
  summarize(folds = n_distinct(id))
# # A tibble: 10 × 4
# # Groups:   .config, mtry [10]
#    .config                mtry min_n folds
#    <chr>                 <int> <int> <int>
#  1 Preprocessor1_Model01    72    31    10
#  2 Preprocessor1_Model02    44    21    10
#  3 Preprocessor1_Model03    59    28    10
#  4 Preprocessor1_Model04    31    11    10
#  5 Preprocessor1_Model05    88    38    10
#  6 Preprocessor1_Model06    94    33    10
#  7 Preprocessor1_Model07   118    15    10
#  8 Preprocessor1_Model08     6    17    10
#  9 Preprocessor1_Model09    26     9    10
# 10 Preprocessor1_Model10   106     2    10
```

---

# The match up!

.pull-left[

```r
show_best(rf_results, metric = "roc_auc", n = 5)
# # A tibble: 5 × 8
#    mtry min_n .metric .estimator  mean     n std_err .config
#   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>  
# 1    44    21 roc_auc binary     0.894    10  0.0227 Prepro…
# 2    31    11 roc_auc binary     0.891    10  0.0224 Prepro…
# 3    59    28 roc_auc binary     0.888    10  0.0229 Prepro…
# 4    26     9 roc_auc binary     0.888    10  0.0259 Prepro…
# 5   106     2 roc_auc binary     0.887    10  0.0248 Prepro…

# pick final model workflow
alz_best <-
  rf_results %>% 
  select_best(metric = "roc_auc")

alz_best
# # A tibble: 1 × 3
#    mtry min_n .config              
#   <int> <int> <chr>                
# 1    44    21 Preprocessor1_Model02
```
]

.pull-right[
<img src="figs/05/unnamed-chunk-40-1.png" width="504" style="display: block; margin: auto;" />

]

---

# The wrap-up

.pull-left[

```r
last_rf_workflow <- 
  rf_wf %>%
  finalize_workflow(alz_best)

last_rf_workflow
# ══ Workflow ════════════════════════════════════════════════════════════════════
# Preprocessor: Formula
# Model: rand_forest()
# 
# ── Preprocessor ────────────────────────────────────────────────────────────────
# Class ~ .
# 
# ── Model ───────────────────────────────────────────────────────────────────────
# Random Forest Model Specification (classification)
# 
# Main Arguments:
#   mtry = 44
#   min_n = 21
# 
# Computational engine: ranger
```
]

.pull-right[

```r
# train + test final model
last_rf_fit <-
  last_rf_workflow %>% 
  last_fit(split = alz_split)

# explore final model
last_rf_fit %>% 
  collect_metrics()
# # A tibble: 2 × 4
#   .metric  .estimator .estimate .config             
#   <chr>    <chr>          <dbl> <chr>               
# 1 accuracy binary         0.788 Preprocessor1_Model1
# 2 roc_auc  binary         0.815 Preprocessor1_Model1

last_rf_fit %>% 
  collect_predictions() %>% 
  roc_curve(truth = Class, estimate = .pred_Impaired) %>% 
  autoplot()
```
]