Build a useful model

class: title-slide, center, bottom

# Build a useful model

## Tidymodels, virtually &mdash; Session 02

### Alison Hill

---
class: center, middle, inverse

# What is Machine Learning?

---
class: middle

# .center[Alzheimer's disease data]

Data from a clinical trial of individuals with well-characterized cognitive impairment, and age-matched control participants.

```r
# install.packages("modeldata")
library(modeldata)
data("ad_data")
alz <- ad_data
```

---
class: middle

```r
glimpse(alz)
# Rows: 333
# Columns: 131
# $ ACE_CD143_Angiotensin_Converti   <dbl> 2.0031003, 1.5618…
# $ ACTH_Adrenocorticotropic_Hormon  <dbl> -1.3862944, -1.38…
# $ AXL                              <dbl> 1.09838668, 0.683…
# $ Adiponectin                      <dbl> -5.360193, -5.020…
# $ Alpha_1_Antichymotrypsin         <dbl> 1.7404662, 1.4586…
# $ Alpha_1_Antitrypsin              <dbl> -12.631361, -11.9…
# $ Alpha_1_Microglobulin            <dbl> -2.577022, -3.244…
# $ Alpha_2_Macroglobulin            <dbl> -72.65029, -154.6…
# $ Angiopoietin_2_ANG_2             <dbl> 1.06471074, 0.741…
# $ Angiotensinogen                  <dbl> 2.510547, 2.45728…
# $ Apolipoprotein_A_IV              <dbl> -1.427116, -1.660…
# $ Apolipoprotein_A1                <dbl> -7.402052, -7.047…
# $ Apolipoprotein_A2                <dbl> -0.26136476, -0.8…
# $ Apolipoprotein_B                 <dbl> -4.624044, -6.747…
# $ Apolipoprotein_CI                <dbl> -1.2729657, -1.27…
# $ Apolipoprotein_CIII              <dbl> -2.312635, -2.343…
# $ Apolipoprotein_D                 <dbl> 2.0794415, 1.3350…
# $ Apolipoprotein_E                 <dbl> 3.7545215, 3.0971…
# $ Apolipoprotein_H                 <dbl> -0.15734908, -0.5…
# $ B_Lymphocyte_Chemoattractant_BL  <dbl> 2.296982, 1.67312…
# $ BMP_6                            <dbl> -2.200744, -1.728…
# $ Beta_2_Microglobulin             <dbl> 0.69314718, 0.470…
# $ Betacellulin                     <int> 34, 53, 49, 52, 6…
# $ C_Reactive_Protein               <dbl> -4.074542, -6.645…
# $ CD40                             <dbl> -0.7964147, -1.27…
# $ CD5L                             <dbl> 0.09531018, -0.67…
# $ Calbindin                        <dbl> 33.21363, 25.2763…
# $ Calcitonin                       <dbl> 1.3862944, 3.6109…
# $ CgA                              <dbl> 397.6536, 465.675…
# $ Clusterin_Apo_J                  <dbl> 3.555348, 3.04452…
# $ Complement_3                     <dbl> -10.36305, -16.10…
# $ Complement_Factor_H              <dbl> 3.573725, 3.60004…
# $ Connective_Tissue_Growth_Factor  <dbl> 0.5306283, 0.5877…
# $ Cortisol                         <dbl> 10.0, 12.0, 10.0,…
# $ Creatine_Kinase_MB               <dbl> -1.710172, -1.751…
# $ Cystatin_C                       <dbl> 9.041922, 9.06762…
# $ EGF_R                            <dbl> -0.1354543, -0.37…
# $ EN_RAGE                          <dbl> -3.688879, -3.816…
# $ ENA_78                           <dbl> -1.349543, -1.356…
# $ Eotaxin_3                        <int> 53, 62, 62, 44, 6…
# $ FAS                              <dbl> -0.08338161, -0.5…
# $ FSH_Follicle_Stimulation_Hormon  <dbl> -0.6516715, -1.62…
# $ Fas_Ligand                       <dbl> 3.1014922, 2.9788…
# $ Fatty_Acid_Binding_Protein       <dbl> 2.5208712, 2.2477…
# $ Ferritin                         <dbl> 3.329165, 3.93295…
# $ Fetuin_A                         <dbl> 1.2809338, 1.1939…
# $ Fibrinogen                       <dbl> -7.035589, -8.047…
# $ GRO_alpha                        <dbl> 1.381830, 1.37243…
# $ Gamma_Interferon_induced_Monokin <dbl> 2.949822, 2.72179…
# $ Glutathione_S_Transferase_alpha  <dbl> 1.0641271, 0.8670…
# $ HB_EGF                           <dbl> 6.559746, 8.75453…
# $ HCC_4                            <dbl> -3.036554, -4.074…
# $ Hepatocyte_Growth_Factor_HGF     <dbl> 0.58778666, 0.530…
# $ I_309                            <dbl> 3.433987, 3.13549…
# $ ICAM_1                           <dbl> -0.1907787, -0.46…
# $ IGF_BP_2                         <dbl> 5.609472, 5.34710…
# $ IL_11                            <dbl> 5.121987, 4.93670…
# $ IL_13                            <dbl> 1.282549, 1.26946…
# $ IL_16                            <dbl> 4.192081, 2.87633…
# $ IL_17E                           <dbl> 5.731246, 6.70589…
# $ IL_1alpha                        <dbl> -6.571283, -8.047…
# $ IL_3                             <dbl> -3.244194, -3.912…
# $ IL_4                             <dbl> 2.484907, 2.39789…
# $ IL_5                             <dbl> 1.09861229, 0.693…
# $ IL_6                             <dbl> 0.26936976, 0.096…
# $ IL_6_Receptor                    <dbl> 0.64279595, 0.431…
# $ IL_7                             <dbl> 4.8050453, 3.7055…
# $ IL_8                             <dbl> 1.711325, 1.67555…
# $ IP_10_Inducible_Protein_10       <dbl> 6.242223, 5.68697…
# $ IgA                              <dbl> -6.812445, -6.377…
# $ Insulin                          <dbl> -0.6258253, -0.94…
# $ Kidney_Injury_Molecule_1_KIM_1   <dbl> -1.204295, -1.197…
# $ LOX_1                            <dbl> 1.7047481, 1.5260…
# $ Leptin                           <dbl> -1.5290628, -1.46…
# $ Lipoprotein_a                    <dbl> -4.268698, -4.933…
# $ MCP_1                            <dbl> 6.740519, 6.84906…
# $ MCP_2                            <dbl> 1.9805094, 1.8088…
# $ MIF                              <dbl> -1.237874, -1.897…
# $ MIP_1alpha                       <dbl> 4.968453, 3.69016…
# $ MIP_1beta                        <dbl> 3.258097, 3.13549…
# $ MMP_2                            <dbl> 4.478566, 3.78147…
# $ MMP_3                            <dbl> -2.207275, -2.465…
# $ MMP10                            <dbl> -3.270169, -3.649…
# $ MMP7                             <dbl> -3.7735027, -5.96…
# $ Myoglobin                        <dbl> -1.89711998, -0.7…
# $ NT_proBNP                        <dbl> 4.553877, 4.21950…
# $ NrCAM                            <dbl> 5.003946, 5.20948…
# $ Osteopontin                      <dbl> 5.356586, 6.00388…
# $ PAI_1                            <dbl> 1.00350156, -0.03…
# $ PAPP_A                           <dbl> -2.902226, -2.813…
# $ PLGF                             <dbl> 4.442651, 4.02535…
# $ PYY                              <dbl> 3.218876, 3.13549…
# $ Pancreatic_polypeptide           <dbl> 0.5787808, 0.3364…
# $ Prolactin                        <dbl> 0.00000000, -0.51…
# $ Prostatic_Acid_Phosphatase       <dbl> -1.620527, -1.739…
# $ Protein_S                        <dbl> -1.784998, -2.463…
# $ Pulmonary_and_Activation_Regulat <dbl> -0.8439701, -2.30…
# $ RANTES                           <dbl> -6.214608, -6.938…
# $ Resistin                         <dbl> -16.475315, -16.0…
# $ S100b                            <dbl> 1.5618560, 1.7566…
# $ SGOT                             <dbl> -0.94160854, -0.6…
# $ SHBG                             <dbl> -1.897120, -1.560…
# $ SOD                              <dbl> 5.609472, 5.81413…
# $ Serum_Amyloid_P                  <dbl> -5.599422, -6.119…
# $ Sortilin                         <dbl> 4.908629, 5.47873…
# $ Stem_Cell_Factor                 <dbl> 4.174387, 3.71357…
# $ TGF_alpha                        <dbl> 8.649098, 11.3316…
# $ TIMP_1                           <dbl> 15.20465, 11.2665…
# $ TNF_RII                          <dbl> -0.0618754, -0.32…
# $ TRAIL_R3                         <dbl> -0.1829004, -0.50…
# $ TTR_prealbumin                   <dbl> 2.944439, 2.83321…
# $ Tamm_Horsfall_Protein_THP        <dbl> -3.095810, -3.111…
# $ Thrombomodulin                   <dbl> -1.340566, -1.675…
# $ Thrombopoietin                   <dbl> -0.1026334, -0.67…
# $ Thymus_Expressed_Chemokine_TECK  <dbl> 4.149327, 3.81018…
# $ Thyroid_Stimulating_Hormone      <dbl> -3.863233, -4.828…
# $ Thyroxine_Binding_Globulin       <dbl> -1.4271164, -1.60…
# $ Tissue_Factor                    <dbl> 2.04122033, 2.028…
# $ Transferrin                      <dbl> 3.332205, 2.89037…
# $ Trefoil_Factor_3_TFF3            <dbl> -3.381395, -3.912…
# $ VCAM_1                           <dbl> 3.258097, 2.70805…
# $ VEGF                             <dbl> 22.03456, 18.6018…
# $ Vitronectin                      <dbl> -0.04082199, -0.3…
# $ von_Willebrand_Factor            <dbl> -3.146555, -3.863…
# $ age                              <dbl> 0.9876238, 0.9861…
# $ tau                              <dbl> 6.297754, 6.65929…
# $ p_tau                            <dbl> 4.348108, 4.85996…
# $ Ab_42                            <dbl> 12.019678, 11.015…
# $ male                             <dbl> 0, 0, 1, 0, 0, 1,…
# $ Genotype                         <fct> E3E3, E3E4, E3E4,…
# $ Class                            <fct> Control, Control,…
```

---
background-image: url(images/hands.jpg)
background-size: contain
background-position: left
class: middle

.pull-right[

## Alzheimer's disease data

+ N = 333

+ 1 categorical outcome: `Class`

+ 130 predictors

+ 126 protein measurements

+ also: `age`, `male`, `Genotype`

]

---
background-image: url(images/hands.jpg)
background-size: contain
background-position: left
class: middle

.pull-right[
<img src="figs/02/unnamed-chunk-3-1.png" width="504" style="display: block; margin: auto;" />
]
---
class: middle, center, inverse

# What is the goal of machine learning?

---
class: middle, center, frame

# Goal

## Build .display[models] that

## generate .display[accurate predictions]

## for .display[future, yet-to-be-seen data].

.footnote[Max Kuhn & Kjell Johnston, http://www.feat.engineering/]

???

This is our whole game vision for today. This is the main goal for predictive modeling broadly, and for machine learning specifically.

We'll use this goal to drive learning of 3 core tidymodels packages:

- parsnip
- yardstick
- and rsample

---
class: inverse, middle, center

# 🔨 Build models

# with parsnip

???

Enter the parsnip package

---
class: middle, center, frame

# parsnip

---
class: middle

# .center[`glm()`]

```r
glm(Class ~ tau, family = binomial, data = alz)
# 
# Call:  glm(formula = Class ~ tau, family = binomial, data = alz)
# 
# Coefficients:
# (Intercept)          tau  
#      13.664       -2.148  
# 
# Degrees of Freedom: 332 Total (i.e. Null);  331 Residual
# Null Deviance:	    390.6 
# Residual Deviance: 318.8 	AIC: 322.8
```

???

So let's start with prediction. To predict, we have to have two things: a model to generate predictions, and data to predict

This type of formula interface may look familiar

How would we use parsnip to build this kind of linear regression model?

---
exclude: true
name: step1
background-image: url("images/predicting/predicting.001.jpeg")
background-size: contain

---
class: middle, frame

# .center[To specify a model with parsnip]

.right-column[

1\. Pick a .display[model] + .display[engine]

2\. Set the .display[mode] (if needed)

]
---
class: middle, frame

# .center[To specify a model with parsnip]

```r
logistic_reg(engine = "glm") %>%
  set_mode("classification")
# Logistic Regression Model Specification (classification)
# 
# Computational engine: glm
```

---
class: middle, frame

# .center[To specify a model with parsnip]

```r
decision_tree(engine = "C5.0") %>%
  set_mode("classification")
# Decision Tree Model Specification (classification)
# 
# Computational engine: C5.0
```

---
class: middle, frame

# .center[To specify a model with parsnip]

```r
nearest_neighbor(engine = "kknn") %>%              
  set_mode("classification")        
# K-Nearest Neighbor Model Specification (classification)
# 
# Computational engine: kknn
```

---
class: middle, frame

.fade[
# .center[To specify a model with parsnip]
]

.right-column[

1\. Pick a .display[model] + .display[engine]
.fade[

2\. Set the .display[mode] (if needed)
]

]

---
class: middle, center

# 1\. Pick a .display[model] + .display[engine]

All available models are listed at

<https://www.tidymodels.org/find/parsnip/>

---
class: middle

.center[
# `logistic_reg()`

Specifies a model that uses logistic regression
]

```r
logistic_reg(engine = "glm", penalty = NULL, mixture = NULL)
```

---
class: middle

.center[
# `logistic_reg()`

Specifies a model that uses logistic regression
]

```r
logistic_reg(
  mode = "classification", # "default" mode
  engine = "glm",          # default computational engine
  penalty = NULL,          # model hyper-parameter
  mixture = NULL           # model hyper-parameter
  )
```

---
class: middle, frame

.fade[
# .center[To specify a model with parsnip]
]

.right-column[
.fade[
1\. Pick a .display[model] + .display[engine]
]

2\. Set the .display[mode] (if needed)

]

---
class: middle, center

# `set_mode()`

Sets the class of problem the model will solve, which influences which output is collected. Not necessary if mode is set in Step 1.

```r
logistic_reg() %>% set_mode(mode = "classification")
```

---
class: your-turn

# Your turn 1

Run the chunk in your .Rmd and look at the output. Then, copy/paste the code and edit to create:

+ a decision tree model for classification

+ that uses the `C5.0` engine.

Save it as `tree_mod` and look at the object. What is different about the output?

*Hint: you'll need https://www.tidymodels.org/find/parsnip/*

---

```r
lr_mod 
# Logistic Regression Model Specification (classification)
# 
# Computational engine: glm

tree_mod <- 
  decision_tree(engine = "C5.0") %>% 
  set_mode("classification")
tree_mod 
# Decision Tree Model Specification (classification)
# 
# Computational engine: C5.0
```

---
class: inverse, middle, center

## Now we've built a model.

## But, how do we *use* a model?

## First - what does it mean to use a model?

---
class: inverse, middle, center

![](https://media.giphy.com/media/fhAwk4DnqNgw8/giphy.gif)

Statistical models learn from the data.

Many learn model parameters, which *can* be useful as values for inference and interpretation.

---

## Our data

---

## "All models are wrong, but some are useful"

.pull-left[
<img src="figs/02/unnamed-chunk-19-1.png" width="504" style="display: block; margin: auto;" />

]

.pull-right[

```
#           Truth
# Prediction Impaired Control
#   Impaired       61      17
#   Control        30     225
```

]

---

## "All models are wrong, but some are useful"

.pull-left[

<img src="figs/02/unnamed-chunk-22-1.png" width="504" style="display: block; margin: auto;" />
]

.pull-right[

```
#           Truth
# Prediction Impaired Control
#   Impaired       65      22
#   Control        26     220
```

]

---
class: inverse, middle, center

# ♻️ Resample models

# with rsample

???

Enter the rsample package

---
class: middle, center, frame

# rsample

---
class: middle, center, frame

# The holdout method

???

We refer to the group for which we know the outcome, and use to develop the algorithm, as the training set. We refer to the group for which we pretend we don’t know the outcome as the test set.

---
class: center, middle

# `initial_split()`

"Splits" data randomly into a single testing and a single training set.

```r
initial_split(data, prop = 3/4. strata = NULL, breaks = 4)
```

---

```r
alz_split <- initial_split(alz, strata = Class, prop = .9)
alz_split
# <Analysis/Assess/Total>
# <300/33/333>
```

???

data splitting

---
class: center, middle

# `training()` and `testing()`

Extract training and testing sets from an rsplit

```r
training(alz_split)
testing(alz_split)
```

---

```r
alz_train <- training(alz_split) 
alz_train
# # A tibble: 300 × 131
#    ACE_CD143_Angioten… ACTH_Adrenocortic…    AXL Adiponectin
#                  <dbl>              <dbl>  <dbl>       <dbl>
#  1               2.00              -1.39   1.10        -5.36
#  2               1.56              -1.39   0.683       -5.02
#  3               1.52              -1.71  -0.145       -5.81
#  4               1.68              -1.61   0.683       -5.12
#  5               2.40              -0.968  0.191       -4.78
#  6               0.431             -1.27  -0.222       -5.22
#  7               0.946             -1.90   0.530       -6.12
#  8               0.708             -1.83  -0.327       -4.88
#  9               1.11              -1.97   0.191       -5.17
# 10               1.60              -1.51   0.449       -5.57
# # … with 290 more rows, and 127 more variables:
# #   Alpha_1_Antichymotrypsin <dbl>,
# #   Alpha_1_Antitrypsin <dbl>, Alpha_1_Microglobulin <dbl>,
# #   Alpha_2_Macroglobulin <dbl>,
# #   Angiopoietin_2_ANG_2 <dbl>, …
```

---
class: your-turn

# Your turn 2

Fill in the blanks.

Use `initial_split()`, `training()`, and `testing()` to:

1. Split **alz** into training and test sets. Save the rsplit!

2. Extract the training data and the testing data.

3. Check the proportions of the `Class` variable in each set.

Keep `set.seed(100)` at the start of your code.

---

```r
set.seed(100) # Important!

alz_split  <- initial_split(alz, strata = Class, prop = .9)
alz_train  <- training(alz_split)
alz_test   <- testing(alz_split)
```

---
class: middle, center

# Data Splitting

---

.right[Mean metric]

---
background-image: url(images/diamonds.jpg)
background-size: contain
background-position: left
class: middle, center
background-color: #f5f5f5

.pull-right[
## The .display[testing set] is precious...

## we can only use it once!

]

---
background-image: url(images/diamonds.jpg)
background-size: contain
background-position: left
class: middle, center
background-color: #f5f5f5

.pull-right[
## How can we use the training set to compare, evaluate, and tune models?

]

---
background-image: url(https://www.tidymodels.org/start/resampling/img/resampling.svg)
background-size: 60%

---
class: middle, center, inverse

# Cross-validation

---
background-image: url(images/cross-validation/Slide2.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide3.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide4.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide5.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide6.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide7.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide8.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide9.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide10.png)
background-size: contain

---
background-image: url(images/cross-validation/Slide11.png)
background-size: contain

---
class: middle, center

# V-fold cross-validation

```r
vfold_cv(data, v = 10, repeats = 1, strata = NULL, breaks = 4, ...)
```

---
exclude: true

---
class: middle, center

# Guess

How many times does in observation/row appear in the assessment set?

---

---
class: middle, center

# Quiz

If we use 10 folds, which percent of our data will end up in the training set and which percent in the testing set for each fold?

90% - training

10% - test

---
class: middle, center, inverse

# Stratified sampling

---

## What if...

The assessment set looked like this?

---

## Or this?

---
<img src="figs/02/unnamed-chunk-53-1.png" width="504" style="display: block; margin: auto;" />

---
class: middle

.pull-left[

<img src="figs/02/unnamed-chunk-54-1.png" width="504" style="display: block; margin: auto;" />
]

.pull-right[
## Original

```
#     Class   n percent
#  Impaired  91   27.3%
#   Control 242   72.7%
```

## Resample

```
#     Class Analysis Assessment
#  Impaired   75.82%     24.18%
#   Control        -          -
```

]

---
class: middle

.pull-left[
<img src="figs/02/unnamed-chunk-56-1.png" width="504" style="display: block; margin: auto;" />
]

.pull-right[

## Original

```
#     Class   n percent
#  Impaired  91   27.3%
#   Control 242   72.7%
```

## Resample

```
#     Class Analysis Assessment
#  Impaired   75.82%     24.18%
#   Control   75.21%     24.79%
```

]

---
class: your-turn

# Your Turn 3

Run the code below. What does it return?

```r
set.seed(100)
alz_folds <- 
    vfold_cv(alz_train, v = 10, strata = Class)
alz_folds
```

---

```r
set.seed(100)
alz_folds <- 
    vfold_cv(alz_train, v = 10, strata = Class)
alz_folds
# #  10-fold cross-validation using stratification 
# # A tibble: 10 × 2
#    splits           id    
#    <list>           <chr> 
#  1 <split [269/31]> Fold01
#  2 <split [269/31]> Fold02
#  3 <split [270/30]> Fold03
#  4 <split [270/30]> Fold04
#  5 <split [270/30]> Fold05
#  6 <split [270/30]> Fold06
#  7 <split [270/30]> Fold07
#  8 <split [270/30]> Fold08
#  9 <split [271/29]> Fold09
# 10 <split [271/29]> Fold10
```

---
class: middle

.center[
# `fit_resamples()`

Trains and tests a resampled model.
]

```r
tree_mod %>% 
  fit_resamples(
    Class ~ tau + VEGF, 
    resamples = alz_folds
    )
```

---

```r
tree_mod %>% 
  fit_resamples(
    Class ~ tau + VEGF, 
    resamples = alz_folds
    )
# # Resampling results
# # 10-fold cross-validation using stratification 
# # A tibble: 10 × 4
#    splits           id     .metrics         .notes          
#    <list>           <chr>  <list>           <list>          
#  1 <split [269/31]> Fold01 <tibble [2 × 4]> <tibble [0 × 1]>
#  2 <split [269/31]> Fold02 <tibble [2 × 4]> <tibble [0 × 1]>
#  3 <split [270/30]> Fold03 <tibble [2 × 4]> <tibble [0 × 1]>
#  4 <split [270/30]> Fold04 <tibble [2 × 4]> <tibble [0 × 1]>
#  5 <split [270/30]> Fold05 <tibble [2 × 4]> <tibble [0 × 1]>
#  6 <split [270/30]> Fold06 <tibble [2 × 4]> <tibble [0 × 1]>
#  7 <split [270/30]> Fold07 <tibble [2 × 4]> <tibble [0 × 1]>
#  8 <split [270/30]> Fold08 <tibble [2 × 4]> <tibble [0 × 1]>
#  9 <split [271/29]> Fold09 <tibble [2 × 4]> <tibble [0 × 1]>
# 10 <split [271/29]> Fold10 <tibble [2 × 4]> <tibble [0 × 1]>
```

---
class: middle, center

# `collect_metrics()`

Unnest the metrics column from a tidymodels `fit_resamples()`

```r
_results %>% collect_metrics(summarize = TRUE)
```

.footnote[`TRUE` is actually the default; averages across folds]

---

.pull-left[

```r
tree_fit <- tree_mod %>% 
  fit_resamples(
    Class ~ tau + VEGF, 
    resamples = alz_folds
  )

collect_metrics(tree_fit)
# # A tibble: 2 × 6
#   .metric  .estimator  mean     n std_err .config           
#   <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
# 1 accuracy binary     0.820    10  0.0204 Preprocessor1_Mod…
# 2 roc_auc  binary     0.786    10  0.0180 Preprocessor1_Mod…
```
]

.pull-right[

```r
collect_metrics(tree_fit, summarize = FALSE)
# # A tibble: 20 × 5
#    id     .metric  .estimator .estimate .config             
#    <chr>  <chr>    <chr>          <dbl> <chr>               
#  1 Fold01 accuracy binary         0.774 Preprocessor1_Model1
#  2 Fold01 roc_auc  binary         0.692 Preprocessor1_Model1
#  3 Fold02 accuracy binary         0.839 Preprocessor1_Model1
#  4 Fold02 roc_auc  binary         0.848 Preprocessor1_Model1
#  5 Fold03 accuracy binary         0.867 Preprocessor1_Model1
#  6 Fold03 roc_auc  binary         0.852 Preprocessor1_Model1
#  7 Fold04 accuracy binary         0.8   Preprocessor1_Model1
#  8 Fold04 roc_auc  binary         0.795 Preprocessor1_Model1
#  9 Fold05 accuracy binary         0.767 Preprocessor1_Model1
# 10 Fold05 roc_auc  binary         0.744 Preprocessor1_Model1
# # … with 10 more rows
```

]

---
class: middle, center, frame

# 10-fold CV

### 10 different analysis/assessment sets

### 10 different models (trained on .display[analysis] sets)

### 10 different sets of performance statistics (on .display[assessment] sets)

---
class: middle, center

<https://tidymodels.github.io/yardstick/articles/metric-types.html#metrics>

---
class: middle, center

# `roc_curve()`

Takes predictions from a special kind of `fit_resamples()`.

Returns a tibble with probabilities.

```r
roc_curve(data, truth = Class, estimate = .pred_Impaired)
```

Truth = .display[probability] of target response

Estimate = .display[predicted] class

---

```r
tree_preds <- tree_mod %>% 
  fit_resamples(
    Class ~ tau + VEGF, 
    resamples = alz_folds,
    control = control_resamples(save_pred = TRUE)
  )

tree_preds %>% 
  collect_predictions() %>% 
  roc_curve(truth = Class, estimate = .pred_Impaired)
# # A tibble: 40 × 3
#    .threshold specificity sensitivity
#         <dbl>       <dbl>       <dbl>
#  1  -Inf           0            1    
#  2     0.0703      0            1    
#  3     0.0708      0.0688       0.976
#  4     0.0713      0.124        0.963
#  5     0.0767      0.183        0.951
#  6     0.0772      0.248        0.939
#  7     0.0925      0.317        0.927
#  8     0.0960      0.344        0.890
#  9     0.0998      0.408        0.854
# 10     0.102       0.454        0.829
# # … with 30 more rows
```

---

## Area under the curve

.pull-left[
<img src="images/roc-pretty-good.png" width="1037" style="display: block; margin: auto;" />
]

.pull-right[
* AUC = 0.5: random guessing

* AUC = 1: perfect classifer

* In general AUC of above 0.8 considered "good"
]

---

---

---

---

---

---
class: your-turn

# Your turn 4

Add a `autoplot()` to visualize the ROC AUC.

---

```r
tree_preds %>% 
  collect_predictions() %>% 
  roc_curve(truth = Class, estimate = .pred_Impaired) %>% 
  autoplot()
```