The forest for the trees

class: title-slide, center, bottom

# The forest for the trees

## Tidymodels, virtually &mdash; Session 04

### Alison Hill

---
class: middle, frame, center

# Decision Trees

To predict the outcome of a new data point:

Uses rules learned from splits

Each split maximizes information gain

---
class: middle, center

![](https://media.giphy.com/media/gj4ZruUQUnpug/source.gif)

---

---

---
class: middle, center

# Quiz

How do assess predictions here?

RMSE

---

---
class: middle, center
<img src="https://raw.githubusercontent.com/EmilHvitfeldt/blog/master/static/blog/2019-08-09-authorship-classification-with-tidymodels-and-textrecipes_files/figure-html/unnamed-chunk-18-1.png" width="70%" style="display: block; margin: auto;" />

https://www.hvitfeldt.me/blog/authorship-classification-with-tidymodels-and-textrecipes/

---
class: middle, center
<img src="https://www.kaylinpavlik.com/content/images/2019/12/dt-1.png" width="50%" style="display: block; margin: auto;" />

https://www.kaylinpavlik.com/classifying-songs-genres/

---
class: middle, center

.footnote[[tweetbotornot2](https://github.com/mkearney/tweetbotornot2)]

---
name: guess-the-animal
class: middle, center, inverse

---
class: middle, center

# What makes a good guesser?

High information gain per question (can it fly?)

Clear features (feathers vs. is it "small"?)

Order matters

---
background-image: url(images/aus-standard-animals.png)
background-size: cover

.footnote[[Australian Computing Academy](https://aca.edu.au/resources/decision-trees-classifying-animals/)]

---
background-image: url(images/aus-standard-tree.png)
background-size: cover

.footnote[[Australian Computing Academy](https://aca.edu.au/resources/decision-trees-classifying-animals/)]

---
background-image: url(images/annotated-tree/annotated-tree.001.png)
background-size: cover

---
background-image: url(images/annotated-tree/annotated-tree.002.png)
background-size: cover

---
background-image: url(images/annotated-tree/annotated-tree.003.png)
background-size: cover

---
background-image: url(images/annotated-tree/annotated-tree.004.png)
background-size: cover

---
background-image: url(images/annotated-tree/annotated-tree.005.png)
background-size: cover

---
class: middle, frame

# .center[To specify a model with parsnip]

.right-column[

1\. Pick a .display[model] + .display[engine]

2\. Set the .display[mode] (if needed)

]

---
class: middle, frame

# .center[To specify a decision tree model with parsnip]

```r
tree_mod <- 
  decision_tree(engine = "rpart") %>% 
  set_mode("classification")
```

---
class: middle, center

```
#  nn    Class  Imp Con                                        cover
#   4 Impaired [.82 .18] when tau >=        5.9 & VEGF <  17     19%
#  10 Impaired [.75 .25] when tau >=        6.7 & VEGF >= 17      4%
#  11  Control [.16 .84] when tau is 5.9 to 6.7 & VEGF >= 17     19%
#   3  Control [.10 .90] when tau <  5.9                         58%
```

---

.pull-left[

]

.pull-right[
<img src="figs/04/unnamed-chunk-10-1.png" width="504" style="display: block; margin: auto;" />
]

---
class: your-turn

# Your turn 1

Here is our very-vanilla parsnip model specification for a decision tree (also in your Rmd)...

```r
tree_mod <- 
  decision_tree(engine = "rpart") %>% 
  set_mode("classification")
```

And a workflow:

```r
tree_wf <-
  workflow() %>% 
  add_formula(Class ~ .) %>% 
  add_model(tree_mod)
```

For decision trees, no recipe really required 🎉

---
class: your-turn

# Your turn 1

Fill in the blanks to return the accuracy and ROC AUC for this model using 10-fold cross-validation.

---

```r
set.seed(100)
tree_wf %>% 
  fit_resamples(resamples = alz_folds) %>% 
  collect_metrics()
# # A tibble: 2 × 6
#   .metric  .estimator  mean     n std_err .config           
#   <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
# 1 accuracy binary     0.756    10  0.0245 Preprocessor1_Mod…
# 2 roc_auc  binary     0.770    10  0.0255 Preprocessor1_Mod…
```

---
class: middle, center

# `args()`

Print the arguments for a **parsnip** model specification.

```r
args(decision_tree)
```

---
class: middle, center

# `decision_tree()`

Specifies a decision tree model

```r
decision_tree(engine = "rpart",tree_depth = 30, min_n = 20, cost_complexity = .01)
```

*either* mode works!

---
class: middle

.center[

# `decision_tree()`

Specifies a decision tree model

]

```r
decision_tree(
  engine = "rpart",      # default computational engine
  tree_depth = 30,       # max tree depth
  min_n = 20,            # smallest node allowed
  cost_complexity = .01  # 0 > cp > 0.1
  )
```

---
class: middle, center

# `set_args()`

Change the arguments for a **parsnip** model specification.

```r
_mod %>% set_args(tree_depth = 3)
```

---
class: middle

```r
decision_tree(engine = "rpart") %>% 
  set_mode("classification") %>% 
* set_args(tree_depth = 3)
# Decision Tree Model Specification (classification)
# 
# Main Arguments:
#   tree_depth = 3
# 
# Computational engine: rpart
```

---
class: middle

```r
*decision_tree(engine = "rpart", tree_depth = 3) %>%
  set_mode("classification")
# Decision Tree Model Specification (classification)
# 
# Main Arguments:
#   tree_depth = 3
# 
# Computational engine: rpart
```

---
class: middle, center

# `tree_depth`

Cap the maximum tree depth.

A method to stop the tree early. Used to prevent overfitting.

```r
tree_mod %>% set_args(tree_depth = 30)
```

---
class: middle, center
exclude: true

---
class: middle, center

---
class: middle, center

---
class: middle, center

# `min_n`

Set minimum `n` to split at any node.

Another early stopping method. Used to prevent overfitting.

```r
tree_mod %>% set_args(min_n = 20)
```

---
class: middle, center

# Quiz

What value of `min_n` would lead to the *most overfit* tree?

`min_n` = 1

---
class: middle, center, frame

# Recap: early stopping

| `parsnip` arg | `rpart` arg | default | overfit? |
|---------------|-------------|:-------:|:--------:|
| `tree_depth`  | `maxdepth`  |    30   |⬆️|
| `min_n`       | `minsplit`  |    20   |⬇️|

---
class: middle, center

# `cost_complexity`

Adds a cost or penalty to error rates of more complex trees.

A way to prune a tree. Used to prevent overfitting.

```r
tree_mod %>% set_args(cost_complexity = .01)
```

Closer to zero ➡️ larger trees.

Higher penalty ➡️ smaller trees.

---
class: middle, center

---
class: middle, center

---
class: middle, center

---
name: bonsai
background-image: url(images/kari-shea-AVqh83jStMA-unsplash.jpg)
background-position: left
background-size: contain
class: middle

---
template: bonsai

.pull-right[

# Consider the bonsai

1. Small pot

1. Strong shears

]

---
template: bonsai

.pull-right[

# Consider the bonsai

1. ~~Small pot~~ .display[Early stopping]

1. ~~Strong shears~~ .display[Pruning]

]

---
class: middle, center, frame

# Recap: early stopping & pruning

---
class: middle, center

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> engine </th>
   <th style="text-align:left;"> parsnip </th>
   <th style="text-align:left;"> original </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> rpart </td>
   <td style="text-align:left;"> tree_depth </td>
   <td style="text-align:left;"> maxdepth </td>
  </tr>
  <tr>
   <td style="text-align:left;"> rpart </td>
   <td style="text-align:left;"> min_n </td>
   <td style="text-align:left;"> minsplit </td>
  </tr>
  <tr>
   <td style="text-align:left;"> rpart </td>
   <td style="text-align:left;"> cost_complexity </td>
   <td style="text-align:left;"> cp </td>
  </tr>
</tbody>
</table>

<https://rdrr.io/cran/rpart/man/rpart.control.html>

---
class: middle, center

---

---

---

---
class: middle, center

---
class: middle, center

---

---
class: middle, frame, center

# Axiom

There is an inverse relationship between  
model *accuracy* and model *interpretability*.

---
class: middle, center

# `rand_forest()`

Specifies a random forest model

```r
rand_forest(mtry = 4, trees = 500, min_n = 1)
```

*either* mode works!

---
class: middle

.center[

# `rand_forest()`

Specifies a random forest model

]

```r
rand_forest(
  engine = "ranger", # default computational engine
  mtry = 4,          # predictors seen at each node
  trees = 500,       # trees per forest
  min_n = 1          # smallest node allowed
  )
```

---
class: your-turn

# Your turn 2

Create a new parsnip model called `rf_mod`, which will learn an ensemble of classification trees from our training data using the **ranger** engine. Update your `tree_wf` with this new model.

Fit your workflow with 10-fold cross-validation and compare the ROC AUC of the random forest to your single decision tree model --- which predicts the test set better?

*Hint: you'll need https://www.tidymodels.org/find/parsnip/*

---

```r
rf_mod <-
  rand_forest(engine = "ranger") %>% 
  set_mode("classification")

rf_wf <-
  tree_wf %>% 
  update_model(rf_mod)

set.seed(100)
rf_wf %>% 
  fit_resamples(resamples = alz_folds) %>% 
  collect_metrics()
# # A tibble: 2 × 6
#   .metric  .estimator  mean     n std_err .config           
#   <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
# 1 accuracy binary     0.837    10  0.0172 Preprocessor1_Mod…
# 2 roc_auc  binary     0.886    10  0.0171 Preprocessor1_Mod…
```

---
class: middle, center

# `mtry`

The number of predictors that will be randomly sampled at each split when creating the tree models.

```r
rand_forest(mtry = 11)
```

**ranger** default = `floor(sqrt(num_predictors))`