class: title-slide, center, bottom # The forest for the trees ## Tidymodels, virtually — Session 04 ### Alison Hill --- class: middle, frame, center # Decision Trees To predict the outcome of a new data point: Uses rules learned from splits Each split maximizes information gain --- class: middle, center ![](https://media.giphy.com/media/gj4ZruUQUnpug/source.gif) --- <img src="figs/04/unnamed-chunk-2-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/04/unnamed-chunk-3-1.png" width="504" style="display: block; margin: auto;" /> --- class: middle, center # Quiz How do assess predictions here? -- RMSE --- <img src="figs/04/rt-test-resid-1.png" width="504" style="display: block; margin: auto;" /> --- class: middle, center <img src="https://raw.githubusercontent.com/EmilHvitfeldt/blog/master/static/blog/2019-08-09-authorship-classification-with-tidymodels-and-textrecipes_files/figure-html/unnamed-chunk-18-1.png" width="70%" style="display: block; margin: auto;" /> https://www.hvitfeldt.me/blog/authorship-classification-with-tidymodels-and-textrecipes/ --- class: middle, center <img src="https://www.kaylinpavlik.com/content/images/2019/12/dt-1.png" width="50%" style="display: block; margin: auto;" /> https://www.kaylinpavlik.com/classifying-songs-genres/ --- class: middle, center <img src="https://a3.typepad.com/6a0105360ba1c6970c01b7c95c61fb970b-pi" width="40%" style="display: block; margin: auto;" /> .footnote[[tweetbotornot2](https://github.com/mkearney/tweetbotornot2)] --- name: guess-the-animal class: middle, center, inverse <img src="http://www.atarimania.com/8bit/screens/guess_the_animal.gif" width="100%" style="display: block; margin: auto;" /> --- class: middle, center # What makes a good guesser? -- High information gain per question (can it fly?) -- Clear features (feathers vs. is it "small"?) -- Order matters --- background-image: url(images/aus-standard-animals.png) background-size: cover .footnote[[Australian Computing Academy](https://aca.edu.au/resources/decision-trees-classifying-animals/)] --- background-image: url(images/aus-standard-tree.png) background-size: cover .footnote[[Australian Computing Academy](https://aca.edu.au/resources/decision-trees-classifying-animals/)] --- background-image: url(images/annotated-tree/annotated-tree.001.png) background-size: cover --- background-image: url(images/annotated-tree/annotated-tree.002.png) background-size: cover --- background-image: url(images/annotated-tree/annotated-tree.003.png) background-size: cover --- background-image: url(images/annotated-tree/annotated-tree.004.png) background-size: cover --- background-image: url(images/annotated-tree/annotated-tree.005.png) background-size: cover --- class: middle, frame # .center[To specify a model with parsnip] .right-column[ 1\. Pick a .display[model] + .display[engine] 2\. Set the .display[mode] (if needed) ] --- class: middle, frame # .center[To specify a decision tree model with parsnip] ```r tree_mod <- decision_tree(engine = "rpart") %>% set_mode("classification") ``` --- class: middle, center <img src="figs/04/alz-tree-01-1.png" width="40%" style="display: block; margin: auto;" /> ``` # nn Class Imp Con cover # 4 Impaired [.82 .18] when tau >= 5.9 & VEGF < 17 19% # 10 Impaired [.75 .25] when tau >= 6.7 & VEGF >= 17 4% # 11 Control [.16 .84] when tau is 5.9 to 6.7 & VEGF >= 17 19% # 3 Control [.10 .90] when tau < 5.9 58% ``` --- .pull-left[ <img src="figs/04/unnamed-chunk-9-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ <img src="figs/04/unnamed-chunk-10-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: your-turn # Your turn 1 Here is our very-vanilla parsnip model specification for a decision tree (also in your Rmd)... ```r tree_mod <- decision_tree(engine = "rpart") %>% set_mode("classification") ``` And a workflow: ```r tree_wf <- workflow() %>% add_formula(Class ~ .) %>% add_model(tree_mod) ``` For decision trees, no recipe really required 🎉 --- class: your-turn # Your turn 1 Fill in the blanks to return the accuracy and ROC AUC for this model using 10-fold cross-validation.
02
:
00
--- ```r set.seed(100) tree_wf %>% fit_resamples(resamples = alz_folds) %>% collect_metrics() # # A tibble: 2 × 6 # .metric .estimator mean n std_err .config # <chr> <chr> <dbl> <int> <dbl> <chr> # 1 accuracy binary 0.756 10 0.0245 Preprocessor1_Mod… # 2 roc_auc binary 0.770 10 0.0255 Preprocessor1_Mod… ``` --- class: middle, center # `args()` Print the arguments for a **parsnip** model specification. ```r args(decision_tree) ``` --- class: middle, center # `decision_tree()` Specifies a decision tree model ```r decision_tree(engine = "rpart",tree_depth = 30, min_n = 20, cost_complexity = .01) ``` -- *either* mode works! --- class: middle .center[ # `decision_tree()` Specifies a decision tree model ] ```r decision_tree( engine = "rpart", # default computational engine tree_depth = 30, # max tree depth min_n = 20, # smallest node allowed cost_complexity = .01 # 0 > cp > 0.1 ) ``` --- class: middle, center # `set_args()` Change the arguments for a **parsnip** model specification. ```r _mod %>% set_args(tree_depth = 3) ``` --- class: middle ```r decision_tree(engine = "rpart") %>% set_mode("classification") %>% * set_args(tree_depth = 3) # Decision Tree Model Specification (classification) # # Main Arguments: # tree_depth = 3 # # Computational engine: rpart ``` --- class: middle ```r *decision_tree(engine = "rpart", tree_depth = 3) %>% set_mode("classification") # Decision Tree Model Specification (classification) # # Main Arguments: # tree_depth = 3 # # Computational engine: rpart ``` --- class: middle, center # `tree_depth` Cap the maximum tree depth. A method to stop the tree early. Used to prevent overfitting. ```r tree_mod %>% set_args(tree_depth = 30) ``` --- class: middle, center exclude: true --- class: middle, center <img src="figs/04/unnamed-chunk-22-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle, center <img src="figs/04/unnamed-chunk-23-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle, center # `min_n` Set minimum `n` to split at any node. Another early stopping method. Used to prevent overfitting. ```r tree_mod %>% set_args(min_n = 20) ``` --- class: middle, center # Quiz What value of `min_n` would lead to the *most overfit* tree? -- `min_n` = 1 --- class: middle, center, frame # Recap: early stopping | `parsnip` arg | `rpart` arg | default | overfit? | |---------------|-------------|:-------:|:--------:| | `tree_depth` | `maxdepth` | 30 |⬆️| | `min_n` | `minsplit` | 20 |⬇️| --- class: middle, center # `cost_complexity` Adds a cost or penalty to error rates of more complex trees. A way to prune a tree. Used to prevent overfitting. ```r tree_mod %>% set_args(cost_complexity = .01) ``` -- Closer to zero ➡️ larger trees. Higher penalty ➡️ smaller trees. --- class: middle, center <img src="figs/04/unnamed-chunk-26-1.png" width="720" style="display: block; margin: auto;" /> --- class: middle, center <img src="figs/04/unnamed-chunk-27-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle, center <img src="figs/04/unnamed-chunk-28-1.png" width="864" style="display: block; margin: auto;" /> --- name: bonsai background-image: url(images/kari-shea-AVqh83jStMA-unsplash.jpg) background-position: left background-size: contain class: middle --- template: bonsai .pull-right[ # Consider the bonsai 1. Small pot 1. Strong shears ] --- template: bonsai .pull-right[ # Consider the bonsai 1. ~~Small pot~~ .display[Early stopping] 1. ~~Strong shears~~ .display[Pruning] ] --- class: middle, center, frame # Recap: early stopping & pruning | `parsnip` arg | `rpart` arg | default | overfit? | |---------------|-------------|:-------:|:--------:| | `tree_depth` | `maxdepth` | 30 |⬆️| | `min_n` | `minsplit` | 20 |⬇️| | `cost_complexity` | `cp` | .01 |⬇️| --- class: middle, center <table> <thead> <tr> <th style="text-align:left;"> engine </th> <th style="text-align:left;"> parsnip </th> <th style="text-align:left;"> original </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> rpart </td> <td style="text-align:left;"> tree_depth </td> <td style="text-align:left;"> maxdepth </td> </tr> <tr> <td style="text-align:left;"> rpart </td> <td style="text-align:left;"> min_n </td> <td style="text-align:left;"> minsplit </td> </tr> <tr> <td style="text-align:left;"> rpart </td> <td style="text-align:left;"> cost_complexity </td> <td style="text-align:left;"> cp </td> </tr> </tbody> </table> <https://rdrr.io/cran/rpart/man/rpart.control.html> --- class: middle, center <img src="figs/04/unnamed-chunk-30-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/04/unnamed-chunk-31-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/04/unnamed-chunk-32-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/04/unnamed-chunk-33-1.png" width="504" style="display: block; margin: auto;" /> --- class: middle, center <img src="figs/04/unnamed-chunk-34-1.png" width="504" style="display: block; margin: auto;" /> --- class: middle, center <img src="figs/04/unnamed-chunk-35-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/rmed02-workflows/big-alz-tree-1.png" width="672" style="display: block; margin: auto;" /> --- class: middle, frame, center # Axiom There is an inverse relationship between model *accuracy* and model *interpretability*. --- class: middle, center # `rand_forest()` Specifies a random forest model ```r rand_forest(mtry = 4, trees = 500, min_n = 1) ``` -- *either* mode works! --- class: middle .center[ # `rand_forest()` Specifies a random forest model ] ```r rand_forest( engine = "ranger", # default computational engine mtry = 4, # predictors seen at each node trees = 500, # trees per forest min_n = 1 # smallest node allowed ) ``` --- class: your-turn # Your turn 2 Create a new parsnip model called `rf_mod`, which will learn an ensemble of classification trees from our training data using the **ranger** engine. Update your `tree_wf` with this new model. Fit your workflow with 10-fold cross-validation and compare the ROC AUC of the random forest to your single decision tree model --- which predicts the test set better? *Hint: you'll need https://www.tidymodels.org/find/parsnip/*
04
:
00
--- ```r rf_mod <- rand_forest(engine = "ranger") %>% set_mode("classification") rf_wf <- tree_wf %>% update_model(rf_mod) set.seed(100) rf_wf %>% fit_resamples(resamples = alz_folds) %>% collect_metrics() # # A tibble: 2 × 6 # .metric .estimator mean n std_err .config # <chr> <chr> <dbl> <int> <dbl> <chr> # 1 accuracy binary 0.837 10 0.0172 Preprocessor1_Mod… # 2 roc_auc binary 0.886 10 0.0171 Preprocessor1_Mod… ``` --- class: middle, center # `mtry` The number of predictors that will be randomly sampled at each split when creating the tree models. ```r rand_forest(mtry = 11) ``` **ranger** default = `floor(sqrt(num_predictors))`