class: title-slide, center, bottom # A bird's eye view ## Tidymodels, Virtually — Session 01 ### Alison Hill --- class: middle, center, inverse ## .big-text[Hello.] --- name: hello class: middle, center, inverse ### Alison Hill <img style="border-radius: 50%;" src="https://www.apreshill.com/about/sidebar/avatar.jpg" width="150px"/> [
@apreshill](https://github.com/apreshill) [
@apreshill](https://twitter.com/apreshill) [
apreshill.com](https://www.apreshill.com) --- class: center, middle, inverse # What is Machine Learning? ??? Machine Learning is usually thought of as a subfield of artificial intelligence that itself contains other hot sub-fields. Let's start somewhere familiar. I have a data set and I want to analyze it. The actual data set is named `ames` and it comes in the `AmesHousing` R package. No need to open your computers. Let's just discuss for a few minutes. --- class: middle, center # What is machine learning? -- <img src="https://imgs.xkcd.com/comics/machine_learning.png " style="display: block; margin: auto;" /> --- class: top, center background-image: url(images/intro.002.jpeg) background-size: cover --- class: top, center background-image: url(images/intro.003.jpeg) background-size: cover --- class: top, center background-image: url(images/all-of-ml.jpg) background-size: contain .footnote[Credit: <https://vas3k.com/blog/machine_learning/>] --- class: middle, center, frame # Two modes -- .pull-left[ ## Classification ] -- .pull-left[ ## Regression ] --- class: middle, center, frame # Two cultures .pull-left[ ### Statistics ] .pull-right[ ### Machine Learning ] --- class: middle, center, frame # Which is which? .pull-left[ model first inference emphasis ] .pull-right[ data first prediction emphasis ] --- class: middle, center, frame # Which is which? .pull-left[ model first inference emphasis ### Statistics ] .pull-right[ data first prediction emphasis ### Machine Learning ] --- name: train-love background-image: url(images/train.jpg) background-size: contain background-color: #f6f6f6 --- template: train-love class: center, top # Statistics --- template: train-love class: bottom > *"Statisticians, like artists, have the bad habit of falling in love with their models."* > > — George Box --- class: freight-slide, center, inverse # Predictive modeling --- class: middle, inverse, center # Schedule for Today -- 01 - A bird's eye view -- 02 - Build a useful model -- 03 - Build better training data -- 04 - The forest for the trees -- 05 - Build a fine-tuned model -- 06 - Case study --- class: inverse, middle, center # tidymodels --- background-image: url(images/tm-org.png) background-size: contain --- class: middle ```r library(tidymodels) ``` --- class: inverse, middle, center # To the workshop! 🚀 Slides: https://apreshill.github.io/tidymodels-it/ RStudio Cloud: https://bit.ly/tidymodels-it --- ```r # pick model rt_mod <- decision_tree(engine = "rpart") %>% set_mode("regression") rt_mod %>% # fit fit(body_mass_g ~ ., data = penguins) %>% # predict predict(new_data = penguins) %>% mutate(body_mass_g = penguins$body_mass_g) %>% # compare rmse(truth = body_mass_g, estimate = .pred) # # A tibble: 1 × 3 # .metric .estimator .estimate # <chr> <chr> <dbl> # 1 rmse standard 311. ``` --- ```r # holdout method peng_split <- initial_split(penguins, strata = species) peng_train <- training(peng_split) peng_test <- testing(peng_split) # add cross-validation peng_folds <- vfold_cv(data = peng_train, strata = "species") # pick model rt_mod <- decision_tree(engine = "rpart") %>% set_mode("regression") # here comes the actual ML bit… fit_resamples( rt_mod, preprocessor = body_mass_g ~ ., resamples = peng_folds ) %>% collect_metrics() # # A tibble: 2 × 6 # .metric .estimator mean n std_err .config # <chr> <chr> <dbl> <int> <dbl> <chr> # 1 rmse standard 317. 10 14.0 Preprocessor1_Mo… # 2 rsq standard 0.840 10 0.0200 Preprocessor1_Mo… ``` --- ```r # holdout method peng_split <- initial_split(penguins, strata = species) peng_train <- training(peng_split) peng_test <- testing(peng_split) # add cross-validation peng_folds <- vfold_cv(data = peng_train, strata = "species") *# pick model *rt_mod <- * parsnip::decision_tree(engine = "rpart") %>% * parsnip::set_mode("regression") # here comes the actual ML bit… fit_resamples( rt_mod, preprocessor = body_mass_g ~ ., resamples = peng_folds ) %>% collect_metrics() ``` --- class: inverse # Pick a model with parsnip --- class: middle, center # Quiz How many ways can you think of in R to do some type of linear regression? -- `lm` for the general linear model -- `glmnet` for regularized regression -- `stan` for Bayesian regression -- `keras` for regression using tensorflow -- `spark` for large data sets -- ... --- class: middle, frame # .center[To specify a model with parsnip] .right-column[ 1\. Pick a .display[model] + .display[engine] 2\. Set the .display[mode] (if needed) ] --- class: middle, frame # .center[To specify a model with parsnip] ```r decision_tree(engine = "C5.0") %>% set_mode("classification") ``` --- class: middle, frame # .center[To specify a model with parsnip] ```r nearest_neighbor(engine = "kknn") %>% set_mode("regression") %>% ``` --- class: middle, frame .fade[ # .center[To specify a model with parsnip] ] .right-column[ 1\. Pick a .display[model] + .display[engine] .fade[ 2\. Set the .display[mode] (if needed) ] ] --- class: middle, center # 1\. Pick a .display[model] + .display[engine] All available models are listed at <https://www.tidymodels.org/find/parsnip/> <iframe src="https://www.tidymodels.org/find/parsnip/" width="504" height="400px"></iframe> --- class: middle .center[ # `linear_reg()` Specifies a model that uses linear regression ] ```r linear_reg(mode = "regression", engine = "lm", penalty = NULL, mixture = NULL) ``` --- class: middle .center[ # `linear_reg()` Specifies a model that uses linear regression ] ```r linear_reg( mode = "regression", # "default" mode, if exists engine = "lm", # default computational engine penalty = NULL, # model hyper-parameter mixture = NULL # model hyper-parameter ) ``` --- class: middle, frame .fade[ # .center[To specify a model with parsnip] ] .right-column[ .fade[ 1\. Pick a .display[model] + .display[engine] ] 2\. Set the .display[mode] (if needed) ] --- class: middle, center # `set_mode()` Sets the class of problem the model will solve, which influences which output is collected. Not necessary if mode is set in Step 1. ```r lm_mod %>% set_mode(mode = "regression") ``` --- class: middle .pull-left[ ```r *# pick model *lm_mod <- * parsnip::linear_reg(engine = "lm") %>% * parsnip::set_mode("regression") lm_mod %>% fit(body_mass_g ~ flipper_length_mm, data = penguins) # parsnip model object # # Fit time: 3ms # # Call: # stats::lm(formula = body_mass_g ~ flipper_length_mm, data = data) # # Coefficients: # (Intercept) flipper_length_mm # -5872.09 50.15 ``` ] .pull-right[ <img src="figs/01/penguins-lm-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: middle, center, inverse # A model doesn't have to be a straight line! --- class: middle .pull-left[ ```r *# pick model *rt_mod <- * parsnip::decision_tree(engine = "rpart") %>% * parsnip::set_mode("regression") rt_mod %>% fit(body_mass_g ~ flipper_length_mm, data = penguins) # parsnip model object # # Fit time: 4ms # n= 333 # # node), split, n, deviance, yval # * denotes terminal node # # 1) root 333 215259700 4207.057 # 2) flipper_length_mm< 206.5 208 38996800 3702.524 # 4) flipper_length_mm< 193.5 129 19162970 3555.039 * # 5) flipper_length_mm>=193.5 79 12445890 3943.354 * # 3) flipper_length_mm>=206.5 125 35211680 5046.600 # 6) flipper_length_mm< 214.5 49 8152398 4614.796 * # 7) flipper_length_mm>=214.5 76 12032500 5325.000 # 14) flipper_length_mm< 220.5 41 5359848 5134.756 * # 15) flipper_length_mm>=220.5 35 3450464 5547.857 * ``` ] -- .pull-right[ <img src="figs/01/penguins-rt-1.png" width="504" style="display: block; margin: auto;" /> ] --- ```r # holdout method *peng_split <- rsample::initial_split(penguins, strata = species) *peng_train <- rsample::training(peng_split) # add cross-validation *peng_folds <- rsample::vfold_cv(data = peng_train, strata = "species") # pick model rt_mod <- decision_tree(engine = "rpart") %>% set_mode("regression") # here comes the actual ML bit… fit_resamples( rt_mod, preprocessor = body_mass_g ~ ., resamples = peng_folds ) %>% collect_metrics() ``` --- class: inverse, middle, center # Resample a model with rsample --- class: inverse, middle, center # .fade[Resample a model with rsample] # Step 1: The holdout method --- class: middle, center <img src="figs/01/all-split-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle .pull-left[ Train with `training()` Test with `testing()` ] .pull-right[ <img src="figs/01/lm-test-resid-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: middle .pull-left[ Train with `training()` Test with `testing()` ] .pull-right[ <img src="figs/01/rt-test-resid-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Do it once .pull-left[ ```r # pick model rt_mod <- decision_tree(engine = "rpart") %>% set_mode("regression") rt_mod %>% # fit fit(body_mass_g ~ ., data = penguins) %>% # predict predict(new_data = penguins) %>% mutate(body_mass_g = penguins$body_mass_g) %>% # compare rmse(truth = body_mass_g, estimate = .pred) # # A tibble: 1 × 3 # .metric .estimator .estimate # <chr> <chr> <dbl> # 1 rmse standard 311. ``` ] .pull-right[ ```r peng_test <- testing(peng_split) rt_mod %>% # TRAIN: get fitted model fit(body_mass_g ~ ., data = peng_train) %>% # TEST: get predictions predict(new_data = peng_test) %>% # COMPARE: get metrics bind_cols(peng_test) %>% rmse(truth = body_mass_g, estimate = .pred) # # A tibble: 1 × 3 # .metric .estimator .estimate # <chr> <chr> <dbl> # 1 rmse standard 309. ``` ] --- # Train + test the model once .pull-left[ ```r peng_test <- testing(peng_split) rt_mod %>% # TRAIN: get fitted model fit(body_mass_g ~ ., data = peng_train) %>% # TEST: get predictions predict(new_data = peng_test) %>% # COMPARE: get metrics bind_cols(peng_test) %>% rmse(truth = body_mass_g, estimate = .pred) # # A tibble: 1 × 3 # .metric .estimator .estimate # <chr> <chr> <dbl> # 1 rmse standard 309. ``` ] --- # Train + test the model 10 times .pull-left[ ```r peng_test <- testing(peng_split) rt_mod %>% # TRAIN: get fitted model fit(body_mass_g ~ ., data = peng_train) %>% # TEST: get predictions predict(new_data = peng_test) %>% # COMPARE: get metrics bind_cols(peng_test) %>% rmse(truth = body_mass_g, estimate = .pred) # # A tibble: 1 × 3 # .metric .estimator .estimate # <chr> <chr> <dbl> # 1 rmse standard 309. ``` ] .pull-right[ ```r # holdout method peng_split <- initial_split(penguins, strata = species) peng_train <- training(peng_split) peng_test <- testing(peng_split) # add cross-validation peng_folds <- vfold_cv(data = peng_train, strata = "species") # pick model rt_mod <- decision_tree(engine = "rpart") %>% set_mode("regression") # here comes the actual ML bit… fit_resamples( rt_mod, preprocessor = body_mass_g ~ ., resamples = peng_folds ) %>% collect_metrics() # # A tibble: 2 × 6 # .metric .estimator mean n std_err .config # <chr> <chr> <dbl> <int> <dbl> <chr> # 1 rmse standard 310. 10 9.45 Preprocessor1_Mo… # 2 rsq standard 0.861 10 0.0146 Preprocessor1_Mo… ``` ] --- class: inverse, middle, center # .fade[Resample a model with rsample] # .fade[Step 1: The holdout method] # Step 2: Cross-validation --- --- class: middle, center # Data Splitting -- <img src="figs/01/unnamed-chunk-17-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/01/unnamed-chunk-18-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/01/unnamed-chunk-19-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/01/unnamed-chunk-20-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/01/unnamed-chunk-21-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/01/unnamed-chunk-22-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/01/unnamed-chunk-23-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/01/unnamed-chunk-24-1.png" width="720" style="display: block; margin: auto;" /> --- class: frame, center, middle # Resampling Let's resample 10 times then compute the mean of the results... --- ```r rmse %>% tibble::enframe(name = "rmse") # # A tibble: 10 × 2 # rmse value # <int> <dbl> # 1 1 396. # 2 2 424. # 3 3 394. # 4 4 352. # 5 5 350. # 6 6 456. # 7 7 346. # 8 8 370. # 9 9 420. # 10 10 393. mean(rmse) # [1] 390.2079 ``` --- background-image: url(images/diamonds.jpg) background-size: contain background-position: left class: middle, center background-color: #f5f5f5 .pull-right[ ## The .display[testing set] is precious... ## we can only use it once! ] --- background-image: url(https://www.tidymodels.org/start/resampling/img/resampling.svg) background-size: 60% --- background-image: url(images/cross-validation/Slide2.png) background-size: contain --- background-image: url(images/cross-validation/Slide3.png) background-size: contain --- background-image: url(images/cross-validation/Slide4.png) background-size: contain --- background-image: url(images/cross-validation/Slide5.png) background-size: contain --- background-image: url(images/cross-validation/Slide6.png) background-size: contain --- background-image: url(images/cross-validation/Slide7.png) background-size: contain --- background-image: url(images/cross-validation/Slide8.png) background-size: contain --- background-image: url(images/cross-validation/Slide9.png) background-size: contain --- background-image: url(images/cross-validation/Slide10.png) background-size: contain --- background-image: url(images/cross-validation/Slide11.png) background-size: contain --- class: middle, center # V-fold cross-validation ```r vfold_cv(data, v = 10, ...) ``` --- exclude: true --- class: middle, center # Guess How many times does in observation/row appear in the assessment set? <img src="figs/01/vfold-tiles-1.png" width="864" style="display: block; margin: auto;" /> --- <img src="figs/01/unnamed-chunk-28-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle .left-column[ ![](https://parsnip.tidymodels.org/reference/figures/logo.png) ] .right-column[ ```r # holdout method peng_split <- initial_split(penguins, strata = species) peng_train <- training(peng_split) peng_test <- testing(peng_split) # add cross-validation peng_folds <- vfold_cv(data = peng_train, strata = "species") *# pick model *rt_mod <- * parsnip::decision_tree(engine = "rpart") %>% * parsnip::set_mode("regression") # here comes the actual ML bit… fit_resamples( rt_mod, preprocessor = body_mass_g ~ ., resamples = peng_folds ) %>% collect_metrics() ``` ] --- class: middle .left-column[ ![](https://rsample.tidymodels.org/reference/figures/logo.png) ] .right-column[ ```r # holdout method *peng_split <- rsample::initial_split(penguins, strata = species) *peng_train <- rsample::training(peng_split) # add cross-validation *peng_folds <- rsample::vfold_cv(data = peng_train, strata = "species") # pick model rt_mod <- decision_tree(engine = "rpart") %>% set_mode("regression") # here comes the actual ML bit… fit_resamples( rt_mod, preprocessor = body_mass_g ~ ., resamples = peng_folds ) %>% collect_metrics() ``` ] --- class: middle .left-column[ ![](https://tune.tidymodels.org/reference/figures/logo.png) ] .right-column[ ```r # holdout method peng_split <- initial_split(penguins, strata = species) peng_train <- training(peng_split) # add cross-validation peng_folds <- vfold_cv(data = peng_train, strata = "species") # pick model rt_mod <- decision_tree(engine = "rpart") %>% set_mode("regression") *# here comes the actual ML bit… *tune::fit_resamples( * rt_mod, * preprocessor = body_mass_g ~ ., * resamples = peng_folds *) %>% * tune::collect_metrics() ``` ] --- class: your-turn ## Your turn Unscramble! ```r peng_folds <- vfold_cv(data = peng_train, strata = "species") peng_train <- training(peng_split) peng_split <- initial_split(penguins, strata = species) peng_metrics <- collect_metrics(peng_fits) rt_mod <- decision_tree(engine = "rpart") %>% set_mode("regression") peng_fits <- fit_resamples( rt_mod, preprocessor = body_mass_g ~ ., resamples = peng_folds ) ``` --- ```r peng_split <- initial_split(penguins, strata = species) peng_train <- training(peng_split) peng_folds <- vfold_cv(data = peng_train, strata = "species") rt_mod <- decision_tree(engine = "rpart") %>% set_mode("regression") peng_fits <- fit_resamples( rt_mod, preprocessor = body_mass_g ~ ., resamples = peng_folds ) peng_metrics <- collect_metrics(peng_fits) ```