class: title-slide, center, bottom # Build a useful model ## Tidymodels, virtually — Session 02 ### Alison Hill --- class: center, middle, inverse # What is Machine Learning? --- class: middle # .center[Alzheimer's disease data] Data from a clinical trial of individuals with well-characterized cognitive impairment, and age-matched control participants. ```r # install.packages("modeldata") library(modeldata) data("ad_data") alz <- ad_data ``` --- class: middle ```r glimpse(alz) # Rows: 333 # Columns: 131 # $ ACE_CD143_Angiotensin_Converti <dbl> 2.0031003, 1.5618… # $ ACTH_Adrenocorticotropic_Hormon <dbl> -1.3862944, -1.38… # $ AXL <dbl> 1.09838668, 0.683… # $ Adiponectin <dbl> -5.360193, -5.020… # $ Alpha_1_Antichymotrypsin <dbl> 1.7404662, 1.4586… # $ Alpha_1_Antitrypsin <dbl> -12.631361, -11.9… # $ Alpha_1_Microglobulin <dbl> -2.577022, -3.244… # $ Alpha_2_Macroglobulin <dbl> -72.65029, -154.6… # $ Angiopoietin_2_ANG_2 <dbl> 1.06471074, 0.741… # $ Angiotensinogen <dbl> 2.510547, 2.45728… # $ Apolipoprotein_A_IV <dbl> -1.427116, -1.660… # $ Apolipoprotein_A1 <dbl> -7.402052, -7.047… # $ Apolipoprotein_A2 <dbl> -0.26136476, -0.8… # $ Apolipoprotein_B <dbl> -4.624044, -6.747… # $ Apolipoprotein_CI <dbl> -1.2729657, -1.27… # $ Apolipoprotein_CIII <dbl> -2.312635, -2.343… # $ Apolipoprotein_D <dbl> 2.0794415, 1.3350… # $ Apolipoprotein_E <dbl> 3.7545215, 3.0971… # $ Apolipoprotein_H <dbl> -0.15734908, -0.5… # $ B_Lymphocyte_Chemoattractant_BL <dbl> 2.296982, 1.67312… # $ BMP_6 <dbl> -2.200744, -1.728… # $ Beta_2_Microglobulin <dbl> 0.69314718, 0.470… # $ Betacellulin <int> 34, 53, 49, 52, 6… # $ C_Reactive_Protein <dbl> -4.074542, -6.645… # $ CD40 <dbl> -0.7964147, -1.27… # $ CD5L <dbl> 0.09531018, -0.67… # $ Calbindin <dbl> 33.21363, 25.2763… # $ Calcitonin <dbl> 1.3862944, 3.6109… # $ CgA <dbl> 397.6536, 465.675… # $ Clusterin_Apo_J <dbl> 3.555348, 3.04452… # $ Complement_3 <dbl> -10.36305, -16.10… # $ Complement_Factor_H <dbl> 3.573725, 3.60004… # $ Connective_Tissue_Growth_Factor <dbl> 0.5306283, 0.5877… # $ Cortisol <dbl> 10.0, 12.0, 10.0,… # $ Creatine_Kinase_MB <dbl> -1.710172, -1.751… # $ Cystatin_C <dbl> 9.041922, 9.06762… # $ EGF_R <dbl> -0.1354543, -0.37… # $ EN_RAGE <dbl> -3.688879, -3.816… # $ ENA_78 <dbl> -1.349543, -1.356… # $ Eotaxin_3 <int> 53, 62, 62, 44, 6… # $ FAS <dbl> -0.08338161, -0.5… # $ FSH_Follicle_Stimulation_Hormon <dbl> -0.6516715, -1.62… # $ Fas_Ligand <dbl> 3.1014922, 2.9788… # $ Fatty_Acid_Binding_Protein <dbl> 2.5208712, 2.2477… # $ Ferritin <dbl> 3.329165, 3.93295… # $ Fetuin_A <dbl> 1.2809338, 1.1939… # $ Fibrinogen <dbl> -7.035589, -8.047… # $ GRO_alpha <dbl> 1.381830, 1.37243… # $ Gamma_Interferon_induced_Monokin <dbl> 2.949822, 2.72179… # $ Glutathione_S_Transferase_alpha <dbl> 1.0641271, 0.8670… # $ HB_EGF <dbl> 6.559746, 8.75453… # $ HCC_4 <dbl> -3.036554, -4.074… # $ Hepatocyte_Growth_Factor_HGF <dbl> 0.58778666, 0.530… # $ I_309 <dbl> 3.433987, 3.13549… # $ ICAM_1 <dbl> -0.1907787, -0.46… # $ IGF_BP_2 <dbl> 5.609472, 5.34710… # $ IL_11 <dbl> 5.121987, 4.93670… # $ IL_13 <dbl> 1.282549, 1.26946… # $ IL_16 <dbl> 4.192081, 2.87633… # $ IL_17E <dbl> 5.731246, 6.70589… # $ IL_1alpha <dbl> -6.571283, -8.047… # $ IL_3 <dbl> -3.244194, -3.912… # $ IL_4 <dbl> 2.484907, 2.39789… # $ IL_5 <dbl> 1.09861229, 0.693… # $ IL_6 <dbl> 0.26936976, 0.096… # $ IL_6_Receptor <dbl> 0.64279595, 0.431… # $ IL_7 <dbl> 4.8050453, 3.7055… # $ IL_8 <dbl> 1.711325, 1.67555… # $ IP_10_Inducible_Protein_10 <dbl> 6.242223, 5.68697… # $ IgA <dbl> -6.812445, -6.377… # $ Insulin <dbl> -0.6258253, -0.94… # $ Kidney_Injury_Molecule_1_KIM_1 <dbl> -1.204295, -1.197… # $ LOX_1 <dbl> 1.7047481, 1.5260… # $ Leptin <dbl> -1.5290628, -1.46… # $ Lipoprotein_a <dbl> -4.268698, -4.933… # $ MCP_1 <dbl> 6.740519, 6.84906… # $ MCP_2 <dbl> 1.9805094, 1.8088… # $ MIF <dbl> -1.237874, -1.897… # $ MIP_1alpha <dbl> 4.968453, 3.69016… # $ MIP_1beta <dbl> 3.258097, 3.13549… # $ MMP_2 <dbl> 4.478566, 3.78147… # $ MMP_3 <dbl> -2.207275, -2.465… # $ MMP10 <dbl> -3.270169, -3.649… # $ MMP7 <dbl> -3.7735027, -5.96… # $ Myoglobin <dbl> -1.89711998, -0.7… # $ NT_proBNP <dbl> 4.553877, 4.21950… # $ NrCAM <dbl> 5.003946, 5.20948… # $ Osteopontin <dbl> 5.356586, 6.00388… # $ PAI_1 <dbl> 1.00350156, -0.03… # $ PAPP_A <dbl> -2.902226, -2.813… # $ PLGF <dbl> 4.442651, 4.02535… # $ PYY <dbl> 3.218876, 3.13549… # $ Pancreatic_polypeptide <dbl> 0.5787808, 0.3364… # $ Prolactin <dbl> 0.00000000, -0.51… # $ Prostatic_Acid_Phosphatase <dbl> -1.620527, -1.739… # $ Protein_S <dbl> -1.784998, -2.463… # $ Pulmonary_and_Activation_Regulat <dbl> -0.8439701, -2.30… # $ RANTES <dbl> -6.214608, -6.938… # $ Resistin <dbl> -16.475315, -16.0… # $ S100b <dbl> 1.5618560, 1.7566… # $ SGOT <dbl> -0.94160854, -0.6… # $ SHBG <dbl> -1.897120, -1.560… # $ SOD <dbl> 5.609472, 5.81413… # $ Serum_Amyloid_P <dbl> -5.599422, -6.119… # $ Sortilin <dbl> 4.908629, 5.47873… # $ Stem_Cell_Factor <dbl> 4.174387, 3.71357… # $ TGF_alpha <dbl> 8.649098, 11.3316… # $ TIMP_1 <dbl> 15.20465, 11.2665… # $ TNF_RII <dbl> -0.0618754, -0.32… # $ TRAIL_R3 <dbl> -0.1829004, -0.50… # $ TTR_prealbumin <dbl> 2.944439, 2.83321… # $ Tamm_Horsfall_Protein_THP <dbl> -3.095810, -3.111… # $ Thrombomodulin <dbl> -1.340566, -1.675… # $ Thrombopoietin <dbl> -0.1026334, -0.67… # $ Thymus_Expressed_Chemokine_TECK <dbl> 4.149327, 3.81018… # $ Thyroid_Stimulating_Hormone <dbl> -3.863233, -4.828… # $ Thyroxine_Binding_Globulin <dbl> -1.4271164, -1.60… # $ Tissue_Factor <dbl> 2.04122033, 2.028… # $ Transferrin <dbl> 3.332205, 2.89037… # $ Trefoil_Factor_3_TFF3 <dbl> -3.381395, -3.912… # $ VCAM_1 <dbl> 3.258097, 2.70805… # $ VEGF <dbl> 22.03456, 18.6018… # $ Vitronectin <dbl> -0.04082199, -0.3… # $ von_Willebrand_Factor <dbl> -3.146555, -3.863… # $ age <dbl> 0.9876238, 0.9861… # $ tau <dbl> 6.297754, 6.65929… # $ p_tau <dbl> 4.348108, 4.85996… # $ Ab_42 <dbl> 12.019678, 11.015… # $ male <dbl> 0, 0, 1, 0, 0, 1,… # $ Genotype <fct> E3E3, E3E4, E3E4,… # $ Class <fct> Control, Control,… ``` --- background-image: url(images/hands.jpg) background-size: contain background-position: left class: middle .pull-right[ ## Alzheimer's disease data + N = 333 + 1 categorical outcome: `Class` + 130 predictors + 126 protein measurements + also: `age`, `male`, `Genotype` ] --- background-image: url(images/hands.jpg) background-size: contain background-position: left class: middle .pull-right[ <img src="figs/02/unnamed-chunk-3-1.png" width="504" style="display: block; margin: auto;" /> ] --- class: middle, center, inverse # What is the goal of machine learning? --- class: middle, center, frame # Goal -- ## Build .display[models] that -- ## generate .display[accurate predictions] -- ## for .display[future, yet-to-be-seen data]. -- .footnote[Max Kuhn & Kjell Johnston, http://www.feat.engineering/] ??? This is our whole game vision for today. This is the main goal for predictive modeling broadly, and for machine learning specifically. We'll use this goal to drive learning of 3 core tidymodels packages: - parsnip - yardstick - and rsample --- class: inverse, middle, center # 🔨 Build models -- # with parsnip ??? Enter the parsnip package --- class: middle, center, frame # parsnip <iframe src="https://parsnip.tidymodels.org" width="100%" height="400px"></iframe> --- class: middle # .center[`glm()`] ```r glm(Class ~ tau, family = binomial, data = alz) # # Call: glm(formula = Class ~ tau, family = binomial, data = alz) # # Coefficients: # (Intercept) tau # 13.664 -2.148 # # Degrees of Freedom: 332 Total (i.e. Null); 331 Residual # Null Deviance: 390.6 # Residual Deviance: 318.8 AIC: 322.8 ``` ??? So let's start with prediction. To predict, we have to have two things: a model to generate predictions, and data to predict This type of formula interface may look familiar How would we use parsnip to build this kind of linear regression model? --- exclude: true name: step1 background-image: url("images/predicting/predicting.001.jpeg") background-size: contain --- class: middle, frame # .center[To specify a model with parsnip] .right-column[ 1\. Pick a .display[model] + .display[engine] 2\. Set the .display[mode] (if needed) ] --- class: middle, frame # .center[To specify a model with parsnip] ```r logistic_reg(engine = "glm") %>% set_mode("classification") # Logistic Regression Model Specification (classification) # # Computational engine: glm ``` --- class: middle, frame # .center[To specify a model with parsnip] ```r decision_tree(engine = "C5.0") %>% set_mode("classification") # Decision Tree Model Specification (classification) # # Computational engine: C5.0 ``` --- class: middle, frame # .center[To specify a model with parsnip] ```r nearest_neighbor(engine = "kknn") %>% set_mode("classification") # K-Nearest Neighbor Model Specification (classification) # # Computational engine: kknn ``` --- class: middle, frame .fade[ # .center[To specify a model with parsnip] ] .right-column[ 1\. Pick a .display[model] + .display[engine] .fade[ 2\. Set the .display[mode] (if needed) ] ] --- class: middle, center # 1\. Pick a .display[model] + .display[engine] All available models are listed at <https://www.tidymodels.org/find/parsnip/> <iframe src="https://www.tidymodels.org/find/parsnip/" width="504" height="400px"></iframe> --- class: middle .center[ # `logistic_reg()` Specifies a model that uses logistic regression ] ```r logistic_reg(engine = "glm", penalty = NULL, mixture = NULL) ``` --- class: middle .center[ # `logistic_reg()` Specifies a model that uses logistic regression ] ```r logistic_reg( mode = "classification", # "default" mode engine = "glm", # default computational engine penalty = NULL, # model hyper-parameter mixture = NULL # model hyper-parameter ) ``` --- class: middle, frame .fade[ # .center[To specify a model with parsnip] ] .right-column[ .fade[ 1\. Pick a .display[model] + .display[engine] ] 2\. Set the .display[mode] (if needed) ] --- class: middle, center # `set_mode()` Sets the class of problem the model will solve, which influences which output is collected. Not necessary if mode is set in Step 1. ```r logistic_reg() %>% set_mode(mode = "classification") ``` --- class: your-turn # Your turn 1 Run the chunk in your .Rmd and look at the output. Then, copy/paste the code and edit to create: + a decision tree model for classification + that uses the `C5.0` engine. Save it as `tree_mod` and look at the object. What is different about the output? *Hint: you'll need https://www.tidymodels.org/find/parsnip/*
03
:
00
--- ```r lr_mod # Logistic Regression Model Specification (classification) # # Computational engine: glm tree_mod <- decision_tree(engine = "C5.0") %>% set_mode("classification") tree_mod # Decision Tree Model Specification (classification) # # Computational engine: C5.0 ``` --- class: inverse, middle, center ## Now we've built a model. -- ## But, how do we *use* a model? -- ## First - what does it mean to use a model? --- class: inverse, middle, center ![](https://media.giphy.com/media/fhAwk4DnqNgw8/giphy.gif) Statistical models learn from the data. Many learn model parameters, which *can* be useful as values for inference and interpretation. --- ## Our data <img src="figs/02/unnamed-chunk-17-1.png" width="504" style="display: block; margin: auto;" /> --- ## "All models are wrong, but some are useful" .pull-left[ <img src="figs/02/unnamed-chunk-19-1.png" width="504" style="display: block; margin: auto;" /> ] -- .pull-right[ ``` # Truth # Prediction Impaired Control # Impaired 61 17 # Control 30 225 ``` ] --- ## "All models are wrong, but some are useful" .pull-left[ <img src="figs/02/unnamed-chunk-22-1.png" width="504" style="display: block; margin: auto;" /> ] -- .pull-right[ ``` # Truth # Prediction Impaired Control # Impaired 65 22 # Control 26 220 ``` ] --- class: inverse, middle, center # ♻️ Resample models -- # with rsample ??? Enter the rsample package --- class: middle, center, frame # rsample <iframe src="https://tidymodels.github.io/rsample/" width="100%" height="400px"></iframe> --- class: middle, center, frame # The holdout method -- <img src="figs/02/all-split-1.png" width="864" style="display: block; margin: auto;" /> ??? We refer to the group for which we know the outcome, and use to develop the algorithm, as the training set. We refer to the group for which we pretend we don’t know the outcome as the test set. --- class: center, middle # `initial_split()` "Splits" data randomly into a single testing and a single training set. ```r initial_split(data, prop = 3/4. strata = NULL, breaks = 4) ``` --- ```r alz_split <- initial_split(alz, strata = Class, prop = .9) alz_split # <Analysis/Assess/Total> # <300/33/333> ``` ??? data splitting --- class: center, middle # `training()` and `testing()` Extract training and testing sets from an rsplit ```r training(alz_split) testing(alz_split) ``` --- ```r alz_train <- training(alz_split) alz_train # # A tibble: 300 × 131 # ACE_CD143_Angioten… ACTH_Adrenocortic… AXL Adiponectin # <dbl> <dbl> <dbl> <dbl> # 1 2.00 -1.39 1.10 -5.36 # 2 1.56 -1.39 0.683 -5.02 # 3 1.52 -1.71 -0.145 -5.81 # 4 1.68 -1.61 0.683 -5.12 # 5 2.40 -0.968 0.191 -4.78 # 6 0.431 -1.27 -0.222 -5.22 # 7 0.946 -1.90 0.530 -6.12 # 8 0.708 -1.83 -0.327 -4.88 # 9 1.11 -1.97 0.191 -5.17 # 10 1.60 -1.51 0.449 -5.57 # # … with 290 more rows, and 127 more variables: # # Alpha_1_Antichymotrypsin <dbl>, # # Alpha_1_Antitrypsin <dbl>, Alpha_1_Microglobulin <dbl>, # # Alpha_2_Macroglobulin <dbl>, # # Angiopoietin_2_ANG_2 <dbl>, … ``` --- class: your-turn # Your turn 2 Fill in the blanks. Use `initial_split()`, `training()`, and `testing()` to: 1. Split **alz** into training and test sets. Save the rsplit! 2. Extract the training data and the testing data. 3. Check the proportions of the `Class` variable in each set. Keep `set.seed(100)` at the start of your code.
04
:
00
--- ```r set.seed(100) # Important! alz_split <- initial_split(alz, strata = Class, prop = .9) alz_train <- training(alz_split) alz_test <- testing(alz_split) ``` --- class: middle, center # Data Splitting -- <img src="figs/02/unnamed-chunk-32-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/02/unnamed-chunk-33-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/02/unnamed-chunk-34-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/02/unnamed-chunk-35-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/02/unnamed-chunk-36-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/02/unnamed-chunk-37-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/02/unnamed-chunk-38-1.png" width="720" style="display: block; margin: auto;" /> -- <img src="figs/02/unnamed-chunk-39-1.png" width="720" style="display: block; margin: auto;" /> --- <img src="figs/02/unnamed-chunk-40-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/02/unnamed-chunk-41-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/02/unnamed-chunk-42-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/02/unnamed-chunk-43-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/02/unnamed-chunk-44-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/02/unnamed-chunk-45-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/02/unnamed-chunk-46-1.png" width="1080" style="display: block; margin: auto;" /> -- <img src="figs/02/unnamed-chunk-47-1.png" width="1080" style="display: block; margin: auto;" /> -- .right[Mean metric] --- background-image: url(images/diamonds.jpg) background-size: contain background-position: left class: middle, center background-color: #f5f5f5 .pull-right[ ## The .display[testing set] is precious... ## we can only use it once! ] --- background-image: url(images/diamonds.jpg) background-size: contain background-position: left class: middle, center background-color: #f5f5f5 .pull-right[ ## How can we use the training set to compare, evaluate, and tune models? ] --- background-image: url(https://www.tidymodels.org/start/resampling/img/resampling.svg) background-size: 60% --- class: middle, center, inverse # Cross-validation --- background-image: url(images/cross-validation/Slide2.png) background-size: contain --- background-image: url(images/cross-validation/Slide3.png) background-size: contain --- background-image: url(images/cross-validation/Slide4.png) background-size: contain --- background-image: url(images/cross-validation/Slide5.png) background-size: contain --- background-image: url(images/cross-validation/Slide6.png) background-size: contain --- background-image: url(images/cross-validation/Slide7.png) background-size: contain --- background-image: url(images/cross-validation/Slide8.png) background-size: contain --- background-image: url(images/cross-validation/Slide9.png) background-size: contain --- background-image: url(images/cross-validation/Slide10.png) background-size: contain --- background-image: url(images/cross-validation/Slide11.png) background-size: contain --- class: middle, center # V-fold cross-validation ```r vfold_cv(data, v = 10, repeats = 1, strata = NULL, breaks = 4, ...) ``` --- exclude: true --- class: middle, center # Guess How many times does in observation/row appear in the assessment set? <img src="figs/02/vfold-tiles-1.png" width="864" style="display: block; margin: auto;" /> --- <img src="figs/02/unnamed-chunk-49-1.png" width="864" style="display: block; margin: auto;" /> --- class: middle, center # Quiz If we use 10 folds, which percent of our data will end up in the training set and which percent in the testing set for each fold? -- 90% - training 10% - test --- class: middle, center, inverse # Stratified sampling --- ## What if... The assessment set looked like this? <img src="figs/02/unnamed-chunk-51-1.png" width="504" style="display: block; margin: auto;" /> --- ## Or this? <img src="figs/02/unnamed-chunk-52-1.png" width="504" style="display: block; margin: auto;" /> --- <img src="figs/02/unnamed-chunk-53-1.png" width="504" style="display: block; margin: auto;" /> --- class: middle .pull-left[ <img src="figs/02/unnamed-chunk-54-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ ## Original ``` # Class n percent # Impaired 91 27.3% # Control 242 72.7% ``` ## Resample ``` # Class Analysis Assessment # Impaired 75.82% 24.18% # Control - - ``` ] --- class: middle .pull-left[ <img src="figs/02/unnamed-chunk-56-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ ## Original ``` # Class n percent # Impaired 91 27.3% # Control 242 72.7% ``` ## Resample ``` # Class Analysis Assessment # Impaired 75.82% 24.18% # Control 75.21% 24.79% ``` ] --- class: your-turn # Your Turn 3 Run the code below. What does it return? ```r set.seed(100) alz_folds <- vfold_cv(alz_train, v = 10, strata = Class) alz_folds ```
01
:
00
--- ```r set.seed(100) alz_folds <- vfold_cv(alz_train, v = 10, strata = Class) alz_folds # # 10-fold cross-validation using stratification # # A tibble: 10 × 2 # splits id # <list> <chr> # 1 <split [269/31]> Fold01 # 2 <split [269/31]> Fold02 # 3 <split [270/30]> Fold03 # 4 <split [270/30]> Fold04 # 5 <split [270/30]> Fold05 # 6 <split [270/30]> Fold06 # 7 <split [270/30]> Fold07 # 8 <split [270/30]> Fold08 # 9 <split [271/29]> Fold09 # 10 <split [271/29]> Fold10 ``` --- class: middle .center[ # `fit_resamples()` Trains and tests a resampled model. ] ```r tree_mod %>% fit_resamples( Class ~ tau + VEGF, resamples = alz_folds ) ``` --- ```r tree_mod %>% fit_resamples( Class ~ tau + VEGF, resamples = alz_folds ) # # Resampling results # # 10-fold cross-validation using stratification # # A tibble: 10 × 4 # splits id .metrics .notes # <list> <chr> <list> <list> # 1 <split [269/31]> Fold01 <tibble [2 × 4]> <tibble [0 × 1]> # 2 <split [269/31]> Fold02 <tibble [2 × 4]> <tibble [0 × 1]> # 3 <split [270/30]> Fold03 <tibble [2 × 4]> <tibble [0 × 1]> # 4 <split [270/30]> Fold04 <tibble [2 × 4]> <tibble [0 × 1]> # 5 <split [270/30]> Fold05 <tibble [2 × 4]> <tibble [0 × 1]> # 6 <split [270/30]> Fold06 <tibble [2 × 4]> <tibble [0 × 1]> # 7 <split [270/30]> Fold07 <tibble [2 × 4]> <tibble [0 × 1]> # 8 <split [270/30]> Fold08 <tibble [2 × 4]> <tibble [0 × 1]> # 9 <split [271/29]> Fold09 <tibble [2 × 4]> <tibble [0 × 1]> # 10 <split [271/29]> Fold10 <tibble [2 × 4]> <tibble [0 × 1]> ``` --- class: middle, center # `collect_metrics()` Unnest the metrics column from a tidymodels `fit_resamples()` ```r _results %>% collect_metrics(summarize = TRUE) ``` -- .footnote[`TRUE` is actually the default; averages across folds] --- .pull-left[ ```r tree_fit <- tree_mod %>% fit_resamples( Class ~ tau + VEGF, resamples = alz_folds ) collect_metrics(tree_fit) # # A tibble: 2 × 6 # .metric .estimator mean n std_err .config # <chr> <chr> <dbl> <int> <dbl> <chr> # 1 accuracy binary 0.820 10 0.0204 Preprocessor1_Mod… # 2 roc_auc binary 0.786 10 0.0180 Preprocessor1_Mod… ``` ] -- .pull-right[ ```r collect_metrics(tree_fit, summarize = FALSE) # # A tibble: 20 × 5 # id .metric .estimator .estimate .config # <chr> <chr> <chr> <dbl> <chr> # 1 Fold01 accuracy binary 0.774 Preprocessor1_Model1 # 2 Fold01 roc_auc binary 0.692 Preprocessor1_Model1 # 3 Fold02 accuracy binary 0.839 Preprocessor1_Model1 # 4 Fold02 roc_auc binary 0.848 Preprocessor1_Model1 # 5 Fold03 accuracy binary 0.867 Preprocessor1_Model1 # 6 Fold03 roc_auc binary 0.852 Preprocessor1_Model1 # 7 Fold04 accuracy binary 0.8 Preprocessor1_Model1 # 8 Fold04 roc_auc binary 0.795 Preprocessor1_Model1 # 9 Fold05 accuracy binary 0.767 Preprocessor1_Model1 # 10 Fold05 roc_auc binary 0.744 Preprocessor1_Model1 # # … with 10 more rows ``` ] --- class: middle, center, frame # 10-fold CV ### 10 different analysis/assessment sets ### 10 different models (trained on .display[analysis] sets) ### 10 different sets of performance statistics (on .display[assessment] sets) --- class: middle, center <iframe src="https://tidymodels.github.io/yardstick/articles/metric-types.html#metrics" width="504" height="400px"></iframe> <https://tidymodels.github.io/yardstick/articles/metric-types.html#metrics> --- class: middle, center # `roc_curve()` Takes predictions from a special kind of `fit_resamples()`. Returns a tibble with probabilities. ```r roc_curve(data, truth = Class, estimate = .pred_Impaired) ``` Truth = .display[probability] of target response Estimate = .display[predicted] class --- ```r tree_preds <- tree_mod %>% fit_resamples( Class ~ tau + VEGF, resamples = alz_folds, control = control_resamples(save_pred = TRUE) ) tree_preds %>% collect_predictions() %>% roc_curve(truth = Class, estimate = .pred_Impaired) # # A tibble: 40 × 3 # .threshold specificity sensitivity # <dbl> <dbl> <dbl> # 1 -Inf 0 1 # 2 0.0703 0 1 # 3 0.0708 0.0688 0.976 # 4 0.0713 0.124 0.963 # 5 0.0767 0.183 0.951 # 6 0.0772 0.248 0.939 # 7 0.0925 0.317 0.927 # 8 0.0960 0.344 0.890 # 9 0.0998 0.408 0.854 # 10 0.102 0.454 0.829 # # … with 30 more rows ``` --- ## Area under the curve .pull-left[ <img src="images/roc-pretty-good.png" width="1037" style="display: block; margin: auto;" /> ] .pull-right[ * AUC = 0.5: random guessing * AUC = 1: perfect classifer * In general AUC of above 0.8 considered "good" ] --- <img src="images/roc-guessing.png" width="80%" style="display: block; margin: auto;" /> --- <img src="images/roc-perfect.png" width="80%" style="display: block; margin: auto;" /> --- <img src="images/roc-bad.png" width="80%" style="display: block; margin: auto;" /> --- <img src="images/roc-ok.png" width="80%" style="display: block; margin: auto;" /> --- <img src="images/roc-pretty-good.png" width="80%" style="display: block; margin: auto;" /> --- class: your-turn # Your turn 4 Add a `autoplot()` to visualize the ROC AUC. --- ```r tree_preds %>% collect_predictions() %>% roc_curve(truth = Class, estimate = .pred_Impaired) %>% autoplot() ``` <img src="figs/02/unnamed-chunk-75-1.png" width="504" style="display: block; margin: auto;" />