Is your hospital in a state of crisis care?
Is your hospital in a state of crisis care?
What percentage of hospital beds are occupied?
"Statisticians, like artists, have the bad habit of falling in love with their models."
— George Box
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──
## ✔ broom 0.8.0 ✔ recipes 0.2.0## ✔ dials 0.1.1 ✔ rsample 0.1.1## ✔ dplyr 1.0.9 ✔ tibble 3.1.7## ✔ ggplot2 3.3.6 ✔ tidyr 1.2.0## ✔ infer 1.0.0 ✔ tune 0.2.0## ✔ modeldata 0.1.1 ✔ workflows 0.2.6## ✔ parsnip 0.2.1 ✔ workflowsets 0.2.1## ✔ purrr 0.3.4 ✔ yardstick 1.0.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──## ✖ purrr::discard() masks scales::discard()## ✖ dplyr::filter() masks stats::filter()## ✖ dplyr::lag() masks stats::lag()## ✖ recipes::step() masks stats::step()## • Dig deeper into tidy modeling with R at https://www.tmwr.org
library(rcfss)data("bechdel")glimpse(bechdel)## Rows: 1,394## Columns: 10## $ year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, …## $ title <chr> "12 Years a Slave", "2 Guns", "42", …## $ bechdel <fct> Fail, Fail, Fail, Fail, Fail, Pass, …## $ budget_2013 <dbl> 2.00, 6.10, 4.00, 22.50, 9.20, 1.20,…## $ domgross_2013 <dbl> 5.310703, 7.561246, 9.502021, 3.8362…## $ intgross_2013 <dbl> 15.860703, 13.249301, 9.502021, 14.5…## $ rated <chr> "R", "R", "PG-13", "PG-13", "R", "R"…## $ metascore <dbl> 97, 55, 62, 29, 28, 55, 48, 33, 90, …## $ imdb_rating <dbl> 8.3, 6.8, 7.6, 6.6, 5.4, 7.8, 5.7, 5…## $ genre <chr> "Biography", "Action", "Biography", …
bechdel
Max Kuhn & Kjell Johnston, http://www.feat.engineering/
We'll use this goal to drive learning of 3 core tidymodels
packages:
parsnip
yardstick
rsample
parsnip
glm()
glm(bechdel ~ metascore, family = binomial, data = bechdel)## ## Call: glm(formula = bechdel ~ metascore, family = binomial, data = bechdel)## ## Coefficients:## (Intercept) metascore ## 0.052274 -0.004563 ## ## Degrees of Freedom: 1393 Total (i.e. Null); 1392 Residual## Null Deviance: 1916 ## Residual Deviance: 1914 AIC: 1918
parsnip
parsnip
logistic_reg() %>% set_engine("glm") %>% set_mode("classification")## Logistic Regression Model Specification (classification)## ## Computational engine: glm
parsnip
decision_tree() %>% set_engine("C5.0") %>% set_mode("classification")## Decision Tree Model Specification (classification)## ## Computational engine: C5.0
parsnip
nearest_neighbor() %>% set_engine("kknn") %>% set_mode("classification") ## K-Nearest Neighbor Model Specification (classification)## ## Computational engine: kknn
All available models are listed at
https://www.tidymodels.org/find/parsnip/
logistic_reg()
Specifies a model that uses logistic regression
logistic_reg(penalty = NULL, mixture = NULL)
logistic_reg()
Specifies a model that uses logistic regression
logistic_reg( mode = "classification", # "default" mode, if exists penalty = NULL, # model hyper-parameter mixture = NULL # model hyper-parameter )
set_engine()
Adds an engine to power or implement the model.
logistic_reg() %>% set_engine(engine = "glm")
set_mode()
Sets the class of problem the model will solve, which influences which output is collected. Not necessary if mode is set in Step 1.
logistic_reg() %>% set_mode(mode = "classification")
Run the chunk in your .Rmd and look at the output. Then, copy/paste the code and edit to create:
a decision tree model for classification
that uses the C5.0
engine.
Save it as tree_mod
and look at the object. What is different about the output?
Hint: you'll need https://www.tidymodels.org/find/parsnip/
03:00
lr_mod ## Logistic Regression Model Specification (classification)## ## Computational engine: glmtree_mod <- decision_tree() %>% set_engine(engine = "C5.0") %>% set_mode("classification")tree_mod ## Decision Tree Model Specification (classification)## ## Computational engine: C5.0
Statistical models learn from the data.
Many learn model parameters, which can be useful as values for inference and interpretation.
lr_mod %>% fit(bechdel ~ metascore + imdb_rating, data = bechdel) %>% broom::tidy()## # A tibble: 3 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 2.70 0.436 6.20 5.64e-10## 2 metascore 0.0202 0.00481 4.20 2.66e- 5## 3 imdb_rating -0.606 0.0889 -6.82 8.87e-12
bechdel_new <- tibble(metascore = c(40, 50, 60), imdb_rating = c(6, 6, 6), bechdel = c("Fail", "Fail", "Pass")) %>% mutate(bechdel = factor(bechdel, levels = c("Fail", "Pass")))bechdel_new## # A tibble: 3 × 3## metascore imdb_rating bechdel## <dbl> <dbl> <fct> ## 1 40 6 Fail ## 2 50 6 Fail ## 3 60 6 Pass
tree_mod %>% fit(bechdel ~ metascore + imdb_rating, data = bechdel) %>% predict(new_data = bechdel) %>% mutate(true_bechdel = bechdel$bechdel) %>% accuracy(truth = true_bechdel, estimate = .pred_class)## # A tibble: 1 × 3## .metric .estimator .estimate## <chr> <chr> <dbl>## 1 accuracy binary 0.580
tree_mod %>% fit(bechdel ~ metascore + imdb_rating, data = bechdel) %>% predict(new_data = bechdel) %>% mutate(true_bechdel = bechdel$bechdel) %>% accuracy(truth = true_bechdel, estimate = .pred_class)## # A tibble: 1 × 3## .metric .estimator .estimate## <chr> <chr> <dbl>## 1 accuracy binary 0.580
tree_mod %>% fit(bechdel ~ metascore + imdb_rating, data = bechdel) %>% predict(new_data = bechdel_new) %>% mutate(true_bechdel = bechdel_new$bechdel) %>% accuracy(truth = true_bechdel, estimate = .pred_class)## # A tibble: 1 × 3## .metric .estimator .estimate## <chr> <chr> <dbl>## 1 accuracy binary 0.333
fit()
Train a model by fitting a model. Returns a parsnip model fit.
fit(tree_mod, bechdel ~ metascore + imdb_rating, data = bechdel)
fit()
Train a model by fitting a model. Returns a parsnip model fit.
tree_mod %>% # parsnip model fit(bechdel ~ metascore + imdb_rating, # a formula data = bechdel # dataframe )
fit()
Train a model by fitting a model. Returns a parsnip model fit.
tree_fit <- tree_mod %>% # parsnip model fit(bechdel ~ metascore + imdb_rating, # a formula data = bechdel # dataframe )
predict()
Use a fitted model to predict new y
values from data. Returns a tibble.
predict(tree_fit, new_data = bechdel_new)
tree_fit %>% predict(new_data = bechdel_new)## # A tibble: 3 × 1## .pred_class## <fct> ## 1 Pass ## 2 Pass ## 3 Pass
The best way to measure a model's performance at predicting new data is to predict new data.
rsample
rsample
initial_split()*
"Splits" data randomly into a single testing and a single training set.
initial_split(data, prop = 3/4)
*
from rsample
bechdel_split <- initial_split(data = bechdel, strata = bechdel, prop = 3/4)bechdel_split## <Analysis/Assess/Total>## <1045/349/1394>
training()
and testing()*
Extract training and testing sets from an rsplit
training(bechdel_split)testing(bechdel_split)
*
from rsample
bechdel_train <- training(bechdel_split) bechdel_train## # A tibble: 1,045 × 10## year title bechdel budget_2013 domgross_2013## <dbl> <chr> <fct> <dbl> <dbl>## 1 2013 12 Years a Slave Fail 2 5.31## 2 2013 2 Guns Fail 6.1 7.56## 3 2013 42 Fail 4 9.50## 4 2013 47 Ronin Fail 22.5 3.84## 5 2013 A Good Day to Di… Fail 9.2 6.73## 6 2013 After Earth Fail 13 6.05## 7 2013 Cloudy with a Ch… Fail 7.8 12.0 ## 8 2013 Don Jon Fail 0.55 2.45## 9 2013 Escape Plan Fail 7 2.52## 10 2013 Gangster Squad Fail 6 4.60## # … with 1,035 more rows, and 5 more variables:## # intgross_2013 <dbl>, rated <chr>, metascore <dbl>,## # imdb_rating <dbl>, genre <chr>
Fill in the blanks.
Use initial_split()
, training()
, and testing()
to:
Split bechdel into training and test sets. Save the rsplit!
Extract the training data and fit your classification tree model.
Predict the testing data, and save the true bechdel
values.
Measure the accuracy of your model with your test set.
Keep set.seed(100)
at the start of your code.
04:00
set.seed(100) # Important!bechdel_split <- initial_split(bechdel, strata = bechdel, prop = 3/4)bechdel_train <- training(bechdel_split)bechdel_test <- testing(bechdel_split)tree_mod %>% fit(bechdel ~ metascore + imdb_rating, data = bechdel_train) %>% predict(new_data = bechdel_test) %>% mutate(true_bechdel = bechdel_test$bechdel) %>% accuracy(truth = true_bechdel, estimate = .pred_class)
Better Model = Better Predictions (Lower error rate)
accuracy()*
Calculates the accuracy based on two columns in a dataframe:
The truth: yi
The predicted estimate: ^yi
accuracy(data, truth, estimate)
*
from yardstick
tree_mod %>% fit(bechdel ~ metascore + imdb_rating, data = bechdel_train) %>% predict(new_data = bechdel_test) %>% mutate(true_bechdel = bechdel_test$bechdel) %>% accuracy(truth = true_bechdel, estimate = .pred_class)## # A tibble: 1 × 3## .metric .estimator .estimate## <chr> <chr> <dbl>## 1 accuracy binary 0.547
What would happen if you repeated this process? Would you get the same answers?
Note your accuracy from above. Then change your seed number and rerun just the last code chunk above. Do you get the same answer?
Try it a few times with a few different seeds.
02:00
## # A tibble: 1 × 3## .metric .estimator .estimate## <chr> <chr> <dbl>## 1 accuracy binary 0.553
## # A tibble: 1 × 3## .metric .estimator .estimate## <chr> <chr> <dbl>## 1 accuracy binary 0.559
## # A tibble: 1 × 3## .metric .estimator .estimate## <chr> <chr> <dbl>## 1 accuracy binary 0.567
## # A tibble: 1 × 3## .metric .estimator .estimate## <chr> <chr> <dbl>## 1 accuracy binary 0.553
## # A tibble: 1 × 3## .metric .estimator .estimate## <chr> <chr> <dbl>## 1 accuracy binary 0.559
## # A tibble: 1 × 3## .metric .estimator .estimate## <chr> <chr> <dbl>## 1 accuracy binary 0.567
## # A tibble: 1 × 3## .metric .estimator .estimate## <chr> <chr> <dbl>## 1 accuracy binary 0.539
## # A tibble: 1 × 3## .metric .estimator .estimate## <chr> <chr> <dbl>## 1 accuracy binary 0.504
## # A tibble: 1 × 3## .metric .estimator .estimate## <chr> <chr> <dbl>## 1 accuracy binary 0.550
Why is the new estimate different?
Mean accuracy
Let's resample 10 times
then compute the mean of the results...
acc %>% tibble::enframe(name = "accuracy")## # A tibble: 10 × 2## accuracy value## <int> <dbl>## 1 1 0.550## 2 2 0.625## 3 3 0.599## 4 4 0.587## 5 5 0.605## 6 6 0.593## 7 7 0.542## 8 8 0.530## 9 9 0.564## 10 10 0.599mean(acc)## [1] 0.5793696
Which do you think is a better estimate?
The best result or the mean of the results? Why?
Fit with training set
Predict with testing set
Fit with training set
Predict with testing set
Rinse and repeat?
acc <- vector(length = 10, mode = "double")for (i in 1:10) { new_split <- initial_split(bechdel) new_train <- training(new_split) new_test <- testing(new_split) acc[i] <- lr_mod %>% fit(bechdel ~ metascore + imdb_rating, data = new_train) %>% predict(new_test) %>% mutate(truth = new_test$bechdel) %>% accuracy(truth, .pred_class) %>% pull(.estimate)}
vfold_cv(data, v = 10, ...)
How many times does an observation/row appear in the assessment set?
If we use 10 folds, which percent of our data will end up in the training set and which percent in the testing set for each fold?
If we use 10 folds, which percent of our data will end up in the training set and which percent in the testing set for each fold?
90% - training
10% - test
Run the code below. What does it return?
set.seed(100)bechdel_folds <- vfold_cv(data = bechdel_train, v = 10, strata = bechdel)bechdel_folds
01:00
set.seed(100)bechdel_folds <- vfold_cv(data = bechdel_train, v = 10, strata = bechdel)bechdel_folds## # 10-fold cross-validation using stratification ## # A tibble: 10 × 2## splits id ## <list> <chr> ## 1 <split [940/105]> Fold01## 2 <split [940/105]> Fold02## 3 <split [940/105]> Fold03## 4 <split [940/105]> Fold04## 5 <split [940/105]> Fold05## 6 <split [940/105]> Fold06## 7 <split [941/104]> Fold07## 8 <split [941/104]> Fold08## 9 <split [941/104]> Fold09## 10 <split [942/103]> Fold10
split1 <- bechdel_folds %>% pluck("splits", 1)split1_train <- training(split1)split1_test <- testing(split1)tree_mod %>% fit(bechdel ~ ., data = split1_train) %>% predict(split1_test) %>% mutate(truth = split1_test$bechdel) %>% accuracy(truth, .pred_class)# rinse and repeatsplit2 <- ...
fit_resamples()
Trains and tests a resampled model.
tree_mod %>% fit_resamples( bechdel ~ metascore + imdb_rating, resamples = bechdel_folds )
tree_mod %>% fit_resamples( bechdel ~ metascore + imdb_rating, resamples = bechdel_folds )## # Resampling results## # 10-fold cross-validation using stratification ## # A tibble: 10 × 4## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [940/105]> Fold01 <tibble [2 × 4]> <tibble>## 2 <split [940/105]> Fold02 <tibble [2 × 4]> <tibble>## 3 <split [940/105]> Fold03 <tibble [2 × 4]> <tibble>## 4 <split [940/105]> Fold04 <tibble [2 × 4]> <tibble>## 5 <split [940/105]> Fold05 <tibble [2 × 4]> <tibble>## 6 <split [940/105]> Fold06 <tibble [2 × 4]> <tibble>## 7 <split [941/104]> Fold07 <tibble [2 × 4]> <tibble>## 8 <split [941/104]> Fold08 <tibble [2 × 4]> <tibble>## 9 <split [941/104]> Fold09 <tibble [2 × 4]> <tibble>## 10 <split [942/103]> Fold10 <tibble [2 × 4]> <tibble>
collect_metrics()
Unnest the metrics column from a tidymodels fit_resamples()
_results %>% collect_metrics(summarize = TRUE)
collect_metrics()
Unnest the metrics column from a tidymodels fit_resamples()
_results %>% collect_metrics(summarize = TRUE)
TRUE
is actually the default; averages across folds
tree_mod %>% fit_resamples( bechdel ~ metascore + imdb_rating, resamples = bechdel_folds ) %>% collect_metrics(summarize = FALSE)## # A tibble: 20 × 5## id .metric .estimator .estimate .config ## <chr> <chr> <chr> <dbl> <chr> ## 1 Fold01 accuracy binary 0.610 Preprocessor1_Model1## 2 Fold01 roc_auc binary 0.587 Preprocessor1_Model1## 3 Fold02 accuracy binary 0.438 Preprocessor1_Model1## 4 Fold02 roc_auc binary 0.434 Preprocessor1_Model1## 5 Fold03 accuracy binary 0.6 Preprocessor1_Model1## 6 Fold03 roc_auc binary 0.570 Preprocessor1_Model1## 7 Fold04 accuracy binary 0.562 Preprocessor1_Model1## 8 Fold04 roc_auc binary 0.547 Preprocessor1_Model1## 9 Fold05 accuracy binary 0.6 Preprocessor1_Model1## 10 Fold05 roc_auc binary 0.572 Preprocessor1_Model1## # … with 10 more rows
10 different analysis/assessment sets
10 different models (trained on analysis sets)
10 different sets of performance statistics (on assessment sets)
Modify the code below to use fit_resamples
and bechdel_folds
to cross-validate the classification tree model. What is the ROC AUC that you collect at the end?
set.seed(100)tree_mod %>% fit(bechdel ~ metascore + imdb_rating, data = bechdel_train) %>% predict(new_data = bechdel_test) %>% mutate(true_bechdel = bechdel_test$bechdel) %>% accuracy(truth = true_bechdel, estimate = .pred_class)
03:00
set.seed(100)tree_mod %>% fit_resamples(bechdel ~ metascore + imdb_rating, resamples = bechdel_folds) %>% collect_metrics()## # A tibble: 2 × 6## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy binary 0.560 10 0.0157 Preprocessor1_Mod…## 2 roc_auc binary 0.547 10 0.0154 Preprocessor1_Mod…
tree_mod %>% fit(bechdel ~ metascore + imdb_rating, data = bechdel_train) %>% predict(bechdel_test) %>% mutate(truth = bechdel_test$bechdel) %>% accuracy(truth, .pred_class)## # A tibble: 1 × 3## .metric .estimator .estimate## <chr> <chr> <dbl>## 1 accuracy binary 0.550
## # A tibble: 2 × 6## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy binary 0.560 10 0.0157 Preprocessor1_Mod…## 2 roc_auc binary 0.547 10 0.0154 Preprocessor1_Mod…
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |