+ - 0:00:00
Notes for current slide
Notes for next slide

An introduction to machine learning

MACS 30500
University of Chicago

1 / 117

What is machine learning?

2 / 117

3 / 117

5 / 117

What is machine learning?

6 / 117

7 / 117

8 / 117

9 / 117

Types of machine learning

  • Supervised
  • Unsupervised
10 / 117

Examples of supervised learning

  • Ad click prediction
  • Supreme Court decision making
  • Police misconduct prediction
12 / 117

Two modes

13 / 117

Two modes

Classification

Is your hospital in a state of crisis care?

13 / 117

Two modes

Classification

Is your hospital in a state of crisis care?

Regression

What percentage of hospital beds are occupied?

13 / 117

Two cultures

Statistics

  • model first
  • inference emphasis

Machine Learning

  • data first
  • prediction emphasis
14 / 117
15 / 117

Statistics

16 / 117

"Statisticians, like artists, have the bad habit of falling in love with their models."

— George Box

17 / 117

Predictive modeling

library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──
## ✔ broom 0.8.0 ✔ recipes 0.2.0
## ✔ dials 0.1.1 ✔ rsample 0.1.1
## ✔ dplyr 1.0.9 ✔ tibble 3.1.7
## ✔ ggplot2 3.3.6 ✔ tidyr 1.2.0
## ✔ infer 1.0.0 ✔ tune 0.2.0
## ✔ modeldata 0.1.1 ✔ workflows 0.2.6
## ✔ parsnip 0.2.1 ✔ workflowsets 0.2.1
## ✔ purrr 0.3.4 ✔ yardstick 1.0.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
18 / 117

Bechdel Test

Source:FiveThirtyEight

19 / 117

Bechdel Test

  1. It has to have at least two [named] women in it
  2. Who talk to each other
  3. About something besides a man
20 / 117

Bechdel Test

  1. It has to have at least two [named] women in it
  2. Who talk to each other
  3. About something besides a man
library(rcfss)
data("bechdel")
glimpse(bechdel)
## Rows: 1,394
## Columns: 10
## $ year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, …
## $ title <chr> "12 Years a Slave", "2 Guns", "42", …
## $ bechdel <fct> Fail, Fail, Fail, Fail, Fail, Pass, …
## $ budget_2013 <dbl> 2.00, 6.10, 4.00, 22.50, 9.20, 1.20,…
## $ domgross_2013 <dbl> 5.310703, 7.561246, 9.502021, 3.8362…
## $ intgross_2013 <dbl> 15.860703, 13.249301, 9.502021, 14.5…
## $ rated <chr> "R", "R", "PG-13", "PG-13", "R", "R"…
## $ metascore <dbl> 97, 55, 62, 29, 28, 55, 48, 33, 90, …
## $ imdb_rating <dbl> 8.3, 6.8, 7.6, 6.6, 5.4, 7.8, 5.7, 5…
## $ genre <chr> "Biography", "Action", "Biography", …
20 / 117

Bechdel test data

  • N = 1394
  • 1 categorical outcome: bechdel
  • 9 predictors
21 / 117

22 / 117

What is the goal of machine learning?

23 / 117

What is the goal of machine learning?

Build models that

23 / 117

What is the goal of machine learning?

Build models that

generate accurate predictions

23 / 117

What is the goal of machine learning?

Build models that

generate accurate predictions

for future, yet-to-be-seen data.

23 / 117

What is the goal of machine learning?

Build models that

generate accurate predictions

for future, yet-to-be-seen data.

Max Kuhn & Kjell Johnston, http://www.feat.engineering/

23 / 117

Machine learning

We'll use this goal to drive learning of 3 core tidymodels packages:

  • parsnip
  • yardstick
  • rsample
24 / 117

🔨 Build models with parsnip

25 / 117

parsnip

26 / 117

glm()

glm(bechdel ~ metascore, family = binomial, data = bechdel)
##
## Call: glm(formula = bechdel ~ metascore, family = binomial, data = bechdel)
##
## Coefficients:
## (Intercept) metascore
## 0.052274 -0.004563
##
## Degrees of Freedom: 1393 Total (i.e. Null); 1392 Residual
## Null Deviance: 1916
## Residual Deviance: 1914 AIC: 1918
27 / 117

28 / 117

To specify a model with parsnip

  1. Pick a model
  2. Set the engine
  3. Set the mode (if needed)
29 / 117

To specify a model with parsnip

logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
30 / 117

To specify a model with parsnip

decision_tree() %>%
set_engine("C5.0") %>%
set_mode("classification")
## Decision Tree Model Specification (classification)
##
## Computational engine: C5.0
31 / 117

To specify a model with parsnip

nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("classification")
## K-Nearest Neighbor Model Specification (classification)
##
## Computational engine: kknn
32 / 117

1. Pick a model

All available models are listed at

https://www.tidymodels.org/find/parsnip/

33 / 117

logistic_reg()

Specifies a model that uses logistic regression

logistic_reg(penalty = NULL, mixture = NULL)
34 / 117

logistic_reg()

Specifies a model that uses logistic regression

logistic_reg(
mode = "classification", # "default" mode, if exists
penalty = NULL, # model hyper-parameter
mixture = NULL # model hyper-parameter
)
35 / 117

set_engine()

Adds an engine to power or implement the model.

logistic_reg() %>% set_engine(engine = "glm")
36 / 117

set_mode()

Sets the class of problem the model will solve, which influences which output is collected. Not necessary if mode is set in Step 1.

logistic_reg() %>% set_mode(mode = "classification")
37 / 117

Your turn 1

Run the chunk in your .Rmd and look at the output. Then, copy/paste the code and edit to create:

  • a decision tree model for classification

  • that uses the C5.0 engine.

Save it as tree_mod and look at the object. What is different about the output?

Hint: you'll need https://www.tidymodels.org/find/parsnip/

03:00
38 / 117
lr_mod
## Logistic Regression Model Specification (classification)
##
## Computational engine: glm
tree_mod <- decision_tree() %>%
set_engine(engine = "C5.0") %>%
set_mode("classification")
tree_mod
## Decision Tree Model Specification (classification)
##
## Computational engine: C5.0
39 / 117

Now we've built a model.

40 / 117

Now we've built a model.

But, how do we use a model?

40 / 117

Now we've built a model.

But, how do we use a model?

First - what does it mean to use a model?

40 / 117

Statistical models learn from the data.

Many learn model parameters, which can be useful as values for inference and interpretation.

41 / 117

A fitted model

lr_mod %>%
fit(bechdel ~ metascore + imdb_rating,
data = bechdel) %>%
broom::tidy()
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.70 0.436 6.20 5.64e-10
## 2 metascore 0.0202 0.00481 4.20 2.66e- 5
## 3 imdb_rating -0.606 0.0889 -6.82 8.87e-12

42 / 117

"All models are wrong, but some are useful"

43 / 117

"All models are wrong, but some are useful"

44 / 117

Predict new data

bechdel_new <- tibble(metascore = c(40, 50, 60),
imdb_rating = c(6, 6, 6),
bechdel = c("Fail", "Fail", "Pass")) %>%
mutate(bechdel = factor(bechdel, levels = c("Fail", "Pass")))
bechdel_new
## # A tibble: 3 × 3
## metascore imdb_rating bechdel
## <dbl> <dbl> <fct>
## 1 40 6 Fail
## 2 50 6 Fail
## 3 60 6 Pass
45 / 117

Predict old data

tree_mod %>%
fit(bechdel ~ metascore + imdb_rating,
data = bechdel) %>%
predict(new_data = bechdel) %>%
mutate(true_bechdel = bechdel$bechdel) %>%
accuracy(truth = true_bechdel,
estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.580
46 / 117

Predict new data

out with the old...

tree_mod %>%
fit(bechdel ~ metascore + imdb_rating,
data = bechdel) %>%
predict(new_data = bechdel) %>%
mutate(true_bechdel = bechdel$bechdel) %>%
accuracy(truth = true_bechdel,
estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.580

in with the 🆕

tree_mod %>%
fit(bechdel ~ metascore + imdb_rating,
data = bechdel) %>%
predict(new_data = bechdel_new) %>%
mutate(true_bechdel = bechdel_new$bechdel) %>%
accuracy(truth = true_bechdel,
estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.333
47 / 117

fit()

Train a model by fitting a model. Returns a parsnip model fit.

fit(tree_mod, bechdel ~ metascore + imdb_rating, data = bechdel)
48 / 117

fit()

Train a model by fitting a model. Returns a parsnip model fit.

tree_mod %>% # parsnip model
fit(bechdel ~ metascore + imdb_rating, # a formula
data = bechdel # dataframe
)
49 / 117

fit()

Train a model by fitting a model. Returns a parsnip model fit.

tree_fit <- tree_mod %>% # parsnip model
fit(bechdel ~ metascore + imdb_rating, # a formula
data = bechdel # dataframe
)
50 / 117

51 / 117

52 / 117

predict()

Use a fitted model to predict new y values from data. Returns a tibble.

predict(tree_fit, new_data = bechdel_new)
53 / 117
tree_fit %>%
predict(new_data = bechdel_new)
## # A tibble: 3 × 1
## .pred_class
## <fct>
## 1 Pass
## 2 Pass
## 3 Pass
54 / 117

Axiom

The best way to measure a model's performance at predicting new data is to predict new data.

55 / 117

Data splitting

56 / 117

♻️ Resample models with rsample

57 / 117

rsample

58 / 117

initial_split()*

"Splits" data randomly into a single testing and a single training set.

initial_split(data, prop = 3/4)

* from rsample

59 / 117
bechdel_split <- initial_split(data = bechdel, strata = bechdel, prop = 3/4)
bechdel_split
## <Analysis/Assess/Total>
## <1045/349/1394>
60 / 117

training() and testing()*

Extract training and testing sets from an rsplit

training(bechdel_split)
testing(bechdel_split)

* from rsample

61 / 117
bechdel_train <- training(bechdel_split)
bechdel_train
## # A tibble: 1,045 × 10
## year title bechdel budget_2013 domgross_2013
## <dbl> <chr> <fct> <dbl> <dbl>
## 1 2013 12 Years a Slave Fail 2 5.31
## 2 2013 2 Guns Fail 6.1 7.56
## 3 2013 42 Fail 4 9.50
## 4 2013 47 Ronin Fail 22.5 3.84
## 5 2013 A Good Day to Di… Fail 9.2 6.73
## 6 2013 After Earth Fail 13 6.05
## 7 2013 Cloudy with a Ch… Fail 7.8 12.0
## 8 2013 Don Jon Fail 0.55 2.45
## 9 2013 Escape Plan Fail 7 2.52
## 10 2013 Gangster Squad Fail 6 4.60
## # … with 1,035 more rows, and 5 more variables:
## # intgross_2013 <dbl>, rated <chr>, metascore <dbl>,
## # imdb_rating <dbl>, genre <chr>
62 / 117

63 / 117

64 / 117

65 / 117

66 / 117

67 / 117

68 / 117

69 / 117

Your turn 2

Fill in the blanks.

Use initial_split(), training(), and testing() to:

  1. Split bechdel into training and test sets. Save the rsplit!

  2. Extract the training data and fit your classification tree model.

  3. Predict the testing data, and save the true bechdel values.

  4. Measure the accuracy of your model with your test set.

Keep set.seed(100) at the start of your code.

04:00
70 / 117
set.seed(100) # Important!
bechdel_split <- initial_split(bechdel, strata = bechdel, prop = 3/4)
bechdel_train <- training(bechdel_split)
bechdel_test <- testing(bechdel_split)
tree_mod %>%
fit(bechdel ~ metascore + imdb_rating,
data = bechdel_train) %>%
predict(new_data = bechdel_test) %>%
mutate(true_bechdel = bechdel_test$bechdel) %>%
accuracy(truth = true_bechdel, estimate = .pred_class)
71 / 117

Goal of Machine Learning

🎯 generate accurate predictions

72 / 117

Axiom

Better Model = Better Predictions (Lower error rate)

73 / 117

accuracy()*

Calculates the accuracy based on two columns in a dataframe:

The truth: yi

The predicted estimate: y^i

accuracy(data, truth, estimate)

* from yardstick

74 / 117
tree_mod %>%
fit(bechdel ~ metascore + imdb_rating,
data = bechdel_train) %>%
predict(new_data = bechdel_test) %>%
mutate(true_bechdel = bechdel_test$bechdel) %>%
accuracy(truth = true_bechdel, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.547
75 / 117

76 / 117

77 / 117

78 / 117

79 / 117

Your Turn 3

What would happen if you repeated this process? Would you get the same answers?

Note your accuracy from above. Then change your seed number and rerun just the last code chunk above. Do you get the same answer?

Try it a few times with a few different seeds.

02:00
80 / 117
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.553
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.559
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.567
81 / 117
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.553
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.559
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.567
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.539
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.504
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.550
81 / 117

Quiz

Why is the new estimate different?

82 / 117

Data Splitting

83 / 117

Data Splitting

83 / 117

Data Splitting

83 / 117

Data Splitting

83 / 117

Data Splitting

83 / 117

Data Splitting

83 / 117

Data Splitting

83 / 117

Data Splitting

83 / 117

Data Splitting

83 / 117

84 / 117

84 / 117

84 / 117

84 / 117

84 / 117

84 / 117

84 / 117

84 / 117

Mean accuracy

84 / 117

Resampling

Let's resample 10 times

then compute the mean of the results...

85 / 117
acc %>% tibble::enframe(name = "accuracy")
## # A tibble: 10 × 2
## accuracy value
## <int> <dbl>
## 1 1 0.550
## 2 2 0.625
## 3 3 0.599
## 4 4 0.587
## 5 5 0.605
## 6 6 0.593
## 7 7 0.542
## 8 8 0.530
## 9 9 0.564
## 10 10 0.599
mean(acc)
## [1] 0.5793696
86 / 117

Guess

Which do you think is a better estimate?

The best result or the mean of the results? Why?

87 / 117

But also...

Fit with training set

Predict with testing set

88 / 117

But also...

Fit with training set

Predict with testing set

Rinse and repeat?

88 / 117

There has to be a better way...

acc <- vector(length = 10, mode = "double")
for (i in 1:10) {
new_split <- initial_split(bechdel)
new_train <- training(new_split)
new_test <- testing(new_split)
acc[i] <-
lr_mod %>%
fit(bechdel ~ metascore + imdb_rating, data = new_train) %>%
predict(new_test) %>%
mutate(truth = new_test$bechdel) %>%
accuracy(truth, .pred_class) %>%
pull(.estimate)
}
89 / 117

The testing set is precious...

we can only use it once!

90 / 117

How can we use the training set to compare, evaluate, and tune models?

91 / 117

92 / 117

Cross-validation

93 / 117

Cross-validation

94 / 117

Cross-validation

95 / 117

Cross-validation

96 / 117

Cross-validation

97 / 117

Cross-validation

98 / 117

Cross-validation

99 / 117

Cross-validation

100 / 117

Cross-validation

101 / 117

Cross-validation

102 / 117

V-fold cross-validation

vfold_cv(data, v = 10, ...)
103 / 117

Guess

How many times does an observation/row appear in the assessment set?

104 / 117

105 / 117

Quiz

If we use 10 folds, which percent of our data will end up in the training set and which percent in the testing set for each fold?

106 / 117

Quiz

If we use 10 folds, which percent of our data will end up in the training set and which percent in the testing set for each fold?

90% - training

10% - test

106 / 117

Your Turn 4

Run the code below. What does it return?

set.seed(100)
bechdel_folds <- vfold_cv(data = bechdel_train, v = 10, strata = bechdel)
bechdel_folds
01:00
107 / 117
set.seed(100)
bechdel_folds <- vfold_cv(data = bechdel_train, v = 10, strata = bechdel)
bechdel_folds
## # 10-fold cross-validation using stratification
## # A tibble: 10 × 2
## splits id
## <list> <chr>
## 1 <split [940/105]> Fold01
## 2 <split [940/105]> Fold02
## 3 <split [940/105]> Fold03
## 4 <split [940/105]> Fold04
## 5 <split [940/105]> Fold05
## 6 <split [940/105]> Fold06
## 7 <split [941/104]> Fold07
## 8 <split [941/104]> Fold08
## 9 <split [941/104]> Fold09
## 10 <split [942/103]> Fold10
108 / 117

We need a new way to fit

split1 <- bechdel_folds %>% pluck("splits", 1)
split1_train <- training(split1)
split1_test <- testing(split1)
tree_mod %>%
fit(bechdel ~ ., data = split1_train) %>%
predict(split1_test) %>%
mutate(truth = split1_test$bechdel) %>%
accuracy(truth, .pred_class)
# rinse and repeat
split2 <- ...
109 / 117

fit_resamples()

Trains and tests a resampled model.

tree_mod %>%
fit_resamples(
bechdel ~ metascore + imdb_rating,
resamples = bechdel_folds
)
110 / 117
tree_mod %>%
fit_resamples(
bechdel ~ metascore + imdb_rating,
resamples = bechdel_folds
)
## # Resampling results
## # 10-fold cross-validation using stratification
## # A tibble: 10 × 4
## splits id .metrics .notes
## <list> <chr> <list> <list>
## 1 <split [940/105]> Fold01 <tibble [2 × 4]> <tibble>
## 2 <split [940/105]> Fold02 <tibble [2 × 4]> <tibble>
## 3 <split [940/105]> Fold03 <tibble [2 × 4]> <tibble>
## 4 <split [940/105]> Fold04 <tibble [2 × 4]> <tibble>
## 5 <split [940/105]> Fold05 <tibble [2 × 4]> <tibble>
## 6 <split [940/105]> Fold06 <tibble [2 × 4]> <tibble>
## 7 <split [941/104]> Fold07 <tibble [2 × 4]> <tibble>
## 8 <split [941/104]> Fold08 <tibble [2 × 4]> <tibble>
## 9 <split [941/104]> Fold09 <tibble [2 × 4]> <tibble>
## 10 <split [942/103]> Fold10 <tibble [2 × 4]> <tibble>
111 / 117

collect_metrics()

Unnest the metrics column from a tidymodels fit_resamples()

_results %>% collect_metrics(summarize = TRUE)
112 / 117

collect_metrics()

Unnest the metrics column from a tidymodels fit_resamples()

_results %>% collect_metrics(summarize = TRUE)

TRUE is actually the default; averages across folds

112 / 117
tree_mod %>%
fit_resamples(
bechdel ~ metascore + imdb_rating,
resamples = bechdel_folds
) %>%
collect_metrics(summarize = FALSE)
## # A tibble: 20 × 5
## id .metric .estimator .estimate .config
## <chr> <chr> <chr> <dbl> <chr>
## 1 Fold01 accuracy binary 0.610 Preprocessor1_Model1
## 2 Fold01 roc_auc binary 0.587 Preprocessor1_Model1
## 3 Fold02 accuracy binary 0.438 Preprocessor1_Model1
## 4 Fold02 roc_auc binary 0.434 Preprocessor1_Model1
## 5 Fold03 accuracy binary 0.6 Preprocessor1_Model1
## 6 Fold03 roc_auc binary 0.570 Preprocessor1_Model1
## 7 Fold04 accuracy binary 0.562 Preprocessor1_Model1
## 8 Fold04 roc_auc binary 0.547 Preprocessor1_Model1
## 9 Fold05 accuracy binary 0.6 Preprocessor1_Model1
## 10 Fold05 roc_auc binary 0.572 Preprocessor1_Model1
## # … with 10 more rows
113 / 117

10-fold CV

  • 10 different analysis/assessment sets

  • 10 different models (trained on analysis sets)

  • 10 different sets of performance statistics (on assessment sets)

114 / 117

Your Turn 5

Modify the code below to use fit_resamples and bechdel_folds to cross-validate the classification tree model. What is the ROC AUC that you collect at the end?

set.seed(100)
tree_mod %>%
fit(bechdel ~ metascore + imdb_rating,
data = bechdel_train) %>%
predict(new_data = bechdel_test) %>%
mutate(true_bechdel = bechdel_test$bechdel) %>%
accuracy(truth = true_bechdel, estimate = .pred_class)
03:00
115 / 117
set.seed(100)
tree_mod %>%
fit_resamples(bechdel ~ metascore + imdb_rating,
resamples = bechdel_folds) %>%
collect_metrics()
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.560 10 0.0157 Preprocessor1_Mod…
## 2 roc_auc binary 0.547 10 0.0154 Preprocessor1_Mod…
116 / 117

How did we do?

tree_mod %>%
fit(bechdel ~ metascore + imdb_rating, data = bechdel_train) %>%
predict(bechdel_test) %>%
mutate(truth = bechdel_test$bechdel) %>%
accuracy(truth, .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.550
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.560 10 0.0157 Preprocessor1_Mod…
## 2 roc_auc binary 0.547 10 0.0154 Preprocessor1_Mod…
117 / 117

What is machine learning?

2 / 117
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow