An introduction to machine learning

.title[
# An introduction to machine learning
]
.author[
### MACS 30500 <br /> University of Chicago
]

---

# What is machine learning?

---

---

---

---

# What is machine learning?

---

---

---

---

## Types of machine learning

- Supervised
- Unsupervised

---

---

## Examples of supervised learning

- Ad click prediction
- Supreme Court decision making
- Police misconduct prediction

---

## Two modes

### Classification

Is your hospital in a state of crisis care?

]

### Regression

What percentage of hospital beds are occupied?

]

---

## Two cultures

### Statistics

- model first
- inference emphasis

]

### Machine Learning

- data first
- prediction emphasis

]

---
name: train-love
background-image: url(images/train.jpg)
background-size: contain
background-color: #f6f6f6

---
template: train-love
class: center, top

# Statistics

---

> *"Statisticians, like artists, have the bad habit of falling in love with their models."*
>
> &mdash; George Box

---

## Predictive modeling

```r
library(tidymodels)
```

```
## ── Attaching packages ────────────────────────────────────── tidymodels 0.2.0 ──
```

```
## ✔ broom        0.8.0     ✔ recipes      0.2.0
## ✔ dials        0.1.1     ✔ rsample      0.1.1
## ✔ dplyr        1.0.9     ✔ tibble       3.1.7
## ✔ ggplot2      3.3.6     ✔ tidyr        1.2.0
## ✔ infer        1.0.0     ✔ tune         0.2.0
## ✔ modeldata    0.1.1     ✔ workflows    0.2.6
## ✔ parsnip      0.2.1     ✔ workflowsets 0.2.1
## ✔ purrr        0.3.4     ✔ yardstick    1.0.0
```

```
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
```

---

## Bechdel Test

.footnote[Source:[FiveThirtyEight](https://fivethirtyeight.com/features/the-dollar-and-cents-case-against-hollywoods-exclusion-of-women/)]

---

## Bechdel Test

1. It has to have at least two [named] women in it
1. Who talk to each other
1. About something besides a man

```r
library(rcfss)
data("bechdel")
glimpse(bechdel)
## Rows: 1,394
## Columns: 10
## $ year          <dbl> 2013, 2013, 2013, 2013, 2013, 2013, …
## $ title         <chr> "12 Years a Slave", "2 Guns", "42", …
## $ bechdel       <fct> Fail, Fail, Fail, Fail, Fail, Pass, …
## $ budget_2013   <dbl> 2.00, 6.10, 4.00, 22.50, 9.20, 1.20,…
## $ domgross_2013 <dbl> 5.310703, 7.561246, 9.502021, 3.8362…
## $ intgross_2013 <dbl> 15.860703, 13.249301, 9.502021, 14.5…
## $ rated         <chr> "R", "R", "PG-13", "PG-13", "R", "R"…
## $ metascore     <dbl> 97, 55, 62, 29, 28, 55, 48, 33, 90, …
## $ imdb_rating   <dbl> 8.3, 6.8, 7.6, 6.6, 5.4, 7.8, 5.7, 5…
## $ genre         <chr> "Biography", "Action", "Biography", …
```

---

## Bechdel test data

- N = 1394
- 1 categorical outcome: `bechdel`
- 9 predictors

---

---

# What is the goal of machine learning?

## Build .white[models] that

## generate .white[accurate predictions]

## for .white[future, yet-to-be-seen data].

---

## Machine learning

We'll use this goal to drive learning of 3 core `tidymodels` packages:

- `parsnip`
- `yardstick`
- `rsample`

---

## 🔨 Build models with `parsnip`

---

## parsnip

---

## `glm()`

```r
glm(bechdel ~ metascore, family = binomial, data = bechdel)
## 
## Call:  glm(formula = bechdel ~ metascore, family = binomial, data = bechdel)
## 
## Coefficients:
## (Intercept)    metascore  
##    0.052274    -0.004563  
## 
## Degrees of Freedom: 1393 Total (i.e. Null);  1392 Residual
## Null Deviance:	    1916 
## Residual Deviance: 1914 	AIC: 1918
```

---

---

## To specify a model with `parsnip`

1. Pick a model
1. Set the engine
1. Set the mode (if needed)

---

## To specify a model with `parsnip`

```r
logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
```

---

## To specify a model with `parsnip`

```r
decision_tree() %>%
  set_engine("C5.0") %>%
  set_mode("classification")
## Decision Tree Model Specification (classification)
## 
## Computational engine: C5.0
```

---

## To specify a model with `parsnip`

```r
nearest_neighbor() %>%              
  set_engine("kknn") %>%             
  set_mode("classification")        
## K-Nearest Neighbor Model Specification (classification)
## 
## Computational engine: kknn
```

---

## 1\. Pick a model

All available models are listed at

<https://www.tidymodels.org/find/parsnip/>

---

## `logistic_reg()`

Specifies a model that uses logistic regression

```r
logistic_reg(penalty = NULL, mixture = NULL)
```

---

## `logistic_reg()`

Specifies a model that uses logistic regression

```r
logistic_reg(
  mode = "classification", # "default" mode, if exists
  penalty = NULL,          # model hyper-parameter
  mixture = NULL           # model hyper-parameter
  )
```

---

## `set_engine()`

Adds an engine to power or implement the model.

```r
logistic_reg() %>% set_engine(engine = "glm")
```

---

## `set_mode()`

Sets the class of problem the model will solve, which influences which output is collected. Not necessary if mode is set in Step 1.

```r
logistic_reg() %>% set_mode(mode = "classification")
```

---
class: your-turn

## Your turn 1

Run the chunk in your .Rmd and look at the output. Then, copy/paste the code and edit to create:

+ a decision tree model for classification

+ that uses the `C5.0` engine.

Save it as `tree_mod` and look at the object. What is different about the output?

*Hint: you'll need https://www.tidymodels.org/find/parsnip/*

---

```r
lr_mod 
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm

tree_mod <- decision_tree() %>% 
  set_engine(engine = "C5.0") %>% 
  set_mode("classification")
tree_mod 
## Decision Tree Model Specification (classification)
## 
## Computational engine: C5.0
```

---
class: inverse, middle

## Now we've built a model.

## But, how do we .white[use] a model?

## First - what does it mean to use a model?

---
class: inverse, middle, center

![](https://media.giphy.com/media/fhAwk4DnqNgw8/giphy.gif)

Statistical models learn from the data.

Many learn model parameters, which *can* be useful as values for inference and interpretation.

---

## A fitted model

```r
lr_mod %>% 
  fit(bechdel ~ metascore + imdb_rating, 
      data = bechdel) %>% 
  broom::tidy()
## # A tibble: 3 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   2.70     0.436        6.20 5.64e-10
## 2 metascore     0.0202   0.00481      4.20 2.66e- 5
## 3 imdb_rating  -0.606    0.0889      -6.82 8.87e-12
```
]

<img src="index_files/figure-html/unnamed-chunk-27-1.png" width="80%" style="display: block; margin: auto;" />
]

---

## "All models are wrong, but some are useful"

---

## "All models are wrong, but some are useful"

---

## Predict new data

```r
bechdel_new <- tibble(metascore = c(40, 50, 60), 
                      imdb_rating = c(6, 6, 6),
                      bechdel = c("Fail", "Fail", "Pass")) %>% 
  mutate(bechdel = factor(bechdel, levels = c("Fail", "Pass")))
bechdel_new
## # A tibble: 3 × 3
##   metascore imdb_rating bechdel
##       <dbl>       <dbl> <fct>  
## 1        40           6 Fail   
## 2        50           6 Fail   
## 3        60           6 Pass
```

---

## Predict old data

```r
tree_mod %>% 
  fit(bechdel ~ metascore + imdb_rating, 
      data = bechdel) %>% 
  predict(new_data = bechdel) %>% 
  mutate(true_bechdel = bechdel$bechdel) %>% 
  accuracy(truth = true_bechdel, 
           estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.580
```

---

## Predict new data

]

### in with the 🆕

```r
tree_mod %>% 
  fit(bechdel ~ metascore + imdb_rating, 
      data = bechdel) %>% 
* predict(new_data = bechdel_new) %>%
* mutate(true_bechdel = bechdel_new$bechdel) %>%
  accuracy(truth = true_bechdel, 
           estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.333
```

]

---

## `fit()`

Train a model by fitting a model. Returns a parsnip model fit.

```r
fit(tree_mod, bechdel ~ metascore + imdb_rating, data = bechdel)
```

---

## `fit()`

Train a model by fitting a model. Returns a parsnip model fit.

```r
tree_mod %>%                               # parsnip model
  fit(bechdel ~ metascore + imdb_rating, # a formula
      data = bechdel                       # dataframe
  )
```

---

## `fit()`

Train a model by fitting a model. Returns a parsnip model fit.

```r
tree_fit <- tree_mod %>%                   # parsnip model
  fit(bechdel ~ metascore + imdb_rating, # a formula
      data = bechdel                       # dataframe
  )
```

---

---

---

## `predict()`

Use a fitted model to predict new `y` values from data. Returns a tibble.

```r
predict(tree_fit, new_data = bechdel_new) 
```

---

```r
tree_fit %>% 
  predict(new_data = bechdel_new)
## # A tibble: 3 × 1
##   .pred_class
##   <fct>      
## 1 Pass       
## 2 Pass       
## 3 Pass
```

---

## Axiom

The best way to measure a model's performance at predicting new data is to .display[predict new data].

---

## Data splitting

---

# ♻️ Resample models with `rsample`

---

## `rsample`

---

## `initial_split()*`

"Splits" data randomly into a single testing and a single training set.

```r
initial_split(data, prop = 3/4)
```

---

```r
bechdel_split <- initial_split(data = bechdel, strata = bechdel,  prop = 3/4)
bechdel_split
## <Analysis/Assess/Total>
## <1045/349/1394>
```

---

## `training()` and `testing()*`

Extract training and testing sets from an `rsplit`

```r
training(bechdel_split)
testing(bechdel_split)
```

---

```r
bechdel_train <- training(bechdel_split) 
bechdel_train
## # A tibble: 1,045 × 10
##     year title             bechdel budget_2013 domgross_2013
##    <dbl> <chr>             <fct>         <dbl>         <dbl>
##  1  2013 12 Years a Slave  Fail           2             5.31
##  2  2013 2 Guns            Fail           6.1           7.56
##  3  2013 42                Fail           4             9.50
##  4  2013 47 Ronin          Fail          22.5           3.84
##  5  2013 A Good Day to Di… Fail           9.2           6.73
##  6  2013 After Earth       Fail          13             6.05
##  7  2013 Cloudy with a Ch… Fail           7.8          12.0 
##  8  2013 Don Jon           Fail           0.55          2.45
##  9  2013 Escape Plan       Fail           7             2.52
## 10  2013 Gangster Squad    Fail           6             4.60
## # … with 1,035 more rows, and 5 more variables:
## #   intgross_2013 <dbl>, rated <chr>, metascore <dbl>,
## #   imdb_rating <dbl>, genre <chr>
```

---

---

---

---

---

---

---

---

## Your turn 2

Fill in the blanks.

Use `initial_split()`, `training()`, and `testing()` to:

1. Split **bechdel** into training and test sets. Save the rsplit!

2. Extract the training data and fit your classification tree model.

3. Predict the testing data, and save the true `bechdel` values.

4. Measure the accuracy of your model with your test set.

Keep `set.seed(100)` at the start of your code.

---

```r
set.seed(100) # Important!

bechdel_split  <- initial_split(bechdel, strata = bechdel, prop = 3/4)
bechdel_train  <- training(bechdel_split)
bechdel_test   <- testing(bechdel_split)

tree_mod %>% 
  fit(bechdel ~ metascore + imdb_rating, 
      data = bechdel_train) %>% 
  predict(new_data = bechdel_test) %>% 
  mutate(true_bechdel = bechdel_test$bechdel) %>% 
  accuracy(truth = true_bechdel, estimate = .pred_class)
```

---

# Goal of Machine Learning

## 🎯 generate accurate predictions

---

## Axiom

Better Model = Better Predictions (Lower error rate)

---

## `accuracy()*`

Calculates the accuracy based on two columns in a dataframe:

The .display[truth]: `${y}_i$`

The predicted .display[estimate]: `$\hat{y}_i$`

```r
accuracy(data, truth, estimate)
```

---

```r
tree_mod %>% 
  fit(bechdel ~ metascore + imdb_rating, 
      data = bechdel_train) %>% 
  predict(new_data = bechdel_test) %>% 
  mutate(true_bechdel = bechdel_test$bechdel) %>% 
* accuracy(truth = true_bechdel, estimate = .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.547
```

---

---

---

---

---

## Your Turn 3

What would happen if you repeated this process? Would you get the same answers?

Note your accuracy from above. Then change your seed number and rerun just the last code chunk above. Do you get the same answer?

Try it a few times with a few different seeds.

---

```
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.553
```

```
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.559
```

```
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.567
```

]

```
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.539
```

```
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.504
```

```
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.550
```

]

---

## Quiz

Why is the new estimate different?

---

## Data Splitting

---

---

## Resampling

Let's resample 10 times

then compute the mean of the results...

---

```r
acc %>% tibble::enframe(name = "accuracy")
## # A tibble: 10 × 2
##    accuracy value
##       <int> <dbl>
##  1        1 0.550
##  2        2 0.625
##  3        3 0.599
##  4        4 0.587
##  5        5 0.605
##  6        6 0.593
##  7        7 0.542
##  8        8 0.530
##  9        9 0.564
## 10       10 0.599
mean(acc)
## [1] 0.5793696
```

---

## Guess

Which do you think is a better estimate?

The best result or the mean of the results? Why?

---

## But also...

Fit with .display[training set]

Predict with .display[testing set]

Rinse and repeat?

---

## There has to be a better way...

```r
acc <- vector(length = 10, mode = "double")
for (i in 1:10) {
  new_split <- initial_split(bechdel)
  new_train <- training(new_split)
  new_test  <- testing(new_split)
  acc[i] <-
    lr_mod %>% 
      fit(bechdel ~ metascore + imdb_rating, data = new_train) %>% 
      predict(new_test) %>% 
      mutate(truth = new_test$bechdel) %>% 
      accuracy(truth, .pred_class) %>% 
      pull(.estimate)
}
```

---
background-image: url(images/diamonds.jpg)
background-size: contain
background-position: left
class: middle, center
background-color: #f5f5f5

## we can only use it once!

]

---
background-image: url(images/diamonds.jpg)
background-size: contain
background-position: left
class: middle, center
background-color: #f5f5f5

## How can we use the training set to compare, evaluate, and tune models?

]

---

---

## Cross-validation

---

## Cross-validation

---

## Cross-validation

---

## Cross-validation

---

## Cross-validation

---

## Cross-validation

---

## Cross-validation

---

## Cross-validation

---

## Cross-validation

---

## Cross-validation

---

## V-fold cross-validation

```r
vfold_cv(data, v = 10, ...)
```

---
exclude: true

---

## Guess

How many times does an observation/row appear in the assessment set?

---

---

## Quiz

If we use 10 folds, which percent of our data will end up in the training set and which percent in the testing set for each fold?

90% - training

10% - test

---

## Your Turn 4

Run the code below. What does it return?

```r
set.seed(100)
bechdel_folds <- vfold_cv(data = bechdel_train, v = 10, strata = bechdel)
bechdel_folds
```

---

```r
set.seed(100)
bechdel_folds <- vfold_cv(data = bechdel_train, v = 10, strata = bechdel)
bechdel_folds
## #  10-fold cross-validation using stratification 
## # A tibble: 10 × 2
##    splits            id    
##    <list>            <chr> 
##  1 <split [940/105]> Fold01
##  2 <split [940/105]> Fold02
##  3 <split [940/105]> Fold03
##  4 <split [940/105]> Fold04
##  5 <split [940/105]> Fold05
##  6 <split [940/105]> Fold06
##  7 <split [941/104]> Fold07
##  8 <split [941/104]> Fold08
##  9 <split [941/104]> Fold09
## 10 <split [942/103]> Fold10
```

---

## We need a new way to fit

```r
split1       <- bechdel_folds %>% pluck("splits", 1)
split1_train <- training(split1)
split1_test  <- testing(split1)

tree_mod %>% 
  fit(bechdel ~ ., data = split1_train) %>% 
  predict(split1_test) %>% 
  mutate(truth = split1_test$bechdel) %>% 
  accuracy(truth, .pred_class)

# rinse and repeat
split2 <- ...
```

---

## `fit_resamples()`

Trains and tests a resampled model.

```r
tree_mod %>% 
  fit_resamples(
    bechdel ~ metascore + imdb_rating, 
    resamples = bechdel_folds
    )
```

---

```r
tree_mod %>% 
  fit_resamples(
    bechdel ~ metascore + imdb_rating, 
    resamples = bechdel_folds
    )
## # Resampling results
## # 10-fold cross-validation using stratification 
## # A tibble: 10 × 4
##    splits            id     .metrics         .notes  
##    <list>            <chr>  <list>           <list>  
##  1 <split [940/105]> Fold01 <tibble [2 × 4]> <tibble>
##  2 <split [940/105]> Fold02 <tibble [2 × 4]> <tibble>
##  3 <split [940/105]> Fold03 <tibble [2 × 4]> <tibble>
##  4 <split [940/105]> Fold04 <tibble [2 × 4]> <tibble>
##  5 <split [940/105]> Fold05 <tibble [2 × 4]> <tibble>
##  6 <split [940/105]> Fold06 <tibble [2 × 4]> <tibble>
##  7 <split [941/104]> Fold07 <tibble [2 × 4]> <tibble>
##  8 <split [941/104]> Fold08 <tibble [2 × 4]> <tibble>
##  9 <split [941/104]> Fold09 <tibble [2 × 4]> <tibble>
## 10 <split [942/103]> Fold10 <tibble [2 × 4]> <tibble>
```

---

## `collect_metrics()`

Unnest the metrics column from a tidymodels `fit_resamples()`

```r
_results %>% collect_metrics(summarize = TRUE)
```

---

```r
tree_mod %>% 
  fit_resamples(
    bechdel ~ metascore + imdb_rating, 
    resamples = bechdel_folds
    ) %>% 
  collect_metrics(summarize = FALSE)
## # A tibble: 20 × 5
##    id     .metric  .estimator .estimate .config             
##    <chr>  <chr>    <chr>          <dbl> <chr>               
##  1 Fold01 accuracy binary         0.610 Preprocessor1_Model1
##  2 Fold01 roc_auc  binary         0.587 Preprocessor1_Model1
##  3 Fold02 accuracy binary         0.438 Preprocessor1_Model1
##  4 Fold02 roc_auc  binary         0.434 Preprocessor1_Model1
##  5 Fold03 accuracy binary         0.6   Preprocessor1_Model1
##  6 Fold03 roc_auc  binary         0.570 Preprocessor1_Model1
##  7 Fold04 accuracy binary         0.562 Preprocessor1_Model1
##  8 Fold04 roc_auc  binary         0.547 Preprocessor1_Model1
##  9 Fold05 accuracy binary         0.6   Preprocessor1_Model1
## 10 Fold05 roc_auc  binary         0.572 Preprocessor1_Model1
## # … with 10 more rows
```

---

## 10-fold CV

- 10 different analysis/assessment sets

- 10 different models (trained on .display[analysis] sets)

- 10 different sets of performance statistics (on .display[assessment] sets)

---

## Your Turn 5

Modify the code below to use `fit_resamples` and `bechdel_folds` to cross-validate the classification tree model. What is the ROC AUC that you collect at the end?

```r
set.seed(100)
tree_mod %>% 
  fit(bechdel ~ metascore + imdb_rating, 
      data = bechdel_train) %>% 
  predict(new_data = bechdel_test) %>% 
  mutate(true_bechdel = bechdel_test$bechdel) %>% 
  accuracy(truth = true_bechdel, estimate = .pred_class)
```

---

```r
set.seed(100)
tree_mod %>% 
  fit_resamples(bechdel ~ metascore + imdb_rating, 
                resamples = bechdel_folds) %>% 
  collect_metrics()
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config           
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
## 1 accuracy binary     0.560    10  0.0157 Preprocessor1_Mod…
## 2 roc_auc  binary     0.547    10  0.0154 Preprocessor1_Mod…
```

---

## How did we do?

```r
tree_mod %>% 
  fit(bechdel ~ metascore + imdb_rating, data = bechdel_train) %>% 
  predict(bechdel_test) %>% 
  mutate(truth = bechdel_test$bechdel) %>% 
  accuracy(truth, .pred_class)
## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.550
```

```
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config           
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>             
## 1 accuracy binary     0.560    10  0.0157 Preprocessor1_Mod…
## 2 roc_auc  binary     0.547    10  0.0154 Preprocessor1_Mod…
```