Build a better workflow

Build a better workflow
MACS 30500 
 University of Chicago
1 / 71

Build a better training set with recipes2 / 71

Recipes3 / 71

Preprocessing optionsEncode categorical predictors
Center and scale variables
Handle class imbalance
Impute missing data
Perform dimensionality reduction 
A lot more!
4 / 71

5 / 71

To build a recipeStart the recipe()
Define the variables involved
Describe preprocessing step-by-step
6 / 71

`recipe()`

Creates a recipe for a set of variables

recipe(bechdel ~ ., data = bechdel)

7 / 71

`recipe()`

Creates a recipe for a set of variables

rec <- recipe(bechdel ~ ., data = bechdel)

8 / 71

`step_*()`

Adds a single transformation to a recipe. Transformations are replayed in order when the recipe is run on data.

rec <- recipe(bechdel ~ ., data = bechdel) %>%
  step_other(genre, threshold = .05)

9 / 71

10 / 71

Before recipe
## # A tibble: 14 × 2
##    genre         n
##    <chr>     <int>
##  1 Action      334
##  2 Comedy      286
##  3 Drama       236
##  4 Adventure    86
##  5 Animation    82
##  6 Crime        77
##  7 Horror       67
##  8 Biography    58
##  9 Mystery      10
## 10 Fantasy       6
## # … with 4 more rows
After recipe
## # A tibble: 8 × 2
##   genre         n
##   <fct>     <int>
## 1 Action      334
## 2 Comedy      286
## 3 Drama       236
## 4 Adventure    86
## 5 other        85
## 6 Animation    82
## 7 Crime        77
## 8 Horror       67
11 / 71

`step_*()`

Complete list at: https://recipes.tidymodels.org/reference/index.html

12 / 71

K Nearest Neighbors (KNN)

To predict the outcome of a new data point:

Find the K most similar old data points
Take the average/mode/etc. outcome

13 / 71

To specify a model with parsnipPick a model
Set the engine
Set the mode (if needed)
14 / 71

To specify a KNN model with `parsnip`

knn_mod <- nearest_neighbor() %>%              
  set_engine("kknn") %>%             
  set_mode("classification")

15 / 71

Fact

KNN requires all numeric predictors, and all need to be centered and scaled.

What does that mean?

16 / 71

Quiz

Why do you need to "train" a recipe?

17 / 71

Quiz

Why do you need to "train" a recipe?

Imagine "scaling" a new data point. What do you subtract from it? What do you divide it by?

17 / 71

18 / 71

19 / 71

20 / 71

Guess# A tibble: 5 × 1
  rated
  <chr>
1 R    
2 PG-13
3 PG   
4 G    
5 NC-17
# A tibble: 1,394 × 5
       R `PG-13`    PG     G `NC-17`
   <dbl>   <dbl> <dbl> <dbl>   <dbl>
 1     1       0     0     0       0
 2     1       0     0     0       0
 3     0       1     0     0       0
 4     0       1     0     0       0
 5     1       0     0     0       0
 6     1       0     0     0       0
 7     0       1     0     0       0
 8     0       1     0     0       0
 9     1       0     0     0       0
10     1       0     0     0       0
# … with 1,384 more rows
21 / 71

Dummy Variables

glm(bechdel ~ rated, family = "binomial", data = bechdel)

## # A tibble: 5 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)   -0.470     0.403    -1.17    0.244
## 2 ratedNC-17    -1.14      1.17     -0.976   0.329
## 3 ratedPG        0.225     0.427     0.527   0.598
## 4 ratedPG-13     0.354     0.412     0.859   0.391
## 5 ratedR         0.198     0.411     0.482   0.630

22 / 71

`step_dummy()`

Converts nominal data into numeric dummy variables, needed as predictors for models like KNN.

rec <- recipe(bechdel ~ ., data = bechdel) %>%
  step_other(genre, threshold = .05) %>% 
  step_dummy(all_nominal_predictors())

You don't need this for decision trees or ensembles of trees

23 / 71

Quiz

How does recipes know which variables are numeric and what is nominal?

24 / 71

Quiz

How does recipes know which variables are numeric and what is nominal?

rec <- recipe(
  bechdel ~ ., 
  data = bechdel
  )

24 / 71

Quiz

How does recipes know what is a predictor and what is an outcome?

25 / 71

Quiz

How does recipes know what is a predictor and what is an outcome?

rec <- recipe(
  bechdel ~ .,
  data = bechdel
  )

25 / 71

Quiz

How does recipes know what is a predictor and what is an outcome?

rec <- recipe(
  bechdel ~ .,
  data = bechdel
  )

The formula → indicates outcomes vs predictors

25 / 71

Quiz

How does recipes know what is a predictor and what is an outcome?

rec <- recipe(
  bechdel ~ .,
  data = bechdel
  )

The formula → indicates outcomes vs predictors

The data → is only used to catalog the names and types of each variable

25 / 71

Selectors

Helper functions for selecting sets of variables

rec %>% 
  step_novel(all_nominal()) %>%
  step_zv(all_predictors())

26 / 71

selector	description
`all_predictors()`	Each x variable (right side of ~)
`all_outcomes()`	Each y variable (left side of ~)
`all_numeric()`	Each numeric variable
`all_nominal()`	Each categorical variable (e.g. factor, string)
`all_nominal_predictors()`	Each categorical variable (e.g. factor, string) that is defined as a predictor
`all_numeric_predictors()`	Each numeric variable that is defined as a predictor
`dplyr::select()` helpers	`starts_with('IL_')`, etc.

27 / 71