+ - 0:00:00
Notes for current slide
Notes for next slide

Build a better workflow

MACS 30500
University of Chicago

1 / 71

Build a better training set with recipes

2 / 71

Recipes

3 / 71

Preprocessing options

  • Encode categorical predictors
  • Center and scale variables
  • Handle class imbalance
  • Impute missing data
  • Perform dimensionality reduction
  • A lot more!
4 / 71

5 / 71

To build a recipe

  1. Start the recipe()
  2. Define the variables involved
  3. Describe preprocessing step-by-step
6 / 71

recipe()

Creates a recipe for a set of variables

recipe(bechdel ~ ., data = bechdel)
7 / 71

recipe()

Creates a recipe for a set of variables

rec <- recipe(bechdel ~ ., data = bechdel)
8 / 71

step_*()

Adds a single transformation to a recipe. Transformations are replayed in order when the recipe is run on data.

rec <- recipe(bechdel ~ ., data = bechdel) %>%
step_other(genre, threshold = .05)
9 / 71

10 / 71

Before recipe

## # A tibble: 14 × 2
## genre n
## <chr> <int>
## 1 Action 334
## 2 Comedy 286
## 3 Drama 236
## 4 Adventure 86
## 5 Animation 82
## 6 Crime 77
## 7 Horror 67
## 8 Biography 58
## 9 Mystery 10
## 10 Fantasy 6
## # … with 4 more rows

After recipe

## # A tibble: 8 × 2
## genre n
## <fct> <int>
## 1 Action 334
## 2 Comedy 286
## 3 Drama 236
## 4 Adventure 86
## 5 other 85
## 6 Animation 82
## 7 Crime 77
## 8 Horror 67
11 / 71

step_*()

Complete list at: https://recipes.tidymodels.org/reference/index.html

12 / 71

K Nearest Neighbors (KNN)

To predict the outcome of a new data point:

  • Find the K most similar old data points
  • Take the average/mode/etc. outcome
13 / 71

To specify a model with parsnip

  1. Pick a model
  2. Set the engine
  3. Set the mode (if needed)
14 / 71

To specify a KNN model with parsnip

knn_mod <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("classification")
15 / 71

Fact

KNN requires all numeric predictors, and all need to be centered and scaled.

What does that mean?

16 / 71

Quiz

Why do you need to "train" a recipe?

17 / 71

Quiz

Why do you need to "train" a recipe?

Imagine "scaling" a new data point. What do you subtract from it? What do you divide it by?

17 / 71

18 / 71

19 / 71

20 / 71

Guess

# A tibble: 5 × 1
rated
<chr>
1 R
2 PG-13
3 PG
4 G
5 NC-17
# A tibble: 1,394 × 5
R `PG-13` PG G `NC-17`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 1 0 0 0
5 1 0 0 0 0
6 1 0 0 0 0
7 0 1 0 0 0
8 0 1 0 0 0
9 1 0 0 0 0
10 1 0 0 0 0
# … with 1,384 more rows
21 / 71

Dummy Variables

glm(bechdel ~ rated, family = "binomial", data = bechdel)
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.470 0.403 -1.17 0.244
## 2 ratedNC-17 -1.14 1.17 -0.976 0.329
## 3 ratedPG 0.225 0.427 0.527 0.598
## 4 ratedPG-13 0.354 0.412 0.859 0.391
## 5 ratedR 0.198 0.411 0.482 0.630
22 / 71

step_dummy()

Converts nominal data into numeric dummy variables, needed as predictors for models like KNN.

rec <- recipe(bechdel ~ ., data = bechdel) %>%
step_other(genre, threshold = .05) %>%
step_dummy(all_nominal_predictors())

You don't need this for decision trees or ensembles of trees

23 / 71

Quiz

How does recipes know which variables are numeric and what is nominal?

24 / 71

Quiz

How does recipes know which variables are numeric and what is nominal?

rec <- recipe(
bechdel ~ .,
data = bechdel
)
24 / 71

Quiz

How does recipes know what is a predictor and what is an outcome?

25 / 71

Quiz

How does recipes know what is a predictor and what is an outcome?

rec <- recipe(
bechdel ~ .,
data = bechdel
)
25 / 71

Quiz

How does recipes know what is a predictor and what is an outcome?

rec <- recipe(
bechdel ~ .,
data = bechdel
)

The formulaindicates outcomes vs predictors

25 / 71

Quiz

How does recipes know what is a predictor and what is an outcome?

rec <- recipe(
bechdel ~ .,
data = bechdel
)

The formulaindicates outcomes vs predictors

The datais only used to catalog the names and types of each variable

25 / 71

Selectors

Helper functions for selecting sets of variables

rec %>%
step_novel(all_nominal()) %>%
step_zv(all_predictors())
26 / 71
selector description

all_predictors()

Each x variable (right side of ~)

all_outcomes()

Each y variable (left side of ~)

all_numeric()

Each numeric variable

all_nominal()

Each categorical variable (e.g. factor, string)

all_nominal_predictors()

Each categorical variable (e.g. factor, string) that is defined as a predictor

all_numeric_predictors()

Each numeric variable that is defined as a predictor

dplyr::select() helpers

starts_with('IL_'), etc.

27 / 71

Let's think about the modeling.

What if there were no films with rated NC-17 in the training data?

28 / 71

Let's think about the modeling.

What if there were no films with rated NC-17 in the training data?

Will the model have a coefficient for rated NC-17?

28 / 71

Let's think about the modeling.

What if there were no films with rated NC-17 in the training data?

Will the model have a coefficient for rated NC-17?

No

28 / 71

Let's think about the modeling.

What if there were no films with rated NC-17 in the training data?

Will the model have a coefficient for rated NC-17?

No

What will happen if the test data includes a film with rated NC-17?

28 / 71

Let's think about the modeling.

What if there were no films with rated NC-17 in the training data?

Will the model have a coefficient for rated NC-17?

No

What will happen if the test data includes a film with rated NC-17?

Error!

28 / 71

step_novel()

Adds a catch-all level to a factor for any new values not encountered in model training, which lets R intelligently predict new levels in the test set.

rec <- recipe(bechdel ~ ., data = bechdel) %>%
step_other(genre, threshold = .05) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors())

Use before step_dummy() so new level is dummified

29 / 71

Guess

What would happen if you try to normalize a variable that doesn't vary?

30 / 71

Guess

What would happen if you try to normalize a variable that doesn't vary?

Error! You'd be dividing by zero!

30 / 71

step_zv()

Intelligently handles zero variance variables (variables that contain only a single value)

rec <- recipe(bechdel ~ ., data = bechdel) %>%
step_other(genre, threshold = .05) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
31 / 71

step_normalize()

Centers then scales numeric variable (mean = 0, sd = 1)

rec <- recipe(bechdel ~ ., data = bechdel) %>%
step_other(genre, threshold = .05) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric())
32 / 71

Your Turn 1

Unscramble! You have all the steps from our knn_rec- your challenge is to unscramble them into the right order!

Save the result as knn_rec

05:00
33 / 71
knn_rec <- recipe(formula = bechdel ~ ., data = bechdel) %>%
step_other(genre, threshold = .05) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
knn_rec
## Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 8
##
## Operations:
##
## Collapsing factor levels for genre
## Novel factor level assignment for all_nominal_predictors()
## Dummy variables from all_nominal_predictors()
## Zero variance filter on all_predictors()
## Centering and scaling for all_numeric_predictors()
34 / 71
library(usemodels)
use_kknn(bechdel ~ ., data = bechdel, verbose = TRUE, tune = FALSE)
## kknn_recipe <-
## recipe(formula = bechdel ~ ., data = bechdel) %>%
## ## For modeling, it is preferred to encode qualitative data as factors
## ## (instead of character).
## step_string2factor(one_of("rated", "genre")) %>%
## step_novel(all_nominal_predictors()) %>%
## ## This model requires the predictors to be numeric. The most common
## ## method to convert qualitative predictors to numeric is to create
## ## binary indicator variables (aka dummy variables) from these
## ## predictors.
## step_dummy(all_nominal_predictors()) %>%
## ## Since distance calculations are used, the predictor variables should
## ## be on the same scale. Before centering and scaling the numeric
## ## predictors, any predictors with a single unique value are filtered
## ## out.
## step_zv(all_predictors()) %>%
## step_normalize(all_numeric_predictors())
##
## kknn_spec <-
## nearest_neighbor() %>%
## set_mode("classification") %>%
## set_engine("kknn")
##
## kknn_workflow <-
## workflow() %>%
## add_recipe(kknn_recipe) %>%
## add_model(kknn_spec)
36 / 71
use_glmnet(bechdel ~ ., data = bechdel, verbose = TRUE, tune = FALSE)
## glmnet_recipe <-
## recipe(formula = bechdel ~ ., data = bechdel) %>%
## ## For modeling, it is preferred to encode qualitative data as factors
## ## (instead of character).
## step_string2factor(one_of("rated", "genre")) %>%
## step_novel(all_nominal_predictors()) %>%
## ## This model requires the predictors to be numeric. The most common
## ## method to convert qualitative predictors to numeric is to create
## ## binary indicator variables (aka dummy variables) from these
## ## predictors.
## step_dummy(all_nominal_predictors()) %>%
## ## Regularization methods sum up functions of the model slope
## ## coefficients. Because of this, the predictor variables should be on
## ## the same scale. Before centering and scaling the numeric predictors,
## ## any predictors with a single unique value are filtered out.
## step_zv(all_predictors()) %>%
## step_normalize(all_numeric_predictors())
##
## glmnet_spec <-
## logistic_reg() %>%
## set_mode("classification") %>%
## set_engine("glmnet")
##
## glmnet_workflow <-
## workflow() %>%
## add_recipe(glmnet_recipe) %>%
## add_model(glmnet_spec)
37 / 71

Now we've built a recipe.

38 / 71

Now we've built a recipe.

But, how do we use a recipe?

38 / 71

Axiom

Feature engineering and modeling are two halves of a single predictive workflow.

39 / 71

40 / 71

41 / 71

42 / 71

43 / 71

44 / 71

45 / 71

46 / 71

47 / 71

48 / 71

49 / 71

50 / 71

51 / 71

52 / 71

Workflows

53 / 71

workflow()

Creates a workflow to add a model and more to

workflow()
54 / 71

add_formula()

Adds a formula to a workflow *

workflow() %>% add_formula(bechdel ~ metascore)

* If you do not plan to do your own preprocessing

55 / 71

add_model()

Adds a parsnip model spec to a workflow

workflow() %>% add_model(knn_mod)
56 / 71

Guess

If we use add_model() to add a model to a workflow, what would we use to add a recipe?

57 / 71

Guess

If we use add_model() to add a model to a workflow, what would we use to add a recipe?

Let's see!

57 / 71

Your Turn 2

Fill in the blanks to make a workflow that combines knn_rec and with knn_mod.

01:00
58 / 71
knn_wf <- workflow() %>%
add_recipe(knn_rec) %>%
add_model(knn_mod)
knn_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: nearest_neighbor()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 5 Recipe Steps
##
## • step_other()
## • step_novel()
## • step_dummy()
## • step_zv()
## • step_normalize()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## K-Nearest Neighbor Model Specification (classification)
##
## Computational engine: kknn
59 / 71

add_recipe()

Adds a recipe to a workflow.

knn_wf <- workflow() %>%
add_recipe(knn_rec) %>%
add_model(knn_mod)
60 / 71

Guess

Do you need to add a formula if you have a recipe?

61 / 71

Guess

Do you need to add a formula if you have a recipe?

Nope!

rec <- recipe(
bechdel ~ .,
data = bechdel
)
61 / 71

fit()

Fit a workflow that bundles a recipe* and a model.

_wf %>%
fit(data = bechdel_train) %>%
predict(bechdel_test)...

* or a formula, if you do not plan to do your own preprocessing

62 / 71

Preprocess k-fold resamples?

set.seed(100)
bechdel_folds <- vfold_cv(bechdel_train, v = 10, strata = bechdel)
63 / 71

fit_resamples()

Fit a workflow that bundles a recipe* and a model with resampling.

_wf %>%
fit_resamples(resamples = bechdel_folds)

* or a formula, if you do not plan to do your own preprocessing

64 / 71

Your Turn 3

Run the first chunk. Then try our KNN workflow on bechdel_folds. What is the ROC AUC?

03:00
65 / 71
set.seed(100)
bechdel_folds <- vfold_cv(bechdel_train, v = 10, strata = bechdel)
knn_wf %>%
fit_resamples(resamples = bechdel_folds) %>%
collect_metrics()
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.591 10 0.0165 Preprocessor1_Mod…
## 2 roc_auc binary 0.604 10 0.0176 Preprocessor1_Mod…
66 / 71

Feature Engineering

Before

67 / 71

Feature Engineering

Before

After

67 / 71

update_recipe()

Replace the recipe in a workflow.

_wf %>%
update_recipe(glmnet_rec)
68 / 71

update_model()

Replace the model in a workflow.

_wf %>%
update_model(tree_mod)
69 / 71

Your Turn 4

Turns out, the same knn_rec recipe can also be used to fit a penalized logistic regression model. Let's try it out!

plr_mod <- logistic_reg(penalty = .01, mixture = 1) %>%
set_engine("glmnet") %>%
set_mode("classification")
plr_mod %>%
translate()
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = 0.01
## mixture = 1
##
## Computational engine: glmnet
##
## Model fit template:
## glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(),
## alpha = 1, family = "binomial")
03:00
70 / 71
glmnet_wf <- knn_wf %>%
update_model(plr_mod)
glmnet_wf %>%
fit_resamples(resamples = bechdel_folds) %>%
collect_metrics()
## # A tibble: 2 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.615 10 0.0139 Preprocessor1_Mod…
## 2 roc_auc binary 0.647 10 0.0193 Preprocessor1_Mod…
71 / 71

Build a better training set with recipes

2 / 71
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow