recipes
recipe()
recipe()
Creates a recipe for a set of variables
recipe(bechdel ~ ., data = bechdel)
recipe()
Creates a recipe for a set of variables
rec <- recipe(bechdel ~ ., data = bechdel)
step_*()
Adds a single transformation to a recipe. Transformations are replayed in order when the recipe is run on data.
rec <- recipe(bechdel ~ ., data = bechdel) %>% step_other(genre, threshold = .05)
## # A tibble: 14 × 2## genre n## <chr> <int>## 1 Action 334## 2 Comedy 286## 3 Drama 236## 4 Adventure 86## 5 Animation 82## 6 Crime 77## 7 Horror 67## 8 Biography 58## 9 Mystery 10## 10 Fantasy 6## # … with 4 more rows
## # A tibble: 8 × 2## genre n## <fct> <int>## 1 Action 334## 2 Comedy 286## 3 Drama 236## 4 Adventure 86## 5 other 85## 6 Animation 82## 7 Crime 77## 8 Horror 67
To predict the outcome of a new data point:
parsnip
parsnip
knn_mod <- nearest_neighbor() %>% set_engine("kknn") %>% set_mode("classification")
KNN requires all numeric predictors, and all need to be centered and scaled.
What does that mean?
Why do you need to "train" a recipe?
Why do you need to "train" a recipe?
Imagine "scaling" a new data point. What do you subtract from it? What do you divide it by?
# A tibble: 5 × 1 rated <chr>1 R 2 PG-133 PG 4 G 5 NC-17
# A tibble: 1,394 × 5 R `PG-13` PG G `NC-17` <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 0 0 0 0 2 1 0 0 0 0 3 0 1 0 0 0 4 0 1 0 0 0 5 1 0 0 0 0 6 1 0 0 0 0 7 0 1 0 0 0 8 0 1 0 0 0 9 1 0 0 0 010 1 0 0 0 0# … with 1,384 more rows
glm(bechdel ~ rated, family = "binomial", data = bechdel)
## # A tibble: 5 × 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) -0.470 0.403 -1.17 0.244## 2 ratedNC-17 -1.14 1.17 -0.976 0.329## 3 ratedPG 0.225 0.427 0.527 0.598## 4 ratedPG-13 0.354 0.412 0.859 0.391## 5 ratedR 0.198 0.411 0.482 0.630
step_dummy()
Converts nominal data into numeric dummy variables, needed as predictors for models like KNN.
rec <- recipe(bechdel ~ ., data = bechdel) %>% step_other(genre, threshold = .05) %>% step_dummy(all_nominal_predictors())
You don't need this for decision trees or ensembles of trees
How does recipes
know which variables are numeric and what is nominal?
How does recipes
know which variables are numeric and what is nominal?
rec <- recipe( bechdel ~ ., data = bechdel )
How does recipes
know what is a predictor and what is an outcome?
How does recipes
know what is a predictor and what is an outcome?
rec <- recipe( bechdel ~ ., data = bechdel )
How does recipes
know what is a predictor and what is an outcome?
rec <- recipe( bechdel ~ ., data = bechdel )
The formula → indicates outcomes vs predictors
How does recipes
know what is a predictor and what is an outcome?
rec <- recipe( bechdel ~ ., data = bechdel )
The formula → indicates outcomes vs predictors
The data → is only used to catalog the names and types of each variable
Helper functions for selecting sets of variables
rec %>% step_novel(all_nominal()) %>% step_zv(all_predictors())
selector | description |
---|---|
|
Each x variable (right side of ~) |
|
Each y variable (left side of ~) |
|
Each numeric variable |
|
Each categorical variable (e.g. factor, string) |
|
Each categorical variable (e.g. factor, string) that is defined as a predictor |
|
Each numeric variable that is defined as a predictor |
|
|
What if there were no films with rated
NC-17 in the training data?
What if there were no films with rated
NC-17 in the training data?
Will the model have a coefficient for rated
NC-17?
What if there were no films with rated
NC-17 in the training data?
Will the model have a coefficient for rated
NC-17?
No
What if there were no films with rated
NC-17 in the training data?
Will the model have a coefficient for rated
NC-17?
No
What will happen if the test data includes a film with rated
NC-17?
What if there were no films with rated
NC-17 in the training data?
Will the model have a coefficient for rated
NC-17?
No
What will happen if the test data includes a film with rated
NC-17?
Error!
step_novel()
Adds a catch-all level to a factor for any new values not encountered in model training, which lets R intelligently predict new levels in the test set.
rec <- recipe(bechdel ~ ., data = bechdel) %>% step_other(genre, threshold = .05) %>% step_novel(all_nominal_predictors()) %>% step_dummy(all_nominal_predictors())
Use before step_dummy()
so new level is dummified
What would happen if you try to normalize a variable that doesn't vary?
What would happen if you try to normalize a variable that doesn't vary?
Error! You'd be dividing by zero!
step_zv()
Intelligently handles zero variance variables (variables that contain only a single value)
rec <- recipe(bechdel ~ ., data = bechdel) %>% step_other(genre, threshold = .05) %>% step_novel(all_nominal_predictors()) %>% step_dummy(all_nominal_predictors()) %>% step_zv(all_predictors())
step_normalize()
Centers then scales numeric variable (mean = 0, sd = 1)
rec <- recipe(bechdel ~ ., data = bechdel) %>% step_other(genre, threshold = .05) %>% step_novel(all_nominal_predictors()) %>% step_dummy(all_nominal_predictors()) %>% step_zv(all_predictors()) %>% step_normalize(all_numeric())
Unscramble! You have all the steps from our knn_rec
- your challenge is to unscramble them into the right order!
Save the result as knn_rec
05:00
knn_rec <- recipe(formula = bechdel ~ ., data = bechdel) %>% step_other(genre, threshold = .05) %>% step_novel(all_nominal_predictors()) %>% step_dummy(all_nominal_predictors()) %>% step_zv(all_predictors()) %>% step_normalize(all_numeric_predictors()) knn_rec## Recipe## ## Inputs:## ## role #variables## outcome 1## predictor 8## ## Operations:## ## Collapsing factor levels for genre## Novel factor level assignment for all_nominal_predictors()## Dummy variables from all_nominal_predictors()## Zero variance filter on all_predictors()## Centering and scaling for all_numeric_predictors()
library(usemodels)use_kknn(bechdel ~ ., data = bechdel, verbose = TRUE, tune = FALSE)## kknn_recipe <- ## recipe(formula = bechdel ~ ., data = bechdel) %>% ## ## For modeling, it is preferred to encode qualitative data as factors ## ## (instead of character). ## step_string2factor(one_of("rated", "genre")) %>% ## step_novel(all_nominal_predictors()) %>% ## ## This model requires the predictors to be numeric. The most common ## ## method to convert qualitative predictors to numeric is to create ## ## binary indicator variables (aka dummy variables) from these ## ## predictors. ## step_dummy(all_nominal_predictors()) %>% ## ## Since distance calculations are used, the predictor variables should ## ## be on the same scale. Before centering and scaling the numeric ## ## predictors, any predictors with a single unique value are filtered ## ## out. ## step_zv(all_predictors()) %>% ## step_normalize(all_numeric_predictors()) ## ## kknn_spec <- ## nearest_neighbor() %>% ## set_mode("classification") %>% ## set_engine("kknn") ## ## kknn_workflow <- ## workflow() %>% ## add_recipe(kknn_recipe) %>% ## add_model(kknn_spec)
use_glmnet(bechdel ~ ., data = bechdel, verbose = TRUE, tune = FALSE)## glmnet_recipe <- ## recipe(formula = bechdel ~ ., data = bechdel) %>% ## ## For modeling, it is preferred to encode qualitative data as factors ## ## (instead of character). ## step_string2factor(one_of("rated", "genre")) %>% ## step_novel(all_nominal_predictors()) %>% ## ## This model requires the predictors to be numeric. The most common ## ## method to convert qualitative predictors to numeric is to create ## ## binary indicator variables (aka dummy variables) from these ## ## predictors. ## step_dummy(all_nominal_predictors()) %>% ## ## Regularization methods sum up functions of the model slope ## ## coefficients. Because of this, the predictor variables should be on ## ## the same scale. Before centering and scaling the numeric predictors, ## ## any predictors with a single unique value are filtered out. ## step_zv(all_predictors()) %>% ## step_normalize(all_numeric_predictors()) ## ## glmnet_spec <- ## logistic_reg() %>% ## set_mode("classification") %>% ## set_engine("glmnet") ## ## glmnet_workflow <- ## workflow() %>% ## add_recipe(glmnet_recipe) %>% ## add_model(glmnet_spec)
Feature engineering and modeling are two halves of a single predictive workflow.
workflow()
Creates a workflow to add a model and more to
workflow()
add_formula()
Adds a formula to a workflow *
workflow() %>% add_formula(bechdel ~ metascore)
*
If you do not plan to do your own preprocessing
add_model()
Adds a parsnip model spec to a workflow
workflow() %>% add_model(knn_mod)
If we use add_model()
to add a model to a workflow, what would we use to add a recipe?
If we use add_model()
to add a model to a workflow, what would we use to add a recipe?
Let's see!
Fill in the blanks to make a workflow that combines knn_rec
and with knn_mod
.
01:00
knn_wf <- workflow() %>% add_recipe(knn_rec) %>% add_model(knn_mod)knn_wf## ══ Workflow ════════════════════════════════════════════════════════════════════## Preprocessor: Recipe## Model: nearest_neighbor()## ## ── Preprocessor ────────────────────────────────────────────────────────────────## 5 Recipe Steps## ## • step_other()## • step_novel()## • step_dummy()## • step_zv()## • step_normalize()## ## ── Model ───────────────────────────────────────────────────────────────────────## K-Nearest Neighbor Model Specification (classification)## ## Computational engine: kknn
add_recipe()
Adds a recipe to a workflow.
knn_wf <- workflow() %>% add_recipe(knn_rec) %>% add_model(knn_mod)
Do you need to add a formula if you have a recipe?
Do you need to add a formula if you have a recipe?
Nope!
rec <- recipe( bechdel ~ ., data = bechdel)
fit()
Fit a workflow that bundles a recipe*
and a model.
_wf %>% fit(data = bechdel_train) %>% predict(bechdel_test)...
*
or a formula, if you do not plan to do your own preprocessing
set.seed(100)bechdel_folds <- vfold_cv(bechdel_train, v = 10, strata = bechdel)
fit_resamples()
Fit a workflow that bundles a recipe*
and a model with resampling.
_wf %>% fit_resamples(resamples = bechdel_folds)
*
or a formula, if you do not plan to do your own preprocessing
Run the first chunk. Then try our KNN workflow on bechdel_folds
. What is the ROC AUC?
03:00
set.seed(100)bechdel_folds <- vfold_cv(bechdel_train, v = 10, strata = bechdel)knn_wf %>% fit_resamples(resamples = bechdel_folds) %>% collect_metrics()## # A tibble: 2 × 6## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy binary 0.591 10 0.0165 Preprocessor1_Mod…## 2 roc_auc binary 0.604 10 0.0176 Preprocessor1_Mod…
Before
Before
After
update_recipe()
Replace the recipe in a workflow.
_wf %>% update_recipe(glmnet_rec)
update_model()
Replace the model in a workflow.
_wf %>% update_model(tree_mod)
Turns out, the same knn_rec
recipe can also be used to fit a penalized logistic regression model. Let's try it out!
plr_mod <- logistic_reg(penalty = .01, mixture = 1) %>% set_engine("glmnet") %>% set_mode("classification")plr_mod %>% translate()## Logistic Regression Model Specification (classification)## ## Main Arguments:## penalty = 0.01## mixture = 1## ## Computational engine: glmnet ## ## Model fit template:## glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), ## alpha = 1, family = "binomial")
03:00
glmnet_wf <- knn_wf %>% update_model(plr_mod)glmnet_wf %>% fit_resamples(resamples = bechdel_folds) %>% collect_metrics() ## # A tibble: 2 × 6## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy binary 0.615 10 0.0139 Preprocessor1_Mod…## 2 roc_auc binary 0.647 10 0.0193 Preprocessor1_Mod…
recipes
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |