Working with statistical models

library(tidyverse)
library(tidymodels)
library(rcfss)

set.seed(123)

theme_set(theme_minimal())

Run the code below in your console to download this exercise as a set of R scripts.

usethis::use_course("uc-cfss/statistical-learning")

Exercise: linear regression with scorecard

Recall the scorecard data set which contains information on U.S. institutions of higher learning.

scorecard
## # A tibble: 1,753 x 15
##    unitid name  state type  admrate satavg  cost netcost avgfacsal pctpell
##     <int> <chr> <chr> <fct>   <dbl>  <dbl> <int>   <dbl>     <dbl>   <dbl>
##  1 420325 Yesh… NY    Priv…  0.531      NA 14874    4018     26253   0.958
##  2 430485 The … NE    Priv…  0.667      NA 41627   39020     54000   0.529
##  3 100654 Alab… AL    Publ…  0.899     957 22489   14444     63909   0.707
##  4 102234 Spri… AL    Priv…  0.658    1130 51969   19718     60048   0.342
##  5 100724 Alab… AL    Publ…  0.977     972 21476   13043     69786   0.745
##  6 106467 Arka… AR    Publ…  0.902      NA 18627   12362     61497   0.396
##  7 106704 Univ… AR    Publ…  0.911    1186 21350   14723     63360   0.430
##  8 109651 Art … CA    Priv…  0.676      NA 64097   43010     69984   0.307
##  9 110404 Cali… CA    Priv…  0.0662   1566 68901   23820    179937   0.142
## 10 112394 Cogs… CA    Priv…  0.579      NA 35351   31537     66636   0.461
## # … with 1,743 more rows, and 5 more variables: comprate <dbl>, firstgen <dbl>,
## #   debt <dbl>, locale <fct>, openadmp <fct>

Answer the following questions using the statistical modeling tools you have learned.

  1. What is the relationship between admission rate and cost? Report this relationship using a scatterplot and a linear best-fit line.

    Click for the solution

    ggplot(scorecard, aes(admrate, cost)) +
      geom_point() +
      geom_smooth(method = "lm")
    

  2. Estimate a linear regression of the relationship between admission rate and cost, and report your results in a tidy table.

    Click for the solution

    scorecard_fit <- linear_reg() %>%
      set_engine("lm") %>%
      fit(cost ~ admrate, data = scorecard)
    tidy(scorecard_fit)
    
    ## # A tibble: 2 x 5
    ##   term        estimate std.error statistic   p.value
    ##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
    ## 1 (Intercept)   51723.     1242.      41.6 3.74e-262
    ## 2 admrate      -22928.     1768.     -13.0 9.32e- 37
    

  3. Estimate a linear regression of the relationship between admission rate and cost, while also accounting for type of college and percent of Pell Grant recipients, and report your results in a tidy table.

    Click for the solution

    scorecard_fit <- linear_reg() %>%
      set_engine("lm") %>%
      fit(cost ~ admrate + type + pctpell, data = scorecard)
    tidy(scorecard_fit)
    
    ## # A tibble: 5 x 5
    ##   term                    estimate std.error statistic   p.value
    ##   <chr>                      <dbl>     <dbl>     <dbl>     <dbl>
    ## 1 (Intercept)               47642.     1010.      47.2 1.94e-311
    ## 2 admrate                  -12456.     1145.     -10.9 1.06e- 26
    ## 3 typePrivate, nonprofit    20235.      512.      39.5 4.58e-243
    ## 4 typePrivate, for-profit   16833.     1203.      14.0 3.50e- 42
    ## 5 pctpell                  -43757.     1465.     -29.9 4.02e-158
    

Exercise: logistic regression with mental_health

Why do some people vote in elections while others do not? Typical explanations focus on a resource model of participation – individuals with greater resources, such as time, money, and civic skills, are more likely to participate in politics. An emerging theory assesses an individual’s mental health and its effect on political participation.1 Depression increases individuals’ feelings of hopelessness and political efficacy, so depressed individuals will have less desire to participate in politics. More importantly to our resource model of participation, individuals with depression suffer physical ailments such as a lack of energy, headaches, and muscle soreness which drain an individual’s energy and requires time and money to receive treatment. For these reasons, we should expect that individuals with depression are less likely to participate in election than those without symptoms of depression.

Use the mental_health data set in library(rcfss) and logistic regression to predict whether or not an individual voted in the 1996 presidential election.

mental_health
## # A tibble: 1,317 x 5
##    vote96   age  educ female mhealth
##     <dbl> <dbl> <dbl>  <dbl>   <dbl>
##  1      1    60    12      0       0
##  2      1    36    12      0       1
##  3      0    21    13      0       7
##  4      0    29    13      0       6
##  5      1    39    18      1       2
##  6      1    41    15      1       1
##  7      1    48    20      0       2
##  8      0    20    12      1       9
##  9      0    27    11      1       9
## 10      0    34     7      1       2
## # … with 1,307 more rows
  1. Estimate a logistic regression model of voter turnout with mhealth as the predictor. Estimate predicted probabilities and a 95% confidence interval, and plot the logistic regression predictions using ggplot.

    Click for the solution

    # convert vote96 to a factor column
    mental_health <- rcfss::mental_health %>%
      mutate(vote96 = factor(vote96, labels = c("Not voted", "Voted")))
    
    # estimate model
    mh_mod <- logistic_reg() %>%
      set_engine("glm") %>%
      fit(vote96 ~ mhealth, data = mental_health)
    
    # generate predicted probabilities + confidence intervals
    new_points <- tibble(
      mhealth = seq(
        from = min(mental_health$mhealth),
        to = max(mental_health$mhealth)
      )
    )
    
    bind_cols(
      new_points,
      # predicted probabilities
      predict(mh_mod, new_data = new_points, type = "prob"),
      # confidence intervals
      predict(mh_mod, new_data = new_points, type = "conf_int")
    ) %>%
      # graph the predictions
      ggplot(mapping = aes(x = mhealth, y = .pred_Voted)) +
      geom_pointrange(mapping = aes(ymin = .pred_lower_Voted, ymax = .pred_upper_Voted)) +
      labs(
        title = "Relationship Between Mental Health and Voter Turnout",
        x = "Mental health status",
        y = "Predicted Probability of Voting"
      )
    

  2. Estimate a second logistic regression model of voter turnout using using age and gender (i.e. the female column). Extract predicted probabilities and confidence intervals for all possible values of age, and visualize using ggplot().

    Click for the solution

    # recode female
    mental_health <- rcfss::mental_health %>%
      mutate(
        vote96 = factor(vote96, labels = c("Not voted", "Voted")),
        female = factor(female, labels = c("Male", "Female"))
      )
    
    # estimate model
    mh_int_mod <- logistic_reg() %>%
      set_engine("glm") %>%
      fit(vote96 ~ age * female, data = mental_health)
    
    # generate predicted probabilities + confidence intervals
    new_points <- expand.grid(
      age = seq(
        from = min(mental_health$age),
        to = max(mental_health$age)
      ),
      female = unique(mental_health$female)
    )
    
    bind_cols(
      new_points,
      # predicted probabilities
      predict(mh_int_mod, new_data = new_points, type = "prob"),
      # confidence intervals
      predict(mh_int_mod, new_data = new_points, type = "conf_int")
    ) %>%
      # graph the predictions
      ggplot(mapping = aes(x = age, y = .pred_Voted, color = female)) +
      # predicted probability
      geom_line(linetype = 2) +
      # confidence interval
      geom_ribbon(mapping = aes(
        ymin = .pred_lower_Voted, ymax = .pred_upper_Voted,
        fill = female
      ), alpha = .2) +
      scale_color_viridis_d(
        end = 0.7, aesthetics = c("color", "fill"),
        name = NULL
      ) +
      labs(
        title = "Relationship Between Age and Voter Turnout",
        x = "Age",
        y = "Predicted Probability of Voting"
      )
    

Session Info

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.0.4 (2021-02-15)
##  os       macOS Big Sur 10.16         
##  system   x86_64, darwin17.0          
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2021-05-25                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package      * version    date       lib source        
##  assertthat     0.2.1      2019-03-21 [1] CRAN (R 4.0.0)
##  backports      1.2.1      2020-12-09 [1] CRAN (R 4.0.2)
##  blogdown       1.3        2021-04-14 [1] CRAN (R 4.0.2)
##  bookdown       0.22       2021-04-22 [1] CRAN (R 4.0.2)
##  broom        * 0.7.6      2021-04-05 [1] CRAN (R 4.0.4)
##  bslib          0.2.5      2021-05-12 [1] CRAN (R 4.0.4)
##  cachem         1.0.5      2021-05-15 [1] CRAN (R 4.0.2)
##  callr          3.7.0      2021-04-20 [1] CRAN (R 4.0.2)
##  cellranger     1.1.0      2016-07-27 [1] CRAN (R 4.0.0)
##  class          7.3-19     2021-05-03 [1] CRAN (R 4.0.2)
##  cli            2.5.0      2021-04-26 [1] CRAN (R 4.0.2)
##  codetools      0.2-18     2020-11-04 [1] CRAN (R 4.0.4)
##  colorspace     2.0-1      2021-05-04 [1] CRAN (R 4.0.2)
##  crayon         1.4.1      2021-02-08 [1] CRAN (R 4.0.2)
##  DBI            1.1.1      2021-01-15 [1] CRAN (R 4.0.2)
##  dbplyr         2.1.1      2021-04-06 [1] CRAN (R 4.0.4)
##  desc           1.3.0      2021-03-05 [1] CRAN (R 4.0.2)
##  devtools       2.4.1      2021-05-05 [1] CRAN (R 4.0.2)
##  dials        * 0.0.9      2020-09-16 [1] CRAN (R 4.0.2)
##  DiceDesign     1.9        2021-02-13 [1] CRAN (R 4.0.2)
##  digest         0.6.27     2020-10-24 [1] CRAN (R 4.0.2)
##  dplyr        * 1.0.6      2021-05-05 [1] CRAN (R 4.0.2)
##  ellipsis       0.3.2      2021-04-29 [1] CRAN (R 4.0.2)
##  evaluate       0.14       2019-05-28 [1] CRAN (R 4.0.0)
##  fansi          0.4.2      2021-01-15 [1] CRAN (R 4.0.2)
##  fastmap        1.1.0      2021-01-25 [1] CRAN (R 4.0.2)
##  forcats      * 0.5.1      2021-01-27 [1] CRAN (R 4.0.2)
##  foreach        1.5.1      2020-10-15 [1] CRAN (R 4.0.2)
##  fs             1.5.0      2020-07-31 [1] CRAN (R 4.0.2)
##  furrr          0.2.2      2021-01-29 [1] CRAN (R 4.0.2)
##  future         1.21.0     2020-12-10 [1] CRAN (R 4.0.2)
##  generics       0.1.0      2020-10-31 [1] CRAN (R 4.0.2)
##  ggplot2      * 3.3.3      2020-12-30 [1] CRAN (R 4.0.2)
##  globals        0.14.0     2020-11-22 [1] CRAN (R 4.0.2)
##  glue           1.4.2      2020-08-27 [1] CRAN (R 4.0.2)
##  gower          0.2.2      2020-06-23 [1] CRAN (R 4.0.2)
##  GPfit          1.0-8      2019-02-08 [1] CRAN (R 4.0.0)
##  gtable         0.3.0      2019-03-25 [1] CRAN (R 4.0.0)
##  haven          2.4.1      2021-04-23 [1] CRAN (R 4.0.2)
##  here           1.0.1      2020-12-13 [1] CRAN (R 4.0.2)
##  hms            1.1.0      2021-05-17 [1] CRAN (R 4.0.4)
##  htmltools      0.5.1.1    2021-01-22 [1] CRAN (R 4.0.2)
##  httr           1.4.2      2020-07-20 [1] CRAN (R 4.0.2)
##  infer        * 0.5.4      2021-01-13 [1] CRAN (R 4.0.2)
##  ipred          0.9-11     2021-03-12 [1] CRAN (R 4.0.2)
##  iterators      1.0.13     2020-10-15 [1] CRAN (R 4.0.2)
##  jquerylib      0.1.4      2021-04-26 [1] CRAN (R 4.0.2)
##  jsonlite       1.7.2      2020-12-09 [1] CRAN (R 4.0.2)
##  knitr          1.33       2021-04-24 [1] CRAN (R 4.0.2)
##  lattice        0.20-44    2021-05-02 [1] CRAN (R 4.0.2)
##  lava           1.6.9      2021-03-11 [1] CRAN (R 4.0.2)
##  lhs            1.1.1      2020-10-05 [1] CRAN (R 4.0.2)
##  lifecycle      1.0.0      2021-02-15 [1] CRAN (R 4.0.2)
##  listenv        0.8.0      2019-12-05 [1] CRAN (R 4.0.0)
##  lubridate      1.7.10     2021-02-26 [1] CRAN (R 4.0.2)
##  magrittr       2.0.1      2020-11-17 [1] CRAN (R 4.0.2)
##  MASS           7.3-54     2021-05-03 [1] CRAN (R 4.0.2)
##  Matrix         1.3-3      2021-05-04 [1] CRAN (R 4.0.2)
##  memoise        2.0.0      2021-01-26 [1] CRAN (R 4.0.2)
##  modeldata    * 0.1.0      2020-10-22 [1] CRAN (R 4.0.2)
##  modelr         0.1.8      2020-05-19 [1] CRAN (R 4.0.0)
##  munsell        0.5.0      2018-06-12 [1] CRAN (R 4.0.0)
##  nnet           7.3-16     2021-05-03 [1] CRAN (R 4.0.2)
##  parallelly     1.25.0     2021-04-30 [1] CRAN (R 4.0.2)
##  parsnip      * 0.1.5      2021-01-19 [1] CRAN (R 4.0.2)
##  pillar         1.6.1      2021-05-16 [1] CRAN (R 4.0.4)
##  pkgbuild       1.2.0      2020-12-15 [1] CRAN (R 4.0.2)
##  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.0.0)
##  pkgload        1.2.1      2021-04-06 [1] CRAN (R 4.0.2)
##  plyr           1.8.6      2020-03-03 [1] CRAN (R 4.0.0)
##  prettyunits    1.1.1      2020-01-24 [1] CRAN (R 4.0.0)
##  pROC           1.17.0.1   2021-01-13 [1] CRAN (R 4.0.2)
##  processx       3.5.2      2021-04-30 [1] CRAN (R 4.0.2)
##  prodlim        2019.11.13 2019-11-17 [1] CRAN (R 4.0.0)
##  ps             1.6.0      2021-02-28 [1] CRAN (R 4.0.2)
##  purrr        * 0.3.4      2020-04-17 [1] CRAN (R 4.0.0)
##  R6             2.5.0      2020-10-28 [1] CRAN (R 4.0.2)
##  rcfss        * 0.2.1      2020-12-08 [1] local         
##  Rcpp           1.0.6      2021-01-15 [1] CRAN (R 4.0.2)
##  readr        * 1.4.0      2020-10-05 [1] CRAN (R 4.0.2)
##  readxl         1.3.1      2019-03-13 [1] CRAN (R 4.0.0)
##  recipes      * 0.1.16     2021-04-16 [1] CRAN (R 4.0.2)
##  remotes        2.3.0      2021-04-01 [1] CRAN (R 4.0.2)
##  reprex         2.0.0      2021-04-02 [1] CRAN (R 4.0.2)
##  rlang          0.4.11     2021-04-30 [1] CRAN (R 4.0.2)
##  rmarkdown      2.8        2021-05-07 [1] CRAN (R 4.0.2)
##  rpart          4.1-15     2019-04-12 [1] CRAN (R 4.0.4)
##  rprojroot      2.0.2      2020-11-15 [1] CRAN (R 4.0.2)
##  rsample      * 0.1.0      2021-05-08 [1] CRAN (R 4.0.2)
##  rstudioapi     0.13       2020-11-12 [1] CRAN (R 4.0.2)
##  rvest          1.0.0      2021-03-09 [1] CRAN (R 4.0.2)
##  sass           0.4.0      2021-05-12 [1] CRAN (R 4.0.2)
##  scales       * 1.1.1      2020-05-11 [1] CRAN (R 4.0.0)
##  sessioninfo    1.1.1      2018-11-05 [1] CRAN (R 4.0.0)
##  stringi        1.6.1      2021-05-10 [1] CRAN (R 4.0.2)
##  stringr      * 1.4.0      2019-02-10 [1] CRAN (R 4.0.0)
##  survival       3.2-11     2021-04-26 [1] CRAN (R 4.0.2)
##  testthat       3.0.2      2021-02-14 [1] CRAN (R 4.0.2)
##  tibble       * 3.1.1      2021-04-18 [1] CRAN (R 4.0.2)
##  tidymodels   * 0.1.3      2021-04-19 [1] CRAN (R 4.0.2)
##  tidyr        * 1.1.3      2021-03-03 [1] CRAN (R 4.0.2)
##  tidyselect     1.1.1      2021-04-30 [1] CRAN (R 4.0.2)
##  tidyverse    * 1.3.1      2021-04-15 [1] CRAN (R 4.0.2)
##  timeDate       3043.102   2018-02-21 [1] CRAN (R 4.0.0)
##  tune         * 0.1.5      2021-04-23 [1] CRAN (R 4.0.2)
##  usethis        2.0.1      2021-02-10 [1] CRAN (R 4.0.2)
##  utf8           1.2.1      2021-03-12 [1] CRAN (R 4.0.2)
##  vctrs          0.3.8      2021-04-29 [1] CRAN (R 4.0.2)
##  withr          2.4.2      2021-04-18 [1] CRAN (R 4.0.2)
##  workflows    * 0.2.2      2021-03-10 [1] CRAN (R 4.0.2)
##  workflowsets * 0.0.2      2021-04-16 [1] CRAN (R 4.0.2)
##  xfun           0.23       2021-05-15 [1] CRAN (R 4.0.2)
##  xml2           1.3.2      2020-04-23 [1] CRAN (R 4.0.0)
##  yaml           2.2.1      2020-02-01 [1] CRAN (R 4.0.0)
##  yardstick    * 0.0.8      2021-03-28 [1] CRAN (R 4.0.2)
## 
## [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library