Practice transforming college education (data)
library(tidyverse)
Run the code below in your console to download this exercise as a set of R scripts.
usethis::use_course("uc-cfss/data-transformation")
The Department of Education collects annual statistics on colleges and universities in the United States. I have included a subset of this data from 2018-19 in the rcfss
library from GitHub. To install the package, run the command devtools::install_github("uc-cfss/rcfss")
in the console.
devtools
library installed, you will get an error. Go back and install this first using install.packages("devtools")
, then run devtools::install_github("uc-cfss/rcfss")
.library(rcfss)
data("scorecard")
glimpse(scorecard)
## Rows: 1,732
## Columns: 14
## $ unitid <dbl> 100654, 100663, 100706, 100724, 100751, 100830, 100858, 1009…
## $ name <chr> "Alabama A & M University", "University of Alabama at Birmin…
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", …
## $ type <fct> "Public", "Public", "Public", "Public", "Public", "Public", …
## $ admrate <dbl> 0.9175, 0.7366, 0.8257, 0.9690, 0.8268, 0.9044, 0.8067, 0.53…
## $ satavg <dbl> 939, 1234, 1319, 946, 1261, 1082, 1300, 1230, 1066, NA, 1076…
## $ cost <dbl> 23053, 24495, 23917, 21866, 29872, 19849, 31590, 32095, 3431…
## $ netcost <dbl> 14990, 16953, 15860, 13650, 22597, 13987, 24104, 22107, 2071…
## $ avgfacsal <dbl> 69381, 99441, 87192, 64989, 92619, 71343, 96642, 56646, 5400…
## $ pctpell <dbl> 0.7019, 0.3512, 0.2536, 0.7627, 0.1772, 0.4644, 0.1455, 0.23…
## $ comprate <dbl> 0.2974, 0.6340, 0.5768, 0.3276, 0.7110, 0.3401, 0.7911, 0.69…
## $ firstgen <dbl> 0.3658281, 0.3412237, 0.3101322, 0.3434343, 0.2257127, 0.381…
## $ debt <dbl> 15250, 15085, 14000, 17500, 17671, 12000, 17500, 16000, 1425…
## $ locale <fct> City, City, City, City, City, City, City, City, City, Suburb…
glimpse()
is part of the tibble
package and is a transposed version of print()
: columns run down the page, and data runs across. With a data frame with multiple columns, sometimes there is not enough horizontal space on the screen to print each column. By transposing the data frame, we can see all the columns and the values recorded for the initial rows.Type ?scorecard
in the console to open up the help file for this data set. This includes the documentation for all the variables. Use your knowledge of the dplyr
functions to perform the following tasks.
Generate a data frame of schools with a greater than 40% share of first-generation students
Click for the solution
filter(.data = scorecard, firstgen > .40)
## # A tibble: 356 × 14
## unitid name state type admrate satavg cost netcost avgfacsal pctpell
## <dbl> <chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 101189 Faulkner U… AL Priv… 0.783 1066 34317 20715 54009 0.488
## 2 101365 Herzing Un… AL Priv… 0.783 NA 30119 26680 54684 0.706
## 3 101541 Judson Col… AL Priv… 0.372 1020 32691 16827 52020 0.545
## 4 101587 University… AL Publ… 0.349 1041 21657 15514 58329 0.535
## 5 102270 Stillman C… AL Priv… 0.330 NA 25413 18352 43605 0.709
## 6 104717 Grand Cany… AZ Priv… 0.769 NA 31213 21020 60741 0.454
## 7 106467 Arkansas T… AR Publ… 0.947 NA 18358 10772 61812 0.361
## 8 107983 Southern A… AR Publ… 0.651 1085 22579 14270 61650 0.487
## 9 110361 California… CA Priv… 0.783 1096 46261 24707 88335 0.453
## 10 110486 California… CA Publ… 0.807 NA 16660 5318 86760 0.619
## # … with 346 more rows, and 4 more variables: comprate <dbl>, firstgen <dbl>,
## # debt <dbl>, locale <fct>
Generate a data frame with the 10 most expensive colleges in 2018-19 based on net cost of attendance
Click for the solution
We could use a combination of arrange()
and slice()
to sort the data frame from most to least expensive, then keep the first 10 rows:
arrange(.data = scorecard, desc(netcost)) %>%
slice(1:10)
## # A tibble: 10 × 14
## unitid name state type admrate satavg cost netcost avgfacsal pctpell
## <dbl> <chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 192712 Manhattan … NY Priv… 0.355 NA 68686 54902 73863 0.129
## 2 111081 California… CA Priv… 0.253 NA 71382 50412 86760 0.248
## 3 136774 Ringling C… FL Priv… 0.639 NA 67325 49649 78435 0.284
## 4 164748 Berklee Co… MA Priv… 0.514 NA 64436 49514 93870 0.166
## 5 247649 Landmark C… VT Priv… 0.470 NA 73821 47373 59373 0.219
## 6 109651 Art Center… CA Priv… 0.708 NA 64316 47080 71523 0.283
## 7 135726 University… FL Priv… 0.271 1371 67249 46949 115353 0.143
## 8 194578 Pratt Inst… NY Priv… 0.555 1273 67703 45571 101079 0.198
## 9 165662 Emerson Co… MA Priv… 0.334 1318 68350 45365 90747 0.161
## 10 143048 School of … IL Priv… 0.570 1238 67058 44815 96102 0.192
## # … with 4 more variables: comprate <dbl>, firstgen <dbl>, debt <dbl>,
## # locale <fct>
We can also use the slice_max()
function in dplyr
to accomplish the same thing in one line of code.
slice_max(.data = scorecard, order_by = netcost, n = 10)
## # A tibble: 10 × 14
## unitid name state type admrate satavg cost netcost avgfacsal pctpell
## <dbl> <chr> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 192712 Manhattan … NY Priv… 0.355 NA 68686 54902 73863 0.129
## 2 111081 California… CA Priv… 0.253 NA 71382 50412 86760 0.248
## 3 136774 Ringling C… FL Priv… 0.639 NA 67325 49649 78435 0.284
## 4 164748 Berklee Co… MA Priv… 0.514 NA 64436 49514 93870 0.166
## 5 247649 Landmark C… VT Priv… 0.470 NA 73821 47373 59373 0.219
## 6 109651 Art Center… CA Priv… 0.708 NA 64316 47080 71523 0.283
## 7 135726 University… FL Priv… 0.271 1371 67249 46949 115353 0.143
## 8 194578 Pratt Inst… NY Priv… 0.555 1273 67703 45571 101079 0.198
## 9 165662 Emerson Co… MA Priv… 0.334 1318 68350 45365 90747 0.161
## 10 143048 School of … IL Priv… 0.570 1238 67058 44815 96102 0.192
## # … with 4 more variables: comprate <dbl>, firstgen <dbl>, debt <dbl>,
## # locale <fct>
Generate a data frame with the average SAT score for each type of college
Click for the solution
scorecard %>%
group_by(type) %>%
summarize(mean_sat = mean(satavg, na.rm = TRUE))
## # A tibble: 3 × 2
## type mean_sat
## <fct> <dbl>
## 1 Public 1126.
## 2 Private, nonprofit 1152.
## 3 Private, for-profit 1121.
Calculate for each school how many students it takes to pay the average faculty member’s salary and generate a data frame with the school’s name and the calculated value
Note: use the net cost of attendance.
Click for the solution
scorecard %>%
mutate(ratio = avgfacsal / netcost) %>%
select(name, ratio)
## # A tibble: 1,732 × 2
## name ratio
## <chr> <dbl>
## 1 Alabama A & M University 4.63
## 2 University of Alabama at Birmingham 5.87
## 3 University of Alabama in Huntsville 5.50
## 4 Alabama State University 4.76
## 5 The University of Alabama 4.10
## 6 Auburn University at Montgomery 5.10
## 7 Auburn University 4.01
## 8 Birmingham-Southern College 2.56
## 9 Faulkner University 2.61
## 10 Herzing University-Birmingham 2.05
## # … with 1,722 more rows
Calculate how many private, nonprofit schools have a smaller net cost than the University of Chicago
Hint: the result should be a data frame with one row for the University of Chicago, and a column containing the requested value.
Report the number as the total number of schools
Click for the solution
scorecard %>%
filter(type == "Private, nonprofit") %>%
arrange(netcost) %>%
# use row_number() but subtract 1 since UChicago is not cheaper than itself
mutate(school_cheaper = row_number() - 1) %>%
filter(name == "University of Chicago") %>%
glimpse()
## Rows: 1
## Columns: 15
## $ unitid <dbl> 144050
## $ name <chr> "University of Chicago"
## $ state <chr> "IL"
## $ type <fct> "Private, nonprofit"
## $ admrate <dbl> 0.0617
## $ satavg <dbl> 1528
## $ cost <dbl> 78555
## $ netcost <dbl> 27315
## $ avgfacsal <dbl> 166923
## $ pctpell <dbl> 0.1135
## $ comprate <dbl> 0.9473
## $ firstgen <dbl> 0.2024353
## $ debt <dbl> 13000
## $ locale <fct> City
## $ school_cheaper <dbl> 808
Report the number as the percentage of schools
Click for the solution
scorecard %>%
filter(type == "Private, nonprofit") %>%
mutate(netcost_rank = percent_rank(netcost)) %>%
filter(name == "University of Chicago") %>%
glimpse()
## Rows: 1
## Columns: 15
## $ unitid <dbl> 144050
## $ name <chr> "University of Chicago"
## $ state <chr> "IL"
## $ type <fct> "Private, nonprofit"
## $ admrate <dbl> 0.0617
## $ satavg <dbl> 1528
## $ cost <dbl> 78555
## $ netcost <dbl> 27315
## $ avgfacsal <dbl> 166923
## $ pctpell <dbl> 0.1135
## $ comprate <dbl> 0.9473
## $ firstgen <dbl> 0.2024353
## $ debt <dbl> 13000
## $ locale <fct> City
## $ netcost_rank <dbl> 0.7516279
Session Info
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.0 (2022-04-22)
## os macOS Monterey 12.2.1
## system aarch64, darwin20
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/Chicago
## date 2022-06-08
## pandoc 2.17.1.1 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
## backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0)
## blogdown 1.10 2022-05-10 [1] CRAN (R 4.2.0)
## bookdown 0.26 2022-04-15 [1] CRAN (R 4.2.0)
## brio 1.1.3 2021-11-30 [1] CRAN (R 4.2.0)
## broom 0.8.0 2022-04-13 [1] CRAN (R 4.2.0)
## bslib 0.3.1 2021-10-06 [1] CRAN (R 4.2.0)
## cachem 1.0.6 2021-08-19 [1] CRAN (R 4.2.0)
## callr 3.7.0 2021-04-20 [1] CRAN (R 4.2.0)
## cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.0)
## cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.0)
## codetools 0.2-18 2020-11-04 [1] CRAN (R 4.2.0)
## colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.2.0)
## crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0)
## DBI 1.1.2 2021-12-20 [1] CRAN (R 4.2.0)
## dbplyr 2.2.0 2022-06-05 [1] CRAN (R 4.2.0)
## desc 1.4.1 2022-03-06 [1] CRAN (R 4.2.0)
## devtools 2.4.3 2021-11-30 [1] CRAN (R 4.2.0)
## digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0)
## dplyr * 1.0.9 2022-04-28 [1] CRAN (R 4.2.0)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
## evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.0)
## fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
## fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
## forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.2.0)
## fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0)
## generics 0.1.2 2022-01-31 [1] CRAN (R 4.2.0)
## ggplot2 * 3.3.6 2022-05-03 [1] CRAN (R 4.2.0)
## glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 4.2.0)
## haven 2.5.0 2022-04-15 [1] CRAN (R 4.2.0)
## here 1.0.1 2020-12-13 [1] CRAN (R 4.2.0)
## hms 1.1.1 2021-09-26 [1] CRAN (R 4.2.0)
## htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.2.0)
## httr 1.4.3 2022-05-04 [1] CRAN (R 4.2.0)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.2.0)
## jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.2.0)
## knitr 1.39 2022-04-26 [1] CRAN (R 4.2.0)
## lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0)
## lubridate 1.8.0 2021-10-07 [1] CRAN (R 4.2.0)
## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
## memoise 2.0.1 2021-11-26 [1] CRAN (R 4.2.0)
## modelr 0.1.8 2020-05-19 [1] CRAN (R 4.2.0)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0)
## pillar 1.7.0 2022-02-01 [1] CRAN (R 4.2.0)
## pkgbuild 1.3.1 2021-12-20 [1] CRAN (R 4.2.0)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
## pkgload 1.2.4 2021-11-30 [1] CRAN (R 4.2.0)
## prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.2.0)
## processx 3.5.3 2022-03-25 [1] CRAN (R 4.2.0)
## ps 1.7.0 2022-04-23 [1] CRAN (R 4.2.0)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.2.0)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
## rcfss * 0.2.4 2022-06-08 [1] local
## readr * 2.1.2 2022-01-30 [1] CRAN (R 4.2.0)
## readxl 1.4.0 2022-03-28 [1] CRAN (R 4.2.0)
## remotes 2.4.2 2021-11-30 [1] CRAN (R 4.2.0)
## reprex 2.0.1 2021-08-05 [1] CRAN (R 4.2.0)
## rlang 1.0.2 2022-03-04 [1] CRAN (R 4.2.0)
## rmarkdown 2.14 2022-04-25 [1] CRAN (R 4.2.0)
## rprojroot 2.0.3 2022-04-02 [1] CRAN (R 4.2.0)
## rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.0)
## rvest 1.0.2 2021-10-16 [1] CRAN (R 4.2.0)
## sass 0.4.1 2022-03-23 [1] CRAN (R 4.2.0)
## scales 1.2.0 2022-04-13 [1] CRAN (R 4.2.0)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
## showtext 0.9-5 2022-02-09 [1] CRAN (R 4.2.0)
## showtextdb 3.0 2020-06-04 [1] CRAN (R 4.2.0)
## stringi 1.7.6 2021-11-29 [1] CRAN (R 4.2.0)
## stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.2.0)
## sysfonts 0.8.8 2022-03-13 [1] CRAN (R 4.2.0)
## testthat 3.1.4 2022-04-26 [1] CRAN (R 4.2.0)
## tibble * 3.1.7 2022-05-03 [1] CRAN (R 4.2.0)
## tidyr * 1.2.0 2022-02-01 [1] CRAN (R 4.2.0)
## tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.2.0)
## tidyverse * 1.3.1 2021-04-15 [1] CRAN (R 4.2.0)
## tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0)
## usethis 2.1.6 2022-05-25 [1] CRAN (R 4.2.0)
## utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
## vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0)
## withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
## xaringanthemer 0.4.1 2021-11-21 [1] CRAN (R 4.2.0)
## xfun 0.31 2022-05-10 [1] CRAN (R 4.2.0)
## xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0)
## yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0)
##
## [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
##
## ──────────────────────────────────────────────────────────────────────────────