Text analysis: classification and topic modeling

class: center, middle, inverse, title-slide

.title[
# Text analysis: classification and topic modeling
]
.author[
### MACS 30500 <br /> University of Chicago
]

---

# Supervised learning

1. Hand-code a small set of documents `$N = 1,000$`
1. Train a statistical learning model on the hand-coded data
1. Evaluate the effectiveness of the statistical learning model
1. Apply the final model to the remaining set of documents `$N = 1,000,000$`

---

# `USCongress`

```
## Rows: 4,449
## Columns: 7
## $ ID       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
## $ cong     <dbl> 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 107, 1…
## $ billnum  <dbl> 4499, 4500, 4501, 4502, 4503, 4504, 4505, 4506, 4507, 4508, 4…
## $ h_or_sen <chr> "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "HR", "…
## $ major    <dbl> 18, 18, 18, 18, 5, 21, 15, 18, 18, 18, 18, 16, 18, 12, 2, 3, …
## $ text     <chr> "To suspend temporarily the duty on Fast Magenta 2 Stage.", "…
## $ label    <fct> "Foreign trade", "Foreign trade", "Foreign trade", "Foreign t…
```

```
## [1] "To suspend temporarily the duty on Fast Magenta 2 Stage."                                                                                                                                                                                
## [2] "To suspend temporarily the duty on Fast Black 286 Stage."                                                                                                                                                                                
## [3] "To suspend temporarily the duty on mixtures of Fluazinam."                                                                                                                                                                               
## [4] "To reduce temporarily the duty on Prodiamine Technical."                                                                                                                                                                                 
## [5] "To amend the Immigration and Nationality Act in regard to Caribbean-born immigrants."                                                                                                                                                    
## [6] "To amend title 38, United States Code, to extend the eligibility for housing loans guaranteed by the Secretary of Veterans Affairs under the Native American Housing Loan Pilot Program to veterans who are married to Native Americans."
```

---

# Split the data set

```r
set.seed(123)

# convert response variable to factor
congress <- congress %>%
  mutate(major = factor(x = major, levels = major, labels = label))

# split into training and testing sets
congress_split <- initial_split(data = congress, strata = major, prop = .8)
congress_split
## <Analysis/Assess/Total>
## <3558/891/4449>

congress_train <- training(congress_split)
congress_test <- testing(congress_split)

# generate cross-validation folds
congress_folds <- vfold_cv(data = congress_train, strata = major)
```

---

# Class imbalance

---

# Preprocessing the data frame

```r
congress_rec <- recipe(major ~ text, data = congress_train)
```

```r
library(textrecipes)

congress_rec <- congress_rec %>%
  step_tokenize(text) %>%
  step_stopwords(text) %>%
  step_tokenfilter(text, max_tokens = 500) %>%
  step_tfidf(text) %>%
  step_downsample(major)
```

---

# Define the model

```r
tree_spec <- decision_tree() %>%
  set_mode("classification") %>%
  set_engine("C5.0")

tree_spec
## Decision Tree Model Specification (classification)
## 
## Computational engine: C5.0
```

---

# Train the model

```r
tree_wf <- workflow() %>%
  add_recipe(congress_rec) %>%
  add_model(tree_spec)
```

```r
set.seed(123)

tree_cv <- fit_resamples(
  tree_wf,
  congress_folds,
  control = control_resamples(save_pred = TRUE)
)
```

```r
tree_cv_metrics <- collect_metrics(tree_cv)
tree_cv_predictions <- collect_predictions(tree_cv)
tree_cv_metrics
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy multiclass 0.454    10 0.00835 Preprocessor1_Model1
## 2 roc_auc  hand_till  0.772    10 0.00837 Preprocessor1_Model1
```

---

# Confusion matrix

---

# Name That Tune!

.pull-left[

]

.pull-right[

]

---

# Topic modeling

* Themes
* Probabilistic topic models
* Latent Dirichlet allocation

---

# Topic and topic

1. I ate a banana and spinach smoothie for breakfast.
1. I like to eat broccoli and bananas.
1. Chinchillas and kittens are cute.
1. My sister adopted a kitten yesterday.
1. Look at this cute hamster munching on a piece of broccoli.

---

# LDA document structure

* Decide on the number of words N the document will have
    * [Dirichlet probability distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution)
    * Fixed set of `$k$` topics
* Generate each word in the document:
    * Pick a topic
    * Generate the word
* LDA backtracks from this assumption

---

# `r/jokes`

<blockquote class="reddit-card" data-card-created="1552319072"><a href="https://www.reddit.com/r/Jokes/comments/a593r0/twenty_years_from_now_kids_are_gonna_think_baby/">Twenty years from now, kids are gonna think "Baby it's cold outside" is really weird, and we're gonna have to explain that it has to be understood as a product of its time.</a> from <a href="http://www.reddit.com/r/Jokes">r/Jokes</a></blockquote>
<script async src="//embed.redditmedia.com/widgets/platform.js" charset="UTF-8"></script>

---

# `r/jokes` dataset

```
## Rows: 194,553
## Columns: 4
## $ body  <chr> "Now I have to say \"Leroy can you please paint the fence?\"", "…
## $ id    <chr> "5tz52q", "5tz4dd", "5tz319", "5tz2wj", "5tz1pc", "5tz1o1", "5tz…
## $ score <dbl> 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 15, 0, 0, 3, 1, 0, 3, 2, 2, 3, 0, …
## $ title <chr> "I hate how you cant even say black paint anymore", "What's the …
```

---

# Create the recipe

```r
set.seed(123) # set seed for random sampling

jokes_rec <- recipe(~., data = jokes) %>%
  step_sample(size = 1e04) %>%
  step_tokenize(title, body) %>%
  step_tokenmerge(title, body, prefix = "joke") %>%
  step_stopwords(joke) %>%
  step_ngram(joke, num_tokens = 5, min_num_tokens = 1) %>%
  step_tokenfilter(joke, max_tokens = 2500) %>%
  step_tf(joke)
```

---

# Bake the recipe

```r
jokes_prep <- prep(jokes_rec)

jokes_df <- bake(jokes_prep, new_data = NULL)
jokes_df %>%
  slice(1:5)
## # A tibble: 5 × 2,502
##   id    score tf_joke_0 tf_joke_1 tf_joke_10 tf_joke_100 tf_joke_1000 tf_joke_11
##   <fct> <dbl>     <dbl>     <dbl>      <dbl>       <dbl>        <dbl>      <dbl>
## 1 2tzi…    12         0         0          0           0            0          0
## 2 4zqp…     0         0         0          0           0            0          0
## 3 2lgw…    58         0         0          0           0            0          0
## 4 3qx3…     9         0         0          0           0            0          0
## 5 2x2z…     0         0         0          0           0            0          0
## # … with 2,494 more variables: tf_joke_12 <dbl>, tf_joke_13 <dbl>,
## #   tf_joke_14 <dbl>, tf_joke_15 <dbl>, tf_joke_16 <dbl>, tf_joke_18 <dbl>,
## #   tf_joke_1st <dbl>, tf_joke_2 <dbl>, tf_joke_20 <dbl>,
## #   tf_joke_20_years <dbl>, tf_joke_200 <dbl>, tf_joke_2015 <dbl>,
## #   tf_joke_25 <dbl>, tf_joke_3 <dbl>, tf_joke_30 <dbl>, tf_joke_3rd <dbl>,
## #   tf_joke_4 <dbl>, tf_joke_40 <dbl>, tf_joke_4th <dbl>, tf_joke_5 <dbl>,
## #   tf_joke_50 <dbl>, tf_joke_500 <dbl>, tf_joke_5th <dbl>, tf_joke_6 <dbl>, …
```

---

# Convert to document-term matrix

```r
jokes_dtm <- jokes_df %>%
  pivot_longer(cols = -c(id, score),
               names_to = "token",
               values_to = "n") %>%
  filter(n != 0) %>%
  # clean the token column so it just includes the token
  # drop empty levels from id - this includes jokes which did not
  # have any tokens retained after step_tokenfilter()
  mutate(token = str_remove(string = token, pattern = "tf_joke_"),
         id = fct_drop(f = id)) %>%
  cast_dtm(document = id, term = token, value = n)
jokes_dtm
## <<DocumentTermMatrix (documents: 9944, terms: 2500)>>
## Non-/sparse entries: 140880/24719120
## Sparsity           : 99%
## Maximal term length: 60
## Weighting          : term frequency (tf)
```

---

# `$k=4$`

```r
jokes_lda4 <- LDA(jokes_dtm, k = 4, control = list(seed = 123))
```

---

# `$k=12$`

---

# Perplexity

* A statistical measure of how well a probability model predicts a sample
* Given the theoretical word distributions represented by the topics, compare that to the actual topic mixtures, or distribution of words in your documents
* Perplexity for LDA model with 12 topics
    * 994.9666512

---

# Perplexity

---

# `$k=100$`

---

# LDAvis

* Interactive visualization of LDA model results
1. What is the meaning of each topic?
1. How prevalent is each topic?
1. How do the topics relate to each other?