Many existing text mining datasets are in the form of a
class (from the tm package). For
example, consider the corpus of 2246 Associated Press articles from the
topicmodels package:
## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity : 99%
## Maximal term length: 18
## Weighting : term frequency (tf)
If we want to analyze this with tidy tools, we need to turn it into a
one-term-per-document-per-row data frame first. The tidy
function does this. (For more on the tidy verb, see the broom package).
Just as shown in this vignette, having the text in this format is convenient for analysis with the tidytext package. For example, you can perform sentiment analysis on these newspaper articles.
## # A tibble: 30,094 × 4
## document term count sentiment
## <int> <chr> <dbl> <chr>
## 1 1 assault 1 negative
## 2 1 complex 1 negative
## 3 1 death 1 negative
## 4 1 died 1 negative
## 5 1 good 2 positive
## 6 1 illness 1 negative
## 7 1 killed 2 negative
## 8 1 like 2 positive
## 9 1 liked 1 positive
## 10 1 miracle 1 positive
## # ℹ 30,084 more rows
We can find the most negative documents:
ap_sentiments %>%
count(document, sentiment, wt = count) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative) %>%
## # A tibble: 2,190 × 4
## document negative positive sentiment
## <int> <dbl> <dbl> <dbl>
## 1 1251 54 6 -48
## 2 1380 53 5 -48
## 3 531 51 9 -42
## 4 43 45 11 -34
## 5 1263 44 10 -34
## 6 2178 40 6 -34
## 7 334 45 12 -33
## 8 1664 38 5 -33
## 9 2147 47 14 -33
## 10 516 38 6 -32
## # ℹ 2,180 more rows
Or visualize which words contributed to positive and negative sentiment:
ap_sentiments %>%
count(sentiment, term, wt = count) %>%
group_by(sentiment) %>%
slice_max(n, n = 10) %>%
mutate(term = reorder(term, n)) %>%
ggplot(aes(n, term, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(vars(sentiment), scales = "free_y") +
labs(x = "Contribution to sentiment", y = NULL)