Datasets for categorical data analysis

Michael Friendly

2023-08-21

The vcdExtra package contains 45 datasets, taken from the literature on categorical data analysis, and selected to illustrate various methods of analysis and data display. These are in addition to the 33 datasets in the vcd package.

To make it easier to find those which illustrate a particular method, the datasets in vcdExtra have been classified using method tags. This vignette creates an “inverse table”, listing the datasets that apply to each method. It also illustrates a general method for classifying datasets in R packages.

library(dplyr)
library(tidyr)
library(readxl)

Processing tags

Using the result of vcdExtra::datasets(package="vcdExtra") I created a spreadsheet, vcdExtra-datasets.xlsx, and then added method tags.

dsets_tagged <- read_excel(here::here("inst", "extdata", "vcdExtra-datasets.xlsx"), 
                           sheet="vcdExtra-datasets")

dsets_tagged <- dsets_tagged |>
  dplyr::select(-Title, -dim) |>
  dplyr::rename(dataset = Item)

head(dsets_tagged)
## # A tibble: 6 × 3
##   dataset   class      tags                                
##   <chr>     <chr>      <chr>                               
## 1 Abortion  table      loglinear;logit;2x2                 
## 2 Accident  data.frame loglinear; glm; logistic            
## 3 AirCrash  data.frame reorder; ca                         
## 4 Alligator data.frame loglinear;multinomial;zeros         
## 5 Bartlett  table      2x2;loglinear; homogeneity;oddsratio
## 6 Burt      data.frame ca

To invert the table, need to split tags into separate observations, then collapse the rows for the same tag.

dset_split <- dsets_tagged |>
  tidyr::separate_longer_delim(tags, delim = ";") |>
  dplyr::mutate(tag = stringr::str_trim(tags)) |>
  dplyr::select(-tags)

#' ## collapse the rows for the same tag
tag_dset <- dset_split |>
  arrange(tag) |>
  dplyr::group_by(tag) |>
  dplyr::summarise(datasets = paste(dataset, collapse = "; ")) |> ungroup()

# get a list of the unique tags
unique(tag_dset$tag)
##  [1] "2x2"         "agree"       "binomial"    "ca"          "glm"        
##  [6] "homogeneity" "lm"          "logistic"    "logit"       "loglinear"  
## [11] "mobility"    "multinomial" "oddsratio"   "one-way"     "ordinal"    
## [16] "poisson"     "reorder"     "square"      "zeros"

Make this into a nice table

Another sheet in the spreadsheet gives a more descriptive topic for corresponding to each tag.

tags <- read_excel(here::here("inst", "extdata", "vcdExtra-datasets.xlsx"), 
                   sheet="tags")
head(tags)
## # A tibble: 6 × 2
##   tag         topic                     
##   <chr>       <chr>                     
## 1 2x2         2 by 2 tables             
## 2 agree       observer agreement        
## 3 binomial    binomial distributions    
## 4 ca          correspondence analysis   
## 5 glm         generalized linear models 
## 6 homogeneity homogeneity of association

Now, join this with the tag_dset created above.

tag_dset <- tag_dset |>
  dplyr::left_join(tags, by = "tag") |>
  dplyr::relocate(topic, .after = tag)

tag_dset |>
  dplyr::select(-tag) |>
  head()
## # A tibble: 6 × 2
##   topic                      datasets                                           
##   <chr>                      <chr>                                              
## 1 2 by 2 tables              Abortion; Bartlett; Heart                          
## 2 observer agreement         Mammograms                                         
## 3 binomial distributions     Geissler                                           
## 4 correspondence analysis    AirCrash; Burt; Draft1970table; Gilby; HospVisits;…
## 5 generalized linear models  Accident; Cormorants; DaytonSurvey; Donner; Draft1…
## 6 homogeneity of association Bartlett

Make the table

Use purrr::map() to apply add_links() to all the datasets for each tag. (mutate(datasets = add_links(datasets)) by itself doesn’t work.)

tag_dset |>
  dplyr::select(-tag) |>
  dplyr::mutate(datasets = purrr::map(datasets, add_links)) |>
  knitr::kable()
topic datasets
2 by 2 tables Abortion; Bartlett; Heart
observer agreement Mammograms
binomial distributions Geissler
correspondence analysis AirCrash; Burt; Draft1970table; Gilby; HospVisits; HouseTasks; Mental
generalized linear models Accident; Cormorants; DaytonSurvey; Donner; Draft1970table; GSS; ICU; PhdPubs
homogeneity of association Bartlett
linear models Draft1970
logistic regression Accident; Donner; ICU; Titanicp
logit models Abortion; Cancer
loglinear models Abortion; Accident; Alligator; Bartlett; Caesar; Cancer; Detergent; Dyke; Heckman; Hoyt; JobSat; Mice; TV; Titanicp; Toxaemia; Vietnam; Vote1980; WorkerSat
mobility tables Glass; Hauser79; Mobility; Yamaguchi87
multinomial models Alligator
odds ratios Bartlett; Fungicide
one-way tables CyclingDeaths; Depends; ShakeWords
ordinal variables Draft1970table; Gilby; HairEyePlace; Hauser79; HospVisits; JobSat; Mammograms; Mental; Mice; Mobility; Yamaguchi87
Poisson distributions Cormorants; PhdPubs
reordering values AirCrash; Glass; HouseTasks
square tables Glass; Hauser79; Mobility; Yamaguchi87
zero counts Alligator; Caesar; PhdPubs; Vote1980

Voila!