Survey data often encounters the challenge of dropout – instances where participants fail to complete sections of the survey due to interruptions or omissions. Handling dropouts effectively is crucial for accurate data analysis and interpretation. The dropout package provides a solution by offering insights into participant behavior during the survey process.
The dropout package empowers you with the capability to extract valuable insights from your dataset, such as:
By leveraging these insights, you can:
In this vignette, we will provide an in-depth overview of the dropout
package’s features and their practical utilization. We will use a sample
dataset named “flying” to illustrate these concepts. This is a modified
version of the Flying Etiquette Survey data behind the story: 41 percent
of flyers say it’s rude to recline your seat on an airplane. You can
load this preinstalled dataset into your environment using the command
data(flying)
.
The package is designed for effortless integration with widely used
data wrangling libraries such as dplyr from the tidyverse ecosystem.
These libraries provide a rich toolkit for data manipulation and
analysis. Additionally, in a recent update, substantial portions of the
package’s code were rewritten using C++, a high-performance programming
language. This enhancement allows the dropout
package to
handle larger datasets efficiently. Consequently, it is well-suited for
analyzing substantial volumes of data, making it a valuable tool for
both small-scale and big data survey analyses.
drop_summary
Let’s embark on a deeper exploration of the dropout
package by delving into the drop_summary
function. This
function serves as a pivotal tool for gaining in-depth insights into
dropout patterns within your dataset. To effectively utilize the
drop_summary
function, you should specify the last column
in your dataset that corresponds to the survey items. If you leave out
this part of the function, the package will attempt to automatically
detect whether the last columns are part of the survey items. This is
achieved by excluding the trailing columns that do not contain any
missing values (NAs). While this automatic detection can be convenient
in many cases, it might not always align with the specific structure of
your dataset.
If the automatic detection behavior does not suit the needs of your dataset, you have the flexibility to set the last_col parameter explicitly. By doing so, you inform the package about the exact location of the last item in your survey data. This manual specification ensures precise alignment with your dataset’s structure and is particularly useful when dealing with unconventional or complex survey formats.
For example, in the “flying” dataset, the final survey-related item is stored in the “location_census_region” column. Following this, the “survey_type” column contains supplementary survey information. Many datasets incorporate similar non-survey-related data, and it’s crucial to consider such cases.
To gain a comprehensive overview of the dropout patterns within your dataset, consider the following code snippet:
drop_summary(flying, "location_census_region")
#> # A tibble: 27 × 9
#> column_name dropout drop_rate cum_drop_rate drop_na section_na single_na
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 respondent_id 0 0 0 0 0 0
#> 2 travel_frequency 0 0 0 0 0 0
#> 3 seat_recline 18 0.02 0.02 18 164 0
#> 4 height 0 0 0.02 18 164 12
#> 5 children_under_… 1 0 0.02 19 164 6
#> 6 two_armrests 1 0 0.02 20 164 0
#> 7 middle_armrest 0 0 0.02 20 164 0
#> 8 window_shade 0 0 0.02 20 164 0
#> 9 moving_to_unsol… 1 0 0.02 21 164 0
#> 10 talking_to_seat… 0 0 0.02 21 164 0
#> # ℹ 17 more rows
#> # ℹ 2 more variables: missing <dbl>, completion_rate <dbl>
Now, let’s delve into the intricacies of the
drop_summary
function and the valuable insights it provides
in a structured format.
drop_summary
When you use the drop_summary
function, the output you
receive is a compact yet informative summary, packaged as either a
dataframe or a tibble. This summary consists of multiple columns, each
of which provides insights into different dimensions of dropout analysis
within your dataset.
column_name
: Lists the names of the
columns from your dataset that have been analyzed for dropouts.
dropout
: Contains the frequency of
dropouts within each listed column, allowing you to see where dropout
rates might be the most significant.
drop_rate
: Shows the percentage of
dropout incidents in each column. This is useful for understanding the
relative impact of dropouts in various parts of your dataset.
cum_drop_rate
: Shows the overall
percentage of dropout incidents in each column.
drop_na
: Provides the percentage of
missing values in each column that can be attributed specifically to
dropouts. This offers insights into the nature of missing data.
section_na
: Indicates occurrences
of missing values that span at least n
consecutive columns
(n
defaults to 3). You can adjust this parameter using
section_min
as shown below:
drop_detect
to Identify DropoutsOne of the core tools in the dropout
package is the
drop_detect
function. This function serves as a
comprehensive tool for isolating and understanding individual
participant dropout behaviors. Specifically, it reveals whether a
participant has left the survey prematurely and pinpoints the exact
juncture at which the dropout took place.
The structure and usage of drop_detect
are intentionally
made to resonate with the drop_summary
function, ensuring
consistency and ease of adoption.
dropout
: A Boolean column (TRUE or
FALSE). A TRUE
value signifies that the respective
participant exited the survey prematurely.
dropout_column
: For those marked as
TRUE
in the dropout
column, this field
specifies the exact column or question that triggered the
dropout.
dropout_index
: Offers a direct
reference to the row number where the dropout incident occurred,
facilitating easier traceability.
For practical insights, consider applying drop_detect
on
the ‘flying’ dataset. Here’s how you can achieve this:
Moreover, if you wish to append the extracted dropout details back to
the original dataset, you can employ the bind_cols
function
from the dplyr
package:
Such integration of dropout specifics into the primary dataset can act as a preliminary step for more nuanced analyses, like zoning into specific dropout triggers, assessing commonalities among dropouts, or any other relevant exploratory exercises.
Subsequent sections will delve into exemplified applications of this integrated approach.
The drop_detect
function can be useful for identifying
and filtering out early dropouts, i.e., participants who stopped
answering the survey at a specific column. For example, you can filter
for participants who did not drop out early, or had a ‘late’ dropout in
the demographic part of the questions, using the
dropout_index
:
If you’re interested in a specific section of questions and want to
filter for dropouts and section_na
, you have two
approaches:
In this approach, we create a subset of the data containing only the
first 22 columns. We then apply the drop_detect
function to
this subset to identify dropouts and other relevant indicators:
last_col
ParameterYou can specify the last_col
parameter to the last
question in the section, to identify all dropouts up to that point. If
left out, the function will use the same detection steps of non survey
related columns as described in drop_summary
function.
One practical application is to compare the demographics (e.g., age and gender) between those who left out a section and those who did not. The following code generates a bar graph that breaks down dropout rates by age group and gender.
library(ggplot2)
flying %>%
drop_detect("smoking_violation") %>%
bind_cols(flying, .) %>%
filter(!is.na(gender)) %>%
mutate(age = factor(age, levels = c("18-29", "30-44", "45-60", "> 60"))) %>%
ggplot(aes(x=age, fill=dropout)) +
geom_bar(position="dodge") +
facet_grid(gender ~ .) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
By visualizing the data, you can more easily discern patterns and disparities among different demographic groups with respect to dropout rates.
Another interesting avenue is to explore whether there’s a relationship between dropout behavior (or in this case leaving out a section) and demographic variables like age and gender:
test <- flying %>%
drop_detect("smoking_violation") %>%
bind_cols(flying, .) %>%
filter(!is.na(gender)) %>%
select(dropout, age, gender) %>%
mutate(dropout = as.numeric(ifelse(dropout == TRUE, 1, 0)))
glm_model <- glm(dropout ~ gender + age, data = test, family = binomial)
print(summary(glm_model))
This can be particularly useful for hypothesis testing and can aid in uncovering patterns in your data.