The goal of this package is to iteratively build a customizable data table, one row at a time. This package will allow a user to input a data object, specify the rows and columns to use for the summary table, and select the type of data to use for each individual row. Missing data, overall statistics, and comparison tests can be calculated using this package as well.
install.packages("tangram.pipe")
suppressPackageStartupMessages(require(tangram.pipe))
suppressPackageStartupMessages(require(knitr))
suppressPackageStartupMessages(require(kableExtra))
The first step to using this package is to initialize the data table
to create. Here, the user will select the name of the dataset to be
analyzed in the table and specify the variable name to use for the
columns. In addition, the user will need to determine whether to account
for missing data, calculate overall statistics across all columns, or
conduct comparison tests across the columns for each row. The arguments
for missing
, overall
, and
comparison
will be used as the defaults for each subsequent
row added to the table; however, a user can specify a different entry
for each argument for individual rows if desired. Finally, the user can
choose the default summary function to use for each type of row.
This vignette will use the built-in iris
dataset, which
is a well-known dataset containing flower measurements for three species
of iris flowers. Since most of the data in iris
is
numerical, we will add in two made-up variables (flower color and stem
size) in order to demonstrate table-building functions for non-numeric
data. Note that the additional columns are made-up purely for
demonstration of this package.
$color <- sample(c("Blue", "Purple"), size=150, replace=TRUE)
iris$Stem.Size <- sample(c("Small", "Medium", "Medium", "Large"), size=150, replace=TRUE)
iris149,5] <- NA
iris[150,c(1:4, 6:7)] <- NA
iris[head(iris) %>%
kable(escape=F, align="c") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | color | Stem.Size |
---|---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa | Blue | Medium |
4.9 | 3.0 | 1.4 | 0.2 | setosa | Blue | Large |
4.7 | 3.2 | 1.3 | 0.2 | setosa | Blue | Large |
4.6 | 3.1 | 1.5 | 0.2 | setosa | Blue | Small |
5.0 | 3.6 | 1.4 | 0.2 | setosa | Blue | Large |
5.4 | 3.9 | 1.7 | 0.4 | setosa | Purple | Large |
For this example, the variable ‘Species’ will be chosen as the column
variable; missing
and comparison
will be set
to FALSE
to generate a simple example. We will also set
each type of summary function to the default setting used by the
package.
<- tbl_start(data=iris,
tbl1 col_var="Species",
missing=FALSE,
overall=TRUE,
comparison=FALSE,
digits = 2,
default_num_summary = num_default,
default_cat_summary = cat_default,
default_binary_summary = binary_default)
Using this function creates a list object that stores the user preferences for building the table going forward; in addition to the nine elements listed here, the number of rows is also saved to the list. Subsequent entries to the list will store information for the rows, which will ultimately be compiled to create the final table after all row information has been added.
tbl_start
arguments are set to the following defaults.
Aside from data
and col_var
, the remaining
arguments do not need to be specified if they match the following
default values:
missing
: FALSE
overall
: TRUE
comparison
: FALSE
digits
: 2
default_num_summary
: num_default
default_cat_summary
: cat_default
default_binary_summary
: binary_default
To start off, we will first add a numeric row to the table. The
function num_row
reads in data that is numeric in form, and
by default calculates the five-number summary statistics (minimum, first
quartile, median, third quartile, maximum), as well as the mean and
standard deviation for the numeric variable within each column. Since we
specified overall=TRUE
in the initialization step, an
overall summary row will be included as well. The default summary
function is num_default
, but the user may write their own
function to calculate different summary statistics from what is shown
here.
Currently, there are five summary functions available for use within
num_row
. The default summary to use for each row can be
specified in tbl_start
, or determined using the
summary
argument of each row
num_default
: Calculates the five-number summary, mean,
and standard deviation
num_minmax
: Calculates the minimum and maximum
values
num_medianiqr
: Calculates the median and interquartile
range
num_mean_sd
: Calculates the mean and standard
deviation
num_date
: Calculates the five-number summary for a date
object
More information on writing your own summary functions can be found in the accompanying package vignette “Writing User-Defined Summary Functions”
Let’s start by calculating summary statistics for the Sepal Length in
the iris
dataset. Since it makes more sense to display the
variable name as “Sepal Length” rather than the R-generated
“Sepal.Length”, we will use the rowlabel
argument to make
this change for the table. Note that if you have a dataframe with
labelled variables as columns, leaving rowlabel
blank will
automatically input the variable’s label as the rowlabel. To output the
final object, we use the function tbl_out
to display the
table.
<- tbl_start(data=iris, col_var="Species", missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 num_row(row_var="Sepal.Length", rowlabel="Sepal Length") %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall |
---|---|---|---|---|---|
Sepal Length | min | 4.30 | 4.90 | 4.90 | 4.30 |
Q1 | 4.80 | 5.60 | 6.30 | 5.10 | |
median | 5.00 | 5.90 | 6.50 | 5.80 | |
Q3 | 5.20 | 6.30 | 6.95 | 6.40 | |
max | 5.80 | 7.00 | 7.90 | 7.90 | |
mean | 5.01 | 5.94 | 6.61 | 5.84 | |
SD | 0.35 | 0.52 | 0.64 | 0.83 |
By default, each row function will use two decimal places in reported
statistics. We can use the digits
argument to specify more
or fewer significant digits in the reported table.
<- tbl_start(data=iris, col_var="Species", missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 num_row(row_var="Sepal.Length", rowlabel="Sepal Length", digits=4) %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall |
---|---|---|---|---|---|
Sepal Length | min | 4.3000 | 4.9000 | 4.9000 | 4.3000 |
Q1 | 4.8000 | 5.6000 | 6.3000 | 5.1000 | |
median | 5.0000 | 5.9000 | 6.5000 | 5.8000 | |
Q3 | 5.2000 | 6.3000 | 6.9500 | 6.4000 | |
max | 5.8000 | 7.0000 | 7.9000 | 7.9000 | |
mean | 5.0060 | 5.9360 | 6.6104 | 5.8405 | |
SD | 0.3525 | 0.5162 | 0.6386 | 0.8331 |
There is a small amount of missing data within the iris
dataset. Currently, num_row
filters out the missing data
and only considers data with complete cases of the row and column
variables. To see how much missing data there is in the sepal length, we
specify missing=TRUE
.
<- tbl_start(data=iris, col_var="Species", missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 num_row(row_var="Sepal.Length", rowlabel="Sepal Length", missing=TRUE) %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall |
---|---|---|---|---|---|
Sepal Length | min | 4.30 | 4.90 | 4.90 | 4.30 |
Q1 | 4.80 | 5.60 | 6.30 | 5.10 | |
median | 5.00 | 5.90 | 6.50 | 5.80 | |
Q3 | 5.20 | 6.30 | 6.95 | 6.40 | |
max | 5.80 | 7.00 | 7.90 | 7.90 | |
mean | 5.01 | 5.94 | 6.61 | 5.84 | |
SD | 0.35 | 0.52 | 0.64 | 0.83 | |
Missing | 0 | 0 | 1 | 1 |
The function above tells us that the dataset is missing a sepal length measurement for one of the virginica flowers. Note that the function cannot locate instances of missingness in the column variable.
Finally, suppose we want to look at the differences in means across
all species. The function num_diff
for the
comparison
argument will calculated the mean difference in
sepal length for each row compared to a reference category, which is
coded as the first column variable in the table. Here, versicolor and
virginica will be compared to setosa. The function also provides a 95%
Confidence interval to accompany the mean difference. Currently,
num_diff
is the only built-in comparison function for
num_row
.
<- tbl_start(data=iris, col_var="Species", missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 num_row(row_var="Sepal.Length", rowlabel="Sepal Length", comparison=num_diff) %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall | Test | setosa vs. versicolor | setosa vs. virginica | Compare: All Groups |
---|---|---|---|---|---|---|---|---|---|
Sepal Length | min | 4.30 | 4.90 | 4.90 | 4.30 | Difference in Means | -0.93 (-1.11, -0.75) | -1.60 (-1.81, -1.40) | p ≤ 0.001 |
Q1 | 4.80 | 5.60 | 6.30 | 5.10 | |||||
median | 5.00 | 5.90 | 6.50 | 5.80 | |||||
Q3 | 5.20 | 6.30 | 6.95 | 6.40 | |||||
max | 5.80 | 7.00 | 7.90 | 7.90 | |||||
mean | 5.01 | 5.94 | 6.61 | 5.84 | |||||
SD | 0.35 | 0.52 | 0.64 | 0.83 |
Now, we will look at adding categorical variables. The function
cat_row
reads in data that is categorical in form, and by
default calculates the number of instances for each row category within
each column category, as well as the column-wise proportions. The
default summary function is cat_default
, but the user may
write their own function to calculate different summary statistics from
what is shown here.
Currently, there are four summary functions available for use within
cat_row
. The default summary to use for each row can be
specified in tbl_start
, or determined using the
summary
argument of each row.
cat_default
: Calculates the cell counts and column-wise
proportions
cat_pct
: Calculates the cell counts and column-wise
percentages
cat_count
: Calculates the cell counts
cat_jama
: Calculates the column-wise percentages and
cell counts divided by column totals. This is the style used by the
Journal of the American Medical Association.
We will demonstrate this function by looking at
Stem.Size
in the iris
dataset. Note that
cat_row
and num_row
have nearly identical
arguments, but cat_row
allows you to choose the number of
spaces to indent category names using the indent
argument.
The default setting is 5 spaces.
<- tbl_start(data=iris, col_var="Species", missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 cat_row("Stem.Size", rowlabel="Stem Size") %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall |
---|---|---|---|---|---|
Stem Size | Col. Prop. (N) | ||||
Large | 0.34 (17) | 0.24 (12) | 0.25 (12) | 0.28 (41) | |
Medium | 0.44 (22) | 0.48 (24) | 0.50 (24) | 0.47 (70) | |
Small | 0.22 (11) | 0.28 (14) | 0.25 (12) | 0.25 (37) |
Setting missing=TRUE
will reveal the proportion of each
species that does not have a corresponding entry for stem size. When
missing data is accounted for, the missingness will be recorded as the
percentage of each column that is designated as missing data.
<- tbl_start(data=iris, col_var="Species", missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 cat_row("Stem.Size", rowlabel="Stem Size", missing=TRUE) %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall |
---|---|---|---|---|---|
Stem Size | Col. Prop. (N) | ||||
Large | 0.34 (17) | 0.24 (12) | 0.24 (12) | 0.28 (41) | |
Medium | 0.44 (22) | 0.48 (24) | 0.49 (24) | 0.47 (70) | |
Small | 0.22 (11) | 0.28 (14) | 0.24 (12) | 0.25 (37) | |
Missing | 0.00 (0) | 0.00 (0) | 0.02 (1) | 0.01 (1) |
We can also sort a categorical row in ascending or descending order
by category counts for a specified column. The ordering
argument will sort the row variable, and sortcol
specifies
which column we could like to sort our row by. Permissible arguments for
ordering
are c("ascending", "descending")
; by
default, the row function will sort by the overall cell counts unless a
valid column category name is inputted into sortcol
. If an
invalid category name is used, the row function will sort by the overall
cell counts instead.
<- tbl_start(data=iris, col_var="Species", missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 cat_row("Stem.Size", rowlabel="Stem Size (Ascending by versicolor)", missing=TRUE,
ordering = "ascending", sortcol = "versicolor") %>%
cat_row("Stem.Size", rowlabel="Stem Size (Descending by overall count)", missing=TRUE,
ordering = "descending") %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall |
---|---|---|---|---|---|
Stem Size (Ascending by versicolor) | Col. Prop. (N) | ||||
Large | 0.34 (17) | 0.24 (12) | 0.24 (12) | 0.28 (41) | |
Small | 0.22 (11) | 0.28 (14) | 0.24 (12) | 0.25 (37) | |
Medium | 0.44 (22) | 0.48 (24) | 0.49 (24) | 0.47 (70) | |
Missing | 0.00 (0) | 0.00 (0) | 0.02 (1) | 0.01 (1) | |
Stem Size (Descending by overall count) | Col. Prop. (N) | ||||
Medium | 0.44 (22) | 0.48 (24) | 0.49 (24) | 0.47 (70) | |
Large | 0.34 (17) | 0.24 (12) | 0.24 (12) | 0.28 (41) | |
Small | 0.22 (11) | 0.28 (14) | 0.24 (12) | 0.25 (37) | |
Missing | 0.00 (0) | 0.00 (0) | 0.02 (1) | 0.01 (1) |
Finally, let’s look at a comparison test for a categorical row. The
default comparison function is cat_comp_default
, which will
calculate the relative entropy between each column and the reference
category, as well as conduct a Chi-Square Goodness of Fit test on the
data present. Currently, cat_comp_default
is the only
built-in function for categorical data, but a user may write their own
function to use instead.
<- tbl_start(data=iris, col_var="Species", missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 cat_row("Stem.Size", rowlabel="Stem Size", comparison=cat_comp_default) %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall | Test | setosa vs. versicolor | setosa vs. virginica | Compare: All Groups |
---|---|---|---|---|---|---|---|---|---|
Stem Size | Col. Prop. (N) | Relative Entropy | 0.03 | 0.02 | p = 0.80 | ||||
Large | 0.34 (17) | 0.24 (12) | 0.25 (12) | 0.28 (41) | |||||
Medium | 0.44 (22) | 0.48 (24) | 0.50 (24) | 0.47 (70) | |||||
Small | 0.22 (11) | 0.28 (14) | 0.25 (12) | 0.25 (37) |
The final type of data we will examine here is binary data; this is
when a variable can only take on two possible values. In a table, it can
be helpful to only include one of the options if the second entry can be
deduced from looking at the first. This is done using the
binary_row
function. The default summary function is
binary_default
, but the user may write their own function
to calculate different summary statistics from what is shown here.
Currently, there are four summary functions available for use within
binary_row
. The default summary to use for each row can be
specified in tbl_start
, or determined using the
summary
argument of each row.
binary_default
: Calculates the cell counts and
column-wise proportions
binary_pct
: Calculates the cell counts and column-wise
percentages
binary_count
: Calculates the cell counts
binary_jama
: Calculates the column-wise percentages and
cell counts divided by column totals. This is the style used by the
Journal of the American Medical Association.
Note that a user may use cat_row
to process binary data
if they wish to see both row entries included in the table.
We will now demonstrate the use of binary_row
on the
color variable in iris
. In the dataset, the available
colors are blue and purple, so we do not wish to include both entries
here.
<- tbl_start(data=iris, col_var="Species", missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 binary_row("color", rowlabel = "Color") %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall |
---|---|---|---|---|---|
Color: Blue | Col. Prop. (N) | 0.36 (18) | 0.56 (28) | 0.60 (29) | 0.51 (75) |
The binary_row
function includes all of the same
arguments as the previous row functions, but additionally includes three
new arguments. reference
allows a user to choose which
group will appear on the table. By default, the alphabetically first row
group will appear on the table, which is why ‘Blue’ appeared above. If
we want to see the statistics for purple flowers, we can run the
following code.
<- tbl_start(data=iris, col_var="Species", missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 binary_row("color", rowlabel = "Color", reference="Purple") %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall |
---|---|---|---|---|---|
Color: Purple | Col. Prop. (N) | 0.64 (32) | 0.44 (22) | 0.40 (19) | 0.49 (73) |
Notice in the previous examples that the binary row is contained
entirely within one row. This is because many tables in professional
journals will often abbreviate binary data to fit within a single row of
data. If you do not wish to do this within your table, you can set the
additional argument compact
to be FALSE and display the row
information in more than one row.
<- tbl_start(data=iris, col_var="Species", missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 binary_row("color", rowlabel = "Color", compact = FALSE) %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall |
---|---|---|---|---|---|
Color | Col. Prop. (N) | ||||
Blue | 0.36 (18) | 0.56 (28) | 0.60 (29) | 0.51 (75) |
As of package version 1.1.2, the user can now choose to remove the
reference group label from the table if they do not want it to be
present. The argument ref.label
allows a user to toggle the
name of the reference group in the table; by default, this is set to
on
, but a user can input off
to remove it.
Finally, let’s look at some comparison functions used for binary
data. By default, this row function will calculate the difference in
proportions by using binary_diff
if
comparison=TRUE
during initialization. This will calculate
differences in proportions across columns; the calculations will also
include 95% Confidence intervals.
<- tbl_start(data=iris, col_var="Species", missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 binary_row("color", comparison=binary_diff) %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall | Test | setosa vs. versicolor | setosa vs. virginica | Compare: All Groups |
---|---|---|---|---|---|---|---|---|---|
color: Blue | Col. Prop. (N) | 0.36 (18) | 0.56 (28) | 0.60 (29) | 0.51 (75) | Difference in Proportions | -0.20 (-0.41, 0.01) | -0.23 (-0.44, -0.02) | p = 0.04 |
The package has two additional options for comparison tests using
binary data. Odds ratios can be calculated using binary_or
,
and risk ratios can be calculated with binary_rr
. Note that
if comparison=TRUE
is initialized in tbl_start
and a user wants to use an odds ratio or risk ratio here,
comparison
must be set to either of those two options in
this row addition, since excluding the argument will lead to
binary_diff
being called by default.
<- tbl_start(data=iris, col_var="Species", missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 binary_row("color", comparison=binary_or) %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall | Test | setosa vs. versicolor | setosa vs. virginica | Compare: All Groups |
---|---|---|---|---|---|---|---|---|---|
color: Blue | Col. Prop. (N) | 0.36 (18) | 0.56 (28) | 0.60 (29) | 0.51 (75) | Odds Ratio | 0.44 (0.20, 0.99) | 0.37 (0.16, 0.83) | p = 0.04 |
<- tbl_start(data=iris, col_var="Species", missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 binary_row("color", comparison=binary_rr) %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall | Test | setosa vs. versicolor | setosa vs. virginica | Compare: All Groups |
---|---|---|---|---|---|---|---|---|---|
color: Blue | Col. Prop. (N) | 0.36 (18) | 0.56 (28) | 0.60 (29) | 0.51 (75) | Risk Ratio | 0.66 (0.33, 1.33) | 0.61 (0.30, 1.23) | p = 0.04 |
The n_row
function will count the number of rows in your
dataset, as well as the total instances of each column variable. Note
that you can decide whether or not you want this function to include the
missing data as part of your row count. For the example below we will
not include rows from missing data.
<- tbl_start(data=iris, col_var="Species", missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 n_row() %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall |
---|---|---|---|---|---|
N | 50 | 50 | 49 | 149 |
The empty_row
function will add a blank row to the final
table. This is useful if a user wants to include blank space between
some of table’s rows. The user only needs to specify the name of the
list object in order to create the blank row. An optional argument is a
header to include, should the user want to create a label for the
subsequent rows that follow in the table.
<- tbl1 %>% empty_row() tbl1
The following code will generate a finalized table for the
iris
dataset. It will include all four numeric variables
(sepal length, sepal width, petal length, petal width), as well as stem
size and color. The final table itself is generated using
tbl_out
. Below is an example of a customized table report
that can be produced using tangram.pipe. Annotations for the unique
elements of the rows are created by inserting the comments into the
header argument for the empty_row()
command.
<- tbl_start(
tbl1 data = iris,
col_var = "Species",
missing=FALSE,
overall=TRUE,
comparison=TRUE,
default_num_summary = num_default,
default_cat_summary = cat_pct,
default_binary_summary = binary_default) %>%
n_row() %>%
num_row("Sepal.Length", rowlabel="Sepal Length") %>%
empty_row('<i>No rowlabel, 3 decimal places</i>') %>%
num_row("Sepal.Width", digits=3) %>%
empty_row("<i>No comparison test used, Min-Max summary</i>") %>%
num_row("Petal.Length", rowlabel="Petal Length", summary = num_minmax, comparison=FALSE) %>%
empty_row("<i>Missing data considered, mean/Std. Dev summary</i>") %>%
num_row("Petal.Width", rowlabel="Petal Width", summary = num_mean_sd, missing=TRUE) %>%
cat_row("Stem.Size", rowlabel="Stem Size", missing=TRUE) %>%
empty_row("<i>No rowlabels, indent 3 spaces, odds ratio as test</i>") %>%
binary_row("color", missing = TRUE, comparison=binary_or, indent=3) %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall | Test | setosa vs. versicolor | setosa vs. virginica | Compare: All Groups |
---|---|---|---|---|---|---|---|---|---|
N | 50 | 50 | 49 | 149 | |||||
Sepal Length | min | 4.30 | 4.90 | 4.90 | 4.30 | Difference in Means | -0.93 (-1.11, -0.75) | -1.60 (-1.81, -1.40) | p ≤ 0.001 |
Q1 | 4.80 | 5.60 | 6.30 | 5.10 | |||||
median | 5.00 | 5.90 | 6.50 | 5.80 | |||||
Q3 | 5.20 | 6.30 | 6.95 | 6.40 | |||||
max | 5.80 | 7.00 | 7.90 | 7.90 | |||||
mean | 5.01 | 5.94 | 6.61 | 5.84 | |||||
SD | 0.35 | 0.52 | 0.64 | 0.83 | |||||
No rowlabel, 3 decimal places | |||||||||
Sepal.Width | min | 2.300 | 2.000 | 2.200 | 2.000 | Difference in Means | 0.658 (0.520, 0.796) | 0.463 (0.322, 0.605) | p ≤ 0.001 |
Q1 | 3.200 | 2.525 | 2.800 | 2.800 | |||||
median | 3.400 | 2.800 | 3.000 | 3.000 | |||||
Q3 | 3.675 | 3.000 | 3.125 | 3.300 | |||||
max | 4.400 | 3.400 | 3.800 | 4.400 | |||||
mean | 3.428 | 2.770 | 2.965 | 3.055 | |||||
SD | 0.379 | 0.314 | 0.323 | 0.438 | |||||
No comparison test used, Min-Max summary | |||||||||
Petal Length | Min – Max | 1.00–1.90 | 3.00–5.10 | 4.50–6.90 | 1.00–6.90 | ||||
Missing data considered, mean/Std. Dev summary | |||||||||
Petal Width | Mean (Std. Dev.) | 0.25 (0.11) | 1.33 (0.20) | 2.02 (0.28) | 1.19 (0.76) | Difference in Means | -1.08 (-1.14, -1.02) | -1.78 (-1.86, -1.69) | p ≤ 0.001 |
Missing | 0 | 0 | 1 | 1 | |||||
Stem Size | Col. Pct. (N) | Relative Entropy | 0.03 | 0.02 | p = 0.80 | ||||
Large | 34.00% (17) | 24.00% (12) | 24.49% (12) | 27.52% (41) | |||||
Medium | 44.00% (22) | 48.00% (24) | 48.98% (24) | 46.98% (70) | |||||
Small | 22.00% (11) | 28.00% (14) | 24.49% (12) | 24.83% (37) | |||||
Missing | 0.00% (0) | 0.00% (0) | 2.04% (1) | 0.67% (1) | |||||
No rowlabels, indent 3 spaces, odds ratio as test | |||||||||
color: Blue | Col. Prop. (N) | 0.36 (18) | 0.56 (28) | 0.59 (29) | 0.50 (75) | Odds Ratio | 0.44 (0.20, 0.99) | 0.37 (0.16, 0.83) | p = 0.04 |
Missing | 0.00 (0) | 0.00 (0) | 0.02 (1) | 0.01 (1) |
The package can handle cases where a user only wants a single summary
column of data. In the iris
dataset, if we set the column
variable to be NULL in tbl_start
, we can obtain just one
summary column for the dataset without breaking the table up by columns.
Note that comparison functions will not run here, even if the
comparison
argument is set to TRUE.
<- tbl_start(iris, NULL, missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 n_row() %>%
num_row("Sepal.Length", rowlabel="Sepal Length") %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | Overall |
---|---|---|
N | 150 | |
Sepal Length | min | 4.30 |
Q1 | 5.10 | |
median | 5.80 | |
Q3 | 6.40 | |
max | 7.90 | |
mean | 5.84 | |
SD | 0.83 |
This package allows for an individual row to use a different dataset
from the one initialized in tbl_start
. Use the
newdata
argument to specify the new dataset to use, then
define the rows and columns for the new data. Note that if a new row is
added after the row with the differing dataset, the new row will
automatically return to using the initialized dataset from
tbl_start
unless the user specifies otherwise in
newdata
.
For this example, we will split the iris
dataset so that
the sepal and petal variables are in separate datasets, and show that
the newdata
argument can allow the information from both
datasets to be combined in one table.
<- iris %>% select(-c(Petal.Length, Petal.Width))
sepaldat <- iris %>% select(-c(Sepal.Length, Sepal.Width)) petaldat
<- tbl_start(sepaldat, "Species", missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 num_row("Sepal.Length", rowlabel="Sepal Length") %>%
num_row("Sepal.Width", rowlabel="Sepal Width") %>%
empty_row(header="Switch to Petal Dataset") %>%
num_row(row_var="Petal.Length", col_var="Species", newdata=petaldat, rowlabel="Petal Length") %>%
num_row(row_var="Petal.Width", col_var="Species", newdata=petaldat, rowlabel="Petal Width") %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall |
---|---|---|---|---|---|
Sepal Length | min | 4.30 | 4.90 | 4.90 | 4.30 |
Q1 | 4.80 | 5.60 | 6.30 | 5.10 | |
median | 5.00 | 5.90 | 6.50 | 5.80 | |
Q3 | 5.20 | 6.30 | 6.95 | 6.40 | |
max | 5.80 | 7.00 | 7.90 | 7.90 | |
mean | 5.01 | 5.94 | 6.61 | 5.84 | |
SD | 0.35 | 0.52 | 0.64 | 0.83 | |
Sepal Width | min | 2.30 | 2.00 | 2.20 | 2.00 |
Q1 | 3.20 | 2.52 | 2.80 | 2.80 | |
median | 3.40 | 2.80 | 3.00 | 3.00 | |
Q3 | 3.68 | 3.00 | 3.12 | 3.30 | |
max | 4.40 | 3.40 | 3.80 | 4.40 | |
mean | 3.43 | 2.77 | 2.96 | 3.06 | |
SD | 0.38 | 0.31 | 0.32 | 0.44 | |
Switch to Petal Dataset | |||||
Petal Length | min | 1.00 | 3.00 | 4.50 | 1.00 |
Q1 | 1.40 | 4.00 | 5.10 | 1.58 | |
median | 1.50 | 4.35 | 5.60 | 4.30 | |
Q3 | 1.58 | 4.60 | 5.90 | 5.10 | |
max | 1.90 | 5.10 | 6.90 | 6.90 | |
mean | 1.46 | 4.26 | 5.56 | 3.74 | |
SD | 0.17 | 0.47 | 0.56 | 1.77 | |
Petal Width | min | 0.10 | 1.00 | 1.40 | 0.10 |
Q1 | 0.20 | 1.20 | 1.80 | 0.30 | |
median | 0.20 | 1.30 | 2.00 | 1.30 | |
Q3 | 0.30 | 1.50 | 2.30 | 1.80 | |
max | 0.60 | 1.80 | 2.50 | 2.50 | |
mean | 0.25 | 1.33 | 2.02 | 1.19 | |
SD | 0.11 | 0.20 | 0.28 | 0.76 |
Notice that in this example, the column variable for
sepaldat
was the same as that for petaldat
. If
the columns used had differed between the datasets, all columns would be
included in the table, but only columns corresponding to the data used
in the rows would have values filled in.
A common useage for the newdata
argument is when you
want to make a table which combines summary statistics for subsets of
data. Suppose we were to display the sepal measures for the entire
dataset, then show these same measurements for two subsets of data which
are determined by the petal length. Here, we divide the dataset into two
subsets; petal length > 4.3 and petal length <= 4.3.
<- iris %>% filter(Petal.Length <= 4.3)
petal.small <- iris %>% filter(Petal.Length > 4.3) petal.large
<- tbl_start(iris, "Species", missing=FALSE, overall=TRUE, comparison=FALSE) %>%
tbl1 empty_row(header="All Data") %>%
n_row() %>%
num_row("Sepal.Length", rowlabel=" Sepal Length") %>%
num_row("Sepal.Width", rowlabel=" Sepal Width") %>%
empty_row(header="Petal Length less than 4.3") %>%
n_row(newdata=petal.small) %>%
num_row("Sepal.Length", rowlabel=" Sepal Length", col_var="Species", newdata=petal.small) %>%
num_row("Sepal.Width", rowlabel=" Sepal Width", col_var="Species", newdata=petal.small) %>%
empty_row(header="Petal Length greater than 4.3") %>%
n_row(newdata=petal.large) %>%
num_row("Sepal.Length", rowlabel=" Sepal Length", col_var="Species", newdata=petal.large) %>%
num_row("Sepal.Width", rowlabel=" Sepal Width", col_var="Species", newdata=petal.large) %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l") %>%
%>%
trimws kable_styling(c("striped","bordered"))
Variable | Measure | setosa | versicolor | virginica | Overall |
---|---|---|---|---|---|
All Data | |||||
N | 50 | 50 | 49 | 149 | |
Sepal Length | min | 4.30 | 4.90 | 4.90 | 4.30 |
Q1 | 4.80 | 5.60 | 6.30 | 5.10 | |
median | 5.00 | 5.90 | 6.50 | 5.80 | |
Q3 | 5.20 | 6.30 | 6.95 | 6.40 | |
max | 5.80 | 7.00 | 7.90 | 7.90 | |
mean | 5.01 | 5.94 | 6.61 | 5.84 | |
SD | 0.35 | 0.52 | 0.64 | 0.83 | |
Sepal Width | min | 2.30 | 2.00 | 2.20 | 2.00 |
Q1 | 3.20 | 2.52 | 2.80 | 2.80 | |
median | 3.40 | 2.80 | 3.00 | 3.00 | |
Q3 | 3.68 | 3.00 | 3.12 | 3.30 | |
max | 4.40 | 3.40 | 3.80 | 4.40 | |
mean | 3.43 | 2.77 | 2.96 | 3.06 | |
SD | 0.38 | 0.31 | 0.32 | 0.44 | |
Petal Length less than 4.3 | |||||
N | 50 | 25 | 0 | 75 | |
Sepal Length | min | 4.30 | 4.90 | 4.30 | |
Q1 | 4.80 | 5.50 | 4.90 | ||
median | 5.00 | 5.60 | 5.10 | ||
Q3 | 5.20 | 5.80 | 5.55 | ||
max | 5.80 | 6.40 | 6.40 | ||
mean | 5.01 | 5.62 | 5.21 | ||
SD | 0.35 | 0.37 | 0.46 | ||
Sepal Width | min | 2.30 | 2.00 | 2.00 | |
Q1 | 3.20 | 2.40 | 2.85 | ||
median | 3.40 | 2.70 | 3.20 | ||
Q3 | 3.68 | 2.90 | 3.50 | ||
max | 4.40 | 3.00 | 4.40 | ||
mean | 3.43 | 2.63 | 3.16 | ||
SD | 0.38 | 0.27 | 0.51 | ||
Petal Length greater than 4.3 | |||||
N | 0 | 25 | 48 | 73 | |
Sepal Length | min | 5.40 | 4.90 | 4.90 | |
Q1 | 6.00 | 6.30 | 6.10 | ||
median | 6.30 | 6.50 | 6.40 | ||
Q3 | 6.60 | 6.95 | 6.80 | ||
max | 7.00 | 7.90 | 7.90 | ||
mean | 6.26 | 6.61 | 6.49 | ||
SD | 0.44 | 0.64 | 0.60 | ||
Sepal Width | min | 2.20 | 2.20 | 2.20 | |
Q1 | 2.80 | 2.80 | 2.80 | ||
median | 3.00 | 3.00 | 3.00 | ||
Q3 | 3.10 | 3.12 | 3.10 | ||
max | 3.40 | 3.80 | 3.80 | ||
mean | 2.91 | 2.96 | 2.95 | ||
SD | 0.29 | 0.32 | 0.31 |
The knitr
and kableExtra
packages can be
used to add styling features to the finished tables. Captions can be
added to the tables using the caption
command, and tables
can also be rendered into a LaTeX format using the format
argument; both can be used in the kable
function.
kable_styling
allows you to use the font_size
argument to specify how large the table text should be.
<- tbl_start(iris, "Species", missing=TRUE, overall=TRUE, comparison=TRUE,
tbl1 default_num_summary = num_minmax,
default_cat_summary = cat_pct,
default_binary_summary = binary_jama) %>%
n_row() %>%
num_row("Sepal.Length", rowlabel="Sepal Length") %>%
cat_row("Stem.Size", rowlabel="Stem Size") %>%
binary_row("color", rowlabel="Color") %>%
tbl_out()
%>%
tbl1 tangram_styling() %>%
kable(escape=F, align="l", caption = "Example Summary table", format = "html") %>%
%>%
trimws kable_styling(c("striped","bordered"), font_size = 12)
Variable | Measure | setosa | versicolor | virginica | Overall | Test | setosa vs. versicolor | setosa vs. virginica | Compare: All Groups |
---|---|---|---|---|---|---|---|---|---|
N | 50 | 50 | 49 | 150 | |||||
Sepal Length | Min – Max | 4.30–5.80 | 4.90–7.00 | 4.90–7.90 | 4.30–7.90 | Difference in Means | -0.93 (-1.11, -0.75) | -1.60 (-1.81, -1.40) | p ≤ 0.001 |
Missing | 0 | 0 | 1 | 1 | |||||
Stem Size | Col. Pct. (N) | Relative Entropy | 0.03 | 0.02 | p = 0.80 | ||||
Large | 34.00% (17) | 24.00% (12) | 24.49% (12) | 27.52% (41) | |||||
Medium | 44.00% (22) | 48.00% (24) | 48.98% (24) | 46.98% (70) | |||||
Small | 22.00% (11) | 28.00% (14) | 24.49% (12) | 24.83% (37) | |||||
Missing | 0.00% (0) | 0.00% (0) | 2.04% (1) | 0.67% (1) | |||||
Color: Blue | Pct. (n/N) | 36.00 (18/50) | 56.00 (28/50) | 59.18 (29/49) | 50.34 (75/149) | Difference in Proportions | -0.20 (-0.41, 0.01) | -0.23 (-0.44, -0.02) | p = 0.04 |
Missing | 0.00 (0/50) | 0.00 (0/50) | 2.04 (1/49) | 0.67 (1/149) |
One of the key features of this package is giving the user the
flexibility to supply custom summary and comparison functions to the
package to create tables in formats not built-in to
tangram.pipe
. The accompanying vignette “Writing
User-Defined Summary Functions” outlines the process for how to write
functions that will work well with tangram.pipe
The digits
parameter is now available in
tbl_start
for specifying default digits to use throughout
the table.
Added ref.label
argument in binary summary functions
to allow user to toggle reference group labels in binary rows.
Deprecated the print.tangram.pipe
function, as the
update to tbl_out
in version 1.1.1 rendered this function
obsolete.
Fixed a bug in num_row
where column category names
with spaces would not format correctly.
Changed binary_row
output to include the rowlabel
along with the displayed category when
compact = TRUE
.
Fixed a bug in binary_row
where numeric row category
labels would not format correctly when
compact = TRUE
.
Added ordering
and sortcol
arguments to
cat_row
.
Edited categorical summary functions to utilize sorting arguments.
Added prewritten summary functions num_date
,
cat_count
, and binary_count
.
Edited tbl_out
to output the finalized dataframe
object (previous version only appended the final table to the table
information list).
Changed the rowlabels
argument to
rowlabel
.
Options overall
, missing
, and
comparison
now have default values in
tbl_start()
.
Only leading white spaces are formatted to HTML form in
tangram_styling
.
Added n_row()
as a row function to the
table.
Added prewritten summary functions num_minmax
,
num_medianiqr
, num_mean_sd
,
cat_pct
, cat_jama
, binary_pct
,
binary_jama
.
Added options default_num_summary
,
default_cat_summary
, default_binary_summary
to
tbl_start()
. Default values are set to the default summary
functions for each row.
Changed the summary
argument within the row
functions to automatically use the default specified in
tbl_start()
, unless another function is supplied by the
user. The default in the function argument has changed from a function
to NULL.
Summary functions now take on generic arguments specified by an ellipsis (…), but still work the same as before within the row functions.
binary_row()
now has the option to condense to one
row (compact
). Default is TRUE.