When analyzing multi omics datasets, the search for features that could serve as biomarkers is an important aspect. Because these biomarkers might be used in clinical settings for disease diagnosis etc., it is extremely important to minimize false positives. One possible error source are confounding variables: The biomarker is not directly linked to the disease but influenced by a third (confounding) variable, that in turn is linked to the disease.
The R package metadeconfoundR
was developed to address
this issue. It first uses univariate statistics to find associations
between omics features and disease status or metadata. Using nested
linear model comparison post hoc testing, those associations are checked
for confounding effects from other covariates/metadata and a status
label is returned. Instead of assuming The tool is able to handle large
scale multi-omics datasets in a reasonable time, by parallel processing
suitable for high-performance computing clusters. In addition, results
can be summarized by a range of plotting functions.
The main (metadeconfoundR::MetaDeconfound()
) analysis is
a two step process:
First, significant associations between single omics features (like
gut microbial OTUs) and metadata (like disease status, drug
administration, BMI) are identified. Based on the data type of the
respective metadata, either wilcox.test()
(for binary),
cor.test()
(for continuous numerical) or
kruskal.test()
(for neither numerical nor binary) is used.
All three tests are rank-based to minimize assumptions about data
distribution (Fig. 1, left). In addition to collecting p-values for all
computed tests, effect size is measured as Cliff’s Delta and Spearman’s
Rho for binary and continuous data, respectively. Since there is no
suitable effect size metric for categorical data with more than 2
levels, no value is reported here. It is recommended to introduce binary
pseudo-variables for each level of the categorical metadata to partially
circumvent this drawback.
In the second step, all hits are checked for confounding effects and
a status is reported for each feature-metadata combination (Fig. 1,
center and right; Fig2). A “hit” here is defined as a feature-metadata
association with small enough fdr-corrected p-value and big enough
effect size reported from the first, naive part of the pipeline.
Thresholds for both parameters can be set via QCutoff
and
DCutoff
when starting the analysis. Since confounding of
signal can only happen with more than one metadata variable associated
to a certain feature, all features with only one significant metadata
association are trivially deconfounded and get status “No Covariates
(OK_nc)”.
The actual confounder detection is done by performing a set of two likelihood ratio tests (LRTs) of nested linear models. For each possible combination of a feature and two of its associated metavariables, three models are fitted to the rank-transformed feature:
LRTs reveal whether inclusion of covariate1 and/or covariate2 significantly improves the performance of the model. Random and/or fixed effects can be added to all models by the user. These additional effects will, however not be considered in the first naive association testing step of the pipeline.
Importantly, metadeconfoundR
will always
rank-transform the features during analysis.
Minimal input consists of two data.frames for feature data (Tab. 1) and metadata (Tab. 2), respectively. Both data.frames must have one row per sample (sample names as rownames) with matching order of sampleIDs and one feature/meta-variable per column. The first column of the metadata data.frame must be binary (i.e be numeric and consist of only 0/1 entries.) Usually this is the control/case variable, but any other binary meta-variable will work as well.
MS0001 | MS0006 | MS0007 | MS0008 | MS0012 | |
---|---|---|---|---|---|
BGI003A | 0 | 42.0 | 4.0 | 153.0 | 126.0 |
BGI089A | 0 | 155.5 | 34.5 | 360.5 | 116.5 |
DLF001 | 3 | 67.0 | 6.0 | 443.0 | 40.0 |
DLF002 | 1 | 58.0 | 18.0 | 175.0 | 181.5 |
DLF003 | 45 | 43.0 | 0.0 | 66.0 | 74.0 |
DLF004 | 0 | 41.0 | 1.0 | 206.0 | 37.0 |
Status | Dataset | Metformin | continuous_dummy | altered_dummy | |
---|---|---|---|---|---|
BGI003A | 0 | CHN | 0 | 0.0617863 | 0.0617863 |
BGI089A | 0 | CHN | 0 | 0.2059746 | 0.2059746 |
DLF001 | 1 | CHN | 0 | 0.1765568 | 0.1765568 |
DLF002 | 1 | CHN | 0 | 0.6870228 | 0.6870228 |
DLF003 | 1 | CHN | 1 | 0.3841037 | 0.3841037 |
DLF004 | 1 | CHN | 0 | 0.7698414 | 0.7698414 |
MetaDeconfound()
has built-in quality checks for
formatting of the input data but it is best to check propper formatting
beforehand.
Ensure that colnames and rownames do not contain any problematic
characters by e.g running them through make.names()
and
check for same order of rows in both input data.frames.
data(reduced_feature)
data(metaMatMetformin)
# check correct ordering
all(rownames(metaMatMetformin) == rownames(reduced_feature))
#> [1] TRUE
all(order(rownames(metaMatMetformin)) == order(rownames(reduced_feature)))
#> [1] TRUE
example_output <- MetaDeconfound(
featureMat = reduced_feature,
metaMat = metaMatMetformin,
returnLong = TRUE,
logLevel = "ERROR"
)
Random and/or fixed effects can be included in the
modeling process by supplying the randomVar
and/or
fixedVar
parameter (Fig. 3, right).
RandDataset_output <- MetaDeconfound(
featureMat = reduced_feature,
metaMat = metaMatMetformin,
randomVar = c("Dataset"),
returnLong = TRUE,
logLevel = "ERROR"
)
#> boundary (singular) fit: see help('isSingular')
#> boundary (singular) fit: see help('isSingular')
For a full list of input parameters please refer to the help page.
Output can be returned either as a list of wide format data.frames (default) or as a single long format data.frame (Tab. 3). In both cases raw p-values (Ps), multiple testing corrected p-values (Qs), corresponding effect size (Ds), and confounding status (status) are reported for each possible combination of a feature and a meta-variable.
While Ps, Qs, and Ds are always solely based on the naive association
testing, the status label reflects effects of included random/fixed
effects. A naively significant feature, metadata association that is
reducible to a random effect can, thus, have a Q-value <
QCutoff
and still be labeled as “NS” (not significant).
feature | metaVariable | Ps | Qs | Ds | status |
---|---|---|---|---|---|
MS0001 | Status | 0.0039510 | 0.0103973 | -0.1396941 | AD |
MS0006 | Status | 0.0000321 | 0.0001147 | 0.2023848 | OK_sd |
MS0007 | Status | 0.0000001 | 0.0000016 | 0.2425731 | OK_sd |
MS0008 | Status | 0.7124524 | 0.7740977 | 0.0179492 | NS |
MS0012 | Status | 0.1847586 | 0.3079311 | 0.0645627 | NS |
Minimal input consists only of an output object from the main
MetaDeconfound()
function. This will return in a ggplot2
heatmap with effect size as tile color and black asterisks or grey
circles indicating significant and not confounded or confounded
associations based on corrected p-values. The plot is clustered on both
axes and features as well as meta-variables without any associations
passing effect size and significance cutoffs (default:
q_cutoff = 0.1
, d_cutoff = 0.01
) are removed
(Fig. 3).
left <- BuildHeatmap(example_output)
right <- BuildHeatmap(RandDataset_output)
grid.arrange(left, right, ncol = 2)
For both this default heatmap, as well as the alternative cuneiform
plot (cuneiform = TRUE
), a range of customizations are
available. In Fig. 4 meta-variables not passing the effect size
and significance cutoffs are manually kept in the plot
(keepMeta
), and the shown range of effect sizes is set from
-1 to +1 (d_range = "full"
). For a full list of options,
again, refer to the help page.
BuildHeatmap(
example_output,
cuneiform = TRUE,
keepMeta = colnames(metaMatMetformin),
d_range = "full"
)
The BuildHeatmap()
function returns a ggplot2 object.
This makes it possible to perform some easy alterations manually
(Fig. 5).
BuildHeatmap(example_output) +
theme(
legend.position = "none",
axis.text.y = element_text(face = "italic"),
plot.title = element_blank(),
plot.subtitle = element_blank()
)
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=de_DE.UTF-8, LC_NUMERIC=C, LC_TIME=de_DE.UTF-8, LC_COLLATE=C, LC_MONETARY=de_DE.UTF-8, LC_MESSAGES=de_DE.UTF-8, LC_PAPER=de_DE.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=de_DE.UTF-8 and LC_IDENTIFICATION=C
attached base packages: stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: kableExtra(v.1.3.4), gridExtra(v.2.3), ggplot2(v.3.4.2), metadeconfoundR(v.1.0.2) and detectseparation(v.0.3)
loaded via a namespace (and not attached): gtable(v.0.3.3), xfun(v.0.39), bslib(v.0.4.2), lattice(v.0.21-8), numDeriv(v.2016.8-1.1), vctrs(v.0.6.2), tools(v.4.3.0), generics(v.0.1.3), parallel(v.4.3.0), tibble(v.3.2.1), fansi(v.1.0.4), highr(v.0.10), pkgconfig(v.2.0.3), Matrix(v.1.5-1), checkmate(v.2.2.0), webshot(v.0.5.5), lifecycle(v.1.0.3), farver(v.2.1.1), compiler(v.4.3.0), stringr(v.1.5.0), munsell(v.0.5.0), codetools(v.0.2-19), htmltools(v.0.5.5), sass(v.0.4.6), yaml(v.2.3.7), pillar(v.1.9.0), nloptr(v.2.0.3), jquerylib(v.0.1.4), MASS(v.7.3-59), cachem(v.1.0.8), iterators(v.1.0.14), boot(v.1.3-28), foreach(v.1.5.2), nlme(v.3.1-162), tidyselect(v.1.2.0), rvest(v.1.0.3), digest(v.0.6.31), stringi(v.1.7.12), slam(v.0.1-50), pander(v.0.6.5), dplyr(v.1.1.2), reshape2(v.1.4.4), labeling(v.0.4.2), splines(v.4.3.0), fastmap(v.1.1.1), grid(v.4.3.0), colorspace(v.2.1-0), cli(v.3.6.1), magrittr(v.2.0.3), utf8(v.1.2.3), withr(v.2.5.0), scales(v.1.2.1), backports(v.1.4.1), registry(v.0.5-1), ROI.plugin.lpsolve(v.1.0-1), httr(v.1.4.6), rmarkdown(v.2.21), lambda.r(v.1.2.4), lme4(v.1.1-33), futile.logger(v.1.4.3), zoo(v.1.8-12), ROI(v.1.0-1), lpSolveAPI(v.5.5.2.0-17.9), evaluate(v.0.21), knitr(v.1.43), doParallel(v.1.0.17), lmtest(v.0.9-40), viridisLite(v.0.4.2), rlang(v.1.1.1), futile.options(v.1.0.1), Rcpp(v.1.0.10), glue(v.1.6.2), xml2(v.1.3.4), formatR(v.1.14), svglite(v.2.1.1), rstudioapi(v.0.14), minqa(v.1.2.5), jsonlite(v.1.8.4), R6(v.2.5.1), plyr(v.1.8.8) and systemfonts(v.1.0.4)