Plotting two chemical metrics

Load required packages
GlobalPatterns
Humboldt Sulfuretum (Marine Sediment)
Qinghai-Tibet Plateau (Hot Springs)
Comparing more datasets
Take-home messages
References

Community-level chemical metrics are computed from the elemental compositions of community reference proteomes, which in turn are derived from genomic protein sequences weighted by taxonomic abundances. Unlike the synthetic variables in ordination methods (e.g. the principal components in PCA), chemical metrics are analogous to thermodynamic components that are defined independently of the data, so they can be compared across datasets. Their theoretical foundation and cross-dataset applicability facilitates the use of chemical metrics to explore hypotheses about genomic adaptation to multiple physicochemical variables at a global scale.

In this vignette we’ll analyze phyloseq’s GlobalPatterns dataset (based on data from Caporaso et al., 2011) to visualize chemical variation of community reference proteomes across environments. Then, we’ll explore specific hypotheses about the effects of redox conditions and salinity on genomic adaptation by analyzing datasets for microbial communities in marine sediment (Fonseca et al., 2022) and geothermal waters (Zhang et al., 2023).

Load required packages

require(chem16S)
require(phyloseq)
require(ggplot2)
theme_set(theme_bw())

# For composing plots and making a common legend (plot_layout())
require(patchwork)

# For annotating plots with regression coefficients (stat_poly_line())
require(ggpmisc)

This vignette was compiled on 2025-01-16 with chem16S version 1.2.0 and phyloseq version 1.48.0.

GlobalPatterns

We will use the GlobalPatterns dataset ‘as-is’, without the preprocessing described in phyloseq’s Ordination Plots tutorial. There, less-abundant OTUs and phyla were removed in order to show high-level trends and shorten computing time. One step we do take from that tutorial is the addition of a categorical variable that identifies whether the samples are human-associated:

data(GlobalPatterns)
Human = get_variable(GlobalPatterns, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue")
sample_data(GlobalPatterns)$Human <- factor(Human)

Taxonomic assignments reported by Caporaso et al. (2011) were made using the RDP Classifier (presumably with its default training set), so we use the refdb = "RefSeq_206" argument to use manual mappings between RDP and RefSeq described by Dick and Tan (2023).

p2 <- plot_ps_metrics2(GlobalPatterns, color = "SampleType", shape = "Human", refdb = "RefSeq_206")
## [1] "map_taxa: using these manual mapping(s) to NCBI RefSeq:"

order_Rhizobiales –> order_Hyphomicrobiales (0.2%)

order_Clostridiales –> order_Eubacteriales (0.6%)

family_Ruminococcaceae –> family_Oscillospiraceae (3.1%)

## [1] "map_taxa: can't map groups order_Stramenopiles (12.94%), family_ACK-M1 (3.27%), 374 others (11.75%)"
## [1] "map_taxa: mapping rate to RefSeq_206 taxonomy is 71.9%"

p2 + geom_polygon(aes(fill = SampleType), alpha = 0.5) + geom_point(size = 3) +
  guides(colour = guide_legend(override.aes = list(shape = c(17, 19, 19, 17, 19, 19, 17, 19, 17))))

At the extremes of carbon oxidation state (Z_C), soil communities are the most oxidized and skin and tongue communities are the most reduced. At the extremes of stoichiometric hydration state (nH₂O), skin communities are the most hydrated and some fecal communities are the least hydrated. In more detail, there are environmental microbiomes that show similar ranges of chemical metrics (e.g., Freshwater (creek) and Sediment (estuary)) and others that are different. Freshwater – described as “lake” by Caporaso et al. (2011) – has lower Z_C than Freshwater (creek), and some ocean samples have lower nH₂O than either freshwater group. These patterns could suggest an influence of greater oxygenation in flowing water compared to lakes (this is a distinction between lotic and lentic systems), and dehydration in communities adapted to life in salty water compared to freshwater.

Humboldt Sulfuretum (Marine Sediment)

Fonseca et al. (2022) reported 16S rRNA gene sequences for sediment samples from the oxygen minimum zone of the Pacific Ocean along the coast of Chile, known as the Humboldt Sulfuretum. The sample data include dissolved oxygen, redox potential in sediment and overlying water, and organic matter (OM) content. This is a useful dataset for exploring the hypothesis that Z_C of proteins is shaped by redox conditions.

Here we read the phyloseq-class object created by using DADA2 (Callahan et al., 2016) to identify amplicon sequence variants (ASVs) in this dataset and to classify them using the GTDB training set. A sample taken from 50 m depth at the Valparaiso location on 2012-05-12 is available in the Sequence Read Archive (SRA) but was not included in the analysis described by Fonseca et al. (2022). The taxonomic composition of this sample is highly different from the all the others (see the ordination plots in the extdata directory where the ps_FEN+22.rds file is located), so we exclude it to avoid anomalous results.

psfile <- system.file("extdata/DADA2-GTDB_220/FEN+22/ps_FEN+22.rds", package = "chem16S")
ps <- readRDS(psfile)
ps <- prune_samples(sample_names(ps) != "SRR1346095", ps)
ps

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 2095 taxa and 13 samples ]
## sample_data() Sample Data:       [ 13 samples by 14 sample variables ]
## tax_table()   Taxonomy Table:    [ 2095 taxa by 6 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 2095 reference sequences ]

Then, we plot Z_C and nH₂O for the community reference proteomes using different colors for sample groups.

plot_ps_metrics2(ps, refdb = "GTDB_220", color = "Location") +
  geom_polygon(aes(fill = Location), alpha = 0.5) + geom_point(size = 3)

## [1] "map_taxa: mapping rate to GTDB_220 taxonomy is 100.0%"

Among the areas with more than one sample, the community reference proteomes for Iquique are more reduced (i.e., have lower Z_C) than those for Concepcion and Valparaiso. Two of the communities at Valparaiso are also characterized by higher nH₂O, suggesting the influence of a hydrating factor.

Let’s take a step toward more quantitative tests of these hypotheses about genomic adaptation to environments. The color scales in the next two plots reflect sediment redox potential and concentration of organic matter. The rationale for choosing these environmental measurements is described below.

p2 <- plot_ps_metrics2(ps, refdb = "GTDB_220", color = "Sediment_redox") +
  geom_point(size = 4) + labs(color = "Sediment redox (Eh)")

## [1] "map_taxa: mapping rate to GTDB_220 taxonomy is 100.0%"

p3 <- plot_ps_metrics2(ps, refdb = "GTDB_220", color = "Organic_matter") +
  geom_point(size = 4) + labs(color = "Organic matter (%)")

## [1] "map_taxa: mapping rate to GTDB_220 taxonomy is 100.0%"

p2 / p3

Thermodynamic considerations predict a positive correlation between redox potential and Z_C (Dick and Meng, 2023). It has also been predicted that salinity has a dehydrating effect that favors proteins with nH₂O (Dick et al., 2020). However, these samples have no documented salinity gradient. Previous observations that protein expression in cells is shifted toward lower nH₂O under hyperglycemic (high-glucose) conditions (Dick et al., 2020) suggest another hypothesis: a higher content of organic matter may be a proxy for dehydrating conditions.

We can use correlations between two environmental variables (redox potential or OM) and two chemical metrics for communities (Z_C or nH₂O) in order to test these hypotheses. To make the plots, let’s construct a single data frame containing the sample data and chemical metrics.

sample.data.and.chemical.metrics.for.communities <- cbind(sample_data(ps), ps_metrics(ps, refdb = "GTDB_220"))

## [1] "map_taxa: mapping rate to GTDB_220 taxonomy is 100.0%"

Now let’s write a function to create a scatter plot for two variables and add a regression line. We use this function to make a plot for each combination of environmental variable and chemical metric.

# Defuse (enquo) and Inject (!!) from https://www.tidyverse.org/blog/2018/07/ggplot2-tidy-evaluation/
# Regression line and equation from https://stackoverflow.com/questions/7549694/add-regression-line-equation-and-r2-on-graph
scatter_plot <- function(data = sample.data.and.chemical.metrics.for.communities, x, y, xlab, ylab) {
  x <- enquo(x)
  y <- enquo(y)
  ggplot(data, aes(x = !!x, y = !!y, color = .data[["Location"]])) +
    geom_point() + xlab(xlab) + ylab(ylab) +
    # Override aes to plot one regression line for samples from all locations
    stat_poly_line(aes(x = !!x, y = !!y), inherit.aes = FALSE) +
    stat_poly_eq(aes(x = !!x, y = !!y), inherit.aes = FALSE, label.x = "center")
}

sp1 <- scatter_plot(x = Sediment_redox, y = Zc, xlab = "Sediment redox (mV)", ylab = chemlab("Zc"))
sp2 <- scatter_plot(x = Sediment_redox, y = nH2O, xlab = "Sediment redox (mV)", ylab = chemlab("nH2O"))
sp3 <- scatter_plot(x = Organic_matter, y = Zc, xlab = "Organic matter (%)", ylab = chemlab("Zc"))
sp4 <- scatter_plot(x = Organic_matter, y = nH2O, xlab = "Organic matter (%)", ylab = chemlab("nH2O"))
sp1 + sp2 + sp3 + sp4 + plot_layout(guides = "collect")

We find that carbon oxidation state is positively correlated with redox potential, and stoichiometric hydration state is negatively correlated with organic matter content. Taken alone, each of these correlations supports our initial hypotheses. However, in part because of strong covariation of the environmental variables, Z_C is also negatively correlated with OM content, and nH₂O is positively correlated with redox potential.

The covariation of environmental variables makes it difficult to identify primary factors that drive the observed differences between communities. However, the chemical nature of these variables provides additional clues. The covariation of environmental variables (higher OM content with lower redox potential) makes sense if greater availability of organic compounds drives respiration and ensuing depletion of oxygen. This interaction among environmental variables yields a mechanistic hypothesis for the positive association between Z_C and nH₂O (see first plot above), which could not be explained by our initial hypotheses about the effects of single variables.

Qinghai-Tibet Plateau (Hot Springs)

Zhang et al. (2023) reported 16S rRNA gene sequences for mildly alkaline hot spring reservoirs in the Qinghai-Tibet Plateau. The following general predictions can be made about these data:

A positive correlation between Z_C of community reference proteomes and oxidation-reduction potential (ORP).
Because of input of reducing fluids, lower Z_C in this dataset compared to marine sediments of the Humboldt Sulfuretum.
Assuming that these samples have relatively low salinity, higher nH₂O than marine sediment communities.

Let’s test these predictions by doing some calculations. The following commands load the data and plot two environmental variables (ORP and temperature (T)) against two chemical metrics.

psfile2 <- system.file("extdata/DADA2-GTDB_220/ZFZ+23/ps_ZFZ+23.rds", package = "chem16S")
ps2 <- readRDS(psfile2)
data.and.metrics <- cbind(sample_data(ps2), ps_metrics(ps2, refdb = "GTDB_220"))

## [1] "map_taxa: using these post-curation mapping(s) for GTDB release 220:"

family_Koribacteraceae –> family_Korobacteraceae (0.022%)

order_Ammonifexales –> order_Ammonificales (0.011%)

family_Phormidesmiaceae –> family_Phormidesmidaceae (0.018%)

order_Hydrogenedentiales –> order_Hydrogenedentales (0.0063%)

## [1] "map_taxa: mapping rate to GTDB_220 taxonomy is 100.0%"

ps2

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 9466 taxa and 7 samples ]
## sample_data() Sample Data:       [ 7 samples by 24 sample variables ]
## tax_table()   Taxonomy Table:    [ 9466 taxa by 6 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 9466 reference sequences ]

scatter_plot_2 <- function(data = data.and.metrics, x, y, xlab, ylab) {
  x <- enquo(x)
  y <- enquo(y)
  ggplot(data, aes(x = !!x, y = !!y)) +
    geom_point() + xlab(xlab) + ylab(ylab) +
    stat_poly_line() +
    stat_poly_eq(label.x = "center")
}
sp1 <- scatter_plot_2(x = ORP, y = Zc, xlab = "ORP (mV)", ylab = chemlab("Zc"))
sp2 <- scatter_plot_2(x = ORP, y = nH2O, xlab = "ORP (mV)", ylab = chemlab("nH2O"))
sp3 <- scatter_plot_2(x = T, y = Zc, xlab = "T (°C)", ylab = chemlab("Zc"))
sp4 <- scatter_plot_2(x = T, y = nH2O, xlab = "T (°C)", ylab = chemlab("nH2O"))
sp1 + sp2 + sp3 + sp4

The correlation between ORP and Z_C isn’t as strong as might be predicted. Moreover, neither of the chemical metrics is strongly associated with temperature. Therefore, this dataset seems to be an exception to the notion that particular chemical metrics of community reference proteomes are shaped by the environment at a local scale.

But let’s not forget about the global-scale predictions! How do communities in hot springs compare to those in ocean sediments? In order to make a plot, we can merge both datasets into a new phyloseq-class object. The sequence processing pipeline assigned the same taxon names to both datasets (ASV1, ASV2, etc.). Therefore, let’s append a letter to one set of names so that distinct taxa are not mistakenly combined.

taxa_names(ps2) <- paste0(taxa_names(ps2), "b")
ps_merged <- merge_phyloseq(ps, ps2)
ps_merged

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 11561 taxa and 20 samples ]
## sample_data() Sample Data:       [ 20 samples by 34 sample variables ]
## tax_table()   Taxonomy Table:    [ 11561 taxa by 6 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 11561 reference sequences ]

Let’s add a column to the sample data to indicate the type of environment for each dataset, then plot nH₂O against Z_C.

sample_data(ps_merged)$Environment <-
  ifelse(is.na(sample_data(ps_merged)$Depth), "Hot spring", "Marine sediment")
plot_ps_metrics2(ps_merged, refdb = "GTDB_220", color = "Environment", shape = "Environment") +
  geom_point(size = 3)

## [1] "map_taxa: using these post-curation mapping(s) for GTDB release 220:"

family_Koribacteraceae –> family_Korobacteraceae (0.021%)

order_Ammonifexales –> order_Ammonificales (0.01%)

family_Phormidesmiaceae –> family_Phormidesmidaceae (0.017%)

order_Hydrogenedentiales –> order_Hydrogenedentales (0.006%)

## [1] "map_taxa: mapping rate to GTDB_220 taxonomy is 100.0%"

In most cases, the communities from hot springs in the Qinghai-Tibet Plateau have lower Z_C and higher nH₂O compared to those in marine sediments of the Humboldt Sulfuretum. This outcome is consistent with predictions about genomic adaptation to relatively more reducing and less saline conditions of the hot springs.

The positive association between Z_C and nH₂O for the sediment communities does not extend to the comparison between two datasets. This suggests different influences of environmental factors on community-level elemental composition at local and global scales.

Comparing more datasets

Let’s add more data to the previous comparison, specifically freshwater and marine samples from the Baltic Sea salinity gradient (Herlemann et al., 2016). The first part of the following code chunk is adapted from the help page for plot_metrics(). A few lines are added to remove brackish samples and adjust the plotting symbols for the remaining samples. Then, the values for hot springs and sediments are appended to the data frame. Finally, new colors are assigned and the plot is made. Note: this chunk does not depend on running the previous code, except for library(chem16S) and library(phyloseq).

# Get data for the Baltic Sea salinity gradient from Herlemann et al., 2016
RDPfile <- system.file("extdata/RDP/HLA+16.tab.xz", package = "chem16S")
RDP <- read_RDP(RDPfile)
map <- map_taxa(RDP, refdb = "RefSeq_206")
metrics <- get_metrics(RDP, map, refdb = "RefSeq_206")
mdatfile <- system.file("extdata/metadata/HLA+16.csv", package = "chem16S")
mdat <- get_metadata(mdatfile, metrics)
# Take out brackish samples (6-20 PSU)
ibrackish <- mdat$metadata$pch == 20
mdat$metadata <- mdat$metadata[!ibrackish, ]
mdat$metrics <- mdat$metrics[!ibrackish, ]
# Keep the values used for the plot
mdat$metadata <- mdat$metadata[, c("name", "pch", "col")]
mdat$metrics <- mdat$metrics[, c("Zc", "nH2O")]
# Change symbols
isalt <- mdat$metadata$pch == 21
ifresh <- mdat$metadata$pch == 24
mdat$metadata$pch[isalt] <- 24
mdat$metadata$pch[ifresh] <- 21
mdat$metadata$col[isalt] <- 3
mdat$metadata$col[ifresh] <- 4
# Append hot spring and sediment values
FEN22 <- readRDS(system.file("extdata/DADA2-GTDB_220/FEN+22/ps_FEN+22.rds", package = "chem16S"))
FEN22 <- prune_samples(sample_names(FEN22) != "SRR1346095", FEN22)
FEN22_metrics <- ps_metrics(FEN22)[, c("Zc", "nH2O")]
ZFZ23 <- readRDS(system.file("extdata/DADA2-GTDB_220/ZFZ+23/ps_ZFZ+23.rds", package = "chem16S"))
ZFZ23_metrics <- ps_metrics(ZFZ23)[, c("Zc", "nH2O")]
mdat$metadata <- rbind(mdat$metadata,
  data.frame(name = "sediment", pch = 25, col = rep(7, nrow(FEN22_metrics))))
mdat$metadata <- rbind(mdat$metadata,
  data.frame(name = "hot spring", pch = 22, col = rep(2, nrow(ZFZ23_metrics))))
mdat$metrics <- rbind(mdat$metrics, FEN22_metrics, ZFZ23_metrics)
# Change colors
mdat$metadata$col[mdat$metadata$col == 2] <- (red <- "#db2828")
mdat$metadata$col[mdat$metadata$col == 3] <- (green <- "#21ba45")
mdat$metadata$col[mdat$metadata$col == 4] <- (blue <- "#2185d0")
mdat$metadata$col[mdat$metadata$col == 7] <- (yellow <- "#fbbd08")
# Create bold axis labels
Zclab <- quote(bolditalic(Z)[bold(C)])
nH2Olab <- quote(bolditalic(n)[bold(H[2]*O)])
# Make the plot
par(mar = c(4, 4, 1, 1))
par(cex.lab = 1.2, mgp = c(2.8, 1, 0))
pm <- plot_metrics(mdat, title = FALSE, xlab = Zclab, ylab = nH2Olab)
# Add a legend
legend <- c("Hot spring", "Freshwater", "Marine water", "Marine sediment")
pch <- c(22, 21, 24, 25)
pt.bg <- c(red, blue, green, yellow)
legend("bottomleft", legend, pch = pch, col = 1, pt.bg = pt.bg, bg = "white", bty = "n")

Notice how more reducing samples (hot spring and sediment) have relatively low Z_C and more saline samples (marine water and sediment) have relatively low nH₂O.

Take-home messages

Chemical metrics are defined independently of the data and can be compared across datasets.
In some cases, covariation between Z_C and nH₂O suggests interactions between environmental variables. For instance, if lower redox potential is associated with greater organic matter content in sediments, this could drive a positive association between Z_C and nH₂O.
Variation of nH₂O may reflect the influence of diverse factors including salinity, organic matter content, and others. An open question is: What explains the very low nH₂O for communities in some fecal samples?
Multiple datasets can be used to examine global-scale influences of redox potential and salinity on genomic adaptation at the community level. More comprehensive tests of the prediction of a positive correlation between Z_C and redox potential at local and global scales have been reported by Dick and Meng (2023).

References

Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP. 2016. DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods 13(7): 581–583. doi: 10.1038/nmeth.3869

Caporaso JG, Lauber CL, Walters WA, Berg-Lyons D, Lozupone CA, Turnbaugh PJ, Fierer N, Knight R. 2011. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proceedings of the National Academy of Sciences 108(Supplement 1): 4516–4522. doi: 10.1073/pnas.1000080107

Dick JM, Meng D. 2023. Community- and genome-based evidence for a shaping influence of redox potential on bacterial protein evolution. mSystems 8(3): e0001423. doi: 10.1128/msystems.00014-23

Dick JM, Tan J. 2023. Chemical links between redox conditions and estimated community proteomes from 16S rRNA and reference protein sequences. Microbial Ecology 85(4): 1338–1355. doi: 10.1007/s00248-022-01988-9

Dick JM, Yu M, Tan J. 2020. Uncovering chemical signatures of salinity gradients through compositional analysis of protein sequences. Biogeosciences 17(23): 6145–6162. doi: 10.5194/bg-17-6145-2020

Fonseca A, Espinoza C, Nielsen LP, Marshall IPG, Gallardo VA. 2022. Bacterial community of sediments under the Eastern Boundary Current System shows high microdiversity and a latitudinal spatial pattern. Frontiers in Microbiology 13: 1016418. doi: 10.3389/fmicb.2022.1016418

Herlemann DPR, Lundin D, Andersson AF, Labrenz M, Jürgens K. 2016. Phylogenetic signals of salinity and season in bacterial community composition across the salinity gradient of the Baltic Sea. Frontiers in Microbiology 7: 1883. doi: 10.3389/fmicb.2016.01883

Zhang H-S, Feng Q-D, Zhang D-Y, Zhu G-L, Yang L. 2023. Bacterial community structure in geothermal springs on the northern edge of Qinghai-Tibet plateau. Frontiers in Microbiology 13: 994179. doi: 10.3389/fmicb.2022.994179