Our understanding of dynamical biological systems such as developmental processes or transitions into disease states is limited by our ability to reverse-engineer these systems using available data. Most medium- and high-throughput experimental protocols (e.g. single cell RNA-seq) are destructive in nature, generating cross-sectional time series in which it is not possible to track the progress of one cell through the system. In typical systems, individual cells progress at different rates and a sample’s experimental capture time will not accurately reflect how far it has progressed. In these cases, cross-sectional data can appear particularly noisy.
We propose a probabilistic model that uses smoothness assumptions to estimate and correct for this effect. Each cell is assigned a pseudotime that represents its progress through the system. These pseudotimes are related to but not determined by the cells’ experimental capture times. Replacing capture times with pseudotimes gives us a more representative view of the underlying system, improving downstream analyses.
We model the smoothness assumption on each gene’s expression profile \(y_g\) using a Gaussian process over pseudotime. Gene-specific parameters in the covariance function represent intrinsic measurement noise \(\omega_g\) and variation of the profile over time \(\psi_g\). Each sample’s pseudotime \(\tau_c\) is given a normal prior centred on its capture time. \[ \begin{aligned} y_{g} &\sim \mathcal{GP}(\phi_g, \Sigma_g) \\ \Sigma_g(\tau_1, \tau_2) &= \psi_g \Sigma_\tau(\tau_1, \tau_2) + \omega_g \delta_{\tau_1,\tau_2} \\ \log \psi_g &\sim \mathcal{N}(\mu_\psi, \sigma_\psi) \\ \log \omega_g &\sim \mathcal{N}(\mu_\omega, \sigma_\omega) \\ \Sigma_\tau(\tau_1, \tau_2) &= \textrm{Matern}_{3/2}\bigg(r=\frac{|\tau_1 - \tau_2|}{l}\bigg) = (1 + \sqrt{3}r) \exp[-\sqrt{3}r] \\ \tau_c &\sim \mathcal{N}(k_c, \sigma_\tau) \end{aligned} \] This model is effectively a one-dimensional Gaussian process latent variable model with a structured prior on the latent variable (pseudotime).
Guo et al. assayed single cell expression values at 7 time points in mouse embryonic cells and the data is contained in the DeLorean package. We will load this data and analyse a subset of it corresponding to the epiblast lineage. First we must build a de.lorean object with the correct data. Load the data.
library(DeLorean)
library(dplyr)
data(GuoDeLorean)
# Limit number of cores to 2 for CRAN
options(DL.num.cores=min(default.num.cores(), 2))Create a de.lorean object from the full data set.
dl <- de.lorean(guo.expr, guo.gene.meta, guo.cell.meta)Estimate hyperparameters for the model from the whole data set. Here we set the width of the normal prior on the pseudotimes to be 0.5.
dl <- estimate.hyper(
    dl,
    sigma.tau=0.5,
    length.scale=1.5,
    model.name='exact-sizes')DeLorean also offers slight variations of the model, we could use model.name='lowrank' or model.name='exact'. See the documentation for estimate.hyper for more details.
Choose a few cells from each capture point from the epiblast lineage.
num.at.each.stage <- 5
epi.sampled.cells <- guo.cell.meta %>%
    filter(capture < "32C" |
           "EPI" == cell.type |
           "ICM" == cell.type) %>%
    group_by(capture) %>%
    do(sample_n(., num.at.each.stage))
dl <- filter_cells(dl, cells=epi.sampled.cells$cell)We only have data for a few genes and can easily model them all which is typical for qPCR data. RNA-seq data often has far too many genes for the model to fit. In any case most are probably irrelevant. In these cases we recommend an analysis of variance across the capture times to choose those genes whose means vary most across time. These are most likely to be relevant for fitting the model.
dl <- aov.dl(dl)The most temporally variable genes (by p-value) are at the head of the result:
head(dl$aov)## # A tibble: 6 x 7
##     gene    term    df     sumsq    meansq statistic      p.value
##   <fctr>   <chr> <dbl>     <dbl>     <dbl>     <dbl>        <dbl>
## 1  Sall4 capture     6  98.10583 16.350971  71.18447 1.186993e-15
## 2  Gata6 capture     6 344.14119 57.356865  42.13893 9.468650e-13
## 3   Bmp4 capture     6 303.90384 50.650640  36.58841 5.391462e-12
## 4   Actb capture     6  34.13423  5.689038  34.63947 1.049178e-11
## 5 Tspan8 capture     6 278.21687 46.369478  33.80668 1.408351e-11
## 6 Pou5f1 capture     6  79.39507 13.232512  27.24854 1.827721e-10The least temporally variable genes (by p-value) are at the tail of the result:
tail(dl$aov)## # A tibble: 6 x 7
##      gene    term    df     sumsq    meansq statistic    p.value
##    <fctr>   <chr> <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
## 1    Klf4 capture     6  22.45294  3.742156 3.3399243 0.01311283
## 2     Msc capture     6  85.89667 14.316111 2.9890350 0.02197590
## 3   Pdgfa capture     6 120.90622 20.151036 2.9164331 0.02449000
## 4  Atp12a capture     6  17.51055  2.918426 1.6109303 0.18111345
## 5 Tcfap2a capture     6  44.21587  7.369311 1.2677130 0.30370287
## 6   Sox17 capture     6  23.17571  3.862618 0.9168198 0.49758776and for instance you could run the model on the 20 most variable genes by executing
dl <- filter_genes(dl, genes=head(dl$aov, 20)$gene)otherwise do not call filter_genes and DeLorean will use all the genes.
Now we have the data we can fit our model using Stan’s ADVI variational Bayes algorithm. To run the No-U-Turn sampler use method='sample'.
dl <- fit.dl(dl, method='vb')If running a sampler, Stan provides \(\hat{R}\) statistics that can aid detecting convergence problems. This makes no sense for ADVI but we show how to produce the boxplots here for users of the samplers.
dl <- examine.convergence(dl)
plot(dl, type='Rhat')Plot the pseudotimes from the best sample (best in the sense of highest likelihood). The prior means for the capture points are shown as dashed lines.
plot(dl, type='pseudotime')Plot the expression data over the pseudotimes from the best sample.
dl <- make.predictions(dl)
plot(dl, type='profiles')