library(EcoEnsemble)
library(mgcv)
library(ggplot2)
library(cowplot)
set.seed(5678)
Sometimes a set of simulators can be jointly affected by a process
driving the dynamics. For example, the outputs of a set of ecosystem
models (simulators) predicting the biomass of various species could be
affected by a set of earth-system models (drivers) giving a set of
combined ecosystem - earth system model outputs.
EcoEnsemble
allows fitting of an ensemble model to inputs
resulting from such scenarios. These inputs should take the form of a
set of simulator models and a set of driver models with the data being
the output of simulator-driver pairs. Input data must be supplied for
all simulator-driver combinations as fitting specific combinations is
not currently supported.
At time \(t\), the \(d\) variables of interest (VoI), \(\mathbf{y}^{(t)}=(y_1^{(t)},\ldots,y_d^{(t)})'\), are described by \(m\) simulators and \(p\) drivers, which for \(k=1,\ldots,m\) and \(j=1,\ldots,p\) give \(n_k\) outputs \(\mathbf{\hat{x}}_{k,j}^{(t)}=(\hat{x}_{k,j,1}^{(t)},\ldots,\hat{x}_{k,j,n_{k}}^{(t)})'\) each relating to one of the variables of interest. We also have noisy observations of the variables of interest, \(\hat{\mathbf{y}}^{(t)}=(\hat{y}_1^{(t)},\ldots,\hat{y}_d^{(t)})'\).
Not all of the simulator-driver pairs output all the variables over
the whole time period. To accommodate these differences, we introduce a
latent variable known as the “best guess”, \(\mathbf{x}_{k,j}^{(t)}=({x}_{1}^{(t)},\ldots,{x}_{d}^{(t)})'\),
which represents the output if simulator \(k\) and driver \(j\) described all \(d\) variables at time \(t\) with no parameter uncertainty. In
EcoEnsemble
, variables of interest are either present or
absent in each simulator-driver pair, therefore if the \(k\)th simulator and \(j\)th driver were evaluated at time \(t\), their combined output is \[
\mathbf{\hat{x}}_{k,j}^{(t)}\sim{}N(M_{k,j}\mathbf{x}_{k,j}^{(t)},\Sigma_{k,j}),
\] where for each \(j =
1,\ldots,p\), \(M_{k,j}^{(t)}\)
is an \(n_k\times{}d\) matrix and \(\Sigma_{k,j}\) reflects the combined
parameter uncertainty of the \(k\)th
simulator and the \(j\)th driver.
The variables of interest at time \(t\) are equal to the “best guess” of simulator \(k\) and driver \(j\), \(\mathbf{x}_{k,j}^{(t)}\), plus a discrepancy term, \(\mathbf{\zeta}^{(t)}_{k,j}\) (Kennedy and O’Hagan (2001)), i.e. \[ \mathbf{y}^{(t)} = \mathbf{x}_{k,j}^{(t)} + \mathbf{\zeta}^{(t)}_{k,j}. \] The discrepancy term, \(\mathbf{\zeta}^{(t)}_{k,j}\), is split between discrepancies that are shared between all of the simulators/driver combinations, and discrepancies that were specific to each simulator/driver combination. The shared discrepancy is further split into long-term shared discrepancy \(\delta\) and the short-term shared discrepancy \(\mathbf\eta^{(t)}\). The discrepancies specific to each combination are also split into a discrepancy arising from the simulator and a discrepancy arising from the driver. These two discrepancies are further split into long-term individual discrepancies \(\mathbf\gamma_k\) (simulator) and \(\mathbf\iota_j\) (driver), and short-term individual discrepancies \(\mathbf z_{k}^{(t)}\) (simulator) and \(\mathbf\alpha_{j}^{(t)}\) (driver). Written together we have \[ \mathbf{\zeta}^{(t)}_{k,j} = \mathbf\delta + \mathbf\eta^{(t)} + \mathbf\gamma_k + \mathbf{z}_k^{(t)} + \mathbf\iota_{j} + \mathbf\alpha_{j}^{(t)}. \] The long-term individual discrepancies for the \(k\)th simulator and \(j\)th driver are \[ \mathbf{\gamma}_k\sim{}N(\mathbf{0},C_{\gamma}) \hspace{10pt},\hspace{10pt} \mathbf{\iota}_j\sim{}N(\mathbf{0},C_{\iota}). \] The short-term discrepancy terms, \(\mathbf\eta^{(t)}\), \(\mathbf{z}_k^{(t)}\) and \(\mathbf{\alpha}_j^{(t)}\), follow auto-regressive processes of order one, \[ \mathbf{\eta}^{(t)}\sim{}N(R_{\eta}\mathbf{\eta}^{(t-1)},\Lambda_{\eta}) \] and \[ \mathbf{z}_{k}^{(t)}\sim{}N(R_{k}\mathbf{z}_{k}^{(t-1)},\Lambda_{k}) \hspace{10pt},\hspace{10pt} \mathbf{\alpha}_{j}^{(t)}\sim{}N(Q_{j}\mathbf{\alpha}_{j}^{(t-1)},\Gamma_{j}) \] respectively, where the diagonal elements of \(R_i, (i = \eta, 1, 2, \ldots, m )\) and \(Q_j, (j = 1,2,\ldots,p)\) are between \(-1\) and \(1\), and the off-diagonal elements are \(0\).
In the absence of any other information, the variables of interest evolve according to a random walk, \[ \mathbf{y}^{(t)}\sim{}N(\mathbf{y}^{(t-1)},\Lambda_y), \] with noisy observations, \[ \mathbf{\hat{y}}^{(t)}\sim{}N(\mathbf{y}^{(t)},\Sigma_y), \] when there are observations at time \(t\).
As in the case without drivers, the ensemble model can be written as a dynamical linear model, by letting \[\begin{equation} \mathbf\theta^{(t)}=(\mathbf{y}^{(t)'},\mathbf\eta^{(t)'},\mathbf{z}_1^{(t)'},\ldots,\mathbf{z}_m^{(t)'},\mathbf{\alpha}_1^{(t)'},\ldots,\mathbf{\alpha}_p^{(t)'})', \end{equation}\] so that \[\begin{equation} \mathbf\theta^{(t)}|\mathbf\theta^{(t-1)}\sim{}N(A\mathbf\theta^{(t)},\Lambda), \end{equation}\] with \[\begin{equation} A= \begin{pmatrix} I_d & 0 & 0 & 0 & \ldots & 0 & 0 & 0 & \ldots & 0\\ 0 & R_\eta & 0 & 0 & \ldots & 0 & 0 & 0 & \ldots & 0\\ 0 & 0 & R_1 & 0 & \ldots & 0 & 0 & 0 & \ldots & 0\\ 0 & 0 & 0 & R_2 & \ldots & 0 & 0 & 0 & \ldots & 0\\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ldots & \vdots\\ 0 & 0 & 0 & 0 & \ldots & R_m & 0 & 0 & \ldots & 0\\ 0 & 0 & 0 & 0 & \ldots & 0 & Q_1 & 0 & \ldots & 0\\ 0 & 0 & 0 & 0 & \ldots & 0 & 0 & Q_2 & \ldots & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & 0 & 0 & \ldots & 0 & 0 & 0 & \ldots & Q_p\\ \end{pmatrix}, \end{equation}\] with \(I_d\) being a \(d\) dimensional indicator matrix, and \[\begin{equation} \Lambda= \begin{pmatrix} \Lambda_y & 0 & 0 & 0 & \ldots & 0 & 0 & 0 & \ldots & 0\\ 0 & \Lambda_\eta & 0 & 0 & \ldots & 0 & 0 & 0 & \ldots & 0\\ 0 & 0 & \Lambda_1 & 0 & \ldots & 0 & 0 & 0 & \ldots & 0\\ 0 & 0 & 0 & \Lambda_2 & \ldots & 0 & 0 & 0 & \ldots & 0\\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ldots & \vdots\\ 0 & 0 & 0 & 0 & \ldots & \Lambda_m & 0 & 0 & \ldots & 0\\ 0 & 0 & 0 & 0 & \ldots & 0 & \Gamma_1 & 0 & \ldots & 0\\ 0 & 0 & 0 & 0 & \ldots & 0 & 0 & \Gamma_2 & \ldots & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & 0 & 0 & \ldots & 0 & 0 & 0 & \ldots & \Gamma_p\\ \end{pmatrix}. \end{equation}\] The observations \[\begin{equation} \mathbf{w}^{(t)}=(\mathbf{\hat{y}}^{(t)'},\mathbf{\hat{x}}^{(t)'}_{1,1},\ldots,\mathbf{\hat{x}}^{(t)'}_{1,p},\ldots,\mathbf{\hat{x}}^{(t)'}_{m,1},\ldots,\mathbf{\hat{x}}^{(t)'}_{m,p})' \end{equation}\] are \[\begin{equation} S^{(t)}\mathbf{w}^{(t)}\sim{}N(S^{(t)}B(\mathbf{\theta}^{(t)} + \mathbf\zeta),S^{(t)}\Sigma S^{(t)'}) \label{eq:DLM_obs} \end{equation}\] where \[\begin{equation} B= \begin{pmatrix} I_d & 0 & 0 & 0 & \ldots & 0 & 0 & 0 & \ldots & 0\\ M_{1,1} & M_{1,1} & M_{1,1} & 0 & \ldots & 0 & M_{1,1} & 0 & \ldots & 0\\ M_{1,2} & M_{1,2} & M_{1,2} & 0 & \ldots & 0 & 0 & M_{1,2} & \ldots & 0\\ \vdots & \vdots & \vdots & \vdots & \ddots & 0 & \vdots & \vdots & \ddots & \vdots\\ M_{1,p} & M_{1,p} & M_{1,p} & 0 & \ldots & 0 & 0 & 0 & \ldots & M_{1,p}\\ M_{2,1} & M_{2,1} & 0 & M_{2,1} & \ldots & 0 & M_{2,1} & 0 & \ldots & 0\\ M_{2,2} & M_{2,2} & 0 & M_{2,2} & \ldots & 0 & 0 & M_{2,2} & \ldots & 0\\ \vdots & \vdots & \vdots & \vdots & \ddots & 0 & \vdots & \vdots & \ddots & \vdots\\ M_{2,p} & M_{2,p} & 0 & M_{2,p} & \ldots & 0 & 0 & 0 & \ldots & M_{2,p}\\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots\\ M_{m,1} & M_{m,1} & 0 & 0 & \ldots & M_{m,1} & M_{m,1} & 0 & \ldots & 0\\ M_{m,2} & M_{m,2} & 0 & 0 & \ldots & M_{m,2} & 0 & M_{m,2} & \ldots & 0\\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \vdots & \ddots & \vdots\\ M_{m,p} & M_{m,p} & 0 & 0 & \ldots & M_{m,p} & 0 & 0 & \ldots & M_{m,p}\\ \end{pmatrix}, \end{equation}\] \[\begin{equation} \mathbf\zeta=(\mathbf{0}',\mathbf{\delta}',\mathbf\gamma'_1,\ldots,\mathbf\gamma'_m,\mathbf\iota'_1,\ldots,\mathbf\iota'_p)', \end{equation}\] with \(\mathbf{0}\) being a \(d\) dimensional vector of 0s, \[\begin{equation} \Sigma= \begin{pmatrix} \Sigma_y & 0 &\ldots & 0 & 0 & \ldots & 0\\ 0 & \Sigma_{1,1} & \ldots & 0 & 0 & \ldots & 0\\ \vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & \Sigma_{1,p} & 0 & \ldots & 0 \\ 0 & 0 & \ldots & 0 & \Sigma_{2,1} & \ldots & 0\\ \vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \ldots & 0 & 0 & \ldots & \Sigma_{m,p} \end{pmatrix} \end{equation}\] and \(S^{(t)}\) is a matrix that is used to describe which variables are observed at time \(t\). The \(S^{(t)}\) matrix is an \(r(t) \times \left(d + \sum_{k=1}^m pn_k\right)\) matrix, where \(r(t)\) is the total number of observations and simulator/driver outputs available at time \(t\), with each element given by \[ S^{(t)}_{i, j} = \begin{cases} 1 & \mathrm{if\ the \ }i^{th} \mathrm{\ available\ observation\ /\ output\ is\ the\ }j^{th}\mathrm{\ of\ all\ possible\ observation\ /\ outputs.} \\ 0 & \mathrm{otherwise} \end{cases} \]
For example, if all observations and the simulators/drivers give an output at time \(t\) then \(S^{(t)}=I_{d+\sum_{k=1}^{m}pn_k}\). If all simulators/drivers output the variables of interest but there are no observations then \(S^{(t)}\) is a \(\left(\sum_{k=1}^m pn_k\right) \times \left(d + \sum_{k=1}^m pn_k\right)\) matrix, equal to \(I_{d+\sum_{k=1}^{m}pn_k}\) with the first \(d\) rows removed.
Here we generate synthetic data for simulator/driver outputs and observations, then fit the ensemble model using these data. We do this for an ensemble model of 3 simulators and 2 drivers, each describing 3 variables of interest (which we imagine to be species biomass).
We generate the synthetic data by sampling from an informative prior
distribution of the ensemble model. We choose priors with strong
correlations of individual long-term discrepancies (\(\mathrm{Beta}(5,1)\) via the method of
concordance) and high autocorrelations of the AR processes on individual
and shared short-term discrepancies (\(\frac{r_i + 1}{2}\sim
\mathrm{Beta}(10,2)\)). These are different to the vague priors
that will later be used to fit the model. We can configure this prior
using the EnsemblePrior()
constructor function.
d <- 3 # The number of variables.
priors <- EnsemblePrior(d,
ind_st_params = IndSTPrior(AR_params=c(10,2)),
ind_lt_params = IndLTPrior("beta",
cor_params = list(matrix(5, d, d),
matrix(1, d, d)),var_params = list(1,0.5)),
sha_st_params = ShaSTPrior(AR_params=c(10,2)),
)
From this configuration, sampling the ensemble model parameters is
done using the prior_ensemble_model()
function. In this
case we sample the prior of an ensemble model with \(m_{1} = 3\) simulators and \(m_{2} = 2\) drivers.
M <- 3 # The number of simulators we are considering.
MM <- 2 # The number of drivers we are considering.
prior_density <- prior_ensemble_model(priors, M = M, MM = MM)
ex.fit <- rstan::extract(prior_density$samples) # Extracted samples
Of these samples, the data we choose to make up our synthetic data is the single sample which maximizes the likelihood (see above section). The samples from the prior distribution provide us with \(A\) and \(\Lambda\) values. To get from these ensemble model parameters to simulated model outputs requires using the above transformations. Values of the truth and discrepancy terms are generated using these values and sums of discrepancy terms to give the “best guesses” of simulator/driver combined outputs.
Time <- 50
true_par <-which.max(ex.fit$lp__)
latent <- matrix(NA, Time, (M+MM+2)*d)
#Priors on initial values
latent[1, 1:d] <- rnorm(d, 0, 1)
latent[1, -(1:d)] <- t(chol(ex.fit$SIGMA_init[true_par,-(1:d) ,-(1:d)])) %*% rnorm((M+MM+1)*d, 0, 1)
#Find all the latent values of the dynamic linear model
SIGMA <- ex.fit$SIGMA[true_par,,]
SIGMA_chol <- t(chol(SIGMA))
for (t in 2:Time) {
latent[t,] <- ex.fit$AR_params[true_par,] * latent[t-1,] + SIGMA_chol %*% rnorm((M+MM+2)*d, 0, 1)
}
#The truth is simply the first d values
truth_latent <- latent[,1:d]
#The best guesses are sums of the truth and discrepancies
simulator_best_guesses <- array(NA,dim=c(Time,d,M*MM))
for(i in 1:M){
for (j in 1:MM){
simulator_best_guesses[,,MM*(i-1)+j] <- t(
t(latent[,1:d] + latent[,(d+1):(2*d)] + latent[,(i + 1)*d + (1:d)] + latent[,(M + j + 1)*d + (1:d)]) +
ex.fit$ind_lt[true_par,i,] + ex.fit$ind_lt_dri[true_par,j,] + ex.fit$sha_lt[true_par,])
}
}
We can see how the truth relates to simulator/driver best guesses by plotting these outputs, for example for the first variable.
plotlist <- list()
for (sim in 1:M){
for (dri in 1:MM){
plotlist[[MM*(sim-1)+dri]] <- cbind(c(1:Time),simulator_best_guesses[,1,MM*(sim-1)+dri],rep(sim,Time),rep(dri,Time),rep("sim",Time))
}
}
plot_df <- data.frame(rbind(do.call(rbind, plotlist), cbind(c(1:Time), truth_latent[,1], rep(0,Time), rep(0,Time), rep("truth",Time))))
names(plot_df) <- c("Year", "Value", "Simulator", "Driver", "Type")
plot_df$Year <- as.numeric(plot_df$Year)
plot_df$Value <- as.numeric(plot_df$Value)
bgplot <- ggplot2::ggplot(plot_df) + geom_line(data = plot_df[which(plot_df$Type == "sim"),], aes(x = Year, y = Value, color = Simulator, linetype = Driver), linewidth = 0.8) + geom_line(data = plot_df[which(plot_df$Type == "truth"),], aes(x = Year, y = Value), linewidth = 1, color = "purple")