This example comes directly from Maslen et al. 2023, for more details behind the methodology go to this paper. If you use this package for sample size analysis in an associated paper or otherwise, please cite Maslen et al. 2023.
Researchers within the Operation Crayweed Restoration Project in Sydney are restoring the locally extinct macro-algae Phyllospora comosa (“crayweed”: see Coleman et al., 2008; Campbell et al., 2014; Vergés et al., 2020 and are interested in the effect of this restoration on associated ecological communities (Marzinelli et al., 2014, 2016; Wood et al., 2019). Pilot data have already been collected, where the abundance of fish species in nine open ocean sites have been recorded in 2015. We are interested in observing if there is a change in mean fish abundance between control sites (those in Sydney without crayweed) and restored sites (similar sites where crayweed has recently been transplanted). There are plans to collect more data in the future, however there is an upper bound of approximately 24 possible spatially independent restored or control coastal bays/sites within the Sydney region and surroundings.
In this example we attempt to answer the following experimental design questions:
Ecopower can be installed from CRAN using
install.packages("ecopower")
, however for the latest
version users can install from GitHub with
devtools::install_github("BenMaslen/ecopower")
.
The first 34 columns of the fish
data set make up the
multivariate abundance matrix of fish counts across our 9 sites.
Below we first fit a manyglm
model, and then a copula
model using the cord
function from the
ecoCopula
package to our pilot data as our data generating
model.
To specify an effect of interest, we first specify a list of responses (or taxa) we believe are ‘increasing’ or ‘decreasing’ in response to our treatment. Responses/taxa we do not specify as either ‘increasers’ or ‘decreasers’ are assumed to have no effect.
Next we specify the multiplicative effect size and term of interest
in the effect_alt function. Here effect_size=1.2
is a \(20\%\) change in mean abundance.
#specify 'increaser' and 'decreaser' species:
increasers <- c("Aplodactylus.lophodon", "Atypichthys.strigatus", "Cheilodactylus.fuscus",
"Olisthops.cyanomelas", "Pictilabrius.laticlavius")
decreasers <- c("Abudefduf.sp", "Acanthurus.nigrofuscus", "Chromis.hypsilepis",
"Naso.unicornis", "Parma.microlepis", "Parupeneus.signatus",
"Pempheris.compressa", "Scorpis.lineolatus", "Trachinops.taeniatus")
#specify effect size of interest
coeff.alt <- effect_alt(fit, effect_size = 1.2, increasers, decreasers, term = "Site.Type")
Notice in the original pilot data we have three treatments but are
only interested in comparing control vs. restored sites. In order to
simulate under only control and restored sites, we can create a new
dataset where reference sites have been relabeled as restored (below -
fish_data_sim
) and simulate under this design using the
newdata
argument in powersim()
.
Note that we could have just started out fitting a model to only restored and control sites, however we would lose samples that could be used to help estimate nuisance parameters like overdispersion and covariance. Thus we are effectively ‘borrowing strength’ from the reference sites to help in the estimation of our nuisance parameters.
Let’s start by estimating power for a single sample size specification.
Below we calculate power for a sample size of \(N=100\) using our pre-specified effect size
(coeff.alt
) of \(20\%\)
differences in mean abundance.
Note that nsim
and ncrit
have been set to
50 and 49 respectively in order to run timely, however in practice you
would want to set these to larger values (we recommend setting
nsim=1000
and ncrit=4999
) to get a more
accurate estimate of power.
powersim(fit.cord, N=100, coeff.alt, term="Site.Type", nsim=50, ncrit=49, newdata = fish_data_sim, ncores=2)
#> Time elapsed: 0 hr 0 min 20 sec
#> Power: 0.48
Based on this initial power simulation we would need more samples in order to obtain the conventional power target of \(80\%\).
In order to answer our first experimental design question, we can estimate power across a range of sample sizes for an effect size of \(20\%\) differences in mean abundance. By plotting the results, we can observe the sample size required to reliably detect (with a conventional power target of \(80\%\)) the effect of interest.
Users could also plot multiple power curves to get an idea of the variability in the power estimates (as in Maslen et al. 2023).
Note: The below code takes ~10minutes-1hour to run depending on the
speed of your computer and the number of logical processors you have. As
such, this code has not been evaluated by default. Instead, the
resultant figure has been added for reference. If you want to run this
code and reduce computation time, you can reduce nsim
or
ncrit
, however this will also make power estimates more
variable.
First we estimate power at each sample size of interest:
#specify sample sizes to simulate under
sample_sizes <- c(10, 25, 50, 75, 100, 125, 150, 175, 200, 225, 250, 275, 300)
power_estimates <- rep(NA, length(sample_sizes))
#loop through sample sizes and estimate power at each step
for (i in 1:length(sample_sizes)){
power_estimates[i] <- powersim(fit.cord, N = sample_sizes[i], coeff.alt, term = "Site.Type",
nsim = 1000, ncrit = 4999, newdata = fish_data_sim)
}
Next, let’s save the results as a data frame and produce a plot against the conventional power target of \(80\%\):
#store the results in a data.frame
powercurve_dat <- data.frame(sample_sizes = sample_sizes, Power = unlist(power_estimates))
#plot power curves
ggplot(powercurve_dat, aes(x = sample_sizes, y = Power)) +
geom_line(size = 1.06) + theme_bw() + geom_hline(yintercept = 0.8, linetype = "dashed", color = "red", size=1) +
xlab("N") + ylab("Power") + scale_x_continuous(breaks = sample_sizes) +
scale_y_continuous(breaks = c(0.2, 0.4, 0.6, 0.8, 1.0))
This suggests that we require a sample size of at-least \(N \approx 150\) in order to likely detect \(20\%\) changes in mean abundances, using a conventional power target of \(80\%\).
Thus we have answered our first experimental design question!
In order to answer the second experimental design question, we can estimate and then plot multiple effect size curves for different effect size specifications (let’s do this for effect sizes of \(10\%, 20\%, \ldots, 100\%\) changes in mean abundance by specifying effect_size\(=1.1, 1.2, \ldots, 2\). We will also do this across sample sizes that we can sample using a balanced sampling design (from \(N=4,6,\ldots,24\)).
As we are simulating from copula models with small sample sizes, we
will also increase n.samp
(number of simulations for
importance sampling in copula modelling) and decrease nlv
to 1 (number of latent factors in the copula) as recommended in the
cord
function description.
Note: The below code takes ~30minutes-2hours to run depending on the
speed of your computer and the number of logical processors you have. As
such, this code has not been evaluated by default. Instead, the
resultant figure has been added for reference. If you want to run this
code and reduce computation time, you can reduce nsim
or
ncrit
, however this will also make power estimates more
variable.
First we estimate power at each sample size and effect size of interest:
#specify effect sizes and sample sizes to loop through
sample_sizes <- c(3:12)*2
effect_sizes <- c(1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2)
mult_power_estimates <- rep(NA,length(sample_sizes)*length(effect_sizes))
#loop through sample sizes and effect sizes
for (j in 1:length(effect_sizes)){
coeff.alt <- effect_alt(fit, effect_size = effect_sizes[j], increasers, decreasers, term = "Site.Type")
for (i in 1:length(sample_sizes)){
try(mult_power_estimates[10*(j-1)+i] <- powersim(fit.cord, N = sample_sizes[i], coeff.alt, term = "Site.Type", nsim = 1000,
ncrit = 4999, newdata = fish_data_sim, nlv = 1, n.samp = 500))
}
}
Next, let’s save the results as a data frame and produce a plot of the resulting power estimates:
#store the results in a data.frame
mult_powercurve_dat <- data.frame(sample_sizes = rep(sample_sizes, times = length(effect_sizes)),
power_estimates = unlist(mult_power_estimates),
Perc_Change = factor(rep(c("10%", "20%", "30%", "40%", "50%", "60%",
"70%", "80%", "90%", "100%"), each = 10),
levels = c("100%", "90%", "80%", "70%", "60%", "50%",
"40%", "30%", "20%", "10%")),
effect_sizes = rep(effect_sizes, each = length(sample_sizes)))
#plot multiple power curve
ggplot(mult_powercurve_dat, aes(x = sample_sizes, y = power_estimates, colour = Perc_Change)) +
geom_line(linewidth = 1.05) + theme_bw() + geom_hline(yintercept = 0.8, linetype = "dashed", color = "red", size = 1) +
xlab("N") + ylab("Power") + scale_x_continuous(breaks = sample_sizes) +
scale_y_continuous(breaks = c(0.2, 0.4, 0.6, 0.8, 1.0)) + scale_color_brewer(palette = "Paired", direction = -1, name = "% change")