---
title: "Getting started"
output:
rmarkdown::html_vignette:
toc: true
bibliography: refs.bib
vignette: >
%\VignetteIndexEntry{Getting started}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
comment = "#>",
message = F, warning = F,
fig.align = "center",
echo = T
)
library(dplyr)
library(wqtrends)
library(plotly)
library(english)
```
This package can be used to assess water quality trends for long-term monitoring data in estuaries using Generalized Additive Models and mixed-effects meta-analysis [@Wood17; @Sera19]. These models are appropriate for data typically from surface water quality monitoring programs at roughly monthly or biweekly collection intervals, covering at least a decade of observations [e.g., @Cloern16]. Daily or continuous monitoring data covering many years are not appropriate for these methods, due to computational limitations and a goal of the analysis to estimate long-term, continuous trends from irregular or discontinuous sampling.
# Basic usage
The sample dataset `rawdat` is included in the package and is used for the examples below. This dataset includes monthly time series data over ~30 years for nine stations in South Bay, San Francisco Estuary. Data are available for `r english(length(unique(rawdat$param)))` water quality parameters. All data are in long format with one observation per row.
The data are pre-processed to work with the GAM fitting functions included in this package. The columns include date, station number, parameter name, and value for the date. Additional date columns are included that describe the day of year (`doy`), date in decimal time (`cont_year`), year (`yr`), and month (`mo` as character label). These are required for model fitting or use with the analysis/plotting functions.
```{r}
head(rawdat)
```
One GAM model can be fit to the time series data. Each GAM fits additive smoothing functions to describe variation of the response variable (`value`) over time, where time is measured as a continuous number. The basic GAM used by this package is as follows:
* `S`: value ~ s(year, k = *large*)
The `cont_year` vector is measured as a continuous numeric variable for the annual effect (e.g., January 1st, 2000 is 2000.0, July 1st, 2000 is 2000.5, etc.). The function `s()` models `cont_year` as a smoothed, non-linear variable. The optimal amount of smoothing on `cont_year` is determined by cross-validation as implemented in the mgcv package [@Wood17] and an upper theoretical upper limit on the number of knots for `k` should be large enough to allow sufficient flexibility in the smoothing term. The upper limit of `k` was chosen as 12 times the number of years for the input data. If insufficient data are available to fit a model with the specified `k`, the number of knots is decreased until the data can be modelled, e.g., 11 times the number of years, 10 times the number of years, etc.
The `anlz_gam()` function is used to fit the model. First, the raw data are filtered to select only station 34 and the chlorophyll parameter. The model is fit using a log-10 transformation of the response variable. Available transformation options are log-10 (`log10`) or identity (`ident`). The log-10 transformation is used by default if not specified by the user.
```{r}
tomod <- rawdat %>%
filter(station %in% 34) %>%
filter(param %in% "chl")
mod <- anlz_gam(tomod, trans = "log10")
mod
```
All remaining functions use the model results to assess fit, calculate seasonal metrics and trends, and plot results.
The fit can be assessed using `anlz_smooth()` and `anlz_fit()`, where the former assesses the individual smoother functions and the latter assesses overall fit. The `anlz_smooth()` results show the results for the fit to the `cont_year` smoother as the effective degrees of freedom (`edf`), the reference degrees of freedom (`Ref.df`), the test statistic (`F`), and statistical significance (`p-value`). The significance is in part based on the difference between `edf` and `Ref.df`. The `anlz_fit()` results show the overall summary of the model as Akaike Information Criterion (`AIC`), the generalized cross-validation score (`GCV`), and the `R2` values. Lower values for `AIC` and `GCV` and higher values for `R2` indicate better model fit.
```{r}
anlz_smooth(mod)
anlz_fit(mod)
```
The plotting functions show the results in different formats. If appropriate for the response variable, the model predictions are back-transformed and the scales on each plot are shown in log10-scale to preserve the values of the results.
The `show_prddoy()` function shows estimated results by day of year with separate lines for each year.
```{r, fig.height = 5, fig.width = 6}
ylab <- "Chlorophyll-a"
show_prddoy(mod, ylab = ylab)
```
The `show_prdseries()` function shows predictions for the model across the entire time series. Points are the observed data and the lines are the predicted.
```{r, fig.height = 4, fig.width = 6}
show_prdseries(mod, ylab = ylab)
```
The `show_prdseason()` function is similar except that the model predictions are grouped by month. This provides a simple visual depiction of changes by month over time. The trend analysis functions below can be used to statistically test the seasonal changes.
```{r, fig.height = 4, fig.width = 6}
show_prdseason(mod, ylab = ylab)
```
Finally, the `show_prd3d()` function shows a three-dimensional fit of the estimated trends across year and day of year with the z-axis showing the estimates for the response variable.
```{r}
show_prd3d(mod, ylab = ylab)
```
# Trend testing
Statistical tests for evaluating trends are available in this package. These methods are considered "secondary" analyses that use results from a fitted GAM to evaluate trends or changes over time. In particular, significance of changes over time are evaluated using mixed-effect meta-analysis [@Sera19] applied to the GAM results to allow for full propagation of uncertainty between methods. Each test includes a plotting method to view the results.
## Evaluating changes between time periods
The `anlz_perchg()` and `show_perchg()` functions can be used to compare annual averages between two time periods of interest. The functions require base and test year inputs that are used for comparison. More than one year can be entered for the base and test years, e.g., `baseyr = c(1990, 1992, 1993)` vs. `testyr = c(2014, 2015, 2016)`.
```{r}
anlz_perchg(mod, baseyr = 2006, testyr = 2017)
```
To plot the results for one GAM, use the `show_perchg()` function. The plot title summarizes the results.
```{r, fig.height = 4, fig.width = 9}
show_perchg(mod, baseyr = 2006, testyr = 2017, ylab = "Chlorophyll-a (ug/L)")
```
## Evaluating seasonal changes over time
The `anlz_metseason()`, `anlz_mixmeta()`, and `show_metseason()` functions evaluate seasonal metrics (e.g., mean, max, etc.) between years, including an assessment of the trend for selected years using mixed-effects meta-analysis modelling. These functions require inputs for the seasonal ranges to evaluate (`doyend`, `doystr`) and years for assessing the trend in the seasonal averages/metrics (`yrstr`, `yrend`).
The `anlz_metseason()` function estimates the seasonal metrics (including uncertainty as standard error) for results from the GAM fit. The seasonal metric can be any summary function available in R, such as seasonal maxima (`max`), minima (`min`), variance (`var`), or others. The function uses repeated resampling of the GAM model coefficients to simulate multiple time series as an estimate of uncertainty for the summary parameter.
The inputs for `anlz_metseason()` include the seasonal range as day of year using start (`doystr`) and end (`doyend`) days and the `metfun` and `nsim` arguments to specify the summary function and number of simulations, respectively. Here we show the estimate for the maximum chlorophyll in each season, using a relatively low number of simulations. Repeating this function will produce similar but slightly different results because the estimates are stochastic. In practice, a large value for `nsim` should be used to produce accurate results (e.g., `nsim = 1e5`).
```{r}
metseason <- anlz_metseason(mod, metfun = max, doystr = 90, doyend = 180, nsim = 100)
metseason
```
The `anlz_mixmeta()` function uses results from the `anlz_metseason()` to estimate the trend in the seasonal metric over a selected year range. Here, we evaluate the seasonal trend from 2006 to 2017 for the seasonal estimate of the model results above.
```{r}
anlz_mixmeta(metseason, yrstr = 2006, yrend = 2017)
```
The `show_metseason()` function plots the seasonal metrics and trends over time. The `anlz_metseason()` and `anlz_mixmeta()` functions are used internally to get the predictions. The same arguments for these functions are used for `show_metseason`, with the mean as the default metric.
```{r, fig.height = 4, fig.width = 9, echo = T}
show_metseason(mod, doystr = 90, doyend = 180, yrstr = 2006, yrend = 2017, ylab = "Chlorophyll-a (ug/L)")
```
To plot only the seasonal metrics, the regression line showing trends over time can be suppressed by setting one or both of `yrstr` and `yrend` as `NULL`.
```{r, fig.height = 4, fig.width = 9, echo = T}
show_metseason(mod, doystr = 90, doyend = 180, yrstr = NULL, yrend = NULL, ylab = "Chlorophyll-a (ug/L)")
```
Adding an argument for `metfun` to `show_metseason()` will plot results and trends for a metric other than the average. Note the use of `nsim` in this example. In practice, a much higher value should be used (e.g., `nsim = 1e5`)
```{r, fig.height = 4, fig.width = 9, echo = T}
show_metseason(mod, metfun = max, nsim = 100, doystr = 90, doyend = 180, yrstr = 2006, yrend = 2017, ylab = "Chlorophyll-a (ug/L)")
```
For convenience, the `anlz_sumstats()` function returns a list of summary statistics for the GAM and associated mixed-effect meta-analysis model. This function can be useful for creating tabular results of the models. The list output includes `mixmet` as a `mixmeta` object of the fitted mixed-effects meta-analysis trend model, `metseason` as a tibble object of the fitted seasonal metrics as returned by `anlz_metseason()` or `anlz_avgseason()`, `summary` of the `mixmet` object, and `coeffs` as a tibble object of the slope estimate coefficients from `mixmet`. An approximately linear slope estimate will be included as `slope.approx` in `coeffs` if `trans = 'log10'` for the GAM used in `mod`.
```{r}
anlz_sumstats(mod, metfun = mean, doystr = 90, doyend = 180, yrstr = 2006, yrend = 2017)
```
The seasonal estimates and mixed-effects meta-analysis regression can be used to estimate the rate of seasonal change across the time series. For any given year and seasonal metric, a trend can be estimated within a specific window (i.e., `yrstr` and `yrend` arguments in `show_metseason()`). This trend can be estimated for every year in the period of record to estimate the rate of change over time for the seasonal estimates.
The `anlz_trndseason()` function estimates the rate of change and the `show_trndseason()` function plots the results. For both, all inputs required for the `anlz_metseason()` function are required, in addition to the desired window width to evaluate for each year (`win`) and the justification for the window as `"left"`, `"right"`, or `"center"` from each year (`justify`).
It's important to note the behavior of the centering for window widths (`win` argument) if choosing even or odd values. For left and right windows, the exact number of years in `win` is used. For example, a left-centered window for 1990 of ten years will include exactly ten years from 1990, 1991, ... , 1999. The same applies to a right-centered window, e.g., for 1990 it would include 1981, 1982, ..., 1990 (if those years have data). However, for a centered window, picking an even number of years for the window width will create a slightly off-centered window because it is impossible to center on an even number of years. For example, if `win = 8` and `justify = 'center'`, the estimate for 2000 will be centered on 1997 to 2004 (three years left, four years right, eight years total). Centering for window widths with an odd number of years will always create a symmetrical window, i.e., if `win = 7` and `justify = 'center'`, the estimate for 2000 will be centered on 1997 and 2003 (three years left, three years right, seven years total).
```{r}
trndseason <- anlz_trndseason(mod, doystr = 90, doyend = 180, justify = 'left', win = 5)
head(trndseason)
```
The `show_trndseason()` function can be used to plot the results directly, one model at a time.
```{r, fig.height = 5, fig.width = 9, echo = T}
show_trndseason(mod, doystr = 90, doyend = 180, justify = 'left', win = 5, ylab = 'Chl. change/yr, average')
```
As before, adding an argument for `metfun` to `show_trndseason()` will plot results and trends for a metric other than the average. Note the use of `nsim` in this example. In practice, a much higher value should be used (e.g., `nsim = 1e5`)
```{r, fig.height = 5, fig.width = 9, echo = T}
show_trndseason(mod, metfun = max, nsim = 100, doystr = 90, doyend = 180, justify = 'left', win = 5, ylab = 'Chl. change/yr, maximum')
```
The results supplied by `show_trndseason()` can be extended to multiple window widths by stacking the results into a single plot. Below, results for window widths from 5 to 15 years are shown using the `show_sumtrndseason()` function for a selected seasonal range using a left-justified window. This function only works with average seasonal metrics due to long processing times with other metrics. To retrieve the results in tabular form, use `anlz_sumtrndseason()`.
```{r, fig.height = 7, fig.width = 9, echo = T}
show_sumtrndseason(mod, doystr = 90, doyend = 180, justify = 'left', win = 5:15)
```
Lastly, the plots returned by `show_metseason()` and `show_trndseason()` can be combined using the `show_mettrndseason()` function. This plot will show the seasonal metrics from the GAM as in `show_metseason()` with the colors of the points for the seasonal metrics colored by the significance of the moving window trends shown in `show_trndseason()`. The four colors indicate increasing, decreasing, no trend, or no estimate (i.e., too few points for the window). Most of the arguments for `show_metseason()` and `show_trndseason()` apply to `show_mettrndseason()`.
```{r, fig.height = 5, fig.width = 9, echo = T}
show_mettrndseason(mod, metfun = mean, doystr = 90, doyend = 180, ylab = "Chlorophyll-a (ug/L)", win = 5, justify = 'left')
```
Four colors are used to define increasing, decreasing, no trend, or no estimate. The `cmbn` argument can be used to combine the no trend and no estimate colors into one color and label. Although this may be desired for aesthetic reasons, the colors and labels may be misleading with the default names since no trend is shown for points where no estimates were made.
```{r, fig.height = 5, fig.width = 9, echo = T}
show_mettrndseason(mod, metfun = mean, doystr = 90, doyend = 180, ylab = "Chlorophyll-a (ug/L)", win = 5, justify = 'left', cmbn = T)
```
# References