Preparing toxEval Data

22 November, 2024

Introduction

What kind of data could be used in a toxEval analysis? It was designed with concentration measurements from water samples as the primary use case. There may be other concentration measurements that could be used as well, but it is up to the researcher to determine if special considerations must be taken in those circumstances. For instance, there was a toxEval analysis done on the concentration of chemicals measured in eagle plasma.

For all cases within toxEval, a “sample” is considered a unique site/date. There are times when this might not be especially relevant to the data collection (passive samplers, groundwater samples at separate depths, etc.). The user will need to come up with strategies to deal with the limiting workflow. For example, single sites at different depths could add site suffixes (site_a_3m, site_a_6m, etc.). Passive samplers could use the start or end time as the sampling times. If a suffix is added in one tab of the data, it must be added to all tabs of the data. So, SiteID must match in the Data and Sites tabs.

Preparing the data

Input data for toxEval should be prepared in a Microsoft ™ Excel file using specifically named sheets (also known as tabs). There are 3 mandatory sheets (Data, Chemical, Sites), and 2 optional sheets (Exclude, Benchmarks). The sheets should appear as follows (although the order is not important):

Each sheet has mandatory columns; the order of the columns is not important, but the names of the columns are important. Additional columns can be included but will be ignored. The top row of each sheet must contain the column names (headers), and the second row should begin with the data. That means no titles or comment rows should precede the data.

Data

The “Data” sheet is used to define the measured concentrations to be evaluated in toxEval. Four columns are required in this sheet: “CAS”, “SiteID”, “Value”, and “Sample Date”. The columns can be in any order, but the first row of the sheet must be the header (column names).

Note: Additional columns may be useful to organize the data. These additional columns will be ignored by toxEval and will not influence a toxEval analysis. For example, many data sets have detection level or censoring information. In this version of toxEval, that type of analysis is ignored. The censored data can be entered in the Value column as the detection level, or half the detection level, 0, or some other strategy…that is currently up to the researcher. This is a topic that could be re-evaluated in future versions of toxEval.

As an example, the first several rows of a minimal example would look like this:

Chemicals

The “Chemicals” sheet is used to define the unique chemicals included in the “Data” sheet (so, 1 row per unique chemical). Two columns are required in this sheet: “CAS” and “Class”. The columns can be in any order, but the first row of the sheet must be the header (column names). If you need chemical names that do not match up with the “tox_chemical” list provided in the package, you will want to include a 3rd column “Chemical” which is the chemical name to use for plots and tables.

Note: Additional columns may be useful to organize the data. These additional columns will be ignored by toxEval and will not influence a toxEval analysis.

Sites

The “Sites” sheet is used to define site information for locations where samples were collected. Four columns are required in this sheet: “SiteID”, “Short Name”, “dec_lon”, and “dec_lat”. The columns can be in any order, but the first row of the sheet must be the header (column names).

Note: Additional columns may be useful to organize the data. These additional columns will be ignored by toxEval and will not influence a toxEval analysis.

Exclude

At times, it may be appropriate to exclude endpoints, chemicals, or specific endpoint:chemical combinations from a data analysis due to lack of relevance to the study objective or low confidence in specific portions of the data. The “Exclude” sheet is used for this purpose.

The “Exclude” sheet is optional, but if used, two columns are required: “CAS” and “endPoint”. They can be in any order, but the first row of the sheet should be the header (column names).

Why would you choose to exclude a chemical/endpoint value? There are times that the dose-response curves from ToxCast may not trigger any automated flags, but upon inspection, the curves seem suspect (NOTE: as of ToxCast database 4.1, there are more flags available and excluding based on flag does appear to capture most cases that should be excluded).

The easiest way to view the dose response curves is from the Comptox dashboard. The function endpoint_hits_DT includes an option to get a direct link to find the dose-response curves if the category is “Chemical”. This is handy to do quick checks on the endpoint/chemical combinations that produce the highest EARs. If the highest EAR values have dose-response curves that seem suspect, consider adding those to the “Exclude” tab, or at least trying to get more information on that endpoint/assay.

Another way to quickly check which endpoints to check on Comtox is to plot the endpoints for a single chemical using the ACC values provided in toxEval. The LOWER the endpoint’s ACC value, the higher the EAR will be, and therefore those are the endpoints you might want to check carefully:

library(toxEval)
library(ggplot2)

CAS <- c("1912-24-9")
default_id = c(5, 6, 11, 15, 18)

ACC_atrazine <- get_ACC(CAS) |> 
  dplyr::arrange(ACC_value) |> 
  dplyr::mutate(index = dplyr::row_number(),
                percent = 100*index/dplyr::n(),
                has_default_flag = colSums(sapply(flags, "%in%", x = default_id)) > 0)

ggplot() +
  geom_point(data = ACC_atrazine,
             aes(x = ACC_value, y = percent,
                 color = has_default_flag)) +
  xlab("ACC ug/L") +
  ylab("Percentile") +
  ggtitle("Individual ACC values for Atrazine") +
  geom_text(data = head(ACC_atrazine |> 
                          dplyr::filter(!has_default_flag),
                        n = 5),
            aes(y = percent, x = ACC_value,
                label = endPoint,
                color = has_default_flag), 
             size = 2, hjust = -0.2, show.legend = FALSE) +
  theme_bw() +
  scale_color_manual("Includes default\nflags for removal", 
                     values = c("blue", "grey80"))

Figure 1: All ACC values in ug/L of atrazine from ToxCast. The values that would be flagged by default are in grey, the values that would remain are in blue. If there were any outliers that do not have default flags and have very low ACC values, those dose-response curves should be verified.

Note: Additional columns may be useful to organize the data. These additional columns will be ignored by toxEval and will not influence a toxEval analysis.

Benchmarks

The user may provide a set of concentration benchmarks to be used in place of the ToxCast database. For example, there may be a need to perform similar toxEval analysis using EPA aquatic life benchmarks to compare measured concentrations against established toxicity thresholds. The “Benchmarks” sheet is used for this purpose. For more information, see here.

The “Benchmarks” sheet is optional, but if used, five columns are required: “CAS”, “Chemical”, “endPoint”, “Value”, and “groupCol”. They can be in any order, but the first row of the sheet should be the header (column names).

Note: Additional columns may be useful to organize the data. These additional columns will be ignored by toxEval and will not influence a toxEval analysis.

Summary

To summarize, there are 3 mandatory sheets (Data, Chemicals, Sites) and 2 option sheets (Exclude, Benchmarks). SiteID and CAS columns MUST match between each sheet.

Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.