DEET relies on the following packages. Since they are all CRAN in origin, they should download and install automatically with devtools::install_github or utils::install.packages. The required dependencies are listed below.
All processed DEGs, metadata, and enriched pathways in formats compatible with this package as well as other methods such as gene set enrichment analysis are stored here: http://wilsonlab.org/public/DEET_data/
No functions within DEET automatically load data for the user, so the data either needs to be downloaded directly from the ftp, or using the downloader function.
The DEET_data_download function, with possible inputs “ALL”, “metadata”, “enrich”, and “feature_extract” automatically downloads the data required to run DEET_enrich and/or DEET_feature_extract.
We reccomended using:
downloaded <- DEET_data_download("ALL")
metadata <- downloaded$metadata
DEET_feature_extract_input <- downloaded$DEET_feature_extract
DEET_enrich_input <- downloaded$DEET_enrichHere: DEET_enrich_input replaces DEET_example_data for DEET_enrich(). DEET_feature_extract_input replaces DEET_feature_extract_example_matrix for DEET_feature_extract() Lastly, metadata is not directly used in any of the function, but summarizes all of the pairwise comparisons using the following columns.
Once download, save these data and DEET can be used offline.
A comparison - by - explanatory piece of data dataframe providing important details to contextualize each study. For every pairwise comparison, the study name, source (SRA, TCGA, GTEx and SRA-manual), description from the DRA compendium, the number of samples (total, up-condition, and down-condition), samples (total ,up-condition, down-condition), tissue (including tumour from TCGA), number of DEs (total, up-condition, down-condition), age (mean +- sd), sex, top 15 DEGs - up, top 15 DEGs - down, top 5 enriched pathways, and top 5 enriched TFs. PMID are also available for studies selected from SRA. Lastly, each pairwise comparison was given an overall category based on those decided in Crow et al., 2019.
This is the meat and potatoes of the DEET dataset. Here, you can find all of the significant DE genes computed within DEET (padj < 0.05), DEGs, pathways, and TFs sorted into *gmt files compatible with traditional pathway enrichment tools (e.g., GSEA, gprofiler etc.), respective metadata, and the pathway enrichment and TF enrichment files used to generate the internal pathway enrichments of DEET_enrich. A more specific breakdown of these objects are below:
http://download.baderlab.org/EM_Genesets/.http://download.baderlab.org/EM_Genesets/.metadata dataframe (see above).A gene by comparison matrix populated by the log2Fold-change of genes that are significantly DE in the comparison (padj < 0.05). The file is the input to the mat variable in DEET_feature_extract.
The primary function of the DEET R package is to allow users to query their own list of DEGs against the consistently computed DEGs within DEET by using the function DEET_enrich(). The optimal input into DEET_enrich() is a data frame of genes (human gene symbols) with an associated p-value and coefficient (e.g., Fold-change) in conjunction with a list of genes designating the statistical background. DEET_enrich() first identifies enriched biological pathways and TF targets using the *.gmt files used for all of the DEET comparisons (i.e., “Human_GO_AllPathways_with_GO_iea_June_01_2021_symbo.gmt” for pathways and “Human_TranscriptionFactors_MSigdb_June_01_2021_symbol.gmt” for TFs), allowing us to not only compare overlapping genes between the user-inputted genes and the DEGs in DEET but also overlapping pathways and TFs. All gene-set enrichment within the DEET_enrich() functions use ActivePathways with all detected genes as the background, Brown’s p-value fusion method, a false-discovery rate for p-value correction, and a cutoff of 0.05. Then, DEET_enrich() enriches the users’ inputted genes, pathways, and TF targets against the DEGs, pathways, and TF targets stored within DEET. Enrichment of the user’s inputted gene lists against the DE comparisons within DEET are also completed with ActivePathways, with a minimum geneset filter of 15 and a maximum of 10000. Then, DEET_enrich() computes the Spearman’s and Pearson’s correlation between the coefficients within the user’s imputed list of DEGs that overlap with the log2(Fold-change) of DEGs within enriched pairwise comparisons.P-values of these correlations are corrected with an FDR-adjustment. Together, DEET_enrich() returns significantly enriched studies based on overlapping DEGs, pathways, and TFs. Similarly, DEET_enrich() returns the traditional pathway and TF motif enrichment of the inputted gene list. All enrichment outputs are in the format of the output of ActivePathways (study, FDR-adjusted p-value, input length, DE comparison length, overlapping genes). DEET_enrich() also returns a dataframe of the Spearman’s and Pearson’s correlation (with associated FDR-adjusted p-values) between the inputted DE list with the DEGs found in DEET as well as the intersecting DEGs within those studies. Optionally, DEET_enrich() may be used with a generic gene list (i.e. without P-values or coefficients). If the inputted gene list is ordered, then the p-value is artificially generated as equation 1 and the coefficient is artificially generated as equation 2. We assume an inputted list in decreasing order of significance, so the FDR and coef in equations 1 and 2 are reversed. DEET_enrich() then runs normally but Pearson’s correlation between the inputted gene list and the DEGs within DEET are excluded. If the inputted gene list is unordered, then all of the p-values are set to 0.049 and both Spearman’s and Pearon’s correlations between the users inputted genes and the DEGs within DEET are excluded. If users do not provide a background set of genes, then we assume the background set is all genes detected within DEET.
For a sorted list of genes without a p-value or coefficient: Note, this happens internally, you do not have to do it.
DEG_list <- c("a", "b", "c", "d") # list of genes user inputs
DEG_processed <- data.frame(gene_symbol = DEG_list)
# DEG list is the list of genes that the user inputs
      padj <- 0.049
      for(i in 2:nrow(DEG_processed)) {
        padj[i] <- padj[i-1] * 0.95
      }
      padj <- rev(padj)
      log2fc <- rev(seq(1, 1 + 0.1*(nrow(DEG_processed) - 1), 0.1))
      DEG_processed$padj <- padj
      DEG_processed$coef <- log2fc
      colnames(DEG_processed) <- c("gene_symbol", "padj", "coef")The DEET R package also contains plotting functions to summarize the most significant studies based on each enrichment test and correlation within DEET_enrich(). The proccess_and_plot_DEET_enrich() function plots barplots of the most enriched studies based on gene set enrichment (ActivePathways) of studies enriched based on overlapping DEGs, pathways, and TF targets. The DEET_plot_correlation() function generates scatterplots of the most enriched studies based on Spearman’s correlation analysis. All plots are generated using ggplot2, and the functions return the ggplot2 objects, allowing researchers to further modify and/or save the plots.
Lastly, the DEET R package contains a function called DEET_feature_extract(), which allows researchers to identify genes that are associated with metadata. If the response variable are continuous (e.g., number of DEGs in study, Fold-change of TP53 etc.) then features are extracted by calculating the coefficients from a Gaussian family elastic net regression using the glmnet R package, as well as Spearman’s correlation between every gene and the response variable. If the response variable is categorical (e.g., Source, Category etc.), then features are extracted by calculating the coefficients from a multinomial family elastic net regression, as well as an ANOVA between each category within the response variable. Lastly, in the response variable is ordinal (e.g., enriches for TNFa pathway, Cancer study yes/no etc.), then features are extracted using a binomial family elastic net regression, as well as a Wilcoxon’s test between the two categories within the response variable.
DEET_enrich() expects a list of human gene symbols as either a character vector or as a data frame with the columns c("gene_symbol", "padj", "coef"). If you are inputted DEGs, the log2Fold-change is the most appropriate coefficient, but I left this general for other inputted data types (e.g., effect size from GWAS). Theexample_DEET_enrich_inputhas the structure required for a gene list with p-values and coefficients. An example of the unordered list isexample_DEET_enrich_input$gene_symbol`.
DEET_dataset should either be the example file in DEET_example_data or DEET_enrich_input downloaded by the downloader function or manually from the FTP. other options will likely cause the tool to crash.
ordered: This parameter only applies when the DEG_list variable is a character vector (not a dataframe). It determins if the inputted list is ordered or not.
background: a character vector of genes within the universe. For example, if your input is a list of DEGs in RNA-seq, then “background” should be all of your detected genes. Like in traditional pathway enrichment, not seeting a background may bias the enriched studies by tissue and/or cell-type as more highly expressed genes have more power to be detected as DE.
The output of these three comparisons will be comparable, however the correlation variable is of note when the input gene set is just a list of genes.
When the gene set is ordered, a Spearman’s correlation is interpretable, as it is simply the rank-order of genes, however a Pearson’s correlation is not interpretable as we do not know the relative difference in coefficient size of your inputted genes
If the gene list is unordered, correlation analysis is entirely uninterpretable and is not run. You are given this message: Input gene list is considered UNORDERED: Correlation analysis will not be run and pathway enrichment will be unordered.
Since it not run the output is No variance in coefs. Cannot proceed with correlation.
Named list where each element contains 6 objects. Each object will contain the results (enrichment or correlation) and corresponding metadata.
AP_INPUT_BP_output but instead of enriched gene ontologies, there is the enriched TFs.results, which is the same data.table as in AP_INPUT_BP_output but instead of pathways, we are enriching against DE comparisons found within DEET. The second is metadata which is a subset of the larger metadata dataframe that corresponds with the gene sets in results.AP_DEET_DE_output where pathway gene ontology term names are in the place of DE gene names.AP_DEET_DE_output where pathway gene transcription factor term names are in the place of DE gene names.results is a data frame where the rows are different pairwise comparisons and the columns are the ouput of the correlational analysis between overlapping genes. Columns are the DEET ID and DEET name, the pearson/spearman’s correlation coefficients, as well as their raw and FDR adjusted p-values. The metadata object are the subsetted rows from the metadata data frame that align with the studies containing significant correlations. Lastly, the distributions object is a list where each element is a diffierent significantly associated pairwise comparisons. Each element is populated by a dataframe where rows are genes significant in at least one study, columns are the coefficients from the input study and DEET, and colour designates whether the gene is DE in one study (grey), both studies in the same direction (purple) and both studies in the opposite direction (orange).DEET_feature_extract expects three variables: a gene-by-comparison matrix populated with a statistic related to differential expresion (e.g., p-value, fold-change), a response variable (i.e., an output dependent variable), and a category.
mat - a gene-bycomparison matrix populated with a statistic related to differential expression. The matrix in DEET_data_download() is populated with the log2FC of genes if they were deemed as significant (padj < 0.05). Other matrices, namely those populated by p-value, fold-change, and t-statitic (including the matrices for all DEGs) can be found in http://wilsonlab.org/public/DEET_data/feature_matrices and can be downloaded separately.
respeonse: a character vector identifying each study, this can be categorical, binomial (e.g. 0/1, T/F), or continuous
datatype: a character indicating if the response is “binomial”, “categorical”, or “continuous”. This is case sensitive.
DEET feature extract outputs a list of three objects.
datatype is “continuous”, then this will just be a one-element list.glmnet are default aside from the family, which is determined by datatype and alpha which is set to 0.5 (selecting an elastic net instead of L1 or L2 regularized).datatype input was categorical, binomial or continuous respsectively. The other object is a dataframe showing the significance of each gene, where rows are genes, and columns are stastistics (e.g., F-statistic for ANOVA), P-value, and FDR-adjusted p-value.The proccess_and_plot_DEET_enrich() function is a wrapper that generates barplots and a dot plot of enrichment for all of the individual outputs of DEET_enrich() (not the correlations). The outputs are in ggplot2 objects, allowing users to further modify the plots or print however they like.
DEET_enrich() a named list with the ActivePathways enrichment from each data type. If there wasn’t enrichment of a certain datatype (e.g. AP_DEET_BP_output) the function runs properly but it is not plotted.The remaining varables are for graphical parameters that are passed into the DEET_enrichment_plot() function.
data("example_DEET_enrich_input")
data("DEET_example_data")
DEET_out <- DEET_enrich(example_DEET_enrich_input, DEET_dataset = DEET_example_data)
plotting_example <- proccess_and_plot_DEET_enrich(DEET_out, text_angle = 45,
horizontal = TRUE, topn=4)Another example Where AP_DEET_BP_output is not significant, to show the plotting function still works.
data("example_DEET_enrich_input")
data("DEET_example_data")
DEET_out <- DEET_enrich(example_DEET_enrich_input, DEET_dataset = DEET_example_data)
DEET_out$AP_DEET_DE_output <- "No enrichment to be plotted"
plotting_example <- proccess_and_plot_DEET_enrich(DEET_out, text_angle = 45,
horizontal = TRUE, topn=4)There are up to four outputs assuming everything is significant, each output is a list or a ggplot object.
DEET_enrich()).DEET_enrichment_plot. An example of this is shown below.DE_example <- DEET_out$AP_DEET_DE_output$results
# Changes for DEET_example_plot
DE_example$term.name <- DEET_out$AP_DEET_DE_output$metadata$DEET.Name
DE_example$domain <- "DE"
DE_example$overlap.size <- lengths(DE_example$overlap)
DE_example$p.value <- DE_example$adjusted.p.val
DE_example_plot <- DEET_enrichment_plot(list(DE_example = DE_example), "DE_example")As shown above, from here you can also just use DEET_enrichment_plot directly to have some more control over these plots.
This function also takes the direct output from DEET_enrich and generates scatterplots of the correlations of studies whose log2FCs are significantly correlated with the input DE list.
correlation_input - The DE_correlations object that is the output of the DEET_enrich function. It only works if there was at least one study that was significantly correlated.
As mentioned previously, the genesets within DEET are easily transferrable to other gene set enrichment datasets.
One option is to download the *gmt files diretly from our ftp. http://wilsonlab.org/public/DEET_data/DEET_DE.gmt Is directly compatible with these tools.
The other option would be to save the downloaded DEET gmt as a gmt file. This is completed using the ActivePathways R package. Instead of using the example data as shown below, please use the full dataset.
Instead of saving to a temporary directory like in this vignette, save the file wherever you want the directory to be saved.
If you have two dependent gene lists to input into DEET, you can use ActivePathways directly to find combind DEET-comparison enrichment of the two gene sets.
set.seed(1234) # as I sample p-values to make the toy example
# For example two, I had the same genes but I shuffled the p-value 
example_DEET_enrich_input$padj2 <- sample(example_DEET_enrich_input$padj, length(example_DEET_enrich_input$padj), replace = FALSE)
# Make a gene-by-input-list matrix of the adjusted p-values from your multiple gene sets
AP_matrix <- as.matrix(example_DEET_enrich_input[,c("padj", "padj2")])
# Run activepathways on the combined matrix.
# Get gmt file, again from the whole list:
DEET_gmt <- DEET_example_data$DEET_gmt_DE
head(AP_matrix)
AP_example_out <- ActivePathways::ActivePathways(scores=AP_matrix, gmt=DEET_gmt, geneset.filter = c(5,10000),correction.method = "fdr")The outputs of using ActivePathways are the same as DEET_enrich() but with a couple extra columns. * evidence: Whether the DEET comparison is enriched because of one gene list, both gene lists, or an integrated version of these gene lists * Genes_colname: The genes that contributed to enrichment from each inputted gene list.