R Sys.Date()Cancer is driven by somatic mutations. A few of these mutations confer a survival advantage to the tumour (drivers) while most mutations play a passive role in tumour development (passengers). Most known driver mutations are located in protein-coding genes in whole exome sequencing datasets. As whole genome sequencing datasets are increasingly available, new methods are required to discover driver mutations in the vast noncoding regulatory genome. ActiveDriverWGS is a statistical model to detect candidate driver mutations genome-wide.
ActiveDriverWGS is a recurrence-based method which builds on the idea that driver mutations are subject to positive selection and should appear more frequently than expected by chance alone. This method analyzes the mutational burden of SNVs and short indels (less than 50bps) in functionally defined genomic elements. These elements can include the protein-coding regions of genes (i.e., exons), as well as noncoding regulatory regions such as promoters, enhancers and untranslated regions. Each element can include one or multiple sub-elements; for example protein-coding genes have multiple exons tested together in ActiveDriverWGS while the introns of the are considered as background.
Optionally, the tested genomic elements may also include active sites of interest which reside within the elements themselves. Examples of such active sites include post-translational modification (PTM) sites in protein-coding genes, miRNA binding sites in mRNA sequences and transcription factor binding sites (TFBS) in enhancers. If sites are specified in addition to the elements, enrichment of site-specific mutations are estimated in elements with enriched mutational burden. This additional test asks if the sites in a predicted driver element are enriched in mutations even beyond the mutations seen in the given driver region.
ActiveDriverWGS uses a Poisson generalized linear regression to compare the mutational burden of elements against the expected mutational burden of a background window (Figure 1). In our work, we have optimized the background to be 50,000 bps upstream and downstream of the elements. For an element comprising a single sub-element the background sequence would total to 100 kbps.
If an element is segmented as multiple sub-elements, such as the exons of a protein coding gene, the inter-segment sequences are also used to calculate the expected background rate, up to +/- 50 kbps around every segment. ActiveDriverWGS also incorporates the effect of mutational signatures, specifically the probability distribution of SNVs arising across trinucleotide contexts which vary with mutational processes. Users also have the option to exclude hyper-mutated samples which decrease the accuracy of driver discovery.
The default genome build for ActiveDriverWGS is
hg19. Additional options for human
(hg38) and mouse (mm9,
mm10) are now supported. Please refer to the user
manual and the option ref_genome of
ActiveDriverWGS().
For a more detailed reference on ActiveDriverWGS, please refer to the publication. Helen Zhu, Liis Uuskula-Reimand, Keren Isaev*, Lina Wadi, Azad Alizada, Shimin Shuai, Vincent Huang, Dike Aduluso-Nwaobasi, Marta Paczkowska, Diala Abd-Rabbo, Oliver Ocsenas, Minggao Liang, J. Drew Thompson, Yao Li, Luyao Ruan, Michal Krassowski, Irakli Dzneladze, Jared T. Simpson, Mathieu Lupien, Lincoln D. Stein, Paul C. Boutros, Michael D. Wilson, Jüri Reimand. Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks. Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027 .
The ActiveDriverWGS Model
ActiveDriverWGS requires a file for somatic mutations
and a file for genomic elements. A third optional file for
sites can be specified by the user. Elements may contain
multiple segments, each represented in a separate row. Segments
belonging to the same element must share a common id unique to the
element. Sites are only incorporated in the model if they reside within
elements but not all elements need to contain sites. Elements may
contain multiple sites. Site ids have to match element ids.
The mutations data must be in a data frame containing
the columns with the correct column names chr,
pos1, pos2, ref,
alt, and patient. Additional columns may be
included but will not be analyzed. Patient ID is required since per
element, only one mutation per patient is counted towards
element-specific mutation rate and driver prediction. This allows us to
avoid inflation of significance in cases where an element is locally
hypermutated in a single patient.
chr: autosomal chromosomes as chr1 to chr22 and sex
chromosomes as chrX and chrY for the human genome, and chr1 to chr19 for
mouse
pos1: the start position of the mutation in base 1
coordinates
pos2: the end position of the mutation in base 1
coordinates
ref: the reference allele as a string containing the
bases A, T, C or G
alt: the alternate allele as a string containing the
bases A, T, C or G
patient: the patient identifier as a string
The elements and sites data must be in data
frames containing the columns with the correct column names
chr, start, end, and
id. Additional columns may be included but will not be
analyzed.
chr: autosomal chromosomes as chr1 to chr22 and sex
chromosomes as chrX and chrY
start: the start position of the element or site in
base 0 coordinates (BED format)
end: the end position of the element or site in base
0 coordinates (BED format)
id: the element identifier - if the element contains
multiple segments such as exons, each segment should be a separate row
with the segment coordinates and the element identifier as id. Elements
can be coding or noncoding such as exons of protein coding genes or
active enhancers.
## chr pos1 pos2 ref alt patient
## 779701 chr6 96651182 96651183 AA T 001-0002-03TD
## 779702 chr10 106556005 106556005 C T 001-0002-03TD
## 779703 chr10 107457352 107457352 C T 001-0002-03TD
## 779704 chr10 108111334 108111334 C A 001-0002-03TD
## 779705 chr10 109605024 109605024 T A 001-0002-03TD
## [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
## chr start end id
## 648 chr1 2488103 2488172 TNFRSF14
## 649 chr1 2489164 2489273 TNFRSF14
## 650 chr1 2489781 2489907 TNFRSF14
## 651 chr1 2491261 2491417 TNFRSF14
## 652 chr1 2492062 2492153 TNFRSF14
## 653 chr1 2492932 2492963 TNFRSF14
## chr start end id
## 1 chr1 11169412 11169416 MTOR
## 2 chr1 11169418 11169422 MTOR
## 3 chr1 11169424 11169427 MTOR
## 4 chr1 11169706 11169709 MTOR
## 5 chr1 11169711 11169718 MTOR
## 6 chr1 11169720 11169724 MTOR
For elements and sites written in BED12 files and BED4 files, the
functions prepare_elements_from_BED12() and
prepare_elements_from_BED4() can be used to read the file
and adapt it to fulfill the format requirements for the
elements and sites parameters of the
ActiveDriverWGS function. For more information on the BED12 or BED4
format, please refer to the UCSC
guidelines. In this example, elements are adapted from annotations
for protein coding genes from GENCODE.v19
for chromosome 17 and sites are adapted from post-translational
modification (TPM) sites from ActiveDriverDB.
elements = prepare_elements_from_BED12(
system.file(
"extdata",
"chr17.coding_regions.bed",
package = "ActiveDriverWGS",
mustWork = TRUE))
head(elements)## chr start end id
## 1 chr17 6006 6168 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1
## 2 chr17 11205 11332 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1
## 3 chr17 11871 11981 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1
## 4 chr17 13920 13995 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1
## 5 chr17 22327 22407 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1
## 6 chr17 30897 31270 gc19_pc.cds::gencode::DOC2B::ENSG00000272636.1
sites = prepare_elements_from_BED4(
system.file(
"extdata",
"chr17.PTM_sites.bed",
package = "ActiveDriverWGS",
mustWork = TRUE))
head(sites)## chr start end id
## 1 chr17 63526097 63526102 gc19_pc.cds::gencode::AXIN2::ENSG00000168646.8
## 2 chr17 63526104 63526108 gc19_pc.cds::gencode::AXIN2::ENSG00000168646.8
## 3 chr17 63526110 63526114 gc19_pc.cds::gencode::AXIN2::ENSG00000168646.8
## 4 chr17 63526116 63526117 gc19_pc.cds::gencode::AXIN2::ENSG00000168646.8
## 5 chr17 63526119 63526123 gc19_pc.cds::gencode::AXIN2::ENSG00000168646.8
## 6 chr17 63526125 63526126 gc19_pc.cds::gencode::AXIN2::ENSG00000168646.8
ActiveDriverWGS can be run with the mutations file, the elements file and an optional sites file. In this example, mutations are adapted from the Alexandrov et al, 2013 dataset for chronic lymphocytic leukemia (CLL) patients. Regions are adapted from the cancer gene census and annotations for protein coding genes are adapted from GENCODE.v19.
some_genes = c("ATM", "MYD88", "NOTCH1")
results = ActiveDriverWGS(mutations = cll_mutations,
elements = cancer_genes[cancer_genes$id %in% some_genes,],
sites = cancer_gene_sites)All three genes have a significant enrichment of somatic mutations as
indicated by the fdr_element field. Note that NOTCH1 is
also highlighted for its site-related mutations since two of two
mutations in NOTCH1 affect a PTM site and the FDR value
fdr_site is also significant.
ActiveDriverWGS has several adjustable parameters:
window_size: A background window both upstream and
downstream of the element in which mutation rates are assumed to remain
constant. We have optimized this parameter on the PCAWG dataset to be
50,000 bps for SNVs and indels. For a single non-fragmented element the
background sequence is thus 100 kbps, while elements with multiple
fragments capture more sequence space and therefore their background
sequence is added up to a maximum of 2x50 kbps per sub-element.
filter_hyper_MB: The threshold for the number of
somatic mutations (SNVs and indels combined) per megabase above which a
sample is considered hypermutated. We define the default to be 30
mutations/megabase according to published literature. The genome is
assumed to be 3000 mbps.
recovery.dir: The directory for writing recovery
files for ActiveDriverWGS. If the directory does not exist, it will be
created. If the parameter is unspecified, recovery files will not be
saved. As an ActiveDriverWGS query for large datasets may be
computationally heavy, specifying a recovery directory will recover
previously computed results if a query is interrupted. The results of
each element are stored in this folder and can be recovered if the
process is terminated while in progress.
mc.cores: The number of cores that the user wishes
to allocate to running ActiveDriverWGS. For more information, refer to
the R package parallel.
## id pp_element element_muts_obs element_muts_exp element_enriched
## 1 ATM 0.002092488 2 0.11059736 TRUE
## 2 MYD88 0.003309630 1 0.01337728 TRUE
## pp_site site_muts_obs site_muts_exp site_enriched fdr_element fdr_site
## 1 0.5494199 0 0.17113693 FALSE 0.004964277 0.6697083
## 2 0.6697083 0 0.08695652 FALSE 0.004964277 0.6697083
## has_site_mutations
## 1
## 2
## [ reached 'max' / getOption("max.print") -- omitted 1 rows ]
ActiveDriverWGS will return results in a data frame format with the following columns.
id: The identifier for the element.
pp_element: The p-value associated with enrichment
of mutations in the element. A value of NA indicates that no mutations
fall within the element.
element_muts_obs: Number of patients with mutations
in the element. A value of NA indicates that no mutations fall within
the element.
element_muts_exp: Number of expected patients with
mutations in the element. A value of NA indicates that no mutations fall
within the element.
element_enriched: Boolean indicating whether an
enrichment of mutations in the element is observed. A value of NA
indicates that no mutations fall within the element.
pp_site: The p-value associated with enrichment of
mutations in the sites.
site_muts_obs: Number of patients with mutations in
the sites. A value of 0 means that sites exist but are unaffected by
mutations. A value of NA indicates that no site resides within the
element.
site_muts_exp: Number of expected patients with
mutations in the sites. A value of 0 means that sites exist but are
unaffected by mutations. A value of NA indicates that no site resides
within the element.
site_enriched: Boolean indicating whether an
enrichment of mutations in the sites is observed. A value of NA
indicates that no site resides within the element.
fdr_element: FDR corrected p-value associated with
the element.
fdr_site: FDR corrected p-value associated with the
site.
has_site_mutations: “V” indicating the presence of
site mutations. An empty string indicates that no site mutations are
present.
In ActiveDriverWGS, multiple testing is performed for all given regions using the Benjamini-Hochberg FDR method. We encourage the users to filter by FDR corrected values rather than p-values to eliminate false positives. FDR correction makes the conservative assumption that genes/regions with 0 mutations have P = 1. FDR correction for sites is conducted over a restricted set of hypotheses, comprising genes/regions that have a significant enrichment of mutations (FDR < 0.05) at the level of genes or regions.
Compute time increases with the number of samples, mutations and regions. Hence, the two main functions integral to ActiveDriverWGS have also been made available in the package and can be adapted to individual local high performance computing clusters (HPCCs). One approach is to assign a subset of testable elements (and/or cancer types) to individual compute nodes and later use an additional script that collects the data from these nodes.
This function formats the mutations data frame, removes hyper-mutated samples and removes non-mitochondrial mutations in extrachromosomal regions. It adds an additional column to the mutations data frame that provides the trinucleotide context of the given mutation which will be later used to estimate the mutational distribution across signatures.
This function calculates the enrichment of mutations for a particular region id. It applies a Poisson generalized linear regression model across mutation signatures to identify enriched regions.
The following example demonstrates how to build an ActiveDriverWGS
pipeline which can be adapted to HPCCs. Parallel jobs are executed for a
list of element ids whereas mutation data and element coordinates are
prepared prior to the process. Note that the creation of GRanges objects
is part of the ActiveDriverWGS() wrapper function and must
be completed manually by users wishing to create personalized pipelines.
The function ADWGS_test() needs a link to the reference
genome object (from the R package BSgenome) as an argument,
and assumes that the mutations and element coordinates match the
reference genome. Also, note that multiple testing corrections will need
to be recalculated by the user after the results have been collected
from individual jobs.
library(GenomicRanges)
# Loading elements & creating a GRanges object
data(cancer_genes)
gr_element_coords = GRanges(seqnames = cancer_genes$chr,
IRanges(start = cancer_genes$start,
end = cancer_genes$end),
mcols = cancer_genes$id)
# Loading sites & creating a GRanges object
data(cancer_gene_sites)
gr_site_coords = GRanges(seqnames = cancer_gene_sites$chr,
IRanges(start = cancer_gene_sites$start,
end = cancer_gene_sites$end),
mocols = cancer_gene_sites$id)
# Loading mutations, format muts & creating a GRanges object
data(cll_mutations)
# load the default reference genome
this_genome = BSgenome.Hsapiens.UCSC.hg19::Hsapiens
# format_muts
cll_mutations = format_muts(cll_mutations, this_genome,
filter_hyper_MB = 30)
gr_maf = GRanges(cll_mutations$chr,
IRanges(start = cll_mutations$pos1,
end = cll_mutations$pos2),
mcols = cll_mutations[,c("patient", "tag")])
# Examplifying the ATM Element
id = "ATM"Note that when splitting tasks using the
ADWGS_test function, only the parameter id
needs to modified for each element while gr_element_coords,
gr_sites and gr_maf can be the complete
datasets. However, the compute time and memory can be saved in very
large analyses by providing more-optimal subsets of these datasets for
each test, for example those limited to specific chromosomes.
# Result of 1 input element
result = ADWGS_test(id = id,
gr_element_coords = gr_element_coords,
gr_site_coords = gr_site_coords,
gr_maf = gr_maf,
win_size = 50000, this_genome = this_genome)
result## id pp_element element_muts_obs element_muts_exp element_enriched pp_site
## 1 ATM 0.002092488 2 0.1105974 TRUE 0.5494199
## site_muts_obs site_muts_exp site_enriched
## 1 0 0.1711369 FALSE
For questions, technical support or to report bugs and errors, please use our GitHub.