Introduction to enrichit

Guangchuang Yu
School of Basic Medical Sciences, Southern Medical University

2025-12-17

Introduction

Functional enrichment analysis is a staple in bioinformatics for interpreting lists of genes identified from omics experiments. enrichit provides fast, C++-based implementations of two of the most widely used methods:

  1. Over-Representation Analysis (ORA)
  2. Gene Set Enrichment Analysis (GSEA)

The package is designed to be efficient and easy to integrate into existing workflows, with a focus on performance and standardized output formats.

Installation

You can install enrichit from GitHub:

devtools::install_github("YuLab-SMU/enrichit")

Over-Representation Analysis (ORA)

ORA determines whether a set of genes of interest (e.g., differentially expressed genes) is enriched in a known gene set (e.g., a biological pathway) more than would be expected by chance.

Method

enrichit implements ORA using the hypergeometric distribution (one-sided Fisher’s exact test). The p-value is calculated as the probability of observing at least k genes from the specific gene set in the selected list of n genes, given a background population (universe) of N genes containing M genes from that set.

$$ p = 1 - \sum_{i=0}^{k-1} \frac{\binom{M}{i} \binom{N-M}{n-i}}{\binom{N}{n}} $$

Example

library(enrichit)

# Simulate a universe of 1000 genes
universe <- paste0("Gene", 1:1000)

# Define gene sets
gene_sets <- list(
  PathwayA = paste0("Gene", 1:50),       # Genes 1-50
  PathwayB = paste0("Gene", 800:850)     # Genes 800-850
)

# Select 'significant' genes (e.g., top 20 genes)
# PathwayA should be enriched
sig_genes <- paste0("Gene", 1:20)

# Run ORA
ora_result <- ora(
  gene = sig_genes,
  gene_sets = gene_sets,
  universe = universe
)

# View results
as.data.frame(ora_result)
        ID SetSize Count DESize UniverseSize       pvalue
1 PathwayA      50    20     20         1000 1.388265e-28
2 PathwayB      51     0     20         1000 1.000000e+00
                                                                                                                              geneID
1 Gene8/Gene19/Gene4/Gene3/Gene17/Gene14/Gene11/Gene10/Gene7/Gene1/Gene12/Gene2/Gene15/Gene5/Gene6/Gene9/Gene18/Gene20/Gene16/Gene13
2                                                                                                                                   
  GeneRatio BgRatio RichFactor FoldEnrichment
1     20/20 50/1000        0.4             20
2      0/20 51/1000        0.0              0

Gene Set Enrichment Analysis (GSEA)

GSEA evaluates whether a defined set of genes shows statistically significant, concordant differences between two biological states. Unlike ORA, GSEA uses the entire ranked list of genes, avoiding the need for arbitrary thresholds to select “significant” genes.

Method

enrichit offers a fast C++ implementation of GSEA. It calculates an Enrichment Score (ES) that reflects the degree to which a gene set is over-represented at the top or bottom of a ranked list of genes.

The package supports different methods for p-value calculation:

  1. Multilevel (method = "multilevel"): This is the default and recommended method. It uses an adaptive multi-level splitting Monte Carlo approach to estimate low p-values efficiently with high accuracy, similar to the fgsea package.
  2. Simple Permutation (method = "permute"): Standard permutation of gene labels.
  3. Sample Permutation (method = "sample"): Random sampling of gene sets (faster but less rigorous for some null hypotheses).

Example

# Generate synthetic ranked gene list
set.seed(42)
geneList <- sort(rnorm(1000), decreasing = TRUE)
names(geneList) <- paste0("Gene", 1:1000)

# Define gene sets
# PathwayTop is enriched at the top (positive ES)
# PathwayBottom is enriched at the bottom (negative ES)
gene_sets <- list(
  PathwayTop = names(geneList)[1:50],
  PathwayBottom = names(geneList)[951:1000],
  PathwayRandom = sample(names(geneList), 50)
)

# Run GSEA using the multilevel method
gsea_result <- gsea(
  geneList = geneList,
  gene_sets = gene_sets,
  method = "multilevel",
  nPerm = 1000,    # Base permutations
  minGSSize = 10,
  maxGSSize = 500
)

# View results
head(gsea_result)
             ID enrichmentScore        NES       pvalue setSize   log2err
1    PathwayTop       1.0000000  3.9449035 2.491505e-12      50 0.8986712
2 PathwayBottom      -1.0000000 -3.7744199 1.797993e-12      50 0.9101197
3 PathwayRandom      -0.2638292 -0.9958021 4.639175e-01      50       NaN

Working with GSON

enrichit works seamlessly with GSON objects, which are used to store gene set information along with metadata. The GSON class is defined in the gson package. It provides a structured way to handle gene sets, including gene identifiers, gene set names, and other associated information.

# Assuming you have a GSON object 'g'
# result <- gsea_gson(geneList = geneList, gson = g)

Session Info

sessionInfo()
R version 4.5.2 (2025-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=C                               
[2] LC_CTYPE=Chinese (Simplified)_China.utf8   
[3] LC_MONETARY=Chinese (Simplified)_China.utf8
[4] LC_NUMERIC=C                               
[5] LC_TIME=Chinese (Simplified)_China.utf8    

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] enrichit_0.0.8

loaded via a namespace (and not attached):
 [1] digest_0.6.39     fastmap_1.2.0     xfun_0.54         yulab.utils_0.2.3
 [5] rappdirs_0.3.3    knitr_1.50        htmltools_0.5.8.1 rmarkdown_2.30   
 [9] cli_3.6.5         compiler_4.5.2    tools_4.5.2       evaluate_1.0.5   
[13] Rcpp_1.1.0        yaml_2.3.11       rlang_1.1.6       jsonlite_2.0.0   
[17] fs_1.6.6