TemporalForest: A Quick Start Guide

Sisi Shao

Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles, CA, USA

Jason H. Moore

Department of Biostatistics, Fielding School of Public Health,
University of California, Los Angeles, CA, USA
Department of Computational Biomedicine,
Cedars-Sinai Medical Center, Los Angeles, CA, USA


Christina M. Ramirez

Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles, CA, USA

Abstract

The TemporalForest package provides a reproducible method for feature selection in high-dimensional longitudinal data. It combines network analysis, mixed-effects models, and stability selection to identify robust predictors over time. This vignette offers a quick start guide to using the package.

1. Introduction

Longitudinal ’omics studies, where subjects are measured repeatedly over time, present unique challenges for feature selection: high dimensionality, temporal dependence, and complex correlations. The TemporalForest algorithm addresses these by creating a robust, multi-stage pipeline that identifies features which are both predictive and stable across resamples.

2. Installation

Since the package is not yet on CRAN, you can install the development version from GitHub:

# install.packages("remotes")
remotes::install_github("SisiShao/TemporalForest")

3. Quick Start: Primary Example

This example walks you through a complete analysis with a small, simulated dataset.

Simulate a Longitudinal Dataset

This tiny demo is designed to always return all true signals quickly (1–3s). We will simulate a dataset with 60 subjects, 2 time points, and 20 potential predictors. We will inject 3 true signals into the outcome \(Y\), coming from predictors V1, V2, and V3. To ensure the example is fast and reliable for CRAN, we will pass a precomputed dissimilarity matrix to skip Stage 1 (WGCNA/TOM).

set.seed(11) # For reproducibility
n_subjects <- 60; n_timepoints <- 2; p <- 20

# Build X (two time points) with matching colnames
X <- replicate(n_timepoints, matrix(rnorm(n_subjects * p), n_subjects, p), simplify = FALSE)
colnames(X[[1]]) <- colnames(X[[2]]) <- paste0("V", 1:p)

# Long view and IDs
X_long <- do.call(rbind, X)
id     <- rep(seq_len(n_subjects), each = n_timepoints)
time   <- rep(seq_len(n_timepoints), times = n_subjects)

# Strong signal on V1, V2, V3 + modest subject random effect + small noise
u_subj <- rnorm(n_subjects, 0, 0.7)
eps    <- rnorm(length(id), 0, 0.08)
Y <- 4*X_long[, "V1"] + 3.5*X_long[, "V2"] + 3.2*X_long[, "V3"] +
     rep(u_subj, each = n_timepoints) + eps

# Lightweight dissimilarity to skip Stage 1 (fast on CRAN)
A <- 1 - abs(stats::cor(X_long)); diag(A) <- 0
dimnames(A) <- list(colnames(X[[1]]), colnames(X[[1]]))

Run TemporalForest

We call the main function, passing our precomputed dissimilarity_matrix = A and asking for 3 features.

# Run TemporalForest with minimal settings for vignette
tf_result <- temporal_forest(
  X = X, Y = Y, id = id, time = time,
  dissimilarity_matrix = A,       # skip WGCNA/TOM (Stage 1)
  n_features_to_select = 3,       
  n_boot_screen = 4, # Very low for quick demo
  n_boot_select =8, # Very low for quick demo
  keep_fraction_screen = 1,       # Permissive screening
  min_module_size = 2,
  alpha_screen = 0.5,             # Permissive screening
  alpha_select = 0.6
)
#>  ..cutHeight not given, setting it to 0.951  ===>  99% of the (truncated) height range in dendro.
#>  ..done.

Interpret the Results

Examine the selected features and check if the true predictors were found.

print(tf_result)
#> --- Temporal Forest Results ---
#> 
#> Top 3 feature(s) selected:
#>   V1
#>   V3
#>   V2 
#> 
#> 5 feature(s) were candidates in the final stage.
# Validate against ground truth
true_predictors <- c("V1", "V2", "V3")
cat("True predictors found:", sum(true_predictors %in% tf_result$top_features), 
    "out of", length(true_predictors), "\n")
#> True predictors found: 3 out of 3

The algorithm successfully identified all three true predictors in this high signal-to-noise example.

4. How TemporalForest Works

TemporalForest operates in three stages:

  1. Time-Aware Module Construction: Groups correlated features into modules that are stable across time points using a consensus topological overlap matrix (TOM).
  2. Within-Module Screening: Uses mixed-effects model trees to select the most important predictor from each module while accounting for within-subject correlations.
  3. Stability Selection: Applies bootstrapping to calculate selection probabilities, ensuring only the most reproducible features are included in the final set.

5. Key Parameters Guide

  • n_features_to_select: Final number of features to return (default: 10)
  • n_boot_screen, n_boot_select: Number of bootstrap samples for screening and selection stages. Increase for more stable results (defaults: 50, 100).
  • keep_fraction_screen: Proportion of features from each module passed to final selection (default: 0.25). Increase if too few features are selected.
  • min_module_size: Minimum size for network modules (default: 4).
  • alpha_screen, alpha_select: Significance levels for splitting in screening and selection trees (defaults: 0.2, 0.05).

6. Troubleshooting

Symptom Likely Cause Solution
No features selected Screening too strict Increase keep_fraction_screen or alpha_screen
Too many features selected Selection too liberal Decrease keep_fraction_screen or alpha_select
Long computation time Data too large Reduce bootstrap numbers or pre-filter features

7. Input Data Validation

The package includes checks for proper data formatting. Here’s an example of the error message for inconsistent inputs:

# This will produce a clear error message
mat1 <- matrix(1:4, nrow=2, dimnames=list(NULL, c("A", "B")))
mat2 <- matrix(1:4, nrow=2, dimnames=list(NULL, c("A", "C")))
bad_X <- list(mat1, mat2)

TemporalForest::check_temporal_consistency(bad_X)
#> Error: Inconsistent data format: The column names of the matrix for time point 2 do not match the column names of the first time point.

8. Conclusion

TemporalForest provides an end-to-end solution for reproducible feature selection in longitudinal high-dimensional data. For detailed information on all function parameters and advanced usage, see the package documentation (?TemporalForest).

9. Citation

To cite TemporalForest in publications, please use:

citation("TemporalForest")
#> To cite package 'TemporalForest' in publications use:
#> 
#>   Shao S, Moore J, Ramirez C (2025). _TemporalForest: A package for
#>   reproducible feature selection in high-dimensional longitudinal
#>   data_. R package version 0.1.0,
#>   <https://github.com/SisiShao/TemporalForest>.
#> 
#>   Shao S, Moore J, Ramirez C (2025). "Network-Guided TemporalForest for
#>   Feature Selection in High-Dimensional Longitudinal Data." Manuscript
#>   submitted for publication.,
#>   <https://github.com/SisiShao/TemporalForest>.
#> 
#> To see these entries in BibTeX format, use 'print(<citation>,
#> bibtex=TRUE)', 'toBibtex(.)', or set
#> 'options(citation.bibtex.max=999)'.

Session Info

sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-apple-darwin20
#> Running under: macOS Sonoma 14.2.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: America/Los_Angeles
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] TemporalForest_0.1.4
#> 
#> loaded via a namespace (and not attached):
#>   [1] Rdpack_2.6.4            DBI_1.2.3               gridExtra_2.3          
#>   [4] rlang_1.1.6             magrittr_2.0.4          matrixStats_1.5.0      
#>   [7] compiler_4.4.1          RSQLite_2.4.3           png_0.1-8              
#>  [10] vctrs_0.6.5             stringr_1.5.2           pkgconfig_2.0.3        
#>  [13] crayon_1.5.3            fastmap_1.2.0           backports_1.5.0        
#>  [16] XVector_0.44.0          inum_1.0-5              rmarkdown_2.30         
#>  [19] UCSC.utils_1.0.0        nloptr_2.2.1            preprocessCore_1.66.0  
#>  [22] bit_4.6.0               xfun_0.53               zlibbioc_1.50.0        
#>  [25] cachem_1.1.0            flashClust_1.01-2       GenomeInfoDb_1.40.1    
#>  [28] jsonlite_2.0.0          blob_1.2.4              parallel_4.4.1         
#>  [31] cluster_2.1.8.1         R6_2.6.1                glmertree_0.2-6        
#>  [34] bslib_0.9.0             stringi_1.8.7           RColorBrewer_1.1-3     
#>  [37] boot_1.3-32             rpart_4.1.24            jquerylib_0.1.4        
#>  [40] Rcpp_1.1.0              iterators_1.0.14        knitr_1.50             
#>  [43] WGCNA_1.73              base64enc_0.1-3         IRanges_2.38.1         
#>  [46] Matrix_1.7-4            splines_4.4.1           nnet_7.3-20            
#>  [49] tidyselect_1.2.1        rstudioapi_0.17.1       yaml_2.3.10            
#>  [52] partykit_1.2-24         doParallel_1.0.17       codetools_0.2-20       
#>  [55] lattice_0.22-7          tibble_3.3.0            Biobase_2.64.0         
#>  [58] KEGGREST_1.44.1         S7_0.2.0                evaluate_1.0.5         
#>  [61] foreign_0.8-90          survival_3.8-3          Biostrings_2.72.1      
#>  [64] pillar_1.11.1           checkmate_2.3.3         foreach_1.5.2          
#>  [67] stats4_4.4.1            reformulas_0.4.1        generics_0.1.4         
#>  [70] S4Vectors_0.42.1        ggplot2_4.0.0           scales_1.4.0           
#>  [73] minqa_1.2.8             glue_1.8.0              Hmisc_5.2-4            
#>  [76] tools_4.4.1             data.table_1.17.8       lme4_1.1-37            
#>  [79] mvtnorm_1.3-3           fastcluster_1.3.0       grid_4.4.1             
#>  [82] impute_1.78.0           libcoin_1.0-10          rbibutils_2.3          
#>  [85] AnnotationDbi_1.66.0    colorspace_2.1-2        nlme_3.1-168           
#>  [88] GenomeInfoDbData_1.2.12 htmlTable_2.4.3         Formula_1.2-5          
#>  [91] cli_3.6.5               dplyr_1.1.4             gtable_0.3.6           
#>  [94] dynamicTreeCut_1.63-1   sass_0.4.10             digest_0.6.37          
#>  [97] BiocGenerics_0.50.0     htmlwidgets_1.6.4       farver_2.1.2           
#> [100] memoise_2.0.1           htmltools_0.5.8.1       lifecycle_1.0.4        
#> [103] httr_1.4.7              GO.db_3.19.1            bit64_4.6.0-1          
#> [106] MASS_7.3-65
options(old_ops)