Validation Study

Scope

This vignette documents the shipped release-validation study for SelectBoost.quantile. The goal is not to claim universal superiority, but to show how the current prototype behaves against two direct baselines:

plain quantile lasso (lasso)
cross-validated quantile lasso with a 1-SE penalty rule (lasso_tuned)
selectboost_quantile() with tau-aware screening, stronger tuning, complementary-pairs stability selection, capped neighborhoods, and a hybrid support score

The included benchmark artifacts were generated with:

scenarios from default_quantile_benchmark_scenarios()
tau = c(0.25, 0.5, 0.75)
4 Monte Carlo replications per scenario
selectboost_quantile(..., B = 8, step_num = 0.5, screen = "auto", tune_lambda = "cv", lambda_rule = "one_se", lambda_inflation = 1.25, complementary_pairs = TRUE, max_group_size = 15, nlambda = 8)
stable support extracted with the hybrid summary score at threshold = 0.55

summary_path <- system.file(
  "extdata",
  "validation",
  "quantile_benchmark_release_summary.csv",
  package = "SelectBoost.quantile"
)
raw_path <- system.file(
  "extdata",
  "validation",
  "quantile_benchmark_release_raw.csv",
  package = "SelectBoost.quantile"
)

resolve_validation_path <- function(installed_path, filename) {
  if (nzchar(installed_path) && file.exists(installed_path)) {
    return(installed_path)
  }

  candidates <- c(
    file.path("inst", "extdata", "validation", filename),
    file.path("..", "inst", "extdata", "validation", filename)
  )
  candidates <- candidates[file.exists(candidates)]
  if (!length(candidates)) {
    stop("Could not locate shipped validation artifact: ", filename, call. = FALSE)
  }
  candidates[[1]]
}

summary_path <- resolve_validation_path(summary_path, "quantile_benchmark_release_summary.csv")
raw_path <- resolve_validation_path(raw_path, "quantile_benchmark_release_raw.csv")

validation_summary <- utils::read.csv(summary_path, stringsAsFactors = FALSE)
validation_raw <- utils::read.csv(raw_path, stringsAsFactors = FALSE)

validation_summary$family <- sub("_tau_.*$", "", validation_summary$scenario)
validation_summary$is_high_dim <- grepl("^high_dim", validation_summary$scenario)
validation_summary$mean_f1 <- with(
  validation_summary,
  ifelse(
    (2 * mean_tp + mean_fp + mean_fn) > 0,
    2 * mean_tp / (2 * mean_tp + mean_fp + mean_fn),
    NA_real_
  )
)

Overall summary

The first table averages the scenario-level summaries across the full shipped grid, including the n < p stress regime.

overall <- aggregate(
  cbind(mean_tpr, mean_fdr, mean_f1, failure_rate, mean_runtime_sec) ~ method,
  data = validation_summary,
  FUN = mean
)

knitr::kable(overall, digits = 3)

method	mean_tpr	mean_fdr	mean_f1	mean_runtime_sec
lasso	0.856	0.655	0.484	0.005
lasso_tuned	0.900	0.735	0.383	0.069
selectboost	0.734	0.063	0.808	3.794

Across the full grid, tuned lasso has the highest average true-positive rate, but it also carries the highest average false-discovery rate. The current selectboost_quantile() release is markedly more conservative: it gives up some recall, but in exchange it sharply lowers the false-discovery rate across the shipped benchmark grid and yields the best average F1 score.

Correlated but not high-dimensional regimes

The high_dim scenario is intentionally hard and changes the picture substantially. Excluding that regime gives a cleaner view of the correlated and misspecified-noise settings that the current prototype handles more naturally.

stable_regimes <- subset(validation_summary, !is_high_dim)

stable_overall <- aggregate(
  cbind(mean_tpr, mean_fdr, mean_f1, failure_rate, mean_runtime_sec) ~ method,
  data = stable_regimes,
  FUN = mean
)

knitr::kable(stable_overall, digits = 3)

method	mean_tpr	mean_fdr	mean_f1	mean_runtime_sec
lasso	0.886	0.625	0.521	0.005
lasso_tuned	0.936	0.719	0.406	0.040
selectboost	0.758	0.062	0.822	3.894

On these non-high-dimensional settings, the shipped study shows a consistent pattern:

lasso_tuned has the highest mean recall
selectboost_quantile() has the lowest mean false-discovery rate by a large margin
selectboost_quantile() also has the highest mean F1 score on the shipped grid
selectboost_quantile() remains slower than either lasso baseline, which is expected because it perturbs, subsamples, and refits repeatedly

The family-level breakdown is below.

family_summary <- aggregate(
  cbind(mean_tpr, mean_fdr, mean_f1) ~ family + method,
  data = stable_regimes,
  FUN = mean
)

knitr::kable(family_summary, digits = 3)

family	method	mean_tpr	mean_fdr	mean_f1
block_corr	lasso	0.806	0.682	0.454
heavy_tail	lasso	0.986	0.597	0.565
heteroskedastic	lasso	0.861	0.643	0.503
high_corr	lasso	0.778	0.600	0.518
moderate_corr	lasso	1.000	0.601	0.565
block_corr	lasso_tuned	0.917	0.754	0.345
heavy_tail	lasso_tuned	1.000	0.692	0.449
heteroskedastic	lasso_tuned	0.903	0.696	0.437
high_corr	lasso_tuned	0.861	0.722	0.383
moderate_corr	lasso_tuned	1.000	0.729	0.414
block_corr	selectboost	0.778	0.164	0.783
heavy_tail	selectboost	0.792	0.014	0.876
heteroskedastic	selectboost	0.611	0.075	0.708
high_corr	selectboost	0.708	0.059	0.797
moderate_corr	selectboost	0.903	0.000	0.948

plot_df <- stable_regimes
method_levels <- c("lasso", "lasso_tuned", "selectboost")
cols <- c("lasso" = "#4C78A8", "lasso_tuned" = "#F58518", "selectboost" = "#54A24B")
plot(
  plot_df$mean_fdr,
  plot_df$mean_f1,
  col = cols[plot_df$method],
  pch = 19,
  xlab = "Mean FDR",
  ylab = "Mean F1",
  main = "Validation Summary by Scenario"
)
legend(
  "bottomleft",
  legend = method_levels,
  col = cols[method_levels],
  pch = 19,
  bty = "n"
)

High-dimensional stress regime

The high_dim family remains difficult, but it is no longer a failure mode in the earlier sense of selecting almost everything. The improved SelectBoost workflow now returns much sparser and more stable supports than either lasso baseline.

high_dim <- subset(validation_summary, is_high_dim)

high_dim_overall <- aggregate(
  cbind(mean_tpr, mean_fdr, mean_f1, failure_rate, mean_support_size) ~ method,
  data = high_dim,
  FUN = mean
)

knitr::kable(high_dim_overall, digits = 3)

method	mean_tpr	mean_fdr	mean_f1	mean_support_size
lasso	0.708	0.804	0.300	22.167
lasso_tuned	0.722	0.817	0.270	26.250
selectboost	0.611	0.067	0.738	3.917

The main remaining tradeoff is recall: selectboost_quantile() is much cleaner than the lasso baselines in high_dim, but it is still more conservative and can miss weaker signals. Even so, on the shipped study it achieves the best mean F1 score in that regime because it avoids the large false-positive burden of the lasso baselines. This is the main reason the package is best described as a polished v2 prototype rather than a finished methodological endpoint.

failure_rows <- subset(validation_summary, failure_rate > 0)
if (nrow(failure_rows)) {
  knitr::kable(failure_rows[, c(
    "scenario",
    "method",
    "failure_rate",
    "mean_tpr",
    "mean_fdr",
    "mean_support_size"
  )], digits = 3)
} else {
  cat("No method failures were recorded in the shipped study.\n")
}
#> No method failures were recorded in the shipped study.

Reproducing the study

From a source checkout, regenerate benchmark artifacts into a temporary directory with:

out_dir <- file.path(tempdir(), "SelectBoost.quantile-validation")
system2(
  "Rscript",
  c("inst/scripts/run_quantile_benchmark.R", out_dir, "4", "0.55")
)

The script loads the local package automatically when run from a source tree. It writes raw results, aggregated summaries, and a sessionInfo record to the chosen output directory. If no output directory is supplied, it defaults to a subdirectory of tempdir(). In the current source tree, that rerun uses the screening, stronger lambda, complementary-pairs stability, neighborhood-cap, and hybrid-support defaults defined in the package benchmark helper.