GAReg

Genetic Algorithms in Regression.

Authors: Mo Li (mo.li@louisiana.edu), QiQi Lu (qlu2@vcu.edu), Robert Lund (rolund@ucsc.edu), Xueheng Shi (xshi11@unl.edu)

Overview

GAReg is a unified genetic algorithm framework for regression problems that require discrete optimization over model spaces with unknown or variable dimension, where gradient-based methods and exhaustive search are impractical. It provides a compact chromosome representation for tasks such as optimal spline knot placement, best-subset variable selection, and multiple changepoint detection, along with exact uniform initialization, constraint-preserving crossover and mutation, steady-state replacement, and optional island-model parallelization. In challenging high-dimensional settings, GAReg enables efficient search and delivers near-optimal solutions when alternative algorithms are not well-justified.

Package download and installation

You can install the version of changepointGA from CRAN:

install.packages("GAReg")

or the development version from Github:

# install.packages("remotes")
remotes::install_github("mli171/GAReg", build_vignettes = TRUE, dependencies = TRUE)

Open the Vignette in R for more details

browseVignettes("GAReg")

GAReg: Genetic Algorithm Applications in Regression

Best Subset Variable Selection

We use the simulation function below for subset selection illustration. Here, n is the number of observations and p is the number of predictors. For the covairates, s0 represent number of truly active predictors, valued range from 0 to p. magnitudes_range specifies the range of significantly expressed coefficients that corresponding to the truly active predictors. If rho is specified with some values, the autoregressive structure is introduced into the error terms. If rho=NULL, we will have independent and identically distributed (IID) errors. We can also specify sigma for the error standard deviation.

sim_subset_data <- function(n = 60, p = 50, s0 = 25, sigma = 1.5,
                            magnitudes_range = c(0.5, 2),
                            rho = NULL,
                            seed = NULL) {
  stopifnot(n > 0, p > 0, s0 >= 0, s0 <= p, sigma >= 0)
  if (!is.null(seed)) set.seed(seed)

  X <- matrix(rnorm(n * p), n, p)
  
  # Active set and coefficients
  true_idx    <- if (s0 > 0) sort(sample.int(p, s0)) else integer(0)
  signs       <- if (s0 > 0) sample(c(-1, 1), s0, replace = TRUE) else numeric(0)
  magnitudes  <- if (s0 > 0) runif(s0, magnitudes_range[1], magnitudes_range[2]) else numeric(0)

  beta_true <- numeric(p)
  if (s0 > 0) beta_true[true_idx] <- magnitudes * signs

  if (is.null(rho)) {
    e <- rnorm(n, sd = sigma)
  } else {
    sd_innov <- sigma * sqrt(1 - rho^2)
    burn_in  <- 100
    z        <- rnorm(n + burn_in, sd = sd_innov)
    e_full   <- numeric(n + burn_in)
    for (t in 2:(n + burn_in)) e_full[t] <- rho * e_full[t - 1] + z[t]
    e <- e_full[(burn_in + 1):(burn_in + n)]
  }

  y <- as.numeric(X %*% beta_true + e)

  DF <- data.frame(y = y, as.data.frame(X))
  colnames(DF)[-1] <- paste0("X", seq_len(p))

  list(
    X = X,
    y = y,
    beta_true = beta_true,
    true_idx = true_idx,
    DF = DF,
    rho = if (is.null(rho)) NULL else rho,
    args = list(n = n, p = p, s0 = s0, sigma = sigma,
                magnitudes_range = magnitudes_range,
                rho = rho, seed = seed)
  )
}

sim <- sim_subset_data(n=100, p=50, s0=25, sigma=1.5, rho=NULL, seed=123)
y <- sim$y
X <- sim$X
ga <- gareg_subset(y=y, X=X, gaMethod = "GA", monitor = FALSE,
                   gacontrol=list(popSize=120,
                                  maxiter=20000,
                                  run=4000,
                                  pmutation=0.02))
summary(ga)

# False Discovery Rate and True Positive Rate calculation function
res <- FDRCalc(truelabel=sim$true_idx, predlabel=ga@bestsol, N=50)
# FALSE Discover Rate
res$fdr
# TRUE Positive Rate
res$tpr

Changepoint Detection

The multiple changepoint detection can be conducted through changepointGA package (Li and Lu, 2024). The BIC penalized function is provided below for IID data. The related math details can be found in (Li et al., 2026).

BIC.cpt = function(chromosome, Xt){
  
  m   = chromosome[1]
  tau = chromosome[2:(2 + m - 1)]
  N   = length(Xt)
  
  if(m==0){
    mu.hat = mean(Xt)
    sigma.hatsq = sum( (Xt-mu.hat)^2 )/N
    BIC.obj = 0.5*N*log(sigma.hatsq)+ 2*log(N)
  }
  else{
    tau.ext = c(1, tau, N+1)
    seg.len = diff(tau.ext)
    ff = rep(0:m, times=seg.len)
    Xseg = split(Xt, ff)
    mu.seg = unlist(lapply(Xseg,mean), use.names=F)
    mu.hat = rep(mu.seg, seg.len)
    sigma.hatsq = sum( (Xt-mu.hat)^2 )/N
    BIC.obj = 0.5*N*log(sigma.hatsq) + (2*m + 2)*log(N)
  }
  return(BIC.obj)
}

# IID data
set.seed(1234)
n = 200
et = rnorm(n)
Xt = et + rep(c(0,2,0,2), each=n/4)

library(changepointGA)
GA.res <- cptga(
  ObjFunc = BIC.cpt, N = n, popSize = 500, 
  pcrossover = 0.95, pmutation = 0.3, pchangepoint = 10/n,
  Xt = Xt
)
summary(GA.res)

Optimal Knot Selection for Splines

The classic motocycle impact dataset from MASS package (Venables & Ripley, 2002) is used as example here.

library(MASS)
library(splines)

data(mcycle)
head(mcycle)

Different GA model set-up

Varying knots with single population GA;

g1 <- gareg_knots(
  y=mcycle$accel, x=mcycle$times,
  minDist = 5,
  gaMethod = "cptga",
  cptgactrl = cptgaControl(popSize=200, pcrossover=0.9, pmutation=0.3),
  ic_method = "BIC"
)
summary(g1)

# knots location
g1@bestsol

Varying knots with island model GA;

g2 <- gareg_knots(
  y=mcycle$accel, x=mcycle$times,
  minDist = 5,
  gaMethod = "cptgaisl",
  cptgactrl = cptgaControl(numIslands=5, popSize=200, maxMig=250,
                           pcrossover=0.9, pmutation=0.3),
  ic_method = "BIC"
)
summary(g2)

Fixed knots with single population GA;

g3 <- gareg_knots(
  y=mcycle$accel, x=mcycle$times,
  fixedknots = 3,
  minDist = 5,
  gaMethod = "cptga",
  cptgactrl = cptgaControl(popSize=200, pcrossover=0.9, pmutation=0.3),
  ic_method = "BIC"
)
summary(g3)

Fixed knots with island model GA;

g4 <- gareg_knots(
  y=mcycle$accel, x=mcycle$times,
  fixedknots = 4,
  minDist = 5,
  gaMethod = "cptgaisl",
  cptgactrl = cptgaControl(numIslands=5, popSize=200, maxMig=250, 
                           pcrossover=0.9, pmutation=0.3),
  ic_method = "BIC"
)
summary(g4)

Comparison;

y = mcycle$accel
x = mcycle$times
x_unique = unique(x)

tBIC.vary.ga     = g1@bestsol
tBIC.vary.gaisl  = g2@bestsol
tBIC.fix.3.ga    = g3@bestsol
tBIC.fix.4.gaisl = g4@bestsol

bsfit.vary.ga     = lm(y ~ bs(x, knots=x_unique[g1@bestsol], Boundary.knots = range(x)))
bsfit.vary.gaisl  = lm(y ~ bs(x, knots=x_unique[g2@bestsol], Boundary.knots = range(x)))
bsfit.fix.3.ga    = lm(y ~ bs(x, knots=x_unique[g3@bestsol], Boundary.knots = range(x)))
bsfit.fix.4.gaisl = lm(y ~ bs(x, knots=x_unique[g4@bestsol], Boundary.knots = range(x)))

plot(x, y, xlab = "Time (ms)", ylab = "Acceleration (g)")
ht = seq(min(x), max(x), length.out = 200)
lines(ht, predict(bsfit.vary.ga, data.frame(x = ht)), col="blue", lty = 5, lwd = 2)
lines(ht, predict(bsfit.vary.gaisl, data.frame(x = ht)), col="orange", lty = 4, lwd = 2)
lines(ht, predict(bsfit.fix.3.ga, data.frame(x = ht)), col="purple", lty = 3, lwd = 2)
lines(ht, predict(bsfit.fix.4.gaisl, data.frame(x = ht)), col="#D55E00", lty = 2, lwd = 2)
legend("bottomright", 
       legend = c("Varying knots GA", 
                  "Varying knots island model GA", 
                  "Fixed 3 knots GA", 
                  "Fixed 4 knots island model GA"),
       lty = 5:2, lwd = 2,
       col = c("blue", "orange", "purple", "#D55E00"), bty = "n")

Spline options: piecewise polynomials, natural cubic, and B-splines

This section illustrates how to build spline design matrices via splineX() for three common options:

type="ppolys": degree-d truncated power piecewise polynomials;
type="ns": degree-3 natural cubic spline (degree will be ignored);
type="bs": degree-d B-spline basis;

We’ll use the motorcycle acceleration data MASS::mcycle, create interior knots at quantiles of times, and compare how different spline types/degrees behave. Here, we only illustrate through Varying number and locations of knots set-up (Let GA choose both how many knots and where they go).

g_pp3 <- gareg_knots(
  y = y, x = x,
  minDist   = 5,
  gaMethod  = "cptga",
  ObjFunc   = NULL,          # use default varyknotsIC
  type      = "ppolys",
  degree    = 3,             # degree-3 piecewise polynomial
  intercept = TRUE,
  ic_method = "BIC"
)
summary(g_pp3)

g_ns <- gareg_knots(
  y = y, x = x,
  minDist   = 5,
  gaMethod  = "cptga",
  type      = "ns",          # natural cubic (degree ignored)
  degree    = 3,             # ignored for "ns"
  intercept = TRUE,
  ic_method = "BIC"
)
summary(g_ns)

g_bs1 <- gareg_knots(
  y = y, x = x,
  minDist   = 5,
  gaMethod  = "cptga",
  type      = "bs",
  degree    = 1,             # linear B-splines
  intercept = TRUE,
  ic_method = "BIC"
)
summary(g_bs1)

References

Li, M., & Lu, Q. (2024). changepointGA: An R package for Fast Changepoint Detection via Genetic Algorithm. arXiv preprint arXiv:2410.15571.

Mo Li, QiQi Lu, Robert Lund, & Xueheng Shi. (2026). Genetic Algorithms in Regression. arXiv preprint arXiv:.

Venables, W. N. & Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer, New York. ISBN 0-387-95457-0.

Code style

Before pushing changes, please run

styler::style_pkg()

to ensure your code follows the tidyverse style guide.