Type: Package
Title: Hybrid Stepwise Regression with Single-Split Dummy Encoding
Version: 1.0.2
Description: Implements 'SplitWise', a hybrid regression approach that transforms numeric variables into either single-split (0/1) dummy variables or retains them as continuous predictors. The transformation is followed by stepwise selection to identify the most relevant variables. The default 'iterative' mode adaptively explores partial synergies among variables to enhance model performance, while an alternative 'univariate' mode applies simpler transformations independently to each predictor. For details, see Kurbucz et al. (2025) <doi:10.48550/arXiv.2505.15423>.
License: GPL (≥ 3)
Encoding: UTF-8
Depends: R (≥ 3.5.0)
Imports: rpart, stats
RoxygenNote: 7.3.2
Suggests: knitr, rmarkdown, testthat (≥ 3.0.0)
Config/testthat/edition: 3
VignetteBuilder: knitr
NeedsCompilation: no
Packaged: 2025-07-31 05:22:34 UTC; Marcell
Author: Marcell T. Kurbucz [aut, cre], Nikolaos Tzivanakis [aut], Nilufer Sari Aslam [aut], Adam M. Sykulski [aut]
Maintainer: Marcell T. Kurbucz <m.kurbucz@ucl.ac.uk>
Repository: CRAN
Date/Publication: 2025-07-31 05:40:02 UTC

Decide Variable Type (Iterative)

Description

A stepwise variable-selection method that iteratively chooses each variable's best form: "linear", single-split "dummy", or double-split ("middle=1") dummy, based on AIC/BIC improvement. Supports "forward", "backward", or "both" strategies.

Usage

decide_variable_type_iterative(
  X,
  Y,
  min_support = 0.1,
  min_improvement = 3,
  direction = c("backward", "forward", "both"),
  criterion = c("AIC", "BIC"),
  exclude_vars = NULL,
  verbose = FALSE,
  ...
)

Arguments

X

A data frame of predictors (no response).

Y

A numeric vector (the response).

min_support

Minimum fraction (0-0.5) of observations required in either group after a dummy split. Default = 0.1.

min_improvement

Minimum required improvement in AIC/BIC for accepting a dummy split or variable transformation. Default = 3.

direction

Stepwise strategy: "forward", "backward", or "both". Default = "backward".

criterion

A character string: either "AIC" or "BIC". Default = "AIC".

exclude_vars

A character vector of variable names to exclude from dummy transformations. These variables will always be treated as linear. Default = NULL.

verbose

Logical; if TRUE, prints messages for debugging. Default = FALSE.

...

Additional arguments (currently unused).

Details

By default, no split is allowed with fewer than 5 observations (i.e., minsplit is max(5, ceiling(min_support * n))). This is not user-configurable.

Dummy forms come from a shallow (maxdepth = 2) rpart tree fit to the partial residuals of the current model. We extract up to two splits:

The function then picks the form (linear, single-split dummy, or double-split dummy) that yields the lowest AIC/BIC. Variables listed in exclude_vars will be forced to remain linear (dummy transformations are never attempted).

Value

A named list of decisions, where each element is a list with:

type

Either "linear" or "dummy".

cutoff

A numeric vector of length 1 or 2 (the chosen split points).


Decide Variable Type (Univariate)

Description

For each numeric predictor, this function fits a shallow (maxdepth = 2) rpart tree directly on Y ~ x and tests whether a dummy transformation improves model fit.

Usage

decide_variable_type_univariate(
  X,
  Y,
  min_support = 0.1,
  min_improvement = 3,
  criterion = c("AIC", "BIC"),
  exclude_vars = NULL,
  verbose = FALSE
)

Arguments

X

A data frame of numeric predictors (no response).

Y

A numeric response vector.

min_support

Minimum fraction (0-0.5) of observations required in either group after a dummy split. Default = 0.1.

min_improvement

Minimum required improvement in AIC/BIC for accepting a dummy split or variable transformation. Default = 3.

criterion

A character string: either "AIC" or "BIC". Default = "AIC".

exclude_vars

A character vector of variable names to exclude from dummy transformations. These variables will always be treated as linear. Default = NULL.

verbose

Logical; if TRUE, prints messages for debugging. Default = FALSE.

Details

By default, no split is allowed with fewer than 5 observations (i.e., minsplit is max(5, ceiling(min_support * n))). This is not user-configurable.

Dummy forms come from a shallow (maxdepth = 2) rpart tree fit to the data. We extract up to two splits:

The function then picks the form (linear, single-split dummy, or double-split dummy) that yields the lowest AIC/BIC. If a variable is listed in exclude_vars, it will always be used as a linear predictor (dummy transformation is never attempted).

Value

A named list of decisions, where each element is a list with:

type

Either "dummy" or "linear".

cutoffs

A numeric vector (length 1 or 2) if type = "dummy", or NULL if linear.

tree_model

The fitted rpart model (for reference) or NULL if excluded.


SplitWise Regression

Description

Transforms each numeric variable into either a single-split dummy or keeps it linear, then runs stats::step() for stepwise selection. The user can choose a simpler univariate transformation or an iterative approach.

Usage

splitwise(
  formula,
  data,
  transformation_mode = c("iterative", "univariate"),
  direction = c("backward", "forward", "both"),
  min_support = 0.1,
  min_improvement = 3,
  criterion = c("AIC", "BIC"),
  exclude_vars = NULL,
  verbose = FALSE,
  steps = 1000,
  k = 2,
  ...
)

## S3 method for class 'splitwise_lm'
print(x, ...)

## S3 method for class 'splitwise_lm'
summary(object, ...)

## S3 method for class 'splitwise_lm'
predict(object, newdata, ...)

## S3 method for class 'splitwise_lm'
coef(object, ...)

## S3 method for class 'splitwise_lm'
fitted(object, ...)

## S3 method for class 'splitwise_lm'
residuals(object, ...)

## S3 method for class 'splitwise_lm'
model.matrix(object, ...)

Arguments

formula

A formula specifying the response and (initial) predictors, e.g. mpg ~ ..

data

A data frame containing the variables used in formula.

transformation_mode

Either "iterative" or "univariate". Default = "iterative".

direction

Stepwise direction: "backward", "forward", or "both".

min_support

Minimum fraction (between 0 and 0.5) of observations needed in either group when making a dummy split. Prevents over-fragmented or tiny dummy groups. Default = 0.1.

min_improvement

Minimum required improvement (in AIC/BIC units) for accepting a dummy split or variable transformation. Helps guard against overfitting from marginal improvements. Default = 2.

criterion

Either "AIC" or "BIC". Default = "AIC". Note: If you choose "BIC", you typically want k = log(nrow(data)) in stepwise.

exclude_vars

A character vector naming variables that should be forced to remain linear (i.e., no dummy splits allowed). Default = NULL.

verbose

Logical; if TRUE, prints debug info in transformation steps. If FALSE, the stepwise selection process is run quietly (trace = 0 in step()). Default = FALSE.

steps

Maximum number of steps for step(). Default = 1000.

k

Penalty multiple for the number of degrees of freedom (used by step()). E.g. 2 for AIC, log(n) for BIC. Default = 2.

...

Additional arguments passed to predict.lm.

x

A "splitwise_lm" object returned by splitwise.

object

An object of class splitwise_lm, as returned by splitwise.

newdata

A data frame of new data (with original predictors) to generate predictions for. The appropriate dummy variables will be generated using the transformation rules learned during model training. If omitted, predictions for the training data are returned.

Value

An S3 object of class c("splitwise_lm", "lm"), storing:

splitwise_info

List containing transformation decisions, final data, and call.

Functions

Examples

# Load the mtcars dataset
data(mtcars)

# Univariate transformations (AIC-based, backward stepwise)
model_uni <- splitwise(
  mpg ~ .,
  data               = mtcars,
  transformation_mode = "univariate",
  direction           = "backward"
)
summary(model_uni)

# Iterative approach (BIC-based, forward stepwise)
# Note: typically set k = log(nrow(mtcars)) for BIC in step().
model_iter <- splitwise(
  mpg ~ .,
  data               = mtcars,
  transformation_mode = "iterative",
  direction           = "forward",
  criterion           = "BIC",
  k                   = log(nrow(mtcars))
)
summary(model_iter)

Transform Features (Iterative Logic)

Description

Once decide_variable_type_iterative has chosen which variables to add (and how), we can build a final data frame from those decisions.

Usage

transform_features_iterative(X, decisions)

Arguments

X

Original predictor data frame.

decisions

Output of decide_variable_type_iterative.

Value

A data frame with the chosen variables in their final forms (dummy or linear).


Transform Features (Univariate Logic)

Description

Given the decisions (dummy or linear) for each predictor, produce a transformed data frame. Dummy columns are 0/1 based on the cutoff.

Usage

transform_features_univariate(X, decisions)

Arguments

X

Original predictor data frame.

decisions

The list returned by decide_variable_type_univariate.

Value

A new data frame with either the original column or a dummy column for each variable.