Help for package SplitWise

Type:

Package

Title:

Hybrid Stepwise Regression with Single-Split Dummy Encoding

Version:

1.0.2

Description:

Implements 'SplitWise', a hybrid regression approach that transforms numeric variables into either single-split (0/1) dummy variables or retains them as continuous predictors. The transformation is followed by stepwise selection to identify the most relevant variables. The default 'iterative' mode adaptively explores partial synergies among variables to enhance model performance, while an alternative 'univariate' mode applies simpler transformations independently to each predictor. For details, see Kurbucz et al. (2025) <doi:10.48550/arXiv.2505.15423>.

License:

GPL (≥ 3)

Encoding:

UTF-8

Depends:

R (≥ 3.5.0)

Imports:

rpart, stats

RoxygenNote:

7.3.2

Suggests:

knitr, rmarkdown, testthat (≥ 3.0.0)

Config/testthat/edition:

VignetteBuilder:

knitr

NeedsCompilation:

Packaged:

2025-07-31 05:22:34 UTC; Marcell

Author:

Marcell T. Kurbucz [aut, cre], Nikolaos Tzivanakis [aut], Nilufer Sari Aslam [aut], Adam M. Sykulski [aut]

Maintainer:

Marcell T. Kurbucz <m.kurbucz@ucl.ac.uk>

Repository:

CRAN

Date/Publication:

2025-07-31 05:40:02 UTC

Decide Variable Type (Iterative)

Description

A stepwise variable-selection method that iteratively chooses each variable's best form: "linear", single-split "dummy", or double-split ("middle=1") dummy, based on AIC/BIC improvement. Supports "forward", "backward", or "both" strategies.

Usage

decide_variable_type_iterative(
  X,
  Y,
  min_support = 0.1,
  min_improvement = 3,
  direction = c("backward", "forward", "both"),
  criterion = c("AIC", "BIC"),
  exclude_vars = NULL,
  verbose = FALSE,
  ...
)

Arguments

X

A data frame of predictors (no response).

Y

A numeric vector (the response).

min_support

Minimum fraction (0-0.5) of observations required in either group after a dummy split. Default = 0.1.

min_improvement

Minimum required improvement in AIC/BIC for accepting a dummy split or variable transformation. Default = 3.

direction

Stepwise strategy: "forward", "backward", or "both". Default = "backward".

criterion

A character string: either "AIC" or "BIC". Default = "AIC".

exclude_vars

A character vector of variable names to exclude from dummy transformations. These variables will always be treated as linear. Default = NULL.

verbose

Logical; if TRUE, prints messages for debugging. Default = FALSE.

...

Additional arguments (currently unused).

Details

By default, no split is allowed with fewer than 5 observations (i.e., minsplit is max(5, ceiling(min_support * n))). This is not user-configurable.

Dummy forms come from a shallow (maxdepth = 2) rpart tree fit to the partial residuals of the current model. We extract up to two splits:

Single cutoff dummy (e.g., x >= c)
Double cutoff dummy (e.g., c1 < x < c2)

The function then picks the form (linear, single-split dummy, or double-split dummy) that yields the lowest AIC/BIC. Variables listed in exclude_vars will be forced to remain linear (dummy transformations are never attempted).

Value

A named list of decisions, where each element is a list with:

type: Either "linear" or "dummy".
cutoff: A numeric vector of length 1 or 2 (the chosen split points).

Decide Variable Type (Univariate)

Description

For each numeric predictor, this function fits a shallow (maxdepth = 2) rpart tree directly on Y ~ x and tests whether a dummy transformation improves model fit.

Usage

decide_variable_type_univariate(
  X,
  Y,
  min_support = 0.1,
  min_improvement = 3,
  criterion = c("AIC", "BIC"),
  exclude_vars = NULL,
  verbose = FALSE
)

Arguments

X

A data frame of numeric predictors (no response).

Y

A numeric response vector.

min_support

Minimum fraction (0-0.5) of observations required in either group after a dummy split. Default = 0.1.

min_improvement

Minimum required improvement in AIC/BIC for accepting a dummy split or variable transformation. Default = 3.

criterion

A character string: either "AIC" or "BIC". Default = "AIC".

exclude_vars

A character vector of variable names to exclude from dummy transformations. These variables will always be treated as linear. Default = NULL.

verbose

Logical; if TRUE, prints messages for debugging. Default = FALSE.

Details

By default, no split is allowed with fewer than 5 observations (i.e., minsplit is max(5, ceiling(min_support * n))). This is not user-configurable.

Dummy forms come from a shallow (maxdepth = 2) rpart tree fit to the data. We extract up to two splits:

Single cutoff dummy (e.g., x >= c)
Double cutoff dummy (e.g., c1 < x < c2)

The function then picks the form (linear, single-split dummy, or double-split dummy) that yields the lowest AIC/BIC. If a variable is listed in exclude_vars, it will always be used as a linear predictor (dummy transformation is never attempted).

Value

A named list of decisions, where each element is a list with:

type: Either "dummy" or "linear".
cutoffs: A numeric vector (length 1 or 2) if type = "dummy", or NULL if linear.
tree_model: The fitted rpart model (for reference) or NULL if excluded.

SplitWise Regression

Description

Transforms each numeric variable into either a single-split dummy or keeps it linear, then runs stats::step() for stepwise selection. The user can choose a simpler univariate transformation or an iterative approach.

Usage

splitwise(
  formula,
  data,
  transformation_mode = c("iterative", "univariate"),
  direction = c("backward", "forward", "both"),
  min_support = 0.1,
  min_improvement = 3,
  criterion = c("AIC", "BIC"),
  exclude_vars = NULL,
  verbose = FALSE,
  steps = 1000,
  k = 2,
  ...
)

## S3 method for class 'splitwise_lm'
print(x, ...)

## S3 method for class 'splitwise_lm'
summary(object, ...)

## S3 method for class 'splitwise_lm'
predict(object, newdata, ...)

## S3 method for class 'splitwise_lm'
coef(object, ...)

## S3 method for class 'splitwise_lm'
fitted(object, ...)

## S3 method for class 'splitwise_lm'
residuals(object, ...)

## S3 method for class 'splitwise_lm'
model.matrix(object, ...)

Arguments

formula

A formula specifying the response and (initial) predictors, e.g. mpg ~ ..

data

A data frame containing the variables used in formula.

transformation_mode

Either "iterative" or "univariate". Default = "iterative".

direction

Stepwise direction: "backward", "forward", or "both".

min_support

Minimum fraction (between 0 and 0.5) of observations needed in either group when making a dummy split. Prevents over-fragmented or tiny dummy groups. Default = 0.1.

min_improvement

Minimum required improvement (in AIC/BIC units) for accepting a dummy split or variable transformation. Helps guard against overfitting from marginal improvements. Default = 2.

criterion

Either "AIC" or "BIC". Default = "AIC". Note: If you choose "BIC", you typically want k = log(nrow(data)) in stepwise.

exclude_vars

A character vector naming variables that should be forced to remain linear (i.e., no dummy splits allowed). Default = NULL.

verbose

Logical; if TRUE, prints debug info in transformation steps. If FALSE, the stepwise selection process is run quietly (trace = 0 in step()). Default = FALSE.

steps

Maximum number of steps for step(). Default = 1000.

k

Penalty multiple for the number of degrees of freedom (used by step()). E.g. 2 for AIC, log(n) for BIC. Default = 2.

...

Additional arguments passed to predict.lm.

x

A "splitwise_lm" object returned by splitwise.

object

An object of class splitwise_lm, as returned by splitwise.

newdata

A data frame of new data (with original predictors) to generate predictions for. The appropriate dummy variables will be generated using the transformation rules learned during model training. If omitted, predictions for the training data are returned.

Value

An S3 object of class c("splitwise_lm", "lm"), storing:

splitwise_info

List containing transformation decisions, final data, and call.

Functions

print(splitwise_lm): Prints a summary of the splitwise_lm object.
summary(splitwise_lm): Provides a detailed summary, including how dummies were created.
predict(splitwise_lm): Generate predictions from a splitwise_lm object using learned transformation rules.
coef(splitwise_lm): Extract model coefficients from a SplitWise linear model.
fitted(splitwise_lm): Extract fitted values from a SplitWise linear model.
residuals(splitwise_lm): Extract residuals from a SplitWise linear model.
model.matrix(splitwise_lm): Extract the model matrix from a SplitWise linear model.

Examples

# Load the mtcars dataset
data(mtcars)

# Univariate transformations (AIC-based, backward stepwise)
model_uni <- splitwise(
  mpg ~ .,
  data               = mtcars,
  transformation_mode = "univariate",
  direction           = "backward"
)
summary(model_uni)

# Iterative approach (BIC-based, forward stepwise)
# Note: typically set k = log(nrow(mtcars)) for BIC in step().
model_iter <- splitwise(
  mpg ~ .,
  data               = mtcars,
  transformation_mode = "iterative",
  direction           = "forward",
  criterion           = "BIC",
  k                   = log(nrow(mtcars))
)
summary(model_iter)

Transform Features (Iterative Logic)

Description

Once decide_variable_type_iterative has chosen which variables to add (and how), we can build a final data frame from those decisions.

Usage

transform_features_iterative(X, decisions)

Arguments

X

Original predictor data frame.

decisions

Output of decide_variable_type_iterative.

Value

A data frame with the chosen variables in their final forms (dummy or linear).

Transform Features (Univariate Logic)

Description

Given the decisions (dummy or linear) for each predictor, produce a transformed data frame. Dummy columns are 0/1 based on the cutoff.

Usage

transform_features_univariate(X, decisions)

Arguments

X

Original predictor data frame.

decisions

The list returned by decide_variable_type_univariate.

Value

A new data frame with either the original column or a dummy column for each variable.