Type: | Package |
Title: | Hybrid Stepwise Regression with Single-Split Dummy Encoding |
Version: | 1.0.2 |
Description: | Implements 'SplitWise', a hybrid regression approach that transforms numeric variables into either single-split (0/1) dummy variables or retains them as continuous predictors. The transformation is followed by stepwise selection to identify the most relevant variables. The default 'iterative' mode adaptively explores partial synergies among variables to enhance model performance, while an alternative 'univariate' mode applies simpler transformations independently to each predictor. For details, see Kurbucz et al. (2025) <doi:10.48550/arXiv.2505.15423>. |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
Depends: | R (≥ 3.5.0) |
Imports: | rpart, stats |
RoxygenNote: | 7.3.2 |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2025-07-31 05:22:34 UTC; Marcell |
Author: | Marcell T. Kurbucz [aut, cre], Nikolaos Tzivanakis [aut], Nilufer Sari Aslam [aut], Adam M. Sykulski [aut] |
Maintainer: | Marcell T. Kurbucz <m.kurbucz@ucl.ac.uk> |
Repository: | CRAN |
Date/Publication: | 2025-07-31 05:40:02 UTC |
Decide Variable Type (Iterative)
Description
A stepwise variable-selection method that iteratively chooses
each variable's best form: "linear"
, single-split "dummy"
,
or double-split ("middle=1") dummy, based on AIC/BIC improvement. Supports
"forward", "backward", or "both" strategies.
Usage
decide_variable_type_iterative(
X,
Y,
min_support = 0.1,
min_improvement = 3,
direction = c("backward", "forward", "both"),
criterion = c("AIC", "BIC"),
exclude_vars = NULL,
verbose = FALSE,
...
)
Arguments
X |
A data frame of predictors (no response). |
Y |
A numeric vector (the response). |
min_support |
Minimum fraction (0-0.5) of observations required in either group after a dummy split. Default = 0.1. |
min_improvement |
Minimum required improvement in AIC/BIC for accepting a dummy split or variable transformation. Default = 3. |
direction |
Stepwise strategy: |
criterion |
A character string: either |
exclude_vars |
A character vector of variable names to exclude from
dummy transformations. These variables will always be treated as linear.
Default = |
verbose |
Logical; if |
... |
Additional arguments (currently unused). |
Details
By default, no split is allowed with fewer than 5 observations (i.e., minsplit is max(5, ceiling(min_support * n))). This is not user-configurable.
Dummy forms come from a shallow (maxdepth = 2
)
rpart
tree fit to the partial residuals of the current model. We
extract up to two splits:
Single cutoff dummy (e.g.,
x >= c
)Double cutoff dummy (e.g.,
c1 < x < c2
)
The function then picks the form (linear, single-split dummy, or
double-split dummy) that yields the lowest AIC/BIC. Variables listed in
exclude_vars
will be forced to remain linear (dummy transformations
are never attempted).
Value
A named list of decisions, where each element is a list with:
- type
Either
"linear"
or"dummy"
.- cutoff
A numeric vector of length 1 or 2 (the chosen split points).
Decide Variable Type (Univariate)
Description
For each numeric predictor, this function fits a shallow
(maxdepth = 2
) rpart
tree directly on Y ~ x
and tests
whether a dummy transformation improves model fit.
Usage
decide_variable_type_univariate(
X,
Y,
min_support = 0.1,
min_improvement = 3,
criterion = c("AIC", "BIC"),
exclude_vars = NULL,
verbose = FALSE
)
Arguments
X |
A data frame of numeric predictors (no response). |
Y |
A numeric response vector. |
min_support |
Minimum fraction (0-0.5) of observations required in either group after a dummy split. Default = 0.1. |
min_improvement |
Minimum required improvement in AIC/BIC for accepting a dummy split or variable transformation. Default = 3. |
criterion |
A character string: either |
exclude_vars |
A character vector of variable names to exclude from
dummy transformations. These variables will always be treated as linear.
Default = |
verbose |
Logical; if |
Details
By default, no split is allowed with fewer than 5 observations (i.e., minsplit is max(5, ceiling(min_support * n))). This is not user-configurable.
Dummy forms come from a shallow (maxdepth = 2
) rpart
tree
fit to the data. We extract up to two splits:
Single cutoff dummy (e.g.,
x >= c
)Double cutoff dummy (e.g.,
c1 < x < c2
)
The function then picks the form (linear, single-split dummy, or
double-split dummy) that yields the lowest AIC/BIC. If a variable is
listed in exclude_vars
, it will always be used as a linear
predictor (dummy transformation is never attempted).
Value
A named list of decisions, where each element is a list with:
- type
Either
"dummy"
or"linear"
.- cutoffs
A numeric vector (length 1 or 2) if
type = "dummy"
, orNULL
if linear.- tree_model
The fitted
rpart
model (for reference) orNULL
if excluded.
SplitWise Regression
Description
Transforms each numeric variable into either a single-split
dummy or keeps it linear, then runs stats::step()
for stepwise
selection. The user can choose a simpler univariate transformation or an
iterative approach.
Usage
splitwise(
formula,
data,
transformation_mode = c("iterative", "univariate"),
direction = c("backward", "forward", "both"),
min_support = 0.1,
min_improvement = 3,
criterion = c("AIC", "BIC"),
exclude_vars = NULL,
verbose = FALSE,
steps = 1000,
k = 2,
...
)
## S3 method for class 'splitwise_lm'
print(x, ...)
## S3 method for class 'splitwise_lm'
summary(object, ...)
## S3 method for class 'splitwise_lm'
predict(object, newdata, ...)
## S3 method for class 'splitwise_lm'
coef(object, ...)
## S3 method for class 'splitwise_lm'
fitted(object, ...)
## S3 method for class 'splitwise_lm'
residuals(object, ...)
## S3 method for class 'splitwise_lm'
model.matrix(object, ...)
Arguments
formula |
A formula specifying the response and (initial) predictors,
e.g. |
data |
A data frame containing the variables used in |
transformation_mode |
Either |
direction |
Stepwise direction: |
min_support |
Minimum fraction (between 0 and 0.5) of observations
needed in either group when making a dummy split. Prevents over-fragmented
or tiny dummy groups. Default = |
min_improvement |
Minimum required improvement (in AIC/BIC units) for
accepting a dummy split or variable transformation. Helps guard against
overfitting from marginal improvements. Default = |
criterion |
Either |
exclude_vars |
A character vector naming variables that should be
forced to remain linear (i.e., no dummy splits allowed).
Default = |
verbose |
Logical; if |
steps |
Maximum number of steps for |
k |
Penalty multiple for the number of degrees of freedom
(used by |
... |
Additional arguments passed to |
x |
A |
object |
An object of class |
newdata |
A data frame of new data (with original predictors) to generate predictions for. The appropriate dummy variables will be generated using the transformation rules learned during model training. If omitted, predictions for the training data are returned. |
Value
An S3 object of class c("splitwise_lm", "lm")
, storing:
splitwise_info |
List containing transformation decisions, final data, and call. |
Functions
-
print(splitwise_lm)
: Prints a summary of the splitwise_lm object. -
summary(splitwise_lm)
: Provides a detailed summary, including how dummies were created. -
predict(splitwise_lm)
: Generate predictions from asplitwise_lm
object using learned transformation rules. -
coef(splitwise_lm)
: Extract model coefficients from a SplitWise linear model. -
fitted(splitwise_lm)
: Extract fitted values from a SplitWise linear model. -
residuals(splitwise_lm)
: Extract residuals from a SplitWise linear model. -
model.matrix(splitwise_lm)
: Extract the model matrix from a SplitWise linear model.
Examples
# Load the mtcars dataset
data(mtcars)
# Univariate transformations (AIC-based, backward stepwise)
model_uni <- splitwise(
mpg ~ .,
data = mtcars,
transformation_mode = "univariate",
direction = "backward"
)
summary(model_uni)
# Iterative approach (BIC-based, forward stepwise)
# Note: typically set k = log(nrow(mtcars)) for BIC in step().
model_iter <- splitwise(
mpg ~ .,
data = mtcars,
transformation_mode = "iterative",
direction = "forward",
criterion = "BIC",
k = log(nrow(mtcars))
)
summary(model_iter)
Transform Features (Iterative Logic)
Description
Once decide_variable_type_iterative
has chosen which
variables to add (and how), we can build a final data frame from those
decisions.
Usage
transform_features_iterative(X, decisions)
Arguments
X |
Original predictor data frame. |
decisions |
Output of |
Value
A data frame with the chosen variables in their final forms (dummy or linear).
Transform Features (Univariate Logic)
Description
Given the decisions (dummy or linear) for each predictor, produce a transformed data frame. Dummy columns are 0/1 based on the cutoff.
Usage
transform_features_univariate(X, decisions)
Arguments
X |
Original predictor data frame. |
decisions |
The list returned by |
Value
A new data frame with either the original column or a dummy column for each variable.