% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/woe.binning.R
\name{woe.binning}
\alias{woe.binning}
\title{Binning via Fine and Coarse Classing}
\usage{
woe.binning(df, target.var, pred.var, min.perc.total,
            min.perc.class, stop.limit, abbrev.fact.levels, event.class)
}
\arguments{
\item{df}{Name of data frame with input data.}

\item{target.var}{Name of dichotomous target variable in quotes. Only target variables with
two distinct values (e.g. 0, 1 or \dQuote{Y}, \dQuote{N}) are accepted;
cases with NAs in the target variable will be ignored.}

\item{pred.var}{Name of predictor variable(s) to be binned in quotes.
A single variable name can be provided, e.g. \dQuote{varname1}, or a list of
variable names, e.g. c(\dQuote{varname1}, \dQuote{varname2}). Alternatively one
can repeat the name of the input data frame; the function will be applied
to all its variables apart from the target variable then.
Numeric variables and factors are supported and may contain NAs.}

\item{min.perc.total}{For numeric variables this parameter defines the number of initial
classes before any merging is applied. For example \emph{min.perc.total=0.05}
(5\%) will result in 20 initial classes. For factors the original
levels with a percentage below this limit are collected in a \sQuote{miscellaneous}
level before the merging based on the \emph{min.perc.class} and on the
WOE starts. Increasing the \emph{min.perc.total} parameter will avoid
sparse bins. Accepted range: 0.01-0.2; default: 0.05.}

\item{min.perc.class}{If a column percentage of one of the target classes within a bin is
below this limit (e.g. below 0.01=1\%) then the respective bin will be
joined with others. In case of numeric variables adjacent predictor classes
are merged. For factors respective levels (including sparse NAs) are
assigned to a \sQuote{miscellaneous} level. Setting \emph{min.perc.class}>0
may provide more reliable WOE values. Accepted range: 0-0.2;
default: 0, i.e. no merging with respect to sparse target classes
is applied.}

\item{stop.limit}{Stops WOE based merging of the predictor's classes/levels in case the
resulting information value (IV) decreases more than \emph{x}\% (e.g. 0.05 = 5\%)
compared to the preceding binning step. \emph{stop.limit=0} will skip any
WOE based merging. Increasing the \emph{stop.limit} will simplify the binning
solution and may avoid overfitting. Accepted range: 0-0.5; default: 0.1.}

\item{abbrev.fact.levels}{Abbreviates the names of new (merged) factor levels via the base R
\code{\link{abbreviate}} function in case the specified number of
characters is exceeded. Accepted range: 0-1000; default: 200.
0 will prevent applying any abbreviation, i.e. only factor levels with
more than 1000 characters will be truncated then.
This option is particularly relevant in case one wants to generate dummy
variables via the \code{\link{woe.binning.deploy}} function, because the
factor levels will be part of the dummy variable names then.}

\item{event.class}{Optional parameter for specifying the class of the target event. This
class typically indicates a negative event like a loan default or a
disease. Use integers (e.g. 1) or characters in quotes (e.g. \dQuote{bad}).
This class will be represented by negative WOE values then.}
}
\value{
\code{woe.binning} generates an object containing the information necessary
for studying and applying the realized binning solution. When saved
it can be used with the functions \code{\link{woe.binning.plot}}, \code{\link{woe.binning.table}}
and \code{\link{woe.binning.deploy}}.
}
\description{
\code{woe.binning} generates a supervised fine and coarse classing of numeric
variables and factors with respect to a dichotomous target variable. Its parameters
provide flexibility in finding a binning that fits specific data characteristics
and practical needs.
}
\section{Binning of Numeric Variables}{

Numeric variables (continuous and ordinal) are binned by merging initial classes with
similar frequencies. The number of initial bins results from the \emph{min.perc.total}
parameter: min.perc.total will result in trunc(1/min.perc.total) initial bins,
whereby \emph{trunc} is needed to guarantee bins with similar frequencies.
For example \emph{min.perc.total=0.07} will cause trunc(14.3)=14 initial classes.
Next, if \emph{min.perc.class}>0, bins with sparse target classes will be merged with
the next upper bin, and in case of the last bin with the next lower one. NAs have
their own bin and will not be merged with others. Finally nearby bins with most similar
weight of evidence (WOE) values are joined step by step until the information value
(IV) decreases more than specified by a percentage value (\emph{stop.limit} parameter)
or until two bins are reached.
}

\section{Binning of Factors}{

Factors (categorical variables) are binned by merging factor levels. As a start sparse
levels (defined via the \emph{min.perc.total} and \emph{min.perc.class} parameters)
are merged to a \sQuote{miscellaneous} level: if possible, respective levels (including
sparse NAs) are bundled as \sQuote{misc. level pos.} (associated with positive WOE
values), respectively as \sQuote{misc. level neg.} (associated with negative WOE
values). In case a misc. level contains only NAs it will be named \sQuote{Missing}.
Afterwards levels with similar WOE values are joined step by step until the information
value (IV) decreases more than specified by a percentage value (\emph{stop.limit} parameter)
or until two bins are reached.
}

\section{Adjustment of 0 Frequencies}{

In case the crosstab of the bins with the target classes contains frequencies = 0
the column percentages are adjusted to be able to compute the WOE and IV values:
the offset 0.0001 (=0.01\%) is added to each column percentage cell and the column
percentages are recomputed then. This allows considering bins associated with one target
class only, but may cause extreme WOE values for these bins. If a correction is not
appropriate choose \emph{min.perc.class}>0; bins with sparse target classes will be
merged then before computing any WOE or IV value.
}

\section{Handling of Missing Data}{

Cases with NAs in the target variable will be ignored. For predictor variables the following
applies: in case NAs already occurred when generating the binning solution
the code \sQuote{Missing} is displayed and a corresponding WOE value can be computed.
(Note that factor NAs may be joined with other sparse levels to a \sQuote{miscellaneous}
level - see above; only this \sQuote{miscellaneous} level will be displayed then.)
In case NAs occur in the deployment scenario only \sQuote{Missing} is
displayed for numeric variables and \sQuote{unknown} for factors; and
the corresponding WOE values will be NA then, as well.
}

\examples{
# Load German credit data and create subset
data(germancredit)
df <- germancredit[, c('creditability', 'credit.amount', 'duration.in.month',
                  'savings.account.and.bonds', 'purpose')]

# Bin a single numeric variable
binning <- woe.binning(df, 'creditability', 'duration.in.month',
                       min.perc.total=0.05, min.perc.class=0.01,
                       stop.limit=0.1, event.class='bad')

# Bin a single factor
binning <- woe.binning(df, 'creditability', 'purpose',
                       min.perc.total=0.05, min.perc.class=0, stop.limit=0.1,
                       abbrev.fact.levels=50, event.class='bad')

# Bin two variables (one numeric and one factor)
# with default parameter settings
binning <- woe.binning(df, 'creditability', c('credit.amount','purpose'))

# Bin all variables of the data frame (apart from the target variable)
# with default parameter settings
binning <- woe.binning(df, 'creditability', df)

}
\seealso{
Other binning functions: \code{\link{woe.tree.binning}}
}
