\name{ROSE}
\alias{ROSE}

\title{
Generation of synthetic data by Randomly Over Sampling Examples (ROSE)
}

\description{
Creates a sample of synthetic data by enlarging the features space of minority and majority class examples. 
Operationally, the new examples are drawn from a conditional kernel density estimate of the two classes, as 
described in Menardi and Torelli (2013).
}

\usage{
 
ROSE(formula, data, N, p=0.5, hmult.majo=1, hmult.mino=1, subset,
     na.action, seed)
}

\arguments{
  \item{formula}{
An object of class \code{\link{formula}} (or one that can be coerced to that class). 
The left-hand-side (response) should be a vector specifying the class labels. 
The right-hand-side should be a series of vectors with the predictors. See ``Warning'' 
for information about interaction among predictors or their transformations.
}
  \item{data}{
	An optional data frame, list or environment (or object
	coercible to a data frame by \code{as.data.frame}) in which 
   to preferentially interpret ``formula''. 
   If not specified, the variables are taken from ``environment(formula)''.
}
  \item{N}{
The desired sample size of the resulting data set generated by ROSE. If missing, 
it is set equal to the length of the response variable in \code{formula}.
}
  \item{p}{
The probability of the minority class examples in the resulting data set generated by ROSE.
}
  \item{hmult.majo}{
Optional shrink factor to be multiplied by the smoothing parameters to estimate the conditional
kernel density of the majority class. See ``References'' and ``Details''.
}
  \item{hmult.mino}{
Optional shrink factor to be multiplied by the smoothing parameters to estimate the conditional
kernel density of the minority class. See ``References'' and ``Details''. 
}
   \item{subset}{
An optional vector specifying a subset of observations to be used in the sampling process.
 } 
 \item{na.action}{
 A function which indicates what should happen when the data contain 'NA's.  
 The default is set by the \code{\link{na.action}} setting of \code{\link{options}}.
}
  \item{seed}{
A single value, interpreted as an integer, recommended to specify seeds and keep trace of the  
 generated sample.
}
}

\details{
ROSE (Random Over-Sampling Examples) is a smoothed-bootstrap based technique 
which aids the task of binary classification in the presence of rare classes.
ROSE produces a synthetic, possibly balanced, sample of data by simulating 
new examples from a kernel estimate of the conditional density
of the two classes. 

Denoted as \eqn{y} the binary response and as \eqn{\bold{x}} a vector 
of numeric predictors observed on \eqn{n} subjects \eqn{i,} (\eqn{i=1, \ldots, n}), 
syntethic examples with label class \eqn{k, (k=0, 1)} are generated from:

\deqn{
\hat{f}(\bold{x}|y = k) = \sum_{i: y_i=k} \frac{1}{\# i: y_i=k} K_{H_k} (\bold{x}- \bold{x}_i)  
}

where \eqn{K} is a Normal product kernel centered at \eqn{\bold{x_i}} with
diagonal covariance matrix \eqn{H_k}. Here, \eqn{H_k} is the asymptotically optimal
smoothing matrix under the assumption of multivariate normality. See ``References''
below and further references therein.

Essentially, ROSE selects observed data belonging to the class \eqn{k} 
and generates new examples in its neighborhood, 
where the width of the neighborhood is determined by \eqn{H_k}. The user is allowed to 
shrink \eqn{H_k} by varying arguments \code{h.mult.majo} and \code{h.mult.mino}.  
Balancement is regulated by argument \code{p}, namely to the probability of 
generating examples from class \eqn{k=1}.

As they stand, kernel-based methods may be applied to continuous data only.
However, as ROSE includes combination of over and under-sampling as a special case when 
\eqn{H_k} tend to zero, the assumption of continuity may be circumvented by 
using a degenerate kernel distribution to draw synthetic categorical examples. 
Basically, if the \eqn{j-}th component of \eqn{x_i} is categorical, a syntehic clone 
of \eqn{x_i} will have as \eqn{j-}th component the same value of the \eqn{j-}th component of \eqn{x_i}.
}

\value{
The value is an object of class \code{ROSE} which has components
  \item{Call}{The matched call.}
  \item{method}{The method used to balance the sample. The only possible choice is \cr \code{ROSE}.}
  \item{data}{An object of class \code{data.frame} containing new examples generated by ROSE.} 
}

\references{
Menardi, G. and Torelli, N. (2013). Training and assessing classification rules with imbalanced data. \emph{Data Mining and Knowledge Discovery
}, DOI:10.1007/s10618-012-0295-5, to appear.
}

\section{Warning}{
The purpose of \code{ROSE} is to generate new synthetic examples in the features space. The use of \code{formula} is intended solely to 
distinguish the response variable from the predictors. 
Hence, \code{formula} must not be confused with the one supplied to fit a classifier in which the specification of either tranformations 
or interactions among variables may be sensible/necessary. 
In the current version \code{ROSE} discards possible interactions and transformations of predictors specified in \code{formula} automatically. 
The automatic parsing of \code{formula} is able to manage virtually all cases on which it has been tested it but 
the user is warned to use caution in the specification of entangled functions of predictors. 
Any report about possible malfunctioning of the parsing mechanism is welcome.

}


\seealso{
\code{\link{ovun.sample}}, \code{\link{ROSE.eval}}.
}

\examples{
# 2-dimensional example
# loading data
data(hacide)

# imbalance on training set
table(hacide.train$cls)
#imbalance on test set
table(hacide.test$cls)

# plot unbalanced data highlighting the majority and 
# minority class examples.
par(mfrow=c(1,2))
plot(hacide.train[, 2:3], main="Unbalanced data", xlim=c(-4,4),
     ylim=c(-4,4), col=as.numeric(hacide.train$cls), pch=20)
legend("topleft", c("Majority class","Minority class"), pch=20, col=1:2)

# model estimation using logistic regression
fit <- glm(cls~., data=hacide.train, family="binomial")
# prediction using test set
pred <- predict(fit, newdata=hacide.test)
roc.curve(hacide.test$cls, pred,
          main="ROC curve \n (Half circle depleted data)")

# generating data according to ROSE: p=0.5 as default
data.rose <- ROSE(cls~., data=hacide.train, seed=3)$data
table(data.rose$cls)

par(mfrow=c(1,2))
# plot new data generated by ROSE highlighting the 
# majority and minority class examples.
plot(data.rose[, 2:3], main="Balanced data by ROSE",
     xlim=c(-6,6), ylim=c(-6,6), col=as.numeric(data.rose$cls), pch=20)
legend("topleft", c("Majority class","Minority class"), pch=20, col=1:2)

fit.rose <- glm(cls~., data=data.rose, family="binomial")
pred.rose <- predict(fit.rose, data=data.rose, type="response")
roc.curve(data.rose$cls, pred.rose, 
          main="ROC curve \n (Half circle depleted data balanced by ROSE)")
par(mfrow=c(1,1))
}

\keyword{ supervised classification }
\keyword{ imbalanced classes }
\keyword{ bootstrap }

