% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/EWF.R
\name{EWF}
\alias{EWF}
\alias{EWF.default}
\alias{EWF.formula}
\title{Edge Weight Filter}
\usage{
\method{EWF}{formula}(formula, data, ...)

\method{EWF}{default}(x, threshold = 0.25, noiseAction = "remove",
  classColumn = ncol(x), ...)
}
\arguments{
\item{formula}{A formula describing the classification variable and the attributes to be used.}

\item{data, x}{Data frame containing the tranining dataset to be filtered.}

\item{...}{Optional parameters to be passed to other methods.}

\item{threshold}{Real number between 0 and 1. It sets the limit between good and suspicious instances. Its
default value is 0.25.}

\item{noiseAction}{Character being either 'remove' or 'hybrid'. It determines what to do with noisy
instances. By default, it is set to 'remove'.}

\item{classColumn}{positive integer indicating the column which contains the
(factor of) classes. By default, the last column is considered.}
}
\value{
An object of class \code{filter}, which is a list with seven components:
\itemize{
   \item \code{cleanData} is a data frame containing the filtered dataset.
   \item \code{remIdx} is a vector of integers indicating the indexes for
   removed instances (i.e. their row number with respect to the original data frame).
   \item \code{repIdx} is a vector of integers indicating the indexes for
   repaired/relabelled instances (i.e. their row number with respect to the original data frame).
   \item \code{repLab} is a factor containing the new labels for repaired instances.
   \item \code{parameters} is a list containing the argument values.
   \item \code{call} contains the original call to the filter.
   \item \code{extraInf} is a character that includes additional interesting
   information not covered by previous items.
}
}
\description{
Similarity-based filter for removing or repairing label noise from a dataset as a
preprocessing step of classification. For more information, see 'Details' and
'References' sections.
}
\details{
\code{EWF} builds up a Relative Neighborhood Graph (RNG) from the dataset. Then, it identifies
as 'suspicious' those instances with a significant value of its\emph{local cut edge weight statistic}, which
intuitively means that they are surrounded by examples from a different class.

Namely, the aforementioned statistic is the sum of the weights of edges joining
the instance (in the RNG graph) with instances from a different class.
Under the null hypothesis of the class label being independent of
the event 'being neighbors in the RNG graph', the distribution of this statistic can be approximated by a
gaussian one. Then, the p-value for the observed value is computed and contrasted with the
provided \code{threshold}.

To handle 'suspicious' instances there are two approaches ('remove' or 'hybrid'), and the argument
'noiseAction' determines which one to use. With 'remove', every suspect is removed from the dataset.
With the 'hybrid' approach, an instance is removed if it does not have \emph{good} (i.e. non-suspicious)
RNG-neighbors. Otherwise, it is relabelled with the majority class among its \emph{good} RNG-neighbors.
}
\examples{
# Next example is not run because EWF is time-consuming
\dontrun{
   data(iris)
   trainData <- iris[c(1:20,51:70,101:120),]
   out <- EWF(Species~Petal.Length+Sepal.Length, data = trainData, noiseAction = "hybrid")
   print(out)
}
}
\references{
Muhlenbach F., Lallich S., Zighed D. A. (2004): Identifying and handling mislabelled
instances. \emph{Journal of Intelligent Information Systems}, 22(1), 89-109.
}

