lpda
is an R package that addresses the classification
problem through linear programming. The method looks for a hyperplane,
H, which separates the samples into two groups by minimizing
the sum of all the distances to the subspace assigned to the group each
individual belongs to. It results in a convex optimization problem for
which we find an equivalent linear programming problem. We demonstrated
that H exists when the centroids of the two groups are not
equal [1]. The method has been extended to more than two groups by
considering pairwise comparisons. Moreover, lpda
offers the
possibility of dealing with Principal Components (PCs) to reduce the
dimension of the data avoiding overfitting problems. This option can be
applied independently of the number of samples, \(n\), and variables, \(p\), that is \(n>p\) or \(n<p\). Compared to other similar
techniques it is very fast, mainly because it is based in a linear
programming problem [2].
library(lpda)
Main function is lpda
that collect the input data,
standarises the data or applies Principal Component Analysis (PCA)
through lpda.pca
if it is required. Then, it calls to
lpda.fit
as many times as pairwise comparisons there are.
The result is a lpda
type object that is the input to
compute predictions through ‘predict’ function and to visualize results
through ‘plot’ function.
The package has also three functions to compute by crossvalidation (CV) the classification error in different test data sets. This is helpful to decide an appropriate number of PCs or a specific strategy to compute the hyperplane. The functions are:
bestPC
: computes the classification error rate for
lpda.pca models obtained with the number of components specified in
PCs
argument. The result is the average classification
error rate from the models computed for each number of PCs.
bestVariability
: computes the classification error
rate for lpda.pca models obtained with the number of components needed
to reach the explained variability specified in Vars
argument. The result is the average classification error rate from the
models computed for each explained variability specified in
Vars
.
lpdaCV
: Computes the classification error rate for a
specific model. The user can choose leaf one out (loo) CV, that
uses CVloo
function, or random test sets with a specified
size with ktest
option, that uses CVktest
function.
lpda
package includes two data sets concerning data
science: palmdates
and RNAseq
. The first one
is a real data set from a chemometric study and the second one a
simulated RNA-seq experiment. In this document we show the performance
of the package with these data sets and with iris
data,
available in R
package.
Palmdates
is a data set with scores of 21 palm dates
including their respective Raman spectra and the concentration of five
compounds covering a wide range of concentrations: fibre, glucose,
fructose, sorbitol and myo-inositol [3]. The first 11 dates are Spanish
(from Elche, Alicante) with no well-defined variety and the last 10 are
from other countries and varieties, mainly Arabian. The data set has two
data.frames: conc
with 5 variables and spectra
with 2050.
data("palmdates")
names(palmdates)
#> [1] "conc" "spectra"
dim(palmdates$spectra)
#> [1] 21 2050
names(palmdates$conc)
#> [1] "fibre" "sorbitol" "fructose" "glucose" "myo-inositol"
As conc
and spectra
, are very correlated,
the application of the method with PCs reduces substantially the
dimension.
This data set has been simulated as Negative Binomial distributed and transformed to rpkm (Reads per kilo base per million mapped reads). It contains 600 genes (in columns) and 60 samples (rows), 30 of each one of the experimental groups. First 30 samples are from first group and the remaining samples from the second one. It has been simulated with few variables (genes) that discriminate between groups. There is few correlation and a lot of noise.
data("RNAseq")
dim(RNAseq)
#> [1] 60 600
head(RNAseq[,1:6])
#> Gene1 Gene2 Gene3 Gene4 Gene5 Gene6
#> G1.T1.1 82 154 44 82 55 55
#> G1.T1.2 66 66 15 25 66 112
#> G1.T1.3 141 19 160 175 78 68
#> G1.T1.4 381 57 71 119 186 62
#> G1.T1.5 60 23 88 134 143 60
#> G1.T1.6 149 37 64 16 90 117
Palmdates
dataFirst we apply the method with the first two variables:
fibre
and sorbitol
to show the performance of
the package. The application of the method with two variables allows the
visualization of the hyperplane in two dimensions, in this case, a
straight line.
= as.factor( c(rep("Spanish",11), rep("Other",10)) )
group = lpda(data = palmdates$conc[,1:2], group = group ) model1
First output of lpda
is a matrix with the coefficients
of the hyperplane: \(a'x=b\) for
each pair-wise comparison. In this example:
$coef
model1#> Other-Spanish
#> [1,] -1.116131
#> [2,] -3.002923
#> [3,] -6.563646
being -1.116 and -3.003 the coefficients of fibre
and
sorbitol
respectively and -6.564 the constant \(b\). We can plot the line on the points,
that represent the samples, with the following code:
plot(palmdates$conc[,1:2], col = as.numeric(group)+1, pch = 20,
main = "Palmdates example")
abline(model1$coef[3]/model1$coef[2], -model1$coef[1]/model1$coef[2], cex = 2)
legend("bottomright", c("Other","Spanish"),col = c(2,3), pch = 20, cex=0.8)
Predicted group with the model is obtained with predict
function. Confusion matrix is computed as follows. We observe that all
the samples are well classified.
= predict(model1)
pred table(pred$fitted, group)
#> group
#> Other Spanish
#> Other 10 0
#> Spanish 0 11
By considering the five concentration variables, all samples are well classified as well. As a 5-dimensional plot can not be showed, we offer the possibility of seeing a plot that shows the situation of samples with respect to \(H\). Y-axis represents order in which they appear in the data matrix and X-axis distances of each sample to \(H\).
= lpda(data = palmdates$conc, group = group )
model2 plot(model2)
Another option is the application of lpda
to the first
PCs scores. When data is highly correlated, two components are usually
preferred. In such case choosing as plot
arguments
PCscores = TRUE
we will visualize the first two PCs,
indicating in the axis the proportion of explained variance by each PC.
Moreover if there are two groups, as in this example, the optimal
hyperplane is also showed.
= lpda(data = palmdates$conc, group = group, pca = TRUE, Variability = 0.7)
model3 plot(model3, PCscores = TRUE, main = "PCA-Substances palmdates")
When having data sets with more variables than individuals PCA is not
directly applicable. In lpda
we have implemented the
possibility of dealing with such problem, working with the equivalences
between the PCA of the data matrix and the transposed matrix.
In palmdates$spectra
there are 2050 very correlated
measurements as it is showed in the following figure:
Due to the high correlation the application of lpda
to
all the spectra variables and to the first 2 PCs gives the same
solution: 0 prediction errors.
<- lpda(data = palmdates$spectra, group = group)
model4 = predict(model4)
pred table(pred$fitted, group)
#> group
#> Other Spanish
#> Other 10 0
#> Spanish 0 11
= lpda(data = palmdates$spectra, group = group, pca = TRUE, Variability = 0.9)
model5 plot(model5, PCscores = TRUE, main = "Spectra palmdates")
Following example shows how predict the group of a test set that does not participate in the model. In this set we include two samples of each group.
= c(10,11,12,13)
test = lpda(data = palmdates$spectra[-test,], group = group[-test], pca = TRUE,
model6 Variability = 0.9)
= predict(model6, palmdates$spectra[test,])
pred $fitted
pred#> [1] "Spanish" "Spanish" "Other" "Other"
We can also use lpdaCV
function to compute the predicted
error in different test sets by crossvalidation with a specific model.
As explained before, two strategies are implemented: loo
(leave one out) and ktest
that is specified in
CV
argument and by default is loo
. When
ktest
is selected, ntest
is the size of test
set (samples not used for computing the model, only to evaluate the
number of prediction errors) and R
is the times the model
is computed and evaluated with different training and test sets.
lpdaCV(palmdates$spectra, group, pca = TRUE, CV = "loo")
lpdaCV(palmdates$spectra, group, pca = TRUE, CV = "ktest", ntest = 5, R = 10)
CV results are so good due to the clear difference between the two groups in spectra data. We continue with CV functions in Example 2 that deals noisier data.
This section applies cros-validation functions with RNAseq data that is noisier than palmdates data and has 60 samples. Firstly we can see that the model with all the data gets a separating hyperplane.
= as.factor(rep(c("G1","G2"), each = 30))
group = lpda(RNAseq, group) # model with all the variables
model = predict(model)
pred table(pred$fitted, group)
#> group
#> G1 G2
#> G1 30 0
#> G2 0 30
Evaluating the error in several test sets with CV functions we can
see that, to avoid overfitting, it is preferable the application of
lpda
with PCA:
lpdaCV(RNAseq, group, pca = FALSE, CV = "ktest", ntest = 10)
#> [1] "Prediction error rate: 0.39"
lpdaCV(RNAseq, group, pca = TRUE, CV = "ktest", ntest = 10)
#> [1] "Prediction error rate: 0.06"
However, the success of PCA solution depends on the chosen number of
components. For this reason it is recommended the application of
bestVariability
or bestPC
functions, that can
help to decide the number of PCs that better fit the data.
bestVariability(RNAseq, group, ntest = 10, R = 10, Vars = c(0.1, 0.9))
#> 0.1 0.9
#> 0.09 0.04
bestPC(RNAseq, group, ntest = 10, R = 10, PCs = c(2, 10))
#> 2 10
#> 0.06 0.05
Therefore variability = 0.9
or PC=10
are
preferable and the model for future predictions will be:
lpda(RNAseq, group, pca = TRUE, Variability = 0.9)
iris
dataTo show an example with more than two groups we use the famous
(Fisher’s or Anderson’s) iris
data set that is available in
base R
. The iris
data set gives the
measurements in centimeters of the variables sepal length and width and
petal length and width, respectively, for 50 flowers from each of 3
species of iris. The species are Iris setosa
,
versicolor
, and virginica
.
Results with the four variables give 2 classification errors. In this
case the model computes 3 hyperplanes for each one of the 3 pairwise
comparisons. Now plot
function gives 3 plots.
= lpda(iris[,-5], iris[,5])
model.iris = predict(model.iris)
pred.iris table(pred.iris$fitted, iris[,5])
#>
#> setosa versicolor virginica
#> setosa 50 0 0
#> versicolor 0 49 1
#> virginica 0 1 49
The application of lpda
with 3 PCs gives the same
classification error. Function plot
with
Pcscores = TRUE
gives the scores in the first two PCs, but
in this case without the hyperplanes.
= lpda(iris[,-5], iris[,5], pca=TRUE, PC=3)
model.iris2 = predict(model.iris2)
pred.iris2 table(pred.iris2$fitted, iris[,5])
#>
#> setosa versicolor virginica
#> setosa 50 0 0
#> versicolor 0 49 1
#> virginica 0 1 49
plot(model.iris2, PCscores= TRUE)
[1] Nueda, M.J., Gandía, C, Molina, M.D. (2022) LPDA: A new classification method based on linear programming. PLOS ONE 17(7): e0270403. https://doi.org/10.1371/journal.pone.0270403
[2] Nueda, M.J., Gandía, C. and Molina, M.D. Classifying sequencing data using linear programming. Euro30 Conference, Dublin, June-2019.
[3] Abdrabo, S.S., Gras, L., Grindlay, G. and Mora, J. (2022) Evaluation of Fourier Transform-Raman spectroscopy for carbohydrates characterization of palm dates (Phoenix dactylifera). Food Analytical Methods. Submitted.