The R package klic (Kernel Learning Integrative Clustering) contains a collection of tools for integrative clustering of multiple data types.
The main function in this package is klic
. It takes as input a list of datasets and builds one kernel per dataset, using consensus clustering. The kernels are then combined through localised kernel k-means to obtain the final clustering. The function also gives the user the option to try several different values for the number of clusters both at the level of the individual datasets and for the final clustering. This is the simplest way to perform Kernel Learning Integrative Clustering. However, users may want to include other types of kernels (instead of using only those derived from consensus clustering) or try different parameters for consensus clustering and include all the corresponding kernels in the analysis. Therefore, we also make available some of the functions needed to build a customised KLIC pipeline, that are:
spectrumShift
. This function takes as input any symmetric matrix (including co-clustering matrices) and checks whether it is positive semi-definite. If not, the eigenvalues of matrix are shifted by a (small) constant in order to make sure that it is a valid kernel.
lmkkmeans
. This is a function implemented by Gonen and Margolin (2014) that performs localised kernel k-means on a set of kernel matrices (such as -appropriately shifted- co-clustering matrices, for example).
The other function needed for this is consensusCluster
. This function can be found in the R package coca
and is used to perform consensus clustering on one dataset and obtain a co-clustering matrix (Monti et al. 2003).
Other functions included in the package are:
kkmeans
, another function implemented by Gonen and Margolin (2004) to perform kernel k-means with only one kernel matrix (Girolami, 2002);copheneticCorrelation
, to calculate the cophenetic correlation of a similarity matrix.First, we generate four datasets with the same clustering structure (6 clusters of equal size) and different levels of noise.
## Load synthetic data
data1 <- as.matrix(read.csv(system.file("extdata", "dataset1.csv", package = "klic"), row.names = 1))
data2 <- as.matrix(read.csv(system.file("extdata", "dataset2.csv", package = "klic"), row.names = 1))
data3 <- as.matrix(read.csv(system.file("extdata", "dataset3.csv", package = "klic"), row.names = 1))
data <- list(data1, data2, data3)
n_datasets <- 3
N <- dim(data[[1]])[1]
true_labels <- as.matrix(read.csv(system.file("extdata", "cluster_labels.csv",
package = "klic"), row.names = 1))
Now we can use the consensusClustering
function to compute a consensus matrix for each dataset.
## Compute co-clustering matrices for each dataset
CM <- array(NA, c(N, N, n_datasets))
for(i in 1: n_datasets){
# Scale the columns to have zero mean and unitary variance
scaledData <- scale(data[[i]])
# Use consensus clustering to find the consensus matrix of each dataset
CM[,,i] <- coca::consensusCluster(scaledData, K = 4, B = 50)
}
## Plot consensus matrix of one of the datasets
true_labels <- as.factor(true_labels)
names(true_labels) <- as.character(1:N)
CM3 <- as.matrix(CM[,,3])
rownames(CM3) <- colnames(CM3) <- names(true_labels)
klic::plotSimilarityMatrix(CM3, y = as.data.frame(true_labels))