%\VignetteIndexEntry{tapkee methods}
%\VignetteEngine{R.rsp::tex}

\documentclass[11pt]{article}
\usepackage{amsfonts,amsmath,hyperref}
\usepackage[utf8]{inputenc}
\usepackage[margin=1in]{geometry}

\providecommand{\tightlist}{%
  \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}

\title{\texttt{tapkee} methods}
\author{}
\date{}

\begin{document}

\maketitle

\tableofcontents

\section{Diffusion map}\label{diffusion-map}

The diffusion map algorithm performs the following steps to embed
feature vectors $x_1, \dots, x_N$:

\begin{itemize}
\item
  Compute $N \times N$ gaussian kernel matrix $K$ such that
  $$K_{i,j} = exp \left\{ - \frac{d^2(x_i,x_j)}{\omega} \right\},$$
  where $d : X \times X \to \mathbb{R}$ is a distance function and
  $\omega > 0$ is a width of the kernel.
\item
  Transform the matrix $K$ using the following equations
  $$K_{i,j} \leftarrow \frac{K_{i,j}}{(p_i p_j)^q},$$ where
  $p_i = \sum_{j=1}^{N} K_{j,i}$. Only $q=1$ for `standard'
  diffusion map is currently supported. Then, recompute
  $p_i = \sum_{j=1}^{N} K_{j,i}$ again and do
  $$K_{i,j} \leftarrow \frac{K_{i,j}}{\sqrt{p_i p_j}}.$$
\item
  Construct embedding with $\dim = d$ from the solution of the
  following partial eigenproblem $$K f = \lambda f$$ for $d+1$ largest
  eigenvalues. Form the embedding matrix such that the $i$-th
  coordinate ($i=1,\dots,N$) of $j$-th largest eigenvector
  ($j=2,\dots,d+1$) corresponds to $j$-th coordinate of projected
  $i$-th vector, normalized by $\lambda_i^t$ and the first
  eigenvector corresponding to $\lambda_1 = 1$.
\end{itemize}

\subsection{References}\label{references}

\begin{itemize}
\tightlist
\item
  Coifman, R., \& Lafon, S. (2006).
  \href{http://linkinghub.elsevier.com/retrieve/pii/S1063520306000546}{Diffusion
  maps}
\end{itemize}

\section{Factor analysis}\label{factor-analysis}

Factor analysis aims at describing how several observed variables are
correlated to each other by means of identifying a set of unobserved
variables, the so-called factors. Desirably, the number of factors is
shorter than the number of observed variables.

Factor analysis is an iterative algorithm. First of all the projection
matrix is initialized randomly and the factors variance is set to the
identity. Then, every iteration consists of the following steps:

\begin{itemize}
\item
  Compute the regularized inverse covariance matrix of the projection.
\item
  Update the factors variance matrix.
\item
  Update the projection matrix.
\item
  Check for convergence using the log-likelihood of the model. If the
  difference between the current log-likelihood and the previous
  iteration's log-likelihood is below a threshold, then the algorithm
  has converged.
\end{itemize}

\subsection{References}\label{references-1}

\begin{itemize}
\tightlist
\item
  Spearman, C. (1904).
  \href{http://www.mendeley.com/catalog/general-intelligence-objectively-determined-measured/}{General
  Intelligence, Objectively Determined and Measured}.
\end{itemize}

\section{Hessian Locally Linear
Embedding}\label{hessian-locally-linear-embedding}

Just like the Local Tangent Space Alignment, the Hessian Locally Linear
Embedding algorithm is very similar to the Locally Linear Embedding
algorithm.

Given a set of feature vectors $X = \{ x_1, x_2, \dots x_N \}$ the
HLLE algorithm proposes to perform the following steps:

\begin{itemize}
\item
  Identify nearest neighbors. For each $x \in X$ identify its $k$
  nearest neighbors, i.e.~a set $\mathcal{N}_x$ of $k$ feature
  vectors such that
  $$\arg\min_{\mathcal{N}_x} \sum_{x_n \in \mathcal{N}_x}\| x - x_n \|_2^2$$
\item
  Analyze hessian of each local patch. For each $x \in X$ compute the
  Gram matrix $G$ of its neighbors such that
  $G_{i,j} = (\mathcal{N}_x^{i} , \mathcal{N}_x^{j})$ and center
  it. Compute its $t$ (the number of required features) eigenvectors
  $v_1, \dots v_t$. Construct hessian approximating matrix
  $$Y = \begin{bmatrix} 1_k & v_1 & \dots & v_t & v_1\cdot v_1 & \dots & v_1\cdot v_t & \dots \end{bmatrix},$$
  where $\cdot : X \times X \to X$ denotes coefficient-wise product.
  Normalize columns of the matrix $Y$ and then compute matrix
  $$Q = Y Y^{T}$$ and put it to the sparse alignment matrix $L$
  (initially set by zeroes) using the following procedure:
  $$L \leftarrow L + Q.$$
\item
  Embedding through eigendecomposition. To obtain $t$ features
  (coordinates) of embedded vectors solve the partial eigenproblem
  $$L f = \lambda f,$$ for smallest eigenvalues
  $\lambda_1, \dots, \lambda_t, \lambda_{t+1}$ and its
  corresponding eigenvectors $f_1, \dots, f_t, f_{t+1}$. Drop the
  smallest eigenvalue $\lambda_1 \sim 0$ (with its corresponding
  eigenvector) and form embedding matrix such that $i$-th coordinate
  ($i=1,\dots,N$) of $j$-th eigenvector ($j=1,\dots,t$)
  corresponds to $j$-th coordinate of projected $i$-th vector.
\end{itemize}

\subsection{References}\label{references-2}

\begin{itemize}
\tightlist
\item
  Donoho, D., \& Grimes, C. (2003).
  \href{http://www-stat.stanford.edu/~donoho/Reports/2003/HessianEigenmaps.pdf}{Hessian
  eigenmaps: new locally linear embedding techniques for
  high-dimensional data}
\end{itemize}

\section{Isomap}\label{isomap}

The Isomap algorithm can be considered as a modification of the classic
Multidimensional Scaling algorithm. The algorithm itself consists of the
following steps:

\begin{itemize}
\item
  For each feature vector $x \in X$ find $k$ its nearest neighbors
  and construct the sparse neighborhood graph.
\item
  Compute squared distances matrix $D$ such as
  $D_{i,j} = d^2(x_i,x_j)$.
\item
  Relax distances with shortest (so-called geodesic) distances on the
  sparse neighborhood graph (e.g.~with Dijkstra's algorithm).
\item
  Center the matrix $D$ with subtracting row mean, column mean and
  adding the grand mean. Multiply $D$ element-wise with $-0.5$.
\item
  Compute embedding with the $t$ eigenvectors that correspond to the
  largest eigenvalues of the matrix $D$; normalize these vectors with
  dividing each eigenvector by the square root of its corresponding
  eigenvalue. Form the final embedding with eigenvectors as rows and
  projected feature vectors as columns.
\end{itemize}

\subsection{References}\label{references-3}

\begin{itemize}
\tightlist
\item
  Tenenbaum, J. B., De Silva, V., \& Langford, J. C. (2000).
  \href{http://www.robots.ox.ac.uk/~az/lectures/ml/tenenbaum-isomap-Science2000.pdf}
  {A global geometric framework for nonlinear dimensionality reduction}
\end{itemize}

\section{Kernel Principal Component
Analysis}\label{kernel-principal-component-analysis}

The Kernel Principal Component Analysis algorithm is a generalization of
the PCA algorithm. The algorithm performs the following steps

\begin{itemize}
\item
  Compute the kernel matrix $K$ such that $K_{i,j} = k(x_i,x_j)$
  where $k : X \times X \to \mathbb{R}$ is a Mercer kernel function
  and $X$ is a set of feature vectors $\\{ x_1, x_2, \dots, x_N \\}$
\item
  Center the matrix $K$ with subtracting row mean, column mean and
  adding the grand mean.
\item
  Compute embedding with the $t$ eigenvectors that correspond to the
  largest eigenvalues of the matrix $D$; normalize these vectors with
  dividing each eigenvectors with square root of its corresponding
  eigenvalue. Form the final embedding with eigenvectors as rows and
  projected feature vectors as columns.
\end{itemize}

\subsection{References}\label{references-4}

\begin{itemize}
\tightlist
\item
  Schölkopf, B., Smola, A., \& Müller, K. R. (1997). Kernel principal
  component analysis
\end{itemize}

\section{Laplacian Eigenmaps}\label{laplacian-eigenmaps}

The Laplacian Eigenmaps algorithm performs the following simple steps to
embed given feature vectors $x_1, \dots, x_N$:

\begin{itemize}
\item
  Identify nearest neighbors. For each $x \in X$ identify its $k$
  nearest neighbors, i.e.~a set $\mathcal{N}_x$ of $k$ feature
  vectors such that
  $$\arg\min_{\mathcal{N}_x} \sum_{x_n \in \mathcal{N}_x} d(x,x_n),$$
  where $d : X \times X \to \mathbb{R}$ is a distance function.
\item
  Construct weight matrix. Initially setting $N\times N$ matrix $W$
  to zero, set
  $$W_{i,j} = exp \left\{ - \frac{d^2(x_i,x_j)}{\tau} \right\}$$
  iff for $i$-th vector $x_i$ neighbors set $\mathcal{N}_{x_i}$
  contains $x_j$ and vice versa (so-called mutual neighborhood). Find
  a diagonal matrix $D$ such that
  $D_{i,i} = \sum_{j=1}^{N} W_{j,i}$.
\item
  Find embedding throught eigendecomposition. To obtain $t$ features
  (coordinates) of embedded vectors solve the partial generalized
  eigenproblem $$(D-W) f = \lambda D f,$$ for smallest eigenvalues
  $\lambda_1, \dots, \lambda_t, \lambda_{t+1}$ and its
  corresponding eigenvectors $f_1, \dots, f_t, f_{t+1}$. Drop the
  smallest eigenvalue $\lambda_1 \sim 0$ (with the corresponding
  eigenvector) and form embedding matrix such that $i$-th coordinate
  ($i=1,\dots,N$) of $j$-th eigenvector ($j=1,\dots,t$)
  corresponds to $j$-th coordinate of projected $i$-th vector.
\end{itemize}

\subsection{References}\label{references-5}

\begin{itemize}
\tightlist
\item
  Belkin, M., \& Niyogi, P. (2002).
  \href{http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.19.9400\&rep=rep1\&type=pdf}{Laplacian
  Eigenmaps and Spectral Techniques for Embedding and Clustering}
\end{itemize}

\section{Locally Linear Embedding}\label{locally-linear-embedding}

Given a set of feature vectors $X = \{ x_1, x_2, \dots x_N \}$ the
Locally Linear Embedding algorithm proposes to perform the following
steps:

\begin{itemize}
\item
  Identify nearest neighbors. For each $x \in X$ identify its $k$
  nearest neighbors, i.e.~a set $\mathcal{N}_x$ of $k$ feature
  vectors such that
  $$\arg\min_{\mathcal{N}_x} \sum_{x_n \in \mathcal{N}_x}\| x - x_n \|_2^2$$
\item
  Compute linear reconstruction weights. For each $x \in X$ compute
  weight vector $w \in \mathbb{R}^n$ that minimizes
  $$\| x - \sum_{i=1}^{k} w_i \mathcal{N}_x^{i} \|_2, ~~ \text{w.r.t.} ~~ \|w\|_2 = 1$$
  where $\mathcal{N}_x^{i}$ is a $i$-th element of the set
  $\mathcal{N}_x$. The solution of the problem stated above can be
  found from the normalized solution of the following equation:
  $$G w = 1_k,$$ where $G$ is a $k \times k$ matrix such that
  $G_{i,j} = (x - \mathcal{N}_x^{i})(x - \mathcal{N}_x^{j})$ and
  $1_k \in \mathbb{R}^k$ is a vector of all ones. Obviously, the
  problem comes ill-posed in case $k$ gets more than dimension of
  feature space $X$. This can be avoided with the regularization:
  $$G \leftarrow G + \varepsilon E,$$ where $E$ is an identity matrix
  and $\varepsilon$ is a pre-defined constant reconstruction shift
  (usually $10^{-3}$). Once $w$ is computed it is stored into the
  sparse alignment matrix $L$ (initially set by zero) with the
  following procedure: $$L_{I,I} \leftarrow L_{I,I} + W,$$ where $I$
  is a set containing indices of all element of the set
  $\mathcal{N}_x$ and $x$ itself, $L_{I,I}$ denotes all
  $(i,j)$ elements of the sparse matrix L such that $i,j \in I$ and
  $$W = \begin{bmatrix} 1 & -w \\ -w^{T} & w^T w \end{bmatrix}$$.
\item
  Embedding through eigendecomposition. To obtain $t$ features
  (coordinates) of embedded vectors solve the partial eigenproblem
  $$L f = \lambda f,$$ for smallest eigenvalues
  $\lambda_1, \dots, \lambda_t, \lambda_{t+1}$ and its
  corresponding eigenvectors $f_1, \dots, f_t, f_{t+1}$. Drop the
  smallest eigenvalue $\lambda_1 \sim 0$ (with the corresponding
  eigenvector) and form embedding matrix such that $i$-th coordinate
  ($i=1,\dots,N$) of $j$-th eigenvector ($j=1,\dots,t$)
  corresponds to $j$-th coordinate of projected $i$-th vector.
\end{itemize}

\section{Locally Linear Embedding}\label{locally-linear-embedding-1}

The Locally Linear Embedding algorithm can be generalized for spaces
with defined dot product function $k(x,y)$ (so-called
\href{http://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space}{RKHS})
in the elegant way. Using the following equation
$$|| x - y||_2^2 = (x,x) - 2 (x,y) + (y,y)$$ we may transform the
nearest neighbors problem to the following form:
$$\arg\min_{\mathcal{N}_x} \sum_{x_n \in \mathcal{N}_x} 
\left[ k(x,x) - 2 k(x,x_n) + k(x_n, x_n) \right] .$$
The matrix $G$ can be formulated in terms of dot product as well. To
find $G$ using only dot products we can compute the Gram matrix $K$
such that $K_{i,j} = k(x_i, x_j)$ and center it using the matrix
$C_k = E_k - \frac{1}{k} 1 1^{T}$: $$G = K C_k K.$$ There is an
efficient way to compute that - it is can be done with subtracting a
column mean of $K$ from each column of $K$, subtracting a row mean
of $K$ from each row of $K$ and adding the grand mean of all
elements of $K$ to $K$.

\subsection{References}\label{references-6}

\begin{itemize}
\tightlist
\item
  \href{http://www.cs.nyu.edu/~roweis/lle/}{Sam Roweis' page on LLE}
\item
  Saul, L. K., Ave, P., Park, F., \& Roweis, S. T. (2001).
  \href{http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.123.7319\&rep=rep1\&type=pdf}{An
  introduction to Locally Linear Embedding}
\item
  Zhao, D. (2006)
  \href{http://linkinghub.elsevier.com/retrieve/pii/S0031320306002160}{Formulating
  LLE using alignment technique}
\end{itemize}

\section{Linear Local Tangent Space
Alignment}\label{linear-local-tangent-space-alignment}

The Linear Local Tangent Space Alignment is a modification of the LTSA
algorithm. Main difference (just like in NPE and LLE) of linear and
original LTSA methods lies in the way of constructing embedding. Instead
of solving common for LLE and LTSA eigenproblem, LLTSA requires solving
the following generalized eigenproblem: $$R L R^T f = \lambda R R^T f,$$
where $R$ is a matrix containing all feature vectors
$x_1, \dots, x_N$ row-wise. The problem is solved for smallest
eigenvalues $\lambda_1 \leq \lambda_2 \leq \dots \leq \lambda_t$
and its corresponding eigenvectors $f_1, \dots, f_t$. To find final
embedding LLTSA forms a matrix such that $i$-th coordinate
($i=1,\dots,N$) of $j$-th eigenvector ($j=1,\dots,t$) corresponds
to $j$-th coordinate of projected $i$-th vector.

\subsection{References}\label{references-7}

\begin{itemize}
\tightlist
\item
  Zhang, T., Yang, J., Zhao, D., \& Ge, X. (2007).
  \href{http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.85.2698}{Linear
  local tangent space alignment and application to face recognition}
\end{itemize}

\section{Locality Preserving
Projections}\label{locality-preserving-projections}

The Locality Preserving Projections algorithm can be viewed as a linear
approximation of the Laplacian Eigenmaps algorithm. It reproduces first
two steps of the Laplacian Eigenmaps and the difference lies in the step
3. To obtain $t$ features (coordinates) of embedded vectors LPP solves
the partial generalized eigenproblem
$$R (D-W) R^{T} f = \lambda R D R^{T} f,$$ where $R$ contains all
feature vectors row-wise, for smallest eigenvalues
$\lambda_1, \dots, \lambda_t, \lambda_{t+1}$ and its corresponding
eigenvectors $f_1, \dots, f_t, f_{t+1}$. Drop the smallest
eigenvalue $\lambda_1 \sim 0$ (with the corresponding eigenvector)
and form embedding matrix such that $i$-th coordinate
($i=1,\dots,N$) of $j$-th eigenvector ($j=1,\dots,t$) corresponds
to $j$-th coordinate of projected $i$-th vector.

\subsection{References}\label{references-8}

\begin{itemize}
\tightlist
\item
  He, X., \& Niyogi, P. (2003).
  \href{http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.7147\&rep=rep1\&type=pdf}{Locality
  Preserving Projections}
\end{itemize}

\section{Local Tangent Space
Alignment}\label{local-tangent-space-alignment}

The Local Tangent Space Alignment algorithm is pretty similar to the
Locally Linear Embedding algorithm.

Given a set of feature vectors $X = \{ x_1, x_2, \dots x_N \}$ the
Local Tangent Space Alignment algorithm performs the following steps:

\begin{itemize}
\item
  Identify nearest neighbors. For each $x \in X$ identify its $k$
  nearest neighbors, i.e.~a set $\mathcal{N}_x$ of $k$ feature
  vectors such that
  $$\arg\min_{\mathcal{N}_x} \sum_{x_n \in \mathcal{N}_x}\| x - x_n \|_2^2$$
\item
  Perform principal component analysis of each local neighborhood patch.
  For each $x \in X$ compute the Gram matrix $G$ of its neighbors
  such that $G_{i,j} = (\mathcal{N}_x^{i} , \mathcal{N}_x^{j})$ and
  center it. Compute its $t$ (the number of required features)
  eigenvectors and store it in the matrix $V$. Compute matrix
  $$Q = \begin{bmatrix} 1_k & V \end{bmatrix} \begin{bmatrix} 1_k \\ V \end{bmatrix}$$
  and put it to the sparse alignment matrix $L$ (initially set by
  zeroes) using the following procedure: $$L \leftarrow L + E_k - Q.$$
\item
  Embedding through eigendecomposition. To obtain $t$ features
  (coordinates) of embedded vectors solve the partial eigenproblem
  $$L f = \lambda f,$$ for smallest eigenvalues
  $\lambda_1, \dots, \lambda_t, \lambda_{t+1}$ and its
  corresponding eigenvectors $f_1, \dots, f_t, f_{t+1}$. Drop the
  smallest eigenvalue $\lambda_1 \sim 0$ (with the corresponding
  eigenvector) and form embedding matrix such that $i$-th coordinate
  ($i=1,\dots,N$) of $j$-th eigenvector ($j=1,\dots,t$)
  corresponds to $j$-th coordinate of projected $i$-th vector.
\end{itemize}

\section{Kernel Local Tangent Space
Alignment}\label{kernel-local-tangent-space-alignment}

Like the Locally Linear Embedding algorithm, LTSA allows generalization
for Mercer kernel functions. Nearest neighbors computation in KLTSA is
identical to one in KLTSA and the matrix $G$ in the step 2 is
naturally replaced with matrix
$K_{i,j} = k(\mathcal{N}_x^{i},\mathcal{N}_x^{j})$.

\subsection{References}\label{references-9}

\begin{itemize}
\tightlist
\item
  Zhang, Z., \& Zha, H. (2002).
  \href{http://arxiv.org/abs/cs/0212008}{Principal Manifolds and
  Nonlinear Dimension Reduction via Local Tangent Space Alignment}
\end{itemize}

\section{Classic multidimensional
scaling}\label{classic-multidimensional-scaling}

The classic multidimensional scaling algorithm is probably the simplest
dimensionality reduction algorithm which reduced data in an attempt to
keep pairwise distances the same. The algorithm itself is:

\begin{itemize}
\item
  For a given set of vectors $X = x_1, x_2, \dots, x_N$ compute the
  pairwise distances matrix $D$ such that $D_{i,j} = d(x_i,x_j)$,
\item
  Square each element of the distances matrix $D$ and center the
  matrix with subtracting row mean, column mean and adding the grand
  mean.
\item
  Compute embedding with the $t$ eigenvectors that correspond to the
  largest eigenvalues of the matrix $D$; normalize these vectors with
  dividing each eigenvectors with square root of its corresponding
  eigenvalue. Form the final embedding with eigenvectors as rows and
  projected feature vectors as columns.
\end{itemize}

\section{Neighborhood Preserving
Embedding}\label{neighborhood-preserving-embedding}

The Neighborhood Preserving Embedding (NPE) algorithm can be considered
as a linear approximation of the Locally Linear Embedding algorithm.
Thus most of computation routines can be shared with LLE. The NPE
algorithm uses steps 1 and 2 of the Locally Linear Embedding and the
main difference lies in the eigendecomposition based embedding.

According to the NPE algorithm embedding can be found from the solution
of the following partial generalized eigenproblem:
$$R L R f = \lambda R R^T f$$ where $R$ is a matrix containing all
feature vectors $x_1 , \dots , x_N$ row-wise. The problem is solved
for smallest eigenvalues $\lambda_1, \dots, \lambda_t$ and its
corresponding eigenvectors $f_1, \dots, f_t$. The final embedding is
obtained with a matrix such that $i$-th coordinate ($i=1,\dots,N$)
of $j$-th eigenvector ($j=1,\dots,t$) corresponds to $j$-th
coordinate of projected $i$-th vector.

References

\begin{itemize}
\tightlist
\item
  He, X., Cai, D., Yan, S., \& Zhang, H.-J. (2005).
  \href{http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1544858}{Neighborhood
  preserving embedding}
\end{itemize}

\section{Principal Component
Analysis}\label{principal-component-analysis}

The Principal Component Analysis is probably the oldest dimension
reduction algorithm which comes in various flavours today. The simplest
`version' of the PCA algorithm could look like that:

\begin{itemize}
\item
  Subtract mean feature vector from each feature vector of a set
  $X = \{ x_1, x_2, \dots, x_N \}$.
\item
  Compute the covariance matrix $C$ using all feature vectors.
\item
  Find top $t$ (desired dimension of embedded space) and form
  projection matrix $P$ with eigenvectors as columns.
\item
  Project the data with $Y = P X$.
\end{itemize}

\section{Random projection}\label{random-projection}

The Random projection algorithm is yet more simple algorithm (comparing
to PCA and MDS). It can be said that the algorithm is based on
Johnson-Lindenstrauss lemma that states that a small number of vectors
in high-dimensional space can be embedded into a space of much lower
dimension with keeping pairwise distances nearly preserved. The
algorithm itself is:

\begin{itemize}
\item
  Construct random basis matrix $P$ with normalized random gaussian
  vectors as columns.
\item
  Project data with left multiplication with generated matrix
  $Y = P X$.
\end{itemize}

\section{Stochastic Proximity
Embedding}\label{stochastic-proximity-embedding}

Stochastic Proximity Embedding (SPE) acts on a set of
$N$ vectors $Y = \{ y_1, y_2, \dots y_N \}$ with corresponding
symmetric distance matrix $D_{ij}$ in the following manner:

\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\item
  Choose an initial learning rate $\lambda$.
\item
  Initialize randomly the point coordinates in the embedded space
  $X = \{ x_1, x_2, \dots x_N \}$.
\item
  Select at random a pair of points with indices $i$ and $j$. For a
  prescribed number of iterations $S$, compute their distances in the
  embedded space, $$d_{i,j} = \| x_i - x_j \|$$; if
  $d_{i,j} \neq D_{i,j}$ then update the coordinates of the selected
  points by
  $$x_i \leftarrow x_i + \lambda \frac{1}{2} \frac{D_{ij} - d_{ij}}{d_{ij} + \epsilon} (x_i - x_j),$$
  $$x_j \leftarrow x_j + \lambda \frac{1}{2} \frac{D_{ij} - d_{ij}}{d_{ij} + \epsilon} (x_j - x_i).$$
\item
  Decrease the learning rate $\lambda$ by
  $\delta \lambda | 0 < \delta < 1$. $\lambda$ is decreased to avoid
  oscillatory behaviour.
\item
  Repeat steps 3 and 4 for a predetermined number of iterations $C$.
\end{enumerate}

SPE is an interesting method because of its simplicity and efficiency,
as it scales linearly with the sample size $N$.

\subsection{Reference}\label{reference}

\begin{itemize}
\tightlist
\item
  D. K. Agrafiotis. ``Stochastic Proximity Embedding,'' \emph{Journal of
  Computational Chemistry}, 2003.
\end{itemize}

\section{Stochastic Neighbour
Embedding}\label{stochastic-neighbour-embedding}

Stochastic Neighbour Embedding (SNE) uses conditional probability
densities in order to model pairwise similarities between data points,
rather than using Euclidean distances directly. The similarity of the
point $x_j$ to the point $x_i$ is the conditional probability
$p_{j|i}$, which is the probability that $x_j$ would be $x_i$'s
neighbour taking into account that neighbourhoods are built in
proportion to Gaussian probability densities centered at $x_{i}$.
Formally,
$$p_{j|i} = \frac{exp(-\|x_i - x_j\|^2 / 2 \sigma_i^2)}{\sum_{k \neq i} exp(-\|x_i - x_k\|^2 / 2 \sigma_i^2)}$$
where $\sigma_i$ is the variance of the Gaussian centered on
$x_i$, whose computation will be later explained. The similarities in
the low-dimensional space are defined in a similar way. However, the
variance of the Gaussian distributions employed are fixed to
$\frac{1}{\sqrt{2}}$ this time, i.e.
$$q_{j|i} = \frac{exp(-\|x_i - x_j\|^2)}{\sum_{k \neq i} exp(-\|x_i - x_k\|^2)}.$$

Intuitively, if the low-dimensional points $Y_i$ and $Y_j$ map
correctly the similarity between their high-dimensional counterparts
$x_i$ and $x_j$, then $p_{j|i}$ and $q_{j|i}$ will be close
to each other. SNE aims at making these quantities as close as possible
minimizing the sum of Kullback-Leibler divergences over all the data
set. Thus, the cost function is
$$C = \sum_{i} KL(P_i \| Q_i) = \sum_{i} \sum_{j} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}.$$

Unfortunately, there exists no optimal value of the Gaussian variance
for all the points in the data set since the data may vary considerably
throughout the data set. SNE chooses the value of each $sigma_i$
performing binary search so that a user specified value for the Shannon
entropy of $P_i$ is achieved.

\section{t-Distributed Stochastic Neighbour
Embedding}\label{t-distributed-stochastic-neighbour-embedding}

There are two main issues related to SNE that t-Distributed Stochastic
Neighbour Embedding (t-SNE) addresses:

\begin{itemize}
\item
  SNE's cost function using gradient descent is faster optimized using
  symmetric similarities. Therefore, t-SNE uses joint probability
  distributions $p_{ij}$ and $q_{ij}$ instead of conditional
  distributions.
\item
  t-SNE uses Student's t instead of Gaussian distributions to handle
  better the so-called crowding problem.
\end{itemize}

\subsection{References}\label{references-10}

\begin{itemize}
\tightlist
\item
  Van der Maaten, L., Hinton, G. (2008).
  \href{http://jmlr.csail.mit.edu/papers/v9/vandermaaten08a.html}{Visualizing
  Data using t-SNE}. 3
\end{itemize}

\end{document}