\documentclass{article} %\VignetteEngine{knitr::knitr} %\VignetteIndexEntry{Dimensionality Reduction} %\VignetteKeyword{Dimensionality Reduction} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{hyperref} \usepackage{amsmath,amssymb} \usepackage{booktabs} \usepackage{tikz} \usetikzlibrary{trees} \usepackage[sectionbib,round]{natbib} \title{\pkg{dimRed} and \pkg{coRanking}---Unifying Dimensionality Reduction in R} \author{Guido Kraemer \and Markus Reichstein \and Miguel D.\ Mahecha} % these are taken from RJournal.sty: \makeatletter \DeclareRobustCommand\code{\bgroup\@noligs\@codex} \def\@codex#1{\texorpdfstring% {{\normalfont\ttfamily\hyphenchar\font=-1 #1}}% {#1}\egroup} \newcommand{\kbd}[1]{{\normalfont\texttt{#1}}} \newcommand{\key}[1]{{\normalfont\texttt{\uppercase{#1}}}} \DeclareRobustCommand\samp{`\bgroup\@noligs\@sampx} \def\@sampx#1{{\normalfont\texttt{#1}}\egroup'} \newcommand{\var}[1]{{\normalfont\textsl{#1}}} \let\env=\code \newcommand{\file}[1]{{`\normalfont\textsf{#1}'}} \let\command=\code \let\option=\samp \newcommand{\dfn}[1]{{\normalfont\textsl{#1}}} % \acronym is effectively disabled since not used consistently \newcommand{\acronym}[1]{#1} \newcommand{\strong}[1]{\texorpdfstring% {{\normalfont\fontseries{b}\selectfont #1}}% {#1}} \let\pkg=\strong \newcommand{\CRANpkg}[1]{\href{https://CRAN.R-project.org/package=#1}{\pkg{#1}}}% \let\cpkg=\CRANpkg \newcommand{\ctv}[1]{\href{https://CRAN.R-project.org/view=#1}{\emph{#1}}} \newcommand{\BIOpkg}[1]{\href{https://www.bioconductor.org/packages/release/bioc/html/#1.html}{\pkg{#1}}} \makeatother \begin{document} \maketitle \abstract{ % This document is based on the manuscript of \citet{kraemer_dimred_2018} which was published in the R-Journal and has been modified and extended to fit the format of a package vignette and to match the extended functionality of the \pkg{dimRed} package. ``Dimensionality reduction'' (DR) is a widely used approach to find low dimensional and interpretable representations of data that are natively embedded in high-dimensional spaces. % DR can be realized by a plethora of methods with different properties, objectives, and, hence, (dis)advantages. The resulting low-dimensional data embeddings are often difficult to compare with objective criteria. % Here, we introduce the \CRANpkg{dimRed} and \CRANpkg{coRanking} packages for the R language. % These open source software packages enable users to easily access multiple classical and advanced DR methods using a common interface. % The packages also provide quality indicators for the embeddings and easy visualization of high dimensional data. % The \pkg{coRanking} package provides the functionality for assessing DR methods in the co-ranking matrix framework. % In tandem, these packages allow for uncovering complex structures high dimensional data. % Currently 15 DR methods are available in the package, some of which were not previously available to R users. % Here, we outline the \pkg{dimRed} and \pkg{coRanking} packages and make the implemented methods understandable to the interested reader. % } \section{Introduction} \label{sec:intro} Dimensionality Reduction (DR) essentially aims to find low dimensional representations of data while preserving their key properties. % Many methods exist in literature, optimizing different criteria: % maximizing the variance or the statistical independence of the projected data, % minimizing the reconstruction error under different constraints, % or optimizing for different error metrics, % just to name a few. % Choosing an inadequate method may imply that much of the underlying structure remains undiscovered. % Often the structures of interest in a data set can be well represented by fewer dimensions than exist in the original data. % Data compression of this kind has the additional benefit of making the encoded information better conceivable to our brains for further analysis tasks like classification or regression problems. % For example, the morphology of a plant's leaves, stems, and seeds reflect the environmental conditions the species usually grow in (e.g.,\ plants with large soft leaves will never grow in a desert but might have an advantage in a humid and shadowy environment). % Because the morphology of the entire plant depends on the environment, many morphological combinations will never occur in nature and the morphological space of all plant species is tightly constrained. % \citet{diaz_global_2016} found that out of six observed morphological characteristics only two embedding dimensions were enough to represent three quarters of the totally observed variability. % DR is a widely used approach for the detection of structure in multivariate data, and has applications in a variety of fields. % In climatology, DR is used to find the modes of some phenomenon, e.g.,\ the first Empirical Orthogonal Function of monthly mean sea surface temperature of a given region over the Pacific is often linked to the El Ni\~no Southern Oscillation or ENSO \citep[e.g.,\ ][]{hsieh_nonlinear_2004}. % In ecology the comparison of sites with different species abundances is a classical multivariate problem: each observed species adds an extra dimension, and because species are often bound to certain habitats, there is a lot of redundant information. Using DR is a popular technique to represent the sites in few dimensions, e.g.,\ \citet{aart_distribution_1972} matches wolfspider communities to habitat and \citet{morrall_soil_1974} match soil fungi data to soil types. (In ecology the general name for DR is ordination or indirect gradient analysis.) % Today, hyperspectral satellite imagery collects so many bands that it is very difficult to analyze and interpret the data directly. % Resuming the data into a set of few, yet independent, components is one way to reduce complexity \citep[e.g.,\ see][]{laparra_dimensionality_2015}. % DR can also be used to visualize the interiors of deep neural networks \citep[e.g.,\ see ][]{han_deep_2016}, where the high dimensionality comes from the large number of weights used in a neural network and convergence can be visualized by means of DR\@. % We could find many more example applications here but this is not the main focus of this publication. % The difficulty in applying DR is that each DR method is designed to maintain certain aspects of the original data and therefore may be appropriate for one task and inappropriate for another. % Most methods also have parameters to tune and follow different assumptions. The quality of the outcome may strongly depend on their tuning, which adds additional complexity. % DR methods can be modeled after physical models with attracting and repelling forces (Force Directed Methods), projections onto low dimensional planes (PCA, ICA), divergence of statistical distributions (SNE family), or the reconstruction of local spaces or points by their neighbors (LLE). % As an example for how changing internal parameters of a method can have a great impact, the breakthrough for Stochastic Neighborhood Embedding (SNE) methods came when a Student's $t$-distribution was used instead of a normal distribution to model probabilities in low dimensional space to avoid the ``crowding problem'', that is,\ a sphere in high dimensional space has a much larger volume than in low dimensional space and may contain too many points to be represented accurately in few dimensions. % The $t$-distribution, allows medium distances to be accurately represented in few dimensions by larger distances due to its heavier tails. % The result is called in $t$-SNE and is especially good at preserving local structures in very few dimensions, this feature made $t$-SNE useful for a wide array of data visualization tasks and the method became much more popular than standard SNE (around six times more citations of \citet{van_der_maaten_visualizing_2008} compared to \citet{hinton_stochastic_2003} in Scopus \citep{noauthor_scopus_nodate}). % There are a number of software packages for other languages providing collections of methods: In Python there is scikit-learn \citep{scikit-learn}, which contains a module for DR. In Julia we currently find ManifoldLearning.jl for nonlinear and MultivariateStats.jl for linear DR methods. % There are several toolboxes for DR implemented in Matlab \citep{van_der_maaten_dimensionality_2009, arenas-garcia_kernel_2013}. The Shogun toolbox \citep{soeren_sonnenburg_2017_1067840} implements a variety of methods for dimensionality reduction in C++ and offers bindings for a many common high level languages (including R, but the installation is anything but simple, as there is no CRAN package). % However, there is no comprehensive package for R and none of the former mentioned software packages provides means to consistently compare the quality of different methods for DR. % For many applications it can be difficult to objectively find the right method or parameterization for the DR task. % This paper presents the \pkg{dimRed} and \pkg{coRanking} packages for the popular programming language R. Together, they provide a standardized interface to various dimensionality reduction methods and quality metrics for embeddings. They are implemented using the S4 class system of R, making the packages both easy to use and to extend. The design goal for these packages is to enable researchers, who may not necessarily be experts in DR, to apply the methods in their own work and to objectively identify the most suitable methods for their data. % This paper provides an overview of the methods collected in the packages and contains examples as to how to use the packages. % The notation in this paper will be as follows: $X = [x_i]_{1\leq i \leq n}^T \in \mathbb{R}^{n\times p}$, and the observations $x_i \in \mathbb{R}^p$. % These observations may be transformed prior to the dimensionality reduction step (e.g.,\ centering and/or standardization) resulting in $X' = [x'_i]_{1\leq i \leq n}^T \in \mathbb{R}^{n\times p}$. % A DR method then embeds each vector in $X'$ onto a vector in $Y = [y_i]_{1\leq i \leq n}^T \in \mathbb{R}^{n\times q}$ with $y_i \in \mathbb{R}^q$, ideally with $q \ll p$. % Some methods provide an explicit mapping $f(x'_i) = y_i$. Some even offer an inverse mapping $f^{-1}(y_{i}) = \hat x'_{i}$, such that one can reconstruct a (usually approximate) sample from the low-dimensional representation. % For some methods, pairwise distances between points are needed, we set $d_{ij} = d(x_{i}, x_{j})$ and $\hat{d}_{ij} = d(y_i, y_j)$, where $d$ is some appropriate distance function. When referring to \code{functions} in the \pkg{dimRed} package or base R simply the function name is mentioned, functions from other packages are referenced with their namespace, as with \code{package::function}. \begin{figure}[htbp] \centering \input{classification_tree.tex} \caption{% Classification of dimensionality reduction methods. Methods in bold face are implemented in \pkg{dimRed}. Modified from \citet{van_der_maaten_dimensionality_2009}. }\label{fig:classification} \end{figure} \section{Dimensionality Reduction Methods} \label{sec:dimredtec} In the following section we do not aim for an exhaustive explanation to every method in \pkg{dimRed} but rather to provide a general idea on how the methods work. % An overview and classification of the most commonly used DR methods can be found in Figure~\ref{fig:classification}. In all methods, parameters have to be optimized or decisions have to be made, even if it is just about the preprocessing steps of data. % The \pkg{dimRed} package tries to make the optimization process for parameters as easy as possible, but, if possible, the parameter space should be narrowed down using prior knowledge. % Often decisions can be made based on theoretical knowledge. For example,\ sometimes an analysis requires data to be kept in their original scales and sometimes this is exactly what has to be avoided as when comparing different physical units. % Sometimes decisions based on the experience of others can be made, e.g.,\ the Gaussian kernel is probably the most universal kernel and therefore should be tested first if there is a choice. % All methods presented here have the embedding dimensionality, $q$, as a parameter (or \code{ndim} as a parameter for \code{embed}). % For methods based on eigenvector decomposition, the result generally does not depend on the number of dimensions, i.e.,\ the first dimension will be the same, no matter if we decide to calculate only two dimensions or more. % If more dimensions are added, more information is maintained, the first dimension is the most important and higher dimensions are successively less important. % This means, that a method based on eigenvalue decomposition only has to be run once if one wishes to compare the embedding in different dimensions. % In optimization based methods this is generally not the case, the number of dimensions has to be chosen a priori, an embedding of 2 and 3 dimensions may vary significantly, and there is no ordered importance of dimensions. % This means that comparing dimensions of optimization-based methods is computationally much more expensive. % We try to give the computational complexity of the methods. Because of the actual implementation, computation times may differ largely. % R is an interpreted language, so all parts of an algorithm that are implemented in R often will tend to be slow compared to methods that call efficient implementations in a compiled language. % Methods where most of the computing time is spent for eigenvalue decomposition do have very efficient implementations as R uses optimized linear algebra libraries. Although, eigenvalue decomposition itself does not scale very well in naive implementations ($\mathcal{O}(n^3)$). \subsection{PCA} \label{sec:pca} Principal Component Analysis (PCA) is the most basic technique for reducing dimensions. It dates back to \citet{pearson_lines_1901}. PCA finds a linear projection ($U$) of the high dimensional space into a low dimensional space $Y = XU$, maintaining maximum variance of the data. It is based on solving the following eigenvalue problem: \begin{equation} (C_{XX}-\lambda_k I)u_k=0\label{eq:pca} \end{equation} where $C_{XX} = \frac 1 n X^TX$ is the covariance matrix, $\lambda_k$ and $u_k$ are the $k$-th eigenvalue and eigenvector, and $I$ is the identity matrix. % The equation has several solutions for different values of $\lambda_k$ (leaving aside the trivial solution $u_k = 0$). % PCA can be efficiently applied to large data sets, because it computationally scales as $\mathcal{O}(np^2 + p^3)$, that is, it scales linearly with the number of samples and R uses specialized linear algebra libraries for such kind of computations. PCA is a rotation around the origin and there exist a forward and inverse mapping. % PCA may suffer from a scale problem, i.e.,\ when one variable dominates the variance simply because it is in a higher scale, to remedy this, the data can be scaled to zero mean and unit variance, depending on the use case, if this is necessary or desired. % Base R implements PCA in the functions \code{prcomp} and \code{princomp}; but several other implementations exist i.e., \BIOpkg{pcaMethods} from Bioconductor which implements versions of PCA that can deal with missing data. % The \pkg{dimRed} package wraps \code{prcomp}. \subsection{kPCA} \label{sec:kpca} Kernel Principal Component Analysis (kPCA) extends PCA to deal with nonlinear dependencies among variables. % The idea behind kPCA is to map the data into a high dimensional space using a possibly non-linear function $\phi$ and then to perform a PCA in this high dimensional space. % Some mathematical tricks are used for efficient computation. % If the columns of X are centered around $0$, then the principal components can also be computed from the inner product matrix $K = X^TX$. % Due to this way of calculating a PCA, we do not need to explicitly map all points into the high dimensional space and do the calculations there, it is enough to obtain the inner product matrix or kernel matrix $K \in \mathbb{R}^{n\times n}$ of the mapped points \citep{scholkopf_nonlinear_1998}. % Here is an example calculating the kernel matrix using a Gaussian kernel: \begin{equation}\label{eq:gauss} K = \phi(x_i)^T \phi(x_j) = \kappa(x_i, x_j) = \exp\left( -\frac{\| x_i- x_j\|^2}{2 \sigma^2} \right), \end{equation} where $\sigma$ is a length scale parameter accounting for the width of the kernel. % The other trick used is known as the ``representers theorem.'' The interested reader is referred to \citet{scholkopf_generalized_2001}. The kPCA method is very flexible and there exist many kernels for special purposes. The most common kernel function is the Gaussian kernel (Equation\ \ref{eq:gauss}). % The flexibility comes at the price that the method has to be finely tuned for the data set because some parameter combinations are simply unsuitable for certain data. % The method is not suitable for very large data sets, because memory scales with $\mathcal{O}(n^2)$ and computation time with $\mathcal{O}(n^3)$. % Diffusion Maps, Isomap, Locally Linear Embedding, and some other techniques can be seen as special cases of kPCA. In which case, an out-of-sample extension using the Nyström formula can be applied \citep{bengio_learning_2004}. % This can also yield applications for bigger data, where an embedding is trained with a sub-sample of all data and then the data is embedded using the Nyström formula. Kernel PCA in R is implemented in the \CRANpkg{kernlab} package using the function \code{kernlab::kpca}, and supports a number of kernels and user defined functions. For details see the help page for \code{kernlab::kpca}. The \pkg{dimRed} package wraps \code{kernlab::kpca} but additionally provides forward and inverse methods \citep{bakir_learning_2004} which can be used to fit out-of-sample data or to visualize the transformation of the data space. % \subsection{Classical Scaling} \label{sec:classscale} What today is called Classical Scaling was first introduced by \citet{torgerson_multidimensional_1952}. It uses an eigenvalue decomposition of a transformed distance matrix to find an embedding that maintains the distances of the distance matrix. % The method works because of the same reason that kPCA works, i.e.,\ classical scaling can be seen as a kPCA with kernel $x^Ty$. % A matrix of Euclidean distances can be transformed into an inner product matrix by some simple transformations and therefore yields the same result as a PCA\@. % Classical scaling is conceptually more general than PCA in that arbitrary distance matrices can be used, i.e.,\ the method does not even need the original coordinates, just a distance matrix $D$. % Then it tries to find an embedding $Y$ so that $\hat d_{ij}$ is as similar to $d_{ij}$ as possible. The disadvantage is that it is computationally much more demanding, i.e.,\ an eigenvalue decomposition of an $n\times n$ matrix has to be computed. This step requires $\mathcal{O}(n^2)$ memory and $\mathcal{O}(n^3)$ computation time, while PCA requires only the eigenvalue decomposition of a $d\times d$ matrix and usually $n \gg d$. % R implements classical scaling in the \code{cmdscale} function. % The \pkg{dimRed} package wraps \code{cmdscale} and allows the specification of arbitrary distance functions for calculating the distance matrix. Additionally a forward method is implemented. \subsection{Isomap} \label{sec:isomap} As Classical Scaling can deal with arbitrarily defined distances, \citet{tenenbaum_global_2000} suggested to approximate the structure of the manifold by using geodesic distances. % In practice, a graph is created by either keeping only the connections between every point and its $k$ nearest neighbors to produce a $k$-nearest neighbor graph ($k$-NNG), or simply by keeping all distances smaller than a value $\varepsilon$ producing an $\varepsilon$-neighborhood graph ($\varepsilon$-NNG). % Geodesic distances are obtained by recording the distance on the graph and classical scaling is used to find an embedding in fewer dimensions. This leads to an ``unfolding'' of possibly convoluted structures (see Figure~\ref{fig:knn}). Isomap's computational cost is dominated by the eigenvalue decomposition and therefore scales with $\mathcal{O}(n^3)$. % Other related techniques can use more efficient algorithms because the distance matrix becomes sparse due to a different preprocessing. In R, Isomap is implemented in the \CRANpkg{vegan} package. The \code{vegan::isomap} calculates an Isomap embedding and \code{vegan::isomapdist} calculates a geodesic distance matrix. % The \pkg{dimRed} package uses its own implementation. This implementation is faster mainly due to using a KD-tree for the nearest neighbor search (from the \CRANpkg{RANN} package) and to a faster implementation for the shortest path search in the $k$-NNG (from the \CRANpkg{igraph} package). % The implementation in \pkg{dimRed} also includes a forward method that can be used to train the embedding on a subset of data points and then use these points to approximate an embedding for the remaining points. This technique is generally referred to as landmark Isomap \citep{de_silva_sparse_2004}. % \subsection{Locally Linear Embedding} \label{sec:lle} Points that lie on a manifold in a high dimensional space can be reconstructed through linear combinations of their neighborhoods if the manifold is well sampled and the neighbohoods lie on a locally linear patch. % These reconstruction weights, $W$, are the same in the high dimensional space as the internal coordinates of the manifold. % Locally Linear Embedding \citep[LLE; ][]{roweis_nonlinear_2000} is a technique that constructs a weight matrix $W \in \mathbb{R}^{n\times n}$ with elements $w_{ij}$ so that \begin{equation} \sum_{i=1}^n \bigg\| x_i- \sum_{j=1}^{n} w_{ij}x_j \bigg\|^2\label{eq:lle} \end{equation} is minimized under the constraint that $w_{ij} = 0 $ if $x_j$ does not belong to the neighborhood and the constraint that $\sum_{j=1}^n w_{ij} = 1$. % Finally the embedding is made in such a way that the following cost function is minimized for $Y$, \begin{equation} \sum_{i=1}^n\bigg\| y_i - \sum_{j=1}^n w_{ij}y_j \bigg\|^2.\label{eq:lle2} \end{equation} This can be solved using an eigenvalue decomposition. Conceptually the method is similar to Isomap but it is computationally much nicer because the weight matrix is sparse and there exist efficient solvers. % In R, LLE is implemented by the package \CRANpkg{lle}, the embedding can be calculated with \code{lle::lle}. Unfortunately the implementation does not make use of the sparsity of the weight matrix $W$. % The manifold must be well sampled and the neighborhood size must be chosen appropriately for LLE to give good results. % \subsection{Laplacian Eigenmaps} \label{sec:laplaceigenmaps} Laplacian Eigenmaps were originally developed under the name spectral clustering to separate non-convex clusters. % Later it was also used for graph embedding and DR \citep{belkin_laplacian_2003}. % A number of variants have been proposed. % First, a graph is constructed, usually from a distance matrix, the graph can be made sparse by keeping only the $k$ nearest neighbors, or by specifying an $\varepsilon$ neighborhood. % Then, a similarity matrix $W$ is calculated by using a Gaussian kernel (see Equation \ref{eq:gauss}), if $c = 2 \sigma^2 = \infty$, then all distances are treated equally, the smaller $c$ the more emphasis is given to differences in distance. % The degree of vertex $i$ is $d_i = \sum_{j=1}^n w_{ij}$ and the degree matrix, $D$, is the diagonal matrix with entries $d_i$. % Then we can form the graph Laplacian $L = D - W$ and, then, there are several ways how to proceed, an overview can be found in \citet{luxburg_tutorial_2007}. % The \pkg{dimRed} package implements the algorithm from \citet{belkin_laplacian_2003}. Analogously to LLE, Laplacian eigenmaps avoid computational complexity by creating a sparse matrix and not having to estimate the distances between all pairs of points. % Then the eigenvectors corresponding to the lowest eigenvalues larger than $0$ of either the matrix $L$ or the normalized Laplacian $D^{-1/2}LD^{-1/2}$ are computed and form the embedding. \subsection{Diffusion Maps} \label{sec:isodiffmaplle} Diffusion Maps \citep{coifman_diffusion_2006} take a distance matrix as input and calculates the transition probability matrix $P$ of a diffusion process between the points to approximate the manifold. % Then the embedding is done by an eigenvalue decompositon of $P$ to calculate the coordinates of the embedding. % The algorithm for calculating Diffusion Maps shares some elements with the way Laplacian Eigenmaps are calculated. % Both algorithms depart from the same weight matrix, Diffusion Maps calculate the transition probability on the graph after $t$ time steps and do the embedding on this probability matrix. The idea is to simulate a diffusion process between the nodes of the graph, which is more robust to short-circuiting than the $k$-NNG from Isomap (see bottom right Figure \ref{fig:knn}). % Diffusion maps in R are accessible via the \code{diffusionMap::diffuse()} function, which is available in the \CRANpkg{diffusionMap} package. % Additional points can be approximated into an existing embedding using the Nyström formula \citep{bengio_learning_2004}. % The implementation in \pkg{dimRed} is based on the \code{diffusionMap::diffuse} function. % , which does not contain an % approximation for unequally sampled manifolds % \citep{coifman_geometric_2005}. % \subsection{non-Metric Dimensional Scaling} \label{sec:nmds} While Classical Scaling and derived methods (see section \nameref{sec:classscale}) use eigenvector decomposition to embed the data in such a way that the given distances are maintained, non-Metric Dimensional Scaling \citep[nMDS, ][]{kruskal_multidimensional_1964,kruskal_nonmetric_1964} uses optimization methods to reach the same goal. % Therefore a stress function, \begin{equation} \label{eq:stress} S = \sqrt{\frac{\sum_{i>= if(Sys.getenv("BNET_BUILD_VIGNETTE") != "") { library(dimRed); library(ggplot2); #library(dplyr); library(tidyr) ## define which methods to apply embed_methods <- c("Isomap", "PCA") ## load test data set data_set <- loadDataSet("3D S Curve", n = 1000) ## apply dimensionality reduction data_emb <- lapply(embed_methods, function(x) embed(data_set, x)) names(data_emb) <- embed_methods ## plot data set, embeddings, and quality analysis ## plot(data_set, type = "3vars") ## lapply(data_emb, plot, type = "2vars") ## plot_R_NX(data_emb) add_label <- function(label) grid::grid.text(label, 0.2, 1, hjust = 0, vjust = 1, gp = grid::gpar(fontface = "bold", cex = 1.5)) ## pdf('~/phd/text/dimRedPackage/plots/plot_example.pdf', width = 4, height = 4) ## plot the results plot(data_set, type = "3vars", angle = 15, mar = c(3, 3, 0, 0), box = FALSE, grid = FALSE, pch = 16) add_label("a") par(mar = c(4, 4, 0, 0) + 0.1, bty = "n", las = 1) plot(data_emb$Isomap, type = "2vars", pch = 16) add_label("b") plot(data_emb$PCA, type = "2vars", pch = 16) add_label("d") ## calculate quality scores print( plot_R_NX(data_emb) + theme(legend.title = element_blank(), legend.position = c(0.5, 0.1), legend.justification = c(0.5, 0.1)) ) add_label("c") } else { # These cannot all be plot(1:10)!!! It's a mistery to me. plot(1:10) barplot(1:10) hist(1:10) plot(1:10) } @ \includegraphics[page=1,width=.45\textwidth]{figure/pca_isomap_example-1.pdf} \includegraphics[page=1,width=.45\textwidth]{figure/pca_isomap_example-2.pdf} \includegraphics[page=1,width=.45\textwidth]{figure/pca_isomap_example-3.pdf} \includegraphics[page=1,width=.45\textwidth]{figure/pca_isomap_example-4.pdf} \caption[dimRed example]{% Comparing PCA and Isomap: % (a) An S-shaped manifold, colors represent the internal coordinates of the manifold. % (b) Isomap embedding, the S-shaped manifold is unfolded. % (c) $R_{NX}$ plotted agains neighborhood sizes, Isomap is much better at preserving local distances and PCA is better at preserving global Euclidean distances. % The numbers on the legend are the $\text{AUC}_{1 / K}$. (d) PCA projection of the data, the directions of maximum variance are preserved. % }\label{fig:plotexample} \end{figure} <>= ## define which methods to apply embed_methods <- c("Isomap", "PCA") ## load test data set data_set <- loadDataSet("3D S Curve", n = 1000) ## apply dimensionality reduction data_emb <- lapply(embed_methods, function(x) embed(data_set, x)) names(data_emb) <- embed_methods ## figure \ref{fig:plotexample}a, the data set plot(data_set, type = "3vars") ## figures \ref{fig:plotexample}b (Isomap) and \ref{fig:plotexample}d (PCA) lapply(data_emb, plot, type = "2vars") ## figure \ref{fig:plotexample}c, quality analysis plot_R_NX(data_emb) @ The function \code{plot\_R\_NX} produces a figure that plots the neighborhood size ($k$ at a log-scale) against the quality measure $\text{R}_{NX}(k)$ (see Equation \ref{eq:rnx}). % This gives an overview of the general behavior of methods: if $\text{R}_{NX}$ is high for low values of $K$, then local neighborhoods are maintained well; if $\text{R}_{NX}$ is high for large values of $K$, then global gradients are maintained well. % It also provides a way to directly compare methods by plotting more than one $\text{R}_{NX}$ curve and an overall quality of the embedding by taking the area under the curve as an indicator for the overall quality of the embedding (see fig~\ref{eq:auclnk}) which is shown as a number in the legend. Therefore we can see from Figure~\ref{fig:plotexample}c that $t$-SNE is very good a maintaining close and medium distances for the given data set, whereas PCA is only better at maintaining the very large distances. % The large distances are dominated by the overall bent shape of the S in 3D space, while the close distances are not affected by this bending. % This is reflected in the properties recovered by the different methods, the PCA embedding recovers the S-shape, while $t$-SNE ignores the S-shape and recovers the inner structure of the manifold. % Example 2: Often the quality of an embedding strongly depends on the choice of parameters, the interface of \pkg{dimRed} can be used to facilitate searching the parameter space. Isomap has one parameter $k$ which determines the number of neighbors used to construct the $k$-NNG\@. % If this number is too large, then Isomap will resemble an MDS (Figure~\ref{fig:knn} e), if the number is too small, the resulting embedding contains holes (Figure~\ref{fig:knn} c). % The following code finds the optimal value, $k_{\text{max}}$, for $k$ using the $Q_{\text{local}}$ criterion, the results are visualized in Figure~\ref{fig:knn} a: \begin{figure}[htp] \centering <>= if(Sys.getenv("BNET_BUILD_VIGNETTE") != "") { library(dimRed) library(cccd) ## Load data ss <- loadDataSet("3D S Curve", n = 500) ## Parameter space kk <- floor(seq(5, 100, length.out = 40)) ## Embedding over parameter space emb <- lapply(kk, function(x) embed(ss, "Isomap", knn = x)) ## Quality over embeddings qual <- sapply(emb, function(x) quality(x, "Q_local")) ## Find best value for K ind_max <- which.max(qual) k_max <- kk[ind_max] add_label <- function(label){ par(xpd = TRUE) b = par("usr") text(b[1], b[4], label, adj = c(0, 1), cex = 1.5, font = 2) par(xpd = FALSE) } names(qual) <- kk } @ <<"select_k",include=FALSE,fig.width=11,fig.height=5>>= if (Sys.getenv("BNET_BUILD_VIGNETTE") != "") { par(mfrow = c(1, 2), mar = c(5, 4, 0, 0) + 0.1, oma = c(0, 0, 0, 0)) plot(kk, qual, type = "l", xlab = "k", ylab = expression(Q[local]), bty = "n") abline(v = k_max, col = "red") add_label("a") plot(ss, type = "3vars", angle = 15, mar = c(3, 3, 0, 0), box = FALSE, grid = FALSE, pch = 16) add_label("b") } else { plot(1:10) plot(1:10) } @ <<"knngraphs",include=FALSE,fig.width=8,fig.height=3>>= if(Sys.getenv("BNET_BUILD_VIGNETTE") != "") { par(mfrow = c(1, 3), mar = c(5, 4, 0, 0) + 0.1, oma = c(0, 0, 0, 0)) add_knn_graph <- function(ind) { nn1 <- nng(ss@data, k = kk[ind]) el <- get.edgelist(nn1) segments(x0 = emb[[ind]]@data@data[el[, 1], 1], y0 = emb[[ind]]@data@data[el[, 1], 2], x1 = emb[[ind]]@data@data[el[, 2], 1], y1 = emb[[ind]]@data@data[el[, 2], 2], col = "#00000010") } plot(emb[[2]]@data@data, type = "n", bty = "n") add_knn_graph(2) points(emb[[2]]@data@data, col = dimRed:::colorize(ss@meta), pch = 16) add_label("c") plot(emb[[ind_max]]@data@data, type = "n", bty = "n") add_knn_graph(ind_max) points(emb[[ind_max]]@data@data, col = dimRed:::colorize(ss@meta), pch = 16) add_label("d") plot(emb[[length(emb)]]@data@data, type = "n", bty = "n") add_knn_graph(length(emb)) points(emb[[length(emb)]]@data@data, col = dimRed:::colorize(ss@meta), pch = 16) add_label("e") } else { plot(1:10) plot(1:10) plot(1:10) } @ \includegraphics[width=.95\textwidth]{figure/select_k-1.pdf} \includegraphics[width=.95\textwidth]{figure/knngraphs-1.pdf} \caption[estimating $k$ using @Q_\text{local}]{% Using \pkg{dimRed} and the $Q_\text{local}$ indicator to estimate a good value for the parameter $k$ in Isomap. % (a) $Q_\text{local}$ for different values of $k$, the vertical red line indicates the maximum $k_{\text{max}}$. % (b) The original data set, a 2 dimensional manifold bent in an S-shape in 3 dimensional space. % Bottom row: Embeddings and $k$-NNG for different values of $k$. % (c) When $k = 5$, the value for $k$ is too small resulting in holes in the embedding, the manifold itself is still unfolded correctly. % (d) Choose $k = k_\text{max}$, the best representation of the original manifold in two dimensions achievable with Isomap. % (e) $k = 100$, too large, the $k$-NNG does not approximate the manifold any more. % }\label{fig:knn} \end{figure} <>= ## Load data ss <- loadDataSet("3D S Curve", n = 500) ## Parameter space kk <- floor(seq(5, 100, length.out = 40)) ## Embedding over parameter space emb <- lapply(kk, function(x) embed(ss, "Isomap", knn = x)) ## Quality over embeddings qual <- sapply(emb, function(x) quality(x, "Q_local")) ## Find best value for K ind_max <- which.max(qual) k_max <- kk[ind_max] @ Figure~\ref{fig:knn}a shows how the $Q_{\text{local}}$ criterion changes when varying the neighborhood size $k$ for Isomap, the gray lines in Figure~\ref{fig:knn} represent the edges of the $k$-NN Graph. % If the value for $k$ is too low, the inner structure of the manifold will still be recovered, but it will be imperfect (Figure~\ref{fig:knn}c, note that the holes appear in places that are not covered by the edges of the $k$-NN Graph), therefore the $Q_{\text{local}}$ score is lower than optimal. % If $k$ is too large, the error of the embedding is much larger due to short circuiting and we observe a very steep drop in the $Q_{\text{local}}$ score. % The short circuiting can be observed in Figure~\ref{fig:knn}e with the edges that cross the gap between the tips and the center of the S-shape. % % Example 3: It is also very easy to compare across methods and quality scores. % The following code produces a matrix of quality scores and methods, where \code{dimRedMethodList} returns a character vector with all methods. A visualization of the matrix can be found in Figure~\ref{fig:qualityexample}. % \begin{figure}[htp] \centering <<"plot_quality",include=FALSE>>= if(Sys.getenv("BNET_BUILD_VIGNETTE") != "") { embed_methods <- dimRedMethodList() quality_methods <- c("Q_local", "Q_global", "AUC_lnK_R_NX", "cophenetic_correlation") iris_data <- loadDataSet("Iris") quality_results <- matrix( NA, length(embed_methods), length(quality_methods), dimnames = list(embed_methods, quality_methods) ) embedded_data <- list() for (e in embed_methods) { try(embedded_data[[e]] <- embed(iris_data, e)) for (q in quality_methods) try(quality_results[e,q] <- quality(embedded_data[[e]], q)) } quality_results <- quality_results[order(rowMeans(quality_results)), ] palette(c("#1b9e77", "#d95f02", "#7570b3", "#e7298a", "#66a61e")) col_hsv <- rgb2hsv(col2rgb(palette())) ## col_hsv["v", ] <- col_hsv["v", ] * 3 / 1 palette(hsv(col_hsv["h", ], col_hsv["s", ], col_hsv["v", ])) par(mar = c(2, 8, 0, 0) + 0.1) barplot(t(quality_results), beside = TRUE, col = 1:4, legend.text = quality_methods, horiz = TRUE, las = 1, cex.names = 0.85, args.legend = list(x = "topleft", bg = "white", cex = 0.8)) } else { plot(1:10) } @ \includegraphics[width=.5\textwidth]{figure/plot_quality-1.pdf} \caption[Quality comparision]{% A visualization of the \code{quality\_results} matrix. % The methods are ordered by mean quality score. % The reconstruction error was omitted, because a higher value means a worse embedding, while in the present methods a higher score means a better embedding. % Parameters were not tuned for the example, therefore it should not be seen as a general quality assessment of the methods. % }\label{fig:qualityexample} \end{figure} <>= embed_methods <- dimRedMethodList() quality_methods <- c("Q_local", "Q_global", "AUC_lnK_R_NX", "cophenetic_correlation") scurve <- loadDataSet("3D S Curve", n = 2000) quality_results <- matrix( NA, length(embed_methods), length(quality_methods), dimnames = list(embed_methods, quality_methods) ) embedded_data <- list() for (e in embed_methods) { embedded_data[[e]] <- embed(scurve, e) for (q in quality_methods) { try(quality_results[e, q] <- quality(embedded_data[[e]], q)) } } @ This example showcases the simplicity with which different methods and quality criteria can be combined. % Because of the strong dependencies on parameters it is not advised to apply this kind of analysis without tuning the parameters for each method separately. % There is no automatized way to tune parameters in \pkg{dimRed}. % \section{Conclusion} \label{sec:conc} This paper presents the \pkg{dimRed} and \pkg{coRanking} packages and it provides a brief overview of the methods implemented therein. % The \pkg{dimRed} package is written in the R language, one of the most popular languages for data analysis. The package is freely available from CRAN. % The package is object oriented and completely open source and therefore easily available and extensible. % Although most of the DR methods already had implementations in R, \pkg{dimRed} adds some new methods for dimensionality reduction, and \pkg{coRanking} adds methods for an independent quality control of DR methods to the R ecosystem. % DR is a widely used technique. However, due to the lack of easily usable tools, choosing the right method for DR is complex and depends upon a variety of factors. % The \pkg{dimRed} package aims to facilitate experimentation with different techniques, parameters, and quality measures so that choosing the right method becomes easier. % The \pkg{dimRed} package wants to enable the user to objectively compare methods that rely on very different algorithmic approaches. % It makes the life of the programmer easier, because all methods are aggregated in one place and there is a single interface and standardized classes to access the functionality. % \section{Acknowledgments} \label{sec:ack} We thank Dr.\ G.\ Camps-Valls and an anonymous reviewer for many useful comments. % This study was supported by the European Space Agency (ESA) via the Earth System Data Lab project (\url{http://earthsystemdatacube.org}) and the EU via the H2020 project BACI, grant agreement No 640176. % \bibliographystyle{abbrvnat} \bibliography{bibliography} \end{document}