An R package for non-negative and sparse principal component analysis.
Principal component analysis (PCA) provides a lower dimensional approximation of high-dimensional data, where the reconstruction error (measured using Euclidean distance) is minimal. PCA is therefore useful to succinctly describe a data set which has an “intrinsic” dimensionality that is low, even though the number of features might be high. A classical example is the set of face images: even low-resolution images quickly exceed 1e4 pixels, yet about 50 principal components (PCs) are sufficient to achieve a good approximation of a face image (Kirby and Sirovich, 1990).
Although PCA often provides a good approximation with few PCs, each component is usually a linear combination of all features of the data. Enforcing sparsity of the principal axes (PA) facilitates identification of the relevant features and is therefore an unsupervised feature selection method. In applications where a fixed penalty is associated with each included feature (e.g. transaction costs in finance), a small loss in variance of the PC for a large reduction in cardinality of the PA can lead to an overall more desirable solution. Enforcing non-negativity of the PAs renders PCA applicable to domains where only positive influence of features is deemed appropriate, i.e. the total variance is explained additively. Non-negative solutions often show some degree of sparsity already, but a combination of both constraints enables precise control over the cardinality of the PAs.
This package implements two non-negative and/or sparse PCA algorithms
which are rooted in expectation-maximization (EM) for a
probabilistic generative model of PCA (Sigg and Buhmann, 2008). The
nsprcomp
algorithm can also be described as applying a
soft-thresholding operator to the well-known power iteration method for
computing eigenvalues.
A small example from the domain of portfolio optimization is provided
in this
blog post, and a comparison to the arrayspc
algorithm
from the elasticnet package can be found here.
In each EM (or power) iteration of the nsprcomp
algorithm, a soft thresholding operator is applied to enforce sparsity
of the PA, and projection to the non-negative orthant is applied to
enforce non-negativity. Once the EM procedure has converged and the
support of the PA has been identified, the non-zero coefficients are
recomputed to maximize the variance of the PC. Finally, because EM is a
local optimizer, random restarts are employed to (hopefully) avoid bad
local minima. The same iterative procedure (without the non-negativity
constraint) was later also proposed by Journee et al. (2010).
The nsprcomp
algorithm computes one PC after the other.
Because constrained PAs no longer correspond to true eigenvectors of the
covariance matrix and are usually not pairwise orthogonal, special
attention needs to be paid when computing more than a single PC. The
algorithm implements the generalized deflation method proposed
by Mackey (2009) to maximize the additional variance of each component.
The variance of the PC is maximized after projecting the PA to the
ortho-complement of the space spanned by the previous PAs. This
procedure maximizes the additional variance not explained by previous
components, and is identical to standard PCA if no sparsity or
non-negativity constraints are enforced on the PAs.
The nsprcomp
algorithm is suitable for large and
high-dimensional data sets, because it entirely avoids computing the
covariance matrix. It is therefore especially suited to the case where
the number of features exceeds the number of observations.
The nscumcomp
algorithm jointly computes all PCs, such
that the cumulative variance of the components is maximized. The
computation is again based on EM iterations, but here the maximization
step has to be carried out numerically. Non-negativity of the PAs is
achieved by enforcing a zero lower bound in the L-BFGS-B algorithm used
for performing the maximization step. Sparsity of the PAs is achieved by
a subsequent soft-thresholding of the pseudo-rotation matrix (the matrix
which has the PAs as its columns). In contrast to sequential PCA
algorithms such as nsprcomp
, only the total number
k
of non-zero loadings is specified by the user, and the
non-zero loadings are distributed over all PAs by the algorithm.
The algorithm penalizes the divergence of the pseudo-rotation matrix
from ortho-normality using a Lagrange parameter gamma
. A
positive gamma
enforces a minimum angle between the PAs,
and is sometimes necessary to avoid PAs collapsing onto each other
during the EM procedure.
The nscumcomp
algorithm also scales to large and
high-dimensional data sets. But due to the numerical maximization of the
M-step, it is computationally more involved than the
nsprcomp
algorithm.
The nscumcomp
algorithm is currently unpublished.
Kirby, M., & Sirovich, L. (1990). Application of the Karhunen-Loeve procedure for the characterization of human faces. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 12(1), 103-108.
Sigg, C. D., & Buhmann, J. M. (2008). Expectation-maximization for sparse and non-negative PCA. In Proceedings of the 25th International Conference on Machine Learning (pp. 960-967).
Journee, M., Nesterov, Y., Richtarik, P., & Sepulchre, R. (2010). Generalized power method for sparse principal component analysis. The Journal of Machine Learning Research, 11, 517-553.
Mackey, L. (2009). Deflation methods for sparse pca. Advances in Neural Information Processing Systems, 21, 1017-1024.