At the traditional Archetypal Analysis (AA) [1] we have a data frame \(df\) which is a matrix \(Y\) with dimension \(n \times d\), with n the number of observations (rows) and d the number of variables (columns) or the dimension of the relevant Linear Space \(R^d\).
The output of our AA algorithm which has been implemented at the R Package “archetypal” computes the matrices \(A\), \(B\) such that: \[ Y\sim\,A\,B\,Y \] or if we want to be more strict our \(kappas\times\,d\) matrix of archetypes \(B\,Y\) is such that next squared Frobenius norm is minimum \[ SSE=\|Y-A\,B\,Y \|^2 = minimum \] \(A\) (\(n\times\,kappas\)) is a are row stochastic matrix and \(B\) (\(kappas\times\,n\) ) is a column stochastic one.
We also define the Variance Explained as next: \[ varexpl=\frac{SST-SSE}{SST} \] with SST being the total sum of squares for all elements of matrix \(Y\).
It is a suitable modification of \(PCHA\) algorithm, see [2], [3], which now uses the data frames without transposing them and it has full control to all external and internal parameters of it.
But there are many disadvantages at this route:
But what really stands for the concept of archetypes?
We know that archetypes lie on the outer bound of the Convex Hull that covers all our data points.
Now arise next questions:
The solution to all the above problems is Geometrical Archetypal Analysis (GAA), see also the pdf version [4].
Given the data frame \(df\) and its correspondent matrix \(Y\in R^d\) we follow next steps:
Since we are sure that our archetypes cover the 100% of the data points, then we don’t care about their number so much, since what actually matters is the compositions for each point.
We create a set of random 3D data points and then we compute the Grid Archetypal Analysis.
library(GeomArchetypal)
# Create random data
set.seed(20140519)
=matrix(runif(300) , nrow = 100, ncol=3)
dfcolnames(df)=c("x","y","z")
# Grid Archetypal
=grid_archetypal(df, diag_less = 1e-6, niter = 50, verbose = FALSE)
gaasummary(gaa)
## Number of observations / data rows n = 108
## (The first 8 rows are the Grid Archetypes)
## Dimension of data variables d = 3
## Number of Grid Archetypes k = 8
##
## Grid Archetypes:
##
## x y z
## 1 0.04107952 0.00131741 0.01868682
## 2 0.98128784 0.00131741 0.01868682
## 3 0.04107952 0.99690733 0.01868682
## 4 0.98128784 0.99690733 0.01868682
## 5 0.04107952 0.00131741 0.99566805
## 6 0.98128784 0.00131741 0.99566805
## 7 0.04107952 0.99690733 0.99566805
## 8 0.98128784 0.99690733 0.99566805
##
## Run details:
##
## SSE VarianceExplained Convergence Iterations ElapsedTime
## 1 5.951504e-06 0.9999999 TRUE 39 0.01
If we plot the data points and the archetypes we observe that the Archetypal Spanning Convex Hull covers all our points, a fact that can be checked by using the function points_inside_convex_hull():
plot(gaa)
points_inside_convex_hull(df, gaa$grid)
## [1] 100
For the same data set as before we want to set as archetypes the Closer to the Grid Archetypes just because we’d like our archetypes to belong to the data set.
# Closer Grid Archetypal
=closer_grid_archetypal(df,diag_less = 1e-3, niter = 70, verbose = FALSE)
cgasummary(cga)
## Number of observations / data rows n = 100
## Dimension of data variables d = 3
## Number of Closer Grid Archetypes k = 8
## The rows that formed the Closer Grid Archetypes are:
## 52,69,18,87,6,14,39,5
##
## The Closer Grid Archetypes are:
##
## x y z
## 1 0.13352227 0.15353714 0.07159086
## 2 0.75890322 0.09126450 0.01868682
## 3 0.07234262 0.79526378 0.06748076
## 4 0.88561538 0.94945905 0.10888000
## 5 0.31213541 0.12031532 0.87912934
## 6 0.96420042 0.07130253 0.77504836
## 7 0.24115414 0.88179278 0.90609350
## 8 0.89296529 0.95993797 0.94319426
##
## The Run details are:
##
## SSE VarianceExplained Convergence Iterations ElapsedTime
## 1 0.3367711 0.9965373 TRUE 26 0
plot(cga)
What is the percentage of data coverage for the Closer Grid Archetypes?
=points_inside_convex_hull(df,cga$aa$BY)
pcprint(pc)
## [1] 59
The is a relatively high percentage compared to the usually very low values of less than 10%, especially if we seek for a small number of archetypes.
If we decide that we’ll take a specific set of rows as archetypes, then we can use the function fast-archetypal() to compute the \(A\)-matrix of convex coefficients for the rest of data points.
It is the generic function that is used from both grid_archetypal() and closer_grid_archetypal() for their main computations.
Here we can use the rows we have found just before as the rows for the archetypes.
# Fast Archetypal
=fast_archetypal(df, irows = cga$grid_rows, diag_less = 1e-6,
faniter = 100, verbose = FALSE)
summary(fa)
## Number of observations / data rows n = 100
## Dimension of data variables d = 3
## Number of archetypes k = 8
##
## Archetypes:
##
## x y z
## 1 0.13352227 0.15353714 0.07159086
## 2 0.75890322 0.09126450 0.01868682
## 3 0.07234262 0.79526378 0.06748076
## 4 0.88561538 0.94945905 0.10888000
## 5 0.31213541 0.12031532 0.87912934
## 6 0.96420042 0.07130253 0.77504836
## 7 0.24115414 0.88179278 0.90609350
## 8 0.89296529 0.95993797 0.94319426
##
## Run details:
##
## SSE VarianceExplained Convergence Iterations ElapsedTime
## 1 0.3305205 0.9966016 TRUE 57 0
##
## Call:
## fast_archetypal(df = df, irows = cga$grid_rows, diag_less = 1e-06,
## niter = 100, verbose = FALSE)
plot(fa)
par(oldpar)
options(oldoptions)
As it was expected the results are totally identical with the previously found ones.
[1] Cutler, A., & Breiman, L. (1994). Archetypal Analysis. Technometrics, 36(4), 338–347. https://doi.org/10.1080/00401706.1994.10485840
[2] M Morup and LK Hansen, “Archetypal analysis for machine learning and data mining”, Neurocomputing (Elsevier, 2012). https://doi.org/10.1016/j.neucom.2011.06.033.
[3] Source: https://mortenmorup.dk/?page_id=2 , last accessed 2024-03-09
[4] Christopoulos, DT, (2024) “Geometrical Archetypal Analysis: One Step Beyond the Current View”, http://dx.doi.org/10.13140/RG.2.2.14030.88642 .