Geometrical Archetypal Analysis: One Step Beyond the Current View

Geometrical Archetypal Analysis (GAA)

At the traditional Archetypal Analysis (AA) [1] we have a data frame \(df\) which is a matrix \(Y\) with dimension \(n \times d\), with n the number of observations (rows) and d the number of variables (columns) or the dimension of the relevant Linear Space \(R^d\).

The output of our AA algorithm which has been implemented at the R Package “archetypal” computes the matrices \(A\), \(B\) such that: \[ Y\sim\,A\,B\,Y \] or if we want to be more strict our \(kappas\times\,d\) matrix of archetypes \(B\,Y\) is such that next squared Frobenius norm is minimum \[ SSE=\|Y-A\,B\,Y \|^2 = minimum \] \(A\) (\(n\times\,kappas\)) is a are row stochastic matrix and \(B\) (\(kappas\times\,n\) ) is a column stochastic one.

We also define the Variance Explained as next: \[ varexpl=\frac{SST-SSE}{SST} \] with SST being the total sum of squares for all elements of matrix \(Y\).

It is a suitable modification of \(PCHA\) algorithm, see [2], [3], which now uses the data frames without transposing them and it has full control to all external and internal parameters of it.

But there are many disadvantages at this route:

The number of archetypes (“\(kappas\)”) is unknown
The choice of \(kappas\) by using the SSE plot is a very questionable process:
- The elbow or knee point usually is located at the center of the SSE-kappas curve
- If you test more kappas, then it increases

But what really stands for the concept of archetypes?

We know that archetypes lie on the outer bound of the Convex Hull that covers all our data points.

Now arise next questions:

How many points in the vector space \(R^d\) are needed in order to create a minimal Convex Hull for them which can be called “Archetypal Spanning Convex Hull”?
- The answer is \(kappas = d + 1\)
If we take the above as the number of archetypes then:
- What is the percentage of our data points that are covered by that hull?
- Are we satisfied with a value of 1%?
- For social sciences data sets we observe an almost ellipsoidal distribution
- That means we need many archetypes for an acceptable data coverage
- What should we do in order to increase that data coverage?

The solution to all the above problems is Geometrical Archetypal Analysis (GAA), see also the pdf version [4].

Given the data frame \(df\) and its correspondent matrix \(Y\in R^d\) we follow next steps:

We find the minimum \(Y_{j(1)}\) and maximum \(Y_{j(\nu)}\) values for all variables \(Y_j,j=1,...,d\)
We take the Cartesian Product: \[ \{Y_{1(1)},Y_{1(\nu)}\}\times\{Y_{2(1)},Y_{2(\nu)}\}\times\ldots\times\{Y_{d(1)},Y_{d(\nu)}\} \]
We define the \(2^d\) vectors as the Archetypal Grid, for example if \(d=3\) we find the 8 rows of next matrix: \[ \begin{matrix}Y1,min&Y2,min&Y3,min \\Y1,max&Y2,min&Y3,min \\Y1,min&Y2,max&Y3,min \\Y1,max&Y2,max&Y3,min \\Y1,min&Y2,min&Y3,max \\Y1,max&Y2,min&Y3,max \\Y1,min&Y2,max&Y3,max \\Y1,max&Y2,max&Y3,max \\\end{matrix} \]
We create a new data frame \(dg\) with the Archetypal Grid as the first rows and the initial data frame “df” as the rest ones
We fix the \(BY\) matrix of archetypes to be the first \(2^d\) rows of \(dg\)
We iteratively compute the \(A\)-matrix of the \(Y\sim\,A\,B\,Y\,\) decomposition
We collect the \(\{2^{d}+1,\ldots\,n\}\) rows of that \(A\)-matrix: these are the compositions of all initial data rows as convex combinations of Grid Archetypes

Since we are sure that our archetypes cover the 100% of the data points, then we don’t care about their number so much, since what actually matters is the compositions for each point.

Grid Archetypal

We create a set of random 3D data points and then we compute the Grid Archetypal Analysis.

library(GeomArchetypal)
# Create random data
set.seed(20140519)
df=matrix(runif(300) , nrow = 100, ncol=3) 
colnames(df)=c("x","y","z")
# Grid Archetypal
gaa=grid_archetypal(df, diag_less = 1e-6, niter = 50, verbose = FALSE)
summary(gaa)

## Number of observations / data rows n = 108 
## (The first 8 rows are the Grid Archetypes) 
## Dimension of data variables d = 3 
## Number of Grid Archetypes k = 8 
## 
## Grid Archetypes: 
## 
##            x          y          z
## 1 0.04107952 0.00131741 0.01868682
## 2 0.98128784 0.00131741 0.01868682
## 3 0.04107952 0.99690733 0.01868682
## 4 0.98128784 0.99690733 0.01868682
## 5 0.04107952 0.00131741 0.99566805
## 6 0.98128784 0.00131741 0.99566805
## 7 0.04107952 0.99690733 0.99566805
## 8 0.98128784 0.99690733 0.99566805
## 
## Run details: 
## 
##            SSE VarianceExplained Convergence Iterations ElapsedTime
## 1 5.951504e-06         0.9999999        TRUE         39        0.01

If we plot the data points and the archetypes we observe that the Archetypal Spanning Convex Hull covers all our points, a fact that can be checked by using the function points_inside_convex_hull():

plot(gaa)

points_inside_convex_hull(df, gaa$grid)

## [1] 100

Closer Grid Archetypal

For the same data set as before we want to set as archetypes the Closer to the Grid Archetypes just because we’d like our archetypes to belong to the data set.

# Closer Grid Archetypal
cga=closer_grid_archetypal(df,diag_less = 1e-3, niter = 70, verbose = FALSE)
summary(cga)

## Number of observations / data rows n = 100 
## Dimension of data variables d = 3 
## Number of Closer Grid Archetypes k = 8 
## The rows that formed the Closer Grid Archetypes are: 
## 52,69,18,87,6,14,39,5 
## 
## The Closer Grid Archetypes are: 
## 
##            x          y          z
## 1 0.13352227 0.15353714 0.07159086
## 2 0.75890322 0.09126450 0.01868682
## 3 0.07234262 0.79526378 0.06748076
## 4 0.88561538 0.94945905 0.10888000
## 5 0.31213541 0.12031532 0.87912934
## 6 0.96420042 0.07130253 0.77504836
## 7 0.24115414 0.88179278 0.90609350
## 8 0.89296529 0.95993797 0.94319426
## 
## The Run details are: 
## 
##         SSE VarianceExplained Convergence Iterations ElapsedTime
## 1 0.3367711         0.9965373        TRUE         26           0

plot(cga)

What is the percentage of data coverage for the Closer Grid Archetypes?

pc=points_inside_convex_hull(df,cga$aa$BY)
print(pc)

## [1] 59

The is a relatively high percentage compared to the usually very low values of less than 10%, especially if we seek for a small number of archetypes.

Fast Archetypal

If we decide that we’ll take a specific set of rows as archetypes, then we can use the function fast-archetypal() to compute the \(A\)-matrix of convex coefficients for the rest of data points.

It is the generic function that is used from both grid_archetypal() and closer_grid_archetypal() for their main computations.

Here we can use the rows we have found just before as the rows for the archetypes.

# Fast Archetypal
fa=fast_archetypal(df, irows = cga$grid_rows, diag_less = 1e-6,
                   niter = 100, verbose = FALSE)
summary(fa)

## Number of observations / data rows n = 100 
## Dimension of data variables d = 3 
## Number of archetypes k = 8 
## 
## Archetypes: 
## 
##            x          y          z
## 1 0.13352227 0.15353714 0.07159086
## 2 0.75890322 0.09126450 0.01868682
## 3 0.07234262 0.79526378 0.06748076
## 4 0.88561538 0.94945905 0.10888000
## 5 0.31213541 0.12031532 0.87912934
## 6 0.96420042 0.07130253 0.77504836
## 7 0.24115414 0.88179278 0.90609350
## 8 0.89296529 0.95993797 0.94319426
## 
## Run details: 
## 
##         SSE VarianceExplained Convergence Iterations ElapsedTime
## 1 0.3305205         0.9966016        TRUE         57           0
## 
## Call:
## fast_archetypal(df = df, irows = cga$grid_rows, diag_less = 1e-06, 
##     niter = 100, verbose = FALSE)

plot(fa)

par(oldpar)
options(oldoptions)

As it was expected the results are totally identical with the previously found ones.

Geometrical Archetypal Analysis: One Step Beyond the Current View

Demetris T. Christopoulos

National and Kapodistrian University of Athens

https://orcid.org/0009-0008-6436-095X

dchristop@econ.uoa.gr

2024/05/30

Geometrical Archetypal Analysis (GAA)

Grid Archetypal

Closer Grid Archetypal

Fast Archetypal

References