vers 1.1.0
A number of interesting packages are available to perform Correspondence Analysis in R. At the best of my knowledge, however, they lack some tools to help users to eyeball some critical CA aspects (e.g., contribution of rows/cols categories to the principal axes, quality of the display,correlation of rows/cols categories with dimensions, etc). Besides providing those facilities, this package allows calculating the significance of the CA dimensions by means of the ‘Average Rule’, the Malinvaud test, and by permutation test. Further, it allows to also calculate the permuted significance of the CA total inertia.
The package comes with some datasets drawn from literature:
brand_coffee
: after Kennedy R et al, Practical
Applications of Correspondence Analysis to Categorical Data in Market
Research, in Journal of Targeting Measurement and Analysis for
Marketing, 1996
breakfast
: after Bendixen M, A Practical Guide to
the Use of Correspondence Analysis in Marketing Research, in
Research on-line 1, 1996, 16-38 (table 5)
diseases
: after Velleman P F, Hoaglin D C,
Applications, Basics, and Computing of Exploratory Data
Analysis, Wadsworth Pub Co 1984 (Exhibit 8-1)
fire_loss
: after Li et al, Influences of Time,
Location, and Cause Factors on the Probability of Fire Loss in China: A
Correspondence Analysis, in Fire Technology 50(5), 2014, 1181-1200
(table 5)
greenacre_data
: after Greenacre M, Correspondence
Analysis in Practice, Boca Raton-London-New York,
Chapman&Hall/CRC 2007 (exhibit 12.1)
aver.rule
: average rule chart.caCluster
: clustering row/column categories on the
basis of Correspondence Analysis coordinates from a space of
user-defined dimensionality.caCorr()
: chart of correlation between rows and columns
categories.caPercept()
: perceptual map-like Correspondence
Analysis scatterplot.caPlot()
: intepretation-oriented Correspondence
Analysis scatterplots, with informative and flexible (non-overlapping)
labels.caPlus()
: facility for interpretation-oriented CA
scatterplot.caScatter()
: basic scatterplot visualization
facility.cols.cntr()
: columns contribution chart.cols.cntr.scatter()
: scatterplot for column categories
contribution to dimensions.cols.qlt()
: chart of columns quality of the
display.groupBycoord()
: define groups of categories on the
basis of a selected partition into k groups employing the Jenks’ natural
break method on the selected dimension’s coordinates.malinvaud()
: Malinvaud’s test for significance of the
CA dimensions.rescale()
: rescale row/column categories coordinates
between a minimum and maximum value.rows.cntr()
: rows contribution chart.rows.cntr.scatter()
: scatterplot for row categories
contribution to dimensions.rows.qlt()
: chart of rows quality of the display.sig.dim.perm()
: permuted significance of CA
dimensions.sig.dim.perm.scree()
: scree plot to test the
significance of CA dimensions by means of a randomized procedure.sig.tot.inertia.perm()
: permuted significance of the CA
total inertia.table.collapse()
: collapse rows and columns of a table
on the basis of hierarchical clustering.aver.rule()
: allows you to locate the number of
dimensions which are important for CA interpretation, according to the
so-called average rule. The reference line showing up in the returned
histogram indicates the threshold of an optimal dimensionality of the
solution according to the average rule.
caCluster()
: plots the result of cluster analysis
performed on the results of Correspondence Analysis, and plots a
dendrogram, a silouette plot depicting the “quality” of the clustering
solution, and a scatterplot with points coded according to the cluster
membership. The function provides the facility to perform hierarchical
cluster analysis of row and/or column categories on the basis of
Correspondence Analysis result. The clustering is based on the row
and/or colum categories’ coordinates from:
To obtain (1), the dim
parameter must be left in its
default value (NULL
); to obtain (2), the dim
parameter must be given an integer (needless to say, smaller than the
full dimensionality of the input data); to obtain (3), the
dim
parameter must be given a vector (e.g., c(1,3))
specifying the dimensions the user is interested in.
The method by which the distance is calculated is specified using the
dist.meth
parameter, while the agglomerative method is
speficied using the aggl.meth
parameter. By default, they
are set to euclidean
and ward.D2
respectively.
The user may want to specify beforehand the desired number of clusters (i.e., the cluster solution). This is accomplished feeding an integer into the ‘part’ parameter. A dendrogram (with rectangles indicating the clustering solution), a silhouette plot (indicating the “quality” of the cluster solution), and a CA scatterplot (with points given colours on the basis of their cluster membership) are returned. Please note that, when a high-dimensional space is selected, the scatterplot will use the first 2 CA dimensions; the user must keep in mind that the clustering based on a higher-dimensional space may not be well reflected on the subspace defined by the first two dimensions only.
Also note:
if both row and column categories are subject to the clustering, the column categories will be flagged by an asterisk (*) in the dendrogram (and in the silhouette plot) just to make it easier to identify rows and columns;
the silhouette plot displays the average silhouette width as a dashed vertical line; the dimensionality of the CA space used is reported in the plot’s title; if a pair of dimensions has been used, the individual dimensions are reported in the plot’s title;
the silhouette plot’s labels end with a number indicating the cluster to which each category is closer.
An optimal clustering solution can be obtained setting the
opt.part
parameter to TRUE
. The optimal
partition is selected by means of an iterative routine which locates at
which cluster solution the highest average silhouette width is achieved.
If the opt.part
parameter is set to TRUE
, an
additional plot is returned along with the silhouette plot. It displays
a scatterplot in which the cluster solution (x-axis) is plotted against
the average silhouette width (y-axis). A vertical reference line
indicate the cluster solution which maximize the silhouette width,
corresponding to the suggested optimal partition.
The function returns a list storing information about the cluster membership (i.e., which categories belong to which cluster).
Further info and Disclaimer about the caCluster()
function:
The silhouette plot is obtained from the silhouette()
function out from the cluster
package. For a detailed
description of the silhouette plot, its rationale, and its
interpretation, see:
For the idea of clustering categories on the basis of the CA coordinates from a full high-dimensional space (or from a subset thereof), see:
Please note that the interpretation of the clustering when both row AND column categories are used must procede with caution due to the issue of inter-class points’ distance interpretation. For a full description of the issue (also with further references), see:
caCorr()
: allows you to calculate the strenght of the
correlation between rows and columns of the contingency table. A
reference line indicates the threshold above which the correlation can
be considered important.
caPercept()
: plots a variant of the traditional
Correspondence Analysis scatterplots that allows facilitating the
interpretation of the results. It aims at producing what in marketing
research is called perceptual map, a visual representation of
the CA results that seeks to avoid the problem of interpreting
inter-spatial distance. It represents only one type of points (say,
column points), and “gives names to the axes” corresponding to the major
row category contributors to the two selected dimensions.
caPlot()
: plots different types of CA scatterplots,
adding information that are relevant to the CA interpretation. Thanks to
the ggrepel
package, the labels tends to not overlap so
producing a nicely readable chart. The function provides the facility to
produce:
The function returns a dataframe containing data about row and column points:
caPlus()
: plots Correspondence Analysis scatterplots
modified to help interpreting the analysis’ results. In particular, the
function aims at making easier to understand in the same visual context:
* (a) which (say, column) categories are actually contributing to the
definition of given pairs of dimensions; * (b) which (say, row)
categories are more correlated to which dimension.
caScatter()
: allows to get different types of CA
scatterplots. It is just a wrapper for functions from the
ca
and FactoMineR
packages.
cols.cntr()
: column equivalent of
rows.cntr()
(see below).
cols.cntr.scatter()
: column equivalent of
rows.cntr.scatter()
(see below).
cols.corr()
: column equivalent of
rows.corr()
(see below).
cols.corr.scatter()
: column equivalent of
rows.corr.scatter()
(see below).
cols.qlt()
: column equivalent of rows.qlt()
(see below).
groupBycoord()
: allows to group the row/column
categories into k user-defined partitions. K groups are created
employing the Jenks’ natural break method applied on the selected
dimension’s coordinates. A dotchart is returned representing the
categories grouped into the selected partitions. At the bottom of the
chart, the Goodness of Fit statistic is also reported. The function also
returns a dataframe storing the categories’ coordinates on the selected
dimension and the group each category belongs to.
malinvaud()
: performs the Malinvaud test, which
assesses the significance of the CA dimensions. The function returns
both a table and a plot. The former lists relevant information, among
which the significance of each CA dimension. The dotchart graphically
represents the p-value of each dimension; dimensions are grouped by
level of significance; a red reference lines indicates the 0.05
threshold.
rescale()
: allows to rescale the coordinates of a
selected dimension to be constrained between a minimum and a maximum
user-defined value. The rationale of the function is that users may wish
to use the coordinates on a given dimension to devise a scale, along the
lines of what is accomplished in: Greenacre M 2002, The Use of
Correspondence Analysis in the Exploration of Health Survey Data,
Documentos de Trabajo 5, Fundacion BBVA, pp. 7-39. The function returns
a chart representing the row/column categories against the rescaled
coordinates from the selected dimension. A dataframe is also returned
containing the original values (i.e., the coordinates) and the
corresponding rescaled values.
rows.cntr()
: calculates the contribution of the row
categories to a selected dimension. It displays the contribution of the
categories as a dotplot. A reference line indicates the threshold above
which a contribution can be considered important for the determination
of the selected dimension. The parameter sort=TRUE
sorts
the categories in descending order of contribution to the inertia of the
selected dimension. At the left-hand side of the plot, the categories’
labels are given a symbol (+ or -) according to wheather each category
is actually contributing to the definition of the positive or negative
side of the dimension, respectively. The categories are grouped into two
groups: ‘major’ and ‘minor’ contributors to the inertia of the selected
dimension. At the right-hand side, a legend (which is enabled/disabled
using the leg
parameter) reports the correlation
(sqrt(COS2)) of the column categories with the selected dimension. A
symbol (+ or -) indicates with which side of the selected dimension each
column category is correlated.
rows.cntr.scatter()
: plots a scatterplot of the
contribution of row categories to two selected dimensions. Two
references lines (in RED) indicate the threshold above which the
contribution can be considered important for the determination of the
dimensions. A diagonal line (in BLACK) is a visual aid to eyeball
whether a category is actually contributing more (in relative terms) to
either of the two dimensions. The row categories’ labels are coupled
with + or - symbols within round brackets indicating to which side of
the two selected dimensions the contribution values that can be read off
from the chart are actually referring. The first symbol (i.e., the one
to the left), either + or -, refers to the first of the selected
dimensions (i.e., the one reported on the x-axis). The second symbol
(i.e., the one to the right) refers to the second of the selected
dimensions (i.e., the one reported on the y-axis).
rows.corr()
: calculates and graphically displays the
correlation (sqrt(COS2)) of the row categories with the selected
dimension. The parameter sort=TRUE
arranges the categories
in decreasing order of correlation. In the returned chart, at the
left-hand side, the categories’ labels show a symbol (+ or -) according
to which side of the selected dimension they are correlated, either
positive or negative. The categories are grouped into two groups:
categories correlated with the positive (‘pole +’) or negative (‘pole
-’) pole of the selected dimension. At the right-hand side, a legend
indicates the column categories’ contribution (in permils) to the
selected dimension (value enclosed within round brackets), and a symbol
(+ or -) indicating whether they are actually contributing to the
definition of the positive or negative side of the dimension,
respectively. Further, an asterisk (*) flags the categories which can be
considered major contributors to the definition of the dimension:
rows.corr.scatter()
: plots a scatterplot of the
correlation (sqrt(COS2)) of row categories with two selected dimensions.
A diagonal line (in BLACK) is a visual aid to eyeball whether a category
is actually more correlated (in relative terms) to either of the two
dimensions. The row categories’ labels are coupled with two + or -
symbols within round brackets indicating to which side of the two
selected dimensions the correlation values that can be read off from the
chart are actually referring. The first symbol (i.e., the one to the
left), either + or -, refers to the first of the selected dimensions
(i.e., the one reported on the x-axis). The second symbol (i.e., the one
to the right) refers to the second of the selected dimensions (i.e., the
one reported on the y-axis).
rows.qlt()
: plots the quality of row categories display
on the sub-space determined by a pair of selected dimensions.
sig.dim.perm()
: calculates the significance of a pair of
selected dimensions via a permutation test, and displays the results as
a scatterplot; a large RED dot indicates the observed inertia. Permuted
p-values are reported in the axes’ labels.
sig.dim.perm.scree()
: tests the significance of the CA
dimensions by means of permutation of the input contingency table. A
scree-plot displays for each dimension the observed eigenvalue and the
95th percentile of the permuted distribution of the corresponding
eigenvalue. Observed eigenvalues that are larger than the corresponding
95th percentile are significant at least at alpha 0.05. P-values are
displayed into the chart.
sig.tot.inertia.perm()
: calculates the significance of
the CA total inertia via permutation test; a histogram of the permuted
total inertia is displayed along with the observed total inertia and the
95th percentile of the permuted total inertia. The latter can be
regarded as a 0.05 alpha threshold for the observed total inertia’s
significance.
table.collapse()
: allows to collapse the rows and
columns of the input contingency table on the basis of the results of a
hierarchical clustering. The function returns a list containing the
input table, the rows-collapsed table, the columns-collapsed table, and
a table with both rows and columns collapsed. It optionally returns two
dendrograms (one for the row profiles, one for the column profiles)
representing the clusters. The hierarchical clustering is obtained using
the FactoMineR
s HCPC()
function.
Rationale: clustering rows and/or columns of a table could
interest the users who want to know where a significant association
is concentrated by collecting together similar rows (or
columns) in discrete groups (Greenacre M, Correspondence
Analysis in Practice, Boca Raton-London-New York,
Chapman&Hall/CRC 2007, pp. 116, 120). Rows and/or columns are
progressively aggregated in a way in which every successive merging
produces the smallest change in the table’s inertia. The underlying
logic lies in the fact that rows (or columns) whose merging produces a
small change in table’s inertia have similar profiles. This procedure
can be thought of as maximizing the between-group inertia and minimizing
the within-group inertia. A method essentially similar is that provided
by the FactoMineR
package (Husson F, Le S, Pages J,
Exploratory Multivariate Analysis by Example Using R, Boca
Raton-London-New York, CRC Press, pp. 177-185). The cluster solution is
based on the following rationale: a division into Q (i.e., a given
number of) clusters is suggested when the increase in between-group
inertia attained when passing from a Q-1 to a Q partition is greater
than that from a Q to a Q+1 clusters partition. In other words, during
the process of rows (or columns) merging, if the following agggregation
raises highly the within-group inertia, it means that at the further
step very different profiles are being aggregated.
## History version 1.1.0
: * minor changes to
optimize the calculation of permuted p-values returned by the functions
sig.dim.perm()
, sig.dim.perm.scree()
, and
sig.tot.inertia.perm()
.
sig.dim.perm.scree()
and sig.dim.perm()
now return permuted p-values in a dataframe (besides reporting them in
the output plots).
minor improvements and typo fixes to the package’s help documentation.
version 1.0.0
: first release to CRAN.