An R package for comparing two classifications or clustering solutions that have different structures - i.e. the two classifications have a different number of classes, or one classification has soft membership and one classification has hard membership. You can create a confusion matrix (error matrix) and then calculate various metrics to assess how the clusters compare to each other. The calculations are simple, but provide a handy tool for users unfamiliar with matrix multiplication. The helper functions also help you to do things like make a soft classification into a hard one, or turn a set of class labels into a binary classification matrix.
The basic premise is that you already have two (or more perhaps) classifications that you would like compare - these could be from a clustering algorithm, extracted from a remote sensing map, a set of classes assigned manually etc. There already exist a number of tools and packages to calculate cluster diagnostics or accuracy metrics, but they are usually focused on comparing clustering solutions that are hard (i.e. each observation has only one class) and have the same number of classes (e.g. clustering solution vs. the ‘truth’). c2c is designed to allow you to compare classifications that to not fit into this scenario. The motivating problem was the need to compare a probabilistic clustering of vegetation data to an existing hard classification (which had a hierarchy with of numbers of classes) of that data, without losing the probabilistic component that the clustering algorithm produces.
This example is on silly fake data, but it’s quick and will run
without any additional data or package loads. Check out the vignette for
something a little more sensible.
c2c
vignette.
First install and load c2c
install.packages("c2c")
library(c2c)Make a silly made up soft classification matrix
my_soft_mat <- matrix(runif(50,0,1), nrow = 10, ncol = 5)and a made up set of class labels, with matching number of observations
my_labels <- rep(c("a","b","c"), length.out = 10)The two main functions are get_conf_mat and
calculate_clustering_metrics. First generate the confusion
matrix
conf_mat <- get_conf_mat(my_soft_mat, my_labels)
conf_matthen calculate the metrics - see
?calculate_cluster_metrics for details
calculate_clustering_metrics(conf_mat)You could also just pass any confusion matrix (that you have already
generated elsewhere). Another thing you can do within
get_conf_mat is turn a soft matrix into a hard one.
You can install directly from CRAN as above
install.packages("c2c")or if you want to get the development version, which might have some
new functionality, you can install from GitHub. It’s very easy, simply
use Hadley Wickham’s (excellent) devtools package
install.packages("devtools")then call
library(devtools)
devtools::install_github("mitchest/c2c")There are some probably. If you find them, please let me know about them - either directly on github, or the contact details below.
Lyons, Foster and Keith (2017). Simultaneous vegetation classification and mapping at large spatial scales. Journal of Biogeography.
Foster, Hill and Lyons (2017) “Ecological Grouping of Survey Sites when Sampling Artefacts are Present”. Journal of the Royal Statistical Society: Series C (Applied Statistics). DOI: http://dx.doi.org/10.1111/rssc.12211