Introduction to the QuadratiK Package

Overview

The QuadratiK package provides the first implementation, in R and Python, of a comprehensive set of goodness-of-fit tests and a clustering technique for spherical data using kernel-based quadratic distances. The primary goal of QuadratiK is to offer flexible tools for testing multivariate and high-dimensional data for uniformity, normality, comparing two or more samples.

This package includes several novel algorithms that are designed to handle spherical data, which is often encountered in fields like directional statistics, geospatial data analysis, and signal processing. In particular, it offers functions for clustering spherical data efficiently, for computing the density value and for generating random samples from a Poisson kernel-based density.

Installation

You can install the version published on CRAN of QuadratiK:

install.packages("QuadratiK")

Or the development version on GitHub:

library(devtools)
install_github('ropensci/QuadratiK')
# or via the rOpenSci organization repository
install.packages("QuadratiK", repos = "https://ropensci.r-universe.dev")

The QuadratiK package is also available in Python on PyPI and as a Dashboard application. Usage instruction for the Dashboard can be found at https://quadratik.readthedocs.io/en/latest/user_guide/dashboard_application_usage.html.

Citation

If you use this package in your research or work, please cite it as follows:

Saraceno G, Markatou M, Mukhopadhyay R, Golzy M (2024). QuadratiK: A Collection of Methods Constructed using Kernel-Based Quadratic Distances. https://cran.r-project.org/package=QuadratiK

@Manual{saraceno2024QuadratiK, 
   title = {QuadratiK: Collection of Methods Constructed using Kernel-Based 
            Quadratic Distances}, 
   author = {Giovanni Saraceno and Marianthi Markatou and Raktim Mukhopadhyay 
            and Mojgan Golzy}, 
   year = {2024}, 
   note = {<https://cran.r-project.org/package=QuadratiK>,
            <https://github.com/ropensci/QuadratiK>,
            <https://docs.ropensci.org/QuadratiK/>} 
}

and the associated paper:

Saraceno Giovanni, Markatou Marianthi, Mukhopadhyay Raktim, Golzy Mojgan (2024). Goodness-of-Fit and Clustering of Spherical Data: the QuadratiK package in R and Python. arXiv preprint arXiv:2402.02290.

@misc{saraceno2024package, 
   title={Goodness-of-Fit and Clustering of Spherical Data: the QuadratiK package in
          R and Python}, 
   author={Giovanni Saraceno and Marianthi Markatou and Raktim Mukhopadhyay and
           Mojgan Golzy}, 
   year={2024}, 
   eprint={2402.02290}, 
   archivePrefix={arXiv}, 
   primaryClass={stat.CO},    
   url={<https://arxiv.org/abs/2402.02290>}
}

Key features and basic usage

Goodness-of-Fit Tests

library(QuadratiK)

The software implements one, two, and k-sample tests for goodness of fit, offering an efficient and mathematically sound way to assess the fit of probability distributions. Our tests are particularly useful for large, high dimensional data sets where the assessment of fit of probability models is of interest.

The provided goodness-of-fit tests can be performed using the kb.test() function. The kernel-based quadratic distance tests are constructed using the normal kernel which depends on the tuning parameter \(h\). If a value for \(h\) is not provided, the function perform the select_h() algorithm searching for an optimal value. For more details please visit the relative help documentations.

?kb.test
?select_h

The proposed tests perform well in terms of level and power for contiguous alternatives, heavy tailed distributions and in higher dimensions.

Test for normality

To test the null hypothesis of normality \(H_0:F=\mathcal{N}_d(\mu, \Sigma)\)

x <- matrix(rnorm(100), ncol = 2)
# Does x come from a multivariate standard normal distribution?
kb.test(x, h = 0.4)

## 
##  Kernel-based quadratic distance Normality test 
##      U-statistic V-statistic
## ------------------------------------------------
## Test Statistic:   0.9121484   1.118871 
## Critical Value:   1.315541    8.901682 
## H0 is rejected:   FALSE       FALSE 
## Selected tuning parameter h:  0.4

If needed, we can specify \(\mu\) and \(\Sigma\), otherwise the standard normal distribution is considered.

x <- matrix(rnorm(100,4), ncol = 2)
# Does x come from the specified multivariate normal distribution?
kb.test(x, mu_hat = c(4,4), Sigma_hat = diag(2), h = 0.4)

## 
##  Kernel-based quadratic distance Normality test 
##      U-statistic V-statistic
## ------------------------------------------------
## Test Statistic:   0.3822067   1.015612 
## Critical Value:   1.611387    8.901682 
## H0 is rejected:   FALSE       FALSE 
## Selected tuning parameter h:  0.4

Two-sample test

In case we want to compare two samples \(X \sim F\) and \(Y \sim G\) with the null hypothesis \(H_0:F=G\) vs \(H_1:F\not =G\).

x <- matrix(rnorm(100), ncol = 2)
y <- matrix(rnorm(100,mean = 5), ncol = 2)
# Do x and y come from the same distribution?
kb.test(x, y, h = 0.4)

## 
##  Kernel-based quadratic distance two-sample test 
## U-statistic   Dn          Trace 
## ------------------------------------------------
## Test Statistic:   5.858333    11.25113 
## Critical Value:   0.6505583   1.250824 
## H0 is rejected:   TRUE        TRUE 
## CV method:  subsampling 
## Selected tuning parameter h:  0.4

k-sample test

In case we want to compare \(k\) samples, with \(k>2\), that is \(H_0:F_1=F_2=\ldots=F_k\) vs \(H_1:F_i\not =F_j\) for some \(i\not = j\).

x1 <- matrix(rnorm(100), ncol = 2)
x2 <- matrix(rnorm(100), ncol = 2)
x3 <- matrix(rnorm(100, mean = 5), ncol = 2)
y <- rep(c(1, 2, 3), each = 50)
# Do x1, x2 and x3 come from the same distribution?
x <- rbind(x1, x2, x3)
kb.test(x, y, h = 0.4)

## 
##  Kernel-based quadratic distance k-sample test 
## U-statistic   Dn          Trace 
## ------------------------------------------------
## Test Statistic:   7.857057    11.78263 
## Critical Value:   0.7710523   1.157151 
## H0 is rejected:   TRUE        TRUE 
## CV method:  subsampling 
## Selected tuning parameter h:  0.4

Test for uniformity on the sphere

Expanded capabilities include supporting tests for uniformity on the (d-1)-dimensional Sphere based on Poisson kernel. The Poisson kernel depends on the concentration parameter \(\rho\) and a location vector \(\mu\). For more details please visit the help documentation of the pk.test() function.

?pk.test

To test the null hypothesis of uniformity on the \((d-1)\)-dimensional sphere \(\mathcal{S}^{d-1} = \{x \in \mathbb{R}^d : ||x||=1 \}\)

# Generate points on the sphere from the uniform ditribution 
x <- sample_hypersphere(d = 3, n_points = 100)
# Does x come from the uniform distribution on the sphere?
pk.test(x, rho = 0.7)

## 
##  Poisson Kernel-based quadratic distance test of 
##                         Uniformity on the Sphere 
## Selected consentration parameter rho:  0.7 
## 
## U-statistic:
## 
## H0 is rejected:  FALSE 
## Statistic Un:  0.5821673 
## Critical value:  1.921119 
## 
## V-statistic:
## 
## H0 is rejected:  FALSE 
## Statistic Vn:  19.67022 
## Critical value:  23.22949

Poisson kernel-based distribution (PKBD)

The package offers functions for computing the density value and for generating random samples from a PKBD. The Poisson kernel-based densities are based on the normalized Poisson kernel and are defined on the \((d-1)\)-dimensional unit sphere. For more details please visit the help documentation of the dpkb() and rpkb() functions.

?dpkb
?rpkb

Example

mu <- c(1,0,0)
rho <- 0.9
x <- rpkb(n = 100, mu = mu, rho = rho)
head(x)

##           [,1]         [,2]        [,3]
## [1,] 0.9897165  0.122733344 -0.07346952
## [2,] 0.9823898 -0.149216027  0.11244977
## [3,] 0.9694938 -0.076624190  0.23283147
## [4,] 0.9977403  0.004146640  0.06706016
## [5,] 0.9989254 -0.005414009  0.04602866
## [6,] 0.9727133 -0.117257048 -0.20019912

dens_x <- dpkb(x, mu = mu, rho = rho)
head(dens_x)

##            [,1]
## [1,]  3.1408064
## [2,]  1.7756769
## [3,]  0.9142496
## [4,]  9.0619495
## [5,] 11.5972269
## [6,]  1.0519251

Clustering Algorithm for Spherical Data

The package incorporates a unique clustering algorithm specifically tailored for spherical data and it is especially useful in the presence of noise in the data and the presence of non-negligible overlap between clusters. This algorithm leverages a mixture of Poisson kernel-based densities on the Sphere, enabling effective clustering of spherical data or data that has been spherically transformed. For more details please visit the help documentation of the pkbc() function.

?pkbc

Example

# Generate 3 samples from the PKBD with different location directions
x1 <- rpkb(n = 100, mu = c(1,0,0), rho = rho)
x2 <- rpkb(n = 100, mu = c(-1,0,0), rho = rho)
x3 <- rpkb(n = 100, mu = c(0,0,1), rho = rho)
x <- rbind(x1, x2, x3)
# Perform the clustering algorithm
# Serch for 2, 3 or 4 clusters
cluster_res <- pkbc(dat = x, nClust = c(2, 3, 4))
summary(cluster_res)

## Poisson Kernel-Based Clustering on the Sphere (pkbc) Results
## ------------------------------------------------------------
## 
## Summary:
##      nClust    LogLik     WCSS
## [1,]      2 -626.1698 388.9010
## [2,]      3 -380.1331 324.5560
## [3,]      4 -372.7611 324.5336
## 
## Results for 2 clusters:
## Estimated Mixing Proportions (alpha):
## [1] 0.6983808 0.3016192
## 
## Clustering table:
## 
##   1   2 
## 207  93 
## 
## 
## Results for 3 clusters:
## Estimated Mixing Proportions (alpha):
## [1] 0.3274602 0.3468244 0.3257154
## 
## Clustering table:
## 
##   1   2   3 
##  99 104  97 
## 
## 
## Results for 4 clusters:
## Estimated Mixing Proportions (alpha):
## [1] 0.327707603 0.008454535 0.337729082 0.326108780
## 
## Clustering table:
## 
##   1   2   3   4 
##  99   2 102  97

The software includes additional graphical functions, aiding users in validating and representing the cluster results as well as enhancing the interpretability and usability of the analysis.

# Predict the membership of new data with respect to the clustering results
x_new <- rpkb(n = 10, mu = c(1,0,0), rho = rho)
memb_mew <- predict(cluster_res, k = 3, newdata = x_new)
memb_mew$Memb

##  [1] 1 1 1 1 1 1 1 1 1 1

# Compute measures for evaluating the clustering results
val_res <- pkbc_validation(cluster_res)
val_res

## $metrics
##             2        3         4
## ASW 0.5107679 0.658676 0.3115461
## 
## $IGP
## $IGP[[1]]
## NULL
## 
## $IGP[[2]]
## [1] 0.9944444 0.9916667
## 
## $IGP[[3]]
## [1] 1 1 1
## 
## $IGP[[4]]
## [1] 1 1 1 1

# Plot method for the pkbc object:
# - scatter plot of data points on the sphere
# - elbow plot for helping the choice of the number of clusters
plot(cluster_res)

Additional Resources

For more detailed information about the QuadratiK package, you can explore the following resources:

Package Documentation on CRAN – Official package documentation on CRAN.
GitHub Repository – The GitHub repository with the development version, issues, and community discussions.
QuadratiK Package Website – A dedicated website with additional tutorials and examples.

If you’re new to the package, we recommend starting with the available vignettes:

References

For more information on the methods implemented in this package, refer to the associated research papers:

Markatou, M. and Saraceno, G. (2024). “A Unified Framework for Multivariate Two- and k-Sample Kernel-based Quadratic Distance Goodness-of-Fit Tests.” arXiv:2407.16374
Ding, Y., Markatou, M. and Saraceno, G. (2023). “Poisson Kernel-Based Tests for Uniformity on the d-Dimensional Sphere.” Statistica Sinica. doi: 10.5705/ss.202022.0347.
Golzy, M. and Markatou, M. (2020) Poisson Kernel-Based Clustering on the Sphere: Convergence Properties, Identifiability, and a Method of Sampling, Journal of Computational and Graphical Statistics, 29:4, 758-770, DOI: 10.1080/10618600.2020.1740713.