A set-based association test in snpsettest

Jaehyun Joo

09 September, 2023

For set-based association tests, the snpsettest package employed the statistical model described in VEGAS (versatile gene-based association study) [1], which takes as input variant-level p values and reference linkage disequilibrium (LD) data. Briefly, the test statistics is defined as the sum of squared variant-level Z-statistics. Letting a set of \(Z\) scores of individual SNPs \(z_i\) for \(i \in 1:p\) within a set \(s\), the test statistic \(Q_s\) is

\[Q_s = \sum_{i=1}^p z_i^2\]

Here, \(Z = \{z_1,...,z_p\}'\) is a vector of multivariate normal distribution with a mean vector \(\mu\) and a covariance matrix \(\Sigma\) in which \(\Sigma\) represents LD among SNPs. To test a set-level association, we need to evaluate the distribution of \(Q_s\). VEGAS uses Monte Carlo simulations to approximate the distribution of \(Q_s\) (directly simulate \(Z\) from multivariate normal distribution), and thus, compute a set-level p value. However, its use is hampered in practice when set-based p values are very small because the number of simulations required to obtain such p values is be very large. The snpsettest package utilizes a different approach to evaluate the distribution of \(Q_s\) more efficiently.

Let \(Y = \Sigma^{-\frac12}Z\) (instead of \(\Sigma^{-\frac12}\), we could use any decomposition that satisfies \(\Sigma = AA'\) with a \(p \times p\) non-singular matrix \(A\) such that \(Y = A^{-1}Z\)). Then,

\[ \begin{gathered} E(Y) = \Sigma^{-\frac12} \mu \\ Var(Y) = \Sigma^{-\frac12}\Sigma\Sigma^{-\frac12} = I_p \\ Y \sim N(\Sigma^{-\frac12} \mu,~I_p) \end{gathered} \]

Now, we posit \(U = \Sigma^{-\frac12}(Z - \mu)\) so that

\[U \sim N(\mathbf{0}, I_p),~~U = Y - \Sigma^{-\frac12}\mu\]

and express the test statistic \(Q_s\) as a quadratic form:

\[ \begin{aligned} Q_s &= \sum_{i=1}^p z_i^2 = Z'I_pZ = Y'\Sigma^{\frac12}I_p\Sigma^{\frac12}Y \\ &= (U + \Sigma^{-\frac12}\mu)'\Sigma(U + \Sigma^{-\frac12}\mu) \end{aligned} \]

With the spectral theorem, \(\Sigma\) can be decomposed as follow:

\[ \begin{gathered} \Sigma = P\Lambda P' \\ \Lambda = \mathbf{diag}(\lambda_1,...,\lambda_p),~~P'P = PP' = I_p \end{gathered} \]

where \(P\) is an orthogonal matrix. If we set \(X = P'U\), \(X\) is a vector of independent standard normal variable \(X \sim N(\mathbf{0}, I_p)\) since

\[E(X) = P'E(U) = \mathbf{0},~~Var(X) = P'Var(U)P = P'I_pP = I_p\]

\[ \begin{aligned} Q_s &= (U + \Sigma^{-\frac12}\mu)'\Sigma(U + \Sigma^{-\frac12}\mu) \\ &= (U + \Sigma^{-\frac12}\mu)'P\Lambda P'(U + \Sigma^{-\frac12}\mu) \\ &= (X + P'\Sigma^{-\frac12}\mu)'\Lambda (X + P'\Sigma^{-\frac12}\mu) \end{aligned} \]

Under the null hypothesis, \(\mu\) is assumed to be \(\mathbf{0}\). Hence,

\[Q_s = X'\Lambda X = \sum_{i=1}^p \lambda_i x_i^2\]

where \(X = \{x_1,...,x_p\}'\). Thus, the null distribution of \(Q_s\) is a linear combination of independent chi-square variables \(x_i^2 \sim \chi_{(1)}^2\) (i.e., central quadratic form in independent normal variables). For computing a probability with a scalar \(q\),

\[Pr(Q_s > q)\]

several methods have been proposed, such as numerical inversion of the characteristic function [2]. The snpsettest package uses the algorithm of Davies [3] or saddlepoint approximation [4] to obtain set-based p values.

References

Liu JZ, Mcrae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, et al. A Versatile Gene-Based Test for Genome-wide Association Studies. Am J Hum Genet. 2010 Jul 9;87(1):139–45.
Duchesne P, De Micheaux P. Computing the distribution of quadratic forms: Further comparisons between the Liu-Tang-Zhang approximation and exact methods. Comput Stat Data Anal. 2010;54:858–62.
Davies RB. Algorithm AS 155: The Distribution of a Linear Combination of Chi-square Random Variables. J R Stat Soc Ser C Appl Stat. 1980;29(3):323–33.
Kuonen D. Saddlepoint Approximations for Distributions of Quadratic Forms in Normal Variables. Biometrika. 1999;86(4):929–35.