Suppose that there are a set of variables \(X\) that are related to each other as seen in the model below:
Suppose that there is a profile of scores in \(X\) such that:
\[X=\{X_1,X_2, X_3, X_4\} = \{2,3,1,2\}. \]
As seen in Figure 1, this profile of scores is summarized by a composite score of 2.30.
How can we calculate the Mahalanobis distance for profiles that all have the same elevation (i.e., composite score)? For your reference, we will do everything “by hand” using matrix algebra.
In r, we can make a named vector of scores like so:
We will need to store variable names:
v_observed <- names(X)
v_latent <- "X"
v_composite <- "X_Composite"
v_names <- c(v_observed, v_composite)
We can create a matrix of factor loadings:
lambda <- c(0.95, 0.90, 0.85, 0.60) %>%
matrix() %>%
`rownames<-`(v_observed) %>%
`colnames<-`(v_latent)
lambda
#> X
#> X_1 0.95
#> X_2 0.90
#> X_3 0.85
#> X_4 0.60
Now we calculate the model-implied correlations among the observed variables:
# Observed Correlations
R_X <- lambda %*% t(lambda) %>%
`diag<-`(1)
R_X
#> X_1 X_2 X_3 X_4
#> X_1 1.0000 0.855 0.8075 0.57
#> X_2 0.8550 1.000 0.7650 0.54
#> X_3 0.8075 0.765 1.0000 0.51
#> X_4 0.5700 0.540 0.5100 1.00
Presented formally, the model-implied correlations are:
\[R_{X} \approx \begin{bmatrix} 1 & .85 & .81 & .57\\ .85 & 1 & .76 & .54\\ .81 & .76 & 1 & .51\\ .57 & .54 & .51 & 1 \end{bmatrix}\]
We need to use this matrix to create a new 5 × 5 correlation matrix that includes the correlations among the four variables and also each variable’s correlation with the general composite score (i.e., the standardized sum of four variables). Fortunately, such a matrix can be calculated with only a few steps.
We will need a “weight” matrix that will select each variable individually and also the sum of the four variables.
\[w=\begin{bmatrix} 1 & 0 & 0 & 0 & 1\\ 0 & 1 & 0 & 0 & 1\\ 0 & 0 & 1 & 0 & 1\\ 0 & 0 & 0 & 1 & 1 \end{bmatrix}\]
Notice that the first column of this matrix has a 1 in first position and zeroes elsewhere. It selects the first variable, X1. The second column selects X2, and so on to the fourth column. The last column is all ones, which will select all four variables and add them up.
We can construct this matrix with the diag
function,
which creates an identity matrix. This matrix is appended to a column of
ones:
w <- cbind(diag(4),
rep(1, 4)) %>%
`rownames<-`(v_observed) %>%
`colnames<-`(v_names)
w
#> X_1 X_2 X_3 X_4 X_Composite
#> X_1 1 0 0 0 1
#> X_2 0 1 0 0 1
#> X_3 0 0 1 0 1
#> X_4 0 0 0 1 1
Now we can use the weight matrix w to calculate the covariance matrix:
\[\Sigma = w'R_{X}w\]
\[\Sigma \approx \begin{bmatrix} 1 & .85 & .81 & .57 & 3.23\\ .85 & 1 & .76 & .54 & 3.16\\ .81 & .76 & 1 & .51 & 3.08\\ .57 & .54 & .51 & 1 & 2.62\\ 3.23 & 3.16 & 3.08 & 2.62 & 12.09 \end{bmatrix}\]
Sigma <- (t(w) %*% R_X %*% w)
Sigma
#> X_1 X_2 X_3 X_4 X_Composite
#> X_1 1.0000 0.855 0.8075 0.57 3.2325
#> X_2 0.8550 1.000 0.7650 0.54 3.1600
#> X_3 0.8075 0.765 1.0000 0.51 3.0825
#> X_4 0.5700 0.540 0.5100 1.00 2.6200
#> X_Composite 3.2325 3.160 3.0825 2.62 12.0950
Now we need to convert the covariance matrix to a correlation matrix. With matrix equations, we would need to create a matrix of with a vector of variances on the diagonal:
\[D = \text{diag}(\Sigma)\] Then we would take the square root, invert this matrix, and then pre-multiply it and post-multiply it by the covariance matrix.
\[R_{All} = D^{-0.5}\Sigma D^{-0.5}\]
\[R_{All} \approx \begin{bmatrix} 1 & .86 & .81 & .57 & .93\\ .86 & 1 & .76 & .54 & .91\\ .81 & .76 & 1 & .51 & .89\\ .57 & .54 & .51 & 1 & .75\\ .93 & .91 & .89 & .75 & 1 \end{bmatrix}\]
If we really want to use pure matrix algebra functions, we could do this:
D_root_inverted <- Sigma %>%
diag() %>%
sqrt() %>%
diag() %>%
solve() %>%
`rownames<-`(v_names) %>%
`colnames<-`(v_names)
R_all <- D_root_inverted %*% Sigma %*% D_root_inverted
R_all
#> X_1 X_2 X_3 X_4 X_Composite
#> X_1 1.0000000 0.8550000 0.8075000 0.5700000 0.9294705
#> X_2 0.8550000 1.0000000 0.7650000 0.5400000 0.9086239
#> X_3 0.8075000 0.7650000 1.0000000 0.5100000 0.8863396
#> X_4 0.5700000 0.5400000 0.5100000 1.0000000 0.7533527
#> X_Composite 0.9294705 0.9086239 0.8863396 0.7533527 1.0000000
However, it is probably best to sidestep all this complication of
converting covariances to correlations with the cov2cor
function:
# Convert covariance matrix to correlations
R_all <- cov2cor(Sigma)
R_all
#> X_1 X_2 X_3 X_4 X_Composite
#> X_1 1.0000000 0.8550000 0.8075000 0.5700000 0.9294705
#> X_2 0.8550000 1.0000000 0.7650000 0.5400000 0.9086239
#> X_3 0.8075000 0.7650000 1.0000000 0.5100000 0.8863396
#> X_4 0.5700000 0.5400000 0.5100000 1.0000000 0.7533527
#> X_Composite 0.9294705 0.9086239 0.8863396 0.7533527 1.0000000
To calculate the standardized composite score \(z_C\), add each variable’s deviation from its own mean and divide by the square root of the sum of the observed score covariance matrix.
\[z_C=\frac{1'(X-\mu_X)}{\sqrt{1'\Sigma_X1}}\]
Where
\(z_C\) is a standardized composite score.
\(X\) is a vector of observed scores.
\(\mu_X\) is the vector of means for the \(X\) variables.
\(\Sigma_X\) is the covariance matrix of the \(X\) variables.
\(1\) is a vector of ones compatible with \(\Sigma_X\).
The composite score is:
# Population means of observed variables
mu_X <- rep_along(X, 0)
# Population standard deviations of observed variables
sd_X <- rep_along(X, 1)
# Covariance Matrix
Sigma_X <- diag(sd_X) %*% R_X %*% diag(sd_X)
# Vector of ones
ones <- rep_along(X, 1)
# Standardized composite score
z_C <- c(ones %*% (X - mu_X) / sqrt(ones %*% (Sigma_X) %*% ones))
Given a particular composite score, we need to calculate a predicted score. That is, if the composite score is 1.5 standard deviations above the mean, what are the expected subtest scores?
\[\hat{X}=\sigma_Xz_Cr_{XX_C}+\mu_X\]
Where
\(\hat{X}\) is the vector of expected subtest scores
\(\sigma_X\) is the vector of standard deviations for \(X\)
\(z_C\) is the composite score
\(r_{XX_C}\) is a vector of correlations of each variable in \(X\) with the composite score \(z_C\)
\(\mu_X\) is the vector of means for \(X\)
Thus,
\[d_{M_C}=\sqrt{\left(X-\hat{X}\right)'\Sigma_{X}^{-1}\left(X-\hat{X}\right)}\]
Where
\(d_{M_C}\) is the Conditional Mahalanobis Distance
\(X\) is a vector of subtest scores
\(\hat{X}\) is the vector of expected subtest scores
\(\Sigma_{X}\) is the covariance matrix of the subtest scores
Suppose there are k outcome scores, and j composite scores used to calculate the expected scores \(\hat{X}\). If multivariate normality of the subtest scores can be assumed, then the Conditional Mahalanobis Distance squared has a χ2 distribution with k − j degrees of freedom.
\[d_{M_C}^{2} \sim\chi^{2}(k-j)\]
# Number of observed variables
k <- length(v_observed)
# Number of composite variables
j <- length(v_composite)
# Cumulative distribution function
p <- pchisq(d_mc ^ 2, df = k - j)
p
#> [1] 0.9641406
If we can assume that the observed variables in X are multivariate normal, a profile of X = {2,3,1,2} is more unusual than 96% of profiles that also have a composite score of zC = 2.3.