The MR.RGM R package presents a crucial advancement in Mendelian randomization (MR) studies, providing a robust solution to a common challenge. While MR has proven invaluable in establishing causal links between exposures and outcomes, its traditional focus on single exposures and specific outcomes can be limiting. Biological systems often exhibit complexity, with interdependent outcomes influenced by numerous factors. MR.RGM introduces a network-based approach to MR, allowing researchers to explore the broader causal landscape.
With two available functions, RGM and NetworkMotif, the package offers versatility in analyzing causal relationships. RGM primarily focuses on constructing causal networks among response variables and between responses and instrumental variables. On the other hand, NetworkMotif specializes in quantifying uncertainty for given network structures among response variables.
RGM accommodates both individual-level data and two types of summary-level data, making it adaptable to various data availability scenarios. This adaptability enhances the package’s utility across different research contexts. The outputs of RGM include estimates of causal effects, adjacency matrices, and other relevant parameters. Together, these outputs contribute to a deeper understanding of the intricate relationships within complex biological networks, thereby enriching insights derived from MR studies.
You can install MR.RGM R package from CRAN with:
install.packages("MR.RGM")
Once the MR.RGM package is installed load the library in the R work-space.
library("MR.RGM")
We offer a concise demonstration of the capabilities of the RGM function within the package, showcasing its effectiveness in computing causal interactions among response variables and between responses and instrumental variables using simulated data sets. Subsequently, we provide an example of how NetworkMotif can be applied, utilizing a specified network structure and GammaPst acquired from executing the RGM function.
# Model: Y = AY + BX + E
# Set seed
set.seed(9154)
# Number of data points
= 10000
n
# Number of response variables and number of instrument variables
= 5
p = 6
k
# Initialize causal interaction matrix between response variables
= matrix(sample(c(-0.1, 0.1), p^2, replace = TRUE), p, p)
A
# Diagonal entries of A matrix will always be 0
diag(A) = 0
# Make the network sparse
sample(which(A!=0), length(which(A!=0))/2)] = 0
A[
# Create D matrix (Indicator matrix where each row corresponds to a response variable
# and each column corresponds to an instrument variable)
= matrix(0, nrow = p, ncol = k)
D
# Manually assign values to D matrix
1, 1:2] = 1 # First response variable is influenced by the first 2 instruments
D[2, 3] = 1 # Second response variable is influenced by the 3rd instrument
D[3, 4] = 1 # Third response variable is influenced by the 4th instrument
D[4, 5] = 1 # Fourth response variable is influenced by the 5th instrument
D[5, 6] = 1 # Fifth response variable is influenced by the 6th instrument
D[
# Initialize B matrix
= matrix(0, p, k) # Initialize B matrix with zeros
B
# Calculate B matrix based on D matrix
for (i in 1:p) {
for (j in 1:k) {
if (D[i, j] == 1) {
= 1 # Set B[i, j] to 1 if D[i, j] is 1
B[i, j]
}
}
}
# Create variance-covariance matrix
= 1 * diag(p)
Sigma
= solve(diag(p) - A)
Mult_Mat
= Mult_Mat %*% Sigma %*% t(Mult_Mat)
Variance
# Generate instrument data matrix
= matrix(runif(n * k, 0, 5), nrow = n, ncol = k)
X
# Initialize response data matrix
= matrix(0, nrow = n, ncol = p)
Y
# Generate response data matrix based on instrument data matrix
for (i in 1:n) {
= MASS::mvrnorm(n = 1, Mult_Mat %*% B %*% X[i, ], Variance)
Y[i, ]
}
# Print true causal interaction matrices between response variables and between response and instrument variables
A#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.0 -0.1 0.0 0.0 0.1
#> [2,] 0.1 0.0 -0.1 0.1 0.1
#> [3,] 0.0 -0.1 0.0 0.0 0.1
#> [4,] 0.0 -0.1 0.0 0.0 0.0
#> [5,] 0.0 0.1 0.0 0.0 0.0
B#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 1 1 0 0 0 0
#> [2,] 0 0 1 0 0 0
#> [3,] 0 0 0 1 0 0
#> [4,] 0 0 0 0 1 0
#> [5,] 0 0 0 0 0 1
We will now apply RGM based on individual level data, summary level data and Beta, SigmaHat matrices to show its functionality.
# Apply RGM on individual level data with Threshold prior
= RGM(X = X, Y = Y, D = D, prior = "Threshold")
Output1
# Calculate summary level data
= t(Y) %*% Y / n
Syy = t(Y) %*% X / n
Syx = t(X) %*% X / n
Sxx
# Apply RGM on summary level data for Spike and Slab Prior
= RGM(Syy = Syy, Syx = Syx, Sxx = Sxx,
Output2 D = D, n = 10000, prior = "Spike and Slab")
# Calculate Beta and Sigma_Hat
# Centralize Data
= t(t(Y) - colMeans(Y))
Y = t(t(X) - colMeans(X))
X
# Calculate Sxx
= t(X) %*% X / n
Sxx
# Generate Beta matrix and SigmaHat
= matrix(0, nrow = p, ncol = k)
Beta = matrix(0, nrow = p, ncol = k)
SigmaHat
for (i in 1:p) {
for (j in 1:k) {
= lm(Y[, i] ~ X[, j])
fit
= fit$coefficients[2]
Beta[i, j]
= sum(fit$residuals^2) / n
SigmaHat[i, j]
}
}
# Apply RGM on Sxx, Beta and SigmaHat for Spike and Slab Prior
= RGM(Sxx = Sxx, Beta = Beta, SigmaHat = SigmaHat,
Output3 D = D, n = 10000, prior = "Spike and Slab")
We get the estimated causal interaction matrix between response variables in the following way:
$AEst
Output1#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.0000000 -0.11032661 0.0000000 0.0000000 0.10676811
#> [2,] 0.0991208 0.00000000 -0.1104576 0.1002182 0.11012341
#> [3,] 0.0000000 -0.09370579 0.0000000 0.0000000 0.09747664
#> [4,] 0.0000000 -0.10185959 0.0000000 0.0000000 0.00000000
#> [5,] 0.0000000 0.10045256 0.0000000 0.0000000 0.00000000
$AEst
Output2#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.000000000 -0.1127249 -0.0006355133 0.0012237310 0.107520412
#> [2,] 0.099881480 0.0000000 -0.1079853541 0.0996589823 0.109275947
#> [3,] -0.001747246 -0.0929592 0.0000000000 -0.0006473297 0.099778267
#> [4,] -0.003863658 -0.1030056 -0.0024683725 0.0000000000 0.009191016
#> [5,] 0.001966818 0.1014136 -0.0055458038 -0.0050688662 0.000000000
$AEst
Output3#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.00000000 -0.08972628 0.039412442 -0.0006516754 0.09454632
#> [2,] 0.11040214 0.00000000 -0.112569739 0.0983975424 0.13676682
#> [3,] 0.01896308 -0.09780459 0.000000000 0.0100459671 0.11765744
#> [4,] 0.00000000 -0.11255315 0.001661851 0.0000000000 0.01358832
#> [5,] -0.00174675 0.12947084 0.011533919 -0.0025906139 0.00000000
We get the estimated causal network structure between the response variables in the following way:
$zAEst
Output1#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0 1 0 0 1
#> [2,] 1 0 1 1 1
#> [3,] 0 1 0 0 1
#> [4,] 0 1 0 0 0
#> [5,] 0 1 0 0 0
$zAEst
Output2#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0 1 0 0 1
#> [2,] 1 0 1 1 1
#> [3,] 0 1 0 0 1
#> [4,] 0 1 0 0 0
#> [5,] 0 1 0 0 0
$zAEst
Output3#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0 1 0 0 1
#> [2,] 1 0 1 1 1
#> [3,] 0 1 0 0 1
#> [4,] 0 1 0 0 0
#> [5,] 0 1 0 0 0
We observe that the causal network structures inferred in the three outputs mentioned are identical. To gain a clearer understanding of the network, we compare the true network structure with the one estimated by RGM. Since the networks derived from all three outputs are consistent, we plot a single graph representing the estimated causal network.
# Define a function to create smaller arrowheads
<- function(graph) {
smaller_arrowheads ::E(graph)$arrow.size = 0.60 # Adjust the arrow size value as needed
igraphreturn(graph)
}
# Create a layout for multiple plots
par(mfrow = c(1, 2))
# Plot the true causal network
plot(smaller_arrowheads(igraph::graph_from_adjacency_matrix((A != 0) * 1,
mode = "directed")), layout = igraph::layout_in_circle,
main = "True Causal Network")
# Plot the estimated causal network
plot(smaller_arrowheads(igraph::graph_from_adjacency_matrix(Output1$zAEst,
mode = "directed")), layout = igraph::layout_in_circle,
main = "Estimated Causal Network")
We get the estimated causal interaction matrix between the response and the instrument variables from the outputs in the following way:
$BEst
Output1#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0.9921448 1.010301 0.0000000 0.0000000 0.0000000 0.0000000
#> [2,] 0.0000000 0.000000 0.9961312 0.0000000 0.0000000 0.0000000
#> [3,] 0.0000000 0.000000 0.0000000 0.9989298 0.0000000 0.0000000
#> [4,] 0.0000000 0.000000 0.0000000 0.0000000 0.9991987 0.0000000
#> [5,] 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.9977224
$BEst
Output2#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0.9942219 1.009985 0.0000000 0.0000000 0.000000 0.000000
#> [2,] 0.0000000 0.000000 0.9931885 0.0000000 0.000000 0.000000
#> [3,] 0.0000000 0.000000 0.0000000 0.9992243 0.000000 0.000000
#> [4,] 0.0000000 0.000000 0.0000000 0.0000000 1.000336 0.000000
#> [5,] 0.0000000 0.000000 0.0000000 0.0000000 0.000000 1.001666
$BEst
Output3#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 0.9910789 1.006024 0.000000 0.0000000 0.0000000 0.0000000
#> [2,] 0.0000000 0.000000 0.993928 0.0000000 0.0000000 0.0000000
#> [3,] 0.0000000 0.000000 0.000000 0.9987596 0.0000000 0.0000000
#> [4,] 0.0000000 0.000000 0.000000 0.0000000 0.9974289 0.0000000
#> [5,] 0.0000000 0.000000 0.000000 0.0000000 0.0000000 0.9948345
We get the estimated graph structure between the response and the instrument variables from the outputs in the following way:
$zBEst
Output1#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 1 1 0 0 0 0
#> [2,] 0 0 1 0 0 0
#> [3,] 0 0 0 1 0 0
#> [4,] 0 0 0 0 1 0
#> [5,] 0 0 0 0 0 1
$zBEst
Output2#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 1 1 0 0 0 0
#> [2,] 0 0 1 0 0 0
#> [3,] 0 0 0 1 0 0
#> [4,] 0 0 0 0 1 0
#> [5,] 0 0 0 0 0 1
$zBEst
Output3#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 1 1 0 0 0 0
#> [2,] 0 0 1 0 0 0
#> [3,] 0 0 0 1 0 0
#> [4,] 0 0 0 0 1 0
#> [5,] 0 0 0 0 0 1
We can plot the log-likelihoods from the outputs in the following way:
plot(Output1$LLPst, type = 'l', xlab = "Iterations", ylab = "Log-likelihood")
plot(Output2$LLPst, type = 'l', xlab = "Iterations", ylab = "Log-likelihood")
plot(Output3$LLPst, type = 'l', xlab = "Iterations", ylab = "Log-likelihood")
Next, we present the implementation of the NetworkMotif function. We begin by defining a random subgraph among the response variables. Subsequently, we collect GammaPst arrays from various outputs and proceed to execute NetworkMotif based on these arrays.
# Start with a random subgraph
= matrix(0, nrow = p, ncol = p)
Gamma 5, 2] = Gamma[3, 5] = Gamma[2, 3] = 1
Gamma[
# Plot the subgraph to get an idea about the causal network
plot(smaller_arrowheads(igraph::graph_from_adjacency_matrix(Gamma,
mode = "directed")), layout = igraph::layout_in_circle,
main = "Subgraph")
# Store the GammaPst arrays from outputs
= Output1$GammaPst
GammaPst1 = Output2$GammaPst
GammaPst2 = Output3$GammaPst
GammaPst3
# Get the posterior probabilities of Gamma with these GammaPst matrices
NetworkMotif(Gamma = Gamma, GammaPst = GammaPst1)
#> [1] 1
NetworkMotif(Gamma = Gamma, GammaPst = GammaPst2)
#> [1] 0.37325
NetworkMotif(Gamma = Gamma, GammaPst = GammaPst3)
#> [1] 0.461375
In real-world scenarios, it is common to encounter a large number of instrumental variables (IVs), each explaining only a small proportion of the trait variance. To better reflect this, we have expanded our simulation setup with the following new elements:
compact_X
: By combining
the top PCs from each response variable, we form a condensed matrix
compact_X
. This matrix aggregates the instrumental
variables into a more manageable form, facilitating a more efficient
analysis.compact_X
, we calculate new summary level data
(Sxx_compact
and Syx_compact
) for the RGM
function application. This approach provides a more realistic
representation of the instrumental variables’ effects in scenarios with
many IVs explaining only a small proportion of the variance.Here is the updated R code reflecting these changes:
# Load necessary libraries
library(MASS)
library(igraph)
#>
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#>
#> decompose, spectrum
#> The following object is masked from 'package:base':
#>
#> union
# Set seed for reproducibility
set.seed(9154)
# Number of data points
= 10000
n
# Number of response variables
= 5
p
# Number of SNPs per response variable
= 100
num_snps_per_y
# Total number of SNPs
= num_snps_per_y * p
k
# Initialize causal interaction matrix between response variables
= matrix(sample(c(-0.1, 0.1), p^2, replace = TRUE), p, p)
A diag(A) = 0
sample(which(A != 0), length(which(A != 0)) / 2)] = 0
A[
# Create D matrix (Indicator matrix where each row corresponds to a response variable
# and each column corresponds to an instrument variable)
= matrix(0, nrow = p, ncol = k)
D
# Assign values to D matrix using a loop
for (run in 1:p) {
- 1) * 100 + 1) : (run * 100)] = 1
D[run, ((run
}
# Initialize B matrix
= matrix(0, p, k) # Initialize B matrix with zeros
B
# Calculate B matrix based on D matrix
for (i in 1:p) {
for (j in 1:k) {
if (D[i, j] == 1) {
= 1 # Set B[i, j] to 1 if D[i, j] is 1
B[i, j]
}
}
}
# Calculate Variance-Covariance matrix
= diag(p)
Sigma = solve(diag(p) - A)
Mult_Mat = Mult_Mat %*% Sigma %*% t(Mult_Mat)
Variance
# Generate instrument data matrix (X)
= matrix(rnorm(n * k, 0, 1), nrow = n, ncol = k)
X
# Initialize response data matrix (Y)
= matrix(0, nrow = n, ncol = p)
Y
# Generate response data matrix based on instrument data matrix
for (i in 1:n) {
= MASS::mvrnorm(n = 1, Mult_Mat %*% B %*% X[i, ], Variance)
Y[i, ]
}
# Calculate summary level data
= t(Y) %*% Y / n
Syy = t(Y) %*% X / n
Syx = t(X) %*% X / n
Sxx
# Perform PCA for each response variable to get top 20 PCs
= list()
top_snps_list for (i in 1:p) {
= X[, (num_snps_per_y * (i - 1) + 1):(num_snps_per_y * i)]
X_sub = prcomp(X_sub, center = TRUE, scale. = TRUE)
pca = pca$x[, 1:20]
top_20_pcs = top_20_pcs
top_snps_list[[i]]
}
# Combine the top PCs from all response variables
= do.call(cbind, top_snps_list)
compact_X
# Calculate summary level data based on compact_X
= t(compact_X) %*% compact_X / n
Sxx_compact = t(Y) %*% compact_X / n
Syx_compact
# Create D_New
= matrix(0, nrow = p, ncol = 20 * p)
D_New
# Assign values to D matrix using a loop
for (run in 1:p) {
- 1) * 20 + 1) : (run * 20)] = 1
D_New[run, ((run
}
# Apply RGM on summary level data for Spike and Slab Prior using the compact_X matrix
= RGM(Syy = Syy, Syx = Syx_compact, Sxx = Sxx_compact, D = D_New, n = n, prior = "Spike and Slab")
Output
# Print estimated causal interaction matrices
$AEst
Output#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.0000000000 -0.10304239 0.012795897 0.006911765 0.09119095
#> [2,] 0.1102430697 0.00000000 -0.101718794 0.095093965 0.10762267
#> [3,] -0.0039579652 -0.10801784 0.000000000 -0.008388722 0.10255348
#> [4,] -0.0013919956 -0.09076408 0.011536478 0.000000000 0.01075220
#> [5,] -0.0005533104 0.10848253 -0.002160112 -0.011725145 0.00000000
$zAEst
Output#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0 1 0 0 1
#> [2,] 1 0 1 1 1
#> [3,] 0 1 0 0 1
#> [4,] 0 1 0 0 0
#> [5,] 0 1 0 0 0
# Create a layout for multiple plots
par(mfrow = c(1, 2))
# Plot the true causal network
plot(smaller_arrowheads(igraph::graph_from_adjacency_matrix((A != 0) * 1, mode = "directed")),
layout = igraph::layout_in_circle, main = "True Causal Network")
# Plot the estimated causal network
plot(smaller_arrowheads(igraph::graph_from_adjacency_matrix(Output$zAEst, mode = "directed")),
layout = igraph::layout_in_circle, main = "Estimated Causal Network")
Conclusion
Although we have mimicked a real-world setup where there are numerous instrumental variables (IVs), each explaining only a small portion of the trait variance, our approach still yields very promising results. This demonstrates that our method is robust even in complex scenarios with many IVs.
The dimensionality reduction technique we employed, specifically using Principal Component Analysis (PCA) to select the top principal components as IVs, proves to be effective. This approach can be broadly applied to similar problems where dimensionality reduction is necessary. By leveraging PCA or other dimensionality reduction methods, researchers can efficiently manage large sets of IVs and apply our algorithm to gain valuable insights into causal relationships.
Yang Ni. Yuan Ji. Peter Müller. “Reciprocal Graphical Models for Integrative Gene Regulatory Network Analysis.” Bayesian Anal. 13 (4) 1095 - 1110, December 2018. https://doi.org/10.1214/17-BA1087