The T-Rex selector performs fast variable/feature selection in large-scale high-dimensional settings. It provably controls the false discovery rate (FDR), i.e., the expected fraction of selected false positives among all selected variables, at the user-defined target level. In addition to controlling the FDR, it also achieves a high true positive rate (TPR) (i.e., power) by maximizing the number of selected variables. It performs terminated-random experiments (T-Rex) using the T-LARS algorithm (R package) and fuses the selected active sets of all random experiments to obtain a final set of selected variables. The T-Rex selector can be applied in various fields, such as genomics, financial engineering, or any other field that requires a fast and FDR-controlling variable/feature selection method for large-scale high-dimensional settings (see, e.g., [1]–[9]).
\[ \DeclareMathOperator{\FDP}{FDP} \DeclareMathOperator{\FDR}{FDR} \DeclareMathOperator{\TPP}{TPP} \DeclareMathOperator{\TPR}{TPR} \newcommand{\A}{\mathcal{A}} \newcommand{\X}{\boldsymbol{X}} \newcommand{\XWK}{\boldsymbol{\tilde{X}}} \newcommand{\C}{\mathcal{C}} \newcommand{\coloneqq}{\mathrel{\vcenter{:}}=} \]
Before installing the ‘TRexSelector’ package, you need to install the required ‘tlars’ package. You can install the ‘tlars’ package from CRAN (stable version) or GitHub (developer version) with:
# Option 1: Install stable version from CRAN
install.packages("tlars")
# Option 2: install developer version from GitHub
install.packages("devtools")
::install_github("jasinmachkour/tlars") devtools
Then, you can install the ‘TRexSelector’ package from CRAN (stable version) or GitHub (developer version) with:
# Option 1: Install stable version from CRAN
install.packages("TRexSelector")
# Option 2: install developer version from GitHub
install.packages("devtools")
::install_github("jasinmachkour/TRexSelector") devtools
You can open the help pages with:
library(TRexSelector)
help(package = "TRexSelector")
?trex
?random_experiments
?lm_dummy
?add_dummies
?add_dummies_GVS
?FDP
?TPP# etc.
To cite the package ‘TRexSelector’ in publications use:
citation("TRexSelector")
This section illustrates the basic usage of the ‘TRexSelector’ package to perform FDR-controlled variable selection in large-scale high-dimensional settings based on the T-Rex selector.
library(TRexSelector)
# Setup
<- 75 # number of observations
n <- 150 # number of variables
p <- 3 # number of true active variables
num_act <- c(rep(1, times = num_act), rep(0, times = p - num_act)) # coefficient vector
beta <- which(beta > 0) # indices of true active variables
true_actives <- p # number of dummy predictors (also referred to as dummies)
num_dummies
# Generate Gaussian data
set.seed(123)
<- matrix(stats::rnorm(n * p), nrow = n, ncol = p)
X <- X %*% beta + stats::rnorm(n) y
# Seed
set.seed(1234)
# Numerical zero
<- .Machine$double.eps
eps
# Variable selection via T-Rex
<- trex(X = X, y = y, tFDR = 0.05, verbose = FALSE)
res <- which(res$selected_var > eps)
selected_var paste0("True active variables: ", paste(as.character(true_actives), collapse = ", "))
#> [1] "True active variables: 1, 2, 3"
paste0("Selected variables: ", paste(as.character(selected_var), collapse = ", "))
#> [1] "Selected variables: 1, 2, 3"
So, for a preset target FDR of 5%, the T-Rex selector has selected all true active variables and there are no false positives in this example.
Note that users have to choose the target FDR according to the requirements of their specific applications.
We give a mathematical definition of two important metrics in variable selection, i.e., the false discovery rate (FDR) and the true positive rate (TPR):
Definitions (FDR and TPR) Let \(\widehat{\A} \subseteq \lbrace 1, \ldots, p \rbrace\) be the set of selected variables, \(\A \subseteq \lbrace 1, \ldots, p \rbrace\) the set of true active variables, \(| \widehat{\A} |\) the cardinality of \(\widehat{\A}\), and define \(1 \lor a \coloneqq \max\lbrace 1, a \rbrace\), \(a \in \mathbb{R}\). Then, the false discovery rate (FDR) and the true positive rate (TPR) are defined by \[ \FDR \coloneqq \mathbb{E} \big[ \FDP \big] \coloneqq \mathbb{E} \left[ \dfrac{\big| \widehat{\A} \backslash \A \big|}{1 \lor \big| \widehat{\A} \big|} \right] \] and
\[ \TPR \coloneqq \mathbb{E} \big[ \TPP \big] \coloneqq \mathbb{E} \left[ \dfrac{| \A \cap \widehat{\A} |}{1 \lor | \A |} \right], \] respectively. Ideally, the \(\FDR = 0\) and the \(\TPR = 1\). In practice, this is not always possible. Therefore, the FDR is controlled on a sufficiently low level, while the TPR is maximized.
Let us have a look at the behavior of the T-Rex selector for different choices of the target FDR. We conduct Monte Carlo simulations and plot the resulting averaged false discovery proportions (FDP) and true positive proportions (TPP) over the target FDR. Note that the averaged FDP and TPP are estimates of the FDR and TPR, respectively:
# Computations might take up to 10 minutes... Please wait...
# Numerical zero
<- .Machine$double.eps
eps
# Seed
set.seed(1234)
# Setup
<- 100 # number of observations
n <- 150 # number of variables
p
# Parameters
<- 10 # number of true active variables
num_act <- rep(0, times = p) # coefficient vector (all zeros first)
beta sample(seq(p), size = num_act, replace = FALSE)] <- 1 # coefficient vector (active variables with non-zero coefficients)
beta[<- which(beta > 0) # indices of true active variables
true_actives <- c(0.1, 0.15, 0.2, 0.25) # target FDR levels
tFDR_vec <- 100 # number of Monte Carlo runs per stopping point
MC
# Initialize results vectors
<- matrix(NA, nrow = MC, ncol = length(tFDR_vec))
FDP <- matrix(NA, nrow = MC, ncol = length(tFDR_vec))
TPP
# Run simulations
for (t in seq_along(tFDR_vec)) {
for (mc in seq(MC)) {
# Generate Gaussian data
<- matrix(stats::rnorm(n * p), nrow = n, ncol = p)
X <- X %*% beta + stats::rnorm(n)
y
# Run T-Rex selector
<- trex(X = X, y = y, tFDR = tFDR_vec[t], verbose = FALSE)
res <- which(res$selected_var > eps)
selected_var
# Results
<- length(setdiff(selected_var, true_actives)) / max(1, length(selected_var))
FDP[mc, t] <- length(intersect(selected_var, true_actives)) / max(1, length(true_actives))
TPP[mc, t]
}
}
# Compute estimates of FDR and TPR by averaging FDP and TPP over MC Monte Carlo runs
<- colMeans(FDP)
FDR <- colMeans(TPP) TPR
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
#> ℹ Please use the `linewidth` argument instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
We observe that the T-Rex selector always controls the FDR (green line is always below the red and dashed reference line, i.e., maximum allowed value for the FDR). For more details and discussions, we refer the interested reader to the T-Rex paper [1].
The general steps that define the framework are illustrated in Figure 1. The key idea is to design randomized controlled experiments where fake variables, so-called dummies, act as a negative control group in the variable selection process.
Within the framework, a total of \(K\) random experiments with independently generated dummy matrices are conducted. Figure 2 shows the structure of the enlarged predictor matrix. Without loss of generality, true active variables (green), non-active (null) variables (red), and dummies (yellow) are illustrated as blocks within the predictor matrix. Note that this is only for visualization purposes and in practice the active and null variables are interspersed. In the random experiments, the dummy variables (yellow) compete with the given input variables in \(\X\) (green and red) to be included by a forward variable selection method, such as the LARS algorithm [10], the Lasso [11], or the elastic net [12]. In each random experiment, the solution path is terminated early, as soon as a pre-defined number of \(T\) dummies is included in the model. This results in the \(K\) candidate sets \(\C_{1, L}(T), \ldots, \C_{K, L}(T)\). The early stopping leads to a drastic reduction in computation time for sparse problems, where continuing the forward selection algorithm, beyond some point, only leads to including more null variables. Finally, a voting scheme is applied to the candidate sets which yields the final active set \(\widehat{\A}_{L}(v^{*}, T^{*})\). As detailed in [1], the calibration process ensures that the FDR is controlled at the user-defined level \(\alpha\) while maximizing the TPR by determining the optimal voting level \(v^{*}\) and number of included dummies \(T^{*}\) after which the forward selection process is terminated.
For a more detailed description of Figures 1 and 2 and more details on the T-Rex selector in general, we refer the interested reader to the original paper [1].