Main simulation
Understanding the effect of the elastic net’s \(\alpha\) parameter
We begin with a simulation showing the best-case performance of the elastic net for several values of \(\alpha\).
library(simulator)
<- "elastic-net"
name_of_simulation <- new_simulation(name_of_simulation, "Elastic Nets") %>%
sim generate_model(make_sparse_linear_model_with_corr_design,
n = 100, p = 50, snr = 2, k = 10,
rho = as.list(seq(0, 0.8, length = 6)),
vary_along = "rho") %>%
simulate_from_model(nsim = 3, index = 1:4) %>%
run_method(list_of_elastic_nets,
parallel = list(socket_names = 2, libraries = "glmnet")) %>%
evaluate(list(sqr_err, nnz, best_sqr_err))
In the above code, we consider a sequence of models in which we vary
the correlation rho
among the features. For each model, we
fit a sequence of elastic net methods (varying the tuning parameter
\(\alpha\)). For each method, we
compute the best-case mean-squared error. By best-case, we mean \(\min_{\lambda\ge0}\frac1{p}\|\hat\beta_{\lambda,\alpha}-\beta\|_2^2\),
which imagines we have an oracle-like ability to choose the best \(\lambda\) for minimizing the MSE.
We provide below all the code for the problem-specific components. We
use the R package glmnet
to fit the elastic net. The most distinctive feature of this particular
vignette is how the list of methods list_of_elastic_nets
was created. This is shown in the Methods section.
plot_evals(sim, "nnz", "sqr_err")
The first plot shows the MSE versus sparsity level for each method (parameterized by \(\lambda\)). As expected, we see that when \(\alpha=1\) (pure ridge regression), there is no sparsity. We see that the performance of the methods with \(\alpha<1\) degrades as the correlation among features increases, especially when a lot of features are included in the fitted model.
It is informative to look at how the height of the minimum of each of the above curves varies with \(\rho\).
plot_eval_by(sim, "best_sqr_err", varying = "rho", include_zero = TRUE)
We see that when the correlation between features is low, the methods with some \(\ell_1\) penalty do better than ridge regression. However, as the features become increasingly correlated, a pure ridge penalty becomes better. Of course, none of the methods are doing as well in the high correlation regime (which is reminiscent of the “bet on sparsity principle”).
A side note: the simulator automatically records the computing time of each method as an additional metric:
plot_eval(sim, "time", include_zero = TRUE)
Results for Cross-Validated Elastic Net
We might be reluctant to draw conclusions about the methods based on the oracle-like version that we used above (in which each method on each random draw gets to pick the best possible \(\lambda\) value). We might therefore look at the performance of the methods using cross-validation to select \(\lambda\).
<- sim %>% subset_simulation(methods = "") %>%
sim_cv rename("elastic-net-cv") %>%
relabel("Elastic Nets with CV") %>%
run_method(list_of_elastic_nets + cv,
parallel = list(socket_names = 2, libraries = "glmnet")) %>%
evaluate(list(sqr_err, nnz))
Reassuringly, the relative performance of these methods is largely the same (though we see that all methods’ MSEs are higher).
plot_eval_by(sim_cv, "sqr_err", varying = "rho", include_zero = TRUE)