This package acts as a wrapper to the penalized R package to add the following functionality to that package:
It also provides a function for simulation of collinear high-dimensional data with survival or binary response.
This package was developed in support of the study by Waldron et al. (2011).
This paper contains greater detail on proper application of the methods
provided here. Please cite this paper when using the pensim package in
your research, as well as the penalized package (Goeman (2010)).
pensim provides example data from a microarray experiment investigating survival of cancer patients with lung adenocarcinomas (Beer et al. (2002)). Load this data and do an initial pre-filter of genes with low IQR:
library(pensim)
data(beer.exprs)
data(beer.survival)
##select just 100 genes to speed computation, just for the sake of example:
set.seed(1)
beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 100),]
#
gene.quant <- apply(beer.exprs.sample, 1, quantile, probs = 0.75)
dat.filt <- beer.exprs.sample[gene.quant > log2(100),]
gene.iqr <- apply(dat.filt, 1, IQR)
dat.filt <- as.matrix(dat.filt[gene.iqr > 0.5,])
dat.filt <- t(dat.filt)
dat.filt <- data.frame(dat.filt)
#
library(survival)
surv.obj <- Surv(beer.survival$os, beer.survival$status)
Note that the expression data are in “wide” format, with one column per predictor (gene). It is recommended to put covariate data in a dataframe object, rather than a matrix.
Unbiased estimation of prediction accuracy involves two levels of cross-validation: an outer level for estimating prediction accuracy, and an inner level for model tuning. This process is simplified by the opt.nested.crossval function.
It is recommended first to establish the arguments for the penalized
regression by testing on the penalized package functions
optL1
(for LASSO), optL2
(for Ridge) or
cvl
(for Elastic Net). Here we use LASSO. Setting
maxlambda1=5
is not a generally recommended procedure, but
is useful in this toy example to avoid converging on the null model.
library(penalized)
## Warning: package 'penalized' was built under R version 4.2.2
## Welcome to penalized. For extended examples, see vignette("penalized").
testfit <- optL1(
response = surv.obj,
penalized = dat.filt,
fold = 5,
maxlambda1 = 5,
positive = FALSE,
standardize = TRUE,
trace = FALSE
)
Now pass these arguments to opt.nested.crossval()
for
cross-validated calculation and assessment of risk scores, with the
additional arguments:
outerfold
and nprocessors
: number of folds
for the outer level of cross-validation, and the number of processors to
use for the outer level of cross-validation (see
?opt.nested.crossval
)opt1D
for LASSO or Ridge, opt2D
for Elastic
Net) - see ?opt.splitval
.nsim
defines the number of times to repeat tuning (see
?opt1D
. opt2D
has different required
arguments.)set.seed(1)
preds <-
opt.nested.crossval(
outerfold = 5,
nprocessors = 1,
#opt.nested.crossval arguments
optFUN = "opt1D",
scaling = FALSE,
#opt.splitval arguments
setpen = "L1",
nsim = 1,
#opt1D arguments
response = surv.obj,
#rest are penalized::optl1 arguments
penalized = dat.filt,
fold = 5,
positive = FALSE,
standardize = TRUE,
trace = FALSE
)
Ideally nsim would be 50, and outerfold and fold would be 10, but the
values below speed computation 200x compared to these recommended
values. Note that here we are using the standardize=TRUE
argument of optL1
rather than the scaling=TRUE
argument of opt.splitval. These two approaches to scaling are roughly
equivalent, but the scaling approaches are not the same
(scaling=TRUE
does z-score, standardize=TRUE
scales to unit central L2 norm), and results will not be identical.
Also, using standardize=TRUE
scales variables but provides
coeffients for the original scale, whereas using scaling=TRUE scales
variables in the training set then applies the same scales to the test
set.
Cox fit on the continuous risk predictions:
coxfit.continuous <- coxph(surv.obj~preds)
summary(coxfit.continuous)
## Call:
## coxph(formula = surv.obj ~ preds)
##
## n= 86, number of events= 24
##
## coef exp(coef) se(coef) z Pr(>|z|)
## preds 0.01639 1.01653 0.07769 0.211 0.833
##
## exp(coef) exp(-coef) lower .95 upper .95
## preds 1.017 0.9837 0.873 1.184
##
## Concordance= 0.543 (se = 0.057 )
## Likelihood ratio test= 0.04 on 1 df, p=0.8
## Wald test = 0.04 on 1 df, p=0.8
## Score (logrank) test = 0.04 on 1 df, p=0.8
Dichotomize the cross-validated risk predictions at the median, for visualization:
preds.dichot <- preds > median(preds)
Plot the ROC curve:
nobs <- length(preds)
cutoff <- 12
if (requireNamespace("survivalROC", quietly = TRUE)) {
preds.roc <-
survivalROC::survivalROC(
Stime = beer.survival$os,
status = beer.survival$status,
marker = preds,
predict.time = cutoff,
span = 0.01 * nobs ^ (-0.20)
)
plot(
preds.roc$FP,
preds.roc$TP,
type = "l",
xlim = c(0, 1),
ylim = c(0, 1),
xlab = paste("FP", "\n", "AUC = ", round(preds.roc$AUC, 3)),
lty = 2,
ylab = "TP",
main = "LASSO predictions\n ROC curve at 12 months"
)
abline(0, 1)
}
Finally, we can get coefficients for the model fit on all the data,
for future use. Note that nsim should ideally be greater than 1, to
train the model using multiple foldings for cross-validation. The output
of opt1D
or opt2D
will be a matrix with one
row per simulation. The default behavior in
opt.nested.crossval()
is to take the simulation with
highest cross-validated partial log likelihood (CVL),
which is the recommended way to select a model from the multiple
simulations.
beer.coefs <- opt1D(
setpen = "L1",
nsim = 1,
response = surv.obj,
penalized = dat.filt,
fold = 5,
maxlambda1 = 5,
positive = FALSE,
standardize = TRUE,
trace = FALSE
)
We can also include unpenalized covariates, if desired. Note that
when keeping only one variable for a penalized or unpenalized covariate,
indexing a dataframe like [1]
instead of doing
[, 1]
preserves the variable name. With [, 1]
the variable name gets converted to ““.
beer.coefs.unpen <-
opt1D(
setpen = "L1",
nsim = 1,
response = surv.obj,
penalized = dat.filt[-1],
# This is equivalent to dat.filt[,-1]
unpenalized = dat.filt[1],
fold = 5,
maxlambda1 = 5,
positive = FALSE,
standardize = TRUE,
trace = FALSE
)
Note the non-zero first coefficient this time, due to it being unpenalized:
beer.coefs[1, 1:5] #example output with no unpenalized covariates
## L1 cvl J04130_s_at U20758_rna1_at L11672_r_at
## 4.99896033 -115.19258054 0.00000000 0.09998177 -0.10312924
beer.coefs.unpen[1, 1:5] #example output with first covariate unpenalized
## L1 cvl J04130_s_at U20758_rna1_at L11672_r_at
## 4.99906823 -117.76317249 -0.15193598 0.10888431 -0.03394335
The pensim also provides a convenient means to simulation high-dimensional expression data with (potentially censored) survival outcome or binary outcome which is dependent on specified covariates.
First, generate the data. Here we simulate 20 variables. The first 15
(group “a”) are uncorrelated, and have no association
with outcome. The final five (group “b”) have
covariance of 0.8 to each other variable in that group. The response
variable is associated with the first variable group “b”
(firstonly=TRUE
) with a coefficient of 2.
Binary outcomes for \(n_s= 50\) samples are simulated as a Bernoulli distribution with probability for patient s:
\[\begin{equation} p_{s} =\frac{1}{1 + exp(-\beta X_{s})} \end{equation}\]
with \(\beta_{s,16} = 0.5\) and all other \(\beta_{s,i}\) equal to zero.
The code for this simulation is as follows:
set.seed(9)
x <- create.data(
nvars = c(15, 5),
cors = c(0, 0.8),
associations = c(0, 2),
firstonly = c(TRUE, TRUE),
nsamples = 50,
response = "binary",
logisticintercept = 0.5
)
Take a look at the simulated data:
summary(x)
## Length Class Mode
## summary 6 data.frame list
## associations 20 -none- numeric
## covariance 400 -none- numeric
## data 21 data.frame list
x$summary
## start end cors associations num firstonly
## a 1 15 0.0 0 15 TRUE
## b 16 20 0.8 2 5 TRUE
A simple logistic model fails at variable selection in this case:
simplemodel <- glm(outcome ~ ., data = x$data, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(simplemodel)
##
## Call:
## glm(formula = outcome ~ ., family = binomial, data = x$data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.155e-05 -2.100e-08 2.100e-08 2.100e-08 7.017e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 146.045 28566.445 0.005 0.996
## a.1 205.225 36863.342 0.006 0.996
## a.2 -350.824 60613.163 -0.006 0.995
## a.3 26.278 9549.552 0.003 0.998
## a.4 -221.866 39113.234 -0.006 0.995
## a.5 95.717 24911.161 0.004 0.997
## a.6 36.924 12817.964 0.003 0.998
## a.7 -246.090 44715.891 -0.006 0.996
## a.8 65.058 13128.095 0.005 0.996
## a.9 -316.993 58214.882 -0.005 0.996
## a.10 4.299 20688.797 0.000 1.000
## a.11 -218.757 39199.270 -0.006 0.996
## a.12 63.066 20305.078 0.003 0.998
## a.13 54.937 11733.007 0.005 0.996
## a.14 -112.590 30198.194 -0.004 0.997
## a.15 398.687 74940.068 0.005 0.996
## b.1 498.778 88875.945 0.006 0.996
## b.2 -297.858 56626.061 -0.005 0.996
## b.3 -677.344 116466.624 -0.006 0.995
## b.4 188.575 39169.971 0.005 0.996
## b.5 699.090 123792.675 0.006 0.995
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6.5342e+01 on 49 degrees of freedom
## Residual deviance: 6.6338e-08 on 29 degrees of freedom
## AIC: 42
##
## Number of Fisher Scoring iterations: 25
But LASSO does a better job, selecting several of the collinear variables in the “b” group of variables which are associated with outcome:
lassofit <-
opt1D(
nsim = 3,
nprocessors = 1,
setpen = "L1",
penalized = x$data[1:20],
response = x$data[, "outcome"],
trace = FALSE,
fold = 10
)
print(lassofit)
## L1 cvl (Intercept) a.1 a.2 a.3 a.4 a.5 a.6 a.7 a.8 a.9 a.10
## [1,] 3.436910 -27.56252 0.6389925 0 0 0 0 0 0 0 0 0 0
## [2,] 4.167477 -27.70431 0.6255205 0 0 0 0 0 0 0 0 0 0
## [3,] 3.949795 -28.37533 0.6291330 0 0 0 0 0 0 0 0 0 0
## a.11 a.12 a.13 a.14 a.15 b.1 b.2 b.3 b.4 b.5
## [1,] 0 0 0 0 0.0128977 0.9981297 0 0 0 0
## [2,] 0 0 0 0 0.0000000 0.8915954 0 0 0 0
## [3,] 0 0 0 0 0.0000000 0.9227009 0 0 0 0
And visualize the data as a heatmap:
dat <- t(as.matrix(x$data[,-match("outcome", colnames(x$data))]))
heatmap(dat, ColSideColors = ifelse(x$data$outcome == 0, "black", "white"))
We simulate these data in the same way, but with
response="timetoevent"
. Here censoring is uniform random
between times 2 and 10, generating approximately 34% censoring:
set.seed(1)
x <- create.data(
nvars = c(15, 5),
cors = c(0, 0.8),
associations = c(0, 0.5),
firstonly = c(TRUE, TRUE),
nsamples = 50,
censoring = c(2, 10),
response = "timetoevent"
)
How many events are censored?
sum(x$data$cens == 0) / nrow(x$data)
## [1] 0.42
Kaplan-Meier plot of this simulated cohort:
library(survival)
surv.obj <- Surv(x$data$time, x$data$cens)
plot(survfit(surv.obj ~ 1), ylab = "Survival probability", xlab = "time")
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] penalized_0.9-52 survival_3.3-1 pensim_1.3.6
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.9 knitr_1.39 magrittr_2.0.3 MASS_7.3-58.1
## [5] splines_4.2.1 lattice_0.20-45 R6_2.5.1 rlang_1.0.4
## [9] fastmap_1.1.0 highr_0.9 stringr_1.4.0 tools_4.2.1
## [13] parallel_4.2.1 grid_4.2.1 xfun_0.31 cli_3.3.0
## [17] jquerylib_0.1.4 htmltools_0.5.3 yaml_2.3.5 digest_0.6.29
## [21] Matrix_1.4-1 sass_0.4.4 cachem_1.0.6 evaluate_0.15
## [25] rmarkdown_2.18 stringi_1.7.8 compiler_4.2.1 bslib_0.4.1
## [29] jsonlite_1.8.0