This vignette of package maxEff
(Github, RPubs) documents three
methods of selecting one from many numeric
predictors for a
regression model, to ensure that the additional predictor has the
maximum effect size. The three methods are implemented in functions
add_num()
, add_dummy()
and
add_dummy_partition()
.
Examples in this vignette require that the search
path
has
Users should remove parameter mc.cores = 1L
from all
examples and use the default option, which engages all CPU cores on the
current host for macOS. The authors are forced to have
mc.cores = 1L
in this vignette in order to pass
CRAN
’s submission check.
Term / Abbreviation | Description | Reference |
---|---|---|
Forward pipe operator | ?base::pipeOp introduced in R
4.1.0 |
|
abs |
Absolute value | base::abs |
coxph |
Cox model | survival::coxph |
CRAN , R |
The Comprehensive R Archive Network | https://cran.r-project.org |
factor |
Factor, or categorical variable | base::factor |
function |
R function |
base::`function` |
groupedHyperframe |
Grouped hyper data frame | groupedHyperframe::as.groupedHyperframe |
head |
First parts of an object | utils::head ;
utils:::head.default |
hypercolumns , hyperframe |
(Hyper columns of) hyper data frame | spatstat.geom::hyperframe |
labels |
Labels from object | base::labels ;
maxEff::labels.node1 |
levels |
Levels of a factor |
base::levels |
listof |
List of objects | stats::listof |
logistic |
Logistic regression model | stats::glm(., family = binomial('logit')) |
matrix |
Matrix | base::matrix |
partition |
Stratified partition | maxEff::statusPartition ,
caret::createDataPartition |
PFS |
Progression/recurrence free survival | https://en.wikipedia.org/wiki/Progression-free_survival |
predict |
Model prediction | stats::predict ;
maxEff::predict.add_num ;
maxEff::predict.add_dummy |
quantile |
Quantile | stats::quantile |
rpart |
Recursive partitioning and regression trees | rpart::rpart |
S3 , generic ,
methods |
S3 object oriented system |
base::UseMethod ;
utils::methods ; utils::getS3method ; https://adv-r.hadley.nz/s3.html |
sort_by |
Sort an object by some criterion | base::sort_by ;
maxEff::sort_by.add_ |
subset |
Subsets of object by conditions | base::subset ;
maxEff::subset.add_dummy |
Surv |
Survival object | survival::Surv |
update |
Update and re-fit a model call | stats::update |
This work is supported by NCI R01CA222847 (I. Chervoneva, T. Zhan, and H. Rui) and R01CA253977 (H. Rui and I. Chervoneva).
node1()
: Dichotomization via First Node of Recursive
PartitioningConsider the following recursive partitioning and regression example,
data(cu.summary, package = 'rpart')
(r = rpart(Price ~ Mileage, data = cu.summary, cp = .Machine$double.eps, maxdepth = 1L))
#> n=60 (57 observations deleted due to missingness)
#>
#> node), split, n, deviance, yval
#> * denotes terminal node
#>
#> 1) root 60 983551500 12615.67
#> 2) Mileage>=24.5 25 206417200 9730.16 *
#> 3) Mileage< 24.5 35 420299300 14676.74 *
Function node1()
creates a dichotomizing rule, i.e., a
function
, based on the first node of
partitioning,
(foo = r |> node1())
#>
#> Dichotomizing Rule based on Recursive Partitioning:
#>
#> function (newx)
#> {
#> ret <- (newx >= 24.5)
#> if (all(ret, na.rm = TRUE) || !any(ret, na.rm = TRUE))
#> warning("Dichotomized value is all-0 or all-1")
#> return(ret)
#> }
#> <environment: 0x125350040>
The dichotomizing rule foo
converts a
numeric
object to a logical
object.
Developers may obtain the numeric
cutoff value of
foo
, or a brief text of to describe foo
, for
further analysis.
statusPartition()
: Stratified Partition by Survival
StatusFunction statusPartition()
performs stratified partition
based on the survival status of a Surv
response, to avoid
the situation that Cox model is degenerate if all subjects are censored.
This is a deviation from the very popular function
caret::createDataPartition()
, which stratifies
Surv
response by the quantile
s of the survival
time.
y = (survival::capacitor) |>
with(expr = Surv(time, status))
set.seed(15); id = y |>
statusPartition(times = 1L, p = .5)
table(y[id[[1L]], 2L]) / table(y[,2L]) # balanced by status
#>
#> 0 1
#> 0.5 0.5
set.seed(15); id0 = y |>
caret::createDataPartition(times = 1L, p = .5)
table(y[id0[[1L]], 2L]) / table(y[,2L]) # *not* balanced by status
#>
#> 0 1
#> 0.46875 0.56250
Data set Ki67
in package
groupedHyperframe
is a
groupedHyperframe
.
data(Ki67, package = 'groupedHyperframe')
Ki67
#> Grouped Hyperframe: ~patientID/tissueID
#>
#> 645 tissueID nested in
#> 622 patientID
#>
#> logKi67 tissueID Tstage PFS recfreesurv_mon recurrence adj_rad adj_chemo
#> 1 (numeric) TJUe_I17 2 100+ 100 0 FALSE FALSE
#> 2 (numeric) TJUe_G17 1 22 22 1 FALSE FALSE
#> 3 (numeric) TJUe_F17 1 99+ 99 0 FALSE NA
#> 4 (numeric) TJUe_D17 1 99+ 99 0 FALSE TRUE
#> 5 (numeric) TJUe_J18 1 112 112 1 TRUE TRUE
#> 6 (numeric) TJUe_N17 4 12 12 1 TRUE FALSE
#> 7 (numeric) TJUe_J17 2 64+ 64 0 FALSE FALSE
#> 8 (numeric) TJUe_F19 2 56+ 56 0 FALSE FALSE
#> 9 (numeric) TJUe_P19 2 79+ 79 0 FALSE FALSE
#> 10 (numeric) TJUe_O19 2 26 26 1 FALSE TRUE
#> histology Her2 HR node race age patientID
#> 1 3 TRUE TRUE TRUE White 66 PT00037
#> 2 3 FALSE TRUE FALSE Black 42 PT00039
#> 3 3 FALSE TRUE FALSE White 60 PT00040
#> 4 3 FALSE TRUE TRUE White 53 PT00042
#> 5 3 FALSE TRUE TRUE White 52 PT00054
#> 6 2 TRUE TRUE TRUE Black 51 PT00059
#> 7 3 FALSE TRUE TRUE Asian 50 PT00062
#> 8 2 TRUE TRUE TRUE White 37 PT00068
#> 9 3 TRUE TRUE FALSE White 68 PT00082
#> 10 2 TRUE TRUE FALSE Black 55 PT00084
s = Ki67 |>
aggregate_quantile(by = ~ patientID, probs = seq.int(from = .01, to = .99, by = .01))
#> Column(s) tissueID removed; as they are not identical per aggregation-group
Candidate of additional predictors are stored in a
numeric
-hypercolumn
logKi67.quantile
. Users are encouraged to learn
more about groupedHyperframe
class and function
aggregate_quantile()
from package
groupedHyperframe
vignette.
Partition into a training (80%) and test (20%) set.
set.seed(234); id = sample.int(n = nrow(s), size = nrow(s)*.8) |> sort.int()
s0 = s[id, , drop = FALSE] # training set
s1 = s[-id, , drop = FALSE] # test set
Let’s consider a starting model of endpoint PFS
with
predictor Tstage
on the training set s0
.
summary(m <- coxph(PFS ~ Tstage, data = s0))
#> Warning in as.data.frame.hyperframe(data, optional = TRUE): 1 variable
#> discarded in conversion to data frame
#> Warning in as.data.frame.hyperframe(data, optional = TRUE): 1 variable
#> discarded in conversion to data frame
#> Warning in as.data.frame.hyperframe(data): 1 variable discarded in conversion
#> to data frame
#> Call:
#> coxph(formula = PFS ~ Tstage, data = s0)
#>
#> n= 497, number of events= 90
#>
#> coef exp(coef) se(coef) z Pr(>|z|)
#> Tstage 0.7162 2.0466 0.1020 7.019 2.23e-12 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> exp(coef) exp(-coef) lower .95 upper .95
#> Tstage 2.047 0.4886 1.676 2.5
#>
#> Concordance= 0.679 (se = 0.028 )
#> Likelihood ratio test= 39.61 on 1 df, p=3e-10
#> Wald test = 49.27 on 1 df, p=2e-12
#> Score (logrank) test = 54.96 on 1 df, p=1e-13
numeric
Predictor with Maximum Effect
SizeFunction add_num()
treats each additional predictor as a
numeric
variable, and update
s the starting
model with each additional predictor. Function add_num()
returns an 'add_num'
object, which is a listof
objects with an internal class 'add_num_'
.
The S3
generic sort_by()
sorts the
'add_num'
object by the abs
olute value of the
regression coefficient (i.e., effect size) of the additioanal
numeric
predictor.
The S3
generic head()
chooses the top
n
element from the object returned from the previous
step.
set.seed(14837); m1 = m |>
add_num(x = ~ logKi67.quantile, mc.cores = 1L) |>
sort_by(y = abs(effsize)) |>
head(n = 2L)
m1
#> .slice(logKi67.quantile, "0.99")
#> .slice(logKi67.quantile, "0.98")
The S3 generic predict()
uses model m1
on
the test set s1
.
m1[1L] |> predict(newdata = s1)
#> .slice(logKi67.quantile, "0.99") :
#> Call:
#> coxph(formula = PFS ~ Tstage + x., data = newd)
#>
#> coef exp(coef) se(coef) z p
#> Tstage 0.4068 1.5020 0.2359 1.724 0.0847
#> x. 0.1114 1.1179 0.1926 0.579 0.5629
#>
#> Likelihood ratio test=3.21 on 2 df, p=0.2008
#> n= 125, number of events= 28
logical
Predictor with Maximum Effect
Sizeadd_dummy()
: Naive PracticeFunction add_dummy()
partitions each additional
numeric
predictor into a logical
variable
using function node1()
, and update
s the
starting model by adding in each of the dichotomized
logical
predictor. Function add_dummy()
returns an 'add_dummy'
object, which is a
listof
node1
objects.
The S3
generic subset()
subsets the the
'add_dummy'
object by the balance of partition of the
additional predictor.
The S3
generic sort_by()
sorts the
'add_dummy'
object by the abs
olute value of
regression coefficient (i.e., effect size) of the additional
logical
predictor.
The S3
generic head()
chooses the top
n
element from the object returned from the previous
step.
set.seed(14837); m2 = m |>
add_dummy(x = ~ logKi67.quantile, mc.cores = 1L) |>
subset(subset = p1 > .05 & p1 < .95) |>
sort_by(y = abs(effsize)) |>
head(n = 2L)
m2
#> .slice(logKi67.quantile, "0.7")>=6.60981019916946
#> .slice(logKi67.quantile, "0.01")>=6.02789726883965
The S3 generic predict()
uses model m2
on
the test set s1
.
m2[1L] |> predict(newdata = s1)
#> .slice(logKi67.quantile, "0.7")>=6.60981019916946 :
#> Call:
#> coxph(formula = PFS ~ Tstage + x., data = newd)
#>
#> coef exp(coef) se(coef) z p
#> Tstage 0.3644 1.4397 0.2402 1.517 0.1292
#> x.TRUE 0.6600 1.9348 0.3841 1.718 0.0857
#>
#> Likelihood ratio test=5.83 on 2 df, p=0.05415
#> n= 125, number of events= 28
add_dummy_partition()
: via Repeated PartitionsFunction add_dummy_partition()
partitions each
additional numeric
predictor into a logical
variable in the following steps.
node1()
) on the training set. Apply this dichotomizing rule
on the test set and obtain the estimated regression coefficient (i.e.,
effect size) of the additional logical
predictor.logical
predictor.Function add_dummy_partition()
also returns an
'add_dummy'
object.
set.seed(14837); m3 = m |>
add_dummy_partition(~ logKi67.quantile, times = 20L, mc.cores = 1L) |>
subset(subset = p1 > .15 & p1 < .85) |>
sort_by(y = abs(effsize), decreasing = TRUE) |>
head(n = 2L)
m3
#> .slice(logKi67.quantile, "0.14")>=6.26711450931504
#> .slice(logKi67.quantile, "0.26")>=6.38090570405786
m3[1L] |> predict(newdata = s1)
#> .slice(logKi67.quantile, "0.14")>=6.26711450931504 :
#> Call:
#> coxph(formula = PFS ~ Tstage + x., data = newd)
#>
#> coef exp(coef) se(coef) z p
#> Tstage 0.3912 1.4788 0.2348 1.666 0.0956
#> x.TRUE 0.3783 1.4598 0.3918 0.966 0.3343
#>
#> Likelihood ratio test=3.79 on 2 df, p=0.1505
#> n= 125, number of events= 28