We can train the model with the caret
package (for
further information about caret
, see the original
website). We use parallel computing to speed up the computation.
# parallel computing
library(doParallel)
<- makePSOCKcluster(2)
cl registerDoParallel(cl)
# stop after finishing the computation
stopCluster(cl)
The following example shows how to estimate the ITR with gradient
boosting machine (GBM) using the caret
package. Note that
we have already loaded the data and specify the treatment, outcome, and
covariates as shown in the Sample
Splitting vignette. Since we are using the caret
package, we need to specify the trainControl
and/or
tuneGrid
arguments. The trainControl
argument
specifies the cross-validation method and the tuneGrid
argument specifies the tuning grid. For more information about these
arguments, please refer to the caret
website.
We estimate the ITR with only one machine learning algorithm (GBM)
and evaluate the ITR with the evaluate_itr()
function. To
compute PAPDp
, we need to specify the
algorithms
argument with more than 2 machine learning
algorithms.
library(evalITR)
library(caret)
# specify the trainControl method
<- caret::trainControl(
fitControl method = "repeatedcv", # 3-fold CV
number = 3, # repeated 3 times
repeats = 3,
search='grid',
allowParallel = TRUE) # grid search
# specify the tuning grid
<- expand.grid(
gbmGrid interaction.depth = c(5,9),
n.trees = (5:10)*100,
shrinkage = 0.1,
n.minobsinnode = 20)
# estimate ITR
<- estimate_itr(
fit_caret treatment = "treatment",
form = user_formula,
trControl = fitControl,
data = star_data,
algorithms = c("gbm"),
budget = 0.2,
split_ratio = 0.7,
tuneGrid = gbmGrid,
verbose = FALSE)
#> Evaluate ITR under sample splitting ...
# evaluate ITR
<- evaluate_itr(fit_caret)
est_caret #> Cannot compute PAPDp
We can extract the training model from caret
and check
the model performance. Other functions from caret
can be
applied to the training model.
# extract the final model
<- fit_caret$estimates$models$gbm
caret_model print(caret_model$finalModel)
#> A gradient boosted model with gaussian loss function.
#> 500 iterations were performed.
#> There were 53 predictors of which 50 had non-zero influence.
# check model performance
trellis.par.set(caretTheme()) # theme
plot(caret_model)
# heatmap
plot(
caret_model, plotType = "level",
scales = list(x = list(rot = 90)))
Alternatively, we can train the model with the
SuperLearner
package (for further information about
SuperLearner
, see the
original website). SuperLearner utilizes ensemble method by taking
optimal weighted average of multiple machine learning algorithms to
improve model performance.
We will compare the performance of the ITR estimated with
causal_forest
and SuperLearner
.
library(SuperLearner)
<- estimate_itr(
fit_sl treatment = "treatment",
form = user_formula,
data = star_data,
algorithms = c("causal_forest","SuperLearner"),
budget = 0.2,
split_ratio = 0.7,
SL_library = c("SL.ranger", "SL.glmnet"))
#> Evaluate ITR under sample splitting ...
<- evaluate_itr(fit_sl)
est_sl
# summarize estimates
summary(est_sl)
#> -- PAPE ------------------------------------------------------------------------
#> estimate std.deviation algorithm statistic p.value
#> 1 0.29 0.86 causal_forest 0.34 0.74
#> 2 0.78 1.24 SuperLearner 0.63 0.53
#>
#> -- PAPEp -----------------------------------------------------------------------
#> estimate std.deviation algorithm statistic p.value
#> 1 1.2 1.1 causal_forest 1.13 0.26
#> 2 1.1 1.2 SuperLearner 0.97 0.33
#>
#> -- PAPDp -----------------------------------------------------------------------
#> estimate std.deviation algorithm statistic p.value
#> 1 0.12 1.1 causal_forest x SuperLearner 0.11 0.91
#>
#> -- AUPEC -----------------------------------------------------------------------
#> estimate std.deviation algorithm statistic p.value
#> 1 1.0 0.80 causal_forest 1.3 0.21
#> 2 1.2 0.96 SuperLearner 1.2 0.23
#>
#> -- GATE ------------------------------------------------------------------------
#> estimate std.deviation algorithm group statistic p.value upper lower
#> 1 -180 107 causal_forest 1 -1.68 0.093 -357 -3.8
#> 2 149 108 causal_forest 2 1.38 0.168 -29 326.8
#> 3 30 108 causal_forest 3 0.27 0.785 -148 207.4
#> 4 -36 107 causal_forest 4 -0.33 0.740 -212 141.1
#> 5 40 106 causal_forest 5 0.38 0.705 -134 214.7
#> 6 -116 108 SuperLearner 1 -1.07 0.282 -294 61.6
#> 7 -65 107 SuperLearner 2 -0.61 0.544 -241 111.4
#> 8 104 108 SuperLearner 3 0.96 0.336 -74 282.2
#> 9 17 107 SuperLearner 4 0.16 0.870 -158 193.1
#> 10 62 107 SuperLearner 5 0.58 0.561 -114 238.1
We plot the estimated Area Under the Prescriptive Effect Curve for
the writing score across a range of budget constraints, seperately for
the two ITRs, estimated with causal_forest
and
SuperLearner
.
# plot the AUPEC
plot(est_sl)