To deal with missing data in parametric models, multiple imputation is the golden standard (Schafer & Graham, 2002). With GLMs, the models fitted on each imputed dataset can then be pooled. For non-parametric methods and specifically prediction rule ensembles, the jury is still out on how to best deal with missing data. There are several possible approaches:
Listwise deletion. Although the default in pre()
, it
is certainly the least favorable option.
Single imputation: Perform only a single imputation and fit a prediction rule ensemble on this single dataset. This is likely better than listwise deletion, but likely inferior to multiple imputation. The main advantage is that a PRE can be fit as usual.
Multiple imputation followed by pooling. This approach takes multiple imputed datasets and fits a separate prediction rule ensemble to each of the imputed datasets, which are aggregated to form a final ensemble. The main disadvantage is that this yields an ensemble of ensembles, which will be (much) more complex and thus less interpretable than a single PRE.
Multiple imputation followed by stacking. This approach stacks the multiple imputed datasets and fits a single prediction rule ensemble to this dataset. Rule generation and estimation of the final model can be adjusted to counter the artificial inflation of sample size. The main advantage is that it yields a single PRE, while treating missing data in an optimal manner.
An alternative, would be the Missing-In-Attributes approach for rule
generation, followed by mean imputation for the final model. According
to Josse et al. (2019), mean imputation and the Missing-In-Attributes
approaches perform well from a prediction perspective and are
computationally inexpensive. This will be implemented in future versions
of package pre
.
Below, we provide examples for the four approaches described above.
For the examples, we will be predicting Wind speeds using the
airquality
dataset (we focus on predicting the
wind
variable, because it does not have missing values,
while variables Ozone
and Solar.R
do):
This option, although not recommended, is the default of function
pre()
:
## Warning in pre(Wind ~ ., data = airquality): Some observations have missing values and have been removed from the data. New sample size is 111.
##
## Final ensemble with cv error within 1se of minimum:
##
## lambda = 0.5076221
## number of terms = 2
## mean cv error (se) = 9.955139 (1.553628)
##
## cv error type : Mean-Squared Error
##
## rule coefficient description
## (Intercept) 9.034190233 1
## rule8 1.743533723 Ozone <= 45
## Ozone -0.006180118 6.75 <= Ozone <= 119
With listwise deletion, only 111 out of 153 observations are retained. We obtain a rather sparse ensemble.
Here we apply single imputation by replacing missing values with the mean:
imp0 <- airquality
imp0$Solar.R[is.na(imp0$Solar.R)] <- mean(imp0$Solar.R, na.rm=TRUE)
imp0$Ozone[is.na(imp0$Ozone)] <- mean(imp0$Ozone, na.rm=TRUE)
set.seed(43)
airq.ens.imp0 <- pre(Wind ~., data = imp0)
airq.ens.imp0
##
## Final ensemble with cv error within 1se of minimum:
##
## lambda = 0.2751573
## number of terms = 5
## mean cv error (se) = 9.757455 (0.7622478)
##
## cv error type : Mean-Squared Error
##
## rule coefficient description
## (Intercept) 10.48717592 1
## rule2 1.18133248 Ozone <= 45
## rule48 -0.56453456 Temp > 72 & Solar.R <= 258
## rule28 -0.51357673 Temp > 73 & Ozone > 22
## Ozone -0.01910646 7 <= Ozone <= 115.6
## rule40 0.01440472 Temp <= 81
We obtain a larger number of rules, and slightly lower cross-validated mean squared error. However, this model cannot really be compared with the listwise deletion model, because they are computed over different sets of observations.
We first perform multiple imputation by chained equations, using the predictive mean matching method. We generate 5 imputed datasets:
We create a list
with imputed datasets:
To deal with imputed data, we have two options: Stacking the
imputed datasets or pooling the resulting ensembles. The former
is likely most beneficial for retaining interpretability and therefore
an experimental (!) version has been implemented in package
pre
.
Function mi_pre
implements a so-called stacking
approach (see also Wood et al., 2008), where imputed datasets are
combined into one large dataset. In addition to adjustments of the
sampling procedures, adjustments to observation weight are made to
counter the artificial inflation of sample size. Function
mi_pre
takes a list of imputed datasets as input data,
because it is assumed imputation has already been performed.
Although the option to use the fraction of complete data for
computing observation weight is provided through argument
compl_frac
of function mi_pre
, users are not
advised to use it. For example, Du et al. (2022) write: “An alternative
weight specification, proposed in Wan et al. (2015), is \(o_i = \frac{f_i}{D}\), where \(f_i\) is the number of observed predictors
out of the total number of predictors for subject \(i\) […] upweighting subjects with less
missingness and downweighting subjects with more missingness can, in
some sense, be viewed as making the optimization more like complete-case
analysis, which might be problematic for Missing at Random (MAR) and
Missing not at Random (MNAR) scenarios.”
We fit a rule ensemble to the imputed data using stacking:
All S3
methods for objects of class pre
are
also applicable to the ensembles resulting from application of function
mi_pre
, e.g.:
##
## Final ensemble with cv error within 1se of minimum:
##
## lambda = 0.3397143
## number of terms = 6
## mean cv error (se) = 9.821413 (1.103229)
##
## cv error type : Mean-Squared Error
## rule coefficient description
## 49 (Intercept) 9.44439871 1
## 6 rule6 1.29750892 Ozone <= 45
## 20 rule23 0.17877849 Temp <= 73
## 2 rule2 0.15142541 Ozone <= 21
## 5 rule5 0.13556989 Ozone <= 47
## 10 rule11 0.03251171 Ozone <= 52
## 50 Ozone -0.01401860 7 <= Ozone <= 118
Computation of partial dependence plots can become computationally
prohibitive, especially with multiply-imputed data. To reduce
computational load, function mi_mean
computes the average
imputed dataset, which can then be supplied to the newdata
arguments of functions singleplot
or pairplot
to speed up computation of partial dependence plots:
## Loading required namespace: interp
To apply pooling, we create a custom function that fits PREs to several datasets contained in a list:
pre.agg <- function(datasets, ...) {
result <- list()
for (i in 1:length(datasets)) {
result[[i]] <- pre(datasets[[i]], ...)
}
result
}
We apply the new function:
Note that we can used the ellipsis (...
) to pass
arguments to pre
(see ?pre
for an overview of
arguments that can be specified).
We now define print
, summary
,
predict
and coef
methods to extract results
from the fitted ensemble. Again, we can use the ellipsis
(...
) to pass arguments to the print
,
summary
, predict
and coef
methods
of function pre
(see e.g., ?pre:::print.pre
for more info):
print.agg <- function(object, ...) {
result <- list()
sink("NULL")
for (i in 1:length(object)) {
result[[i]] <- print(object[[i]], ...)
}
sink()
print(result)
}
print.agg(airq.agg) ## results suppressed for space considerations
summary.agg <- function(object, ...) {
for (i in 1:length(object)) summary(object[[i]], ...)
}
summary.agg(airq.agg) ## results suppressed for space considerations
For averaging over predictions, there is only one option for continuous outcomes. For non-continuous outcomes, we can average over the linear predictor, or over the predicted values on the scale of the response. I am not sure which would be more appropriate; the resulting predicted values will not be identical, but highly correlated, though.
predict.agg <- function(object, newdata, ...) {
rowMeans(sapply(object, predict, newdata = newdata, ...))
}
agg_preds <- predict.agg(airq.agg, newdata = airquality[1:4, ])
agg_preds
## 1 2 3 4
## 10.42757 10.59272 10.93324 11.00302
Finally, the coef
method should return the averaged /
aggregated final PRE. That is, it returns:
One averaged intercept;
All rules and linear terms, with their coefficients scaled by the number of datasets;
In presence of identical rules and linear terms, it aggregates those rules and their coefficients into one rule / term, and adds together the scaled coefficients.
Note that linear terms of the same predictors, which obtained
different winsorizing points across imputed datasets will be retained
seperately and will not be aggregated. Note also that the labels of
rules and variables may overlap between different datasets (e.g., the
label rule 12
may appear multiple times in the aggregated
ensemble, but each rule 12
will have different
conditions).
coef.agg <- function(object, ...) {
coefs <- coef(object[[1]], ...)
coefs <- coefs[coefs$coefficient != 0, ]
for (i in 2:length(object)) {
coefs_tmp <- coef(object[[i]], ...)
coefs <- rbind(coefs, coefs_tmp[coefs_tmp$coefficient != 0, ])
}
## Divide coefficients by the number of datasets:
coefs$coefficient <- coefs$coefficient / length(object)
## Identify identical rules:
duplicates <- which(duplicated(coefs$description))
for (i in duplicates) {
first_match <- which(coefs$description == coefs$description[i])[1]
## Add the coefficients:
coefs$coefficient[first_match] <-
coefs$coefficient[first_match] + coefs$coefficient[i]
}
## Remove duplicates:
if (length(duplicates) > 0) coefs <- coefs[-duplicates, ]
## Check if there are- duplicate linear terms left and repeat:
duplicates <- which(duplicated(coefs$rule))
for (i in duplicates) {
first_match <- which(coefs$rule == coefs$rule[i])[1]
coefs$coefficient[first_match] <-
coefs$coefficient[first_match] + coefs$coefficient[i]
}
if (length(duplicates) > 0) coefs <- coefs[-duplicates, ]
## Return results:
coefs
}
coef.agg(airq.agg)
## rule coefficient description
## 65 (Intercept) 9.433571984 1
## 29 rule32 0.097328032 Temp > 73 & Ozone > 23
## 3 rule3 0.968746033 Ozone <= 45
## 7 rule7 0.040507812 Ozone > 14 & Solar.R <= 238
## 66 Ozone -0.006089757 7 <= Ozone <= 115.6
## 6 rule6 0.074835353 Ozone <= 59
## 25 rule26 -0.067771974 Temp > 63 & Ozone > 14
## 17 rule17 -0.038092316 Temp > 75 & Ozone > 47
## 1 rule1 0.090749273 Ozone <= 21
## 58 rule62 -0.277510244 Ozone > 45 & Solar.R <= 275
## 36 rule39 -0.046938032 Ozone > 14 & Solar.R <= 201
## 40 rule43 -0.035213431 Temp > 72 & Solar.R <= 255
## 51 rule5 0.031821453 Ozone <= 52
## 10 rule10 0.070056367 Temp <= 73
## 19 rule22 0.066962331 Ozone <= 63
## 14 rule15 0.001044073 Ozone <= 45 & Solar.R > 212
We have obtained a final ensemble of 15 terms.
We compare performance using 10-fold cross validation. We evaluate predictive accuracy and the number of selected rules. We only evaluate accuracy for observations that have no missing values.
k <- 10
set.seed(43)
fold_ids <- sample(1:k, size = nrow(airquality), replace = TRUE)
observed <- c()
for (i in 1:k) {
## Separate training and test data
test <- airquality[fold_ids == i, ]
test <- test[!is.na(test$Ozone), ]
test <- test[!is.na(test$Solar.R), ]
observed <- c(observed, test$Wind)
}
preds <- data.frame(observed)
preds$LWD <- preds$SI <- preds$MI <- preds$observed
nterms <- matrix(nrow = k, ncol = 3)
colnames(nterms) <- c("LWD", "SI", "MI")
row <- 1
for (i in 1:k) {
if (i > 1) row <- row + nrow(test)
## Separate training and test data
train <- airquality[fold_ids != i, ]
test <- airquality[fold_ids == i, ]
test <- test[!is.na(test$Ozone), ]
test <- test[!is.na(test$Solar.R), ]
## Fit and evaluate listwise deletion
premod <- pre(Wind ~ ., data = train)
preds$LWD[row:(row+nrow(test)-1)] <- predict(premod, newdata = test)
tmp <- print(premod)
nterms[i, "LWD"] <- nrow(tmp) - 1
## Fit and evaluate single imputation
imp0 <- train
imp0$Solar.R[is.na(imp0$Solar.R)] <- mean(imp0$Solar.R, na.rm=TRUE)
imp0$Ozone[is.na(imp0$Ozone)] <- mean(imp0$Ozone, na.rm=TRUE)
premod.imp0 <- pre(Wind ~., data = imp0)
imp1 <- test
imp1$Solar.R[is.na(imp1$Solar.R)] <- mean(imp0$Solar.R, na.rm=TRUE)
imp1$Ozone[is.na(imp1$Ozone)] <- mean(imp0$Ozone, na.rm=TRUE)
preds$SI[row:(row+nrow(test)-1)] <- predict(premod.imp0, newdata = imp1)
tmp <- print(premod.imp0)
nterms[i, "SI"] <- nrow(tmp) - 1
## Perform multiple imputation
imp <- mice(train, m = 5)
imp1 <- complete(imp, action = "all", include = FALSE)
airq.agg <- pre.agg(imp1, formula = Wind ~ .)
preds$MI[row:(row+nrow(test)-1)] <- predict.agg(airq.agg, newdata = test)
nterms[i, "MI"] <- nrow(coef.agg(airq.agg)) - 1
}
## observed MI SI LWD
## 0.000000 9.462657 10.087872 9.824913
## observed MI SI LWD
## 0.000000 1.472118 1.538254 1.499660
## [1] 12.65732
Interestingly, we see that all three methods yield similar predictions and accuracy, and explain about 20% of variance in the response. Multiple imputation performed best, followed by listwise deletion, followed by single imputation. Taking into account the standard errors, however, these differences are not significant. Also, this simple evaluation on only a single dataset should not be taken too seriously. The better performance of multiple imputation does come at the cost of increased complexity:
In line with findings of Josse et al. (2019), we expect MIA to work
better for rules than mean imputation. In future versions of package
pre
, we plan to implement MIA (for the
rules) and combine it with mean imputation (for the linear terms).
In case you obtained different results, these results were obtained using the following:
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=C LC_CTYPE=Dutch_Netherlands.utf8
## [3] LC_MONETARY=Dutch_Netherlands.utf8 LC_NUMERIC=C
## [5] LC_TIME=Dutch_Netherlands.utf8
##
## time zone: Europe/Amsterdam
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] pre_1.0.7 mice_3.16.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.7 utf8_1.2.3 generics_0.1.3 tidyr_1.3.0
## [5] shape_1.4.6 stringi_1.7.12 lattice_0.21-8 inum_1.0-5
## [9] plotmo_3.6.2 lme4_1.1-35.1 digest_0.6.31 magrittr_2.0.3
## [13] mitml_0.4-5 evaluate_0.23 grid_4.3.1 iterators_1.0.14
## [17] mvtnorm_1.2-3 fastmap_1.1.1 foreach_1.5.2 jomo_2.7-6
## [21] jsonlite_1.8.4 glmnet_4.1-8 Matrix_1.6-2 nnet_7.3-19
## [25] backports_1.4.1 Formula_1.2-5 survival_3.5-5 purrr_1.0.1
## [29] fansi_1.0.4 codetools_0.2-19 jquerylib_0.1.4 cli_3.6.0
## [33] rlang_1.1.1 splines_4.3.1 plotrix_3.8-4 cachem_1.0.8
## [37] yaml_2.3.7 pan_1.9 tools_4.3.1 deldir_1.0-9
## [41] MatrixModels_0.5-3 earth_5.3.2 nloptr_2.0.3 minqa_1.2.6
## [45] dplyr_1.1.2 interp_1.1-4 boot_1.3-28.1 broom_1.0.5
## [49] rpart_4.1.19 vctrs_0.6.3 R6_2.5.1 lifecycle_1.0.4
## [53] stringr_1.5.1 libcoin_1.0-10 MASS_7.3-60 partykit_1.2-20
## [57] pkgconfig_2.0.3 pillar_1.9.0 bslib_0.5.1 TeachingDemos_2.12
## [61] glue_1.6.2 Rcpp_1.0.10 highr_0.10 xfun_0.41
## [65] tibble_3.2.1 tidyselect_1.2.0 rstudioapi_0.15.0 knitr_1.45
## [69] htmltools_0.5.7 nlme_3.1-162 rmarkdown_2.25 compiler_4.3.1
Du, J., Boss, J., Han, P., Beesley, L. J., Kleinsasser, M., Goutman, S.A., … & Mukherjee, B. (2022). Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods. Journal of Computational and Graphical Statistics, 31(4), 1063-1075.
Josse, J., Prost, N., Scornet, E., & Varoquaux, G. (2019). On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931. https://arxiv.org/abs/1902.06931
Miles, A. (2016). Obtaining predictions from models fit to multiply imputed data. Sociological Methods & Research, 45(1), 175-185.
Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7(2), 147.
Wood, A. M., White, I. R., & Royston, P. (2008). How should variable selection be performed with multiply imputed data? Statistics in Medicine, 27(17), 3227-3246.