
Author: Maciej Nasinski
Check the miceFast website for more details
miceFast provides fast methods for imputing missing
data, leveraging an object-oriented programming paradigm and optimized
linear algebra routines.
The package includes convenient helper functions compatible with
data.table, dplyr, and other popular R
packages.
Major speed improvements occur when:
- Using a grouping variable, where the data is
automatically sorted by group, significantly reducing computation time.
- Performing multiple imputations, by evaluating the
underlying quantitative model only once for multiple draws. - Running
Predictive Mean Matching (PMM), thanks to presorting
and binary search.
For performance details, see performance_validity.R in
the extdata folder.
Vignettes:
fill_NA() or
averaging draws with fill_NA_N() is fast and convenient.
For any inferential statement use full MI with pool().complete.cases(). Listwise deletion is unbiased under MCAR
and may be sufficient when the fraction of incomplete rows is
small.complete.cases(), mean imputation) and across different
imputation models (lm_bayes, lm_noise,
pmm). Vary the number of imputations. If conclusions
change, investigate why. Report the imputation model, m, and
any assumptions about the missing-data mechanism.mice implements the full MI pipeline (impute, analyze, pool). miceFast focuses on the computationally expensive part: fitting the imputation models. It is typically ~10× faster than mice for the imputation step alone (see benchmarks). Two usage modes:
MI with Rubin’s rules. Call
fill_NA() with a stochastic model in a loop to create
m completed datasets, then pool() the fitted
models. For continuous variables use lm_bayes (strictly
proper; it draws from the posterior). For both
continuous and categorical variables, pmm (Predictive Mean
Matching) is also proper. It draws from the posterior
and matches to observed values, preserving the data distribution. Use
the OOP interface (impute("pmm", ...)) in a loop for MI
with PMM. For categorical variables, lda with a random
ridge is approximate (ad-hoc perturbation,
not a posterior draw, but works well in practice). lm_noise
is improper (no parameter uncertainty); useful for
sensitivity checks. See the MI
vignette.
Single-dataset imputation.
fill_NA_N() with
lm_bayes/lm_noise returns the mean of
k stochastic draws per missing value. With pmm,
k is the number of nearest neighbours to sample from (no
averaging). Handy for exploration, but not for Rubin’s rules
(between-imputation variance is lost).
Iterative FCS (chained equations). When multiple variables have interlocking (non-monotone) missingness, you can cycle through variables in a loop, restoring and re-imputing each one — the same algorithm mice uses. With a monotone pattern a single pass suffices and FCS is unnecessary. See the Introduction vignette for details.
See the MI vignette for worked examples.
You can install miceFast from CRAN:
install.packages("miceFast")Or install the development version from GitHub:
# install.packages("devtools")
devtools::install_github("polkas/miceFast")library(miceFast)
library(dplyr)
data(air_miss)
# Visualize the NA structure
upset_NA(air_miss, 6)
# Select the 4 core variables for regression: Ozone ~ Solar.R + Wind + Temp
# Ozone has 37 NAs, Solar.R has 7 NAs, Wind and Temp are complete.
df <- air_miss[, c("Ozone", "Solar.R", "Wind", "Temp")]
# MI with Rubin's rules: impute m = 10 datasets, fit model, pool.
# Impute Solar.R first (predictors fully observed), then Ozone
# (can now use the freshly imputed Solar.R). This sequential order
# resolves joint missingness in a single pass.
set.seed(1234)
completed <- lapply(1:10, function(i) {
df %>%
mutate(Solar.R = fill_NA(., "lm_bayes", "Solar.R", c("Wind", "Temp"))) %>%
mutate(Ozone = fill_NA(., "lm_bayes", "Ozone", c("Solar.R", "Wind", "Temp")))
})
fits <- lapply(completed, function(d) lm(Ozone ~ Solar.R + Wind + Temp, data = d))
pool(fits)
#> Pooled results from 10 imputed datasets
#> Rubin's rules with Barnard-Rubin df adjustment
#>
#> term estimate std.error statistic df p.value
#> (Intercept) -49.50313 21.74948 -2.276 78.41 2.557e-02
#> Solar.R 0.05771 0.02294 2.516 72.83 1.407e-02
#> Wind -3.44033 0.62721 -5.485 76.15 5.185e-07
#> Temp 1.47603 0.23404 6.307 97.50 8.345e-09library(miceFast)
library(data.table)
data(air_miss)
dt <- as.data.table(air_miss[, c("Ozone", "Solar.R", "Wind", "Temp")])
# MI with Rubin's rules: same sequential chain as above.
set.seed(1234)
completed <- lapply(1:10, function(i) {
d <- copy(dt)
d[, Solar.R := fill_NA(.SD, "lm_bayes", "Solar.R", c("Wind", "Temp"))]
d[, Ozone := fill_NA(.SD, "lm_bayes", "Ozone", c("Solar.R", "Wind", "Temp"))]
d
})
fits <- lapply(completed, function(d) lm(Ozone ~ Solar.R + Wind + Temp, data = d))
pool(fits)For iterative FCS (chained equations) with non-monotone missingness, see the Introduction vignette.
# Quick baseline. Biased; does not account for relationships between variables.
naive_fill_NA(air_miss)See the Introduction vignette for weights, the OOP interface, log-transformations, and more.
miceFast
objects (Rcpp modules).fill_NA(): Single imputation (lda,
lm_pred, lm_bayes,
lm_noise).fill_NA_N(): Multiple imputations. Averaged draws for
lm_bayes/lm_noise; nearest-neighbour sampling
for pmm.pool(): Pool multiply imputed results using Rubin’s
rules.VIF(): Variance Inflation Factor calculations.naive_fill_NA(): Automatic naive imputations.compare_imp(): Compare original vs. imputed
values.upset_NA(): Visualize NA structure using UpSetR.Quick Reference Table:
| Function | Description |
|---|---|
new(miceFast) |
Creates an OOP instance with numerous imputation methods (see the vignette). |
fill_NA() |
Single imputation: lda, lm_pred,
lm_bayes, lm_noise. |
fill_NA_N() |
lm_bayes/lm_noise: averages k
draws. pmm: samples from k nearest observed values
(works for both continuous and categorical). |
pool() |
Pools estimates from m imputed datasets using Rubin’s
rules. Works with any model that has coef() and
vcov(). |
VIF() |
Computes Variance Inflation Factors. |
naive_fill_NA() |
Performs automatic, naive imputations. |
compare_imp() |
Compares imputations vs. original data. |
upset_NA() |
Visualizes NA structure using an UpSet plot. |
Median timings on 100k rows, 10 variables, 100 groups (R 4.4.3, macOS M3 Pro, optimized BLAS/LAPACK):
Imputation quality (SSE) is comparable to mice across all models.

Full benchmark script:
inst/extdata/performance_validity.R.