Imputing missing values in IPD

Michael Seo

2022-06-03

We describe how to impute missing data in an individual patient data (IPD) using multiple imputation. However, this package is not restricted to IPD, but can also be used for multilevel data. This method is most appropriate for IPD with two treatments (i.e. placebo and treatment). This package conveniently sets up input parameters that are needed to use imputation methods in mice and mice-extended packages (micemd and miceadds). We will first simulate a mock data that we will use for illustration.

Missing data exploration

# install.packages('bipd') or
# devtools::install_github('MikeJSeo/bipd')
library(bipd)
set.seed(1)
simulated_dataset <- generate_sysmiss_ipdma_example(Nstudies = 10,
    Ncov = 5, sys_missing_prob = 0.3, magnitude = 0.2, heterogeneity = 0.1)
head(simulated_dataset)
#> # A tibble: 6 x 8
#>       y     x1 x2    x3         x4     x5 study treat
#>   <dbl>  <dbl> <fct> <fct>   <dbl>  <dbl> <int> <int>
#> 1 2.24   0.271 0     0     -0.280  0.819      1     0
#> 2 2.62   0.768 1     1      0.429  0.0837     1     0
#> 3 3.23  -1.31  1     0     -1.23   0.165      1     1
#> 4 0.815 -0.590 1     1      0.435  0.345      1     0
#> 5 3.33   1.20  1     1      0.0561 0.287      1     0
#> 6 2.65  -1.00  0     0     -2.19   1.63       1     1

There are two types of missingness in an IPD. In an IPD, we have multiple studies reporting the same clinical outcome. When we have predictors in some studies only partly reported, e.g. when some patients did not provide information on their age, we refer to this situation as ‘sporadically missing’ data. A different missing data problem, which is relevant for meta-analysis only, is that of ‘systematically missing’ predictors. This is when different studies measured different sets of predictors. We have written a function that calculates percentage of missing patients for each predictor in each study and determines whether the variable is sporadically missing or systematically missing.

missP <- findMissingPattern(simulated_dataset, covariates = c("x1",
    "x2", "x3", "x4", "x5"), typeofvar = c("continuous", "binary",
    "binary", "continuous", "continuous"), studyname = "study",
    treatmentname = "treat", outcomename = "y")
missP$missingpercent
#> # A tibble: 10 x 6
#>    study    x1    x2    x3    x4    x5
#>    <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     1     0     0     0     0     0
#>  2     2     0     0     0     0     0
#>  3     3     0     0   100     0     0
#>  4     4     0     0     0     0     0
#>  5     5     0     0     0     0     0
#>  6     6     0     0     0     0     0
#>  7     7     0     0     0     0   100
#>  8     8     0     0   100     0     0
#>  9     9     0     0   100     0     0
#> 10    10     0     0     0   100   100
missP$sys_covariates
#> [1] "x3" "x4" "x5"
missP$spor_covariates
#> character(0)

We see that x3, x4, and x5 are systematically missing, but we do not observe any sporadically missing variables. Now let’s insert some sporadically missing values in the dataset by randomly assigning some patient covariates to NA. Having done that we see from below output that x1 is sporadically missing with percentages near 10% for each study.

simulated_dataset2 <- simulated_dataset
randomindex <- sample(c(TRUE, FALSE), dim(simulated_dataset)[1],
    replace = TRUE, prob = c(0.1, 0.9))
simulated_dataset2[randomindex, "x1"] <- NA
missP2 <- findMissingPattern(simulated_dataset2, covariates = c("x1",
    "x2", "x3", "x4", "x5"), typeofvar = c("continuous", "binary",
    "binary", "continuous", "continuous"), studyname = "study",
    treatmentname = "treat", outcomename = "y")
missP2$missingpercent
#> # A tibble: 10 x 6
#>    study    x1    x2    x3    x4    x5
#>    <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     1     9     0     0     0     0
#>  2     2    10     0     0     0     0
#>  3     3     6     0   100     0     0
#>  4     4     8     0     0     0     0
#>  5     5    10     0     0     0     0
#>  6     6    11     0     0     0     0
#>  7     7     6     0     0     0   100
#>  8     8    11     0   100     0     0
#>  9     9     8     0   100     0     0
#> 10    10    12     0     0   100   100
missP2$sys_covariates
#> [1] "x3" "x4" "x5"
missP2$spor_covariates
#> [1] "x1"

Multiple Imputations

There are possibly three mice related packages that this package uses. Let’s install/load these to ensure the imputation function works.

library(mice)  #for datasets with only one study level
library(miceadds)  #for multilevel datasets without systematically missing predictors
library(micemd)  #for multilevel datasets with systematically missing predictors.

We can impute missing variables using ipdma.impute function that automatically sets up parameters and calls the mice imputation function.

imputation <- ipdma.impute(simulated_dataset, covariates = c("x1",
    "x2", "x3", "x4", "x5"), typeofvar = c("continuous", "binary",
    "binary", "continuous", "continuous"), interaction = TRUE,
    studyname = "study", treatmentname = "treat", outcomename = "y",
    m = 5)
#> 
#>  iter imp variable
#>   1   1  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   1   2  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   1   3  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   1   4  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   1   5  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   2   1  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   2   2  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   2   3  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   2   4  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   2   5  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   3   1  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   3   2  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   3   3  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   3   4  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   3   5  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   4   1  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   4   2  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   4   3  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   4   4  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   4   5  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   5   1  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   5   2  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   5   3  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   5   4  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat
#>   5   5  x3  x4  x5  x1treat  x2treat  x3treat  x4treat  x5treat

There are a lot to describe what goes underneath this function (ipdma.impute). The most important parameters that are automatically set up are: imputation method (meth) and predictor matrix (pred). “meth” specifies the imputation method for each incomplete covariates and “pred” specifies the variables that are used to impute each incomplete variable. Let’s see what was used in the imputation process.

imputation$meth
#>                                      study 
#>                                         "" 
#>                                      treat 
#>                                         "" 
#>                                          y 
#>                                         "" 
#>                                         x1 
#>                                         "" 
#>                                         x2 
#>                                         "" 
#>                                         x3 
#>                            "2l.2stage.bin" 
#>                                         x4 
#>                           "2l.2stage.norm" 
#>                                         x5 
#>                           "2l.2stage.norm" 
#>                                    x1treat 
#> "~ I(as.numeric(as.character(x1)) *treat)" 
#>                                    x2treat 
#> "~ I(as.numeric(as.character(x2)) *treat)" 
#>                                    x3treat 
#> "~ I(as.numeric(as.character(x3)) *treat)" 
#>                                    x4treat 
#> "~ I(as.numeric(as.character(x4)) *treat)" 
#>                                    x5treat 
#> "~ I(as.numeric(as.character(x5)) *treat)"
imputation$pred
#>         study treat y x1 x2 x3 x4 x5 x1treat x2treat x3treat x4treat x5treat
#> study       0     0 0  0  0  0  0  0       0       0       0       0       0
#> treat       0     0 0  0  0  0  0  0       0       0       0       0       0
#> y          -2     2 0  1  1  1  1  1       1       1       1       1       1
#> x1          0     0 0  0  0  0  0  0       0       0       0       0       0
#> x2          0     0 0  0  0  0  0  0       0       0       0       0       0
#> x3         -2     2 1  1  1  0  1  1       1       1       0       1       1
#> x4         -2     2 1  1  1  1  0  1       1       1       1       0       1
#> x5         -2     2 1  1  1  1  1  0       1       1       1       1       0
#> x1treat     0     0 0  0  0  0  0  0       0       0       0       0       0
#> x2treat     0     0 0  0  0  0  0  0       0       0       0       0       0
#> x3treat     0     0 0  0  0  0  0  0       0       0       0       0       0
#> x4treat     0     0 0  0  0  0  0  0       0       0       0       0       0
#> x5treat     0     0 0  0  0  0  0  0       0       0       0       0       0

Note that the systematically missing variables are assigned imputation methods (2l.2stage.bin and 2l.2stage.norm) which are from micemd package programmed to handle systematically missing variables. Each row in the predictor matrix describes how the imputation model should be built. Only rows for missing variables should be filled in and the row variable corresponds to what goes in the left of the imputation model equation. The mice package utilizes fully conditional specification models meaning that a conditional distribution is defined for each incomplete variable and incomplete variables are imputed iteratively. Assigning a value of 1 tells the program to insert the variables that appear on the column to the right side of the imputation model equation. Similarly, -2 is assigned for the variable that denotes cluster or study. 2 is assigned for variables that are assumed to have random effects in the linear mixed effects model.

ls(imputation)
#> [1] "imp"            "imp.list"       "meth"           "missingPattern"
#> [5] "pred"

There are two additional output from the ipdma.impute function. ‘imp’ gives you the raw object that is returned from the mice function and ‘imp.list’ gives you the imputed datasets in a list format. You may use whichever format that is easier for your analysis. Note that we have specified to generate 5 imputed datasets through specifying parameter m = 5.

length(imputation$imp.list)
#> [1] 5

Let’s try another example using the dataset that has sporadically missing predictor that we have created previously. For this example, we assume that we do not include treatment-covariate interactions and we use a different imputation method for systematically missing variables. “interaction” parameter tells the program whether to include treatment-covariate interactions in the imputation model. See help(ipdma.impute) for more details on the parameters.

imputation2 <- ipdma.impute(simulated_dataset2, covariates = c("x1",
    "x2", "x3", "x4", "x5"), typeofvar = c("continuous", "binary",
    "binary", "continuous", "continuous"), sys_impute_method = "2l.glm",
    interaction = FALSE, studyname = "study", treatmentname = "treat",
    outcomename = "y", m = 5)

Note that this takes a longer time to run since we used a different imputation method for systematically missing predictors.

imputation2$meth
#>         study         treat             y            x1            x2 
#>            ""            ""            ""      "2l.pmm"            "" 
#>            x3            x4            x5 
#>  "2l.glm.bin" "2l.glm.norm" "2l.glm.norm"
imputation2$pred
#>       study treat y x1 x2 x3 x4 x5
#> study     0     0 0  0  0  0  0  0
#> treat     0     0 0  0  0  0  0  0
#> y        -2     2 0  1  1  1  1  1
#> x1       -2     2 1  0  1  1  1  1
#> x2        0     0 0  0  0  0  0  0
#> x3       -2     2 1  1  1  0  1  1
#> x4       -2     2 1  1  1  1  0  1
#> x5       -2     2 1  1  1  1  1  0

Since x1 is sporadically missing we have assigned a suitable imputation method (2l.pmm) which is from the miceadds package. This method uses predictive mean matching based on predictions from the linear mixed model. If the user would like to specify a different method for imputing x1, user can simply change the method by specifying the “meth” parameter in the ipdma.impute function.

meth <- imputation2$meth
meth["x1"] <- "2l.norm"
imputation2 <- ipdma.impute(simulated_dataset2, covariates = c("x1",
    "x2", "x3", "x4", "x5"), typeofvar = c("continuous", "binary",
    "binary", "continuous", "continuous"), sys_impute_method = "2l.glm",
    interaction = FALSE, studyname = "study", treatmentname = "treat",
    outcomename = "y", m = 5, meth = meth)

Note how we have changed imputation method for x1 to normal linear mixed effects model (2l.norm). Similarly, the predictor matrix can be changed to adapt to user’s needs by specifying “pred” differently.

Try your data

It is quite likely that when you try imputing your data, the multiple imputation accounting for study level do not work. For instance, if you have a study where all patients observed were female, then the imputation methods that account for systematically missing predictors (i.e. methods from micemd) would not work. similarly, if you decide to include treatment-covariate interactions in the imputation model and a study administered treatment only to patients who were taking anti-psychotic drugs (binary covariate) then more advanced imputation models might not work. If for whatever reason multiple imputaiton method fails, I suggest first trying simple multiple imputation methods that ignore study level. “pmm” is a nice place to start and is the default method in the mice function. From ipdma.impute function, user can specify this by setting sys_impute_method = “pmm”.

imputation2 <- ipdma.impute(simulated_dataset2, covariates = c("x1",
    "x2", "x3", "x4", "x5"), typeofvar = c("continuous", "binary",
    "binary", "continuous", "continuous"), sys_impute_method = "pmm",
    interaction = FALSE, studyname = "study", treatmentname = "treat",
    outcomename = "y", m = 5)
imputation2$meth
#> study treat     y    x1    x2    x3    x4    x5 
#>    ""    ""    "" "pmm"    "" "pmm" "pmm" "pmm"