The HLMdiag package provides functionality to examine
diagnostics for hierarchical linear models, including residuals values
and influence diagnostics. This vignette aims to:
hlm_influence(), which provides any easy way to obtain
multiple influence diagnostics in one tibblehlm_augment(),
which combines hlm_influence() and hlm_resid()
to return influence diagnostics and residualslme created from the
nlme packageHLMdiag and those returned by the lme4 and
car packageshlm_influence() functionThe hlm_influence() function provides a wrapper that
returns influence diagnostics appended to the original model frame. It
is useful for assessing the influence of individual observations, or
groups of observations, and can also be used with
dotplot_diag() to visually explore the influence
diagnostics. The diagnostics returned by hlm_influence()
include Cook’s distance, MDFFITS, covariance trace (covtrace),
covariance ratio (covratio), leverage, and relative variance change
(RVC).
Cook’s distance and MDFFITS both measure the distance between fixed
effects estimated from the full data set and those obtained from the
reduced data set. For Cook’s distance, the change in parameter estimates
is scaled by the estimated covariance matrix of the original parameter
estimates, while MDFFITS scales this change by the estimated covariance
matrix of the deletion estimates. The covariance trace and covariance
ratio both measure how precision is affected by the deletion of a
particular observation or group of observations i. Covariance
trace is a measure of the ratio between the covariance matrices with and
without unit i to the identity matrix, while covariance ratio
is a comparison of the two covariance matrices with and without unit
i using their determinants. Relative variance change (RVC) is a
measurement of the ratio of estimates of the l th variance
component with and without unit i. The final influence
diagnostic returned by hlm_influence(), leverage, is the
rate of change in the predicted response with respect to the observed
response. For a full explanation of these influence diagnostics,
including formulas, please refer to Loy and Hofmann (2014).
To explore the functionality of hlm_influence(), we will
use the data set classroom (West et. al,
2014). The data set consists of 1,190 observations of students.
Students are grouped within classes, which are nested within schools.
There are 312 distinct classes nested within 107 schools.
#> # A tibble: 1,190 × 12
#> sex minority mathkind mathgain ses yearstea mathknow housepov mathprep
#> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 448 32 0.46 1 NA 0.082 2
#> 2 0 1 460 109 -0.27 1 NA 0.082 2
#> 3 1 1 511 56 -0.03 1 NA 0.082 2
#> 4 0 1 449 83 -0.38 2 -0.11 0.082 3.25
#> 5 0 1 425 53 -0.03 2 -0.11 0.082 3.25
#> 6 1 1 450 65 0.76 2 -0.11 0.082 3.25
#> 7 0 1 452 51 -0.03 2 -0.11 0.082 3.25
#> 8 0 1 443 66 0.2 2 -0.11 0.082 3.25
#> 9 1 1 422 88 0.64 2 -0.11 0.082 3.25
#> 10 0 1 480 -7 0.13 2 -0.11 0.082 3.25
#> classid schoolid childid
#> <int> <int> <int>
#> 1 160 1 1
#> 2 160 1 2
#> 3 160 1 3
#> 4 217 1 4
#> 5 217 1 5
#> 6 217 1 6
#> 7 217 1 7
#> 8 217 1 8
#> 9 217 1 9
#> 10 217 1 10
#> # ℹ 1,180 more rows
We will fit a simple three-level hierarchical linear model using the
lme4 package:
levelhlm_influence() calculates influence diagnostics for
individual cases, or larger groups. For example, to obtain a tibble with
influence diagnostics for all 1,190 students we use the following:
#> # A tibble: 1,190 × 14
#> id mathgain mathkind sex minority ses housepov schoolid classid
#> <int> <int> <int> <int> <int> <dbl> <dbl> <int> <int>
#> 1 1 32 448 1 1 0.46 0.082 1 160
#> 2 2 109 460 0 1 -0.27 0.082 1 160
#> 3 3 56 511 1 1 -0.03 0.082 1 160
#> 4 4 83 449 0 1 -0.38 0.082 1 217
#> 5 5 53 425 0 1 -0.03 0.082 1 217
#> 6 6 65 450 1 1 0.76 0.082 1 217
#> 7 7 51 452 0 1 -0.03 0.082 1 217
#> 8 8 66 443 0 1 0.2 0.082 1 217
#> 9 9 88 422 1 1 0.64 0.082 1 217
#> 10 10 -7 480 0 1 0.13 0.082 1 217
#> cooksd mdffits covtrace covratio leverage.overall
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.00106 0.00106 0.00289 1.00 0.121
#> 2 0.00143 0.00143 0.00246 1.00 0.121
#> 3 0.000336 0.000335 0.00378 1.00 0.122
#> 4 0.000225 0.000225 0.00169 1.00 0.0780
#> 5 0.000173 0.000172 0.00178 1.00 0.0780
#> 6 0.000000807 0.000000804 0.00321 1.00 0.0794
#> 7 0.0000233 0.0000233 0.00113 1.00 0.0775
#> 8 0.0000000504 0.0000000504 0.00120 1.00 0.0775
#> 9 0.000130 0.000129 0.00401 1.00 0.0801
#> 10 0.00108 0.00108 0.00147 1.00 0.0778
#> # ℹ 1,180 more rows
Note that the parameter used to select individual observations is now
called level, not group. Previous versions of
HLMdiag used the group parameter for influence diagnostics,
where group = NULL was the default, and users could choose
to delete groups instead by setting group equal to level
names. As of HLMdiag version 0.4.0, group has
been replaced by level. level defaults to
level = 1 which deletes individual observations, exactly
how group = NULL used to work.
The resulting tibble can be used along with
dotplot_diag() to identify potentially influential
observations using either an internal cutoff of 3*IQR or a
user-specified cutoff. For example, the plot shown below reveals and
labels all observations above the internally calculated cutoff for
Cook’s distance.
The plot illustrates that there are many observations with a Cook’s
distance value above the internally calculated cutoff.
dotplot_diag() labels the top 5, which is difficult to see
in this case. Rather than investigate all observations above the cutoff,
we may be interested in the observations with the highest Cook’s
distance values. The tibble returned by hlm_influence()
provides an easy way to do this when used with the
arrange() function from dplyr.
#> # A tibble: 1,190 × 14
#> id mathgain mathkind sex minority ses housepov schoolid classid cooksd
#> <int> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <dbl>
#> 1 539 253 290 0 1 -0.4 0.257 47 82 0.0536
#> 2 41 -110 434 0 1 -0.37 0.365 4 179 0.0265
#> 3 1078 203 290 1 1 -0.32 0.067 95 144 0.0256
#> 4 664 -84 629 1 0 2.33 0.17 62 22 0.0245
#> 5 312 138 542 1 0 2.27 0.05 27 104 0.0221
#> 6 754 218 368 0 1 -1.14 0.43 70 152 0.0202
#> 7 337 197 290 1 1 -0.46 0.214 28 203 0.0194
#> 8 723 214 290 1 1 -0.67 0.209 68 244 0.0187
#> 9 812 147 510 0 0 -0.47 0.367 75 42 0.0148
#> 10 1146 11 376 0 1 0.62 0.126 102 44 0.0132
#> mdffits covtrace covratio leverage.overall
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0527 0.0172 1.02 0.112
#> 2 0.0264 0.00406 1.00 0.143
#> 3 0.0250 0.0250 1.03 0.147
#> 4 0.0242 0.0125 1.01 0.0869
#> 5 0.0219 0.00988 1.01 0.0919
#> 6 0.0200 0.00702 1.01 0.0896
#> 7 0.0190 0.0198 1.02 0.147
#> 8 0.0183 0.0195 1.02 0.155
#> 9 0.0147 0.00797 1.01 0.0722
#> 10 0.0131 0.00825 1.01 0.0962
#> # ℹ 1,180 more rows
There does not appear to be a clear pattern among students who have
relatively higher Cook’s distance values; they do not tend to be from a
particular school or class. In order to investigate these observations
further, one could look to summary statistics for the different
explanatory variables. Additionally, a similar analysis could be done
with the other influence diagnostics provided in the tibble from
hlm_influence().
In addition to identifying influential observations, it may also be
useful to identify influential groups. The level parameter
in hlm_influence() can be used for this purpose. For
example, we can obtain influence diagnostics for each class:
#> # A tibble: 312 × 6
#> `classid:schoolid` cooksd mdffits covtrace covratio leverage.overall
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 160:1 0.00224 0.00224 0.00990 1.01 0.0980
#> 2 217:1 0.00133 0.00131 0.0238 1.02 0.128
#> 3 197:2 0.00323 0.00320 0.0106 1.01 0.123
#> 4 211:2 0.00724 0.00719 0.0202 1.02 0.0891
#> 5 307:2 0.00346 0.00343 0.0197 1.02 0.152
#> 6 11:3 0.00116 0.00115 0.0141 1.01 0.0960
#> 7 137:3 0.000873 0.000868 0.0131 1.01 0.150
#> 8 145:3 0.00517 0.00515 0.0248 1.02 0.0983
#> 9 228:3 0.000479 0.000478 0.00797 1.01 0.126
#> 10 48:4 0.00371 0.00367 0.0212 1.02 0.134
#> # ℹ 302 more rows
Note that the level parameter is set as
classid:schoolid, not classid. The level
parameter should be specified as found in model@flist for
lmerMod objects. For three level models, this usually takes the form of
“level2:level3”, where level2 is the name of the second level variable,
and level3 is the name of the third level variable.
Using dotplot_diag() again with one of the columns from
the resulting tibble of hlm_influence reveals that class
218 has a relatively high Cook’s distance value.
We can repeat a similar analysis to flag influential schools.
#> # A tibble: 107 × 6
#> schoolid cooksd mdffits covtrace covratio leverage.overall
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.000561 0.000557 0.0381 1.04 0.0901
#> 2 2 0.00681 0.00672 0.0554 1.06 0.115
#> 3 3 0.0343 0.0333 0.0723 1.07 0.108
#> 4 4 0.0248 0.0246 0.0360 1.04 0.125
#> 5 5 0.00224 0.00223 0.0583 1.06 0.0998
#> 6 6 0.0109 0.0107 0.0690 1.07 0.104
#> 7 7 0.00576 0.00560 0.0785 1.08 0.108
#> 8 8 0.0121 0.0118 0.0785 1.08 0.0926
#> 9 9 0.00624 0.00590 0.0793 1.08 0.141
#> 10 10 0.00848 0.00826 0.0723 1.07 0.0928
#> # ℹ 97 more rows
Similarly, we use dotplot_diag to visually represent the
differences. In this case, there are only four observations above the
cutoff, so we set modify = "dotplot" in order to better
visualize the influential observations. modify = "dotplot"
works well when there are a relatively small number of observations
above the cutoff.
The plot reveals that schools 68, 75, 70, and 27 have relatively high Cook’s distance values.
deleteThe delete parameter allows the user to calculate
influence diagnostics for a specified group of observations, or group
factor level. For example, influence diagnostics for the influential
schools flagged above (68, 75, 70, and 27) can be calculated as shown
below, deleting all four schools at once:
hlm_influence(class.lmer, level = "schoolid", delete = c("27", "70", "75", "68"))
#> # A tibble: 1 × 4
#> cooksd mdffits covtrace covratio
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.238 0.222 0.370 1.43Note that in this case, delete is specified as a
character vector consisting of the school ID’s of interest.
delete can also be set as a numeric vector of indices; in
this case, setting delete to the row numbers of all
students in the data frame who attend schools 68, 75, 70, or 27 would be
equivalent to the line above.
leverage
argumentIn the previous examples, hlm_influence() only returned
overall leverage. However, hlm_influence() also allows for
other types of leverage, including the leverage corresponding to the
fixed effects, the leverage corresponding to the random effects as
proposed by Demidenko and Stukel (2005), and the unconfounded leverage
corresponding to the random effects proposed by Nobre and Singer (2011).
These types of leverage are referred to as fixef,
ranef, and ranef.uc, respectively.
None of our observations are flagged for high leverage when looking only at overall leverage:
However, we can obtain the other types of leverage as follows:
infl2 <- hlm_influence(class.lmer, level = 1, leverage = c("overall", "fixef", "ranef", "ranef.uc"))This example illustrates how to select all four types of leverage, but any one or more may also be selected.
We can then use dotplot_diag() with one of the new
leverage columns to investigate outlier for that type of leverage.
However, further analysis reveals that several observations have high leverage when considering only the leverage corresponding to the fixed effects.
approxhlm_influence() defaults to calculating influence
diagnostics based off a one step approximation (Christensen
et.al 1992; Shi and Chen 2008; Zewotir
2008). However, the approx parameter allows the
user to calculate influence diagnostics based off a full refit of the
data using hlm_influence(). For example, if we wished to
calculate influence diagnostics for each school by fully refitting the
model each time, we could use:
#> # A tibble: 107 × 9
#> schoolid cooksd mdffits covtrace covratio rvc.sigma2 rvc.D11 rvc.D22
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.000642 0.000639 0.0565 1.06 -0.00262 0.0154 0.0183
#> 2 2 0.00684 0.00676 0.0423 1.04 -0.00409 0.0161 -0.00403
#> 3 3 0.0389 0.0386 0.0256 0.973 0.00299 0.000121 -0.0952
#> 4 4 0.0249 0.0253 0.0971 0.906 -0.0315 -0.0293 0.0174
#> 5 5 0.00222 0.00221 0.0691 1.07 -0.00229 0.0160 0.00997
#> 6 6 0.0111 0.0109 0.0671 1.07 0.00443 0.00323 -0.0198
#> 7 7 0.00564 0.00546 0.0970 1.10 0.00436 0.0230 -0.0119
#> 8 8 0.0117 0.0115 0.0653 1.07 -0.00744 -0.0804 0.0574
#> 9 9 0.00634 0.00598 0.0922 1.09 0.00313 0.00326 -0.00216
#> 10 10 0.00816 0.00790 0.109 1.11 0.00796 0.0220 -0.00935
#> leverage.overall
#> <dbl>
#> 1 0.0901
#> 2 0.115
#> 3 0.108
#> 4 0.125
#> 5 0.0998
#> 6 0.104
#> 7 0.108
#> 8 0.0926
#> 9 0.141
#> 10 0.0928
#> # ℹ 97 more rows
In most cases, using the default of approx = TRUE is
sufficient for influence analysis. Setting approx = FALSE
also takes much longer than the default setting since the model must be
refit each time with each group or observation deleted. However, the
full refit method also allows for relative variance change (RVC) to be
returned. If this is desired, approx = FALSE should be
used.
na.action and the data argumenthlm_influence() was written to respect the
na.action parameter from the lme4 package.
This argument defaults to na.omit, which means all rows in
the data sets with NAs present are simply deleted from the
model. However, there is also an na.exclude option, which
pads the resulting tibbles with NAs in the the rows that
contained NAs in the original data set in place of deleting
them altogether. In order for hlm_influence() to do this,
the original data set must be passed into hlm_influence()
via the data argument whenever the na.action
was set to na.exclude in the model fitting process.
For example, while the class data set does not have any
NAs in the data set, we can introduce a couple for the
purposes of an example.
We can then fit the same model using the lme4 package as
before. Below, we fit two models, one with
na.action = "na.exclude" and one with the default
na.action = "na.omit".
class.lmer.exclude <- lme4::lmer(mathgain ~ mathkind + sex + minority + ses + housepov + (1|schoolid/classid), data = classNA, na.action = "na.exclude")
class.lmer.omit <- lme4::lmer(mathgain ~ mathkind + sex + minority + ses + housepov + (1|schoolid/classid), data = classNA, na.action = "na.omit")We then run hlm_influence() on the model where
na.action = "na.omit".
#> # A tibble: 1,188 × 14
#> id mathgain mathkind sex minority ses housepov schoolid classid
#> <int> <int> <int> <int> <int> <dbl> <dbl> <int> <int>
#> 1 1 32 448 1 1 0.46 0.082 1 160
#> 2 2 83 449 0 1 -0.38 0.082 1 217
#> 3 3 53 425 0 1 -0.03 0.082 1 217
#> 4 4 65 450 1 1 0.76 0.082 1 217
#> 5 5 51 452 0 1 -0.03 0.082 1 217
#> 6 6 66 443 0 1 0.2 0.082 1 217
#> 7 7 88 422 1 1 0.64 0.082 1 217
#> 8 8 -7 480 0 1 0.13 0.082 1 217
#> 9 9 60 502 0 1 0.83 0.082 1 217
#> 10 10 -2 502 1 1 0.06 0.082 2 197
#> cooksd mdffits covtrace covratio leverage.overall
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.000710 0.000708 0.00351 1.00 0.160
#> 2 0.000295 0.000294 0.00182 1.00 0.0801
#> 3 0.000140 0.000140 0.00183 1.00 0.0801
#> 4 0.00000818 0.00000815 0.00325 1.00 0.0814
#> 5 0.0000145 0.0000144 0.00124 1.00 0.0795
#> 6 0.00000224 0.00000224 0.00127 1.00 0.0796
#> 7 0.000184 0.000183 0.00400 1.00 0.0821
#> 8 0.00111 0.00111 0.00163 1.00 0.0799
#> 9 0.000375 0.000374 0.00338 1.00 0.0815
#> 10 0.00250 0.00248 0.00508 1.01 0.136
#> # ℹ 1,178 more rows
Note than there are only 1,188 rows in the returned tibble, although there were 1,190 observations in the original data set. The two rows with NAs were deleted from the returned tibble.
We repeat this with the model where
na.action = "na.exclude".
#> # A tibble: 1,190 × 14
#> id mathgain mathkind sex minority ses housepov schoolid classid
#> <int> <int> <int> <int> <int> <dbl> <dbl> <int> <int>
#> 1 1 32 448 1 1 0.46 0.082 1 160
#> 2 2 109 NA 0 1 -0.27 0.082 1 160
#> 3 3 NA 511 1 1 -0.03 0.082 1 160
#> 4 4 83 449 0 1 -0.38 0.082 1 217
#> 5 5 53 425 0 1 -0.03 0.082 1 217
#> 6 6 65 450 1 1 0.76 0.082 1 217
#> 7 7 51 452 0 1 -0.03 0.082 1 217
#> 8 8 66 443 0 1 0.2 0.082 1 217
#> 9 9 88 422 1 1 0.64 0.082 1 217
#> 10 10 -7 480 0 1 0.13 0.082 1 217
#> cooksd mdffits covtrace covratio leverage.overall
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.000710 0.000708 0.00351 1.00 0.160
#> 2 NA NA NA NA NA
#> 3 NA NA NA NA NA
#> 4 0.000295 0.000294 0.00182 1.00 0.0801
#> 5 0.000140 0.000140 0.00183 1.00 0.0801
#> 6 0.00000818 0.00000815 0.00325 1.00 0.0814
#> 7 0.0000145 0.0000144 0.00124 1.00 0.0795
#> 8 0.00000224 0.00000224 0.00127 1.00 0.0796
#> 9 0.000184 0.000183 0.00400 1.00 0.0821
#> 10 0.00111 0.00111 0.00163 1.00 0.0799
#> # ℹ 1,180 more rows
In this tibble, there are 1,190 rows. Furthermore, the two rows with
NAs display NAs for the influence diagnostics, instead of being entirely
absent as in the above example. It is important to note that the
data argument is necessary. Failing to provide the data set
through the data argument in this situation will result in
an error.
The hlm_augment() function combines the
hlm_resid() and hlm_influence() functions to
return a tibble containing information about the residuals and the
influence diagnostics appended to the data. For example, we can obtain
residuals and influence diagnostics at once for all students in the
class data set with the following:
#> # A tibble: 1,190 × 20
#> .id mathgain mathkind sex minority ses housepov schoolid classid
#> <dbl> <int> <int> <int> <int> <dbl> <dbl> <int> <int>
#> 1 1 32 448 1 1 0.46 0.082 1 160
#> 2 2 109 460 0 1 -0.27 0.082 1 160
#> 3 3 56 511 1 1 -0.03 0.082 1 160
#> 4 4 83 449 0 1 -0.38 0.082 1 217
#> 5 5 53 425 0 1 -0.03 0.082 1 217
#> 6 6 65 450 1 1 0.76 0.082 1 217
#> 7 7 51 452 0 1 -0.03 0.082 1 217
#> 8 8 66 443 0 1 0.2 0.082 1 217
#> 9 9 88 422 1 1 0.64 0.082 1 217
#> 10 10 -7 480 0 1 0.13 0.082 1 217
#> .resid .fitted .ls.resid .ls.fitted .mar.resid .mar.fitted cooksd
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -37.7 69.7 0 32 -34.6 66.6 0.00106
#> 2 47.5 61.5 0 109 50.6 58.4 0.00143
#> 3 18.5 37.5 0 56 21.6 34.4 0.000336
#> 4 23.3 59.7 35.4 47.6 20.0 63.0 0.000225
#> 5 -19.9 72.9 -15.4 68.4 -23.1 76.1 0.000173
#> 6 1.01 64.0 -4.18 69.2 -2.22 67.2 0.000000807
#> 7 -9.14 60.1 -1.19 52.2 -12.4 63.4 0.0000233
#> 8 0.413 65.6 4.24 61.8 -2.82 68.8 0.0000000504
#> 9 11.5 76.5 4.18 83.8 8.22 79.8 0.000130
#> 10 -54.8 47.8 -45.3 38.3 -58.0 51.0 0.00108
#> mdffits covtrace covratio leverage.overall
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.00106 0.00289 1.00 0.121
#> 2 0.00143 0.00246 1.00 0.121
#> 3 0.000335 0.00378 1.00 0.122
#> 4 0.000225 0.00169 1.00 0.0780
#> 5 0.000172 0.00178 1.00 0.0780
#> 6 0.000000804 0.00321 1.00 0.0794
#> 7 0.0000233 0.00113 1.00 0.0775
#> 8 0.0000000504 0.00120 1.00 0.0775
#> 9 0.000129 0.00401 1.00 0.0801
#> 10 0.00108 0.00147 1.00 0.0778
#> # ℹ 1,180 more rows
This is useful for inspecting residuals values and influence
diagnostics values at the same time. However, hlm_augment()
lacks some of the functionality that hlm_influence() and
hlm_resid() have. The delete and
approx parameters available for
hlm_influence() are not available in
hlm_augment(), so the function will always use a one step
approximation and delete all observations or groups instead of a
selected few. The standardize parameter from
hlm_resid() is also not included, so
hlm_influence() or hlm_resid() should be used
instead if this functionality is desired. For more information about
available functionality in hlm_resid(), see the
hlm_resid vignette.
hlm_augment() is especially useful for inspecting
residual values of observations with relatively high influence
diagnostics values, or vice versa.
#> # A tibble: 1,190 × 20
#> .id mathgain mathkind sex minority ses housepov schoolid classid .resid
#> <dbl> <int> <int> <int> <int> <dbl> <dbl> <int> <int> <dbl>
#> 1 539 253 290 0 1 -0.4 0.257 47 82 110.
#> 2 41 -110 434 0 1 -0.37 0.365 4 179 -157.
#> 3 1078 203 290 1 1 -0.32 0.067 95 144 62.0
#> 4 664 -84 629 1 0 2.33 0.17 62 22 -88.7
#> 5 312 138 542 1 0 2.27 0.05 27 104 94.6
#> 6 754 218 368 0 1 -1.14 0.43 70 152 107.
#> 7 337 197 290 1 1 -0.46 0.214 28 203 60.7
#> 8 723 214 290 1 1 -0.67 0.209 68 244 59.8
#> 9 812 147 510 0 0 -0.47 0.367 75 42 87.2
#> 10 1146 11 376 0 1 0.62 0.126 102 44 -79.8
#> .fitted .ls.resid .ls.fitted .mar.resid .mar.fitted cooksd mdffits covtrace
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 143. 4.04e+ 1 213. 117. 136. 0.0536 0.0527 0.0172
#> 2 47.0 0 -110 -177. 66.8 0.0265 0.0264 0.00406
#> 3 141. 0 203 65.9 137. 0.0256 0.0250 0.0250
#> 4 4.67 -2.79e+ 1 -56.1 -81.9 -2.08 0.0245 0.0242 0.0125
#> 5 43.4 -4.44e-15 138 98.1 39.9 0.0221 0.0219 0.00988
#> 6 111. -3.11e-15 218 125. 93.1 0.0202 0.0200 0.00702
#> 7 136. 0 197 62.3 135. 0.0194 0.0190 0.0198
#> 8 154. 0 214 80.4 134. 0.0187 0.0183 0.0195
#> 9 59.8 2.78e+ 1 119. 109. 38.3 0.0148 0.0147 0.00797
#> 10 90.8 -2.60e+ 1 37.0 -91.1 102. 0.0132 0.0131 0.00825
#> covratio leverage.overall
#> <dbl> <dbl>
#> 1 1.02 0.112
#> 2 1.00 0.143
#> 3 1.03 0.147
#> 4 1.01 0.0869
#> 5 1.01 0.0919
#> 6 1.01 0.0896
#> 7 1.02 0.147
#> 8 1.02 0.155
#> 9 1.01 0.0722
#> 10 1.01 0.0962
#> # ℹ 1,180 more rows
We can sort by a particular column in order to inspect the values of other influence diagnostics and the residuals of influential observations.
Previously, only the individual one step approximation influence
functions worked on lme models fit using the
nlme package. However, hlm_influence() can
also be used on lme objects, as can
hlm_augment(). This allows the user to calculate influence
diagnostics for lme models by fulling refitting the model
using the approx = FALSE argument.
In most cases, using a lme object for
hlm_influence() or hlm_augment() is identical
to their usage with lmerMod objects from lme4.
However, there are a few notable exceptions.
For both hlm_influence() and hlm_augment(),
levels should be specified by names that appear in
model@flist. For the second level of a three level
lmerMod model, this usually looks like “level2:level3”
where level2 and level3 are the names of the second and third level
variables, respectively. However, for a lme model, levels
should be specified by names that appear in model$groups.
For example, we can obtain influence diagnostics for each classroom from
the class data set in the following way:
class.lme <- nlme::lme(mathgain ~ mathkind + sex + minority + ses + housepov, random = ~ 1|schoolid/classid, data = classroom)
hlm_influence(class.lme, level = "classid")
#> # A tibble: 312 × 6
#> classid cooksd mdffits covtrace covratio leverage.overall
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1/160 0.00224 0.00224 0.00990 1.01 0.121
#> 2 1/217 0.00133 0.00131 0.0238 1.02 0.0784
#> 3 2/197 0.00323 0.00320 0.0106 1.01 0.126
#> 4 2/211 0.00724 0.00719 0.0202 1.02 0.102
#> 5 2/307 0.00346 0.00343 0.0197 1.02 0.0752
#> 6 3/11 0.00116 0.00115 0.0141 1.01 0.102
#> 7 3/137 0.000873 0.000868 0.0131 1.01 0.0819
#> 8 3/145 0.00517 0.00515 0.0248 1.02 0.107
#> 9 3/228 0.000479 0.000478 0.00797 1.01 0.148
#> 10 4/48 0.00371 0.00367 0.0212 1.02 0.149
#> # ℹ 302 more rowsFor the lmerMod model, we specified
level = classid:schoolid. However, for lme
models, the name of the second level alone is sufficient. However,
specifying names of specific groups to delete for lme
models is also slightly different. For example, we can obtain influence
diagnostics for a model created when classes 160 and 217 are deleted for
a lmerMod model in the following way:
hlm_influence(class.lmer, level = "classid:schoolid", delete = c("160:1", "217:1"))
#> # A tibble: 1 × 4
#> cooksd mdffits covtrace covratio
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.000561 0.000557 0.0381 1.04Obtaining influence diagnostics for a model created with the deletion
of classes 160 and 217 from a lme model is a bit
different:
hlm_influence(class.lme, level = "classid", delete = c("1/160", "1/217"))
#> # A tibble: 1 × 4
#> cooksd mdffits covtrace covratio
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.000561 0.000557 0.0381 1.04Note that both examples specify units for deletion in the same way
they are specified in model@flist (lmerMod
models) or model$groups (lme models).
Additionally, lmerMod models require that the
data argument is passed into hlm_influence()
and hlm_augment() when
na.action = "na.exclude" during the model fitting process.
However, this is unnecessary for lme models.
The HLMdiag package has two different ways to calculate
Cook’s distance. One of them, which is more exact, refits the model with
each observation or group deleted from the model and calculates Cook’s
distance from the resulting coefficient estimates and variance
components estimates. The second is a one-step approximation. For more
information about how the one step approximation works, we refer the
reader to Christensen et.al (1992); Shi and
Chen (2008); and Zewotir (2008). Other R
packages also have functions that can calculate Cook’s distance. In this
section, we highlight the differences between the ways of calculating
Cook’s distance in HLMdiag versus other methods in the
car and lme4 packages.
car packageThe cooks_distance() function from HLMdiag
calculates Cook’s distance through a full refit of the model using the
following formula:
\(C_i(\hat{\beta}) = \frac{1}{p}{(\hat{\beta} - \hat{\beta}_{(i)})}^\top\widehat{\mathrm{VAR}(\hat{\beta})}^{-1}(\hat{\beta} - \hat{\beta}_{(i)})\)
Notice that the difference in the change in the parameter estimates
is scaled by the estimated covariance matrix of the original parameter
estimates. We can calculate the Cook’s distance values in the
HLMdiag package by first using the
case_delete() function, followed by the
cooks.distance() function.
hlm_case <- HLMdiag:::case_delete.lmerMod(class.lmer)
hlm_cd_full <- HLMdiag:::cooks.distance.case_delete(hlm_case)
head(hlm_cd_full)
#> [1] 9.327238e-04 1.415243e-03 3.316859e-04 2.282399e-04 1.797497e-04
#> [6] 6.968431e-07Conversely, the mdffits() function from
HLMdiag calculates MDFFITS, which is a multivariate version
of the DFFITS statistic. This is calculated as follows:
\(MDFFITS_i(\hat{\beta}) = \frac{1}{p}{(\hat{\beta} - \hat{\beta}_{(i)})}^\top\widehat{\mathrm{VAR}(\hat{\beta}_{(i)})}^{-1}(\hat{\beta} - \hat{\beta}_{(i)})\)
Instead of scaling by the estimated covariance matrix of the original
parameter estimates, calculations for MDFFITS are scaled by the
estimated covariance matrix of the deletion estimates. For each deleted
observation or group, the model is refit and the covariance structure
re-estimated without unit i. We can calculate this similarly to
how we calculated Cook’s distance, using the same
case_delete object.
hlm_mdffits <- mdffits(hlm_case)
head(hlm_mdffits)
#> [1] 9.304263e-04 1.412796e-03 3.302360e-04 2.278198e-04 1.793468e-04
#> [6] 6.942262e-07These estimates are pretty similar to those produced by
cooks_distance(); the difference is due to the use of the
covariance matrix of the deletion estimates, rather than the original
estimates.
The car package calculates Cook’s distance similarly to
how the HLMdiag package calculates MDFFITS. Instead of
using the covariance matrix of the original parameter estimates like
HLMdiag’s cooks_distance() function, the
cooks.distance() function from car uses the
estimated covariance matrix of the deletion estimates. However, the
cooks_distance() function from car is not identical to the
mdffits function from HLMdiag; the
car::cooks.distance() function also scaled each
observation, i, by dividing it by the estimated variance of the
model calculated without observation i. Therefore, the formula
used to calculate Cook’s distance in the car package is as
follows:
\(C_i(\hat{\beta}) = \frac{1}{p\hat{\sigma_{i}}^2}{(\hat{\beta} - \hat{\beta}_{(i)})}^\top\widehat{\mathrm{VAR}(\hat{\beta}_{(i)})}^{-1}(\hat{\beta} - \hat{\beta}_{(i)})\)
We can calculate this by first using the influence()
function followed by the cooks.distance() function.
library(car)
car_case <- influence(class.lmer)
car_cd <- cooks.distance(car_case)
head(car_cd)
#> 1 2 3 4 5 6
#> 9.280228e-04 1.408088e-03 3.300041e-04 2.270854e-04 1.788868e-04 6.909835e-07These values initially seem fairly different from the MDFFITS and
Cook’s distance values produced by HLMdiag, due to the
additional scaling by the inverse of the variance of each refitted
model. We can adjust for this by multiplying the vector of Cook’s
distance values from car by the estimated variance from
each refitted model.
sig.sq <- car_case[[4]][,1]
car_cd_adjusted <- car_cd * sig.sq
head(car_cd_adjusted)
#> 1 2 3 4 5 6
#> 0.6793697325 1.0316179073 0.2425234966 0.1667251605 0.1314640724 0.0005079863Once the values from car have been adjusted by sigma
squared, they now appear very similar to the MDFFITS values from
HLMdiag. The plot below shows the difference between the
MDFFITS estimates from HLMdiag and the variance-adjusted
Cook’s distance estimates from car for all 1,190
observations in the classroom data set.
While the difference between the estimates varies slightly due to differences in how the fixed effects and variance components are calculated for refit models in both packages, the difference in values tends to be very small.
lme4 packageSimilar to HLMdiag, the lme4 package has
two methods that calculate Cook’s distance. One of them, similar to the
case_delete method, conducts a full refit of models without
each observation or group. The second is a quicker approximation.
lme4 approximationIn order to calculate the approximation of Cook’s distance values
from lme4, we use the cooks.distance()
function.
lme_cd_approx <- lme4:::cooks.distance.merMod(class.lmer)
head(lme_cd_approx)
#> 1 2 3 4 5 6
#> 0.050602009 0.080124360 0.012322170 0.011276222 0.008216468 0.000021657However, this approximation is fairly different from the one produced
from HLMdiag.
hlm_cd_approx <- HLMdiag:::cooks.distance.lmerMod(class.lmer)
head(hlm_cd_approx)
#> [1] 1.060728e-03 1.433652e-03 3.358071e-04 2.249644e-04 1.726769e-04
#> [6] 8.069538e-07The approximation from HLMdiag is also closer to the
full refit calculated by HLMdiag than the lme4
approximation is. The plots below show the difference between the full
refit from HLMdiag and the HLMdiag
approximation (left) and the lme4 approximation
(right).
The estimates produced by the lme4 approximation are
never smaller than the values from the HLMdiag full refit,
but they tend to be further off the HLMdiag full refit
values than the HLMdiag approximation values are. The
difference between the full refit and the approximation from
HLMdiag tends to be very small (< 0.0005), while the
difference between the full refit and the lme4
approximation values tends to be much larger.
lme4 full refitlme4 also has a function that performs a full refit of
the models without each observation or group and calculates Cook’s
distance values based off those refits. We can calculate this by using
the influence function from lme4.
lme_case <- lme4:::influence.merMod(class.lmer, data = classroom)
cd_lme_full <- cooks.distance(lme_case) lme4’s cook’s distance function for objects created from
the influence function has a bug that prevents us from
obtaining Cook’s distance values from the influence object
using lme4; however, we can use the
cooks.distance function from the stats package
on the influence object from lme4.
The values obtained from lme4 match those from
car, meaning that lme4 is also scaling the
estimates by dividing by the estimated variance of each refit model, in
addition to using the estimated covariance matrix from the deletion
estimates, rather than the original estimates. Therefore, the
lme4 package is also using this formula to calculate Cook’s
distance:
\(C_i(\hat{\beta}) = \frac{1}{p\hat{\sigma_{i}}^2}{(\hat{\beta} - \hat{\beta}_{(i)})}^\top\widehat{\mathrm{VAR}(\hat{\beta}_{(i)})}^{-1}(\hat{\beta} - \hat{\beta}_{(i)})\)
The full refits from the lme4 and car
packages both choose to calculate what is MDFFITS in the
HLMdiag package, and additionally choose to scale by
dividing by the variance from each refit model. Those choices explain
almost all of the differences between Cook’s distance values from the
three different packages; additional variation is due to differences in
how the new models are fit and coefficients estimated.
##References Bates D, Maechler M, Bolker B, Walker S (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1-48.
Christensen R, Pearson L, Johnson W (1992). “Case-Deletion Diagnostics for Mixed Models.” Technometrics, 34(1), 38–45.
Fox J, Weisburg S (2019). A {R} Companion to Applied Regression, Third Addition. Thousand Oaks CA: Sage. URL: https://www.john-fox.ca/Companion/
Loy A, Hofmann H (2014). HLMdiag: A Suite of Diagnostics for Hierarchical Linear Models in R. Journal of Statistical Software, 56(5), 1-28.
Pinheiro J, Bates D, DebRoy S, Sarkar D, R Core Team (2020). nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-148, <URL: https://CRAN.R-project.org/package=nlme>.
Shi L, Chen G (2008). “Case Deletion Diagnostics in Multilevel Models.” Journal of Multivariate Analysis, 99(9), 1860–1877.
West, B., Welch, K. & Galecki, A. (2006) Linear Mixed Models: A Practical Guide Using Statistical Software. First Edition. Chapman Hall / CRC Press. ISBN 1584884800
Zewotir T (2008). “Multiple Cases Deletion Diagnostics for Linear Mixed Models.” Communications in Statistics – Theory and Methods, 37(7), 1071–1084.