jrtThis package provides user-friendly functions designed for the easy implementation of Item-Response Theory (IRT) models and scoring with judgment data. Although it can be used in a variety of contexts, the original motivation for implementation is to facilitate use for creativity researchers.
jrt is not an estimation package, it provides wrapper
functions that call estimation packages and extract/report/plot
information from them. At this stage, jrt uses the
(excellent) package mirt (Chalmers, 2012) as its only IRT
engine. Thus, if you use jrt for your research, please
ensure to cite mirt as the estimation package/engine:
We also encourage that you cite jrt – especially if you
use the plots or the automatic model selection. Currently, this would be
done with:
Ok now let’s get started…
Then, a judgment data.frame would be provided to the
function jrt. Here we’ll use the simulated one in
jrt::ratings.
It looks like this:
head(data)
#>   Judge_1 Judge_2 Judge_3 Judge_4 Judge_5 Judge_6
#> 1       5       4       3       4       4       4
#> 2       3       3       2       3       2       2
#> 3       3       3       3       3       3       2
#> 4       3       2       2       3       4       2
#> 5       2       3       1       2       2       1
#> 6       3       2       2       3       2       1In the current release:
mirt-fitted binary and
nominal models)jrt()You will first want to first load the library.
The main function of the jrt package is
jrt(). By default, this function will:
@factor.scores (or @output.data) slot of the
jrt object.Let’s do it!
fit) to do more after.
Note: There’s a progress bar by default, but it takes space in the
vignette, so I’ll remove it here with
progress.bar = F.fit <- jrt(data, progress.bar = F)
#> ℹ️ The possible responses detected are: 1-2-3-4-5
#> 
#>  📌 Model Selection (6 judges) 📌
#> ▸ AIC for Rating Scale Model: 4414.163 | Model weight: 0.000
#> ▸ AIC for Generalized Rating Scale Model: 4368.781 | Model weight: 0.000
#> ▸ AIC for Partial Credit Model: 4022.956 | Model weight: 0.000
#> ▸ AIC for Generalized Partial Credit Model: 4014.652 | Model weight: 0.000
#> ▸ AIC for Constrained Graded Rating Scale Model: 4399.791 | Model weight: 0.000
#> ▸ AIC for Graded Rating Scale Model: 4307.955 | Model weight: 0.000
#> ▸ AIC for Constrained Graded Response Model: 3999.248 | Model weight: 0.673
#> ▸ AIC for Graded Response Model: 4000.689 | Model weight: 0.327
#> 👉 The best fitting model is the Constrained Graded Response Model.
#> 
#> 📌 General Summary 📌
#> ▸ 6 Judges
#> ▸ 300 Products
#> ▸ 5 response categories (1-2-3-4-5)
#> ▸ Mean judgment = 2.977 | SD = 0.862
#> 
#> 📌 IRT Summary 📌
#> ▸ Model: Constrained (equal slopes) Graded Response Model (Samejima, 1969) | doi: 10.1007/BF03372160
#> ▸ Estimation package: mirt (Chalmers, 2012) | doi: 10.18637/jss.v048.i06
#> ▸ Estimation algorithm: Expectation-Maximization (EM; Bock & Atkin, 1981) | doi: 10.1007/BF02293801
#> ▸ Factor scoring method: Expected A Posteriori (EAP)
#> ▸ AIC = 3999.248 | BIC = 4091.843 | SABIC = 4091.843 | HQ = 4036.305
#> 
#> 📌 Model-based reliability 📌
#> ▸ Empirical reliability | Average in the sample: .893
#> ▸ Expected reliability | Assumes a Normal(0,1) prior density: .894Of course there’s more available here than one would report. If using IRT scoring (which is the main purpose of this package), we recommend reporting what IRT model was selected, along with IRT indices primarily, since the scoring is based on the estimation of the \(\theta\) abilities. In this case typically what is reported in the empirical reliability (here 0.893), which is the estimate of the reliability of the observations in the sample. It can be interpreted similarily as other more traditionnal indices of reliability (like Cronbach’s \(\alpha\)).
One may of course select a model based on assumptions on the data
rather than on model fit comparisons. This is done through using the
name of a model as an imput of the argument irt.model of
the jrt() function. This bypasses the automatic model
selection stage.
fit <- jrt(data, "PCM")
#> ℹ️ The possible responses detected are: 1-2-3-4-5
#> 
#> 📌 General Summary 📌
#> ▸ 6 Judges
#> ▸ 300 Products
#> ▸ 5 response categories (1-2-3-4-5)
#> ▸ Mean judgment = 2.977 | SD = 0.862
#> 
#> 📌 IRT Summary 📌
#> ▸ Model: Partial Credit Model (Masters, 1982) | doi: 10.1007/BF02296272
#> ▸ Estimation package: mirt (Chalmers, 2012) | doi: 10.18637/jss.v048.i06
#> ▸ Estimation algorithm: Expectation-Maximization (EM; Bock & Atkin, 1981) | doi: 10.1007/BF02293801
#> ▸ Factor scoring method: Expected A Posteriori (EAP)
#> ▸ AIC = 4022.956 | BIC = 4115.551 | SABIC = 4115.551 | HQ = 4060.013
#> 
#> 📌 Model-based reliability 📌
#> ▸ Empirical reliability | Average in the sample: .889
#> ▸ Expected reliability | Assumes a Normal(0,1) prior density: .507See the documentation for a list of available models. Most models are
directly those of mirt. Others are versions of the Graded
Response Model or Generalized Partial Credit Model that are constrained
in various ways (equal discriminations and/or equal category structures)
through the mirt.model() function of mirt.
Note that they can also be called by their full names
(e.g. jrt(data, "Graded Response Model")).
@factor.scores.head(fit@factor.scores)
#>   Judgments.Factor.Score Judgments.Standard.Error Judgments.Mean.Score
#> 1              1.7071888                0.5824418             4.000000
#> 2             -0.7216079                0.5581625             2.500000
#> 3             -0.1530602                0.5119408             2.833333
#> 4             -0.4249492                0.5319721             2.666667
#> 5             -2.2559489                0.6720186             1.833333
#> 6             -1.4157511                0.6202543             2.166667Note : If you want a more complete output with the original data, use
@output.data. If there were missing data,
@output.data also appends imputed data.
head(fit@output.data)
#>   Judge_1 Judge_2 Judge_3 Judge_4 Judge_5 Judge_6 Judgments.Factor.Score
#> 1       5       4       3       4       4       4              1.7071888
#> 2       3       3       2       3       2       2             -0.7216079
#> 3       3       3       3       3       3       2             -0.1530602
#> 4       3       2       2       3       4       2             -0.4249492
#> 5       2       3       1       2       2       1             -2.2559489
#> 6       3       2       2       3       2       1             -1.4157511
#>   Judgments.Standard.Error Judgments.Mean.Score
#> 1                0.5824418             4.000000
#> 2                0.5581625             2.500000
#> 3                0.5119408             2.833333
#> 4                0.5319721             2.666667
#> 5                0.6720186             1.833333
#> 6                0.6202543             2.166667Judge characteristics can be inspected with Judge Category Curve
(JCC) plots. They are computed with the function
jcc.plot().
A basic example for Judge 3…
Now of course, there are many options, but a few things that you could try:
judge = "all" or simply removing the judge
argument (note that you can change the number of columns or rows, see
the documentation for these advanced options).greyscale = TRUE (this uses linetypes instead of
colors)…overlay.reliability = TRUE (reliability is scaled from
\(0\) to \(1\), making it easier to read with
probabilities than information)labelled = FALSE.column.names.jcc.plot(fit, 3:4,
         manual.facet.names = paste("Expert ", c("A", "B", "C", "D", "E", "F")),
         manual.line.names = c("Totally disagree", "Disagree", "Neither agree\nnor disagree", "Agree", "Totally agree"),
         labelled = F)titletheta.span = 5 (sets the maximum, the minimum is
automatically adjusted)color.palette
(uses the RColorBrewer palettes in ggplot2), the background
colors with theme (uses the ggplot2 themes,
like bw, light, grey, etc.), and
the line size with line.width.jcc.plot(fit, 1, color.palette = "Dark2", theme = "classic", line.width = 1.5, font.family = "serif", overlay.reliability = T, name.for.reliability = "Reliability")or
I’ve also integrated the colors of the ggsci package
(npg, aaas, nejm,
lancet, jama, jco,
D3, locuszoom, igv,
uchicago, startrek, tron,
futurama), but be careful, not all may have sufficient
color values!
The jrt() function already plots an information plot,
but information plots can be called (as well as variants of information,
like standard error and reliability), with the info.plot()
function.
judge
argument.type argument.(type = "reliability" also works)
(type = "Standard Error" also works)
y.line to add a horizontal line, for example for a
.70 threshold, usual (though rarely used in IRT) for reliability.type = ir)
or with standard error (type = ise).With a threshold value
And here again, themes are available.
Similar customizing options than jcc.plot() are
available, here is an example:
info.plot(fit, 1, "ir",
          column.names = "Rater",
          theta.span = 5,
          theme = "classic",
          line.width = 2,
          greyscale = T,
          font.family = "serif", 
          axis.title.info = expression(I(theta))
          )Some polytomous IRT models (namely, the Rating Scale models) assume
that judges all have the same response category structure, and so they
cannot be estimated if all judges do not have the same observed
categories. So, if your data includes judges with unobserved categories,
how does jrt deal with that?
For the automatic model selection stage, jrt will by
default keep all judges but, if there are judges with unobserved
categories, it will not fit the Rating Scale and Generalized Rating
Scale models. You will be notified in the output.
Note : The possible values are automatically detected, but it can
be bypassed with the possible.values argument.
Here’s an example on a data set where a judge had unobserved categories. By default the set of candidate models will exclude rating scale models (note in the plot that the last judge has an uboserved category).
fit <- jrt(data, 
           progress.bar = F, #removing the progress bar for the example
           plots = F) 
#> ℹ️ The possible responses detected are: 1-2-3-4-5
#> 12.5% Judges (1 out of 8) did not have all categories (1-2-3-4-5 observed). Rating scale models were ignored. See documentation (argument remove.judges.with.unobserved.categories) for details.
#> 
#>  📌 Model Selection (8 judges) 📌
#> ▸ AIC for Graded Response Model: 1656.018 | Model weight: 0.546
#> ▸ AIC for Constrained Graded Response Model: 1656.393 | Model weight: 0.453
#> ▸ AIC for Partial Credit Model: 1678.702 | Model weight: 0.000
#> ▸ AIC for Generalized Partial Credit Model: 1668.746 | Model weight: 0.001
#> 👉 The best fitting model is the Graded Response Model.
#> 
#> 📌 General Summary 📌
#> ▸ 8 Judges
#> ▸ 100 Products
#> ▸ 5 response categories (1-2-3-4-5)
#> ▸ Mean judgment = 2.841 | SD = 0.785
#> 
#> 📌 IRT Summary 📌
#> ▸ Model: Graded Response Model (Samejima, 1969) | doi: 10.1007/BF03372160
#> ▸ Estimation package: mirt (Chalmers, 2012) | doi: 10.18637/jss.v048.i06
#> ▸ Estimation algorithm: Expectation-Maximization (EM; Bock & Atkin, 1981) | doi: 10.1007/BF02293801
#> ▸ Factor scoring method: Expected A Posteriori (EAP)
#> ▸ AIC = 1656.018 | BIC = 1755.014 | SABIC = 1755.014 | HQ = 1696.084
#> 
#> 📌 Model-based reliability 📌
#> ▸ Empirical reliability | Average in the sample: .921
#> ▸ Expected reliability | Assumes a Normal(0,1) prior density: .919Now, if you want instead to remove the incomplete judges to compare
the models, set
remove.judges.with.unobserved.categories = TRUE (it’s a
long name for an argument, so if you have a better idea of a clear but
shorter name shoot me an email!). Now all models will be compared, but
with only the complete judges.
After this stage:
An example with the same data as above but with
remove.judges.with.unobserved.categories = TRUE. Here,
since the best fitting model was the Constrained Graded Response Model
(not a Rating Scale Model), then the model is fit again with all judges
(hence the different AIC between the two stages).
fit <- jrt(data, 
           remove.judges.with.unobserved.categories = T,
           progress.bar = F, #removing the progress bar for the example
           plots = F) 
#> ℹ️ The possible responses detected are: 1-2-3-4-5
#> 12.5% Judges (1 out of 8) did not have all categories (1-2-3-4-5 observed). Incomplete Judges were removed for model comparison, and in subsequent analyses if a rating scale model is selected. See documentation (argument remove.judges.with.unobserved.categories) for details.
#> 
#>  📌 Model Selection (7 judges) 📌
#> ▸ AIC for Rating Scale Model: 1723.348 | Model weight: 0.000
#> ▸ AIC for Generalized Rating Scale Model: 1706.738 | Model weight: 0.000
#> ▸ AIC for Partial Credit Model: 1574.999 | Model weight: 0.001
#> ▸ AIC for Generalized Partial Credit Model: 1579.209 | Model weight: 0.000
#> ▸ AIC for Constrained Graded Rating Scale Model: 1724.575 | Model weight: 0.000
#> ▸ AIC for Graded Rating Scale Model: 1701.954 | Model weight: 0.000
#> ▸ AIC for Constrained Graded Response Model: 1561.043 | Model weight: 0.945
#> ▸ AIC for Graded Response Model: 1566.783 | Model weight: 0.054
#> 👉 The best fitting model is the Constrained Graded Response Model.
#> 
#> 📌 General Summary 📌
#> ▸ 8 Judges
#> ▸ 100 Products
#> ▸ 5 response categories (1-2-3-4-5)
#> ▸ Mean judgment = 2.841 | SD = 0.785
#> 
#> 📌 IRT Summary 📌
#> ▸ Model: Constrained (equal slopes) Graded Response Model (Samejima, 1969) | doi: 10.1007/BF03372160
#> ▸ Estimation package: mirt (Chalmers, 2012) | doi: 10.18637/jss.v048.i06
#> ▸ Estimation algorithm: Expectation-Maximization (EM; Bock & Atkin, 1981) | doi: 10.1007/BF02293801
#> ▸ Factor scoring method: Expected A Posteriori (EAP)
#> ▸ AIC = 1656.393 | BIC = 1737.154 | SABIC = 1737.154 | HQ = 1689.078
#> 
#> 📌 Model-based reliability 📌
#> ▸ Empirical reliability | Average in the sample: .916
#> ▸ Expected reliability | Assumes a Normal(0,1) prior density: .915Additionnal statistics may be computed with
additional.stats = TRUE.
fit <- jrt(data,
           additional.stats = T,
           progress.bar = F,
           plots = F) #removing the progress bar for the example
#> ℹ️ The possible responses detected are: 1-2-3-4-5
#> 12.5% Judges (1 out of 8) did not have all categories (1-2-3-4-5 observed). Rating scale models were ignored. See documentation (argument remove.judges.with.unobserved.categories) for details.
#> 
#>  📌 Model Selection (8 judges) 📌
#> ▸ AIC for Graded Response Model: 1656.018 | Model weight: 0.546
#> ▸ AIC for Constrained Graded Response Model: 1656.393 | Model weight: 0.453
#> ▸ AIC for Partial Credit Model: 1678.702 | Model weight: 0.000
#> ▸ AIC for Generalized Partial Credit Model: 1668.746 | Model weight: 0.001
#> 👉 The best fitting model is the Graded Response Model.
#> 
#> 📌 General Summary 📌
#> ▸ 8 Judges
#> ▸ 100 Products
#> ▸ 5 response categories (1-2-3-4-5)
#> ▸ Mean judgment = 2.841 | SD = 0.785
#> 
#> 📌 IRT Summary 📌
#> ▸ Model: Graded Response Model (Samejima, 1969) | doi: 10.1007/BF03372160
#> ▸ Estimation package: mirt (Chalmers, 2012) | doi: 10.18637/jss.v048.i06
#> ▸ Estimation algorithm: Expectation-Maximization (EM; Bock & Atkin, 1981) | doi: 10.1007/BF02293801
#> ▸ Factor scoring method: Expected A Posteriori (EAP)
#> ▸ AIC = 1656.018 | BIC = 1755.014 | SABIC = 1755.014 | HQ = 1696.084
#> 
#> 📌 Model-based reliability 📌
#> ▸ Empirical reliability | Average in the sample: .921
#> ▸ Expected reliability | Assumes a Normal(0,1) prior density: .919
#> 📌 Other reliability statistics (packages "irr" and "psych") 📌
#> ▸ Cronbach's Alpha: .903
#> ▸ Standardized Cronbach's Alpha : .913
#> ▸ Guttman's Lambda 4 :.939
#> ▸ Guttman's Lambda 6 :.908
#> ▸ Fleiss' Kappa : .153
#> ▸ Fleiss-Conger's Exact Kappa : .164
#> ▸ Intraclass Correlation Coefficient (One-Way Consistency model): .495
#> ▸ Intraclass Correlation Coefficient (Two-Way Consistency model): .538
#> ▸ Intraclass Correlation Coefficient (One-Way Agreement model): .495
#> ▸ Intraclass Correlation Coefficient (Two-Way Agreement model): .500The (manually or automatically) selected fitted model is stored in
the slot @mirt.object, so additional functions from
mirt can be easily used.
For example:
# Get more fit indices and compare models
mirt::anova(fit@mirt.object, verbose = F)
#>                      AIC    SABIC       HQ      BIC   logLik
#> fit@mirt.object 1656.018 1635.001 1696.084 1755.014 -790.009
# Get total information for a given vector of attributes
mirt::testinfo(fit@mirt.object, Theta = seq(from = -3, to = 3, by = 1))
#> [1]  2.602953  6.107328 13.251952 10.853827 13.552225  9.937194  4.407784
# Get the test information for case 1
mirt::testinfo(fit@mirt.object, Theta = fit@factor.scores.vector[1])
#> [1] 15.50897
# Get marginal reliability for high abilities – using a Normal(1,1) prior
mirt::marginal_rxx(fit@mirt.object,
                   density = function(x) {dnorm(x, mean = 1, sd = 1)})
#> [1] 0.9141302For now, direct comparisons between two models are not directly
implemented, but rather easy to do with mirt’s
anova() function.
The ratings_missing data is a simulated dataset with a
planned missingness design. The first 5 columns contains simulated
ratings (one column per judge), while the last contains true \(\theta\) values used for simulation.
jrt will be default impute missing data for partially
missing data, but can be easily retrieved.
fit <- jrt(ratings_missing[,1:5], irt.model = "PCM", silent = T) #fit model
#> ⚠️ Person fit statistics based on imputed data! Use with caution!The fit@output.data contains both the original data and
the data with imputation (variable names are tagged “original”” and
“imputed”), as well as the factor scores.
To retrieved them separately, the imputed data can be retrieved with
fit@imputed.data, the original data is in
fit@input.data and the factor scores can be retrieved like
described previously.
jrt for plotting?You may want to use jrt as a plotting device only.
That’s ok, because jrt plotting functions will accept
mirt objects as input. They should be detected
automatically as such (unidimensional models only).
Let’s fit a Generalized Partial Credit Model with mirt
for this example.
Now jcc.plot() can plot the category curves. Note that
the default column names is now automatically switched to “Item”.
For the information plot:
For convenience the argument item can be used instead of
judge in both plotting functions:
Even though it isn’t its primary purpose, jrt can also
plot binary item response functions. They will be automatically detected
and the plot will be named accordingly.
# SAT data from mirt
## Convert to binary
data <- mirt::key2binary(mirt::SAT12,
    key = c(1,4,5,2,3,1,2,1,3,1,2,4,2,1,5,3,4,4,1,4,3,3,4,1,3,5,1,3,1,5,4,5))
## Fit 2PL model in mirt
fit <- mirt::mirt(data = data, model = 1, itemtype = "2PL", verbose = F)
## Plotting an item response function
jcc.plot(fit, item = 2)## Plotting the item response functions of the first 12 items with a larger theta range
jcc.plot(fit, facet.cols = 4, item = 1:12, theta.span = 5)