This is an introduction to the auctestr
package, a package for statistical testing of the AUC (also known as Area Under the Receiver Operating Characteristic Curve, or A’) statistic. The AUC has some useful statistical properties that make it especially simple to apply statistical tests. Furthermore, auctestr
implements some basic statistical procedures for applying these statistical tests even when you have several observations of the AUC of a given model, even over different datasets, and within datasets when there is some kind of dependency (such as when there are observations within a dataset over time, or across multiple randomized resamples or cross-validation folds).
auctestr
is useful if you: - Are evaluating predictive models. - Need to conduct pairwise comparisons of the performance of those models. - Are using AUC (or A’) to evaluate the performance of those models (note that there are multiclass versions of AUC that can also be used for this).
For the remainder of this document, we refer to the statistic of interest only as AUC. Note that the unique statistical properties used in this package only apply to the AUC statistic, and cannot be used to evaluate other model performance metrics (i.e., accuracy, F-1 score, etc).
auctestr
currently contains only four simple functions, which is all that is required for complete statistical testing of the AUC. An example dataset would consist of one or more observations of at least two different predictive models:
data("sample_experiment_data", package="auctestr")
head(sample_experiment_data, 15)
## auc precision accuracy n n_p n_n dataset time model_id
## 1 0.7957640 0.5354970 0.8207171 1757 350 1407 dataset1 0 ModelA
## 2 0.7957640 0.5354970 0.8207171 1757 350 1407 dataset1 0 ModelC
## 3 0.7957640 0.5354970 0.8207171 1757 350 1407 dataset1 0 ModelB
## 4 0.8459516 0.4772727 0.8471926 1407 199 1208 dataset1 1 ModelA
## 5 0.7473793 0.6300578 0.8905473 1407 199 1208 dataset1 1 ModelC
## 6 0.7440098 0.6407186 0.8919687 1407 199 1208 dataset1 1 ModelB
## 7 0.8434291 0.6080000 0.8841060 1208 194 1014 dataset1 2 ModelA
## 8 0.8097918 0.7371429 0.9081126 1208 194 1014 dataset1 2 ModelC
## 9 0.8009618 0.7440476 0.9072848 1208 194 1014 dataset1 2 ModelB
## 10 0.8455385 0.6265823 0.8471400 1014 235 779 dataset1 3 ModelA
## 11 0.8339251 0.6654676 0.8589744 1014 235 779 dataset1 3 ModelC
## 12 0.7319750 0.7393939 0.8461538 1014 235 779 dataset1 3 ModelB
## 13 0.4970371 0.3500000 0.5866496 779 316 463 dataset1 4 ModelA
## 14 0.7426457 0.7167235 0.7573813 779 316 463 dataset1 4 ModelC
## 15 0.7586701 0.7046154 0.7650834 779 316 463 dataset1 4 ModelB
## model_variant
## 1 VariantA
## 2 VariantA
## 3 VariantA
## 4 VariantA
## 5 VariantA
## 6 VariantA
## 7 VariantA
## 8 VariantA
## 9 VariantA
## 10 VariantA
## 11 VariantA
## 12 VariantA
## 13 VariantA
## 14 VariantA
## 15 VariantA
Conducting statistical comparisons of models, including over time, can be completed in a single call to auc_compare()
:
# compare model A and model B, only evaluating VariantC of both models
z_score = auc_compare(sample_experiment_data, compare_values = c("ModelA", "ModelB"), filter_value = c("VariantC"), time_col = "time", outcome_col = "auc", compare_col = "model_id", over_col = "dataset", filter_col = "model_variant")
## fetching comparison results for models ModelA, ModelB in dataset dataset1 with filter value VariantC
## fetching comparison results for models ModelA, ModelB in dataset dataset2 with filter value VariantC
## fetching comparison results for models ModelA, ModelB in dataset dataset3 with filter value VariantC
z_score
## [1] 3.604343
# fetch p-value of this comparison
pnorm(-abs(z_score))
## [1] 0.0001564715
auctestr
also allows for flexible adjustment of which pairwise comparisons are conducted, and which elements are held fixed (the fixed values are set using filter_value
and filter_col
parameters):
z_score = auc_compare(sample_experiment_data, compare_values = c("VariantA", "VariantB"), filter_value = c("ModelC"), time_col = "time", outcome_col = "auc", compare_col = "model_variant", over_col = "dataset", filter_col = "model_id")
## fetching comparison results for models VariantA, VariantB in dataset dataset1 with filter value ModelC
## fetching comparison results for models VariantA, VariantB in dataset dataset2 with filter value ModelC
## fetching comparison results for models VariantA, VariantB in dataset dataset3 with filter value ModelC
z_score
## [1] 1.655143
pnorm(-abs(z_score))
## [1] 0.04894775
The model comparisons are conducted using a method described in detail in: Fogarty, James, Ryan S. Baker, and Scott E. Hudson. “Case studies in the use of ROC curve analysis for sensor-based estimates in human computer interaction.” Proceedings of Graphics Interface 2005. Canadian Human-Computer Communications Society, 2005.
Note that these comparisons assume that there is a dataset-dependent column that needs to be statistically averaged over, and it uses Stouffer’s method to do so:
\(Z \sim \frac{\sum_{i=1}^kZ_i}{\sqrt{k}}\)
This is a conservative adjustment and more powerful, less conservative adjustments may be added in future versions. For more information, see Stouffer, S.A.; Suchman, E.A.; DeVinney, L.C.; Star, S.A.; Williams, R.M. Jr. (1949). The American Soldier, Vol.1: Adjustment during Army Life. Princeton University Press, Princeton, or [wikipedia]{https://en.wikipedia.org/wiki/Fisher%27s_method#Relation_to_Stouffer.27s_Z-score_method}
We hope to implement more features in a future version, but this is a fully-featured package to allow for more principled statistical model selection based on the unique statistical properties of the AUC metric; we hope it improves your research and modeling.