testdat is a package designed to ease data validation, particularly for complex data processing, inspired by software unit testing. testdat extends the strong and flexible unit testing framework already provided by testthat with a family of functions and reporting tools focused on checking data frames.
Features include:
A fully fledged test framework so you can spend more time specifying tests and less time running them
A set of common methods for simply specifying data validation rules
Repeatability of data tests (avoid unintentionally breaking your data set!)
Data-focused reporting of test results
As an extension of testthat, testdat uses the same basic testing framework. Before using testdat (and reading this documentation), make sure you’re familiar with the introduction to testthat in R packages.
The main addition provided by testdat is a set of expectations designed for testing data frames and an accompanying mechanism to globally specify a data set for testing.
In general, our approach to data testing is variable-centric - the majority of expectations perform a check on one or more variables in a given data frame.
The standard form of a data expectation is:
testdat uses dplyr at its core, and thus supports tidy evaluation.
Most operations act on one or more variables. There are two variants of the variable argument:
vars
requires a set of columns specified as tidy
selections.test_that("multi-variable identifier is unique", {
expect_unique(c(name, year, month, day, hour), data = storms)
})
#> -- Failure ('<text>:2:3'): multi-variable identifier is unique -----------------
#> `storms` has 46 duplicate records on variable `name, year, month, day, hour`.
#> Filter: None
#> Error in `reporter$stop_if_needed()`:
#> ! Test failed
var
requires an unquoted variable name. This only
applies to a small number of expectations.test_that("hour values are valid", {
expect_base(ts_diameter, year >= 2004)
})
#> -- Error ('<text>:2:3'): hour values are valid ---------------------------------
#> Error: A test data frame has not been specified. Use `set_testdata()` to set the data frame.
#> Backtrace:
#> x
#> 1. \-testdat::expect_base(ts_diameter, year >= 2004)
#> 2. +-testthat::quasi_label(enquo(data))
#> 3. | \-rlang::eval_bare(expr, quo_get_env(quo))
#> 4. \-testdat::get_testdata()
#> Error in `reporter$stop_if_needed()`:
#> ! Test failed
The flt
argument takes a logical predicate defined in
terms of the variables in data
using data
masking. Only rows where the condition evaluates to
TRUE
are included in the test.
test_that("iris range checks", {
expect_range(Petal.Width, 0, 1, data = iris)
})
#> -- Failure ('<text>:2:3'): iris range checks -----------------------------------
#> `iris` has 93 records failing range check on variable `Petal.Width`.
#> Variable set: `Petal.Width`
#> Filter: None
#> Arguments: `min = 0, max = 1`
#> Error in `reporter$stop_if_needed()`:
#> ! Test failed
test_that("iris range checks filtered", {
# Test passes for setosa rows
expect_range(Petal.Width, 0, 1, flt = Species == "setosa", data = iris)
# Failures will provide the filter
expect_range(Petal.Width, 0, 0.5, flt = Species == "setosa", data = iris)
})
#> -- Failure ('<text>:9:3'): iris range checks filtered --------------------------
#> `iris` has 1 records failing range check on variable `Petal.Width`.
#> Variable set: `Petal.Width`
#> Filter: `Species == "setosa"`
#> Arguments: `min = 0, max = 0.5`
#> Error in `reporter$stop_if_needed()`:
#> ! Test failed
The data
argument takes a data frame to test. To avoid
redundant code the data argument defaults to a global test data set
retrieved using get_testdata()
. This can be used in two
ways:
set_testdata()
.set_testdata(iris)
identical(get_testdata(), iris)
#> [1] TRUE
test_that("Versicolor has sepal length greater than 5 - will fail", {
expect_cond(Species %in% "versicolor", Sepal.Length >= 5)
})
#> -- Failure ('<text>:5:3'): Versicolor has sepal length greater than 5 - will fail --
#> get_testdata() failed consistency check. 1 cases have `Species %in% "versicolor"` but not `Sepal.Length >= 5`.
#> Error in `reporter$stop_if_needed()`:
#> ! Test failed
with_testdata()
wrapper to temporarily set
the global test data for a block of code.set_testdata(mtcars)
identical(get_testdata(), mtcars)
#> [1] TRUE
with_testdata(iris, {
test_that("Versicolor has sepal length greater than 5 - will fail", {
expect_cond(Species %in% "versicolor", Sepal.Length >= 5)
})
})
#> -- Failure ('<text>:6:5'): Versicolor has sepal length greater than 5 - will fail --
#> get_testdata() failed consistency check. 1 cases have `Species %in% "versicolor"` but not `Sepal.Length >= 5`.
#> Error in `reporter$stop_if_needed()`:
#> ! Test failed
identical(get_testdata(), mtcars)
#> [1] TRUE
Both approaches are equivalent to:
test_that("Versicolor has sepal length greater than 5 - will fail", {
expect_cond(Species %in% "versicolor", Sepal.Length >= 5, data = iris)
})
#> -- Failure ('<text>:2:3'): Versicolor has sepal length greater than 5 - will fail --
#> `iris` failed consistency check. 1 cases have `Species %in% "versicolor"` but not `Sepal.Length >= 5`.
#> Error in `reporter$stop_if_needed()`:
#> ! Test failed
By default, set_testdata()
stores a quosure with a
reference to the provided data frame rather than the data itself, so
changes made to the data frame will be reflected in the test
results.
tmp_data <- tibble(x = c(1, 0), y = c(1, NA))
set_testdata(tmp_data)
print(get_testdata())
#> # A tibble: 2 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 1
#> 2 0 NA
expect_base(y, x == 1)
tmp_data$y <- 1
print(get_testdata())
#> # A tibble: 2 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 1
#> 2 0 1
expect_base(y, x == 1)
#> Error: get_testdata() has a base mismatch in variable `y`.
#> 0 cases have `x == 1` but `y` is missing.
#> 1 cases do not have `x == 1` but `y` is non missing.
...
Additional arguments are specific to the expectation. See the help page for the expectation function for details.
Data expectations fall into a few main classes.
?`date-expectations`
?`label-expectations`
?`pattern-expectations`
?`proportion-expectations`
?`text-expectations`
?`value-expectations`
Value expectations test variable values for valid data. Tests include explicit value checks, pattern checks and others.
?`exclusivity-expectations`
?`uniqueness-expectations`
Relationship expectations test for relationships among variables. Tests include uniqueness and exclusivity checks.
?`conditional-expectations`
Conditional expectations check for the co-existence of multiple conditions.
?`datacomp-expectations`
Data frame comparison expectations test for consistency between data frames, for example ensuring similar frequencies between similar variables in different data frames.
?`generic-expectations`
Generic expectations allow for testing of a data frame using an
arbitrary function. The function provided should take a single vector as
its first argument and return a logical vector showing whether each
element has passed or failed. Additional arguments to the checking
function can be passed as a list using the args
argument.
testdat includes a set of useful checking functions. See
?`chk-generic`
for details. Several of the checking
functions have a corresponding expectation. These are listed in
?`chk-expect`
.
The easiest way to use data testing is directly inside an R script. Expecations and test blocks throw an error if they fail, so it is very clear to the user that something needs to be checked, and the script will fail when sourced.
The example below shows how to adapt a script from exploratory “print and check” to a testing approach, using a simple data processing exercise.
library(dplyr)
x <- tribble(
~id, ~pcode, ~state, ~nsw_only,
1, 2000, "NSW", 1,
2, 3123, "VIC", NA,
3, 2123, "NSW", 3,
4, 12345, "VIC", 3
)
# check id is unique
x %>% filter(duplicated(id))
#> # A tibble: 0 x 4
#> # i 4 variables: id <dbl>, pcode <dbl>, state <chr>, nsw_only <dbl>
# check values
x %>% filter(!pcode %in% 2000:3999)
#> # A tibble: 1 x 4
#> id pcode state nsw_only
#> <dbl> <dbl> <chr> <dbl>
#> 1 4 12345 VIC 3
x %>% count(state)
#> # A tibble: 2 x 2
#> state n
#> <chr> <int>
#> 1 NSW 2
#> 2 VIC 2
x %>% count(nsw_only)
#> # A tibble: 3 x 2
#> nsw_only n
#> <dbl> <int>
#> 1 1 1
#> 2 3 2
#> 3 NA 1
# check base for nsw_only variable
x %>% filter(state != "NSW") %>% count(nsw_only)
#> # A tibble: 2 x 2
#> nsw_only n
#> <dbl> <int>
#> 1 3 1
#> 2 NA 1
x <- x %>% mutate(market = case_when(pcode %in% 2000:2999 ~ 1,
pcode %in% 3000:3999 ~ 2))
x %>% count(market)
#> # A tibble: 3 x 2
#> market n
#> <dbl> <int>
#> 1 1 2
#> 2 2 1
#> 3 NA 1
library(testdat)
library(dplyr)
x <- tribble(
~id, ~pcode, ~state, ~nsw_only,
1, 2000, "NSW", 1,
2, 3123, "VIC", NA,
3, 2123, "NSW", 3,
4, 12345, "VIC", 3
)
with_testdata(x, {
test_that("id is unique", {
expect_unique(id)
})
test_that("variable values are correct", {
expect_values(pcode, 2000:2999, 3000:3999)
expect_values(state, c("NSW", "VIC"))
expect_values(nsw_only, 1:3) # by default expect_values allows NAs
})
test_that("filters applied correctly", {
expect_base(nsw_only, state == "NSW")
})
})
#> Test passed
#> -- Failure ('<text>:18:5'): variable values are correct ------------------------
#> get_testdata() has 1 records failing value check on variable `pcode`.
#> Variable set: `pcode`
#> Filter: None
#> Arguments: `<int: 2000L, 2001L, 2002L, 2003L, 2004L, ...>, <int: 3000L, 3001L, 3002L,`
#> get_testdata() has 1 records failing value check on variable `pcode`.
#> Variable set: `pcode`
#> Filter: None
#> Arguments: ` 3003L, 3004L, ...>, miss = <chr: NA, "">`
#> Error in `reporter$stop_if_needed()`:
#> ! Test failed
x <- x %>% mutate(market = case_when(pcode %in% 2000:2999 ~ 1,
pcode %in% 3000:3999 ~ 2))
with_testdata(x, {
test_that("market derived correctly", {
expect_values(market, 1:2, miss = NULL) # miss = NULL excludes NAs from valid values
})
})
#> -- Failure ('<text>:33:5'): market derived correctly ---------------------------
#> get_testdata() has 1 records failing value check on variable `market`.
#> Variable set: `market`
#> Filter: None
#> Arguments: `<int: 1L, 2L>, miss = NULL`
#> Error in `reporter$stop_if_needed()`:
#> ! Test failed
Note that:
The global test data can be set with set_testdata()
or with_testdata()
.
The test_that()
wrapper is not strictly necessary,
but it provides more informative error messaging and logically groups a
set of expectations.
Each with_testdata()
block will fail on the first
test_that()
block that fails. The “filters applied
correctly” test block is not run in the example.
Setting up a proper project test suite can be useful for automatically validating final processed data sets.
Setting up proper file based testing infrastructure is out of the scope of this vignette. See R packages for a brief introduction to testing infrastructure.
In testthat, related tests are grouped into files. When using
testdat, each file should have a test data set specified with a call to
set_testdata()
.
set_testdata(iris)
test_that("Variable format checks", {
expect_regex(Species, "^[a-z]+$")
})
#> Test passed
test_that("Versicolor has sepal length greater than 5 - will fail", {
expect_cond(Species %in% "versicolor", Sepal.Length >= 5)
})
#> -- Failure ('<text>:8:3'): Versicolor has sepal length greater than 5 - will fail --
#> get_testdata() failed consistency check. 1 cases have `Species %in% "versicolor"` but not `Sepal.Length >= 5`.
#> Error in `reporter$stop_if_needed()`:
#> ! Test failed