In today’s blog post we will take a look at a package that allows you to keep track of the number of observations that you in- or exclude from your dataset or study population.
In epidemiological studies we often want to limit our study population based on some inclusion and exclusion criteria. This is also the case in other areas such as social science or econometrics. Although there are a lot of nice functions in R to exclude observations from your study population, i.e. in the packages {base}, {dplyr} or {data.table}, there is no function that records how many observations you excluded or included in each step. Hence, I created a package {ExclusionTable} which includes the exclusion_table() function that does exactly this. In this blog post, I would like to introduce the function and show some of its features. The blog post also serves as a vignette for the {ExclusionTable} package.
For this let’s imagine we are working with data from the the US National Health and Nutrition Examination Study (NHANES) from the years 2009 to 2012. This dataset includes questionnaire data on health related topics for a sample of the US population. To load the data we can conveniently use the {NHANES} package on CRAN. If you would like to get more information on the dataset please take a look at ?NHANES.
# Load packages
library(NHANES) # NHANES dataset
library(dplyr) # Data manipulation
library(ExclusionTable)
# Attach data
data("NHANES")
Usually we don’t want to use the whole study population for our analysis but only a subset. Let’s assume we are interested in a subset of older females included in the NHANES dataset who have complete information on BMI. For this we would like to include females aged 65 years or above and exclude all individuals with missing information on BMI. If we would like to do this in R we could for example use dplyr::filter().
# Subset NHANES data using dplyr::filter
NHANES_subset <- NHANES %>%
filter(Gender == "female",
Age >= 65,
!is.na(BMI))
# Print number of observations
nrow(NHANES_subset)
## [1] 731
The code above obtained the subpopulation of the study population that we are interest in. For this we excluded 9,269 of the 10,000 observations leaving us with 731 observations in the dataset. However, we don’t know how many observations we excluded at each step, i.e. based on each of the three criteria. Often we want to report those numbers in our articles, e.g. in study population flow-charts. That’s were the {ExclusionTable} package makes your life easier.
The ExclusionTable::exclusion_table() function takes as first argument a dataframe and as second and third arguments a character vector of inclusion and exclusion criteria, respectively. Let’s try this function for our previous exclusions and inclusions. For this let’s set keep_data = FALSE, since we are for now only interested in obtaining the numbers of excluded individuals.
exclusion_table(NHANES,
inclusion_criteria = c("Gender == 'female'",
"Age >= 65"),
exclusion_criteria = "is.na(BMI)",
keep_data = FALSE)
##
## ===============================================
## Excluded the following observations:
## ===============================================
## Exclusions based on INCLUSION criteria
##
## inclusion n_prior n_post n_excluded
## 1 Gender == 'female' 10000 5020 4980
## 2 Age >= 65 5020 739 4281
## 3 TOTAL 10000 739 9261
##
## Exclusions based on EXCLUSION criteria
##
## exclusion n_prior n_post n_excluded
## 1 is.na(BMI) 739 731 8
## 2 TOTAL 739 731 8
##
## ===============================================
The output above shows us that we excludes 4,980 male individuals from our dataset and 4,281 individuals that were aged below 65. In total we excluded 9,261 individuals based on our inclusion criteria. Additionally, we excluded 8 individuals with missing information on BMI. Hence, we end up with a dataset including 731 individuals as before. Please notice the difference between inclusion and exclusion criteria: Individuals who do not meet the inclusion criteria will be excluded whereas individuals who meet the exclusion criteria will excluded. Also, ExclusionTable::exclusion_table(} will always apply the exclusion criteria after the inclusion criteria.
If we want to use strings in the filter statements, we can use single quotation marks within the strings, e.g. “Gender == ‘female’”.
This looks already really nice, but let’s improve the output a little bit and supply some labels for our inclusion and exclusion criteria.
exclusion_table(NHANES,
inclusion_criteria = c("Gender == 'female'",
"Age >= 65"),
exclusion_criteria = "is.na(BMI)",
labels_inclusion = c("Get females",
"Age is >= 65"),
labels_exclusion = "Missing BMI",
keep_data = FALSE)
##
## =========================================
## Excluded the following observations:
## =========================================
## Exclusions based on INCLUSION criteria
##
## inclusion n_prior n_post n_excluded
## 1 Get females 10000 5020 4980
## 2 Age is >= 65 5020 739 4281
## 3 TOTAL 10000 739 9261
##
## Exclusions based on EXCLUSION criteria
##
## exclusion n_prior n_post n_excluded
## 1 Missing BMI 739 731 8
## 2 TOTAL 739 731 8
##
## =========================================
Now we get nice labels in our inclusion and exclusion tables. Let’s next actually keep the data and take a look at the output object of the exclusion_table() function. For this we will set keep_data = TRUE.
NHANES_ex_tab <-
exclusion_table(NHANES,
inclusion_criteria = c("Gender == 'female'",
"Age >= 65"),
exclusion_criteria = "is.na(BMI)",
labels_inclusion = c("Get females",
"Age is >= 65"),
labels_exclusion = "Missing BMI",
keep_data = TRUE)
# Print structure
str(NHANES_ex_tab, 1)
## List of 3
## $ table_in:'data.frame': 3 obs. of 4 variables:
## $ table_ex:'data.frame': 2 obs. of 4 variables:
## $ dataset :'data.frame': 731 obs. of 76 variables:
## - attr(*, "class")= chr "exl_tbl"
The exclusion_table() function returns an exl_tbl objects which is a list with maximum three data.frames, depending on the options used. The [[“table_in”]] and [[“table_ex”]] data.frames include the tables with the numbers of individuals excluded due to our inclusion and exclusion criteria, respectively. Those are the tables we have seen previously. The [[“dataset”]] data.frame includes the cleaned dataset without the excluded observations. You can use this dataset for your further analysis. I usually assign the dataset to another object to access it more easily.
## [1] 731
Sometimes you might also be interested in using a variable from your global environment in the filter call. I often do this when I want to filter certain diagnosis codes, that I stored in a character variable in my environment. However, since we don’t have any diagnosis codes in this dataset we will take a look at certain number of rooms in the individual’s home. This information is stored in the variable HomeRooms. Let’s assume we would like to only include individuals who live in a home with 2, 4 and 9 rooms.
room_selection <- c(2, 4, 9)
exclusion_table(NHANES,
inclusion_criteria = c("HomeRooms %in% obj$room_selection"),
labels_inclusion = c("2, 4, 9 rooms"),
obj = list(room_selection = room_selection))
##
## ==========================================
## Excluded the following observations:
## ==========================================
## Exclusions based on INCLUSION criteria
##
## inclusion n_prior n_post n_excluded
## 1 2, 4, 9 rooms 10000 2283 7717
## 2 TOTAL 10000 2283 7717
##
## ==========================================
By supplying objects as names lists to the obj argument of exclusion_table we can access them in our filter calls using obj$<name of your object>.
To summaries the ExclusionTable::exclusion_table() function allows you to easily keep track of your exclusion and inclusion criteria and helps you to report how many individuals you excluded at each step. If you have any questions or suggestions please write me an email or contact me on Twitter or file an issue on GitHub.