Real World Example

This vignette demonstrates using dwctaxon on “real life” data found in the wild. Our goal is to import the data and validate it.

First, load the packages used for this vignette.

library(dwctaxon)
library(readr)
library(tibble)
library(dplyr)

Import data

We will use the Database of Vascular Plants of Canada (VASCAN), which is available as a Darwin Core Archive.

The data can be obtained manually by going to the VASCAN website, downloading the Darwin Core Archive, and unzipping it1.

Alternatively, it can be downloaded and unzipped with R. First, we set up some temporary folders for downloading and specify the URL:

# - Specify temporary folder for downloading data
temp_dir <- tempdir()
# - Set name of zip file
temp_zip <- paste0(temp_dir, "/dwca-vascan.zip")
# - Set name of unzipped folder
temp_unzip <- paste0(temp_dir, "/dwca-vascan")
vascan_url <- "https://data.canadensys.net/ipt/archive.do?r=vascan&v=37.12"

Next, download and unzip the zip file.

# Download data
download.file(url = vascan_url, destfile = temp_zip, mode = "wb")

# Unzip
unzip(temp_zip, exdir = temp_unzip)

# Check the contents of the unzipped data (the Darwin Core Archive)
list.files(temp_unzip)
#> [1] "description.txt"          "distribution.txt"         "eml.xml"                  "meta.xml"                 "resourcerelationship.txt"
#> [6] "taxon.txt"                "vernacularname.txt"

Finally, load the taxonomic data (taxon.txt) into R. It is a tab-separated text file, so we use readr::read_tsv() to load it.

vascan <- read_tsv(paste0(temp_unzip, "/taxon.txt"))
#> Warning: One or more parsing issues, call `problems()` on your data frame for details, e.g.:
#>   dat <- vroom(...)
#>   problems(dat)
#> Rows: 32770 Columns: 24
#> ── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (20): nameAccordingToID, scientificName, acceptedNameUsage, parentNameUsage, nameAccordingTo, higherClassification, class, order, fa...
#> dbl  (4): id, taxonID, acceptedNameUsageID, parentNameUsageID
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Take a peak at the data
vascan
#> # A tibble: 32,770 × 24
#>       id taxonID acceptedNameUsageID parentNameUsageID nameAccordingToID    scientificName acceptedNameUsage parentNameUsage nameAccordingTo
#>    <dbl>   <dbl>               <dbl>             <dbl> <chr>                <chr>          <chr>             <chr>           <chr>          
#>  1    73      73                  73                NA http://dx.doi.org/1… Equisetopsida… Equisetopsida C.… <NA>            Chase, M.W. & …
#>  2    26      26                  26                73 http://dx.doi.org/1… Equisetidae W… Equisetidae Warm… Equisetopsida … Chase, M.W. & …
#>  3    25      25                  25                26 http://www.jstor.or… Equisetales d… Equisetales de C… Equisetidae Wa… Smith, A.R., K…
#>  4   128     128                 128                25 http://www.jstor.or… Equisetaceae … Equisetaceae Mic… Equisetales de… Smith, A.R., K…
#>  5  1142    1142                1142               128 http://www.efloras.… Equisetum Lin… Equisetum Linnae… Equisetaceae M… FNA Editorial …
#>  6  2004    2004                2004              1142 http://www.efloras.… Equisetum sub… Equisetum subg. … Equisetum Linn… FNA Editorial …
#>  7  5467    5467                5467              2004 http://www.efloras.… Equisetum flu… Equisetum fluvia… Equisetum subg… FNA Editorial …
#>  8  5466    5466                5466              2004 http://www.efloras.… Equisetum arv… Equisetum arvens… Equisetum subg… FNA Editorial …
#>  9  5472    5472                5472              2004 http://www.efloras.… Equisetum pra… Equisetum praten… Equisetum subg… FNA Editorial …
#> 10  5471    5471                5471              2004 http://www.efloras.… Equisetum pal… Equisetum palust… Equisetum subg… FNA Editorial …
#> # ℹ 32,760 more rows
#> # ℹ 15 more variables: higherClassification <chr>, class <chr>, order <chr>, family <chr>, genus <chr>, subgenus <chr>,
#> #   specificEpithet <chr>, infraspecificEpithet <chr>, taxonRank <chr>, scientificNameAuthorship <chr>, taxonomicStatus <chr>,
#> #   modified <chr>, license <chr>, bibliographicCitation <chr>, references <chr>

The dataset includes 32770 rows (taxa) and 24 columns.

Validation

Let’s see if the dataset passes validation with dwctaxon.

It is usually a good idea to just run dct_validate() with default settings the first time. If it passes, you can move on.

dct_validate(vascan)
#> Error: check_sci_name failed
#>    scientificName detected with duplicated value
#>    Bad scientificName: Scilla esculenta Ker Gawler, Arnica monocephala Rydberg, Arnica pedunculata Rydberg, Trifolium tridentatum Lindley, Oenothera angustissima R.R. Gates, Spiraea discolor Pursh, Viola discurrens Greene, Cogswellia simplex (Nuttall ex S. Watson) M.E. Jones, Scilla esculenta Ker Gawler, Ginseng trifolium (Linnaeus) Alph. Wood, Arnica pedunculata Rydberg, Arnica monocephala Rydberg, Swida racemosa (Lamarck) Moldenke, Spiraea discolor Pursh, Swida racemosa (Lamarck) Moldenke, Trifolium tridentatum Lindley, Oenothera angustissima R.R. Gates, Aralia triphylla Poiret, Panax lanceolatus Rafinesque, Panax pusillus Sims, Aralia triphylla Poiret, Ginseng trifolium (Linnaeus) Alph. Wood, Panax lanceolatus Rafinesque, Panax pusillus Sims, Cogswellia simplex (Nuttall ex S. Watson) M.E. Jones, Viola discurrens Greene

Looks like we’ve got problems…

To dig into these in more detail, let’s run dct_validate() again, but this time output a summary of errors.

validation_res <- dct_validate(vascan, on_fail = "summary")
#> Warning: scientificName detected with duplicated value
#> Warning: Invalid column names detected: id

validation_res
#> # A tibble: 27 × 4
#>    taxonID scientificName                                       error                                         check          
#>      <dbl> <chr>                                                <glue>                                        <chr>          
#>  1      NA <NA>                                                 Invalid column names detected: id             check_col_names
#>  2   10170 Scilla esculenta Ker Gawler                          scientificName detected with duplicated value check_sci_name 
#>  3   10664 Arnica monocephala Rydberg                           scientificName detected with duplicated value check_sci_name 
#>  4   10665 Arnica pedunculata Rydberg                           scientificName detected with duplicated value check_sci_name 
#>  5   16398 Trifolium tridentatum Lindley                        scientificName detected with duplicated value check_sci_name 
#>  6   17569 Oenothera angustissima R.R. Gates                    scientificName detected with duplicated value check_sci_name 
#>  7   20099 Spiraea discolor Pursh                               scientificName detected with duplicated value check_sci_name 
#>  8   21522 Viola discurrens Greene                              scientificName detected with duplicated value check_sci_name 
#>  9   21946 Cogswellia simplex (Nuttall ex S. Watson) M.E. Jones scientificName detected with duplicated value check_sci_name 
#> 10   24660 Scilla esculenta Ker Gawler                          scientificName detected with duplicated value check_sci_name 
#> # ℹ 17 more rows

The summary lists one taxonID per row. Let’s count these to get a higher-level view of what’s going on.

validation_res %>%
  count(check, error)
#> # A tibble: 2 × 3
#>   check           error                                             n
#>   <chr>           <glue>                                        <int>
#> 1 check_col_names Invalid column names detected: id                 1
#> 2 check_sci_name  scientificName detected with duplicated value    26

We see there are 2 kinds of errors. There is 1 column with an invalid name (id) and 26 rows with duplicated scientific names.

Investigate errors

Duplicate names

Let’s take a closer look at some of those duplicated names.

dup_names <-
  validation_res %>%
  filter(grepl("scientificName detected with duplicated value", error)) %>%
  arrange(scientificName)

dup_names
#> # A tibble: 26 × 4
#>    taxonID scientificName                                       error                                         check         
#>      <dbl> <chr>                                                <glue>                                        <chr>         
#>  1   29934 Aralia triphylla Poiret                              scientificName detected with duplicated value check_sci_name
#>  2   31463 Aralia triphylla Poiret                              scientificName detected with duplicated value check_sci_name
#>  3   10664 Arnica monocephala Rydberg                           scientificName detected with duplicated value check_sci_name
#>  4   25705 Arnica monocephala Rydberg                           scientificName detected with duplicated value check_sci_name
#>  5   10665 Arnica pedunculata Rydberg                           scientificName detected with duplicated value check_sci_name
#>  6   25704 Arnica pedunculata Rydberg                           scientificName detected with duplicated value check_sci_name
#>  7   21946 Cogswellia simplex (Nuttall ex S. Watson) M.E. Jones scientificName detected with duplicated value check_sci_name
#>  8   32013 Cogswellia simplex (Nuttall ex S. Watson) M.E. Jones scientificName detected with duplicated value check_sci_name
#>  9   25350 Ginseng trifolium (Linnaeus) Alph. Wood              scientificName detected with duplicated value check_sci_name
#> 10   31464 Ginseng trifolium (Linnaeus) Alph. Wood              scientificName detected with duplicated value check_sci_name
#> # ℹ 16 more rows

We can join back to the original data to investigate these names.

inner_join(
  select(dup_names, taxonID),
  vascan,
  by = "taxonID"
) %>%
  # Just look at the first 6 columns
  select(1:6)
#> # A tibble: 26 × 6
#>    taxonID    id acceptedNameUsageID parentNameUsageID nameAccordingToID                                                 scientificName     
#>      <dbl> <dbl>               <dbl>             <dbl> <chr>                                                             <chr>              
#>  1   29934 29934                2695                NA <NA>                                                              Aralia triphylla P…
#>  2   31463 31463                2695                NA <NA>                                                              Aralia triphylla P…
#>  3   10664 10664                2857                NA http://www.efloras.org/volume_page.aspx?volume_id=1021&flora_id=1 Arnica monocephala…
#>  4   25705 25705                2857                NA http://www.botanicus.org/item/31753003488092                      Arnica monocephala…
#>  5   10665 10665                2857                NA http://www.efloras.org/volume_page.aspx?volume_id=1021&flora_id=1 Arnica pedunculata…
#>  6   25704 25704                2857                NA http://www.botanicus.org/item/31753003488092                      Arnica pedunculata…
#>  7   21946 21946                2613                NA <NA>                                                              Cogswellia simplex…
#>  8   32013 32013                2613                NA http://www.tropicos.org                                           Cogswellia simplex…
#>  9   25350 25350                2695                NA <NA>                                                              Ginseng trifolium …
#> 10   31464 31464                2695                NA <NA>                                                              Ginseng trifolium …
#> # ℹ 16 more rows

We see that in some cases, multiple entries for the exact same scientific name (for example, Arnica monocephala Rydberg) differ only in the value for nameAccordingToID.

So this seems like something the database manager should fix.

Invalid column names

Let’s see what is in the id column.

vascan %>%
  select(id)
#> # A tibble: 32,770 × 1
#>       id
#>    <dbl>
#>  1    73
#>  2    26
#>  3    25
#>  4   128
#>  5  1142
#>  6  2004
#>  7  5467
#>  8  5466
#>  9  5472
#> 10  5471
#> # ℹ 32,760 more rows

n_distinct(vascan$id)
#> [1] 32770

id contains numbers that are all unique. In other words, these appear to be unique key values to each row in the dataset (as one would expect from the name id).

It is probably the case that this dataset has a good reason for using the id column, even though it is not a standard DwC column.

Fixing the data

Let’s see if we can get this dataset to pass validation.

First, let’s remove the duplicated names. This is something that should be done with more thought, but here let’s just keep the first name of each pair.

vascan_fixed <-
  vascan %>%
  filter(!duplicated(scientificName))

Next, we will run validation again, but this time allow id as an extra column.

dct_validate(
  vascan_fixed,
  extra_cols = "id"
)
#> # A tibble: 32,757 × 24
#>       id taxonID acceptedNameUsageID parentNameUsageID nameAccordingToID    scientificName acceptedNameUsage parentNameUsage nameAccordingTo
#>    <dbl>   <dbl>               <dbl>             <dbl> <chr>                <chr>          <chr>             <chr>           <chr>          
#>  1    73      73                  73                NA http://dx.doi.org/1… Equisetopsida… Equisetopsida C.… <NA>            Chase, M.W. & …
#>  2    26      26                  26                73 http://dx.doi.org/1… Equisetidae W… Equisetidae Warm… Equisetopsida … Chase, M.W. & …
#>  3    25      25                  25                26 http://www.jstor.or… Equisetales d… Equisetales de C… Equisetidae Wa… Smith, A.R., K…
#>  4   128     128                 128                25 http://www.jstor.or… Equisetaceae … Equisetaceae Mic… Equisetales de… Smith, A.R., K…
#>  5  1142    1142                1142               128 http://www.efloras.… Equisetum Lin… Equisetum Linnae… Equisetaceae M… FNA Editorial …
#>  6  2004    2004                2004              1142 http://www.efloras.… Equisetum sub… Equisetum subg. … Equisetum Linn… FNA Editorial …
#>  7  5467    5467                5467              2004 http://www.efloras.… Equisetum flu… Equisetum fluvia… Equisetum subg… FNA Editorial …
#>  8  5466    5466                5466              2004 http://www.efloras.… Equisetum arv… Equisetum arvens… Equisetum subg… FNA Editorial …
#>  9  5472    5472                5472              2004 http://www.efloras.… Equisetum pra… Equisetum praten… Equisetum subg… FNA Editorial …
#> 10  5471    5471                5471              2004 http://www.efloras.… Equisetum pal… Equisetum palust… Equisetum subg… FNA Editorial …
#> # ℹ 32,747 more rows
#> # ℹ 15 more variables: higherClassification <chr>, class <chr>, order <chr>, family <chr>, genus <chr>, subgenus <chr>,
#> #   specificEpithet <chr>, infraspecificEpithet <chr>, taxonRank <chr>, scientificNameAuthorship <chr>, taxonomicStatus <chr>,
#> #   modified <chr>, license <chr>, bibliographicCitation <chr>, references <chr>

It passes, so we have now confirmed that the only steps needed to obtain correctly formatted DwC data are to de-duplicate the species names and account for the id column.

Summary

This vignette shows how dwctaxon can be used on DwC data to find possible problems in a taxonomic dataset. We were able to identify several rows with duplicated scientific names and one column that does not follow DwC standards. Other than that, it passes validation, giving us confidence that this dataset can be used for downstream taxonomic analyses.


  1. If you download the data manually, it may be a different version than the one used here, v37.12↩︎