taxodist answers a simple question: how related are
any two living things?
Given any two taxon names, a pair of dinosaurs, a dinosaur and a
fungus, two species of fly, or an oak tree and a human,
taxodist retrieves their full hierarchical lineages from The Taxonomicon and computes a
dissimilarity index between them.
The Taxonomicon is based on Systema Naturae 2000 (Brands, 1989 onwards) and provides exceptionally deep lineage resolution, substantially exceeding other programmatic sources.
Searches work at any taxonomic level: genus, species, family, order,
or any clade. Both "Tyrannosaurus" and
"Tyrannosaurus rex" are valid inputs, as are
"Drosophila melanogaster", "Homo sapiens", or
"Araucaria angustifolia".
taxodist measures how related two taxa are by asking a
single question: how deep is their most recent common
ancestor?
\[d(A, B) = \frac{1}{\text{depth}(\text{MRCA}(A,B))}\]
The deeper the shared ancestor, the smaller the distance, meaning the more related the two taxa are. A shallow MRCA (close to the root) means the two taxa diverged early and are distantly related; a deep MRCA means they share a long common history and are closely related.
This has a key property: taxa that diverged at the same point in the tree are always equidistant from any third taxon, regardless of how many nodes each has in its lineage below the split. For example:
Tyrannosaurus and Velociraptor are both Tetanurae: they diverged from Carnotaurus (Ceratosauria) at the same node (Averostra), so both have exactly the same distance to Carnotaurus;
All dinosaurs diverged from the mammal lineage at the same node (Amniota), so Homo sapiens is equally distant from Tyrannosaurus, Triceratops, Carnotaurus and Cyanocorax.
The distance is not bounded to \([0, 1]\), it depends on the depth of the MRCA in The Taxonomicon’s classification. Deeper, more finely resolved clades will have smaller distances between their members.
lin <- get_lineage("Tyrannosaurus")
tail(lin, 8)
#> [1] "Avetheropoda" "Coelurosauria" "Tyrannoraptora" "Tyrannosauroidea"
#> [5] "Tyrannosauridae" "Tyrannosaurinae" "Tyrannosaurini" "Tyrannosaurus" Species-level searches also work:
result <- taxo_distance("Tyrannosaurus", "Velociraptor")
print(result)
#> -- Taxonomic Distance --
#>
#> * Tyrannosaurus vs Velociraptor
#> Distance : 0.0153846153846154
#> MRCA : Tyrannoraptora (depth 65)
#> Depth A : 70
#> Depth B : 73The distance between a dinosaur and a mammal or a bacteria and a human is larger:
taxa <- c("Tyrannosaurus", "Carnotaurus", "Velociraptor",
"Triceratops", "Homo", "Drosophila melanogaster")
mat <- distance_matrix(taxa)
print(mat)
#> Tyrannosaurus Carnotaurus Velociraptor Triceratops Homo
#> Carnotaurus 0.01666667
#> Velociraptor 0.01538462 0.01666667
#> Triceratops 0.01818182 0.01818182 0.01818182
#> Homo 0.02777778 0.02777778 0.02777778 0.02777778
#> Drosophila melanogaster 0.06666667 0.06666667 0.06666667 0.06666667 0.06666667The matrix is symmetric with zeros on the diagonal. Taxa are ordered so that closely related pairs appear near each other when clustered:
closest_relative(
"Carnotaurus",
c("Aucasaurus", "Velociraptor", "Triceratops",
"Brachiosaurus", "Homo sapiens", "Apis mellifera")
)
#> taxon distance
#> 1 Aucasaurus 0.01515152
#> 2 Velociraptor 0.01666667
#> 4 Brachiosaurus 0.01754386
#> 3 Triceratops 0.01818182
#> 5 Homo sapiens 0.02777778
#> 6 Apis mellifera 0.06666667compare_lineages("Carnotaurus", "Tyrannosaurus")
#> -- Lineage Comparison --
#> MRCA: Averostra at depth 60
#>
#> Shared lineage (60 nodes):
#> Biota ... Theropoda
#>
#> Carnotaurus only (7 nodes):
#> Ceratosauria
#> Neoceratosauria
#> Abelisauroidea
#> Abelisauria
#> Abelisauridae
#> Carnotaurinae
#> Carnotaurus
#>
#> Tyrannosaurus only (10 nodes):
#> Tetanurae
#> Orionides
#> ...taxa <- c("Tyrannosaurus", "Carnotaurus", "Triceratops",
"Velociraptor", "Homo sapiens", "Drosophila melanogaster",
"Quercus robur", "Saccharomyces cerevisiae")
filter_clade(taxa, "Dinosauria")
#> [1] "Tyrannosaurus" "Carnotaurus" "Triceratops" "Velociraptor"
filter_clade(taxa, "Theropoda")
#> [1] "Tyrannosaurus" "Carnotaurus" "Velociraptor"
filter_clade(taxa, "Animalia")
#> [1] "Tyrannosaurus" "Carnotaurus"
#> [3] "Triceratops" "Velociraptor"
#> [5] "Homo sapiens" "Drosophila melanogaster"taxa <- c("Tyrannosaurus", "Velociraptor", "Apis mellifera", "Fakeosaurus")
check_coverage(taxa)
#> Tyrannosaurus Velociraptor Apis mellifera Fakeosaurus
#> TRUE TRUE TRUE FALSEUse check_coverage() to pre-screen a list before running
distance_matrix() on a large dataset — taxa that return
FALSE will produce NA distances.
The Taxonomicon provides substantially deeper lineage resolution than most other programmatic sources. For example, Tyrannosaurus has 70 nodes in its lineage, capturing intermediate clades at the level of superfamilies, tribes, and named subclades that are absent from most sources. This depth is what makes the distance metric meaningful, shallower sources would produce coarser distances that conflate distantly related groups.
All lineage data is sourced from The Taxonomicon (taxonomy.nl), based on Systema Naturae 2000:
Brands, S.J. (1989 onwards). Systema Naturae 2000. Amsterdam, The Netherlands. Retrieved from The Taxonomicon, http://taxonomicon.taxonomy.nl.
Please cite this resource in any published work using
taxodist.