Getting started with {tidystringdist}

About {tidystringdist}

{tidystringdist} is a package that extends the {stringdist} package with tidy data principles.

The idea is to perform string distance calculation and combine it with functions for data manipulation and visualisation from the tidyverse framework.

Installing tidystringdist

You can install the last stable version from GitHub with:

install.packages("tidystringdist")

Or the dev version from GitHub:

# install.packages(remotes)
remotes::install_github("ColinFay/tidystringdist")

{tidystringdist} basic workflow

`tidycomb()`

The tidycomb() & tidy_comb_all() functions return all the possible combinations from a vector / a data.frame and a column / two vectors:

library(tidystringdist)

tidy_comb_all(LETTERS[1:3])
#> # A tibble: 3 x 2
#>   V1    V2   
#> * <chr> <chr>
#> 1 A     B    
#> 2 A     C    
#> 3 B     C

tidy_comb_all(iris, Species)
#> # A tibble: 3 x 2
#>   V1         V2        
#> * <chr>      <chr>     
#> 1 setosa     versicolor
#> 2 setosa     virginica 
#> 3 versicolor virginica

tidy_comb("Paris", state.name)
#> # A tibble: 50 x 2
#>    V1          V2   
#>  * <chr>       <chr>
#>  1 Alabama     Paris
#>  2 Alaska      Paris
#>  3 Arizona     Paris
#>  4 Arkansas    Paris
#>  5 California  Paris
#>  6 Colorado    Paris
#>  7 Connecticut Paris
#>  8 Delaware    Paris
#>  9 Florida     Paris
#> 10 Georgia     Paris
#> # … with 40 more rows

Compute string distance

Once you’ve got this data.frame, you can use tidy_string_dist() to compute string distance. This function takes a data.frame, the two columns containing the strings, and one or more stringdist methods.

comb <- tidy_comb_all(state.name) 
tidy_stringdist(comb)
#> # A tibble: 1,225 x 12
#>    V1    V2      osa    lv    dl hamming   lcs qgram cosine jaccard    jw
#>  * <chr> <chr> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>
#>  1 Alab… Alas…     3     3     3     Inf     5     5  0.216   0.571 0.254
#>  2 Alab… Ariz…     5     5     5       5    10    10  0.581   0.8   0.476
#>  3 Alab… Arka…     6     6     6     Inf     9     9  0.440   0.778 0.399
#>  4 Alab… Cali…     8     8     8     Inf    13    11  0.481   0.818 0.535
#>  5 Alab… Colo…     6     6     6     Inf    11    11  0.704   0.778 0.488
#>  6 Alab… Conn…    11    11    11     Inf    18    18  1       1     1    
#>  7 Alab… Dela…     5     5     5     Inf     9     9  0.440   0.778 0.399
#>  8 Alab… Flor…     5     5     5       5    10    10  0.581   0.8   0.476
#>  9 Alab… Geor…     6     6     6       6    12    12  0.686   0.909 0.571
#> 10 Alab… Hawa…     5     5     5     Inf     9     9  0.474   0.875 0.460
#> # … with 1,215 more rows, and 1 more variable: soundex <dbl>

Default call compute all the methods. You can use specific method with the method argument:

comb <- tidy_comb_all(state.name)
tidy_stringdist(comb, method = c("osa","jw"))
#> # A tibble: 1,225 x 4
#>    V1      V2            osa    jw
#>  * <chr>   <chr>       <dbl> <dbl>
#>  1 Alabama Alaska          3 0.254
#>  2 Alabama Arizona         5 0.476
#>  3 Alabama Arkansas        6 0.399
#>  4 Alabama California      8 0.535
#>  5 Alabama Colorado        6 0.488
#>  6 Alabama Connecticut    11 1    
#>  7 Alabama Delaware        5 0.399
#>  8 Alabama Florida         5 0.476
#>  9 Alabama Georgia         6 0.571
#> 10 Alabama Hawaii          5 0.460
#> # … with 1,215 more rows

Getting started

Colin Fay

2019-03-20