Introduction to reclin2

reclin2 implements methodology for linking records based on inexact keys. It allows for maximum flexibility by giving users full control over each step of the linking procedure. The package is built with performance and scalability in mind: the core algorithms have been implemented in C++.

> library(reclin2)

We will work with a pair of data sets with artificial data. They are tiny, but that allows us to see what happens. In this example we will perform ‘classic’ probabilistic record linkage. When some known true links are known it is also possible to use machine learning methods. This is illustrated in another vignette.

> data("linkexample1", "linkexample2")
> print(linkexample1)
  id lastname firstname    address sex postcode
1  1    Smith      Anna 12 Mainstr   F  1234 AB
2  2    Smith    George 12 Mainstr   M  1234 AB
3  3  Johnson      Anna 61 Mainstr   F  1234 AB
4  4  Johnson   Charles 61 Mainstr   M  1234 AB
5  5  Johnson    Charly 61 Mainstr   M  1234 AB
6  6 Schwartz       Ben  1 Eaststr   M  6789 XY
> print(linkexample2)
  id lastname firstname       address  sex postcode
1  2    Smith    Gearge 12 Mainstreet <NA>  1234 AB
2  3   Jonson        A. 61 Mainstreet    F  1234 AB
3  4  Johnson   Charles    61 Mainstr    F  1234 AB
4  6 Schwartz       Ben        1 Main    M  6789 XY
5  7 Schwartz      Anna     1 Eaststr    F  6789 XY

We have two data sets with personal information. The second data set contains a lot of errors, but we will try to link the second data set to the first.

Step 1: generate pairs

In principle linkage consists of comparing each combination of records from the two data sets and determine which of those combinations (or pairs as we will call them below) belong to the same entity. In case of a perfect linkage key, it is of course, not necessary to compare all combinations of records, but when the linkage keys are imperfect and contain errors, it is in principle necessary to compare all pairs.

However, comparing all pairs can result in an intractable number of pairs: when linking two data sets with a million records there are 10¹² possible pairs. Therefore, some sort of reduction of the possible pairs is usually applied. In the example below, we apply blocking, which means that pairs are only generated when they agree on the blocking variable (in this case the postcode). This means that pairs of records that disagree on the blocking variable are not considered. Therefore, one will only use variables that can be considered without errors as blocking variable, or link multiple times with different blocking variables and combine both data sets.

> pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
> print(pairs)
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

       .x    .y
    <int> <int>
 1:     1     1
 2:     1     2
 3:     1     3
 4:     2     1
 5:     2     2
 6:     2     3
 7:     3     1
 8:     3     2
 9:     3     3
10:     4     1
11:     4     2
12:     4     3
13:     5     1
14:     5     2
15:     5     3
16:     6     4
17:     6     5

As you can see, record 1 from x (the first data set) is compared to records 1, 2 and 3 from y. Also note that reclin2 uses the data.table package to efficiently perform some of the computations. Therefore, the pairs object is a data.table.

Step 2: compare pairs

> pairs <- compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"))
> print(pairs)
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

       .x    .y lastname firstname address    sex
    <int> <int>   <lgcl>    <lgcl>  <lgcl> <lgcl>
 1:     1     1     TRUE     FALSE   FALSE     NA
 2:     1     2    FALSE     FALSE   FALSE   TRUE
 3:     1     3    FALSE     FALSE   FALSE   TRUE
 4:     2     1     TRUE     FALSE   FALSE     NA
 5:     2     2    FALSE     FALSE   FALSE  FALSE
 6:     2     3    FALSE     FALSE   FALSE  FALSE
 7:     3     1    FALSE     FALSE   FALSE     NA
 8:     3     2    FALSE     FALSE   FALSE   TRUE
 9:     3     3     TRUE     FALSE    TRUE   TRUE
10:     4     1    FALSE     FALSE   FALSE     NA
11:     4     2    FALSE     FALSE   FALSE  FALSE
12:     4     3     TRUE      TRUE    TRUE  FALSE
13:     5     1    FALSE     FALSE   FALSE     NA
14:     5     2    FALSE     FALSE   FALSE  FALSE
15:     5     3     TRUE     FALSE    TRUE  FALSE
16:     6     4     TRUE      TRUE   FALSE   TRUE
17:     6     5     TRUE     FALSE    TRUE  FALSE

As you can see, we don’t need to pass the original data sets although the variables lastname etc. are from those original data sets. This is because a copy of the original data sets are stored with the pairs object pairs (and should you be worrying about memory: as long as the original data sets are not modified the data sets are not actually copied).

In the example above the result of compare_pairs was assigned back to pairs. When working with large datasets it can be more efficient to modify pairs in place preventing unnecessary copies. This behaviour can be switched on using the inplace argument which is accepted by most functions.

> compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"), 
+   inplace = TRUE)
> print(pairs)
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

       .x    .y lastname firstname address    sex
    <int> <int>   <lgcl>    <lgcl>  <lgcl> <lgcl>
 1:     1     1     TRUE     FALSE   FALSE     NA
 2:     1     2    FALSE     FALSE   FALSE   TRUE
 3:     1     3    FALSE     FALSE   FALSE   TRUE
 4:     2     1     TRUE     FALSE   FALSE     NA
 5:     2     2    FALSE     FALSE   FALSE  FALSE
 6:     2     3    FALSE     FALSE   FALSE  FALSE
 7:     3     1    FALSE     FALSE   FALSE     NA
 8:     3     2    FALSE     FALSE   FALSE   TRUE
 9:     3     3     TRUE     FALSE    TRUE   TRUE
10:     4     1    FALSE     FALSE   FALSE     NA
11:     4     2    FALSE     FALSE   FALSE  FALSE
12:     4     3     TRUE      TRUE    TRUE  FALSE
13:     5     1    FALSE     FALSE   FALSE     NA
14:     5     2    FALSE     FALSE   FALSE  FALSE
15:     5     3     TRUE     FALSE    TRUE  FALSE
16:     6     4     TRUE      TRUE   FALSE   TRUE
17:     6     5     TRUE     FALSE    TRUE  FALSE

The default comparison function returns TRUE when the linkage keys agree and false when they don’t. However, when looking at the original data sets, we can see that most of our linkage keys are string variables that contain typing errors. The quality of our linkage could be improved if we could use a similarity score to compare the two strings: a high score means that the two strings are very similar a value close to zero means that the strings are very different.

> compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"),
+   default_comparator = cmp_jarowinkler(0.9), inplace = TRUE)
> print(pairs)
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

       .x    .y lastname firstname   address   sex
    <int> <int>    <num>     <num>     <num> <num>
 1:     1     1 1.000000 0.4722222 0.9230769    NA
 2:     1     2 0.000000 0.5833333 0.8641026     1
 3:     1     3 0.447619 0.4642857 0.9333333     1
 4:     2     1 1.000000 0.8888889 0.9230769    NA
 5:     2     2 0.000000 0.0000000 0.8641026     0
 6:     2     3 0.447619 0.5396825 0.9333333     0
 7:     3     1 0.447619 0.4722222 0.8641026    NA
 8:     3     2 0.952381 0.5833333 0.9230769     1
 9:     3     3 1.000000 0.4642857 1.0000000     1
10:     4     1 0.447619 0.6428571 0.8641026    NA
11:     4     2 0.952381 0.0000000 0.9230769     0
12:     4     3 1.000000 1.0000000 1.0000000     0
13:     5     1 0.447619 0.5555556 0.8641026    NA
14:     5     2 0.952381 0.0000000 0.9230769     0
15:     5     3 1.000000 0.8492063 1.0000000     0
16:     6     4 1.000000 1.0000000 0.6111111     1
17:     6     5 1.000000 0.5277778 1.0000000     0

The function compare_vars offers more flexibility than compare_pairs. It can for example compare multiple variables at the same time (e.g. compare birth day and month allowing for swaps) or generate multiple results from comparing on one variable.

Step 3: score pairs

The next step in the process, is to determine which pairs of records belong to the same entity and which do not. There are numerous ways to do this. One possibility is to label some of the pairs as match or no match, and use some machine learning algorithm to predict the match status using the comparison vectors. Traditionally, the probabilistic linkage framework initially formalised by Fellegi and Sunter tries is used. The function problink_em uses an EM-algorithm to estimate the so called m- and u-probabilities for each of the linkage variables. The m-probability is the probability that two records concerning the same entity agree on the linkage variable; this means that the m-probability corresponds to the probability that there is an error in the linkage variables. The u-probability is the probability that two records belonging to different entities agree on a variable. For a variable with few categories (such as sex) this probability will be large, while for a variable with a large number of categories (such as last name) this probability will be small.

> m <- problink_em(~ lastname + firstname + address + sex, data = pairs)
> print(m)
M- and u-probabilities estimated by the EM-algorithm:
  Variable M-probability U-probability
  lastname     0.9990000   0.001152679
 firstname     0.1999999   0.000100000
   address     0.8999206   0.285831118
       sex     0.3002011   0.285427112

Matching probability: 0.5885595.

> pairs <- predict(m, pairs = pairs, add = TRUE)
> print(pairs)
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

       .x    .y lastname firstname   address   sex    weights
    <int> <int>    <num>     <num>     <num> <num>      <num>
 1:     1     1 1.000000 0.4722222 0.9230769    NA  7.7103862
 2:     1     2 0.000000 0.5833333 0.8641026     1 -5.9463949
 3:     1     3 0.447619 0.4642857 0.9333333     1  0.8042090
 4:     2     1 1.000000 0.8888889 0.9230769    NA  8.6064218
 5:     2     2 0.000000 0.0000000 0.8641026     0 -6.3177171
 6:     2     3 0.447619 0.5396825 0.9333333     0  0.7937508
 7:     3     1 0.447619 0.4722222 0.8641026    NA  0.6017106
 8:     3     2 0.952381 0.5833333 0.9230769     1  4.0674910
 9:     3     3 1.000000 0.4642857 1.0000000     1  7.9350221
10:     4     1 0.447619 0.6428571 0.8641026    NA  0.7713174
11:     4     2 0.952381 0.0000000 0.9230769     0  3.6961688
12:     4     3 1.000000 1.0000000 1.0000000     0 15.4915816
13:     5     1 0.447619 0.5555556 0.8641026    NA  0.6717426
14:     5     2 0.952381 0.0000000 0.9230769     0  3.6961688
15:     5     3 1.000000 0.8492063 1.0000000     0  8.5458257
16:     6     4 1.000000 1.0000000 0.6111111     1 14.6796595
17:     6     5 1.000000 0.5277778 1.0000000     0  7.9139248

With add = TRUE the predictions are added to the pairs object. The higher the weight the more likely the two pairs belong to the same entity/are a match.

The prediction function can also return the m- and u-probabilities and the posterior m- and u-probabilities.

A simpler method is to base the score directly on the similarity scores by calculating a (weighted) sum of the similarity scores. The higher the score the more similar the two records in the pair are. In the simplest case score_simple just calculates the sum of the similarity scores of the compared variables:

> pairs <- score_simple(pairs, "score", 
+   on = c("lastname", "firstname", "address", "sex"))

However, it is also possible to provide custom weights for each of the variables. In that way some variables can contribute more or less to the final score. For example, agreement on last name might give more information than agreement on a variable such as sex.

> pairs <- score_simple(pairs, "score", 
+   on = c("lastname", "firstname", "address", "sex"), 
+   w1 = c(lastname = 2, firstname = 2, address = 1, sex = 0.5),
+   w0 = -1, wna = 0)

Using w1, w0 and wna it is possible to specify separate weights for agreement, non-agreement and missing values respectively. For example, agreement on last name might be a strong indication that two records belong to the same object, while non-agreement on last name might not be strong indication of two records not being from the same object as last name might contain a lot of errors. In that case a high values for w1 and a hight (e.g. only slightly negative) values for w0 might be appropriate. For sex the reverse might be true: agreement does not give much information (as two random persons will generally have a 50 percent chance of having the same sec) while non-agreement might be a strong indication that the two records do not belong to the same person (e.g. when there are few errors in sex). In that case relatively low values for w1 and low values for w0 are appropriate.

Or wna when the comparison is missing. This assumes that similarity scores are between zero and one.

Step 4: select pairs

The final step is to select the pairs that are considered to belong to the same entities. The simplest method is to select all pairs above a certain threshold

> pairs <- select_threshold(pairs, "threshold", score = "weights", threshold = 8)
> print(pairs)
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

       .x    .y lastname firstname   address   sex    weights      score
    <int> <int>    <num>     <num>     <num> <num>      <num>      <num>
 1:     1     1 1.000000 0.4722222 0.9230769    NA  7.7103862  3.2628205
 2:     1     2 0.000000 0.5833333 0.8641026     1 -5.9463949  0.9782051
 3:     1     3 0.447619 0.4642857 0.9333333     1  0.8042090  2.1023810
 4:     2     1 1.000000 0.8888889 0.9230769    NA  8.6064218  4.5128205
 5:     2     2 0.000000 0.0000000 0.8641026     0 -6.3177171 -2.2717949
 6:     2     3 0.447619 0.5396825 0.9333333     0  0.7937508  0.8285714
 7:     3     1 0.447619 0.4722222 0.8641026    NA  0.6017106  1.4877289
 8:     3     2 0.952381 0.5833333 0.9230769     1  4.0674910  3.9532967
 9:     3     3 1.000000 0.4642857 1.0000000     1  7.9350221  3.8928571
10:     4     1 0.447619 0.6428571 0.8641026    NA  0.7713174  1.9996337
11:     4     2 0.952381 0.0000000 0.9230769     0  3.6961688  0.7032967
12:     4     3 1.000000 1.0000000 1.0000000     0 15.4915816  4.0000000
13:     5     1 0.447619 0.5555556 0.8641026    NA  0.6717426  1.7377289
14:     5     2 0.952381 0.0000000 0.9230769     0  3.6961688  0.7032967
15:     5     3 1.000000 0.8492063 1.0000000     0  8.5458257  3.5476190
16:     6     4 1.000000 1.0000000 0.6111111     1 14.6796595  4.7222222
17:     6     5 1.000000 0.5277778 1.0000000     0  7.9139248  2.5833333
    threshold
       <lgcl>
 1:     FALSE
 2:     FALSE
 3:     FALSE
 4:      TRUE
 5:     FALSE
 6:     FALSE
 7:     FALSE
 8:     FALSE
 9:     FALSE
10:     FALSE
11:     FALSE
12:      TRUE
13:     FALSE
14:     FALSE
15:      TRUE
16:      TRUE
17:     FALSE

The select functions add a (logical) variable to the data set indicating whether a pairs is selected or not.

In this case we know which records truly belong to each other. We can use that to evaluate the linkage:

> pairs <- compare_vars(pairs, "truth", on_x = "id", on_y = "id")
> print(pairs)
  First data set:  6 records
  Second data set: 5 records
  Total number of pairs: 17 pairs
  Blocking on: 'postcode'

       .x    .y lastname firstname   address   sex    weights      score
    <int> <int>    <num>     <num>     <num> <num>      <num>      <num>
 1:     1     1 1.000000 0.4722222 0.9230769    NA  7.7103862  3.2628205
 2:     1     2 0.000000 0.5833333 0.8641026     1 -5.9463949  0.9782051
 3:     1     3 0.447619 0.4642857 0.9333333     1  0.8042090  2.1023810
 4:     2     1 1.000000 0.8888889 0.9230769    NA  8.6064218  4.5128205
 5:     2     2 0.000000 0.0000000 0.8641026     0 -6.3177171 -2.2717949
 6:     2     3 0.447619 0.5396825 0.9333333     0  0.7937508  0.8285714
 7:     3     1 0.447619 0.4722222 0.8641026    NA  0.6017106  1.4877289
 8:     3     2 0.952381 0.5833333 0.9230769     1  4.0674910  3.9532967
 9:     3     3 1.000000 0.4642857 1.0000000     1  7.9350221  3.8928571
10:     4     1 0.447619 0.6428571 0.8641026    NA  0.7713174  1.9996337
11:     4     2 0.952381 0.0000000 0.9230769     0  3.6961688  0.7032967
12:     4     3 1.000000 1.0000000 1.0000000     0 15.4915816  4.0000000
13:     5     1 0.447619 0.5555556 0.8641026    NA  0.6717426  1.7377289
14:     5     2 0.952381 0.0000000 0.9230769     0  3.6961688  0.7032967
15:     5     3 1.000000 0.8492063 1.0000000     0  8.5458257  3.5476190
16:     6     4 1.000000 1.0000000 0.6111111     1 14.6796595  4.7222222
17:     6     5 1.000000 0.5277778 1.0000000     0  7.9139248  2.5833333
    threshold  truth
       <lgcl> <lgcl>
 1:     FALSE  FALSE
 2:     FALSE  FALSE
 3:     FALSE  FALSE
 4:      TRUE   TRUE
 5:     FALSE  FALSE
 6:     FALSE  FALSE
 7:     FALSE  FALSE
 8:     FALSE   TRUE
 9:     FALSE  FALSE
10:     FALSE  FALSE
11:     FALSE  FALSE
12:      TRUE   TRUE
13:     FALSE  FALSE
14:     FALSE  FALSE
15:      TRUE  FALSE
16:      TRUE   TRUE
17:     FALSE  FALSE

> table(pairs$truth, pairs$threshold)
       
        FALSE TRUE
  FALSE    12    1
  TRUE      1    3

We see that three of the four matches that should have been found have indeed been found (the recall is 3/4) and we have one false link (sensitivity is 1/4).

Using a threshold, does not take into account the fact that often we know that one record from the first data set can be linked to at most one record from the second data set and vice versa. If we make the threshold low enough we have more links than records in either data set. reclin contains two functions that force one-to-one linkage: select_greedy and select_n_to_m. The first is fast (it selects pairs starting from the highest score; pairs are only selected when each of the records in a pair have not been selected previously); the second is slower, but can lead to better results (it tries to optimise to total score of the selected records under the restriction that each record can be selected only once):

> pairs <- select_greedy(pairs, "weights", variable = "greedy", threshold = 0)
> table(pairs$truth, pairs$greedy)
       
        FALSE TRUE
  FALSE    13    0
  TRUE      0    4

> pairs <- select_n_to_m(pairs, "weights", variable = "ntom", threshold = 0)
> table(pairs$truth, pairs$ntom)
       
        FALSE TRUE
  FALSE    13    0
  TRUE      0    4

The final, last step

The real final step is to create the linked data set. We now know which pairs are to be linked, but we still have to actually link them. link does that (the optional arguments all_x and all_y control the type of linkage):

> linked_data_set <- link(pairs, selection = "ntom")
> print(linked_data_set)
  Total number of pairs: 4 pairs

Key: <.y>
      .y    .x  id.x lastname.x firstname.x  address.x  sex.x postcode.x  id.y
   <int> <int> <int>     <fctr>      <fctr>     <fctr> <fctr>     <fctr> <int>
1:     1     2     2      Smith      George 12 Mainstr      M    1234 AB     2
2:     2     3     3    Johnson        Anna 61 Mainstr      F    1234 AB     3
3:     3     4     4    Johnson     Charles 61 Mainstr      M    1234 AB     4
4:     4     6     6   Schwartz         Ben  1 Eaststr      M    6789 XY     6
   lastname.y firstname.y     address.y  sex.y postcode.y
       <fctr>      <fctr>        <fctr> <fctr>     <fctr>
1:      Smith      Gearge 12 Mainstreet   <NA>    1234 AB
2:     Jonson          A. 61 Mainstreet      F    1234 AB
3:    Johnson     Charles    61 Mainstr      F    1234 AB
4:   Schwartz         Ben        1 Main      M    6789 XY