reclin2
implements methodology for linking records based
on inexact keys. It allows for maximum flexibility by giving users full
control over each step of the linking procedure. The package is built
with performance and scalability in mind: the core algorithms have been
implemented in C++
.
> library(reclin2)
We will work with a pair of data sets with artificial data. They are tiny, but that allows us to see what happens. In this example we will perform ‘classic’ probabilistic record linkage. When some known true links are known it is also possible to use machine learning methods. This is illustrated in another vignette.
> data("linkexample1", "linkexample2")
> print(linkexample1)
id lastname firstname address sex postcode1 1 Smith Anna 12 Mainstr F 1234 AB
2 2 Smith George 12 Mainstr M 1234 AB
3 3 Johnson Anna 61 Mainstr F 1234 AB
4 4 Johnson Charles 61 Mainstr M 1234 AB
5 5 Johnson Charly 61 Mainstr M 1234 AB
6 6 Schwartz Ben 1 Eaststr M 6789 XY
> print(linkexample2)
id lastname firstname address sex postcode1 2 Smith Gearge 12 Mainstreet <NA> 1234 AB
2 3 Jonson A. 61 Mainstreet F 1234 AB
3 4 Johnson Charles 61 Mainstr F 1234 AB
4 6 Schwartz Ben 1 Main M 6789 XY
5 7 Schwartz Anna 1 Eaststr F 6789 XY
We have two data sets with personal information. The second data set contains a lot of errors, but we will try to link the second data set to the first.
In principle linkage consists of comparing each combination of records from the two data sets and determine which of those combinations (or pairs as we will call them below) belong to the same entity. In case of a perfect linkage key, it is of course, not necessary to compare all combinations of records, but when the linkage keys are imperfect and contain errors, it is in principle necessary to compare all pairs.
However, comparing all pairs can result in an intractable number of pairs: when linking two data sets with a million records there are 1012 possible pairs. Therefore, some sort of reduction of the possible pairs is usually applied. In the example below, we apply blocking, which means that pairs are only generated when they agree on the blocking variable (in this case the postcode). This means that pairs of records that disagree on the blocking variable are not considered. Therefore, one will only use variables that can be considered without errors as blocking variable, or link multiple times with different blocking variables and combine both data sets.
The first step in (probabilistic) linkage is, therefore, generating all pairs:
> pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
> print(pairs)
: 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
.x .y<int> <int>
1: 1 1
2: 1 2
3: 1 3
4: 2 1
5: 2 2
6: 2 3
7: 3 1
8: 3 2
9: 3 3
10: 4 1
11: 4 2
12: 4 3
13: 5 1
14: 5 2
15: 5 3
16: 6 4
17: 6 5
As you can see, record 1 from x
(the first data set) is
compared to records 1, 2 and 3 from y
. Also note that
reclin2
uses the data.table
package to
efficiently perform some of the computations. Therefore, the
pairs
object is a data.table
.
Other functions to generate pairs are:
pair
: generate all possible pairspair_minsim
: generate pairs that have minimum
similarity score (e.g. should agree on at least one variable in a set of
given variables). Can be computationally intensive as all records have
to be compared.We can now compare the records on their linkage keys:
> pairs <- compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"))
> print(pairs)
: 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
.x .y lastname firstname address sex<int> <int> <lgcl> <lgcl> <lgcl> <lgcl>
1: 1 1 TRUE FALSE FALSE NA
2: 1 2 FALSE FALSE FALSE TRUE
3: 1 3 FALSE FALSE FALSE TRUE
4: 2 1 TRUE FALSE FALSE NA
5: 2 2 FALSE FALSE FALSE FALSE
6: 2 3 FALSE FALSE FALSE FALSE
7: 3 1 FALSE FALSE FALSE NA
8: 3 2 FALSE FALSE FALSE TRUE
9: 3 3 TRUE FALSE TRUE TRUE
10: 4 1 FALSE FALSE FALSE NA
11: 4 2 FALSE FALSE FALSE FALSE
12: 4 3 TRUE TRUE TRUE FALSE
13: 5 1 FALSE FALSE FALSE NA
14: 5 2 FALSE FALSE FALSE FALSE
15: 5 3 TRUE FALSE TRUE FALSE
16: 6 4 TRUE TRUE FALSE TRUE
17: 6 5 TRUE FALSE TRUE FALSE
As you can see, we don’t need to pass the original data sets although
the variables lastname
etc. are from those original data
sets. This is because a copy of the original data sets are stored with
the pairs object pairs
(and should you be worrying about
memory: as long as the original data sets are not modified the data sets
are not actually copied).
In the example above the result of compare_pairs
was
assigned back to pairs
. When working with large datasets it
can be more efficient to modify pairs
in place preventing
unnecessary copies. This behaviour can be switched on using the
inplace
argument which is accepted by most functions.
> compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"),
+ inplace = TRUE)
> print(pairs)
: 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
.x .y lastname firstname address sex<int> <int> <lgcl> <lgcl> <lgcl> <lgcl>
1: 1 1 TRUE FALSE FALSE NA
2: 1 2 FALSE FALSE FALSE TRUE
3: 1 3 FALSE FALSE FALSE TRUE
4: 2 1 TRUE FALSE FALSE NA
5: 2 2 FALSE FALSE FALSE FALSE
6: 2 3 FALSE FALSE FALSE FALSE
7: 3 1 FALSE FALSE FALSE NA
8: 3 2 FALSE FALSE FALSE TRUE
9: 3 3 TRUE FALSE TRUE TRUE
10: 4 1 FALSE FALSE FALSE NA
11: 4 2 FALSE FALSE FALSE FALSE
12: 4 3 TRUE TRUE TRUE FALSE
13: 5 1 FALSE FALSE FALSE NA
14: 5 2 FALSE FALSE FALSE FALSE
15: 5 3 TRUE FALSE TRUE FALSE
16: 6 4 TRUE TRUE FALSE TRUE
17: 6 5 TRUE FALSE TRUE FALSE
The default comparison function returns TRUE
when the
linkage keys agree and false when they don’t. However, when looking at
the original data sets, we can see that most of our linkage keys are
string variables that contain typing errors. The quality of our linkage
could be improved if we could use a similarity score to compare the two
strings: a high score means that the two strings are very similar a
value close to zero means that the strings are very different.
Below we use the cmp_jarowinkler
similarity score to
compare all fields:
> compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"),
+ default_comparator = cmp_jarowinkler(0.9), inplace = TRUE)
> print(pairs)
: 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
.x .y lastname firstname address sex<int> <int> <num> <num> <num> <num>
1: 1 1 1.000000 0.4722222 0.9230769 NA
2: 1 2 0.000000 0.5833333 0.8641026 1
3: 1 3 0.447619 0.4642857 0.9333333 1
4: 2 1 1.000000 0.8888889 0.9230769 NA
5: 2 2 0.000000 0.0000000 0.8641026 0
6: 2 3 0.447619 0.5396825 0.9333333 0
7: 3 1 0.447619 0.4722222 0.8641026 NA
8: 3 2 0.952381 0.5833333 0.9230769 1
9: 3 3 1.000000 0.4642857 1.0000000 1
10: 4 1 0.447619 0.6428571 0.8641026 NA
11: 4 2 0.952381 0.0000000 0.9230769 0
12: 4 3 1.000000 1.0000000 1.0000000 0
13: 5 1 0.447619 0.5555556 0.8641026 NA
14: 5 2 0.952381 0.0000000 0.9230769 0
15: 5 3 1.000000 0.8492063 1.0000000 0
16: 6 4 1.000000 1.0000000 0.6111111 1
17: 6 5 1.000000 0.5277778 1.0000000 0
The function compare_vars
offers more flexibility than
compare_pairs
. It can for example compare multiple
variables at the same time (e.g. compare birth day and month allowing
for swaps) or generate multiple results from comparing on one
variable.
The next step in the process, is to determine which pairs of records
belong to the same entity and which do not. There are numerous ways to
do this. One possibility is to label some of the pairs as match or no
match, and use some machine learning algorithm to predict the match
status using the comparison vectors. Traditionally, the probabilistic
linkage framework initially formalised by Fellegi and Sunter tries is
used. The function problink_em
uses an EM-algorithm to
estimate the so called m- and u-probabilities for each of the linkage
variables. The m-probability is the probability that two records
concerning the same entity agree on the linkage variable; this means
that the m-probability corresponds to the probability that there is an
error in the linkage variables. The u-probability is the probability
that two records belonging to different entities agree on a variable.
For a variable with few categories (such as sex) this probability will
be large, while for a variable with a large number of categories (such
as last name) this probability will be small.
> m <- problink_em(~ lastname + firstname + address + sex, data = pairs)
> print(m)
- and u-probabilities estimated by the EM-algorithm:
M-probability U-probability
Variable M0.9990000 0.001152679
lastname 0.1999999 0.000100000
firstname 0.8999206 0.285831118
address 0.3002011 0.285427112
sex
: 0.5885595. Matching probability
These m- and u-probabilities can be used to score the pairs:
> pairs <- predict(m, pairs = pairs, add = TRUE)
> print(pairs)
: 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
.x .y lastname firstname address sex weights<int> <int> <num> <num> <num> <num> <num>
1: 1 1 1.000000 0.4722222 0.9230769 NA 7.7103862
2: 1 2 0.000000 0.5833333 0.8641026 1 -5.9463949
3: 1 3 0.447619 0.4642857 0.9333333 1 0.8042090
4: 2 1 1.000000 0.8888889 0.9230769 NA 8.6064218
5: 2 2 0.000000 0.0000000 0.8641026 0 -6.3177171
6: 2 3 0.447619 0.5396825 0.9333333 0 0.7937508
7: 3 1 0.447619 0.4722222 0.8641026 NA 0.6017106
8: 3 2 0.952381 0.5833333 0.9230769 1 4.0674910
9: 3 3 1.000000 0.4642857 1.0000000 1 7.9350221
10: 4 1 0.447619 0.6428571 0.8641026 NA 0.7713174
11: 4 2 0.952381 0.0000000 0.9230769 0 3.6961688
12: 4 3 1.000000 1.0000000 1.0000000 0 15.4915816
13: 5 1 0.447619 0.5555556 0.8641026 NA 0.6717426
14: 5 2 0.952381 0.0000000 0.9230769 0 3.6961688
15: 5 3 1.000000 0.8492063 1.0000000 0 8.5458257
16: 6 4 1.000000 1.0000000 0.6111111 1 14.6796595
17: 6 5 1.000000 0.5277778 1.0000000 0 7.9139248
With add = TRUE
the predictions are added to the
pairs
object. The higher the weight the more likely the two
pairs belong to the same entity/are a match.
The prediction function can also return the m- and u-probabilities and the posterior m- and u-probabilities.
A simpler method is to base the score directly on the similarity
scores by calculating a (weighted) sum of the similarity scores. The
higher the score the more similar the two records in the pair are. In
the simplest case score_simple
just calculates the sum of
the similarity scores of the compared variables:
> pairs <- score_simple(pairs, "score",
+ on = c("lastname", "firstname", "address", "sex"))
However, it is also possible to provide custom weights for each of the variables. In that way some variables can contribute more or less to the final score. For example, agreement on last name might give more information than agreement on a variable such as sex.
> pairs <- score_simple(pairs, "score",
+ on = c("lastname", "firstname", "address", "sex"),
+ w1 = c(lastname = 2, firstname = 2, address = 1, sex = 0.5),
+ w0 = -1, wna = 0)
Using w1
, w0
and wna
it is
possible to specify separate weights for agreement, non-agreement and
missing values respectively. For example, agreement on last name might
be a strong indication that two records belong to the same object, while
non-agreement on last name might not be strong indication of two records
not being from the same object as last name might contain a lot of
errors. In that case a high values for w1
and a hight
(e.g. only slightly negative) values for w0
might be
appropriate. For sex the reverse might be true: agreement does not give
much information (as two random persons will generally have a 50 percent
chance of having the same sec) while non-agreement might be a strong
indication that the two records do not belong to the same person
(e.g. when there are few errors in sex). In that case relatively low
values for w1
and low values for w0
are
appropriate.
The total contribution of the comparison ci of variable i is
wi = w1, ici + w0, i(1−ci)
Or wna
when the comparison is missing. This assumes that
similarity scores are between zero and one.
The final step is to select the pairs that are considered to belong to the same entities. The simplest method is to select all pairs above a certain threshold
> pairs <- select_threshold(pairs, "threshold", score = "weights", threshold = 8)
> print(pairs)
: 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
.x .y lastname firstname address sex weights score<int> <int> <num> <num> <num> <num> <num> <num>
1: 1 1 1.000000 0.4722222 0.9230769 NA 7.7103862 3.2628205
2: 1 2 0.000000 0.5833333 0.8641026 1 -5.9463949 0.9782051
3: 1 3 0.447619 0.4642857 0.9333333 1 0.8042090 2.1023810
4: 2 1 1.000000 0.8888889 0.9230769 NA 8.6064218 4.5128205
5: 2 2 0.000000 0.0000000 0.8641026 0 -6.3177171 -2.2717949
6: 2 3 0.447619 0.5396825 0.9333333 0 0.7937508 0.8285714
7: 3 1 0.447619 0.4722222 0.8641026 NA 0.6017106 1.4877289
8: 3 2 0.952381 0.5833333 0.9230769 1 4.0674910 3.9532967
9: 3 3 1.000000 0.4642857 1.0000000 1 7.9350221 3.8928571
10: 4 1 0.447619 0.6428571 0.8641026 NA 0.7713174 1.9996337
11: 4 2 0.952381 0.0000000 0.9230769 0 3.6961688 0.7032967
12: 4 3 1.000000 1.0000000 1.0000000 0 15.4915816 4.0000000
13: 5 1 0.447619 0.5555556 0.8641026 NA 0.6717426 1.7377289
14: 5 2 0.952381 0.0000000 0.9230769 0 3.6961688 0.7032967
15: 5 3 1.000000 0.8492063 1.0000000 0 8.5458257 3.5476190
16: 6 4 1.000000 1.0000000 0.6111111 1 14.6796595 4.7222222
17: 6 5 1.000000 0.5277778 1.0000000 0 7.9139248 2.5833333
threshold<lgcl>
1: FALSE
2: FALSE
3: FALSE
4: TRUE
5: FALSE
6: FALSE
7: FALSE
8: FALSE
9: FALSE
10: FALSE
11: FALSE
12: TRUE
13: FALSE
14: FALSE
15: TRUE
16: TRUE
17: FALSE
The select functions add a (logical) variable to the data set indicating whether a pairs is selected or not.
In this case we know which records truly belong to each other. We can use that to evaluate the linkage:
> pairs <- compare_vars(pairs, "truth", on_x = "id", on_y = "id")
> print(pairs)
: 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
.x .y lastname firstname address sex weights score<int> <int> <num> <num> <num> <num> <num> <num>
1: 1 1 1.000000 0.4722222 0.9230769 NA 7.7103862 3.2628205
2: 1 2 0.000000 0.5833333 0.8641026 1 -5.9463949 0.9782051
3: 1 3 0.447619 0.4642857 0.9333333 1 0.8042090 2.1023810
4: 2 1 1.000000 0.8888889 0.9230769 NA 8.6064218 4.5128205
5: 2 2 0.000000 0.0000000 0.8641026 0 -6.3177171 -2.2717949
6: 2 3 0.447619 0.5396825 0.9333333 0 0.7937508 0.8285714
7: 3 1 0.447619 0.4722222 0.8641026 NA 0.6017106 1.4877289
8: 3 2 0.952381 0.5833333 0.9230769 1 4.0674910 3.9532967
9: 3 3 1.000000 0.4642857 1.0000000 1 7.9350221 3.8928571
10: 4 1 0.447619 0.6428571 0.8641026 NA 0.7713174 1.9996337
11: 4 2 0.952381 0.0000000 0.9230769 0 3.6961688 0.7032967
12: 4 3 1.000000 1.0000000 1.0000000 0 15.4915816 4.0000000
13: 5 1 0.447619 0.5555556 0.8641026 NA 0.6717426 1.7377289
14: 5 2 0.952381 0.0000000 0.9230769 0 3.6961688 0.7032967
15: 5 3 1.000000 0.8492063 1.0000000 0 8.5458257 3.5476190
16: 6 4 1.000000 1.0000000 0.6111111 1 14.6796595 4.7222222
17: 6 5 1.000000 0.5277778 1.0000000 0 7.9139248 2.5833333
threshold truth<lgcl> <lgcl>
1: FALSE FALSE
2: FALSE FALSE
3: FALSE FALSE
4: TRUE TRUE
5: FALSE FALSE
6: FALSE FALSE
7: FALSE FALSE
8: FALSE TRUE
9: FALSE FALSE
10: FALSE FALSE
11: FALSE FALSE
12: TRUE TRUE
13: FALSE FALSE
14: FALSE FALSE
15: TRUE FALSE
16: TRUE TRUE
17: FALSE FALSE
> table(pairs$truth, pairs$threshold)
FALSE TRUE
FALSE 12 1
TRUE 1 3
We see that three of the four matches that should have been found have indeed been found (the recall is 3/4) and we have one false link (sensitivity is 1/4).
Using a threshold, does not take into account the fact that often we
know that one record from the first data set can be linked to at most
one record from the second data set and vice versa. If we make the
threshold low enough we have more links than records in either data set.
reclin
contains two functions that force one-to-one
linkage: select_greedy
and select_n_to_m
. The
first is fast (it selects pairs starting from the highest score; pairs
are only selected when each of the records in a pair have not been
selected previously); the second is slower, but can lead to better
results (it tries to optimise to total score of the selected records
under the restriction that each record can be selected only once):
> pairs <- select_greedy(pairs, "weights", variable = "greedy", threshold = 0)
> table(pairs$truth, pairs$greedy)
FALSE TRUE
FALSE 13 0
TRUE 0 4
> pairs <- select_n_to_m(pairs, "weights", variable = "ntom", threshold = 0)
> table(pairs$truth, pairs$ntom)
FALSE TRUE
FALSE 13 0
TRUE 0 4
Perfection!
Other functions to select pairs are
select_unique
which will deselect pairs that are
matched multiple times.The real final step is to create the linked data set. We now know
which pairs are to be linked, but we still have to actually link them.
link
does that (the optional arguments all_x
and all_y
control the type of linkage):
> linked_data_set <- link(pairs, selection = "ntom")
> print(linked_data_set)
: 4 pairs
Total number of pairs
: <.y>
Key
.y .x id.x lastname.x firstname.x address.x sex.x postcode.x id.y<int> <int> <int> <fctr> <fctr> <fctr> <fctr> <fctr> <int>
1: 1 2 2 Smith George 12 Mainstr M 1234 AB 2
2: 2 3 3 Johnson Anna 61 Mainstr F 1234 AB 3
3: 3 4 4 Johnson Charles 61 Mainstr M 1234 AB 4
4: 4 6 6 Schwartz Ben 1 Eaststr M 6789 XY 6
lastname.y firstname.y address.y sex.y postcode.y<fctr> <fctr> <fctr> <fctr> <fctr>
1: Smith Gearge 12 Mainstreet <NA> 1234 AB
2: Jonson A. 61 Mainstreet F 1234 AB
3: Johnson Charles 61 Mainstr F 1234 AB
4: Schwartz Ben 1 Main M 6789 XY