Normalize Geographic Variables

Messy Data

In this vignette, we will clean some exaggerated fake messy contribution data from a fictitious campaign in Vermont. This data is found in the vt_contribs.csv file included with our package.

ex_file <- system.file("extdata", "vt_contribs.csv", package = "campfin")

id	cand	date	amount	name	address	city	state	zip
01	Bob Miller	02/09/2019	10	Lisa Miller	4 Sheffield square Road	Sheffield	VT	05866
02	Bob Miller	03/09/2009	20	Deb Brown	Requested	Requested	RE	00000
03	Chuck White	04/09/2019	25	N/A	p.o. box 567	Midlebury	Vermont	05753-0567
04	Chuck White	05/09/2019	100	Josh Jones	sugarhouse SQU	e Corinth	VT	5076
05	Bob Miller	02/09/2019	10	Lisa Miller	4 Sheffield square Road	Sheffield	VT	05866
06	Chuck White	06/09/2019	1000	Bob Taylor	55 thisplace av	young america	mn	55555
07	Chuck White	07/09/2019	-600	Alex Johnson	11 Liberty AVN	Bristol, VT	VT	99999
08	Alice Walsh	08/09/2019	0	Ruth Smith	2 Burlington sqre	Brulington	vt	05401
09	Alice Walsh	09/09/2019	69	Joe Garcia	770 5th-st-nw	Washington	D.C.	20001-2674
10	Alice Walsh	11/09/2019	222	Dave Wilson	XXXXXXXXXXXXXXXXX	SA	Texas	78202

What are some of the potential problems we can see in this data?

The date column is not parsed as an R date, making it impossible to perform mathematical calculations like min().
There is one negative amount value and another that’s zero. Not necessarily uncommon, but worth noting.
One record is missing both the contributor’s name.
One record is duplicated across every column.
In address we see:
- Inconsistent capitalization.
- A mix of full and abbreviated suffixes.
- Alternative versions of NA.
- Unnecessary and inconsistent punctuation.
- Excess trailing and internal white space.
- Incorrect punctuation and replaced spaces.
- Repeating character strings used instead of NA.
In city we see many of the same problems, plus:
- Geographic abbreviations.
- Repeated state information.
- Misspellings.
- Colloquial city abbreviations.
In state we see a mix of full and abbreviated state names.
In zip:
- Repeated digits used for NA values (NB: 55555 is a valid ZIP code, 99999 is not).
- Unnecessary and inconsistent ZIP+4 usage.
- Leading zeroes dropped by Excel or some other program.

While this data is obviously much smaller and more full of errors than real campaign finance data, these errors are not uncommon and need to be addressed. The campfin package contains many of the tools we need to first find and then fix these common problems.

Read

In most cases, the first step is to download and read the file from a state agency. When reading the data with the popular readr::read_delim() function, the col_date_mdy() function can be used as a quick shortcut for readr::col_date(format = "%m/%d/%Y"), the format most commonly found in U.S. campaign finance data.

vt <- read_csv(
  file = ex_file,
  trim_ws = FALSE,
  na = c("", "NA", "N/A"),
  col_types = cols(
    amount = col_number(),
    date = col_date_mdy()
  )
)

We can see how the new date column is an actual date object, allowing for mathematical manipulation.

min(vt$date)
#> [1] "2009-03-09"

Next, we should try to normalize our data as much as possible. We can use some simple counting functions and built in vectors to check the cleanliness of our raw data.

prop_in(vt$city, str_to_lower(valid_city))
#> [1] 0.1
prop_in(vt$state, valid_state)
#> [1] 0.4
prop_in(vt$zip, valid_zip)
#> [1] 0.5

col_stats(vt, n_distinct)
#> # A tibble: 9 × 4
#>   col     class      n     p
#>   <chr>   <chr>  <int> <dbl>
#> 1 id      <chr>     10   1  
#> 2 cand    <chr>      3   0.3
#> 3 date    <date>     9   0.9
#> 4 amount  <dbl>      9   0.9
#> 5 name    <chr>      9   0.9
#> 6 address <chr>      9   0.9
#> 7 city    <chr>      9   0.9
#> 8 state   <chr>      7   0.7
#> 9 zip     <chr>      9   0.9
col_stats(vt, count_na)
#> # A tibble: 9 × 4
#>   col     class      n     p
#>   <chr>   <chr>  <int> <dbl>
#> 1 id      <chr>      0   0  
#> 2 cand    <chr>      0   0  
#> 3 date    <date>     0   0  
#> 4 amount  <dbl>      0   0  
#> 5 name    <chr>      1   0.1
#> 6 address <chr>      0   0  
#> 7 city    <chr>      0   0  
#> 8 state   <chr>      0   0  
#> 9 zip     <chr>      0   0

A typical benchmark is to reach greater than 95% valid. That is, we want to normalize our data enough that less than 5% of our data can not be easily confirmed as valid using a fairly comprehensive list of cities, states, and ZIP codes.

We will first try to reach this threshold by normalizing our data. This process involves reducing inconsistencies through string manipulation. There are separate normal_*() functions for each of the 4 types of geographic variables. Typically we use dplyr::mutate() to create new, normalized versions of our messy columns, preserving the old data for transparency. Here, we will just overwrite our example data for simplicity.

vt <- vt %>% 
  mutate(
    address = normal_address(
      address = address,
      abbs = usps_street,
      na = invalid_city,
      na_rep = TRUE
    ),
    city = normal_city(
      city = city,
      abbs = usps_city,
      states = "VT",
      na = invalid_city
    ),
    state = normal_state(
      state = state,
      abbreviate = TRUE,
      na_rep = TRUE,
      valid = valid_state
    ),
    zip = normal_zip(
      zip = zip,
      na_rep = TRUE
    )
  )

We can see how these functions and our built in data was used to normalize the geographic contributor data and remove anything that didn’t present real information. This format is much more easily explored and search.

#> # A tibble: 10 × 4
#>    address               city          state zip  
#>    <chr>                 <chr>         <chr> <chr>
#>  1 4 SHEFFIELD SQUARE RD SHEFFIELD     VT    05866
#>  2 <NA>                  <NA>          <NA>  <NA> 
#>  3 PO BOX 567            MIDLEBURY     VT    05753
#>  4 SUGARHOUSE SQ         EAST CORINTH  VT    5076 
#>  5 4 SHEFFIELD SQUARE RD SHEFFIELD     VT    05866
#>  6 55 THISPLACE AVE      YOUNG AMERICA MN    55555
#>  7 11 LIBERTY AVE        BRISTOL       VT    <NA> 
#>  8 2 BURLINGTON SQ       BRULINGTON    VT    05401
#>  9 770 5THSTNW           WASHINGTON    DC    20001
#> 10 <NA>                  SA            TX    78202

Cities

However, the problem has not been solved. City names are the most troublesome; There are so many city names and such great variety (compared to states and ZIP codes), that it can be difficult to normalize and difficult to assess.

Our valid_city vector contains many city names, but far less than exist in the country, especially when you account for neighborhoods that aren’t really cities, but shouldn’t be changed (some of these are contained in our curated extra_city vector).

length(valid_city)
#> [1] 19083
sample(valid_city, 6)
#> [1] "SUTHERLAND SPRINGS" "ELON COLLEGE"       "DODGERTOWN"        
#> [4] "JEFFERSONVILLE"     "LARNED"             "MACHIASPORT"
sample(extra_city, 6)
#> [1] "ETIWANDA"        "FORKS TOWNSHIP"  "VANPORT"         "SYMMES TOWNSHIP"
#> [5] "YARDVILLE"       "DEL CITY"
# combine both vectors
many_city <- c(valid_city, extra_city)

Still, checking against this list is a good way to check for values that need additional attention.

#> # A tibble: 3 × 5
#>   id    city       state zip   valid
#>   <chr> <chr>      <chr> <chr> <lgl>
#> 1 03    MIDLEBURY  VT    05753 FALSE
#> 2 08    BRULINGTON VT    05401 FALSE
#> 3 10    SA         TX    78202 FALSE

It might not be clear what’s actually wrong with these values. A common way to check is by comparing them against expected city for a given ZIP code.

bad <- left_join(
  x = bad,
  y = zipcodes,
  by = c("zip", "state"), 
  suffix = c("_raw", "_match")
)

id	city_raw	state	zip	city_match
03	MIDLEBURY	VT	05753	MIDDLEBURY
08	BRULINGTON	VT	05401	BURLINGTON
10	SA	TX	78202	SAN ANTONIO

Now the problems become clear. Two city names are misspelled and the third is an obvious abbreviation. When dealing with millions of city names, we need a way to check each raw value against it’s expected ZIP code match.

The str_dist() and is_abbrev() functions can be used to compared the value we have with the value we expect. By only checking against the corresponding city to that record’s ZIP code, we are making extremely confident changes (compared to the incredibly useful clustering algorithms like those provided by the refinr package).

First, we can use str_dist() to check the distance between the two strings; distance is defined as the number of changes we’d need to make to our normalized value to get our expected matched value. If that distance is small (usually 1 or 2), we can pretty confidently use the matched value.

str_dist("example", "xampel")
#> [1] 2

But the string distance does not catch colloquial city abbreviations (e.g., NYC, BOS, LA, CHI, ABQ, BRNX, DFW, OKC). Many residents get so used to writing their city’s name they use abbreviations and assume them to be universally understood. The is_abbrev() function can be used to check to one string might be an abbreviation for another. Every abbreviation generated by the abbreviate() function satisfied the requirements of is_abbrev().

is_abbrev(abb = "NYC", full = "New York City")
#> [1] TRUE
is_abbrev(abb = "DC", full = "Washington")
#> [1] FALSE

bad <- bad %>% 
  mutate(
    match_dist = str_dist(city_raw, city_match),
    match_abb = is_abbrev(city_raw, city_match)
  )

id	city_raw	state	zip	city_match	match_dist	match_abb
03	MIDLEBURY	VT	05753	MIDDLEBURY	1	FALSE
08	BRULINGTON	VT	05401	BURLINGTON	1	FALSE
10	SA	TX	78202	SAN ANTONIO	9	TRUE

Here’s what this process would look like when employed on an entire data frame. It’s important to ensure that the number of rows in our campaign finance data is kept consistent throughout the wrangling process and that original columns are left unchanged.

vt <- vt %>%
  rename(city_raw = city) %>% 
  # match city by ZIP
  left_join(zipcodes) %>% 
  rename(city_match = city) %>%
  mutate(
    # check against match
    match_dist = str_dist(city_raw, city_match),
    match_abb = is_abbrev(city_raw, city_match),
    city = ifelse(match_abb | match_dist == 1, city_match, city_raw)
  ) %>% 
  # remove intermediary columns
  select(-city_raw, -city_match, -match_dist, -match_abb)
#> Joining with `by = join_by(state, zip)`

Now every city, state, and zip value is contained in our list of valid values.

id	state	zip	city	all_valid
01	VT	05866	SHEFFIELD	TRUE
03	VT	05753	MIDDLEBURY	TRUE
05	VT	05866	SHEFFIELD	TRUE
06	MN	55555	YOUNG AMERICA	TRUE
08	VT	05401	BURLINGTON	TRUE
09	DC	20001	WASHINGTON	TRUE
10	TX	78202	SAN ANTONIO	TRUE

Once our data is as normal as we can confidently make it, we can begin to explore. First, we’ll explore the data for missing values with flag_na(), which takes a tidyselect input of columns (or something like dplyr::everything()).

(vt <- flag_na(vt, name))
#> # A tibble: 10 × 10
#>    id    cand        date       amount name    address state zip   city  na_flag
#>    <chr> <chr>       <date>      <dbl> <chr>   <chr>   <chr> <chr> <chr> <lgl>  
#>  1 01    Bob Miller  2019-02-09     10 Lisa M… 4 SHEF… VT    05866 SHEF… FALSE  
#>  2 02    Bob Miller  2009-03-09     20 Deb Br… <NA>    <NA>  <NA>  <NA>  FALSE  
#>  3 03    Chuck White 2019-04-09     25 <NA>    PO BOX… VT    05753 MIDD… TRUE   
#>  4 04    Chuck White 2019-05-09    100 Josh J… SUGARH… VT    5076  <NA>  FALSE  
#>  5 05    Bob Miller  2019-02-09     10 Lisa M… 4 SHEF… VT    05866 SHEF… FALSE  
#>  6 06    Chuck White 2019-06-09   1000 Bob Ta… 55 THI… MN    55555 YOUN… FALSE  
#>  7 07    Chuck White 2019-07-09   -600 Alex J… 11 LIB… VT    <NA>  <NA>  FALSE  
#>  8 08    Alice Walsh 2019-08-09      0 Ruth S… 2 BURL… VT    05401 BURL… FALSE  
#>  9 09    Alice Walsh 2019-09-09     69 Joe Ga… 770 5T… DC    20001 WASH… FALSE  
#> 10 10    Alice Walsh 2019-11-09    222 Dave W… <NA>    TX    78202 SAN … FALSE

Next, we’ll want to check for duplicate rows using flag_dupes(), which takes the same kind of arguments. Here, we can ignore the supposedly unique id variable. It’s possible for a person to make the same contribution on the same date, but we should flag them nonetheless.

(vt <- flag_dupes(vt, -id, .both = TRUE))
#> # A tibble: 10 × 11
#>    id    cand        date       amount name    address state zip   city  na_flag
#>    <chr> <chr>       <date>      <dbl> <chr>   <chr>   <chr> <chr> <chr> <lgl>  
#>  1 01    Bob Miller  2019-02-09     10 Lisa M… 4 SHEF… VT    05866 SHEF… FALSE  
#>  2 02    Bob Miller  2009-03-09     20 Deb Br… <NA>    <NA>  <NA>  <NA>  FALSE  
#>  3 03    Chuck White 2019-04-09     25 <NA>    PO BOX… VT    05753 MIDD… TRUE   
#>  4 04    Chuck White 2019-05-09    100 Josh J… SUGARH… VT    5076  <NA>  FALSE  
#>  5 05    Bob Miller  2019-02-09     10 Lisa M… 4 SHEF… VT    05866 SHEF… FALSE  
#>  6 06    Chuck White 2019-06-09   1000 Bob Ta… 55 THI… MN    55555 YOUN… FALSE  
#>  7 07    Chuck White 2019-07-09   -600 Alex J… 11 LIB… VT    <NA>  <NA>  FALSE  
#>  8 08    Alice Walsh 2019-08-09      0 Ruth S… 2 BURL… VT    05401 BURL… FALSE  
#>  9 09    Alice Walsh 2019-09-09     69 Joe Ga… 770 5T… DC    20001 WASH… FALSE  
#> 10 10    Alice Walsh 2019-11-09    222 Dave W… <NA>    TX    78202 SAN … FALSE  
#> # ℹ 1 more variable: dupe_flag <lgl>

Normalize Geographic Variables

Background

Messy Data

Read

Cities

Conclude

id	cand	date	amount	name	address	state	zip	city	na_flag	dupe_flag
01	Bob Miller	2019-02-09	10	Lisa Miller	4 SHEFFIELD SQUARE RD	VT	05866	SHEFFIELD	FALSE	TRUE
02	Bob Miller	2009-03-09	20	Deb Brown	NA	NA	NA	NA	FALSE	FALSE
03	Chuck White	2019-04-09	25	NA	PO BOX 567	VT	05753	MIDDLEBURY	TRUE	FALSE
04	Chuck White	2019-05-09	100	Josh Jones	SUGARHOUSE SQ	VT	5076	NA	FALSE	FALSE
05	Bob Miller	2019-02-09	10	Lisa Miller	4 SHEFFIELD SQUARE RD	VT	05866	SHEFFIELD	FALSE	TRUE
06	Chuck White	2019-06-09	1000	Bob Taylor	55 THISPLACE AVE	MN	55555	YOUNG AMERICA	FALSE	FALSE
07	Chuck White	2019-07-09	-600	Alex Johnson	11 LIBERTY AVE	VT	NA	NA	FALSE	FALSE
08	Alice Walsh	2019-08-09	0	Ruth Smith	2 BURLINGTON SQ	VT	05401	BURLINGTON	FALSE	FALSE
09	Alice Walsh	2019-09-09	69	Joe Garcia	770 5THSTNW	DC	20001	WASHINGTON	FALSE	FALSE
10	Alice Walsh	2019-11-09	222	Dave Wilson	NA	TX	78202	SAN ANTONIO	FALSE	FALSE