JSON (JavaScript Object Notation) is a lightweight and flexible data format that is easy for humans to read and for machines to parse. JSON has become a common format used in:
Public APIs (e.g., Twitter)
NoSQL databases as a document format (e.g., MongoDB)
Relational databases as a new column type (e.g., PostgreSQL)
Tidyjson provides a grammar for turning complex JSON data into tidy data frames that are easy to work with in the tidyverse.
Several libraries exist for working with JSON data in R, such as rjson, rjsonio and jsonlite. These libraries transform JSON data automatically into nested R lists or complex data frames. However, working with these complex objects can be difficult.
The tidyjson package takes a different approach to structuring JSON data into tidy data frames. Similar to tidyr, tidyjson builds a grammar for manipulating JSON into a tidy table structure.
Tidyjson is based on the following principles:
Work on a single JSON document, or on a collection of related documents
Create pipelines with %>%
, producing code that
can be read from left to right
Guarantee the structure of the data produced, even if the input
JSON structure changes (with the exception of
spread_all
)
Work with arbitrarily nested arrays or objects
Handle ‘ragged’ arrays and / or objects (varying lengths by document)
Allow for extraction of data in values or object names
Ensure edge cases are handled correctly (especially empty data)
Integrate seamlessly with dplyr
, allowing
tbl_json
objects to pipe in and out of dplyr
verbs where reasonable
A simple example of how tidyjson works is as follows:
library(dplyr)
# Define a simple people JSON collection
<- c('{"age": 32, "name": {"first": "Bob", "last": "Smith"}}',
people '{"age": 54, "name": {"first": "Susan", "last": "Doe"}}',
'{"age": 18, "name": {"first": "Ann", "last": "Jones"}}')
# Tidy the JSON data
%>% spread_all
people #> # A tbl_json: 3 x 5 tibble with a "JSON" attribute
#> ..JSON document.id age name.first name.last
#> <chr> <int> <dbl> <chr> <chr>
#> 1 "{\"age\":32,\"name..." 1 32 Bob Smith
#> 2 "{\"age\":54,\"name..." 2 54 Susan Doe
#> 3 "{\"age\":18,\"name..." 3 18 Ann Jones
This produces a tbl_json
object, where each row
corresponds to an element of the people
vector (a
“document” in tidyjson). The JSON attribute of the tbl_json
object is shown first, then the columns of the tibble are shown - a
document.id
indicating which document the row originated
in, and then the age and name columns that spread_all
created.
The tidyjson package really shines in a more complex example.
Consider the worldbank
data included in the tidyjson
package.
%>% str
worldbank #> chr [1:500] "{\"_id\":{\"$oid\":\"52b213b38594d8a2be17c780\"},\"boardapprovaldate\":\"2013-11-12T00:00:00Z\",\"closingdate\""| __truncated__ ...
It is a 500 length character vector of projects funded by the world
bank. We can quickly expand all simple columns using
spread_all
%>% spread_all
worldbank #> # A tbl_json: 500 x 9 tibble with a "JSON" attribute
#> ..JSON docum…¹ board…² closi…³ count…⁴ proje…⁵ regio…⁶ total…⁷ _id.$…⁸
#> <chr> <int> <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
#> 1 "{\"_id\":{\"… 1 2013-1… 2018-0… Ethiop… Ethiop… Africa 1.3 e8 52b213…
#> 2 "{\"_id\":{\"… 2 2013-1… <NA> Tunisia TN: DT… Middle… 0 52b213…
#> 3 "{\"_id\":{\"… 3 2013-1… <NA> Tuvalu Tuvalu… East A… 6.06e6 52b213…
#> 4 "{\"_id\":{\"… 4 2013-1… <NA> Yemen,… Gov't … Middle… 0 52b213…
#> # … with 496 more rows, and abbreviated variable names ¹document.id,
#> # ²boardapprovaldate, ³closingdate, ⁴countryshortname, ⁵project_name,
#> # ⁶regionname, ⁷totalamt, ⁸`_id.$oid`
And we can limit the coluns produced by calling
dplyr::select
after
%>% spread_all %>% select(regionname, totalamt)
worldbank #> # A tbl_json: 500 x 3 tibble with a "JSON" attribute
#> ..JSON regionname totalamt
#> <chr> <chr> <dbl>
#> 1 "{\"_id\":{\"$oid\":..." Africa 130000000
#> 2 "{\"_id\":{\"$oid\":..." Middle East and North Africa 0
#> 3 "{\"_id\":{\"$oid\":..." East Asia and Pacific 6060000
#> 4 "{\"_id\":{\"$oid\":..." Middle East and North Africa 0
#> # … with 496 more rows
But worldbank also contains arrays, which cannot be naively spread
into new columns. We can use gather_object
to gather all
name-value paris by name, and then json_types
to identify
the type of JSON stored under each value, and dplyr::count
to aggregate across documents:
%>% gather_object %>% json_types %>% count(name, type)
worldbank #> # A tibble: 8 × 3
#> name type n
#> <chr> <fct> <int>
#> 1 _id object 500
#> 2 boardapprovaldate string 500
#> 3 closingdate string 370
#> 4 countryshortname string 500
#> 5 majorsector_percent array 500
#> 6 project_name string 500
#> 7 regionname string 500
#> 8 totalamt number 500
It appears that majorsector_percent
is an array, and so
we can use enter_object
to enter into it:
%>% enter_object(majorsector_percent)
worldbank #> # A tbl_json: 500 x 2 tibble with a "JSON" attribute
#> ..JSON document.id
#> <chr> <int>
#> 1 "[{\"Name\":\"Educa..." 1
#> 2 "[{\"Name\":\"Publi..." 2
#> 3 "[{\"Name\":\"Trans..." 3
#> 4 "[{\"Name\":\"Healt..." 4
#> # … with 496 more rows
and gather_array
to gather it by index
%>% enter_object(majorsector_percent) %>% gather_array
worldbank #> # A tbl_json: 1,405 x 3 tibble with a "JSON" attribute
#> ..JSON document.id array.index
#> <chr> <int> <int>
#> 1 "{\"Name\":\"Educat..." 1 1
#> 2 "{\"Name\":\"Educat..." 1 2
#> 3 "{\"Name\":\"Public..." 1 3
#> 4 "{\"Name\":\"Educat..." 1 4
#> # … with 1,401 more rows
We can then spread_all
again to capture the name-value
pairs stored in each object
%>%
worldbank enter_object(majorsector_percent) %>% gather_array %>% spread_all
#> # A tbl_json: 1,405 x 5 tibble with a "JSON" attribute
#> ..JSON document.id array.index Name Percent
#> <chr> <int> <int> <chr> <dbl>
#> 1 "{\"Name\":\"Educat..." 1 1 Education 46
#> 2 "{\"Name\":\"Educat..." 1 2 Education 26
#> 3 "{\"Name\":\"Public..." 1 3 Public Administration… 16
#> 4 "{\"Name\":\"Educat..." 1 4 Education 12
#> # … with 1,401 more rows
By combining with our initial top-level spread_all
, we
can aggregate funding dollars by sector by region:
%>%
worldbank %>% select(region = regionname, funding = totalamt) %>%
spread_all enter_object(majorsector_percent) %>% gather_array %>%
%>% rename(sector = Name, percent = Percent) %>%
spread_all group_by(region, sector) %>%
summarize(funding = sum(funding * percent))
#> `summarise()` has grouped output by 'region'. You can override using the
#> `.groups` argument.
#> # A tibble: 63 × 3
#> # Groups: region [7]
#> region sector funding
#> <chr> <chr> <dbl>
#> 1 Africa Agriculture, fishing, and forestry 96648500000
#> 2 Africa Education 83370000000
#> 3 Africa Energy and mining 86709960000
#> 4 Africa Finance 30761000000
#> # … with 59 more rows
Tidyjson provides the following core functions:
Function | Use | Row Effect | Column Effect | JSON Effect |
---|---|---|---|---|
spread_all |
Spread all object values | add many columns | ||
spread_values |
Spread specific columns | specific columns | ||
gather_array |
Gather a JSON array | Duplicates rows | index column | enter array |
gather_object |
Gather a JSON object | Duplicates rows | name column | enter object |
append_values_X |
Append all values of a type | column of type X | ||
enter_object |
Enter into an object by name | Keeps rows with name | enter object | |
json_types |
Identify JSON type | type column | ||
json_lengths |
Identify JSON length | length column |
These functions can be composed into pipelines using the
%>%
operator, as each takes a tbl_json
object and returns a tbl_json
object.
The following are common patterns for manipulating JSON data with tidyjson.
Spread all object values and select a subset to continue working with.
spread_all %>% select
%>% spread_all %>% select(regionname, totalamt)
worldbank #> # A tbl_json: 500 x 3 tibble with a "JSON" attribute
#> ..JSON regionname totalamt
#> <chr> <chr> <dbl>
#> 1 "{\"_id\":{\"$oid\":..." Africa 130000000
#> 2 "{\"_id\":{\"$oid\":..." Middle East and North Africa 0
#> 3 "{\"_id\":{\"$oid\":..." East Asia and Pacific 6060000
#> 4 "{\"_id\":{\"$oid\":..." Middle East and North Africa 0
#> # … with 496 more rows
Identify the names under an object, and their type and frequency.
gather_object %>% json_types %>% count(name, type)
%>% gather_object %>% json_types %>% count(name, type)
worldbank #> # A tibble: 8 × 3
#> name type n
#> <chr> <fct> <int>
#> 1 _id object 500
#> 2 boardapprovaldate string 500
#> 3 closingdate string 370
#> 4 countryshortname string 500
#> # … with 4 more rows
Enter into an array nested under an object, and gather it
enter_object %>% gather_array
%>% enter_object(majorsector_percent) %>% gather_array
worldbank #> # A tbl_json: 1,405 x 3 tibble with a "JSON" attribute
#> ..JSON document.id array.index
#> <chr> <int> <int>
#> 1 "{\"Name\":\"Educat..." 1 1
#> 2 "{\"Name\":\"Educat..." 1 2
#> 3 "{\"Name\":\"Public..." 1 3
#> 4 "{\"Name\":\"Educat..." 1 4
#> # … with 1,401 more rows
Filter to just objects / arrays and then gather them
filter(is_json_X(.)) %>% gather_X
1] %>% gather_object %>%
companies[filter(is_json_array(.)) %>% gather_array
#> # A tbl_json: 8 x 4 tibble with a "JSON" attribute
#> ..JSON document.id name array.index
#> <chr> <int> <chr> <int>
#> 1 "{\"is_past\":fals..." 1 relationships 1
#> 2 "{\"id\":6252,\"rou..." 1 funding_rounds 1
#> 3 "{\"id\":15229,\"ro..." 1 funding_rounds 2
#> 4 "{\"id\":22565,\"ro..." 1 funding_rounds 3
#> # … with 4 more rows
1] %>% gather_object %>%
companies[filter(is_json_object(.)) %>% gather_object
#> Warning in gather_object(.): name column name already exists, changing to name.2
#> # A tbl_json: 3 x 4 tibble with a "JSON" attribute
#> ..JSON document.id name name.2
#> <chr> <int> <chr> <chr>
#> 1 "\"52cdef7e4bab8b..." 1 _id $oid
#> 2 "[[[150,22],\"ass..." 1 image available_sizes
#> 3 "null" 1 image attribution
Useful when data is stored in object names as well as values
gather_object %>% append_values_X
<- '{"2015": 5, "2016": 10}'
json %>% gather_object("year") %>% append_values_number("count")
json #> # A tbl_json: 2 x 4 tibble with a "JSON" attribute
#> ..JSON document.id year count
#> <chr> <int> <chr> <dbl>
#> 1 5 1 2015 5
#> 2 10 1 2016 10
The first step in using tidyjson is to get your data into a
tbl_json
object. All tidyjson functions automatically
coerce objects into tbl_json
if they are not already, so
you may be able to just start manipulating your data directly.
But if not, you can use as.tbl_json
directly. Here are
examples for common scenarios:
The simplest case is when your JSON data is already in R as a
character vector, like the worldbank
data:
%>% as.tbl_json
worldbank #> # A tbl_json: 500 x 2 tibble with a "JSON" attribute
#> ..JSON document.id
#> <chr> <int>
#> 1 "{\"_id\":{\"$oid\":..." 1
#> 2 "{\"_id\":{\"$oid\":..." 2
#> 3 "{\"_id\":{\"$oid\":..." 3
#> 4 "{\"_id\":{\"$oid\":..." 4
#> # … with 496 more rows
If this generates errors, then likely your JSON data is malformed:
<- '{"key": "value"'
bad_json %>% as.tbl_json
bad_json #> Error in `purrr::map()`:
#> ℹ In index: 1.
#> Caused by error:
#> ! parse error: premature EOF
#> {"key": "value"
#> (right here) ------^
tidyjson uses jsonlite::fromJSON
to parse the JSON, and
so will print out a useful error message.
Many APIs will return multiple documents in a single array, like the
issues
data from github.
%>% as.tbl_json
issues #> # A tbl_json: 1 x 2 tibble with a "JSON" attribute
#> ..JSON document.id
#> <chr> <int>
#> 1 "[{\"url\":\"https:..." 1
A single call to gather_array makes this data look like the
worldbank
data:
%>% as.tbl_json %>% gather_array
issues #> # A tbl_json: 30 x 3 tibble with a "JSON" attribute
#> ..JSON document.id array.index
#> <chr> <int> <int>
#> 1 "{\"url\":\"https:/..." 1 1
#> 2 "{\"url\":\"https:/..." 1 2
#> 3 "{\"url\":\"https:/..." 1 3
#> 4 "{\"url\":\"https:/..." 1 4
#> # … with 26 more rows
If your JSON is a list of character strings, you can use `purrr::flatten_chr’ to flatten it into a character vector and then proceed as usual:
library(purrr)
list('1', '2') %>% flatten_chr %>% as.tbl_json
#> # A tbl_json: 2 x 2 tibble with a "JSON" attribute
#> ..JSON document.id
#> <chr> <int>
#> 1 1 1
#> 2 2 2
If you extracted JSON from a table in a database into a data frame,
then likely you already have other columns in the data frame you would
like to retain. You can use then json.column
argument to
as.tbl_json
to specify which column contains the JSON of
interest:
<- tibble(id = 1:2, json = list('[1, 2]', '[3, 4]'))
df %>% as.tbl_json(json.column = "json")
df #> # A tbl_json: 2 x 2 tibble with a "JSON" attribute
#> ..JSON id
#> <chr> <int>
#> 1 "\"[1, 2]\"" 1
#> 2 "\"[3, 4]\"" 2
If your JSON is in a file, like in the jsonlines format, then you can use
read_json
to read the file into a tbl_json
object directly.
The tidyjson package comes with several JSON example datasets:
commits
: commit data for the dplyr repo from github
API
issues
: issue data for the dplyr repo from github
API
worldbank
: world bank funded projects from
jsonstudio
companies
: startup company data from
jsonstudio
Each dataset has some example tidyjson queries in
help(commits)
, help(issues)
,
help(worldbank)
and help(companies)
.