prt
The prt
object introduced by this package is intended to
represent tabular data stored as one or more fst
files. This is in
similar spirit as disk.frame
, but is much
less ambitious in scope and therefore much simpler in implementation.
While the disk.frame
package attempts to provide a dplyr
compliant API
and offers parallel computation via the future
package, the intended use-case for prt
objects is the
situation where only a (small) subset of rows of the (large) tabular
dataset are of interest for analysis at once. This subset can be
specified using the base generic function subset()
and the
selected data is read into memory as a data.table
object.
Subsequent data operations and analysis is then preformed on this
data.table
representation. For this reason, partition-level
parallelism is not in-scope for prt
as fst
already provides an efficient shared memory parallel implementation for
decompression. Furthermore the much more complex multi-function
non-standard evaluation API provided by dplyr
was forgone
in favor of the very simple one-function approach presented by the base
R S3 generic function subset()
.
For the purpose of illustration of some prt
features and
particularities, we instantiate a dataset as data.table
object and create a temporary directory which will contain the
file-based data back ends.
tmp <- tempfile()
dir.create((tmp))
dat <- data.table::setDT(nycflights13::flights)
print(dat)
#> year month day dep_time sched_dep_time dep_delay arr_time
#> 1: 2013 1 1 517 515 2 830
#> 2: 2013 1 1 533 529 4 850
#> 3: 2013 1 1 542 540 2 923
#> 4: 2013 1 1 544 545 -1 1004
#> 5: 2013 1 1 554 600 -6 812
#> ---
#> 336772: 2013 9 30 NA 1455 NA NA
#> 336773: 2013 9 30 NA 2200 NA NA
#> 336774: 2013 9 30 NA 1210 NA NA
#> 336775: 2013 9 30 NA 1159 NA NA
#> 336776: 2013 9 30 NA 840 NA NA
#> sched_arr_time arr_delay carrier flight tailnum origin dest air_time
#> 1: 819 11 UA 1545 N14228 EWR IAH 227
#> 2: 830 20 UA 1714 N24211 LGA IAH 227
#> 3: 850 33 AA 1141 N619AA JFK MIA 160
#> 4: 1022 -18 B6 725 N804JB JFK BQN 183
#> 5: 837 -25 DL 461 N668DN LGA ATL 116
#> ---
#> 336772: 1634 NA 9E 3393 <NA> JFK DCA NA
#> 336773: 2312 NA 9E 3525 <NA> LGA SYR NA
#> 336774: 1330 NA MQ 3461 N535MQ LGA BNA NA
#> 336775: 1344 NA MQ 3572 N511MQ LGA CLE NA
#> 336776: 1020 NA MQ 3531 N839MQ LGA RDU NA
#> distance hour minute time_hour
#> 1: 1400 5 15 2013-01-01 05:00:00
#> 2: 1416 5 29 2013-01-01 05:00:00
#> 3: 1089 5 40 2013-01-01 05:00:00
#> 4: 1576 5 45 2013-01-01 05:00:00
#> 5: 762 6 0 2013-01-01 06:00:00
#> ---
#> 336772: 213 14 55 2013-09-30 14:00:00
#> 336773: 198 22 0 2013-09-30 22:00:00
#> 336774: 764 12 10 2013-09-30 12:00:00
#> 336775: 419 11 59 2013-09-30 11:00:00
#> 336776: 431 8 40 2013-09-30 08:00:00
Creating a prt
object consisting of 2 partitions can for
example be done as
flights <- as_prt(dat, n_chunks = 2L, dir = tempfile(tmpdir = tmp))
print(flights)
#> # A prt: 336,776 × 19
#> # Partitioning: [168,388, 168,388] rows
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 1 517 515 2 830
#> 2 2013 1 1 533 529 4 850
#> 3 2013 1 1 542 540 2 923
#> 4 2013 1 1 544 545 -1 1004
#> 5 2013 1 1 554 600 -6 812
#> …
#> 336,772 2013 9 30 NA 1455 NA NA
#> 336,773 2013 9 30 NA 2200 NA NA
#> 336,774 2013 9 30 NA 1210 NA NA
#> 336,775 2013 9 30 NA 1159 NA NA
#> 336,776 2013 9 30 NA 840 NA NA
#> # ℹ 336,771 more rows
#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
This simply splits rows of dat
into 2 equally sized
groups, preserving the original row ordering and writes each group to
its own fst
file. Depending on the types of queries that
are most frequently run against the data, this naive partitioning might
not be optimal. While fst
does provide random row access,
row selection is only possible via index ranges. Consequently, for each
partition all rows that fall into the range between the minimum and the
maximum required index will be read into memory and superfluous rows are
discarded. If for example the data were to be most frequently accessed
by airline, the resulting data loads would be more efficient if the data
was already sorted by carrier codes.
dat <- data.table::setorderv(dat, "carrier")
grp <- cumsum(table(dat$carrier)) / nrow(dat) < 0.5
dat <- split(dat, grp[dat$carrier])
by_carrier <- as_prt(dat, dir = tempfile(tmpdir = tmp))
by_carrier
#> # A prt: 336,776 × 19
#> # Partitioning: [182,128, 154,648] rows
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 1 557 600 -3 709
#> 2 2013 1 1 624 630 -6 909
#> 3 2013 1 1 632 608 24 740
#> 4 2013 1 1 809 815 -6 1043
#> 5 2013 1 1 811 815 -4 1006
#> …
#> 336,772 2013 9 30 1955 2000 -5 2219
#> 336,773 2013 9 30 1956 1825 91 2208
#> 336,774 2013 9 30 2041 2045 -4 2147
#> 336,775 2013 9 30 2050 2045 5 20
#> 336,776 2013 9 30 2121 2100 21 2349
#> # ℹ 336,771 more rows
#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
The behavior of subsetting operations on prt
objects is
modeled after that of tibble
objects.
Columns can be extracted using [[
, $
(with
partial matching being disallowed), or by selecting a single column with
[
and passing TRUE
as drop
argument.
str(flights[[1L]])
#> int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
identical(flights[["year"]], flights$year)
#> [1] TRUE
identical(flights[["year"]], flights[, "year", drop = TRUE])
#> [1] TRUE
str(flights$yea)
#> Warning: Unknown or uninitialised column: `yea`.
#> NULL
If the object resulting from the subsetting operation is
two-dimensional, it is returned as data.table
object. Apart
form this distinction, again the intent is to replicate
tibble
behavior. One way in which tibble
and
data.frame
do not behave in the same way is in default
coercion to lower dimensions. The default value for the
drop
argument of [.data.frame
is
FALSE
if only one row is returned but changes to
TRUE
where the result is a single column, while it is
always FALSE
for tibble
s. A difference in
behavior between data.table
and tibble
(any by
extension prt
) is a missing j
argument: in the
tibble
(and in the data.frame
) implementation,
the i
argument is then interpreted as column specification,
whereas for data.frame
s, i
remains a row
selection.
datasets::mtcars[, "mpg"]
#> [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
#> [31] 15.0 21.4
flights[, "dep_time"]
#> dep_time
#> 1: 517
#> 2: 533
#> 3: 542
#> 4: 544
#> 5: 554
#> ---
#> 336772: NA
#> 336773: NA
#> 336774: NA
#> 336775: NA
#> 336776: NA
jan_dt <- flights[flights$month == 1L, ]
jan_dt[1L]
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> 1: 2013 1 1 517 515 2 830 819
#> arr_delay carrier flight tailnum origin dest air_time distance hour minute
#> 1: 11 UA 1545 N14228 EWR IAH 227 1400 5 15
#> time_hour
#> 1: 2013-01-01 05:00:00
flights[1L]
#> year
#> 1: 2013
#> 2: 2013
#> 3: 2013
#> 4: 2013
#> 5: 2013
#> ---
#> 336772: 2013
#> 336773: 2013
#> 336774: 2013
#> 336775: 2013
#> 336776: 2013
Deviation of prt
subsetting behavior from that of
tibble
objects is most likely unintentional and bug reports
are much appreciated as github issues.
The main feature of prt
is the ability to load only a
subset of a much larger tabular dataset and a useful function for
selecting rows and columns of a table in a concise manner is the base R
S3 generic function subset()
. As such, a prt
specific method is provided by this package. Using this functionality,
above query for selecting all flights in January can be written as
follows
To illustrate the importance of row-ordering consider the following
small benchmark example: we subset on the carrier
column,
selecting only American Airlines flights. In one prt
object, rows are ordered by carrier whereas in the other they are not,
which will cause rows that are interleaved with those corresponding to
AA
flights to be read and discarded.
bench::mark(
subset(flights, carrier == "AA"),
subset(by_carrier, carrier == "AA")
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl>
#> 1 subset(flights, carrier == "AA") 69.7ms 74ms 13.5 56.3MB 47.3
#> 2 subset(by_carrier, carrier == "AA") 18.1ms 18.6ms 52.5 17.7MB 26.3
A common problem with non-standard evaluation (NSE) is potential
ambiguity. Symbols in expressions passed as subset
and
select
arguments are first resolved in the context of the
data, followed by the environment the expression was created in (the quosure
environment). Expressions are evaluated using
rlang::eval_tidy()
, which makes possible the distinction
between symbols referring to the data mask from those referring to the
expression environment. This can either be achieved using the
.data
and .env
pronouns or by forcing parts
of the expression.
month <- 1L
subset(flights, month == month, 1L:7L)
#> year month day dep_time sched_dep_time dep_delay arr_time
#> 1: 2013 1 1 517 515 2 830
#> 2: 2013 1 1 533 529 4 850
#> 3: 2013 1 1 542 540 2 923
#> 4: 2013 1 1 544 545 -1 1004
#> 5: 2013 1 1 554 600 -6 812
#> ---
#> 336772: 2013 9 30 NA 1455 NA NA
#> 336773: 2013 9 30 NA 2200 NA NA
#> 336774: 2013 9 30 NA 1210 NA NA
#> 336775: 2013 9 30 NA 1159 NA NA
#> 336776: 2013 9 30 NA 840 NA NA
identical(jan_dt, subset(flights, month == !!month))
#> [1] TRUE
identical(jan_dt, subset(flights, .env$month == .data$month))
#> [1] TRUE
While in the above example it is fairly clear what is happening and
it should come as no surprise that the symbol month
cannot
simultaneously refer to a value in the calling environment and the name
of a column in the data mask, a more subtle issue is considered in the
following example. The environment which takes precedence for evaluating
the select
argument is a named list of column indices. This
makes it possible for example to specify a range of columns as (and
makes the behavior of subset()
being applied to a
prt
object consistent with that of a
data.frame
).
subset(flights, select = year:day)
#> year month day
#> 1: 2013 1 1
#> 2: 2013 1 1
#> 3: 2013 1 1
#> 4: 2013 1 1
#> 5: 2013 1 1
#> ---
#> 336772: 2013 9 30
#> 336773: 2013 9 30
#> 336774: 2013 9 30
#> 336775: 2013 9 30
#> 336776: 2013 9 30
Now recall that symbols that cannot be resolved in this data
environment will be looked up in the calling environment. Therefore the
following effect, while potentially unintuitive, can easily be
explained. Again, the .data
and .env
pronouns
can be used to resolve potential issues.
sched_dep_time <- "dep_time"
colnames(subset(flights, select = sched_dep_time))
#> [1] "sched_dep_time"
actual_dep_time <- "dep_time"
colnames(subset(flights, select = actual_dep_time))
#> [1] "dep_time"
colnames(subset(flights, select = .env$sched_dep_time))
#> [1] "dep_time"
colnames(subset(flights, select = .env$actual_dep_time))
#> [1] "dep_time"
colnames(subset(flights, select = .data$sched_dep_time))
#> [1] "sched_dep_time"
colnames(subset(flights, select = .data$actual_dep_time))
#> Error in `.data$actual_dep_time`:
#> ! Column `actual_dep_time` not found in `.data`.
By default, subset
expressions have to be evaluated on
the entire dataset at once in order to be consistent with base R
subset()
for data.frames
. Often times this is
inefficient and this behavior can be modified using the
part_saft
argument. Consider the following query which
selects all rows where the arrival delay is larger than the mean arrival
delay. Obviously an expression like this can yield different results
depending on whether it is evaluated on individual partitions or over
the entire data. Other queries such as the one above where we threshold
on a fixed value, however can safely be evaluated on partitions
individually.
is_true <- function(x) !is.na(x) & x
expr <- quote(is_true(arr_delay > mean(arr_delay, na.rm = TRUE)))
nrow(subset_quo(flights, expr, part_safe = FALSE))
#> [1] 105827
nrow(subset_quo(flights, expr, part_safe = TRUE))
#> [1] 104752
As an aside, in addition to subset()
, which creates
quosures from the expressions passed as subset
and
select
, (using rlang::enquo()
) the function
subset_quo()
which operates on already quoted expressions
is exported as well. Thanks to the double curly brace forwarding
operator introduced in rlang 0.4.0, this escape-hatch mechanism however
is of lesser importance.
col_safe_subset <- function(x, expr, cols) {
stopifnot(is_prt(x), is.character(cols))
subset(x, {{ expr }}, .env$cols)
}
air_time <- c("dep_time", "arr_time")
col_safe_subset(flights, month == 1L, air_time)
#> dep_time arr_time
#> 1: 517 830
#> 2: 533 850
#> 3: 542 923
#> 4: 544 1004
#> 5: 554 812
#> ---
#> 27000: NA NA
#> 27001: NA NA
#> 27002: NA NA
#> 27003: NA NA
#> 27004: NA NA
In addition to subsetting, concise and informative printing is
another area which effort ha been put into. Inspired by (and liberally
borrowing code from) tibble
, the print()
method of fst
objects adds the data.table
approach of showing both the first and last n
rows of the
table in question. This functionality can be used by other classes used
to represent tabular data, as the function trunc_dt()
driving this is exported. All that is required are implementations of
the base S3 generic functions dim()
, head()
,
tail()
and of course print()
.
new_tbl <- function(...) structure(list(...), class = "my_tbl")
dim.my_tbl <- function(x) {
rows <- unique(lengths(x))
stopifnot(length(rows) == 1L)
c(rows, length(x))
}
head.my_tbl <- function(x, n = 6L, ...) {
as.data.frame(lapply(x, `[`, seq_len(n)))
}
tail.my_tbl <- function(x, n = 6L, ...) {
as.data.frame(lapply(x, `[`, seq(nrow(x) - n + 1L, nrow(x))))
}
print.my_tbl <- function(x, ..., n = NULL, width = NULL,
max_extra_cols = NULL) {
out <- format_dt(x, n = n, width = width, max_extra_cols = max_extra_cols)
out <- paste0(out, "\n")
cat(out, sep = "")
invisible(x)
}
new_tbl(a = letters, b = 1:26)
#> # Description: my_tbl[,2]
#> a b
#> <chr> <int>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> …
#> 22 v 22
#> 23 w 23
#> 24 x 24
#> 25 y 25
#> 26 z 26
#> # ℹ 21 more rows
Similarly, the function glimpse_dt()
which can be used
to implement a class-specific function for the tibble
S3
generic tibble::glimpse()
. In order to customize the text
description of the object a class-specific function for the
tibble
S3 generic tibble::tbl_sum()
can be
provided.