oshka
We will implement simplified versions of dplyr
and data.table
to illustrate how to write programmable NSE functions with oshka
. The implementations are intentionally limited in functionality, robustness, and speed for the sake of simplicity.
dplyr
The interface is as follows:
group_r <- function(x, ...) {...} # similar to dplyr::group_by
filter_r <- function(x, subset) {...} # similar to dplyr::filter
summarize_r <- function(x, ...) {...} # similar to dplyr::summarise
`%$%` <- function(x, y) {...} # similar to the magrittr pipe
Our functions mimic the corresponding dplyr
ones:
CO2 %$% # built-in dataset
filter_r(grepl("[12]", Plant)) %$%
group_r(Type, Treatment) %$%
summarize_r(mean(conc), mean(uptake))
Type Treatment mean.conc. mean.uptake.
1 Quebec nonchilled 435 34.19286
2 Mississippi nonchilled 435 26.87143
3 Quebec chilled 435 31.33571
4 Mississippi chilled 435 15.07143
Most of the implementation is not directly related to oshka
NSE, but we will go over summarize_r
to highlight how those parts integrate with the rest. summarize_r
is just a forwarding function:
summarize_r <- function(x, ...)
eval(bquote(.(summarize_r_l)(.(x), .(substitute(list(...))))), parent.frame())
We use the eval
/bquote
pattern to forward NSE
arguments. We retrieve summarize_r_l
from the current function frame with .()
, because there is no guarantee we would find it on the search path starting from the parent frame. In this case it happens to be available, but it would not be if these functions were in a package.
We present summarize_r_l
in full for reference, but feel free to skip as we highlight the interesting bits next:
summarize_r_l <- function(x, els) {
frm <- parent.frame()
exps.sub <- expand(substitute(els), x, frm)
if(is.null(exps.sub)) x else {
# compute groups and splits
grps <- make_grps(x) # see appendix
splits <- lapply(grps, eval, x, frm)
dat.split <- split(x, splits, drop=TRUE)
grp.split <- if(!is.null(grps)) lapply(splits, split, splits, drop=TRUE)
# aggregate
res.list <- lapply(
dot_list(exps.sub), # see appendix
function(exp) lapply(dat.split, eval, expr=exp, enclos=frm)
)
list_to_df(res.list, grp.split, splits) # see appendix
}
}
The only oshka
specific line is the second one:
exps.sub <- expand(substitute(els), x, frm)
els
is the language captured and forwarded by summarize_r
. We run expand
on that language with our data x
as the environment and the parent frame as the enclosure. We then compute the groups:
grps <- make_grps(x) # see appendix
splits <- lapply(grps, eval, x, frm)
make_grps
extracts the grouping expressions generating by group_r
. These have already been substituted so we evaluate each one with x
as the environment and the parent frame as the enclosure. We use this to split our data into groups:
dat.split <- split(x, splits, drop=TRUE)
Finally we can evaluate our expand
ed expressions within each of the groups:
# aggregate
res.list <- lapply(
dot_list(exps.sub), # see appendix
function(exp) lapply(dat.split, eval, expr=exp, enclos=frm)
)
list_to_df(res.list, grp.split, splits) # see appendix
dot.list
turns exps.sub
into a list of expressions. Each expression is then evaluated with each data chunk as the environment and the parent frame as the enclosure. Finally list_to_df
turns our lists of vectors into a data frame.
You can see the rest of the implementation in the appendix.
That single expand
line enables a programmable NSE:
f.exp <- quote(grepl("[12]", Plant))
s.exp <- quote(mean(uptake))
CO2 %$%
filter_r(f.exp & conc > 500) %$%
group_r(Type, Treatment) %$%
summarize_r(round(s.exp))
Type Treatment round.mean.uptake..
1 Quebec nonchilled 41
2 Mississippi nonchilled 33
3 Quebec chilled 38
4 Mississippi chilled 18
Because %$%
uses expand
you can even do the following:
f.exp.b <- quote(filter_r(grepl("[12]", Plant) & conc > 500))
g.exp.b <- quote(group_r(Type, Treatment))
s.exp.b <- quote(summarize_r(mean(conc), mean(uptake)))
exp <- quote(f.exp.b %$% g.exp.b %$% s.exp.b)
CO2 %$% exp
Type Treatment mean.conc. mean.uptake.
1 Quebec nonchilled 837.5 41.150
2 Mississippi nonchilled 837.5 32.625
3 Quebec chilled 837.5 38.500
4 Mississippi chilled 837.5 18.050
data.table
We wish to re-use our ersatz dplyr
functions to create a data.table
-like interface:
as.super_df <- function(x) {
class(x) <- c("super_df", class(x))
x
}
"[.super_df" <- function(x, i=NULL, j=NULL, by=NULL) {
frm <- parent.frame() # as per docs, safer to do this here
x <- as.data.frame(x)
x <- eval(bquote(.(filter_r)( .(x), .(substitute(i)))), frm)
x <- eval(bquote(.(group_r_l)( .(x), .(substitute(by)))), frm)
x <- eval(bquote(.(summarize_r_l)(.(x), .(substitute(j)))), frm)
as.super_df(x)
}
Again, we use the eval
/bquote
pattern to forward the NSE arguments to our NSE functions filter_r
, group_r_l
, and summarize_r_l
. The pattern is not trivial, but it only took six lines of code to transmogrify our faux-dplyr
into a faux-data.table
.
After we add the super_df
class to our data we can start using it with data.table
semantics, but with programmable NSE:
co2 <- as.super_df(CO2)
co2[f.exp, s.exp, by=Type]
Type mean.uptake.
1 Quebec 32.76429
2 Mississippi 20.97143
exp.a <- quote(max(conc))
exp.b <- quote(min(conc))
co2[f.exp, list(exp.a, exp.b), by=list(Type, Treatment)][1:3,]
Type Treatment max.conc. min.conc.
1 Quebec nonchilled 1000 95
2 Mississippi nonchilled 1000 95
3 Quebec chilled 1000 95
exp.c <- quote(list(exp.a, exp.b))
exp.d <- quote(list(Type, Treatment))
co2[f.exp, exp.c, by=exp.d][1:3,]
Type Treatment max.conc. min.conc.
1 Quebec nonchilled 1000 95
2 Mississippi nonchilled 1000 95
3 Quebec chilled 1000 95
Despite the forwarding layers the symbols resolve as expected in complex circumstances:
exps <- quote(list(stop("boo"), stop("ya"))) # don't use this
g.exp <- quote(Whatever) # nor this
local({
summarize_r_l <- function(x, y) stop("boom") # nor this
max.upt <- quote(max(uptake)) # use this
min.upt <- quote(min(uptake)) # and this
exps <- list(max.upt, min.upt)
g.exp <- quote(Treatment) # and this
lapply(exps, function(y) co2[f.exp, y, by=g.exp])
})
[[1]]
Treatment max.uptake.
1 nonchilled 44.3
2 chilled 42.4
[[2]]
Treatment min.uptake.
1 nonchilled 10.6
2 chilled 7.7
And we can even nest our dplyr
and data.table
for an unholy abomination:
exp <- quote(data.frame(upt=uptake) %$% summarize_r(new.upt=upt * 1.2))
local({
exps <- list(quote(sum(exp$new.upt)), quote(sum(uptake)))
g.exp <- quote(Treatment)
lapply(exps, function(y) co2[f.exp, y, by=g.exp])
})
[[1]]
Treatment V1
1 nonchilled 1025.88
2 chilled 779.64
[[2]]
Treatment sum.uptake.
1 nonchilled 854.9
2 chilled 649.7
Ersatz dplyr
implementation:
## - Summarize -----------------------------------------------------------------
summarize_r <- function(x, ...)
eval(bquote(.(summarize_r_l)(.(x), .(substitute(list(...))))), parent.frame())
summarize_r_l <- function(x, els) {
frm <- parent.frame()
exps.sub <- expand(substitute(els), x, frm)
if(is.null(exps.sub)) x else {
# compute groups and splits
grps <- make_grps(x) # see appendix
splits <- lapply(grps, eval, x, frm)
dat.split <- split(x, splits, drop=TRUE)
grp.split <- if(!is.null(grps)) lapply(splits, split, splits, drop=TRUE)
# aggregate
res.list <- lapply(
dot_list(exps.sub), # see appendix
function(exp) lapply(dat.split, eval, expr=exp, enclos=frm)
)
list_to_df(res.list, grp.split, splits) # see appendix
}
}
## - Grouping ------------------------------------------------------------------
group_r <- function(x, ...)
eval(bquote(.(group_r_l)(.(x), .(substitute(list(...))))), parent.frame())
group_r_l <- function(x, els) {
exps.sub <- expand(substitute(els), x, parent.frame())
if(is.null(exps.sub)) x else {
if(!is.call(exps.sub) || exps.sub[[1L]] != quote(list))
exps.sub <- call("list", exps.sub)
structure(x, .GRP=dot_list(exps.sub, "G"))
} }
## - Filtering -----------------------------------------------------------------
filter_r <- function(x, subset) {
sub.exp <- expand(substitute(subset), x, parent.frame())
sub.val <- eval(sub.exp, x, parent.frame())
as.data.frame(
if(!is.null(sub.val)) {
as.data.frame(x)[
if(is.numeric(sub.val)) sub.val else !is.na(sub.val) & sub.val,
]
} else x
)
}
## - Pipe ----------------------------------------------------------------------
`%$%` <- function(x, y) {
x.sub <- expand(substitute(x), parent.frame())
y.sub <- expand(substitute(y), parent.frame())
y.list <- if(!is.call(y.sub)) list(y.sub) else as.list(y.sub)
eval(sub_dat(y.sub, x), parent.frame())
}
## - Helper Funs ---------------------------------------------------------------
# Takes result of `substitute(list(...))` and returns a list of quoted language
# object with nice names.
dot_list <- function(x, pre="V") {
if(!is.call(x) || x[[1L]] != quote(list)) x <- call("list", x)
dots <- tail(as.list(x), -1L)
if(is.null(names(dots))) names(dots) <- character(length(dots))
for(i in seq_along(dots)[!nzchar(names(dots))])
names(dots)[i] <- if(
is.language(dots[[i]]) && nchar(deparse(dots[[i]])[[1]]) < 20
) deparse(dots[[i]])[[1]] else sprintf("%s%d", pre, i)
dots
}
# Used by the `%$%` pipe operator to find the correct point in the RHS to
# substitute the forwarded argument in
sub_dat <- function(z, dat) {
if(is.call(z)) {
if(z[[1]] == as.name('%$%')) z[[2]] <- sub_dat(z[[2]], dat)
else {
z.list <- as.list(z)
z <- as.call(c(z.list[1], list(dat), tail(z.list, -1)))
} }
z
}
# convert the ".GRP" attribute into usable form
make_grps <- function(x)
if(is.null(attr(x, ".GRP")) || !length(attr(x, ".GRP")))
list(rep_len(1, nrow(x))) else attr(x, ".GRP")
# Takes result list and makes into a data.frame by recycling elements so they
# are the same length a longest, and also adds in cols for the group vars
list_to_df <- function(dat, grp, splits) {
lens <- do.call(pmax, lapply(dat, lengths, integer(length(splits))))
as.data.frame(
lapply(c(grp, dat), function(x) unname(unlist(Map(rep_len, x, lens))))
)
}