A common data manipulation need is: applying the same operation to a number of columns.
In the rquery
R
package we strongly
recommend using value-oriented (or standard evaluation) for this
task.
Here is what this looks like.
For our data lets build a simple data set similar to
mtcars
:
library(wrapr)
library(rquery)
have_rqdatatable <- requireNamespace('rqdatatable', quietly = TRUE)
example_data <- wrapr::build_frame(
"mpg" , "cyl", "disp", "car" |
21 , 6 , 160 , "Mazda RX4" |
21 , 6 , 160 , "Mazda RX4 Wag" |
22.8 , 4 , 108 , "Datsun 710" |
21.4 , 6 , 258 , "Hornet 4 Drive" |
18.7 , 8 , 360 , "Hornet Sportabout" |
18.1 , 6 , 225 , "Valiant" )
knitr::kable(example_data)
mpg | cyl | disp | car |
---|---|---|---|
21.0 | 6 | 160 | Mazda RX4 |
21.0 | 6 | 160 | Mazda RX4 Wag |
22.8 | 4 | 108 | Datsun 710 |
21.4 | 6 | 258 | Hornet 4 Drive |
18.7 | 8 | 360 | Hornet Sportabout |
18.1 | 6 | 225 | Valiant |
Now suppose for a number of columns we wish to perform a calculation, such centering it with respect to the grand average.
This is easily handled by first specifying the set of variables we wish to work with.
## [1] "mpg" "cyl" "disp"
Now we build up what we want as a name-vector of strings using the :=
named map builder.
## mpg cyl disp
## "mpg - mean(mpg)" "cyl - mean(cyl)" "disp - mean(disp)"
The idea is: the :=
operator fits into R
idiom by looking very much like a vectorized version of “names get
assigned expressions”.
These expressions can then be used in an rquery
pipeline
using the _se
-variant of extend()
:
extend_se()
.
## mk_td("example_data", c(
## "mpg",
## "cyl",
## "disp",
## "car")) %.>%
## extend(.,
## mpg := mpg - mean(mpg),
## cyl := cyl - mean(cyl),
## disp := disp - mean(disp))
And this operator pipeline is ready to use (assuming we have
rqdatatable
available):
mpg | cyl | disp | car |
---|---|---|---|
0.5 | 0 | -51.83333 | Mazda RX4 |
0.5 | 0 | -51.83333 | Mazda RX4 Wag |
2.3 | -2 | -103.83333 | Datsun 710 |
0.9 | 0 | 46.16667 | Hornet 4 Drive |
-1.8 | 2 | 148.16667 | Hornet Sportabout |
-2.4 | 0 | 13.16667 | Valiant |
The expression construction can also be done inside the
extend_se()
operator.
ops <- local_td(example_data) %.>%
extend_se(., vars := paste0(vars, ' - mean(', vars, ')'))
cat(format(ops))
## mk_td("example_data", c(
## "mpg",
## "cyl",
## "disp",
## "car")) %.>%
## extend(.,
## mpg := mpg - mean(mpg),
## cyl := cyl - mean(cyl),
## disp := disp - mean(disp))
Note: the above is only a notional example to demonstrate the
operations; for supervised machine learning we would probably use
base::scale()
, which saves the learned centering for later
re-use on new data.
The point is: we use standard R
tools to build up the
lists of names and operations. We are not restricted to any single
argument manipulation grammar.
For example we could build all interaction terms as follows.
combos <- t(combn(vars, 2))
interactions <-
paste0(combos[, 1], '_', combos[, 2]) :=
paste0(combos[, 1], ' * ', combos[, 2])
print(interactions)
## mpg_cyl mpg_disp cyl_disp
## "mpg * cyl" "mpg * disp" "cyl * disp"
## mk_td("example_data", c(
## "mpg",
## "cyl",
## "disp",
## "car")) %.>%
## extend(.,
## mpg_cyl := mpg * cyl,
## mpg_disp := mpg * disp,
## cyl_disp := cyl * disp)
It is a critical advantage to work with sets of variables as standard
values. In this case what we can do is limited only by the power of
R
itself.
Note: we also supply an alias for :=
as
%:=%
for those that don’t want to confuse this assignment
with how the symbol is used in data.table
. Take care that
:=
has the precedence-level of an assignment and
%:=%
has the precedence-level of a user defined
operator.
As, as always, our queries can be used on data.
mpg | cyl | disp | car | mpg_cyl | mpg_disp | cyl_disp |
---|---|---|---|---|---|---|
21.0 | 6 | 160 | Mazda RX4 | 126.0 | 3360.0 | 960 |
21.0 | 6 | 160 | Mazda RX4 Wag | 126.0 | 3360.0 | 960 |
22.8 | 4 | 108 | Datsun 710 | 91.2 | 2462.4 | 432 |
21.4 | 6 | 258 | Hornet 4 Drive | 128.4 | 5521.2 | 1548 |
18.7 | 8 | 360 | Hornet Sportabout | 149.6 | 6732.0 | 2880 |
18.1 | 6 | 225 | Valiant | 108.6 | 4072.5 | 1350 |
Or even in a database.
have_db <- requireNamespace("DBI", quietly = TRUE) &&
requireNamespace("RSQLite", quietly = TRUE)
if(have_db) {
raw_connection <- DBI::dbConnect(RSQLite::SQLite(),
":memory:")
RSQLite::initExtension(raw_connection)
db <- rquery_db_info(
connection = raw_connection,
is_dbi = TRUE,
connection_options = rq_connection_tests(raw_connection))
rq_copy_to(db, 'example_data',
example_data,
temporary = TRUE,
overwrite = TRUE)
sql <- to_sql(ops, db)
cat(format(sql))
}
## SELECT
## `mpg`,
## `cyl`,
## `disp`,
## `car`,
## `mpg` * `cyl` AS `mpg_cyl`,
## `mpg` * `disp` AS `mpg_disp`,
## `cyl` * `disp` AS `cyl_disp`
## FROM (
## SELECT
## `mpg`,
## `cyl`,
## `disp`,
## `car`
## FROM
## `example_data`
## ) tsql_73730651641561315786_0000000000
if(have_db) {
res_table <- materialize(db, ops)
DBI::dbReadTable(raw_connection, res_table$table_name) %.>%
knitr::kable(.)
}
mpg | cyl | disp | car | mpg_cyl | mpg_disp | cyl_disp |
---|---|---|---|---|---|---|
21.0 | 6 | 160 | Mazda RX4 | 126.0 | 3360.0 | 960 |
21.0 | 6 | 160 | Mazda RX4 Wag | 126.0 | 3360.0 | 960 |
22.8 | 4 | 108 | Datsun 710 | 91.2 | 2462.4 | 432 |
21.4 | 6 | 258 | Hornet 4 Drive | 128.4 | 5521.2 | 1548 |
18.7 | 8 | 360 | Hornet Sportabout | 149.6 | 6732.0 | 2880 |
18.1 | 6 | 225 | Valiant | 108.6 | 4072.5 | 1350 |