Working with Expressions: Helper Functions for Advanced Data Manipulation

Introduction

Three blocks accept expressions: expression filter, mutate, and summarize.

Helper functions make expressions more powerful by applying operations to multiple columns at once. This vignette covers:

See also: dplyr column-wise operations and dplyr window functions.

Expression Blocks

Expression filter block: Logical expressions to keep rows

Mutate block: Create or modify columns

Summarize block: Aggregate data

Useful Functions for Mutate

Arithmetic

Logs and exponentials

Offsets

Cumulative aggregates

Ranking

Logical comparisons

Examples

# Calculate percentage
across(c(hp, wt), \(x) x / sum(x) * 100)

# Lag differences
mpg - lag(mpg)

# Cumulative sums by group (use by parameter)
cumsum(sales)

# Rank values
min_rank(desc(hp))

See dplyr window functions for more.

Useful Functions for Summarize

Center

Spread

Range

Position

Count

Sums and products

Examples

# Basic statistics
across(where(is.numeric), list(mean = mean, sd = sd))

# Count by group (use by parameter)
n()

# Multiple stats
list(
  avg = mean(hp),
  min = min(hp),
  max = max(hp),
  count = n()
)

For handling missing values, add na.rm = TRUE:

mean(hp, na.rm = TRUE)
across(where(is.numeric), \(x) mean(x, na.rm = TRUE))

See dplyr summarise for more.

Column Selection Helpers

Select columns by name pattern or type (used inside across(), if_any(), if_all()):

Combine selections:

c(starts_with("Sepal"), ends_with("Width"))
where(is.numeric) & starts_with("x")

The across() Function

Apply the same operation to multiple columns.

Syntax: across(.cols, .fns, .names = NULL)

In Mutate Block

Transform multiple columns:

# Round all numeric columns
across(where(is.numeric), round)

# Scale to 0-1 range
across(c(mpg, hp, wt), \(x) x / max(x))

# Log transform with custom names
across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}")

Use \(x) to create anonymous functions where x represents the current column.

In Summarize Block

Calculate statistics for multiple columns:

# Mean of all numeric columns
across(where(is.numeric), mean)

# Multiple functions
across(c(hp, wt), list(mean = mean, sd = sd))

# With grouping (use by parameter)
across(everything(), n_distinct)

Custom names

# Default: col_fn
across(c(mpg, hp), list(mean = mean, sd = sd))
# Result: mpg_mean, mpg_sd, hp_mean, hp_sd

# Custom: fn.col
across(c(mpg, hp), list(mean = mean, sd = sd), .names = "{.fn}.{.col}")
# Result: mean.mpg, sd.mpg, mean.hp, sd.hp

The pick() Function

Select columns as a data frame for custom functions.

Syntax: pick(.cols)

Use in summarize block with custom functions that need a data frame:

# With custom function
calc_stats(pick(everything()))

# Select specific columns
my_function(pick(c(hp, wt, mpg)))

Relationship to unpack parameter

When your expression returns a data frame, use the unpack option:

Example:

# Custom function that returns data frame
calc_stats <- function(df) {
  data.frame(mean_x = mean(df$x), sd_x = sd(df$x))
}

# In summarize block with unpack = TRUE:
calc_stats(pick(everything()))
# Result: mean_x and sd_x as separate columns

Filter Helpers: if_any() and if_all()

Check conditions across multiple columns in the expression filter block.

if_any(): TRUE when condition is true for at least one column

# Rows with any NA
if_any(everything(), is.na)

# Any numeric column > 100
if_any(where(is.numeric), \(x) x > 100)

# Search across text columns
if_any(where(is.character), \(x) x == "setosa")

if_all(): TRUE when condition is true for all columns

# All numeric columns positive
if_all(where(is.numeric), \(x) x > 0)

# All width measurements > 2
if_all(ends_with("Width"), \(x) x > 2)

# No missing values
if_all(everything(), \(x) !is.na(x))

Common Patterns

Mutate Block

# Round numeric columns
across(where(is.numeric), round)

# Scale to max
across(c(mpg, hp, wt), \(x) x / max(x))

# Uppercase text
across(where(is.character), toupper)

# Log transform
across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}")

Summarize Block

# Means (add grouping with by parameter)
across(where(is.numeric), mean)

# Multiple statistics
across(c(hp, wt), list(mean = mean, sd = sd, min = min, max = max))

# Count non-missing
across(everything(), \(x) sum(!is.na(x)))

Expression Filter Block

# Complete rows only
if_all(everything(), \(x) !is.na(x))

# Any negative
if_any(where(is.numeric), \(x) x < 0)

# All widths > threshold
if_all(ends_with("Width"), \(x) x > 2.5)

# Search text columns
if_any(where(is.character), \(x) grepl("pattern", x))

Tips

Start simple: Test on one column, then use across()

Check preview: Verify results in the block preview

Unpack option: In summarize, enable unpack when expressions return data frames

Combine helpers: Use & and | to combine selections

across(where(is.numeric) & starts_with("Sepal"), mean)

Function syntax:

Missing values: Add na.rm = TRUE to aggregation functions

across(where(is.numeric), \(x) mean(x, na.rm = TRUE))

Learn More

For comprehensive documentation on column-wise operations, see:

These resources provide additional examples, edge cases, and advanced techniques for working with expressions in dplyr.