Three blocks accept expressions: expression filter, mutate, and summarize.
Helper functions make expressions more powerful by applying operations to multiple columns at once. This vignette covers:
across(), where(),
if_any(), if_all(), pick()See also: dplyr column-wise operations and dplyr window functions.
Expression filter block: Logical expressions to keep rows
>, <,
==, !=, >=,
<=& (AND) or | (OR)if_any(), if_all()Mutate block: Create or modify columns
across() to transform multiple columnsSummarize block: Aggregate data
mean(), sum(),
n()across() for multiple columns,
pick() for custom functions+, -, *, /,
^ - basic operations%/% - integer division%% - remainder (modulo)log(), log2(), log10() -
logarithmsexp() - exponentialsqrt() - square rootlag() - previous valueslead() - next valuescumsum() - cumulative sumcummean() - cumulative meancummin(), cummax() - cumulative
min/maxcumprod() - cumulative productrow_number() - sequential ranks (1, 2, 3, 4)min_rank() - ranks with ties (1, 2, 2, 4)dense_rank() - ranks without gaps (1, 2, 2, 3)percent_rank() - percentile (0 to 1)ntile(n) - divide into n bins<, >, <=,
>=, ==, !=& (and), | (or), !
(not)%in% - test membership# Calculate percentage
across(c(hp, wt), \(x) x / sum(x) * 100)
# Lag differences
mpg - lag(mpg)
# Cumulative sums by group (use by parameter)
cumsum(sales)
# Rank values
min_rank(desc(hp))
See dplyr window functions for more.
mean() - averagemedian() - middle valuesd() - standard deviationIQR() - interquartile rangemad() - median absolute deviationmin(), max() - minimum and maximumquantile(x, 0.25) - percentilesfirst(), last(), nth(x, 2) -
positional valuesn() - count rowsn_distinct() - count unique valuessum(!is.na(x)) - count non-missingsum() - totalprod() - product# Basic statistics
across(where(is.numeric), list(mean = mean, sd = sd))
# Count by group (use by parameter)
n()
# Multiple stats
list(
avg = mean(hp),
min = min(hp),
max = max(hp),
count = n()
)
For handling missing values, add na.rm = TRUE:
mean(hp, na.rm = TRUE)
across(where(is.numeric), \(x) mean(x, na.rm = TRUE))
See dplyr summarise for more.
Select columns by name pattern or type (used inside
across(), if_any(),
if_all()):
everything() - all columnsstarts_with("prefix") - columns starting with
prefixends_with("suffix") - columns ending with suffixcontains("text") - columns containing textwhere(is.numeric) - columns by type
(is.character, is.factor)c(col1, col2) - specific columnsCombine selections:
c(starts_with("Sepal"), ends_with("Width"))
where(is.numeric) & starts_with("x")
Apply the same operation to multiple columns.
Syntax:
across(.cols, .fns, .names = NULL)
.cols - which columns (use selection helpers).fns - function(s) to apply.names - control output names (default:
{.col}_{.fn})Transform multiple columns:
# Round all numeric columns
across(where(is.numeric), round)
# Scale to 0-1 range
across(c(mpg, hp, wt), \(x) x / max(x))
# Log transform with custom names
across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}")
Use \(x) to create anonymous functions where
x represents the current column.
Calculate statistics for multiple columns:
# Mean of all numeric columns
across(where(is.numeric), mean)
# Multiple functions
across(c(hp, wt), list(mean = mean, sd = sd))
# With grouping (use by parameter)
across(everything(), n_distinct)
# Default: col_fn
across(c(mpg, hp), list(mean = mean, sd = sd))
# Result: mpg_mean, mpg_sd, hp_mean, hp_sd
# Custom: fn.col
across(c(mpg, hp), list(mean = mean, sd = sd), .names = "{.fn}.{.col}")
# Result: mean.mpg, sd.mpg, mean.hp, sd.hp
Select columns as a data frame for custom functions.
Syntax: pick(.cols)
Use in summarize block with custom functions that need a data frame:
# With custom function
calc_stats(pick(everything()))
# Select specific columns
my_function(pick(c(hp, wt, mpg)))
When your expression returns a data frame, use the unpack option:
Example:
# Custom function that returns data frame
calc_stats <- function(df) {
data.frame(mean_x = mean(df$x), sd_x = sd(df$x))
}
# In summarize block with unpack = TRUE:
calc_stats(pick(everything()))
# Result: mean_x and sd_x as separate columns
Check conditions across multiple columns in the expression filter block.
if_any(): TRUE when condition is true for at least one column
# Rows with any NA
if_any(everything(), is.na)
# Any numeric column > 100
if_any(where(is.numeric), \(x) x > 100)
# Search across text columns
if_any(where(is.character), \(x) x == "setosa")
if_all(): TRUE when condition is true for all columns
# All numeric columns positive
if_all(where(is.numeric), \(x) x > 0)
# All width measurements > 2
if_all(ends_with("Width"), \(x) x > 2)
# No missing values
if_all(everything(), \(x) !is.na(x))
# Round numeric columns
across(where(is.numeric), round)
# Scale to max
across(c(mpg, hp, wt), \(x) x / max(x))
# Uppercase text
across(where(is.character), toupper)
# Log transform
across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}")
# Means (add grouping with by parameter)
across(where(is.numeric), mean)
# Multiple statistics
across(c(hp, wt), list(mean = mean, sd = sd, min = min, max = max))
# Count non-missing
across(everything(), \(x) sum(!is.na(x)))
# Complete rows only
if_all(everything(), \(x) !is.na(x))
# Any negative
if_any(where(is.numeric), \(x) x < 0)
# All widths > threshold
if_all(ends_with("Width"), \(x) x > 2.5)
# Search text columns
if_any(where(is.character), \(x) grepl("pattern", x))
Start simple: Test on one column, then use
across()
Check preview: Verify results in the block preview
Unpack option: In summarize, enable unpack when expressions return data frames
Combine helpers: Use & and
| to combine selections
across(where(is.numeric) & starts_with("Sepal"), mean)
Function syntax:
round (no arguments needed)\(x) round(x, 2) (with
arguments)x represents the current columnMissing values: Add na.rm = TRUE to
aggregation functions
across(where(is.numeric), \(x) mean(x, na.rm = TRUE))
For comprehensive documentation on column-wise operations, see:
These resources provide additional examples, edge cases, and advanced techniques for working with expressions in dplyr.