---
title: "Working with Expressions: Helper Functions for Advanced Data Manipulation"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Working with Expressions: Helper Functions for Advanced Data Manipulation}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  echo = TRUE,
  warning = FALSE,
  message = FALSE
)
```

## Introduction

Three blocks accept expressions: **expression filter**, **mutate**, and **summarize**.

Helper functions make expressions more powerful by applying operations to multiple columns at once. This vignette covers:

- Common functions for mutate and summarize
- Helper functions: `across()`, `where()`, `if_any()`, `if_all()`, `pick()`
- Column selection helpers
- Practical examples

See also: [dplyr column-wise operations](https://dplyr.tidyverse.org/articles/colwise.html) and [dplyr window functions](https://dplyr.tidyverse.org/articles/window-functions.html).

## Expression Blocks

**Expression filter block**: Logical expressions to keep rows

- Expressions return TRUE/FALSE for each row
- Use comparison operators: `>`, `<`, `==`, `!=`, `>=`, `<=`
- Combine with `&` (AND) or `|` (OR)
- Helpers: `if_any()`, `if_all()`

**Mutate block**: Create or modify columns

- Perform calculations on existing columns
- Use arithmetic, logs, lags, cumulative functions
- Helper: `across()` to transform multiple columns

**Summarize block**: Aggregate data

- Reduce rows to summary statistics
- Use aggregation functions: `mean()`, `sum()`, `n()`
- Helpers: `across()` for multiple columns, `pick()` for custom functions

## Useful Functions for Mutate

### Arithmetic
- `+`, `-`, `*`, `/`, `^` - basic operations
- `%/%` - integer division
- `%%` - remainder (modulo)

### Logs and exponentials
- `log()`, `log2()`, `log10()` - logarithms
- `exp()` - exponential
- `sqrt()` - square root

### Offsets
- `lag()` - previous values
- `lead()` - next values

### Cumulative aggregates
- `cumsum()` - cumulative sum
- `cummean()` - cumulative mean
- `cummin()`, `cummax()` - cumulative min/max
- `cumprod()` - cumulative product

### Ranking
- `row_number()` - sequential ranks (1, 2, 3, 4)
- `min_rank()` - ranks with ties (1, 2, 2, 4)
- `dense_rank()` - ranks without gaps (1, 2, 2, 3)
- `percent_rank()` - percentile (0 to 1)
- `ntile(n)` - divide into n bins

### Logical comparisons
- `<`, `>`, `<=`, `>=`, `==`, `!=`
- `&` (and), `|` (or), `!` (not)
- `%in%` - test membership

### Examples
```
# Calculate percentage
across(c(hp, wt), \(x) x / sum(x) * 100)

# Lag differences
mpg - lag(mpg)

# Cumulative sums by group (use by parameter)
cumsum(sales)

# Rank values
min_rank(desc(hp))
```

See [dplyr window functions](https://dplyr.tidyverse.org/articles/window-functions.html) for more.

## Useful Functions for Summarize

### Center
- `mean()` - average
- `median()` - middle value

### Spread
- `sd()` - standard deviation
- `IQR()` - interquartile range
- `mad()` - median absolute deviation

### Range
- `min()`, `max()` - minimum and maximum
- `quantile(x, 0.25)` - percentiles

### Position
- `first()`, `last()`, `nth(x, 2)` - positional values

### Count
- `n()` - count rows
- `n_distinct()` - count unique values
- `sum(!is.na(x))` - count non-missing

### Sums and products
- `sum()` - total
- `prod()` - product

### Examples
```
# Basic statistics
across(where(is.numeric), list(mean = mean, sd = sd))

# Count by group (use by parameter)
n()

# Multiple stats
list(
  avg = mean(hp),
  min = min(hp),
  max = max(hp),
  count = n()
)
```

For handling missing values, add `na.rm = TRUE`:
```
mean(hp, na.rm = TRUE)
across(where(is.numeric), \(x) mean(x, na.rm = TRUE))
```

See [dplyr summarise](https://dplyr.tidyverse.org/reference/summarise.html) for more.

## Column Selection Helpers

Select columns by name pattern or type (used inside `across()`, `if_any()`, `if_all()`):

- `everything()` - all columns
- `starts_with("prefix")` - columns starting with prefix
- `ends_with("suffix")` - columns ending with suffix
- `contains("text")` - columns containing text
- `where(is.numeric)` - columns by type (`is.character`, `is.factor`)
- `c(col1, col2)` - specific columns

Combine selections:
```
c(starts_with("Sepal"), ends_with("Width"))
where(is.numeric) & starts_with("x")
```

## The across() Function

Apply the same operation to multiple columns.

**Syntax**: `across(.cols, .fns, .names = NULL)`

- `.cols` - which columns (use selection helpers)
- `.fns` - function(s) to apply
- `.names` - control output names (default: `{.col}_{.fn}`)

### In Mutate Block

Transform multiple columns:
```
# Round all numeric columns
across(where(is.numeric), round)

# Scale to 0-1 range
across(c(mpg, hp, wt), \(x) x / max(x))

# Log transform with custom names
across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}")
```

Use `\(x)` to create anonymous functions where `x` represents the current column.

### In Summarize Block

Calculate statistics for multiple columns:
```
# Mean of all numeric columns
across(where(is.numeric), mean)

# Multiple functions
across(c(hp, wt), list(mean = mean, sd = sd))

# With grouping (use by parameter)
across(everything(), n_distinct)
```

### Custom names
```
# Default: col_fn
across(c(mpg, hp), list(mean = mean, sd = sd))
# Result: mpg_mean, mpg_sd, hp_mean, hp_sd

# Custom: fn.col
across(c(mpg, hp), list(mean = mean, sd = sd), .names = "{.fn}.{.col}")
# Result: mean.mpg, sd.mpg, mean.hp, sd.hp
```

## The pick() Function

Select columns as a data frame for custom functions.

**Syntax**: `pick(.cols)`

Use in summarize block with custom functions that need a data frame:
```
# With custom function
calc_stats(pick(everything()))

# Select specific columns
my_function(pick(c(hp, wt, mpg)))
```

### Relationship to unpack parameter

When your expression returns a data frame, use the **unpack** option:

- **unpack = FALSE** (default): Result is a single list-column
- **unpack = TRUE**: Columns are spread into separate columns

Example:
```
# Custom function that returns data frame
calc_stats <- function(df) {
  data.frame(mean_x = mean(df$x), sd_x = sd(df$x))
}

# In summarize block with unpack = TRUE:
calc_stats(pick(everything()))
# Result: mean_x and sd_x as separate columns
```

## Filter Helpers: if_any() and if_all()

Check conditions across multiple columns in the expression filter block.

**if_any()**: TRUE when condition is true for **at least one** column
```
# Rows with any NA
if_any(everything(), is.na)

# Any numeric column > 100
if_any(where(is.numeric), \(x) x > 100)

# Search across text columns
if_any(where(is.character), \(x) x == "setosa")
```

**if_all()**: TRUE when condition is true for **all** columns
```
# All numeric columns positive
if_all(where(is.numeric), \(x) x > 0)

# All width measurements > 2
if_all(ends_with("Width"), \(x) x > 2)

# No missing values
if_all(everything(), \(x) !is.na(x))
```

## Common Patterns

### Mutate Block
```
# Round numeric columns
across(where(is.numeric), round)

# Scale to max
across(c(mpg, hp, wt), \(x) x / max(x))

# Uppercase text
across(where(is.character), toupper)

# Log transform
across(where(is.numeric), \(x) log(x + 1), .names = "log_{.col}")
```

### Summarize Block
```
# Means (add grouping with by parameter)
across(where(is.numeric), mean)

# Multiple statistics
across(c(hp, wt), list(mean = mean, sd = sd, min = min, max = max))

# Count non-missing
across(everything(), \(x) sum(!is.na(x)))
```

### Expression Filter Block
```
# Complete rows only
if_all(everything(), \(x) !is.na(x))

# Any negative
if_any(where(is.numeric), \(x) x < 0)

# All widths > threshold
if_all(ends_with("Width"), \(x) x > 2.5)

# Search text columns
if_any(where(is.character), \(x) grepl("pattern", x))
```

## Tips

**Start simple**: Test on one column, then use `across()`

**Check preview**: Verify results in the block preview

**Unpack option**: In summarize, enable unpack when expressions return data frames

**Combine helpers**: Use `&` and `|` to combine selections
```
across(where(is.numeric) & starts_with("Sepal"), mean)
```

**Function syntax**:

- Direct: `round` (no arguments needed)
- Anonymous function: `\(x) round(x, 2)` (with arguments)
- The `x` represents the current column

**Missing values**: Add `na.rm = TRUE` to aggregation functions
```
across(where(is.numeric), \(x) mean(x, na.rm = TRUE))
```

## Learn More

For comprehensive documentation on column-wise operations, see:

- [Column-wise operations vignette](https://dplyr.tidyverse.org/articles/colwise.html) - detailed guide to across(), where(), if_any(), and if_all()
- [across() reference](https://dplyr.tidyverse.org/reference/across.html) - complete documentation with all arguments and options
- [R for Data Science (2e)](https://r4ds.hadley.nz/data-transform.html) - comprehensive tutorial on data transformation

These resources provide additional examples, edge cases, and advanced techniques for working with expressions in dplyr.
