---
title: "slideimp"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{slideimp}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(slideimp)
set.seed(1234)
```
## Comparing Methods Using a Shared Cross-Validation Split with `tune_imp()`
* Simulate some data
```{r}
# 20 rows, 1000 columns, all columns have at least some NA
sim_obj <- sim_mat(n = 20, p = 1000, perc_col_na = 1)
obj <- sim_obj$input
obj[1:4, 1:4]
```
* Instead of random NA sampling inside `tune_imp()`, which internally calls `sample_na_loc()`, we can generate the NA locations up front and pass them to `tune_imp()`.
* Here, we randomly select 5 rows from each of 200 random columns (features) for 5 repetitions using `sample_na_loc()`. To specify just certain columns (i.e., clock CpGs), provide the `na_col_subset` argument.
- `na_loc` has 5 elements, one for each repeat. Each element is a matrix whose rows store the row and column index of a missing value.
```{r}
na_loc <- sample_na_loc(obj, n_cols = 200, n_rows = 5, n_reps = 5)
length(na_loc)
na_loc[[1]][1:6, ]
```
* Then we can compare 1) PCA, 2) K-NN imputation, and 3) a custom method, since the cross-validation missing values are the same for all methods.
* **Note**: The custom function requires the first argument to be `obj`, must return an object with the same dimensions as `obj`, and all subsequent arguments must match the column names of the `parameters` data frame.
```{r}
# This custom function imputes missing values with random normal values and takes
# `mean` and `sd` as params
rnorm_imp <- function(obj, mean, sd) {
na <- is.na(obj)
obj[na] <- rnorm(sum(na), mean = mean, sd = sd) # <- impute values with rnorm
return(obj) # <- return an imputed object with the same dim as obj
}
pca_tune <- tune_imp(
obj,
.f = "pca_imp",
na_loc = na_loc,
parameters = data.frame(ncp = 10)
)
knn_tune <- tune_imp(
obj,
.f = "knn_imp",
na_loc = na_loc,
parameters = data.frame(k = 10)
)
rnorm_tune <- tune_imp(
obj,
.f = rnorm_imp,
na_loc = na_loc,
parameters = data.frame(mean = 0, sd = 1) # must match with arguments of `rnorm_imp`
)
```
* Pick the method with the lowest cross-validation error:
```{r}
mean(compute_metrics(pca_tune, metrics = "rmse")$.estimate)
mean(compute_metrics(knn_tune, metrics = "rmse")$.estimate)
mean(compute_metrics(rnorm_tune, metrics = "rmse")$.estimate)
```
## Group-Wise Imputation with Small-Group Padding and Group-Wise Parameters with `group_imp()`
* `group_imp()` allows imputation to be performed separately within defined groups (e.g., by chromosome), which significantly reduces run time and can increase accuracy for both K-NN and PCA imputation.
* `group_imp()` requires the `group` argument, which maps `colnames(obj)` to groups. This can be created up front with `prep_groups()` for advanced features such as group-wise parameters and padding of small groups with random features from other groups. `prep_groups()` returns a list-column data frame with:
- `features`: **required** - a list-column where each element is a character vector of variable names to be imputed together.
- `aux`: **optional** - auxiliary variables to include in each group. These are only used to augment the imputation quality of features and are not imputed themselves. If one group is too small, e.g. `chrM`, `aux` is used to pad the group by randomly drawing features from other groups to meet `min_group_size`.
- `parameters`: **optional** - group-specific imputation parameters.
* First we simulate data from 2 groups. We then create `group3` with only 1 feature to show how `min_group_size` pads it using the `aux` list column.
```{r}
sim_obj <- sim_mat(n = 20, p = 50, n_col_groups = 2)
# Matrix to be imputed
obj <- sim_obj$input
obj[1:5, 1:4]
# Metadata, i.e., which features belong to which group
meta <- sim_obj$col_group
meta[1:5, ]
# We put feature 1 in `group3`
meta[1, 2] <- "group3"
meta[1:5, ]
```
* Then we generate the `group` parameter for `group_imp()` up front. We can see that `group3` has been padded to have 10 columns.
* For each group, we also assign a different number of nearest neighbors as a demonstration.
```{r}
set.seed(1234)
group_imp_df <- prep_groups(colnames(obj), group = meta, min_group_size = 10)
group_imp_df$parameters <- list(list(k = 3), list(k = 4), list(k = 5))
group_imp_df
```
* Finally, we impute `obj` using the modified `group_imp_df`. The `k = 10` passed to `group_imp()` is ignored since all groups have group-wise `k` specified.
```{r}
knn_results <- group_imp(obj, group = group_imp_df, cores = 4, k = 10)
print(knn_results, p = 4)
```
## Sliding Window Imputation for WGBS/EM-seq Data with `slide_imp()`
### Select `window_size`, `overlap_size`, and PCA/K-NN Parameters
* We simulate the output of the `{methylKit}` package.
- **Note:** WGBS/EM-seq data should be grouped by chromosome before performing sliding window imputation.
- Here, we simulate 1000 sites.
* Importantly, we simulate the location vector of each feature (genomic position) with varying spacing of 50 to 500 bp apart from each other.
* The `locations` vector contains the genomic position of each feature (column). It is used to determine which columns are grouped together given a window size.
```{r}
set.seed(1234)
sample_names <- paste0("S", 1:10)
n_sites <- 1000
# Simulate positions with 50–500 bp between each site
distances_between <- sample(50:500, size = n_sites, replace = TRUE)
locations <- cumsum(distances_between) # <- important, location vector
methyl <- data.frame(
chr = "chr1",
start = locations,
end = locations,
strand = "+"
)
for (i in seq_along(sample_names)) {
methyl[[paste0("numCs", i)]] <- sample.int(100, size = n_sites, replace = TRUE)
methyl[[paste0("numTs", i)]] <- sample.int(100, size = n_sites, replace = TRUE)
methyl[[paste0("coverage", i)]] <- methyl[[paste0("numCs", i)]] + methyl[[paste0("numTs", i)]]
}
methyl[1:5, 1:10]
```
* First, we convert the data into a beta matrix with features in the columns.
```{r}
numCs_matrix <- as.matrix(methyl[, paste0("numCs", seq_along(sample_names))])
cov_matrix <- as.matrix(methyl[, paste0("coverage", seq_along(sample_names))])
beta_matrix <- numCs_matrix / cov_matrix
colnames(beta_matrix) <- sample_names
rownames(beta_matrix) <- methyl$start
beta_matrix <- t(beta_matrix)
# Set 10% of the data to missing
set.seed(1234)
beta_matrix[sample.int(length(beta_matrix), floor(length(beta_matrix) * 0.1))] <- NA
beta_matrix[1:4, 1:4]
```
* Then, in a real dataset, we would tune hyperparameters using `chr22`. Here, as a demonstration, we use the whole data since the size is small.
* We use 2 repetitions of cross-validation (increase to 10–30 in real analyses). We are selecting between:
- `ncp` (number of principal components) of 2 or 4, indicating that we are performing sliding PCA imputation. Pass `k` for sliding K-NN imputation.
- `window_size` of 5,000 or 10,000 bp.
- `overlap_size` fixed at 1,000 bp (does not affect results much in real analyses).
```{r}
params <- expand.grid(ncp = c(2, 4), window_size = c(5000, 10000))
params$overlap_size <- 1000
params$min_window_n <- 20 # windows with less than 20 columns are dropped
# Increase n_reps from 2 in actual analyses and use another chromosome (i.e., chr22)
tune_slide_pca <- tune_imp(
obj = beta_matrix,
parameters = params,
.f = "slide_imp",
n_reps = 2,
location = locations
)
metrics <- compute_metrics(tune_slide_pca)
aggregate(.estimate ~ .metric + ncp + window_size, data = metrics, FUN = mean)
```
* Finally, we can use `slide_imp()` to impute the full `beta_matrix`. Use the best parameter combination from the cross-validation metrics.
* First, we use `dry_run = TRUE` to examine the columns to be imputed.
- `start` and `end` are window location vectors.
- `window_n` is the number of features included in the window.
```{r}
slide_imp(
obj = beta_matrix,
location = locations,
window_size = 5000,
overlap_size = 1000,
ncp = 2,
min_window_n = 20,
dry_run = TRUE # <- dry_run to inspect the windows
)
```
* Turn off `dry_run` to impute the data
```{r}
slide_imp(
obj = beta_matrix,
location = locations,
window_size = 5000,
overlap_size = 1000,
ncp = 2,
min_window_n = 20,
dry_run = FALSE,
.progress = FALSE
)
```
### Subset and Flanking Mode
* Use this mode when you only need to impute specific target features (e.g., clock CpGs) rather than the entire dataset.
- Pass the desired feature names to the `subset` argument. Only windows containing these features will be imputed.
- Set `flank = TRUE` to build windows *centered* on each feature in the `subset`. Each window will extend `window_size` bp on either side of the target feature (flanking mode).
- In this mode, the `overlap_size` argument is ignored.
* In this example, we only want to impute the features `"1323"` and `"33810"` by creating 5,000 bp flanking windows around each feature:
```{r}
slide_imp(
obj = beta_matrix,
location = locations,
window_size = 5000,
ncp = 2,
min_window_n = 20,
subset = c("1323", "33810"),
flank = TRUE,
dry_run = TRUE
)
```