---
title: "Adapter cookbook: from split_spec to native resamples"
author: "Selçuk Korkmaz"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_document:
    toc: true
    toc_float: true
    number_sections: true
    theme: flatly
    highlight: tango
vignette: >
  %\VignetteIndexEntry{Adapter cookbook: from split_spec to native resamples}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE,
  eval = TRUE
)

package_root <- if (file.exists("../DESCRIPTION")) ".." else "."
if (requireNamespace("pkgload", quietly = TRUE) &&
    file.exists(file.path(package_root, "DESCRIPTION"))) {
  pkgload::load_all(package_root, export_all = FALSE, helpers = FALSE, quiet = TRUE)
} else {
  library(splitGraph)
}
```

## What this vignette is for

`splitGraph` ends at a `split_spec` object. It deliberately knows nothing
about `rsample`, `tidymodels`, or any other resampling engine. The handoff
contract is the `sample_data` table inside the spec plus a few scalar
fields (`group_var`, `block_vars`, `time_var`, `ordering_required`,
`recommended_resampling`).

This cookbook shows three small, self-contained adapters that turn a
`split_spec` into something a downstream workflow can use:

1. A **base-R adapter** that returns a list of `(train, test)` row-index
   pairs — runnable here, no extra dependencies.
2. An **`rsample::group_vfold_cv()`** adapter for grouped cross-validation
   keyed to `group_id`.
3. An **`rsample::rolling_origin()`** adapter for ordered evaluation keyed
   to `order_rank`.

Adapters 2 and 3 show idiomatic glue but are not evaluated in this
vignette so that `splitGraph` does not pick up `rsample` as a build-time
dependency.

The same pattern works for any other resampling library you happen to use.

## Build a split_spec to work with

```{r build-spec}
meta <- data.frame(
  sample_id    = c("S1", "S2", "S3", "S4", "S5", "S6"),
  subject_id   = c("P1", "P1", "P2", "P2", "P3", "P3"),
  batch_id     = c("B1", "B2", "B1", "B2", "B1", "B2"),
  timepoint_id = c("T0", "T1", "T0", "T1", "T0", "T1"),
  time_index   = c(0, 1, 0, 1, 0, 1),
  outcome_id   = c("ctrl", "case", "ctrl", "case", "case", "ctrl"),
  stringsAsFactors = FALSE
)

g <- graph_from_metadata(meta, graph_name = "cookbook")
subject_constraint <- derive_split_constraints(g, mode = "subject")
spec <- as_split_spec(subject_constraint, graph = g)
spec
```

The `sample_data` table is the contract:

```{r show-sample-data}
as.data.frame(spec)[, c("sample_id", "group_id", "batch_group", "order_rank")]
```

## Adapter 1 — base R: leave-one-group-out folds

This is the simplest meaningful adapter. It groups by whatever
`split_spec$group_var` says is the split unit, and returns one held-out
group per fold.

```{r adapter-base-r}
logo_folds <- function(spec, observation_data, sample_id_col = "sample_id") {
  stopifnot(inherits(spec, "split_spec"))
  if (!sample_id_col %in% names(observation_data)) {
    stop("`observation_data` must contain a `", sample_id_col, "` column.")
  }

  joined <- merge(
    observation_data,
    spec$sample_data[, c("sample_id", spec$group_var)],
    by.x = sample_id_col, by.y = "sample_id", sort = FALSE
  )
  joined$.row <- seq_len(nrow(joined))
  groups <- split(joined$.row, joined[[spec$group_var]])

  lapply(names(groups), function(g) {
    list(
      group   = g,
      train   = unlist(groups[setdiff(names(groups), g)], use.names = FALSE),
      assess  = groups[[g]]
    )
  })
}

# Pretend we have an observation frame keyed by sample_id.
obs <- data.frame(
  sample_id = meta$sample_id,
  x = rnorm(nrow(meta)),
  y = rbinom(nrow(meta), 1, 0.5)
)

folds <- logo_folds(spec, obs)
length(folds)
folds[[1]]
```

That is the entire downstream contract: take `spec`, take an observation
frame, return train/assess index lists. Anything more complicated is
specific to a resampling library.

## Adapter 2 — `rsample::group_vfold_cv()`

Grouped CV keyed to `group_id`. The downstream package would typically
ship something like this; the adapter is short enough that you can paste
it into your own analysis script.

```{r adapter-rsample-group, eval=FALSE}
spec_to_group_vfold <- function(spec, observation_data,
                                v = NULL,
                                sample_id_col = "sample_id") {
  stopifnot(inherits(spec, "split_spec"))
  if (!requireNamespace("rsample", quietly = TRUE)) {
    stop("Install rsample to use this adapter.")
  }

  joined <- merge(
    observation_data,
    spec$sample_data[, c("sample_id", spec$group_var)],
    by.x = sample_id_col, by.y = "sample_id", sort = FALSE
  )

  n_groups <- length(unique(joined[[spec$group_var]]))
  if (is.null(v)) v <- n_groups

  rsample::group_vfold_cv(
    data  = joined,
    group = !!spec$group_var,
    v     = v
  )
}
```

`v = NULL` (the default above) gives leave-one-group-out, which is the
right default when `splitGraph` has already grouped samples by their
deepest leakage-relevant unit (e.g. subject). Pick a smaller `v` for
k-fold-style grouped CV.

## Adapter 3 — `rsample::rolling_origin()`

When `spec$ordering_required` is `TRUE` (or `spec$time_var` is set), the
right downstream object is an ordered split rather than a grouped one.

```{r adapter-rsample-rolling, eval=FALSE}
spec_to_rolling_origin <- function(spec, observation_data,
                                   sample_id_col = "sample_id",
                                   initial = NULL,
                                   assess = 1L) {
  stopifnot(inherits(spec, "split_spec"))
  if (is.null(spec$time_var)) {
    stop("This split_spec has no `time_var`; ordered evaluation is not available.")
  }
  if (!requireNamespace("rsample", quietly = TRUE)) {
    stop("Install rsample to use this adapter.")
  }

  joined <- merge(
    observation_data,
    spec$sample_data[, c("sample_id", spec$time_var)],
    by.x = sample_id_col, by.y = "sample_id", sort = FALSE
  )
  ordered <- joined[order(joined[[spec$time_var]]), , drop = FALSE]

  if (is.null(initial)) initial <- max(1L, floor(nrow(ordered) * 0.6))
  rsample::rolling_origin(ordered, initial = initial, assess = assess)
}
```

The key idea: `splitGraph` puts ordering information on the spec; the
adapter is just a thin shim that consumes it.

## Going across language boundaries via JSON

If the downstream consumer is not in R, write the spec to JSON and let the
consumer (Python, Julia, a CLI) interpret it.

```{r serialize, eval = requireNamespace("jsonlite", quietly = TRUE)}
tmp <- tempfile(fileext = ".json")
write_split_spec(spec, tmp)

# Inspect the first ~30 lines so the on-disk format is visible.
cat(readLines(tmp, n = 30), sep = "\n")

# And read it back exactly.
spec2 <- read_split_spec(tmp)
identical(spec$sample_data$group_id, spec2$sample_data$group_id)

unlink(tmp)
```

The same pair exists for `dependency_graph` (`write_dependency_graph()` /
`read_dependency_graph()`). Both formats are documented under
`?write_split_spec` and `?write_dependency_graph` and include a
`schema_version` field so consumers can detect drift.

## When you need a custom adapter

The only assumptions an adapter has to honor:

- `split_spec$sample_data` is keyed by `sample_id` (character).
- `split_spec$group_var` is the column that holds the splitting unit.
- `split_spec$block_vars` are present-but-coarser blocking columns.
- `split_spec$time_var`, when non-`NULL`, defines the ordering.
- `split_spec$recommended_resampling` is a hint, not a contract — your
  adapter is free to ignore it.

That is the whole interface. As long as those five fields are honored,
anything is a valid downstream consumer.