---
title: "CharAnalysis: Diagnostic and analytical tools for peak detection and fire-history interpretations using high-resolution sediment-charcoal records"
author: "Philip Higuera"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    toc_depth: 3
    fig_width: 7
    fig_height: 5
vignette: >
  %\VignetteIndexEntry{CharAnalysis: Diagnostic and analytical tools for peak detection and fire-history interpretations using high-resolution sediment-charcoal records}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  fig.align = "center",
  fig.width = 7,
  fig.height = 5,
  warning   = FALSE,
  message   = FALSE
)

# Suggests-package gate: figure chunks live-render only when ggplot2,
# patchwork, and ggtext are all installed. If any is missing, those
# chunks skip cleanly and the rest of the vignette still builds.
PLOT_OK <- requireNamespace("ggplot2",   quietly = TRUE) &&
           requireNamespace("patchwork", quietly = TRUE) &&
           requireNamespace("ggtext",    quietly = TRUE)
```

## Overview

*CharAnalysis* is a tool to help reconstruct local fire history from lake-sediment charcoal records. Charcoal preserved in lake sediments is a direct proxy for past fire activity, and peaks in charcoal deposition above a slowly varying background level have been shown to record individual fire events near a lake. *CharAnalysis* formalizes this peak-detection logic as a reproducible, quantitative workflow.

This R package is a direct translation of *CharAnalysis* v2.0 (MATLAB), validated against reference outputs on four benchmark datasets spanning a range of record lengths, sampling resolutions, ecosystems, and analysis configurations. Analytical methods are described in Higuera et al. (2009).

The full workflow proceeds in five steps:

1. **Pretreatment** — Resample the raw charcoal series to equal time steps and compute charcoal accumulation rate (CHAR, pieces cm^-2^ yr^-1^).
2. **Smoothing** — Estimate low-frequency trends (C~background~) using lowess or a moving-window statistic.
3. **Peak decomposition** — Compute the high-frequency residual C~peak~ = C~interp~ − C~background~ (or ratio C~interp~ / C~background~).
4. **Thresholding** — Define a noise threshold and flag C~peak~ values that exceed it as candidate fire events.  Thresholds can be global (one distribution fitted to the full record) or local (sliding-window distribution).
5. **Screening and metrics** — Apply a minimum-count significance test, then compute fire-return intervals (FRIs), Weibull statistics, smoothed fire frequency, signal-to-noise index (SNI), and goodness-of-fit.

---

## Installation

```{r install, eval = FALSE}
# Install from GitHub (requires devtools)
devtools::install_github("phiguera/CharAnalysis",
                         subdir = "CharAnalysis_2_0_R")

# Suggested packages for figures
install.packages(c("ggplot2", "patchwork", "ggtext"))
```

---

## Worked example: Code Lake

Code Lake (site code `CO`) in Alaska, USA, is the primary validation dataset for *CharAnalysis*. The record spans approximately the past 7,300 years and is analyzed here using a local Gaussian mixture model (GMM) for threshold determination, with the working threshold percentile set at 0.95 to compensate for language-induced GMM drift relative to the MATLAB v2.0 implementation; see the *Comparison with MATLAB v2.0* section below for the rationale.

### Input files

*CharAnalysis* reads two CSV files:

- **`CO_charData.csv`** — the charcoal data table (depth, age, volume, count).
- **`CO_charParams.csv`** — the analysis parameter file.

Both files ship with the package in `inst/validation/` and are located with `system.file()`. The data file is auto-derived from the site name embedded inside the params file, so only the params path needs to be supplied to `CharAnalysis()`. In your own use you would substitute the path to a `*_charParams.csv` file of your choosing.

For this worked example we use `CO_compensated_charParams.csv`, a variant of the standard `CO_charParams.csv` in which the working threshold percentile is lowered from 0.99 to 0.95. This compensates for a language-induced shift in the GMM-fitted noise distribution; see the *Comparison with MATLAB v2.0* section below for details.

```{r paths}
params_file <- system.file("validation", "CO_compensated_charParams.csv",
                           package = "CharAnalysis")
params_file
```

### Running the full pipeline

A single call to `CharAnalysis()` runs all five analytical steps and returns a named list of results. The function prints progress messages as it works through each step.

```{r run, message = TRUE}
library(CharAnalysis)
out <- CharAnalysis(params_file)
```

### Exploring the output

`CharAnalysis()` returns a named list. The most commonly used elements are:

```{r output-structure}
names(out)
```

#### The `charcoal` object

`out$charcoal` holds all time-series outputs, both raw and processed:

```{r charcoal}
# Inspect the first few rows of key time series
head(data.frame(
  age_BP    = out$charcoal$ybpI,          # interpolated age (yr BP)
  CHAR      = out$charcoal$accI,          # C_interpolated  (pieces cm-2 yr-1)
  C_bkg     = out$charcoal$accIS,         # C_background
  C_peak    = out$charcoal$peak,          # C_peak (residuals)
  peaks     = out$charcoal$charPeaks[, 4] # final-threshold peak flags (0/1)
))
```

#### The `char_thresh` object

`out$char_thresh` holds threshold values, SNI, and goodness-of-fit results:

```{r thresh}
# Threshold at the final percentile (column 4 = threshValues[4])
range(out$char_thresh$pos[, 4], na.rm = TRUE)

# Signal-to-noise index (SNI): values > 3 indicate a strong signal
summary(out$char_thresh$SNI)
```

#### Post-processing metrics

`out$post` holds fire-history summary metrics:

```{r post}
# Fire-return intervals (FRIs) and mean FRI
cat("Number of FRIs:", length(out$post$FRI), "\n")
cat("Mean FRI:", round(mean(out$post$FRI), 1), "yr\n")

# Per-zone Weibull statistics (zone 1)
fri_z1 <- out$post$FRI_params_zone[1, ]
cat(sprintf(
  "Zone 1 — nFRI: %d  mFRI: %.1f yr  WBLb: %.1f  WBLc: %.2f\n",
  fri_z1[1], fri_z1[2], fri_z1[5], fri_z1[8]
))
```

#### The `char_results` matrix

`out$char_results` is the complete N × 33 output matrix, with columns matching the MATLAB `charResults` CSV exactly:

```{r char-results}
dim(out$char_results)

# Total number of fire events identified
sum(out$charcoal$charPeaks[, 4], na.rm = TRUE)
```

---

### Output figures

*CharAnalysis* provides eight publication-quality ggplot2 figures. The five analytical figures (3, 5, 6, 7, 8) summarise the fire-history analysis at progressively higher levels of integration; the three diagnostic figures (1, 2, 4) support inspection of the pretreatment, smoothing, and threshold-determination steps. Each figure is produced by its own `char_plot_*()` function; the convenience wrapper `char_plot_all()` writes the five analytical figures to PDF in one call. The figures below render live from the Code Lake analysis just produced.

The figures require the **ggplot2**, **patchwork**, and **ggtext** packages, all in `Suggests`. If any of these is unavailable in the current R environment the figure chunks below will be skipped silently.

#### Figure 1 — Smoothing options for C~background~

`char_plot_raw()` is the diagnostic figure for the pretreatment and smoothing steps. It overlays the raw CHAR series, the resampled (interpolated) CHAR at the constant time-step set by `yrInterp`, and the C~background~ curves produced by all five smoothing methods (1: lowess; 2: robust lowess; 3: moving average; 4: moving median; 5: moving mode) over the user-specified smoothing window. Inspecting these alternatives helps assess whether the chosen smoothing method is well-behaved on the record at hand and whether the smoothing window is appropriate.

```{r fig1, eval = PLOT_OK, fig.width = 7, fig.height = 5}
char_plot_raw(out)
```

#### Figure 2 — Local threshold determination diagnostics

`char_plot_thresh_diag()` shows how the local threshold is determined within sliding windows along the record. The figure is organized as a 5×5 grid of windows, each panel showing the empirical C~peak~ distribution within that window, the fitted noise component (Gaussian or Gaussian mixture, per `threshMethod`), and the resulting threshold value at the working percentile. This diagnostic is useful for confirming that the local threshold algorithm is behaving sensibly across the full record, and for spotting any windows where the noise model fits poorly.

```{r fig2, eval = PLOT_OK, fig.width = 8, fig.height = 8}
char_plot_thresh_diag(out)
```

#### Figure 3 — C~interp~, C~background~, and C~peak~

`char_plot_peaks()` is the diagnostic figure for the peak-detection step. The top panel shows the resampled charcoal accumulation rate (CHAR, black bars) with the fitted C~background~ trend overlaid as a grey line. The bottom panel shows the C~peak~ residual series with the positive and negative threshold lines (red), identified fire events (+ symbols), and candidate peaks that failed the minimum-count significance screen (grey dots). This is typically the first figure to inspect when assessing whether the smoothing window and threshold settings are appropriate for a given record.

```{r fig3, eval = PLOT_OK, fig.width = 7, fig.height = 6}
char_plot_peaks(out)
```

#### Figure 5 — Cumulative peaks through time

`char_plot_cumulative()` shows the cumulative count of identified fire events plotted against age. The slope of the curve at any point reflects the instantaneous fire frequency: steeper slopes indicate higher fire frequency, gentler slopes lower fire frequency. Visible changes in slope identify periods of regime change in the record.

```{r fig5, eval = PLOT_OK, fig.width = 7, fig.height = 4}
char_plot_cumulative(out)
```

#### Figure 6 — FRI distributions by zone

`char_plot_fri()` summarises the distribution of fire-return intervals (FRIs) within each user-defined stratigraphic zone. Each panel shows a histogram of FRIs in 20-year bins, normalised to proportions of the zone's FRI population. A two-parameter Weibull probability density function is overlaid where the Kolmogorov-Smirnov goodness-of-fit test passes (p > 0.10 for n < 30; p > 0.05 for n ≥ 30). Weibull scale (*b*) and shape (*c*) parameters, mean FRI, and sample size are annotated on each panel.

```{r fig6, eval = PLOT_OK, fig.width = 7, fig.height = 5}
char_plot_fri(out)
```

#### Figure 7 — Continuous fire history

`char_plot_fire_history()` is the integrated fire-history summary, with three stacked panels sharing a common time axis. The top panel shows peak magnitude (integrated C~peak~ above threshold, pieces cm^-2^ peak^-1^) for each fire event. The middle panel shows fire-return intervals through time as filled squares, with the smoothed mean FRI curve (black line) and bootstrapped 95% confidence ribbon (grey). The bottom panel shows lowess-smoothed fire frequency (fires per 1000 yr), the most common single summary in fire-history publications.

```{r fig7, eval = PLOT_OK, fig.width = 7, fig.height = 8}
char_plot_fire_history(out)
```

#### Figure 8 — Between-zone CHAR comparisons

`char_plot_zones()` tests whether raw charcoal accumulation differs between user-defined stratigraphic zones. The left panel shows empirical cumulative distribution functions of raw CHAR within each zone, with pairwise two-sample Kolmogorov-Smirnov p-values annotated. The right panel shows box plots (10th, 25th, 50th, 75th, 90th percentiles) of raw CHAR by zone, allowing direct visual comparison of central tendency and spread.

```{r fig8, eval = PLOT_OK, fig.width = 8, fig.height = 4}
char_plot_zones(out)
```

#### Saving all figures to PDF

`char_plot_all()` saves all five figures to PDF in one call. `out_dir` is required when `save = TRUE`. The example below writes to a temporary directory so it runs on any system; in your own work you would substitute a path of your choosing.

```{r save-figs, eval = FALSE}
char_plot_all(out, save = TRUE, out_dir = tempdir())
# Saves to tempdir():
#   CO_03_CHAR_analysis.pdf
#   CO_05_cumulative_peaks.pdf
#   CO_06_FRI_distributions.pdf
#   CO_07_continuous_fire_hx.pdf
#   CO_08_zone_comparisons.pdf
```

> **Note:** Figures 9 (threshold sensitivity detail) and 10 (multi-site comparisons) from the MATLAB v2.0 interface are not implemented in this R package. Figure 4 (`char_plot_sni()`, sensitivity to alternative threshold values and signal-to-noise index) is implemented but not displayed in this vignette; call it directly on the `out` object to inspect.

---

### Writing results to CSV

`char_write_results()` writes the 33-column output matrix to a CSV file whose column names and format match the MATLAB `charResults` output exactly. `out_dir` is required (no default); substitute a path of your choosing for the temporary directory used here.

```{r write, eval = FALSE}
char_write_results(out$char_results,
                 site    = out$site,
                 out_dir = tempdir())
# Writes: <tempdir>/CO_charResults.csv
```

The output CSV contains one row per interpolated time step and 33 columns covering all analytical outputs from C~interp~ through to per-zone Weibull confidence intervals. Column headers match the MATLAB reference file exactly to facilitate direct numerical comparison.

---

## Key parameters

The parameter file (`*_charParams.csv`) controls all aspects of the analysis. The most commonly adjusted parameters are:

| Parameter | Default | Description |
|-----------|---------|-------------|
| `yrInterp` | 15 | Resampling resolution (yr). Set to 0 for automatic (median raw resolution). |
| `yr` | 500 | Smoothing window width (yr) for C~background~ estimation. |
| `threshType` | 2 | Threshold type: 1 = global, 2 = local (sliding window). |
| `threshMethod` | 3 | Noise distribution: 2 = Gaussian, 3 = Gaussian mixture model. |
| `threshValues` | 0.95, 0.99, 0.999, 0.99 | Percentile thresholds; the final value defines the working threshold. |
| `minCountP` | 0.05 | Alpha level for the minimum-count significance screen. |
| `peakFrequ` | 1000 | Window width (yr) for smoothed fire frequency and FRI statistics. |
| `zoneDiv` | record bounds | Zone boundaries (yr BP) for stratified FRI and Weibull analysis. |

---

## Comparison with MATLAB v2.0

*CharAnalysis* in R is a direct translation of *CharAnalysis* v2.0 (MATLAB). Outputs are quantitatively equivalent on all validated reference datasets. The table below summarises results across the four validation datasets; full details are in `inst/z_Validation_report_R_vs_MATLAB_V_2.0.md`.

| Dataset | Site | Smoothing | Threshold | charBkg max\|diff\| | Peaks R v2.0.x | Peaks MATLAB v2.0 |
|---------|------|-----------|-----------|-------------------|---------|-------------|
| CO | Code Lake, AK | Method 4 (moving median) | 0.99 | ~5 × 10^-6^ | 39 | 48 |
| CO (compensated) | Code Lake, AK | Method 4 (moving median) | 0.95\* | ~5 × 10^-6^ | 50 | 48 |
| CH10 | Chickaree Lake, CO | Method 2 (robust lowess) | 0.99 | 0.267 | 59 | 50 |
| SI17 | Silver Lake, CO | Method 2 (robust lowess) | 0.99 | 0.130 | 25 | 31 |
| RA07 | Raven Lake, AK | Method 2 (robust lowess) | 0.99 | < 0.001 | 15 | 17 |

\* The Threshold column lists the R-side working percentile (`threshValues[4]`); MATLAB uses 0.99 in all rows. The "CO (compensated)" row therefore compares R @ 0.95 with MATLAB @ 0.99, illustrating the parameter offset that brings R into qualitative agreement with the published MATLAB result. See *Compensating for GMM drift* below.

Two sources of numerical difference are documented:

**1. Robust lowess background (smoothing method 2)** — MATLAB's Curve Fitting Toolbox `smooth(..., 'rlowess')` and the R `char_lowess()` port produce slightly different C~background~ series. For gap-free records (RA07) the difference is negligible (< 0.001). For records with NaN gaps (CH10, SI17) the difference is larger (≤ 0.267) because the two implementations handle gap positions differently inside the bisquare robustness iteration. The other smoothing methods (1 plain lowess, 3 moving average, 4 moving median, 5 moving mode) are unaffected and agree to within floating-point noise on all datasets — including Code Lake (CO), which uses method 4.

**2. Gaussian mixture model (GMM) peak counts** — The R package ports the MATLAB `GaussianMixture.m` EM algorithm directly. Floating-point arithmetic accumulates differently across languages during EM iterations, causing the two implementations to reach slightly different threshold values in some local windows. Peak counts differ by 10–20% across datasets, with the direction varying (R sometimes higher, sometimes lower). All threshold and peak differences are downstream consequences of this single source; interpolation and peak-magnitude outputs agree to within numerical precision.

**Compensating for GMM drift.** The peak-count differences in the strict-comparison rows above reflect a 1-to-1 reproduction of each MATLAB v2.0 reference run, with all parameters held identical across languages. The differences are not irreducible. For Code Lake, the R implementation produces a slightly higher GMM threshold than MATLAB does, so fewer C~peak~ values pass the screen; lowering the working threshold percentile (`threshValues[4]`) from 0.99 to 0.95 compensates for this drift and brings the R peak count substantially closer to the MATLAB v2.0 count. The required compensation is dataset-specific because the direction of GMM drift varies by record. We report unmodified results in the validation table to characterize the language-induced numerical drift directly, rather than mask it.

The worked example earlier in this vignette uses the bundled `CO_compensated_charParams.csv` (identical to `CO_charParams.csv` except `threshValues[4] = 0.95`). With this compensation, R identifies 50 peaks for Code Lake. This is much closer to the published MATLAB v2.0 result of 48 peaks, and it captures the significant decrease in FRIs from Zone 1 to Zone 2, highlighted in Higuera et al. (2009).

**Weibull confidence intervals** — Bootstrap CIs use random resampling and will differ between R and MATLAB runs regardless of any other differences. Weibull point estimates (scale *b*, shape *c*) agree within a few percent on datasets where peak counts are similar.

---

## Citation

If you use *CharAnalysis* in published research, please cite Higuera et al. (2009), the first study to apply the core analytical tools implemented in the program. If you used *CharAnalysis* v2.0 (MATLAB or R v2.0.x) specifically, please also cite the software:

> Higuera, P.E., L.B. Brubaker, P.M. Anderson, F.S. Hu, and T.A. Brown. 2009. Vegetation mediated the impacts of postglacial climate change on fire regimes in the south-central Brooks Range, Alaska. *Ecological Monographs* 79:201–219. https://doi.org/10.1890/07-2019.1

> Higuera, P.E. 2026. *CharAnalysis*: Diagnostic and analytical tools for peak analysis in sediment-charcoal records (Version 2.0). Zenodo. https://doi.org/10.5281/zenodo.19304064

---

## References

Higuera, P.E., L.B. Brubaker, P.M. Anderson, F.S. Hu, and T.A. Brown. 2009. Vegetation mediated the impacts of postglacial climate change on fire regimes in the south-central Brooks Range, Alaska. *Ecological Monographs* 79:201–219. https://doi.org/10.1890/07-2019.1