---
title: "TemporalForest: A Quick Start Guide"
output:
  rmarkdown::html_vignette
author:
  - name: "Sisi Shao"
    affiliation: "Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles, CA, USA"
    orcid: "0009-0000-9783-9205"
  - name: "Jason H. Moore"
    affiliation: |
      Department of Biostatistics, Fielding School of Public Health,  
      University of California, Los Angeles, CA, USA  
      Department of Computational Biomedicine,  
      Cedars-Sinai Medical Center, Los Angeles, CA, USA
    orcid: "0000-0002-5015-1099"
  - name: "Christina M. Ramirez"
    affiliation: "Department of Biostatistics, Fielding School of Public Health, University of California, Los Angeles, CA, USA" 
    corresponding: true
    email: "cr@ucla.edu"
    orcid: "0000-0002-8435-0416"
bibliography: refs.bib
link-citations: yes
vignette: >
  %\VignetteIndexEntry{A Quick Start Guide to TemporalForest}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
old_ops <- options()
suppressPackageStartupMessages(library(TemporalForest))
knitr::opts_chunk$set(
  collapse   = TRUE,  
  comment    = "#>",   
  fig.width  = 7,
  fig.height = 5,
  message    = FALSE, 
  warning    = FALSE  
)
options(stringsAsFactors = FALSE)
suppressPackageStartupMessages({
  ok_wgcna <- requireNamespace("WGCNA", quietly = TRUE)
})
if (ok_wgcna && "disableWGCNAThreads" %in% getNamespaceExports("WGCNA")) {
  suppressMessages(WGCNA::disableWGCNAThreads())
}
```

## Abstract

The TemporalForest package provides a reproducible method for feature selection in high-dimensional longitudinal data. It combines network analysis, mixed-effects models, and stability selection to identify robust predictors over time. This vignette offers a quick start guide to using the package.

## 1. Introduction

Longitudinal 'omics studies, where subjects are measured repeatedly over time, present unique challenges for feature selection: high dimensionality, temporal dependence, and complex correlations. The `TemporalForest` algorithm addresses these by creating a robust, multi-stage pipeline that identifies features which are both predictive and stable across resamples.

## 2. Installation

Since the package is not yet on CRAN, you can install the development version from GitHub:

```{r eval=FALSE}
# install.packages("remotes")
remotes::install_github("SisiShao/TemporalForest")
```

## 3. Quick Start: Primary Example

This example walks you through a complete analysis with a small, simulated dataset.

### Simulate a Longitudinal Dataset
This tiny demo is designed to always return all true signals quickly (1–3s). We will simulate a dataset with 60 subjects, 2 time points, and 20 potential predictors. We will inject **3 true signals** into the outcome \(Y\), coming from predictors `V1`, `V2`, and `V3`. To ensure the example is fast and reliable for CRAN, we will pass a precomputed dissimilarity matrix to **skip Stage 1 (WGCNA/TOM)**.


```{r}
set.seed(11) # For reproducibility
n_subjects <- 60; n_timepoints <- 2; p <- 20

# Build X (two time points) with matching colnames
X <- replicate(n_timepoints, matrix(rnorm(n_subjects * p), n_subjects, p), simplify = FALSE)
colnames(X[[1]]) <- colnames(X[[2]]) <- paste0("V", 1:p)

# Long view and IDs
X_long <- do.call(rbind, X)
id     <- rep(seq_len(n_subjects), each = n_timepoints)
time   <- rep(seq_len(n_timepoints), times = n_subjects)

# Strong signal on V1, V2, V3 + modest subject random effect + small noise
u_subj <- rnorm(n_subjects, 0, 0.7)
eps    <- rnorm(length(id), 0, 0.08)
Y <- 4*X_long[, "V1"] + 3.5*X_long[, "V2"] + 3.2*X_long[, "V3"] +
     rep(u_subj, each = n_timepoints) + eps

# Lightweight dissimilarity to skip Stage 1 (fast on CRAN)
A <- 1 - abs(stats::cor(X_long)); diag(A) <- 0
dimnames(A) <- list(colnames(X[[1]]), colnames(X[[1]]))
```

### Run TemporalForest

We call the main function, passing our precomputed `dissimilarity_matrix = A` and asking for 3 features.

```{r}
# Run TemporalForest with minimal settings for vignette
tf_result <- temporal_forest(
  X = X, Y = Y, id = id, time = time,
  dissimilarity_matrix = A,       # skip WGCNA/TOM (Stage 1)
  n_features_to_select = 3,       
  n_boot_screen = 4, # Very low for quick demo
  n_boot_select =8, # Very low for quick demo
  keep_fraction_screen = 1,       # Permissive screening
  min_module_size = 2,
  alpha_screen = 0.5,             # Permissive screening
  alpha_select = 0.6
)
```


### Interpret the Results

Examine the selected features and check if the true predictors were found.

```{r}
print(tf_result)
```

```{r}
# Validate against ground truth
true_predictors <- c("V1", "V2", "V3")
cat("True predictors found:", sum(true_predictors %in% tf_result$top_features), 
    "out of", length(true_predictors), "\n")
```

The algorithm successfully identified all three true predictors in this high signal-to-noise example.

## 4. How TemporalForest Works

TemporalForest operates in three stages:

1.  **Time-Aware Module Construction:** Groups correlated features into modules that are stable across time points using a consensus topological overlap matrix (TOM).
2.  **Within-Module Screening:** Uses mixed-effects model trees to select the most important predictor from each module while accounting for within-subject correlations.
3.  **Stability Selection:** Applies bootstrapping to calculate selection probabilities, ensuring only the most reproducible features are included in the final set.

## 5. Key Parameters Guide

- `n_features_to_select`: Final number of features to return (default: 10)
- `n_boot_screen`, `n_boot_select`: Number of bootstrap samples for screening and selection stages. Increase for more stable results (defaults: 50, 100).
- `keep_fraction_screen`: Proportion of features from each module passed to final selection (default: 0.25). Increase if too few features are selected.
- `min_module_size`: Minimum size for network modules (default: 4).
- `alpha_screen`, `alpha_select`: Significance levels for splitting in screening and selection trees (defaults: 0.2, 0.05).

## 6. Troubleshooting

| Symptom | Likely Cause | Solution |
|---------|--------------|----------|
| No features selected | Screening too strict | Increase `keep_fraction_screen` or `alpha_screen` |
| Too many features selected | Selection too liberal | Decrease `keep_fraction_screen` or `alpha_select` |
| Long computation time | Data too large | Reduce bootstrap numbers or pre-filter features |

## 7. Input Data Validation

The package includes checks for proper data formatting. Here's an example of the error message for inconsistent inputs:

```{r error=TRUE}
# This will produce a clear error message
mat1 <- matrix(1:4, nrow=2, dimnames=list(NULL, c("A", "B")))
mat2 <- matrix(1:4, nrow=2, dimnames=list(NULL, c("A", "C")))
bad_X <- list(mat1, mat2)

TemporalForest::check_temporal_consistency(bad_X)
```

## 8. Conclusion

TemporalForest provides an end-to-end solution for reproducible feature selection in longitudinal high-dimensional data. For detailed information on all function parameters and advanced usage, see the package documentation (`?TemporalForest`).

## 9. Citation
To cite TemporalForest in publications, please use:

```{r citation}
citation("TemporalForest")
```

## Session Info
```{r}
sessionInfo()
options(old_ops)
```