--- title: "Data Quality Assessment" author: - Matthias Templ^[University of Applied Sciences Northwestern Switzerland (FHNW)] - Barbara Templ^[Swiss Federal Institute for Forest, Snow and Landscape Research (WSL)] date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: references.bib vignette: > %\VignetteIndexEntry{Data Quality Assessment} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- *Vignette 2 of 4 for the **pep725** R package. The most current version is available on [GitHub](https://github.com/matthias-da/pep725) and [CRAN](https://CRAN.R-project.org/package=pep725).* ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) ``` ## Introduction Data quality assessment is a critical but often overlooked step in phenological analysis. Before calculating trends, normals, or climate sensitivities, you need to understand what's in your data and make informed decisions about unusual observations. ### The Data Quality Challenge Long-term phenological datasets such as PEP725 are compiled from many observers over decades. This richness enables large-scale analyses, but it also introduces data quality challenges that need to be assessed: | Issue Type | Examples | Consequence if Ignored | |------------|----------|------------------------| | **Recording errors** | Typographical errors (DOY 150 instead of 105), incorrect year | Outliers bias statistics | | **Identification errors** | Confusion between morphologically similar species | Inconsistent or mixed time series | | **Protocol changes** | Changes in the definition of phenological phases over time | Artificial shifts or trends | | **Biological anomalies** | Irregular phenological events as a result of climate extremes | May represent real biological signals | The following sections introduce practical tools in pep725 package for diagnosing and handling these issues in a transparent and reproducible way. ### What You'll Learn | Section | Topic | Key Functions | |---------|-------|---------------| | Part 1 | Detecting statistical outliers | `pep_flag_outliers()`, `pep_plot_outliers()` | | Part 2 | Temporal coverage and completeness | `pep_completeness()` | | Part 3 | Phase presence validation | `pep_check_phases()` | | Part 4 | Integrated workflow | Combining all approaches | ### Prerequisites This vignette assumes you are familiar with: - Basic phenological concepts (from "Getting Started" vignette) - Day of Year (DOY) as a timing metric - BBCH phenological phase codes ## Setup ```{r setup, message=FALSE, warning=FALSE} library(pep725) library(data.table) # Download the synthetic dataset pep <- pep_download() # For this vignette, we'll focus on flowering phases flowering <- pep[phase_id %in% c(60, 65)] cat("Flowering observations:", nrow(flowering), "\n") cat("Species:", length(unique(flowering$species)), "\n") cat("Year range:", min(flowering$year), "-", max(flowering$year), "\n") ``` --- # Part 1: Outlier Detection ## Why Detect Outliers? An **outlier** is an observation that differs substantially from the expected pattern. In phenological data, outliers may indicate: 1. **Recording errors** that should be excluded from analysis - A typo: DOY 250 instead of DOY 150 (roughly 3 months difference) - Wrong year entered: observation assigned to incorrect season 2. **Unusual weather events** worth investigating - Very warm winter causing early flowering - Late frost delaying spring phenology 3. **Second flowering** or other abnormal phenological events - Plants flowering again in autumn/winter after spring flowering - Increasingly observed with climate change **The challenge**: You can't simply delete all outliers. Some are errors, but others are scientifically valuable observations of unusual events. ## Statistical Methods for Outlier Detection The `pep_flag_outliers()` function provides four statistical methods for identifying potential outliers. Each method has different assumptions and strengths: ### Method 1: 30-Day Rule (Simple Threshold) The simplest approach: flag any observation more than 30 days from the median for that species/phase combination. ```{r flag-basic} # Flag outliers using the default 30-day rule outliers <- pep_flag_outliers( pep = flowering, method = "30day", by = c("species", "phase_id") ) print(outliers) ``` **How it works:** ``` Distribution of flowering dates ← Early Late → ----|----[=======|=======]----|----- ^ ^ ^ -30 days median +30 days Anything outside the brackets is flagged as an outlier ``` ### Understanding the Output The function adds new columns to your original data: | Column | What It Contains | How to Use It | |--------|------------------|---------------| | `is_outlier` | TRUE/FALSE flag | Filter data: `data[is_outlier == FALSE]` | | `deviation` | Days from expected | Prioritize investigation: larger = more extreme | | `expected_doy` | Reference DOY (median) | Understand what "normal" looks like | ### Method 2: MAD (Median Absolute Deviation) - Recommended The MAD method is **robust to the presence of multiple outliers**. This is important because if your data contains many outliers, they can distort mean-based methods. ```{r flag-methods, eval=FALSE} # MAD method: flag if > 3 MAD from median (robust) outliers_mad <- pep_flag_outliers(flowering, method = "mad", threshold = 3) ``` **Why MAD is recommended:** | Scenario | Mean-based method | MAD method | |----------|-------------------|------------| | 1-2 outliers | Works reasonably | Works well | | Many outliers | Outliers inflate SD, may miss more | Stays robust | | Asymmetric data | Assumes symmetry | Handles asymmetry | ### Method 3: IQR (Interquartile Range) - Also Robust The IQR method uses quartiles and is familiar from boxplot "whiskers": ```{r iqr-method, eval=FALSE} # IQR method: flag if outside 1.5 * IQR (standard boxplot rule) outliers_iqr <- pep_flag_outliers(flowering, method = "iqr", threshold = 1.5) ``` **How it works:** - Q1 = 25th percentile, Q3 = 75th percentile - IQR = Q3 - Q1 - Outlier if: value < Q1 - 1.5×IQR OR value > Q3 + 1.5×IQR ### Method 4: Z-Score - Sensitive but Less Robust The z-score method assumes normally distributed data: ```{r zscore-method, eval=FALSE} # Z-score method: flag if |z| > 3 (assumes normal distribution) outliers_zscore <- pep_flag_outliers(flowering, method = "zscore", threshold = 3) ``` **Caution**: The z-score method is sensitive to existing outliers. If your data has many outliers, they will inflate the standard deviation, making the z-score threshold less effective at detecting them. ### Method 5: GAM Residual - Model-based, Covariate-aware The four methods above are all **univariate within a group** — they compare each observation to the central tendency of its own station-species-phase combination. That misses three realistic outlier patterns: 1. **Year-effect outliers.** In a year where most of central Europe bloomed 8 days earlier, a station that flowered on its 30-year median looks "late" only because the reference ignores the shared weather signal. 2. **Covariate-inconsistent reports.** A lowland station that reports a DOY typical of a mountain station is invisible to a within-station detector because the station's own history may be short. 3. **Trend-inconsistent reports.** A station whose DOY is constant while every other station in the network has a trend is anomalous only in relation to the trend. `method = "gam_residual"` tackles all three. For each group (typically species × phase), it fits a GAM of DOY on year, altitude, latitude, and a station random intercept (customisable via `formula =`) and flags observations whose residual — standardised by a robust MAD scale — exceeds the threshold: ```{r gam-method, eval=FALSE} # Model-based detection across the whole species x phase group. outliers_gam <- pep_flag_outliers( apple_flowering, by = c("genus", "species", "phase_id"), method = "gam_residual", threshold = 3.5 ) ``` The `deviation` column now holds GAM residuals in days (not deviations from a within-station median), and `expected_doy` holds the fitted values — so the station-year whose residual is 20 days above the fit is flagged whether or not its station's history made that value look "normal". Groups with fewer observations than `min_n_per_group` (default 50) or where the GAM fails to converge silently fall back to the 30-day rule. ### Method 6: Mahalanobis - Multivariate, Robust (MCD) All methods above flag observations one at a time. A station-year could have an entirely reasonable BBCH 60 date and an entirely reasonable BBCH 65 date, but only **one day apart** — biologically impossible, because first flowering precedes full flowering by ~10 days. `method = "mahalanobis"` catches this by treating each station-year as a *vector* across phases and computing the **robust Mahalanobis distance** of that vector from the centre of the rest of the network. "Robust" matters: the classical Mahalanobis distance uses the sample mean and covariance, which are themselves pulled by outliers (masking). We use the **Minimum Covariance Determinant** estimator (Rousseeuw 1984, `robustbase::covMcd()`) which gives a high-breakdown robust centre and scatter, so the outliers stay *outside* the robust ellipsoid instead of contaminating it: ```{r mahalanobis-method, eval=FALSE} # Multivariate detection on the phase profile per station-year. # by = c("genus", "species") pools across phases so the covariance # is estimated from the joint (BBCH 60, BBCH 65, BBCH 89) distribution. outliers_mahal <- pep_flag_outliers( apple, by = c("genus", "species"), method = "mahalanobis" ) ``` The default threshold is `sqrt(qchisq(0.975, df = p))` where *p* is the number of phases — the standard chi-square containment cut-off under multivariate normality. Flagged station-years get all their phase rows marked; `deviation` stores the robust MD (not a day-count, so it is unitless). ### Choosing a Method | Method | Best For | Threshold Meaning | |--------|----------|-------------------| | `30day` | Simple, interpretable threshold | Absolute days from median | | `mad` | Most situations (recommended for single-phase) | Number of MADs from median | | `iqr` | Familiar boxplot-style detection | IQR multiplier | | `zscore` | Clean data, normal distribution | Standard deviations | | `gam_residual` | Year / elevation / station covariates available | Robust z on GAM residuals | | `mahalanobis` | Multiple phases per station-year | `sqrt(chi^2_{0.975, p})` default | ### Summary Statistics ```{r flag-summary} summary(outliers) ``` ### Comparing Outlier Rates Across Species Different species may have different outlier rates due to observation difficulty or data quality: ```{r flag-compare} # Check outliers for grapevine vine_flowering <- pep[species == "Vitis vinifera" & phase_id %in% c(60, 65)] if (nrow(vine_flowering) > 0) { outliers_vine <- pep_flag_outliers( pep = vine_flowering, method = "30day", by = c("species", "phase_id") ) cat("Outlier comparison:\n") cat("All flowering species: ", round(100 * mean(outliers$is_outlier), 2), "%\n") cat("Grapevine only: ", round(100 * mean(outliers_vine$is_outlier), 2), "%\n") } else { cat("No grapevine flowering data available for comparison.\n") } ``` ## Visualizing Outliers Visual inspection is essential for understanding outlier patterns. The `pep_plot_outliers()` function provides several visualization types — four for the univariate methods plus two targeted at the model-based and multivariate methods: ### 1. Overview Plot Shows the big picture: how many outliers, which species/phases are affected: ```{r plot-overview, fig.height=6} # Overview of outlier patterns pep_plot_outliers(outliers, type = "overview") ``` **What to look for:** - Which species have the most outliers? (Might indicate data quality issues) - Are outliers concentrated in certain phases? (Might indicate definition problems) - What's the overall outlier rate? (Typical: 1-5% for well-curated data) ### 2. Seasonal Distribution When in the year do outliers occur? ```{r plot-seasonal, fig.height=5} # When do outliers occur in the year? pep_plot_outliers(outliers, type = "seasonal") ``` **Interpretation guide:** | Outlier Timing | Likely Explanation | |----------------|-------------------| | Very early (DOY < 50) | Possible data errors (flowering in January/February unlikely for most species) | | Somewhat early | Warm winters or Mediterranean locations | | Somewhat late | Cool springs or high-altitude stations | | Very late (DOY > 250) | Possible **second flowering** events - investigate! | ### 3. Detailed Context See outliers alongside normal observations: ```{r plot-detail, fig.height=5} # See outliers in context of all observations pep_plot_outliers(outliers, type = "detail", n_top = 15) ``` **This plot shows:** - Full distribution of observations (gray) - Flagged outliers (highlighted) - The `n_top` parameter controls how many species/phases to show ### 4. Geographic Distribution Where are outliers located? ```{r plot-map, fig.height=5} # Where are outliers located? pep_plot_outliers(outliers, type = "map") ``` **What to look for:** - Clustered outliers in one region? Might indicate local data quality issues - Outliers at network edges? Might be at environmental limits - Random scatter? Suggests individual observation errors ### 5. Model Diagnostic (for `gam_residual` / `mahalanobis`) `type = "diagnostic"` produces a paper-ready 4-panel figure tailored to the detection method. For `gam_residual` the panels are classical GAM diagnostics — residuals vs fitted, Q-Q against Normal, |residual| vs altitude, and a per-station worst-case-residual map. For `mahalanobis` the panels switch automatically to MD-specific diagnostics: sorted MD with the chi-square threshold line, Q-Q of MD² against χ²_*p*, mean/max MD over time (exposes network-wide anomalous epochs), and the per-station worst-case MD map. ```{r plot-diagnostic-gam, eval=FALSE} # For gam_residual: classical model diagnostics outliers_gam <- pep_flag_outliers( apple_flowering, by = c("genus", "species", "phase_id"), method = "gam_residual" ) pep_plot_outliers(outliers_gam, type = "diagnostic") ``` ![GAM-residual diagnostic on a synthetic fixture with one injected contamination at station 1, year 2000. Panel A: residuals vs fitted; the injected point appears at residual ≈ +50 days. Panel B: Q-Q against N(0,1); the right-tail departure corresponds to the outlier. Panel C: |residual| vs altitude, showing the model has no altitude bias away from the outlier. Panel D: per-station worst-case residual map.](figures/gam_diagnostic.png) ```{r plot-diagnostic-mahal, eval=FALSE} # For mahalanobis: MD-specific diagnostics outliers_mahal <- pep_flag_outliers( apple, by = c("genus", "species"), method = "mahalanobis" ) pep_plot_outliers(outliers_mahal, type = "diagnostic") ``` ![Mahalanobis-specific diagnostic on the same synthetic fixture. Panel A: sorted robust Mahalanobis distances across all station-years, with the $\sqrt{\chi^2_{0.975,\,3}} \approx 3.06$ threshold as a red dashed line; flagged points are red. Panel B: Q-Q of observed $\mathrm{MD}^2$ against $\chi^2_3$ — upper-tail departure identifies the outlier mass. Panel C: mean (blue) and max (red dotted) MD per year — the spike at year 2000 coincides with the injected anomaly. Panel D: per-station worst-case MD map.](figures/mahalanobis_diagnostic.png) **What to look for:** - **Residuals vs fitted (GAM)**: a level trend with no structure is ideal; a funnel or curve signals model misspecification, not outliers. - **Q-Q (GAM)**: heavy-tailed deviations in the corners are the real outlier signal — light tails mean the threshold is picking up noise. - **Sorted MD with threshold (Mahalanobis)**: the steep rise at the right-hand end is where flagged station-years sit; a long flat "knee" across the threshold means the threshold is arguably too loose or too strict. - **Mean / max MD over time (Mahalanobis)**: spikes in the max line identify *years* in which many stations reported biologically inconsistent phase sequences — useful for correlating with known data-collection or instrumental events. ### 6. Phase-profile Plot (primary Mahalanobis figure) The MD detector flags station-years whose *shape* across phases is atypical, even when each marginal DOY looks fine. `type = "profile"` plots the phase profile as parallel coordinates — one line per station-year, flagged ones in red, with the robust median profile overlaid in black as a reference: ```{r plot-profile, eval=FALSE} # Phase-profile parallel coordinates (primary Mahalanobis figure). pep_plot_outliers(outliers_mahal, type = "profile") ``` ![Phase-profile parallel-coordinates plot. Each line is one station-year; the grey envelope shows the natural variability of the network, the red lines are station-years flagged by the robust Mahalanobis detector, and the thick black line is the MCD robust per-phase median profile. Red lines that cross the phases at atypical angles — not just at unusual absolute DOYs — are the classic Mahalanobis flags.](figures/mahalanobis_profile.png) **How to read it:** - Every grey line is a "normal-looking" station-year; together they form the envelope of the robust ellipsoid. - Red lines are flagged. Look where they bend oddly: a station-year whose BBCH 60 → 65 slope is much flatter or steeper than the reference is the classic MD flag. - The black reference profile is the robust (MCD) per-phase median — the centre of the ellipsoid. Anything strongly parallel to this is safe; anything that crosses multiple phases at unusual angles is suspect. --- # Part 2: Data Completeness ## Why Check Completeness? **Data completeness** refers to how many years of observations exist for each station/phase combination. This matters because: 1. **Trend analysis requires continuous data**: Gaps can bias trend estimates 2. **Normals require representative coverage**: WMO recommends 24+ years in a 30-year period 3. **Missing data isn't random**: Stations often drop out or are added systematically ### Visualizing Completeness Issues ``` Station A: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● <- 100% complete 1990 2020 Station B: ●●●●●●●●●●○○○○○○○○○○●●●●●●●●●● <- Gap in middle 1990 2020 Station C: ○○○○○○○○○○○○○○○○○○○○●●●●●●●●●● <- Only recent data 1990 2020 Station D: ●●●●●●●●●●●●●●●●●●●●○○○○○○○○○○ <- Discontinued 1990 2020 ``` Different completeness patterns have different implications: | Pattern | Implications for Analysis | |---------|---------------------------| | Full coverage | Ideal - reliable trends and normals | | Gap in middle | Careful - may miss important changes | | Only recent | Cannot compare to historical period | | Discontinued | Cannot assess recent changes | ## Assessing Completeness ```{r completeness-basic} # Check completeness by station and phase # Use year_range to focus on a specific period completeness <- pep_completeness( pep = flowering, by = c("s_id", "phase_id"), year_range = c(1990, 2020) ) print(completeness) ``` ### Understanding Completeness Metrics The function calculates several metrics: | Metric | What It Measures | Interpretation | |--------|------------------|----------------| | `n_years` | Total years with data | More = better | | `completeness_pct` | % of year span covered | 70%+ for trends, 80%+ for normals | | `year_span` | Total span (year_max - year_min + 1) | Context for completeness | | `year_min` | Earliest observation | Needed for historical comparison | | `year_max` | Most recent observation | Needed for current status | ### Summary Statistics ```{r completeness-summary} summary(completeness) ``` ## Filtering by Completeness Use completeness information to select stations appropriate for your analysis: ```{r completeness-filter} # Get stations with good coverage (>= 70%) good_coverage <- completeness[completeness_pct >= 70] cat("Stations with >= 70% coverage:", nrow(good_coverage), "\n") # Use these for trend analysis good_stations <- unique(good_coverage$s_id) flowering_complete <- flowering[s_id %in% good_stations] cat("Observations from complete stations:", nrow(flowering_complete), "\n") ``` ### Completeness Thresholds for Different Analyses | Analysis Type | Minimum Completeness | Reasoning | |---------------|----------------------|-----------| | Trend detection | 50% (15+ years) | Need enough points for trend fitting | | Climate sensitivity | 60% | Need variability across climate conditions | | 30-year normals | 80% (24+ years) | WMO standard requirement | | Station comparison | Same time period | Avoid bias from different eras | ## Visualizing Completeness ```{r completeness-plot, fig.height=5} plot(completeness) ``` --- # Part 3: Phase Presence Validation ## Why Check Phase Presence? Before starting an analysis, it's important to verify that your data contains the phenological phases you need. The `pep_check_phases()` function validates that expected BBCH phase codes are present in your data. **Common questions this helps answer:** - Does my dataset have flowering observations (phase 60)? - Are flowering and fruit maturity phases both available for apple? - Which phases are missing for my species of interest? ## Checking Phase Presence ```{r check-basic} # Check if expected phases are present for apple apple <- pep[species == "Malus domestica"] phase_check <- pep_check_phases( pep = apple, expected = c(60, 65, 87) # flowering, full flowering, fruit maturity ) print(phase_check) ``` ### Understanding the Output The function returns information about phase coverage: | Output | What It Shows | |--------|---------------| | `expected` | The phases you asked for | | `present` | Which expected phases were found | | `missing` | Which expected phases are absent | | `complete` | TRUE if all expected phases are present | | `n_obs` | Number of observations per phase | ## Checking Multiple Species To check phase presence across multiple species at once: ```{r check-multi} # Check phases for multiple species multi_check <- pep_check_phases_multi( pep = pep, species_list = c("Malus domestica", "Vitis vinifera"), expected = c(60, 65, 87) ) print(multi_check) ``` ### Common Phases to Check | Plant Type | Common Phases | BBCH Codes | |------------|---------------|------------| | Cereals | Heading, Flowering, Harvest | 60, 65, 100 | | Fruit trees | Flowering, Full flowering, Fruit maturity | 60, 65, 87 | | Deciduous trees | Leaf unfolding, Flowering, Leaf fall | 11, 60, 95 | --- # Part 4: Integrated Quality Workflow ## Putting It All Together Here's a recommended workflow for data quality assessment that combines all the tools covered in this vignette: ```{r workflow} # ══════════════════════════════════════════════════════════════════════════════ # STEP 1: Assess temporal completeness # ══════════════════════════════════════════════════════════════════════════════ # Why: Incomplete stations can bias trend estimates and normals completeness <- pep_completeness(flowering, by = c("s_id", "phase_id")) good_stations <- completeness[completeness_pct >= 50, s_id] fl_filtered <- flowering[s_id %in% good_stations] cat("Kept", length(good_stations), "stations with >= 50% completeness\n") # ══════════════════════════════════════════════════════════════════════════════ # STEP 2: Flag statistical outliers # ══════════════════════════════════════════════════════════════════════════════ # Why: Identify observations that deviate from expected patterns outliers_wf <- pep_flag_outliers(fl_filtered, method = "mad", threshold = 3) cat("Flagged", sum(outliers_wf$is_outlier), "outliers", "(", round(100 * mean(outliers_wf$is_outlier), 1), "% )\n") # ══════════════════════════════════════════════════════════════════════════════ # STEP 3: Make informed decisions about exclusion # ══════════════════════════════════════════════════════════════════════════════ # Key principle: Document your decisions! # Option A: Strict cleaning (for normals calculation) fl_strict <- outliers_wf[is_outlier == FALSE] # Option B: Moderate cleaning (keep moderate outliers) fl_moderate <- outliers_wf[is_outlier == FALSE | abs(deviation) < 60] cat("Strict cleaning keeps:", nrow(fl_strict), "obs\n") cat("Moderate cleaning keeps:", nrow(fl_moderate), "obs\n") # ══════════════════════════════════════════════════════════════════════════════ # STEP 4: Proceed with analysis on cleaned data # ══════════════════════════════════════════════════════════════════════════════ normals <- pheno_normals(fl_moderate, period = 1991:2020, min_years = 5) cat("Normals calculated for", sum(!is.na(normals$mean_doy)), "groups\n") ``` ## Documentation Template When publishing research, document your quality control decisions: ``` Data Quality Control: 1. Completeness filtering: Excluded stations with < 50% temporal coverage (reduced dataset from N to M stations) 2. Outlier detection: Used MAD method with threshold = 3 (flagged X observations as outliers, Y% of total) 3. Outlier disposition: - Excluded: Z observations with deviations > 60 days (likely errors) - Retained: W observations identified as abnormal events (e.g., second flowering) 4. Phase validation: Checked for sequence violations in [species list] (identified V cases for manual review) ``` --- # Best Practices Summary ## For Outlier Detection 1. **Use robust methods**: MAD or IQR methods handle multiple outliers better than z-scores. The MAD method is generally recommended. 2. **Check seasonally**: Late-season outliers (DOY > 250) may be biologically meaningful events (e.g. second flowering) rather than errors. 3. **Verify extremes**: For observations with very large deviations (> 60 days), try to check original data sources or contact data providers. 4. **Document decisions**: Record which outliers you exclude and why. This is essential for reproducibility and peer review. 5. **Don't delete automatically**: Human judgment is needed to distinguish errors from genuine unusual events. ## For Abnormal Event Detection 1. **Consider species biology**: Some species (e.g., certain Rosaceae) are more prone to second flowering than others. 2. **Check geographic patterns**: If multiple nearby stations report late events in the same year, this suggests a real regional phenomenon. 3. **Look at weather context**: Link abnormal events to weather extremes - drought is a common trigger. 4. **Distinguish from errors**: Very isolated observations (single station, single year) need verification before treating them as e.g. second flowering. 5. **Keep for separate analysis**: Abnormal phenological events are scientifically interesting (proof of climate change) - don't just delete them! ## For Completeness Assessment 1. **Set appropriate thresholds**: Use 80%+ coverage for calculating normals, 50%+ for trend analysis. 2. **Consider gap patterns**: Many small gaps are usually better than one long gap that might coincide with important climate changes. 3. **Check temporal bias**: Ensure you're not comparing stations with data only from early periods to stations with only recent data. 4. **Document station selection**: Report how many stations met your completeness criteria and how this affected your sample. ## For Phase Presence Checking 1. **Check early**: Run `pep_check_phases()` at the start of your analysis to verify required phases are available in your data. 2. **Know required phases**: Different analyses need different phases - know which BBCH codes you need before starting. 3. **Check across species**: Use `pep_check_phases_multi()` to verify data availability for all species you plan to analyze. --- ## Summary This vignette covered data quality tools for phenological analysis: | Function | Purpose | Key Output | |----------|---------|------------| | `pep_flag_outliers()` | Identify unusual observations | `is_outlier` flag, `deviation` in days | | `pep_plot_outliers()` | Visualize outlier patterns | Four plot types: overview, seasonal, detail, map | | `pep_completeness()` | Assess temporal coverage | Completeness %, gaps, year range | | `pep_check_phases()` | Check phase presence | Missing phases, observation counts | ### Key Take-Home Messages 1. **Data quality assessment is not optional** - it's a critical step that affects the validity of all downstream analyses. 2. **Not all outliers are errors** - some represent genuine biological phenomena like second flowering that deserve further study. 3. **Document your decisions** - quality control choices should be transparent and reproducible. 4. **Use robust methods** - the MAD method is recommended for phenological data because it handles multiple outliers well. 5. **Visual inspection matters** - plots often reveal patterns that summary statistics miss. ## Next Steps Explore the other vignettes for complementary analyses: - **Getting Started**: Data access, the `pep` class, and basic exploration ```{r eval = FALSE} vignette("getting-started", package = "pep725") ``` - **Phenological Analysis**: Normals, anomalies, quality grading, and trends ```{r eval = FALSE} vignette("phenological-analysis", package = "pep725") ``` - **Spatial Phenological Patterns**: Gradients, synchrony, and mapping ```{r eval = FALSE} vignette("spatial-patterns", package = "pep725") ``` --- ## Session Info ```{r session} sessionInfo() ```