--- title: "Evidence-based Bayesian disaggregation" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Evidence-based Bayesian disaggregation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ## What this package does Given an observed **aggregate** price index $\mathrm{cpi}_t$ and a matrix of (known) sectoral aggregation weights $W_{t,k}$ — value-added (VAB) shares — the goal is to recover the $K$ latent **sectoral** price indices $\varphi_{t,k}$ that the aggregate is made of. The sectoral indices then feed a downstream nested Ornstein–Uhlenbeck model (`bayesianOU`) as the market price $\varphi$. The disaggregation is genuinely Bayesian: the aggregate enters as **evidence** (an observation density), and the sectoral indices come out as a **posterior** with credible intervals, not as a single deterministic re-weighting. ## Why "evidence-based": the F1–F6 history The 0.1.x family advertised "MCMC-free Bayesian disaggregation", but the aggregate CPI never entered the computation (F1): the "posterior" was derived from the prior weight matrix alone, the Dirichlet concentration cancelled on renormalization (F2), the temporal pattern cancelled too (F3), an "efficiency" term was a fixed constant (F4), there were no recovery tests (F5), and a correlation helper opportunistically picked whichever of Pearson/Spearman was larger (F6). That foundational defect — not using the data — cannot be patched within a deterministic re-weighting; the fix *is* a model that conditions on the aggregate. The deterministic family has been removed; two honest Bayesian engines replace it. ## The model (state-space, "Model A") Latent state in logs, with a random walk plus drift and partial pooling: $$ \log \varphi_{t,k} = \log \varphi_{t-1,k} + \delta_k + \tau_k\,\eta_{t,k}, \qquad \eta_{t,k}\sim\mathcal N(0,1), $$ with $\delta_k \sim \mathcal N(\delta_\mu,\delta_\sigma)$ and $\log\tau_k \sim \mathcal N(\mu_{\log\tau}, \sigma_{\log\tau})$ (the drift and the innovation scale are pooled across sectors). The cross-section at $t=1$ is anchored at the aggregate level with an **estimable** dispersion $\omega_{\text{struct}}$ (the real concentration the old Dirichlet $\gamma$ failed to be): $$ \log\varphi_{1,k} = \log(\text{phi1\_center}) + \omega_{\text{struct}}\,z_k . $$ The aggregate is the genuine observation: $$ \mathrm{cpi}_t \sim \mathrm{Student\text{-}t}\!\left(\nu,\ \textstyle\sum_k W_{t,k}\,\varphi_{t,k},\ \sigma\right), $$ (Gaussian if `student_obs = FALSE`). ### Identification, honestly (rigour by layers) The **aggregate** $\sum_k W\varphi$ is *strongly* identified by the observation density. The **per-sector** split is only *weakly* identified: at each period one linear combination of the $K$ sectors is pinned by the CPI, and the remaining $K-1$ directions are governed by the cross-sectional prior plus temporal smoothness. So the per-sector intervals are honestly **wide and prior-influenced**. This is not a defect to hide — it is the correct uncertainty, and it is precisely why we feed the *full posterior draws* (not a point estimate) to the OU by multiple imputation: the sectoral uncertainty is propagated, not faked away. ## Two engines, one trade-off * **Closed-form (conjugate) — `disaggregate_conjugate()`**. A linear-Gaussian random walk in *levels* with the same aggregate observation; its exact posterior is the Kalman filter + RTS smoother, with no MCMC. Joint posterior draws come from the Durbin–Koopman simulation smoother. This is the correct realization of the original "MCMC-free posterior" idea. * **MCMC — `disaggregate_statespace()`**. The richer model above (log scale ⇒ positivity, Student-t ⇒ robustness to aggregate outliers, hierarchical pooling), which is *not* conjugate and therefore needs HMC. Both are Bayesian. Closed form buys speed and exactness at the cost of a simpler (Gaussian, linear) model; MCMC buys richness at the cost of sampling. ## A runnable example (closed-form engine) ```{r conjugate, eval = requireNamespace("BayesianDisaggregation", quietly = TRUE)} library(BayesianDisaggregation) sim <- simulate_disagg(T = 30, K = 4, seed = 1) # synthetic CPI + VAB weights bl <- disaggregate_conjugate(sim$cpi, sim$W, n_draws = 100, seed = 1) bl ## the smoothed aggregate tracks the CPI tightly (aggregate is well identified) round(cor(bl$agg_summary[, "median"], sim$cpi), 4) ## joint posterior draws: the [T, K, draws] contract consumed by the nested OU dim(bl$phi_draws) ``` ## The MCMC engine (sketch, not evaluated here) ```{r statespace, eval = FALSE} fit <- disaggregate_statespace(sim$cpi, sim$W, chains = 4, iter = 2000, warmup = 1000) fit$diagnostics # rhat_max, divergences dim(fit$phi_draws) # T x K x draws str(fit$phi_summary) # median, q2.5, q97.5 (T x K each) ## couple to the nested OU (uncertainty propagated by Rubin's rule): ## bayesianOU::fit_ou_nested_mi(phi_draws = fit$phi_draws, X = Phi_index, ...) ``` From Excel directly, reusing the bundled readers: ```{r fromfiles, eval = FALSE} cpi_file <- system.file("extdata", "CPI.xlsx", package = "BayesianDisaggregation") w_file <- system.file("extdata", "WEIGHTS.xlsx", package = "BayesianDisaggregation") fit <- disaggregate_from_files(cpi_file, w_file, chains = 2, iter = 1000) ``` ## Data note: comparing index vs index The model is about **index levels**, so the CPI must be a level series (FRED units "Index source base", aggregation "Average" for annual data — never a rate-of-change), re-indexed to the **same base** as the production prices it will be compared against (e.g. 1982–1984 = 100 via the project's `convert_to_index`). Feeding a percent-change series here is a category error: the aggregate would not be on the same scale as $\sum_k W\varphi$. ## Coupling to the nested OU `disaggregate_statespace()$phi_draws` (or `disaggregate_conjugate(..., n_draws = M)$phi_draws`) is a `[T, K, M]` array — exactly the multiple-imputation input of `bayesianOU::fit_ou_nested_mi()`. The OU refits once per imputation and combines the analyses by Rubin's rule, so the disaggregation uncertainty becomes part of the OU posterior. ```