--- title: "MAIHDA for Binary Outcomes (Discriminatory Accuracy)" author: "Hamid Bulut" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{MAIHDA for Binary Outcomes (Discriminatory Accuracy)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.5 ) ``` ## Why binary outcomes? Many of the original MAIHDA applications target a **binary** health outcome -- obese vs. not, hypertensive vs. not, screened vs. not. Merlo's framing of *discriminatory accuracy* is, at heart, a question about a binary classifier: how well do the intersectional strata alone separate people who do and do not have the outcome? For a binary outcome the model is a multilevel **logistic** regression: \[ \text{logit}\,\Pr(y_{ij} = 1) = \beta_0 + u_j, \qquad u_j \sim N(0, \sigma_u^2), \] where \(j\) indexes the intersectional stratum. MAIHDA reads off two complementary quantities from this model: - the **VPC/ICC**, the share of the (latent) variation that lies *between* strata; - the **discriminatory accuracy** (e.g. the AUC / C-statistic), how well stratum membership predicts the individual outcome. A high between-stratum VPC can still go with only *moderate* discriminatory accuracy at the individual level -- that contrast is the whole point of the "DA" in MAIHDA, and this vignette shows how to get both numbers. ```{r load} library(MAIHDA) data("maihda_health_data") # A two-level outcome: obese (Yes) vs. not (No) table(maihda_health_data$Obese) ``` ## Fitting a logistic MAIHDA model `fit_maihda()` detects a two-level outcome automatically. If you leave `family` at its default, the function switches to `family = "binomial"`, recodes the response to 0/1 for `glmer()`, and warns you so the change is never silent: ```{r autodetect} model_null <- fit_maihda( Obese ~ 1 + (1 | Gender:Race:Education), data = maihda_health_data ) ``` The warning is a feature, not a problem. To be explicit (and to silence the warning), pass `family = "binomial"` yourself -- this is the recommended form for scripts: ```{r explicit, eval=FALSE} model_null <- fit_maihda( Obese ~ 1 + (1 | Gender:Race:Education), data = maihda_health_data, family = "binomial" ) ``` > If you actually want a linear probability model on a 0/1 outcome, ask for it > explicitly with `family = "gaussian"`; the auto-switch only fires for the > default family. ## The VPC is on the latent scale `summary()` reports the VPC/ICC for the logistic model. There is no observed-scale residual variance for a Bernoulli outcome, so MAIHDA uses the standard **latent-scale** approximation: the level-1 (within-stratum) variance is fixed at \(\pi^2/3 \approx 3.29\) for the logit link (and the corresponding constant for a probit link). The VPC is therefore the between-stratum share of variation on that underlying latent scale, not on the probability scale. ```{r summary-null} summary_null <- summary(model_null) print(summary_null) ``` Read from this null model, the VPC is the **total** between-stratum share of latent variation in the odds of obesity. As in the Gaussian case, the stratum random effects capture the *combined* additive + interaction differences across strata; they isolate the *pure* intersectional (interaction) component only once the additive main effects of the strata variables are in the model. > For a binomial model `summary()` now reports the discriminatory accuracy (AUC / > Median Odds Ratio) automatically, so the printed summary above already carries that > block -- the [Discriminatory accuracy](#discriminatory-accuracy-auc-and-median-odds-ratio) > section below explains how to read it and how to obtain the pieces on their own. For an interpretable **probability-scale** complement to the latent-scale VPC, the package provides `maihda_vpc_response()`, which estimates the response-scale VPC by simulation (Goldstein, Browne & Rasbash 2002): ```{r vpc-response} maihda_vpc_response(model_null, seed = 1) ``` Report it *alongside* -- not instead of -- the latent-scale VPC: it depends on the overall prevalence, and for adjusted models it is evaluated at the mean covariate profile, so it is best read from the null model. ## Adjusted model and PCV A "Model 2" adds individual-level covariates to ask how much of the between-stratum variation they account for. Here we adjust for age, fit on the **same analytic sample and strata** as the null model, and compare with `calculate_pvc()`. PCV compares variances across models, so both must use the same complete-case sample: ```{r adjusted} health_complete <- maihda_health_data[complete.cases( maihda_health_data[, c("Obese", "Age", "Gender", "Race", "Education")] ), ] model_null2 <- fit_maihda( Obese ~ 1 + (1 | Gender:Race:Education), data = health_complete, family = "binomial" ) # Model 2: adjust for an individual-level covariate (age) model_adj <- fit_maihda( Obese ~ Age + (1 | Gender:Race:Education), data = health_complete, family = "binomial" ) pcv <- calculate_pvc(model_null2, model_adj) print(pcv) ``` The same caveats as in the continuous case apply, with one extra wrinkle: for non-Gaussian models the latent residual variance is fixed by the link, so a change in the between-stratum variance is partly a **rescaling** of the latent scale, not only "variance explained". Interpret the PCV as a model-dependent, descriptive change, not a causal decomposition. > **A note on adjusting for the strata's own categories.** If instead you add the > *categorical* main effects that define the strata (e.g. > `Obese ~ Gender + Race + Education + ...`) to recover the additive-vs-interaction > split shown in the [introduction](introduction.html), be aware that in a logistic > model those fixed effects are nearly collinear with the stratum random intercept, > so `glmer()` often reports a convergence or "nearly unidentifiable" note. Scaling > covariates and using `control = lme4::glmerControl(optimizer = "bobyqa")` usually > helps. ## Discriminatory accuracy (AUC and Median Odds Ratio) The VPC summarises *variation*; discriminatory accuracy summarises *prediction*. For a binomial model this is reported **automatically**: `summary()` of a binomial `maihda_model` carries a `discriminatory_accuracy` slot, and `maihda(..., family = "binomial")` surfaces it on its summaries and headline `print()`. The explicit `maihda_discriminatory_accuracy()` below is the same quantity on its own -- it bundles the two individual-level summaries for a logistic MAIHDA model: the **AUC / C-statistic** (how well the predicted probabilities separate cases from non-cases) and the **Median Odds Ratio (MOR)** (the between-stratum heterogeneity expressed on the odds-ratio scale). The strata-only model's AUC is the discriminatory accuracy of the intersectional strata *themselves* -- Merlo's central quantity. Comparing it with the adjusted model shows whether individual information beyond stratum membership sharpens classification: ```{r da} da_null <- maihda_discriminatory_accuracy(model_null2) da_adj <- maihda_discriminatory_accuracy(model_adj) da_null da_adj ``` You can also call the pieces directly: `maihda_auc(prob, y)` on any vector of predicted probabilities and 0/1 outcomes (it equals the Mann-Whitney U statistic, so it needs no extra package), and `maihda_mor(model)` for the Median Odds Ratio. ```{r auc-direct} prob_null <- predict_maihda(model_null2, type = "individual", scale = "response") y_obs <- as.numeric(lme4::getME(model_null2$model, "y")) maihda_auc(prob_null, y_obs) ``` An AUC of 0.5 is chance. Here both models sit around 0.6: even a non-trivial between-stratum VPC translates into only **modest** accuracy at the *individual* level -- exactly the cautionary message that motivates discriminatory-accuracy reporting in MAIHDA. Note the adjusted model barely moves the AUC: the categorical covariates that *define* the strata are already captured by stratum membership, so only the continuous covariate (`Age`) adds genuinely new individual-level information. > **The MOR needs the logit link.** The Median Odds Ratio is an odds-ratio-scale > quantity, so it is defined only for `binomial(link = "logit")` (the default). For a > `binomial(link = "probit")` fit, `maihda_mor()` errors and > `maihda_discriminatory_accuracy()` reports the AUC with `mor = NA` -- the AUC, > being rank-based, is link-agnostic. ## Plots adapt to the binomial family The standard plots recognise the binomial family and switch to the probability scale and to deviance-based diagnostics automatically: ```{r plot-predicted} # Predicted probabilities per stratum with intervals plot(model_adj, type = "predicted") ``` ```{r plot-vpc} # Latent-scale variance partition plot(model_adj, type = "vpc") ``` ```{r plot-prediction-deviation, warning = FALSE} # For binomial fits the dashboard highlights the largest absolute # deviance residuals rather than raw deviations from the mean. plot(model_adj, type = "prediction_deviation") ``` See the [plot interpretation vignette](interpreting_plots.html) for how to read each of these. ## Count outcomes work the same way A Poisson (count) outcome follows the identical pattern -- pass `family = "poisson"` to `fit_maihda()`. The VPC then uses the log-link latent-scale residual variance, and the summary, PCV, and plotting helpers all behave as above. ## References - Merlo, J. (2018). Multilevel analysis of individual heterogeneity and discriminatory accuracy (MAIHDA) within an intersectional framework. *Social Science & Medicine*, 203, 74-80. - Evans, C. R., Williams, D. R., Onnela, J. P., & Subramanian, S. V. (2018). A multilevel approach to modeling health inequalities at the intersection of multiple social identities. *Social Science & Medicine*, 203, 64-73. - Merlo, J., Wagner, P., Ghith, N., & Leckie, G. (2016). An original stepwise multilevel logistic regression analysis of discriminatory accuracy: the case of neighbourhoods and health. *PLOS ONE*, 11(4), e0153778.