Categorical summary tables in R

library(spicy)

table_categorical() builds publication-ready categorical tables suitable for APA-style reporting in social science and data science research. With by, it produces grouped cross-tabulation tables with chi-squared \(p\)-values, effect sizes, confidence intervals, and multi-level headers. Without by, it produces one-way frequency-style tables for the selected variables. Export to gt, tinytable, flextable, Excel, or Word. This vignette walks through the main features.

Basic usage

For grouped tables, provide a data frame, one or more selected variables, and a grouping variable:

table_categorical(
  sochealth,
  select = c(smoking, physical_activity, dentist_12m),
  by = education
)
#> Categorical table by education
#> 
#>  Variable          │ Lower secondary n  Lower secondary %  Upper secondary n 
#> ───────────────────┼─────────────────────────────────────────────────────────
#>  smoking           │                                                         
#>    No              │        179               69.6                415        
#>    Yes             │         78               30.4                112        
#> ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
#>  physical_activity │                                                         
#>    No              │        177               67.8                310        
#>    Yes             │         84               32.2                229        
#> ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
#>  dentist_12m       │                                                         
#>    No              │        113               43.3                174        
#>    Yes             │        148               56.7                365        
#> 
#>  Variable          │ Upper secondary %  Tertiary n  Tertiary %  Total n 
#> ───────────────────┼────────────────────────────────────────────────────
#>  smoking           │                                                    
#>    No              │       78.7            332         84.9       926   
#>    Yes             │       21.3             59         15.1       249   
#> ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
#>  physical_activity │                                                    
#>    No              │       57.5            163         40.8       650   
#>    Yes             │       42.5            237         59.2       550   
#> ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
#>  dentist_12m       │                                                    
#>    No              │       32.3             67         16.8       354   
#>    Yes             │       67.7            333         83.2       846   
#> 
#>  Variable          │ Total %    p    Cramer's V 
#> ───────────────────┼────────────────────────────
#>  smoking           │          <.001     .14     
#>    No              │  78.8                      
#>    Yes             │  21.2                      
#> ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
#>  physical_activity │          <.001     .21     
#>    No              │  54.2                      
#>    Yes             │  45.8                      
#> ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
#>  dentist_12m       │          <.001     .22     
#>    No              │  29.5                      
#>    Yes             │  70.5

The default output is "default", which prints a styled ASCII table to the console. Use output = "data.frame" to get a plain numeric data frame suitable for further processing.

One-way tables

Omit by to build a frequency-style table for the selected variables:

table_categorical(
  sochealth,
  select = c(smoking, physical_activity),
  output = "default"
)
#> Categorical table
#> 
#>  Variable            │   n      %    
#> ─────────────────────┼───────────────
#>  smoking             │               
#>    No                │  926    78.8  
#>    Yes               │  249    21.2  
#> ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
#>  physical_activity   │               
#>    No                │  650    54.2  
#>    Yes               │  550    45.8

Output formats

table_categorical() supports several output formats. The table below summarizes the options:

Format Description
"default" Styled ASCII table in the console (default)
"data.frame" Wide data frame, one row per modality
"long" Long data frame, one row per modality x group
"gt" Formatted gt table
"tinytable" Formatted tinytable
"flextable" Formatted flextable
"excel" Excel file (requires excel_path)
"clipboard" Copy to clipboard
"word" Word document (requires word_path)

gt output

The "gt" format produces a table with APA-style borders, column spanners, and proper alignment:

pkgdown_dark_gt(
  table_categorical(
    sochealth,
    select = c(smoking, physical_activity, dentist_12m),
    by = education,
    output = "gt"
  )
)

tinytable output

table_categorical(
  sochealth,
  select = c(smoking, physical_activity),
  by = sex,
  output = "tinytable"
)

Data frame output

Use output = "data.frame" for a wide numeric data frame (one row per modality), or output = "long" for a long format (one row per modality x group):

table_categorical(
  sochealth,
  select = smoking,
  by = education,
  output = "data.frame"
)
#>   Variable Level Lower secondary n Lower secondary % Upper secondary n
#> 1  smoking    No               179              69.6               415
#> 2  smoking   Yes                78              30.4               112
#>   Upper secondary % Tertiary n Tertiary % Total n Total %            p
#> 1              78.7        332       84.9     926    78.8 2.012877e-05
#> 2              21.3         59       15.1     249    21.2 2.012877e-05
#>   Cramer's V
#> 1  0.1356677
#> 2  0.1356677

Custom labels

By default, table_categorical() uses variable names as row headers. Use the labels argument to provide human-readable labels. Two forms are accepted (matching table_continuous() and table_continuous_lm()):

pkgdown_dark_gt(
  table_categorical(
    sochealth,
    select = c(smoking, physical_activity),
    by = education,
    labels = c(
      smoking           = "Smoking status",
      physical_activity = "Regular physical activity"
    ),
    output = "gt"
  )
)

Association measures and confidence intervals

table_categorical() picks the association measure per row variable based on the variable type (assoc_measure = "auto", the default):

When the chosen measures differ across rows, the column header collapses to "Effect size" and an APA-style Note. line documents which measure was used for each variable.

Override with a single string for uniform application, or with a named vector to mix measures per row:

# Uniform: same measure for every row variable
table_categorical(
  sochealth,
  select = smoking,
  by = education,
  assoc_measure = "lambda",
  output = "tinytable"
)
# Per-row: pick the right measure for each variable.
# `smoking` x `education` is 2x3 (binary x ordered) -> Cramer's V;
# `self_rated_health` x `education` is ordered x ordered -> Tau-b.
# The mixed result collapses the header to "Effect size" and adds an
# APA `Note.` line documenting the per-row measure.
table_categorical(
  sochealth,
  select = c(smoking, self_rated_health),
  by = education,
  assoc_measure = c(
    smoking           = "cramer_v",
    self_rated_health = "tau_b"
  ),
  output = "tinytable"
)

Add confidence intervals with assoc_ci = TRUE. In rendered formats (gt, tinytable, flextable), the CI is shown inline:

pkgdown_dark_gt(
  table_categorical(
    sochealth,
    select = c(smoking, physical_activity),
    by = education,
    assoc_ci = TRUE,
    output = "gt"
  )
)

In data formats ("data.frame", "long", "excel", "clipboard"), separate CI lower and CI upper columns are added:

table_categorical(
  sochealth,
  select = smoking,
  by = education,
  assoc_ci = TRUE,
  output = "data.frame"
)
#>   Variable Level Lower secondary n Lower secondary % Upper secondary n
#> 1  smoking    No               179              69.6               415
#> 2  smoking   Yes                78              30.4               112
#>   Upper secondary % Tertiary n Tertiary % Total n Total %            p
#> 1              78.7        332       84.9     926    78.8 2.012877e-05
#> 2              21.3         59       15.1     249    21.2 2.012877e-05
#>   Cramer's V   CI lower  CI upper
#> 1  0.1356677 0.07909264 0.1913716
#> 2  0.1356677 0.07909264 0.1913716

Weighted tables

Pass survey weights with the weights argument. Use rescale = TRUE so the total weighted N matches the unweighted N:

pkgdown_dark_gt(
  table_categorical(
    sochealth,
    select = c(smoking, physical_activity),
    by = education,
    weights = "weight",
    rescale = TRUE,
    output = "gt"
  )
)

Handling missing values

By default, rows with missing values are dropped (drop_na = TRUE). Set drop_na = FALSE to display them as a “(Missing)” category:

pkgdown_dark_gt(
  table_categorical(
    sochealth,
    select = income_group,
    by = education,
    drop_na = FALSE,
    output = "gt"
  )
)

Filtering and reordering levels

Use levels_keep to display only specific modalities. The order you specify controls the display order, which is useful for placing “(Missing)” first to highlight missingness:

pkgdown_dark_gt(
  table_categorical(
    sochealth,
    select = income_group,
    by = education,
    drop_na = FALSE,
    levels_keep = c("(Missing)", "Low", "High"),
    output = "gt"
  )
)

Formatting options

Control the number of digits for percentages, p-values, and the association measure:

pkgdown_dark_gt(
  table_categorical(
    sochealth,
    select = smoking,
    by = education,
    percent_digits = 2,
    p_digits = 4,
    v_digits = 3,
    output = "gt"
  )
)

p_digits drives both the displayed precision of the p column and the small-p threshold (p_digits = 3 -> <.001, p_digits = 4 -> <.0001), matching table_continuous() and table_continuous_lm().

Decimal alignment

By default (align = "decimal") numeric columns are aligned on the decimal mark, the standard scientific-publication convention used by SPSS, SAS, LaTeX siunitx, and the native primitives of gt::cols_align_decimal() and tinytable::style_tt(align = "d"). Engines without a native primitive (flextable, word, clipboard, ASCII print) get the alignment via leading / trailing space padding, with flextable / word switching the body font to Consolas so character widths match.

Pass align = "auto" to revert to the legacy uniform right-alignment used in spicy < 0.11.0:

table_categorical(
  sochealth,
  select = c(smoking, physical_activity),
  by = sex,
  align = "auto"
)
#> Categorical table by sex
#> 
#>  Variable          │ Female n  Female %  Male n  Male %  Total n  Total %     p 
#> ───────────────────┼────────────────────────────────────────────────────────────
#>  smoking           │                                                       .713 
#>    No              │      475      78.4     451    79.3      926     78.8       
#>    Yes             │      131      21.6     118    20.7      249     21.2       
#> ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
#>  physical_activity │                                                       .832 
#>    No              │      334      53.9     316    54.5      650     54.2       
#>    Yes             │      286      46.1     264    45.5      550     45.8       
#> 
#>  Variable          │ Phi 
#> ───────────────────┼─────
#>  smoking           │ .01 
#>    No              │     
#>    Yes             │     
#> ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌
#>  physical_activity │ .01 
#>    No              │     
#>    Yes             │

"center" and "right" apply literal alignment.

Tidying for downstream pipelines

table_categorical() returns an object that can be coerced to a plain data.frame / tbl_df (stripping the spicy formatting attributes) or piped into broom::tidy() / broom::glance() for use with gtsummary, modelsummary, parameters, or any other tidyverse-stats workflow:

out <- table_categorical(
  sochealth,
  select = c(smoking, physical_activity),
  by = sex
)
#> Categorical table by sex
#> 
#>  Variable          │ Female n  Female %  Male n  Male %  Total n  Total %   p   
#> ───────────────────┼────────────────────────────────────────────────────────────
#>  smoking           │                                                       .713 
#>    No              │   475       78.4     451     79.3     926     78.8         
#>    Yes             │   131       21.6     118     20.7     249     21.2         
#> ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
#>  physical_activity │                                                       .832 
#>    No              │   334       53.9     316     54.5     650     54.2         
#>    Yes             │   286       46.1     264     45.5     550     45.8         
#> 
#>  Variable          │ Phi 
#> ───────────────────┼─────
#>  smoking           │ .01 
#>    No              │     
#>    Yes             │     
#> ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌
#>  physical_activity │ .01 
#>    No              │     
#>    Yes             │

# One row per (variable x level x group) with broom-style columns
# (outcome, level, group, n, proportion). The synthetic Total
# margin is excluded so each observation is counted once.
broom::tidy(out)
#> # A tibble: 8 × 5
#>   outcome           level group      n proportion
#>   <chr>             <chr> <chr>  <int>      <dbl>
#> 1 smoking           No    Female   475      0.784
#> 2 smoking           No    Male     451      0.793
#> 3 smoking           Yes   Female   131      0.216
#> 4 smoking           Yes   Male     118      0.207
#> 5 physical_activity No    Female   334      0.539
#> 6 physical_activity No    Male     316      0.545
#> 7 physical_activity Yes   Female   286      0.461
#> 8 physical_activity Yes   Male     264      0.455

# One row per outcome with the omnibus chi-squared test and the
# chosen association measure (test_type, statistic, df, p.value,
# assoc_type, assoc_value, assoc_ci_lower / assoc_ci_upper, n_total).
broom::glance(out)
#> # A tibble: 2 × 10
#>   outcome           test_type   statistic    df p.value assoc_type assoc_value
#>   <chr>             <chr>           <dbl> <int>   <dbl> <chr>            <dbl>
#> 1 physical_activity chi_squared    0.0452     1   0.832 Phi            0.00614
#> 2 smoking           chi_squared    0.136      1   0.713 Phi            0.0107 
#> # ℹ 3 more variables: assoc_ci_lower <dbl>, assoc_ci_upper <dbl>, n_total <int>

Exporting to Excel, Word, or clipboard

For Excel export, provide a file path:

table_categorical(
  sochealth,
  select = c(smoking, physical_activity, dentist_12m),
  by = education,
  output = "excel",
  excel_path = "my_table.xlsx"
)

For Word, use output = "word":

table_categorical(
  sochealth,
  select = c(smoking, physical_activity, dentist_12m),
  by = education,
  output = "word",
  word_path = "my_table.docx"
)

You can also copy directly to the clipboard for pasting into a spreadsheet or a text editor:

table_categorical(
  sochealth,
  select = c(smoking, physical_activity),
  by = education,
  output = "clipboard"
)