--- title: "Performance notes" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Performance notes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` This vignette collects practical advice for reading large DATASUS files efficiently with `datasusr`. ## Column selection The single most effective way to speed up reads is to select only the columns you need. The C parser skips over unneeded fields at the byte level, so fewer columns means less allocation and less parsing: ```{r eval = FALSE} library(datasusr) # Slow: reads all ~100+ columns x <- read_datasus_dbc("RDPE2401.dbc") # Fast: reads only 4 columns x <- read_datasus_dbc( "RDPE2401.dbc", select = c("uf_zi", "ano_cmpt", "munic_res", "val_tot") ) ``` ## Type inference When `guess_types = TRUE` (the default), the reader scans every value of each numeric field to decide whether it fits in an integer or requires a double. Disabling this skips the scan and relies on the DBF field metadata alone: ```{r eval = FALSE} x <- read_datasus_dbc("RDPE2401.dbc", guess_types = FALSE) ``` This is safe for most analyses and can save noticeable time on files with millions of rows. ## Explicit types When you know exactly which types you need, `col_types` lets you specify them upfront and skip inference entirely: ```{r eval = FALSE} x <- read_datasus_dbc( "RDPE2401.dbc", select = c("uf_zi", "dt_inter", "val_tot"), col_types = c(uf_zi = "character", dt_inter = "date", val_tot = "double"), parse_dates = TRUE, guess_types = FALSE ) ``` ## Date parsing Date parsing (`parse_dates = TRUE`) adds a small overhead per date field. Only enable it when you actually need `Date` objects; otherwise the raw `YYYYMMDD` character strings are often sufficient for filtering and grouping. ## Benchmarking example ```{r eval = FALSE} library(bench) file <- "RDPE2401.dbc" bench::mark( default = read_datasus_dbc(file), no_guess = read_datasus_dbc(file, guess_types = FALSE), selected = read_datasus_dbc(file, select = c("uf_zi", "val_tot")), typed = read_datasus_dbc( file, select = c("uf_zi", "dt_inter", "val_tot"), col_types = c(uf_zi = "character", dt_inter = "date", val_tot = "double"), parse_dates = TRUE, guess_types = FALSE ), check = FALSE, iterations = 5 ) ``` ## Memory `datasusr` allocates R vectors directly from C without materialising intermediate files on disk. For very large datasets (tens of millions of rows), consider processing files in batches rather than binding everything into a single data frame.