--- title: "Getting started with TestGenerator" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with TestGenerator} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` TestGenerator helps you test pharmacoepidemiological study code against a small, explicit OMOP CDM test population. The typical workflow is: 1. Create a small patient dataset in Excel or CSV files. 2. Convert that dataset to a Unit Test Definition JSON file. 3. Load the JSON into a blank CDM. 4. Run your study code and assert the expected results. This vignette uses the ICU sample population included with the package. ## Create a Unit Test Definition An Excel input file should contain one sheet per OMOP CDM table. For example, the sheet names can include `person`, `observation_period`, `visit_occurrence`, `condition_occurrence`, `drug_exposure`, and `measurement`. ```{r create-json} library(TestGenerator) file_path <- system.file( "extdata", "icu_sample_population.xlsx", package = "TestGenerator" ) output_path <- file.path(tempdir(), "testgenerator-example") dir.create(output_path, showWarnings = FALSE, recursive = TRUE) readPatients( filePath = file_path, testName = "icu_sample", outputPath = output_path, cdmVersion = "5.4" ) ``` This writes `icu_sample.json` to `output_path`. Keeping these JSON files in `tests/testthat/testCases` makes them easy to reuse from package tests. When `outputPath = NULL`, TestGenerator writes to that default test case folder. ## Load the Test Population into a CDM Use `patientsCDM()` to create a CDM reference containing the small patient population and a complete vocabulary. By default, the CDM is created in DuckDB. ```{r load-cdm} cdm <- patientsCDM( pathJson = output_path, testName = "icu_sample", cdmVersion = "5.4" ) cdm[["person"]] ``` If `pathJson = NULL`, TestGenerator looks for JSON files in `tests/testthat/testCases`. ```{r default-test-path} cdm <- patientsCDM( pathJson = NULL, testName = "icu_sample", cdmVersion = "5.4" ) ``` ## Use the CDM in Unit Tests Once the test CDM is available, run the same study code you use on a real CDM. The package includes example cohort definitions under `inst/extdata/test_cohorts`. ```{r cohort-test} library(CDMConnector) library(dplyr) library(testthat) test_cohorts <- system.file( "extdata", "test_cohorts", package = "TestGenerator" ) cohort_set <- readCohortSet(test_cohorts) cdm <- generateCohortSet( cdm = cdm, cohortSet = cohort_set, name = "test_cohorts" ) cohort_attrition <- attrition(cdm[["test_cohorts"]]) excluded_records <- cohort_attrition |> pull(excluded_records) |> sum() expect_equal(excluded_records, 0) ``` In a package test, place this code in `tests/testthat/test-*.R` and assert the specific counts, dates, durations, or intersections that your study should produce for the micro population. ## Visualise Cohort Timelines `graphCohort()` can help inspect whether cohort intersections and timing look as expected for a single subject. ```{r graph-cohort} diazepam <- cdm[["test_cohorts"]] |> filter(cohort_definition_id == 1) |> collect() hospitalisation <- cdm[["test_cohorts"]] |> filter(cohort_definition_id == 2) |> collect() icu_visit <- cdm[["test_cohorts"]] |> filter(cohort_definition_id == 3) |> collect() graphCohort( subject_id = 4, cohorts = list( diazepam = diazepam, hospitalisation = hospitalisation, icu_visit = icu_visit ) ) ``` ## Start from a Blank Excel Template If you want to design a new test population from scratch, create an Excel workbook with the required CDM table columns. ```{r generate-template} generateTestTables( tableNames = c( "person", "observation_period", "visit_occurrence", "condition_occurrence", "drug_exposure", "measurement" ), cdmVersion = "5.4", outputFolder = output_path, filename = "my_test_population" ) ``` Fill in the workbook rows for the small set of patients needed by your test, then pass the completed workbook to `readPatients()`. ## CSV Inputs For CSV inputs, place one file per CDM table in a folder. File names should match the table names, for example `person.csv` and `observation_period.csv`. ```{r csv-input} csv_path <- system.file( "extdata", "mimic_sample", package = "TestGenerator" ) readPatients.csv( filePath = csv_path, testName = "mimic_sample", outputPath = output_path, cdmVersion = "5.4" ) ``` For source datasets with very large integer identifiers, set `reduceLargeIds = TRUE`. ## Remote Databases DuckDB is the default and is usually enough for unit tests. When you need to test SQL translation on another backend, `patientsCDM()` can also create a test CDM in Spark, SQL Server, or PostgreSQL. ```{r remote-cdm} cdm <- patientsCDM( pathJson = output_path, testName = "icu_sample", cdmVersion = "5.4", dbms = "postgresql" ) # Drop the remote test schema and disconnect when finished. cleanupTestCdm(cdm) ``` Remote database connections require the relevant environment variables to be configured before calling `patientsCDM()`. | Backend | Required environment variables | | --- | --- | | Spark | `DATABRICKS_HOST`, `DATABRICKS_TOKEN`, `DATABRICKS_HTTPPATH` | | SQL Server | `DARWIN_SQLSERVER_SERVER`, `DARWIN_SQLSERVER_DBNAME`, `DARWIN_SQLSERVER_PORT`, `DARWIN_SQLSERVER_USER`, `DARWIN_SQLSERVER_PASSWORD` | | PostgreSQL | `DARWIN_POSTGRESQL_SERVER`, `DARWIN_POSTGRESQL_DBNAME`, `DARWIN_POSTGRESQL_PORT`, `DARWIN_POSTGRESQL_USER`, `DARWIN_POSTGRESQL_PASSWORD` | Spark also reads `DATABRICKS_USER` and `DATABRICKS_WORKSPACE` when they are set. If they are not set, TestGenerator uses `token` as the Databricks user and `hive_metastore` as the workspace/catalog. SQL Server reads `SQL_SERVER_DRIVER` when it is set; otherwise it uses `ODBC Driver 18 for SQL Server`. ## Clean Up For local DuckDB examples, disconnect when the test has finished. ```{r cleanup} DBI::dbDisconnect(CDMConnector::cdmCon(cdm), shutdown = TRUE) unlink(output_path, recursive = TRUE) ```