--- title: "Documenting datasets" description: > How to document datasets stored in `data/`. output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Documenting datasets} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r} #| include: false knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Datasets are stored in `data/`, not as regular R objects in the package. This means you need to document them in a slightly different way: instead of documenting the data directly, you quote the dataset's name. For example, this is the roxygen2 block used for `ggplot2::diamonds`: ```{r} #| eval: false #' Prices of over 50,000 round cut diamonds #' #' A dataset containing the prices and other attributes of almost 54,000 #' diamonds. The variables are as follows: #' #' @format A data frame with 53940 rows and 10 variables: #' \describe{ #' \item{price}{price in US dollars ($326--$18,823)} #' \item{carat}{weight of the diamond (0.2--5.01)} #' \item{cut}{quality of the cut (Fair, Good, Very Good, Premium, Ideal)} #' \item{color}{diamond colour, from D (best) to J (worst)} #' \item{clarity}{a measurement of how clear the diamond is (I1 (worst), SI2, #' SI1, VS2, VS1, VVS2, VVS1, IF (best))} #' \item{x}{length in mm (0--10.74)} #' \item{y}{width in mm (0--58.9)} #' \item{z}{depth in mm (0--31.8)} #' \item{depth}{total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)} #' \item{table}{width of top of diamond relative to widest point (43--95)} #' } #' #' @source {ggplot2} tidyverse R package. "diamonds" ``` Datasets should never be exported with `@export` because they are not found in the `NAMESPACE`. Instead, datasets will either be automatically available if you set `LazyData: true` in your `DESCRIPTION`, or available after calling `data()` if not. This field also affects the default usage. If you have `LazyData: true`, the usage will be just the dataset name (e.g. `diamonds`). Otherwise, the usage will be wrapped in `data()` (e.g. `data(diamonds)`). Note the use of two additional tags that are particularly useful for documenting data: - `@format`, which gives an overview of the structure of the dataset. This should include a **definition list** that describes each variable. There's currently no way to generate this with Markdown, so this is one of the few places you'll need to Rd markup directly. - `@source` where you got the data form, often a URL.