--- title: "nuggets: Get Started" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{nuggets: Get Started} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r, include=FALSE} library(nuggets) library(dplyr) library(ggplot2) library(tidyr) options(tibble.width = Inf) ``` # Introduction Package `nuggets` searches for patterns that can be expressed as formulae in the form of elementary conjunctions, referred to in this text as *conditions*. Conditions are constructed from *predicates*, which correspond to data columns. The interpretation of conditions depends on the choice of underlying logic: - *Crisp (Boolean) logic*: each predicate takes values `TRUE` (1) or `FALSE` (0). The truth value of a condition is computed according to the rules of classical Boolean algebra. - *Fuzzy logic*: each predicate is assigned a *truth degree* from the interval $[0, 1]$. The truth degree of a conjunction is then computed using a chosen *triangular norm (t-norm)*. The package supports three common t-norms, which are defined for predicates' truth degrees $a, b \in [0, 1]$ as follows: - *Gödel* (minimum) t-norm: $\min(a, b)$ ; - *Goguen* (product) t-norm: $a \cdot b$ ; - *Łukasiewicz* t-norm: $\max(0, a + b - 1)$ Before applying `nuggets`, data columns intended as predicates must be prepared either by *dichotomization* (conversion into *dummy* logical variables) or by transformation into *fuzzy sets*. The package provides functions for both transformations. See the [section](#data-preparation) Data Preparation below for a quick overview, or the [Data Preparation](data-preparation.html) vignette for a comprehensive guide. `nuggets` implements functions to search for pre-defined types of patterns or to discover patterns of *user-defined* type. For example, the package provides: - `dig_associations()` for association rules, - `dig_baseline_contrasts()`, `dig_complement_contrasts()`, and `dig_paired_baseline_contrasts()` for various contrast patterns on numeric variables, - `dig_correlations()` for conditional correlations. To provide custom evaluation functions for conditions and to search for *user-defined* types of patterns, the package offers two general functions: - `dig()` is a general function for searching arbitrary pattern types. - `dig_grid()` is a wrapper around `dig()` for patterns defined by conditions and a pair of columns evaluated by a user-defined function. See the section [Pre-defined Patterns](#pre-defined-patterns) below for examples and details on using the pre-defined pattern discovery functions and the section [Advanced Use](#advanced-use) for examples of custom pattern discovery. Discovered rules and patterns can be post-processed, visualized, and explored interactively. That part is covered in the section [Post-processing and Visualization](#postprocessing-and-visualization) below. # Data Preparation Before applying `nuggets`, data columns intended as predicates must be prepared either by *dichotomization* (conversion into *dummy variables*) or by transformation into *fuzzy sets*. The package provides the `partition()` function for both transformations. This section gives a quick overview of data preparation with `nuggets`. For a detailed guide, including information about all available functions and advanced techniques, please see the [Data Preparation Vignette](data-preparation.html). ## Crisp (Boolean) Predicates Example For crisp patterns, numeric columns are transformed to logical (`TRUE`/`FALSE`) columns. To show the process, we start with the built-in `mtcars` dataset, which we first slightly modify by converting the `cyl` column to a factor: ```{r} # For demonstration, convert 'cyl' column of the mtcars dataset to a factor mtcars <- mtcars |> mutate(cyl = factor(cyl, levels = c(4, 6, 8), labels = c("four", "six", "eight"))) head(mtcars, n = 3) ``` Now we can use the `partition()` function to transform all columns into crisp predicates: ```{r} # Transform the whole dataset to crisp predicates crisp_mtcars <- mtcars |> partition(cyl, vs:gear, .method = "dummy") |> partition(mpg, .method = "crisp", .breaks = c(-Inf, 15, 20, 30, Inf)) |> partition(disp:carb, .method = "crisp", .breaks = 3) head(crisp_mtcars, n = 3) ``` As seen above, the `"dummy"` method can be used to create logical columns for each category of processed variables. Here, it was applied to create dummy variables for the factor variable `cyl` as well as for the numeric variables `vs`, `am`, and `gear`. The method `"crisp"` creates logical columns representing intervals for numeric variables. In the example, it was used to create intervals for `mpg` based on specified breakpoints (`-Inf`, `15`, `20`, `30`, `Inf`), and for `disp`, `hp`, `drat`, `wt`, `qsec`, and `carb` using equal-width intervals (3 intervals each). Now all columns are logical and can be used as predicates in crisp conditions. ## Fuzzy Predicates Example Fuzzy predicates express the degree to which a condition is satisfied, with values in the interval $[0,1]$. This allows modeling of smooth transitions between categories: ```{r, message=FALSE} # Start with fresh mtcars and transform to fuzzy predicates fuzzy_mtcars <- mtcars |> partition(cyl, vs:gear, .method = "dummy") |> partition(mpg, .method = "triangle", .breaks = c(-Inf, 15, 20, 30, Inf)) |> partition(disp:carb, .method = "triangle", .breaks = 3) head(fuzzy_mtcars, n = 3) ``` Similar to the crisp example, the `"dummy"` method creates logical columns for categorical variables (`cyl`, `vs`, `am`, `gear`). The `"triangle"` method creates fuzzy predicates with triangular membership functions. For `mpg`, it uses specified breakpoints to define fuzzy intervals. For the remaining numeric variables (`disp` through `carb`), it automatically creates 3 overlapping fuzzy sets with smooth transitions between intervals. Note that the `cyl`, `vs`, `am`, and `gear` columns are still represented by dummy logical columns, while the numeric columns are now represented by fuzzy sets. This combination allows both crisp and fuzzy predicates to be used together in pattern discovery. ## Advanced Data Preparation Capabilities The `nuggets` package provides powerful and flexible data preparation tools. The [Data Preparation](data-preparation.html) vignette covers these capabilities in depth, including: - **Crisp (Boolean) partitioning** with customizable interval strategies: - Equal-width intervals for uniform discretization - Data-driven methods (quantile, k-means, hierarchical clustering, etc.) for optimal breakpoints that respect the data structure - Custom breakpoints for domain-specific intervals - **Fuzzy partitioning** for modeling gradual transitions and uncertainty: - Triangular membership functions for basic fuzzy sets - Raised-cosine membership functions for smoother transitions - Trapezoidal shapes using `.span` and `.inc` parameters for overlapping fuzzy sets - **Quality control utilities** to improve pattern mining: - `is_almost_constant()` and `remove_almost_constant()` to identify and filter uninformative columns - `dig_tautologies()` to find always-true or almost-always-true rules that can be used to prune search spaces - **Custom labels** for predicates to make discovered patterns more interpretable For example, you can use quantile-based partitioning to ensure balanced predicates, or use raised-cosine fuzzy sets with custom labels to create meaningful linguistic terms like "very_low", "low", "medium", "high", and "very_high". These preparation choices significantly impact the interpretability and usefulness of patterns discovered in subsequent analyses. # Pre-defined Patterns The package `nuggets` provides a set of functions for discovering some of the best-known pattern types. These functions can process Boolean data, fuzzy data, or both. Each function returns a tibble, where every row represents one detected pattern. > **Note:** This section assumes that the data have already been **preprocessed** > — i.e., transformed into a binarized or fuzzified form. See the previous > section [Data Preparation](#data-preparation) for details on how to prepare > your dataset (for example, `crisp_mtcars` and `fuzzy_mtcars`). For more advanced workflows — such as defining custom pattern types or computing user-defined measures — see the section [Advanced Use](#advanced-use). ## Search for Association Rules **Association rules** identify conditions (*antecedents*) under which a specific feature (*consequent*) is present very often. \[ A \Rightarrow C \] If condition `A` is satisfied, then the feature `C` tends to be present. For example,
`university_edu & middle_age & IT_industry => high_income`
can be read as:
*People in middle age with university education working in IT industry are very likely to have a high income.*
In practice, the antecedent `A` is a set of predicates, and the consequent `C` is usually a single predicate. For a set of predicates \(I\), let \(\text{supp}(I)\) denote the *support* — the relative frequency (for logical data) or the mean truth degree (for fuzzy data) of rows satisfying all predicates in \(I\). Using this notation, the following rule properties and quality measures may be defined: - **Length** — number of predicates in the antecedent. - **Coverage** (antecedent support) — \(\text{supp}(A)\). - **Consequent support** — \(\text{supp}(C)\). - **Support** — \(\text{supp}(A \cup C)\). - **Confidence** — \(\text{supp}(A \cup C) / \text{supp}(A)\). - **Lift** — \(\text{supp}(A \cup C) / (\text{supp}(A) \text{supp}(C))\). Rules with high *support* are frequent in the data. Rules with high *confidence* indicate a strong association between antecedent and consequent. Rules with high *lift* suggest that the validity of antecedent increases the likelihood of the consequent occurring. Before searching for rules, it is recommended to create a *vector of disjoints*, which specifies predicates that must not appear together in the same condition. This vector should have the same length as the number of dataset columns. For example, columns representing `gear=3` and `gear=4` are mutually exclusive, so their shared group label in `disj` prevents meaningless conditions like `gear=3 & gear=4`. You can conveniently generate this vector with `var_names()`: ```{r} disj <- var_names(colnames(fuzzy_mtcars)) print(disj) ``` The `dig_associations()` function searches for association rules. Its main arguments are: - `x`: the data matrix or data frame (logical or numeric); - `antecedent`, `consequent`: tidyselect expressions selecting columns for each side of the rule; - `disjoint`: a vector defining mutually exclusive predicates; - rule filtering thresholds such as `min_support`, `min_confidence`, `min_coverage`, and limits like `min_length`, `max_length`; - optional parameters such as `t_norm`, and `contingency_table`. In the following example, we search for fuzzy association rules in the dataset `fuzzy_mtcars`, such that: - any column except those starting with `"am"` may appear in the antecedent; - columns starting with `"am"` may appear in the consequent; - minimum support is `0.02`, i.e., 2 % of data rows have to contain both the antecedent and consequent of the rule; - minimum confidence is `0.8`, i.e., the conditional probability of consequent given antecedent should be at least 80%; - additionally to basic quality measures, the contingency table for each rule is computed. The *contingency table* is a quadruplet `pp`, `pn`, `np` and `nn`, which contains the counts (or sums of degrees) of rows satisfying antecedent & consequent (`pp`), antecedent & not consequent (`pn`), not antecedent & consequent (`np`), and not antecedent & not consequent (`nn`). These values are important for further computation of various additional interestingness measures. ```{r} result <- dig_associations(fuzzy_mtcars, antecedent = !starts_with("am"), consequent = starts_with("am"), disjoint = disj, min_support = 0.02, min_confidence = 0.8, contingency_table = TRUE) ``` The result is a tibble containing the discovered rules and their quality metrics. You can arrange them, for example, by decreasing support: ```{r} result <- arrange(result, desc(support)) print(result) ``` This example illustrates the typical workflow for mining association rules with `nuggets`. The same structure and arguments apply when analyzing either fuzzy or Boolean datasets. ## Conditional Correlations **Conditional correlations** identify strong relationships between pairs of numeric variables under specific conditions. The `dig_correlations()` function searches for pairs of variables that are significantly correlated within sub-data satisfying generated conditions. This is useful for discovering context-dependent relationships. In the following example, we search for correlations between different numeric variables in the original `mtcars` data under conditions defined by the prepared predicates in `crisp_mtcars`: ```{r} # Prepare combined dataset with both condition predicates and numeric variables combined_mtcars <- cbind(crisp_mtcars, mtcars[, c("mpg", "disp", "hp", "wt")]) # Extend disjoint vector for the new numeric columns disj_combined <- c(var_names(colnames(crisp_mtcars)), c("mpg", "disp", "hp", "wt")) # Search for conditional correlations corr_result <- dig_correlations(combined_mtcars, condition = colnames(crisp_mtcars), xvars = c("mpg", "hp"), yvars = c("wt", "disp"), disjoint = disj_combined, min_length = 1, max_length = 2, min_support = 0.2, method = "pearson") print(corr_result) ``` This example combines crisp predicates (from `crisp_mtcars`) with numeric variables from the original `mtcars` dataset. The function searches for conditions under which pairs of numeric variables show significant Pearson correlations. The `disjoint` vector is extended to include the new numeric columns, preventing conflicts in the search algorithm. The result shows conditions under which specific pairs of variables exhibit strong correlations, along with correlation coefficients and p-values. ## Contrast Patterns Contrast patterns identify conditions under which numeric variables show statistically significant differences. The `nuggets` package provides several functions for different types of contrasts. ### Baseline Contrasts *Baseline contrasts* identify conditions under which a variable is significantly different from a baseline value (typically zero) using a one-sample statistical test. ```{r} # Prepare combined dataset with predicates and numeric variables combined_mtcars2 <- cbind(crisp_mtcars, mtcars[, c("mpg", "hp", "wt")]) # Extend disjoint vector for the new numeric columns disj_combined2 <- c(var_names(colnames(crisp_mtcars)), c("mpg", "hp", "wt")) # Search for baseline contrasts baseline_result <- dig_baseline_contrasts(combined_mtcars2, condition = colnames(crisp_mtcars), vars = c("mpg", "hp", "wt"), disjoint = disj_combined2, min_length = 1, max_length = 2, min_support = 0.2, method = "t") head(baseline_result) ``` This example tests whether the mean of numeric variables (`mpg`, `hp`, `wt`) significantly differs from zero under various conditions. The `method = "t"` parameter specifies a t-test. The results show which combinations of conditions lead to statistically significant deviations from the baseline. ### Complement Contrasts *Complement contrasts* identify conditions under which a variable differs significantly between elements that satisfy the condition and those that don't. ```{r} complement_result <- dig_complement_contrasts(combined_mtcars2, condition = colnames(crisp_mtcars), vars = c("mpg", "hp", "wt"), disjoint = disj_combined2, min_length = 1, max_length = 2, min_support = 0.15, method = "t") head(complement_result) ``` This example uses a two-sample t-test to compare the mean values of numeric variables between rows that satisfy a condition and rows that don't. The results identify conditions where subgroups have significantly different characteristics compared to the rest of the data. ### Paired Baseline Contrasts *Paired baseline contrasts* identify conditions under which there is a significant difference between two paired numeric variables. ```{r} paired_result <- dig_paired_baseline_contrasts(combined_mtcars2, condition = colnames(crisp_mtcars), xvars = c("mpg", "hp"), yvars = c("wt", "wt"), disjoint = disj_combined2, min_length = 1, max_length = 2, min_support = 0.2, method = "t") head(paired_result) ``` This example performs paired t-tests to compare two variables within the same rows under specific conditions. Here, it tests whether `mpg` differs from `wt` (and `hp` from `wt`) in various subgroups. This is useful for detecting context-dependent relationships between paired measurements. # Post-processing and Visualization After discovering patterns with `nuggets`, you'll often want to manipulate, format, and visualize the results. The package provides several tools for these tasks. ## Visualizing Association Rules with Diamond Plots The `geom_diamond()` function provides a specialized visualization for association rules and their hierarchical structure. It displays rules as a lattice where broader (more general) conditions appear above their descendants: ```{r, fig.width=8, fig.height=5} # Search for rules with various confidence levels for visualization vis_rules <- dig_associations(fuzzy_mtcars, antecedent = starts_with(c("gear", "vs")), consequent = "am=1", disjoint = disj, min_support = 0, min_confidence = 0, min_length = 0, max_length = 3, max_results = 50) print(vis_rules) # Create diamond plot showing rule hierarchy ggplot(vis_rules) + aes(condition = antecedent, fill = confidence, linewidth = confidence, size = support, label = paste0(antecedent, "\nconf: ", round(confidence, 2))) + geom_diamond(nudge_y = 0.25) + scale_x_discrete(expand = expansion(add = 0.5)) + scale_y_discrete(expand = expansion(add = 0.25)) + labs(title = "Association Rules Hierarchy", subtitle = "consequent: am=1") ``` This example creates a hierarchical visualization of association rules. The `geom_diamond()` function arranges rules in a lattice structure where simpler rules (with fewer predicates) appear at the top and more complex rules below. Visual properties (fill color, edge width, node size) encode rule quality measures, making it easy to identify the most interesting patterns. Custom label merges antecedent with confidence value for better readability. Additional modifications (`scale_x_discrete`, `scale_y_discrete`) add padding. The diamond plot helps identify: - Simple vs. complex rules (vertical position) - Antecedent relationship (ancestor and descendant rules are connected with lines) - Strong vs. weak confidence (node color intensity) - Frequent vs. rare rules (node size) - Improvement/worsening of confidence (line size and color: improvement is depicted with gray lines, worsening with reddish line; the amount of change is indicated with the width of the lines) # Interactive Exploration The `explore()` function launches an interactive Shiny application for exploring discovered patterns. This is particularly useful for association rules: ```{r, eval=FALSE} # Launch interactive explorer for association rules rules <- dig_associations(fuzzy_mtcars, antecedent = everything(), consequent = everything(), min_support = 0.05, min_confidence = 0.7) # Open interactive explorer explore(rules, data = fuzzy_mtcars) ``` The interactive explorer provides: - **Rule filtering**: Filter rules by support, confidence, lift, and other measures - **Sorting and searching**: Find specific rules of interest - **Visualizations**: Multiple visualization types for rule exploration # Advanced Use For advanced workflows, the `nuggets` package allows users to define custom pattern types and evaluation functions. This section demonstrates how to use the general `dig()` function with custom callbacks and the specialized `dig_grid()` wrapper. ## Custom Patterns with dig() The `dig()` function allows you to execute a user-defined callback function on each generated frequent condition. This enables searching for custom pattern types beyond the pre-defined functions. The following example replicates the search for association rules using a custom callback function with the datasets prepared earlier: ```{r} # Define thresholds for custom association rules min_support <- 0.02 min_confidence <- 0.8 # Define custom callback function f <- function(condition, support, pp, pn) { # Calculate confidence for each focus (consequent) conf <- pp / support # Filter rules by confidence and support thresholds sel <- !is.na(conf) & conf >= min_confidence & !is.na(pp) & pp >= min_support conf <- conf[sel] supp <- pp[sel] # Return list of rules meeting criteria lapply(seq_along(conf), function(i) { list(antecedent = format_condition(names(condition)), consequent = names(conf)[[i]], support = supp[[i]], confidence = conf[[i]]) }) } # Search using custom callback custom_result <- dig(fuzzy_mtcars, f = f, condition = !starts_with("am"), focus = starts_with("am"), disjoint = disj, min_length = 1, min_support = min_support) # Flatten and format results custom_result <- custom_result |> unlist(recursive = FALSE) |> lapply(as_tibble) |> do.call(rbind, args = _) |> arrange(desc(support)) print(custom_result) ``` The callback function `f()` receives information based on its argument names: - `condition`: vector of column indices forming the condition - `support`: relative frequency of the condition - `pp`, `pn`: contingency table entries This approach gives you full control over pattern evaluation and filtering logic. ## Grid-Based Patterns with dig_grid() The `dig_grid()` function is useful for patterns based on relationships between pairs of columns. It creates a grid of column combinations and evaluates a user-defined function for each condition and column pair. Here's an example that computes custom statistics for pairs of numeric variables: ```{r} # Define callback for grid-based patterns grid_callback <- function(d, weights) { if (nrow(d) < 5) return(NULL) # Skip if too few observations # Compute weighted correlation wcor <- cov.wt(d, wt = weights, cor = TRUE)$cor[1, 2] list( correlation = wcor, n_obs = sum(weights > 0.1), mean_x = weighted.mean(d[[1]], weights), mean_y = weighted.mean(d[[2]], weights) ) } # Prepare combined dataset combined_fuzzy <- cbind(fuzzy_mtcars, mtcars[, c("mpg", "hp", "wt")]) # Extend disjoint vector for new numeric columns combined_disj3 <- c(var_names(colnames(fuzzy_mtcars)), c("mpg", "hp", "wt")) # Search using grid approach grid_result <- dig_grid(combined_fuzzy, f = grid_callback, condition = colnames(fuzzy_mtcars), xvars = c("mpg", "hp"), yvars = c("wt"), disjoint = combined_disj3, type = "fuzzy", min_length = 1, max_length = 2, min_support = 0.15, max_results = 20) # Display results print(grid_result) ``` The `dig_grid()` function is particularly useful for: - Computing conditional correlations with custom methods - Evaluating pairwise relationships under different conditions - Implementing specialized statistical tests on variable pairs # Summary This vignette has introduced the core functionality of the `nuggets` package for discovering patterns in data through systematic exploration of conditions. Key takeaways: 1. **Data Preparation**: Transform your data into predicates using `partition()`. 2. **Pre-defined Pattern Discovery**: The package provides specialized functions for common pattern types: - `dig_associations()` finds association rules (A → C) - `dig_correlations()` discovers conditional correlations between variable pairs - `dig_baseline_contrasts()` identifies when variables deviate from baseline under conditions - `dig_complement_contrasts()` finds subgroups differing from the rest - `dig_paired_baseline_contrasts()` compares paired variables within contexts 3. **Post-processing**: Manipulate and visualize discovered patterns: - Create hierarchical visualizations with `geom_diamond()` - Launch interactive explorers with `explore()` 4. **Advanced Usage**: Define custom pattern types: - Use `dig()` with custom callback functions for specialized analyses - Use `dig_grid()` for patterns based on variable pairs ## Next Steps - Explore the [Data Preparation vignette](data-preparation.html) for advanced preprocessing techniques - Review function documentation (e.g., `?dig_associations`) for detailed parameter descriptions - Experiment with your own datasets to discover meaningful patterns - Use interactive exploration (`explore()`) to gain insights into discovered patterns The `nuggets` package provides a flexible framework for pattern discovery that scales from simple association rule mining to complex custom pattern searches, all while supporting both crisp and fuzzy logic approaches.