1 Comparing Empirical and Theoretical Distributions of Statistics

Because the procedure implemented by distfreereg() is distribution-free only asymptotically, it is important to explore how well the approximations perform at finite sample sizes. How large the sample size must be for the asymptotic behavior to be sufficiently achieved is dependent on the details of each application. To facilitate the exploration of sample size requirements, distfreereg provides the function compare().

In the following example, we explore the required sample size when the true mean is \[f(X;\theta) = \theta_1 + \theta_2X_1 + \theta_3X_2,\] \(\theta=(2,5,-1)\), and the errors are independent standard normal errors. The covariate matrix is generated randomly. We start with a sample size of 10.

set.seed(20240913)
n <- 10
func <- function(X, theta) theta[1] + theta[2]*X[,1] + theta[3]*X[,2]
theta <- c(2, 5, -1)
X <- matrix(rexp(2*n, rate = 1), nrow = n)
comp_dfr <- compare(theta = theta, true_mean = func, test_mean = func,
                    true_X = X, true_covariance = list(Sigma = 3), X = X,
                    covariance = list(Sigma = 3),
                    theta_init = rep(1, length(theta)))

This function generates a response vector Y from true_mean using the given values of true_X, true_covariance (by default, the errors are multivariate normal), and theta. It then passes Y and most of the remaining arguments to distfreereg(). The statistics and p-values corresponding to each simulated data set are saved, and the process is repeated. The results are returned in an object of class compare.

2 Plotting the Results

These results can be explored graphically. Several options are available. The default behavior of the compare method for plot() produces a comparison of estimated cumulative distribution functions (CDFs) for the observed and simulated statistics. If the sample size is sufficiently large, these two curves should be nearly identical.

plot(comp_dfr)

This plot gives some cause for concern.

Another way of exploring this uses quantile–quantile (Q–Q) plots. Below is a Q–Q plot that compares observed and simulated statistics. If the sample size is sufficiently large, these points should lie on the line \(y=x\).

plot(comp_dfr, which = "qq")

The curvature indicates an insufficient sample size. Similar to this is the next plot, which is a Q–Q plot comparing p-values to uniform quantiles. Once again, if the sample size is sufficiently large, these points should lie on the line \(y=x\).

plot(comp_dfr, which = "qqp")

These plots all indicate that this sample size is insufficient for the asymptotic behavior for the KS statistic. Similar plots can be made for the CvM statistic. More details on this plot method are discussed in the plotting vignette.

3 Formal Comparison Testing

The compare method for ks.test(), which compares the observed and simulated distributions, confirms that they are still different:

ks.test(comp_dfr)

## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  comp_dfr$empirical_stats$KS and comp_dfr$theoretical_stats$KS
## D = 0.1261, p-value = 5.559e-13
## alternative hypothesis: two-sided

The default statistic is whatever is the first statistic appearing in the supplied object:

names(comp_dfr$obs)

## NULL

ks.test(comp_dfr, stat = "CvM")

## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  comp_dfr$empirical_stats$CvM and comp_dfr$theoretical_stats$CvM
## D = 0.1488, p-value < 2.2e-16
## alternative hypothesis: two-sided

4 Increasing the Sample Size

Let us repeat the simulation with a sample size of 100.

n_2 <- 100
X_2 <- matrix(rexp(2*n_2, rate = 1), nrow = n_2)
comp_dfr_2 <- update(comp_dfr, true_X = X_2, X = X_2)

The following plots indicate that this sample size is sufficient.

plot(comp_dfr_2)

plot(comp_dfr_2, which = "qq")

plot(comp_dfr_2, which = "qqp")

A formal test confirms this assessment.

ks.test(comp_dfr_2)

## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  comp_dfr_2$empirical_stats$KS and comp_dfr_2$theoretical_stats$KS
## D = 0.0224, p-value = 0.7517
## alternative hypothesis: two-sided

5 Using a `distfreereg` Object: `asymptotics()`

When a distfreereg object has already been created, the function asymptotics() provides a convenient wrapper for compare() to explore sample size adequacy. See the function’s documentation for details.

6 Power Considerations

The examples using compare() above all illustrate behavior of the function when the null hypothesis is true. It is also important to investigate the behavior when it is false to learn about the test’s power against particular alternative hypotheses. Using the example in , consider the case in which the true mean function is quadratic in \(X_2\): \(f(x;\theta) = \theta_0 + \theta_1x_1 + \theta_2x_2 + x_2^2\). Having verified above that a sample size of 100 is sufficient for the asymptotic behavior to be sufficiently approximated, this can be investigated as follows.

alt_func <- function(X, theta) theta[1] + theta[2]*X[,1] + theta[3]*X[,2] + 0.5*X[,2]^2
comp_dfr_3 <- update(comp_dfr_2, true_mean = alt_func, true_covariance = list(Sigma = 3))

plot(comp_dfr_3)

As hoped, these curves are clearly different. Power estimates can be obtained with rejection().

rejection(comp_dfr_3)

##   stat alpha  rate        mcse
## 1   KS  0.05 0.853 0.003541059
## 2  CvM  0.05 0.863 0.003438473

Power for different significance levels can be found by changing the alpha argument.

rejection(comp_dfr_3, alpha = 0.01)

##   stat alpha  rate        mcse
## 1   KS  0.01 0.648 0.004775940
## 2  CvM  0.01 0.693 0.004612494

7 Warnings

The test implemented by distfreereg() is distribution-free, but simulating values requires specifying an error distribution. The default behavior of compare() is to use multivariate normal errors, or the default behavior of simulate(). There is no way to avoid specifying a distribution to run these simulations. Alternative error distributions can be specified, however, when true_mean is a function, an nls object, or a formula with method equal to “nls”.

The safe interpretation of these simulations is limited to the error distribution that is used.

Comparing Distributions with the distfreereg Package