Getting Started with np

This vignette is meant to be the smallest useful package-side introduction to np. The emphasis is on one clean workflow that users can run after installation: choose a bandwidth, fit a model, inspect the result, and plot it.

Broader worked examples, package comparisons, and method-specific articles are better carried by the gallery site:

The basic workflow

In np, the bandwidth object is often the key object in the analysis.

  1. compute or inspect a bandwidth object,
  2. fit the model,
  3. summarize or plot the result.

A simple regression example

library(np)
data(cps71, package = "np")

bw <- npregbw(logwage ~ age, data = cps71)
summary(bw)
#> 
#> Regression Data (205 observations, 1 variable(s)):
#> 
#> Regression Type: Local-Constant
#> Bandwidth Selection Method: Least Squares Cross-Validation
#> Formula: logwage ~ age
#> Bandwidth Type: Fixed
#> Objective Function Value: 0.316055 (achieved on multistart 1)
#> Number of Function Evaluations: 47 (fast = 0)
#> 
#> Exp. Var. Name: age Bandwidth: 1.892158 Scale Factor: 0.4487743
#> 
#> Continuous Kernel Type: Second-Order Gaussian
#> No. Continuous Explanatory Vars.: 1
#> Estimation Time: 0.075 seconds

fit <- npreg(bws = bw)
summary(fit)
#> 
#> Regression Data: 205 training points, in 1 variable(s)
#>                    age
#> Bandwidth(s): 1.892158
#> 
#> Kernel Regression Estimator: Local-Constant
#> Bandwidth Type: Fixed
#> Residual standard error: 0.5307943
#> R-squared: 0.3108675
#> 
#> Continuous Kernel Type: Second-Order Gaussian
#> No. Continuous Explanatory Vars.: 1
#> Estimation Time: 0.076 seconds (optim 0.075s, fit 0.001s)

Plotting the fitted relationship

plot(cps71$age, cps71$logwage, cex = 0.25, col = "grey")
lines(cps71$age, fitted(fit), col = 2, lwd = 2)

Mixed data

One important feature of np is that it handles mixed data directly. Variable class matters: unordered categorical variables should be factors, and ordered categorical variables should be ordered factors when appropriate.

set.seed(42)
mydat <- data.frame(
  y = rnorm(200),
  x_cont = runif(200),
  x_unordered = factor(sample(c("a", "b", "c"), 200, replace = TRUE)),
  x_ordered = ordered(sample(1:4, 200, replace = TRUE))
)

bw_mixed <- npregbw(y ~ x_cont + x_unordered + x_ordered, data = mydat)
fit_mixed <- npreg(bws = bw_mixed)
summary(fit_mixed)
#> 
#> Regression Data: 200 training points, in 3 variable(s)
#>                x_cont x_unordered x_ordered
#> Bandwidth(s): 1808084   0.6610069 0.9981613
#> 
#> Kernel Regression Estimator: Local-Constant
#> Bandwidth Type: Fixed
#> Residual standard error: 0.9721457
#> R-squared: 0
#> 
#> Continuous Kernel Type: Second-Order Gaussian
#> No. Continuous Explanatory Vars.: 1
#> 
#> Unordered Categorical Kernel Type: Aitchison and Aitken
#> No. Unordered Categorical Explanatory Vars.: 1
#> 
#> Ordered Categorical Kernel Type: Li and Racine
#> No. Ordered Categorical Explanatory Vars.: 1
#> Estimation Time: 0.163 seconds (optim 0.163s, fit 0s)

Data preparation matters

In np, the formula interface tells the function which variables are the response and regressors. It is not imposing an ordinary linear-additive model.

It is also important not to pass blocks of 0/1 dummies as if this were a standard linear-model workflow. If the underlying variable is categorical, it is usually better to keep it as one factor or ordered variable.

Other common starting points

This vignette keeps the package-side introduction intentionally narrow. Other common first routes are:

Those broader branches are better carried by help pages and website articles than by a single shipped vignette.

Where to go next