The SpatialInference package provides tools for statistical inference with spatially correlated data. Its main features are:
Load the package and data containing the centroids of all 3,108 counties of the contiguous United States:
library(SpatialInference)
library(sf); library(ggplot2)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(lfe)
#> Loading required package: Matrix
library(modelsummary)
#> Version 2.0.0 of `modelsummary`, to be released soon, will introduce a
#> breaking change: The default table-drawing package will be `tinytable`
#> instead of `kableExtra`. All currently supported table-drawing packages
#> will continue to be supported for the foreseeable future, including
#> `kableExtra`, `gt`, `huxtable`, `flextable, and `DT`.
#>
#> You can always call the `config_modelsummary()` function to change the
#> default table-drawing package in persistent fashion. To try `tinytable`
#> now:
#>
#> config_modelsummary(factory_default = 'tinytable')
#>
#> To set the default back to `kableExtra`:
#>
#> config_modelsummary(factory_default = 'kableExtra')
data("US_counties_centroids")
ggplot(US_counties_centroids) + geom_sf(size = .1) + theme_bw()
The dataset contains two independently generated spatially correlated noise variables (noise1 and noise2), drawn using geostatistical simulation. Although drawn independently, both exhibit strong spatial clustering — nearby counties have similar values:
ggplot(US_counties_centroids) + geom_sf(aes(col = noise1), size = .1) + theme_bw()
ggplot(US_counties_centroids) + geom_sf(aes(col = noise2), size = .1) + theme_bw()
For comparison, the dist variable is an example of spatial non-stationarity (a spatial trend rather than spatial autocorrelation):
ggplot(US_counties_centroids) + geom_sf(aes(col = dist), size = .1) + theme_bw()
Even though the two noise variables were drawn independently (so the true coefficient is zero), a naive regression finds a highly significant relationship:
spuriouslm <- fixest::feols(noise1 ~ noise2, data = US_counties_centroids, vcov = "HC1")
spuriouslm
#> OLS estimation, Dep. Var.: noise1
#> Observations: 3,108
#> Standard-errors: Heteroskedasticity-robust
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6.890000e-16 0.017872 3.860000e-14 1.0000e+00
#> noise2 8.738022e-02 0.015535 5.624836e+00 2.0216e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.996015 Adj. R2: 0.007316
This is because the OLS errors \(\varepsilon_i\) are not i.i.d. — nearby residuals are positively correlated, which inflates the \(t\)-statistic. A map of the regression residuals reveals the spatial clustering:
US_counties_centroids$resid <- spuriouslm$residuals
ggplot(US_counties_centroids) +
geom_sf(aes(col = resid), size = .1) + theme_bw() + scale_color_viridis_c()
Before computing Conley standard errors, we need to choose a bandwidth: the distance within which residuals are allowed to be correlated. The covgm_range() function estimates this from the data by computing the empirical covariogram of the regression residuals and identifying the distance at which covariation first crosses zero [@Lehner2026; @Pebesma2004].
covgm_range(US_counties_centroids) + theme_bw()
The plot shows the empirical covariogram — each point represents the average cross-product of residual pairs within a distance bin. The red vertical line marks the estimated correlation range \(\hat{\varsigma}\) (here around 831 km), and the red dashed line marks zero covariance. Beyond the estimated range, the covariogram fluctuates around zero, indicating that residual correlation has effectively ceased.
This is the key input for the Conley standard error: the bandwidth should match the distance over which residuals are actually correlated.
To extract just the numeric range for programmatic use:
# Compute the covariogram manually
covgm <- gstat::variogram(
spuriouslm$residuals ~ 1,
data = sf::as_Spatial(US_counties_centroids),
covariogram = TRUE,
width = 2e4,
cutoff = as.numeric(max(sf::st_distance(US_counties_centroids))) * (2/3)
)
estimated_range <- extract_corr_range(covgm)
estimated_range
#> [1] 820.0129
A critical insight documented by @Lehner2026 is that the relationship between the bandwidth and the magnitude of Conley standard errors follows an inverse-U shape. This means that both too narrow and too wide bandwidths lead to underestimated standard errors — contradicting the conventional wisdom that wider bandwidths are always more conservative.
The inverseu_plot_conleyrange() function visualises this:
inverseu_plot_conleyrange(US_counties_centroids, seq(1, 2501, by = 200)) + theme_bw()
The grey dashed line represents the HC1 standard error. As the cutoff distance increases from zero, the Conley SE first rises (as positively correlated residual pairs are included), reaches a peak near the true correlation range, and then declines (as uncorrelated pairs dilute the estimate). At very wide bandwidths, the Conley SE can fall below the HC1 level — making wide-bandwidth Conley errors less conservative than heteroskedasticity-robust errors.
With the estimated range in hand, we compute the spatial HAC standard error using conley_SE():
spuriouslm_fe <- felm(noise1 ~ noise2 | unit + year | 0 | lat + lon,
data = US_counties_centroids, keepCX = TRUE)
regfe_conley <- conley_SE(reg = spuriouslm_fe, unit = "unit", time = "year",
lat = "lat", lon = "lon",
kernel = "epanechnikov", dist_fn = "Haversine",
dist_cutoff = 831)
conley <- sapply(regfe_conley, function(x) diag(sqrt(x)))[2] %>% round(5)
conley
#> Spatial.noise2
#> 0.137
The Conley SE at the estimated bandwidth is substantially larger than the HC1 error. Constructing a corrected \(t\)-statistic:
spuriouslm_fe$coefficients[1] / as.numeric(conley)
#> [1] 0.6378118
The \(t\)-statistic is far below 1.96, so we correctly fail to reject the null hypothesis — as expected, given that the two variables are genuinely independent.
The compute_conley_lfe() function provides a shortcut:
compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "epanechnikov")
#> [1] 0.137
For a complete workflow — regression, Moran’s I test, and Conley SEs in one call — use lm_sac():
lmsac.out <- lm_sac("noise1 ~ noise2 | unit + year | 0 | lat + lon",
US_counties_centroids,
conley_cutoff = 831, conley_kernel = "epanechnikov")
lmsac.out$conley_SE
#> [1] 0.0000000 0.1370043
gm.param <- list(
list("raw" = "nobs", "clean" = "Observations", "fmt" = 0),
list("raw" = "Moran_y", "clean" = "Moran's I [y]", "fmt" = 3),
list("raw" = "Moran_resid", "clean" = "Moran's I [resid]", "fmt" = 3),
list("raw" = "y_mean", "clean" = "Mean [y]", "fmt" = 3),
list("raw" = "y_SD", "clean" = "Std.Dev. [y]", "fmt" = 3)
)
modelsummary(lmsac.out,
estimate = c("{estimate}"),
statistic = c("({std.error})", "([{conley}])"),
gof_omit = "Model|Range_resid|Range_resp|Range_y",
gof_map = gm.param)
| (1) | |
|---|---|
| (Intercept) | 0.000 |
| (0.018) | |
| ([0.137]) | |
| noise2 | 0.087 |
| (0.018) | |
| ([0.137]) | |
| Observations | 3108 |
The spatial HAC estimate is sensitive not only to the bandwidth but also to the kernel function. The package includes six kernels with different distance-decay profiles. The uniform kernel weights all pairs equally within the cutoff; the others downweight more distant pairs with varying decay rates:
compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "uniform")
#> [1] 0.13964
compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "epanechnikov")
#> [1] 0.137
compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "bartlett")
#> [1] 0.12253
compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "parzen")
#> [1] 0.11584
compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "gaussian")
#> [1] 0.13799
compute_conley_lfe(spuriouslm_fe, cutoff = 831, kernel_choice = "biweight")
#> [1] 0.13087
Based on Monte Carlo evidence in @Lehner2026, the Bartlett and Epanechnikov kernels tend to deliver the best size control in practice.