Beyond family-driven GLMs, fastglm has fitting
functions for four model types/situations that arise often in practice
but require model-specific machinery on top of the standard IRLS
solver:
fastglm_nb() — negative-binomial
regression with the dispersion theta estimated jointly with
the regression coefficients, in the spirit of
MASS::glm.nb().firth = TRUE — Firth’s (1993)
bias-reducing penalty for logistic regression, useful under separation
or in small samples (the logistf::logistf() use case).fastglm_hurdle() — a two-part count
model with a binary zero / non-zero component and a zero-truncated
Poisson or NB count component, the same model as
pscl::hurdle().fastglm_zi() — a zero-inflated Poisson
or NB regression, with a binary inflation component layered on the
original (untruncated) count distribution, the same model as
pscl::zeroinfl().All four reuse the same C++ IRLS solver as fastglm()
itself; the outer iteration (joint NB MLE, the EM loop in
zero-inflation, the inner theta-Brent for unknown-theta NB
hurdle / ZI) likewise lives in C++. Cameron and Trivedi (1998) is the
standard textbook reference for the count-data theory underlying this
vignette, and Zeileis, Kleiber, and Jackman (2008) gives the
pscl implementation we benchmark against.
library(fastglm)
suppressPackageStartupMessages({
library(MASS) # glm.nb, rnegbin, sex2-style separation data
library(pscl) # hurdle, zeroinfl, bioChemists
library(logistf) # canonical Firth implementation
library(microbenchmark)
})fastglm_nb() is a drop-in alternative to
MASS::glm.nb() (Venables and Ripley, 2002) for the NB2
model with Var(Y) = mu + mu^2 / theta. Both
theta and beta are estimated by maximum
likelihood, alternating an IRLS update for beta at fixed
theta with a 1-D Brent root-find for theta at
fixed beta. Both loops run in C++.
A small-sample comparison on MASS::quine (school
absences):
data(quine)
X <- model.matrix(~ Eth + Sex + Age + Lrn, data = quine)
y <- quine$Days
fit_f <- fastglm_nb(X, y)
fit_m <- MASS::glm.nb(Days ~ Eth + Sex + Age + Lrn, data = quine)
c(theta_fastglm = fit_f$theta, theta_glm.nb = fit_m$theta)
#> theta_fastglm theta_glm.nb
#> 1.274893 1.274893
max(abs(unname(coef(fit_f)) - unname(coef(fit_m))))
#> [1] 3.004626e-08
abs(as.numeric(logLik(fit_f)) - as.numeric(logLik(fit_m)))
#> [1] 0Coefficients and theta agree to roughly 1e-8 on the same
MLE.
A timing comparison, on a moderately sized simulated
NB(theta = 2) example:
set.seed(1)
n <- 5e4
X <- cbind(1, matrix(rnorm(n * 3), n, 3))
mu <- exp(X %*% c(0.5, 0.4, -0.2, 0.3))
y <- MASS::rnegbin(n, mu = mu, theta = 2)
df <- data.frame(y = y, x1 = X[, 2], x2 = X[, 3], x3 = X[, 4])
mb_nb <- microbenchmark(
fastglm_nb = fastglm_nb(X, y),
glm.nb = MASS::glm.nb(y ~ x1 + x2 + x3, data = df),
times = 5L
)
print(mb_nb)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> fastglm_nb 27.27906 27.54245 28.20274 28.64723 28.66466 28.88032 5
#> glm.nb 257.68812 300.93619 294.54855 303.13879 304.43976 306.53990 5
#> cld
#> a
#> bThe native NB family kernel is also exposed directly through
negbin(theta, link) for the case when theta is
known (or estimated separately). Holding theta fixed at the
joint MLE recovers the fastglm_nb() regression coefficients
exactly, since fastglm_nb() is just IRLS at the converged
theta:
Logistic regression breaks down under (quasi-) separation: the MLE is
at infinity and the Wald standard errors blow up. Firth (1993) modifies
the score by (1/2) ∂ log|I(beta)| / ∂beta, which both
removes the leading O(1/n) bias and produces finite
estimates under separation. fastglm exposes the penalty as a
flag,
For binomial logit the penalty collapses to a per-iteration
adjustment of the working response by
h_i (0.5 - mu_i) / (mu_i (1 - mu_i)) where h_i
is the leverage; no new family code is needed, and the cost per IRLS
iteration is one extra streaming pass for the leverages.
The canonical demonstration uses the sex2 dataset
shipped with logistf (Heinze and Schemper, 2002):
data(sex2, package = "logistf")
X <- model.matrix(~ age + oc + vic + vicl + vis + dia, data = sex2)
y <- sex2$case
fit_f <- fastglm(X, y, family = binomial(), firth = TRUE)
fit_l <- logistf(case ~ age + oc + vic + vicl + vis + dia, data = sex2)
cbind(fastglm = unname(coef(fit_f)), logistf = unname(coef(fit_l)))
#> fastglm logistf
#> [1,] 0.12025405 0.12025405
#> [2,] -1.10598133 -1.10598131
#> [3,] -0.06881673 -0.06881673
#> [4,] 2.26887465 2.26887464
#> [5,] -2.11140819 -2.11140817
#> [6,] -0.78831695 -0.78831694
#> [7,] 3.09601183 3.09601166
max(abs(unname(coef(fit_f)) - unname(coef(fit_l))))
#> [1] 1.67938e-07The two implementations land on the same penalized MLE to about 2e-7. The difference in runtime is substantial because fastglm’s Firth-fitting code leverages our existing C++ solver:
mb_firth <- microbenchmark(
fastglm = fastglm(X, y, family = binomial(), firth = TRUE),
logistf = logistf(case ~ age + oc + vic + vicl + vis + dia, data = sex2),
times = 10L
)
print(mb_firth)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> fastglm 181.753 195.283 214.3357 211.2935 225.951 282.408 10 a
#> logistf 6785.336 6794.684 6853.8183 6863.2565 6894.519 6922.932 10 bfirth = TRUE is currently restricted to
family = binomial(link = "logit") on a dense matrix
x; probit / cloglog use a different score correction and
will be added separately.
Hurdle models, introduced by Mullahy (1986), factorize a count distribution into two independent pieces:
1(y > 0) over the full
sample (the zero part);y > 0
over the positive subset (the count part).Because the two parts share no parameters, the joint likelihood
factorizes and they can be fit independently. fastglm’s C++
driver fits both parts using the same IRLS solver as
fastglm(), with new FAM_POIS_TRUNC_* /
FAM_NB_TRUNC_* family codes that handle the truncation
correction stably (expm1, log1p near
mu = 0).
The formula uses the Formula package convention
y ~ x1 + x2 | z1 + z2; the right-hand side after
| specifies the zero-part design (it defaults to the
count-part design if absent).
data(bioChemists, package = "pscl")
fit_f <- fastglm_hurdle(art ~ ., data = bioChemists, dist = "poisson")
fit_p <- pscl::hurdle (art ~ ., data = bioChemists, dist = "poisson")
max(abs(unname(coef(fit_f, "count")) - unname(fit_p$coefficients$count)))
#> [1] 3.144497e-07
max(abs(unname(coef(fit_f, "zero")) - unname(fit_p$coefficients$zero)))
#> [1] 1.27043e-10
abs(as.numeric(logLik(fit_f)) - as.numeric(logLik(fit_p)))
#> [1] 1.07093e-10coef() and vcov() accept a
model = c("full", "count", "zero") argument so each part
can be inspected separately:
coef(fit_f, model = "count")
#> (Intercept) femWomen marMarried kid5 phd ment
#> 0.67113934 -0.22858262 0.09648498 -0.14218724 -0.01272657 0.01874550
coef(fit_f, model = "zero")
#> (Intercept) femWomen marMarried kid5 phd ment
#> 0.23679601 -0.25115113 0.32623358 -0.28524872 0.02221940 0.08012135For NB count parts, the dispersion is estimated by an inner Brent MLE
that runs between outer IRLS iterations; the outer-loop
tolerance is controlled by outer.tol /
outer.maxit.
A timing comparison on a 4000-observation simulated example:
set.seed(11)
n <- 4000
x1 <- rnorm(n); x2 <- rnorm(n)
lam <- exp(0.7 + 0.4 * x1 - 0.3 * x2)
is_pos <- rbinom(n, 1, plogis(-0.4 + 0.5 * x1 + 0.2 * x2))
yt <- integer(n)
for (i in seq_len(n)) {
repeat { v <- rpois(1, lam[i]); if (v > 0) { yt[i] <- v; break } }
}
y <- ifelse(is_pos == 1, yt, 0L)
df <- data.frame(y = y, x1 = x1, x2 = x2)
mb_hurdle <- microbenchmark(
fastglm_hurdle = fastglm_hurdle(y ~ x1 + x2, data = df, dist = "poisson"),
pscl_hurdle = pscl::hurdle (y ~ x1 + x2, data = df, dist = "poisson"),
times = 5L
)
print(mb_hurdle)
#> Unit: milliseconds
#> expr min lq mean median uq max
#> fastglm_hurdle 1.187319 1.191624 1.285629 1.234223 1.390966 1.424012
#> pscl_hurdle 15.933543 16.948949 16.986948 17.167397 17.327912 17.556938
#> neval cld
#> 5 a
#> 5 bZero inflation, introduced by Lambert (1992), differs from a hurdle
in that the two components share a latent variable: a y = 0
outcome could come from the inflation component (with probability
pi_i) or from the count component (with
probability 1 - pi_i and a count-side zero). The likelihood
for y = 0 is
\[ \Pr(Y_i = 0) = \pi_i + (1 - \pi_i) f(0; \mu_i) \]
and for y > 0 it is
(1 - pi_i) f(y; mu_i). The two-component mixture rules out
a closed-form factorization, so fastglm uses an EM
algorithm:
tau_i = P(Z_i = 1 | y_i) for each observation, computed in
log-space with logsumexp to avoid catastrophic cancellation
when pi_i is near 0 or 1.tau (continuous), weights 1; followed by
a Poisson or NB fit on the original y with prior
weights 1 - tau. For NB, an inner Brent MLE re-estimates
theta after each count-side step.The final observed-information vcov comes from a
numerical Jacobian of the analytical observed score at the EM fixed
point, which is stable and cheap (block-diagonal per
(gamma, beta, theta)). The complete EM driver — E-step,
both M-steps, the inner theta MLE, and the score Jacobian — runs in
C++.
The formula syntax matches fastglm_hurdle() exactly:
fit_f <- fastglm_zi(art ~ ., data = bioChemists, dist = "poisson",
em.tol = 1e-10, em.maxit = 300L)
fit_p <- pscl::zeroinfl(art ~ ., data = bioChemists, dist = "poisson")
max(abs(unname(coef(fit_f, "count")) - unname(fit_p$coefficients$count)))
#> [1] 2.019915e-05
max(abs(unname(coef(fit_f, "zero")) - unname(fit_p$coefficients$zero)))
#> [1] 4.132504e-05
abs(as.numeric(logLik(fit_f)) - as.numeric(logLik(fit_p)))
#> [1] 3.170896e-07EM is iterative, so the agreement is slightly looser than for hurdle (where the two parts are closed-form independent fits), but coefficients and the joint log-likelihood still match to about 1e-5.
A timing comparison on a simulated zero-inflated Poisson:
set.seed(21)
n <- 3000
x1 <- rnorm(n); x2 <- rnorm(n)
eta_c <- 0.7 + 0.4 * x1 - 0.3 * x2
eta_z <- -0.4 + 0.5 * x1 + 0.2 * x2
z <- rbinom(n, 1, plogis(eta_z))
y <- ifelse(z == 1, 0L, rpois(n, exp(eta_c)))
df <- data.frame(y = y, x1 = x1, x2 = x2)
mb_zi <- microbenchmark(
fastglm_zi = fastglm_zi(y ~ x1 + x2, data = df, dist = "poisson"),
pscl_zi = pscl::zeroinfl(y ~ x1 + x2, data = df, dist = "poisson"),
times = 5L
)
print(mb_zi)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> fastglm_zi 9.547096 9.674073 9.760296 9.742502 9.753695 10.08411 5
#> pscl_zi 25.020619 25.383756 26.361393 26.381737 27.159507 27.86135 5
#> cld
#> a
#> bFirth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80(1), 27–38. doi:10.1093/biomet/80.1.27
Heinze, G. and Schemper, M. (2002). A solution to the problem of separation in logistic regression. Statistics in Medicine, 21(16), 2409–2419. doi:10.1002/sim.1047
Cameron, A. C. and Trivedi, P. K. (1998). Regression Analysis of Count Data. Cambridge University Press.
Mullahy, J. (1986). Specification and testing of some modified count data models. Journal of Econometrics, 33(3), 341–365. doi:10.1016/0304-4076(86)90002-3
Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34(1), 1–14. doi:10.2307/1269547
Zeileis, A., Kleiber, C., and Jackman, S. (2008). Regression models for count data in R. Journal of Statistical Software, 27(8), 1–25. doi:10.18637/jss.v027.i08
Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S (4th ed.). Springer.