Zero-Inflated Models¶
Quick Reference
Classes: ZeroInflatedPoisson, ZeroInflatedNegativeBinomial
Import: from panelbox.models.count import ZeroInflatedPoisson, ZeroInflatedNegativeBinomial
Stata equivalent: zip, zinb
R equivalent: pscl::zeroinfl()
Overview¶
Zero-inflated (ZI) models address count data with excess zeros --- more zeros than standard Poisson or Negative Binomial distributions can explain. The key insight is that zeros can arise from two distinct processes:
- Structural zeros: the event can never occur for some observations (e.g., a firm without an R&D department will never file patents)
- Sampling zeros: the event could occur but did not in the observation period (e.g., an R&D-active firm that happened to file zero patents this year)
ZI models combine two components in a mixture model:
where \(\pi_{it} = \Lambda(Z_{it}'\gamma)\) is the probability of a structural zero (modeled by a logit), and \(f(\cdot)\) is the count distribution (Poisson or NB).
Quick Example¶
from panelbox.models.count import ZeroInflatedPoisson
model = ZeroInflatedPoisson(
endog=data["patents"],
exog_count=data[["rd_spending", "employees", "capital"]],
exog_inflate=data[["small_firm", "new_entrant"]]
)
results = model.fit()
print(results.summary())
# Zero proportions
print(f"Actual zeros: {results.actual_zeros:.1%}")
print(f"Predicted zeros: {results.predicted_zeros:.1%}")
# Vuong test: ZIP vs standard Poisson
print(f"Vuong statistic: {results.vuong_stat:.2f} (p = {results.vuong_pvalue:.4f})")
When to Use¶
- Count data with more zeros than Poisson/NB can explain
- Healthcare utilization: many people never visit the doctor (structural zeros)
- Patent counts: some firms lack R&D capacity (never-patenters)
- Insurance claims: some policies cover non-claimable events
- When the zero-generating process differs from the count process
Key Assumptions
- Two-process model: zeros arise from both a structural and a sampling process
- Logit inflation: \(P(\text{structural zero}) = \Lambda(Z'\gamma)\)
- Count distribution: Poisson (ZIP) or Negative Binomial (ZINB)
- Correct specification of both parts: wrong regressors in either part biases both
Detailed Guide¶
The Excess Zeros Problem¶
Standard count models often underpredict the number of zeros:
import numpy as np
from scipy.stats import poisson
# Check if Poisson predicts too few zeros
y = data["patents"].values
mu_hat = y.mean()
expected_zeros_poisson = len(y) * poisson.pmf(0, mu_hat)
actual_zeros = (y == 0).sum()
print(f"Actual zeros: {actual_zeros}")
print(f"Poisson-expected zeros: {expected_zeros_poisson:.0f}")
print(f"Excess zeros: {actual_zeros - expected_zeros_poisson:.0f}")
If actual zeros greatly exceed Poisson-expected zeros, a ZI model may be appropriate.
Two-Part Model Structure¶
The ZI model splits the data-generating process:
Part 1 --- Inflation Model (Logit):
This determines who generates structural zeros. The regressors \(Z\) can differ from \(X\).
Part 2 --- Count Model (Poisson or NB):
where \(\mu_{it} = \exp(X_{it}'\beta)\). This determines the intensity of events among potential generators.
Combined likelihood:
Estimation¶
Zero-Inflated Poisson (ZIP)¶
from panelbox.models.count import ZeroInflatedPoisson
model = ZeroInflatedPoisson(
endog=data["patents"],
exog_count=data[["rd_spending", "employees", "capital"]],
exog_inflate=data[["small_firm", "new_entrant"]],
exog_count_names=["rd_spending", "employees", "capital"],
exog_inflate_names=["small_firm", "new_entrant"]
)
results = model.fit(method="BFGS", maxiter=1000)
The count and inflation components can use different regressors. This is conceptually motivated: variables that determine whether an event can occur may differ from those that determine how many events occur.
Zero-Inflated Negative Binomial (ZINB)¶
ZINB adds an overdispersion parameter \(\alpha\) to the count component, handling both excess zeros and overdispersion:
from panelbox.models.count import ZeroInflatedNegativeBinomial
model = ZeroInflatedNegativeBinomial(
endog=data["claims"],
exog_count=data[["age", "income", "risk_score"]],
exog_inflate=data[["new_policy", "low_coverage"]],
exog_count_names=["age", "income", "risk_score"],
exog_inflate_names=["new_policy", "low_coverage"]
)
results = model.fit(method="L-BFGS-B", maxiter=1000)
# Overdispersion parameter
print(f"alpha = {results.alpha:.4f}")
ZIP vs ZINB
Use ZIP when the count process follows Poisson (equidispersion among non-structural zeros). Use ZINB when there is both excess zeros and overdispersion in the count process. ZINB is more flexible but requires estimating one additional parameter.
Interpreting Results¶
Parameter Estimates¶
Results contain separate coefficients for each component:
# Count model coefficients (beta)
print("Count model:")
for name, coef, se in zip(
results.exog_count_names, results.params_count, results.bse_count
):
print(f" {name}: {coef:.4f} (SE = {se:.4f})")
# Inflation model coefficients (gamma)
print("\nInflation model:")
for name, coef, se in zip(
results.exog_inflate_names, results.params_inflate, results.bse_inflate
):
print(f" {name}: {coef:.4f} (SE = {se:.4f})")
Count coefficients (\(\beta\)): semi-elasticities of the expected count among potential generators.
Inflation coefficients (\(\gamma\)): log-odds of being a structural zero. Positive \(\gamma\) means higher probability of being a "never-taker."
Predictions¶
ZI models support multiple prediction types:
# Overall expected count: E[y] = (1-pi) * lambda
y_hat = results.predict(which="mean")
# Probability of zero (total)
p_zero = results.predict(which="prob-zero")
# Structural zero probability (pi)
p_structural = results.predict(which="prob-zero-structural")
# Sampling zero probability: (1-pi) * f(0)
p_sampling = results.predict(which="prob-zero-sampling")
# Count mean among potential generators (lambda)
count_mean = results.predict(which="count-mean")
Model Selection: Vuong Test¶
The Vuong test (1989) compares the ZI model against its non-inflated counterpart:
- \(H_0\): Standard model and ZI model are equivalent
- \(H_1\): ZI model fits better (Vuong > 1.96) or standard model fits better (Vuong < -1.96)
# ZIP results include automatic Vuong test
print(f"Vuong statistic: {results.vuong_stat:.2f}")
print(f"Vuong p-value: {results.vuong_pvalue:.4f}")
if results.vuong_stat > 1.96:
print("ZIP preferred over Poisson")
elif results.vuong_stat < -1.96:
print("Poisson preferred over ZIP")
else:
print("No significant difference")
Choosing Between ZIP and ZINB¶
| Feature | ZIP | ZINB |
|---|---|---|
| Excess zeros | Yes | Yes |
| Overdispersion | No | Yes |
| Parameters | \(\beta, \gamma\) | \(\beta, \gamma, \alpha\) |
| Count process | Poisson | Negative Binomial |
| Use when | Equidispersion among "users" | Overdispersion among "users" |
Configuration Options¶
ZeroInflatedPoisson¶
| Parameter | Type | Default | Description |
|---|---|---|---|
endog |
array-like | required | Dependent variable (non-negative integers) |
exog_count |
array-like | required | Regressors for count process |
exog_inflate |
array-like | None |
Regressors for inflation process (defaults to exog_count) |
exog_count_names |
list | None |
Variable names for count model |
exog_inflate_names |
list | None |
Variable names for inflation model |
ZeroInflatedNegativeBinomial¶
Same parameters as ZIP. The fit() method defaults to method="L-BFGS-B" (with bounds) for numerical stability.
fit() Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
start_params |
array | None |
Starting values (auto-generated if None) |
method |
str | "BFGS" / "L-BFGS-B" |
Optimization method |
maxiter |
int | 1000 |
Maximum iterations |
Result Attributes¶
| Attribute | Type | Description |
|---|---|---|
params |
array | Full parameter vector \([\beta, \gamma]\) or \([\beta, \gamma, \ln\alpha]\) |
params_count |
array | Count model coefficients (\(\beta\)) |
params_inflate |
array | Inflation model coefficients (\(\gamma\)) |
bse |
array | Standard errors (all parameters) |
bse_count |
array | Standard errors for count model |
bse_inflate |
array | Standard errors for inflation model |
llf |
float | Log-likelihood |
aic |
float | Akaike Information Criterion |
bic |
float | Bayesian Information Criterion |
actual_zeros |
float | Proportion of actual zeros |
predicted_zeros |
float | Proportion of predicted zeros |
vuong_stat |
float | Vuong test statistic (ZIP only) |
vuong_pvalue |
float | Vuong test p-value (ZIP only) |
alpha |
float | Overdispersion parameter (ZINB only) |
Tutorials¶
| Tutorial | Description | Link |
|---|---|---|
| Count Data Models | Complete guide including ZIP and ZINB |
See Also¶
- Count Data Overview --- Introduction and model selection guide
- Poisson Models --- Baseline count models (no excess zeros)
- Negative Binomial --- Overdispersion without excess zeros
- Marginal Effects for Count Data --- ZI marginal effects decomposition
References¶
- Lambert, D. (1992). Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing. Technometrics, 34(1), 1--14.
- Vuong, Q. H. (1989). Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses. Econometrica, 57(2), 307--333.
- Hall, D. B. (2000). Zero-Inflated Poisson and Binomial Regression with Random Effects: A Case Study. Biometrics, 56(4), 1030--1039.
- Cameron, A. C., & Trivedi, P. K. (2013). Regression Analysis of Count Data (2nd ed.). Cambridge University Press, Chapter 4.
- Greene, W. H. (1994). Accounting for Excess Zeros and Sample Selection in Poisson and Negative Binomial Regression Models. Working Paper EC-94-10, NYU.