Skip to content

Choosing a Model

PanelBox provides 13 model families covering virtually every panel data scenario. This guide helps you select the right one.

Quick Decision Tree

START: What type is your dependent variable?
├─ CONTINUOUS (y is real-valued)
│   │
│   ├─ Is there a lagged dependent variable (y_{t-1})?
│   │   ├─ YES → Dynamic models
│   │   │   ├─ Single equation? → GMM (Sec. 2)
│   │   │   └─ Multiple equations? → Panel VAR (Sec. 6)
│   │   │
│   │   └─ NO → Static models
│   │       ├─ Spatial dependence? → Spatial Models (Sec. 3)
│   │       ├─ Endogenous regressor? → Panel IV (Sec. 10)
│   │       ├─ Efficiency/productivity? → SFA (Sec. 4)
│   │       ├─ Distributional effects? → Quantile Regression (Sec. 5)
│   │       └─ Standard panel → Static Models (Sec. 1)
│   │           ├─ Hausman p < 0.05 → Fixed Effects
│   │           └─ Hausman p ≥ 0.05 → Random Effects
│   │
├─ BINARY (0/1)
│   └─ Discrete Choice: Logit / Probit (Sec. 7)
├─ COUNT (0, 1, 2, ...)
│   └─ Count Data: Poisson / NegBin / PPML (Sec. 8)
├─ ORDERED CATEGORICAL (1 < 2 < 3 ...)
│   └─ Ordered Logit / Ordered Probit (Sec. 7)
├─ CENSORED / TRUNCATED
│   └─ Tobit / Honore (Sec. 9)
└─ SELECTION BIAS (outcome observed only for a subsample)
    └─ Panel Heckman (Sec. 9)

By Data Type

Continuous Dependent Variable

When your outcome is a real-valued continuous variable (wages, investment, GDP).

Model PanelBox Class When to Use
Pooled OLS PooledOLS No entity effects (baseline)
Fixed Effects FixedEffects Entity effects correlated with X
Random Effects RandomEffects Entity effects uncorrelated with X
Between BetweenEstimator Cross-sectional variation only
First Difference FirstDifferenceEstimator Alternative to FE for short T
from panelbox import FixedEffects
model = FixedEffects("y ~ x1 + x2", data, "id", "year")
results = model.fit(cov_type="clustered")
Model PanelBox Class When to Use
Difference GMM DifferenceGMM Arellano-Bond; moderate persistence
System GMM SystemGMM Blundell-Bond; high persistence
CUE GMM CUEGMM Continuous updating; robust to weak instruments
Bias-Corrected BiasCorrectedGMM Small-sample correction
from panelbox.gmm import SystemGMM
model = SystemGMM("n ~ L.n + w + k", data, "id", "year",
                   gmm_instruments=["L.n"], iv_instruments=["w", "k"])
results = model.fit(two_step=True)
Model PanelBox Class When to Use
Spatial Lag (SAR) SpatialLag Outcome depends on neighbors' outcomes
Spatial Error (SEM) SpatialError Errors correlated across neighbors
Spatial Durbin (SDM) SpatialDurbin Both spatial lag and spatial X effects
Dynamic Spatial DynamicSpatialPanel Spatial + temporal dynamics
General Nesting GeneralNestingSpatial Most flexible spatial specification
from panelbox.models.spatial import SpatialLag
model = SpatialLag("y ~ x1 + x2", data, "region", "year", W=W)
results = model.fit()
Model PanelBox Class When to Use
Panel IV (2SLS) PanelIV Endogenous regressor with instruments
Stochastic Frontier StochasticFrontier Efficiency/productivity analysis
Four-Component SFA FourComponentSFA Persistent + transient inefficiency
Quantile Regression FixedEffectsQuantile Effects at different quantiles

Binary Dependent Variable

When your outcome is 0 or 1 (employment status, default, adoption).

Model PanelBox Class When to Use Stata Equivalent
Pooled Logit PooledLogit No entity effects logit
Pooled Probit PooledProbit No entity effects probit
FE Logit FixedEffectsLogit Conditional logit (entity effects) clogit / xtlogit, fe
RE Probit RandomEffectsProbit Random entity effects xtprobit, re
Dynamic Discrete DynamicDiscreteChoice State dependence (lagged y) xtdpdml
from panelbox.models.discrete.binary import FixedEffectsLogit
model = FixedEffectsLogit("employed ~ age + education", data, "id", "year")
results = model.fit()

Count Dependent Variable

When your outcome is a non-negative integer (number of patents, accidents, visits).

Model PanelBox Class When to Use Stata Equivalent
Poisson FE PoissonFE Count data with entity effects xtpoisson, fe
Poisson RE PoissonRE Count data, random effects xtpoisson, re
PPML PPML Gravity models, zero-robust ppmlhdfe
Negative Binomial NegativeBinomialFE Overdispersed counts xtnbreg
Zero-Inflated Poisson ZeroInflatedPoisson Excess zeros zip
Zero-Inflated NB ZeroInflatedNB Excess zeros + overdispersion zinb
from panelbox.models.count.poisson import PoissonFE
model = PoissonFE("patents ~ rd + size", data, "firm", "year")
results = model.fit()

Ordered Categorical Dependent Variable

When your outcome has a natural ordering (satisfaction: 1--5, credit rating: A--D).

Model PanelBox Class Stata Equivalent
Ordered Logit OrderedLogit ologit
Ordered Probit OrderedProbit oprobit
from panelbox.models.discrete.ordered import OrderedLogit
model = OrderedLogit("rating ~ size + leverage", data, "firm", "year")
results = model.fit()

Censored, Truncated, or Selection

When your outcome is censored (e.g., capped at zero), truncated, or selectively observed.

Model PanelBox Class When to Use Stata Equivalent
Tobit Tobit Censored at a threshold xttobit
Honore Honore FE Tobit (trimmed LAD) --
Panel Heckman PanelHeckman Selection bias correction heckman (panel)
from panelbox.models.censored import Tobit
model = Tobit("hours ~ wage + children", data, "id", "year", lower=0)
results = model.fit()

By Research Question

"I need to control for unobserved entity characteristics"

Use: Fixed Effects vs Random Effects

Run the Hausman test to decide:

from panelbox import FixedEffects, RandomEffects
from panelbox.validation import HausmanTest

fe = FixedEffects("y ~ x1 + x2", data, "id", "year").fit()
re = RandomEffects("y ~ x1 + x2", data, "id", "year").fit()
print(HausmanTest(fe, re))
  • p < 0.05: Use Fixed Effects (effects correlated with X)
  • p >= 0.05: Use Random Effects (more efficient)

"My dependent variable depends on its own lag"

Use: GMM estimators

Fixed Effects is biased when the model includes a lagged dependent variable (Nickell bias). GMM uses instrumental variables to obtain consistent estimates.

from panelbox.gmm import SystemGMM
model = SystemGMM("y ~ L.y + x1 + x2", data, "id", "year",
                   gmm_instruments=["L.y"], iv_instruments=["x1", "x2"])
results = model.fit(two_step=True)

Difference vs System GMM

Use System GMM when the dependent variable is highly persistent (autoregressive coefficient > 0.8). Use Difference GMM as a more conservative baseline.

"My entities are spatially connected"

Use: Spatial panel models

When outcomes or errors are correlated across neighboring entities (regions, countries):

  • SAR: Neighbors' outcomes affect yours (trade spillovers, policy diffusion)
  • SEM: Shared unobserved shocks across neighbors
  • SDM: Both outcome and regressor spillovers

"I want to measure efficiency or productivity"

Use: Stochastic Frontier Analysis (SFA)

from panelbox.frontier import StochasticFrontier
model = StochasticFrontier("output ~ labor + capital", data, "firm", "year",
                            frontier_type="production", dist_type="half_normal")
results = model.fit()

PanelBox also offers FourComponentSFA for decomposing persistent vs transient inefficiency -- a model unique to PanelBox among Python libraries.

"I want effects at different points of the distribution"

Use: Quantile Regression

Standard regression estimates the mean effect. Quantile regression estimates effects at the median, 10th percentile, 90th percentile, etc.

Model PanelBox Class Description
Pooled Quantile PooledQuantile Ignores panel structure
FE Quantile FixedEffectsQuantile With entity fixed effects
Canay Two-Step CanayTwoStep Debiased FE quantile
Location-Scale LocationScale Heterogeneous effects on variance
QTE QuantileTreatmentEffects Treatment effects at quantiles

"I have multiple interrelated outcome variables"

Use: Panel VAR

Model dynamic interdependencies between multiple variables (e.g., GDP, investment, and consumption jointly):

from panelbox.var import PanelVAR
model = PanelVAR(data, ["gdp", "invest", "consumption"], "country", "year", lags=2)
results = model.fit()

# Impulse response functions
irf = results.irf(periods=10)
irf.plot()

Comprehensive Comparison Table

Family Key Models When to Use Key Assumption Stata Equivalent
Static FE, RE, Pooled Standard panel, no dynamics Strict exogeneity xtreg
Dynamic GMM Diff-GMM, Sys-GMM Lagged dependent variable Sequential exogeneity xtabond2
Spatial SAR, SEM, SDM Spatially connected entities Known weight matrix xsmle
SFA SF, 4-Component Efficiency measurement Composed error (v + u) xtfrontier
Quantile FE Quantile, Canay Distributional effects Quantile restrictions qregpd
Panel VAR PVAR, PVECM Multiple endogenous variables Stationarity (VAR) pvar
Discrete Logit, Probit, FE Logit Binary/multinomial outcome Latent variable model xtlogit, xtprobit
Count Poisson, NegBin, PPML Count/non-negative outcome Conditional mean spec. xtpoisson, ppmlhdfe
Censored Tobit, Honore Censored/truncated outcome Normality (Tobit) xttobit
Selection Panel Heckman Non-random sample selection Exclusion restriction heckman
IV Panel 2SLS Endogenous regressors Instrument validity xtivreg
Standard Errors HC, Cluster, DK, PCSE All models Varies by type vce() options
Diagnostics Hausman, Mundlak, RESET Model validation Varies by test Various

Common Research Scenarios

Scenario 1: Firm Investment Decisions

Data: 500 firms, 10 years. Variables: investment, sales, debt, Tobin's Q.

Question: What drives firm investment?

Recommendation: Start with Fixed Effects (firm-specific unobservables like management quality). If investment is persistent, switch to System GMM with lagged investment as an endogenous regressor.

Scenario 2: Effect of Education on Wages

Data: 5,000 workers, 5 annual surveys. Variables: wage, education, experience, industry.

Question: What is the return to education?

Recommendation: Fixed Effects to control for unobserved ability. Note that if education doesn't change much over time, FE will have low within-variation and imprecise estimates. Consider Random Effects with the Mundlak correction as an alternative.

Scenario 3: Regional Economic Growth

Data: 200 regions, 20 years, with geographic adjacency.

Question: Do neighboring regions' growth rates affect local growth?

Recommendation: Spatial Durbin Model (SDM) to capture both direct effects and spatial spillovers through the weight matrix.

Scenario 4: Patent Activity

Data: 1,000 firms, 15 years. Variable: number of patents (count).

Question: How does R&D spending affect patent output?

Recommendation: Poisson FE for count data with firm fixed effects. If there are excess zeros (many firms with zero patents), use Zero-Inflated Poisson. If variance exceeds the mean, use Negative Binomial.

Scenario 5: Bank Default Prediction

Data: 800 banks, 12 quarters. Variable: default (0/1).

Question: What predicts bank failure?

Recommendation: Fixed Effects Logit (conditional logit) for binary outcome with bank-specific unobservables. If state dependence matters (past default predicts future default), use Dynamic Discrete Choice.

Scenario 6: Hospital Efficiency

Data: 300 hospitals, 8 years. Variables: output (patients treated), inputs (staff, beds, equipment).

Question: How efficient are hospitals, and has efficiency changed over time?

Recommendation: Stochastic Frontier Analysis with a production frontier. Use Four-Component SFA to separate persistent inefficiency (structural issues) from transient inefficiency (temporary shocks).

Testing Your Model Choice

After selecting a model, run diagnostic tests to validate the choice:

Static Model Diagnostics

from panelbox import FixedEffects, RandomEffects
from panelbox.validation import HausmanTest, MundlakTest

fe_results = FixedEffects("y ~ x1 + x2", data, "id", "year").fit()
re_results = RandomEffects("y ~ x1 + x2", data, "id", "year").fit()

# Hausman test: FE vs RE
print(HausmanTest(fe_results, re_results))

# Mundlak test: alternative to Hausman
print(MundlakTest(re_results))

GMM Diagnostics

# After fitting a GMM model:
# 1. Hansen J-test (overidentification)
print(f"Hansen J p-value: {results.hansen_j.pvalue:.3f}")  # Want p > 0.10

# 2. AR(2) test (no second-order serial correlation)
print(f"AR(2) p-value: {results.ar2_test.pvalue:.3f}")     # Want p > 0.10

# 3. Instrument count
print(f"Instruments: {results.n_instruments}")
print(f"Groups: {results.n_entities}")                      # Want instruments < groups

General Specification Tests

Test PanelBox Class Null Hypothesis Use When
Hausman HausmanTest RE is consistent Choosing FE vs RE
Mundlak MundlakTest No correlation with effects Alternative to Hausman
Breusch-Pagan BreuschPaganTest Homoskedasticity Checking error variance
Wooldridge AR WooldridgeTest No serial correlation Checking autocorrelation
Pesaran CD PesaranCDTest No cross-sectional dependence Macro panels
RESET RESETTest Correct functional form Specification check
Chow ChowTest Structural stability Testing for breaks

Further Reading

Each model family has a dedicated section in the User Guide: