Skip to content

General FAQ

Common questions for beginners and intermediate users of PanelBox.

Looking for more?


Getting Started

How do I install PanelBox?

Install from PyPI:

pip install panelbox

For the full installation with all optional dependencies (spatial, visualization, reports):

pip install panelbox[all]

See the Getting Started guide for detailed instructions.

What Python versions are supported?

PanelBox supports Python 3.9 and later. We recommend Python 3.10+ for the best experience.

What data format does PanelBox expect?

PanelBox expects a long-format pandas DataFrame with one row per entity-time observation. You specify the entity column and time column when creating a model:

from panelbox import FixedEffects

# data must have columns for entity, time, and variables
model = FixedEffects("invest ~ value + capital", data, "firm", "year")
result = model.fit()

Key requirements:

  • Long format: each row is one entity at one time period
  • Entity column: identifies the cross-sectional unit (firm, country, individual)
  • Time column: identifies the time period (year, quarter, date)
  • Variable columns: must be numeric (float or int)

If your data is in wide format, convert it first:

import pandas as pd

data_long = pd.melt(
    data_wide,
    id_vars=["firm"],
    value_vars=["sales_2020", "sales_2021", "sales_2022"],
    var_name="year",
    value_name="sales"
)
data_long["year"] = data_long["year"].str.extract(r"(\d+)").astype(int)

For a complete guide, see How to Load Data.

How do I load my CSV data?
import pandas as pd
from panelbox import FixedEffects

# 1. Load CSV
data = pd.read_csv("my_panel_data.csv")

# 2. Check structure
print(data.head())
print(f"Entities: {data['entity_id'].nunique()}")
print(f"Periods: {data['year'].nunique()}")

# 3. Estimate a model
model = FixedEffects(
    formula="y ~ x1 + x2",
    data=data,
    entity_col="entity_id",
    time_col="year"
)
results = model.fit()
print(results.summary())

PanelBox also supports loading from Excel, Stata (.dta), SQL databases, and R files. See How to Load Data for all formats.

What datasets come built-in?

PanelBox ships with classic econometric datasets:

from panelbox.datasets import load_grunfeld, load_abdata, list_datasets

# List all available datasets
print(list_datasets())

# Grunfeld: firm investment (10 firms, 20 years)
grunfeld = load_grunfeld()

# ABdata: employment dynamics (140 firms, 9 years)
abdata = load_abdata()
Can PanelBox handle unbalanced panels?

Yes. Most PanelBox estimators handle unbalanced panels automatically. You can check your panel's balance:

from panelbox.core import PanelData

panel = PanelData(data, "firm", "year")
print(f"Balanced: {panel.is_balanced}")
print(f"Entities: {panel.n_entities}, Periods: {panel.n_periods}")

Model Selection

How do I choose between Fixed Effects and Random Effects?

Run the Hausman test:

from panelbox import FixedEffects, RandomEffects
from panelbox.validation import HausmanTest

fe = FixedEffects("invest ~ value + capital", data, "firm", "year")
re = RandomEffects("invest ~ value + capital", data, "firm", "year")

fe_results = fe.fit()
re_results = re.fit()

hausman = HausmanTest(fe_results, re_results)
result = hausman.run()
print(result.conclusion)

Interpretation:

  • p < 0.05 → Use Fixed Effects (entity effects are correlated with regressors)
  • p >= 0.05Random Effects is more efficient

If the Hausman test statistic is negative, use the Mundlak test instead (see Common Pitfalls).

For a detailed decision tree, see How to Choose a Model.

When should I use GMM instead of Fixed Effects?

Use GMM (Difference or System GMM) when your model includes a lagged dependent variable:

y_it = α * y_{i,t-1} + β * X_it + η_i + ε_it

Including \(y_{i,t-1}\) as a regressor creates correlation with the error term, making Fixed Effects biased (Nickell bias). GMM uses instrumental variables to handle this endogeneity.

Rule of thumb:

  • No lagged dependent variable → Fixed Effects or Random Effects
  • Lagged dependent variable present → GMM
from panelbox.gmm import SystemGMM

model = SystemGMM(
    "n ~ L.n + w + k", data, "id", "year",
    gmm_instruments=["L.n"],
    iv_instruments=["w", "k"]
)
result = model.fit()

See the GMM tutorial for details.

Which spatial model should I use?

Use the LM test decision tree:

LM Test Result Recommended Model
Only LM-lag significant SAR (Spatial Lag)
Only LM-error significant SEM (Spatial Error)
Both significant → check robust versions SDM or GNS
Robust LM-lag significant SAR
Robust LM-error significant SEM
from panelbox.diagnostics import lm_lag_test, lm_error_test

lm_lag = lm_lag_test(result, W)
lm_err = lm_error_test(result, W)

print(f"LM-lag p-value: {lm_lag.pvalue:.4f}")
print(f"LM-error p-value: {lm_err.pvalue:.4f}")

When in doubt, LeSage & Pace (2009) recommend starting with SDM (Spatial Durbin Model). See Spatial FAQ for details.

When do I need PPML?

Use PPML (Poisson Pseudo-Maximum Likelihood) for:

  • Gravity models with zero trade flows (OLS on logs drops zeros)
  • Heteroskedastic count-like data
  • When the dependent variable is non-negative with many zeros
from panelbox.models.count import PPML

ppml = PPML(data, dep_var="trade",
            exog_vars=["log_distance", "log_gdp_i", "log_gdp_j"])
result = ppml.fit()

See Advanced FAQ for more details.


Standard Errors

Which standard errors should I use?

Decision tree:

Situation Standard Error Type Code
Default / baseline Non-robust cov_type='nonrobust'
Heteroskedasticity Robust (HC1) cov_type='robust'
Panel data (most common) Clustered by entity cov_type='clustered'
Cross-sectional dependence Driscoll-Kraay Use DriscollKraayStandardErrors
Spatial correlation Spatial HAC Use SpatialHAC
Both spatial and temporal Panel-corrected (PCSE) Use PanelCorrectedStandardErrors

Safe default for panel data: cluster by entity.

results = model.fit(cov_type="clustered")
How do I cluster standard errors?
# Cluster by entity (most common)
results = model.fit(cov_type="clustered")

# Two-way clustering (entity and time)
from panelbox.standard_errors import twoway_cluster
vcov = twoway_cluster(results)
What's the difference between HC0, HC1, HC2, HC3?

These are different heteroskedasticity-consistent (HC) variance estimators:

Variant Correction Best for
HC0 None (White, 1980) Large samples
HC1 Degrees-of-freedom General use (most common)
HC2 Leverage-based Moderate samples
HC3 Jackknife-like Small samples, most conservative

In practice, HC1 is the default and works well for most applications. Use HC3 for small samples where you want conservative inference.


Results Interpretation

How do I get coefficients, p-values, and confidence intervals?
results = model.fit()

# Coefficients
print(results.params)

# Standard errors
print(results.std_errors)

# P-values
print(results.pvalues)

# Confidence intervals
print(results.conf_int(alpha=0.05))

# Full summary table
print(results.summary())
How do I compute marginal effects?

For nonlinear models (logit, probit, count), coefficients are not directly interpretable. Compute marginal effects:

from panelbox.marginal_effects import compute_ame, compute_mem

# Average Marginal Effects (AME)
ame = compute_ame(results, data)
print(ame)

# Marginal Effects at the Mean (MEM)
mem = compute_mem(results, data)
print(mem)

For ordered models:

from panelbox.marginal_effects import compute_ordered_ame
ame = compute_ordered_ame(results, data)

See the Marginal Effects tutorial for detailed examples.

How do I compare two models?

Use the PanelExperiment workflow:

from panelbox import PanelExperiment

exp = PanelExperiment(data, "invest ~ value + capital", "firm", "year")
exp.fit_all_models(["pooled", "fe", "re"])
comparison = exp.compare_models(["pooled", "fe", "re"])
print(comparison.summary())

For formal model testing between non-nested models, use the J-test:

from panelbox.diagnostics.specification import j_test
result = j_test(result1, result2)
print(result.summary())
How do I export results to LaTeX or HTML?
from panelbox.report import ReportManager

report = ReportManager(results)

# LaTeX table
report.export_latex("results.tex")

# HTML report with interactive charts
report.export_html("analysis.html")

Or use the full Experiment workflow for a master report:

exp = PanelExperiment(data, formula, entity, time)
exp.fit_all_models(["pooled", "fe", "re"])
exp.save_master_report("master_report.html")

Common Pitfalls

My Hausman test statistic is negative — what does this mean?

A negative Hausman test statistic can occur when the estimated variance difference is not positive semi-definite. This does not mean FE or RE is better — the test is simply unreliable in this case.

Solution: Use the Mundlak test instead:

from panelbox.validation import MundlakTest

mundlak = MundlakTest(data, "invest ~ value + capital", "firm", "year")
result = mundlak.run()
print(result.conclusion)

The Mundlak test is a robust alternative that adds group means of regressors to the RE model and tests their joint significance.

My GMM has too many instruments — what should I do?

Instrument proliferation leads to overfitting and makes the Hansen J test unreliable. Rule of thumb: number of instruments should be less than N (number of entities).

Solutions:

  1. Use collapse=True to reduce instrument count:

    model = SystemGMM(
        "n ~ L.n + w + k", data, "id", "year",
        gmm_instruments=["L.n"],
        iv_instruments=["w", "k"],
        collapse=True
    )
    

  2. Limit lag depth:

    model = SystemGMM(
        ...,
        max_lags=2  # Use at most 2 lags as instruments
    )
    

  3. Check the instrument-to-entity ratio after estimation:

    result = model.fit()
    print(f"Instruments: {result.n_instruments}")
    print(f"Entities: {result.n_entities}")
    # Instruments should be < N
    

R-squared is low — is my model bad?

Not necessarily. In panel data econometrics, R-squared is less meaningful than in cross-sectional analysis:

  • Within R-squared (FE) only captures time variation, which is often small
  • A low R-squared does not mean the coefficients are wrong or insignificant
  • What matters more: coefficient significance, diagnostic tests, and economic interpretation

Focus on:

  • Statistical significance of coefficients
  • Correct signs (consistent with theory)
  • Diagnostic test results (Hausman, serial correlation, etc.)
  • Robustness across specifications
My model shows 'ConvergenceWarning' — what should I do?

For MLE-based models (logit, probit, Heckman, SFA), convergence issues often arise from:

  1. Poor starting values — try providing manual starting values
  2. Model too complex — simplify the specification
  3. Data issues — check for perfect separation (logit/probit) or outliers
# Increase iterations
result = model.fit(maxiter=500)

# Try different optimizer
result = model.fit(method="bfgs")

See Troubleshooting for detailed solutions.


What's Next?

  • Advanced FAQ

    Technical questions about GMM, VAR, Heckman, cointegration, and performance optimization.

  • Spatial FAQ

    Questions specific to spatial econometrics — weight matrices, model selection, effects interpretation.

  • Troubleshooting

    Common error messages, debugging strategies, and step-by-step solutions.

  • How to Choose a Model

    Decision trees and checklists for selecting the right panel estimator.