Difference GMM (Arellano-Bond)¶
Quick Reference
Class: panelbox.gmm.DifferenceGMM
Import: from panelbox.gmm import DifferenceGMM
Stata equivalent: xtabond2 y L.y x1 x2, gmm(y, lag(2 .)) iv(x1 x2) noleveleq
R equivalent: pgmm(y ~ lag(y, 1) + x1 + x2 | lag(y, 2:99), transformation = "d")
Overview¶
Difference GMM, introduced by Arellano and Bond (1991), is the foundational estimator for dynamic panel data models where a lagged dependent variable appears as a regressor alongside individual fixed effects. Standard estimators (OLS, Fixed Effects, Random Effects) all produce biased and inconsistent estimates in this setting.
The key insight is to first-difference the equation to eliminate fixed effects, then use lagged levels of the dependent variable as instruments for the differenced equation. This yields consistent estimates under the assumption that the original errors are serially uncorrelated.
Difference GMM is the workhorse estimator for dynamic panels with short T (few time periods) and large N (many cross-sectional units). PanelBox's implementation provides feature parity with Stata's xtabond2, including collapsed instruments (Roodman 2009) and the Windmeijer (2005) finite-sample correction for two-step standard errors.
Quick Example¶
from panelbox.gmm import DifferenceGMM
from panelbox.datasets import load_abdata
# Load Arellano-Bond employment dataset
data = load_abdata()
# Estimate Difference GMM
model = DifferenceGMM(
data=data,
dep_var="n",
lags=1,
id_var="id",
time_var="year",
exog_vars=["w", "k"],
collapse=True,
two_step=True,
robust=True,
)
results = model.fit()
print(results.summary())
When to Use¶
- Dynamic panel models where \(y_{it}\) depends on \(y_{i,t-1}\)
- Short panels with small T (typically T < 20) and large N
- Fixed effects are correlated with regressors (rules out RE)
- Strict exogeneity fails (some regressors are endogenous or predetermined)
- Moderate persistence: AR coefficient \(\gamma < 0.8\) (for persistent series, consider System GMM)
Key Assumptions
- No serial correlation in idiosyncratic errors: \(E[\varepsilon_{it} \varepsilon_{is}] = 0\) for \(t \neq s\)
- Sequential exogeneity: \(E[y_{i,t-s} \varepsilon_{it}] = 0\) for \(s \geq 1\) (lagged levels are valid instruments)
- Large N asymptotics: Consistency relies on \(N \to \infty\) with T fixed
- Initial conditions: \(E[y_{i1} \varepsilon_{it}] = 0\) for \(t \geq 2\)
Detailed Guide¶
The Dynamic Panel Problem¶
Consider the standard dynamic panel model:
where \(\alpha_i\) is an unobserved individual fixed effect and \(\varepsilon_{it}\) is the idiosyncratic error.
Why OLS is biased upward: The lagged dependent variable \(y_{i,t-1}\) is correlated with \(\alpha_i\) (since \(\alpha_i\) affects all periods), producing omitted variable bias. OLS overestimates \(\gamma\).
Why Fixed Effects is biased (Nickell bias): The within transformation creates correlation between the demeaned lagged variable \((y_{i,t-1} - \bar{y}_i)\) and the demeaned error \((\varepsilon_{it} - \bar{\varepsilon}_i)\), because \(\bar{y}_i\) contains \(y_{i,t-1}\) and \(\bar{\varepsilon}_i\) contains \(\varepsilon_{i,t-1}\). The bias is approximately \(-(1 + \gamma)/(T - 1)\), which is severe for small T.
Result: For a true coefficient \(\gamma\), we expect:
This provides bounds for validating GMM estimates.
First-Differencing and Instruments¶
Step 1: First-difference to eliminate \(\alpha_i\):
Step 2: Use lagged levels as instruments. The key moment conditions are:
This works because \(y_{i,t-2}\) is predetermined (determined before \(\varepsilon_{it}\)) and uncorrelated with \(\Delta \varepsilon_{it} = \varepsilon_{it} - \varepsilon_{i,t-1}\) under the assumption of no serial correlation in levels.
One-Step vs Two-Step Estimation¶
One-step GMM uses a simple weighting matrix \(W_1 = (Z'HZ)^{-1}\) where \(H\) is a block-diagonal matrix based on the first-difference structure. It is consistent but not efficient.
Two-step GMM constructs an optimal weighting matrix from one-step residuals:
Two-step is asymptotically efficient but has downward-biased standard errors in finite samples.
Windmeijer (2005) Correction¶
The Windmeijer correction adjusts two-step standard errors for the estimation error in the weighting matrix. This correction is critical in practice and is automatically applied when robust=True (the default).
Best Practice
Always use two_step=True with robust=True to get efficient estimates with properly corrected standard errors.
Data Preparation¶
import pandas as pd
from panelbox.datasets import load_abdata
data = load_abdata()
# Check panel structure
print(f"Panels (N): {data['id'].nunique()}")
print(f"Time periods (T): {data['year'].nunique()}")
print(f"Observations: {len(data)}")
print(f"Balance: {'Balanced' if data.groupby('id').size().nunique() == 1 else 'Unbalanced'}")
Estimation¶
from panelbox.gmm import DifferenceGMM
model = DifferenceGMM(
data=data,
dep_var="n", # Log employment
lags=1, # AR(1) specification
id_var="id",
time_var="year",
exog_vars=["w", "k"], # Wages and capital (strictly exogenous)
time_dummies=True, # Include time fixed effects
collapse=True, # Collapse instruments (Roodman 2009)
two_step=True, # Two-step estimation
robust=True, # Windmeijer-corrected SEs
)
results = model.fit()
With predetermined and endogenous variables:
model = DifferenceGMM(
data=data,
dep_var="n",
lags=1,
id_var="id",
time_var="year",
exog_vars=["policy"], # Strictly exogenous
predetermined_vars=["capital"], # Instruments: t-2 and earlier
endogenous_vars=["labor"], # Instruments: t-3 and earlier
collapse=True,
two_step=True,
)
results = model.fit()
Interpreting Results¶
# Coefficient on lagged dependent variable
gamma = results.params["L1.n"]
se = results.std_errors["L1.n"]
print(f"Persistence: {gamma:.4f} (SE: {se:.4f})")
# 95% confidence interval
ci = results.conf_int()
print(f"95% CI: [{ci.loc['L1.n', 'lower']:.4f}, {ci.loc['L1.n', 'upper']:.4f}]")
# Diagnostic tests
print(f"AR(2) p-value: {results.ar2_test.pvalue:.4f}")
print(f"Hansen J p-value: {results.hansen_j.pvalue:.4f}")
print(f"Instrument ratio: {results.instrument_ratio:.3f}")
Configuration Options¶
| Parameter | Type | Default | Description |
|---|---|---|---|
data |
pd.DataFrame |
required | Panel data in long format |
dep_var |
str |
required | Dependent variable name |
lags |
int or list[int] |
required | Lags of dependent variable (e.g., 1 or [1, 2]) |
id_var |
str |
"id" |
Cross-sectional identifier |
time_var |
str |
"year" |
Time variable |
exog_vars |
list[str] |
None |
Strictly exogenous variables |
endogenous_vars |
list[str] |
None |
Endogenous variables (instrumented with t-3+) |
predetermined_vars |
list[str] |
None |
Predetermined variables (instrumented with t-2+) |
time_dummies |
bool |
True |
Include time fixed effects |
collapse |
bool |
False |
Collapse instruments (Roodman 2009) |
two_step |
bool |
True |
Use two-step estimation |
robust |
bool |
True |
Windmeijer-corrected standard errors |
gmm_type |
str |
"two_step" |
"one_step", "two_step", or "iterative" |
gmm_max_lag |
int or None |
None |
Maximum lag for GMM instruments (None = all) |
iv_max_lag |
int |
0 |
Maximum lag for IV instruments of exogenous vars |
Recommended Settings
For most applications, use collapse=True, two_step=True, robust=True. Only change these for specific robustness checks.
Diagnostics¶
Essential Diagnostic Tests¶
# 1. AR(2) test - CRITICAL: must NOT reject
ar2 = results.ar2_test
print(f"AR(2): z={ar2.statistic:.3f}, p={ar2.pvalue:.4f} [{ar2.conclusion}]")
# 2. Hansen J test - instruments validity
hansen = results.hansen_j
print(f"Hansen J: stat={hansen.statistic:.3f}, p={hansen.pvalue:.4f} [{hansen.conclusion}]")
# 3. Instrument ratio - overfitting check
print(f"Instruments: {results.n_instruments}, Groups: {results.n_groups}")
print(f"Instrument ratio: {results.instrument_ratio:.3f}")
# 4. AR(1) test - expected to reject
ar1 = results.ar1_test
print(f"AR(1): z={ar1.statistic:.3f}, p={ar1.pvalue:.4f} [{ar1.conclusion}]")
Diagnostic Checklist¶
| Test | Criterion | Interpretation |
|---|---|---|
| AR(2) | p > 0.10 | Moment conditions valid |
| Hansen J | 0.10 < p < 0.25 | Instruments appear valid |
| Hansen J | p > 0.25 | Possible weak instruments |
| Hansen J | p < 0.10 | Instruments rejected |
| Instrument ratio | < 1.0 | No proliferation |
| AR(1) | p < 0.10 | Expected (informational) |
Overfitting Diagnostics¶
from panelbox.gmm import GMMOverfitDiagnostic
diag = GMMOverfitDiagnostic(model, results)
print(diag.summary())
For detailed diagnostic interpretation, see GMM Diagnostics.
Common Issues¶
| Problem | Symptom | Solution |
|---|---|---|
| Too many instruments | Instrument ratio > 1.0, Hansen p near 1.0 | Use collapse=True |
| Weak instruments | Very large SEs, Hansen p > 0.50 | Try System GMM |
| Serial correlation | AR(2) p < 0.05 | Add more lags: lags=[1, 2] |
| Low observation retention | Warning about < 30% retention | Set time_dummies=False, use collapse=True |
| Coefficient outside bounds | Estimate > OLS or < FE | Check specification, reduce instruments |
Tutorials¶
| Tutorial | Description | Link |
|---|---|---|
| Complete GMM Guide | Step-by-step applied tutorial | Complete Guide |
| GMM Instruments | Instrument selection and management | Instruments |
| GMM Diagnostics | Interpreting all diagnostic tests | Diagnostics |
See Also¶
- System GMM -- Blundell-Bond System GMM for persistent series
- CUE-GMM -- Continuous Updating Estimator for robustness checks
- Bias-Corrected GMM -- Analytical bias correction for moderate N, T
- Instruments -- Instrument selection and the proliferation problem
- Diagnostics -- Complete guide to GMM diagnostic tests
- Complete Guide -- Step-by-step applied GMM tutorial
References¶
- Arellano, M., & Bond, S. (1991). "Some Tests of Specification for Panel Data: Monte Carlo Evidence and an Application to Employment Equations." Review of Economic Studies, 58(2), 277-297.
- Roodman, D. (2009). "How to do xtabond2: An Introduction to Difference and System GMM in Stata." The Stata Journal, 9(1), 86-136.
- Windmeijer, F. (2005). "A Finite Sample Correction for the Variance of Linear Efficient Two-Step GMM Estimators." Journal of Econometrics, 126(1), 25-51.
- Nickell, S. (1981). "Biases in Dynamic Models with Fixed Effects." Econometrica, 49(6), 1417-1426.
- Bond, S. R. (2002). "Dynamic Panel Data Models: A Guide to Micro Data Methods and Practice." Portuguese Economic Journal, 1(2), 141-162.