Panel IV / Two-Stage Least Squares¶
Quick Reference
Class: panelbox.models.iv.PanelIV
Import: from panelbox.models.iv import PanelIV
Stata equivalent: xtivreg y x1 (endog = z1 z2), fe
R equivalent: plm::plm(..., model="within", inst.method="bvk")
Overview¶
Instrumental Variables (IV) estimation addresses the problem of endogenous regressors -- explanatory variables that are correlated with the error term. This can arise from omitted variables, measurement error, simultaneity, or reverse causality. When endogeneity is present, OLS produces biased and inconsistent estimates.
The IV approach uses instruments \(Z\) -- variables that are correlated with the endogenous regressor but uncorrelated with the error term -- to isolate the exogenous variation in \(X\) and obtain consistent estimates. The most common implementation is Two-Stage Least Squares (2SLS).
PanelBox's PanelIV supports pooled, fixed effects, and random effects specifications with a convenient formula syntax and comprehensive first-stage diagnostics.
Quick Example¶
from panelbox.models.iv import PanelIV
# Formula: y ~ exogenous + endogenous | instruments
model = PanelIV(
formula="invest ~ capital + value | capital + lag_value + lag2_value",
data=df,
entity_col="firm",
time_col="year",
model_type="fe",
)
results = model.fit(cov_type="clustered")
print(results.summary())
# Check instrument strength
for var, fs in results.first_stage_results.items():
print(f"{var}: F-stat = {fs['f_statistic']:.2f}")
When to Use¶
- One or more regressors are endogenous (\(\text{Cov}(X_{it}, \varepsilon_{it}) \neq 0\))
- You have external instruments that are correlated with the endogenous variable but uncorrelated with the error
- The first-stage relationship is strong (F-statistic > 10)
- Examples: returns to education (ability bias), demand estimation (simultaneity), peer effects (reflection problem)
Key Assumptions
- Relevance: \(\text{Cov}(Z_{it}, X_{it}) \neq 0\) -- instruments must predict the endogenous variable
- Exogeneity: \(\text{Cov}(Z_{it}, \varepsilon_{it}) = 0\) -- instruments must be uncorrelated with the error
- Exclusion restriction: instruments affect \(y\) only through \(X\), not directly
The Endogeneity Problem¶
Sources of Endogeneity¶
| Source | Example | Why OLS fails |
|---|---|---|
| Omitted variables | Ability affects both education and wages | \(\text{Cov}(education, ability) \neq 0\) |
| Simultaneity | Price and quantity determined jointly | Supply and demand interact |
| Measurement error | True \(X^*\) measured with noise \(X = X^* + u\) | \(\text{Cov}(X, \varepsilon) \neq 0\) by construction |
| Reverse causality | Does trade cause growth or growth cause trade? | Direction of causation unclear |
The IV Solution¶
Find instruments \(Z\) such that:
- Stage 1: \(X = Z'\pi + \nu\) (instruments predict the endogenous variable)
- Stage 2: \(y = \hat{X}'\beta + \varepsilon\) (use predicted \(\hat{X}\) in place of \(X\))
The predicted values \(\hat{X}\) contain only the variation in \(X\) that is driven by \(Z\), which by assumption is uncorrelated with \(\varepsilon\).
Detailed Guide¶
Formula Syntax¶
PanelIV uses a formula with | to separate the outcome equation from the instrument list:
- Before
|: all variables in the outcome equation (both exogenous and endogenous) - After
|: all instruments (exogenous controls appear on both sides; excluded instruments appear only after|)
Identification rule: Variables that appear before | but not after | are treated as endogenous. Variables that appear after | but not before | are the excluded instruments.
# Example: 'value' is endogenous, instrumented by lag_value and lag2_value
# 'capital' is exogenous (appears on both sides)
model = PanelIV(
formula="invest ~ capital + value | capital + lag_value + lag2_value",
data=df,
entity_col="firm",
time_col="year",
)
In this example:
invest: dependent variablecapital: exogenous regressorvalue: endogenous regressor (in outcome but not in instruments)lag_value,lag2_value: excluded instruments (in instruments but not in outcome)
Panel Specifications¶
# Pooled IV (no entity effects)
model = PanelIV(formula, data, "id", "year", model_type="pooled")
# Fixed Effects IV (within transformation)
model = PanelIV(formula, data, "id", "year", model_type="fe")
# Random Effects IV
model = PanelIV(formula, data, "id", "year", model_type="re")
model_type |
Transformation | Intercept | Use when |
|---|---|---|---|
"pooled" |
None | Yes | No entity heterogeneity |
"fe" |
Within (demean) | No | \(\text{Cov}(\alpha_i, X_{it}) \neq 0\) |
"re" |
None | Yes | \(\text{Cov}(\alpha_i, X_{it}) = 0\) |
Estimation¶
Result Attributes¶
| Attribute | Description |
|---|---|
results.params |
2SLS coefficient estimates |
results.std_errors |
Standard errors |
results.cov_params |
Variance-covariance matrix |
results.resid |
Second-stage residuals |
results.first_stage_results |
Dict of first-stage results per endogenous var |
results.weak_instruments |
True if any first-stage F < 10 |
results.n_instruments |
Number of instruments |
results.n_endogenous |
Number of endogenous variables |
results.endogenous_vars |
List of endogenous variable names |
results.instruments |
List of instrument names |
First-Stage Results¶
The first-stage results are stored as a dictionary keyed by endogenous variable name:
for endog_var, fs in results.first_stage_results.items():
print(f"\nFirst stage for '{endog_var}':")
print(f" F-statistic: {fs['f_statistic']:.2f}")
print(f" R-squared: {fs['rsquared']:.4f}")
| Key | Description |
|---|---|
"fitted" |
Fitted values from first stage |
"gamma" |
First-stage coefficients |
"rsquared" |
First-stage \(R^2\) |
"f_statistic" |
First-stage F-statistic |
"residuals" |
First-stage residuals |
Standard Errors¶
PanelIV supports multiple covariance estimators:
# Classical 2SLS (homoskedastic)
results = model.fit(cov_type="nonrobust")
# Heteroskedasticity-robust (HC1)
results = model.fit(cov_type="robust")
# Cluster-robust (by entity)
results = model.fit(cov_type="clustered")
# Two-way clustering (entity and time)
results = model.fit(cov_type="twoway")
# Driscoll-Kraay HAC
results = model.fit(cov_type="driscoll_kraay")
cov_type |
Description | When to use |
|---|---|---|
"nonrobust" |
Classical 2SLS | Homoskedastic errors (textbook) |
"robust" / "hc1" |
HC1 robust | Heteroskedasticity suspected |
"clustered" |
Cluster by entity | Within-entity correlation (recommended) |
"twoway" |
Two-way clustering | Entity and time correlation |
"driscoll_kraay" |
DK HAC | Cross-sectional dependence |
Default recommendation
For panel data, cov_type="clustered" is the safest default. It accounts for arbitrary within-entity correlation and heteroskedasticity.
Configuration Options¶
PanelIV Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
formula |
str | required | "y ~ exog + endog \| instruments" |
data |
DataFrame | required | Panel data in long format |
entity_col |
str | required | Entity identifier column |
time_col |
str | required | Time identifier column |
model_type |
str | "pooled" |
"pooled", "fe", or "re" |
weights |
array-like | None |
Observation weights |
fit() Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
cov_type |
str | "nonrobust" |
Covariance type |
**cov_kwds |
dict | -- | Additional SE arguments (e.g., cluster, maxlags) |
Identification¶
Order Condition¶
The model requires at least as many excluded instruments as endogenous variables:
where \(k_{excluded}\) = number of instruments not in the outcome equation.
| Case | \(k_{excluded}\) vs \(k_{endog}\) | Name | Implication |
|---|---|---|---|
| \(k_{excluded} < k_{endog}\) | Under-identified | Cannot estimate | Need more instruments |
| \(k_{excluded} = k_{endog}\) | Exactly identified | Unique solution | Cannot test overidentification |
| \(k_{excluded} > k_{endog}\) | Over-identified | Can test validity | Sargan/Hansen J test available |
PanelBox raises a ValueError if the model is under-identified.
Instrument Strength¶
A weak first-stage relationship leads to:
- Biased 2SLS estimates (toward OLS)
- Unreliable standard errors and test statistics
- Confidence intervals with poor coverage
The first-stage F-statistic is the key diagnostic:
if results.weak_instruments:
print("WARNING: Weak instruments detected!")
print("Consider finding stronger instruments or using LIML.")
See IV Diagnostics for detailed guidance.
Tutorials¶
| Tutorial | Description | Link |
|---|---|---|
| Panel IV | 2SLS estimation with diagnostics | Static Models Tutorial |
See Also¶
- IV Diagnostics -- Instrument validity, weak instruments, and overidentification tests
- GMM (Arellano-Bond) -- Dynamic panels with internal instruments
- Static Models -- OLS, FE, RE without endogeneity
References¶
- Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton University Press.
- Stock, J. H., & Yogo, M. (2005). Testing for weak instruments in linear IV regression. In D. W. K. Andrews & J. H. Stock (Eds.), Identification and Inference for Econometric Models (pp. 80-108). Cambridge University Press.
- Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). MIT Press. Chapters 5-6.
- Baltagi, B. H. (2013). Econometric Analysis of Panel Data (5th ed.). John Wiley & Sons.