Murphy-Topel Variance Correction¶
Quick Reference
Functions: murphy_topel_variance, heckman_two_step_variance
Import: from panelbox.models.selection.murphy_topel import murphy_topel_variance, heckman_two_step_variance
Applies to: Any two-step estimator (Heckman two-step, generated regressors)
Overview¶
Two-step estimation procedures -- such as the Heckman two-step -- produce standard errors that are too small when computed naively. The problem is that Step 2 treats the Step 1 estimates as known constants, ignoring the sampling uncertainty from Step 1. This leads to understated standard errors, overstated t-statistics, and incorrect inference.
Murphy & Topel (1985) developed a general correction that accounts for the estimation error propagated from Step 1 to Step 2. In the context of the Heckman model, Step 1 is the probit estimation of the selection equation, and Step 2 is the augmented OLS with the Inverse Mills Ratio (IMR) -- which depends on the estimated probit coefficients \(\hat{\gamma}\).
The corrected variance is always larger than the naive variance, reflecting the additional uncertainty from the first step.
The Problem¶
Naive Two-Step Standard Errors¶
Consider a two-step procedure:
- Step 1: Estimate \(\hat{\theta}_1\) (e.g., probit coefficients \(\hat{\gamma}\))
- Step 2: Estimate \(\hat{\theta}_2\) using \(\hat{\theta}_1\) as if known (e.g., OLS with IMR)
The naive variance from Step 2 is:
This ignores that \(\hat{\theta}_1\) was estimated, not known. The true asymptotic variance is larger.
Impact on Inference¶
| Measure | Naive (incorrect) | Murphy-Topel (correct) |
|---|---|---|
| Standard errors | Too small | Correct (larger) |
| t-statistics | Too large | Correct (smaller) |
| Confidence intervals | Too narrow | Correct (wider) |
| p-values | Too small | Correct (larger) |
| Rejection rates | Too high (over-rejection) | Correct |
The Murphy-Topel Correction¶
General Formula¶
The corrected variance-covariance matrix is:
where:
- \(\hat{V}_1\) is the variance-covariance from Step 1 (probit)
- \(\hat{V}_2\) is the uncorrected variance from Step 2 (OLS)
- \(C = \frac{\partial^2 Q}{\partial \theta_2 \partial \theta_1'}\) is the cross-derivative matrix that captures how Step 2 depends on Step 1
The correction term \(C \hat{V}_1 C'\) is always positive semi-definite, so corrected standard errors are always at least as large as naive ones.
For Heckman Specifically¶
In the Heckman two-step:
- Step 1 estimates \(\hat{\gamma}\) (probit on selection equation)
- Step 2 estimates \((\hat{\beta}, \hat{\theta})\) where \(\theta = \rho \sigma_\varepsilon\) is the IMR coefficient
- The IMR \(\hat{\lambda}_{it} = \phi(W_{it}'\hat{\gamma}) / \Phi(W_{it}'\hat{\gamma})\) depends on \(\hat{\gamma}\)
- The cross-derivative captures how the IMR changes with \(\gamma\):
API Reference¶
General Murphy-Topel Correction¶
from panelbox.models.selection.murphy_topel import murphy_topel_variance
corrected_vcov = murphy_topel_variance(
vcov_step1=vcov_probit, # (k1 x k1) from Step 1
vcov_step2_uncorrected=vcov_ols, # (k2 x k2) naive from Step 2
cross_derivative=C, # (k2 x k1) cross-derivative
)
Parameters:
| Parameter | Shape | Description |
|---|---|---|
vcov_step1 |
\((k_1, k_1)\) | Step 1 variance-covariance matrix |
vcov_step2_uncorrected |
\((k_2, k_2)\) | Naive Step 2 variance-covariance |
cross_derivative |
\((k_2, k_1)\) | Cross-derivative \(\partial^2 Q / \partial \theta_2 \partial \theta_1'\) |
Returns: Corrected \((k_2, k_2)\) variance-covariance matrix.
Heckman-Specific Convenience Function¶
from panelbox.models.selection.murphy_topel import heckman_two_step_variance
vcov_corrected, se_corrected = heckman_two_step_variance(
X=X, # Outcome regressors (full sample)
W=W, # Selection regressors (full sample)
y=y, # Outcome variable (NaN for non-selected)
beta=beta_hat, # Outcome coefficients
gamma=gamma_hat, # Probit coefficients
theta=theta_hat, # IMR coefficient (rho * sigma)
sigma=sigma_hat, # Outcome error SD
selected=d, # Selection indicator
vcov_probit=V_probit, # Probit variance-covariance
)
This function handles all intermediate computations:
- Computes the IMR and its derivative
- Computes the cross-derivative matrix
- Computes the uncorrected OLS variance
- Applies the Murphy-Topel correction
Returns:
vcov_corrected: Corrected \((k_{outcome}+1, k_{outcome}+1)\) matrix (for \(\beta\) and \(\theta\))se_corrected: Corrected standard errors (square root of diagonal)
Cross-Derivative Computation¶
For advanced users, the cross-derivative can be computed separately:
from panelbox.models.selection.murphy_topel import compute_cross_derivative_heckman
C = compute_cross_derivative_heckman(
X=X_selected, # Outcome regressors (selected only)
W=W_selected, # Selection regressors (selected only)
imr=imr_selected, # IMR values (selected only)
imr_derivative=dimr, # dλ/dz (selected only)
beta=beta_hat, # Outcome coefficients
theta=theta_hat, # IMR coefficient
selected=selected, # Selection indicator
)
Practical Example¶
Manual Correction¶
import numpy as np
from panelbox.models.selection import PanelHeckman
from panelbox.models.selection.murphy_topel import murphy_topel_variance
# Fit the Heckman model
model = PanelHeckman(
endog=y, exog=X, selection=d, exog_selection=Z,
method="two_step",
)
results = model.fit()
# The PanelHeckman two-step automatically applies Murphy-Topel
# internally. But you can also apply it manually:
# Step 1: Get probit variance (from probit fit)
# Step 2: Get uncorrected OLS variance
# Step 3: Compute cross-derivative
# Step 4: Apply correction
# corrected_vcov = murphy_topel_variance(V1, V2, C)
Comparing Naive vs. Corrected SEs¶
# The correction typically increases SEs by 5-30%
# depending on the strength of the first-stage estimation
print("Naive SEs are always <= Murphy-Topel SEs")
print("The difference reflects Step 1 estimation uncertainty")
When the Correction Matters¶
The Murphy-Topel correction is most important when:
- Strong selection: \(|\rho|\) is far from zero, so the IMR plays a large role
- Imprecise first stage: the probit has few observations or weak predictors
- Many observations at the boundary: a large fraction of the sample is non-selected
The correction is less important when:
- Weak selection: \(\rho \approx 0\) (IMR coefficient is near zero)
- Precise first stage: many observations, strong exclusion restriction
- MLE is used: MLE automatically produces correct standard errors
Alternatives¶
Bootstrap¶
An alternative to the analytical Murphy-Topel correction is to bootstrap the entire two-step procedure:
- Resample entities (with replacement) to preserve panel structure
- Re-estimate both steps on the bootstrap sample
- Repeat \(B\) times
- Compute variance across bootstrap estimates
This is computationally expensive (\(B \times\) estimation time) but does not require analytical derivatives and is robust to model misspecification.
Bootstrap implementation
PanelBox provides the bootstrap_two_step_variance function signature, but the full implementation is not yet available. For now, use the analytical Murphy-Topel correction or implement panel bootstrap manually.
Tutorials¶
| Tutorial | Description | Link |
|---|---|---|
| Selection Models | Includes Murphy-Topel correction examples |
See Also¶
- Panel Heckman -- The primary user of Murphy-Topel correction
- Tobit Models -- Censored regression (uses MLE, not two-step)
- Marginal Effects for Censored Models -- Interpreting effects in nonlinear models
References¶
- Murphy, K. M., & Topel, R. H. (1985). Estimation and inference in two-step econometric models. Journal of Business & Economic Statistics, 3(4), 370-379.
- Wooldridge, J. M. (1995). Selection corrections for panel data models under conditional mean independence assumptions. Journal of Econometrics, 68(1), 115-132.
- Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). MIT Press. Section 12.5.
- Pagan, A. (1984). Econometric issues in the analysis of regressions with generated regressors. International Economic Review, 25(1), 221-247.