Murphy-Topel Variance Correction¶

Quick Reference

Functions: murphy_topel_variance, heckman_two_step_variance Import: from panelbox.models.selection.murphy_topel import murphy_topel_variance, heckman_two_step_variance Applies to: Any two-step estimator (Heckman two-step, generated regressors)

Overview¶

Two-step estimation procedures -- such as the Heckman two-step -- produce standard errors that are too small when computed naively. The problem is that Step 2 treats the Step 1 estimates as known constants, ignoring the sampling uncertainty from Step 1. This leads to understated standard errors, overstated t-statistics, and incorrect inference.

Murphy & Topel (1985) developed a general correction that accounts for the estimation error propagated from Step 1 to Step 2. In the context of the Heckman model, Step 1 is the probit estimation of the selection equation, and Step 2 is the augmented OLS with the Inverse Mills Ratio (IMR) -- which depends on the estimated probit coefficients \(\hat{\gamma}\).

The corrected variance is always larger than the naive variance, reflecting the additional uncertainty from the first step.

The Problem¶

Naive Two-Step Standard Errors¶

Consider a two-step procedure:

Step 1: Estimate \(\hat{\theta}_1\) (e.g., probit coefficients \(\hat{\gamma}\))
Step 2: Estimate \(\hat{\theta}_2\) using \(\hat{\theta}_1\) as if known (e.g., OLS with IMR)

The naive variance from Step 2 is:

\[\hat{V}_2^{naive} = \hat{\sigma}^2 (X_{aug}'X_{aug})^{-1}\]

This ignores that \(\hat{\theta}_1\) was estimated, not known. The true asymptotic variance is larger.

Impact on Inference¶

Measure	Naive (incorrect)	Murphy-Topel (correct)
Standard errors	Too small	Correct (larger)
t-statistics	Too large	Correct (smaller)
Confidence intervals	Too narrow	Correct (wider)
p-values	Too small	Correct (larger)
Rejection rates	Too high (over-rejection)	Correct

The Murphy-Topel Correction¶

General Formula¶

The corrected variance-covariance matrix is:

\[\hat{V}^{MT}_2 = \hat{V}_2 + C \hat{V}_1 C'\]

where:

\(\hat{V}_1\) is the variance-covariance from Step 1 (probit)
\(\hat{V}_2\) is the uncorrected variance from Step 2 (OLS)
\(C = \frac{\partial^2 Q}{\partial \theta_2 \partial \theta_1'}\) is the cross-derivative matrix that captures how Step 2 depends on Step 1

The correction term \(C \hat{V}_1 C'\) is always positive semi-definite, so corrected standard errors are always at least as large as naive ones.

For Heckman Specifically¶

In the Heckman two-step:

Step 1 estimates \(\hat{\gamma}\) (probit on selection equation)
Step 2 estimates \((\hat{\beta}, \hat{\theta})\) where \(\theta = \rho \sigma_\varepsilon\) is the IMR coefficient
The IMR \(\hat{\lambda}_{it} = \phi(W_{it}'\hat{\gamma}) / \Phi(W_{it}'\hat{\gamma})\) depends on \(\hat{\gamma}\)
The cross-derivative captures how the IMR changes with \(\gamma\):

\[\frac{\partial \hat{\lambda}}{\partial \gamma} = \frac{d\lambda}{dz} \cdot W_{it} = -\lambda(\lambda + z) \cdot W_{it}\]

API Reference¶

General Murphy-Topel Correction¶

from panelbox.models.selection.murphy_topel import murphy_topel_variance

corrected_vcov = murphy_topel_variance(
    vcov_step1=vcov_probit,            # (k1 x k1) from Step 1
    vcov_step2_uncorrected=vcov_ols,   # (k2 x k2) naive from Step 2
    cross_derivative=C,                 # (k2 x k1) cross-derivative
)

Parameters:

Parameter	Shape	Description
`vcov_step1`	\((k_1, k_1)\)	Step 1 variance-covariance matrix
`vcov_step2_uncorrected`	\((k_2, k_2)\)	Naive Step 2 variance-covariance
`cross_derivative`	\((k_2, k_1)\)	Cross-derivative \(\partial^2 Q / \partial \theta_2 \partial \theta_1'\)

Returns: Corrected \((k_2, k_2)\) variance-covariance matrix.

Heckman-Specific Convenience Function¶

from panelbox.models.selection.murphy_topel import heckman_two_step_variance

vcov_corrected, se_corrected = heckman_two_step_variance(
    X=X,                   # Outcome regressors (full sample)
    W=W,                   # Selection regressors (full sample)
    y=y,                   # Outcome variable (NaN for non-selected)
    beta=beta_hat,         # Outcome coefficients
    gamma=gamma_hat,       # Probit coefficients
    theta=theta_hat,       # IMR coefficient (rho * sigma)
    sigma=sigma_hat,       # Outcome error SD
    selected=d,            # Selection indicator
    vcov_probit=V_probit,  # Probit variance-covariance
)

This function handles all intermediate computations:

Computes the IMR and its derivative
Computes the cross-derivative matrix
Computes the uncorrected OLS variance
Applies the Murphy-Topel correction

Returns:

vcov_corrected: Corrected \((k_{outcome}+1, k_{outcome}+1)\) matrix (for \(\beta\) and \(\theta\))
se_corrected: Corrected standard errors (square root of diagonal)

Cross-Derivative Computation¶

For advanced users, the cross-derivative can be computed separately:

from panelbox.models.selection.murphy_topel import compute_cross_derivative_heckman

C = compute_cross_derivative_heckman(
    X=X_selected,          # Outcome regressors (selected only)
    W=W_selected,          # Selection regressors (selected only)
    imr=imr_selected,      # IMR values (selected only)
    imr_derivative=dimr,   # dλ/dz (selected only)
    beta=beta_hat,         # Outcome coefficients
    theta=theta_hat,       # IMR coefficient
    selected=selected,     # Selection indicator
)

Practical Example¶

Manual Correction¶

import numpy as np
from panelbox.models.selection import PanelHeckman
from panelbox.models.selection.murphy_topel import murphy_topel_variance

# Fit the Heckman model
model = PanelHeckman(
    endog=y, exog=X, selection=d, exog_selection=Z,
    method="two_step",
)
results = model.fit()

# The PanelHeckman two-step automatically applies Murphy-Topel
# internally. But you can also apply it manually:

# Step 1: Get probit variance (from probit fit)
# Step 2: Get uncorrected OLS variance
# Step 3: Compute cross-derivative
# Step 4: Apply correction
# corrected_vcov = murphy_topel_variance(V1, V2, C)

Comparing Naive vs. Corrected SEs¶

# The correction typically increases SEs by 5-30%
# depending on the strength of the first-stage estimation
print("Naive SEs are always <= Murphy-Topel SEs")
print("The difference reflects Step 1 estimation uncertainty")

When the Correction Matters¶

The Murphy-Topel correction is most important when:

Strong selection: \(|\rho|\) is far from zero, so the IMR plays a large role
Imprecise first stage: the probit has few observations or weak predictors
Many observations at the boundary: a large fraction of the sample is non-selected

The correction is less important when:

Weak selection: \(\rho \approx 0\) (IMR coefficient is near zero)
Precise first stage: many observations, strong exclusion restriction
MLE is used: MLE automatically produces correct standard errors

Alternatives¶

Bootstrap¶

An alternative to the analytical Murphy-Topel correction is to bootstrap the entire two-step procedure:

Resample entities (with replacement) to preserve panel structure
Re-estimate both steps on the bootstrap sample
Repeat \(B\) times
Compute variance across bootstrap estimates

This is computationally expensive (\(B \times\) estimation time) but does not require analytical derivatives and is robust to model misspecification.

Bootstrap implementation

PanelBox provides the bootstrap_two_step_variance function signature, but the full implementation is not yet available. For now, use the analytical Murphy-Topel correction or implement panel bootstrap manually.

Tutorials¶

Tutorial	Description	Link
Selection Models	Includes Murphy-Topel correction examples

References¶

Murphy, K. M., & Topel, R. H. (1985). Estimation and inference in two-step econometric models. Journal of Business & Economic Statistics, 3(4), 370-379.
Wooldridge, J. M. (1995). Selection corrections for panel data models under conditional mean independence assumptions. Journal of Econometrics, 68(1), 115-132.
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2^nd ed.). MIT Press. Section 12.5.
Pagan, A. (1984). Econometric issues in the analysis of regressions with generated regressors. International Economic Review, 25(1), 221-247.