Panel Heckman¶

Quick Reference

Class: panelbox.models.selection.PanelHeckman Import: from panelbox.models.selection import PanelHeckman Stata equivalent: heckman y x1 x2, select(d = z1 z2) R equivalent: sampleSelection::selection()

Overview¶

The Panel Heckman model corrects for sample selection bias -- the problem that arises when the outcome of interest is observed only for a non-random subset of the population. The classic example is wage estimation: wages are observed only for employed individuals, and the employment decision is likely correlated with potential wages. Ignoring this produces biased estimates.

The model consists of two equations:

Selection equation: determines whether the outcome is observed $$d_{it} = \mathbb{1}\{W_{it}'\gamma + v_{it} > 0\}$$
Outcome equation: the model of interest, observed only when $d_{it} = 1$ $$y_{it} = X_{it}'\beta + \varepsilon_{it}$$

The errors $(v_{it}, \varepsilon_{it})$ follow a bivariate normal distribution with correlation $\rho$. When $\rho \neq 0$, OLS on the selected subsample is biased. The Heckman correction adds the Inverse Mills Ratio (IMR) to the outcome equation, removing the selection bias.

PanelBox implements the Wooldridge (1995) panel extension with both two-step and maximum likelihood estimation, plus comprehensive diagnostics for assessing selection bias.

Quick Example¶

import numpy as np
from panelbox.models.selection import PanelHeckman

# Prepare data
y = wages              # Outcome (observed for employed only)
X = X_outcome          # Outcome regressors: [education, experience]
d = employed           # Selection indicator: 1=employed, 0=not
Z = X_selection        # Selection regressors: [education, experience, children]
entity = person_id
time = year

# Fit two-step Heckman model
model = PanelHeckman(
    endog=y,
    exog=X,
    selection=d,
    exog_selection=Z,
    entity=entity,
    time=time,
    method="two_step",
)
results = model.fit()
print(results.summary())

# Test for selection bias
test = results.selection_effect()
print(test["interpretation"])

When to Use¶

Your outcome is observed only for a non-random subsample
You suspect the selection mechanism is correlated with the outcome
You have a valid exclusion restriction (a variable affecting selection but not the outcome)
Examples: wages (employed only), medical costs (insured only), firm profits (surviving firms only)

Key Assumptions

Bivariate normality: $(v_{it}, \varepsilon_{it}) \sim N(0, \Sigma)$
Exclusion restriction: at least one variable in $W$ should not appear in $X$
Exogenous regressors: $X$ and $W$ are uncorrelated with the errors (conditional on selection)
The selection indicator is binary (0/1)

The Selection Problem¶

Why OLS is Biased¶

When we estimate $y_{it} = X_{it}'\beta + \varepsilon_{it}$ using only selected observations ($d_{it} = 1$), the conditional expectation is:

\[E[y_{it} \mid d_{it} = 1, X_{it}] = X_{it}'\beta + \rho \sigma_\varepsilon \lambda(W_{it}'\gamma)\]

where $\lambda(z) = \phi(z) / \Phi(z)$ is the Inverse Mills Ratio. The term $\rho \sigma_\varepsilon \lambda(\cdot)$ is the omitted variable that causes bias in OLS.

Direction of Bias¶

Sign of $\rho$	Interpretation	OLS bias
$\rho > 0$	Positive selection: high-outcome individuals more likely selected	Upward bias
$\rho < 0$	Negative selection: low-outcome individuals more likely selected	Downward bias
$\rho = 0$	No selection bias	OLS is unbiased

Detailed Guide¶

Model Specification¶

from panelbox.models.selection import PanelHeckman

model = PanelHeckman(
    endog=y,                   # Outcome variable
    exog=X,                    # Outcome regressors (n x k_outcome)
    selection=d,               # Binary selection: 1=observed, 0=not
    exog_selection=Z,          # Selection regressors (n x k_selection)
    entity=entity,             # Entity IDs (optional)
    time=time,                 # Time IDs (optional)
    method="two_step",         # 'two_step' or 'mle'
)

Exclusion restriction

The selection regressors exog_selection should include at least one variable that is not in exog. This exclusion restriction is critical for identification. Without it, the model relies solely on the nonlinearity of the IMR, which is fragile.

Good exclusion restrictions affect selection but not the outcome directly:

Application	Selection variable	Exclusion restriction
Wages	Employment	Number of children, non-labor income
Training effects	Participation	Program availability, distance
Insurance	Purchase decision	State regulations, subsidies

Two-Step Estimation¶

The default and recommended approach:

results = model.fit()  # Uses two_step by default

Step 1: Estimate the selection equation via probit:

\[\hat{\gamma} = \arg\max \sum_{i,t} \left[ d_{it} \log \Phi(W_{it}'\gamma) + (1-d_{it}) \log(1-\Phi(W_{it}'\gamma)) \right]\]

Step 2: Compute the IMR and run augmented OLS on the selected sample:

\[y_{it} = X_{it}'\beta + \theta \hat{\lambda}_{it} + \text{error}_{it} \quad (d_{it}=1 \text{ only})\]

where $\hat{\lambda}_{it} = \phi(W_{it}'\hat{\gamma}) / \Phi(W_{it}'\hat{\gamma})$ and $\theta = \rho \sigma_\varepsilon$.

MLE Estimation¶

Full information maximum likelihood jointly estimates all parameters:

results_mle = model.fit(method="mle")

MLE is asymptotically more efficient but computationally expensive. It uses Fisher's z-transformation for $\rho$ and log-transformation for $\sigma$ to ensure parameters stay in valid ranges.

Performance

MLE with $N > 500$ observations may take several minutes. Use two-step estimation for large samples or exploratory analysis.

Result Attributes¶

Attribute	Description
`results.outcome_params`	Outcome equation coefficients $\hat{\beta}$
`results.probit_params`	Selection equation coefficients $\hat{\gamma}$
`results.sigma`	Outcome error SD $\hat{\sigma}_\varepsilon$
`results.rho`	Error correlation $\hat{\rho}$
`results.lambda_imr`	Inverse Mills Ratio for each observation
`results.method`	Estimation method used
`results.llf`	Log-likelihood (MLE only)
`results.converged`	Convergence status
`results.n_selected`	Number of selected observations
`results.n_total`	Total number of observations

Predictions¶

# Unconditional: E[y*] = X'beta (latent outcome)
y_unconditional = results.predict(type="unconditional")

# Conditional: E[y|selected] = X'beta + rho*sigma*lambda
y_conditional = results.predict(type="conditional")

# Out-of-sample prediction
y_new = results.predict(
    exog=X_new,
    exog_selection=Z_new,
    type="conditional",
)

Diagnostics¶

Testing for Selection Bias¶

The fundamental question: is selection bias actually present? Test $H_0: \rho = 0$.

test = results.selection_effect()
print(f"Test statistic: {test['statistic']:.3f}")
print(f"P-value: {test['pvalue']:.4f}")
print(test["interpretation"])

If you fail to reject $H_0$, OLS on the selected sample may be adequate.

IMR Diagnostics¶

Examine the distribution of the Inverse Mills Ratio:

diag = results.imr_diagnostics()
print(f"Mean IMR: {diag['imr_mean']:.3f}")
print(f"Std IMR: {diag['imr_std']:.3f}")
print(f"Range: [{diag['imr_min']:.3f}, {diag['imr_max']:.3f}]")
print(f"High IMR (>2): {diag['high_imr_count']}")
print(f"Selection rate: {diag['selection_rate']:.1%}")

High IMR values (> 2) indicate observations with very low selection probabilities where the correction is large. Many such observations can make the model unstable.

IMR Visualization¶

fig = results.plot_imr(figsize=(12, 5))
# Creates scatter (IMR vs selection prob) and histogram

Comparing OLS vs. Heckman¶

Assess the magnitude of selection bias:

comparison = results.compare_ols_heckman()
print(comparison["interpretation"])

# Coefficient-by-coefficient comparison
for i in range(len(comparison["beta_ols"])):
    print(f"  Variable {i}: OLS={comparison['beta_ols'][i]:.4f}, "
          f"Heckman={comparison['beta_heckman'][i]:.4f}, "
          f"Diff={comparison['difference'][i]:.4f}")

Large differences indicate substantial selection bias. If the differences are small and $\rho \approx 0$, OLS is adequate.

Configuration Options¶

PanelHeckman Parameters¶

Parameter	Type	Default	Description
`endog`	array-like	required	Outcome variable
`exog`	array-like	required	Outcome equation regressors
`selection`	array-like	required	Binary selection indicator (0/1)
`exog_selection`	array-like	required	Selection equation regressors
`entity`	array-like	`None`	Entity identifiers
`time`	array-like	`None`	Time identifiers
`method`	str	`"two_step"`	`"two_step"` or `"mle"`

fit() Parameters¶

Parameter	Type	Default	Description
`method`	str	`None`	Override default method

Common Issues¶

Collinearity with IMR¶

If the exclusion restriction is weak ($X$ and $W$ are very similar), the IMR $\hat{\lambda}$ becomes highly collinear with $X$. This leads to unstable estimates and large standard errors. Solution: find a better exclusion restriction.

Extreme Selection Probabilities¶

When $\Phi(W'\gamma)$ is close to 0 or 1, the IMR can become very large, causing numerical instability. PanelBox clips probabilities to $[10^{-10}, 1 - 10^{-10}]$ to prevent division by zero, but extreme values still produce unreliable corrections.

MLE Convergence¶

If MLE fails to converge:

Check that two-step estimates are reasonable (used as starting values)
Reduce the number of parameters (simplify the model)
Check for near-boundary $\rho$ ($|\rho| \approx 1$)
Fall back to two-step estimation

Extreme Selection Rates¶

Selection rate < 5% or > 95%

Very low or very high selection rates make the IMR unstable. PanelBox warns when this occurs. Consider whether the Heckman model is appropriate for your data.

Standard Errors¶

Murphy-Topel correction

The two-step estimator produces naive standard errors that understate uncertainty because they treat the first-step probit estimates as known. The Murphy-Topel correction adjusts for the estimation error in $\hat{\gamma}$, producing correct (larger) standard errors. See the dedicated page for details.

Tutorials¶

Tutorial	Description	Link
Selection Models	Two-step and MLE Heckman estimation

References¶

Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47(1), 153-161.
Wooldridge, J. M. (1995). Selection corrections for panel data models under conditional mean independence assumptions. Journal of Econometrics, 68(1), 115-132.
Murphy, K. M., & Topel, R. H. (1985). Estimation and inference in two-step econometric models. Journal of Business & Economic Statistics, 3(4), 370-379.
Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: Methods and Applications. Cambridge University Press. Chapter 16.
Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2^nd ed.). MIT Press. Chapter 19.

Sign of \(\rho\)	Interpretation	OLS bias
\(\rho > 0\)	Positive selection: high-outcome individuals more likely selected	Upward bias
\(\rho < 0\)	Negative selection: low-outcome individuals more likely selected	Downward bias
\(\rho = 0\)	No selection bias	OLS is unbiased

Attribute	Description
`results.outcome_params`	Outcome equation coefficients \(\hat{\beta}\)
`results.probit_params`	Selection equation coefficients \(\hat{\gamma}\)
`results.sigma`	Outcome error SD \(\hat{\sigma}_\varepsilon\)
`results.rho`	Error correlation \(\hat{\rho}\)
`results.lambda_imr`	Inverse Mills Ratio for each observation
`results.method`	Estimation method used
`results.llf`	Log-likelihood (MLE only)
`results.converged`	Convergence status
`results.n_selected`	Number of selected observations
`results.n_total`	Total number of observations