Influence Diagnostics¶

Quick Reference

Class: panelbox.validation.robustness.InfluenceDiagnostics Import: from panelbox.validation.robustness import InfluenceDiagnostics Key method: influence.compute() returns InfluenceResults Stata equivalent: predict, cooksd dffits dfbeta(x1) R equivalent: stats::influence.measures(), car::influenceIndexPlot()

What It Measures¶

Influence diagnostics identify observations that have disproportionate impact on regression results. An influential observation is one whose removal would substantially change the estimated coefficients, fitted values, or predictions.

Three measures capture different aspects of influence:

Measure	What It Measures	Default Threshold
Cook's D	Overall influence on all coefficients jointly	\(4/N\)
DFFITS	Influence on the observation's own fitted value	\(2\sqrt{K/N}\)
DFBETAS	Influence on each coefficient individually	\(2/\sqrt{N}\)

Mathematical Details¶

Cook's Distance¶

Cook's distance combines residual magnitude and leverage into a single measure:

\[D_i = \frac{r_i^2}{K \cdot MSE} \cdot \frac{h_{ii}}{(1 - h_{ii})^2}\]

where:

\(r_i\) = residual for observation \(i\)
\(K\) = number of parameters
\(MSE\) = mean squared error
\(h_{ii}\) = leverage (hat value) for observation \(i\)

An observation can be influential through a large residual (outlier), high leverage (unusual predictor values), or both.

DFFITS¶

DFFITS measures how much the fitted value \(\hat{y}_i\) changes when observation \(i\) is deleted:

\[DFFITS_i = r_i^{*} \sqrt{\frac{h_{ii}}{1 - h_{ii}}}\]

where \(r_i^{*}\) is the standardized residual \(r_i / \sqrt{MSE \cdot (1 - h_{ii})}\).

DFBETAS¶

DFBETAS measures how much each individual coefficient \(\hat\beta_j\) changes when observation \(i\) is deleted:

\[DFBETAS_{ij} = \frac{\hat\beta_j - \hat\beta_j^{(-i)}}{SE(\hat\beta_j^{(-i)})}\]

This reveals which specific coefficients are most affected by each observation.

Quick Example¶

from panelbox import FixedEffects
from panelbox.validation.robustness import InfluenceDiagnostics
from panelbox.datasets import load_grunfeld

data = load_grunfeld()
model = FixedEffects("invest ~ value + capital", data, "firm", "year")
results = model.fit()

# Compute all influence diagnostics
influence = InfluenceDiagnostics(results, verbose=True)
infl = influence.compute()

# Access individual measures
print(f"Max Cook's D: {infl.cooks_d.max():.6f}")
print(f"Max |DFFITS|: {infl.dffits.abs().max():.6f}")

# Find influential observations
influential = influence.influential_observations(method="cooks_d")
print(influential)

# Summary with top 10 most influential observations
print(influence.summary())

# 4-panel influence plot
influence.plot_influence()

API Reference¶

Constructor¶

InfluenceDiagnostics(
    results=results,   # PanelResults from model.fit()
    verbose=True,      # Print progress
)

Methods¶

Method	Returns	Description
`compute()`	`InfluenceResults`	Compute Cook's D, DFFITS, DFBETAS, leverage
`influential_observations(method, threshold)`	`pd.DataFrame`	Observations exceeding influence threshold
`summary(n_top=10)`	`str`	Formatted summary of top influential observations
`plot_influence(save_path=None)`	--	4-panel influence diagnostic plot

Properties (auto-compute on access)¶

Property	Type	Description
`leverage`	`pd.Series`	Approximate leverage (hat) values
`cooks_d`	`pd.Series`	Cook's distance
`dffits`	`pd.Series`	DFFITS values
`dfbetas`	`pd.DataFrame`	DFBETAS (observations \(\times\) parameters)

These properties call compute() automatically if it hasn't been called yet.

InfluenceResults Attributes¶

Attribute	Type	Description
`cooks_d`	`pd.Series`	Cook's distance for each observation
`dffits`	`pd.Series`	DFFITS for each observation
`dfbetas`	`pd.DataFrame`	DFBETAS (\(N \times K\))
`leverage`	`pd.Series`	Approximate leverage values
`standardized_residuals`	`pd.Series`	Standardized residuals

Identifying Influential Observations¶

# By Cook's distance (default threshold: 4/N)
influential_cook = influence.influential_observations(method="cooks_d")

# By DFFITS (default threshold: 2*sqrt(K/N))
influential_dffits = influence.influential_observations(method="dffits")

# By DFBETAS (default threshold: 2/sqrt(N))
influential_dfbetas = influence.influential_observations(method="dfbetas")

# Custom threshold
influential_custom = influence.influential_observations(method="cooks_d", threshold=0.5)

Default Thresholds¶

Method	Default Threshold	Formula
`cooks_d`	\(4/N\)	Flags approximately 0.4% of observations in large samples
`dffits`	\(2\sqrt{K/N}\)	Scales with model complexity and sample size
`dfbetas`	\(2/\sqrt{N}\)	Per-coefficient threshold; flags if any coefficient is affected

Where \(N\) = number of observations and \(K\) = number of parameters.

Visualization¶

influence.plot_influence()

Produces a 4-panel figure:

Cook's Distance -- Stem plot with \(4/N\) threshold line
DFFITS -- Stem plot with \(\pm 2\sqrt{K/N}\) threshold lines
Leverage vs Residuals -- Scatter of leverage against standardized residuals (high in both corners = influential)
Leverage Values -- Stem plot with \(2 \times \bar{h}\) threshold line

Interpretation Guide¶

Cook's Distance¶

Range	Interpretation
\(D_i < 4/N\)	Not influential
\(4/N < D_i < 1\)	Moderately influential; investigate
\(D_i > 1\)	Highly influential; strongly affects results

DFFITS¶

Range	Interpretation
\(\\|DFFITS_i\\| < 2\sqrt{K/N}\)	Not influential on fitted value
\(\\|DFFITS_i\\| > 2\sqrt{K/N}\)	Substantially changes own fitted value when removed

DFBETAS¶

Range	Interpretation
\(\\|DFBETAS_{ij}\\| < 2/\sqrt{N}\)	Not influential on coefficient \(j\)
\(\\|DFBETAS_{ij}\\| > 2/\sqrt{N}\)	Substantially changes coefficient \(j\) when removed

Which Measure to Use?

Start with Cook's D for an overall picture
Use DFFITS to assess prediction impact
Use DFBETAS when you need to know which specific coefficients are affected
Always examine observations flagged by any measure

Panel Context¶

In panel data, influential observations may cluster within entities or time periods:

import pandas as pd

# Compute influence
influence = InfluenceDiagnostics(results)
infl = influence.compute()

# Merge with panel identifiers
model = results._model
infl_data = pd.DataFrame({
    "entity": model.data.data[model.data.entity_col].values,
    "time": model.data.data[model.data.time_col].values,
    "cooks_d": infl.cooks_d.values,
})

# Check if influence clusters by entity
entity_influence = infl_data.groupby("entity")["cooks_d"].mean()
print("Mean Cook's D by entity:")
print(entity_influence.sort_values(ascending=False))

# Check if influence clusters by time period
time_influence = infl_data.groupby("time")["cooks_d"].mean()
print("\nMean Cook's D by time period:")
print(time_influence.sort_values(ascending=False))

If influence clusters within a specific entity or time period, consider whether that unit has unique characteristics that warrant separate treatment or additional controls.

Comprehensive Example¶

from panelbox import FixedEffects
from panelbox.validation.robustness import InfluenceDiagnostics
from panelbox.datasets import load_grunfeld
import numpy as np

data = load_grunfeld()
model = FixedEffects("invest ~ value + capital", data, "firm", "year")
results = model.fit()

# Full influence analysis
influence = InfluenceDiagnostics(results, verbose=True)
infl = influence.compute()

# Thresholds
n = len(infl.cooks_d)
k = len(results.params)
print(f"N = {n}, K = {k}")
print(f"Cook's D threshold (4/N):          {4/n:.6f}")
print(f"DFFITS threshold (2*sqrt(K/N)):    {2*np.sqrt(k/n):.6f}")
print(f"DFBETAS threshold (2/sqrt(N)):     {2/np.sqrt(n):.6f}")

# Count influential observations by each measure
n_cook = (infl.cooks_d > 4/n).sum()
n_dffits = (infl.dffits.abs() > 2*np.sqrt(k/n)).sum()
n_dfbetas = (infl.dfbetas.abs() > 2/np.sqrt(n)).any(axis=1).sum()

print(f"\nInfluential by Cook's D: {n_cook}")
print(f"Influential by DFFITS:   {n_dffits}")
print(f"Influential by DFBETAS:  {n_dfbetas}")

# Detailed summary
print(influence.summary())

# Visualization
influence.plot_influence()

Common Pitfalls¶

Watch Out

Approximate leverage: For panel models with entity fixed effects, PanelBox uses an approximate leverage based on the Mahalanobis distance of the design matrix. Exact leverage requires the full hat matrix, which may be too large for big panels.
Multiple flagged observations: Influence is computed one observation at a time. If multiple influential observations are present, deleting one may reveal (or mask) others.
Don't remove without investigation: Finding high Cook's D does not mean the observation should be deleted. It means it deserves scrutiny.
Small panels: In small panels, a few observations can have high leverage by construction. Thresholds like \(4/N\) may flag too many observations; consider using \(1.0\) as an alternative Cook's D cutoff.

References¶

Cook, R. D. (1977). Detection of influential observation in linear regression. Technometrics, 19(1), 15-18.
Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons.
Cook, R. D., & Weisberg, S. (1982). Residuals and Influence in Regression. Chapman and Hall.
Welsch, R. E., & Kuh, E. (1977). Linear regression diagnostics. NBER Working Paper No. 173.

Influence Diagnostics¶

What It Measures¶

Mathematical Details¶

Cook's Distance¶

DFFITS¶

DFBETAS¶

Quick Example¶

API Reference¶

Constructor¶

Methods¶

Properties (auto-compute on access)¶

InfluenceResults Attributes¶

Identifying Influential Observations¶

Default Thresholds¶

Visualization¶

Interpretation Guide¶

Cook's Distance¶

DFFITS¶

DFBETAS¶

Panel Context¶

Comprehensive Example¶

Common Pitfalls¶

See Also¶

References¶