Outlier Detection¶

Quick Reference

Class: panelbox.validation.robustness.OutlierDetector Import: from panelbox.validation.robustness import OutlierDetector Key methods: detect_outliers_residuals(), detect_outliers_univariate(), detect_outliers_multivariate() Stata equivalent: predict, rstudent R equivalent: stats::influence.measures()

What It Does¶

Outliers are observations that deviate markedly from the bulk of the data. In panel data, outliers can arise from data entry errors, structural breaks, or genuinely extreme realizations. The OutlierDetector class provides four detection methods covering univariate, multivariate, and regression-based approaches.

Four Detection Methods¶

Method	Approach	Default Threshold	Best For
Univariate IQR	\(Q_1 - k \cdot IQR\) to \(Q_3 + k \cdot IQR\)	\(k = 1.5\)	Exploring individual variables
Univariate Z-score	\(\\|z_i\\| > t\) where \(z_i = (x_i - \bar{x})/s\)	\(t = 2.5\)	Normally distributed variables
Multivariate	Mahalanobis distance	\(3.0\)	Correlated regressors
Residual-based	Standardized or studentized residuals	\(2.5\)	Regression outliers

Quick Example¶

from panelbox import FixedEffects
from panelbox.validation.robustness import OutlierDetector
from panelbox.datasets import load_grunfeld

data = load_grunfeld()
model = FixedEffects("invest ~ value + capital", data, "firm", "year")
results = model.fit()

# Detect outliers via standardized residuals
detector = OutlierDetector(results, verbose=True)
outliers = detector.detect_outliers_residuals(method="standardized", threshold=2.5)

print(f"Outliers detected: {outliers.n_outliers} / {len(outliers.outliers)}")
print(outliers.outlier_table)  # Only flagged rows

# 4-panel diagnostic plot
detector.plot_diagnostics()

Univariate Detection¶

IQR Method¶

Flags observations outside the "fences" defined by the interquartile range:

\[\text{Lower} = Q_1 - k \cdot IQR, \quad \text{Upper} = Q_3 + k \cdot IQR\]

where \(IQR = Q_3 - Q_1\) and \(k\) is the threshold (default: 1.5 for "outlier", 3.0 for "extreme outlier").

# IQR method on residuals (default when variable=None)
outliers_iqr = detector.detect_outliers_univariate(method="iqr", threshold=1.5)

# IQR method on a specific variable
outliers_var = detector.detect_outliers_univariate(variable="value", method="iqr", threshold=1.5)

Z-Score Method¶

Flags observations with \(|z| > t\) where \(z_i = (x_i - \bar{x}) / s\):

outliers_z = detector.detect_outliers_univariate(method="zscore", threshold=3.0)

Multivariate Detection¶

Uses Mahalanobis distance, which accounts for correlations between variables:

\[D_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}\]

outliers_mahal = detector.detect_outliers_multivariate(threshold=3.0)

The threshold is scaled by the \(\chi^2\) distribution critical value to account for dimensionality.

Singular Covariance

If the covariance matrix is singular (e.g., multicollinear regressors), PanelBox automatically switches to the pseudo-inverse with a warning.

Residual-Based Detection¶

The most common approach for regression diagnostics:

# Standardized residuals: r / sqrt(MSE)
outliers_std = detector.detect_outliers_residuals(method="standardized", threshold=2.5)

# Studentized residuals: r / sqrt(MSE * (1 - h_ii))
outliers_stud = detector.detect_outliers_residuals(method="studentized", threshold=2.5)

Type	Formula	Accounts For
Standardized	\(e_i / \sqrt{MSE}\)	Overall error scale
Studentized	\(e_i / \sqrt{MSE \cdot (1 - h_{ii})}\)	Overall error scale + leverage

The studentized residual adjusts for the observation's leverage (\(h_{ii}\)), making it more reliable for detecting outliers at high-leverage points.

Convenience Alias

detector.detect_outliers(method, threshold) is a shortcut for detect_outliers_residuals() with a default threshold of 3.0.

OutlierResults¶

All detection methods return an OutlierResults object:

Attribute	Type	Description
`outliers`	`pd.DataFrame`	Full results with `entity`, `time`, `is_outlier`, `distance` columns
`method`	`str`	Method description string
`threshold`	`float`	Threshold used
`n_outliers`	`int`	Count of flagged observations
`outlier_table`	`pd.DataFrame`	Property: only rows where `is_outlier == True`

Leverage Points¶

High-leverage points are observations with unusual predictor values that can pull the regression line toward them. The default threshold is \(2K/N\) where \(K\) is the number of parameters and \(N\) is the sample size.

leverage_df = detector.detect_leverage_points(threshold=None)  # Uses 2K/N
print(leverage_df[leverage_df["is_high_leverage"]])

Returns a pd.DataFrame with columns: entity, time, leverage, is_high_leverage.

Approximate Leverage

For panel models with entity fixed effects, exact leverage computation requires the full hat matrix. PanelBox provides an approximation based on the Mahalanobis distance of the design matrix.

Diagnostic Plots¶

detector.plot_diagnostics()

Produces a 4-panel figure:

Residuals vs Fitted -- Checks for heteroskedasticity and nonlinearity
Normal Q-Q Plot -- Checks normality of residuals
Scale-Location -- \(\sqrt{|r_i^*|}\) vs fitted values (detects heteroskedasticity)
Residual Histogram -- Distribution of residuals with normal overlay

What To Do With Outliers¶

Never Blindly Remove Outliers

Outliers are data, not noise. Removing them without justification biases your estimates.

Recommended workflow:

Investigate: Look up the flagged observations. Are they data entry errors? Extreme but real events? Structural breaks?
Understand: If they are real, consider whether the model should accommodate them (e.g., with additional controls or a different functional form).
Compare: Re-estimate with and without outliers. If results change substantially, report both sets of estimates.
Winsorize: If you must limit the impact of outliers, winsorize at the 1^st/99^th percentile rather than dropping observations.
Robust estimation: Consider robust regression methods that downweight influential observations.

Comprehensive Example¶

from panelbox import FixedEffects
from panelbox.validation.robustness import OutlierDetector
from panelbox.datasets import load_grunfeld

data = load_grunfeld()
model = FixedEffects("invest ~ value + capital", data, "firm", "year")
results = model.fit()

detector = OutlierDetector(results, verbose=True)

# Multiple detection methods
residual_outliers = detector.detect_outliers_residuals(method="standardized", threshold=2.5)
mahal_outliers = detector.detect_outliers_multivariate(threshold=3.0)
leverage_points = detector.detect_leverage_points()

print(f"Residual outliers:     {residual_outliers.n_outliers}")
print(f"Multivariate outliers: {mahal_outliers.n_outliers}")
print(f"High-leverage points:  {leverage_points['is_high_leverage'].sum()}")

# Investigate flagged observations
print("\nResidual outlier details:")
print(residual_outliers.outlier_table)

# Diagnostic plots
detector.plot_diagnostics()

Common Pitfalls¶

Watch Out

Threshold choice matters: A threshold of 2.0 flags ~5% of normally distributed data; 2.5 flags ~1.2%; 3.0 flags ~0.3%. Choose based on context, not convenience.
Masking effect: In the presence of multiple outliers, standard methods can fail to detect some outliers because the contaminated mean and variance absorb the outlier signal. Use robust methods (multivariate Mahalanobis) when you suspect clusters of outliers.
Panel structure: An observation may be an outlier within one entity but normal across entities. Consider entity-specific outlier detection for heterogeneous panels.
Confusing outliers with influence: An outlier (unusual residual) is not necessarily influential (changes coefficients). Use InfluenceDiagnostics to assess impact on estimates.

References¶

Cook, R. D., & Weisberg, S. (1982). Residuals and Influence in Regression. Chapman and Hall.
Rousseeuw, P. J., & Leroy, A. M. (1987). Robust Regression and Outlier Detection. John Wiley & Sons.
Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons.