Pooled OLS¶

Quick Reference

Class: panelbox.models.static.pooled_ols.PooledOLS Import: from panelbox import PooledOLS Stata equivalent: reg y x1 x2 R equivalent: plm(y ~ x1 + x2, data, model = "pooling")

Overview¶

Pooled OLS is the simplest panel data estimator. It stacks all observations from all entities and time periods into a single dataset and estimates a standard OLS regression, completely ignoring the panel structure. The model is:

\[y_{it} = X_{it} \beta + \varepsilon_{it}\]

where \(i\) indexes entities and \(t\) indexes time periods. No entity-specific or time-specific effects are included.

Pooled OLS serves primarily as a baseline model for comparison with panel-specific estimators like Fixed Effects or Random Effects. If unobserved entity-specific heterogeneity exists and is correlated with the regressors, Pooled OLS produces biased and inconsistent estimates due to omitted variable bias.

In practice, researchers estimate Pooled OLS first, then test whether the panel structure matters by comparing it with Fixed Effects (using an F-test) or Random Effects (using the Breusch-Pagan LM test).

Quick Example¶

from panelbox import PooledOLS
from panelbox.datasets import load_grunfeld

data = load_grunfeld()
model = PooledOLS("invest ~ value + capital", data, "firm", "year")
results = model.fit(cov_type="robust")
print(results.summary())

When to Use¶

As a baseline or benchmark before estimating panel models
When you believe no unobserved entity-specific heterogeneity exists
When the panel structure is irrelevant (e.g., pooled cross-sections)
To compare coefficient magnitudes with FE/RE and detect potential omitted variable bias
When computing bounds on the true coefficients (Pooled OLS vs FE)

Key Assumptions

No unobserved heterogeneity: \(E[\varepsilon_{it} | X_{it}] = 0\) (no omitted entity effects)
Linearity: The conditional expectation of \(y\) is linear in \(X\)
No perfect multicollinearity: Regressors are not perfectly correlated
If using classical SEs: Homoskedasticity and no serial correlation within entities

If unobserved entity effects \(\alpha_i\) exist and are correlated with \(X_{it}\), Pooled OLS is biased and inconsistent. Use Fixed Effects instead.

Detailed Guide¶

Data Preparation¶

PanelBox expects data in long format (one row per entity-time observation):

import pandas as pd
from panelbox.datasets import load_grunfeld

data = load_grunfeld()
print(data.head())
#    firm  year  invest  value  capital
# 0     1  1935   317.6  3078.5    2.8
# 1     1  1936   391.8  4661.7   52.6
# ...

The data must contain:

A dependent variable and one or more independent variables
An entity identifier column (e.g., "firm")
A time identifier column (e.g., "year")

Estimation¶

from panelbox import PooledOLS

# Basic estimation with classical standard errors
model = PooledOLS("invest ~ value + capital", data, "firm", "year")
results = model.fit()

# With robust standard errors (recommended)
results_robust = model.fit(cov_type="robust")

# With clustered standard errors by entity
results_cluster = model.fit(cov_type="clustered")

# With Driscoll-Kraay standard errors
results_dk = model.fit(cov_type="driscoll_kraay", max_lags=3)

Interpreting Results¶

print(results.summary())

Key output attributes:

Attribute	Description
`results.params`	Estimated coefficients (pd.Series)
`results.std_errors`	Standard errors (pd.Series)
`results.tvalues`	t-statistics (pd.Series)
`results.pvalues`	Two-sided p-values (pd.Series)
`results.rsquared`	R-squared
`results.rsquared_adj`	Adjusted R-squared
`results.nobs`	Number of observations
`results.conf_int()`	95% confidence intervals (DataFrame)
`results.resid`	Residuals (np.ndarray)
`results.fittedvalues`	Fitted values (np.ndarray)

# Access individual results
print(f"R-squared: {results.rsquared:.4f}")
print(f"Coefficients:\n{results.params}")
print(f"Confidence intervals:\n{results.conf_int()}")

Configuration Options¶

Parameter	Type	Default	Description
`formula`	str	required	R-style formula (e.g., `"y ~ x1 + x2"`)
`data`	DataFrame	required	Panel data in long format
`entity_col`	str	required	Entity identifier column name
`time_col`	str	required	Time identifier column name
`weights`	np.ndarray	`None`	Observation weights for WLS estimation

fit() method:

Parameter	Type	Default	Description
`cov_type`	str	`"nonrobust"`	Standard error type (see table below)
`max_lags`	int	auto	Maximum lags for Driscoll-Kraay / Newey-West
`kernel`	str	`"bartlett"`	Kernel for HAC estimators

Standard Errors¶

`cov_type`	Method	When to Use
`"nonrobust"`	Classical OLS	Homoskedastic errors, no autocorrelation
`"robust"` / `"hc1"`	White HC1	Heteroskedasticity of unknown form
`"hc0"`	White HC0	Heteroskedasticity (no small-sample correction)
`"hc2"`	HC2	Heteroskedasticity (leverage-based)
`"hc3"`	HC3	Heteroskedasticity (jackknife-like, conservative)
`"clustered"`	Cluster-robust	Within-entity correlation over time
`"twoway"`	Two-way clustered	Correlation within entities and time periods
`"driscoll_kraay"`	Driscoll-Kraay	Cross-sectional dependence + serial correlation
`"newey_west"`	Newey-West HAC	Serial correlation in time series
`"pcse"`	Panel-corrected (Beck-Katz)	Cross-sectional dependence, T > N

Recommendation

For panel data, always use at least cov_type="clustered" to account for within-entity correlation. Classical standard errors are almost always too small for panel data.

Diagnostics¶

After estimating Pooled OLS, run these tests to check whether the panel structure matters:

from panelbox import FixedEffects, RandomEffects

# Compare with Fixed Effects
fe = FixedEffects("invest ~ value + capital", data, "firm", "year")
fe_results = fe.fit()

# F-test for entity effects (FE vs Pooled OLS)
# Reported automatically in FE results
print(f"F-statistic: {fe_results.f_statistic:.4f}")
print(f"F-test p-value: {fe_results.f_pvalue:.4f}")
# If p < 0.05 -> entity effects exist -> Pooled OLS is inadequate

# Breusch-Pagan LM test for random effects
from panelbox.validation import BreuschPaganTest
bp = BreuschPaganTest(results)
bp_result = bp.run()
print(bp_result.summary())

Tutorials¶

Tutorial	Level	Colab
Pooled OLS Introduction	Beginner
Comparison of All Estimators	Advanced

References¶

Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2^nd ed.). MIT Press. Chapter 10.
Baltagi, B. H. (2021). Econometric Analysis of Panel Data (6^th ed.). Springer. Chapter 2.
Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: Methods and Applications. Cambridge University Press. Chapter 21.