Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Statistical Foundations for Factor Model Analysis

Inference, the Central Limit Theorem, and Nonparametric Robust Standard Errors


Where This Fits

This is the first notebook in the series. The recommended reading order is:

  1. Statistical Foundations — this notebook

  2. Time Series Foundations — Stationarity, autocorrelation, volatility clustering, ergodicity

  3. Fama-French 3-Factor Model — Data, regression mechanics, diagnostics, robust SEs, interpretation

  4. Advanced Factor Models — FF5, momentum, profitability, beta anomaly, rolling windows

  5. Practical Factor Investing — Real ETF analysis, portfolio construction, constrained optimisation

  6. Solution Manual — Complete worked solutions to all exercises

This notebook develops the statistical concepts that all later tutorials rely on — with small, hand-checkable examples so that every formula can be verified without a computer. The next notebook applies these tools to financial time series specifically.

What We’ll Cover

  1. Descriptive Statistics Refresher — Mean, variance, covariance with a tiny dataset

  2. Random Variables & Expectation — The population vs. sample distinction

  3. The Central Limit Theorem (CLT) — Why it is the single most important result in applied statistics

  4. Hypothesis Testing and the CLT — t-statistics, p-values, confidence intervals, and why the normal approximation works

  5. Linear Regression and OLS — From scalar formulas to matrix notation, with hand-checkable examples

  6. Nonparametric Robust Inference — The sandwich formula, the CLT connection, and HC standard errors

  7. Newey-West HAC Standard Errors — Full derivation with a hand-checkable example

  8. Newey-West on a Realistic Sample — Simulation with heteroscedasticity and autocorrelation

  9. Putting It All Together — From CLT to Newey-West to the standard normal p-value

  10. Summary — Quick-reference tables and connection to the Fama-French tutorials

Philosophy

Every result in this notebook is demonstrated on data small enough to check by hand (typically 5–8 observations). The code cells verify the hand calculations — they never replace them. We encourage you to work through the arithmetic with pen and paper before running the code.

Prerequisites

  • Probability theory — random variables, expectation, variance, the law of large numbers

  • Basic statistics — hypothesis testing, confidence intervals, the normal distribution

  • Linear algebra essentials — matrix multiplication, transpose, inverse (used in the OLS sections)

  • Basic familiarity with Python (for the verification code; the math stands alone)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

try:
    plt.style.use('seaborn-v0_8-darkgrid')
except:
    plt.style.use('default')

np.random.seed(42)
print("Libraries loaded.")

Section 1: Descriptive Statistics — A Hand-Checkable Example

We start with a dataset small enough to compute everything on paper. Suppose we observe 5 monthly excess returns (in percent):

x=(2,  1,  3,  0,  1)x = (2, \;-1, \;3, \;0, \;1)

1.1 Sample Mean

xˉ=1ni=1nxi=2+(1)+3+0+15=55=1.0\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i = \frac{2 + (-1) + 3 + 0 + 1}{5} = \frac{5}{5} = 1.0

1.2 Sample Variance (with Bessel’s correction)

s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2

The deviations from the mean are:

iixix_ixixˉx_i - \bar{x}(xixˉ)2(x_i - \bar{x})^2
1211
2−1−24
3324
40−11
5100
Sum10
s2=1051=104=2.5s^2 = \frac{10}{5-1} = \frac{10}{4} = 2.5
s=2.51.5811s = \sqrt{2.5} \approx 1.5811

1.3 Why n1n - 1? (Bessel’s Correction)

If we knew the true population mean μ\mu, we would divide by nn. But we estimated μ\mu from the same data (using xˉ\bar{x}), which “uses up” one degree of freedom. Dividing by n1n - 1 corrects for this, making s2s^2 an unbiased estimator of the population variance:

E[s2]=σ2E[s^2] = \sigma^2

With only n=5n = 5 observations, the difference between dividing by 4 vs. 5 is 25% — far from negligible!

# ============================================================================
# Section 1: Verify hand calculations
# ============================================================================

x = np.array([2, -1, 3, 0, 1], dtype=float)
n = len(x)

mean = x.sum() / n
var = ((x - mean)**2).sum() / (n - 1)
std = np.sqrt(var)

print("Hand-checkable dataset: x =", x)
print(f"\nSample size n = {n}")
print(f"Sum = {x.sum()}")
print(f"Mean = {x.sum()}/{n} = {mean}")
print(f"\nDeviations from mean: {x - mean}")
print(f"Squared deviations:   {(x - mean)**2}")
print(f"Sum of squared devs:  {((x - mean)**2).sum()}")
print(f"\nSample variance s² = {((x - mean)**2).sum()}/{n-1} = {var}")
print(f"Sample std dev s   = √{var} = {std:.4f}")

# Verify against numpy (which uses n-1 by default in pandas, but n in numpy)
print(f"\nVerification:")
print(f"  np.mean(x) = {np.mean(x)}")
print(f"  np.var(x, ddof=1) = {np.var(x, ddof=1)}")  # ddof=1 => Bessel's
print(f"  np.std(x, ddof=1) = {np.std(x, ddof=1):.4f}")

1.4 Covariance and Correlation (Two Variables)

Now suppose we have a second variable — say, the market excess return for the same 5 months:

y=(3,  2,  4,  1,  0)y = (3, \;-2, \;4, \;1, \;0)
yˉ=3+(2)+4+1+05=65=1.2\bar{y} = \frac{3 + (-2) + 4 + 1 + 0}{5} = \frac{6}{5} = 1.2

Sample covariance:

sxy=1n1i=1n(xixˉ)(yiyˉ)s_{xy} = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})
iixixˉx_i - \bar{x}yiyˉy_i - \bar{y}Product
111.81.8
2−2−3.26.4
322.85.6
4−1−0.20.2
50−1.20.0
Sum14.0
sxy=14.04=3.5s_{xy} = \frac{14.0}{4} = 3.5

Sample correlation:

rxy=sxysxsyr_{xy} = \frac{s_{xy}}{s_x \cdot s_y}

We already know sx=2.5s_x = \sqrt{2.5}. For yy: the squared deviations are 1.82+3.22+2.82+0.22+1.22=3.24+10.24+7.84+0.04+1.44=22.801.8^2 + 3.2^2 + 2.8^2 + 0.2^2 + 1.2^2 = 3.24 + 10.24 + 7.84 + 0.04 + 1.44 = 22.80, so sy2=22.80/4=5.70s_y^2 = 22.80/4 = 5.70 and sy=5.702.3875s_y = \sqrt{5.70} \approx 2.3875.

rxy=3.52.5×5.70=3.51.5811×2.38753.53.77490.9272r_{xy} = \frac{3.5}{\sqrt{2.5} \times \sqrt{5.70}} = \frac{3.5}{1.5811 \times 2.3875} \approx \frac{3.5}{3.7749} \approx 0.9272

This strong positive correlation (0.93\approx 0.93) means xx and yy move together closely.

# ============================================================================
# Verify covariance and correlation
# ============================================================================

y = np.array([3, -2, 4, 1, 0], dtype=float)
y_mean = y.mean()
s_y2 = np.var(y, ddof=1)
s_y = np.std(y, ddof=1)

cov_xy = ((x - mean) * (y - y_mean)).sum() / (n - 1)
corr_xy = cov_xy / (std * s_y)

print(f"y = {y}")
print(f"ȳ = {y_mean}")
print(f"s_y² = {s_y2:.2f},  s_y = {s_y:.4f}")
print(f"\nProducts (x_i - x̄)(y_i - ȳ): {(x - mean) * (y - y_mean)}")
print(f"Sum of products: {((x - mean) * (y - y_mean)).sum()}")
print(f"Covariance s_xy = {cov_xy}")
print(f"Correlation r_xy = {cov_xy} / ({std:.4f} × {s_y:.4f}) = {corr_xy:.4f}")
print(f"\nVerification: np.corrcoef(x, y)[0,1] = {np.corrcoef(x, y)[0, 1]:.4f}")

Section 2: Random Variables, Expectation, and the Population–Sample Distinction

2.1 Population vs. Sample

A crucial distinction underlies all of statistics:

  • The population is the entire (usually infinite or very large) set of possible outcomes. For example, “all possible monthly returns that SPY could ever produce.”

  • A sample is a finite collection of observations drawn from the population. For example, “the 168 monthly returns we actually observed from 2010–2023.”

We use sample statistics (xˉ\bar{x}, s2s^2, etc.) to estimate population parameters (μ\mu, σ2\sigma^2, etc.). The sample mean xˉ\bar{x} is our best guess of the true mean μ\mu, but it comes with uncertainty because different samples would give different xˉ\bar{x} values.

2.2 The Standard Error of the Mean

If X1,X2,,XnX_1, X_2, \ldots, X_n are independent draws from a population with mean μ\mu and variance σ2\sigma^2, then the sample mean

Xˉ=1ni=1nXi\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i

has:

E[Xˉ]=μVar(Xˉ)=σ2nE[\bar{X}] = \mu \qquad \text{Var}(\bar{X}) = \frac{\sigma^2}{n}

Why nn and not n1n - 1 here? In Section 1.3 we used Bessel’s correction (n1n - 1) when estimating the population variance σ2\sigma^2 from a sample — that corrects for the bias introduced by using xˉ\bar{x} in place of the unknown μ\mu. Here the situation is different: we are not estimating anything. The formula Var(Xˉ)=σ2/n\text{Var}(\bar{X}) = \sigma^2 / n is an exact algebraic result that follows directly from the properties of variance. Since the XiX_i are independent, Var ⁣(Xi)=Var(Xi)=nσ2\text{Var}\!\left(\sum X_i\right) = \sum \text{Var}(X_i) = n\sigma^2, and dividing a random variable by nn scales its variance by 1/n21/n^2, giving Var(Xˉ)=nσ2/n2=σ2/n\text{Var}(\bar{X}) = n\sigma^2 / n^2 = \sigma^2 / n. The nn here simply counts how many observations are being averaged — it has nothing to do with degrees of freedom or bias correction.

The standard error of the mean is therefore:

SE(Xˉ)=σn\text{SE}(\bar{X}) = \frac{\sigma}{\sqrt{n}}

In practice σ\sigma is unknown, so we replace it with the sample standard deviation ss (computed with the Bessel-corrected n1n - 1 denominator from Section 1.2). Note that the n1n - 1 inside ss and the nn inside n\sqrt{n} play completely different roles: n1n - 1 corrects the bias when estimating the variance, while nn reflects the number of observations being averaged. The estimated standard error is:

SE^(Xˉ)=sn\widehat{\text{SE}}(\bar{X}) = \frac{s}{\sqrt{n}}

Hand calculation with our data (xˉ=1\bar{x} = 1, s=2.5s = \sqrt{2.5}, n=5n = 5):

SE^=2.55=2.55=0.50.7071\widehat{\text{SE}} = \frac{\sqrt{2.5}}{\sqrt{5}} = \sqrt{\frac{2.5}{5}} = \sqrt{0.5} \approx 0.7071

This tells us: “If we repeatedly drew samples of size 5 from this population and computed the mean each time, those means would have a standard deviation of about 0.71.”

# ============================================================================
# Standard error of the mean — hand calculation verification
# ============================================================================

se_mean = std / np.sqrt(n)
print(f"s = {std:.4f}")
print(f"n = {n}")
print(f"SE(x̄) = s / √n = {std:.4f} / √{n} = {se_mean:.4f}")
print(f"Verification: √(s²/n) = √({var}/{n}) = √{var/n:.4f} = {np.sqrt(var/n):.4f}")

Section 3: The Central Limit Theorem — The Engine of Inference

The CLT is the single most important result in statistics. It is the reason we can do hypothesis testing, build confidence intervals, and compute p-values. In this section we state the classical version first, then progressively relax its assumptions toward versions that actually apply to financial time series. By the end, you will understand exactly which CLT variant underwrites the Newey-West standard errors used throughout the Fama-French tutorials.

Reading guide. Sections 3.1–3.2 (classical CLT) and Section 3.7 (visual demo) are essential for everyone. Sections 3.3–3.5 develop the theoretical machinery that justifies Newey-West for time-series data — they can be skimmed on a first reading and revisited after working through Section 7.

What are Newey-West standard errors? In short, they are a method for computing standard errors that remain valid even when the data are serially correlated and/or heteroscedastic — two features that are ubiquitous in financial time series. Where ordinary standard errors assume each observation is independent, Newey-West standard errors account for the fact that nearby observations may be correlated, producing wider (and more honest) confidence intervals. We develop the full theory and a hand-checkable example in Section 7.

3.1 Version 1: The Classical CLT (Lindeberg-Lévy)

Theorem. Let X1,X2,,XnX_1, X_2, \ldots, X_n be independent and identically distributed (i.i.d.) random variables with mean μ\mu and finite variance σ2>0\sigma^2 > 0. Then as nn \to \infty:

Xˉnμσ/ndN(0,1)\frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1)

In words: the standardized sample mean converges in distribution to a standard normal, regardless of the shape of the underlying distribution.

Assumptions: (1) identical distributions, (2) independence, (3) finite variance.

What this buys us: We can compute the distribution of Xˉn\bar{X}_n without knowing the shape of the XiX_i. This is extraordinary — we need no parametric model for the data.

Relevance for finance: If monthly returns were truly i.i.d. (same distribution every month, no dependence across time), this version would be all we need. But they are not — so we need to relax the assumptions.

3.2 Why This Is Revolutionary

The CLT tells us something extraordinary:

  • The XiX_i might follow any distribution — uniform, exponential, chi-squared, a bizarre bimodal shape, anything with finite variance.

  • Yet the average of many such observations will be approximately normal.

  • The approximation improves as nn grows.

This is why the normal distribution appears everywhere in statistics: it governs the behavior of averages and sums, not necessarily of individual observations.

3.3 Version 2: Dropping “Identical” — The Lindeberg-Feller CLT

The problem with “identically distributed.” In a regression yi=α+βxi+ϵiy_i = \alpha + \beta x_i + \epsilon_i, the coefficient estimator β^=(XTX)1XTy\hat{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} is a weighted average of the yiy_i — where the weights depend on the xix_i values and differ across observations. Even if the errors ϵi\epsilon_i are i.i.d., the weighted terms are not identically distributed. The classical CLT does not directly apply.

The solution. The Lindeberg-Feller CLT drops the “identically distributed” requirement:

Theorem (Lindeberg-Feller). Let X1,,XnX_1, \ldots, X_n be independent (but not necessarily identically distributed) random variables with E[Xi]=μiE[X_i] = \mu_i and Var(Xi)=σi2\text{Var}(X_i) = \sigma_i^2. If no single observation dominates the total variance (the Lindeberg condition), then:

i=1n(Xiμi)i=1nσi2dN(0,1)\frac{\sum_{i=1}^n (X_i - \mu_i)}{\sqrt{\sum_{i=1}^n \sigma_i^2}} \xrightarrow{d} \mathcal{N}(0, 1)

Assumptions: (1) independence, (2) finite variances, (3) no single term dominates.

What this buys us: Regression coefficients β^\hat{\beta} are asymptotically normal even when the regressors xix_i take different values across observations. This is why OLS inference works even for regression (not just sample means).

Relevance for finance: This handles the “weighted average” structure of β^\hat{\beta}, but still requires independence — which financial time series violate.

3.4 Version 3: Dropping “Independent” (Partially) — The Martingale Difference CLT

The problem with “independent.” Monthly stock returns exhibit volatility clustering — periods of high volatility (like 2008–2009 or March 2020) alternate with calm periods. Even if returns themselves are nearly unpredictable (weak autocorrelation), they are not independent, because knowing yesterday’s return magnitude tells you something about today’s volatility.

A useful middle ground. Many financial models assume that returns are a martingale difference sequence (MDS). This means:

E[XtXt1,Xt2,]=0E[X_t \mid X_{t-1}, X_{t-2}, \ldots] = 0

In words: the conditional mean of the next return is zero (given past data), even though other aspects of the conditional distribution (like the variance) may depend on the past. This is weaker than independence — it says returns are unpredictable (in the linear sense) but not necessarily independent.

Theorem (MDS-CLT). Let {Xt}\{X_t\} be a stationary, ergodic martingale difference sequence with E[Xt2]=σ2<E[X_t^2] = \sigma^2 < \infty. Then:

Xˉnσ/ndN(0,1)\frac{\bar{X}_n}{\sigma / \sqrt{n}} \xrightarrow{d} \mathcal{N}(0, 1)

Assumptions: (1) MDS property (unpredictability), (2) stationarity and ergodicity, (3) finite variance. Independence is not required.

What this buys us: We can handle volatility clustering. Even though Var(Xtpast)\text{Var}(X_t \mid \text{past}) changes over time (GARCH effects), as long as returns themselves are unpredictable, Xˉn\bar{X}_n is asymptotically normal with variance σ2/n\sigma^2/n where σ2=E[Xt2]\sigma^2 = E[X_t^2] is the unconditional variance.

Relevance for finance: The efficient market hypothesis (in its weak form) implies that returns are approximately an MDS. This CLT version is what makes standard hypothesis tests on mean returns valid — even with GARCH-type volatility clustering — as long as the standard errors are computed correctly (i.e., using the unconditional variance, or better yet, a HAC estimator).

3.5 Version 4: Allowing Mild Dependence — The CLT for Mixing Processes

The problem with MDS. In reality, monthly returns may exhibit mild autocorrelation (short-run momentum or reversal effects), and regression residuals may be serially correlated. If E[XtXt1]0E[X_t \mid X_{t-1}] \neq 0, we can’t use the MDS-CLT.

The solution. The most general CLT relevant to finance allows weak dependence — observations can be correlated, but observations far apart in time must be nearly independent. This is formalized by mixing conditions (such as α\alpha-mixing or strong mixing):

Theorem (CLT for stationary mixing sequences). Let {Xt}\{X_t\} be a stationary sequence with E[Xt]=μE[X_t] = \mu, Var(Xt)=γ0<\text{Var}(X_t) = \gamma_0 < \infty, and autocovariances γk=Cov(Xt,Xt+k)\gamma_k = \text{Cov}(X_t, X_{t+k}) that decay sufficiently fast (specifically, k=0γk<\sum_{k=0}^{\infty} |\gamma_k| < \infty). Then:

n(Xˉnμ)dN ⁣(0,  γ0+2k=1γk)\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} \mathcal{N}\!\left(0,\; \gamma_0 + 2\sum_{k=1}^{\infty} \gamma_k \right)

The asymptotic variance is not simply γ0=σ2\gamma_0 = \sigma^2. It is the long-run variance:

σLR2=γ0+2k=1γk=k=γk\sigma_{\text{LR}}^2 = \gamma_0 + 2\sum_{k=1}^{\infty}\gamma_k = \sum_{k=-\infty}^{\infty} \gamma_k

This equals 2π2\pi times the spectral density at frequency zero — the “DC component” of the time series.

Assumptions: (1) stationarity, (2) autocovariances decay fast enough for the sum to converge, (3) finite variance. Neither independence nor the MDS property is required.

What this buys us: Full generality for weakly dependent time series. The sample mean Xˉn\bar{X}_n is still asymptotically normal, but its variance is σLR2/n\sigma_{\text{LR}}^2 / n, not σ2/n\sigma^2 / n. If autocovariances are positive, σLR2>σ2\sigma_{\text{LR}}^2 > \sigma^2 and the “naive” standard error s/ns/\sqrt{n} understates the true uncertainty.

Relevance for finance: This is the CLT that Newey-West standard errors are built on. Newey-West estimates σLR2\sigma_{\text{LR}}^2 from the data using a weighted sum of sample autocovariances. We will derive this in full detail in Section 7.

3.6 Summary: Which CLT Do We Need?

VersionAssumesAllowsVariance of Xˉn\bar{X}_nUsed for
Lindeberg-Lévy (3.1)i.i.d.Non-normalityσ2/n\sigma^2/nTextbook examples
Lindeberg-Feller (3.3)Independent (not identical)Weighted averages, regressionσi2/n2\sum \sigma_i^2 / n^2OLS with known σ\sigma
MDS-CLT (3.4)Unpredictable (MDS)Volatility clusteringσ2/n\sigma^2/nReturns under EMH
Mixing CLT (3.5)Decaying dependenceSerial correlation + heteroscedasticityσLR2/n\sigma_{\text{LR}}^2 / nNewey-West / HAC

For empirical asset pricing with financial time series, we need the Mixing CLT (or at minimum the MDS-CLT). The classical i.i.d. version is a pedagogical starting point — useful for intuition — but it does not match the properties of real financial data.

The key takeaway: The CLT always delivers normality of the sample mean (or regression coefficients). What changes across versions is the variance expression. Getting the variance right is the entire game — and it is exactly what robust standard errors (Newey-West) are designed to do.

3.7 A Visual Demonstration

Let us verify the CLT by repeatedly drawing samples from a decidedly non-normal distribution — the exponential distribution (which is right-skewed with skewness = 2) — and examining the distribution of sample means.

# ============================================================================
# CLT Demonstration: Sample means from an Exponential(1) distribution
# True mean μ = 1, true variance σ² = 1
# ============================================================================

np.random.seed(42)
num_simulations = 10_000
sample_sizes = [1, 5, 30, 200]

fig, axes = plt.subplots(1, 4, figsize=(18, 4))

for ax, n_sim in zip(axes, sample_sizes):
    # Draw all samples at once (vectorized — much faster than a Python loop)
    sample_means = np.random.exponential(scale=1.0, size=(num_simulations, n_sim)).mean(axis=1)
    
    # Standardize: z = (x̄ - μ) / (σ/√n)
    z_scores = (sample_means - 1.0) / (1.0 / np.sqrt(n_sim))
    
    ax.hist(z_scores, bins=50, density=True, alpha=0.7, color='steelblue', edgecolor='white')
    
    # Overlay standard normal
    t_grid = np.linspace(-4, 4, 200)
    ax.plot(t_grid, stats.norm.pdf(t_grid), 'r-', linewidth=2, label='N(0,1)')
    
    ax.set_title(f'n = {n_sim}', fontweight='bold', fontsize=13)
    ax.set_xlim(-4, 4)
    ax.set_ylim(0, 0.55)
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)

fig.suptitle('CLT in Action: Distribution of Standardized Sample Means\n'
             '(from Exponential(1) — a skewed, non-normal population)',
             fontsize=14, fontweight='bold', y=1.05)
plt.tight_layout()
plt.show()

print("""
OBSERVATIONS:
• n = 1:   The sample mean IS the single draw → looks exponential (right-skewed)
• n = 5:   Already much more symmetric, but still visibly non-normal
• n = 30:  Nearly indistinguishable from the standard normal curve
• n = 200: Essentially perfect normal bell curve

KEY INSIGHT: The population is exponential (skewness = 2), yet the sample 
mean's distribution converges to N(0,1) as n grows. This is the CLT.
""")

3.8 Why the Long-Run Variance Matters — A Quick Demonstration

The classical CLT says Var(Xˉn)=σ2/n\text{Var}(\bar{X}_n) = \sigma^2 / n. But when data are autocorrelated, the true variance of the sample mean is σLR2/n\sigma_{\text{LR}}^2 / n, where σLR2=k=γk\sigma_{\text{LR}}^2 = \sum_{k=-\infty}^{\infty} \gamma_k can be much larger than σ2=γ0\sigma^2 = \gamma_0.

The following simulation makes this concrete. We draw from an AR(1) process with autocorrelation ρ=0.5\rho = 0.5 and compare the actual dispersion of sample means against what the naive formula σ/n\sigma / \sqrt{n} predicts.

# ============================================================================
# Mixing CLT: naive SE vs true SE for autocorrelated data
# ============================================================================

np.random.seed(123)
n_ar = 100
rho_demo = 0.5
num_mc = 5000

# Generate num_mc independent AR(1) series, each of length n_ar
means_ar = np.zeros(num_mc)
for sim in range(num_mc):
    eps = np.random.randn(n_ar)
    ar_series = np.zeros(n_ar)
    ar_series[0] = eps[0]
    for t in range(1, n_ar):
        ar_series[t] = rho_demo * ar_series[t - 1] + eps[t]
    means_ar[sim] = ar_series.mean()

# Theoretical values
gamma0 = 1.0 / (1 - rho_demo**2)          # Var(X_t) for AR(1)
sigma_lr_sq = gamma0 * (1 + rho_demo) / (1 - rho_demo)  # long-run variance
naive_se = np.sqrt(gamma0 / n_ar)
true_se = np.sqrt(sigma_lr_sq / n_ar)

fig, ax = plt.subplots(figsize=(9, 4.5))
ax.hist(means_ar, bins=60, density=True, alpha=0.6, color='steelblue',
        edgecolor='white', label='Simulated $\\bar{X}_n$')

grid = np.linspace(means_ar.min(), means_ar.max(), 300)
ax.plot(grid, stats.norm.pdf(grid, 0, naive_se), 'r--', linewidth=2,
        label=f'Naive: $\\sigma / \\sqrt{{n}}$ = {naive_se:.3f}')
ax.plot(grid, stats.norm.pdf(grid, 0, true_se), 'g-', linewidth=2,
        label=f'True:  $\\sigma_{{LR}} / \\sqrt{{n}}$ = {true_se:.3f}')

ax.set_title(f'AR(1) with ρ = {rho_demo}: Naive SE Understates Uncertainty',
             fontweight='bold', fontsize=13)
ax.set_xlabel('$\\bar{X}_n$')
ax.set_ylabel('Density')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Naive SE (σ/√n):              {naive_se:.4f}")
print(f"True SE  (σ_LR/√n):           {true_se:.4f}")
print(f"Empirical SD of sample means:  {means_ar.std():.4f}")
print(f"\nThe naive SE underestimates uncertainty by a factor of "
      f"{true_se / naive_se:.2f}x.")
print("This is exactly the problem that Newey-West standard errors fix (Section 7).")

Section 4: Hypothesis Testing, t-Statistics, and p-Values — and the CLT Connection

4.1 The Framework

Hypothesis testing asks: “Could this result have arisen by chance?”

Suppose we observe a sample mean xˉ=1.0\bar{x} = 1.0 for a set of monthly excess returns. That sounds like a positive return — but with only 5 noisy observations, maybe the true mean is actually zero and we just got an unlucky draw. Hypothesis testing gives us a principled way to weigh the evidence.

We set up two competing hypotheses:

  • H0H_0 (null hypothesis): The effect is zero. For example, μ=0\mu = 0 (“the true mean excess return is zero — the asset earns no premium”).

  • H1H_1 (alternative hypothesis): The effect is nonzero. For example, μ0\mu \neq 0.

The logic proceeds by proof by contradiction: we temporarily assume H0H_0 is true and ask how surprising our observed data would be under that assumption. If the data would be very surprising (low probability), we reject H0H_0. If the data is not particularly unusual under H0H_0, we have no grounds to reject it.

This is analogous to a courtroom trial: the null hypothesis is “innocent until proven guilty.” We need sufficient evidence (data) to reject this default presumption. Importantly, failing to reject H0H_0 does not prove H0H_0 is true — it merely means we lack enough evidence to overturn it.

4.2 The t-Distribution — What It Is and Where It Comes From

Before constructing our test statistic, we need to understand a distribution that plays a central role in small-sample inference: the Student’s tt-distribution.

The problem. If X1,,XnX_1, \ldots, X_n are i.i.d. N(μ,σ2)\mathcal{N}(\mu, \sigma^2) and we knew the true σ\sigma, then

Z=Xˉμσ/nN(0,1)(exactly)Z = \frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim \mathcal{N}(0, 1) \quad \text{(exactly)}

But we almost never know σ\sigma. When we replace it with the sample standard deviation ss, we introduce extra randomness in the denominator — ss itself varies from sample to sample. The resulting ratio is no longer standard normal.

The solution (Student, 1908). William Sealy Gosset (writing under the pseudonym “Student”) showed that if the underlying data are normally distributed, then

T=Xˉμs/ntn1T = \frac{\bar{X} - \mu}{s / \sqrt{n}} \sim t_{n-1}

follows a tt-distribution with n1n - 1 degrees of freedom. The parameter n1n - 1 (degrees of freedom) matches the denominator in Bessel’s correction, for the same reason: we used up one degree of freedom by estimating μ\mu with xˉ\bar{x}.

Key properties of the tt-distribution:

PropertyDescription
ShapeBell-shaped and symmetric around zero, like the standard normal
TailsHeavier than the normal — extreme values are more likely
ParameterSingle parameter: ν=n1\nu = n - 1 degrees of freedom
ConvergenceAs ν\nu \to \infty, tνN(0,1)t_\nu \to \mathcal{N}(0,1)
VarianceVar(T)=νν2\text{Var}(T) = \frac{\nu}{\nu - 2} for ν>2\nu > 2 (always >1> 1, so wider than N(0,1)\mathcal{N}(0,1))

The heavier tails reflect the extra uncertainty from estimating σ\sigma with ss. With small samples, ss can differ substantially from σ\sigma, so the tt-distribution is noticeably wider than the normal. With large samples, sσs \approx \sigma and the tt-distribution is nearly indistinguishable from N(0,1)\mathcal{N}(0,1).

For our data with n=5n = 5, we have ν=4\nu = 4 degrees of freedom. The t4t_4 distribution has variance 4/(42)=24/(4-2) = 2 — double that of the standard normal. This matters: using the normal instead of the tt would make us overconfident.

The plot below visualizes exactly this — notice how the t4t_4 distribution spreads more probability into the tails compared to N(0,1)\mathcal{N}(0,1), and how larger ν\nu brings the tt-distribution closer to the normal.

# ============================================================================
# Visualizing the t-distribution vs the standard normal
# ============================================================================

fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# --- Left panel: overlay several t-distributions against N(0,1) ---
t_grid = np.linspace(-5, 5, 500)
dfs = [2, 4, 10, 30]
colors = ['#e74c3c', '#e67e22', '#2ecc71', '#3498db']

ax = axes[0]
ax.plot(t_grid, stats.norm.pdf(t_grid), 'k-', linewidth=2.5, label='N(0, 1)')
for df_val, color in zip(dfs, colors):
    ax.plot(t_grid, stats.t.pdf(t_grid, df=df_val), '--', linewidth=1.8,
            color=color, label=f'$t_{{{df_val}}}$')

ax.set_title('Student\'s t-distribution vs Standard Normal', fontweight='bold', fontsize=13)
ax.set_xlabel('t')
ax.set_ylabel('Density')
ax.legend(fontsize=10)
ax.set_xlim(-5, 5)
ax.set_ylim(0, 0.45)
ax.grid(True, alpha=0.3)
ax.annotate('Heavier tails\n(more probability\nin extremes)',
            xy=(3.0, stats.t.pdf(3.0, df=2)), xytext=(3.5, 0.12),
            fontsize=9, ha='center',
            arrowprops=dict(arrowstyle='->', color='#e74c3c'),
            color='#e74c3c')

# --- Right panel: zoom into the right tail to highlight the difference ---
ax2 = axes[1]
tail_grid = np.linspace(1.5, 5, 300)
ax2.fill_between(tail_grid, stats.norm.pdf(tail_grid), alpha=0.3, color='black', label='N(0,1) tail')
ax2.fill_between(tail_grid, stats.t.pdf(tail_grid, df=4), alpha=0.3, color='#e67e22', label='$t_4$ tail')
ax2.plot(tail_grid, stats.norm.pdf(tail_grid), 'k-', linewidth=2)
ax2.plot(tail_grid, stats.t.pdf(tail_grid, df=4), '--', linewidth=2, color='#e67e22')

# Mark the t = 2 line and show the p-value difference
ax2.axvline(x=2.0, color='gray', linestyle=':', linewidth=1.5)
p_norm_tail = 1 - stats.norm.cdf(2.0)
p_t4_tail = 1 - stats.t.cdf(2.0, df=4)
ax2.annotate(f'P(T4 >= 2) = {p_t4_tail:.3f}', xy=(2.05, 0.055), fontsize=10, color='#e67e22', fontweight='bold')
ax2.annotate(f'P(Z  >= 2) = {p_norm_tail:.3f}', xy=(2.05, 0.04), fontsize=10, color='black', fontweight='bold')
ax2.annotate(f'Ratio: {p_t4_tail/p_norm_tail:.1f}x', xy=(2.05, 0.025), fontsize=10, color='#c0392b', fontweight='bold')

ax2.set_title('Right Tail Close-Up (v = 4 vs Normal)', fontweight='bold', fontsize=13)
ax2.set_xlabel('t')
ax2.set_ylabel('Density')
ax2.legend(fontsize=10)
ax2.set_xlim(1.5, 5)
ax2.set_ylim(0, 0.14)
ax2.grid(True, alpha=0.3)

fig.suptitle('Why the t-Distribution Matters for Small Samples',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("""
KEY OBSERVATIONS:
- The t-distribution is bell-shaped and symmetric like the normal, but with heavier tails.
- With v = 4 (our n = 5 example), the tails are substantially heavier -- the t4 allocates
  about 2.5x more probability beyond t = 2 than the standard normal.
- As v increases (v = 10, 30, ...), the t-distribution rapidly approaches N(0, 1).
- For v >= 30, the two are nearly indistinguishable in practice.
- This is why we MUST use the t-distribution for small samples: using the normal would
  make us overconfident (p-values too small, confidence intervals too narrow).
""")

4.3 The t-Statistic

Under H0:μ=0H_0: \mu = 0, the test statistic is:

t=xˉ0SE^(xˉ)=xˉs/nt = \frac{\bar{x} - 0}{\widehat{\text{SE}}(\bar{x})} = \frac{\bar{x}}{s / \sqrt{n}}

This measures how many standard errors the sample mean is away from zero. The intuition is simple:

  • If xˉ\bar{x} is far from 0 (the null value) relative to its own uncertainty, that is strong evidence against H0H_0.

  • If xˉ\bar{x} is close to 0 relative to the noise, the data are consistent with H0H_0.

The tt-statistic standardizes the sample mean into a “signal-to-noise ratio”: the numerator is the signal (xˉ0\bar{x} - 0) and the denominator is the noise (s/ns/\sqrt{n}).

Hand calculation with our data:

t=1.00.7071=1.00.5=21.4142t = \frac{1.0}{0.7071} = \frac{1.0}{\sqrt{0.5}} = \sqrt{2} \approx 1.4142

So our observed mean is about 1.41 standard errors above zero — present, but not overwhelmingly so.

4.4 The p-Value

The p-value is the probability of observing a test statistic at least as extreme as the one we got, assuming H0H_0 is true. “At least as extreme” means at least as far from zero in either direction (for a two-sided test).

Formally, for a two-sided test with n1=4n - 1 = 4 degrees of freedom:

p=P(T4tobs)=2×P(T41.4142)p = P\bigl(|T_4| \geq |t_{\text{obs}}|\bigr) = 2 \times P(T_4 \geq 1.4142)

where T4T_4 denotes a random variable following a tt-distribution with 4 degrees of freedom.

How to interpret the p-value:

  • A small p-value (e.g., p<0.05p < 0.05) means: “Data this extreme would be very unlikely if H0H_0 were true.” This is evidence against H0H_0.

  • A large p-value means: “Data this extreme is not unusual under H0H_0.” We have no strong reason to reject H0H_0.

  • The p-value is not the probability that H0H_0 is true. It is the probability of seeing data this extreme given that H0H_0 is true — a crucial distinction.

The conventional significance level is α=0.05\alpha = 0.05, meaning we reject H0H_0 if p<0.05p < 0.05. This corresponds to a t|t| threshold of about 2 (more precisely, t0.025,νt_{0.025, \nu} for the relevant degrees of freedom). The “rule of thumb” you often hear — “a tt-statistic above 2 is significant” — comes from this.

4.5 Why a t-Distribution Instead of the Standard Normal?

When the sample size nn is small and we estimate σ\sigma with ss, the ratio xˉ/(s/n)\bar{x} / (s/\sqrt{n}) does not exactly follow N(0,1)\mathcal{N}(0,1). As derived in Section 4.2, it follows a tt-distribution with n1n - 1 degrees of freedom.

Using the standard normal when you should use the tt-distribution would understate the p-value — you’d think data is more significant than it actually is. This is because the tt-distribution allocates more probability to the tails.

For example, P(Z2)=0.046P(|Z| \geq 2) = 0.046 under the standard normal, but P(T42)=0.116P(|T_4| \geq 2) = 0.116 under t4t_4. That is a factor of 2.5 difference. With 4 degrees of freedom, the normal would mislead you substantially.

As nn \to \infty, the tn1t_{n-1} distribution converges to N(0,1)\mathcal{N}(0,1). For n30n \geq 30, the difference is already small. This convergence is related to the CLT: with enough data, estimating σ\sigma with ss introduces negligible additional uncertainty, and the extra randomness in the denominator vanishes.

4.6 Confidence Intervals

A confidence interval inverts the hypothesis test. Instead of asking “Is μ=0\mu = 0?”, it asks: “What range of μ\mu values are consistent with our data?”

A 95% confidence interval for μ\mu is:

xˉ±t0.025,n1×sn\bar{x} \pm t_{0.025,\, n-1} \times \frac{s}{\sqrt{n}}

where t0.025,42.776t_{0.025,\, 4} \approx 2.776 is the 97.5th percentile of the t4t_4 distribution. We use the tt-distribution (not the normal) for the same reason as above: we are estimating σ\sigma from a small sample.

The critical value 2.776 is notably larger than the 1.96 we would use from the standard normal. This makes the confidence interval wider, appropriately reflecting our greater uncertainty with n=5n = 5.

Hand calculation:

CI=1.0±2.776×0.7071=1.0±1.963=[0.963,  2.963]\text{CI} = 1.0 \pm 2.776 \times 0.7071 = 1.0 \pm 1.963 = [-0.963, \; 2.963]

Since this interval contains zero, we cannot reject H0:μ=0H_0: \mu = 0 at the 5% level — consistent with our p-value being above 0.05.

Interpretation. “If we repeated this experiment many times and computed a 95% CI each time, about 95% of those intervals would contain the true μ\mu.” It does not mean there is a 95% probability that μ\mu lies in this particular interval — μ\mu is a fixed (unknown) number, not a random variable.

# ============================================================================
# Hand-checkable hypothesis test
# Uses: mean, se_mean, n from Sections 1–2 (cells above)
# ============================================================================

t_stat = mean / se_mean
df_t = n - 1
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df=df_t))
t_crit = stats.t.ppf(0.975, df=df_t)
ci_low = mean - t_crit * se_mean
ci_high = mean + t_crit * se_mean

print("HYPOTHESIS TEST: H₀: μ = 0  vs  H₁: μ ≠ 0")
print("=" * 55)
print(f"  x̄ = {mean}")
print(f"  SE(x̄) = {se_mean:.4f}")
print(f"  t = x̄ / SE = {mean} / {se_mean:.4f} = {t_stat:.4f}")
print(f"  Degrees of freedom = n - 1 = {df_t}")
print(f"\n  p-value (two-sided) = {p_value:.4f}")
print(f"  t-critical (α=0.05, df={df_t}) = ±{t_crit:.3f}")
print(f"  95% CI = [{ci_low:.3f}, {ci_high:.3f}]")
print(f"\n  Decision: {'Reject H₀' if p_value < 0.05 else 'Fail to reject H₀'} at 5% level")
print(f"  (CI {'does NOT contain' if ci_low > 0 or ci_high < 0 else 'contains'} zero — consistent with p-value)")

4.7 Why the CLT Justifies Normal p-Values

Let us pause and appreciate the deep connection between the CLT and hypothesis testing. It is the intellectual heart of this entire notebook and fundamental to understanding how Newey-West works.

Above we used the t-distribution to compute the p-value. This was exact — it required that the XiX_i are normally distributed (or that we use an exact finite-sample result). But in practice, financial returns are not normally distributed: they have fat tails, skewness, and time-varying volatility.

So why do our p-values still work?

The answer is the CLT. Here is the argument, spelled out carefully:

4.8 The Argument in Five Steps

Step 1. We want to test H0:μ=0H_0: \mu = 0. Our test statistic is:

t=Xˉns/nt = \frac{\bar{X}_n}{s / \sqrt{n}}

Step 2. By the CLT, for large nn:

Xˉnμσ/ndN(0,1)\frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} \mathcal{N}(0,1)

Under H0H_0 (μ=0\mu = 0): Xˉnσ/ndN(0,1)\frac{\bar{X}_n}{\sigma / \sqrt{n}} \xrightarrow{d} \mathcal{N}(0,1).

Step 3. By the Law of Large Numbers, s2pσ2s^2 \xrightarrow{p} \sigma^2, i.e. the sample variance converges in probability to the true variance.

Step 4. By Slutsky’s theorem, replacing σ\sigma with ss in the denominator does not change the limiting distribution.

Slutsky’s Theorem. If XndXX_n \xrightarrow{d} X and YnpcY_n \xrightarrow{p} c (a constant), then Xn/YndX/cX_n / Y_n \xrightarrow{d} X / c, and more generally g(Xn,Yn)dg(X,c)g(X_n, Y_n) \xrightarrow{d} g(X, c) for any continuous gg.

Applying this with Xn=Xˉn/(σ/n)dN(0,1)X_n = \bar{X}_n / (\sigma/\sqrt{n}) \xrightarrow{d} \mathcal{N}(0,1) and Yn=s/σp1Y_n = s/\sigma \xrightarrow{p} 1:

t=Xˉns/n=Xˉn/(σ/n)s/σdN(0,1)1=N(0,1)t = \frac{\bar{X}_n}{s / \sqrt{n}} = \frac{\bar{X}_n / (\sigma/\sqrt{n})}{s / \sigma} \xrightarrow{d} \frac{\mathcal{N}(0,1)}{1} = \mathcal{N}(0,1)

Step 5. Therefore, for large nn, we can compute p-values using the standard normal N(0,1)\mathcal{N}(0,1) table (or the tn1t_{n-1} distribution, which is nearly identical for large nn).

4.9 What “Nonparametric” Means Here

Notice what we did not assume in Steps 1–5:

  • We did not assume the XiX_i are normally distributed.

  • We did not assume the XiX_i have any particular distributional shape.

  • We only assumed they have a finite mean and variance and are (approximately) independent.

This is what makes the CLT-based inference nonparametric (more precisely, semiparametric or distribution-free): it works regardless of the population distribution, provided nn is large enough.

4.10 How Large is “Large Enough”?

The speed of CLT convergence depends on the distribution’s skewness and kurtosis:

DistributionSkewnessKurtosisnn needed for good approximation
Normal03Any nn (exact)
Uniform01.8n10n \geq 10
Exponential29n30n \geq 30
Highly skewed (e.g., Pareto)LargeVery largen100+n \geq 100+
Financial returns (mild fat tails)~04–6n3050n \geq 30–50

For monthly Fama-French data with 100–200 observations, the CLT approximation is excellent. This is why finance researchers can use standard-normal-based p-values with Newey-West standard errors even though returns are not normally distributed.

4.11 A Subtle but Critical Point

The CLT tells us that Xˉn\bar{X}_n is approximately normal. This does not mean:

  • ❌ Individual observations XiX_i are normal

  • ❌ Residuals from a regression are normal

  • ❌ The error term ϵ\epsilon is normal

It does mean:

  • ✅ This works even when the underlying data is decidedly non-normal

  • ✅ Functions of averages (like regression coefficients β^\hat{\beta}, which are weighted averages of the data) are approximately normal for large nn

  • ✅ We can use the normal distribution to compute p-values for these statistics

# ============================================================================
# Demonstration: CLT-based p-value vs exact t-distribution p-value
# As n grows, they converge — showing the CLT justification in action
# ============================================================================

print("COMPARISON: t-distribution vs standard normal p-values")
print("=" * 65)
print(f"{'n':>6}  {'t-stat':>8}  {'p (t-dist)':>12}  {'p (normal)':>12}  {'Δp':>10}")
print("-" * 65)

# For each sample size, compute p-value under both distributions
for n_demo in [5, 10, 30, 50, 100, 200, 500]:
    t_val = 2.0  # Fix t-statistic at 2.0 for comparison
    p_t = 2 * (1 - stats.t.cdf(t_val, df=n_demo - 1))
    p_z = 2 * (1 - stats.norm.cdf(t_val))
    print(f"{n_demo:6d}  {t_val:8.2f}  {p_t:12.6f}  {p_z:12.6f}  {abs(p_t - p_z):10.6f}")

print(f"\nAs n → ∞, the t-distribution p-values converge to the normal p-values.")
print(f"For n ≥ 100 (typical in Fama-French analysis), the difference is < 0.002.")
print(f"\nThis is why Newey-West uses the standard normal: with n ≈ 100-200 months,")
print(f"the CLT guarantees the t-statistic is approximately N(0,1).")

Section 5: Linear Regression and OLS

We now shift from testing a single mean to analyzing the relationship between variables. This is the core of empirical finance: the Fama-French model asks whether exposure to market, size, and value factors explains the cross-section of expected returns — a question that is fundamentally about regression.

5.1 What Is Regression?

Regression answers the question: how does one variable change, on average, when another variable changes?

Concretely, given paired observations (x1,y1),(x2,y2),,(xn,yn)(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n), we model:

yi=α+βxi+ϵiy_i = \alpha + \beta x_i + \epsilon_i

where:

  • α\alpha (the intercept) is the expected value of yy when x=0x = 0,

  • β\beta (the slope) is the expected change in yy per one-unit change in xx,

  • ϵi\epsilon_i (the error or residual) captures everything about yiy_i that is not explained by xix_i.

In the Fama-French context, yiy_i is the excess return of a portfolio, xix_i is the excess return of the market (or a factor), and β\beta measures the portfolio’s sensitivity (or loading) on that factor. A positive β\beta on the market factor means the portfolio tends to go up when the market goes up, and vice versa. The intercept α\alpha — often called Jensen’s alpha — measures the average return that is not explained by factor exposure. If α>0\alpha > 0, the portfolio earns a premium beyond what its risk exposure would predict.

5.2 The Idea: Minimizing Squared Errors

How do we choose α\alpha and β\beta? The Ordinary Least Squares (OLS) principle says: choose the line that makes the residuals ϵ^i=yi(α^+β^xi)\hat{\epsilon}_i = y_i - (\hat{\alpha} + \hat{\beta} x_i) as small as possible, in the sense of minimizing the sum of squared residuals:

RSS=i=1nϵ^i2=i=1n(yiα^β^xi)2\text{RSS} = \sum_{i=1}^n \hat{\epsilon}_i^2 = \sum_{i=1}^n \bigl(y_i - \hat{\alpha} - \hat{\beta} x_i\bigr)^2

Why squared errors rather than absolute errors? Three reasons:

  1. Differentiability — Squared errors are smooth, so we can use calculus to find the minimum.

  2. Unique solution — The squared-error problem always has exactly one solution (for non-degenerate data).

  3. Connection to the CLT — Under standard assumptions, the OLS estimator is the best linear unbiased estimator (BLUE), and its sampling distribution is governed by the CLT.

5.3 Deriving the OLS Formulas (Scalar Case)

To minimize RSS\text{RSS}, we take partial derivatives with respect to α\alpha and β\beta and set them to zero.

Derivative with respect to α\alpha:

RSSα=2i=1n(yiαβxi)=0\frac{\partial \, \text{RSS}}{\partial \alpha} = -2\sum_{i=1}^n (y_i - \alpha - \beta x_i) = 0

This yields the first normal equation:

i=1nyi=nα+βi=1nxiα^=yˉβ^xˉ\sum_{i=1}^n y_i = n\alpha + \beta \sum_{i=1}^n x_i \qquad \Longrightarrow \qquad \hat{\alpha} = \bar{y} - \hat{\beta}\bar{x}

Derivative with respect to β\beta:

RSSβ=2i=1nxi(yiαβxi)=0\frac{\partial \, \text{RSS}}{\partial \beta} = -2\sum_{i=1}^n x_i(y_i - \alpha - \beta x_i) = 0

Substituting α^=yˉβ^xˉ\hat{\alpha} = \bar{y} - \hat{\beta}\bar{x} and simplifying gives the second normal equation:

β^=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2=Cov(x,y)Var(x)\hat{\beta} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{\text{Cov}(x, y)}{\text{Var}(x)}

This is an elegant result: the slope equals the sample covariance of xx and yy divided by the sample variance of xx. This makes intuitive sense — β^\hat{\beta} measures how much yy moves per unit of xx-variability.

5.4 A Geometric Intuition

Think of the nn observed yy-values as a point in nn-dimensional space. The set of all vectors of the form α^1+β^x\hat{\alpha}\mathbf{1} + \hat{\beta}\mathbf{x} (where 1\mathbf{1} is the column of ones) traces out a 2-dimensional plane within that nn-dimensional space.

OLS finds the point on that plane that is closest to y\mathbf{y} — that is, the orthogonal projection of y\mathbf{y} onto the column space of X\mathbf{X}. The residual vector ϵ^=yy^\hat{\boldsymbol{\epsilon}} = \mathbf{y} - \hat{\mathbf{y}} is perpendicular to the column space, which is precisely the condition XTϵ^=0\mathbf{X}^T\hat{\boldsymbol{\epsilon}} = \mathbf{0} — the normal equations in matrix form.

This geometric picture explains several properties of OLS that can otherwise seem mysterious:

  • The residuals sum to zero (because ϵ^\hat{\boldsymbol{\epsilon}} is perpendicular to the constant column 1\mathbf{1}).

  • The fitted values and residuals are uncorrelated (because y^\hat{\mathbf{y}} lies in the column space while ϵ^\hat{\boldsymbol{\epsilon}} is perpendicular to it).

  • R2R^2 measures the cosine squared of the angle between y\mathbf{y} and y^\hat{\mathbf{y}}.

5.5 Hand Calculation with Our Data

Using the dataset from Section 1: x=[2,1,3,0,1]x = [2, -1, 3, 0, 1] and y=[3,2,4,1,0]y = [3, -2, 4, 1, 0].

From Section 1 we already computed: xˉ=1.0\bar{x} = 1.0, yˉ=1.2\bar{y} = 1.2.

Step 1: Compute the deviations.

iixix_iyiy_ixixˉx_i - \bar{x}yiyˉy_i - \bar{y}(xixˉ)(yiyˉ)(x_i - \bar{x})(y_i - \bar{y})(xixˉ)2(x_i - \bar{x})^2
1231.01.81.801.00
2−1−2−2.0−3.26.404.00
3342.02.85.604.00
401−1.0−0.20.201.00
5100.0−1.20.000.00
Sum = 14.0Sum = 10.0

Step 2: Compute β^\hat{\beta} and α^\hat{\alpha}.

β^=14.010.0=1.4\hat{\beta} = \frac{14.0}{10.0} = 1.4
α^=yˉβ^xˉ=1.21.4(1.0)=0.2\hat{\alpha} = \bar{y} - \hat{\beta}\bar{x} = 1.2 - 1.4(1.0) = -0.2

So the fitted line is y^=0.2+1.4x\hat{y} = -0.2 + 1.4x.

Interpretation: Each additional unit of xx is associated with a 1.4-unit increase in yy. The intercept -0.2 is the predicted value of yy when x=0x = 0.

5.6 Visualizing the Fit

The code below plots the data and the fitted regression line, and marks the residuals as vertical distances from each point to the line — making concrete the idea that OLS minimizes the sum of these squared vertical distances.

# ============================================================================
# Section 5: Linear Regression and OLS — Hand calculation & visualization
# ============================================================================

# Regression data — same x, y from Section 1, renamed to avoid shadowing
# the single-variable analysis (mean, std, se_mean used in Section 4).
x_reg = np.array([2, -1, 3, 0, 1], dtype=float)
y_reg = np.array([3, -2, 4, 1, 0], dtype=float)
n_reg = len(x_reg)

# --- Step 1: Compute β̂ and α̂ using the scalar formulas ---
x_bar = x_reg.mean()
y_bar = y_reg.mean()

numerator = np.sum((x_reg - x_bar) * (y_reg - y_bar))   # Cov(x,y) * (n-1)
denominator = np.sum((x_reg - x_bar) ** 2)                # Var(x) * (n-1)

beta_hat = numerator / denominator
alpha_hat = y_bar - beta_hat * x_bar

print("SCALAR OLS — HAND VERIFICATION")
print("=" * 55)
print(f"  x̄ = {x_bar},  ȳ = {y_bar}")
print(f"  Σ(xᵢ - x̄)(yᵢ - ȳ) = {numerator}")
print(f"  Σ(xᵢ - x̄)²         = {denominator}")
print(f"\n  β̂ = {numerator}/{denominator} = {beta_hat}")
print(f"  α̂ = {y_bar} − {beta_hat}×{x_bar} = {alpha_hat}")

# --- Step 2: Fitted values and residuals ---
y_hat = alpha_hat + beta_hat * x_reg
residuals = y_reg - y_hat
RSS = np.sum(residuals ** 2)

print(f"\n  Fitted values ŷ: {y_hat}")
print(f"  Residuals ε̂:     {residuals}")
print(f"  Σε̂ᵢ = {residuals.sum():.10f}  (should ≈ 0)")
print(f"  RSS = Σε̂ᵢ² = {RSS:.2f}")

# --- Step 3: R² ---
SS_tot = np.sum((y_reg - y_bar) ** 2)
R_squared = 1 - RSS / SS_tot
print(f"\n  SS_tot = {SS_tot:.2f}")
print(f"  R² = 1 − RSS/SS_tot = 1 − {RSS:.2f}/{SS_tot:.2f} = {R_squared:.4f}")
print(f"  → {R_squared*100:.1f}% of the variation in y is explained by x.")

# ============================================================================
# Visualization: data, fitted line, and residuals
# ============================================================================

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# --- Left panel: scatter + fitted line + residuals ---
ax = axes[0]
x_grid = np.linspace(-2, 4, 100)
y_grid = alpha_hat + beta_hat * x_grid

ax.plot(x_grid, y_grid, 'r-', linewidth=2, label=f'ŷ = {alpha_hat:.1f} + {beta_hat:.1f}x')

# Draw residuals as vertical lines
for xi, yi, yhi in zip(x_reg, y_reg, y_hat):
    ax.plot([xi, xi], [yi, yhi], 'b--', linewidth=1, alpha=0.7)

ax.scatter(x_reg, y_reg, s=80, c='steelblue', edgecolors='navy', zorder=5, label='Data')
ax.scatter(x_reg, y_hat, s=40, c='red', marker='x', zorder=5, label='Fitted values')

ax.set_xlabel('x', fontsize=12)
ax.set_ylabel('y', fontsize=12)
ax.set_title('OLS Regression: Data, Fitted Line, and Residuals', fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)

# Annotate residuals
for i, (xi, yi, ei) in enumerate(zip(x_reg, y_reg, residuals)):
    if abs(ei) > 0.05:
        ax.annotate(f'ε̂={ei:.1f}', xy=(xi, (yi + y_hat[i])/2),
                    fontsize=8, color='blue', ha='left',
                    xytext=(5, 0), textcoords='offset points')

# --- Right panel: residuals vs fitted values ---
ax2 = axes[1]
ax2.scatter(y_hat, residuals, s=80, c='steelblue', edgecolors='navy', zorder=5)
ax2.axhline(y=0, color='red', linestyle='-', linewidth=1.5)
ax2.set_xlabel('Fitted values ŷ', fontsize=12)
ax2.set_ylabel('Residuals ε̂', fontsize=12)
ax2.set_title('Residuals vs Fitted Values', fontweight='bold')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("""
KEY OBSERVATIONS:
• The fitted line passes through (x̄, ȳ) = (1.0, 1.2) — always true for OLS with an intercept.
• Residuals sum to zero — a consequence of including an intercept.
• The residual plot (right) shows no obvious pattern — consistent with a 
  reasonable linear fit (though with only 5 points, diagnostics are limited).
• R² = {:.1f}%: most of the variation in y is captured by the linear relationship.
""".format(R_squared * 100))

5.7 The Matrix Formulation

The scalar formulas from Section 5.3 work for simple regression (one regressor). But the Fama-French model has three regressors (market, size, value) — and in general we need to handle kk regressors simultaneously. Matrix algebra provides a compact, general framework that works for any number of regressors. It also sets up the variance formulas we will need for robust standard errors in subsequent sections.

The model in matrix form. Stack all nn observations into a single equation:

y=Xβ+ϵ\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}

where:

  • y\mathbf{y} is the n×1n \times 1 vector of outcomes,

  • X\mathbf{X} is the n×kn \times k design matrix — its first column is a column of ones (the intercept), and the remaining columns are the regressors,

  • β\boldsymbol{\beta} is the k×1k \times 1 vector of parameters (α,β1,,βk1)T(\alpha, \beta_1, \ldots, \beta_{k-1})^T,

  • ϵ\boldsymbol{\epsilon} is the n×1n \times 1 vector of errors.

For our 5-observation simple regression, k=2k = 2 (intercept + one slope):

X=(1x11x21xn),β=(αβ)\mathbf{X} = \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}, \qquad \boldsymbol{\beta} = \begin{pmatrix} \alpha \\ \beta \end{pmatrix}

Deriving the OLS estimator. The residual sum of squares in matrix form is:

RSS(β)=(yXβ)T(yXβ)\text{RSS}(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})

Expanding:

RSS(β)=yTy2βTXTy+βTXTXβ\text{RSS}(\boldsymbol{\beta}) = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}

To minimize, we take the matrix derivative (gradient) and set it to zero. The key matrix calculus identities are:

  • β(βTa)=a\frac{\partial}{\partial \boldsymbol{\beta}}(\boldsymbol{\beta}^T\mathbf{a}) = \mathbf{a}, for any constant vector a\mathbf{a};

  • β(βTAβ)=2Aβ\frac{\partial}{\partial \boldsymbol{\beta}}(\boldsymbol{\beta}^T\mathbf{A}\boldsymbol{\beta}) = 2\mathbf{A}\boldsymbol{\beta}, for any symmetric matrix A\mathbf{A}.

Applying these:

RSSβ=2XTy+2XTXβ=0\frac{\partial \, \text{RSS}}{\partial \boldsymbol{\beta}} = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{0}

Rearranging gives the normal equations:

XTXβ^=XTy\boxed{\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}}

This is a system of kk linear equations in kk unknowns. If XTX\mathbf{X}^T\mathbf{X} is invertible (which requires that no regressor is a perfect linear combination of the others), we can solve directly:

β^=(XTX)1XTy\boxed{\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}}

Interpreting the components. Each piece of this formula has a clear meaning:

ExpressionDimensionMeaning
XTX\mathbf{X}^T\mathbf{X}k×kk \times kInformation matrix — measures the total “variation” in the regressors. Larger values (more spread in xx) mean more precise estimates.
XTy\mathbf{X}^T\mathbf{y}k×1k \times 1Cross-moment vector — measures how each regressor covaries with the outcome.
(XTX)1(\mathbf{X}^T\mathbf{X})^{-1}k×kk \times kInverts the information matrix, converting raw covariation into properly scaled regression coefficients.

The formula says: β^\hat{\boldsymbol{\beta}} equals the covariation of regressors with outcomes, scaled by the inverse of the regressors’ own variation. This is the multivariate generalization of β^=Cov(x,y)/Var(x)\hat{\beta} = \text{Cov}(x,y) / \text{Var}(x).


Connection to projection (Section 5.4 revisited). The normal equations can be rewritten as:

XT(yXβ^)=0XTϵ^=0\mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0} \qquad \Longleftrightarrow \qquad \mathbf{X}^T\hat{\boldsymbol{\epsilon}} = \mathbf{0}

This says the residual vector is orthogonal to every column of X\mathbf{X} — exactly the geometric projection condition from Section 5.4. The fitted values are the orthogonal projection of y\mathbf{y} onto the column space of X\mathbf{X}:

y^=Xβ^=X(XTX)1XTy=Hy\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} = \mathbf{H}\mathbf{y}

where H=X(XTX)1XT\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T is the hat matrix (or projection matrix). It “puts the hat on y\mathbf{y}.” The hat matrix is symmetric (HT=H\mathbf{H}^T = \mathbf{H}) and idempotent (H2=H\mathbf{H}^2 = \mathbf{H}) — properties that projections always have.



5.8 From Coefficients to Standard Errors: Deriving Var(β^)\text{Var}(\hat{\boldsymbol{\beta}})

Computing β^\hat{\boldsymbol{\beta}} is only half the job. To do inference (test hypotheses, build confidence intervals), we need to know how uncertain our estimates are. This requires deriving the variance of the estimator.

Step 1 — Express β^\hat{\boldsymbol{\beta}} in terms of the true errors. Substituting y=Xβ+ϵ\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} into the OLS formula:

β^=(XTX)1XT(Xβ+ϵ)=β+(XTX)1XTϵ\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}) = \boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\epsilon}

This is a fundamental decomposition: β^\hat{\boldsymbol{\beta}} equals the truth β\boldsymbol{\beta} plus a noise term (XTX)1XTϵ(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\epsilon} that depends on the errors. Two immediate consequences:

  1. Unbiasedness: If E[ϵX]=0E[\boldsymbol{\epsilon} \mid \mathbf{X}] = \mathbf{0}, then E[β^X]=βE[\hat{\boldsymbol{\beta}} \mid \mathbf{X}] = \boldsymbol{\beta}.

  2. Consistency: As nn \to \infty, the noise term vanishes (under regularity conditions), so β^pβ\hat{\boldsymbol{\beta}} \xrightarrow{p} \boldsymbol{\beta}.

Step 2 — The variance-covariance matrix: what it is and why we need it. For a scalar random variable ZZ, its variance is Var(Z)=E[(ZE[Z])2]\text{Var}(Z) = E[(Z - E[Z])^2] — a single number. But β^\hat{\boldsymbol{\beta}} is a vector of kk random variables (β^0,β^1,,β^k1)T(\hat{\beta}_0, \hat{\beta}_1, \ldots, \hat{\beta}_{k-1})^T. We need a matrix that captures the variance of each component and the covariances between components. This is the variance-covariance matrix, defined as:

Var(β^)    E[(β^E[β^])(β^E[β^])T]\text{Var}(\hat{\boldsymbol{\beta}}) \;\equiv\; E\bigl[(\hat{\boldsymbol{\beta}} - E[\hat{\boldsymbol{\beta}}])(\hat{\boldsymbol{\beta}} - E[\hat{\boldsymbol{\beta}}])^T\bigr]

This is a k×kk \times k matrix. Its structure is:

Var(β^)=(Var(β^0)Cov(β^0,β^1)Cov(β^0,β^k1)Cov(β^1,β^0)Var(β^1)Cov(β^1,β^k1)Cov(β^k1,β^0)Cov(β^k1,β^1)Var(β^k1))\text{Var}(\hat{\boldsymbol{\beta}}) = \begin{pmatrix} \text{Var}(\hat{\beta}_0) & \text{Cov}(\hat{\beta}_0, \hat{\beta}_1) & \cdots & \text{Cov}(\hat{\beta}_0, \hat{\beta}_{k-1}) \\ \text{Cov}(\hat{\beta}_1, \hat{\beta}_0) & \text{Var}(\hat{\beta}_1) & \cdots & \text{Cov}(\hat{\beta}_1, \hat{\beta}_{k-1}) \\ \vdots & \vdots & \ddots & \vdots \\ \text{Cov}(\hat{\beta}_{k-1}, \hat{\beta}_0) & \text{Cov}(\hat{\beta}_{k-1}, \hat{\beta}_1) & \cdots & \text{Var}(\hat{\beta}_{k-1}) \end{pmatrix}
  • The diagonal entries [Var(β^)]jj=Var(β^j)[\text{Var}(\hat{\boldsymbol{\beta}})]_{jj} = \text{Var}(\hat{\beta}_j) are the variances — the standard errors come directly from these: SE(β^j)=Var(β^j)\text{SE}(\hat{\beta}_j) = \sqrt{\text{Var}(\hat{\beta}_j)}.

  • The off-diagonal entries are covariances between different coefficient estimates — these matter when testing joint hypotheses (e.g., are multiple coefficients simultaneously zero?).

“Conditional on X\mathbf{X} means we treat the regressor matrix as fixed (non-random) and ask: over repeated samples of ϵ\boldsymbol{\epsilon} drawn from the same error distribution, how much would β^\hat{\boldsymbol{\beta}} vary? This is the standard frequentist perspective — the randomness comes only from ϵ\boldsymbol{\epsilon}, not from X\mathbf{X}.

Step 3 — Derive the formula. From Step 1, we know β^β=(XTX)1XTϵ\hat{\boldsymbol{\beta}} - \boldsymbol{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\epsilon}. Since E[β^]=βE[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta} (unbiasedness), we can substitute directly into the definition:

Var(β^)=E[(β^β)(β^β)T]\text{Var}(\hat{\boldsymbol{\beta}}) = E\bigl[(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})^T\bigr]
=E[((XTX)1XTϵ)((XTX)1XTϵ)T]= E\Bigl[\bigl((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\epsilon}\bigr)\bigl((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\epsilon}\bigr)^T\Bigr]

Using the transpose rule (Ab)T=bTAT(\mathbf{A}\mathbf{b})^T = \mathbf{b}^T\mathbf{A}^T:

=E[(XTX)1XT  ϵϵT  X(XTX)1]= E\Bigl[(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\;\boldsymbol{\epsilon}\boldsymbol{\epsilon}^T\;\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\Bigr]

Since X\mathbf{X} is treated as fixed, only ϵϵT\boldsymbol{\epsilon}\boldsymbol{\epsilon}^T is random, so the expectation passes through:

=(XTX)1XT  E[ϵϵT]Ω  X(XTX)1= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \; \underbrace{E[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^T]}_{\boldsymbol{\Omega}} \; \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}

Here Ω=E[ϵϵT]\boldsymbol{\Omega} = E[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^T] is the n×nn \times n variance-covariance matrix of the errors — its (i,j)(i,j) entry is E[ϵiϵj]E[\epsilon_i \epsilon_j]. The diagonal entries are the individual error variances E[ϵi2]E[\epsilon_i^2], and the off-diagonals capture any correlation between errors at different observations.

This gives us the sandwich formula — so named for its bread-meat-bread structure:

Var(β^)=(XTX)1breadXTΩXmeat(XTX)1bread\boxed{\text{Var}(\hat{\boldsymbol{\beta}}) = \underbrace{(\mathbf{X}^T\mathbf{X})^{-1}}_{\text{bread}} \underbrace{\mathbf{X}^T\boldsymbol{\Omega}\mathbf{X}}_{\text{meat}} \underbrace{(\mathbf{X}^T\mathbf{X})^{-1}}_{\text{bread}}}

The bread comes from the OLS mechanics and is the same no matter what the errors look like. The meat XTΩX\mathbf{X}^T\boldsymbol{\Omega}\mathbf{X} captures the actual error structure — their variances, their correlations, everything.

5.9 Classic OLS: The Homoscedastic Special Case

Under the classic assumptions — errors are homoscedastic and uncorrelated:

Ω=E[ϵϵT]=σ2In\boldsymbol{\Omega} = E[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^T] = \sigma^2\mathbf{I}_n

the meat simplifies dramatically:

XTΩX=σ2XTInX=σ2XTX\mathbf{X}^T\boldsymbol{\Omega}\mathbf{X} = \sigma^2\mathbf{X}^T\mathbf{I}_n\mathbf{X} = \sigma^2\mathbf{X}^T\mathbf{X}

Substituting into the sandwich formula:

Var(β^)=(XTX)1σ2XTX(XTX)1=σ2(XTX)1\text{Var}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^T\mathbf{X})^{-1}\sigma^2\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}
Varclassic(β^)=σ2(XTX)1\boxed{\text{Var}_{\text{classic}}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}}

This is the classic OLS variance formula. The cancellation occurs because when ΩI\boldsymbol{\Omega} \propto \mathbf{I}, the bread and meat share the same XTX\mathbf{X}^T\mathbf{X} structure, and one copy cancels. But when Ωσ2I\boldsymbol{\Omega} \neq \sigma^2\mathbf{I} — for instance, if errors are heteroscedastic or autocorrelated — this cancellation does not occur, and the classic formula gives wrong standard errors. This is exactly the problem that robust methods (Section 6) will fix.

Estimating σ2\sigma^2. Since σ2\sigma^2 is unknown, we estimate it from residuals:

σ^2=ϵ^Tϵ^nk=RSSnk\hat{\sigma}^2 = \frac{\hat{\boldsymbol{\epsilon}}^T\hat{\boldsymbol{\epsilon}}}{n - k} = \frac{\text{RSS}}{n - k}

We divide by nkn - k (not nn) because the OLS fit “uses up” kk degrees of freedom — same logic as dividing by n1n - 1 for sample variance. The standard errors are then:

SE(β^j)=σ^2[(XTX)1]jj\text{SE}(\hat{\beta}_j) = \sqrt{\hat{\sigma}^2 \bigl[(\mathbf{X}^T\mathbf{X})^{-1}\bigr]_{jj}}

5.10 Numerical Verification

Applying the matrix formulas to our 5-observation example confirms the scalar results from Section 5.5: α^=0.2\hat{\alpha} = -0.2, β^=1.4\hat{\beta} = 1.4, RSS=3.20\text{RSS} = 3.20, σ^2=1.0667\hat{\sigma}^2 = 1.0667, and SE(β^)=0.3266\text{SE}(\hat{\beta}) = 0.3266. The code below carries out the full matrix computation and cross-checks against statsmodels.

# ============================================================================
# Verify the matrix OLS formulas from Sections 5.7–5.10
# ============================================================================

# Our data
x_reg = np.array([2, -1, 3, 0, 1], dtype=float)
y_reg = np.array([3, -2, 4, 1, 0], dtype=float)
n_reg = len(x_reg)
k_reg = 2  # intercept + one slope

# Design matrix  (n × k)
X_mat = np.column_stack([np.ones(n_reg), x_reg])

# OLS via the normal equations:  β̂ = (X'X)⁻¹ X'y
XtX     = X_mat.T @ X_mat
XtX_inv = np.linalg.inv(XtX)
beta_hat = XtX_inv @ (X_mat.T @ y_reg)

# Residuals and RSS
y_hat = X_mat @ beta_hat
resid = y_reg - y_hat
RSS   = resid @ resid

# Classic variance estimate:  Var(β̂) = σ̂²(X'X)⁻¹
sigma2_hat = RSS / (n_reg - k_reg)
var_beta   = sigma2_hat * XtX_inv
se_classic = np.sqrt(np.diag(var_beta))

# t-statistics and p-values
t_stats = beta_hat / se_classic
p_vals  = 2 * (1 - stats.t.cdf(np.abs(t_stats), df=n_reg - k_reg))

# Verify orthogonality condition:  X'ε̂ = 0  (the normal equations)
ortho_check = X_mat.T @ resid

print("MATRIX OLS — NUMERICAL VERIFICATION")
print("=" * 55)
print(f"  β̂ = (X'X)⁻¹X'y = [{beta_hat[0]:.1f}, {beta_hat[1]:.1f}]")
print(f"  α̂ = {beta_hat[0]:.1f},  β̂ = {beta_hat[1]:.1f}")
print(f"\n  X'ε̂ = {ortho_check}  (orthogonality check — should be ≈ 0)")
print(f"  Σε̂ = {resid.sum():.10f}  (residuals sum to zero ✓)")
print(f"  RSS = {RSS:.2f}")
print(f"\n  σ̂² = RSS/(n−k) = {RSS:.2f}/{n_reg - k_reg} = {sigma2_hat:.4f}")
print(f"\n  Var(β̂) = σ̂²·(X'X)⁻¹ =")
print(f"    {var_beta}")
print(f"\n  SE(α̂) = {se_classic[0]:.4f},   SE(β̂) = {se_classic[1]:.4f}")
print(f"  t(α̂)  = {t_stats[0]:.3f}  (p = {p_vals[0]:.4f})")
print(f"  t(β̂)  = {t_stats[1]:.3f}  (p = {p_vals[1]:.4f})")

# Cross-check with statsmodels
model_check = sm.OLS(y_reg, X_mat).fit()
print(f"\n  statsmodels match — coefficients: {np.allclose(beta_hat, model_check.params)}")
print(f"  statsmodels match — std errors:   {np.allclose(se_classic, model_check.bse)}")

Section 6: Nonparametric Robust Inference

6.1 Why Variance Estimation Is the Key Problem

In Section 4 we showed that the CLT justifies using normal p-values for hypothesis tests — provided we plug the right variance into the denominator of the t-statistic. In Section 3 we saw that for weakly dependent time series (the Mixing CLT, Section 3.5), the relevant variance is the long-run variance σLR2=k=γk\sigma_{\text{LR}}^2 = \sum_{k=-\infty}^{\infty} \gamma_k, not simply σ2\sigma^2.

Here is the crux of the problem:

The naive sample variance s2s^2 converges to γ0=Var(Xt)\gamma_0 = \text{Var}(X_t), not to σLR2\sigma_{\text{LR}}^2. So if we use s2s^2 in the denominator, Slutsky’s theorem plugs in the wrong variance, and the resulting tt-statistic has incorrect size — typically too large, leading to false rejections.

The fix: Replace s2s^2 with a HAC (heteroscedasticity and autocorrelation consistent) estimator — specifically, the Newey-West estimator — which consistently estimates σLR2\sigma_{\text{LR}}^2 by including weighted sample autocovariances:

σ^LR2=γ^0+2k=1Lwkγ^k\hat{\sigma}_{\text{LR}}^2 = \hat{\gamma}_0 + 2\sum_{k=1}^{L} w_k \,\hat{\gamma}_k

With this correction, Slutsky’s theorem gives us t=Xˉn/(σ^LR/n)dN(0,1)t = \bar{X}_n / (\hat{\sigma}_{\text{LR}} / \sqrt{n}) \xrightarrow{d} \mathcal{N}(0,1).

The punchline: The CLT handles non-normality and dependence — it guarantees asymptotic normality. The variance estimator handles the denominator — it ensures we plug the right number in. Getting the variance right is the entire game, and it is exactly what the robust methods in this and the following sections are designed to do.

6.2 What Does “Nonparametric” Mean in This Context?

In the classic OLS framework, we assume:

ϵiiidN(0,σ2)\epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2)

This is a parametric assumption — it specifies the exact distribution of the errors.

A nonparametric (or semiparametric) approach to inference relaxes this assumption. Instead of assuming errors are normal with constant variance, we say:

“We make no assumption about the distribution of ϵi\epsilon_i. We only require that the sample is large enough for the CLT to apply.”

The OLS coefficient estimates β^=(XX)1Xy\hat{\beta} = (X'X)^{-1}X'y remain valid (unbiased and consistent) under either approach. What changes is how we estimate the uncertainty — i.e., the standard errors.

6.3 Why Classic Standard Errors Can Fail

Recall the classic OLS covariance formula:

Var^classic(β^)=σ^2(XTX)1\widehat{\text{Var}}_{\text{classic}}(\hat{\boldsymbol{\beta}}) = \hat{\sigma}^2 (\mathbf{X}^T\mathbf{X})^{-1}

This assumes two things simultaneously:

  1. Homoscedasticity: Var(ϵi)=σ2\text{Var}(\epsilon_i) = \sigma^2 for all ii (constant variance)

  2. No autocorrelation: Cov(ϵi,ϵj)=0\text{Cov}(\epsilon_i, \epsilon_j) = 0 for iji \neq j

If either assumption fails:

  • The coefficient estimates β^\hat{\beta} are still consistent (the CLT still applies to them)

  • But the standard errors are wrong — typically too small

  • Which makes t-statistics too large and p-values too small

  • Leading to false discoveries (rejecting H0H_0 when you shouldn’t)

6.3½ Seeing the Problem: A Quick Simulation

Before deriving the fix, let’s see the problem in action. The cell below generates data with heteroscedastic errors and checks whether classic 95% confidence intervals actually achieve 95% coverage. If classic SEs are too small, the intervals will be too narrow — and coverage will fall below 95%.

# ============================================================================
# Quick demo: classic SEs understate uncertainty under heteroscedasticity
# ============================================================================

np.random.seed(99)
n_motiv = 100
num_sims = 2000
beta_true_m = 1.0
cover_classic = 0
cover_hc1 = 0

for _ in range(num_sims):
    x_m = np.random.randn(n_motiv)
    # Heteroscedastic errors: variance proportional to x²
    eps_m = np.random.randn(n_motiv) * (0.5 + 2.0 * np.abs(x_m))
    y_m = beta_true_m * x_m + eps_m
    
    X_m = sm.add_constant(x_m)
    res_m = sm.OLS(y_m, X_m).fit()
    res_hc1 = res_m.get_robustcov_results(cov_type='HC1')
    
    # Classic 95% CI for beta
    ci_c = res_m.conf_int(alpha=0.05)[1]
    if ci_c[0] <= beta_true_m <= ci_c[1]:
        cover_classic += 1
    
    # HC1 95% CI for beta
    ci_h = res_hc1.conf_int(alpha=0.05)[1]
    if ci_h[0] <= beta_true_m <= ci_h[1]:
        cover_hc1 += 1

print("COVERAGE OF 95% CONFIDENCE INTERVALS (2000 simulations)")
print("=" * 55)
print(f"  Data:  n={n_motiv}, heteroscedastic errors (σᵢ ∝ |xᵢ|)")
print(f"  True β = {beta_true_m}")
print(f"")
print(f"  Classic OLS coverage: {cover_classic/num_sims*100:.1f}%  (should be 95%)")
print(f"  HC1 robust coverage:  {cover_hc1/num_sims*100:.1f}%  (should be 95%)")
print(f"")
if cover_classic / num_sims < 0.93:
    print(f"  ⚠ Classic SEs give UNDER-COVERAGE: the intervals are too narrow.")
    print(f"    You'd think you have 95% confidence, but you really have ~{cover_classic/num_sims*100:.0f}%.")
print(f"  ✓ HC1 robust SEs restore correct coverage.")
print(f"\nThis motivates the sandwich formula derived next.")

6.4 The True Variance Formula

Without the homoscedasticity/independence assumptions, the exact variance of the OLS estimator is:

Var(β^)=(XTX)1(XTΩX)(XTX)1\text{Var}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^T\mathbf{X})^{-1} \left(\mathbf{X}^T \boldsymbol{\Omega} \mathbf{X}\right) (\mathbf{X}^T\mathbf{X})^{-1}

where Ω=E[ϵϵT]\boldsymbol{\Omega} = E[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^T] is the true covariance matrix of the error vector.

This is called the sandwich formula — and the name is worth remembering, because it encodes the entire logic of robust inference:

  • The two outer copies of (XTX)1(\mathbf{X}^T\mathbf{X})^{-1} are the bread. They come from the OLS mechanics and are the same whether you use classic or robust standard errors.

  • The inner term XTΩX\mathbf{X}^T \boldsymbol{\Omega} \mathbf{X} is the meat. It captures what the errors actually look like — their variances, their correlations across time, their entire dependence structure. Classic OLS replaces the meat with the crude approximation σ2XTX\sigma^2 \mathbf{X}^T\mathbf{X} (assuming all errors are identical and independent). The sandwich estimator instead lets the residuals speak for themselves.

Every robust SE method you encounter — HC0, HC1, HC3, Newey-West — is just a different recipe for the meat. The bread never changes.

Under homoscedasticity, Ω=σ2I\boldsymbol{\Omega} = \sigma^2 \mathbf{I}, and the meat simplifies to σ2XTX\sigma^2 \mathbf{X}^T\mathbf{X}, recovering the classic formula. The sandwich formula is the general case; classic OLS is a special case.

6.5 HC0 (White’s Heteroscedasticity-Consistent Estimator)

White (1980) proposed replacing the unknown Ω\boldsymbol{\Omega} with the diagonal matrix of squared residuals:

Ω^HC0=diag(ϵ^12,  ϵ^22,  ,  ϵ^n2)\hat{\boldsymbol{\Omega}}_{HC0} = \text{diag}(\hat{\epsilon}_1^2, \; \hat{\epsilon}_2^2, \; \ldots, \; \hat{\epsilon}_n^2)

The HC0 estimator is:

Var^HC0(β^)=(XTX)1(i=1nϵ^i2xixiT)(XTX)1\widehat{\text{Var}}_{HC0}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^T\mathbf{X})^{-1} \left(\sum_{i=1}^{n} \hat{\epsilon}_i^2 \, \mathbf{x}_i \mathbf{x}_i^T \right) (\mathbf{X}^T\mathbf{X})^{-1}

where xi\mathbf{x}_i is the ii-th row of X\mathbf{X} (as a column vector).

Key idea: Instead of assuming all errors have the same variance, let each residual “speak for itself.” The squared residual ϵ^i2\hat{\epsilon}_i^2 is a (noisy) estimate of Var(ϵi)\text{Var}(\epsilon_i).

6.6 Hand Calculation: HC0 for Our 5-Observation Regression

Let’s compute the HC0 “meat” matrix:

M=i=15ϵ^i2xixiT\mathbf{M} = \sum_{i=1}^{5} \hat{\epsilon}_i^2 \, \mathbf{x}_i \mathbf{x}_i^T

Each xi=(1,xi)T\mathbf{x}_i = (1, x_i)^T and ϵ^i2\hat{\epsilon}_i^2 was computed in Section 5.9:

iiϵ^i2\hat{\epsilon}_i^2xixiT\mathbf{x}_i \mathbf{x}_i^T
10.16(1224)\begin{pmatrix} 1 & 2 \\ 2 & 4 \end{pmatrix}
20.16(1111)\begin{pmatrix} 1 & -1 \\ -1 & 1 \end{pmatrix}
30.00(1339)\begin{pmatrix} 1 & 3 \\ 3 & 9 \end{pmatrix}
41.44(1000)\begin{pmatrix} 1 & 0 \\ 0 & 0 \end{pmatrix}
51.44(1111)\begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}
M=0.16(1224)+0.16(1111)+0(1339)+1.44(1000)+1.44(1111)\mathbf{M} = 0.16\begin{pmatrix}1&2\\2&4\end{pmatrix} + 0.16\begin{pmatrix}1&-1\\-1&1\end{pmatrix} + 0\begin{pmatrix}1&3\\3&9\end{pmatrix} + 1.44\begin{pmatrix}1&0\\0&0\end{pmatrix} + 1.44\begin{pmatrix}1&1\\1&1\end{pmatrix}

Computing element-by-element (top-left as example):

M11=0.16(1)+0.16(1)+0(1)+1.44(1)+1.44(1)=0.16+0.16+0+1.44+1.44=3.20M_{11} = 0.16(1) + 0.16(1) + 0(1) + 1.44(1) + 1.44(1) = 0.16 + 0.16 + 0 + 1.44 + 1.44 = 3.20

You can verify the full meat matrix is:

M=(3.201.601.602.24)\mathbf{M} = \begin{pmatrix} 3.20 & 1.60 \\ 1.60 & 2.24 \end{pmatrix}

Then the HC0 variance:

Var^HC0=(XTX)1M(XTX)1\widehat{\text{Var}}_{HC0} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{M}(\mathbf{X}^T\mathbf{X})^{-1}

6.7 Finite-Sample Corrections: HC1, HC2, HC3

HC0 is consistent (correct as nn \to \infty) but tends to underestimate the variance in finite samples. Several corrections exist:

  • HC1 multiplies HC0 by the factor nnk\frac{n}{n-k} (a degrees-of-freedom correction analogous to Bessel’s correction for the variance). This is the most common variant in applied work: Var^HC1=nnkVar^HC0\widehat{\text{Var}}_{HC1} = \frac{n}{n-k} \widehat{\text{Var}}_{HC0}.

  • HC2 rescales each squared residual by 1/(1hii)1/(1 - h_{ii}), where hiih_{ii} is the ii-th diagonal element of the hat matrix H=X(XTX)1XT\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T.

  • HC3 uses ϵ^i2/(1hii)2\hat{\epsilon}_i^2 / (1 - h_{ii})^2, providing an even more conservative correction. HC3 is popular in econometrics because it approximates a jackknife estimator.

For our 5-observation example with k=2k = 2: nnk=531.667\frac{n}{n-k} = \frac{5}{3} \approx 1.667, so HC1 standard errors are about 29% larger than HC0 — a non-negligible correction for small samples. In the Fama-French tutorials (where n168n \approx 168), the difference between HC0 and HC1 is only about 1.2%, so the choice matters much less.

All of these HC variants handle heteroscedasticity but not autocorrelation. For time-series data, we need Newey-West — the subject of the next section.

# ============================================================================
# Hand-checkable HC0 sandwich estimator
# Uses: X_mat, resid, n_reg, k_reg, XtX_inv, se_classic, model_check
#       from Section 5 matrix OLS cell above
# ============================================================================

print("HC0 (WHITE) SANDWICH ESTIMATOR — HAND VERIFICATION")
print("=" * 60)

# Meat matrix: sum of e_i^2 * x_i x_i'
meat_HC0 = np.zeros((k_reg, k_reg))
for i in range(n_reg):
    xi = X_mat[i, :].reshape(-1, 1)  # column vector
    meat_HC0 += resid[i]**2 * (xi @ xi.T)

print(f"Squared residuals: {resid**2}")
print(f"\nMeat matrix M = Σ ε̂²ᵢ xᵢxᵢ' =\n{meat_HC0}")

# Sandwich
var_HC0 = XtX_inv @ meat_HC0 @ XtX_inv
se_HC0 = np.sqrt(np.diag(var_HC0))

print(f"\nHC0 Var(β̂) = (X'X)⁻¹ M (X'X)⁻¹ =\n{var_HC0}")
print(f"\nComparison of standard errors:")
print(f"  {'':15s}  {'Classic':>10s}  {'HC0':>10s}  {'% Change':>10s}")
print(f"  {'SE(α̂)':15s}  {se_classic[0]:10.4f}  {se_HC0[0]:10.4f}  {(se_HC0[0]-se_classic[0])/se_classic[0]*100:+10.1f}%")
print(f"  {'SE(β̂)':15s}  {se_classic[1]:10.4f}  {se_HC0[1]:10.4f}  {(se_HC0[1]-se_classic[1])/se_classic[1]*100:+10.1f}%")

# Verify with statsmodels HC0
results_hc0 = model_check.get_robustcov_results(cov_type='HC0')
print(f"\nStatsmodels HC0 SEs: {results_hc0.bse}")
print(f"Our HC0 SEs:         {se_HC0}")
print(f"Match: {np.allclose(se_HC0, results_hc0.bse)}")
# ============================================================================
# HC0 / HC1 / HC2 / HC3 comparison on our 5-observation regression
# ============================================================================

# Hat matrix: H = X (X'X)^{-1} X'
H_mat = X_mat @ XtX_inv @ X_mat.T
h_ii = np.diag(H_mat)  # leverage values

print("HC VARIANT COMPARISON (n=5, k=2)")
print("=" * 65)
print(f"Leverage values h_ii: {h_ii}")
print(f"Residuals:            {resid}")
print()

# HC0: standard White
var_hc0 = XtX_inv @ meat_HC0 @ XtX_inv

# HC1: degrees-of-freedom correction
dof_factor = n_reg / (n_reg - k_reg)
var_hc1 = dof_factor * var_hc0

# HC2: rescale by 1/(1 - h_ii)
meat_hc2 = np.zeros((k_reg, k_reg))
for i in range(n_reg):
    xi = X_mat[i, :].reshape(-1, 1)
    meat_hc2 += (resid[i]**2 / (1 - h_ii[i])) * (xi @ xi.T)
var_hc2 = XtX_inv @ meat_hc2 @ XtX_inv

# HC3: rescale by 1/(1 - h_ii)^2
meat_hc3 = np.zeros((k_reg, k_reg))
for i in range(n_reg):
    xi = X_mat[i, :].reshape(-1, 1)
    meat_hc3 += (resid[i]**2 / (1 - h_ii[i])**2) * (xi @ xi.T)
var_hc3 = XtX_inv @ meat_hc3 @ XtX_inv

# Standard errors
se_hc0 = np.sqrt(np.diag(var_hc0))
se_hc1 = np.sqrt(np.diag(var_hc1))
se_hc2 = np.sqrt(np.diag(var_hc2))
se_hc3 = np.sqrt(np.diag(var_hc3))

print(f"{'':15s}  {'Classic':>9s}  {'HC0':>9s}  {'HC1':>9s}  {'HC2':>9s}  {'HC3':>9s}")
print("-" * 65)
for j, name in enumerate(['SE(α̂)', 'SE(β̂)']):
    print(f"  {name:13s}  {se_classic[j]:9.4f}  {se_hc0[j]:9.4f}  "
          f"{se_hc1[j]:9.4f}  {se_hc2[j]:9.4f}  {se_hc3[j]:9.4f}")

print(f"\nDOF factor n/(n-k) = {n_reg}/{n_reg - k_reg} = {dof_factor:.3f}")
print(f"HC1 = HC0 × √({dof_factor:.3f}) in SE terms")
print(f"\nWith n=5, HC3 is notably larger than HC0 (conservative small-sample correction).")
print(f"With n=168 (Fama-French), all HC variants give very similar results.")

# Verify against statsmodels
for hc_type in ['HC0', 'HC1', 'HC2', 'HC3']:
    res_check = model_check.get_robustcov_results(cov_type=hc_type)
    our_se = {'HC0': se_hc0, 'HC1': se_hc1, 'HC2': se_hc2, 'HC3': se_hc3}[hc_type]
    match = np.allclose(our_se, res_check.bse)
    print(f"  {hc_type} matches statsmodels: {match}")

Section 7: Newey-West HAC Standard Errors — Complete Derivation

7.1 The Problem Newey-West Solves

The HC0 estimator handles heteroscedasticity (non-constant error variance) but still assumes no autocorrelation — that is, Cov(ϵi,ϵj)=0\text{Cov}(\epsilon_i, \epsilon_j) = 0 for iji \neq j.

In financial time-series, this assumption often fails. Monthly returns can be correlated across time due to:

  • Stale prices and illiquidity

  • Overlapping data (e.g., overlapping holding periods)

  • Slow information diffusion

  • Volatility clustering (ARCH/GARCH effects induce serial dependence in squared returns)

Newey and West (1987) proposed an estimator that is consistent in the presence of both heteroscedasticity and autocorrelation — hence HAC (Heteroscedasticity and Autocorrelation Consistent).

7.2 The Newey-West Meat

The Newey-West estimator modifies the “meat” of the sandwich to include cross-products of residuals at different lags:

MNW=i=1nϵ^i2xixiTHC0 meat (lag 0)+j=1Lwji=j+1nϵ^iϵ^ij(xixijT+xijxiT)lag-j cross terms\mathbf{M}_{NW} = \underbrace{\sum_{i=1}^{n} \hat{\epsilon}_i^2 \, \mathbf{x}_i\mathbf{x}_i^T}_{\text{HC0 meat (lag 0)}} + \sum_{j=1}^{L} w_j \underbrace{\sum_{i=j+1}^{n} \hat{\epsilon}_i \hat{\epsilon}_{i-j} \left(\mathbf{x}_i\mathbf{x}_{i-j}^T + \mathbf{x}_{i-j}\mathbf{x}_i^T\right)}_{\text{lag-}j\text{ cross terms}}

where:

  • LL = maximum lag (the “bandwidth” or “truncation lag”)

  • wj=1jL+1w_j = 1 - \frac{j}{L + 1} are Bartlett kernel weights (linearly declining from 1 to 0)

The full Newey-West variance is then:

Var^NW(β^)=(XTX)1MNW(XTX)1\widehat{\text{Var}}_{NW}(\hat{\boldsymbol{\beta}}) = (\mathbf{X}^T\mathbf{X})^{-1} \, \mathbf{M}_{NW} \, (\mathbf{X}^T\mathbf{X})^{-1}

7.3 Understanding the Bartlett Weights

The Bartlett weights wj=1jL+1w_j = 1 - \frac{j}{L+1} have two important properties:

  1. Monotone decline: Nearby lags get more weight than distant ones, reflecting the empirical fact that autocorrelation typically decays with distance.

  2. Guaranteed positive semi-definiteness: The resulting variance estimate is always a valid (non-negative) covariance matrix. This is not guaranteed by all kernel choices.

For example, with L=3L = 3:

Lag jjWeight wj=1j/4w_j = 1 - j/4
01.000 (this is the HC0 part)
10.750
20.500
30.250

7.4 Lag Selection

The choice of LL involves a bias-variance tradeoff:

  • Too small LL: May not capture all autocorrelation → biased (too optimistic)

  • Too large LL: Includes noisy estimates of near-zero autocorrelations → inefficient

The standard rule of thumb (Andrews, 1991):

L=4(n100)2/9L = \left\lfloor 4 \left(\frac{n}{100}\right)^{2/9} \right\rfloor

For typical Fama-French samples:

  • n=100n = 100: L=4L = 4

  • n=168n = 168: L=4L = 4

  • n=200n = 200: L=4L = 4

  • n=288n = 288: L=5L = 5

7.5 Hand Calculation: Newey-West with L=1L = 1

Let us compute the Newey-West estimator for our 5-observation regression using L=1L = 1 (one lag of autocorrelation). This keeps the arithmetic manageable.

Step 1: The lag-0 (HC0) meat — already computed:

M0=i=15ϵ^i2xixiT=(3.201.601.602.24)\mathbf{M}_0 = \sum_{i=1}^{5} \hat{\epsilon}_i^2 \, \mathbf{x}_i\mathbf{x}_i^T = \begin{pmatrix} 3.20 & 1.60 \\ 1.60 & 2.24 \end{pmatrix}

Step 2: The lag-1 cross-term

We need ϵ^iϵ^i1\hat{\epsilon}_i \hat{\epsilon}_{i-1} for i=2,3,4,5i = 2, 3, 4, 5:

iiϵ^i\hat{\epsilon}_iϵ^i1\hat{\epsilon}_{i-1}Product
2−0.40.4−0.16
30.0−0.40.00
41.20.00.00
5−1.21.2−1.44

Now compute the sum i=25ϵ^iϵ^i1(xixi1T+xi1xiT)\sum_{i=2}^{5} \hat{\epsilon}_i\hat{\epsilon}_{i-1}(\mathbf{x}_i\mathbf{x}_{i-1}^T + \mathbf{x}_{i-1}\mathbf{x}_i^T):

Each term is a product of a scalar (ϵ^iϵ^i1\hat{\epsilon}_i\hat{\epsilon}_{i-1}) times a symmetric matrix. Note that for i=3i = 3 and i=4i = 4, the product is zero (because ϵ^3=0\hat{\epsilon}_3 = 0 and ϵ^4ϵ^3=0\hat{\epsilon}_4 \cdot \hat{\epsilon}_3 = 0), so only i=2i = 2 and i=5i = 5 contribute.

i=2i = 2: ϵ^2ϵ^1=0.16\hat{\epsilon}_2\hat{\epsilon}_1 = -0.16

x2x1T+x1x2T=(11)(12)+(12)(11)=(1212)+(1122)=(2114)\mathbf{x}_2\mathbf{x}_1^T + \mathbf{x}_1\mathbf{x}_2^T = \begin{pmatrix}1\\-1\end{pmatrix}\begin{pmatrix}1&2\end{pmatrix} + \begin{pmatrix}1\\2\end{pmatrix}\begin{pmatrix}1&-1\end{pmatrix} = \begin{pmatrix}1&2\\-1&-2\end{pmatrix} + \begin{pmatrix}1&-1\\2&-2\end{pmatrix} = \begin{pmatrix}2&1\\1&-4\end{pmatrix}

Contribution: 0.16×(2114)=(0.320.160.160.64)-0.16 \times \begin{pmatrix}2&1\\1&-4\end{pmatrix} = \begin{pmatrix}-0.32&-0.16\\-0.16&0.64\end{pmatrix}

i=5i = 5: ϵ^5ϵ^4=1.44\hat{\epsilon}_5\hat{\epsilon}_4 = -1.44

x5x4T+x4x5T=(11)(10)+(10)(11)=(1010)+(1100)=(2110)\mathbf{x}_5\mathbf{x}_4^T + \mathbf{x}_4\mathbf{x}_5^T = \begin{pmatrix}1\\1\end{pmatrix}\begin{pmatrix}1&0\end{pmatrix} + \begin{pmatrix}1\\0\end{pmatrix}\begin{pmatrix}1&1\end{pmatrix} = \begin{pmatrix}1&0\\1&0\end{pmatrix} + \begin{pmatrix}1&1\\0&0\end{pmatrix} = \begin{pmatrix}2&1\\1&0\end{pmatrix}

Contribution: 1.44×(2110)=(2.881.441.440)-1.44 \times \begin{pmatrix}2&1\\1&0\end{pmatrix} = \begin{pmatrix}-2.88&-1.44\\-1.44&0\end{pmatrix}

Lag-1 sum (before weighting):

G1=(0.322.880.161.440.161.440.64+0)=(3.201.601.600.64)\mathbf{G}_1 = \begin{pmatrix}-0.32 - 2.88 & -0.16 - 1.44\\ -0.16 - 1.44 & 0.64 + 0\end{pmatrix} = \begin{pmatrix}-3.20 & -1.60\\ -1.60 & 0.64\end{pmatrix}

Step 3: Apply Bartlett weight with L=1L = 1:

w1=111+1=10.5=0.5w_1 = 1 - \frac{1}{1+1} = 1 - 0.5 = 0.5

Step 4: Assemble the Newey-West meat:

MNW=M0+w1G1=(3.201.601.602.24)+0.5(3.201.601.600.64)\mathbf{M}_{NW} = \mathbf{M}_0 + w_1 \cdot \mathbf{G}_1 = \begin{pmatrix}3.20&1.60\\1.60&2.24\end{pmatrix} + 0.5\begin{pmatrix}-3.20&-1.60\\-1.60&0.64\end{pmatrix}
=(3.201.601.600.801.600.802.24+0.32)=(1.600.800.802.56)= \begin{pmatrix}3.20 - 1.60 & 1.60 - 0.80\\ 1.60 - 0.80 & 2.24 + 0.32\end{pmatrix} = \begin{pmatrix}1.60 & 0.80\\ 0.80 & 2.56\end{pmatrix}

Step 5: Compute the Newey-West variance

Var^NW=(XTX)1MNW(XTX)1\widehat{\text{Var}}_{NW} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{M}_{NW}(\mathbf{X}^T\mathbf{X})^{-1}

We already know (XTX)1=(0.300.100.100.10)(\mathbf{X}^T\mathbf{X})^{-1} = \begin{pmatrix}0.30&-0.10\\-0.10&0.10\end{pmatrix}.

# ============================================================================
# Hand-checkable Newey-West with L = 1
# Uses: meat_HC0, X_mat, resid, n_reg, k_reg, XtX_inv, se_classic, se_HC0,
#       beta_hat from Sections 5–6 cells above
# ============================================================================

print("NEWEY-WEST HAC ESTIMATOR (L = 1) — HAND VERIFICATION")
print("=" * 60)

L_nw = 1  # One lag

# Step 1: HC0 meat (lag 0) — already computed
print("Step 1: HC0 meat (lag 0) — from Section 6")
print(f"  M₀ =\n{meat_HC0}\n")

# Step 2: Lag-1 cross-terms
print("Step 2: Lag-1 cross-products")
G1 = np.zeros((k_reg, k_reg))
print(f"  Residuals: {resid}")
print(f"  Lag-1 products ε̂ᵢε̂ᵢ₋₁:")
for i in range(1, n_reg):
    cross = resid[i] * resid[i - 1]
    xi = X_mat[i, :].reshape(-1, 1)
    xi_lag = X_mat[i - 1, :].reshape(-1, 1)
    outer = xi @ xi_lag.T + xi_lag @ xi.T
    contribution = cross * outer
    G1 += contribution
    print(f"    i={i+1}: ε̂_{i+1}·ε̂_{i} = {resid[i]:.1f} × {resid[i-1]:.1f} = {cross:.2f}")

print(f"\n  G₁ (lag-1 sum, unweighted) =\n{G1}")

# Step 3: Bartlett weight
w1 = 1 - 1 / (L_nw + 1)
print(f"\nStep 3: Bartlett weight w₁ = 1 - 1/{L_nw + 1} = {w1}")

# Step 4: Newey-West meat
meat_NW = meat_HC0 + w1 * G1
print(f"\nStep 4: M_NW = M₀ + w₁·G₁ =\n{meat_NW}")

# Step 5: Newey-West variance
var_NW = XtX_inv @ meat_NW @ XtX_inv
se_NW = np.sqrt(np.diag(var_NW))
print(f"\nStep 5: Var_NW(β̂) = (X'X)⁻¹ M_NW (X'X)⁻¹ =\n{var_NW}")

# Comparison table
print(f"\n{'='*60}")
print(f"STANDARD ERROR COMPARISON")
print(f"{'='*60}")
print(f"  {'':15s}  {'Classic':>10s}  {'HC0':>10s}  {'NW(L=1)':>10s}")
print(f"  {'SE(α̂)':15s}  {se_classic[0]:10.4f}  {se_HC0[0]:10.4f}  {se_NW[0]:10.4f}")
print(f"  {'SE(β̂)':15s}  {se_classic[1]:10.4f}  {se_HC0[1]:10.4f}  {se_NW[1]:10.4f}")

# t-statistics and p-values under each method
print(f"\n  {'':15s}  {'Classic':>10s}  {'HC0':>10s}  {'NW(L=1)':>10s}")
for idx, name in enumerate(['t(α̂)', 't(β̂)']):
    t_c = beta_hat[idx] / se_classic[idx]
    t_h = beta_hat[idx] / se_HC0[idx]
    t_n = beta_hat[idx] / se_NW[idx]
    print(f"  {name:15s}  {t_c:10.3f}  {t_h:10.3f}  {t_n:10.3f}")

print(f"\nNOTE: With only n=5, these numbers are illustrative — the CLT hasn't")
print(f"kicked in yet. With n=168 (as in the Fama-French tutorials), the")
print(f"Newey-West estimator becomes highly reliable.")

Section 8: Newey-West on a Realistic Sample — CLT in Action

The 5-observation example above showed the mechanics. But with n=5n = 5, the CLT has barely begun to work. Let’s now see Newey-West on a realistic sample where the asymptotic theory actually applies.

We’ll generate 200 observations from a deliberately non-normal regression with:

  • Heteroscedastic errors (variance depends on xx)

  • Autocorrelated errors (AR(1) structure)

  • Fat-tailed errors (t-distributed innovations)

This scenario represents a stylized version of what real financial data looks like. Classic OLS standard errors will be wrong; Newey-West will handle it correctly.

# ============================================================================
# Newey-West on a realistic simulated dataset
# ============================================================================

np.random.seed(42)
n_sim = 200

# Generate regressor
x_sim = np.random.randn(n_sim) * 0.04  # Scale like monthly factor returns

# Generate errors with heteroscedasticity + autocorrelation + fat tails
innovations = stats.t.rvs(df=5, size=n_sim)  # fat-tailed (t with 5 df)
errors = np.zeros(n_sim)
rho = 0.3  # AR(1) coefficient
for t in range(n_sim):
    if t == 0:
        errors[t] = innovations[t]
    else:
        errors[t] = rho * errors[t-1] + innovations[t]
# Add heteroscedasticity: variance proportional to |x|
errors = errors * (0.01 + 0.5 * np.abs(x_sim))

# True model: y = 0.001 + 1.0 * x + error
alpha_true, beta_true = 0.001, 1.0
y_sim = alpha_true + beta_true * x_sim + errors

# Fit OLS
X_sim = sm.add_constant(x_sim)
model_sim = sm.OLS(y_sim, X_sim).fit()

# Compute Newey-West with Andrews lag
lag_sim = int(np.floor(4 * (n_sim / 100) ** (2/9)))
model_nw = model_sim.get_robustcov_results(cov_type='HAC', maxlags=lag_sim)

print("REALISTIC SIMULATION: n=200, heteroscedastic + autocorrelated + fat-tailed errors")
print("=" * 75)
print(f"True parameters: α = {alpha_true}, β = {beta_true}")
print(f"Newey-West lag length (Andrews formula): L = {lag_sim}")

print(f"\n{'':15s}  {'Estimate':>10s}  {'Classic SE':>10s}  {'NW SE':>10s}  {'Classic t':>10s}  {'NW t':>10s}")
print("-" * 75)
for i, name in enumerate(['α (intercept)', 'β (slope)']):
    coef = model_sim.params[i]
    se_c = model_sim.bse[i]
    se_nw_val = model_nw.bse[i]
    t_c = model_sim.tvalues[i]
    t_nw = model_nw.tvalues[i]
    print(f"  {name:13s}  {coef:10.5f}  {se_c:10.5f}  {se_nw_val:10.5f}  {t_c:10.3f}  {t_nw:10.3f}")

print(f"\nClassic SE underestimates uncertainty:")
print(f"  SE(β) ratio (NW / Classic): {model_nw.bse[1] / model_sim.bse[1]:.2f}x")
print(f"\n  Classic p-value for β: {model_sim.pvalues[1]:.4f}")
print(f"  NW p-value for β:     {model_nw.pvalues[1]:.4f}")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 4.5))

# Plot 1: Residuals vs fitted
axes[0].scatter(model_sim.fittedvalues, model_sim.resid, alpha=0.5, s=20)
axes[0].axhline(0, color='red', linewidth=1, linestyle='--')
axes[0].set_xlabel('Fitted values')
axes[0].set_ylabel('Residuals')
axes[0].set_title('Residuals vs Fitted\n(Heteroscedasticity visible)', fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Plot 2: Autocorrelation of residuals
acf_vals = [pd.Series(model_sim.resid).autocorr(lag=l) for l in range(1, 11)]
axes[1].bar(range(1, 11), acf_vals, color='steelblue', alpha=0.7)
axes[1].axhline(1.96/np.sqrt(n_sim), color='red', linestyle='--', alpha=0.5, label='95% band')
axes[1].axhline(-1.96/np.sqrt(n_sim), color='red', linestyle='--', alpha=0.5)
axes[1].set_xlabel('Lag')
axes[1].set_ylabel('Autocorrelation')
axes[1].set_title('Residual Autocorrelation\n(Lag 1 is significant)', fontweight='bold')
axes[1].legend(fontsize=9)
axes[1].grid(True, alpha=0.3)

# Plot 3: Confidence intervals comparison
coef_val = model_sim.params[1]
ci_classic = [coef_val - 1.96 * model_sim.bse[1], coef_val + 1.96 * model_sim.bse[1]]
ci_nw = [coef_val - 1.96 * model_nw.bse[1], coef_val + 1.96 * model_nw.bse[1]]
axes[2].barh(['NW HAC', 'Classic'], 
             [ci_nw[1] - ci_nw[0], ci_classic[1] - ci_classic[0]],
             left=[ci_nw[0], ci_classic[0]], height=0.4, alpha=0.6,
             color=['darkred', 'steelblue'])
axes[2].axvline(beta_true, color='green', linewidth=2, linestyle='--', label=f'True β = {beta_true}')
axes[2].axvline(coef_val, color='black', linewidth=1.5, label=f'Estimate = {coef_val:.3f}')
axes[2].set_xlabel('β coefficient')
axes[2].set_title('95% Confidence Intervals\n(NW is wider = more honest)', fontweight='bold')
axes[2].legend(fontsize=9)
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nThe Newey-West CI is wider because it honestly accounts for the")
print("heteroscedasticity and autocorrelation that classic OLS ignores.")

Section 9: Putting It All Together — From CLT to Newey-West to p-Values

9.1 The Grand Argument

We can now state the complete logical chain that justifies using the standard normal distribution for p-values in Fama-French regressions, even though the data is non-normal, heteroscedastic, and autocorrelated:

Step 1 (OLS coefficients are weighted averages). The OLS estimator β^=(XX)1Xy\hat{\beta} = (X'X)^{-1}X'y is a linear function of y1,,yny_1, \ldots, y_n. Since each yi=βxi+ϵiy_i = \beta x_i + \epsilon_i, the estimator β^\hat{\beta} is ultimately a weighted average of the error terms ϵi\epsilon_i.

Step 2 (CLT applies to weighted averages). A generalized version of the CLT (Lindeberg-Feller or Eicker conditions) guarantees that β^\hat{\beta} is asymptotically normal as nn \to \infty, provided the errors are not too dependent or too heavy-tailed:

β^dN(β,  Var(β^))\hat{\beta} \xrightarrow{d} \mathcal{N}\left(\beta, \; \text{Var}(\hat{\beta})\right)

This holds regardless of the distribution of ϵi\epsilon_i: no normality needed.

Step 3 (Newey-West consistently estimates the variance). Even though Var(β^)\text{Var}(\hat{\beta}) depends on the unknown error structure (heteroscedasticity + autocorrelation), the Newey-West estimator Var^NW(β^)\widehat{\text{Var}}_{NW}(\hat{\beta}) converges to the true variance as nn \to \infty.

Step 4 (Slutsky’s theorem combines Steps 2–3). The t-statistic

t=β^β0Var^NW(β^)dN(0,1)t = \frac{\hat{\beta} - \beta_0}{\sqrt{\widehat{\text{Var}}_{NW}(\hat{\beta})}} \xrightarrow{d} \mathcal{N}(0, 1)

under H0:β=β0H_0: \beta = \beta_0. Therefore, p-values can be computed from the standard normal distribution.

9.2 What This Does NOT Require

  • ❌ Normally distributed errors

  • ❌ Constant error variance (homoscedasticity)

  • ❌ Independent errors (no autocorrelation)

  • ❌ Any specific parametric distribution for the errors

9.3 What This DOES Require

  • ✅ A large enough sample (n50n \geq 50, ideally 100\geq 100 for financial data)

  • ✅ Finite second moments (E[ϵi2]<E[\epsilon_i^2] < \infty)

  • ✅ Errors that are “not too dependent” (autocorrelation decays to zero as lag increases — satisfied by most financial return series)

  • ✅ A consistent lag length LL (grows with nn but slower than nn)

9.4 The Classification: Is This “Nonparametric”?

Technically, the approach is semiparametric:

AspectParametric or not?
Model for the mean ($E[yx] = \alpha + \beta x$)
Distribution of errorsNonparametric — no distributional assumption
Variance estimationNonparametric — Newey-West uses the data directly (squared residuals and cross-products), not any parametric formula
Inference (p-values)Based on asymptotic normality via CLT, not on any parametric distributional assumption

In the finance literature, this is often loosely called “nonparametric” or “robust” inference. The key point: the p-values do not depend on the error distribution, only on the CLT.

# ============================================================================
# Monte Carlo: Verify that Newey-West p-values are correctly calibrated
# under H₀, even with non-normal, heteroscedastic, autocorrelated errors
# ============================================================================
# If the theory is correct:
#   - Under H₀: β = 0, we should reject at 5% level exactly 5% of the time
#   - Classic OLS will over-reject (too many false positives)

np.random.seed(0)
num_mc = 2000
n_mc = 150  # Typical Fama-French sample size
reject_classic = 0
reject_nw = 0

for sim in range(num_mc):
    # Generate x (factor)
    x_mc = np.random.randn(n_mc) * 0.04
    
    # Generate non-normal, heteroscedastic, AR(1) errors
    innov = stats.t.rvs(df=5, size=n_mc)
    eps = np.zeros(n_mc)
    for t in range(n_mc):
        eps[t] = 0.3 * eps[t-1] + innov[t] if t > 0 else innov[t]
    eps = eps * (0.01 + 0.4 * np.abs(x_mc))  # heteroscedasticity
    
    # True β = 0 (null hypothesis is true)
    y_mc = 0.0 + 0.0 * x_mc + eps
    
    X_mc = sm.add_constant(x_mc)
    res_mc = sm.OLS(y_mc, X_mc).fit()
    lag_mc = int(np.floor(4 * (n_mc / 100) ** (2/9)))
    res_nw_mc = res_mc.get_robustcov_results(cov_type='HAC', maxlags=lag_mc)
    
    # Test β = 0 at 5% level (two-sided)
    if res_mc.pvalues[1] < 0.05:
        reject_classic += 1
    if res_nw_mc.pvalues[1] < 0.05:
        reject_nw += 1

print("MONTE CARLO: FALSE REJECTION RATES UNDER H₀ (β = 0)")
print("=" * 60)
print(f"Number of simulations:     {num_mc}")
print(f"Sample size per sim:       {n_mc}")
print(f"Errors: t(5) innovations, AR(1) ρ=0.3, heteroscedastic")
print(f"\nNominal size (α):          5.0%")
print(f"Classic OLS rejection rate: {reject_classic/num_mc*100:.1f}%")
print(f"Newey-West rejection rate:  {reject_nw/num_mc*100:.1f}%")
print(f"\nINTERPRETATION:")
if reject_classic / num_mc > 0.07:
    print(f"  Classic OLS OVER-REJECTS: {reject_classic/num_mc*100:.1f}% >> 5%")
    print(f"    → Using classic SEs, you'd get too many false positives!")
print(f"  Newey-West is close to 5%: the CLT + HAC correction works.")
print(f"\n  This confirms the theory from Section 9.1:")
print(f"  Newey-West + CLT → correctly calibrated p-values")
print(f"  even with non-normal, heteroscedastic, autocorrelated data.")

# ── Visualize the result: rejection-rate bar chart ──
rate_classic = reject_classic / num_mc * 100
rate_nw = reject_nw / num_mc * 100

fig, ax = plt.subplots(figsize=(7, 4.5))
bars = ax.bar(['Classic OLS', 'Newey-West HAC'], [rate_classic, rate_nw],
              color=['#d9534f', '#5cb85c'], width=0.5, edgecolor='black', linewidth=0.8)
ax.axhline(5, color='black', linestyle='--', linewidth=1.5, label='Nominal 5 % level')
ax.set_ylabel('Rejection rate (%)', fontsize=12)
ax.set_title('Monte Carlo Rejection Rates Under $H_0$\n'
             f'({num_mc} simulations, n = {n_mc}, AR(1) ρ = 0.3, heteroscedastic t(5) errors)',
             fontweight='bold', fontsize=11)
ax.set_ylim(0, max(rate_classic, rate_nw) * 1.35)

# Annotate bars
for bar, rate in zip(bars, [rate_classic, rate_nw]):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.4,
            f'{rate:.1f} %', ha='center', va='bottom', fontweight='bold', fontsize=12)

ax.legend(fontsize=10)
ax.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nThe bar chart confirms: Classic OLS over-rejects (red bar above the")
print("dashed line), while Newey-West stays near the nominal 5 % level (green bar).")

Section 10: Summary and Connection to the Fama-French Tutorials

10.1 What We’ve Shown

SectionKey Result
1. Descriptive StatsMean, variance, covariance computed by hand on 5 data points
2. Population vs. SampleStandard error of the mean = s/ns / \sqrt{n} measures estimation uncertainty
3. CLTSample means are approximately normal for large nn, regardless of the population distribution
4. Hypothesis Testing & CLTt-statistics, p-values, confidence intervals, and why the normal approximation works
5. Linear Regression & OLSScalar and matrix derivations, residuals, and classic standard errors on a 5-observation regression
6. Nonparametric InferenceThe sandwich formula: HC0 replaces scalar σ^2\hat{\sigma}^2 with observation-specific ϵ^i2\hat{\epsilon}_i^2
7. Newey-WestExtends HC0 by adding Bartlett-weighted cross-lag terms — handles autocorrelation too
8. Realistic SimulationNewey-West gives wider (more honest) CIs when data has time-series structure
9. Monte CarloClassic OLS over-rejects under H0H_0; Newey-West maintains correct 5% size

10.2 How This Connects to the Fama-French Notebooks

In the 3-Factor Tutorial:

  • Section 10 uses Newey-West standard errors for the Fama-French regression — now you understand exactly what that estimator computes and why it works.

  • Section 11 interprets t-statistics and p-values — now you know these are justified by the CLT, not by assuming normally distributed errors.

  • The Andrews lag-selection formula L=4(n/100)2/9L = \lfloor 4(n/100)^{2/9} \rfloor is the same one derived here.

In the Advanced Tutorial:

  • Every regression (5-factor, Carhart, rolling window) uses Newey-West HAC standard errors.

  • The run_ff_regression() helper function computes the Newey-West estimator automatically.

  • The Monte Carlo evidence here explains why those p-values are trustworthy despite the non-normality, fat tails, and volatility clustering present in real factor returns.

10.3 Quick Reference

When to Use What:

SituationStandard ErrorsWhy
Cross-sectional data, homoscedasticClassic OLSAll assumptions hold
Cross-sectional, possible heteroscedasticityHC0/HC1/HC3Sandwich estimator handles non-constant variance
Time-series data (financial returns)Newey-WestHandles both heteroscedasticity and autocorrelation
Any situation with n100n \geq 100Normal approximation for p-valuesCLT guarantees asymptotic normality

The Newey-West Recipe:

  1. Fit OLS: β^=(XX)1Xy\hat{\beta} = (X'X)^{-1}X'y

  2. Compute residuals: ϵ^i=yixiβ^\hat{\epsilon}_i = y_i - \mathbf{x}_i'\hat{\beta}

  3. Choose lag L=4(n/100)2/9L = \lfloor 4(n/100)^{2/9}\rfloor

  4. Compute the meat: MNW=iϵ^i2xixi+j=1Lwji>jϵ^iϵ^ij(xixij+xijxi)\mathbf{M}_{NW} = \sum_i \hat{\epsilon}_i^2 \mathbf{x}_i\mathbf{x}_i' + \sum_{j=1}^{L} w_j \sum_{i>j} \hat{\epsilon}_i\hat{\epsilon}_{i-j}(\mathbf{x}_i\mathbf{x}_{i-j}' + \mathbf{x}_{i-j}\mathbf{x}_i')

  5. Sandwich: Var^NW=(XX)1MNW(XX)1\widehat{\text{Var}}_{NW} = (X'X)^{-1}\mathbf{M}_{NW}(X'X)^{-1}

  6. Standard errors: SE(β^j)=[Var^NW]jj\text{SE}(\hat{\beta}_j) = \sqrt{[\widehat{\text{Var}}_{NW}]_{jj}}

  7. t-statistics: tj=β^j/SE(β^j)  d  N(0,1)t_j = \hat{\beta}_j / \text{SE}(\hat{\beta}_j) \;\xrightarrow{d}\; \mathcal{N}(0,1) under H0H_0

  8. p-values: p=2Φ(tj)p = 2\Phi(-|t_j|) where Φ\Phi is the standard normal CDF

10.4 Next Steps

You are now ready to proceed to the Fama-French 3-Factor Tutorial, where these concepts are applied to real market data. The statistical machinery developed here — CLT-based inference, sandwich estimators, Newey-West HAC — forms the foundation of all the inference in both tutorial notebooks.


Exercises

Test your understanding of the statistical foundations developed in this notebook.

Exercise 1: Hand Computation (Pencil & Paper)

Consider the dataset x=[2,5,3,8,7]x = [2, 5, 3, 8, 7] and y=[4,7,5,10,9]y = [4, 7, 5, 10, 9].

  1. Compute the sample mean, variance, and standard deviation of xx and yy.

  2. Compute Cov(x,y)\text{Cov}(x, y) and Corr(x,y)\text{Corr}(x, y) by hand.

  3. Set up the design matrix X\mathbf{X} (with a constant column) and compute β^=(XTX)1XTy\hat{\beta} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} using the normal equations. Verify your answer with numpy.

Exercise 2: CLT in Action (Hands-On)

The CLT was demonstrated in Section 3.7 using an Exponential(1) distribution. Repeat the experiment with a qualitatively different non-normal distribution:

  1. Draw samples of size n=5,30,100n = 5, 30, 100 from a Beta(0.5, 0.5) distribution (use np.random.beta(0.5, 0.5, size=n)). This distribution is bimodal — most of the mass is near 0 and 1.

  2. Compute 5 000 sample means for each nn and plot the histograms.

  3. At what sample size does the distribution of Xˉn\bar{X}_n start looking approximately normal? Is this faster or slower than for Exponential(1)?

  4. Why might a bimodal distribution converge faster than a heavily skewed one?

# Starter code:
from scipy import stats
n_sims = 5000
mu_beta = 0.5  # E[Beta(0.5, 0.5)] = a/(a+b) = 0.5
var_beta = 0.5 * 0.5 / (1.0 * 2.0) # a*b / ((a+b)^2 * (a+b+1)) = 0.125
for n in [5, 30, 100]:
    means = np.array([np.random.beta(0.5, 0.5, size=n).mean() for _ in range(n_sims)])
    z = (means - mu_beta) / np.sqrt(var_beta / n)
    # ... plot histogram of z and overlay N(0, 1) density

Exercise 3: Sandwich Estimator Comparison (Hands-On)

Using the 5-observation dataset from Section 5:

  1. Artificially make the residuals heteroscedastic by multiplying yy by [1,1,1,3,3][1, 1, 1, 3, 3] (i.e., inflate the last two observations). Refit by OLS.

  2. Compute the classic SE, HC0 SE, and Newey-West SE for β^1\hat{\beta}_1.

  3. Which estimator changes the most? Why?

Exercise 4: Monte Carlo — Varying the Autocorrelation (Challenge)

Extend the Monte Carlo simulation from Section 9:

  1. Run the same experiment with autocorrelation coefficients ρ{0,0.3,0.6,0.9}\rho \in \{0, 0.3, 0.6, 0.9\}.

  2. For each ρ\rho, record the empirical rejection rate of the classic t-test and the Newey-West t-test at the 5% level.

  3. Plot rejection rate vs. ρ\rho for both methods. At what ρ\rho does the classic test start seriously over-rejecting?

# Starter code:
rho_values = [0.0, 0.3, 0.6, 0.9]
for rho in rho_values:
    # Generate AR(1) errors: e_t = rho * e_{t-1} + v_t
    # Fit OLS, compute classic and NW t-stats, check rejection at 5%
    pass

Exercise 5: Why Does the Mixing CLT Matter? (Discussion)

  1. Explain in your own words why the standard CLT (Lindeberg-Lévy) cannot be applied directly to time-series data.

  2. What additional condition does the Mixing CLT require, and why is it satisfied for financial returns?

  3. If the mixing condition were not satisfied (e.g., a random walk), what would go wrong with the Newey-West estimator?