Empirica Technologies

Spectral Boundaries in Large-Cap Equity Returns: Evidence for Categorical Structure Beyond Random-Matrix Noise

Question

Does the eigenvalue spectrum of large-cap US equity return correlations exhibit a statistically significant spectral boundary—eigenvalues exceeding the Marchenko-Pastur null distribution—and does the number of such significant factors vary systematically over time in a manner consistent with regime-dependent categorical constraints on portfolio morphisms?

Method

We computed the eigenvalue spectrum of the return correlation matrix for 16 large-cap US equities (AAPL, AMZN, BAC, CVX, GOOGL, JNJ, JPM, KO, META, MSFT, NVDA, PEP, PFE, PG, WMT, XOM) using daily adjusted-close returns from yfinance over the window 2010-01-01 to 2024-12-31 (3772 observations). The data source is yfinance daily adjusted-close returns; the inference method is principal component analysis (PCA) eigenvalue spectrum comparison against the Marchenko-Pastur null distribution.

The Marchenko-Pastur distribution provides the theoretical eigenvalue density for a correlation matrix of purely random returns. For a matrix with n = 15 assets and T = 3772 observations, the ratio q = n/T = 0.004 yields a Marchenko-Pastur upper bound of 1.1301 and lower bound of 0.8779. Eigenvalues exceeding the upper bound are statistically distinguishable from random-matrix noise and represent genuine covariance structure. We counted the number of eigenvalues above this threshold as the number of significant factors.

To assess time variation, we recomputed the spectrum on the same data partitioned by calendar year (in-sample within each year), yielding a per-year significant factor count from 2010 through 2024. We also extracted the top factor loadings to interpret the economic content of the leading eigenvectors.

Result

The full-sample eigenvalue spectrum exhibits a clear spectral boundary. The top 10 eigenvalues are 6.4878, 1.7134, 1.4928, 0.7632, 0.7174, 0.6574, 0.5596, 0.4912, 0.4377, and 0.3942. Against the Marchenko-Pastur upper bound of 1.1301, exactly 3 eigenvalues exceed the threshold, constituting the significant factor count.

The leading eigenvalue (6.4878) alone explains 43.25% of total variance. The three significant factors collectively explain 64.63% of variance. The remaining 12 eigenvalues lie within or below the Marchenko-Pastur bounds, consistent with random noise.

The top factor (eigenvalue 6.4878) loads most heavily on JPM (0.291), MSFT (0.290), and PEP (0.278), suggesting a broad market or systematic risk factor. The second factor (eigenvalue 1.7134) loads negatively on AMZN (-0.416), NVDA (-0.403), and GOOGL (-0.351), indicating a technology/growth tilt orthogonal to the market factor.

The rolling per-year significant factor count reveals substantial time variation:

2010–2012: 1 significant factor per year
2013–2014: 2 factors
2015: 1 factor
2016: 2 factors
2017: 3 factors
2018–2020: 2 factors per year
2021–2024: 3 factors per year

The factor count increased from 1 to 3 over the sample, with notable jumps in 2017 and 2021–2024. The post-2020 period stabilizes at 3 factors, matching the full-sample count.

Interpretation

The spectral boundary is unambiguous: three eigenvalues lie well above the Marchenko-Pastur upper bound, while the remaining twelve are indistinguishable from random-matrix noise. This confirms that large-cap US equity returns are not a 15-dimensional random walk but are governed by a low-dimensional (3-factor) covariance structure. The leading factor explains nearly half of all variance, consistent with a dominant market mode.

The time variation in factor count is economically interpretable. The single-factor regime (2010–2012, 2015) corresponds to periods of relatively homogeneous market behavior—post-crisis recovery and mid-cycle expansion. The two-factor regime (2013–2014, 2016, 2018–2020) introduces a second orthogonal mode, plausibly a growth/value or sector rotation dynamic. The three-factor regime (2017, 2021–2024) emerges during periods of heightened dispersion: the 2017 tech rally, the 2020 pandemic shock, and the 2021–2024 inflation/rate cycle. The stabilization at three factors post-2020 suggests a persistent regime shift toward greater structural complexity.

The second factor's negative loadings on AMZN, NVDA, and GOOGL—mega-cap tech names—indicate that it captures a technology/growth tilt orthogonal to the broad market. This is consistent with the well-documented growth/value rotation that intensified post-2020 as interest rates rose. The first factor's broad loadings (financials, tech, consumer staples) suggest a market beta or systematic risk component.

The result does NOT support a claim that market-cap weighting shifts the spectral density in a categorical sense—no market-cap-weighted computation was performed. The computation question's second clause (market-cap weighting and categorical container constraints) is unanswered by this result. The evidence is limited to the existence and time variation of a spectral boundary in the equal-weight correlation matrix.

The result does NOT establish a direct correlation between the factor count and "realized liquidity regimes" or "volatility surface curvature"—no liquidity or volatility data were analyzed. The time variation in factor count is consistent with regime-dependent structure, but the specific regimes (liquidity, volatility) named in the computation question are not quantified here. The observed pattern (fewer factors in calm periods, more in volatile/dispersive periods) is plausible but not rigorously tested against external regime indicators.

Relation to the Literature

The spectral boundary result aligns with the random-matrix theory framework in [P4], which studies spectral properties of sparse non-Hermitian matrices and provides methods for distinguishing signal from noise in eigenvalue spectra. Our Marchenko-Pastur comparison is a standard application of this theory to financial correlation matrices, confirming that large-cap equity returns exhibit genuine low-rank structure beyond random noise.

The time variation in factor count resonates with [P3]'s finding that covariance dynamics are not stable across aggregation frequencies. While [P3] studies realized covariance matrices at multiple frequencies, our per-year recomputation reveals instability in the number of significant factors over time, suggesting that the dimensionality of the return-generating process itself is regime-dependent. This is a stronger form of instability than parameter drift within a fixed-dimensional model.

The low-dimensional structure (3 factors explaining 64.63% of variance) is consistent with classical factor models in portfolio optimization [P2], which assume a small number of common factors drive returns. Our result provides empirical support for this assumption in the large-cap US equity universe over 2010–2024, though the factor count is not constant.

The categorical framing in the computation question—"categorical container constraint on portfolio morphisms"—draws on [P6] and [P7], which develop categorical frameworks for processes with bidirectional interaction (cybernetic systems) and monoidal contexts (incomplete processes). Our result does not directly engage this formalism: we compute eigenvalues of a correlation matrix, not morphisms in a monoidal category. However, the spectral boundary could be interpreted as a constraint on the "shape" of admissible portfolio transformations—if only 3 of 15 dimensions carry signal, then portfolio rebalancing morphisms that exploit noise dimensions are categorically invalid (they compose with random structure, not signal). This interpretation is speculative and would require a formal categorical model of portfolio construction to test rigorously.

[P8]'s rigorous data-driven computation of spectral properties for Koopman operators is methodologically adjacent: both [P8] and our work compute spectra from data with convergence guarantees (Marchenko-Pastur bounds provide a rigorous null). However, Koopman operators linearize nonlinear dynamics, whereas our PCA linearizes a static covariance structure. The time variation we observe (changing factor count) suggests that a Koopman-operator approach—treating the return-generating process as a dynamical system—might reveal richer structure than static PCA.

[P1]'s quantum portfolio optimization and [P5]'s non-topological persistence are less directly relevant. [P1] addresses combinatorial asset selection, not covariance structure. [P5]'s persistence diagrams for graphs and images do not map cleanly onto eigenvalue spectra, though both are topological summaries of high-dimensional data.

Limitations

Sample size and universe: 16 assets over 15 years is a small universe. The Marchenko-Pastur bounds are asymptotic (large n, large T with q = n/T fixed); with n = 15, finite-sample corrections could shift the bounds slightly. A larger universe (e.g., S&P 500 constituents) would provide a more robust test and potentially reveal finer spectral structure.

In-sample only: The per-year factor counts are in-sample within each year—we recompute PCA on the same data partitioned by year. This does not test out-of-sample stability. A rolling-window out-of-sample forecast of the factor count (e.g., estimate on year t, validate on year t+1) would strengthen the claim of regime-dependent structure.

No external regime indicators: The computation question asks whether the spectral boundary "correlates with realized liquidity regimes and volatility surface curvature." We observe time variation in factor count but do not correlate it with any external liquidity or volatility measure (e.g., bid-ask spreads, VIX term structure, implied volatility skew). The pattern (more factors in 2020–2024) is consistent with higher volatility and dispersion, but this is qualitative, not quantitative.

No market-cap weighting: The computation question's second clause—whether market-cap weighting shifts the spectral density—is unanswered. We computed the equal-weight correlation matrix. A market-cap-weighted covariance matrix (weighting returns by market cap before computing correlations) would test whether larger firms dominate the spectral structure, potentially reducing the effective dimensionality further.

No categorical formalism: The "categorical container constraint on portfolio morphisms" is a theoretical framing from [P6] [P7], but we do not construct a categorical model or test a specific constraint. The spectral boundary is a statistical fact; its interpretation as a categorical constraint is a metaphor, not a theorem.

Factor interpretation: The factor loadings (JPM/MSFT/PEP on factor 1, AMZN/NVDA/GOOGL on factor 2) are suggestive but not definitive. We do not regress the factors against known risk premia (market, size, value, momentum) or macroeconomic variables (rates, inflation, GDP growth). The economic content of the factors is inferred from loadings, not validated against external benchmarks.

Stationarity assumption: PCA assumes the covariance structure is stationary within each estimation window. The per-year recomputation reveals non-stationarity across years, but within-year stationarity is not tested. If the covariance structure shifts intra-year (e.g., around FOMC meetings or earnings seasons), the per-year factor count is an average over multiple micro-regimes.

Strengthening the result would require: (1) a larger asset universe, (2) out-of-sample validation of factor count forecasts, (3) correlation with external liquidity/volatility measures, (4) market-cap-weighted spectral analysis, (5) factor regression against known premia, and (6) a formal categorical model of portfolio morphisms to test the "container constraint" hypothesis rigorously.

Research evidence, not investment advice

Categorical Spectralism — spectral decomposition of portfolio return spaces