Empirica Technologies

Categorical Spectralism: Eigenvalue Spectrum of Large-Cap Equity Returns Reveals Low-Rank Factor Structure

Question

Does the eigenvalue spectrum of large-cap equity returns exhibit a low-rank structure (few dominant factors) consistent with categorical equivalence classes, or does it approach the Marchenko-Pastur random-matrix null (no meaningful common structure)?

Method

Principal component analysis was performed on the return correlation matrix of 13 large-cap U.S. equities (AAPL, AMZN, BAC, CVX, GOOGL, JNJ, JPM, KO, MSFT, NVDA, PEP, PFE, XOM) over the period 2010-01-01 through 2024-12-31, yielding 3,772 daily observations. The data source is yfinance daily adjusted-close returns. The eigenvalue spectrum was compared against the Marchenko-Pastur (MP) distribution, which characterizes the eigenvalue distribution of a random correlation matrix with no underlying factor structure. Under the MP null, eigenvalues arise purely from sampling noise when correlations are zero in the population. The MP bounds depend on the ratio q = n_assets / n_obs = 13 / 3772 = 0.003, yielding an upper bound of 1.1209 and a lower bound of 0.886. Eigenvalues exceeding the upper bound are statistically distinguishable from random-matrix noise and indicate the presence of genuine common factors. The analysis was repeated on a rolling per-calendar-year basis (in-sample within each year) to assess time variation in the number of significant factors.

Result

The full-sample eigenvalue spectrum exhibits clear departure from the Marchenko-Pastur null. The top ten eigenvalues are 5.8867, 1.6483, 1.235, 0.7523, 0.6741, 0.5584, 0.4915, 0.4329, 0.3911, and 0.342. Three eigenvalues exceed the MP upper bound of 1.1209, establishing the presence of three statistically significant common factors. The first eigenvalue (5.8867) is more than five times the MP threshold, indicating a dominant market-wide factor. The second (1.6483) and third (1.235) eigenvalues also lie well above the noise floor, confirming a low-rank structure.

The top factor explains 45.28% of total variance; the three significant factors together explain 67.46% of variance. This concentration is inconsistent with the random-matrix null, under which variance would be diffusely distributed across all eigenvalues near unity.

Factor loadings reveal economic interpretation. The first factor loads most heavily on JPM (−0.315), MSFT (−0.308), and BAC (−0.301), suggesting a broad market or financial-sector component. The second factor loads most heavily on AMZN (−0.424), NVDA (−0.379), and GOOGL (−0.346), consistent with a technology or growth-stock factor. (Sign convention is arbitrary in PCA; the magnitude and relative ordering of loadings carry the economic content.)

The rolling per-year analysis shows time variation in factor count. From 2010 through 2015, only one significant factor was detected each year. In 2016, the count rose to two. From 2017 onward, the count fluctuated between two and three, settling at three factors in each of 2017, 2021, 2022, 2023, and 2024. The increase in significant factor count after 2016 coincides with rising market dispersion and the emergence of distinct sector rotations (technology outperformance, energy volatility, financials' sensitivity to rate cycles). The year 2020 (two factors) reflects the pandemic-driven collapse of cross-sectional variance into a single dominant risk-off mode, followed by re-emergence of multi-factor structure in 2021.

Interpretation

The eigenvalue spectrum decisively rejects the Marchenko-Pastur random-matrix null. The presence of three eigenvalues above the MP upper bound, with the largest eigenvalue nearly an order of magnitude above the threshold, establishes that large-cap equity returns are governed by a low-rank factor structure rather than idiosyncratic noise. This finding supports the categorical spectralism hypothesis: equity return spaces admit spectral decomposition into a small number of categorical equivalence classes (factors), each representing a coherent source of common variation.

The concentration of 67.46% of variance in three factors implies that the 13-dimensional return space is effectively three-dimensional for most practical purposes. The remaining ten eigenvalues, all below the MP upper bound, are statistically indistinguishable from sampling noise and do not represent genuine common factors. This low effective dimensionality is the quantitative signature of categorical structure: assets cluster into equivalence classes (market-wide, technology, financials) rather than varying independently.

The factor loadings provide economic grounding. The first factor's uniform loadings across JPM, MSFT, and BAC suggest a market-wide or systematic risk component, consistent with the capital asset pricing model's single-factor structure. The second factor's concentration in AMZN, NVDA, and GOOGL isolates technology-sector exposure, capturing the well-documented growth-versus-value rotation. The third factor (not shown in detail but implied by the eigenvalue count) likely represents energy or financials, given the universe composition (CVX, XOM, JPM, BAC).

The time variation in significant factor count is economically interpretable. The single-factor regime of 2010–2015 reflects the post-crisis environment of synchronized monetary policy and low dispersion, when the market factor dominated. The transition to two and then three factors after 2016 coincides with the divergence of technology from the broader market, the re-emergence of sector-specific drivers (energy price cycles, financial deregulation), and the breakdown of the low-volatility regime. The temporary collapse to two factors in 2020 is consistent with the pandemic's indiscriminate risk-off shock, which compressed cross-sectional variance. The return to three factors in 2021–2024 reflects the restoration of sector differentiation in the post-pandemic recovery.

The result does not support forward return prediction or trading signals. The eigenvalue spectrum characterizes the covariance structure of realized returns, not the conditional expectation of future returns. A low-rank structure implies that portfolio risk can be hedged with a small number of factors, but it does not imply that factor exposures predict alpha. The finding is a statement about the geometry of the return space, not about expected returns.

Relation to the Literature

No closely related papers were retrieved for this specific computation. The result stands on the computed eigenvalue spectrum and its comparison to the Marchenko-Pastur null. The methodology draws on random matrix theory, which has been applied to financial correlation matrices to distinguish signal from noise, but the present analysis is an original empirical application to this specific universe and window. The finding of three significant factors is consistent with the broader empirical asset pricing literature's documentation of multi-factor models (market, size, value, momentum, quality), though the specific factors identified here (market, technology, financials) are tailored to the large-cap universe studied.

Limitations

The sample is restricted to 13 large-cap U.S. equities, a small and homogeneous universe. The low q-ratio (0.003) makes the Marchenko-Pastur bounds tight, increasing statistical power to detect factors, but the small cross-section limits the generality of the factor structure. A broader universe (mid-caps, small-caps, international equities, other asset classes) would test whether the three-factor structure is specific to large-cap U.S. equities or a more general feature of equity return spaces.

The analysis is in-sample. The eigenvalue spectrum is computed on the same data used to estimate the correlation matrix, with no out-of-sample validation. The Marchenko-Pastur comparison establishes that the factors are statistically significant (not noise), but it does not establish that the factor structure is stable or predictive. An out-of-sample test would split the data, estimate factors on the first half, and verify that the same eigenvalue structure appears in the second half.

The rolling per-year analysis is also in-sample within each year. Each year's factor count is computed on that year's data alone, with no forward or backward validation. The time variation in factor count could reflect genuine structural change (sector divergence, regime shifts) or sampling variation in short windows. A longer rolling window (e.g., three-year) would smooth sampling noise and clarify whether the increase in factor count after 2016 is persistent.

The factor interpretation relies on loadings for only the top two factors, and only the top three tickers per factor are reported. A complete interpretation would examine all loadings for all three significant factors, assess their economic coherence (do they align with known sector or style classifications?), and test their stability over time. The present analysis establishes that three factors exist but does not fully characterize their economic content.

The Marchenko-Pastur null assumes that the true correlation matrix is the identity (zero correlations). This is a strong null, appropriate for detecting any common structure, but it does not distinguish between different types of structure (e.g., block-diagonal sector structure versus a single dominant market factor). A more refined null, such as a factor model with a specified number of factors, would test whether the data require three factors or whether two would suffice.

The daily return frequency may introduce microstructure noise (bid-ask bounce, non-synchronous trading), which can distort correlation estimates and eigenvalue spectra. Monthly returns would reduce microstructure effects but would also reduce the sample size (from 3,772 to ~180 observations), widening the Marchenko-Pastur bounds and reducing power. The choice of daily data prioritizes sample size over noise reduction; the large n_obs (3,772) relative to n_assets (13) makes the q-ratio small and the MP bounds tight, but microstructure noise remains a potential confound.

The universe excludes bonds, commodities, currencies, and other asset classes. A multi-asset universe would test whether the low-rank structure is equity-specific or a general feature of financial returns. Cross-asset factors (e.g., a global risk-off factor spanning equities, bonds, and currencies) might emerge, or the factor count might increase as new sources of variation (interest rate risk, commodity price risk) enter the return space.

The analysis does not address the economic source of the factors. The loadings suggest market, technology, and financials, but these are post-hoc interpretations. A formal factor model (e.g., regressing returns on observable characteristics like sector, size, value) would test whether the statistical factors align with economic fundamentals or whether they are purely statistical artifacts of the covariance structure.

The result is a quantified bound on the effective dimensionality of the return space, not a causal model. The eigenvalue spectrum establishes that three factors are necessary to explain the observed correlations, but it does not identify what those factors are (market risk, sector risk, style risk) or why they exist (investor preferences, production technologies, information structures). Causal identification would require additional structure (e.g., instrumental variables, natural experiments, structural models).

Strengthening the result would require: (1) expanding the universe to test generality; (2) out-of-sample validation of the factor structure; (3) longer rolling windows to distinguish structural change from sampling variation; (4) complete factor interpretation (all loadings, all factors); (5) comparison to alternative nulls (e.g., a two-factor model); (6) robustness to return frequency (monthly, weekly); (7) multi-asset extension; (8) formal factor model linking statistical factors to economic fundamentals; (9) causal identification of factor sources. The present result establishes the existence and count of significant factors but leaves their economic interpretation and predictive power as open questions.

Categorical Spectralism — spectral decomposition of portfolio return spaces