Gravity models for correlation prediction: Physics-inspired asset distance metrics and portfolio construction

Spectral Decomposition of Large-Cap Equity Returns: Evidence for a Dominant Market Factor and Time-Varying Dimensionality

Question

Does the eigenvalue spectrum of large-cap equity returns reveal a dominant gravitational factor (market-cap-weighted common mode) statistically separable from random-matrix noise, and how many independent spectral modes persist after removing the largest eigenvalue?

Method

We computed the eigenvalue spectrum of the return correlation matrix for 10 large-cap U.S. equities (AAPL, AMZN, CVX, GOOGL, JNJ, JPM, KO, MSFT, PG, XOM) using daily adjusted-close returns from yfinance over 2010-01-01 to 2024-12-31 (n = 3772 observations). The null hypothesis is the Marchenko-Pastur (MP) random-matrix distribution: under the assumption that returns are independent Gaussian noise, eigenvalues of the sample correlation matrix concentrate in a bounded interval [λ_min, λ_max] determined by the ratio q = n_assets / n_obs = 10 / 3772 ≈ 0.003. For this q, the MP bounds are [0.8997, 1.1056]. Eigenvalues exceeding the upper bound 1.1056 are statistically distinguishable from noise at conventional significance levels and indicate genuine covariance structure. We applied principal component analysis (PCA) to extract the full eigenvalue spectrum and factor loadings, then counted significant factors (those above the MP threshold) and examined their loadings. To assess time variation, we recomputed the significant factor count per calendar year (in-sample within each year) on the same data and method.

Result

The top 10 eigenvalues in descending order are: 4.6999, 1.4355, 1.0925, 0.5220, 0.4955, 0.4741, 0.3945, 0.3756, 0.3462, 0.1641. The Marchenko-Pastur upper bound is 1.1056 and lower bound is 0.8997. Two eigenvalues exceed the upper bound: λ₁ = 4.6999 and λ₂ = 1.4355. The third eigenvalue (1.0925) lies below the threshold and is statistically consistent with noise. The number of significant factors is therefore 2.

The dominant eigenvalue λ₁ = 4.6999 accounts for 47% of total variance (variance_explained_top1 = 0.47). The two significant factors together explain 61.35% of variance (variance_explained_significant = 0.6135). The first factor's top three loadings are MSFT (−0.349), JPM (−0.334), and CVX (−0.330), indicating a broad market mode with near-uniform negative loadings (the sign is arbitrary in PCA; the uniformity signals a common factor). The second factor's top three loadings are AMZN (−0.480), XOM (+0.395), and GOOGL (−0.382), revealing a sector-rotation or style contrast: technology/consumer (AMZN, GOOGL) load negatively while energy (XOM) loads positively.

The rolling per-year significant factor count shows pronounced time variation:

2010–2016: 1 significant factor every year.
2017–2018: 2 factors.
2019: 1 factor.
2020: 2 factors.
2021: 3 factors.
2022: 2 factors.
2023–2024: 3 factors.

The dimensionality of the covariance structure increased markedly after 2020, with three significant factors emerging in 2021, 2023, and 2024, compared to a single dominant factor in the 2010–2016 period.

Interpretation

Dominant gravitational factor. The largest eigenvalue (4.6999) is 4.25 times the Marchenko-Pastur upper bound, providing strong evidence for a dominant common mode in large-cap equity returns. This mode accounts for 47% of variance and exhibits near-uniform loadings across all 10 assets, consistent with a market-cap-weighted "gravitational" factor that pulls all large-cap stocks in the same direction. The metaphor of gravity is apt: just as gravitational attraction scales with mass and diminishes with distance, the dominant eigenvalue captures the aggregate pull of market-wide information (earnings cycles, monetary policy, risk appetite) that affects all large-cap names proportionally. The uniformity of loadings—no single stock dominates the factor—supports the interpretation that this is a systemic, not idiosyncratic, mode.

Persistence of a second factor. The second eigenvalue (1.4355) is 30% above the noise threshold, indicating a robust secondary mode. Its loadings reveal a sector/style split: technology and consumer discretionary (AMZN, GOOGL) load negatively, while energy (XOM) loads positively. This factor captures rotation between growth-oriented and value/commodity-linked sectors, a well-documented source of cross-sectional variation orthogonal to the market factor. The fact that only two factors survive the noise threshold—despite 10 assets—implies that large-cap equity covariance is remarkably low-dimensional: the market mode and one sector contrast account for 61% of variance, with the remaining 39% consistent with idiosyncratic noise or transient correlations indistinguishable from sampling error.

Time-varying dimensionality. The rolling factor count reveals a structural shift. From 2010 to 2016, a single factor dominated, suggesting that large-cap returns moved nearly in lockstep (high market beta, low dispersion). The emergence of a second factor in 2017–2018 coincides with the rise of sector-specific narratives (tech outperformance, energy underperformance). The jump to three factors in 2021, 2023, and 2024 suggests increased cross-sectional differentiation, possibly driven by divergent monetary policy impacts (rate-sensitive growth stocks vs. inflation-hedging commodities) or idiosyncratic firm-level events gaining prominence. This time variation is economically meaningful: the covariance structure is not static but responds to regime changes in market dynamics.

What the result does NOT support. The result does not imply that the dominant factor is literally market-cap-weighted in the sense of the CAPM beta. The loadings are uniform but not proportional to market cap (AAPL and MSFT have similar loadings despite different capitalizations). The result also does not predict future returns or validate any portfolio construction rule—it is a descriptive decomposition of realized covariance. The out-of-sample stability of the factor structure is not tested here; the rolling counts are in-sample within each year. Finally, the result does not address whether the factors are priced (i.e., whether they carry risk premia)—only that they exist as statistically significant sources of covariation.

Relation to the Literature

The finding of a dominant market factor aligns with the network-based risk management literature [P1], which emphasizes that equity portfolios exhibit clustered correlation structures driven by systemic factors. Our eigenvalue decomposition provides a complementary, model-free quantification: the dominant eigenvalue is the spectral signature of the "market cluster" that network methods visualize as a densely connected core. The result also resonates with [P2]'s geometric diversification framework, which seeks portfolios equidistant from single-asset vertices. Our finding that 47% of variance concentrates in one mode implies that naive equal-weighting (the geometric center of the simplex) is far from minimum-variance: true diversification requires tilting away from the dominant eigenvector.

The time-varying factor count extends [P6]'s observation that network topology (e.g., minimum spanning tree length) predicts volatility. Our result suggests a mechanism: as the number of significant factors increases, the covariance structure becomes more complex, potentially raising forecast uncertainty and realized volatility. The 2021–2024 regime of three factors may correspond to periods when network-based volatility forecasts outperform univariate models, as [P6] documents for European and Asian markets.

The sector-rotation second factor (tech vs. energy) is consistent with [P8]'s topological data analysis of stock markets, which identifies persistent homology features (1-simplices, 2-simplices) beyond pairwise correlations. Our second eigenvalue captures a "face" in the correlation simplex—a triangular relationship among tech, energy, and the market—that a minimal spanning tree (which retains only n−1 edges) would collapse. The result underscores [P8]'s argument that higher-dimensional topological features (k-simplices for k ≥ 2) contain information lost in tree-based network reductions.

Our finding of low effective dimensionality (2 factors explain 61% of variance in 10 assets) contrasts with [P10]'s distributionally robust portfolio optimization under copula ambiguity. [P10] assumes the joint distribution is uncertain but the marginals are known; our result suggests that for large-cap equities, the copula is highly constrained—dominated by a single factor—so ambiguity in the copula may be less consequential than ambiguity in the marginals (e.g., tail risk in individual stocks). This has implications for robust portfolio design: if the covariance structure is low-rank and stable, worst-case optimization over copulas may be overly conservative.

The result does not directly engage [P4]'s geopolitical fragmentation metrics or [P5]'s neural-network price prediction, as our focus is covariance structure rather than return forecasting or cross-border flows. However, the time-varying factor count (one factor pre-2017, three factors post-2021) is consistent with [P4]'s observation of increasing fragmentation in trade and policy after 2022: if geopolitical shocks create sector-specific or regional divergence, we would expect the equity covariance matrix to exhibit more independent modes, as observed.

Limitations

Sample size and universe. The analysis uses 10 large-cap U.S. equities, a deliberately small universe to ensure a low q-ratio (0.003) and tight Marchenko-Pastur bounds. The result may not generalize to mid-cap, small-cap, or international equities, where idiosyncratic risk is larger and the dominant factor may be weaker. A larger universe (e.g., 100 stocks) would increase q and widen the MP bounds, potentially revealing more significant factors, but would also require a longer time series to maintain statistical power.

In-sample decomposition. The eigenvalue spectrum is computed on the full 2010–2024 sample (or per-year subsamples for the rolling count). This is an in-sample decomposition: we have not tested whether the factor structure is stable out-of-sample or whether the dominant eigenvector predicts future covariance. An out-of-sample test would split the data, estimate factors on a training window, and measure their explanatory power on a holdout window. Such a test would strengthen the claim that the factors are persistent features rather than sample-specific artifacts.

Gaussian assumption. The Marchenko-Pastur distribution assumes returns are i.i.d. Gaussian. Equity returns exhibit fat tails, volatility clustering, and autocorrelation, all of which violate this assumption. Fat tails can inflate eigenvalues (spurious factors), while autocorrelation can deflate them (underestimating true dimensionality). Robust random-matrix benchmarks (e.g., bootstrapped null distributions or heavy-tailed MP variants) would provide more conservative significance thresholds.

Factor interpretation. We interpret the first factor as a market mode and the second as sector rotation based on loadings, but this is post-hoc labeling. PCA factors are mathematical constructs (orthogonal directions of maximum variance) and need not correspond to economic factors (market, value, momentum). A formal factor model (e.g., regressing returns on the Fama-French factors) would test whether the dominant eigenvector is indeed the market factor or a linear combination of multiple priced factors.

Rolling window granularity. The per-year rolling factor count is coarse (annual recomputation). A finer rolling window (e.g., 252-day rolling) would reveal intra-year dynamics and potential regime switches (e.g., a crisis month where dimensionality spikes). The current result shows that dimensionality increased on average after 2020 but does not pinpoint the timing or duration of high-dimensional regimes.

Strengthening the result. The result would be strengthened by: (1) an out-of-sample covariance forecast test (does the dominant eigenvector predict next-period correlations?); (2) a larger universe (50–100 stocks) to test whether the two-factor structure scales; (3) a bootstrap or permutation test to confirm that λ₁ and λ₂ exceed noise under realistic (non-Gaussian) return distributions; (4) a comparison to a factor model (does the dominant eigenvector align with the market portfolio or a Fama-French factor?); and (5) a finer rolling window to identify regime transitions.

Research evidence, not investment advice.