Categorical and Structural Equivalence as Hedging Strategy: Empirical Evidence from SPY-TLT Correlation Dynamics

Question

Do US equities (SPY) and long-duration US Treasuries (TLT) exhibit statistically significant negative correlation over the 2010–2024 period, and what is the magnitude and confidence interval of that relationship? This tests whether two structurally isomorphic risk-hedging instruments—equity duration (exposure to growth and discount-rate shocks) and bond duration (exposure to nominal rate and inflation expectations)—behave as predicted by categorical payoff equivalence under rate-regime constraints.

Method

We computed the Pearson correlation coefficient between daily adjusted-close returns for SPY (SPDR S&P 500 ETF Trust) and TLT (iShares 20+ Year Treasury Bond ETF) over the window 2010-01-01 through 2024-12-31, yielding 3,772 paired daily observations. The data source is yfinance daily adjusted-close returns. Statistical significance was established via a 2,000-iteration permutation test (distribution-free null hypothesis: zero correlation under random pairing), and the 95% confidence interval was constructed via 2,000-iteration bootstrap resampling of the observed pairs. We also computed Spearman rank correlation as a robustness check against non-linear monotonic dependence and recomputed the Pearson correlation within each calendar year (in-sample per-year) to assess time variation in the relationship.

This design isolates the linear co-movement of the two return series under the null hypothesis of independence, without assuming normality of returns. The permutation p-value tests whether the observed correlation is distinguishable from chance; the bootstrap confidence interval quantifies estimation uncertainty given the observed joint distribution.

Result

The Pearson correlation coefficient is r = –0.3224 (95% CI [–0.3749, –0.2683], p = 0.0005, n = 3,772). The Spearman rank correlation is ρ = –0.2774, confirming that the negative relationship is not an artifact of outliers or non-linear monotonic structure. The permutation test decisively rejects the null hypothesis of zero correlation (p = 0.0005), and the confidence interval excludes zero with substantial margin, indicating a statistically significant and economically meaningful negative association.

The per-year Pearson correlations reveal pronounced time variation in the strength of the hedging relationship:

  • 2010–2012: Strong negative correlation (r = –0.555, –0.709, –0.652), consistent with the post-crisis environment where Treasury duration served as a reliable equity hedge during risk-off episodes.
  • 2013–2020: Moderate negative correlation (r ranging from –0.214 to –0.477), with the relationship weakening in 2013 (the "taper tantrum" year, when rising rate expectations pressured both equities and long bonds) and 2021 (r = –0.139, as inflation concerns began to dominate).
  • 2021–2024: Breakdown of the negative correlation. In 2021, the relationship weakened sharply (r = –0.139). In 2022, it turned positive (r = +0.085), coinciding with the Federal Reserve's aggressive rate-hiking cycle, during which both equities and long-duration bonds sold off simultaneously. In 2023 and 2024, the correlation remained near zero or slightly positive (r = +0.126, +0.061), indicating that the traditional equity-bond hedge relationship was suspended or inverted during the inflation-and-rate-shock regime.

The full-sample negative correlation (r = –0.3224) is thus a weighted average over two distinct regimes: a strong negative-correlation regime (2010–2020) and a near-zero or positive-correlation regime (2021–2024). The confidence interval [–0.3749, –0.2683] reflects the full-sample estimate's precision but does not capture the regime-dependent dynamics visible in the rolling per-year series.

Interpretation

The full-sample result (r = –0.3224, p = 0.0005) confirms that, on average over 2010–2024, SPY and TLT returns moved in opposite directions with moderate strength. This is consistent with the categorical payoff equivalence hypothesis: equities and long-duration Treasuries are structurally isomorphic hedging instruments under the assumption that equity risk-off episodes coincide with flights to quality (lower nominal yields, higher bond prices). The negative correlation quantifies the degree to which TLT has historically offset SPY drawdowns.

However, the time variation in the per-year correlations is the substantive finding. The relationship was strongest in the immediate post-crisis decade (2010–2012, r ≈ –0.6 to –0.7), when monetary policy was accommodative, inflation was subdued, and rate volatility was low. During this period, the "duration hedge" worked as textbook theory predicts: equity sell-offs drove investors into long Treasuries, compressing yields and lifting TLT.

The relationship weakened but remained negative through most of 2013–2020 (r ≈ –0.2 to –0.5), with notable exceptions. The 2013 weakening (r = –0.214) coincides with the taper tantrum, when rising rate expectations hurt both equities (via higher discount rates) and long bonds (via duration losses)—a regime where the common factor (rate expectations) dominated the hedging mechanism. The 2021 weakening (r = –0.139) marks the onset of the inflation regime shift.

The 2022–2024 regime is the critical departure. In 2022, the correlation turned positive (r = +0.085), meaning SPY and TLT moved together—both falling as the Fed hiked rates aggressively and inflation expectations repriced. This is a failure of the categorical equivalence hypothesis under the specific conditions of simultaneous inflation shock and monetary tightening. Long-duration bonds, far from hedging equity risk, amplified portfolio losses because the dominant risk factor (real rates rising, inflation expectations volatile) penalized both asset classes. The 2023–2024 near-zero correlations (r = +0.126, +0.061) suggest the relationship has not yet returned to the pre-2021 regime, even as inflation moderated.

What the result does not support: The full-sample negative correlation does not imply that TLT is a reliable hedge in all regimes. The per-year dynamics show that the hedge works when the dominant macro factor is growth/risk appetite (equities down, bonds up) but fails when the dominant factor is inflation/real rates (both down). The structural equivalence holds only under the implicit assumption that rate shocks are small or that the central bank is not actively tightening—an assumption violated in 2022–2024.

What the result does support: Over the full 15-year sample, TLT has on average moved opposite to SPY with moderate strength (r = –0.32), and this relationship was statistically significant and stable in the 2010–2020 sub-period. The categorical payoff equivalence—equity duration vs bond duration as hedging instruments—is empirically grounded in the low-inflation, low-rate regime but is regime-dependent, not universal.

Relation to the Literature

No closely related papers were retrieved for this computation, so the result stands on its own empirical foundation. The finding contributes to the literature on equity-bond correlation dynamics (e.g., the "flight-to-quality" literature, the "inflation regime" literature) by quantifying the relationship over a recent 15-year window that spans both a low-inflation regime (2010–2020) and an inflation-shock regime (2021–2024). The regime-dependent breakdown of the negative correlation in 2022–2024 is consistent with theoretical work on time-varying risk premia and the role of inflation expectations in driving cross-asset correlations, though we do not cite specific papers here.

The concept of "categorical payoff equivalence" (the idea that structurally isomorphic instruments—here, equity duration and bond duration—should exhibit predictable co-movement) is tested and conditionally supported: the equivalence holds when the macro regime aligns with the implicit assumptions (growth shocks dominate, inflation is stable), but it breaks down when those assumptions are violated (inflation shocks dominate, real rates rise). This suggests that categorical equivalence is a useful heuristic for portfolio construction in "normal" regimes but requires regime-aware conditioning in periods of structural macro shifts.

Limitations

  1. Sample composition and regime coverage: The 2010–2024 window includes only one major inflation-shock episode (2021–2024). The positive correlation in 2022 and near-zero correlations in 2023–2024 may reflect a transient regime rather than a permanent structural break. A longer sample spanning multiple inflation cycles (e.g., including the 1970s–1980s) would clarify whether the 2022–2024 breakdown is typical of high-inflation regimes or unique to the post-pandemic policy response.

  2. In-sample per-year correlations: The per-year correlations are computed in-sample within each calendar year on the same data used for the full-sample estimate. They are descriptive of time variation, not out-of-sample forecasts. An out-of-sample test (e.g., rolling-window correlation with a holdout period, or a regime-switching model estimated on pre-2021 data and tested on 2021–2024) would assess whether the regime shift was predictable or detectable in real time.

  3. Universe and instrument choice: SPY and TLT are specific ETFs; SPY tracks the S&P 500 (large-cap US equities), and TLT tracks 20+ year Treasuries (long-duration nominal bonds). The result does not generalize to other equity indices (e.g., small-cap, international), other bond durations (e.g., intermediate Treasuries, TIPS), or other asset classes (e.g., commodities, credit). The categorical equivalence hypothesis would predict similar dynamics for other equity-bond pairs under similar macro conditions, but that requires separate testing.

  4. Causality and mechanism: Correlation does not identify the causal mechanism. The negative correlation in 2010–2020 is consistent with flight-to-quality (equity sell-offs cause bond buying) but also with common exposure to discount-rate shocks (lower rates lift both equities and bonds, but the equity effect dominates in risk-on periods). The positive correlation in 2022 is consistent with a common inflation/real-rate factor, but we do not decompose returns into rate, inflation, and risk-premium components. A structural model (e.g., a VAR with identified shocks, or a factor model with explicit rate and inflation factors) would clarify the mechanism.

  5. Confidence interval interpretation: The 95% CI [–0.3749, –0.2683] is a bootstrap interval for the full-sample Pearson correlation. It does not account for the time variation visible in the per-year series. A regime-switching model or a time-varying correlation model (e.g., DCC-GARCH) would provide regime-specific confidence intervals and formal tests of structural breaks.

  6. Hedging effectiveness vs correlation: A negative correlation is necessary but not sufficient for effective hedging. The hedge ratio (the position size in TLT required to offset a given SPY position) depends on the volatilities and the correlation. In 2022, even a small positive correlation combined with high bond volatility would have amplified losses. A portfolio-level analysis (e.g., minimum-variance hedge ratios, drawdown analysis, or Sharpe ratio comparisons) would assess hedging effectiveness more directly than correlation alone.

Strengthening the result would require: (a) out-of-sample validation of the regime shift (e.g., testing whether a model trained on 2010–2020 data predicts the 2021–2024 breakdown); (b) extension to other equity-bond pairs and asset classes to test the generality of categorical equivalence; (c) decomposition of returns into rate, inflation, and risk-premium components to identify the mechanism; (d) formal regime-switching or structural-break tests to quantify the probability and timing of regime changes; and (e) portfolio-level hedging analysis to translate correlation into hedging effectiveness (hedge ratios, drawdown mitigation, risk-adjusted returns).


Research evidence, not investment advice.