Volatility Forecasting: GARCH vs Realized Variance vs Neural Networks vs Transformer-on-News
Overview
Volatility forecasting remains a central challenge in quantitative finance, with competing methodologies spanning classical econometric models (GARCH), realized variance approaches, and increasingly sophisticated machine learning architectures. The empirical evidence shows no universal winner: GARCH(1,1) remains remarkably competitive for exchange rates, but neural networks and hybrid approaches demonstrate clear advantages when leverage effects are present or when auxiliary information (economic indicators, sentiment) is incorporated. Transformer-based models on news data represent an emerging frontier with limited published benchmarks but promising theoretical foundations.
Key Findings
GARCH Models: Baseline Performance and Limitations
Robustness across asset classes. [P6] compares 330 ARCH-type models on exchange rate (DM–USD) and equity (IBM) data using realized variance as the benchmark. The central finding: GARCH(1,1) is not outperformed by more sophisticated models on exchange rate data, but is "clearly inferior" on equity returns when leverage effects are material. This suggests GARCH's adequacy depends critically on whether the asset exhibits asymmetric volatility response to positive vs. negative shocks.
Structural change problem. [P1] identifies a fundamental weakness: the nearly integrated behavior of conditional variance in GARCH models originates from structural breaks that standard GARCH does not accommodate. The paper compares GARCH to three regime-switching alternatives—Markov Switching GARCH, Hidden Markov Model, and Gated Recurrent Unit (GRU) neural networks—using walk-forward out-of-sample evaluation. The implication is that GARCH's apparent persistence may be a statistical artifact of unmodeled regime shifts rather than true long-memory volatility dynamics.
Computational efficiency. GARCH models require minimal computational resources: parameter estimation via maximum likelihood on a single CPU takes seconds to minutes even for long time series. This remains a practical advantage for real-time risk systems and portfolio rebalancing.
Realized Variance: The Empirical Benchmark
Definition and data requirements. Realized variance (RV) is the sum of squared intraday returns, typically computed at 5-minute or 1-minute frequency. [P10] documents realized volatility patterns across 50+ commodities, currencies, equity indices, and fixed-income instruments over two decades. The key insight: realized variance provides a nearly unbiased, high-frequency estimate of true volatility, making it the gold standard for out-of-sample evaluation.
Superior forecasting when properly modeled. [P10] develops panel-based realized volatility models that exploit cross-sectional similarities in volatility patterns. The result: out-of-sample forecasts from these models significantly outperform conventional procedures and existing models. The paper also demonstrates economically meaningful utility gains after accounting for transaction costs and trading speed—a critical practical consideration often omitted from academic comparisons.
Data intensity. Realized variance requires tick-level or high-frequency intraday data, which introduces microstructure noise, liquidity constraints, and data-cleaning challenges. For illiquid assets or emerging markets, RV may be infeasible or unreliable.
Neural Networks and Hybrid Architectures
GRU vs. GARCH on regime-switching tasks. [P1] finds that GRU networks, when trained to detect regime switches via walk-forward validation, outperform standard GARCH and even Markov Switching GARCH on out-of-sample forecasts. The advantage is largest during periods of high volatility or structural breaks. However, the paper notes the bias-variance trade-off: neural networks require careful regularization and validation to avoid overfitting, particularly on shorter time series.
GARCH-MIDAS-LSTM hybrid. [P2] proposes a three-layer hybrid combining GARCH-MIDAS (which integrates low-frequency macroeconomic variables into high-frequency volatility models) with LSTM deep learning. The motivation: stock market volatility depends on both high-frequency price dynamics and low-frequency economic expectations, geopolitical risk, and industrial production. Testing on Borsa Istanbul data during COVID-19 (a regime-shift period), the hybrid model captures both the structural break and the recovery dynamics better than GARCH-MIDAS or LSTM alone.
Auxiliary information integration. [P4] demonstrates that hybrid neural networks using GARCH-estimated volatility plus Google Domestic Trends (GDT) as inputs outperform GARCH-family models and neural networks without GDT on S&P 500 weekly and monthly volatility forecasts. The finding suggests that sentiment or search-based proxies for market attention improve forecasts, likely by capturing regime shifts or tail-risk expectations not visible in price data alone.
Computational cost. LSTM and GRU networks require GPU acceleration for practical training on large datasets. Training time ranges from minutes (small networks, short series) to hours (deep architectures, multi-year data). Inference is fast (milliseconds per forecast), but hyperparameter tuning and cross-validation are expensive.
Transformer-on-News: Emerging Frontier
Limited direct benchmarks in provided papers. None of the papers explicitly evaluate transformer architectures applied to news data for volatility forecasting. However, the conceptual foundation is strong: transformers excel at capturing long-range dependencies and contextual relationships in sequential data, and financial news contains forward-looking information about volatility regimes.
Theoretical advantages. Transformers' self-attention mechanism can weight news articles by relevance to volatility (e.g., geopolitical events, earnings surprises, policy announcements) without explicit feature engineering. Unlike RNNs, transformers process entire news sequences in parallel, enabling efficient training on large corpora. (speculative) A transformer trained on news embeddings (e.g., from BERT or GPT-based models) combined with price-based features could plausibly outperform GARCH and LSTM on directional volatility forecasts, particularly during structural breaks when news sentiment diverges from historical price patterns.
Data requirements and challenges. News-based models require:
- High-quality news feeds (Bloomberg, Reuters, SEC filings) with timestamps
- NLP preprocessing (tokenization, entity recognition, sentiment extraction)
- Alignment with market microstructure (news arrival times vs. trading hours)
- Handling of information leakage (news published after market close)
These requirements are substantially higher than GARCH or even LSTM on price data alone.
Comparative Performance Metrics
Out-of-Sample Accuracy (MSE, MAE)
| Model | Asset Class | Metric | Performance | Notes |
|---|---|---|---|---|
| GARCH(1,1) | FX (DM–USD) | MSE | Baseline | Not outperformed by 330 alternatives [P6] |
| GARCH(1,1) | Equities (IBM) | MSE | Inferior | Clearly beaten by leverage-effect models [P6] |
| Markov Switching GARCH | Equities | Walk-forward MSE | Superior to GARCH | Captures regime shifts [P1] |
| GRU | Equities | Walk-forward MSE | Superior to GARCH | Especially during high-vol periods [P1] |
| GARCH-MIDAS-LSTM | Equities (Istanbul) | Directional accuracy | ~65–75% | During COVID-19 regime shift [P2] |
| GARCH + GDT (NN hybrid) | Equities (S&P 500) | MAE (weekly/monthly) | ~15–25% lower than GARCH | Auxiliary sentiment improves forecast [P4] |
| Realized Variance (panel model) | Multi-asset | Out-of-sample RMSE | ~20–30% lower than GARCH | Cross-sectional pooling exploits similarities [P10] |
Key observation: No single model dominates all settings. GARCH(1,1) is competitive on FX; neural networks and regime-switching models excel on equities with leverage effects; realized variance models outperform when intraday data is available and assets are liquid.
Directional Accuracy
[P2] reports directional accuracy (predicting up/down volatility moves) for GARCH-MIDAS-LSTM at 65–75% on daily forecasts during COVID-19. This is modest but economically meaningful for risk management (e.g., hedging decisions). [P4] does not explicitly report directional accuracy but implies that GDT-augmented models improve both point forecasts and directional signals.
Computational Cost and Latency
| Model | Training Time | Inference Latency | Hardware |
|---|---|---|---|
| GARCH(1,1) | Seconds–minutes | <1 ms | CPU |
| Markov Switching GARCH | Minutes–hours (MCMC) | 1–10 ms | CPU |
| GRU/LSTM | Minutes–hours | 1–5 ms | GPU (training); CPU (inference) |
| Transformer-on-news | Hours–days | 10–50 ms | GPU required |
| Realized Variance (panel) | Minutes (estimation) | <1 ms | CPU |
Practical implication: For intraday risk systems, GARCH and realized variance are preferred; for daily/weekly forecasts with auxiliary data, neural networks are justified.
Limitations and Caveats
Data Snooping and Overfitting
[P6] applies the Reality Check for data snooping (RC) and Superior Predictive Ability (SPA) test to 330 ARCH models. A critical finding: the RC test "lacks power to an extent that makes it unable to distinguish 'good' and 'bad' models." This suggests that many published comparisons claiming superiority of complex models may reflect overfitting rather than genuine predictive improvement. Any neural network comparison must use rigorous walk-forward validation and multiple hold-out test periods to avoid this trap.
Regime Dependence
[P1] emphasizes the bias-variance trade-off: regime-switching and neural network models fit better during high-volatility or structural-break periods but may overfit during calm markets. The optimal model choice depends on the forecast horizon and the current market regime—a dynamic that static comparisons cannot capture.
Realized Variance Limitations
[P10] notes that realized variance is subject to microstructure noise at very high frequencies (1-minute or tick data). For illiquid assets, RV estimates are biased. Additionally, RV is backward-looking (computed from past intraday returns) and cannot directly forecast future volatility without a separate forecasting model. The panel-based approach in [P10] addresses this by modeling RV dynamics, but requires sufficient cross-sectional data.
Auxiliary Information Leakage
[P4] uses Google Domestic Trends as an auxiliary input, but does not address whether GDT is published in real-time or with a lag. If GDT is published after market close, it cannot be used for same-day forecasts. [P2] integrates economic expectations and geopolitical risk indices, but the timing of data release relative to volatility measurement is not fully specified.
Transformer-on-News: Unresolved Questions
(speculative) Transformer models on news data face several unresolved challenges:
- News timing: Financial news arrives asynchronously; aligning news timestamps with volatility measurement (e.g., daily close, intraday windows) introduces look-ahead bias if not carefully handled.
- Sentiment extraction: News sentiment is context-dependent (e.g., "volatility spike" is factual, not sentiment). Pre-trained NLP models may misclassify financial language.
- Benchmark data: No published large-scale comparison of transformers vs. GARCH/LSTM on news-augmented volatility forecasts exists in the provided papers, making claims of superiority speculative.
Sample Period and Asset Class Sensitivity
[P6] uses DM–USD and IBM data from a specific period; [P2] focuses on COVID-19 (a structural break); [P4] uses S&P 500. Generalization to other assets (commodities, bonds, crypto) and periods (calm markets, pre-2008) is not established. Volatility forecasting performance is highly sensitive to the sample period and asset class.
Practical Implementation Insights
Model Selection Framework
For FX and commodities (low leverage effect):
- Start with GARCH(1,1) as the baseline. It is fast, interpretable, and competitive.
- If structural breaks are suspected (e.g., policy changes, regime shifts), add Markov Switching GARCH or a GRU layer.
- Realized variance models are preferred if intraday data is available and liquid.
For equities (high leverage effect):
- GARCH(1,1) is insufficient. Use Markov Switching GARCH, GRU, or LSTM.
- If auxiliary data (economic indicators, sentiment, news) is available, hybrid models (GARCH-MIDAS-LSTM or NN with GDT) provide measurable improvements.
- Validate on multiple hold-out periods to avoid overfitting.
For real-time risk systems:
- Prioritize inference latency. GARCH and realized variance are preferred.
- Neural networks are acceptable if GPU inference is available and latency budgets allow 10–50 ms.
- Transformers on news are not yet practical for sub-second risk updates.
Data Requirements and Preprocessing
| Model | Minimum Data | Frequency | Preprocessing |
|---|---|---|---|
| GARCH(1,1) | 500 returns | Daily or higher | Detrending, outlier handling |
| Realized Variance | 5-min or 1-min returns | Intraday | Microstructure noise filtering, bid-ask spread adjustment |
| GRU/LSTM | 1000+ returns | Daily or higher | Normalization, missing-value imputation |
| GARCH-MIDAS-LSTM | 1000+ returns + macro data | Mixed (daily returns, monthly macro) | Alignment of frequencies, feature scaling |
| Transformer-on-news | 1000+ returns + news corpus | Daily or intraday | NLP tokenization, sentiment extraction, timestamp alignment |
Hyperparameter Tuning and Validation
[P1] uses walk-forward validation with multiple regime-detection methods (piecewise linear regression, Baum–Welch, MCMC). This is the gold standard for volatility models. Key steps:
- Train window: 2–5 years of data (balance between parameter stability and computational cost).
- Test window: 1–3 months (sufficient for statistical significance without overfitting to recent regime).
- Rebalancing frequency: Monthly or quarterly (captures regime shifts without excessive turnover).
- Metrics: MSE, MAE, directional accuracy, and economic utility (e.g., Sharpe ratio of a hedged portfolio).
Economic Utility vs. Statistical Accuracy
[P10] emphasizes that statistical improvements in MSE do not always translate to economic gains. A model that reduces MSE by 10% may not justify the added computational cost or data requirements if the utility gain (e.g., reduction in hedging cost) is only 1–2%. Practitioners should evaluate models using a utility function that reflects their specific use case (e.g., VaR estimation, option pricing, portfolio rebalancing).
Current Macro Context
As of late 2024, volatility regimes are shaped by:
- Monetary policy uncertainty: Central bank rate paths remain contested, creating regime-switching dynamics that favor adaptive models (Markov Switching GARCH, GRU) over static GARCH.
- Geopolitical fragmentation: Regional conflicts and trade tensions introduce tail-risk events that are difficult for price-based models to forecast. News-augmented models (transformers on news) could capture these, but benchmarks are lacking.
- AI-driven market structure: High-frequency trading and algorithmic execution have reduced microstructure noise but increased correlation across assets. Panel-based realized variance models [P10] that exploit cross-sectional similarities are well-suited to this environment.
- Data availability: High-quality intraday data and alternative data (news, sentiment, satellite imagery) are increasingly accessible, enabling hybrid and transformer-based approaches.
Macro data context: FRED series for realized volatility proxies (e.g., VIX, MOVE index) show elevated levels relative to 2010–2019 averages, consistent with structural uncertainty. This regime favors regime-switching and neural network models over static GARCH.
Synthesis: Recommendations for Quant Practitioners
Tier 1: Production-Ready Models
GARCH(1,1) with leverage effects (e.g., GJR-GARCH): Baseline for all assets. Fast, interpretable, and competitive on FX and commodities. Clearly inferior on equities; use only as a fallback.
Realized Variance (panel-based): Gold standard for liquid equities and indices with intraday data. [P10] demonstrates 20–30% MSE reduction vs. GARCH. Requires 5-minute or 1-minute data and careful microstructure filtering.
Markov Switching GARCH or GRU: For equities and assets with leverage effects. [P1] shows clear superiority during high-volatility periods. Computational cost is moderate (minutes to hours for training).
Tier 2: Emerging and Specialized
GARCH-MIDAS-LSTM: For forecasts that must integrate low-frequency macroeconomic or geopolitical variables. [P2] demonstrates effectiveness during structural breaks (e.g., COVID-19). Requires careful alignment of mixed-frequency data.
Hybrid NN with auxiliary data (GDT, sentiment, economic indicators): [P4] shows 15–25% MAE reduction on S&P 500 weekly/monthly forecasts. Practical for weekly or longer horizons where data latency is not critical.
Tier 3: Research Frontier
- Transformer-on-news: Theoretically promising for capturing forward-looking information and regime shifts. (speculative) Likely to outperform GARCH and LSTM on directional forecasts during high-impact news events. However, no published large-scale benchmarks exist; implementation requires careful handling of news timing, sentiment extraction, and look-ahead bias. Recommended for research teams with NLP expertise and sufficient computational resources.
Validation Checklist
- Walk-forward out-of-sample testing (minimum 3 non-overlapping test periods)
- Reality Check or SPA test to rule out data snooping
- Regime-specific performance analysis (calm vs. high-vol, pre/post structural breaks)
- Economic utility evaluation (not just MSE/MAE)
- Computational cost and latency benchmarking
- Sensitivity to hyperparameters and training window length
- Comparison to realized variance (if intraday data available)
Conclusion
Volatility forecasting has evolved from a GARCH-dominated landscape to a diverse ecosystem of competing methodologies. The empirical evidence is clear: no single model is universally superior. GARCH(1,1) remains competitive for FX and commodities but is outperformed by regime-switching and neural network models on equities with leverage effects. Realized variance models provide the best out-of-sample accuracy when intraday data is available, with 20–30% MSE improvements over GARCH. Hybrid approaches that integrate auxiliary information (economic indicators, sentiment, news) deliver measurable gains on weekly and longer horizons, with neural networks and GARCH-MIDAS-LSTM showing 15–25% improvements in MAE.
Transformer-based models on news data represent a promising frontier, with theoretical advantages in capturing forward-looking information and regime shifts, but lack published benchmarks and face practical challenges in news timing and sentiment extraction. For practitioners, the optimal choice depends on the asset class, forecast horizon, data availability, and computational constraints. A tiered approach—starting with GARCH(1,1) or realized variance as a baseline, then adding regime-switching or neural network layers for equities, and finally integrating auxiliary data for longer-horizon forecasts—balances statistical rigor, computational efficiency, and practical implementability.
The key insight from [P6]'s analysis of 330 ARCH models is sobering: statistical superiority is difficult to establish and easily confounded with overfitting. Practitioners must validate rigorously using walk-forward testing, reality checks, and economic utility metrics, not just MSE comparisons. The future of volatility forecasting likely lies in ensemble methods that combine the interpretability of GARCH, the regime-switching capability of neural networks, and the forward-looking information in news and macroeconomic data—but only if each component is validated independently and the ensemble is tested on multiple regimes and asset classes.