[cf. ClimaX] Weather forecasting is a fundamental problem for anticipating and mitigating the impacts of climate change. Recently, data-driven approaches for weather forecasting based on deep learning have shown great promise, achieving accuracies that are competitive with operational systems. However, those methods often employ complex, customized architectures without sufficient ablation analysis, making it difficult to understand what truly contributes to their success.
Here we introduce Stormer, a simple transformer model that achieves state-of-the-art performance on weather forecasting with minimal changes to the standard transformer backbone. We identify the key components of Stormer through careful empirical analyses, including weather-specific embedding, randomized dynamics forecast, and pressure-weighted loss.
At the core of Stormer is a randomized forecasting objective that trains the model to forecast the weather dynamics over varying time intervals. During inference, this allows us to produce multiple forecasts for a target lead time and combine them to obtain better forecast accuracy.
On WeatherBench 2, Stormer performs competitively at short to medium-range forecasts and outperforms current methods beyond 7 days, while requiring orders-of-magnitude less training data and compute. Additionally, we demonstrate Stormer’s favorable scaling properties, showing consistent improvements in forecast accuracy with increases in model size and training tokens. Code and checkpoints will be made publicly available.
…We start with a standard vision transformer (ViT) architecture, and through extensive ablation studies, identify the 3 key components to the performance of the model: (1) a weather-specific embedding layer that transforms the input data to a sequence of tokens by modeling the interactions among atmospheric variables; (2) a randomized dynamics forecasting objective that trains the model to predict the weather dynamics at random intervals; and (3) a pressure-weighted loss that weights variables at different pressure levels in the loss function to approximate the density at each pressure level. During inference, our proposed randomized dynamics forecasting objective allows a single model to produce multiple forecasts for a specified lead time by using different combinations of the intervals for which the model was trained. For example, one can obtain a 3-day forecast by either rolling out the 6-hour predictions 12× or 12-hour predictions 6×. Combining these forecasts leads to substantial performance improvements, especially for long lead times.
…Experiments show that Stormer achieves competitive forecast accuracy of key atmospheric variables for 1–7 days and outperforms the state-of-the-art beyond 7 days. Notably, Stormer achieves this performance by training on more than 5× lower-resolution data and orders-of-magnitude fewer GPU hours compared to the baselines. Finally, our scaling analysis shows that the performance of Stormer improves consistently with increases in model capacity and data size, demonstrating the potential for further improvements.
…Moreover, we also note that Stormer achieves this performance with much less compute and training data compared to the two deep learning baselines. We train Stormer on 6-hourly data of 1.40625∘ with 13 pressure levels, which is ~190× less data than Pangu-Weather’s hourly data at 0.25∘ and 90× less than that used for GraphCast, which also uses 6-hourly data but at a 0.25∘ resolution with 37 pressure levels. The training of Stormer was completed in under 24 hours on 128 A100 GPUs. In contrast, Pangu-Weather took 60 days to train 4 models on 192 V100 GPUs, and GraphCast required 28 days on 32 TPUv4 devices. This training efficiency will facilitate future works that build upon our proposed framework.
…4.3. Scaling analysis: We examine the scalability of Stormer with respect to model size and the number of training tokens. We evaluate 3 variants of Stormer—Stormer-S, Stormer-B, and Stormer-L, whose parameter counts are similar to ViT-S, ViT-B, and ViT-L, respectively. To understand the impact of training token count, we vary the patch size 2–16. The number of training tokens increases 4× whenever the patch size is halved.
Figure 6: Stormer improves consistently with larger models (top) and smaller patch sizes (bottom).
Figure 6 shows a substantial improvement in forecast accuracy when we increase the model size, and the performance gap widens as we increase the lead time.
Since we do not perform multi-step fine-tuning for these models, minor performance differences at short intervals may become magnified over time. While multi-step fine-tuning could potentially reduce this gap, it is unlikely to eliminate it entirely. Reducing the patch size also improves the performance of the model consistently. From a practical view, smaller patches mean more tokens and consequently more training data. From a climate perspective, smaller patches capture finer weather details and processes not evident in larger patches, allowing the model to more effectively capture physical dynamics that drive weather patterns.