Most state-of-the-art approaches for weather and climate modeling are based on physics-informed numerical models of the atmosphere. These approaches aim to model the non-linear dynamics and complex interactions between multiple variables, which are challenging to approximate. Additionally, many such numerical models are computationally intensive, especially when modeling the atmospheric phenomenon at a fine-grained spatial and temporal resolution. Recent data-driven approaches based on machine learning instead aim to directly solve a downstream forecasting or projection task by learning a data-driven functional mapping using deep neural networks. However, these networks are trained using curated and homogeneous climate datasets for specific spatiotemporal tasks, and thus lack the generality of numerical models.
We develop and demonstrate ClimaX, a flexible and generalizable deep learning model for weather and climate science that can be trained using heterogeneous datasets spanning different variables, spatio-temporal coverage, and physical groundings. ClimaX extends the Transformer architecture with novel encoding and aggregation blocks that allow effective use of available compute while maintaining general utility.
ClimaX is pre-trained with a self-supervised learning objective on climate datasets derived from CMIP6. The pre-trained ClimaX can then be fine-tuned to address a breadth of climate and weather tasks, including those that involve atmospheric variables and spatio-temporal scales unseen during pretraining.
Compared to existing data-driven baselines, we show that this generality in ClimaX results in superior performance on benchmarks for weather forecasting and climate projections, even when pretrained at lower resolutions and compute budgets.
Figure 11: Error on ERA5 3-day forecasting for different variables with respect to CMIP6 5.625° data seen during pre-training. Bigger models are more sample-efficient.
4.5. Scaling laws analysis: …Figure 11 presents the performance of ClimaX as a function of data size and model capacity. The x-axis is the pretraining data size measured in Gigabytes, which corresponds to 1–5 CMIP6 datasets, and the y-axis shows the RMSE of ClimaX on the 3-day forecasting task. We compare 4 ClimaX models with different capacities by varying the embedding dimension 128–1,024. All experiments are conducted on the 5.625° data. The error rate of the two biggest models decreases consistently as we increase the data and model size. This highlights the unique ability of ClimaX in learning from diverse and heterogeneous data sources, which allows us to further improve the performance by simply pretraining on more data. However, the two smaller models do not scale as well as the bigger ones, where increasing data size does not gain much improvement or can sometimes hurt performance. This result shows that larger models not only perform better but are also more sample-efficient.
Figure 12: Scaling performance with respect to data resolution. Despite a larger patch size, ClimaX (1.40625°) achieves consistently better performance than the low-resolution model on almost all tasks, except for T2m forecast at 1 day and 3 days lead times.
In addition to data size and model capacity, data resolution is another important scaling dimension in the context of weather and climate. In many vision tasks such as classification, understanding the general, high-level structure of the image is sufficient to make accurate predictions. To model the underlying complex physical processes that govern weather and climate, however, it is important for a model to look at fine-grained details of the input in order to understand the spatial and temporal structure of data as well as the interactions between different variables. High-resolution data contains finer details and local processes of weather conditions that are not present in the low-resolution data, and thus provides stronger signals for training deep learning models. Figure 12 compares the performance of ClimaX pretrained and finetuned on 5.625° and 1.40625° data on global forecasting. Except for T2m at 1 day and 3 days lead times, ClimaX (1.40625°) consistently achieves lower RMSE and higher ACC than the low-resolution model. We note that for the high-resolution data we have to use a larger patch size (4 compared to 2 for low-resolution data) due to lack of memory issue. We can further improve the performance of ClimaX on the 1.40625° data by reducing the patch size, as the model is able to capture better details.
…Future research could explore incorporating both observational and simulated datasets that include a wider range of climate variables, higher spatiotemporal resolutions, and even extend into future scenarios. Further, we showed that resolution plays a crucial role in scaling of ClimaX. Due to our compute restrictions, we trained ClimaX on low to moderate resolutions. Nevertheless, our empirical trends suggest that scaling to higher resolutions (0.25°) is likely to lead to even better results.