It feels like, for the past few years, the only things you hear about in AI and machine learning are LLMs and agents, maybe an image or video generation model here and there. And honestly, it's really starting to feel a bit boring to me. So I wanted to look a bit beyond all the headline news and learn a bit more about other cutting-edge machine learning applications. And being a physicist, the recent progress in machine-learned weather forecasting got me quite excited.
In this blog post, I want to dig a little bit into the AIFS paper by Lang et al. (2024). AIFS is the Artificial Intelligence Forecasting System by the European Centre for Medium-Range Weather Forecasts (ECMWF). It is a system that now runs alongside the traditional physics-based Integrated Forecasting System (IFS) in order to provide better, more accurate weather forecasts.
The Lang 2024 paper introduces the first version of the AIFS system, and there have been a number of changes and improvements since, which I hope to dig into in follow-up posts. But I think it is a good starting point to jump into ML-based weather forecasting. I am going to write this post as if this were a one-person journal club, and as if I were presenting the paper to a person like myself: someone who has a good understanding of current LLM systems and architectures (and of some physics) and wants to understand where ML-based weather forecasting is similar, and where it differs.
The Setup
Weather forecasting has, for the past seventy years or so, meant numerical weather prediction (NWP). You take a grid of the atmosphere, discretise the relevant equations of fluid dynamics on it, and solve them on a supercomputer. This is hard. Partial differential equations can be quite difficult to work with. You have to deal with discretisation issues, numerical instabilities, and chaotic systems (Lorenz's famous butterfly effect originates in his work on weather modelling), not to mention the parametrisation of all the physical processes that you cannot directly resolve, like cloud formation. It takes tons of computational resources. A single simulation can take hours to run on thousands of CPU cores.
That's why the ML alternative looks so attractive at first glance. Just train a neural network to make the prediction. No need to deal with all the messy physics yourself. And once the network is trained, predictions are fast and cheap. Since the beginning of the deep learning revolution in the 2010s, major companies and institutions have explored the feasibility of this approach: there is Nvidia's FourCastNet (2022), Huawei's Pangu-Weather (2023), Google's GraphCast (2023), and ECMWF's AIFS (2024).
The Data
Of course, as is always the case in machine learning, the performance of a model depends primarily on the availability of high-quality data. And for a weather prediction model, this generally means we are talking about ECMWF's own ERA5 dataset.
ERA5 is what is known as a reanalysis. It is the output of running a classical physics-based weather forecasting model that continuously assimilates, and optimises for, every observation that was available at the time (satellite radiances, radiosondes, surface stations, aircraft, buoys). The result is an hourly, ~31 km resolution, 137-vertical-level global record of the atmosphere stretching from 1940 to today. Actually, the creation of this dataset, and reanalysis in general, is a super interesting topic in its own right. I'll have to make a post about that at some point too. AIFS was trained using the ERA5 data from 1979 up to 2020.
However, this means ERA5 is not made up of direct observations. It is a combined model-observation product. Where observations are dense, ERA5 tracks them closely. Where they're sparse, ERA5 is largely model output. And this means any machine learning model we train will inherit the biases and errors of the original physics-based model, so we will not get around understanding and modelling the physics after all. Furthermore, to make a forecast, we need to provide AIFS with the initial conditions for the atmosphere, which rely on the same physics-derived models that are used for the traditional forecast. So to make a good ML-based model, we also need a good physics-based model. At least for now.
The Grid
The native grid of ERA5 is a reduced Gaussian grid denoted N320. A Gaussian grid avoids the pole problem that simple regular latitude-longitude grids have, where the cell width collapses to zero at $\pm 90°$, while staying at roughly 28 km near the equator. A Gaussian grid constructs grid points in a way that keeps the distance between points roughly uniform across the entire sphere.
| Grid | Resolution | Number of points | Used by AIFS as |
|---|---|---|---|
| N320 (reduced Gaussian) | ~31 km | 542,080 | input / output grid |
| O96 (reduced Gaussian) | ~110 km | 40,320 | processor grid |
| 0.25° lat-lon (for reference) | ~28 km equator, <1 km pole | 1,038,240 | — |
N320 has roughly half the grid points of an equivalent-resolution lat-lon grid, and the cells are roughly the same physical size everywhere. Since the N320 is the native output/data structure of ERA5, the AIFS model uses the same input and output dimensions. This grid has to be passed to the model in a way that preserves the information about the relations between grid points, and a meaningful chunk of the architecture has to deal with this irregular geometry.
Map view · global grid points
Pole view · north polar cap
The Input Variables
Per grid point, AIFS sees roughly 90 channels: six pressure-level variables (geopotential, three wind components, specific humidity, temperature) at 13 vertical levels, plus eight surface variables (mean sea-level pressure, 2 m temperature, 10 m winds, total column water, etc.), plus static forcing terms (orography, land-sea mask, insolation, latitude, time-of-day, time-of-year). Two time slices are fed in per forward pass ($t-6h$ and $t_0$). The output is the same 90-ish channels at $t+6h$. The hope is that the atmosphere is sufficiently Markovian on six-hour scales for medium-range forecasts — the state at $t+6h$ is essentially a function of the state at $t$ (with a small benefit from including $t-6h$ to disambiguate time derivatives). So the model doesn't need a year of history in order to make a prediction.
At training time, the input is straight from ERA5. At inference time, the input comes from ECMWF's operational 4D-Var analysis, which is the best estimate of "now" given the last twelve hours of observations. AIFS fine-tunes specifically on the operational analyses to close the train/inference distribution gap.
The Architecture
AIFS follows an encoder → processor → decoder pattern. The encoder is a graph neural network that maps from the high-resolution input reduced Gaussian grid (N320) onto a coarser processor mesh (a coarse, octahedral reduced Gaussian grid, O96). The processor is a stack of transformer layers that does the bulk of the computation on the processor mesh. The decoder is a second graph neural network that maps the processor's output back to the high-resolution grid.
The Encoder And Decoder
Both the encoder and decoder are bipartite graphs — every edge connects a node on one grid to a node on the other. The connectivity is asymmetric:
- Encoder (N320 → O96): each O96 processor node has incoming edges from all N320 nodes inside a fixed great-circle cut-off radius.
- Decoder (O96 → N320): each N320 output node connects to its three nearest O96 nodes.
Edge features encode geometry — great-circle distance and directions. AIFS additionally adds eight learnable per-node and per-edge features that are pure model parameters. The mapper blocks use the graph-transformer convolution from Shi et al. 2021. I don't have a lot of experience with Graph Neural Networks, so we will have to dig into the details in another post. But for now, let's just treat it as a general encoder setup: We encode the relevant information (values at the nodes/grid points + relationship to other nodes) and pass them on to the processor.
Processor: Sliding-Window Transformer
The processor is made up of sixteen standard pre-norm transformer layers with multi-head attention. Since the input consists of 40,320 O96 nodes, full self-attention would need to calculate $\mathcal{O}(n^2) \approx 1.6\times10^9$ pairwise interactions per layer per head, which would be excessively expensive. Instead, the processor uses sliding-window attention along latitude bands (similar to Longformer, Sparse Transformers, and Mistral's SWA). Sliding window in this case means that the attention window shifts in each layer, meaning the receptive field grows linearly with depth. Unlike for LLMs, there is no need for a causal mask. Forecasts attend to the full atmosphere. The "autoregressive" part of AIFS happens at the rollout level (predicting $t+6h$ from $t$). Within a single forward pass, attention is fully bidirectional in space.
Parallelism, Briefly
One AIFS instance is split across four A100 40 GB GPUs within a node. The scaling parallelism stack is the standard large-model toolbox: mixed precision training, activation checkpointing, tensor parallelism via attention head sharding, and data parallelism. Since sliding-window attention is used, there is no context parallelisation despite the long sequence length (although it could be implemented in principle). AIFS also includes node and edge sharding across the graph.
Training
The forward problem is supervised regression: given $x_{t-6h}$ and $x_t$, predict $x_{t+6h}$:
$$\hat{x}_{t+6} = f_\theta\big(x_{t-6},\, x_t\big).$$
The loss is an area-weighted mean squared error. The individual loss terms are scaled empirically per variable, such that each physical quantity has roughly equal contribution. Additionally, the weights decrease linearly with height, such that lower atmosphere accuracy is more important than upper atmosphere accuracy. Similar to LLMs, AIFS uses a 3-stage training process: A pre-training phase, where the model gets trained to make a single $\hat{x}_{t+6}$ prediction. A rollout phase, where the model is fed its own predictions from previous steps through twelve six-hour forward passes. And a fine-tuning phase, where the model gets adapted from fully-processed ERA5 inputs to online IFS NWP data.
| Phase | Data | Steps | Learning rate | Forecast horizon |
|---|---|---|---|---|
| 1. Pre-training | ERA5 1979–2020 | 260,000 | cosine, $10^{-4} \to 3\times10^{-7}$ | single 6 h step |
| 2. Rollout fine-tuning | ERA5 1979–2018 | up to 12 unrolled steps | $6\times10^{-7}$ | 6 h → 72 h, increased by 6 h every 1000 steps |
| 3. Operational fine-tuning | IFS analyses 2019–2020 | rollout | $6\times10^{-7}$ | same rollout schedule |
The rollout phase deserves a closer look. For each step, we feed the model prediction back in as the next input,
$$\hat{x}_{t+6(k+1)} = f_\theta\big(\hat{x}_{t+6(k-1)},\, \hat{x}_{t+6k}\big).$$
The loss is calculated against the ground truth data at every step,
$$\mathcal{L}(\theta) = \frac{1}{N}\sum_{k=1}^{N} \ell\big(\hat{x}_{t+6k},\, x_{t+6k}\big),$$
with $\ell$ being the area-weighted MSE and $N$ being the number of rollout steps. Each state $\hat{x}_k (\theta) = \hat{x}_{t+6k}$ depends on the previous state $\hat{x}_{k-1} (\theta)$, so differentiating $\mathcal{L}$ unrolls the chain rule across the whole sequence:
$$\frac{\partial \mathcal{L}}{\partial \theta} = \frac{1}{N}\sum_{k=1}^{N}\sum_{j=1}^{k} \frac{\partial \ell_k}{\partial \hat{x}_k}\left(\prod_{i=j+1}^{k}\frac{\partial \hat{x}_i}{\partial \hat{x}_{i-1}}\right)\frac{\partial \hat{x}_j}{\partial \theta}.$$
The optimiser used is AdamW with $(\beta_1, \beta_2) = (0.9, 0.95)$. The total wall time is roughly one week on 64 A100s, and a ten-day forecast takes about two and a half minutes on a single A100. For reference, the equivalent IFS run takes hours on a thousand CPUs.
Results
To evaluate the performance of the model, the anomaly correlation coefficient (ACC) is used. The anomaly is the difference between the observed value at a grid point and the historical average for that value. The ACC measures the (Pearson) correlation between the forecast anomaly and the analysis (observed ground truth) anomaly. By focusing on the anomaly rather than absolute values, we remove the easily predictable background (e.g. colder temperatures towards the poles). By focusing on the correlation, we focus on predicting the correct patterns without having to worry about (regional) biases.
The headline result: on Northern Hemisphere 500 hPa geopotential ACC, AIFS outperforms IFS at 2-, 6-, and 10-day lead times across the whole verification period. It performs worse in the upper atmosphere, which can be traced back to the decreased weight in the loss calculation.
As the forecast horizon increases, the AIFS prediction becomes visibly smoother, and small-scale structures that are present at day 1 get progressively blurred out as the forecast unrolls. The same effect shows up in GraphCast, Pangu and other ML-based weather models. This is caused by the double-penalty problem: if a sharp feature is predicted in the wrong spot, it gets penalised for predicting a true feature in the wrong spot and for missing the same feature in the correct spot. This leads to an increased preference of the model to predict mean values as uncertainty increases. The same problem occurred in pre-diffusion image generator models. For this reason, the ECMWF has been looking at ensemble-based approaches as well.
What's Next
The ECMWF team that built AIFS was quite aware of this smoothing problem from the start and built the model with the goal of extending the point prediction model to a probabilistic forecasting system. Probabilistic ensemble forecasts are the standard for physics-based models as well, and come with the advantage that they would not only solve the blurring problem but also yield uncertainties for their predictions. They are therefore the obvious next step after this initial model, and that is exactly the direction the model developers went.
The paper we have been walking through describes the first, operational AIFS version from 2024. In the time since, AIFS has been extended in a number of ways, which we will explore in follow-up posts. For your own reading, the next papers we will go through are the following:
- AIFS-CRPS (2024) is the direct extension of the AIFS system to a probablistic system. The model architecture stays the same, but multiple predictions are made by injecting Gaussian noise into the processor states. The ensemble output is then scored with a probabilistic score in order to generate a well-calibrated ensemble forecast. This solves two problems at once: The ensemble members stay sharp without progressive smoothing and we get the ability to estimate uncertainties on the output.
- AIFS 1.1 (2025) adds physical consistency constraints in order to avoid unphysical outputs and adds additional output variables.
- AIFS v2 (2026) arrived only this month (12 May 2026), taking both the deterministic and ensemble models to version two and shipping alongside a major update to the physics-based IFS.
- Anemoi is the open-source, collaboratively developed framework that covers the whole pipeline, from preparing ML-ready datasets to training models to running them operationally.