The ECMWF's Artificial Intelligence Forecasting System

May 20, 2026

GNN, Transformers, Weather Forecasting

Update (2026-06-02): I just finshied follow-up post covering the probabilistic extension of the model - From Deterministic Weather Forecasts To Ensembles.

It feels like, for the past few years, the only things you hear about in AI and machine learning are LLMs and agents, maybe an image or video generation model here and there. And honestly, it's really starting to feel a bit boring to me. So I wanted to look a bit beyond all the headline news and learn a bit more about other cutting-edge machine learning applications. And being a physicist, the recent progress in machine-learned weather forecasting got me quite excited.

In this blog post, I want to dig a little bit into the AIFS paper by Lang et al. (2024). AIFS is the Artificial Intelligence Forecasting System by the European Centre for Medium-Range Weather Forecasts (ECMWF). It is a system that now runs alongside the traditional physics-based Integrated Forecasting System (IFS) in order to provide better, more accurate weather forecasts.

The Lang 2024 paper introduces the first version of the AIFS system, and there have been a number of changes and improvements since, which I hope to dig into in follow-up posts. But I think it is a good starting point to jump into ML-based weather forecasting. I am going to write this post as if this were a one-person journal club, and as if I were presenting the paper to a person like myself: someone who has a good understanding of current LLM systems and architectures (and of some physics) and wants to understand where ML-based weather forecasting is similar, and where it differs.

↔ LLM Analogy Think of AIFS as a transformer-based, autoregressive next-token predictor just like an LLM, except that a token is an atmospheric state rather than a word or word fragment.

The Setup

Weather forecasting has, for the past seventy years or so, meant numerical weather prediction (NWP). You take a grid of the atmosphere, discretise the relevant equations of fluid dynamics on it, and solve them on a supercomputer. This is hard. Partial differential equations can be quite difficult to work with. You have to deal with discretisation issues, numerical instabilities, and chaotic systems (Lorenz's famous butterfly effect originates in his work on weather modelling), not to mention the parametrisation of all the physical processes that you cannot directly resolve, like cloud formation. It takes tons of computational resources. A single simulation can take hours to run on thousands of CPU cores.

That's why the ML alternative looks so attractive at first glance. Just train a neural network to make the prediction. No need to deal with all the messy physics yourself. And once the network is trained, predictions are fast and cheap. Since the beginning of the deep learning revolution in the 2010s, major companies and institutions have explored the feasibility of this approach: there is Nvidia's FourCastNet (2022), Huawei's Pangu-Weather (2023), Google's GraphCast (2023), and ECMWF's AIFS (2024).

interactive · timeline

A non-exhaustive timeline of ML-based weather prediction models up to 2024.

Within less than 6 years, we went from feasibility studies (Dueben & Bauer), to competitive accuracy (FourCastNet, Pangu, GraphCast), to an operational pipeline (AIFS).

The Data

Of course, as is always the case in machine learning, the performance of a model depends primarily on the availability of high-quality data. And for a weather prediction model, this generally means we are talking about ECMWF's own ERA5 dataset.

ERA5 is what is known as a reanalysis. It is the output of running a classical physics-based weather forecasting model that continuously assimilates, and optimises for, every observation that was available at the time (satellite radiances, radiosondes, surface stations, aircraft, buoys). The result is an hourly, ~31 km resolution, 137-vertical-level global record of the atmosphere stretching from 1940 to today. Actually, the creation of this dataset, and reanalysis in general, is a super interesting topic in its own right. I'll have to make a post about that at some point too. AIFS was trained using the ERA5 data from 1979 up to 2020.

However, this means ERA5 is not made up of direct observations. It is a combined model-observation product. Where observations are dense, ERA5 tracks them closely. Where they're sparse, ERA5 is largely model output. And this means any machine learning model we train will inherit the biases and errors of the original physics-based model, so we will not get around understanding and modelling the physics after all. Furthermore, to make a forecast, we need to provide AIFS with the initial conditions for the atmosphere, which rely on the same physics-derived models that are used for the traditional forecast. So to make a good ML-based model, we also need a good physics-based model. At least for now.

The Grid

The native grid of ERA5 is a reduced Gaussian grid denoted N320. A Gaussian grid avoids the pole problem that simple regular latitude-longitude grids have, where the cell width collapses to zero at $\pm 90°$, while staying at roughly 28 km near the equator. A Gaussian grid constructs grid points in a way that keeps the distance between points roughly uniform across the entire sphere.

Grid	Resolution	Number of points	Used by AIFS as
N320 (reduced Gaussian)	~31 km	542,080	input / output grid
O96 (reduced Gaussian)	~110 km	40,320	processor grid
0.25° lat-lon (for reference)	~28 km equator, <1 km pole	1,038,240	—

N320 has roughly half the grid points of an equivalent-resolution lat-lon grid, and the cells are roughly the same physical size everywhere. Since the N320 is the native output/data structure of ERA5, the AIFS model uses the same input and output dimensions. This grid has to be passed to the model in a way that preserves the information about the relations between grid points, and a meaningful chunk of the architecture has to deal with this irregular geometry.

interactive · grid

Rectangular vs. Gaussian Grids.

Map view · global grid points

Points: 542,080 Δx @ eq: 31 km Δx @ pole: 31 km

Pole view · north polar cap

Cell aspect at 85°: ~1:1

Toggle the grid type. The lat-lon grid visibly clusters and distorts near the poles; the reduced Gaussian stays roughly uniform across the sphere.

The Input Variables

Per grid point, AIFS sees roughly 90 channels: six pressure-level variables (geopotential, three wind components, specific humidity, temperature) at 13 vertical levels, plus eight surface variables (mean sea-level pressure, 2 m temperature, 10 m winds, total column water, etc.), plus static forcing terms (orography, land-sea mask, insolation, latitude, time-of-day, time-of-year). Two time slices are fed in per forward pass ($t-6h$ and $t_0$). The output is the same 90-ish channels at $t+6h$. The hope is that the atmosphere is sufficiently Markovian on six-hour scales for medium-range forecasts — the state at $t+6h$ is essentially a function of the state at $t$ (with a small benefit from including $t-6h$ to disambiguate time derivatives). So the model doesn't need a year of history in order to make a prediction.

↔ LLM Analogy This differs a bit from language modelling, where we care about long-range dependencies between tokens. Here, we essentially only use two context tokens to predict one output token.

At training time, the input is straight from ERA5. At inference time, the input comes from ECMWF's operational 4D-Var analysis, which is the best estimate of "now" given the last twelve hours of observations. AIFS fine-tunes specifically on the operational analyses to close the train/inference distribution gap.

The Architecture

AIFS follows an encoder → processor → decoder pattern. The encoder is a graph neural network that maps from the high-resolution input reduced Gaussian grid (N320) onto a coarser processor mesh (a coarse, octahedral reduced Gaussian grid, O96). The processor is a stack of transformer layers that does the bulk of the computation on the processor mesh. The decoder is a second graph neural network that maps the processor's output back to the high-resolution grid.

interactive · schematic

Encoder, processor, decoder — what flows through each block

block 01 · encoder

N320 → O96 graph mapper

ERA5 input grid (542k nodes) is projected onto the O96 processor grid (40k nodes) by a bipartite graph-transformer convolution. Each processor node attends to its data-grid neighbours inside a cut-off radius.

block 02 · processor

16-layer sliding-window transformer

Pre-norm transformer blocks over the O96 mesh. Attention is windowed along latitude bands; receptive field grows $L \times w$ with depth.

block 03 · decoder

O96 → N320 graph mapper

The decoder is bipartite the other way. Each ERA5 output node connects to its three nearest processor nodes — uniform in-degree gives a clean projection back to full resolution.

The Encoder And Decoder

Both the encoder and decoder are bipartite graphs — every edge connects a node on one grid to a node on the other. The connectivity is asymmetric:

Encoder (N320 → O96): each O96 processor node has incoming edges from all N320 nodes inside a fixed great-circle cut-off radius.
Decoder (O96 → N320): each N320 output node connects to its three nearest O96 nodes.

Edge features encode geometry — great-circle distance and directions. AIFS additionally adds eight learnable per-node and per-edge features that are pure model parameters. The mapper blocks use the graph-transformer convolution from Shi et al. 2021. I don't have a lot of experience with Graph Neural Networks, so we will have to dig into the details in another post. But for now, let's just treat it as a general encoder setup: We encode the relevant information (values at the nodes/grid points + relationship to other nodes) and pass them on to the processor.

↔ LLM Analogy Edge features encode the relationship between nodes, and take the role of positional encoding in LLMs.

Processor: Sliding-Window Transformer

The processor is made up of sixteen standard pre-norm transformer layers with multi-head attention. Since the input consists of 40,320 O96 nodes, full self-attention would need to calculate $\mathcal{O}(n^2) \approx 1.6\times10^9$ pairwise interactions per layer per head, which would be very expensive. Instead, the processor uses sliding-window attention along latitude bands (similar to Longformer and Mistral's SWA). Each layer attends to a fixed local window of $\pm w$ nodes. Stacking $L$ such layers lets the information from previous layers propagate, so that the effective receptive field grows to $\pm (L \cdot w)$. Unlike for LLMs, there is no need for a causal mask. Forecasts attend to the full atmosphere. The "autoregressive" part of AIFS happens at the rollout level (predicting $t+6h$ from $t$). Within a single forward pass, attention is fully bidirectional in space.

↔ LLM Analogy Attention is not calculated over the temporal sequence of atmospheric states (like the sequence of words in an LLM), but between grid points within one step.

interactive · layer blocks

Layer schematic

Toggle the block type and click any component to see what it does. The encoder and decoder share the same GNN mapper block (only the direction and connectivity differ); the processor is a standard pre-norm transformer block, repeated sixteen times. Adapted from Figure 3 of Lang et al. (2024).

Parallelism, Briefly

One AIFS instance is split across four A100 40 GB GPUs within a node. The scaling parallelism stack is the standard large-model toolbox: mixed precision training, activation checkpointing, tensor parallelism via attention head sharding, and data parallelism. Since sliding-window attention is used, there is no context parallelisation despite the long sequence length (although it could be implemented in principle). AIFS also includes node and edge sharding across the graph.

Training

The forward problem is supervised regression: given $x_{t-6h}$ and $x_t$, predict $x_{t+6h}$:

$$\hat{x}_{t+6} = f_\theta\big(x_{t-6},\, x_t\big).$$

The loss is an area-weighted mean squared error. The individual loss terms are scaled empirically per variable, such that each physical quantity has roughly equal contribution. Additionally, the weights decrease linearly with height, such that lower atmosphere accuracy is more important than upper atmosphere accuracy. Similar to LLMs, AIFS uses a 3-stage training process: A pre-training phase, where the model gets trained to make a single $\hat{x}_{t+6}$ prediction. A rollout phase, where the model is fed its own predictions from previous steps through twelve six-hour forward passes. And a fine-tuning phase, where the model gets adapted from fully-processed ERA5 inputs to online IFS NWP data.

↔ LLM Analogy In LLMs, the three stages are pretraining, supervised fine-tuning and reinforcement learning (via human feedback or verifiable reward).

Phase	Data	Steps	Learning rate	Forecast horizon
1. Pre-training	ERA5 1979–2020	260,000	cosine, $10^{-4} \to 3\times10^{-7}$	single 6 h step
2. Rollout fine-tuning	ERA5 1979–2018	up to 12 unrolled steps	$6\times10^{-7}$	6 h → 72 h, increased by 6 h every 1000 steps
3. Operational fine-tuning	IFS analyses 2019–2020	rollout	$6\times10^{-7}$	same rollout schedule

The rollout phase deserves a closer look. For each step, we feed the model prediction back in as the next input,

$$\hat{x}_{t+6(k+1)} = f_\theta\big(\hat{x}_{t+6(k-1)},\, \hat{x}_{t+6k}\big).$$

The loss is calculated against the ground truth data at every step,

$$\mathcal{L}(\theta) = \frac{1}{N}\sum_{k=1}^{N} \ell\big(\hat{x}_{t+6k},\, x_{t+6k}\big),$$

with $\ell$ being the area-weighted MSE and $N$ being the number of rollout steps. Each state $\hat{x}_k (\theta) = \hat{x}_{t+6k}$ depends on the previous state $\hat{x}_{k-1} (\theta)$, so differentiating $\mathcal{L}$ unrolls the chain rule across the whole sequence:

$$\frac{\partial \mathcal{L}}{\partial \theta} = \frac{1}{N}\sum_{k=1}^{N}\sum_{j=1}^{k} \frac{\partial \ell_k}{\partial \hat{x}_k}\left(\prod_{i=j+1}^{k}\frac{\partial \hat{x}_i}{\partial \hat{x}_{i-1}}\right)\frac{\partial \hat{x}_j}{\partial \theta}.$$

interactive · rollout

Rollout training: feeding predictions back in

During rollout fine-tuning the 6-hour model $f_\theta$ is applied repeatedly, each step taking its own previous prediction as input and is scored against the ERA5 ground truth data. The gradient of the summed loss then flows backward through the entire chain into the shared set of weights $\theta$ via backpropagation through time.

The optimiser used is AdamW with $(\beta_1, \beta_2) = (0.9, 0.95)$. The total wall time is roughly one week on 64 A100s, and a ten-day forecast takes about two and a half minutes on a single A100. For reference, the equivalent IFS run takes hours on a thousand CPUs.

Results

To evaluate the performance of the model, the anomaly correlation coefficient (ACC) is used. The anomaly is the difference between the observed value at a grid point and the historical average for that value. The ACC measures the (Pearson) correlation between the forecast anomaly and the analysis (observed ground truth) anomaly. By focusing on the anomaly rather than absolute values, we remove the easily predictable background (e.g. colder temperatures towards the poles). By focusing on the correlation, we focus on predicting the correct patterns without having to worry about (regional) biases.

The headline result: on Northern Hemisphere 500 hPa geopotential ACC, AIFS outperforms IFS at 2-, 6-, and 10-day lead times across the whole verification period. It performs worse in the upper atmosphere, which can be traced back to the decreased weight in the loss calculation.

As the forecast horizon increases, the AIFS prediction becomes visibly smoother, and small-scale structures that are present at day 1 get progressively blurred out as the forecast unrolls. The same effect shows up in GraphCast, Pangu and other ML-based weather models. This is caused by the double-penalty problem: if a sharp feature is predicted in the wrong spot, it gets penalised for predicting a true feature in the wrong spot and for missing the same feature in the correct spot. This leads to an increased preference of the model to predict mean values as uncertainty increases. The same problem occurred in pre-diffusion image generator models. For this reason, the ECMWF has been looking at ensemble-based approaches as well.

What's Next

The ECMWF team that built AIFS was quite aware of this smoothing problem from the start and built the model with the goal of extending the point prediction model to a probabilistic forecasting system. Probabilistic ensemble forecasts are the standard for physics-based models as well, and come with the advantage that they would not only solve the blurring problem but also yield uncertainties for their predictions. They are therefore the obvious next step after this initial model, and that is exactly the direction the model developers went.

The paper we have been walking through describes the first, operational AIFS version from 2024. In the time since, AIFS has been extended in a number of ways, which we will explore in follow-up posts. For your own reading, the next papers we will go through are the following:

AIFS-CRPS (2024) is the direct extension of the AIFS system to a probablistic system. The model architecture stays the same, but multiple predictions are made by injecting Gaussian noise into the processor states. The ensemble output is then scored with a probabilistic score in order to generate a well-calibrated ensemble forecast. This solves two problems at once: The ensemble members stay sharp without progressive smoothing and we get the ability to estimate uncertainties on the output.
AIFS 1.1 (2025) adds physical consistency constraints in order to avoid unphysical outputs and adds additional output variables.
AIFS v2 (2026) arrived only this month (12 May 2026), taking both the deterministic and ensemble models to version two and shipping alongside a major update to the physics-based IFS.
Anemoi is the open-source, collaboratively developed framework that covers the whole pipeline, from preparing ML-ready datasets to training models to running them operationally.

References cited