Benchmarking machine learning models with pysteps

Benchmarking machine learning models with pysteps#

How to correctly compare the accuracy of machine learning against traditional nowcasting methods available in pysteps?

Before starting the comparison, you need to ask yourself what is the objective of nowcasting:

  1. Do you only want to minimize prediction errors?

  2. Do you also want to represent the prediction uncertainty?

To achieve objective 1, it is sufficient to produce a single deterministic nowcast that filters out the unpredictable small-scale precipitation features. However, this will create a nowcast that will become increasingly smooth over time.

To achieve objective 2, you need to produce a probabilistic or an ensemble nowcast (several ensemble members or realizations).

In weather forecasting (and nowcasting), we usually want to achieve both goals because it is impossible to predict the evolution of a chaotic system with 100% accuracy, especially space-time precipitation fields and thunderstorms!

Machine learning and pysteps offer several methods to produce both deterministic and probabilistic nowcasts. Therefore, if you want to compare machine learning-based nowcasts to simpler extrapolation-based models, you need to select the right method and verification measure.

1. Deterministic nowcasting#

Deterministic nowcasts can be divided into:

  1. Variance-preserving nowcasts, such as extrapolation nowcasts by Eulerian and Lagrangian persistence.

  2. Error-minimization nowcasts, such as machine learning, Fourier-filtered and ensemble mean nowcasts.

Very important: these two types of deterministic nowcasts are not directly comparable because they have a different variance! This is best explained by the decomposition of the mean squared error (MSE):

\(MSE = bias^2 + Var\)

All deterministic machine learning algorithms that minimize the MSE (or a related measure) will also inevitably minimize the variance of nowcast fields. This is a natural attempt to filter out the unpredictable evolution of precipitation features, which would otherwise increase the variance (and the MSE). The same principle holds for convolutional and/or deep neural network architectures, which also produce smooth nowcasts.

Therefore, it is better to avoid directly comparing an error-minimization machine learning nowcast to a variance-preserving radar extrapolation, as produced by the module pysteps.nowcasts.extrapolation. Instead, you should use compare with the mean of a sufficiently large ensemble.

A deterministic equivalent of the ensemble mean can be approximated using the modules pysteps.nowcasts.sprog or pysteps.nowcasts.anvil. Another possibility, but more computationally demanding, is to average many ensemble members generated by the modules pysteps.nowcasts.steps or pysteps.nowcasts.linda.

Still, even by using the pysteps ensemble mean, it is not given that its variance will be the same as the one of machine learning predictions. Possible solutions to this:

  1. use a normalized MSE (NMSE) or another score accounting for differences in the variance between prediction and observation.

  2. decompose the field with a Fourier (or wavelet) transform to compare features at the same spatial scales.

A good deterministic comparison of a deep convolutional machine learning neural network nowcast and pysteps is given in [FNP+20].

2. Probabilistic nowcasting#

Probabilistic machine learning regression methods can be roughly categorized into:

  1. Quantile-based methods, such as quantile regression, quantile random forests, and quantile neural networks.

  2. Ensemble-based methods, such as generative adversarial networks (GANs) and variational auto-encoders (VAEs).

Quantile-based machine learning nowcasts are interesting, but can only estimate the probability of exceedance at a given point (see e.g. [FSN+19]).

To estimate areal exceedance probabilities, for example above catchments, or to propagate the nowcast uncertainty into hydrological models, the full ensemble still needs to be generated, e.g. with generative machine learning models.

Generative machine learning methods are similar to the pysteps ensemble members. Both are designed to produce an ensemble of possible realizations that preserve the variance of observed radar fields.

A proper probabilistic verification of generative machine learning models against pysteps is an interesting research direction which was recently undertake in the work of [RLW+11].


The table below is an attempt to classify machine learning and pysteps nowcasting methods according to the four main prediction types:

  1. Deterministic (variance-preserving), like one control NWP forecast

  2. Deterministic (error-minimization), like an ensemble mean NWP forecast

  3. Probabilistic (quantile-based), like a probabilistic NWP forecast (without members)

  4. Probabilistic (ensemble-based), like the members of an ensemble NWP forecast

The comparison of methods from different types should only be done carefully and with good reasons.

Nowcast type

Machine learning



Deterministic (variance-preserving)

SRGAN, Others?

pysteps.nowcasts.extrapolation (any optical flow method)


Deterministic (error-minimization)

Classical ANNs, (deep) CNNs, random forests, AdaBoost, etc

pysteps.nowcasts.sprog, pysteps.nowcasts.anvil or ensemble mean of pysteps.nowcasts.steps/linda

MSE, RMSE, MAE, ETS, etc or better normalized scores, etc

Probabilistic (quantile-based)

Quantile ANN, quantile random forests, quantile regression

pysteps.nowcasts.lagrangian_probability or probabilities derived from pysteps.nowcasts.steps/linda

Reliability diagram (predicted vs observed quantile), probability integral transform (PIT) histogram

Probabilistic (ensemble-based)

GANs ([RLW+11]), VAEs, etc

Ensemble and probabilities derived from pysteps.nowcasts.steps/linda

Probabilistic verification: reliability diagrams, continuous ranked probability scores (CRPS), etc. Ensemble verification: rank histograms, spread-error relationships, etc