Introduction

A common way to determine the quality of forecasts and further to compare them to other forecasts is often done by calculating scalar summary measures. For probabilistic forecasts there are a few verification scores which are generally used (e.g. [Wilks, 2011]). The Brier Score (BS) can be chosen for binary events (e.g. rain/no rain) whereas analyzing discrete multiple-category events (e.g. below normal/normal/above normal precipitation) the Ranked Probability Score (RPS) is preferably used. In the continuous case (infinite number of predictand classes of infinitesimal width) the RPS is extended to the Continuous Ranked Probability Score (CRPS). In addition to the mentioned verification scores, the Plugin is able to calculate the associated skill scores comparing two forecasts or comparing a forecast to the climatological reference. Since the scores are biased to finite ensemble size, additionally ensemble-size-corrected “Fair” (skill) scores [Ferro, 2013] are also calculated. Unless otherwise mentioned, all available scores used in this Plugin are based on the SpecsVerification R-Package from [Siegert, 2014]. Moreover the plugin is able to calculate the MSE, the MSESS, the Pearson and the Spearman correlation coefficient.

Note

For correlation, no skill scores are calculated. Thus, the skill scores are set to missing value.

Sections Input Parameter and Output Parameter explain the input respectively the output parameters of the ProblEMS Plugin. In section Scientific Background, the methods of pre-processing and score calculations are described. Additional information can be found at the end of the documentation.

User Manual

At first, you have to choose the variable you want to analyze. The input data for Forecast (“1” as suffix) and Reference (“2”, optional) can be selected from the data browser via the options project, product, institute, model, experiment, time_frequency and ensemble. In case the data is a decadal experiment set is_decadal to True. If you want to compare the Forecast to the climatology, leave all entries of Reference (suffix “2”) empty.

Note

In case you want to compare a decadal experiment against a historical experiment you have to define the decadal experiment as the Forecast and the historical experiment as the Reference. In observation you can choose the Observation/Reanalysis data to compare. You can select observational and reanalysis data based on monthly time frequency (time_frequency=mon) or you can specify a netcdf file with any time frequency.

In case of selected decadal experiments you have to specify the decadals (starting/initialization years of decadal experiments) and leadtimes you want to analyze. Otherwise (e.g. only historical experiments) leave empty and insert years you want to process into the first_last_year parameter. Further, define the temporal resolution of the resulting output time series you want to analyze (time_frequency_output). Additionally, you can choose specific months (mon) which will be processed.

Note

If you choose seas and you want to analyze decadal experiments, the winter season contains only the months January and February. In combination with the timemean operator it is possible to define extended seasons with months of interest.

In the next parameter (level) you can select a level for variables which contain vertical levels. Leave empty for 2D files. In region you can choose a region box (W,E,S,N) you want to analyze. In grid you have the option to specify a regular lon-lat grid to remap all of the input data to the same grid before processing. Recommended in case you want to analyze data with different grid sizes. Instead of a gridwise (map), a field-mean-based (fieldmean) analysis can be chosen with the output_dimension operator.

Next, you have to specify metrics you want to calculate. In case of RPSS/BSS you have to configure the parameters threshold and optionally threshold_check, otherwise leave empty. In case you want to do a stratified verification (e.g. useful to evaluate experiments during different initialization times/states (e.g. ENSO phases)), you can input different sets of years. For this, you can input a .dat file where every column (at least 1 column) contains the set of years defining a certain initialization state. Leave empty to evaluate the Forecast experiment in the default way.

With the option mean_or_median you can specify whether an ensemble mean or median should be used for the calculation of MSESS or correlation. Moreover you can specify with time_mean_or_median if the climatology for anomaly calculation should be determined from the temporal mean or median.

By using the operator statistic_method, the aggregation method to get from input time frequency to the output time frequency can be changed. Using avg to processing missing values correctly is recommended. Further, it is possible to compute the anomalies (anomaly) of time series before the scores will be calculated. Calculation of significances can be enabled by entering a number in bootstraps. In case you want to re-produce the bootstrap sample used for significance calculation, you can define a fixed seed for the random number generation.

Finally, you can specify the number of parallel-processing tasks (ntask), the output (outputdir) and cache (cachedir) directories and you have the option to remove the cache directories (cacheclear).

Input Parameter

freva plugin problems --doc


Plugin to calculate and plot verification scores of two ensemble forecasts primarily based on the SPECS project. If one leaves the reference experiment blank, the forecast is compared to the climatology instead.

Available scores are:
RPS/FairRPS, RPSS/FairRPSS
BS/FairBS, BSS/FairBSS
EnsCRPS/FairEnsCRPS, EnsCRPSS/FairEnsCRPSS, EnsCRPS Decomposition into REL, RES and UNC
GaussCRPS, GaussCRPSS and GaussCRPS Decomposition into REL, RES and UNC
GevCRPS, GevCRPSS, MSESS, MSE,
Spearman and Pearson Correlation Coefficient.
Options:
variable                      (default: <undefined>) [mandatory]
                              Choose variable you want to analyze (check data-
                              browser)
project1                      (default: <undefined>) [mandatory]
                              Choose project e.g. CMIP5, Reanalysis, your data,
                              etc. (check data-browser)
product1                      (default: <undefined>) [mandatory]
                              Choose product (check data-browser)
institute1                    (default: <undefined>) [mandatory]
                              Choose institute (check data-browser)
model1                        (default: <undefined>) [mandatory]
                              Choose model (check data-browser)
experiment1                   (default: <undefined>) [mandatory]
                              Choose experiment (check data-browser)
time_frequency1               (default: <undefined>) [mandatory]
                              Choose time_frequency (check data-browser)
ensemble1                     (default: <undefined>) [mandatory]
                              Choose ensemble member - multi selection possible
                              (comma separated), use * for all members
is_decadal1                   (default: True)
                              Set "True" in case forecast experiment is a
                              decadal experiment.
project2                      (default: <undefined>)
                              Choose project e.g. CMIP5, Reanalysis, your data,
                              etc. (check data-browser)
product2                      (default: <undefined>)
                              Choose product (check data-browser)
institute2                    (default: <undefined>)
                              Choose institute (check data-browser)
model2                        (default: <undefined>)
                              Choose model (check data-browser)
experiment2                   (default: <undefined>)
                              Choose experiment (check data-browser)
time_frequency2               (default: <undefined>)
                              Choose time_frequency (check data-browser)
ensemble2                     (default: <undefined>)
                              Choose ensemble member - multi selection possible
                              (shell: comma separated), use * for all members
is_decadal2                   (default: False)
                              Set "True" in case forecast experiment is a
                              decadal experiment.
observation                   (default: <undefined>) [mandatory]
                              Observation to compare to. Choose observational or
                              reanalysis experiment with time_frequency=mon
                              (check data-browser). OR specify an observational
                              netcdf file with any time_frequency ->
                              path/2/file.nc.
obs_complete                  (default: 90)
                              percentage of observation data which needs to be
                              available for calculation of scores. Default: 90
                              (%)
decadals                      (default: <undefined>)
                              Comma separated lists of decadal experiments:
                              1960,1961,etc. OR use a sequence like 1960:2000
                              for all years in between
leadtimes                     (default: <undefined>)
                              Leadtimes to analyze, if a decadal experiment is
                              selected. Use e.g. "1,3" for separate lead times
                              or "1-3" for averaging.
first_last_year               (default: <undefined>)
                              First and last year of the analysis: 1960:2000 for
                              all years in between.
                              NOTE(!): Only available if the "is_decadal" option
                              is False.
time_frequency_output         (default: year) [mandatory]
                              Temporal resolution of the output time series you
                              want to analyze.
                              Valid options are "year", "seas", and "mon". If
                              you choose "mon", your selected months will be
                              individually analyzed
mon                           (default: <undefined>)
                              Months to be processed. You also can set
                              "timemean" below "True" if you want to calculate
                              the yearly average over these months (e.g. To get
                              time series of yearly averaged extended summer
                              season MJJAS insert 5,6,7,8,9). Only available if
                              "time_frequency_output" is set "mon".
timemean                      (default: False)
                              Option to calculate the yearly mean of selected
                              month above to get the time series. Only available
                              if youe insert some months in "mon" and
                              time_frequency_output is set to "mon"
level                         (default: <undefined>)
                              Choose level [in Pa], e.g. 50000. Only available
                              for files containing levels (check data-browser).
                              Leave empty for 2D files.
region                        (default: -180,180,-90,90)
                              Region box to be processed. The format have to be
                              W,E,S,N (e.g. -180,180,0,90 for NH).
grid                          (default: r72x36)
                              Gridfile or a grid description like r72x36.
                              Required in case of different grid resolutions of
                              input data.
output_dimension              (default: map) [mandatory]
                              Option to calculate input as a map (default) or
                              areamean of the region box above (or region of
                              file).
metrics                       (default: rpss) [mandatory]
                              Choose scores you want to calculate.
                              RPSS: RPS/RPSS/FairRPS/FairRPSS
                              BSS: BS/BSS/FairBS/FairBSS
                              EnsCRPSS: CRPS/CRPSS/FairCRPS/FairCRPSS for
                              ensemble forecast as well as score decomposition
                              into Resolution, Reliability and Uncertainty.
                              GaussCRPSS: CRPS/CRPSS for Gaussian forecast as
                              well as score decomposition into Resolution,
                              Reliability and Uncertainty
                              GevCRPSS: CRPS/CRPSS for generalized extreme value
                              (GEV) distribution forecast.
                              MSESS: MSE and MSESS.
                              SP-Cor: Spearman Correlation Coefficient (if the
                              reference forecast is the climatology the skill
                              score is set to the correlation coefficient
                              between forecast and observations, otherwise the
                              difference between the correlation coefficients
                              between forecast and observations and that between
                              forecast and reference is used instead of the
                              skill score).
                              PE-Cor: Pearson Correlation Coefficient (if the
                              reference forecast is the climatology the skill
                              score is set to the correlation coefficient
                              between forecast and observations, otherwise the
                              difference between the correlation coefficients
                              between forecast and observations and that between
                              forecast and reference is used instead of the
                              skill score).
mean_or_median                (default: mean) [mandatory]
                              Option whether to calculate MSES(S) or correlation
                              with ensemble mean or median.
time_mean_or_median           (default: mean) [mandatory]
                              Option whether to calculate anomalies with respect
                              to climatology (if anomaly flag is set to TRUE)
                              either with mean or median over time.
threshold                     (default: perc_1/3,2/3)
                              Threshold(s) for seperating categories following
                              the format OPTION_THRESHOLD1,THRESHOLD2. Constant
                              values (OPTION=const) or percentile values
                              (OPTION=perc) are possible. Use 1 threshold for
                              BS(S) and at least 2 for RPS(S) calculation. Leave
                              empty for CRPSS metric.
                              E.g. insert perc_1/3,2/3 to separate data into 3
                              categories based on the percentiles 33.3% and
                              66.7%.
threshold_check               (default: <undefined>)
                              Option to define intervals between chosen
                              thresholds following the format
                              VALUE4TH1,VALUE4THS. If conditions are not
                              fulfilled the score will be set to NA. Only valid
                              for BSS/RPSS with percentile option. Leave empty
                              for no threshold restrictions.
                              E.g. if "1,2" is set, the first threshold value
                              must be greater equal 1 and the gap between the
                              following thresholds must be at least 2. Otherwise
                              the score is set to NA.
years_stratified_verification (default: <undefined>)
                              Option to input a .dat-file containing at least
                              one column of years (YYYY). The following
                              evaluation will be stratified on the given years
                              in the file. Useful for stratified verification
                              (e.g. years with a El Nino vs. years with La Nina;
                              early time period vs. late time period). Leave
                              blank to evaluate the forecast experiment in the
                              default way (based on all available years).
                              NOTE: Stratified verification is in a test phase!
statistic_method              (default: avg) [mandatory]
                              Statistical method used to get from input
                              time_frequency to output time_frequency. "avg" is
                              recommended. "max" could be useful for analyzing
                              extreme values (e.g. GevCRPS)
anomaly                       (default: False)
                              Option to calculate anomalies of time series
                              before calculating scores.
significance_method           (default: Bootstrap) [mandatory]
                              Select method for significance estimation.
bootstraps                    (default: <undefined>)
                              Number of bootstrap runs for calculation of
                              significance levels if Bootstrap is selected via
                              significance_method. Leave empty if you do not
                              want to calculate significance levels with
                              bootstrap.
                              WARNING: This could take a lot of time when
                              enabled!
seed                          (default: <undefined>)
                              Specify the seed to initialize the random number
                              generator used to generate the bootstrap sample.
blocklength                   (default: 5)
                              If bootstrap is selected via significance_method:
                              Length of block for bootstrapping. If jackknife is
                              selected: Length of block that will be deleted for
                              resampling.
                              Usually 5 for decadals and 1 for seasonal or
                              monthly forecasts.
ntask                         (default: 6) [mandatory]
                              Choose number of tasks on machine, default set by
                              Freva for batchmode.
outputdir                     (default: $USER_OUTPUT_DIR/$SYSTEM_DATETIME) [mandatory]
                              Choose your output directory. Default is set by
                              Freva.
cachedir                      (default: $USER_CACHE_DIR) [mandatory]
                              Choose your cache directory. Default is set by
                              Freva.
cacheclear                    (default: True)
                              Option switch to NOT clear the cache.
debug                         (default: False)
                              Option for debugging purposes
extra_scheduler_options       (default: --ntasks-per-node=20)
                              Set additional options for the job submission to
                              the workload manager (, seperated). Note:
                              batchmode and web only.

Output Parameter

The processed files can be found in the selected outputdir. Pre-processed NetCDF files of Forecast/Reference/Observation time series are stored in fc/ref/obs directories containing the variable with dimension of [LON; LAT; DATE; ENSEMBLE]. NetCDF files [LON; LAT; DATE] of calculated scores can be found in a separated directory depending on your chosen metrics. Output figures showing the gridwise skill score (or the fair version if available) of your chosen metrics will be saved in the plot/ directory (only available if output_dimension is set to maps/).

_images/tas_FairRPSS_b1_clim.png — Fig. 1 Fair RPSS of near-surface temperature (tas) for baseline1 (as Forecast) against (top) climatology and (bottom) historical (as Reference) for lead time 1-4. ERA-Interim is used as observational data (Observation).

In case the Plugin is used in the stratified verification setup, some additional output files as well as figures will be stored in associated sub folders stratveri/. The following file prefixes exist (see section Stratified Verification for more information):

bcw_: Individual contribution of time series subset i to the total skill score

\[\frac{N_i}{N} \cdot \frac{\left( \operatorname{RPS_{ref}^\textit{i}} - \operatorname{RPS_{fc}^\textit{i}} \right)}{\operatorname{RPS_{ref}^{tot}}}\]
bc_: Coefficient of individual contribution of time series subset i to the total skill score

\[\frac{\left( \operatorname{RPS_{ref}^\textit{i}} - \operatorname{RPS_{fc}^\textit{i}} \right)}{\operatorname{RPS_{ref}^{tot}}}\]
bcw-reference_: Expected individual contribution of time series subset i based on total skill score (expected: every individual time series subset has the same contribution to the total skill score)

\[\frac{ \operatorname{RPSS_{tot}} }{ \text{number of individual time series subsets} }\]
bc-reference_: Expected coefficient of individual contribution of time series subset i based on total skill score (expected: every individual time series subset has the same value as the total skill score)

\[\operatorname{RPSS_{tot}}\]
absdiff-bcw-reference_: Difference between individual contribution of time series subset i and expected individual contribution of time series subset i

\[\frac{N_i}{N} \cdot \frac{\left( \operatorname{RPS_{ref}^\textit{i}} - \operatorname{RPS_{fc}^\textit{i}} \right)}{\operatorname{RPS_{ref}^{tot}}} - \frac{ \operatorname{RPSS_{tot}} }{ \text{number of individual parts} }\]
absdiff-bc-reference_: Difference between coefficient of individual contribution of time series subset i and expected coefficient of individual contribution of time series subset i

\[\frac{\left( \operatorname{RPS_{ref}^\textit{i}} - \operatorname{RPS_{fc}^\textit{i}} \right)}{\operatorname{RPS_{ref}^{tot}}} - \operatorname{RPSS_{tot}}\]
diff-bcw1-bcw2_: Difference between individual contribution of time series subset 1 and time series subset 2

\[\frac{N_1}{N} \cdot \frac{\left( \operatorname{RPS_{ref}^1} - \operatorname{RPS_{fc}^1} \right)}{\operatorname{RPS_{ref}^{tot}}} - \frac{N_2}{N} \cdot \frac{\left( \operatorname{RPS_{ref}^2} - \operatorname{RPS_{fc}^2} \right)}{\operatorname{RPS_{ref}^{tot}}}\]
diff-bc1-bc2_: Difference between coefficient of individual contribution of time series subset 1 and time series subset 2

\[\frac{\left( \operatorname{RPS_{ref}^1} - \operatorname{RPS_{fc}^1} \right)}{\operatorname{RPS_{ref}^{tot}}} - \frac{\left( \operatorname{RPS_{ref}^2} - \operatorname{RPS_{fc}^2} \right)}{\operatorname{RPS_{ref}^{tot}}}\]
rpss_: Skill score of individual time series subset i

\[\frac{\left( \operatorname{RPS_{ref}^1} - \operatorname{RPS_{fc}^1} \right)}{\operatorname{RPS_{ref}^{1}}}\]
w_: Time step weighting factor of time series subset i

\[\frac{N_i}{N}\]
d_: Weighting term of reference forecast for time series subset i vs. total time series

\[\frac{ \operatorname{RPS_{ref}^i} }{ \operatorname{RPS_{ref}^{tot}} }\]

Previous Applications and Test

The Plugin was used to verify global and regional hindcast data in a probabilistic way using the fair version of the RPSS (see forecast web page).

Scientific Background

Pre-Processing

Before a calculation of verification scores can be done, the input data have to be pre-processed beforehand. The general pre-processing procedure includes a yearly, monthly or seasonally averaging of every year to finally get yearly time series. In case of decadal experiments, time series will be created according to the Leadtime-Selector Plugin. Afterward, a remapping to a chosen regular longitude-latitude grid is applied in case the user enter a grid to remap. Fig. 2 additionally shows the schematic processing procedure of the ProblEMS Plugin.

Note

Every of the following score calculation procedures is individually done for every gridpoint, month (season) and/or lead-year.

For the calculation of BSS and RPSS metrics a definition of \(n_{th}\) threshold(s) is necessary to classify the data into discrete categories. The data will be divided by thresholds \(X_{th}\) into \(n_{th}+1\) categories \(Y\) following the rule

\[\begin{split}\begin{aligned} Y_{th} < X_{th} \qquad \textrm{for:}& \quad th = 1 \\ X_{th-1} \geq Y_{th} < X_{th} \qquad \textrm{for:}& \quad 2 \leq th \leq n_{th}-1 \\ Y_{th+1} \geq X_{th} \qquad \textrm{for:}& \quad th = n_{th} \quad .\end{aligned}\end{split}\]

Hence, e.g. assume that two thresholds are selected; category \(Y_{1}\) will contain data below the 1st threshold \(X_1\), category \(Y_2\) will contain data equal or greater than \(X_1\) and below \(X_2\) and category \(Y_3\) will contain data equal or greater than the 2nd threshold \(X_2\). In case of percentile-based categorization, threshold values will be individually received for Observation, Forecast and Reference by separate percentile calculations beforehand.

_images/specs_schema.png — Fig. 2 Schematic processing procedure of the ProblEMS Plugin.

Metrics and Scores

Brier Skill Score (BSS-Metric)

For binary events (e.g. rain/no rain separated by a threshold) the Brier Score (BS) essentially gives information about the mean squared error of the probabilistic forecast with ensemble size \(M\) and is defined as

\[\operatorname{BS} = \frac{1}{T} \sum^{T}_{t=1}(y_t - o_t)^2,\]

where \(o\) stands for the observed event (if \(o=1\) then the event occur, if \(o=0\) the event does not occur), \(y\) for the related forecast probability of the \(t\)-th forecast-event pair and \(T\) for the number of cases (in context of this Plugin, \(T\) is related to the number of years within a given time period). In context of ensemble forecast the probabilistic forecast \(y\) is based on the relative number \(\frac{e}{M}\) of ensemble members \(e\) predicting that the event occurs. The BS is bounded between 0 and 1 with a perfect forecast \(\operatorname{BS}=0\). To evaluate the performance of one forecast against a reference forecast, the calculation of skill scores is a good choice. The associated skill score for binary events is the Brier Skill Score

\[\operatorname{BSS} = 1 - \frac{\operatorname{BS_{fc}}}{\operatorname{BS_{ref}}},\]

where BS_fc is the Brier Score of the Forecast and \(\operatorname{BS_{ref}}\) the Brier Score of the Reference. If no Reference forecast experiment is selected in the Plugin, the climatology is used as the Reference instead. Since scores are biased for finite ensemble size (positive bias for small ensemble size), the Brier Score will be adjusted following [Ferro, 2013] to be fair with comparing ensemble forecasts of different ensemble sizes. Thus, the Fair Brier Score FairBS for one forecast-event pair \(t\) is defined as

(1)\[\operatorname{FairBS_t} = \left(\frac{e}{M} - o\right)^2 - \frac{e(M-e)}{M^2(M-1)}.\]

According to the ensemble-size adjustment of the Bier Score, the Fair Brier Skill Score can also be calculated, whereas the Brier Scores of Forecast (\(\operatorname{BS_{fc}}\)) and Reference (\(\operatorname{BS_{ref}}\)) will be calculated related to equation (1).

Note

In the Fair-Score-background of this Plugin the climatology will get a hypothetical ensemble size which is equal to the number of observations.

Ranked Probability Skill Score (RPSS-Metric)

In case you want to divide data into more than 2 groups having discrete events (e.g. dry/near-normal/wet precipitation conditions) the Ranked Probability Score (RPS) can be chosen. The RPS is basically an extension of the Brier Score to multiple-event situations and is based on squared errors regarding to the cumulative probabilities in forecasts \(Y\) and observations \(O\). The RPS is defined as

(2)\[ \operatorname{RPS} = \frac{1}{T} \sum^{T}_{t=1}\left[\sum^{K}_{k=1}(Y_k(t) - O_k(t))^2\right],\]

where \(K\) is the number of discrete categories. According to the Brier Skill Score, the Ranked Probability Skill Score

\[\operatorname{RPSS} = 1 - \frac{\operatorname{RPS_{fc}}}{\operatorname{RPS_{ref}}},\]

can be derived. With respect to ensemble-size-adjustment the Fair Ranked Probability Score FairRPS for one forecast-event pair \(t\) follows

\[\operatorname{FairRPS_t} = \sum^{K}_{k=1}\left[\left(\frac{E_k}{M} - O_k\right)^2 - \frac{E_k(M-E_k)}{M^2(M-1)}\right],\]

where \(E\) is the cumulative number of ensemble members. In this context the calculation of the Fair Ranked Probability Skill Score is similar to the computation of FairBSS.

Continuous Ranked Probability Skill Score (CRPSS-Metric)

The Continuous Ranked Probability Score (CRPS) is essentially the extension of the RPS to the continuous case. The number of predictand classes will be replaced by an infinite number of infinitesimal width, so that the summations in eq. (2) is substitute with the integrals.

\[\operatorname{CRPS} = \frac{1}{T} \sum^{T}_{t=1}\int^{\infty}_{-\infty}\left[F_Y(t) - F_O(Y(t)\right]^2 dx,\]

where \(F_Y\) is the cumulative distribution function (CDF) from the forecast and \(F_O\) is the Heaviside function jumping from 0 to 1 when the forecast variable \(Y\) is equal to the observation.

EnsCRPSS Metric

Following [Hersbach, 2000] and [Siegert, 2014] respectively the CRPS derived from ensemble prediction systems (enscrpss Metric option in the ProblEMS Plugin) can be written as

\[\operatorname{CRPS_{Ens}} = \frac{1}{T} \sum^{T}_{t=1} \left[ \frac{1}{M} \sum^{M}_{i=1} \lvert y_i(t) - o_t \rvert - \frac{1}{M^2} \sum^{}_{1 \leq i < j \leq M} \lvert y_i(t) - y_j(t) \rvert \right].\]

The corresponding Continuous Ranked Probability Skill Score is defined as

(3)\[\operatorname{CRPSS} = 1 - \frac{\operatorname{CRPS_{fc}}}{\operatorname{CRPS_{ref}}}.\]

According to the ensemble-size-adjustment, the Fair Continuous Ranked Probability Score \(\operatorname{FairCRPS}\) follows

\[\operatorname{FairCRPS_{Ens}} = \frac{1}{T} \sum^{T}_{t=1} \left[ \frac{1}{M} \sum^{M}_{i=1} \lvert y_i(t) - o_t \rvert - \frac{1}{M(M-1)} \sum^{}_{1 \leq i < j \leq M} \lvert y_i(t) - y_j(t) \rvert \right].\]

The calculation of the Fair Continuous Ranked Probability Skill Score is analogous to equation (3).

GaussCRPSS Metric

Besides the described score version for ensemble forecast systems above, a parametric version of the CRPS can also be defined for a Gaussian forecast distribution (gausscrpss Metric option in the ProblEMS Plugin). Following [Gneiting et al., 2005] and [Wilks, 2011] the \(\operatorname{CRPS_{Gauss}}\) can be written as.

\[\operatorname{CRPS_{Gauss}} = \frac{1}{n} \sum^{T}_{t=1} \sigma_t \left\{\frac{o_t - \mu}{\sigma_t}\left[ 2\Phi \left( \frac{o_t - \mu_t}{\sigma_t}\right) - 1 \right] +2\phi \left( \frac{o_t - \mu_t}{\sigma_t} \right) - \frac{1}{\sqrt{\pi}} \right\},\]

where \(\mu_t\) (\({\sigma_t}^2\)) is the mean (variance) of the ensemble at time step \(t\) and \(\Phi\) (\(\phi\)) the CDF (PDF) of the standard Gaussian distribution.

GevCRPSS Metric

In case of a generalized extreme value (GEV) distribution forecast (e.g. yearly maximum temperatures), the parametric gevcrpss option in the ProblEMS Plugin can be applied. At first, for each forecast time series the GEV parameters location \(\mu\), scale \(\sigma\) and shape \(\xi\) following the distribution \(F_{\operatorname{GEV}}\)

\[\begin{split}F_{\operatorname{GEV}}(x) = \begin{cases} \exp \left(-\left[ 1 + \xi \left(\frac{x - \mu}{\sigma} \right) \right] ^{-\frac{1}{\xi}} \right) & \xi \neq 0 \\ \exp \left(-\exp \left( - \frac{x - \mu}{\sigma} \right) \right) & \xi = 0 \\ \end{cases}\end{split}\]

will be estimated based on the ismev R-Package [Coles, 2001, Gilleland, 2016]. After that, the estimated parameters are used for the parametric CRPS calculation of generalized extreme value (GEV) distribution forecasts following [Friederichs and Thorarinsdottir, 2012] and the scoringRules R-Package from [Jordan et al., 2017], respectively. For \(\xi \neq 0\) the CRPS is given by

\[\begin{split}\begin{aligned} \begin{split} \label{eq:gev_crps1} \operatorname{CRPS_{GEV_{\xi \neq 0}}} = & \left( \mu - x - \frac{\sigma}{\xi} \right) \left( 1 - 2F_{\operatorname{GEV_{\xi \neq 0}}} \right) \\ & - \frac{\sigma}{\xi} \left[ 2^{\xi}\Gamma(1-\xi) - 2\Gamma_{l} \left( 1-\xi,-\log F_{\operatorname{GEV_{\xi \neq 0}}} \right) \right] \end{split}\end{aligned}\end{split}\]

and for \(\xi = 0\), the CRPS is written as

\[\operatorname{CRPS_{GEV_{\xi = 0}}} = \mu - x + \sigma \left( C - \log 2 \right) - 2\sigma Ei \left( \log F_{\operatorname{GEV_{\xi = 0}}} \right),\]

where \(C \approx 0.5772\) is the Euler-Mascheroni constant, \(Ei(x) =\int^{x}_{-\infty} (e^t/t)dt\) the exponential integral, \(\Gamma\) denotes the gamma function and \(\Gamma_{l}\) the lower incomplete gamma function. More details of the \(\operatorname{CRPS_{GEV}}\) calculation can be found in the cited references above.

CRPS Decomposition

Details of the decomposition of the CRPS into Reliability, Resolution and Uncertainty can be found for the Gaussian CRPS in [Siegert, 2014] and [Siegert, 2017] and for the Ensemble CRPS in [Hersbach, 2000] eq. (3), (35-37) and the verification R-Package [NCAR - Research Applications Laboratory, 2015].

_images/bs_schema.png — Fig. 3 Visualization of BS (left), RPS (center) and CRPS (right) conceptual examples. BS: For a specific date, 7 out of 10 ensemble members of a forecast system predict a frost day (green). In case, a frost day is observed (red solid) the BS for this event is given by \((0.7-1)^2\). In case a temperature above \(0^\circ\text{C}\) is observed (red dashed), the BS is given by \((0.7-0)^2\). RPS: For a specific date, 3 out of 10 ensemble members predict a temperature, which is classified as category “below normal”, 5 out of 10 members predict temperature category “normal” and 2 out of 10 the category “above normal”. In this example a temperature classified as “normal” is observed, so that the cumulative probability of the observation jumps from 0 to 1 (red). Considering the cumulative probability of forecast (green), RPS is calculated by \((0.3-0)^2 + (0.8-1)^2 + (1-1)^2\). Note that the last term is always zero. CRPS: For a specific date, temperature values predicted from ensemble members of a forecast system follow a hypothetical Gaussian distribution (green). Assuming that the observed temperature is \(0^\circ\text{C}\) (red), the squared area between the cumulative probabilities of observation and forecast represents the CRPS. As we can see in the figure, the (squared) area and thereby the CRPS depend on the distribution of the ensemble forecast.

MSES(S)

The Mean Squared Error Skill Score \(MSESS\) is calculated as

\[MSESS=1-\frac{MSE_F}{MSE_R},\]

where \(MSE_F\) and \(MSE_R\) are the Mean Squared Errors of the forecast and the reference with

\[MSE=\frac{1}{T} \sum^{T}_{t=1} \left( f_t - o_t \right)^2\]

with \(f_t\) and \(o_t\) are the forecast and the observation for time step \(t\). Please note, that \(f_t\) and \(o_t\) are anomalies calculated within a leave-one-out cross-validation setup (i.e. the year for which the anomaly is calculated does not contribute to the calculation of climatology). Moreover, the climatology which could be used as reference forecast is also calculated via a leave-one-out cross-validation. See [Goddard et al., 2013] for further information.

Correlation

The Pearson and Spearman Correlation are calculated with the R-function cor. Here, also cross-validated anomalies were used for model and observations.

Significance

The information about the significance of skill scores is based on the block-bootstrap method applied in [Goddard et al., 2013] but without re-sampling ensemble members. For this, time series of length \(n\) is divided into (\(n-l+1\)) overlapping blocks of length l (here set to five), where the \(k\)-th block contains the time series values \(x_i\) (\(i\) goes from \(i=k\) to \(k+l-1\)). After that, a random selection with replacing is applied to generate a new time series of Forecast-Reference-Observation cases and further to calculate the skill score. This procedure is done e.g. 1000 times to build up a sampling distribution of the considered skill score (e.g. [Mason, 2008]). A positive (negative) skill score is significantly different from zero if the 5%-percentile (95%-percentile) is greater (less) than zero.

Stratified Verification

Note

Stratified verification is in a test phase and under construction: It is not recommended for operational use!

The forecast skill of decadal predictions is assumed not only to vary with lead time but also with low-frequency climate modes such as the El Niño-Southern Oscillation (ENSO) or the Atlantic Multidecadal Variability (AMV). The state/phase of these variability modes could have an influence during the initialization procedure of decadal predictions and therefore also on the overall forecast skill. With the option of stratified verification (Plugin parameter: years_stratified_verification) the total Skill Score can be decomposed into multiple parts to estimate the skill contribution stratified along a given set of initialization years. E.g., the total Ranked Probability Skill Score (RPSS_tot) can be decomposed into two parts initialized during different years (periods) as follow

\[\begin{split}\begin{split} \operatorname{RPSS_{tot}} & = \frac{\operatorname{RPS_{ref}^{tot}} - \operatorname{RPS_{fc}^{tot}}}{\operatorname{RPS_{ref}^{tot}}}\\\ \\\ & = \frac{ \left( \frac{N_1}{N} \operatorname{RPS_{ref}^1} + \frac{N_2}{N} \operatorname{RPS_{ref}^2} \right) - \left( \frac{N_1}{N} \operatorname{RPS_{fc}^1} + \frac{N_2}{N} \operatorname{RPS_{fc}^2} \right) }{\left( \frac{N_1}{N} \operatorname{RPS_{ref}^1} + \frac{N_2}{N} \operatorname{RPS_{ref}^2} \right)}\\\ \\\ & = \frac{ \frac{N_1}{N} \left( \operatorname{RPS_{ref}^1} - \operatorname{RPS_{fc}^1} \right) + \frac{N_2}{N} \left( \operatorname{RPS_{ref}^2} - \operatorname{RPS_{fc}^2} \right) }{\left( \frac{N_1}{N} \operatorname{RPS_{ref}^1} + \frac{N_2}{N} \operatorname{RPS_{ref}^2} \right)}\\\ \\\ & = \frac{ \frac{N_1}{N} \left( \operatorname{RPS_{ref}^1} - \operatorname{RPS_{fc}^1} \right) + \frac{N_2}{N} \left( \operatorname{RPS_{ref}^2} - \operatorname{RPS_{fc}^2} \right) }{ \operatorname{RPS_{ref}^{tot}}}\\\ \\\ & = \overbrace{ \frac{N_1}{N} \cdot \frac{ \left( \operatorname{RPS_{ref}^1} - \operatorname{RPS_{fc}^1} \right) }{ \operatorname{RPS_{ref}^{tot} } }}^{\text{Contribution period 1}} + \overbrace{\frac{N_2}{N} \cdot \frac{ \left( \operatorname{RPS_{ref}^2} - \operatorname{RPS_{f2}^1} \right) }{ \operatorname{RPS_{ref}^{tot} } }}^{\text{Contribution period 2}}\\\ \\\ & = \overbrace{ \frac{N_1}{N} \cdot \frac{ \left( \operatorname{RPS_{ref}^1} - \operatorname{RPS_{fc}^1} \right) }{ \operatorname{RPS_{ref}^1} } \cdot \frac{ \operatorname{RPS_{ref}^1} }{ \operatorname{RPS_{ref}^{tot}} } }^{\text{Contribution period 1}} + \overbrace{ \frac{N_2}{N} \cdot \frac{ \left( \operatorname{RPS_{ref}^2} - \operatorname{RPS_{fc}^2} \right) }{ \operatorname{RPS_{ref}^2} } \cdot \frac{ \operatorname{RPS_{ref}^2} }{ \operatorname{RPS_{ref}^{tot}} } }^{\text{Contribution period 2}}\\\ \\\ & = \sum_{i=1}^{2} \underbrace{\frac{N_i}{N}}_{\text{Weighting}} \cdot \underbrace{\frac{ \left( \operatorname{RPS_{ref}^i} - \operatorname{RPS_{fc}^i} \right) }{ \operatorname{RPS_{ref}^i} }}_{\text{RPSS period i}} \cdot \underbrace{\frac{ \operatorname{RPS_{ref}^i} }{ \operatorname{RPS_{ref}^{tot}} }}_{\substack{ \text{Ratio} \\ \text{RPS period i} \\ \text{vs. total} } } \end{split}\end{split}\]

Significant differences between individual parts are also based on the bootstrap method (but without using the “block” option) mentioned in section Significance considering the given time steps of every individual part.

Additional Information, Remarks and Installation

Climatology as Reference

Climatologies in ProbLEMS are generated as an artificial ensemble Reference experiment containing the same size of ensemble members as number of time steps exists in the observational data. The generated climatologies are based on all time steps of the Observation; a leave-one-out-method is not used here.

In general, every time step of the “climatological experiment” contains all time steps of the Observation as an artificial ensemble. Hence, each time step has the same ensemble values based on the observational data. In case of BS and RPS, the fraction of ensemble members for every category is derived from the probability of occurrence based on the user-given threshold parameter.

Handling of Missing Values

General Preparation of Input Data (Temporal Aggregation):
- Statistic Method: avg
  
  Gridwise calculation of the annual monthly/seasonal/yearly average separately for each Forecast/Reference/Observation experiment. In case at least one time step of the input data contains a missing value, the output time step of the processed year is also set to missing value. The same behavior is valid for moving average calculation over multiple lead-years.
- Statistic Method: mean/min/max
  
  Gridwise calculation of the annual monthly/seasonal/yearly mean/minimum/maximum separately for each Forecast/Reference/Observation experiment. The output time step of the processed year is only set to missing value in case that all time steps of input data contain missing values. The same behavior is valid for moving mean/minimum/maximum calculation over multiple lead-years.
Areamean:
1. Gridwise calculation of overall monthly/seasonal/yearly average separately for each Forecast/Reference/Observation experiment. In case at least one time step (year) of the input time series contains a missing value, the output time step is also set to missing value. The result is an individual missing-value-mask for every Forecast/Reference/Observation experiment (separately for each month/season if chosen).
2. Gridwise combination of Forecast-Reference-Observation cases. In case at least one individual experiment contains a missing value, the resulting grid point is also set to missing value. Hence, the missing-value-mask is the same for each experiment (separately for each month/season if chosen).
3. Calculation of area mean over all non-missing-value grid points.
Score Calculation:

At least 90% of Forecast-Reference-Observation cases (years) must be completely (all experiments must contain non-missing-value in a given year) available within the prepared time series (gridwise for every month/season if chosen) to calculate the resulting score.
Significances:

Since missing values can also be picked to generate random time series used for the bootstrap method, the behavior described in 3. is also valid for each randomly generated time series. The 95%-level will be achieved by counting the relative number based on scores, which are not set to missing values. That means, the number of score values describing the sample distribution can be smaller than the number of bootstraps is given by the user.

Installation

Software:

CDO (>=1.7)
NetCDF4
NCO
plot_engine (2.1.16)
R (>=3.2)
relevant R-packages plus dependencies: ncdf4, SpecsVerification, abind, fields, RColorBrewer, foreach, SNOW, ismev, verification

Installation Steps:

Get the Plugin (incl. the git submodule leadtimeselektor) from the git repository:
```
git clone --recursive https://gitlab.dkrz.de/freva/plugins4freva/problems.git
cd problems
```
make sure the latest version from leadtimeselektor submodule is available in problems/tools4problems/leadtimeselector (branch to_new_freva). If not, update leadtimeselektor with the following:
```
git submodule sync
git submodule update --init --recursive
cd tools4problems/leadtimeselector
git checkout to_new_freva
git pull # in case it is needed
```
Create the conda environment:
1. you will need to have a suitable version of conda active previously, for example by loading one Freva instance, e.g.:
```
module load clint <your-projects-freva>
```
1. run the Makefile:
```
make all
```

To activate/deactivate the environment including own CDO (not necessary to run the plugin via Freva):

source plugin_env/bin/activate # activate
conda deactivate # deactivate

Updating the Plugin with git:

cd <PATH2PLUGIN>
git checkout -b <new_branch> (commit changes in a new branch <new_branch>)
git pull -u <new_branch> (pull changes upstream to <new_branch>)
pull/merge request into main/master branch (at e.g. gitlab), wait for reviews, merge