Surrogate Modelling¶

Each global sensitivity sweep solves thousands of model instances over a space-filling design of uncertain parameters. A surrogate is a cheap statistical approximation of the model’s input-output mapping, fitted once on those solved samples and then evaluated millions of times in place of the real model. Surrogates are what make downstream diagnostics tractable: Sobol variance decomposition (Sensitivity Analysis), policy sweeps, and uncertainty-band plots all query the surrogate rather than re-solving.

This page covers how surrogates are fitted, validated, and consumed. The parameter design they are fitted on, and the Sobol indices computed from them, are described in Sensitivity Analysis.

The surrogate bundle¶

A single Snakemake rule (build_surrogate) fits a surrogate for every declared output and persists it as a self-contained, pickled SurrogateBundle. The bundle carries the trained model(s), the generator spec (parameter names, distributions, slice parameters, seed), the list of trained output columns, and per-output validation metrics. It is the single artifact every downstream consumer loads – nothing re-reads the raw scenario outputs.

All methods expose a uniform prediction interface, so consumers are agnostic to which surrogate was fitted:

from workflow.scripts.analysis.surrogate import load_bundle, predict

bundle = load_bundle("results/gsa/surrogates/surrogate_gsa_mlp.pkl")
y = predict(bundle, "co2", x)   # x: (n_samples, n_params) in physical units

See Using a fitted surrogate for the full API.

Surrogate methods¶

Four methods are supported: pce, rf, xgb, and mlp. The same solved scenarios can be fitted by several methods independently; which one downstream consumers load by default is set by sensitivity_analysis.default_surrogate (currently mlp).

Polynomial Chaos Expansion (PCE)¶

PCE approximates the response as a polynomial in the uncertain inputs. If \(\mathbf{X} = (X_1, \ldots, X_d)\) are the parameters and \(Y\) is a scalar output,

\[Y \approx \sum_{\boldsymbol{\alpha}} c_{\boldsymbol{\alpha}}\, \Psi_{\boldsymbol{\alpha}}(\mathbf{X})\]

where \(\boldsymbol{\alpha} = (\alpha_1, \ldots, \alpha_d)\) is a multi-index, \(c_{\boldsymbol{\alpha}}\) are scalar coefficients, and \(\Psi_{\boldsymbol{\alpha}}\) are multivariate orthonormal polynomials with respect to the joint input distribution, each a product of univariate orthonormal polynomials:

\[\Psi_{\boldsymbol{\alpha}}(\mathbf{X}) = \prod_{i=1}^d \psi_{\alpha_i}^{(i)}(X_i)\]

For uniform inputs these are (normalised) Legendre polynomials; for normal inputs, Hermite polynomials. The orthonormality property \(\mathbb{E}[\Psi_{\boldsymbol{\alpha}} \Psi_{\boldsymbol{\beta}}] = \delta_{\boldsymbol{\alpha}\boldsymbol{\beta}}\) makes the variance decomposition exact, so once the coefficients are known Sobol indices follow analytically without further model evaluations (see Sobol Indices from PCE Coefficients). The basis is generated with chaospy; a cross-truncation parameter \(q \in (0, 1]\) controls the multi-index set, with lower values favouring lower-order interaction terms.

Sparse fitting via LARS. With many parameters and moderate degree, the number of candidate basis terms can exceed the number of samples. Rather than requiring a full tensor-product design, the coefficients are fitted with Least Angle Regression (sklearn.linear_model.LarsCV, 5-fold CV), which incrementally adds the basis terms most correlated with the residual and uses cross-validation to choose how many to keep. This yields a parsimonious expansion that captures the dominant polynomial structure without overfitting.

PCE handles only scalar outputs; vector and field targets (Output target kinds) require a multi-output method.

Regression surrogates (RF, XGB, MLP)¶

The three regression surrogates share a common multi-output pipeline: all output targets are fitted by a single shared-structure estimator rather than one model per output. Targets are standardized to zero mean and unit variance before fitting, so the shared objective is not dominated by the output with the largest absolute scale; predictions are de-standardized back to physical units on the way out. Unlike PCE, these methods support scalar, vector, and field targets, and their Sobol indices are computed by Saltelli pick-freeze Monte Carlo on the fitted surrogate (Sensitivity Analysis) rather than analytically.

Random Forest (rf). A bagged ensemble of regression trees (sklearn.ensemble.RandomForestRegressor, OOB scoring enabled). Non-parametric and able to capture non-smooth or discontinuous responses that polynomials miss, at the cost of noisier sensitivity estimates and piecewise-constant (staircased) predictions.

XGBoost (xgb). Gradient-boosted trees with multi_strategy='multi_output_tree', so a single boosted model predicts all standardized targets jointly. Early stopping on the holdout set governs the number of boosting rounds. Typically the most accurate method on the sharp, near-discontinuous scalar outputs (e.g. yield-loss thresholds), but the resulting bundles are large.

ReLU MLP (mlp, the default). A multilayer perceptron with ReLU activations, wrapped in a pipeline that log-transforms the log-uniform input columns, standardizes all inputs, and regresses the standardized targets (sklearn.neural_network.MLPRegressor). Two properties motivate it as the default:

A ReLU MLP is a continuous piecewise-linear map, so predictions have smooth, non-staircased gradients – well suited to the PCA score targets of spatial fields (Output target kinds), where it outperforms the tree methods.
The fitted bundle is one to two orders of magnitude smaller on disk than an XGBoost bundle and faster at inference, which matters when many bundles are synced from the cluster and loaded by notebooks.

The log-transform stage matters because the MLP is not invariant to monotone input rescalings (tree methods are). Log-uniform parameters such as value_per_yll or ghg_price span several orders of magnitude; logging them first turns a log-uniform marginal into a uniform one and a typically log-linear response into a near-linear one, which the network fits far more readily. solver='adam' with early stopping on a held-out fraction scales to the full design and avoids the poorer multi-output minima reached by full-batch lbfgs.

The trade-off versus XGBoost: the MLP’s smoothness bias costs a little accuracy on the sharpest scalar outputs while winning on spatial fields and on bundle size. Pick xgb explicitly when raw scalar accuracy on bang-bang responses matters most.

Output target kinds¶

Each entry under sensitivity_analysis.outputs declares a kind that controls how its reducer output is turned into surrogate targets:

scalar (default): The reducer returns one float per scenario, fitted as a single target.
vector: The reducer returns a dict[str, float] (e.g. global mass per food), expanded to one {output}.{element} column per key and fitted as individual targets. Requires a multi-output method (rf/xgb/mlp).
field: A high-dimensional spatial dict[str, float] – e.g. per-region cropland or grazing area (~700 regions) – that is too wide to fit element by element. It is PCA-compressed: build_surrogate fits a PCA on the field matrix (training rows only) and the surrogate predicts the n_components score columns alongside the scalar/vector targets. Requires a multi-output method and an explicit n_components.

Per-region land-use fields are strongly low-rank – ~20-30 components capture over 99 % of the variance – so the PCA adds negligible reconstruction error, and the field’s accuracy is governed by how well the surrogate predicts the leading scores. predict_field reconstructs the dense field as scores @ components + mean and field_element_keys returns the matching element (e.g. region) labels in column order.

cropland_by_region:
  kind: field
  source: land_use.parquet
  reducer: region_field      # group value_col by key_col, with row filters
  value_col: area_mha
  key_col: region
  exclude_col: crop          # everything except grassland = cropland
  exclude_value: grassland
  n_components: 30
  label: Cropland area by region
  units: Mha

Validation¶

Surrogate quality is assessed per output target through several metrics, written to a flat validation parquet (one row per target):

Holdout error. When holdout_fraction > 0 (recommended), the tail of the Sobol sequence is reserved as held-out test data; the surrogate is fitted on the rest and scored out-of-sample. Using the tail preserves the space-filling quality of the training design (Sobol sequences front-load coverage). This is the primary validation_error.
Leave-one-out error (PCE only). A relative error computed via the hat matrix without refitting.
Out-of-bag R² (RF only). The OOB score from bootstrap aggregation.
R² on training and (when held out) test data, per target.
Field reconstruction R² (field outputs). The dense field is predicted on the holdout via the PCA decoder and scored against the true field (field_recon_r2), alongside explained_variance, n_components, and n_elements.

When holdout is disabled, validation_error falls back to LOO (PCE) or OOB (RF) error. Vector elements (minor foods) and higher-order PCA scores often have tiny mass and noisy targets, so poor fits there are logged quietly; only poor scalar fits raise a warning.

Validation parquet schema (results/{name}/surrogates/surrogate_validation_{group}_{method}.parquet):

Column	Type	Description
`output`	string	Output metric (or `{output}.{element}` for vector/score columns)
`validation_error`	float	Primary error metric (holdout error when available)
`r2_train`	float	R-squared on training data
`r2_test`	float	R-squared on holdout data (null if holdout disabled)
`n_train`	int	Number of training samples
`n_test`	int	Number of holdout samples
`method`	string	Surrogate method (`pce`, `rf`, `xgb`, `mlp`)

Additional method-specific columns: loo_error, n_terms, n_active_terms, max_degree (PCE); oob_error, n_estimators (RF); n_estimators (XGB); n_iter (MLP); field_recon_r2, explained_variance, n_components, n_elements (field rows).

As a rule of thumb, a validation error below 0.1 indicates a reliable scalar surrogate. If it is higher, increase the sample count, or for PCE increase the polynomial degree; comparing a smooth method (PCE/MLP) against a tree method (RF/XGB) reveals whether the difficulty is response non-smoothness or insufficient data.

Configuration¶

Surrogate configuration lives in the sensitivity_analysis section, separate from the _generators block that defines the sampling design (Sensitivity Analysis). This lets several methods be fitted over the same solved scenarios without duplicating the design.

sensitivity_analysis:
  holdout_fraction: 0.15        # fraction reserved for out-of-sample validation
  threads: 6                    # sklearn n_jobs / BLAS threads
  default_surrogate: mlp        # method downstream consumers load by default
  discover_scenarios_on_disk: false
  sobol:                        # Sobol index settings (see sensitivity_analysis)
    outputs: [total_cost, co2, ch4, n2o, land_use, yll]
    grid_resolution: 15
    n_mc_global: 16384
    n_mc_conditional: 2048
  methods:
    pce:
      method_options:
        max_degree: 3
        cross_truncation: 0.8
    rf:
      method_options:
        n_estimators: 500
    xgb:
      method_options:
        n_estimators: 5000
        max_depth: 4
        learning_rate: 0.02
        subsample: 0.8
        colsample_bytree: 0.8
        min_child_weight: 5
        early_stopping_rounds: 50
    mlp:
      method_options:
        hidden_layer_sizes: [256, 128, 64]
        solver: adam
        alpha: 0.0001
        max_iter: 3000
        learning_rate_init: 0.001
        n_iter_no_change: 40
  outputs:
    # ... output target declarations (see Output target kinds)

Field reference

holdout_fraction: Fraction of samples reserved for out-of-sample validation (e.g. 0.15). Set to 0 to disable holdout.
threads: Threads for the fit (sklearn n_jobs and BLAS). Note that tools/smk -jN clamps rule threads to N.
default_surrogate: Method downstream consumers (uncertainty-band plots, notebooks) load when none is named explicitly. Must match a methods key.
discover_scenarios_on_disk: When false (default), build_surrogate declares every Sobol scenario as an input so one tools/smk call drives the whole solve-analyse-surrogate chain. When true, it instead scans the analysis directory and fits over whatever scenarios have complete outputs – the cluster path, where solves run outside Snakemake (Cluster Execution).
sobol: Sobol-index settings shared across methods; see Sensitivity Analysis.
methods: Mapping of method name (pce, rf, xgb, mlp) to its method_options hyperparameters. Configuration is assumed complete, so each declared method must specify all options it uses; the defaults shipped in config/default.yaml are merged in for any method a scenario config does not override.
outputs: Surrogate target declarations (see Output target kinds).

Running¶

Fit a surrogate (after the scenarios are solved):

# MLP surrogate for the default scenario group
tools/smk -j4 --configfile config/gsa.yaml -- \
    results/gsa/surrogates/surrogate_gsa_mlp.pkl

Output paths carry two wildcards: {group} identifies the scenario sampling group (e.g. gsa, gsa-l1-low) and {method} selects the surrogate type (pce, rf, xgb, mlp). Per (group, method) the rule writes the bundle and its validation parquet under results/{name}/surrogates/.

Using a fitted surrogate¶

The bundle is the unit of reuse. Persistence is a plain pickle via save_bundle / load_bundle, and prediction is uniform across methods:

predict (bundle, output, x) – predict one scalar/vector/score column at design matrix x (physical units), returning (len(x),).
predictor (bundle, output) – return a bound x -> y callable that caches per-output setup (e.g. the PCE expansion), for repeated evaluation across a Monte Carlo grid.
predict_field (bundle, field_name, x) – predict a full spatial field, reconstructing the dense (len(x), n_elements) map from its PCA scores.
field_element_keys (bundle, field_name) – the element (region) labels for the columns of predict_field.

These predictors are exactly what the Sobol computation, policy sweeps, and uncertainty-band plots call; see Sensitivity Analysis for how Sobol indices are derived from them.