DGTLENG 206 · Lesson 2 of 5

Building Surrogate Models

The Surrogate Workflow

A surrogate model is a fast approximation of an expensive simulation. The simulation solves physics — FEA, CFD, multi-physics coupling — in hours or days. The surrogate predicts the simulation's output in milliseconds. This speed difference is not incremental. It is the difference between evaluating 50 design candidates and evaluating 50,000. Between a deterministic safety factor and a full probabilistic failure analysis. Between a static design and a real-time digital twin.

But a surrogate is only as good as the data it was trained on, and training data comes from simulations configured from the system model. The workflow has five stages, each with decisions that determine whether the final surrogate is trustworthy or dangerous. This lesson covers each stage in detail.

Stage 1: Define the Parameter Space

Before generating any data, you must define what varies and over what range. The parameter space comes directly from the system model: which design variables matter (geometry dimensions, material properties, operating conditions), what ranges are physically meaningful (you do not train a thermal surrogate on temperatures below absolute zero), and which parameters are fixed versus variable.

Dimensionality matters. A surrogate for 3 input parameters requires a qualitatively different approach than one for 30. In low dimensions (3 to 8 parameters), classical methods work well — Gaussian processes, polynomial response surfaces, and moderate-sized DOE plans. In high dimensions (20+), the curse of dimensionality dominates: the volume of the design space grows exponentially, and covering it with training data becomes prohibitively expensive. Dimensionality reduction techniques (principal component analysis, sensitivity-based screening) are often necessary before training begins.

Bounds must be physically meaningful. The parameter ranges define the domain within which the surrogate's predictions are valid. Setting bounds too narrow produces a surrogate that cannot answer questions about designs outside the training range. Setting bounds too wide wastes training budget on regions of the design space that will never be used — and may include regions where the simulation fails to converge or produces physically meaningless results.

The system model is the source of truth. In an MBSE-driven organization, the parameter space is not invented by the ML engineer. It is read from the model — the design variables, their ranges, their units, and their interdependencies are model artifacts. This traceability from surrogate to model is essential: when the design space changes (a new material option, a revised operating envelope), the surrogate's validity domain changes with it, and that change is visible in the model.

Stage 2: Design of Experiments (DOE)

Given the parameter space, DOE determines where to sample. The goal is to cover the space efficiently — maximizing the information content of each expensive simulation run.

Latin Hypercube Sampling (LHS) is the most widely used DOE method for surrogate training. It divides each parameter's range into N equal intervals and ensures exactly one sample falls in each interval for each parameter. This guarantees uniform marginal coverage even with modest sample sizes. A 50-sample LHS over 8 parameters ensures each parameter's full range is represented, which random sampling cannot guarantee at this sample size.

Sobol sequences are quasi-random sequences that fill the parameter space more uniformly than random or even LHS sampling. They minimize the gaps between samples, producing better coverage of the high-dimensional space. Sobol sequences are particularly effective for sensitivity analysis — Sobol sensitivity indices, which quantify each parameter's contribution to output variance, require Sobol-structured sampling.

Adaptive sampling goes beyond static DOE. After an initial batch of simulations, build a preliminary surrogate, identify where the surrogate is most uncertain (highest prediction variance), and place the next simulation runs in those regions. This concentrates the training budget where it has the most impact. Adaptive sampling is especially valuable when the response surface has regions of complex behavior (sharp gradients, near-discontinuities) surrounded by regions of smooth behavior.

How many samples? There is no universal formula. Rules of thumb suggest 10 to 20 samples per input dimension for smooth response surfaces, more for highly nonlinear responses. A 5-parameter surrogate with a smooth response might train well on 75 to 100 simulations. A 10-parameter surrogate with strong nonlinearities might need 300 to 500. The only reliable test is validation performance — which is assessed in Stage 4.

Stage 3: Run Simulations

Each DOE sample point becomes a simulation configuration. Geometry parameters are set, boundary conditions are applied, material properties are assigned, and the solver runs. The output is the training data — the set of (input, output) pairs that the surrogate will learn from.

Automation is essential. Running 200 simulations manually — exporting geometry, setting up mesh, defining loads, running solver, extracting results for each case — is error-prone and labor-intensive. Automated simulation pipelines, configured from model data through tool APIs, are the practical enabler of surrogate modeling at scale. This connects directly to the engineering automation concepts from DGTLENG 205: the same pipeline infrastructure that runs parametric sweeps and CI/CD-style verification also generates surrogate training data.

Data quality determines surrogate quality. Every training point must be checked: did the simulation converge? Is the mesh adequate for this configuration? Are the results physically reasonable? A single non-converged simulation in the training set introduces a corrupt data point that the surrogate will learn from — producing predictions that look plausible but are wrong in the vicinity of that point.

Simulation fidelity defines the ceiling. The surrogate can never be more accurate than its training data. If the simulations use a coarse mesh that underresolves stress concentrations, the surrogate faithfully reproduces the underresolved answer. If the CFD simulations use an inappropriate turbulence model, the surrogate learns the wrong physics. The surrogate inherits both the accuracy and the systematic errors of its training simulations.

Stage 4: Train the Model

With training data in hand, the next step is choosing and training the surrogate model itself. Three model families dominate engineering surrogate applications.

Gaussian Process (GP) regression (also called Kriging) provides both a prediction and a prediction uncertainty at every query point. The uncertainty estimate is the distinguishing feature: it tells you not just what the surrogate predicts, but how confident the prediction is. High uncertainty flags regions where more training data is needed or where the prediction should not be trusted. GPs work well for low-to-moderate dimensionality (up to about 15 parameters) and smooth response surfaces. They become computationally expensive for large training datasets (over about 1,000 points) due to the cubic scaling of matrix operations.

Neural networks can learn highly nonlinear, high-dimensional mappings. Deep neural networks handle 50+ input parameters and complex response surfaces that GPs cannot represent efficiently. They require more training data than GPs, provide no built-in uncertainty estimate (though ensemble methods and dropout approximations can add one), and are harder to interpret — the model is a black box. For engineering surrogates with large training databases (thousands of simulations) and high-dimensional parameter spaces, neural networks are often the practical choice.

Random forests are ensemble methods that average predictions from many decision trees. They handle mixed variable types (continuous and categorical), are robust to outliers, provide feature importance rankings (which parameters matter most), and train quickly. They tend to produce step-function-like predictions that can be less smooth than GP or neural network predictions, which matters when the surrogate is used in gradient-based optimization. For screening and ranking applications where smooth gradients are not required, random forests are effective and low-maintenance.

Model selection is not a religious choice. The right model depends on the dimensionality, the training set size, the smoothness of the response, whether uncertainty estimates are needed, and whether the surrogate will be used in gradient-based workflows. In practice, many teams train multiple model types on the same data and select based on validation performance.

Stage 5: Validate and Deploy

Validation is what separates a trained model from a trustworthy tool. Deployment is what separates a research exercise from engineering value.

Hold-out validation reserves a fraction of the simulation data (typically 15 to 25%) that the surrogate never sees during training. The surrogate predicts at these hold-out points, and the predictions are compared to the simulation results. Key metrics: root mean squared error (RMSE), maximum absolute error, and coefficient of determination (R-squared). For engineering applications, the maximum error matters more than the average error — a surrogate that is 2% accurate on average but 30% wrong at a single critical design point is dangerous.

Cross-validation (k-fold) rotates the hold-out set so that every data point is used for both training and validation. This is more statistically robust than a single hold-out split, especially for small datasets where reserving 25% for validation wastes precious data. 5-fold or 10-fold cross-validation is standard practice.

Prediction intervals quantify the surrogate's uncertainty at each query point. For Gaussian processes, these are provided natively. For neural networks and random forests, ensemble methods or bootstrapping provide approximate intervals. Prediction intervals are essential for engineering applications because they enable risk-informed use: if the interval is narrow, the surrogate's prediction is reliable; if the interval is wide, the surrogate is uncertain and full simulation may be warranted.

Deployment means embedding the surrogate in an engineering workflow where it delivers value: in an optimization loop, as a real-time predictor in a digital twin, as a screening tool in a design exploration pipeline. The deployed surrogate must be versioned, documented, and monitored — just like any engineering tool. When the design space changes, the surrogate must be retrained or flagged as out of scope.

When Surrogates Fail

Not every problem is amenable to surrogate modeling. Recognizing when surrogates will fail is as important as knowing how to build them.

Extrapolation. The surrogate has no basis for prediction outside its training domain. A thermal surrogate trained on operating temperatures from 20 to 80 degrees Celsius will produce a number at 150 degrees Celsius, but that number is meaningless. The model provides no warning. The output looks exactly like a valid prediction. Detecting extrapolation requires tracking the training domain boundaries and flagging queries that fall outside them — a capability that must be built into the deployment infrastructure.

Discontinuities. Response surfaces with sharp transitions — buckling thresholds, phase changes, resonance frequencies — are poorly approximated by smooth surrogate models. A GP trained on data from both sides of a buckling threshold will produce a smoothed prediction that is wrong everywhere near the threshold. Either the discontinuity must be identified and handled explicitly (separate surrogates for each regime), or the model must be specifically designed to represent discontinuities (classification to identify the regime, regression within each regime).

Insufficient training data. If the training database is too small relative to the dimensionality and complexity of the response, the surrogate underfits — it cannot capture the true behavior. Adding model complexity (deeper networks, more flexible kernels) without adding training data leads to overfitting — the model memorizes the training points and generalizes poorly. The only reliable cure for insufficient data is more data, which means more simulation runs, which means more computational investment.

Multi-fidelity approaches address the data cost problem by combining expensive high-fidelity simulations with cheaper low-fidelity ones. The low-fidelity model (coarse mesh, simplified physics) is cheap enough to evaluate hundreds or thousands of times. The high-fidelity model is evaluated at a smaller set of carefully chosen points. The multi-fidelity surrogate learns the correction between low and high fidelity, leveraging the low-fidelity model's broad coverage and the high-fidelity model's accuracy. This can reduce the required number of expensive simulations by 50 to 80% — a significant cost saving for organizations building surrogates at scale.

The Surrogate Model Workflow: Five Stages

ActivityIdentify design variables, their ranges, and their interdependencies from the system model.

Key DecisionHow many parameters to include and what ranges to cover. Too few misses important effects. Too many explodes the required training data.

MBSE ConnectionParameters and ranges come from the model — design variables, operating envelopes, material options. Traceability from surrogate scope to model scope.

Risk If Done PoorlySurrogate trained on the wrong parameter space. Either missing important variables or covering irrelevant regions.

Walk through all five stages. Notice that each stage has a direct connection to the system model — the surrogate does not exist in isolation. It is built from model-defined parameters, trained on model-configured simulations, and deployed within model-governed workflows. When the model changes, the surrogate's validity must be reassessed.

Assessment

Question 1 of 3Score: 0

A team builds a surrogate model using a 100-point Latin Hypercube DOE over 6 design parameters. The surrogate achieves an R-squared of 0.98 on 5-fold cross-validation. A new design variant adds a 7th parameter (a new material option) not present in the original DOE. What should the team do? (Select all correct approaches)

Select all that apply

Implement a simplified surrogate model validation workflow in Python. Given a set of training and validation data points, train a simple surrogate (use a radial basis function interpolation or similar), compute key validation metrics (RMSE, max error, R-squared), and flag any validation point where the error exceeds a specified threshold. This simulates the critical Step 5 of the surrogate workflow — the step that determines whether the surrogate is trustworthy.

python

import numpy as np

# Training data: 20 simulation runs (1 input parameter for simplicity)
np.random.seed(42)
X_train = np.sort(np.random.uniform(0, 10, 20))
Y_train = np.sin(X_train) + 0.3 * X_train  # true response

# Validation data: 8 held-out simulation runs
X_val = np.array([0.5, 1.8, 3.2, 4.7, 5.5, 7.1, 8.3, 9.6])
Y_val = np.sin(X_val) + 0.3 * X_val  # true response

def train_surrogate(X_train, Y_train):
  """
  Train a simple surrogate model.
  Use numpy polyfit (degree 5) or scipy RBF interpolation.
  Return a function that predicts Y given X.
  """
  # YOUR CODE: fit a model to the training data
  # Return a callable predict function
  pass

def validate_surrogate(predict_fn, X_val, Y_val, error_threshold=0.1):
  """
  Evaluate the surrogate on held-out validation data.
  Return a dict with:
    - 'rmse': root mean squared error
    - 'max_error': maximum absolute error
    - 'r_squared': coefficient of determination
    - 'flagged_points': list of (x, predicted, actual, error)
      for points where absolute error exceeds error_threshold
  """
  # YOUR CODE: compute predictions, metrics, and flag bad points
  pass

# Train and validate
predict_fn = train_surrogate(X_train, Y_train)
results = validate_surrogate(predict_fn, X_val, Y_val, error_threshold=0.1)

# Report results
print(f"RMSE: {results['rmse']:.4f}")
print(f"Max Error: {results['max_error']:.4f}")
print(f"R-squared: {results['r_squared']:.4f}")
print(f"Flagged points ({len(results['flagged_points'])} exceed threshold):")
for x, pred, actual, err in results['flagged_points']:
  print(f"  x={x:.2f}: predicted={pred:.3f}, actual={actual:.3f}, error={err:.3f}")