Multi-fidelity Gaussian process surrogate modeling for regression problems in physics

Kislaya Ravi; Vladyslav Fediukov; Felix Dietrich; Tobias Neckel; Fabian Buse; Michael Bergmann; Hans-Joachim Bungartz

doi:10.1088/2632-2153/ad7ad5

1. Introduction

Complex simulations are often computationally expensive and time-consuming. A surrogate model is a simpler and faster model that emulates the output of a complex model as a function of input parameters. Surrogates, also known as digital twins, have various applications in design optimization [1, 2], uncertainty quantification [3, 4], real-time predictions [5, 6], etc. The amount of training data is one of the crucial factors governing the quality of the surrogate. Generating an extensive training data set is computationally infeasible for expensive simulations. In this work, we tackle the data availability limitation using Multi-fidelity [7].

The advancement of computational capabilities significantly increased the use of numerical simulation in almost every field of application. This led to the development of various simulation methods offering different levels of approximation quality and computational costs. Low-fidelity models provide estimations with reduced accuracy but demand fewer resources, whereas high-fidelity models yield precise predictions at a higher cost. These distinct fidelity levels arise from differences in physical modeling or simulation of the same phenomena, the linearization of complex physical processes, or the application of the same modeling approach with different discretization levels. The trade-off between simulation accuracy and computational cost is an ever-present challenge. Relying solely on high-fidelity models for applications that require multiple evaluations of the model is impractical due to their computational cost. Conversely, utilizing only low-fidelity models may compromise the results' accuracy. Multi-fidelity methods address this challenge by leveraging low-fidelity models to alleviate computational load and incorporating occasional high-fidelity evaluations to control result accuracy.

Multi-fidelity methods are widely used in applications like surrogate modeling, uncertainty quantification [8–11], bayesian inference [12–14], optimization [15–20], etc. They are helpful in scenarios where resource considerations and accurate estimations are important simultaneously.

This work focuses on using multi-fidelity models to build a Gaussian Process (GP) [21] surrogate. We use GP because of its probabilistic nature and additional functionalities like Bayesian optimization and uncertainty quantification. There are multiple ways to incorporate multi-fidelity in GP. The first method developed was the linear auto-regressive models [22, 23]. The linear modeling was not sufficient for more complex problems. This led to the development of non-linear auto-regression methods [24, 25]. However, the non-linear methods cannot be extended to more than two levels of fidelity and require low-fidelity evaluations during prediction. One can use Deep Gaussian Process (DGP) [26] to tackle the limitations.

This paper is organized as follows. We provide the basic theoretical background on GP, kernels, and calibration in section 2. Then, we review different multi-fidelity GP surrogate models in section 3. We also suggest modifications in existing non-linear autoregressive to incorporate more than two fidelities without DGP. In the same section, we suggest using delay terms in multi-fidelity DGP. Then, we compare the performance of different methods on the academic problem and two real-world problems, namely Terramechanical problems and Plasma microturbulence simulations in section 4. Finally, we end the paper with concluding remarks in section 5.

2. Background

2.1. Gaussian process regression

Gaussian processes (GP) are an efficient and flexible tool for probabilistic predictions [21]. They provide reliable statistical inference as well as uncertainty predictions. A GP is a stochastic process in which each finite subset of variables forms a multivariate Gaussian distribution. GPs are defined by their covariance (kernel) function k and define a probability distribution over functions,

$\begin{equation} f\left(\cdot , \cdot\right) \sim \mathcal{G}\mathcal{P}\left(0, k\left(\cdot , \cdot\right)\right).\end{equation} \tag{ 1 }$

The mean function is usually assumed to be zero for simplicity. In practice, the base space of the process is sampled with a finite number of N points $X = \left\lbrace x_i \right\rbrace_{i = 1}^N$ . Then, the GP is given as a finite-dimensional, multivariable Gaussian distribution of points y,

$\begin{equation} y \sim \mathcal{N}\left(\mu, K\right),\end{equation} \tag{ 2 }$

where µ is a vector of means, usually zeros, and $K = (K_{ij})_{i,j = 1}^N = (k(x_i,x_j))_{i,j = 1}^N$ is a covariance matrix of size N × N. Gaussian distributions can be conveniently employed in a Bayesian framework because they are closed under condition and conjugate distributions, which means marginal and conditional distributions of multivariate Gaussians are again Gaussian.

GP is typically used in regression, also called the 'Kriging method' [27]. In this case, each training point $x_i\in X$ is assigned an additional function value y_i, and the goal is to construct function values $y^{*}$ for unobserved test points $X^{*} = \{x^*\}$ . Then, we can construct a joint distribution

$\begin{equation} \begin{bmatrix}f\left(X\right) \\ f\left(X^*\right) \end{bmatrix} \sim \mathcal{N}\left( \mathbf{0},\left[ \begin{matrix}K\left(X,X\right) & K\left(X,X^*\right) \\ K\left(X^*, X\right) & K\left(X^*,X^*\right) \end{matrix}\right] \right).\end{equation} \tag{ 3 }$

After we observe the training data, the posterior consists of a narrowed distribution of functions, illustrated in figure 1(b), initially defined by the prior distribution as observed in figure 1(a). The inference is done then using posterior mean and posterior covariance defined as

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** A kernel function defines the family of the functions with which we are approximating the target function. In this example, we use squared-exponential or RBF kernel and visualize (a) a prior distribution over the functions. (b) In the posterior distribution in the second plot, newly observed training data points, depicted here in black, narrow the space of potential functions from the initial family, defined by the RBF kernel.
Download figure:
Standard image High-resolution image

$\begin{equation} \mathbb{E}\left(f\left(X^*\right)\right) = K\left(X,X^*\right)^{T}K\left(X,X\right)^{-1}f\left(X\right),\end{equation} \tag{ 4 }$

$\begin{equation} cov\left(f\left(X^*\right)\right) = K\left(X^*,X^*\right)- K\left(X,X^*\right)^{T}K\left(X,X\right)^{-1}K\left(X^*,X\right).\end{equation} \tag{ 5 }$

2.2. Kernels

K denotes a matrix built with the help of a kernel, which is a function used to define the covariances between the data points. This covariance is usually interpreted as a measure of similarity between data points, and thus, points x and x' that are considered similar in the input space will have target values y and y' that are also similar.

A valid covariance function must be positive semidefinite [21]. The product and sum of two valid covariance functions generate a valid covariance function. Using this property, one can generate more sophisticated covariance functions.

There are multiple options for covariance functions like squared-exponential, exponential, trigonometric, etc. The task of choosing the right kernel is complicated. There is no standardized algorithm to do so except for the random search and the researcher's intuition regarding modeling. However, some authors are making promising progress by creating interesting heuristics and algorithms [28–31]. In this work, we only work with squared-exponential covariance functions, because it yields smooth, i.e. mean-square differentiable, sample paths from the resulting GP, while others, like exponential kernel, do not [21].

After choosing the kernel family, we evaluate the hyperparameters of the kernel by maximizing the log-likelihood. The log-likelihood L as a function of observed values of data points Y, parameterized by the hyperparameters θ as

$\begin{equation} \log\left(p\left(y|X, \theta\right)\right) = -\frac{1}{2}y^{T}{K_y}^{-1}y - \frac{1}{2}\log|K_y| - \frac{n}{2}\log 2\pi.\end{equation} \tag{ 6 }$

We will use a gradient-based method (LBFGS [32] with multiple starting points) to find the maximum likelihood.

2.3. Calibration

As shown in equation (6), hyperparameter optimization for Gaussian process kernels is usually performed by maximizing the marginal likelihood. This often leads to an acceptable mean squared error [21], but the posterior variance tends to be lower than the variance of the actual distribution [33, 34]. This problem requires an additional post-processing step known as calibration. There are various calibration techniques, but in all cases, they aim to adjust the predicted variance towards the actual variance in the data. Calibration is an established concept in the classification context, but in the case of regression, it is relatively new [35]. In general, calibration can be split into three categories: quantile, distribution, and variance calibration.

The regression is quantile calibrated, if

$\begin{equation} \frac{1}{N}\sum^{N}_{n = 1} \mathbf{I} \left\{y_n \lt F^{-1}_n\left(p\right) \right\} \rightarrow p\,\,\textrm{for all } p \in \left[0, 1\right] \textrm{as } N \rightarrow \infty\end{equation} \tag{ 7 }$

for the set of data ${(x_n, y_n)}^{N}_{n = 1}$ , where F_n stands for predicted cumulative distribution at x_n and $F^{-1}_n$ the corresponding quantile function. Intuitively, $X\%$ of the confidence interval should capture the ground truth values in $X\%$ predictions [36, 37]. Quantile calibration is considered a global calibration, as it concentrates only on the marginal distribution without considering a distribution calibration around each exact prediction.

Distribution calibration resembles the calibration approach in classification tasks because it is also focused on local calibration. In the classification case, instances are grouped by their predicted probability and considered calibrated only if each group matches the indicated probability. Assume X as input variables, Y as a target variable, and the regression model $f: \mathbb{X} \rightarrow \mathbb{S}_{\mathbb{Y}}$ or intuitively, the regression model is mapping from input features into a space of distributions and for each test instance predicted a particular distribution. Denoting s as an arbitrary distribution predicted by the regression model, the regression model f is distribution calibrated if

$\begin{equation} p\left(Y = y|f_{Y|X} = s\right) = s\left(y\right) \textrm{for all } y \in Y, s \in S,\end{equation} \tag{ 8 }$

or intuitively, for all predicted instances with a particular distribution s, all instances on average with such prediction should follow this distribution s [36].

Variance calibration aims to make the model predict its own error, i.e. for all instances where variance $\sigma^2(x)$ takes a certain value ω, the squared predicted error $\mathbb{E}_{X,Y}[\mu(x)-y]$ will match this predicted variance [38]. Given $\mu(x)$ is a predicted mean and $\sigma^2(x)$ is a predicted variance, then the regression model is variance calibrated if

$\begin{equation} \mathbb{E}_{X, Y}\left[\left(\mu\left(x\right)-y\right)^2|\sigma^2\left(x\right) = \omega\right] = \omega.\end{equation} \tag{ 9 }$

The first method performs a global calibration, while the last two perform a local one. Global calibration is a better choice in case we need only to rescale the marginalized distribution, but in the case when variance should be rescaled only in certain places, it is better to use local calibration. Knowing a priori the required type of rescaling is hard for real-world data, so if we do not have any prior assumptions on the real variance, we must empirically find the most suited calibration method.

3. Methodology

3.1. Multi-fidelity

In this paper, our focus is on exploring various multi-fidelity Gaussian process surrogate modeling methods. In the upcoming sub-sections, we delve into the motivation behind each method, elucidate their formulation, and critically evaluate their limitations. Additionally, we propose enhancements for some methods within the corresponding sub-sections to address identified shortcomings.

3.1.1. Linear auto-regressive model

We start with the Auto-Regressive Model (AR1), a straightforward linear auto-regressive model introduced by Kennedy and O'Hagan [22]. In this model, we formulate the joint distribution of all fidelities, building on the fundamental concept of expressing the model for a particular fidelity as the sum of a linear scaling of the previous fidelity and an additive correction term.

We explain AR1 using a two-fidelity case. This process can be easily extended to cases with more than two fidelities. Let us consider the following random variables:

$\begin{align*} u_l \sim \mathcal{N}\left(0, K_{l}\right) & \Rightarrow \mathbb{E}\left[u_l\right] = 0 \quad \textrm{and} \quad\,\,\,\, \mathbb{E}\left[u_l u_l^{^{\prime}}\right] = K_{l}\left(X, X^{^{\prime}}\right), \\ u_{\delta} \sim \mathcal{N}\left(0, K_{\delta}\right) & \Rightarrow \mathbb{E}\left[u_{\delta}\right] = 0 \quad \textrm{and} \quad \mathbb{E}\left[u_{\delta} u_{\delta}^{^{\prime}}\right] = K_{\delta}\left(X, X^{^{\prime}}\right).\end{align*}$

We are modeling all the surrogates and the additive correction term using GP. u_l and u_δ represent the samples drawn from the low-fidelity and the additive correction GP. Let u_h represent a realization of the GP that approximates a high-fidelity function f_h. $\rho \in \mathbb{R}$ is a linear-scaling parameter. We assume an additive relationship between the consecutive fidelities so that

$\begin{equation} u_h = \rho u_l + u_{\delta}.\end{equation} \tag{ 10 }$

The expected value of the surrogate of the high-fidelity model is assumed to be zero,

$\begin{equation} \mathbb{E}\left[u_h\right] = \mathbb{E}\left[\rho u_l + u_{\delta}\right] = \rho\mathbb{E}\left[u_l\right] + \mathbb{E}\left[u_{\rho}\right] = 0.\end{equation} \tag{ 11 }$

The covariance of the surrogate of the high-fidelity model is then given by

$\begin{align*} Cov\left(u_h, u_h^{^{\prime}}\right) & = \mathbb{E}\left[u_h u_h^{^{\prime}}\right] - \underbrace{\mathbb{E}\left[u_h\right] \mathbb{E}\left[u_h^{^{\prime}}\right]}_{ = 0} \\ & = \mathbb{E}\left[\left(\rho u_l + u_{\delta}\right) \left(\rho u_l^{^{\prime}}\right) + u_{\delta}^{^{\prime}} \right] = \rho^2 \mathbb{E}\left[u_l u_l^{^{\prime}}\right] + \mathbb{E}\left[u_{\delta} u_{\delta}^{^{\prime}}\right] = \rho^2 K_l + K_{\delta}.\end{align*}$

The covariance between u_l and u_h is

$\begin{align*} Cov\left(u_h, u_l^{^{\prime}}\right) & = \mathbb{E}\left[u_h u_l^{^{\prime}}\right] - \underbrace{\mathbb{E}\left[u_h\right] \mathbb{E}\left[u_l^{^{\prime}}\right]}_{ = 0} = \mathbb{E}\left[\left(\rho u_l + u_{\delta}\right) u_l^{^{\prime}} \right] \\ & = \rho\mathbb{E}\left[u_l u_l^{^{\prime}}\right] + \mathbb{E}\left[u_{\delta} u_l^{^{\prime}}\right] = \rho K_l + \underbrace{\mathbb{E}\left[u_{\delta}\right] \mathbb{E}\left[u_l^{^{\prime}}\right]}_{ = 0} = \rho K_l.\end{align*}$

The joint distribution of samples drawn from the low-fidelity surrogate and the high-fidelity surrogate is written

$\begin{equation} \begin{bmatrix} u_h \cr u_l \end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix} 0 \cr 0 \end{bmatrix}, \begin{bmatrix} K_{\delta} + \rho^2 K_l & \rho K_l \cr \rho K_l & K_l \end{bmatrix} \right).\end{equation} \tag{ 12 }$

Using the maximum likelihood method, one can calculate the value of ρ and the kernel hyperparameters. We can also extend the joint distribution in equation (12) to incorporate more than two fidelities that will lead to covariance matrices with sparse blocks. Le Gratiet [23] took advantage of this block structure and decreased the complexity of the method from $\mathcal{O}((\sum_{l = 1}^{L} N_l)^3)$ to $\sum_{l = 1}^{L} \mathcal{O}( N_l^3)$ . He also improved the performance using polynomial terms for scaling (ρ) instead of a constant term.

Nevertheless, the efficacy of the Auto-Regressive Model (AR1) is limited by the assumption of a linear dependency between the low-fidelity and high-fidelity functions. This assumption might prove inadequate in scenarios characterized by non-linear transformations.

3.1.2. Non-linear auto-regressive models

Let us consider a two-fidelity case. A non-linear transformation on the low-fidelity function to obtain the high-fidelity function is written as

$\begin{equation} f_h\left(x\right) = g\left(x, f_l\left(x\right)\right),\end{equation} \tag{ 13 }$

where $g:\mathbb{R}^{d+1} \rightarrow \mathbb{R}$ is a function representing the non-linear transformation applied on the low-fidelity function to obtain the high-fidelity model. When attempting to model both the low-fidelity and high-fidelity functions using GPs, writing the joint distribution of the surrogates, as presented in equation (12), becomes analytically challenging for a general transformation g. To overcome this difficulty, a common assumption in studies such as [24, 25] is that the low-fidelity function can be evaluated at zero cost. This eliminates the need for a surrogate for the low-fidelity function. Consequently, the non-linear transformation depicted in equation (13) transforms into creating a surrogate in a d + 1 dimensional space. In our work, we delve into the modeling of g using a Gaussian Process (GP). This approach enables us to effectively navigate the challenges associated with the joint distribution of surrogates, providing a viable solution for incorporating non-linear transformations within the multi-fidelity surrogate modeling framework.

Let us consider the following pedagogical examples where the high-fidelity model f_h and the lower-fidelity model f_l are defined as

$\begin{align*} f_h\left(x\right) & = \left(\sqrt{2}-x\right) \sin^3\left( 8 \pi x \right), \\ f_l\left(x\right) & = \sin \left( 8 \pi x \right).\end{align*}$

Figure 2 illustrates the relationship between f_l and f_h. Modeling the transformation g as described in equation (13) is synonymous with learning the manifold depicted in figure 2. Crucially, this manifold lacks oscillations, signifying that learning it necessitates relatively fewer high-fidelity function evaluations. This observation underscores the advantage of multi-fidelity modeling in this example, as it allows for a learning process by leveraging the complicated feature to be captured in the low-fidelity function, thereby reducing the requirement of high-fidelity function evaluations.

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** It is difficult to approximate the function $f_{h}(x)$ (orange) using only x as an input, because it is highly oscillatory. Including the function $f_{l}(x)$ (blue) as an additional input means only the green, smooth manifold must be approximated, which can be done accurately with only a few data points.
Download figure:
Standard image High-resolution image

**Figure 2.** It is difficult to approximate the function $f_{h}(x)$ (orange) using only x as an input, because it is highly oscillatory. Including the function $f_{l}(x)$ (blue) as an additional input means only the green, smooth manifold must be approximated, which can be done accurately with only a few data points.
Download figure:
Standard image High-resolution image

A refinement of the method involves incorporating a well-structured kernel into the GP model, as suggested by Peridikaris et al [25]. The proposed kernel structure is given by:

$\begin{equation} k\left(x, x^{^{\prime}}\right) = k_{\rho}\left(x, x^{^{\prime}} ; \theta_{\rho}\right) k_f \left(f_l\left(x\right), f_l\left(x^{^{\prime}}\right); \theta_f\right) + k_{\delta}\left(x, x^{^{\prime}} ; \theta_{\delta}\right).\end{equation} \tag{ 14 }$

Here, θ_ρ, θ_f, and θ_δ denote the kernel hyperparameters, which are determined by maximizing the likelihood, as discussed in the previous section. This kernel structure closely mirrors the Auto-Regressive Model (AR1) formulation described in equation (10), assigning each kernel to an individual part of the transformation. Specifically, k_ρ models the scaling term, k_f is responsible for transforming the low-fidelity model, and k_δ handles the additive correction term. The formulation above gives the method its name: Non-linear Auto-Regressive Gaussian Process (NARGP).

Introducing this well-defined prior aids the model in accurately fitting the underlying relationships. A drawback of this formulation is the increase in the number of hyperparameters, which necessitates careful consideration during the optimization process. Despite this challenge, the structured nature of the kernel contributes to a more informed and effective surrogate modeling approach.

We follow Lee et al [24] and assume sufficient smoothness of the high-fidelity function f_h. We can then use Taylor expansion to approximate the value of f_h close to x by

$\begin{equation} f_h\left(x + \Delta x\right) = f_h\left(x\right) + \sum_{i = 1}^{d} \frac{\partial }{\partial x_i}f_h\left(x\right) \Delta x_i + \mathcal{O}\left(\Delta x^2\right).\end{equation} \tag{ 15 }$

Substituting equation (13) into equation (15), we obtain

$\begin{align*} f_h\left(x + \Delta x\right)& = g\left(x, f_l\left(x\right)\right) + \sum_{i = 1}^{d} \left(\frac{\partial g}{\partial x_i}g\left(x, f_l\left(x\right)\right) \right)\Delta x_i + \mathcal{O}\left(\Delta x^2\right) \\ & = g\left(x, f_l\left(x\right)\right) + \sum_{i = 1}^{d} \left(\frac{\partial g}{\partial x_i} + \frac{\partial g}{\partial f_l} \frac{\partial f_l}{\partial x_i} \right)\Delta x_i + \mathcal{O}\left(\Delta x^2\right).\end{align*}$

The Taylor expansion converges if

(i)
$\left\Vert\frac{\partial g(\cdots)}{\partial x} \right\Vert \unicode{x2A7D} c_g$ for $c_g \in \mathbb{R}^+$
(ii)
$\left\Vert \frac{\partial f_l(\cdots)}{\partial x} \right\Vert \unicode{x2A7D} c_l$ for $c_l \in \mathbb{R}^+$ .

The Taylor expansion also motivates us to incorporate the derivative of low-fidelity in equation (13). Lee et al [24] suggest writing the non-linear transformation as

$\begin{equation} f_h\left(x\right) = g_d \left(x, f_l\left(x\right), \frac{\partial f_l}{\partial x} \right).\end{equation} \tag{ 16 }$

In many legacy simulators, the gradients are unavailable [39]. In cases where gradients are unavailable, one can resort to finite difference techniques to approximate the derivative. Lee et al [24] suggest replacing the derivative in equation (16) with the delay term to accommodate such scenarios. This delay term evaluates the low-fidelity function after a small time step $\tau \in \mathbb{R}$ . For a higher-dimensional input variable, we add a delay along each dimension, defining τ_i as a vector of zeros with τ at the i^th position. The transformed formulation is now expressed as

$\begin{equation} f_h\left(x\right) = g_d \left(x, f_l\left(x\right), f_l\left(x+\tau_1\right), f_l\left(x+\tau_2\right), \ldots, f_l\left(x+\tau_d\right) \right).\end{equation} \tag{ 17 }$

This method is called the Gaussian Process with Delay Fusion (GPDF) proposed in [24]. Additionally, we propose an enhancement to this method by modifying the kernel to mimic the Taylor expansion using a structured composite kernel similar to equation (14)

$\begin{equation} k\left(x, x^{^{\prime}}\right) = k_{\rho}\left(x, x^{^{\prime}} ; \theta_{\rho}\right) k_f \left(\left(f_l\left(x\right), f_l\left(x+\tau\right)\right), \left(f_l\left(x^{^{\prime}}\right), f_l\left(x^{^{\prime}}+\tau\right)\right); \theta_f\right) + k_{\delta}\left(x, x^{^{\prime}} ; \theta_{\delta}\right).\end{equation} \tag{ 18 }$

This improved method is called the Gaussian Process with Delay Fusion and Composite kernel (GPDFC).

The formulations of NARGP, GPDF, and GPDFC come with certain limitations that need to be addressed for broader applicability:

(i)
Nested Evaluation Requirements:
- NARGP: Requires low-fidelity model evaluations during training and prediction at precisely the same parameters as the high-fidelity model.
- GPDF and GPDFC: Share similar constraints and involve additional low-fidelity function evaluations at delay points.
These limitations were initially disregarded under the assumption of infinitely cheap low-fidelity function evaluations. However, in practice, even low-fidelity function evaluations have nonzero cost, and so these constraints can also become significant. In some cases, we are simply given data collected from the low and high-fidelity models. Gathering any further data might not be possible, and for such cases, ensuring nested training data points may also not be possible. Moreover, the possible absence of low-fidelity data at the prediction parameter point also serves as a hurdle in implementing the methods mentioned in data-driven multi-fidelity problems [26].
(ii)
Limited to Two Fidelity Cases: Modern NARGP, GPDF, and GPDFC formulations are tailored for scenarios involving only two fidelity models, restricting their application in situations with more than two fidelities.
(iii)
Overfitting: As shown in [26], NARGP is prone to overfitting, which leads to poor generalization.

To overcome the limitation of application to a two-fidelity case and extend the framework to accommodate multiple fidelity levels, we can explore two scenarios:

Nested training data: In the case of nested training data, where ( $X_{\ell + 1}^{\textrm{train}} \subset X_{\ell}^{\textrm{train}}$ ), we propose training a Gaussian Process surrogate for the lowest fidelity and multi-fidelity Gaussian Process surrogates for higher levels. We train each multi-fidelity surrogate using the previous level, as shown in figure 3(a) for three fidelity NARGP surrogates. These layers are then cascaded, forming a stacked surrogate for multiple fidelities. During prediction, we draw samples from the surrogate of the previous level to obtain prediction samples for the next level and marginalize it to derive the posterior prediction for the given level. Let us represent $y_{\ell}$ as posterior prediction at parameter X. We can evaluate the posterior distribution of $y_{\ell+1}$ as
$\begin{equation} p\left(y_{\ell + 1}|X\right) = \int p\left(y_{\ell + 1}|y_{\ell},X\right) p\left(y_{\ell}|X\right) \mathrm{d}y_{\ell}.\end{equation} \tag{ 19 }$
One can write the marginalization step after cascading as
$\begin{eqnarray*} &&p\left(y_{\ell + 1}|X\right) = \int_{y_{\ell}} \int _{y_{\ell-1}} \cdots \int_{y_1} p\left(y_{\ell + 1}|y_{\ell},X\right)p\left(y_{\ell}|y_{\ell-1}X\right)\ldots p\left(y_2|y_1, X\right) p\left(y_1|X\right) \mathrm{d}y_{\ell} \mathrm{d}y_{\ell-1} \ldots \mathrm{d}y_1.\end{eqnarray*}$
We can perform this integration using the Monte Carlo method. The prediction steps for three fidelity stacked NARGP surrogates are shown in figure 3(b).Every layer in the stacked architecture is a GP. So, the samples drawn from GP will stay close to the posterior mean when the posterior predicted variance is small. If we ensure that there are enough data points in each layer such that the maximum of the posterior variance across the parameter space is smaller than a cut-off limit then we can use the posterior predicted mean of the previous layer as input for the next layer. In this way, we can bypass the Monte Carlo step. An efficient way to ensure small posterior variance is by using the adaptivity algorithm, which we will discuss later.
Non-nested training data: In situations with non-nested training data, an alternative is to employ the DGP.

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** Flowchart showing the training and prediction steps of NARGP surrogate for a three-fidelity scenario. f₁, f₂, and f₃ represent the function at each fidelity. g₁, g₂, and g₃ represent the surrogate of the corresponding fidelity. Training the NARGP model for a particular level involves evaluating the model of that fidelity level and the previous fidelity level. We draw samples to make predictions for a particular level, which involves drawing samples from the surrogate of the previous level and feeding them as input to the surrogate of that level. We can then use the samples to obtain the required statistical moments of the predictions by marginalizing the samples.
Download figure:
Standard image High-resolution image

3.1.3. Deep Gaussian processes

Drawing inspiration from the stacked architecture of multilayer neural networks, DGPs extend a single GP by composing multiple GP with each other. Each node within DGP represents a Gaussian process. The fundamental concept of deep architectures lies in their ability to model complex functions through combinations of simpler ones, with the output of one layer serving as the input for the next [40]. The primary advantage of DGPs over a single Gaussian Process (GP) lies in the distribution of feature learning across different layers. In a DGP, each layer is dedicated to learning distinct features of the model. This is in contrast to a single GP, where all features are consolidated into one, potentially leading to a complex and intricate model, especially when using traditional kernels.

DGP can effectively capture various aspects or representations of the underlying system by distributing the learning process across multiple layers. DGP emerges as a suitable candidate for modeling multi-fidelity scenarios. With each fidelity level, DGP gradually incorporates new features, enhancing the model's capabilities as it approaches high-fidelity representations. This progressive refinement allows DGP to effectively capture the intricacies of complex systems.

We model the transformation similar to equation (13) with one difference. We replace the actual function evaluation of the previous layer with the previous GP layer

$\begin{equation} f_{\ell}\left(X\right) = g_{\ell}\left(X, g_{\ell -1}\left(\cdots\right)\right).\end{equation} \tag{ 20 }$

The analytic form of the posterior of the GP surrogate $g_{\ell}$ is intractable because one of its parameters ( $g_{\ell -1}$ ) is sampled from a posterior distribution. To sample from the posterior of $g_{\ell}$ , we use variational inference for each layer of DGP [41]. We use $M_{\ell}$ inducing points to approximate each layer representing a fidelity level. Typically, the number of inducing points is significantly smaller than the number of training data points ( $M_{\ell} \ll N_{\ell}$ ). This decreases the computational requirements of GP training and prediction steps [42–44]. This approach is also known as the Sparse Variational Gaussian Process (SVGP) [45]. We discuss the modeling approach using DGP for multi-fidelity surrogate modeling in [26, 46].

Let us assume that we have a set of locations of the training parameters as

$\begin{align*}\mathbf{X} = \left\{X_{1}^{\textrm{train}}, X_{2}^{\textrm{train}},\ldots, X_{L}^{\textrm{train}}\right\}\end{align*}$

and the corresponding function evaluations

$\begin{align*}\mathbf{Y} = \left\{\mathbf{y}_1, \mathbf{y}_2, \ldots, \mathbf{y}_L \right\} = \left\{f_1 \left( X_{1}^{\textrm{train}} \right), f_2 \left( X_{2}^{\textrm{train}} \right), \ldots, f_L \left( X_{L}^{\textrm{train}} \right) \right\}.\end{align*}$

Note that we do not impose the limitation that the location of the training parameters should be nested. Let $\mathbf{g}_{\ell}^{t}$ represent the samples drawn from the $\ell$ th layer at the training parameter $X_{t}^{\textrm{train}}$ . We assume that $M_{\ell}$ inducing points are sufficient to represent the distribution at layer $\ell$ . We represent the location of the inducing point by $\mathbf{Z}_{\ell}$ . Let $\mathbf{u}_{\ell}$ be the sample drawn from the $\ell$ th layer at the inducing points $\mathbf{Z}_{\ell}$ . We represent the kernel function and the mean function of the prior at layer $\ell$ using $k_{\ell}$ and $m_{\ell}$ respectively. The unnormalized posterior distribution at $\ell$ th layer is

$\begin{eqnarray} &&p\left(\mathbf{y}_{\ell}, \mathbf{g}_{\ell}^{\ell}, \mathbf{u}_{\ell}| \mathbf{X}_{\ell}, \mathbf{g}_{\ell-1}^{\ell},\mathbf{Z}_{\ell}\right) = \underbrace{p\left(\mathbf{y}_{\ell}|\mathbf{g}_{\ell}^{\ell}, \mathbf{X}_{\ell}\right)}_{\textrm{likelihood}} \underbrace{p\left(\mathbf{g}_{\ell}^{\ell}|\mathbf{u}_{\ell}, \mathbf{g}_{\ell-1}^{\ell}, \mathbf{X}_{\ell}, \mathbf{Z}_{\ell}\right) p\left(\mathbf{u}_{\ell}|\mathbf{Z}_{\ell}\right)}_{\textrm{DGP prior}}.\end{eqnarray} \tag{ 21 }$

Figure 4(a) shows the training process for a three-fidelity scenario. The calculation of posterior at any layer needs samples from the previous layer at the training points, which in turn requires further evaluation from previous layers. For example, the calculation of the posterior of the third layer needs to draw samples from the second layer at $X_{3}^{\textrm{train}}$ , which also requires samples from the first layer at $X_{3}^{\textrm{train}}$ . We represent this hidden calculation with a dashed arrow in figure 4(a).

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** Flowchart showing the training and prediction steps of Deep Gaussian process surrogate for three fidelity scenarios. f₁, f₂, and f₃ represent the function at each fidelity. g₁, g₂, and g₃ represent the surrogate of the corresponding fidelity. Training the DGP model for a particular level involves evaluating the model of that fidelity level and the surrogate of the previous fidelity level, which further involves evaluating the one-level lower surrogate. This creates a cascading effect until we reach the lowest fidelity surrogate. The prediction process is similar to the full GP, which involves drawing and marginalizing samples.
Download figure:
Standard image High-resolution image

One can combine all the layers by multiplying the posterior from individual layers from equation (21) to obtain the joint posterior distribution for all layer $p(\mathbf{y}, \{\mathbf{g}_{\ell}^{t}, u_{\ell} \})$ . Analytical inference from the posterior is intractable. To approximate the distribution, we assume that the prior $q(\mathbf{u}_{\ell})$ of u_l is a Gaussian distribution with mean $\tilde{\mathbf{m}}_{\ell}$ and variance $\tilde{\mathbf{S}}_{\ell}$ . Now, the approximated DGP prior is

$\begin{equation} q\left(\mathbf{g}_{\ell}^{\ell}, \mathbf{u}_{\ell}| \mathbf{X}_{\ell}, \mathbf{g}_{\ell-1}^{\ell},\mathbf{Z}_{\ell}\right) = p\left(\mathbf{g}_{\ell}^{\ell}|\mathbf{u}_{\ell}, \mathbf{g}_{\ell-1}^{\ell}, \mathbf{X}_{\ell}, \mathbf{Z}_{\ell}\right) q\left(\mathbf{u}_{\ell}\right).\end{equation} \tag{ 22 }$

If $\mathbf{u}_{\ell}$ is marginalized out of the equation (22), then the distribution becomes a Gaussian distribution with the following mean and variance:

$\begin{align} \mu_{\ell}\left(x\right) & = m_{\ell}\left(x\right) + \alpha\left(x\right)^T \left(\tilde{\mathbf{m}}_{\ell} - m_{\ell}\left(x\right)\right),\end{align} \tag{ 23 }$

$\begin{align} \Sigma_{\ell}\left(x, x^{^{\prime}}\right) & = k_{\ell}\left(x, x^{^{\prime}}\right) - \alpha\left(x\right)^T \left(k_{\ell}\left(\mathbf{Z}_{\ell}, \mathbf{Z}_{\ell}\right)- \tilde{\mathbf{S}}_{\ell}\right) \alpha\left(x^{^{\prime}}\right).\end{align} \tag{ 24 }$

We can use this before approximating the posterior at each layer and then extend this to all the layers of DGP to obtain the approximate posterior distribution $q(\mathbf{y}, \{\mathbf{g}_{\ell}^{\ell}, \mathbf{u}_{\ell} \}_{\ell = 1}^{L})$ . We want to approximate the exact posterior distribution in equation (21) by an approximate distribution $q(\mathbf{y}, \{\mathbf{g}_{\ell}^{\ell}, \mathbf{u}_{\ell} \}_{\ell = 1}^{L})$ . We train the DGP by minimizing the Kullback-Leibler (KL) divergence $\textrm{KL}[q||p]$ between the variational posterior distribution q and the true posterior distribution p. An exact evaluation of the KL divergence is not feasible. So, we convert the minimization of the KL divergence to the maximization of Evidence Lower Bound (ELBO) ( $\mathcal{L}$ ) whose evaluation is feasible. ELBO is evaluated as

$\begin{equation} \mathcal{L} = \sum_{\ell = 1}^{L} \mathbb{E}_{q(\mathbf{g}_{\ell}^{\ell})} [ \log \underbrace{p \left(\mathbf{y}_{\ell} | \mathbf{g}_{\ell}^{\ell} \right)}_{\textrm{likelihood}} ] - \textrm{KL}\left[q(\mathbf{u_{\ell}})||p(\mathbf{u}_{\ell}|\mathbf{Z}_{\ell}))\right].\end{equation} \tag{ 25 }$

One can refer to [46] for detailed proof. We optimize for the variational parameters ( $\tilde{\mathbf{m}}_{\ell}$ , $\tilde{\mathbf{S}}_{\ell}$ , $\mathbf{Z}_{\ell}$ ) and the hyperparameters of the kernels ( $k_{\ell}$ ) using the Adam optimization method.

The prediction process in DGPs follows a recursive transfer of results from the previous layer into the next layer. We draw samples at each layer from the approximate distribution defined by equation (22). We fuse the prediction parameter $X^*$ in intermediate layers with the sample drawn from the previous layer. We then use the fused parameter to draw a sample from that layer. We repeat this process until we have S samples for each layer. We use the collected samples to compute the posterior mean and variance through a Monte Carlo approximation. Figure 4(b) visually illustrates these prediction steps for three fidelity models, depicting the recursive transfer and sampling process at each layer.

One can improve the surrogate by choosing a more structured kernel. Cutajar et al [26] suggest using the kernel defined in equation (14). We call this method the Non-linear Auto-Regressive Deep Gaussian Process (NARDGP). We suggest using the delay term to further add useful information. We call the method Deep Gaussian Process with Delay Fusion (DGPDF). We further improve upon the suggested method using a structured kernel as mentioned in equation (18). We named the resulting method Deep Gaussian Process with Delay Fusion and Composite kernel (DGPDFC).

Methods discussed in sections 3.1.2 and 3.1.3 have similar ideas but different approaches to approximating the posterior distribution. In future discussions, we will sometimes mention the methods in section 3.1.2 as full GP to differentiate it from DGP because they do not involve sparse variational approximation of the posterior.

3.2. Adaptive model improvement

In many cases, the available training data may not provide sufficient information to accurately model predictions, resulting in poor performance. One approach to enhance the model is adding more data points to the training set. This comes with the added cost of evaluating the corresponding function at the newly added training data locations. Therefore, judiciously choosing where to add points is crucial. The goal is to ensure that we capture the essential features of the target functions using as few training data points as possible.

In this section, we discuss an adaptive algorithm [47]. It is designed to enhance training datasets applicable to any of the methods mentioned earlier, all of which utilize combinations of Gaussian processes. This adaptive algorithm improves the overall surrogate prediction by focusing on one Gaussian process at a time. Suppose we are training a GP using n data points $\mathcal{D}_n = \{X_1, X_2, \ldots, X_n \}$ . Let $y^{n+1}$ be the value of a sample from GP at $X^{n+1}$ . We select the next point $X_{n+1}$ where we gain maximum information which is defined as follows

$\begin{equation} \mathcal{I}\left(X_{n+1}|\mathcal{D}_n\right) = H\left(y^*|\mathcal{D}_n\right) - H\left(y^*|\mathcal{D}_n \cup X_{n+1}\right),\end{equation} \tag{ 26 }$

where $H(y^{n+1}|\mathcal{D}_n)$ is the conditional entropy. We suggest a greedy algorithm to selecting the next point $X_{n+1}$ by solving the optimization problem

$\begin{equation} X_{n+1} = \mathop{\mathrm{argmax}}_{X_{n+1}} \mathcal{I}\left(X_{n+1}|\mathcal{D}_n\right).\end{equation} \tag{ 27 }$

Upon simplification, it can be shown that for GPs, the gain in information is directly proportional to the logarithm of the posterior predicted standard deviation. Thus, the optimization problem stated in equation (27) can be converted to

$\begin{equation} X_{n+1} = \mathop{\mathrm{argmax}}_{X^*} \sigma_p\left(X^*|\mathcal{D}_n\right),\end{equation} \tag{ 28 }$

where σ_p is the predicted standard deviation. We can keep adding new points for a particular fidelity level until the maximum posterior standard deviation is below a certain specified threshold level. After that, we can move to the next level until we cover all the fidelity levels.

There are certain drawbacks to this method. Firstly, this method adds one point at a time, limiting its overall speed. Secondly, the suggested points are always towards the boundary in high-dimensional space.

Various other approaches are proposed in the literature depending on alternative metrics to iteratively enhance the Gaussian Process surrogate [48]. One can improve the posterior standard deviation adaptivity metric by adjusting it via some error information as discussed in [49]. However, this requires function evaluation at additional validation points, making it infeasible for computationally expensive high-fidelity functions. Query-by-committee (QBC) [50, 51] is also computationally infeasible due to the need to train multiple models. Gradient-based adaptive approach [52] utilizes gradient information, which may be unavailable in many cases especially involving legacy solvers. Moreover, gradient approximation using the finite difference method is numerically unstable and computationally expensive for high-fidelity models [53]. There are still many other adaptivity algorithms [48, 54, 55]. We will leave the discussion and the effects of those algorithms on multi-fidelity GP as future works.

4. Numerical examples

In this segment, we evaluate the surrogate modeling techniques mentioned in section 3 using standard academic examples and two real-world problems, each presenting its distinct challenges. Subsequently, we conduct a comparative and critical analysis of the strengths and limitations of each approach in varied scenarios. The method outlined in section 3 was implemented in Python programming language using GPflow library [56]. This implementation is open source and can be accessed on GitHub⁵.

4.1. Academic examples

Before applying the problem to real-world scenarios, we test different multi-fidelity surrogate modeling methods on some academic problems shown in table 1 and visualized in figure 5, each posing a different modeling challenge. Researchers addressed the benchmarking of the surrogate models and there exists a variety of them [57]. Still, our low-fidelity function for all the test cases is the same because we want to test the capability of different methods to learn the relationship between low and high-fidelity models. We deal with two fidelity cases and train the multi-fidelity GP surrogate using fifty low-fidelity and eight high-fidelity function evaluations. The high number of low-fidelity train data ensures low predictive variance for the low-fidelity Gaussian process. We use twenty-five and eight induction points in each level for all the cases involving DGPs. Additionally, we also train a GP on high-fidelity data without multi-fidelity features. We then compare the multi-fidelity methods to the single-fidelity GP surrogate.

Figure 5. Refer to the following caption and surrounding text. — **Figure 5.** Visualization of different academic problems. The low and high-fidelity functions are visualized in each sub-plot by orange and blue curves, respectively.
Download figure:
Standard image High-resolution image

Table 1. Benchmark problems for the academic examples.

Benchmark name	Low-fidelity function (f_l)	High-fidelity function (f_h)
Linear transformation		$0.8 \sin{\left(8 \pi x\right)} + 0.3 \sin{\left(2 \pi x\right)}$
Non-linear transformation	$\sin{\left(8 \pi x\right)}$	$\sin^2{\left(8 \pi x\right)}$
Phase-shifted oscillation		$\sin^2{\left(8 \pi x + \frac{\pi}{10} \right)} + \cos{\left(4 \pi x \right)}$

After this initial training, we run the adaptivity algorithm for ten steps to add ten additional high-fidelity data points. To ensure the robustness of the adaptivity algorithm, we perform multiple runs with distinct seed values. We take the average of the mean square error (MSE) and plot it against the corresponding number of high-fidelity function evaluations shown in figure 6. The resulting plot visualizes the rate at which the respective algorithm convergences to the target high-fidelity curve. We also summarize the average MSE in table 2.

Figure 6. Refer to the following caption and surrounding text. — **Figure 6.** Average MSE evolution with respect to the number of high-fidelity function evaluations for different academic problems discussed in section 4.1.
Download figure:
Standard image High-resolution image

Table 2. Average value of MSE of each method in three different transformation scenarios, namely linear scaling, non-linear, and phase-shifted transformation as shown in table 1, w.r.t. number of high-fidelity data points used in training. The method with minimum MSE is shown in bold.

	Type of transformation
	Linear scaling		Non-linear transformation		Phase-shifted oscillation
MF method	8 HF points	18 HF points	8 HF points	18 HF points	8 HF points	18 HF points
GP	$2.36 \times 10^{-1}$	$5.85 \times 10^{-6}$	$2.45 \times 10^{-1}$	$4.79 \times 10^{-2}$	$7.32 \times 10^{-1}$	$1.37 \times 10^{-1}$
AR1	$\mathbf{4.46 \times 10^{-5}}$	$\mathbf{5.24 \times 10^{-8}}$	$3.73 \times 10^{-1}$	$2.71 \times 10^{-1}$	$2.25 \times 10^{-1}$	$\mathbf{1.14 \times 10^{-6}}$
NARGP	$2.57 \times 10^{-3}$	$3.02 \times 10^{-6}$	$\mathbf{6.57 \times 10^{-6}}$	$2.00 \times 10^{-6}$	$3.62 \times 10^{-1}$	$2.19 \times 10^{-1}$
GPDF	$9.15 \times 10^{-3}$	$3.11 \times 10^{-6}$	$2.04 \times 10^{-2}$	$\mathbf{1.68 \times 10^{-6}}$	$2.50 \times 10^{-1}$	$6.99 \times 10^{-6}$
GPDFC	$9.15 \times 10^{-3}$	$3.11 \times 10^{-6}$	$2.04 \times 10^{-2}$	$\mathbf{1.68 \times 10^{-6}}$	$2.50 \times 10^{-1}$	$6.99 \times 10^{-6}$
NARDGP	$7.63 \times 10^{-4}$	$2.88 \times 10^{-5}$	$3.34 \times 10^{-2}$	$1.14 \times 10^{-4}$	$4.84 \times 10^{-2}$	$7.16 \times 10^{-2}$
DGPDF	$1.09 \times 10^{-1}$	$2.41 \times 10^{-2}$	$8.81 \times 10^{-4}$	$1.01 \times 10^{-4}$	$17.0 \times 10^{-1}$	$8.13 \times 10^{-1}$
DGPDFC	$9.99 \times 10^{-4}$	$2.52 \times 10^{-5}$	$1.00 \times 10^{-1}$	$1.37 \times 10^{-4}$	$\mathbf{3.53 \times 10^{-2}}$	$2.52 \times 10^{-4}$

We can observe from table 2 that across all scenarios, the multi-fidelity methods outperform the single-fidelity GP by a considerable margin. This is because the multi-fidelity methods take advantage of the additional information the low-fidelity models provide. This shows the advantage of using multi-fidelity GP over single-fidelity GP. One can also conclude from table 2 that methods involving the DGP have higher MSE as compared to the corresponding non-linear autoregressive methods with full GP, specifically when the number of high-fidelity function evaluations is increased. This is because DGP involves the approximation of the posterior distribution using some finite induction points.

We observe from table 2 that the AR1 surrogate demonstrates the lowest MSE compared to the other methods in a linear transformation problem. The high-fidelity function of the linear scaling problem is written as constant scaling of the low-fidelity function with a non-linear additive term. This aligns with the assumed formulation of AR1 as described in equation (10), giving AR1 an edge over the other methods. Every other non-linear method except DGPDF also has a low MSE compared to the single-fidelity GP. The non-linear transformation methods involve expanding the surrogate dimensionality by introducing additional dimensions for low-fidelity function evaluations and/or delay terms. The assumed non-linear transformation of the low-fidelity term and the additional dimension of the delay term do not contribute meaningful information to the surrogate model in the case of the linear scaling problem. The additional kernel hyperparameters and the increase in dimension might require more evaluation points to represent the function. Therefore, non-linear auto-regressive methods underperform when compared to the AR1 method.

Table 2 shows that the AR1 surrogate cannot accurately capture non-linear transformations as expected because of the underlying linear formulation. In contrast, other methods tailored for non-linear transformations exhibit effective predictions with low MSE.

In the next example, we discuss the case where the high-fidelity function (f_h) and the low-fidelity function (f_l) are oscillating functions with phase differences. We observe such cases in our brain where neurons oscillate with certain frequencies where one uses the Hodgekin-Huxley model of them as a surrogate for real data [58].

The high-fidelity function can be expressed as a combination of sine and cosine terms, with the cosine term representing the derivative of the low-fidelity function. Consequently, methods incorporating delay terms are anticipated to outperform others

$\begin{align*} f_h\left(x\right) & = \sin{\left(8 \pi x\right)} \cos{\left( \frac{\pi}{10}\right)} + \cos{\left(8 \pi x\right)} \sin{\left( \frac{\pi}{10}\right)} + \cos{\left( 4 \pi x \right)} \\ & = \cos{\left( \frac{\pi}{10}\right)} f_l\left(x\right) + \sin{\left( \frac{\pi}{10}\right)} \frac{d f_l}{dx} + \cos{\left( 4 \pi x \right)}.\end{align*}$

NARGP and NARDGP exhibited less accurate fitting to the target high-fidelity curve, as their formulations lack derivatives directly applicable to this example. Conversely, methods incorporating delay terms in surrogate modeling demonstrated accurate predictions by mimicking the derivative of the low-fidelity function. Surprisingly, AR1 also predicts accurately because the phase-shifted oscillation gets simplified into a linear scaling problem which is suitable for AR1.

DGPDF specifically underperforms for linear-scaling problems and phase-shifted oscillations even when other non-linear autoregressive methods had low MSE. Only one kernel is responsible for capturing all the features of the target high-fidelity function. In other non-linear autoregressive methods, the kernel has a well-defined structure. This underscores the significance of a well-defined kernel structure specifically for DGP.

The three presented cases with different transformations underscore the variability in the performance of different methods across diverse scenarios. This observation emphasizes the necessity of a judicious selection of methods based on the specific characteristics of the modeled system. Prior knowledge about the system can significantly contribute to choosing the most suitable methodology.

As indicated in section 3, we can extend the surrogate modeling to encompass more than two fidelities. As an example, we consider a three-fidelity case defined as

$\begin{align} f_1\left(x\right) & = \sin \left(8 \pi x \right) \nonumber \\ f_2\left(x\right) & = \sin^2 \left(8 \pi x \right)\end{align} \tag{ 29 }$

$\begin{align} f_3\left(x\right) & = \sin^2 \left(8 \pi x \right) + x^2. \nonumber\end{align}$

Figure 7(a) shows the target function for the three fidelity cases as described in equation (29). The transition from the first fidelity to the second fidelity is non-linear, necessitating the use of a non-linear method to construct a surrogate. In this example, we present the results achieved using NARGP. We begin with forty, twelve, and eight randomly sampled points for each fidelity level and then execute the adaptivity algorithm for five steps. We have already observed from the results of two-fidelity cases that the posterior variance for the second fidelity under the same scenario using the NARGP model is very small. Consequently, we can utilize the posterior mean of the surrogate of the second level as the input for the third layer. Figure 7(b) shows that the surrogate aligns closely with the target function. Analogous to the two-fidelity cases, familiarity with the physical model will guide our selection of the appropriate method.

Figure 7. Refer to the following caption and surrounding text. — **Figure 7.** An example of multi-fidelity gaussian process surrogates for three fidelity case. We use NARGP to create surrogate because the transformation from first to the second fidelity is non-linear.
Download figure:
Standard image High-resolution image

4.2. Terramechanical example

Terramechanics explores the interaction of wheels with the underlying soil [59]. There are no exact physical formulas that describe the process of the wheel-soil interaction because of the non-linearity of the pressure-sinkage relations in the soft soil [60]. Therefore, to simulate wheel movement on soft sands and to estimate forces and torques acting on that wheel, we exploit a wide range of numerical simulations, each one with a different level of discretization, assumptions, accuracy, and computational time. This makes terramechanics a good example of multi-fidelity modeling because different numerical models are dedicated to simulating the same physical process, but with different levels of fidelity.

As data sources for lower-fidelity models for this experiment, we used two types of models, both of which were developed at the Terramechanical lab at German Aerospace Center (DLR): TerRA [62] and SCM [60]. As a high-fidelity data source, DLR's testbed TROLL was used [63]. This testbed was initially created to test and validate simulations for the wheel-soil interactions for extraterrestrial rover projects like MMX [64] and Scout [65]. The 44 high-fidelity runs from TROLL were also replicated in the simulation environment to achieve both higher- and lower-fidelity observations on the same set of inputs⁶. These runs included a variety of different steering and titling angles, different rotation and horizontal velocities, and different movement trajectories and accelerations. Movement scenarios for dataset creation were described and justified in the previous paper [61]. The scheme of the overall setup of experiments is depicted in figure 8(a).

Figure 8. Refer to the following caption and surrounding text. — **Figure 8.** Different data generators were simulating the wheel-soil interaction for the same scenarios, thus creating an ideal case for multi-fidelity modeling. Reproduced from permission from [61]. © International Society for Terrain-Vehicle Systems.
Download figure:
Standard image High-resolution image

Lower fidelity simulations are iterative simulations with each next step dependent on the state of the previous step, and the output of the simulation depends on several technical variables, which do not have a clear physical meaning and were not included as the input features in the dataset. The Gaussian process surrogate takes only a subset of the input variables of the simulation into account. Surrogate GPs were trained to approximate the outputs of TerRA and SCM simulations, and 100 runs were used in both cases. Each run was 20 s long and with a sampling rate of 3 points per second. Runs were conducted with randomly selected surfaces, random sequences of acceleration and deceleration and random sequences of steering commands. The randomness of a surface is introduced by varying its profile. The profile for each run has 6 breaking points, where after each point the change is introduced. Changes are normally distributed with a mean of 0.1 m and a variance of 0.6 m and are gradually introduced at each breakpoint. Velocities of the wheel simulation are also distributed normally, with a mean of 1 m s⁻¹ and variance of 2 m s⁻¹. Steering angle is distributed normally, with 0 mean and 0.5 radians variance. Both velocities and steering angle were changing each second of the simulation.

We split high-fidelity runs into 29 runs used for training purposes and 15 reserved for the test dataset. We varied sampling frequency from the high-fidelity data, to verify how performance changes with changes in the amount of high-fidelity data. Harvesting high-fidelity data is challenging and we want to minimize the damage to the model's performance, therefore we want to analyze the degradation of the prediction accuracy with the decrease in the number of training points. Each next subset is nested into the subset with a bigger number of training points. This logic assures the consistency of training points in all down-sampling examples.

All signals were smoothed with a low-pass filter, as in reality spikes of traction force are considered to be noise because only constant application of a force can change the wheel's movement. The data consists of 12-dimensional input and the objective function is a traction force acting on the wheel. A description of the variables is detailed in table 3. A visualization of the wheel and features describing it is depicted in figure 8(b).

Table 3. Description of the variables from the terramechanical dataset.

Variable	Specification	Description
$\vec{r}_w$	3D Input	Wheel position w.r.t the surface
$\vec{v}_w$	3D Input	Translational velocity of the wheel
$\vec{\omega}_w$	3D Input	Angular velocity of the wheel
$\vec{n}_g$	3D Input	Gravity acting on the wheel
f	1D Output	Traction force acting on the wheel

We tested all MF models discussed earlier on the terramechanical data. MF models shown here were trained with both SCM and TerRA surrogates being the lower-fidelity function and compared with each other. Table 4 shows the performance of different MF methods w.r.t. the number of high-fidelity data points used during the modeling and numerical simulation which generated the data for the lower-fidelity level. The performance of the MF models with the TerRA as a lower-fidelity data source shows marginally worse results than with SCM, however still very compatible. This is a positive result, given that the TerRA model is much simpler in implementation and less expensive in exploitation. Changing the number of high-fidelity points can give a hint of the lower acceptable bar when we construct the MF model.

Table 4. Normalized MSE of each MF method for wheel-soil locomotion simulation, w.r.t. number of high-fidelity data points used in training. Simulations using SCM were used as a lower-fidelity layer. The normalized MSE of conventional numerical simulation models is presented for comparison. GP trained only on the high-fidelity points shows good results when we have a lot of data points, but its performance deteriorates with a decrease in the training points. While the MF method, especially NARDGP, can show good results with fewer high-fidelity points and has more consistent performance. The method with minimum MSE is shown in bold.

	Number of HF points
	2085		1042		521		208		104
	Numerical simulation used as a lower fidelity layer
Method Name	TerRA	SCM	TerRA	SCM	TerRA	SCM	TerRA	SCM	TerRA	SCM
Single fidelity GP	0.149		0.239		0.331		0.32		0.756
AR1	0.235	0.235	0.235	0.235	0.235	0.235	0.235	0.235	0.235	0.235
NARGP	0.184	0.211	0.2	0.23	0.218	0.726	0.251	0.256	0.724	0.726
GPDF	0.41	0.207	0.327	0.231	0.3	0.218	0.25	0.233	0.724	0.724
GPDFC	0.41	0.207	0.327	0.231	0.3	0.218	0.25	0.233	0.724	0.724
NARDGP	0.116	0.122	0.12	0.118	0.123	0.12	0.118	0.117	0.121	0.125
DGPDF	0.726	0.726	0.726	0.726	0.726	0.726	0.726	0.726	0.726	0.726
DGPDFC	0.217	0.22	0.249	0.233	0.224	0.223	0.226	0.23	0.22	0.235

Conventional numerical simulations
TerRA	799.92
SCM	0.209

Models comparison with different subsets is illustrated in table 4. As we can see there, the multi-fidelity approach can handle the drastic decrease in the number of used high-fidelity points, while the single-fidelity model's performance decreases, as expected. Several examples of best-performing methods, NARDGP, are shown in figure 9.

Figure 9. Refer to the following caption and surrounding text. — **Figure 9.** Visual comparison of the best performing MF ML model, NARDGP, on several test runs. It is compared against the ground truth from the TROLL sensors, single-fidelity model and the most popular numerical simulation SCM. The overall performance is reasonably better compared to other methods, but the model is visibly over-confident in its uncertainty estimation.
Download figure:
Standard image High-resolution image

Empirically, there is no advantage in building more than two layers of fidelity for this regression task, as SCM simulation covers all the abilities of TerRA simulation and gives much more reliable results [60]. We conducted a comparison of the two-level NARDGP model performance where SCM data serves as a lower-fidelity and TROLL as the high-fidelity data source and three-level NARDGP, with new lower-fidelity level from TerRA data, so that SCM simulations become medium-level. For both cases, we used 43 high-fidelity data points from TROLL experiments, as we are first of all interested in the model's performance with as few high-fidelity points as possible, and 1500 data points from SCM simulation. For the three-level model, we sampled 3000 data points from the TerRA simulation. Two of the examples are depicted in figure 10 and they show a very limited difference between the performance of the two approaches. Given the limited scale of improvements we did not conduct a full analysis as in table 4, but we will leave this for further work.

Figure 10. Refer to the following caption and surrounding text. — **Figure 10.** Visual comparison of the two- and three-level NARDGP surrogate performance against the test runs from the high-fidelity TROLL experiments. The results are not identical, but the performance increase is not promising, therefore we put aside proper testing of three-level MF ML surrogates for terramechanical data.
Download figure:
Standard image High-resolution image

One of the main advantages of MF for complex systems modeling is that it can decrease computational and time costs without decreasing prediction accuracy. MF ML model outperforms both conventional numerical simulations TerRA and SCM simultaneously by MSE and by the runtime. Figure 11 shows 44 runs for both the MF ML model and two baseline numerical simulations (TerRA and SCM), distributed by their MSE and runtime. As we can see, the performance of the MF ML model is indeed faster and more accurate than the conventional terramechanical models.

Figure 11. Refer to the following caption and surrounding text. — **Figure 11.** Mean squared error comparison of predictions from the conventional numerical simulation methods and MF ML model. The multi-fidelity approach is computationally efficient while maintaining a high level of precision.
Download figure:
Standard image High-resolution image

Another advantage of the Gaussian process is the implicit handling of uncertainties. Due to its Bayesian nature, we can propagate uncertainties easily using the equation (5). It is an important part of predicting complex simulation systems, both for operational and research purposes. This adds more depth to the understanding and explainability of predictions and makes ML black boxes more transparent.

As we can see in figure 9, the main disadvantage of the NARDGP multi-fidelity prediction is that confidence intervals are too narrow, as discussed in section 2.3. This is a critical flaw when it comes to deploying ML models in practice, especially for highly sensitive operations like extraterrestrial rovers, where the cost of error is very high due to the inability to quickly compensate for the wrong move or repair the rover. To solve this problem, we applied several calibration methods, mentioned in section 2.3. Isotonic regression is an algorithm enabling quantile calibration, inspired by a Platt scaling algorithm from classification [37]. Beta calibration was chosen as an algorithm for achieving distribution calibration [36]. Normal calibration [35] introduces a scaling parameter for the predicted variance, preserving a Gaussian nature of the posterior prediction. A comparison of all calibration techniques on all of the test runs can be seen in table 5. As an example, we took NARDGP trained with 41 high-fidelity data points and calibrated it with the calibration mentioned above techniques. From this table, we can see that isotonic regression performs the best calibration, despite that it is only a global calibration of the marginal distribution. This means that the NARDGP model was constantly making overconfident predictions and global scaling of the variance solves the problem. These findings are consistent with the previous study, researching Gaussian process regression calibration for terramechanical data, but conducted only in the context of single-level medium-fidelity SCM data [68]. An example of a calibration on one test run could be seen in figure 12, where prediction became less overconfident and 95% internal captures almost the entire actual observed traction force signal.

Figure 12. Refer to the following caption and surrounding text. — **Figure 12.** Original uncalibrated prediction is visibly overconfident because the ground truth traction force from TROLL quite often finds itself outside even the 95% confidence interval. By applying Isotonic regression, we achieve quantile calibrated prediction and here predicted confidence intervals capturing ground truth much better.
Download figure:
Standard image High-resolution image

Table 5. Mean pinball loss measures the errors of over- and under-confidence for each quantile individually [66]. Negative loglikelihood (NLL) measures how likely observed data was generated by the model [36]. Expected normalized calibration error (ENCE) orders and splits instances by variance into bins and calculates the scaled difference between the predicted error and variance in each bin [38]. Mean predictive interval width (MPIW) measures the sharpness of the prediction and is defined for each confidence interval α as a width of this interval [67]. Normal and isotonic calibration have similar results almost by every metric, but normal calibration has a significantly worse sharpness metric, which leaves us only with Isotonic regression. The method with minimum error is shown in bold.

	Calibration metrics
Calibration method	Pinball	NLL	ENCE	MPIW
Uncalibrated	0.696	6.648	2.232	0.093
Isotonic regression	0.424	1.932	0.237	0.422
Normal calibration	0.424	2.229	0.21	175.675
Beta calibration	0.484	2.74	0.62	0.187

4.3. Plasma-physics example

Harnessing energy from plasma fusion promises to offer a clean alternative energy source. To achieve this goal, creating a self-sustained burning plasma is essential. Achieving a sustained burning plasma remains elusive due to physical and technological challenges. One prominent physical hurdle is the occurrence of small-scale fluctuations in confined plasma, which lead to energy loss. These small-scale fluctuations are referred to as microturbulence. Mitigating energy losses caused by microturbulence is a significant challenge within the plasma physics community.

This work simulates plasma fusion using the Automated System for TRansport Analysis (ASTRA) [69, 70] modeling suite. ASTRA solves four 1-dimensional transport equations for electron density (n_e), electron temperature (T_i), ion temperature (T_i) and poloidal flux (Ψ). Interested readers can refer to [71] for details. It uses different turbulent solvers to predict the density and temperature profiles given some initial conditions (initial profiles of T_i, T_e, and n_e) and the last-closed-flux surface. The calculation of the turbulent flux is one of the computationally expensive parts of the code. One can couple ASTRA with different turbulent solvers. In this work, we will use QuaLiKiz (QLK) [72] and QuaLiKiz Neural Network (QLKNN) [73, 74] as the two subroutines. QLK is a quasi-linear turbulence solver for the linearized gyrokinetic Vlasov equation. QLKNN is a neural network surrogate trained on a 10-dimensional Latin hypercube for a quick flux evaluation [74]. ASTRA simulation with QLK is the high-fidelity model, whereas ASTRA with QLKNN is the low-fidelity model.

We simulate a plasma discharge until it reaches a steady state. At the steady state, the heat and particle fluxes match the integrated sources at all the radial locations. In this work we simulate shot #34954 until it reaches a steady state. We train the surrogate on a six-dimensional space mentioned in table 6. We create six surrogates for six different quantities, namely, steady-state T_i, T_e, and n_e, and the corresponding negative derivative of the logarithm of each quantity ( $-\nabla T_i/ T_i$ , $-\nabla T_e/ T_e$ , and $-\nabla n_e/n_e$ ). We normalize each quantity before training the surrogate. We use 40 high-fidelity and 400 low-fidelity training points.

Table 6. List of parameters for plasma microturbulence surrogate modeling.

Parameter Name	Range
Radial distance (ρ_r)	$[0.1, 0.8]$
Initial T_i scaling	$[0.9, 1.1]$
Initial T_e scaling	$[0.9, 1.1]$
Initial n_e scaling	$[0.9, 1.1]$
Toroidal velocity scaling	$[0.9, 1.1]$
Safety factor scaling	$[0.9, 1.1]$

Figure 13 shows the simulation results for QLK (high-fidelity model) and QLKNN (low-fidelity model) when parameters other than radial distance (ρ_r) are set as 1. The figure also shows the predicted results for multi-fidelity surrogates using NARGP. Table 7 shows the spatial average of mean square error for each surrogate.

Figure 13. Refer to the following caption and surrounding text. — **Figure 13.** Steady state solution for different quantities simulated for the shot #34954. Parameters other than radial distance (ρ_r) are set as 1. The blue and orange curves show the simulation results for QLKNN (low-fidelity) and QLK (high-fidelity), respectively. Green curves show the predicted mean and 95% confidence interval for the NARGP surrogate.
Download figure:
Standard image High-resolution image

Table 7. Spatial average value of MSE for different surrogates across different quantities for the plasma microturbulence surrogate modelling. The method with minimum MSE is shown in bold.

	Quantity of Interest
Method Name	T_i	T_e	n_e	$-\nabla T_i/T_i$	$-\nabla T_e/T_e$	$-\nabla n_e/n_e$
Single-fidelity GP	$8.57 \times 10^{-3}$	$8.68 \times 10^{-4}$	$1.07 \times 10^{-2}$	$2.26 \times 10^{-1}$	$1.04 \times 10^{-1}$	$7.83 \times 10^{-2}$
AR1	$1.73 \times 10^{-2}$	$2.26 \times 10^{-3}$	$3.31 \times 10^{-2}$	$4.89 \times 10^{-1}$	$8.21 \times 10^{-2}$	1.33
NARGP	$4.46 \times 10^{-3}$	$\mathbf{8.13 \times 10^{-4}}$	$7.12 \times 10^{-3}$	$1.39 \times 10^{-1}$	$\mathbf{6.9 \times 10^{-2}}$	$\mathbf{9.57 \times 10^{-3}}$
GPDF	$3.07 \times 10^{-3}$	$\mathbf{8.13 \times 10^{-4}}$	$6.3 \times 10^{-3}$	$1.94 \times 10^{-1}$	$\mathbf{6.9 \times 10^{-2}}$	$1.44 \times 10^{-2}$
GPDFC	$3.07 \times 10^{-3}$	$\mathbf{8.13 \times 10^{-4}}$	$6.3 \times 10^{-3}$	$1.94 \times 10^{-1}$	$\mathbf{6.9 \times 10^{-2}}$	$1.44 \times 10^{-2}$
NARDGP	$2.89 \times 10^{-3}$	$8.56 \times 10^{-4}$	$6.15 \times 10^{-3}$	$4.41 \times 10^{-1}$	$1.65 \times 10^{-1}$	$1.72 \times 10^{-1}$
DGPDF	$4.52 \times 10^{-3}$	$8.58 \times 10^{-4}$	$\mathbf{5.42 \times 10^{-3}}$	$1.83 \times 10^{-1}$	$8.06 \times 10^{-2}$	3.57
DGPDFC	$\mathbf{2.88 \times 10^{-3}}$	$8.58 \times 10^{-4}$	$\mathbf{5.42 \times 10^{-3}}$	$\mathbf{1.38 \times 10^{-1}}$	$1.82 \times 10^{-1}$	$4.37 \times 10^{-2}$

We observe that non-linear auto-regressive multi-fidelity surrogates perform better than the single-fidelity GP. However, the difference between single-fidelity GP and multi-fidelity surrogates is not too significant. This may be due to the underlying simple structure that single-fidelity GP was able to learn relatively easily. AR1 struggles to model the quantity $-\nabla n_e/n_e$ , which suggests that the relation between low-fidelity and high-fidelity could be non-linear for the case of $-\nabla n_e/n_e$ . Moreover, DGPDF also has a high MSE for $-\nabla n_e/n_e$ , whereas DGPDFC fits the quantity accurately. This re-iterates the significance of structured kernels, especially for DGPs.

5. Conclusions

This paper describes different multi-fidelity Gaussian process surrogate modeling methods. We extend non-linear autoregressive models for full GP to accommodate cases that involve more than two fidelities. Using a structured kernel with delay terms, we also suggest a new family of multi-fidelity GP models (GPDFC and DGPDFC). Finally, we test all the modeling methods on different academic and real-world problems.

We observe that not all the multi-fidelity methods performed well under all the scenarios. The quality of prediction depends upon the underlying relationship between the low and high-fidelity models. Prior knowledge about that would help us choose the correct modeling method. In many scenarios, that relationship is not known. Therefore, after constructing the surrogate, one must carefully validate its predictions. We observe that the structured kernel significantly improves models involving DGPs.

We also conclude that the multi-fidelity simulations can preserve the performance of higher-fidelity simulations but reach the speed and cost of low-fidelity ones. This can aid different research fields in analyzing the underlying system or improving the algorithm using outer loop methods [7].

Acknowledgments

Research by Vladyslav Fediukov, Michael Bergmann and Kislaya Ravi is funded by the Helmholtz Association under the 'Munich School for Data Science—MuDS'(HIDSS-006). Felix Dietrich acknowledges funding by the DFG Project No. 468830823, and also association to DFG-SPP-229.

Data availability statement

All data that support the findings of this study are included within the article (and any supplementary files).

Appendix:

Figure A1. Refer to the following caption and surrounding text. — **Figure A1.** Comparison of different multi-fidelity Gaussian Process surrogate modeling methods on a linear transformation problem stated in table 1. The plots show the high-fidelity function in the blue curve, the high-fidelity training data in blue dots, the predicted mean in the orange curve, and 95% confidence interval using the orange-shaded region.
Download figure:
Standard image High-resolution image

Figure A2. Refer to the following caption and surrounding text. — **Figure A2.** Comparison of different multi-fidelity Gaussian Process surrogate modeling methods on a non-linear transformation problem stated in table 1. The plots show the high-fidelity function in the blue curve, the high-fidelity training data in blue dots, the predicted mean in the orange curve, and 95% confidence interval using the orange-shaded region.
Download figure:
Standard image High-resolution image

Figure A3. Refer to the following caption and surrounding text. — **Figure A3.** Comparison of different Multi-fidelity Gaussian Process surrogate modeling methods on a phase-shifted oscillation stated in table 1. The plots show the high-fidelity function in the blue curve, the high-fidelity training data in blue dots, the predicted mean in the orange curve, and 95% confidence interval using the orange-shaded region.
Download figure:
Standard image High-resolution image

Dates