Recent Advances in Surrogate-Based Optimization

ARTICLE IN PRESS
Progress in Aerospace Sciences 45 (2009) 50–79
Contents lists available at ScienceDirect
Progress in Aerospace Sciences

journal homepage: www.elsevier.com/locate/paerosci
Recent advances in surrogate-based optimization

Alexander I.J. Forrester , Andy J. Keane
Computational Engineering and Design Group, School of Engineering Sciences, University of Southampton, SO17 1BJ, UK
a r t i c l e in f o a b s t r a c t
Available online 10 January 2009 The evaluation of aerospace designs is synonymous with the use of long running and computationally
intensive simulations. This fuels the desire to harness the efficiency of surrogate-based methods in
aerospace design optimization. Recent advances in surrogate-based design methodology bring the
promise of efficient global optimization closer to reality. We review the present state of the art of
constructing surrogate models and their use in optimization strategies. We make extensive use of
pictorial examples and, since no method is truly universal, give guidance as to each method’s strengths
and weaknesses.
& 2008 Elsevier Ltd. All rights reserved.
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2. Initial sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3. Constructing the surrogate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1. Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2. Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3. Moving least-squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4. Radial basis functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.1. RBF models of noisy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5. Kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5.1. Universal Kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.2. Blind Kriging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.3. Kriging with noisy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.6. Support vector regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.6.1. The support vector predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.6.2. Finding the support vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.6.3. Finding m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.6.4. Choosing C and e . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.7. Enhanced modelling with additional design information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.7.1. Exploiting gradient information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.7.2. Multi-fidelity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4. Infill criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1. Exploitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1.1. Minimizing the predictor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1.2. The trust-region method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2. Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3. Balanced exploration/exploitation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.1. Two-stage approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.2. One-stage approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4. Parallel infill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5. Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Corresponding author.
E-mail address: [email protected] (A.I.J. Forrester).
0376-0421/$ - see front matter & 2008 Elsevier Ltd. All rights reserved.
doi:10.1016/j.paerosci.2008.11.001
ARTICLE IN PRESS
A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79 51
6. Multiple objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.1. Multi-objective expected improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7. Discussion and recommendations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
1. Introduction explicitly the building and adaptation of the internal model used
during optimization—these models are here termed surrogate
An overview of surrogate-based analysis and optimization was models although they are also often referred to as meta models or
presented in this journal by Queipo et al. [1]. They covered some response surfaces. This review is concerned with this approach to
of the most popular methods in design space sampling, surrogate design search and, in particular, the construction of surrogates
model construction, model selection and validation, sensitivity and their refinement. Fig. 1 illustrates the basic process (the steps
analysis, and surrogate-based optimization. More recently Simp- remain the same for any optimization-based search):
son et al. [2] presented a general overview of how this area has
developed over the past 20 years, following the landmark paper of 1. first the variables to be optimized are chosen, often due to their
Sacks et al. [3]. Here we take a more in depth look at the various importance, as determined by preliminary experiments;
methods of constructing a surrogate model and, in particular, 2. some initial sample designs are analysed according to some
surrogate-based optimization. Our review is by no means pre-defined plan;
exhaustive, but the methods we cover are those we feel are the 3. a surrogate model type is selected and used to build a model of
most promising, based on the cited references coupled with our the underlying problem—for surrogate-based search this
own experience. Parting from a common trend in review papers, process can be quite sophisticated and time consuming;
we do not include a large, industrial type problem which may not 4. a search is carried out using the model to identify new design
be of interest to all readers. Instead we have employed small points for analysis;
illustrative examples throughout the paper in the hope that 5. the new results are added to those already available and,
methods are explained better in this way. provided further analyses are desired, the process returns to
The use of long running expensive computer simulations in step 3.
design leads to a fundamental problem when trying to compare
and contrast various competing options: there are never sufficient These steps assume that an automated process has been
resources to analyse all of the combinations of variables that one established to carry out design analyses when given a selection
would wish. This problem is particularly acute when using of design inputs, and this is a far from trivial process in most
optimization schemes. All optimization methods depend on some aerospace applications. Typically it requires a scheme to create
form of internal model of the problem space they are explor- meshed water-tight geometries from a vector of design para-
ing—for example a quasi-Newton scheme attempts to construct meters either using CAD or some specialized product specific
the Hessian at the current design point by sampling the design code, followed by the use of mesh generation. The resulting mesh
space. To build such a model when there are many variables can is then used in a finite volume or finite difference scheme to solve
require large numbers of analyses to be carried out, particularly if the underlying equations and is then followed by some design
using finite difference methods to evaluate gradients. Because of evaluation process to calculate performance metrics such as lift,
these difficulties it is now common in aerospace design to manage drag, stress, etc. Throughout the rest of this paper we assume such
a process is established, although we allow for the possibility that
it may fail to return usable results (perhaps because of
convergence failure, for example) or may return results that are
contaminated with noise due to round off, discretization or
convergence errors.
2. Initial sampling
Assuming that we already have a parameterized design

coupled to a method of evaluation, the first step in the
surrogate-based optimization process is to choose which para-
meters we wish to vary—our design variables. This may be
patently obvious in a familiar design problem, but in a new
design problem there may be many variables, only a subset of
which we can optimize. The problem of obtaining enough
information to predict a design landscape in a hyper-cube of
increasing dimensions—the curse of dimensionality—is what holds
us back in terms of the number of variables we can optimize. The
amount of information we can obtain will, of course, depend on
the computational (or experimental) expense of computing
objective and, perhaps, constraint functions. Thus the number of
variables we can optimize is a function of this expense. Choosing
the variables that will be taken forward to optimize usually
requires the design and analysis of some preliminary experiments.
This process may in fact be revisited several times in the light of
Fig. 1. A surrogate-based optimization framework. results coming from surrogate-based searches. We will not cover
ARTICLE IN PRESS
52 A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79
such methods here and the reader may wish to consult Morris [4]. method. Other methods may perform better if further assump-
We endorse this reference as it makes the weakest assumptions tions prove to be valid.
regarding the type and size of the problem, assuming only that the A second assumption is that the engineering function (though
function is deterministic. not necessarily our analyses) is smooth. Again, this is usually a
With the design space identified, we must now choose which perfectly valid assumption. Methods such as moving least-squares
designs we wish to evaluate in order to construct the surrogate (MLS) (Section 3.3), radial basis functions (RBFs) (Section 3.4),
model—our sampling plan. It is worth noting here that this process support vector regression (SVR) (Section 3.6), and a simplified
is often referred to as a design of experiments, a term used for Kriging model are based on this and the continuity assumption, as
selection of physical experiments. Here we use the term sampling too is our mental road prediction method. Although the
plan to refer to both physical and computational experiments. engineering function may be smooth, our analysis of it may not
To build global models of unknown landscapes, a sampling be. The smoothness assumption may then require ‘noise’ in the
plan with a uniform, but not regular, spread of points across the observed data, be it random physical error or computational error
design space makes intuitive sense. We also wish to use a sample appearing as noise, to be filtered.
of points whose projections onto each variable axis are uniform, Further assumptions can be made as to the actual shape of the
the logic being that it is wasteful to sample a variable more than function itself, e.g. by applying a polynomial regression (see
once at the same value. To this end we favour the space filling Section 3.2). We know of many engineering quantities that obey
maximin [5] Latin hypercube [6] sampling techniques of Morris such forms (within certain bounds). For example stress/strain is
and Mitchell [7]. An in-depth description of this technique, often linear and drag/velocity quadratic. Clearly assumptions
including Matlab code can be found in Forrester et al. [8]. Note about the shape of the function can be useful, but may be
that here we are discussing the sample upon which an initial unfounded in many problems. No doubt many road accidents have
surrogate will be built. Further sampling—sometimes called occurred when a bend suddenly tightened in front of a driver who
adaptive sampling—can be carried out subsequently (the ‘add wrongly assumed a circular turn was ahead.
new design(s)’ box in Fig. 1) and will be discussed in due course. It is worth bearing in mind what we want from our surrogate.
It is also worth noting that for some classes of surrogates it is Naturally we want an accurate prediction of the function
possible to estimate directly the quality of a sampling plan on the landscape we are trying to emulate and, moreover, in the context
stability of model building, e.g. D-optimal designs when used with of surrogate-based optimization, we want this prediction to be
polynomial surrogates. If a particular surrogate type is to be used, most accurate in the region of the optimum. In this section we will
whatever the outcome of the initial experiments, this may well only be considering how to produce a prediction and will look at
influence the choice of sampling plan. enhancing the accuracy in the region of the optimum in the
remaining sections. As mentioned at the beginning of this paper,
we will not be reviewing methods for surrogate model selection
3. Constructing the surrogate and validation. We shall though consider this topic in our final
discussion section, and will now briefly describe cross-validatio-
In everyday life we try to save time and make predictions n—an important and commonly used generalization error
based on assumptions. For example, when travelling on a road we estimator.
can predict the rate of turn of a bend based on the entry and
surrounding landscape. Without really considering it, in our mind 3.1. Cross-validation
we are constructing a surrogate using the direction of the road, its
derivatives with respect to distance along the road (at least to To compute the cross-validation error, the surrogate model
second order), and local elevation information. This information is training data are split (randomly) into q roughly equal subsets.
coupled with assumptions based on our experience of going round Each of these subsets is removed in turn from the complete
many bends in the past. We also formulate an estimate of the training data and the model is fitted to the remaining data. At
possible error in the prediction and regulate our entry speed each stage the removed subset is predicted using the model which
based on this and road surface conditions. In unfamiliar has been fitted to the remaining data. When all subsets have been
surroundings, for example in a foreign country where road removed, n predictions ðb
ð1Þ ð2Þ
y ;by ;...;b
ðnÞ
y Þ of the n observed data
building techniques are different, we have to assume a very high ð1Þ ð2Þ ðnÞ
points (y ; y ; . . . ; y ) will have been calculated. The cross-
error because our assumptions based on experience are likely to validation error is then calculated as
be incorrect, and reduce our speed of entry accordingly. In essence
we calculate a suitably safe speed based on our prediction of 1X n
ðiÞ
ecv ¼ y Þ2 .
ðyðiÞ b (1)
curvature and subtract a safety margin based on our predicted n i¼1
error. In engineering design we are faced with different problems,
but in essence we try to do with a surrogate model what we do With q ¼ n, an almost unbiased error estimate can be obtained,
everyday with our mind: make useful predictions based on but one whose variance can be very high. Hastie et al. [9] suggest
limited information and assumptions. using somewhat larger subsets, with q ¼ 5 or 10.
Care must be taken that any assumptions are well founded. The As will be seen in the following sections, the cross-validation
first assumption we make with all the surrogate modelling error can be used for surrogate model parameter estimation,
techniques discussed here is that the engineering function is model selection and validation when it is too costly to employ a
continuous—incidentally one we also make about roads, per- separate validation data set.
mitted due to the use of signs warning of junctions, etc. This is
usually a well founded assumption, with some notable exceptions 3.2. Polynomials
such as when dealing with aerodynamic quantities in the region
of shocks, structural dynamics, and progressive failure analysis Although fast being replaced by RBF approaches, the classic
(e.g. crash simulation). This can be accommodated by using polynomial response surface model (RSM) is the original and still,
multiple surrogates, patched together at discontinuities, though probably, the most widely used form of surrogate model in
we will not consider this here. This is the only assumption in engineering design. We will not dwell for too long on the subject
Kriging (see Section 3.5), making it a versatile, but complicated of polynomials as this ground has been covered many times
ARTICLE IN PRESS
before and better than we could hope to do so now. Of the variable ranges under study, problems can always be simplified).
various texts out there, one of the most popular is that by Box and In high-dimensional problems it may not be possible to obtain the
Draper [10]. quantity of data required to estimate the terms of all but a low
A polynomial approximation of order m of a function f of order polynomial. However, for problems with few dimensions,
dimension k ¼ 1, can be written as uni- or low-modality, and/or where data are very cheap to obtain,
a polynomial surrogate may be an attractive choice. In particular,
X
m
yðx; m; aÞ ¼ a0 þ a1 x þ a2 x2 þ þ am xm ¼
b ai xðiÞ . (2) the terms of the polynomial expression obtained using one of the
i¼0 above methods can provide insight into the design problem, e.g.
We estimate a ¼ fa0 ; a1 ; . . . ; am gT through a least squares solution the effect of each variable in the design space can be easily
of Ua ¼ y, where U is the Vandermonde matrix: identified from the coefficients. There may be scenarios when the
0 1 analysis for which a surrogate is to be used in lieu of is too
1 x1 x21 . . . xm 1 expensive to be searched directly, but is cheaper than the more
B 2 mC complex methods which follow in this paper. Here, a quick
B 1 x2 x2 . . . x2 C
U¼B B... ... ... ... ...C
C (3) polynomial surrogate may find its niche.
@ A
1 2
xn xn . . . xn m When dealing with results from deterministic computer
experiments, as so often is the case in design optimization, the
and y is the vector of observed responses. The maximum premise that the error between the polynomial model and the
likelihood estimate of a is thus data is independently randomly distributed (which the least-
squares estimation in (4) is based upon) is completely false. While
a ¼ ðUT UÞ1 UT y. (4) assuming independently randomly distributed error is often
Estimating m is not so simple. The polynomial approximation useful in laboratory experiments, for a surrogate based on
(2) of order m of an underlying function f is similar in some ways deterministic experiments the error is in fact entirely modelling
to a Taylor series expansion of f truncated after m þ 1 terms [10]. error and can be eliminated by interpolating rather than
This suggests that greater values of m (i.e. more Taylor expansion regressing the data. Data from computer experiments can,
terms) will usually yield a more accurate approximation. How- however, appear to be corrupted by an element of random error,
ever, the greater the number of terms, the more flexible the model or ‘noise’. This can, for example, manifest via errors due to the
becomes and we come up against the danger of over fitting any finite discretization of governing equations over a computational
noise that may be corrupting the underlying response. Also, we mesh. In such cases we may again wish to assume the error
run the risk of building an excessively ‘snaking’ polynomial with between the data and the surrogate does to some extent have a
poor generalization. We can prevent this by estimating the order random element. The remaining methods in this paper allow the
m through a number of different criteria [11]. degree of regression to be controlled in some way.
One method is to hypothesise that the true function is indeed a
polynomial of degree m and any deviations from this are simply 3.3. Moving least-squares
normally distributed noise. If this is the case then only
a0 ; a1 ; . . . ; am are required and amþ1 ; . . . ; an ¼ 0. The null hypoth- The method of MLS [15,16] allows us to build a standard
esis method (see, e.g. [12, p. 254]) chooses m by minimizing polynomial regression, an interpolation, or something somewhere
d2m between the two. MLS is an embellishment of the weighted least-
s2m ¼ , (5) squares (WLS) approach [17]. WLS recognises that all fyðiÞ ; xðiÞ g
nm1
pairs may not by equally important in estimating the polynomial
where
coefficients. To this end, each observation is given a weighting
!2 wðiÞ X0, defining the relative importance of fyðiÞ ; xðiÞ g. With wðiÞ ¼ 0
X
n X
m
d2m ¼ yðiÞ ai xi . (6) the observation is neglected in the fitting. The coefficients of the
i¼1 i¼0 WLS model are
To choose m, s2m is calculated for m ¼ 1; 2; . . . as long as there are
a ¼ ðUT WUÞ1 UT Wy, (9)
significant decreases in s2m . The smallest m beyond which there
are no significant decreases in s2m is chosen. where
A model with better generalization might be obtained by 0 1
wð1Þ 0
instead choosing m to minimize the cross-validation error (see B .
B .. .. C
Section 3.1 and e.g. [9]). Cross-validation can give an indication of W ¼ @ .. . . CA. (10)
the overall generalization quality of the polynomial. We can also 0 wðnÞ
estimate the average mean squared error (MSE) as
A WLS model is still a straightforward polynomial, but with the fit
bÞT ðy y
ðy y bÞ biased towards points with a higher weighting. In a MLS model, the
s2avg ¼ , (7)
ðn m þ 1Þ weightings are varied depending upon the distance between the
where by is a vector of predictions at the observed data points and point to be predicted and each observed data point. The weighting
n is the number of observations. is controlled by a function which decays with increasing distance
The local MSE is estimated as jxðiÞ xj. As in the RBF literature, a whole host of decay functions
have been used, but a popular choice is the Gaussian function
T T
s2 ¼ s2avg f ðUT UÞ1 f , (8) Pk ðiÞ 2
!
j¼1 ðxj xj Þ
T
where f is the vector of functions in Eq. (2), e.g. f1; x; x2 ; . . . ; xm gT wðiÞ ¼ exp (11)
2
s
for a one variable quadratic [13,14].
Polynomial surrogates are unsuitable for the non-linear, multi- (used, for example, by Toropov et al. [18]).
modal, multi-dimensional design landscapes we often encounter The coefficients of the MLS are found using Eq. (9) but, unlike
in engineering unless the ranges of the variables being considered normal least-squares and WLS, the calculation must be performed
are reduced, as in trust-region methods (by suitably reducing the at every prediction and so the process is more computationally
ARTICLE IN PRESS
intensive. The cost of prediction is increased further by the need analogy of imitating the specific timbre of a musical instrument
to estimate s as well as choose the form of the underlying with a synthesizer, using a weighted combination of tones.
polynomial. The number of terms in the polynomial is no longer Consider a function f observed without error, according to the
critical, however, since s will, in some senses, take care of a poor sampling plan X ¼ fxð1Þ ; xð2Þ ; . . . ; xðnÞ gT , yielding the responses
choice of underlying function. This may be a considerable y ¼ fyð1Þ ; yð2Þ ; . . . ; yðnÞ gT . We seek a RBF approximation to b
f of the
advantage in a multi-variable design space where the order of fixed form
the polynomial is restricted by the amount of data required to
X
nc
estimate all the coefficients. There is a large expanse of literature b
f ðxÞ ¼ wT w ¼ wi cðkx cðiÞ kÞ, (12)
on the selection of terms for reduced order models. Here we will i¼1
only consider the choice of s, which may be a nested part of the where cðiÞ denotes the ith of nc basis function centres and w is an nc
selection of terms. We can choose s by minimizing a cross- vector containing the values of the basis functions c themselves,
validation error ecv (see Eq. (1)), either using a formal optimiza- evaluated at the Euclidean distances between the prediction site x
tion algorithm or simply try a selection of s’s and choose the best. and the centres cðiÞ of the basis functions. Readers familiar with the
Consider the one variable test function f ðxÞ ¼ ð6x 2Þ2 sin technology of artificial neural networks will recognise this formula-
ð12x 4Þ, x 2 ½0; 1 to which normally distributed random noise tion as being identical to that of a single-layer neural network with
has been added with standard deviation of one. A MLS surrogate is radial coordinate neurons, featuring an input x, hidden units c,
to be fitted to a sample of 21 points. Fig. 2 shows the MLS weights w, linear output transfer functions, and output b f ðxÞ.
approximations found for a range of s’s. Clearly the method There is one undetermined weight per basis function if fixed
exhibits an attractive tradeoff between regression and interpola- bases are used. Example basis functions include
tion and the ecv metric provides a basis for choosing the correct
tradeoff—here s ¼ 0:1 works best. linear cðrÞ ¼ r,
MLS is becoming increasingly popular in the aerospace cubic cðrÞ ¼ r3 , and
sciences. Kim et al. [19] present a derivative enhanced form of thin plate spline cðrÞ ¼ r2 ln r.
the method and MLS has been used in surrogate-based optimiza-
tion by Ho et al. [20]. However, we have found no evidence of MLS
More freedom to improve the generalization properties of (12)—at
being used as part of a global optimization procedure. Although
the expense of a more complex parameter estimation process—-
Ho et al. [20] use a global algorithm (simulated annealing) to
can be gained by using parametric basis functions, such as the
search a MLS surrogate (which is then followed by a further local
direct search of the true function), the search cannot be 2 2
Gaussian cðrÞ ¼ er =2s ,
considered global without a system of updating the surrogate to
multi-quadric cðrÞ ¼ ðr2 þ s2 Þ1=2 , or
account for possible inaccuracies which may be obscuring the
inverse multi-quadric cðrÞ ¼ ðr2 þ s2 Þ1=2 .
global optimum (see Section 4).
Whether we choose a set of parametric basis functions or fixed ones,

3.4. Radial basis functions w is easy to estimate. This can be done via the interpolation condition
nc
X
RBFs [21] use a weighted sum of simple functions in an attempt b
f ðxðjÞ Þ ¼ wi cðkxðjÞ cðiÞ kÞ ¼ yðjÞ ; j ¼ 1; . . . ; n. (13)
to emulate complicated design landscapes. Sóbester [22] uses the i¼1
20 20
σ = 0.025, εcv = 41.8755 σ = 0.05, εcv = 13.5155
f (x)
f (x)
0 0
−20 −20
0 0.5 1 0 0.5 1
x x
20 20
σ = 0.1, εcv = 12.889 σ = 0.25, εcv = 37.9795
f (x)
f (x)
0 0
−20 −20
0 0.5 1 0 0.5 1
x x
20 20
σ = 0.5, εcv = 73.1825 σ = 1, εcv = 85.0479
f (x)
f (x)
0 0
−20 −20
0 0.5 1 0 0.5 1
x x
Fig. 2. MLS approximations of a noisy test function for varying s.

ARTICLE IN PRESS
While Eq. (13) is linear in terms of the basis function weights w, the time are often employed. Starting from an empty subset, the basis
predictor b f can express highly non-linear responses. It is easy to see function which most reduces some error metric is chosen from n
that one of the conditions of obtaining a unique solution is that possible basis functions (centred at observed data points). The
system (13) must be ‘square’, that is, nc ¼ n. It simplifies things if the process is continued, adding one basis function at a time, until
bases actually coincide with the data points, that is cðiÞ ¼ xðiÞ , there is no significant decrease in the error metric. An improved
8i ¼ 1; . . . ; n, which leads to the matrix equation subset might be achieved by using an exchange algorithm [28] to
choose the subset at each stage of the forward selection process.
Ww ¼ y, (14)
The forward selection process is analogous to the null hypothesis
where W denotes the so-called Gram matrix and it is defined as 2
method for polynomial fitting, but here we minimize d =ðn nc Þ
Wi;j ¼ cðkxðiÞ xðjÞ kÞ, i; j ¼ 1; . . . ; n. The fundamental step of the [29].
parameter estimation process is therefore the computation of w ¼
W1 y and this is where the choice of basis function can have an 3.5. Kriging
important effect. For example, it can be shown that, under certain
assumptions, Gaussian and inverse multi-quadric basis functions
An increasingly popular basis function—so much so that we
always lead to a symmetric positive definite Gram matrix [23],
have given it its own section—is that used in Kriging1:
ensuring safe computation of w via Cholesky factorization—one
0 1
reason for the popularity of these basis functions. Theoretically, other X
k
bases can also be modified to exhibit this property through the cðiÞ ¼ cor½YðxðiÞ Þ; YðxÞ ¼ exp@ yj jxðiÞ
j x j jpj A
, (17)
addition of a polynomial term (see, e.g. [24]). j¼1
Beyond determining w, there is, of course, the additional task where YðÞ are random variables which, rather counter intuitively,
of estimating any other parameters introduced via the basis we assume the observed responses to be. Eq. (17) clearly has
functions. A typical example is the s of the Gaussian basis similarities with the Gaussian basis function in the previous
function, usually taken to be the same for all basis functions in all section. Here there are more parameters: the variance of the basis
variables. function can be controlled in each of the k dimensions of the
While the correct choice of w will make sure that the design space by yj and, instead of being fixed at 2, the exponent
approximation can reproduce the training data, the correct can be varied, again in each dimension, by pj. These additional
estimation of these additional parameters will enable us to parameters of Kriging naturally lead to lengthier model training
minimize the (estimated) generalization error of the model. This times, but this is mitigated by the possibility of improved
optimization step—say, the minimization of a cross-validation accuracy in the surrogate. A common thread throughout the
error estimate—can be performed at the top level, while the world of surrogate modelling is that we make assumptions as to
determination of w can be integrated into the process at the lower the nature of the landscape we are emulating. Kriging is the least
level, once for each candidate value of the parameter(s). assuming method in this paper, in terms of the range of function
We have already indicated that the guarantee of a positive forms it can emulate, and it is for this reason that it is so effective.
definite U is one of the advantages of Gaussian RBFs. They also Due to the expense of estimating the yj and pj parameters, the
possess another desirable feature: we can estimate their predic- method is of most use when the true function is particularly
tion error at any x in the design space as computationally intensive, e.g. a CFD-based calculation. The
s2 ðxÞ ¼ 1 wT W1 w (15) regurgitation of the derivation of the Kriging model is becoming
somewhat trite, so we will satisfy ourselves with quoting the key
(see, e.g. [25] for further details and a derivation). equations and giving a few insights. We find the most useful
Such error estimates are invaluable in formulating infill criteria derivation to be that in Jones [32]. There is an extensive section on
for use in surrogate-based optimization, which we will look at in Kriging, including Matlab code, in Forrester et al. [8].
subsequent sections. The unknown parameters yj and pj are chosen as MLEs or, to be
precise, we maximize the natural logarithm of the likelihood with
3.4.1. RBF models of noisy data constant terms removed—the concentrated ln-likelihood function:
If the responses y ¼ fyð1Þ ; yð2Þ ; . . . ; yðnÞ gT are corrupted by noise,
n 1
the above equations may yield a model that overfits the data, that b 2 Þ lnðjWjÞ,
lnðLÞ lnðs (18)
is, it does not discriminate between the underlying response and 2 2
the noise. Perhaps the easiest way around this is the introduction where
of added model flexibility in the form of the regularization
ðy 1mÞT W1 ðy 1mÞ
parameter l [26]. This is added to the main diagonal of the Gram b2 ¼
s , (19)
n
matrix. As a result, the approximation will no longer pass through
the training points and w will be the least-squares solution of and W is an n n matrix of correlations between the sample data,
with each element given by Eq. (17). It is the maximization of (30)
1
w ¼ ðU þ lIÞ y, (16)
where I is an n n identity matrix. l should, ideally, be set to the 1
Matheron [30] coined the term Krigeage, in honour of the South African
variance of the noise in the response data y [24], but since we mining engineer Danie Krige, who first developed the method we now call Kriging
usually do not know that, we are left with the option of simply [31]. Kriging made its way into engineering design following the work of Sacks et
adding it to the list of parameters that need to be estimated. al. [3], who applied the method to the approximation of computer experiments.
Krige’s research into the application of mathematical statistics in ore valuation
Another means of constructing a regression model through
started during his time in the Government Mining Engineer’s office and was based
noisy data using RBFs is to reduce the number of bases. This can on very practical analyses of the frequency distributions of gold sampling values
be done using the SVR method which we will look at in Section and of the correlation patterns between the individual sample values; also
3.6. While SVR is perhaps the most elegant basis function between the grade estimates of ore blocks based on limited sampling of the block
selection method, a simpler way is to use forward selection (see, perimeters and the subsequent sampling results from inside the ore blocks as they
were being mined out. These distribution and correlation models led directly to
e.g. [27]). With 2nc 1 subsets of the n available centres to choose the development of useful spatial patterns for the data and the implementation of
from, a truly optimal subset is unlikely to be found and so greedy the geostatistical Kriging and simulation techniques now in extensive use,
forward selection algorithms which add one optimal centre at a particularly in mining circles worldwide.
ARTICLE IN PRESS
that lies at the heart of the computational expense of the Kriging

technique. Much research effort is directed at devising suitable θ = 0.1
θ=1
training strategies and reducing the expense of the multiple
1 θ = 10
matrix inversions required (see, e.g., [33,34]). Still though, this
parameter estimation stage limits the method to problems of low
dimensionality, with k usually limited to around 20, depending on 0.8
−xj |2)
the expense of the analyses the Kriging model is to be used
in lieu of.
exp (−θ|x(i)
j
A by product of the parameter estimation is the insight into the 0.6
design landscape we can obtain from the MLEs of yj and pj . Fig. 3
shows how expðjxðiÞ pj
j xj j Þ varies with the separation between
0.4
the points. The correlation is intuitive insofar as when the two
points move close together, xðiÞ ðiÞ pj
j xj ! 0, expðjxj xj j Þ ! 1
ðiÞ
(the points show very close correlation and Yðxj Þ ¼ Yðxj Þ) and 0.2
when the points move apart, xðiÞ ðiÞ pj
j xj ! 1, exp jxj xj j ! 0
(the points have no correlation). Three different correlations are
shown in Fig. 3: pj ¼ 0:1, 1, and 2. It is clear how this ‘smoothness’ 0
parameter affects the correlation, with pj ¼ 2 we have a smooth −2 −1 0 1 2
correlation with continuous gradient through xðiÞ j xj ¼ 0. Redu- x(i) −x j
j
cing pj increases the rate at which the correlation initially drops as
jxðiÞ
j xj j increases. With a very low value of pj ¼ 0:1, we are
Fig. 4. Correlations with varying h.
essentially saying that there is no immediate correlation between
the two points and there is a near discontinuity between YðxðiÞ j Þ where
and Yðxj Þ.
Fig. 4 shows how the choice of yj affects the correlation. It is 1T W1 y
b¼
m . (21)
essentially a width parameter which affects how far a sample 1T W1 1
point’s influence extends. A low yj means that all points will have The predictive power of Kriging is rather impressive, as shown
a high correlation, with Yðxj Þ being similar across our sample, by the prediction of the popular Branin test function based on 20
while a high yj means that there is a significant difference observations shown in Fig. 5. This demonstration does, however,
between the Yðxj Þ’s. yj can therefore be considered as a measure of come with the warning that engineering functions are often not so
how ‘active’ the function we are approximating is. Considering the smooth, noise free and predictable.
‘activity’ parameter yj in this way is helpful in high-dimensional Along with other Gaussian process based models, one of the
problems where it is difficult to visualize the design landscape key benefits of Kriging is the provision of an estimated error in its
and the effect of the variables is unknown. By examining the predictions. The estimated MSE for a Kriging model is
elements of h, and providing that suitable scaling of the design " #
variables is in use, one can determine which are the most 1 1T W1 w
s2 ðxÞ ¼ s2 1 wT W1 w þ (22)
important variables and perhaps eliminate unimportant variables 1T W1 1
from future searches [8,35].
(see [3] for a derivation). It is because of its error estimates that
With the parameters estimated, we can make function
we favour Kriging for use in surrogate-based optimization and
predictions at an unknown x using
Sections 4–6 will make extensive use of it.
The above equations are categorized as ordinary Kriging. This is
b b þ wT W1 ðy 1m
yðxÞ ¼ m b Þ, (20) the most popular incarnation of Kriging in the engineering
sciences. There is also universal, co- and, more recently, blind
Kriging. We will look at co-Kriging in Section 3.7.
p = 0.1 3.5.1. Universal Kriging

p=1 In universal Kriging [36] the mean term is now some function
1 p=2
of x:
X
m
b¼m
m b ðxÞ ¼ mi ni ðxÞ, (23)
0.8
i¼0
−x j |p)
where the ni ’s are some known functions and the mi ’s are

b ðxÞ takes the form of a low-order
exp (−|x(i)
0.6 unknown parameters. Usually m

j
polynomial regression. The idea is that m b ðxÞ captures known

trends in the data and basis functions added to this will fine-tune
0.4 the model, thus giving better accuracy than ordinary Kriging
where a constant m b is used. However, we do not usually have a
0.2 priori knowledge of the trends in the data and specifying them
may introduce inaccuracies. Hence the popularity of ordinary
Kriging.
0
−2 −1 0 1 2 3.5.2. Blind Kriging
x(i)−x Blind Kriging is a method by which the ni ’s are identified
j j
through some data-analytic procedures. Hopefully, if the under-
Fig. 3. Correlations with varying p. lying trends can be identified, the ensuing model will be more
ARTICLE IN PRESS
Fig. 5. The true Branin function (left) compared with a Kriging prediction based on 20 sample points (right), with MSE ¼ 9:30.
accurate than ordinary Kriging. Joseph et al. [37] have certainly To build the blind Kriging model, first the parameters of the
shown this to be the case for their engineering design examples. Kriging correlation function h and p are estimated as per ordinary
We will outline the process of building a blind Kriging prediction Kriging. Then we compute b b from Eq. (25) using V0 ¼ 1 and l b0 ¼
and leave the reader to consult Joseph et al. [37] for more details b (from ordinary Kriging). Now set m ¼ 1 and choose nm as the
m
(our description is drawn from this reference). The above interaction corresponding to the maximum value of b b . We again
reference suggests identifying the n’s through a Bayesian forward compute b b and now Vm is an n ðm þ 1Þ matrix whose first
selection technique [38] and uses candidate variables of linear column is 1 and mth column is the column of U corresponding to
effects, quadratic effects, and two-factor interactions. The two- the index of the maximum value of b b. l
b m is found from Eq. (29).
factor interactions are linear-by-linear, linear-by-quadratic, quad- The blind Kriging parameters h and p can now be estimated by
ratic-by-linear, and quadratic-by-quadratic. This gives a total of maximizing the concentrated ln-likelihood
2
2k candidate variables, plus the mean term.
n 1
The linear and quadratic effects can be defined using b 2m Þ lnðjWjÞ,
lnðLÞ lnðs (30)
orthogonal polynomial coding [39] and, with variables scaled 2 2
2 ½0; 1, are given by where
pffiffiffi
3 b m ÞT W1 ðy Vm l
ðy Vm l bmÞ
xlin;j ¼ pffiffiffi 2ðxj 0:5Þ and b 2m ¼
s . (31)
2 n
1
xquad;j ¼ pffiffiffi ½12ðxj 0:5Þ2 2, (24) The blind Kriging predictor is
2
for j ¼ 1; 2; . . . ; k. b b m þ wT W1 ðy Vm l
yðxÞ ¼ m ðxÞT l b m Þ, (32)
To find the most important effect we need to find the
maximum of the vector where m ðxÞ is an m þ 1 element column vector of interactions for
the point to be predicted.
b ¼ RUT W1 ðy Vm l
b b m Þ, (25) Predictions can now be used to calculate a cross-validation
2 2 error and the above process iterated to reduce this error up to
where R is a ð2k þ 1Þ ð2k þ 1Þ diagonal matrix: 2
m ¼ 2k times. Joseph et al. [37] stop iterating when the cross-
R ¼ diagð1; r lin;1 ; r quad;1 ; r lin;2 ; . . . ; r quad;k1 r quad;k Þ, (26) validation error begins to rise consistently and choose the m
variables corresponding to the smallest error. They also note that
where
it is not necessary to estimate h and p at every step, just the first
3 3cj ð1Þ and the last.
r lin;j ¼ and The ordinary Kriging prediction in Fig. 5 has MSE ¼ 9:30 based
3 þ 4cj ð0:5Þ þ 2cj ð1Þ
on a 101 101 grid of points. Following the above procedure for
3 4cj ð0:5Þ þ cj ð1Þ
r quad;j ¼ . (27) building a blind Kriging model leads to the prediction in Fig. 6
3 þ 4cj ð0:5Þ þ 2cj ð1Þ
with MSE ¼ 0:56. The variables chosen using the Bayesian forward
2 selection process were (in order of selection) n1 ¼ xlin;1 xlin;2 ,
U is an n ð2k þ 1Þ matrix whose first column is 1, with
subsequent columns given by the interactions of the sample data: n2 ¼ xquad;1 , n3 ¼ xquad;2 and n4 ¼ xlin;1 xquad;2 . The addition of these
0 1 variables affects the prediction accuracy as shown in Fig. 7. While
1 xlin;1 ðxð1Þ
1 Þ xquad;1 ðxð1Þ
1 Þ xlin;2 ðxð1Þ
2 Þ xquad;k1 ðxð1Þ Þx ðxð1Þ Þ
B k1 quad;k k
C the Branin prediction in Fig. 5 is impressive, the improvements
B C
W ¼ B ... ... ... ... ..
. ... C. brought about by the additional terms in the underlying function
@ A
1 xlin;1 ðxðnÞ
1 Þ xquad;1 ðxðnÞ
1 Þ xlin;2 ðxðnÞ
2 Þ xquad;k1 ðxðnÞ Þx ðxðnÞ Þ
k1 quad;k k used in blind Kriging are very promising. We use the cautious
(28) term ‘promising’ because here we are looking at an analytical test
function in just two dimensions, with a large quantity of observed
b m is given by
l data. In this case, and in Joseph et al. [37] (who use a small
amount of data in high-dimensional, non-analytical problems),
b m ¼ ðVTm W1 Vm Þ1 ðVTm W1 yÞ,
l (29)
blind Kriging performs very well, but we would like to see it
where Vm is an n m matrix containing the m interactions which applied to more problems before we commit ourselves entirely.
have been determined. One should also bear in mind that the blind Kriging process is
ARTICLE IN PRESS
Fig. 6. The true Branin function (left) compared with a blind Kriging prediction based on 20 sample points (right), with MSE ¼ 0:56.
10 in the interpolating error equation (22) [41]. This error estimate is

appropriate when the data contain ‘noise’ from computer
9 experiments. It is important to use the appropriate error estimate
when formulating the infill criteria in Section 4.
8
7 3.6. Support vector regression

6
SVR comes from the theory of support vector machines (SVM),
MSE
5 which were developed at AT&T Bell Laboratories in the 1990s [42].

In a surrogate-based engineering design optimization context, it is
4 perhaps more appropriate to consider SVR as an extension to RBF
methods rather than SVMs, and we will do just that.
3
The key attribute of SVR is that it allows us to specify or
2 calculate a margin (e) within which we are willing to accept errors
in the sample data without them affecting the surrogate
1 prediction. This may be useful if our sample data have an element
of random error due to, for example, finite mesh size, since
0
OK xl,1xl,2 xq,1 xq,2 xl,1xq,2 xq,1xq,2 through a mesh sensitivity study we could calculate a suitable
value for e. If the data are derived from a physical experiment, the
Fig. 7. MSE in blind Kriging prediction as variables are added (OK indicates accuracy of the measurements taken could be used to specify e. To
ordinary Kriging). demonstrate this we have sampled our noisy one-dimensional
test function at 21 evenly spaced points. Since we know the
standard deviation of the noise is one, we have chosen e ¼ 1.
more computationally expensive and this may outweigh increased The resulting SVR is shown in Fig. 8. The sample points which lie
accuracy. within the e band (known as the e-tube) are ignored, with the
predictor being defined entirely by those which lie on or outside
3.5.3. Kriging with noisy data this region: the support vectors.
In the same way as for an RBF prediction, a Kriging model can The basic form of the SVR prediction is the familiar sum of basis
ðiÞ
be allowed to regress the data by adding a regularization constant functions c , with weightings wðiÞ , added to a base term m; all
to the diagonal of the correlation matrix. The addition of this calculated in different ways to their counterparts in the RBF and
constant alters the error estimation and we need to use an error Kriging literature, yet contributing to the prediction in the same way:
estimate in keeping with the origins of the observed data. The X
n
b
f ðxÞ ¼ m þ wðiÞ cðx; xðiÞ Þ. (36)
modelling error and the errors due to noise are given by
" # i¼1
1 1T ðW þ lIÞ1 w
^ 2r 1 þ l wT ðW þ lIÞ1 w þ
s2r ðxnþ1 Þ ¼ s , (33)
1T ðW þ lIÞ1 1
where [40] 3.6.1. The support vector predictor

We have not included derivations of the polynomial, RBF and
^ 2r b r ÞT ðW þ lIÞ1 ðy 1m
s ¼ ðy 1m b r Þ=n. (34) Kriging models in this review, as these can be found easily
This error estimate is appropriate when the data contain error throughout the engineering literature. SVR is rather new in this
from physical experiments. field and so we will devote some space to its derivation.
We can express the modelling error only by using the variance Following the theme of most SVM texts, we will first consider a
linear regression, i.e. cðÞ ¼ x:
b ÞT ðW þ lIÞ1 WðW þ lIÞ1 ðy 1m
ðy 1m bÞ
s^ 2ri ¼ (35) b
f ðxÞ ¼ m þ wT x. (37)
n
ARTICLE IN PRESS
The constrained optimization problem of Eq. (39) is solved by

prediction
15 introducing Lagrange multipliers, ZþðiÞ , ZðiÞ , aþðiÞ and aðiÞ , to give
+ε
−ε the Lagrangian
sample data 1 1X n Xn
10 support vectors L¼ jwj2 þ C ðx
þðiÞ ðiÞ
þx Þ ðZþðiÞ x
þðiÞ ðiÞ
þ ZðiÞ x Þ
2 n i¼1 i¼1
X
n
5 aþðiÞ ðe þ xþðiÞ yðiÞ þ w xðiÞ þ mÞ

i¼1
f (x)
Xn
aðiÞ ðe þ xðiÞ þ yðiÞ w xðiÞ mÞ, (40)
0 i¼1

which must be minimized with respect to w, m, x (the primal
variables) and maximized with respect to ZðiÞ and aðiÞ (the dual
−5
variables), where ZðiÞ , aðiÞ X0 ( refers to both þ and
variables). For active constraints, the corresponding ðaðiÞ þ aþðiÞ Þ
will become the support vectors (the circled points in Fig. 8)
−10
0 0.2 0.4 0.6 0.8 1 whereas for inactive constraints, ðaðiÞ þ aþðiÞ Þ ¼ 0 and the corre-
sponding yðiÞ will be excluded from the prediction.
x
The minimization of L with respect to the primal variables and
Fig. 8. A SVR prediction using a Gaussian kernel through the one-dimensional test maximization with respect to the dual variables means we are
function with added noise. looking for a saddle point, at which the derivatives with respect to
the primal variables must vanish:
qL Xn
To produce a prediction which generalizes well, we wish to find a ¼w ðaþðiÞ aðiÞ ÞxðiÞ ¼ 0, (41)
function with at most e deviation from y and at the same time qw i¼1
minimize the model complexity.2 We can minimize the model
complexity by minimizing the vector norm jwj2 , that is, the flatter qL Xn
the function the simpler it is, and the more likely to generalize ¼ ðaþðiÞ aðiÞ Þ ¼ 0, (42)
qm i¼1
well. Cast as a constrained convex quadratic optimization
problem, we wish to qL C
¼ aþðiÞ ZðiÞ ¼ 0, (43)
minimize 1 2 qxþ n
2jwj
subject to epyðiÞ w xðiÞ mpe. (38) qL C
¼ aðiÞ ZðiÞ ¼ 0. (44)
qx n
Note that the constraints on this optimization problem assume
that a function b f ðxÞ exists which approximates all yðiÞ with From (41) we obtain
precision e. Such a solution may not actually exist and it is also X
n
likely that better predictions will be obtained if we allow for the w¼ ðaþðiÞ aðiÞ ÞxðiÞ , (45)
possibility of outliers. This is achieved by introducing slack i¼1
þ
variables, x for f ðxðiÞ Þ yðxðiÞ Þ4e and x for yðxðiÞ Þ f ðxðiÞ Þ4e. and by substituting into (37) the SVR prediction is found to be
We now
X
n
b
f ðxÞ ¼ m þ ðaþðiÞ aðiÞ ÞðxðiÞ xÞ. (46)
1 1 X
n
þðiÞ ðiÞ
minimize jwj2 þ C ðx þx Þ i¼1
2 n i¼1
8 The kernel trick: Until now we have considered our data X to
exist in real coordinate space, which we will denote as X 2 Rk . We
þðiÞ
>
> yðiÞ w xðiÞ mpe þ x ;
>
<
ðiÞ wish to extend Eq. (46) beyond linear regression, to basis
subject to w xðiÞ þ m yðiÞ pe þ x ; (39)
>
> functions (known in the SV literature as kernels) which can
>
: xþðiÞ ; xðiÞ X0:
capture more complicated landscapes. To do this we say that x in
(46) is in feature space, denoted as H, which may not coincide
From Eq. (39) we see that the minimization is a tradeoff with Rk . We can define a mapping between these two spaces,
between model complexity and the degree to which errors larger / : X/H. We are only dealing with the inner product x x and
than e are tolerated. This tradeoff is governed by the user defined x x ¼ / /. We can actually choose the mapping / and employ
constant CX0 (C ¼ 0 would correspond to a flat function through
m). This method of tolerating errors is known as the e-insensitive loss
loss function and is shown in Fig. 9. Points which lie inside the
e-tube (the e-tube is shown in Fig. 8) will have no loss asso-
ciated with them, while points outside have a loss which
increases linearly away from the prediction with the rate deter-
mined by C.
2
The requirement of minimizing model complexity to improve generalization
derives from Occam’s Razor: entia non-sunt multiplicanda praeter necessitatem,
which translates to ‘entities should not be multiplied beyond necessity’ or, in lay f (x(i)) − y (x(i))
terms, ‘all things being equal, the simplest solution tends to be the best one’. This − +
principle is attributed to William of Ockham, a 14th century English logician and
Franciscan friar. Fig. 9. e-insensitive loss function.
ARTICLE IN PRESS
different basis functions by using w ¼ / /: and the constraints vanishes and see that
X
n
aþðiÞ ðe þ xþðiÞ yðiÞ þ wwðxðiÞ Þ þ mÞ ¼ 0, (51)
b
f ðxÞ ¼ m þ
ðiÞ
ðaþðiÞ aðiÞ Þc . (47)
ðiÞ ðiÞ ðiÞ ðiÞ
i¼1 a ðe þ x þ y wwðx Þ mÞ ¼ 0 (52)
We can do this so long as: and

C
1. c is continuous, xþðiÞ aþðiÞ ¼ 0, (53)
n
2. c is symmetric, i.e. cðx; xðiÞ Þ ¼ cðxðiÞ ; xÞ,
ðiÞ C
3. c is positive definite, which means the correlation matrix W ¼ x aðiÞ ¼ 0. (54)
n
WT and has non-negative eigenvalues,
This is one of the Karush–Kuhn–Tucker conditions, which hold at
that is, c must be a Mercer kernel. Popular choices for c are the optimum (see, e.g. [43]).
ðiÞ
From (53) and (54) we see that either ðC aðiÞ Þ ¼ 0 or x ¼0
cðxðiÞ ; xðjÞ Þ ¼ ðxðiÞ xðjÞ Þ (linear), and so all points outside of the e-tube (where the slack variable
cðxðiÞ ; xðjÞ Þ ¼ ðxðiÞ xðjÞ Þd ðd degree homogeneous polynomialÞ, xðiÞ 40) must have a corresponding aðiÞ ¼ C. Along with Eqs. (51)
P
and (52), and noting that wwðxðiÞ Þ ¼ ni¼1 ðaþðiÞ aðiÞ ÞwðxÞ this
cðxðiÞ ; xðjÞ Þ ¼ ðxðiÞ xðjÞ þ cÞd ðd degree inhomogeneous polynomialÞ, tells us that either

jxðiÞ xðjÞ j2
cðxðiÞ ; xðjÞ Þ ¼ exp (Gaussian), X
n
C
s2 aþðiÞ ¼ 0 and m ¼ yðiÞ ðaþðiÞ aðiÞ ÞwðxÞ þ e if 0oaðiÞ o
! n
Xl
ðjÞ pk
i¼1
cðxðiÞ ; xðjÞ Þ ¼ exp yk jxðiÞ
k
xk
j (Kriging). (48) (55)
k¼1
or
Whichever form of c is chosen, the method of finding the
support vectors remains unchanged. X
n
C
aðiÞ ¼ 0 and m ¼ yðiÞ ðaþðiÞ aðiÞ ÞwðxÞ e if 0oaþðiÞ o .
i¼1
n
3.6.2. Finding the support vectors (56)
With the kernel substitution made, the support vectors are
Using (55) and (56) we can compute m from one or more a ’s ðiÞ
found by substituting (41)–(44) into (40) to eliminate ZðiÞ and
which are greater than zero and less than C. More accurate results
ZþðiÞ , and to finally obtain the dual variable optimization problem: will be obtained if an aðiÞ not too close to these bounds is used.
8
> 1 P n
>
> ðaþðiÞ aðiÞ ÞðaþðiÞ aðiÞ ÞWðxðiÞ ; xðjÞ Þ;
>
< 2 i;j¼1 3.6.4. Choosing C and e
maximize Our initial slack variable formulation (39) was a tradeoff
> P n Pn
> e1 ðaþðiÞ þ aðiÞ Þ þ
>
> yðiÞ ðaþðiÞ aðiÞ Þ between model complexity and the degree to which errors larger
: 2
i¼1 i¼1
8 than e are tolerated and is governed by the constant C. A small
>
> Pn
constant will lead to a flatter prediction (more emphasis on
< ðaþðiÞ aðiÞ Þ ¼ 0;
subject to i¼1 (49) minimizing 12jwj2 ), usually with fewer SVs, while a larger constant
>
> will lead to a closer fitting of the data (more emphasis on
: aðiÞ 2 ½0; C=n: Pn þðiÞ ðiÞ
minimizing i¼1 ðx þ x Þ), usually with a greater number of
In order to find aþ and a , rather than a combined ðaþ a Þ, we SVs. We wish to choose C which produces the model with the best
must re-write (49) as generalization. The scaling of y will have an effect on the optimal
8 ! ! ! value of C, so it is good practice to start by normalizing y to have
>
>1
> aþ T W W aþ elements between zero and one. Fig. 10 shows SVRs of the noisy
>
>
> one-dimensional function (this time with noise of standard
< 2 a
>
W W a
minimize 0 1T deviation four), normalized between zero and one, for varying C.
> !
>
> 1T e y aþ The Gaussian kernel variance, s2 , has been tuned to minimize the
>
> þ @ A
>
> RMSE of each prediction using 101 test points. This RMSE is
: 1T e þ y a
displayed above each plot. It is clear that, although there is an
8 þ
!
> 1T a
> optimum choice for C, the exact choice is not overly critical. It is
< ¼ 0; sufficient to try a few C’s of varying orders of magnitude and
subject to a (50)
>
> choose that which gives the lowest RMSE for a test data set. For
: aþ ; a 2 ½0; C=n:
small problems it is possible to obtain a more accurate C by using
a simple bounded search algorithm.
Note that, as per the convention of most optimization algorithms,
Here we have prior knowledge of the amount of noise in the
we have also transformed the maximization problem into a
data and so have been able to choose e as the standard deviation
minimization.
of this noise. There are many situations where we may be able to
We will not cover the formulation of quadratic programming
estimate the degree of noise, e.g. from a mesh dependency and
algorithms used to solve problems such as (50). This subject is
solution convergence study. Situations, however, arise where the
covered in detail by Schölkopf and Smola [43]. We have had
noise is an unknown quantity, e.g. a large amount of experimental
success using Matlab’s quadprog.
data with measurements obtained by different researchers. In
these situations we can calculate a value of e which will give the
3.6.3. Finding m most accurate prediction by using n-SVR [43].
In order to find the constant term m, know as the bias, we We have outlined the SVR formulation, but those wishing to
exploit the fact that at the point of the solution of the delve further into this promising technique will find Schölkopf
optimization problem (49) the product between the dual variables and Smola [43] a very useful text. A watered down version of the
ARTICLE IN PRESS
C = 0 RMSE = 0.21 σ = 1 C = 0.01 RMSE = 0.22 σ = 0.62

1
f (x)
0.5
0
0 0.5 1 0 0.5 1
x x
C = 0.1 RMSE = 0.21 σ = 0.28 C = 1 RMSE = 0.18 σ = 0.34

1
f (x)
0.5
0
0 0.5 1 0 0.5 1
x x
C = 10 RMSE = 0.19 σ = 0.13 C = 100 RMSE = 0.19 σ = 0.16
1
f (x)
0.5
0
0 0.5 1 0 0.5 1
x x
Fig. 10. SVR predictions and corresponding RMSEs for varying C (e ¼ 4=rangeðyÞ ¼ 0:18ðyÞ).
SVR related part of their book appears in Smola and Schölkopf employ a localized gradient descent search of the true function
[44]. Our work here and in Forrester et al. [8] is inspired by these with no surrogate model. However, if a global optimum is sought,
references. While there is no doubt that SVR is powerful method the gradient information can be used to enhance the accuracy of a
for prediction, particularly with large, high-dimensional data sets, surrogate model of the design landscape, which can then be
there is only a slim body of literature detailing its use in searched using a global optimizer.
engineering design. An example among few is Clarke et al. [45] Gradient information should only be used to enhance the
who compare SVR with other surrogate modelling methods when surrogate if it is available cheaply, otherwise we are likely to be
predicting a variety of engineering problems. No doubt the lack of better off simply making more calls to the true function and
published material is partly because the method is still young, but building the surrogate using a larger sample. This effectively
also perhaps because the expense of engineering analyses means precludes the use of finite-differencing and the complex step
that we are rarely faced with the problem of very large data sets. approximation [46], although these methods could be useful for
More often than not we have a high-dimensional design space, but calculating a few derivatives, perhaps of particularly active
no possibility of filling it with large amounts of data. In such variables. Most likely to be of use is gradient information found
situations we habitually wish to use all our analysis data and so via algorithmic differentiation (AD) [47] (also known as automatic
the SVR process of choosing a subset serves no purpose. SVR is an differentiation, though the term ‘automatic’ can instill false hope,
elegant way of producing predictions from large sets of noisy data since manual intervention is required in most cases), which
and so may have uses in building surrogate models from, for requires access to the function source code, and the adjoint
example, extensive archives of historical data. The time required approach [48], which requires the creation of a whole new source
to train an SVR model is longer than other methods we have code. One could also write separate code for the derivatives of a
considered, due to the presence of the additional quadratic black-box code (see, e.g. [49]).
programming problem, but the accuracy and speed of prediction Howsoever the derivatives of the function have been found, the
make it a good candidate for this scenario. Because of the lengthy methods by which we incorporate them into the surrogate
training times, SVR is unlikely to find favour with those wishing to model are essentially the same. We mentioned earlier that Kim
create surrogates in an ongoing optimization loop. et al. [19] have presented a gradient enhanced MLS method. van
Keulen and Vervenne [50] have presented promising results,
albeit for approximating analytical test functions, for a gradient
3.7. Enhanced modelling with additional design information enhanced WLS method. We will examine how the information is
incorporated into a Gaussian process based RBF using a form of
3.7.1. Exploiting gradient information co-Kriging [51].
A key benefit of surrogate model based search is that the RBF models are typically built from the sum of a number of
gradients of the true function(s) are not required. If gradient basis functions centred around the sample data. The height of
information is available, the designer may in fact choose to these functions determines the value of the prediction at the
ARTICLE IN PRESS
sample points (usually such that the model interpolates the data) the correlation c. This is not important when we are squaring the
and the width determines the rate at which the function moves result but, after differentiating, sign changes will appear depend-
away from this value. If gradient information is available at the ing upon whether we differentiate with respect to xðiÞ or xðjÞ. Using
sample locations, we can incorporate this into the model, using a the product and chain rules, the following derivatives are
second set of basis function. obtained:
These additional basis functions determine the gradient of the
prediction at the sample points and the rate at which the function qWði;jÞ
¼ 2yðxðiÞ xðjÞ ÞWði;jÞ , (59)
moves away from this gradient. The form of the basis function qxðiÞ
used to incorporate the gradient information is simply the qWði;jÞ
¼ 2yðxðiÞ xðjÞ ÞWði;jÞ , (60)
derivative of the first n Gaussian basis functions with respect to qxðjÞ
the design variables: q2 Wði;jÞ 2
¼ ½2y 4y ðxðiÞ xðjÞ Þ2 Wði;jÞ , (61)
Pk qxðiÞ qxðjÞ
ðiÞ ðiÞ 2
qc q expð l¼1 yl ðxl xl Þ Þ ðiÞ q2 Wði;jÞ
¼ ¼ 2yl ðxðiÞ xl Þc . (57) ¼ 4yl ym ðxðiÞ xlðjÞ ÞðxðiÞ ðjÞ ði;jÞ
qxðiÞ qxlðiÞ l l m x m ÞW . (62)
l qxðiÞ
l
qxðiÞ
m
Fig. 11 shows how this function behaves as xðiÞ l

xl varies. Here The h parameter is found by maximizing the concentrated ln-
we are looking at how the prediction will be distorted from the likelihood in the same manner as for a standard Gaussian RBF.
model produced by the first n basis functions. Intuitively no Other than the above correlations, the only difference in the
distortion is applied at a sampled point: we can learn no more construction of the gradient-enhanced model is that 1 is now a
about the value at this point than the sample data value. As we ðk þ 1Þn 1 column vector of n ones followed by nk zeros. The
move away from the point the function pulls the prediction up or gradient-enhanced predictor is
down. The yl hyper-parameter determines the activity of the
function: a higher yl leads to a small region of distortion, with the yðxðnþ1Þ Þ ¼ m
b _ TW
bþw _ 1 ðy 1m
b Þ, (63)
value of qc=qxl quickly returning to zero, while a low qc=qxl where
means that a larger area is influenced by the value of the gradient
T
in the jth direction at xðiÞ . _ ¼ w; qw ; . . . ; qw .
w (64)
In a gradient-enhanced RBF the correlation matrix W must qx1 qxk
include the correlation between the data and the gradients and
Fig. 12 shows a contour plot of the Branin function along with a
the gradients and themselves as well as the correlations between
_. gradient-enhanced Kriging prediction based on nine sample
the data, and will be denoted by the ðk þ 1Þn ðk þ 1Þn matrix W
points. True gradients and gradients calculated using a finite
The matrix, for a k-dimensional problem (k ¼ 1), is constructed as
difference of the gradient-enhanced Kriging prediction are also
follows:
shown. The agreement between the functions and gradients is
0 qW qW qW 1 remarkable for this function, however, the method is unlikely to
W qxðiÞ qxðiÞ

qxðiÞ
B 1 2 k C perform quite so well on true engineering functions.
B qW q2 W q2 W q2 W C
B ðjÞ C We can take the use of gradients to the next step and include
B qx1 qxðiÞ qxðjÞ qxðiÞ qxðjÞ qxðiÞ qxðjÞ C
B 1 1 1 2 1 k C second derivatives in an Hessian-enhanced model. The basis
_ ¼B
W B qWðjÞ q2 W q2 W q2 W C C. (58) function used to incorporate the second derivative information is
B qx2 qxðjÞ qxðiÞ qxðiÞ qxðjÞ qxðiÞ qx ðjÞ
C
B 1 2 2 2 2 k
C the second derivative of the first n Gaussian basis functions with
B . .. .. .. .. C
B .. . . . . C
B C respect to the design variables:
@ qW 2
q W 2
q W q2 W
A
2 P
qxðjÞ qxðjÞ qxðiÞ qxðjÞ qxðiÞ qxðiÞ qxðjÞ
q2 cðiÞ q expð kl¼1 yl ðxl xl Þ2 Þ
ðiÞ
k 1 k 2 k k k
2
xl Þ2 c .
ðiÞ
¼ ¼ ½2yl þ 4yl ðxðiÞ
l
The superscripts in Eq. (58) refer to which way round the qxðiÞ2
l
qxðiÞ2
l
subtraction is being performed when calculating the distance in (65)

Fig. 13 shows how the twice differentiated basis function behaves
3 for varying y.
θ = 0.1 Fig. 14 shows three predictions of our one variable test
θ=1
function: Kriging, gradient-enhanced Kriging and Hessian-
2 θ = 10
enhanced Kriging. In our other figures showing Kriging predic-
tions of this function based on three points we have cheated a
j j
−2θ (x(i)j−xj)exp[−θ(x(i)−x )2 ]
1 little by constraining the y parameter to give a good prediction.

Here we have opened up the bounds on y and the MLE
actually gives a very poor prediction: because of the sparsity of
0 data, no trend is recognised and the prediction is simply narrow
bumps around a mean fit. The extra gradient information
significantly improves the prediction. It should be borne in
−1 mind though that adding extra observed data instead of the
expense of calculating gradients would have improved the
prediction too. In high-dimensional problems a few extra
−2
observed points will not be as useful as many derivatives and it
is here that cheap gradient information (e.g. from adjoint
−3 formulations) is of most use.
−2 −1 0 1 2 The nine weighted basis functions and mean used to build the
xj(i)−xj prediction in Fig. 14 are shown in Fig. 15. It is clear from this figure
how each type of basis function affects the prediction. The first
Fig. 11. Differentiated correlations for varying h. three (c) are simple deviations from the mean and the second
ARTICLE IN PRESS
20
f (x)
sample points
15 Kriging
GE Kriging
HE Kriging
10
y
0
−5
−10
0 0.2 0.4 0.6 0.8 1
x
Fig. 14. Kriging, gradient enhanced Kriging, and Hessian-enhanced Kriging

predictions of f ðxÞ ¼ ð6x 2Þ2 sinð12x 4Þ using three sample points.
8
wψ
Fig. 12. Contours of the Branin function (solid) and a gradient-enhanced wdψ/dx
prediction (dashed) based on nine points (dots). True gradients (solid arrows) 6
wd2ψ/dx2
and gradients calculated using a finite difference of the gradient-enhanced Kriging
prediction (dashed arrows) are also shown. Note that the true function and the μ
prediction are so close that the solid contours and arrows almost completely 4
obscure their dashed counterparts.
2
y
20 θ = 0.1 0
θ=1
[2θ−4θ2 (x(i)−x(j))2]exp[−θ (x(i)−x(j))2]
θ = 10
15 −2
10 −4
5 −6
0 0.2 0.4 0.6 0.8 1
x
0
Fig. 15. The nine basis functions used to construct the Hessian-enhanced Kriging
prediction, multiplied by their weights, w ¼ W€ 1 ðy 1mÞ. These are added to the
−5 constant m to produced the prediction in Fig. 14.
−10
−2 −1 0 1 2 research effort, though a panacea is yet to reveal itself. Second
x(i)−x(j) derivatives are not often available to the designer but, with the
increasing use of automatic differentiation tools, models which
Fig. 13. Twice differentiated correlations for varying h. can take advantage of this information may soon provide
significant speed-ups compared to using additional function
calls—particularly in very high-dimensional problems where
three (c_ ) clearly match the gradient at the sample points. Of the
adjoint approaches prove most powerful.
final three bases (c € ), the first has little effect (the gradient is near
constant at this point), the second works against c to flatten the
function, while the third adds to the curvature, resulting in the 3.7.2. Multi-fidelity analysis
steep curve into the global minimum. When additional information is available, rather than gradients
The use of derivative information adds considerable complex- of the function to be approximated, we are perhaps more likely to
ity to the model and the increased size of the correlation matrix have available other cheaper approximations of the function. It
leads to (very much) lengthier parameter estimation, however, it may be, for example, that as well as using finite element analysis
clearly leads to the possibility of building more accurate or computational fluid dynamics, a quick calculation can be made
predictions. Schemes to reduce model parameter estimation using empirical equations, more simple beam theory, or panel
times for large correlation matrices are always the target of methods. In multi-fidelity (also known as variable-fidelity)
ARTICLE IN PRESS
surrogate-based methods a greater quantity of this cheap data have the random field
may be coupled with a small amount of expensive data to enhance 0 1
the accuracy of a surrogate of the expensive function. To make use Y c ðxð1Þ c Þ
B C
of the cheap data, we must formulate some form of correction B
B ... C
C
process which models the differences between the cheap and ! B C
Yc ðXc Þ B Y c ðx Þ Cðnc Þ
expensive function. B c C
Y¼ ¼B C.
Ye ðXe Þ B Y e ðxð1Þ e Þ C
Although we may have many forms of analysis, let us assume B C
B .. C
for our discussion that we have just two ways of calculating the B . C
@ A
function (the methods can be extended to multiple levels of Y e ðxe Þ ðne Þ
analyses). Our most accurate expensive data have values ye at
points Xe and the less accurate cheap data have values yc at points Here we use the auto-regressive model of Kennedy and O’Hagan

Xc . The formulation of a correction process is simplified if the [56] which assumes that cov Y e ðxðiÞ Þ; Y c ðxÞjY c ðxðiÞ Þ ¼ 0, 8xaxðiÞ .
expensive function sample locations coincide with a subset of the This means that no more can be learnt about Y e ðxðiÞ Þ from the
cheap sample locations (Xe Xc ). The correction process will cheaper code if the value of the expensive function at xðiÞ is known
usually take the form (this is known as a Markov property which, in essence, says we
assume that the expensive simulation is correct and any
ye ¼ Z r yc þ Z d . (66)
inaccuracies lie wholly in the cheaper simulation).
With Z d ¼ 0, Z r can take the form of any approximation model Gaussian processes Z c ðÞ and Z e ðÞ represent the local features of
fitted to ye =yc ðXe Þ. Likewise, with Z r ¼ 1, Z d can take the form of the cheap and expensive codes. Using the auto-regressive model
an approximation fitted to ye yc ðXe Þ. These processes are then we are essentially approximating the expensive code as the cheap
used to correct yc when making predictions of the expensive code multiplied by a constant scaling factor r plus a Gaussian
function f e . If the correction process is simpler than f e , then we process Z d ðÞ which represents the difference between rZ c ðÞ
can expect predictions based on a large quantity of cheap data and Z e ðÞ:
with a simple correction to be more accurate than predictions
Z e ðxÞ ¼ rZ c ðxÞ þ Z d ðxÞ. (67)
based on a small quantity of expensive data. This simple form of
combining multi-fidelity analyses has be used by Leary et al. [52] Where in Kriging we have a covariance matrix
for finite element analyses using different mesh sizes and by covfYðXÞ; YðXÞg ¼ s2 WðX; XÞ, we now have a covariance matrix:
Forrester et al. [53] for combining CFD of varying levels of !
s2c Wc ðXc ; Xc Þ rs2c Wc ðXc ; Xe Þ
convergence. C¼ . (68)
Instead of using Z c or Z d, which are output correction processes,
rs2c Wc ðXe ; Xc Þ r2 s2c Wc ðXe ; Xe Þ þ s2d Wd ðXe ; Xe Þ
we can employ an input correction, known as space mapping [54]. The notation Wc ðXe ; Xc Þ, for example, denotes a matrix of
By distorting the locations of Xc we can attempt to align the correlations of the form cc between the data Xe and Xc .
contours of the cheap function with those of the expensive The correlations are of the same form as Eq. (17), but there are
function. If the cheap and expensive functions have similar two correlations, cc and cd and we therefore have more
scaling, we hope to find a mapping pðXe Þ such that parameters to estimate: hc , hd , pc , pd and the scaling parameter
ye ðXe Þ yc ðpðXe ÞÞ. Of course the scaling of the cheap and r. Our cheap data are considered to be independent of the
expensive functions may be quite different and so an additional expensive data and we can find MLEs for mc, s2c , hc and pc by
correction process from Eq. (66) may be required. maximizing the concentrated ln-likelihood:
A more powerful multi-fidelity method is that of co-Kriging
[36]—an enhancement to the geostatistical method of Kriging, but nc 1
b 2c Þ ln j detðWc ðXc ; Xc ÞÞj,
lnðs (69)
equally applicable as an enhancement to any parametric RBF. Co- 2 2
Kriging has been used extensively outside of aerospace design. For where
example Hevesi et al. [55] predict average annual precipitation
values near a potential nuclear waste disposal site using a sparse b 2c ¼ ðyc 1m
s b c ÞT Wc ðXc ; Xc Þ1 ðyc 1m
b c Þ=nc . (70)
set of precipitation measurements from the region along with the To estimate md , s2d , hd , pd and r, we first define
correlated and more easily obtainable elevation map of the region.
Kennedy and O’Hagan [56] apply co-Kriging to the correlation of d ¼ ye ryc ðXe Þ, (71)
results of computer simulations of varying fidelities and cost. where yc ðXe Þ are the values of yc at locations common to those of
Forrester et al. [57] extend the method from prediction to Xe (the Markov property implies that we only need to consider
optimization and present the aerodynamic design of a wing using this data). If yc is not available at Xe we may estimate r at little
correlated empirical and panel codes. The following presentation additional cost by using Kriging estimates y bc ðXe Þ found from Eq.
of the co-Kriging method is based on this reference. (20) using the already determined parameters b hc and pb c . The
Using our two sets of data; cheap and expensive, we begin the concentrated ln-likelihood of the expensive data is now
co-Kriging formulation by concatenating the sample locations to
ne 1
give the combined set of sample points b 2d Þ ln j detðWd ðXe ; Xe ÞÞj,
lnðs (72)
2 2
0 ð1Þ 1
xc where
B . C
B . C
B . C b 2d ¼ ðd 1m
s b d ÞT Wd ðXe ; Xe Þ1 ðd 1m
b d Þ=ne . (73)
! B ðn Þ C
Xc Bx c C
B c C As with Kriging, Eqs. (69) and (72) must be maximized
X¼ ¼ B ð1Þ C.
Xe B xe C
B C numerically using a suitable global search routine. Depending
B . C
B .. C upon the cost of evaluating the cheap and expensive functions f c
@ A
xeðne Þ and f e , for very high-dimensional problems the multiple matrix
inversions involved in the likelihood maximization may render
As with Kriging, the value at a point in X is treated as if it were the the use of the co-Kriging model impractical (the size of the
realization of a stochastic process. For co-Kriging we therefore matrices depends directly on the quantities of data available, and
ARTICLE IN PRESS
the number of search steps needed in the MLE process is linked to 20

the number of parameters being tuned). Typically a statistical fe
model used as a surrogate will be tuned many fewer times than fc
the number of evaluations of f e required by a direct search. The 15 ye
yc
cost of tuning the model can therefore be allowed to exceed the
kriging through ye
cost of computing f e and still provide significant speed-up. For co−kriging
10
large k and n the time required to find MLEs can be reduced by
using a constant yc;j and yd;j for all elements of hc and hd to
simplify the maximization, though this may affect the accuracy of 5
y
the approximation.
Co-Kriging predictions are given by
0
b b þ cT C1 ðy 1m
ye ðxÞ ¼ m b Þ, (74)
where −5
T 1 T 1
b ¼ 1 CðX; XÞ y=1 CðXe ; Xe Þ 1
m (75)
−10
and c is a column vector of the covariance between X and x (see
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
[57] for a derivation).
x
If we make a prediction at one of our expensive points, xðnþ1Þ ¼
xðiÞ
e and c is the nc þ ith column of C, then c C
T 1
is the nc þ ith unit Fig. 16. A one variable co-Kriging example. The Kriging approximation using four
ðiÞ ðiÞ
b
vector and ye ðxe Þ ¼ mbþy ðn c þiÞ b ¼ ye . We see, therefore, that
m expensive data points (ye ) has been significantly improved using extensive
Eq. (74) is an interpolator of the expensive data (just like ordinary sampling from the cheap function (yc ).
Kriging), but will in some sense regresses the cheap data unless it
coincides with ye .
The estimated MSE in this prediction is similar to the Kriging x 10−3
error, and is calculated as
2 2 1 1T C1 c
s2 ðxÞ r2 s b d cT C1 c þ
bc þ s . (76)
1T C1 1
6
For xðne þ1Þ ¼ xðiÞ
e , c C
T 1
is the nc þ ith unit vector, cT C1 c ¼
2 2 2 2
cðnc þiÞ
¼ rc sc þ sd and so s ðxÞ is zero (just like ordinary Kriging).
For Xc nXe, s2 ðxÞa0 unless ye ¼ yc ðXe Þ. The error at these points is
determined by the character of Z d . If this difference between 4
s2
rZ c ðXe Þ and Z e ðXe Þ is simple (characterized by low yd;j ’s) the error
will be low, whereas a more complex difference (high yd;j ’s) will
lead to high error estimates.
As shown for Kriging in Section 3.5.3, a regression parameter
2
can be added to the leading diagonal of the correlation matrix
when noise is present. In fact, two parameters may be used: one
for the cheap data and one for the expensive data.
We will recycle our simple one variable test function to
0
demonstrate co-Kriging. Imagine that our expensive to compute
0 0.2 0.4 0.6 0.8 1
data are calculated by the original function f e ðxÞ ¼ ð6x 2Þ2 sin
x
ð12x 4Þ, x 2 ½0; 1, and a cheaper estimate of this data is given
by f c ðxÞ ¼ Af e þ Bðx 0:5Þ þ C. We sample the design space Fig. 17. Estimated error in the co-Kriging prediction in Fig. 16. The simple
extensively using the cheap function at Xc ¼ f0; 0:1; 0:2; 0:3; 0:4; relationship between the data results in low error estimates at Xc as well as Xe .
0:5; 0:6; 0:7; 0:8; 0:9; 1g, but only run the expensive function at four
of these points, Xe ¼ f0; 0:4; 0:6; 1g. subsets (clearly infeasible for all but very moderate cardinalities),
Fig. 16 shows the functions f e and f c with A ¼ 0:5, B ¼ 10, and here we use an exchange algorithm to select Xe (see, e.g. [28]).
C ¼ 5. A Kriging prediction through ye gives a poor approxima- We start from a randomly selected subset Xe and calculate the
tion to the deliberately deceptive function, but the co-Kriging Morris–Mitchell criterion. We then exchange the first point xð1Þ e
prediction lies very close to f e , being better than both the standard with each of the remaining points in Xc nXe and retain the
Kriging model and the cheap data. Despite the considerable exchange which gives the best Morris–Mitchell criterion. This
differences between f e and f c , a simple relationship has been process is repeated for each remaining point xð2Þ ðne Þ
e . . . xe . A number
found between the expensive and cheap data and the estimated of restarts from different initial subsets can be employed to avoid
error reduces almost to zero at Xc (see Fig. 17). local optima. Fig. 18 shows a Morris–Mitchell optimal LH with a
While we are not considering sampling techniques in this subset chosen using this exchange algorithm.
paper, the problem of choosing the ne -element subset Xe of Xc is A rule of thumb for the number of points which should be used
an unusual one and so in this case we will make an exception. As in the sampling plan is n ¼ 10k. When using a particularly cheap
with an initial sample, we wish to cover the parameter space analysis nc may be rather greater than this, allowing us to build a
evenly so we turn to the Morris–Mitchell criterion [7], but this more accurate model, and if the relationship between f c and f e is
time we are dealing with a limited, discreet parameter space and simple, ne may be somewhat fewer—the advantage of the co-
thus the problem becomes a combinatorial one. Since selecting Kriging method.
the subset that satisfies this is an NP-complete problem and an Our choice of cheap function for the above example is
exhaustive search would have to examine nc C ne ¼ nc !=ne !ðnc ne Þ! somewhat contrived. For our test function the correction process
ARTICLE IN PRESS
stress. Of course relationships will rarely be so straightforward,

but, nonetheless, it is worth trying a few transformations (e.g.,
negative reciprocal, logarithm) and re-calculating a generalization
error estimate to see if the accuracy of the surrogate can be
improved [58].
4.1. Exploitation
4.1.1. Minimizing the predictor

The most basic assumption we can make is that the surrogate
model is globally accurate and all we need to do is validate the
optimum of the surrogate, having found it with large numbers of
calls to a suitably robust global optimizer, by running a single true
Fig. 18. A 20 point Morris–Mitchell optimal Latin hypercube ðþÞ with a five point
function evaluation at this point. Hopefully our assumption of
subset found using the exchange algorithm (
). global accuracy is based on some form of validation metric. Cross-
validation or, ideally, tests using a separate set of data used to
Z d ðÞ is linear. Co-Kriging will work effectively for more complex compute a MSE or correlation coefficient can be used to indicate
correction processes with the proviso that Z d ðÞ must be simpler to the global validity of a surrogate [9]. It is, however, unlikely that
model than Z e ðÞ. Although we have only considered combining the surrogate will, initially, be sufficiently accurate in the region of
two levels of analysis, the co-Kriging method can be extended the optimum and so it is usual practice to apply a succession of
to multiple levels by using additional r’s and d’s (see [56] for infill points at the predicted optimum. After each infill point the
more details). surrogate is re-fitted to the data such that an interpolating model
Although multi-level modelling can be achieved simply by will descend into the optimum location. This method is illustrated
combining independent surrogates of the ratios or differences in Figs. 19 and 20, where a function, f ðxÞ ¼ ð6x 2Þ2 sinð12 4Þ;
between data, the co-Kriging method is more powerful, both in x 2 ½0; 1, is sampled with three initial points followed by five infill
terms of the complexity of relationships it can handle, and its points at the minimum of the prediction. The method quickly
ability to provide error estimates which can be used to formulate descend into a local minimum. The vague region of this minimum
infill criteria. was indicated by the initial prediction and the infill points isolate
the precise position.
4. Infill criteria 4.1.2. The trust-region method

The above method will find at least a local minimum of the
While our surrogate is built upon assumption, our designs, of surrogate, given the mild assumptions that the objective function
course, cannot be—the ensuing lawsuits would be too costly! is smooth and continuous. Convergence may be rather lengthy
Results from the surrogate must be confirmed with calls to the depending upon the function. Alexandrov et al. [59] shows
true function. Indeed, at any stage we take our optimum design to rigorous proofs of convergence to a local optimum from an
be the best result of the true function, not that from the surrogate. arbitrary point for a trust-region based method which can be used
Additional calls to the true function are not only used to validate if the surrogate interpolates the observed data and also matches
the surrogate, but also to enhance its accuracy. It is the judicious the gradient of the objective function at the observed points. Of
selection of new points at which to call the true function, so- the surrogates we have considered, a gradient enhanced MLS and
called infill points, which represents the heart of the surrogate- gradient enhanced Kriging are permissible. The trust-region
based optimization process. Applying a series of infill points, method can also be employed by using the first order scaling
based on some infill criteria, is also known as adaptive sampling
(or updating), that is we are sampling the objective function in 20
promising areas based on a constantly changing surrogate.
The success or failure of a surrogate-based optimization rests true function
RBF prediction
on the correct choice of model and infill criteria. Just as when 15 initial sample
choosing the model, when selecting the infill criteria we can also
take short-cuts by making certain assumptions. While offering
quick solutions, such short-cuts are naturally susceptible to 10
failure. Jones [32] does an admirable job in highlighting possible
avenues of failure and points towards the correct route to the
f (x)
5
global optimum. In the following sections we will give an
overview of his work coupled with our own experience, and point
towards some new methods which try to guarantee the eventual 0
location of the global optimum. We should note at this point,
however, that in much practical design work our aim is design
improvement and often starts from a locally optimized design. −5
Also that it is always possible to design pathological functions that
will fool any optimization process except exhaustive search.
−10
Before embarking on an infill process, we can try a simple
0 0.2 0.4 0.6 0.8 1
‘trick’ to improve the accuracy of the surrogate. Consider the
x
problem of predicting stress vs. cross-sectional area. The problem,
albeit already simple, could be further simplified to a linear Fig. 19. An initial prediction of our one variable function using a Gaussian process
relationship by predicting the negative of the reciprocal of the model based on three points.
ARTICLE IN PRESS
algorithm of Haftka [60] to match the gradient of the function. then calculate the new trust-region as
Eldred et al. [61] have extended this to second-order scaling. 8
In this approach we start at an arbitrary x0 and search a < c1 kxm xm1 k
> if ror 1 ;
surrogate b yðxÞ in the interval x0 d0 . The trust-region d is dm ¼ minfc2 kxm xm1 k; Dg if r4r 2 ; (78)
>
: kx x
initialized at some user defined value. The first plot in Fig. 21 m k
m1 otherwise:
shows a gradient enhance Kriging model of our one variable test
c1 o1 and c2 41 are factors affecting the degree to which the
function through x0 ¼ 0:5. d0 ¼ 0:25 and the second plot shows an
trust-region shrinks and expands depending on how well the
infill point at the minimum of the trust-region with a new
surrogate performs (here we have used c1 ¼ 0:75 and c2 ¼ 1:25).
gradient enhanced Kriging model through this point (x1 ) and the
r 1 and r 2 determine how poorly we allow the surrogate to perform
initial point. Based on this first iteration at m ¼ 0, dm is updated as
before reducing the trust-region and how well it must perform
follows. We first evaluate how well the prediction performed as
before increasing the trust-region. Typical values are r 1 ¼ 0:10
and r 2 ¼ 0:75.
f ðxm1 Þ f ðxm Þ
r¼ , (77) We now find x2 by minimizing b yðxÞ in the region x1 d1 . The
f ðxm1 Þ b
yðxm Þ process is repeated until a stopping criterion is met (see [62] for
information on stopping criteria). The remaining plots in Fig. 21
show a further two infill points converging towards a local
20 minimum of the function. The above description outlines the core
of the trust-region approach to the use of surrogate models. More
details can be found in Alexandrov et al. [59], with a multi-fidelity
15 true function implementation in Alexandrov et al. [63].
RBF prediction Although the above exploitation-based infill criteria are
initial sample
attractive methods for local optimization, it is clear from Fig. 20
10 updates
that a prediction-based infill criterion may not find the global
optimum of a deceptive objective function. Likewise, although the
f (x)
5 trust-region approach will find a local optimum from an arbitrary

starting point, it may not find the global optimum if x0 is not in
132 the global basin of attraction. To locate the true global optimum,
5 4
0 we clearly need an element of exploration in our infill criterion.
−5 4.2. Exploration
Pure design space exploration can essentially be viewed as

−10 filling in the gaps between existing sample points. Perhaps the
0 0.2 0.4 0.6 0.8 1
simplest way of doing this is to use a sequentially space filling
x
sampling plan such as a Sobol sequence or LPt array [64,65],
Fig. 20. Minimum prediction based infill points, starting from the prediction in although such sample methods exhibit rather poor space filling
Fig. 19, converging towards a local optimum. characteristics for small samples. New points could also be
20 20
10 10
f (x)
f (x)
1
0 0
−10 −10
0 0.5 1 0 0.5 1
x x
20 20
10 10
f (x)
f (x)
21 3 21
0 0
−10 −10
0 0.5 1 0 0.5 1
x x
Fig. 21. Trust-region based infill points to a gradient enhanced Kriging prediction.
ARTICLE IN PRESS
positioned using a maxi min criterion [7]. If error estimates are that, as the method stands, it is difficult to choose a value for A.
available for the surrogate, infill points can be positioned at points For example, a suitable choice of A for one function might lead to
of maximum estimated error. Error estimates from regressing over exploitation of another. In Section 4.3.2 we will look at how a
models, with the exception of those with the modified Gaussian one-stage approach can solve this problem. A possible solution
process variance (Eq. (35)), are of dubious merit here. For the for a two-stage implementation is to try a number of values
exploration of a design space populated by computer experiments of A and position infill points where there are clusters of minima
we require that the estimated error returns to zero at all sample of Eq. (79).
locations. Otherwise we run the risk of the maximum error Probability of improvement: By considering byðxÞ as the realiza-
occurring at a previously visited point. While this is a valid tion of a random variable we can calculate the probability of an
outcome in the world of physical experiments with their random improvement I ¼ ymin YðxÞ upon the best observed objective
errors, re-running a deterministic computer experiment as an value so far, ymin :
infill point is useless. Z 0
1
eððIbyðxÞÞ Þ=2s dI.
2 2
The Gaussian process based models considered in this paper P½IðxÞ ¼ pffiffiffiffiffiffi (80)
assume a stationary covariance structure, that is the basis s 2p 1
function variance is constant across the design space. The model This equation is interpreted graphically in Fig. 22. The figure
does not account for some areas of the design space having more shows the prediction in Fig. 20 along with a vertical Gaussian
activity than others, e.g. flat spots may not be modelled distribution with variance s2 ðxÞ centred around b yðxÞ. This Gaussian
effectively. This is unlikely to be a serious problem for optimiza- distribution represents our uncertainty in the prediction b yðxÞ and
tion, but may be for building a model which accurately predicts the part of the distribution below the horizontal dotted line
the underlying function in all areas. For such stationary covariance indicates the possibility of improving on the best observed value
models a maximum error based infill will indeed just fill in the (the quantity we are integrating in Eq. (80)). The probability of
gaps between sample points. It is possible to build a surrogate improvement is the area enclosed by the Gaussian distribution
with a non-stationary covariance [66]. Maximum error infill below the best observed value so far (the value of the integral
points based on such a model may well perform better at in Eq. (80)).
improving generalization than simply using a space filling sample Expected improvement: Instead of simply finding the prob-
with more points. ability that there will be some improvement, we can calculate the
Pure exploration is of dubious merit in an optimization amount of improvement we expect, given the mean b yðxÞ and
context. Time spent accurately modelling suboptimal regions is variance s2 ðxÞ. This expected improvement is given by
time wasted when all we require is the global optimum itself. 8
>
< ðy ymin byðxÞ y b yðxÞ
Exploration-based infill has its niche in design space visualization b
min yðxÞÞF þ sf min if s40;
E½IðxÞ ¼ sðxÞ sðxÞ
and comprehension where the object is to build an accurate >
:0 if s ¼ 0;
approximation of the entire design landscape to help the designer
visualize and understand the design environment they are (81)
working in. We will not dwell on visualization issues, but those where FðÞ and fðÞ are the normal cumulative distribution
interested in design space visualization might consult Holden function and probability density function, respectively. This
[67]. Exploration also has a role in producing a globally accurate equation can be interpreted graphically from Fig. 22 as the first
model when the final surrogate is to be used in a realtime control moment of the area enclosed by the Gaussian distribution below
system or in a more complex overarching calculation, such as the best observed value so far.
aeroelasticity. The progress of maximum E½IðxÞ updates to our one-variable
test function is shown in Fig. 23. Clearly the E½IðxÞ has escaped the
4.3. Balanced exploration/exploitation local minimum to the left and succeeded in locating the global
optimum.
We know that to exploit the surrogate before the design space The infill criteria we have reviewed so far represent the current
has been explored sufficiently may lead to the global optimum tools of choice for surrogate-based optimization in the aerospace
lying undiscovered, while over exploration is a waste of resources.
Thus the Holy Grail of global optimization is finding the correct 20
balance between exploitation and exploration. Concurring with true function
Jones [32], we will split the following discussion into two breeds RBF prediction
15 observed data
of infill criteria: one- and two-stage methods. In a two-stage ymin
method the surrogate is fitted to the data and the infill criterion distribution of Y(x)
calculated based upon this model. In a one-stage approach the 10 P [I(x)]
surrogate is not fixed when calculating the infill criterion, rather
the infill criterion is used to calculate the surrogate. We will begin
f (x)
with the simpler and more common two-stage methods. 5
4.3.1. Two-stage approaches 0

Statistical lower bound: The simplest way of balancing
exploitation of the prediction b yðxÞ and exploration using s2 ðxÞ
(e.g., Eq. (22)) is to minimize a statistical lower bound −5
LBðxÞ ¼ b
yðxÞ AsðxÞ, (79)
−10
where A is a constant that controls the exploitation/exploration 0 0.2 0.4 0.6 0.8 1
balance. As A ! 0, LBðxÞ ! b
yðxÞ (pure exploitation) and as A ! 1, x
the effect of b
yðxÞ becomes negligible and minimizing LBðxÞ is
equivalent to maximizing sðxÞ (pure exploration). A key problem is Fig. 22. A graphical interpretation of the probability of improvement.
ARTICLE IN PRESS
0
20 2
E [I(x)]
f (x)
0 1
θ = 70.79
−20 0
0 0.5 1 0 0.5 1
20 0.02
1
E [I(x)]
f (x)
0 0.01
θ = 2.82
−20 0
0 0.5 1 0 0.5 1
x 10−3
20 4
E [I(x)]
2
f (x)
0 2
θ = 0.89
−20 0
0 0.5 1 0 0.5 1
x 10−4
20 2
3
E [I(x)]
f (x)
0 1
θ = 6.31
−20 0
0 0.5 1 0 0.5 1
20 1
4
E [I(x)]
f (x)
0 0.5
θ = 18.64
−20 0
0 0.5 1 0 0.5 1
20 0.1
E [I(x)]
5
f (x)
0 0.05
θ = 16.89
−20 0
0 0.5 1 0 0.5 1
20 1
E [I(x)]
f (x)
0 6 0.5
θ = 12.4
−20 0
0 0.5 1 0 0.5 1
20 0.04
E [I(x)]
f (x)
0 7 0.02
θ = 12.85
−20 0
0 0.5 1 0 0.5 1
x x
Fig. 23. The progress of a search of the one variable test function using a maximum E½IðxÞ infill strategy.
industry. We will now look at a more recently developed breed of which does not even contain a local optimum until y is estimated
infill criteria, which attempt to address some of the problems with correctly. That is, until there is sufficient data to estimate y
those we have covered so far. In many situations maximizing correctly. In situations where data are sparse and/or the true
E½IðxÞ will prove to be the best route to finding the global function is deceptive, we may wish to consider a breed of infill
optimum and as such has become very popular as a tool for global criteria which can alleviate this pitfall.
optimization, evident from the number of citations to the seminal
paper by Jones et al. [58]. Should the assumptions through which
we base our confidence in this method prove to be false, 4.3.2. One-stage approaches
maximizing E½IðxÞ (and P½IðxÞ) may converge very slowly or not All the above infill criteria could possibly be mislead by a
at all. The assumption upon which E½IðxÞ and P½IðxÞ trip up is that particularly poor or unlucky initial sample and a very deceptively
they assume that the model parameters have been estimated positioned optimum. Consider the function shown in Fig. 24
accurately based on the observed data. Note that, although the which, although on face value looks rather contrived, represents
search in Fig. 23 does locate the optimum, it dwells in a region the worst case scenario of a type of situation that can occur in
ARTICLE IN PRESS
surrogate-based optimization. We have been unlucky enough to (Eq. (31)) to give the concentrated conditional ln-likelihood:
sample the function at three points with the same function value.
n 1
An error-based infill criterion cannot cope with the prediction in b 2 Þ ln jCj.
lnðs (85)
Fig. 24 because the estimated error is zero for all values of x and so 2 2
P½IðxÞ or E½IðxÞ would also be zero. The error does not have to be To see how effective this method can be we will consider the
zero in all areas for problems to arise. Slow convergence of error- search of our one-dimensional test function. We begin with three
based infill criteria can occur whenever there is a significant sample points and set an objective function goal of 7 (a little less
underestimation of the error. than the true optimum, but let us assume we do not know what
In situations like that in Fig. 24 we need to employ an infill that is). Fig. 25 shows the progress of infill points positioned at
criterion which takes into account the possibility that a deceptive locations which maximize the conditional likelihood of the goal.
sample may have resulted in significant error in the model Despite its deceptive location, the goal seeking method quickly
parameters. The criteria we will consider do not use the surrogate finds the global optimum. We cannot choose a purely arbitrary
to find the minimum, but rather use the minimum to find the goal. An overly optimistic goal will lead to too much exploration,
surrogate. Or, in a sound bite (paraphrased from [68])—ask not since there will be an equally low likelihood in many areas. A
what the surrogate implies about the minimum—ask what the pessimistic goal will result in a local search, but the goal will
minimum implies about the surrogate. quickly be obtained, which may well be an acceptable outcome.
Goal seeking: We may be able to estimate a suitable value for Gutmann [69] suggests, and has had success, trying a range of
the global optimum or perhaps we would just like to search for a goals and positioning infill points where there are clusters of
specific improvement, even if it is not known if that improvement optimal infill locations.3 A more elegant method, when a suitable
is possible. In such cases we can use a method which does not estimate for a goal cannot be made, is to calculate a lower bound
search for expectations or probabilities of improvement, but based on the conditional likelihood.
assesses the likelihood that an objective function value could exist The conditional lower bound: In many cases we will not be able
at a given point [32]. to specify a goal for the optimization, but we can still use a
The Kriging predictor can be considered as a maximum conditional likelihood approach. Instead of finding the x which
likelihood estimate of the sample data augmented with the point gives the highest likelihood conditional upon b yðxÞ passing through
to be predicted. Instead of estimating the value b yðxÞ for a given x, a goal, we find the x which minimizes b yðxÞ subject to the
we can assume the predictor passes through a goal yg as well as conditional likelihood not being too low [68].
the sample data and find the value of xg which best fits this Again, consider the prediction of our deceptive one variable
assumption. To do this we maximize the conditional ln-likelihood test function based on an initial sample of three points. This is
shown in Fig. 26, along with the statistical lower bound found by
n n 1 ðy mÞT C1 ðy mÞT
b 2 Þ ln jCj
lnð2pÞ lnðs , (82) subtracting the estimated RMSE (sðxÞ). At x ¼ 0:7572, which we
2 2 2 2sb2 know is the minimum of the function, a point with yh ¼ b yðxÞ has
where been imputed (i.e. we have hypothesized that this point is part of
g the sample data, even though it has not actually been observed).
m ¼ 1m þ wðb
y mÞ (83) The likelihood conditional upon the prediction passing through
and this point is shown. Subsequently, we have imputed lower and
lower values at x ¼ 0:7572 and re-optimized y b to produce a
C ¼ W wwT , (84) prediction through these points. These values fall well below our
by varying xg and the model parameters (at this stage we may statistical lower bound, but still have a conditional likelihood and
wish to widen any upper and lower bounds on h). The position of so represent possible values at x ¼ 0:7572. As the imputed value
the goal, xg , appears in Eq. (82) via its vector of correlations with reduces the conditional likelihood becomes extremely low and
the observed data, w. We must maximize the conditional ln- we clearly need a systematic method of dismissing imputa-
likelihood numerically in the same way as for tuning the model tions which are very unlikely. We achieve this using a likelihood
parameters. We can first make a substitution for the MLE s b2 ratio test.
By calculating the ratio of the conditional likelihood of b
y (using
2 the maximum likelihood estimate of y L0 , to the conditional
likelihood, Lcond , of the prediction passing through the imputed
1.5 point and comparing to the w2 distribution, we can make a
decision as to whether to accept the value of the imputed point. To
1 be accepted
L0
0.5 2 ln ow2critical ðlimit; dofÞ (86)
Lcond
f (x)
0 must be satisfied. The value of the critical w2 value will depend

upon the confidence limit we wish to obtain and the number of
−0.5 degrees of freedom (the number of model parameters). For the
example in Fig. 26, if we wish to obtain a confidence limit of 0.95,
−1 we use limit ¼ 0:975 (we are only considering the lower bound)
and dof ¼ 1 to obtain w2critical ¼ 5:0239 (from tables or, e.g.,
−1.5 Matlab). Fig. 26 shows the likelihood ratio for each hypothesized
−2 3
This is similar to trying a range of weightings between local and global
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
search. Sóbester et al. [70] used a weighted expected improvement formulation as
x part of a two-stage approach to achieve similar ends, while Forrester [71] used a
weighted statistical lower bound with reinforcement learning to chose the
Fig. 24. A deceptive function with a particularly unlucky sampling plan. weighting.
ARTICLE IN PRESS
Cond. Like.
20 0.01
10
f (x)
0.005
0
−10 0
0 0.5 1 0 0.5 1
10−4
Cond. Like.
x
20 2
f (x) 10 −0.055353 1
0
−10 0
0 0.5 1 0 0.5 1
x 10−5
Cond. Like.
20 2
10
f (x)
−3.9399 1
0
−10 0
0 0.5 1 0 0.5 1
20 true function
10
f (x)
prediction
0 −6.0009 initial sample
−10 updates
0 0.5 1
x
Fig. 25. The progress of a search of the one variable test function in the range f0; 1g using a goal seeking infill strategy.
20 Kriging is known to give inaccurate error estimates, particu-

true function larly with sparse sampling [72] and the conditional lower bound
initial sample can, in fact, be used to calculate what may be a more reliable
15 prediction Gaussian process based model error estimate by setting the
prediction−MSE
imputed points with cond. lik. confidence interval to give one standard deviation. By using the
conditional lower bound approach to calculate error estimates,
10
1.11e−002 the two-stage approaches of probability of improvement and
expected improvement can be transformed into one-stage
f (x)
5 methods, allowing problems such as that shown in Fig. 24 to be

3.15e−003 solved with the added benefits of E½IðxÞ over a lower bound
Λ = 2.52
criterion. The benefits are that, assuming the function to be
0 searched is smooth and continuous, E½IðxÞ can be proved to
1.58e−003
Λ = 3.9 converge to the global optimum. It can also readily be modified to
account for constraints and multiple objectives, which we will
−5
8.06e−004 consider in the next two sections. Forrester and Jones [73] show
Λ = 5.25 the formulation of an expected improvement criterion using
−10 conditional lower bound based error estimates.
0 0.2 0.4 0.6 0.8 1 These one-stage approaches seem to be the panacea we have
x been looking for, but in many situations they could prove to be
intractable. In the conditional bound approach, for example, x, yh ,
Fig. 26. The conditional likelihood and likelihood ratio for hypothesized points s and p (for a Kriging model) need to be varied to minimize the
with increasingly lower objective function values.
lower bound. This search of up to 3k þ 1 parameters is naturally
far more computationally intensive than a standard two-stage
method, particularly when we have a significant number of
point which has been imputed, calculated using the conditional sample points. Recall that at each step in this 3k þ 1 dimensional
likelihoods shown. The lowest value would be rejected based search we must invert a matrix whose dimensions are the size of
on w2critical . the data set. If, however, the data set is rather limited because the
Using this likelihood ratio test we can systematically compute underlying function is extremely expensive, this may nonetheless
a lower confidence bound for the prediction. The minimum of this be worthwhile. This will sometimes be the case in high fidelity
lower bound can then be used as an infill criterion. To choose a CFD or non-linear FEA based optimization.
new infill point we must minimize yh by varying yh , x, and the
model parameters, subject to the constraint defined by (86).
Fig. 27 shows the progress of a search of the deceptive one 4.4. Parallel infill
variable test problem using this infill criterion, starting from the
same three point initial sample. A 95% confidence interval has We have so far assumed that infill points will be applied in a
been chosen. Despite us not specifying a goal a priori, the infill serial process, but it is often possible to apply parallel updates.
strategy has quickly found the global optimum. We are still left Many of the infill criteria we have reviewed exhibit multi-modal
with a rather annoying control parameter—we must choose the behaviour. In particular, the E½IðxÞ and conditional likelihood plots
confidence interval and it is not entirely clear what is the best in Figs. 23 and 25. A search which locates a number of minima,
method of doing this. In a similar vein to Gutmann’s goal seeking e.g. multi-start hill-climbers or a genetic algorithm with cluster-
method, a number of confidence intervals could be tried. ing, can be employed to obtain a number of infill points [74]. The
ARTICLE IN PRESS
true function infill points. This fact will always limit the amount of time we can
observed data dedicate to the building and searching of surrogates.
prediction
lower bound
5. Constraints
new infill point
20 Traditional constrained optimization approaches can be ap-
plied to a surrogate-based optimization process. Of note is the use
of the augmented Lagrangian method for constrained optimiza-
f (x)
0
tion in a surrogate-based trust region search [63]. More simple
to implement is the application of penalty functions [76].
−20 Whether the constraint is cheap, and evaluated directly, or
0 0.5 1
expensive, and a surrogate model of the constraint is employed,
x
in most cases penalty functions can be applied to surrogate-based
20 search in the usual manner. When one or more constraints are
violated, a suitable penalty is applied to the value obtained from
f (x)
0 the surrogate of the objective function. Thus a search of the

surrogate is deviated away from regions of violation. For a
maxfE½IðxÞg or maxfP½IðxÞg based search, ymin should be replaced
−20 with the minimum observed function value which satisfies the
0 0.5 1
constraint.
x We may not be able to model a constraint function, for
20 example when a region of infeasibility is defined purely by
objective function calculations failing. In such situations we can
penalize the surrogate in regions of failures by imputing large
f (x)
0
objective function values at failed points. Forrester et al. [77] used
yðxfailed Þ þ s2 ðxfailed Þ for imputed points and showed this to work
b
−20 well for an aerofoil design problem.
0 0.5 1 Assuming we can model the constraint function(s), a fully
x probabilistic approach can be taken to their inclusion. We shall
20 concentrate on this surrogate model specific method. Before
delving into the mathematics, it is useful to set out what we might
f (x)
expect when using Gaussian process (e.g. Kriging) models for both
0
the objective and constraint functions. If, at a given point in the
design space, the predicted errors in a constraint model are low
−20 and the surrogate shows a constraint violation, then the expecta-
0 0.5 1
tion of improvement will also be low, but not zero, since there is a
x finite possibility that a full evaluation of the constraints may
Fig. 27. The progress of a search of the one variable test function in the range ½0; 1
actually reveal a feasible design. Conversely, if the errors in the
using a conditional lower bound infill strategy. constraints are large then there will be a significant chance that
the constraint predictions are wrong and that a new point will, in
fact, be feasible. Thus the expectation of improvement will be
y can then be evaluated simultaneously to take advantage of greater. Clearly, for a fully probabilistic approach we must factor
parallel computing capabilities. We cannot guarantee how many these ideas into the calculation of the expectation. It turns out
infill points will be obtained and so the process may not take that this is relatively simple to do, although it is rarely mentioned
advantage of all available resources. in the literature (a formulation can be found in the thesis of
A method of obtaining a specified number of infill points has Schonlau [75]). Provided that we assume that the constraints
been suggested by Schonlau [75]. We search the infill criterion to and objective are all uncorrelated a closed form solution can
find its global optimum and then temporarily add the surrogate readily be derived. If not, and if the correlations can be defined,
model predicted value at this point, i.e. assume the model is then numerical integration in probability space is required.
correct at this location and impute its value. The surrogate is then Since such data is almost never available this idea is not pursued
constructed with this temporary new point (with no need to re- further here.
estimate model parameters) and the infill criterion searched We have already discussed the probability of improvement
again. For P½IðxÞ and E½IðxÞ we do not change f min , should the infill criterion. Now consider a situation when we have a
imputed prediction be lower than this. The process is continued constraint function, also modelled by a Gaussian process, based
until the desired number of infill points has been obtained. These on sample data in exactly the same way. Rather than calculating
infill points are then evaluated and added to the data set in place P½IðxÞ, we could use this model to calculate the probability of the
of the temporary predictions. This method makes effective use of prediction being greater than the constraint limit, i.e. the
parallel computing resources, though the sampling may not be as probability that the constraint is met, P½FðxÞ4g min . The prob-
well placed as a sequential scheme. ability that a design is feasible can be calculated following the
We note in passing that setting up and searching a surrogate of same logic as for an improvement, only now instead of using the
any kind can be a bottleneck in a heavily parallel computing current best design as the dividing point in probability space
environment. If we have sufficient processors, then evaluating all we use the constraint limit value, i.e.
the points in our initial sampling plan can occur simultaneously. Z 1
1
We must then pull all these results together to build and study the eðFbgðxÞÞ =2s dG,
2 2
P½FðxÞ4g min ¼ pffiffiffiffiffiffi (87)
surrogate before we can return to our parallel calculation of sets of s 2p 0
ARTICLE IN PRESS
where g is the constraint function, g min is the limit value, F is the multiply E½IðxÞ (Eq. (81)) by P½FðxÞ4g min :
measure of feasibility GðxÞ g min, GðxÞ is a random variable, and s
is the variance of the Kriging model of the constraint. We can E½IðxÞ \ FðxÞ4g min ¼ E½IðxÞP½FðxÞ4g min . (89)
couple this result to the probability of improvement from a
Kriging model of the objective and the probability that a new infill As an example, the first plot of Fig. 28 shows our one variable
point both improves on the current best point and is also feasible function along with a constraint function (simply the negative of
is then just the objective minus a constant). The second plot shows E½IðxÞ,
the third plot shows the probability of satisfying the constraint
P½IðxÞ \ FðxÞ4g min ¼ P½IðxÞP½FðxÞ4g min , (88)
(Eq. (88)), and the fourth plot shows the product of these—our
since these are independent models. constrained expected improvement (Eq. (89)). Note how multi-
We can also use the probability that a point will be feasible to plying by the probability of satisfying the constraint forces the
formulate a constrained expected improvement. We simply expectation away from the region where the constraint is violated
20 1.5
0 1
E [I(x)]
f (x)
−20 0.5
−40 0
0 0.5 1 0 0.5 1
X X
1 0.5
0.8 0.4
P [constr. met]
Constr. E [I(x)]
0.6 0.3
0.4 0.2
0.2 0.1
0 0
0 0.5 1 0 0.5 1
X X
Fig. 28. Predictions of the objective (dash) and constraint functions (thin solid) based on four sample points, with the constraint limit (here g max , which we wish to be
below) shown as a dash-dot line (first plot), the unconstrained E½IðxÞ (second plot), the probability of meeting the constraint (third plot) and the constrained expected
improvement (final plot).
20 0.4
0.3
0 1
E [I(x)]
f (x)
1 0.2
−20
0.1
−40 0
0 0.5 1 0 0.5 1
X X
1 0.2
0.8
P[constr. met]
Constr. E [I(x)]
0.15
0.6
0.1
0.4
0.05
0.2
0 0
0 0.5 1 0 0.5 1
X X
Fig. 29. The build up of the constrained expected improvement after an infill point has been applied at the maximum constrained E½IðxÞ in Fig. 28.
ARTICLE IN PRESS
and the next infill will be on the constraint boundary (see Fig. 29, Currently, there appear to be two popular ways of constructing
which shows the situation after this infill point has been applied). Pareto sets. First, and most simply, one chooses a (possibly non-
linear) weighting function to combine all the goals in the problem
of interest into a single quantity and carries out a single objective
6. Multiple objectives optimization. The weighting function is then changed and the
process repeated. By slowly working through a range of weight-
ings it is possible to build up a Pareto set of designs. In a similar
In aerospace design it is common to be aiming for light weight,
vein, one can also search a single objective at a time, while
low cost, robust, high performance systems. These aspirations are
constraining the other objectives. Slowly working through a range
clearly in tension with each other and so compromise solutions
of constraint values, a Pareto set can be populated. However, it is
have to be sought. The final selection between such compromises
by no means clear what weighting function or constraint values to
inevitably involves deciding on some form of weighting between
use and how to alter them so as to be able to reach all parts of the
the goals. However, before this stage is reached it is possible
potential design space (and thus to have a wide ranging Pareto
to study design problems from the perspective of Pareto sets.
set). In particular, the weighted single objective method will miss
A Pareto set of designs is one whose members are all optimal in
Pareto optimal points if the front is not convex and the weighting
some sense, but where the relative weighting between the
in linear. If it is non-linear this can be avoided, but then the form
competing goals is yet to be finally fixed (see for example [78]).
of the function to use must be decided upon.
More formally, a Pareto set of designs contains systems that are
In an attempt to address this limitation designers have turned
sufficiently optimized that, to improve the performance of any set
to a second way of constructing Pareto sets via the use of
member in any one goal function, its performance in at least one
population-based search schemes. In such schemes a set of
of the other functions must be made worse. The designs in the set
designs is worked on concurrently and evolved towards the final
are said to be non-dominated in that no other set member exceeds
Pareto set in one process. In doing this, designs are compared to
a given design’s performance in all goals. It is customary to
each other and progressed if they are of high quality and if they
illustrate a Pareto set by plotting the performance of its members
are widely spaced apart from other competing designs. Moreover
against each goal function, see Fig. 30, where the two axes are for
such schemes usually avoid the need for an explicit weighting
two competing goal functions that must both be minimized. The
function to combine the goals being studied. Perhaps the most
series of horizontal and vertical lines joining the set members is
well known of these schemes is the NSGA-II method introduced
referred to as the Pareto front—any design lying above and to the
by Deb et al. [79].
right of this line is dominated by members of the set.
To overcome the problem of long run-times a number of
There are a number of technical difficulties associated with
workers have advocated the use of surrogate modelling ap-
constructing Pareto sets. Firstly, the set members need to be
proaches within Pareto front frameworks [80,81]. It is also
optimal in some sense—since it is desirable to have a good range
possible to combine tools like NSGA-II with surrogates [82]. In
of designs in the set this means that an order of magnitude more
such schemes an initial sampling plan is evaluated and surrogate
optimization effort is usually required to produce such a set than
models built as per the single objective case, but now there is one
to find a single design that is optimal against just one goal.
surrogate for each goal function. In the NSGA-II approach the
Secondly, it is usually necessary to provide a wide and even
search is simply applied to the resulting surrogates and used to
coverage in the set in terms of the goal function space—since the
produce a Pareto set of designs. These designs are then used to
mapping between design parameters and goal functions is usually
form an infill point set and, after running full computations, the
highly non-linear, gaining such coverage is far from simple.
surrogates are refined and the approach continued. Although
Finally, and in common with single objective design, many
sometimes quite successful this approach suffers from an inability
problems of practical interest involve the use of expensive
to balance explicitly exploration and exploitation in the surrogate
computer simulations to evaluate the performance of each
model construction, in just the same way as when using a
candidate, and this means that only a limited number of such
prediction-based infill criterion in single objective search,
simulations can usually be afforded.
although the crowding or niching measures normally used help
mitigate these problems to some extent. Here we consider
statistically based operators for use in surrogate model based
f2 (x)
multi-objective search so as to explicitly tackle this problem.
6.1. Multi-objective expected improvement
To begin with consider a problem where we wish to minimize

two objective functions f 1 ðxÞ and f 2 ðxÞ, which we can sample to
find observed outputs y1 and y2 . For simplicity, assume that x
f1 (x)
consists of just one design variable x (k ¼ 1). By evaluating a
sampling plan, X, we can obtain observed responses y1 and y2 .
This will allow us to identify the initial Pareto set of m designs
that dominate all the others in the training set:
y1;2 ¼ f½yð1Þ
1 ðx
ð1Þ
Þ; yð1Þ
2 ðx
ð1Þ
Þ; ½yð2Þ
1 ðx
ð2Þ
Þ; yð2Þ
2 ðx
ð2Þ
Þ,
. . . ; ½yðmÞ
1 ðxðmÞ Þ; yðmÞ
2 ðxðmÞ Þg.
In this set the superscript indicates that the designs are non-
dominated. We may plot these results on the Pareto front axes as
Fig. 30. A Pareto set of five non-dominated points () for a problem with two
objectives. The solid line is the Pareto front. The shaded area shows where new
per Fig. 30. In that figure the solid line is the Pareto front and the
points would augment the Pareto front, while the hatched area is where new hatched and shaded areas represent locations where new designs
points would dominate and replace the existing set of non-dominated points. would need to lie if they are to become members of the Pareto set.
ARTICLE IN PRESS
Note that if new designs lie in the shaded area they augment the Pareto set (for formulations which deal with Pareto set augmen-
set and that if they lie in the hatched area they will replace at least tation see Keane [83]). Carrying out the desired integral is best
one member of the set (since they will then dominate some done by considering the various rectangles that comprise the
members of the old set). It is possible to set up our new metric hatched area in Fig. 30 and this gives
such that an improvement is achieved if we can augment the set
Z yð1Þ Z 1
or, alternatively, only if we can dominate at least one set 1
P½Y 1 ðxÞoy1 \ Y 2 ðxÞoy2 ¼ Y 1 fðY 1 ; Y 2 Þ dY 2 dY 1
member—here we consider the latter metric. 1 1
Given the training set it is possible to build a pair of Gaussian X
m1 Z yðiþ1Þ
1
Z yðiþ1Þ
2
process based models (e.g. Kriging models). As when dealing with þ Y 1 fðY 1 ; Y 2 Þ dY 2 dY 1
i¼1 yðiÞ 1
constrained surrogates, it is assumed that these models are 1
Z 1 Z yðmÞ
independent (though it is also possible to build correlated models 2
þ Y 1 fðY 1 ; Y 2 Þ dY 2 dY 1 . (91)
by using co-Kriging, as per Section 3.7.2). The Gaussian processes y1ðmÞ 1
have means b y1 ðxÞ and b y2 ðxÞ (from Eq. (20)), and variances s21 ðxÞ and
2 This is the multi-objective equivalent of the P½IðxÞ formulation
s2 ðxÞ (from Eq. (22)). These values may then be used to construct a
two-dimensional Gaussian probability density function for the in Section 4.3.1. It will work irrespective of the relative scaling of
predicted responses of the form the objectives being dealt with. When used as an infill criterion it
" # will not, however, necessarily encourage very wide ranging
1 ðY 1 ðxÞ b y1 ðxÞÞ2 exploration since it is not biased by the degree of improvement
fðY 1 ; Y 2 Þ ¼ pffiffiffiffiffiffi exp
s1 ðxÞ 2p 2s21 ðxÞ being achieved. To do this we must consider the first moment of
" #
1 ðY ðxÞ b y2 ðxÞÞ2 the integral, as before when dealing with single objective
pffiffiffiffiffiffi exp 2 , (90) problems.
s2 ðxÞ 2p 2s22 ðxÞ
The equivalent improvement metric we require for the two
where it is made explicitly clear that b y1 ðxÞ, s21 ðxÞ, b
y2 ðxÞ and s22 ðxÞ are objective case will be the first moment of the joint probability
all functions of the location at which an estimate is being sought. density function integral taken over the area where improvements
Clearly this joint pdf accords with the predicted mean and errors occur, calculated about the current Pareto front. Now, while it is
coming from the two Kriging models at x. When seeking to add a simple to understand the region over which the integral is to be
new point to the training data we wish to know the likelihood that taken (it is just the same as in Eq. (91)) the moment arm about the
any newly calculated point will be good enough to become a current Pareto front is a less obvious concept. To understand what
member of the current Pareto set and, when comparing compet- is involved, it is useful to return to the geometrical interpretation
ing potential designs, which will improve the Pareto set most. of E½IðxÞ (shown in Fig. 22 for the single objective case). P½IðxÞ
We first considering the probability that a new design at x will represents integration over the probability density function in the
dominate a single member of the existing Pareto set, say area below and to the left of the Pareto front where improvements
½yð1Þ ð1Þ
1 ; y2 . For a two-objective problem this may arise in one of
can occur. E½Iðx Þ (we will use the superscript to denote the
three ways: either the new point improves over the existing set multi-objective formulation) is the first moment of the integral
member in goal one, or in goal two, or in both (see Fig. 31). The over this area, about the Pareto front. Now the distance the
probability of the new design being an improvement is simply centroid of the E½Iðx Þ integral lies from the front is simply E½Iðx Þ
P½Y 1 ðxÞoyðiÞ ðiÞ
1 \ Y 2 ðxÞoy2 , which is given by integrating the volume
divided by P½Iðx Þ, see Fig. 32. Given this position and P½Iðx Þ it is
under the joint probability density function, i.e. by integrating simple to calculate E½Iðx Þ based on any location along the front.
over the hatched area in Fig. 31. Hence we first calculate P½Iðx Þ and the location of the centroid of
Next consider the probability that the new point is an its integral, ðȲ 1 ; Ȳ 2 Þ (by integration with respect to the origin and
improvement given all the points in the Pareto set. Now we must division by P½Iðx Þ). It is then possible to establish the Euclidean
integrate over the hatched (and possibly the shaded) area in Fig. distance the centroid lies from each member of the Pareto set D.
30. We can distinguish whether we want the new point to The expected improvement criterion is subsequently calculated
augment the existing Pareto set or dominate at least one set using the set member closest to the centroid, ðy1 ðx Þ; y2 ðx ÞÞ, by
member by changing the area over which the integration takes taking the product of the volume under the probability density
place. Here we will consider only points which dominate the function with the Euclidean distance between this member and
f2 (x)
f2 (x)
(y1,y2)
improvement
in f1 (x)
* *
(f1, f2) f1 (x)
f1 (x)
improvement
improvement in f2 (x) centroid
in both functions
(Y1 (x), Y2 (x))
Fig. 32. Centroid of the probability integral and moment arm used in calculating
Fig. 31. Improvements possible from a single point in the Pareto set. E½Iðx Þ, also showing predicted position of currently postulated update.
ARTICLE IN PRESS
the centroid, shown by the arrow in Fig. 32. This leads to the objective 1
following definition of E½Iðx Þ:
objective 2
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi prediction 1
E½Iðx Þ ¼ P½Iðx Þ ðȲ 1 ðxÞ y1 ðx ÞÞ2 þ ðȲ 2 ðxÞ y2 ðx ÞÞ2 , (92)
prediction 2
where sample points
8 R ð1Þ R 9 Pareto optimal
> y1 1 >
>
> 1 1 Y 1 fðY 1 ; Y 2 Þ dY 2 dY 1 >
> 20
>
> >
>,
>
< P R yðiþ1Þ
m1 R ðiþ1Þ
>
=
y2
Ȳ 1 ðxÞ ¼ þ 1
ðiÞ 1 Y 1 fðY 1 ; Y 2 Þ dY 2 dY 1 P½Iðx Þ, 10
y1
>
> i¼1 >
>
f (x)
>
> R 1 R yðmÞ >
>
>
> >
> 5
: þ yðmÞ 1 2 Y 1 fðY 1 ; Y 2 Þ dY 2 dY 1 ; 1234
1 0
67
(93) 8
−10
and Ȳ 2 ðxÞ is defined similarly. 0 0.2 0.4 0.6 0.8 1
When defined in this way E½Iðx Þ varies with the location of the
x
predicted position of the currently postulated update
ðb
y1 ; b
y2 Þ—also shown in Fig. 32, and also with the estimated errors Pareto front
−4
in this prediction, s1 and s2 , since it is these quantities that define
the probability density function being integrated.
The further the predicted update location lies below and to the −6
f2 (x)
left of the current Pareto front the further the centroid will lie
from the front. Moreover, the further the prediction lies in this −8
direction the closer the integral becomes to unity (since the
greater the probability of the update offering an improvement). −10
Both tendencies will drive updates to be improved with regard to −6 −5 −4 −3 −2 −1 0
the design objectives. Note that if there is a significant gap in the f1(x)
points forming the existing Pareto front, then centroidal positions
lying in or near such a gap will score proportionately higher Fig. 34. Further updates locate the global optimum of objective one, which is also,
values of E½Iðx Þ, since the Euclidean distances to the nearest point naturally, Pareto optimal.
will then be greater. This pressure will tend to encourage an even
spacing in the front as it is updated. Also, when the data points
used to construct the Gaussian process model (i.e., all points exploitation in just the same way as its one-dimensional
available and not just those in the Pareto set) are widely spaced, equivalent.
the error terms will be larger and this tends to further increase When calculating the location of the centroid there is still no
exploration. Thus this E½Iðx Þ definition balances exploration and requirement to scale the objectives being studied but, when
deciding which member of the current Pareto set lies closest to
the centroid, relative scaling will be important (i.e., when
objective 1
objective 2 calculating the Euclidean distance). This is an unavoidable and
prediction 1 difficult issue that arises whenever explicitly attempting to space
prediction 2 out points along the Pareto front, whatever method is used to do
sample points this.
Pareto optimal Again we will use our one variable test function for illustration.
20 The first plot in Fig. 33 shows two objective functions, the first of
which is that used in the previous examples. Starting from a three
10 point initial sample, the first four infill points, based on
f (x)
maximizing the dual-objective expected improvement (Eq. (92)),

1234 5 are all Pareto optimal. Further updates lead to the location of the
0
global optimum of objective one, which represents another part of
the Pareto front, as shown in Fig. 34.
−10
It is worth noting that there is no fundamental difficulty in
0 0.2 0.4 0.6 0.8 1
extending this form of analysis to problems with more than two
x
goal functions. This does, of course, increase the dimensionality of
Pareto front the Pareto surfaces being dealt with, and so inevitably complicates
−8.8 further the expressions needed to calculate the improvement
metrics. Nonetheless, they always remain expressible in closed
−9 form; it always being possible to define the metrics in terms of
f2 (x)
summations over known integrable functions.

−9.2
7. Discussion and recommendations

−9.4
−1 −0.95 −0.9 −0.85 −0.8 −0.75
f1 (x) We have covered a range of surrogate modelling methods and
infill criteria and have noted the pros and cons of each method
Fig. 33. The first four infill points positioned at the maximum expectation of along the way. We will now provide some more general thoughts
improving on the Pareto front are all Pareto optimal. on the applicability of the methods we have discussed. The
ARTICLE IN PRESS
Table 1
A taxonomy of surrogate methods.
Sample plan: infill points ratio p1 42: 1 1: 2 o1: 2

Comprehension Optimization
Simple landscape Complex landscape Local search P½IðxÞ, E½IðxÞ Goal seeking Conditional lower bound
p p p
k420 n4500 SVR
p p p
Fixed bases e.g. cubic, thin plate
p
Polynomials
p p p p
ko20 no500 MLS, parametric bases E.g. multi-quadric
p p p p p p
Gaussian bases e.g. Kriging
suitability of each method for various types of problem is shown perform best for a given problem (as we have in Table 1), a more
in Table 1, which we have taken from Forrester et al. [8]. Naturally educated choice can be made using various model selection and
there are exceptions to every rule and it is risky to make such validation criteria. The accuracy of a number of surrogates could
generalizations on the applicability of methods. The table does, be compared by assessing their ability to predict a validation data
however, give a concise view of the context in which we see each set. Such a strategy will require a portion of observed data to be
method. set aside for validation purposes only, making this impractical
Working through Fig. 1, while referring to Table 1, after any when observations are expensive. More likely is that cross-
preliminary experiments we may wish to conduct to reduce the validation or bootstrapping errors [85] will be compared when
dimensionality of the problem, we must choose the number of selecting the most accurate surrogate. These methods rely only on
points which our initial sampling plan will comprise. Assuming the observed data used to construct the surrogate being assessed.
there is a maximum budget of function evaluations, we will define Recently, rather than selecting one surrogate which appears to
the number of points as a fraction of this budget. If our aim is have better generalization properties, as determined by some
purely to create an accurate model for visualization and design validation metric, Goel et al. [86] have tried using a weighted
space comprehension, our sampling plan could contain all of our average of an ensemble of surrogates. While Kriging, for example,
budgeted points. However, it is likely to be beneficial to position might accurately predict some non-linear aspect of a function, a
some points where it is believed that the error in the surrogate is polynomial may better capture the underlying trend. By combin-
high. Error estimates which reflect the possibility of varying ing these two methods (and maybe others) in a weighted average,
function activity across the design space will be of most use here, better generalization could be achieved. We would argue though
e.g. from non-stationary Kriging [66]. If we are using the surrogate that blind Kriging (see Section 3.5.2) can do this in a more
as the basis of an infill criteria, we must save some points for that mathematically rigorous manner. Ensembles or committees (as
process. For an exploitation-based criterion, most of the points, i.e. they are known in the machine learning literature [87]) are a
more than one half, should be in the initial sample because only a powerful concept, indeed blind Kriging could be viewed as form of
small amount of surrogate model enhancement is possible during committee model. These ‘Jack-of-all-trade’ methods seem likely to
the infill process. A notable exception is the trust-region example find increasing favour in problems where the nature of the design
in Fig. 21 where we started from just one. If a two-stage balanced landscapes is unknown.
exploitation/exploration infill criterion is to be employed, Sóbe- We have already taken an in depth look at the various infill
ster et al. [84] have shown that approximately one third of the criteria and Table 1 shows which surrogate types these marry to.
points should be in the initial sample, with the majority saved for Essentially, for a surrogate to be suited to some form of search-
the infill stage. The one-stage methods rely less on the initial infill process, the surrogate must have the capacity to modify its
prediction and so fewer points are required. shape to fit any complex local behaviour the true function may
The choice of which surrogate to use should be based on the exhibit. Thus, polynomials must be excluded, since, for practical
problem size, that is k, the expected complexity, the cost of the purposes, there is a limitation on the order polynomial which can
analyses the surrogate is to be used in lieu of, and the form of infill be used. From Figs. 19 and 20 we see how an interpolating
strategy that will follow. As discussed at the end of Section 3.2, surrogate converges on an optimum. We stop short of saying that
polynomial models make good surrogates of cheap analyses, the surrogate must interpolate the data, since SVR and regressing
following simple trends, in low dimensions. MLS (see Section 3.3) Kriging and RBFs will converge towards the optimum of a noisy
can model more complex landscapes, but is still limited to lower function to an accuracy determined by the noise, not by
dimensions (for the same reasons as polynomials), and its added deficiencies in the surrogate. For a global search we need some
expense means that it may not be cheaper than some quick form of error estimate for predictions made by the surrogate
analyses. Fixed bases RBFs (see Section 3.4) are suitable for (coupled with the above requirements). Thus, of the methods
higher-dimensional, but simple landscapes and can be used reviewed in this paper, we are limited to the Gaussian process
in lieu of cheap analyses. SVR (see Section 3.6) with a fixed kernel based methods, although Gutmann [69] has employed a one-stage
also fits somewhere in this category, though the initial cost of goal seeking approach for a variety of RBFs.
training the model is higher than for RBFs. Our most complex We have not yet looked at when to stop the iterative process in
surrogates—Kriging (see Section 3.5) and parametric RBFs, Fig. 1. Choosing a suitable convergence criterion to determine
including parametric SVRs—can only be used for relatively low- when to stop the surrogate infill process is rather subjective. Goal
dimensional problems due to the expense of training the model, seeking is the obvious winner in terms of convergence criteria and
but these methods have the potential to provide more accurate nothing need be added to the method itself. When choosing infill
predictions. points based on minimizing the prediction (exploitation), the
Often the choice of surrogate modelling method will be convergence criterion is simple: we stop when the change in a
dictated by the infill criteria we wish to apply. When this is not number of successive infill point objective values is small.
the case, although we can pigeon-hole which surrogate is likely to Maximum error based infill (exploration) is likely to be stopped
ARTICLE IN PRESS
when some generalization error metric, e.g. cross-validation, [15] Lancaster P, Salkauskas K. Surfaces generated by moving least squares
drops below a certain threshold. methods. Mathematics of Computation 1981;37(155):141–58.
[16] Levin D. The approximation power of moving least-squares. Mathematics of
When using the probability or expectation of improvement, we Computation 1998;67(224):1517–31.
can simply stop when the probability is very low or the [17] Aitken AC. On least squares and linear combinations of observations.
expectation is smaller than a percentage of the range of observed Proceedings of the Royal Society of Edinburgh 1935;55:42–8.
objective function values. Care should, however, be taken since [18] Toropov VV, Schramm U, Sahai A, Jones RD, Zeguer T. Design optimization and
stochastic analysis based on the moving least squares method. In: 6th World
the estimated MSE of Gaussian process based models is often an congress of structural and multidisciplinary optimization, Rio de Janeiro, 30th
under estimator and the search may be stopped prematurely. It is May–3rd June, 2005.
wise to set an overly stringent threshold and wait for a [19] Kim C, Wang S, Choi KK. Efficient response surface modeling by using moving
least-squares method and sensitivity. AIAA Journal 2005;43(11):2404–11.
consistently low P½IðxÞ, E½IðxÞ or E½Iðx Þ.
[20] Ho SL, Yang S, Ni P, Wong HC. Developments of an efficient global optimal
When minimizing a lower bound there is no quantitative design technique—a combined approach of MLS and SA algorithm. COMPEL
indicator of convergence and we are limited to the convergence 2002;21(4):604–14.
criteria used for exploitation. Unfortunately, an infill strategy may [21] Broomhead DS, Loewe D. Multivariate functional interpolation and adaptive
networks. Complex Systems 1988;2:321–55.
dwell in the region of a local minima before jumping to another so [22] Sóbester A. Enhancements to global optimisation. PhD thesis, University of
we cannot guarantee that a series of similar objective values Southampton, Southampton, UK; October 2003.
means that the global optimum has been found. [23] Vapnik V. Statistical learning theory. New York: Wiley; 1998.
[24] Keane AJ, Nair PB. Computational approaches to aerospace design: the pursuit
In many real engineering problems we actually stop when we
of excellence. Chichester: Wiley; 2005.
run out of available time or resources, dictated by design cycle [25] Gibbs MN. Bayesian Gaussian processes for regression and classification.
scheduling or costs. Dphil dissertation, University of Cambridge; 1997.
Final thoughts: The above discussion gives no definitive [26] Poggio T, Girosi F. Regularization algorithms for learning that are equivalent
to multilayer networks. Science 1990;247:978–82.
answers, and deliberately so. This is because a method which is [27] Orr M. Regularisation in the selection of RBF centres. Neural Computation
universally better than all others is yet to present itself. While we 1995;7(3):606–23.
wait for it to do so, we must choose our surrogate-based [28] Cook RD, Nachtsheim CJ. A comparison of algorithms for constructing exact
optimization methodology carefully. Although exact choice of D-optimal designs. Technometrics 1980;22(3):315.
[29] Keane AJ. Design search and optimisation using radial basis functions with
methodology may be problem dependent, one underlying trait regression capabilities. In: Parmee IC, editor. Proceedings of the conference on
that any surrogate-based optimization must include is some form adaptive computing in design and manufacture, vol. VI. Berlin: Springer;
of repetitive search and infill process to ensure the surrogate is 2004. p. 39–49.
[30] Matheron G. Principles of geostatistics. Economic Geology 1963;58:1246–66.
accurate in regions of interest. Other considerations in terms of
[31] Krige DG. A statistical approach to some basic mine valuation problems on
model selection, validation and infill criteria are secondary to this the Witwatersrand. Journal of the Chemical, Metallurgical and Mining
key requirement Engineering Society of South Africa 1951;52(6):119–39.
[32] Jones DR. A taxonomy of global optimization methods based on response
surfaces. Journal of Global Optimisation 2001;21:345–83.
[33] Toal DJJ, Bressloff NW, Keane AJ. Kriging hyperparameter tuning strategies.
Acknowledgements AIAA Journal 2008;46(5):1240–52.
[34] Zhang Y, Leithead WE. Exploiting hessian matrix and trust-region algorithm
in hyperparameters estimation of gaussian process. Applied Mathematics and
We are grateful for the input and advice of Danie Krige, Donald Computation 2005;171:1264–81.
Jones, András Sóbester, Prasanth Nair and Rafi Haftka. [35] Keane AJ. Wing optimization using design of experiment, response surface,
and data fusion methods. Journal of Aircraft 2003;40(4):741–50.
[36] Cressie NAC. Statistics for spatial data, probability and mathematical
References statistics. revised ed. New York: Wiley; 1993.
[37] Joseph VR, Hung Y, Sudjianto A. Blind Kriging: a new method for developing
metamodels. ASME Journal of Mechanical Design 2008;130.
[1] Queipo NV, Haftka RT, Shyy W, Goel T, Vaidyanathan R, Tucker PK. Surrogate- [38] Joseph VR. A Bayesian approach to the design and analysis of fractional
based analysis and optimization. Progress in Aerospace Sciences experiments. Technometrics 2006;48:219–29.
2005;41:1–28. [39] Wu CFJ, Hamada M. Experiments: planning, analysis, and parameter design
[2] Simpson TW, Toropov V, Balabanov V, Viana FAC. Design and analysis of optimization. New York: Wiley; 2000.
computer experiments in multidisciplinary design optimization: a review of [40] Hoyle N. Automated multi-stage geometry parameterization of internal fluid
how we have come—or not. In: 12th AIAA/ISSMO multidisciplinary analysis flow applications. PhD thesis, University of Southampton, Southampton, UK;
and optimization conference, Victoria, British Colombia, 10–12 September, 2006.
2008. [41] Forrester AIJ, Keane AJ, Bressloff NW. Design and analysis of ‘noisy’ computer
[3] Sacks J, Welch WJ, Mitchell TJ, Wynn H. Design and analysis of computer experiments. AIAA Journal 2006;44(10):2331–9.
experiments. Statistical Science 1989;4(4):409–23. [42] Vapnik V. The nature of statistical learning theory. New York: Springer; 1995.
[4] Morris MD. Factorial sampling plans for preliminary computational experi- [43] Schölkopf B, Smola AJ. Learning with kernels. Cambridge, MA: MIT; 2002.
ments. Technometrics 1991;33(2):161–74.
[44] Smola AJ, Schölkopf B. A tutorial on support vector regression. Statistics and
[5] Johnson ME, Moore LM, Ylvisaker D. Minimax and maximin distance designs.
Computing 2004;14:199–222.
Journal of Statistical Planning and Inference 1990;26:131–48.
[45] Clarke SM, Griebsch JH, Simpson TW. Analysis of support vector regression for
[6] McKay MD, Beckman RJ, Conover WJ. A comparison of three methods for
approximation of complex engineering analyses. Journal of Mechanical
selecting values of input variables in the analysis of output from a computer
Design 2005;127.
code. Technometrics 1979;21(2):239–45.
[46] Squire W, Trapp G. Using complex variables to estimate derivatives of real
[7] Morris MD, Mitchell TJ. Exploratory designs for computational experiments.
functions. SIAM Review 1998;40:110–2.
Journal of Statistical Planning and Inference 1995;43:381–402.
[8] Forrester AIJ, Sóbester A, Keane AJ. Engineering design via surrogate [47] Griewank A. Evaluating derivatives: principles and techniques of algorithmic
modelling: a practical guide. Chichester: Wiley; 2008. differentiation. In: Frontiers in applied mathematics. Philadelphia: SIAM;
[9] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. New 2000.
York: Springer; 2001. [48] Giles MB, Pierce NA. An introduction to the adjoint approach to design. Flow,
[10] Box EP, Draper NR. Empirical model building and response surfaces. New Turbulence and Combustion 2000;65:393–2000.
York: Wiley; 1987. [49] Barthelemy BM, Haftka RT, Cohen GA. Physically based sensitivity derivatives
[11] Cherkassky V, Mulier F. Learning from data—concepts, theory, and methods. for structural analysis programs. Computational Mechanics 1989;4(6):
New York: Wiley; 1998. 465–76.
[12] Ralston A, Rabinowitz P. A first course in numerical analysis. New York: [50] van Keulen F, Vervenne K. Gradient-enhanced response surface building.
McGraw-Hill; 1978. Structural and Multidisciplinary Optimization 2004;27:337–51.
[13] Myers RH, Montgomery DC. Response surface methodology: process and [51] Santner TJ, Williams BJ, Notz WI. Design and analysis of computer
product optimization using designed experiments. New York: Wiley; 1995. experiments. In: Springer series in statistics. Berlin: Springer; 2003.
[14] Goel T, Haftka R. Comparing error estimation measures for polynomial and [52] Leary SJ, Bhaskar A, Keane AJ. A knowledge-based approach to response
Kriging approximation of noise-free functions. Structural and Multidisciplin- surface modelling in multifidelity optimization. Journal of Global Optimiza-
ary Optimization 2008; in press, doi:10.1007/s00158-008-0290-z. tion 2003;26(3):297–319.
ARTICLE IN PRESS
[53] Forrester AIJ, Bressloff NW, Keane AJ. Optimization using surrogate models [72] den Hertog D, Kleijnen JPC, Siem AYD. The correct Kriging variance estimated
and partially converged computational fluid dynamics simulations. Proceed- by bootstrapping. Journal of the Operational Research Society 2006;
ings of the Royal Society A 2006;462(2071):2177–204. 57(4):400–9.
[54] Bandler J, Cheng Q, Dakroury S, Mohamed A, Bakr M, Madsen K, et al. Space [73] Forrester AIJ, Jones DR. Global optimization of deceptive functions with
mapping: the state of the art. IEEE Transactions on Microwave Theory and sparse sampling. In: 12th AIAA/ISSMO multidisciplinary analysis and
Techniques 2004;52:337–61. optimization conference, Victoria, British Colombia, 10–12 September 2008.
[55] Hevesi J, Flint A, Istok J. Precipitation estimation in mountainous terrain using [74] Sóbester A, Leary SJ, Keane AJ. A parallel updating scheme for approximating
multivariate geostatistics. Part II: isohyetal maps. Journal of Applied and optimizing high fidelity computer simulations. Structural and multi-
Meteorology 1992;31:677–88. disciplinary optimization 2004;27:371–83.
[56] Kennedy MC, O’Hagan A. Predicting the output from complex computer code
[75] Schonlau M. Computer experiments and global optimization. PhD thesis,
when fast approximations are available. Biometrika 2000;87(1):1–13.
University of Waterloo, Waterloo, Ontario, Canada; 1997.
[57] Forrester AIJ, Sóbester A, Keane AJ. Multi-fidelity optimization via surrogate
[76] Siddall JN. Optimal engineering design: principles and applications. New
modelling. Proceedings of the Royal Society A 2007;463(2088):3251–69.
York: Marcel Dekker; 1982.
[58] Jones DR, Schlonlau M, Welch WJ. Efficient global optimisation of expensive
black-box functions. Journal of Global Optimisation 1998;13:455–92. [77] Forrester AIJ, Sóbester A, Keane AJ. Optimization with missing data.
[59] Alexandrov N, Dennis JE, Lewis RM, Torczon V. A trust region framework for Proceedings of the Royal Society A 2006;462(2067):935–45.
managing the use of approximation models in optimization. Structural [78] Fonseca CM, Fleming PJ. An overview of evolutionary algorithms in multi-
Optimization 1998;15:16–23. objective optimization. IEEE Transactions on Evolutionary Computation
[60] Haftka RT. Combining global and local approximations. AIAA Journal 1995;3(1):1–16.
1991;29(9):1523–5. [79] Deb K, Pratap A, Agarwal S, Meyarivan T. A fast and elitist multiobjective
[61] Eldred MS, Giunta AA, Collis SS. Second-order corrections for surrogate-based genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation
optimization with model hierarchies. In: 10th AIAA/ISSMO multidisciplinary 2002;6(2):182–97.
analysis and optimization conference, Albany, New York, 30–31 August 2004. [80] Wilson B, Cappelleri D, Simpson W, Frecker M. Efficient Pareto frontier
[62] Dennis JE, Schnabel RB. Numerical methods for unconstrained optimization exploration using surrogate approximations. Optimization and Engineering
and nonlinear equations. Englewood Cliffs, NJ: Prentice-Hall; 1983. 2001;2:31–50.
[63] Alexandrov NM, Lewis RM, Gumbert CR, Green LL, Newman PA. Approxima- [81] Knowles J, Hughes EJ. Multiobjective optimization on a budget of 250
tion and model management in aerodynamic optimization with variable- evaluations. In: Coello C, et al., editor. Evolutionary multi-criterion optimiza-
fidelity models. Journal of Aircraft 2001;38(6):1093–101. tion (EMO-2005). Lecture notes in computer science, vol. 3410. Berlin:
[64] Sobol IM. On the systematic search in a hypercube. SIAM Journal of Numerical Springer; 2005.
Analysis 1979;16:790–3. [82] Voutchkov II, Keane AJ. Multi-objective optimization using surrogates. In:
[65] Statnikov RB, Matusov JB. Multicriteria optimization and engineering: theory Proceedings of the 7th international conference on adaptive computing in
and practice. New York: Chapman & Hall; 1995. design and manufacture, Bristol, 2006. p. 167–75 (ACDM 2006, ISBN 0-
[66] Xiong Y, Chen W, Apley D, Ding X. A non-stationary covariance-based Kriging
9552885-0-9).
method for metamodelling in engineering design. International Journal for
[83] Keane AJ. Statistical improvement criteria for use in multiobjective design
Numerical Methods in Engineering 2007;71:733–56.
optimization. AIAA Journal 2006;44(4):879–91.
[67] Holden C. Visualization methodologies in aircraft design optimization. PhD
[84] Sóbester A, Leary SJ, Keane AJ. A parallel updating scheme for approximating
thesis, University of Southampton, Southampton, UK; January 2004.
[68] Jones DR, Welch WJ. Global optimization using response surfaces. In: Fifth and optimizing high fidelity computer simulations. In: 3rd ISSMO/AIAA
SIAM conference on optimization, Victoria, Canada, 20–22 May, 1996. internet conference on approximations in optimization, 2002.
[69] Gutmann HM. A radial basis function method for global optimization. Journal [85] Efron B. Estimating the error rate of a prediction rule: improvement on cross-
of Global Optimization 2001;19(3):201–27. validation. Journal of the American Statistical Association 1983;78(382):
[70] Sóbester A, Leary SJ, Keane AJ. On the design of optimization strategies based 316–31.
on global response surface approximation models. Journal of Global [86] Goel T, Haftka R, Shyy W, Queipo NV. Ensemble of surrogates. Structural and
Optimization 2005;33:31–59. Multidisciplinary Optimization 2007;33:199–216.
[71] Forrester AIJ. Efficient global optimisation using expensive CFD simulations. [87] Tresp V. A Bayesian committee machine. Neural Computation 2000;12:
PhD thesis, University of Southampton, Southampton, UK; November 2004. 2719–41.

Recent Advances in Surrogate-Based Optimization

Uploaded by

Copyright:

Available Formats

Recent Advances in Surrogate-Based Optimization

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Recent Advances in Surrogate-Based Optimization

Uploaded by

Copyright:

Available Formats

ARTICLE IN PRESS

Progress in Aerospace Sciences 45 (2009) 50–79

Contents lists available at ScienceDirect

Progress in Aerospace Sciences

Recent advances in surrogate-based optimization

A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79 51

Assuming that we already have a parameterized design

52 A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79

A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79 53

54 A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79

Whether we choose a set of parametric basis functions or ﬁxed ones,

Fig. 2. MLS approximations of a noisy test function for varying s.

A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79 55

56 A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79

that lies at the heart of the computational expense of the Kriging

p = 0.1 3.5.1. Universal Kriging

where the ni ’s are some known functions and the mi ’s are

0.6 unknown parameters. Usually m

polynomial regression. The idea is that m b ðxÞ captures known

A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79 57

58 A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79

10 in the interpolating error equation (22) [41]. This error estimate is

7 3.6. Support vector regression

5 which were developed at AT&T Bell Laboratories in the 1990s [42].

where [40] 3.6.1. The support vector predictor

A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79 59

The constrained optimization problem of Eq. (39) is solved by

5 aþðiÞ ðe þ xþðiÞ yðiÞ þ w xðiÞ þ mÞ

60 A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79

A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79 61

C = 0 RMSE = 0.21 σ = 1 C = 0.01 RMSE = 0.22 σ = 0.62

C = 0.1 RMSE = 0.21 σ = 0.28 C = 1 RMSE = 0.18 σ = 0.34

62 A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79

Fig. 11 shows how this function behaves as xðiÞ l

subtraction is being performed when calculating the distance in (65)

1 little by constraining the y parameter to give a good prediction.

A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79 63

Fig. 14. Kriging, gradient enhanced Kriging, and Hessian-enhanced Kriging

64 A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79

A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79 65

the number of search steps needed in the MLE process is linked to 20

66 A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79

stress. Of course relationships will rarely be so straightforward,

4.1.1. Minimizing the predictor

4. Inﬁll criteria 4.1.2. The trust-region method

A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79 67

5 trust-region approach will ﬁnd a local optimum from an arbitrary

Pure design space exploration can essentially be viewed as

68 A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79

with the simpler and more common two-stage methods. 5

4.3.1. Two-stage approaches 0

A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79 69

70 A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79

0 must be satisﬁed. The value of the critical w2 value will depend

A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79 71

20 Kriging is known to give inaccurate error estimates, particu-

5 methods, allowing problems such as that shown in Fig. 24 to be

72 A.I.J. Forrester, A.J. Keane / Progress in Aerospace Sciences 45 (2009) 50–79

Sample plan: inﬁll points ratio p1 42: 1 1: 2 o1: 2