Abstract
Preference-based optimization algorithms are iterative procedures that seek the optimal calibration of a decision vector based only on comparisons between couples of different tunings. At each iteration, a human decision-maker expresses a preference between two calibrations (samples), highlighting which one, if any, is better than the other. The optimization procedure must use the observed preferences to find the tuning of the decision vector that is most preferred by the decision-maker, while also minimizing the number of comparisons. In this work, we formulate the preference-based optimization problem from a utility theory perspective. Then, we propose GLISp-r, an extension of a recent preference-based optimization procedure called GLISp. The latter uses a Radial Basis Function surrogate to describe the tastes of the decision-maker. Iteratively, GLISp proposes new samples to compare with the best calibration available by trading off exploitation of the surrogate model and exploration of the decision space. In GLISp-r, we propose a different criterion to use when looking for new candidate samples that is inspired by MSRS, a popular procedure in the black-box optimization framework. Compared to GLISp, GLISp-r is less likely to get stuck on local optima of the preference-based optimization problem. We motivate this claim theoretically, with a proof of global convergence, and empirically, by comparing the performances of GLISp and GLISp-r on several benchmark optimization problems.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Preference-Based Optimization (PBO) algorithms seek the global solution of an optimization problem whose objective function is unknown and unmeasurable. Instead, the “goodness” of a decision vector is assessed by a human Decision-Maker (DM), who compares different samples (i.e. tunings of the decision vector) and states which of them he/she prefers. In this work, we consider queries to the DM that involve two options. The output of a query is the preference between the two calibrations of the decision vector expressed by the decision-maker (e.g. “this tuning is better than that one”) [3].
Preference-based optimization is closely related to industry practice. In the context of control systems, often the performances achieved by a regulator tuning are evaluated by a decision-maker, who expresses his/her judgment by observing the behavior of the system under control. Multiple experiments must be performed until the DM is satisfied by the closed-loop performances. If a trial-and-error approach is adopted, then there is no guarantee that the final tuning is optimal in some sense. Moreover, the calibration process might be quite time-consuming since possibly many combinations of the controller parameters need to be tested. PBO algorithms constitute a better alternative to the trial-and-error methodology due to the fact that they drive the experiments by exploiting the information carried by the preferences expressed by the DM. The goal is to seek the calibration of the decision vector that is most preferred by the decision-maker while also minimizing the number of queries, thus performing fewer experiments. Successful applications of preference-based optimization procedures include [39], where the authors use algorithm GLISp [3] to tune the controller for a continuous stirring tank reactor and for autonomous driving vehicles. The same algorithm has been employed in [30] to calibrate the parameters of a velocity planner for robotic sealing tasks.
Although preference-based optimization is a valuable tool for solving those calibration tasks that involve human decision-makers, its applicability is broader. In multi-objective optimization, PBO can be employed to scalarize the multi-objective optimization problem into a single objective one by tuning the weights for the weighted sum method, or to choose which, among a set of Pareto optimal solutions, is the most suited for the DM [3, 21]. PBO can also be used to perform active preference learning. In general, preference learning methods estimate a predictive model that describes the tastes of a human decision-maker [12]. Preference-based optimization methods also rely on a predictive model (called surrogate model), which is updated after each query to the DM, but its prediction accuracy is not the main concern. Rather, the sole purpose of the surrogate model is to drive the search towards the most preferred tuning of the decision vector.
The preference-based optimization problem can be formalized by applying notions of utility theory [27]. Suppose that there exists a binary relation, called the preference relation, which describes the tastes of the decision-maker (i.e. the outputs of the queries). If the preference relation of the DM exhibits certain properties, then it is possible to represent it with a continuous (latent) utility functionFootnote 1 [8], which assigns an abstract degree of “goodness” to all possible calibrations of the decision vector. The tuning that is most preferred by the decision-maker is the one with the highest utility and corresponds to the optimizer of the PBO problem.
As previously mentioned, many preference-based optimization algorithms rely on a surrogate model, which is an approximation of the latent utility function built from the preference information at hand. In general, any procedure that uses a surrogate model when solving an optimization problem is said to be a surrogate-based (or response surface) method. These algorithms are mostly used for black-box optimization problems, where the objective function is unknown but measurable (see [19, 36]). Nevertheless, surrogate-based methods can also be employed for PBO problems, provided that the surrogate model is properly defined. In practice, preference-based response surface methods iteratively propose new samples to be compared to the available best candidate by trading off exploitation (or local search), i.e. selecting samples that are likely to offer an improvement based on the data at our disposal, and exploration (or global search), i.e. finding samples in regions of the decision space of which we have little to no knowledge. Typically, new candidate samples are sought by minimizing or maximizing an acquisition function that encapsulates these two aspects.
Most surrogate models either rely on Gaussian Processes (\(\mathcal{G}\mathcal{P}\)s) [38], giving rise to preferential Bayesian optimization [13], or Radial Basis Functions (RBFs) [15]. For example, in [7] the authors propose a predictive model for the latent utility function that is based on \(\mathcal{G}\mathcal{P}\)s. The latter is used as a surrogate model in [6] to carry out preference-based optimization. Alternative preferential Bayesian optimization algorithms are proposed in [4] and in [13]. Recently, in [3] the authors developed a preference-based optimization method, called GLISp, which is based on a RBF approximation of the latent utility function.
In this work, we propose an extension of the GLISp [3] algorithm (that we will refer to as GLISp-r) which is more robust in finding the global solution of the preference-based optimization problem. To do so, we:
-
1.
Address some limitations of the acquisition function used in [3], which can cause the procedure to miss the global solution and get stuck on local optima;
-
2.
Dynamically vary the exploration–exploitation trade-off weight, allowing GLISp-r to alternate between local and global search. This is commonly done in the black-box optimization framework, see for example [15, 28, 29, 37], but it has not been tried for GLISp [3];
-
3.
Provide a proof of global convergence for GLISp-r, furtherly motivating its robustness. Currently, no such proof is available for GLISp [3] (or for any of the most popular preference-based response surface methods, cf. also [4, 6, 13]).
Regarding this last point, as a minor contribution of this work, we formalize the preference-based optimization problem from a utility theory perspective, allowing us to analyze the existence of a solution and, ultimately, to prove the global convergence of GLISp-r.
This paper is organized as follows. In Sect. 2, we introduce the preference-based optimization problem. Section 3 addresses how to build the surrogate model (as in GLISp [3]) and look for the next sample to evaluate, keeping in mind the exploration–exploitation dilemma. We also briefly cover the exploration function used in [3]. The latter is thoroughly analyzed in Sect. 4, where we propose a solution to the shortcomings that we have encountered in GLISp [3]. Section 5 describes algorithm GLISp-r and addresses its convergence. Then, Sect. 6 compares the performances of GLISp-r with GLISp [3] and C-GLISp [40], a revisitation of the latter method by the same authors, on several benchmark optimization problems. Finally, Sect. 7 gives some concluding remarks.
2 Problem formulation
Let \(\varvec{x} \in \mathbb {R}^n, \varvec{x} = \begin{bmatrix}x^{(1)}&\ldots&x^{(n)}\end{bmatrix}^{\top },\) be the n-dimensional decision vector and suppose that we are interested in finding its calibration that is most preferred by the decision-maker within a subset of \(\mathbb {R}^{n}\), namely \(\Omega \subset \mathbb {R}^{n}\). In particular, we define the constraint set \(\Omega \) as:
In (1), \(\varvec{l}, \varvec{u} \in \mathbb {R}^{n}, \varvec{l} \le \varvec{u},\) are the lower and upper bounds on the decision vector while \(\varvec{g_{ineq}}: \mathbb {R}^{n} \rightarrow \mathbb {R}^{q_{ineq}}\) and \(\varvec{g_{eq}}: \mathbb {R}^{n} \rightarrow \mathbb {R}^{q_{eq}}\) are the constraints functions associated to the inequality and equality constraints respectively (which are \(q_{ineq} \in \mathbb {N} \cup \left\{ 0 \right\} \) and \(q_{eq} \in \mathbb {N} \cup \left\{ 0 \right\} \)). Notation-wise, \(\varvec{0}_{q_{ineq}}\) represents the \(q_{ineq}\)-dimensional zero column vector (and similarly for \(\varvec{0}_{q_{eq}}\)). We suppose that: (i) all of the constraints in (1) are completely known and (ii) \(\Omega \) includes, at least, the bound constraints \(\varvec{l}\le \varvec{x}\le \varvec{u}\) (the remaining equality and inequality constraints can be omitted, resulting in \(q_{ineq} = q_{eq} = 0\)).
In this work, we formalize the preference-based optimization problem from a utility theory [27] perspective. We start by introducing preference relations, a particular kind of binary relations that play a key role in PBO.
Definition 1
(Binary relation [27]) Consider the constraint set \(\Omega \) in (1); we define a generic binary relation \(\mathcal {R}\) on \(\Omega \) as a subset \(\mathcal {R} \subseteq \Omega \times \Omega \).
Notation-wise, given two samples \(\varvec{x}_{i}, \varvec{x}_{j} \in \Omega \), we denote the ordered pairs for which the binary relation holds, \(\left( \varvec{x}_{i}, \varvec{x}_{j}\right) \in \mathcal {R}\), as \(\varvec{x}_{i} \mathcal {R} \varvec{x}_{j}\).
Definition 2
(Preference relation [27]) A preference relation, \(\succsim \subseteq \Omega \times \Omega \), is a binary relation that describes the tastes of a human decision-maker.
In the context of utility theory, \(\varvec{x}_{i} \succsim \varvec{x}_{j}\) implies that the DM with preference relation \(\succsim \) on \(\Omega \) deems the alternative \(\varvec{x}_{i}\) at least as good as \(\varvec{x}_{j}\). We say that the decision-maker is rational (in an economics sense) if his/her preference relation exhibits certain properties, as highlighted by the following Definition.
Definition 3
(Rational decision-maker [27]) Consider a decision-maker with preference relation \(\succsim \) on \(\Omega \). We say that the DM is rational if \(\succsim \) is a reflexive, transitive and complete binary relation on \(\Omega \).
Now, we briefly review each of the aforementioned properties and give some insights as to why they characterize the rationality of an individual (see [8, 10, 27] for more details):
-
Reflexivity of \(\succsim \) on \(\Omega \) implies that, for the DM, any alternative is as good as itself, i.e. \(\varvec{x}_{i} \succsim \varvec{x}_{i}\) for each \(\varvec{x}_{i} \in \Omega \);
-
A decision-maker whose preference relation \(\succsim \) on \(\Omega \) is transitive is able to express his/her preferences coherently since if \(\varvec{x}_{i} \succsim \varvec{x}_{j}\) and \(\varvec{x}_{j} \succsim \varvec{x}_{k}\) hold, then \(\varvec{x}_{i} \succsim \varvec{x}_{k}\), for any \(\varvec{x}_{i}, \varvec{x}_{j}, \varvec{x}_{k} \in \Omega \);
-
Completeness of \(\succsim \) on \(\Omega \) implies that the DM is able to express a preference between any two alternatives in \(\Omega \), i.e. either \(\varvec{x}_{i} \succsim \varvec{x}_{j}\) or \(\varvec{x}_{j} \succsim \varvec{x}_{i}\) holds for each \(\varvec{x}_{i}, \varvec{x}_{j} \in \Omega \).
The preference relation \(\succsim \) on \(\Omega \) of a rational decision-maker is usually “split” into two transitive binary relations [27]:
-
The strict preference relation \(\succ \) on \(\Omega \), i.e. \(\varvec{x}_{i} \succ \varvec{x}_{j}\) if and only if \(\varvec{x}_{i} \succsim \varvec{x}_{j}\) but not \(\varvec{x}_{j} \succsim \varvec{x}_{i}\) (\(\varvec{x}_{i}\) is “better than” \(\varvec{x}_{j}\) or, equivalently, \(\varvec{x}_j\) is “worse than” \(\varvec{x}_i\)), and
-
The indifference relation \(\sim \) on \(\Omega \), i.e. \(\varvec{x}_{i} \sim \varvec{x}_{j}\) if and only if \(\varvec{x}_{i} \succsim \varvec{x}_{j}\) and \(\varvec{x}_{j} \succsim \varvec{x}_{i}\) (\(\varvec{x}_{i}\) is “as good as” \(\varvec{x}_{j}\)).
One last relevant property for preference relations is continuity.
Definition 4
(Continuous preference relation [27]) A preference relation \(\succsim \) on \(\Omega \) is continuous if the strict upper and lower \(\succsim \)-contour sets:
are open subsets of \(\Omega \) for each \(\varvec{x} \in \Omega \).
Intuitively speaking, if \(\succsim \) on \(\Omega \) is continuous and \(\varvec{x}_i \succ \varvec{x}_j\) holds, then an alternative \(\varvec{x}_k\) which is “very close” to \(\varvec{x}_j\) should also be deemed strictly worse than \(\varvec{x}_i\) by the decision-maker, i.e. \(\varvec{x}_i \succ \varvec{x}_k\).
Having defined the preference relation \(\succsim \) on \(\Omega \) thoroughly, we can finally state the goal of preference-based optimization:
Formally, \(\varvec{x^{*}}\) is called the \(\succsim \)-maximum of \(\Omega \) [27], i.e. the sample that is most preferred by the decision-maker with preference relation \(\succsim \) on \(\Omega \). Concerning the existence of \(\varvec{x}^*\), we can state the following Proposition, which can be seen as a generalization of the Extreme Value Theorem [1] for preference relations.
Proposition 1
(Existence of a \(\succsim \)-maximum of \(\Omega \) [27]) A \(\succsim \)-maximum of \(\Omega \) is guaranteed to exist if \(\Omega \) is a compact subset of a metric space (in our case \(\Omega \subset \mathbb {R}^{n}\)) and \(\succsim \) is a continuous preference relation on \(\Omega \) of a rational decision-maker (see Definitions 3 and 4).
Proposition 1 will be relevant when proving the convergence of the proposed algorithm, in Sect. 5. Lastly, in order to write Problem (2) as a typical global optimization problem, we need to state one of the most important results in utility theory.
Theorem 1
(Debreu’s utility representation Theorem for \(\mathbb {R}^{n}\) [8]) Let \(\Omega \) be any nonempty subset of \(\mathbb {R}^{n}\) and \(\succsim \) be a preference relation on \(\Omega \) of a rational decision-maker (as in Definition 3). If \(\succsim \) on \(\Omega \) is continuous, then it can be represented by a continuous utility function \(u_{\scriptscriptstyle \succsim }:\Omega \rightarrow \mathbb {R}\) such that, for any \(\varvec{x}_i, \varvec{x}_j \in \Omega \):
Using Theorem 1, we can build an optimization problem to find the \(\succsim \)-maximum of \(\Omega \). In particular, we define the scoring function, \(f: \mathbb {R}^{n} \rightarrow \mathbb {R}\), as \(f\left( \varvec{x}\right) = - u_{\scriptscriptstyle \succsim }\left( \varvec{x}\right) \) and re-write Problem (2) as:
Remark 1
Formally, there could be more than one \(\succsim \)-maximum of \(\Omega \), i.e. Problem (3) could admit multiple global solutions, as described by the set:
In this work, without loss of generality, we assume that there exists only one global solution \(\varvec{x}^{\varvec{*}}\) as in (3). In practice, as we will see in Sect. 5, any globally convergent preference-based optimization procedure generates a set of samples that is dense in \(\Omega \) and thus, at least asymptotically, it actually finds all the global minimizers in \(\mathcal {X}^*\). We do not make any assumptions on the local solutions of (3), which can be more than one.
On a side note, we could view preference-based optimization as a particular instance of black-box optimization [19, 36] where the cost function is not measurable in any way. Instead, information on \(f(\varvec{x})\) in (3) comes in the form of preferences, as described in the next section.
2.1 Data available to preference-based optimization procedures
In GLISp [3], instead of considering the preference relation \(\succsim \) on \(\Omega \) explicitly, the authors describe the outputs of the queries to the DM using the preference function \({\pi }_{\scriptscriptstyle \succsim }:\mathbb {R}^{n} \times \mathbb {R}^{n} \rightarrow \{-1,0,1\}\), defined as:
In light of the just reviewed utility theory literature, we can see that \({\pi }_{\scriptscriptstyle \succsim }(\varvec{x}_{i},\varvec{x}_{j})\) in (4) is obtained from the utility representation of the preference relation \(\succsim \) on \(\Omega \) (see Theorem 1) and from the fact that \(f(\varvec{x}) = - u_{\scriptscriptstyle \succsim }(\varvec{x})\). In the case of rational decision-makers (Definition 3), reflexivity and transitivity of the preference relation are highlighted by the following properties of the preference function:
-
\({\pi }_{\scriptscriptstyle \succsim }(\varvec{x}_{i},\varvec{x}_{i}) = 0\), for each \(\varvec{x}_{i} \in \mathbb {R}^{n}\),
-
\({\pi }_{\scriptscriptstyle \succsim }(\varvec{x}_{i},\varvec{x}_{j}) = {\pi }_{\scriptscriptstyle \succsim }(\varvec{x}_{j},\varvec{x}_{k}) = b \Rightarrow {\pi }_{\scriptscriptstyle \succsim }(\varvec{x}_{i},\varvec{x}_{k}) = b\), for any \(\varvec{x}_{i}, \varvec{x}_{j}, \varvec{x}_{k} \in \mathbb {R}^{n}\).
In the context of PBO, surrogate-based methods aim solve Problem (3) starting from a set of \(N \in \mathbb {N}\) distinct samples of the decision vector:
and a set of \(M \in \mathbb {N}\) preferences expressed by the decision-maker:
The term \(b_h\) in (6) is the output of the h-th query, where the decision-maker was asked to compare two samples in \(\mathcal {X}\), as highlighted by the following mapping set:
In (7), \(\ell : \mathbb {N} \rightarrow \mathbb {N}\) and \(\kappa : \mathbb {N} \rightarrow \mathbb {N}\) are two mapping functions that associate the indexes of the samples, contained inside \(\mathcal {X}\), to their respective preferences in \(\mathcal {B}\). The cardinalities of these sets are \(|\mathcal {X}|=N\) and \(|\mathcal {B}|=|\mathcal {S}|=M\). Also note that \(1\le M\le \begin{pmatrix}N \\ 2 \end{pmatrix}\).
3 Handling exploration and exploitation
In this section, we review some key concepts that are common in most preference-based optimization algorithms. We also cover briefly how exploration and exploitation are handled by algorithm GLISp [3].
Preference-based response surface methods iteratively propose new samples to evaluate with the objective of solving Problem (3) while also minimizing the number of queries. Suppose that, at iteration k, we have at our disposal the set of samples \(\mathcal {X}\) in (5), \(|\mathcal {X}|= N\), and the sets \(\mathcal {B}\) in (6) and \(\mathcal {S}\) in (7), \(|\mathcal {B}|=|\mathcal {S}|=M\). We define the best sample found so far as:
The new candidate sample,
is obtained by solving an additional global optimization problem:
where \(a_N: \mathbb {R}^{n} \rightarrow \mathbb {R}\) is a properly defined acquisition function which trades off exploration and exploitation.
In practice, once \(\varvec{x}_{N+1}\) has been computed, we let the decision-maker express a preference between the best sample found so far and the new one, obtaining:
After that, \(\varvec{x}_{N+1}\) is added to the set \(\mathcal {X}\) and, similarly, the sets \(\mathcal {B}\) and \(\mathcal {S}\) are also updated with the new preference \(b_{M+1}\). The process is iterated until a certain condition is met. Typically, a budget, or rather a maximum number of samples \(N_{max} \in \mathbb {N}\), is set and the optimization procedure is stopped once it is reached.
In our case, \(a_N\left( \varvec{x}\right) \) in (8) is defined starting from a surrogate model \(\hat{f}_N: \mathbb {R}^{n} \rightarrow \mathbb {R}\), which approximates the scoring function \(f\left( \varvec{x}\right) \) of Problem (3), and a function \(z_N: \mathbb {R}^{n} \rightarrow \mathbb {R}\) that promotes the exploration of those regions of \(\Omega \) where fewer samples have been evaluated. The acquisition function that we will propose in this work is a weighted sum of these two contributions (see Sect. 4.3). \(\hat{f}_N\left( \varvec{x}\right) \) and \(z_N\left( \varvec{x}\right) \) are defined as in GLISp [3], which we now review.
3.1 Surrogate model
Given N samples \(\varvec{x}_i \in \mathcal {X}\) in (5), we define the surrogate model \(\hat{f}_N:\mathbb {R}^{n} \rightarrow \mathbb {R}\) as the radial basis function expansion [9]:
where \(\varphi :\mathbb {R}_{\ge 0}\rightarrow \mathbb {R}\) is a properly chosen radial function, \(\epsilon \in \mathbb {R}_{>0}\) is the so-called shape parameter (which needs to be tuned) and \(\phi _{i}:\mathbb {R}^{n}\rightarrow \mathbb {R}\) is the radial basis function originated from \(\varphi \left( \cdot \right) \) and center \(\varvec{x}_{i}\in \mathcal {X}\), namely \(\phi _{i}\left( \varvec{x}\right) =\varphi \left( \epsilon \cdot \left\| {{\varvec{x}-\varvec{x}_{i}}} \right\| _2\right) \). Moreover, \(\varvec{\phi }\left( \varvec{x}\right) \in \mathbb {R}^{N}\), \(\varvec{\phi }\left( \varvec{x}\right) =\begin{bmatrix}\phi _{1}\left( \varvec{x}\right)&\ldots&\phi _{N}\left( \varvec{x}\right) \end{bmatrix}^{\top }\), is the radial basis function vector and \(\varvec{\beta }=\begin{bmatrix}\beta ^{(1)}&\ldots&\beta ^{(N)}\end{bmatrix}^{\top }\in \mathbb {R}^{N}\) is a vector of weights that has to be estimated from the preferences in \(\mathcal {B}\) (6) and \(\mathcal {S}\) (7). Given a distance \(r = \left\| {{\varvec{x} - \varvec{x}_i}} \right\| _2\), some commonly used radial functions are [11]:
-
Inverse quadratic: \(\varphi \left( \epsilon \cdot r\right) =\frac{1}{1+\left( \epsilon \cdot r\right) ^{2}}\);
-
Multiquadratic: \(\varphi \left( \epsilon \cdot r\right) =\sqrt{1+\left( \epsilon \cdot r\right) ^{2}}\);
-
Linear: \(\varphi \left( \epsilon \cdot r\right) =\epsilon \cdot r\);
-
Gaussian: \(\varphi \left( \epsilon \cdot r\right) =e^{-\left( \epsilon \cdot r\right) ^{2}}\);
-
Thin plate spline: \(\varphi \left( \epsilon \cdot r\right) =\left( \epsilon \cdot r\right) ^{2}\cdot \log \left( \epsilon \cdot r\right) \);
-
Inverse multiquadratic: \(\varphi \left( \epsilon \cdot r\right) =\frac{1}{\sqrt{1+\left( \epsilon \cdot r\right) ^{2}}}\).
One advantage of using (9) as the surrogate model is highlighted by the following Proposition.
Proposition 2
\(\hat{f}_N\left( \varvec{x}\right) \) in (9) is differentiable everywhereFootnote 2 if and only if the chosen radial basis function \(\phi _i \left( \varvec{x}\right) = \varphi \left( \epsilon \cdot \left\| {{\varvec{x}-\varvec{x}_{i}}} \right\| _2 \right) \) is differentiable everywhere.
The surrogate model in (9) can be used to define the surrogate preference function \({\hat{\pi }}_{{\scriptscriptstyle \succsim }_{N}}:{\mathbb {R}^{n}\times \mathbb {R}^{n}\rightarrow \{-1,0,1\}}\). Differently from \({\pi }_{\scriptscriptstyle \succsim }(\varvec{x}_{i},\varvec{x}_{j})\) in (4), we consider a tolerance \(\sigma \in \mathbb {R}_{>0}\) to avoid using strict inequalities and equalities and define \({\hat{\pi }}_{{\scriptscriptstyle \succsim }_{N}}(\varvec{x}_{i},\varvec{x}_{j})\) as [3]:
Suppose now that we have at our disposal the sets \(\mathcal {X}\) in (5), \(\mathcal {B}\) in (6) and \(\mathcal {S}\) in (7). Then, we are interested in a surrogate model \(\hat{f}_N\left( \varvec{x}\right) \) in (9) that correctly describes the preferences expressed by the decision-maker, i.e. we would like the corresponding surrogate preference function \({\hat{\pi }}_{{\scriptscriptstyle \succsim }_{N}}(\varvec{x}_{i},\varvec{x}_{j})\) in (10) to be such that:
This, in turn, translates into some constraints on \(\hat{f}_N\left( \varvec{x}\right) \), which can be used to estimate \(\varvec{\beta }\). To do so, the authors of GLISp [3] define the following convex optimization problem:
where:
-
\(\varvec{\varepsilon } = \begin{bmatrix} \varepsilon ^{(1)}&\ldots&\varepsilon ^{(M)} \end{bmatrix}^{\top } \in \mathbb {R}_{\ge 0}^{M}\) is a vector of slack variables (one for each preference) which takes into consideration that: (i) there might be some outliers in \(\mathcal {B}\) and \(\mathcal {S}\) if the decision-maker expresses the preferences in an inconsistent way, and (ii) the selected radial function and/or shape parameter for \(\hat{f}_N\left( \varvec{x}\right) \) in (9) do not allow a proper approximation of the scoring function \(f\left( \varvec{x}\right) \);
-
\(\varvec{r} = \begin{bmatrix} r^{(1)}&\ldots&r^{(M)} \end{bmatrix}^{\top } \in \mathbb {R}^{M}_{>0}\) is a vector of weights that can be used to penalize more some slacks related to the most important preferences. In GLISp [3], the authors weigh more the preferences associated to the current best candidate and define \(\varvec{r}\) as follows:
$$\begin{aligned} r^{(h)}&= 1, \quad \forall h: \left( \ell (h),\kappa (h)\right) \in \mathcal {S}, \varvec{x}_{\ell (h)} \ne \varvec{x_{best}}(N)\quad \text {and}\quad \varvec{x}_{\kappa (h)} \ne \varvec{x_{best}}(N), \\ r^{(h)}&= 10, \quad \forall h: \left( \ell (h),\kappa (h)\right) \in \mathcal {S}, \varvec{x}_{\ell (h)} = \varvec{x_{best}}(N)\quad \text {or}\quad \varvec{x}_{\kappa (h)} = \varvec{x_{best}}(N). \end{aligned}$$ -
\(\lambda \in \mathbb {R}_{\ge 0}\) plays the role of a regularization parameter. It is easy to see that, for \(\lambda =0\), Problem (11) is a Linear Program (LP) while, for \(\lambda >0\), it is a Quadratic Program (QP).
Problem (11) ensures that, at least approximately, \(\hat{f}_N\left( \varvec{x}\right) \) in (9) is a suitable representation of the unknown preference relation \(\succsim \) on \(\Omega \) which produced the data described in Sect. 2.1 (see Theorem 1).
3.2 Exploration function
Consider a sample \(\varvec{x}_i \in \mathcal {X}\) in (5). Its corresponding Inverse Distance Weighting (IDW) function \(w_{i}:\mathbb {R}^{n}{\setminus }\left\{ \varvec{x}_{i}\right\} \rightarrow \mathbb {R}_{>0}\) is defined as [31]:
GLISp [3] uses the so-called IDW distance function \(z_N: \mathbb {R}^{n} \rightarrow (-1, 0]\),
to promote the exploration in those regions of \(\mathbb {R}^{n}\) where fewer samples have been evaluated. It is possible to prove the following Proposition and Lemma.
Proposition 3
The IDW distance function \(z_N\left( \varvec{x}\right) \) in (13) is differentiable everywhere.
The proof of Proposition 3 is given in [2]. Here, we derive the gradient of \(z_N\left( \varvec{x}\right) \) in (13) (see “Appendix A” for its derivation).
Lemma 1
The gradient of the IDW distance function \(z_N\left( \varvec{x}\right) \) in (13) is:
The gradient \(\nabla _{\varvec{x}} z_N\left( \varvec{x}\right) \) in (14) will be particularly relevant in the following section.
4 Next candidate sample search
As previously mentioned in Sect. 3, a key aspect of surrogate-based methods is the exploration–exploitation dilemma. Typically, new candidate samples are sought by minimizing an acquisition function \(a_N\left( \varvec{x}\right) \) that is a weighted sum between the surrogate model and the exploration function. In practice, \(\hat{f}_N\left( \varvec{x}\right) \) in (9) and \(z_N\left( \varvec{x}\right) \) in (13) often exhibit different ranges and need to be rescaled. In particular, GLISp [3] adopts the following \(a_N\left( \varvec{x}\right) \):
where \(\delta \in \mathbb {R}_{\ge 0}\) is the exploration–exploitation trade-off weight. The division by
aims to rescale the surrogate model to make it assume a range that is comparable to that of the IDW distance function in (13), which is \((-1, 0]\).
In this section, we address some limitations of \(a_N\left( \varvec{x}\right) \) in (15), which might prevent GLISp [3] from reaching the global minimizer of Problem (3), and define an alternative acquisition function. Furthermore, we propose a strategy to iteratively vary the exploration–exploitation trade-off.
4.1 Shortcomings of GLISp
There are two shortcomings of \(a_N\left( \varvec{x}\right) \) in (15) that limit the exploratory capabilities of GLISp [3]. First, the rescaling of \(\hat{f}_N\left( \varvec{x}\right) \) in (15), which relies on \(\Delta \hat{F}\) in (16), only takes into account the previously evaluated samples inside \(\mathcal {X}\) in (5) and thus it can be ineffective in making \(\frac{\hat{f}_N\left( \varvec{x}\right) }{\Delta \hat{F}}\) and \(z_N\left( \varvec{x}\right) \) comparable over all \(\Omega \) (see Problem (8)). Second, the IDW distance function in (13) exhibits two characteristics that can make its contribution negligible in \(a_N\left( \varvec{x}\right) \) in (15) and complicate the selection of \(\delta \):
-
1.
Even though the range of \(z_N\left( \varvec{x}\right) \) is \((-1, 0]\), what we are really interested in when solving Problem (8) and, ultimately, Problem (3), are the values that it assumes for \(\varvec{x} \in \Omega \) and not on its whole domain \(\mathbb {R}^{n}\). In particular, there are some situations for which \(|z_N\left( \varvec{x}\right) |\ll 1, \forall \varvec{x} \in \Omega \). Consider, for example, the case \(\mathcal {X} = \left\{ \varvec{x}_1\right\} \) (\(N = 1\)); then, \(\forall \varvec{x} \in \mathbb {R}^n {\setminus } \mathcal {X}\), the IDW distance function simply becomes:
$$\begin{aligned} z_1\left( \varvec{x}\right) =-\frac{2}{\pi }\cdot \arctan \left( \left\| {{\varvec{x}-\varvec{x}_{1}}} \right\| _2^{2}\right) . \end{aligned}$$Suppose that Problem (3) is only bound constrained, i.e. \(\Omega = \left\{ \varvec{x}: \varvec{l}\le \varvec{x}\le \varvec{u} \right\} \). Then, \(z_1\left( \varvec{x}\right) \) assumes its lowest value at one (or more) of the vertices of the box defined by the constraints in \(\Omega \). Define \(\varvec{v}_{\Omega } \in \Omega \) as one of such vertices; if \( \left\| {{\varvec{v}_{\Omega } - \varvec{x}_{1}}} \right\| _2\) is close to zero, then \(|z_N\left( \varvec{x}\right) |\ll 1, \forall \varvec{x} \in \Omega \). Thus, unless \(\delta \gg 1\) in (15), \(\hat{f}_N\left( \varvec{x}\right) \) in (9) and \(z_N\left( \varvec{x}\right) \) in (13) might not be comparable.
-
2.
The (absolute) values assumed by the IDW distance function decrease as the number of samples increases. To clarify this, consider two sets of samples:
$$\begin{aligned} \mathcal {X}'&= \left\{ \varvec{x}_{1}, \ldots , \varvec{x}_{N} \right\} , \quad |\mathcal {X}'|= N, \\ \mathcal {X}''&= \mathcal {X}' \cup \left\{ \varvec{x}_{N + 1} \right\} , \quad |\mathcal {X}''|= N + 1. \end{aligned}$$Given any point \(\tilde{\varvec{x}} \in \mathbb {R}^n {\setminus } \mathcal {X}''\), the IDW distance functions obtained from the previously defined sets are:
$$\begin{aligned} z_{N}\left( \varvec{\tilde{x}}\right)&= {-\frac{2}{\pi }}\cdot \arctan \left( \frac{1}{\sum _{i=1}^{N}w_{i}\left( \varvec{\tilde{x}}\right) }\right) , \\ z_{N+1}\left( \varvec{\tilde{x}}\right)&= {-\frac{2}{\pi }}\cdot \arctan \left( \frac{1}{\sum _{i=1}^{N+1}w_{i}\left( \varvec{\tilde{x}}\right) }\right) . \end{aligned}$$Note that \(w_i\left( \varvec{\tilde{x}} \right) > 0, \forall \varvec{\tilde{x}}\in \mathbb {R}^n {\setminus } \mathcal {X}''\) and \(i = 1, \ldots , N + 1\) (see (12)). Hence:
$$\begin{aligned} |z_{N}\left( \varvec{\tilde{x}}\right) |> |z_{N+1}\left( \varvec{\tilde{x}}\right) |> 0, \end{aligned}$$proving the above point. In practice, unless \(\delta \) in (15) is progressively increased as the iterations go on, GLISp [3] will explore the constraint set \(\Omega \) of Problem (3) less as the number of samples increases, regardless of whether a region that contains the global minimizer \(\varvec{x^*}\) has been located.
A visualization of these two characteristics of \(z_N\left( \varvec{x}\right) \) in (13) is presented in Fig. 1.
In this work, we overcome the limitations of \(a_N\left( \varvec{x}\right) \) in (15) by defining an acquisition function that is similar to the one used by the Metric Stochastic Response Surface (MSRS [29]) algorithm, a popular black-box optimization procedure. In MSRS [29], the surrogate \(\hat{f}_N\left( \varvec{x}\right) \) and the exploration function \(z_N\left( \varvec{x}\right) \) are made comparable by randomly sampling \(\Omega \) and rescaling the two using min–max normalization. Then, Problem (8) is not solved explicitly but by choosing the generated random sample that achieves the lowest value for the acquisition function. Here, we propose to rescale \(z_N\left( \varvec{x}\right) \) in (13) and \(\hat{f}_N\left( \varvec{x}\right) \) in (9) using some insights on the stationary points of the IDW distance function and solve Problem (8) explicitly, using a proper global optimization solver.
4.2 Novel rescaling strategy
In this section, we derive the approximate locations of the stationary points of the IDW distance function in (13) and use them to define an augmented set of samples \(\mathcal {X}_{aug} \supset \mathcal {X}\) that is suited for the min–max normalization of both \(z_N\left( \varvec{x}\right) \) and \(\hat{f}_N\left( \varvec{x}\right) \).
4.2.1 Stationary points of the IDW distance function
The locations of the global maximizers of \(z_N\left( \varvec{x}\right) \) can be deduced immediately from (13) and Lemma 1, as stated by the following Proposition.
Proposition 4
Each \(\varvec{x}_i \in \mathcal {X}\) in (5) is a global maximizer of \(z_N\left( \varvec{x}\right) \) in (13).
Proof
Recall that:
-
(i)
\(\nabla _{\varvec{x}} z_N\left( \varvec{x}_i\right) = \varvec{0}_{n}, \forall \varvec{x}_i \in \mathcal {X}\), see (14);
-
(ii)
\(z_N\left( \varvec{x}\right) < 0, \forall \varvec{x} \in \mathbb {R}^n {\setminus } \mathcal {X}\), see (13);
-
(iii)
\(z_N\left( \varvec{x}_i\right) = 0, \forall \varvec{x}_i \in \mathcal {X}\), see (13).
From Item (i) we deduce that each \(\varvec{x}_i \in \mathcal {X}\) is a stationary point of \(z_N\left( \varvec{x}\right) \). Item (ii), in conjunction with Item (iii), implies that such samples are local maximizers of the IDW distance function in (13) since there exists a neighborhood of \(\varvec{x}_i \in \mathcal {X}\), denoted as \(\mathcal {N}\left( \varvec{x}_i\right) \), such that \(z_N\left( \varvec{x}\right) \le z_N\left( \varvec{x}_i \right) , \forall \varvec{x} \in \mathcal {N}\left( \varvec{x}_i\right) \). Moreover, due to Item (ii), \(z_N\left( \varvec{x}\right) \le z_N\left( \varvec{x}_i \right) , \forall \varvec{x} \in \mathbb {R}^n\) and not just in a neighborhood of \(\varvec{x}_i \in \mathcal {X}\). Hence, each \(\varvec{x}_i \in \mathcal {X}\) is a global maximizer of \(z_N\left( \varvec{x}\right) \) in (13). \(\square \)
Reaching similar conclusions for the minimizers of \(z_N\left( \varvec{x} \right) \) is much harder; however, we can consider some simplified situations. Note that we are not necessarily interested in finding the minimizers of the IDW distance function in (13) with high accuracy, but rather to gain some insights on where they are likely to be located so that we can rescale both \(z_N\left( \varvec{x} \right) \) in (13) and \(\hat{f}_N\left( \varvec{x}\right) \) in (9) sufficiently enough to make them comparable. Moreover, their approximate locations can be used to solve the following global optimization problem (pure exploration):
by using a multi-start derivative-based optimization method with warm-start [23, 26] (recall that \(z_N\left( \varvec{x}\right) \) is differentiable everywhere, see Proposition 3). Problem (17) is quite relevant for the global convergence of the algorithm that we will propose in Sect. 5.
Remark 2
In the following Paragraphs, we analyze where the local minimizers of \(z_N\left( \varvec{x} \right) \) in (13) and the solution(s) of the simplified problem:
are located in some specific cases. Note that \(\left\{ \varvec{x}: \varvec{l} \le \varvec{x} \le \varvec{u}\right\} \supseteq \Omega \) (see (1)) and thus the global minimum of Problem (18) is lower than or at most equal to the global minimum of Problem (17). Therefore, the minimizers of Problem (18) are better suited to perform min–max rescaling of \(z_N\left( \varvec{x} \right) \) than those of Problem (17).
Case \(\mathcal {X} = \left\{ \varvec{x}_1\right\} \) (\(N = 1\)). The IDW distance function and its gradient \(\forall \varvec{x} \in \mathbb {R}^n {\setminus } \mathcal {X}\) are:
Clearly, \(\forall \varvec{x} \in \mathbb {R}^n {\setminus } \mathcal {X}\), the gradient is never zero since \(w_{1}\left( \varvec{x}\right) > 0\). Therefore, the only stationary point is the global maximizer \(\varvec{x}_1\) (see Proposition 4). However, if we were to consider Problem (18), then its solution would be located at one of the vertices of the box defined by the bound constraints \(\varvec{l} \le \varvec{x} \le \varvec{u}\).
Case \(\mathcal {X} = \left\{ \varvec{x}_1, \varvec{x}_2\right\} \) (\(N = 2\)). The gradient of the IDW distance function \(\forall \varvec{x} \in \mathbb {R}^n {\setminus } \mathcal {X}\) is:
Let us consider the midpoint \(\varvec{x_{\mu }}=\frac{\varvec{x}_{1}+\varvec{x}_{2}}{2}\), that is such that \( \left\| {{\varvec{x_{\mu }}-\varvec{x}_{1}}} \right\| _2= \left\| {{\varvec{x_{\mu }}-\varvec{x}_{2}}} \right\| _2\) and for which \(w_{1}\left( \varvec{x_{\mu }}\right) =w_{2}\left( \varvec{x_{\mu }}\right) \). If we substitute it in the previous expression, we obtain:
which means that \(\varvec{x_{\mu }}\) is a stationary point for \(z_N\left( \varvec{x}\right) \) in (13). It is easy to see by visual inspection that such point is actually a local minimizer for the IDW distance function (see for example Fig. 2). However, note that \(\varvec{x_{\mu }}\) is not necessarily the global solution of Problem (18), it might just be a local one.
Case \(\mathcal {X} = \mathcal {X}^{(1)} \cup \mathcal {X}^{(2)}\) (\(N > 2\)). Suppose now that the samples contained in \(\mathcal {X}\) (5) can be partitioned into two clusters:
-
\(\mathcal {X}^{(1)} = \left\{ \varvec{x}_1, \ldots , \varvec{x}_{N_1}\right\} \) (\(|\mathcal {X}^{(1)}|= N_1\)),
-
\(\mathcal {X}^{(2)} = \left\{ \varvec{x}_{N_1 + 1}, \ldots , \varvec{x}_{N}\right\} \) (\(|\mathcal {X}^{(2)}|= N - N_1\)),
such that \(\mathcal {X}^{(1)} \cap \mathcal {X}^{(2)} = \emptyset \) and \(\mathcal {X}^{(1)} \cup \mathcal {X}^{(2)} = \mathcal {X}\). Consider the midpoint between the centroids of each cluster:
We make the simplifying assumption that all the points contained inside each cluster are quite close to each other, i.e. \(\varvec{x}_{1} \approx \varvec{x}_{2} \approx \cdots \approx \varvec{x}_{N_1}\) and \(\varvec{x}_{N_1 + 1} \approx \varvec{x}_{N_1 + 2} \approx \cdots \approx \varvec{x}_{N}\). Then, the midpoint in (19) is approximately equal to \(\varvec{x_{\mu }} \approx \frac{\varvec{x}_1 + \varvec{x}_N}{2}\). Moreover, we have that \(w_{1}\left( \varvec{x_{\mu }}\right) \approx \cdots \approx w_{N_{1}}\left( \varvec{x_{\mu }}\right) \approx w_{N_{1}+1}\left( \varvec{x_{\mu }}\right) \approx \cdots \approx w_{N}\left( \varvec{x_{\mu }}\right) \). Thus, the gradient of the IDW distance function at \(\varvec{x_{\mu }}\) is approximately equal to:
Clearly, if the clusters are nearly equally sized, i.e. \(N_1 \approx N - N_1 \approx \frac{N}{2}\), then:
reaching a similar result to the one that we have seen for the case \(N = 2\).
General case (\(N > 2\)). Any set of samples \(\mathcal {X}\) in (5) can be partitioned into an arbitrary number of disjoint clusters, say \(K \in \mathbb {N}, K \le |\mathcal {X}|= N\), i.e.:
In this case, finding the local solutions of Problem (18) explicitly, or even approximately, is quite complex. Heuristically speaking, if the clusters are “well spread” (i.e. all the points contained inside each cluster \(\mathcal {X}^{(i)}\) are sufficiently far away from the others in \(\mathcal {X}^{(j)}, j = 1, \ldots , K, j \ne i\)), then we can approximately deduce where the local minimizers of \(z_N\left( \varvec{x}\right) \) in (13) are located.
For instance, Fig. 3 depicts a set of samples \(\mathcal {X}\) that has been partitioned into three clusters, \(\mathcal {X}^{(1)}, \mathcal {X}^{(2)}\) and \(\mathcal {X}^{(3)}\), and for which the previous hypothesis is satisfied.
In the general case, given the clusters \(\mathcal {X}^{(i)}\) and \(\mathcal {X}^{(j)}\), we compute their centroids \(\varvec{x}^{(i)}_{\varvec{c}}\), \(\varvec{x}^{(j)}_{\varvec{c}}\), and the corresponding midpoint \(\varvec{x_{\mu }}\) between them as:
Going back to the example depicted in Fig. 3, if we consider the midpoint \(\varvec{x_{\mu }}\) between the centroids of \(\mathcal {X}^{(1)}\) and \(\mathcal {X}^{(2)}\), due to the “well spread” hypothesis we can say that \( \left\| {{\varvec{x_{\mu }} - \varvec{x}_i}} \right\| _2 \gg 0, \forall \varvec{x}_i \in \mathcal {X}^{(3)}\), making the contributions of the points inside the third cluster negligible when evaluating \(z_N\left( \varvec{x}\right) \) in (13) at \(\varvec{x_{\mu }}\). Therefore, the IDW distance function at \(\varvec{x_{\mu }}\) is approximately equal to:
In general, given K clusters, if these are “well spread”, then we can consider each possible couple of clusters separately and neglect the contributions of the remaining ones. Approximately speaking, we could split the general case into \(\begin{pmatrix}K \\ 2 \end{pmatrix}\) distinct problems that read as follows: find the stationary points of the IDW distance function \(z_{N_{i \cup j}}\left( \varvec{x}\right) \) in (13) defined from the set of samples \(\mathcal {X}^{(i)}\cup \mathcal {X}^{(j)}, i \ne j\) and \(N_{i \cup j} = |\mathcal {X}^{(i)}\cup \mathcal {X}^{(j)}|\). Hence, rough locations of the stationary points of \(z_{N_{i \cup j}}\left( \varvec{x}\right) \) can be found by following the same rationale proposed for the previously analyzed cases.
Some one-dimensional examples of all the previously analyzed situations are reported in Fig. 2.
4.2.2 Min–max rescaling and augmented sample set
Given a generic set of samples \(\mathcal {X} = \left\{ \varvec{x}_1, \ldots , \varvec{x}_N\right\} \) and a multivariable function \(h:\mathbb {R}^{n} \rightarrow \mathbb {R}\), min–max rescaling (or normalization) [16] rescales \(h\left( \varvec{x}\right) \) as:
where:Footnote 3
The objective of min–max rescaling is to obtain a function with range [0, 1], i.e. we would like to have \(\bar{h}:\mathbb {R}^{n} \rightarrow [0, 1]\). Clearly, the quality of the normalization depends on the information brought by the samples contained inside \(\mathcal {X}\), as pointed out in the following Remark.
Remark 3
We can observe that:
-
1.
If \(\mathcal {X}\) contains the global minimizer(s) and maximizer(s) of \(h\left( \varvec{x}\right) \), then \(\bar{h}\left( \varvec{x}\right) \) defined as in (21) effectively has codomain [0, 1],
-
2.
Otherwise, we can only ensure that \(0 \le \bar{h}\left( \varvec{x}_i\right) \le 1, \forall \varvec{x}_i \in \mathcal {X}\).
-
3.
In general, if we increase the amount of distinct samples in \(\mathcal {X}\), then the rescaling of \(h\left( \varvec{x}\right) \) gets better (or, worst case, stays the same).
Going back to the problem of rescaling the IDW distance function \(z_N\left( \varvec{x}\right) \) in (13), if we were to apply (21) using the set of previously evaluated samples \(\mathcal {X}\) in (5), then it would not be effective since \(z_N\left( \varvec{x}_i\right) = 0, \forall \varvec{x}_i \in \mathcal {X}\) (see Proposition 4). Instead, we have opted to generate a sufficiently expressive augmented sample set \(\mathcal {X}_{aug} \supset \mathcal {X}\) and perform min–max normalization using \(\mathcal {X}_{aug}\) instead of \(\mathcal {X}\).
Consider the general case described in Sect. 4.2.1. Then, the augmented sample set \(\mathcal {X}_{aug}\) can be built in the following fashion:
-
1.
Partition the points in \(\mathcal {X}\) (5) into different clusters. Here, for simplicity, we fix a-priori the number \(K_{aug} \in \mathbb {N}\) of clusters and apply K-means clustering [5, 17, 22] to obtain the sets \(\mathcal {X}^{(1)}, \ldots , \mathcal {X}^{(K_{aug})}\);
-
2.
Compute the centroids of each cluster, using (20a), and group them inside the set \(\mathcal {X}_c = \left\{ \varvec{x}_{\varvec{c}}^{(1)}, \ldots , \varvec{x}_{\varvec{c}}^{(K_{aug})} \right\} \);
-
3.
Calculate all the midpoints \(\varvec{x_\mu }\) between each possible couple of centroids \(\varvec{x}_{\varvec{c}}^{(i)}, \varvec{x}_{\varvec{c}}^{(j)} \in \mathcal {X}_c, \varvec{x}_{\varvec{c}}^{(i)} \ne \varvec{x}_{\varvec{c}}^{(j)},\) using (20b);
-
4.
Build the augmented sample set as \(\mathcal {X}_{aug} = \mathcal {X} \cup \mathcal {X}_\mu \), where \(\mathcal {X}_\mu \) is the set which groups all the previously computed midpoints.
Clearly, as highlighted by (21) and Remark 3, if \(\mathcal {X}_{aug}\) contains points that are close (or equal) to the global minimizer(s) and maximizer(s) of \(z_N\left( \varvec{x}\right) \) in (13), then the quality of the min–max rescaling of the IDW distance function improves.
Algorithm 1 formalizes these steps while also taking into consideration the case \(|\mathcal {X}|\le K_{aug}\) (for which no clustering is performed). Note that we also include the bounds \(\varvec{l}\) and \(\varvec{u}\) inside \(\mathcal {X}_{c}\) and \(\mathcal {X}_{aug}\) for two reasons: (i) \(\varvec{l}\) or \(\varvec{u}\) might actually be the solutions of Problem (18)Footnote 4 and (ii) given that we also want to rescale \(\hat{f}_N\left( \varvec{x}\right) \) in (9), adding additional points to the augmented sample set improves the quality of min–max normalization (see Remark 3). Notice that the number of points contained inside \(\mathcal {X}_{aug}\) obtained from Algorithm 1 is:
Therefore, to avoid excessively large augmented sample sets, \(K_{aug}\) needs to be chosen appropriately.
As a final remark, we point out that we could perform min–max normalization in (21) by using the real minima and maxima of \(z_N\left( \varvec{x}\right) \) in (13) and \(\hat{f}_N\left( \varvec{x}\right) \) in (9), which can be obtained by solving four additional global optimization problems. However, we have preferred to stick with the proposed heuristic way since we are not interested in an extremely accurate rescaling and, also, to avoid potentially large overhead times due to solving additional global optimization problems.
4.3 Definition of the acquisition function
In this section, we take advantage of the results on the stationary points of \(z_N\left( \varvec{x}\right) \) presented in Sect. 4.2.1 to rescale the surrogate model and the exploration function. In particular, we define the following acquisition function:
where \(\hat{f}_N\left( \varvec{x}\right) \) in (9) and \(z_N\left( \varvec{x}\right) \) in (13) have been rescaled using min–max normalization as in (21) and \(\mathcal {X}_{aug}\) is generated by Algorithm 1. \(\delta \in \left[ 0, 1\right] \) is the exploration–exploitation trade-off weight; also note that \(\delta = 0\) corresponds to pure exploration, while \(\delta = 1\) results in pure exploitation. \(a_N\left( \varvec{x}\right) \) in (23) is similar to the acquisition function of MSRS [29] (for black-box optimization) but here we use an ad-hoc augmented sample set instead of a randomly generated one and a different exploration function. We will refer to the algorithm that we will propose in Sect. 5, which uses \(a_N\left( \varvec{x}\right) \) in (23), as GLISp-r, where the “r” highlights the min–max rescaling performed for the acquisition function.
A comparison between the terms of the acquisition functions in (23) (GLISp-r) and in (15) (GLISp [3]) is depicted in Fig. 4. As the number of samples N increases, the absolute values of \(z_N\left( \varvec{x}\right) \) in (13) get progressively smaller (see Sect. 4.1) and simply dividing \(\hat{f}_N\left( \varvec{x}\right) \) by \(\Delta \hat{F}\) as in (15) is not enough to make the exploration and exploitation contributions comparable. Thus, unless \(\delta \) in (15) is dynamically varied in between iterations of GLISp [3], solving Problem (8) with \(a_N\left( \varvec{x}\right) \) in (15) becomes similar to performing pure exploitation. This, in turn, can make GLISp [3] more prone to getting stuck on local minima of Problem (3) with no way of escaping (especially if the surrogate model is not expressive enough to capture the location of the global minimizer). Vice-versa, by performing min–max rescaling as proposed in (23), the exploration and exploitation contributions stay comparable throughout the whole optimization process and approximately assume the same range. For this reason, it is also more straightforward to define \(\delta \) in (23) compared to the weight in (15).
From Propositions 2 and 3, we can immediately deduce the following results on the differentiability of \(a_N\left( \varvec{x}\right) \) in (23).
Proposition 5
The acquisition function \(a_N\left( \varvec{x}\right) \) in (23) is differentiable everywhere provided that the surrogate model \(\hat{f}_N\left( \varvec{x}\right) \) in (9) is differentiable everywhere.
At each iteration of GLISp-r we find the next candidate for evaluation, i.e. \(\varvec{x}_{N+1}\), by solving Problem (8) with the acquisition function in (23). It is possible to use derivative-based optimization solvers since \(a_N\left( \varvec{x}\right) \) in (23) is differentiable everywhere. In general, the acquisition function is multimodal and thus it is better to employ a global optimization procedure. Moreover, \(a_N\left( \varvec{x} \right) \) is cheap to evaluate; therefore, we are not particularly concerned on its number of function evaluations.
4.3.1 Greedy \(\delta \)-cycling
Many black-box optimization algorithms explicitly vary their emphasis on exploration and exploitation in between iterations. Just to cite a few:
-
Gutmann-RBF [15] uses an acquisition function that is a measure of “bumpiness” of the RBF surrogate that depends upon a target value \(\tau \) to aim for.
The author suggests to cycle the values of \(\tau \) between \(\tau = \mathchoice{\underset{\varvec{x} \in \Omega }{{\text {min}}}\,}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }} \hat{f}_N\left( \varvec{x}\right) \) (local search) and \(\tau = -\infty \) (global search).
-
The authors of MSRS [29], which uses an acquisition function that is similar to (23), propose to cycle between different values of \(\delta \) as to prioritize exploration or exploitation more.
-
In algorithm SO-SA [37], which is a revisitation of MSRS [29], the weight \(\delta \) is chosen in a random fashion at each iteration. Moreover, the authors adopt a greedy strategy, i.e. \(\delta \) is kept unaltered until it fails to find a significantly better solution.
In GLISp [3], the weight \(\delta \) for \(a_N\left( \varvec{x}\right) \) in (15) is kept constant throughout the whole optimization process. Also, defining some form of cycling for such hyper-parameter can be quite complex since the additive terms that compose the acquisition function are not always comparable. In this work, we propose a strategy that is in between that of MSRS [29] and SO-SA [37], which we refer to as greedy \(\delta \)-cycling. We define a sequence of \(N_{cycle} \in \mathbb {N}\) weights to cycle:
\(\Delta _{cycle}\) should contain values that are well spread within the \(\left[ 0, 1\right] \) range as to properly alternate between local and global search. Greedy \(\delta \)-cycling operates as follows. Suppose that, at iteration k of GLISp-r, we have at our disposal \(|\mathcal {X}|= N\) samples and denote the trade-off weight \(\delta \) in (23) as \(\delta \left( k\right) \) to highlight the iteration number. Furthermore, assume \(\delta \left( k\right) = \delta _j \in \Delta _{cycle}\), which is used to find the new candidate sample \(\varvec{x}_{N+1}\) at iteration k by solving Problem (8). Then, if \(\varvec{x}_{N+1} \succ \varvec{x_{best}}\left( N\right) \) (i.e. there has been some improvement), the trade-off weight is kept unchanged, \(\delta \left( k+1\right) = \delta \left( k\right) = \delta _j\). Otherwise, we cycle the values in \(\Delta _{cycle}\), obtaining \(\delta \left( k+1\right) = \delta _{\left( j+1\right) \text {mod}{N_{cycle}}}\). Thus:
In Sect. 5, we will discuss the choice of the cycling sequence in (24) more in detail and also cover its relationship with the global convergence of GLISp-r.
5 Algorithm GLISp-r and convergence
Algorithm 2 describes each step of the GLISp-r procedure. As with any surrogate-based method, GLISp-r starts from an initial set of samples \(\mathcal {X}, |\mathcal {X}|= N_{init} \in \mathbb {N}, N_{init} \ge 2\), generated by a space-filling experimental design [36], such as a Latin Hypercube Design (LHD) [24]. The sets \(\mathcal {B}\) in (6) and \(\mathcal {S}\) in (7), as well as the initial best candidate \(\varvec{x_{best}}\left( N_{init}\right) \), are obtained by asking the decision-maker to compare the samples in \(\mathcal {X}\) (5) as proposed in Algorithm 3, which prompts \(M = N_{init} - 1\) queries. Once the initial sampling phase is concluded, the new candidate samples are obtained by solving Problem (8). The procedure is stopped once \(|\mathcal {X}|= N_{max}\), where \(N_{max} \in \mathbb {N}\) is a budget specified by the user. Overall, the decision-maker is queried \(N_{max} - 1\) times.
GLISp-r follows the same scheme of GLISp [3] but, additionally, at each iteration, builds the augmented sample set \(\mathcal {X}_{aug}\) using Algorithm 1. New candidate samples are found by minimizing the acquisition function in (23) instead of \(a_N\left( \varvec{x}\right) \) in (15). Moreover, \(\delta \) is cycled as proposed in Sect. 4.3.1. Similarly to GLISp [3], the shape parameter \(\epsilon \) of the surrogate model in (9) is recalibrated at certain iterations of the algorithm, as specified by the set \(\mathcal {K}_{R} \subseteq \left\{ 1, \ldots , N_{max} - N_{init}\right\} \), using grid-search Leave One Out Cross-Validation (LOOCV). In particular, at each iteration \(k \in \mathcal {K}_{R}\), \(\epsilon \) for \(\hat{f}_N\left( \varvec{x}\right) \) in (9) is selected among a set \(\mathcal {E}_{LOOCV{}}\) of possible shape parameters as the one whose corresponding surrogate preference function \({\hat{\pi }}_{{\scriptscriptstyle \succsim }_{N}}\left( \varvec{x}_i, \varvec{x}_j\right) \) in (10) classifies (out-of-sample) most of the preferences in \(\mathcal {B}\) and S correctly [3]. Lastly, consistently with GLISp [3], Problem (3) is rescaled so that each decision variable assumes the \(\left[ -1, 1\right] \) range (at least inside \(\Omega \)).
5.1 Global convergence of GLISp-r
Whenever we are dealing with any global optimization algorithm, it is possible to guarantee its convergence to the global minimizer of Problem (3) by checking if the conditions of the following Theorem hold.
Theorem 2
(Convergence of a global optimization algorithm [33]) Let \(\Omega \subset \mathbb {R}^{n}\) be a compact set. An algorithm converges to the global minimum of every continuous function \(f:\mathbb {R}^n \rightarrow \mathbb {R}\) over \(\Omega \) if and only if its sequence of iterates,
is dense in \(\Omega \).
In what follows, for the sake of clarity, we denote:
-
\(\mathcal {X}_\infty \) as the set containing all the elements of \(\langle \varvec{x}_i\rangle _{i \ge 1}\) (infinite sequence),
-
\(\mathcal {X}_k \subseteq \mathcal {X}_\infty \) as the set containing all the elements of \(\langle \varvec{x}_i\rangle _{i = 1}^k\), which is a subsequence of \(\langle \varvec{x}_i\rangle _{i \ge 1}\) composed of its first \(k \in \mathbb {N}\) entries.
To prove the convergence of GLISp-r, we also make use of the following Theorem, which gives us a sufficient condition that ensures the denseness of the sequence of iterates produced by any global optimization algorithm.
Theorem 3
(A sufficient condition for the denseness of \(\mathcal {X}_\infty \) [28]) Let \(\Omega \) be a compact subset of \(\mathbb {R}^n\) and let \(\langle \varvec{x}_{i}\rangle _{i\ge 1}\) be the sequence of iterates generated by an algorithm \(\texttt {A}\) (when run indefinitely). Suppose that there exists a strictly increasing sequence of positive integers \(\langle i_{t}\rangle _{t\ge 1}, i_{t}\in \mathbb {N},\) such that \(\langle \varvec{x}_{i}\rangle _{i\ge 1}\) satisfies the following condition for some \(\alpha \in (0,1]\):
where:
Then, \(\mathcal {X}_{\infty }\) generated by \(\texttt {A}\) is dense in \(\Omega \).
The aforementioned Theorem has been used to prove the global convergence of CORS [28], a black-box optimization procedure. Clearly, if \(\Omega \) is compact and (26) holds for some \(\alpha \in (0,1]\) (making \(\mathcal {X}_{\infty }\) dense in \(\Omega \)) then, due to Theorem 2, algorithm \(\texttt {A}\) converges to the global minimum of every continuous function \(f\left( \varvec{x}\right) \) over \(\Omega \). For what concerns the preference-based framework, we need to ensure that the scoring function \(f\left( \varvec{x}\right) \), which represents the preference relation \(\succsim \) on \(\Omega \), is continuous. Theorem 1 gives us necessary conditions on \(\succsim \) to achieve such property. Furthermore, Proposition 1 can be used to check the existence of a \(\succsim \)-maximum of \(\Omega \). The next Theorem addresses the global convergence of GLISp-r (Algorithm 2).
Theorem 4
(Convergence of GLISp-r) Let \(\Omega \subset \mathbb {R}^n\) be a compact set and \(\succsim \) be a continuous preference relation on \(\Omega \) of a rational (as in Definition 3) human decision-maker. Then, provided that \(\exists \delta _j \in \Delta _{cycle}\) in (24) such that \(\delta _j = 0\) and \(N_{max} \rightarrow \infty \), GLISp-r converges to the global minimum of Problem (3) for any choice of its remaining hyper-parameters.Footnote 5
Proof
Compactness of \(\Omega \), continuity of \(\succsim \) on \(\Omega \) and rationality of the decision-maker are conditions that ensure the existence of a solution for Problem (3) (cf. Theorem 1 and Proposition 1).
Consider the sequence of iterates \(\langle \varvec{x}_i\rangle _{i \ge 1}\) produced by Algorithm 2. The first \(N_{init} \in \mathbb {N}, N_{init} \ge 2,\) elements of \(\langle \varvec{x}_i\rangle _{i \ge 1}\) are obtained by the LHD [24]. Instead, each \(\varvec{x}_i \in \mathcal {X}_{\infty }, i > N_{init},\) is selected as the solution of Problem (8) with \(a_N\left( \varvec{x}\right) \) in (23).
Now, suppose that \(\delta \) in (23) is cycled regardless of the improvement that the new candidate samples might bring (non-greedy cycling). Denote the exploration–exploitation trade-off weight at iteration k of Algorithm 2 as \(\delta \left( k\right) \) and assume that \(\delta (k) = \delta _j \in \Delta _{cycle}\). Then, the cycling is performed as:
instead of (25). Without loss of generality, suppose that \(\Delta _{cycle}\) in (24) is defined as:
Then, every \(N_{cycle}\) iterations Algorithm 2 looks for a new candidate sample by minimizing the (min–max rescaled) IDW distance function in (13), regardless of the surrogate model in (9), see \(a_N\left( \varvec{x}\right) \) in (23) and Problem (8). In practice, minimizing \(\bar{z}_N\left( \varvec{x}; \mathcal {X}_{aug}\right) \) over \(\Omega \) is equivalent to solving \(\mathchoice{{\text {arg}}\, \mathchoice{\underset{\varvec{x} \in \Omega }{{\text {min}}}\,}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x} \in \Omega }{{\text {min}}}\,}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x} \in \Omega }{{\text {min}}}\,}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }} }{{\text {arg}}\, \mathchoice{\underset{\varvec{x} \in \Omega }{{\text {min}}}\,}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }} } z_N\left( \varvec{x}\right) \) since scaling and shifting the IDW distance function does not change its minimizers [26]. We define the following strictly increasing sequence of positive integers:
which is such that:
Now, recall from Proposition 4 that each \(\varvec{x}_i \in \mathcal {X}_{i_{t'} - 1}\) is a global maximizer of Problem (30). Furthermore, \(z_{i_{t'} - 1}\left( \varvec{x}\right) \) in (13) is differentiable everywhere and thus continuous (see Proposition 3). Then, by the Extreme Value Theorem [1], Problem (30) admits at least a solution. Hence, we can conclude that:
Clearly, due to how new candidate samples are sought (i.e. by minimizing some acquisition function over \(\Omega \)), we have that (recall (27)):
for some \(\alpha ' \in (0, 1]\). Then, by combining (31) and (32), we get:
Therefore, \(\exists \alpha ' \in (0, 1]\) such that \(\tilde{\alpha }' = \alpha ' \cdot d_{\Omega }\left( \mathcal {X}_{i_{t'} - 1}\right) \) which satisfies the condition (26) of Theorem 3, \(\forall t' \in \mathbb {N}\). Thus, Algorithm 2 with \(\delta \) cycled as in (28) produces a sequence of iterates that is dense in \(\Omega \).
Next, consider the greedy \(\delta \)-cycling strategy proposed in Sect. 4.3.1 and for an arbitrary choice of \(\Delta _{cycle}\) in (24). Let us examine the case in (25) when \(\delta \) is kept unchanged from an iteration of Algorithm 2 to the other. Denote as \(\langle i_{t''} \rangle _{t'' = 1}^{t_{max}''}, t_{max}'' \in \mathbb {N},\) the sequence of indexes of those samples that improve upon the current best candidate, resulting in no change in the exploration–exploitation trade-off weight. We have that:
Clearly, \(\varvec{x}_{i_{t''}} \notin \mathcal {X}_{i_{t''} - 1}\) since it is strictly preferred to all the other samples in \(\mathcal {X}_{i_{t'' - 1}}\). Thus, we could define a positive constant \(\tilde{\alpha }'' \in \mathbb {R}_{> 0}\) analogously to (31):
Finally, let us consider the greedy \(\delta \)-cycling strategy in (25) as whole and assume, as in Theorem 4, that \(\exists \delta _j \in \Delta _{cycle}\) in (24) such that \(\delta _j = 0\). We can build a strictly increasing sequence of positive integers \(\langle i_{t}\rangle _{t \ge 1}\) by merging:
-
The elements of the sequence \(\langle i_{t''} \rangle _{t'' = 1}^{t_{max}''}\), which are the indexes of those samples that improve upon the best candidates \(\varvec{x_{best}}\left( i_{t''} - 1\right) , \forall t'': 1 \le t'' \le t_{max}''\);
-
The elements of the sequence \(\langle i_{t'} \rangle _{t' \ge 1}\), which constitute the indexes of those samples found by solving the pure exploration problem in (30). Note that, unless Algorithm 2 always improves upon its current best candidate (in which case \(\langle i_{t''} \rangle _{t'' = 1}^{t_{max}''}\) is actually infinite and hence \(\mathcal {X}_{\infty }\) is dense in \(\Omega \)), Problem (8) with \(a_N\left( \varvec{x}\right) \) in (23) is solved using \(\delta = 0\) infinitely often, although not necessarily every \(N_{cycle}\) iterations as in (29).
Hence, we can select a positive constant \(\tilde{\alpha } \in \mathbb {R}_{>0}\) as \(\tilde{\alpha } = \mathchoice{\underset{}{{\text {min}}}\,}{{\text {min}}_{}}{{\text {min}}_{}}{{\text {min}}_{}} \left\{ \tilde{\alpha }', \tilde{\alpha }''\right\} \) which, analogously to (33), is such that:
Therefore, \(\exists \alpha \in (0, 1]\) such that \(\tilde{\alpha } = \alpha \cdot d_{\Omega }\left( \mathcal {X}_{i_{t} - 1}\right) \) which satisfies the condition (26) of Theorem 3, \(\forall t \in \mathbb {N}\). Thus, GLISp-r with \(\delta \) in (23) cycled following the greedy \(\delta \)-cycling strategy in (25) converges to the global minimum of Problem (3). \(\square \)
Most preference-based response surface methods, such as the ones in [3, 4, 6, 13], do not address their convergence to the global minimum of Problem (3). In this work, we have shown that, by leveraging some results from the utility theory literature (see Sect. 2), we can find sufficient conditions on the preference relation of the human decision-maker (\(\succsim \) on \(\Omega \)) that guarantee the existence of a solution for Problem (3) and allow us to analyze the convergence of any preference-based procedure as we would in the global optimization framework.
We conclude this section with two Remarks on Theorem 4. The first deals with the importance of adding a zero entry inside \(\Delta _{cycle}\) in (24), while the second addresses the selection of the cycling set.
Remark 4
The omission of a zero entry inside \(\Delta _{cycle}\) in (24) does not necessarily preclude the global convergence of Algorithm 2 on all possible preference-based optimization problems. For example, if \(f\left( \varvec{x}\right) \) is a constant function then, after we evaluate the first sample \(\varvec{x}_1\), any other point brings no improvement (i.e. we would have \(\varvec{x}_i \sim \varvec{x}_1, \forall i > 1\)). The caveat is that, if \(\not \exists \delta _j \in \Delta _{cycle}\) such that \(\delta _j = 0\), then the sequence \(\langle i_{t''} \rangle _{t'' = 1}^{t_{max}''}\) for which (34) holds is likely to be finite (i.e. GLISp-r does not improve upon its current best candidate infinitely often). Moreover, differently from Problem (30), we have no guarantee that the solutions of:
with \(a_N\left( \varvec{x}\right) \) defined as in (23) and for \(\delta \ne 0\), are not already present in \(\mathcal {X}_N\).
Therefore, we cannot ensure the existence of a strictly increasing sequence of positive integers that is infinite and for which (26) holds. Instead, the result in Theorem 3 does not apply for \(\langle i_{t''} \rangle _{t'' = 1}^{t_{max}''}\), since the sequence is finite. Consequently, Algorithm 2 does not necessarily produce a sequence of iterates \(\langle \varvec{x}_i \rangle _{i \ge 1}\) that is dense in \(\Omega \), preventing its convergence on some (but not all) preference-based optimization problems.
Remark 5
Theorem 4 guarantees that, under some hypotheses, GLISp-r converges to the global minimum of Problem (3), however it does not give any indication on its convergence rate. In particular, if \(\Delta _{cycle}\) is actually \(\langle 0 \rangle \), Algorithm 2 amounts to performing an exhaustive search without considering the information carried by the preferences in \(\mathcal {B}\) (6) and \(\mathcal {S}\) (7), which is quite inefficient [1]. Therefore, it is best to include some \(\delta _j\)’s in \(\Delta _{cycle}\) that allow the surrogate model to be taken into consideration. For this reason, we suggest including terms that are well spread within the \(\left[ 0, 1\right] \) range, including a zero entry to ensure the result in Theorem 4. Intuitively, the rate of convergence will be dependent on how well \(\hat{f}_{N}\left( \varvec{x}\right) \) in (9) approximates \(f\left( \varvec{x}\right) \) as well as on the choice of \(\Delta _{cycle}\) in (24).
6 Empirical results
In this section, we compare the performances of algorithms GLISp-r and GLISp [3] on a variety of benchmark bound-constrained global optimization problems taken from [2, 14, 18, 25]. Consistently with the preference-based optimization literature, we stick to benchmark problems with less than \(n = 10\) decision variables [3, 4, 6, 13]. We also consider the revisited version of the IDW distance function in (13) employed by C-GLISp [40], which is an extension of GLISp [3] proposed by the same authors.Footnote 6 In particular, in C-GLISp [40]:
while \(z_N(\varvec{x}) = 0, \forall \varvec{x} \in \mathcal {X}\). In (36), \(i_{best}\left( N\right) \in \mathbb {N}, 1 \le i_{best}\left( N\right) \le N,\) represents the index of the best-found candidate when \(|\mathcal {X}|= N\). In practice, \(z_N\left( \varvec{x}\right) \) in (36) improves the exploratory capabilities of GLISp [3] without the need to define an alternative acquisition function from the one in (15) (see [40]).
We point out that we could also consider the preferential Bayesian optimization algorithm in [6] as an additional competitor for GLISp-r. However, in [3], the authors show that GLISp outperforms the aforementioned method. As we will see in the next sections, GLISp-r exhibits convergence speeds that are similar to those of GLISp [3] and hence we have decided to omit the algorithm in [6] from our analysis. Moreover, after preliminary testing, the latter method was proven to not be on par w.r.t. the other competitors.
6.1 Experimental setup
All benchmark optimization problems have been solved on a machine with two Intel Xeon E5-2687W @3.00 GHz CPUs and 128 GB of RAM. GLISp-r has been implemented in MATLAB. Similarly, we have used the MATLAB code for GLISp provided in [3] (formally, version 2.4 of the software package) and the one for C-GLISp supplied in [40] (version 3.0 of the same code package). For all the procedures, Problem (8) has been solved using Particle Swarm Optimization (PSWARM). In particular, we have used its MATLAB implementation provided by [20, 34, 35].
To achieve a fair comparison, we have chosen the same hyper-parameters for both GLISp-r and GLISp/C-GLISp [3, 40], whenever possible. This applies, for example, to the shape parameter \(\epsilon \), which is initialized to \(\epsilon = 1\), and the radial function \(\varphi \left( \cdot \right) \), that is an inverse quadratic. Furthermore, the shape parameter for the surrogate model \(\hat{f}_N\left( \varvec{x}\right) \) in (9) is recalibrated using LOOCV (see GLISp [3]) at the iterations \(\mathcal {K}_{R} = \left\{ 1, 50, 100\right\} \). Its possible values are \(\mathcal {E}_{LOOCV{}} = \{0.1000,\) 0.1668, 0.2783, 0.4642, 0.7743, 1, 1.2915, 2.1544, 3.5938, 5.9948, \(10\}\). The remaining hyper-parameters shared by GLISp/C-GLISp [3, 40] and GLISp-r are set to \(\lambda = 10^{-6}\) and \(\sigma = 10^{-2}\). Regarding GLISp-r, we have chosen \(\Delta _{cycle} = \langle 0.95,0.7,0.35,0 \rangle \), where we have included a zero term to comply with the convergence result in Theorem 4. The reasoning behind this cycling sequence is that, after the initial sampling phase, we give priority to the surrogate as to drive the algorithm towards more promising regions of \(\Omega \), for example where local minima are located. In practice, if \(f\left( \varvec{x}\right) \) is a function that can be approximated well by \(\hat{f}_N\left( \varvec{x}\right) \) with little data, starting with a \(\delta \) in (23) that is close to 1 might lead the procedure to converge quite faster. If that is not the case, the remaining terms contained inside \(\Delta _{cycle}\) promote the exploration of other zones of the constraint set, either dependently or independently from \(\hat{f}_N\left( \varvec{x}\right) \). For the sake of completeness, we also analyze the performances of GLISp-r when equipped with two particular cycling sequences: \(\Delta _{cycle} = \langle 0.95 \rangle \) (“pure” exploitation) and \(\Delta _{cycle} = \langle 0 \rangle \) (pure exploration). Concerning GLISp/C-GLISp [3, 40], which use \(a_N\left( \varvec{x}\right) \) in (15), we set \(\delta = 2\), as proposed by the authors in [3]. Lastly, for GLISp-r, we select \(K_{aug} = 5\) for all optimization problems since, empirically, after several preliminary experiments, it has proven to be good enough to rescale \(z_N\left( \varvec{x}\right) \) in (13) and \(\hat{f}_N\left( \varvec{x}\right) \) in (9) in most cases (see for example Fig. 4).
We run the procedures using a fixed budget of \(N_{max}=200\) samples and solve each benchmark optimization problem \(N_{trial} = 100\) times, starting from \(N_{init} = 4 \cdot n\) points generated by LHDs [24] with different random seeds. To ensure fairness of comparison, all the algorithms are started from the same samples.
For the sake of clarity, we point out that the preferences between the couples of samples are expressed using the preference function in (4), where \(f\left( \varvec{x}\right) \) is the corresponding cost function for the considered benchmark optimization problem. Therefore, the preferences are always expressed consistently, allowing us to compare GLISp-r and GLISp/C-GLISp [3, 40] properly. That would not necessarily be the case if a human decision-maker was involved.
6.2 Results
We compare the performances of GLISp-r and GLISp/C-GLISp [3, 40] on each benchmark optimization problem by means of convergence plots and data profiles [1]. Convergence plots depict the median, best and worst case performances over the \(N_{trial}\) instances and with respect to the cost function values achieved by \(\varvec{x_{best}}\left( N\right) \), as N increases. Data profiles are one of the most popular tools for assessing efficiency and robustness of global optimization methods: efficient surrogate-based methods exhibit steep slopes (i.e. fast convergences speeds), while robust algorithms are able to solve more (instances of) problems within the budget \(N_{max}\). In general, no method is both efficient and robust: a trade-off must be made [32]. Typically, data profiles are used to visualize the performances of several algorithms on multiple benchmark optimization problems simultaneously. However, due to the stochastic nature of LHDs [24], here we will also depict the data profiles for GLISp-r, GLISp [3] and C-GLISp [40] on each benchmark optimization problem on its own, highlighting how the algorithms behave when started from different samples. In practice, data profiles show, for \(1 \le N \le N_{max}\), how many among the \(N_{trial}\) instances of one (or several) benchmark optimization problem(s) have been solved to a prescribed accuracy, defined as [1]:
where \(f^* = \mathchoice{\underset{\varvec{x} \in \Omega }{{\text {min}}}\,}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }}{{\text {min}}_{\varvec{x} \in \Omega }} f\left( \varvec{x}\right) \). In particular, here we consider a benchmark optimization problem to be solved by some algorithm when \(acc\left( N\right) > t, t = 0.95\). In what follows, the results achieved by GLISp-r equipped with \(\Delta _{cycle} = \langle 0.95 \rangle \) and \(\Delta _{cycle} = \langle 0 \rangle \) are reported only in the data profiles (and not in the convergence plots), to make the graphs more readable.
Figure 5 shows the cumulative data profiles of GLISp-r, GLISp [3] and C-GLISp [40], which result from considering all the benchmark optimization problems simultaneously. Instead, Figs. 6 and 7 depict the convergence plots and the data profiles achieved by the considered algorithms on each benchmark. In Table 1 we report the number of samples required to reach the accuracy \(t = 0.95\), i.e.:
In practice, given that each benchmark optimization problem is solved multiple times, \(N_{acc > t}\) in Table 1 is assessed median-wise (over the \(N_{trial}\) instances), giving us an indication on the efficiency of each method. In the same table, we also report the percentages of instances of problems solved by GLISp-r, GLISp [3] and C-GLISp [40] (which is an indicator of robustness) and the average execution times of each algorithm.
From our experiments, we gather that:
-
GLISp-r (\(\Delta _{cycle} = \langle 0.95, 0.7, 0.35, 0\rangle \)) can be notably more robust than GLISp [3] without excessively compromising its convergence speed. On several occasions, the latter algorithm gets stuck on local minima of the benchmark optimization problems. That is particularly evident on the bemporad [2] and gramacy-lee [14] benchmarks, in which cases GLISp [3] solves, respectively, only 70% and 31% of the \(N_{trial}\) instances. Instead, GLISp-r (\(\Delta _{cycle} = \langle 0.95, 0.7, 0.35, 0\rangle \)) is able to solve both of them to the prescribed accuracy. In practice, GLISp [3] shines when exploitation is better suited for the benchmark optimization problem at hand. That is particularly relevant for the bukin 6 [18] problem, on which both GLISp [3] and GLISp-r with \(\Delta _{cycle} = \langle 0.95 \rangle \) (“pure” exploitation) perform quite well. Lastly, C-GLISp [40] is as robust as GLISp-r (\(\Delta _{cycle} = \langle 0.95, 0.7, 0.35, 0\rangle \)) but is the least efficient among the analyzed procedures.
-
The pure exploration strategy (i.e. GLISp-r with \(\Delta _{cycle} = \langle 0\rangle \)) performs poorly, even for \(n = 2\). When \(n = 5\), it is unable solve any problem (in fact, the data profiles stay flat after the initial sampling phase). Only for \(n = 1\) the pure exploration strategy is quite robust and relatively efficient.
-
Vice-versa, a “pure” exploitatory approach (i.e. GLISp-r with \(\Delta _{cycle} = \langle 0.95\rangle \)), although not necessarily globally convergent (see Theorem 4), can actually be successful on some benchmark optimization problems. Often, such strategy exhibits a slightly lower \(N_{acc > t}\) (median-wise) than the other procedures, see Table 1. Notably, the data profiles of GLISp-r (\(\Delta _{cycle} = \langle 0.95\rangle \)) can be quite similar to those of GLISp [3]. Therefore, we could say that GLISp [3] has limited exploratory capabilities.
-
The main disadvantage of GLISp-r, compared to GLISp [3] and C-GLISp [40], is the increased computational time, as reported in Table 1. That is due to the computational overhead of Algorithm 1, which generates the augmented sample set \(\mathcal {X}_{aug}\) for the proposed procedure by performing K-means clustering. On average, GLISp-r (\(\Delta _{cycle} = \langle 0.95, 0.7, 0.35, 0\rangle \)) is 31% slower than GLISp [3] and 72% slower than C-GLISp [40]. However, we point out that these overheads are practically negligible when the queries to the decision-maker involve running computer simulations or performing real-world experiments which, contrary to the considered benchmark optimization problems, can take from a few minutes up to several hours. This is a common assumption made by surrogate-based methods: at each iteration, the most time-consuming operation is the query to the decision-maker (or, in the context of black-box optimization, the measure of the cost function [19, 36]).
7 Conclusions
In this work, we introduced the preference-based optimization problem from a utility theory perspective. Then, we extended algorithm GLISp [3], giving rise GLISp-r. In particular, we have addressed the shortcomings of \(a_N\left( \varvec{x}\right) \) in (15), defined a new acquisition function and proposed the greedy \(\delta \)-cycling strategy, which dynamically varies the exploration–exploitation weight \(\delta \) in (23). Furthermore, we have proven the global convergence of GLISp-r, which is strictly related to the choice of the cycling sequence \(\Delta _{cycle}\). To the best of our knowledge, GLISp-r is the first preference-based surrogate-based method with a formal proof of convergence.
Compared to the original method, the proposed extension is less likely to get stuck on local minima of the scoring function and proves to be more robust on several benchmark optimization problems without particularly compromising its convergence speed. Moreover, we have also considered algorithm C-GLISp [40], which improves the exploratory capabilities of GLISp [3] by employing a different exploration function. In our experiments, we have observed that, even though C-GLISp [40] is as robust as GLISp-r, it often exhibits slower convergence rates compared to GLISp [3] and the proposed extension.
Further research is devoted to extending GLISp-r in order to handle black-box constraints, which involve functions whose analytical formulations are not available. One possibility is to follow the same reasoning behind C-GLISp [40], which does so by adding a term to the acquisition function that penalizes the exploration in those regions of \(\Omega \) where the black-box constraints are likely to be violated. Additionally, a compelling line of research concerns preference-based optimization problems in the case where the human decision-maker is “irrational” (in a sense that some of the properties in Definition 3, typically completeness, do not hold). From a practical perspective, when the preference relation \(\succsim \) on \(\Omega \) of the DM is not complete, we would need a surrogate model that is able to handle the answer “I do not know” when the decision-maker cannot decide which, among two calibrations, he/she prefers.
Data availibility
Notes
As a matter of fact, the goal of some preference learning methods is to estimate the utility function of the decision-maker [12].
Note that, whenever we say that a multivariable function, such as \(\hat{f}_N\left( \varvec{x}\right) \) in (9), is “differentiable everywhere” we imply that it is differentiable at each \(\varvec{x} \in \mathbb {R}^{n}\).
Note that, to avoid dividing by zero in (21), \(\Delta H\left( \mathcal {X}\right) \) can be set to \(h_{max}\left( \mathcal {X}\right) \) or 1 whenever \(h_{min}\left( \mathcal {X}\right) = h_{max}\left( \mathcal {X}\right) \ne 0\) or \(h_{min}\left( \mathcal {X}\right) = h_{max}\left( \mathcal {X}\right) = 0\) respectively.
We could add all \(2^n\) vertices of the box defined by the bound constraints \(\left\{ \varvec{x}: \varvec{l} \le \varvec{x} \le \varvec{u}\right\} \). However, we have preferred to include only \(\varvec{l}\) and \(\varvec{u}\) to avoid increasing the cardinality of the augmented sample set, especially in the case of high-dimensional problems.
Note that, in this work, we use C-GLISp [40] only to test how the IDW distance function in (36) compares to the other formulation in (13). However, C-GLISp [40] has been developed mainly to extend GLISp [3] in order to handle black-box constraints. The different definition of \(z_N\left( \varvec{x}\right) \) is only a minor detail of such paper.
References
Audet, C., Hare, W.: Derivative-Free and Blackbox Optimization. Springer, Berlin (2017). (ISBN 9783319886800)
Bemporad, A.: Global optimization via inverse distance weighting and radial basis functions. Comput. Optim. Appl. 77(2), 571–595 (2020). https://doi.org/10.1007/s10589-020-00215-w. (ISSN 1573-2894)
Bemporad, A., Piga, D.: Global optimization based on active preference learning with radial basis functions. Mach. Learn. 110(2), 417–448 (2021). https://doi.org/10.1007/s10994-020-05935-y. (ISSN 1573-0565)
Benavoli, A., Azzimonti, D., Piga, D.: Preferential Bayesian optimisation with skew gaussian processes. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, pp. 1842–1850 (2021). https://doi.org/10.1145/3449726.3463128
Bishop, C.M., Nasrabadi, N.M.: Pattern Recognition and Machine Learning. Springer, Berlin (2006)
Brochu, E., De Freitas, N., Ghosh, A.: Active preference learning with discrete choice data. In: Advances in Neural Information Processing Systems 20 (NIPS 2007), pp. 409–416 (2007). https://papers.nips.cc/paper/2007/hash/b6a1085a27ab7bff7550f8a3bd017df8-Abstract.html
Chu, W., Ghahramani, Z.: Preference learning with gaussian processes. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 137–144 (2005). https://doi.org/10.1145/1102351.1102369
Debreu, G.: Theory of value: An Axiomatic Analysis of Economic Equilibrium, Vol. 17. Yale University Press (1971) (ISBN 0300015593)
Fasshauer, G.E.: Meshfree Approximation Methods with MATLAB, vol. 6. World Scientific, Singapore (2007). (ISBN 9789812706348)
Feldman, A. M., Serrano, R.: Welfare Economics and Social Choice Theory. Springer Science & Business Media (2006). (ISBN 9780387293677)
Fornberg, B., Flyer, N.: A Primer on Radial Basis Functions with Applications to the Geosciences. SIAM, Philadelphia (2015). (ISBN 9781611974027)
Fürnkranz, J., Hüllermeier, E.: Preference Learning and Ranking by Pairwise Comparison. Springer, Berlin (2010). (ISBN 9783642141249)
González, J., Dai, Z., Damianou, A., Lawrence, N. D.: Preferential Bayesian optimization. In: International Conference on Machine Learning, pp. 1282–1291 (2017). https://assets.amazon.science/da/1e/24ad7c354431bd0c2f64a6049269/preferential-bayesian-optimization.pdf
Gramacy, R.B., Lee, H.K.H.: Cases for the nugget in modeling computer experiments. Stat. Comput. 22(3), 713–722 (2012). https://doi.org/10.1007/s11222-010-9224-x
Gutmann, H.-M.: A radial basis function method for global optimization. J. Glob. Optim. 19(3), 201–227 (2001). https://doi.org/10.1023/A:1011255519438
Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, Amsterdam (2011). (ISBN 9789380931913)
Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, vol. 2. Springer, Berlin (2009). (ISBN 9780387848570)
Jamil, M., Yang, X.-S.: A literature survey of benchmark functions for global optimisation problems. Int. J. Math. Modell. Numer. Optim. 4(2), 150–194 (2013). https://doi.org/10.1504/IJMMNO.2013.055204
Jones, D.R.: A taxonomy of global optimization methods based on response surfaces. J. Glob. Optim. 21(4), 345–383 (2001). https://doi.org/10.1023/A:1012771025575
Le, H.A., Thi, A.I., Vaz, F., Vicente, L.N.: Optimizing radial basis functions by DC programming and its use in direct search for global derivative-free optimization. TOP 20(1), 190–214 (2012). https://doi.org/10.1007/s11750-011-0193-9
Liu, G.P., Yang, J.B., Whidborn, J.F.: Multiobjective Optimisation and Control. Research Studies Press Ltd, Baldock (2002). (ISBN 9780863802645)
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982). https://doi.org/10.1109/TIT.1982.1056489
Martí, R., Lozano, J. A., Mendiburu, A., Hernando, L.: Multi-start methods. In: Handbook of Heuristics, pp. 155–175 (2018). https://doi.org/10.1007/0-306-48056-5_12
McKay, M.D., Beckman, R.J., Conover, W.J.: A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 42(1), 55–61 (2000). https://doi.org/10.1080/00401706.2000.10485979
Mishra, S. K.: Some new test functions for global optimization and performance of repulsive particle swarm method. Available at SSRN 926132. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=926132 (2006)
Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, Berlin (1999). (ISBN 0387987932)
Ok, E.A.: Real Analysis with Economic Applications. Princeton University Press, Princeton (2011). (ISBN 9780691117683)
Regis, R.G., Shoemaker, C.A.: Constrained global optimization of expensive black box functions using radial basis functions. J. Glob. Optim. 31(1), 153–171 (2005). https://doi.org/10.1007/s10898-004-0570-0
Regis, R.G., Shoemaker, C.A.: A stochastic radial basis function method for the global optimization of expensive functions. INFORMS J. Comput. 19(4), 497–509 (2007). https://doi.org/10.1287/ijoc.1060.0182
Roveda, L., Maggioni, B., Marescotti, E., Shahid, A.A., Zanchettin, A.M., Bemporad, A., Piga, D.: Pairwise preferences-based optimization of a path-based velocity planner in robotic sealing tasks. IEEE Robot. Autom. Lett. 6(4), 6632–6639 (2021). https://doi.org/10.1109/LRA.2021.3094479
Shepard, D.: A two-dimensional interpolation function for irregularly-spaced data. In Proceedings of the 1968 23rd ACM National Conference, pp. 517–524 (1968). https://doi.org/10.1145/800186.810616
Thanedar, P.B., Arora, J.S., Li, G.Y., Lin, T.C.: Robustness, generality and efficiency of optimization algorithms for practical applications. Struct. Optim. 2(4), 203–212 (1990). https://doi.org/10.1007/BF01748225
Torn, A., Zilinskas, A.: Global optimization. In: Lecture Notes in Computer Science (1989). (ISBN 9783540508717)
Vaz, A.I.F., Vicente, L.N.: A particle swarm pattern search method for bound constrained global optimization. J. Glob. Optim. 39(2), 197–219 (2007). https://doi.org/10.1007/s10898-007-9133-5
Vaz, A.I.F., Vicente, L.N.: PSwarm: a hybrid solver for linearly constrained global derivative-free optimization. Optim. Methods Softw. 24(4–5), 669–685 (2009). https://doi.org/10.1080/10556780902909948
Vu, K.K., d’Ambrosio, C., Hamadi, Y., Liberti, L.: Surrogate-based methods for black-box optimization. Int. Trans. Oper. Res. 24(3), 393–424 (2017). https://doi.org/10.1111/itor.12292
Wang, Y., Shoemaker, C. A.: A general stochastic algorithmic framework for minimizing expensive black box objective functions based on surrogate models and sensitivity analysis. arXiv:1410.6271 (2014). https://doi.org/10.48550/arXiv.1410.6271
Williams, C.K., Rasmussen, C.E.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2005)
Zhu, M., Bemporad, A., Piga, D.: Preference-based MPC calibration. arXiv:2003.11294 (2020). https://doi.org/10.48550/arXiv.2003.11294
Zhu, M., Piga, D., Bemporad, A.: C-GLISp: Preference-based global optimization under unknown constraints with applications to controller calibration. IEEE Trans. Control Syst. Technol. (2021). https://doi.org/10.1109/TCST.2021.3136711
Funding
Open access funding provided by Università degli studi di Bergamo within the CRUI-CARE Agreement.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All the authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Additional proofs
Appendix A: Additional proofs
Proof of Lemma 1
The IDW distance function in (13) is differentiable at any \(\varvec{x} \in \mathbb {R}^{n}\) (see Proposition 3); thus, its gradient \(\nabla _{\varvec{x}} z_N\left( \varvec{x}\right) \) is defined everywhere and can be computed by repeatedly applying the chain rule.
Consider the case \(\varvec{x} \in \mathbb {R}^n {\setminus } \mathcal {X}\). First of all, it is easy to prove that the IDW function \(w_{i}\left( \varvec{x}\right) \) in (12) is differentiable at any \(\varvec{x} \in \mathbb {R}^{n}{\setminus }\left\{ \varvec{x}_{i}\right\} \). That is because the squared Euclidean norm \( \left\| {{\varvec{x}-\varvec{x}_{i}}} \right\| _2^2\) is differentiable everywhere and also:
Therefore, due to the reciprocal rule, the IDW function is differentiable at any \(\varvec{x} \in \mathbb {R}^{n}{\setminus }\left\{ \varvec{x}_{i}\right\} \). In order to compute the gradient of \(w_{i}\left( \varvec{x}\right) \) in (12), recall that the gradient of the squared Euclidean norm is equal to:
Using (A1), it is easy to prove that:
Now, let us consider the argument of the \(\arctan \left( \cdot \right) \) function in \(z_N\left( \varvec{x}\right) \), which is (see (13)):
We can find the gradient of \(h\left( \varvec{x}\right) \) by applying the chain rule in combination with (A2):
Finally, using (A3) and applying the chain rule one last time, we can compute the gradient of the IDW distance function in (13):
Now, let us consider the case \(\varvec{x}_i \in \mathcal {X}\). In [2], the authors have proven that all the partial derivatives of \(z_N\left( \varvec{x}\right) \) in (13) are zero at each \(\varvec{x}_i \in \mathcal {X}\), i.e.
Lastly, combining (A4) and (A5), we obtain the expression for the gradient of the IDW distance function \(\forall \varvec{x} \in \mathbb {R}^{n}\), as reported in (14). \(\square \)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Previtali, D., Mazzoleni, M., Ferramosca, A. et al. GLISp-r: a preference-based optimization algorithm with convergence guarantees. Comput Optim Appl 86, 383–420 (2023). https://doi.org/10.1007/s10589-023-00491-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10589-023-00491-2