1 Introduction

1.1 Motivation

In science, one uses mathematical models to describe processes or phenomena. For example, in chemical process engineering there exists a plethora of formulas that describe the course of a reaction.

Often one can make an educated guess based on experience or prior knowledge which model fits an investigated phenomenon. Still it might happen that there are several plausible models. In order to decide which of these models fits the true underlying phenomenon best, we want to conduct further experiments. Then, based on these additional experiments, we can hopefully clearly distinguish, i.e. discriminate, the suggested models and pick the one that suits best.

However, such experiments can be very time consuming and costly. For example, a large batch reactor needs time and energy until it reaches the right temperature and pressure to conduct an experiment. Therefore, if we carry out costly experiments, we want to do as few experiments as possible. Thus, every chosen experiment should carry as much information as attainable.

The task of determining such experiments is generally called optimal experimental design (OED). In this article we focus on model discrimination (MD), which is a sub-problem in the family of OED problems: Given two plausible models \(f_1\) and \(f_2\), as well as perhaps some existing data \({\mathcal {D}}_N\), we want to find settings for new experiments that give insight whether \(f_1\) or \(f_2\) better fits the real phenomenon.

1.2 Literature

In order to evaluate whether an experiment is worthwhile or not, several criteria have been introduced. Consequently, optimal settings for new experiments can be determined by optimizing these criteria.

For model discrimination specifically, the first recorded criterion was presented by Hunter and Reiner (1965). They proposed fitting the two competing response models \(f_1\) and \(f_2\) to some given data \({\mathcal {D}}_N\) and then performing the experiment where the models differ the most after the fits.

One issue with the Hunter–Reiner criterion is its implicit assumption that the measurement errors for both models have constant variance. To circumvent this problem, Box and Hill (1967) proposed a Bayesian sequential discrimination method based on the Kullback–Leibler divergence (short: KL-divergence, see, e.g., Bishop 2016). Additionally, they formulated their method in a way such that several competing models can be discriminated.

Then in 1975, Atkinson and Fedorov (1975a) introduced the T-optimality criterion, or short T-criterion. The idea for this criterion is to fix one of the models while fitting the other model as well as possible to the first one by a weighted least-squares sum. Afterwards, one selects the points with the largest difference between the models as the proposed experiments.

In the same year, Atkinson and Fedorov (1975b) also introduced an extension for discriminating several models at once.

Theoretical properties of the T-criterion have been investigated in further publications. A good first overview of the most important characteristics can be found in Fedorov and Leonov (2014). A comprehensive analysis is included in the dissertation by Kuczewski (2006). Additionally, Dette and Titoff (2009) state the duality of the T-criterion and the Chebychev approximation problem, which was later used in Braess and Dette (2013) to further characterize T-optimal solutions.

An extension of the T-criterion for discriminating dynamic systems that depend on end times and initial values has been introduced in Ucinski and Bogacka (2005). Recently, a further generalization of the T-criterion for semi-parametric models has been presented by Dette et al. (2018).

An important advancement of the T-criterion was introduced by López-Fidalgo et al. (2007): the KL-optimality criterion. In fact, the idea for the KL-criterion is the same as for the T-criterion, but in order to fit one model to the other they use the Kullback-Leibler divergence instead of a sum of least-squares. The advantage of this criterion is that we can freely choose the distribution of the measurement errors, whereas the T-criterion always implicitly assumes Gaussian measurement errors.

In order to compute optimal solutions for the T-criterion, called T-optimal designs, Atkinson and Fedorov (1975a) used an algorithm that is based on the Vertex Direction Method (Fedorov and Leonov 2014; Wynn 1970). Since then, many more strategies have been successfully applied. Kuczewski (2006) and Duarte et al. (2015) both formulated the T-criterion as a semi-infinite program (SIP) and solved it using the general Blankenship and Falk algorithm. Braess and Dette (2013) used results from approximation theory to formulate an algorithm which alternatingly finds new points by solving the dual Chebyshev approximation problem and calculates new weights for a new solution of the primal T-criterion. Later, Dette et al. (2015) simplified this approach by finding several new design points using the directional derivative from the Vertex Direction Method and computing the corresponding new weights for the current selection of design points with the help of a linearization. A heuristic approach based on particle swarm optimization has been successfully applied by Chen et al. (2020).

In the special case where we discriminate two polynomial models, Yue et al. (2019) presented an approach based on moment relaxation and semi-definite programming to find T-optimal solutions. In the even more special case where the degree of the two competing polynomial models differ by exactly two, Dette et al. (2012) were even able to give an analytic approach for calculating T-optimal designs. Additional investigation when discriminating polynomial or rational models has been done by Guchenko and Melas (2017).

1.3 Outline

In this paper, we present a new algorithm for computing T-optimal designs based on two nested adaptive discretization schemes. At first, we introduce some preliminaries in Sect. 2. This includes basic notations from design theory and selected insights on the T-criterion. Then, we introduce our new algorithm in Sect. 3. This section is separated into two parts, where the first considers the inner discretization scheme, while the second considers the outer one. We also show that our algorithm converges towards arbitrarily good approximations of T-optimal designs under realistic assumptions in Sect. 4. Finally, in Sect. 5 we compare our algorithm against three other state-of-the-art methods on two examples from chemical process engineering.

2 Preliminaries

We start by briefly introducing some basic notations from design theory and presenting the main characteristics for T-optimal experimental designs.

2.1 Fundamentals of design theory

A model function in the following context is of the form \(f: {{\,\mathrm{{\mathcal {X}}}\,}}\times \,\Theta \rightarrow {{\,\mathrm{{\mathcal {Y}}}\,}}\). It maps inputs or design points x to their respective values \(f(x,\theta )\), which also depend on the parameters \(\theta\).

We call \({{\,\mathrm{{\mathcal {X}}}\,}}\) the design space, \({{\,\mathrm{{\mathcal {Y}}}\,}}\) the output space, and \(\Theta\) the parameter space. These sets are assumed to be subsets of the multi-dimensional real numbers, i.e. \(\Theta \subseteq {\mathbb {R}}^{d_\theta }\), \({{\,\mathrm{{\mathcal {X}}}\,}}\subseteq {\mathbb {R}}^{d_x}\), and \({{\,\mathrm{{\mathcal {Y}}}\,}}\subseteq {\mathbb {R}}^{d_y}\). Additionally, if the number of outputs is one, i.e. \(d_y = 1\), we call f a single-response model, whereas if it has more than one output, i.e. \(d_y > 1\), then we call f a multi-response model. We assume there exists a true but unknown model \(f_t\) and true but unknown parameters \(\theta _t\). If we perform experiments with settings given by design points \(x_1, \ldots , x_N \in {{\,\mathrm{{\mathcal {X}}}\,}}\), we can make observations

$$\begin{aligned} y_i = f_t(x_i, \theta _t) + \varepsilon _i, \end{aligned}$$

with the \(\varepsilon _i \sim {\mathcal {N}}(0, \Sigma )\) denoting independent identically distributed random measurement errors. Typical settings in the context of chemical process engineering are, for example, temperature, pressure, or initial concentrations.

Following Fedorov and Leonov (2014), we model an experimental design plan as a probability measure. In case of a discrete measure, we write

$$\begin{aligned} \xi = \left\{ \begin{array}{ccc} x_1 &{} \ldots &{} x_N \\ w_1 &{} \ldots &{} w_N \end{array} \right\} , \end{aligned}$$

with

$$\begin{aligned}w_i\ge 0,\qquad \Vert w\Vert _1 = \sum _{i=1}^N w_i = 1.\end{aligned}$$

Here, the \(x_i\in {{\,\mathrm{{\mathcal {X}}}\,}}\) are the suggested experiments, and the weights \(w_i\) represent the relative number of experiments being proposed for their respective design points. The set of experimental designs, which coincides with the set of probability measures on \({{\,\mathrm{{\mathcal {X}}}\,}}\), is denoted by \(\Xi\), or \(\Xi ({{\,\mathrm{{\mathcal {X}}}\,}})\).

2.2 The T-criterion

We now rigorously introduce the so-called T-optimality criterion which aims to find several design points with maximum squared difference between two competing models \(f_1\) and \(f_2\).

Definition 1

(T-Optimality Criterion) Assume the first model \(f_1\) to be true, i.e. having fixed parameters \(\theta _1\). Write \(f_1(x) = f_1(x,\theta _1)\). The alternative model \(f_2\) stays parametrized with parameters \(\theta _2\in \Theta _2\). Then, the T-optimality criterion, or short T-criterion, for model discrimination is defined as

$$\begin{aligned} T(\xi )&:= \min _{\theta _2\in \Theta _2} \int _{{{\,\mathrm{{\mathcal {X}}}\,}}} \left[ f_1(x) - f_2(x,\theta _2)\right] ^2 \xi (dx)\\&~=\min _{\theta _2\in \Theta _2}\, \sum _{i=1}^N w_i \left[ f_1(x_i) - f_2(x_i,\theta _2)\right] ^2 \end{aligned}$$

for a design \(\xi = \left\{ (x_i, w_i)\right\} _{i=1,\ldots ,N}\). A design \(\xi ^*\) is called T-optimal if

$$\begin{aligned} \xi ^* = {{\,\mathrm{arg\,max}\,}}_{\xi \in \Xi } ~T(\xi ) = {{\,\mathrm{arg\,max}\,}}_{\xi \in \Xi } \min _{\theta _2\in \Theta _2} \int _{{{\,\mathrm{{\mathcal {X}}}\,}}} \left[ f_1(x) - f_2(x,\theta _2)\right] ^2 \, \xi (dx). \end{aligned}$$
(1)

One benefit of the T-criterion is its relative independence from data. In fact, if we set the true parameters \(\theta _1\) of the first model based on prior assumptions, then we can even compute a T-optimal design without any data at all. Nevertheless, if data \({\mathcal {D}}_N = \{(x_i, y_i)\mid i=1,\ldots ,N\}\) is available, it is still best practice to use this data in order to estimate \(\theta _1\), e.g. by the method of least-squares. A drawback of the T-criterion is that the two potential models are treated differently based on the choice which model is assumed to be true and which model is chosen to be the alternative. Whereas the true model is fit to data or other prior knowledge, the alternative model is fit to the true model. Therefore the T-criterion is not symmetrical with respect to the models.

To work with the T-criterion, we introduce the following notations.

Notation 2

We define the squared model difference as

$$\begin{aligned}\varphi (x,\theta _2):= \left[ f_1(x) - f_2(x,\theta _2)\right] ^2.\end{aligned}$$

We further define the function

$$\begin{aligned}T(\xi ,\theta _2):= \int _{{{\,\mathrm{{\mathcal {X}}}\,}}} \varphi (x, \theta _2) ~\xi (dx),\end{aligned}$$

and the optimal parameter with respect to a given design \(\xi\) as

$$\begin{aligned}{\hat{\theta }}_2:= {\hat{\theta }}_2(\xi ):= {{\,\mathrm{arg\,min}\,}}_{\theta _2\in \Theta _2} ~T(\xi ,\theta _2).\end{aligned}$$

With these notations, we may also write the T-criterion as \(T(\xi ) = T(\xi , {\hat{\theta }}_2).\)

One issue is that the optimal parameters \({\hat{\theta }}_2\) are not necessarily unique. This cannot be avoided in general, e.g. for designs with a single support point. Nevertheless, we are mostly interested in designs that contain several design points. For those, in practice, one would assume that they have in fact a unique solution to the corresponding least-squares problem. For example, linear models have a unique solution to the least-squares problem if the number of support points equals the number of parameters. This justifies the notion of regular designs.

Definition 3

(Regular Design) A design \(\xi\) is called regular if the corresponding weighted least-squares problem \(\min _{\theta _2\in \Theta _2} T(\xi ,\theta _2)\) has a unique minimizer \({\hat{\theta }}_2(\xi )\).

We further use the following assumptions on the model functions such that we can guarantee the existence of T-optimal designs and other nice properties like concavity (see, e.g., Fedorov and Leonov 2014).

Assumption 4

We assume the following properties on a single-response model \(f:{{\,\mathrm{{\mathcal {X}}}\,}}\times \,\Theta \rightarrow {\mathbb {R}}\):

  1. (M1)

    \({{\,\mathrm{{\mathcal {X}}}\,}}\ne \emptyset\) is compact.

  2. (M2)

    f is continuous with respect to \((x,\theta ) \in {{\,\mathrm{{\mathcal {X}}}\,}}\times \,\Theta\).

  3. (M3)

    \(\Theta _2\ne \emptyset\) is compact and convex.

Another important concept when optimizing the T-criterion are directional derivatives. Beforehand, we need the following notation for designs with a single support point.

Notation 5

We denote the probability measure with full weight on one design point \(x\in {{\,\mathrm{{\mathcal {X}}}\,}}\) as \(\xi _x\), i.e. \(\xi _x:= \{(x,1)\}\).

Proposition 6

(Directional Derivative of the T-Criterion) Let \(\xi\) denote a regular design and \({\bar{\xi }}\) an additional design. Then the following limit exists and is given by

$$\begin{aligned}&\lim _{\alpha \searrow 0}\frac{T\big ((1-\alpha )\xi + \alpha {\bar{\xi }}, \; {\hat{\theta }}_2((1-\alpha )\xi + \alpha {\bar{\xi }})\big ) - T(\xi , {\hat{\theta }}_2(\xi ))}{\alpha } \\&\qquad \qquad = \int _{{{\,\mathrm{{\mathcal {X}}}\,}}}\left[ \varphi (x,{\hat{\theta }}_2(\xi )) - T(\xi ,{\hat{\theta }}_2(\xi )) \right] ~{\bar{\xi }}(dx). \end{aligned}$$

We say that the right-hand directional derivative (for the T-criterion for some regular design \(\xi\) in direction \({\bar{\xi }}-\xi\)) exists and denote

$$\begin{aligned}\delta ^+ \, T(\xi ,{\bar{\xi }}-\xi ):= \int _{{{\,\mathrm{{\mathcal {X}}}\,}}}\varphi (x,{\hat{\theta }}_2(\xi ))~{\bar{\xi }}(dx) - T(\xi ,{\hat{\theta }}_2(\xi )).\end{aligned}$$

In particular, it holds

$$\begin{aligned}&T\big ((1-\alpha )\xi + \alpha {\bar{\xi }}, \; {\hat{\theta }}_2((1-\alpha )\xi + \alpha {\bar{\xi }})\big ) \\&\qquad \qquad = T(\xi , {\hat{\theta }}_2(\xi )) + \alpha \int _{{{\,\mathrm{{\mathcal {X}}}\,}}}\psi (x,\xi ;{\hat{\theta }}_2(\xi )) ~{\bar{\xi }}(dx) + {{\,\mathrm{{\mathcal {O}}}\,}}(\alpha ^2), \end{aligned}$$

where the (right-hand) directional derivative in direction \(\xi _x-\xi\) for any design point \(x\in {{\,\mathrm{{\mathcal {X}}}\,}}\) is given by

$$\begin{aligned}\psi (x,\xi ):= \psi (x,\xi ;{\hat{\theta }}_2(\xi )):= \delta ^+ \, T(\xi ,\xi _x-\xi ) = \varphi (x,{\hat{\theta }}_2(\xi )) - T(\xi ,{\hat{\theta }}_2(\xi )).\end{aligned}$$

The existence of the directional derivative is derived in, e.g., Kuczewski (2006).

We are now able to state the Equivalence Theorem for the T-criterion, which gives a necessary and sufficient criterion for T-optimal solutions.

Theorem 7

(Equivalence Theorem for the T-Criterion) The following statements hold:

  1. 1.

    A design \(\xi ^*\) is T-optimal if and only if there exists a measure \(\zeta\) on the parameters such that

    $$\begin{aligned} \int _{{\hat{\Theta }}_2(\xi ^*)} \varphi (x, {\hat{\theta }}_2) ~\zeta (d{\hat{\theta }}_2) \le T(\xi ^*), ~\forall x\in {{\,\mathrm{{\mathcal {X}}}\,}}, \end{aligned}$$
    (2)

    where \({\hat{\Theta }}_2(\xi ):= {{\,\mathrm{arg\,min}\,}}_{\theta _2\in \Theta _2} \int _{{{\,\mathrm{{\mathcal {X}}}\,}}} \varphi (x,\theta _2) ~\xi (dx)\) is the set of optimal parameters w.r.t. \(\xi\) and \(\zeta\) is a probability measure on \({\hat{\Theta }}_2(\xi )\), i.e. \(\zeta \big ({\hat{\Theta }}_2(\xi )\big ) = 1.\)

  2. 2.

    The inequality (2) holds with equality for all support points of a T-optimal design \(\xi ^*\).

  3. 3.

    The set of T-optimal designs is convex.

Proof

See, for example, Kuczewski (2006).□

We end this section with a helpful result concerning the existence of T-optimal solutions with finitely many support points.

Proposition 8

(Existence of T-optimal solutions with finite support) There always exists a T-optimal design \(\xi ^*\) with at most \(\dim (\Theta _2)+1\) many design points.

Proof

The statement can be shown by applying the characterization theorem for best uniform approximation, see Braess and Dette (2013).□

3 The 2-ADAPT-MD algorithm

In the previous section we have seen that optimizing the T-criterion can be formulated as a maximin problem of the form

$$\begin{aligned} \max _{\xi \in \Xi } \min _{\theta _2\in \Theta _2} \; \sum _{i=1}^{\dim (\Theta _2)+1} w_i \left[ f_1(x_i) - f_2(x_i,\theta _2)\right] ^2, \end{aligned}$$
(3)

where \(\xi\) is a discrete probability measure with \(\dim (\Theta _2)+1\) many support points. Thus, it can also be formulated as a semi-infinite problem and solved by the Blankenship & Falk algorithm (see Appendix A). Similar formulations were also used by Kuczewski (2006) and Duarte et al. (2015). However, the upper-level problem, which has to be solved in every iteration, is

$$\begin{aligned} \max _{\begin{array}{c} t\in {\mathbb {R}}\\ w\in {\mathbb {R}}^N \\ x_1,\ldots ,x_N \in {{\,\mathrm{{\mathcal {X}}}\,}} \end{array}}\quad&t \\ \text { s.t.} \quad&\sum _{i=1}^{N} w_i \left[ f_1(x_i) - f_2(x_i,\theta _2)\right] ^2\ge t, \quad \forall \, \theta _2\in {\dot{\Theta }}_2,\\&\sum _{i=1}^{N} w_i = 1,\quad w\ge 0 , \end{aligned}$$

where \(N = \dim (\Theta _2)+1\) is the maximum number of required support points. Not only does this problem contain many variables, \((d_x+1)\cdot N + 1\) to be exact, but it is also often highly non-linear in the design points, which makes it hard to solve.

Therefore, we present a new algorithm for computing T-optimal designs, the 2-ADAPT-MD. Although we do not eradicate the problems with the SIP approach completely, we mitigate the issues by exploiting the linearity in the weights and optimizing with respect to the design points as infrequently as possible.

The idea for the 2-ADAPT-MD algorithm is the same as Algorithm 3.2 by Dette et al. (2015): We use a sub-routine to compute T-optimal designs on a finite subset of design points, then we augment this subset by a new design point to improve the solution with regard to the entire set of design points.

But unlike the approach by Dette et al., our new method does not rely on a linearization of the alternative model \(f_2\). This is advantageous for cases where the linearization does not approximate the model well enough. Also, our algorithm is able to deal with multi-response models, whereas the linearization formulated by Dette et al. is limited to single-response models.

3.1 The grid-based DISC-MD sub-routine

At first we consider the sub-problem to find T-optimal designs on a finite set of design points \({{{\,\mathrm{{\mathcal {X}}}\,}}_N=\{x_1,\ldots ,x_N\} \subset {{\,\mathrm{{\mathcal {X}}}\,}}}\). This is:

$$\begin{aligned} \max _{\begin{array}{c} {w\ge 0}\\ \Vert w\Vert _1=1 \end{array}} \min _{\theta _2\in \Theta _2} \; \sum _{i=1}^N w_i \left[ f_1(x_i) - f_2(x_i,\theta _2)\right] ^2. \end{aligned}$$
(4)

We again reformulate this maximin problem as a SIP. However, this time we get a linear SIP:

$$\begin{aligned} \max _{\begin{array}{c} t\in {\mathbb {R}}\\ w\in {\mathbb {R}}^N \end{array}}\quad&t \nonumber \\ \text { s.t.} \quad&\sum _{i=1}^{N} w_i \left[ f_1(x_i) - f_2(x_i,\theta _2)\right] ^2\ge t, ~\forall \theta _2\in \Theta _2,\nonumber \\&\sum _{i=1}^{N} w_i = 1,\quad w\ge 0 . \end{aligned}$$
(5)

Notice that in this formulation we only optimize with respect to the weights (and t) since the former inner optimization problem is replaced by infinitely many constraints.

To solve (5), we may use the Blankenship and Falk algorithm. Thus, we approximate the SIP by considering finitely many constraints, i.e. we consider a finite discretization of the parameter set \({\dot{\Theta }}_2 \subset \Theta _2\). This yields the upper-level problem

$$\begin{aligned} \max _{\begin{array}{c} w\in {\mathbb {R}}^N\\ t\in {\mathbb {R}} \end{array}}\quad&t \nonumber \\ \text { s.t.} \quad&\sum _{i=1}^{N} w_i \left[ f_1(x_i) - f_2(x_i,\theta _2)\right] ^2\ge t, \quad \forall \, \theta _2\in {\dot{\Theta }}_2,\nonumber \\&\sum _{i=1}^{N} w_i = 1,\quad w\ge 0 , \end{aligned}$$
(6)

which is a linear program. The corresponding lower-level problem aims to find new parameters for the discretization \({\dot{\Theta }}_2\). This is the weighted least-squares problem

$$\begin{aligned} \min _{\theta _2\in \Theta _2} \; \sum _{i=1}^N w_i \left[ f_1(x_i) - f_2(x_i,\theta _2)\right] ^2, \end{aligned}$$
(7)

where the weights are given by the solution of the upper-level problem (6).

The complete approach is presented in Algorithm 1, which we denote by DISC-MD. The name stems from the fact that we assume a discrete (and finite) set \({{\,\mathrm{{\mathcal {X}}}\,}}_N\) for the set of all design points.

figure a

3.2 The 2-ADAPT-MD algorithm

Now we want to find T-optimal designs on a continuous set of design points \({{\,\mathrm{{\mathcal {X}}}\,}}\). Hence, we want to solve

$$\begin{aligned} \xi ^* = {{\,\mathrm{arg\,max}\,}}_{\xi \in \Xi ({{\,\mathrm{{\mathcal {X}}}\,}})} \; \min _{\theta _2\in \Theta _2} \int _{{{\,\mathrm{{\mathcal {X}}}\,}}} \varphi (x, \theta _2) \, \xi (dx). \end{aligned}$$
(8)

To do so, we approximate the design set \({{\,\mathrm{{\mathcal {X}}}\,}}\) by a set of finitely many design points \({{\,\mathrm{{\mathcal {X}}}\,}}_N\) and compute an optimal solution \(\xi ^*_N\) on this finite set. This is done by the DISC-MD method which yields

$$\begin{aligned} \xi ^*_N = {{\,\mathrm{arg\,max}\,}}_{\xi \in \Xi ({{\,\mathrm{{\mathcal {X}}}\,}}_N)} \; \min _{\theta _2\in \Theta _2} \, \sum _{i=1}^N \, \xi (x_i) \, \varphi (x_i, \theta _2) \end{aligned}$$
(9)

in the limit. Obviously \(T(\xi ^*_N) \le T(\xi ^*)\), but for good choices for the design points in \({{\,\mathrm{{\mathcal {X}}}\,}}_N\) we also get a good approximate solution \(\xi _N^*\).

In fact, recall there always exists an optimal design \(\xi ^*\) with at most \(\dim (\Theta _2) + 1\) many design points (Proposition 8). Finding these few design points such that they are included in the finite set \({{\,\mathrm{{\mathcal {X}}}\,}}_N\) is sufficient for \(T(\xi ^*_N) = T(\xi ^*)\).

The question is now: How do we find new design points to get better designs? As with the Vertex Direction Method and the algorithm by Dette et al. (2015), we take the design point that maximizes the directional derivative from Proposition 6, i.e.

$$\begin{aligned} x^{(N+1)} = {{\,\mathrm{arg\,max}\,}}_{x\in {{\,\mathrm{{\mathcal {X}}}\,}}}\, \psi (x,\xi ^*_N) = {{\,\mathrm{arg\,max}\,}}_{x\in {{\,\mathrm{{\mathcal {X}}}\,}}}\, \varphi (x,{\hat{\theta }}_2(\xi ^*_N)). \end{aligned}$$
(10)

This point is of interest since it induces a direction of ascent. It is also the point with highest potential to increase the objective \(T(\cdot )\) as \(\varphi (x^{(N+1)},{\hat{\theta }}_2(\xi ^*_N))\) is an upper bound for the optimal objective value \(T(\xi ^*)\). This is due to the duality of the T-criterion and the Chebyshev approximation problem (Dette and Titoff 2009). Because of \({\hat{\theta }}_2(\xi ^*) = {{\,\mathrm{arg\,min}\,}}_{\theta _2\in \Theta _2} \Vert \varphi (\cdot , \theta _2) \Vert _\infty\), it holds

$$\begin{aligned} \max _{x\in {{\,\mathrm{{\mathcal {X}}}\,}}} \varphi (x,{\hat{\theta }}_2(\xi ^*_N)) \ge \max _{x\in {{\,\mathrm{{\mathcal {X}}}\,}}} \varphi (x,{\hat{\theta }}_2(\xi ^*)) = T(\xi ^*) \ge T(\xi ^*_N). \end{aligned}$$

Altogether, we end up with Algorithm 2, which is called 2-ADAPT-MD. The name follows from the idea that we consider two discretizations which we adaptively improve to compute T-optimal designs for model discrimination. On the one hand, we augment the discretization of the parameter set \({\dot{\Theta }}_2\) to get better designs \(\xi _N\) approximating the T-optimal design \(\xi ^*_N\) on \({{\,\mathrm{{\mathcal {X}}}\,}}_N\). On the other, we determine new points to update \({{\,\mathrm{{\mathcal {X}}}\,}}_N\) so that \(\xi ^*_N\) becomes a better approximation to the optimal design \(\xi ^*\) on \({{\,\mathrm{{\mathcal {X}}}\,}}.\)

figure b

Remark 9

(2-ADAPT-MD for Multi-Response Models) With the 2-ADAPT-MD algorithm, we can also compute T-optimal designs for multi-response models \(f:{{\,\mathrm{{\mathcal {X}}}\,}}\times \Theta \rightarrow {{\,\mathrm{{\mathcal {Y}}}\,}}\) with \({{\,\mathrm{{\mathcal {Y}}}\,}}\subseteq {\mathbb {R}}^{d_y}\) and \(d_y > 1\). This is achieved by changing the squared distance \(\varphi\) to

$$\begin{aligned}\varphi (x,\theta _2) = \left\| f_1(x) - f_2(x,\theta _2) \right\| ^2.\end{aligned}$$

4 Convergence

We now show that we can find arbitrarily good approximations to T-optimal designs with the 2-ADAPT-MD algorithm.

4.1 Convergence of the DISC-MD procedure

Our initial focus is on the DISC-MD sub-routine. We show that we find arbitrarily good approximations of the T-optimal design \(\xi _N^*\) on the finite design space \({{\,\mathrm{{\mathcal {X}}}\,}}_N\) by using again results from semi-infinite programming.

Theorem 10

(Convergence of the DISC-MD Procedure) Let \({\dot{\Theta }}_2^{(0)}\) be a finite discretization of \(\Theta _2\). Assume that the design \(\xi ^{(s)}\) is regular in each iteration, i.e. we find unique minimizers in the least-squares problem in the lower-level. In particular, the initial design \(\xi ^{(0)}\) is regular. Then, every accumulation point of the sequence of solutions \((\xi ^{(s)})_{s\in {\mathbb {N}}}\) of Algorithm 1 is a T-optimal solution on \({{\,\mathrm{{\mathcal {X}}}\,}}_N\).

Proof

We prove the statement by showing that we satisfy the conditions of Theorem 20 (see Appendix A). If its requirements are satisfied, every accumulation point of the sequence of solutions is an optimal solution of the semi-infinite problem (5). Hence, it is a T-optimal solution on \({{\,\mathrm{{\mathcal {X}}}\,}}_N\).

So, we have to show that the feasible region of the initial iteration

$${\mathcal{F}}(\dot{\Theta }_{2}^{{(0)}} ) = \left\{ {(w,t) \in \mathbb{R}^{{N + 1}} {\text{ }}{\mid }~t \le \sum\limits_{{i = 1}}^{N} {w_{i} } \left[ {f_{1} (x_{i} ) - f_{2} (x_{i} ,\theta _{2} )} \right]^{2} ,~\forall \theta _{2} \in \dot{\Theta }_{2}^{{(0)}} ,~\sum\limits_{{i = 1}}^{N} {w_{i} } = 1,~ w_{i} \ge 0} \right\}$$

is compact. Since \({\mathcal {F}}({\dot{\Theta }}_2^{(0)})\) can be defined by finitely many linear inequalities, it is a closed convex polytope. It remains to show that \({\mathcal {F}}({\dot{\Theta }}_2^{(0)})\) is bounded. The constraints imply \({w\in [0,1]^N}\) and

$$\begin{aligned} t\le \max _{\theta \in {\dot{\Theta }}_2^{(0)}} \, \sum _{i=1}^N \left[ f_1(x_i) - f_2(x_i,\theta _2) \right] ^2:= C < \infty . \end{aligned}$$

W.l.o.g., we can further assume \(t\ge 0.\) This is due to maximizing with respect to t and all upper bounds of t being positive. Altogether, \({\mathcal {F}}({\dot{\Theta }}_2^{(0)})\) is bounded and closed in \({\mathbb {R}}^{N+1}\), i.e. compact.

4.2 Convergence of the 2-ADAPT-MD algorithm

We have already highlighted that the 2-ADAPT-MD algorithm is reminiscent of the algorithm by Dette et al. (2015). In fact, if the DISC-MD routine returned an optimal design on \({{\,\mathrm{{\mathcal {X}}}\,}}_N\), then Theorem 3.3 from their paper proves convergence. However, for this sub-routine, we only have convergence in the limit. Therefore, in finite time, we are only guaranteed that \({\hat{\xi }}_N\) is an arbitrarily good approximation to \(\xi ^*_N\). Hence, we introduce the notion of \(\varepsilon\)-T-optimal designs.

Definition 11

(\(\varepsilon\)-T-Optimal Designs) A design \(\xi\) is called \(\varepsilon\)-T-optimal if

$$\begin{aligned} \max _{x\in {{\,\mathrm{{\mathcal {X}}}\,}}} \, \psi (x,\xi ) \le \varepsilon .\end{aligned}$$

Definition 11 means that the directional derivative \(\psi (x,\xi ) = \varphi (x,{\hat{\theta }}_2(\xi )) - T(\xi )\) is small. We also conclude that if we have an \(\varepsilon\)-T-optimal solution \(\xi\), then the objective value of this design \(T(\xi )\) is close to the optimal objective value. This justifies our definition for approximate solutions.

Corollary 12

If \(\xi\) is an \(\varepsilon\)-T-optimal design and \(\xi ^*\) is an actual T-optimal design on the same design and parameter spaces, then

$$\begin{aligned}T(\xi ) \ge T(\xi ^*) - \varepsilon .\end{aligned}$$

Proof

By the duality of the T-criterion and the Chebyshev approximation problem (Dette and Titoff 2009), we have

$$\begin{aligned}\max _{x\in {{\,\mathrm{{\mathcal {X}}}\,}}} \, \varphi (x,{\hat{\theta }}_2(\xi )) \ge \min _{\theta _2\in \Theta _2} \, \max _{x\in {{\,\mathrm{{\mathcal {X}}}\,}}} \varphi (x,\theta _2) = T(\xi ^*).\end{aligned}$$

Therefore,

$$\begin{aligned}T(\xi ) \ge \max _{x\in {{\,\mathrm{{\mathcal {X}}}\,}}} \, \varphi (x,{\hat{\theta }}_2(\xi )) - \varepsilon \ge T(\xi ^*) - \varepsilon \end{aligned}$$

holds.

Additionally, we will need the following lemma.

Lemma 13

Let Assumptions (M1)–(M3) hold. Let \((\xi _n)_{n\in {\mathbb {N}}}\) be a sequence of designs such that the objective values converge, i.e. \(T(\xi _n) \rightarrow T^*\) as \(n\rightarrow \infty\) for some \(T^*\in {\mathbb {R}}\). Then, there exists a solution \(\xi ^* \in \Xi ({{\,\mathrm{{\mathcal {X}}}\,}})\) such that

$$\begin{aligned}T(\xi ^*) = T^*.\end{aligned}$$

Proof

The sequence of solutions \((\xi _n)_{n\in {\mathbb {N}}}\) is obviously tight because the solutions are probability measures defined on the compact set \({{\,\mathrm{{\mathcal {X}}}\,}}\). Then, by Prokhorov’s Theorem (Parthasarathy 2005), there exists a subsequence \((\xi _{n_k})_{k\in {\mathbb {N}}}\) which weakly converges to a probability measure \(\xi ^*\) defined on \({{\,\mathrm{{\mathcal {X}}}\,}}\). By the weak continuity of the T-criterion (Kuczewski 2006), we have

$$\begin{aligned}\lim \limits _{k\rightarrow \infty } T(\xi _{n_k}) = T(\xi ^*).\end{aligned}$$

But since the objectives \((T(\xi _n))_{n\in {\mathbb {N}}}\) converge to \(T^*\) by assumption, we also have

$$\begin{aligned}\lim \limits _{k\rightarrow \infty } T(\xi _{n_k}) = T^*.\end{aligned}$$

So by the uniqueness of limits, we have \(T(\xi ^*) = T^*\).

We now show that the new algorithm finds arbitrarily good approximations of the T-optimal design \(\xi ^*\) as long as the solution of the DISC-MD sub-routine returns a sufficiently good approximation for the optimal design on a finite subset \({{\,\mathrm{{\mathcal {X}}}\,}}_N\).

But before we can prove the statement, we have to introduce a bound for the second directional derivative. Recall from Proposition 6 that for a design \(\xi\) and the directional derivative \(\psi (x,\xi )\) in direction \(\xi _x-\xi\), we have

$$\begin{aligned}&T\big ((1-\alpha )\xi + \alpha {\bar{\xi }}, \, {\hat{\theta }}((1-\alpha )\xi + \alpha {\bar{\xi }})\big ) \nonumber \\&\qquad \qquad = T(\xi , {\hat{\theta }}(\xi )) + \alpha \int _{{{\,\mathrm{{\mathcal {X}}}\,}}}\psi (x,\xi ;{\hat{\theta }}(\xi )) ~{\bar{\xi }}(dx) + {{\,\mathrm{{\mathcal {O}}}\,}}(\alpha ^2) \nonumber \\&\qquad \qquad \ge T(\xi , {\hat{\theta }}(\xi )) + \alpha \int _{{{\,\mathrm{{\mathcal {X}}}\,}}}\psi (x,\xi ;{\hat{\theta }}(\xi )) ~{\bar{\xi }}(dx) - \alpha ^2 K \end{aligned}$$
(11)

for some \(K\in {\mathbb {R}}\) which depends on the given models. This K can be understood as a bound for the second directional derivative but also plays a role on how close we can get to a T-optimal solution.

Theorem 14

(Convergence of SIP-Based Algorithm for T-Criterion) Let \(\varepsilon >0\). Assume in each iteration n that DISC-MD returns a regular \(\varepsilon\)-T-optimal design \(\xi ^{(n)}\) on the finite design space \({{\,\mathrm{{\mathcal {X}}}\,}}^{(n)}\subset {{\,\mathrm{{\mathcal {X}}}\,}}\). Then, a subsequence of intermediate solutions \((\xi ^{(n_j)})_{j\in {\mathbb {N}}}\) of Algorithm 2 converges to a \(2\sqrt{K\varepsilon }\)-T-optimal design \({\hat{\xi }}\) on \({{\,\mathrm{{\mathcal {X}}}\,}}\), i.e.

$$\begin{aligned}\max _{x\in {{\,\mathrm{{\mathcal {X}}}\,}}} \, \psi (x, {\hat{\xi }}) \le 2\sqrt{K\varepsilon }\end{aligned}$$

with \(\xi ^{(n_j)} \rightarrow {\hat{\xi }}\) as \(j\rightarrow \infty\).

Proof

Let \({\hat{\xi }}_N\) be the solution found by the DISC-MD sub-routine for the finite design space \({{\,\mathrm{{\mathcal {X}}}\,}}_N = \{x_1,\ldots ,x_N\}\) as input, and let \(\xi ^*_N\) be an actual T-optimal design on \({{\,\mathrm{{\mathcal {X}}}\,}}_N\). Since the sub-routine stops with an \(\varepsilon\)-T-optimal solution on \({{\,\mathrm{{\mathcal {X}}}\,}}_N\), we have

$$\begin{aligned} T(\xi ^*_N) \ge T({\hat{\xi }}_N) \ge T(\xi ^*_N)-\varepsilon . \end{aligned}$$
(12)

We divide this proof into two parts. First, we show that the sequence of solutions \(({\hat{\xi }}_N)_{N\in {\mathbb {N}}}\) converges weakly to a design \({\hat{\xi }}\). Then, we prove that \({\hat{\xi }}\) is a \(2\sqrt{K\varepsilon }\)-T-optimal solution.

We start with the sequence of optimal objective values \((T(\xi ^*_N))_{N\in {\mathbb {N}}}\). It is monotone increasing because we may only add points to the finite design space \({{\,\mathrm{{\mathcal {X}}}\,}}_N\) in each iteration, thus increasing the feasible region \(\Xi ({{\,\mathrm{{\mathcal {X}}}\,}}_N)\) for the designs. Also, \((T(\xi ^*_N))_{N\in {\mathbb {N}}}\) is bounded from above by \(T(\xi ^*)\), which is the objective value of the T-optimal design on the whole set \({{\,\mathrm{{\mathcal {X}}}\,}}\). Therefore this sequence converges to an element \(T^{**}\in {\mathbb {R}}\), i.e.

$$\begin{aligned} \lim \limits _{N\rightarrow \infty } T(\xi _N^*) = T^{**}. \end{aligned}$$
(13)

Next, we show that the sequence of the \(\varepsilon\)-T-optimal solutions \(({\hat{\xi }}_{N})_{N\in {\mathbb {N}}}\) has a weakly convergent subsequence. Recall (12) which states

$$\begin{aligned} T({\hat{\xi }}_N) \in \left[ T(\xi ^*_N)-\varepsilon ,\; T(\xi ^*_N)\right] . \end{aligned}$$
(14)

Combining (13) and (14), we have that for a fixed \(\delta >0\), there exists a sufficiently large \(N_0\in {\mathbb {N}}\) such that for all \(N>N_0\):

$$\begin{aligned} T^{**} - T({\hat{\xi }}_N)&= T^{**} - T(\xi ^*_N) + T(\xi ^*_N) - T({\hat{\xi }}_N)\\&\le \delta + \varepsilon . \end{aligned}$$

As a consequence, the sequence \((T({\hat{\xi }}_{N}))_{N\in {\mathbb {N}}}\) is bounded within a compact subset of \({\mathbb {R}}\). Specifically

$$\begin{aligned}T({\hat{\xi }}_N) \in \left[ T^{**}-\delta -\varepsilon ,\; T^{**}\right] \end{aligned}$$

for \(N>N_0\). Thus, there exists a convergent subsequence \((T({\hat{\xi }}_{N_{j}}))_{j\in {\mathbb {N}}}\) with

$$\begin{aligned}\lim \limits _{j\rightarrow \infty } T({\hat{\xi }}_{N_{j}}) = {\hat{T}}\end{aligned}$$

for some \({\hat{T}}\in {\mathbb {R}}\). Due to Lemma 13, we can take another subsequence which weakly converges to a design \({\hat{\xi }}\) with

$$\begin{aligned}{\hat{T}} = T({\hat{\xi }}) \in \left[ T^{**} - \varepsilon , T^{**}\right] .\end{aligned}$$

For the sake of convenience we denote the final subsequence by \(({\hat{\xi }}_s)_{s\in {\mathbb {N}}}\).

We finally show that \({\hat{\xi }}\) is a \(2\sqrt{K\varepsilon }\)-T-optimal solution. Assume it is not, i.e.

$$\begin{aligned} \max _{x\in {{\,\mathrm{{\mathcal {X}}}\,}}} \psi (x, {\hat{\xi }}) = \max _{x\in {{\,\mathrm{{\mathcal {X}}}\,}}} \, \varphi (x,{\hat{\theta }}_2({\hat{\xi }})) - T({\hat{\xi }}) > 2\sqrt{K\varepsilon }. \end{aligned}$$
(15)

Due to the continuity of the T-criterion w.r.t. weakly convergent designs, the weak outer semi-continuity of the optimal parameters (see both in Kuczewski (2006)), and the obvious continuity of \(\varphi\), we have that for some \(\varepsilon '>0\), there is a sufficiently large \(s_0\in {\mathbb {N}}\) such that for all \(s\ge s_0\):

$$\begin{aligned} \max _{x\in {{\,\mathrm{{\mathcal {X}}}\,}}} \, \psi (x, {\hat{\xi }}_s) \ge 2\sqrt{K\varepsilon } + \varepsilon '. \end{aligned}$$
(16)

Next, define a design \({\xi }_{s+1}^{(\text {LS})}\) by

$$\begin{aligned}\xi _{s+1}^{(\text {LS})}:= (1-{\hat{\alpha }}_s){\hat{\xi }}_s + {\hat{\alpha }}_s \, \xi _{{\hat{x}}_s},\end{aligned}$$

where

$$\begin{aligned}{\hat{x}}_s = {{\,\mathrm{arg\,max}\,}}_{x\in {{\,\mathrm{{\mathcal {X}}}\,}}} \, \psi (x, {\hat{\xi }}_s),\end{aligned}$$

and

$$\begin{aligned}{\hat{\alpha }}_s = {{\,\mathrm{arg\,max}\,}}_{\alpha \in [0,1]} \; (1-\alpha ){\hat{\xi }}_s + \alpha \, \xi _{{\hat{x}}_s}.\end{aligned}$$

Basically, \({\xi }_{s+1}^{(\text {LS})}\) is the design which we get if we took \({\hat{\xi }}_s\) and applied optimal line search. In this case, \({\hat{x}}_s\) induces a direction of ascent. The defined design serves as a lower bound for our actual optimal design \(\xi ^*_{s+1}\) on the set \({{\,\mathrm{{\mathcal {X}}}\,}}_{s+1}:= {{\,\mathrm{{\mathcal {X}}}\,}}_s \cup \{{\hat{x}}_s\}\), i.e.

$$\begin{aligned} T(\xi ^*_{s+1}) \ge T(\xi _{s+1}^{(\text {LS})}). \end{aligned}$$
(17)

Because \(\xi _{s+1}^{(\text {LS})}\) is a convex combination of designs with an optimal step-length \({\hat{\alpha }}_s\in [0,1]\), we use (11) and our modified assumption (16) in order to conclude

$$\begin{aligned} T(\xi ^{(\text {LS})}_{s+1}) - T({\hat{\xi }}_s)&= T({\hat{\xi }}_s) + {\hat{\alpha }}_s \psi ({\hat{x}}_s,{\hat{\xi }}_s) + {{\,\mathrm{{\mathcal {O}}}\,}}({\hat{\alpha }}_s^2) - T({\hat{\xi }}_s)\\&= {\hat{\alpha }}_s \psi ({\hat{x}}_s,{\hat{\xi }}_s) + {{\,\mathrm{{\mathcal {O}}}\,}}({\hat{\alpha }}_s^2)\\&\ge {\hat{\alpha }}_s (2\sqrt{K\varepsilon } + \varepsilon ') - {\hat{\alpha }}_s^2 K\\&\ge \frac{(2\sqrt{K\varepsilon } + \varepsilon ')^2}{4K}\\&\ge \varepsilon + \frac{(\varepsilon ')^2}{4K}, \end{aligned}$$

where the second inequality from below follows for the sub-optimal choice of a step-length

$$\begin{aligned}\alpha _s = \frac{2\sqrt{K\varepsilon } + \varepsilon '}{2K}.\end{aligned}$$

Therefore, with (17), we get

$$\begin{aligned} T({\hat{\xi }}_{s+1}) - T({\hat{\xi }}_s)&\ge T(\xi ^*_{s+1}) - \varepsilon - T({\hat{\xi }}_s)\\&\ge T(\xi ^{(\text {LS})}_{s+1}) - T({\hat{\xi }}_s) - \varepsilon \\&\ge \frac{(\varepsilon ')^2}{4K}. \end{aligned}$$

But this yields

$$\begin{aligned}T(\xi ^*) - T({\hat{\xi }}_0) \ge \sum _{s=0}^{n} T({\hat{\xi }}_{s+1}) - T({\hat{\xi }}_s) \ge \sum _{s=0}^{n} \frac{(\varepsilon ')^2}{4K} \rightarrow \infty \end{aligned}$$

as \(n\rightarrow \infty\), which is a contradiction to \(T(\xi ^*) < \infty\) and \(T({\hat{\xi }}_0) \ge 0\). Hence, our assumption (15) is false and \({\hat{\xi }}\) must be a \(2\sqrt{K\varepsilon }\)-T-optimal solution.

5 Numerical results

We tested our new algorithm on two examples from chemical process engineering. Additionally, we compared the performance against three other algorithms from the literature, namely the original Vertex Direction Method used by Atkinson and Fedorov, the algorithm by Dette et al. (2015), and the original SIP formulation which solves the maximin problem (3) by the Blankenship and Falk algorithm. All algorithms have been implemented in Python (Van Rossum and Drake 2009).

5.1 Implementation details

In the following, we describe how we numerically solve the sub-problems that make up each algorithm.


Computing new parameters. We solve the least-squares problem (7) with the help of either the scipy dogbox solver (Virtanen et al. 2020) or the KNITRO solver suite (Byrd et al. 2006), depending on the application. Also, in order to find a globally optimal solution, we perform a multi-start where one starting point is the solution from the previous iteration, and all other starting points are determined by a Sobol sequence (Sobol 1967).

However, we might run into cases where we have very few support points with positive weight. In a degenerate case it might happen that we have full weight on a single design point in an iteration, i.e. we get an intermediate solution with \(\xi = \{(x_1,1),(x_2,0),\ldots ,(x_N,0)\}\). This can happen due to numerical instabilities. In these cases, the least-squares problem does not have a unique optimal solution and may yield poor fits. Thus, we use a regularization by adding a small weight to all currently selected design points \(x_1,\ldots ,x_N\). This yields the regularized least-squares problem

$$\begin{aligned} \min _{\theta _2\in \Theta _2} \, \sum _{i=1}^N w_i \left[ f_1(x_i) - f_2(x_i\theta _2)\right] ^2 + \lambda \sum _{i=1}^N \left[ f_1(x_i) - f_2(x_i\theta _2)\right] ^2 \end{aligned}$$
(18)

with some small least-squares regularizer \(\lambda >0\), e.g. \(\lambda = 10^{-8}\). This problem often has a unique optimal solution, and it is still very close to the original problem. The idea to consider all design points \(x_1,\ldots ,x_N\) with an additional small weight can also be justified by the fact that the Chebyshev approximation problem is the dual of the T-criterion. Hence, for the corresponding parameters of a T-optimal design, we require that the squared distances are low everywhere.


Computing new design points. The sub-problem of finding a new design point (10) is in general non-convex and solved using BARON or by enumeration if the design space was discrete.


Updating the design. Computing a new design by the linear program (6) is done with the MOSEK optimizer (MOSEK ApS 2021) via the pyomo interface. Additionally, for other algorithms, we solve the quadratic program in Dette’s algorithm with the KNITRO package (Byrd et al. 2006) and use again the BARON solver (Sahinidis 1996) for the simultaneous optimization of the weights and design points in the original SIP approach.


Stopping rule. We use a stopping criterion based on the Equivalence Theorem 7. For regular designs, it simplifies to

$$\begin{aligned} \max _{x\in {{\,\mathrm{{\mathcal {X}}}\,}}} \varphi (x,{\hat{\theta }}_2(\xi )) - T(\xi , {\hat{\theta }}_2(\xi )) \le \varepsilon , \end{aligned}$$
(19)

where the \(\varepsilon\) is a pre-determined tolerance. However, if the estimated parameters \({\hat{\theta }}_2(\xi )\) are not the true (global) minimizer of the least-squares problem, it might happen that this criterion is satisfied too early. Specifically, poorly estimated parameters might cause that the directional derivatives \(\psi (\cdot ,{\hat{\theta }}_2(\xi ))\) at the support points of \(\xi\) are not all equal, which is also required by the Equivalence Theorem 7. In such a case the support point which maximizes \(\varphi (\cdot ,{\hat{\theta }}_2(\xi ))\) might still satisfy (19), whereas the other support points do not. Therefore, we propose a small adaption:

$$\begin{aligned} \max _{x\in {{\,\mathrm{{\mathcal {X}}}\,}}} \varphi (x,{\hat{\theta }}_2(\xi )) - \min _{x\in {{\,\mathrm{\text {supp}}\,}}(\xi )} \varphi (x,{\hat{\theta }}_2(\xi )) \le \varepsilon . \end{aligned}$$
(20)

The idea is that the generally valid inequality

$$\begin{aligned} \min _{x\in {{\,\mathrm{\text {supp}}\,}}(\xi )} \varphi (x,{\hat{\theta }}_2(\xi )) \le T(\xi ,{\hat{\theta }}_2(\xi )) \end{aligned}$$

holds with equality for optimal designs. Thus if (20) is satisfied, so is (19), but not necessarily the other way around.

5.2 Examples

In the following, we compare the performance of the Vertex Direction Method (VDM), Dette’s algorithm, the original SIP approach, and the new 2-ADAPT-MD algorithm. We consider two examples, one simple example that should be easy to solve for all models and a harder example based on a small system of ordinary differential equations.

In Table 1, we present the parameters of the algorithms. They can be tweaked to find a better solution or terminate earlier. If not otherwise stated, we take the default values. However, for the VDM, we make two distinctions:

  • The least-squares regularizer \(\lambda\) is not required and thus set to zero since all support points keep a positive weight.

  • The maximum number of iterations \(n_\text {maxiter}\) is increased to 1000.

The following results were computed on an Intel Xeon Gold 6248R processor with four available cores.

Table 1 Table of customizable parameters

5.2.1 Michaelis–Menten model

In our first example we consider the Michaelis-Menten (MM) model. It gives a formula for the rate of a reaction where a substrate reacts with an enzyme to form a product:

$$\begin{aligned} f_2(x,(V,K)) = \frac{Vx}{K+x}. \end{aligned}$$

Here, x is a substrate concentration, V is the maximum reaction rate, and K is the so-called Michaelis-Menten constant, which is the substrate concentration at which the reaction rate is at half maximum \(\frac{1}{2} V\) (Srinivasan 2022). Alternatively, we might add a linear term to this rate. This yields the modified Michaelis-Menten (ModMM) model:

$$\begin{aligned} f_1(x,(V,K,F)) = \frac{Vx}{K+x}+Fx. \end{aligned}$$

Our goal is to find experiments such that we can discriminate between the two models and decide which one fits our real-world problem better. This example as it stands was also studied in López-Fidalgo et al. (2007) and has since then become a standard example for model discrimination.

Considering that the original model is included in the modified version, we assume that \(f_1\) is our true function with the parameters \(V=1\), \(K=1\), and \(F=0.1\). Accordingly, \(f_2\) is the alternative model. The design space is chosen to be \({{\,\mathrm{{\mathcal {X}}}\,}}= [0.001, 5]\), and for the parameter space we have \((V,K) \in \Theta _2 = [10^{-3}, 5]\times [10^{-3}, 5].\) Lastly, as the initial design we select

$$\begin{aligned}\xi ^{(0)} = \left\{ \begin{array}{cccc} 1 &{} 2 &{} 3 &{} 4 \\ \nicefrac {1}{4} &{} \nicefrac {1}{4} &{} \nicefrac {1}{4} &{} \nicefrac {1}{4} \end{array} \right\} .\end{aligned}$$

The performance of the algorithms for this example is presented in Table 2. Additionally, the computed designs for each algorithm are shown in Table 3, where the design points are ordered by weight.

Table 2 Results for Michaelis–Menten versus modified Michaelis–Menten
Table 3 Optimal designs per algorithm for Michaelis-Menten vs. modified Michaelis-Menten Model

Moreover, we illustrate the results. In Fig. 1, we show how well the Michaelis-Menten model \(f_2\) is fitted to the modified Michaelis-Menten model. Since all algorithms find designs with approximately the same optimal parameters

$$\begin{aligned} {\hat{\theta }}_2 = ({\hat{V}}, {\hat{K}}) \approx (1.86, \, 2.15), \end{aligned}$$

this figure is representative for all algorithms. Additionally in Fig. 2, we present a plot of the directional derivative \(x\mapsto \psi (x,\xi ^*)\) for the optimal design computed by the 2-ADAPT-MD algorithm. It illustrates that the necessary and sufficient optimality criterion based on the Equivalence Theorem 7 is satisfied. We also plotted the directional derivative of the solution from the Vertex Direction Method, which shows its problems to get rid of bad support points (Fig. 3).

Fig. 1
figure 1

ModMM reference model versus fitted MM alternative model with support points of the optimal design from the 2-ADAPT-MD algorithm. The sizes of the crosses indicate the weights of the design, and the black lines highlight the model differences

Fig. 2
figure 2

Directional derivative (blue) with support points (red) computed by 2-ADAPT-MD (colour figure online)

Fig. 3
figure 3

Directional derivative (blue) with support points (red) computed by VDM (colour figure online)

Overall, we see that all methods are capable of computing T-optimal designs. However, the algorithm by Dette et al. is the fastest in this easy example, closely followed by the 2-ADAPT-MD algorithm. The Vertex Direction Method already takes significantly longer, whereas the original SIP approach has tremendous problems solving the upper-level problem to global optimality, thus taking by far the longest.

5.2.2 Partially reversible versus irreversible consecutive reaction

Sometimes it is desirable to discriminate between an irreversible consecutive reaction of the form

$$A \rightarrow^{k_1} B \rightarrow^{k_2} C$$
(21)

and a consecutive reaction where the first part is reversible, i.e.

$$A \rightleftharpoons^{k_1}_{k_3} B \rightarrow^{k_2} C$$
(22)

This problem was investigated in Ucinski and Bogacka (2005). Typically, reactions of such forms are modelled via a system of ordinary differential equations. For example, by modelling (22) with power law expressions for the reaction rates, we get

$$\begin{aligned} \frac{d[A]}{dt}&= -k_1 \, [A]^{n_1} + k_3 \, [B]^{n_3} \nonumber \\ \frac{d[B]}{dt}&= k_1 [A]^{n_1} - k_2 [B]^{n_2} - k_3 [B]^{n_3} \nonumber \\ \frac{d[C]}{dt}&= k_2 [B]^{n_2}, \end{aligned}$$
(23)

where [A], [B], and [C] are the concentrations of each component at a time t. Furthermore, \(k_1\), \(k_2\), and \(k_3\) are rate constants and \(n_1\), \(n_2\), and \(n_3\) are reaction orders. Note that the model for the irreversible reaction (21) is included in this model by choosing \(k_3 = 0\). Furthermore, notice that in this case we deal with multi-response models that output the concentrations for all components [A], [B], and [C] at time points t.

Now, if we want to discriminate between the models, we have to assume that the reaction with the reversible part (22) is the reference model, and the irreversible reaction (21) is our alternative model. Otherwise, we could perfectly fit the alternative to the reference model, which means that the squared distance is zero everywhere and every design would be trivially optimal.

For the specific problem setting, we define the reference model by the parameter values \(k_1 = 0.7\), \(k_2=0.2\), \(k_3=0.1\), \(n_1=2\), \(n_2=2\), and \(n_3=1\). The parameter space for the alternative is given by \((k_1, k_2, n_1, n_2) \in \Theta _2:= [0.5, 1.0]\times [0.05, 0.5]\times [1.5, 3.5]\times [1.5, 3]\).

As design variables, we consider the initial concentrations \([A]_0\), \([B]_0\), and \([C]_0\) and the time point of measurement t. However, since evaluating the models at one point is already hard enough, we use a small discretized design space and maximize the squared distance (10) by evaluating over the complete lattice instead of optimizing over the continuous space. So we assume \(\left( [A]_0, [B]_0, [C]_0, t\right) \in {{\,\mathrm{{\mathcal {X}}}\,}}= \{0.5,0.7,0.9\}\times \{0.1,0.2,0.3\}\times \{0.0,0.15,0.3\}\times \{2,4,\ldots ,10\}\). The initial design is

$$\begin{aligned}\xi ^{(0)} = \left\{ \begin{array}{ccccc} \left( \begin{array}{c} 0.5\\ 0.1\\ 0\\ 2 \end{array}\right) &{} \left( \begin{array}{c} 0.5\\ 0.1\\ 0.15\\ 4 \end{array}\right) &{} \left( \begin{array}{c} 0.7\\ 0.3\\ 0.15\\ 6 \end{array}\right) &{} \left( \begin{array}{c} 0.9\\ 0.2\\ 0.15\\ 8 \end{array}\right) &{} \left( \begin{array}{c} 0.9\\ 0.3\\ 0.3\\ 10 \end{array}\right) \vspace{0.5em} \\ \nicefrac {1}{5} &{} \nicefrac {1}{5} &{} \nicefrac {1}{5} &{} \nicefrac {1}{5} &{} \nicefrac {1}{5} \end{array} \right\} .\end{aligned}$$

Due to this problem definition, the original SIP approach and the new 2-ADAPT-MD algorithm both try to solve

$$\begin{aligned} \max _{\begin{array}{c} {w\ge 0}\\ \Vert w\Vert _1=1 \end{array}} \min _{\theta _2\in \Theta _2} \; \sum _{i=1}^N w_i \left[ f_1(x_i) - f_2(x_i,\theta _2)\right] ^2. \end{aligned}$$

But whereas the original SIP approach uses the Blankenship and Falk algorithm directly, we iteratively determine a smaller subset of design points and apply the Blankenship and Falk algorithm to this subset. This reduces the number of evaluations required to formulate the linear program (6) in the upper-level but should lead to more iterations overall.

The performance of the algorithms is presented in Table 4. For the original SIP method, we set the least-squares regularizer to \(\lambda =0\), which worked better in this example, but might lead to problems with other instances. Additionally, see Table 5 for the computed designs, where the design points are again ordered by weight. Notice that this example discriminates between two multi-response models and Dette’s algorithm can only deal with single-response models, so its row is blank.

Table 4 Results for partially reversible versus irreversible consecutive reaction model
Table 5 Optimal designs per algorithm for partially reversible versus irreversible consecutive reaction model

The results prove that we can not only find optimal designs even for multi-response models, but taking a small subset of candidate design points as it is done in the 2-ADAPT-MD algorithm is indeed beneficial compared to the usual SIP approach. Also, the least-squares regularization can be applied more effectively, which can be crucial for instances where intermediate solutions with too few support points appear.

6 Conclusion

In this paper we have introduced a novel algorithm to compute T-optimal experimental designs, the 2-ADAPT-MD. The algorithm utilizes a twofold adaptive discretization of the parameter space \(\Theta _2\) as well as the design space \({{\,\mathrm{{\mathcal {X}}}\,}}\). Both discretizations are chosen similar to approaches from semi-infinite programming, where an inner optimization problem is solved to select a new design point x or a new parameter value \(\theta _2\), respectively.

We have proven the convergence of the 2-ADAPT-MD. Additionally we have shown bounds on the error in the objective value, depending on the tolerance obtained in the discretized model discrimination sub-problem. The main advantage of the 2-ADAPT-MD is that it converges under only mild assumptions on the considered models. By using an increasing discretization in design points x and parameters \(\theta _2\), we obtain linear programs and least-squares programs as most frequent sub-problems. These structures are well studied and specialized solvers can be used to efficiently compute corresponding solutions. The use of discretizations also decreases our problem dimensions, which leads to a higher numerical stability. In particular, this is noticeable when adding a regularization to the least-squares term.

In two examples from chemical engineering we have seen that the 2-ADAPT-MD can compete and even outperform state-of-the-art model discrimination methods. In comparison to a classical semi-infinite approach, we can drastically improve the time required for the computations. By using a second inner discretization in the parameters \(\theta _2\) we decrease the dimension in the regularization term of the least-squares problem. Compared to the approach by Dette et al., we do not require linear approximations in the parameters for the selection of new candidate experiments. This makes our new algorithm more reliable and stable. Additionally, the 2-ADAPT-MD can also handle multi-response models, whereas the algorithm by Dette et al. is limited to the single-response case.

Some aspects of the 2-ADAPT-MD need further investigation and can be improved upon. First, the algorithm always adds a single design point without removing any previous points. However, we know that there always is a T-optimal design with at most \(\dim (\Theta _2)+1\) many design points. Thus, it should be possible to apply an exchange method that swaps current points against new ones. Secondly, we need to solve a non-convex global optimization sub-problem in each iteration in order to refine the discretization in the design space \({{\,\mathrm{{\mathcal {X}}}\,}}\). This is similar to most state-of-the-art algorithms, which rely on global optimization solvers or discrete design spaces \({{\,\mathrm{{\mathcal {X}}}\,}}\). Finally, throughout this paper we have assumed the worst-case parameters \({\hat{\theta }}_2\) to be unique. In future work we want to investigate if the 2-ADAPT-MD can be adjusted to also handle problems where the parameters are not unique.