Mixed noise and posterior estimation with conditional deepGEM

Paul Hagemann; Johannes Hertrich; Maren Casfor; Sebastian Heidenreich; Gabriele Steidl

doi:10.1088/2632-2153/ad5926

1. Introduction

In a variety of healthcare and other contemporary applications, the variables of primary interest are obtained through indirect measurements, such as in the case of magnetic resonance imaging and computed tomography. For some of these applications, the reliability of the results is of particular importance. The accuracy and trustworthiness of the outcomes obtained through indirect measurements are significantly influenced by two critical factors: the degree of uncertainty associated with the measuring instrument and the appropriateness of the (forward) model used for the reconstruction of the parameters of interest (measurand). In this paper, we consider Bayesian inversion to obtain the measurand from signals measured by the instrument and a noise model that mimics background noise coming from the instrument and the variation of the measurement, depending on the forward model. Within this framework, we developed an extension of the expectation maximization (EM) algorithm that is able to handle a Bayesian inversion with a measurement noise model. As a result, we obtain the posterior distribution for the parameters of interest (distribution of the measurand), which is a measure of the reliability of the measurement results. To demonstrate the applicability and effectiveness we apply the algorithm to two real examples in nanometrology, i.e. EUV Scatterometry. The key focus of the work is the development of a noise-adapted posterior sampler based on DeepGEM [24], which can incorporate information from several measurements simultaneously.

In this context we consider Bayesian inverse problems

$\begin{equation} Y_{\theta} = F\left(X\right) + \eta_{\theta},\end{equation} \tag{ 1 }$

with a possibly nonlinear forward operator $F\colon \mathbb{R}^d \to \mathbb{R}^n$ and a random noise variable η_θ which depends on an unknown parameter θ and on F(X). Note, Y describes the signals of the instrument whereas X are the parameters of interest. The posterior (parameter distribution) $P_{X|Y_{\theta} = y}$ for observations y will ultimately depend on these parameters θ, as the likelihood $P_{Y_{\theta}|X = x}$ depends on them. Therefore, we aim to estimate the parameter θ from observations $y_i \in \mathbb{R}^n$ , $i = 1,{\ldots},N$ , where N is possibly small. There exists plenty of literature on estimating the standard deviation σ within the Gaussian noise model $\eta_{\theta} = \eta_{\sigma} \sim \mathcal{N}(0,\sigma^2 I_n)$ . However, motivated by applications in nanometrology [30], we are interested in a mixture of additive and multiplicative Gaussian noise of the form

$\begin{equation} \eta_\theta = \eta_{\left(a,b\right)} = \eta_1 + \eta_2, \quad \eta_1 \sim \mathcal{N}\left(0, a^2{I_n}\right),; \eta_2 \sim \mathcal{N}\left(0, b^2 \text{diag} \left(F\left(x\right)^2\right) \right),\end{equation} \tag{ 2 }$

where $F(x)^2 = (F_i(x)^2)_{i = 1}^n$ and the identity in $\mathbb{R}^{n\times n}$ is given by I_n . For convenience we assume that the instrument noise and other sources can be described by the simple Ansatz made here. In general different noise models may appear in the applications. The noise model equation (2) was used in several previous studies in optics [29, 30, 32, 59] and analyzed in [19]. A similar noise model appears in analytical chemistry [58] and the study of gene expression arrays [57]. It belongs to the class of heteroskedastic noise models [22] and an algorithm for parameter estimation in a slightly different problem was proposed in [23]. Learning the noise model without any parametric form was done using NFs in [1].

The standard approach for parameter estimation is maximum likelihood estimation. That is, we choose θ as the minimizer of the negative log likelihood function

$\begin{align*} \mathcal L\left(\theta\right) = \frac{1}{N}\sum_{i = 1}^N \log\left(p_{\theta}\left(y_i\right)\right),\end{align*}$

where p_θ is the probability density function of Y_θ. However, in our case this function involves a high-dimensional integral of the form $\int_x p_{Y_{\theta}|X = x}(y) \textrm{d}P_X(x)$ which is intractable to compute. As a remedy, we exploit EM algorithms which were introduced in [16], see also [10] for an overview and [50, 51] for applications for parameter estimation in inverse problems. The basic idea is to iteratively compute the posterior distribution $P_{X|Y_{\theta^{(r)}} = y_i}$ for a current estimate $\theta^{(r)}$ of θ and then updating this estimate to $\theta^{(r+1)}$ based on this posterior distribution. Here, the computation of $P_{X|Y_{\theta^{(r)}} = y_i}$ is called E-step, while the update of θ is called M-step. Intuitively, this corresponds to the idea that the distribution of F(X) is approximately concentrated on a lower-dimensional manifold and consequently the distance of the y_i to this manifold contains the information of the noise parameters θ.

Recently, Gao et al [24] proposed to solve the E-step by a normalizing flow (NF) [18] using the reverse/backward Kullback–Leibler (KL) divergence as loss function. For the M-step, they apply a stochastic gradient ascent algorithm. Note that in general the same procedure can be applied to estimate parameters of the forward operator instead of the noise level, see [41]. This approach has several drawbacks. The model needs to be retrained for every new observations y and cannot profit from many observations that follow the same error parameters. Furthermore, the reverse KL is known to be mode-seeking. That is, it tends to recover only one mode of multimodal distributions, which incorporates a significant approximation error, see the discussions in [26, 48, 67] for more details. Very related is also the JANA framework [55], where the authors propose to learn forward (likelihood), posterior and summary network together in an amortized (i.e. conditional manner). However, they optimize it not iteratively, since they do not treat it as an EM framework and do not discuss noise modelling. The main idea of this paper was presented by some of the authors in a one-page extended abstract in [56].

1.1. Contributions

First, we propose to use the conditional NFs [7, 66] in the E-step. This allows the incorporation of several measurements from the same error model and to solve the inverse problem for all measurements simultaneously. Fortunately, the forward KL [7] can be used as loss function for training the conditional NFs which makes the method mode covering. Second, we propose an inner EM algorithm for solving the M-step more efficiently. For our special noise model (2), we deduce analytic expressions for E- and M-steps of this inner algorithm. The performance of our approach will be demonstrated on two applications from nano-optics. In particular, we propose a conditional version of DeepGEM and benchmark it against forward conditional DeepGEM, where the reverse KL is replaced by a forward KL.

1.2. Organization

We start in section 2 by recalling the general EM algorithm. Then, in section 3, we construct the E-step and M-step for our application. That is, we show how conditional NFs can be incorporated into the E-step and describe how the M-step can be solved for our noise model with an 'inner' EM algorithm which steps can be given in a closed analytical form. Some of the technical computations are postponed to appendix A. We test our algorithms on two nano-optics problems, which is done in section 4. Finally, conclusions are drawn in section 5.

2. EM algorithm

In this section, we introduce the EM algorithm as a maximization-maximization algorithm of an evidence lower bound. A general introduction into the EM algorithm can be found, e.g. in [10].

Let $\{Y_\theta:\theta\in\Theta\}$ be a family of n-dimensional random variables having probability density functions p_θ, $\theta\in\Theta$ . Given i.i.d. samples $y_1,{\ldots},y_N\in\mathbb{R}^n$ from $Y_{\theta^*}$ for some unknown $\theta^*$ , which we want to approximate by computing the maximum log-likelihood estimator

$\begin{align*} \hat \theta = \mathop{\mathrm{arg\,max}}_{\theta\in\Theta} \mathcal L\left(\theta\right),\quad \mathcal L\left(\theta\right): = \frac1N\sum_{i = 1}^N \log\left(p_{\theta}\left(y_i\right)\right).\end{align*}$

In the literature, the term $\log(p_\theta(y))$ is also called evidence of y under θ. As in many applications it is hard to maximize $\mathcal L$ , we introduce an absolute continuous d-dimensional auxiliary random variable X such that the joint density $p_{X,Y_\theta}$ exists and is easy to evaluate. Then, it holds by the law of total probability and Jensen's inequality, for any probability density function q on $\mathbb{R}^d$ , that

$\begin{align} \log\left(p_\theta\left(y\right)\right)& = \log\left(\int_{\mathbb{R}^d}p_{X,Y_\theta}\left(x,y\right)\mathrm{d} x\right) = \log\left(\int_{\mathbb{R}^d}\frac{p_{X,Y_\theta}\left(x,y\right)}{q\left(x\right)} q\left(x\right)\mathrm{d} x\right)\end{align}$

$\begin{align} &\unicode{x2A7E}\int_{\mathbb{R}^d}\log\left(\frac{p_{X,Y_\theta}\left(x,y\right)}{q\left(x\right)}\right) q\left(x\right)\mathrm{d} x = : \mathcal F\left(q,\theta|y\right).\end{align}$

We call the random variable X the hidden variable and the expression $\mathcal F(q,\theta|y)$ the evidence lower bound (ELBO). Now, instead of maximizing the log-likelihood function directly, the EM algorithm is a maximization-maximization algorithm for the ELBO, i.e. starting with an initial estimate $\theta^{(0)}$ , it consists of the following two steps:

$\begin{align} \text{E-step:} &\quad q_i^{\left(r+1\right)} = \mathop{\mathrm{arg\,max}}_{q\in \mathrm{pdf}\left(d\right)} \mathcal F\left(q,\theta^{\left(r\right)}|y_i\right),\quad i = 1,{\ldots},N,\end{align} \tag{ 3 }$

$\begin{align} \text{M-step:} &\quad \theta^{\left(r+1\right)} = \mathop{\mathrm{arg\,max}}_{\theta\in\Theta} \sum_{i = 1}^N\mathcal F\left(q_i^{\left(r+1\right)},\theta|y_i\right),\end{align} \tag{ 4 }$

where $\mathrm{pdf}$ is the space of d-dimensional probability density functions.

The E-step (3) can be solved based on the following standard lemma which can be found, e.g. in [9, section 9.4]. For convenience, we provide the simple proof. Recall that the Kullback–Leibler (KL) divergence of two probability measures $P,Q$ with densities $p,q$ is defined by

$\begin{align*} \text{KL}\left(P,Q\right) : = \int_{\mathbb{R}^d} p\left(x\right) \log \frac{p\left(x\right)}{q\left(x\right)} \, \mathrm{d} x ,\end{align*}$

if P is absolutely continuous with respect to Q, and $\text{KL}(P,Q) = + \infty$ otherwise. Further, we use the convention $0 \log 0 = 0$ .

Lemma 1. Let $X \in \mathbb{R}^d$ be a absolute continuous random variable and let Q be an absolutely continuous measure on $\mathbb{R}^d$ with probability density function q. Then it holds, for any $y\in\mathbb{R}^n$ , that

$\begin{align*} \log\left(p_\theta\left(y\right)\right) - \mathcal F\left(q,\theta|y\right) = \mathrm{KL}\left(Q,P_{X|Y_{\theta} = y}\right).\end{align*}$

Proof. By definition of the conditional distribution, we have

$\begin{align*} &\log\left(p_\theta\left(y\right)\right) - \mathcal F\left(q,\theta|y\right)\\ &\quad = \log\left(p_\theta\left(y\right)\right) - \int_{\mathbb{R}^d}q\left(x\right)\log\left(\frac{p_{X|Y_\theta = y}\left(x\right)p_{\theta}\left(y\right)}{q\left(x\right)}\right)\mathrm{d} x\\ & \quad= \log\left(p_\theta\left(y\right)\right)-\int_{\mathbb{R}^d}q\left(x\right)\log\left(p_{\theta}\left(y\right)\right)\mathrm{d} x+\int_{\mathbb{R}^d}q\left(x\right)\log\left(\frac{q\left(x\right)}{p_{X|Y_\theta = y}\left(x\right)}\right)\mathrm{d} x\\ & \quad= \log\left(p_\theta\left(y\right)\right)-\log\left(p_{\theta}\left(y\right)\right)+\mathrm{KL}\left(Q,P_{X|Y_{\theta} = y}\right) = \mathrm{KL}(Q,P_{X|Y_{\theta} = y}).\end{align*}$

□

As the KL divergence $\mathrm{KL}(Q,P)$ is minimal if and only if Q = P, the lemma implies that the solutions q_i of the E-step (3) are given explicitly by

$\begin{equation} \text{E-step:} \qquad q_i^{\left(r+1\right)} = p_{ X | Y_{\theta^{\left(r\right)}} = y_i},\quad i = 1,{\ldots},N.\end{equation} \tag{ 5 }$

For solving the M-step (4), we decompose the ELBO as

$\begin{equation} \mathcal F\left(q,\theta|y\right) = \mathbb{E}_{x\sim q}\left[\log\left(p_{X,Y_\theta}\left(x,y\right)\right)\right]-\mathbb{E}_{x\sim q}\left[\log\left(q\left(x\right)\right)\right].\end{equation} \tag{ 6 }$

Note, that only the first summand depends on θ. Using this decomposition and the explicit form (5) of the q_i in the M-step, we obtain the classical EM algorithm as proposed in [16]:

$\begin{align} \text{M-step:} \qquad \theta^{\left(r+1\right)} &= \mathop{\mathrm{arg\,max}}_{\theta\in\Theta} \mathcal Q\left(\theta,\theta^{\left(r\right)}\right),\end{align} \tag{ 7 }$

$\begin{align} \text{E-step:} \qquad \mathcal Q\left(\theta,\theta^{\left(r\right)}\right)& = \sum_{i = 1}^N \mathbb{E}_{x\sim P_{X|Y_{\theta^{\left(r\right)}} = y_i}}\left[\log\left(p_{X,Y_\theta}\left(x,y_i\right)\right)\right].\end{align}$

A convergence analysis of the EM algorithm based on KL proximal point algorithms was done in [13, 14]. In particular, we obtain the following convergence properties.

Proposition 2. Let the sequence $(q^{(r)},\theta^{(r)})$ be generated by the EM algorithm. Then, the following holds true.

(i)
The sequence of ELBO-values $(\mathcal F(q^{(r)},\theta^{(r)}))_r$ is monotone increasing.
(ii)
The sequence of likelihood values $\mathcal L(\theta^{(r)})$ is monotone increasing.

Remark 3 (generalized EM algorithms). Several papers propose so-called generalized EM algorithms [34, 42, 46, 52]. The key idea of these generalizations is to replace the maximization steps (3) and (4) by increase steps. More precisely, in each iteration the values $q_i^{(r+1)}$ and $\theta^{(r+1)}$ are chosen such that

$\begin{align*} \mathcal F\left(q_i^{\left(r+1\right)},\theta^{\left(r\right)}|y\right)\unicode{x2A7E} \mathcal F\left(q_i^{\left(r\right)},\theta^{\left(r\right)}|y\right), \quad \sum_{i = 1}^N \mathcal F\left(q_i^{\left(r+1\right)},\theta^{\left(r+1\right)}|y\right)\unicode{x2A7E}\sum_{i = 1}^N \mathcal F\left(q_i^{\left(r+1\right)},\theta^{\left(r\right)}|y\right).\end{align*}$

Using such increase steps, generalized EM algorithms often achieve simpler and faster steps, than the original EM algorithm even though they might require more steps until convergence. By construction, part (i) of proposition 2 remains for generalized EM algorithms, while part (ii) is not longer proved for certain of these algorithms.

3. Parameter estimation in Bayesian Inverse Problems

Now we consider the inverse problem (1), where we assume that X has density p_X . Given N observations $y_1,{\ldots},y_N$ of Y_θ, we aim to determine the parameter θ. We will derive an EM algorithm for this problem, where the hidden variable is given by the ground truth random variable X. In particular, we will deal with the noise model (2). Here the parameter $\theta = (a,b)$ can be updated in the M-step analytically.

3.1. E-Step: conditional NFs

As we have seen in (5), the E-step corresponds to finding the posterior densities

$\begin{align*} p_{X|Y_{\theta^{\left(r\right)}} = y_i} = \text{arg}\,\text{min}_q\mathcal F\left(q,\theta^{\left(r\right)}|y_i\right), \quad i = 1,\ldots,N,\end{align*}$

for given $\theta^{(r)}$ . We propose to approximate these posteriors by conditional NFs. This extends the so-called DeepGEM from [24] to the conditional case. We will see that our approach brings the forward KL instead of the reverse KL into the play which has several advantages, see remark 4.

A conditional NF is a mapping $\mathcal T_\phi\colon\mathbb{R}^n\times\mathbb{R}^d\to\mathbb{R}^d$ depending on some parameters φ such that $\mathcal T_\phi(y,\cdot)$ is invertible for any $y\in\mathbb{R}^n$ . In this paper, $\mathcal T_\phi$ is a neural network. There were several architectures for NFs proposed in the literature. They include GLOW [39], real NVP [18], invertible ResNets [8, 12, 33] and autoregressive Flows [15, 20, 36, 53]. They were extended to the conditional setting in [7, 17, 27, 40, 66]. The parameters φ are learned such that

$\begin{align*} P_Z \circ T_\phi\left(y_i,\cdot\right)^{-1} = {\mathcal T_\phi\left(y_i,\cdot\right)}_\#P_Z\approx P_{X|Y_{\theta^{\left(r\right)}} = y_i},\end{align*}$

for all $i = 1,{\ldots},N$ , where P_Z is some latent distribution, usually a standard Gaussian one. Once we have learned the conditional NF $\mathcal T_\phi$ for an appropriate parameter θ this provides us with a desired approximation of the posterior. In other words, given $y \in \mathbb{R}^n$ , we can sample z from Z and produce a sample from $P_{X|Y = y}$ by $\mathcal T_\phi(y,z)$ .

Now we could learn the conditional NF $\mathcal T_\phi$ by minimizing the loss function

$\begin{align*} \mathcal J_\text{reverse}\left(\phi\right) = \sum_{i = 1}^N\mathcal F\left(p_{\mathcal T_\phi\left(y_i,\cdot\right)_\#P_Z},\theta^{\left(r\right)}|y_i\right)\propto\sum_{i = 1}^N\mathrm{KL}\left(\mathcal T_\phi\left(y_i,\cdot\right)_\#P_Z,P_{X|Y_\theta = y_i}\right),\end{align*}$

where the last relation follows from lemma 1 and ' $\propto$ ' indicates equality up to a constant. In literature, this loss function is known as reverse or backward KL loss function. Applying the change of variable formula for push-forward measures and Bayes's formula this can be rewritten as

$\begin{align*} \mathcal J_\text{reverse}\left(\phi\right) &\propto\sum_{i = 1}^N\mathrm{KL}\left(P_{Z},\mathcal T_\phi\left(y_i,\cdot\right)^{-1}_\#P_{X|Y_{\theta^{\left(r\right)}} = y_i}\right)\\ &\propto - \sum_{i = 1}^N \mathbb{E}_{z\sim P_Z} \left[\log\left(p_{Y_{\theta^{\left(r\right)}}|X = \mathcal T_\phi\left(y_i,z\right)}\left(y_i\right)\right)+\log\left(p_X\left(\mathcal T_\phi\left(y_i,z\right)\right)\right) +\log\left(|\mathrm{det}\left(\nabla_z\mathcal T_\phi\left(y_i,z\right)\right)|\right) \right],\end{align*}$

see [2, 3, 40] for a detailed explanation and applications. In order to evaluate these terms we have to be able to evaluate the prior density p_X as well as the conditional densities $p_{Y_{\theta^{(r)}}|X = x}(y)$ , which contains the forward operator and the noise model for given parameters $\theta^{(r)}$ . Unfortunately, it is known from the literature that the reverse KL is prone to mode collapse, see [48]. That is, in the case that $P_{X|Y_\theta = y_i}$ is multimodal, it tends to generate only samples from one of the modes.

As a remedy, we interchange the arguments in the KL divergence in $\mathcal J_\text{reverse}$ and replace the sum over the y_i by the expectation over $P_{Y_{\theta^{(r)}}}$ . Then, we arrive at the so-called forward KL loss function

$\begin{align} \mathcal J_\text{forward}\left(\phi\right) & = \mathbb{E}_{y\sim P_{Y_{\theta^{\left(r\right)}}}} \left[\mathrm{KL}\left(P_{X|Y_{\theta^{\left(r\right)}} = y},\mathcal T_\phi\left(y,\cdot\right)_\#P_Z\right) \right]\end{align}$

$\begin{align} & = \mathbb{E}_{y\sim P_{Y_{\theta^{\left(r\right)}}}} \left[ \mathbb{E}_{x \sim P_{X|Y_{\theta^{\left(r\right)}} = y}} \left[ \log \left(\frac{p_{X|Y_{\theta^{\left(r\right)}} = y}\left(x\right) }{p_{\mathcal T_\phi\left(y,\cdot\right)_\# P_Z }\left(x\right)}\right) \right] \right]\end{align}$

$\begin{align} & \propto-\mathbb{E}_{\left(x,y\right)\sim P_{X,Y_{\theta^{\left(r\right)}}}}\left[\log\left(p_Z\left(\mathcal T_\phi\left(y,\cdot\right)^{-1}\left(x\right)\right)\right)+\log|\mathrm{det}\left(\nabla\mathcal T_\phi\left(y,\cdot\right)^{-1}\left(x\right)\right)|\right].\end{align}$

To compute these terms we need samples $(\tilde x_j,\tilde y_j)$ , $j = 1,{\ldots},\tilde N$ from the joint distribution $P_{X,Y_{\theta^{(r)}}}$ . Note that such samples can be generated from just knowing the $\tilde x_j$ by evaluating the forward operator and the noise model. In this setting, we do not need access to the prior density p_X or the conditional densities $p_{Y_{\theta^{(r)}}|X = x }$ . The forward KL is more standard in (conditional) generative modelling [7, 18, 66] due to these properties and is also known as maximum likelihood training [67].

Remark 4 (forward versus Reverse KL). Note that in the case that $\mathcal T_\phi$ is an universal approximator, we have for both loss functions that the optimal parameters $\hat \phi$ fulfills $\mathcal T_{\hat \phi}(y_i,\cdot)_\#P_Z = P_{X|Y_{\theta^{(r)}} = y_i}$ . This is important, as we propose to replace the reverse KL in the E-step by the forward KL. Moreover, the assumptions for training and the approximation properties differ. For the reverse KL, we have to be able to evaluate the density p_X of the prior distribution, while the forward KL needs samples from P_X . In practice it depends on the problem which assumption is more realistic. On the other hand, the forward KL loss function is not that prone to mode collapse. The universality of conditional NFs has been discussed in [44].

3.2. M-Step: inner EM for mixed noise model

As described in (7), the M-step is given by

$\begin{align} \theta^{\left(r+1\right)} & = \mathop{\mathrm{arg\,max}}_{\theta}\sum_{i = 1}^N\mathbb{E}_{x\sim P_{X|Y_{\theta^{\left(r\right)}} = y_i}}\left[\log\left(p_{X,Y_{\theta}}\left(x, y_i\right)\right)\right].\end{align} \tag{ 8 }$

Unfortunately, to the best of our knowledge, an analytic solution of (8) is not available. Therefore, we discretize the expectation in (8) by

$\begin{align} \theta^{\left(r+1\right)} = \mathop{\mathrm{arg\,max}}_{\theta}\sum_{i = 1}^N\sum_{k = 1}^M\log\left(p_{X,Y_{\theta}}\left(x_{i}^k,y_i\right)\right),\end{align} \tag{ 9 }$

where the $x_{i}^k$ , $k = 1,{\ldots},M$ are sampled from $P_{X|Y_{\theta^{(r)}} = y_i}$ . This can be solved by various iterative methods, e.g. by a stochastic gradient algorithm [38] as done in [24].

For our special noise model (2) with $\theta = (a,b)$ , we propose to use again an EM algorithm, since both the E- and M-step of the 'inner' EM can be computed analytically, which will be shown in the following paragraphs. We use that for this noise model we have

$\begin{align*} p_{Y_{\theta}|X = x}\left(y\right) = \mathcal N\left(y|F\left(x\right),a^2 I_n+b^2\mathrm{diag}\left(F\left(x\right)\right)\right).\end{align*}$

For simplicity, we assume that we have only M = 1 samples. The case M > 1 can be reduced to this case by considering M copies of y_i . In our EM algorithm for (9) we use $V_\theta \sim a \mathcal N(0,1)$ as hidden variable, which corresponds to the 'additive part' of the noise.

3.2.1. Inner E-step

We have to compute the conditional distribution $P_{V_\theta|(X,Y_\theta) = (x,y)}$ . Using Bayes' formula, we obtain

$\begin{align} \log\left(p_{V_\theta|\left(X,Y_\theta\right) = \left(x,y\right)}\left(v\right)\right) &\propto \log\left(p_{Y_\theta|V_\theta = v,X = x}\left(y\right)\right)+\log\left(p_{V_\theta|X = x}\left(v\right)\right)\end{align}$

$\begin{align} &\propto \sum_{j = 1}^n-\frac{\left(y_j-F_j\left(x\right)-v_j\right)^2}{2 b^2F_j\left(x\right)^2}-\frac{v_j^2}{2a^2}\end{align}$

$\begin{align} & \propto \log\left(\mathcal N\left(v\Big|\tfrac{a^2\left(y-F\left(x\right)\right)}{a^2+b^2F\left(x\right)^2},\mathrm{diag}\left(\tfrac{a^2b^2 F\left(x\right)^2}{a^2+b^2 F\left(x\right)^2}\right)\right)\right),\end{align}$

where the quotients in the last line are understood componentwise and ' $\propto$ ' indicates, that we have equality up to an additive constant independent of v. Consequently, the conditional distribution $P_{V_\theta|(X,Y_\theta) = (x,y)}$ is given by

$\begin{align*} P_{V_\theta|\left(X,Y_\theta\right) = \left(x,y\right)} = \mathcal N\left(\tfrac{a^2\left(y-F\left(x\right)\right)}{a^2+b^2F\left(x\right)^2},\mathrm{diag}\left(\tfrac{a^2b^2 F\left(x\right)^2}{a^2+b^2 F\left(x\right)^2}\right)\right).\end{align*}$

3.2.2. Inner M-step

We will just outline the final result. The quite technical proof is deferred to appendix A, where the M-step can be rewritten as

$\begin{align*} \theta^{\left(r+1\right)} = \mathop{\mathrm{arg\,max}}_{\theta = \left(a,b\right) \in \mathbb{R}^2_{\unicode{x2A7E} 0}}A_1\left(b\right)+A_2\left(a\right),\end{align*}$

where

$\begin{align*} A_1\left(b\right) = \frac{1}{2b^2}c_1^{\left(r\right)}-n+\log\left(b\right)+\mathrm{const},\quad A_2\left(a\right) = \frac{1}{2a^2}c_2^{\left(r\right)}-n\log\left(a\right)+\mathrm{const},\end{align*}$

with

$\begin{align} c_1^{\left(r\right)}& = -\frac{1}{N}\sum_{i = 1}^N\sum_{j = 1}^n\frac{\left(y_{ij}-F_j\left(x_i\right)\right)^2\left(b^{\left(r\right)}\right)^4F_j\left(x_i\right)^2}{\left(\left(a^{\left(r\right)}\right)^2+\left(b^{\left(r\right)}F_j\left(x_i\right)\right)^2\right)^2}+\frac{\left(a^{\left(r\right)}b^{\left(r\right)}\right)^2}{\left(a^{\left(r\right)}\right)^2+\left(b^{\left(r\right)}F_j\left(x_i\right)\right)^2}\end{align}$

$\begin{align} c_2^{\left(r\right)}& = -\frac{1}{N}\sum_{i = 1}^N \left(\sum_{j = 1}^n \left(\frac{\left(a^{\left(r\right)}\right)^2\left(y_{ij}-F_j\left(x_i\right)\right)}{\left(a^{\left(r\right)}\right)^2+\left(b^{\left(r\right)}\right)^2F_j\left(x_i\right)^2}\right)^2+\frac{\left(a^{\left(r\right)}b^{\left(r\right)}F_j\left(x_i\right)\right)^2}{\left(a^{\left(r\right)}\right)^2+\left(b^{\left(r\right)}F_j\left(x_i\right)\right)^2}\right),\end{align}$

and $y_i = (y_{i1},{\ldots},y_{in})$ and $F = (F_1,{\ldots},F_n)\colon\mathbb{R}^d\to\mathbb{R}^n$ . By setting the derivatives of A₁ and A₂ to zero, this is equivalent to

$\begin{align*} a^{\left(r+1\right)} = -\frac{c_2^{\left(r\right)}}{n},\qquad b^{\left(r+1\right)} = -\frac{c_1^{\left(r\right)}}{n},\end{align*}$

which are the update rules we will use.

3.3. Resulting algorithm

The summary of the two nested algorithms can be seen in algorithm 1. Here, both the E-steps and the M-steps are not run for 1 iteration, but several. In particular the analytical M-step is cheap, and therefore it is intuitive to make use of this. For the E-step we take usually 10 steps to perform posterior updates. The initialization of $a,b$ are done in such a way that we approximate the posterior distribution 'from above'. This is important so that the observed measurements are included in the distribution $P_{Y_{\theta}}$ , which is similar to the logdet schedule proposed in [63]. It indeed can be shown that making the logdet term larger corresponds to scaling the noise higher for additive Gaussian noise, which makes the estimated distributions broader and therefore prevents mode missing, although we still found that this does not solve the problem completely.

Algorithm 1. EM algorithm mixed noise estimation via CNFs.
Input: $y_1,{\ldots},y_N\in\mathbb{R}^{n}$ , conditional normalizing flow $\mathcal{T}$ and initial estimate $a^{(0)},b^{(0)}$ , number of x in total $K = 2000 \unicode{x2A7E} N$ .
for $r = 0,1,{\ldots}, R$ do
E-Step:
for $p = 0,1,{\ldots}, P$ do update $\mathcal{T}$ according to
$\text{arg}\,\text{min}_{\phi}\mathbb{E}_{y\sim P_{Y_\theta}}\left[\mathrm{KL}\left(p_{X\|Y_\theta = y},p_{\mathcal T_\phi\left(y,\cdot\right)_\#P_Z}\right)\right]\qquad\text{(reverse KL)}$
or
$\text{arg}\,\text{min}_{\phi}\mathbb{E}_{y\sim P_{Y_\theta}}\left[\mathrm{KL}\left(p_{\mathcal T_\phi\left(y,\cdot\right)_\#P_Z},p_{X\|Y_\theta = y}\right)\right]\qquad\text{(forward KL)}$
via a gradient descent method where $\theta = (a^r,b^r).$
end for
Repeat $\tilde{y} = (y_1,..,y_1,{\ldots},y_N,{\ldots}y_N) \in \mathbb{R}^{K\ n}$ .
Sample $x_i \sim \mathcal{T}(\tilde{y}_i,z_i)$ for standard normal z_i .
M-Step:
for $l = 0,1,..,L$ do
$c_1^{\left(r,l\right)} = -\frac{1}{K}\sum_{i = 1}^{K}\sum_{j = 1}^{n}\frac{\left(\tilde{y}_{ij}-F_j\left(x_i\right)\right)^2\left(b^{\left(r\right)}\right)^4F_j\left(x_i\right)^2}{\left(\left(a^{\left(r\right)}\right)^2+\left(b^{\left(r\right)}F_j\left(x_i\right)\right)^2\right)^2}+\frac{\left(a^{\left(r\right)}b^{\left(r\right)}\right)^2}{\left(a^{\left(r\right)}\right)^2+\left(b^{\left(r\right)}F_j\left(x_i\right)\right)^2}$
$c_2^{\left(r,l\right)} = -\frac{1}{K}\sum_{i = 1}^K \left(\sum_{j = 1}^n \left(\frac{\left(a^{\left(r\right)}\right)^2\left(\tilde{y}_{ij}-F_j\left(x_i\right)\right)}{\left(a^{\left(r\right)}\right)^2+\left(b^{\left(r\right)}\right)^2F_j\left(x_i\right)^2}\right)^2+\frac{\left(a^{\left(r\right)}b^{\left(r\right)}F_j\left(x_i\right)\right)^2}{\left(a^{\left(r\right)}\right)^2+\left(b^{\left(r\right)}F_j\left(x_i\right)\right)^2}\right).$
Update $(a^{(r,l)})^2 = -\frac{c_2^{(r,l)}}{n},\qquad (b^{(r,l)})^2 = -\frac{c_1^{(r,l)}}{n}$
end for
Set $(a^{(r+1)})^2 = (a^{(r,L)})^2,\qquad (b^{(r+1)})^2 = (b^{(r,L)})^2$
end for

4. Experiments

We will benchmark our algorithm on two problems from nano-optics, the first one being low-dimensional and the second one harder and more recent. The first was introduced in [29, 30] and the second one is part of a current research project. The goal of this is to learn both a reasonable posterior reconstruction as well as the error parameters $a,b$ jointly. To showcase the advantages of making the models conditional we also vary the number of measurements and hope that more measurements lead to better reconstructions.

Generally, we use the PyTorch framework [54] and use FrEia package for implementation of the conditional NFs [6]. The code is available on GitHub⁴ . We train our models using the Adam optimizer [38] and fix some hyperparameter choices across the experiments. In particular, we only use a learning rate of $1\times 10^{-3}$ , P = 10, set K = 2000, R = 5000 in 1 and L = 20. The choice of K is in particular constant no matter how many measurements N are used. This allows us to compare whether the information of many measurements is beneficial for the estimation of a and b. However, these hyperparameters were not optimized in a grid search and therefore it is likely that one can improve the performance. We generate synthetic measurements via surrogate forward operators with known noise levels $a_\textrm{true}, b_\textrm{true}$ , similar as in [3, section 3]. In both these experiments, we take these surrogate neural networks as forward operators. The extension to real world measurements and the relation to the true PDE inverse problem is left for future work. This allows us, given some noise parameters, to sample (x, y) data on the fly.

Then we apply our proposed algorithms to learn a and b as well as the posterior reconstructions. Then we are able to compare the models and error parameters on two metrics. The metrics and models evaluated are summarized below.

4.1. Models evaluated

We will evaluate two EM-based models, one is the conditional version of the DeepGEM method [24] which we combined with our M-step. Note that amounts to using the reverse KL divergence in algorithm 1. However we propose to use the forward KL divergence which we call conditional forward DeepGEM. To see the general comparison of EM algorithms, we also implement a grid search over (a, b) and save the 'best' model of this. The grid search is still possible since we are searching over a two dimensional space, but becomes quickly infeasible for higher dimensional noise models. We will call this grid conditional NF and also evaluate its forward and reverse KL version.

4.2. Metrics

We are going to benchmark the models using the following two metrics.

Distance to true a and b: we will consider synthetic data, where the $a,b$ the observations were generated with, are known. This metric is given by
$\begin{align*} D\left(\left(a,b\right), \left(a_\textrm{true}, b_\textrm{true}\right)\right) = \frac{\vert a - a_\textrm{true} \vert}{a_\textrm{true}} + \frac{\vert b - b_\textrm{true} \vert}{b_\textrm{true}}.\end{align*}$
However, this is a terrible metric as there can be other combinations of a and b which explain the observations equally well. However, we hope that for sufficient observations we will converge to the true a and b.
ELBO: from lemma 1 we see that maximizing the ELBO $\mathcal{F}$ leads to minimizing the KL distance to the true posteriors as well as maximizing the likelihoods of the observations under the estimated error parameters $a,b$ . This is a good proxy, as both the likelihood of the observations as well as the KL distance to the posteriors are intractable in high dimensions.

By the above discussion, we also obtain a suitable model selection criterion. We train all the models for the same amount of steps, but we validate it after every EM-step according to the ELBO for the measurements. We then load the best model for every run and evaluate our metrics.

4.3. Scatterometry

For chip manufacturing the control of nanopattern in the lithography process is essential and non-destructive measurement methods with high throughput are desirable. In addition to standard scanning electron microscopy (low throughput, destructive) scatterometry is gaining importance. Scatterometry is a non-destructive optical measurement technique for assessing lithography's periodic nanostructures' critical dimensions [37]. In this measuring method, nanostructured periodic surfaces are illuminated with light and refraction patterns are detected. From these patterns geometry parameters are reconstructed by solving an inverse problem. According to equation (1) observations are given by the refraction patterns, the forward operator is determined by time-harmonic Maxwell's equations and the noise is given by the instrument as well as the model error.

In the following, we consider two examples to demonstrate the performance of the developed algorithm for applications in nanometrology of chip production. The first example considers a typical photomask for extreme ultra violet light (EUV) and the second a line grating.

4.3.1. Photo mask

The EUV-photomask considered here consists of periodic absorber lines, capping layers and a multilayer stack functioning as a mirror for 13.4 nm wavelength waves (EUV range). Key geometry parameters include the line width, height and the angle of the sidewall (3 parameters). The refraction patterns comprise 23 intensities (maxima of the refractive orders) and the measurement/model noise is assumed to be distributed according to our mixed noise model.

The problem has x-dimension 3 and y-dimension 23 and therefore is well-suited for first experiments. Furthermore, by [30] it is known that the posterior is indeed multimodal. The prior is chosen uniformly in $[-1,1]$ and its density is approximated like in [26] for the reverse KL. For the example we train both the conditional DeepGEM as well as the conditional forward DeepGEM using data from the finite element method (FEM) based forward model [30], which is approximated by a surrogate neural network. The forward DeepGEM is a bit quicker to train. The true a and b used to generate simulated signals of the instrument were set to 0.005 and 0.1 respectively. We benchmark now the four methods, the conditional DeepGem with forward and reverse KL as well as the grid conditional NFs (gridCNF) with forward and reverse KL. For the grids we chose an equispaced grid with 8 points for a and for b. For a this ranged from 0.001 to 0.03 and for b from 0.01 to 0.2. Concerning training time, the forward and reverse conditional DeepGEM were similar with 9 min per run. The grid versions took approximately 13 min to train, where we took 1200 optimizer steps per grid point. Generally, we can see that the grid methods get outperformed by our EM versions, although they take a longer time. From table 1 and figure 1 we can see that the forward KL and the reverse KL have both similar performance in terms of distance to the true a and b, where the forward KL seems to have a slight edge in the case of many measurements. However, in terms of ELBO, we observe in table 2 that the forward KL performs favorably. This is somewhat remarkable, as the reverse KL is the ELBO objective when ignoring the parameters independent of the flow. Considering posterior measures obtained form simulated measurements we realize that the reverse KL does not exactly reproduce the modes in some of the examples, see figure 2 whereas the forward KL performs quite well. The inability of the reverse KL to detect the correct modes of the posterior can indeed explain the better performance of the forward conditional DeepGEM. Both algorithms are improving with more measurements, see figures 1(a) and (b).

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** Distance to the hyperparameters (a) and (b) for forward and reverse KL conditional DeepGEM.
Download figure:
Standard image High-resolution image

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** Posterior reconstructions for different measurements using forward/reverse conditional DeepGEM via one dimensional histograms on the diagonal and two dimensional on the offdiagonal. Ground truth x is depicted by the blue line.
Download figure:
Standard image High-resolution image

Table 1. Distance of estimated a and b to the true ones over 10 runs. The best values are indicated in bold.

Number measurements	1	2	4	8
Forward condDeepGEM	0.50	0.37	0.33	0.20
Reverse condDeepGEM	0.60	0.36	0.31	0.22
Forward gridCNF	0.65	0.41	0.44	0.32
Reverse gridCNF	0.53	0.34	0.31	0.42

Table 2. ELBO of the algorithms over 10 runs. Calculated for the measurements based on 2000 samples. The best values are indicated in bold.

Number measurements	1	2	4	8
Forward condDeepGEM	81.12	80.55	79.36	79.20
Reverse condDeepGEM	80.92	79.80	78.62	78.59
Forward gridCNF	77.54	76.89	76.33	76.61
Reverse gridCNF	78.99	77.55	76.32	76.50

4.3.2. Line grating with oxid layer

The second example involves a periodic line grating consisting of a silicon bulk and an oxide layer on top. Similar samples were investigated e.g. in [43]. In addition to the geometry parameters, as used in the previous example, the optical constants (OC) of the materials are assumed to be not accurately known. In practice this is often the case if the material composition was changed due to oxidation and contamination of the sample. So for each material, there are two parameters for the complex refractive index (real and imaginary part) [31], which depend on the material density. Hence we change the OC by varying the densities of the material, i.e. silicon and silicon-oxide. This results in two parameters changing the OC and five parameter describing the geometry of the line grating. The refraction pattern are detected for a single wavelength of the incoming light beam under an angle of incidence of 30^∘, from the sample plane and a set of seven azimuth angles between 0^∘ and 6^∘, the sample is rotated by in the plane. In sum we obtained 77 simulated intensities and hence end up with x-dimension seven and y-dimension 77. For simulations the forward model was solved with the software package JCMsuite⁵ , based on the FEM which solves a boundary value problem following from the Maxwell's equations [21]. In order to get a strong response for the OC of the oxide layer we used a wavelength of $12.99\,\text{nm}$ , right before the absorption edge [4, 31]. For this work we standardized the data from the forward simulation [11] on $[0,1]$ and chose a uniform prior for the x-data.

Again, we plot two example posterior distributions calculated. The distribution shapes seen in 3 clearly reflect the sensitivity of the forward operator against the line height (parameter 0), the silicon oxide density (parameter 4) and non-sensitive against the layer roughness (parameter 6). The true a and b were set to 0.03 and 0.25 respectively. Again as in the first scatterometry example we can see in tables 3 and 4 that the forward KL performs a bit better in distance to the true a and b as well as ELBO. Similarly, one can observe that the first x-component, the height can be multimodal, where the reverse KL can indeed miss the mode. This can be observed in figure 3. Similarly, the distance to the true a and b decreases by adding more simulated measurement values, which can be seen in figure 4.

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** Posterior reconstructions for different measurements using forward/reverse conditional DeepGEM.
Download figure:
Standard image High-resolution image

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** Distance to the hyperparameters (a), (b) for forward and reverse KL conditional DeepGEM.
Download figure:
Standard image High-resolution image

Table 3. Distance of estimated a and b to the true ones over 5 runs. The best values are indicated in bold.

Number measurements	1	2	4	8
Forward	0.93	0.19	0.09	0.08
Reverse	1.07	0.20	0.10	0.11

Table 4. ELBO of the algorithms over 5 runs. Calculated for the measurements based on 10 000 samples. The best values are indicated in bold.

Number measurements	1	2	4	8
Forward	195.9	196.0	190.5	188.1
Reverse	195.8	196.1	189.1	186.9

5. Conclusions and limitations

We developed a nested EM algorithms, one for estimating the posterior distribution via a conditional NF and a second one to solve the M-step within the former EM algorithm to estimate the error model parameters. For the special kind of non-additive noise appearing in our applications we derived analytic formulas for the inner E- and M-steps. We showed advantages of using the forward KL for modelling multimodal distributions. The reverse KL often led to mode collapse. However, there has been a plethora of literature tackling this issue of the reverse KL, namely [5, 45, 47, 64]. It would be interesting to compare these approaches to the forward KL. Moreover, we could replace the conditional NF by other methods for posterior sampling like score-based diffusion models [35, 60], conditional GANs [49] or posterior MMD flows [25]. Furthermore, we chose synthetic $a_\textrm{true}$ and $b_\textrm{true}$ . One of the next steps is to test these approaches on real world measurements. Even if the novel algorithm was applied to two specific real world experiments, it may have an impact to a wide range of applications where indirect measurements are involved. The extension of the algorithm to other noise distributions than Gaussian is analogous. An advantage over standard approaches like Markov-Chain Monte Carlo methods is the fact that once the network has been trained, further similar measurements can be evaluated very quickly. This benefit opens the possibility of scatterentry and similar measurement techniques for real time applications, e.g. important for process control. In terms of limitations, it would be also interesting to test the algorithm on other inverse problems. Intuitively, we believe that the scatterometric inverse problem is particularly well-suited for these estimations since the observed f(X)-data is living on a low-dimensional manifold in a nominal high-dimensional space. One can indeed easily think of an inverse problem, where recovering noise parameters is much harder if the observed data already lies in the full space.

Acknowledgments

P H acknowledges support from the DFG within the SPP 2298 'Theoretical Foundations of Deep Learning' (STE 571/17-1). J H acknowledges funding by the EPSRC programme grant 'The Mathematics of Deep Learning' with reference EP/V026259/1. M C and S H acknowledges the support of the EMPIR Project 22IND04-ATMOC. This Project (20IND04 ATMOC) has received funding from the EMPIR programme cofinanced by the Participating States and from the European Union's Horizon 2020 research and innovation programme.

Data availability statement

The code and data in the form of surrogate networks that support the findings of this study are openly available. This means, the surrogates and the code for running and reproducing the experiments is available under https://github.com/PaulLyonel/ConditionalDeepGEM. For the second photo mask problem we have training data from the forward simulation/simulation of the measurement under https://doi.org/10.5281/zenodo.10580011.

Appendix A: Derivation of the inner M-step

For the simplicity of the notation, we use the abbreviation

$\begin{align} Q_i = P_{V_{\theta^{\left(r\right)}}|\left(X,Y_{\theta^{\left(r\right)}}\right) = \left(x_i,y_i\right)} = \mathcal N\left(\tfrac{\left(a^{\left(r\right)}\right)^2\left(y_i-F\left(x_i\right)\right)}{\left(a^{\left(r\right)}\right)^2+\left(b^{\left(r\right)}\right)^2F\left(x_i\right)^2},\mathrm{diag}\left(\tfrac{\left(a^{\left(r\right)}\right)^2\left(b^{\left(r\right)}\right)^2 F\left(x_i\right)^2}{\left(a^{\left(r\right)}\right)^2+\left(b^{\left(r\right)}\right)^2 F\left(x_i\right)^2}\right)\right).\end{align} \tag{ A.1 }$

Using the decomposition (6) of the ELBO, and noting that the second summand within (6) does not depend on the parameters $\theta = (a,b)$ , we obtain that the optimization problem (4) is equivalent to

$\begin{equation} \begin{aligned} \theta^{\left(r+1\right)}& = \mathop{\mathrm{arg\,max}}_{\theta\in\Theta}\sum_{i = 1}^N \frac{1}{N}\mathbb{E}_{v\sim Q_i}\left[\log\left(p_{V_\theta,X,Y_\theta}\left(v,x_i,y_i\right)\right)\right]\\ & = \mathop{\mathrm{arg\,max}}_{\theta\in\Theta}\sum_{i = 1}^N \frac{1}{N}\mathbb{E}_{v\sim Q_i}\left[\log\left(p_{V_\theta,Y_\theta|X = x_i}\left(v,y_i\right)\right)\right]+\log\left(p_X\left(x_i\right)\right)\\ & = \mathop{\mathrm{arg\,max}}_{\theta\in\Theta}\sum_{i = 1}^N \frac{1}{N}\mathbb{E}_{v\sim Q_i}\left[\log\left(p_{V_\theta,Y_\theta|X = x_i}\left(v,y_i\right)\right)\right]. \end{aligned}\end{equation} \tag{ A.2 }$

Now, the objective function reads as

$\begin{align} &\sum_{i = 1}^N \frac{1}{N}\mathbb{E}_{v\sim Q_i}\left[\log\left(p_{V_\theta,Y_\theta|X = x_i}\left(v,y_i\right)\right)\right]\end{align}$

$\begin{align} &\quad = \sum_{i = 1}^N \frac{1}{N}\mathbb{E}_{v\sim Q_i}\left[\log\left(p_{Y_\theta|V_\theta = v,X = x_i}\left(y_i\right)\right)+\log\left(p_{V_\theta|X = x_i}\left(v\right)\right)\right].\end{align}$

As it holds by definition that $p_{Y_\theta|V_\theta = v,X = x_i}(y_i) = \mathcal N(y_i|F(x_i)+v,b^2\mathrm{diag}(F(x))^2)$ and $p_{V_\theta|X = x_i}(v) =$ $p_{V_\theta}(v) = \mathcal N(v|0,a^2 I_n)$ , this is equal to

$\begin{align} &\quad\sum_{i = 1}^N \frac{1}{N}\mathbb{E}_{v\sim Q_i}\left[\log(\mathcal N(y_i|F(x_i)+v,b^2\mathrm{diag}(F(x_i))^2)+\log(\mathcal N(v|0,a^2 I))\right] = A_1(b)+A_2(a),\end{align}$

where (leaving out constants with respect to a or b)

$\begin{align*} A_1(b)& = \frac{1}{N} \sum_{i = 1}^N\mathbb{E}_{v\sim Q_i}\left[\log(\mathcal N(y_i|F(x_i)+v,b^2\mathrm{diag}(F(x_i))^2)\right]\\ & = -\frac{1}{2Nb^2}\sum_{i = 1}^N\sum_{j = 1}^n\frac{1}{F_j(x_i)^2}\Big((y_{ij}-F_j(x_i))^2 -2\mathbb{E}_{v\sim Q_i}\left[v_j\right](y_{ij}-F_j(x_i))+\mathbb{E}_{v\sim Q_i}\left[v_j^2\right]\Big) -\frac{n}{2}\log(b^2),\end{align*}$

and

$\begin{align} A_2\left(a\right)& = \frac{1}{N}\sum_{i = 1}^N \mathbb{E}_{v\sim Q_i}\left[\log\left(\mathcal N\left(v|0,a^2 I\right)\right)\right] = -\frac{1}{2a^2\ N}\sum_{i = 1}^N\sum_{j = 1}^n\mathbb{E}_{v\sim Q_i}\left[v_j^2\right]-\frac{n}{2}\log\left(a^2\right).\end{align}$

Now, by (A.1), the expressions $\mathbb{E}_{v\sim Q_i}[v_j]$ and $\mathbb{E}_{v\sim Q_i}[v_j^2]$ are the first and second moment of certain normal distributions, such that

$\begin{align*} \mathbb{E}_{v\sim Q_i}\left[v_j\right] = \frac{\left(a^{\left(r\right)}\right)^2\left(y_{ij}-F_j\left(x_i\right)\right)}{\left(a^{\left(r\right)}\right)^2+\left(b^{\left(r\right)}\right)^2F_j\left(x_i\right)^2}\end{align*}$

and

$\begin{align*} \mathbb{E}_{v\sim Q_i}\left[v_j^2\right] = \left(\frac{\left(a^{\left(r\right)}\right)^2\left(y_{ij}-F_j\left(x_i\right)\right)}{\left(a^{\left(r\right)}\right)^2+\left(b^{\left(r\right)}\right)^2F_j\left(x_i\right)^2}\right)^2 +\frac{\left(a^{\left(r\right)}\right)^2\left(b^{\left(r\right)}\right)^2 F_j\left(x_i\right)^2}{\left(a^{\left(r\right)}\right)^2+\left(b^{\left(r\right)}\right)^2F_j\left(x_i\right)^2}.\end{align*}$

Putting everything together, we obtain that (A.2) is equivalent to

$\begin{align*} \theta^{\left(r+1\right)} = \mathop{\mathrm{arg\,max}}_{\left(a,b\right) \in \mathbb{R}^2_{\unicode{x2A7E} 0}} A_1\left(b\right)+A_2\left(a\right)\end{align*}$

with

$\begin{align*} A_1\left(b\right) = \frac{1}{2b^2}c_1^{\left(r\right)}-n \log\left(b\right)+\mathrm{const},\quad A_2\left(a\right) = \frac{1}{2a^2}c_2^{\left(r\right)}-n \log\left(a\right)+\mathrm{const},\end{align*}$

where $\mathrm{const}$ denotes an unspecified constant independent of a and b. Further, the $c_i^{(r)}$ are given by

$\begin{align} c_2^{\left(r\right)} = -\frac{1}{N}\sum_{i = 1}^N \left(\sum_{j = 1}^n \left(\frac{\left(a^{\left(r\right)}\right)^2\left(y_{ij}-F_j\left(x_i\right)\right)}{\left(a^{\left(r\right)}\right)^2+\left(b^{\left(r\right)}\right)^2F_j\left(x_i\right)^2}\right)^2+\frac{\left(a^{\left(r\right)}b^{\left(r\right)}F_j\left(x_i\right)\right)^2}{\left(a^{\left(r\right)}\right)^2+\left(b^{\left(r\right)}F_j\left(x_i\right)\right)^2}\right),\end{align}$

and

$\begin{align} c_1^{(r)}& = -\frac{1}{N}\sum_{i = 1}^N\sum_{j = 1}^n\frac{1}{F_j(x_i)^2} \left((y_{ij}-F_j(x_i))^2-2\frac{(a^{(r)})^2(y_{ij}-F_j(x_i))^2}{(a^{(r)})^2+(b^{(r)})^2F_j(x_i)^2} \right.\end{align}$

$\begin{align} &\quad \left.+\left(\frac{(a^{(r)})^2(y_{ij}-F_j(x_i))}{(a^{(r)})^2+(b^{(r)})^2F_j(x_i)^2}\right)^2 +\frac{(a^{(r)}b^{(r)}F_j(x_i))^2}{(a^{(r)})^2+(b^{(r)}F_j(x_i))^2}\right).\end{align}$

Bringing the first three terms onto one denominator, this can be simplified to

$\begin{align} c_1^{\left(r\right)} & = -\frac{1}{N}\sum_{i = 1}^N\sum_{j = 1}^n\frac{\left(y_{ij}-F_j\left(x_i\right)\right)^2\left(b^{\left(r\right)}\right)^4F_j\left(x_i\right)^2}{\left(\left(a^{\left(r\right)}\right)^2 +\left(b^{\left(r\right)}F_j\left(x_i\right)\right)^2\right)^2}+\frac{\left(a^{\left(r\right)}b^{\left(r\right)}\right)^2}{\left(a^{\left(r\right)}\right)^2+\left(b^{\left(r\right)}F_j\left(x_i\right)\right)^2}.\end{align}$

Note that by definition $c_i^{(r)}$ , $i = 1,2$ are non-positive. Thus, $a^{(r+1)}$ and $b^{(r+1)}$ are given by

$\begin{align*} a^{\left(r+1\right)} = \mathop{\mathrm{arg\,max}}_{a\unicode{x2A7E} 0} A_2\left(a\right), \qquad b^{\left(r+1\right)} = \mathop{\mathrm{arg\,max}}_{b\unicode{x2A7E} 0} A_1\left(b\right).\end{align*}$

By setting the derivatives of A₁ and A₂ to zero, this is equivalent to

$\begin{align*} a^{\left(r+1\right)} = -\frac{c_2^{\left(r\right)}}{n},\qquad b^{\left(r+1\right)} = -\frac{c_1^{\left(r\right)}}{n}.\end{align*}$

Appendix B: Convergence plots of a and b

Here we showcase the convergence of the parameters a and b in the first scatterometry photo mask example. For one particular run the convergence of a and b is shown, see figure B1. Note that both methods seem to converge to the true a and b values, however the reverse KL trains more unstably, which might be tackable with more careful hyperparameter selection or other stabilization techniques.

Figure B1. Refer to the following caption and surrounding text. — **Figure B1.** Convergence plots for a and b where we save every 20 EM-steps.
Download figure:
Standard image High-resolution image

Appendix C: Sensitivity analysis of the forward model

To verify our results for the reconstruction of the line grating in 4.3.2, we make a sensitivity analysis of the corresponding forward model. For the sensitivity analysis we have to determine the Sobol'indices, which are coefficients of a decomposition of the forward model and describe the impact of each parameter combination on the forward model. Those indices are normalized to $[0,1]$ and sum up to 1 [61, 62]. Making an approximation of the forward model in a polynomial basis using Polynomial Chaos (PC) [21, 65] makes it very easy to calculate the Sobol'indices. The indices in figure C1 come from a PC-approximation with a relative L₂-error of about 0.076 and show the dependence on each single parameter. It is clearly seen, that the height of the grating line and the density of the oxide layer and hence the OC for silicon-oxide has a huge impact on the forward model. In general the sensitivity analysis fits very well to the reconstruction of the line grating, since in figure 3 the distributions for parameter 0 and 4 are very sharp defined, while those which are very broad distributed also show a low impact on the forward model. For the sensitivity analysis the source software tool PyThia was used [28].

Figure C1. Refer to the following caption and surrounding text. — **Figure C1.** Barplot of Sobol' indices for PC-expansion of the forward model.
Download figure:
Standard image High-resolution image

Dates