Bayesian Optimization of Hyperparameters From Noisy Marginal Likelihood Estimates
Bayesian Optimization of Hyperparameters From Noisy Marginal Likelihood Estimates
Bayesian Optimization of Hyperparameters From Noisy Marginal Likelihood Estimates
Likelihood Estimates
1
Department of Statistics, Stockholm University
2
Department of Computer and Information Science, Linköping University
3
Sveriges Riksbank
Abstract
1
1 Introduction
The trend in econometrics is to use increasingly more flexible models that give a richer description of
the economy, particularly for prediction purposes. As the model complexity increases, the estimation
problems get more involved, and computationally costly MCMC methods are often used to sample
Most models involve a relatively small set of hyperparameters that needs to be chosen by the user.
For example, consider the steady-state BVAR model (Villani, 2009), which is widely used among
practitioners and professional forecasters (Karlsson, 2013), and used in Section 5 for illustration.
The choice of the prior distribution in BVARs is often reduced to the selection of a small set of
for example, the steady-state is usually given a rather informative subjective prior. Other prior
hyperparameters control the smoothness/shrinkage properties of the model and are less easy to
specify subjectively.
Giannone et al. (2015) proposed to treat these hard-to-specify prior hyperparameters as un-
known parameters and explore the joint posterior of the hyperparameters, the VAR dynamics, and
the shock covariance matrix. This is a statistically elegant approach which works well when the
marginal likelihood is available in closed form and is easily evaluated. However, the marginal like-
lihood is rarely available in closed form. The BVARs with conjugate priors considered in Carriero
et al. (2012), and Giannone et al. (2015) are an exception, but already the steady-state VAR needs
MCMC methods to evaluate the marginal likelihood. It is of course always an option to sample
the hyperparameters jointly with the other model parameters using Metropolis-Hastings (MH) or
Hamiltonian Monte Carlo (HMC), but this likely leads to inefficient samplers since the parameter
spaces are high-dimensional and the posterior of the hyperparameters are often quite complex, see
Most practitioners also seem to prefer to fix the hyperparameters before estimating the other
model parameters. Carriero et al. (2012) propose a brute force optimization approach where the
simulation based method has to be used for computing the marginal likelihood. Since the marginal
likelihood in Giannone et al. (2015) is available in closed form, one can readily optimize it using
2
a standard gradient based optimizer with automatic differentiation, but this is again restricted to
models with conjugate priors. The vast majority of applications instead use so-called conventional
values for the hyperparameters, dating back to Doan et al. (1984), which were found to be optimal
on a specific historical dataset but are likely to be suboptimal for other datasets. Hence, there is
a real need for a fast method for optimizing the marginal likelihood over a set of hyperparameters
when every evaluation of the marginal likelihood is a noisy estimate from a computationally costly
learning. BO is particularly suitable for optimization of costly noisy functions in small to moderate
dimensional parameter spaces (Brochu et al., 2010 and Snoek et al., 2012) and is therefore well
suited for marginal likelihood optimization. The method treats the underlying objective function
as an unknown object that can be inferred by Bayesian inference by evaluating the function at
a finite number of points. A Gaussian process prior expresses Bayesian prior beliefs about the
underlying function, often just containing the information that the function is believed to have
a certain smoothness. Bayes theorem is then used to sequentially update the Gaussian process
posterior after each new function evaluation. Bayesian optimization uses the most recently updated
posterior of the function to decide where to optimally place the next function evaluation. This
so-called acquisition strategy is a trade-off between: i) exploiting the available knowledge about
the function to improve the current maxima and ii) exploring the function to reduce the posterior
Our paper proposes a novel framework for Bayesian optimization when the user can control the
precision and computational cost of each function evaluation. The framework is quite general, but we
focus mainly on the situation when the noisy objective function is a marginal likelihood computed by
MCMC. This is a very common situation in econometrics using, for example, the estimators in Chib
(1995), Chib and Jeliazkov (2001), and Geweke (1999). The precision of the marginal likelihood
estimate at each evaluation point is then implicitly chosen by the user via the number of MCMC
iterations. This makes it possible to use occasional cheap noisy evaluations of the marginal likelihood
to quickly explore the marginal likelihood over hyperparameter space during the optimization. Our
proposed acquisition strategy can be seen as jointly deciding where to place the new evaluation but
also how much computational effort to spend in obtaining the estimate. We implement this strategy
3
by a stopping rule for the MCMC sampling combined with an auxiliary prediction model for the
computational effort at any new evaluation point; the auxiliary prediction model is learned during
We apply the method to the steady-state BVAR (Villani, 2009) and the time-varying parameter
BVAR with stochastic volatility (Chan and Eisenstat, 2018) and demonstrate that the new acquisi-
tion strategy finds the optimal hyperparameters faster than traditionally used acquisition functions.
It is also substantially faster than a grid search and finds a better optimum.
The outline of the paper is a follows. Section 2 introduces the problem of inferring hyperparam-
eters from an estimated marginal likelihood. Section 3 gives the necessary background on Gaussian
processes and Bayesian optimization and introduces our new Bayesian optimization framework. Sec-
tion 4 illustrates and evaluates the proposed algorithm in a simulation study. Sections 5 and 6 assess
the performance of the proposed algorithm in empirical examples, and the final section concludes.
hood
To introduce the problem of using an estimated marginal likelihood for learning a model’s hyper-
parameters we consider the selection of hyperparameters in the popular class of Bayesian vector
K
X
yt = Πk y t−k + εt , (1)
k=1
where {εt }Tt=1 are iid N (0, Σ). A simplified version of the Minnesota prior (see e.g. Karlsson (2013))
with
4
Σ ∼ IW (S, ν), vec(Π0 )|Σ ∼ N (vec(Π0 ), Σ ⊗ ΩΠ ). (3)
where vec(Π) and ΩΠ denotes the prior mean and covariance of the coefficient matrix, S is the prior
scale matrix with the prior degrees of freedom, ν. The diagonal elements of ΩΠ are given by
λ21
ω ii = , for lag l of variable r, i = (l − 1)p + r, (4)
(lλ3 sr )2
where λ1 controls the overall shrinkage and λ3 the lag-decay shrinkage set by the user, sr denotes the
estimated standard deviation of variable r. The fact that we do not use the additional cross-equation
shrinkage hyperparameter, λ2 , makes this prior conjugate to the VAR likelihood, a fact that will
be important in the following. It has been common practice to use standard values that dates back
to Doan et al. (1984), but there has been a renewed interest to find values that are optimal for the
given application (see e.g. Bańbura et al. (2010), Carriero et al. (2012) and Giannone et al. (2015)).
Two main approaches have been proposed. First, Giannone et al. (2015) proposed to sample from
where β = (Π, Σ) and p(θ|y1:T ) is the marginal posterior distribution of the hyperparameters. The
algorithm samples from p(θ|y1:T ) using Metropolis-Hastings (MH) and then samples directly from
p(β|θ, y1:T ) for each θ draw by drawing Π and Σ from the Normal-Inverse Wishart distribution.
There are some limitations to using this approach. First, the p(θ|y1:T ) can be multimodal (see e.g.
the application in Section 6) and it can be hard to find a good MH proposal density, making the
to model selection and want to determine a fixed value for θ once and for all early in the model
building process.
Carriero et al. (2012) propose an exhaustive grid search to find the θ that maximizes p(θ|y1:T )
and then uses that optimal θ throughout the remaining analysis. The obvious drawback here is that
a grid search is very costly, especially if we have non-conjugate priors and more than a couple of
hyperparameters.
5
A problem with both the approach in Giannone et al. (2015) and Carriero et al. (2012) is that
for most interesting models p(θ|y1:T ) is not available in closed form. Even the Minnesota prior
with cross-equation shrinkage is no longer a conjugate prior and p(θ|y1:T ) is intractable. In fact,
most Bayesian models used in practice have intractable p(θ|y1:T ), including the steady-state BVAR
(Villani, 2009) and the TVP-SV BVAR (Primiceri, 2005) used in Section 5 and 6.
When p(θ|y1:T ) is intractable, MCMC or other simulation based methods like Sequential Monte
Carlo (SMC) are typically used to obtain a noisy estimate of p(θ|y1:T ). Since
where p(y1:T |θ) = p(y1:T |θ, β)p(β|θ)dβ is the marginal likelihood, this problem goes under the
R
heading of (log) marginal likelihood estimation. We will therefore frame our method as maximizing
the log marginal likelihood; maximization of p(θ|y1:T ) is achieved by simply adding the log prior,
marginal likelihood when the posterior can be obtained via Gibbs sampling, which is the case
for many econometric models. We refer to Chib (1995) for details about the marginal likelihood
Estimating the marginal likelihood for the TVP-SV BVAR is more challenging and we will adopt
the method suggested by Chan and Eisenstat (2018). The approach consists of four steps; 1) obtain a
posterior sample via Gibbs sampling, 2) integrate out the time-varying VAR coefficients analytically,
3) integrate out the stochastic volatility using importance sampling to obtain the integrated likelihood,
4) integrate out the static parameters using another importance sampler. Since the algorithm makes
use of two nested importance samplers, it is a special case of importance sampling squared (IS 2 )
There are many alternative estimators that can be used in our approach, for example, the
extension of Chib’s estimator to Metropolis-Hastings sampling (Chib and Jeliazkov, 2001), and
estimators based on importance sampling (Geweke, 1999) or Sequential Monte Carlo (Doucet et al.,
2001).
All simulation-based estimators: i) give noisy evaluations of the marginal likelihood, ii) are time-
6
consuming, and iii) have a precision that is controlled by the user in terms of the number of MCMC
or importance sampling draws. The next section explains how traditional Bayesian optimization is
well suited for points i) and ii), but lacks a mechanism for exploiting point iii). Taking the user
controlled precision into account brings a new perspective to the problem and we propose a class of
A Gaussian process (GP) is a (possibly infinite) collection of random variables such that any
subset is jointly distributed according to a multivariate normal distribution, see e.g. Williams
and Rasmussen (2006). This process, denoted by f (x) ∼ GP(µ(x), k(x, x0 )), can be seen as a
probability distribution over functions f : X → R that is completely specified by its mean function,
µ(x) ≡ Ef (x), and its covariance function, C(f (x), f (x0 )) ≡ k(x, x0 ), where x and x0 are two
arbitrary input values to f (·). Note that the covariance function specifies the covariance between
any two function values, f (x1 ) and f (x2 ). A popular covariance function is the squared exponential
(SE):
|x − x0 |2
0
k(x, x ) = σf exp − , (7)
2`2
where |x − x0 | is the Euclidean distance between the two inputs; the covariance function is specified
by its two kernel hyperparameters, the scale parameter σf > 0 and the length scale ` > 0. The
scale parameter σf governs the variability of the function and the length scale determines how fast
the correlation between two function values taper off with the distance |x − x0 |, see Figure 1. The
normal distribution on RN allows for the convenient conditioning and marginalization properties
of the multivariate normal distribution. In particular, this makes it easy to compute the posterior
An increasingly popular alternative to the squared exponential kernel is the Matérn kernel, see
7
Figure 1: Illustration of two Gaussian processes with squared exponential kernel with different length scales and the
same variance σf2 = 0.252 . The figure shows the prior mean (dashed line) and 95% probability intervals (shaded) and
five realizations from each process. A smaller length scale gives more wiggly realizations.
e.g. Matérn (1960) and Williams and Rasmussen (2006). The Matérn kernel has an additional
hyperparameter, ν>0, in addition to the length scale ` and scale σf , such that the process is k times
mean square differentiable if and only if ν > k. Hence, ν controls the smoothness of the process
and it can be shown that the Matérn kernel approaches the SE kernel as ν → ∞ (Williams and
Rasmussen (2006)). Our approach is directly applicable for any valid kernel function, but we will
√ ! √ !
5r 5r2 5r
kν=5/2 (r) = σf 1+ + 2 exp − , (8)
` 3` `
where r = |x − x0 |. The Matérn 5/2 has two continuous derivatives which is often a requirement
for Newton-type optimizers (Snoek et al., 2012). The hyperparameters, σf and `, are found by
maximizing the marginal likelihood, see e.g. Williams and Rasmussen (2006).
iid
yi = f (xi ) + i , i ∼ N (0, σ 2 ), for i = 1, . . . , n, (9)
and the prior f (x) ∼ GP(0, k(x, x0 )). Given a dataset with n observations, the posterior distribution
8
of f (x? ) at a new input x? is (Williams and Rasmussen, 2006)
where y = (y1 , . . . , yn )> , k(x? ) is the n-vector with covariances between f at the test point x? and
all other training inputs, K(X, X) is the n × n matrix with covariances among the function values
at all n training inputs in X. When the errors are heteroscedastic with variance σi2 for the ith
Bayesian optimization (BO) is an iterative optimization method that selects new evaluation points
using the posterior distribution of f conditional on the previous function evaluations. More specifi-
cally, BO uses an acquisition function, a(x), to select the next evaluation point (Brochu et al., 2010
An intuitively sensible acquisition rule is to select a new evaluation point that maximizes the
probability of obtaining a higher function value than the current maximum, i.e. the Probability of
Improvement (PI ):
m(x) − fmax
PI(x) ≡ Pr(f (x) > fmax ) = Φ , (11)
s(x)
where fmax is the maximum value of the function obtained so far. The functions m(x) and s(x) are
the posterior mean and standard deviation of the estimated Gaussian process for f in the point x,
conditional on the available function evaluations (see (10)), and Φ denotes the cumulative standard
normal distribution.
The Expected Improvement (EI ) takes also the size of the improvement into consideration:
m(x) − fmax m(x) − fmax
EI(x) =(m(x) − fmax )Φ + s(x)φ , (12)
s(x) s(x)
where φ denotes the density function of the standard normal distribution. The first part of (12)
is associated with the size of our predicted improvement and the second part is related to the un-
9
certainty of our function in that area. Thus, EI incorporates the trade-off between high expected
improvement (exploitation) and learning more about the underlying function (exploration). Opti-
mization of the acquisition function, to find the next evaluation point, is typically a fairly easy task
since it is noise-free and cheap to evaluate, however, it can have multiple optima so some care has
with different starting values over the hyperparameter surface. In this paper we use particle swarm
The exploitation-exploration trade-off is illustrated in Figure 2, where the blue line shows the true
objective function, the black line denotes the posterior mean of the GP, and the blue-shaded regions
are 95% posterior probability bands for f . The black (small) dots are past function evaluations, and
the violet (large) dot is the current evaluation. The first and third row in Figure 2 show the true
objective function and the learned GP at four different iterations of the algorithm while the second
and fourth row show the EI acquisition function corresponding to the row immediately above. At
Iteration 2 in the top-left corner, we see that the EI acquisition function (second row) indicates
that there is a high expected improvement by moving to either the immediate left or right of the
current evaluation. At Iteration 5 the EI strategy suggests three regions worth evaluating, where
the two leftmost regions are candidates because of their high uncertainty. After seven iterations,
the algorithm is close to the global maximum and will now continue a narrow search for the exact
10
Figure 2: Bayesian optimization illustrated with the expected improvements acquisition strategy. The graphs in Row
1 and 3 depict the posterior distribution of f and the evaluation points (current evaluation in violet). Rows 2 and 4
show the corresponding aquisition functions.
Acquisition rules like PI or EI do not consider that different evaluation points can be more or less
costly. To introduce the notation of cost into the acquisition strategy, Snoek et al. (2012) proposed
that measures the evaluation time at input x in seconds. More generally, we can define a(x)/c(x) as
an effort-aware acquisition function. The duration function is typically unknown and Snoek et al.
(2012) proposed to estimate it alongside f using an additional Gaussian process for log c(x).
EIS assumes that the duration (or the cost) of function evaluations are unknown, but fixed for a
given input x; once we visit x, the cost of the function estimate fˆ(x) is given. However, the user
can often choose the duration spent to obtain a certain precision in the estimate; for example by
increasing the number of MCMC iterations when the marginal likelihood is estimated by MCMC.
This novel perspective opens up for strategies that not only optimize for the next evaluation point,
but also optimize over the computational resources, or equivalently, the precision of the estimate
11
fˆ(x). We formally extend BO by modeling the function evaluations with a heteroscedastic GP
where the noise variance σ 2 (x, G) is now an explicit function of the number of MCMC iterations,
G, or some other duration measure. Hence the user can now choose both where to place the next
with respect to both x and G, where a(x) is a baseline acquisition function, for example EI.
A complication with maximization of ã(x, G) is that while we typically know that σ(x, G) =
√
O(1/ G) in Monte Carlo or MCMC, the exact numerical standard error depends on the integrated
autocorrelation time (IACT) of the MCMC chain. Note that the evaluation points can, for example,
be hyperparameters in the prior, where different values can give rise to varying degrees of well-
behaved posteriors, so we can not expect the IACT to be constant over the hyperparameter space,
hence the explicit dependence on x in σ 2 (x, G). Rather than maximizing ã(x, G) with respect to
both x and G directly, we propose to implement the algorithm in an alternative way that achieves a
similar effect. The approach includes stopping the evaluation early whenever the function evaluation
turns out to be hopelessly low with a low probability of improvement over the current fmax .
for some small value, α, or until G reaches a predetermined upper bound, Ḡ; here m̂(g) (x) and s(g) (x)
denotes the posterior mean and standard deviation of the GP evaluated at x after g MCMC iter-
ations. Note that both the posterior mean m(x) and standard deviation s(x) are functions of the
noise variance, which in turn is a function of G. The posterior distribution for f (x) is hence con-
tinuously updated as G grows until 1 − α of the posterior mass in the GP for f (x) is concentrated
below fmax , at which point the evaluation stops. The optimization is insensitive to the choice of α,
12
as long as it is a relatively small number. We now propose to maximize the following acquisition
where Ĝα (x) is a prediction of the number of MCMC draws needed at x before the evaluation stops,
with the probability α as the threshold for stopping. We emphasize that early stopping is here
used in a subtle way, not only as a simple rule to short-circuit useless computations, but also in the
planning of future computations; the mere possibility of early stopping can make the algorithm try
an x which does not have the highest a(x), but which is expected to be cheap and is therefore worth
a try. This effect that comes via σ 2 (G) is not present in the EIS of Snoek et al. (2012) where the
Although one can use any model to predict G, we will here fit a GP regression model to the
logarithm of the number of MCMC draws, log Gj for j = 1, ..., J in the J previous evaluations
iid
log Gj = h(zj ) + εj , εj ∼ N (0, ψ 2 )
where zj is a vector of covariates. The hyperparameters, x1:J , themselves may be used as predictors
of Ĝ(x), but also D(x) = m̂(x) − fmax and s(x) are likely to have predictive power for G, as well as
u(x) = (m̂(x) − fmax )/s(x). We will use zj = xj , D(j) (xj ), s(j) (xj ), u(j) (xj ) in our applications,
where the superscript over j denotes the BO iteration. The prediction for G is taken to be Ĝ =
exp (mG (z)), which corresponds to the median of the log-normal posterior for G.
We will use the term Bayesian Optimization with Optimized Precision (BOOP) for BO meth-
ods that optimize ãα (x) in (16), and more specifically BOOP-EI when EI is used as the baseline
13
Algorithm 1 Bayesian Optimization with Optimized Precision (BOOP)
input
• estimator fˆ(x) of the f (x) to be maximized, and its standard error function σ(G).
• j0 initial points x1:j0 ≡ (x1 , . . . , xj0 ), a vector of corresponding function estimates, fˆ(x1:j0 ),
and standard errors σ 2 (G1:j0 ).
where the elements of z are functions of x. Return the point prediction Ĝα (x).
e) Update the datasets in a) with (xj , fˆ(xj ), σ 2 (Gi )) and in b) with (zj , log Gj ).
Note that (13) assumes that fˆ(x) is an unbiased estimator at any x. This can be ensured by using
enough MCMC/importance sampling draws in the first marginal likelihood evaluation of BOOP. We
performed a small simulation exercise that shows that the Chib estimator is approximately unbiased
after a small number of iterations for the medium-sized VAR model in 5.4. As expected, we had to
use more initial draws in the large-scale VAR in Section 5.5 and the time-varying parameter VAR
with stochastic volatility in Section 6 to drive down the bias. See also Section 7 for some ideas on
how to extend BOOP to estimators where the bias is still sizeable in large MCMC samples.
Figure 3 illustrates the early stopping part of BOOP in a toy example. The first row illustrates
the first BOOP iteration and the columns show increasingly larger MCMC sample sizes (G). We
(1)
can see that the 95% posterior interval after G = 10 MCMC draws at the current x includes fmax
14
(dotted orange line), the highest posterior mean of the function values observed so far; it therefore
worthwhile to increase the number of simulations for this x. Moving one graph to the right we see
(1)
that after G = 20 simulations the 95% posterior interval still includes fmax , and we move one more
graph to the right for G = 50. Here we conclude that the sampled point is almost certainly not
an improvement and we move on to a new evaluation point. The new evaluation point is found by
maximizing the BOOP-EI acquisition function in (16) with updated effort prediction function Ĝ(z)
in Equation 17 and is depicted by the violet dot in leftmost graph in the second row of Figure 3.
Following the progress in the second row, we see that it takes only G = 20 samples to conclude
that the function value is almost certainly lower than the current maximum of the posterior mean
at the second BO iteration. Finally, in the third row, we can see that the point is sampled with
high variance at the beginning, but as we increase G it becomes clear that this x is indeed an
improvement.
The code used in this paper is written in the Julia computing language (Bezanson et al., 2017),
making use of the GaussianProcesses.jl package for estimation of the Gaussian processes and the
Figure 3: Illustration of BOOP-EI implemented with early stopping. The row corresponds to iterations in the
algorithm and the columns to different MCMC sample sizes. The blue line is the true f , the shaded regions are 95%
posterior probability bands for f based on the (noisy) evaluations (black for past and violet for current). The orange
crosshair marks the current maximum; see the text for details.
15
4 Simulation experiment
We will here assess the performance of the proposed BOOP-EI algorithm in a simulation experiment
for the optimization of a non-linear function in a single variable. The simple one-dimensional setup
is chosen for presentational purposes, and more challenging higher-dimensional settings are likely to
show even larger advantages of our method compared to regular Bayesian optimization.
where N (x|µ, σ 2 ) denotes the density function of a N (µ, σ 2 ) variable; the function is plotted in
Figure 4 (left). We further assume that f (x) can be estimated at any x from G noisy evaluations
g 2 (x)
fˆ(G) (x) = f (x) + , ∼ N 0, (19)
G
where g(x) = 0.1 + 0.15N (x|1, 0.52 ) + 0.15N (x|2.5, 0.252 ) + 0.5N (x|5, 0.752 ) is a heteroscedastic
standard deviation function that mimics the real world case where the variability of the marginal
likelihood estimate varies over the space of hyperparameters; g(x) is plotted in Figure 4 (middle).
Figure 4 (right) illustrates the noisy function evaluations (gray points) and the effect of Monte Carlo
averaging G = 3 evaluations for a given x (blue points). We assume for simplicity here that once
the algorithm decides to visit an x it will get access to a noise-free evaluation of the standard error
of the estimator.
16
Figure 4: The function f that we want to maximize (left) and the function g that controls the sampling variance over
x (middle). The figure to the right shows estimates for G = 1 samples (grey dots), G = 3 samples (blue dots), the
mean function (red line) and 2 standard deviation error bands (pink lines).
Figure 5: GP posterior after 25 Bayesian optimization iterations using BO-EI (left) and BOOP-EI (right) to optimize
f . The black lines are the posterior means of f (x).
Before we conduct a small Monte Carlo experiment it is illustrative to look at the results from
a single run of the EI and BOOP-EI algorithms. Figure 5 highlights the difference between the
algorithms by showing the GP posterior after 25 Bayesian optimization runs. The EI algorithm
has clearly wasted computations to needlessly drive down the uncertainty at (useless) low function
We now compare the performance of a standard EI approach using a heteroscedastic GP with our
BOOP-EI approach in a small simulation study. The methods will be judged by their ability to find
the optimum using as few evaluations as possible. The performance will be evaluated by a Monte
Carlo study where we simulate 1000 replications from each model under each simulation scenario.
17
We investigate Bayesian optimization with EI using 100 and 500 samples in each iteration, BOOP-EI
is allowed to stop the sampling at any time before the number of samples for the EI is reached. In
Figure 6: The evolution of fmax − max f (x), i.e. the difference between the current maximum fmax and the true
maximum of the function (vertical axis), as a function of the total number of the MCMC draws consumed up to the
current BO/BOOP iteration. Note that the number of MCMC draws per BOOP iteration is variable due to early
stopping. The lefthand graph uses a maximum of 100 MCMC draws per BO iteration while the righthand graph uses
500 draws. The shaded areas are the one standard deviation probability bands in the distribution of fmax − max f (x)
over the replicate runs, and the solid and dashed lines are the mean and median, respectively.
We can see from Figure 6 that BOOP finds the maximum by using fewer total number of samples
in both scenarios and that the difference increases with the number of samples used when forming
the estimator. We can also see from the median fmax that both algorithms can get stuck for a while
at the second greatest local maximum (which is approximately .05 lower than the global maximum).
However, BOOP gets out of the local optimum faster since it has the option to try cheap noisy
evaluations and will therefore explore other parts of the function earlier than basic BO-EI. This
effect seems to increase as we allow for a higher number of samples since it lowers the relative price
of cheaper evaluations.
In this section, we use BOOP-EI to estimate the prior hyperparameters of the steady-state BVAR
of Villani (2009). Giannone et al. (2015) show that finding the right values for the hyperparameters
in BVARs can significantly improve forecasting performance. Moreover, Bańbura et al. (2010) show
that different degree of shrinkage (controlled by the hyperparameters) is necessary under different
model specifications.
18
5.1 The steady-state BVAR
Π(L)(yt − Ψxt ) = εt ,
iid
εt ∼ N (0, Σ), (20)
where E[yt ] = Ψxt . In particular, if we assume that xt = 1 for all t, then Ψ has the interpretation
as the overall mean of the process. We take the prior distribution to be:
p(Σ) ∼Σ−(n+1)/2
vec(Π) ∼N (θ Π , ΩΠ ) (21)
Ψ ∼N (θ Ψ , ΩΨ ),
where θ Ψ and ΩΨ are the mean and covariance matrix for the steady states. The prior covariance
where ω ii is the diagonal elements of ΩΠ . We also assume prior independence, following Villani
(2009). The hyperparameters that we optimize over are: the overall-shrinkage parameter θ1 , the
The posterior distribution of the steady-state BVAR model parameters can be sampled with a
simple Gibbs sampling scheme (Villani, 2009). The marginal likelihood, together with its empirical
Table 1 describes the data used in our applications which are also used in Giannone et al. (2015). It
contains 23 macroeconomic variables for which two subsets are selected to represent a medium-sized
model with 7 variables and a large model that contains 22 of the variables (real investment is ex-
cluded). Before the analysis, the consumer price index and the five-year bond rate were transformed
from monthly to quarterly frequency. All series are transformed such that they become stationary
19
according to the augmented Dickey-Fuller test. This is necessary for the data to be consistent with
the prior assumption of a steady-state. The number of lags is chosen according to the HQ-criteria,
Hannan and Quinn (1979) and Quinn (1980). This resulted in p = 2 lags for the medium-sized
We set the prior mean of the coefficient matrix, Π, to values that reflect some persistence on the
first lag, but also that all the time series are stationary; e.g. the prior mean on the first lag of the
FED interest rate and the GDP-deflator is set to 0.6, while others are set to zero in the medium-sized
model. Lags longer than 1 and cross-lags all have zero prior means. The priors for the steady-states
are set informative to the values listed in Table 1, these values follow suggestions from the literature
for most variables, see e.g. Louzis (2019) and Österholm (2012). There were a few variables where
we could not find theoretical values for either the mean or the standard deviation, in those cases,
20
Table 1: Data Description
We consider three competing optimization strategies: (I) an exhaustive grid-search, (II) Bayesian
optimization with the EI acquisition function (BO-EI), and (III) our BOOP-EI algorithm. In each
approach, we use the restrictions θ1 ∈ (0, 5), θ2 ∈ (0, 1), and θ3 ∈ (0, 5). In the grid-search, θ1 and θ2
move in steps of 0.05 and θ3 moves in steps of 0.1, yielding in total 100000 marginal likelihood
evaluations. For the Bayesian optimization algorithm, we set the number of evaluations to 250, and
For strategies (I) and (II) we use a total of 10000 Gibbs iterations with 1000 as a burn-in sample
in each model evaluation. For (III) we first draw 1100 Gibbs samples where we discard the first
21
1000 as burn-in and use the rest to calculate the probability of improvement PI, to ensure that the
estimated marginal likelihood will be approximately unbiased; Figure 7 shows that Chib’s estimate
is unbiased already after a few hundred samples. If PI < α we stop early and move on to the next
BO iteration, otherwise we generate a new batch (of size 100) of Gibbs samples and again check the
PI criteria. The total number of Gibbs iterations will therefore vary between 1100 and 10 000 in
each of the 250 BOOP-iterations for the medium-sized model. Note that Chib’s estimator uses an
estimate of the parameter for the so-called reduced Gibbs sampler run. This point estimate should
preferably have a high posterior density for efficiency reasons, see Chib (1995). The medium-sized
model uses only 100 posterior samples to obtain high-density parameters for calculating Chib’s log
marginal likelihood, which is enough in our set-up. For the large model, we use 5000 burn-in samples
and 500 simulations in the first batch, which gives between 5500 and 10000 MCMC iterations per
evaluation point. The 5000 burn-in is likely to be excessive but is used to be conservative; a small
number would make BOOP even faster in comparison to regular BO. The application is robust to
the choice of α, as long as it is a reasonably low number, in this study we use α = 0.001.
-3112
-3113
-3114
-3115
Figure 7: Unbiasedness of Chib’s log marginal likelihood estimator in the steady-state BVAR application. The
horizontal axis denotes the number of MCMC draws (excluding 50 observations as burn-in), the blue dots are draws
from the sampling distribution of Chib’s estimator for a given MCMC sample size. The red line represents the mean
of the draws and the blue line represents the true log marginal likelihood, obtained from 100 000 MCMC iterations
with 5000 as a burn-in.
For comparison, we will also use the standard values of the hyperparameters used in e.g. the
BEAR-toolbox, Dieppe et al. (2016), θ1 = 0.1, θ2 = 0.5, and θ3 = 1, as a benchmark. The methods
are compared with respect to i) the obtained marginal likelihood, and ii) how much computational
22
5.4 Results for the medium-scale steady state VAR model
Table 2 summarizes the results from ten runs of the algorithms for the medium size BVAR model.
We see that all three optimization strategies find hyperparameters that yield substantially higher
log marginal likelihood than the standard values. We can also see that both Bayesian optimization
methods yield as good hyperparameters as the grid search at only a small fraction of the computa-
tional cost. It is also clear from Table 2 that a substantial amount of computations associated with
the MCMC are saved when using BOOP. It is interesting to note that the values for θ1 and θ2 are
similar for all three optimization approaches but that θ3 differs to some extent. This is due to the
Figure 8: Comparison of the convergence speed of the Bayesian optimization methods as a function of the number of
MCMC draws (left) and computing time (right).
The left graph of Figure 8 shows that BOOP-EI finds higher values of the log marginal likelihood
using much fewer MCMC iterations than plain BO with EI acquisitions. From Table 2 we can see
that BOOP-EI uses, on average, less than a fifth of the MCMC iterations compared to BO-EI for a
23
full run. Interesting to note is that BO-EI leads to (on average) a higher number of improvements
on the way to the maximum; while BOOP-EI gives fewer improvements but of larger magnitude;
the strategy of cheaply exploring new territories before locally optimizing the function pays off. The
graph to the right in Figure 8 shows that for this application BO-EI is quicker in terms of CPU time
to reach fairly high values for the log marginal likelihood. We see at least two explanations for this:
first, BOOP-EI tries to explore more unknown territories since they are presumed to be cheap while
BO-EI more greedily focuses on local optimization. Second, and more importantly, the overhead
cost associated with the BOOP-EI acquisition is relatively large in this medium-sized application
where the cost of evaluating the marginal likelihood itself is not excessive. The fact that BOOP
can heavily reduce the number of MCMC draws while still giving similar CPU computational time
suggest that it is most useful in cases where each log marginal likelihood evaluation is expensive;
this will be demonstrated in the more computationally demanding models in Sections 5.5 and 6.
Figure 9 displays the log marginal likelihood surfaces over the grid of (θ1 , θ2 )-values used in
the grid search. Each sub-graph is for a fixed value of θ3 ∈ {0.76, 1, 2}. The red dot indicates
the predicted maximum log marginal likelihood for the given θ3 , and the black dot in the middle
sub-figure indicates the standard values. We can see that the standard values are located outside
the high-density region, relatively far away from the maximum. A comparison of Figures 9 and 10
shows that the GP’s predicted log marginal likelihood surface is quite accurate already after merely
250 evaluations; this is quite impressive considering that Bayesian optimization tries to find the
maximum in the fastest way, and does not aim to have high precision in low-density regions.
Figure 9: Log marginal likelihood surfaces over a fine grid of (θ1 , θ2 )-values. The hyperparameter values for the
lag-decay is θ3 = 0.76, (b) θ3 = 1, (c) θ3 = 2 (left to right). The red dot denotes the maximum log marginal likelihood
value for the given θ3 and the black dot, in the middle plot, show the standard values.
24
Figure 10: GP predictions of the hyperparameter surfaces in Figure 9 based on 250 evaluations for one BOOP-EI run.
The hyperparameter for the lag-decay are θ3 = 0.76, 1, and 2 (left to right). Red dot indicates the highest predicted
value in the sub-plot and the black dot, in the middle plot, show the standard values.
We also optimize the parameters of the more challenging large BVAR model containing the 22
different time series, using 250 iterations for both BO-EI and BOOP-EI. A complete grid search is
too costly here, so we instead compare with parameters obtained from BOOP in the medium-sized
Table 3 shows that our method, again, finds optimal hyperparameters with dramatically larger
log ML than standard values, and also substantially better values than those that are optimal for the
medium-scale BVAR. Finally, note that the hyperparameters selected by BOOP-EI in the large-scale
BVAR are quite different from those in the medium-scale model. The optimal θ1 applies less baseline
shrinkage than before, but the lag decay (θ3 ) is higher, and in particular, the cross-lag shrinkage,
θ2 , is much closer to zero, implying much harder shrinkage towards univariate AR-processes. This
25
latter result strongly suggests that the computationally attractive conjugate prior structure is a
highly sub-optimal solution since such a prior requires that θ2 = 1. We can see that for this more
computationally demanding model BOOP-EI is much faster and finish on average in a third of the
time than the regular BO-EI strategy. Figure 11 show the predicted log marginal likelihood surface
obtained from the last GP in a BOOP run. The rightmost graph conditions on θ3 = 1.51, which is
optimal for BOOP-EI, so this graph therefore has the GP with the highest accuracy.
Figure 11: GP predictions of the hyperparameter surfaces for the large BVAR based on 250 evaluations for one
BOOP-EI run. The hyperparameter for the lag-decay are θ3 = 0.76 (left graph, optimal in medium-size BVAR),
θ3 = 1 (middle graph, standard value) and θ3 = 1.51 (right, optimal for BOOP-EI). Red dot indicates the highest
predicted value in the sub-plot. The orange dot in the leftmost plot show the hyperparameters obtained from BOOP
in the medium-sized model and the white dot in the middle plot show the standard values.
The time-varying parameter BVAR with stochastic volatility (TVP-SV BVAR) in Chan and Eisen-
cients, B 0t is a n × n lower triangular matrix with ones on the main diagonal. The evolution of
Σt = diag (exp(h1t ), . . . , exp(hnt )) is modelled by the vector of log volatilities, ht = (h1t , . . . , hnt )> ,
26
where Σh = diag(σh21 , . . . , σh2n ) and the starting values in h0 are parameters to be estimated. Fol-
lowing Chan and Eisenstat (2018), we collect all parameters of µt and the Bit matrices in a kγ -
yt = X t γ t + εt , εt ∼ N (0, Σt )
where X t contain both current and lagged values of y, Σγ = diag(σγ21 , . . . , σγ2kγ ) and the initial
For comparability we choose to use the same prior setup as in Chan and Eisenstat (2018)
where the variances of the state innovations follow independent inverse-gamma distributions: σγ2i ∼
IG(νγ0 , Sγ0 ) if γi is an intercept, σγ2i ∼ IG(νγ1 , Sγ1 ) if γi is a VAR coefficient and σh2 ∼ IG(νh , Sh )
for the innovations to the log variances. Following Chan and Eisenstat (2018) we set aγ = 0,
tion to find the optimal values for the three key hyperparameters Sγ0 , Sγ1 and Sh which controls
the degree of time-variation in the states. We collect the three optimized hyperparameter in the
vector θ = (θ1 , θ2 , θ3 )> , where θ1 = Sγ0 , θ2 = Sγ1 and θ3 = Sh . We optimize over the domain
{θ : 0 ≤ θ1 ≤ 5, 0 ≤ θ2 ≤ 1, 0 ≤ θ3 ≤ 5}, which allows for all cases from no time variation in any
To estimate the marginal likelihood, Chan and Eisenstat (2018) first obtain posterior draws
of γ, h, Σγ , Σh , γ 0 , h0 using Gibbs sampling, which are then used to design efficient importance
Z
p(y) = p(y|γ, h, ψ)p(γ|ψ)p(h|ψ)p(ψ)dγdhdψ, (26)
where ψ collects all the static parameters in Σγ , Σh , γ 0 , h0 . The inner integral w.r.t. γ can be
solved analytically and afterwards, we can integrate out h using importance sampling to obtain an
27
Z
p(y|ψ) = p(y|γ, h, ψ)p(γ|ψ)p(h|ψ)dγdh. (27)
The last step is to integrate out the fixed parameters from the integrated likelihood
Z
p(y) = p(y|ψ)p(ψ)dψ, (28)
which is done using another importance sampler. The two nested importance samplers put the
algorithm in the framework of importance sampling squared (IS 2 , Tran et al. (2013)). The Chan-
Eisenstat algorithm is elegantly designed, but necessarily computationally expensive with a single
estimate of the marginal likelihood taking 205 minutes in MATLAB on a standard desktop (Chan
and Eisenstat, 2018). We call their MATLAB code from Julia using the MATLAB.jl package,
illustrating that BOOP can plug in any marginal likelihood estimator. However, we found that the
standard errors in Chan and Eisenstat (2018) can be more robustly estimated using the bootstrap
and we have done so here. The cost of bootstrapping the standard errors only has to be taken once
for every Bayesian optimization iteration, and this cost is negligible compared to the computation
We use quarterly data for the GDP-deflator, real GDP, and the short-term interest rate in the
USA from 1954Q3 to 2014Q4 from Chan and Eisenstat (2018) for comparability. In addition, we
make use of their Matlab code (with minor adjustments) for computing the marginal likelihood.
This shows another strength with the BOOP approach, that it works on top of existing code.
We fix the number of Gibbs sampling iterations and the burn-in period to 20000 and 5000 re-
spectively for both BO-EI and BOOP-EI in all evaluation points. This simplifies the implementation
and does not make a practical difference since the main part of the computational cost is spent on
the log marginal likelihood estimate from importance sampling. For BO-EI we use 5000 log marginal
likelihood evaluations in each new evaluation point, while BOOP-EI starts with 1000 importance
sampling draw and then takes batches of size 100 until a maximum of 5000 samples has been reached.
The initial 1000 draws were enough to make the estimator approximately unbiased.
28
6.2 Results for the TVP-SV BVAR
Table 4 shows the optimized log marginal likelihood from three independent runs of BO-EI and
BOOP-EI; the hyperparameter values used in Chan and Eisenstat (2018) are shown for reference.
As expected both BO and BOOP find better hyperparameters than the ones in Chan and Eisenstat
(2018); this is particularly true for BOOP which gives an increase in the marginal likelihood of
more than 10 units on the log scale on average. Interestingly, both BO and BOOP suggest that
the stochastic volatilities should be allowed to move more freely than in Chan and Eisenstat (2018),
but that there should be less time variation in the intercepts and VAR coefficients. This points
in the same direction as the results in Chan and Eisenstat (2018) who find that shutting down
the time variation in the intercept and VAR coefficients actually increases the marginal likelihood.
Our results indicate that when carefully selecting the shrinkage parameters by optimization, the
VAR-dynamics should in fact be allowed to evolve over time, but at a slower pace.
Table 4 shows a great deal of variability between runs, in particular for BO. Figure 12 shows
that this is probably because the hyperparameter surface is substantially more complicated and
multimodal than for the steady-state BVAR. We can also see that the log marginal likelihood is
relatively insensitive to changes in θ3 around the mode while it is very sensitive to changes in θ2 .
Figure 12: Predicted log marginal likelihood over the hyperparameters for stochastic volatility and the VAR dynamics
for θ1 = 0.0086 (left), 0.05 (middle) and 0.1 (right). The mode in each plot is marked out by a red point. A distant
local optimum is also marked out by an orange point.
29
Table 4: Result for TVP-BVAR with stochastic volatility for three independent runs of BO and BOOP. CE is taken
from Table 3 in Chan and Eisenstat (2018). The row named SE shows the numerical standard errors. Runs were
stopped in case there was no improvement in the last 50 hours.
7 Concluding remarks
We propose a new Bayesian optimization method for finding optimal hyperparameters in econometric
models. The method can be used to optimize any noisy function where the precision is under the
control of the user. We focus on the common situation of maximizing a marginal likelihood evaluated
by MCMC or importance sampling, where the precision is determined by the number of MCMC or
importance sampling draws. The ability to choose the precision makes it possible for the algorithm
to take occasional cheap and noisy evaluations to explore the marginal likelihood surface, thereby
We assess the performance of the new algorithm by optimizing the prior hyperparameters in the
extensively used BVAR with stochastic volatility and time-varying parameters and the steady-state
BVAR model in both a medium-sized and a large-scale VAR. The method is shown to be practical
and competitive to other approaches in that it finds the optimum using a substantially smaller
computational budget, and has the potential of being part of the standard toolkit for BVARs. We
have focused on optimizing the marginal likelihood, but the method is directly applicable to other
score functions, e.g. the popular log predictive score (Geweke and Keane, 2007 and Villani et al.,
2012).
Our approach builds on the assumption that the noisy estimates of the log marginal likelihoods
are approximately unbiased, which we verify is a reasonable assumption in the three applications
if the first BOOP evaluation is based on a marginal likelihood estimator from enough MCMC
30
draws. The unbiasedness of the log marginal likelihood will, however, depend on the combination
of MCMC sampler and marginal likelihood estimator, see Adolfson et al. (2007) for some evidence
from Dynamic Stochastic General Equilibrium (DSGE) models (An and Schorfheide, 2007). For
example, the simulations in Adolfson et al. (2007) suggest that sampling with the independence
Metropolis-Hastings combined with the Chib and Jeliazkov (2001) estimator is nearly unbiased,
whereas sampling with the random walk Metropolis algorithm combined with the modified harmonic
estimator (Geweke, 1999) can be severely biased, unless the posterior sample is extremely large.
It would therefore be interesting to extend the method to cases with biased evaluations where the
marginal likelihood estimates are persistent and only slowly approaching the true marginal likelihood.
Since the marginal likelihood trajectory over MCMC iterations is rather smooth (Adolfson et al.,
2007) one can try to predict its evolution and then correct the bias in the marginal likelihood
estimates.
References
Adolfson, M., Lindé, J., and Villani, M. (2007). Bayesian analysis of DSGE models-some comments.
An, S. and Schorfheide, F. (2007). Bayesian analysis of DSGE models. Econometric reviews, 26(2-
4):113–172.
Bańbura, M., Giannone, D., and Reichlin, L. (2010). Large Bayesian vector auto regressions. Journal
Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2017). Julia: A fresh approach to
Brochu, E., Cora, V. M., and De Freitas, N. (2010). A tutorial on Bayesian optimization of expensive
cost functions, with application to active user modeling and hierarchical reinforcement learning.
Carriero, A., Kapetanios, G., and Marcellino, M. (2012). Forecasting government bond yields with
31
Chan, J. C. and Eisenstat, E. (2018). Bayesian model comparison for time-varying parameter vars
Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical
Association, 90(432):1313–1321.
Chib, S. and Jeliazkov, I. (2001). Marginal likelihood from the Metropolis-Hastings output. Journal
Dieppe, A., Legrand, R., and Van Roye, B. (2016). The BEAR toolbox.
Doan, T., Litterman, R., and Sims, C. (1984). Forecasting and conditional projection using realistic
Doucet, A., De Freitas, N., Gordon, N. J., et al. (2001). Sequential Monte Carlo methods in practice.
Springer.
Geweke, J. (1999). Using simulation methods for Bayesian econometric models: inference, develop-
138(1):252–290.
Giannone, D., Lenza, M., and Primiceri, G. E. (2015). Prior selection for vector autoregressions.
Hannan, E. J. and Quinn, B. G. (1979). The determination of the order of an autoregression. Journal
Matérn, B. (1960). Spatial variation. Medd. fr. St. Skogsf. Inst. 49 (5). Reprinted in Lecture Notes
32
Österholm, P. (2012). The limited usefulness of macroeconomic Bayesian VARs when estimating
Primiceri, G. E. (2005). Time varying structural vector autoregressions and monetary policy. The
Quinn, B. G. (1980). Order determination for a multivariate autoregression. Journal of the Royal
Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of machine
Tran, M.-N., Scharth, M., Pitt, M. K., and Kohn, R. (2013). Importance sampling squared for
Villani, M. (2009). Steady-state priors for vector autoregressions. Journal of Applied Econometrics,
24(4):630–650.
Villani, M., Kohn, R., and Nott, D. J. (2012). Generalized smooth finite mixtures. Journal of
Econometrics, 171(2):121–133.
Williams, C. K. and Rasmussen, C. E. (2006). Gaussian processes for machine learning, volume 2.
33