Bayesian Optimization of Hyperparameters From Noisy Marginal Likelihood Estimates

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Bayesian Optimization of Hyperparameters from Noisy Marginal

Likelihood Estimates

Oskar Gustafsson1 ∗, Mattias Villani1,2 and Pär Stockhammar1,3

1
Department of Statistics, Stockholm University

2
Department of Computer and Information Science, Linköping University
3
Sveriges Riksbank

Abstract

Bayesian models often involve a small set of hyperparameters determined by maximizing


the marginal likelihood. Bayesian optimization is a popular iterative method where a Gaussian
process posterior of the underlying function is sequentially updated by new function evaluations.
An acquisition strategy uses this posterior distribution to decide where to place the next function
evaluation. We propose a novel Bayesian optimization framework for situations where the user
controls the computational effort, and therefore the precision of the function evaluations. This is
a common situation in econometrics where the marginal likelihood is often computed by Markov
chain Monte Carlo (MCMC) or importance sampling methods, with the precision of the marginal
likelihood estimator determined by the number of samples. The new acquisition strategy gives the
optimizer the option to explore the function with cheap noisy evaluations and therefore find the
optimum faster. The method is applied to estimating the prior hyperparameters in two popular
models on US macroeconomic time series data: the steady-state Bayesian vector autoregressive
(BVAR) and the time-varying parameter BVAR with stochastic volatility. The proposed method
is shown to find the optimum much quicker than traditional Bayesian optimization or grid search.

Keywords: acquisition strategy, Bayesian optimization, importance sampling, MCMC, steady-state

BVAR, stochastic volatility, US macro.



Corresponding author: Oskar Gustafsson, Department of Statistics SE-10691 Stockholm, Sweden.
Email:[email protected]. Phone:(+46)739692774

1
1 Introduction

The trend in econometrics is to use increasingly more flexible models that give a richer description of

the economy, particularly for prediction purposes. As the model complexity increases, the estimation

problems get more involved, and computationally costly MCMC methods are often used to sample

from the posterior distribution.

Most models involve a relatively small set of hyperparameters that needs to be chosen by the user.

For example, consider the steady-state BVAR model (Villani, 2009), which is widely used among

practitioners and professional forecasters (Karlsson, 2013), and used in Section 5 for illustration.

The choice of the prior distribution in BVARs is often reduced to the selection of a small set of

prior hyperparameters. Some of these hyperparameters can be specified subjectively by experts,

for example, the steady-state is usually given a rather informative subjective prior. Other prior

hyperparameters control the smoothness/shrinkage properties of the model and are less easy to

specify subjectively.

Giannone et al. (2015) proposed to treat these hard-to-specify prior hyperparameters as un-

known parameters and explore the joint posterior of the hyperparameters, the VAR dynamics, and

the shock covariance matrix. This is a statistically elegant approach which works well when the

marginal likelihood is available in closed form and is easily evaluated. However, the marginal like-

lihood is rarely available in closed form. The BVARs with conjugate priors considered in Carriero

et al. (2012), and Giannone et al. (2015) are an exception, but already the steady-state VAR needs

MCMC methods to evaluate the marginal likelihood. It is of course always an option to sample

the hyperparameters jointly with the other model parameters using Metropolis-Hastings (MH) or

Hamiltonian Monte Carlo (HMC), but this likely leads to inefficient samplers since the parameter

spaces are high-dimensional and the posterior of the hyperparameters are often quite complex, see

e.g. the application in Section 6.

Most practitioners also seem to prefer to fix the hyperparameters before estimating the other

model parameters. Carriero et al. (2012) propose a brute force optimization approach where the

marginal likelihood is evaluated over a grid. This is computationally demanding, especially if a

simulation based method has to be used for computing the marginal likelihood. Since the marginal

likelihood in Giannone et al. (2015) is available in closed form, one can readily optimize it using

2
a standard gradient based optimizer with automatic differentiation, but this is again restricted to

models with conjugate priors. The vast majority of applications instead use so-called conventional

values for the hyperparameters, dating back to Doan et al. (1984), which were found to be optimal

on a specific historical dataset but are likely to be suboptimal for other datasets. Hence, there is

a real need for a fast method for optimizing the marginal likelihood over a set of hyperparameters

when every evaluation of the marginal likelihood is a noisy estimate from a computationally costly

full MCMC run.

Bayesian optimization (BO) is an iterative optimization technique originating from machine

learning. BO is particularly suitable for optimization of costly noisy functions in small to moderate

dimensional parameter spaces (Brochu et al., 2010 and Snoek et al., 2012) and is therefore well

suited for marginal likelihood optimization. The method treats the underlying objective function

as an unknown object that can be inferred by Bayesian inference by evaluating the function at

a finite number of points. A Gaussian process prior expresses Bayesian prior beliefs about the

underlying function, often just containing the information that the function is believed to have

a certain smoothness. Bayes theorem is then used to sequentially update the Gaussian process

posterior after each new function evaluation. Bayesian optimization uses the most recently updated

posterior of the function to decide where to optimally place the next function evaluation. This

so-called acquisition strategy is a trade-off between: i) exploiting the available knowledge about

the function to improve the current maxima and ii) exploring the function to reduce the posterior

uncertainty about the objective function.

Our paper proposes a novel framework for Bayesian optimization when the user can control the

precision and computational cost of each function evaluation. The framework is quite general, but we

focus mainly on the situation when the noisy objective function is a marginal likelihood computed by

MCMC. This is a very common situation in econometrics using, for example, the estimators in Chib

(1995), Chib and Jeliazkov (2001), and Geweke (1999). The precision of the marginal likelihood

estimate at each evaluation point is then implicitly chosen by the user via the number of MCMC

iterations. This makes it possible to use occasional cheap noisy evaluations of the marginal likelihood

to quickly explore the marginal likelihood over hyperparameter space during the optimization. Our

proposed acquisition strategy can be seen as jointly deciding where to place the new evaluation but

also how much computational effort to spend in obtaining the estimate. We implement this strategy

3
by a stopping rule for the MCMC sampling combined with an auxiliary prediction model for the

computational effort at any new evaluation point; the auxiliary prediction model is learned during

the course of the optimization.

We apply the method to the steady-state BVAR (Villani, 2009) and the time-varying parameter

BVAR with stochastic volatility (Chan and Eisenstat, 2018) and demonstrate that the new acquisi-

tion strategy finds the optimal hyperparameters faster than traditionally used acquisition functions.

It is also substantially faster than a grid search and finds a better optimum.

The outline of the paper is a follows. Section 2 introduces the problem of inferring hyperparam-

eters from an estimated marginal likelihood. Section 3 gives the necessary background on Gaussian

processes and Bayesian optimization and introduces our new Bayesian optimization framework. Sec-

tion 4 illustrates and evaluates the proposed algorithm in a simulation study. Sections 5 and 6 assess

the performance of the proposed algorithm in empirical examples, and the final section concludes.

2 Hyperparameter estimation from an estimated marginal likeli-

hood

To introduce the problem of using an estimated marginal likelihood for learning a model’s hyper-

parameters we consider the selection of hyperparameters in the popular class of Bayesian vector

autoregressive models (BVARs) as our running example.

2.1 Hyperparameter estimation

Consider the standard BVAR model

K
X
yt = Πk y t−k + εt , (1)
k=1

where {εt }Tt=1 are iid N (0, Σ). A simplified version of the Minnesota prior (see e.g. Karlsson (2013))

without cross-equation shrinkage is of the form

(Π, Σ) ∼ M N IW (Π, ΩΠ , S, ν), (2)

with

4
Σ ∼ IW (S, ν), vec(Π0 )|Σ ∼ N (vec(Π0 ), Σ ⊗ ΩΠ ). (3)

where vec(Π) and ΩΠ denotes the prior mean and covariance of the coefficient matrix, S is the prior

scale matrix with the prior degrees of freedom, ν. The diagonal elements of ΩΠ are given by

λ21
ω ii = , for lag l of variable r, i = (l − 1)p + r, (4)
(lλ3 sr )2

where λ1 controls the overall shrinkage and λ3 the lag-decay shrinkage set by the user, sr denotes the

estimated standard deviation of variable r. The fact that we do not use the additional cross-equation

shrinkage hyperparameter, λ2 , makes this prior conjugate to the VAR likelihood, a fact that will

be important in the following. It has been common practice to use standard values that dates back

to Doan et al. (1984), but there has been a renewed interest to find values that are optimal for the

given application (see e.g. Bańbura et al. (2010), Carriero et al. (2012) and Giannone et al. (2015)).

Two main approaches have been proposed. First, Giannone et al. (2015) proposed to sample from

the joint posterior distribution using the decomposition

p(β, θ|y1:T ) = p(β|θ, y1:T )p(θ|y1:T ), (5)

where β = (Π, Σ) and p(θ|y1:T ) is the marginal posterior distribution of the hyperparameters. The

algorithm samples from p(θ|y1:T ) using Metropolis-Hastings (MH) and then samples directly from

p(β|θ, y1:T ) for each θ draw by drawing Π and Σ from the Normal-Inverse Wishart distribution.

There are some limitations to using this approach. First, the p(θ|y1:T ) can be multimodal (see e.g.

the application in Section 6) and it can be hard to find a good MH proposal density, making the

sampling time-consuming. Second, practitioners tend to view hyperparameter selection as similar

to model selection and want to determine a fixed value for θ once and for all early in the model

building process.

Carriero et al. (2012) propose an exhaustive grid search to find the θ that maximizes p(θ|y1:T )

and then uses that optimal θ throughout the remaining analysis. The obvious drawback here is that

a grid search is very costly, especially if we have non-conjugate priors and more than a couple of

hyperparameters.

5
A problem with both the approach in Giannone et al. (2015) and Carriero et al. (2012) is that

for most interesting models p(θ|y1:T ) is not available in closed form. Even the Minnesota prior

with cross-equation shrinkage is no longer a conjugate prior and p(θ|y1:T ) is intractable. In fact,

most Bayesian models used in practice have intractable p(θ|y1:T ), including the steady-state BVAR

(Villani, 2009) and the TVP-SV BVAR (Primiceri, 2005) used in Section 5 and 6.

When p(θ|y1:T ) is intractable, MCMC or other simulation based methods like Sequential Monte

Carlo (SMC) are typically used to obtain a noisy estimate of p(θ|y1:T ). Since

p(θ|y1:T ) ∝ p(y1:T |θ)p(θ), (6)

where p(y1:T |θ) = p(y1:T |θ, β)p(β|θ)dβ is the marginal likelihood, this problem goes under the
R

heading of (log) marginal likelihood estimation. We will therefore frame our method as maximizing

the log marginal likelihood; maximization of p(θ|y1:T ) is achieved by simply adding the log prior,

log p(θ), to the objective function.

Chib (1995) proposes an accurate way of computing a simulation-consistent estimate of the

marginal likelihood when the posterior can be obtained via Gibbs sampling, which is the case

for many econometric models. We refer to Chib (1995) for details about the marginal likelihood

estimator and its approximate standard error.

Estimating the marginal likelihood for the TVP-SV BVAR is more challenging and we will adopt

the method suggested by Chan and Eisenstat (2018). The approach consists of four steps; 1) obtain a

posterior sample via Gibbs sampling, 2) integrate out the time-varying VAR coefficients analytically,

3) integrate out the stochastic volatility using importance sampling to obtain the integrated likelihood,

4) integrate out the static parameters using another importance sampler. Since the algorithm makes

use of two nested importance samplers, it is a special case of importance sampling squared (IS 2 )

(Tran et al., 2013); see Section 6 for more details.

There are many alternative estimators that can be used in our approach, for example, the

extension of Chib’s estimator to Metropolis-Hastings sampling (Chib and Jeliazkov, 2001), and

estimators based on importance sampling (Geweke, 1999) or Sequential Monte Carlo (Doucet et al.,

2001).

All simulation-based estimators: i) give noisy evaluations of the marginal likelihood, ii) are time-

6
consuming, and iii) have a precision that is controlled by the user in terms of the number of MCMC

or importance sampling draws. The next section explains how traditional Bayesian optimization is

well suited for points i) and ii), but lacks a mechanism for exploiting point iii). Taking the user

controlled precision into account brings a new perspective to the problem and we propose a class of

algorithms that handle all three points above.

3 Bayesian optimization of hyperparameters

3.1 Gaussian processes

Since Bayesian optimization is a relatively unknown method in econometrics, we give an introduction

here to Gaussian processes and their use in Bayesian optimization.

A Gaussian process (GP) is a (possibly infinite) collection of random variables such that any

subset is jointly distributed according to a multivariate normal distribution, see e.g. Williams

and Rasmussen (2006). This process, denoted by f (x) ∼ GP(µ(x), k(x, x0 )), can be seen as a

probability distribution over functions f : X → R that is completely specified by its mean function,

µ(x) ≡ Ef (x), and its covariance function, C(f (x), f (x0 )) ≡ k(x, x0 ), where x and x0 are two

arbitrary input values to f (·). Note that the covariance function specifies the covariance between

any two function values, f (x1 ) and f (x2 ). A popular covariance function is the squared exponential

(SE):
|x − x0 |2
 
0
k(x, x ) = σf exp − , (7)
2`2

where |x − x0 | is the Euclidean distance between the two inputs; the covariance function is specified

by its two kernel hyperparameters, the scale parameter σf > 0 and the length scale ` > 0. The

scale parameter σf governs the variability of the function and the length scale determines how fast

the correlation between two function values taper off with the distance |x − x0 |, see Figure 1. The

fact that any finite sampling of function values {f (xn ) for xn ∈ X }N


n=1 constitutes a multivariate

normal distribution on RN allows for the convenient conditioning and marginalization properties

of the multivariate normal distribution. In particular, this makes it easy to compute the posterior

distribution for the function f at any input x? .

An increasingly popular alternative to the squared exponential kernel is the Matérn kernel, see

7
Figure 1: Illustration of two Gaussian processes with squared exponential kernel with different length scales and the
same variance σf2 = 0.252 . The figure shows the prior mean (dashed line) and 95% probability intervals (shaded) and
five realizations from each process. A smaller length scale gives more wiggly realizations.

e.g. Matérn (1960) and Williams and Rasmussen (2006). The Matérn kernel has an additional

hyperparameter, ν>0, in addition to the length scale ` and scale σf , such that the process is k times

mean square differentiable if and only if ν > k. Hence, ν controls the smoothness of the process

and it can be shown that the Matérn kernel approaches the SE kernel as ν → ∞ (Williams and

Rasmussen (2006)). Our approach is directly applicable for any valid kernel function, but we will

use the popular Matérn ν = 5/2 kernel in our applications:

√ ! √ !
5r 5r2 5r
kν=5/2 (r) = σf 1+ + 2 exp − , (8)
` 3` `

where r = |x − x0 |. The Matérn 5/2 has two continuous derivatives which is often a requirement

for Newton-type optimizers (Snoek et al., 2012). The hyperparameters, σf and `, are found by

maximizing the marginal likelihood, see e.g. Williams and Rasmussen (2006).

Consider the nonlinear/nonparametric regression model with additive Gaussian errors

iid
yi = f (xi ) + i , i ∼ N (0, σ 2 ), for i = 1, . . . , n, (9)

and the prior f (x) ∼ GP(0, k(x, x0 )). Given a dataset with n observations, the posterior distribution

8
of f (x? ) at a new input x? is (Williams and Rasmussen, 2006)

f (x? ) | y1 , . . . , yn , x1 , . . . , xn ∼ N m(x? ), s2 (x? )




m(x? ) = k(x? )> (K(X, X) + σ 2 I)−1 y

s2 (x? ) = k(x? , x? ) − k(x? )> (K(X, X) + σ 2 I)−1 k(x? ), (10)

where y = (y1 , . . . , yn )> , k(x? ) is the n-vector with covariances between f at the test point x? and

all other training inputs, K(X, X) is the n × n matrix with covariances among the function values

at all n training inputs in X. When the errors are heteroscedastic with variance σi2 for the ith

observation, the same formulas apply with σ 2 I replaced by diag(σ12 , . . . , σn2 ).

3.2 Bayesian optimization

Bayesian optimization (BO) is an iterative optimization method that selects new evaluation points

using the posterior distribution of f conditional on the previous function evaluations. More specifi-

cally, BO uses an acquisition function, a(x), to select the next evaluation point (Brochu et al., 2010

and Snoek et al., 2012).

An intuitively sensible acquisition rule is to select a new evaluation point that maximizes the

probability of obtaining a higher function value than the current maximum, i.e. the Probability of

Improvement (PI ):
 
m(x) − fmax
PI(x) ≡ Pr(f (x) > fmax ) = Φ , (11)
s(x)

where fmax is the maximum value of the function obtained so far. The functions m(x) and s(x) are

the posterior mean and standard deviation of the estimated Gaussian process for f in the point x,

conditional on the available function evaluations (see (10)), and Φ denotes the cumulative standard

normal distribution.

The Expected Improvement (EI ) takes also the size of the improvement into consideration:

   
m(x) − fmax m(x) − fmax
EI(x) =(m(x) − fmax )Φ + s(x)φ , (12)
s(x) s(x)

where φ denotes the density function of the standard normal distribution. The first part of (12)

is associated with the size of our predicted improvement and the second part is related to the un-

9
certainty of our function in that area. Thus, EI incorporates the trade-off between high expected

improvement (exploitation) and learning more about the underlying function (exploration). Opti-

mization of the acquisition function, to find the next evaluation point, is typically a fairly easy task

since it is noise-free and cheap to evaluate, however, it can have multiple optima so some care has

to be taken. An easily implemented solution is to use a regular Newton-type algorithm initiated

with different starting values over the hyperparameter surface. In this paper we use particle swarm

optimization, a global optimization algorithm implemented in the Optim.jl package in Julia.

The exploitation-exploration trade-off is illustrated in Figure 2, where the blue line shows the true

objective function, the black line denotes the posterior mean of the GP, and the blue-shaded regions

are 95% posterior probability bands for f . The black (small) dots are past function evaluations, and

the violet (large) dot is the current evaluation. The first and third row in Figure 2 show the true

objective function and the learned GP at four different iterations of the algorithm while the second

and fourth row show the EI acquisition function corresponding to the row immediately above. At

Iteration 2 in the top-left corner, we see that the EI acquisition function (second row) indicates

that there is a high expected improvement by moving to either the immediate left or right of the

current evaluation. At Iteration 5 the EI strategy suggests three regions worth evaluating, where

the two leftmost regions are candidates because of their high uncertainty. After seven iterations,

the algorithm is close to the global maximum and will now continue a narrow search for the exact

location of the global maximum.

10
Figure 2: Bayesian optimization illustrated with the expected improvements acquisition strategy. The graphs in Row
1 and 3 depict the posterior distribution of f and the evaluation points (current evaluation in violet). Rows 2 and 4
show the corresponding aquisition functions.

Acquisition rules like PI or EI do not consider that different evaluation points can be more or less

costly. To introduce the notation of cost into the acquisition strategy, Snoek et al. (2012) proposed

Expected Improvement per second, EIS(x) ≡ EI(x)/c(x), where c : X → R+ is a duration function

that measures the evaluation time at input x in seconds. More generally, we can define a(x)/c(x) as

an effort-aware acquisition function. The duration function is typically unknown and Snoek et al.

(2012) proposed to estimate it alongside f using an additional Gaussian process for log c(x).

3.3 Bayesian optimization with optimized precision

EIS assumes that the duration (or the cost) of function evaluations are unknown, but fixed for a

given input x; once we visit x, the cost of the function estimate fˆ(x) is given. However, the user

can often choose the duration spent to obtain a certain precision in the estimate; for example by

increasing the number of MCMC iterations when the marginal likelihood is estimated by MCMC.

This novel perspective opens up for strategies that not only optimize for the next evaluation point,

but also optimize over the computational resources, or equivalently, the precision of the estimate

11
fˆ(x). We formally extend BO by modeling the function evaluations with a heteroscedastic GP

fˆ(x) = f (x) + ,  ∼ N (0, σ 2 (x, G)) (13)

f ∼ GP(µ(x), k(x, x0 )),

where the noise variance σ 2 (x, G) is now an explicit function of the number of MCMC iterations,

G, or some other duration measure. Hence the user can now choose both where to place the next

evaluation and the effort spent in computing it by maximizing

ã(x, G) ≡ a(x)/G, (14)

with respect to both x and G, where a(x) is a baseline acquisition function, for example EI.

A complication with maximization of ã(x, G) is that while we typically know that σ(x, G) =

O(1/ G) in Monte Carlo or MCMC, the exact numerical standard error depends on the integrated

autocorrelation time (IACT) of the MCMC chain. Note that the evaluation points can, for example,

be hyperparameters in the prior, where different values can give rise to varying degrees of well-

behaved posteriors, so we can not expect the IACT to be constant over the hyperparameter space,

hence the explicit dependence on x in σ 2 (x, G). Rather than maximizing ã(x, G) with respect to

both x and G directly, we propose to implement the algorithm in an alternative way that achieves a

similar effect. The approach includes stopping the evaluation early whenever the function evaluation

turns out to be hopelessly low with a low probability of improvement over the current fmax .

For a given x we let G increase, in batches of a fixed size, until


!
m̂(g) (x) − fmax
PI(x) ≡ Φ < α, (15)
s(g) (x)

for some small value, α, or until G reaches a predetermined upper bound, Ḡ; here m̂(g) (x) and s(g) (x)

denotes the posterior mean and standard deviation of the GP evaluated at x after g MCMC iter-

ations. Note that both the posterior mean m(x) and standard deviation s(x) are functions of the

noise variance, which in turn is a function of G. The posterior distribution for f (x) is hence con-

tinuously updated as G grows until 1 − α of the posterior mass in the GP for f (x) is concentrated

below fmax , at which point the evaluation stops. The optimization is insensitive to the choice of α,

12
as long as it is a relatively small number. We now propose to maximize the following acquisition

function based on early stopping

ãα (x) = a(x)/Ĝα (x), (16)

where Ĝα (x) is a prediction of the number of MCMC draws needed at x before the evaluation stops,

with the probability α as the threshold for stopping. We emphasize that early stopping is here

used in a subtle way, not only as a simple rule to short-circuit useless computations, but also in the

planning of future computations; the mere possibility of early stopping can make the algorithm try

an x which does not have the highest a(x), but which is expected to be cheap and is therefore worth

a try. This effect that comes via σ 2 (G) is not present in the EIS of Snoek et al. (2012) where the

cost is fixed and is not influenced by the probability model on f .

Although one can use any model to predict G, we will here fit a GP regression model to the

logarithm of the number of MCMC draws, log Gj for j = 1, ..., J in the J previous evaluations

iid
log Gj = h(zj ) + εj , εj ∼ N (0, ψ 2 )

h ∼ GP(mG (z), kG (z, z0 )), (17)

where zj is a vector of covariates. The hyperparameters, x1:J , themselves may be used as predictors

of Ĝ(x), but also D(x) = m̂(x) − fmax and s(x) are likely to have predictive power for G, as well as

u(x) = (m̂(x) − fmax )/s(x). We will use zj = xj , D(j) (xj ), s(j) (xj ), u(j) (xj ) in our applications,


where the superscript over j denotes the BO iteration. The prediction for G is taken to be Ĝ =

exp (mG (z)), which corresponds to the median of the log-normal posterior for G.

We will use the term Bayesian Optimization with Optimized Precision (BOOP) for BO meth-

ods that optimize ãα (x) in (16), and more specifically BOOP-EI when EI is used as the baseline

acquisition function, a(x). The whole procedure is described in Algorithm 1.

13
Algorithm 1 Bayesian Optimization with Optimized Precision (BOOP)
input

• estimator fˆ(x) of the f (x) to be maximized, and its standard error function σ(G).

• j0 initial points x1:j0 ≡ (x1 , . . . , xj0 ), a vector of corresponding function estimates, fˆ(x1:j0 ),
and standard errors σ 2 (G1:j0 ).

• baseline acquisition function a(x), and early stopping thresholding probability α.

for j from j0 + 1 until convergence do:

a) Fit the heteroscedastic GP for f based on past evaluations

fˆ(x1:(j−1) ) = f (x1:(j−1) ) + ,  ∼ N (0, Σ1:(j−1) )


f (x) ∼ GP(m(x), k(x, x0 )),

where Σ1:(j−1) ≡ Diag(σ 2 (G1 ), . . . , σ 2 (Gj−1 )).

b) Fit the GP for log G based on past evaluations

log G1:(j−1) = h(z1:(j−1) ) + ε, ε∼N (0, ψ 2 I)


h(z) ∼ GP(mG (z), kG (z, z0 )),

where the elements of z are functions of x. Return the point prediction Ĝα (x).

c) Maximize ãα (x) = a(x)/Ĝα (x) to select the next point, xj .

d) Compute fˆ(xj ) and σ 2 (Gj ) by early stopping at thresholding probability α.

e) Update the datasets in a) with (xj , fˆ(xj ), σ 2 (Gi )) and in b) with (zj , log Gj ).

Note that (13) assumes that fˆ(x) is an unbiased estimator at any x. This can be ensured by using

enough MCMC/importance sampling draws in the first marginal likelihood evaluation of BOOP. We

performed a small simulation exercise that shows that the Chib estimator is approximately unbiased

after a small number of iterations for the medium-sized VAR model in 5.4. As expected, we had to

use more initial draws in the large-scale VAR in Section 5.5 and the time-varying parameter VAR

with stochastic volatility in Section 6 to drive down the bias. See also Section 7 for some ideas on

how to extend BOOP to estimators where the bias is still sizeable in large MCMC samples.

Figure 3 illustrates the early stopping part of BOOP in a toy example. The first row illustrates

the first BOOP iteration and the columns show increasingly larger MCMC sample sizes (G). We
(1)
can see that the 95% posterior interval after G = 10 MCMC draws at the current x includes fmax

14
(dotted orange line), the highest posterior mean of the function values observed so far; it therefore

worthwhile to increase the number of simulations for this x. Moving one graph to the right we see
(1)
that after G = 20 simulations the 95% posterior interval still includes fmax , and we move one more

graph to the right for G = 50. Here we conclude that the sampled point is almost certainly not

an improvement and we move on to a new evaluation point. The new evaluation point is found by

maximizing the BOOP-EI acquisition function in (16) with updated effort prediction function Ĝ(z)

in Equation 17 and is depicted by the violet dot in leftmost graph in the second row of Figure 3.

Following the progress in the second row, we see that it takes only G = 20 samples to conclude

that the function value is almost certainly lower than the current maximum of the posterior mean

at the second BO iteration. Finally, in the third row, we can see that the point is sampled with

high variance at the beginning, but as we increase G it becomes clear that this x is indeed an

improvement.

The code used in this paper is written in the Julia computing language (Bezanson et al., 2017),

making use of the GaussianProcesses.jl package for estimation of the Gaussian processes and the

Optim.jl package for the optimization of the acquisition functions.

Figure 3: Illustration of BOOP-EI implemented with early stopping. The row corresponds to iterations in the
algorithm and the columns to different MCMC sample sizes. The blue line is the true f , the shaded regions are 95%
posterior probability bands for f based on the (noisy) evaluations (black for past and violet for current). The orange
crosshair marks the current maximum; see the text for details.

15
4 Simulation experiment

4.1 Simulation setup

We will here assess the performance of the proposed BOOP-EI algorithm in a simulation experiment

for the optimization of a non-linear function in a single variable. The simple one-dimensional setup

is chosen for presentational purposes, and more challenging higher-dimensional settings are likely to

show even larger advantages of our method compared to regular Bayesian optimization.

The function to be maximized is

f (x) = N (x|0, 0.82 ) + N (x|4, 0.752 ) + N (x|0, 0.82 ) + N (x|2, 0.62 )+


(18)
+ 0.05N (x|2.2, 0.052 ) + 0.075N (x|1.25, 0.12 ),

where N (x|µ, σ 2 ) denotes the density function of a N (µ, σ 2 ) variable; the function is plotted in

Figure 4 (left). We further assume that f (x) can be estimated at any x from G noisy evaluations

by a simple Monte Carlo average

g 2 (x)
 
fˆ(G) (x) = f (x) + ,  ∼ N 0, (19)
G

where g(x) = 0.1 + 0.15N (x|1, 0.52 ) + 0.15N (x|2.5, 0.252 ) + 0.5N (x|5, 0.752 ) is a heteroscedastic

standard deviation function that mimics the real world case where the variability of the marginal

likelihood estimate varies over the space of hyperparameters; g(x) is plotted in Figure 4 (middle).

Figure 4 (right) illustrates the noisy function evaluations (gray points) and the effect of Monte Carlo

averaging G = 3 evaluations for a given x (blue points). We assume for simplicity here that once

the algorithm decides to visit an x it will get access to a noise-free evaluation of the standard error

of the estimator.

16
Figure 4: The function f that we want to maximize (left) and the function g that controls the sampling variance over
x (middle). The figure to the right shows estimates for G = 1 samples (grey dots), G = 3 samples (blue dots), the
mean function (red line) and 2 standard deviation error bands (pink lines).

4.2 Illustration of a single run of the algorithms

Figure 5: GP posterior after 25 Bayesian optimization iterations using BO-EI (left) and BOOP-EI (right) to optimize
f . The black lines are the posterior means of f (x).

Before we conduct a small Monte Carlo experiment it is illustrative to look at the results from

a single run of the EI and BOOP-EI algorithms. Figure 5 highlights the difference between the

algorithms by showing the GP posterior after 25 Bayesian optimization runs. The EI algorithm

has clearly wasted computations to needlessly drive down the uncertainty at (useless) low function

values, while BOOP-EI tolerates larger uncertainty in such function locations.

4.3 Simulation study

We now compare the performance of a standard EI approach using a heteroscedastic GP with our

BOOP-EI approach in a small simulation study. The methods will be judged by their ability to find

the optimum using as few evaluations as possible. The performance will be evaluated by a Monte

Carlo study where we simulate 1000 replications from each model under each simulation scenario.

17
We investigate Bayesian optimization with EI using 100 and 500 samples in each iteration, BOOP-EI

is allowed to stop the sampling at any time before the number of samples for the EI is reached. In

each simulation we set an upper bound of 120 Bayesian optimization iterations.

Figure 6: The evolution of fmax − max f (x), i.e. the difference between the current maximum fmax and the true
maximum of the function (vertical axis), as a function of the total number of the MCMC draws consumed up to the
current BO/BOOP iteration. Note that the number of MCMC draws per BOOP iteration is variable due to early
stopping. The lefthand graph uses a maximum of 100 MCMC draws per BO iteration while the righthand graph uses
500 draws. The shaded areas are the one standard deviation probability bands in the distribution of fmax − max f (x)
over the replicate runs, and the solid and dashed lines are the mean and median, respectively.

We can see from Figure 6 that BOOP finds the maximum by using fewer total number of samples

in both scenarios and that the difference increases with the number of samples used when forming

the estimator. We can also see from the median fmax that both algorithms can get stuck for a while

at the second greatest local maximum (which is approximately .05 lower than the global maximum).

However, BOOP gets out of the local optimum faster since it has the option to try cheap noisy

evaluations and will therefore explore other parts of the function earlier than basic BO-EI. This

effect seems to increase as we allow for a higher number of samples since it lowers the relative price

of cheaper evaluations.

5 Application to the steady-state BVAR

In this section, we use BOOP-EI to estimate the prior hyperparameters of the steady-state BVAR

of Villani (2009). Giannone et al. (2015) show that finding the right values for the hyperparameters

in BVARs can significantly improve forecasting performance. Moreover, Bańbura et al. (2010) show

that different degree of shrinkage (controlled by the hyperparameters) is necessary under different

model specifications.

18
5.1 The steady-state BVAR

The steady-state BVAR model of Villani (2009) is given by:

Π(L)(yt − Ψxt ) = εt ,
iid
εt ∼ N (0, Σ), (20)

where E[yt ] = Ψxt . In particular, if we assume that xt = 1 for all t, then Ψ has the interpretation

as the overall mean of the process. We take the prior distribution to be:

p(Σ) ∼Σ−(n+1)/2

vec(Π) ∼N (θ Π , ΩΠ ) (21)

Ψ ∼N (θ Ψ , ΩΨ ),

where θ Ψ and ΩΨ are the mean and covariance matrix for the steady states. The prior covariance

matrix for the dynamics, ΩΠ , is constructed using



θ12
for own lag l of variable r, i = (l − 1)n + r,



(lθ3 )2
,
ω ii = (22)
2
 (θ1θθ2 sr )2 , for cross-lag l of variable r 6= j, i = (l − 1)n + j,


(l s )
3 j

where ω ii is the diagonal elements of ΩΠ . We also assume prior independence, following Villani

(2009). The hyperparameters that we optimize over are: the overall-shrinkage parameter θ1 , the

cross-lag shrinkage θ2 , and the lag-decay parameter θ3 .

The posterior distribution of the steady-state BVAR model parameters can be sampled with a

simple Gibbs sampling scheme (Villani, 2009). The marginal likelihood, together with its empirical

standard error, can be estimated by the method in Chib (1995).

5.2 Data, prior and model settings

Table 1 describes the data used in our applications which are also used in Giannone et al. (2015). It

contains 23 macroeconomic variables for which two subsets are selected to represent a medium-sized

model with 7 variables and a large model that contains 22 of the variables (real investment is ex-

cluded). Before the analysis, the consumer price index and the five-year bond rate were transformed

from monthly to quarterly frequency. All series are transformed such that they become stationary

19
according to the augmented Dickey-Fuller test. This is necessary for the data to be consistent with

the prior assumption of a steady-state. The number of lags is chosen according to the HQ-criteria,

Hannan and Quinn (1979) and Quinn (1980). This resulted in p = 2 lags for the medium-sized

model which we also use for the large model.

We set the prior mean of the coefficient matrix, Π, to values that reflect some persistence on the

first lag, but also that all the time series are stationary; e.g. the prior mean on the first lag of the

FED interest rate and the GDP-deflator is set to 0.6, while others are set to zero in the medium-sized

model. Lags longer than 1 and cross-lags all have zero prior means. The priors for the steady-states

are set informative to the values listed in Table 1, these values follow suggestions from the literature

for most variables, see e.g. Louzis (2019) and Österholm (2012). There were a few variables where

we could not find theoretical values for either the mean or the standard deviation, in those cases,

we set them close to their empirical counterparts.

20
Table 1: Data Description

Variable names and transformations


Mnemonic
Variables Transform Medium Freq. Prior
(FRED)
Real GDP GDPC1 400 × diff-log x Q (2.5;3.5)
GDP deflator GDPCTPI 400 × diff-log x Q (1.5;2.5)
Fed funds rate FEDFUNDS - x Q (4.3,5.7)
Consumer price index CPIAUCSL 400 × diff-log M (1.5;2.5)
Commodity prices PPIACO 400 × diff-log Q (1.5;2.5)
Industrial production INDPRO 400 × diff-log Q (2.3;3.7)
Employment PAYEMS 400 × diff-log Q (1.5;2.5)
Employment, service sector SRVPRD 400 × diff-log Q (2.5;3.5)
Real consumption PCECC96 400 × diff-log x Q (2.3;3.7)
Real investment GPDIC1 400 × diff-log x Q (1.5;4.5)
Real residential investment PRFIx 400 × diff-log Q (1.5;4.5)
Nonresidential investment PNFIx 400 × diff-log Q (1.5;4.5)
Personal consumption
PCECTPI 400 × diff-log Q (1.5;4.5)
expenditure, price index
Gross private domestic
GPDICTPI 400 × diff-log Q (1.5;4.5)
investment, price index
Capacity utilization TCU - Q (79.3;80.7)
Consumer expectations UMCSENTx diff Q (-0.5, 0.5)
Hours worked HOANBS 400 × diff-log x Q (2.5;3.5)
Real compensation/hour AHETPIx 400 × diff-log x Q (1.5;2.5)
One year bond rate GS1 diff Q (-0.5;0.5)
Five years bond rate GS5 diff M (-0.5,0.5)
SP 500 S&P 500 400 × diff-log Q (-2,2)
Effective exchange rate TWEXMMTH 400 × diff-log Q (-1;1)
M2 M2REAL 400 × diff-log Q (5.5;6.5)
The table shows the 23 US macroeconomic time series from the FRED database used in the empirical analysis. The
column named Prior contains the steady-state mean ± one standard deviation.

5.3 Experimental setup

We consider three competing optimization strategies: (I) an exhaustive grid-search, (II) Bayesian

optimization with the EI acquisition function (BO-EI), and (III) our BOOP-EI algorithm. In each

approach, we use the restrictions θ1 ∈ (0, 5), θ2 ∈ (0, 1), and θ3 ∈ (0, 5). In the grid-search, θ1 and θ2

move in steps of 0.05 and θ3 moves in steps of 0.1, yielding in total 100000 marginal likelihood

evaluations. For the Bayesian optimization algorithm, we set the number of evaluations to 250, and

we use three random draws as initial values for the GPs.

For strategies (I) and (II) we use a total of 10000 Gibbs iterations with 1000 as a burn-in sample

in each model evaluation. For (III) we first draw 1100 Gibbs samples where we discard the first

21
1000 as burn-in and use the rest to calculate the probability of improvement PI, to ensure that the

estimated marginal likelihood will be approximately unbiased; Figure 7 shows that Chib’s estimate

is unbiased already after a few hundred samples. If PI < α we stop early and move on to the next

BO iteration, otherwise we generate a new batch (of size 100) of Gibbs samples and again check the

PI criteria. The total number of Gibbs iterations will therefore vary between 1100 and 10 000 in

each of the 250 BOOP-iterations for the medium-sized model. Note that Chib’s estimator uses an

estimate of the parameter for the so-called reduced Gibbs sampler run. This point estimate should

preferably have a high posterior density for efficiency reasons, see Chib (1995). The medium-sized

model uses only 100 posterior samples to obtain high-density parameters for calculating Chib’s log

marginal likelihood, which is enough in our set-up. For the large model, we use 5000 burn-in samples

and 500 simulations in the first batch, which gives between 5500 and 10000 MCMC iterations per

evaluation point. The 5000 burn-in is likely to be excessive but is used to be conservative; a small

number would make BOOP even faster in comparison to regular BO. The application is robust to

the choice of α, as long as it is a reasonably low number, in this study we use α = 0.001.

-3112

-3113

-3114

-3115

100 200 300 400 500

Figure 7: Unbiasedness of Chib’s log marginal likelihood estimator in the steady-state BVAR application. The
horizontal axis denotes the number of MCMC draws (excluding 50 observations as burn-in), the blue dots are draws
from the sampling distribution of Chib’s estimator for a given MCMC sample size. The red line represents the mean
of the draws and the blue line represents the true log marginal likelihood, obtained from 100 000 MCMC iterations
with 5000 as a burn-in.

For comparison, we will also use the standard values of the hyperparameters used in e.g. the

BEAR-toolbox, Dieppe et al. (2016), θ1 = 0.1, θ2 = 0.5, and θ3 = 1, as a benchmark. The methods

are compared with respect to i) the obtained marginal likelihood, and ii) how much computational

resources were spent in the optimization.

22
5.4 Results for the medium-scale steady state VAR model

Table 2: Optimization Results Medium Steady-State BVAR.

Standard BO-EI BOOP-EI Grid


Log ML −3078.54 −3052.13 −3052.06 −3052.08
Gibbs iterations 2.75 · 106 443660 109
CPU-time (minutes) 74 56
Model evaluations 250 250 105
θ1 0.1 0.26 0.27 0.3
θ2 0.5 0.37 0.41 0.4
θ3 1 0.69 0.76 0.9
The table compares different methods for hyperparameter optimization in the medium-scale steady-state BVAR. Each
method is run 10 times and the reported hyperparameters for each method are the best ones over the 10 runs, rounded
to two decimals. The marginal likelihood of the selected models were re-estimated using 200,000 Gibbs iterations with
40,000 as a burn-in. The duration measure is an average over the 10 runs.

Table 2 summarizes the results from ten runs of the algorithms for the medium size BVAR model.

We see that all three optimization strategies find hyperparameters that yield substantially higher

log marginal likelihood than the standard values. We can also see that both Bayesian optimization

methods yield as good hyperparameters as the grid search at only a small fraction of the computa-

tional cost. It is also clear from Table 2 that a substantial amount of computations associated with

the MCMC are saved when using BOOP. It is interesting to note that the values for θ1 and θ2 are

similar for all three optimization approaches but that θ3 differs to some extent. This is due to the

flatness of the log marginal likelihood in that area.

Figure 8: Comparison of the convergence speed of the Bayesian optimization methods as a function of the number of
MCMC draws (left) and computing time (right).

The left graph of Figure 8 shows that BOOP-EI finds higher values of the log marginal likelihood

using much fewer MCMC iterations than plain BO with EI acquisitions. From Table 2 we can see

that BOOP-EI uses, on average, less than a fifth of the MCMC iterations compared to BO-EI for a

23
full run. Interesting to note is that BO-EI leads to (on average) a higher number of improvements

on the way to the maximum; while BOOP-EI gives fewer improvements but of larger magnitude;

the strategy of cheaply exploring new territories before locally optimizing the function pays off. The

graph to the right in Figure 8 shows that for this application BO-EI is quicker in terms of CPU time

to reach fairly high values for the log marginal likelihood. We see at least two explanations for this:

first, BOOP-EI tries to explore more unknown territories since they are presumed to be cheap while

BO-EI more greedily focuses on local optimization. Second, and more importantly, the overhead

cost associated with the BOOP-EI acquisition is relatively large in this medium-sized application

where the cost of evaluating the marginal likelihood itself is not excessive. The fact that BOOP

can heavily reduce the number of MCMC draws while still giving similar CPU computational time

suggest that it is most useful in cases where each log marginal likelihood evaluation is expensive;

this will be demonstrated in the more computationally demanding models in Sections 5.5 and 6.

Figure 9 displays the log marginal likelihood surfaces over the grid of (θ1 , θ2 )-values used in

the grid search. Each sub-graph is for a fixed value of θ3 ∈ {0.76, 1, 2}. The red dot indicates

the predicted maximum log marginal likelihood for the given θ3 , and the black dot in the middle

sub-figure indicates the standard values. We can see that the standard values are located outside

the high-density region, relatively far away from the maximum. A comparison of Figures 9 and 10

shows that the GP’s predicted log marginal likelihood surface is quite accurate already after merely

250 evaluations; this is quite impressive considering that Bayesian optimization tries to find the

maximum in the fastest way, and does not aim to have high precision in low-density regions.

Figure 9: Log marginal likelihood surfaces over a fine grid of (θ1 , θ2 )-values. The hyperparameter values for the
lag-decay is θ3 = 0.76, (b) θ3 = 1, (c) θ3 = 2 (left to right). The red dot denotes the maximum log marginal likelihood
value for the given θ3 and the black dot, in the middle plot, show the standard values.

24
Figure 10: GP predictions of the hyperparameter surfaces in Figure 9 based on 250 evaluations for one BOOP-EI run.
The hyperparameter for the lag-decay are θ3 = 0.76, 1, and 2 (left to right). Red dot indicates the highest predicted
value in the sub-plot and the black dot, in the middle plot, show the standard values.

5.5 Results for the large-scale steady state VAR model

We also optimize the parameters of the more challenging large BVAR model containing the 22

different time series, using 250 iterations for both BO-EI and BOOP-EI. A complete grid search is

too costly here, so we instead compare with parameters obtained from BOOP in the medium-sized

BVAR in Section 5.4, which is a realistic strategy in practical work.

Table 3: Optimization Results Large Steady-State BVAR.

Standard BO-EI BOOP-EI Medium BVAR


Log ML −7576.31 −7402.50 −7401.09 −7532.61
Sd log ML 0.54 0.81 0.16 0.49
Gibbs iterations 3.75 × 106 1.8 × 106
CPU-time (hours) 64.90 20.22
θ1 0.1 0.47 0.56 0.27
θ2 0.5 0.06 0.05 0.41
θ3 1 1.46 1.51 0.76
Hyperparameter optimization in the large-scale steady-state BVAR. The column named “Medium BVAR” are the
values obtained from using BOOP-EI for the medium size model. Both optimization methods were run 5 times and
the reported hyperparameters for each method are the best ones over the 5 runs, rounded to two decimals. The
marginal likelihood of the selected models were re-estimated using 100,000 Gibbs iterations with 40,000 as a burn-in.
The duration measures are averages over the 5 runs.

Table 3 shows that our method, again, finds optimal hyperparameters with dramatically larger

log ML than standard values, and also substantially better values than those that are optimal for the

medium-scale BVAR. Finally, note that the hyperparameters selected by BOOP-EI in the large-scale

BVAR are quite different from those in the medium-scale model. The optimal θ1 applies less baseline

shrinkage than before, but the lag decay (θ3 ) is higher, and in particular, the cross-lag shrinkage,

θ2 , is much closer to zero, implying much harder shrinkage towards univariate AR-processes. This

25
latter result strongly suggests that the computationally attractive conjugate prior structure is a

highly sub-optimal solution since such a prior requires that θ2 = 1. We can see that for this more

computationally demanding model BOOP-EI is much faster and finish on average in a third of the

time than the regular BO-EI strategy. Figure 11 show the predicted log marginal likelihood surface

obtained from the last GP in a BOOP run. The rightmost graph conditions on θ3 = 1.51, which is

optimal for BOOP-EI, so this graph therefore has the GP with the highest accuracy.

Figure 11: GP predictions of the hyperparameter surfaces for the large BVAR based on 250 evaluations for one
BOOP-EI run. The hyperparameter for the lag-decay are θ3 = 0.76 (left graph, optimal in medium-size BVAR),
θ3 = 1 (middle graph, standard value) and θ3 = 1.51 (right, optimal for BOOP-EI). Red dot indicates the highest
predicted value in the sub-plot. The orange dot in the leftmost plot show the hyperparameters obtained from BOOP
in the medium-sized model and the white dot in the middle plot show the standard values.

6 Time-varying parameter BVAR with stochastic volatility

6.1 Model and setup

The time-varying parameter BVAR with stochastic volatility (TVP-SV BVAR) in Chan and Eisen-

stat (2018) is given by

B 0t yt = µt + B 1t yt−1 + · · · + B pt yt−p + εt , εt ∼ N (0, Σt ) (23)

where µt is a vector of time varying intercepts, B 1t , . . . , B pt are n × n matrices of VAR coeffi-

cients, B 0t is a n × n lower triangular matrix with ones on the main diagonal. The evolution of

Σt = diag (exp(h1t ), . . . , exp(hnt )) is modelled by the vector of log volatilities, ht = (h1t , . . . , hnt )> ,

evolving as a random walk


iid
ht = ht−1 + ζ t , ζ t ∼ N (0, Σh ), (24)

26
where Σh = diag(σh21 , . . . , σh2n ) and the starting values in h0 are parameters to be estimated. Fol-

lowing Chan and Eisenstat (2018), we collect all parameters of µt and the Bit matrices in a kγ -

dimensional vector γ t and write the model in state space form as

yt = X t γ t + εt , εt ∼ N (0, Σt )

γ t = γ t−1 + η t , η t ∼ N (0, Σγ ), (25)

where X t contain both current and lagged values of y, Σγ = diag(σγ21 , . . . , σγ2kγ ) and the initial

values for the state variables follow γ 0 ∼ N (aγ , Vγ ) and h0 ∼ N (ah , Vh ).

For comparability we choose to use the same prior setup as in Chan and Eisenstat (2018)

where the variances of the state innovations follow independent inverse-gamma distributions: σγ2i ∼

IG(νγ0 , Sγ0 ) if γi is an intercept, σγ2i ∼ IG(νγ1 , Sγ1 ) if γi is a VAR coefficient and σh2 ∼ IG(νh , Sh )

for the innovations to the log variances. Following Chan and Eisenstat (2018) we set aγ = 0,

Vγ = 10 · Ikγ , ah = 0, Vh = 10 · In , and νγ0 = νγ1 = νh = 5, but we use Bayesian optimiza-

tion to find the optimal values for the three key hyperparameters Sγ0 , Sγ1 and Sh which controls

the degree of time-variation in the states. We collect the three optimized hyperparameter in the

vector θ = (θ1 , θ2 , θ3 )> , where θ1 = Sγ0 , θ2 = Sγ1 and θ3 = Sh . We optimize over the domain

{θ : 0 ≤ θ1 ≤ 5, 0 ≤ θ2 ≤ 1, 0 ≤ θ3 ≤ 5}, which allows for all cases from no time variation in any

parameter to high time variation in all the model parameters.

To estimate the marginal likelihood, Chan and Eisenstat (2018) first obtain posterior draws

of γ, h, Σγ , Σh , γ 0 , h0 using Gibbs sampling, which are then used to design efficient importance

sampling proposals. The marginal likelihood is

Z
p(y) = p(y|γ, h, ψ)p(γ|ψ)p(h|ψ)p(ψ)dγdhdψ, (26)

where ψ collects all the static parameters in Σγ , Σh , γ 0 , h0 . The inner integral w.r.t. γ can be

solved analytically and afterwards, we can integrate out h using importance sampling to obtain an

estimate of the integrated likelihood

27
Z
p(y|ψ) = p(y|γ, h, ψ)p(γ|ψ)p(h|ψ)dγdh. (27)

The last step is to integrate out the fixed parameters from the integrated likelihood

Z
p(y) = p(y|ψ)p(ψ)dψ, (28)

which is done using another importance sampler. The two nested importance samplers put the

algorithm in the framework of importance sampling squared (IS 2 , Tran et al. (2013)). The Chan-

Eisenstat algorithm is elegantly designed, but necessarily computationally expensive with a single

estimate of the marginal likelihood taking 205 minutes in MATLAB on a standard desktop (Chan

and Eisenstat, 2018). We call their MATLAB code from Julia using the MATLAB.jl package,

illustrating that BOOP can plug in any marginal likelihood estimator. However, we found that the

standard errors in Chan and Eisenstat (2018) can be more robustly estimated using the bootstrap

and we have done so here. The cost of bootstrapping the standard errors only has to be taken once

for every Bayesian optimization iteration, and this cost is negligible compared to the computation

of the log marginal likelihood estimate.

We use quarterly data for the GDP-deflator, real GDP, and the short-term interest rate in the

USA from 1954Q3 to 2014Q4 from Chan and Eisenstat (2018) for comparability. In addition, we

make use of their Matlab code (with minor adjustments) for computing the marginal likelihood.

This shows another strength with the BOOP approach, that it works on top of existing code.

We fix the number of Gibbs sampling iterations and the burn-in period to 20000 and 5000 re-

spectively for both BO-EI and BOOP-EI in all evaluation points. This simplifies the implementation

and does not make a practical difference since the main part of the computational cost is spent on

the log marginal likelihood estimate from importance sampling. For BO-EI we use 5000 log marginal

likelihood evaluations in each new evaluation point, while BOOP-EI starts with 1000 importance

sampling draw and then takes batches of size 100 until a maximum of 5000 samples has been reached.

The initial 1000 draws were enough to make the estimator approximately unbiased.

28
6.2 Results for the TVP-SV BVAR

Table 4 shows the optimized log marginal likelihood from three independent runs of BO-EI and

BOOP-EI; the hyperparameter values used in Chan and Eisenstat (2018) are shown for reference.

As expected both BO and BOOP find better hyperparameters than the ones in Chan and Eisenstat

(2018); this is particularly true for BOOP which gives an increase in the marginal likelihood of

more than 10 units on the log scale on average. Interestingly, both BO and BOOP suggest that

the stochastic volatilities should be allowed to move more freely than in Chan and Eisenstat (2018),

but that there should be less time variation in the intercepts and VAR coefficients. This points

in the same direction as the results in Chan and Eisenstat (2018) who find that shutting down

the time variation in the intercept and VAR coefficients actually increases the marginal likelihood.

Our results indicate that when carefully selecting the shrinkage parameters by optimization, the

VAR-dynamics should in fact be allowed to evolve over time, but at a slower pace.

Table 4 shows a great deal of variability between runs, in particular for BO. Figure 12 shows

that this is probably because the hyperparameter surface is substantially more complicated and

multimodal than for the steady-state BVAR. We can also see that the log marginal likelihood is

relatively insensitive to changes in θ3 around the mode while it is very sensitive to changes in θ2 .

Figure 12: Predicted log marginal likelihood over the hyperparameters for stochastic volatility and the VAR dynamics
for θ1 = 0.0086 (left), 0.05 (middle) and 0.1 (right). The mode in each plot is marked out by a red point. A distant
local optimum is also marked out by an orange point.

29
Table 4: Result for TVP-BVAR with stochastic volatility for three independent runs of BO and BOOP. CE is taken
from Table 3 in Chan and Eisenstat (2018). The row named SE shows the numerical standard errors. Runs were
stopped in case there was no improvement in the last 50 hours.

CE BO1 BO2 BO3 BOOP1 BOOP2 BOOP3


log ML −1180.2 −1169.25 −1170.57 −1178.34 −1167.32 −1172.92 −1168.49
SE 0.12 0.89 0.49 0.32 1.24 0.47 1.60
θ1 × 10 3 40 19.05 8.66 29.53 7.65 12.22 15.14
θ2 × 105 40 9.81 10.65 11.07 10.26 7.06 8.70
θ3 × 103 40 77.56 119.07 25.04 73.81 25.12 114.42
Iterations - 67 35 46 81 44 157
CPU time (hours) - 83.40 42.47 56.25 34.90 22.49 77.89

7 Concluding remarks

We propose a new Bayesian optimization method for finding optimal hyperparameters in econometric

models. The method can be used to optimize any noisy function where the precision is under the

control of the user. We focus on the common situation of maximizing a marginal likelihood evaluated

by MCMC or importance sampling, where the precision is determined by the number of MCMC or

importance sampling draws. The ability to choose the precision makes it possible for the algorithm

to take occasional cheap and noisy evaluations to explore the marginal likelihood surface, thereby

finding the optimum faster.

We assess the performance of the new algorithm by optimizing the prior hyperparameters in the

extensively used BVAR with stochastic volatility and time-varying parameters and the steady-state

BVAR model in both a medium-sized and a large-scale VAR. The method is shown to be practical

and competitive to other approaches in that it finds the optimum using a substantially smaller

computational budget, and has the potential of being part of the standard toolkit for BVARs. We

have focused on optimizing the marginal likelihood, but the method is directly applicable to other

score functions, e.g. the popular log predictive score (Geweke and Keane, 2007 and Villani et al.,

2012).

Our approach builds on the assumption that the noisy estimates of the log marginal likelihoods

are approximately unbiased, which we verify is a reasonable assumption in the three applications

if the first BOOP evaluation is based on a marginal likelihood estimator from enough MCMC

30
draws. The unbiasedness of the log marginal likelihood will, however, depend on the combination

of MCMC sampler and marginal likelihood estimator, see Adolfson et al. (2007) for some evidence

from Dynamic Stochastic General Equilibrium (DSGE) models (An and Schorfheide, 2007). For

example, the simulations in Adolfson et al. (2007) suggest that sampling with the independence

Metropolis-Hastings combined with the Chib and Jeliazkov (2001) estimator is nearly unbiased,

whereas sampling with the random walk Metropolis algorithm combined with the modified harmonic

estimator (Geweke, 1999) can be severely biased, unless the posterior sample is extremely large.

It would therefore be interesting to extend the method to cases with biased evaluations where the

marginal likelihood estimates are persistent and only slowly approaching the true marginal likelihood.

Since the marginal likelihood trajectory over MCMC iterations is rather smooth (Adolfson et al.,

2007) one can try to predict its evolution and then correct the bias in the marginal likelihood

estimates.

References

Adolfson, M., Lindé, J., and Villani, M. (2007). Bayesian analysis of DSGE models-some comments.

Econometric Reviews, 26(2-4):173–185.

An, S. and Schorfheide, F. (2007). Bayesian analysis of DSGE models. Econometric reviews, 26(2-

4):113–172.

Bańbura, M., Giannone, D., and Reichlin, L. (2010). Large Bayesian vector auto regressions. Journal

of Applied Econometrics, 25(1):71–92.

Bezanson, J., Edelman, A., Karpinski, S., and Shah, V. B. (2017). Julia: A fresh approach to

numerical computing. SIAM review, 59(1):65–98.

Brochu, E., Cora, V. M., and De Freitas, N. (2010). A tutorial on Bayesian optimization of expensive

cost functions, with application to active user modeling and hierarchical reinforcement learning.

arXiv preprint arXiv:1012.2599.

Carriero, A., Kapetanios, G., and Marcellino, M. (2012). Forecasting government bond yields with

large Bayesian vector autoregressions. Journal of Banking & Finance, 36(7):2026–2047.

31
Chan, J. C. and Eisenstat, E. (2018). Bayesian model comparison for time-varying parameter vars

with stochastic volatility. Journal of applied econometrics, 33(4):509–532.

Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical

Association, 90(432):1313–1321.

Chib, S. and Jeliazkov, I. (2001). Marginal likelihood from the Metropolis-Hastings output. Journal

of the American Statistical Association, 96(453):270–281.

Dieppe, A., Legrand, R., and Van Roye, B. (2016). The BEAR toolbox.

Doan, T., Litterman, R., and Sims, C. (1984). Forecasting and conditional projection using realistic

prior distributions. Econometric reviews, 3(1):1–100.

Doucet, A., De Freitas, N., Gordon, N. J., et al. (2001). Sequential Monte Carlo methods in practice.

Springer.

Geweke, J. (1999). Using simulation methods for Bayesian econometric models: inference, develop-

ment, and communication. Econometric reviews, 18(1):1–73.

Geweke, J. and Keane, M. (2007). Smoothly mixing regressions. Journal of Econometrics,

138(1):252–290.

Giannone, D., Lenza, M., and Primiceri, G. E. (2015). Prior selection for vector autoregressions.

Review of Economics and Statistics, 97(2):436–451.

Hannan, E. J. and Quinn, B. G. (1979). The determination of the order of an autoregression. Journal

of the Royal Statistical Society: Series B (Methodological), 41(2):190–195.

Karlsson, S. (2013). Forecasting with Bayesian vector autoregression. In Handbook of Economic

Forecasting, volume 2, pages 791–897. Elsevier.

Louzis, D. P. (2019). Steady-state modeling and macroeconomic forecasting quality. Journal of

Applied Econometrics, 34(2):285–314.

Matérn, B. (1960). Spatial variation. Medd. fr. St. Skogsf. Inst. 49 (5). Reprinted in Lecture Notes

in Statistics no. 36.

32
Österholm, P. (2012). The limited usefulness of macroeconomic Bayesian VARs when estimating

the probability of a US recession. Journal of Macroeconomics, 34(1):76–86.

Primiceri, G. E. (2005). Time varying structural vector autoregressions and monetary policy. The

Review of Economic Studies, 72(3):821–852.

Quinn, B. G. (1980). Order determination for a multivariate autoregression. Journal of the Royal

Statistical Society: Series B (Methodological), 42(2):182–185.

Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of machine

learning algorithms. In Advances in neural information processing systems, pages 2951–2959.

Tran, M.-N., Scharth, M., Pitt, M. K., and Kohn, R. (2013). Importance sampling squared for

bayesian inference in latent variable models. arXiv preprint arXiv:1309.3339.

Villani, M. (2009). Steady-state priors for vector autoregressions. Journal of Applied Econometrics,

24(4):630–650.

Villani, M., Kohn, R., and Nott, D. J. (2012). Generalized smooth finite mixtures. Journal of

Econometrics, 171(2):121–133.

Williams, C. K. and Rasmussen, C. E. (2006). Gaussian processes for machine learning, volume 2.

MIT Press Cambridge, MA.

33

You might also like