Bayesian Inference

BAYESIAN
INFERENCE
Edited by Javier Prieto Tejedor
BAYESIAN INFERENCE

Bayesian Inference
http://dx.doi.org/10.5772/66264
Contributors
Claudia Ceci, Katia Colaneri, Sherif Sherif, Ishan Maduranga Wickramasingha, Michael Sobhy, Anamaria Berea, Daniel
Maxwell, Hafedh Ben Zaabza, Abderrahmen Ben Gara, Boulbaba Rekik, Solomon Tesfamicael, Steven Kim, Tai VoVan,
Poom Kumam, Plern Saipara, Cristiano Premebida, Diego Faria, Francisco Souza, Loc Nguyen, Alonso Ortega, Gorka
Navarrete, Xin Tong, Dingjing Shi, Mengxi Chen, Sien Chen, Jie Cheng, Junjiang Zhong, Wenqiang Huang, Hamid El
Maroufy, El Houcine Hibbah, Abdelmajid Zyad, Taib Ziad, Naveen Bansal, Jørund Gasemyr, Bent Natvig, Valeria
Sambucini, Pablo Emilio Verde
Published by InTech
Janeza Trdine 9, 51000 Rijeka, Croatia
© The Editor(s) and the Author(s) 2017

The moral rights of the editor(s) and the author(s) have been asserted.
All rights to the book as a whole are reserved by InTech. The book as a whole (compilation) cannot be reproduced,
distributed or used for commercial or non-commercial purposes without InTech's written permission. Enquiries
concerning the use of the book should be directed to InTech's rights and permissions department
([email protected]).
Violations are liable to prosecution under the governing Copyright Law.
Individual chapters of this publication are distributed under the terms of the Creative Commons Attribution 3.0
Unported License which permits commercial use, distribution and reproduction of the individual chapters, provided
the original author(s) and source publication are appropriately acknowledged. More details and guidelines
concerning content reuse and adaptation can be found at http://www.intechopen.com/copyright-policy.html.
Notice
Statements and opinions expressed in the chapters are these of the individual contributors and not necessarily those
of the editors or publisher. No responsibility is accepted for the accuracy of information contained in the published
chapters. The publisher assumes no responsibility for any damage or injury to persons or property arising out of the
use of any materials, instructions, methods or ideas contained in the book.
Publishing Process Manager Mirena Calmic

Technical Editor SPi Global
Cover InTech Design team
First published October, 2017

Printed in Croatia
Legal deposit, Croatia: National and University Library in Zagreb
Additional hard copies can be obtained from [email protected]
Bayesian Inference, Edited by Javier Prieto Tejedor

p. cm.
Print ISBN 978-953-51-3577-7
Online ISBN 978-953-51-3578-4
PUBLISHED BY
World’s largest Science,

Technology & Medicine
Open Access book publisher
105,000+
3.200+ INTERNATIONAL 110+ MILLION
OPEN ACCESS BOOKS AUTHORS AND EDITORS DOWNLOADS
AUTHORS AMONG
BOOKS TOP 1% 12.2%
DELIVERED TO AUTHORS AND EDITORS
151 COUNTRIES MOST CITED SCIENTISTS FROM TOP 500 UNIVERSITIES
A
ATE NALY
IV
R
TI
CLA
CS
BOOK Selection of our books indexed in the

CITATION
INDEX Book Citation Index in Web of Science™
IN
DEXED Core Collection (BKCI)
Interested in publishing with us?

Contact [email protected]
Numbers displayed above are based on data collected at the time of publication, for latest information visit www.intechopen.com
Contents
Preface IX
Section 1 Theoretical Foundations of Bayesian Inference 1
Chapter 1 Bayesian Inference Application 3

Wiyada Kumam, Plern Saipara and Poom Kumam
Chapter 2 Node-Level Conflict Measures in Bayesian Hierarchical Models

Based on Directed Acyclic Graphs 23
Jørund I. Gåsemyr and Bent Natvig
Chapter 3 Classifying by Bayesian Method and Some Applications 39

Tai Vovan
Chapter 4 Hypothesis Testing for High-Dimensional Problems 63

Naveen K. Bansal
Chapter 5 Bayesian vs Frequentist Power Functions to Determine the

Optimal Sample Size: Testing One Sample Binomial Proportion
Using Exact Methods 77
Valeria Sambucini
Chapter 6 Converting Graphic Relationships into Conditional Probabilities

in Bayesian Network 97
Loc Nguyen
Section 2 Applications of Bayesian Inference in Life Sciences 145
Chapter 7 Bayesian Estimation of Multivariate Autoregressive Hidden

Markov Model with Application to Breast Cancer Biomarker
Modeling 147
Hamid El Maroufy, El Houcine Hibbah, Abdelmajid Zyad and Taib
Ziad
VI Contents
Chapter 8 Bayesian Model Averaging and Compromising in

Dose-Response Studies 167
Steven B. Kim
Chapter 9 Two Examples of Bayesian Evidence Synthesis with the

Hierarchical Meta-Regression Approach 189
Pablo Emilio Verde
Chapter 10 Bayesian Modeling in Genetics and Genomics 207

Hafedh Ben Zaabza, Abderrahmen Ben Gara and Boulbaba Rekik
Chapter 11 Bayesian Two-Stage Robust Causal Modeling with

Instrumental Variables using Student's t Distributions 221
Dingjing Shi and Xin Tong
Chapter 12 Bayesian Hypothesis Testing: An Alternative to Null Hypothesis

Significance Testing (NHST) in Psychology and Social
Sciences 235
Alonso Ortega and Gorka Navarrete
Section 3 Applications of Bayesian Inference in Engineering 255
Chapter 13 Bayesian Inference and Compressed Sensing 257

Solomon A. Tesfamicael and Faraz Barzideh
Chapter 14 Sparsity in Bayesian Signal Estimation 279

Ishan Wickramasingha, Michael Sobhy and Sherif S. Sherif
Chapter 15 Dynamic Bayesian Network for Time-Dependent Classification

Problems in Robotics 299
Cristiano Premebida, Francisco A. A. Souza and Diego R. Faria
Section 4 Applications of Bayesian Inference in Economics 311
Chapter 16 A Bayesian Model for Investment Decisions in Early

Ventures 313
Anamaria Berea and Daniel Maxwell
Chapter 17 Recent Advances in Nonlinear Filtering with a Financial

Application to Derivatives Hedging under Incomplete
Information 325
Claudia Ceci and Katia Colaneri
Contents VII
Chapter 18 Airlines Content Recommendations Based on Passengers'

Choice Using Bayesian Belief Networks 349
Sien Chen, Wenqiang Huang, Mengxi Chen, Junjiang Zhong and Jie
Cheng
Preface
The range of Bayesian inference algorithms and their different applications has been greatly
expanded since the first implementation of a Kalman filter by Stanley F. Schmidt for the Apol‐
lo program. Extended Kalman filters, unscented Kalman filters, particle filters, and belief con‐
densation filters are just some examples of these algorithms that have been applied to
logistics, medical services, search and rescue operations, or automotive safety, among others.
The essence of the mentioned algorithms is to explain how we should update our existing
beliefs in the light of new evidence. Stephen Senn defined a Bayesian as “one who, vaguely
expecting a horse and catching a glimpse of a donkey, strongly concludes he has seen a mule.”
From the Bayesian perspective, both the parameter to estimate and the observations are ran‐
dom variables, against the frequentist approach where the parameter to estimate is an un‐
known deterministic value. This Bayesian point of view leads to a common resolution
framework where what we infer is a density function of the parameter conditioned on the
observation. Under this context, the task is to determine the posterior distribution of the de‐
sired state, from the knowledge of the prior and the likelihood, by using Bayes’ rule. This
setting can be modeled with a hidden Markov model (HMM), where the sequence of varia‐
bles of interest is called hidden states and the sequence from which one can obtain realiza‐
tions is called observations.
Achilles’ heel of Bayesian inference is the implementation of algorithms for solving nonlin‐
ear and/or non-Gaussian system models, where the traditional Kalman approach is inaccu‐
rate. Optimal algorithms exist in some restricted cases, while researchers have proposed
many practical implementations of Bayesian inference that rely on the use of approxima‐
tions. In addition, the inclusion of relevant prior information has been deeply addressed in
the literature where there is no consensus about the existence of a truly non-informative pri‐
or. The election of appropriate prior information has extensively nourished the research on
Bayesian inference, and it is still today an element of intense discussion.
This book takes a look at both theoretical foundations of Bayesian inference and practical
implementations in the fields of life sciences, engineering, and economics. The book is or‐
ganized into four sections according to this twofold perspective.
Section 1 is dedicated to the theoretical foundations of Bayesian inference.
In Chapter 1, the authors make a review of Bayesian inference components and its application
to game theory. The chapter includes definitions of the former (prior, likelihood, and posterior
distributions) and comprises several examples of the latter (Bayesian and fuzzy games).
In Chapter 2, the authors review methods for checking the modeling assumptions at each
node of a directed acyclic graph that represents the understanding of the underlying struc‐
X Preface
ture of a problem. The chapter shows how nodes in a graph correspond to data or parame‐
ters and directed edges between parameters correspond to conditional distributions.
In Chapter 3, the author presents a Bayesian classification method, the associated Bayes error,
and its relationship with other measures and proposes an algorithm to determine prior proba‐
bility that can make to reduce Bayes error is proposed. The chapter applies the proposed algo‐
rithm in three domains, biology, medicine, and economics, through specific problems.
In Chapter 4, the author introduces the problem of high-dimensional multiple hypothesis
testing by using the Bayesian approach. The chapter demonstrates the practical application
of the Bayesian decision theoretic approach by means of a real example of directional hy‐
pothesis testing with skewed alternatives that use gene expression data.
In Chapter 5, the author provides formal criteria to determine the adequate sample size in
the design of experiments based on combined frequentist-Bayesian or fully Bayesian ap‐
proaches. The chapter defines four power functions for sample size calculations where the
Bayesian predictive power is the one that allows to add more flexibility.
In Chapter 6, the author addresses the problem of converting graphic relationships into con‐
ditional probabilities in order to construct a simple Bayesian network from a graph. The
chapter applies this research in learning context in which a Bayesian network is used to as‐
sess students’ knowledge.
Section 2 is dedicated to the applications of Bayesian inference in life sciences.
In Chapter 7, the authors propose a first-order autoregressive hidden Markov model as a
suitable model to characterize a marker of breast cancer disease progression. The chapter
shows how this model captures the complexity and the dynamics of the evolution of breast
cancer by introducing the latent states and permits to evaluate the efficacy of a treatment by
their transition probabilities.
In Chapter 8, the author deals with the limitations of the Bayesian framework applied to
dose-response studies relying on small samples and sparse data. The chapter addresses
three practical issues in small-sample dose-response studies: model sensitivity, disagree‐
ment in prior knowledge, and conflicting perspective in decision rules.
In Chapter 9, the author illustrates the application of Bayesian inference to two meta-analy‐
ses in medical research when indirect evidence is available for analysis. The chapter
presents the hierarchical meta-regression method for meta-analysis, which is an integrated
approach for evidence synthesis when a multiplicity of bias, coming from indirect and dis‐
parate evidence, has to be incorporated.
In Chapter 10, the authors provide a review of statistical methods applied to animal and
plant breeding programs with a particular focus on Bayesian methods. The chapter illus‐
trates the flexibility of the Bayesian approaches and their high accuracy of prediction of the
breeding values.
In Chapter 11, the authors propose four types of two-stage least squares models with instru‐
mental variables to model normal and non-normal data in causal inference research. The
chapter evaluates the performance of the robust method using Student's t-distributions in
the four distributional 2SLS models by means of Monte Carlo simulation.
Preface XI
In Chapter 12, the authors present the Bayesian hypothesis testing and its application to psy‐
chology and social sciences as an alternative to traditional frequentist null hypothesis signif‐
icance testing (NHST). The chapter shows the advantages of this Bayesian approach over
frequentist NHST, providing examples that support its use.
Section 3 is dedicated to the applications of Bayesian inference in engineering.
In Chapter 13, the authors motivate the use of Bayesian inference in compressed sensing
signal processing method. The chapter provides three use cases of its applicability, such as
magnetic resonance images, remote sensing, and wireless communication systems, specifi‐
cally on multiple-input multiple-output (MIMO) systems.
In Chapter 14, the authors present different methods to estimate an unknown signal from
its linear measurements where the number of measurements is less than the dimension of
the unknown signal. The chapter introduces the concept of signal sparsity and describes
how it could be used as prior information for either regularized least squares or Bayesian
signal estimation.
In Chapter 15, the authors explore the use of dynamic Bayesian networks (DBN) for time-
dependent classification problems in mobile robotics and present some experiments in se‐
mantic place recognition and daily activity classification. The chapter formulates the DBN as
a time-dependent classification problem and gives a general expression for a DBN in terms
of classifier priors and likelihoods through the time steps.
Section 4 is dedicated to the applications of Bayesian inference in economics.
In Chapter 16, the authors propose a new Bayesian model to aid the investment decision in
early-stage start-ups and ventures informed by previous academic literature on entrepre‐
neurship and venture capital investment practices. The chapter assesses this model in an
anonymized experiment where reviewers with previous experience in entrepreneurship
and/or investment scored a list of 20 anonymous real companies.
In Chapter 17, the authors present some results about nonlinear filtering for Markovian par‐
tially observable systems where the state and the observation processes are described by
jump diffusions with correlated Brownian motions and common jump times. The chapter
applies this theory to the financial problem of derivatives hedging for a trader who has lim‐
ited information on the market.
In Chapter 18, the authors illustrate how a Bayesian belief network (BBN) can enable airlines
to optimize the loyalty based on dynamic recommendations of relevant contents from pre‐
dictions of passengers’ choice. The chapter establishes BBN models by using the use case of
China Southern Airlines with real transaction data, including passengers’ basic information,
history decision options, and purchase characteristics.
Overall, this book is intended as an introductory guide for the application of Bayesian infer‐
ence in the fields of life sciences, engineering, and economics, as well as a source document
of fundamentals for intermediate Bayesian readers.
Dr. Javier Prieto Tejedor

University of Salamanca,
Salamanca, Spain
Section 1
Theoretical Foundations of Bayesian Inference

DOI: 10.5772/intechopen.70530
Chapter 1
Provisional chapter
Bayesian Inference Application
Bayesian Inference Application
Wiyada Kumam, Plern Saipara and Poom Kumam
Wiyada information
Additional Kumam, Plern Saipara
is available andofPoom
at the end Kumam
the chapter
Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/intechopen.70530
Abstract
In this chapter, we were introduced the concept of Bayesian inference and application to
the real world problems such as game theory (Bayesian Game) etc. This chapter was
organized as follows. In Sections 2 and 3, we present Model-based Bayesian inference
and the components of Bayesian inference, respectively. The last section contains some
applications of Bayesian inference.
Keywords: statistical inference, Frequentist inference, Bayesian inference
1. Introduction
In statistical inference, there are two ways for interpretations of probability include Frequentist
(or Classical) inference and Bayesian inference. It usually is unlike with each other in the
classical nature of probability. Classical inference defines probability as the limit of an event’s
relative frequency for a large number of experiments and only in the sense of random experi-
ments which are well defined. Other side, Bayesian inference can to impose probabilities to
each statement when a random process is not associated. In the sense of Bayesian, probability
is a way to show an individual’s degree of believes in a statement. Bayesian inferences are
different interpretations of probability, and also different approaches depend on those inter-
pretations. Bayes’ theorem presents the relativity about two conditional probabilities that are
the reverse of anything other. The initials of the term Bayes’ theorem is in honor of Reverend
Thomas Bayes, and is referred to as Bayes’ law (see [1]). This theorem shows the conditional
probability or posterior probability of an event A after B is observed in terms of the prior
probability of A, prior probability of B and the conditional probability of B given A. It is valid
in all interpretations of probability. Bayes’ formula is how to revise probability statements
using data. The Bayes’ law (or Bayes’ rule) is
© 2017 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,
distribution, and reproduction in any medium, provided the original work is properly cited.
© The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons
distribution, and eproduction in any medium, provided the original work is properly cited.
4 Bayesian Inference
PðBjAÞPðAÞ
PðAjBÞ ¼ : (1)
PðBÞ
The conditional probability definition is defined as follows
PðA ∩ BÞ ¼ PðAjBÞPðBÞ ¼ PðBjAÞPðAÞ: (2)
For example, let a dice is thrown under a dice-box. From the standard model, all of outcomes
have probability equal to 1/6. Now, the dice is lifted a bit and a random corner of the upper
side is able to visible which it contains a dot. The new probability distribution of the outcomes
shows as follows. Let Ai is the outcome of the throw, for i = 1 , 2 , 3 , 4 , 5 , 6 and B is the randomly
chosen corner contains a dot. So, we get P(Ai) = 1/6 and P(B) = 2/3. We get the following table:
Ai P(Ai) P(B| Ai) P(Ai ∩ B) P(Ai| B)
A1 1/6 0 0 0
A2 1/6 1/2 1/12 1/8
A3 1/6 1/2 1/12 1/8
A4 1/6 1 1/6 1/4
A5 1/6 1 1/6 1/4
A6 1/6 1 1/6 1/4
The simplest way to construct the fourth column is to multiply. For any Ai, P(Ai| B) and
P(B| Ai), to sum these values and divide by this sum. This final term is said to be scaling and
corresponds to the formula as
6
X 6
X
PðBjAi ÞPðAi Þ ¼ PðAi ∩ BÞ ¼ PðBÞ:
i¼1 i¼1
An simpler argument is that P(Ai| B) has to be a probability distribution, thus sum to unity. As
the scaling operation is trivial, Bayes’ rule is also shown as
PðAjBÞ ∝ PðAÞPðBjAÞ
where P(A) the prior (distribution), P(B|A) is the likelihood and P(A|B) is the posterior
(distribution).
The main result of Bayesians statistics is that statistical inference may depend on the simple
device posterior ∝ prior ∗ likelihood. By dice-throwing example is not of controversial. The dispu-
tations about the possibility of using Bay’s rule as
PðDatajTruthÞPðTruthÞ
PðTruthjDataÞ ¼ : (3)
PðDataÞ
So, we get
Bayesian Inference Application 5
PðTruthÞ ¼ the prior: (4)
The second ingredient we need is data, plus a how the data associate to the truth which is
nothing but the classical concept of specifying a random relationship
PðDatajTruthÞ ¼ the likelihood (5)
for all associated values of Truth. Note that P(Data| Truth) is not applied as probability distri-
bution for different data, but as the probability of the given data for different values of Truth.
Various authors do apply P(Data| Truth) for likelihood to sheer this misconstrue.
Now, noting that (replace Truth with T), probability of Data (P(Data)) can be written as
ð
PðDataÞ ¼ PðT ÞPðDatajT ÞdT (6)
that is as a function of P(T) and P(Data| T), it is obvious that the prior and likelihood enable,
using 1 to construct a new probability statement about T given the data as follows
PðTruthjDataÞ ¼ the posterior: (7)
The purpose of this chapter was to introduce the concept of Bayesian inference and application
to the real world problem such as game theory (Bayesian Game). In this chapter was organized
as follows. In Sections 2 and 3, we present Model-based Bayesian inference and the components
of Bayesian inference, respectively. The last section contains some applications of Bayesian
inference.
2. Model-based Bayesian inference
The basic of Bayesian inference is continued by Bayes’ theorem. From (1), replacement B with
observations y, A with the set of parameter Θ, and probabilities P with densities p, results as the
following
pðyjΘÞpðΘÞ
pðΘjyÞ ¼ (8)
pðyÞ
which p(y) is the marginal likelihood of y, p(Θ) is the set prior distributions of the set of
parameter Θ before y is observed, p(y|Θ) is the likelihood of y underneath a model and p(Θ|y)
is the joint posterior distribution of Θ that expresses uncertainty about parameter set Θ after
taking both the prior and data into system. Because there are often multiple parameters, Θ
presents a set of j parameters, we have
Θ ¼ θ1 , θ2 , …, θj :
The term
ð
pðyÞ ¼ pðyjΘÞpðΘÞdΘ (9)
determines the marginal likelihood (or the prior predictive distribution) of y which it was intro-
duced by Jeffreys [2], and may be set to c where c is an unknown constant. This distribution
shows what y should be similar to given the model, before y has been observed. Only the prior
probabilities and the model’s likelihood function are applied for p(y). The presence of p(y)
normalizes the joint posterior distribution, p(Θ|y) guarantee it is a proper distribution and
integrates to 1. From replacement p(y) with a constant of proportionality c, the Bayes’ theorem
becomes to
pðyjΘÞpðΘÞ
pðΘjyÞ ¼ : (10)
c
We get
pðΘjyÞ ∝ pðyjΘÞpðΘÞ (11)
when ∝ is proportional to.
This formulation (11) be shown as the unnormalized joint posterior being proportional to the
likelihood multiply with the prior. Howsoever, the aim of this model is often not to concluding
the non-normalized joint posterior distribution, however to concluding the marginal distribu-
tions of the parameters. The set of all Θ can partitioned as
Θ ¼ fΦ; Λg (12)
when the interest sub-vector denote by Φ and the complementary sub-vector of Θ denoted by
Λ, usually called to as a vector of nuisance parameters. For a Bayesian scope, the presence of
nuisance parameters does not pose any formal, theoretical problems. A nuisance parameter is
a parameter that exists in the joint posterior distribution of a model, though it is not a interest
parameter. The marginal posterior distribution of φ, the interest parameter, can be shown as
ð
� � � �
p φjy ¼ p φ; Λjy dΛ: (13)
In model-based Bayesian inference, Bayes’ theorem is applied to approximate the non-

normalized joint posterior distribution, and lastly the user can evaluate and make inferences
by the marginal posterior distributions.
3. The components of Bayesian inference
In this section, we presents about the components of Bayesian inference which contains the
prior distributions, the likelihood or likelihood function and the joint posterior distribution as
follows.
1. p(Θ) is the prior distributions for set of Θ, and uses probability as a methods of quantify-
ing uncertainty about Θ before taking the data into system.
2. p(y|Θ) is the function of likelihood which all variables are associated in a full probability
model.
3. p(Θ|y) is the joint posterior distribution that shows uncertainty about Θ after taking both
the prior and the data into system. If Θ is partitioned into a single parameter of interest φ
and the remaining parameters are considered nuisance parameters, then the marginal
posterior distribution of φ denote by p(φ|y).
3.1. Prior distribution

The prior distribution is a main concept of Bayesian and shows the information about an
uncertain Θ that is merged with the probability distribution of new data to yield the posterior
distribution which in turn is applied for future inferences and decisions about Θ. The existence
of a prior distribution for any problem can justified by some axioms of decision theory; which
we now focus for how to set up a prior distribution for every given application. Generally, Θ
will be a vector, but for easiness we will point as on p(Θ).
By well-identified and large sample sizes, suitable alternatives of p(Θ) will have minor effects
on posterior inferences. This definition might look like to be circular, but in practice one can
check the dependence on p(Θ) by a sensitivity analysis: comparing posterior inferences under
different suitable alternatives of p(Θ).
If the sample size is small, or available data provide only indirect information about the
parameters of interest, then p(Θ) becomes more important. In various cases, nevertheless,
models can be set up hierarchically, such that clusters of parameters have shared p(Θ), which
can themselves be approximated from data. Prior probability distributions have belonged to
one of two kinds as informative and uninformative priors. In this section, four kinds of priors
which include informative, weakly informative, least informative, and uninformative, are
shown according to information and the aim in the use of the prior.
3.1.1. Informative prior

If prior information is obtainable about Θ, it should be included in p(Θ). If the current model is
homologous to a previous model, and the current model is goal to be an adjusted version
dependent on more current data, then the posterior distribution of Θ from the previous model
maybe used as p(Θ) for the current model.
Now, every version of a model is not start from scratch, based only on the current data, but the
cumulative effects of all data, past and current, can be taken into system. To sure the current
data do not dominate the prior, in 2000, Ibrahim and Chen [3] presented the power prior which
it is a class of informative prior distribution that takes early data and results into system. If the
current data is very homologous to the previous data, then the precision of the posterior
distribution increases when including more information from previous models. If the current
data differs tremendously, then the posterior distribution of Θ maybe in the tails of the prior
distribution for Θ, therefore p(Θ) contributes less density in its tails.
Sometimes informative prior is not ready to be applied, for example when it resides in other
person, as in an expert. For this way, their human personal beliefs of the probability for the
event must be elicited into the form of a suitable probability density function which this
process is said to be prior elicitation.
3.1.2. Weakly informative prior

Weakly informative prior (in the short term: WIP) use prior information for regularization and
stabilization, providing sufficient prior information to prevent results that contradict our
knowledge for example an algorithmic failure to explore the state space. Other aim is for WIPs
to use less prior information than is really available. WIPs should provide some of the useful of
prior information while avoiding some of the risk from using information which does not
exist. WIPs are the most common priors in practice and are liked by subjective Bayesians.
Selecting WIPs may be cumbersome. WIPs distributions should change with the sample size,
since a model should have sufficient prior information to learn from the data, but the prior
information must also be weak sufficient to learn from the data.
In practice, this is an example of WIPs. It is favor, for well reasons, to center and scale all
continuous predictors [4]. Though centering and scaling predictors is not talked about here,
but it should be clear that the potential range of the posterior distribution of θ for a centered
and scaled predictor should be small. A favor WIPs for a centered and scaled predictor may be
θ � N ð0; 10; 000Þ where θ is normal distribution agreeable to a mean of 0 and a variance of
10,000. Here, the density for θ is nearly flat. Nonetheless, the fact that it is not perfectly at
yields well properties for numerical estimation algorithms. In both Bayesian and Frequentist
inference, it is possible for numerical estimation algorithms to become stuck in regions of at
density which become more common as sample size decreases or model complexity increases.
Numerical estimation algorithms in Frequentist inference function as though a at prior were
used, thus numerical estimation algorithms in Frequentist inference become stuck more fre-
quently than numerical estimation algorithms in Bayesian inference. Prior distributions that
are not completely at allow sufficient information for the numerical estimation algorithm to
continue to diagnose the goal density, the posterior distribution.
After updating a model in which WIPs exist, the user should be investigating the posterior. If
the posterior contradicts knowledge, then the WIPs must be revised by including information
that will make the posterior consistent with knowledge [4]. A favor purpose Bayesian criticism
against WIPs is that there is no precise mathematical form to derive the optimal WIPs for a
given model and data.
3.1.2.1. Vague priors

A vague prior, is said to be a diffuse prior which difficult to define, after considering WIPs. In
2005, Lambert, Sutton, Burton, Abrams and Jones introduce the first formal move from vague
to WIPs. After conjugate priors were introduced by Raiffa and Schlaifer in 1961 which most
applied Bayesian has applied vague priors, parameterized to estimate the idea of uninform-
ative priors.
Normally, a vague prior is a conjugate prior together with a large size parameter. Howsoever,
if the sample size is small then vague priors may be problems. All most problems about vague
priors and small sample size are implicated with scale rather than location. The problem can be
particularly acute in random-effects models which it is used rather loosely in this here to imply
exchangeable, hierarchical and multilevel structures. A vague prior is defined as commonly
being a conjugate prior that is intent to estimate an uninformative prior and without two goals
of regularization and stabilization.
3.1.3. Least informative prior

Least informative priors (for short term LIP) is applied here to describe a class of prior in which
the aim is to minimize the amount of subjective information content, and to apply a prior that
is determined only by the model and observed data. The rationale for using LIPs is often called
to let the data speak for themselves. LIPs are preferred by objective Bayesians. LIPs are contains
Flat Priors [12], Hierarchical Prior [4], Jeffreys Prior [2], MAXENT [5] and Reference Priors
[6–8] etc.
3.1.4. Uninformative prior

Traditionally, most of the above descriptions of prior distributions were classified as uninformative
priors. However, uninformative priors do not really exist (see in [9]) and all priors are informative
in some ways. Moreover, there have been various names associated with uninformative priors
including diffuse, minimal, non-informative, objective, reference, uniform, vague, and perhaps
weakly informative etc.
3.1.5. Proper and improper priors

It is important for the prior distribution to be proper. A prior distribution, p(θ), is improper
Ð
when p(θ)dθ = ∞ .
Before, an unbounded uniform prior distribution is an inappropriate prior distribution since

p(θ) ∝ 1, for θ ∈ (�∞, ∞). An inappropriate prior distribution may be cause an inappropriate
posterior distribution. If the posterior distribution is inappropriate, then inferences are invalid.
To determine the propriety of a joint posterior distribution, the marginal likelihood should be
Ð
finite for any y. Again, the marginal likelihood is p(y) = p(y|Θ)p(Θ)dΘ. Although inappropri-
ate prior distributions can be applied, it is good practice to avoid them.
3.2. Likelihood
To completely for the definition of a Bayesian, both the prior distributions and the likelihood
must be estimated or completely specified. The likelihood or p(y|Θ), contains the available
Q � �
information provided by the sample. The likelihood is pðyjΘÞ ¼ ni¼1 p yi jΘ :
The data y effect to the posterior distribution p(Θ| y) only through the likelihood p(Θ| y). In
this way, Bayesian inference believes the likelihood principle which states that for a given
sample of data, any two probability models p(Θ| y) that have the same likelihood yield the
same inference for Θ.
3.3. Posterior distribution

Recent theoretical and applied overviews of Bayesian statistics, including many examples and
uses of posterior distributions, see [10–12]. The posterior distributions for decision-making
about home radon exposure are discussed in [13].
The posterior distribution summarizes the current state of knowledge about all the uncertain
quantities in a Bayesian analysis. Analytically, the posterior density is the product of the prior
density and the likelihood. In a complicated analysis, the joint posterior distribution can be
summarized by a set of L simulation draws of the vector of uncertain quantities w1 , w2 , … , wJ,
as illustrated in the following matrix:
l w1 w2 … wJ
1 . . … .
2 . … … .
… … … … .
L . . … .
The marginal posterior distribution for any unknown quantity wl can be summarized by its
column of L simulation draws. In many examples it is not necessary to construct the entire
table ahead of time; rather, one creates the L vectors of posterior simulations for the parameters
of the model and then uses these to construct posterior simulations for other unknown quan-
tities of interest, as necessary.
4. Application to games theory
In this section, we present the application of Bayesian inference to the real world problems
such as Bayesian Game as follows.
4.1. The classical games

The basic contents of the n-person game was presented by John Forbes Nash [14] in 1950. Also,
he first shows the existence of equilibrium for this model when the player’s preferences are
representable by continuous quasi-concave utilities and the sets of strategy are simplex. The
definition of an n-person game can be written as below.
Definition 4.1
The normal form of an n� person game is ðXi ; ri Þni¼1 , where for each i ∈ {1, 2, … , n}, the set of
individual strategies of player i denoted by Xi which Xi is a non-empty set and ri is the
Q
preference relation on X ≔ i ∈ IXi of player i.
The individual preferences ri are usually represented by utility functions, i.e. for each i ∈ {1, 2,
Q
… , n} there exist a real valued function ui : X ≔ i ∈ IXi!R such that:
xri y ⇔ ui ðxÞ ≥ ui ðyÞ, ∀x;y ∈ X:
Then the normal form of n� person game is transformed to ðXi ; ui Þni¼1 .
The solution of this game is called Nash equilibrium as below.

Definition 4.2
The Nash equilibrium for the game ðXi ; ui Þni¼1 is a point x∗ ∈ X which satisfies for each i ∈ {1, 2,
… , n} : ui(x∗) ≥ ui(x∗, xi) for each xi ∈ Xi.
The following theorem offers sufficient conditions for the existence of Nash equilibrium.
Theorem 4.3
Let Γ ¼ ðXi ; ui Þni¼1 be a n-person game and denoted by f the real-valued function on X � X defined by
� �
f ðx; yÞ ¼ Σni¼1 ui x�i ; yi . Let us assume that
1. for each i ∈ {1, 2, … , n}, Xi is a non-empty compact convex subset of a Hausdorff linear topological
space;
Q
2. for each i ∈ {1, 2, … , n}, ui(�, xi) is continuous on X�i = i 6¼ jXj for each fixed xi ∈ Xi;
3. Σni¼1 ui is continuous on X;
4. f(x, �) is quasi-concave on X, for each x ∈ X.

Then, Γ has an equilibrium.
Proof. See in [34].

Next, we present some examples of Nash equilibrium for two persons game as follows.
Example 4.4
The battle of the sexes game has two Nash equilibrium (MT, FT), (MS, FS) with (3, 2) and (2, 3),
where “Male like playing tennis” denoted by MT, “Male like shopping” denoted by MS,
“Female like playing tennis” denoted by FT and “Female like shopping” denoted by FS, see
in Figure 1.
Example 4.5
The oligopoly behavior game is a unique Nash equilibrium (Aa, Ba) where “A coffee shop use a
strategy for don’t advertising” denoted by Ad, “A coffee shop use a strategy for advertising”
Figure 1. The battle of the sexes game.
Figure 2. The oligopoly behavior game.
denoted by Aa, “A coffee shop use a strategy for do not advertising” denoted by Bd, and “A
coffee shop use a strategy for advertising” denoted by Ba, see in Figure 2.
4.2. The Bayesian games

For a long time, we have been supposed that everything in the game was normal knowledge
for everyone playing. However, real players may have private information about their own
payoffs, their type or preferences, etc. The way to modeling this situation of asymmetrical
information is by recurring to the concept was defined by Harsanyi in 1967. The key is to
introduce a move by the nature, which changes the uncertainty by converting an asymmetrical
information problem into an imperfect information problem. The concept is the nature moves
determining players’ types, a concept that collects all the private information relevant them
(i.e. payoffs, preferences, beliefs of another players, etc.).
Definition 4.6
The normal form of Bayesian games with incomplete information include:
1. the players i ∈ {1, 2, … , I};

2. the set of finite action for each player ai ∈ Ai;
3. the finite type set for each player θi ∈ Θi;

4. a probability distribution on types p(θ)
5. ui : A1 � A2 �…� AI � Θ1 � Θ2 �…� ΘI!R, where ui is utilities function.
It is important to discuss some parts of the definition. Players’ types comprise all relevant
information about some player’s private characteristics. The type of θi is only observed by
player i who uses this information both to make decisions and to update itself beliefs about the
likelihood of opponents’ types.
Combining actions and types for each player it is possible to create the strategies. Strate-
gies will be given by si : Θi!Ai, with elements si(θi) where Θi is the type space and Ai is the
action space. A strategy may determine different actions to different types. Lastly, utilities
are computed by each player by taking expectations over types using itself own condi-
tional beliefs about opponents’ types. Hence, if player i uses the pure strategy si, other
players use the strategies si and player i’s type is θi, the expected utility can be presented
as follows
X
Eui ðsi js�i ; θi Þ ¼ ui ðsi ; s�i ðθ�i Þ; θi ; θ�i Þpðθ�i jθi Þ:
θ�i ∈ Θ�i
A Bayesian Nash Equilibrium (for short term: BNE) is basically the same concept than a Nash
Equilibrium with the addition that players need to take expectations over opponents’ types as
follows.
Definition 4.7
A Bayesian Nash Equilibrium is a Nash Equilibrium of a Bayesian Game, i.e. Eui ðsi js�1 ; θi Þ ≥
� �
Eui s0i js�i ; θi for all s0i ∈ Si and for all types θi occurring with positive probability.
The following theorem for the existence of Bayesian Nash Equilibrium.
Theorem 4.8
Every finite Bayesian Games has a Bayesian Nash Equilibrium.
Example 4.9
Consider the Bayesian games as follows:

1. Nature decides that the payoffs are as in matrix I or II, with probabilities;
2. ROW is informed of the choice of nature but COL is not;
3. The choices of ROW include U or D and the choices of COL include L or R where these
choices are made simultaneously;
4. Payoffs are as in the matrix chosen from nature.

For any of the Bayesian games, we will find all BNE. All equilibrium in mixed behavioral
strategies can be written as.
Matrix I:
L R
U (1, 1) (0, 0)
D (0, 0) (0, 0)
Matrix II:
L R
U (0, 0) (0, 0)
D (0, 0) (2, 2)
4.2.1. Pure strategy BNE
First, we will deflate the case of incomplete information problem as a static extended game
b It can be presented follow Harsanyi, that the Nash Equilib-
with all of possible strategies: Γ.
b
rium of Γ is the same equilibrium of the imperfect game presented. The idea is to deflate a
b
game such that all the ways the game can follow is considered in the extended game Γ.
The first step is to define the strategies for all player.
Since he does not know in which matrix the game is played, so, COL has only two strategies
which contain L and R.
ROW knows in which Matrix the game occurs, and the strategies are UU, UD, DU and DD
where UD is played U in Matrix I and D in Matrix II.
The probability knowledge, the nature locates the game in any matrix. The new extended
b can be shown as:
game Γ
L R
�1 1�
UU 2;2
(0, 0)
�1 1�
UD 2;2
(1, 1)
DU (0, 0) (0, 0)
DD (0, 0) (1, 1)
Remember that DU is a dominated strategy for ROW. After displacement that possibility, the
game has 3 pure Nash Equilibrium as follows {(UU; L); (UD; R); (DD; R)}.
4.2.2. Mixed strategy BNE

Sequent to obtain the mixed strategies we will make another kind of analysis and try to repeat
the three pure BNE obtained before.
Suppose the probabilities of playing each action are as displayed in the matrices as below,
where y is the probability COL plays L, if the game is in Matrix I then x is the probability ROW
plays U and if the game is in Matrix II then z is the probability ROW plays U.
4.2.3. Player’s best respones
• In Matrix I: we get ROW’s best response as follows

ROW would play U, x = 1, if 1y + 0(1 � y) > 0, then y > 0, which can be concluded as:
a. if y > 0, then x = 1;
b. if y = 0, then x ∈ [0, 1].

• In Matrix II: we get ROW’s best response as follows
ROW would play D, z = 0 if 0 < 2(1 � y) then y < 1 which can be concluded as:
c. if y < 1, then z = 0;
d. if y = 1 then z ∈ [0, 1].
• In Matrix I and II: we get COL’s best response as follows
COL would play L, y = 1 if
1 1 1 1
½1x þ 0ð1 � xÞ� þ ½0z þ 0ð1 � zÞ� > ½0x þ 0ð1 � xÞ� þ ½0z þ 2ð1 � zÞ�
2 2 2 2
then x > 2(1 � z) which can be summarized as:

e. if x = 2(1 � z), then y ∈ [0, 1];
f. if x > 2(1 � z), then y = 1;
g. if x < 2(1 � z), then y = 0.
Next, we can check each the possibilities in order to find the Nash Equilibrium, such as those
strategies stable for any players. Let us start by checking COL’s strategies since there are less
combinations.
4.2.4. Mixed equilibrium

Case 1:
If y = 0, we have b. x ∈ [0, 1] and c. z = 0. Here, we want to check this is a equilibrium from COL’s
point of view. By g., we can see that if z = 0, then x < 2 which always hold and that y = 0.
This Nash equilibrium supports two of the three pure BNE found before: (DD, R), which is the
same as y = 0, x = 0 and z = 0 and (UD, R) which is the same as y = 0, x = 1 and z = 0.
Thus, we get Nash equilibrium of y = 0, x ∈ [0, 1] and z = 0.
There are many BNE in which column plays R and row plays xU + (1 � x)D, when x ∈ [0, 1] if
Matrix I occurs and D if Matrix II occurs.
Case 2:
If y = 0, we have d. z ∈ [0, 1] and from a. x = 1.
From f., we can see that when x = 1, then it should be the case that z ≥ 12 in order to be true that
� �
y = 1. Hence, these BNE are restricted to y = 1, z ∈ 12 ; 1 and x = 1.
This BNE supports the third pure Nash Equilibrium found before: (UU, L), which is the same
as y = 1, x = 1 and z = 1.
There are many BNE in which column plays L and row plays U if Matrix I occurs and zU
� �
+ (1 � z)D, where z ∈ 12 ; 1 if Matrix II occurs.
Case 3:
If y ∈ (0, 1), we have a. x = 1 and c. z = 0. By e., we can see that in order y belongs to [0, 1] it
should be the case that x = (1 � z). However it is impossible this equality to hold if both z = 0 and
x = 1.
Therefore, the case if y ∈ (0, 1) is not a Bayesian Nash equilibrium.
4.3. Abstract economy model

Later, the existence of social equilibrium was proved Debreu [15]. Also Arrow and Debreu [16]
proved the existence of Walrasian equilibrium. The classical abstract economy game intro-
duced by Shafer and Sonnenschein [17] or Borglin and Keiding [18] consists of a finite set of
agents, each characterized by certain constraints and preferences, explained by correspon-
dences. Following many previous authors ideas, they studied the existence of equilibrium for
generalized games (see, for example, [19–27] and the references therein). Now, we show some
definitions of an abstract economy model and equilibrium of this model as follows. Let the set
of agents be the finite set {1, 2, … , n}. For each i ∈ {1, 2, … , n} let Xi be a non-empty set.
Definition 4.10
An abstract economy Γ ¼ ðXi ; Ai ; Pi Þni¼1 is defined as a family of n ordered triplets (Xi, Ai, Pi),
where for each i ∈ I:
Q
1. Ai : i ∈ I Xi!2Xi is constraint correspondence and
Q
2. Pi : i ∈ I Xi!2Xi is preference correspondence.
Definition 4.11
Q
An equilibrium for Γ is a point x∗ ∈ i ∈ IXi which satisfies for each i ∈ {1, 2, … , n}:
1. x∗ ∈ Ai(x∗);
2. Ai(x∗) ∩ Pi(x∗) = ∅.
Theorem 4.12
Let Γ ¼ ðXi ; Ai ; Pi Þni¼1 be an abstract economy which satisfies, for each i ∈ {1, 2, … , n}:
1. Xi is a non-empty compact convex subset in Rl;

2. Ai is a continuous correspondence;
3. for each x ∈ X, Ai(x) is non-empty compact and convex;
4. Pi has an open graph in X � Xi and for each x ∈ X, Pi(x) is convex;

5. for each x ∈ X, xi ∉ Pi(x).
Then, Γ has an equilibrium.
Proof. See in [34].
4.4. Fuzzy games

The first concept of a fuzzy set was introduced by Zadeh [28] in 1965. Fuzzy set theory has
been shown to be a gainful tool to describe situations in which the data are imprecise or vague.
The theory of fuzzy sets has become a well framework for studying results concerning fuzzy
equilibrium existence for abstract fuzzy economies. The first study of a fuzzy abstract economy
(or a fuzzy game) has been studied by Kim and Lee in [29], they shown the existence of the
equilibrium for 1-person fuzzy game. Also Kim and Lee [29] shown the existence of equilib-
rium for generalized games when the constraints or preferences are vague due to the agent’s
behavior. In 2009, Patriche [30] studied the Bayesian abstract economy game and proved the
existence of equilibrium for an abstract economy game with differential information and a
measure space of agents. However, the existence of random fuzzy equilibrium for fuzzy game
has not been studied so far. In 2013, Patriche [31] defined the Bayesian abstract economy game
and proved the existence of the Bayesian fuzzy equilibrium for this game. Also, Patriche [32]
defined the new Bayesian abstract fuzzy economy game and proved the existence of the
Bayesian fuzzy equilibrium for this game which it is characterized by a private information
set, an action fuzzy mapping, a random fuzzy constraint one and a random fuzzy preference
mapping. Recently, Patriche [33] defined the fuzzy games and applications to systems of
generalized quasi-variational inequalities problem. The Bayesian fuzzy equilibrium concept is
an extension of the deterministic equilibrium. She also generalized and extended the former
deterministic models introduced by Debreu [15], Shafer and Sonnenschein [17] and Patriche
[34]. Very recently, Saipara and Kumam [35] introduced the model of general Bayesian abstract
fuzzy economy for product measurable spaces, and proved the existence for Bayesian fuzzy
equilibrium of this model as follows.
For each i ∈ I, let ðΩi ; Z i Þ be a measurable space, ðΩ; Z Þ be the product measurable space where
Q
Ω≔ i ∈ I Ωi , Z≔ ⊗ i ∈ I Z i and μ is a probability measure on ðΩ; Z Þ. Let Y denote the strategy or
commodity space, where Y is a separable Banach space.
Let I be a non-empty finite set (the set of agents). For each i ∈ I, let Xi : Ωi!F (Y) be a fuzzy
mapping, and let zi ∈ (0, 1].
Q Q
Let LXi = {xi ∈ S(Xi(�))zi : xi is Σi-measurable}. Denote by LX = i ∈ ILXi and by the set i 6¼ jLXj. An
element xi of LXi is called a strategy for agent i. The typical element of LXi is denoted by xi and
that of (Xi(ωi))zi by xi(ωi) (or xi). We can define a general Bayesian abstract fuzzy economy
model of product measurable spaces as follow.
Definition 4.13
A general Bayesian abstract fuzzy economy model of product measurable spaces is defined as follows:

Γ ¼ ðΩi ; Z i Þi ∈ I ; μ ; Xi ; Σi ; ðAi ; ai Þ; Pi ; pi ; zi i ∈ I ,
where I is non-empty finite set (the set of agents) and:
a. Xi : Ωi ! F (Y) is a action (strategy) fuzzy mapping of agent i;
b. Σi is a sub σ-algebra of Z ¼ ⊗ i ∈ I Z i , which denotes the private information of agent i;

c. for each ωi ∈ Ωi , Ai(ωi, �) : LX ! F (Y) is the random fuzzy constraint mapping of agent i;
d. for each ωi ∈ Ωi , Pi(ωi, �) : LX ! F (Y) is the random fuzzy preference mapping of agent i;
e. ai : LX ! (0, 1] is a random fuzzy constraint function, and pi : LX ! (0, 1] is a random fuzzy

preference function of agent i;
f. zi ∈ (0, 1] is such that for all ðωi ; xÞ ∈ Ωi � LX , ðAi ðωi ; ~x ÞÞai ð~x Þ ⊂ðXi ðωi ÞÞzi and ðPi ðωi ; ~x ÞÞp ð~x Þ ⊂
i
ðXi ðωi ÞÞzi .
The Bayesian fuzzy equilibrium for a general Bayesian abstract fuzzy economy model of
product measurable spaces is defined as follows.
Definition 4.14
A Bayesian fuzzy equilibrium for Γ is a strategy profile x~∗ ∈ LX such that for all i ∈ I,
i. x~∗ ðωi Þ ∈ clðAi ðωi ; x~∗ ÞÞai ðx~∗ Þ μ � a:e:;
ii. ðAi ðωi ; x~∗ ÞÞai ðx~∗ Þ ∩ ðPi ðωi ; x~∗ ÞÞpi ðx~∗ Þ ¼ ∅ μ � a:e:.
Theorem 4.15
Let I be a non-empty finite set. Let the family

Γ ¼ ðΩi ; Z i Þi ∈ I ; μ ; Xi ; Σi ; ðAi ; ai Þ; Pi ; pi ; zi i ∈ I be a general Bayesian abstract economy model
of product spaces satisfy (a)-(j). Then, there exists a Bayesian fuzzy equilibrium for Γ.
For each i ∈ I, the following conditions are sastisfied:

a. Xi : Ωi ! F (Y) is such that ωi!Xi(ωi)zi : Ωi ! 2Y is a non-empty convex weakly compact-valued
and integrably bounded correspondence;
b. Xi : Ωi ! F (Y) is such that ωi!Xi(ωi)zi : Ωi ! 2Y is ∑i� lower measurable;
c. For each ðωi ; ~x Þ ∈ Ωi � LX , ðAi ðωi ; ~x ÞÞai ð~x Þ is convex and has a non-empty interior in the relative
norm topology of (Xi(ωi))zi ;
d. the correspondence ðωi ; x~ Þ ! ðAi ðωi ; x~ ÞÞai ð~x Þ:Ωi �LX !2Y has a measurable graph, i.e.,
n o
ðωi ; ~x ; yÞ ∈ Ωi � LX � Y : y ∈ ðAi ðωi ; ~x ÞÞai ð~x Þ ∈ F i ⊗ BðLX Þ ⊗ BðYÞ, where Bωi ðLX Þ is the
Borel σ� algebra for the weak topology on LX and BðYÞ is the Borel σ� algebra for the norm
topology on Y;
e. the correspondence ðωi ; ~x Þ ! ðAi ðωi ; ~x ÞÞai ð~x Þ has weakly open lower sections, i.e., for each ωi ∈ Ωi
� ��1 n o
and for each y ∈ Y, the set ðAi ðωi ; ~x ÞÞai ð~x Þ ðωi ; ~x Þ ¼ ~x ∈ LX : y ∈ ðAi ðωi ; ~x ÞÞai ð~x Þ is
weakly open in LX;
f. For each ωi ∈ Ωi , ~x ! clðAi ðωi ; ~x ÞÞai ð~x Þ : LX ! 2Y is upper semicontinuous in the sense
n o
that the set ~x ∈ LX : clðAi ðωi ; ~x ÞÞai ð~x Þ is weakly open in LX for every norm open subset
V of Y;
g. the correspondence ðωi ; ~x Þ ! ðPi ðωi ; ~x ÞÞp ð~x Þ : Ωi � LX ! 2Y has open convex values such that
i
ðPi ðωi ; ~x ÞÞp ð~x Þ ⊂ðXðωi ÞÞzi for each ðωi ; ~x Þ ∈ Ωi � LX ;

i
h. the correspondence ðωi ; ~x Þ ! ðPi ðωi ; ~x ÞÞp ð~x Þ : Ωi � LX ! 2Y has a measurable graph;
i
i. the correspondence ðωi ; ~x Þ ! ðPi ðωi ; ~x ÞÞp ð~x Þ : Ωi � LX ! 2Y has weakly open lower sections,
i
i.e. for each ωi ∈ Ωi and for each y ∈ Y, the set ððPi ðωi ; ~x ÞÞp ð~x Þ Þ�1 ðωi ; yÞ ¼ f~x ∈ LX : y ∈
i
ðPi ðωi ; ~x ÞÞp ð~x Þ g is weakly open in LX;

i
j. For each x~ i ∈ LXi , for each ωi ∈ Ωi , x~i ∉ðAi ðωi ; ~x ÞÞai ð~x Þ ∩ ðPi ðωi ; ~x ÞÞp ð~x Þ .
i
Proof. See in [35].

Moreover, in 1960, Fichera and Stampacchia first introduced the variational inequalities prob-
lem, this issue has been widely studied. Next, the basic concept of variational inequalities for
fuzzy mappings was first introduced by Chang and Zhu [36] in 1989. In the topic of variational
inequalities problem, there are many mathematicians who studied this topic (see, for example,
[37, 38]). In 1993, the concept of a random variational inequality was introduced by Noor and
Elsanousi [39]. Recently, Patriche [31] used the model of the Bayesian abstract fuzzy economy
to prove the existence of solution for the two types of random quasi-variational inequalities
with random fuzzy mappings.
5. Conclusion
The main objectives of this chapter was introduced the concept of Bayesian inference and
application to some real world problems. In this chapter, we were presented about the basic
concept of Bayesian inference which it can be application to the Bayesian game and a general
Bayesian abstract fuzzy economy game or a fuzzy game. For application to Bayesian game, we
were shown the solution of Bayesian Nash Equilibrium (BNE) for a Bayesian game with
examples. Finally, we were shown the existence of Bayesian fuzzy equilibrium for a fuzzy
game.
Acknowledgements
This project was supported by the Theoretical and Computation Science (TaCS) Center under
Computational and Applied Science for Smart Innovation Cluster (CLASSIC), Faculty of
Science, KMUTT. Moreover, Poom Kumam was supported by the Thailand Research Fund
(TRF) and the King Mongkut’s University of Technology Thonburi (KMUTT) under the TRF
Research Scholar Award (Grant No. RSA6080047).
Author details
Wiyada Kumam1, Plern Saipara2 and Poom Kumam3*

*Address all correspondence to: [email protected]
1 Rajamangala University of Technology Thanyaburi (RMUTT), Thailand

2 Rajamangala University of Technology Lanna Nan (RMUTL), Thailand
3 King Mongkut’s University of Technology Thonburi (KMUTT), Thailand
References
[1] Stigler S. Who discovered Bayes’s theorem. The American Statistician. 1983;37(4):290-296
[2] Jeffreys H. Theory of Probability. 3rd ed. Oxford, England: Oxford University Press; 1961
[3] Ibrahim J, Chen M. Power prior distributions for regression models. Statistical Science.
2000;15:46-60
[4] Gelman A. Scaling regression inputs by dividing by two standard deviations. Statistics in
Medicine. 2008;27:2865-2873
[5] Jaynes E. Prior probabilities. IEEE Transactions on Systems Science and Cybernetics.
1968;4(3):227-241
[6] Berger J, Bernardo J, Dongchu S. The formal definition of reference priors. Annals of
Statistics. 2009;37(2):905-938
[7] Bernardo J. Reference posterior distributions for Bayesian inference (with discussion).
Journal of the Royal Statistical Society, B. 1979;41:113-147
[8] Bernardo J. Reference analysis. In: Dey D, Rao C, editors. Handbook of Statistics. Vol. 25.
Amsterdam: Elsevier; 2005. p. 17-90
[9] Irony T, Singpurwalla N. Noninformative priors do not exist: A discussion with Jose M.
Bernardo. Journal of Statistical Inference and Planning. 1997;65:159-189
[10] Bernardo JM, Smith AFM. Bayesian Theory. New York: Wiley; 1994
[11] Carlin BP, Louis TA. Bayes and Empirical Bayes Methods for Data Analysis. London:
Chapman and Hall; 1996
[12] Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. London: Chapman
and Hall; 1995
[13] Lin CY, Gelman A, Price PN, Krantz DH. Analysis of local decisions using hierarchical
modeling, applied to home radon measurement and remediation (with discussion). Sta-
tistical Science. 1999;14:305-337
[14] Nash J. Equilibrium points in n-person games. Proceedings of the National Academy of
Sciences of the United States of America. 1950;36(1):48-49
[15] Debreu G. A social equilibrium existence theorem. Proceedings of the National Academy
of Sciences of the United States of America. 1952;38:886-903
[16] Arrow KJ, Debreu G. Existence of an equilibrium for a competitive economy. Econometrica.
1952;22:265-290
[17] Shafer W, Sonnenschein H. Equilibrium in abstract economies without ordered prefer-

ences. Journal of Mathematical Economics. 1975;2:345-348
[18] Borglin A, Keiding H. Existence of equilibrium actions and of equilibrium: A note on the
“new” existence theorem. Journal of Mathematical Economics. 1976;3:313-316
[19] Huang NJ. Some new equilibrium theorems for abstract economies. Applied Mathemat-
ics Letters. 1998;11(1):41-45
[20] Kim WK, Tan KK. New existence theorems of equilibria and applications. Nonlinear
Analysis: Theory, Methods & Applications. 2001;47:531-542
[21] Lin LJ, Chen LF, Ansari QH. Generalized abstract economy and systems of generalized
vector quasi-equilibrium problems. Journal of Computational and Applied Mathematics.
2007;208:341-353
[22] Briec W, Horvath C. Nash points, Ky Fan inequality and equilibria of abstract economies in
Max-Plus and B-convexity. Journal of Mathematical Analysis and Applications. 2008;341:
188-199
[23] Ding XP, Wang L. Fixed points, minimax inequalities and equilibria of noncompact
abstract economies in FC-spaces. Nonlinear Analysis: Theory, Methods & Applications.
2008;69:730-746
[24] Kim WK, Kum SH, Lee KH. On general best proximity pairs and equilibrium pairs in free
abstract economies. Nonlinear Analysis: Theory, Methods & Applications. 2008;68:2216-2227
[25] Lin LJ, Liu YH. The study of abstract economies with two constraint correspondences.
Journal of Optimization Theory and Applications. 2008;137:41-52
[26] Ding XP, Feng HL. Fixed point theorems and existence of equilibrium points of
noncompact abstract economies for L∗F -majorized mappings in FC-spaces. Nonlinear
Analysis: Theory, Methods & Applications. 2010;72:65-76
[27] Wang L, Cho YJ, Huang NJ. The robustness of generalized abstract fuzzy economies in
generalized convex spaces. Fuzzy Sets and Systems. 2011;176:56-63
[28] Zadeh LA. Fuzzy sets. Information and Control. 1965;8:338-353

[29] Kim WK, Lee KH. Fuzzy fixed point and existence of equilibria of fuzzy games. Journal
of Fuzzy Mathematics. 1998;6:193-202
[30] Patriche M. Bayesian abstract economy with a measure space of agents. Abstract and
Applied Analysis. 2009;2009:1-11
[31] Patriche M. Equilibriumof Bayesian fuzzy economies and quasi-variational inequalities
with random fuzzy mappings. Journal of Inequalities and Applications. 2013; 374. Article
ID 58E35
[32] Patriche M. Existence of equilibrium for an abstract economy with private information
and a countable space of actions. Mathematical Reports. 2013;15(65)(3):233-242
[33] Patriche M. Fuzzy games with a countable space of actions and applications to systems of
generalized quasi-variational inequalities. Fixed Point Theory and Applications.
2014;2014(124)
[34] Patriche M. Equilibrium in Games and Competitive Economies. Bucharest: The Publish-
ing House of the Romanian Academy; 2011
[35] Saipara P, Kumam P. Fuzzy games for a general Bayesian abstract fuzzy economy model of
product measurable spaces. Mathematical Methods in the Applied Sciences. 2015;39(16):
4810-4819
[36] Chang SS, Zhu YG. On variational inequalities for fuzzy mappings. Fuzzy Sets and
Systems. 1989;32:359-367
[37] Noor MA. Variational inequalities for fuzzy mappings III. Fuzzy Sets and Systems.
2000;110:101-108
[38] Park JY, Lee SY, Jeong JU. Completely generalized strongly quasivariational inequalities
for fuzzy mapping. Fuzzy Sets and Systems. 2000;110:91-99
[39] Noor MA, Elsanousi SA. Iterative algorithms for random variational inequalities.
Panamerican Mathematical Journal. 1993;3:39-50
Chapter 2
Provisional chapter
Node-Level Conflict Measures in Bayesian Hierarchical

Node-Level Conflict
Models Based Measures
on Directed in Bayesian
Acyclic Graphs Hierarchical
Models Based on Directed Acyclic Graphs


Abstract
Over the last decades, Bayesian hierarchical models defined by means of directed,
acyclic graphs have become an essential and widely used methodology in the analysis
of complex data. Simulation-based model criticism in such models can be based on
conflict measures constructed by contrasting separate local information sources about
each node in the graph. An initial suggestion of such a measure was not well calibrated.
This shortcoming has, however, to a large extent been rectified by subsequently pro-
posed alternative mutually similar tail probability-based measures, which have been
proved to be uniformly distributed under the assumed model under various circum-
stances, and in particular, in quite general normal models with known covariance
matrices. An advantage of this is that computationally costly precalibration schemes
needed for some other suggested methods can be avoided. Another advantage is that
noninformative prior distributions can be used when performing model criticism. In this
chapter, we describe the basic framework and review the main uniformity results.
Keywords: cross-validation, data splitting, information contribution, MCMC, model

criticism, pivotal quantity, preexperimental distribution, p-value
1. Introduction
Over the last decades, Bayesian hierarchical models have become an essential and widely used
methodology in the analysis of complex data. Computational techniques such as Markow
Chain Monte Carlo (MCMC) methods make it possible to treat very complex models and data
structures. Analysis of such models gives intuitively appealing Bayesian inference based on
posterior probability distributions for the parameters.
In the construction of such models, an understanding of the underlying structure of the problem
can be represented by means of directed acyclic graphs (DAGs), with nodes in the graph
corresponding to data or parameters, and directed edges between parameters representing

conditional distributions. However, a perfect understanding of the underlying structure is
usually an unachievable goal, and there is always a danger of constructing inadequate models.
Box [1] suggests a pattern for the model building process where an initial candidate model is
assessed for adequacy, and if necessary modified and elaborated on, leading to a new candi-
date that again is checked for adequacy, and so on. As a tool in this model criticism process,
Ref. [1] suggests using the prior predictive distribution of some checking function or test
statistic as a reference for the observed value of this checking function, resulting in a prior
predictive p-value. This requires an informative and realistic prior distribution, which is not
always available or even desirable. Indeed, as pointed out in Ref. [2], in an early phase of the
model building process, it is often convenient to use noninformative or even improper priors
and thus avoid costly and time-consuming elicitation of prior information. Noninformative
priors may be used also for the inference because relevant prior information is unavailable.
There exist many other methods for checking the overall fit of the model or an aspect of the
model of special interest, based on locating a test statistic or a discrepancy measure in some
kind of a reference distribution. The posterior predictive p-value (ppp) of Ref. [3] uses the
posterior distribution as reference and does not require informative priors. But this method
uses data twice and can as a result be very conservative [2, 4–6]. Hjort et al. [5] suggest
remedying this by using the ppp value as a test statistic in a prior predictive test. The compu-
tation of the resulting calibrated cppp-value is, however, very computer intensive in the
general case, and again realistic, informative priors are needed. A node-level discrepancy
measure suggested in Ref. [7] is subject to the same limitations. The partial posterior predictive
p-value of Ref. [4] avoids double use of data and allows noninformative priors but may be
difficult to compute and interpret in hierarchical models.
Comparison with other candidate models through a technique for model comparison or model
choice, such as predictive methods, maximum posterior probability, Bayes factors or an infor-
mation criterion, can also serve as tools for checking model adequacy indirectly when alterna-
tive candidate models exist.
In this chapter, we will, however, focus on methods for criticizing models in the absence of any
particular alternatives. We will review methods for checking the modeling assumptions at
each node of the DAG. The aim is to identify parts or building blocks of the model that are in
discordance with reality, which may be in need of adjustment or further elaboration.
O’Hagan [8] regards any node in the graph as receiving information from two disjoint subsets
of the neighboring nodes. This information is represented as a conditional probability density
or a likelihood or as a combination of these two kinds of information sources. Adopting the
same basic perspective, our aim is to check for inconsistency between such subsets. The
suggestion in Ref. [8] is to normalize these information sources to have equal height 1 and to
regard the height of the curves at the point of intersection as a measure of conflict. However, as
shown in Ref. [2], this measure tends to be quite conservative. Dahl et al. [9] demonstrated that
it is also poorly calibrated, with false warning probabilities that vary substantially between
models. Dahl et al. [9] also identified the different sources of inaccuracy and modified the
measure of Ref. [8] to an approximately χ2-distributed quantity under the assumed model by
Node-Level Conflict Measures in Bayesian Hierarchical Models Based on Directed Acyclic Graphs 25
instead normalizing the information sources to probability densities. In Ref. [10], these densi-
ties were instead used to define tail probability-based conflict measures. Gåsemyr and
Natvig [10] showed that these measures are uniformly distributed in quite general hierarchical
normal models with fixed variances/covariances. In Ref. [11], such uniformity results were
proved in various situations involving nonnormal and nonsymmetric distributions. These
uniformity results indicate that the measures of Refs. [9] and [10] have comparable interpreta-
tions across different models. Therefore, they can be used without computationally costly
precalibration schemes, such as the one suggested in Ref. [5]. Gåsemyr [12] focuses on some
situations where the conflict measure approach can be directly compared to the calibration
method of Ref. [5] and shows that the less computer-intensive conflict measure approach
performs at least as well in these situations. Moreover, the conflict measure approach can be
applied in models using noninformative prior distributions.
Focusing on the special problem of identifying outliers among the second-level parameters in a
random-effects model, Ref. [13] defines similar conflict measures. In this setting, the group-
specific means are the nodes of interest. In some models, there exist sufficient statistics for
these means. Then, outlier detection at the group level can also be based on cross validation,
measuring the tail probability beyond the observed value of the statistic in the posterior
predictive distribution given data from the other groups. In this context, the conflict measure
approach can be viewed as an extension of cross-validation to situations where sufficient
statistics do not exist. Also in Ref. [13] applications to the examination of exceptionally high
hospital mortality rates and to results from a vaccination program are given. In Ref. [14], this
methodology is used to check for inconsistency in multiple treatment comparison of random-
ized clinical trials. Presanis et al. [15] apply these conflict measures in complex cases of medical
evidence synthesis.
2. Directed acyclic graphs and node-specific conflict
2.1. Directed acyclic graphs and Bayesian hierarchical models

An example of a DAG discussed extensively in Ref. [8] is the random-effects model with
normal random effects and normal error terms defined by
Yi, j � Nðλi , σ2 Þ, λi � Nðμ, τ2 Þ, j ¼ 1, …, ni , i ¼ 1, …, m: (1)
In general, we identify the nodes or vertices of the graph with the unknown parameters θ and
the observed data y, the latter appearing as bottom nodes and being the realizations of the
random vector Y. In the Bayesian model, the parameters, the components of θ, are also
considered as random variables. In general, if there is a directed edge from node a to node b,
then a is a parent of b, and b is a child of a. We denote by Ch(a) the set of child nodes of a, and
by Pa(b) the set of parent nodes of b. More generally, b is a descendant of a if there is a directed
path from a to b. The set of descendants of a is denoted by Desc(a) and, for convenience, is
defined to contain a itself. The directed edges encode conditional independence assumptions,
indicating that, given its parents, a node is assumed to be independent of all other
nondescendants. Hence, writing θ = (ν, μ), with μ representing the vector of top-level nodes, the
joint density of (Y, θ) = (Y, ν, μ) is
Y Y
pðy, ν, μÞ ¼ pðyjPaðyÞÞ pðνjPaðνÞÞπðμÞ; (2)
y∈y ν∈ν
where π(μ) is the prior distribution of μ. The posterior distribution π(θ|y) is the basis for the
inference.
This setup can be generalized in various directions. The nodes may be allowed to represent
vectors, at both the parameter and the data levels [10]. Instead of DAGs, one may consider
chain graphs, as described in Ref. [16], with undirected edges representing mutual dependence
as in Markov random fields. Scheel et al. [17] introduce a graphical diagnostic for model
criticism in such models.
2.2. Information contributions

The representation of a Bayesian hierarchical model in terms of a DAG is often meant to reflect
an understanding of the underlying structure of the problem. By looking for a conflict associ-
ated with the different nodes in the DAG, we may therefore put our understanding of this
structure to test. We may also identify parts of the model that need adjustment.
The idea put forward in Ref. [8] is that for each node λ in a DAG one may in general think of each
neighboring node as providing information about λ and that it is of interest to consider the
possibility of conflict between different sources of information. For instance, one may want to
contrast the local prior information provided by the factor p(λ|Pa(λ)) with the likelihood informa-
tion source formed by multiplying the factors p(γ|Pa(γ)) for all child nodes γ ∈ Ch(λ). The full
conditional distribution of λ given all the observed and unobserved variables in the DAG, i.e.,
Y
πðλjðy, θÞ�λ Þ ∝ pðλjPaðλÞÞ pðγjPaðγÞÞ; (3)
γ ∈ ChðλÞ
is determined by these two types of factors. Here, (y, θ)�λ denotes the vector of all components
of (y, θ) except for λ.
Y
Dahl et al. [9] normalize the product pðγjPaðγÞÞ to a probability density function denoted
γ ∈ ChðλÞ
by fc(λ), the likelihood or child node information contribution, whereas the local prior density is
denoted by fp(λ) and called the prior or parent node information contribution. These information
contributions are integrated with respect to posterior distributions for the unknown nuisance
parameters to form integrated information contribution (iic) denoted by gc and gp. In this
construction, a key to avoid the conservatism of the measure suggested in Ref. [8] is to prevent
dependence between the two information sources by introducing a suitable data splitting
Y = (Yp, Yc) and condition the parameters of fp on yp and the parameters of fc on yc.
Definition 1 For a given parameter node λ, denoted by βp the vector whose components are Pa(λ), and
by βc the vector whose components are
∪γ ∈ ChðλÞ ð{γ}∪PaðγÞÞ � {λ} ¼ ChðλÞ∪½PaðChðλÞÞ � {λ}� (4)
Let Y = (Yp, Yc) be a splitting of the data Y. Define the densities fp, fc, the prior respectively likelihood
information contributions, by
Y
f p ðλ; βp Þ ¼ pðλjβp Þ, f c ðλ; βc Þ ∝ pðγjPaðγÞÞ (5)
γ ∈ ChðλÞ
Define the integrated information contribution densities gp, gc by

ð ð
gp ðλÞ ¼ f p ðλ; βp Þπðβp jyp Þdβp , gc ðλÞ ¼ f c ðλ; βc Þπðβc jyc Þdβc ; (6)
and denote by Gp, Gc the corresponding cumulative distribution functions.
Note that βc may contain data nodes. The second integral in Eq. (6) is then taken only with
respect to the random components of βc, i.e., the parameters in βc. If βc contains no parameters,
then gc and fc coincide. Definition 1 may also be extended to the case when λ is a vector,
corresponding to a subset of parameter nodes.
Combining the set of information sources linked to a specific node in different ways leads to a
modification of Definition 1 where βc does not contain all child nodes of λ, the others being
instead included in βp together with their parent nodes. In this way, different types of conflict
about the node may be revealed. This is natural, e.g., in the context of outlier detection among
independent observations with a common mean. Note that βp and βc may then be overlapping,
containing common coparents with λ. The setup is illustrated in Figure 1 in the case when the
Figure 1. Part of a DAG showing information sources about λ.

set of common components, by abuse of notation denoted by βp ∩ βc, is empty. For the general
setup, Definition 1 is modified as follows.
Definition 2 Let γ be a vector whose components are a subset of Ch(λ), and define βc as in Eq. (4).
Denote by γ1 the rest of the child nodes of λ, and let βp consist of γ1 together with its parent nodes in the
same way as in Eq. (4), as well as Pa(λ). The information contributions are then given by
f p ðλ; βp Þ ∝ pðγ1 jPaðγ1 ÞpðλjPaðλÞÞ; (7)
f c ðλ; βc Þ ∝ pðγjPaðγÞÞ: (8)

In Eq. (7), p λjPaðλÞ is replaced by the prior density π(λ) if λ is a top-level parameter. The
corresponding iic densities are defined by Eq. (6) as before.
2.3. Node-specific conflict measures
The conflict measure c2λ of Ref. [9] is defined as
c2λ ¼ ðEGp ðλÞ � EGc ðλÞÞ2 =ðvarGp ðλÞ þ varGc ðλÞÞ (9)
The χ21 -distribution is the reference distribution for this measure. For the conflict measures of
Ref. [10], the uniform distribution on [0, 1] is the reference distribution. They focus on tail
behavior but are based on the same iic distributions. The general distribution of information
sources given in Definition 2 is also introduced in Ref. [10]. For a given pair Gp, Gc of iic
distributions, let λ�p and λ�c be independent samples from Gp and Gc, respectively. Let G be the
cumulative distribution function for δ ¼ λ�p � λ�c . Define
def
c3þ
λ ¼ Gð0Þ, c3�
λ ¼ Gð0Þ ¼ 1 � Gð0Þ (10)
and
c3λ ¼ 1 � 2minðGð0Þ, Gð0ÞÞ ¼ 2jGð0Þ � 1=2j: (11)
conf
The c3þ
λ -measure and the Pλ measure of Ref. [13] are very similar. The latter measure is
aimed at detecting outlying groups or units in a three-level hierarchical model, with the
second-level parameters being location parameters for group-specific data. However, the mea-
sure is interpreted as a p value, with small values indicative of conflict. Gåsemyr and
Natvig [10] also defines a measure based on defining a tail area in terms of the density g of G,
namely
c4λ ¼ PG ðgðδÞ > gð0ÞÞ; (12)
applicable also when λ is a vector.

Example 1. To illustrate the theory, consider the random-effects model 1, with the variance
parameters σ2, τ2 assumed known, and with μ having the improper prior π(μ) = 1. For
simplicity, assume ni = n for all i. Suspecting the mth group of representing an outlier, let λ=λm
be the node of interest. Define the data splitting Yp, Yc by letting Yc ¼ Ym ¼ ðY m ,1 , …, Ym , n Þ,
and let βc ¼ yc , βp ¼ μ. Denoting the normal density function by φ, it is easy to see that
gc ðλÞ ¼ f c ðλÞ ¼ φðλ; yc , σ2 =nÞ. Furthermore, f p ðλ; μÞ ¼ φðλ; μ, τ2 Þ. Given yp, μ has the density
Xm�1
πðμjyp Þ ¼ φðμÞ; i¼1
yi=ðm � 1Þ, ð1=ðm � 1ÞÞτ2 þ ð1=ðnðm � 1ÞÞÞσ2 Þ. By a standard argument
ð
gp ðλÞ ¼ f p ðλ; μÞπðμjyp Þdμ
m�1
X
¼ φðλ; yi =ðm � 1Þ, ð1 þ 1=ðm � 1ÞÞτ2 þ ð1=ðnðm � 1ÞÞÞσ2 Þ:
i¼1
Xm¼1
It follows that gðδÞ ¼ φðδÞ; i¼1
yi =ðm � 1Þ � y c , ðm=ðm � 1ÞÞðτ2 þ σ2 =nÞ. The conflict mea-
sures (Eqs. (9), (10), (11), and (12)) can hence be calculated analytically, with no simulation
needed in this case.
In a simulation study of the c2λ -measure in Ref. [9] using a warning level equal to the 95%
quantile of the χ21 -distribution, a false warning probability of close to 5% is obtained for a
normal random-effects model with unknown variance parameters as in Eq. (1) and also in
similar random-effects models with heavy-tailed t- and uniformly distributed random effects.
Also with respect to detection power, this measure performs well when compared to a cali-
brated version of the measure given in Ref. [8], if an optimal data splitting is used. Refs. [10]
and [11] prove preexperimental uniformity of the conflict measures in various situations, i.e.,
their distributions as functions of a Y which is distributed according to the assumed model are
uniform, regardless of the true value of the basic parameter. Another way of stating this is that
we obtain a proper p-value by subtracting these measures from 1. These results are reviewed in
Section 5 of the present chapter.
2.4. Integrated information contributions as posterior distributions
In most cases, the conflict measures of Refs. [9] and [10] are based on simulated samples from
Gp and Gc. Definitions 1 and 2 suggest obtaining such samples by running an MCMC algo-
rithm to generate posterior samples of the unknown parameters in βp and βc and then generate
samples λ�p and λ�c from the respective information contributions for each such parameter
sample. If the information contributions are standard probability densities, this procedure is
straightforward. If not, one may instead often use the fact that, under certain conditions on the
data splitting, the distributions Gp and Gc are posterior distributions conditional on yp and yc,
respectively, the latter based on the improper prior π(λ) = 1, independently of the coparents.
Theorem 1 Suppose that the data splitting satisfies
Yc ¼ Y ∩ ½∪γ ∈ ChðλÞ ∩ βc DescðγÞ�, Yp ¼ Y � Yc ; (13)

the latter expression by abuse of notation meaning the components of Y not present in Yc. Assume λ and

the coparents Pa ChðλÞ ∩ βp � λ are independent. We then have
gp ðλÞ ¼ πðλjyp Þ
and, specifying as prior density
πðλjPaðChðλÞ ∩ βc Þ � λÞ ¼ 1;
(14)
gc ðλÞ ¼ πðλjyc Þ:
The proof is given in Appendix A in the online supporting information for Ref. [11]. Specializing
to the standard setup of Definition 1, where ChðλÞ⊆βc , we see that the requirement for Eq. (13) to
hold is that Yc consists of all data descendant nodes of λ. In Ref. [9], this splitting was compared
with two other splittings for c2λ and found to be optimal with respect to detection power. This
measure was also found to be a well-calibrated measure under this splitting.
3. Noninvariance and reparametrizations
The iic distributions and the corresponding conflict measures are parametrization dependent.
Based on experience so far, the conflict measures seem to be fairly robust to changes in
parametrization. However, this noninvariance can be handled in a theoretically satisfactory
way under certain circumstances.
Let φ be the parameter, in a standard parametrization, corresponding to a specific node in the

DAG. Suppose for simplicity that Yc ¼ ChðφÞ. Assume that there exists a sufficient statistic Yc
and an alternative parametrization λ, being a strictly monotonic function λ(φ), such that Yc – λ
is a pivotal quantity, i.e., the density for Yc given λ is of the form
pðyc jλÞ ¼ f Yc ðyc jλÞ ¼ f 0 ðyc � λÞ (15)
for some known density function f0. Such a parametrization will be considered as a canonical
or reference parametrization if it exists, as opposed to the standard parametrization involving
φ. Accordingly, the conflict measures given in Eqs. (9)–(12) are preferably based on this
reference parametrization.
By Theorem 1, samples λ�c from Gc may be obtained by MCMC as posterior samples from
πðλjyc Þ when the splitting satisfies Eq. (13) and the prior for λ satisfies Eq. (14), i.e., equals 1.
According to an argument given in Section 1.3 of Ref. [18], such a prior expresses noninfor-
mativity for likelihoods of the form (Eq. (15)). Computationally, we may, however, use the
standard parametrization. When generating φ�c as posterior samples from π(φ|Yc), the prior
density |dλ/dφ| for φ must be used. Then, we may calculate λ�c ¼ λðφ�c Þ. To represent the iic
distribution Gp(λ), we may calculate λ�p ¼ λðφ�p Þfor samples φ�p from πðφjyp Þ according to the
given model. Now, the c4λ -measure can be estimated from (Eq. (12)), using a kernel density
estimate of g(δ) based on corresponding samples δ� ¼ λ�p � λ�c . However, if we limit attention
to the c3λ -measure (Eq. (11)) and its one-sided versions (Eq. (10)), we may use the samples from
πðφjyc Þ and πðφjyp Þ directly. To see this, note that the condition λ�p ≥ λ�c is equivalent to the
condition φ�p ≥ φ�c (assuming that λ is increasing as a function of φ). Hence, the probability G
(0) that λ�p � λ�c ≤ 0 can be estimated as the proportion of sample values for which φ�p ≤ φ�c .
4. Extensions to deterministic nodes: Relation to cross-validation,

prediction and hypothesis testing
4.1. Cross-validation and data node conflict
The model variables Y are represented by the bottom nodes in the DAG describing the hierar-
chical model. The framework can be extended to also cover conflict concerning these nodes. In
this way, cross-validation can be viewed as a special case of the conflict measure approach.
Let Yc be an element in the vector Y of observable random variables. We define the prior iic
density gp(yc) exactly as in Eq. (6), with λ replaced by yc. The Dirac measure at the observed
value yc represents a degenerate iic information contribution about Yc. This leads to the
following definitions:
c3þ
yc ¼ Gp ðyc Þ, c3�
yc ¼ Gp ðyc Þ; (16)
c3yc ¼ 1 � 2minðGp ðyc Þ, Gp ðyc ÞÞ; (17)
c4yc ¼ Pgp ðgp ðYc Þ ≥ gp ðyc ÞÞ: (18)
The measures (Eqs. (16)–(18)) are called data node conflict measures. To see that these defini-
tions are consistent with Eqs. (10)–(12), note that λ�p corresponds to Yc, and λ�c is determin-
istic and corresponds to yc. We define X = Yc – yc, corresponding to δ. We then have
gðxÞ ¼ gp ðx þ yc Þ. Hence,
ð0 ð yc
Gð0Þ ¼ gðxÞdx ¼ gp ðyÞdy ¼ Gp ðyc Þ;
�∞ �∞
and accordingly, Gð0Þ ¼ Gp ðyc Þ. It follows that Eqs. (16) and (17) are special cases of Eqs. (10)
and (11). Moreover,
Pg ðgðXÞ ≥ gð0ÞÞ ¼ Pgp ðgp ðYc Þ ≥ gp ðyc ÞÞ;
showing that Eq. (18) is a special case of Eq. (12).
Furthermore, this correspondence between the data node conflict measures (Eqs. (16) and (17))
and the parameter node conflict measures (Eqs. (10) and (11)) can be used to motivate these
latter measures. We will treat the c3+ measure as an example. Consider again a parameter node
λ. If λ were actually observable and known to take the value λc, the data node version of the c3+
measure could be used to measure deviations toward the right tail of Gp as
ð λc ð0
Gp ðλc Þ ¼ gp ðλÞdλ ¼ gp ðδ þ λc Þdδ:
�∞ �∞
Now λ is in reality not known, but we can take the expectation of this conflict with respect to
the distribution Gc, which reflects the uncertainty about λ when influence from data yp is
removed. The result is the following theorem:
Theorem 2
EGc ðGp ðλÞ ¼ c3þ

λ :
Proof:
ð∞ �ð 0 � ð 0 �ð ∞ �
EGc ðGp ðλÞ ¼ gc ðλÞ gp ðδ þ λÞdδ dλ ¼ gp ðδ þ λÞgc ðλÞdλ dδ
�∞ �∞ �∞ �∞
ð0
¼ gðδÞdδ ¼ Gð0Þ ¼ c3þ
λ
�∞
by Eq. (10).
4.2. Cross-validation and sufficient statistics

Suppose the node λ of interest is the parent of the subvector Yc of Y. Suppose also that Yc is a
sufficient statistic for Yc. Evidently then, the measures c3þ 3þ
λ and cY c address the same kind of
possible conflict in the model. The following theorem, proved in Ref. [11], states that the two
measures agree under certain conditions. This is a generalization of a result in Ref. [13], which
also unnecessarily assumed symmetry for the conditional density of Yc.
Theorem 3 Suppose the conditional density for the scalar variable Yc given the parameter λ is of the
form f Yc ðyjλÞ ¼ f 2c, 0 ðy � λÞ. Then,
c3þ 3þ
Y c ¼ cλ :
When a sufficient statistic exists, the cross-validatory p-value is considered by Ref. [13] as the
gold standard, and the aim of their construction is to provide a measure which is generally
applicable and matches cross-validation when a sufficient statistic exists.
4.3. Prediction
As mentioned in Section 2, the c4 measure can be used to assess conflict concerning vectors of
nodes. Applying this at the data node level, we may assess the quality of predictions of a
subvector Yc of Y based on a complementary subvector yp of observations. The relevant
measure is given by Eq. (18), with Yc replaced by the vector Yc. This is particularly well suited
to models where data accumulate as time evolves. Such a conflict measure can be used to
assess the overall quality of the model. It can also be used as a tool for model comparison and
model choice.
4.4. Hypothesis testing

Suppose the top-level nodes μ appearing in Eq. (2) are assumed fixed and known according to
the model, so that π(μ) is a Dirac measure at these fixed values of the components of μ. Hence,
the DAG has deterministic nodes both at the top and at the bottom, namely the vectors μ and y,
respectively. We may then check for a conflict concerning a component λ of μ by introducing a
random version λ ~ of λ and contrast the corresponding g ðλÞ ~ with the fixed value λ. The
c
random λ ~ has the same children and coparents as λ, and the vector βc, the information
~ β Þ and the iic density gc are defined as in Eqs. (4), (5) and (6). The respective
contribution f c ðλ; c
conflict measures are defined as in Eqs. (16)–(18) with yc replaced by λ and Gp and gp replaced
by Gc and gc. If the model is rejected when the conflict exceeds a certain predefined warning
level, this corresponds to a formal Bayesian test of the hypothesis λ ~ ¼ λ. Using the conflict
measure (Eq. (18)), we may put the whole vector μ to test in this way.
5. Preexperimental uniformity of the conflict measures
In this section, we review some results concerning the distribution of the conflict measures. If c
is one of the measures (Eqs. (10), (11), (12), (16), (17) or (18)), then preexperimentally, i.e., prior
to observing the data y, c is a random variable taking a value in [0, 1]. A large value of c
indicates a possible conflict in the model, and uniformity of c corresponds to 1 – c being a
proper p-value. This does not mean that we propose a formal hypothesis testing procedure for
model criticism, possibly even adjusted for multiple testing, nor that we think that a fixed
significance level represents an appropriate criterion signaling the need for changing the
model. A relatively large value of c may be accepted if there are convincing arguments for
believing in a particular modeling aspect, while a less extreme value of c may indicate a need
for adjustments in modeling aspects that are considered questionable for other reasons. But the
terms “relatively large” and “less extreme” must refer to a meaningful common scale. In our
view, uniformity of the conflict measure under all sources of uncertainty is the natural ideal
criterion for being a well-calibrated conflict measure, the fulfillment of which ensures compa-
rable assessment of the level of conflict across models. This means that we aim for
preexperimental uniformity in cases where the prior distribution is highly noninformative,
and also, as discussed in the following subsection, in cases where an informative prior repre-
sents part of the randomness in the data-generating process (aleatory uncertainty) rather than
subjective (epistemic) uncertainty about the location of a fixed but unknown λ. In this chapter,
we limit attention to situations where exact uniformity is achieved. The pivotality condition
(Eq. (15)) turns out to be a key assumption needed to obtain such exact results. Refs. [10]
and [12] provide some examples where exact uniformity is achieved in other cases.
5.1. Data-prior conflict
Consider the model
Y � FY ðyjλÞ, λ � Fλ ðλÞ;
where Fλ is an arbitrary informative prior distribution. Here, we think of this prior distribution
as representing aleatory rather than epistemic uncertainty. The corresponding densities are
denoted by fY and fλ. If contrasting the prior density with the likelihood f Y ðyjλÞ indicates a
conflict between the prior and likelihood information contributions, we consider this a data-
prior conflict. The following theorem, proved in Ref. [11], deals with this kind of conflict. Note
that in this situation, the Yp part of the data splitting is empty.
Theorem 4 Suppose the conditional density for the scalar variable Y given the parameter λ is of the
form f Y ðyjλÞ ¼ f 0 ðy � λÞ and that λ is generated from an arbitrary informative prior density fλ(λ).
Then, the data-prior conflict measures about λ are preexperimentally uniformly distributed for both the
c3λ - and c4λ -measures.
The theorem obviously applies to the location parameter of normal and t-distributions with
fixed variance parameters, as well as the location parameter in the skew normal distribu-
tion [19]. If the vector Y consists of IID normal variables, the theorem also applies to the
location parameter, using as scalar variable the sufficient statistic Y. If the n components of Y
are IID exponentially distributed with failure rate λ, their sum is a sufficient statistic that is
gamma distributed with shape parameter n and scale parameter λ. We may then use the fact
that for a variable Y which is gamma distributed with known shape parameter and unknown
scale parameter λ, the quantity logðYÞ � logðλÞ is a pivotal statistic, and uniformity is
obtained by combining Theorem 4 with the approach of Section 3. In the standard parame-
trization, the appropriate prior distribution is πðλÞ ¼ 1=λÞ. Details are given in Ref. [11],
which also deals with the gamma, inverse gamma, Weibull and lognormal distributions in a
similar way.
5.2. Data-data conflict
Suppose all components of Y have distributions determined by the same parameter λ.

Suppose we want to contrast information contributions from separate parts of Y about λ
and define the splitting ðYp , Yc Þ accordingly. Focusing on this kind of possible conflict, we
assume complete prior ignorance about λ and accordingly assume that λ has the improper
prior πðλÞ ¼ 1. Hence, recalling Eqs. (7) and (8), we contrast the information in f c ðλ; Yc Þ with
that in f p ðλ; Yp Þ. We use the term data-data conflict in this context, since there is no prior
information incorporated in fp, and the two information contributions play symmetric roles.
However, as a particular application, one may think of Yc as a scalar variable representing a
possible outlier.
The following theorem is proved in Ref. [11].
Theorem 5 Suppose that the conditional densities for the scalar variables Yp and Yc given the
parameter λ are of the form f Yp ðyjλÞ ¼ f p, 0 ðy � λÞ, f Yc ðyjλÞ ¼ f c, 0 ðy � λÞ.
Assume λ has the improper prior πðλÞ ¼ 1. Then, the data-data conflict measures about λ are preexper-
imentally uniformly distributed for both the c3λ - and c4λ -measures.
Theorem 5 can be applied if the components of Yc and Yp are normally or lognormally

distributed with known variance parameter, exponentially distributed, or gamma, inverse
gamma or Weibull with known shape parameter, since pivotal quantities based on sufficient
statistics exist for these distributions.
5.3. Normal hierarchical models with fixed covariance matrices

Allowing for each y and ν appearing in Eq. (2) to be interpreted as vectors of nodes, we now
assume that each conditional distribution in the decomposition (Eq. (2)) is multinormal with
fixed and known covariance matrices. The random-effects model (Eq. (1)) is a simple example
of this. We also assume that the top-level parameter vector μ has the improper prior 1 and that
each linear mapping PaðνÞ ! EðνjPaðνÞÞ has full rank.
Now let λ be any node in the model description. It is standard to verify that, regardless of how
the vector of neighboring and coparent nodes β is decomposed into βp, containing PaðλÞ, and
βc, the densities f p ðλ; βp Þ and f c ðλ; βc Þ of Eqs. (5) and (8) are multinormal with fixed covariance
matrices. Furthermore, this is true also for the iic densities gp and gc of Eq. (6), regardless of the
data splitting. It follows that the density g of the difference δ between independent samples from
gp and gc is multinormal with expectation EG ðδÞ ¼ EGp ðλÞ � EGc ðλÞ and covariance matrix
t
covG ðδÞ ¼ covGp ðλÞ þ EGc ðλÞ. It follows that δ � EG ðδÞ covG ðδÞ�1 δ � EG ðδÞ is χ2-distributed
with n ¼ dimðλÞ degrees of freedom, and the probability under G that gðδÞ < gð0Þ is easily seen

to be Ψ n EG ðδÞt covG ðδÞ�1 EG ðδÞ , where Ψn is the cumulative distribution function for the χ2n -
distribution. The preexperimental uniformity of this quantity is proved in Ref. [10].
Theorem 6 Consider a hierarchical normal model as described above.

i. Let λ be an arbitrary scalar or vector parameter node. If the data splitting satisfies Eq. (13), then
c4λ is uniformly distributed preexperimentally.

ii. Suppose the data splitting ðYp , Yc Þ satisfies Ch PaðYc Þ ¼ Yc . Then, c4Yc is uniformly distrib-
uted preexperimentally.
If λ in (i) or Yc in (ii) are one dimensional, then G is symmetric and unimodal, and therefore, the
respective c3-measures are defined and coincide with the c4-measures. Gåsemyr et al. [10] also
show that in that case the c3+- and c3�-measures are uniformly distributed preexperimentally.
Example 2. Consider the following DAG model, a regression model with randomly varying
regression coefficients.
Yi, j � NðXti, j ξi , σ2 Þ, ξi � Nðξ, ΩÞ, j ¼ 1, …, n, i ¼ 1, …, m, πðξÞ ∝ 1: (19)
The m units could be groups of individuals, with yi,j the measurement for a group member
with individual covariate vector Xi,j, or individuals with the successive yi,j representing
repeated measurements over time. In this model, we could check for a possible exceptional
behavior of the mth unit by means of the conflict measure c4ξm . With a data splitting for which
Yc ¼ Ym ¼ ðYm , 1 , …, Ym , n Þ the conditions for Theorem 6, part (i), are satisfied if dimðξÞ ≤ n,
and the measure is preexperimentally uniformly distributed.
6. Concluding remarks
The assumption of fixed covariance matrices in the previous subsection is admittedly quite
restrictive. In general, the presence of unknown nuisance parameters, such as parameters
describing the covariance matrices in a normal model, makes the derivation of exact unifor-
mity at least difficult and often impossible. Promising approximate results are reported in Ref.
[9] for the closely related c2λ measure. Further empirical studies are needed in order to examine
to what extent the conflict measures are approximately uniformly distributed in other situa-
tions. As an informal tool to be used in conjunction with subject matter insight, the conflict
measure approach does not require exact uniformity in order to be useful.
Author details
Jørund I. Gåsemyr* and Bent Natvig

University of Oslo, Norway
References
[1] Box GEP. Sampling and Bayes’ inference in scientific modelling and robustness (with
discussion and rejoinder). Journal of the Royal Statistical Society. Series A. 1980;143:383-430
[2] Bayarri MJ, Castellanos ME. Bayesian checking of the second levels of hierarchical
models. Statistical Science. 2007;22:322-343
[3] Gelman A, Meng X-L, Stern H. Posterior predictive assessment of model fitness via
realized discrepancies (with discussion and rejoinder). Statistica Sinica. 1996;6:733-807
[4] Bayarri MJ, Berger JO. P values in composite null models (with discussion). The Journal
of the American Statistical Association. 2000;95:1127-1142
[5] Hjort NL, Dahl FA, Steinbakk GH. Post-processing posterior predictive p-values. The
Journal of the American Statistical Association. 2006;101:1157-1174
[6] Dahl FA. On the conservativeness of posterior predictive p-values. Statistics and Proba-
bility Letters. 2006;76:1170-1174
[7] Dey D, Gelfand A, Swartz T, Vlachos P. A simulation-intensive approach for checking

hierarchical models. Test. 1998;7:325-346
[8] O’Hagan A. HSSS model criticism (with discussion). In: Green PJ, Hjort NL, Richardson
S, editors. Highly Structured Stochastic Systems. Oxford: Oxford University Press; 2003.
pp. 423-444
[9] Dahl FA, Gåsemyr J, Natvig B. A robust conflict measure of inconsistencies in Bayesian
hierarchical models. Scandinavian Journal of Statistics. 2007;34:816-828
[10] Gåsemyr J, Natvig B. Extensions of a conflict measure of inconsistencies in Bayesian
hierarchical models. Scandinavian Journal of Statistics. 2009;36:822-838
[11] Gåsemyr J. Uniformity of node level conflict measures in Bayesian hierarchical models
based on directed acyclic graphs. Scandinavian Journal of Statistics. 2016;43:20-34
[12] Gåsemyr J. Alternatives to post-processing posterior predictive p-values. Submitted 2017

[13] Marshall EC, Spiegelhalter DJ. Identifying outliers in Bayesian hierarchical models.
A simulation based approach. Bayesian Analysis. 2007;2:409-444
[14] Dias S, Welton NJ, Caldwell DM, Ades AE. Checking consistency in mixed treatment
comparison meta-analysis. Statistics in Medicine. 2010;29:932-944
[15] Presanis AM, Ohlssen D, Spiegelhalter D, De Angelis D. Conflict diagnostics in directed

acyclic graphs, with applications in Bayesian evidence synthesis. Statistical Science.
2013;28:376-397
[16] Lauritzen SL. Graphical Models. Oxford: Oxford University Press; 1996
[17] Scheel I, Green P, Rougier JC. A graphical diagnostic to identifying influential model
choices in Bayesian hierarchical models. Scandinavian Journal of Statistics.2011;38:529-550
[18] Box GEP, Tiao GC. Bayesian Inference in Statistical Analysis. New York: Wiley; 1992
[19] Azzalini A. A class of distributions which include the normal ones. Scandinavian Journal
of Statistics. 1985;12:171-178
Chapter 3
Provisional chapter
Classifying by Bayesian Method and Some Applications
Classifying by Bayesian Method and Some Applications
Tai Vovan
Tai Vovan

Abstract
This chapter sums up and proposes some results related to classification problem by
Bayesian method. We present the classification principle, Bayes error, and establish its
relationship with other measures. The determination for Bayes error in reality for one and
multi-dimensions is also considered. Based on training set and the object that we need to
classify, an algorithm to determine the prior probability that can make to reduce Bayes
error is proposed. This algorithm has been performed by the MATLAB procedure that can
be applied well with real data. The proposed algorithm is applied in three domains:
biology, medicine, and economics through specific problems. With different characteristics
of applied data sets, the proposed algorithm always gives the best results in comparison to
the existing ones. Furthermore, the examples show the feasibility and potential application
in reality of the researched problem.
Keywords: Bayesian method, classification, error, prior, application
1. Introduction
Classification problem is one of the main subdomains of discriminant analysis and closely
related to many fields in statistics. Classification is to assign an element to the appropriate
population in a set of known populations based on certain observed variables. It is an impor-
tant development direction of multivariate statistics and has applications in many different
fields [25, 27]. Recently, this problem is interested by many statisticians in both theories and
applied areas [14–18, 22–25]. According to Tai [22], we have four main methods to solve the
classification problem: Fisher method [6, 12], logistic regression method [8], support vector
machine (SVM) method [3], and Bayesian method [17]. Because Bayesian method does not
require normal condition for data and can classify for two and more populations it has many
advantages [22–25]. Therefore, it has been used by many scientists in their researches.
Given k populations {wi}, with probability density functions (pdfs) and the prior probabilities
k
X
respectively {fi} and {qi}, i = 1, 2,…, k, where qi ∈ ð0; 1Þ, qi ¼ 1: Pham–Gia et al. [17] used the
i¼1
maximum function of pdfs as a tool to study about Bayesian method and obtained important
results. The classification principle and Bayes error were established based on the gmax(x) =
max{q1f1(x), q2f2(x), …, qkfk(x)}. The relationship between the upper and lower bounds of the
Bayes error and the L1—distance of the pdfs and the overlap coefficient of the pdfs—were
established. The function gmax(x) played a very important role in the classification problem by
Bayesian method and Pham–Gia et al. [17] continued to do research on it. Using the MATLAB
software, Pham–Gia et al. [18] succeeded in identifying gmax(x) for some cases of the bivariate
normal distribution. With similar development, Tai [22] has proposed the L1—distance of the
{qifi(x)}—and established its relationship with Bayes error. This distance is also used to calcu-
late Bayes error as well as to classify new element. This research has been applied in classifying
ability to repay debt of bank customers. However, we think that the survey of two Bayesian
approach relevant research was not yet completed. There are some relations between Bayes
error and other statistical measures.
Bayesian method has many advantages. However, to our knowledge, the field of applications
of this method in practice is narrower than other methods. We can find many applications in
banking and medicine using Fisher method, SVM method, logistic method [1, 3, 8, 12].
Recently, all statistics software can effectively and quickly process the classification of large
data sets and multivariate statistics using either three of the methods mentioned above,
whereas the Bayesian method does not have this advantage. The cause of this problem is the
ambiguity in determining prior probability, in estimating pdfs, and the complexity in calculat-
ing Bayes error. Although all these issues have been discussed by many authors, the optimal
methods have yet to be found [22, 25]. In this chapter, we consider to estimate the pdf and to
calculate Bayes error to apply in reality. We will present the problem on how to determine the
prior probability in this chapter. In case of noninformation, we normally choose prior proba-
bilities by uniform distribution. If we have some types of past data or training set, the prior
probabilities are estimated either by Laplace method: qi = (ni + n/k)/(N + n) or by the frequencies
of the sample: qi = ni/N, where ni and N are the number of elements in the ith population and
training set, respectively, n is the number of dimensions, and k is the number of groups. The
above-mentioned approaches have been studied and applied by many authors [14, 15, 22, 25].
We will also propose an algorithm to determine prior probability based on the training set,
classified objective, and fuzzy cluster analysis. The proposed algorithm is applied in some
specific problems of biology, medicine, and economics and has advantages over existing
approaches. All calculations are performed by MATLAB procedures.
The next section of this chapter is structured as follows. Section 2 presents the classification
principle and Bayes error. Some results of the Bayes error are also established in this section.
Section 3 resolves the related problems in real application of the Bayes method. There are esti-
mation of pdfs and determination of Bayes error in case of one dimension and multidimension.
This section also proposes an algorithm to determine prior probability. Section 4 applies the
proposed algorithm in real problems and compares outcome results to those obtained using
existing approaches. Section 5 concludes this chapter.
Classifying by Bayesian Method and Some Applications 41
2. Classifying by Bayesian method
The classification problem by Bayesian method has been presented in many documents
[15, 16, 27], where the classification principle and the Bayes error are established based on Bayes
theorem. In this section, we present them via the maximum function of qifi(x), i = 1, 2, …, k that
they have advantages over existing approaches in real application [17, 18, 21–25]. This section
also establishes the upper and lower bounds of the Bayes error and the relationships of Bayes
error with other measures in statistical pattern recognition.
2.1. Classification principle and Bayes error
Given k populations w1, w2, …, wk with qi ∈ (0;1) and fi(x) are the prior probability and pdf of
ith population, respectively, i = 1, 2, …, k. According to Pham–Gia et al. [17], element x0 will be
assigned to wi if
gi ðx0 Þ ¼ gmax ðx0 Þ, i ¼ 1, 2, …, k (1)
� �
where gi ðxÞ ¼ qi f i ðxÞ, gmax ðxÞ ¼ max q1 f 1 ðxÞ, q2 f 2 ðxÞ, …, qk f k ðxÞ :
Bayes error is given by the formula:

k
X ð k ð
X
ðqÞ
Pe1, 2, …, k ¼ qi f i dx ¼ 1 � qi f i ðxÞdx, (2)
i¼1 n i¼1
R \Rni Rni
n o
where Rni ¼ xjqi f i ðxÞ > qj f j ðxÞ, ∀i 6¼ j, i, j ¼ 1, 2, …, k , ðqÞ ¼ ðq1 , q2 , …, qk Þ:
From Eq. (2), we can prove the following result:

k
X ð
ðqÞ
Pe1, 2, …, k ¼ qj f j ðxÞdx
j¼1
Rn \Rnj
2 3
k
X ð ð
6 � � 7
¼ 4 qj f j ðxÞdx � max ql f l ðxÞ dx5
1≤l≤k
j¼1
Rn Rnj
ðX
k k ð
X � �
¼ qj f j ðxÞdx � max ql f l ðxÞ dx
1≤l≤k
j¼1 j¼1
Rn Rnj
ð
� �
¼1� max ql f l ðxÞ dx
1≤l≤k
Rn
or
ð
ðqÞ
Pe1, 2, …, k ¼ 1 � gmax ðxÞdx: (3)
Rn
ðqÞ ðqÞ
The correct probability is determined by Ce1, 2, …, k ¼ 1 � Pe1, 2, …, k :
For k = 2, we have
ð
ðq, 1�qÞ � � ðq, 1�qÞ 1� �
Pe1, 2 ¼ min qf 1 ðxÞ, ð1 � qÞf 2 ðxÞ dx ¼ λ1, 2 ¼ 1 � kqf 1 , ð1 � qÞf 2 k1 , (4)
2
n
R
where
ð
ðq, 1�qÞ
λ1 , 2 is the overlap area measure of qf1(x) and (1�q)f2(x) and kqf 1 , ð1 � qÞf 2 k1 ¼ jqf 1 ðxÞ�
Rn
ð1 � qÞf 2 ðxÞjdx:
2.2. Some results about Bayes error

Theorem 1. Let fi(x), i =1, 2, …, k, k ≥ 3 be k pdfs defined on Rn , n ≥ 1, qi ∈ ð0; 1Þ: We have the
relationships of Bayes error with other measures as follow:
0 1
Yk
i. ðqÞ 1 @ αj � �α
Pe1, 2, …, k ≤1 � 1 � qj DT f 1 , f 2 , …, f k A, (5)
k�1 j¼1
X � �ðβ, 1�βÞ
ðqÞ β 1�β
ii. Pe1, 2, …, k ≤ qi qj DT f i , f j , (6)
i<j
8 9
< XX = n o � �
iii. ðqÞ
ðk � 1Þ � kgi , gj k1 =k ≤ Pe1, 2, …, k ≤ 1 � ð1=2Þmax kgi , gj k1 � min qi , (7)
: i j
; i<j i
ðqÞ � �
iv. 0 ≤ Pe1, 2, …, k ≤ maxi qi , (8)
where
k
X
α ¼ ðα1 , α2 , …, αk Þ; αj , β ∈ ð0, 1Þ, αj ¼ 1, i, j = 1, 2, …, k, and
j¼1
ðY
k h iαj
� �α
DT f 1 , f 2 , …, f k ¼ f j ðxÞ dx is affinity of Toussaint [26].
j¼1
Rn
Proof:
i. For each j = 1,2,…,k, we have

0 1αi
Xk
@ q f A ≥ ðq f Þαi , i ¼ 1, 2, …, k:
j j i i
j¼1
Therefore,
0 1α1 þα2 þ…þαk
k
X k �
Y �αj X
k k �
Y �αj
@ qf A ≥ qj f j ⇔ qj f j ≥ qj f j : (9)
j j
j¼1 j¼ j¼1 j¼
On the other hand,

� n o�α1 � � n o�αk �
�α �α
min qj f j ≤ q1 f 1 1 , ……, min qj f j ≤ qk f k k ,
1≤j≤k 1≤j≤k
So
� n o�α1 þ⋯þαk Yk � �αj
min qj f j ≤ qj f j :
1≤j≤k
j¼1
or
n o Y k � �αj
min qj f j ≤ qj f j : (10)
1≤j≤k
j¼1
Combining Eqs. (9) and (10), we obtain
k
X k �
Y �αj X
k n o
0≤ qj f j � qj f j ≤ qj f j � min qj f j :
1≤j≤k
j¼1 j¼1 j¼1
k
X n o
Because qj f j � min qj f j includes (k�1) terms, we have
1≤j≤k
j¼1
k
X n o n o
qj f j � min qj f j ≤ ðk � 1Þmax qj f j :
1≤j≤k 1≤j≤k
j¼1
Thus,
k
X k �
Y �αj n o
0≤ qj f j � qj f j ≤ ðk � 1Þmax qj f j :
1≤j≤k
j¼1 j¼1
Integrating the above relation, we obtain:

k
Y ð
α � �α
1� qj j DT f 1 , f 2 , …, f k ≤ ðk � 1Þ gmax ðxÞdx: (11)
j¼1
Rn
ð
ðqÞ
Using gmax ðxÞ ¼ 1 � Pe1, 2, …, k for Eq. (11), we have Eq. (5).
Rn
ii. From Eq. (2), we have

k
X ð
ðqÞ
Pe1, 2, …, k ¼ qj f j ðxÞdx
j¼1
Rn \Rnj
k X ð
X n o
¼ min qi f i ðxÞ, qj f j ðxÞ dx
j¼1 j6¼i
Rnj
Xð n o
¼ min qi f i ðxÞ, qj f j ðxÞ dx:
i<j
Rni
Since
h n oiβ � � h n oi1�β � �
β 1�β
min qi f i ðxÞ, qj f j ðxÞ ≤ qi f i and min qi f i ðxÞ, qj f j ðxÞ ≤ qi f i ,
then
n o � � � �1�β
β
min qi f i ðxÞ, qj f j ðxÞ ≤ qi f i qj f j :
Integrating the above inequality, we obtain:
ðqÞ
X ð �� β � �1�β � X β 1�β � �ðβ, 1�βÞ
Pe1, 2, …, k ≤ qi f i ðxÞ qj f j ðxÞ dx ≤ qi qj DT f i , f j dx:
i<j i<j
Rni
iii. We have
ð ð n o
� �
max g1 ðxÞ, g2 ðxÞ, …, gk ðxÞ dx ≥ max max gi ðxÞ, gj ðxÞ dx
i<j
Rn Rn
On the other hand,

8 9
<ð = � �
1 1
max max{gi ðxÞ, gj ðxÞ}dx ¼ max kgi , gj k1 þ ðqi þ qj Þ
i<j : ; i<j 2 2
Rn
� � � �
1 1
≥ max kgi , gj k1 þ min ðqi þ qj Þ
i<j 2 i<j 2
� �
1 � �
≥ max kgi , gj k1 þ min ðq1 , q2 , …, qk Þ :
i<j 2 i<j
Hence,
ð n o
1 � �
gmax ðxÞdx ≥ max kgi , gj k1 þ min ðq1 , q2 , …, qk Þ : (12)
2 i<j i<j
Rn
XX k h
X � � i
jgi � gj j ≥ max g1 , g2 , ⋯gk � gj
i< j j¼1
We also have k
� � X
¼ k maxfg1 , g2 , ⋯gk g � gj
j¼1
Therefore,
� 1 XX k
� 1X
max g1 , g2 , ⋯gk ≤ jgi � gj j þ g: (13)
k i< j k j¼1 j
Since
ð k
X
gi ðxÞdx ¼ qi and qi ¼ 1, the inequality Eq. (13) becomes:
i¼1
Rn
ð
1X 1
gmax ðxÞdx ≤ kg , g k þ : (14)
k i<j i j 1 k
Rn
ð
ðqÞ
Replacing gmax ðxÞ ¼ 1 � Pe1, 2, …, k to Eqs. (12) and (14), we have Eq. (7).
Rn
iv. We have
� � Xk
qi f i ðxÞ ≤ max q1 f 1 ðxÞ, q2 f 2 ðxÞ, …, qk f k ðxÞ ≤ qi f i ðxÞ for all i = 1,…,k.
i¼1
Integrating the above relation, we obtain:

ð
qi ≤ gmax ðxÞdx ≤ 1:
Rn
Above inequality is true for all i = 1,…,k, so

ð
� �
max qi ≤ gmax ðxÞdx ≤ 1:
Rn
ð
ðqÞ
Replacing gmax ðxÞ ¼ 1 � Pe1, 2, …, k in above relation, we have Eq. (8).
Rn
From the result of Eqs. (5) and (6), with α1 ¼ α2 ¼ … ¼ αk ¼ 1=k, , we have the relationship
between Bayes error and affinity of Matusita [11]. Especially, when k = 2, we have the relation-
ðq, 1�qÞ
ship between Pe1, 2 and Hellinger’s distance.
In addition, we also have the relation between Bayes error and overlap coefficients as well as
L1–distance of {g1(x), g2(x), …, gk(x)} (see Ref. [22]). For special case: q1 = q2 = … = qk = 1/k, we
had established expressions about relations between Bayes error and L1–distance of {f1(x), f2(x),
ð1=kÞ ð1=ðkþ1ÞÞ
…, fk(x)}, Pe1, 2, …, k and Pe1, 2, …, kþ1 (see Ref. [17]).
3. Related problems in applying of Bayesian method
To apply Bayesian method in reality, we have to resolve three main problems: (i) Determine
prior probability, (ii) compute Bayes error, and (iii) estimate pdfs. In this section, we propose
an algorithm to solve for (i) based on fuzzy cluster analysis and classified objective that can
reduces Bayes error in comparing with traditional approaches. For (ii), Bayes error is
established by closed expression for general case and determine it by an algorithm to find
maximum function of gi(x), i = 1, 2, …, k for one dimension case. The quasi-Monte Carlo
method is proposed to compute Bayes error in this section. For (iii), we review the problem to
estimate pdfs by kernel function method where the bandwidth parameter and kernel function
are specified.
3.1. Prior probability

n o
ð0Þ ð0Þ ð0Þ
In the n-dimensions space, given N populations N ð0Þ ¼ W 1 , W 2 , …, W N with data set Z =
[zij]nxN. Let matrix U ¼ ½μik �c�n , where μik is probability of the kth element belonging to wi. We
have μik ∈ ½0, 1� and satisfies the following conditions:
c
X N
X
μik ¼ 1, 0 < μik < N, 1 ≤ i ≤ c, 1 ≤ k ≤ N:
i¼1 k¼1
We call
( )
� � c
X N
X
Mzc ¼ U ¼ μik cxN jμik ∈ ½0, 1�, ∀i, k; μik ¼ 1, ∀k; 0 < μik , ∀i (15)
i¼1 k¼1
be fuzzy partitioning space of k populations,
D2ikA ¼ kzk � vi k2A ¼ ðzk � vi ÞT Aðzk � vi Þ is the matrix whose element d2ik is the square of dis-
tance from the object zk to the ith representative population. This representative is computed
by the following formula:
N
X
ðμik Þm zk
k¼1
vi ¼ N
, 1 ≤ i ≤ c, (16)
X
ðμik Þm
k¼1
where m ∈ [1,∞) is the fuzziness parameter.
Given the data set Z including c known populations w1, w2, …, wc. Assume x0 is an object that
we need to classify. To identify the prior probabilities when classifying x0, we propose the
following prior probability by fuzzy clustering (PPC) algorithm:
Algorithm 1. Determining prior probability by fuzzy clustering (PPC)

Input: The data set Z = ½zij �n�N of c populations {w1, w2, …, wc}, x0, ε, m and the initial partition matrix U ¼ U ð0Þ ¼ ½μij �c�Nþ1 ,
where μij = 1 if the jth object belongs to the wi and μij = 0 for the opposite, i ¼ 1, c; j ¼ 1, N, μij ¼ 1=c for j = N + 1.
Output: The prior probability μiðNþ1Þ , i ¼ 1, 2, …c:
Repeat:
N
X
ðμik Þm zk
k¼1
Find the representative object of wi: vi ¼ N
, 1≤i≤c
X
ðμik Þm
k¼1
Compute the matrix ½Dik �c�Nþ1 (the pairwise distance between objects and representative objects).
Update the new partition matrix U(new) by the following principle:
If Dik > 0 for all i ¼ 1, 2, …, c; k ¼ 1, 2, …, N þ 1 then
1
μik ðnewÞ ¼ X
c , i 6¼ j ¼ 1, 2, …, c
ðDik =Djk Þ2=ðm�1Þ
j¼1
ðnewÞ
Else, μik ¼0
End;
� �!
� �
� ðnewÞ �
Compute S ¼ kU ðnewÞ � Uk ¼ maxik �μik � μik �
� �
U ¼ UðnewÞ
Until S < ε;
The prior probability μiðNþ1Þ , i ¼ 1, 2, …c (the final column of the matrix U);
In the above algorithm, we have:
i. ε is a really small number and is chosen arbitrarily. The smaller ε is, the more iterations
time is taken. In the examples of this chapter, we choose ε = 0.001.
ii. The distance matrix Dik depends on the norm-inducing matrix A. When A = I, Dik is the
matrix of Euclidean distances. Besides, there are several choices of A, such as diagonal
matrix or the inverse of the covariance matrix. In this chapter, we chose the Euclidean
distances in the numerical examples and applications.
iii. m is the fuzziness parameter, when m = 1, the fuzzy clustering becomes the nonfuzzy
clustering. When m ! ∞, the partition becomes completely fuzzy μik = 1/c. The determin-
ing of this parameter, which affects the analysis result, is difficult. Even though Yu et al.
[28] proposed two rules to determine the supermom of m for clustering problems, the
searching of the specific m was done by meshing method (see [2, 4, 5, 9] for more details).
By this process, the best m among several of given values will be chosen. In this chapter,
m is also identified by meshing method for the classification problem. The best integer m
between 2 and 10 will be used.
At the end of the PPC algorithm, we obtain the prior probabilities of x0 based on the last
column of the partition matrix U ðμiðNþ1Þ , i ¼ 1, 2, …cÞ. The PPC algorithm helps us determine
the prior probabilities via the closeness degree between the classified object and the popu-
lations. Each object will receive its suitable prior probabilities.
In this chapter, Bayesian method with prior probabilities calculated by the uniform distribu-
tion approach, the ratio of samples approach, the Laplace approach, and the proposed PPC
algorithm approach are respectively called BayesU, BayesR, BayesL, and BayesC.
Example 1. Given the studied marks (scale 10 grading system) of 20 students. Among them,
nine students have marks that are lower than 5 (w1: fail the exam) and 11 students have marks
that are higher than 5 (w2: pass the exam). The data are given in Table 1.
Assume that we need to classify the ninth object, x0 = 4.3, to one in two populations. Using the
PPC algorithm, we have the following final partition matrix:

0:957 0:973 0:981 0:993 1 0:997 0:997 0:830 0:321 0:290 0:158 0:1 0:1 0:01 0:009 0:037 0:045 0:054 0:062 0:724
0:043 0:027 0:019 0:007 0 0:003 0:003 0:170 0:679 0:710 0:842 0:9 0:9 0:99 0:991 0:963 0:955 0:946 0:938 0:276
This matrix shows the prior probabilities when assigning the ninth object to w1 and w2 are
0.724 and 0.276, respectively. Meanwhile, the prior probabilities determined by BayesU,
BayesR, and BayesL are (0.5; 0.5), (0.421; 0.579), and (0.429; 0.571), respectively.
From the data in Table 1, we might estimate the pdfs f1(x) and f2(x) and compute the values
q1f1(x) and q2f2(x), where q1 and q2 are the calculated prior probabilities. The results of classify-
ing x0 by four approaches: BayesU, BayesR, BayesL, and BayesC are given in Table 2.
Because the actual population of x0 is w1, only BayesC gives the true result. The Bayes error of
BayesC is also the smallest. Thus, in this example, the proposed method improves the draw-
back of the traditional method in determining the prior probabilities.
Objects Marks Groups Objects Marks Groups
1 0.6 w1 11 5.6 w2
2 1.0 w1 12 6.1 w2
3 1.2 w1 13 6.4 w2
4 1.6 w1 14 6.4 w2
5 2.2 w1 15 7.3 w2
6 2.4 w1 16 8.4 w2
7 2.4 w1 17 9.2 w2
8 3.9 w1 18 9.4 w2
9 4.3 w1 19 9.6 w2
10 5.5 w2 20 9.8 w2
Table 1. The studied marks of 20 students and the actual classifications.

Methods Priors gmax(x0) Populations Bayes errors
BayesU (0.5; 0.5) 0.0353 2 0.0538

BayesR (0.421; 0.579) 0.0409 2 0.0558
BayesL (0.429; 0.571) 0.0403 2 0.0557

BayesC (0.724; 0.276) 0.0485 1 0.0241
Table 2. The results when classifying the ninth object.
3.2. Determining Bayes error

Theorem 2. Let fi(x), i =1, 2, …, k, k ≥ 3 be k pdfs defined on Rn, n ≥ 1 and let qi ∈ (0;1),
8 n o
>
> Rn1 ¼ x ∈ Rn : q1 f 1 ðxÞ > qj f j ðxÞ, 2 ≤ j ≤ k ,
>
< n o
Rnk ¼ x ∈ Rn : qk f k ðxÞ > qj f j ðxÞ, 1 ≤ j ≤ k , (17)
>
>
>
: n � �
Rl ¼ x ∈ Rn : qi f i ðxÞ > ql f l ðxÞ, 1 ≤ i ≤ k, 2 ≤ l ≤ k � 1, i 6¼ l :
The Bayes error is determined by

ð k�1 ð
X ð
ðqÞ
Pe1, 2, …, k ¼ 1 � q1 f 1 ðxÞdx � ql f l ðxÞdx � qk f k ðxÞdx: (18)
l¼2
Rn1 Rnl Rnk
Proof:
To obtain Eq. (18), we need to prove two following results:
Rni ∩ Rnj ¼ φ, ð1 ≤ i 6¼ j ≤ kÞ
k k�1
and ⋃ Rni ¼ Rn1 ∪ ⋃ Rni ∪Rnk ¼ Rn , f max ðxÞ ¼ f i ðxÞ, ∀x ∈ Rni :
i¼1 i¼2
Let A ¼ Rn \A, we have
Rij ¼ fx ∈ Rn : qi f i ðxÞ ≤ qj f j ðxÞg, Rij ¼ fx ∈ Rn : qi f i ðxÞ > qj f j ðxÞg, ð1 ≤ i, j ≤ kÞ:
From Eq. (17), we obtain

k
Rn1 ¼ ⋂ R1j , Rnl ¼ ⋂ Ril , ð2 ≤ l < kÞ:
j¼2 i6¼k
Therefore,
k
Rn1 ∩ Rnl ¼ ð⋂ Rij Þ ∩ ð⋂ Ril Þ⊂Ril ∩ R1l ¼ φ ) Rn1 ∩ Rnl ¼ φ, ð2 ≤ l < kÞ:
j¼2 i6¼k
On the other hand, from antithesis style of D’Morgan, we have

n
Rn1 ∪ Rnl ¼ ð⋃ Rij Þ∪ð⋃ Ril Þ⊂Ril ∩ R1l ¼ φ ) Rn1 ∪Rnl ¼ Rn , ð2 ≤ l < kÞ:
j¼2 i6¼k
Similarly,
Rnk ∩ Rnl ¼ φ, ð2 ≤ l < kÞ, Rn1 ∩ Rnk ¼ φ,
so
k k�1 k�1
⋃ Rni ¼ Rn , ∪ ð ⋃ Rnl Þ ∪ Rnk ¼ Rn1 ∪ ð ⋃ Rnl Þ ∪ Rnk
i¼1 l¼2 l¼2
k�1 k�1 k
¼ ð ⋃ Rn1 ∪Rnl Þ ∪ ð ⋃ Rnk ∪Rnl Þ ¼ Rn ∪ Rn ¼ Rn ) ⋃ Rni ¼ Rn :
l¼2 l¼2 i¼1
In addition, from Eq. (17), we can directly find out
gmax ðxÞ ¼ gi ðxÞ, ∀x ∈ Rni , ð1 ≤ i ≤ kÞ:
For k = 2, q1 = q2 = 1/2, we consider the two following special cases:

i. If f1(x) and f2(x) are two one-dimension normal pdfs (Nðμi , σi Þ, i = 1, 2), without loss of
generality, we suppose that μ1 < μ2 (for μ1 6¼ μ2 ), σ1 < σ2 (for σ1 6¼ σ2 ), then
8 2 3
xð1 ð
þ∞
>
>
>14
>
>
> f 2 ðxÞdx þ f 1 ðxÞdx5, if σ1 ¼ σ2 ,
<2
>
>
ð1=2, 1=2Þ �∞ x1
Pe1, 2 ¼ 2x 3
>
> ð2 xð3 ð
þ∞
>
> 1
>
> 2 4 f 1 ðxÞdx þ f 2 ðxÞdx þ f 1 ðxÞdx5,
>
>
if σ1 < σ2 ,
:
�∞ x2 x3
where
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
μ1 þ μ2 ðμ1 σ22 � μ2 σ21 Þ � σ1 σ2 ðμ1 � μ2 Þ2 þ K
x1 ¼ , x2 ¼ ,
2 σ22 � σ21
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðμ1 σ22 � μ2 σ21 Þ þ σ1 σ2 ðμ1 � μ2 Þ2 þ K � �
2 2 σ2
x3 ¼ , K ¼ 2ðσ 2 � σ 1 Þln ≥ 0:
σ22 � σ21 σ1
For μ1 = μ2 =μ, the above result becomes:

8
>
> 1, ifσ1 ¼ σ2 ,
>
< 2 x4 3
ð1=2, 1=2Þ ð xð5 ð
þ∞
Pe1, 2 ¼ 1 if σ1 < σ2 ,
>
> 4 f ðxÞdx þ f ðxÞdx þ f ðxÞdx5
> 1 2 1
: 2
�∞ x4 x5
pffiffiffi pffiffiffi � �
2 σ2
where x4 ¼ μ � σ1 σ2 E and x5 ¼ μ þ σ1 σ2 E with E ¼ σ2 �σ2 ln σ
1
≥ 0:
2 1
ii. If f1(x) and f2(x) are two n-dimension normal pdfs ðNðμi , Σi Þ, n ≥ 2, i ¼ 1, 2Þ then
2 3
ð ð
ð1=2, 1=2Þ 1 6 7
Pe1, 2 ¼ 4 f 2 ðxÞdx þ f 1 ðxÞdx5,
2
R1 R2
where
Rn1 ¼ fx : dðxÞ ≤ 0g, Rn2 ¼ fx : dðxÞ > 0g,

h i 1 h i
dðxÞ ¼ μT1 ðΣ1 Þ�1 � μT2 ðΣ2 Þ�1 x � xT ðΣ1 Þ�1 � ðΣ2 Þ�1 x � m,
2
� �
1 jΣ1 j T T
m ¼ ln þ μ1 ðΣ1 Þ μ1 � μ2 ðΣ2 Þ�1 μ2 :
�1
2 jΣ2 j
In case of n = 2, d(x) can be straight lines or parabola or ellipses or hyperbola.
3.3. Maximum function in the classification problem
To classify a new element by the principle (1) and to determine Bayes error by the formula (3),
we must find gmax(x). Some authors, such as Pham–Gia et al. [15, 17] and Tai [21, 22], have
surveyed relationships between gmax(x) with some related quantities of classification problem.
The specific expression for gmax(x) in some special case has been found [18]. However, the
general expression for all of cases is a complex problem that has not been still found yet.
Given k pdfs fi(x) and qi, i = 1, 2, …, k with q1 + q2 + …+ qk = 1 and let gi(x) = qifi(x), gmax(x) = max
{gi(x)}. Now, we take interest in determining gmax(x).
(a) For one dimension
In this case, we can find gmax(x) by the following algorithm:
Algorithm 2. Find the gmax(x) function

Input: gi(x) = qifi(x), where fi(x) and qi are the probability density function and the prior probability of wi, i = 1, 2, …, k,
respectively.
Output: The gmax(x) function.
Find all roots of the equations gi ðxÞ � gj ðxÞ ¼ 0, i ¼ 1, k � 1, j ¼ i þ 1, k.

Let B be the set of all roots.
For xlm ∈ B (the roof of equation gl ðxÞ � gm ðxÞ ¼ 0) do
For p ∈{1,2,…,k}\{l,m} do
If gl ðxlm Þ < gp ðxlm Þ then B ¼ B\fxlm g
End
End
End
Arrange the elements of B in order from smallest to largest:
B ¼ fx1 , x2 , …, xh g, x1 < x2 < … < xh
(Determine the function gmax(x) in interval (�∞,x1])

For i = 1 to k do
If gi ðx1 � ε1 Þ ¼ maxfg1 ðx1 � ε1 Þ, g2 ðx1 � ε1 Þ, …, gk ðx1 � ε1 Þg then
gmax ðxÞ ¼ gi ðxÞ, for all x ∈ (�∞,x1]
End
End
(Determine the function gmax (x) in interval ðxj , xjþ1 �, j ¼ 1, h � 1)
For i =1 to k do
For j =1 to h-1 do
If gi ðxj þ ε2 Þ ¼ max{g1 ðxj þ ε2 Þ, g2 ðxj þ ε2 Þ, …, gk ðxk þ ε2 Þ} then
gmax ðxÞ ¼ gi ðxÞ, for all x ∈ ðxj , xjþ1 �
End
End
End
(Determine the function gmax (x) in interval (h,+∞))
For i = 1 to k do
If gi ðxh þ ε3 Þ ¼ maxfg1 ðxh þ ε3 Þ, g2 ðxh þ ε3 Þ, …, gk ðxh þ ε3 Þg then
gmax ðxÞ ¼ gi ðxÞ, for all x ∈ ðh, þ ∞Þ
End
End
In the above algorithm, ε1, ε2, ε3 are the positive constants such that:
x1 þ ε1 < x2 , xh � ε3 > xh�1 , xi � ε2 < xi�1 and xi þ ε2 < xiþ1 :
From this algorithm, we have written a MATLAB code to find the gmax(x). When gmax(x) is
determined, we will easily calculate Bayes error by using formula (3), as well as classify a new
element by principle (1).
Example 2. Given seven populations having univariate normal pdfs {f1, f2,…, f7} with specific
parameters as follows (Figure 1):
μ1 ¼ 0:3, μ2 ¼ 4:0, μ3 ¼ 9:1, μ4 ¼ 1:9, μ5 ¼ 5:3, μ6 ¼ 8, μ7 ¼ 4:8,

σ1 ¼ 1:0, σ2 ¼ 1:3, σ3 ¼ 1:4, σ4 ¼ 1:6, σ5 ¼ 2, σ6 ¼ 1:9, σ7 ¼ 2:3:
Using codes written with qi ¼ 1=7, gi ðxÞ ¼ qi f i ðxÞ, i ¼ 1, 2, ::, 7, we have the results:
8
>
> g1 if �1:28 < x ≤ 0:99;
>
>
>
> g2 if 2:58 < x ≤ 4:89;
>
>
>
>
>
> g if 8:30 < x ≤ 12:52;
>
< 3
gmax ðxÞ ¼ g4 if { � 7:86 < x ≤ � 1:28} ∪ {0:99 < x ≤ 2:58};
>
>
>
> g5 if 4:89 < x ≤ 6:65;
>
>
>
>
> g6 if {6:65 < x ≤ 8:30} ∪ {12:52 < x ≤ 23:33};
>
>
>
>
:
g7 if {x ≤ � 7:86} ∪ {x > 23:33}:
(b) For multidimension

In multidimension cases, it should be very complicated to obtain closed expression for gmax(x).
The difficulty comes from the various forms of the intersection space curves between the pdfs
surfaces. This problem has been interested by many authors in Refs. [17, 18, 21–25]. Pham–Gia
Figure 1. The graph of seven one-dimension normal pdfs, fmax(x) and gmax(x).
et al. [18] have attempted to find the function gmax(x); however, it has been only established for
some cases of bivariate normal distribution.
Example 3. Given the four bivariate normal pdfs N(μi, Σi) with the following specific parame-
ters [16]:
" # " # " # " #
40 48 43 38
μ1 ¼ , μ2 ¼ , μ3 ¼ , μ4 ¼ ,
20 24 32 28
! ! ! !
35 18 28 �20 15 25 5 �10
Σ1 ¼ , Σ1 ¼ , Σ1 ¼ , Σ1 ¼
18 20 �20 25 25 65 �10 7
With q1 = 0.25, q2 = 0.2, q3 = 0.4, and q4 = 0.15, we have the graphs of gi(x) = qifi(x) and their
intersection curves as shown in Figure 2.
Here, we do not find the expression of gmax(x). We compute Bayes error instead by taking
integration of gmax(x) by quasi-Monte Carlo method [17]. An algorithm for doing calculations
has been constructed, and a corresponding MATLAB procedure is used in Section 4.
3.4. Estimate the probability density function
There are many parameter and nonparameter methods to estimate pdfs. In the examples and
applications of Section 4, we use the kernel function method, the popular one in practice
nowadays. It has the following formula:
XN Y n � �
_ 1 xj � xij
f ðxÞ ¼ fj , (19)
Nh1 h2 …hn i¼1 j¼1 hj
where xj, j = 1,2,…,n are variables, xj, i = 1,2,…,N are the ith data of the jth variable, hj is the
bandwidth parameter for the jth variable, fj(.) is the kernel function of the jth variable which is
usually normal, Epanechnikov, biweight, and triweight. According to this method, the choice
Figure 2. The graph of three bivariate normal pdfs and their gmax(x).
of smoothing parameter and the type of kernel function play an important role and affect the
result. Although Silverman [20], Martinez and Martinez [10], and some other authors [7, 13, 27]
had discussions about this problem, the optimal choice still has not been found yet. In this
chapter, the smoothing parameter is from the idea of Scott [19] and the kernel function is the
Gaussian one. We have also written the code by MATLAB software to estimate the pdfs in n-
dimensions space using this method.
We have written the complete code for the proposed algorithm by MATLAB software. It is
applied effectively for the examples of Section 4.
4. Some applications
In this section, we will consider three applications in three domains: biology, medicine, and
economics to illustrate for present theories and to test established algorithms. They also show
that the proposed algorithm presents more advantages than the existing ones.
Application 1. We consider classification for well-known Iris flower data, which have been
presented in many documents like in Ref. [17]. These data are often used to compare the new
method and existing ones in classifying. The three varieties of Iris, namely, Setosa (Se),
Versicolor (Ve), and Virginica (Vi), have data in four attributes: X1 = sepal length, X2 = sepal
width, X3 = petal length, and X4 = petal width.
In this application, the cases of one, two, three and four variables are respectively considered to
classify for three groups (Se), (Ve), and (Vi) by Bayesian method with different prior probabil-
ities. The purpose of this performance is to compare the results of BayesC with BayesU,
BayesR, and BayesL. Because the numbers of the three groups are equal, and the results of
BayesU, BayesR, and BayesL are the same. The correct probability of methods is summarized
in Table 3.
Table 3 shows that in almost all cases, the results of proposed algorithm are better than those
using other algorithms, and in the case using three variables X1, X2, and X3, it gives the best
results.
Application 2. This application considers thyroid gland disease (TGD). Thyroid gland is
an important and the largest gland in our body. It is responsible for the metabolism and
work process of all cells. Some of the common diseases of gland thyroid are hypothyroidism,
hyperthyroidism, thyroid nodules, and thyroid cancer. They are dangerous diseases. Recently,
the rate of thyroid gland disease has been increasing in some poor countries. Data includes
3772 person with 3541 for ill group (I) and 231 ones for nonill group (NI). Detail for this
data is given in http://www.cs.sfu.ca/wangk/ucidata/dataset/thyroid–disease, in which the
surveyed variables are Age (X1), Query on thyroxin (X2), Anti-thyroid medication (X3),
Sick (X4), Pregnant (X5), Thyroid surgery (X6), Thyroid Stimulating Hormone (X7),
Variables B BayesU = BayesL = BayesR BayesC
X1 0.667 0.679
X2 0.668 0.579
X3 0.903 0.916
X4 0.815 0.827
X1, X2 0.715 0.807
X1, X3 0.893 0.895
X1, X4 0.807 0.850
X2, X3 0.891 0.898
X2, X4 0.809 0.815
X3, X4 0.843 0.866
X1, X2, X3 0.892 0.919
X1, X2, X4 0.764 0.810
X1, X3, X4 0.762 0.814
X2, X3, X4 0.736 0.822
X1, X2, X3, X4 0.725 0.745
Table 3. The correct probability (%) in classifying Iris flower.

Triiodothyronine (X8), Total thyroxin (X9), T4U measured (X10), and Referral source (X11). In
this application, this chapter will use random 70% of the data size (2479 elements belong to
group I and 162 elements belong to group NI) as the training set to determine significant
variables, to estimate pdfs, and to find suitable model. About 30% of the remaining data will
be used as test set (1062 elements belong to group I and 69 elements belong to group NI). The
result of Bayesian method is also compared to others.
To assess the effect of independent variables in TGD, we build the logistic regression model log
(p/1p) with variables Xi, i = 1, 2, …, 11 (p is the probability of TGD). The analytical results are
summarized in Table 4.
In Table 4, the three variables X1, X8, and X11 in bold face have statistical significance in
classifying the two groups (I) and (NI) at 5% level, so we use them to classify TGD.
Applying the PPC algorithm for cases of one variable, two variables, and three variables with
all prior probabilities, we obtain the results given in Table 5.
Table 5 shows that the correct probability is high, in which BayesC always gives the best result
in all three cases of variables. BayesC gives the almost exact result with three variables. We also
compare BayesC with existing methods (Fisher, SWM, and logistic) for all the above three
cases. All cases show that BayesC is more advantageous than others in reducing Bayes error.
Variable Sig. Variable Sig.
X1 0.000 X7 0.304
X2 0.279 X8 0.000
X3 0.998 X9 0.995
X4 0.057 X10 0.999
X5 0.997 X11 0.000
X6 0.997 Const 0.992
Table 4. Value Sigs of logistic regression model.
Cases Variables BayesU BayesR BayesL BayesC
One variable X1 91.13 97.47 97.46 97.97

X8 90.72 98.51 98.50 98.65
X11 90.53 97.48 97.47 98.19

Two variables X1, X8 98.73 98.77 98.77 99.78
X1, X11 98.11 98.65 97.65 99.44
X8, X11 98.71 98.77 98.77 99.82
Three variables X1, X8, X11 98.35 98.89 98.89 99.96
Table 5. The correct probability (%) in classifying TGD by Bayesian method from training set.
Using the best results for each case of methods from Table 6, classifying for test set (1131
elements), we have the results given in Table 7.
From Table 7, we see that with the test set, BayesC also gives the best result.
Application 3. This application considers the problem of repaying bank debt (RBD) by cus-
tomers. In bank credit operations, determining the repayment ability of customers is really
important. If the lending is too easy, the bank may have bad debt problems. In contrast, the
bank will miss a good business. Therefore, in the current years, the classification of credit
application on assessing the ability to repay bank debt has been specially studied and has been
a difficult problem in Vietnam. In this section, we appraise this ability of companies in Can Tho
city (CTC), Vietnam by using the proposed approach. We collect a data on 214 enterprises
operating in key sectors as agriculture, industry, and commerce, including 143 cases of good
debt (G) and 71 cases of bad debt (B). Data are provided by responsible organizations of CTC.
Each company is evaluated by 13 independent variables in the expert opinion. The specific
variables are given in Table 8.
Because of sensitive problem, author has to conceal real data and use training data set. The
steps to perform in this application are similar as in Application 2. Training set has 100
elements belonging to group G and 50 elements belonging to group B, and the test set has 43
elements belonging to group G and 21 elements belonging to group B. With training set, the
logistic regression model shows only three variables X1, X4, and X7 have statistical signifi-
cance at 5% level, so we use these three variables to perform BayesU, BayesR, BayesL, and
BayesC. Their results are given in Table 9.
From Table 9, we see that BayesC gives the highest probability in all the cases. We also use
logistic method, Fisher, and SVM with training set to find the best results. We have the correct
probability given in Table 10.
Methods One variable Two variables Three variables
Logistic 93.90 93.90 93.90
Fisher 72.30 73.60 71.70

SVM 93.87 93.87 93.87
BayesC 98.65 99.82 99.96
Table 6. The correct probability (%) for optimal models of methods in classifying TGD.
Methods Correct numbers False numbers Correct probability
Logistic 835 296 73.8
Fisher 835 296 73.8
SVM 1062 69 90.9

BayesC 1062 69 93.9
Table 7. Compare the correct probability (%) in classifying TGD from test set.
Using the best model for each case of methods from Table 10 to classify the test set (67
elements), we obtain the results given in Table 11.
Once again from Table 11, we see that with test data, BayesC also gives the best result.
Xi Independent variables Detail
X1 Financial leverage Total debt/total equity

X2 Reinvestment Total debt/total equity
X3 Roe Net profit/equity
X4 Interest (Net income + depreciation)/total assets
X5 Floating capital (Current assets current liabilities)/total assets

X6 Liquidity (Cash + Short-term investments)/current liabilities
X7 Profits Net profit/total assets

X8 Ability Net sales/Total assets
X9 Size Logarithm of total assets
X10 Experience Years in business activity
X11 Agriculture Agricultural and forestry sector

X12 Industry Industry and construction
X13 Commerce Trade and services
Table 8. The surveyed independent variables.
Cases variables BayesU BayesR BayesL BayesC
One variable X1 86.21 86.14 84.13 87.13

X4 81.12 82.91 86.16 88.19
X7 83.21 84.63 83.14 84.52
Two variables X1, X4 87.25 88.72 87.19 89.06

X1, X7 88.16 88.34 83.26 89.56
X4, X7 89.25 89.04 89.02 91.34

Three variables X1, X5, X7 91.15 91.53 90.17 93.18
Table 9. The correct probability (%) in classifying RBD by Bayesian method from training set.
Methods One variable Two variables Three variables
Logistic 84.04 88.29 88.69

Fisher 84.73 80.73 79.32
SWM 82.34 82.03 83.07

BayesC 88.19 91.34 93.18
Table 10. The correct probability (%) for optimal models of methods in classifying RBD.
Methods Correct numbers False numbers Correct probability
Logistic 53 11 82.81
Fisher 52 12 81.25
SVM 53 11 82.81
BayesC 57 7 89.06
Table 11. Compare the correct probability (%) in classifying RBD from test set.
5. Conclusion
This chapter presents the classification algorithm by Bayesian method in both theory and appli-
cation aspect. We establish the relations of Bayes error with other measures and consider the
problem to compute it in real application for one and multidimensions. An algorithm to deter-
mine the prior probabilities which may decrease Bayes error is proposed. The researched prob-
lems are applied in three different domains: biology, medicine, and economics. They show that
the proposed approach has more advantages than existing ones. In addition, a complete proce-
dure on MATLAB software is completed and is effectively used in some real applications. These
examples show that our works present potential applications for research on real problems.
Author details
Tai Vovan
Address all correspondence to: [email protected]
College of Natural Sciences, Can Tho University, Can Tho City, Vietnam
References
[1] Altman DG. Statistics in medical journals: Development in 1980s. Statistical Medicine.
1991;10:546-551. DOI: 10.1002/sim.4780101206
[2] Bora DJ, Gupta AK. Impact of exponent parameter value for the partition matrix on the
performance of fuzzy C means Algorithm. International Journal of Scientific Research in
Computer Science Applications and Management Studies. 2014;3:1-6. DOI: arXiv:1406.4007
[3] Cristiani S, Shawe TJ. An introduction to support vector machines and other kernel-based
learning method. 2nd ed. London: Cambridge University; 2000. p. 204. DOI: 10.1108/
k.2001.30.1.103.6
[4] Cannon RL, Dave JV, Bezdek JC. Efficient implementation of the fuzzy c-means clustering
algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1986;2:
248-255. DOI: 10.1109/TPAMI.1986.4767778
[5] Fadili MJ, et al. On the number of clusters and the fuzziness index for unsupervised FCA
application to BOLD fMRI time series. Medical Image Analysis. 2001;5(1):55-67. DOI:
10.1016/S1361-8415(00)00035-9
[6] Fisher RA. The statistical utilization of multiple measurements. Annals of Eugenic. 1936;7:
376-386. DOI: 10.1111/j.1469-1809.1938.tb02189
[7] Ghosh AK. Classification using kernel density estimates. Technometrics. 2006;48:120-132.
DOI: 10.1198/004017005000000391
[8] Jan YK, Cheng CW, Shih YH. Application of logistic regression analysis of home mort-
gage loan prepayment and default. ICIC Express Letters. 2010;2:325-331. DOI: 325-331.
10.12783/ijss.2015.03.014
[9] Hall LO, et al. A comparison of neural network and fuzzy clustering techniques in
segmenting magnetic resonance images of the brain. IEEE Transactions on Neural Net-
works. 1992;3(5):672-682. DOI: 10.1109/72.159057
[10] Martinez WL, Martinez AR. Computational statistics handbook with Matlab. 1st ed. Boca
Raton: CRC Press; 2007. DOI: 1198/tech.2002.s89
[11] Matusita K. On the notion of affinity of several distributions and some of its applica-
tions. Annals of the institute of Statistical Mathematics. 1967;19:181-192. DOI: 10.1007/
BF02481018
[12] Marta E. Application of Fisher's method to materials that only release water at high
temperatures. Portugaliae Etecfochlmlca Acta. 2001;15:301-311. DOI: 10.1016/S0167-7152
(02)00310-3
[13] McLachlan GJ, Basford KE. Mixture Models: Inference and Applications to Clustering. 1st
ed. New York: Marcel Dekker; 1988. DOI: 10.2307/2348072
[14] Miller G, Inkret WC, Little TT. Bayesian prior probability distributions for internal
dosimetry. Radiation Protection Dosimetry. 2001;94:347-352. DOI: 10.1093/oxfordjour
nals.rpd.a006509
[15] Pham-Gia T, Turkkan T. Bounds for the Bayes error in classification: A Bayesian approach
using discriminant analysis. Statistical Methods and Applications. 2006;16:7-26. DOI:
10.1007/s10260-006-0012-x
[16] Pham–Gia T, Turkkan N, Bekker A. Bayesian analysis in the L1–norm of the mixing
proportion using discriminant analysis. Metrika. 2008;64:1-22. DOI: 10.1007/s00184-006-
0027-1
[17] Pham–Gia T, Turkkan N, Tai VV. Statistical discrimination analysis using the maximum
function. Communications in Statistics-Simulation and Computation. 2008;37:320-336.
DOI: 10.1080/03610910701790475
[18] Pham–Gia T, Nhat ND, Phong, NV. Statistical classification using the maximum function.
Open Journal of Statistics. 2015;15:665-679. DOI: 10.4236/ojs.2015.57068
[19] Scott DW. Multivariate density estimation: Theory, practice, and visualization. 1st ed.
New York: Wiley; 1992. DOI: 10.1002/9780470316849
[20] Silverman BW. Density Estimation for Statistics and Data Analysis. 1st ed. Boca Raton:
CRC Press; 1986. DOI: 10.1007/978-1-4899-3324-9
[21] Tai VV, Pham–Gia T. Clustering probability distributions. Journal of Applied Statistics.
2010;37:1891-1910. DOI: 10.1080/02664760903186049
[22] Tai VV. L1–distance and classification problem by Bayesian method. Journal of Applied
Statistics. 2017; 44(3):385-401. DOI: 10.1080/02664763.2016.1174194
[23] Tai VV, Thao NT, Ha CN. The prior probability in classifying two populations by Bayesian
method. Applied Mathematics Engineering and Reliability. 2016;6:35-40. DOI: 10.1201/
b21348-7
[24] Thao NT, Tai VV. Fuzzy clustering of probability density functions. Journal of Applied
Statistics. 2017;44(4):583-601. DOI: 0.1080/02664763.2016.117750
[25] Thao NT, Tai VV. A new approach for determining the prior probabilities in the classifi-
cation problem by Bayesian method. Advances in data analysis and classification. Forth-
coming. DOI: 10.1007/s11634-016-0253
[26] Toussaint GT. Some inequalities between distance: Measures for feature evaluation. IEEE
Transactions on Computers. 1972;21:405-410. DOI: 10.1109/TC1972.5008990
[27] Webb AR. Statistical Pattern Recognition. 1st ed. New York: Wiley; 2003. DOI: 10.1109/
34.824819
[28] Yu J, Cheng Q, Huang H. Analysis of the weighting exponent in the FCM. IEEE Trans-
actions on Systems, Man, and Cybernetics, Part B. 2004;34(1):634-639
Chapter
Provisional 4
chapter
Hypothesis Testing
Hypothesis Testing for
for High-Dimensional
High-DimensionalProblems
Problems
Naveen K. Bansal
Naveen K. Bansal
Abstract
For high-dimensional hypothesis problems, new approaches have emerged since the
publication. The most promising of them uses Bayesian approach. In this chapter, we
review some of the past approaches applicable to only law-dimensional hypotheses
testing and contrast it with the modern approaches of high-dimensional hypotheses
testing. We review some of the new results based on Bayesian decision theory and show
how Bayesian approach can be used to accommodate directional hypotheses testing and
skewness in the alternatives. A real example of gene expression data is used to demon-
strate a Bayesian decision theoretic approach to directional hypotheses testing with
skewed alternatives.
Keywords: multiple directional hypotheses, false discovery rate, familywise error rate,
gene expression, skew-normal distribution
1. Introduction
In today’s world, most of the statistical inference problems involve high-dimensional multi-
ple hypothesis testing. Whenever we collect data, we collect data on multiple features,
involving very high-dimensional variables in some cases. For example, gene expression data
consist of gene expressions on thousands of genes; image data consist of image expressions
on multiple voxels. The statistical analysis for these types of data involves multiple hypoth-
eses testing (MHT). It is well known that univariate methods cannot be applied to simulta-
neously test hypotheses on the multiple features. The reason for this is that the error rates for
the univariate analysis get multiplied under MHT, and as a result the actual error rate can be
very high. To understand the main issue of multiplicity, consider the following example.
Suppose there are, say, 100 misspelled words in a book, and each of these words occurs in 5%
of the pages. You pick a page at random. For each misspelled word, the probability is
certainly 0.05 of finding that word in the page. However, the probability is much higher that
you will find at least one of the 100 misspelled words. If these words were independently
distributed, then the probability of finding at least one misspelled word is 1 � (0.95)100 ≈ 0.995.
If the placements of the misspelled words were positively dependent, then the probability
will be lower than 0.995. For example, if we take an extreme case of dependence that they all
occur together, then the probability will be 0.05. The same phenomenon occurs in the MHT.
The statistical inference, based on the error rate of individual hypothesis testing, can lead to
very high error rate for the combined hypotheses. Thus, for the MHT, adjustment in the error
rate needs to be made. Note that the adjustment rate may depend on the dependent struc-
ture, but due to the complexity of the dependent structure in high dimension, dependency is
usually ignored in the literature [1].
The statistical inference depends on how we define the error rate for the combined hypotheses
testing. Let us suppose that there are m hypotheses testing H i0 vs: H ia , i ¼ 1, 2, …, m. If we do not
want to make even one false discovery, then we should control the familywise error rate
(FWER), which is defined as
� �
FWER ¼ Pr Falsely Reject H i0 for at least one i, i ¼ 1, 2, …, m (1)
There are many methods for controlling FWER ≤ αF (=0.05, e.g.). A simplest method is the
Bonferroni’s procedure. Let Ti be the test statistics for testing H i0 vs: H ia with the corresponding
p-values pi. Then, Bonferroni’s procedure rejects H i0 if pi < αF/m. To see the proof of this,
suppose I0 be the set of all i for which Hi0 is true, and suppose pj < αF/m for at least one ∈ I0 .
Then using Boole’s inequality, we have, from Eq. (1),
( )
� � X � �
FWER ¼ Pr ⋃ pi < αF =m ≤ Pr pi < αF =m (2)
i ∈ I0 i ∈ I0
Now, since, under H i0 , pi � Uð0, 1Þ , Pr{pi < αF/m} = αF/m. Then, assuming that there are m0
number of elements in I0, we have, from Eq. (2),
m0 αF
FWER ≤ ≤ αF
m
Holm [2] gave a modified version of Bonferroni’s procedure which also controls the family-
wise error rate. Holm’s Bonferroni Procedure is the following: First rank all the p-values,
ð1Þ ð2Þ ðmÞ
p(1) ≤ p(2) ≤ … ≤ p(m), and let H 0 , H 0 , …, H 0 be their associated null hypotheses. Let l be the
smallest index such that p(l) > αF/(m � l + 1). Then, reject only those null hypotheses that are
ð1Þ ð2Þ ðl�1Þ
associated with H 0 , H 0 , …, H 0 . Note that the selected hypotheses have p-values with
p(1) < αF/m,p(2) < αF/(m � 1),…,p(l � 1) < αF/(m � l + 2) , and thus more powerful than
Bonferroni’s procedure, since hypotheses that are selected under Bonferroni’s procedure will
also be selected under Holm’s procedure.
The above Bonferroni type procedures are not very satisfactory when m is very high. Let us
suppose m = 10, 000 (this is actually not very high for most of the high-dimensional problems),
and suppose we want to control FWER by αF = 0.05. Then, for Holm’s procedure, the smallest
Hypothesis Testing for High-Dimensional Problems 65
p-value has to be lower than 0.000005 in order to reject at least one hypothesis, which may be
very hard to achieve. The problem is not really with Holm’s procedure; the problem is with the
use of FWER as an error rate. For a high-dimensional problem, it is unrealistic to seek for a
procedure which will not make at least one false discovery. Benjamini and Hochberg [1]
proposed a new approach called false discovery rate (FDR) and proposed a procedure that
works much better for high-dimensional MHT.
In Section 2, we review the FDR procedure and Bayesian procedures for two-sided alterna-
tives. An extension of directional hypotheses is presented in Section 3. In Section 3, we also
discuss Bayesian procedures under skewed alternatives. In Section 4, the problem of direc-
tional hypotheses is considered by converting p-values to normally distributed test statistics.
We also discuss, in Section 4, a Bayes procedure under skew-normal alternatives. An applica-
tion using real data of gene expressions is also discussed in Section 4. Some concluding
remarks are made in Section 5.
2. False discovery rate (FDR), Benjamini and Hochberg’s (BH) procedure,

and Bayesian procedures
For each of the hypothesis testing Hi0 vs: H ia , suppose a statistical procedure either rejects the
null hypothesis H i0 or fails to reject H i0 . For the sake of simplicity, we equate fail to reject H i0 as
accepting the null Hi0 . However, for small sample size case, it will be unwise to make a
conclusion of accepting Hi0 . From now on, rejections of the null will be called discoveries.
Table 1 shows the possible outcomes by a procedure, where, for example, V is the total number
of discoveries, among them V0 is the number of false discoveries.
Thus, the proportion of the false discoveries is V0/max(V,1). The FDR is defined as the expected
proportion of false discoveries, that is,

V0
FDR ¼ E : (3)
maxðV, 1Þ
If, for example, FDR = 0.05, then we can expect on the average 5% of all discoveries to be false.
In other words, under repeated experiments on the average, we make 5% of the false discov-
eries (in a frequentist’s sense). Note that FDR ≤ FWER = P(V0 ≥ 1) as the following inequality
shows:
Accept H0 Reject H0 Total
H0 is true U0 V0 m0
Ha is true Ua Va m�m0
U V m
Table 1. Total number of decisions made.

� � � �
V0 V0
FDR ¼ E ¼E I ðV 0 ≥ 1Þ ≤ E½I ðV 0 ≥ 1Þ� ¼ PðV 0 ≥ 1Þ:
maxðV; 1Þ maxðV; 1Þ
Thus, we are likely to make a higher number of discoveries under FDR approach than under
FWER, since if a procedure controls FWER (≤α), then it also controls FDR ((≤α), but not vice
versa.
2.1. Benjamini and Hochberg’s procedure

Benjamini and Hochberg [1] proposed the following BH procedure which controls the FDR.
Let pi be the p-value for the ith hypothesis under a test statistic Ti. Suppose T1,T2,…,Tm are
independently distributed. Let p[1] < p[2] < … < p[m] be the ordered p-values with the
ð1Þ ð2Þ ðmÞ
corresponding null hypotheses be denoted by H 0 , H 0 , …, H 0 . Let
� �
i
i0 ¼ max i : p½i� ≤ α
m
ð iÞ
Then, reject H 0 for all i ≤ i0.
This procedure controls FDR ≤ mm0 α ≤ α. Since m0 is unknown, having the upper bound of mm0 α is
not very useful. If m0 can be estimated reliably, a better bound is possible.
The above result was proven in [1], under the independence of the test statistics. Hochberg and
Yekitieli [3] extended the result to positively correlated test statistics, and they also sharpened
the BH procedure with new i0 defined as
� �
1
i0 ¼ max i : p½i� ≤ α ;
mcðmÞ
Pm 1
where cðmÞ ¼ i¼1 i .
2.2. Bayesian procedures
Under Bayesian setting, we assume that Hi0 and Hia , i ¼ 1, 2, …, m are generated probabilisti-
cally with
� � � �
P H i0 ¼ p and P Hia ¼ 1 � p
Under this setting, [4] developed a concept of local false discovery rate (fdr). If Ti,i = 1, 2, …,m
are test statistics with pdf Ti|H0 � f0(t) and Ti|Ha � fa(t). Then, marginally, Ti � f(t) =
pf0(t) + (1 � p)fa(t), and
� � pf ðtÞ
f drðtÞ ¼ P Hi0 jT i ¼ t ¼ 0 (4)
f ðtÞ
The idea is that if Ti ∈ [t,t + δt], where δt ! 0, then fdr(t) represents that the proportion of the
times Hi0 will be true. If t is very high, then fdr(t) will be very small indicating the probability of
H i0 to be very small (i.e., the false discovery rate will be very small). In Eq. (3), p and f(t) are
unknown, which can be estimated (see [4]).
Storey [5] proposed a positive false discovery rate

V0
pFDR ¼ E jV > 0 ; (5)
V
where expectation is with respect to the distribution of (Ti,θi),i = 1, 2, …, m. Under the

assumption that T1,T2, … Tm are identically and independently distributed, [6] proved that
pFDRðΓÞ ¼ PðH0 jT ∈ ΓÞ;
for a procedure that rejects H i0 when Ti ∈ Γ. Based on this, q � value for the multiple hypothesis
(analogous to p-value for a single hypothesis) is defined as the smallest value of pFDR(Γ) such
that the observed Ti = ti ∈ Γ, see [6]. Under most cases, q � value(ti) = P(H0| Ti > ti). This gives a
procedure under multiple hypothesis that rejects H i0 if q � value(ti) < α.
3. Directional hypotheses testing
As described earlier, the null hypothesis H i0 is either accepted or rejected. In most cases,
however, rejection of null hypotheses is not sufficient. After rejecting H i0 , finding the direction
of the alternatives may also be important. A detailed discussion of the directional hypotheses
can be found in [7].
Direction hypotheses testing involves testing Hi0 against directional hypotheses Hi� and Hiþ ,
and the objective is to obtain selection region {Ti ∈ Γ�} for selecting H i� and selection region
{Ti ∈ Γ+} for selecting H iþ . In other words, H i0 will be rejected if Ti ∈ Γ� or Ti ∈ Γ+, and the
direction H i� or H iþ is determined according to Ti ∈ Γ� or Ti ∈ Γ+, respectively. Analogous to
Table 1, we now have
Table 2 illustrates the number of cases possible when accepting H0 or selecting H� or selecting
H+. For example, out of V times when selecting H�, V0 times errors are made when in fact H0 is
Accept H0 Select H� Select H+ Total
H0 is true U0 V0 W0 m0
H� is true U� V� W� m�
H+ is true U+ V+ W+ m+
Total U V W m
Table 2. Number of decisions under directional hypotheses.

true, and V+ times errors are made when in fact H+ is true. In other words, when selecting H�,
not only H0 is falsely rejected V0 times but the direction is also falsely selected V+ times. This
leads to a concept of directional false discovery rate DFDR defined as
� �
V0 þ Vþ þ W0 þ W�
DFDR ¼ E : (6)
maxðV þ W, 1Þ
This is analogous to FDR for two-sided alternatives. For most cases, [8] showed that DFDR-
controlling procedures for directional hypotheses can be treated as FDR-controlling pro-
cedure for two-sided multiple hypotheses with direction determined by the sign of the test
statistics.
Bansal and Miescke [9] considered a decision theoretic formulation to multiple hypotheses
problems. The approach assumes parametric modeling. Suppose the model for the observed
data x be represented by P(x; θ,η), where θ = (θ1,θ2,…,θm) 0 is a parameter vector of interest,
and η is a nuisance parameter. The problem of interest is to test
H i0 : θi ¼ 0 vs: H i� : θi < 0 or H iþ : θi > 0 (7)
Let the loss function of a decision rule d(x) = (d1(x),d2(x),…,dm(x)) is given by

m
X
Lðθ, dðxÞÞ ¼ li ðθ, di ðxÞÞ; (8)
i¼1
where li(θ,di(x)) is an individual loss of di. Here, di ∈ {�1,0,1} with di = 0, di = � 1, and di = 1

means accepting Hi0 , selecting Hi� and selecting H iþ , respectively. Note that for the “0-1” loss,
that is, when li = 0 for correct decision, and li = 1 for the incorrect decision, L is the total number
of incorrect decisions. Thus, minimizing the E[L(θ,d(X))] for the “0-1” loss amounts to mini-
mizing the expected number of incorrect decisions.
Now, suppose under the Bayesian setting, θi,i = 1, 2, …, m are generated from
πðθÞ ¼ p� π� ðθÞ þ p0 I ðθ ¼ 0Þ þ pþ πþ ðθÞ; (9)
where π� is the prior density over (�∞,0) and π+ is the prior density over (0, ∞). A special
case of prior (9) is that π�(θ) = π+(�θ). In this case, p� and p+ reflect the skewness in the
alternative hypotheses. For example, if p� = p+, then we have a symmetric case. In this
case, the selection of H� or H+, after rejecting H0, based on the sign of the test statistics
makes sense. On the other hand, if p� < p+, then it reflects that more of the θis are
positives than negatives. For many gene expressions data analyses, this presents a useful
case when over-expressed genes may occur more frequently than under-expressed genes as
a result of gene mutation (naturally or as a result of external factors). For specific exam-
ples, see [9, 10].
From now on, we focus on the “0-1” loss. The results can be easily extended to other loss
functions. The “0-1” loss can be written as
2 3
m
X 1
X � �
Lðθ, dÞ ¼ 41 � I ðdi ¼ jÞI νθi ¼ j 5;
i¼1 j¼�1
where vθi ∈ f�1, 0, 1g is an indicator variable indicating θi < 0 when vθi ¼ �1, θi = 0 when
vθi ¼ 0, and θi > 0 when vθi ¼ 1. It is easy to see that minimizing the posterior expected loss
n o
ð0Þ
yields the selection rule that selects Hi� , H i0 , or H iþ according to max vi , vi , vi
ð�Þ ðþÞ
; where
� � � � � �
ð�Þ ð�Þ ð0Þ ð0Þ ðþÞ ðþÞ
vi ¼ P Hi jx , vi ¼ P H i jx ; and vi ¼ P H i jx :
3.1. The constrained Bayes rule

The Bayes procedure described earlier accommodates skewness in the prior, but no type of
false discovery rates is controlled. In order to control a false discovery rate, we need to obtain a
constrained Bayes rule that minimizes the posterior expected loss subject to a constraint on the
false discovery rate.
The directional false discovery rate (6) is defined in a frequentist’s manner, in which expecta-
tion is with respect to X|θ. Let us define Eq. (6) as BDFDR when expectation is taken with
respect to X|θ and then further expectation is taken with respect to θ. We define posterior
version of Eq. (6) as PDFDR when the expectation is taken with respect to the posterior
distribution of θ|X = x. It can be shown that
Pm n ð�Þ ðþÞ
o
i¼1 I ðdi ¼ �1Þvi þ I ðdi ¼ þ1Þvi
PDFDR ¼ 1 � (10)
ðjD� jþjDþ jÞ ∨ 1
Pm Pm
Here, jD� j ¼ i¼1 I ðdi ¼ �1Þ and jDþ j ¼ i¼1 I ðdi ¼ 1Þ:
A constrained Bayes rule can be obtained by minimizing the posterior expected loss subject to
the constraint that PDFDR ≤ α. There can be many approaches to obtain the constraint mini-
mization. We present, here, an approach given in [9], which is as follows:
Consider the sets DB� and DBþ of indices that selects Hi and Hi , respectively, according to the uncon-
ð�Þ ðþÞ
n o n o
ð�Þ ð0Þ ðþÞ ðþÞ ð0Þ ð�Þ
straint Bayes rule, that is, when vi ¼ max vi , vi and vi ¼ max vi , vi , respectively.
for i ∈ DBþ, and ξi ¼ νi for i ∈ DBþ, and then rank all ξi , i ∈ DB� ∪DBþ from the lowest to
ð�Þ ðþÞ
Define ξi ¼ νi
� �
the highest. Let the ranked values be denoted by ξ½1� ≤ ξ½2� ≤ … ≤ ξ�b�, where bk ¼ �DB� ⋃DBþ � . Denote
k
( j
)
X
bi 0 ¼ max j ≤ bk : 1 ξ^ ≥1 � α :
j i¼1 ½k�iþ1�
Let Dξ denotes the set of indices corresponding to ξh i ≥ ξh i ≥ … ≥ ξh i . Now, select H i

�
bk bk�1 bk�bi 0 þ1
for i ∈ DB� ∩ Dξ , and H iþ for i ∈ DBþ ⋂Dξ.

3.2. Estimating mixture parameters
The above procedure requires estimation of the parameters (p�,p0,p+) and estimation of the
nuisance parameter η. Note that marginally,
Xi � p� f � ðxi jηÞ þ p0 f 0 ðxi jηÞ þ pþ f þ ðxi jηÞ;
where f0(xi| η) = f(xi| 0,η), and

ð0 ð∞
f � ðxi jηÞ ¼ f ðxi jθ, ηÞπ� ðθÞdθ, f þ ðxi jηÞ ¼ f ðxi jθ, ηÞπþ ðθÞdθ
�∞ 0
and X1,X2,…,Xm are independently distributed. Estimates of the parameters of the mixed
density can be obtained by using EM algorithm. It is easy to see that the EM estimators of
(p�,p0,p+) follows the following iterative scheme:
m
1X pð�jÞ f � ðxi jηÞ
pð�jþ1Þ ¼ ;
m i¼1 p f ðxi jηÞ þ pðjÞ f ðxi jηÞ þ pðþjÞ f ðxi jηÞ
ð j Þ
� � 0 0 þ
m ðjÞ
ðjþ1Þ 1X p0 f � ðxi jηÞ
p0 ¼ ;
m i¼1 pðjÞ f ðxi jηÞ þ pðjÞ f ðxi jηÞ þ pðþjÞ f ðxi jηÞ
� � 0 0 þ
m ð jÞ
ðjþ1Þ 1X pþ f � ðxi jηÞ
pþ ¼
m i¼1 pðjÞ f ðxi jηÞ þ pðjÞ f ðxi jηÞ þ pðþjÞ f ðxi jηÞ
� � 0 0 þ
Estimation of η can also be estimated iteratively by using EM algorithm or by different means.

See [9] for more details.
4. Bayes rules by converting p-values to normally distributed test

statistics
Let Ti,i = 1, 2,..,m be independently and identically distributed test statistics. Let Pi ¼ PðT i ≤
ti jHi0 Þ be the corresponding p-values. Note that under Hi0 , Pi � U ð0, 1Þ . Let Xi = Φ� 1(Pi) be the
corresponding z-score. Then, under Hi0 ; Xi � N(0,1) . Efron [11] suggested using Xi � N(0,σ2)
under H i0 with σ2 appropriately estimated. Efron pointed out that, in practice, σ2 may not be
equal to 1 due to possible correlation among multiple components. Under the alternative, we
assume that Xi � N(θi,σ2), where θis are generated with distribution described in Eq. (9). It is
true that this is a big leap in making this assumption. In practice, this assumption can be tested,
however, and if true, it can lead to very powerful results. [9] assumed that π+(θ) is a truncated
normal distribution N(0, σ2/ω) , and π�(θ) = π+(�θ), where ω is some positive constant
depending upon how inflated we believe the alternative θis are. It can be seen that
ð�Þ ðþÞ ð0Þ

vi ∝ p� T � ðxi Þ, vi ∝ pþ T þ ðxi Þ, and vi ∝ p0 (11)
with the proportionality constant [p�T�(xi) + p+T+(xi) + p0}� 1 . Also, T�(xi) = T+(�xi), and
� � � �
x2i xi
T þ ðxi Þ ¼ exp Φ p ffiffiffiffiffiffiffiffiffiffiffiffi (12)
2ð1 þ ωÞσ2 σ 1þω
In order to apply the Bayes procedure as discussed in Section 3, all we need are Eqs. (11) and
(12). For computation details, see [9].
4.1. Skew-normal alternatives

In the above discussions, we assumed that θis are generated from distribution with pdf (9).
[12] considered the case when θis are generated from a skew-normal distribution under the
alternative hypotheses. The skew-normal distribution was first introduced in [13]. It has an
important property that if (ξ1,ξ2) � Bivariate Norma with mean 0, then the distribution of
ξ1|ξ2 > 0 � Skew � normal. Its pdf is given by
� � � �
1 ξ ξ
gþ ðξ1 Þ ¼ 2 φ 1 Φ λ 1 ;
σ1 σ1 σ1
and is denoted by SN(0,σ1,λ). Here, λ is a skew parameter. If λ = 0, then this distribution is N

(0,σ1). The implication of this result is the following: suppose within a normal system an
outcome follows a normal distribution, but if a correlated factor starts exerting a positive
effect, then the outcome variable will start following a skew-normal distribution. For example,
consider RNAs experiments and assume that genes are in a normal state. Suppose a gene
mutation occurs at a later state and it starts exerting positive effect on the affected genes. In
this case, based on the above property of skew-normal distribution, we can assume that the
expressions of the affected genes will follow a skew-normal distribution.
Under this formulation, we assume that θi = 1, 2, …,m are generated from
� � � �
2 θi θi
πλ ðθi Þ ¼ p I ðθi ¼ 0Þ þ ð1 � pÞ φ Φ λ
σ1 σ1 σ1
Now, similar to Eq. (11), it can be seen that

ð�Þ ðþÞ ð0Þ
vi ∝ ð1 � pÞT � ðxi Þ, vi ∝ ð1 � pÞT þ ðxi Þ, vi ∝ p
with proportionality constant [(1 � p)(T+(xi) + T�(xi) + p]� 1, where

ð � � sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ! � �
2 ∞ xi θ 1 1 λθ
T þ ðxi Þ ¼ exp 2 φ þ θ Φ dθ;
σ1 0 σ σ21 σ2 σ1
and
ð0 � � sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ! � �
2 xi θ 1 1 λθ
T � ðxi Þ ¼ exp 2 φ þ θ Φ dθ:
σ1 �∞ σ σ21 σ2 σ1
The sets DB� and DBþ can be written as
DB� ¼ fi : xi < �c1 g and DBþ ¼ fi : xi > c2 g
where c1 > 0 and c2 > 0 are determined as shown in Figure 1 by considering the point of
intersections of y = p/(1 � p) and y = T�(x), and y = p/(1 � p) and y = T+(x), respectively. Note
that when λ > 0, the intersection point Q (as shown in the figure) will be to the left of x = 0, and
when λ < 0, Q will be to the right of x = 0. Thus, when λ > 0,c1 > c2 and the opposite is true
when λ < 0. When λ = 0,T�(x) = T+(�x) and thus c1 = c2. If λ ! ∞, T�(x) ! 0 and thus D� B is an
empty set which is equivalent to a one-tailed test. As discussed in Section 3, the procedure
based on Eq. (13) by itself does not control BDFDR. However, c1 and c2 can be further shrunk
so that the resulting procedure achieves BDFDR ≤ α; see [12] for details.
To illustrate the above procedure, and to compare it with the standard FDR procedure (BY)
of [8], which selects the direction based on the sign of the test statistics, we consider a HIV data
described in [14]. For detailed analysis, see [12]. Here, we describe the analysis very briefly. The
data consist of eight microarrays, four from cells of HIV-infected subjects and four from
uninfected subjects, each with expression levels of 7680 genes. For each gene, we obtained a
two-sample t-statistic, comparing the infected versus the uninfected subjects, which is then
transformed to a z-value, where zi = Φ� 1{F6(ti)}. Here,F6(∙) denotes the cumulative distribution
p p
Figure 1. Graph of T+(x) and T�(x) with cutoff values � c1 and c2 such that T þ ðxÞ ≥ 1�p and T � ðxÞ ≥ 1�p.
Figure 2. Histogram of the HIV data with cutoff points by BY and the Bayes method under skew-normal prior.
function (cdf) of t -distribution with six degrees of freedom. Figure 2 shows the histogram of
the z-values with a skew-normal fit. Although the null distribution of Zi should be N(0,1).
However, as suggested in [11], we use the null distribution as N(�0.11,0.752). Thus, we
formulate our problem as testing hypotheses (7) with test statistics Zi � N(�0.11 + θi,0.752).
BY procedure resulted in cutoffs (�3.94,3.94), which resulted in 18 total discoveries with two
genes declared as under-expressed and 16 as over-expressed. For the constrained Bayes rule,
we first used the EM algorithm to obtain the parameter estimates as b p ¼ 0:9, b
σ ¼ 0:79,
b ¼ 0:22. The Bayes procedure ended up with cut-off points (�2.82,2.70) with
σ1 ¼ 1:54; and λ
c
a total of 86 discoveries (under-expressed genes: 23 and over-expressed genes: 63). Note that
the number of discoveries by the Bayes rule is much higher than by the BY procedure.
There are many different methods of testing multiple hypotheses. Methodologies, however,
depend on the criteria we choose. When the dimension of multiple hypotheses is not very high,
the familywise error rate (FWER) is an appropriate criterion which safeguards against making
even one false discovery. However, when the dimension of multiple hypotheses is very high,
the FWER is not very useful; instead, a false discover rate (FDR) criterion is a good approach.
Although FDR was originally defined as a frequentist’s concept, it can be re-interpreted in a
Bayesian framework. The Bayesian framework brings many advantages. For example, a
decision-theoretic formulation is easy to implement, directional hypotheses are easy to handle,
and the skewness in the alternatives is easy to implement. Drawback is that we need to make
an assumption about the prior distributions under the alternatives. Some work has been done
based on nonparametric priors; however, much more work is needed.
Author details
Naveen K. Bansal
Department of Mathematics, Statistics, and Computer Science, Marquette University,
Milwaukee, WI, USA
References
[1] Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practice and powerful
approach to multiple testing. Journal of the Royal Statistical Society B. 1995;57(1):289-300
[2] Holm S. A simple sequentially rejective multiple test procedure. Scandinavian Journal of
Statistics. 1979;6(2):65-70
[3] Hochberg B, Yekitieli D. The control of the false discovery rate in multiple testing under
dependency. Annals of Statistics. 2001;29(4):1165-1188
[4] Efron B, Tibshirani R, Storey JD, Tusher V. Empirical Bayes analysis of a microarray
experiment. Journal of the American Statistical Association. 2001;96(456):1151-1160
[5] Storey JD. A direct approach to false discovery rates. Journal of the Royal Statistical
Society B. 2002;64(3):479-498
[6] Storey JD. The positive false discovery rate: A Bayesian interpretation and the q value.
The Annals of Statistics. 2003;31(6):2013-2035
[7] Shaffer JP. Multiplicity, directional (Type III) errors, and the null hypothesis. Psychologi-
cal Methods. 2002;7(3):356-369
[8] Benjamini Y, Yekutieli D. False discovery rate controlling confidence intervals for selected
parameters. Journal of American Statistical Association. 2005:71-80
[9] Bansal NK, Miescke KJ. A Bayesian decision theoretic approach to directional multiple
hypotheses problems. Journal of Multivariate Analysis. 2013:205-215
[10] Bansal NK, Jiang H, Pradeep P. A Bayesian methodology for detecting targeted genes
under two related experiments. Statistics in Medicine. 2015;34(25):3362-3375
[11] Efron B. Correlation and large-scale simultaneous significance testing. Journal of the
American Statistical Association. 2007:93-103
[12] Bansal NK, Hamedani GG, Maadooliat M. Testing multiple hypotheses with skewed alter-
natives. Biometrics. 2016;72(2):494-502
[13] Azzalini A. A class of distributions which includes the normal ones. Scandinavian Jour-
nal of Statistics. 1985;12(2):171-178
[14] van't Wout AB, Lehrman GK, Mikheeva SA, O'Keeffe GC, Katze MG, Bumgarner RE,
Mullins JI. Cellular gene expression upon human immunodeficiency virus type 1 infec-
tion of CD4+-T-cell lines. Journal of Virology. 2003;77(2):1392-1402
Chapter 5
Provisional chapter
Bayesian vs Frequentist Power Functions to Determine

Bayesian vs Frequentist
the Optimal Power
Sample Size: Functions
Testing to
One Sample Binomial
Determine
Proportion the Optimal
Using Exact Sample
Methods Size: Testing One
Sample Binomial Proportion Using Exact Methods
Valeria Sambucini
Valeria Sambucini

Abstract
In order to avoid the drawbacks of sample size determination procedures based on
classical power analysis, it is possible to define analogous criteria based on ‘hybrid
classical-Bayesian’ or ‘fully Bayesian’ approaches. We review these conditional and
predictive procedures and provide an application, when the focus is on a binomial
model and the analysis is performed through exact methods. The distinction between
analysis and design prior distributions is essential for the practical implementation of
the criteria: some guidelines for choosing these priors are discussed, and their impact on
the required sample size is examined.
Keywords: analysis and design prior distributions, binomial proportion, Bayesian

power functions, conditional and predictive approach, sample size determination,
saw-toothed behaviour of power
1. Introduction
The calculation of an adequate sample size is a crucial aspect in the design of experiments.
Researchers need to select the appropriate number of participants required to ensure ethically
and scientifically valid results. If samples are too large, time and resources are wasted, often
for minimal gain. On the other hand, too small samples may lead to inaccurate results.
Therefore, sample size determination (SSD) plays a very important role in the design aspect
of studies in many fields, especially in the context of clinical trials where, in addition to
economical problems, investigators have to deal with important ethical implications.
Sample size determination (SSD) methods, when the focus is on hypothesis testing, are typi-
cally related to the concept of power function. Let us denote the parameter of interest by θ and
let us assume that we are interested in testing H0 : θ ∈ Θ0 versus H1 : θ ∈ Θ1, where Θ0 and Θ1
form a partition of the parameter space Θ. The most widely used frequentist SSD criterion
consists in choosing the minimal sample size that guarantees a given power, for a fixed type I
error rate, under the assumption that θ is equal to a suitable design value, θD ∈ Θ1. In practice,
the idea is to ensure a sufficiently large probability of obtaining a statistically significant result
(i.e. of rejecting the null hypothesis), when the true value of θ belongs to the alternative
hypothesis and is equal to θD. In many textbooks (see [1–3], among others) sample size
formulas, derived using this procedure, are provided in many occurring situations, under
different hypothesis testing and based on both categorical and quatitative data.
In the frequentist criterion described above, a crucial role is played by the design value that the
trial is designed to detect with high probability, whose uncertainty is not accounted for. In fact,
the local optimality is one of the most criticized aspects of the method. Moreover, this
frequentist procedure does not allow to take into account pre-experimental information about
θ, for instance available from previous studies. By adopting a ‘hybrid classical-Bayesian
approach’ or a ‘fully Bayesian approach’, it is possible to define analogous criteria for sample
size selection that allow the researcher to avoid the problem of the local optimality or/and to
introduce possible prior information in the SSD process.
In this chapter, we illustrate how to construct frequentist and Bayesian power functions, based
on both conditional and predictive approaches, and how to use them to determine the optimal
sample size. An essential element of the method is the use of two different prior distributions
for the parameter of interest, which play two distinct roles in the criteria. The importance of this
distinction in sample size determination problems has been stressed by several authors (see, for
instance, [4–9] among others). The rest of the chapter is organized as follows: in Section 2, we
review both the frequentist conditional and predictive procedures based on power analysis to
determine the optimal sample size. Section 3 provides a description of analogous methods
based on Bayesian power functions. Then, in Section 4, we formalize different SSD criteria that
depend on the shape of the power curves as a function of the sample size and, as a conse-
quence, on the nature of the data distributions. Furthermore, in Section 5, we illustrate an
application of the frequentist and Bayesian SSD procedures, when the parameter of interest is
a single binomial proportion. Finally, Section 6 contains a brief final discussion.
2. Frequentist power functions and SSD methods
Let us consider a parameter of interest θ and assume that we are interested in testing H0 : θ ∈ Θ0
versus H1 : θ ∈ Θ1, where Θ0 and Θ1 form a partition of the parameter space Θ. Moreover, let
Yn be the random result of the experiment that is typically a suitable statistic used to summa-
rize the data relevant to the parameter θ. In the notation, we have highlighted that Yn depends
on the sample size n. Finally, we denote by fn(yn|θ) the sampling distribution of Yn.
The power function is defined as the probability of obtaining a statistically significant result
that leads to reject the null hypothesis H0, when the actual value of the parameter is θ. In a
frequentist approach, the investigator is firstly required to specify a fixed level α for the type
Bayesian vs Frequentist Power Functions to Determine the Optimal Sample Size: Testing One Sample Binomial… 79
I error probability that one is willing to tolerate. This significance level is typically set equal
to 0.05 and is used to obtain the rejection region of H0, denoted by RH0, that represents an
appropriate subset of outcomes that—if observed—lead to the rejection of H0. Therefore, given
a frequentist test of size α, Yn is considered a statistically significant result if it belongs to RH0 .
Consequently, in general terms, the power function is defined as
ηðn; θÞ ¼ Pθ ðYn ∈ RH0 Þ; (1)
where Pθ is the probability measure associated with a suitable distribution of Yn.

In order to exploit the frequentist power function in Eq. (1) for sample size determination
purposes, investigators can adopt two different approaches: the conditional and the predictive
one. The conditional approach is certainly the most widely known and used, when performing
sample size calculations based on pre-study power analysis. It requires the specification of a
suitable design value for θ, denoted by θD, that belongs to the alternative hypothesis and is
considered a relevant value important to detect. By assuming that the true value of the
parameter is equal to θD, we obtain the frequentist conditional power given by
� �
ηCF n; θD ¼ Pf D
Þ ðY n ∈ RH0 Þ; (2)
n ð�jθ
where Pf ð�jθD Þ is the probability measure associated with the sampling distribution of Yn when
n
θ = θD. Since θD has to be selected within the subspace Θ1, the conditional frequentist power can
be interpreted as the probability of correctly rejecting H0, when the true value of the parameter
belongs to the alternative hypothesis and is exactly equal to θD. Then, the sample size determi-
nation criterion consists in choosing the minimal sample size that guarantees a desired level for
� �
ηCF n; θD . In practice, the idea is to ensure a sufficiently large probability of rejecting H0, when
the true θ belongs to the alternative hypothesis and, more specifically, it is equal to θD ∈ Θ1.
The SSD procedure based on the power function in Eq. (2) is strongly affected by the choice of
θD. In order to account for uncertainty in the specification of the design value and to avoid
local optimality, it is natural to incorporate Bayesian concepts into the sample size determina-
tion process. By adopting a ‘hybrid classical-Bayesian approach’, it is possible to model uncer-
tainty on the appropriate design value for θ through the elicitation of a prior distribution,
denoted by πD(θ) and called design prior. This prior is used to compute the marginal or prior
predictive distribution of the data by averaging the sampling distribution as follows:
ð
� � �
mD
n yn ¼ f n yn jθÞπD ðθÞdθ: (3)
Θ
Therefore, the design prior cannot be a non-informative improper distribution in order to have
� � D
mDn yn well defined. In any case, the elicitation of a non-informative π (θ) would not be
reasonable choice. In fact, the design prior is used to introduce uncertainty on the suitable
design value for θ that we need to specify when using the SSD procedure previously described
and the possible guessed values have to belong to the subspace Θ1. Thus, πD(θ) serves to
describe a design scenario of interest that supports values of θ under the alternative hypothesis:
it has to be an informative distribution that assigns a negligible probability to values of θ

under the null hypothesis.
Once the design prior has been elicited, the idea is to average the conditional frequentist power
with respect to it by computing
ð ð "ð #
� �
ηCF ðn; θÞπD ðθÞdθ ¼ f n yn jθ dyn πD ðθÞdθ
Θ Θ RH0
ð (4)
D
� �
¼ mn yn dyn :
RH 0
This leads to the frequentist predictive power that is given by

� �
ηPF n; πD ¼ PmDn ð�Þ ðY n ∈ RH0 Þ; (5)
where PmDn ð�Þ is the probability measure associated with the marginal distribution of Yn
obtained using πD(θ). The power function in Eq. (5) expresses the probability of making a
correct decision by rejecting H0, when θ actually belongs to the subspace defined under the
alternative hypothesis, where we can assume that it is distributed according to the design
prior. Therefore, the corresponding SSD criterion requires to select the minimum n to achieve
� �
a desired level for ηPF n; πD .
Note that if πD(θ) is chosen as a point mass distribution centred on θD, no uncertainty on the
relevant design values is taken into account and the marginal distribution coincides with the
sampling one. In this case, there is no difference between the frequentist power functions
obtained under the conditional and the predictive approach.
3. Bayesian power functions and SSD methods
In the previous section, we have described how to select the sample size through power
functions by assuming that a frequentist analysis will be performed at the end of the study. In
both the frequentist conditional and predictive powers, the decision about the two hypotheses
is based on the construction of the rejection region of H0 of a classical test of fixed size α. A
major limitation to the fully classical and the hybrid classical-Bayesian approaches previously
introduced is the inability to incorporate past experience and information about the unknown
parameter, as well as expert prior opinions. The use of a ‘fully Bayesian approach’ allows to
take into account important knowledge and belief about θ when planning the study.
It is well known that the information available before starting the study can be expressed by
introducing a prior distribution for θ, πA(θ), which in this context is typically called analysis
prior to distinguish it from the design prior. It is worth pointing out that πA(θ) is the usual prior
distribution employed in a Bayesian analysis: it formalizes pre-experimental knowledge, often
represented by historical data, and subjective opinions of experts and is used to compute the
� � � �
posterior distribution of the parameter, πAn θ�yn ∝ f n yn jθÞπA ðθÞ. Moreover, it is often chosen
as a non-informative distribution to avoid the inclusion of external evidence in the posterior

inference.
Let us recall that, in general terms, a power function is defined as the probability of obtaining a
significant result, i.e. a result that leads to the rejection of the null hypothesis. Then, to exploit
this function as a useful tool to determine the optimal sample size, we need to compute it
under the assumption that the alternative hypothesis is true. In practice, we have to consider a
design scenario where the true θ belongs to Θ1, so that the power function represents the
probability of making a correct decision. Therefore, to define power functions from a Bayesian
point of view, first of all we need to decide when we reject the null hypothesis in a Bayesian
setting, that is we have to establish the condition for the ‘Bayesian significance’. Following
Spiegelhalter et al. [10], we define the result Yn as ‘significant from a Bayesian perspective’ if
the corresponding posterior probability that θ belongs to the alternative hypothesis is suffi-
ciently large, that is if
PπAn ð�jYn Þ ðθ ∈ Θ1 Þ > λ; (6)
where PπAn ð�jYn Þ denotes the probability measure associated with the posterior distribution of θ
computed using the analysis prior and λ ∈ (0, 1) represents a suitably specified threshold. Let us
stress that, since we are dealing with a pre-experimental problem, the posterior probability in
Eq. (6) is a random variable, depending on a random result that has not yet been observed. In
order to construct Bayesian power functions, we need to compute the probability of obtaining a
Bayesian significant result. Similar to what we have seen in the frequentist case, we can use two
alternative distributions of the data, according to the approach we decide to adopt.
The conditional approach realizes the pre-experimental assumption that the alternative hypothesis
is true, by fixing a design value θD ∈ Θ1, which is considered relevant and important to detect.
Then the sampling distribution of Yn conditional on θD, fn(�|θD), is used to compute the proba-
bility of getting Bayesian significance. In this way, we obtain the Bayesian conditional power

ηCB n; θD ¼ Pf ð�jθD Þ PπAn ð�jYn Þ ðθ ∈ Θ1 Þ > λ : (7)
n
The predictive approach, instead, aims at avoiding the problem of local optimality in the SSD
procedure by introducing a design prior for θ, πD(θ), that accounts for additional uncertainty
involved in the choice of the design values θD. Then, the prior predictive distribution of Yn,
D
mDn ð�Þ, is computed and used in place of the sampling distribution conditional on θ . This leads
to the Bayesian predictive power

ηPB n; πD ¼ PmDn ð�Þ PπAn ð�jYn Þ ðθ ∈ Θ1 Þ > λ : (8)
Both the power functions in Eqs. (7) and (8) express the probability of rejecting H0 under a
Bayesian framework, assuming that the true θ actually belongs to H1. In fact, we assume that θ
is equal to a specific value under the alternative hypothesis (conditional approach) or that θ
is in the specific subspace defined under the alternative hypothesis, where we can assume
that it is distributed according to the design prior (predictive approach). The sample size
determination criteria, therefore, require to select the minimal sample size to ensure a suffi-

ciently large level for ηCB n; θD or ηPB n; πD . Moreover, note that, when the specified design
prior distribution assigns the whole mass probability to θD, the two Bayesian power functions
coincide, leading to the same optimal sample size.
4. SSD criteria according to the nature of the distribution of Yn
In this section, we explicitly formalize the SSD criteria based on frequentist and Bayesian
power functions, according to the nature of the random result Yn. When Yn has a continuous
distribution, each of the power functions previously introduced shows a monotonically
increasing behaviour as a function of n. In this case, the SSD criteria sensibly select the
minimum sample size to guarantee the desired level of power, that is

nCF ¼ min n ∈ N: ηCF n; θD > γ ; (9)

nPF ¼ min n ∈ N: ηPF n; πD > γ ; (10)

nCB ¼ min n ∈ N: ηCB n; θD > γ ; (11)

nPB ¼ min n ∈ N: ηPB n; πD > γ ; (12)
for a conveniently chosen threshold γ ∈ (0, 1]. Let us remark that in the notation for the optimal
sample sizes, as well as in the notations for the power functions, the subscripts are used to
specify the approach (frequentist or Bayesian) adopted at the analysis stage. The superscripts,
instead, indicate the appoach (conditional or predictive) used to represent the design expecta-
tions. An application of the criteria formalized above is provided by Gubbiotti and De
Santis [11], where it is assumed that the statistic Yn follows a normal distribution with mean
equal to θ and known variance.

However, it may happen that ηCF n; θD , ηPF n; πD , ηCB n; θD and ηPB n; πD are not monoton-
ically increasing functions of the sample size: this occurs when dealing with discrete distribu-
tions of Yn. In these cases, the power functions show a basically increasing behaviour as a
function of n, but with some small fluctuations. A suitable SSD criterion has to take into
account this kind of behaviour. For instance, instead of selecting the smallest sample size that
attains the condition of interest, it can be considered more appropriate to select the smallest
sample size in such a way that the condition is fulfilled also for all the sample size values
greater than it. Given a threshold γ ∈ (0, 1), the corresponding SSD criteria are

nCF ¼ min n� ∈ N: ηCF n; θD > γ, ∀n ≥ n� ; (13)

nPF ¼ min n� ∈ N: ηPF n; πD > γ, ∀n ≥ n� ; (14)

nCB ¼ min n� ∈ N: ηCB n; θD > γ, ∀n ≥ n� ; (15)

nPB ¼ min n� ∈ N: ηPB n; πD > γ, ∀n ≥ n� : (16)
In this way, it is possible to avoid the paradox of having the condition of interest fulfilled for
the selected sample size, but not satisfied for some larger values of n any longer.
5. Single binomial proportion using exact methods
In this section, we focus on exact procedures for one-sample testing problem with binary
response. For instance, in a clinical context, we could be interested in evaluating the efficacy of a
new experimental treatment or drug that is received at the same dose by all the n patients enrolled
in the trial. No comparisons with other therapies are involved. A binary response variable, which
assumes value 1 if clinicians classify the patient as a responder to the therapy and 0 otherwise, is
considered and, therefore, the parameter of interest θ is the true response rate (i.e. an unknown
proportion). In these one-arm studies, θ is compared with a fixed target value, say θ0, that should
ideally represent the response rate for the current ‘gold standard’ therapy and that is typically
obtained through historical data. Values of θ greater than θ0 suggest that the experimental drug
can be considered sufficiently effective and, therefore, the following hypotheses are considered
H 0 : θ ¼ θ0 and H 1 : θ > θ0 : (17)
This kind of single-arm studies is typically conducted in phase II of clinical trials, whose
primary goal is not to definitively assess the efficacy of new drugs, but to screen out those that
are ineffective. In practice, in the clinical development process of a new drug, phase II aims at
avoiding that not sufficiently promising treatments reach phase III, where randomized con-
trolled trials, based on large patients groups, are generally conducted.
It is important to point out that the power functions based on exact procedures usually do not
have explicit forms. Hence, exact formulas for sample size calculations cannot be obtained.
However, it is possible to proceed numerically by evaluating the conditions of interest for
different increasing or decreasing values of the sample size, until reaching the optimal one. In
the following sections, we provide the expressions of the frequentist and Bayesian power
functions for non-comparative studies with binary responses. The saw-toothed shape of the
power curves as a function of n is shown and, hence, the conservative criteria illustrated in the
previous section are adopted. All the graphical and numerical results have been obtained by
using the R programming language [12].
5.1. Frequentist conditional power

In the statistical context described above, the number of responders out of the n patients
treated with the new drug (i.e. the number of successes in n trials) is the natural statistic Yn
we have to consider and its sampling distribution is

f n yn jθÞ ¼ bin yn ; n; θ , for yn ¼ 0, ::.; n; (18)
where bin(�; n, θ) denotes the probability mass function of a binomial distribution of parame-
ters n and θ.
Let us consider the two hypotheses in Eq. (17). For a fixed significance level α and assuming
that H0 is true, there exists a non-negative integer r between 0 and n such that
n
X n
X
binði; n; θ0 Þ ≤ α and binði; n; θ0 Þ > α: (19)
i¼r i¼r�1
� �
Then, the rejection region at α level is RH0 ¼ yn ∈ f0, 1, ::.; ng : yn ≥ r , where the critical value
r can be expressed in symbols by
( )
n
X
r ¼ min k ∈ f0, 1, ::.; ng : binði; n; θ0 Þ ≤ α : (20)
i¼k
For a given design value θD, that has to be specified under the alternative hypothesis, the
frequentist conditional power is provided by
� �
ηCF n; θD ¼ Pf ð�jθD Þ ðYn ∈ RH0 Þ
n
n
X (21)
¼ binðyn ; n, θD Þ:
yn ¼r
� �
In practice, ηCF n; θD is obtained by the sum of the probabilities of the all the outcomes that
belong to RH0 , when we assume that the true θ is equal to the design value.
Figure 1 shows the behaviour of the frequentist conditional power as a function of n, when
� �
θ0 = 0.2, θD = 0.4 and α = 0.05. It is evident that ηCF n; θD is not a monotonically increasing
function of the sample size, because of the discrete nature of the sampling distribution of Yn.
� �
Figure 1. Behaviour of ηCF n; θD as a function of n, when θ0 = 0.20, θD = 0.4 and α = 0.05.
The reasons for this saw-toothed behaviour can be clarified by the numerical results presented
in Table 1. Here, for all the possible values of the sample size between 3 and 50, we provide not
only the level of the frequentist conditional power used to obtain Figure 1, but also the
corresponding critical value r and the actual value for the type I error probability. Obviously,
this latter value is always below the fixed threshold 0.05. Note that whenever the sample size is
increased by one unit, the corresponding critical value r may also increase or it may remain
constant. In the second case, both the actual type I error rate and the conditional frequentist
power grow up; otherwise, if also the critical value changes by one unit, they both get smaller.
To help in reading the table, the colours white and grey are used alternately to highlight blocks

n r ηCF n; θD Actual type I n r ηCF n; θD Actual type I
error rate error rate
3 3 0.0640 0.0080 27 10 0.6913 0.0304
4 3 0.1792 0.0272 28 10 0.7412 0.0391
5 4 0.0870 0.0067 29 10 0.7853 0.0493
6 4 0.1792 0.0170 30 11 0.7085 0.0256
7 4 0.2898 0.0333 31 11 0.7546 0.0327

8 5 0.1737 0.0104 32 11 0.7954 0.0411
9 5 0.2666 0.0196 33 12 0.7242 0.0216
10 5 0.3669 0.0328 34 12 0.7669 0.0274
11 6 0.2465 0.0117 35 12 0.8048 0.0344
12 6 0.3348 0.0194 36 12 0.8380 0.0424
13 6 0.4256 0.0300 37 13 0.7783 0.0231

14 6 0.5141 0.0439 38 13 0.8136 0.0288
15 7 0.3902 0.0181 39 13 0.8446 0.0355
16 7 0.4728 0.0267 40 13 0.8715 0.0432
17 7 0.5522 0.0377 41 14 0.8219 0.0242

18 8 0.4366 0.0163 42 14 0.8509 0.0298
19 8 0.5122 0.0233 43 14 0.8762 0.0362
20 8 0.5841 0.0321 44 14 0.8979 0.0436
21 8 0.6505 0.0431 45 15 0.8570 0.0250

22 9 0.5460 0.0201 46 15 0.8807 0.0304
23 9 0.6116 0.0273 47 15 0.9012 0.0366

24 9 0.6721 0.0362 48 15 0.9187 0.0437
25 9 0.7265 0.0468 49 16 0.8851 0.0256

26 10 0.6358 0.0232 50 16 0.9045 0.0308
Table 1. Numerical calculations related to Figure 1: sample sizes, corresponding critical values, frequentist conditional
power and actual values for the type I error rate, when θ0 = 0.20, θD = 0.4 and α = 0.05.
of sample sizes with the same critical value: within each block both the power and the actual
type I rate monotonically raise as n increases. But, in correspondence with the first sample size
of the subsequent block, they both decrease. This determines the basically increasing behav-
iour of the power as a function of n, with some small fluctuations, which is represented in
Figure 1. For additional discussion about the saw-toothed shape of the frequentist power
function, the reader is referred to Chernick and Liu [13].
Now, the problem of which sample size we should select arises because of the non-monotonic

behaviour of ηCF n; θD . If we set the desired threshold γ for the power equal to 0.8, we have
that the smallest sample size that meets the power requirement is n = 35. At that sample size,
the critical value is 12 and the power level is 0.8048. Then for n = 36, the critical value is still 12
and the power increases to 0.8380. However, the power drops below 0.8 to 0.7783, when n = 37,

at which r = 13, and rises again over 0.8 when n = 38. Then ηCF n; θD never decreases below 0.8
for sample sizes greater than 38. Therefore, instead of selecting the smallest n that attains the
power condition, it can be more appropriate to consider the more conservative sample size
criterion formalized in Section 4, according to which the optimal sample size is selected as

nCF ¼ min n� ∈ N: ηCF n; θD > γ, ∀n ≥ n� : (22)
The criterion ensures that the power will not decrease below the desired threshold for any
larger sample size: in our specific case, it consists in selecting n = 38, instead of n = 35.
5.2. Frequentist predictive power
In order to model uncertainty in the specification of the design value, we need to adopt the
hybrid classical-Bayesian approach described previously. We introduce a beta design prior
density for θ, πD(θ) = beta(θ; αD, βD), that is used to obtain the prior predictive distribution of
the data. It is well known that by averaging the binomial sampling fn(yn|θ) with respect to the
beta design prior, we obtain the following marginal distribution

mD D D
n yn ¼ beta-bin yn ; α ; β ; n , for yn ¼ 0, ::.; n; (23)
where beta-bin(�; αD, βD, n) denotes the probability mass function of a beta-binomial distribu-
tion with parameters (αD, βD, n).
The design prior πD(θ) can be elicited in many different ways. One useful possibility consists in
(i) setting the prior mode equal to the fixed design value θD, which investigators would choose
within the subset under H1 when using the conditional approach, and (ii) regulating the concen-
tration of the distribution around its mode according to the degree of uncertainty one wishes to
express. This can be done by using for the hyperparameters of πD(θ) the following expressions:

αD ¼ nD θD þ 1 and βD ¼ nD 1 � θD þ 1; (24)
where θD is the prior mode and nD is a design parameter that can be interpreted as prior sample
size. The larger the nD, the smaller the variance of the beta design prior. Therefore, we need to
increase nD if we want to reduce uncertainty on the guessed values of θ. More specifically, if we

set nD = ∞, the design prior of θ assigns all the probability mass to θD: in this case, no
uncertainty is involved and the marginal distribution of the data coincides with the sampling
distribution conditional on θD. We thus must set nD < ∞ to distinguish between conditional and
predictive approaches. In particular, once a prior mode θD has been selected, the researcher
can choose nD by assuring a large level (say very close to 1) for PπD ð�Þ ðθ > θ0 Þ, that is the
probability assigned by πD(θ) to the event θ > θ0. Let us assume, for instance, that θ0 = 0.2
and consider three possible choices for θD (i.e. 0.3, 0.4 and 0.5). For each of them, we compute
the smallest nD such that PπD ð�Þ ðθ > θ0 Þ is about equal to 0.999, and the behaviour of the
corresponding design priors is shown in Figure 2(a). Clearly, if the prior mode approaches θ0,
we need to increase nD to guarantee that PπD ð�Þ ðθ > θ0 Þ ≃ 0:999. Moreover, for a fixed prior
mode θD, if we decided to decrease the value of nD with respect to the one used in the graph,
PπD ð�Þ ðθ > θ0 Þ would decrease. In fact, nD has been specified in order to express the minimum
degree of prior enthusiasm about the efficacy of the treatment necessary to have the prior
probability that θ exceeds the target θ0 at least equal to the chosen level 0.999. An alternative
way of proceeding consists in choosing nD by ensuring a fixed level for the prior probability
assigned to a symmetrical interval around the prior mode. For instance, if we set θD = 0.4, we
can find that 255, 111 and 60 are the values of nD such that it is about equal to 0.999 the
probability that πD(θ) assigns to the intervals (0.3, 0.5), (0.25, 0.55) and (0.2, 0.6), respectively.
The corresponding design prior distributions are shown in Figure 2(b). It is important to point
out that all the design densities, represented in both the graphs of Figure 2, express uncertainty
in the suitable design value that it is worthwhile to consider when applying the SSD criteria
based on power analysis. Thus, all the distributions assign a negligible probability to values of
θ smaller than θ0, which are those values specified under H0.
Figure 2. Possible choices of the design prior distribution, when θ0 = 0.2.

Once πD(θ) has been specified, the frequentist predictive power can be obtained by computing
� �
the probability of rejecting the null hypothesis at α level with respect to mD
n yn . Hence, we
have
� �
ηPF n; πD ¼ PmDn ð�Þ ðYn ∈ RH0 Þ
n
X � � (25)
¼ beta-bin yn ; αD ; βD ; n ;
yn ¼r
� �
where r is the critical value provided in Eq. (20). In practice ηPF n; πD is given by the sum of the
probabilities of the all the outcomes inside RH0 , computed under a design scenario according to
which the true θ belongs to the interval (θ0, 1), where it is distributed according to the design
prior density. Let us remark again that if the design prior is a point mass distribution on θD (i.e.
nD = ∞), we have that the frequentist power functions, conditional and predictive coincide.
Similarly to the frequentist conditional power, also the predictive one presents a saw-toothed
� �
shape as a function of n, since mDn yn is a discrete distribution. Therefore, we suggest to adopt
the conservative approach previously described and to select
� � � �
nPF ¼ min n� ∈ N : ηPF n; πD > γ, ∀n ≥ n� ; (26)
for a fixed desired threshold γ. Figure 3 shows the behaviour of the frequentist predictive
power as a function of n for different choices of the design prior, when θ0 = 0.2 and α = 0.05.
More specifically, we consider the three πD(θ) plotted in Figure 2(b) that are all centred on
θD = 0.4, but with different degrees of concentrations regulated by the nD value. In each graph,
we highlight which is the optimal sample size obtained according to the criterion in Eq. (26)
when γ = 0.8. Note that the larger the nD, the smaller the degree of uncertainty we introduce
through the design prior and, as a consequence, the smaller the optimal sample size. In fact, we
obtain the optimal values 46, 42 and 39, for nD equal to 60, 111 and 255, respectively. If we set
nD = ∞, we would retrieve the conditional criterion in Eq. (22), where no uncertainty is
considered in specifying the design value, and the optimal n would be equal to 38 (see
� �
Figure 3. Behaviour of ηPF n; πD as a function of n for different choices of the design prior distribution, when θ0 = 0.2 and
α = 0.05.
Figure 1). Moreover, let us fix again θ0 = 0.2, α = 0.05 and γ = 0.8 and consider the three design
prior distributions in Figure 2(a), which are characterized by different prior modes. The
evident difference between the prior scenarios represented by these design priors clearly
affects the optimal sample size: we obtain the optimal values 157, 46 and 23, for (θD, nD) =
(0.3, 163), (θD, nD) = (0.4, 43) and (θD, nD) = (0.5, 20), respectively.
5.3. Bayesian conditional power
When we decide to adopt a Bayesian approach to establish the statistical significance of the
result, we need to introduce an analysis prior distribution for θ. In our specific case, it is
computationally convenient to specify a beta analysis prior, πA(θ) = beta(θ; αA, βA): in this
way, from conjugate analysis we obtain that the corresponding posterior distribution is still a
beta density with updated parameters,

πAn θyn ¼ beta θ; αA þ yn , βA þ n � yn : (27)
Through πA(θ), the researcher can incorporate in the SSD procedure pre-experimental knowl-
edge, as well as sceptical or enthusiastic expert prior opinions about the efficacy of the
experimental treatment. However, one of the most common ways of proceeding is to choose a
non-informative—or based on very weak information–density, to let the posterior distribution
be based almost entirely on the evidence in the data. We could, therefore, specify πA(θ) = beta
(θ; 1, 1) or consider the non-informative Jeffreys prior. Alternatively, if we want to use infor-
mative analysis prior distributions, we can express the hyperparameters in terms of the prior
mode θA and the prior sample size nA, that is

αA ¼ nA θA þ 1 and βA ¼ nA 1 � θA þ 1: (28)
In this way, for instance, it is possible to express scepticism or optimism about large treatment
effects by setting θA less or higher than the target θ0, respectively. Obviously, when θA < θ0, the
larger the nA, the larger the degree of scepticism we wish to express; while, when θA > θ0 larger
values of nA are used to increase the degree of enthusiasm we desire to take into account.
However, the value nA = 1 is often used to have a weakly informative prior distribution. The
upper panel of Figure 4 shows three possible choices for the analysis prior when θ0 = 0.2. These
distributions are obtained by fixing the prior mode θA and, then, selecting nA so that
PπA ð�Þ ðθ > θ0 Þ (i.e. the probability assigned by πA(θ) to the event θ > θ0) is about equal to a
desired level. More specifically, we have considered (i) a sceptical prior mode θA = 0.1 and
PπA ð�Þ ðθ > θ0 Þ ≃ 0:4, (ii) a neutral prior mode θA = 0.2 and PπA ð�Þ ðθ > θ0 Þ ≃ 0:6 and finally (iii) an
enthusiastic prior mode θA = 0.3 and PπA ð�Þ ðθ > θ0 Þ ≃ 0:8. The corresponding values of nA are 7,
14 and 4, respectively. These densities will be used to illustrate how the optimal sample sizes
based on Bayesian powers are affected by the information formalized through the analysis
priors.
The random result Yn is defined as ‘significant’ from a Bayesian perspective, if the corres-
ponding posterior probability that θ > θ0 is sufficiently large. In symbols, we decide to reject
the null hypothesis, on the basis of the result Yn, if the following condition is satisfied.
Figure 4. Upper panel: possible choices of the analysis prior distribution, when θ0 = 0.2. Lower panel: behaviour of
� �
ηBC n; θD as a function of n for each of the analysis prior distributions represented in the upper panel, when θ0 = 0.2,
D
θ = 0.4 and λ = 0.9.
PπAn ð�jYn Þ ðθ > θ0 Þ > λ; (29)
where PπA ð�jYn Þ is the probability measure associated with the posterior distribution in Eq. (27)
and λ ∈ (0, 1) is a pre-specified threshold. It is worth noting that, for a given value of n, the
posterior quantity PπAn ð�jYn Þ ðθ > θ0 Þ is an increasing function of Yn. As a consequence, we can
find a non-negative integer ~r between 0 and n, such that
PπAn ð�j~r Þ ðθ > θ0 Þ > λ and PπAn ð�j~r �1Þ ðθ > θ0 Þ ≤ λ; (30)
and we can claim that H0 is rejected if the observed number of responders yn is equal to or
greater than ~r . In practice, ~r represents the smallest number of successes such that the condi-
tion for the Bayesian significance is satisfied, and in symbols it can be expressed by
n o
er ¼ min k ∈ f0, 1, ::.; ng : PπAn ð�jkÞ ðθ > θ0 Þ > λ : (31)
By considering a fixed design value θD greater than θ0, the Bayesian conditional power is
therefore obtained as
� � � �
ηCB n; θD ¼ Pf D PπAn ð�jYn Þ ðθ > θ0 Þ > λ
n ð�jθ Þ
n
X (32)
¼ binðyn ; n, θD Þ:
yn ¼~r
Essentially, it is given by the sum of the probabilities of all the Bayesian significant results,
computed assuming that the true θ is equal to θD.
Since we are dealing with discrete data, also this power function is not monotonically increasing
as a function of n. Let us assume that θ0 = 0.20, θD = 0.4 and λ = 0.9. The detailed calculations
� �
shown in Table 2 can help to understand why ηCB n; θD has the typical saw-toothed behaviour.
For each sample size between 3 and 50, the table provides the corresponding value of ~r , the level
of the Bayesian conditional power and the posterior probability that θ exceeds θ0 conditional on
the result ~r . Clearly, these latter values are always larger than the threshold λ that is 0.9. The
white and grey colours are used alternately to highlight blocks of sample sizes with the same
value of ~r associated. When the sample size grows, but ~r remains constant, PπA ð�j~r Þ ðθ > θ0 Þ
n
� �
decreases, while ηCB n; θD increases. However, when both n and ~r are simultaneously increased
by one unit, PπA ð�j~r Þ ðθ > θ0 Þ jumps up, while the Bayesian power drops.
n
Because of the saw-toothed nature of the power curve, for a fixed threshold γ, the optimal
sample size is selected using the conservative criterion, that is
� � � �
nCB ¼ min n� ∈ N: ηCB n; θD > γ, ∀n ≥ n� : (33)
The lower panel of Figure 4 shows the behaviour of the Bayesian conditional power as a function
of n for each of the three analysis prior density plotted in the upper panel, when θ0 = 0.2, θD = 0.4
and λ = 0.9. In each graph, it is indicated the optimal sample size according to the criterion in
Eq. (33) for γ = 0.8. As expected, as we move from sceptical prior opinions towards more enthusi-
astic beliefs about the efficacy of the experimental treatment, the required sample size decreases.
5.4. Bayesian predictive power
Besides introducing pre-experimental information, if we also wish to model uncertainty on the

design value, we have to consider the Bayesian predictive power. Therefore, as described in
Section 5.3, we elicit an analysis prior distribution to obtain the beta posterior density
� � �
πAn θ�yn . Moreover, following the indications provided in Section 5.2, we introduce a design
� �
prior distribution to construct the marginal distribution mD
n yn .
The Bayesian predictive power is computed by adding the probabilities of all the Bayesian
significant results, computed under the design scenario expressed through the design prior.
Thus, we have
� � � �
ηPB n; πD ¼ PmDn ð�Þ PπAn ð�jYn Þ ðθ > θ0 Þ > λ
n
X � � (34)
¼ beta-bin yn ; αD ; βD ; n ;
yn ¼~r
� � � �
n er ηCB n; θD PπA ð�j~r Þ ðθ > θ0 Þ n er ηCB n; θD PπA ð�j~r Þ ðθ > θ0 Þ
n n
3 3 0.0640 0.9263 27 9 0.8161 0.9077
4 4 0.0256 0.9703 28 10 0.7412 0.9464
5 4 0.0870 0.9558 29 10 0.7853 0.9354
6 4 0.1792 0.9377 30 10 0.8237 0.9230
7 4 0.2898 0.9159 31 10 0.8566 0.9092
8 5 0.1737 0.9618 32 11 0.7954 0.9460
9 5 0.2666 0.9476 33 11 0.8310 0.9356

10 5 0.3669 0.9304 34 11 0.8617 0.9239
11 5 0.4672 0.9102 35 11 0.8877 0.9110
12 6 0.3348 0.9559 36 12 0.8380 0.9460
13 6 0.4256 0.9422 37 12 0.8667 0.9362
14 6 0.5141 0.9260 38 12 0.8911 0.9252
15 6 0.5968 0.9075 39 12 0.9118 0.9131
16 7 0.4728 0.9518 40 13 0.8715 0.9464

17 7 0.5522 0.9388 41 13 0.8945 0.9371
18 7 0.6257 0.9237 42 13 0.9140 0.9267
19 7 0.6919 0.9065 43 13 0.9305 0.9153

20 8 0.5841 0.9491 44 13 0.9441 0.9028
21 8 0.6505 0.9367 45 14 0.9164 0.9381
22 8 0.7102 0.9226 46 14 0.9320 0.9284
23 8 0.7627 0.9067 47 14 0.9450 0.9176
24 9 0.6721 0.9474 48 14 0.9558 0.9059
25 9 0.7265 0.9357 49 15 0.9336 0.9394
26 9 0.7745 0.9225 50 15 0.9460 0.9301

� �
Table 2. Numerical calculations to explain the saw-toothed behaviour of ηB n; θD as a function of n: sample sizes, the
C
corresponding value of ~r , the Bayesian conditional power and the posterior probability that θ > θ0 when the observed
result is equal to ~r successes, for θ0 = 0.20, θD = 0.4 and λ = 0.9.
� �
where ~r is given in Eq. (31). Obviously, also ηPB n; πD shows the typical saw-toothed behav-
iour as a function of n, because of the discrete nature of the beta-binomial marginal distribu-
tion of yn. Therefore, given a desired threshold γ and according to the suitable conservative
approach previously used, we select the optimal sample size as
� � � �
nPB ¼ min n� ∈ N: ηPB n; πD > γ, ∀n ≥ n� : (35)
θA = 0.1 θA = 0.2 θA = 0.3
θD nD nA = 7 nA = 14 nA = 4
(a) Design prior distributions in Figure 2(a)
0.3 163 120 109 94
0.4 43 37 31 22
0.5 20 21 18 11
(b) Design prior distributions in Figure 2(b)
0.4 60 37 31 22
0.4 111 33 31 22
0.4 255 33 27 22
Table 3. nPB for different choices of the analysis and the design priors, when θ0 = 0.2 and λ = 0.9.
In Table 3 we provide the values of nPB , for different choices of the analysis and the design prior
densities. More specifically, we consider the three analysis priors plotted in the upper panel of
Figure 4 and the design prior distributions represented in both the panels of Figure 2, when
θ0 = 0.2 and λ = 0.9. Similarly to what we have seen for the Bayesian conditional power, the
sample sizes obtained under the sceptical analysis prior are uniformly larger than those
obtained under the more enthusiastic distributions. As regard the impact of the design priors,
it is straightforward to see that the stronger the degree of uncertainty on the appropriate
design value expressed by πD(θ), the larger the required sample size. For instance, for a fixed
prior mode of the design prior, nPB increases as nD get smaller (see Table 3(b), where θD = 0.4).
However, let us note that more evident changes in the sample size can be appreciated when we
compare the effects of design priors based on different prior modes (see the results in Table 3(a),
where the design priors represent very distant design scenarios).
These Bayesian predictive SSD procedures, which include the conditional ones as a special
case, have been exploited in Ref. [8] to construct single-arm two-stage design for phase II of
clinical trials based on binary data. In Ref. [14], instead, an extension to the randomized case
has been presented, while in Ref. [15] the same procedures have been implemented by adding
the possibility of taking into account uncertainty in the historical response rate.
6. Conclusions
Especially in clinical research, the pre-experimental power analysis is one of the most commonly
used methods for sample size calculations. It is tacitly implied that the power function is
constructed under a frequentist framework. However, it is possible to introduce Bayesian con-
cepts in the power analysis to provide more flexibility to the sample size determination process.
When the power function is used as a tool to obtain the appropriate sample size, the general
idea is to ensure a large probability of correctly rejecting the null hypothesis H0, when it is
actually false because the true θ belongs to H1. Therefore, the conjecture that the alternative
hypothesis is true represents an essential element of the method. It can be realized by assum-
ing that the true θ is equal to a fixed design value θD, suitably selected inside H1 (conditional
approach); alternatively, we can introduce uncertainty on the guessed design value by intro-
ducing a design prior distribution that assigns negligible probability to values of θ under H0
(predictive approach). Moreover, the decision about the rejection of H0 can be made under a
frequentist framework or by performing a Bayesian analysis. In the latter case, it is possible to
incorporate in the methodology pre-experimental information possibly available through the
specification of an analysis prior distribution. By combining frequentist and Bayesian pro-
cedures of analysis, with both the conditional and predictive approaches, we obtain the four
power functions described in this chapter. Let us remark that the Bayesian predictive power is
the one that allows to add more flexibility to the sample size calculations. At the same time, it
let the researcher take into account prior knowledge, as well uncertainty on the design value.
However, no design uncertainty can be involved by considering a point-mass design distribu-
tion. On the other hand, if no information is available, it is possible to elicit a non-informative
analysis prior and let the analysis be based entirely on the data.
Author details
Valeria Sambucini

Department of Statistical Sciences, Sapienza Università di Roma, Sapienza, Italy
References
[1] Ryan TP. Sample Size Determination and Power. Haboken: Wiley; 2013
[2] Chow SC, Wang H, Shao J. Sample Size Calculations in Clinical Research. 2nd ed. Boca
Raton: Chapman and Hall/CRC; 2008
[3] Julious SA. Sample Sizes for Clinical Trials. Boca Raton: Chapman and Hall/CRC; 2010.
[4] Wang F, Gelfand AE. A simulation-based approach to Bayesian sample size determina-
tion for performance under a given model and for separating models. Statistical Science.
2002;17(2):193-208. DOI: 10.1214/ss/1030550861
[5] De Santis F. Sample size determination for robust Bayesian analysis. Journal of the Amer-
ican Statistical Association. 2006;101(473):278-291. DOI: 10.1198/016214505000000510
[6] Sahu SK, Smith TMF. A Bayesian method of sample size determination with practical
applications. Journal of the Royal Statistical Society: Series A. 2006;169:235-253. DOI:
10.1111/j.1467-985X.2006.00408.x
[7] Brutti P, De Santis F, Gubbiotti S. Robust Bayesian sample size determination in clinical
trials. Statistics in Medicine. 2008;27(13):2290-2306. DOI: 10.1002/sim.3175
[8] Sambucini V. A Bayesian predictive two-stage design for phase II clinical trials. Statistics
in Medicine. 2008;27(8):1199-1224. DOI: 10.1002/sim.3021
[9] Sambucini V. A Bayesian predictive strategy for an adaptive two-stage design in phase II
clinical trials. Statistics in Medicine. 2010;29(13):1430-1442. DOI: 10.1002/sim.3800
[10] Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian Approaches to Clinical Trials and
Health-Care Evaluation. New York: Wiley; 2004
[11] Gubbiotti S, De Santis F. Classical and Bayesian power functions: Their use in clinical
trials. Biomedical Statistics and Clinical Epidemiology. 2008;2(3):201-211. DOI: 10.1198/
016214505000000510
[12] R Core Team. R: A Language and Environment for Statistical Computing, R Foundation
for Statistical Computing, Vienna, Austria. 2016. Available from: http://www.R-project.org
[13] Chernick MR, Liu CY. The saw-toothed behavior of power versus sample size and
software solutions: Single binomial proportion using exact methods. The American Stat-
istician. 2002;56(2):149-155. DOI: 10.1198/000313002317572835
[14] Cellamare M, Sambucini V. A randomized two-stage design for phase II clinical trials
based on a Bayesian predictive approach. Statistics in Medicine. 2015;34(6):1059-1078.
DOI: 10.1002/sim.6396
[15] Matano F, Sambucini V. Accounting for uncertainty in the historical response rate of the
standard treatment in single-arm two-stage designs based on Bayesian power functions.
Pharmaceutical Statistics. 2016;15(6):517-530. DOI: 10.1002/pst.1788
Chapter 6
Provisional chapter
Converting Graphic Relationships into Conditional
Probabilities in Bayesian
Converting Graphic Networkinto Conditional
Relationships
Probabilities in Bayesian Network
Loc Nguyen
Loc Nguyen

Abstract
Bayesian network (BN) is a powerful mathematical tool for prediction and diagnosis
applications. A large Bayesian network can constitute many simple networks, which in
turn are constructed from simple graphs. A simple graph consists of one child node and
many parent nodes. The strength of each relationship between a child node and a parent
node is quantified by a weight and all relationships share the same semantics such as
prerequisite, diagnostic, and aggregation. The research focuses on converting graphic
relationships into conditional probabilities in order to construct a simple Bayesian net-
work from a graph. Diagnostic relationship is the main research object, in which sufficient
diagnostic proposition is proposed for validating diagnostic relationship. Relationship
conversion is adhered to logic gates such as AND, OR, and XOR, which are essential
features of the research.
Keywords: diagnostic relationship, Bayesian network, transformation coefficient
1. Introduction
Bayesian network (BN) is a directed acyclic graph (DAG) consists of a set of nodes and a set of
arcs. Each node is a random variable. Each arc represents a relationship between two nodes.
The strength of a relationship in a graph can be quantified by a number called weight. There are
some important relationships such as prerequisite, diagnostic, and aggregation. The difference
between BN and normal graph is that the strength of every relationship in BN is represented by
a conditional probability table (CPT) whose entries are conditional probabilities of a child node
given parent nodes. There are two main approaches to construct a BN, which are as follows
• The first approach aims to learn BN from training data by learning machine algorithms.
• The second approach is that experts define some graph patterns according to specific rela-
tionships and then, BN is constructed based on such patterns along with determined CPTs.
This research focuses on the second approach in which relationships are converted into CPTs.
Essentially, relationship conversion aims to determine conditional probabilities based on
weights and meanings of relationships. We will have different ways to convert graphic weights
into CPTs for different relationships. It is impossible to convert all relationships but some of
them such as diagnostic, aggregation, and prerequisite are mandatory ones that we must
specify as computable CPTs of BN. Especially, these relationships are adhered to logic X-gates
[1] such as AND-gate, OR-gate, and SIGMA-gate. The X-gate inference in this research is
derived and inspired from noisy OR-gate described in the book “Learning Bayesian Networks”
Neapolitan ([2], pp. 157–159). Díez and Druzdzel [3] also researched OR/MAX, AND/MIN,
and noisy XOR inferences but they focused on canonical models, deterministic models, and ICI
models whereas I focused on logic gate and graphic relationships. So, their research is different
from mine but we share the same result that is AND-gate model. In general, my research
focuses on applied probability adhered to Bayesian network, logic gates, and Bayesian user
modeling [4]. The scientific results are shared with Millán and Pérez-de-la-Cruz [4].
Factor graph [5] represents factorization of a global function into many partial functions. If
joint distribution of BN is considered as the global function and CPTs are considered as partial
functions, the sumproduct algorithm [6] of factor graph is applied into calculating posterior
probabilities of variables in BN. Pearl’s propagation algorithm [7] is very successful in BN
inference. The application of factor graph into BN is only realized if all CPT (s) of BN are
already determined whereas this research focuses on defining such CPTs firstly. I did not use
factor graph for constructing BN. The concept “X-gate inference” only implies how to convert
simple graph into BN. However, the arrange sum with a fixed variable mentioned in this
research is the “not-sum” ([6], p. 499) of factor graph. Essentially, X-gate probability shown in
Eq. (10) is as same as λ message in the Pearl’s algorithm ([6], p. 518) but I use the most basic
way to prove the X-gate probability.
As default, the research is applied in learning context in which BN is used to assess students’
knowledge. Evidences are tests, exams, exercises, etc. and hypotheses are learning concepts,
knowledge items, etc. Note that diagnostic relationship is very important to Bayesian evalua-
tion in learning context because it is used to evaluate student’s mastery of concepts (knowledge
items) over entire BN. Now, we start relationship conversion with a research on diagnostic
relationship in the next section.
2. Diagnostic relationship
In some opinions like mine, the diagnostic relationship should be from hypothesis to evidence.
For example, disease is hypothesis and symptom is evidence. The symptom must be condi-
tionally dependent on disease. Given a symptom, calculating the posterior probability of
Converting Graphic Relationships into Conditional Probabilities in Bayesian Network 99
disease is essentially to diagnose likelihood of such disease ([8], p. 1666). Inversely, the arc from
evidence to hypothesis implies prediction where evidence and hypothesis represent observa-
tion and event, respectively. Given an observation, calculating the posterior probability of the
event is essentially to predict/assert such event ([8], p. 1666). Figure 1 shows diagnosis and
prediction.
The weight w of the relationship between X and D is 1. Figure 1 depicts simplest graph with
two random variables. We need to convert diagnostic relationship into conditional probabili-
ties in order to construct a simplest BN from the simplest graph. Note that hypothesis is binary
but evidence can be numerical. In learning context, evidence D can be test, exam, exercise, etc.
The conditional probability of D given X (likelihood function) is P(D|X). The posterior proba-
bility of X is P(X|D), which is used to evaluate student’s mastery over concept (hypothesis) X
given evidence D. Eq. (1) specifies CPT of D when D is binary (0 and 1)

D if X ¼ 1
PðDjXÞ ¼ (1)
1 � D if X ¼ 0
Eq (1) is our first relationship conversion. It implies
PðDjX ¼ 0Þ þ PðDjX ¼ 1Þ ¼ D þ 1 � D ¼ 1
Evidence D can be used to diagnose hypothesis X if the so-called sufficient diagnostic proposition
is satisfied, as seen in Table 1.
The concept of sufficient evidence is borrowed from the concept of sufficient statistics and it is
inspired from equivalence of variables T and T’ in the research ([4], pp. 292-295). The proposi-
tion can be restated that evidence D is only used to assess hypotheses if it is sufficient evidence.
As a convention, the proposition is called diagnostic condition and hypotheses have uniform
distribution. The assumption of hypothetic uniform distribution (P(X = 1) = P(X = 0)) implies
that we cannot assert whether or not given hypothesis is true before we observe its evidence.
In learning context, D can be totally used to assess student’s mastery of X if diagnostic

condition is satisfied. Derived from such condition, Eq. (2) specifies transformation coefficient
k given uniform distribution of X.
Figure 1. Diagnosis and prediction with hypothesis X and evidence D.

D is equivalent to X in diagnostic relationship if P(X|D) = kP(D|X) given uniform distribution of X and the transformation
coefficient k is independent from D. In other words, k is constant with regards to D and so D is called sufficient evidence.
Table 1. Sufficient diagnostic proposition.
PðXjDÞ
k¼ (2)
PðDjXÞ
We need to prove that Eq. (1) satisfies diagnostic condition. Suppose the prior probability of X
is uniform.
PðX ¼ 0Þ ¼ PðX ¼ 1Þ
we have
PðDjXÞPðXÞ PðDjXÞPðXÞ
PðXjDÞ ¼ ¼
PðDÞ PðDjX ¼ 0ÞPðX ¼ 0Þ þ PðDjX ¼ 1ÞPðX ¼ 1Þ
ðdue to Bayes’ruleÞ
PðDjXÞPðXÞ
¼ � �
PðXÞ PðDjX ¼ 0Þ þ PðDjX ¼ 1Þ
� �
due to PðX ¼ 0Þ ¼ PðX ¼ 1Þ
PðDjXÞ
¼ ¼ 1 � PðDjXÞ
PðDjX ¼ 0Þ þ PðDjX ¼ 1Þ
� �
due to PðDjX ¼ 0Þ þ PðDjX ¼ 1Þ ¼ 1 ■
It is easy to infer that the transformation coefficient k is 1, if D is binary. In practice, evidence D

is often a test whose grade ranges within an interval {0, 1, 2,…, η}. Eq. (3) specifies CPT of D in
this case
8
> D
>
< if X ¼ 1
S
PðDjXÞ ¼ (3)
>
: η � D if X ¼ 0
>
S S
Where
D ∈ f0, 1, 2, …, ηg
Xn
ηðη þ 1Þ
S¼ D¼
D¼0
2
As a convention, PðDjXÞ ¼ 0, ∀D ∉ f0, 1, 2, …, ηg. Eq. (3) implies that if student has mastered
concept (X = 1), the probability that she/he completes the exercise/test D is proportional to her/
� �
his mark on D PðDjXÞ ¼ DS . We also have
D η�D η 2
PðDjX ¼ 0Þ þ PðDjX ¼ 1Þ ¼ þ ¼ ¼
S S S ðη þ 1Þ
Xη
η η
X X D D¼0
D S
PðDjX ¼ 1Þ ¼ ¼ ¼ ¼1
D¼0 D¼0
S S S
Xη Xη Xη
η η
X X η�D D¼0
ðη � DÞ D¼0
η� D¼0
D ηðη þ 1Þ � S 2S � S
PðDjX ¼ 0Þ ¼ ¼ ¼ ¼ ¼ ¼1
D¼0 D¼0
S S S S S
We need to prove that Eq. (3) satisfies diagnostic condition. Suppose the prior probability of X
is uniform.
PðX ¼ 0Þ ¼ PðX ¼ 1Þ
The assumption of prior uniform distribution of X implies that we do not determine if student
has mastered X yet. Similarly, we have
PðDjXÞPðXÞ PðDjXÞ ηþ1

PðXjDÞ ¼ ¼ ¼ PðDjXÞ ■
PðDÞ PðDjX ¼ 0Þ þ PðDjX ¼ 1Þ 2
So, the transformation coefficient k is ηþ1

2 if D ranges in {0, 1, 2,…, η}.
In the most general case, discrete evidence D ranges within an arbitrary integer interval
fa, a þ 1, a þ 2, …, bg. In other words, D is bounded integer variable whose lower bound and
upper bound are a and b, respectively. Eq. (4) specifies CPT of D, where D ∈ fa, a þ 1, a þ 2, …, bg.
8
> D
>
< if X ¼ 1
S
PðDjXÞ ¼ (4)
: b þ a � D if X ¼ 0
>
>
S S
Where
D ∈ {a, a þ 1, a þ 2, …, b}
ðb þ aÞðb � a þ 1Þ
S ¼ a þ ða þ 1Þ þ ða þ 2Þ þ … þ b ¼
2
Note, PðDjXÞ ¼ 0, ∀D ∉ fa, a þ 1, a þ 2, …, bg. According to the diagnostic condition, we need

to prove the equality PðXjDÞ ¼ kPðDjXÞ, where
b�aþ1
k¼
2
Similarly, we have
PðDjXÞPðXÞ PðDjXÞ b�aþ1

PðXjDÞ ¼ ¼ ¼ PðDjXÞ ■
PðDÞ PðDjX ¼ 0Þ þ PðDjX ¼ 1Þ 2
If evidence D is continuous in the real interval [a, b] with note that a and b are real numbers,
Eq. (5) specifies probability density function (PDF) of continuous evidence D ∈ ½a, b�. The PDF
pðDjXÞ replaces CPT in case of continuous random variable.
8
> 2D
>
> if X ¼ 1
< b2 � a2
pðDjXÞ ¼
>
> 2 2D
>
: � if X ¼ 0
b � a b2 � a2
where
D ∈ ½a, b� where a and b are real numbers
ðb
b2 � a2
S ¼ DdD ¼ (5)
2
a
As a convention, [a, b] is called domain of continuous evidence, which can be replaced by

open or half-open intervals such as (a, b), (a, b], and [a, b). Of course we have pðDjXÞ ¼ 0,
∀D ∉ ½a, b�. In learning context, evidence D is often a test whose grade ranges within real
interval [a, b].
Functions p(D|X = 1) and p(D|X = 0) are valid PDFs due to
ð ðb ðb
2D 1
pðDjX ¼ 1ÞdD ¼ dD ¼ 2DdD ¼ 1
b2 � a2 b2 � a2
D a a
ð ðb ðb
2 1
pðDjX ¼ 0ÞdD ¼ dD � 2 2DdD ¼ 1:
b�a b � a2
D a a
According to the diagnostic condition, we need to prove the equality
PðXjDÞ ¼ kpðDjXÞ
where,
b�a
k¼
2
When D is continuous, its probability is calculated in ε-vicinity where ε is very small number.
As usual, ε is bias if D is measure values produced from equipment. The probability of D given
X, where D + ε ∈ [a, b] and D – ε ∈ [a, b] is
8 Dþε
> ð
>
>
> 2D
>
> dD if X ¼ 1
Dþε
ð >
< b2 � a2
D�ε
PðDjXÞ ¼ pðDjXÞdD ¼
>
> Dþε
ð � �
D�ε >
> 2 2D
>
> � 2 dD if X ¼ 0
>
: b � a b � a2
D�ε
8
> 4εD
>
< if X ¼ 1
b2 � a2
¼ ¼ 2εpðDjXÞ
>
> 4ε 4εD
: � 2 if X ¼ 0
b � a b � a2
In fact, we have
PðDjXÞPðXÞ PðDjXÞ
PðXjDÞ ¼ ¼
PðDjX ¼ 0ÞPðX ¼ 0Þ þ PðDjX ¼ 1ÞPðX ¼ 1Þ PðDjX ¼ 0Þ þ PðDjX ¼ 1Þ
� 0
�
due to Bayes rule and the assumption PðX ¼ 0Þ ¼ PðX ¼ 1Þ
b�a
¼ PðDjXÞ ¼ kpðDjXÞ ■
4ε
In general, Eq. (6) summarizes CPT of evidence of single diagnostic relationship.

8
> D
< if X ¼ 1
PðDjXÞ ¼ S
> M D
: � if X ¼ 0
S S
N
k¼
2
Where,
8
>
> 2 if D ∈ f0, 1g
>
>
< η þ 1 if D ∈ f0, 1, 2, …, ηg
N¼
> b � a þ 1 if D ∈ fa, a þ 1, a þ 2, …, bg
>
>
>
:
b � a if D continuous and D ∈ ½a, b�
8
>
> 1 if D ∈ f0, 1g
>
>
< η if D ∈ f0, 1, 2, …, ηg
M¼
>
>
> b þ a if D ∈ fa, a þ 1, a þ 2, …, bg
>
:
b þ a if D continuous and D ∈ ½a, b�
8
> 1 if D ∈ f0, 1g
>
>
>
> ηðη þ 1Þ
>
> if D ∈ f0, 1, 2, …, ηg
>
>
X NM < 2
S¼ D¼ ¼ ðb þ aÞðb � a þ 1Þ (6)
D
2 >
> if D ∈ fa, a þ 1, a þ 2, …, bg
>
> 2
>
>
> 2
>b � a 2
>
: if D continuous and D ∈ ½a, b�
2
In general, if the conditional probability P(D|X) is specified by Eq. (6), the diagnostic condition
will be satisfied. Note that the CPT P(D|X) is the PDF p(D|X) in case of continuous evidence.
The diagnostic relationship will be extended with more than one hypothesis. The next section
will mention how to determine CPTs of a simple graph with one child node and many parent
nodes based on X-gate inferences.
3. X-gate inferences
Given a simple graph consisting of one child variable Y and n parent variables Xi, as shown in
Figure 2, each relationship from Xi to Y is quantified by normalized weight wi where 0 ≤ wi ≤ 1.
A large graph is an integration of many simple graphs. Figure 2 shows the DAG of a simple
BN. As aforementioned, the essence of constructing simple BN is to convert graphic relation-
ships of simple graph into CPTs of simple BN.
Child variable Y is called target and parent variables Xis are called sources. Especially, these
relationships are adhered to X-gates such as AND-gate, OR-gate, and SIGMA-gate. These
gates are originated from logic gate [1]. For instance, AND-gate and OR-gate represent prereq-
uisite relationship. SIGMA-gate represents aggregation relationship. Therefore, relationship
conversion is to determined X-gate inference. The simple graph shown in Figure 2 is also
called X-gate graph or X-gate network. Please distinguish the letter “X” in the term “X-gate
inference” which implies logic operators (AND, OR, XOR, etc.) from the “variable X”.
All variables are binary and they represent events. The probability P(X) indicates event X
occurs. Thus, P(X) implicates P(X = 1) and P(not(X)) implicates P(X = 0). Eq. (7) specifies the
simple NOT-gate inference.
Figure 2. Simple graph or simple network.


P notðXÞ ¼ PðXÞ ¼ PðX ¼ 0Þ ¼ 1 � PðX ¼ 1Þ ¼ 1 � PðXÞ
(7)
P not notðXÞ ¼ PðXÞ
X-gate inference is based on three assumptions mentioned in Ref. ([2], p. 157), which are as
follows
• X-gate inhibition: Given a relationship from source Xi to target Y, there is a factor Ii that
inhibits Xi from being integrated into Y. Factor Ii is called inhibition of Xi. That the
inhibition Ii is turned off is prerequisite of Xi integrated into Y.
• Inhibition independence: Inhibitions are mutually independent. For example, inhibition I1 of
X1 is independent from inhibition I2 of X2.
• Accountability: X-gate network is established by accountable variables Ai for Xi and Ii. Each
X-gate inference owns particular combination of Ais.
Figure 3 shows the extended X-gate network with accountable variables Ais ([2], p. 158).
The strength of each relationship from source Xi to target Y is quantified by a weight 0 ≤ wi ≤ 1.

According to the assumption of inhibition, probability of Ii = OFF is pi, which is set to be the
weight wi.
pi ¼ wi
If notation wi is used, we focus on the strength of relationship. If notation pi is used, we focus

on probability of OFF inhibition. In probabilistic inference, pi is also prior probability of Xi = 1.
However, we will assume each Xi has uniform distribution later on. Eq. (8) specifies probabil-
ities of inhibitions Iis and accountable variables Ais.
Figure 3. Extended X-gate network with accountable variables Ais.

PðI i ¼ OFFÞ ¼ pi ¼ wi
PðI i ¼ ONÞ ¼ 1 � pi ¼ 1 � wi
PðAi ¼ ONjXi ¼ 1, I i ¼ OFFÞ ¼ 1
PðAi ¼ ONjXi ¼ 1, I i ¼ ONÞ ¼ 0
PðAi ¼ ONjXi ¼ 0, I i ¼ OFFÞ ¼ 0
(8)
PðAi ¼ ONjXi ¼ 0, I i ¼ ONÞ ¼ 0
PðAi ¼ OFFjXi ¼ 1, I i ¼ OFFÞ ¼ 0
PðAi ¼ OFFjXi ¼ 1, I i ¼ ONÞ ¼ 1
PðAi ¼ OFFjXi ¼ 0, I i ¼ OFFÞ ¼ 1
PðAi ¼ OFFjXi ¼ 0, I i ¼ ONÞ ¼ 1
According to Eq. (8), given probability P(Ai=ON | Xi=1, Ii=OFF), it is assured 100% confident
that accountable variables Ai is turned on if source Xi is 1 and inhibition Ii is turned off. Eq. (9)
specifies conditional probability of accountable variables Ai (s) given Xi (s), which is corollary
of Eq. (8).
PðAi ¼ ONjXi ¼ 1Þ ¼ pi ¼ wi
PðAi ¼ ONjXi ¼ 0Þ ¼ 0
(9)
PðAi ¼ OFFjXi ¼ 1Þ ¼ 1 � pi ¼ 1 � wi
PðAi ¼ OFFjXi ¼ 0Þ ¼ 1
Appendix A1 is the proof of Eq. (9). As a definition, the set of all Xis is complete if and only if
n
X
PðX1 ∪ X2 ∪ ⋯∪ Xn Þ ¼ PðΩÞ ¼ wi ¼ 1
i¼1
The set of all Xis is mutually exclusive if and only if
Xi ∩ Xj ¼ ∅, ∀i 6¼ j
For each Xi, there is only one Ai and vice versa, which establishes a bijection between Xis and
Ais. Obviously, the fact that the set of all Xis is complete is equivalent to the fact that the set of
all Ai (s) is complete. We will prove by contradiction that “the fact that the set of all Xi (s) is
mutually exclusive is equivalent to the fact that the set of all Ai (s) is mutually exclusive.”
Suppose Xi ∩ Xj ¼ ∅, ∀i 6¼ j but ∃i 6¼ j: Ai ∩ Aj ¼ B 6¼ ∅. Let B�1 6¼ ∅ be preimage of B. Due to
B ⊆ Ai and B ⊆ Aj , we have B�1 ⊆ Xi and B�1 ⊆ Xj , which causes that Xi ∩ Xj ¼ B�1 6¼ ∅. There
is a contradiction and so we have
Xi ∩ Xj ¼ ∅, ∀i 6¼ j ) Ai ∩ Aj ¼ ∅, ∀i 6¼ j
By similar proof, we have

Ai ∩ Aj ¼ ∅, ∀i 6¼ j ) Xi ∩ Xj ¼ ∅, ∀i 6¼ j ■
The extended X-gate network shown in Figure 3 is interpretation of simple network shown
in Figure 2. Specifying CPT of the simple network is to determine the conditional probability
P(Y = 1 | X1, X2,…, Xn) based on extended X-gate network. The X-gate inference is represented
by such probability P(Y = 1 | X1, X2,…, Xn) specified by Eq. (10) ([2], p. 159).
X Yn
PðYjX1 , X2 , …, Xn Þ ¼ A1 , A2 , …, An
PðYjA1 , A2 , …, An Þ i¼1
PðAi jXi Þ (10)
Appendix A2 is the proof of Eq. (10). It is necessary to make some mathematical notations
because Eq. (10) is complicated, which is relevant to arrangements of Xi (s). Given the set
Ω = {X1, X2,…, Xn} where all variables are binary, Table 2 specifies binary arrangements of Ω.
Given Ω = {X1, X2,…, Xn} where |Ω| = n is cardinality of Ω.

Let a(Ω) be an arrangement of Ω which is a set of n instances {X1=x1, X2=x2,…, Xn=xn} where xi is 1 or 0. The number of all a
(Ω) is 2|Ω|. For instance, given Ω = {X1, X2}, there are 22=4 arrangements as follows:
aðΩÞ ¼ fX1 ¼ 1, X2 ¼ 1g, aðΩÞ ¼ fX1 ¼ 1, X2 ¼ 0g, aðΩÞ ¼ fX1 ¼ 0, X2 ¼ 1g, aðΩÞ
¼ fX1 ¼ 0, X2 ¼ 0g:
Let a(Ω:{Xi}) be the arrangement of Ω with fixed Xi. The number of all a(Ω:{Xi}) is 2|Ω|�1. Similarly, for instance, a(Ω:{X1,
X2, X3}) is an arrangement of Ω with fixed X1, X2, X3. The number of all a(Ω:{X1, X2, X3}) is 2|Ω|�3.
Let c(Ω) and c(Ω:{Xi}) be the number of arrangements a(Ω) and a(Ω:{Xi}), respectively. Such c(Ω) and c(Ω:{Xi}) are called
arrangement counters. As usual, counters c(Ω) and c(Ω:{Xi}) are equal to 2|Ω| and 2|Ω|�1, respectively but they will vary
according to specific cases.
X � � Y � �
Let a
F aðΩÞ and a F aðΩÞ denote sum and product of values generated from function F acting on every a(Ω). The
number of arrangements on which F acts is c(Ω).
Let x denote the X-gate operator, for instance, x = ⊙ for AND-gate, x = ⊕ for OR-gate, x = not ⊙ for NAND-gate, x = not ⊕
for NOR-gate, x = ⊗ for XOR-gate, x = not ⊗ for XNOR-gate, x = ⊎ for U-gate, x ¼ þ for SIGMA-gate. Given an x-operator,
let s(Ω:{Xi}) and s(Ω) be sum of all PðX1 xX2 x…xXn Þ through every arrangement of Ω with and without fixed Xi,
respectively.
X � � X � �
sðΩÞ ¼ P X1 xX2 x…xXn jaðΩÞ ¼ P Y ¼ 1jaðΩÞ
a� a
� X � �
X
sðΩ : fXi gÞ ¼ P X1 xX2 x…xXn jaðΩ : fXi gÞ ¼ P Y ¼ 1jaðΩ : fXi gÞ
a a
For example, s(Ω) and s(Ω:{Xi}) for OR-gate are:

X � �
sðΩÞ ¼ P X1 ⊕ X2 ⊕ … ⊕ Xn jaðΩÞ
a � �
X
sðΩ : fXi gÞ ¼ P X1 ⊕ X2 ⊕ … ⊕ Xn jaðΩ : fXi gÞ
a
Such s(Ω) and s(Ω:{Xi}) are called arrangement sum. They are acting function F.
Note that Ω can be any set of binary variables.
Table 2. Binary arrangements.

It is not easy to produce all binary arrangements of Ω. Table 3 shows a code snippet written by
Java programming language for producing such all arrangements.
Each element of the list “arrangements” is a binary arrangement a(Ω) presented by an array of
bits (0 and 1). The method “create(int[] a, int i)” which is recursive method, is the main one that
generates arrangements. The method call “ArrangementGenerator.parse(2, n)” will list all possi-
ble binary arrangements.
Eq. (11) specifies the connection between s(Ω:{Xi = 1}) and s(Ω:{Xi = 0}), between c(Ω:{Xi = 1})
and c(Ω:{Xi = 0}).
sðΩ : fXi ¼ 1gÞ þ sðΩ : fXi ¼ 0gÞ ¼ sðΩÞ

(11)
cðΩ : fXi ¼ 1gÞ þ cðΩ : fXi ¼ 0gÞ ¼ cðΩÞ
It is easy to draw Eq. (11) when the set of all arrangements a(Ω:{Xi = 1) is complement of the set
of all arrangements a(Ω:{Xi = 0).
Let K be a set of Xis whose values are 1 and let L be a set of Xis whose values are 0. K and L are
mutually complementary. Eq. (12) determines sets K and L.
8
> K ¼ fi : Xi ¼ 1g
>
>
>
< L ¼ fi : Xi ¼ 0g
(12)
>
> K∩ L ¼ ∅
>
>
:
K∪ L ¼ {1, 2, …, n}
The AND-gate inference represents prerequisite relationship satisfying AND-gate condition

specified by Eq. (13).
PðY ¼ 1jAi ¼ OFF for some iÞ ¼ 0 (13)
From Eq. (10), we have

X n
Y
PðY ¼ 1jX1 , X2 , …, Xn Þ ¼ PðY ¼ 1jA1 , A2 , …, An Þ PðAi jXi Þ
A1 , A2 , …, An i¼1
Yn
¼ PðAi ¼ ONjXi Þ
� i¼1 �
Due to PðY ¼ 1jAi ¼ OFF for some iÞ ¼ 0
�Y ��Y �
¼ PðAi ¼ ONjXi ¼ 1Þ PðAi ¼ ONjXi ¼ 0Þ
i∈K i∉K
! ! 8Yn
Y Y <
pi if all Xi ðsÞ are 1
¼ pi 0 ¼
i∈K i∉K
: i¼1
0 if there exists at least one Xi ¼ 0
(Due to Eq. (9))

public class ArrangementGenerator {

private ArrayList<int[]> arrangements;
private int n;
private int r;
private ArrangementGenerator(int n, int r) {

this.n = n;
this.r = r;
this.arrangements = new ArrayList();
}
private void create(int[] a, int i) {
for(int j = 0; j < n; j++) {
a[i] = j;
if(i < r - 1)
create(a, i + 1);
else if(i == r -1) {
int[] b = new int[a.length];
for(int k = 0; k < a.length; k++) b[k] = a[k];
arrangements.add(b);
}
}
}
public int[] get(int i) {
return arrangements.get(i);
}
public long size() {
return arrangements.size();
}
public static ArrangementGenerator parse(int n, int r) {
ArrangementGenerator arr =
new ArrangementGenerator(n, r);
int[] a = new int[r];
for(int i=0; i<r; i++) a[i] = -1;
arr.create(a, 0);
return arr;
}
}
Table 3. Code snippet generating all binary arrangements.
In general, Eq. (14) specifies AND-gate inference.

8 n
> Y
< pi if all Xi ðsÞ are 1
PðX1 ⊙X2 ⊙…⊙Xn Þ ¼ PðY ¼ 1jX1 , X2 , …, Xn Þ ¼ i¼1
>
:
8 n
(14)
> Y
<1� pi if all Xi ðsÞ are 1
PðY ¼ 0jX1 , X2 , …, Xn Þ ¼ i¼1
>
:
The AND-gate inference was also described in ([3], p. 33). Eq. (14) varies according to two
cases whose arrangement counters are listed as follows
L¼∅
cðΩ : fXi ¼ 1gÞ ¼ 1, cðΩ : fXi ¼ 0gÞ ¼ 0, cðΩÞ ¼ 1:
L 6¼ ∅
cðΩ : fXi ¼ 1gÞ ¼ 2n�1 � 1, cðΩ : fXi ¼ 0gÞ ¼ 2n�1 , cðΩÞ ¼ 2n � 1:
The OR-gate inference represents prerequisite relationship satisfying OR-gate condition spec-
ified by Eq. (15) ([2], p. 157).
PðY ¼ 1jAi ¼ ON for some iÞ ¼ 1 (15)
The OR-gate condition implies
PðY ¼ 0jAi ¼ ON for some iÞ ¼ 0
From Eq. (10), we have ([2], p. 159)

X n
Y
A1 , A2 , …, An i¼1
n
Y
¼ PðAi ¼ OFFjXi Þ
i¼1
� �
due to PðY ¼ 1jAi ¼ ON for some iÞ ¼ 0
! !
Y Y
¼ PðAi ¼ OFFjXi ¼ 1Þ PðAi ¼ OFFjXi ¼ 0Þ
i∈K i∉K
! ! 8Y
Y Y < ð1 � pi Þif K 6¼ ∅
¼ ð1 � pi Þ 1 ¼ i∈K
i∈K i∉K
:
1 if K ¼ ∅
(Due to Eq. (9))
In general, Eq. (16) specifies OR-gate inference.

8 Y
<1 � ð1 � pi Þ if K 6¼ ∅
PðX1 ⊕ X2 ⊕ … ⊕ Xn Þ ¼ 1 � PðY ¼ 0jX1 , X2 , …, Xn Þ ¼ i∈K
:
0 if K ¼ ∅
8Y (16)
< ð1 � pi Þ if K 6¼ ∅
PðY ¼ 0jX1 , X2 , …, Xn Þ ¼ i∈K
:
1 if K ¼ ∅
where K is the set of Xis whose values are 1. The OR-gate inference was mentioned in Refs. ([2],
p. 158) and ([3], p. 20). Eq. (16) varies according to two cases whose arrangement counters are
listed as follows
K 6¼ ∅
cðΩ : fXi ¼ 1gÞ ¼ 2n�1 , cðΩ : fXi ¼ 0gÞ ¼ 2n�1 � 1, cðΩÞ ¼ 2n � 1:
K¼∅
According to De Morgan’s rule with regard to AND-gate and OR-gate, we have

� � ��
P notðX1 ⊙X2 ⊙…⊙Xn Þ ¼ P notðX1 Þ ⊕ notðX2 Þ ⊕ … ⊕ notðXn Þ
8 Y� �
<1 � 1 � ð1 � pi Þ if L 6¼ ∅
¼ i∈L
:
0 if L ¼ ∅
(Due to Eq. (16))

According to Eq. (14), we also have
� � ��
P notðX1 ⊕ X2 ⊕ … ⊕ Xn Þ ¼ P notðX1 Þ ⊙ notðX2 Þ ⊙…⊙ notðXn Þ
8 n
>
> Y � �
< P notðXi Þ if all not ðXi ÞðsÞ are 1
¼ i¼1
>
>
:
0 if there exists at least one not ðXi Þ ¼ 0
8Y n
>
< ð1 � pi Þ if all Xi ðsÞ are 0
¼ i¼1
>
:
In general, Eq. (17) specifies NAND-gate inference and NOR-gate inference derived from
AND-gate and OR-gate
8 Y
� � > < 1 � pi if L 6¼ ∅
P notðX1 ⊙X2 ⊙…⊙Xn Þ ¼ i∈L
>
:
0 if L ¼ ∅
8 n (17)
> Y
� � > < qi if K ¼ ∅
P notðX1 ⊕ X2 ⊕ … ⊕ Xn Þ ¼ i¼1
>
>
:
0 if K 6¼ ∅
where K and L are the sets of Xis whose values are 1 and 0, respectively.
Suppose the number of sources Xis is even. Let O be the set of Xis whose indices are odd. Let O1
and O2 be subsets of O, in which all Xis are 1 and 0, respectively. Let E be the set of Xis whose
indices are even. Let E1 and E2 be the subsets of E, in which all Xis are 1 and 0, respectively.
8 8
>
> E ¼ f2, 4, 6, …, ng >
> O ¼ f1, 3, 5, …, n � 1g
>
> >
>
>
> E ⊆ E >
> O1 ⊆ O
> 1
>
>
>
>
>
> >
< E2 ⊆ E < O2 ⊆ O
>
> >
>
E1 ∪ E2 ¼ E and O1 ∪ O2 ¼ O
>
> >
>
>E ∩ E ¼ ∅
> >
> O1 ∩ O 2 ¼ ∅
>
> 1 2 >
>
>
> >
>
>
> Xi ¼ 1, ∀i ∈ E1 >
> Xi ¼ 1, ∀i ∈ O1
>
> >
>
: :
Xi ¼ 0, ∀i ∈ E2 Xi ¼ 0, ∀i ∈ O2
Thus, O1 and E1 are the subsets of K. Sources Xis and target Y follow XOR-gate if one of two
XOR-gate conditions specified by Eq. (18) is satisfied.
�( )!
� A ¼ ON for i ∈ O
� i
P Y ¼ 1� ¼ PðY ¼ 1jA1 ¼ ON, A2 ¼ OFF, …, An�1 ¼ ON, An ¼ OFFÞ ¼ 1
� Ai ¼ OFF for i ∉ O
�( )!
� Ai ¼ ON for i ∈ E
�
P Y ¼ 1� ¼ PðY ¼ 1jA1 ¼ OFF, A2 ¼ ON, …, An�1 ¼ OFF, An ¼ ONÞ ¼ 1
� Ai ¼ OFF for i ∉ E
(18)

X n
Y
A1 , A2 , …, An i¼1
If both XOR-gate conditions are not satisfied then,
PðY ¼ 1jX1 , X2 , …, Xn Þ ¼ 0
If the first XOR-gate condition is satisfied, we have
PðY ¼ 1jX1 , X2 , …, Xn Þ
n
Y
¼ PðY ¼ 1jA1 ¼ ON, A2 ¼ OFF, …, An�1 ¼ ON, An ¼ OFFÞ PðAi jXi Þ
i¼1
! !
Y Y
¼ PðAi ¼ ONjXi Þ PðAi ¼ OFFjXi Þ
i∈O i∈E
We have
Y
PðAi ¼ ONjXi Þ
i∈O
! !
Y Y
¼ PðAi ¼ ONjXi ¼ 1Þ � PðAi ¼ ONjXi ¼ 0Þ
i ∈ O1 2 i∈O
! ! 8Y
Y Y < pi if O2 ¼ ∅
¼ pi � 0 ¼ i ∈ O1
:
i ∈ O1 i ∈ O2 0 if O2 6¼ ∅
(Due to Eq. (9))
We also have
Y
PðAi ¼ OFFjXi Þ
i∈E
! !
Y Y
¼ PðAi ¼ OFFjXi ¼ 1Þ � PðAi ¼ OFFjXi ¼ 0Þ
i ∈ E1 i ∈ E2
! ! 8Y
Y Y < ð1 � pi Þ if E1 6¼ ∅
¼ ð1 � pi Þ 1 ¼ i ∈ E1
i ∈ E1 i ∈ E2
:
1 if E1 ¼ ∅
(Due to Eq. (9))
Given the first XOR-gate condition, it implies

! !
Y Y
PðY ¼ 1jX1 , X2 , …, Xn Þ ¼ PðAi ¼ ONjXi Þ PðAi ¼ OFFjXi Þ
i∈O i∈E
8 ! !
> Y Y
>
> pi ð1 � pi Þ if O2 ¼ ∅ and E1 6¼ ∅
>
>
>
> i ∈ O1 i ∈ E1
<
¼ Y
>
> pi if O2 ¼ ∅ and E1 ¼ ∅
>
>
>
> i ∈ O1
>
:
0 if O2 6¼ ∅
Similarly, given the second XOR-gate condition, we have

! !
Y Y
PðY ¼ 1jX1 , X2 , …, Xn Þ ¼ PðAi ¼ ONjXi Þ PðAi ¼ OFFjXi Þ
i∈E i∈O
8 ! !
>
> Y Y
>
> pi ð1 � pi Þ if E2 ¼ ∅ and O1 6¼ ∅
>
>
>
> i ∈ E1 i ∈ O1
<
¼ Y
>
> pi if E2 ¼ ∅ and O1 ¼ ∅
>
>
>
> i ∈ E1
>
>
:
0 if E2 6¼ ∅
If one of XOR-gate conditions is satisfied then,
PðY ¼ 1jX1 , X2 , …, Xn Þ
�Y ��Y � �Y �� Y �
¼ PðAi ¼ ONjXi Þ PðAi ¼ OFFjXi Þ þ PðAi ¼ ONjXi Þ PðAi ¼ OFFjXi Þ
i∈O i∈E i∈E i∈O
This implies Eq. (19) to specify XOR-gate inference.

PðX1 ⊗ X2 ⊗ … ⊗ Xn Þ ¼ PðY ¼ 1jX1 , X2 , …, Xn Þ

8 ! ! ! !
>
> Y Y Y Y
>
> pi ð1 � pi Þ þ pi ð1 � pi Þ if O2 ¼ ∅ and E2 ¼ ∅
>
>
>
> i ∈ O1 i ∈ E1 i ∈ E1 i ∈ O1
>
> ! !
>
> Y Y
>
>
>
>
> pi ð1 � pi Þ if O2 ¼ ∅ and E1 6¼ ∅ and E2 6¼ ∅
>
> i ∈ O1 i ∈ E1
>
> Y
>
>
>
< pi if O2 ¼ ∅ and E1 ¼ ∅
¼ i ∈ O1
> ! !
>
> Y Y
>
> pi ð1 � pi Þ if E2 ¼ ∅ and O1 6¼ ∅ and O2 6¼ ∅
>
>
>
> i ∈ E1 i ∈ O1
>
> Y
>
>
>
>
> pi if E2 ¼ ∅ and O1 ¼ ∅ (19)
>
> i ∈ E1
>
>
>
> 0 if O2 6¼ ∅ and E2 6¼ ∅
>
>
:
0 if n < 2 or n is odd
where 8 8
>
> O ¼ f1, 3, 5, …, n � 1g >
> E ¼ f2, 4, 6, …, ng
>
> >
>
>
>
> O 1 ⊆ O >
>
> E1 ⊆ E
>
> >
>
< O2 ⊆ O
> < E2 ⊆ E
>
O1 ∪ O 2 ¼ O and E1 ∪ E2 ¼ E
>
> >
>
>
>
> O 1 ∩ O 2 ¼ ∅ >
>
> E1 ∩ E2 ¼ ∅
>
> >
>
>
>
> X i ¼ 1, ∀i ∈ O 1 >
>
> Xi ¼ 1, ∀i ∈ E1
: :
Xi ¼ 0, ∀i ∈ O2 Xi ¼ 0, ∀i ∈ E2
Where,
Given n ≥ 2 and n is even, Eq. (19) varies according to six cases whose arrangement counters
are listed as follows
O2 ¼ ∅ and E2 ¼ ∅
O2 ¼ ∅ and E1 6¼ ∅ and E2 6¼ ∅
n n
cðΩ : fXi ¼ 1gÞ ¼ 22 � 2, cðΩ : fXi ¼ 0gÞ ¼ 0, cðΩÞ ¼ 22 � 2:
O2 ¼ ∅ and E1 ¼ ∅
E2 ¼ ∅ and O1 6¼ ∅ and O2 6¼ ∅
n n n
cðΩ : fXi ¼ 1gÞ ¼ 22�1 � 1, cðΩ : fXi ¼ 0gÞ ¼ 22�1 � 1, cðΩÞ ¼ 22 � 2:
E2 ¼ ∅ and O1 ¼ ∅

O2 6¼ ∅ and E2 6¼ ∅
n� �� n � n � n � � n �2
cðΩ : fXi ¼ 1gÞ ¼ 22�1 � 1 22 � 1 , cðΩ : fXi ¼ 0gÞ ¼ 22�1 22 � 1 , cðΩÞ ¼ 22 � 1 :
Suppose the number of sources Xis is even. According to XNOR-gate inference [1], the output
is on if all inputs get the same value 1 (or 0). Sources Xi (s) and target Y follow XNOR-gate if
one of two XNOR-gate conditions specified by Eq. (20) is satisfied.
PðY ¼ 1jAi ¼ ON, ∀iÞ ¼ 1

(20)
PðY ¼ 1jAi ¼ OFF, ∀iÞ ¼ 1
X n
Y
A1 , A2 , …, An i¼1
If both XNOR-gate conditions are not satisfied then,
PðY ¼ 1jX1 , X2 , …, Xn Þ ¼ 0
If Ai = ON for all i, we have
Yn
PðY ¼ 1jX1 , X2 , …, Xn Þ ¼ PðY ¼ 1jAi ¼ ON, ∀iÞ PðAi ¼ ONjXi Þ
i¼1
8
Yn
Yn >
< p if L ¼ ∅
i
¼ PðAi ¼ ONjXi Þ ¼ i¼1
i¼1
>
:
0 if L 6¼ ∅
(Please see similar proof in AND-gate inference)
If Ai = OFF for all i, we have

(Y
n
Y ð1 � pi Þ if K 6¼ ∅
PðY ¼ 1jX1 , X2 , …, Xn Þ ¼ PðAi ¼ OFFjXi Þ ¼ i∈K
i¼1 1 if K ¼ ∅
(Please see similar proof in OR-gate inference)
If one of XNOR-gate conditions is satisfied then,
n
Y n
Y
PðY ¼ 1jX1 , X2 , …, Xn Þ ¼ PðAi ¼ ONjXi Þ þ PðAi ¼ OFFjXi Þ
i¼1 i¼1
This implies Eq. (21) to specify XNOR-gate inference.

8 n
Y n
Y
>
>
>
> pi þ ð1 � pi Þ if L ¼ ∅
>
>
� � < i¼1 i¼1
P notðX1 ⊗ X2 ⊗ … ⊗ Xn Þ ¼ PðY ¼ 1jX1 , X2 , …, Xn Þ ¼ Y
>
> ð1 � pi Þ if L 6¼ ∅ and K 6¼ ∅
>
> i∈K
>
>
:
1 if L 6¼ ∅ and K ¼ ∅
(21)
where K and L are the sets of Xis whose values are 1 and 0, respectively. Eq. (21) varies
according to three cases whose arrangement counters are listed as follows
L¼∅
L 6¼ ∅ and K 6¼ ∅
cðΩ : fXi ¼ 1gÞ ¼ 2n�1 � 1, cðΩ : fXi ¼ 0gÞ ¼ 2n�1 � 1, cðΩÞ ¼ 2n � 2:
L 6¼ ∅ and K ¼ ∅
Let U be a set of indices such that Ai = ON and let α ≥ 0 and β ≥ 0 be predefined numbers. The
U-gate inference is defined based on α, β and cardinality of U. Table 4 specifies four common
U-gate conditions.
Note that U-gate condition on |U| can be arbitrary and it is only relevant to Ais (ON or OFF)
and the way to combine Ais. For example, AND-gate and OR-gate are specific cases of U-gate
with |U| = n and |U| ≥ 1, respectively. XOR-gate and XNOR-gate are also specific cases of
U-gate with specific conditions on Ai (s). However, it must be assured that there is at least one
combination of Ais satisfying the predefined U-gate condition, which causes that U-gate
probability is not always equal to 0. In this research, U-gate is the most general nonlinear gate
where U-gate probability contains products of weights (see Table 5). Later on, we will research
a so-called SIGMA-gate that contains only linear combination of weights (sum of weights, see
Eq. (23)). Shortly, each X-gate is a pattern owning a particular X-gate inference that is X-gate
probability P(X1 � X2 �…� Xn). Each X-gate inference is based on particular X-gate condition
(s) relevant to only variables Ais.
X n
Y
A1 , A2 , …, An i¼1
Let U be the set of all possible U (s), we have

|U|=α PðY ¼ 1jA1 , A2 , …, An Þ ¼ 1 if there are exactly α variables Ai = ON (s). Otherwise, PðY ¼ 1jA1 , A2 , …, An Þ ¼ 0.
|U|≥α PðY ¼ 1jA1 , A2 , …, An Þ ¼ 1 if there are at least α variables Ai = ON (s). Otherwise, PðY ¼ 1jA1 , A2 , …, An Þ ¼ 0.
|U|≤β PðY ¼ 1jA1 , A2 , …, An Þ ¼ 1 if there are at most β variables Ai = ON (s). Otherwise, PðY ¼ 1jA1 , A2 , …, An Þ ¼ 0.
α≤|U|≤β PðY ¼ 1jA1 , A2 , …, An Þ ¼ 1 if the number of Ai = ON (s) is from α to β. Otherwise, PðY ¼ 1jA1 , A2 , …, An Þ ¼ 0.
Table 4. U-gate conditions.
X n
Y
U∈U i¼1
X Y Y
¼ PðAi ¼ ONjXi Þ PðAj ¼ OFFjXj Þ
U∈U i∈U j∉U
If Xi ¼ 0, ∀i ∈ U then,
X Y Y
PðY ¼ 1jX1 , X2 , …, Xn Þ ¼ 0 PðAj ¼ OFFjXj Þ ¼ 0
U ∈ U i ∈ U j∉U
This implies all sets U (s) must be subsets of K. The U-gate probability is rewritten as follows
X Y Y
PðY ¼ 1jX1 , X2 , …, Xn Þ ¼ PðAi ¼ ONjXi ¼ 1Þ PðAj ¼ OFFjXj Þ
U∈U i∈U j∉U
X Y Y
¼ pi PðAj ¼ OFFjXj Þ
U∈U i∈U j∉U
X Y Y Y
¼ pi PðAj ¼ OFFjXj ¼ 1Þ PðAj ¼ OFFjXj ¼ 0Þ
U∈U i∈U j ∈ K\U j∉K
X Y Y Y X Y Y
¼ pi ð1 � pj Þ 1¼ pi ð1 � pj Þ
U∈U i∈U j ∈ K\U j∉K U∈U i∈U j ∈ K\U
(Due to Eq. (9))
Let PU be the U-gate probability; Table 5 specifies U-gate inference and cardinality of U where
U is the set of subsets (U) of K.
� �
n
Note that the notation denotes the number of combinations of j elements taken from n
j
elements.
� �
n n!
¼
j j!ðn � jÞ!
Arrangement counters relevant to U-gate inference and the set K are listed as follows
X Y Y
Let, SU ¼ pi ð1 � pj Þ
U∈U i∈U j ∈ K\U
PU ¼ PðX1 ⊎X2 ⊎…⊎Xn Þ ¼ PðY ¼ 1jX1 , X2 , …, Xn Þ
Y
As a convention, pi ¼ 1 ifjUj ¼ 0
i∈U
Y
ð1 � pj Þ ¼ 1 ifjUj ¼ jKj
j ∈ K\U
8Yn
>
>
< ð1 � pj Þ if jKj > 0
PU ¼ j¼1
|U|=0 >
>
:
1 if jKj ¼ 0
jUj ¼ 1
(
SU if jKj > 0
PU ¼ jUj ¼ 2jKj
|U|≥0 1 if jKj ¼ 0
The case |U|≥0 is the same to the case |U|≤n
8 n
> Y
>
< pi if jKj ¼ n
PU ¼ i¼1
>
>
|U|=n :
0 if jKj < n
(
1 if jKj ¼ n
jUj ¼
0 if jKj < n
(
SU if jKj ≥ α
PU ¼
0 if jKj < α
|U|=α 8 !
> jKj
0<α<n >
< if jKj ≥ α
jUj ¼ α
>
>
:
0 if jKj < α
(
SU if jKj ≥ α
PU ¼
0 if jKj < α
8 !
|U|≥α > XjKj jKj
>
>
0<α<n < if jKj ≥ α
jUj ¼ j¼α j
>
>
>
:
0 if jKj < α
(
SU if jKj > 0
PU ¼
1 if jKj ¼ 0
8
|U|≤β > minðβ
X , jKjÞ jKj !
>
>
0<β<n < if jKj > 0
jUj ¼ j¼0 j
>
>
>
:
1 if jKj ¼ 0
(
SU if jKj ≥ α
PU ¼
α≤|U|≤β 0 if jKj < α
8
> minðβ
X , jKjÞ jKj !
0<α<n >
>
< if jKj ≥ α
0<β<n jUj ¼ j
j¼α
>
>
>
:
0 if jKj < α
Table 5. U-gate inference.

jKj ¼ 0
jKj ¼ 1
jKj ¼ α and α > 0

� � � � � �
n�1 n�1 n
cðΩ : fXi ¼ 1gÞ ¼ , cðΩ : fXi ¼ 0gÞ ¼ , cðΩÞ ¼ :
α�1 α α
jKj ≤ α and α > 0

α �
X � α �
X � α � �
X
n�1 n�1 n
j�1 j j
j¼1 j¼0 j¼0
jKj ≥ α and α > 0

n �
X � n�1 �
X � n � �
X
n�1 n�1 n
j�1 j j
j¼α j¼α j¼α
The SIGMA-gate inference [9] represents aggregation relationship satisfying SIGMA-gate

condition specified by Eq. (22).
�Xn �
PðYÞ ¼ P i¼1
A i
where the set of Ai is complete and mutually exclusive

n
X
wi ¼ 1
i¼1 (22)
Ai ∩ Aj ¼ ∅, ∀i 6¼ j
Xn
The sigma sum i¼1
Ai indicates that Y is exclusive union of Ais and here, it does not express
arithmetical additions.
n
X n
Y¼ Ai ¼ ⋃ Ai
i¼1 i¼1
This implies
! !
n
X n n
X
PðYÞ ¼ P Ai ¼ P ⋃ Ai ¼ PðAi Þ
i¼1 i¼1 i¼1
Xn
The sigma sum i¼1
PðAi Þ now expresses arithmetical additions of probabilities P(Ai).
SIGMA-gate inference requires the set of Ais is complete and mutually exclusive, which means
that the set of Xis is complete and mutually exclusive too. The SIGMA-gate probability is [9]
� !
X n �
�
PðYjX1 , X2 , …, Xn Þ ¼ P Ai �X1 , X2 , …, Xn
i¼1
�
ðdue to SIGMA � gate conditionÞ
Xn
¼ PðAi jX1 , X2 , …, Xn Þ
� i¼1 �
because Ai ðsÞ are mutually exclusive
Xn
¼ PðAi jXi Þ
i¼1
ðbecause Ai is only dependent on Xi Þ
It implies
PðY ¼ 1jX1 , X2 , …, Xn Þ
Xn
¼ PðAi ¼ ONjXi Þ
i¼1! !
X X
¼ PðAi ¼ ONjXi ¼ 1Þ þ PðAi ¼ ONjXi ¼ 0Þ
Xi∈K X X i∉K
¼ wi þ 0¼ wi
i∈K i∉K i∈K
(Due to Eq. (9))

In general, Eq. (23) specifies the theorem of SIGMA-gate inference [9]. The base of this theorem
was mentioned by Millán and Pérez-de-la-Cruz ([4], pp. 292-295).
!
X n X
PðX1 þ X2 þ … þ Xn Þ ¼ P Xi ¼ PðY ¼ 1jX1 , X2 , …, Xn Þ ¼ wi
i¼1
X X i∈K
PðY ¼ 0jX1 , X2 , …, Xn Þ ¼ 1 � wi ¼ wi
i∈K i∈L
where the set of Xis is complete and mutually exclusive.

Xn
w ¼1
i¼1 i (23)
Xi ∩ Xj ¼ ∅, ∀i 6¼ j
The arrangement counters of SIGMA-gate inference are c(Ω:{Xi = 1}) = c(Ω:{Xi = 0}) = 2n�1,
c(Ω) = 2n.
Eq. (9) specifies the “clockwise” strength of relationship between Xi and Y. Event Xi = 1 causes
event Ai = ON with “clockwise” weight wi. There is a question “given Xi = 0, how likely the event
Ai = OFF is”. In order to solve this problem, I define a so-called “counterclockwise” strength of
relationship between Xi and Y denoted ωi. Event Xi = 0 causes event Ai = OFF with
“counterclockwise” weight ωi. In other words, each arc in simple graph is associated with a
clockwise weight wi and a counterclockwise weight ωi. Such graph is called bi-weight simple graph
shown in Figure 4.
With bi-weight simple graph, all X-gate inferences are extended as so-called X-gate bi-inferences.
Derived from Eq. (9), Eq. (24) specifies conditional probability of accountable variables with
regard to bi-weight graph.
PðAi ¼ ONjXi ¼ 1Þ ¼ pi ¼ wi
PðAi ¼ ONjXi ¼ 0Þ ¼ 1 � ρi ¼ 1 � ωi
(24)
PðAi ¼ OFFjXi ¼ 1Þ ¼ 1 � pi ¼ 1 � wi
PðAi ¼ OFFjXi ¼ 0Þ ¼ ρi ¼ ωi
The probabilities P(Ai = ON | Xi = 0) and P(Ai = OFF | Xi = 1) are called clockwise adder di and
counterclockwise adder δi. As usual, di and δi are smaller than wi and ωi. When di = 0, bi-weight
graph becomes normal simple graph.
di ¼ PðAi ¼ ONjXi ¼ 0Þ ¼ 1 � ρi ¼ 1 � ωi
δi ¼ PðAi ¼ OFFjXi ¼ 1Þ ¼ 1 � pi ¼ 1 � wi
The total clockwise weight or total counterclockwise weight is defined as sum of clockwise
weight and clockwise adder or sum of counterclockwise weight and counterclockwise adder.
Eq. (25) specifies such total weights Wi and W i . These weights are also called relationship powers.
W i ¼ wi þ di
W i ¼ ωi þ δi
where
di ¼ 1 � ρi ¼ 1 � ωi
(25)
δi ¼ 1 � pi ¼ 1 � wi
Xn
Given Eq. (25), the set of all Ais is complete if and only if i¼1
wi ¼ 1.
Figure 4. Bi-weight simple graph.

By extending aforementioned X-gate inferences, we get bi-inferences for AND-gate, OR-gate,

NAND-gate, NOR-gate, XOR-gate, XNOR-gate, and U-gate as shown in Table 6.
The largest cardinalities of K (L) are 2n�1 and 2n with and without fixed Xi. Thus, it is possible
to calculate arrangement counters. As a convention, the product of probabilities is 1 if indices
set is empty.
Y
f i ¼ 1 if I ¼ ∅
i∈I
With regard to SIGMA-gate bi-inference, the sum of all total clockwise weights must be 1 as
follows
n
X n
X n
X
Wi ¼ ðwi þ di Þ ¼ ðwi þ 1 � ωi Þ ¼ 1
i¼1 i¼1 i¼1
Derived from Eq. (23), the SIGMA-gate probability for bi-weight graph is
X n
PðX1 þ X2 þ … þ Xn Þ ¼ PðAi ¼ ONjXi Þ
X i¼1X
¼ PðAi ¼ ONjXi ¼ 1Þ þ PðAi ¼ ONjXi ¼ 0Þ
i∈K
X X i∈L
¼ wi þ di
i∈K i∈L
Shortly, Eq. (26) specifies SIGMA-gate bi-inference.

X X
PðX1 þ X2 þ … þ Xn Þ ¼ wi þ di
i∈K i∈L
where the set of Xi(s) is complete and mutually exclusive.

n
X
Wi ¼ 1
i¼1
(26)
Xi ∩ Xj ¼ ∅, ∀i 6¼ j
The next section will research diagnostic relationship which adheres to X-gate inference.
4. Multihypothesis diagnostic relationship
Given a simple graph shown in Figure 2, if we replace the target source Y by an evidence D, we
get a so-called multihypothesis diagnostic relationship whose property adheres to X-gate infer-
ence. Maybe there are other diagnostic relationships in which X-gate inference is not
concerned. However, this research focuses on X-gate inference and so multi-hypothesis diag-
nostic relationship is called X-gate diagnostic relationship. Sources X1, X2,…, Xn become hypoth-
eses. As a convention, these hypotheses have prior uniform distribution.
According to aforementioned X-gate network shown in Figures 2 and 3, the target variable
must be binary whereas evidence D can be numeric. It is impossible to establish the evidence D
as direct target variable. Thus, the solution of this problem is to add an augmented target
binary variable Y and then, the evidence D is connected directly to Y. In other words, the X-gate
diagnostic network have n sources {X1, X2,…, Xn}, one augmented hypothesis Y, and one
evidence D. As a convention, X-gate diagnostic network is called X-D network. The CPTs of
the entire network are determined based on combination of diagnostic relationship and X-gate
inference mentioned in previous sections. Figure 5 depicts the augmented X-D network. Note
that variables X1, X2,…, Xn, and Y are always binary.
Appendix A3 is the proof that the augmented X-D network is equivalent to X-D network with
regard to variables X1, X2,…, Xn and D. As a convention, augmented X-D network is consid-
ered as same as X-D network.
The simplest case of X-D network is NOT-D network having one hypothesis X1 and one
evidence D, equipped with NOT-gate inference. NOT-D network satisfies diagnostic condition
because it essentially represents the single diagnostic relationship. Inferred from Eqs. (1)
and (7), the conditional probability P(D|X1) and posterior probability P(X1|D) of NOT-D
network are
(
1 � D if X1 ¼ 1
PðDjX1 Þ ¼
D if X1 ¼ 0
PðDjX1 ÞPðX1 Þ
PðX1 jDÞ ¼ � �
PðX1 Þ PðDjX1 ¼ 0Þ þ PðDjX1 ¼ 1Þ
(Due to Bayes’ rule and uniform distribution of X1)
Figure 5. Augmented X-D network.

PðDjX1 Þ
¼ ¼ 1 � PðDjX1 Þ
PðDjX1 ¼ 0Þ þ PðDjX1 ¼ 1Þ
� �
due to PðDjX1 ¼ 0Þ þ PðDjX1 ¼ 1Þ ¼ 1
It implies NOT-D network satisfies diagnostic condition. Let
Ω ¼ fX1 , X2 , …, Xn g
n ¼ jΩj
We will validate whether the CPT of diagnostic relationship, P(D|X) specified by Eq. (6), still
satisfies diagnostic condition within general case, X-D network. In other words, X-D network
is general case of single diagnostic relationship.
Recall from dependencies shown in Figure 5, Eq. (27) specifies the joint probability of X-D
network.
n
Y
PðΩ, Y, DÞ ¼ PðX1 , X2 , …, Xn , Y, DÞ ¼ PðDjYÞPðYjX1 , X2 , …, Xn Þ PðXi Þ
i¼1 (27)
where Ω ¼ {X1, X2, …, Xn}:
Eq. (28) specifies the conditional probability of D given Xi (likelihood function) and the
posterior probability of Xi given D.
X
PðXi , DÞ fΩ, Y, Dg\fXi , Dg
PðΩ, Y, DÞ
PðDjXi Þ ¼ ¼ X
PðXi Þ PðΩ, Y, DÞ
fΩ, Y, Dg\fXi g
X (28)
PðXi , DÞ fΩ, Y, Dg\fXi , Dg
PðΩ, Y, DÞ
PðXi jDÞ ¼ ¼ X
PðDÞ PðΩ, Y, DÞ
fΩ, Y, Dg\fDg
where Ω = {X1, X2,…, Xn} and the sign “\” denotes the subtraction (excluding) operator in set
theory [10]. Eq. (29) specifies the joint probability P(Xi, D) and the marginal probability P(D)
given uniform distribution of all sources. Appendix A4 is the proof of Eq. (29).
1 � �
PðXi , DÞ ¼ n ð2D � MÞsðΩ : fXi gÞ þ 2n�1 ðM � DÞ
2 S
(29)
1 � �
PðDÞ ¼ n ð2D � MÞsðΩÞ þ 2n ðM � DÞ
2 S
where s(Ω) and s(Ω:{Xi}) are specified in Table 2. From Eqs. (28–30) specifies conditional
probability P(D|Xi), posterior probability P(Xi|D), and transformation coefficient for X-gate
inference.
PðXi ¼ 1, DÞ ð2D � MÞsðΩ : fXi ¼ 1gÞ þ 2n�1 ðM � DÞ

PðDjXi ¼ 1Þ ¼ ¼
PðXi ¼ 1Þ 2n�1 S

PðDjXi ¼ 0Þ ¼ ¼
PðXi ¼ 0Þ 2n�1 S

PðXi ¼ 1jDÞ ¼ ¼ (30)
PðDÞ ð2D � MÞsðΩÞ þ 2n ðM � DÞ
ð2D � MÞsðΩ : fXi ¼ 0gÞ þ 2n�1 ðM � DÞ

PðXi ¼ 0jDÞ ¼ 1 � PðXi ¼ 1jDÞ ¼
ð2D � MÞsðΩÞ þ 2n ðM � DÞ
PðXi jDÞ 2n�1 S

k¼ ¼
PðDjXi Þ ð2D � MÞsðΩÞ þ 2n ðM � DÞ
The transformation coefficient is rewritten as follows
2n�1 S
k¼ � � � �
n�1
2D sðΩÞ � 2 þ M 2n � sðΩÞ
Note that S, D, and M are abstract symbols and there is no proportional connection between
2n�1S and D for all D, specified by Eq. (6). Assuming that such proportional connection 2n�1S =
aDj exists for all D where a is arbitrary constant. Given binary case when D = 0 and S = 1, we
have
2n�1 ¼ 2n�1 � 1 ¼ 2n�1 S ¼ aDj ¼ a � 0j ¼ 0
There is a contradiction, which implies that it is impossible to reduce k into the following form
aDj
k¼
bDj
Therefore, if k is constant with regard to D then,

� � � �
2D sðΩÞ � 2n�1 þ M 2n � sðΩÞ ¼ C 6¼ 0, ∀D
where C is constant. We have

X� � � � �� X
2D sðΩÞ � 2n�1 þ M 2n � sðΩÞ ¼ C
D D
� � � �
) 2S sðΩÞ � 2n�1 þ NM 2n � sðΩÞ ¼ NC
) 2n S ¼ NC
It is implied that
Y Y
PðX1 ⊙X2 ⊙…⊙Xn Þ ¼ pi di
i∈K i∈L
Y Y
PðX1 ⊕ X2 ⊕ … ⊕ Xn Þ ¼ 1 � δi ρi
i∈K i∈L
� � Y Y
P notðX1 ⊙X2 ⊙…⊙Xn Þ ¼ 1 � ρi δi
i∈L i∈K
� � Y Y
P notðX1 ⊕ X2 ⊕ … ⊕ Xn Þ ¼ di pi
i∈L i∈K
Y Y Y Y Y Y Y Y
PðX1 ⊗ X2 ⊗ … ⊗ Xn Þ ¼ pi di δi ρi þ pi di δi ρi
i ∈ O1 i ∈ O2 i ∈ E1 i ∈ E2 i ∈ E1 i ∈ E2 i ∈ O1 i ∈ O2
� � Y Y Y Y
P notðX1 ⊗ X2 ⊗ … ⊗ Xn Þ ¼ pi di þ δi ρi
i∈K i∈L i∈K i∈L
0 10 1
X Y B Y Y Y C
PðX1 ⊎X2 ⊎…⊎Xn Þ ¼ @ pi A
di @ δi ρi A
U∈U i∈U ∩ K i∈U ∩ L
i∈U ∩ K i∈U ∩ L
There are four common conditions of U: |U|=α, |U|≥α, |U|≤β, and α≤|U|≤β. Note that U is the complement of U,
U ¼ f1, 2, …, ng\U
The largest cardinality of U is:
jUj ¼ 2n
Table 6. Bi-inferences for AND-gate, OR-gate, NAND-gate, NOR-gate, XOR-gate, XNOR-gate, and U-gate.
2n�1 S NC N
k¼ � � � �¼ ¼
2D sðΩÞ � 2 n�1
n
þ M 2 � sðΩÞ 2C 2
This holds
� � � � ��
2n S ¼ N 2D sðΩÞ � 2n�1 þ M 2n � sðΩÞ ¼ 2ND sðΩÞ � 2n�1 þ 2S 2n � sðΩÞ
� � � �
) 2ND sðΩÞ � 2n�1 � 2S sðΩÞ � 2n�1 ¼ 0
� �
) ðND � SÞ sðΩÞ � 2n�1 ¼ 0
Assuming ND = S we have
ND ¼ S ¼ 2NM ) D ¼ 2M
There is a contradiction because M is maximum value of D. Therefore, if k is constant with

regard to D then s(Ω) = 2n�1. Inversely, if s(Ω) = 2n�1 then k is
2n�1 S N
k¼ n�1
¼
2Dð2 � 2 Þ þ Mð2n � 2n�1 Þ 2
n�1
Given X-D network is combination of diagnostic relationship and X-gate inference:
PðY ¼ 1jX1 , X2 , …, Xn Þ ¼ PðX1 xX2 x…xXn Þ

8
>
> D
< if Y ¼ 1
S
PðDjYÞ ¼
>
> M D
: � if Y ¼ 0
S S
The diagnostic condition of X-D network is satisfied if and only if

X � �
sðΩÞ ¼ P Y ¼ 1jaðΩÞ ¼ 2jΩj�1 , ∀Ω 6¼ ∅
a
At that time, the transformation coefficient becomes:
N
k¼
2
Note that weights pi = wi and ρi = ωi, which are inputs of s(Ω), are abstract variables. Thus, the equality s(Ω) = 2|Ω|�1
implies all abstract variables are removed and so s(Ω) does not depend on weights.
Table 7. Diagnostic theorem.
In general, the event that k is constant with regard to D is equivalent to the event s(Ω) = 2n�1.
This implies diagnostic theorem stated in Table 7.
The diagnostic theorem is the optimal way to validate the diagnostic condition.
The Eq. (30) becomes simple with AND-gate inference. Recall that Eq. (14) specified AND-gate
inference as follows
8 n
> Y
< pi if all Xi ðsÞ are 1
PðX1 ⊙X2 ⊙…⊙Xn Þ ¼ PðY ¼ 1jX1 , X2 , …, Xn Þ ¼ i¼1
>
:
Due to only one case X1 = X2 =…= Xn = 1, we have

n
Y
sðΩÞ ¼ sðΩ : fXi ¼ 1gÞ ¼ pi
i¼1
Due to Xi = 0, we have
sðΩ : fXi ¼ 0gÞ ¼ 0
Derived from Eq. (30), Eq. (31) specifies conditional probability P(D|Xi), posterior probability
P(Xi|D), and transformation coefficient according to X-D network with AND-gate reference
called AND-D network.
Yn
ð2D � MÞ i¼1 i
p þ 2n�1 ðM � DÞ
PðDjXi ¼ 1Þ ¼ n�1
2 S
M�D
PðDjXi ¼ 0Þ ¼
S
Yn
ð2D � MÞ pi þ 2n�1 ðM � DÞ
PðXi ¼ 1jDÞ ¼ Yi¼1
n
ð2D � MÞ p þ 2n ðM � DÞ (31)
i¼1 i
2n�1 ðM � DÞ
PðXi ¼ 0jDÞ ¼ Yn
ð2D � MÞ i¼1 pi þ 2n ðM � DÞ
2n�1 S
k¼ Yn
ð2D � MÞ i¼1 pi þ 2n ðM � DÞ
For convenience, we validate diagnostic condition with a case of two sources Ω = {X1, X2}, p1 =
p2 = w1 = w2 = 0.5, D ∈ {0, 1, 2, 3}. According to diagnostic theorem stated in Table 7, if s(Ω) 6¼ 2
for given X-gate then, such X-gate does not satisfy diagnostic condition.
Given AND-gate inference, by applying Eq. (14), we have
sðΩÞ ¼ ð0:5 � 0:5Þ þ 0 þ 0 þ 0 ¼ 0:25
Given OR-gate inference, by applying Eq. (16), we have
sðΩÞ ¼ ð1 � 0:5 � 0:5Þ þ ð1 � 0:5Þ þ ð1 � 0:5Þ þ 0 ¼ 3 � 3 � 0:5 � 0:5 ¼ 1:75
Given XOR-gate inference, by applying Eq. (19), we have
sðΩÞ ¼ ð0:5 � 0:5 þ 0:5 � 0:5Þ þ 0:5 þ 0:5 þ 0 ¼ 1:5
Given XNOR-gate inference, by applying Eq. (21), we have
sðΩÞ ¼ ð0:5 � 0:5 þ 0:5 � 0:5Þ þ 0:5 þ 0:5 þ 1 ¼ 2:5
Given SIGMA-gate inference, by applying Eq. (23), we have
sðΩÞ ¼ ð0:5 þ 0:5Þ þ 0:5 þ 0:5 þ 0 ¼ 2
It is asserted that AND-gate, OR-gate, XOR-gate, and XNOR-gate do not satisfy diagnostic
condition and so they should not be used to assess hypotheses. However, it is not asserted if U-
gate and SIGMA-gate satisfy such diagnostic condition. It is necessary to expend equation for
SIGMA-gate diagnostic network (called SIGMA-D network) in order to validate it.
In case of SIGMA-gate inference, by applying Eq. (23), we have

X
wi ¼ 1
i
X
sðΩÞ ¼ 2n�1 wi ¼ 2n�1
i
X
n�1 n�2
sðΩ : fXi ¼ 1gÞ ¼ 2 wi þ 2 wj ¼ 2n�1 wi þ 2n�2 ð1 � wi Þ ¼ 2n�2 ð1 þ wi Þ
j6¼i
sðΩ : fXi ¼ 0gÞ ¼ sðΩÞ � sðΩ : fXi ¼ 1gÞ ¼ 2n�2 ð1 � wi Þ
It is necessary to validate SIGMA-D network with SIGMA-gate bi-inference. By applying

Eq. (26), we recalculate these quantities as follows
X X X
sðΩÞ ¼ 2n�1 wi þ 2n�1 di ¼ 2n�1 ðwi þ di Þ ¼ 2n�1
i i i
� X �
due to ðwi þ di Þ ¼ 1
i
X X X
sðΩ : fXi ¼ 1gÞ ¼ 2n�1 wi þ 2n�2 wj þ 2n�2 di ¼ 2n�2 wi þ 2n�2 ðwi þ di Þ ¼ 2n�2 ð1 þ wi Þ
j6¼i i i
sðΩ : fXi ¼ 0gÞ ¼ sðΩÞ � sðΩ : fXi ¼ 1gÞ ¼ 2n�2 ð1 � wi Þ
Obviously, quantities s(Ω), s(Ω:{Xi=1}), and s(Ω:{Xi = 0}) are kept intact. According to diagnostic
theorem, we conclude that SIGMA-D network does satisfy diagnostic condition due to
s(Ω)=2n�1. Thus, SIGMA-D network can be used to assess hypotheses.
Eq. (32), an immediate consequence of Eq. (30), specifies conditional probability P(D|Xi),
posterior probability P(Xi|D), and transformation coefficient for SIGMA-D network.
ð2D � MÞwi þ M
PðDjXi ¼ 1Þ ¼
2S
ðM � 2DÞwi þ M
PðDjXi ¼ 0Þ ¼
2S
ð2D � MÞwi þ M
PðXi ¼ 1jDÞ ¼ (32)
2M
ðM � 2DÞwi þ M
PðXi ¼ 0jDÞ ¼
2M
N
k¼
2
In case of SIGMA-gate, the augmented variable Y can be removed from X-D network. The
evidence D is now established as direct target variable. Figure 6 shows a so-called direct
SIGMA-gate diagnostic network (direct SIGMA-D network).
Derived from Eq. (23), the CPT of direct SIGMA-D network is determined by Eq. (33).
XD XM � D
PðDjX1 , X2 , …, Xn Þ ¼ wi þ wj
i∈K
S j∈L
S
where the set of Xi (s) is complete and mutually exclusive.

n
X
wi ¼ 1
i¼1
(33)
Xi ∩ Xj ¼ ∅, ∀i 6¼ j
Eq. (33) specifies valid CPT due to

X 1X X 1X X
PðDjX1 , X2 , …, Xn Þ ¼ wi D þ wj ðM � DÞ
D
S i∈K D S j∈L D
n
1X 1X 1X 1X X
¼ Swi þ wj ðNM � SÞ ¼ Swi þ Swj ¼ wi ¼ 1
S i∈K S j∈L S i∈K S j∈L i¼1
From dependencies shown in Figure 6, Eq. (34) specifies the joint probability of direct SIGMA-D
network.
Yi¼1
PðX1 , X2 , …, Xn , Y, DÞ ¼ PðDjX1 , X2 , …, Xn Þ n
PðXi Þ (34)
Inferred from Eq. (29), Eq. (35) specifies the joint probability P(Xi, D) and the marginal proba-
bility P(D) of direct SIGMA-D network, given uniform distribution of all sources.
1
PðXi , DÞ ¼ sðΩ : fXi gÞ
2n (35)
1
PðDÞ ¼ n sðΩÞ
2
where s(Ω) and s(Ω:{Xi}) are specified in Table 2.

By browsing all variables of direct SIGMA-D network, we have
Figure 6. Direct SIGMA-gate diagnostic network (direct SIGMA-D network).

D XD XM � D
sðΩ : fXi ¼ 1gÞ ¼ 2n�1 wi þ 2n�2 wj þ 2n�2 wj
S j6¼i
S j6¼i
S
2n�2 X 2n�2 � �
¼ ð2Dwi þ M wj Þ ¼ 2Dwi þ Mð1 � wi Þ
S j6¼i
S
!
n
X
Due to wi ¼ 1
i¼1
2n�2 � �
¼ ð2D � MÞwi þ M
S
Similarly, we have
M�D XM � D XD 2n�2 � �
sðΩ : fXi ¼ 0gÞ ¼ 2n�1 wi þ 2n�2 wj þ 2n�2 wj ¼ ðM � 2DÞwi þ M
S j6¼i
S j6¼i
S S
XD XM � D 2n�1 M
sðΩÞ ¼ 2n�1 wi þ 2n�1 wi ¼
i
S i
S S
By applying Eq. (35), s(Ω:{Xi = 0}), s(Ω:{Xi = 1}), and s(Ω), we get the same result with Eq. (32).
ð2D � MÞwi þ M
PðDjXi ¼ 1Þ ¼
2S
ðM � 2DÞwi þ M
PðDjXi ¼ 0Þ ¼
2S
ð2D � MÞwi þ M
PðXi ¼ 1jDÞ ¼
2M
ðM � 2DÞwi þ M
PðXi ¼ 0jDÞ ¼
2M
N
k¼
2
Therefore, it is possible to use direct SIGMA-D network to assess hypotheses. It is asserted that
SIGMA-D network satisfy diagnostic condition when single relationship, NOT-D network,
direct SIGMA-D network are specific cases of SIGMA-D network. There is a question: does an
X-D network that is different from SIGMA-D network and not aforementioned exist such that
it satisfies diagnostic condition?
Recall that each X-D network is a pattern owning a particular X-gate inference which in turn is
based on particular X-gate condition (s) relevant to only variables Ais. The most general
nonlinear X-D network is U-D network whereas SIGMA-D network is linear one. The U-gate
inference given arbitrary condition on U is
0 10 1
X Y BY Y Y C
PðX1 ⊎X2 ⊎…⊎Xn Þ ¼ @ pi ð1 � ρi ÞA@ ð1 � pi Þ ρi A
U∈U i∈U ∩ K i∈U ∩ L
Let f be the arrangement sum of U-gate inference.

0 10 1
X X Y Y B Y Y C
f ðpi , ρi Þ ¼ @ pi ð1 � ρi ÞA@ ð1 � pi Þ ρi A
aðΩÞ U ∈ U i ∈ U ∩ K i ∈ U ∩ L
The function f is sum of many large expressions and each expression is product of four possible
sub-products (Π) as follows
Y Y Y Y
Expr ¼ pi ð1 � ρi Þ ð1 � pi Þ ρi
i∈U ∩ K i∈U ∩ L i∈U ∩ K i∈U ∩ L
In any case of degradation, there always exist expression Expr (s) having at least 2 sub-
products (Π), for example,
Y Y
Expr ¼ pi ð1 � ρi Þ
Consequently, there always exist Expr (s) having at least 5 terms relevant to pi and ρi if n ≥ 5, for
example,
Expr ¼ p1 p2 p3 ð1 � ρ4 Þð1 � ρ5 Þ
Thus, degree of f will be larger than or equal to 5 given n ≥ 5. According to diagnostic theorem,
U-gate network satisfies diagnostic condition if and only if f(pi, ρi) = 2n�1 for all n ≥ 1 and for all
abstract variables pi and ρi. Without loss of generality, each pi or ρi is sum of variable x and a
variable ai or bi, respectively. Note that all pi, ρi, ai are bi are abstract variables.
pi ¼ x þ ai
ρi ¼ x þ bi
The equation f�2n�1 = 0 becomes equation g(x) = 0 whose degree is m ≥ 5 if n ≥ 5.
ɡðxÞ ¼ �xm þ C1 xm�1 þ … þ Cm�1 x þ Cm � 2n�1 ¼ 0
where coefficients Ci s are functions of ai and bis. According to Abel-Ruffini theorem [11],
equation g(x) = 0 has no algebraic solution when m ≥ 5. Thus, abstract variables pi and ρi cannot
be eliminated entirely from g(x) = 0, which causes that there is no specification of U-gate
inference P(X1xX2x…xXn) so that diagnostic condition is satisfied.
It is concluded that there is no nonlinear X-D network satisfying diagnostic condition, but
a new question is raised: does there exist the general linear X-D network satisfying
diagnostic condition? Such linear network is called GL-D network and SIGMA-D network
is specific case of GL-D network. The GL-gate probability must be linear combination
of weights.
n
X n
X
PðX1 xX2 x…xXn Þ ¼ C þ αi wi þ βi di
i¼1 i¼1
where C is arbitrary constant.
The GL-gate inference is singular if αi and βi are functions of only Xi as follows

n
X n
X
PðX1 xX2 x…xXn Þ ¼ C þ hi ðXi Þwi þ ɡi ðXi Þdi
i¼1 i¼1
The functions hi and gi are not relevant to Ai because the final equation of GL-gate
inference is only relevant to Xi (s) and weights (s). Because GL-D network is a pattern,
we only survey singular GL-gate. Mentioned GL-gate is singular by default and it is
dependent on how to define functions hi and gi. The arrangement sum with regard to
GL-gate is
!
X n
X n
X
sðΩÞ ¼ Cþ hi ðXi Þwi þ ɡi ðXi Þdi
a i¼1 i¼1
n �
X � n �
X �
¼ 2n C þ 2n�1 hi ðXi ¼ 1Þ þ hi ðXi ¼ 0Þ wi þ 2n�1 ɡi ðXi ¼ 1Þ þ ɡi ðXi ¼ 0Þ di
i¼1 i¼1
Suppose hi and gi are probability mass functions with regard to Xi. For all i, we have
0 ≤ hi ðXi Þ ≤ 1
0 ≤ ɡi ðXi Þ ≤ 1
hi ðXi ¼ 1Þ þ hi ðXi ¼ 0Þ ¼ 1
ɡi ðXi ¼ 1Þ þ gi ðXi ¼ 0Þ ¼ 1
The arrangement sum becomes

n
X
sðΩÞ ¼ 2n C þ 2n�1 ðwi þ di Þ
i¼1
GL-D network satisfies diagnostic condition if

n
X
sðΩÞ ¼ 2n C þ 2n�1 ðwi þ di Þ ¼ 2n�1
i¼1
n
X
) 2C þ ðwi þ di Þ ¼ 1
i¼1
Suppose the set of Xis is complete.

n
X
ðwi þ di Þ ¼ 1
i¼1
This implies C = 0. Shortly, Eq. (36) specifies the singular GL-gate inference so that GL-D
network satisfies diagnostic condition.
n
X n
X
PðX1 xX2 x…xXn Þ ¼ hi ðXi Þwi þ ɡi ðXi Þdi
i¼1 i¼1
where hi and ɡi are probability mass functions and the set of Xi ðsÞ is complete: (36)
Xn
Wi ¼ 1
i¼1
Functions hi(Xi) and gi(Xi) are always linear due to Xim = Xi for all m ≥ 1 when Xi is binary. It is
easy to infer that SIGMA-D network is GL-D network with following definition of functions hi
and gi.
hi ðXi Þ ¼ 1 � ɡi ðXi Þ ¼ Xi , ∀i
According to Millán and Pérez-de-la-Cruz [4], a hypothesis can have multiple evidences as
seen in Figure 7. This is multi-evidence diagnostic relationship opposite to aforementioned multi-
hypothesis diagnostic relationship.
Figure 7 depicts the multi-evidence diagnostic network called M-E-D network in which there
are m evidences D1, D2,…, Dm and one hypothesis Y. Note that Y has uniform distribution.
In simplest case where all evidences are binary, the joint probability of M-E-D network is
m
Y
PðY, D1 , D2 , …, Dm Þ ¼ PðYÞ PðDj jYÞ ¼ PðYÞPðD1 , D2 , …, Dm jYÞ
j¼1
Ym
The product j¼1
PðDj jYÞ is denoted as likelihood function as follows
m
Y
PðD1 , D2 , …, Dm jYÞ ¼ PðDj jYÞ
j¼1
The posterior probability P(Y | D1, D2,…, Dm) given uniform distribution of Y is
Figure 7. Diagnostic relationship with multiple evidences (M-E-D network).
Figure 8. M-HE-D network.
PðY, D1 , D2 , …, Dm Þ
PðYjD1 , D2 , …, Dm Þ ¼
PðY ¼ 1, D1 , D2 , …, Dm Þ þ PðY ¼ 0, D1 , D2 , …, Dm Þ
1
¼ Ym Ym � PðD1 , D2 , …, Dm jYÞ
j¼1
PðDj jY ¼ 1Þ þ j¼1
PðDj jY ¼ 0Þ
The possible transformation coefficient is

m m
1 Y Y
¼ PðDj jY ¼ 1Þ þ PðDj jY ¼ 0Þ
k j¼1 j¼1
M-E-D network will satisfy diagnostic condition if k = 1 because all hypotheses and evidence
are binary, which leads that following equation specified by Eq. (37) has 2m real roots P(Dj|Y)
for all m ≥ 2.
m
Y m
Y
PðDj jY ¼ 1Þ þ PðDj jY ¼ 0Þ ¼ 1 (37)
j¼1 j¼1
Eq. (37) has no real root given m = 2 according to following proof. Suppose Eq. (37) has 4 real
roots as follows
a1 ¼ PðD1 ¼ 1jY ¼ 1Þ
a2 ¼ PðD2 ¼ 1jY ¼ 1Þ
b1 ¼ PðD1 ¼ 1jY ¼ 0Þ
b2 ¼ PðD2 ¼ 1jY ¼ 0Þ
From Eq. (37), it holds

8 8
>
> a1 a2 þ b1 b2 ¼ 1 >
> a1 ¼ a2 8 8
>
> >
> a1 ¼ a2 ¼ 0 > a1 ¼ a2 ¼ 0:5
>
> >
> >
> >
>
> a1 ð1 � a2 Þ þ b1 b2 ¼ 1 >
> b1 ¼ b2 >
> >
>
>
< >
< >
< b1 ¼ b2 >
< b1 ¼ b2
2 2
ð1 � a1 Þa2 þ b1 b2 ¼ 1 ) a1 þ b1 ¼ 1 ⇔ or
>
>
>
>
>
>
>
>
> a21 þ b21 ¼ 1 > >
> a21 þ b21 ¼ 1
>
> a a þ b ð1 � b Þ ¼ 1 >
> a þ 2b 2
¼ 2 >
> >
>
>
> 1 2 1 2 >
> 1 1 : :
>
> >
> b1 ¼ 2 b1 ¼ 1:5
: :
a1 a2 þ ð1 � b1 Þb2 ¼ 1 b1 þ 2a21 ¼ 2
The final equation leads a contradiction (b1 = 2 or b1 = 1.5) and so it is impossible to apply the
sufficient diagnostic proposition into M-E-D network. Such proposition is only used for one-
evidence network. Moreover, X-gate inference absorbs many sources and then produces out of
one targeted result whereas the M-E-D network essentially splits one source into many results.
It is impossible to model M-E-D network by X-gates. The potential solution for this problem is
to group many evidences D1, D2,…, Dm into one representative evidence D which in turn is
dependent on hypothesis Y but this solution will be inaccurate in specifying conditional
probabilities because directions of dependencies become inconsistent (relationships from Dj
to D and from Y to D) except that all Djs are removed and D becomes a vector. However,
evidence vector does not simplify the hazardous problem and it changes the current problem
into a new problem.
Another solution is to reverse the direction of relationship, in which the hypothesis is depen-
dent on evidences so as to take advantages of X-gate inference as usual. However, the rever-
sion method violates the viewpoint in this research where diagnostic relationship must be
from hypothesis to evidence. In other words, we should change the viewpoint.
Another solution is based on a so-called partial diagnostic condition that is a loose case of
diagnostic condition for M-E-D network, which is defined as follows
PðYjDj Þ ¼ kPðDj jYÞ
where k is constant with regard to Dj. The joint probability is

m
Y
PðY, D1 , D2 , …, Dm Þ ¼ PðYÞ PðDj jYÞ
j¼1
M-E-D network satisfies partial diagnostic condition. In fact, given all variables are binary,
we have
X
Ψ\fY, Dj g
PðY, D1 , D2 , …, Dm Þ
PðYjDj Þ ¼ X
Ψ\fDj g
PðY, D1 , D2 , …, Dm Þ
(Let Ψ = {D1, D2,…, Dm})
Ym �X �
PðDj jYÞ k¼1, k6¼j Dk
PðDk jYÞ
¼ Ym �X � Ym �X �
k¼1, k6¼j Dk
PðDk jY ¼ 1Þ þ k¼1, k6¼j Dk
PðDk jY ¼ 0Þ
(Due to uniform distribution of Y)
Ym
PðDj jYÞ k¼1, k6¼j 1 1
¼ Ym Ym ¼ PðDj jYÞ
1 þ 1 2
k¼1, k6¼j k¼1, k6¼j
� X �
Due to PðDk jYÞ ¼ PðDk ¼ 0jYÞ þ PðDk ¼ 1jYÞ ¼ 1
Dk
Partial diagnostic condition expresses a different viewpoint. It is not an optimal solution

because we cannot test a disease based on only one symptom while ignoring other obvious
symptoms, for example. The equality P(Y|Dj) = 0.5P(Dj|Y) indicates the accuracy is decreased
two times. However, Bayesian network provides inference mechanism based on personal
belief. It is subjective. You can use partial diagnostic condition if you think that such condition
is appropriate to your application.
If we are successful in specifying conditional probabilities of M-E-D network, it is possible
to define an extended network which is constituted of n hypotheses X1, X2,…, Xn and m
evidences D1, D2,…, Dm. Such extended network represents multi-hypothesis multi-evidence
diagnostic relationship, called M-HE-D network. Figure 8 depicts M-HE-D network.
The M-HE-D network is the most general case of diagnostic network, which was mentioned in
Ref. ([4], p. 297). We can construct any large diagnostic BN from M-HE-D networks and so the
research is still open.
5. Conclusion
In short, relationship conversion is to determine conditional probabilities based on logic gates

that are adhered to semantics of relationships. The weak point of logic gates is to require that
all variables must be binary. For example, in learning context, it is inconvenient for expert to
create an assessment BN with studying exercises (evidences) whose marks are only 0 and 1. In
order to lessen the impact of such weak point, the numeric evidence is used for extending
capacity of simple Bayesian network. However, combination of binary hypothesis and
numeric evidence leads to errors or biases in inference. For example, given a student gets
maximum grade for an exercise but the built-in inference results out that she/he has not
mastered fully the associated learning concept (hypothesis). Therefore, I propose the sufficient
diagnostic proposition so as to confirm that numeric evidence is adequate to make compli-
cated inference tasks in BN. The probabilistic reasoning based on evidence is always accurate.
Application of the research can go beyond learning context whenever probabilistic deduction
relevant to constraints of semantic relationships is required. A large BN can be constituted of
many simple BN (s). Inference in large BN is hazardous problem and there are many optimal
algorithms for solving such problem. In future, I will research effective inference methods for
the special BN that is constituted of X-gate BN (s) mentioned in this research because X-gate
BN (s) have precise and useful features of which we should take advantages. For instance, their
CPT (s) are simple in some cases and the meanings of their relationships are mandatory in
many applications. Moreover, I try my best to research deeply M-E-D network and M-HE-D
network whose problems I cannot solve absolutely now.
Two main documents that I referred to do this research are the book “Learning Bayesian
Networks” [2] by the author Richard E. Neapolitan and the article “A Bayesian Diagnostic
Algorithm for Student Modeling and its Evaluation” [4] by authors Eva Millán and José Luis
Pérez-de-la-Cruz. Especially, the SIGMA-gate inference is based on and derived from the work
of the Eva Millán and José Luis Pérez-de-la-Cruz. This research is originated from my PhD
research “A User Modeling System for Adaptive Learning” [12]. Other references relevant
to user modeling, overlay model, and Bayesian network are [13–16]. Please concern these
references.
Appendices
A1. Following is the proof of Eq. (9)
PðAi ¼ ONjXi Þ
¼ PðAi ¼ ONjXi , I i ¼ ONÞPðI i ¼ ONÞ þ PðAi ¼ ONjXi , I i ¼ OFFÞPðI i ¼ OFFÞ
¼ 0 � ð1 � pi Þ þ PðAi ¼ ONjXi , I i ¼ OFFÞpi
ðBy applying Eq: ð8ÞÞ
¼ pi PðAi ¼ ONjXi , I i ¼ OFFÞ
It implies
PðAi ¼ ONjXi ¼ 1Þ ¼ pi PðAi ¼ ONjXi ¼ 1, I i ¼ OFFÞ ¼ pi
PðAi ¼ ONjXi ¼ 0Þ ¼ pi PðAi ¼ ONjXi ¼ 0, I i ¼ OFFÞ ¼ 0
PðAi ¼ OFFjXi ¼ 1Þ ¼ 1 � PðAi ¼ ONjXi ¼ 1Þ ¼ 1 � pi
PðAi ¼ OFFjXi ¼ 0Þ ¼ 1 � PðAi ¼ ONjXi ¼ 0Þ ¼ 1 ■

PðY, X1 , X2 , …, Xn Þ
PðYjX1 , X2 , …, Xn Þ ¼
PðX1 , X2 , …, Xn Þ
ðDue to Bayes’ ruleÞ
X
A1 , A2 , …, An
PðY, X1 , X2 , …, Xn jA1 , A2 , …, An Þ � PðA1 , A2 , …, An Þ
¼
PðX1 , X2 , …, Xn Þ
ðDue to total probability ruleÞ
X PðA1 , A2 , …, An Þ
¼ PðY, X1 , X2 , …, Xn jA1 , A2 , …, An Þ �
A 1 , A 2 , …, A n
PðX1 , X2 , …, Xn Þ
X PðA1 , A2 , …, An Þ
¼ PðYjA1 , A2 , …, An Þ � PðX1 , X2 , …, Xn jA1 , A2 , …, An Þ �
A1 , A2 , …, An
PðX1 , X2 , …, Xn Þ
(Because Y is conditionally independent from Xis given Ais)

X PðX1 , X2 , …, Xn , A1 , A2 , …, An Þ
¼ PðYjA1 , A2 , …, An Þ �
A 1 , A 2 , …, A n
PðX1 , X2 , …, Xn Þ
X
¼ PðYjA1 , A2 , …, An Þ � PðA1 , A2 , …, An jX1 , X2 , …, Xn Þ
A1 , A2 , …, An

X n
Y
¼ PðYjA1 , A2 , …, An Þ PðAi jX1 , X2 , …, Xn Þ
A1 , A2 , …, An i¼1
(Because Ais are mutually independent)

X n
Y
¼ PðYjA1 , A2 , …, An Þ PðAi jXi Þ
A1 , A2 , …, An i¼1
(Because each Ai is only dependent on Xi) ■
A3. Following is the proof that the augmented X-D network (shown in Figure 5) is equivalent
to X-D network (shown in shown in Figures 2 and 3) with regard to variables X1, X2,…, Xn,
and D.
The joint probability of augmented X-D network shown in Figure 5 is

n
Y
PðX1 , X2 , …, Xn , Y, DÞ ¼ PðDjYÞPðYjX1 , X2 , …, Xn Þ PðXi Þ
i¼1
The joint probability of X-D network is

n
Y
PðX1 , X2 , …, Xn , DÞ ¼ PðDjX1 , X2 , …, Xn Þ PðXi Þ
i¼1
By applying total probability rule into X-D network, we have

n
PðD, X1 , X2 , …, Xn Þ Y
PðX1 , X2 , …, Xn , DÞ ¼ PðXi Þ
PðX1 , X2 , …, Xn Þ i¼1

X
Y
PðD, X1 , X2 , …, Xn jYÞPðYÞ Yn
¼ PðXi Þ
PðX1 , X2 , …, Xn Þ i¼1
ðDue to total probability ruleÞ

X
Y
PðD, X1 , X2 , …, Xn jYÞPðYÞ Yn
¼ PðXi Þ
PðX1 , X2 , …, Xn Þ i¼1
!
X Yn
PðYÞ
¼ PðD, X1 , X2 , …, Xn jYÞ � � PðXi Þ
Y
PðX1 , X2 , …, Xn Þ i¼1
!
X Yn
PðX1 , X2 , …, Xn jYÞPðYÞ
¼ PðDjYÞ � � PðXi Þ
Y
PðX1 , X2 , …, Xn Þ i¼1
(Because D is conditionally independent from all Xi (s) given Y)

!
X Yn
PðY, X1 , X2 , …, Xn Þ
¼ PðDjYÞ � � PðXi Þ
Y
PðX1 , X2 , …, Xn Þ i¼1
X n
Y
¼ PðDjYÞPðYjX1 , X2 , …, Xn Þ PðXi Þ
Y i¼1

X
¼ PðX1 , X2 , …, Xn , Y, DÞ ■
Y
Given uniform distribution of Xi (s), we have
1
PðX1 Þ ¼ PðX2 Þ ¼ ⋯ ¼ PðXn Þ ¼
2
The joint probability becomes
1
PðΩ, Y, DÞ ¼ PðYjX1 , X2 , …, Xn ÞPðDjYÞ
2n
The joint probability of Xi and D is

X
PðXi , DÞ ¼ PðΩ, Y, DÞ
fΩ, Y, Dg\fXi , Dg
¼ PðX1 ¼ 1, X2 ¼ 1, …, Xi , …, Xn�1 ¼ 1, Xn ¼ 1, Y ¼ 1, DÞ
þ PðX1 ¼ 1, X2 ¼ 1, …, Xi , …, Xn�1 ¼ 1, Xn ¼ 0, Y ¼ 1, DÞ þ ⋯
þ PðX1 ¼ 0, X2 ¼ 0, …, Xi , …, Xn�1 ¼ 0, Xn ¼ 1, Y ¼ 1, DÞ
þ PðX1 ¼ 0, X2 ¼ 0, …, Xi , …, Xn�1 ¼ 0, Xn ¼ 0, Y ¼ 1, DÞ
þ PðX1 ¼ 1, X2 ¼ 1, …, Xi , …, Xn�1 ¼ 1, Xn ¼ 1, Y ¼ 0, DÞ
þ PðX1 ¼ 1, X2 ¼ 1, …, Xi , …, Xn�1 ¼ 1, Xn ¼ 0, Y ¼ 0, DÞ þ ⋯
þ PðX1 ¼ 0, X2 ¼ 0, …, Xi , …, Xn�1 ¼ 0, Xn ¼ 1, Y ¼ 0, DÞ
þ PðX1 ¼ 0, X2 ¼ 0, …, Xi , …, Xn�1 ¼ 0, Xn ¼ 0, Y ¼ 0, DÞ
1 D�
¼ n PðY ¼ 1jX1 ¼ 1, X2 ¼ 1, …, Xi , …, Xn�1 ¼ 1, Xn ¼ 1Þ þ PðY ¼ 1jX1 ¼ 1, X2
2 S
¼ 1, …, Xi , …, Xn�1 ¼ 1, Xn ¼ 0Þ þ ⋯ þ PðY ¼ 1jX1 ¼ 1, X2 ¼ 1, …, Xi , …, Xn�1 ¼ 0, Xn ¼ 1Þ
�
þ PðY ¼ 1jX1 ¼ 1, X2 ¼ 1, …, Xi , …, Xn�1 ¼ 0, Xn ¼ 0Þ
1 M � D�
þ n PðY ¼ 0jX1 ¼ 1, X2 ¼ 1, …, Xi , …, Xn�1 ¼ 1, Xn ¼ 1Þ þ PðY ¼ 0jX1 ¼ 1, X2
2 S
¼ 1, …, Xi , …, Xn�1 ¼ 1, Xn ¼ 0Þ þ ⋯ þ PðY ¼ 0jX1 ¼ 1, X2 ¼ 1, …, Xi , …, Xn�1 ¼ 0, Xn ¼ 1Þ
�
þ PðY ¼ 0jX1 ¼ 1, X2 ¼ 1, …, Xi , …, Xn�1 ¼ 0, Xn ¼ 0Þ
(Due to Eq. (6))

The marginal probability of D is
X
PðDÞ ¼ PðΩ, Y, DÞ
fΩ, Y, Dg\fDg
¼ PðX1 ¼ 1, X2 ¼ 1, …, Xn ¼ 1, Y ¼ 1, DÞ þ PðX1 ¼ 1, X2 ¼ 1, …, Xn ¼ 0, Y ¼ 1, DÞ þ ⋯
þ PðX1 ¼ 0, X2 ¼ 0, …, Xn ¼ 1, Y ¼ 1, DÞ þ PðX1 ¼ 0, X2 ¼ 0, …, Xn ¼ 0, Y ¼ 1, DÞ
þ PðX1 ¼ 1, X2 ¼ 1, …, Xn ¼ 1, Y ¼ 0, DÞ
1 D�
¼ n PðY ¼ 1jX1 ¼ 1, X2 ¼ 1, …, Xn ¼ 1Þ þ PðY ¼ 1jX1 ¼ 1, X2 ¼ 1, …, Xn ¼ 0Þ þ ⋯
2 S
�
þ PðY ¼ 1jX1 ¼ 1, X2 ¼ 1, …, Xn ¼ 1Þ þ PðY ¼ 1jX1 ¼ 1, X2 ¼ 1, …, Xn ¼ 0Þ
1 M�D�
þ n PðY ¼ 0jX1 ¼ 1, X2 ¼ 1, …, Xn ¼ 1Þ þ PðY ¼ 0jX1 ¼ 1, X2 ¼ 1, …, Xn ¼ 0Þ þ ⋯
2 S
�
þ PðY ¼ 0jX1 ¼ 1, X2 ¼ 1, …, Xn ¼ 1Þ þ PðY ¼ 0jX1 ¼ 1, X2 ¼ 1, …, Xn ¼ 0Þ
þ PðX1 ¼ 1, X2 ¼ 1, …, Xn ¼ 0, Y ¼ 0, DÞ þ ⋯
By applying Table 2, the joint probability P(Xi, D) is determined as follows

!
1 X � � X � �
PðXi , DÞ ¼ D P Y ¼ 1jaðΩ : fX i gÞ þ ðM � DÞ P Y ¼ 0jaðΩ : fX i gÞ
2n S a a !
1 X � � X� � ��
¼ D P Y ¼ 1jaðΩ : fXi gÞ þ ðM � DÞ 1 � P Y ¼ 1jaðΩ : fXi gÞ
2n S a a
1 � �
¼ n ð2D � MÞsðΩ : fXi gÞ þ 2n�1 ðM � DÞ
2 S
Similarly, the marginal probability P(D) is
1 � �
PðDÞ ¼ n ð2D � MÞsðΩÞ þ 2n ðM � DÞ ■
2 S
Author details
Loc Nguyen

Sunflower Soft Company, An Giang, Vietnam
References
[1] Wikipedia. Logic gate. Wikimedia Foundation [Internet]. 2016. [Online]. Available from:
https://en.wikipedia.org/wiki/Logic_gate [Accessed June 4, 2016]
[2] Neapolitan RE. Learning Bayesian Networks. Upper Saddle River, New Jersey: Prentice
Hall; 2003. p. 674
[3] Díez FJ, Druzdzel MJ. Canonical Probabilistic Models. Madrid: Research Centre on Intel-
ligent Decision-Support Systems; 2007
[4] Millán E, Pérez-de-la-Cruz JL. A bayesian diagnostic algorithm for student modeling and
its evaluation. User Modeling and User-Adapted Interaction. 2002;12(2-3):281–330
[5] Wikipedia. Factor graph. Wikimedia Foundation [Internet]. 2015. [Online]. Available
from: https://en.wikipedia.org/wiki/Factor_graph [Accessed: February 8, 2017]
[6] Kschischang FR, Frey BJ, Loeliger HA. Factor graphs and the sum-product algorithm.
IEEE Transactions on Information Theory. 2001;47(2):498–519
[7] Pearl J. Fusion, propagation, and structuring in belief networks. Artificial Intelligence.
1986;29(3):241–288
[8] Millán E, Loboda T, Pérez-de-la-Cruz JL. Bayesian networks for student model engineering.
Computers & Education. 2010:55(4):1663–1683
[9] Nguyen L. Theorem of SIGMA-gate inference in Bayesian network. Wulfenia Journal.

2016;23(3):280–289
[10] Wikipedia. Set (mathematics), Wikimedia Foundation [Internet]. 2014. [Online]. Available
from: http://en.wikipedia.org/wiki/Set_(mathematics) [Accessed: October 11, 2014]
[11] Wikipedia. Abel-Ruffini theorem. Wikimedia Foundation [Internet]. 2016. [Online]. Available
from: https://en.wikipedia.org/wiki/Abel%E2%80%93Ruffini_theorem [Accessed: June 26,
2016]
[12] Nguyen L. A User Modeling System for Adaptive Learning. Abuja, Nigeria: Standard
Research Journals; 2014
[13] Fröschl C. User modeling and user profiling in adaptive E-learning systems [master
thesis]. Graz, Austria: Graz University of Technology; 2005
[14] De Bra P, Smits D, Stash N. The Design of AHA!. In Proceedings of the Seventeenth ACM
Hypertext Conference on Hypertext and hypermedia (Hypertext ’06); 22-25 August 2006;
Odense, Denmark. New York, NY: ACM; 2006. pp. 133–134
[15] Murphy KP. A Brief Introduction to Graphical Models and Bayesian Networks. Univer-
sity of British Columbia; 1998. [Online]. Available from: http://www.cs.ubc.ca/~murphyk/
Bayes/bnintro.html [Accessed: 2008]
[16] Heckerman D. A Tutorial on Learning With Bayesian Networks. Redmond: Microsoft
Research; 1995
Section 2
Applications of Bayesian Inference in Life

Sciences
Chapter 7
Provisional chapter
Bayesian Estimation of Multivariate Autoregressive

Hidden Markov
Bayesian Model
Estimation of with Application
Multivariate to Breast
Autoregressive
Hidden Markov Model
Cancer Biomarker with Application to Breast
Modeling
Cancer Biomarker Modeling
Hamid El Maroufy, El Houcine Hibbah,
Abdelmajid Zyad and
Hamid El Maroufy, Taib ZiadHibbah,
El Houcine
Abdelmajid Zyad and Taib Ziad

Abstract
In this work, a first-order autoregressive hidden Markov model (AR(1)HMM) is pro-
posed. It is one of the suitable models to characterize a marker of breast cancer disease
progression essentially the progression that follows from a reaction to a treatment or
caused by natural developments. The model supposes we have observations that increase
or decrease with relation to a hidden phenomenon. We would like to discover if the
information about those observations can let us learn about the progression of the phe-
nomenon and permit us to evaluate the transition between its states (supposed discrete
here). The hidden states governed by the Markovian process would be the disease stages,
and the marker observations would be the depending observations. The parameters of the
autoregressive model are selected at the first level according to a Markov process, and at
the second level, the next observation is generated from a standard autoregressive model
of first order (unlike other models considering the successive observations are indepen-
dents). A Markov Chain Monte Carlo (MCMC) method is used for the parameter estima-
tion, where we develop the posterior density for each parameter and we use a joint
estimation of the hidden states or block update of the states.
Keywords: autoregressive hidden Markov model, breast cancer progression marker,

Gibbs sampler, hidden states joint estimation, Markov Chain Monte Carlo
1. Introduction
The main motivation behind this work is to characterize progression in breast cancer. In fact,
disease progression cannot be assessed correctly without the use of biomarkers, which would
effectively monitor the evolution of the patient health state, and this is the case for breast cancer.
The major challenge in this matter for researchers and clinicians is to unravel the stage of the
disease, so as to tailor the treatment for each patient and to monitor the response of a patient to a
treatment.
Currently, studies have shown that there is a correlation between the levels of certain markers
such as cancer antigen CA15-3, carcinoembryonic antigen (CEA), and serum HER2 Neu with
the stage of the disease [1]. This gives an opportunity of using a hidden Markov model (HMM)
to predict the stage of the disease based on biomarker data and to address the effectiveness of
the treatments in their influence on the transition of the cancer from one state to another. In
HMM, we have two constituents: the Markovian hidden process suitable to represent the
breast cancer stage and the observation process given by the biomarker data. By the way, we
can learn about the disease transition rates and how it progresses from primary breast cancer
to advanced cancer stage, for example.
Indeed, HMM is a useful tool for tackling numerous concrete problems in many fields but some
possible applications of HMM are in speech processing [2], in biology [3], in disease progres-
sion [4], in economics [5, 6], and in gene expression [7]. For a complete review of HMM, the
reader is referred to Zucchini and MacDonald [8], in which properties and definitions of HMM
are presented in a plausible way with both classical estimation by maximum likelihood method
and expectation maximization (EM) algorithm and the new Bayesian inference is addressed.
The model we consider here is a variation of the regular hidden Markov model, since we will use
extensions to incorporate dependence among successive observations, suggesting autoregressive
dependence among continuous observations. Consequently, we have relaxed the conditional
independence assumption from a standard HMM, because we would like to add some dynamics
to the patient disease progression and because in reality the current patient biomarker observa-
tion is dependent on the past one. In fact, the autoregressive assumption in HMM has shown its
advantage over regular HMM that cannot catch the strong dependence between successive
observations (e.g., Ref. [9]). A similar model to ours can be found in Ref. [10]. This kind of
models, which were first proposed in Ref. [11] to describe econometrics time series, is generali-
zation of both HMM and autoregressive models, will be effective in representing multiple
heterogeneous dynamics such as the disease progression dynamics, and can be even generalized
to a regime-switching ARMA models such as in Ref. [12].
Moreover, Our model can also be viewed as an extension of the multivariate double-chain
Markov model (DCMM) developed by Ref. [13], where there are two discrete Markov chains
of first order: the first Markov chain is observed and the second one is hidden. In contrast to
this DCMM, our multivariate first-order autoregressive hidden Markov model (MAR(1)HMM)
will lead to continuous observations, where each observation conditional on the hidden pro-
cess will depend on the previous observation according to an autoregressive process of first
order. This dynamic is promising for continuous observed disease biomarkers.
Parameter estimation is very challenging for HMM family models since the likelihood is not
available in a closed form most of the time. Thus, we call for a Markov Chain Monte Carlo
(MCMC) procedure instead of a maximum likelihood-based approach. This choice rises from
the fact that the Bayesian analysis uses prior knowledge about the process being measured,
Bayesian Estimation of Multivariate Autoregressive Hidden Markov Model with Application to Breast Cancer… 149
and it allows direct probability statements and an approximation of posterior distributions for
the parameters. Instead in the maximum likelihood approach, we cannot have declared prior
or have exact distribution for the parameters when the likelihood is untractable or when we
have missing data (e.g., Refs. [14–16]).
Since the realization of HMM includes two separate entities: the parameters and the hidden
states, the Bayesian computation is carried out after augmenting the likelihood by the missing
hidden states [17]. The hidden states are sampled using a Gibbs sampler adopting a joint estima-
tion of the hidden states or block update of the states (instead of a single update of each state
separately) by means of a forward filtering/backward smoothing algorithm. Given the hidden
states, we can compute the autoregressive parameters and the transition probabilities of the
Markov chain by Gibbs sampler from their posterior densities after specifying conjugate priors
for the parameters. Hence, the MCMC algorithm will alternate between simulating the hidden
states and the parameters. Finally, we can obtain posteriors statistics such as the means, standard
deviations and confidence intervals after assessing the convergence of the MCMC algorithm.
This chapter is organized as follows: after a preliminary on HMM, a description of the model is
given in Section 3. In Section 4, we will give the Bayesian estimation of the parameters and the
hidden states and provide the details of the MCMC algorithm, before presenting the results of
a simulation studies in Section 5 and we will finish by a conclusion.
2. Preliminary
Since the model suggested is of the HMM type, we will describe HMM in more detail: an
HMM is a stochastic process fXt , Y t gTt¼0 , where fXt gTt¼0 is a hidden Markov chain
(unobservable) and fYt gTt¼0 is a sequence of observable independent random variables such
that Yt depends only on Xt for the time t = 0,1,…T. Here the process fXt gTt¼0 evolves indepen-
dently of fY t gTt¼0 and is supposed to be a homogeneous finite Markov chain with probability
transition matrix Π of dimension a � a, where a indicates the number of the hidden states and
Π0 = (Π01,…,Π0a) is the initial state distribution.
We denote the probability density function of Yt = yt given Xt = k for k ∈ {1, …, a} with Pxt ðyt , θk Þ,
where θk refers to the parameters of P when Xt = k. We suppose further that the processes Yt|Xt
and Yt0 |Xt0 are independent for t 6¼ t0 . Let Θ = (θ1,…, θa) and θ = (Π0, Π, Θ), and then, the HMM
can be described as follows: First, the likelihood of the observations and the hidden states can be
decomposed to Pðy0 , …, yT , x0 , …, xT , θÞ ¼ Pðy0 , …, yT jx0 , …, xT , θÞPðx0 , …, xT , θÞ: Since fXt gTt¼0
T
Y
is a Markov chain, Pðx0 , …, xT , θÞ ¼ Π0 ðx0 Þ Πðxt jxt�1 Þ. Under the conditional independence of
t¼1
T
Y
the observations given the hidden states, Pðy0 , …, yT jx0 , …, xT , θÞ ¼ Px0 ðy0 jθx0 Þ Pxt ðyt jθxt Þ.
t¼1
Consequently, the likelihood function for the hidden states and the observations is given by
T
Y
Pðy0 , y1 , …, yT , x0 , x1 , …, xT , θÞ ¼ Πðx0 ÞPx0 ðy0 jθx0 Þ Πðxt jxt�1 ÞPxt ðyt jθxt Þ.
t¼0
3. Model description and specification
The MAR(1)HMM model we consider in this work is a hidden Markov model, where condition-
ally on the latent states, the observations are not independent like it is the case for a regular
hidden Markov model. Instead, the current observation is allowed to depend on the previous
observation according to an autoregressive model of first order. As in an HMM model, the latent
states evolve according to a discrete first-order time homogeneous Markov model. We consider
data of n continuous random variables observed over time, each of potentially different lengths,
i.e., for each individual i = 1,2, …, n, we observe a vector yi;: ¼ ðyi, ui , …, yi, mi ÞT , with ui < mi.
Define u0 ¼ min fui g and M ¼ max fmi g and note that the times ui and mi may vary over the
1≤i≤n 1≤i≤n
entire observation period from u0 to M with the restriction that ui – mi ≥ 1, for i = 1,2,…,n.
We assume, for i = 1,2,…,n for integer time t = ui,…,mi, that the random variable Yi,t taking
nonnegative values depends only on the states Xt and the previous observation Yi,t–1, and
based on the model developed by Farcomeni and Arima [10], we get the following model:
Yi, t jXt ¼xt ¼ βðxt Þ Yi, t�1 þ μðxt Þ þ εi, t : (1)
The choice of the autoregressive part of the model is motivated by the fact that successive
biomarker observations are most of the time correlated from many diseases unlike the hypoth-
esis of independence between observations in HMMs.
We interpret x as the vector of the hidden health states of the patients; in the case of breast cancer,
those states would be localized or advanced metastatic breast cancer for example, while y is the
vector of the biomarkers observed and measured for the patients. The εi,t are normal variables
with mean 0 and variance σ2 such that εi,t and εi0 ,t0 are uncorrelated, (i, t) 6¼ (i0 , t0 ).
The parameters βðxt Þ and μðxt Þ are parameters taking values in R for each hidden state and
σ 2 ∈ Rþ .
Similar to Ref. [13], the transition matrix of the Markov chain Π is time homogeneous with
dimension a � a where a is the number of hidden states, and Π = (Πgh, g = 1,…,a; h = 1,…,a)
where Πgh = P(Xt = h|Xt–1 = g), for g = 1,2,…,a; h = 1,2,…, a; and t = u0+1,…, M. We let the first state
Xu0 to be selected from a discrete distribution with vector of probabilities r = (r1,…,ra). Also we
consider the time of initial observation ui, the initial observed state yi, ui , and the number of
consecutive time points that were observed mi – ui + 1. Let μ = (μ(1),…, μ(a)), β = (β(1),…,β(a)), and
θ = (μ, β, σ2, r, Π) be the set of all parameters in the model. We suppose that the individuals, i.e.,
Yi,t, behave independently conditionally on X. Therefore, for i = 1,…,n, Pðyi;: jyi, ui , x, θÞ ¼
mi
Y M
Y
Pðyi, t jyi, t�1 , xt , ΘÞ and PðxjθÞ ¼ Pðxu0 Þ Pðxt jxt�1 , ΠÞ, where Pðxt jxt�1 , ΠÞ ¼ PðXt ¼
t¼ui þ1 t¼u0 þ1
xt jXt�1 ¼ xt�1 , ΠÞ ¼ Πxt�1 , xt : Then, the likelihood density for the observations of all individuals
y = (y1,…,yn) given first time vector of observations y0 ¼ ðy1;u1 , …, yn, un Þ, x, and θ is
n
Y
Pðyjy0 , x, θÞ ¼ Pðyi;: jyi, ui , x, θÞ,
i¼1
This is due to the conditional independence of the yi, given x and θ. The joint mass of each yi,.
and x given yui and θ can be written as follows: Pðyi;: , xjyi, ui , θÞ ¼ Pðyi;: jyi, ui , x, θÞ � Pðxjyui , θÞ:
Using the Markov property of the hidden process, we have after simplification Pðxjyi, ui , θÞ
∝ Pðyi, ui jx, θÞPðxjθÞ ¼ Pðyi, ui jxui , θÞrxu0 Πxu0 , xu0 þ1 � ⋯ � ΠxM�1 , xM : In addition, Pðyi;: jyi, ui , x, θÞ ¼
mi
Y
Pðyi, t jyi, t�1 , x, θÞ, and consequently,
t¼ui þ1
M
Y mi
Y
Pðyi;: , xjyi, ui , θÞ ∝ rxu0 Pðyi, ui jxui , θÞ Πxt�1 , xt Pðyi, t jyi, t�1 , x, θÞ: Finally, under the hypoth-
t¼u0 þ1 t¼ui þ1
esis of normal error distribution for the autoregressive parameters of the model (Eq. (1)) and
the Chapman-Kolmogorov property, the joint distribution of yi,. and x given yi, ui and θ can be
simplified to:
a χ
Y ðhÞ M a a
fxu g Y YY χ ðg, hÞ
Pðyi;: , xjyi, ui θÞ ∝ Pðyi, ui jxui , θÞ rh 0 Πg, h fxt , xt�1 g
h¼1 t¼u0 þ1g¼1 h¼1
a �
mi Y
Y � ��χ x ðhÞ
f tg
1 yi, t �μðhÞ �βðhÞ yi, t�1
� σ φ σ ,
t¼ui þ1h¼1
where φ denotes the density of a standard normal distribution N ð0; 1Þ and χfAg ðxÞ is the usual
indicator function of a set A. Finally, the joint distribution of y and x has the following form:
a χ
Y ðhÞ M a a
fxu g Y YY χ ðg, hÞ
Pðy, xjy0 , θÞ ∝ rh 0 Πg, h fxt , xt�1 g
h¼1 t¼u0 þ1g¼1 h¼1
(2)
a �
n Y
Y � ��χ
fxt g ðlÞ Y a �
mi Y
n Y � ��χ
fxt g ðhÞ
yi, u �μ ðlÞ
yi, t �μðhÞ �βðhÞ yi, t�1
1 1
σφ σφ
i
� σ � σ :
i¼1 l¼1 i¼1 t¼ui þ1h¼1
4. Bayesian estimation of the model parameters
We will use a Bayesian approach to estimate the model parameters. Inference in the Bayesian
framework is obtained through the posterior density, which is proportional to the prior multi-
plied by the likelihood. The posterior distribution for our model, as in most cases, cannot be
derived analytically, and we will approximate it through MCMC methods specifically
designed for working with the augmented likelihood with the hidden states. In fact, MCMC
methods start by specifying the prior density Π(θ) for the parameters. Since the data Y are
available, the general sampling methods work recursively by alternating between simulating
the full conditional distributions X given y and θ given x and y.
4.1. Prior distributions
Under the assumption of independence between the parameters θ ¼ ðμ, β, σ2 , r, ΠÞ, the prior
density could be written as PðθÞ ¼ PðrÞPðΠÞPðμÞPðβÞPðσ2 Þ. r is the parameters of a multino-
mial distribution; hence, the natural choice for the prior would be a Dirichlet distribution
a
X
r � Dðα01 , …, α0a Þ. Later on, Πij ¼ 1, and we assume that Πi � Dðδi1 , …, δia Þ for each row i
j¼1
of the transition matrix. This choice of the Dirichlet prior can be even the default Dð1;…; 1Þ as
recently discussed in Ref. [18]. In fact, a Dirichlet prior is justified because the posterior density
of each row of the transition matrix is proportional to the density of a Dirichlet distribution,
and hence, choosing a Dirichlet prior would give a posterior Dirichlet. This can be justified as
follows for a given set of parameters λ ¼ ðλ1 , …, λa Þ from a discrete or from a multinomial
density:
a
X
n!
πðx1 , …, xa , λ1 , …, λa Þ ¼ x1 !…x a!
λx11 …λxa a for the nonnegative integers x1 , …, xa , with xi ¼ n:
i¼1
This probability mass function can be expressed, using the gamma function Γ, as
!
Xa
Γ xi þ1
a
Y
πðx1 , …, xa , λ1 , …, λa Þ ¼ Y
a
i¼1
λxi i : This form shows its resemblance to the Dirichlet
Γðxi þ1Þ i¼1
i¼1
distribution, and by starting from supposing the prior λ ∝ Dðα0 , …, αa Þ, the posterior is
Y Y Y
PðλjxÞ ∝ PðλÞPðxjλÞ ∝ λxi i λαi i �1 ∝ λixi þαi �1 ∝ Dðx1 þ α1 , …, xa þ αa Þ:
i i i
Furthermore, concerning the priors for parameters of the autoregressive model, we suppose
for h ¼ 1;…, a: μðhÞ � N ðαh , τh Þ, βðhÞ � N ðbh , ch Þ, and inverse gamma (IG) prior for σ2 � IG
ðε, ζÞ. αh , τh , bh , ch , E, ζ are hyperparameters to be specified. For more details on Bayesian
inference and prior selection in HMM, the reader is referred to Ref. [19]. In our case, prior
distributions for the autoregressive parameters were proposed by Ref. [20] for a mixture
autoregressive model, who points out that they are conventional prior choices for mixture
models.
4.2. Sampling the posterior distribution for the hidden states
Chib [21] developed a method for the simulation of the hidden states from the full joint
distribution for the univariate hidden Markov model case. We will describe his full Bayesian
algorithm for the univariate hidden Markov model before a generalization to our MAR(1)
HMM.
4.2.1. Chib’s algorithm for the univariate hidden Markov model for estimation of the states
Suppose we have an observed process Yn ¼ ðy1 , …, yn Þ and the hidden states Xn ¼ ðx1 , …, xn Þ,
θ are the parameters of the model. We adopt for simplicity Xt ¼ ðx1 , …, xt Þ the history of the
states up to time t and Xtþ1 ¼ ðxtþ1 , …, xn Þ the future from t + 1 to n. We use the same notation
for Yt and Yt+1.
For each state xt ∈ f1; 2;…, ag for t ¼ 1; 2;…, n, the hidden model can be described by a condi-
tional density given the hidden states πðyt jYt�1 , xt ¼ kÞ ¼ πðyt jYt�1 , θk Þ, k ¼ 1;…, a, with xt
depending only on xt–1 and having transition matrix Π and initial distribution Π0, and the
parameters for π(.) are θ ¼ ðθ1 , …, θa Þ.
Chib [21] shows that it is preferable to simulate the full latent data Xn ¼ ðx1 , …, xn Þ from the
joint distribution of x1 , …, xn jYn , θ, in order to improve the convergence property of the
MCMC algorithm because instead of n additional blocks if each state is simulated separately,
only one additional block is required. First, we write the joint conditional density as
PðXn jYn , θ, ΠÞ ¼ Pðxn jYn , θÞPðxn�1 jYn , xn , θ, ΠÞ � ⋯ � Pðx1 jYn , X2 , θ, ΠÞ:
For sampling, it is sufficient to consider the sampling of xt from Pðxt jYn , Xtþ1 , θ, ΠÞ. Moreover,
Pðxt jYn , Xtþ1 , θ, ΠÞ ∝ Pðxt jY t , θ, ΠÞPðxtþ1 jxt , ΠÞ. This expression has two ingredients: the first is
Pðxtþ1 jxt , ΠÞ, which is the transition matrix from the Markov chain. The second is Pðxt jY t , θ, ΠÞ
that would be obtained by recursively starting at t = 1.
The mass function Pðxt�1 jY t�1 , θ, ΠÞ is transformed into Pðxt jYt , θ, ΠÞ, which is in turn
transformed into Pðxtþ1 jYtþ1 , θ, ΠÞ and so on. The update is as follows: for k ¼ 1;…, a, we
could write
Pðxt ¼ kjYt�1 , θ, ΠÞπðyt jyt�1 , θk Þ

Pðxt ¼ kjYt , θ, ΠÞ ¼ Xa :
l¼1
Pðxt ¼ ljYt�1 , θ, ΠÞπðyt jyt�1 , θl Þ
These calculations are initialized at t = 0, by setting Pðx1 jY 0 , θÞ to be the stationary distribution

of the Markov chain. Precisely, the simulation proceeds for k ¼ 1;…, a, recursively by first
simulating Pðx1 ¼ kjY 0 , θÞ, from the initial distribution Π0(k) and Pðx1 ¼ kjY1 , θ, ΠÞ
∝ Pðx1 ¼ kjY 0 , θ, ΠÞπðy1 jY0 , θk Þ: Then, we get by forward calculation Pðxt ¼ kjYt�1 , θÞ ¼
Xa
Π Pðxt�1 ¼ ljYt�1 , θÞ, for each t ¼ 2;…, n, where Πlk is the transition probability and
l¼1 lk
Pðxt ¼ kjYt , θÞ ∝ Pðxt ¼ kjYt�1 , θ, PÞΠðyt jYt�1 , θk Þ: The last term in the forward computation
Pðxn ¼ kjY n , θÞ would serve as a start for the backward pass, and we get recursively for each
t ¼ n � 1;…; 1:; Pðxt ¼ kjY n , Xtþ1 , θÞ ∝ Pðxt jYt�1 , θ, ΠÞPðxtþ1 jxt ¼ k, ΠÞ, which permits the
obtention of Xn ¼ ðx1 , …, xn Þ.
4.2.2. Simulating the hidden states for the MAR(1)HMM
Returning to our model, and adopting notations and algorithm developed by Fitzpatrick and
Marchev, f will denote the observation density for the MAR(1)HMM, and for u0 < t < M;
n o
x�t ¼ ðxu0 , …, xt Þ, xt ¼ ðxt , …, xM Þ, yðtÞ ¼ ðyi, t , i ¼ 1; 2;…, nÞ, y, t ¼ ⋃ yi, ui , …, yi, minft, mi g ,
i:ui <t
n o
t
and y ¼ ⋃ yi, maxftþ1;ui g , …, yi, mi . The posterior distribution of the hidden state could be
i:t<mi
written as: Pðx�M jy, M , θÞ ¼ PðxM jy, M , θÞ � ⋯ � Pðxu0 jy, M , xu0 þ1 , θÞ. So we could sample the
whole sequence of states by sampling from Pðxt jy, M , xtþ1 , θÞ: Hence, the estimation of the
hidden states is performed recursively by first initializing
n o
Pðxu0 jy, u0 , θÞ ∝ Pðy, u0 jxu0 ÞPðxu0 jrÞ; y, u0 ¼ yi, ui , , ui ¼ u0 , i ¼ 1;…, n :
a
X
Pðxu0 þ1 ¼ kjy, u0 , θÞ ¼ Πlk Pðxu0 ¼ ljY, u0 Þ; k ¼ 1;…, a:
l¼1
Pðxu0 þ1 ¼ kjy, u0 þ1 , θÞ ∝ Pðxu0 þ1 ¼ kjy, u0 , θÞf ðyðu0 Þjy, u0 , θk Þ:
We perform a similar calculation for every state at time t, and we conclude by calculating
Xa
PðxM ¼ kjy, M�1 , θÞ ¼ Π PðxM�1 ¼ ljY, M�1 , θÞ, and PðxM ¼ kjy, M , θÞ ∝ PðxM ¼ kjy, M�1 , θ
l¼1 lk
Þf ðyðMÞjy, M�1 , θk Þ. Later on, we get PðxM ¼ kjy, M , θÞ, which permits the simulation of
PðxM jy, M , θÞ: Finally, by backward calculation, we simulate from the probabilities
Pðxt jy, M , xtþ1 , θÞ ∝ Pðxtþ1 jxt , ΠÞPðxt jy, t , θÞ for each time t ¼ M � 1;…, u0 . Those backward
probabilities would permit the simulation of the latent states.
4.3. Sampling from P(θ|x, y)
4.3.1. Sampling Π
Under the prior assumption of Dirichlet prior for each row of the transition matrix
PðΠi Þ ∝ Dðδi1 , …, δia Þ, and the independence assumption between those rows, the posterior
distribution for Πi can be developed using Eq. (2) as follows: Let nij denote the number of
single transitions from state i to state j, so
M Y
Y a a
Y a
Y
χf x ði, jÞ n δ þn �1
PðΠi jy, xÞ ∝ PðΠi Þ Π t�1 , xt g ∝ PðΠi Þ Πijij ∝ Πijij ij
t¼u0 þ1 j¼1 j¼1 j¼1
∝ Dðδi1 þ ni1 , …, δia þ nia Þ:
4.3.2. Sampling posterior distribution for initial distribution

Let n0l ¼ χxu0 ðlÞ, for l ¼ 1;…, a. Using (2), under Dirichlet prior Dðδ01 , …, δ0a Þ for the parameter
a χ
Y a
ðlÞ Y
fxu g
r, we obtain Pðrjx, yÞ ∝ PðrÞ rl 0 ∝ rδl 0l þn0l �1 ∝ Dðδ01 þ n01 , …, δ0a þ n0a Þ:
l¼1 l¼1
4.3.3. Sampling posterior distribution for the autoregressive parameters μ, β, σ2
When a complete conditional distribution is known such as the normal distribution or beta
distribution, we use the Gibbs sampler to draw the random variable. This is the case for our
n
X X mi
n X a
X
model. Let us define nui ðlÞ ¼ χfxu ¼lg , nl ¼ χfxt ¼lg , N ¼ nl , n0l ¼ χfxu g ðlÞ: So for
i 0
i¼1 i¼1 t¼ui þ1 l¼1
l ¼ 1; 2;…, a; by supposing N ðαl , τl Þ as prior distribution and using Eq. (2), the conditional
posterior distribution of μ(l) is:
n �
Y � ��χ ðlÞ mi �
n Y
Y �χ
fxt g ðlÞ
1
yi, u �μðlÞ fxui g 1 yi, t �μðlÞ �βðlÞ yi, t�1
PðμðlÞ jy, xÞ ∝ PðμðlÞ Þ σφ
i
σ � σ φð σ Þ :
i¼1 i¼1 t¼ui þ1
8 !2 !2 9
�1 <ðμðlÞ � αl Þ2 yi, t � μðlÞ � βðlÞ yi, t�1 =
n n mi
X yi, ui � μðlÞ X X
∝ exp þ þ :
2 : τl i¼1;x ¼l
σ i¼1 t¼u þ1;x ¼l
σ ;
ui i t
nui ðlÞþnl
then μðlÞ =y, x � N ð~τ l , α
~ l Þ with inverse mean τ~ l �1 ¼ σ2
þ τ1l and variance
0 X n Xn X mi 1
B yi, ui þ ðyi, t � βðlÞ yi, t�1 Þ C
Bi¼1;xui ¼l i¼1 t¼ui þ1;xt ¼l αl C
C:
~ l ¼ τ~ l B
α þ
B
@ σ2 τl C
A
For βðlÞ , l ¼ 1;…, a, and similar to μ(l), N ðbl , cl Þ was proposed as prior choice to obtain:
" !#χ
mi
n Y fxt g ðlÞ
Y 1 y � μðlÞ � βðlÞ yi, t�1
Pðβ jy, xÞ ∝ Pðβ Þ
ðlÞ ðlÞ
φ i, t ,
i¼1 t¼ui þ1
σ σ
n
X mi
X
y2i, t�1
i¼1 t¼ui þ1;xt ¼l
and therefore, β =y, x � N ð~c l , ~b l Þ with inverse mean ~c l �1 ¼ c1l þ
ðlÞ
σ2 and variance
0 n
X mi
X 1
B ðyi, t � μðlÞ Þyi, t�1 C
~b l ¼ ~c l Bbl þ
B i¼1 t¼ui þ1;xt ¼l C
C:
Bcl σ 2 C
@ A
For the posterior distribution of σ2, by supposing IGðε, ζÞ as prior, we deduce from Eq. (2)
" !#
n yi, ui � μðxui Þ
2 2 �ðεþ1Þ ζ Y 1
Pðσ jy, xÞ ∝ ðσ Þ exp ð� 2 Þ φ
σ i¼1 σ σ
" !#
n m ðx Þ
Y Yi 1 y � μ t � β t yi, t�1
ðx Þ
� φ i, t ,
i¼1 t¼u þ1
σ σ
i
nu þ N
consequently σ2 =y, x � IGð~ε , ~ζÞ with parameters ~ε ¼ i 2 þ ε and
n
X mi
n X
X
ðyi, ui � μðxui Þ Þ2 þ ðyi, t � μðxt Þ � βðxt Þ yi, t�1 Þ2
~ζ ¼ i¼1 i¼1 t¼ui þ1
þ ζ:
2
Finally, the algorithm is ran for d = 1,…,D iterations by alternating between the following steps,
where in each step we compute a conditional posterior for the given parameter:
The MCMC algorithm:
1. For h ¼ 1; 2; ::::; a, give reference values for the hyperparameters αh, τh, ah, bh, δ0h, and δih
for i ¼ 1; 2; ::::; a:
2. Initialization (Step d = 1 of the MCMC iterations): Initialize Π(1), r(1), μ(1), β(1), and σ2(1).
3. Simulation of the hidden states:
ðdÞ ðdÞ ðdÞ
a. Initialization of forward simulation: Pðxu0 jy, u0 , θÞ ∝ Pðy, u0 jxu0 ÞPðxu0 jrðdÞ Þ, with y, u0 ¼
n o
yi, ui , , ui ¼ u0 , i ¼ 1; …; n :
b. Forward simulation: For k ¼ 1; …:;a and t ¼ u0 þ 1; …:; M :

ðdÞ
Xa ðdÞ ðdÞ
Pðxt ¼ kjy, t�1 , θÞ ¼ Π Pðxt�1 ¼ ljY, t�1 , θÞ and
l¼1 lk
Pðx ¼kjy, t�1 , θk Þf ðyðtÞjy, t�1 , θk Þ

ðdÞ
ðdÞ
Pðxt ¼ kjy, t , θÞ ¼ Xa t .
Pðxdt ¼ljy, t�1 , θÞf ðyðtÞjy, t�1 , θl Þ
l¼1
c. Initialization of backward simulation: For k ¼ 1; …:;a, given

ðdÞ ðdÞ
PðxM ¼ kjy, M , θÞ from forward simulation, we get PðxM jy, M , θÞ:
d. Backward simulation: For k ¼ 1; …:;a and t ¼ M � 1; …;u0 :

ðdÞ ðdÞ ðdÞ ðdÞ
Pðxt jy, M , xtþ1ðdÞ , θÞ ∝ Pðxtþ1 jxt , πÞPðxt jy, t , θÞ:
4. Estimation of the initial distribution and the transition distribution
a. for l ¼ 1; …:;a, k ¼ 1; …:;a. Calculate n0l ¼ χfxðdÞ g ðlÞ and nkl ¼ ΣM

t¼u0 þ1 χfxðdÞ , xðdÞ g ðk, lÞ:
u0 t�1 t
ðdþ1Þ ðdþ1Þ
b. Sample ðr1 ; …:;ra Þ ∝ Dðδ01 þ n01 ; …:;δ0a þ n0a Þ:
ðdþ1Þ ðdþ1Þ
c. For i ¼ 1; …:; a; sample ðΠi1 ; …:;Πia Þ ∝ Dðδi 1 þ ni1 ; …:;δi a þ nia Þ:
5. Simulation of μ: For l ¼ 1; …:; a,

nui ðlÞþnl
a. ~τ l �1 ¼ σ2ðdÞ
þ τ1l :
0 X n Xn Xmi 1
ðlÞ
yi, u þ ðyi, t �βðdÞ yi, t�1 Þ
B i C
Bi¼1;xui ¼l i¼1 t¼ui þ1;xt ¼l C
αl C
b. ~ l ¼ τ~ l B
α B σ2ðdÞ
þ τl C:
@ A
ðlÞ
c. Simulate μðdþ1Þ =y, x � N ð~
α l , τ~ l Þ:
6. Simulation of β: For l ¼ 1; …:;a,
n
X mi
X
y2i, t�1
1 i¼1 t¼ui þ1;xt ¼l
a. ~c l �1
¼ þcl σ2ðdÞ
:
0 n mi 1
X X
ðlÞ
B ðyi, t �μdþ1 Þyi, t�1 C
B i¼1 t¼ui þ1;xt ¼l C
b. ~b l ¼ ~c l Bbl þ C.
Bcl σ2ðdÞ C
@ A
ðlÞ
c. Simulate βðdþ1Þ =y, x � N ð~b l , ~c l Þ:
7. - Simulation of σ2:
nui þN
a. ~E ¼ 2 þ E.
n
X mi
n X
X
ðxu Þ
i 2
Þ2
ðx Þ ðx Þ
t �β t y
ðyi, u �μðdþ1Þ Þ þ ðyi, t �μðdþ1Þ ðdþ1Þ i, t�1
i
~ζ ¼ i¼1 i¼1 t¼ui þ1
b. 2 þ ζ.
c. Simulate σ2ðdþ1Þ =y, x � IGðE, ~ζÞ.
5. Simulation study
In this section, we apply our results to the breast cancer model discussed earlier. The main
reason behind our work is that the progression of breast cancer cannot be seen directly unless
we use observations related to the disease that could characterize its progression; those obser-
vations here are quantities which could be measured; they are called biomarkers, where the
word biomarker is used to designate any objective indication of a biological process or disease
condition including during treatment and should be measurable. Furthermore, biomarkers are
increasingly used in the management of breast cancer patients. One example is reported in Ref.
[22], stating that there is correlation between elevation of CEA and/or CA15-3 and disease
progression, in breast cancer patients. Also we use the autoregressive dependence among the
observations to add more dynamics to the model unlike conventional HMMs where the
successive observations given the Markov process are independent. We used the classification
of breast cancer in three states: local where the disease is confined within the breast, the
regional phase when the lymph nodes are involved, and the distant stage where the cancer is
found in other parts of the body. We restrict ourselves to these three stages unlike other stage
classifications that divide the progression in more than three stages such as the TNM (tumor,
node, and metastasis) system. By lack of finding data about breast cancer biomarkers, we will
confine ourselves to simulate an MAR(1)HMM model for observation time M = 24, and a
number of individuals n = 210, a = 3 for Markov states number, with the length observation
time for each individual selected uniformly between 2 and M. The simulation process sup-
poses we have for the autoregressive means μ ¼ ðμð1Þ , μð2Þ , μð3Þ Þ ¼ ð12; 24; 36Þ, since markers
such as CA15–3 increase as the disease advances toward metastatic breast cancer. In addition,
CA15–3 increase rapidly between successive observations, and thus, we take in the simulation
the parameters β ¼ ðβð1Þ , βð2Þ , βð3Þ Þ ¼ ð0:2; 0:4; 0:8Þ.
The algorithm of simulation works as follows:

1. For each individual i ¼ 1;…, n, choose mi the length of observation for that individual i.
2. Generate each discrete disease state xt using transition matrixΠ ¼ ð0:7; 0:2;0:1; 0:1; 0:6; 0:3;
0:2; 0:3; 0:5Þ for t ¼ u0þ1 , …, M.
3. Generate the observations yi,t for all individuals using our model 1.
We choose a prior σ2 � IGð0:001; 0:001Þ, a Dð1;…; 1Þ prior for each row of Π, and Gaussian
noninformative priors for the μs and the βs. Having the hidden states and the observations,
we ran our algorithm for 8000 MCMC iterations. MCMC algorithm convergence was
assessed by analyzing MCMC iterations mixing plots that are shown in Figure 1, autocorre-
lation sample graphs checking as illustrated in Figure 2, and inspecting histograms of poste-
rior densities for the parameters of the models in Figure 3. All parameters show good mixing
of chains, autocorrelations that decay immediately after a few lags, and perfect posterior
densities fitting. Also the Gelman [23] potential scale reduction factor (PSRF) was plot. The
PSRF is measured for more than two MCMC chains (three chains in this works are consid-
ered), and it is measured for each parameter of the model; it should show how the chains
have forgotten their initial values and that the output from all chains is indistinguishable. It is
based on a comparison of within-chain and between-chains variances and is similar to a
classical analysis of variance; when the PSRF is high (perhaps greater than 1.1 or 1.2), then
we should run our chains out longer to improve convergence to the stationary distribution.
Each PSRF declines to 1 as the number of iterations approaches infinity to confirm conver-
gence. All the parameters have shown a PSRF less than 1.1 as the number of iteration
increases and by the way a good sign of convergence (Figure 4). Moreover, we should point
out that in the family of Markov switching model there is the so-called label switching
problem (e.g., Ref. [24]) which arises identifiability problem, and hence, we would not
estimate perfectly the parameters. In addition, the posterior densities could show evidence
of multimodality. some authors postprocess the output of the MCMC to deal with the issue
(e.g., [25]), while other uses a random permutation of the parameters in each iteration of the
MCMC algorithm (e.g., [26]) or one can call for an invariant loss function method (e.g., [27]).
In our case, no identifiability issue is noticed since we used well-separated prior hyper-
parameters. Even when we start from different initial values for the parameters, our algo-
rithm converges immediately after a few iterations.
Finally and before giving our results, we should report that the simulation of the Dirichlet
posterior was carried out following ([28, p. 22], [29, p. 155]) who reported that the posterior
Dirichlet parameters should be simulated using the beta distribution approach. Table 1 shows
how the posterior values estimated from algorithm are very close to the true ones.
Figure 1. Markov chain mixing for each parameter through MCMC algorithm simulation.
Figure 2. Autocorrelation sample plots for parameters of the model.

Figure 3. Posterior densities for the parameters of the model (after 8000 iterations).
Figure 4. Potential scale reduction factor convergence to less than 1.02 with more iterations.
Parameter True value Posterior statistics
Mean Standard deviation Confidence interval (5%)
μ1 12 11.929 0.047 (11.851–12.005)
μ2 24 23.923 0.047 (23.847–24.000)
μ3 36 35.843 0.070 (35.729–35.959)
β1 0.2 0.2016 0.0012 (0.1997–0.2035)
β2 0.4 0.4018 0.0009 (0.4004–0.4032)

3
β 0.8 0.8022 0.0010 (0.8005–0.8038)
π11 0.7 0.688 0.068 (0.5715–0.797)
π12 0.2 0.223 0.062 (0.129–0.332)
π13 0.1 0.090 0.042 (0.032–0.17)
π21 0.1 0.091 0.035 (0.041–0.154)
π22 0.6 0.607 0.059 (0.507–0.701)
π23 0.3 0.302 0.055 (0.214–0.397)
π31 0.2 0.153 0.053 (0.075–0.250)
π32 0.3 0.368 0.071 (0.257–0.488)
π33 0.5 0.479 0.073 (0.358–0.599)
σ2 2 2.023 0.032 (1.970–2.077)
Table 1. Posterior inference for the parameters of the MAR(1)HMM model.
6. Conclusion
We have extended the method of Chib [21] for block update estimation of the states to a MAR
(1)HMM model. Furthermore, we would like to point out that our model can easily be
extended to include missing observations, as we should only add an extra step in each MCMC
iteration to estimate the missing observations. Also, we can estimate the autoregressive model
for different values of the autoregressive order, p ≥ 1, by evaluating the Bayesian information
criterion to select the best order that fits the observations of the model. Our model would
capture the complexity and the dynamics of the evolution of breast cancer by introducing the
latent states; the probabilities of transition between the latent states allow to compare among
the effects of treatments on slowing or accelerating the transition of the disease from one health
stage to another the autoregressive parameter mean values corresponding to different stages of
the disease would guide medical doctors and scientists to monitor patients in different phases
of the disease. The model incorporates individual observations with different lengths.
Last but not least, we like to mention the utilities of switching diffusion processes in addres-
sing and analyzing many complicated applications such as in finance and risk management.
Our future work would be to apply these processes to explore disease progression, because
they are characterized by the coexistence of continuous dynamics and discrete events as well
as their interactions.
Acknowledgements
We would like to thank the editorial staff for the comments that helped in improving this
work. Also, we would like to thank the supporters of this work: The Lalla Salma Foundation
Prevention and Treatment of Cancer, Rabat, Morocco; and the Germano-Morrocan Program
for Scientific Research PMARS 2015-060.
Author details
Hamid El Maroufy1*†, El Houcine Hibbah1†, Abdelmajid Zyad2† and Taib Ziad3

1 Department of Mathematics, Faculty of Sciences and Technics, Sultan Moulay Slimane
University, Béni Mellal, Morocco
2 Biological Engineering Laboratory, Team of Natural Substances, Cell and Molecular
Immuno-Pharmacology, Sultan Moulay Slimane University, Morocco
3 Early Clinical Development, Astra Zeneca RD, Gothenburg, Mölndal, Sweden
†
The three first authors acknowledge the financial support of the Lalla Salma Fondation of
Cancer: Prevention and Treatment, Project 09/2013.
References
[1] Samy N, Ragab HM, El Maksoud NA, Shaalan M. Prognostic significance of serum Her2/
neu, BCL2, CA15-3 and CEA in breast cancer patients: A short follow up. Cancer Bio-
markers. 2009;6:63-72
[2] Benmiloud B, Piczunski W. Estimation des parametres dans les chaines de markov
cachees et segmentation. Traitement du Signal. 1995;12:433–454
[3] Boys R, Handerson D. A Bayesian approach to DNA sequence segmentation (with dis-
cussion). Biometrics. 2004;60:573–588
[4] Guihenneuc-Jouyaux C, Richardson S, Longini IM Jr. Modeling disease progression by a

hidden Markov process: Application to characterizing CD4 cell decline. Biometrics.
2000;56:733–741
[5] Albert J, Chib S. Bayes inference via Gibbs sampling of autoregressive time series subject
to Markov mean and variance shifts. Journal of Business and Economic Statistics.
1993;11:1–15
[6] Korolkiewickz M, Elliot J. A hidden Markov model of credit quality. Journal of Economic
Dynamics and Control. 2008;32:3807–3819
[7] Zeng Y, Frias J. A novel HMM-based clustering algorithm for the analysis of gene expres-
sion time-course data. Computational Statistics and Data Analysis. 2006;50:2472–2494
[8] Zucchini W, MacDonald I. Hidden Markov Models for Time Series: An Introduction
Using R. New York: Springer; 2009
[9] Ailliot P, Monbet V. Markov-switching autoregressive models for wind time series. Envi-
ronmental Modelling & Software. 2012;30:92–101
[10] Farcomeni A, Arima S. A Bayesian autoregressive three state hidden Markov model for
identifying switching monotonic regimes in microarray time course data. Statistical
Applications in Genetics and Molecular Biology. 2013;23:467–480
[11] Hamilton J. A new approach to the economic analysis of nonstationary time series and
the business cycle. Econometrica. 1989;57:357–384
[12] Kim C, Kim J. Bayesian inference in regime-switching ARMA models with absorbing
states: The dynamics of the ex-ante real interest rate under regime shifts. Journal of
Business and Economic Statistics. 2015;33:566–578
[13] Fitzpatrick, M. and Marchev, D. Efficient bayesian estimation of the multivariate double
chain markov model. Statistics and Computing, 2013;23(4):467-480
[14] Lindley D. The philosophy of statistics. The Statistician. 2000;49:293–337

[15] Bolstad, W. M. (2007). Introduction to Bayesian Statistics. John Wiley and Sons, Inc.,
Hoboken, New Jersey; 2007
[16] Gelman A, Shalizi CR. Philosophy and the practice of Bayesian statistics in the social
sciences. British Journal of Mathematical and Statistical Psychology. 2013;66:8–38
[17] Hobert, J.P. The data augmentation algorithm: Theory and methodology. In: Brooks S,
Gelman A, Jones GL, Meng X-L, editors. The Handbook of Markov Chain Monte Carlo.
Boca Raton, FL: Chapman and Hall/CRC; 2011. p. 253
[18] Tuyl F, Gerlach R, Mengersen K. Posterior predictive arguments in favor of the Bayes-
Laplace priors as the consensus prior for the binomial and multinomial parameters.
Bayesian Analysis. 2013;4:151–158
[19] Cappe O, Moulines E, Ryden T. Inference in Hidden Markov Models. New York:
Springer-Verlag; 2005
[20] Sampietro S. Mixture of autoregressive components for modeling financial market vola-
tility. LIUC Papers. Serie Metodi quantitativi 16. 2005
[21] Chib S. Calculating posterior distributions and model estimates in Markov mixtures
models. Journal of Econometrics. 1996;75:79–98
[22] Laessig D, Nagel D, Heinemann V, Untch M, Kahlert S, Bauerfeind I, Stieber P. Impor-
tance of CEA and CA15-3 during disease progression in metastatic breast cancer patients.
Anticancer Research. 2007;27:1963–1968
[23] Gelman A. Inference and monitoring convergence. In: Gilks W, Richardson S, Spiegelhalter
D, editors. Markov Chain Monte Carlo in Practice. London: Chapman and Hall/CRC; 1995.
p. 131
[24] Fruhwirth-Schnatter S. Finite Mixture and Markov Switching Models. New York: Springer;
2006
[25] Celoux G. Bayesian inference for mixture: The label switching problem. In: Payne R,
Green P, editors. Proceedings in Computational Statistics. Heidelberg: Physica; 1998. pp.
227–232
[26] Fruhwirth-Schnatter S. Markov Chain Monte Carlo estimation of classical and dynamic
switching and mixture models. Journal of the American Statistical Association. 2001;
96:194–209
[27] Hurn M, Justel A, Rober C. Estimating mixtures of regressions. Journal of Computational
and Graphical Statistics. 2003;79:55–79
[28] Kim C, Nelson C. State-Space Models with Regime Switching: Classical and Gibbs Sam-
pling Approaches with Applications. Cambridge, MA: MIT Press; 1999
[29] Krozlig H. Markov-Switching Vector Autoregressions. Berlin, Heidelberg: Springer-

Verlag; 1997
Chapter 8
Provisional chapter
Bayesian Model Averaging and Compromising in Dose-
ResponseModel
Bayesian Studies
Averaging and Compromising in
Dose-Response Studies
Steven B. Kim
Steven B.information
Additional Kim is available at the end of the chapter

Abstract
Dose-response models are applied to animal-based cancer risk assessments and human-
based clinical trials usually with small samples. For sparse data, we rely on a parametric
model for efficiency, but posterior inference can be sensitive to an assumed model. In
addition, when we utilize prior information, multiple experts may have different prior
knowledge about the parameter of interest. When we make sequential decisions to
allocate experimental units in an experiment, an outcome may depend on decision rules,
and each decision rule has its own perspective. In this chapter, we address the three
practical issues in small-sample dose-response studies: (i) model-sensitivity, (ii) disagree-
ment in prior knowledge and (iii) conflicting perspective in decision rules.
Keywords: dose-response models, model-sensitivity, model-averaging, prior-sensitivity,

consensus prior, Bayesian decision theory, individual-level ethics, population-level
ethics, Bayesian adaptive designs, sequential decisions, continual reassessment method,
c-optimal design, Phase I clinical trials
1. Introduction
Dose-response modeling is often used to learn about the effect of an agent on a particular
outcome with respect to dose. It is widely applied to animal-based cancer risk assessments and
human-based clinical trials. A sample size is typically small; so many statistical issues can arise
from a limited amount of data. The issues include the impact of a misspecified model, prior-
sensitivity, and conflicting ethical perspectives in clinical trials. In this chapter, we focus on
cases when an outcome variable of interest is binary (a predefined event happened or not)
when an experimental unit is exposed to a dose. Main ideas are preserved for cases when an
outcome variable is continuous or discrete.
There are two different approaches to statistical inference. One approach is called frequentist
inference. In this framework, we often rely on the sampling distribution of a statistic and large-
sample theories. Another approach is called Bayesian inference. It is founded on Bayes’ Theo-
rem, and it allows researchers to express prior knowledge independent of data. In a small-
sample study, Bayesian inference can be more useful than frequentist inference because we can
incorporate both researcher’s prior knowledge and observed data to make inference for the
parameter of interest. Bayesian ideas are briefly introduced for dose-response modeling with a
binary outcome in Section 2.
In a small-sample study, we often rely on a parametric model to gain statistical efficiency

(i.e., less variance in parameter estimation), but our inference can be severely biased by the
use of a wrong model. To account for model uncertainty, it is reasonable to specify multiple
models and make inference based on “averaged-inference.” In this regard, Bayesian model
averaging (BMA) is a useful method to gain robustness [1]. The BMA method has a wide
range of application, and we focus its application to animal-based cancer risk assessments in
Section 3.
In clinical trials, study participants are real patients, and therefore, we need to carefully
consider ethics. There are conflicting perspectives of individual- and population-level ethics
in early phase clinical trials. Individual-level ethics focuses on the benefit of trial partici-
pants, whereas population-level focuses on the benefit of future patients, which may require
some level of sacrifice from trial participants. We compare the two conflicting perspectives in
clinical trials based on Bayesian decision theory, and we discuss a compromising method in
Section 4 [2, 3].
A sample size for an early phase (Phase I) clinical trial is often less than 30 subjects. Dose
allocations for first few patients and statistical inference for future patients heavily depend on
researcher’s prior knowledge in sparse data. When multiple researchers have different prior
knowledge about a parameter of interest, one compromising approach is to combine their
prior elicitations and average them (i.e., consensus prior) [4, 5]. When we average the prior
elicitations, there are two different approaches to determine the weight of each prior elicita-
tion, weights determined before observing data and after observing data. We discuss operating
characteristics of the two different weighting methods in the context of Phase I clinical trials
in Section 5.
2. Bayesian inference
In statistics, we address a research question by a parameter, which is often denoted by θ. We

begin Bayesian inference by modeling the prior knowledge about θ. A function, which
models the prior knowledge about θ, is called the prior density function of θ, and we denote
ð
it by f(θ). It is a non-negative function, which satisfies f ðθÞ dθ ¼ 1, where Ω is the set of all
Ω
!
possible values of θ (i.e., parameter space). We then model data y ¼ ðy1 , …, yn Þ given θ. The
!
likelihood function, denoted by f ðy jθÞ, quantifies the likelihood of observing a particular
Bayesian Model Averaging and Compromising in Dose-Response Studies 169
!
sample y ¼ ðy1 , …, yn Þ under an assumed probability model. By Bayes’ Theorem, we update
!
our knowledge about θ after observing data y as
!
! f ðy jθÞ f ðθÞ
f ðθj y Þ ¼ ! : ð1Þ
f ðy Þ
! !
The function f ðθjy Þ is called the posterior density function of θ given data y . Since we treat
!
observed data y ¼ ðy1 , …, yn Þ as fixed numbers, we often express Eq. (1) as follows
! ! !
f ðθj y Þ ∝ f ðy jθÞ f ðθÞ ¼ k f ðy jθÞ f ðθÞ , ð2Þ
ð ! !
where k is the normalizing constant which makes f ðθj y Þ dθ ¼ 1. We can often realize f ðθjy Þ
Ω
!
based on the prior density function f(θ) and the likelihood function f ðy jθÞ without considering
! ð
the denominator f ðy Þ ¼ f ðyjθÞ f ðθÞ dθ in Eq. (1) which is called the marginal likelihood.
2.1. Example
Suppose we observe n = 20 rats for 2 years. Let π be the parameter of interest, which is
interpreted as the probability of developing some type of tumor. Suppose a researcher models
the prior knowledge about π using the prior density function
Γða þ bÞ a�1
f ðπÞ ¼ π ð1 � πÞb�1 , 0 < π < 1 : ð3Þ
ΓðaÞ ΓðbÞ
It is known as the beta distribution with shape parameters a > 0 and b > 0. We often denote the
beta distribution by π � Betaða, bÞ, and the values of a and b must be specified by the researcher
!
independent from data. Let y ¼ ðy1 , …, yn Þ denote observed data, where yi = 1 if the ith rat
developed tumor and yi = 0 otherwise. Assuming y1,…, yn are independent observations, the
likelihood function is as follows
n
Y
!
f ðy jπÞ ¼ πyi ð1 � πÞ1�yi ¼ πs ð1 � πÞn�s , ð4Þ
i¼1
Xn
where s ¼ i¼1
yi is the total number of rats developed tumor. By Eq. (2), the posterior
density function of π is as follows
!
f ðπj y Þ ¼ k π
aþs�1
ð1 � πÞbþn�s�1 , ð5Þ
ð1 !
ΓðaþbþnÞ
where k ¼ ΓðaþsÞ Γðbþn�sÞ is the normalizing constant, which makes f ðπj y Þ dπ ¼ 1. We can
0
!
recognize that πjy � Betaða þ s, b þ n � sÞ.
If the researcher fixed a = 2 and b =3 and observed s = 9 from a sample of size n = 20, the prior
density function is f ðπÞ ¼ k π ð1 � πÞ2 with k ¼ Γð2ÞΓð5ÞΓð3Þ ¼ 12, and the posterior density function
!
is f ðπj y Þ ¼ k π10 ð1 � πÞ13 with k ¼ Γð11Þ Γð14Þ ¼ 27457584. The prior and posterior distributions
Γð25Þ
are shown in Figure 1. The knowledge about π becomes more certain (less variance) after
observing the data.
2.2. Example
This example is simplified from Shao and Small [6]. In dose-response studies, we model π as a
function of dose x. There are many link functions between π and x used in practice. In this
example, we focus on a link function
eβ0 þβ1 x
πx ¼ ð6Þ
1 þ eβ0 þβ1 x
which is known as a logistic regression model. It is commonly assumed that a dose-response

curve increases with respect to dose, so we assume β1 > 0 (and β0 can be any real number).
!
There are two regression parameters in Eq. (6), β0 and β1, and we denote them as β ¼ ðβ0 , β1 Þ.
!
Figure 2 presents two dose-response curves. The solid curve is generated by β ¼ ð�1; 2Þ, and
!
the dotted curve is generated by β ¼ ð�2; 5Þ. As β0 increases, the background risk π0 ¼ 1þee β0
β0
increases, where π0 is interpreted as the probability of tumor development at dose x = 0. The

dose-response curve increases when β1 > 0, and it decreases when β1 < 0. The rate of change in
the dose-response curve is determined by|β1|.
! !
To express prior knowledge about β , we need to find an appropriate prior density function f ðβ Þ.
It is not simple because it is difficult to express one’s knowledge on the two-dimensional
!
parameters β ¼ ðβ0 , β1 Þ. For mathematical convenience, some practitioners use a flat prior den-
! !
sity function f ðβ Þ ∝ 1. Another way of expressing a lack of prior knowledge about β is as follows
2 2
! 1 �β0 þβ1
f ðβ Þ ∝ 2πσ2 e 2σ2 Iβ1 >0 ð7Þ
with an arbitrarily large value of σ [6]. When a reliable source of prior information is available,
there is a practical method, which is known as the conditional mean prior [7], and it will be
discussed in a later section (see Section 4.2). In an experiment, the experimental doses
! ! !
x ¼ ðx1 , …, xn Þ are fixed, and we observe random binary outcomes y ¼ ðy1 , …, yn Þ. Given y
!
(and fixed x ), the likelihood function is as follows
Yn � �yi � �1�yi
! ! eβ0 þβ1 xi 1 eβ0 s1 þβ1 s2
f ðy jβ Þ ¼ β x β x
¼ Yn , ð8Þ
i¼1
1þe 0 1þβ i 1þe 0 1
þβ i
ð1 þ eβ0 þβ1 xi Þ
i¼1
Prior and Posterior Distributions
6
Prior
5
4 Posterior
Density
3
2
1
0
0.0 0.2 0.4 0.6 0.8 1.0
θ
!
Figure 1. The prior f(π) in the dotted curve and the posterior f ðπj y Þ in the solid curve.
Xn Xn
where s1 ¼ i¼1
yi and s2 ¼ i¼1
xi yi . By incorporating both prior and data, the posterior
density function is as follows
! ! ! eβ0 s1 þβ1 s2
f ðβ j y Þ ∝ f ðβ Þ Yn � �: ð9Þ
i¼1
1 þ eβ0 þβ1 xi
In an animal-based studies, one parameter of interest is the median effective dose, which is
denoted by ED50. It is the dose, which satisfies
eβ0 þβ1 ED50

πED50 ¼ ¼ :5 , ð10Þ
1 þ eβ0 þβ1 ED50
and it can be shown that ED50 ¼ �ββ by algebra. In the case of β0 = �2 and β1 = 5, we have
0
1
ED50 = .4 as describe in the figure with the dotted curve. In the case of β0 = �1 and β2 = 2, we
have ED50 = .5 as described in the figure with the solid curve.
In 1997, International Agency for Research on Cancer classified 2,3,7,8-Tetrachlorodibenzo-p-
dioxin (known as TCDD) as a carcinogen for humans based on various empirical evidence [8].
Dose−Response Curves (Logistic Link)
1.0
0.8
0.6
π
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Figure 2. Two dose-response curves using the logistic link.
In 1978, Kociba et al. presented the data on male Sprague-Dawley rats at four experimental
doses 0, 1, 10 and 100 nanograms per kilogram per day (ng/kg/day) [9]. In the control dose
group, nine of 86 rats developed tumor (known as hepatocellular carcinoma); three of 50 rats
developed the tumor at dose 1; 18 of 50 rats developed the tumor at dose 10; and 34 of 48 rats
developed the tumor at dose 100 [6]. Without loss of generosity, we let xi = 0 for i = 1,…, 86; xi =
1 for i = 87,…136; xi = 10 for i = 137,…, 186; and xi = 100 for i = 187,…, 234. The given
Xn Xn
information is sufficient to calculate s1 ¼ y ¼ 64 and s2 ¼
i¼1 i
x y ¼ 3583. By the use
i¼1 i i
!
of the flat prior f ðβ Þ ∝ 1 with the restriction β1 > 0, given the observed sample of size n = 234,
!
we can generate random numbers of β ¼ ðβ0 , β1 Þ from the posterior density function
! ! eβ0 s1 þβ1 s2
f ðβ j y Þ ∝ Yn Iβ1 >0 , ð11Þ
i¼1
where Iβ1 >0 ¼ 1 if β1 > 0 and Iβ1 >0 ¼ 0 otherwise. Using a method of Markov Chain Monte
!
Carlo (MCMC), we can approximate the posterior distribution of β as shown in the left panel
of Figure 3. By transforming (β0, β1) to ED50 ¼ �ββ , we can approximate the posterior
0
1
Regression Parameters Median Effective Dose
0.04
β1
0.02
0.00
−3.0 −2.5 −2.0 −1.5 −1.0 0 20 40 60 80 100
β0 ED50
Figure 3. Approximate posterior distributions of (β0, β1) and ED50 ¼ �ββ01 .
distribution of the median effective dose ED50 as shown in the right panel. The posterior mean
!
of ED50 is EðED50 j y Þ ¼ 64:9 with 95% credible interval (50.8, 82.5), the 2.5th percentile and the
97.5th percentile of the posterior distribution.
3. Bayesian model averaging
In a small sample, we borrow the strength of a parametric model to gain efficiency in param-
eter estimation. However, an assumed model may not describe the true dose-response rela-
tionship adequately. The impact of model misspecification is not negligible particularly in a
poor experimental design. In such a limited practical situation, Bayesian model averaging
(BMA) can be a useful method to account for model uncertainty. It is widely applied in
practice, and in this section, we focus on the application to cancer risk assessment for the
estimation of a benchmark dose [1, 6, 10, 11].
Let θ denote a parameter of interest. Suppose we have a set of K candidate models denoted by
!
M ¼ fM1 , …, MK g. Let β k denote the vector of regression parameters under model Mk for k =1,
!
…, K. Suppose θ is a function of β k , and the interpretation of θ must be common across all
! !!
models. Let f ðβ k jMk Þ and f ðy jβ k , Mk Þ denote the prior density function and the likelihood
function, respectively, under Mk. By the Law of Total Probability, the posterior density function
of θ is as follows
K
X
! ! !
f ðθj y Þ ¼ f ðθjMk , y Þ PðMk j y Þ : ð12Þ
k¼1
!
In Eq. (12), the posterior density function f ðθjMk , y Þ depends on model Mk, and the posterior
!
model probability PðMk jy Þ quantifies the plausibility of model Mk after observing data, which
is given by
!
! f ðy jMk Þ PðMk Þ
PðMk jy Þ ¼ XK !
: ð13Þ
j¼1
f ðy jMj Þ PðMj Þ
In Eq. (13), the prior model probability P(Mk) is determined before observing data such that
XK
PðMk Þ > 0 for k ¼ 1;…, K and k¼1
PðMk Þ ¼ 1. The marginal likelihood under Mk requires the
integration
ð
! !! ! !
f ðy jMk Þ ¼ f ðy jβ k , Mk Þ f ðβ k jMk Þ dβ k : ð14Þ
In the BMA method, all K models contribute to inference of θ through the averaged posterior
density function in Eq. (12), and the weight of contribution is determined by Bayes’ Theorem
in Eq. (13).
3.1. Example
This example is continued from the example in Section 2.2. Recall πx is interpreted as the
probability of a toxic event (tumor development) at dose x. In many cancer risk assessments,
a parameter of interest is θγ at a fixed risk level γ, which is defined as follows
πθγ � π0
γ¼ ð15Þ
1 � π0
or equivalently πθγ ¼ π0 þ ð1 � π0 Þ γ. In words, θγ is a dose corresponding to a fixed increase

in the risk level. In frequentist framework, Crump defined a benchmark dose as a lower
confidence limit for θγ [12]. In Bayesian framework, an analogous definition would be a lower
credible bound (i.e., a fixed low percentile of the posterior distribution of θγ). The definition is
widely applied to the public health protection [13].
In practice, γ is fixed between 0.01 and 0.1. Often, the estimation of θγ is highly sensitive to an
assumed dose-response model because we have a lack of information at low doses. Shao and
Small fixed γ = 0.1 and applied BMA with K = 2 models, logistic model and quantal-linear
model [6]. In the quantal-linear model, the probability of tumor development is modeled by
πx ¼ β0 þ ð1 � β0 Þð1 � e�β1 x Þ : ð16Þ
with the restrictions 0 < β0 < 1 and β1 > 0 under the monotonic assumption. The logistic model
was given in Eq. (6) of Section 2.2.
Let M1 denote the logistic model, and let M2 denote the quantal-linear model. Assume the
uniform prior model probabilities PðM1 Þ ¼ PðM2 Þ ¼ :5 and flat priors on the regression
parameters. By posterior sampling, we can approximate the posterior model probabilities
! !
PðM1 j y Þ ¼ :049 and PðM2 j y Þ ¼ :951. Under M1, the posterior mean of θ0.1 is 20.95 with the
5th percentile 16.74. Under M2, the posterior mean is 8.25 with the 5th percentile 5.95. These
Logistic Model Quantal−Linear Model Bayesian Model Averaging
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
θ0.1 θ0.1 θ0.1
Figure 4. Posterior distributions of θ0.1 from the logistic model (left panel), the quantal-linear model (middle panel), and
the Bayesian model averaging (right panel).
results are very similar to the results reported by Shao and Small [6]. From these model-
specific statistics, we can calculate the model-averaged posterior mean
2
X
! ! !
Eðθ0:1 jy Þ ¼ Eðθ0:1 jMk , y Þ PðMk j y Þ ¼ 20:95 ð:049Þ þ 8:25 ð:951Þ ¼ 8:87 : ð17Þ
k¼1
However, we are not able to calculate the 5th percentile of the model-averaged posterior
distribution based on the given statistics. In fact, we need to approximate the posterior distri-
! ! !
bution f ðθ0:1 j y Þ, which is a mixture of f ðθ0:1 jM1 , y Þ and f ðθ0:1 jM2 , y Þ weighted by
! !
PðM1 j y Þ ¼ :049 and PðM2 j y Þ ¼ :951, respectively, as shown in Figure 4. In the figure, the left
!
panel shows an approximation of f ðθ0:1 jM1 , y Þ, the middle panel shows an approximation of
! !
f ðθ0:1 jM2 , y Þ, and the right panel shows an approximation of the averaged posterior f ðθ0:1 j y Þ.
! !
The averaged posterior density f ðθ0:1 jy Þ is bimodal, but it is very close to f ðθ0:1 jM2 , y Þ because
the quantal-linear model M2 fits the data better than the logistic model M1 by a Bayes factor of
!
PðM2 j y Þ
! :049 ¼ 19:4. The 5th percentile of the model-averaged posterior distribution is approx-
¼ :951
PðM1 j y Þ
imately 5.97, and it is a BMA-BMD based on the BMA method proposed by Raftery et al. [1]
and the BMD estimation method suggested by Crump [12].
4. Application of Bayesian decision theory to Phase I trials
In a Phase I cancer trial, the main objectives are to study the safety of a new chemotherapy
and to determine an appropriate dose for future patients. Since trial participants are cancer
patients, dose allocations require ethical considerations. Whitehead and Williams discussed
several Bayesian approaches to dose allocations [14]. One decision rule is devised from the
perspective of trial participants (individual-level ethics), and another decision rule is devised
from the perspective of future patients (population-level ethics). However, a decision rule,
which is devised from the population-level ethics, is not widely accepted in current prac-
tice [15]. Instead, there are some proposed decision rules, which compromise between the
individual- and population-level perspectives [3, 16]. In this section, we discuss the two
conflicting perspectives in Phase I clinical trials and a compromising method based on

Bayesian decision theory.
Assume a dose-response relationship follows a logistic model
eβ0 þβ1 x
πx ¼ , ð18Þ
1 þ eβ0 þβ1 x
where x is a dose in the logarithmic scale (base e) and πx is the probability of observing an
adverse event due to the toxicity of a new chemotherapy at dose x. The logarithmic transforma-
!
tion on the dose is to satisfy πx ! 0 as x ! 0. Let x n ¼ ðx1 , …, xn Þ denote a series of decisions for
!
n patients (i.e., allocated doses) and y n ¼ ðy1 , …, yn Þ denote a series of observed responses,
!
where yi = 1 indicates an adverse event and yi = 0 otherwise. Let Lðβ , xnþ1 Þ denote a loss by
allocating the next patient at xn+1. Based on Bayesian Decision Theory, we want to find xn+1 which
!
minimizes the posterior mean of Lðβ , xnþ1 Þ. If we let A denote an action space, a set of all
possible dose allocations for the next patients, the decision rule can be written as follows:
! !

x�nþ1 ¼ argminxnþ1 ∈ A E Lðβ , xnþ1 Þ j y n : ð19Þ
A choice of L has a substantial impact on the operating characteristics of a Phase I trial including
(i) the degree of under- and over-dosing in trial, (ii) the observed number of adverse events at the
end of a trial, and (iii) the quality of estimation at the end of a trial.
4.1. Parameter of interest: maximum tolerable dose
Let N denote an available sample size for a Phase I clinical trial. A typical sample size is N ≤ 30.
Let γ denote a target risk level, the probability of an adverse event. In a cancer study, a typical
target risk level γ is fixed between .15 and .35 depending on the severity of an adverse event.
Then, the dose corresponding to γ is called a maximum tolerable dose (MTD) at level γ, and
we denote it by θγ in the logarithmic scale. Under the logistic model in Eq. (18), it is defined as
follows

γ
log � β0
1�γ ð20Þ
θγ ¼ :
β1
At the end of a trial (observing N responses), we estimate θγ by the posterior mean

!
θ^ γ, N ¼ Eðθγ jy Þ for future patients.
N
4.2. Prior density function: conditional mean priors

!
A consequence of sequential decisions heavily depends on a prior density function f ðβ Þ. In
particular, the first decision x1 must be made based on prior knowledge only because empirical
evidence is not observed yet. In addition, the later decisions x2, x3,… and the final inference of
!
θγ are substantially affected by f ðβ Þ as a Phase I study is typically based on a small sample. In
!
this regard, we want to carefully utilize researchers’ prior knowledge about β , but it may be
!
difficult to express their prior knowledge directly through f ðβ Þ. In this section, we discuss a
!
method of eliciting prior knowledge, which is more tractable than prior elicitation directly on β .
Suppose a researcher selects two arbitrarily doses, say x�1 < x0. Then, the researcher may
express their prior knowledge by two independent beta distributions
eβ0 þβ1 xi
πxi ¼ � Betaðai , bi Þ , j ¼ �1; 0: ð21Þ
1 þ eβ0 þβ1 xi
!
Using the Jacobian transformation from ðπx�1 , πx0 Þ to β ¼ ðβ0 , β1 Þ, it can be shown that the
!
prior density function of β is given by
Y0 � �ai � � bi
! eβ0 þβ1 xi 1
f ðβ Þ ∝ ðx0 � x�1 Þ β
1þe 0 1 i
þβ x β x
1þe 0 1 i
þβ
: ð22Þ
i¼�1
It is known as conditional mean priors under the logistic model [7].
4.3. Posterior density function: conjugacy

For notational convenience, we let yi = ai and ni = ai + bi for i = �1,0. By conjugacy, the posterior
!
density function of β can be concisely written as follows
! ! eβ0 s1 þβ1 s2
f ðβ jy n Þ ∝ Yn , ð23Þ
i¼�1
Xn Xn
where s1 ¼ i¼�1
yi and s2 ¼ i¼�1
xi yi . After observing n responses, the decision rule for
the next patient is as follows
ð
! !! !
x�nþ1 ¼ argminxnþ1 ∈ A Lðβ , xnþ1 Þ f ðβ j y n Þ dβ : ð24Þ
4.4. Loss functions for individual- and population-level ethics
A loss function, which reflects the perspective of individual-level ethics, is as follows:

! 2
LI ðβ , xnþ1 Þ ¼ ðxnþ1 � θγ Þ : ð25Þ
This loss function is analogous to the original continual reassessment method proposed by
O’Quigley et al. [17]. The square error loss attempts to treat a trial participant at θγ, and the
expected square error loss is minimized by the posterior mean of θγ.
From the perspective of population-level ethics, Whitehead and Brunier proposed a loss
function, which is equal to the asymptotic variance of the maximum likelihood estimator for
θγ [18]. The Fisher expected information matrix with a sample of size n + 1 is given by
0 Xnþ1 Xnþ1 1
! i¼1
τi i¼1
τi x i
I ðβ Þ ¼ @ X nþ1 Xnþ1 A, ð26Þ
i¼1
τi xi i¼1
τi x2i
where τi ¼ πxi ð1 � πxi Þ. Then, the loss function (the asymptotic variance) is given by
h ! ! iT h ! i�1 h ! ! i
!
LP ðβ , xnþ1 Þ ¼ ∇h ðβ Þ Iðβ Þ ∇h ðβ Þ , ð27Þ
where
1 0
∂θγ
� �
! ! B ∂β C
B 0 C¼�
1 1
∇h ðβ Þ ¼ @ ∂θγ A ð28Þ
β1 θγ
∂β1
is the gradient vector, the partial derivatives of θγ with respect to β0 and β1. Kim and Gillen
decomposed the population-level loss function as follows
h i
τnþ1 ðxnþ1 � θγ Þ2 þ sn ðθγ � μn Þ2 þ σ2n
ð0Þ
!
LP ðβ , xnþ1 Þ ¼ h ð0Þ ð2Þ i h i, ð29Þ
sn sn � sn sn þ sn τnþ1 ðxnþ1 � μn Þ2 þ σ2n
ð1Þ ð1Þ ð0Þ
where
n
X
ðmÞ
sn ¼ τi xm
i , m ¼ 0; 1; 2,
i¼1
n
X
μn ¼ wi xi , ð30Þ
i¼1
!2
n
X n
X
σ2n ¼ wi x2i � wi xi
i¼1 i¼1
with the weight defined as wi ¼ Xτni [3]. Eq. (29) has the following important remarks. In
τ
i¼1 j
! !
fact, LP ðβ , xnþ1 Þ considers individual-level ethics by including LI ðβ , xnþ1 Þ ¼ ðxnþ1 � θγ Þ2 in the
Xn
numerator. By including ðxnþ1 � μn Þ2 in the denominator, where μn ¼ i¼1 wi xi , the
population-level loss function reduces a loss by allocating the next patient further away from
the weighted average of previously allocated doses (i.e., devised from information gain). In
!
long run, LP ðβ , xnþ1 Þ is devised from a compromise between individual- and population-level
ethics, but the compromising process is rather too slow to be implemented in a small-sample
Phase I clinical trial [3].
4.5. Loss function for compromising the two perspectives

!
Kim and Gillen proposed to accelerate the compromising process by modifying LP ðβ , xnþ1 Þ of
Eq. (29) as follows
h i
2 ð0Þ 2 2
!
a n ðλÞ τ nþ1 ðxnþ1 � θ γ Þ þ s n ðθγ � μn Þ þ σ n
LB, λ ðβ , xnþ1 Þ ¼ h ð0Þ ð2Þ i h i , ð31Þ
sn sn � sn þ sn τnþ1 ðxnþ1 � μn Þ2 þ σ2n
ð1Þ ð0Þ
where
� Xn �
y
i¼1 i
� �λ 1þ Nγ ð32Þ
n
an ðλÞ ¼ 1 þ N
is an accelerating factor [3]. It has two implications. First, the compromising process is acceler-
ated toward the individual-level ethics as the trial proceeds (i.e., n increases). Second, the
compromising process toward the individual-level ethics is accelerated at a faster rate when
Xn
an adverse event is observed (i.e., i¼1 yi increases). The tuning parameter λ controls the rate
of acceleration. It imposes more emphasis on population-level ethics as λ ! 0 and more
emphasis on individual-level ethics as λ ! ∞. The choice of λ shall depend on the severity
level of an adverse event.
4.6. Simulation
To study the operating characteristics of LB,λ with respect to λ, we assume the logistic model
with β0 = �3 and β1 = .8 as a true dose-response relationship as shown in Figure 5 in the left
panel. The target risk level is fixed at γ = .2, so the true MTD is given by θ.2 = 2.02 in the
logarithmic scale. We consider three different priors based on the conditional mean priors
given in Eq. (22). For simplicity, we set a�1 = 1, b�1 = 3, a0 = 3 and b0 = 1 for all three priors.
Then, we let x�1 ¼ �4 and x0 ¼ 4 for Prior 1; x�1 ¼ 0 and x0 ¼ 8 for Prior 2; and x�1 ¼ 4 and
x0 ¼ 12 for Prior 3. Figure 5 in the right panel shows an approximated f(θ.2) for each prior.
Prior 1 significantly underestimates the true θ:2 ¼ 2:02 with prior mean Eðθ:2 Þ ¼ �1:70, Prior 3
overestimates the truth with Eðθ:2 Þ ¼ 5:38, and Prior 2 has a prior estimate relatively close to
the truth with Eðθ:2 Þ ¼ 1:40.
Let N = 20 be a fixed sample size. Let Yi = 1 denote an adverse event observed from the ith
XN
patient (Yi = 0 otherwise), so i¼1
Yi denotes the total number of adverse events observed at
Xn Xn
the end of a trial. The sum i¼1 Yi is random from a trial to another trial, and we want i¼1 Yi
to behave like Binomialð20; :2Þ which is the case when we treat N = 20 to the true MTD θ.2.
Figure 6 shows three simulated trials under the loss function LB,λ with λ = 0,1,5. When λ = 0,
True Dose−Response Curve Approximated Prior Distributions of MTD
0.25
1.0
Prior 1
Prior 2
0.20
0.8
Prior 3
True MTD
0.15
Probability
0.6
0.10
0.4
0.05
0.2
0.00
0.0
−4 0 2.02 4 8 −5 0 5 10
Dose (Logarithmic Scale) θ0.2
Figure 5. The true dose-response relationship πx ¼ 1þe e

with β0 = �3 and β1 = .8 (where x is the dose in the logarithmic
β0 þβ1 x
β0 þβ1 x
scale) in the simulation (left panel) and the three prior distributions of θ.2 approximated by kernel density (right panel).
Simulated Trial Simulated Trial Simulated Trial

λ=0 λ=1 λ=5
8
AE AE 8 AE
Non−AE Non−AE Non−AE
6
6
Patient Index
Patient Index
Patient Index
4
4
2
2
0
0
−2
−2
−2
1 5 10 15 20 1 5 10 15 20 1 5 10 15 20
Dose (Logarithmic Scale) Dose (Logarithmic Scale) Dose (Logarithmic Scale)
Figure 6. Three simulated trials using the loss function LB, λ with λ ¼ 0 (left), λ ¼ 1 (middle) and λ ¼ 5 (right) with a
sample of size N = 20 and assumed parameter values β0 ¼ �3, β1 ¼ 8 and θ:2 ¼ 2:02.
the up-and-down scheme has a high degree of fluctuation in order to maximize information
about θ.2. When λ = 1, the up-and-down scheme is stabilized after the first few adverse events,
and the stabilization occurs quickly when λ = 5 to treat trial participants near an estimated θ.2.
!
Let θ^ :2 ¼ Eðθ:2 jy N Þ, the posterior estimate of θ.2 at the end of a trial, so π ^ implies the true
θ :2
probability of an adverse event at the estimated MTD. We focus on the following criteria: (i)
Eðπ ^ Þ which we desire to be close to γ ¼ :2 for future patients, (ii) Vðπ ^ Þ which we desire to
θ :2 θ :2
be as low as possible for future patients, (iii) E½ðπ ^ � :2Þ2 � which we desire to be as low as
θ :2
X20
possible for future patients, (iv) Eð i¼1 Y i Þ which we desire to be close to Nγ ¼ 4 for trial
X20
participants and (v) Pð3 ≤ i¼1
Yi ≤ 5Þ which we desire to be close to one for trial participants.
X20 X20
Prior λ Eðπ ^ Þ
θ :2
Vðπ ^ Þ
θ :2 E½ðπ ^ � :2Þ2 � Eð i¼1 Y i Þ Pð3 ≤ Y i ≤ 5Þ
θ :2 i¼1
1 0 0.0964 0.0019 0.0126 2.4353 0.4318

.5 0.1034 0.0024 0.0118 2.0997 0.2298
1 0.1082 0.0028 0.0113 1.8969 0.1714
2 0.1100 0.0031 0.0112 1.6929 0.1211
5 0.1157 0.0035 0.0106 1.3128 0.0596

2 0 0.1665 0.0054 0.0065 4.1217 0.9889
.5 0.1705 0.0056 0.0065 3.9598 0.9877
1 0.1727 0.0060 0.0068 3.9025 0.9670

2 0.1751 0.0066 0.0072 3.8707 0.9291
5 0.1763 0.0067 0.0073 3.8442 0.9068
3 0 0.2743 0.0048 0.0103 6.1875 0.1600

.5 0.2673 0.0048 0.0093 6.3954 0.1430
1 0.2606 0.0046 0.0083 6.6194 0.1165
2 0.2562 0.0045 0.0077 6.8035 0.1020

5 0.2499 0.0044 0.0068 7.0274 0.0760
Table 1. Simulation results of 10,000 replicates for λ = 0, .5, 1, 2, 5 and each prior.
Table 1 summarizes simulation results of 10,000 replicates for each prior. For all three priors,
we observe similar tendencies. First, Eðπ ^ Þ gets closer to θ = .2 as λ increases. Second,
θ :2
Vðπ ^ Þ decreases as λ decreases to zero. The average square distance between π ^ and
θ :2 θ :2
γ ¼ :2 measures a balance between j Eðπ ^ Þ � :2 j and Vðπ ^ Þ, and the superiority depends
θ :2 θ :2
X20 X20
on priors. Lastly, as λ ! 0, we have larger Pð3 ≤ i¼1
Y i ≤ 5Þ and more robust Eð i¼1
Yi Þ to
prior elicitation.
In summary, when we emphasize more on population-level ethics, we have a smaller variance
in the estimation for future patients (with a greater absolute bias, potentially due to Jensen’s
Xn
Inequality), and the distribution of i¼1 Yi becomes more robust to prior elicitations. When we
emphasize more on individual-level ethics, we have a larger variance in the estimation, and the
Xn
distribution of i¼1 Y i becomes more sensitive to prior elicitations.
5. Consensus prior
In Bayesian inference, researchers are able to utilize information, which is independent of

observed data. It allows researchers to incorporate any form of information, such as one’s experi-
ence and existing literature, which may be particularly useful in a small-sample study. On the
other hand, we concern subjectivity and prior sensitivity in sparse data. Furthermore, it is possible
to have disagreement among multiple researchers’ prior elicitations about a parameter θ.
Suppose there are K researchers with their own prior density functions, say f ðθjQk Þ for
!
k ¼ 1;…, K, and they have the same likelihood function f ðy jθÞ. Each prior elicitation leads to
a unique Bayes estimator
ð
! !
θ^ k ¼ Eðθjy , Qk Þ ¼ θ f ðθjy , Qk Þ dθ , ð33Þ
! ! !
where f ðθjy , Qk Þ ∝ f ðy jθÞ f ðθjQk Þ is the posterior density function of θ given data y and the
kth prior elicitation Qk. For posterior estimation, one reasonable approach to compromise is a
XK XK
weighted average wk θ^ k , where wk > 0 for k = 1,…,K and
k¼1
wk ¼ 1. In this section, we
k¼1
discuss two different weighting methods. The first method is to fix wk before observing data
!
(referred to as prior weighting scheme). The second method is to determine wk ðy Þ after
! ! th
observing data y so that wk ðy Þ increases when the k prior elicitation Qk is better supported
!
by the observed data y (referred to as posterior weighting scheme) [5].
For a prior weighting scheme, we denote wk ¼ PðQk Þ which quantifies the credibility of the kth
prior elicitation. For a posterior weighting scheme, we consider
! !
! ! f ðy jQk Þ PðQk Þ wk f ðy jQk Þ
wk ðy Þ ¼ PðQk j y Þ ¼ XK !
¼ XK !
, ð34Þ
j¼1
f ðy jQ j Þ PðQ j Þ w f ðy jQj Þ
j¼1 j
ð
! !
where f ðy jQk Þ ¼ f ðy jθÞ f ðθjQk Þ dθ is the marginal likelihood from the kth prior elicitation.
This formulation is similar to the BMA method discussed in Section 3. It can be shown that
XK !
w ðy Þ θ^ k is the Bayes estimator (the posterior mean of θ) when a consensus prior
k¼1 k
XK
f ðθÞ ¼ k¼1
wk f ðθjQk Þ is used with wk ¼ PðQk Þ [5].
Samaniego discussed self-consistency when compromised inference is used through the prior
XK
weighting scheme wk θ^ k [4]. Let θ denote a parameter of interest and
k¼1
ð
EðθÞ ¼ θ f ðθÞ dθ ¼ θ� ð35Þ
~ denote a sufficient
be the prior expectation, the mean of the prior density function f ðθÞ. Let θ
statistic, which serves as an unbiased estimator for θ. When we satisfy Eðθjθ ~ ¼ θ� Þ ¼ θ�, it is
called self-consistency [4].
!
Self-consistency can be achieved under simple models. For example, let Y ¼ ðY1 , …, Yn Þ be a
random sample, where Yi � BernoulliðθÞ, and assume θ � Betaða, bÞ for prior. It can be shown
that the maximum likelihood estimator θ ~ ¼ 1 Xn Y i is a sufficient statistic and an unbiased

n i¼1
estimator for θ. The posterior mean is a weighted average between θ* and θ ~ as follows
~ ¼ θ� Þ ¼ c θ� þ ð1 � cÞ θ
Eðθjθ ~, ð36Þ
aþb
where c ¼ aþbþn . If we observe θ ~ ¼ θ� , we can achieve the self-consistency because
Eðθjθ^ ¼ θ� Þ ¼ θ� . In words, when prior estimate and maximum likelihood estimate are iden-
tical, the posterior estimate must be consistent with the prior estimate and the maximum
likelihood estimate. The self-consistency can be also achieved in the prior weighting scheme
under certain conditions as illustrated in the following example.
5.1. Binomial experiment

Let Yi � BernoulliðπÞ for i ¼ 1;…, n and assume Y1 , …, Y n are independent. Suppose the kth
researcher specifies the prior distribution πjQk � Betaðak , bk Þ for k ¼ 1;…, K. For the prior
weighting scheme, let wk ¼ PðQk Þ, the prior probability for the kth prior elicitation (fixed before
observing data). Since EðπjQk Þ ¼ ak aþb
k
k
and the expectation E (�) is a linear operator, the average
of “consensus prior” is
ð1 ð 1 �X
K � K
X �ð 1 � XK
EðπÞ ¼ π f ðπÞ dπ ¼ π f ðπjQk Þ PðQk Þ dπ ¼ wk π f ðπjQk Þ dπ ¼ wk EðπjQk Þ :
0 0 k¼1 k¼1 0 k¼1
ð37Þ
Xn
~ ¼ n1
Let EðπÞ ¼ π� and suppose the K researchers observed the consistent result π i¼1
Yi ¼ π� .
The individual-specific Bayes estimator is as follows
π π ¼ π� , Qk Þ ¼ ck EðπjQk Þ þ ð1 � ck Þ π� ,
^ k ¼ Eðπj~ ð38Þ
for the kth researcher, where ck ¼ a aþbþbþn. The compromised Bayes estimator is as follows
k
k
k
k
K
X K
X
Eðπj~
π ¼ π� Þ ¼ wk π
^k ¼ wk ½ck EðπjQk Þ þ ð1 � ck Þ π� � : ð39Þ
k¼1 k¼1
If we allow individual-specific prior elicitation ak and bk with the restriction ak + bk = m for all K
researchers (i.e., the same strength of prior elicitation), the value ck ¼ mþnm
is constant over all
researcher. By letting the constant ck = c,
! !
K
X K
X
Eðπj~
π¼π Þ ¼c wk EðπjQk Þ þ ð1 � cÞ π
� �
wk ¼ c EðπÞ þ ð1 � cÞ π� ¼ π� , ð40Þ
k¼1 k¼1
so the self-consistency is satisfied.

!
For the posterior weighting scheme given data y ¼ ðy1 , …, yn Þ, the marginal likelihood from
the kth prior elicitation is as follows
ð1
! ! Γðak þ bk Þ Γðak þ sÞ Γðbk þ n � sÞ
f ðy jQk Þ ¼ f ðy jπÞ f ðπjQk Þ dπ ¼ , ð41Þ
0 Γðak Þ Γðbk Þ Γðak þ bk þ nÞ
Xn
where s ¼ i¼1
yi is an observed sufficient statistic. Then, the posterior weighting scheme
XK !
becomes k¼1
wk ðy Þ π
^ k with
!
! w f ðy jQk Þ
wk ðy Þ ¼ XKk !
,
w f ðy jQj Þ
j¼1 j ð42Þ
ak þ s
π
^k ¼ :
ak þ bk þ n
If we desire an equal strength from each researcher’s prior elicitation, we may fix ak þ bk ¼ m
and wk ¼ K1 . In the posterior weighting scheme, it is difficult to achieve the self-consistency.
Whether self-consistency is satisfied, the practical concern is the quality of estimation such as
bias, variance and mean square error. Assuming K = 2 researchers have disagreeing prior
knowledge and a sample of size n = 10, let us consider three cases. Suppose two researchers
express relatively mild disagreement as ða1 , b1 Þ ¼ ð1; 3Þ and ða2 , b2 Þ ¼ ð3; 1Þ in Case 1, relatively
strong disagreement as ða1 , b1 Þ ¼ ð2; 6Þ and ða2 , b2 Þ ¼ ð6; 2Þ in Case 2, and even stronger dis-
agreement as ða1 , b1 Þ ¼ ð3; 9Þ and ða2 , b2 Þ ¼ ð9; 3Þ in Case 3. For each case, Figure 7 provides the
relative bias, variance and mean square error (MSE) for comparing the posterior weighting
X3 ! X3
scheme k¼1
wk ðy Þ π
^ k to the prior weighting scheme k¼1
wk π
^ k . When a relative MSE is
smaller than one, it implies a smaller MSE for the posterior weighting scheme. As the true
value of π is well between the two prior guesses EðπjQ1 Þ ¼ :25 and EðπjQ2 Þ ¼ :75, the poste-
rior weighting scheme shows a greater MSE due to greater variance. When the true value of π
deviates away from either prior guess, the posterior weighting schemes show a smaller MSE
due to smaller bias. The tendency is stronger when the two disagreeing prior elicitations are
stronger (i.e., stronger prior disagreement). The bottom line is a clear bias-variance tradeoff
Relative Bias Relative Variance Relative MSE

1.0
Case 1 Case 1 Case 1

0.8

4
4
0.6
3
0.4
2
0.2
1
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
π π π
Figure 7. Comparing prior and posterior weighting schemes for different degrees of disagreements.
X3 !
when we compare the two weighting schemes. k¼1 wk ðy Þ π
^ k is able to reduce bias when there
is strong discrepancy between “consensus prior” and data, but it has larger variance than
X3 !
k¼1
wk π
^ k because wk ðy Þ depends on random data.
5.2. Applications to Phase I trials under logistic regression model

In this section, we apply the prior weighting scheme and the posterior weighting scheme to
Phase I clinical trials under the logistic regression model. We consider the three priors consid-
ered in Section 4.6. We denote Prior 1, 2 and 3 by Q1, Q2 and Q3, respectively. The three priors
had the same hyper-parameters a�1;k ¼ 1, b�1;k ¼ 3, a0;k ¼ 3, b0;k ¼ 1, but they were different by
x�1;k ¼ �4; 0; 4 and x0, k ¼ 4; 8; 12 for k ¼ 1; 2; 3, respectively. By the use of the conditional mean
!
prior in Eq. (22), the prior density function of β for prior Qk is given by
Y0 � �ai, k � �bi, k
! eβ0 þβ1 xi 1
f ðβ jQk Þ ∝ ðx0;k � x�1;k Þ 1 þ eβ0 þβ1 xi 1 þ eβ0 þβ1 xi
: ð43Þ
i¼�1
The prior means were Eðθ:2 jQ1 Þ ¼ �1:70, Eðθ:2 jQ2 Þ ¼ 1:40 and Eðθ:2 jQ3 Þ ¼ 5:38 for Priors 1, 2
and 3, respectively.
For simulation study, we consider three simulation scenarios with sample size N = 20. In
Scenario 1, we assume β0 = �5 and β1 = .6, so the true MTD is θ:2 ¼ 6:02, which deviates
significantly from all of the three prior means. In Scenario 2, we assume β0 = �3 and β1 = .8 as
in Section 4.6, so θ:2 ¼ 2:02 is well surrounded by the three prior means. In Scenario 3, we
assume β0 = �1 and β1 = 1.2, so θ:2 ¼ �:32 is close to the most conservative prior mean
!
Eðθ:2 jQ1 Þ ¼ �1:70. We consider the loss function LI ðβ , xnþ1 Þ ¼ ðxnþ1 � θ:2 Þ2 discussed in Sec-
tion 4.4, which focuses on individual-level ethics. We use the uniform prior probabilities
wk ¼ PðQk Þ ¼ 1=3 for k ¼ 1; 2; 3 for implementing both prior and posterior weighting scheme.
Table 2 provides the simulation results of 10,000 replicates for each scenario under the prior
weighting scheme and under the posterior weighting scheme. Since the posterior weighting
!
scheme adaptively updates wk ðy Þ based on empirical evidence, it can reduce bias, but it has
greater variance in the estimation of θ2. As a consequence, when the true MTD was close to one
extreme prior estimate (Scenarios 1 and 3), the use of the posterior weighting scheme yields a
h i X X20
smaller E ðπ ^ � :2Þ2 , Eð i¼1 Yi Þ closer to Nγ ¼ 4, and Pð3 ≤ i¼1 Yi ≤ 5Þ closer to one when
20
θ :2
compared to the use of the prior weighting scheme. In Scenario 3, the average number of adverse
events was 4.6 for the posterior weighting scheme, but it was as high as 7.1 in the prior weighting
scheme. On the other hand, when the true MTD was well surrounded by the three prior
estimates (Scenario 2), the use of the prior weighting scheme yielded more plausible results.
The simulation results are analogous to the simpler model in Section 5.1. When the true
parameter is not well surrounded by prior guesses, the posterior weighting scheme is prefer-
able with respect to mean square error due to smaller bias. When the true parameter is well
surrounded by prior guesses, the prior weighting scheme is beneficial with respect to mean
square error due to smaller variance.
X20 X20
Scenario Method Eðπ ^ Þ
θ :2
Vðπ ^ Þ
θ :2 E½ðπ ^ � :2Þ2 � Eð i¼1 Y i Þ Pð3 ≤ Y i ≤ 5Þ
θ :2 i¼1
1 Prior weighting 0.0967 0.0014 0.0121 1.1090 0.0398

Posterior weighting 0.1853 0.0073 0.0075 2.7304 0.5900
2 Prior weighting 0.2018 0.0059 0.0059 3.8432 0.9042

3 Prior weighting 0.2929 0.0071 0.0157 7.1090 0.0568
Table 2. Simulation results of 10,000 replicates for the prior and posterior weighting schemes.
As a final comment, we shall be careful about the strength of individual prior elicitations when
we implement the posterior weighting scheme in Phase I clinical trials. The strength of indi-
vidual prior elicitations depends on (i) the hyper-parameters ai, k and bi, k , (ii) the prior weight
wk ¼ PðQk Þ as well as (iii) the distance between the two arbitrarily chosen doses x0;k � x1;k . It
can be seen through the expression
XK XK Y0 � �ai, k � �bi, k
! ! eβ0 þβ1 xi 1
f ðβ Þ ¼ f ðβ jQk Þ PðQk Þ ∝ wk ðx0;k � x�1;k Þ β x β x
: ð44Þ
k¼1 k¼1 i¼�1
1þe 0 1 i
þβ 1þe 0 1 i
þβ
When researchers determine consensus prior elicitations before initiating a trial, the multiplicative
term wk ðx0;k � x�1;k Þ shall be carefully considered together with the hyper-parameters ai, k and bi, k [5].
In this chapter, we have discussed Bayesian inference with averaging, balancing, and compromis-
ing in sparse data. In the cancer risk assessment, we have observed that low-dose inference can be
very sensitive to an assumed parametric model (Section 3.1). In this case, the Bayesian model
averaging can be a useful method. It provides robustness by using multiple models and posterior
model probabilities to account for model uncertainty. In the application of Bayesian decision
theory to Phase I clinical trials, we have observed that the sequential sampling scheme heavily
depends on a loss function. A loss function, which is devised from individual-level ethics, focuses
on the benefit of trial participants, and a loss function, which is devised from population-level
ethics, focuses on the benefit of future patients. It is possible to balance between the two
conflicting perspectives, and we can adjust a focusing point by the tuning parameter (Sections
4.5 and 4.6). Finally, the use of a weighted posterior estimate can be a compromising method
when two or more researchers have prior disagreement. We have compared the prior and
posterior weighting schemes in a small-sample binomial problem (Section 5.1) and in a small-
sample Phase I clinical trial (Section 5.2). The prior weighting scheme (data-independent weights)
outperforms when prior estimates surround the truth, and the posterior weighting scheme (data-
dependent weights) outperforms when the truth is not well surrounded by prior estimates. One
method does not outperform the other method for all parameter values, so it is important to be
aware of their bias-variance tradeoff.
Author details
Steven B. Kim

Department of Mathematics and Statistics, California State University, CA, United States
References
[1] Raftery AE, Madigan D, Hoeting JA. Bayesian model averaging for linear regression
models. Journal of the American Statistical Association. 1997;92:171-191
[2] Whitehead J, Williamson D. Bayesian decision procedures based on logistic regression

models for dose-finding studies. Journal of Biopharmaceutical Statistics. 1998;8:445-467
[3] Kim SB, Gillen DL. A Bayesian adaptive dose-finding algorithm for balancing individual-
and population-level ethics in Phase I clinical trials, Sequential Analysis. 2016;35(4):423-439
[4] Samaniego FJ. A Comparison of the Bayesian and Frequentist Approaches to Estimation.
New York: Springer; 2010
[5] Kim SB, Gillen DL. An alternative perspective on consensus priors with applications to
Phase I clinical trials. Jacobs Journal of Biostatistics. 2016;1(1):006
[6] Shao K, Small MJ. Potential uncertainty reduction in model-averaged benchmark dose
estimates informed by an additional dose study. Risk Analysis. 2011;31:1156-1175
[7] Bedrick EJ, Christensen R, Johnson W. A new perspective on priors for generalized linear
models. Journal of the American Statistical Association. 1996;91(436):1450-1460
[8] International Agency for Research on Cancer. IARC Monographs on the Evaluation of
Carcinogenic Risks to Humans. Vol. 69. Lyon: IARC; 1997. ISBN 92-832-1269-X
[9] Kociba RJ, Keyes DG, Beyer JE, Carreon RM, Wade CE, Dit-tenber DA. et al. Results of a
two-year chronic toxicity andoncogenicity study of 2,3,7,8-tetrachlorodibenzo-p-dioxin in
rats. Toxicology and Applied Pharmacology. 1978;46(2):279-303
[10] Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: a tutorial.
Statistical Science. 1999;14(4):382-417
[11] Simmons SJ, Chen C, Li X, Wang Y, Piegorsch WW, Fang Q, Hu B, Dunn GE. Bayesian
model averaging for benchmark dose estimation. Environmental and Ecological Statis-
tics. 2015;22(1):5-16
[12] Crump KS. A new method for determining allowable daily intakes. Fundamental and
Applied Toxicology. 1984;4:854-871
[13] EPA (US Environmental Protection Agency). Benchmark dose technical guidance, EPA/100/R-
12/001, Risk Assessment Forum. Washington, DC: U.S. Environmental Protection Agency; 2012
[14] Whitehead J, Williamson D. Bayesian decision procedures based on logistic regression

models for dose-finding studies. Journal of Biopharmaceutical Statistics. 1998;8:445-467
[15] O’Quigley J, Conaway M. Continual reassessment and related dose-finding designs.
Statistical Science. 2010;25:202-216
[16] Bartroff J, Lai TL. Incorporating individual and collective ethics into Phase I cancer trial
designs. Biometrics. 2011;67:596-603
[17] O’Quigley J, Pepe M, Fisher L. Continual reassessment method: A practical design for
Phase 1 clinical trials in cancer. Biometrics. 1990;46:33-48
[18] Whitehead J, Brunier H. Continual reassessment method: Bayesian decision procedures

for dose determining experiments. Statistics in Medicine. 1995;14:885-893
Chapter 9
Provisional chapter
Two Examples of Bayesian Evidence Synthesis with the

Two Examples of Bayesian Evidence Synthesis with
Hierarchical Meta-Regression Approach
the Hierarchical Meta-Regression Approach
Pablo Emilio Verde

Pablo Emilio Verde

Abstract
This is the Information Age. We can expect that for a particular research question that is
empirically testable, we should have a collection of evidence which indicates the best
way to proceed. Unfortunately, this is not the case in several areas of empirical research
and decision making. Instead, when researchers and policy makers ask a specific ques-
tion, such as “What is the effectiveness of a new treatment?”, the structure of the
evidence available to answer this question may be complex and fragmented (e.g.
published experiments may have different grades of quality, observational data, subjec-
tive judgments, etc.).
Meta-analysis is a branch of statistical techniques that helps researchers to combine

evidence from a multiplicity of indirect sources. A main hurdle in meta-analysis is that
we not only combine results from a diversity of sources but we also combine their
multiplicity of biases. Therefore, commonly applied meta-analysis methods, e.g. ran-
dom-effects models, could be misleading.
In this chapter we present a new method for meta-analysis that we have called: the
“Hierarchical Meta-Regression” (HMR). The HMR is an integrated approach for evi-
dence synthesis when a multiplicity of bias, coming from indirect and disparate evi-
dence, has to be incorporated in a meta-analysis.
Keywords: Bayesian hierarchical models, meta-analysis, multi-parameters evidence

synthesis, conflict of evidence, randomized control trials, retrospective studies
1. Introduction
In today’s information age one can expect that the digital revolution can create a knowledge-
based society surrounded by global communications that influence our world in an efficient
and convenient way. It is recognized that never in human history we have accumulated such
an astronomical amount of data, and we keep on generating data at in an alarming rate. A new
term, “big data,” was coined to indicate the existence of “oceans of data” where we may expect
to extract useful information for any problem of interest.
In this technological society, one could expect that for a particular research question we should
have a collection of high quality evidence which indicates the best way to proceed. Paradoxi-
cally, this is not the case in several areas of empirical research and decision making. Instead,
when researchers and policy makers ask a specific and important question, such as “What is
the effectiveness of a new treatment?”, the structure of the evidence available to answer this
question may be complex and fragmented (e.g., published experiments may have different
grades of quality, observational data, subjective judgments, etc.). The way how researchers
interpret this multiplicity of evidence will be the basis for their understanding of reality and it
will determine their future decisions.
Bayesian meta-analysis, which has its roots in the work of Eddy et al. [1], is a branch of
statistical techniques for interpreting and displaying results of different sources of evidence,
exploring the effects of biases and assessing the propagation of uncertainty into a coherent
statistical model. A gentle introduction of this area can be found in Chap. 8 of Spiegelhalter
et al. [2] and a recent review in Verde and Ohmann [3].
In this chapter we present a new method for meta-analysis that we have called: the “Hierarchical
Meta-Regression” (HMR). The aim of HMR is to have an integrated approach for bias modeling
when disparate pieces of evidence are combined in meta-analysis, for instance randomized and
non-randomized studies or studies with different qualities. This is a different application of
Bayesian inference than those applications with which we could be familiar, for instance an
intricate regression model, where the available data bear directly upon the question of interest.
We are going to discuss two recent meta-analyses in clinical research. The reason for highlight-
ing these two cases is that they illustrate a main problem in evidence synthesis, which is the
presence of a multiplicity of bias in systematic reviews.
1.1. An example of meta-analysis of therapeutic trials

The first example, is a meta-analysis of 31 randomized controlled trials (RCTs) of two treatment
groups of heart disease patients, where the treatment group received bone marrow stem cells
and the control group a placebo treatment, Nowbar et al. [4]. The data of this meta-analysis
appear in the Appendix, see Table 1. Figure 1 presents the forest plot of these 31 trials, where the
treatment effect is measured as the difference of the ejection fraction between groups, which
measures the improvement of left ventricular function in the heart.
At the bottom of Figure 1 we see average summaries represented by two diamonds: the first
one corresponds to the fixed effect meta-analysis model. This model is based under the
assumption that studies are identical and the between study variability is zero. The widest
diamond represents the results of a random effects meta-analysis model, which assume a
substantial heterogeneity between studies. In this meta-analysis both models confirmed a
positive treatment of effect of a mean difference 3.95 95% CI [3.43; 4.47] and 2.92 and a 95%
CI of [1.47, 4.36], respectively.
Two Examples of Bayesian Evidence Synthesis with the Hierarchical Meta-Regression Approach 191
Figure 1. Meta-analysis results of studies applying treatments based on bone marrow stem cells to improve the left
ventricular function.
Could we conclude that we have enough evidence to demonstrate the efficacy of the treatment?
Unfortunately, these apparently confirming results are completely misleading. The problem is
that these 31 studies are very heterogeneous, which resulted in a wide 95% prediction interval
[4.33; 10.16] covering the no treatment effect, and a large number of contradictory evidence
displayed in Figure 1.
In order to explain the sources of heterogeneity in this area Nowbar et al. [4] investigated
whether detected discrepancies in published trials, might account for the variation in reported
effect sizes. They define a discrepancy in a trial as two or more reported facts that cannot both
be true because they are logically or mathematically incompatible. In other words, the term
discrepancies is a polite way to indicate that a published study suffers from poor reporting,
could be implausible or its results have been manipulated. For example, as we see at the
bottom of Table 1 in the appendix, it would be difficult to believe in the results of a study with
55 discrepancies. In Section 2 we present a HMR model to analyze a possible link between the
risk of bias results and the amount of discrepancies.
1.2. An example of meta-analysis of diagnostic trials

The topic of Section 3 is the meta-analysis of diagnostic trials. These trials play a central role in
personalized medicine, policy making, healthcare and health economics. Figure 2 presents our
example in this area. The scatter plot shows the diagnostic summaries of a meta-analysis
investigating the diagnostic accuracy of computer tomography scans in the diagnostic of
appendicitis [5]. Each circle identifies the true positive rate vs. the false positive rate of each
study, where the different circles’ sizes indicate different sample sizes. One characteristic of this
meta-analysis is the combination of disparate data. From 51 studies 22 were retrospective and
29 were prospective, which is indicated by the different grey scale of the circles.
The main problem in this area is the multiple sources of variability behind those diagnostic
results. Diagnostic studies are usually performed under different diagnostic setups and patients’
populations. For a particular diagnostic technique we may have a small number of studies which
may differ in their statistical design, their quality, etc. Therefore, the main question in meta-
analysis of diagnostic test is: How can we combine the multiplicity of diagnostic accuracy rates
Figure 2. Display of the meta-analysis results of studies performing computer tomography scans in the diagnostic of
appendicitis. Each circle identifies the true positive rate vs. the false positive rate of each study. Different colors are used
for different study designs and different diameters for sample sizes.
in a single coherent model? A possible answer to this question is a HMR presented in Section 3.
This model has been introduced by Verde [5] and it is available in the R’s package bamdit [6].
2. A Hierarchical Meta-Regression model to assess reported bias
Figure 3 shows the reported effect size and the 95% confidence intervals of 31 trials from [4]
against the number of discrepancies (in logarithmic scale). The authors reported a positive
statistical significant correlation between the size effect and the number of discrepancies
detected in the papers. However, a direct correlation analysis of aggregated results is threat-
ened by ecological bias and it may lead to misleading conclusions. The amount of variability
presented by the 95% confidence intervals is very big to accept a positive correlation at face
value. In this section we are going to present a HMR model to link the risk of reporting bias
with the amount of reported discrepancies. This model assumes that the connection between
discrepancies and size effect could be much more subtle.
The starting point of any meta-analytic model is the description of a model for the pieces of
evidence at face value. In statistical terms, this means the likelihood of the parameter of
interest. Let y1, …, yN and SE1, …, SEN be the reported effect sizes and their corresponding
standard errors, we assume a normal likelihood of θi the treatment effect of study i:
Figure 3. Relationship between effect size and number of discrepancies. The vertical axis corresponds to the effect size,
the treatment group received a treatment based on bone marrow stem cells and the control group a placebo treatment.
The horizontal axis corresponds to the number of discrepancies (in the logarithmic scale) found in the publication.
� � �
yi �θi � N θi , SE2i , i ¼ 1, …, N: (1)
If a prior assumption of exchangeability was considered reasonable, a random effects Bayesian

model incorporates all the studies into a single model, where the θ1, …, θN are assumed to be a
random sample from a prior distribution with unknown parameters, which is known as a
hierarchical model.
In this section we assume that exchangeability is unrealistic and we wish to learn how the un-
observed treatment effects θ1, …, θN are linked with some observed covariate xi.
Let xi be the number of observed discrepancies in the logarithmic scale. We propose to model
the association between the treatment effect θi and the observed discrepancies xi with the
following HMR model:
� � � � �
θi �I i , xi � I i N μbiased ; τ2 þ ð1 � I i ÞN μ; τ2 ; (2)
where the non-observable variable Ii indicates if study i is at risk of bias:

(
1 if study i is biased
I i jxi ¼ (3)
0 otherwise:
The parameter μ corresponds to the mean treatment effect of studies with low risk of bias. We
assume that in our context of application biased studies could report higher effect sizes and the
biased mean μbiased can be expressed as:
μbiased ¼ μ þ K, with K > 0: (4)
In this way, K measures the average amount of bias with respect to the mean effect μ. Eq. (4)
also ensures that μ and μbiased are identifiable parameters in this model. The parameter τ
measures the between-studies variability in both components of the mixture distributions.
We model the probability that a study is biased as a function of xi as follows:
logitðPrðI i ¼ 1jxi ÞÞ ¼ α0 þ α1 xi : (5)
In Eq. (5) positive values of α1 indicate that an increase in the number of discrepancies is
associated with an increased risk of study bias.
In this HMR model the conditional mean is given by
� � �
E θjxi Þ ¼ Pr I i ¼ 1jxi Þμbiased þ 1 � PrðI i ¼ 1jxi ÞÞμ: (6)
Eqs. (5) and (6) can be calculated as functional parameters for a grid of values of x. Their
posteriors intervals are calculated at each value of x.
This HMR not only quantifies the average bias K and the relationship between bias and
discrepancies in Eq. (5), but also allows to correct the treatment effect θi by its propensity of
being biased:
θcorrected
i ¼ ðθi � KÞPrðI i ¼ 1jxi Þ þ θi ð1 � PrðI i ¼ 1jxi ÞÞ; (7)
where the amount (θi � K) measures the bias of study i and Pr(Ii = 1|xi) its propensity of being
biased.
The HMR model presented above is completed by the following vague hyper-priors: For
the regression parameters α0, α1 � N(0, 100). We give to the mean μ1 � N(0, 100) and for the
bias parameter K � Uniform(0, 50). Finally, for the variability between studies we use
τ � Uniform(0, 100), which represent a vague prior within the range of possible study deviations.
The model presented in this section is mathematically non-tractable. We approximated the
posterior distributions of the model parameters with Markov Chain Monte Carlo (MCMC)
techniques implemented in OpenBUGS.
BUGS stands for Bayesian Analysis Using Gibbs Sampling, the OpenBUGS software constructs a
Directed Acyclic Graph (DAG) representation of the posterior distribution of all model’s param-
eters. This representation allows to automatically factorize the DAG as a product of each node
(parameters or data) conditionally on its parents and children. The software scans each node and
proposes a method of sampling. The kernel of the Gibbs sampling is built upon this algorithm.
Computations were performed with the statistical language R and MCMC computations were
linked to R with the package R2OpenBUGS. We used two chains of 20,000 iterations and we
discarded the first 5000 for the burn-in period. Convergence was assessed visually by using the
R package coda.
The diagonal panels of Figure 4 summarize the resulting posterior distributions for μ, K, τ, α0
and α1. The posterior of μ clearly covers the zero indicating that the stem cells treatment is not
effective. The bias parameter K indicates a considerable over-estimation of treatment effects
reported for some trials. The posterior of α1 is concentrated in positive values, which indicates
that an increase in discrepancies is associated with an increase of the risk of reporting bias. The
posteriors of α0 and α1 also present a large variability, which is expected when a hidden effect
is modeled.
Further results of the Hierarchical Meta-Regression model appears in Figure 5, where poste-
riors 95% intervals are plotted against the number of discrepancies. On the left panel, we can
see the relationship between the number of discrepancies and the probability that a study is
biased. We can observe an increase of probability with an increase of the number of discrep-
ancies, but also a large amount of variability. On the right panel appears the conditional mean
of effect size as a function of the number of discrepancies, which corresponds to Eq. (6). Our
analysis shows that the 95% posterior intervals of the conditional mean covers the zero effect in
most of the range of discrepancies. Only for studies with more than 33 (exp(3.5)) discrepancies
Figure 4. Posterior distributions for the hyper-parameters of the HRM model. The diagonal displays the posterior
distributions, the upper panels the pairwise correlations and the lower panels the pairwise posterior densities.
the model predicts a positive effect. One interesting result of this analysis is, that a horizontal
line which may represent a zero correlation is also predicted by the model. This means that the
regression calculated directly from the aggregated data contains an ecological bias and it is
misleading. We have added this regression line to the plot to highlight this issue.
The results presented so far indicate that increases in the amount of discrepancies increases the
propensity of bias. The question is: How can we correct a particular study for its bias? Eq. (7)
gives the bias correction of treatment effect in this HMR model.
In Figure 6 we can see HMR bias correction in action. We display two studies which have 21
and 18 discrepancies respectively. The solid lines correspond to the likelihood functions of
these studies. These likelihoods represent the information of the effect size at face value. The
dashed lines correspond to the posterior treatment effects after bias correction. Clearly, we can
see a strong bias correction with the conclusion of no treatment effect.
Figure 5. Results of the Hierarchical Meta-Regression model. The posterior median and 95% intervals are displayed as
solid lines. Left panel: relationship between the number of discrepancies and probability that a study is biased. Right
panel: conditional mean of effect size as a function of the number of discrepancies.
Figure 6. Bias correction for two studies with 21 and 18 discrepancies respectively. The solid lines correspond to the
likelihood functions of effect sizes. The dashed lines represent the posteriors for treatment effect after bias correction.
3. Hierarchical Meta-Regression analysis for diagnostic test data
In meta-analysis of diagnostic test data, the pieces of evidence that we aim to combine are the
results of N diagnostic studies, where results of the ith study (i = 1, …, N) are summarized in a
2 � 2 table as follows:
Patient status
With disease Without disease
Test + tpi fpi

Outcome � fni tni
Sum: ni,1 ni,2
where tpi and fni are the number of patients with positive and negative diagnostic results from
ni,1 patients with disease, and fpi and tni are the positive and negative diagnostic results from
ni,2 patients without disease.
Assuming that ni,1 and ni,2 have been fixed by design, we model the tpi and fpi outcomes with
two independent Binomial distributions:
tpi � BðTPRi ; ni, 1 Þ and f pi � BðFPRi ; ni, 2 Þ; (8)
where TPRi is the true positive rate or sensitivity, Sei, of study i and FPRi is the false positive
rate or complementary specificity, i.e., 1 � Spi.
At face value, diagnostic performance of each study is summarized by the empirical true
positive rate and true negative rate or specificity
d i ¼ tpi
TPR d i ¼ tni
and TNR (9)
ni, 1 n i, 2
and the complementary empirical rates of false positive rate and false negative diagnostic results,
d i ¼ f pi
FPR d i ¼ f ni :
and FNR (10)
n i, 2 n i, 1
In this type of meta-analysis we could separately model TPRi and FPRi (or Spi), but this
approach ignores that these rates could be correlated by design. Therefore, it is more sensible
to handle TPRi and FPRi jointly.
We define the random effect Di which represents the study effect associated with the diagnos-
tic discriminatory power:
� � � �
TPRi FPRi
Di ¼ log � log : (11)
1 � TPRi 1 � FPRi
However, diagnostic results are sensitive to diagnostic settings (e.g., the use of different
thresholds) and to populations where the diagnostic procedure under investigation is applied.
These issues are associated with the external validity of diagnostic results. To model external
validity bias we introduce the random effect Si:
� � � �
TPRi FPRi
Si ¼ log þ log : (12)
1 � TPRi 1 � FPRi
This random effect quantifies variability produced by patients’ characteristics and diagnostic
setup, that may produce a correlation between the observed TPRs d and FPRs.d In short, we
called Si the threshold effect of study i and it represents an adjustment of external validity in
the meta-analysis.
We could assume exchangeability of pairs (Di, Si), but study’s quality is known to be an issue in
diagnostic studies. For this reason we model the internal validity of a study by introducing
random weights w1, …, wN. Conditionally to a study weight wi, the study effects Di and Si are
modeled as exchangeable between studies and they follow a scale-mixture of bivariate Normal
distributions with the following mean and variance:
" ! # � � " ! # !
Di �� μD Di �� 1 σ2D ρσD σS
E �wi ¼ and var �wi ¼ ¼ Σi ; (13)
Si μS Si wi ρσD σS σ2S
and scale mixing density

�ν ν �
wi � Gamma ; : (14)
2 2
The inclusion of the random weights wi into the model was proposed by [5]. This approach
was generalized in [6] in two ways: firstly, by splitting wi in two weights w1,i and w2,i
corresponding to each component Di and Si respectively. Secondly, by putting a prior on the
degrees of freedom parameter ν, which corresponds to an adaptive robust distribution of the
random-effects.
The Hierarchical Meta-Regression representation of the model introduced above is the model
based on the conditional distribution of (Di|Si = x) and the marginal distribution of Si. This
HMR model was introduced by [7], who followed the stepping stones of the classical Sum-
mary Receiving Operating Characteristic (SROC) [8].
The conditional mean of (Di|Si = x) is given by:
EðDi jSi ¼ xÞ ¼ A þ Bx (15)
where the functional parameters A and B are

σD
A ¼ μD , and B ¼ ρ : (16)
σS
We define the Bayesian SROC Curve (BSROC) by transforming back results from (S, D) to
(FPR, TPR) with
� �
A Bþ1
BSROCðFPRÞ ¼ g�1 þ gðFPRÞ ; (17)
ð1 � BÞ ð1 � BÞ
where g(p) is the logit(p) transformation, i.e. logit(p) = log(p/(1 � p)).

The BSROC curve is obtained by calculating TPR in a grid of values of FPR which gives a
posterior conditionally on each value of FPR. Therefore, it is straightforward to give credibility
intervals for the BSROC for each value of FPR.
One important aspect of the BSROC is that it incorporates the variability of the model’s
parameters, which influences the width of its credibility intervals. In addition, given that FPR
is modeled as a random variable, the curve is corrected by measurement error bias in FPR.
Finally, we can define a Bayesian Area Under the SROC Curve (BAUC) by numerically integrat-
ing the BSROC for a range of values of the FPR:
ð1
BAUC ¼ BSROCðxÞdx: (18)
0
In some applications it is recommend to use the limits of integration within the observed
d
values of FPRs.
In order to make this complex HMR model applicable in practice, we have implemented the
model in the R’s package bamdit, which uses the following set of hyper-priors:
μD � Logisticðm1 ; v1 Þ, μS � Logisticðm2 ; v2 Þ (19)
and
σD � Uniformð0; u1 Þ, σS � Uniformð0; u2 Þ: (20)
The correlation parameter ρ is transformed by using the Fisher transformation,

� �
ρþ1
z ¼ logit (21)
2
and a Normal prior is used for z:
z � Nðmr ; vr Þ: (22)
Modeling priors in this way guarantees that in each MCMC iteration the variance-covariance
matrix of the random effects θ1 and θ2 is positive definite. The values of the constants
m1, v1, m2, v2, u1, u2, mr and vr have to be given. They can be used to include valid prior
information which might be empirically available or they could be the result of expert elicita-
tion. If such information is not available, we recommend setting these parameters to values
that represent weakly informative priors. In this work, we use m1 = m2 = mr = 0, v1 = v2 = 1,

pffiffiffiffiffiffiffi
u1 = u2 = 5 and vr ¼ 1:7 as weakly informative prior setup.
These values are fairly conservative, in the sense that they induce prior uniform distributions
for TPRi and FPRi. They give locally uniform distributions for μ1 and μ2; uniforms for σ1 and
σ2; and a symmetric distribution for ρ centered at 0.
Figure 7 summarizes the meta-analysis results of fitting the bivariate random-effect model to
the computer tomography diagnostic data. The Bayesian Predictive Surface are presented by
contours at different credibility levels and compare these curves with the observed data
represented by the circles with varying diameters according to the sample size of each study.
The scattered points are samples from the predictive posteriors and the histograms correspond
to the posterior predictive marginals. This result was generated by using the functions
metadiag() and plot in the R package bamdit.
Figure 7. Results of the meta-analysis: Bayesian Predictive Surface by contours at different credibility levels.
Figure 8 displays the posteriors of each components’ weights. The left panel shows that
prospective studies number 25 and 33 deviate with respect to the prior mean of 1, while on
the right panel we see that a prospective study (number 47) and five retrospective studies
(number 1, 3, 4, 8 and 29) have substantial variability.
Figure 8. Posterior distributions of the component weights: it is expected that the posterior is centered at 1. Studies with
retrospective design tend to present deviations in FPR.
Figure 9. Hierarchical Meta-Regression model: left panel shows the BSROC curve, the central line corresponds to the
posterior median and the upper and lower curves correspond to the quantiles of the 2.5 and 97.5%, respectively. The right
panel displays the posterior distribution of the area under the BSROC curve.
An important aspect of wi is its interpretation as estimated bias correction. A priori all

studies included in the review have a mean of E(wi) = 1. We can expect that studies which
are unusually heterogeneous will have posteriors substantially greater than 1. Unusual
studies’ results could be produced by factors that may affect the quality of the study, such
as errors in recording diagnostic results, confounding factors, loss to follow-up, etc. For that
reason, the studies’ weights wi can be interpreted as an adjustment of studies’ internal
validity bias.
The BSROC curve and its area under the curve are presented in Figure 9. The left panel shows
this HMR as a meta-analytic summary for this data. On the right panel the posterior distribu-
tion of the BAUC show quite a high diagnostic ability for computer tomography scans as
diagnostic of appendicitis.
4. Conclusions
In this work we have seen the HMR in action. This approach of meta-analysis is based on a
simple strategy: two sub-models are defined in the meta-analysis, one which models the
problem of interest, for instance the treatment effect, and one which handles the multiplicity
of bias. The meta-analysis is summarized by understanding how these components interact

with each other.
The examples presented in this work have shown that we could have misleading conclusions
from indirect evidence, if it were analyzed as directly contributing to the problem of interest.
For instance, in the first example, Section 2, we have seen in Figure 1 that pooling studies gave
a wrong conclusion about the effect of stem cells treatment. The positive correlation between
the aggregated effect size and the number of discrepancies exaggerates its relationship.
Actually, in Figure 5 the HMR has shown that it is possible to simultaneously have a zero
correlation between effect size and discrepancies while still having a risk of reporting bias. In
addition, the HMR allows to extract the amount of bias in the meta-analysis and to correct the
treatment effect at the level of the study (Figure 6).
In the second example, Section 3, biases come from the external validity of diagnostic studies
and the internal validity due to their quality. In this example the HMR showed that it was
possible to simultaneously model these two types of subtle biases.
To account for internal validity bias, the application of a scale mixture of normal distributions
allows us to detect conflictive studies, which can be considered as outliers. The Bayesian
Summary Receiving Operative Curve accounts for the external validity bias due to changes in
factors that affected the diagnostic results. In addition, the posterior for its Area Under the
Curve (AUC) summarizes the results of the meta-analysis.
Acknowledgements
This work was supported by the German Research Foundation project DFG VE 896 1/1.
Appendix: Source Data for Sections 1.1 and Section 2
Trial ID Effect size SE (effect size) Sample size Number of Author or principal Year Country
discrepancies investigator
t01 1.5 3.67 21 17 Quyyumi 2011 USA

t02 1.1 2.09 100 7 Lunde 2007 Norway
t03 1.7 2.91 23 7 Srimahachota 2011 Thailand
t05 0.8 2.78 60 4 Meyer 2006 Germany

t06 7 0.63 40 4 Meluzín 2006 Czech Republic
t09 7.8 2.76 38 21 Piepoli 2010 Italy
t11 14 4.05 20 13 Suárez de Lezo 2007 Spain
t12 5.4 2.44 77 18 Huikuri HV 2008 Finland

t13 2.7 1.2 82 16 Perin 2012 USA
t15 4.1 0.98 46 0 Assmus 2006 Germany
Trial ID Effect size SE (effect size) Sample size Number of Author or principal Year Country
discrepancies investigator
t16 2.2 0.88 79 27 Assmus 2013 Germany
t17 2.5 1.01 187 11 Assmus 2010 Germany

t18 2.5 3.96 20 2 Hendrikx 2006 Belgium
t19 0.2 1.17 127 3 Hirch 2011 The Netherlands

t20 2 2.09 20 0 Perin 2012 USA
t23 3 2.03 81 0 Traverse 2011 USA
t24 3.6 2.52 17 2 Ang 2008 UK
t25 4 1.28 40 3 Rodrigo 2012 The Netherlands

t26 1.5 1.75 66 8 Herbolts 2009 Belgium
t29 8.8 6.07 10 6 Castellani 2010 Italy
t30 4 3.89 14 2 Maureira 2012 France

t31 0.2 1.54 183 19 Ribero dos Santos 2012 Brazil
t32 3.2 3.63 40 7 Traverse 2010 USA

t35 10.1 1.21 20 11 Patel 2004 USA
t38 5.4 2.54 27 15 Tse 2007 Hong Kong
t42 3.5 1.04 86 6 Cao 2009 China

t45 1.25 1.56 118 9 Sürder 2013 Switzerland
t46 6.7 3.16 20 2 Ge 2006 China
t47 0.1 2.03 112 2 Traverse 2012 USA

t48 3.9 2.62 40 1 Wöhrle 2010 Germany
t49 10.4 1.01 116 55 Yousef (Strauer) 2009 Germany
Table 1. Results from 31 randomized controlled trials of heart disease patients, where the treatment group received bone
marrow stem cells and the control group a placebo treatment. The source of this table is Nowbar et al. [4].
Author details
Pablo Emilio Verde

Coordination Center for Clinical Trials, University of Duesseldorf, Duesseldorf, Germany
References
[1] Eddy DM, Hasselblad V, Shachter R. Meta-Analysis by the Confidence Profile Method:
The Statistical Synthesis of Evidence. San Diego, CA: Academic Press; 1992
[2] Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian Approaches to Clinical Trials and Health-
Care Evaluation. The Atrium, Southern Gate, Chichester, West Sussex PO19 8SQ, England:
John Wiley & Sons, Ltd.; 2004
[3] Verde PE, Ohmann C. Combining randomized and non-randomized evidence in clinical
research: A review of methods and applications. Research Synthesis Methods. Vol. 6. 2014.
DOI: 10.1002/jrsm.1122
[4] Nowbar AN, Mielewczik M, Karavassilis M, Dehbi HM, Shun-Shin MJ, Jones S, Howard
JP, Cole GD, Francis DP. Discrepancies in autologous bone marrow stem cell trials and
enhancement of ejection fraction (damascene): Weighted regression and meta-analysis.
BMJ. 2014;348:1-9
[5] Verde PE. Meta-analysis of diagnostic test data: A bivariate Bayesian modeling approach.
Statistics in Medicine. 2010;29(30):3088-3102
[6] Verde PE. bamdit: An R package for Bayesian meta-analysis of diagnostic test data.
Journal of Statistical Software. 2017, in press
[7] Verde PE. Meta-analysis of diagnostic test data: Modern statistical approaches. PhD The-
sis, University of Düsseldorf. Deutsche Nationalbibliothek. July, 2008. Available from:
http://docserv.uni-duesseldorf.de/servlets/DocumentServlet?id=8494
[8] Moses LE, Shapiro D, Littenberg B. Combining independent studies of a diagnostic test
into a summary roc curve: Data-analytic approaches and some additional considerations.
Statistics in Medicine. 1993;12:1293-1316
Chapter 10
Provisional chapter
Bayesian Modeling in Genetics and Genomics
Bayesian Modeling in Genetics and Genomics
Hafedh Ben Zaabza, Abderrahmen Ben Gara and
Boulbaba Rekik
Hafedh Ben Zaabza, Abderrahmen Ben Gara
and Boulbaba
Additional Rekik
information is available at the end of the chapter

Abstract
This chapter provides a critical review of statistical methods applied in animal and plant
breeding programs, especially Bayesian methods. Classical and Bayesian procedures are
presented in pedigree-based and marker-based models. The flexibility of the Bayesian
approaches and their high accuracy of prediction of the breeding values are illustrated.
We show a tendency of the superiority of Bayesian methods over best linear unbiased
prediction (BLUP) in accuracy of selection, but some difficulties on elicitation of some
complex prior distributions are investigated. Genetic models including marker and
pedigree information are more accurate than statistical models based on markers or
pedigree alone.
Keywords: accuracy of prediction, breeding value, Bayesian methods, BLUP, pedigree,

markers
1. Introduction
Quantitative genetics result from the (connection) combination of statistics and the principles
of animal and plant breeding. In quantitative genetics, selection for economically important
traits refers to use of phenotypic values of the individual and pedigree information. Genomic
is based on the use of dense markers through the whole genome to predict the breeding
value of the individuals [1]. Linear models (univariate and multivariate) are of fundamental
importance in applied and theoretical quantitative genetics [2]. In animal breeding, two
major methods were particularly applied, restricted maximum likelihood (REML) and
Bayesian methods. REML has emerged as the method of choice in animal breeding for
variance component estimation [3]. Bayesian analysis is gaining popularity because of its
more comprehensive assumptions than those of classical approaches and its flexibility in
resolving a wide range of biological problems [4, 5]. In the Bayesian approach, the idea is to
combine what is known about the statistical ensemble before the data are observed (prior
probability distributions) with the information coming from the data, to obtain a posterior
distribution from which inferences are made using the standard probability calculus tech-
niques [2, 6]. In recent years, Bayesian methods were broadly used to solve many of the
difficulties faced by conventional statistical methods and extend the applicability of statistics
on animal and plant breeding data [7]. Furthermore, Markov chain Monte Carlo (MCMC)
has an important impact in applied statistics, especially from Bayesian perspective for the
estimation of genetic parameters in the linear mixed effect model [2, 5]. The specific objective
of this chapter was to illustrate applications of Bayesian inference in quantitative genetics
and genomics. First, Bayesian models in the quantitative genetics theory are examined.
Second, and in the context of the genomic selection, we presented the details of statistical
modeling, using BLUP and Bayesian analyses. Third, a critical review with a focus on the
prior distributions is illustrated. Finally, genomic predictions from several methods used in
many countries are discussed.
2. A brief introduction to Bayesian analyses
In Bayesian inference, the idea is to combine what is known about the statistical ensemble
before the data are observed (prior probability distributions) with the information coming
from the data, to obtain a posterior distribution from which inferences are made using the
standard probability calculus techniques.
Pðθ=yÞα Pðy=θÞPðθÞ (1)
P(θ) is the prior distribution, which reflects the relative uncertainty about the possible values
of θ before the data are seen. P(y/θ) is the likelihood function of observing the data given the
parameter which represents the contribution of y to knowledge about the parameter θ. P(θ/y)
is the posterior distribution of the parameter θ given the previous information on the data.
3. Bayesian analyses of linear models
3.1. The mixed linear model

The mixed linear model is of great importance in genetics and is one of the most used statistical
models. Arguably, variance components and genetic parameters are important because they
give an indication of the ability of species to respond to selection and thus the potential of that
species to evolve. Mixed linear model is the simplest method for estimating the variance
components for quantitative traits in population. In the “frequentist” view, mixed linear model
is one included linearly the fixed and random effects. In the Bayesian context, there is no
distinction between fixed and random effects. Detailed Bayesian analyses of models with two
or more component variances will be discussed.
Bayesian Modeling in Genetics and Genomics 209
3.1.1. The univariate linear additive genetic model

The mixed linear model is one that includes fixed and random effects.
Consider the linear model:
y ¼ Xβ þ Za þ e (2)
y is a n�1 vector of records on a trait; β is the vector of fixed effects affecting records; a is the
vector of additive genetic effects; e is a vector of residual effects. X and Z are incidence matrices
relating records to fixed effects and additive genetic effects, respectively. Data are assumed to
be generated from the following distribution:
yjβ, a, σ2e � NðXb þ Za, Iσ2e Þ

e � Nð0, Iσ2e Þ
where, I is an identity matrix of order n�n and σ2e is the residual variance. Independence of
various effects was assumed for the sake of simplicity in implementation. We assume a genetic
model in which genes act additively within and between loci, and there are effectively an
infinite number of loci. Under this infinitesimal model, and assuming further initial Hardy-
Weinberg and linkage equilibrium, the distribution of additive genetic values conditional on
the additive genetic covariance is multivariate normal.
ajA, σ2a � Nð0, Aσ2a Þ
where A is the numerator relationship matrix of order q�q; β is assumed to have a uniform
distribution with bounds βmin and βmax.
!
2 2
ν
2 ðð þ1ÞÞ
iνi S2i
Pðσi jνi , Si Þ � ðσi Þ exp � 2 , ði ¼ a, eÞ
2
2σi
where ve , S2e and va , S2a are interpreted as degrees of belief and a priori values for residual and
additive genetic covariances. Posterior conditional distributions derived from the likelihood
and the prior distributions for these parameters are,
bi| b-i, a, σ2a , σ2e , y � Nðbî , ðx0i xi Þ�1 σ2e Þ, with (x0 ixi) is the ith element of the diagonal of X0 X
3.1.2. The univariate linear additive genetic model with permanent and genetic group effects
The model equation [8] used to estimate genetic parameters and genetic breeding value for
milk yield was as follows:
y ¼ Xβ þ Za þ ZQg þ Wp þ e (3)
where y is the vector of milk yield, b is the vector of fixed effects, a is the vector of additive
genetic effects, g is the vector of genetic group effects, p is the vector of random permanent
environmental effects, and e is the vector of residual effects. X, Z, W, and ZQ are incidence
matrices relating a record to fixed environmental effects in b, to a random animal effects in a,
to a random permanent environment effects in p, and to genetic groups in g, respectively. g* is
the vector of genetic group effects, â is a vector of breeding values. A is the numerator
relationship matrix. where a^� ¼ Q g^ þ a^ .
The conditional distribution of observed yield is defined by:
yjb, p, a � , σ2e � NðXb þ Za � þ Wp, Iσ2e Þ
with the assumption of P(b) being a constant; a*|A*, σ2a � NðQg, A� σ2a Þ;

νi νi S2
pjσ2p � Nð0, Iσ2p Þ; and Pðσ2i jνi , S2i Þ � ðσ2i Þðð 2 þ1ÞÞ
exp � 2i
2σi
where S2i are prior values for the variances, χ�2

νi are inverted chi-square distributions, and νi are
degrees of freedom of parameters.
3.1.2.1. Management and environmental effects
The distribution of a fixed effect is:
bi jb�i , a� , σ2a , σ2p , σ2e , y � Nðb^ i , ðx0 i xi Þ�1 σ2e Þ
with (x0 i xi) bî ¼ x y � xi x�i b�i � xi wp � xi za� ,

0 0 0 0 0
where (x0 i xi) is the ith element of the diagonal of X0 X
3.1.2.2. Permanent environmental effects
The distribution of a permanent effect is:
pi jbi , p�i , a� , σ2a , σ2p , σ2e , y � Nð^

p i , ðw0 i wi þ δÞ�1 σ2e Þ
with ðw0i wi þ δÞ^

p i ¼ w0 i y � w0 i Xb � ðwi W �i þ δÞp�i � wi za � ,
where w0 i wi is the ith element of the diagonal of W0 W.
3.1.2.3. Breeding values
The distribution of a breeding value is:
a�i jb, p�i , a��i , σ2a , σ2p , σ2e , y � Nða�i ðz0 i zi þ A��1 �1 2
i, i α Þσe Þ
0 0 0 0
with ðzi zi þ A��1 a i ¼ zi y � zi Xb � zi W P � A��1
i, i αÞ^ i, i αa�i ,
�
where z0 i zi is the i th element of the diagonal of Z0 Z.

3.1.2.4. Variance components

The additive genetic variance is defined by
2
σ2a jb, p, a� , σ2p , σ2e , y � V ~ χ�2
~ aS
v
a ~ a
2 0
with V ~ ¼ ða� A��1 a� þ V a S2 Þ=V
~ a ¼ na þ V a , S ~ a , and np is the number of animals being evaluated.
a a
The variance of permanent environmental effects is given by:
σ2p , jb, p, a� , σ2a , σ2e , y � V ~ 2 χ�2

~ pS
p ~v p
2
with V ~ ¼ ðp0 p þ V p S2 Þ=V
~ p ¼ np þ V p , S ~ p , and np is the number of animals being evaluated.
p p
Residual variance:
2
σ2e jb, p, a� , σ2a , σ2p , y � V ~ χ�2
~ eS
v
e ~ e
~ e ¼ ne þ V e ,
with V
~ 2 ¼ ½ðy � Xb � Wp � Za�Þ0 ðy � Xb � Wp � Za�Þ þ V e S2 �=V

S ~ e,
e e
and ne is the total number of records.
Comparing genetic value predictions based on polygenic model in Tunisian Holstein Population
using BLUP and Bayesian analyses, Ref. [8] reported that the rankings of animals with Bayesian
methods are similar to those obtained by BLUP method. Spearman’s rank correlation between
genetic values estimated from Bayesian procedures and genetic values estimated from BLUP
methods were high (0.99). Again, Bayesian and best linear unbiased estimator (BLUE) solutions
of fixed effects (month of calving, herd-year, and age-parity) showed the same patterns. The
same result is reported by Ref. [9]. However, Ref. [8] illustrated different correlation estimates
between two methods (Bayesian and BLUP) for cow’s and bull’s breeding value.
4. Genomic selection
A massive quantity of genomic data is now available in animal and plant breeding with the
revolutionary development in sequencing and genotyping. The cost of genotyping is dramati-
cally reduced. Consequently, practices of genomic selection are nowadays possible with the high
number of single nucleotide polymorphism (SNP) markers available. Therefore, it is feasible to
perform analysis of the genome at a level that was not possible before [10–13]. The concept of
genomic selection was introduced by Ref. [1]. The latter suggested that a set of markers covering
the whole genome explain the all genetic variances and each marker is likely to be associated
with a quantitative trait locus (QTL), and each QTL is in linkage disequilibrium with the
markers. The number of effects per QTL to be estimated is very small. The estimated effects of all
markers are summed in order to obtain the genetic value of the individual. Using simulation,
Ref. [1] showed in simulation that with a high-density SNP marker, it is possible to predict the
breeding value with an accuracy of 0.85 (where accuracy is the correlation between the estimated
breeding value and true breeding value). The challenge in genomic evaluation is to find the best
prediction method to obtain accurate genetic values of candidates. Many genomic evaluation
methods have been proposed [14, 15]. The main objective of this section is to compare Bayesian
methods to other methods used in genomic selection based on their predictive abilities. The
study reported by Ref. [1] was considered an influential paper on dairy cattle breeding pro-
grams. First, the methods suggested correspond well to the data structures where the number of
SNPs substantially exceeds the number of observations. Second, the methods of Ref. [1] consti-
tute a logical evolution of the BLUP methodology, which is the reference method in animal
genetics by considering specific variances of SNPs in the different loci. Third, the Bayesian
approaches used in Ref. [1] that take into account unknown effects (measuring prior uncertainty)
in a model, and combined with the ability of the Monte Carlo Markov chain, can be used in the
majority of parametric statistical models.
4.1. Genomic BLUP (GBLUP)

The GBLUP method assumes that effects of all SNPs are sampled from the same normal
distribution; the effects of all markers are assumed to be small with equal variance. Genomic
BLUP was defined by the model:
y ¼ 1μ þ Zg þ e (4)
where y is the data vector; μ is the overall mean; 1 is a vector of n ones; Z is a matrix of incidence,
allocating records to the markers’ effects; g is a vector of SNP effects assumed to be normally
distributed g � Nð0, Gσ2g Þ, where σ2g is the additive genetic variance and G is the genomic
relationship matrix; e is the vector of normal error, e � Nð0, σ2e Þ where σ2e is the error variance.
0
The genomic relationship matrix was defined as G ¼ Xm X X , where X is matrix for
p
i¼1 i
ð1 � pi Þ
specified SNP genotype coefficient at each locus, pi is the rare allele frequency for SNPi.
4.2. Bayesian approaches
In Bayesian estimation, the information from the data is combined with the information from
the prior distribution of the variances of the markers. Several Bayesian statistical analyses have
been used in genomic evaluation, which differ in the hypotheses of distributions of marker
effects. At the level of the modeling of the variances of the effects of the markers, Meuwissen
et al. [1] proposed different distributions a priori between the Bayes A and Bayes B methods.
4.2.1. Bayes A
Bayes A method assumes that variance of marker effects differ among loci (e.g., σ2g is different
j
across the j) [16]. The variances are modeled according to the scaled inverted chi-square
distribution: The a priori distribution of the variances of the SNP effects is written:
Pðσ2gj Þ � χ�2 ðν, SÞ, where S is the scale parameter and ν is the number of degrees of freedom.
This has the advantage, if we consider a normal distribution of the data, to lead to an a
posteriori conditional distribution of χ�2.
Pðσ2gj jgj Þ � χ�2 ðν þ nj , S þ g0 j gj Þ,
where, nj is the number of marker effects at segment j. The posterior distribution combines
both the information provided by the data and the a priori distribution.
4.2.2. Bayes B
In a genomic evaluation context, Bayes B method [1, 17] assumes different variances of SNP
effects, with many SNP contribute per zero effects, and a few contribute per a large effects on
the trait. Meuwissen et al. [1] propose a model in which a proportion π (arbitrarily fixed at
0.95) of the markers having zero effect. The a priori distribution of the variances of the effects
to the markers is then written:
σ2g ¼ 0 with a probability π, Pðσ2gj Þ � χ�2 ðν, SÞ with a probability (1 � π), Gibbs sampling cannot
be used to estimate the effects and variances of the Bayes B model because of the high probability
on some markers of being of zero variance. We therefore use a Metropolis-Hastings algorithm
which allows the simultaneous estimation of σ2g and gj. On the basis of the results of Ref. [1] and
j
many subsequent works, the Bayes B method is often considered the “benchmark” in terms of
genomic prediction efficiency, but it is extremely costly in computational time. However,
Meuwissen [18] propose an alternative to the Bayes B method which relies on a fast algorithm.
4.2.3. Bayesian lasso
Legarra et al. [19] proposed a model of Bayesian lasso (BL) with different variances for residual
and SNP effects which they termed BL2Var. It is therefore assumed that a large number of
SNPs have an effect practically zero and that very few have large effects. Tibshirani [20]
showed that the distribution of the lasso estimators can be written:
λ
Pðσ2g j jσ2 , λÞ � exp ð�λjgj jÞ
2
He suggests that the lasso estimators can be interpreted as an a posteriori mode of a model in
which the regression parameters would be independent and identically distributed according
to a prior double exponential distribution. Park and Casella [21] propose to use a complete
Bayesian approach by assuming an a priori distribution of regression coefficients such as:
λ λ
Pðσ2g j jσ2 , λÞ � pffiffiffiffiffi exp ð� pffiffiffiffiffi jgj jÞ
2 σ2 σ2
where σ2 represents the variance of residual effects of the model and the variance of the SNP
effects. Applications of the Bayesian lasso to the genomic selection proposed by Refs. [22, 23]
use the same variance σ2 to model both the distribution of effects of SNPs and residuals. De los
Campos et al. [22] showed that the Bayesian lasso is close in terms of precision of prediction to
the Bayes B method but with a significant reduction in the complexity of the calculations. In
addition, these authors suggested using Bayesian lasso against the large number of markers
included in regression models, which is typically larger than the number of records.
4.2.4. The Bayes C method

Bayesian methods such as Bayes A and Bayes B [1] have been widely used for genomic evalua-
tion. Similar methods exist, with similar performances, developed in order to reduce computa-
tion times and to simplify statistical modeling. The Bayes C method [24] differs from Bayes B by
assuming the variance associated with SNPs common to all markers. In Bayes C, as in Bayes B,
the probability π that an SNP has a nonzero effect is assumed to be known. The model is similar
to the Bayes B model but for a homogeneous variance of effects on all loci: σ2g ¼ 0 with a
probability 1 � π;ðσ2g Þ � χ�2 ðν, SÞ. The main problem with the Bayes C method is that SNPs
with a nonzero effect is assumed to be known. With the Bayes A method, the parameter π is
equal to 1, which implies that all the markers have an effect. For the Bayes B method, π is strictly
less than 1 in order to take into account the hypothesis that some SNPs may have a zero effect
but is fixed arbitrarily while the intensity of the selection of variables is controlled by this
parameter. Habier et al. [25] propose to modify the Bayes C method by estimating the parameter
π: the parameter π is assumed to be unknown. Thus, the a priori distribution of π becomes
uniform over [0, 1]. SNP modeling is the same as with Bayes C. Pðgj jπ, σ2g Þ ¼ 0 with a probability
1 � π; Pðgj jπ, σ2g Þ � Nð0, σ2g Þ where Pðσ2g Þ � χ�2 ðν, SÞ with a probability π. The various param-
eters of this model are estimated by MCMC methods, Markov Chain Monte Carlo [6, 26] as
proposed by Ref. [25]. It is written as a function of the additive genetic variance σ2a .
Xp σa
2
σ2g ¼ , where pj is the allelic frequency of SNP j.
ð1�πÞ 2p
j¼1 j
ð1 � pj Þ
4.3. A critique
The extreme speed with which events are running handicaps the process of linking new
development to extant theory, and the understanding of statistical models suggested up until
now [27]. The latter authors criticize the theoretical and statistical concepts followed by Ref. [1]
in three levels. The first is the connection between parameters (additive genetic variances with
Bayesian view) from infinitesimal models with those from marker-based models. The second is
the relationship between molecular marker genotypes and similarity between relatives. The
third is the connection between infinitesimal genetic models and marker-based regression
models. Gianola et al. [27] argued that the methods Bayes A and Bayes B proposed by
Ref. [18] require specifying parameters. The latter used formulas for obtaining the variance of
SNP effects, based on some knowledge of the additive genetic variance in the population. Their
development begins on the assumption that the effects of the markers are fixed and in other
development, they consider them as random without a clear demonstration. Meuwissen et al.
[1] explained that affecting a priori a value σ2g ¼ 0 with a probability π means that the specific
SNP does not have an effect on the trait. By contrast, Ref. [27] illustrated that a parameter
having zero variance does not obligatory imply that the parameter takes zero value. The
parameter could have any value, but with certainty. Gianola et al. [27] suggested the use of a
nonparametric method as developed by Refs. [22, 28] because these methods do not impose
hypotheses about mode of inheritance as Bayesian A and Bayesian B methods.
5. Applications in genomics
Major dairy breeding countries are now using genomic evaluation [27]. Several results have
been reported around the world. Several authors reported that the reliabilities of genomic
estimated breeding values (GEBV) were substantially greater than breeding values from esti-
mated breeding values (EBV) based on pedigree information [29]. The accuracy of selection
was different between countries [12]. The accuracy was dependent on the size of reference
population, the heritability of the trait studied, the statistical models and approaches used for
prediction of genetic values for quantitative traits, and the method achieved to estimate the
accuracy [12, 27, 29]. Ref. [14] found the reliability of GEBV bulls of the Canadian and
American Holstein population. A genotyping of 39,416 molecular markers of 3576 Holstein
bulls was used to establish the prediction equations.
The prediction methods contained a linear model, in which marker effects are assumed to be
normal, and a nonlinear model with a heavier tailed prior distribution to account for major genes
as described by [1]. VanRaden et al. [14] reported that the combination of the polygenic effects
based on pedigree information with the genomic predictions can improve the reliability to 23%
greater than the reliability of polygenic effects only. The same study showed that the nonlinear
model had a little advantage in reliability over the linear model for all traits except for fat and
protein percentages. Genomic breeding values of 25 traits in New Zealand dairy cattle were
estimated by Ref. [30]. The reference population consisted of 4500 bulls genotyped using the
BovineSNP50Beadchip, containing 44,146 SNPs. Harris and Johnson [31] reported an increase in
accuracy was found by using Bayesian approaches compared to BLUP methods. In Ref. [31],
genomic breeding values (GBVs) for young bulls with no daughter information had accuracies
ranging from 50 to 67% for milk traits, live weight, fertility, somatic cell, and longevity, versus an
average 34% for progeny test. Meuwissen et al. [1] compared least squares method with BLUP
and two Bayesian methods (Bayesian A and Bayesian B). The latter authors estimated the effects
of 50,000 marker haplotypes from a limited number of observations (2200). Using least squares
method, it is not possible to estimate all effects simultaneously. For this reason, different steps
have been adopted to incorporate the effects of markers. First, they performed regression on
markers for every segment of 1 cm each. Second, they calculated a Log-likelihood, which
assumed to be normal at every segment of chromosome. Third, they summed all segments
corresponding to a likelihood peak into multiple regression models. Using BLUP analyses, Ref.
[1] considered that all SNP effects were independent and identically distributed with a known
variance. Bayes A method was as BLUP at the level of the data, but differs in the variance of the
chromosome segments, which assumed to have an inverted chi-square distribution. A mixture
prior distribution of genetic variances was used in Bayes B method. Table 1 shows the accuracy
of selection obtained by Ref. [1] from the GBLUP methods, the least squares regression and the
Methods ρ b
Least squares 0.318 0.285

GBLUP 0.732 0.896
Bayes A 0.798 0.827

Bayes B 0.848 0.946
Table 1. Comparing estimated versus the breeding value [1].
Bayes A and Bayes B approaches. The predictive abilities of the different methods are estimated
by calculating the correlation (ρ) between true and estimated breeding values and the regression
(b) of true on estimated breeding value.
The least squares method is the least efficient because it overestimates effects on QTL [32].
The Bayes B approach is the most accurate both in terms of correlation and regression.
However, the regression coefficient obtained by the Bayesian methods was still less than 1,
and probably due to the hypothesis of a priori distribution χ2 for Bayes A and Bayes B
being different from the simulated distribution of the variances. Goddard and Hayes [11]
compared the correlation of 0.85 as reported by Ref. [1] to results obtained on real data by
Refs. [14, 33, 34]. VanRaden et al. [14] produced a mean correlation over several characters of
0.71 from a reference population of more than 3500 bulls. Studies have shown the superiority
of genomic evaluation [35] or marker-assisted selection in France [36] on classical infinitesi-
mal model of quantitative genetics. Several authors have applied the first genomic evalua-
tion methods described by Ref. [1] or their derived methods on real data. The Bayes A and
Bayes B approaches have found results that are often similar or slightly superior to GBLUP
in terms of accuracy of genetic value prediction for the Australian Holstein-Friesian cattle
breed (+0.02 to +0.07 of correlation gain between predicted and observed values), for exam-
ple [12] and New Zealand (+2% correlation gain, [31]). However, the GBLUP method
required less computing time than the Bayes A method [32, 37]. Gredler et al. [38] demon-
strated the superiority of the Bayes B method, in terms of the accuracy of genomic estimates,
on a modified Bayes A method for integrating a polygenic effect [39]. Thus, although the
Bayes B method seems slightly more efficient than the Bayes A method, numerous studies
showed that the Bayes B method is not so much better in terms of accuracy of the genomic
estimates than a GBLUP model [40]. Again, all researches indicate that the Bayesian
approaches, which assume an a priori distribution of SNPs, increase the reliability of breed-
ing values over traditional BLUP methods [1, 12, 14]. A common conclusion is that for most
quantitative traits, the hypothesis of the traditional BLUP method, that all markers are
associated with equal variances, is far from reality. By comparing the results obtained in the
various populations around the world, clearly, the accuracies of GEBVs were greater than
breeding values estimated from progeny test based on pedigree information. Several
researches suggested combining the progeny test based on pedigree information with the
breeding value from genomic to calculate the final GEBV [5, 25]. Accuracy based on model-
ing molecular marker and pedigree information was generally superior to that of the model
including only genomic or pedigree information. Hayes et al. [12] reported that a main
advantage of using the both sources of information coming from polygenic breeding values
and genomic information is that any QTL not detected by the marker effects may be detected
by the progeny test based on pedigree information. A significant reduction in posterior mean
of residual variance component was reported by Ref. [22] when pedigree and markers were
considered jointly compared to pedigree-based model. In the same study, Spearman’s rank
correlation of estimated breeding value between model including marker information and
pedigree-based model was close to 1.
6. Conclusion
Standard quantitative genetic model based on phenotypic and pedigree information has been
very successful in term of genetic value prediction. Also, the availability of genome-wide dense
markers leads researchers to be able to perform advanced genetic evaluation of quantitative
traits with a high accuracy of prediction of genetic value. However, a main problem is how this
information should be included into statistical genetic models. Bayesian MCMC methods
appear to be convenient for genetic value prediction with a focus on the precision of the choice
of prior distribution for the different parameters.
Author details
Hafedh Ben Zaabza1*, Abderrahmen Ben Gara2 and Boulbaba Rekik2

1 Institut National Agronomique, Tunis-Mahrajène, Tunisie
2 Département des productions animales, Ecole supérieure d’Agriculture de Mateur, Mateur,

Tunisie
References
[1] Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-
wide dense marker maps. Genetics. 2001;157:1819-1829
[2] Sorenson D, Gianola D. Likelihood, Bayesian, and MCMC Methods in Quantitative

Genetics. 1st ed. New York: Springer-Verlag; 2002. p. 740
[3] Neumaier A, Groeneveld E. Restricted maximum likelihood estimation of covariances in
sparse linear models. Genetics Selection Evolution. 1997;30(1):3-26
[4] Waldmann P. Easy and flexible Bayesian inference of quantitative genetic parameters.
Evolution. 2009;63(6):1640-1643. DOI: 10.1111/j.1558-5646.2009
[5] Hallander J, Waldmann P, Chunkao W, Sillanpaa MJ. Bayesian inference of genetic

parameters based on conditional decompositions of multivariate normal distributions.
Genetics. 2010;185:645-654. DOI: 10.1534/genetics.110.114249
[6] Robert CP. Le choix bayésien Principes et pratique. 1st ed. Paris: Springer-Verlag France;
2006. p. 638
[7] Ben Zaabza H, Ben Gara A, Hammami H, Ferchichi MA, Rekik B. Estimation of variance
components of milk, fat, and protein yields of Tunisian Holstein dairy cattle using Bayes-
ian and REML methods. Archives Animal Breeding. 2016;59:243-248. DOI: 10.5194/aab-
59-243-2016
[8] Ben Gara A, Rekik B, Bouallègue M. Genetic parameters and evaluation of the Tunisian
dairy cattle population for milk yield by Bayesian and BLUP analyses. Livestock Science.
2006;100:142-149. DOI: 10.1016/j.livsci.2005.08.012
[9] Schenkel FS, Schaeffer LR, Boettcher PJ. Comparison between estimation of breeding
values and fixed effects using Bayesian and empirical BLUP estimation under selection
on parents and missing pedigree information. Genetic Selection Evolution. 2002;34:41-59.
DOI: 10.1051/gse:2001003
[10] Gianola D, Fernando RL, Stella A. Genomic-assisted prediction of genetic value with semi-
parametric procedures. Genetics. 2006;173(3):1761-1776. DOI: 10.1534/genetics.105.049510
[11] Goddard ME, Hayes BJ. Genomic selection. Journal of Animal Breeding and Genetics.
2007;124:323-330. DOI: 10.1111/j.1439-0388.2007
[12] Hayes BJ, Bowman PJ, Chamberlain AJ, Goddard ME. Genomic selection in dairy cattle:
Progress and challenges. Journal of Dairy Science. 2009;92:433-443. DOI: 10.3168/jds.
2008-1646
[13] Wittenburg D, Melzer N, Reinsch N. Including non-additive genetic effects in Bayesian

methods for the prediction of genetic values based on genome-wide markers. BMC
Genetics. 2011;12(74):14
[14] VanRaden PM, Van Tassell CP, Wiggans GR, Sonstegard TS, Schnabel RD, Taylor JF, et al.
Reliability of genomic predictions for North American Holstein bulls. Journal of Dairy
Science. 2009;92:16-24. DOI: 10.3168/jds.2008-1514
[15] Colombani C, Croiseau P, Fritz S, Guillaume F, Legarra A, Ducrocq V, Robert-Granié C. A

comparison of partial least squares (PLS) and sparse PLS regressions in genomic selection in
French dairy cattle. Journal of Dairy Science. 2012;95:2120-2131. DOI: 10.3168/jds.2011-4647
[16] Su G, Guldbrandtsen B, Gregersen VR, Lund MS. Preliminary investigation on reliability
of genomic estimated breeding values in the Danish Holstein population. Journal of
Dairy Science. 2010;93(3):1175-1183. DOI: 10.3168/jds.2009-2192
[17] Villumsen TM, Janss L, Lund MS. The importance of haplotype length and heritability
using genomic selection in dairy cattle. Journal of Animal Breeding and Genetics.
2009;126(1):3-13. DOI: 10.1111/j.1439-0388.2008
[18] Meuwissen THE. Accuracy of breeding values of “unrelated” individuals predicted by dense
SNP genotyping. Genetics Selection Evolution. 2009;41:35. DOI: 10.1186/1297-9686-41-35
[19] Legarra A, Robert-Granié C, Croiseau P, Guillaume F, Fritz S. Improved lasso for genomic
selection. Genetics Research. 2011;93(1):77-87. DOI: 10.1017/S0016672310000534
[20] Tibshirani R. Regression shrinkage selection via the LASSO. Journal of the Royal Statisti-
cal Society Series B. 1996;73(3):273-282. DOI: 10.2307/41262671
[21] Park T, Casella G. The Bayesian lasso. Journal of the American Statistical Association.
2008;103(482)681-686. DOI: 10.1198/016214508000000337
[22] De los Campos G, Naya H, Gianola D, Crossa J, Legarra A, Manfredi E, et al. Predicting
quantitative traits with regression models for dense molecular markers and pedigree.
Genetics. 2009;182:375-385. DOI: 10.1534/genetics.109.101501
[23] Weigel KA, De los Campos G, González-Recio O, Naya H, Wu XL, Long N, et al. Predictive
ability of direct genomic values for lifetime net merit of Holstein sires using selected subsets
of single nucleotide polymorphism markers. Journal of Dairy Science. 2009;92(10):
5248-5257. DOI: 10.3168/jds.2009-2092
[24] Kizilkaya K, Fernando RL, Garrick DJ. Genomic prediction of simulated multi-breed and
purebred performance using observed fifty thousand single nucleotide polymorphism
genotypes. Journal of Animal Science. 2010;88(2):544-551. DOI: 10.2527/jas.2009-2064
[25] Habier D, Fernando RL, Kizilkaya K, Garrick DJ. Extension of the Bayesian alphabet for
genomic selection. BMC Bioinformatics. 2011;12:12
[26] Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equations of state
calculations by fast computing machines. Journal of Chemical Physics. 1953;21:1087-1092
[27] Gianola D, Manfredi E, Fernando RL. Additive genetic variability and the Bayesian
alphabet. Genetics. 2009;183:347-363. DOI: 10.1534/genetics.109.103952
[28] Gianola D, Van Kam JBCHM. Reproducing kernel hilbert spaces regression methods for
genomic assisted prediction of quantitative traits. Genetics. 2008;178(4):2289-2303. DOI:
10.1534/genetics.107.084285
[29] Su G, Madsen P, Nielsen US, Mäntysaari EA, Aamand GP, Christensen OF, et al. Genomic
prediction for Nordic Red Cattle using one-step and selection index blending. Journal of
Dairy Science. 2012;95:909-917. DOI: 10.3168/jds.2011-4804
[30] Harris BL, Johnson DL, Spelman RJ. Genomic selection in New Zealand and the implica-
tions for national genetic evaluation. In: Proceeding Interbull Meeting; 2008; Canada. The
36th International Committee for Animal Recording (ICAR) Session, held June 16-20, in
Niagara Falls; 2008
[31] Harris BL, Johnson DL. Genomic predictions for New Zealand dairy bulls and integra-
tion with national genetic evaluation. Journal of Dairy Science. 2009;93(3):1243-1252.
DOI: 10.3168/jds.2009-2619
[32] Moser G, Tier B, Crump RE, Khatkar MS, Raadsma HM. A comparison of five methods to
predict genomic breeding values of dairy bulls from genome-wide SNP markers. Genetics
Selection Evolution. 2009;41(56). DOI: 10.1186/1297-9686-41-56
[33] Legarra A, Misztal I. Technical note: Computing strategies in genome-wide selection.
Journal of Dairy Science. 2008;91(1):360-366. DOI: 10.3168/jds.2007-0403
[34] González-Recio O, Gianola G, Rosa GJM, Weigel KA, Kranis A. Genome-assisted prediction
of a quantitative trait measured in parents and progeny: Application to food conversion
rate in chickens. Genetics Selection Evolution. 2009;41(3):10. DOI: 10.1186/1297-9686-41-3
[35] VanRaden P. Efficient methods to compute genomic predictions. Journal of Dairy Science.
2008;91(11):4414-4423. DOI: 10.3168/jds.2007-0980
[36] Boichard D, Fritz S, Rossignol MN, Bosher MY, Malafosse A, Colleau JJ. Implementation
of marker-assisted selection in French dairy cattle. In: 7th World Congress on Genetics
Applied to Livestock Production; 19-23 August 2002; Montpellier, France. 2002. Session
22. Exploitation of molecular information in animal breeding. Electronic communication
22-03. p. 4
[37] Solberg TR, Sonesson AK, Woolliams JA, Meuwissen THE. Reducing dimensionality for
prediction of genome-wide breeding values. Genetics Selection Evolution. 2009;41(29):8.
DOI: 10.1186/1297-9686-41-29
[38] Gredler B, Nirea KG, Solberg TR, Egger-Danner C, Meuwissen THE, Solkner J. Genomic
selection in Fleckvieh/Simmental—First results. In: Proceedings of the Interbull Meeting;
21-24 August 2009; Interbull Bulletin, Barcelone, Espagne; 2009;40:209-213
[39] Hayes BJ. Genomic selection in the era of the $1000 genome sequence. In: Symposium
Statistical Genetics of Livestock for the Post-Genomic Era; USA: Wisconsin-Madison,
USA; 2009
[40] Habier DJ, Tetens J, Seefried FR, Lichtner P, Thaller G. The impact of genetic relationship
information on genomic breeding values in German Holstein cattle. Genetics Selection
Evolution. 2010;42(5). DOI: 10.1186/1297-9686-42-5
Chapter 11
Provisional chapter
Bayesian Two-Stage Robust Causal Modeling with
Instrumental
Bayesian Variables
Two-Stage usingCausal
Robust Student's t Distributions
Modeling with
Instrumental Variables using Student's t Distributions
Dingjing Shi and Xin Tong
Dingjinginformation
Additional Shi and Xin Tongat the end of the chapter
is available

Abstract
In causal inference research, the issue of the treatment endogeneity is commonly addressed
using the two-stage least squares (2SLS) modeling with instrumental variables (IVs), where
the local average treatment effect (LATE) is the causal effect of interest. Because practical
data are usually heavy tailed or contain outliers, using traditional 2SLS modeling based on
normality assumptions may result in inefficient or even biased LATE estimate. This study
proposes four types of Bayesian two-stage robust causal models with IVs to model normal
and nonnormal data, and evaluates the performance of the four types of models with IVs.
The Monte Carlo simulation results show that the Bayesian two-stage robust causal model-
ing produces reliable parameter estimates and model fits. Particularly, in different types of
the two-stage robust models with IVs, the models that take outliers into consideration and
use Student’s t distributions in the second stage to model heavy-tailed data or data
containing outliers provide more accurate and efficient LATE estimates and better model
fits than other distribution models when data are contaminated. The preferred models are
recommended to be adopted in general in the two-stage causal modeling with IVs.
Keywords: Bayesian methods, two-stage causal modeling with instrumental variables,

nonnormal data, robust method using Student’s t distributions
1. Introduction
Causal inference and experimental researchers are often interested in the average treatment
effect (ATE), measured by the outcome difference between participants who are assigned to the
treatment and those being assigned to the control. The estimation of ATE for the whole popula-
tion is neither reliable nor feasible when certain conditions are not achieved or assumptions are
violated [6, 9]. The treatment effects for only a subset of participants is instead estimated, which
is called the local average treatment effect (LATE) [2, 13]. Different studies may have different
LATEs, depending on the subgroup of interest. Often the subgroup of interest is those who have
been assigned to the treatment and have actually received the treatment [3]. One way to estimate
the LATE is to incorporate instrumental variables (IVs), which are correlated with both the
endogenous regressors and error terms when the linearity assumption of the traditional linear
models is violated and the endogenous regressors are correlated with the errors. Instrumental
variables are incorporated in the analysis to estimate the LATE, or a part of the treatment effect
whose estimation is not contaminated by the violation of the linearity assumption.
Two-stage least squares (2SLS) modeling [1] is widely used to estimate the LATE with IVs. In
the first stage, IVs are used to predict the partial treatment effect that can be explained by the
variations of IVs, and in the second stage, the fitted treatment values are used to predict the
experimental outcome, and to estimate the LATE. In estimating the LATE in traditional 2SLS
modeling with IVs, it is typically assumed that the measurement errors at both stages are
normally distributed. However, practical data in social and behavioral research usually violate
the normality assumption and often have heavy tails or contain outliers [25]. Failure to take the
nonnormal data into consideration but instead treating the heavy-tailed data or data
containing outliers as if they were normally distributed may result in unreliable parameter
estimates and inflated type I error rates [35, 38–40], which will eventually lead to misleading
statistical inference.
Routine methods to accommodate heavy-tailed data or data with outliers include data trans-
formation and data truncation. However, transformed data are often difficult to interpret
especially when the raw scores have meaningful scales [17], and the exclusion of outliers may
lead to underestimated standard errors and reduced efficiency [14, 32]. Alternatively, different
robust procedures have been developed to provide reliable parameter estimates, the associated
standard errors, and statistical tests. The rationale of most robust procedures is to weigh each
observation according to its distance from the center of the majority of the data, so that outliers
that are far from the center of the data are downweighted [10, 11, 37]. In recent research, more
and more robust methods have been used to estimate complex models, such as linear and
generalized linear mixed-effects models [19, 26], structural equation models [15, 31], and
hierarchical linear and nonlinear models [20, 29].
Over the past decades, robust procedures based on Student’s t distributions have been devel-
oped and advanced to model heavy-tailed data or data containing outliers [14, 33]. For exam-
ple, Student’s t distributions have been applied under the structural equation modeling
framework and were found to produce reliable parameter estimates and inferences [15, 16]; in
robust mixture models, Wang et al. [30] used the multivariate t distribution to fit heavy-tailed
data and data with missing information, Shoham [24] implemented a robust clustering algo-
rithm in mixture models by modeling data that are contaminated by outliers using multivari-
ate t distributions, Seltzer et al. [21] and Seltzer and Choi [22] conducted sensitivity analysis
employing Student’s t distributions in robust multilevel models and downweighted outliers in
level two (the between-subject level), and Tong and Zhang [28] and Zhang et al. [36] advanced
the Student’s t distributions to robust growth curve models and provided online software to
Bayesian Two-Stage Robust Causal Modeling with Instrumental Variables using Student's t Distributions 223
carry out the analysis. Although robust methods based on Student's t distributions have been
used in different modeling frameworks, few have been adopted in the causal modeling, where
heavy-tailed data or data containing outliers are not uncommon [18].
Recently, Shi and Tong [23] implemented a robust Bayesian estimation method using Student's
t distributions to the two-stage causal modeling with IVs to fit data that contain outliers or are
normally distributed concurrently at both stages. However, in the two-stage causal models
with IVs, data at either stage are equally likely having outliers or are nonnormally distributed.
Previous studies have noticed such a situation. For example, Pinheiro et al. [19] used a robust
estimation to the linear mixed-effects model and applied the multivariate t distribution to both
the random effects and intraindividual errors simultaneously. Tong and Zhang [28] conducted
a robust estimation to growth curve modeling and modeled the measurement errors and
random effects separately with t distributions or normal distributions rather than the same
distribution for the two effects. Therefore, this article extends the study of Shi and Tong [23]
and proposes four possible types of two-stage causal models with IVs to the data. The study
evaluates the performance of the robust method in four types of models. In the following
section, the robust method based on Student's t distributions is reviewed. Then, the two-stage
causal models with IVs, the associated LATE, and the corresponding four types of models are
introduced. Next, a Monte Carlo simulation study is conducted to evaluate the performance of
the robust method in four possible types of two-stage causal models with IVs. In the end,
conclusions are summarized and discussions are provided.
2. Robust methods based on Student’s t distributions
As a robust procedure, the fundamental idea of using Student's t distributions to model heavy-
tailed data or data containing outliers is to assign a weight to each case and properly downweight
cases that are far from the center of the majority of the data [10, 11, 37]. Suppose a population of k
random variables, y, follow a multivariate t distribution, with mean vector μ, scale matrix Ψ, and
degrees of freedom ν, denoted by t (μ, Ψ, ν). The probability density function of y can be
expressed as:
1 � � !�ðνþk Þ
jΨj�2 Γ ν þ k ðy � μÞT Ψ�1 ðy � μÞ

2
pðyjμ, Ψ, νÞ ¼ � � �k � �2 k � � 1 þ : (1)
Γ 12 Γ ν2 ν2 ν
The maximum likelihood estimates of model parameters under the model with t distribution
assumptions satisfy
� �
Σni¼1 wi Ai Ψ�1
i yi � μ ¼ 0; (2)
where n is the total sample size, yi is a sample from y, Ai is the partial derivatives of μ, and
ν þ τi
wi ¼ (3)
ν þ σ2i
is the weight assigned to case i. In the equation for wi, τi is the dimension of the parameter for
T
each i and σ2i is the squared Mahalanobis distances σ2i ¼ yi � μ Ψ�1 yi � μ . Note that
(yi � μ) is the distance between each observation and the population mean, and a large (yi � μ)
indicates a potential outlier as well as a large squared Mahalanobis distance σ2i . The outliers
are downweighted in the analysis because the weight wi decreases with increasing squared
Mahalanobis distances σ2i ; given fixed degrees of freedom ν, and dimensions τi [14].
The shape of a t distribution is controlled by its degrees of freedom ν, and ν can be set a priori
or estimated in the analysis. Under certain conditions, the degrees of freedom have been
recommended setting a priori. Lange et al. [14] and Zhang et al. [36] suggested fixing the value
for the degrees of freedom of Student's t distributions when sample size is small, as small
sample sizes could lead to biased degrees of freedom estimate. Moreover, Tong and Zhang [28]
argued that by fixing the degrees of freedom, more accurate parameter estimates and credible
intervals can be obtained when model specification is built on solid substantive theories. In
contrast, estimating the degrees of freedom can make the model more flexible. When the
degrees of freedom ν are freely estimated, Student's t distributions have an additional param-
eter ν, compared with normal distributions. As the degrees of freedom ν increase, the Student's
t distribution approaches a normal distribution.
There are several advantages in using Student's t distributions for robust data analysis [28].
First, unlike the nonparametric robust analysis, Student's t distributions have parametric
forms, and inferences based on them can be carried out relatively easily through maximum
likelihood estimation or Bayesian estimation methods. Second, the degrees of freedom of
Student's t distributions control the weight of outliers and can flexibly set a priori or be estimated.
Third, when data have heavy tails or contain outliers, considering Student's t distribution as a
natural extension of the normal distribution is rather intuitive.
3. Bayesian two-stage robust causal modeling with IVs
In causal Ordinary Least Squares (OLS) regression, when the error terms are related to some
regressors, the estimated ATE is biased due to the violation of the linearity assumption. Variables
that are related to both endogenous regressors and errors are used as instruments to differentiate
the correlations between endogenous regressors and errors, leaving only a part of the treatment
effects that have not been contaminated by the violation of the linearity assumption to be
estimated, and such variables are called instrumental variables (IVs). The ATE of interest
becomes the LATE of interest. For example, Currie and Yelowitz [8] studied the effect of public
housing voucher program of having a larger housing unit on housing quality and educational
attainment. Based on the fact that some families in voucher program tradeoff physical housing
amenities and reductions in rental payments that are bad and have negative effects for the
housing quality and their children, some regressors are correlated with errors and become
endogenous. Previous theory supports that a household having an extra number of kids is
entitled to a larger housing unit, whether there are extra kids in the household and the sex
decomposition of the extra kids are chosen as the IVs, to study the voucher program effect to
participants who have one girl and one boy (i.e., having sex decomposition) in the household. It
was found that the voucher program participants who have the sex decomposition in the
household are more likely to have better housing quality and educational attainment. The
example shows that when IVs are introduced, the external validity is traded for the improvement
of the internal validity, and the ATE (i.e., all of the voucher program participants) becomes the
LATE (i.e., program participants who have extra kids and who have the sex decomposition).
One commonly used framework to estimate LATE is the 2SLS modeling with IVs. Let di and yi be
the treatment and the outcome for individual i, respectively, and Zi = (zi1, …, ziJ)0 be a vector of
instrumental variables for individual i (i = 1, …, N). Here, N is the sample size and J is the total
number of instrumental variables. In the first stage of the 2SLS model, the IVs Z are used to
predict the treatment d. In other words, the portion of variations in the treatment d is identified
and estimated by the IVs Z; and then the second stage relies on the estimated exogenous portion
of treatment variations in the form of the predicted treatment values to estimate the treatment
effect on the outcome y. A typical form of the 2SLS model with IVs can be expressed as:
di ¼ π10 þ π11 Zi þ e1i ; (4)
yi ¼ π20 þ π21b
d i þ e2i ; (5)
where π10 and π11 = (π11, …, π1J)0 are the intercept and regression coefficients for the linear
model where the treatment d is regressed on the IVs Z, respectively; and π20 and π21 are the
intercept and slope for the linear model where the outcome y is regressed on the predicted
treatment values of b
d, respectively. The IVs help estimate the treatment effects in which the
causal effect of IVs on the treatment is first estimated in Eq. (4), and the causal effect of this
estimated partial treatment effect on the outcome is then estimated in Eq. (5). From the model,
π11 is the causal effect of the IVs Z on the treatment Z, and π21 is the treatment effect on the
outcome y for a subset of participants whose treatment effect has been partialled out and
explained by the IVs Z. π21 is the causal effect of interest and is called LATE. There are several
advantages in using 2SLS modeling to estimate LATE. First, unlike method of point estimate
such as Wald estimator [4], 2SLS modeling also provides standard error estimate and confi-
dence intervals of the LATE, making statistical inferences more efficient. Second, when 2SLS
models are used, covariates could be controlled simultaneously at both stages of the 2SLS
model when the effect of Z on d and the effect of b d on y are estimated. Mathematically, the
estimated LATE π b 21 in 2SLS can be derived as:
� �
� � � � � �
cov yi ; b
di cov yi , π b 11 zi1 þ ⋯ þ π
b 10 þ π b 1J ziJ b 11 cov yi ; zi1 þ ⋯ þ π
π b 1J cov yi ; ziJ
π
b 21 ¼ � � ¼ � � ¼ :
var bdi var π b 10 þ πb 11 zi1 þ ⋯ þ π
b 1J ziJ b 2 11 varðzi1 Þ þ ⋯ þ π
π b 2 11 varðzi1 Þ
(6)
Traditional causal 2SLS models with IVs are commonly estimated using OLS methods or
maximum likelihood estimation from the frequentist approach. The measurement errors at
both stages, e1i and e2i, are assumed to be normally distributed as e1i � Nð0, σ2e1 Þ and
e2i � Nð0, σ2e2 Þ. Because practical data usually violate the normality assumption, it was
proposed from a Bayesian approach that the normal distributions can be replaced by
Student's t distributions for heavy-tailed data or data containing outliers [23, 28, 36]. In the
two-stage causal model with IVs, data at either stage are equally likely to be nonnormal or
containing outliers. Therefore, we propose four possible types of Bayesian two-stage causal
models to data with (a) normal measurement errors at both stages, denoted as Bayesian
normal model, (b) t measurement errors in the first stage and normal measurement errors in
the second stage, denoted as Bayesian nonnormal-s1 model, (c) normal measurement errors in
the first stage and t measurement errors in the second stage, denoted as Bayesian nonnormal-
s2 model, and (d) t measurement errors at both stages, denoted as Bayesian nonnormal-both
model. The four types of Bayesian two-stage causal models have the same mathematical
model expressions as those from the frequentist approach. Namely, for the Bayesian normal
model, measurement errors are assumed to be distributed as e1i � Nð0, σ2e1 Þ and
e2i � Nð0, σ2e2 Þ; for the Bayesian nonnormal-s1 model, the measurement errors are assumed
to be distributed as e1i � tð0, σ2e1 , ν1 Þ and e2i � Nð0, σ2e2 Þ; for the Bayesian nonnormal-s2
model, the measuremenet errors are assumed to be distributed as e1i � Nð0, σ2e1 Þ and
e2i � tð0, σ2e2 , ν2 Þ; finally, for the Bayesian nonnormal-both model, the measurement errors
are assumed to be distributed as e1i � tð0, σ2e1 , ν1 Þ and e2i � tð0, σ2e2 , ν2 Þ. All four types of
models are estimated using Bayesian methods.
In the Bayesian approach, we obtain the joint posterior distributions of the parameters based
on the prior distributions of the parameters and the likelihood of the data information. Making
statistical inferences directly from the joint posterior distributions is usually difficult. Gibbs
sampling, a Markov chain Monte Carlo (MCMC) method is a widely used algorithm to draw a
sequence of samples from the joint posterior distribution of two or more random variables,
given that the conditional posterior distributions of the model parameters can be obtained [7].
In specific, Gibbs sampling alternately samples parameters one at a time from their conditional
posterior distribution on the current values of other parameters, which are treated as known.
After a sufficient number of iterations, the sequence of samples constitutes a Markov chain that
converges to a stationary distribution. This stationary distribution is the sought-after joint
posterior distribution of the parameters [12].
The Gibbs sampling algorithm is used to obtain the LATE estimate for the two-stage causal
model with IVs. Because the t distribution can be viewed as a normal distribution with
variance weighted by a Gamma distribution, the data augmentation method is used here to
simplify the posterior distribution. Specifically, a Gamma random variable ω is augmented

with a normal random variable because if ωi � G 2ν ; ν2 , and yi|ωi ~ N(μ, Ψ/ωi), then yi ~ t
(μ, Ψ, ν). The detailed steps of the Gibbs sampling algorithm for the Bayesian nonnormal-
s2 model are given below. The Gibbs sampling procedures for the other models are
similar.

ð0Þ ð0Þ 2 ð0Þ 2 ð0Þ ð0Þ ð0Þ ð0Þ ð0Þ 0
1. Start with initial values π1 , π2 , σe1 , σe2 , νð0Þ , ωi , where π1 ¼ π10 , π10 and

ð0Þ ð0Þ ð0Þ 0
π2 ¼ π20 , π21 .
� �
ðjÞ ðjÞ 2 ð jÞ 2 ðjÞ ð jÞ ð jÞ ð jÞ ð jÞ 0
2. Assume at the jth iteration, we have π1 , π2 , σe1 , σe2 , νðjÞ , ωi , where π1 ¼ π10 , π10
� �
ðjÞ ð jÞ ð jÞ 0
and π2 ¼ π20 , π21 .
At the (j+1)th iteration,
3. Step 3
� �
ðjþ1Þ 2 ð jÞ
3.1 Sample π1 from p π1 jσe1 , di , Zi , i ¼ 1, …, N ;
� �
2 ðjþ1Þ ðjþ1Þ
3.2 Sample σe1 from p σ2e1 jπ1 , di , Zi , i ¼ 1, …, N ;
� �
2 ðjþ1Þ ð jÞ ð jÞ
3.3 Sample σe2 from p σ2e2 jπ2 , bd i , yi , ωi , i ¼ 1, …, N ;
� �
ð jÞ
3.4 Sample ν(j + 1) from p νjωi , i ¼ 1, …, N ;
� �
ðjþ1Þ 2 ðjþ1Þ ðjÞ
3.5 Sample ωi , i ¼ 1, …N, from p ωi jνðjþ1Þ , σe2 , b d i , yi , π2 , i ¼ 1, …, N ;
� �
ðjþ1Þ ðjþ1Þ 2 ðjÞ
3.6 Sample π2 from p π2 jωi , σe2 , b d i , yi , i ¼ 1, …, N .
4. Repeat Step 3.
4. Evaluation of four types of distributional 2SLS models
In this section, the performance of the four types of two-stage robust causal models is evaluated
through a Monte Carlo simulation study. Data are generated from a general causal inference
model as presented in Eq. (7). Full Bayesian methods are used for the estimation of all four types
of two-stage causal models. In specific, noninformative priors are applied to all model parameters,
conditional posterior distributions of all model parameters are obtained and Markov chains are
generated through Gibbs sampling algorithm, convergence tests are conducted and finally statis-
tical inferences for the model parameters are made. Free software (R Development Core Team,
2011) R [41] and OpenBUGS [42] (Thomas, O’Hara, Ligges, & Sturtz, 2006) were used for the
implementation of MCMC algorithms and model estimation. A total of 20,000 iterations was
conducted for each simulation condition, with the first 10,000 iterations as the burn-in period.
4.1. Study design
Data are generated from a general causal inference model
yi ¼ 3 þ 0:5xi þ ei ; (7)
where yi is the causal outcome, xi is the causal treatment, and ei is the measurement error.
Three potential influential factors are considered. First, sample size (N) is either 200 or 600.
Second, correlation between x and e(Φ) is manipulated to be either 0.3 or 0.7, reflecting
relatively weak or strong linear relationship between the treatment and the measurement
error. Third, a proportion of observations that contains outliers is manipulated. The proportion
of outliers (OP) is considered to be 0, 5, or 10%. When the OP is 0%, data contain no outliers
and measurement errors ei are normally distributed. When the OP is above zero, data contain
outliers. For outliers, the measurement errors are generated from a different normal distribu-
tion with the same standard deviation, but a larger mean (eight times of the standard devia-
tion). An IV is also generated from a normal distribution and correlated with x with the
correlation coefficient being 0.6.
If we fit a linear regression to the generated data, we will immediately notice that the residuals
and the regressors are not independent. Therefore, we adopt the two-stage causal model with
IVs. The four types of two-stage models (normal model, nonnormal-s1 model, nonnormal-s2
model, and nonnormal-both model) are used to fit the data. In the first stage, the IV is used to
predict the endogenous treatment, and the estimated treatment is then used in the second
stage to estimate the LATE. Based on Eq. (6), the theoretical LATE is 5/6.
As discussed previously, Bayesian methods using Gibbs sampling algorithm are used to obtain
the LATE estimates in four types of two-stage causal models. The bias and standard error (SE)
of the LATE estimate for each of the four distributional models are assessed. In addition, the
deviance information criterion (DIC) [27] for each condition is examined to study the model fit.
A lower value of DIC indicates a better model fit.
4.2. Results
The bias and SEs of the LATE estimates from four types of models when ϕ = 0.3 are presented
in Table 1.
In almost all cases, models that use normal distributions to model the normal data and that use
Student's t distributions to model the data with outliers provide the best estimates with smaller
bias and SEs among other types of two-stage causal models. For example, when N = 200, the
normal model provides smaller bias and SE for normal data; similarly, nonnormal-s2 and
nonnormal-both models lead to the smaller bias and SEs when they are used to fit data
containing outliers. This shows that using Student’s t distributions to model data containing
Normal model Nonnormal-s1 model Nonnormal-s2 model Nonnormal-both model
N Data OP Bias SE Bias SE Bias SE Bias SE
200 Normal 0% 0.001 0.154 0.004 0.154 0.004 0.155 0.003 0.155
Nonnormal 5% 0.210 0.283 0.200 0.281 0.022 0.177 0.020 0.171
10% 0.342 0.379 0.341 0.378 0.060 0.157 0.050 0.155

600 Normal 0% 0.021 0.076 0.023 0.076 0.023 0.077 0.023 0.076
Nonnormal 5% 0.180 0.168 0.170 0.167 0.064 0.099 0.060 0.096

10% 0.390 0.230 0.380 0.210 0.077 0.099 0.070 0.098
Table 1. Bias and SEs of the LATE estimates for all the conditions when Φ = 0.3.
outliers is an effective way to accommodate heavy-tailed data or data containing outliers, and
this finding is consistent with the previous research [34, 36]. In causal inference study, because
practical data at either stage are equally likely having outliers or are normally distributed in
the two-stage causal model with IVs, we fit all four types of distributional models and try to
decide which one is the best-fitted model. From the results, modeling heavy-tailed data or data
containing outliers with nonnormal-both model provides more reliable parameter estimates
than traditional methods that ignore the data distributions and model all data exclusively with
normal distributions.
Although it is always a good choice to model normal data with normal distributions and
heavy-tailed data or data containing outliers with Student’s t distributions, in practice,
researchers may not know whether the first stage or the second stage of the model should
account for the nonnormality. The simulation results show that when data contain outliers, the
nonnormal-s2 model and nonnormal-both model that use t distributions in the second stage
produce the smallest bias and SEs of the LATE estimates. This is probably because the causal
effect of interest, LATE, is housed in the second stage, and using Student's t distribution to
model outliers in that stage is effective in capturing the LATE. On the contrary, in the normal
model or the nonnormal-s1 model, the normal distribution is being used to model the second
stage data that are heavy tailed or contain outliers. For example, for all the nonnormal data
that contain outliers (i.e., OP = 5 or 10%), the nonnormal-s2 model and the nonnormal-both
model, both of which use t distributions to model data in the second stage, outperform other
models, providing smaller bias and SEs of the LATE estimates regardless of sample size (N)
and proportion of outliers (OP). Comparing between nonnormal-s2 and nonnormal-both
models, the nonnormal-both models perform slightly better than the nonnormal-s2 model
does. Take N = 600 and OP = 10% as an example, the bias and SEs for the nonnormal-s2 model
are 0.077 and 0.099, whereas those for the nonnormal-both model are slightly smaller to be
0.070 and 0.098, showing that fitting the nonnormal data with Student's t distributions at both
stages has the best performance in terms of accuracy and efficiency of the LATE estimate.
Table 2 presents the results for DICs for the four types of two-stage causal models when
Φ = 0.3.
In practice, DIC can be used as a model selection criteria. To select the best-fitted parsimonious
model, we first fit all four types of models to the data, and then select the model with the
N Data OP Normal model Nonnormal-s1 model Nonnormal-s2 model Nonnormal-both model
200 Normal 0% 1145.09 1145.83 1145.82 1146.54

Nonnormal 5% 1380.18 1380.82 1241.48 1242.04
10% 1488.71 1489.43 1315.18 1315.87
600 Normal 0% 3418.20 3419.25 3419.23 3420.53

Nonnormal 5% 4126.86 4128.00 3705.62 3706.93
10% 4448.88 4450.07 3922.32 3923.73
Table 2. DICs of all the distributional models when Φ = 0.3.

smallest DIC. Notice that for normal data, all four types of models have similar DIC values.
When data contain outliers, nonnormal-s2 and nonnormal-both models provides the smallest
DIC, indicating that these types of models fit the data better. In all data conditions in the study,
the DICs of the nonnormal-s2 model and the nonnormal-both model are very similar, and
either model can be adopted.
The proportions of outliers contained in the data have effect on the performance of the
nonnormal-s2 model and the nonnormal-both model. Specifically, the larger the proportions
of outliers, the more salient the advantages of the nonnormal-s2 and nonnormal-both models.
For example, for the nonnormal data with N = 200 and OP = 5%, the bias from the normal
model, the nonnormal-s2 model and the nonnormal-both model is 0.210, 0.022, and 0.020,
respectively; when OP becomes 10%, the bias from the normal model jumps to 0.342, whereas
the bias from the nonnormal-s2 model changes slightly to 0.060 and that from the
nonnormal-both model is 0.050. Similarly, the preferred models provide less biased LATE
estimates when sample size is small, and the advantage of the preferred models is more
apparent under small sample conditions (e.g., [23]).
When Φ = 0.7, consistent with the results from previous conditions when Φ = 0.3, when data
have outliers, using Student's t distributions to model the data provides more accurate and
efficient LATE estimates and better model fits than using normal distribution to model the
data. The advantage of using t distributions is more obvious when sample size is small and the
proportion of outliers is large.
5. Discussion
In causal inference research, the issue of the treatment endogeneity is commonly addressed
in the 2SLS model with IVs, where the LATE is the causal effect of interest. Because practical
data usually violate the normality assumption, using normal distributions to model heavy-
tailed data or data containing outliers may result in inefficient or even biased LATE estimate.
In the 2SLS model with IVs, data at either stage are equally likely having outliers or are
normally distributed. To address this problem, this study proposes four possible types of
Bayesian two-stage robust causal models with IVs to the data, and evaluates the perfor-
mance of the robust method using Student's t distributions in the causal modeling. The
Monte Carlo simulation results show that modeling normal data with normal distributions
and normal or heavy-tailed data or data containing outliers with Student's t distributions
gives good performance in terms of accuracy, efficiency, and model fit. When data are
normally distributed, the methods that either use normal distributions or the Student's t
distributions perform equally well as they provide similar bias, SEs and DICs. In the pres-
ence of outliers, the nonnormal-s2 and the nonnormal-both models that take outliers into
consideration and use Student's t distributions in the second stage to model heavy-tailed
data or data containing outliers outperform other distribution models that use normal
distributions to model either exclusively all the data or the second stage data in two-stage
causal models with IVs with smaller bias and higher efficiency. In addition, the nonnormal-
s2 model and the nonnormal-both model have smaller DICs than the other two models,
suggesting evidence of better model fit. The nonnormal-s2 and nonnormal-both models are
especially preferred when sample size is small and the proportion of outliers is large as they
produce more accurate and efficient LATE estimates.
Note that fitting the nonnormal-both model to data may require longer Markov chains as
degrees of freedom for t distributions at both stages need to be estimated. We also want to be
cautious to simply use Student's t distributions to model all the data as this method is numer-
ically not optimal all the time and computationally time consuming [28]. Additionally,
Student's t distributions are sensitive to the skewness, so some nonnormally distributed data
may not be modeled by them. If data are highly skewed, alternative robust method, such as
robust methods based on skewed-t distributions may be considered [5].
Author details
Dingjing Shi and Xin Tong*

Department of Psychology, University of Virginia, Charlottesville, Virginia, USA
References
[1] Angrist JD, Imbens G. Two stage least squares estimation of average causal effects in
models with variable treatment intensity. Journal of the American Statistical Association.
1995;90:431-442
[2] Angrist JD, Imbens G, Rubin D. Identification of causal effects using instrumental vari-
ables. Journal of the American Statistical Association. 1996;91:444-455
[3] Angrist JD, Pischke J. Mastering Metrics: The Path from Cause to Effect. Princeton, NJ:
Princeton University Press; 2014
[4] Angrist JD, Pischke J. Mostly Harmless Econometrics: An Empiricist's Companion.
Princeton, NJ: Princeton University Press; 2008
[5] Azzalini A, Genton MG. Robust likelihood methods based on the skew-t and related
distributions. International Statistical Review. 2008;76:106-129
[6] Baiocchi M, Cheng J, Small D. Instrumental variable methods for causal inference. Statis-
tics in Medicine. 2014;33:2297-2340
[7] Casella G, George EI. Explaining the Gibbs sampler. The American Stastician. 1992;46:
167-174
[8] Currie J, Yelowitz A. Are public housing projects good for kids? Journal of Public Eco-
nomics. 2000;75:99-124
[9] Gerber AS, Green DP. Field Experiments: Design, Analysis and Interpretation. New York,
NY: W.W.Norton & Company; 2011
[10] Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA. Robust Statistics: The Approach
Based on Influence Functions. New York: John Wiley & Sons, Inc; 1986
[11] Huber PJ. Robust Statistics. New York: John Wiley & Sons, Inc; 1981
[12] Geman S, Geman D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration
of images. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1984;6:721-741
[13] Imbens G, Angrist JD. Identification and estimation of local average treatment effects.
Econometrica. 1994;62:467-475
[14] Lange KL, Little RJ, Taylor JM. Robust statistical modeling using the t distribution.
Journal of the American Statistical Association. 1989;84:881-896
[15] Lee SY, Xia YM. Maximum likelihood methods in treating outliers and symmetrically heavy-
tailed distributions for nonlinear structural equations. Psychometrika. 2006;71:565-585
[16] Lee SY, Xia YM. A robust Bayesian approach for structural equation models with missing
data. Psychometrika. 2008;73:343-364
[17] Osbourne, J. W. (2002). Notes on the Use of Data Transformation. Practical Assessment,
Research & Evaluation, 8(6), n6.
[18] Osborne, J. W., & Overbay, A. (2004). The power of outliers (and why researchers should
always check for them). Practical assessment, research & evaluation, 9(6), 1-12.
[19] Pinheiro JC, Liu C, Wu Y. Efficient algorithms for robust estimation in linear mixed-
effects models using the multivariate t distribution. Journal of Computational and Graph-
ical Statistics. 2001;10:249-276
[20] Rachman-Moore D, Wolfe RG. Robust analysis of a nonlinear model for multilevel edu-
cational survey data. Journal of Educational Statistics. 1984;9:277-293
[21] Seltzer M, Novak J, Choi K, Lim N. Sensitivity analysis for hierarchical models employing
t level-1 assumptions. Journal of Educational and Behavioral Statistics. 2002;27:181-222
[22] Seltzer M, Choi K. Sensitivity analysis for hierarchical models: Downweighting and
identifying extreme cases using the t distribution. Multilevel Modeling: Methodological
Advances, Issues, and Applications. 2003;1:25-52
[23] Shi D, Tong X. Robust Bayesian estimation in causal two-stage least squares modeling
with instrumental variables. In: van der Ark LA, Culpepper S, Douglas JA, Wang W-
C, Wiberg M, editors. Quantitative Psychology Research. Springer: New York; 2017
[24] Shoham S. Robust clustering by deterministic agglomeration EM of mixtures of multivar-

iate t-distributions. Pattern Recognition. 2002;35:1127-1142
[25] Simmons J, Nelson L, Simon S. False-positive psychology: Undisclosed flexibility in data

collection and analysis allows presenting anything as significant. Psychological Science.
2011;22:1359-1366
[26] Song P, Zhang P, Qu A. Maximum likelihood inference in robust linear mixed-effects

models using multivariate t distribution. Statistica Sinica. 2007;17:929-943
[27] Spiegelhalter D, Best N, Carlin B, van der Linder A. Bayesian measures of model complex-
ity and fit (with discussion). Journal of the Royal Statistical Society, Series B. 2002;64:583-639
[28] Tong X, Zhang Z. Diagnostics of robust growth curve Modeling using Student's t distri-
bution. Multivariate Behavioral Research. 2012;47:493-518
[29] Wang J, Lu Z, Cohen AS. The sensitivity analysis of two-level hierarchical linear models
to outliers. Quantitative Psychology Research. New York: Springer; 2015. 307-320
[30] Wang H, Zhang Q, Luo B, Wei S. Robust mixture modelling using multivariate
t-distribution with missing information. Pattern Recognition Letters. 2004;25:701-710
[31] Yuan K-H, Bentler PM. Structural equation modeling with robust covariances. Sociolog-
ical Methodology. 1998;28:363-396
[32] Yuan K-H, Bentler PM. On normal theory based inference for multilevel models with
distributional violations. Pschometrika. 2002;67:539-561
[33] Yuan K-H, Zhang Z. Structural equation modeling diagnostics using R package semdiag
and EQS. Structural Equation Modeling. 2012;19:683-702
[34] Yuan K-H, Bentler PM, Chan W. Structural equation modeling with heavy tailed distri-
butions. Psychometrika. 2004;69:421-436
[35] Yuan K-H, Lambert PL, Fouladi RT. Mardia's multivariate kurtosis with missing data.
Multivariate Behavioral Research. 2004;39:413-437
[36] Zhang Z, Lai K, Lu Z, Tong X. Bayesian inference and application of robust growth curve
models using Student's t distribution. Structural Equation Modeling. 2013;20:47-78
[37] Zhong X, Yuan K-H. Weights. In: Salkind NJ, editors. Encyclopedia of Research Design.
Thousand Oaks, CA: Sage; 2010. pp. 1617-1620
[38] Zimmerman D. A note on the influence of outliers on parametric and nonparametric
tests. Journal of General Psychology. 1994;121:391-401
[39] Zimmerman D. Invalidation of parametric and nonparametric statistical tests by concur-
rent violation of two assumptions. Journal of Experimental Education. 1998;67:55-68
[40] Zu J, Yuan K-H. Local influence and robust procedures for mediation analysis. Multivar-
iate Behavioral Research. 2010;45:1-44
[41] R Development Core Team. R: A Language and Environment for Statistical Computing.
Vienna, Austria: R Foundation for Statistical Computing; 2011
[42] Thomas A, O’Hara B, Ligges U, Sturtz S. Making BUGS open. R News. 2006:6;12-17
Chapter 12
Provisional chapter
Bayesian Hypothesis Testing: An Alternative to Null

Bayesian Hypothesis
Hypothesis Testing:
Significance An (NHST)
Testing Alternative to Null
in Psychology
Hypothesis
and Significance Testing (NHST) in Psychology
Social Sciences
and Social Sciences


Abstract
Since the mid-1950s, there has been a clear predominance of the Frequentist approach to
hypothesis testing, both in psychology and in social sciences. Despite its popularity in
the field of statistics, Bayesian inference is barely known and used in psychology.
Frequentist inference, and its null hypothesis significance testing (NHST), has been
hegemonic through most of the history of scientific psychology. However, the NHST
has not been exempt of criticisms. Therefore, the aim of this chapter is to introduce a
Bayesian approach to hypothesis testing that may represent a useful complement, or
even an alternative, to the current NHST. The advantages of this Bayesian approach over
Frequentist NHST will be presented, providing examples that support its use in psy-
chology and social sciences. Conclusions are outlined.
Keywords: Bayesian inference, Bayes factor, NHST, quantitative research
1. Introduction
“Scientific honesty then requires less than had been thought: it consists in
uttering only highly probable theories: or even in merely specifying, for
each scientific theory, the evidence, and the probability of the theory in
the light of this evidence”. Lakatos [1, p. 208].
The nature and role of experimentation in science found its origins in the rise of natural
sciences during the sixteenth and seventeenth centuries [2]. Since then, knowledge meant that
theories have to be corroborated either by the power of the intellect or by the evidence of the
senses [1]. However, until the mid-late 1800s, “psychological experiments had been performed,
but the science was not yet experimental” [3, p. 158]. It was not until 1875 that—either at
Wundt laboratory in Leipzig or at James’ laboratory in Harvard—experimental procedures

were introduced and contributed to the development of psychology as an independent sci-
ence [3]. From almost one and a half centuries, scientific research mostly relies on empirical
findings to provide support to their hypotheses, models, or theories. From this point of view,
psychology and social sciences must take distance from rhetorical speculations, desist from
unproven statements and build its knowledge on the basis of empirical evidence [1, 4]. Almost
a decade ago, Curran reemphasized that the aim of any empirical science is to pursue the
construction of a cumulative base of knowledge [5]. However, it has also been emphasized that
such a cumulative knowledge—for a true psychological science—is not possible through the
current and widespread paradigm of hypothesis testing [5–9]. Since approximately two
decades ago, some explicit claims have appeared in peer review articles, such as “Psychology
will be a much better science when we change the way we analyze data”[7], “We need statistical
thinking, not statistical rituals” [10], “Why most research findings are false” [11] or “Yes, psycholo-
gists must change the way they analyze their data…” [12]. Most critiques have been directed
toward the current—and still predominant—approach to hypothesis testing (i.e., NHST) and
its overreliance on p-values and significance levels [6, 11, 13], emphasizing its pervasive conse-
quences against the construction of a cumulative base of knowledge in psychological sci-
ence [8]. Despite all warnings, they seem not to have generated a noteworthy echo in the
scientific community, even though “it is evident that the current practice of focusing exclu-
sively on a … decision strategy of null hypothesis testing can actually impede scientific
progress” [14, p. 100]. Therefore, it seems reasonable to suggest that there is a need to make
considerable changes to how we usually carry out research, especially if the goal is to ensure
research integrity [6]. Regarding this matter, a frequently proposed alternative has been mov-
ing from the exclusive focus on p-values to incorporate other existing techniques such as
“power analysis” [15] and “meta-analysis” [16], or to report and interpret “effect sizes” and
“confidence intervals” [7]. However, in our view, a sounder alternative would be to move from
a Frequentist paradigm to a Bayesian approach, which allows us not only to provide evidence
against the null hypothesis but also in favor of it [17]. Furthermore, Bayesian analysis allows
us to compare two (or more) competing models in light of the existent data and not only based
in “theoretical probability distributions,” as in the Frequentist approach to hypothesis
testing [18].
A Bayesian approach would offer some interesting possibilities for both individual psychology
researchers and the research endeavor in general. First, Bayesian analysis allows us to move
from a dichotomous way of reasoning about results (e.g., either an effect exists of it does not) to
a less artificial view that interprets results in terms of magnitude of evidence (e.g., the data are
more likely under H0 than Ha), and therefore, allows us to better depict to which extent a
phenomenon may occur. Second, a Bayesian approach naturally allows us to directly test the
plausibility of both the null and the alternative hypothesis, but the current NHST paradigm
does not. In fact, when a researcher does not reach a desired p-value oftentimes it is—falsely—
assumed that the effect “does not exist.” As a consequence, the researcher’s chances of getting
his or her results published decrease dramatically, which moves us to our third argument. As
broadly known, the most scientific peer-reviewed journals do not show much interest in
results, which are “non-statistically significant.” This common practice—or scientific standard
—sadly reinforces the idea of thinking in terms of relevant or irrelevant findings. In our view,
Bayesian Hypothesis Testing: An Alternative to Null Hypothesis Significance Testing (NHST) in Psychology… 237
such standards do not promote scientific advance and quickly lead us to ignore some promis-
ing but “non-significant” findings that may be further explored, fed into meta-analysis, of just
be considered by other researchers in the field. Of course, systematically ignoring a portion of
the research undermines the primary goal of scientific inquiry that is to collect evidence and
not only to reject hypothesis. The facts and ideas exposed in this introductory section set forth
the necessity to reanalyze the way in which scientific evidence has been conceived during the
NHST era.
The following sections will: (a) concisely address the NHST procedure, (b) introduce a Bayes-
ian framework to hypothesis testing, (c) provide an example that highlights the advantages of
a Bayesian approach over the current NHST in terms of the way in which scientific evidence is
quantified, and (d) briefly summarize and discuss the benefits of a Bayesian approach to
hypothesis testing.
2. Null hypothesis significance testing (NHST)

“Never use the unfortunate expression: accept the null hypothesis.” Wilkin-
son and the Task Force on Statistical Inference APA Board of Scientific
Affairs [19, p. 602].
The most influential methods to modern null hypothesis significance testing (NHST) were
developed by Fisher, and by Neyman and Pearson in the early and mid-1900s [20]. Since then,
the NHST has been broadly used to provide an association between empirical evidence and
models or theories [21]. In the traditional NHST procedure, two hypotheses are postulated: a null
hypothesis (i.e., H0) and a research hypothesis, also called alternative (i.e., Ha), which describe
two contrasting conceptions about some phenomenon [22]. When conducting a NHST,
researchers usually pursue to reject the null hypothesis (H0) on the basis of a p-value. When the
observed p-value is lower than a predetermined significance level (i.e., alpha, usually
corresponding to α = 0.05), the conclusion is that such p-value constitutes supporting evidence
that favors the plausibility of the alternative hypothesis [23]. However, a more important feature
of this procedure that remains unknown for most scientists, including psychology researchers, is
that the NHST constitutes an amalgamation of two irreconcilable schools of thought in modern
statistics: the Fisher test of significance, and the Neyman and Pearson hypothesis test [24, 25]. To
this respect, Goodman stated that “it is not generally appreciated that the p-value, as conceived
by Fisher, is not compatible with the Neyman and Pearson hypothesis in which it has become
embedded” [25, p. 485]. In this synthesized NHST, the Fisherian approach includes a test of
significance of p-values obtained from the data, whereas the Neyman and Pearson method
incorporates the notion of error probabilities from the test (i.e., Type I and Type II).
2.1. Origins and rationale of NHST

First, in the early 1900s, Fisher [26, 27] developed a method that tested a single hypothesis (i.e.,
null or H0), which has been mainly referred to as a hypothesis of “no effect” between variables
(e.g., relationship, difference). The null hypothesis, as conceived by Fisher, has a known
distribution of the test statistic t. Thus, as the test statistic moves away from its expected value,
then the null hypothesis becomes progressively less plausible. In other words, it appears less
likely to occur by chance. Then, if H0 achieves a probability of occurrence sufficiently lower
than the significance level (i.e., a small p-value) then it should be rejected. Otherwise, no
conclusion can be reached. Subsequently, the question that logically arises is: what p-value is
sufficiently small to reject H0? The answer to this question was clearly addressed by Fisher
when he stated that this threshold should be determined by the context of the problem, and it
was not until the 1950s that Fisher presented the first significance tables to establishing
rejection thresholds [22]. However, Fisher [28] refused the idea of establishing a conventional
significance level and, in its place, recommended reporting the exact p-value instead of a
significance level (e.g., p = 0.019, but not p < 0.05; see [10]). Similarly, May et al. indicated that
the choice of a significance level should depend on the consequences of rejecting or failing to
reject the null hypothesis [29]. Despite these recommendations about threshold determination,
most scientists from different research fields adopted standard significance levels (i.e., α = 0.05
or α = 0.01), which have been used—or misused—regardless of the hypotheses being tested.
Later, in 1933, Neyman and Pearson proposed a procedure in which two explicitly stated rival
hypotheses were contrasted, being one of them still considered as the “null” hypothesis, as in
the Fisher test [30]. Neyman and Pearson rejected Fisher’s idea of only testing the null hypoth-
esis. In this scenario, there are now two hypotheses (i.e., the null and the alternative), and
based on the observed p-value, the researcher has to decide whether to reject or not to reject the
null hypothesis. This decision rule faces the researcher with the probability of committing two
kinds of errors: Type I and Type II. As defined by Neyman and Pearson, the Type I error is the
probability of falsely rejecting H0 (i.e., null) when H0 is true [30]. Conversely, the probability of
failing to reject H0 when H0 is false is the Type II error. For the sake of simplicity, an analogy of
both kinds of errors can be found in the classic fairy tale “The boy who cried wolf!” When the
young shepherd, called Peter, shouted out: “Help! the wolf is coming!” The village’s people
believed the young boy warning and quickly came to help him. However, when they found
out that all was a joke, they got angry. To believe in the boy’s false, alarm can be considered as a
Type I error. Peter repeated the same joke a couple of times and, when the wolf actually
appeared, the villagers did not believe the young shepherd’s desperate calls. This situation is
analogous to be engaged in a Type II error [31].
Within this NHST framework, the Fisher’s p-value is then used to dichotomize effects into two
categories: significant and non-significant results [21]. Consequently, on one hand, obtaining
significant results led us to assume that the phenomenon under investigation can be consid-
ered as “existing” and, therefore, can be used as supporting evidence for a particular model or
theory. On the other hand, non-significant results are usually (and erroneously) considered as
“noise,” implicating the nonexistence of an effect [21]. In this last case, there are no findings
that could be reported. From this view, the evidence in favor of a research finding is then solely
judged on the ability to reject H0 when a sufficiently low p-value is observed. This simple and
appealing decision rule may constitute a very seductive way of thinking about results, that is:
A phenomenon either exists or it does not. However, thinking in this fashion is fallacious, led
to misinterpretations of results and findings, and more importantly “it can distract us from a
higher goal of scientific inquiry. That is, to determine if the results of a test have any practical
value or not” [32, p. 7].
2.2. NHST: Common misconceptions and criticisms
As previously stated, most problems and criticisms to the current NHST paradigm appear as a
result of the mismatch of these essentially incompatible statistical approaches [10, 33, 34]. In
this line, Nickerson stated that “A major concern expressed by critics is that such testing is
misunderstood by many of those who use it” [35, p. 241]. Some of these misconceptions are
common among researchers and are interpretative in nature. As a matter of fact, Badenes-
Ribera et al. recently reported the results of a survey conducted to 164 academic psychologists
who were questioned about the meaning of p-values [36]. Results confirmed previous findings
regarding the occurrence of wrongful interpretations of p-values. For instance, the false belief
that the p-value indicates the conditional probability of the null hypothesis given certain data
(i.e., p (H0|D)), instead of the probability of witnessing a given result, assuming that the null
hypothesis is true [37]. This wrong interpretation of a p-value is known as “the inverse proba-
bility” fallacy. Another common misconception regarding p-values is that they provide direct
information about the magnitude of an effect, that is, a p-value of 0.00001 represents evidence
of a bigger effect than a p-value of 0.01. This conclusion is wrong because the only way to
estimate the magnitude of an effect is to calculate the value of the effect size with the appro-
priate statistic and its confidence interval (e.g., Cohen’s d; see [38]). This erroneous interpreta-
tion of a p-value is known as “the effect size” fallacy. A comprehensive review of these and
other common misconceptions is out of the scope of this chapter, but several resources on these
topics are available for the interested readers (see [14, 35, 37–40]).
Likewise, the rationale under the NHST has been largely criticized. Most criticisms against
NHST are focused on the way in which data are (unsoundly) analyzed and interpreted, for
example:
a. NHST only provides evidence against the plausibility of H0, but does not provide
probabilistic evidence in favor of the plausibility of Ha.
b. NHST uses inference procedures based on hypothetical data distributions, instead of
being based on actual data.
c. NHST does not provide clear rules for stopping data collection; therefore, as long as
sample size increases any H0 can be rejected (see [9, 18]).
However, an issue that is of particular interest for this chapter is related to the use of p-values as
a way to quantify statistical evidence [13, 41]. As previously stated in this chapter, rejecting H0
does not provide evidence in favor of the plausibility of Ha, and all that can be concluded is
that H0 is unlikely [9]. Conversely, failing to reject H0 simply allows us to state that—given the
evidence at hand—one cannot make an assertion about the existence of some effect or phe-
nomenon [42]. Hence, rejecting H0 is not a valid indicator of the magnitude of evidence of a
result [43]. In Schmidt’s words: “… reliance on statistical significance testing in psychology and
the other social sciences has led to frequent serious errors in interpreting the meaning of data,
errors that have systematically retarded the growth of cumulative knowledge” [16, p. 120].
Despite the existence of scientific literature that highlights the weaknesses of NHST [9, 16, 21, 22,
39, 43–46], it is still considered as the: “sine qua non of the scientific method” [10, p. 199].
Moreover, NHST is arguably the most widely used method of data analysis in psychology
since the mid-1950s and still governs the interpretation of quantitative data in social science
research [35, 47]. In Krueger’s words: “NHST is the researcher’s workhorse for making induc-
tive inferences” [45, p. 16]. An immediate matter of concern is that most of scientific discover-
ies, in a wide range of research fields, are based on a procedure that still generates controversy
(see [12, 48–50]). Since the focus of research should be on what data tell us about the magni-
tude of effects, it seems necessary to shift from our reliance on NHST to more robust alterna-
tives [14]. Some recommended practices include estimates based on effect sizes, confidence
intervals, and meta-analysis [6]. However, a sounder alternative comes from the Bayesian
paradigm through the use of a simple estimate of the magnitude of evidence called Bayes
factor (BF) [17]. This approach to hypothesis testing has shown several benefits. First, it is not
oriented to pursue the rejection of H0; on the contrary, it provides a way to obtain evidence for
and against H0. Second, it does not use arbitrary thresholds (i.e., significance levels) to reach
dichotomous decisions about the plausibility or implausibility of H0; on the contrary, it directly
contrasts the magnitude of evidence for and against both H0 and Ha. Third, it permits the
continuous update of evidence as long as new data are available, which is in line with the
nature of scientific inquiry. Bayesian methods have been largely suggested as a practical
alternative to NHST [9, 17, 23, 51], but—until now—they have not received enough attention
from researchers in psychology and social sciences.
3. Bayesian hypothesis testing: An alternative to NHST

“(…) prior and posterior are relative terms, referring to the data. Today’s
posterior is tomorrow’s prior.” Lindley [52, p. 301].
In the field of statistics, probabilities can be interpreted under two predominant paradigms:
Frequentist inference and Bayesian inference. The former makes predictions about experi-
ments whose outcomes depend basically upon random processes [53]. The latter assigns
probabilities to any statement, even when a random process is not involved [54]. In a Bayesian
framework, a probability is a way to embody an individual’s degree of belief in a statement.
Since the mid-1950s, there has been a clear predominance of the Frequentist approach to
hypothesis testing, both in psychology and social sciences. The hegemony of Frequentist
inference and its null hypothesis significance testing (NHST) might be partially attributed to
the massive incorporation of such approaches in psychology undergraduate programs [9] and
also to the fact that the Neyman and Pearson approach had the most well-developed compu-
tational software to conduct statistical inference [18]. However, the current scenario has dras-
tically changed, and the development of sampling techniques like Markov-Chain Monte Carlo
(MCMC; see [55, 56]) along with the availability and improvement of specifically developed
software (e.g., WinBUGS, see [57, 58]; JAGS, see [59, 60]; JASP, see [61]) makes exact Bayesian
inferences possible even in very complex models. As a result, “Bayesian applications have
found their way into most social science fields” [22, p. 665], and psychologists can now easily
implement Bayesian analysis for many common experimental situations (see for example JASP
Statistics: https://jasp-stats.org/).
3.1. Bayes in a nutshell
In Bayesian inference, our degrees of belief about a set of hypotheses are quantified by proba-
bility distributions over those hypotheses [47, 62], which makes the Bayesian approach funda-
mentally different from the Frequentist approach, which relies on sampling distributions of
data [47]. A Bayesian analysis usually implicates the updating of prior knowledge or informa-
tion in light of newly available experimental data [63]. The latter clearly reflects the aim of any
empirical science, which is to strive for the elaboration of a cumulative base of knowledge. Any
Bayesian analysis implies the combination of three sources of information as follows:
a. a model that specifies how latent parameters (e.g., θ) generate data (e.g., D);
b. prior information about those parameters (i.e., prior distribution); and

c. the observed data (i.e., likelihood).
This prior information, represented by p(θ), represents our degree of uncertainty about the
parameters included in the model. Conversely, this prior distribution may also represent our
degree of knowledge about the same parameters. Then, the more informative is our prior
distribution, the less will be our degree of uncertainty about the parameters. The likelihood is
the conditional probability of observing the data under some latent parameter (i.e., p(D|θ)).
Following the Bayes theorem [64], the combination of these three elements produces an updated
knowledge about the model parameters after the data have been observed, which is also known
as the posterior distribution. The change from the prior to the posterior distribution reflects what
has been learned from the data (see Figure 1). Thus, within a Bayesian framework, a researcher
can invest more effort in the specification of prior distributions by translating existing knowl-
edge about the phenomenon under study into prior distributions [65]. As suggested by Lee and
Wagenmakers “such knowledge may be obtained by eliciting prior beliefs from experts, or by
consulting the literature for earlier work on similar problems” [65, p. 110].
As shown in Figure 1, the strength of each source of information is indicated by the narrow-
ness of its curve. A narrower curve is more informative about the value of parameters, whereas
a wider one is less informative.
Bayes’ rule specifies how the prior information p(θ) and the likelihood p(D|θ) are combined to
arrive at the posterior distribution denoted by p(θ |D), in Eq. (1):
pðDjθÞ pðθÞ
pðθjDÞ ¼ (1)
pðDÞ
Eq. (1) is usually paraphrased as:
pðθjDÞ ∝ pðDjθÞpðθÞ (2)
which means, “the posterior is proportional (i.e.,∝) to the likelihood times the prior.” In other
words, the observed data (i.e., likelihood) increases our previous degree of knowledge (i.e.,
Figure 1. Prior, likelihood and posterior probability distributions.
prior) in a proportional way to its informative strength, producing a new state of knowledge
about the parameters of the model (i.e., posterior). One of the benefits of the Bayesian
approach is that the prior (i.e., p(θ); our present knowledge about the model parameters
moderates the influence provided by the data (i.e., p(D|θ)). This compromise leads to less
pessimism when data are unexpectedly bad and less optimism when it is unexpectedly
good [66]. Both influences are beneficial and help us to make more realistic inferences and take
better decisions. For more detailed information on Bayesian inference, see, for instance,
O’Hagan and Forster [54], Kruschke [59], and Jackman [67].
3.2. Bayes factor
Bayesian approaches for hypothesis testing are comparative in nature. Different models often
represent competing theories or hypotheses, and the focus of interest is on which one is more
plausible and better supported by the data [65]. Therefore, the Bayesian approach allows to
quantify the plausibility of a given model or hypothesis (i.e., H0) against that of an alternative
model (i.e., Ha). For any comparison of two competing models or hypotheses (e.g., Ha vs. H0),
we can rely on an estimate of evidence known as the Bayes factor [52]. One of the attractive
features of the Bayes factor is that it follows the principle of parsimony: When two models fit
the data equally well, the Bayes factor prefers the simple model over the more complex
one [68]. Nonetheless, in contrast to the NHST approach, “Bayesian statistics assigns no special
status to the null hypothesis, which means that Bayes factors can be used to quantify evidence
for the null hypothesis just as for any other hypothesis” [65, p. 108].
Before observing the data, the prior odds of Ha over, e.g., H0, are p(Ha)/p(H0), and after having
observed the data we have the posterior odds p(Ha|D)/p(H0|D). Therefore, the ratio of the
posterior odds and the prior odds is defined as the Bayes factor:
fpðHa jDÞg
ðDjHa Þ fpðH0 jDÞg posterior odds
BFHa H0 ¼ ¼ fpðH Þg ¼ (3)
ðDjH 0 Þ a prior odds
fpðH 0 Þg
Eq. (3) shows the Bayes factor for given data D and two competing hypotheses (i.e., H0 vs. Ha),
which is a measure of the evidence for Ha against H0 provided by the data. In other words, the
Bayes factor is the probability of the data under one hypothesis relative to the other. For instance,
a BFHa H0 = 3 indicates that Ha is three times more plausible relative to H0 than it was a priori.
From this view, the Bayes factor may be considered as analogous to the Frequentist likelihood
ratio. Nevertheless, in the Bayesian context there is no reference at all to theoretical probability
distributions as it is customary in a Frequentist approach. In a Bayesian framework, all inferences
are made conditional on the observed data, and therefore, the Bayes factor has to be interpreted
as a summary measure of the information provided by the data about the relative plausibility of
two models or hypotheses (e.g., Ha vs. H0). Jeffreys [52] suggests the following scale for
interpreting the Bayes factor (Table 1), although some people argue against the use of thresholds,
least we fall in a different version of the old p < 0.05 ritual (see, for instance, [69]).
Bayes factor Interpretation
> 100 Extreme evidence for Ha
30 – 100 Very strong evidence for Ha

10 – 30 Strong evidence for Ha
3 – 10 Moderate evidence for Ha
1 – 3 Anecdotal evidence for Ha
1 No evidence
1/3 – 1 Anecdotal evidence for H0

1/10 – 1/3 Moderate evidence for H0
1/30 – 1/10 Strong evidence for H0
1/100 – 1/30 Very strong evidence for H0
< 1/100 Extreme evidence for H0
Adapted from Jeffreys [52, p. 433], and Lee and Wagenmakers [65, p. 105].
Table 1. Evidence categories for the Bayes factor.1

4. Bayesian vs. Frequentist approaches to hypothesis testing: An example
Bayes factors to evaluate the amount of evidence in favor or against H0 and Ha are one of the
big selling points of the Bayesian framework.1 As stated in the previous section, the core idea is
that the magnitude of evidence in favor of the null hypothesis compared to that of the
alternative hypothesis can be estimated (or vice-versa). As we have seen, this approach has
multiple advantages, such as departing from a hit-or-miss approach to results reporting, or
being able to show evidence in favor of the null. The possibility of providing evidence in favor
of both the null and the alternative hypotheses has some important advantages. One of them is
that it helps to overcome one of the most common issues behind the well-known file-drawer
effect, in that results do not suddenly become meaningless when the p-value is over certain
threshold. Another advantage is that it gives us more freedom when establishing hypothesis,
particularly in topics where hypothesizing the absence of differences may be necessary for
theoretical advance.
In this section, an example from a field known as Bayesian reasoning will be presented, which
deals with how people update their beliefs when new evidence is available (e.g., when receiv-
ing a positive result in a medical test, how likely it is that I have a disease?). There is a long
standing debate in the field about why people are unable to solve medical screening problems
such as the one shown in Table 2 when the information is shown in a standard probability
format (i.e., single-event probabilities; for instance, 1% have cancer), but have a comparatively
better time when the same information is shown in a standard frequency format (i.e., natural
frequencies; for instance, 10 in 1000 have cancer). As it is often the case, the debate about these
issues is very complex (for a review, see [71]), and the present example will focus on a single
unnuanced aspect with the goal of showing the usefulness of the Bayesian statistics paradigm.
Standard probability format

The probability of breast cancer is 1% for women at age 40 who participate in routine screening. If a woman has breast
cancer, the probability is 80% that she will get a positive mammography. If a woman does not have breast cancer, the
probability is 9.6% that she will also get a positive mammography.
A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually
has breast cancer? _____%
Standard frequency format
Ten out of every 1000 women at age 40 who participate in routine screening have breast cancer. Eight of every 10 women
with breast cancer will get a positive mammography. Ninety-five out of every 990 women without breast cancer will also
get a positive mammography.
Here is a new representative sample of women at age 40 who got a positive mammography in routine screening. How
many of these women do you expect to actually have breast cancer? ____out of____
Table 2. Standard probability and standard frequency format problems, as shown by Gigerenzer and Hoffrage [72].
1
However, we recommend the interested reader to revise a recent paper by Lakens [70], which describes an approach to
test for equivalence within a Frequentist framework.
Some authors [73, 74] argue that the crucial factor explaining the differences between the two
versions is not the representation format (i.e., probabilities or natural frequencies), but the
reference class or more specifically the computational complexity is caused by the reference
class of the problems [75]. In brief, as the probability version has a relative reference class, and
all the numbers refer to the group above them (e.g., 80% from the 1% who have breast cancer
will get a positive mammography). To solve the problem, we need to use the base-rates (in this
example, percentage of women with and without breast cancer; 1 and 99%), and the percent-
age of women who got a positive mammography amongst those two groups (e.g., 80 and 9.6%;
see Eq. (4)). In the frequency version, as the reference class is absolute, and all numbers can be
seen as referring to the 1000 women, we can ignore the base-rates and directly use the positive
mammographies for women with and without cancer (8 and 95; see Eq. (5)). The above-
mentioned authors hypothesized that when reference class and computational complexity are
taken into account, there is no difference between probabilities and natural frequencies. In
other words, they expect the null hypothesis to be true (Figure 2).
1% � 80%
pðHjDÞ ¼ ¼ 0:077 (4)
1% � 80% þ 99% � 9:6%
8
pðHjDÞ ¼ ¼ 0:077 (5)
8 þ 95
Now, imagine two PhD students, a Frequentist (i.e., Student 1) and a Bayesian (i.e., Student 2).
After reading a critical but often ignored Fiedler’s paper [73], they had the idea that computa-
tional complexity class (and not representation format) is the key issue when trying to under-
stand how people solve Bayesian reasoning problems. They devise a very simple experiment
where two different groups of people will be asked to solve one Bayesian reasoning problem
that will be shown either in single-event probabilities or in natural frequencies. In both cases,
the arithmetic complexity (i.e., number of arithmetic steps required to solve the problem) will
be exactly 2. That is, to solve the problems, participants would need to do two arithmetic
operations, a sum and a division. They used a test with a 100% sensitivity and 0% specificity,
which could not have any clinical application, but it is useful to get a few arithmetic steps out
of the probability format and check if computational complexity underlies Bayesian reasoning.
With this manipulation, the algorithms to solve the probability and frequency versions become
Figure 2. Relative and absolute reference classes represented by the reference of the last row (test results). In the Relative
reference class, the information about the test, for example, 80% positive (+) and 20% negative results (�) refers to the 1%
women with BC, but not to the 100% of the women (it is not an 80% of the 100%!). However, in the absolute reference
class, the same information, 8+ and 2�, refers to the women with BC, but also to the 1000 women directly. This translates
in the need to use Eq. (4) for relative probabilities and Eq. (5) for absolute frequencies.
Eqs. (6) and (7), respectively. It is easy to see how both have become roughly equivalent now in
terms of arithmetic complexity.
10% � 100% 10%

pðHjDÞ ¼ ¼ ¼ 0:1 (6)
10% � 100% þ 90% � 100% 10% þ 90%
10
pðHjDÞ ¼ ¼ 0:1 (7)
10 þ 90
As it can be deduced, Student 1 would have a Fisherian approach to statistics and Student 2 a
Bayesian approach. Both run an experiment with a total of 62 participants (31 per group),2 and
have the following results:
Contingency tables
Representation format
Accuracy Natural frequencies Probabilities Total
0 23 24 47
1 8 7 15
Total 31 31 62
4.1. PhD Student 1—Frequentist

Student 1, as the most good NHST practitioners would do, conducts a Chi-square test and
reports that he did not obtain a significant effect of representation format when arithmetic
steps were equal (χ2 = 0.088, p = 0.767). He is happy, because this is congruent with his
hypothesis. He then writes a brief report detailing his idea and experimental results and sends
the manuscript draft to his advisor. A few days later, he receives his advisor feedback, telling
him that his non-significant results could be caused by a number of reasons, and as a conse-
quence, the non-significant results are hard to interpret.
Chi-square tests
Value df p
χ2 0.088 1 0.767
N 62
2
Of course, the sample size and manipulation for this experiment is more congruent with a pilot experiment than a real
one that could be sent to a journal on its own. As a side note, take into account that one of the advantages of the Bayesian
framework some authors propose is a sequential sampling rule, where sampling stops when the evidence (BF) is over a
predetermined threshold (e.g., BF10 >10 | <0.1), see Lindley [76].
His advisor suggests carrying out a few more experiments using variations of the task and
decent sample-sizes, to be able to perform a meta-analysis that could convince the editorial
board of a journal that their endeavor is noteworthy, as they would probably have a hard time
publishing those non-significant results by themselves.
4.2. PhD Student 2—Bayesian

Student 2, instead of performing a Chi-square test, prefers to use a well-known analysis among
Bayesian statisticians called Bayes factor (BF; see [17, 65]). He uses a very simple to use
software called JASP [61], that incorporates Bayesian contingency tables, and outputs BF
results in ready to use APA formatted tables. He finds that when arithmetic steps are equal,
there is a BF01 of 4.656, that is, there is 4.6 times more evidence in favor of the null-hypothesis
than the alternative-hypothesis. Along his advisor, they send the manuscript to a journal,
pushing for the relative importance of arithmetic complexity over representation format. In
practical terms, it is more likely that the editor will be willing to publish this interesting result,
although the amount of evidence in favor of the null would be considered moderate by some
standards (see [53]).
Bayesian contingency tables tests
Value
BF0+ independent multinomial 4.656
N 62
Note: For all tests, the alternative hypothesis specifies that group Natural-Frequencies is greater than group Probabilities.
As the evidence for the null effect is not very strong, they would need to run a few more
studies with variations to replicate the finding and show, using BF, how much more evidence
there is for the null hypothesis compared to the alternative hypothesis. Alternatively, they
could increase the sample size in their experiment until the stopping rule threshold (e.g., BF10
< 0.1) is reached.
This example was aimed to describe (in a very simplified manner) one of the practical advan-
tages of the Bayesian framework, that is, being able to present the amount of evidence for and
against both the null and alternative-hypotheses. This, combined with the incremental nature
of the Bayesian inference process, allows us to move further from the hit-or-miss approach
generally reinforced by the NHST framework, in which significant results are seen as more
valuable than non-significant ones.
5. Conclusion
During the past 70 years, the NHST has dominated the way in which knowledge is produced
and interpreted and still governs the way in which researchers analyze their data, reach
conclusions, and report results [10, 45]. This approach has been largely criticized [9, 16, 21, 22,
39, 43–46], and “a major concern expressed by critics is that such testing is misunderstood by
many of those who use it” [35, p. 241]. Some authors [9, 13] emphasized that one of the most
pervasive influences of the NHST approach has been its over reliance on p-values, and in
particular, in the way that p-values have been interpreted (see, for instance [35, 36, 77]). One of
the most common misinterpretations of p-values it has been to consider a p-value as a valid
indicator of the magnitude of evidence of a result (i.e., effect size fallacy). Regarding this point,
Cohen emphasized that the only way to estimate the magnitude of an effect is to calculate the
value of the effect size with the appropriate statistic and its confidence interval [38]. The correct
way to interpret p-values is two-fold. On one hand, to reject H0 only allows us to conclude that H0
is unlikely. On the other hand, failing to reject H0 simply allows us to state that—given the
evidence at hand—one cannot make an assertion about the existence of some effect or phenom-
enon [42]. An immediate consequence of the wrong way in which a big number of researchers
interpret p-values is that null results have been usually considered as the absence of evidence of
the existence of an effect. This perspective regarding the decisions made when a given p-value
threshold is not reached (i.e., p < 0.05) do not promote scientific advance and quickly leads us to
a systematic bias toward ignoring promising but “non-significant” findings that may be further
explored, fed into meta-analysis, of just be considered by other researchers in the field. This fact
is against the pursue of any empirical science and may be harmful to the construction of a
cumulative base of knowledge [5].
As a way to provide a complementary (or alternative) method to deal with the current NHST
practice, we described here a Bayesian approach to hypothesis testing. A Bayesian approach
allows us to think about phenomena in terms of the magnitude of evidence that supports the
existence of an effect, instead of a dichotomous and artificial way of thinking in which an effect
either exists or does not exist [21]. As described in previous sections, a Bayesian approach
provides us a measure of evidence for and against both the null and the alternative hypotheses
(i.e., Bayes factor, BF; see [17]). The use of Bayes factors helps to overcome one of the most
common issues behind the well-known file-drawer effect, reducing the existent bias through
which results suddenly become meaningless when the p-value is over certain threshold (e.g., p
> 0.05). A straightforward feature of this approach is that “Bayesian statistics assigns no
special status to the null hypothesis, which means that Bayes factors can be used to quantify
evidence for the null hypothesis just as for any other hypothesis” [65, p. 108]. Therefore, a
Bayesian approach gives us more freedom when establishing hypothesis, for example in topics
where hypothesizing the absence of differences may be necessary for theoretical advance.
However, a major problem with Bayesian statistics has historically been that they require
complex and intricate mathematical calculations that were analytically intractable, at least
without the required techniques and specialized software. However, this scenario changed
dramatically during the 1990s with the development of sampling techniques like Markov-
Chain Monte Carlo (MCMC; see [55]) along with the availability and improvement of specifi-
cally developed software (e.g., WinBUGS, see [57, 58]; JAGS, see [59, 60]) that makes exact
Bayesian inferences possible even in very complex models. Nowadays, the relatively recent
implementation and availability of Bayesian analysis in “easy-to-use” and open software such
as JASP [61], R toolboxes such as Bayes factor [78], or more specialized ones like WinBUGS,
JAGS, or Stan (http://mc-stan.org/) makes Bayesian statistics more accessible to all researchers,
academics and students. This widespread availability, paired with the advantages of the
Bayesian approach described in this chapter, and several times elsewhere [79–82], should help
establish the Bayesian paradigm as a viable and popular alternative to NHST.
Despite all the important Bayesian paradigm advantages, as always, there is potential for
misuse. As pointed out by Morey, Bayes factor interpretation is very natural (i.e., as the
amount of evidence in favor of one hypothesis in comparison to another), and does not need
specific decision thresholds, as it is the case of p-values [83]. However, some standards that
could help to communicate BF results have been proposed (see [53]) and may be helpful to
people that are not familiar with them. Nonetheless, the introduction of these labels also
creates an opportunity for misuse, as they could be misinterpreted as decision boundaries. It
is very important to be aware of this fact, and be careful when using them, to avoid making
“BF > 3” the new “p < 0.05.”
To sum up, the main goal of this chapter has been to increase the degree of awareness
regarding the limitations of the NHST approach and highlight the advantages of the Bayesian
approach. We expect that the inclusion of an easy-to-understand example of a specific case
where a Bayesian paradigm shows its practical utility may offer the newborn readers on this
matter a glimpse to the usefulness of this alternative to the way in which they can analyze and
interpret their data. As a final remark, we would like to point an often-heard recommendation
for people interested in starting to use BF, which is to introduce them alongside p-values and
effect size measures, to ease the transition to the new paradigm, and make them comprehen-
sible to people not yet familiarized with them.
Author details
Alonso Ortega1* and Gorka Navarrete2

1 School of Psychology, Universidad Adolfo Ibáñez, Chile

2 Center for Social and Cognitive Neuroscience (CSCN), School of Psychology, Universidad
Adolfo Ibáñez, Chile
References
[1] Lakatos I. Falsification and the methodology of scientific research programmes. In:
Harding S, editor. Can Theories be Refuted? Dordrecht: Holland: D. Reidel Publishing
Company; 1976. pp. 205-259
[2] Radder H. Toward a more developed philosophy of scientific experimentation. In:

Radder H, editor. The Philosophy of Scientific Experimentation. Pittsburgh: University
of Pittsburgh Press; 2003. pp. 1-18
[3] Harper RS. The first psychological laboratory. Isis. 1950;41(2):158-161

[4] Popper KR. Degree of confirmation. The British Journal for the Philosophy of Science.
1954;5(18):143-149
[5] Curran PJ. The seemingly quixotic pursuit of a cumulative psychological science: Intro-
duction to the special issue. Psychological Methods. 2009;14(2):77-80
[6] Cumming G. The new statistics why and how. Psychological Science. 2013;25(1):7-29
[7] Loftus GR. Psychology will be a much better science when we change the way we
analyze data. Current Directions in Psychological Science. 1996;5(6):161-171
[8] Rossi JS. A case study in the failure of psychology as a cumulative science: The spontane-
ous recovery of verbal learning. In: Harlow L, Mulaik S, Steiger J, editors. What If There
Were No Significance Tests. Mahwah, NJ: Erlbaum Associates Publishers; 1997. pp. 175-197
[9] Wagenmakers E-J. A practical solution to the pervasive problems of p values.
Psychonomic Bulletin & Review. 2007;14(5):779-804
[10] Gigerenzer G. We need statistical thinking, not statistical rituals. Behavioral and Brain
Sciences. 1998;21(2):199-200
[11] Ioannidis JP. Why most published research findings are false. PLOS Medicine. 2005;2(8):
e124
[12] Wagenmakers EJ, Wetzels R, Borsboom D, Van Der Maas HL. Why psychologists must
change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal
of Personality and Social Psychology. 2011;100(3):426-432
[13] Llobell JP, Dolores M, Navarro F, et al. Usos y abusos de la significación estadística:
propuestas de futuro (“Necesidad de nuevas normativas editoriales”). Metodologia de
las Ciencias del Comportamiento, 2004; Volumen Especial: 465-469
[14] Kirk RE. The importance of effect magnitude. In: Davis SF, editor. Handbook of Research
Methods in Experimental Psychology. Malden, MA: Blackwell Publishing; 2003. pp. 83-105
[15] Cohen J. A power primer. Psychological Bulletin. 1992;112(1):155-159
[16] Schmidt FL. Statistical significance testing and cumulative knowledge in psychology:
Implications for training of researchers. American Psychological Association. 1996;1(2):
115-129
[17] Kass RE, Raftery AE. Bayes factors. Journal of the American Statistical Association.
1995;90(430):773-795
[18] Dienes Z. Bayesian versus Orthodox statistics: Which side are you on? Perspectives on
Psychological Science. 2011;6(3):274-290
[19] Wilkinson L, Task Force on Statistical Inference APA Board of Scientific Affairs. Statistical
methods in psychology journals: Guidelines and explanations. American Psychologist.
1999;54:594-604
[20] Levine TR, Weber R, Hullett C, Park HS, Lindsey LLM. A critical assessment of null
hypothesis significance testing in quantitative communication research. Human Commu-
nication Research. 2008;34(2):71-187
[21] Dixon P. The p-value fallacy and how to avoid it. Canadian Journal of Experimental
Psychology/Revue Canadienne de Psychologie Experimentale. 2003;57(3):189-202
[22] Gill J. The insignificance of null hypothesis significance testing. Political Research Quar-
terly. 1999;52(3):647-674
[23] Rouder JN, Speckman PL, Sun D, Morey RD, Iverson G. Bayesian t tests for accepting and
rejecting the null hypothesis. Psychonomic Bulletin & Review. 2009;16(2):225-237
[24] Christensen R. Testing Fisher, Neyman, Pearson, and Bayes. The American Statistician.
2005;59(2):121-126
[25] Goodman SN. P values, hypothesis tests, and likelihood: Implications for epidemiology
of a neglected historical debate. American Journal of Epidemiology. 1993;137(5):485-496
[26] Fisher RA. Two new properties of mathematical likelihood. Proceedings of the Royal
Society of London. Series A, Containing Papers of a Mathematical and Physical Charac-
ter. 1934;144(852):285-307
[27] Fisher RA. Statistical Methods for Research Workers. Edinburgh: Genesis Publishing Pvt
Ltd; 1925
[28] Fisher RA. Statistical methods and scientific induction. Journal of the Royal Statistical
Society. Series B (Methodological). 1955;17:69-78
[29] May RB, Masson MJ, Hunter MA. Application of Statistics in Behavioral Research. NY:
Harper & Row; 1990
[30] Neyman J, Pearson ES. On the problem of the most efficient tests of statistical hypotheses.
Philosophical Transactions of the Royal Society of London. 1933;A231:289-337
[31] Singh VB. Don't Confuse Type I and Type II errors. 2015. Available from: https://www.
linkedin.com/pulse/dont-confuse-type-i-ii-errors-bhaskar-vijay-singh-frm?articleId=
6077308381431951360 [Accessed: June 21, 2017]
[32] Nix TW, Barnette JJ. The data analysis dilemma: Ban or abandon. A review of null
hypothesis significance testing. Research in the Schools. 1998;5(2):3-14
[33] Gigerenzer G. The superego, the ego, and the id in statistical reasoning. In: A Handbook
for Data Analysis in the Behavioral Sciences: Methodological Issues. Hillsdale, NJ:
L. Erlbaum Associates; 1993. pp. 311-339
[34] Sedlmeier P, Gigerenzer G. Teaching Bayesian reasoning in less than two hours. Journal
of Experimental Psychology General. 2001;130(3):380-400
[35] Nickerson RS. Null hypothesis significance testing: A review of an old and continuing
controversy. Psychological Methods. 2000;5(2):241-301
[36] Badenes-Ribera L, Frias-Navarro D, Iotti B, Bonilla-Campos A, Longobardi C. Miscon-

ceptions of the p-value among Chilean and Italian Academic Psychologists. Frontiers in
Psychology. 2016;7:1247
[37] Kline RB. Beyond Significance Testing, Statistics Reform in the Behavioral Sciences. 2nd
ed. Washington, DC: American Psychological Association; 2013
[38] Cohen J. The earth is round (p < .05). American Psychologist. 1994;49:997-1003
[39] Carver R. The case against statistical significance testing. Harvard Educational Review.
1978;48(3):378-399
[40] Rozeboom WW. The fallacy of the null-hypothesis significance test. Psychological Bulle-
tin. 1960;57(5):416
[41] Wetzels R, Raaijmakers JG, Jakab E, Wagenmakers E-J. How to quantify support for and
against the null hypothesis: A flexible WinBUGS implementation of a default Bayesian t
test. Psychonomic Bulletin & Review. 2009;16(4):752-760
[42] Cohen J. The statistical power of abnormal-social psychological research: A review. The
Journal of Abnormal and Social Psychology. 1962;65(3):145
[43] Shaver JP. What statistical significance testing is, and what it is not. The Journal of
Experimental Education. 1993;61(4):293-316
[44] Carver RP. The case against statistical significance testing, revisited. The Journal of Exper-
imental Education. 1993;61(4):287-292
[45] Krueger J. Null hypothesis significance testing: On the survival of a flawed method.
American Psychologist. 2001;56(1):16
[46] Meehl PE. Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress
of soft psychology. Journal of Consulting and Clinical Psychology. 1978;46(4):806-834
[47] Wetzels R, Matzke D, Lee MD, Rouder JN, Iverson GJ, Wagenmakers E-J. Statistical
evidence in experimental psychology an empirical comparison using 855 t tests. Perspec-
tives on Psychological Science. 2011;6(3):291-298
[48] Wagenmakers EJ, Wetzels R, Borsboom D, van der Maas H. Yes, Psychologists Must
Change the Way They Analyse Their Data: Clarifications for Bem, Utts, and Johnson
(2011). 2011. Available from: http://web.stanford.edu/class/psych201s/psych201s/papers/
ClarificationsForBemUttsJohnson.pdf [Accessed: July 26, 2017]
[49] Bem DJ. Feeling the future: Experimental evidence for anomalous retroactive influences
on cognition and affect. Journal of Personality and Social Psychology. 2011;100(3):407-425
[50] Bem DJ, Utts J, Johnson WO. Must psychologists change the way they analyze their data?
Journal of Personality and Social Psychology. 2011;101(4):716-719
[51] Bernardo JM. A Bayesian analysis of classical hypothesis testing. Trabajos de estadística y
de investigación operativa. 1980;31(1):605-647
[52] Lindley DV. The philosophy of statistics. Journal of the Royal Statistical Society: Series D
(The Statistician). 2000;49:293-337
[53] Jeffreys H. Theory of Probability. Oxford: Clarendon Press; 1961
[54] O'Hagan A, Forster JJ. Kendall's Advanced Theory of Statistics. Vol. 2B. Bayesian Infer-
ence. London: Arnold; 2004
[55] Gamerman D, Lopes HF. Markov Chain Monte Carlo: Stochastic Simulation for Bayesian
Inference. Boca Raton: CRC Press; 2006
[56] Gilks WR, Richardson S, Spiegelhalter DJ. Introducing Markov Chain Monte Carlo,
Markov Chain Monte Carlo in Practice. London: Chapman & Hall; 1996
[57] Lunn D, Spiegelhalter D, Thomas A, Best N. The BUGS project: Evolution, critique and
future directions. Statistics in Medicine. 2009;28(25):3049-3067
[58] Lunn DJ, Thomas A, Best N, Spiegelhalter D. WinBUGS-a Bayesian modelling frame-
work: Concepts, structure, and extensibility. Statistics and Computing. 2000;10(4):325-337
[59] Kruschke JK. Introduction: Credibility, Models, and Parameters, Doing Bayesian Data
Analysis: A Tutorial with R, JAGS, and Stan. Boston: Academic Press; 2015. pp. 15-30
[60] Plummer M. JAGS: A program for analysis of Bayesian graphical models using Gibbs
sampling. In: Proceedings of the 3rd International Workshop on Distributed Statistical
Computing. Vienna: TU Wien; 2003. p. 125
[61] Love J, Selker R, Marsman M, Jamil T, Dropmann D, Verhagen A, Wagenmakers E. JASP
(Version 0.7) [Computer Software]. Amsterdam, the Netherlands: JASP Project; 2015
[62] Griffiths TL, Tenenbaum JB, Kemp C. Bayesian inference. In: Holyoak K, Morrison R,
editors. The Oxford Handbook of Thinking and Reasoning. New York: Oxford University
Press; 2012. pp. 22-35
[63] Samaniego F. A Comparison of the Bayesian and Frequentist Approaches to Estimation.
New York: Springer; 2010
[64] Bayes T. An essay toward solving a problem in the doctrine of chances. Philosophical
Transactions of the Royal Society of London. 1763;53:370-418
[65] Lee MD, Wagenmakers E-J. Bayesian Cognitive Modeling: A Practical Course. Cam-
bridge, New York: Cambridge University Press; 2014
[66] Berger JO, Moreno E, Pericchi LR, Bayarri MJ, Bernardo JM, Cano JA, De la Horra J,
Martín J, Ríos-Insúa D, Betrò B. An overview of robust Bayesian analysis. Test. 1994;3(1):
5-124
[67] Jackman S. Bayesian Analysis for the Social Sciences. West Sussex: Wiley Chichester; 2009
[68] Myung IJ, Pitt MA. Applying Occam’s razor in modeling cognition: A Bayesian approach.
Psychonomic Bulletin & Review. 1997;4(1):79-95
[69] Bigler ED. Symptom validity testing, effort, and neuropsychological assessment. Journal
of the International Neuropsychological Society. 2012;18(04):632-640
[70] Lakens D. Equivalence tests: A practical primer for t-tests, correlations, and meta-ana-
lyses. Social Psychological and Personality Science. 2017;March 4:1-21
[71] Barbey AK, Sloman SA. Base-rate respect: From ecological rationality to dual processes.
Behavioral and Brain Sciences. 2007;30(03):241-254
[72] Gigerenzer G, Hoffrage U, Mellers BA, et al. How to improve Bayesian reasoning without
instruction: Frequency formats. Psychological Review. 1995;102:684-704
[73] Fiedler K, Brinkmann B, Betsch T, Wild B. A sampling approach to biases in conditional

probability judgments: Beyond base rate neglect and statistical format. Journal of Exper-
imental Psychology: General. 2000;129(3):399-418
[74] Lesage E, Navarrete G, De Neys W. Evolutionary modules and Bayesian facilitation: The
role of general cognitive resources. Thinking & Reasoning. 2013;19(1):27-53
[75] Ayal S, Beyth-Marom R. The effects of mental steps and compatibility on Bayesian
reasoning. Judgment and Decision Making. 2014;9(3):226-242
[76] Lindley DV. Bayesian statistics: A review. Society for Industrial and Applied Mathemat-
ics; 1972
[77] Gliner JA, Leech NL, Morgan GA. Problems with null hypothesis significance testing
(NHST): What do the textbooks say? The Journal of Experimental Education. 2002;71(1):
83-92
[78] Morey RD, Rouder JN. Bayes Factor: Computation of Bayes Factors for Common Designs.
R package version 0.9.12-2. 2015. Available from: https://cran.r-project.org/package=
BayesFactor [Accessed: June 21, 2017]
[79] Berry DA. Bayesian clinical trials. Nature Reviews Drug Discovery. 2006;5(1):27-36
[80] Briggs AH. A Bayesian approach to stochastic cost-effectiveness analysis. Health Eco-
nomics. 1999;8(3):257-261
[81] Ortega A, Wagenmakers E-J, Lee MD, Markowitsch HJ, Piefke M. A Bayesian latent group
analysis for detecting poor effort in the assessment of malingering. Archives of Clinical
Neuropsychology. 2012;27(4):453-465
[82] Stegmueller D. How many countries for multilevel modeling? A comparison of Frequentist
and Bayesian approaches. American Journal of Political Science. 2013;57(3):748-761
[83] Morey RD. On verbal categories for the interpretation of Bayes factors. 2015. Available
from: http://bayesfactor.blogspot.cl/2015/01/on-verbal-categories-for-interpretation.html
[Accessed: June 21, 2017]
Section 3
Applications of Bayesian Inference in

Engineering
Chapter 13
Provisional chapter
Bayesian Inference and Compressed Sensing
Bayesian Inference and Compressed Sensing
Solomon A. Tesfamicael and Faraz Barzideh
Solomoninformation
Additional A. Tesfamicael and
is available Faraz
at the end ofBarzideh
the chapter

Abstract
This chapter provides the use of Bayesian inference in compressive sensing (CS), a
method in signal processing. Among the recovery methods used in CS literature, the
convex relaxation methods are reformulated again using the Bayesian framework and
this method is applied in different CS applications such as magnetic resonance imaging
(MRI), remote sensing, and wireless communication systems, specifically on multiple-
input multiple-output (MIMO) systems. The robustness of Bayesian method in incorpo-
rating prior information like sparse and structure among the sparse entries is shown in
this chapter.
Keywords: Bayesian inference, compressive sensing, sparse priors, clustered priors,

convex relaxation
1. Introduction
In order to estimate parameters in a signal, one can apply wisdoms of the two schools of
thoughts in statistics called the classical (also called the frequentist) and the Bayesian. These
methods of computing are competitive with each other at times. The definition of probability is
where the basic difference arises from. The frequentist define P(A) as a long-run relative
frequency with which A occurs in identical repeats of an experiment, whereas Bayesian defines
P(A|B) as a real number measure of the probability of a proposition A, given the truth of the
information represented by proposition B. So under Bayesian theory, probability is considered
as an extension of logic [1, 2]. Probabilities represent the investigator degree of belief—hence it
is subjective. But this is not acceptable under classical theory, making it to be not flexible. To
add on the differences, under the classical inference, parameters are not random, they are fixed
and prior information is absent. But under the Bayesian, parameters are random variables, and
prior information is an integral part, and the Bayesian has no excuse for that. Since one is free
to invent new estimators or confidence intervals or hypothesis test, adhockery exists and hence
frequentist approach lack consistency whereas Bayesian theory is flexible and consistent [1–9].
Therefore, Bayesian inference is our main focus applied to a special paradigm in signal
processing in this chapter.
After presenting the theoretical frameworks, Bayesian theory, CS, and convex relaxation
methods in Section 2, the use of Bayesian inference in CS problem by considering two priors
modeling the sparsity and clusteredness is shown in Section 3. In Section 4, we present three
examples of applications that show the connection of the two theories, Bayesian and compres-
sive sensing. In Section 5, the conclusion is given briefly.
2. Theoretical framework
2.1. Bayesian framework
For two random variables A and B, the product rule gives
PðA; BÞ ¼ PðAjBÞPðBÞ ¼ PðBjAÞPðAÞ (1)
and the famous Bayes’ theorem provides
PðAjBÞPðBÞ
PðBjAÞ ¼ : (2)
P ðA Þ
Using the same framework, consider model Mj and a vector of parameter θ. We infer what the
model’s parameter θ might be, given the data, D, and a prior information I. Using Bayes’
theorem, the probability of the parameters θ given model Mj, data D, and information I is
given by
� � � �
� � P Djθ, Mj , I P θjMj , I
P θjD, Mj , I ¼ � � ; (3)
P DjMj , I
where P(θ|D, Mj, I), is posterior probability, P(θ|Mj, I) is the non data information about θ,
called prior probability distribution function, while P(D|θ, Mj, I) is the density of the data
conditional on the parameters of the model, called likelihood. P(D|Mj, I) is called the evidence
of model Mj, or the normalizing constant, given by:
ð
� � � � � �
P DjMj , I ¼ P θjMj , I P Djθ, Mj , I dθ: (4)
θ
P(θ|D, Mj, I) is the fundamental interest for the first level of inference called model fitting. It is
the task of inferring what the model parameters might be given the model and the data.
Further, we can do inference on higher level, which is comparing models Mj. In the light of
prior information I and data D, a given set of models {M1, M2,⋯, Mn} is most likely to be the
Bayesian Inference and Compressed Sensing 259
correct one. Now focusing on the first level of inference, we can ignore the normalizing
constant in (3) since it has no relevance at this level of inference about the parameters θ. Hence
we get:
� � � � � �
P θjD, Mj , I ∝ P Djθ, Mj , I P θjMj , I : (5)
The posterior probability is proportional to the Prior probability times the Likelihood. Eq. (5) is
called Updating Rule [1, 3], in which the data allow us to update our prior views about θ, and
as a result, we get the posterior which combines both the data and non-data information of θ.
As an example for a binomial trial, let us have beta distribution as a prior and as a result we get
posterior distribution which is beta distribution. Figure 1 shows that the posterior density is
taller and narrower than the prior density. It therefore favors strongly a smaller range of θ
values, reflecting the fact that we now have more information. That is why inference based on
the posterior distribution is superior to the one only based on the likelihood.
Now, we first find the maximum of the posterior distribution called maximum a posteriori
b . MAP is related to
(MAP). It defines the most probable value for the parameters denoted θ MP
Fisher’s methods of maximum likelihood estimation (MLE), θ b . If f is the sampling distribu-
ML
tion of D, then the likelihood function of D:θ ↦ f ðDjθÞ and the maximum likelihood estimation
of θ:
θML ðDÞ ¼ arg max f ðDjθÞ: (6)

θ
But under Bayesian inference, let g be a prior distribution of θ, then the posterior distribution
of θ becomes
f ðDjθÞgðθÞ
θML ↦ (7)
f ðDÞ
and the maximum a posteriori estimation of θ:
b ¼ ð f ðDjθÞgðθÞ
θ MP
f ðDjϑÞgðϑÞdϑ
ϑ (8)
¼ arg max f ðDjθÞgðθÞ

θ
Inference based on the posterior is not an easy task since it involves multiple integral, which
are cumbersome to solve at times. However, it can be computed in several ways: Numerical
optimization (like Conjugate gradient method, Newton method,…), modification of an
expectation-maximization algorithm and others. As we can see it from (22) and (8), the differ-
ence between MLE and MAP is the prior distribution. The latter can be considered as a
0.8
Likelihood
0.7 Prior
Postersior
0.6
0.5
0.4
p
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
Figure 1. Figure showing the updating rule: the posterior synthesizes and compromises by favoring values between the
maximum of the prior density and likelihood. The prior we had is challenged to shift by the arrival of little amount of
data.
regularization of the former. Here we can summarize the posterior distribution by the value of
the best fit parameters θMP and error bars (confidence intervals) on the best fit parameters.
Error bars can be found from the curvatures of the posterior. To proceed further, we replace the
random variable D and θ by vectors y and x and we assume prior distributions on x in the next
section.
2.2. Compressive sensing

Compressive sensing (CS) is a paradigm to capture information at lower rate than the Nyquist-
Shannon sampling rate when signals are sparse in some domain [10–13]. CS has recently
gained a lot of attention due to its exploitation of signal sparsity. Sparsity, an inherent charac-
teristic of many natural signals, enables the signal to be stored in a few samples and subse-
quently be recovered accurately.
As a signal processing scheme, CS follows a similar fashion: encoding, transmission/storing,
and decoding. Focusing on the encoding and decoding of such a system with noisy measure-
ment, the block diagram is given in Figure 2. At the encoding side, CS combines the sampling
and compression stages of a traditional signal processing into one step by measuring few
samples that contain maximum information about the signal. This measurement/sampling is
done by linear projections using random sensing transformations as shown in the landmark
papers by the authors mentioned above. Having said this, let us define the CS problem
formally as follows:
Figure 2. Blockdiagram for CS-based reconstruction.
Definition 1. (The standard CS problem)

Find the k-sparse signal vector x ∈ RN provided the measurement vector y ∈ RM, the measurement
matrix A ∈ RM�N and the under-determined set of linear equations as
y ¼ Ax; (9)
where k ≪ M ≪ N.
One can ask again two of the questions here, in relation to the standard CS problem. First, how
should we design the matrix A to ensure that it preserves the information in the signal x?
Second, how can we recover the original signal x from measurements y [14]? To address the
first question, the solution for the CS problem presented here is dependent on the design of A.
This matrix can be considered as a transformation of the signal from the signal space to the
measurement space, Figure 3 [15]. There have been different criteria that matrix A should
satisfy to have meaningful reconstruction. One of the main criteria is given in [11]. The authors
defined the sufficient condition that matrix A should satisfy for the reconstruction of the signal x.
It is called the Restricted Isometric Property (RIP) and it is defined below.
Figure 3. Transformation from the signal-space to the measurement-space.

Definition 2. (Restricted Isometry Property)
For all x ∈ RN so that ∥ x ∥0 ≤ k, if there exists 0 ≤ δk < 1 such that
ð1 � δk Þ∥x∥22 ≤ ∥Ax∥22 ≤ ð1 þ δk Þ∥x∥22 (10)
is satisfied, then A fulfills RIP of order k with radius δk.

An equivalent description of the RIP is to say that all subsets of k columns taken from A are
nearly orthogonal (the columns of A cannot be exactly orthogonal since we have more columns
than rows) [16]. For example, if a matrix A satisfies the RIP of order 2k, then we can interpret
(10) saying that A approximately preserves the distance between any pair of k-sparse vectors.
For random matrix A, the following theorem is one of the results in relation to RIP for the
noiseless CS problem, provided that the entries of the random matrix A are drawn from some
distributions which are given later.
Theorem 1. (Perfect Recovery Condition, Candes and Tao [13])
If A satisfies the RIP of order 2k with radius δ2k, then for any k-sparse signal x sensed by y=Ax, x is
with high probability perfectly recovered by the ideal program
x ¼ arg min
b ∥x∥0
x (11)
subject to y ¼ Ax
and it is unique, where ∥ x ∥0 = k� # {i ∈ {1, 2, ⋯, N}|xi 6¼ 0}.
This means, if A satisfies the RIP of order k with radius δk, then for any k0 < k, A satisfies the RIP
of order k0 with constant δk0 < δk [?]. Note that, this theorem is stated for the noiseless CS
problem and it is possible to extend it for the noisy CS system. The proof of these theorems is
deferred to the literature mentioned, [13], in for the sake of space.
Under conventional sensing paradigm, the dimension of the original signal and the measure-
ment should be at least equal. But in CS, the measurement vector can be far less than the
original. While at the decoding side, reconstruction is done using nonlinear schemes. Eventu-
ally, the reconstruction is more cumbersome than the encoding which was only projections
from a large space to a smaller space. On the other hand, finding a unique solution that
satisfies the constraint that the signal itself is sparse or sparse in some domain is complex in
nature. Fortunately, there are many algorithms to solve the CS problem, such as iterative
methods such as greedy iterative algorithms [17] and iterative thresholding algorithms [18]. This
chapter focuses merely on the convex relaxations methods [12, 13]. The regularizing terms in
these methods can be reinterpreted as prior information under Bayesian inference. We con-
sider a noisy measurement and apply convex relaxation algorithms for robust reconstruction.
2.3. Convex relaxation methods for CS

Various methods for estimating x may be used. We have the least square (LS) estimator in
which no prior information is applied:
� ��1
x ¼ AT A AT y;
b (12)
which performs very badly for the CS estimation problem we are considering. In order to
incorporate the methods called convex relaxation, let us define an important concept first.
Definition 3. (Unit Ball)

A unit ball in lp-space of dimension N can be defined as
n o
Bp � x ∈ RN : ∥x∥p ≤ 1 : (13)
Unit balls corresponding to p = 0, p = 1/2, p = 1, p = 2, p = ∞, and N = 2, the balls are shown in Figure 4.
The exact solution for the noiseless CS problem is given by
min ∥x∥0 , such that y ¼ Ax: (14)

x
However, minimizing l0-norm is a non-convex optimization problem which is NP-hard [19].

By relaxing the objective function to convexity, it is possible to get good approximation. That
is, replacing the l0-norm by the l1-norm, one can find a problem which is tractable. Note that it
is also possible to use other lp-norms to relax the condition given by l0. However, keeping our
focus on l1-norm, consider the minimization problem instead of (14).
min ∥x∥1 , such that y ¼ Ax (15)

x
The solution of the relaxed problem (15) gives the same as that of (14) and this equivalence was
provided by Donoho and Huo in [20].
Figure 4. Different lp-balls in different lp-spaces for N=2, only balls with p≥1 are convex.
Theorem 2. (l0�l1 Equivalence [13])

pffiffiffi
If A satisfies the RIP of order 2k with radius δ2k < 2 � 1, then
x ¼ arg min ∥x∥1

b
x (16)
subject to y ¼ Ax
x.
is equivalent to (11) and will find the same unique b
Justified by this theorem, (15) is an optimization problem which can be solved in polynomial
time and the fact that it gives the exact solution for the problem (14) under some circumstance
has been one of the main reasons for the recent developments in CS. There is a simple
geometric intuition on why such an approach gives good approximations. Among the
lp-norms that can be used in the construction of CS related optimization problems, only those
which are convex give rise to a convex optimization problem which is more feasible than the
non-convex counter parts, which means lp-norms with only p ≥ 1 satisfy such a condition. On
the other hand, lp-norms with p > 1 do not favor sparsity, for example, l2-norm minimization
tends to spread reconstruction across all coordinates even if the true solution is sparse. But
l1-norm is able to enforce sparsity. The intuition is that l1-minimization solution is most likely
to occur at corners or edges, not faces [21, 45]. That is why l1-norm became famous for CS.
Further, in CS literature, convex relaxation is presented as either l2-penalized l1-minimization
called Basis Pursuit Denoising (BPDN) [22] or l1-penalized l2-minimization called least abso-
lute shrinkage and selection operator (LASSO) [45], which are equivalent and effective in
estimating a high-dimensional data.
Usually real world systems are contaminated with noise, w, and in this chapter, the focus is on
such problems. The noisy recovery problem becomes a simple extension of (15),
min ∥x∥1 , such that ∥y � Ax∥2 ≤ e (17)

x
where E is a bound on ||w||2. The real problem for (17) is stability. Introducing small changes in
the observations should result in small changes in the recovery. We can visualize this using the
balls shown in Figure 5.
Both the l0 and l1-norms give exact solutions for the noise-free CS problem while giving a close
solution for the noisy problem. However, the l2-norm gives worst approximation in both cases
compared to the other lp-norms with p < 2 (see Figure 5). Moreover, (17) is equivalent to an
unconstrained quadratic programming problem as
1
min ∥y � Ax∥22 þ γ∥x∥1 ; (18)
x 2
as it will be shown later as LASSO, where γ is a tuning parameter. The equivalency of (17) and
(18) is shown in [23, 24]. In this chapter, the generalized form of the minimization problem in
(18) with different lp-norm regularization is considered, that is,
Figure 5. lp-norm approximations: the constraints for the noise-free CS problem is given by the bold line while the shaded
region is for the noisy one.
1
min ∥y � Ax∥22 þ γ∥x∥p : (19)
x 2
Further, this chapter provides the use of Bayesian framework in compressive sensing by
incorporating two different priors modeling the sparsity and the possible structure among the
sparse entries in a signal. Basically, it is the summary of the recent works [2, 25–27].
3. Bayesian inference used in CS problem
Under Bayesian inference, consider two random variables x and y with probability density func-
tion (pdf) p(x) and p(y), respectively. The product rule gives us p(x,y) = p(x|y)p(y) = p(y|x)p(x) and
Bayes’ theorem provides
pðyjxÞpðxÞ
pðxjyÞ ¼ : (20)
pðyÞ
xMP , is defined as
Further, the maximum a posteriori (MAP), b
pðyjxÞpðxÞ
xMP ¼ arg max ð
b
x
pðyj~x Þpð~x Þd~x
~x (21)
¼ arg max pðyjxÞpðxÞ

x
xML :
MAP is related to Fisher’s methods of maximum likelihood estimation (MLE), b
xML ¼ arg max pðyjxÞ:

b (22)
x
As we can see it from (21) and (22), the difference between MAP and MLE is the prior
distribution. The former can be considered as a regularized form of the latter. Since we apply
Bayesian inference we assume further different prior distributions on x.
3.1. Sparse prior

The estimators of x resulting from (19) for the sparse problem we consider in this chapter,
can be presented as a maximum a posteriori (MAP) estimator under the Bayesian frame-
work as in [28]. We show this by defining a prior probability distribution for x on the
form
e�uf ðxÞ
pðxÞ ¼ ð (23)
e�uf ðxÞ dx
x ∈ RN
where the regularizing function f: χ!R is some scalar-valued, non negative function with
χ ⊆ R which can be expanded to a vector argument by
N
X
f ðxÞ ¼ f ðxi Þ; (24)
i¼1
such that for sufficiently large u,

ð
expð�uf ðxÞÞdx
x ∈ RN
is finite. Furthermore, let the assumed variance of the noise be given by σ2 ¼ λu ; where λ is the
system parameter which can be taken as λ = σ2u.
Since the pdf of the noise w is gaussian, the likelihood function of y given x is given by
1 1 2
pyjx ðyjxÞ ¼ e�2σ2 ∥y�Ax∥2 : (25)
ð2πσÞN=2
Together with (20) and (23), this now gives

1 2
e�uð2∥y�Ax∥2 þλf ðxÞÞ
pxjy ðxjy; AÞ ¼ ð :
1 2
ð2πσÞN=2 e�uð2λ∥y�Ax∥2 þλf ðxÞÞ dx
x ∈ RN
The MAP estimator, (21), becomes
1
xMP ¼ arg min ∥y � Ax∥22 þ λf ðxÞ:
b (26)
x∈R N 2
Now, as we choose different regularizing function, we get different estimators as listed

below [28]:
1. Linear estimators: when f ðxÞ ¼ ∥x∥22 (26) reduces to

� ��1
xLinear ¼ AT AAT þ λI y;
b (27)
which is the LMMSE estimator. But we ignore this estimator in our analysis since the
results are not sparse. However, the following two estimators are more interesting for CS
problems since they enforce sparsity into the vector x.
2. LASSO estimator: when f(x)=||x||1 we get the LASSO estimator and (26) becomes,
x^ 1
LASSO ¼ arg min ∥y � Ax∥22 þ λ∥x∥1 : (28)
x ∈ RN
2
3. Zero-norm regularization estimator: when fx=∥x∥0, we get the zero-norm regularization

estimator and (26) becomes
x^ 1
Zero-Norm ¼ arg min ∥y � Ax∥22 þ λ∥x∥0 : (29)
x ∈ RN
2
As mentioned earlier, (29) is the best solution for estimation of the sparse vector x, but is NP-
complete. The worst approximation for the sparse problem considered is the L2-regularization
solution given by (27). However, the best approximation is given by Eq. (28) and its equivalent
forms. We have used some of the algorithms in literature in our simulation which are consid-
ered as equivalent to this approximations such as Bayesian compressive sensing (BCS) [29] and
L1-norm regularized least-squares (L1-LS) [11–13].
3.2. Clustering prior

The entries of the sparse vector x may have some special structure (clusteredness) among
themselves. This can be modeled by modifying the previous prior distribution.1 We use
another penalizing parameter γ to represent clusteredness in the data. For that we define the
clustering using the distance between the entries of the sparse vector x by
N
X
D� jxi � xi�1 j;
i¼1
and we use a regularizing parameter γ. Hence, we define the clustering prior to be
e�γDðxÞ
qðxÞ ¼ ð : (30)
e�γDðxÞ dx
x ∈ RN
The new posterior involving this prior under the Bayesian framework is proportional to the
product of the three pdfs:
1
In [30] a hierarchical Bayesian generative model for sparse signals is found in which they have applied full Bayesian
analysis by assuming prior distributions to each parameter appearing in the analysis. We follow a different approach.
pðxjyÞ ∝ pðyjxÞpðxÞqðxÞ: (31)
By similar argument s as used in 3.1, we arrive at the clustered LASSO estimator
XN
x^ 1
Clu-Lasso ¼ arg min ∥y � Ax∥22 þ λ∥x∥1 þ γ jxi � xi�1 j: (32)
x ∈ RN
2 i¼1
Here λ, γ are our tuning parameters for the sparsity in x and the way the entries are clustered,
respectively.
4. Bayesian inference in CS applications
Compressed sensing paradigm has been applied to many signal processing areas [31–41].
However, at this time, building the hardware that can translate the CS theory into practical
use is very limited. Nonetheless, the demand for cheaper, faster, and efficient devices will
motivate the use of CS paradigm in real-time systems in the near future.
So far, in image processing, one can mention the single-pixel imaging via compressive sam-
pling [31], in magnetic resonance imaging (MRI) for reducing scan time and improved image
quality [32], in seismic images [33], and in radar systems for simplifying hardware design and
to obtain high resolution [34, 35]. In communication and networks, CS theory has been studied
for sparse channel estimation [36], for under water acoustic channels which are inherently
sparse [37], spectrum sensing in cognitive radio networks [38], for large wireless sensor net-
works (WSNs) [39], as a channel coding scheme [40], localization [41] and so on. A good CS
application literature review is provided in [21], which basically is the summary of the bulk of
literatures given at http://dsp.rice.edu/cs.
In this chapter, there are examples of CS theory applications using Bayesian inference in
imaging, like magnetic resonance imaging (MRI) and, in communication, i.e., multiple-input
multiple-output (MIMO) systems, and in remote sensing. First, let us see the impact of the
estimators derived above, LASSO and clustered LASSO, in MRI.
4.1. Magnetic resonance imaging (MRI)

MRI images are usually very weak due to the presence of noise and due to the weak nature of
the signal itself. Compressed sensing (CS) paradigm can be applied in order to boost such
signal recoveries. We applied CS paradigm via Bayesian framework, that is, incorporating the
different prior information such as sparsity and the special structure that can be found in such
sparse signal recovery method is applied on different MRI images.
4.1.1. Angiogram image

Angiogram images are already sparse in the pixel representation. An angiogram image taken
from the University Hospital Rechts der Isar, Munich, Germany [42] is used for our analysis.
Figure 6. Comparison of reconstruction schemes together with performance comparison using mean square error (MSE)
in dB: (a) original image x; (b) LMMSE (35.1988 dB); (c) LASSO (53.6195 dB); and (d) clustered Lasso (63.6889 dB).
The image we took is sparse and clustered even in the pixel domain. The original signal after
vectorization is x of length N = 960. By taking 746 measurements, and maximum number of
non-zero elements k = 373, we applied different reconstruction schemes and the results are
shown in Figure 6.
4.1.2. Phantom image

Another MRI image considered is the Shepp-Logan phantom which is not sparse in spatial
domain. However, we sparsified it in K-space by zeroing out small coefficients. We then
measured the sparsified image and added noise. The original signal after vectorization is x of
length N = 200. By taking 94 measurements, that is, y is of length M = 94, and maximum
number of non-zero elements k = 47, we applied different reconstruction algorithms used
above. The result shows that clustered LASSO does well compared to the others as can be seen
in Figure 7.
4.1.3. fMRI image
Another example to apply the clustered LASSO based image reconstruction using Bayesian
framework to medical images is a functional magnetic resonance imaging (fMRI), a non-
invasive technique of brain mapping, which is crucial in the study of brain activity. Taking
many slices in fMRI data, we saw how these data sets are sparse in the Fourier domain. This is
a b c
d e f
Figure 7. Comparison of reconstruction schemes together with performance comparison using mean square error (MSE)
in dB: (a) original image x; (b) sparcified image; (c) least square (LS) (21.3304 dB); (d) LMMSE (27.387 dB); (e) LASSO
(37.9978 dB); and (f) clustered LASSO (40.0068 dB).
R #1 R #2 R #3 R #4 R #5
I #1 I #2 I #3 I #4 I #5
Figure 8. The five column images represent the real and imaginary parts of the Fourier transform representation of the
data set we have chosen to present further, which in general shows that the fMRI image have sparse and clustered
representation.
shown in Figure 8. We observed the whole data in this domain for the whole brain image.
They all share the characteristics we have based our analysis, i.e., sparsity and clusteredness.
Then we took some slices which are consecutive in the slice order and took different N, k, and
M=2k, on these slices. We can see the numbers at the top of Figure 9, in which the two numbers
represent k and N, respectively.
In fMRI, results are compared using image intensity which gives a good ground for a health
practitioner to observe and decide in accordance to the available information. The more one have
prior knowledge on how the brain regions work in human beings or pets the better priors that
one incorporate to analyze the data. So this is an interesting tool for researchers in the future.
4.2. MIMO systems

Multiple-input multiple-output (MIMO) systems are integrated in modern wireless communi-
cations due to their advantage in improving performance with respect to many performance
metrics. One such advantage is the ability to transmit multiple streams using spatial
multiplexing, but channel state information (CSI) at the transmitter is needed to get optimal
system performance.
Consider a frequency division duplex (FDD) MIMO system consisting of Nt transmit and Nr
receive antennas. Assume that the channel is a flat-fading, temporally correlated channel denoted
by a matrix H½n� ∈ CNr �Nt where n indicates a channel feedback time index with block fading
assumed during the feedback interval. The singular value decomposition (SVD) of H[n] gives
H½n� ¼ U½n� Σ ½n�VH ½n�;
where U ∈ CNr �r and V ∈ CNt �r are unitary matrices and Σ ∈ Cr�r is a diagonal matrix
consisting of r = min(Nt, Nr) singular values. In the presence of perfect channel state informa-
tion (CSI), a MIMO system model can be given by the equation
~ ¼ UH ½n�H½n�V½n�~x þ UH ½n�n
y (33)
where ~x ∈ Cr�1 is transmitted vector, V[n] is used as precoder at the transmitter, UH[n] is used
as decoder at the receiver, n ∈ CNr �1 denotes a noise vector whose entries are i.i.d. and distrib-
uted according to CN ð0; 1Þ and y~ ∈ CNr �1 is the received vector.
Channel adaptive transmission requires knowledge of channel state information at the trans-
mitter. In temporally correlated MIMO channels, the correlation can be utilized to reduce
feedback overhead and improve performance. CS methods and rotative quantization are used
to compress and feedback the CSI for MIMO systems [43]. This was done as an extension work
of [44]. It is shown that the CS-based method reduces feedback overhead while delivering the
same performance as the direct quantization scheme, using simulation.
Three methods are compared in the simulations, perfect CSI, without CS and with CS using
matched filter (MF) and minimum mean square error estimator (MMSE) receivers for different
total feedback bits B = 10 and B = 5. In Figure 10, sum rates are compared against signal-
to-noise-ratio (SNR). Using CS, half of the number of bits can be saved. In Figure 11, where the
bit-error-rate is plotted against SNR, the CS method has a better bit error rate performance
using same number of bits for the CS and without CS cases. These two figures demonstrate the
clear advantage of using CS in feedback of singular vectors in rotative based method. The
detail is deferred to [43].
4.3. Remote sensing

Remote sensing satellites provide a repetitive and consistent view of the Earth and they offer a
wide range of spatial, spectral, radiometric, and temporal resolutions. Image fusion is applied
to extract all the important features from various input images. These images are integrated to
form a fused image which is more informative and suitable for human visual perception or
computer processing. Sparse representation has been applied to fuse image to improve the
quality of fused image [45].
Figure 9. Application of sparse and cluster prior, LASSO and clustered LASSO (CL. LASSO), on a fMRI data analysis for
N = 80, k > 50 for σ2 = 0.1 and λ = 0.1, where LMMSE is with L2-regularised one.
14
Perfect CSI
Without CS and B=10
12 With CS and B=5
10
MMSE
8
Sum Rate (bits/s/Hz)
2 Matched Filter
0
0 5 10 15 20
SNR (dB)
Figure 10. Sum rate vs. SNR for a 22 MIMO system with and without CS with two streams. We can observe that the
performance of the CS method is almost equal to that of the method without using CS while saving half the number
of bits.
Perfect CSI
Without CS, B=10
With CS, B=10
−1
10
Bit error rate
−2
10
−3
10
0 5 10 15
SNR (dB)
Figure 11. Bit error rate vs. SNR using matched filter receiver for a 22 MIMO system with one stream.
Figure 12. Comparison of image fusion methods for remote sensing applications using Brovey, DWT, PCA, FDCT, and
the sparse representation methods [46].
To improve the quality of the fused image, a remote sensing image fusion method based on
sparse representation is proposed in [46]. In these methods, the source images were
represented with sparse coefficients first. Then, the larger values of sparse coefficients of
panchromatic (Pan) image are set to 0. Thereafter, the coefficients of panchromatic (Pan) and
multispectral (MS) image are combined with the linear weighted averaging fusion rule. Finally,
the fused image is reconstructed from the combined sparse coefficients and the dictionary. The
proposed method is compared with intensity-hue-saturation (IHS), Brovey transform
(Brovey), discrete wavelet transform (DWT), principal component analysis (PCA) and fast
discrete curvelet transform (FDCT) methods on several pairs of multifocus images. The pro-
posed method using sparse representation outperforms, see Figure 12, better than the usual
methods listed here. We believe that our method of clustered compressed sensing can also
further improve this result.
5. Conclusions
In this chapter, a Bayesian way of analyzing data on CS paradigm is presented. The method
assumes prior information like the sparsity and clusteredness of signals in the analysis of the
data. Among the different reconstruction methods, the convex relaxation methods are
redefined using Bayesian inference. Further, three CS applications are presented: MRI imag-
ing, MIMO systems, and remote sensing. For MRI imaging, the two different priors are
incorporated, while for MIMO systems and remote sensing, only the sparse prior is applied in
the analysis. We suggest that including the special structure among the sparse elements of the
data can be included in the analysis to further improve the results.
Author details
Solomon A. Tesfamicael1* and Faraz Barzideh2

1 Department of Education, Norwegian University of Science and Technology (NTNU),
Trondheim, Norway
2 Department of Electrical Engineering and Computer Science, University of Stavanger (UiS),

Stavanger, Norway
References
[1] Jaynes ET. Probability Theory: The Logic of Science. Cambridge University Press; 2003
[2] Tesfamicael SA, Barzideh F. Clustered compressed sensing in fMRI data analysis using a
Bayesian framework. International Journal of Information and Electronics Engineering.
2014;4(2):74-80
[3] Mackay DJC. Information Theory, Inference, and Learning Algorithms. Cambridge Uni-
versity Press; 2003. ISBN: 978-0-521-64298-9
[4] O’Hagen A, Forster J. Kendall’s Advanced Theory of statistics, volume 2B. Bayesian
Inference. Arnold, a member of the Hodder Headline Group; 2004. ISBN: 0 340 807520
[5] Berger JO. Bayesian and Conditional Frequentist Hypothesis Testing and Model Selec-
tion. VIII C:L:A:P:E:M; La Havana, Cuba; November 2001
[6] Efron B. Modern Science and the Bayesian-Frequentist Controversy; 2005-19B/233.
January 2005
[7] Botje MRA. Fisher on Bayes and Bayes’ Theorem, Bayesian Analysis; 2008
[8] Moreno E, Javier Giron F. On the Frequentist and Bayesian approaches to hypothesis
testing. January-June 2006;3-28
[9] Friston KJ, Penny W, Phillips C, Kiebel S, Hinton G, Ashburner J, Classical and Bayesian
inference in neuroimaging: Theory. NeuroImage. June 2002;16(2):465-483
[10] Donoho D. Compressed sensing. IEEE Transactions on Information Theory. 2006;52(4):
12891306
[11] Candes EJ, Tao T. Decoding by linear programming. IEEE Transactions on Information
Theory. December 2005;51(12)
[12] Cand‘es E, Romberg J, Tao T. Robust uncertainty principles: Exact signal reconstruction
from highly incomplete frequency information. IEEE Transactions on Information The-
ory. February 2006;52(2):489509
[13] Cande‘s EJ, Tao T. Near-optimal signal recovery from random projections: Universal
encoding strategies? IEEE Transactions on Information Theory. December 2006;52:54065425
[14] Eldar YC, Kutyniok G. Compressed Sensing: Theory and Applications. Cambridge Uni-
versity Press; 2012
[15] Eldar YC, Kutyniok G. Compressed Sensing: Algorithms and Applications. KTHKTH,
Communication Theory. ACCESS Linnaeus Centre; 2012
[16] Candes EJ. The restricted isometry property and its implications for compressed sensing.
Comptes Rendus Mathematique. 2008
[17] Guan X, Gao Y, Chang J, Zhang Z. Advances in Theory of Compressive Sensing and
Applications in Communication. 2011 First International Conference on Instrumentation,
Measurement, Computer, Communication and Control; 2011
[18] Blumensath T, Davies ME. Iterative hard thresholding for compressed sensing. Applied
and Computational Harmonic Analysis. 2009
[19] Natarajan BK. Sparse approximate solutions to linear systems. SIAM Journal of Comput-
ing. 1995
[20] Natarajan BK. Uncertainty principles and ideal atomic decomposition. IEEE Transactions
on Information Theory. January 2001;47:2845-2862
[21] Qaisar S, Bilal RM, Iqbal W, Naureen M, Lee S. Compressive sensing: From theory to applica-
tions, a survey. Journal of Communications and Networks. October 2013;15(5):443-456
[22] Figueiredo MAT, Nowak RD, Wright SJ. Gradient projection for sparse reconstruction:
Application to compressed sensing and other inverse problems. Journal of Selected
Topics in Signal Processing. 2007;1(4):586-597
[23] Schniter P, Potter LC, Ziniel J. Subspace pursuit for compressive sensing signal recon-
struction. Information Theory and Applications Workshop. February 2008. pp. 326-333
[24] Teixeira FCA, Bergen SWA, Antoniou A. Robust signal recovery approach for compres-
sive sensing using unconstrained optimization. Proceedings of 2010 IEEE International
Symposium on Circuits and Systems (ISCAS). May 2010. pp. 3521-3524
[25] Tesfamicael SA, Barzideh F. Clustered compressed sensing via Bayesian framework. IEEE
UKSim-AMSS 17th International Conference on Computer Modelling and Simulation.
UKSim2015-19.S.Image, Speech and Signal Processing; Cambridge, United Kingdom.
2015. pp. 25-27
[26] Tesfamicael SA, Barzideh F. Clustered compressive sensing: Application on medical imag-
ing. International Journal of Information and Electronics Engineering. 2015;5(1):48-50
[27] Tesfamicael SA. Compressive sensing in signal processing: Performance analysis and
applications [doctoral thesis]. NTNU; 2016. p. 182
[28] Rangan S, Fletcher AK, Goyal VK. Asymptotic Analysis of MAP Estimation via the
Replica Method and Applications to Compressed Sensing. 2009. arXiv:0906.3234v1
[29] Ji S, Xue Y, Carin L. Bayesian compressive sensing. IEEE Transactions on Signal
Processing. June 2008;56(6):2346-2356
[30] Yu L, Sun H, Pierre Barbot J, Zheng G. Bayesian compressive sensing for clustered sparse
signals. IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). 2011
[31] Duarte MF, Davenport MA, Takhar D, Laska JN, Sun T, Kelly KF, Baraniuk RG. Single-
pixel imaging via compressive sampling. IEEE Signal Processing Magazine. March
2008;25(2):83-91
[32] Lustig M, Donoho D, Pauly JM. Sparse MRI: The application of compressed sensing for
rapid MR imaging. Magnetic Resonance in Medicine. 2007;58(6):1182-1195
[33] Lustig M, Donoho D, Pauly JM. Simply denoise: Wavefield reconstruction via jittered
undersampling. Geophysics. 2007;128-133
[34] Baraniuk R, Steeghs P. 2007 IEEE Compressive Radar Imaging Radar Conference. 19-28
April, 2007
[35] Herman MA, Strohmer T. High-resolution radar via compressed sensing. IEEE Trans-
actions on Signal Processing. June 2009;57(6):2275-2284
[36] Bajwa WU, Haupt J, Sayeed AM, Nowak R. Compressed channel sensing: A new
approach to estimating sparse multipath channel. Proceedings of the IEEE. June 2010;98(6):
1058-1107
[37] Berger CR, Zhou S, Preisig JC, Willett P. Sparse channel estimation for multicarrier
underwater acoustic communication: From subspace methods to compressed sensing.
IEEE Transactions on Signal Processing. March 2010;58(3):1708-1721
[38] Zeng F, Zhi T, Chen L. Distributed compressive wideband spectrum sensing in coopera-
tive multi-hop cognitive networks. IEEE International Conference on Communications
(ICC). 1-5 May, 2010
[39] Qing L, Zhi T. Decentralized sparse signal recovery for compressive sleeping wireless
sensor networks. IEEE Transactions on Signal Processing. July 2010;58(7):3816-3827
[40] Charbiwala Z, Chakraborty S, Zahedi S, Kim Y, Srivastava MB, He T, Bisdikian C.
Compressive oversampling for robust data transmission in sensor networks. IEEE
INFOCOM Proceedings. 2010. pp. 1-9
[41] Chen F, Au WSA, Valaee S, Zhenhui T. Compressive sensing based positioning using RSS
of WLAN access points. IEEE INFOCOM Proceedings. 1-9 March 2010
[42] Image. Depiction of Vessel Diseases with a Wide Range of Contrast and Non-Contrast
Enhanced Techniques. Munich, Germany: University Hospital Rechts der Isar; 2014
[43] Tesfamicael SA, Lundheim L. Compressed sensing based rotative quantization in tempo-
rally correlated MIMO channels. Recent Developments in Signal Processing; 2013
[44] Godana SBE, Ekman T. Rotative quantization using adaptive range for temporally corre-
lated MIMO channels. 2013 IEEE 24th International Symposium on Personal Indoor and
Mobile Radio Communications (PIMRC). 2013. pp. 1233-1238
[45] Tibshirani R. Compressive sensing: From theory to applications, a survey. Journal of the
Royal Statistical Society. Series B (Methodological). 1996;58(1):267-288
[46] Yu X, Gao G, Xu J, Wang G. Remote sensing image fusion based on sparse representation.
2014 IEEE Geoscience and Remote Sensing Symposium
Chapter 14
Provisional chapter
Sparsity in Bayesian Signal Estimation

Sparsity in Bayesian Signal Estimation
Ishan Wickramasingha, Michael Sobhy and

Sherif S. Sherif
Ishan Wickramasingha, Michael Sobhy and
Sherif S. Sherif

Abstract
In this chapter, we describe different methods to estimate an unknown signal from its
linear measurements. We focus on the underdetermined case where the number of
measurements is less than the dimension of the unknown signal. We introduce the
concept of signal sparsity and describe how it could be used as prior information for
either regularized least squares or Bayesian signal estimation. We discuss compressed
sensing and sparse signal representation as examples where these sparse signal estima-
tion methods could be applied.
Keywords: inverse problems, signal estimation, regularization, Bayesian methods,

signal sparsity
1. Introduction
In engineering and science, a system typically refers to a physical process whose outputs are
generated due to some inputs [1, 2]. Examples of systems include measuring instruments,
imaging devices, mechanical and biomedical devices, chemical reactors and others. A system
could be abstracted as a block diagram,
where x and y represent the inputs and outputs of the system, respectively. The block, A,
formalizes the relation between these inputs and the outputs using mathematical equa-
tions [2, 3]. Depending on the nature of the system, the relation between its inputs and outputs
could be either linear or nonlinear. For a linear relation, the system is called a linear system and
it would be represented by a set of linear equations [3, 4]
y ¼ A x: (1)
In this chapter, we will restrict our attention to linear systems, as they could adequately
represent many actual systems in a mathematically tractable way.
When dealing with systems, two typical types of problems arise, forward and inverse problems.
1.1. Forward problems
In a forward problem, one would be interested in obtaining the output of a system due to a
particular input [5, 6]. For linear systems, this output is the result of a simple matrix-vector
product, Ax. Forward problems usually become more difficult as the number of equations
increases or as uncertainties about the inputs, or the behavior of the system, are present [6].
1.2. Inverse problems
In an inverse problem, one would be interested in inferring the inputs to a system x that resulted in
observed outputs, i.e., measured y [5, 6]. Another formulation of an inverse problem is to identify
the behavior of the system, i.e., construct A, from knowledge of different input and output values.
This problem formulation is known as system identification [1, 7, 8]. In this chapter, we will only
consider the input inference problem. The nature of the input x to be inferred further leads to two
broad categories of this problem: estimation, and classification. In input estimation, the input could
assume an infinite number of possible values [4, 9], while in input classification the input could
assume only a finite number (usually small) of possible values [4, 9]. Accordingly, in input
classification, one would like to only assign an input to a predetermined signal class. In this
chapter, we will only focus on estimation problems, particularly on restoring an input signal x
from noisy data y that is obtained using a linear measuring system represented by a matrix A.
2. Signal restoration as example of an inverse problem
To solve the above signal restoration problem, we need to estimate input signal x through the
inversion of matrix A. This could be a hard problem because in many cases the inverse of A might
not exist, or the measurement data, y, might be corrupted by noise. The existence of the inverse of
A depends on the number of acquired independent measurements relative to the dimension of
the unknown signal [5, 10]. The conditions for the existence of a stable solution of any inverse
problem, i.e., for an inverse problem to be well-posed, have been addressed by Hadamard as:
• Existence: for measured output y there exists at least one corresponding input x.
• Uniqueness: for measured output y there exists only one corresponding input x.
• Continuity: as the input x changes slightly, the output y changes slightly, i.e., the relation
between x and y is continuous.
Sparsity in Bayesian Signal Estimation 281
These conditions could be applied to linear systems as conditions on the matrix A. Let the
matrix A ∈ Rn�m , such that Rn�m denotes the set of matrices of dimension n � m with its
elements being real values. The matrix equation, yn � 1 = An � m xm � 1, is equivalent to n
linear equations with m unknowns. The matrix A is a linear transformation that maps input
signals from its domain DðAÞ ¼ Rm to its range RðAÞ ¼ Rn [4, 5, 10]. For any measured output
signal y ∈ Rn , we could identify three cases based on the values of n and m.
2.1. Underdetermined linear systems

In this case, n < m, i.e., the number of equations is less than the number of unknowns,
2 3
a11 a12 ⋯ a1m
6 7
A¼4 ⋮ ⋮ ⋱ ⋮ 5: (2)
an1 an2 ⋯ anm
If these equations are consistent, Hadamard’s Existence condition will be satisfied. However,
Hadamard’s Uniqueness condition is not satisfied because the Null Space(A) 6¼ {0}, i.e., there
exist z 6¼ 0 ∈ Null Space(A) such that,
Aðx þ zÞ ¼ y: (3)
This linear system is called under-determined because its equations, i.e., system constraints, are
not enough to uniquely determine x [4, 5]. Thus, the inverse of A does not exist.
2.2. Overdetermined linear systems

In this case, m > n, the number of equations is more than the number of unknowns,
2 3
a11 ⋯ a1m
6a 7
6 21 ⋯ a2m 7
6 7
A¼6 ⋮ ⋮ 7: (4)
6 ⋱ 7
4 ⋮ ⋮ 5
an1 ⋯ anm
If these equations are consistent, Hadamard’s Existence condition will not be satisfied. How-
ever, Hadamard’s Uniqueness condition will be satisfied, if A has full rank. In this case, Null Space
(A) = {0}, i.e.,
Aðx þ 0Þ ¼ Ax ¼ y: (5)
This linear system is called over-determined, because its equations, i.e., system constraints, are
too many for x to exist [4, 5]. Also, the inverse of A does not exist.
2.3. Square linear systems

The case where m = n, the number of equations is equal to the number of unknowns,
2 3
a11 ⋯ a1n
A¼4 ⋮ ⋱ ⋮ 5: (6)
an1 ⋯ ann
If A has full rank, its Null Space(A) = {0} and both Hadamard’s Existence and Uniqueness
conditions will be satisfied. In addition, if A has a small condition number, the relation
between x,y will be continuous, and Hadamard’s Continuity condition will be satisfied [4, 5, 10].
In this case, the inverse problem formulated by this system of linear equations is well-posed.
3. Methods for signal estimation
In this section, we will focus on the estimation of an input signal x from a noisy measurement y
of the output of a linear system A.
The linear system shown in Figure 1, could be modeled as,
y ¼ Ax þ v: (7)
where v is additive Gaussian noise. As a consequence of the Central Limit Theorem, this
assumption of Gaussian distributed noise is valid for many output measurement setups.
x of a signal x that is input to a

Statistical Estimation Theory allows one to obtain an estimate b
known system A from measurement y (see Figure 2) [11, 12]. However, this estimate b x is not
unique, as it depends on the choice of the used estimator from the different ones available. In
addition to measurement y, if other information about the input signal is available, it could be
Figure 1. Linear system with noisy output measurement.
Figure 2. Signal estimation using prior information.

used as prior information to constrain the estimator to produce a better estimate of x. Signal
estimation for overdetermined systems could be achieved without any prior information about
the input signal. However, for underdetermined systems, prior information is necessary to
ensure a unique estimate.
3.1. Least squares estimation

If there is no information available about the statistics of the measured data,
y ¼ Ax þ v; (8)
least squares estimation could be used. The least squares estimate is obtained by minimizing
the square of the L2 norm of the error between the measurement and the linear model,
v = y � Ax. It is given by
x ¼ arg min ∥y � Ax∥22 :

b (9)
x
The L2 norm is a special case of the p-norm of a vector, where p = 2, that is defined as
�Pm p �p
1
kxkp ¼ i¼1 jxi j . In Eq. (9), the unknown x is considered deterministic, so its statistics are
not required. The noise v in this formulation is implicitly assumed to be white noise with
variance σ2 [13, 14]. Least squares estimation is typically used to estimate input signals x in
overdetermined problems. Since b x is unique in this case, no prior information, additional
constraints, for x is necessary.
3.2. Weighted least squares estimation

If the noise v in Eq. (8) is not necessarily white and its second order statistics, i.e., mean and
covariance matrix, are known, then weighted least squares estimation could be used to further
improve the least squares estimate. In this estimation method, measurement errors are not
weighted equally, but a weighting matrix C explicitly specifies such the weights. The weighted
least squares estimate is given by
x ¼ arg min ∥C�1=2 ðy � AxÞ∥22 :

b (10)
x
We note that the least squares problem, Eq. (9), is a special case of the weighted least squares
problem, Eq. (10), when C = σ2I.
3.3. Regularized least squares estimation
In underdetermined problems, the introduction of additional constraints on x, also known as

regularization, could ensure the uniqueness of the obtained solution. Standard least squares
estimation could be extended, through regularization, to solve underdetermined estimation
problems. The regularized least squares estimate is given by
arg min ∥y � Ax∥22 þ λ∥L x∥2 ; (11)

x
where L is a matrix specifying the additional constraints and λ is a regularization parameter

whose value determines the relative weights of the two terms in the objective function. If the
� �
A
combined matrix has full rank, the regularized least squares estimate b x is unique [4]. In
L
this optimization problem, the unknown x is once again considered deterministic, so its
statistics are not required. It is worthwhile noting that while regularization is necessary to solve
underdetermined inverse problems, it could also be used to improve numerical properties,
e.g., condition number, of either linear overdetermined or linear square inverse problems.
3.4. Maximum likelihood estimation

If the probability distribution function (pdf) of the measurement y, parameterized by an
unknown deterministic input signal x, is available, then the maximum likelihood estimate of x is
given by,
x ¼ arg max f ðyjxÞ:

b (12)
x
This maximum likelihood estimate b x is obtained by assuming that measurement y is the most
likely measurement to occur given the input signal x. This corresponds to choosing the value of
x for which the probability of the observed measurement y is maximized. In maximum
likelihood estimation, the negative log of the likelihood function, f(y|x), is typically used to
transform Eq. (12) into a simpler minimization problem. When, f(y| x) is a Gaussian distribu-
tion, N(Ax,C), minimizing the negative log of the likelihood function is equivalent to solving
the weighted least squares estimation problem.
3.5. Bayesian estimation
If the conditional pdf of the measurement y, given an unknown random input signal x, is
known, in addition to the marginal pdf of x, representing prior information about x, is given,
then a Bayesian estimation method would be possible. The first step to obtain one of the many
possible Bayesian estimates of x is to use Bayes rule to obtain the a posteriori pdf,
f ðyjxÞ f ðxÞ
f ðxjyÞ ¼ Ð : (13)
f ðyjxÞf ðxÞ
Once this a posteriori pdf is known, different Bayesian estimates b x could be obtained. For
example, the minimum mean square error estimate is given by,
� �
f ðyjxÞ f ðxÞ
x MMSE ¼ Ex ½f ðxjyÞ� ¼ Ex Ð
b ; (14)
f ðyjxÞf ðxÞ
while the maximum a priori (MAP) estimate is given by,

x MAP ¼ arg max f ðxjyÞ ¼ arg max f ðyjxÞ f ðxÞ:

b (15)
x x
We note that the maximum likelihood estimate, Eq. (12), is a special case of the MAP estimate,
when f(x) is a uniform pdf over the entire domain of x. The use of prior information is essential
to solve underdetermined inverse problems, but it also improves the numerical properties,
e.g., condition number, of either linear overdetermined or linear square inverse problems.
3.5.1. Bayesian least squares estimation
In least squares estimation, the vector x is assumed to be an unknown deterministic variable.

However, in Bayesian least squares estimation, it is considered a vector of scalar random vari-
ables that satisfies statistical properties given by an a priori probability distribution function [5].
In addition, in least squares estimation, the L2 norm of the measurement error is minimized,
while in Bayesian least squares estimation, it is the estimation error, e ¼ b x � x, not measure-
ment error, that is used [5]. Since x is assumed to be a random vector, the estimation error e will
also be a random vector. Therefore, the Bayesian least squares estimate could be obtained by
minimizing the condtional mean of the square of the estimation error, given measurement, y,
h i
T
x ¼ arg min E ðb
b x � xÞ ðbx � xÞjy : (16)
x
When x has a Gaussian distribution and A represents a linear system, then measurement y will
also have a Gaussian distribution. In this case, the Bayesian least squares estimate given by
Eq. (16) could be reinterpreted as a regularized least squares estimate given by,
x ¼ arg min ∥y � Ax ∥22 þ ∥μ � x∥;

b (17)
x
where μ is the mean of the a priori distribution of x [5]. Therefore, a least squares Bayesian
estimate is analogous to a regularized least squares estimate, where a priori information about
x is expressed as additional constraints on x in the form of a regularization term.
3.5.2. Advantages of Bayesian estimation over other estimation methods
Bayesian estimation techniques could be used, given that a reliable a priori distribution is
known, to obtain an accurate estimate of a signal x, even if the number available measure-
ments is smaller than the dimension of the signal to estimated. In this underdetermined case,
Bayesian estimation could accurately estimate a signal while un-regularized least squares
estimation or maximum likelihood estimation could not. The use of prior information in
Bayesian estimation could also improves the numerical properties, e.g., condition number, of
either linear overdetermined or linear square inverse problems. This could be understood by
keeping in mind the mathematical equivalence between obtaining one scalar measurement
related to x, and specifying one constraint that x has to satisfy. Therefore, as the number of
available measurements significantly increases, both Bayesian and maximum likelihood esti-
mates would converge to the same estimate.
Bayesian estimation also could be easily adapted to estimate dynamic signals that change over
time. This is achieved by sequentially using past estimates of a signal, e.g., xt 1, as prior
information to estimate its current value xt. More generally, Bayesian estimation could be
easily adapted for data fusion, i.e., combination of multiple partial measurements to estimate a
complete signal in remote sensing, stereo vision and tomographic imaging, e.g., Positron
emission tomography (PET), Magnetic resonance imaging (MRI), computed tomography (CT)
and optical coherence tomography (OCT). Bayesian methods could also easily fuse all avail-
able prior information to provide an estimate based on measurements, in addition to all
known information about a signal.
Bayesian estimation techniques could be extended in straight forward ways to estimate output
signals of nonlinear systems or signals that have complicated probability distributions. In these
cases, numerical Bayesian estimates are typically obtained using Monte Carlo methods.
3.5.3. Sparsity as prior information for underdetermined Bayesian signal estimation

Sparse signal representation means the representation of a signal in a domain where most of its
coefficients are zero. Depending on the nature of the signal, one could find an appropriate
domain where it would be sparse. This notion could be useful in signal estimation because
assuming that the unknown signal x is sparse could be used as prior information to obtain an
accurate estimate of it, even if only a small number of measurements are available. The rest of
this chapter will focus on using signal sparsity as prior information for underdetermined
Bayesian signal estimation.
4. Sparse signal representation
As shown in Figure 3, a sinusoid is a dense signal in the time domain. However, it could be
represented by a single value, i.e., it has a sparse representation, in the frequency domain.
We note that any signal could have a sparse representation in a suitable domain [15]. A sparse
signal representation means a representation of the signal in a domain where most of its
coefficients are zero. Sparse signal representations have many advantages including:
1. A sparse signal representation requires less memory for its storage. Therefore, it is a
fundamental concept for signal compression.
Figure 3. A sinusoid in time and frequency domains.

2. A sparse signal representation could lead to simpler signal processing algorithms. For
example, signal denoising could be achieved by simple thresholding operations in a
domain where the signal is known to be sparse.
3. Sparse signal representations have fewer coefficients than dense signal representations.
Therefore, the computational cost for sparse representations would be lower than for
dense representations.
4.1. Signal representation using a dictionary

A dictionary D is a collection of vectors {φn}nEΓ, indexed by a parameter n E Γ equal to the
dimension of a signal f, where we could represent f as a linear combination [16],
X
f ¼ cn φn : (18)
nEΓ
If the vectors {φn}nEΓ are linearly independent, then such dictionary is called a basis.
Representing a signal as a linear combination of sinusoids, i.e., using a Fourier dictionary, is
very common. Wavelet dictionaries and Chirplet dictionaries are also common dictionaries for
signal representation. Dictionaries could be combined together to obtain a larger dictionary,
where n EΓ is larger than the dimension the signal f, that is called an overcomplete dictionary or
a frame.
4.1.1. Signal representation using a basis
A set of vectors form a basis for Rn if they span Rn and are linearly independent. A basis in a
vector space V is a set X of linearly independent vectors such that every vector in V is a linear
combination of elements in X. A vector space V is finite dimensional if it has a finite number of
basis vectors [17].
Depending on the properties of {φn}nEΓ, bases could be classified into different types, e.g.,
orthogonal basis, orthonormal basis, biorthogonal basis, global basis and local basis. For an
orthogonal basis, its basis vectors in the vector space V are mutually orthogonal,
� �
φm ; φn ¼ 0 for m 6¼ n: (19)
For an orthonormal basis, its basis vectors in the vector space V are mutually orthogonal and
have unit length,
� �
φm ; φn ¼ δðm � nÞ; (20)
where δ(m � n) is the Kronecker delta function. For a biorthogonal basis, its basis vectors are not
n o
orthogonal to each other, but they are orthogonal to vectors in another basis, φe , such that
n
nEΓ
D E
e ¼ δðm � nÞ:
φm ; φ (21)
n
In addition, depending on the domain (support) on which these basis vectors are defined, we
could also classify a basis as either global or local. Sinusoidal basis vectors used for the discrete
Fourier transform are defined on the entire domain (support) of f, so they are considered a
global basis. Many wavelet basis vectors used for the discrete wavelet transform are defined on
only part of the domain (support) of f, so they are considered a local basis.
4.1.2. Signal representation using a frame

A frame is a set of vectors {φn}nEΓ that spans Rn and could be used to represent a signal f from
the inner products {〈f, φn〉}nEΓ. A frame allows the representation of a signal as a set of frame
coefficients, and its reconstruction from these coefficients in a numerically stable way
X� �
f ¼ f ; φn φn : (22)
nEΓ
Frame theory analyzes the completeness, stability, and redundancy of linear discrete signal
representations [18]. A frame is not necessarily a basis, but it shares many properties with
bases. The most important distinction between a frame and a basis is that the vectors that
comprise a basis are linearly independent, while those comprising frame could be linearly
dependent. Frames are also called overcomplete dictionaries. The redundancy in the representa-
tion of a signal using frames could be used to obtain sparse signal representations.
4.2. Sparse signal representation as a regularized least squares estimation problem
If designed to concentrate the energy of a signal in a small number of dimensions, an orthog-

onal basis would be the minimum-size dictionary that could yield a sparse representation of
this signal [15]. However, finding an orthogonal basis that yields a highly sparse representa-
tion for a given signal is usually difficult or impractical. To allow more flexibility, the orthog-
onality constraint is usually dropped, and overcomplete dictionaries (frames) are usually used.
This idea is well explained in the following quote by Stephane Mallat:
“In natural languages, a richer dictionary helps to build shorter and more precise sentences. Similarly,
dictionaries of vectors that are larger than bases are needed to build sparse representations of complex
signals. Sparse representations in redundant dictionaries can improve pattern recognition, compression,
and noise reduction but also the resolution of new inverse problems. This includes super resolution, source
separation, and compressed sensing” [15].
Thus representing a signal using a particular overcomplete dictionary has the following goals [16]
• Sparsity—this representation should be more sparse than other representations.

• Super resolution—the resolution of the signal when represented using this dictionary
should be higher than when represented in any other dictionary.
• Speed—this representation should be computed in O(n) or O(n log(n)) time.
A simple way to obtain an overcomplete dictionary A is to use a union of basis Ai that would
result in the following representation of a signal y,
ð y Þ ¼ ð½ A1 � ½ A2 � ½ A3 �½ A4 �½ A5 �Þ ð x Þ ) y ¼ Ax; (23)
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
A
where A is a n � m matrix representing the dictionary and x are the coefficients representing y
in the domain defined by A. Since A represents an overcomplete dictionary, the number of its
rows will be less than the number of its columns. Eq. (23) is a formulation of the signal
representation problem as an underdetermined inverse problem.
x, such
To obtain a sparse solution for Eq. (23) one needs to find an m � 1 coefficient vector b
that,
x ¼ arg min ky � Axk22 þ λkxk0 ;

b (24)
x
where kxk0 is the cardinality of vector x, i.e., its number of nonzero elements, and λ > 0 is a
regularization parameter that quantifies the tradeoff between the signal representation
error,ky � Axk22 , and its sparsity level, kxk0 [19]. The cardinality of vector x is sometimes
referred to as the L0 norm of x, even though kxk0 is actually a pseudo norm that does not satisfy
the requirements of a norm in Rm . This sparse signal representation problem, Eq. (24), has a
form similar to the regularized least squares estimation problem, Eq. (11), that would be
underdetermined in the case of an overcomplete dictionary. Because of the correspondence
between regularized least squares estimation and Bayesian estimation, the problem of finding
a sparse representation of a signal could be formulated as a Bayesian estimation problem.
5. Compressed sensing
Compressed sensing involves the estimation of a signal using a number of measurements that
are significantly less than its dimension [20]. By assuming that the unknown signal is sparse in
the domain where the measurements were acquired, one could use this sparsity constraint as
prior information to obtain an accurate estimate of the signal from relatively few measurements.
Compressed sensing is closely related to signal compression that is routinely used for efficient
storage or transmission of signals. Compressed sensing was inspired by this question: instead of
the typical signal acquisition followed by signal compression, is there a way to acquire (sense)
the compressed signal in the first place? If possible, it would significantly reduce the number of
measurements and the computation cost [20]. In addition, this possibility would allow acquisi-
tion of signals that require extremely high, hence impractical, sampling rates [21]. As an affirma-
tive answer to this question, compressed sensing was developed to combine signal compression
with signal acquisition [20]. This is achieved by designing the measurement setup to acquire
signals in the domain where the unknown signal is assumed to be sparse.
In compressed sensing, we consider the estimation of an input signal x ∈ Rn from m linear
measurements, where m ≪ n. As discussed above, this problem could be written as an
underdetermined linear system,
y ¼ A x; (25)
where y ∈ Rm and A ∈ Rm�n represent the measurements and measurement (sensing) matrix,
respectively.
Assuming that the unknown signal x is s-sparse, i.e., x ∈ ∑s has only s nonzero elements, in the
domain specified by the measurement (sensing) matrix A, and assuming that A satisfies the
restricted isometry property (RIP) of order 2s, i.e., there exists a constant δ2s ∈ (0, 1) such that,
ð1 � δ2s Þjjzjj22 ≤ jjAzjj22 ≤ ð1 þ δ2s Þjjzjj22 ; (26)
for all z ∈ ∑2s, then x could be reconstructed from m ≥ s measurements by different optimiza-
tion algorithms [20]. When the measurements y are noiseless, x could be exactly estimated
from,
min jjxjj0 subject to Ax ¼ y: (27)

x
However, when the measurements y are contaminated by noise,x could be obtained as the
regularized least squares estimate,
x ¼ arg min jjAx � yjj22 þ λ jjxjj0 :

b (28)
x
This minimization problem could also be mathematically reformulated and solved as a Bayesian
estimation problem.
6. Obtaining sparse solutions for signal representation and signal

estimation problems
From Sections 4 and 5 we note that the problem of obtaining a sparse signal representation,
Eq. (24) and the problem of sparse signal estimation in compressed sensing, Eq. (28), both have
the same mathematical form [11, 22],
x ¼ arg min ky � Axk22 þ λkxk0 :

b (29)
x
In this section, we describe different approaches to solving this minimization problem. From
Eq. (29), we note that the first term of its RHS, ky � Axk22 , represents either signal reconstruc-
tion error (sparse signal representation problem) or measurement fitting error (sparse signal
estimation in compressed sensing problem), while the second term of its RHS,kxk0, represents
the cardinality (number of nonzero coefficients) of the unknown signal. The regularization
parameter λ specifies the tradeoff between these two terms in the objective function. The
selection of an appropriate value of λ to balance the reconstruction, or fitting error, and signal
sparsity is very important. Regularization theory and Bayesian approaches could provide
ways to determine optimal values of λ [23–26].
Convex optimization problems is a class of optimization problems that are significantly easier
to solve compared to nonconvex problems [34]. Another advantage of convex optimization
problems is that any local solution, e.g., a local minimum, is guaranteed to be a global solution.
We note that obtaining an exact solution for the minimization problem in Eq. (29) is difficult
because it is nonconvex. Therefore, one could either seek an approximate solution to this
nonconvex problem or approximate this problem by a convex optimization whose exact
solution could be obtained easily.
Considering the general regularized least squares estimation problem,
x ¼ arg min ky � Axk22 þ λkxkp ;

b (30)
x
we note that it is a nonconvex optimization problem for 0 ≤ p < 1 and a convex optimization
problem for p ≥ 1. One alternative to approximate Eq. (29) by a convex optimization problem,
one could relax the strict condition of minimizing the cardinality of the signal, kxk0, by
replacing by it by the sparsity-promoting condition of minimizing the L1 norm of the signal,
kxk1. Another alternative to approximate Eq. (29) by another nonconvex optimization problem
that is easier to solve than the original problem using a Bayesian formulation, is to replace kxk0
by kxkp, 0 < p < 1. The minimization of Eq. (30) using kxkp, 0 < p < 1 would result in a higher
degree of signal sparsity compared to when kxk1 is used. This could be understood visually by
examining Figure 4, that shows the shapes of two-dimensional unit balls using (pseudo)norms
with different values of p.
We explain further details in the following subsections.
6.1. Obtaining a sparse signal solution using L0 minimization

The sparsest solution of the regularized least squares estimation problem, Eq. (29) would be
obtained when p = 0 in kxkp. As shown in Figure 5, the solution of the regularized least squares
x , is given by the intersection of the circles, possibly ellipses, representing the
problem, b
Figure 4. Two-dimensional unit ball using different (pseudo)norms. (a) L0, (b) L0�1, and (c) L1.
Figure 5. Regularized least squares using L0.
solution of the unconstrained least squares estimation problem and the unit ball using L0
representing the constraint of minimizing L0. In this case of minimizing L0, the unconstrained
least squares solution will always intersect the unit ball at an axis, this yielding the most
possible sparse solution. However, as mentioned earlier, this L0 minimization problem is
difficult to solve because it is nonconvex. Approximate solutions for this problem could be
obtained using greedy optimization algorithms, e.g., Matching Pursuits [27] and Least Angle
Regression (LARS) [28].
6.2. Obtaining a sparse signal solution using L1 minimization

On relaxing the nonconvex regularized least squares using L0 minimization problem, by
setting p = 1, we obtain the convex L1 minimization problem. As shown in Figure 4(c), the unit
ball using the L1 norm covers a larger area than the unit ball using the L0 pseudo norm, shown
in Figure 4(a). Therefore, as shown in Figure 6, the solution for the regularized least squares
problem using the L1 minimization would be sparse, but it should not be expected to be as
sparse as the L0 minimization problem.
This L1 minimization problem could be solved easily using various algorithms, e.g., Basis
Pursuits [16], Method of frames (MOF) [29], Lasso [30, 31], and Best Basis Selection [32, 33]. A
Bayesian formulation of this L1 minimization problem is also possible by assuming that the a
priori probability distribution of x is Laplacian, x ~ e |x|.
6.3. Obtaining a sparse signal solution using L0 1 minimization

As discussed above, solving the regularized least squares problem with L0 minimization
should yield the sparsest signal solution. However, only approximate solutions are available
for this difficult nonconvex problem. Alternatively, solving the regularized least squares prob-
lem with L1 minimization should yield an exact sparse solution that would be less sparse than
in the L0 case, but it is considerably easier to obtain.
Figure 6. Regularized least squares using L1.
The regularized least squares problem could also be formulated as an L0 � 1 minimization

problem. As kxkp, 0 < p < 1,that we abbreviate as L0 � 1, is not an actual norm, this optimization
problem would be nonconvex [34]. The advantage of using L0 � 1 minimization is that, as
shown in Figure 4(b), compared to unit ball using the L1 norm, the unit ball using the L0 � 1
pseudo norm has a narrower area that is concentrated around the axes. Therefore, as shown in
Figure 7, the L0 � 1 minimization problem should yield a sparser solution compared to the L1
minimization problem.
Figure 7. Regularized least squares using L0�1.

Figure 8. Product of two student-t probability distributions.
Another advantage of using L0 � 1 minimization is that this nonconvex optimization problem

could be easily formulated as a Bayesian estimation problem that could be solved using
Markov Chain Monte Carlo (MCMC) methods. As shown in Figure 8, the product of student-t
probability distributions has a shape similar to the unit ball using the L0 � 1 pseudo norm, so
student-t distributions could be used as a priori distributions to approximate the L0 � 1 pseudo
norm.
6.4. Bayesian method to obtain a sparse signal solution using L0 � 1 minimization
As mentioned in Section 3.5, the first step to obtaining one of the many possible Bayesian
estimates of x is to use Bayes rule to obtain the a posteriori pdf,
f ðyjxÞ f ðxÞ
f ðxjyÞ ¼ Ð : (31)
f ðyjxÞf ðxÞ
Using this a posteriori distribution, one could obtain a sparse signal solution using L 0 � 1
minimization, as the maximum a posteriori (MAP) estimate given by Eq. (15). Compared to
other Bayesian estimates, the MAP estimate could be easier to obtain because the calculation
Ð
of the normalizing constant, f(y|x)f(x),would not be needed. The maximization of the product
of conditional probability distribution of y given x and the a priori distribution of x is equiva-
lent to the minimizing of the sum of their negative logarithms,
x MAP ¼ arg min ½� log pðyjxÞ � log pðxÞ� :

b (32)
x
In the case of white Gaussian measurement noise, p(y|x) ~ Nx (Ax, σ2I) where � log pðyjxÞ ∝
ky � Axk22 , which the first term of the RHS of Eq. (30). As discussed in the previous section, the
a priori probability p(x) corresponding to L 0 � 1 minimization could be represented as a product
of univariate student-t probability distribution functions [14],
M M
� � � ��ðϑþ1 Þ
Y Y Γ ϑþ1 2 x2i 2
pðxÞ ¼ studxi ½0; 1; ϑ� ¼ pffiffiffiffiffiffiffi �ϑ� 1 þ ; (33)

i¼1 i¼1 ϑπΓ 2 ϑ
where Γ is the Gamma function, and ϑ is the number of degrees of freedom of the student-t
distribution. Since this a priori distribution function is not an exponential function, we would
use Eq. (15) instead of Eq. (32) to obtain the MAP estimate.
Because the prior is not a Gaussian distribution, there is no simple closed form expression for
the posterior, p(x|y) with a student-t a priori probability distribution. However, we could
express each student-t distribution as an infinite weighted sum of Gaussian distributions,
where the hidden variables hi determine their variances [14].
M ð
Y ð
� M
�Y
pðxÞ ¼ N xi ð0, 1=hi Þ Gamhi ½ϑ=2, ϑ=2�dhi ¼ Nx 0; H �1 Gamhi ½ϑ=2, ϑ=2�dH ; (34)
i¼1 i¼1
where the matrix H contains the hidden variables fhi gM i¼1 on its diagonal and has zeros
elsewhere, and Gamhi ½ϑ=2, ϑ=2� is the gamma probability distribution function with parame-
ters (ϑ/2, ϑ/2). Using this approximation, the a posteriori pdf could be written as
ð M � �
� � � �Y ϑ ϑ
pðxjyÞ ∝ pðyjxÞpðxÞ ¼ Nx Ax, σ2 I N x 0; H �1 Gamhi ; dH
i¼1
2 2
ð � � (35)
� � � �YM
ϑ ϑ
¼ Nx Ax, σ2 I Nx 0; H �1 Gamhi ; dH:
i¼1
2 2
The product of two Gaussian distributions is also a Gaussian distribution [35],

� � � �
N x μ1 ; Σ1 N x μ2 ; Σ2 ¼ k:N x ðμ; ΣÞ; (36)
where the mean and covariance (μ, Σ) of the new Gaussian distribution in Eq. (36) is given by,
� ��1 � �1 � � ��1
μ ¼ Σ1 �1 þ Σ2 �1 Σ1 μ1 þ Σ2 �1 μ2 and Σ ¼ Σ1 �1 þ Σ2 �1 ; (37)
and k is a constant. Therefore, we could simplify the product of two the Gaussian distributions
given in Eq. (35) as,
� � � � �� 1 � �2 � � �2 ��1 �
N x Ax, σ2 I : N x 0; H �1 ¼ k: N x σ�2 I þ H σ Ax , σ I þ H : (38)
From Eqs. (35) and (38) we could write p(x|y) as,

ð ��
��1 � �2 � � �2 ��1 � YM
ϑ ϑ
pðxjyÞ ¼ k Nx σ�2 I þ H σ Ax , σ I þ H Gamhi ; dH : (39)
i¼1
2 2
We still could not compute the integral in Eq. (39) in closed form. However, we could maxi-
mize the RHS of Eq. (39) over the hidden variables H to obtain an approximation for the a
posteriori probability distribution function
" � �#
�� 1 � �2 � � �2 ��1 � YM
ϑ ϑ
pðxjyÞ ≈ arg max N x σ�2 I þ H σ Ax , σ I þ H Gamhi ; : (40)
H i¼1
2 2
Eq. (40) would be a good approximation of p(x|y), if the actual distribution over the hidden
variables is concentrated tightly around its mode [14]. When hi has a large value, its
corresponding ith component of the a priori probability distribution function p(x) would have
a small variance, h1i , so that this ith component of p(x) could be set to zero. Therefore, this ith
dimension of the prior p(x) would not contribute to the solution of Eq. (30), thus increasing its
sparsity.
Since both Gaussian and gamma pdfs in Eq. (40) are members of the exponential family of
probability distributions, we could obtain b x MAP by maximizing the sum of their logarithms.
Section 3.5 in [11] and Section 8.6 in [14] describe an iterative optimization method to obtain
x MAP from the approximate a posteriori probability distribution function given by Eq. (40).
b
7. Conclusion
In this chapter, we described different methods to estimate an unknown signal from its linear
measurements. We focused on the underdetermined case where the number of measurements
is less than the dimension of the unknown signal. We introduced the concept of signal sparsity
and described how it could be used as prior information for either regularized least squares or
Bayesian signal estimation. We discussed compressed sensing and sparse signal representation
as examples where these sparse signal estimation methods could be applied.
Author details
Ishan Wickramasingha1, Michael Sobhy2 and Sherif S. Sherif1*

1 Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg,
Canada
2 Biomedical Engineering Graduate Program, University of Manitoba, Winnipeg, Canada
References
[1] Keesman KJ. System Identification: An Introduction. London: Springer Science & Business
Media; 2011
[2] Von Bertalanffy L. General system theory. New York. 1968;41973(1968):40
[3] Chen C-T. Linear System Theory and Design. New York, NY: Oxford University Press,
Inc.; 1999
[4] Moon TK, Stirling WC. Mathematical Methods and Algorithms for Signal Processing.
Upper Saddle River, NJ: Prentice Hall; 2000
[5] Fieguth P. Statistical Image Processing and Multidimensional Modeling. New York, NY:
Springer Science+Business Media, LLC; 2011
[6] Tarantola A. Inverse problem theory and methods for model parameter estimation.
Philadelphia, PA: Society for Industrial and Applied Mathematics; 2005. p. 1-37
[7] Ljung L. Perspectives on system identification. Annual Reviews in Control. 2010 Apr
30;34(1):1-2
[8] Wellstead PE. Non-parametric methods of system identification. Automatica. 1981 Jan
1;17(1):55-69
[9] Shanmugan KS, Breipohl AM. Random Signals: Detection, Estimation, and Data Analysis.
New York, NY: Wiley; 1997
[10] Saad Y. Iterative Methods for Sparse Linear Systems. Philadelphia, PA: Society for Indus-
trial and Applied Mathematics; 2003
[11] Bishop CM. Pattern Recognition and Machine Learning. New York, NY: Springer; 2006
[12] Mendel JM. Lessons in Estimation Theory for Signal Processing, Communications, and
Control. Englewood Cliffs, N.J.: Prentice-Hall; 1995
[13] Sorenson HW. Least-squares estimation: From Gauss to Kalman. IEEE Spectrum. 1970
Jul;7(7):63-68
[14] Prince SJ. Computer Vision: Models, Learning, and Inference. Cambridge: Cambridge
University Press; 2012
[15] Mallat S. A Wavelet Tour of Signal Processing: The Sparse Way. Amsterdam: Academic
Press; 2009
[16] Chen SS, Donoho DL, Saunders MA. Atomic decomposition by basis pursuit. SIAM
Review. 2001;43(1):129-159
[17] Paul R. Halmos. Finite-Dimensional Vector Spaces. Mineola, UNITED STATES: Dover
Publications; 2017
[18] Mallat S. A Wavelet Tour of Signal Processing. San Diego: Academic Press; 1999
[19] Shannon CE. A mathematical theory of communication. ACM SIGMOBILE Mobile Com-
puting and Communications Review. 2001;5(1):3-55
[20] Eldar YC, Kutyniok G, editors. Compressed Sensing: Theory and Applications. Cam-
bridge: Cambridge University Press; 2012
[21] Asif MS. Dynamic compressive sensing: Sparse recovery algorithms for streaming signals
and video [Doctoral dissertation]. Georgia Institute of Technology
[22] Huang K, Aviyente S. Sparse representation for signal classification. In: NIPS. Vol. 19;
2006. pp. 609-616
[23] Poggio T, Torre V, Koch C. Computational vision and regularization theory. Nature. 1985
Sep 26;317(6035):314-319
[24] Tikhonov AN, Arsenin VI. Solutions of Ill-posed Problems. Washington, DC: Winston;
1977 Jan
[25] Wahba G, Wendelberger J. Some new mathematical methods for variational objective
analysis using splines and cross validation. Monthly Weather Review. 1980 Aug;108(8):
1122-1143
[26] Lin Y, Lee DD. Bayesian L1-Norm Sparse Learning. In: 2006 IEEE International Confer-
ence on Acoustics Speech and Signal Processing Proceedings. Toulouse, France: vol. 5;
2006. p. V–V.
[27] Mallat SG, Zhang Z. Matching pursuits with time-frequency dictionaries. IEEE Trans-
actions on Signal Processing. 1993 Dec;41(12):3397-3415
[28] Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statis-
tics. 2004 Apr;32(2):407-499
[29] Daubechies I. Time-frequency localization operators: A geometric phase space approach.
IEEE Transactions on Information Theory. 1988 Jul;34(4):605-612
[30] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society. Series B (Methodological). 1996 Jan 1;58(1):267–288
[31] Yuan M, Lin Y. Model selection and estimation in regression with grouped variables.
Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2006 Feb 1;68(1):
49-67
[32] Coifman RR, Wickerhauser MV. Entropy-based algorithms for best basis selection. IEEE
Transactions on Information Theory. 1992 Mar;38(2):713-718
[33] Rao BD, Kreutz-Delgado K. An affine scaling methodology for best basis selection. IEEE
Transactions on Signal Processing. 1999 Jan;47(1):187-200
[34] Boyd S, Vandenberghe L. Convex Optimization. New York: Cambridge University Press;
2004.
[35] Bromiley P. Products and convolutions of Gaussian probability density functions. Tina-
Vision Memo. 2003;3(4):1–13.
Provisional chapter
Chapter 15
Dynamic Bayesian Network for Time-Dependent

Classification Problems
Dynamic Bayesian in Robotics
Network for Time-Dependent
Classification Problems in Robotics
Cristiano Premebida, Francisco A. A. Souza
and DiegoPremebida,
Cristiano R. Faria Francisco A. A. Souza and
Diego R. Faria
Abstract
This chapter discusses the use of dynamic Bayesian networks (DBNs) for time-dependent
classification problems in mobile robotics, where Bayesian inference is used to infer the
class, or category of interest, given the observed data and prior knowledge. Formulating
the DBN as a time-dependent classification problem, and by making some assumptions, a
general expression for a DBN is given in terms of classifier priors and likelihoods through
the time steps. Since multi-class problems are addressed, and because of the number
of time slices in the model, additive smoothing is used to prevent the values of priors
from being close to zero. To demonstrate the effectiveness of DBN in time-dependent
classification problems, some experimental results are reported regarding semantic place
recognition and daily-activity classification.
Keywords: dynamic Bayesian network, Bayesian inference, probabilistic classification,

mobile robotics, social robotics
1. Introduction
Bayesian inference finds applications in many areas of engineering, and mobile robotics is not
an exception. When time is a variable to be considered, the dynamic Bayesian network (DBN)
[1–5] is a powerful approach to be considered. Due to its graphical representation and model-
ling versatility, DBN facilitates the problem-solving process in probabilistic time-dependent
applications. Therefore, DBNs provide an effective way to model time-based (dynamic) prob-
abilistic problems and also enable a very suitable and intuitive representation by means of a
graph-based tree.
© 2017 The
Attribution Author(s).
License Licensee InTech. This chapter is distributedwhich
(http://creativecommons.org/licenses/by/3.0), underpermits
the terms of the Creative
unrestricted Commons
use, distribution,
Attribution
and License
reproduction (http://creativecommons.org/licenses/by/3.0),
in any medium, provided the original work is properly which permits unrestricted use,
cited.
Depending on the structure of the DBN, the joint probabilistic distribution that governs a
given system can be decomposed by a tractable product of probabilities, where the con-
ditional terms only depend on their directly linked nodes. This chapter concentrates on
inference problems using DBN where the variable to be inferred from a feature vector
(data) represents a set of semantic classes 𝒞𝒞 = { c1, c2, … , cnc} or categories, in the context of
intelligent perception systems for mobile robotics applications. Namely, we will address
problems where 𝒞𝒞 denotes semantic places in a given indoor environment [6, 7] e.g. 𝒞𝒞 = {
'corridor', 'office', … , 'kitchen'} and also when the classes of interest are daily-live activities 𝒞𝒞 = {'drink-
ing', 'talking', … , 'walking'} [8, 9].
The principle of Bayesian inference basically depends on two elements: the prior and the
likelihood; in practical problems, the evidence probability acts ‘only’ as a normalization to
guarantees that the posterior sums to one. In this chapter, we will deal with the problems of
the classical Bayesian form posterior ∝ likelihood ⋅ prior, but the incorporation of (past) time will be
explicitly modelled in a discrete-time basis, and the past information is assumed to be con-
tained in the prior probabilities. Inference will be considered beyond the first-order Markov
assumption, which means that a DBN with a finite number of time slices (T) will be addressed.
Current time step t and previous/past time steps will be considered in the formulation of the
DBN; thus, the time interval is {t, t − 1, … , t − T}.
The observed data enters the DBN in the form of a vector of features X calculated from
sensory data; examples of sensors are laser scanners (or 2D Lidar) and RGB-D camera, as
shown in Figure 1. Later, in the formulation of the DBN, we will consider that the feature
vector at a given time step (Xt) is conditionally independent of previous time steps; there-
fore, P(X t | X t−1) = P(X t ).
The use of Bayesian inference in mobile robotics for purpose of localization, simulta-
neous localization and mapping (SLAM), object detection, path planning and navi-
gation, has been addressed in many scientific works; see Ref. [10] for a review. The
majority of those applications involve stochastic filtering, such as Kalman filter (KF),
particle filter (PF), Monte Carlo techniques and hidden Markov model (HMM) [11,
12]. However, when the parameter of interest has to be inferred from multidimen-
sional feature vectors ( e.g. feature vectors with hundreds of elements) and also
when the distribution that the observed data were drawn is not known (in unseen/
knew or testing scenarios) then, a DBN can be used to handle such complex prob-
lems. In robotics, semantic place classification [6, 7] and activity recognition [8, 9]
are examples of such problems and belong to the research area of pattern recognition. For
Figure 1. Sensors commonly used in mobile robotics for perception systems.

Dynamic Bayesian Network for Time-Dependent Classification Problems in Robotics 301
these application cases, the class-conditional probabilities (or likelihoods) can be mod-
elled using machine learning techniques, for example, naive Bayes classifier (NBC), sup-
port vector machines (SVMs) and artificial neural networks (ANNs) [13, 14].
The remainder of this chapter is organized as follows: a brief review of the DBN is given in
Section 2. Section 3 addresses inference in DBN, formulated for purposes of pattern recog-
nition in robotics, followed by the use of additive smoothing on the prior distributions. In
Section 4, experimental results on semantic place classifications and activity recognition are
presented. Finally, Section 5 presents our conclusions.
2. Preliminaries on DBN
Basically, a DBN is used to express the joint probability of events that characterizes a time-
based (dynamic) system, where the relationships between events are expressed by condi-
tional probabilities. Given evidence (observations) about events of the DBN, and prior
probabilities, statistical inference is accomplished using the Bayes theorem. Inference in pat-
tern recognition applications is the process of estimating the probability of the classes/catego-
ries given the observations, the class-conditional probabilities, and the priors [15, 16]. When
time is involved, usually the system is assumed to evolve according to the first-order Markov
assumption and, as consequence, a single time slice is considered.
In this chapter, we address DBN structures with more than one time slice. Moreover, the
conditional probabilities of the DBN will be modelled by supervised machine learning tech-
niques (also known as classifier or classification method). Two case studies will be particularly
discussed: activity recognition for human-robot interaction and semantic place classification
for mobile robotics navigation.
The observed data variable, denoted by X = { X1, … , Xnx}, enters into the DBN in the form of
conditional probabilities P(X|C ), where the values of X are feature vectors. To give an idea
of the dimensionality of X, in semantic place classification [6], the number of features can be
nx = 50, while in activity recognition we have 51 features [8]. Given such dimensionalities,
which can be even higher, it becomes infeasible to estimate the probability distribution that
characterizes P(X|C ) without the use of advanced algorithms. Although a simple Naïve Bayes
classifier can be incorporated in a DBN to model P(X|C ), more powerful solutions, such as the
ensemble of classifiers in the DBMM approach introduced in Ref. [8], tend to achieve higher
classification performance.
In summary, DBN is a direct acyclic graph (DAG) that consists of a finite set of events (the
nodes or vertices) connected through edges (or arcs) that model the dependencies among
the events and also the time variable. Here, the nodes are given by the variables {X, C}, and
the dynamic (time-based) behaviour of the BDN is considered to be governed by the current
time t and by a finite set of previous time slices {t − 1, t − 2, … , t − T}. So, future time slices will be
not considered. Figure 2 shows the structure of the DBN, with T + 1 time slices, that will be
considered in the problem formulation presented in the sequel.
Figure 2. An example of a DBN with T + 1 time slices and two nodes {C, X}.
3. Inference with DBN
The problem is formulated by considering P(X t, X t−1, … , X t−T, C t, C t−1, … , C t−T ) i.e., the joint dis-
tribution of the nodes over the time up to T. The goal is to infer the current-time value of the
class Ct given the data X t:t−T = { X t, X t−1, … , X t−T} and the prior knowledge of the class, which is
attained by the a-posteriori probability P(C t | C t−1:t−T, X t:t−T). The superscript notation denotes
the set of values over a time interval: {t : t − T} = {t, t − 1, t − 2, … , t − T}.
The simplest case is for a single time slice where the posterior reduces to P(C t | X t) ∝ P(X t | C t)P(C t ).
For two time slices, we have
P(C t | C t−1, X t:t−1) ∝ P(X t | X t−1, C t:t−1)P(X t−1 | C t:t−1)P(C t | C t−1)P(C t−1). (1)
As the number of time slices increases, the problem of inferring the class becomes more com-
plex; therefore, some assumptions can be made in order to find a tractable solution. As a first
assumption, let the nodes be independent of later (subsequent in time) nodes. As a conse-
quence, and taking as the example for T = 1, the probability P(X t−1 | C t:t−1) = P(X t−1| C t−1 ) that is, the
node Xt–1 does not depend on the node Ct which is after a time-slice. The second assumption,
more strong, is that the feature-vector node X is independent for all time slices hence, and fol-
lowing the previous example, P(X t | X t−1, C t:t−1) becomes P(X t | C t:t−1). Given these two assumptions,
we can state the general problem of calculating the posterior probability of a DBN with T + 1
time slices by the expression
k=t {
P(C t | C t−1:t−T, X t:t−T) = _β1 ∏ t−T P(X k|C k)P(C k) }. (2)
where β is the scale (normalization) factor to guarantee that the values of the a-posteriori sum
to one. The class-conditional probabilities P(X k| C k) come from a supervised classifier or from
an ensemble of classifiers as in Ref. [8], while P(C k) assumes the value of the previous posterior
probability; thus, P(C t) ← posterior t−1.
This strategy for ‘updating’ the values of the prior by taking the values of previous posteriors
is a very common and effective technique used in Bayesian sequential systems. The steps
involved in the calculation of the posterior probability, as expressed in Eq. (2), are illustrated
in Figure 3.
Selection of the class-conditional model to express P(X|C) is an important part of the approach
and can be achieved by well-known probabilistic machine learning methods. Although gener-
ative methods (e.g. Naïve Bayes, GMM and HMM) provide direct probabilistic interpretation
and, therefore, constitute appropriate choices, discriminative methods (e.g. SVM, random for-
est and ANN) tend to have better classification performance. However, to be a suitable model,
a given discriminative method has to be of a probabilistic form; this implies, at least, that the
outcomes from the classifier sum to one. A more advanced method can be used to model P(X|C)
in a DBN, as the dynamic Bayesian mixture model (DBMM) [8], where a mixture of n clas-
sifiers is used to model the conditional probability which assumes the form P(X|C) = ∑ nj=1 ωj P
(X|C ) j, j = 1, … , n; where ωj are the weighting parameters and P (X|C ) j are the probabilities from
the classifiers. Further details are provided in Ref. [6].
The product of likelihoods and priors, in the expression of the a-posteriori Eq. (2), has the con-
sequence of penalizing the classes that are less likely to occur. In other words, the classes with
low probability, i.e. close to zero, will have an even more low values of posterior; this effect is
intensified as the number of time slices increases. Because the priors are recursively assigned
by assuming the values of the previous posteriors, we suggest to use additive smoothing to
avoid values of priors to be very close to zero.
Additive smoothing, also called Lidstone smoothing, adds a term (α) to the prior distribution
and can be expressed as
Figure 3. This figure illustrates the DBN, with T + 1 time slices, as formulated according to the assumptions presented
in Section 3. The product of likelihoods and priors, over the time interval [t – T, t], becomes the posterior probability as
expressed in Eq. (2).
P(C ) + α
P(Ci) = _________
^ i
1 + α ⋅ (nc )
, i = 1, … , nc (3)
where α is the additive smoothing factor and nc is the number of classes. The influence of α on
the smoothed prior P(Ci) has to be such that the values of P(Ci) are greater than zero (P(Ci) >
^ ^ ^
0, ∀ i) and, moreover, the prior distribution P(Ci) should be consistent (the values of P(Ci) must
^ ^
of course sum to one). A practical range is 0 < α < 0.1.
Figure 4 provides an example of the impact of α on a given prior, with values of α equal to
{0, 0.01, 0.05 and 0.1}. As the value of α increases, the prior distribution tends to lose its initial
definiteness due to the uniform ‘bias’ introduced by α. In the example shown in Figure 4, we
have considered a five-class case (nc = 5).
Figure 4. An example of the influence of the additive factor (α) in a given P(Ci), i = 1, … , 5.
4. Experiments on classification: mobile robotics case studies
In order to demonstrate the use of the DBN as formulated above, we will consider two clas-
sification problems that find applications in mobile robotics: semantic place recognition [6]
and activity classification [8].
4.1. Semantic place recognition
Figure 5 illustrates a probabilistic system for semantic place recognition where data comes
from a laser scanner sensor. In a practical application, the sensor is mounted on-board a
mobile robot [6, 7]. Based on Figure 5, we can make a direct correspondence with the DBN
discussed above by verifying that the feature vector is X, the probabilistic classifier outputs
the class-conditional probability P(X|C ) and the priors transmit the time-based information
through the network.
As an example of the DBN application in semantic place classification, let us report some
results from Ref. [6], where a DBN was applied on the image database for robot localiza-
tion (IDOL) dataset: available at http://www.cas.kth.se/IDOL/. In this context, the problem of
semantic place classification can be stated as follows: ‘given a set of features, calculated on
Figure 5. Illustration for a time-dependent probabilistic system applied in semantic place recognition. In this system,
data obtained from a laser scanner.
data from laser scanner sensors (installed on-board a mobile robot), determine the semantic
robot location (‘corridor’, ‘room’, ‘office’, etc) by using a classification method’. The experi-
ments in Ref. [6] use a mixture of classifiers to model the class-conditional probability in the
DBN; such approach is called DBMM [8].
Figure 6 shows recognition results in a sequence of nine frames from the IDOL dataset,
where the first row depicts images of indoor places as captured by a camera mounted on-
board a mobile robot. The second row provides classification results without time slices (i.e.
time-base prior probabilities are not incorporated into the DBN), and the subsequent rows
show classification probabilities for a DBN with time-slices up to three. In the figure, the
vertical line (in red) indicates the transition between classes: from the class ‘kitchen’ (KT) to
the class ‘corridor’ (CR).
4.2. Activity classification
In the case of the activity classification problem described here, the objective is to classify the
human’s daily activity based on spatiotemporal skeleton-based features. In such a case, mobile
robots mounted with appropriated cameras can make use of such classification models to
Figure 6. Classification results on a five-class semantic place recognition problem, extracted from reference [6], using a
DBN with mixture models of three classifiers (DBMM [6, 8]).
improve the quality of life of, for example, old-age people, by assisting them in their daily life
or detecting anomalous situations. Similar to semantic place recognition problem, the activity
classification problem can also be seen as a time-dependent probabilistic system, where the
feature vector X is the skeleton-based features. From Ref. [8], we report some results on the
activity classification.
Figure 7 exhibits an activity classification framework, based on Ref. [8], which uses a DBN
with mixture models (the DBMM approach as previously described in the semantic place
classification problem), where the data is acquired by using an RGB-D sensor, followed
by the skeleton detection step and the feature extraction process, where the latter is based
on geometrical features. From the training stage, global weights are computed using an
uncertainty measure (e.g. entropy) as a confidence level for each base classifier based on
their performance on the training set. During the test, given the input data (i.e. skeleton
features for the current activity), base classifiers are used and merged as mixture mod-
els with time slices (using previous time instant classification) to reinforce the current
classification.
The well-known dataset for activity recognition Cornell Activity Dataset (CAD60) [9, 17] was
used to evaluate the proposed framework in Refs. [8, 18]. The CAD-60 dataset comprises
video sequences and skeleton data of human daily activities acquired from a RGB-D sensor.
There are 12 human’ daily activities performed by four different subjects (two male and two
female, one of them being left-handed) grouped in five different environments: office, kitchen,
Figure 7. Illustration for a time-dependent probabilistic system applied to activity classification. In this system, data
obtained from a RGB-D camera, which provides the spatiotemporal skeleton-based features.
bedroom, bathroom and living room. Additionally, the CAD-60 dataset has two more activi-
ties (random movements and still), which are used for classification assessment on test sets, in
order to evaluate precision and generalization capacity of the approaches since these activities
encompass similar movements to some other activities. We have adopted the same strategy
described in Ref. [17], so that we present the classification results in terms of precision (Prec)
and recall (Rec) for each scenario. The evaluation criterion was carried out using leave-one-
out cross-validation. The idea is to verify the generalization capacity of the classifier by using
the strategy of ‘new person’, i.e. learning from different persons and testing with an unseen
person. The classification is made frame-by-frame to account for the accuracy of the frames
correctly classified.
Results show the DBMM approach obtained better classification performance compared to
other state-of-the-art methods presented in the ranked table in Ref. [17]. The overall results
were precision: 94.83%; recall: 94.74% and accuracy: 94.74%. Figure 8 presents the classifica-
tion performance (i.e. precision and recall) for the ‘new person’ tested in each scenario. For
comparison purposes, Table 1 summarizes the results in terms of accuracy of state-of-the-art
single classifiers and a simple averaged ensemble compared with the proposed DBMM for
the bedroom (scenario with more misclassification), showing that our approach outperforms
other classifiers. The classification performance in terms of overall accuracy, precision and
recall has shown that our proposed framework outperforms state-of-the-art methods that use
the same datasets [17].
In this section, we have shown the DBMM [8, 18] performance using an offline dataset.
Additionally, further tests using a mobile platform with an RGB-D sensor on-board running
on-the-fly in an assisted living context was also successfully validated with accuracy above
Figure 8. Performance on the CAD-60 (‘new person’). Results are reported in terms of precision (Prec) and recall (Rec)
and an average (AV) per scenario. Overall AV: precision 94.83%; recall: 94.74%. Activities in (a): Act1—rinsing water;
Act2—brushing teeth; Act3—wearing lens; Act4—random + still; activities in (b): Act1—talking on phone; Act2—
drinking water; Act3—opening container; Act4—random + still; activities in (c): Act1—talking on phone; Act2—drinking
water; Act3—talking on coach; Act4—relaxing on coach; Act5—random + still; activities in (d) Act1—drinking water;
Act2—cooking chopping; Act3—cooking stirring; Act4—opening container; Act5—random + still; activities in (e):
Act1—talking on phone; Act2—writing on whiteboard; Act3—drinking water; Act4—working on computer; Act5—
random + still.
Location Activity Bayes ANN SVM AV DBMM

Bedroom 1 79.90% 74.70% 74.90% 76.50% 84.10%
2 72.70% 76.60% 81.40% 76.90% 86.40%
3 79.60% 91.10% 93.10% 87.90% 98.30%
4 65.70% 93.50% 92.60% 83.90% 97.40%
Average 74.48% 83.98% 85.50% 81.30% 91.55%
Activity: 1—talk.on phone, 2—drink.water, 3—open.container, 4—random + still.
Table 1. Results in terms of accuracy on the bedroom scenario of the CAD-60 dataset (‘new person’) using single
classifiers, a simple averaged ensemble (AV) and the DBMM.
90%, as reported in Ref. [18]. More details about the DBMM using a mobile robot for activity
recognition and a video showing the classification performance can be found in Ref. [18].
5. Conclusion
In this chapter, the authors have presented a DBN formulation for classification of time-depen-
dent problems together with experimental results on applications of two mobile robots. The
first one regarding the semantic place classification and the second one based on activity clas-
sification. In both formulations, the DBN was used as basis to compose the DBMM [6, 8, 18],
a more complex structure used to handle more complex scenarios. In both applications, the
DBMM has shown to be a powerful choice in modelling of time-dependent scenarios.
When it comes to semantic place classification, the model could detect classes’ transitions dur-
ing the robot navigation, thanks to the different time slices (i.e. higher than 2) and the additive
smoothing used in the model. In the case of activity recognition, since the activities in the
dataset do not have classes’ transitions, i.e. only one activity is performed during a task, in this
case, a simple version of the DBMM using only one time slice is enough to correct classify all
activities. For real-time applications using a mobile robot and in accordance with experimental
results reported in Ref. [6], it is suggested to use more than two time slices in the mode.
Author details
Cristiano Premebida1*, Francisco A. A. Souza1 and Diego R. Faria2

1 Institute of Systems and Robotics, University of Coimbra, Coimbra, Portugal
2 School of Engineering & Applied Science, Aston University, Birmingham, UK
References
[1] Friedman N, Murphy K, Russell S. Learning the structure of dynamic probabilistic

networks. In: Proceeding of the Fourteenth Conference on Uncertainty in Artificial
Intelligence (UAI’98); 24-26 July 1998; Madison, Wisconsin. San Francisco, CA, USA:
Morgan Kaufmann Publishers Inc.; 1998
[2] Korb KB, Nicholson AE. Bayesian Artificial Intelligence. 2nd ed. Boca Raton, FL: CRC
Press, Inc. 2010
[3] Koller D, Friedman N. Probability graphical models: Principles and techniques. In:
Adaptive Computation and Machine Learning. The MIT Press; Cambridge, MA, USA; 2009
[4] Murphy KP. Dynamic Bayesian networks: Representation, inference and learning. Ph.D.
Dissertation. University of California, Berkeley; 2002
[5] Mihajlovic V, Petkovic M. Dynamic Bayesian Networks: A State of the Art. Technical
Report, Computer Science Department, University of Twente, Netherlands; 2001
[6] Premebida C, Faria D, Nunes U. Dynamic Bayesian network for semantic place classifi-
cation in mobile robotics. Autonomous Robots (AURO), Springer; 2016
[7] Rottmann A, Mozos OM, Stachniss C, Burgard W. Semantic place classification of indoor
environments with mobile robots using boosting. In: Proceeding of the 20th National
Conference on Artificial Intelligence (AAAI’05); 9-13 July 2005; Pittsburgh, Pennsylvania:
AAAI Press; 2005
[8] Faria DR, Premebida C, Nunes C. A probabilistic approach for human everyday
activities recognition using body motion from RGB-D images. In: Proceedings of
the IEEE RO-MAN’14: International Symposium on Robot and Human Interactive
Communication; 25-29 August. 2014; Edinburgh, UK. IEEE; Cambridge, MA, USA; 2014
[9] Sung J, Ponce C, Selman B, Saxena A. Unstructured human activity detection from
RGBD images. In: Proceedings of the IEEE International Conference on Robotics and
Automation (ICRA), Saint Paul, MN, New York, NY, USA, May 2012; pp. 842-849
[10] Thrun S, Burgard W, Fox D. Probabilistic Robotics. MIT Press; New Jersey, NJ, USA;
2005
[11] Li T, Prieto J, Corchado JM, Bajo J. On the use and misuse of Bayesian filters. In:
Proceeding of the IEEE 18th Int. Conference on Information Fusion (Fusion); 6-9 July
2015; Washington, DC, USA. IEEE; 2015
[12] Chen Z. Bayesian filtering: From Kalman filters to particle filters and beyond. Statistics.
2003;182(1):1-69
[13] Bishop CM. Pattern recognition. Machine Learning. 2006;128:1-58
[14] Duda RO, Hart PE, Stork DG. Pattern Classification. John Wiley & Sons; New Jersey, NJ,
USA
[15] Neapolitan RE. Learning Bayesian Networks. Upper Saddle River, NJ, USA: Prentice-
Hall, Inc.; 2003
[16] Russell S, Norvig P. Artificial Intelligence: A Modern Approach. 3rd ed. Prentice Hall;
New Jersey, NJ, USA; 2010
[17] Cornell Activity Datasets CAD-60 [Internet]. Available from: http://pr.cs.cornell.edu/

humanactivities/data.php [Accessed: January 2017
[18] Faria DR, Vieira M, Premebida C, Nunes U. Probabilistic human daily activity recog-
nition towards robot-assisted living. In: Proceeding of the IEEE RO-MAN’15: IEEE
International Symposium on Robot and Human Interactive Communication; Kobe,
Japan; New York, NY, USA; September 2015
Section 4
Applications of Bayesian Inference in Economics

Chapter 16
Provisional chapter
A Bayesian Model for Investment Decisions in Early

A Bayesian Model for Investment Decisions
Ventures
in Early Ventures


Abstract
In this research, we present a Bayesian model to aid the investment decision in early
stage start-ups and ventures. This model addresses both the venture and the angel
investing markets. The model is informed both by previous academic literature on
entrepreneurship and by venture capital investment practices. The model is validated
through an anonymized experiment where reviewers with previous experience in entre-
preneurship or investment or both scored a list of 20 anonymous real companies for
which we knew the outcome a priori. The experiment revealed that the model and
online scoring platform that we built provide an accuracy of 83% in identifying compa-
nies that would later on fail and where the investments would be lost. The model also
performs fairly well in identifying companies where the investors would not lose their
money but they would either have to wait for a very long time on their returns or they
would not receive large return on investment (ROI), and we also show that the model
performs modestly in identifying “big exit” companies or companies where the inves-
tors would receive high ROI and in a fairly short amount of time.
Keywords: Bayesian networks, investment, start-up, entrepreneurship, decision models
1. Introduction
One of the biggest challenges facing early stage investors is a lack of actionable data and
effective analytics. Most investment decisions are made based on the instinct (heuristics) of
the investor who may or may not have experience in the sector and decisions are often
inherently biased. In investment environment is increasingly complex, and investors cannot
process all of the factors that are critical to the success of a potential investment and make a
well-informed decision. Research suggests that well-built analytic models make better deci-
sions than human experts across virtually every field [1].
Some of the newest data on the returns on angel investment show that these are about
2.5 times the value of the initial investment and the average period of recovery of investment
is 3.6 years [2].
In general, there is little literature with respect to automated techniques or models of invest-
ment decision. A very recently published paper shows an interesting risk analysis model that
would reduce the risk of investing in early entrepreneurs [3]. This research takes a similar
approach—reduce the “bad” investment decisions—but it uses a different model, based on a
Bayesian model, which performs well in identifying the future failures of new ventures.
While there is understandably little academic literature on forecasting future star-up success
and its relationship to investment decision-making, due to the confidentiality of the data in
this business, the decision-making practice in the venture capital and angel investment indus-
tries rely heavily on the experience of the investors and on the “collective” thinking of the
investors that gather together to rate or assess the pitches or business proposals for various
funding rounds of investment.
Therefore, this chapter presents a model for investment decision-making that is informed
mainly by the practitioners and is intended to be applied in to investment practice. Its aim is
to be a tool that helps the process of rating seed and start-up ventures become more informa-
tive and transparent both for investors and for entrepreneurs.
The model built for this research is mainly informed by the interviews and discussions
conducted with investors during the summer of 2014. The nodes of the model and the depen-
dencies between the nodes have been created based on these interviews, while the distribu-
tions of the prior probabilities have been informed by the academic literature where such
information could be found, otherwise they are normal.
This research describes the model in general terms, how it has been implemented in practice
and the results of two experiments that have been run to provide validity of its forecasting
accuracy. The construction, implementation, and validation of the model, as well as a discus-
sion of findings are presented in the following sections below.
The rest of this chapter is structured as follows: Section 2 describes the model and the rationale
behind building it; Section 3 describes the experiments that were conducted using this model,
mainly with the purpose of validating its accuracy; Section 4 presents the results from the
experiments and an analysis of the accuracy of the model; and Section 5 summarizes succinctly
the conclusions of this research.
2. The Bayesian investment decision model
We used Bayesian networks modeling to build a probabilistic assessment model of early stage
companies or ventures. We based our selection of nodes/factors on a series of interviews and
working closely with practitioners in venture capital funding. We afterwards implemented this
model on an online platform, available at www.exogenius.net (see Figure 1).
A Bayesian Model for Investment Decisions in Early Ventures 315
Figure 1. The Bayesian model for investment decision.
The Bayesian model scores on a scale of [0, 100] the potential performance of a company/start-up
by identifying three key measures: business execution, value proposition, and exit potential (see
Figure 2). These measures are aggregated (nonlinearly) into an overall score of performance.
Each of these three important measures scores the future potential of a project or start-up in
regard to their proposition (which may be a technological innovation, a social value, or any
business value that the entrepreneur presents as the core proposition), their ability to sustain,
carry out, and fulfill their proposition (business execution) and the potential of this new venture
to exit (either through IPO, buy-out, or in any manner that would be satisfactory for the investor).
Each of these three measures is a child of five subnetworks in the model, which are represented
by more granular parent-children nodes each. These five subnetworks are business/entrepre-
neurship factors or indicators that are measuring the new venture on the following aspects of
the business proposal: technical difficulty, uniqueness of innovation, readiness for market,
customer engagement, team performance, entrepreneurial and managerial experience, foun-
ders and incorporation of the company, and many more. Each of the granular nodes in the
model is represented by three to five states and they are informed either by the evidence from
published literature (as described below) or otherwise by a uniform distribution priors [4].
The conditional tables of each node have been readjusted after sensitivity analysis was performed,
based on data and facts previously published in the entrepreneurship and high-growth compa-
nies literature [5–7].
Figure 2. Example of one of five subnetworks of the model—the technology offering is represented by three granular
nodes.
For example, the states of the technology (marginal versus breakthrough) node are defined
according to the literature on entrepreneurship [7–9]; the number of founders is also determined
based on these prior findings, i.e., the state of 2–4 founders has the highest positive impact on the
final score, while the other states have low impact or negative impact (more than 5 founders
lower the chances of success significantly) [5, 6].
The nodes representing the team complementarity, coordination, and learning are based on the
findings of the Startup Genome Project, which was run at Berkley and Stanford Universi-
ties [5, 10, 11]. In other words, since the findings show that team complementarity and learning
are critically important for the success of the early ventures, the team node in the model reflects
these findings through the distribution of prior in its states.
Similarly, the nodes that are assessing the infrastructure of the start-up (broadly construed as
not only physical requirements to develop the proposed technology, but also legislative,
financial, or logistic infrastructure), are informed by the currently published probabilistic
values in previous studies on organizational emergence [12].
The placement of the new venture in the current market is also assessed, and this is done based
on the assessment of the projected growth of the company relative to the projected growth of
the market or of the industry [7, 12].
For the development of the model, we used both UnBBayes [13] and GeNIe/SMILE [14] open-
source softwares dedicated to Bayesian modeling. After the model was built, tested, and
developed, it was migrated on the online platform, with easy to use user interface, and where
we ran our experiments.
The implementation of the model on an online platform facilitated experimentation for fore-
casting accuracy. The nodes of the model that provide new evidence, specific to each venture,
are represented as a series of 23 questions in a user-friendly interface. For example, the evidence
node in the model that represents the uniqueness of the offering became the question “How
unique is the proposed offering (idea/innovation/technology/product/service)?” in the online
platform. The nodes that were not evidence in the model have obviously not been represented
as questions in the online implementation. The reviewers/users have the possibility to see the
progression of the three key scores (value proposition, business execution, and exit potential) as
well as the final score as they go through answering the individual assessment questions.
3. The experimental design for model validation
In order to validate the accuracy of the model scores, an anonymized experiment was designed,
where 20 case studies of companies were recreated from real, historical companies. These case
studies included the state of funding and potential of various companies while they were start-
ups, before their first or second seed funding and the aim of the experiment was to show whether
the exit or the overall scores of the model align statistically with what happened in real life.
In the experiment, there were randomly picked 20 historical cases for which we know the
ground truths about their financial history (how they started, how much was their initial
funding, and how much was their exit), by using publicly available information from Crunch-
Base website, Wikipedia and various failed start-ups, and postmortems case studies. The
companies in the sample for the experiment had either high exits (were bought for more than
$500 million), medium exits (were bought for 100–1000K or they took a very long time to exit,
i.e., 20 years), or no exits (they shut down or went bankrupt soon after their launch).
Each of these 20 case studies in the sample were recreated as anonymous business proposals,
given the information at the time when they were seeking initial funding (i.e., 2010). Therefore,
each of these anonymized case studies included the following information: the year when the
reviewer had to “travel back in time” (i.e., 2010), with a hyperlink toward published most
important business and technological events of that year (i.e., the economist), the company
location, the number of founders, the type of incorporation, anonymized information about
the founders experience, information about the market and industry at that time, information
about the customers, the team, the infrastructure, about the financial past of the company if it
existed and, most importantly, information about the product or technology without disclos-
ing its brand name. The reviewers were also free to look for additional information on the web
regarding the state of technology and business at that particular time in the past. The oldest
case study was placed in 1999 and the newest one in 2014.
In other words, all the possible information about a company that could be included prior to
the time of their initial funding request was we included, as long as it could be anonymized.
We conducted two experiments: one with experts in business or investing and other with MBA
students at the University of Maryland.
The first experiment was carried by 24 volunteer reviewers, who reviewed five of these
anonymous case studies each, by answering the questions from online platform at the forefront
of our model for each of their assigned five case studies. The reviewers in the experiment are
experienced as either entrepreneurs or investors; therefore, they are a panel of experts that
completed the experiment.
The second experiment was carried by MBA students at the University of Maryland, in a 1 h
long session. The students were also randomly assigned five case studies each and answered
the same questions from the online platform as the experts did.
4. Results and accuracy analysis
The first experiment started on March 22, 2016 and by April 13, 2016, 54% of reviewers
completed their reviews. We collected 68 (reviews) X-4 (scores) data points. The second exper-
iment was carried out during 1 day in October 2016.
Figure 3. The distribution of overall reviewing scores in the expert experiment. This figure shows the scores on a scale of
0–100 that were given by the professional reviewers (investors and entrepreneurs) in the overall rating for the companies
in each of the three groups—high-exits; medium exits; and no exits. The distributions of the reviewers scores show that
low exits were scored between 0 and 40 with most scores around a value of 20; medium exits were scored between 0 and
80 with most scores around 40; and the high exits were scored either with scores around 20 or scores around 60.
A reviewer provides the observations for the evidence nodes/questions in the model. The
model then provides a distribution on all scores as output, conditional on these observations.
Thus, the Bayesian model here is a three-layer model where the metrics are at the top level in
the network and the observations (market evaluation, team evaluation, etc.) are at the bottom
layer of granular nodes.
Both the measures in the model and the observations are discrete.
The data from the anonymized experiments were rematched with the ground truth data from
the real case studies and compared the experiments with the evidence on three groups of
companies (high exits, medium exits, no exits). The distributions of the exit scores and the
overall scores from the experiment for each of these groups are plotted on the following figures
(see Figures 3–7).
Figure 4. The distribution of overall reviewing scores in the MBA students experiment. Similarly to the plot above, this
figure shows the scores on a scale of 0–100 that were given by the University of Maryland students in the overall rating for
the companies in each of the three groups—high exits; medium exits; and no exits. The distributions of the reviewers
scores show that low exits were scored between 0 and 40 with most scores around a value below 20; medium exits were
scored between 0 and 80 with most scores around either 20 or 40; and the high exits were scored either with scores
between 20 and 60.
Figure 5. The distribution of exit reviewing scores in the expert experiment. This figure is similar to Figure 3, except that
these are the scores of the professional reviewers for the exit node and not the overall score. The low exits were scores
mainly with values close to 0, medium exits with scores between 10 and 60, and high exits scores were very close to a
uniform distribution.
Figure 6. The distribution of exit reviewing scores in the MBA students experiment. Similarly as above, this figure shows
the distribution of the exit scores for the student reviewers. The scores of the low exit companies were close to zero, the
ones of the medium exits around 30 and the ones of the high exits exhibit a much larger range of scores, from 0 to 100.
Figure 7. The overall accuracy of the Bayesian model in the expert panel experiment.
Figure 8. The overall accuracy of the Bayesian model in both experiments.
We can observe from these distributions that the “no exits” or “failures” scored low in both
experiments, that the medium exits had medium scores in both experiments, and that the high
exits had low, medium, and high scores in both experiments, whether we look at the final
overall score or only at the exit key intermediate score (see Figure 8).
Failed companies Medium-exit companies High-exit companies
Experiment mean scores 0.20 0.31 0.42

Experiment median scores 0.16 0.28 0.46
Accuracy 0.83 0.77 0.41
Table 1. A summary of the model accuracy based on the experimental results.
In other words, there is consistency between the two groups of reviewers with respect to each
of the three groups of companies. Moreover, there is consistency in the reviewers responses
and the ground truth data with respect to low-exit and medium-exit companies, but less so for
high-exit companies. In other words, we can use this model to identify failures or low exits, but
less so to identify high exits and, therefore, the model is designed to prune out “bad” proposals
from a pool of varied investment opportunities.
Between the two experiments, we can also observe that the experts are still slightly better than
MBA students at identifying low and medium exits.
The responses from the experiment for the “no exits” had a mean exit score of 20% and a
median exit score of 16% and a mean and median overall score of 27% with a standard
deviation of 16–17%. This means that the companies that failed in real life were reviewed with
scores in the range of 16–27% in our model.
The medium exits experimental data had a mean exit score of 31%, a median of 28% and an
overall mean and median of 34 and 36%, respectively, with standard deviations of 20 and 17%,
respectively. This means that the companies that had medium exits (either low in capital value
or took very long to exit) scored around the probabilities of 28–36% in our model.
The high exits had a mean and median exit score of 42%, an overall mean and median of 46%
and a standard deviation of 28and 25%, respectively. This means that companies that were
bought for more than $500 million in real life scored around 42–46% in our model (see Table 1).
The accuracy performance of the model was analyzed by using simple quantitative forecasting
analysis. Specifically, the mean absolute deviation was used as a metric to calculate the fore-
casting error. The resolution value of 1 was considered for the companies with high exits, 0.5
for the medium exist, and 0 for the failed or no exit companies. The difference between these
resolutions and the actual probabilities given by the reviewers was calculated as a mean
absolute deviation. Based on this calculation, the overall accuracy of the model is situated at
75%, the accuracy for the no exits is valued at 83% and the accuracy for the medium and high
exits is 77 and 41%, respectively (see Table 1).
5. Conclusions
In this research, a probabilistic model that assesses the potential for exit and overall performance
of new ventures (start-ups) is presented, from building it based on practice and published
statistical data, to its implementation in a readily available online platform that can be used by
entrepreneurs and investors alike. The model is designed to assess quantitatively the potential of
business while they are still at the very initial stages. The model is well informed with facts that
we know from previous academic literature on entrepreneurship and high-growth companies,
as well as informed in detail with venture capital experience and practices by working closely
with them during the development phase of the model.
The model is validated using two anonymized experiments with experts in the field and MBA
students and is currently translated into a commercial product. The results of these experi-
ments and the details of the model are being presented in this chapter as both a validation
method and as a viable metric or indicator that can detect ahead of time the future failures and
“bad investments.” This model can thus be also used by entrepreneurs to self-assess and
identify points of weakness in their proposals and current seed ventures. Therefore, this
research is presenting a tool for investment decision that can be easily automated and scaled
up for the use of any potential investor, either angel or venture or any entrepreneur.
At the same time, these research efforts are also a good pathway to shed more transparency in
the investment road map.
Acknowledgements
The author would like to thank Marco Rubin for his professional expertise and professor
David Kirsch and his MBA students at the Smith School of Business for help with conducting
the experiments and very useful comments.
Author details
Anamaria Berea1* and Daniel Maxwell2

1 Center for Complexity in Business, University of Maryland, College Park, MD, USA
2 KaDSci, VA, USA
References
[1] Tetlock PC. Superforecasting: The Art and Science of Prediction. Random House; 2015
[2] Wiltbank R, Boeker W. Returns to Angel Investors in Groups [Internet]. 2007. Available
from: https://ssrn.com/abstract=1028592 or http://dx.doi.org/10.2139/ssrn.1028592
[3] Bodily S. Reducing risk and improving incentives in funding entrepreneurs. Decision
Analysis. 2015;13(2):101–116
[4] Pearl J. Probabilistic Reasoning in Intelligent Systems: Network of Plausible Inference.

Morgan Kaufmann; 1987
[5] Berea A. Essays in high-impact companies and high-impact entrepreneurship [thesis].
George Mason University; 2012
[6] Zoltan J. Acs. Foundations of High Impact Entrepreneurship. Boston, MA; 2008
[7] Shane SA. The Illusions of Entrepreneurship: The Costly Myths that Entrepreneurs,
Investors and Policy Makers Live By. Yale University; 2008
[8] Arthur BW. The Nature of Technology. Free Press; 2009
[9] Auerswald P, Kauffman S, Lobo J, Shell K. The production recipes approach to modeling
technological innovation: An application to learning by doing. CAE Working paper. 1998:98–10
[10] Marmer MH, Bjoern L, Dogrultan E, Berman R. Startup Genome Report. Technical report.
Berkley University and Stanford University; 2011
[11] Botazzi G, Cefis E, Dosi G. Corporate growth and industrial structures: Some evidence from
the italian manufacturing industry. Industrial and Corporate Change. 2002;11(4):705–723
[12] Wolley JL. Studying the emergence of new organizations: Entrepreneurship research
design. New Perspectives on Entrepreneurship Research. 2011;1(1)
[13] Carvalho RN, Onishi MS, Ladeira M. Development of the Java version of the UnBBayes
Framework for probabilistic reasoning. In: Congresso de Iniciacao Cientifica da UnB. Brasilia,
DF, Brazil: University of Brasil; 2002
[14] GeNIe and SMILE, software developed at the Decision Systems Laboratory, School of
Information Sciences, University of Pittsburgh
Chapter 17
Provisional chapter
Recent Advances in Nonlinear Filtering with a Financial

Recent Advances
Application in Nonlinear
to Derivatives Filtering
Hedging with
under Incomplete
a Financial
Information Application to Derivatives Hedging
under Incomplete Information


Abstract
In this chapter, we present some recent results about nonlinear filtering for jump diffu-
sion signal and observation driven by correlated Brownian motions having common
jump times. We provide the Kushner-Stratonovich and the Zakai equation for the nor-
malized and the unnormalized filter, respectively. Moreover, we give conditions under
which pathwise uniqueness for the solutions of both equations holds. Finally, we study
an application of nonlinear filtering to the financial problem of derivatives hedging in an
incomplete market with partial observation. Precisely, we consider the risk-minimizing
hedging approach. In this framework, we compute the optimal hedging strategy for an
informed investor and a partially informed one and compare the total expected squared
costs of the strategies.
Keywords: nonlinear filtering, jump diffusions, risk minimization,

Galtchouk-Kunita-Watanabe decomposition, partial information
1. Introduction
Bayesian inference and stochastic filtering are strictly related, since in both approaches, one
wants to estimate quantities which are not directly observable. However, while in Bayesian
inference, all uncertainty sources are considered as random variables, stochastic filtering refers
to stochastic processes. It also covers many situations, from linear to nonlinear case, with
various types of noises.
The objective of this chapter is to present nonlinear filtering results for Markovian partially
observable systems where the state and the observation processes are described by jump diffu-
sions with correlated Brownian motions and common jump times. We also aim at applying this
theory to the financial problem of derivatives hedging for a trader who has limitative informa-
tion on the market.
A filtering model is characterized by a signal process, denoted by X, which cannot be observed

directly, and an observation process denoted by Y whose dynamics depends on X. The natural
filtration of Y, FY ¼ {F Yt , t ∈ ½0, T�}, represents the available information. The goal of solving a
filtering problem is to determine the best estimation of the signal Xt from the knowledge of F Yt .
Similar to optimal Bayesian filtering, we seek for the best estimation of the signal according to
the minimum mean-squared error criterion, which corresponds to compute the posterior
distribution of Xt given the available observations up to time t.
Historically, the first example of continuous-time filtering problem is the well-known Kalman-
Bucy filter which concerns the case where Y gives the observation of X in additional Gaussian
noise and both processes X and Y are modeled by linear stochastic differential equations. In this
case, one ends up with a filter having finite-dimensional realization. Since then, the problem
has been extended in many directions. To start, a number of authors including Refs. [1–3]
studied the nonlinear case in the setting of additional Gaussian noise. Other references in a
similar framework are given, for instance, by Refs. [4–8]. Subsequently also the case of counting
process or marked point process observation has been considered (see Refs. [9–14] and refer-
ence therein). A more recent literature contains the case of mixed-type observations (marked
point processes and diffusions or jump-diffusion processes), see, for, example, Refs. [15–18].
There are two major approaches to nonlinear filtering problems: the innovations method and
the reference probability method. The latter is usually employed when it is possible to find an
equivalent probability measure that makes the state X and the observations Y independent.
This technique may appear problematic when, for instance, signal and observation are corre-
lated and present common jump times. Therefore, in this chapter, we use the innovations
approach which allows circumventing the technical issues arising in the reference probability
method. By characterizing the innovation process and applying a martingale representation
theorem, we can derive the dynamics of the filter as the solution of the Kushner-Stratonovich
equation, which is a nonlinear stochastic partial integral differential equation. By considering
the unnormalized version of the filter, it is possible to simplify this equation and make it
at least linear. The resulting equation is called the Zakai equation, and due to its linear nature,
it is of particular interest in many applications. We also compute the dynamics of the
unnormalized filter, and we investigate pathwise uniqueness for the solutions of both equa-
tions. Normalized and unnormalized filters are probability measure and finite measure-valued
processes, respectively, and therefore in general infinite-dimensional. Due to this, various
recursive algorithms for statistical inference have come in to address this intractability, such
as extended Kalman filter, statistical linearization, or particle filters. These algorithms intend to
estimate both state and parameters. For the parameter estimation, we also mention the expec-
tation maximization (EM) algorithm which enables to estimate parameters in models with
incomplete data, see, for example, Ref. [19].
The success of the filtering theory over the years is due to its use in a great variety of problems
arising from many disciplines such as engineering, informational sciences and mathematical
finance. Specifically, in this chapter, we have a financial application in view. In real financial
Recent Advances in Nonlinear Filtering with a Financial Application to Derivatives Hedging under Incomplete… 327
markets, it is reasonable that investors cannot fully know all the stochastic factors that may
influence the prices of negotiated assets, since these factors are usually associated with eco-
nomic quantities which are hard to observe. Filtering theory represents a way to measure, in
some sense, this uncertainty. A consistent part of the literature over the last years has consid-
ered stochastic factor models under partial information for analyzing various financial prob-
lems, as, for example, pricing and hedging of derivatives, optimal investment, credit risk, and
insurance modeling. A list, definitely nonexhaustive, is given by Refs. [15, 16, 20–26]).
In the following, we consider the problem of a trader who wants to determine the hedging
strategy for a European-type contingent claim with maturity T in an incomplete financial
market where the investment possibilities are given by a riskless asset, assumed to be the
numéraire, and a risky asset with price dynamics given by a geometric jump diffusion,
modeled by the process Y. We assume that the drift, as well as the intensity and the jump size
distribution of the price process, is influenced by an unobservable stochastic factor X, modeled
as a correlated jump diffusion with common jump times. By common jump times, we intend to
take into account catastrophic events which affect both the asset price and the hidden state
variable driving its dynamics. The agent knows the asset prices, since they are publicly
available, and trades on the market by using the available information FY .
Partial information easily leads to incomplete financial markets as clearly the number of
random sources is larger than the number of tradeable risky asset. Therefore, the existence of
a self-financing strategy that replicates the payoff of the given contingent claim at maturity is
not guaranteed. Here, we assume that the risky asset price is modeled under a martingale
measure, and we choose the risk-minimization approach as hedging criterion, see, for exam-
ple, Refs. [27, 28].
According to this method, the optimal hedging strategy is the one that perfectly replicates the
claim at maturity and has minimum cost in the mean-square sense. Equivalently, we say that it
minimizes the associated risk defined as the conditional expected value of the squared future
costs, given the available information (see Refs. [28, 29] and references therein).
The risk-minimizing hedging strategy under restricted information is strictly related to

Galtchouk-Kunita-Watanabe decomposition of the random variable representing the payoff
of the contingent claim in a partial information setting. Here, we provide a characterization of
the risk-minimizing strategy under partial information via this orthogonal decomposition and
obtain a representation in terms of the corresponding risk-minimizing hedging strategy under
full information (see, e.g., Refs. [29, 30]) via predictable projections on the available informa-
tion flow by means of the filter. Finally, we investigate the difference of expected total risks
associated with the optimal hedging strategies under full and partial information.
The chapter has the following structure. In Section 2, we introduce the general framework. In
Section 3, we study the filtering equations. In particular, we derive the dynamics for both
normalized and unnormalized filters, and we investigate uniqueness of the solutions of the
Kushner-Stratonovich and the Zakai equation. In Section 4, we analyze a financial application
to risk minimization by computing the optimal hedging strategies for a European-type contin-
gent claim under full and partial information and providing a comparison between the
corresponding expected squared total costs.
2. The setting
We consider a pair of stochastic processes (X,Y), with values on R � R and càdlàg trajectories,
on a complete filtered probability space ðΩ, F , F, PÞ, where F ¼ {F t , t ∈ ½0, T�} is a filtration
satisfying the usual condition of right continuity and completeness, and T is a fixed time
horizon. The pair (X, Y) represents a partially observable system, where X is a signal process
that describes a phenomenon which is not directly observable and Y gives the observation of
X, and it is modeled by a process correlated with the signal, having possibly common jump
times.
Remark 1. In view of the financial application discussed in Section 4, Y represents the price of some
risky asset, while X is an unknown stochastic factor, which may describe the activity of other markets,
macroeconomic factors or microstructure rules that influences the dynamics of the stock price process.
We define the observed history as the natural filtration of the observation process Y, that is,
FY ¼ fF Yt gt ∈ ½0, T� , where F Yt :¼ σðYs ; 0 ≤ s ≤ tÞ. The σ-algebra F Yt can be interpreted as the infor-
mation available from observations up to time t. We aim to compute the best estimate of the
signal X from the available information, in the quadratic sense. In other terms, this corre-
sponds to determine the filter which furnishes the conditional distribution of Xt given F Yt , for
every t ∈ [0, T].
Let MðRÞ be the space of finite measures over R and PðRÞ the subspace of the probability
measures over R. Given μ ∈ MðRÞ, for any bounded measurable function f, we write
ð
μðf Þ ¼ f ðxÞμðdxÞ: (1)
R
Definition 2. The filter is the FY -càdlàg process π taking values in PðRÞ defined by
ð
� �
πt ðf Þ :¼ E f ðt, Xt ÞjF Yt ¼ f ðt, xÞπt ðdxÞ, (2)
R
for all bounded and measurable functions f (t, x) on [0, T] � R.

In the sequel, we denote by πt� the left version of the filter and for all functions F(t, x, y) such
that EjFðt, Xt , Yt Þj < ∞ (resp. EjFðt, Xt� , Yt� Þj < ∞) for every t ∈ [0,T], we use the notation
πt ðFÞ :¼ πt ðFðt, � , Y t ÞÞ (resp. πt� ðFÞ :¼ πt� ðFðt, � , Yt� ÞÞÞ.
In this paper, we wish to consider the filtering problem for a partially observable system (X, Y)
described by the following pair of stochastic differential equations:
8 ð
>
> dX ¼ b ðt, X Þdt þ σ ðt, X ÞdW 0
þ K0 ðt, Xt� ; ζÞNðdt, dζÞ; X0 ¼ x0 ∈ R
> t
< 0 t 0 t t
Z
ð (3)
>
>
>
: dYt ¼ b1 ðt, Xt , Y t Þdt þ σ1 ðt, Yt ÞdW 1t þ K1 ðt, Xt� , Y t� ; ζÞNðdt, dζÞ; Y0 ¼ y0 ∈ R
Z
where W0 and W1 are correlated ðF, PÞ-Brownian motions with correlation coefficient ρ ∈
[�1,1] and Nðdt, dζÞ is a Poisson random measure on Rþ � Z whose intensity νðdζÞdt is a
σ – finite measure on a measurable space ðZ, ZÞ. Here, b0 , b1 , σ0 , σ1 , K0 , and K1 are R-valued and
measurable functions of their arguments. In particular, σ0(t, x) and σ1(t, x, y) are strictly
positive for every ðt, x, yÞ ∈ ½0, T� � R2.
For the rest of the paper, we assume that strong existence and uniqueness for system Eq. (3)
holds. Sufficient conditions are collected, for instance, in Ref. [18, Appendix]. These assump-
tions also imply Markovianity for the pair (X, Y).
Remark 3. Note that the quadratic variation process of Y defined by
ðt
½Y�t ¼ Y2t � 2 Yu� dYu , t ∈ ½0, T�, (4)
0
ðt X
is FY -adapted and ½Y�t ¼ σ21 ðu, Yu Þdu þ u≤t
ðΔYu Þ2 , where ΔYt :¼ Yt � Yt� . Therefore, it is
0
natural to assume that the signal X does not affect the diffusion coefficient in the dynamics
of Y. If Y describes the price of a risky asset, this implies that the volatility of the stock price
does not depend on the stochastic factor X.
The jump component of Y can be described in terms of the following integer-valued random
measure on [0, T] � R:
X
mðdt, dzÞ ¼ δ{s, ΔYs } ðdt, dzÞ, (5)
s:ΔYs 6¼0
where δa denotes the Dirac measure at point a. Note that the following equality holds:
ðt ð ðt ð
zmðds, dzÞ ¼ K1 ðs, Xs� , Ys� ; ζÞNðds, dζÞ: (6)
0 R 0 Z
For all t ∈ [0, T], for all A ∈ BðRÞ, we define the following sets:
d0 ðt, xÞ :¼ {ζ ∈ Z : K0 ðt, x; ζÞ 6¼ 0}, d1 ðt, x, yÞ :¼ {ζ ∈ Z : K1 ðt, x, y; ζÞ 6¼ 0}, (7)
dA ðt, x, yÞ :¼ {ζ ∈ Z : K1 ðt, x, y; ζÞ ∈ A\{0}} ⊆ d1 ðt, x, yÞ, (8)
DAt :¼ dA ðt, Xt� , Yt� Þ⊆Dt :¼ d1 ðt, Xt� , Yt� Þ, D0t :¼ d0 ðt, Xt� Þ: (9)
Typically, we have D0t ∩ Dt 6¼ Ø P � a.s., which means that state and observation may have
common jump times. This characteristic is particularly meaningful in financial applications to
model catastrophic events that produce jumps in both the stock price and the underlying
stochastic factor that influences its dynamics.
To ensure existence of the first moment for the pair (X, Y) and non-explosiveness for the jump
process governing the dynamics of X and Y, we make the following assumption:
Assumption 4.
�ð T ð �
E jb0 ðt, Xt Þj þ σ20 ðt, Xt Þ þ jK0 ðt, Xt� ; ζÞjνðdζÞdt < ∞, (10)
0 Z
�ð T ð �
E jb1 ðt, Xt , Yt Þj þ σ21 ðt, Yt Þ þ jK1 ðt, Xt� , Yt� ; ζÞjνðdζÞdt < ∞, (11)
0 Z
�ð T �
E νðD0t ∪ Dt Þdt < ∞: (12)
0
Denote by ηP ðdt, dzÞ the ðF, PÞ compensator of mðdt, dzÞ (see, e.g., Refs. [9, 31] for the defini-
tion).
Then, in Ref. [14, Proposition 2.2], it is proved that
ηP ðdt, dzÞ ¼ λðt, Xt� , Y t� Þφðt, Xt� , Yt� , dzÞdt, (13)
where
ð
λðt, x, yÞφðt, x, y, dzÞ ¼ δK1 ðt, x, y; ζÞ ðdzÞνðdζÞ (14)
d1 ðt, x, yÞ
and in particular λðt, x, yÞ ¼ νðd1 ðt, x, yÞÞ.
Remark 5. Let us observe that both the local jump characteristics ðλðt, Xt� , Y t� Þ, φðt, Xt� , Y t� , dzÞÞ
depend on X and, for all A ∈ BðRÞ, λðt, Xt� , Y t� Þφðt, Xt� , Yt� , AÞ ¼ νðDAt Þ provides the ðF, PÞ -intensity
of the point process Nt ðAÞ :¼ mðð0, t� � AÞ. According to this, the process λðt, Xt� , Yt� Þ ¼ νðDt Þ is the
ðF, PÞ -intensity of the point process Nt ðRÞ which counts the total number of jumps of Y until time t.
2.1. The innovation process

To derive the filtering equation, we use the innovations approach. This method requires to
introduce a pair ðI, mπ Þ, called the innovation process, consisting of the ðFY , PÞ-Brownian motion
and the ðFY , PÞ-compensated jump measure that drive the dynamics of the filter. The innova-
tion also represents the building block of ðFY , PÞ -martingales.
To introduce the first component of the innovation process, we assume that

" ( ð � � )#
1 T b1 ðt, Xt , Yt Þ 2
E exp dt < ∞, (15)
2 0 σ1 ðt, Yt Þ
and define
ðt � �
b1 ðs, Xs , Ys Þ πs ðb1 Þ
I t :¼ W 1t þ � ds, t ∈ ½0, T�: (16)
0 σ1 ðs, Ys Þ σ1 ðs, Ys Þ
The process I is an ðFY , PÞ-Brownian motion (see, e.g., Ref. [4]) and the ðFY , PÞ-compensated
jump martingale measure is given by
mπ ðdt, dzÞ ¼ mπ ðdt, dzÞ � πt� ðλφðdzÞÞdt; (17)
See, e.g. Ref. [14]. The following theorem provides a characterization of the ðFY , PÞ-martingale
in terms of the innovation process.
Theorem 6 (A martingale representation theorem). Under Assumption 4 and the integrability
condition Eq. (15), every ðFY , PÞ-local martingale M admits the following decomposition:
ðt ð ðt
Mt ¼ M0 þ ws ðzÞmπ ðds, dzÞ þ hs dI s , t ∈ ½0, T�, (18)
0 R 0
where wðzÞ ¼ {wt ðzÞ, t ∈ ½0, T�} is an FY -predictable process indexed by z, and h ¼ {ht , t ∈ ½0, T�} is an
FY -adapted process such that
ðT ð ðT
jwt ðzÞjπt� ðλφðdzÞÞdt < ∞, h2t dt < ∞ P �a:s:: (19)
0 R 0
Proof. The proof is given in Ref. [17, Proposition 2.4]. Note that here condition (15) implies that
�ð T � �2 �
b1 ðt, Xt , Yt Þ
E σ1 ðt, Yt Þ dt < ∞, and also that the process L defined by
0
ðt ðt � � !
b1 ðs, Xs , Y s Þ 1 b1 ðs, Xs , Ys Þ 2
Lt ¼ exp � dW 1s � ds , (20)
0 σ1 ðs, Y s Þ 2 0 σ1 ðs, Ys Þ
for every t ∈ ½0, T�, is an ðF, PÞ-martingale.
3. The filtering equations
Theorem 7 (The Kushner-Stratonovich equation). Under Assumptions 4 and condition (15), the
filter π solves the following Kushner-Stratonovich equation, that is, for every f ∈ C1;2
b ð½0, T� � RÞ:
ðt ðt ð ðt
πt ðf Þ ¼ f ð0, x0 Þ þ πs ðLX f Þds þ wπs ðf , zÞmπ ðds, dzÞ þ hπs ðf ÞdI s , t ∈ ½0, T� (21)
0 0 R 0
where
dπt� ðλφf Þ dπt� ðLf Þ

wπt ðf , zÞ ¼ ðzÞ � πt� ðf Þ þ ðzÞ, (22)
dπt� ðλφÞ dπt� ðλφÞ
� �
∂f
hπt ðf Þ ¼ σ�1
1 ðtÞ ½ π ðb
t 1 f Þ � π ðb
t 1 Þπ t ðf Þ � þ ρπ t σ 0 : (23)
∂x
dπt� ðλφf Þ dπt� ðLf Þ

Here, by dπt� ðλφÞ ðzÞ and we mean the Radon-Nikodym derivatives of the measures
dπt� ðλφÞ ðzÞ,
� �
πt� ðλf φðdzÞÞ and πt� ðLf ÞðdzÞ, with respect to πt� λφðdzÞ . Moreover, the operator L defined by
Lt f ðdzÞ :¼ Lf ð.; Yt� , dzÞ is such that for every A ∈ BðRÞ,
ð
Lf ðt, x, y, AÞ ¼ ½f ðt, x þ K0 ðt, x; ζÞÞ � f ðt, xÞ�νðdζÞ (24)
dA ðt, x, yÞ
takes into account common jump times between the signal X and the observation Y.
Finally, the operator LX given by

ð
∂f ∂f 1 ∂2 f
LX f ðt, xÞ ¼ þ b0 ðt, xÞ þ σ20 ðt, xÞ 2 þ {f ðt, x þ K0 ðt, x; ζÞÞ � f ðt, xÞ}νðdζÞ: (25)
∂t ∂x 2 ∂x Z
denotes the generator of the Markov process X.

Proof. The theorem is proved in Ref. [17, Theorem 3.1].
Example 8 (Observation dynamics driven by independent point processes with unobservable

intensities). In the sequel, we provide an example where the Kushner-Stratonovich equation
simplifies and the Radon-Nikodym derivatives appearing in the dynamics of π(f) reduce to
ratios. Suppose that there exists a finite set of measurable functions Ki1 ðt, yÞ 6¼ 0 for all
ðt, yÞ ∈ ½0, T� � R, for i ∈ {1,…; n}, such that the dynamics of Y is given by
n
X
dYt ¼ b1 ðt, Xt , Yt Þdt þ σ1 ðt, Yt ÞdW 1t þ Ki1 ðt, Y t� ÞdN it , Y0 ¼ y0 ∈ R, (26)
i¼1
where Ni are independent counting processes with ðF, PÞ intensities λi ðt, Xt� , Y t� Þ.
For simplicity, in this example, we assume that X and Y have no common jump times. Then,
the filtering Eq. (21) reads as
ðt ðt � � ��
∂f
πt ðf Þ ¼ f ð0, x0 Þ þ πs ðLX f Þds þ σ1 ðsÞ�1 ½πs ðb1 f Þ � πs ðb1 Þπs ðf Þ� þ ρπs σ0 dI s
0 0 ∂x
n ðt (27)
X πs� ðλi f Þ � πs� ðf Þπs� ðλi Þ � i �
þ 1πs� ðλi Þ>0 i
dN s � πs� ðλi Þds , t ∈ ½0, T�:
i¼1 0 πs� ðλ Þ
Note that Eq. (21) has an equivalent expression in terms of the operator LX0 , given by
LX0 f ðt, x, yÞ ¼ LX f ðt, xÞ � Lf ðt, x, y, RÞ

ð
∂f ∂f 1 ∂2 f
¼ ðt, xÞ þ b0 ðt, xÞ þ σ20 ðt, xÞ 2 þ f f ðt, x þ K0 ðt, x, ζÞÞ � f ðt, xÞgνðdζÞ,
∂t ∂x 2 ∂x d1t ðt, x, yÞc
(28)
where d1 ðt, x, yÞc ¼ {ζ ∈ Z : K1 ðt, x, y, ζÞ ¼ 0}. Indeed, we get

ð
dπt ðf Þ ¼ {πt ðLX0 f Þ þ πt ðf Þπt ðλÞ � πt ðλf Þ}dt þ hπt dI t þ wπ ðt, zÞmðdt, dzÞ: (29)
R
Moreover, the filter has a natural recursive structure. To show this, define the sequence
{T n , Zn }n ∈ N of jump times and jump sizes of Y, that is, Zn ¼ YT n � YT �n . These are observable
data. Then, between two consecutive jump times the filter is governed by a diffusion process,
that is, for t ∈ ðT n ∧ T, T nþ1 ∧ TÞ
ðt ðt
πt ðf Þ ¼ πT n ðf Þ þ {πs ðLX0 f Þ þ πs ðf Þπs ðλÞ � πs ðλf Þ}ds þ hπs ðf ÞdI s , (30)
Tn Tn
and at any jump time Tn occurring before time T, it is given by
dπT �n ðλφf Þ dπT �n ðLf Þ

πT n ðf Þ ¼ ðZn Þ þ ðZn Þ, (31)
dπT n ðλφÞ
� dπT �n ðλφÞ
which implies that πT n ðf Þ is completely determined by the observed data (Tn, Zn) and the
knowledge of πt (f) in the time interval ½T n�1 , T n Þ, since πT �n ðf Þ ¼ limt!T �n πt ðf Þ.
Note that the Kushner-Stratonovich equation is an infinite-dimensional nonlinear stochastic

differential equation. Often, it is possible to characterize the filter in terms of a simpler equa-
tion, known as the Zakai equation which provides the dynamics of the unnormalized version
of the filter. Although the Zakai equation is still infinite-dimensional, it has the advantage to be
linear.
The idea for getting the dynamics of the unnormalized filter consists of performing an equiv-
alent change of probability measure defined by
�
dP0 ��
¼ Zt , t ∈ ½0, T� (32)
dP �F t
for a suitable strictly positive ðF, PÞ-martingale Z, in such a way that the so-called
unnormalized filter p is the MðRÞ-valued process defined by
� �
pt ðf Þ :¼ E0 Z�1 Y
t f ðt, Xt ÞjF t , t ∈ ½0, T�, (33)
Remark 9. By the Kallianpur-Striebel formula, we get that

� �
E0 f ðt, Xt ÞZ�1
t jF t
Y
p ðf Þ
πt ðf Þ ¼ � � ¼ t , t ∈ ½0, T�, (34)
E0 Z�1 t jF Y
t
pt ð1Þ
� �
where pt ð1Þ :¼ E0 Z�1 Y
t jF t . This provides the relation between the filter and its unnormalized version.
In order to compute the Zakai equation, we make the following assumption.
Assumption 10. Suppose that there exists a transition function η0 ðt, y, dzÞ such that the ðFY , PÞ-
predictable measure η0 ðt, Yt� , dzÞ is equivalent to λðt, Xt� , Yt� Þφðt, Xt� , Yt� , dzÞ and
�ð T �
E η0 ðt, Yt� , RÞdt < ∞: (35)
0
Remark 11. In Ref. [18], a weaker assumption is considered. That condition allows to introduce an
equivalent probability measure on ðΩ, F YT Þ which is not necessarily the restriction on F YT of an equivalent
probability measure on ðΩ, F T Þ.
Remark 12. In the context of Example 8, Assumption 10 is satisfied if, for instance, λi ðt, Xt� , Yt� Þ > 0
P-a.s. for every t ∈ ½0, T�.
Assumption 10 equivalently means that there exists an ðFY , PÞ-predictable process

Ψðt, Xt� , Yt� , zÞ such that
λðt, Xt� , Yt� Þφðt, Xt� , Y t� , dzÞdt ¼ ð1 þ Ψðt, Xt� , Yt� , zÞÞη0 ðt, Yt� , dzÞdt (36)
and 1 þ Ψðt, Xt� , Yt� , zÞ > 0 P-a.s. for every t ∈ ½0, T�, z ∈ R. Setting
1
Uðt, zÞ :¼ � 1, (37)
1 þ Ψðt, Xt� , Yt� , zÞ
we also assume that the following integrability condition holds:

" ( ð � � ðT ð )#
1 T b1 ðs, Xs , Ys Þ 2 2
E exp ds þ U ðs, zÞλðs, Xs� , Y s� Þφðs, Xs� , Ys� , dzÞds < ∞: (38)
2 0 σ1 ðs, Ys Þ 0 R
The subsequent proposition provides a useful version of the Girsanov Theorem that fits to our
setting.
Proposition 13. Let Assumptions 4 and 10, and condition (38) hold and define the process
� ðt ðt ð � ��
Zt :¼ E � b1σðs1,ðsX,sY, sYÞs Þ dW 1s þ Uðs, zÞ mðds, dzÞ � λðs, Xs� , Ys� Þφðs, Xs� , Ys� , dzÞds , for
0 0 R
every t ∈ ½0, T�, where EðMÞ denotes the Doléans-Dade exponential of a martingale M. Then, Z is a
strictly positive ðF, PÞ -martingale. Let P0 be the probability measure equivalent to P given by
�
dP0 ��
¼ Zt , t ∈ ½0, T�: (39)
dP �F t
Then, the process

ðt
f 1 :¼ W 1 þ b1 ðs, Xs , Ys Þ
W t t ds, t ∈ ½0, T� (40)
0 σ1 ðs, Y s Þ
is an ðF, P0 Þ-Brownian motion, and the ðF, P0 Þ-predictable projection of the integer-valued random
measure mðdt, dzÞ is given by η0 ðt, Yt� , dzÞdt.
Proof. [32, Theorem 9] ensures that Z is a martingale under Assumptions 10, 4 and integrability
condition Eq. (38). Then the proof follows by Ref. [31, Chapter III, Theorem 3.24].
f1 can also be written as
Note that, by Eq. (16), we get that the process W
ðt � �
f 1 ¼ It þ b1
W t πs ds, t ∈ ½0, T� (41)
0 σ1
1
f is also an ðFY , P0 Þ-Brownian motion. Moreover, since η0 ðt, Yt� , dzÞ is FY
which implies that W
predictable, it provides the ðFY , P0 Þ-predictable projection of the measure mðdt, dzÞ and the
ð
observation process Y satisfies dYt ¼ σ1 ðt, Yt ÞdW ~ 1 þ zmðdt, dzÞ. In particular, η0 ðRÞ :¼
t t
R
η0 ðt, Yt� , RÞ is the ðFY , P0 Þ-intensity of the point process which counts the total jumps of Y until
time t.
Theorem 14 (The Zakai equation). Under Assumptions 4 and 10 and condition (38), let P0 be the
probability measure defined in Proposition 13. For every f ∈ C 1;2 b ð½0, T� � RÞ, the unnormalized filter
defined in Eq. (33) satisfies the equation
� � ��
� � pt ðb1 f Þ ∂f f1
dpt ðf Þ ¼ pt ðLX0 f Þ � pt ðλf Þ þ η0t ðRÞpt ðf Þ dt þ þ ρ pt σ0 dW t
σ1 ðt, Yt Þ ∂x
ð � � (42)
dp � ðLf Þ
þ pt� ðf ΨÞðzÞ þ t 0 ðzÞ mðdt, dzÞ:
R dηt
See Ref. [18, Theorem 3.6] for the proof.
3.1. Uniqueness of the filtering equations

In this section, we show pathwise uniqueness for the solution of the Kushner-Stratonovich and
the Zakai equations. The first result provides the equivalence of uniqueness of the solutions to
the filtering Eqs. (21) and (42).
Theorem 15. Let Assumptions 4 and 10 and condition (38) hold.
i. Assume strong uniqueness for the solution to the Zakai equation, let μ be a PðRÞ-valued process
which is a strong solution of the Kushner-Stratonovich equation. Then μt = πt P � a.s. for all t ∈ [0, T].
ii. Conversely, suppose that pathwise uniqueness for the solution of the Kushner-Stratonovich
equation holds and let ξ be an MðRÞ-valued process which is a strong solution of the Zakai
equation. Then ξt ¼ pt P � a:s: for all t ∈ ½0, T�.
Proof. The proof follows by Ref. [18, Theorems 4.5 and 4.6]. Here, note that Assumption 10
implies that the measures μt� ðλφðdzÞÞ and πt� ðλφðdzÞÞ are equivalent.
Finally, strong uniqueness for the solution of both filtering equations is established in the
subsequent theorems.
Theorem 16. Let (X, Y) be the partially observed system defined in Eq. (3), and assume in addition to
Assumptions 4 and 10 and condition (15) that
ð
sup fjK0 ðt, x; ζÞj þ jK1 ðt, x, y; ζÞjgνðdζÞ < ∞: (43)
t, x, y Z
Let μ be a strong solution of the Kushner-Stratonovich equation. Then μt = πt P-a.s. for every t ∈ ½0, T�.
Proof. See Ref. [17, Theorem 3.3].

Theorem 17. Let (X, Y) be the partially observed system in Eq. (3). Under Assumptions 4 and 10 and
conditions (38) and (43), let ξ be a strong solution to the Zakai equation, then ξt = pt P-a.s. for every
t ∈ ½0, T�.
Proof. The proof follows by Ref. [18, Theorem 4.7], after noticing that under Assumption 10 the
measures ξt� ðλφðdzÞÞ and pt� ðλφðdzÞÞ are equivalent.
4. A financial application to risk minimization
In the current section, we focus on a financial application. We consider a simple financial market
where agents may invest in a risky asset whose price is described by the process Y given in Eq. (3)
and a riskless asset with price process B. Without loss of generality, we assume that Bt = 1 for
every t ∈ ½0, T�. We also assume throughout the section the following dynamics for the process Y:
� ð � ��
dYt ¼ Y t σðt, Yt ÞdW 1t þ Kðt, Xt� , Y t� ; ζÞ Nðdt, dζÞ � νðdζÞdt , Y0 ¼ y0 ∈ Rþ (44)
Z
for some functions σðt, yÞ and Kðt, x, y; ζÞ such that σðt, yÞ > 0 and Kðt, x, y; ζÞ > �1.
This choice for the dynamics of Y has a double advantage. On one side assuming a geometric
form, together with the condition that Kðt, x, y; ζÞ > �1 guarantees nonnegativity which is
desirable when talking about prices. On the other hand, we are modeling Y directly under a
martingale measure, and by Assumption 18, it turns out to be a square integrable ðF, PÞ-
martingale.
Considering Eq. (44) corresponds to take in system (3)

ð
b1 ðt, x, yÞ ¼ �y Kðt, x, y; ζÞνðdζÞ
Z (45)
σ1 ðt, yÞ ¼ yσðt, yÞ, K1 ðt, x, y; ζÞ ¼ yKðt, x, y; ζÞ:
In addition, we me make the following assumption.

Assumption 18.
0 < c1 < σðt, yÞ < c2 , jKðt, x, y; ζÞj < c3 , νðDt Þ < c4 , (46)
for every ðt, x, yÞ ∈ ½0, T� � R � Rþ, ζ ∈ Z and for some positive constants c1 , c2 , c3 , c4 .
Remark 19. In the sequel, it might be useful to specify the dynamics of Y also in terms of the jump
measure mðdt, dzÞ. Recalling Eqs. (6) and (14), we have
ð � �
dYt ¼ Yt σðt, Yt ÞdW 1t þ z mðdt, dzÞ � λðt, Xt� , Yt� Þφðt, Xt� , Yt� , dzÞdt : (47)
R
The stochastic factor X which affects intensity and jump size distribution of Y may represent
the state of the economy and is not directly observable by market agents. This is a typical
situation arising in real financial markets.
We model by FY the available information to investors. Since Y is FY adapted, it is in particular

an ðFY , PÞ-martingale with the following decomposition:
ðt ðt ð � �
Yt ¼ y0 þ Ys σðs, Ys ÞdI s þ z mðds, dzÞ � πs� ðλφðdzÞÞds , t ∈ ½0, T�: (48)
0 0 R
By Eqs. (14) and (45), in this setting the first component of the innovation process I defined in
ðt ð � �
Eq. (16) is given by I t ¼ W 1t þ Ys σðs1, Ys Þ z λðs, Xs , Ys Þφðs, Xs , Y s , dzÞ � πs ðλφðdzÞÞ ds.
0 R
Suppose that we are given a European-type contingent claim whose final payoff is a square
integrable F YT -measurable random variable ξ, that is, ξ ∈ L2 ðF YT Þ where
� �
L2 ðF YT Þ :¼ {random variables Γ ∈ F YT : E Γ2 < ∞}: (49)
The objective of the agent is to find the optimal hedging strategy for this derivative. Since the
number of random sources exceeds the number of tradeable risky assets, the market is incom-
plete. It is well known that in this setting, perfect replication by self-financing strategies is not
feasible. Then, we suppose that the investor intends to pursue the risk-minimization approach.
Risk minimization is a quadratic hedging method that allows determining a dynamic invest-
ment strategy that replicates perfectly the claim with minimal cost. Let us properly introduce
the objects of interest. We start with the following notation. For any pair of F-adapted (respec-
tively, FY -adapted) processes Ψ1 , Ψ2 we refer to 〈Ψ1 , Ψ2 〉F for the predictable covariation
Y
computed with respect to filtration F (respectively, 〈Ψ1 , Ψ2 〉F for the predictable covariation
computed with respect to filtration FY ). Note that
ðt � ð �
〈Y〉Ft ¼ Y2s σ2 ðs, Ys� Þ þ K2 ðs, Xs� , Ys� ; ζÞνðdζÞ ds
0 Z
ðt � ð (50)
�
¼ Y2s σ2 ðs, Ys� Þ þ z2 λðs, Xs� , Y s� Þφðs, Xs� , Ys� , dzÞ ds, t ∈ ½0, T�,
0 R
and since Y is also FY adapted, we also have

ðt � ð �
Y
〈Y〉Ft ¼ Y2s σ2 ðs, Y s� Þ þ z2 πs� ðλφðdzÞÞ ds, t ∈ ½0, T�: (51)
0 R
We stress that, due to the presence of a jump component, the predictable quadratic variations
of Y with respect to filtrations F and FY are different.
Now we introduce a technical definition of two spaces, ΘðFÞ and ΘðFY Þ

Definition 20. The space ΘðFY Þ (respectively, ΘðFÞ) is the space of all FY -predictable (respectively,
F-predictable) processes θ such that
�ð T � � �ð T � �
Y
E θ2u d〈Y〉Fu <∞ respectively E θ2u d〈Y〉Fu < ∞ : (52)
0 0
We observe that for every θ ∈ ΘðFY Þ, thanks to FY -predictability, we have

�ð T � �ð T �
Y
E θ2u d〈Y〉Fu ¼ E θ2u d〈Y〉Fu < ∞, (53)
0 0
which implies that ΘðFY Þ ⊆ ΘðFÞ.
Since we have two different levels of information represented by the filtrations F and FY , we
may define two classes of admissible strategies.
Definition 21. An FY -strategy (respectively, F-strategy) is a pair ψ ¼ ðθ, ηÞ of stochastic processes,

where θ represents the amount invested in the risky asset and η is the amount invested in the riskless
asset, such that θ ∈ ΘðFY Þ (respectively, θ ∈ ΘðFÞ) and η is FY -adapted (respectively, F-adapted).
This definition reflects the fact that investor’s choices should be adapted to her/his knowledge
of the market. The value of a strategy ψ ¼ ðθ, ηÞ is given by
V t ðψÞ ¼ θt Yt þ ηt , t ∈ ½0, T�, (54)
and its cost is described by the process

ðt
Ct ðψÞ ¼ V t ðψÞ � θu dYu , t ∈ ½0, T�: (55)
0
In other terms, the cost of a strategy is the difference between the value process and the gain
process. For a self-financing strategy, the value and the gain processes coincide, up to the initial
wealth V0, and therefore the cost is constant and equal to Ct ¼ V 0 , for every t ∈ ½0, T�. We
continue by defining the risk process, in the partial information setting.
Y
Definition 22. Given an FY -strategy (respectively, an F-strategy) ψ ¼ ðθ, ηÞ, we denote by RF ðψÞ
(respectively, RF ðψÞ) the associated risk process defined as
�� 2 � � �� 2 ��
Y
RFt ðψÞ :¼ E CT ðψÞ � Ct ðψÞ jF Yt , respectively RFt ðψÞ :¼ E CT ðψÞ � Ct ðψÞ jF t ,
(56)
for every t ∈ ½0, T�.

Then, we have the following definition of risk-minimizing strategy under partial information.
Definition 23. An FY -strategy ψ is risk minimizing if

i. V T ðψÞ ¼ ξ,
Y Y
ii. for any other FY -strategy ψ
~ we have RF ðψÞ ≤ RF ðψÞ,
t t
~ for every t ∈ ½0, T�.
The corresponding definitions of risk process and risk-minimizing strategy under full infor-
Y
mation can be obtained replacing FY and RFt with F and RFt in Definition 23. To differentiate,
when it is necessary, we use the terms FY -risk-minimizing strategy or F-risk-minimizing
strategy. The criterion (ii) in Definition 23 can be also written as
h i
min E ðCT ðψÞ � Ct ðψÞÞ2 , t ∈ ½0, T�, (57)
ψ ∈ ΘðFY Þ
which intuitively means that a strategy is risk minimizing if it minimizes the variance of the
cost. This equivalent definition allows to obtain a nice property of risk-minimizing strategies
which turn out to be self-financing on average, that is, the cost process C is a martingale and
therefore has constant expectation (see, e.g., Ref. [27, Lemma 2] or [28, Lemma 2.3]).
In the sequel, we aim to characterize the optimal hedging strategy for the contingent claim ξ
under full and partial information, that is, the F- and the FY -risk-minimizing strategies. To this,
we introduce two orthogonal decompositions known as the Galtchouk-Kunita-Watanabe
decompositions under full and partial information (see, e.g., [30]). To understand better the
relevance of these decompositions, we assume for a moment completeness of the market and
full information. Then, it is well known that for every European-type contingent claim with
final payoff ξ, there exists a self-financing strategy ψ ¼ ðθ, ηÞ such that
ðT
ξ ¼ V0 þ θu dY u , P � a:s: (58)
0
that is, a replicating portfolio is uniquely determined by the initial wealth and the
investment in the risky asset. When the market is incomplete, decomposition Eq. (58)
does not hold in general. Intuitively, this implies that we might expect additional terms
in Eq. (58), and according to the risk-minimization criterion, this additional terms need
to be such that the final cost does not deviate too much from the average cost, in
the quadratic sense. Specifically, we have the following decomposition of the random
variable ξ:
ðT
ξ ¼ V0 þ θu dYu þ GT , P � a:s: (59)
0
where GT is the value at time T of a suitable process G. The minimality criterion requires that
G is a martingale orthogonal to Y. We refer the reader to Ref. [28] for a detailed survey. Under
suitable hypothesis, the above decomposition takes the name of Galtchouk-Kunita-Watanabe
decomposition.
Now we wish to be more formal, and we introduce the following definitions:
Consider a random variable ξ ∈ L2 ðF YT Þ. Since F YT ⊆F T , we can define the following decompo-

sitions for ξ.
Definition 24. a. The Galtchouk-Kunita-Watanabe decomposition of ξ ∈ L2 ðF YT Þ with respect to Y and

F is given by
ðT
ξ ¼ UF0 þ θFu dYu þ GFT P � a:s:, (60)
0
where U F0 ∈ L2 ðF 0 Þ, θF ∈ ΘðFÞ and GF is a square integrable ðF, PÞ-martingale, with GF0 ¼ 0,

orthogonal to Y, that is, 〈GF , Y〉Ft ¼ 0 for every t ∈ ½0, T�.
b. The Galtchouk-Kunita-Watanabe decomposition of ξ ∈ L2 ðF YT Þ with respect to Y and FY is given by

ðT
Y Y Y
ξ ¼ U F0 þ θFu dYu þ GFT P � a:s:, (61)
0
Y Y Y
where UF0 ∈ L2 ðF Y0 Þ, θF ∈ ΘðFY Þ and GF is a square integrable ðFY , PÞ -martingale, With
Y Y
GF0 ¼ 0, strongly orthogonal to Y, that is, 〈GF , Y〉Ft ¼ 0 for every t ∈ ½0, T�:
In the sequel, we refer to Eqs. (60) and (61) as the Galtchouk-Kunita-Watanabe decompositions
under full information and under partial information, respectively. Since Y is a square integrable
martingale with respect to both filtrations F and FY , decompositions Eqs. (60) and (61) exist.
Y
Next proposition provides a relation between the integrands θF and θF of decompositions
Eqs. (60) and (61) in terms of predictable projections. For any ðF, PÞ-predictable process A of
Y
finite variation, we denote by Ap, F its ðFY , PÞ-dual-predictable projection.1
Proposition 25. The integrands in decompositions Eqs. (60) and (61) satisfy the following relation:
�ð t �p, FY
d θFu d〈Y〉Fu
Y
0
θFt ¼ , t ∈ ½0, T�: (62)
p, FY
d〈Y〉t
Y
Here, 〈Y〉p, F denotes the ðFY , PÞ-dual-predictable projection of 〈Y〉F and it is given by
Y
We call ðFY , PÞ- dual predictable projection of a process A the FY -predictable finite variation process Ap, F such that for any
1
FY -predictable-bounded process φ we have

�ð T � �ð T �
Y
E φs dAs ¼ E φs dAps , F
0 0
ðt ðt ð
p, FY Y
〈Y〉t ¼ 〈Y〉Ft ¼ Y 2s σ2 ðs, Ys� Þds þ z2 πs� ðλφðdzÞÞds, t ∈ ½0, T�: (63)
0 0 R
Proof. First note that the ðFY , PÞ-dual-predictable projection of the process 〈Y〉F coincides with
the predictable quadratic variation of the process Y itself, computed with respect to its internal
filtration, given in Eq. (51), since for any ðFY , PÞ-predictable-(bounded) process φ, we have that
�ð T � �ð T �
Y
E φt d〈Y〉Ft ¼ E φt d〈Y〉Ft . This proves Eq. (63).
0 0
Let
�ð t �p, FY
d θFu d〈Y〉Fu
0
θt :¼ , t ∈ ½0, T�: (64)
p , FY
d〈Y〉t
By the Galtchouk-Kunita-Watanabe decomposition Eq. (60), we can write

ðT
ξ ¼ U F0 þ eT
θu dY u þ GFT þ G P � a:s:; (65)
0
ðt
e t :¼
where G ðθFu � θu ÞdYu , for every t ∈ ½0, T�. We observe that for every FY -predictable
0
process φ the following holds:
�ð T � �ð T �
Y
E φu θu d〈Y〉Fu ¼ E φu θu d〈Y〉Fu
0 0
�ð T � �ð T � (66)
Y
¼E φu ðθFu d〈Y〉Fu Þp, F ¼E φu θFu d〈Y〉Fu :
0 0
By choosing φ = θ and applying the Cauchy-Schwarz inequality, we obtain

�ð T � �ð T �
Y
E ðθu Þ2 d〈Y〉Fu ≤ E ðθFu Þ2 d〈Y〉Fu < ∞: (67)
0 0
This implies that θ ∈ ΘðFY Þ ⊆ ΘðFÞ and that G e is an ðF, PÞ-martingale. Taking the conditional
Y
expectation with respect to F T in Eq. (65) leads to
ðT ðT
� � � � FY
ξ ¼ E UF0 jF YT þ e T ¼ E UF jF Y þ
θu dYu þ GFT þ G 0 0
b
θu dYu þ G T P � a:s: (68)
0 0
where
Y � � � � � � h i
b F :¼ E UF jF Y � E UF jF Y þ E GF jF Y þ E G
G e T jF Y , t ∈ ½0, T�, (69)
t 0 t 0 0 T t t
which provides the Galtchouk-Kunita-Watanabe decomposition Eq. (61) if we can show that
Y
b F is strongly orthogonal to Y, that is, if for any ðFY , PÞ-predictable-
the ðFY , PÞ-martingale G
(bounded) process φ the following holds:
� Y ðT �
bF
E G φu dYu ¼ 0: (70)
T
0
� � � � � �
Note that orthogonality of the term E U F0 jF Yt � E U F0 jF Y0 þ E GFT jF Yt follows by the
orthogonality of GF and Y. Moreover, we have
� h ið T � � ðT � �ð T �
e T jF Y e F F
E E G T φu dY u ¼ E G T φ u dY u ¼ E φ ðθ
u u � θu Þd〈Y〉u , (71)
0 0 0
and by Eq. (64)

�ð T � �ð T �
Y
E φu θu d〈Y〉Fu ¼ E φu θu d〈Y〉Fu
0 0
�ð T ðu � �ð T � (72)
Y
¼E φu dð θFr d〈Y〉r Þp, F ¼E φu θFu d〈Y〉Fu ,
0 0 0
which proves strong orthogonality.

Theorem 26 shows the relation between the Galtchouk-Kunita-Watanabe decompositions and
the optimal strategies under full and partial information.
Theorem 26. i. Every contingent claim ξ ∈ L2 ðF YT , PÞ admits a unique F-risk-minimizing strategy

ψ�, F ¼ ðθ�, F , η�, F Þ, explicitly given by
θ�, F ¼ θF , η�, F ¼ Vðψ�, F Þ � θ�, F Y, (73)
where V t ðψ�, F Þ ¼ E½ξjF t � for every t ∈ ½0, T�, with minimal cost
Ct ðψ�, F Þ ¼ U F0 þ GFt , t ∈ ½0, T�: (74)
Here, θF , U F0 , and GF are given in Definition 24 part a.
ii. Moreover, it also admits a unique FY -risk-minimizing strategy ψ�, F ¼ ðθ�, F , η�, F Þ, explicitly given by
Y
Y Y Y Y Y
θ�, F ¼ θF , η�, F ¼ Vðψ�, F Þ � θ�, F Y, (75)
Y � �
where V t ðψ�, F Þ ¼ E ξjF Yt for every t ∈ ½0, T�, with minimal cost
Y Y Y
Ct ðψ�, F Þ ¼ U F0 þ GFt , t ∈ ½0, T�, (76)
Y Y Y
and θF , UF0 and GF are given in Definition 24 part b.
Proof. The proof of part i. is given, for example, in Ref. [28, Theorem 2.4]. For part ii., note that
using the martingale representation of Y with respect to its inner filtration given in Eq. (48) and
the fact that ξ ∈ L2 ðF YT Þ, it is possible to reduce the partial information case to full information
and apply again [28, Theorem 2.4]. □
Proposition 25 helps us in the computation of the optimal strategy under partial information.
Indeed, it is sufficient to compute the corresponding strategy θ�, F under full information and
the Radon-Nikodym derivative given in Eq. (62). To get more explicit representations, we
assume that the payoff of the contingent claim has the form ξ ¼ HðT, Y T Þ, for some function
H : ½0, T� � Rþ ! R. Let LX, Y denote the Markov generator of the pair (X, Y), that is
∂f ∂f ∂f 1 ∂2 f ∂2 f
LX, Y f ðt, x, yÞ ¼ þ b0 ðt, xÞ þ b1 ðt, x, yÞ þ σ20 ðt, xÞ 2 þ ρyσ0 ðt, xÞσðt, yÞ
∂t ∂x ∂y 2 ∂x ∂x∂y
ð (77)
2
1 ∂f
þ y2 σ2 ðt, yÞ 2 þ Δf ðt, x, y; ζÞνðdζÞ
2 ∂y Z
for every f ∈ C1;2;2

b ð½0, T� � R � Rþ Þ, where
Δf ðt, x, y; ζÞ :¼ f ðt, x þ K0 ðt, x; ζÞ, yð1 þ Kðt, x, y; ζÞÞÞ � f ðt, x, yÞ: (78)
By the Markov property, we have that for any t ∈ ½0, T� there exists a measurable function
hðt, x, yÞ such that
hðt, Xt , Yt Þ ¼ E½HðT, Y T ÞjF t �: (79)
If the function h is sufficiently regular, for instance h ∈ C1;2;2

b ð½0, T� � R � Rþ Þ, we can apply
Itô’s formula and get that
ðt
hðt, Xt , Y t Þ ¼ hð0, X0 , Y 0 Þ þ LX, Y hðs, Xs , Ys Þds þ Mht (80)
0
where Mh is the ðF, PÞ-martingale given by

ðt ðt
∂h ∂h
dMht ¼ ðs, Xs , Ys Þσ0 ðs, Xs ÞdW 0s þ ðs, Xs , Ys ÞYs σðs, Y s ÞdW 1s
0 ∂x 0 ∂y
ðt ð (81)
� �
þ Δhðs, Xs� , Ys� ; ζÞ Nðds, dζÞ � νðdζÞds :
0 Z
By Eq. (79), the process {hðt, Xt , Yt Þ, t ∈ ½0, T�} is an ðF, PÞ-martingale. Then, the finite variation
term vanishes, which means that the function h satisfies LX, Y hðt, Xt , Y t Þ ¼ 0, P-a.s. and for
almost every t ∈ ½0, T�. The next proposition provides the risk-minimizing strategy under par-
tial information.
Y
Proposition 27. Assume h ∈ C1;2;2
b ð½0, T� � R � Rþ Þ. Then the first components θ�, F and θ�, F of the
risk-minimizing strategies under full and partial information are given by
gðt, Xt� , Y t� Þ
θ�t , F ¼ ð , t ∈ ½0, T� (82)
Y 2t� σ2 ðt, Yt� Þ þ z2 λðt, Xt� , Yt� Þφðt, Xt� , Y t� , dzÞ
R
Y πt� ðgÞ
θ�t , F ¼ ð , t ∈ ½0, T� (83)
Y2t� σðt, Yt� Þ þ z2 πt� ðλφðdzÞÞ
R
respectively, where the function g(t, x, y) is

ð
∂h ∂h
gðt, x, yÞ ¼ ρ σ0 ðt, xÞyσðt, yÞ þ y2 σ2 ðt, yÞ þ yKðt, x, y; ζÞΔhðt, x, y; ζÞνðdζÞ: (84)
∂x ∂y Z
Proof. Consider decomposition Eq. (60) for ξ ¼ HðT, Y T Þ. Then, conditioning on F t we get
ðt
hðt, Xt , Y t Þ ¼ U0 þ θ�s , F dYs þ GFt : (85)
0
Taking the covariation with respect to Y and F, we obtain

ðt
〈hð�, X, YÞ, Y〉Ft ¼ θ�s , F d〈Y〉Fs : (86)
0
On the other hand, hðt, Xt , Y t Þ ¼ Mht , then taking Eqs. (81) and (44) into account we get that
ðt
〈hð�, X, YÞ, Y〉Ft ¼ gðs, Xs , Ys Þds, (87)
0
where g(t, x, y) is given in Eq. (84). Hence, by Eqs. (50) and (87), we may represent θ�, F as
d〈hð�, X, YÞ, Y〉Ft gðt, Xt� , Y t� Þ

θ�t , F ¼ ¼ ð (88)
d〈Y〉Ft 2 2
Y t� σ ðt, Yt� Þ þ z2 λðt, Xt� , Yt� Þφðt, Xt� , Y t� , dzÞ
R
Note that by Eq. (51) and

�ð t �p, FY �ð t �p, FY ðt
θ�u, F d〈Y〉Fu ¼ gðs, Xs , Ys Þds ¼ πs ðgÞds, (89)
0 0 0
applying Eq. (62) we get representation Eq. (83).

Our ultimate objective in this section is to investigate on the relation between costs of the
F-optimal strategy and the FY -optimal strategy, or equivalently the associated risk processes.
Y
It clearly holds that θ�, F ∈ ΘðFÞ, and then the FY -risk-minimizing strategy is also an F-strategy.
Considering the corresponding risks, we have
�� 2 � � �� 2 � �
Y Y Y Y
E CT ðψ�, F Þ � Ct ðψ�, F Þ jF Yt ¼ E E CT ðψ�, F Þ � Ct ðψ�, F Þ jF t jF Yt
� �� 2 � � �� 2 � (90)
≥ E E CT ðψ�, F Þ � Ct ðψ�, F Þ jF t jF Yt ¼ E CT ðψ�, F Þ � Ct ðψ�, F Þ jF Yt ,
� � h Y Y
i
and then E RFt ðψ�, F Þ ≤ E RFt ðψ�, F Þ , for every t ∈ ½0, T�. In the remaining part of the paper,
we assume that F Y0 ¼ F 0 ¼ {Ω, Ø}, and we wish to measure the difference in the total risk
taken by an informed investor, endowed with a filtration F, and a partially informed investor,
Y Y
whose information is described by FY. Precisely, we compute the difference RF0 ðψ�, F Þ
�RF0 ðψ�, F Þ. By decompositions Eqs. (60) and (61), we have that CT ðψ�, F Þ � C0 ðψ�, F Þ ¼ GFT
Y Y Y
and CT ðψ�, F Þ � C0 ðψ�, F Þ ¼ GFT and also
ðT
Y Y Y
GFT ¼ UF0 � U F0 þ ðθ�r , F � θ�r , F ÞdY r þ GFT , (91)
0
Y Y
since F Y0 ¼ F 0 ¼ {Ω, Ø}, U F0 ¼ UF0 . Then computing the square of GFT and taking the expec-
tation we get
h i h i ��ð T �2 � � ðT �
Y Y Y
E ðGFT Þ2 ¼ E ðGFT Þ2 þ E ðθ�r , F � θ�r , F ÞdYr þ 2E GFT ðθ�r , F � θ�r , F ÞdYr : (92)
0 0
It follows from Itô isometry and the fact that GF is orthogonal to Y, that
h i h i �ð T �
Y Y
E ðGFT Þ2 ¼ E ðGFT Þ2 þ E ðθ�r , F � θ�r , F Þ2 〈Y〉Fr : (93)
0
Then the difference that we want to evaluate becomes

h i h i �ð T �
Y Y Y Y
RF0 ðψ�, F Þ � RF0 ðψ�, F Þ ¼ E ðGFT Þ2 � E ðGFT Þ2 ¼ E ðθ�r , F � θ�r , F Þ2 d〈Y〉Fr
0
�ð T � �ð T � �ð T � (94)
Y Y
¼E ðθ�r , F Þ2 d〈Y〉Fr þ E ðθ�r , F Þ2 d〈Y〉Fr � 2E θ�r , F θ�r , F d〈Y〉Fr :
0 0 0
Using Eq. (62) and the definition of FY -dual-predictable projections, we have that
�ð t � �ð t � �ð t �
Y Y Y Y
E θ�r , F θ�r , F d〈Y〉Fr ¼ E ðθ�r , F Þ2 d〈Y〉Fr ¼ E ðθ�r , F Þ2 d〈Y〉Fr , (95)
0 0 0
which implies
�ð T � �ð T �
Y Y Y Y
RF0 ðψ�, F Þ � RF0 ðψ�, F Þ ¼ E ðθ�r , F Þ2 d〈Y〉Fr � E ðθ�r , F Þ2 d〈Y〉Fr : (96)
0 0
Plugging in the expressions for the optimal strategies given in Eqs. (82) and (83), respectively,
� ð �
and denoting Σðt, Xt , Y t Þ :¼ Y2t σ2 ðt, Y t Þ þ z2 λðt, Xt� , Y t� Þφðt, Xt� , Yt� , dzÞ , we have
Z
�ð T � 2 � �
Y Y g ðt, Xt , Yt Þ π2t ðgÞ
RF0 ðψ�, F Þ � RF0 ðψ�, F Þ ¼ E � dt
0 Σðt, Xt , Y t Þ πt ðΣÞ
�ð T � (97)
� � �ð T � �2 �
2
≤ CE g ðt, Xt , Y t Þ � π2t ðgÞ dt ¼ CE gðt, Xt , Yt Þ � πt ðgÞ dt
0 0
for some C > 0, where the inequality follows by Assumption 18, and in the last equality, we
�ð T � �ð T �
used E 2gðt, Xt , St Þπt ðgÞdt ¼ E 2πt ðgÞ2 dt .
0 0
We can conclude by saying that we found an upper bound for the expected difference between
the total risks taken by an informed investor and a partially informed one which is directly
proportional to the mean-squared error between the process {gðt, Xt , St Þ, t ∈ ½0, T�} and its
filtered estimate πðgÞ ¼ {πt ðgÞ, t ∈ ½0, T�}.
Author details
Claudia Ceci1* and Katia Colaneri2

1 Department of Economics, University of Chieti-Pescara, Pescara, Italy
2 Department of Economics, University of Perugia, Perugia, Italy
References
[1] Kushner H. On the differential equations satisfied by conditional probability densities of

Markov processes, with applications. Journal of the Society for Industrial and Applied
Mathematics, Series A: Control. 1964;2(1):106-119
[2] Kushner H. Dynamical equations for optimal nonlinear filtering. Journal of Differential
Equations. 1967;3(2):179-190
[3] Zakai M. On the optimal filtering of diffusion processes. Probability Theory and Related
Fields. 1969;11(3):230-243
[4] Lipster RS, Shiryaev A. Statistics of Random Processes I. Springer-Verlag; Berlin Heidel-
berg, 1977
[5] Kallianpur G. Stochastic Filtering Theory. Springer; Springer Verlag New York, 1980
[6] Elliott RJ. Stochastic Calculus and Applications. Springer; Berlin Heidelberg New York,
1982
[7] Kurtz TG, Ocone D. Unique characterization of condition distribution in nonlinear filter-
ing. Annals of Probability. 1988;16:80-107
[8] Bhatt AG, Kallianpur G, Karandikar RL. Uniqueness and robustness of solution of
measure-valued equations of nonlinear filtering. The Annals of Probability. 1995;23(4):
1895-1938
[9] Brémaud P. Point Processes and Queues. Springer-Verlag; New York, 1980
[10] Kliemann WH, Koch G, Marchetti F. On the unnormalized solution of the filtering
problem with counting process observations. IETIT. 1990;36:1415-1425
[11] Ceci C, Gerardi A. Filtering of a Markov jump process with counting observations.
Applied Mathematics & Optimization. 2000;42:1-18
[12] Frey R, Runggaldier W. A nonlinear filtering approach to volatility estimation with a

view towards high frequency data. International Journal of Theoretical and Applied
Finance. 2001;4(2):199-210
[13] Ceci C, Gerardi A. A model for high frequency data under partial information: A filtering
approach. International Journal of Theoretical and Applied Finance. 2006;9(4):1-22
[14] Ceci C. Risk minimizing hedging for a partially observed high frequency data model.
Stochastics: An International Journal of Probability and Stochastic Processes. 2006;78(1):
13-31
[15] Frey R, Runggaldier W. Pricing credit derivatives under incomplete information: A
nonlinear-filtering approach. Finance and Stochastics. 2010;14:495-526
[16] Frey R, Schimdt T. Pricing and hedging of credit derivatives via the innovation approach
to nonlinear filtering. Finance and Stochastics. 2011;16(1):105-133
[17] Ceci C, Colaneri K. Nonlinear filtering for jump diffusion observations. Advances in
Applied Probability. 2012;44(3):678-701
[18] Ceci C, Colaneri K. The Zakai equation of nonlinear filtering for jump-diffusion observa-
tions: Existence and uniqueness. Applied Mathematics & Optimization. 2014;69(1):47-82
[19] Elliott R, Malcolm W. Discrete-time expectation maximization algorithms for Markov-

modulated Poisson processes. IEEE Transactions on Automatic Control. 2008;53(1):247-256
[20] Björk T, Davis M, Landén C. Optimal investment under partial information. Mathemati-
cal Methods of Operations Research. 2010;71(2):371-399
[21] Ceci C, Colaneri K, Cretarola A. Local risk-minimization under restricted information to

asset prices. Electronic Journal of Probability. 2015;20(96):1-30
[22] Ceci C, Colaneri K, Cretarola A. Hedging of unit-linked life insurance contracts with
unobservable mortality hazard rate via local risk-minimization. Insurance: Mathematics
and Economics. 2015;60:47-60.
[23] Ceci C, Gerardi A. Pricing for geometric marked point processes under partial informa-
tion: entropy approach. International Journal of Theoretical and Applied Finance.
2009;12:179-207
[24] Frey R. Risk minimization with incomplete information in a model for high-frequency
data. Mathematical Finance. 2000;10(2):215-222
[25] Nagai H, Peng S. Risk-sensitive dynamic portfolio optimization with partial information
on infinite time horizon. Annals of Applied Probability. 2000;12:173-195
[26] Bäuerle N, Rieder U. Portfolio optimization with jumps and unobservable intensity
process. Mathematical Finance. 2007;17(2):205-224
[27] Föllmer H, Sondermann D. Hedging of non redundant contingent claims. In:
Hildenbrand W, Mas-Colell A, editors. Contribution to Mathematical Economics. North
Holland, Amsterdam New York Oxford Tokyo; 1986. pp. 205-223
[28] Schweizer M. A guided tour through quadratic hedging approaches. In: Jouini E,
Cvitanic J, Musiela M, editors. Option Pricing, Interest Rate and Risk Management.
Cambridge University Press; Cambridge, 2001. pp. 538-574
[29] Schweizer M. Risk minimizing hedging strategies under partial information. Mathemat-
ical Finance. 1994;4:327-342
[30] Ceci C, Cretarola A, Russo F. GKW representation theorem under restricted information.
An application to risk-minimization. Stochastics and Dynamics. 2014;14(2):1350019 (p. 23)
[31] Jacod J, Shiryaev A. Limit Theorems for Stochastic Processes. 2nd ed. Berlin: Springer;
2003
[32] Protter P, Shimbo K, Ethier SN, Feng J, Stockbridge RH eds, No arbitrage and general
semimartingales. In: Markov Processes and Related Topics: A Festschrift for Thomas G.
Kurtz. Institute of Mathematical Statistics; Beachwood, Ohio, USA, 2008. pp. 267-283
Provisional chapter
Chapter 18
Airlines Content Recommendations Based on

Passengers' Choice
Airlines Content Using Bayesian Belief
Recommendations BasedNetworks
on
Passengers' Choice Using Bayesian Belief Networks
Sien Chen, Wenqiang Huang, Mengxi Chen,
Junjiang
Sien Zhong
Chen, and JieHuang,
Wenqiang ChengMengxi Chen,
Junjiang Zhong and Jie Cheng
Abstract
Faced with the increasingly fierce competition in the aviation market, the strategy of
consumer choice has gained increasing significance in both academia and practice. As
ever-increasing travel choices and growing consumer heterogeneity, how do airline com-
panies satisfy passengers' needs? With a vast amount of data, how do airline managers
combine information to excavate the relationship between independent variables to gain
insight about passengers' choices and value system as well as determining best personal-
ized contents to them? Using the real case of China Southern Airlines, this paper illus-
trates how Bayesian belief network (BBN) can enable airlines dynamically recommend
relevant contents based on predicting passengers' choice to optimize the loyalty. The
findings of this study provide airline companies useful insights to better understand the
passengers' choices and develop effective strategies for growing customer relationship.
Keywords: consumer choice, Bayesian belief network, recommendation system
1. Introduction
In a world of increasingly global competition, companies have to compete on the effectiveness

and efficiency of their marketing strategies to capture new opportunities to satisfy custom-
ers' needs. In other words, having the greatest product at the lowest price is not competi-
tive enough. Choice behavior is affected by a consumer's own preference for entire product
categories and particular brands, allowing companies to collect market and industry data,
learn about consumer preference, and change sales tactics. In general, companies must con-
sider consumer choice and offer their customers varieties of differentiated products and dif-
ferent types of choices to meet consumer demand when they formulate revenue decisions
© 2017 The
Attribution Author(s).
License Licensee InTech. This chapter is distributedwhich
(http://creativecommons.org/licenses/by/3.0), underpermits
the terms of the Creative
unrestricted Commons
use, distribution,
Attribution
and License
reproduction (http://creativecommons.org/licenses/by/3.0),
in any medium, provided the original work is properly which permits unrestricted use,
cited.
and marketing strategies. For instance, most airlines have different fare classes (e.g., economy
class versus first class) that differ in the level of services and facilities available for customers.
Companies have to understand the choices that consumers make when facing such a product
assortment and provide appropriate contents for each consumer. Once individual choice has
been modeled, the choice prediction would be of great value to managers for the estimation
of the impact of a change in product formulation [1, 2].
What one cares most is choice, the selection of a suitable content from a set of available alter-
natives. Given the growing diversity of the purchasing channels and information media, com-
panies are increasingly interested in modeling and understanding an actual process through
which consumers choose products, in addition to measure consumers' future choices. Better
understanding of consumer choices and predicting preference is important for enterprises to
introduce new products and implement target marketing. Preference prediction could also be
used more extensively by companies to guide decision optimization [3]. The researchers and
managers are mostly interested in knowing human choice behavior, particularly the under-
lying choice mechanisms, and reveal and investigate fundamental reasons behind it. Choice
behavior is complex and yet rational, as a result, decision makers seek to simplify the formu-
lation of choice process. Capturing the consumers' choice decision, a method that estimates
direct and indirect effects, the situation-specific variables and clear causal relationship could
offer better representing choice behavior mechanisms. Based on choice mechanism, compa-
nies need to measure preference and predict consumer decision making to conduct market
research to design products. In this chapter, we aim to investigate these concerns:
• How can we infer consumers' choice for content in the future?
• How to design a model using only current period choices to infer consumers' inter-tempo-
ral preferences?
• Is the process dynamic and it allows the researchers to analyze influences of effect changes?
With increasing awareness of transportation competition, if airline companies hope to survive
and make profit, they have to realize that customer resource is the most valuable competitive
advantage and try their best to satisfy the needs of their customers. Besides, Mowen (1988)
emphasized that managers can reset promotional strategies to satisfy different types of con-
sumers' desire most effectively, through the best channels [7]. Based on the above analysis,
we decided to introduce a personalized content recommendation system to satisfy China air
passengers' desire. As we all know, good relationship with customers is crucial for airline
companies to keep advantage in competition and furthermore make profit in the long run.
Using real history data from China Southern Airline and Bayesian network can fix best per-
sonalized contents to each individual passenger.
2. Consumer choice behavior and Bayesian belief network
This chapter introduces Bayesian belief networks (BBNs) for predicting air passengers' choice.
On the basis of these choices, airlines can recommend best relevant content to passengers,
Airlines Content Recommendations Based on Passengers' Choice Using Bayesian Belief Networks 351
including products, service, tips, notices, feature introductions, and information sharing to
improve their travel experience, satisfaction, and loyalty. The remainder of this paper is orga-
nized as follows. Section 2 briefly discusses a review of consumer choice behavior, provides
some definitions, and illustrates advantages of Bayesian belief networks. In the next section,
we establish BBN models by using the case of China Southern Airlines with real transaction
data, including passengers' basic information, history decision options, and purchase charac-
teristics to predict the possible contents which the consumer will choose, followed by model
results and discussion.
2.1. Consumer choice behavior
When consumers face multiple alternative products, brands, and services, they tend to
repeat the same choices that proved satisfactory in similar situations [4]. Information
integration theory offers a specific mechanism to describe how individuals integrate
separate pieces of available information into an overall index of preferences [5, 6]. The
theory proposes that in situations where information about the products and brands are
available in the marketplace, consumers tend to value and weight product attributes
more often at the time of making a purchase decision. We formulated a comprehensive
evaluation by combing consumers' values and weights under certain rules. Marketing
managers should carefully study these decision-making processes and results to under-
stand where consumers can collect relevant information, how consumers form beliefs,
and what criteria consumers use to make product or service choices. As a result, compa-
nies can develop products that emphasize the appropriate attributes, and managers can
reset promotional strategies to satisfy different types of consumers' desire most effec-
tively, through the best channels [7]. Another interest issue to academia is in determining
whether there are systematic differences in consumers’ choice behavior. Identifying and
understanding these differences are important for developing or formulating effective
marketing strategies.
Consumer choice behavior has been mainly conceptualized as a combination of some socio-
demographics and the attributes of alternatives [8]. Constructs, such as utility, attitude, or cog-
nition, are used to map the attributes into one of the choice behavior. However, little research
has been conducted about the choice process models considering the socio-demographic char-
acteristics and the attributes of their decision alternatives in a recent study.
2.1.1. Consumer choice behavior in airline industry
In recent years, the airline industry faces the economic challenges, which are coupled with vol-
atile fuel prices and pressure of environment protection. In addition, with increasing aware-
ness of competition caused by the development of other transportation alternatives, if airline
companies hope to survive and make profit, they have to realize that customer resource is the
most valuable competitive advantage and try their best to satisfy the needs of their custom-
ers. Thus, good relationship with customers is crucial for airline companies to keep advan-
tage in competition and furthermore make profit in the long run. Domestic and international
airline companies have long shifted their attention to customer relationship management [9].
Relationship with passengers has been taken as one of the most important goals for every
airline company to maximize passengers' loyalty and revenues. Besides good performance in
airline operation and business management, another important key for success is leveraging
power of customer relationship to attain superb performance. Those airline companies, who
could correctly estimate trends and risks in the airline market and take necessary actions to
satisfy their customers, could be much more successful in the industry.
Although understanding changing needs of passengers is of great importance for airlines,

passengers' decision-making processes have received relatively little managerial attention.
Therefore, further understanding of those decision-making processes is crucial for airline com-
panies to improve their operations and business models [10].
With today's ever-increasing travel choices and growing consumer heterogeneity, numer-
ous factors affect passengers' choices, for example, their socio-demographic status, decision-
making patterns, cultural background, ticket cost, travel objectives, time schedule, and so on.
The role of each factor is difficult to define, let alone the interaction between different factors.
Leisure travelers are becoming more and more supply-oriented selecting airlines with most
convenient schedules and best service experience. A change from selecting the travel destina-
tion and seeking for appropriate transportations to one, which a desirable airline service is
initially set and the trip is arranged around in, will very likely alter the dominant social ideo-
logical trend of travel behavior.
2.2. Bayesian belief network
Belief networks are probabilistic graphical representations of models that capture rela-
tionship between different variables. Belief networks use either directed or undirected
graphs to represent a dependency model. The directed acyclic graph (DAG) is more
flexible and expressive. It is also able to investigate a wide range of probabilistic inter-
dependency than undirected graphs. For example, induced and transitive dependency
cannot be modeled accurately by undirected graphs, but can be easily represented in
DAGs.
A causal belief network is made up of various types of nodes. An arc between two nodes
represents a causal relation, an originating node of the arc is a parent node and the others
are child nodes. A root node has no parents while a leaf node has no children. Each node has
an underlying conditional probability table (CPT) that describes the option distribution for
specific nodes associated with each possible combination of the parent nodes. Bayesian belief
network (BBN) is a specific type of causal belief network, consisting of a set of nodes, where
each node represents a variable in the dependency model and the connecting arcs represent
the causal relationship among variables. Figure 1 shows a simple BBN example about heart
disease and heartache patients. The CPT's of the nodes is also illustrated in Figure 1. As for
any causal belief network, the nodes represent stochastic variables and the arcs identify direct
causal influences among the linked variables [11]. Each node or variable may take one of a
number of possible states. The certainty of these states is determined from the belief in each
possible state of all the nodes. The belief in each state of a node is updated whenever the belief
Figure 1. A Bayesian belief network depicting relationship among heart disease and heartache patients.
in each state of any directly connected node changes. The difference between Bayesian belief
networks and other causal belief networks is that BBNs use Bayesian calculus to process the
state probabilities of each node from the predetermined conditional and prior probabilities.
The belief network is dynamic and their probabilities are subject to changes.
A Bayesian belief network is a graphical representation of a Bayesian probabilistic depen-

dency within a knowledge domain [12], particularly appropriate for target recognition
problems, where the category, identity, and class of target groups are to be recognized [13].
Bayesian belief networks have proven to be very useful, befitting to small and incomplete data
collections. A Bayesian network can be, for example, used to save a considerable amount of
space, explicit treatment of uncertainty, and support for decision analysis, casual relationship,
and fast responses. Bayesian network is also suited to structural learning applications, and a
combination of different sources of the preferred knowledge [14]. Besides, Bayesian approach
finds the inclusion optimal model structure from data constructed by the a priori knowledge,
and a constraint-based approach finds the optimal model structure from conditional depen-
dences in each pair of variables. Given the ascertained information, Bayesian belief networks
are used to determine or infer the posterior probability distributions for the variables of inter-
est [11]. As such, they do not include decisions or utilities that typify the preferences of the
users, but the user make decisions based on these probability distributions [11]. The causal
relationships in Bayesian belief networks allow the correlation between variables to be mod-
eled and predictions to be made. Comparing to classical statistical approaches, Bayesian belief
networks have a distinct advantage [15]. BBN becomes not only a powerful tool for knowledge
representation but reasoning under conditions of uncertainty [16], frequently dealing with
real-world problems such as building medical diagnostic systems, forecasting, and manufac-
turing process control for several decades [17]. Nowadays BBN has been extended to other
applications including software risk management [18], ecosystem and environmental man-
agement [19], and transportation [20]. There is a great impact of key events on long-term
transport mode choice decisions using Bayesian belief network, precisely Bayesian decision
network, for the exploration of the suggested formalism in measuring, analyzing, and predict-
ing dynamic travel mode choice in relation to key events and critical incidents [21]. However,
seldom researches are found using BBN as an application in airlines marketing management.
This paper introduces Bayesian belief networks using relative and contextual variables to esti-
mate a logic relationship and test the causal mechanisms of current passengers' choice and
predict their future preference.
3. Case study: BBN in China Southern Airlines
Air passengers make their choices using prior information available as well as information they
obtain from the internal and external environments. Passengers integrate all the information
actually available to them (including prior information and any information affect them) and
turn them into preferences of a product. The basic aim is to support airline decision makers in
their analysis of the impact of variables on passenger demand in the future. Prediction of pas-
senger choice for the distant future is critical to guide managers in the specification of marketing
strategies to be used. Such distant future predictions necessitate large-scale models of passen-
ger choice but that pressing need contrasts sharply with the capabilities of traditional forecast-
ing and modeling techniques. In this study, both qualitative and quantitative approaches are
studied. Developed as such, the BBN is expected to guide airline managers in their future prod-
uct decisions, facilitating analysis of specific decisions based on predicting the choice modes
of passengers; highlighting the causal relationships among variables in the process and finally
showing the impact of changes. To represent the dynamic nature of the causal relationship and
to draw inferences based on the uncertainty concerning the states of the variables; this part con-
structs a Bayesian belief network for airline content recommendation mode using a case study
of Chinese airline. A basic assumption of BBN is that when the conditional probabilities for
each variable are multiplied, the joint probability distributions for all variables in the network
are then calculated [22]. The structure is determined based on experts' judgments on content
recommendation mode and a logic relationship.
Three components of a belief network are important: the nodes representing variables, the
links among nodes, and states representing the expected utilities or probabilities. Therefore,
the first step of the process is the development of a casual network. For this purpose, relevant
variables and the logic relationship of this network should be determined. In the next stage,
belief networks explore how the changes of states of variables (nodes) influence consumers'
future choices and the needs of contents. Therefore, the static causal model is transformed
into a dynamic one through the calculation of the Bayesian belief network. The resulting net-
work is subjected to scenario analysis to help airline decision makers in their analysis on
future product designs.
3.1. Determination of the basic variables and casual relations
To obtain a mutually, selectively exhaustive list of basic variables of the airline companies,
interviews are conducted with airline domain experts, who are encouraged to identify the
variables that might be relevant to the research. Thereafter, 35 variables are generated based
on the situation of China and with weights of the expert judgments and estimation. The deci-
sion variables are classified into four groups:
1. Personal characteristics
2. Experience and behavior characteristics
3. Preference characteristics
4. Individuals' perceptions
Personal characteristics include airline passengers' demographic status and member infor-
mation related to air travel. Experience and behavioral characteristics include passenger
purchase behavior, decisions in choosing products, and attributes of particular experience.
Preference characteristics include consumer preference and travel patterns. Individual per-
ception describes the evaluation of passengers' loyalty, satisfaction, and comfort. After the
identification of variables, the next step is the determination of the causal relations among all
the variables. The use of this network is proposed to capture the knowledge and assumptions
and to understand the mechanism of consumer choice processes`. The whole network is built
up using Netica. The changes exist in the network are subjected to field tests using real world
data from Chinese data sources.
3.2. Implementation of the BBN
The content recommendation is a new attempt in airline companies' new marketing strate-
gies. After obtaining and integrating consumers' choice behavior, airline companies forecast
and measure passengers' preference to predict intertemporal choices in the future. Based on
the predicted choices, airline decision makers formulate relevant content and recommend it
to target consumer groups. Passengers can get information about what they want to know
which can improve their loyalty, satisfaction, and comfort with airlines. Better customer rela-
tionship, more market share in the fierce competition. Content recommendations include
products, services, tips, notices, introductions, and information sharing. Products include
popular routes; international and domestic hotels; duty free gifts, etc. Services include spe-
cial assistance, baggage inquiry, online check-in, pre-paid luggages, and so on. Tips include
travel guide, entertainment activities, lounge locations, and flight delays. News and promo-
tions, mileage redemption, and offers are also included in notices. Introductions involve fre-
quent flyer program, activities, flight and hotel, boarding and arrival procedure, and so on.
Information sharing is a new measure applied to web search with the popularity of social
media. Airline industry starts to realize this platform can further improve service experience.
Passengers can link the data of Weibo (China popular social media) or WeChat to flight reser-
vation process that is easy for them to know who are in the same flight [23]. The network in
Figure 2 shows airline content recommendations for given choices of passengers.
Figure 2. Network for content recommendation mode.
The first cluster illustrates personal information. Demographic data elements include gender,
age, and education. By using these three attributes, one can speculate individual occupation
and time pressure. Distinguishing leisure travelers and business travelers depends on time
sensitivities. The node ‘feasibility’ indicates the air travel feasibility of each node combina-
tion, upgrade (yes or no), travel mode (leisure or business), and time pressure (yes or no). The
second cluster depicts experience and behavior characteristic. The experience characteristic
describes passengers' trip and destination experience. The third cluster represents passenger's
preference such as fare class preference, seat preference, flight time preference, and holiday
preference. It is worth to emphasize that distance and upgrade may lead passenger to change
their class selections. Passengers will choose more comfortable classes when they take longer
range flights. When it comes to membership upgrades, passengers are more likely to choose
traveling first class to accumulate qualified miles. The fourth cluster describes individuals'
perception evaluating the effect of variable changes on passengers' loyalty, satisfaction, and
degree of comfort. This cluster refers to benefit variables that intend to cover the most signifi-
cant perception. One of the benefit variables, namely, loyalty is affected by membership class.
The higher the membership class, the higher level of stickiness to an Airline company. In this
aspect, the outcome should take the weights of the benefit nodes into account.
3.3. Results and discussion
The data from airlines are used to complete all CPTs of the nature nodes. After completing all
tables, we use Netica software to compile the network and determine the probabilities of six
contents. Figure 3 shows the compiled decision network.
Figure 3. Compiled decision network (Cluster 1).
From Figure 3, the probability of tips is the highest, reaching 18.5% in total among all the
contents decision options. Due to tips containing travel guide, entertainment activities, lounge
caution, and delay calling, different kinds of hints remind passengers to have considerable
experience. The probabilities of products and notice are around 17%. The other three contents
are similar under the average level just over 15.4%. The beliefs and probabilities will be updated
when evidence for certain nature nodes change. We will discuss some examples below.
After entering the evidence ‘Yes’ for the node ‘Feasibility’, which is colored gray in Figure 4,
the belief for the decision nodes are auto-updated and recalculated. We find that only the
probability of ‘Introduce’ option changed according to this new evidence. When air traveling
is totally feasible for passengers no matter his or her travel purpose is holiday or business,
the ‘Introduce’ is less useful to provide them flight information, boarding, and arrival proce-
dure what they already know. They concern more about the services, delay caution, popular
routes, holiday destination, and so on.
Figure 4. Compiled decision network (‘Feasibility’ = ‘yes’).

Figure 5 represents the influence of the evidence ‘Yes’ for the nature node ‘Change’ on the
decision options mode. The beliefs and probabilities updated automatically. The results
illustrates if passengers change their purchase behavior or trip modes, for example, taking
high-speed train, the effective method to retain their customers, airline managers could rec-
ommend relative tips and notice to them and give them more comfortable services.
We compiled network with the ‘Long’ for ‘Distance’, ‘High’ for ‘TotalFlight’, and ‘High’ for
‘Frequency’, respectively. The consequences are represented on Figure 6 (a–c). The trends
of three results are similar. The probabilities of ‘Products’, ‘Introduce’, and ‘Shares’ rise out-
standing. However, what surprised us is that the probabilities of ‘Services’ decrease sharply.
This result gives decision makers a good suggestion that passengers who have high frequency
traveling behavior need products recommendation, destination introduce, web link to share
when they experience long range journey. In the same way, the service is not as important as
other aspects.
As each passenger has own preference. The information about preference is too diverse, so
that we introduce a nature node ‘Flexibility’ to describe the overall variation of consumers'
preferences. Controlling states of these nodes, including seat, class, flight time and holiday,
have no obvious effects on decision option modes. Therefore, we set ‘Flexibility’ to ‘Yes’ for
sure in Figure 7. ‘Shares’ has the biggest change in the entire content recommendation options
mode that means ‘Share’ is the most useful method to address flexibility problems whose
regular pattern is hard to capture. Facing this situation, airline managers share links to their
passengers on social media to release a service “meeting & sitting in the same flight” [23]. As
the results shown in Figure 8, the membership class has significant influence on customers'
loyalty. Members with highest qualification are stickier to their choices of airlines; the prob-
ability of high loyalty ascends from 35.7 to 85.2% when we set ‘Yes’ for state ‘Gold’. Moreover,
more than half of silver card owners remain loyal to their airline companies. For airlines,
managers should better service loyal customers, reduce the loss of customers and mining new
customers.
In the first part, we come up with three questions: How can we infer consumers' choice for
content in the future? How to design a model using only current period choices to infer con-
sumers’ inter-temporal preferences? Is the process dynamic and it allows the researchers to
analyze influences of effect changes?
Figure 5. (a) Compiled decision network (Cluster 2). (b) Compiled decision network (‘Change’ = ‘Yes’).
Figure 6. (a) Compiled decision network (‘Distance’ = ‘Long’). (b) Compiled decision network (‘TotalFlights’=‘High’). (c)
Compiled decision network (‘Frequency’ = ‘High’).
This paper uses automatic updating process to explain the dynamics of belief networks. BBN
model represents a complex network that constructs and model consumer choice process.
From the examples above, we investigate clearly how the evidence of one state of a node
change affects the probability of decision options. Based on China Southern Airline historical
real data, we predict the passengers' choice and help airline managers recommend relative
contents to satisfy passengers' needs.
Figure 7. Compiled decision network (‘Flexibility’ = ‘Yes’).

Figure 8. (a–d) Compiled decision network of Cluster 4.
4. Conclusion and implications
This article measures air passengers' preference and predicts their choices in the future based
on current choice behavior using Bayesian belief network. This network can represent com-
plex choice behavior and causal relationship among different variables, and the use of the
probability of options can capture passengers' dynamic decision-making processes. The most
powerful of the Bayesian network is that the probability of getting results from each stage
is a reflection of mathematics and science. In other words, the network will infer reasonable
results if we obtain enough information based on statistical knowledge.
We illustrate it by conducting a detailed empirical study of a data set from a Chinese Airline
company. Our research demonstrates that understanding the extent to which the consumer
choice behavior is beneficial for airline mangers strategic decision making.
To help with formulating better marketing strategies, the airline companies may consider
adoption of the following procedures.
1. To track detailed consumer behavior: inquiry of products, reservation, payment, ticket is-
sue, check-in, waiting, cabin service, luggage claim, mileage accumulation.
2. To analyze consumer behavior: purchase behavior, tour experience, choice behavior,

preference.
3. To set up high-level products: high-level customization; customized design and products

design, relevant product support.
4. To use social media: share web link, extract information from social media and social
network.
A good strategy should analyze passengers' trip behavior and preferences to conduct cross
selling, filter unnecessary information, and to present consumer recommendations and offer
the most valuable product portfolio to customers.
We expect that, together with the need for the more specific features; BBN combined with
Artificial Intelligence and deep learning are of great value to addressing uncertainty problems
and consumer choice behavior in the future.
Author details
Sien Chen1,2*, Wenqiang Huang3, Mengxi Chen4, Junjiang Zhong5 and Jie Cheng6
1 Alliance Manchester Business School, University of Manchester, Manchester, UK
2 Antai College of Economics and Management, Shanghai Jiao Tong University, Shanghai,
China
3 China Southern Airlines, Guangzhou, China
4 Shanghai Jiao Tong University, Shanghai, China
5 Xiamen University of Technology, Xiamen, China
6 Gausscode Technology Inc, CA, USA
References
[1] Bucklin RE, Srinivasan V. Determining interbrand substitutability through survey mea-
surement of consumer preference structures. Journal of Marketing Research. 1991;28
(February):58-71
[2] Wittink DR, Philippe C. Commercial use of conjoint analysis: An update. Journal of
Marketing. 1989;53(July):91-6
[3] Rao VR. Applied Conjoint Analysis. New York: Springer; 2014
[4] Hansen F. Consumer choice behavior: An experimental approach. Journal of Marketing

Research. 1969;6(4):436-443
[5] Anderson NH. Contributions to Information Integration Theory Volume II: Social.
Lawrence Erlbaum Associates. Psychology Press, New York; 1991
[6] Bettman J, Capon N, Lutz RJ. Cognitive algebra in multi-attribute attitude models.
Journal of Marketing Research. 1975;12(May):151-164
[7] Mowen JC. Beyond consumer decision making. Journal of Consumer Marketing.
1988;5(1):15-25.
[8] Wierenga B, van Raaj WF. Consumentengedrag. Leiden; Stenfert Kroese BV; 1987
[9] Davenport TH. At the Big Data Crossroads: Turning Towards a Smarter Travel Experience.
2013. Available from: http://www.bigdata.amadeus.com/assets/pdf/Amadeus_Big_Data.
pdf (Accessed: March 2, 2018, 14:50)
[10] Buhalis D, Law R. Progress in information technology and tourism management: 20

years on and 10 years after the Internet—The state of eTourism research. Tourism
Management. 2008;28(4):587-590
[11] Suermondt HJ. Explanation in Bayesian belief networks. PhD thesis, Palo Alto, California:
Medical Information Sciences, Stanford University; March 1992
[12] Jensen FV. An Introduction to Bayesian Networks. London UK: UCL Press; 1996
[13] Stewart L, McCarty Jr P. The use of Bayesian belief networks to fuse continuous and dis-
crete information for target recognition, tracking and situation assessment. Proceedings
of the SPIE, 1992;1699:177-185
[14] Uusitalo L. Advantages and challenges of Bayesian networks in environmental model-

ling. Ecological Modelling. 2007;203(3/4):312-318
[15] Heckerman D. A Tutorial on Learning with Bayesian Networks, Technical Report

MSR-TR-95-06, Redmond, WA: Microsoft Corporation; 1996
[16] Cheng J, Greiner R, Kelly J, Kelly J, Bell D, Liu W. Learning Bayesian networks from
data: An information-theory based approach. Artificial Intelligence. 2002;137(1/2):43-90
[17] Heckerman D, Mamdani A, Wellman MP. Real-world applications of Bayesian net-

works. Communications of the ACM. 1995;38(3):24-26
[18] Fan C, Yu Y. BBN-based software project risk management. The Journal of Systems and
Software. 2004;73(2):193-203
[19] Uusitalo, L. Advantages and challenges of Bayesian networks in environmental model-

ling. Ecological Modelling. 2007;203(3/4):312-318
[20] Ulegine F, Onsel S, Topcu YI, Aktas E, Kabak O. An integrated transportation decision
support system for transportation policy decisions: The case of Turkey. Transportation
Research Part A, Policy and Practice. 2007;41(1):40-97
[21] Verhoeven M, Arente TA, Timmermans HJP, van der Waerden PJHJ. Modeling the
impact of key events on long-term transport mode choice decisions: A decision network
approach using event history data, Transportation Research Record. 2005;1926:106-114.
DOI: 10.3141/1926-13
[22] Fusun U, Sule O, Iker Topcu Y, Emel A, Ozgur K. An integrated transportation decision
support system for transportation policy decisions: The case of Turkey. Transportation
Research Part A. 2007;41:80-97. DOI: 10.1016/j.tra.2006.05.010
[23] Peveto A. KLM surprise: How a little research earned 1,000,000 impressions on Twitter.
2011. Available from:. http://www.digett.com/2011/01/11/klm-surprise-how-little-research-
earned-1000000-impressions-twitter (Accessed: March 5, 2018, 9:20)
BAYESIAN INFERENCE
Javier Prieto received his PhD degree in Information and Com-

munications Technology and the Extraordinary Performance
Award for Doctorate Studies in 2012 from the University of Val-
ladolid (Spain). Since 2015, he is a lecturer and researcher at the
University of Salamanca (Spain). Previously, he was with the University of Val-
ladolid from 2009 to 2014 and with a Spanish technological center from 2007 to
2009. In 2010, he was a visiting researcher at the Massachusetts Institute of Tech-
nology (MIT), USA. Dr. Prieto serves as an associate editor for various journals.
His research interests include artificial intelligence for smart cities, navigation for
indoor environments, and Bayesian inference for dynamic systems.
The range of Bayesian inference algorithms and their different applications has
been greatly expanded since the first implementation of a Kalman filter by Stan-
ley F. Schmidt for the Apollo program. Extended Kalman filters or particle filters
are just some examples of these algorithms that have been extensively applied
to logistics, medical services, search and rescue operations, or automotive safety,
among others.
This book takes a look at both theoretical foundations of Bayesian inference and
practical implementations in different fields. It is intended as an introductory
guide for the application of Bayesian inference in the fields of life sciences, engi-
neering, and economics, as well as a source document of fundamentals for inter-
mediate Bayesian readers.
ISBN 978-953-51-3577-7
© iStock / LV4260
INTECHOPEN.COM

Bayesian Inference

Uploaded by

Copyright:

Available Formats

Bayesian Inference

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian Inference

Uploaded by

Copyright:

Available Formats

BAYESIAN

Edited by Javier Prieto Tejedor

© The Editor(s) and the Author(s) 2017

Publishing Process Manager Mirena Calmic

First published October, 2017

Additional hard copies can be obtained from [email protected]

Bayesian Inference, Edited by Javier Prieto Tejedor

World’s largest Science,

BOOK Selection of our books indexed in the

Interested in publishing with us?

Section 1 Theoretical Foundations of Bayesian Inference 1

Chapter 1 Bayesian Inference Application 3

Chapter 2 Node-Level Conflict Measures in Bayesian Hierarchical Models

Chapter 3 Classifying by Bayesian Method and Some Applications 39

Chapter 4 Hypothesis Testing for High-Dimensional Problems 63

Chapter 5 Bayesian vs Frequentist Power Functions to Determine the

Chapter 6 Converting Graphic Relationships into Conditional Probabilities

Section 2 Applications of Bayesian Inference in Life Sciences 145

Chapter 7 Bayesian Estimation of Multivariate Autoregressive Hidden

Chapter 8 Bayesian Model Averaging and Compromising in

Chapter 9 Two Examples of Bayesian Evidence Synthesis with the

Chapter 10 Bayesian Modeling in Genetics and Genomics 207

Chapter 11 Bayesian Two-Stage Robust Causal Modeling with

Chapter 12 Bayesian Hypothesis Testing: An Alternative to Null Hypothesis

Section 3 Applications of Bayesian Inference in Engineering 255

Chapter 13 Bayesian Inference and Compressed Sensing 257

Chapter 14 Sparsity in Bayesian Signal Estimation 279

Chapter 15 Dynamic Bayesian Network for Time-Dependent Classification

Section 4 Applications of Bayesian Inference in Economics 311

Chapter 16 A Bayesian Model for Investment Decisions in Early

Chapter 17 Recent Advances in Nonlinear Filtering with a Financial

Chapter 18 Airlines Content Recommendations Based on Passengers'

Dr. Javier Prieto Tejedor

Theoretical Foundations of Bayesian Inference

Additional information is available at the end of the chapter

Keywords: statistical inference, Frequentist inference, Bayesian inference

The conditional probability definition is defined as follows

PðA ∩ BÞ ¼ PðAjBÞPðBÞ ¼ PðBjAÞPðAÞ: (2)

Ai P(Ai) P(B| Ai) P(Ai ∩ B) P(Ai| B)

A2 1/6 1/2 1/12 1/8

A3 1/6 1/2 1/12 1/8

A4 1/6 1 1/6 1/4

A5 1/6 1 1/6 1/4

A6 1/6 1 1/6 1/4

PðTruthÞ ¼ the prior: (4)

PðDatajTruthÞ ¼ the likelihood (5)

PðTruthjDataÞ ¼ the posterior: (7)

2. Model-based Bayesian inference

pðΘjyÞ ∝ pðyjΘÞpðΘÞ (11)

when ∝ is proportional to.

In model-based Bayesian inference, Bayes’ theorem is applied to approximate the non-

3. The components of Bayesian inference

3.1. Prior distribution

3.1.1. Informative prior

3.1.2. Weakly informative prior

3.1.2.1. Vague priors

3.1.3. Least informative prior

3.1.4. Uninformative prior

3.1.5. Proper and improper priors

Before, an unbounded uniform prior distribution is an inappropriate prior distribution since