Consider a regression problem where the dependent variable to be predicted is not a single real-valued scalar but an m -length vector of correlated real numbers. As in the standard regression setup, there are n observations, where each observation i consists of k −1 explanatory variables , grouped into a vector
x
i
{\displaystyle \mathbf {x} _{i}}
of length k (where a dummy variable with a value of 1 has been added to allow for an intercept coefficient). This can be viewed as a set of m related regression problems for each observation i :
y
i
,
1
=
x
i
T
β
1
+
ϵ
i
,
1
⋮
y
i
,
m
=
x
i
T
β
m
+
ϵ
i
,
m
{\displaystyle {\begin{aligned}y_{i,1}&=\mathbf {x} _{i}^{\mathsf {T}}{\boldsymbol {\beta }}_{1}+\epsilon _{i,1}\\&\;\;\vdots \\y_{i,m}&=\mathbf {x} _{i}^{\mathsf {T}}{\boldsymbol {\beta }}_{m}+\epsilon _{i,m}\end{aligned}}}
where the set of errors
{
ϵ
i
,
1
,
…
,
ϵ
i
,
m
}
{\displaystyle \{\epsilon _{i,1},\ldots ,\epsilon _{i,m}\}}
are all correlated. Equivalently, it can be viewed as a single regression problem where the outcome is a row vector
y
i
T
{\displaystyle \mathbf {y} _{i}^{\mathsf {T}}}
and the regression coefficient vectors are stacked next to each other, as follows:
y
i
T
=
x
i
T
B
+
ϵ
i
T
.
{\displaystyle \mathbf {y} _{i}^{\mathsf {T}}=\mathbf {x} _{i}^{\mathsf {T}}\mathbf {B} +{\boldsymbol {\epsilon }}_{i}^{\mathsf {T}}.}
The coefficient matrix B is a
k
×
m
{\displaystyle k\times m}
matrix where the coefficient vectors
β
1
,
…
,
β
m
{\displaystyle {\boldsymbol {\beta }}_{1},\ldots ,{\boldsymbol {\beta }}_{m}}
for each regression problem are stacked horizontally:
B
=
[
(
β
1
)
⋯
(
β
m
)
]
=
[
(
β
1
,
1
⋮
β
k
,
1
)
⋯
(
β
1
,
m
⋮
β
k
,
m
)
]
.
{\displaystyle \mathbf {B} ={\begin{bmatrix}{\begin{pmatrix}\\{\boldsymbol {\beta }}_{1}\\\\\end{pmatrix}}\cdots {\begin{pmatrix}\\{\boldsymbol {\beta }}_{m}\\\\\end{pmatrix}}\end{bmatrix}}={\begin{bmatrix}{\begin{pmatrix}\beta _{1,1}\\\vdots \\\beta _{k,1}\end{pmatrix}}\cdots {\begin{pmatrix}\beta _{1,m}\\\vdots \\\beta _{k,m}\end{pmatrix}}\end{bmatrix}}.}
The noise vector
ϵ
i
{\displaystyle {\boldsymbol {\epsilon }}_{i}}
for each observation i is jointly normal, so that the outcomes for a given observation are correlated:
ϵ
i
∼
N
(
0
,
Σ
ϵ
)
.
{\displaystyle {\boldsymbol {\epsilon }}_{i}\sim N(0,{\boldsymbol {\Sigma }}_{\epsilon }).}
We can write the entire regression problem in matrix form as:
Y
=
X
B
+
E
,
{\displaystyle \mathbf {Y} =\mathbf {X} \mathbf {B} +\mathbf {E} ,}
where Y and E are
n
×
m
{\displaystyle n\times m}
matrices. The design matrix X is an
n
×
k
{\displaystyle n\times k}
matrix with the observations stacked vertically, as in the standard linear regression setup:
X
=
[
x
1
T
x
2
T
⋮
x
n
T
]
=
[
x
1
,
1
⋯
x
1
,
k
x
2
,
1
⋯
x
2
,
k
⋮
⋱
⋮
x
n
,
1
⋯
x
n
,
k
]
.
{\displaystyle \mathbf {X} ={\begin{bmatrix}\mathbf {x} _{1}^{\mathsf {T}}\\\mathbf {x} _{2}^{\mathsf {T}}\\\vdots \\\mathbf {x} _{n}^{\mathsf {T}}\end{bmatrix}}={\begin{bmatrix}x_{1,1}&\cdots &x_{1,k}\\x_{2,1}&\cdots &x_{2,k}\\\vdots &\ddots &\vdots \\x_{n,1}&\cdots &x_{n,k}\end{bmatrix}}.}
The classical, frequentists linear least squares solution is to simply estimate the matrix of regression coefficients
B
^
{\displaystyle {\hat {\mathbf {B} }}}
using the Moore-Penrose pseudoinverse :
B
^
=
(
X
T
X
)
−
1
X
T
Y
.
{\displaystyle {\hat {\mathbf {B} }}=(\mathbf {X} ^{\mathsf {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\mathsf {T}}\mathbf {Y} .}
To obtain the Bayesian solution, we need to specify the conditional likelihood and then find the appropriate conjugate prior. As with the univariate case of linear Bayesian regression , we will find that we can specify a natural conditional conjugate prior (which is scale dependent).
Let us write our conditional likelihood as[ 1]
ρ
(
E
|
Σ
ϵ
)
∝
|
Σ
ϵ
|
−
n
/
2
exp
(
−
1
2
tr
(
E
T
E
Σ
ϵ
−
1
)
)
,
{\displaystyle \rho (\mathbf {E} |{\boldsymbol {\Sigma }}_{\epsilon })\propto |{\boldsymbol {\Sigma }}_{\epsilon }|^{-n/2}\exp \left(-{\tfrac {1}{2}}\operatorname {tr} \left(\mathbf {E} ^{\mathsf {T}}\mathbf {E} {\boldsymbol {\Sigma }}_{\epsilon }^{-1}\right)\right),}
writing the error
E
{\displaystyle \mathbf {E} }
in terms of
Y
,
X
,
{\displaystyle \mathbf {Y} ,\mathbf {X} ,}
and
B
{\displaystyle \mathbf {B} }
yields
ρ
(
Y
|
X
,
B
,
Σ
ϵ
)
∝
|
Σ
ϵ
|
−
n
/
2
exp
(
−
1
2
tr
(
(
Y
−
X
B
)
T
(
Y
−
X
B
)
Σ
ϵ
−
1
)
)
,
{\displaystyle \rho (\mathbf {Y} |\mathbf {X} ,\mathbf {B} ,{\boldsymbol {\Sigma }}_{\epsilon })\propto |{\boldsymbol {\Sigma }}_{\epsilon }|^{-n/2}\exp(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {Y} -\mathbf {X} \mathbf {B} )^{\mathsf {T}}(\mathbf {Y} -\mathbf {X} \mathbf {B} ){\boldsymbol {\Sigma }}_{\epsilon }^{-1})),}
We seek a natural conjugate prior—a joint density
ρ
(
B
,
Σ
ϵ
)
{\displaystyle \rho (\mathbf {B} ,\Sigma _{\epsilon })}
which is of the same functional form as the likelihood. Since the likelihood is quadratic in
B
{\displaystyle \mathbf {B} }
, we re-write the likelihood so it is normal in
(
B
−
B
^
)
{\displaystyle (\mathbf {B} -{\hat {\mathbf {B} }})}
(the deviation from classical sample estimate).
Using the same technique as with Bayesian linear regression , we decompose the exponential term using a matrix-form of the sum-of-squares technique. Here, however, we will also need to use the Matrix Differential Calculus (Kronecker product and vectorization transformations).
First, let us apply sum-of-squares to obtain new expression for the likelihood:
ρ
(
Y
|
X
,
B
,
Σ
ϵ
)
∝
|
Σ
ϵ
|
−
(
n
−
k
)
/
2
exp
(
−
tr
(
1
2
S
T
S
Σ
ϵ
−
1
)
)
|
Σ
ϵ
|
−
k
/
2
exp
(
−
1
2
tr
(
(
B
−
B
^
)
T
X
T
X
(
B
−
B
^
)
Σ
ϵ
−
1
)
)
,
{\displaystyle \rho (\mathbf {Y} |\mathbf {X} ,\mathbf {B} ,{\boldsymbol {\Sigma }}_{\epsilon })\propto |{\boldsymbol {\Sigma }}_{\epsilon }|^{-(n-k)/2}\exp(-\operatorname {tr} ({\tfrac {1}{2}}\mathbf {S} ^{\mathsf {T}}\mathbf {S} {\boldsymbol {\Sigma }}_{\epsilon }^{-1}))|{\boldsymbol {\Sigma }}_{\epsilon }|^{-k/2}\exp(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {B} -{\hat {\mathbf {B} }})^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {X} (\mathbf {B} -{\hat {\mathbf {B} }}){\boldsymbol {\Sigma }}_{\epsilon }^{-1})),}
S
=
Y
−
X
B
^
{\displaystyle \mathbf {S} =\mathbf {Y} -\mathbf {X} {\hat {\mathbf {B} }}}
We would like to develop a conditional form for the priors:
ρ
(
B
,
Σ
ϵ
)
=
ρ
(
Σ
ϵ
)
ρ
(
B
|
Σ
ϵ
)
,
{\displaystyle \rho (\mathbf {B} ,{\boldsymbol {\Sigma }}_{\epsilon })=\rho ({\boldsymbol {\Sigma }}_{\epsilon })\rho (\mathbf {B} |{\boldsymbol {\Sigma }}_{\epsilon }),}
where
ρ
(
Σ
ϵ
)
{\displaystyle \rho ({\boldsymbol {\Sigma }}_{\epsilon })}
is an inverse-Wishart distribution
and
ρ
(
B
|
Σ
ϵ
)
{\displaystyle \rho (\mathbf {B} |{\boldsymbol {\Sigma }}_{\epsilon })}
is some form of normal distribution in the matrix
B
{\displaystyle \mathbf {B} }
. This is accomplished using the vectorization transformation, which converts the likelihood from a function of the matrices
B
,
B
^
{\displaystyle \mathbf {B} ,{\hat {\mathbf {B} }}}
to a function of the vectors
β
=
vec
(
B
)
,
β
^
=
vec
(
B
^
)
{\displaystyle {\boldsymbol {\beta }}=\operatorname {vec} (\mathbf {B} ),{\hat {\boldsymbol {\beta }}}=\operatorname {vec} ({\hat {\mathbf {B} }})}
.
Write
tr
(
(
B
−
B
^
)
T
X
T
X
(
B
−
B
^
)
Σ
ϵ
−
1
)
=
vec
(
B
−
B
^
)
T
vec
(
X
T
X
(
B
−
B
^
)
Σ
ϵ
−
1
)
{\displaystyle \operatorname {tr} ((\mathbf {B} -{\hat {\mathbf {B} }})^{\mathsf {T}}\mathbf {X} ^{\mathsf {T}}\mathbf {X} (\mathbf {B} -{\hat {\mathbf {B} }}){\boldsymbol {\Sigma }}_{\epsilon }^{-1})=\operatorname {vec} (\mathbf {B} -{\hat {\mathbf {B} }})^{\mathsf {T}}\operatorname {vec} (\mathbf {X} ^{\mathsf {T}}\mathbf {X} (\mathbf {B} -{\hat {\mathbf {B} }}){\boldsymbol {\Sigma }}_{\epsilon }^{-1})}
Let
vec
(
X
T
X
(
B
−
B
^
)
Σ
ϵ
−
1
)
=
(
Σ
ϵ
−
1
⊗
X
T
X
)
vec
(
B
−
B
^
)
,
{\displaystyle \operatorname {vec} (\mathbf {X} ^{\mathsf {T}}\mathbf {X} (\mathbf {B} -{\hat {\mathbf {B} }}){\boldsymbol {\Sigma }}_{\epsilon }^{-1})=({\boldsymbol {\Sigma }}_{\epsilon }^{-1}\otimes \mathbf {X} ^{\mathsf {T}}\mathbf {X} )\operatorname {vec} (\mathbf {B} -{\hat {\mathbf {B} }}),}
where
A
⊗
B
{\displaystyle \mathbf {A} \otimes \mathbf {B} }
denotes the Kronecker product of matrices A and B , a generalization of the outer product which multiplies an
m
×
n
{\displaystyle m\times n}
matrix by a
p
×
q
{\displaystyle p\times q}
matrix to generate an
m
p
×
n
q
{\displaystyle mp\times nq}
matrix, consisting of every combination of products of elements from the two matrices.
Then
vec
(
B
−
B
^
)
T
(
Σ
ϵ
−
1
⊗
X
T
X
)
vec
(
B
−
B
^
)
=
(
β
−
β
^
)
T
(
Σ
ϵ
−
1
⊗
X
T
X
)
(
β
−
β
^
)
{\displaystyle {\begin{aligned}&\operatorname {vec} (\mathbf {B} -{\hat {\mathbf {B} }})^{\mathsf {T}}({\boldsymbol {\Sigma }}_{\epsilon }^{-1}\otimes \mathbf {X} ^{\mathsf {T}}\mathbf {X} )\operatorname {vec} (\mathbf {B} -{\hat {\mathbf {B} }})\\&=({\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}})^{\mathsf {T}}({\boldsymbol {\Sigma }}_{\epsilon }^{-1}\otimes \mathbf {X} ^{\mathsf {T}}\mathbf {X} )({\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}})\end{aligned}}}
which will lead to a likelihood which is normal in
(
β
−
β
^
)
{\displaystyle ({\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}})}
.
With the likelihood in a more tractable form, we can now find a natural (conditional) conjugate prior.
Conjugate prior distribution
edit
The natural conjugate prior using the vectorized variable
β
{\displaystyle {\boldsymbol {\beta }}}
is of the form:[ 1]
ρ
(
β
,
Σ
ϵ
)
=
ρ
(
Σ
ϵ
)
ρ
(
β
|
Σ
ϵ
)
,
{\displaystyle \rho ({\boldsymbol {\beta }},{\boldsymbol {\Sigma }}_{\epsilon })=\rho ({\boldsymbol {\Sigma }}_{\epsilon })\rho ({\boldsymbol {\beta }}|{\boldsymbol {\Sigma }}_{\epsilon }),}
where
ρ
(
Σ
ϵ
)
∼
W
−
1
(
V
0
,
ν
0
)
{\displaystyle \rho ({\boldsymbol {\Sigma }}_{\epsilon })\sim {\mathcal {W}}^{-1}(\mathbf {V} _{0},{\boldsymbol {\nu }}_{0})}
and
ρ
(
β
|
Σ
ϵ
)
∼
N
(
β
0
,
Σ
ϵ
⊗
Λ
0
−
1
)
.
{\displaystyle \rho ({\boldsymbol {\beta }}|{\boldsymbol {\Sigma }}_{\epsilon })\sim N({\boldsymbol {\beta }}_{0},{\boldsymbol {\Sigma }}_{\epsilon }\otimes {\boldsymbol {\Lambda }}_{0}^{-1}).}
Posterior distribution
edit
Using the above prior and likelihood, the posterior distribution can be expressed as:[ 1]
ρ
(
β
,
Σ
ϵ
|
Y
,
X
)
∝
|
Σ
ϵ
|
−
(
ν
0
+
m
+
1
)
/
2
exp
(
−
1
2
tr
(
V
0
Σ
ϵ
−
1
)
)
×
|
Σ
ϵ
|
−
k
/
2
exp
(
−
1
2
tr
(
(
B
−
B
0
)
T
Λ
0
(
B
−
B
0
)
Σ
ϵ
−
1
)
)
×
|
Σ
ϵ
|
−
n
/
2
exp
(
−
1
2
tr
(
(
Y
−
X
B
)
T
(
Y
−
X
B
)
Σ
ϵ
−
1
)
)
,
{\displaystyle {\begin{aligned}\rho ({\boldsymbol {\beta }},{\boldsymbol {\Sigma }}_{\epsilon }|\mathbf {Y} ,\mathbf {X} )\propto {}&|{\boldsymbol {\Sigma }}_{\epsilon }|^{-({\boldsymbol {\nu }}_{0}+m+1)/2}\exp {(-{\tfrac {1}{2}}\operatorname {tr} (\mathbf {V} _{0}{\boldsymbol {\Sigma }}_{\epsilon }^{-1}))}\\&\times |{\boldsymbol {\Sigma }}_{\epsilon }|^{-k/2}\exp {(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {B} -\mathbf {B} _{0})^{\mathsf {T}}{\boldsymbol {\Lambda }}_{0}(\mathbf {B} -\mathbf {B} _{0}){\boldsymbol {\Sigma }}_{\epsilon }^{-1}))}\\&\times |{\boldsymbol {\Sigma }}_{\epsilon }|^{-n/2}\exp {(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {Y} -\mathbf {XB} )^{\mathsf {T}}(\mathbf {Y} -\mathbf {XB} ){\boldsymbol {\Sigma }}_{\epsilon }^{-1}))},\end{aligned}}}
where
vec
(
B
0
)
=
β
0
{\displaystyle \operatorname {vec} (\mathbf {B} _{0})={\boldsymbol {\beta }}_{0}}
.
The terms involving
B
{\displaystyle \mathbf {B} }
can be grouped (with
Λ
0
=
U
T
U
{\displaystyle {\boldsymbol {\Lambda }}_{0}=\mathbf {U} ^{\mathsf {T}}\mathbf {U} }
) using:
(
B
−
B
0
)
T
Λ
0
(
B
−
B
0
)
+
(
Y
−
X
B
)
T
(
Y
−
X
B
)
=
(
[
Y
U
B
0
]
−
[
X
U
]
B
)
T
(
[
Y
U
B
0
]
−
[
X
U
]
B
)
=
(
[
Y
U
B
0
]
−
[
X
U
]
B
n
)
T
(
[
Y
U
B
0
]
−
[
X
U
]
B
n
)
+
(
B
−
B
n
)
T
(
X
T
X
+
Λ
0
)
(
B
−
B
n
)
=
(
Y
−
X
B
n
)
T
(
Y
−
X
B
n
)
+
(
B
0
−
B
n
)
T
Λ
0
(
B
0
−
B
n
)
+
(
B
−
B
n
)
T
(
X
T
X
+
Λ
0
)
(
B
−
B
n
)
,
{\displaystyle {\begin{aligned}&\left(\mathbf {B} -\mathbf {B} _{0}\right)^{\mathsf {T}}{\boldsymbol {\Lambda }}_{0}\left(\mathbf {B} -\mathbf {B} _{0}\right)+\left(\mathbf {Y} -\mathbf {XB} \right)^{\mathsf {T}}\left(\mathbf {Y} -\mathbf {XB} \right)\\={}&\left({\begin{bmatrix}\mathbf {Y} \\\mathbf {U} \mathbf {B} _{0}\end{bmatrix}}-{\begin{bmatrix}\mathbf {X} \\\mathbf {U} \end{bmatrix}}\mathbf {B} \right)^{\mathsf {T}}\left({\begin{bmatrix}\mathbf {Y} \\\mathbf {U} \mathbf {B} _{0}\end{bmatrix}}-{\begin{bmatrix}\mathbf {X} \\\mathbf {U} \end{bmatrix}}\mathbf {B} \right)\\={}&\left({\begin{bmatrix}\mathbf {Y} \\\mathbf {U} \mathbf {B} _{0}\end{bmatrix}}-{\begin{bmatrix}\mathbf {X} \\\mathbf {U} \end{bmatrix}}\mathbf {B} _{n}\right)^{\mathsf {T}}\left({\begin{bmatrix}\mathbf {Y} \\\mathbf {U} \mathbf {B} _{0}\end{bmatrix}}-{\begin{bmatrix}\mathbf {X} \\\mathbf {U} \end{bmatrix}}\mathbf {B} _{n}\right)+\left(\mathbf {B} -\mathbf {B} _{n}\right)^{\mathsf {T}}\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}\right)\left(\mathbf {B} -\mathbf {B} _{n}\right)\\={}&\left(\mathbf {Y} -\mathbf {X} \mathbf {B} _{n}\right)^{\mathsf {T}}\left(\mathbf {Y} -\mathbf {X} \mathbf {B} _{n}\right)+\left(\mathbf {B} _{0}-\mathbf {B} _{n}\right)^{\mathsf {T}}{\boldsymbol {\Lambda }}_{0}\left(\mathbf {B} _{0}-\mathbf {B} _{n}\right)+\left(\mathbf {B} -\mathbf {B} _{n}\right)^{\mathsf {T}}\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}\right)\left(\mathbf {B} -\mathbf {B} _{n}\right),\end{aligned}}}
with
B
n
=
(
X
T
X
+
Λ
0
)
−
1
(
X
T
X
B
^
+
Λ
0
B
0
)
=
(
X
T
X
+
Λ
0
)
−
1
(
X
T
Y
+
Λ
0
B
0
)
.
{\displaystyle \mathbf {B} _{n}=\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}\right)^{-1}\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} {\hat {\mathbf {B} }}+{\boldsymbol {\Lambda }}_{0}\mathbf {B} _{0}\right)=\left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}\right)^{-1}\left(\mathbf {X} ^{\mathsf {T}}\mathbf {Y} +{\boldsymbol {\Lambda }}_{0}\mathbf {B} _{0}\right).}
This now allows us to write the posterior in a more useful form:
ρ
(
β
,
Σ
ϵ
|
Y
,
X
)
∝
|
Σ
ϵ
|
−
(
ν
0
+
m
+
n
+
1
)
/
2
exp
(
−
1
2
tr
(
(
V
0
+
(
Y
−
X
B
n
)
T
(
Y
−
X
B
n
)
+
(
B
n
−
B
0
)
T
Λ
0
(
B
n
−
B
0
)
)
Σ
ϵ
−
1
)
)
×
|
Σ
ϵ
|
−
k
/
2
exp
(
−
1
2
tr
(
(
B
−
B
n
)
T
(
X
T
X
+
Λ
0
)
(
B
−
B
n
)
Σ
ϵ
−
1
)
)
.
{\displaystyle {\begin{aligned}\rho ({\boldsymbol {\beta }},{\boldsymbol {\Sigma }}_{\epsilon }|\mathbf {Y} ,\mathbf {X} )\propto {}&|{\boldsymbol {\Sigma }}_{\epsilon }|^{-({\boldsymbol {\nu }}_{0}+m+n+1)/2}\exp {(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {V} _{0}+(\mathbf {Y} -\mathbf {XB_{n}} )^{\mathsf {T}}(\mathbf {Y} -\mathbf {XB_{n}} )+(\mathbf {B} _{n}-\mathbf {B} _{0})^{\mathsf {T}}{\boldsymbol {\Lambda }}_{0}(\mathbf {B} _{n}-\mathbf {B} _{0})){\boldsymbol {\Sigma }}_{\epsilon }^{-1}))}\\&\times |{\boldsymbol {\Sigma }}_{\epsilon }|^{-k/2}\exp {(-{\tfrac {1}{2}}\operatorname {tr} ((\mathbf {B} -\mathbf {B} _{n})^{\mathsf {T}}(\mathbf {X} ^{T}\mathbf {X} +{\boldsymbol {\Lambda }}_{0})(\mathbf {B} -\mathbf {B} _{n}){\boldsymbol {\Sigma }}_{\epsilon }^{-1}))}.\end{aligned}}}
This takes the form of an inverse-Wishart distribution times a Matrix normal distribution :
ρ
(
Σ
ϵ
|
Y
,
X
)
∼
W
−
1
(
V
n
,
ν
n
)
{\displaystyle \rho ({\boldsymbol {\Sigma }}_{\epsilon }|\mathbf {Y} ,\mathbf {X} )\sim {\mathcal {W}}^{-1}(\mathbf {V} _{n},{\boldsymbol {\nu }}_{n})}
and
ρ
(
B
|
Y
,
X
,
Σ
ϵ
)
∼
M
N
k
,
m
(
B
n
,
Λ
n
−
1
,
Σ
ϵ
)
.
{\displaystyle \rho (\mathbf {B} |\mathbf {Y} ,\mathbf {X} ,{\boldsymbol {\Sigma }}_{\epsilon })\sim {\mathcal {MN}}_{k,m}(\mathbf {B} _{n},{\boldsymbol {\Lambda }}_{n}^{-1},{\boldsymbol {\Sigma }}_{\epsilon }).}
The parameters of this posterior are given by:
V
n
=
V
0
+
(
Y
−
X
B
n
)
T
(
Y
−
X
B
n
)
+
(
B
n
−
B
0
)
T
Λ
0
(
B
n
−
B
0
)
{\displaystyle \mathbf {V} _{n}=\mathbf {V} _{0}+(\mathbf {Y} -\mathbf {XB_{n}} )^{\mathsf {T}}(\mathbf {Y} -\mathbf {XB_{n}} )+(\mathbf {B} _{n}-\mathbf {B} _{0})^{\mathsf {T}}{\boldsymbol {\Lambda }}_{0}(\mathbf {B} _{n}-\mathbf {B} _{0})}
ν
n
=
ν
0
+
n
{\displaystyle {\boldsymbol {\nu }}_{n}={\boldsymbol {\nu }}_{0}+n}
B
n
=
(
X
T
X
+
Λ
0
)
−
1
(
X
T
Y
+
Λ
0
B
0
)
{\displaystyle \mathbf {B} _{n}=(\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0})^{-1}(\mathbf {X} ^{\mathsf {T}}\mathbf {Y} +{\boldsymbol {\Lambda }}_{0}\mathbf {B} _{0})}
Λ
n
=
X
T
X
+
Λ
0
{\displaystyle {\boldsymbol {\Lambda }}_{n}=\mathbf {X} ^{\mathsf {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}}