A Multimodal Approach For Frequency Domain Independent Component Analysis With Geometrically-Based Initialization

This document presents a novel multimodal approach for independent component analysis (ICA) of complex valued frequency domain signals that utilizes video information. The video captures the positions and orientations of speakers and microphones. This geometric information is incorporated into the initialization of the ICA algorithm for each frequency bin. The separation results show an improvement over traditional frequency domain convolutive blind source separation by solving the inherent permutation problem through the use of multimodal information from both speech signals and video.

Uploaded by

coolhemakumar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views

A Multimodal Approach For Frequency Domain Independent Component Analysis With Geometrically-Based Initialization

Uploaded by

coolhemakumar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

A MULTIMODAL APPROACH FOR FREQUENCY DOMAIN INDEPENDENT

COMPONENT ANALYSIS WITH GEOMETRICALLY-BASED INITIALIZATION

S. M. Naqvi† , Y. Zhang† , T. Tsalaile†, S. Sanei‡ and J. A. Chambers†

† Advanced Signal Processing Group, Department of Electronic and Electrical Engineering
Loughborough University, Loughborough LE11 3TU, UK.
‡ Centre of Digital Signal Processing, Cardiff University Cardiff, CF24 3AA, UK.
Email: {s.m.r.naqvi, y.zhang5, t.k.tsalaile, j.a.chambers}@lboro.ac.uk, [email protected]

ABSTRACT matrix normalization [9] [10]. The convolutive mixing system can
A novel multimodal approach for independent component analy- be described as follows: assume m statistically independent sources
sis (ICA) of complex valued frequency domain signals is presented as s(t) = [s1 (t), . . . , sm (t)]T where [.]T denotes the transpose opera-
which utilizes video information to provide geometrical description tion and t the discrete time index. The sources are convolved with a
of both the speakers and the microphones. This geometric informa- linear model of the physical medium (mixing matrix) which can be
tion, the visual aspect, is incorporated into the initialization of the represented in the form of a multichannel FIR filter H with memory
complex ICA algorithm for each frequency bin, as such, the method length p to produce n sensor signals x(t) = [x1 (t), . . . , xn (t)]T as
is multimodal since two signal modalities, speech and video, are
P
exploited. The separation results show a significant improvement
over traditional frequency domain convolutive blind source separa-
x(t) = ∑ H(τ )s(t − τ ) + v(t) (1)
τ =0
tion (BSS) systems. Importantly, the inherent permutation problem
in the frequency domain BSS (complex valued signals) with the im- where v(t) = [v1 (t), . . . , vn (t)]T and H = [H(0), . . . , H(P)]. In
provement in the rate of convergence, for static sources, is shown common with other researchers we assume n≥m. Using time do-
to be solved by simulation results at the level of each frequency bin. main CBSS, the sources can be estimated using a set of unmixing
We also highlight that certain fixed point algorithms proposed by filter matrices W(τ ), τ = 0, .., Q, such that
Hyvärinen et. al., or their constrained versions, are not valid for
complex valued signals. Q
y(t) = ∑ W(τ )x(t − τ ) (2)
1. INTRODUCTION τ =0

Convolutive blind source separation (CBSS) performed in the fre- where y(t) = [y1 (t), . . . , ym (t)]T contains the estimated sources, and
quency domain, where the separation of complex valued signals is Q is the memory of the unmixing filters. In FDBSS the problem
encountered, has remained as a subject of much research interest is transferred into the frequency domain using the short time fre-
due to its potential wide applications for example in acoustic source quency transform STFT. Equations (1) and (2) then change respec-
separation, and the associated challenging technical problems, most tively to:
important of which is perhaps the permutation problem. Generally, x(ω ,t) ≈ H(ω )s(ω ,t) + v(ω ,t) (3)
the main objective of BSS is to decompose the measurement sig-
y(ω ,t) ≈ W(ω )x(ω ,t) (4)
nals into their constituent independent components as an estima-
tion of the true sources which are assumed a priori to be indepen- where ω denotes discrete normalized frequency. An inverse STFT
dent [1] [2]. is then used to find the estimated sources ŝ(t) = y(t); however, this
CBSS algorithms have been conventionally developed in either will be certainly affected by the permutation effect due to the vari-
the time [3] or frequency [1] [4] [5] [6] domains. Frequency domain ation of W(ωi ) with frequency bin ωi . In the following section we
convolutive blind source separation (FDCBSS) has however been present a fast fixed-point algorithm for ICA of these complex valued
a more popular approach as the time-domain convolutive mixing signals, carefully motivate the choice of contrast function, and men-
is converted into a number of independent complex instantaneous tion the local consistency of the algorithm. In Sec. 3 we examine
mixing operations. The permutation problem inherent to FDCBSS the use of spatial information indicating the positions and directions
presents itself when reconstructing the separated sources from the of the sources using “data” acquired by a number of video cameras.
separated outputs of these instantaneous mixtures. It is more se- In Sec. 4 we use this geometric information to initialize the fixed-
vere and destructive than for time-domain schemes as the number point frequency domain ICA algorithm. In Sec. 5 the simulation
of permutations grows geometrically with the number of instanta- results for real world data confirm the usefulness of the algorithm.
neous mixtures [7]. In unimodal BSS no priori assumptions are Finally, conclusions are drawn.
typically made on the source statistics or the mixing system. On
the other hand, in a multimodal approach a video system can cap- 2. A FAST FIXED-POINT ALGORITHM FOR ICA
ture the approximate positions of the speakers and the directions Recently, ICA has become one of the central tools for BSS [1], [2].
they face [8]. Such video information can thereby help to estimate In ICA, a set of original source signals s(t) in (1) are retrieved from
the unmixing matrices more accurately and ultimately increase the their mixtures based on the assumption of their mutual statistical
separation performance. Following this idea, the objective of this independence. Hyvärinen and Oja [2] [11] presented a fast fixed
paper is to use efficiently such information to mitigate the permu- point algorithm (FastICA) for the separation of linearly mixed in-
tation problem. The scaling problem in CBSS is easily solved by dependent source signals. Unfortunately, these algorithms are not
Work supported by the Engineering and Physical Sciences Research suitable for complex valued signals. The use of algorithm [12] in
Council (EPSRC) of the UK. this paper is due to four main reasons, its suitability for complex
signals, the proof of the local consistency of the estimator, more where κ is a constant representing the attenuation per unit length
robustness against outliers and capability of deflationary separation in a homogenous medium. Similarly, τi j in terms of the number of
of the independent component signals. In deflationary separation samples, is proportional to the sampling frequency fs , sound veloc-
the components tend to separate in the order of decreasing non- ity C, and the distance di j as:
Gaussianity. In [12] the basic concept of complex random variables
is also provided and the fixed point algorithm for one unit is derived, fs
τi j = di j (9)
and for ease of derivation the algorithm updates the real and imagi- C
nary parts of w separately. Note for convenience explicit use of the
discrete time index is dropped and w represents one row of W used which is independent of the directionality. However, in practical
to extract a single source. Since the source signals are assumed zero situations the speaker’s direction introduces another variable into
mean, unit variance and with uncorrelated real and imaginary parts the attenuation measurement. In the case of electronic loudspeak-
of equal variances, the optima of E{G(|wH x|2 )} under the con- ers (not humans) the directionality pattern depends on the type of
straint E{|wH x|2 } = kwk2 = 1, where E{.} denotes the statistical loadspeaker. Here, we approximate this pattern as cos(θi j /r) where
expectation, (.)H Hermitian transpose, k.k Euclidian norm, |.| ab- r > 2, which has a smaller value for highly directional speakers and
solute function; and G(.) is a nonlinear contrast function, according vice versa (an accurate profile can be easily measured using a sound
to the Khun-Tucker conditions satisfy pressure level (SPL) meter). Therefore, the attenuation parameters
become
∇E{G(|wH x|2 )} − β ∇E{|wH x|2 } = 0 (5) κ
αi j = 2 cos(θi j /r) (10)
where the gradient denoted by ∇, is computed with respect to the di j
real and imaginary parts of ω separately. The Newton method is If, for simplicity, only the direct path is considered the mixing filter
used to solve this equation for which the total approximative Jaco- has the form:
bian [12] is
α11 δ (t − τ11 ) α12 δ (t − τ12 )
Ĥ(t) = (11)
α21 δ (t − τ21 ) α22 δ (t − τ22 )
J = 2(E{g(|wH x|2 ) + |wH x|2 ǵ(|wH x|2 )} − β )I (6)
where (.̂) denotes the approximation in this assumption. In the
which is diagonal and therefore easily invertible, where I denotes
the identity matrix and g(.) and ǵ(.) denote the first and second
derivative of the contrast function. Bingham and Hyvärinen ob-
tained the following approximative Newton iteration:

E{x(wH x)∗ g(|wH x|2 )}

w+ = w−
E{g(|wH x|2 ) + |wH x|2 ǵ(|wH x|2 )} − β
w+
w = (7)
kw+ k
where (.)∗ denotes the complex conjugate. In the experiments the
statistical expectation is realized as a sample average.
Fig. 1. A two-speaker two-microphone setup for recording within a
2.1 Robustness of Contrast Function reverberant (room) environment; only distances and angles between
sources and microphones are shown.
A good contrast function is one for which the estimator given by
the contrast function is more robust to outliers in the sample values.
The function used in our experiments is G(y) = log (b + y) and its frequency domain the above filter has the form
derivative is g(y) = 1/(b + y), where b is an arbitrary positive con-
α11 e− jωτ11 α12 e− jωτ12

stant, empirically b ≈ 0.1 is a reasonable value. The robustness of Ĥ(ω ) = (12)
α21 e− jωτ21 α22 e− jωτ22
the estimator is captured in the slow growth of G, as its argument
increases [12].
Although the actual mixing matrix includes the reverberation terms
2.2 Local Consistency related to the reflection of sounds by the obstacles and walls, in such
a room environment it will always contain the direct path compo-
It has been shown in [12] that the earlier results for real signals
nents as in the above equations. Therefore, we can consider Ĥ(ω )
and the exact conditions for convergence of algorithms of the form
as a crude biased estimate of the frequency domain mixing filter
of (7) to locally consistent separating solutions naturally extend to
matrix, but one which provides the learning algorithm with a good
complex valued random variables. These results substantiate our
initialization whilst importantly avoiding the bias in learning when
choice of (7) for frequency domain source separation.
used as a constraint within the FDBSS algorithm as in [9].
3. THE GEOMETRICAL MODEL
4. PROPOSED GEOMETRICALLY-BASED
Given the position of the speakers and the microphones, the dis- INITIALIZATION ICA ALGORITHM
tances between the ith microphone and the jth speaker di j , and also
With the help of the estimate Ĥ(ω ), as an initialization of the algo-
their propagation times τi j , can be calculated (see Figure 1 for a sim-
rithm in [12], we improve the convergence of the algorithm and also
ple two-speaker two-microphone case). Accordingly, in a homoge-
increase the separation performance together with mitigate the per-
nous medium such as air, the attenuation of the received speech
mutation problem. Crucially, in the proposed FDCBSS approach,
signals is related to the distances via
since the algorithm essentially fixes the permutation at each fre-
κ quency bin, there will be no problem while aligning the estimated
αi j = 2 (8)
di j sources for reconstruction in the time domain.
As an initial step, it is usual in ICA approaches to sphere or Moving Fixed
Room Layout Camera Camera
whiten the data i.e. the fist z(ω ) = Q(ω )x(ω ), where Q(ω ) is the
Room = [5.0, 5.0, 5.0]
whitening matrix [2].
Next the position and direction information obtained from the Height of the Mics and
the Speakers = 1.5 m
video cameras equipped with a speaker tracking algorithm is au-
tomatically passed to (9) and (10) to estimate the Ĥ(ω ) and then
the first column of Ĥ(ω ) is used to initialize the fixed point algo- Mic2
6cm
Mic1
rithm [12] for each frequency bin.
2.50 m

w1 (ω ) = Q(ω )ĥ1 (ω ) (13)

1.30 m
The equivalence between frequency domain blind source sepa-
ration and frequency domain adaptive beamforming is already con- 1.65 m 1.60 m
firmed in [13].

1.20 m

1.30 m
Speaker 1 Speaker 2
Multiplying both sides of (7) by β − E{g(|wH x|2 ) + |wH x|2
ǵ|w x|2 )} we have the following update equation for each fre-
H
quency bin.

w1+ (ω ) = E{z(ω )(w1 (ω )H z(ω ))∗ g(|w1 (ω )H z(ω )|2 )} Fig. 2. A two-speaker two-microphone layout for recording within
H 2 H 2
a reverberant (room) environment. Room impulse response length
−E{g(|w1 (ω ) z(ω )| ) + |w1 (ω ) z(ω )| is 130 ms.
ǵ(|w1 (ω )H z(ω )|2 )}w1 (ω ) (14)

where Gik is the ikth element of G, to examine the performance of

w1+ (ω ) our algorithm at each frequency bin.
w1 (ω ) = (15)
kw1+ (ω )k The resulting performance indices are shown in Figure 3 which
shows good performance for the proposed algorithm i.e. close to
which importantly eliminates the need to calculate β . zero across the majority of the frequency bins. This is due to ge-
Since we have m independent components, the other separating ometrical information used in the initialization. Both algorithms
vectors, i.e. wi (ω ), i = 2, · · · , m, are calculated in a similar manner were tested at fixed iteration count of seven, as our proposed algo-
and than decorrelated in a deflationary orthogonalization scheme. rithm has converged in this number of iterations. The visual modal-
The deflationary orthogonalization for the m-th separating vector ity therefore renders our BSS algorithm semiblind and thereby
[2] takes the form much improves the resulting performance and rate of convergence.

m−1
wm (ω ) ← wm (ω ) − ∑ {wmH (ω )w j (ω )}w j (ω ) (16)
Performance Index (PI)

j=1
1

Finally, we formulate W(ω ) = [w1 (ω ), · · · , wm (ω )] after sep-

arating all vectors of each frequency bin. 0.5

Before starting the update process Ĥ(ω ) is normalized once

using Ĥ(ω ) ← Ĥ(ω )/kĤ(ω )kF where k.kF denotes the Frobenius 0
0 100 200 300 400 500
norm. Frequency bin
The algorithm convergence depends on the estimate of Ĥ(ω ),
Performance Index (PI)

to improve accuracy. In the case of a reverberant environment,

Ĥ(ω ) should ideally be the sum of all echo paths, but this is not 1
available in practice. As will be shown by later simulations, an esti-
mate of Ĥ(ω ) obtained from (12) can result in a good performance 0.5
for the proposed algorithm in a moderate reverberant environment.
0
5. EXPERIMENTAL RESULTS 0 100 200 300 400 500
Frequency bin
In our experiments which correspond to the environment in Figure
2, the Bingham and Hyvärinen [12] algorithm and the proposed al-
Fig. 3. Performance index at each frequency bin for the Bingham
gorithm were tested for real room recordings. The other important
and Hyvärinen algorithm on the top [12] and the proposed algo-
variables were selected as: FFT length T = 1024 and filter length
rithm at the bottom, on the recorded signals with fixed iteration
Q = 512 half of T , r = 4, the sampling frequency for the record-
count = 7. A lower PI refers to a superior method.
ings was 8KHz and the room impulse duration was 130ms. In our
proposed algorithm we select G(y) = log(b + y), with b = 0.1.
We first use the performance index PI [1], as a function of the As mentioned in [1] PI is insensitive to permutation. We there-
overall system matrix G = WH, given by fore introduce a criterion for the two sources case which is sensi-
tive to permutation and shown for the real case for convenience, i.e.
h1 n m
ik abs(G ) i in the case of no permutation, H = W = I or H = W = [0, 1; 1, 0]
n ∑ ∑ maxk abs(Gik )
PI(G) = −1 then G = I and in the case of permutation if H = [0, 1; 1, 0] then
i=1 k=1 W = I and vice versa therefore, G = [0, 1; 1, 0]. Hence for a permu-
h1 m n abs(Gik ) tation free FDCBSS [abs(G11 G22 ) − abs(G12 G21 )] > 0. We evalu-
i
+ ∑ ∑
m k=1 i=1 maxi abs(Gik )
−1 (17)
ated permutation on the basis of the criterion mentioned above. In
Figure 4 the results confirm that the proposed algorithm automati- Figure 6 confirms the convergence of the underlying cost, i.e.
cally mitigates the permutation at each frequency bin. E{G(|wH x|2 )}, within seven iterations for the proposed algorithm.
The results are averaged over all frequency bins. The convergence
within seven iterations with solution for permutation confirm that
[abs(G11*G22) − abs(G12*G21)]
2 the proposed algorithm is robust and suitable for realtime imple-
1
mentation.

0
3

−1
2.5

−2
0 100 200 300 400 500 2
Frequency bin
[abs(G11*G22) − abs(G12*G21)]

2 1.5

1 1

0 0.5

−1
0
1 2 3 4 5 6 7 8 9 10
Iteration Number
−2
0 100 200 300 400 500
Frequency bin
Fig. 6. The convergence graph of the cost function of the proposed
algorithm using contrast function G(y) = log(b + y); the results are
Fig. 4. Evaluation of permutation in each frequency bin for the averaged over all frequency bins.
Bingham and Hyvärinen algorithm at the top [12] and the proposed
algorithm at the bottom, on the recorded signals with fixed iteration
count = 7. [abs(G11 G22 ) − abs(G12 G21 )] > 0 means no permuta- The proposed algorithm starts with W(ω ) = Q(ω )Ĥ(ω ), if
tion. the estimate of Ĥ(ω ) is unbiased, then Wopt (ω ) = Q(ω )Ĥ(ω ).
We assumed the estimate of Ĥ(ω ) (used in above simulations ob-
In contrast, the performance indices and evaluation of permuta- tained from (12) the directions of the sources were obtained from
tion by the original FastICA algorithm [12] (MATLAB code avail- video cameras) is unbiased and calculated the performance shown
able online) with random initialization, on the recorded mixtures are in Figure 7, which confirms the estimate is biased, since the envi-
shown in Figure 5. We highlight that thirty iterations are required ronment is reverberant therefore Ĥ(ω ) should include the sum of
for the performance level achieved in Figure 5(a) with no solution all echo paths, but practically the directions of these reverberations
for permutation as shown in Figure 5(b). The permutation problem are not possible to be measured by the video cameras. The conver-
in frequency domain BSS degraded the SIR to approximately zero gence of the proposed algorithm within seven iterations including
on the recorded mixtures. the solution for permutation, with the biased estimate of Ĥ(ω ) con-
firm that the multimodal approach is necessary to solve the cocktail
party problem.
Performance Index (PI)

1
Performance Index (PI)

(a) 0.5 1

0 0.5
0 100 200 300 400 500
Frequency bin
[abs(G11*G22) − abs(G12*G21)]

2 0
0 100 200 300 400 500
Frequency bin
1
[abs(G11*G22) − abs(G12*G21)]

2
(b) 0
1
−1
0
−2
0 100 200 300 400 500 −1
Frequency bin
−2
0 100 200 300 400 500
Frequency bin
Fig. 5. (a) Performance index at each frequency bin and (b) Eval-
uation of permutation in each frequency bin for Bingham and
Hyvärinen FastICA algorithm [12], on the recorded signals af- Fig. 7. (a) Performance index at each frequency bin and (b) Evalua-
ter 30 iterations. A lower PI refers to a better separation and tion of permutation in each frequency, assumed Ĥ(ω ) is correct i.e.
[abs(G11 G22 ) − abs(G12 G21 )] > 0 means no permutation. Wopt (ω ) = Q(ω )Ĥ(ω ). A lower PI refers to a better separation
and [abs(G11 G22 ) − abs(G12 G21 )] > 0 means no permutation.
Finally, the signal-to-interference ratio (SIR) was calculated as [8] W. Wang, D. Cosker, Y. Hicks, S. Sanei, and J. A. Cham-
in [9] and results are shown in Table 1 for infomax (INFO), FD- bers, “Video assisted speech source separation,” Proc. IEEE
CBSS, Constrained ICA (CICAu), Para and Spence and Proposed ICASSP, pp. 425–428, 2005.
GBFastICA algorithms, where SIR is defined as
[9] S. Sanei, S. M. Naqvi, J. A. Chambers, and Y. Hicks, “A ge-
Σi Σω |Hii (ω )|2 h|si (ω )|2 i ometrically constrained multimodal approach for convolutive
SIR = (18) blind source separation,” Proc. IEEE ICASSP, pp. 969–972,
Σi Σi6= j Σω |Hi j (ω )|2 h|s j (ω )|2 i 2007.
where Hii and Hi j represents respectively, the diagonal and off-
[10] T. Tsalaile, S. M. Naqvi, K. Nazarpour, S. Sanei, and J. A.
diagonal elements of the frequency domain mixing filter, and si is
Chambers, “Blind source extraction of heart sound signals
the frequency domain representation of the source of interest.
from lung sound recordings exploiting periodicity of the heart
The results are summarized in Table 1 and confirm the objec-
sound,” Proc. IEEE ICASSP, Las Vegas, USA, 2008.
tive improvement of our algorithm which has been confirmed sub-
jectively by listening tests. [11] A. Hyvärinen, “Fast and roubst fixed-point algorithms for in-
dependent component analysis,” IIEEE Trans. Neural Netw.,
vol. 10, no. 3, pp. 626–634, 1999.
Table 1. Comparison of SIR-Improvement between algorithms and
[12] E. Bingham and A. Hyvärinen, “A fast fixed point algorithm
the proposed method for different sets of mixtures.
for independent component analysis of complex valued sig-
Algorithms SIR-Improvement/dB
nals,” Int. J. Neural Networks, vol. 10, no. 1, pp. 1–8, 2000.
Parra‘s Method 6.8
FDCBSS 9.4 [13] S.Araki, S. Makino, Y. Hinamoto, R. Mukai, T.Nishikawa,
INFO 11.1 and H. Saruwatari, “Equivalence between frequency domain
CICAu 11.6 blind source separation and frequency domain adaptive beam-
GBFastICA 18.8 forming for convolutive mixtures,” EURASIP J. Appl. Signal
Process., , no. 11, pp. 1157–1166, 2003.

6. CONCLUSIONS
In this research a new multimodal approach for independent com-
ponent analysis of complex valued frequency domain signals was
proposed which exploits visual information to initialize a FastICA
algorithm in order to mitigate the permutation problem. The advan-
tage of our proposed algorithm was confirmed in simulations from
a real room environment. The location and direction information
was obtained using a number of cameras equipped with a speaker
tracking algorithm. The outcome of this approach paves the way
for establishing a multimodal audio-video system for separation of
speech signals, with moving sources.

REFERENCES

[1] A. Cichocki and S. Amari, Adaptive Blind Signal and Image

Processing: Learning Algorithms and Applications, John Wi-
ley, 2002.
[2] A. Hyvärinen, J. Karhunen, and E. Oja, Independent Compo-
nent Analysis, New York: Wiley, 2001.
[3] A. S. Bregman, Auditory scence analysis, MIT Press, Cam-
bridge, MA, 1990.
[4] L. Parra and C. Spence, “Convolutive blind separation of non-
stationary sources,” IEEE Trans. On Speech and Audio Pro-
cessing, vol. 8, no. 3, pp. 320–327, 2000.
[5] W. Wang, S. Sanei, and J.A. Chambers, “Penalty function
based joint diagonalization approach for convolutive blind
separation of nonstationary sources,” IEEE Trans. Signal Pro-
cessing, vol. 53, no. 5, pp. 1654–1669, 2005.
[6] S. Makino, H. Sawada, R. Mukai, and S.Araki, “Blind sepa-
ration of convolved mixtures of speech in frequency domain,”
IEICE Trans. Fundamentals, vol. E88-A, no. 7, pp. 1640–
1655, Jul 2005.
[7] W. Wang, S. Sanei, and J. A. Chambers, “A joint diagonaliza-
tion method for convolutive blind separation of nonstationary
sources in the frequency domain,” Proc. ICA, Nara, Japan,
April 2003.