A Multimodal Approach For Frequency Domain Independent Component Analysis With Geometrically-Based Initialization
A Multimodal Approach For Frequency Domain Independent Component Analysis With Geometrically-Based Initialization
ABSTRACT matrix normalization [9] [10]. The convolutive mixing system can
A novel multimodal approach for independent component analy- be described as follows: assume m statistically independent sources
sis (ICA) of complex valued frequency domain signals is presented as s(t) = [s1 (t), . . . , sm (t)]T where [.]T denotes the transpose opera-
which utilizes video information to provide geometrical description tion and t the discrete time index. The sources are convolved with a
of both the speakers and the microphones. This geometric informa- linear model of the physical medium (mixing matrix) which can be
tion, the visual aspect, is incorporated into the initialization of the represented in the form of a multichannel FIR filter H with memory
complex ICA algorithm for each frequency bin, as such, the method length p to produce n sensor signals x(t) = [x1 (t), . . . , xn (t)]T as
is multimodal since two signal modalities, speech and video, are
P
exploited. The separation results show a significant improvement
over traditional frequency domain convolutive blind source separa-
x(t) = ∑ H(τ )s(t − τ ) + v(t) (1)
τ =0
tion (BSS) systems. Importantly, the inherent permutation problem
in the frequency domain BSS (complex valued signals) with the im- where v(t) = [v1 (t), . . . , vn (t)]T and H = [H(0), . . . , H(P)]. In
provement in the rate of convergence, for static sources, is shown common with other researchers we assume n≥m. Using time do-
to be solved by simulation results at the level of each frequency bin. main CBSS, the sources can be estimated using a set of unmixing
We also highlight that certain fixed point algorithms proposed by filter matrices W(τ ), τ = 0, .., Q, such that
Hyvärinen et. al., or their constrained versions, are not valid for
complex valued signals. Q
y(t) = ∑ W(τ )x(t − τ ) (2)
1. INTRODUCTION τ =0
Convolutive blind source separation (CBSS) performed in the fre- where y(t) = [y1 (t), . . . , ym (t)]T contains the estimated sources, and
quency domain, where the separation of complex valued signals is Q is the memory of the unmixing filters. In FDBSS the problem
encountered, has remained as a subject of much research interest is transferred into the frequency domain using the short time fre-
due to its potential wide applications for example in acoustic source quency transform STFT. Equations (1) and (2) then change respec-
separation, and the associated challenging technical problems, most tively to:
important of which is perhaps the permutation problem. Generally, x(ω ,t) ≈ H(ω )s(ω ,t) + v(ω ,t) (3)
the main objective of BSS is to decompose the measurement sig-
y(ω ,t) ≈ W(ω )x(ω ,t) (4)
nals into their constituent independent components as an estima-
tion of the true sources which are assumed a priori to be indepen- where ω denotes discrete normalized frequency. An inverse STFT
dent [1] [2]. is then used to find the estimated sources ŝ(t) = y(t); however, this
CBSS algorithms have been conventionally developed in either will be certainly affected by the permutation effect due to the vari-
the time [3] or frequency [1] [4] [5] [6] domains. Frequency domain ation of W(ωi ) with frequency bin ωi . In the following section we
convolutive blind source separation (FDCBSS) has however been present a fast fixed-point algorithm for ICA of these complex valued
a more popular approach as the time-domain convolutive mixing signals, carefully motivate the choice of contrast function, and men-
is converted into a number of independent complex instantaneous tion the local consistency of the algorithm. In Sec. 3 we examine
mixing operations. The permutation problem inherent to FDCBSS the use of spatial information indicating the positions and directions
presents itself when reconstructing the separated sources from the of the sources using “data” acquired by a number of video cameras.
separated outputs of these instantaneous mixtures. It is more se- In Sec. 4 we use this geometric information to initialize the fixed-
vere and destructive than for time-domain schemes as the number point frequency domain ICA algorithm. In Sec. 5 the simulation
of permutations grows geometrically with the number of instanta- results for real world data confirm the usefulness of the algorithm.
neous mixtures [7]. In unimodal BSS no priori assumptions are Finally, conclusions are drawn.
typically made on the source statistics or the mixing system. On
the other hand, in a multimodal approach a video system can cap- 2. A FAST FIXED-POINT ALGORITHM FOR ICA
ture the approximate positions of the speakers and the directions Recently, ICA has become one of the central tools for BSS [1], [2].
they face [8]. Such video information can thereby help to estimate In ICA, a set of original source signals s(t) in (1) are retrieved from
the unmixing matrices more accurately and ultimately increase the their mixtures based on the assumption of their mutual statistical
separation performance. Following this idea, the objective of this independence. Hyvärinen and Oja [2] [11] presented a fast fixed
paper is to use efficiently such information to mitigate the permu- point algorithm (FastICA) for the separation of linearly mixed in-
tation problem. The scaling problem in CBSS is easily solved by dependent source signals. Unfortunately, these algorithms are not
Work supported by the Engineering and Physical Sciences Research suitable for complex valued signals. The use of algorithm [12] in
Council (EPSRC) of the UK. this paper is due to four main reasons, its suitability for complex
signals, the proof of the local consistency of the estimator, more where κ is a constant representing the attenuation per unit length
robustness against outliers and capability of deflationary separation in a homogenous medium. Similarly, τi j in terms of the number of
of the independent component signals. In deflationary separation samples, is proportional to the sampling frequency fs , sound veloc-
the components tend to separate in the order of decreasing non- ity C, and the distance di j as:
Gaussianity. In [12] the basic concept of complex random variables
is also provided and the fixed point algorithm for one unit is derived, fs
τi j = di j (9)
and for ease of derivation the algorithm updates the real and imagi- C
nary parts of w separately. Note for convenience explicit use of the
discrete time index is dropped and w represents one row of W used which is independent of the directionality. However, in practical
to extract a single source. Since the source signals are assumed zero situations the speaker’s direction introduces another variable into
mean, unit variance and with uncorrelated real and imaginary parts the attenuation measurement. In the case of electronic loudspeak-
of equal variances, the optima of E{G(|wH x|2 )} under the con- ers (not humans) the directionality pattern depends on the type of
straint E{|wH x|2 } = kwk2 = 1, where E{.} denotes the statistical loadspeaker. Here, we approximate this pattern as cos(θi j /r) where
expectation, (.)H Hermitian transpose, k.k Euclidian norm, |.| ab- r > 2, which has a smaller value for highly directional speakers and
solute function; and G(.) is a nonlinear contrast function, according vice versa (an accurate profile can be easily measured using a sound
to the Khun-Tucker conditions satisfy pressure level (SPL) meter). Therefore, the attenuation parameters
become
∇E{G(|wH x|2 )} − β ∇E{|wH x|2 } = 0 (5) κ
αi j = 2 cos(θi j /r) (10)
where the gradient denoted by ∇, is computed with respect to the di j
real and imaginary parts of ω separately. The Newton method is If, for simplicity, only the direct path is considered the mixing filter
used to solve this equation for which the total approximative Jaco- has the form:
bian [12] is
α11 δ (t − τ11 ) α12 δ (t − τ12 )
Ĥ(t) = (11)
α21 δ (t − τ21 ) α22 δ (t − τ22 )
J = 2(E{g(|wH x|2 ) + |wH x|2 ǵ(|wH x|2 )} − β )I (6)
where (.̂) denotes the approximation in this assumption. In the
which is diagonal and therefore easily invertible, where I denotes
the identity matrix and g(.) and ǵ(.) denote the first and second
derivative of the contrast function. Bingham and Hyvärinen ob-
tained the following approximative Newton iteration:
1.30 m
The equivalence between frequency domain blind source sepa-
ration and frequency domain adaptive beamforming is already con- 1.65 m 1.60 m
firmed in [13].
1.20 m
1.30 m
Speaker 1 Speaker 2
Multiplying both sides of (7) by β − E{g(|wH x|2 ) + |wH x|2
ǵ|w x|2 )} we have the following update equation for each fre-
H
quency bin.
w1+ (ω ) = E{z(ω )(w1 (ω )H z(ω ))∗ g(|w1 (ω )H z(ω )|2 )} Fig. 2. A two-speaker two-microphone layout for recording within
H 2 H 2
a reverberant (room) environment. Room impulse response length
−E{g(|w1 (ω ) z(ω )| ) + |w1 (ω ) z(ω )| is 130 ms.
ǵ(|w1 (ω )H z(ω )|2 )}w1 (ω ) (14)
m−1
wm (ω ) ← wm (ω ) − ∑ {wmH (ω )w j (ω )}w j (ω ) (16)
Performance Index (PI)
j=1
1
0
3
−1
2.5
−2
0 100 200 300 400 500 2
Frequency bin
[abs(G11*G22) − abs(G12*G21)]
2 1.5
1 1
0 0.5
−1
0
1 2 3 4 5 6 7 8 9 10
Iteration Number
−2
0 100 200 300 400 500
Frequency bin
Fig. 6. The convergence graph of the cost function of the proposed
algorithm using contrast function G(y) = log(b + y); the results are
Fig. 4. Evaluation of permutation in each frequency bin for the averaged over all frequency bins.
Bingham and Hyvärinen algorithm at the top [12] and the proposed
algorithm at the bottom, on the recorded signals with fixed iteration
count = 7. [abs(G11 G22 ) − abs(G12 G21 )] > 0 means no permuta- The proposed algorithm starts with W(ω ) = Q(ω )Ĥ(ω ), if
tion. the estimate of Ĥ(ω ) is unbiased, then Wopt (ω ) = Q(ω )Ĥ(ω ).
We assumed the estimate of Ĥ(ω ) (used in above simulations ob-
In contrast, the performance indices and evaluation of permuta- tained from (12) the directions of the sources were obtained from
tion by the original FastICA algorithm [12] (MATLAB code avail- video cameras) is unbiased and calculated the performance shown
able online) with random initialization, on the recorded mixtures are in Figure 7, which confirms the estimate is biased, since the envi-
shown in Figure 5. We highlight that thirty iterations are required ronment is reverberant therefore Ĥ(ω ) should include the sum of
for the performance level achieved in Figure 5(a) with no solution all echo paths, but practically the directions of these reverberations
for permutation as shown in Figure 5(b). The permutation problem are not possible to be measured by the video cameras. The conver-
in frequency domain BSS degraded the SIR to approximately zero gence of the proposed algorithm within seven iterations including
on the recorded mixtures. the solution for permutation, with the biased estimate of Ĥ(ω ) con-
firm that the multimodal approach is necessary to solve the cocktail
party problem.
Performance Index (PI)
1
Performance Index (PI)
(a) 0.5 1
0 0.5
0 100 200 300 400 500
Frequency bin
[abs(G11*G22) − abs(G12*G21)]
2 0
0 100 200 300 400 500
Frequency bin
1
[abs(G11*G22) − abs(G12*G21)]
2
(b) 0
1
−1
0
−2
0 100 200 300 400 500 −1
Frequency bin
−2
0 100 200 300 400 500
Frequency bin
Fig. 5. (a) Performance index at each frequency bin and (b) Eval-
uation of permutation in each frequency bin for Bingham and
Hyvärinen FastICA algorithm [12], on the recorded signals af- Fig. 7. (a) Performance index at each frequency bin and (b) Evalua-
ter 30 iterations. A lower PI refers to a better separation and tion of permutation in each frequency, assumed Ĥ(ω ) is correct i.e.
[abs(G11 G22 ) − abs(G12 G21 )] > 0 means no permutation. Wopt (ω ) = Q(ω )Ĥ(ω ). A lower PI refers to a better separation
and [abs(G11 G22 ) − abs(G12 G21 )] > 0 means no permutation.
Finally, the signal-to-interference ratio (SIR) was calculated as [8] W. Wang, D. Cosker, Y. Hicks, S. Sanei, and J. A. Cham-
in [9] and results are shown in Table 1 for infomax (INFO), FD- bers, “Video assisted speech source separation,” Proc. IEEE
CBSS, Constrained ICA (CICAu), Para and Spence and Proposed ICASSP, pp. 425–428, 2005.
GBFastICA algorithms, where SIR is defined as
[9] S. Sanei, S. M. Naqvi, J. A. Chambers, and Y. Hicks, “A ge-
Σi Σω |Hii (ω )|2 h|si (ω )|2 i ometrically constrained multimodal approach for convolutive
SIR = (18) blind source separation,” Proc. IEEE ICASSP, pp. 969–972,
Σi Σi6= j Σω |Hi j (ω )|2 h|s j (ω )|2 i 2007.
where Hii and Hi j represents respectively, the diagonal and off-
[10] T. Tsalaile, S. M. Naqvi, K. Nazarpour, S. Sanei, and J. A.
diagonal elements of the frequency domain mixing filter, and si is
Chambers, “Blind source extraction of heart sound signals
the frequency domain representation of the source of interest.
from lung sound recordings exploiting periodicity of the heart
The results are summarized in Table 1 and confirm the objec-
sound,” Proc. IEEE ICASSP, Las Vegas, USA, 2008.
tive improvement of our algorithm which has been confirmed sub-
jectively by listening tests. [11] A. Hyvärinen, “Fast and roubst fixed-point algorithms for in-
dependent component analysis,” IIEEE Trans. Neural Netw.,
vol. 10, no. 3, pp. 626–634, 1999.
Table 1. Comparison of SIR-Improvement between algorithms and
[12] E. Bingham and A. Hyvärinen, “A fast fixed point algorithm
the proposed method for different sets of mixtures.
for independent component analysis of complex valued sig-
Algorithms SIR-Improvement/dB
nals,” Int. J. Neural Networks, vol. 10, no. 1, pp. 1–8, 2000.
Parra‘s Method 6.8
FDCBSS 9.4 [13] S.Araki, S. Makino, Y. Hinamoto, R. Mukai, T.Nishikawa,
INFO 11.1 and H. Saruwatari, “Equivalence between frequency domain
CICAu 11.6 blind source separation and frequency domain adaptive beam-
GBFastICA 18.8 forming for convolutive mixtures,” EURASIP J. Appl. Signal
Process., , no. 11, pp. 1157–1166, 2003.
6. CONCLUSIONS
In this research a new multimodal approach for independent com-
ponent analysis of complex valued frequency domain signals was
proposed which exploits visual information to initialize a FastICA
algorithm in order to mitigate the permutation problem. The advan-
tage of our proposed algorithm was confirmed in simulations from
a real room environment. The location and direction information
was obtained using a number of cameras equipped with a speaker
tracking algorithm. The outcome of this approach paves the way
for establishing a multimodal audio-video system for separation of
speech signals, with moving sources.
REFERENCES