Implementation of Sound Direction Detection and Mixed Source Separation in Embedded Systems

Wang, Jian-Hong; Le, Phuong Thi; Bee, Weng-Sheng; Putri, Wenny Ramadha; Su, Ming-Hsiang; Li, Kuo-Chen; Chen, Shih-Lun; He, Ji-Long; Pham, Tuan; Li, Yung-Hui; Wang, Jia-Ching

doi:10.3390/s24134351

Open AccessArticle

Implementation of Sound Direction Detection and Mixed Source Separation in Embedded Systems

by

Jian-Hong Wang

^1,†

,

Phuong Thi Le

^2,†

,

Weng-Sheng Bee

³,

Wenny Ramadha Putri

³

,

Ming-Hsiang Su

^4,*

,

Kuo-Chen Li

^5,*

,

Shih-Lun Chen

⁶

,

Ji-Long He

¹,

Tuan Pham

⁷

,

Yung-Hui Li

⁸

and

Jia-Ching Wang

³

¹

School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, China

²

Department of Computer Science and Information Engineering, Fu Jen Catholic University, New Taipei City 242062, Taiwan

³

Department of Computer Science and Information Engineering, National Central University, Taoyuan City 320314, Taiwan

⁴

Department of Data Science, Soochow University, Taipei City 10048, Taiwan

⁵

Department of Information Management, Chung Yuan Christian University, Taoyuan City 320317, Taiwan

⁶

Department of Electronic Engineering, Chung Yuan Christian University, Taoyuan City 320314, Taiwan

⁷

Faculty of Digital Technology, University of Technology and Education—University of Đà Nẵng, Danang 550000, Vietnam

⁸

AI Research Center, Hon Hai Research Institute, New Taipei City 207236, Taiwan

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Sensors 2024, 24(13), 4351; https://doi.org/10.3390/s24134351

Submission received: 12 May 2024 / Revised: 21 June 2024 / Accepted: 21 June 2024 / Published: 4 July 2024

(This article belongs to the Special Issue Sensors Network and Wearables for People Activities and Wellbeing Monitoring)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, embedded system technologies and products for sensor networks and wearable devices used for monitoring people’s activities and health have become the focus of the global IT industry. In order to enhance the speech recognition capabilities of wearable devices, this article discusses the implementation of audio positioning and enhancement in embedded systems using embedded algorithms for direction detection and mixed source separation. The two algorithms are implemented using different embedded systems: direction detection developed using TI TMS320C6713 DSK and mixed source separation developed using Raspberry Pi 2. For mixed source separation, in the first experiment, the average signal-to-interference ratio (SIR) at 1 m and 2 m distances was 16.72 and 15.76, respectively. In the second experiment, when evaluated using speech recognition, the algorithm improved speech recognition accuracy to 95%.

Keywords:

embedded systems; position detection; hybrid sound source separation; signal-to-interference ratio (SIR); speech recognition

1. Introduction

In recent years, embedded system technology and products have become the focus of the global IT industry. As people pursue a more convenient and comfortable lifestyle, the information industry and smart homes are booming. Embedded systems are increasingly integrated into our daily lives in various forms, such as sensor networks and wearable devices for monitoring people’s activities and health. Although many people still do not fully understand what embedded systems are, they are closely related to our daily lives and have already permeated various fields, such as home applications [1,2,3,4,5,6,7], wireless communications [4,8], network applications [9,10], medical devices [4,11,12], consumer electronics, etc. Embedded systems encompass many applications, including smart homes, gaming consoles, electronic stethoscopes, automated teller machines (ATMs), and car-mounted Global Positioning Systems (GPSs).

This paper discusses the implementation of audio localization and enhancement in embedded systems, focusing on embedded algorithms for direction detection and mixed sound source separation. These two algorithms are implemented using different embedded systems: the TI TMS320C6713 DSK [13,14,15,16,17,18,19,20] for direction detection development and the Raspberry Pi 2 [21,22,23,24,25] for mixed sound source separation. The objective is to develop audio localization and noise reduction techniques applicable to intelligent living to bring convenience and comfort to users’ lives.

Direction detection entails capturing audio from a microphone array and determining the direction of the sound source through a specialized algorithm. Azimuth detection is utilized for audio tracking, with the TDE method [26] employed for direction detection. By utilizing Cross-Power Spectral Density (XPSD) based on the Generalized Cross-Correlation with Phase Transform (GCC-PHAT) estimate and detecting the peak of cross-correlation, we can accurately identify the azimuthal relationship between the sound signal and the microphone array.

The current research on direction detection, domestically and internationally, can be divided into two categories. The first category utilizes beamforming or subspace theory in conjunction with microphone arrays to determine the angle of the sound source. The most widely used method in this category is Multiple Signal Classification (MUSIC) [27]. The second category employs Time Delay of Arrival (TDOA) to estimate the angle of the sound source based on the time delay between the arrival of the sound at different microphones [28,29]. Among these, the Generalized Cross-Correlation (GCC) method proposed by Kanpp and Carter [26] is considered one of the most common TDOA methods. The first research category requires prior measurement of the impulse frequency response corresponding to each microphone in the array, resulting in significant computational complexity. Considering the real-time nature of the system, this paper will adopt the Generalized Cross-Correlation PHAT [26,30] method.

Mixed sound source separation involves extracting multiple individual source signals from the mixed signal captured by the microphone. Since the inception of this problem, it has garnered significant attention from researchers. We aim to extract the desired sounds embedded within the observed signal using mixed sound source separation techniques. Blind source separation of mixed signals can be classified in two ways based on the type of mixing model: the instantaneous mixing model [31,32] and the convolution mixing model [33,34,35,36]. Our method is as follows [37]. First, the Fourier transform transfers the received mixed signal to the frequency domain, and then features are extracted and input into K-Means to cluster the signal, where the k value is set to 2, as there are two mixed sound sources. Next, the mixed signal is subjected to Binary Masking to reconstruct the source in the frequency domain. The signal is finally converted back to the time domain using the inverse Fourier transform. This paper primarily explores the convolution mixing model as it finds application in real-world environments.

Due to the limited processing power and memory capacity of embedded system processors compared to PCs, efficient utilization of memory, computational resources, and program storage space become critical in the embedded system environment. Therefore, we need to further optimize the computational load in the algorithm and streamline the code to ensure smooth execution in embedded systems. Code Composer Studio (CCS) is an integrated development environment developed by TI. It provides an optimized compiler that compiles program code into efficient executable programs. CCS also provides a real-time operating system, DSP/BIOS, which can provide simple and effective management of programs. This study will utilize CCS to optimize the computational load and ensure smooth execution in embedded systems. To enhance the speech recognition capabilities of wearable devices, the contribution of this article is to realize audio positioning and enhancement in embedded systems, use embedded algorithms to perform direction detection and mixed source separation, and ultimately increase the speech recognition accuracy, further improving the practicality of wearable devices.

2. Embedded System Design for Direction Detection

2.1. Algorithm Flow and Overview

Firstly, we conduct voice activity detection (VAD) preprocessing on the received audio signal to identify segments containing speech [38]. Subsequently, we utilize the spectral subtraction method [39,40] to remove noise from the audio. Finally, the denoised audio is forwarded to the DOA (Direction of Arrival) recognizer for direction detection [26,30]. Figure 1 illustrates the architecture of the embedded system proposed in this paper for direction detection.

2.1.1. Voice Activity Detection (VAD)

We use conventional energy-based Voice Activity Detection (VAD) [38] to extract sound events. Let

x_{t} (n)

represent the received audio, where t indicates the audio frame and n ranges from 1 to N samples.

Μ_{t}

denotes the average value of audio

x_{t}

.

E_{t}

is determined by a threshold value, T, resulting in either A = 1 for sound events or A = 0 for non-sound events. Since different microphones may have different threshold values, it is necessary to conduct testing to determine the appropriate threshold value.

E_{t} = \sum_{n = 1}^{N} {{(x}_{t} (n) - μ_{t})}^{2}

(1)

A = \{\begin{matrix} 1, E_{t} > T \\ 0, o t h e r w i s e \end{matrix}

(2)

2.1.2. Sound Enhancement—Spectral Subtraction

Sound enhancement is achieved using the spectral subtraction method [39,40]. The benefit of the spectral subtraction method over some machine learning-based sound enhancement methods [41,42,43] is its lower computational complexity. This method involves subtracting the averaged noise spectrum from the spectrum of the noisy signal to eliminate environmental noise. The averaged noise spectrum is obtained from the signals received during non-sound events.

If the noise,

n (k)

, of one audio frame is added to the original signal,

s (k)

, of the same audio frame, resulting in a noisy signal

s (k)

for that frame, we have the following equation:

y (k) = s (k) + n (k)

(3)

After performing the Fourier transform, we obtain the following:

Y (e^{j ω}) = S (e^{j ω}) + N (e^{j ω})

(4)

The general formula for the spectral subtraction method is as follows:

{|S_{S (e^{j ω})}|}^{2} = {|Y (e^{j ω})|}^{2} - α {|μ (e^{j ω})|}^{2}, if {|Y (e^{j ω})|}^{2} > α {|μ (e^{j ω})|}^{2} β {|Y (e^{j ω})|}^{2}, {|μ (e^{j ω})|}^{2} = E \{{|N (e^{j ω})|}^{2}\}

(5)

where

μ (e^{j ω})

represents the average noise spectrum, α lies between 0 and 1, and β is either 0 or a minimum.

After subtracting the spectral energy, we obtain the denoised signal spectrum

\hat{S} (e^{j ω})

.

θ_{Y} (e^{j ω})

represents the phase of

Y (e^{j ω})

.

\hat{S} (e^{j ω}) = |S_{S (e^{j ω})}| e^{j θ_{Y} (e^{j ω})}

(6)

Alternatively, by obtaining the ratio

H (e^{j ω})

of the energy-subtracted spectrum

{|S_{S (e^{j ω})}|}^{2}

to the spectrum of the noisy signal

{|Y (e^{j ω})|}^{2}

, we multiply it with

Y (e^{j ω})

to obtain the denoised signal spectrum

\hat{S} (e^{j ω})

.

\hat{S} (e^{j ω}) = H (e^{j ω}) Y (e^{j ω})

(7)

H (e^{j ω}) = \frac{{|S_{S (e^{j ω})}|}^{2}}{{|Y (e^{j ω})|}^{2}}

(8)

2.1.3. Direction Detection—TDE-to-DOA Method

We referred to related papers that utilize the GCC-PHAT [26,30] estimation for the XPSD (Cross-Power Spectral Density) and the peak detection of cross-correlation for direction detection using the Time Delay Estimation (TDE) method. In addition, we also refer to the research of Varma et al. [44], which uses cross-correlation-based time delay estimates (TDE) for direction-of-arrival (DOA) estimation of acoustic arrays in less reverberant environments. The TDE method determines the direction of a single sound source, and multiple sound sources cannot be differentiated simultaneously. However, its advantage lies in its simplicity, as it only requires two microphones and has a relatively straightforward hardware architecture, making it suitable for real-time applications.

Firstly, we assume the presence of a sound source in the space. Under ideal conditions, the signals received by the two microphones can be represented as follows:

x_{1} (t) = s_{1} (t) + n_{1} (t)

(9)

x_{2} (t) = α s_{1} (t - D) + n_{2} (t)

(10)

s_{1} (t)

represents the sound source;

x_{1} (t)

and

x_{2} (t)

represent the signals received by the two microphones.

n_{1} (t)

and

n_{2} (t)

are the noises present.

We assume

s_{1} (t)

,

n_{1} (t)

, and

n_{2} (t)

to be wide-sense stationary (WSS) and

s_{1} (t)

and

n_{1} (t)

, as well as

n_{2} (t)

, to be uncorrelated. Here, D represents the actual delay, and α represents the scale value for changing the magnitude. Furthermore, the changes in D and α are slow, and at this stage, the cross-correlation between the microphones can be expressed as follows:

R_{x 1, x 2} (τ) = E [x_{1} (t) x_{2} (t - τ)]

(11)

where E represents the expectation value, and τ, which maximizes Equation (11), is the time delay between the two microphones. Since the actual observation time is finite, the estimation of cross-correlation can be expressed as follows:

{\hat{R}}_{x 1, x 2} (τ) = \frac{1}{T - τ} \int_{τ}^{T} x_{1} (t) x_{2} (t - τ) d t

(12)

where T represents the observation time interval, and the relationship between cross-correlation and cross-power spectrum can be expressed in the following Fourier representation:

R_{x 1, x 2} (τ) = \int_{- \infty}^{\infty} G_{x 1, x 2} (f) e^{j 2 π f τ} d f

(13)

Now, let us consider the actual state of the physical space, where the sound signals received by the microphones undergo spatial transformations. Therefore, the actual cross-power spectrum between the microphones can be represented as follows:

G_{y 1, y 2} (f) = H_{1} (f) H_{2}^{*} (f) G_{x 1, x 2} (f)

(14)

where

H_{1} (f)

and

H_{2} (f)

represent the spatial transformation functions from the sound source to the first microphone and the second microphone, respectively. Therefore, we define the generalized correlation between the microphones as follows:

R_{x 1, x 2}^{(g)} (τ) = \int_{- \infty}^{\infty} Ψ_{g} (f) G_{x 1, x 2} (f) e^{j 2 π f τ} d f

(15)

wherein

Ψ_{g} (f) = H_{1} (f) H_{2}^{*} (f)

(16)

In practice, due to the limited observation time, we can only use the estimated

{\hat{G}}_{x 1, x 2} (f)

instead of

G_{x 1, x 2} (f)

. Therefore, Equation (16) is rewritten as follows:

{\hat{R}}_{x 1, x 2}^{(p)} (τ) = \int_{- \infty}^{\infty} Ψ_{p} (f) {\hat{G}}_{x 1, x 2} (f) e^{j 2 π f τ} d f

(17)

Using Equation (17), we can estimate the time delay between the microphones. The choice of

ψ_{p} (f)

also has an impact on the estimation of time delay. In this paper, we employ the PHAT (Phase Transform) method proposed by Carter et al. [30], which can be expressed as follows:

Ψ_{p} (f) = \frac{1}{|G_{x 1, x 2} (f)|}

(18)

This method works remarkably well when the noise distributions between the two microphones are independent. By employing the aforementioned approach, we can accurately detect the azimuth relationship between our sound signal and the microphones.

2.2. Embedded System Hardware Devices

The embedded system for azimuth detection in this study utilizes the TI TMS320C6713 DSK [16,17,18,19,20,21,22,23,37] as the development platform, as shown in Figure 2. In the following sections, we provide a detailed introduction to the specification for the TI TMS320C6713 DSK, which is divided into three parts: peripheral equipment, DSP core, and multi-channel audio input expansion card.

3. Design of Embedded System for Separation of Mixed Audio Sources

3.1. Algorithm Flow and Introduction

We begin by applying the blind source separation algorithm [37] to the received audio signal for separating the mixed sources. Then, we upload the separated signals to the Google Speech API for recognition. Figure 3 illustrates the architecture diagram of the embedded system, proposed in this paper, for mixed source separation.

Hybrid Audio Source Separation

We set up the system with two receivers (microphones). Initially, the received mixed signal is transformed from the time domain to the frequency domain using Fourier transform to leverage its sparsity for further processing. Then, feature extraction is applied to the transformed signal and input into K-Means for clustering. During the clustering process, corresponding masks are generated, and a Binary Mask is adopted for implementation. Subsequently, Binary Masking is applied to the mixed signal to reconstruct the source signals in the frequency domain. Finally, the signals are transformed back to the time domain using the inverse Fourier transform. Figure 4 illustrates the flowchart of the hybrid audio source separation algorithm [37].

3.2. Embedded System Hardware Devices

In this paper, Raspberry Pi 2 [21,22,23,24,25] is the development platform for embedding mixed audio sources. The physical diagram is depicted in Figure 5. Here, we will provide a detailed introduction to the specifications for both the Raspberry Pi 2 and the Cirrus Logic Audio Card [45,46] audio module.

Cirrus Logic Audio Card Audio Module

Figure 6 illustrates the Cirrus Logic Audio Card [45,46], an audio expansion board designed for the Raspberry Pi. Compatible with the Raspberry Pi models A+ and B+, it features a 40-pin GPIO interface that seamlessly connects to the Raspberry Pi’s 40-pin GPIO Header. The card supports high-definition audio (HD Audio) and incorporates two digital micro-electromechanical microphones (DMICs) and D-class power amplifiers for the direct driving of external speakers. The analog signals include line-level input/output and headphone output/headphone microphone input, while the digital signals encompass stereo headphone audio input/output (SPDIF). Moreover, it includes an Expansion Header, enabling connections to devices beyond the Raspberry Pi. Figure 7 depicts the Raspberry Pi 2 connected to the Cirrus Logic Audio Card.

4. Experimental Results

4.1. Direction Detection of the Embedded System

4.1.1. Experimental Environment Setup

We utilized a classroom measuring 15 m × 8.5 m × 3 m for the experiment. Four omnidirectional microphones were strategically placed within the classroom. The sound source was positioned 2 m away from the center of the microphone array. To assess the azimuth detection capability, we tested 18 angles, ranging from 5° to 175°, with a 10° interval between each test angle (see Table 1 for details). Figure 8 depicts the setup for the azimuth detection experiment.

4.1.2. Experimental Environment Equipment

We utilized the CM503N omnidirectional microphone (depicted in Figure 9). For the equipment setup, the microphone was initially connected to the phantom power supply, and then the phantom power supply was connected to DSK AUDIO 4.

4.1.3. Experimental Results

The development version of the functionality was completed, and the measured angles yielded satisfactory results, with errors within 10 degrees (refer to Table 2). However, due to the limited memory and processor speed of the development version, achieving real-time measurements is currently not feasible. To address this limitation, a compromise was made by allocating the longest execution time program segments to the smaller internal memory, which offers the fastest execution speed. Meanwhile, the remaining parts were stored in the external memory. This approach ensures a reasonably fast execution speed. Figure 10 depicts the experimental scenario of direction detection.

Based on the current execution results, most of the sound events that surpassed the threshold value during angle measurement achieved an error within 10 degrees. However, there were occasional instances where either no angle measurement was obtained for detected sound events or the measured result was close to 90 degrees. We speculate that the former is attributable to the development board being engaged in other tasks at the time of emitting sound. This resulted in no audio data being captured, thus classifying the signal as silent when determining threshold value passage. As for the latter, we infer that the emitted sound surpassed the threshold value, but it was either too soft and was overshadowed by noise, or the captured sound was too limited, leading to noise being mistakenly identified as a measurable sound event.

4.2. Embedded System for Mixed Sound Source Separation

4.2.1. Experimental Environment Setup

Experimental Setup 1: We utilized a classroom with dimensions of 5.5 m

\times

4.8 m

\times

3 m for the experiment. The microphone setup included the two built-in, omnidirectional MEMS microphones from the Cirrus Logic Audio Card, with a distance of 0.058 m between them. The sound sources, denoted as S1 and S2, were positioned at distances of 1 m and 2 m, respectively, from the center of the microphones. S1 was a male speaker, and S2 was a female speaker. Figure 11 illustrates the environment for the mixed sound source separation experiment, and Table 3 provides the setup details for the mixed sound source separation environment.

Experimental Setup 2: We utilized a classroom with dimensions of 5.5 m

\times

4.8 m

\times

3 m for the experiment. The microphone setup included the two built-in, omnidirectional MEMS microphones from the Cirrus Logic Audio Card, with a distance of 0.058 m between them. The speaker was positioned at a distance of 0.2 m from the center of the microphones, while the interfering sound source (a news broadcast) was positioned at a distance of 1 m from the center of the microphones. Figure 12 illustrates the environment for the mixed sound source separation experiment, and Table 4 provides the setup details for the mixed sound source separation environment.

4.2.2. Experimental Environment Equipment

We utilized a Raspberry Pi 7-inch touch screen (Figure 13) for display purposes, which could be connected to the Raspberry Pi 2 using DSI as the output. Additionally, we utilized the USB-N10 wireless network card for internet access and Google speech recognition.

4.2.3. Experimental Results

For Experimental Setup 1, the signal-to-interference ratio (SIR) served as the performance evaluation metric. The formula for the SIR is as follows:

S I R = 10 {l o g}_{10} \frac{‖y_{q t a r g e t}‖}{{‖e_{q i n t e r f}‖}^{2}}

(19)

where

y_{q t a r g e t}

represents the components of the source signal in the separated signal, and

e_{q i n t e r f}

refers to the remaining interference components in the separated signal. Table 5 and Table 6 present the SIR obtained at different distances.

The sound source angles (30°, 30°) correspond to 90°, with S1 shifted to the left by 30° and S2 shifted to the right by 30°. Figure 14 displays the mixed signal for left and right channels at a distance of 1 m, while Figure 15 showcases the separated signal after mixed sound source separation at the same distance. Similarly, Figure 16 exhibits the mixed signal for left and right channels at a distance of 2 m, followed by Figure 17 demonstrating the separated signal after mixed sound source separation at the same distance.

For Experimental Setup 2, we utilized the free Speech API provided by Google to evaluate the performance of the algorithm. We tested the algorithm using 20 common commands typically used in a general and simple smart home environment, such as “Open the window,” “Weather forecast,” “Turn off the lights,” “Increase the volume,” “Stock market status,” and so on.

From Table 7, we can observe that the recognition accuracy of the separated signals is lower than that of the mixed signals. However, the recognition accuracy of the separated signals and the mixed signals does not completely overlap. Therefore, by combining the mixed signals and the separated signals (as shown in Figure 18) in a fusion set, we achieved a recognition accuracy of 95%.

5. Conclusions and Future Research Directions

On the one hand, this study implemented the orientation detection system on the TI TMS320C6713 DSK development board, and on the other hand, it implemented the hybrid sound source separation system on the Raspberry Pi 2 development board. The experimental results show that the method proposed in this study is better than the hybrid signals, and separated mixed signals enhance the speech recognition capabilities of embedded systems in sensor networks and wearable devices suitable for people’s activity and health monitoring. Firstly, we successfully implemented a direction detection system on the TI TMS320C6713 DSK development board. Secondly, a mixed sound source separation system was implemented on the Raspberry Pi 2 development board.

The received audio signal underwent preprocessing in the direction detection system using voice activity detection (VAD) to identify speech segments. Spectral subtraction was then applied to denoise the noisy audio. Finally, the denoised audio was passed to the DOA (Direction of Arrival) estimator for direction angle detection.

We designed a system with two microphones in the mixed sound source separation algorithm. The received mixed signal was initially transformed from the time domain to the frequency domain using the Fourier transform to exploit its sparsity. Then, feature extraction was performed and the signal was input into the K-Means algorithm for clustering. During the clustering process, a corresponding mask was generated. Here, we used binary masks for separation. The binary-masked mixed signal was then used to reconstruct the source signals in the frequency domain, which was subsequently transformed back to the time domain using the inverse Fourier transform to obtain the separated audio.

In future research, our goal is to further enhance and optimize the algorithm on the embedded board, reduce the computational load of the embedded system, and improve the embedded system’s real-time performance. We also plan to enhance the separation signal quality of the mixed sound source separation algorithm to enhance speech recognition accuracy. In addition, we will try various situation settings and set sound sources at different distances to evaluate the system performance more comprehensively.

Author Contributions

Conceptualization, J.-C.W.; Methodology, W.-S.B.; writing—original draft preparation, W.-S.B.; writing—review and editing, J.-H.W., P.T.L., W.R.P., M.-H.S., K.-C.L., S.-L.C., J.-L.H., T.P. and Y.-H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, J.-C.; Lee, Y.-S.; Lin, C.-H.; Siahaan, E.; Yang, C.-H. Robust environmental sound recognition with fast noise suppression for home automation. IEEE Trans. Autom. Sci. Eng. 2015, 12, 1235–1242. [Google Scholar] [CrossRef]
Wang, J.-C.; Lee, H.-P.; Wang, J.-F.; Lin, C.-B. Robust environmental sound recognition for home automation. IEEE Trans. Autom. Sci. Eng. 2008, 5, 25–31. [Google Scholar] [CrossRef]
Lu, J.; Peng, Z.; Yang, S.; Ma, Y.; Wang, R.; Pang, Z.; Feng, X.; Chen, Y.; Cao, Y. A review of sensory interactions between autonomous vehicles and drivers. J. Syst. Archit. 2023, 141, 102932. [Google Scholar] [CrossRef]
Abdulhameed, Z.N. Design and Implement a Wireless Temperature Monitoring System using Noncontact IR Sensor Based on Arduino. In Proceedings of the 2023 IEEE International Conference on ICT in Business Industry & Government (ICTBIG), Indore, India, 8–9 December 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Wang, J.-C.; Chin, Y.-H.; Hsieh, W.-C.; Lin, C.-H.; Chen, Y.-R.; Siahaan, E. Speaker identification with whispered speech for the access control system. IEEE Trans. Autom. Sci. Eng. 2015, 12, 1191–1199. [Google Scholar] [CrossRef]
Karunaratna, S.; Maduranga, P. Artificial Intelligence on Single Board Computers: An Experiment on Sound Event Classification. In Proceedings of the 2021 5th SLAAI International Conference on Artificial Intelligence (SLAAI-ICAI), Colombo, Sri Lanka, 6–7 December 2021; IEEE: New York, NY, USA, 2021; pp. 1–5. [Google Scholar]
Ahmad, M.; Amin, M.B.; Hussain, S.; Kang, B.H.; Cheong, T.; Lee, S. Health fog: A novel framework for health and wellness applications. J. Supercomput. 2016, 72, 3677–3695. [Google Scholar] [CrossRef]
Liu, X.; Zhang, M.; Xiong, T.; Richardson, A.G.; Lucas, T.H.; Chin, P.S.; Etienne-Cummings, R.; Tran, T.D.; Van der Spiegel, J. A fully integrated wireless compressed sensing neural signal acquisition system for chronic recording and brain machine interface. IEEE Trans. Biomed. Circ. Syst. 2016, 10, 874–883. [Google Scholar] [CrossRef] [PubMed]
Guo, Z.; Liu, Y.; Lu, F. Embedded remote monitoring system based on NBIOT. J. Phys. Conf. Ser. 2022, 2384, 012038. [Google Scholar] [CrossRef]
Liu, W.; Xu, W.; Mu, X.; Cheng, J. Design of Smart Home Control System Based on Embedded System. In Proceedings of the 2020 IEEE 9th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Shanghai, China, 6–8 November 2020; IEEE: New York, NY, USA, 2020; pp. 2209–2213. [Google Scholar]
Sehatbakhsh, N.; Alam, M.; Nazari, A.; Zajic, A.; Prvulovic, M. Syndrome: Spectral Analysis for Anomaly Detection on Medical IOT and Embedded Devices. In Proceedings of the 2018 IEEE International Symposium on Hardware Oriented Security and Trust (HOST), Washington, DC, USA, 30 April–4 May 2018; IEEE: New York, NY, USA, 2018; pp. 1–8. [Google Scholar]
Zhou, W.; Liao, J.; Li, B.; Li, J. A Family Medical Monitoring System Based on Embedded uC/OS-II and GPRS. In Proceedings of the 2012 IEEE International Conference on Information and Automation, Shenyang, China, 6–8 June 2012; IEEE: New York, NY, USA, 2012; pp. 663–667. [Google Scholar]
Ilmi, M.; Huda, M.; Rahardhita, W. Automatic Control Music Amplifier Using Speech Signal Utilizing by TMS320C6713. In Proceedings of the 2015 International Electronics Symposium (IES), Surabaya, Indonesia, 29–30 September 2015; IEEE: New York, NY, USA, 2015; pp. 163–166. [Google Scholar]
Manikandan, J.; Venkataramani, B.; Girish, K.; Karthic, H.; Siddharth, V. Hardware Implementation of Real-Time Speech Recognition System Using TMS320C6713 DSP. In Proceedings of the 2011 24th International Conference on VLSI Design, Chennai, India, 2–7 January 2011; IEEE: New York, NY, USA, 2011; pp. 250–255. [Google Scholar]
Mustafa, N.B.A.; Gandi, S.; Sharrif, Z.A.M.; Ahmed, S.K. Real-Time Implementation of a Fuzzy Inference System for Banana Grading Using DSP TMS320C6713 Platform. In Proceedings of the 2010 IEEE Student Conference on Research and Development (SCOReD), Kuala Lumpur, Malaysia, 13–14 December 2010; IEEE: New York, NY, USA, 2010; pp. 324–328. [Google Scholar]
Singh, J.; Singh, H.P.; Singh, S. Implementation of FIR Interpolation Filter on TMS320C6713 for VOIP Analysis. In Proceedings of the 2010 2nd International Conference on Computational Intelligence, Communication Systems and Networks, Liverpool, UK, 28–30 July 2010; IEEE: New York, NY, USA, 2010; pp. 289–294. [Google Scholar]
Texas Instruments. SPRS186L-TMS320C6713 Floating-Point Digital Signal Processing; December 2001. Revised November 2005; Texas Instruments, Texas, USA. 2005. Available online: https://media.digikey.com/pdf/Data%20Sheets/Texas%20Instruments%20PDFs/TMS320C6713.pdf (accessed on 20 June 2024).
Spectrum Digital. TMS320C6713 DSK Technical Reference; November 2003. Spectrum Digital, Texas, USA. 2003. Available online: https://electrical.engineering.unt.edu/sites/default/files/Spectrum_Digital_TMS320C6713_User_Guide.pdf (accessed on 20 June 2024).
Texas Instruments. Code Composer Studio Development Tools v3. 3 Getting Started Guide; SPRU509H; May 2008. Texas Instruments, Texas, USA. 2008. Available online: https://spinlab.wpi.edu/courses/ece4703_2010/spru509h.pdf (accessed on 20 June 2024).
Texas Instruments. Code Composer Studio User’s Guide; Texas Instruments Literature Number SPRU328B. Texas Instruments, Texas, USA. 2000. Available online: https://software-dl.ti.com/ccs/esd/documents/users_guide/index.html (accessed on 20 June 2024).
Gupta, M.S.D.; Patchava, V.; Menezes, V. Healthcare Based on IOT Using Raspberry Pi. In Proceedings of the 2015 International Conference on Green Computing and Internet of Things (ICGCIoT), Greater Noida, India, 8–10 October 2015; IEEE: New York, NY, USA, 2015; pp. 796–799. [Google Scholar]
Hossain, N.; Kabir, M.T.; Rahman, T.R.; Hossen, M.S.; Salauddin, F. A Real-Time Surveillance Mini-Rover Based on OpenCV-Python-JAVA Using Raspberry Pi 2. In Proceedings of the 2015 IEEE International Conference on Control System, Computing and Engineering (ICCSCE), Penang, Malaysia, 27–29 November 2015; IEEE: New York, NY, USA, 2015; pp. 476–481. [Google Scholar]
Sandeep, V.; Gopal, K.L.; Naveen, S.; Amudhan, A.; Kumar, L.S. Globally Accessible Machine Automation Using Raspberry Pi based on Internet of Things. In Proceedings of the 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Kochi, India, 10–13 August 2015; IEEE: New York, NY, USA, 2015; pp. 1144–1147. [Google Scholar]
Paul, S.; Antony, A.; Aswathy, B. Android based home automation using raspberry pi. Int. J. Comput. Technol. 2014, 1, 143–147. [Google Scholar]
Upton, E.; Halfacree, G. Raspberry Pi User Guide; John Wiley & Sons: Hoboken, NJ, USA, 2016. [Google Scholar]
Knapp, C.; Carter, G. The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 1976, 24, 320–327. [Google Scholar] [CrossRef]
Schmidt, R. Multiple emitter location and signal parameter estimation. IEEE Trans. Antennas Propag. 1986, 34, 276–280. [Google Scholar] [CrossRef]
Kwon, B.; Kim, G.; Park, Y. Sound Source Localization Methods with Considering of Microphone Placement in Robot Platform. In Proceedings of the RO-MAN 2007-The 16th IEEE International Symposium on Robot and Human Interactive Communication, Jeju, Republic of Korea, 26–29 August 2007; IEEE: New York, NY, USA, 2007; pp. 127–130. [Google Scholar]
Lv, X.; Zhang, M. Sound Source Localization Based on Robot Hearing and Vision. In Proceedings of the 2008 International Conference on Computer Science and Information Technology, Singapore, 29 August–2 September 2008; IEEE: New York, NY, USA, 2008; pp. 942–946. [Google Scholar]
Carter, G.C.; Nuttall, A.H.; Cable, P.G. The smoothed coherence transform. Proc. IEEE 1973, 61, 1497–1498. [Google Scholar] [CrossRef]
Hyvärinen, A.; Oja, E. Independent component analysis: Algorithms and applications. Neural Netw. 2000, 13, 411–430. [Google Scholar] [CrossRef] [PubMed]
Roberts, S.; Everson, R. Independent Component Analysis: Principles and Practice; Cambridge University Press: Cambridge, UK, 2001. [Google Scholar]
Wang, J.-C.; Wang, C.-Y.; Tai, T.-C.; Shih, M.; Huang, S.-C.; Chen, Y.-C.; Lin, Y.-Y.; Lian, L.-X. VLSI design for convolutive blind source separation. IEEE Trans. Circ. Syst. II Express Briefs 2015, 63, 196–200. [Google Scholar] [CrossRef]
Belouchrani, A.; Amin, M.G. Blind source separation based on time-frequency signal representations. IEEE Trans. Signal Process. 1998, 46, 2888–2897. [Google Scholar] [CrossRef] [PubMed]
Winter, S.; Kellermann, W.; Sawada, H.; Makino, S. MAP-based underdetermined blind source separation of convolutive mixtures by hierarchical clustering and-norm minimization. EURASIP J. Adv. Signal Process. 2006, 2007, 24717. [Google Scholar] [CrossRef]
Bofill, P. Underdetermined blind separation of delayed sound sources in the frequency domain. Neurocomputing 2003, 55, 627–641. [Google Scholar] [CrossRef]
Ying-Chuan, C. VLSI Architecture Design for Blind Source Separation Based on Infomax and Time-Frequency Masking. Master’s Thesis, National Central University, Taoyuan, Taiwan, 2012. [Google Scholar]
Kinnunen, T.; Rajan, P. A Practical, Self-Adaptive Voice Activity Detector for Speaker Verification with Noisy Telephone and Microphone Data. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; IEEE: New York, NY, USA, 2013; pp. 7229–7233. [Google Scholar]
Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 1979, 27, 113–120. [Google Scholar] [CrossRef]
Kim, W.; Kang, S.; Ko, H. Spectral subtraction based on phonetic dependency and masking effects. IEE Proc.-Vis. Image Signal Process. 2000, 147, 423–427. [Google Scholar] [CrossRef]
Wang, J.-C.; Lee, Y.-S.; Lin, C.-H.; Wang, S.-F.; Shih, C.-H.; Wu, C.-H. Compressive sensing-based speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 2122–2131. [Google Scholar] [CrossRef]
Tan, H.M.; Liang, K.-W.; Wang, J.-C. Discriminative Vector Learning with Application to Single Channel Speech Separation. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Tan, H.M.; Vu, D.-Q.; Wang, J.-C. Selinet: A Lightweight Model for Single Channel Speech Separation. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; IEEE: New York, NY, USA, 2023; pp. 1–5. [Google Scholar]
Varma, K.; Ikuma, T.; Beex, A.A. Robust TDE-Based DOA Estimation for Compact Audio Arrays. In Proceedings of the Sensor Array and Multichannel Signal Processing Workshop Proceedings, Rosslyn, VA, USA, 6 August 2002; IEEE: New York, NY, USA, 2002; pp. 214–218. [Google Scholar]
Cirrus Logic. Cirrus Logic Audio Card User Documentation; Cirrus Logic: Austin, TX, USA, 2015. [Google Scholar]
Cirrus Logic. Cirrus Logic Audio Card for B+ and A+ Onwards Schematics; Cirrus Logic: Austin, TX, USA, 2014. [Google Scholar]

Figure 1. Architecture diagram of the embedded system for direction detection.

Figure 2. The physical image of the TI TMS320C6713 DSK.

Figure 3. Architecture diagram of a hybrid audio source embedded system.

Figure 4. Flowchart of the hybrid audio source separation algorithm.

Figure 5. The physical image of the Raspberry Pi 2.

Figure 6. The physical image of the Cirrus Logic Audio Card.

Figure 7. Connection between Raspberry Pi 2 and Cirrus Logic Audio Card.

Figure 8. Setup for azimuth detection experiment.

Figure 9. Microphone CM503N.

Figure 10. Experimental scenario of direction detection.

Figure 11. Experimental Environment 1 for mixed sound source separation.

Figure 12. Experimental Environment 2 for mixed sound source separation.

Figure 13. Raspberry Pi 2, Cirrus Logic Audio Card, and 7-inch touch screen.

Figure 14. Mixed signal for left and right channels (1 m).

Figure 15. Separated signal after mixed sound source separation (1 m).

Figure 16. Mixed signal for left and right channels (2 m).

Figure 17. Separated signal after mixed sound source separation (2 m).

Figure 18. Speech recognition accuracy.

Table 1. Orientation detection under different environment settings.

Parameters	Value
angle	5°, 15°, 25°, 35°, 45°, 55°, 65°, 75°, 85°, 95°, 105°, 115°, 125°, 135°, 145°, 155°, 165°, 175°
microphone distance	0.22 m
distance between source and microphone center	2 m
sampling frequency	16 kHz

Table 2. Test results for direction detection.

Parameters	Value
actual angle	5°	15°	25°	35°	45°	55°	65°	75°	85°
measured angle	8.53°	8.53°	19.13°	28.54°	41.77°	52.17°	64.12°	75.60°	89.68°
actual angle	95°	105°	115°	125°	135°	145°	155°	165°	175°
measured angle	100.50°	107.69°	118.72°	131.11°	141.17°	152.81°	165.21°	168.54°	180.00°

Table 3. Details for mixed sound source separation in Environment 1.

Parameters	Value
angle	30°, 60°, 90°
microphone distance	58 mm
distance between source and microphone center	1 m, 2 m
sampling frequency	8 kHz

Table 4. Setup Details for mixed sound source separation in Environment 2.

Parameters	Value
speaker angle	0°, 45°, 90°, 135°, 180°
microphone distance	58 mm
distance between speaker and microphone center	0.2 m
distance between interference source and microphone center	1 m
sampling frequency	8 kHz

Table 5. SIR for sound source at 1 m.

Parameters	Value							Average
sound source angle	$30 °$ , $30 °$	$30 °$ , $60 °$	$30 °$ , $90 °$	$60 °$ , $30 °$	$60 °$ , $60 °$	$60 °$ , $90 °$	$90 °$ , $90 °$
SIR	17.35	17.13	15.64	17.87	17.21	15.76	16.04	16.72

Table 6. SIR for sound source at 2 m.

Parameters	Value						Average
sound source angle	$30 °$ , $30 °$	$30 °$ , $60 °$	$60 °$ , $30 °$	$60 °$ , $60 °$	$60 °$ , $90 °$	$90 °$ , $90 °$
SIR	16.68	16.01	18.76	14.74	15.61	15.74	15.76

Table 7. Speech recognition accuracy for mixed sound source separation.

Signal	Speech Recognition Accuracy
mixed signal	90%
separated signal	83%
mixed signal and separated signal	95%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.-H.; Le, P.T.; Bee, W.-S.; Putri, W.R.; Su, M.-H.; Li, K.-C.; Chen, S.-L.; He, J.-L.; Pham, T.; Li, Y.-H.; et al. Implementation of Sound Direction Detection and Mixed Source Separation in Embedded Systems. Sensors 2024, 24, 4351. https://doi.org/10.3390/s24134351

AMA Style

Wang J-H, Le PT, Bee W-S, Putri WR, Su M-H, Li K-C, Chen S-L, He J-L, Pham T, Li Y-H, et al. Implementation of Sound Direction Detection and Mixed Source Separation in Embedded Systems. Sensors. 2024; 24(13):4351. https://doi.org/10.3390/s24134351

Chicago/Turabian Style

Wang, Jian-Hong, Phuong Thi Le, Weng-Sheng Bee, Wenny Ramadha Putri, Ming-Hsiang Su, Kuo-Chen Li, Shih-Lun Chen, Ji-Long He, Tuan Pham, Yung-Hui Li, and et al. 2024. "Implementation of Sound Direction Detection and Mixed Source Separation in Embedded Systems" Sensors 24, no. 13: 4351. https://doi.org/10.3390/s24134351

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Implementation of Sound Direction Detection and Mixed Source Separation in Embedded Systems

Abstract

1. Introduction

2. Embedded System Design for Direction Detection

2.1. Algorithm Flow and Overview

2.1.1. Voice Activity Detection (VAD)

2.1.2. Sound Enhancement—Spectral Subtraction

2.1.3. Direction Detection—TDE-to-DOA Method

2.2. Embedded System Hardware Devices

3. Design of Embedded System for Separation of Mixed Audio Sources

3.1. Algorithm Flow and Introduction

Hybrid Audio Source Separation

3.2. Embedded System Hardware Devices

Cirrus Logic Audio Card Audio Module

4. Experimental Results

4.1. Direction Detection of the Embedded System

4.1.1. Experimental Environment Setup

4.1.2. Experimental Environment Equipment

4.1.3. Experimental Results

4.2. Embedded System for Mixed Sound Source Separation

4.2.1. Experimental Environment Setup

4.2.2. Experimental Environment Equipment

4.2.3. Experimental Results

5. Conclusions and Future Research Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI