CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

Chen, Yu-Wen; Hung, Kuo-Hsuan; Li, You-Jin; Kang, Alexander Chao-Fu; Lai, Ya-Hsin; Liu, Kai-Chun; Fu, Szu-Wei; Wang, Syu-Siang; Tsao, Yu

doi:10.1109/ACCESS.2022.3153469

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2008.09264 (eess)

[Submitted on 21 Aug 2020 (v1), last revised 25 Apr 2022 (this version, v5)]

Title:CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

Authors:Yu-Wen Chen, Kuo-Hsuan Hung, You-Jin Li, Alexander Chao-Fu Kang, Ya-Hsin Lai, Kai-Chun Liu, Szu-Wei Fu, Syu-Siang Wang, Yu Tsao

View PDF

Abstract:This study presents a deep learning-based speech signal-processing mobile application known as CITISEN. The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC), allowing CITISEN to be used as a platform for utilizing and evaluating SE models and flexibly extend the models to address various noise environments and users. For SE, a pretrained SE model downloaded from the cloud server is used to effectively reduce noise components from instant or saved recordings provided by users. For encountering unseen noise or speaker environments, the MA function is applied to promote CITISEN. A few audio samples recording on a noisy environment are uploaded and used to adapt the pretrained SE model on the server. Finally, for BNC, CITISEN first removes the background noises through an SE model and then mixes the processed speech with new background noise. The novel BNC function can evaluate SE performance under specific conditions, cover people's tracks, and provide entertainment. The experimental results confirmed the effectiveness of SE, MA, and BNC functions. Compared with the noisy speech signals, the enhanced speech signals achieved about 6\% and 33\% of improvements, respectively, in terms of short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). With MA, the STOI and PESQ could be further improved by approximately 6\% and 11\%, respectively. Finally, the BNC experiment results indicated that the speech signals converted from noisy and silent backgrounds have a close scene identification accuracy and similar embeddings in an acoustic scene classification model. Therefore, the proposed BNC can effectively convert the background noise of a speech signal and be a data augmentation method when clean speech signals are unavailable.

Subjects:	Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2008.09264 [eess.AS]
	(or arXiv:2008.09264v5 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2008.09264
Related DOI:	https://doi.org/10.1109/ACCESS.2022.3153469

Submission history

From: SyuSiang Wang [view email]
[v1] Fri, 21 Aug 2020 02:04:12 UTC (2,605 KB)
[v2] Sat, 14 Aug 2021 13:29:12 UTC (12,899 KB)
[v3] Thu, 26 Aug 2021 01:24:58 UTC (16,503 KB)
[v4] Sun, 20 Feb 2022 13:03:39 UTC (10,116 KB)
[v5] Mon, 25 Apr 2022 14:23:41 UTC (10,377 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators