SongTrans: An unified song transcription and alignment method for lyrics and notes

Wu, Siwei; He, Jinzheng; Yuan, Ruibin; Wei, Haojie; Wei, Xipin; Lin, Chenghua; Xu, Jin; Lin, Junyang

Computer Science > Sound

arXiv:2409.14619 (cs)

[Submitted on 22 Sep 2024 (v1), last revised 10 Oct 2024 (this version, v2)]

Title:SongTrans: An unified song transcription and alignment method for lyrics and notes

Authors:Siwei Wu, Jinzheng He, Ruibin Yuan, Haojie Wei, Xipin Wei, Chenghua Lin, Jin Xu, Junyang Lin

View PDF HTML (experimental)

Abstract:The quantity of processed data is crucial for advancing the field of singing voice synthesis. While there are tools available for lyric or note transcription tasks, they all need pre-processed data which is relatively time-consuming (e.g., vocal and accompaniment separation). Besides, most of these tools are designed to address a single task and struggle with aligning lyrics and notes (i.e., identifying the corresponding notes of each word in lyrics). To address those challenges, we first design a pipeline by optimizing existing tools and annotating numerous lyric-note pairs of songs. Then, based on the annotated data, we train a unified SongTrans model that can directly transcribe lyrics and notes while aligning them simultaneously, without requiring pre-processing songs. Our SongTrans model consists of two modules: (1) the \textbf{Autoregressive module} predicts the lyrics, along with the duration and note number corresponding to each word in a lyric. (2) the \textbf{Non-autoregressive module} predicts the pitch and duration of the notes. Our experiments demonstrate that SongTrans achieves state-of-the-art (SOTA) results in both lyric and note transcription tasks. Furthermore, it is the first model capable of aligning lyrics with notes. Experimental results demonstrate that the SongTrans model can effectively adapt to different types of songs (e.g., songs with accompaniment), showcasing its versatility for real-world applications.

Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2409.14619 [cs.SD]
	(or arXiv:2409.14619v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2409.14619

Submission history

From: Siwei Wu [view email]
[v1] Sun, 22 Sep 2024 23:06:15 UTC (286 KB)
[v2] Thu, 10 Oct 2024 13:39:19 UTC (286 KB)

Computer Science > Sound

Title:SongTrans: An unified song transcription and alignment method for lyrics and notes

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:SongTrans: An unified song transcription and alignment method for lyrics and notes

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators