Speaker diarisation

Speaker diarisation (or diarization) is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker.^[1] It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity.^[2] It is used to answer the question "who spoke when?"^[3] Speaker diarisation is a combination of speaker segmentation and speaker clustering. The first aims at finding speaker change points in an audio stream. The second aims at grouping together speech segments on the basis of speaker characteristics.

With the increasing number of broadcasts, meeting recordings and voice mail collected every year, speaker diarisation has received much attention by the speech community, as is manifested by the specific evaluations devoted to it under the auspices of the National Institute of Standards and Technology for telephone speech, broadcast news and meetings.^[4] A leading list tracker of speaker diarization research can be found at Quan Wang's github repo.^[5]

Main types of diarisation systems

In speaker diarisation, one of the most popular methods is to use a Gaussian mixture model to model each of the speakers, and assign the corresponding frames for each speaker with the help of a Hidden Markov Model. There are two main kinds of clustering strategies. The first one is by far the most popular and is called Bottom-Up. The algorithm starts in splitting the full audio content in a succession of clusters and progressively tries to merge the redundant clusters in order to reach a situation where each cluster corresponds to a real speaker. The second clustering strategy is called top-down and starts with one single cluster for all the audio data and tries to split it iteratively until reaching a number of clusters equal to the number of speakers. A 2010 review can be found at [1].

More recently, speaker diarisation is performed via neural networks leveraging large-scale GPU computing and methodological developments in deep learning.^[6]

Open source speaker diarisation software

There are some open source initiatives for speaker diarisation (in alphabetical order):

ALIZE Speaker Diarization (last repository update: July 2016; last release: February 2013, version: 3.0): ALIZE Diarization System, developed at the University Of Avignon, a release 2.0 is available [2].
Audioseg (last repository update: May 2014; last release: January 2010, version: 1.2): AudioSeg is a toolkit dedicated to audio segmentation and classification of audio streams. [3].
pyannote.audio (last repository update: August 2022, last release: July 2022, version: 2.0): pyannote.audio is an open-source toolkit written in Python for speaker diarization. [4].
pyAudioAnalysis (last repository update: September 2022): Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications [5]
SHoUT (last update: December 2010; version: 0.3): SHoUT is a software package developed at the University of Twente to aid speech recognition research. SHoUT is a Dutch acronym for Speech Recognition Research at the University of Twente. [6]
LIUM SpkDiarization (last release: September 2013, version: 8.4.1): LIUM_SpkDiarization tool [7].

References

^ Sahidullah, Md; Patino, Jose; Cornell, Samuele; Yin, Ruiking; Sivasankaran, Sunit; Bredin, Herve; Korshunov, Pavel; Brutti, Alessio; Serizel, Romain; Vincent, Emmanuel; Evans, Nicholas; Marcel, Sebastien; Squartini, Stefano; Barras, Claude (2019-11-06). "The Speed Submission to DIHARD II: Contributions & Lessons Learned". arXiv:1911.02388 [eess.AS].
^ Zhu, Xuan; Barras, Claude; Meignier, Sylvain; Gauvain, Jean-Luc. "Improved speaker diarization using speaker identification". Retrieved 2012-01-25.
^ Kotti, Margarita; Moschou, Vassiliki; Kotropoulos, Constantine. "Speaker Segmentation and Clustering" (PDF). Retrieved 2012-01-25.
^ "Rich Transcription Evaluation Project". NIST. Retrieved 2012-01-25.
^ "Awesome Speaker Diarization". awesome-diarization. Retrieved 2024-09-17.
^ Park, Tae Jin; Kanda, Naoyuki; Dimitriadis, Dimitrios; Han, Kyu J.; Watanabe, Shinji; Narayanan, Shrikanth (2021-11-26). "A Review of Speaker Diarization: Recent Advances with Deep Learning". arXiv:2101.09624 [eess.AS].

Bibliography

Anguera, Xavier (2012). "Speaker diarization: A review of recent research". IEEE Transactions on Audio, Speech, and Language Processing. 20 (2). IEEE/ACM Transactions on Audio, Speech, and Language Processing: 356–370. CiteSeerX 10.1.1.470.6149. doi:10.1109/TASL.2011.2125954. ISSN 1558-7916. S2CID 206602044.
Beigi, Homayoon (2011). Fundamentals of Speaker Recognition. New York: Springer. ISBN 978-0-387-77591-3.

[1] Sahidullah, Md; Patino, Jose; Cornell, Samuele; Yin, Ruiking; Sivasankaran, Sunit; Bredin, Herve; Korshunov, Pavel; Brutti, Alessio; Serizel, Romain; Vincent, Emmanuel; Evans, Nicholas; Marcel, Sebastien; Squartini, Stefano; Barras, Claude (2019-11-06). "The Speed Submission to DIHARD II: Contributions & Lessons Learned". arXiv:1911.02388 [eess.AS].

[2] Zhu, Xuan; Barras, Claude; Meignier, Sylvain; Gauvain, Jean-Luc. "Improved speaker diarization using speaker identification". Retrieved 2012-01-25.

[3] Kotti, Margarita; Moschou, Vassiliki; Kotropoulos, Constantine. "Speaker Segmentation and Clustering" (PDF). Retrieved 2012-01-25.

[4] "Rich Transcription Evaluation Project". NIST. Retrieved 2012-01-25.

[5] "Awesome Speaker Diarization". awesome-diarization. Retrieved 2024-09-17.

[6] Park, Tae Jin; Kanda, Naoyuki; Dimitriadis, Dimitrios; Han, Kyu J.; Watanabe, Shinji; Narayanan, Shrikanth (2021-11-26). "A Review of Speaker Diarization: Recent Advances with Deep Learning". arXiv:2101.09624 [eess.AS].

[1]

[2]

[3]

[4]

[5]

[6]