A Multimodal Approach to Improve Performance Evaluation of Call Center Agent
Abstract
:1. Introduction
2. Related Work
3. The Proposed Framework
3.1. CNNs and BiLSTMs
3.2. Attention Layer
3.3. Max Weights Similarity (MWS)
3.4. Multimodal Approach
4. The Experiment
4.1. The Data
4.2. Speech Processing
4.3. Text Processing
4.4. Multimodal Approach (Speech + Text)
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Acknowledgments
Conflicts of Interest
References
- Breuer, K.; Nieken, P.; Sliwka, D. Social ties and subjective performance evaluations: An empirical investigation. Rev. Manag. Sci. 2013, 7, 141–157. [Google Scholar] [CrossRef]
- Dhanpat, N.; Modau, F.D.; Lugisani, P.; Mabojane, R.; Phiri, M. Exploring employee retention and intention to leave within a call center. SA J. Hum. Resour. Manag. 2018, 16, 1–13. [Google Scholar] [CrossRef]
- Frederiksen, A.; Lange, F.; Kriechel, B. Subjective performance evaluations and employee careers. J. Econ. Behav. Organ. 2017, 134, 408–429. [Google Scholar] [CrossRef] [Green Version]
- Gonzalez-Benito, O.; Gonzalez-Benito, J. Cultural vs. operational market orientation and objective vs. subjective performance: Perspective of production and operations. Ind. Mark. Manag. 2005, 34, 797–829. [Google Scholar] [CrossRef]
- Echchakoui, S.; Baakil, D. Emotional Exhaustion in Offshore Call Centers: A Comparative Study. J. Glob. Mark. 2019, 32, 17–36. [Google Scholar] [CrossRef]
- Ahmed, A.; Hifny, Y.; Toral, S.; Shaalan, K. A Call Center Agent Productivity Modeling Using Discriminative Approaches. In Intelligent Natural Language Processing: Trends and Applications; Book Section 1; Springer: Berlin/Heidelberg, Germany, 2018; pp. 501–520. [Google Scholar]
- Ahmed, A.; Toral, S.; Shaalan, K. Agent productivity measurement in call center using machine learning. In International Conference on Advanced Intelligent Systems and Informatics; Springer: Berlin/Heidelberg, Germany, 2016; pp. 160–169. [Google Scholar]
- Ahmed, A.; Hifny, Y.; Shaalan, K.; Toral, S. End-to-End Lexicon Free Arabic Speech Recognition Using Recurrent Neural Networks. Comput. Linguist. Speech Image Process. Arab. Lang. 2018, 4, 231. [Google Scholar]
- Dave, N. Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 2013, 1, 1–4. [Google Scholar]
- Bae, S.M.; Ha, S.H.; Park, S.C. A web-based system for analyzing the voices of call center customers in the service industry. Expert Syst. Appl. 2005, 28, 29–41. [Google Scholar] [CrossRef]
- Karakus, B.; Aydin, G. Call center performance evaluation using big data analytics. In Proceedings of the 2016 International Symposium on Networks, Computers and Communications (ISNCC), Hammamet, Tunisia, 11–13 May 2016; pp. 1–6. [Google Scholar]
- Perera, K.N.N.; Priyadarshana, Y.; Gunathunga, K.; Ranathunga, L.; Karunarathne, P.; Thanthriwatta, T. Automatic Evaluation Software for Contact Centre Agents’ voice Handling Performance. Int. J. Sci. Res. Publ. 2019, 5, 1–8. [Google Scholar]
- Sudarsan, V.; Kumar, G. Voice call analytics using natural language processing. Int. J. Stat. Appl. Math. 2019, 4, 133–136. [Google Scholar]
- Ahmed, A.; Hifny, Y.; Shaalan, K.; Toral, S. Lexicon free Arabic speech recognition recipe. In International Conference on Advanced Intelligent Systems and Informatics; Springer: Berlin/Heidelberg, Germany, 2016; pp. 147–159. [Google Scholar]
- Neumann, M.; Vu, N.T. Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech. arXiv 2017, arXiv:1706.00612. [Google Scholar]
- Hifny, Y.; Ali, A. Efficient Arabic Emotion Recognition Using Deep Neural Networks. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6710–6714. [Google Scholar]
- Cho, J.; Pappagari, R.; Kulkarni, P.; Villalba, J.; Carmiel, Y.; Dehak, N. Deep neural networks for emotion recognition combining audio and transcripts. arXiv 2019, arXiv:1911.00432. [Google Scholar]
- Li, P.; Jiang, Z.; Yin, S.; Song, D.; Ouyang, P.; Liu, L.; Wei, S. PAGAN: A Phase-Adapted Generative Adversarial Networks for Speech Enhancement. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–9 May 2020; pp. 6234–6238. [Google Scholar]
- Cleveland, B. Call Center Management on Fast Forward: Succeeding in the New Era of Customer Relationships; ICMI Press: Colorado Springs, CO, USA, 2012. [Google Scholar]
- Hayes, A.F.; Krippendorff, K. Answering the call for a standard reliability measure for coding data. Commun. Methods Meas. 2007, 1, 77–89. [Google Scholar] [CrossRef]
- Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5200–5204. [Google Scholar]
- Li, L.; Xu, W.; Yu, H. Character-level neural network model based on Nadam optimization and its application in clinical concept extraction. Neurocomputing 2020, 414, 182–190. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Los Angeles, CA, USA, 8–9 December 2017. [Google Scholar]
- Zhang, M.; Yang, Y.; Ji, Y.; Xie, N.; Shen, F. Recurrent attention network using spatial-temporal relations for action recognition. Signal Process. 2018, 145, 137–145. [Google Scholar] [CrossRef]
- Ahmed, A.; Toral, S.; Shaalan, K.; Hifny, Y. Agent Productivity Modeling in a Call Center Domain Using Attentive Convolutional Neural Networks. Sensors 2020, 20, 5489. [Google Scholar] [CrossRef]
- Eyben, F.; Wöllmer, M.; Schuller, B. Opensmile: The munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, Nice, France, 21–25 October 2010; pp. 1459–1462. [Google Scholar]
- Palaz, D.; Collobert, R. Analysis of CNN-Based Speech Recognition System Using Raw Speech as Input; Technical Report; Idiap: Martigny, Switzerlnad, 2015. [Google Scholar]
- Norouzian, A.; Mazoure, B.; Connolly, D.; Willett, D. Exploring attention mechanism for acoustic-based classification of speech utterances into system-directed and non-system-directed. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 7310–7314. [Google Scholar]
- Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, Sydney, Australia, 17 July 2017; pp. 1243–1252. [Google Scholar]
- Bridle, J.S. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing; Springer: Berlin/Heidelberg, Germany, 1990; pp. 227–236. [Google Scholar]
- Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y. Multimodal deep learning. In Proceedings of the ICML, Bellevue, WA, USA, 28 June–2 July 2011. [Google Scholar]
- Ramachandram, D.; Taylor, G.W. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Process. Mag. 2017, 34, 96–108. [Google Scholar] [CrossRef]
- Kahou, S.E.; Bouthillier, X.; Lamblin, P.; Gulcehre, C.; Michalski, V.; Konda, K.; Jean, S.; Froumenty, P.; Dauphin, Y.; Boulanger-Lewandowski, N.; et al. Emonets: Multimodal deep learning approaches for emotion recognition in video. J. Multimodal User Interfaces 2016, 10, 99–111. [Google Scholar] [CrossRef] [Green Version]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef] [Green Version]
- Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Broux, P.-A.; Desnous, F.; Larcher, A.; Petitrenaud, S.; Carrive, J.; Meignier, S. S4D: Speaker Diarization Toolkit in Python. Interspeech 2018. [Google Scholar] [CrossRef] [Green Version]
- Schuller, B.; Steidl, S.; Batliner, A.; Hirschberg, J.; Burgoon, J.K.; Baird, A.; Elkins, A.; Zhang, Y.; Coutinho, E.; Evanini, K.; et al. The interspeech 2016 computational paralinguistics challenge: Deception, sincerity & native language. In Proceedings of the 17th Annual Conference of the International Speech Communication Association (Interspeech 2016), San Francisco, CA, USA, 8–12 September 2016; Volume 1–5, pp. 2001–2005. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
54 spectral LLD |
RASTA-style auditory spectrum |
MFCC 1–14 |
Spectral energy |
Spectral Roll Off Point |
Entropy, Spectral Flux, Skewness, Variance, Kurtosis, |
Slope, Harmonicity, Psychoacoustic Sharpness |
7 voicing related LLD |
Probability of voicing, F0 by SHS - Viterbi smoothing |
Jitter, logarithmic HNR, Shimmer |
PCM fftMag spectral Centroid SMA numeric |
4 energy related LLD |
Sum of auditory spectrum |
Sum of RASTA-style filtered auditory spectrum |
RMS Energy |
Zero-Crossing Rate |
Speech Accuracy % per Model Type | ||
---|---|---|
Classification Method | Type | Accuracy |
CNNs | MFCC | 82.7% |
CNNs-Attention | MFCC | 84.27% |
CNNs-BiLSTMs | MFCC | 83.55% |
CNNs-BiLSTMs-Attention | MFCC | 83.54% |
CNNs | LLD | 90.1% |
CNNs-Attention | LLD | 92.48% |
CNNs-Attention + MWS | LLD | 92.88% |
CNNs-BiLSTMs | LLD | 92.67% |
CNNs-BiLSTMs-Attention | LLD | 92.68% |
CNNs-BiLSTMs-Attention + MWS | LLD | 92.25% |
Accuracy % per Model Type | ||
---|---|---|
Classification Method | Type | Accuracy |
Naive Bayes | Bag of words | 67.3% |
Logistic Regression | Bag of words | 80.76% |
Linear Support Vector Machine (LSVM) | Bag of words | 82.69% |
CNNs | Word Embedding | 90.73% |
CNNs-Attention | Word Embedding | 90.98% |
CNNs-Attention+MWS | Word Embedding | 91.4% |
CNNs-BiLSTMs | Word Embedding | 89.87% |
CNNs-BiLSTMs-Attention | Word Embedding | 91.19% |
CNNs-BiLSTMs-Attention+MWS | Word Embedding | 91.12% |
Multimodal Accuracy % per Model Type | ||
---|---|---|
Text Model | Speech Model | Accuracy |
CNNs | CNNs | 90.44% |
CNNs-Attention | CNN | 90.1% |
CNN | CNNs-Attention | 92.63% |
CNN | CNNs-Attention + MWS | 92.9% |
CNN-Attention | CNNs-Attention | 91.76% |
CNN-Attention + MWS | CNNs-Attention + MWS | 93.1% |
CNNs | CNNs-BiLSTMs-Attention | 91.8% |
CNNs | CNNs-BiLSTMs-Attention + MWS | 91.9% |
CNNs-Attention | CNNs-BiLSTMs | 90.36% |
CNNs-Attention + MWS | CNNs-BiLSTMs | 91.1% |
CNNs-Attention | CNNs-BiLSTMs-Attention | 91% |
CNNs-Attention + MWS | CNNs-BiLSTMs-Attention + MWS | 91.1% |
MWS vs. Softmax—Accuracy Improvement% | |||
---|---|---|---|
Method | Speech Model | Text Model | Multimodal |
Softmax | 92.68% | 90.98% | 91.76% |
MWS | 92.88% | 91.4% | 93.1% |
Delta | 0.2% | 0.42% | 1.34% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ahmed, A.; Shaalan, K.; Toral, S.; Hifny, Y. A Multimodal Approach to Improve Performance Evaluation of Call Center Agent. Sensors 2021, 21, 2720. https://doi.org/10.3390/s21082720
Ahmed A, Shaalan K, Toral S, Hifny Y. A Multimodal Approach to Improve Performance Evaluation of Call Center Agent. Sensors. 2021; 21(8):2720. https://doi.org/10.3390/s21082720
Chicago/Turabian StyleAhmed, Abdelrahman, Khaled Shaalan, Sergio Toral, and Yasser Hifny. 2021. "A Multimodal Approach to Improve Performance Evaluation of Call Center Agent" Sensors 21, no. 8: 2720. https://doi.org/10.3390/s21082720
APA StyleAhmed, A., Shaalan, K., Toral, S., & Hifny, Y. (2021). A Multimodal Approach to Improve Performance Evaluation of Call Center Agent. Sensors, 21(8), 2720. https://doi.org/10.3390/s21082720