Abstract
This study investigates the feasibility of employing artificial intelligence and large language models (LLMs) to customize closed captions/subtitles to match the personal needs of deaf and hard of hearing viewers. Drawing on recorded live TV samples, it compares user ratings of caption quality, speed, and understandability across five experimental conditions: unaltered verbatim captions, slowed-down verbatim captions, moderately and heavily edited captions via ChatGPT, and lightly edited captions by an LLM optimized for TV content by AppTek, LLC. Results across 16 deaf and hard of hearing participants show a significant preference for verbatim captions, both at original speeds and in the slowed-down version, over those edited by ChatGPT. However, a small number of participants also rated AI-edited captions as best. Despite the overall poor showing of AI, the results suggest that LLM-driven customization of captions on a per-user and per-video basis remains an important avenue for future research.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Many studies have explored closed captionsFootnote 1 for Deaf and Hard of Hearing (DHH) accessibility. Captions come in many different forms, and include artificial intelligence (AI) and automatic speech recognition (ASR)-generated captions, human-generated captions, employing live human captioners via steno keyboards or re-speaking, as well as verbatim captions at natural speaking rates over 180 words per minutes, and captions at reduced speeds up to 140 words per minute [7]. Captions also can benefit a diverse set of users, not just DHH viewers. For example, one study documented that captions also could benefit children and adults who were watching in a non-native language, and both children and adults learning to read [5]. Furthermore, the rise of AI and ASR has opened new possibilities for captioning and making content more widely accessible than it used to be in the past.
Despite the technological advances, user dissatisfaction remains high. This was underscored by a recent mixed-methods study [3]. Although it found that DHH users liked verbatim captions better than typical live TV captions, participants still expressed high levels of dissatisfaction even with verbatim caption stimuli. Many felt overwhelmed by the captions. In some cases, they were perceived as too fast or having too much text.
This study explores how dissatisfaction with captions can be alleviated. The primary research question is whether custom-edited captions that adjust to an individual user’s reading speed might fare better than verbatim captions. Historically, such customized captions have been cost-prohibitive, since a human would have had to edit for multiple levels of reading fluency and visual attention. However, with the advent of generative AI and large language models (LLMs), such as ChatGPT, GPT-4 and others, the potential to summarize text and extract key information has been already demonstrated. Could LLMs not be used to custom-tailor the content of captions?
Customizing captions via LLMs and AI faces many challenges. A key step is to determine how the content produced by these fares against verbatim captions. Our experimental design employs two different LLMs to edit captions down to different degrees and compares their usability to verbatim captions among DHH participants, as well as slowed down verbatim captions as a control condition, to assess whether using LLMs for customization is feasible in principle.
2 Related Work
A prior study on caption metrics [3] found no clear correlations between DHH user ratings of TV captions with errors and several caption quality metrics. Additionally, that study questioned the extent to which verbatim captions can be used as a gold standard for the user experience, because of the levels of dissatisfaction even with those. Caption customization has been proposed as a new avenue for accessible technologies in a paper on perspectives of DHH viewers with captions [2], as a desire for users, especially with adjusting locations [16] and adjusting colors of the captions [15], as a method of user satisfaction and empowerment in a study on preferred appearances of captions among DHH users [1], or a method of user control as shown in a study on user-generated captions on social media [12].
People’s abilities to keep up with caption speed may vary [7]. Some people report being able to follow fast-moving subtitles and find slow subtitles frustrating [14]. However, other studies argue that faster captions or subtitles may negatively affect people’s reading comprehension [10] or overwhelm viewers that may be outside the expected adult age ranges such as younger viewers [4]. There are unresolved questions about whether fast or slow captions would be generally preferred for DHH viewers, further suggesting that customization may have a role to play [17].
Using artificial intelligence with humans together also can improve captioning workflows. For example, Levin et al. [11] tested real-time editing of automatic-speech recognition-generated captions as a solution for the 2014 Paralympics games in Sochi. Another example is the incorporation of avatars for speaker identification [16], where most evaluations results were positive. However, it should be emphasized that the technology itself should be adequate. For example, a study by Kawas et al. [9] found that accuracy and reliability of the technologies used for captioning are still important issues that need to be addressed. Another study found that while machine lacks the reasoning ability of humans and is not reliable for creating acceptable subtitles, it can be used to support humans to create higher quality semi-automated captions [13].
Overall, the accuracy of AI-assisted captions still could be improved. Graham and Choo found that such captions did not meet legal requirements and industry standards [6]. Although AI-assisted subtitling improved accuracy and resulted in quicker production times, it occasionally struggled with speech recognition. This resulted in participants editing the subtitles and introducing new errors. Another study examined expressing visual transcript uncertainties along with the captions [8], but this did not improve overall intelligibility. Based on the current state of the art, there is a continued need for research into AI-generated and AI-assisted captions for a better user experience.
3 Methods
The study design was within-subjects repeated measures. There were five different conditions designed to compare verbatim captions to three AI-edited alternatives. One provided unaltered verbatim captions with unaltered videos. Another was set as a control, with video, audio and verbatim captions slowed down by 25%, to explore alternatives to editing. The remaining AI-based conditions offered different levels of editing, ranging from 5–10% condensation to 25% all the way to 50%. The employed AI engines were the then-GPT-3-based ChatGPT, while the other was AppTek’s LLM. The latter, unlike ChatGPT, has specifically been tuned for commercial TV captions – AppTek is a provider of AI captioning. The full details of the conditions were as follows:
-
1.
Rev: Verbatim captioning by offline human captioners (rev.com) at normal speed.
-
2.
Rev-75: Verbatim captioning by offline human captioners (rev.com) at 75% speed.
-
3.
AppTek: Captions by offline human captioners (rev.com), edited by AppTek AI (reduced by 5% to 10% in content) at normal speed.
-
4.
GPT-25: Captions by human captioners, edited by ChatGPT AI (reduced by 25% in content) at normal speed.
-
5.
GPT-50: Captions by human captioners, edited by ChatGPT AI (reduced by 50% in content) at normal speed.
To generate the AI-edited captions, we used the human-generated verbatim captions as a starting point. The subtitle files (in SRT format) were fed to the LLMs with prompts to reduce the content and preserve the timestamps. However, none of the LLMs were able to provide good timings for the edited captions. To eliminate timing as a confounder, all AI-generated captions were re-timed by humans to match the originals.
The videos were recorded from live TV broadcasts in the USA and featured a mix of news and financial-related programming (in part, because the AppTek LLM was optimized for the latter). Participants watched three captioned videos per condition, for a total of 15 videos. The order of conditions was counterbalanced across participants, as were the video assignments and order in each condition. Participants rated quality, speed, and understanding of captions, immediately after viewing each video, on a Likert scale from 1–7 [3]. Participants were allowed to customize the caption position, audio volume, and caption color/size upfront on a test video before starting the session. This avoids suboptimal caption appearance as a confounding factor.
Sixteen participants, half women, and half men, were recruited. Eight identified as culturally Deaf, five as deaf, two as hard of hearing and one as other. Most were white, but there were also Black, Native American, Asian, and Latina participants. Ages ranged from 18 to 64, and most were 25 to 34 years old. They were recruited from the Washington DC area, as well as online. In all cases, participants were guided through the experiment via Zoom and asked to share their screen with the experimenter while they viewed the videos. If they were in person, they were seated in front of a computer running Zoom on a 27″ monitor; otherwise, they used their own. Participants viewed the videos and answered questions through Qualtrics. They also were asked open-ended questions about captioning preferences after each condition that were recorded live.
4 Results
Results are shown in Fig. 1, Fig. 2 and Fig. 3. Each participant watched three videos per condition, so there was a total of 3 × 16 = 48 videos watched, for each condition. The average ratings for verbatim captions (4.8) and slowed down videos (4.9) were close. In the case of verbatim vs moderately AI-edited GPT-25, pairwise t testing with Bonferroni corrections determined that verbatim was significantly better (t(15) = 3.56, p < 0.05). Verbatim also was significantly better than heavily AI-edited GPT-50 (t(15) = 4.82, p < 0.001). The differences between verbatim and lightly AI-edited AppTek, and verbatim vs slowed down videos, were not significant. ANOVA indicated that the caption type variable had a significant effect (F(4, 60) = 9.26, p < 0.00001).
Figure 2 shows that users rated the captioning speed close to “just right” across all conditions – the amount of information in the captions did not have a large impact on how fast participants perceived them to be. There were no significant differences in speed ratings of verbatim captions versus any of the other conditions. ANOVA was marginally significant on caption type for speed ratings (F(4, 60) = 2.59, p = 0.045).
As with caption ratings, users judged subjective comprehension highest for verbatim captions (Fig. 3). In the case of verbatim vs heavily AI-edited GPT-50, pairwise t testing with Bonferroni corrections determined that verbatim was significantly better (t(15) = 3.33, p = 0.018). ANOVA was significant for the caption type variable (F(2.1, 31.53) = 4.79, p = 0.014). As Mauchly’s test indicated that the assumption of sphericity had not been met (p < 0.05) for this ANOVA, a Greenhouse-Geisser correction was applied. Note that subjective comprehension was correlated with the participants’ caption quality ratings (r = 0.77), mirroring results from Arroyo Chavez et al. [3].
Eleven participants preferred verbatim, 2 preferred edited, and the remaining 3 preferred slowed down captions. However, the top ratings of captions matched the stated preferences only for five participants, and all the matched ones were those where participants preferred unaltered verbatim captions. Most participants preferred slowed down captions, followed by unaltered verbatim captions (Fig. 4). Note that participants did not know which condition was which.
With respect to the open-ended questions, several interesting patterns emerged: (1) participants stated a preference for verbatim captions, but (2) many felt, without knowing, that the GPT-25 captions provided sufficient information, (3) many participants did not realize some videos were slowed down, although a few commented that something felt off, and (4) participants do not like the idea of summarizing the dialogue in captions.
5 Discussion and Limitations
Most participants preferred verbatim captions, either unaltered or slowed down. Unaltered verbatim captions fared better than both types of GPT-edited captions. They were comparable to the much more lightly edited AI captions from AppTek, which, however, did not offer any advantages over verbatim. Slowed down videos appear to be a viable alternative to edited captions.
Despite the underperformance of AI-edited captions, they should be investigated further. Notably, four participants overall rated AI-edited captions as best to begin with, and there were limitations that likely underestimate the true potential of AI-edited captions. We had a skewed participant sample with high literacy and used an older ChatGPT version. With newer LLMs, participants may be able to customize the level of editing on a per-video basis in future work. However, implementing AI-edited captions comes with challenges regarding the accuracy of timestamping. Overcoming this technical hurdle will be important for the successful use of AI-generated captions.
This study has two additional limitations. The first is the relatively small sample size of 16 participants. While this number was sufficient to extract the main trends about LLM-edited captioning vs verbatim captioning, this was not sufficient to perform more fine-grained assessment on the relative merits of each LLM. The second limitation is that participants were provided with predetermined levels of editing. They could not customize the editing on their own, or individually for each type of video. This means that even though some participants preferred LLM-edited captions overall, we cannot be sure that users would consistently choose AI-edited captions of their own volition.
6 Conclusions and Future Work
Overall, the data from the study indicates that most participants prefer watching videos with verbatim captions either at normal speeds or slightly slowed down speeds, rather than with a reduced rate of information via AI-edited captions. When unaware of which condition was which, the majority preferred slowed-down videos, indicating that information overload induced by verbatim captions could be alleviated this way.
LLM-based editing of captions is not yet ready for unsupervised deployment. Even though a small number of users preferred those, they received, on average, much worse usability ratings than their verbatim counterparts. There also is no reliable way to have LLM editing preserve timestamps, which precludes editing captions automatically for now. Nevertheless, AI editing of captions holds promise. The idea of being able to customize captions to each individual is a very powerful one. Future work should refine AI-edited captions. Ultimately, individuals should be able to select their target reading level, including 6th, 8th, 10th and 12th grades, as well as college-level.
Future work should also explore LLMs in conjunction with customizing non-speech information, such as city traffic sounds and footsteps approaching [12]. These concepts also can be extended to emotions, and music types and moods. Future work also needs to examine how AI could assist with providing sign language access.
Notes
- 1.
Closed Captions are used in the United States to describe subtitles for the deaf and hard of hearing that can be toggled on and off and customized in appearance.
References
Berke, L., Albusays, K., Seita, M., Huenerfauth, M.: Preferred appearance of captions generated by automatic speech recognition for deaf and hard-of-hearing viewers. In: Extended Abstracts of 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–6 (2019)
Butler, J.: Perspectives of deaf and hard of hearing viewers of captions. Am. Ann. Deaf 163(5), 534–553 (2019). https://www.jstor.org/stable/26663593
Arroyo Chavez, M., et al.: How users experience closed captions on live television: quality metrics remain a challenge. In: 2024 CHI Conference on Human Factors in Computing Systems (2024, to appear). https://doi.org/10.1145/3613904.3641988
Fresno, N.: Watching accessible cartoons: the speed of closed captions for young audiences in the United States. Perspectives 26(3), 405–421 (2018). https://www.tandfonline.com/doi/abs/10.1080/0907676X.2017.1377264
Gernsbacher, M.A.: Video captions benefit everyone. Policy insights from the behavioral and brain sciences 2(1), 195–202 (2015). https://www.researchgate.net/publication/290395991_Video_Captions_Benefit_Everyon
Graham, R., Choo, J.: Preliminary research on AI-generated caption accuracy rate by platforms and variables. J. Technol. Pers. Disabil. 10, 33–53 (2022)
Jensema, C.: Viewer reaction to different television captioning speeds. Am. Ann. Deaf 143(4), 318–324 (1998). https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=221f424fb830672e8b16ff34999d0dad6981e4d1
Karlsson, F.: User-centered visualizations of transcription uncertainty in AI-generated subtitles of news broadcast (2020)
Kawas, S., Karalis, G., Wen, T., Ladner, R.E.: Improving real-time captioning experiences for deaf and hard of hearing students. In: Proceedings of the 18th International ACM SIGACCESS Conference on Computers and Accessibility, pp. 15–23 (2016)
Kruger, J.-L., Wisniewska, N., Liao, S.: Why subtitle speed matters: evidence from word skipping and rereading. Appl. Psycholinguist. 43(1), 211–236 (2022). https://doi.org/10.1017/S0142716421000503
Levin, K., et al.: Automated closed captioning for Russian live broadcasting. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
May, L., et al.: Unspoken sound: identifying trends in non-speech audio captioning on YouTube. In: 2024 CHI Conference on Human Factors in Computing Systems (2024, to appear)
Soe, T.H., Guribye, F., Slavkovik, M.: Evaluating AI assisted subtitling. In: ACM International Conference on Interactive Media Experiences, pp. 96–107 (2021)
Szarkowska, A., Gerber-Morón, O.: Viewers can keep up with fast subtitles: evidence from eye movements. PLoS ONE 13(6), e0199331 (2018). https://doi.org/10.1371/journal.pone.0199331
Shiver, B.N., Wolfe, R.J.: Evaluating alternatives for better deaf accessibility to selected web-based multimedia. In: Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility, pp. 231–238 (2015)
Vy, Q.V., Fels, D.I.: Using avatars for improving speaker identification in captioning. In: Gross, T., et al. (eds.) INTERACT 2009, Part II. LNCS, vol. 5727, pp. 916–919. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03658-3_110
Yuan, Y., Ma, L., Zhu, W.: Syntax customized video captioning by imitating exemplar sentences. IEEE Trans. Pattern Anal. Mach. Intell. 44(12), 10209–10221 (2021). https://arxiv.org/abs/2112.01062
Acknowledgments
The contents of this paper were developed under a grant from the National Institute on Disability, Independent Living, and Rehabilitation Research (NIDILRR grant number 90DPCP0002). Additional funding was provided by a National Science Foundation REU Site Grant (#2150429). Norman Williams supported the technical setup of the experiments, wrote the custom video player used for the stimuli, and organized the collection of TV recordings. James Waller and Matthew Seita consulted on the statistical analysis.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Christian Vogler partners with AppTek LLC on the NIDILRR grant that funded most of this work. He also has a separate partnership with AppTek on a joint venture with the goal of providing universal accessibility tools. The other authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2024 The Author(s)
About this paper
Cite this paper
Arroyo Chavez, M. et al. (2024). Customization of Closed Captions via Large Language Models. In: Miesenberger, K., Peňáz, P., Kobayashi, M. (eds) Computers Helping People with Special Needs. ICCHP 2024. Lecture Notes in Computer Science, vol 14751. Springer, Cham. https://doi.org/10.1007/978-3-031-62849-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-62849-8_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-62848-1
Online ISBN: 978-3-031-62849-8
eBook Packages: Computer ScienceComputer Science (R0)