Keywords

1 Introduction

Many studies have explored closed captionsFootnote 1 for Deaf and Hard of Hearing (DHH) accessibility. Captions come in many different forms, and include artificial intelligence (AI) and automatic speech recognition (ASR)-generated captions, human-generated captions, employing live human captioners via steno keyboards or re-speaking, as well as verbatim captions at natural speaking rates over 180 words per minutes, and captions at reduced speeds up to 140 words per minute [7]. Captions also can benefit a diverse set of users, not just DHH viewers. For example, one study documented that captions also could benefit children and adults who were watching in a non-native language, and both children and adults learning to read [5]. Furthermore, the rise of AI and ASR has opened new possibilities for captioning and making content more widely accessible than it used to be in the past.

Despite the technological advances, user dissatisfaction remains high. This was underscored by a recent mixed-methods study [3]. Although it found that DHH users liked verbatim captions better than typical live TV captions, participants still expressed high levels of dissatisfaction even with verbatim caption stimuli. Many felt overwhelmed by the captions. In some cases, they were perceived as too fast or having too much text.

This study explores how dissatisfaction with captions can be alleviated. The primary research question is whether custom-edited captions that adjust to an individual user’s reading speed might fare better than verbatim captions. Historically, such customized captions have been cost-prohibitive, since a human would have had to edit for multiple levels of reading fluency and visual attention. However, with the advent of generative AI and large language models (LLMs), such as ChatGPT, GPT-4 and others, the potential to summarize text and extract key information has been already demonstrated. Could LLMs not be used to custom-tailor the content of captions?

Customizing captions via LLMs and AI faces many challenges. A key step is to determine how the content produced by these fares against verbatim captions. Our experimental design employs two different LLMs to edit captions down to different degrees and compares their usability to verbatim captions among DHH participants, as well as slowed down verbatim captions as a control condition, to assess whether using LLMs for customization is feasible in principle.

2 Related Work

A prior study on caption metrics [3] found no clear correlations between DHH user ratings of TV captions with errors and several caption quality metrics. Additionally, that study questioned the extent to which verbatim captions can be used as a gold standard for the user experience, because of the levels of dissatisfaction even with those. Caption customization has been proposed as a new avenue for accessible technologies in a paper on perspectives of DHH viewers with captions [2], as a desire for users, especially with adjusting locations [16] and adjusting colors of the captions [15], as a method of user satisfaction and empowerment in a study on preferred appearances of captions among DHH users [1], or a method of user control as shown in a study on user-generated captions on social media [12].

People’s abilities to keep up with caption speed may vary [7]. Some people report being able to follow fast-moving subtitles and find slow subtitles frustrating [14]. However, other studies argue that faster captions or subtitles may negatively affect people’s reading comprehension [10] or overwhelm viewers that may be outside the expected adult age ranges such as younger viewers [4]. There are unresolved questions about whether fast or slow captions would be generally preferred for DHH viewers, further suggesting that customization may have a role to play [17].

Using artificial intelligence with humans together also can improve captioning workflows. For example, Levin et al. [11] tested real-time editing of automatic-speech recognition-generated captions as a solution for the 2014 Paralympics games in Sochi. Another example is the incorporation of avatars for speaker identification [16], where most evaluations results were positive. However, it should be emphasized that the technology itself should be adequate. For example, a study by Kawas et al. [9] found that accuracy and reliability of the technologies used for captioning are still important issues that need to be addressed. Another study found that while machine lacks the reasoning ability of humans and is not reliable for creating acceptable subtitles, it can be used to support humans to create higher quality semi-automated captions [13].

Overall, the accuracy of AI-assisted captions still could be improved. Graham and Choo found that such captions did not meet legal requirements and industry standards [6]. Although AI-assisted subtitling improved accuracy and resulted in quicker production times, it occasionally struggled with speech recognition. This resulted in participants editing the subtitles and introducing new errors. Another study examined expressing visual transcript uncertainties along with the captions [8], but this did not improve overall intelligibility. Based on the current state of the art, there is a continued need for research into AI-generated and AI-assisted captions for a better user experience.

3 Methods

The study design was within-subjects repeated measures. There were five different conditions designed to compare verbatim captions to three AI-edited alternatives. One provided unaltered verbatim captions with unaltered videos. Another was set as a control, with video, audio and verbatim captions slowed down by 25%, to explore alternatives to editing. The remaining AI-based conditions offered different levels of editing, ranging from 5–10% condensation to 25% all the way to 50%. The employed AI engines were the then-GPT-3-based ChatGPT, while the other was AppTek’s LLM. The latter, unlike ChatGPT, has specifically been tuned for commercial TV captions – AppTek is a provider of AI captioning. The full details of the conditions were as follows:

  1. 1.

    Rev: Verbatim captioning by offline human captioners (rev.com) at normal speed.

  2. 2.

    Rev-75: Verbatim captioning by offline human captioners (rev.com) at 75% speed.

  3. 3.

    AppTek: Captions by offline human captioners (rev.com), edited by AppTek AI (reduced by 5% to 10% in content) at normal speed.

  4. 4.

    GPT-25: Captions by human captioners, edited by ChatGPT AI (reduced by 25% in content) at normal speed.

  5. 5.

    GPT-50: Captions by human captioners, edited by ChatGPT AI (reduced by 50% in content) at normal speed.

To generate the AI-edited captions, we used the human-generated verbatim captions as a starting point. The subtitle files (in SRT format) were fed to the LLMs with prompts to reduce the content and preserve the timestamps. However, none of the LLMs were able to provide good timings for the edited captions. To eliminate timing as a confounder, all AI-generated captions were re-timed by humans to match the originals.

The videos were recorded from live TV broadcasts in the USA and featured a mix of news and financial-related programming (in part, because the AppTek LLM was optimized for the latter). Participants watched three captioned videos per condition, for a total of 15 videos. The order of conditions was counterbalanced across participants, as were the video assignments and order in each condition. Participants rated quality, speed, and understanding of captions, immediately after viewing each video, on a Likert scale from 1–7 [3]. Participants were allowed to customize the caption position, audio volume, and caption color/size upfront on a test video before starting the session. This avoids suboptimal caption appearance as a confounding factor.

Sixteen participants, half women, and half men, were recruited. Eight identified as culturally Deaf, five as deaf, two as hard of hearing and one as other. Most were white, but there were also Black, Native American, Asian, and Latina participants. Ages ranged from 18 to 64, and most were 25 to 34 years old. They were recruited from the Washington DC area, as well as online. In all cases, participants were guided through the experiment via Zoom and asked to share their screen with the experimenter while they viewed the videos. If they were in person, they were seated in front of a computer running Zoom on a 27″ monitor; otherwise, they used their own. Participants viewed the videos and answered questions through Qualtrics. They also were asked open-ended questions about captioning preferences after each condition that were recorded live.

4 Results

Results are shown in Fig. 1, Fig. 2 and Fig. 3. Each participant watched three videos per condition, so there was a total of 3 × 16 = 48 videos watched, for each condition. The average ratings for verbatim captions (4.8) and slowed down videos (4.9) were close. In the case of verbatim vs moderately AI-edited GPT-25, pairwise t testing with Bonferroni corrections determined that verbatim was significantly better (t(15) = 3.56, p < 0.05). Verbatim also was significantly better than heavily AI-edited GPT-50 (t(15) = 4.82, p < 0.001). The differences between verbatim and lightly AI-edited AppTek, and verbatim vs slowed down videos, were not significant. ANOVA indicated that the caption type variable had a significant effect (F(4, 60) = 9.26, p < 0.00001).

Figure 2 shows that users rated the captioning speed close to “just right” across all conditions – the amount of information in the captions did not have a large impact on how fast participants perceived them to be. There were no significant differences in speed ratings of verbatim captions versus any of the other conditions. ANOVA was marginally significant on caption type for speed ratings (F(4, 60) = 2.59, p = 0.045).

As with caption ratings, users judged subjective comprehension highest for verbatim captions (Fig. 3). In the case of verbatim vs heavily AI-edited GPT-50, pairwise t testing with Bonferroni corrections determined that verbatim was significantly better (t(15) = 3.33, p = 0.018). ANOVA was significant for the caption type variable (F(2.1, 31.53) = 4.79, p = 0.014). As Mauchly’s test indicated that the assumption of sphericity had not been met (p < 0.05) for this ANOVA, a Greenhouse-Geisser correction was applied. Note that subjective comprehension was correlated with the participants’ caption quality ratings (r = 0.77), mirroring results from Arroyo Chavez et al. [3].

Fig. 1.
A bar graph with error bars for rating the captions of video clip. The maximum, mean, and minimum values are 5, 4.9, and 4.8 for Rev, 4.1, 3.7, and 3.5 for G P T 25, 3.9, 3.6, and 3.3 for G P T 50, 4.5, 4.3, and 4.2 for AppTek, and 5.2, 5, and 4.8 for Rev 75. Values are approximated.

User ratings of captions. Higher is better. Generally, more condensed captions received worse ratings. Rev vs both GPT conditions were statistically significant.

Fig. 2.
A bar graph with error bars for the speed of the captions of video clip. The maximum, mean, and minimum values are 4.2, 4.1, and 4 for Rev, 3.9, 3.8, and 3.7 for G P T 25, 4, 3.8, and 3.7 for G P T 50, 4.2, 4.1, and 4 for AppTek, and 3.9, 3.8, and 3.7 for Rev 75. Values are approximated.

User ratings of caption speed. Users rated all types of captions as “just right”, which corresponds to the number 4. Lower means too slow, higher too fast.

Fig. 3.
A bar graph with error bars for understanding the captions of video clip. The maximum, mean, and minimum values are 6, 5.7, and 5.5 for Rev, 5, 4.5, and 4.1 for G P T 25, 4.8, 4.5, and 4.1 for G P T 50, 5.3, 5, and 4.8 for AppTek, and 5.9, 5.7, and 5.4 for Rev 75. Values are approximated.

User ratings as to how much they subjectively understood the content. Rev vs GPT-50 is statistically significant. Understanding is correlated with caption ratings (r = 0.77, n = 240).

Eleven participants preferred verbatim, 2 preferred edited, and the remaining 3 preferred slowed down captions. However, the top ratings of captions matched the stated preferences only for five participants, and all the matched ones were those where participants preferred unaltered verbatim captions. Most participants preferred slowed down captions, followed by unaltered verbatim captions (Fig. 4). Note that participants did not know which condition was which.

Fig. 4.
A bar graph for the types of captions rated as best. The values are 5 for Rev, 2 for G P T 25, 1 for G P T 1 for AppTek, and 7 for Rev 75. Values are approximated.

Distribution as to which type of captions was rated best by each participant, without participants knowing which was which.

With respect to the open-ended questions, several interesting patterns emerged: (1) participants stated a preference for verbatim captions, but (2) many felt, without knowing, that the GPT-25 captions provided sufficient information, (3) many participants did not realize some videos were slowed down, although a few commented that something felt off, and (4) participants do not like the idea of summarizing the dialogue in captions.

5 Discussion and Limitations

Most participants preferred verbatim captions, either unaltered or slowed down. Unaltered verbatim captions fared better than both types of GPT-edited captions. They were comparable to the much more lightly edited AI captions from AppTek, which, however, did not offer any advantages over verbatim. Slowed down videos appear to be a viable alternative to edited captions.

Despite the underperformance of AI-edited captions, they should be investigated further. Notably, four participants overall rated AI-edited captions as best to begin with, and there were limitations that likely underestimate the true potential of AI-edited captions. We had a skewed participant sample with high literacy and used an older ChatGPT version. With newer LLMs, participants may be able to customize the level of editing on a per-video basis in future work. However, implementing AI-edited captions comes with challenges regarding the accuracy of timestamping. Overcoming this technical hurdle will be important for the successful use of AI-generated captions.

This study has two additional limitations. The first is the relatively small sample size of 16 participants. While this number was sufficient to extract the main trends about LLM-edited captioning vs verbatim captioning, this was not sufficient to perform more fine-grained assessment on the relative merits of each LLM. The second limitation is that participants were provided with predetermined levels of editing. They could not customize the editing on their own, or individually for each type of video. This means that even though some participants preferred LLM-edited captions overall, we cannot be sure that users would consistently choose AI-edited captions of their own volition.

6 Conclusions and Future Work

Overall, the data from the study indicates that most participants prefer watching videos with verbatim captions either at normal speeds or slightly slowed down speeds, rather than with a reduced rate of information via AI-edited captions. When unaware of which condition was which, the majority preferred slowed-down videos, indicating that information overload induced by verbatim captions could be alleviated this way.

LLM-based editing of captions is not yet ready for unsupervised deployment. Even though a small number of users preferred those, they received, on average, much worse usability ratings than their verbatim counterparts. There also is no reliable way to have LLM editing preserve timestamps, which precludes editing captions automatically for now. Nevertheless, AI editing of captions holds promise. The idea of being able to customize captions to each individual is a very powerful one. Future work should refine AI-edited captions. Ultimately, individuals should be able to select their target reading level, including 6th, 8th, 10th and 12th grades, as well as college-level.

Future work should also explore LLMs in conjunction with customizing non-speech information, such as city traffic sounds and footsteps approaching [12]. These concepts also can be extended to emotions, and music types and moods. Future work also needs to examine how AI could assist with providing sign language access.