Audio Match Cutting: Finding and Creating Matching
Audio Transitions in Movies and Videos
Abstract
A “match cut” is a common video editing technique where a pair of shots that have a similar composition transition fluidly from one to another. Although match cuts are often visual, certain match cuts involve the fluid transition of audio, where sounds from different sources merge into one indistinguishable transition between two shots. In this paper, we explore the ability to automatically find and create “audio match cuts” within videos and movies. We create a self-supervised audio representation for audio match cutting and develop a coarse-to-fine audio match pipeline that recommends matching shots and creates the blended audio. We further annotate a dataset for the proposed audio match cut task and compare the ability of multiple audio representations to find audio match cut candidates. Finally, we evaluate multiple methods to blend two matching audio candidates with the goal of creating a smooth transition. Project page and examples are available at: https://denfed.github.io/audiomatchcut/
Index Terms— Self-Supervised Learning, Match Cuts, Audio Transitions, Audio Retrieval, Similarity Matching
1 Introduction
In movies and videos, the “cut” is a foundational editing technique that is used to transition from one scene or shot to the next [1]. The precise use of cuts often crafts the story being portrayed, whether it controls pacing, highlights emotions, or connects disparate scenes into a cohesive story [2]. There are many variations of cuts that are used across the film industry, including smash cuts, reaction cuts, J-cuts, L-cuts, and others. One specific cut is the “match cut”, which is a transition between a pair of shots that uses similar framing, composition, or action to fluidly bring the viewer from one scene to the next. Match cuts often match visuals across each scene, either through similar objects and their placement, colors, or camera movements [3]. However, match cuts can also match sound across scenes, where sound between two scenes transition seamlessly between each other. These audio match cuts (also referred to as “sound bridges”) either blend together sounds or carry similar sound across scenes, often from different sound sources, to create a fluid audio transition between them [4]. Figure 1 shows examples of visual and audio match cuts found in movies.
Along with cutting, video editing as a whole is a time-consuming process that often involves a team of expert editors to create high-quality videos and movies. When performing tasks like match cuts, it often involves a manual search across a collection of recorded content to find strong candidates to transition to, which becomes a time-consuming and tedious manual process [5]. As a result, AI-assisted video editing has emerged as a promising area of research, with the goal of aiding editors improve the speed and quality of editing. Recent works focus on improving the understanding of movies, from detecting events and objects within them, like speakers [6], video attributes like shot angles, sequences and locations [7], and understanding various cuts [8]. Beyond understanding videos, full editing tools have been proposed including shot sequence ordering [7], automatic scene cutting [9], trailer generation [6], video transition creation [10], and audio beat matching [11]. Recently, [5] proposed a framework to automatically find frame and motion match cuts in movies. [5] collects a large-scale dataset of match cuts found in movies and further trains a classification network to retrieve match cut candidates, aiding video editors in finding and creating these match cuts. However, [5] only focuses on visual match cuts. In our work, we expand upon this area and focus on the ability to automatically find create audio match cuts.
At its core, creating audio match cuts involves retrieving candidate audio clips that are able to create high-quality match cuts. Retrieving similar audio samples have been explored in the music domain with music information retrieval systems that retrieve full songs based on small song snippets, using signal processing techniques [14, 15] and more recently deep learning [16, 17]. Similarly, performing audio transitions in the music domain is an often-used technique, both in music mixing and live DJ performances. In literature, both signal processing [18] and deep-learning-based [19] techniques have been introduced to automatically create these transitions. However, finding and creating matching audio transitions has been unexplored in the context of movies and videos across a diverse set of sounds beyond music.
In this paper, we explore this problem and propose a self-supervised retrieval-and-transition framework, shown in Figure 2, to automatically find and create high-quality audio match cuts. Our contributions in this paper are as follows:
-
•
We introduce the problem of automatic audio match cut generation across diverse sounds and create two datasets for evaluating automatic audio match cutting methods.
-
•
We propose a framework where a coarse-to-fine audio retrieval pipeline first recommends matched clips, then a fine-grained transition method creates audio match cuts that outperform multiple baselines.
2 Method
2.1 Problem Definition
To model the real-world task of creating audio match cuts, we formulate our proposed audio match cut problem as a unimodal audio retrieval task. Specifically, given a query video clip and a collection of other video clips , the goal is to retrieve a video clip and create an audio transition such that creates a high-quality audio match cut. We formulate the retrieval as a maximum inner-product search (MIPS) between extracted normalized feature representations of the query and a gallery of the audio of video clips, , denoted by . After retrieving the top-k highest-similar gallery clips , we perform a processing operation to blend the query and retrieved clips to create the final audio match cuts , where a user selects which match cuts to use out of recommendations.
2.2 Data Collection
As the audio match cut problem is unexplored, we developed evaluation sets based on subsets of publicly available datasets, Audioset [20] and Movieclips111Movieclips is collected from the Movieclips YouTube Channel, to evaluate audio match cut generation methods. Audioset contains user-generated videos from YouTube and Movieclips contains high-quality movie snippets. For each dataset, we split each video into 1-second non-overlapping image-audio pairs where the image is the middle frame of the respective second of video, resulting in over 2M Audioset and 800k Movieclips samples. We selected to perform retrieval over 1-second pairs to balance between granularity and search complexity.
Next, we collect a query set of samples of a variety of natural sounds and sound effects, including sounds like engines revving, impulsive sounds like a hammer striking, doorbells, campfires, and other unique sounds seen in videos and movies. For each query, we label a set of match candidates based on two criteria that constitute a positive audio match: i) the pair must sound plausible if the audio is swapped between the query and match images. ii) the pair must sound perceptually similar in terms of pitch, rhythm, timbre, etc.
As labeling random pairs across all samples results in an unfeasible search space (over 4 trillion pairs), we use existing audio representations to help generate candidate audio matches. We hypothesize that since the main characteristic of audio match cuts is that the audio of both scenes are perceptually-similar, widely-available audio representations may be used as they often are trained with the goal of similar audio samples having high similarity. We use two simple representations, the MFCC and Mel-Spectrogram, and two deep representations, the audio encoders from CLAP [21] and ImageBind [22]. For MFCC and Mel-Spectrogram, we use a window of 2048 samples and hop length of 1024 samples. We flatten both representations along the time steps and use the resulting feature vectors for retrieval. For CLAP [21] and ImageBind [22], we use their respective spectrogram generation parameters and use the resulting audio encoder feature vectors for retrieval. We use the MIPS operation described in Section 2.1 to create the audio match cut candidates for labeling. All audio used in this work is sampled at 48kHz.
Since we used audio representations to collect audio match candidate pairs and label only those pairs, our evaluation set tends to favor the highest-similar candidates of each representation. To address this bias and create a more comprehensive evaluation, we randomly sample 100 negative matches for each query in the Audioset and Movieclips dataset. By randomly selecting samples out of millions of 1-second samples, there is a very unlikely chance that these samples in fact are positive audio matches. The resulting Audioset evaluation set has a gallery of 12,350 labeled samples spread across 102 queries, and the Movieclips evaluation set has a gallery of 8,289 labeled samples spread across 66 queries. Each query has an average of 123 labeled samples and 10 positive matches.
2.3 Audio Match Cut Representation Learning
In Section 2.2, we utilize existing audio representations to generate audio match cut candidates. However, existing audio representations are not directly aligned for the audio match cut task, which aims to retrieve perceptually-similar audio from different scenes, differing from existing retrieval tasks. As a result, existing audio representations may produce sub-optimal audio match cut candidates.
Lacking labeled data for audio match cutting, we propose a self-supervised learning objective to create an audio representation that effectively retrieves high-quality audio match cut candidates. Our objective leverages already-edited videos based on the notion that given a query audio frame of a video, an audio frame that results in a high-quality match cut is the next successive frame in the same video, as the entire video has been previously edited to have continuous audio. We model this characteristic as “Split-and-Contrast”, shown in Figure 2b, where adjacent audio frames in two splits of a video are trained to have high similarity, while contrasting away other non-adjacent audio frames.
Given a batch of audio samples that have audio frames, for every sample, we extract a feature representation from each frame, , where is the feature representation size. We then randomly select an index to split the sets of features into left/right sections and , of length and , such that . For each left/right section in , we denote the adjacent frames as and , corresponding to the last frame in the left section, and first frame in the right section, respectively. Then, we define a contrastive learning formulation for a batch of samples to learn a representation that produces high similarity for only the adjacent frames in the split sections, and low similarity for all other pairs:
(1) |
denotes the inner product of normalized vectors, and denotes a temperature parameter for softmax. This formulation is similar to InfoNCE [23] and N-Pair [24] loss, modified to allow multiple positives in a single loss computation. By maximizing the similarity of two adjacent frames in a split audio sample, we expect the model to learn to retrieve perceptually-similar audio frames that result in high-quality transitions. We use the pretrained CLAP [21] audio encoder, based on the HTSAT [25] architecture, and the CLAP [21] linear projection layers. We also use the spectrogram creation and preprocessing steps defined in [21], using an audio frame size of 1-second. We found fine-tuning the CLAP [21] projection layers with a frozen encoder works better than end-to-end finetuning, suggesting that “Split-and-Contrast” is better suited for aligning existing audio feature representations for the audio match cut task.
We train the projection layers using 200k random Audioset samples for 20 epochs using the Adam [26] optimizer, learning rate of , batch size of 2048, and temperature of 0.1. Each sample has ten 1-second audio frames, such that .
2.4 Audio Transition
One common method of transitioning between two audio samples is the crossfade, where the first audio clip fades out while simultaneously fading the second clip in, resulting in a smooth transition [27]. However, creating high-quality transitions using crossfade often requires manual tuning of the crossfade length, based on the audio characteristics [28]. In this section, we describe our audio transition method that improves upon simple crossfade by i) first finding a specific transition point within the 1-second clip, and ii) adaptively selecting a more optimal crossfade length based on the audio characteristics.
Since we perform retrieval using 1-second audio clips, the matched clips may be overall strong candidates for an audio match cut, but the exact borders of each still may not align well for a direct transition. Therefore, we propose an operation to find an optimal transition point between the query and matched clip at the spectrogram time-step-level, named “Max Sub-Spectrogram” similarity search, shown in Figure 2a.
Given a Mel-spectrogram representation of the query audio and matched audio , where and denote the frequency bins and time steps, respectively, we calculate the inner product of two spectrograms across time steps, yielding a similarity matrix . We then find the highest-similar time step pair, . We hypothesize that the highest-similar time step pair in yields a strong point to transition between the query and match as the audio spectra are most aligned.
After finding the transition point, we perform a crossfade to further blend together the query and match audio clip. However, as previously mentioned, certain types of audio may benefit from different length crossfades [28]. Figure 3 shows two examples of audio matches that require different crossfades. When matching the strikes of a hammer and knife, long-duration crossfades result in the impacts overlapping eachother and resulting in a blurry, low-quality transition. When matching a blender and motorcycle, the audio exhibits more noise throughout the sample, that benefits from longer crossfades as both noisy samples blend into eachother slowly.
We model this characteristic based on the variance of the spectrogram similarity matrix based on the hypothesis that audio pairs that exhibit high variance in their similarity matrix (e.g. impulsive sounds) require little-to-no crossfading, while audio pairs that exhibit low variance in their similarity matrix (e.g. noisy, static sounds) benefit from longer crossfades, as they have plausible transition points across multiple time steps. We use the inverse-variance of the computed pair similarity matrix to adaptively determine crossfade length, named “Adaptive Crossfade”:
(2) |
Retrieval Methods | Dataset | R-mAP | HR@1 | HR@2 | HR@5 | P@5 | P@10 |
---|---|---|---|---|---|---|---|
Random | AudioSet [20] | .1093 | .0392 | .1373 | .3235 | .0804 | .0794 |
MFCC | .4111 | .3725 | .5196 | .6961 | .3510 | .3206 | |
Mel-Spectrogram | .3318 | .3529 | .5392 | .6569 | .3157 | .2882 | |
ImageBind [22] | .5623 | .6471 | .7745 | .9314 | .5137 | .4696 | |
CLAP [21] | .7225 | .7843 | .9314 | .9608 | .6765 | .5990 | |
Split-and-Contrast (Ours) | .7656 | .8333 | .9608 | .9804 | .7216 | .6069 | |
Random | MovieClips | .1176 | .1061 | .1364 | .3030 | .0727 | .0742 |
MFCC | .3266 | .3636 | .5000 | .6667 | .2576 | .2197 | |
Mel-Spectrogram | .3337 | .3485 | .5606 | .7273 | .3758 | .3258 | |
ImageBind [22] | .5209 | .4697 | .6212 | .7576 | .4939 | .4955 | |
CLAP [21] | .7729 | .7424 | .8939 | .9848 | .7636 | .7136 | |
Split-and-Contrast (Ours) | .7995 | .8788 | .9394 | 1.000 | .7758 | .7227 |
Here, controls the scaling of the relationship of the similarity matrix variance to the crossfade length. We use a value of . For the crossfade, we use a square-root window for fade-in and fade-out, with length and overlap of seconds. We use the same Mel-Spectrogram parameters described in Section 2.2. Note we use dot product to find the most similar time step pair and use cosine similarity in calculating the matrix variance to keep values bounded in a defined range. We found that using dot product takes the spectrogram magnitude into account (via un-normalized features) and as a result the transition point often occurs on time steps with higher energies, like strikes and impacts rather than quiet portions, which aligns well with many real-world audio match cuts.
Transition Methods | Transition Score (0-3) |
---|---|
Concatenation | 0.821 |
Crossfade (0.25s) | 1.750 |
Crossfade (0.5s) | 1.714 |
Max-Sub-Spectrogram (Max-SS) (Ours) | 1.107 |
Max-SS + Adaptive Crossfade (Ours) | 2.143 |
3 Experiments
3.1 Evaluation Metrics
To evaluate audio retrieval performance, we use multiple standard metrics that are widely used across various retrieval tasks. Specifically, we measure retrieval mean average precision (R-mAP), , and metrics. These metrics align well with the real-world use case of our proposed framework, where an editor is provided audio match cuts to choose from, with the goal of the recommendations being high-quality audio match cuts.
For evaluating transition quality, we construct criteria to grade the overall quality of the audio transition of a positive audio match pair. We create four criteria, ranging from of increasing transition quality: 0) Transition is poor and directly noticeable. 1) Transition is noticeable but is still a fluid transition. 2) Transition is high-quality that strongly matches either rhythm or timbre/pitch. 3) Transition is imperceptible, the transition point cannot be directly heard.
3.2 Retrieval Evaluation
Table 1 shows qualitative audio match cut retrieval performance of multiple baseline methods against our proposed method. As shown, both the MFCC and Mel-Spectrogram representations are able to outperform random selection of audio matches, showing that simple non-learnable representations are able to effectively retrieve audio match cut candidates. However, when comparing large-scale deep audio representations ImageBind [22] and CLAP [21], we see that they significantly outperform the non-learnable representations, with CLAP outperforming ImageBind [22] across all metrics. Although models like CLAP are trained for other tasks like language-audio alignment, the learned representations still are effective in audio-to-audio retrieval as the highly-similar samples are often also perceptually similar, the main criteria for creating audio match cuts. Finally, we see that our “Split-and-Contrast” scheme outperforms CLAP [21] and all other methods across all retrieval metrics, showing our self-supervised objective is effective for better aligning audio representations for the audio match cut task.
3.3 Transition Evaluation
To evaluate the quality of transition methods once an audio match is retrieved, we score the transition quality of 27 Audioset and 41 Movieclips positive matches. Table 2 shows the average transition scores for multiple baseline transition methods and our proposed method, with and without crossfading. Simple concatenation of the query and match audio often results in artifacts and audible discontinuities, which degrade the transition quality as the exact borders of audio may not be perfectly aligned. When performing crossfading at multiple time lengths, significantly higher-quality match cuts are produced as discontinuties and slight differences in spectra are blended away with the crossfade. When comparing our method of selecting a specific transition point, named “Max-SS”, we see that it outperforms concatenation, showing that selecting a more optimal transition point within the 1-second query and match often results in a higher quality transition. When adding our proposed adaptive crossfading, we see the best transition performance, showing that the addition of selecting the optimal transition point and adaptively fading based on the audio characteristics outperforms each baseline.
We highlight that the performance of the transition methods are often a function of how perceptually-similar the retrieved audio match is. The more the query and retrieved audio match, the less the need for advanced transition methods as they already transition from one another well. For very high-quality match retrievals, simple transitions may result in audio match cuts of similar perceptual quality to our proposed method. However, our method allows for the alignment of the cut on specific sound events, like impacts and strikes. Therefore, the specific transition method is left for user choice, depending on the type of audio match cut that is desired.
4 Conclusion
In this paper, we introduce a framework to automatically find and create audio match cuts, an advanced video editing technique used in videos and movies. Analogous to visual match cutting [5], this work can be used to aid in the automatic creation of trailers, edits, montages, and other videos by creating high-quality audio match cuts that are interesting and appealing to viewers. In the future, we hope to explore more advanced audio blending methods beyond crossfading, in addition creating audio-visual match cuts by incorporating the visual modality, with the ability to control specific audio-visual characteristics of the desired match cut.
References
- [1] James E Cutting, “The evolution of pace in popular movies,” Cognitive research: principles and implications, vol. 1, pp. 1–21, 2016.
- [2] Anton Karl Kozlovic, “Anatomy of film,” Kinema: A Journal for Film and Audiovisual Media, vol. 1, 2007.
- [3] John S Douglass and Glenn P Harnden, “The art of technique: An aesthetic approach to film and video production,” 1996.
- [4] Roy Thompson and Christopher J Bowen, Grammar of the Edit, Taylor & Francis, 2013.
- [5] Boris Chen, Amir Ziai, Rebecca S Tucker, and Yuchen Xie, “Match cutting: Finding cuts with smooth visual transitions,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2115–2125.
- [6] Go Irie, Takashi Satou, Akira Kojima, Toshihiko Yamasaki, and Kiyoharu Aizawa, “Automatic trailer generation,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 839–842.
- [7] Dawit Mureja Argaw, Fabian Caba Heilbron, Joon-Young Lee, Markus Woodson, and In So Kweon, “The anatomy of video editing: a dataset and benchmark suite for ai-assisted video editing,” in European Conference on Computer Vision. Springer, 2022, pp. 201–218.
- [8] Alejandro Pardo, Fabian Caba Heilbron, Juan León Alcázar, Ali Thabet, and Bernard Ghanem, “Moviecuts: A new dataset and benchmark for cut type recognition,” in European Conference on Computer Vision. Springer, 2022, pp. 668–685.
- [9] Alejandro Pardo, Fabian Caba, Juan León Alcázar, Ali K Thabet, and Bernard Ghanem, “Learning to cut by watching movies,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6858–6868.
- [10] Yaojie Shen, Libo Zhang, Kai Xu, and Xiaojie Jin, “Autotransition: Learning to recommend video transition effects,” in European Conference on Computer Vision. Springer, 2022, pp. 285–300.
- [11] Sen Pei, Jingya Yu, Qi Chen, and Wozhou He, “Automatch: A large-scale audio beat matching benchmark for boosting deep learning assistant video editing,” arXiv preprint arXiv:2303.01884, 2023.
- [12] Stanley Kubrick and Arthur C. Clarke, “2001: A space odyssey,” 1968.
- [13] Andrew Adamson, “The chronicles of narnia: The lion, the witch and the wardrobe,” 2005.
- [14] Joren Six and Marc Leman, “Panako: a scalable acoustic fingerprinting system handling time-scale and pitch modification,” in 15th International society for music information retrieval conference (ISMIR-2014), 2014.
- [15] Sébastien Fenet, Gaël Richard, Yves Grenier, et al., “A scalable audio fingerprint method with robustness to pitch-shifting.,” in ISMIR, 2011, pp. 121–126.
- [16] Adhiraj Banerjee and Vipul Arora, “wav2tok: Deep sequence tokenizer for audio retrieval,” in The Eleventh International Conference on Learning Representations, 2022.
- [17] Sungkyun Chang, Donmoon Lee, Jeongsoo Park, Hyungui Lim, Kyogu Lee, Karam Ko, and Yoonchang Han, “Neural audio fingerprint for high-specific audio retrieval based on contrastive learning,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 3025–3029.
- [18] Len Vande Veire and Tijl De Bie, “From raw audio to a seamless mix: creating an automated dj system for drum and bass,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2018, no. 1, pp. 1–21, 2018.
- [19] Bo-Yu Chen, Wei-Han Hsu, Wei-Hsiang Liao, Marco A Martínez Ramírez, Yuki Mitsufuji, and Yi-Hsuan Yang, “Automatic dj transitions with differentiable audio effects and generative adversarial networks,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 466–470.
- [20] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780.
- [21] Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- [22] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra, “Imagebind: One embedding space to bind them all,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15180–15190.
- [23] Aaron van den Oord, Yazhe Li, and Oriol Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- [24] Kihyuk Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” Advances in neural information processing systems, vol. 29, 2016.
- [25] Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, and Shlomo Dubnov, “Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 646–650.
- [26] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [27] Fitzgerald J Archibald, “Cross fade of digital audio streams,” .
- [28] Lucian Lupsa-Tataru, “Audio fade-out profile shaping for interactive multimedia,” 2020.