Audio Match Cutting: Finding and Creating Matching
Audio Transitions in Movies and Videos

Abstract

A “match cut” is a common video editing technique where a pair of shots that have a similar composition transition fluidly from one to another. Although match cuts are often visual, certain match cuts involve the fluid transition of audio, where sounds from different sources merge into one indistinguishable transition between two shots. In this paper, we explore the ability to automatically find and create “audio match cuts” within videos and movies. We create a self-supervised audio representation for audio match cutting and develop a coarse-to-fine audio match pipeline that recommends matching shots and creates the blended audio. We further annotate a dataset for the proposed audio match cut task and compare the ability of multiple audio representations to find audio match cut candidates. Finally, we evaluate multiple methods to blend two matching audio candidates with the goal of creating a smooth transition. Project page and examples are available at: https://denfed.github.io/audiomatchcut/

Index Terms—  Self-Supervised Learning, Match Cuts, Audio Transitions, Audio Retrieval, Similarity Matching

1 Introduction

In movies and videos, the “cut” is a foundational editing technique that is used to transition from one scene or shot to the next [1]. The precise use of cuts often crafts the story being portrayed, whether it controls pacing, highlights emotions, or connects disparate scenes into a cohesive story [2]. There are many variations of cuts that are used across the film industry, including smash cuts, reaction cuts, J-cuts, L-cuts, and others. One specific cut is the “match cut”, which is a transition between a pair of shots that uses similar framing, composition, or action to fluidly bring the viewer from one scene to the next. Match cuts often match visuals across each scene, either through similar objects and their placement, colors, or camera movements [3]. However, match cuts can also match sound across scenes, where sound between two scenes transition seamlessly between each other. These audio match cuts (also referred to as “sound bridges”) either blend together sounds or carry similar sound across scenes, often from different sound sources, to create a fluid audio transition between them [4]. Figure 1 shows examples of visual and audio match cuts found in movies.

Along with cutting, video editing as a whole is a time-consuming process that often involves a team of expert editors to create high-quality videos and movies. When performing tasks like match cuts, it often involves a manual search across a collection of recorded content to find strong candidates to transition to, which becomes a time-consuming and tedious manual process [5]. As a result, AI-assisted video editing has emerged as a promising area of research, with the goal of aiding editors improve the speed and quality of editing. Recent works focus on improving the understanding of movies, from detecting events and objects within them, like speakers [6], video attributes like shot angles, sequences and locations [7], and understanding various cuts [8]. Beyond understanding videos, full editing tools have been proposed including shot sequence ordering [7], automatic scene cutting [9], trailer generation [6], video transition creation [10], and audio beat matching [11]. Recently, [5] proposed a framework to automatically find frame and motion match cuts in movies. [5] collects a large-scale dataset of match cuts found in movies and further trains a classification network to retrieve match cut candidates, aiding video editors in finding and creating these match cuts. However, [5] only focuses on visual match cuts. In our work, we expand upon this area and focus on the ability to automatically find create audio match cuts.

Refer to caption
Fig. 1: Example match cuts in movies. In 2001: A Space Odyssey [12] (top), two different visuals transition fluidly based on the similar size and shape of the objects. In The Chronicles of Narnia: The Lion, the Witch and the Wardrobe [13] (bottom), The sound of a sword clinking within its sheath matched to the strike of a hammer in the next scene, creating a seamless audio match across scenes.
Refer to caption
Fig. 2: a) Proposed Framework. Given a query video, we retrieve an audio match cut candidate from a video gallery and find the optimal transition point using a sub-spectrogram similarity search. Using the variance of the created similarity matrix, we adaptively select the crossfade length to blend both the query and match audio into a fluid audio match cut. b) Proposed “Split-and-Contrast” contrastive objective. Each audio sample is split at a randomly selected frame, then the adjacent frames of the split are contrasted towards each other.

At its core, creating audio match cuts involves retrieving candidate audio clips that are able to create high-quality match cuts. Retrieving similar audio samples have been explored in the music domain with music information retrieval systems that retrieve full songs based on small song snippets, using signal processing techniques [14, 15] and more recently deep learning [16, 17]. Similarly, performing audio transitions in the music domain is an often-used technique, both in music mixing and live DJ performances. In literature, both signal processing [18] and deep-learning-based [19] techniques have been introduced to automatically create these transitions. However, finding and creating matching audio transitions has been unexplored in the context of movies and videos across a diverse set of sounds beyond music.

In this paper, we explore this problem and propose a self-supervised retrieval-and-transition framework, shown in Figure 2, to automatically find and create high-quality audio match cuts. Our contributions in this paper are as follows:

  • We introduce the problem of automatic audio match cut generation across diverse sounds and create two datasets for evaluating automatic audio match cutting methods.

  • We propose a framework where a coarse-to-fine audio retrieval pipeline first recommends matched clips, then a fine-grained transition method creates audio match cuts that outperform multiple baselines.

2 Method

2.1 Problem Definition

To model the real-world task of creating audio match cuts, we formulate our proposed audio match cut problem as a unimodal audio retrieval task. Specifically, given a query video clip Vqsubscript𝑉𝑞V_{q}italic_V start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and a collection of n𝑛nitalic_n other video clips G={i1,i2,,in}𝐺subscript𝑖1subscript𝑖2subscript𝑖𝑛G=\{i_{1},i_{2},...,i_{n}\}italic_G = { italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_i start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, the goal is to retrieve a video clip Gisubscript𝐺𝑖G_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and create an audio transition such that VqGisubscript𝑉𝑞subscript𝐺𝑖V_{q}\Rightarrow G_{i}italic_V start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ⇒ italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT creates a high-quality audio match cut. We formulate the retrieval as a maximum inner-product search (MIPS) between extracted L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT normalized feature representations of the query and a gallery of the audio of video clips, zVq,zGidsubscript𝑧subscript𝑉𝑞subscript𝑧subscript𝐺𝑖superscript𝑑z_{V_{q}},z_{G_{i}}\in\mathbb{Z}^{d}italic_z start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_Z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, denoted by zGi=argmaxi(zVqTzGi)subscriptsuperscript𝑧subscript𝐺𝑖subscriptargmax𝑖superscriptsubscript𝑧subscript𝑉𝑞𝑇subscript𝑧subscript𝐺𝑖z^{*}_{G_{i}}=\text{argmax}_{i}(z_{V_{q}}^{T}z_{G_{i}})italic_z start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = argmax start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). After retrieving the top-k highest-similar gallery clips {Gi}i=1ksuperscriptsubscriptsubscriptsuperscript𝐺𝑖𝑖1𝑘\{G^{*}_{i}\}_{i=1}^{k}{ italic_G start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, we perform a processing operation fpsubscript𝑓𝑝f_{p}italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT to blend the query and retrieved clips to create the final audio match cuts {fp(Vq,Gi)}i=1ksuperscriptsubscriptsubscript𝑓𝑝subscript𝑉𝑞subscriptsuperscript𝐺𝑖𝑖1𝑘\{f_{p}(V_{q},G^{*}_{i})\}_{i=1}^{k}{ italic_f start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_G start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, where a user selects which match cuts to use out of k𝑘kitalic_k recommendations.

2.2 Data Collection

As the audio match cut problem is unexplored, we developed evaluation sets based on subsets of publicly available datasets, Audioset [20] and Movieclips111Movieclips is collected from the Movieclips YouTube Channel, to evaluate audio match cut generation methods. Audioset contains user-generated videos from YouTube and Movieclips contains high-quality movie snippets. For each dataset, we split each video into 1-second non-overlapping image-audio pairs where the image is the middle frame of the respective second of video, resulting in over 2M Audioset and 800k Movieclips samples. We selected to perform retrieval over 1-second pairs to balance between granularity and search complexity.

Next, we collect a query set of samples of a variety of natural sounds and sound effects, including sounds like engines revving, impulsive sounds like a hammer striking, doorbells, campfires, and other unique sounds seen in videos and movies. For each query, we label a set of match candidates based on two criteria that constitute a positive audio match: i) the pair must sound plausible if the audio is swapped between the query and match images. ii) the pair must sound perceptually similar in terms of pitch, rhythm, timbre, etc.

As labeling random pairs across all samples results in an unfeasible search space (over 4 trillion pairs), we use existing audio representations to help generate candidate audio matches. We hypothesize that since the main characteristic of audio match cuts is that the audio of both scenes are perceptually-similar, widely-available audio representations may be used as they often are trained with the goal of similar audio samples having high similarity. We use two simple representations, the MFCC and Mel-Spectrogram, and two deep representations, the audio encoders from CLAP [21] and ImageBind [22]. For MFCC and Mel-Spectrogram, we use a window of 2048 samples and hop length of 1024 samples. We flatten both representations along the time steps and use the resulting feature vectors for retrieval. For CLAP [21] and ImageBind [22], we use their respective spectrogram generation parameters and use the resulting audio encoder feature vectors for retrieval. We use the MIPS operation described in Section 2.1 to create the audio match cut candidates for labeling. All audio used in this work is sampled at 48kHz.

Since we used audio representations to collect audio match candidate pairs and label only those pairs, our evaluation set tends to favor the highest-similar candidates of each representation. To address this bias and create a more comprehensive evaluation, we randomly sample 100 negative matches for each query in the Audioset and Movieclips dataset. By randomly selecting samples out of millions of 1-second samples, there is a very unlikely chance that these samples in fact are positive audio matches. The resulting Audioset evaluation set has a gallery of 12,350 labeled samples spread across 102 queries, and the Movieclips evaluation set has a gallery of 8,289 labeled samples spread across 66 queries. Each query has an average of 123 labeled samples and 10 positive matches.

2.3 Audio Match Cut Representation Learning

In Section 2.2, we utilize existing audio representations to generate audio match cut candidates. However, existing audio representations are not directly aligned for the audio match cut task, which aims to retrieve perceptually-similar audio from different scenes, differing from existing retrieval tasks. As a result, existing audio representations may produce sub-optimal audio match cut candidates.

Lacking labeled data for audio match cutting, we propose a self-supervised learning objective to create an audio representation that effectively retrieves high-quality audio match cut candidates. Our objective leverages already-edited videos based on the notion that given a query audio frame of a video, an audio frame that results in a high-quality match cut is the next successive frame in the same video, as the entire video has been previously edited to have continuous audio. We model this characteristic as “Split-and-Contrast”, shown in Figure 2b, where adjacent audio frames in two splits of a video are trained to have high similarity, while contrasting away other non-adjacent audio frames.

Given a batch of N𝑁Nitalic_N audio samples that have n𝑛nitalic_n audio frames, for every sample, we extract a feature representation z𝑧zitalic_z from each frame, {zk}k=1nd×nsuperscriptsubscriptsubscript𝑧𝑘𝑘1𝑛superscript𝑑𝑛\{z_{k}\}_{k=1}^{n}\in\mathbb{Z}^{d\times n}{ italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ blackboard_Z start_POSTSUPERSCRIPT italic_d × italic_n end_POSTSUPERSCRIPT, where d𝑑ditalic_d is the feature representation size. We then randomly select an index to split the N𝑁Nitalic_N sets of features into left/right sections zαd×nαsubscript𝑧𝛼superscript𝑑subscript𝑛𝛼z_{\alpha}\in\mathbb{Z}^{d\times n_{\alpha}}italic_z start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∈ blackboard_Z start_POSTSUPERSCRIPT italic_d × italic_n start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and zβd×nβsubscript𝑧𝛽superscript𝑑subscript𝑛𝛽z_{\beta}\in\mathbb{Z}^{d\times n_{\beta}}italic_z start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ∈ blackboard_Z start_POSTSUPERSCRIPT italic_d × italic_n start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, of length nαsubscript𝑛𝛼n_{\alpha}italic_n start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT and nβsubscript𝑛𝛽n_{\beta}italic_n start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT, such that nα+nβ=nsubscript𝑛𝛼subscript𝑛𝛽𝑛n_{\alpha}+n_{\beta}=nitalic_n start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT = italic_n. For each left/right section in N𝑁Nitalic_N, we denote the adjacent frames as zkl=zαnαsubscript𝑧subscript𝑘𝑙subscript𝑧subscript𝛼subscript𝑛𝛼z_{k_{l}}=z_{\alpha_{n_{\alpha}}}italic_z start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT and zkr=zβ0subscript𝑧subscript𝑘𝑟subscript𝑧subscript𝛽0z_{k_{r}}=z_{\beta_{0}}italic_z start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, corresponding to the last frame in the left section, and first frame in the right section, respectively. Then, we define a contrastive learning formulation for a batch of N𝑁Nitalic_N samples to learn a representation that produces high similarity for only the adjacent frames in the split sections, and low similarity for all other pairs:

S&C(N)=log(k=1Nexp(zklTzkr/τ)i=1Nnαj=1Nnβexp(zαiTzβj/τ))subscript𝑆𝐶𝑁logsuperscriptsubscript𝑘1𝑁expsuperscriptsubscript𝑧subscript𝑘𝑙𝑇subscript𝑧subscript𝑘𝑟𝜏superscriptsubscript𝑖1𝑁subscript𝑛𝛼superscriptsubscript𝑗1𝑁subscript𝑛𝛽expsuperscriptsubscript𝑧subscript𝛼𝑖𝑇subscript𝑧subscript𝛽𝑗𝜏\mathcal{L}_{S\&C}(N)=-\text{log}\left(\frac{\sum_{k=1}^{N}\text{exp}(z_{k_{l}% }^{T}z_{k_{r}}/\tau)}{\sum_{i=1}^{N\cdot n_{\alpha}}\sum_{j=1}^{N\cdot n_{% \beta}}\text{exp}(z_{\alpha_{i}}^{T}z_{\beta_{j}}/\tau)}\right)caligraphic_L start_POSTSUBSCRIPT italic_S & italic_C end_POSTSUBSCRIPT ( italic_N ) = - log ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT exp ( italic_z start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT / italic_τ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N ⋅ italic_n start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N ⋅ italic_n start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_POSTSUPERSCRIPT exp ( italic_z start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT / italic_τ ) end_ARG ) (1)

zaTzbsuperscriptsubscript𝑧𝑎𝑇subscript𝑧𝑏z_{a}^{T}z_{b}italic_z start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT denotes the inner product of L2superscript𝐿2L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT normalized vectors, and τ𝜏\tauitalic_τ denotes a temperature parameter for softmax. This formulation is similar to InfoNCE [23] and N-Pair [24] loss, modified to allow multiple positives in a single loss computation. By maximizing the similarity of two adjacent frames in a split audio sample, we expect the model to learn to retrieve perceptually-similar audio frames that result in high-quality transitions. We use the pretrained CLAP [21] audio encoder, based on the HTSAT [25] architecture, and the CLAP [21] linear projection layers. We also use the spectrogram creation and preprocessing steps defined in [21], using an audio frame size of 1-second. We found fine-tuning the CLAP [21] projection layers with a frozen encoder works better than end-to-end finetuning, suggesting that “Split-and-Contrast” is better suited for aligning existing audio feature representations for the audio match cut task.

We train the projection layers using 200k random Audioset samples for 20 epochs using the Adam [26] optimizer, learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, batch size of 2048, and temperature τ𝜏\tauitalic_τ of 0.1. Each sample has ten 1-second audio frames, such that nα+nβ=10subscript𝑛𝛼subscript𝑛𝛽10n_{\alpha}+n_{\beta}=10italic_n start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT = 10.

2.4 Audio Transition

One common method of transitioning between two audio samples is the crossfade, where the first audio clip fades out while simultaneously fading the second clip in, resulting in a smooth transition [27]. However, creating high-quality transitions using crossfade often requires manual tuning of the crossfade length, based on the audio characteristics [28]. In this section, we describe our audio transition method that improves upon simple crossfade by i) first finding a specific transition point within the 1-second clip, and ii) adaptively selecting a more optimal crossfade length based on the audio characteristics.

Refer to caption
Fig. 3: Example sub-spectrogram similarities of audio match cuts: A forging hammer striking matched with a knife chopping (left) exhibits high similarity on each strike occurrence. A blender matched with a motorcycle revving (right) shows a smoother similarity matrix, allowing for plausible transitions across multiple time steps.

Since we perform retrieval using 1-second audio clips, the matched clips may be overall strong candidates for an audio match cut, but the exact borders of each still may not align well for a direct transition. Therefore, we propose an operation to find an optimal transition point between the query and matched clip at the spectrogram time-step-level, named “Max Sub-Spectrogram” similarity search, shown in Figure 2a.

Given a Mel-spectrogram representation of the query audio SQf×tsubscript𝑆𝑄superscript𝑓𝑡S_{Q}\in\mathbb{R}^{f\times t}italic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_f × italic_t end_POSTSUPERSCRIPT and matched audio SMf×tsubscript𝑆𝑀superscript𝑓𝑡S_{M}\in\mathbb{R}^{f\times t}italic_S start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_f × italic_t end_POSTSUPERSCRIPT, where f𝑓fitalic_f and t𝑡titalic_t denote the frequency bins and time steps, respectively, we calculate the inner product of two spectrograms across time steps, yielding a similarity matrix M=SQTSMt×t𝑀superscriptsubscript𝑆𝑄𝑇subscript𝑆𝑀superscript𝑡𝑡M=S_{Q}^{T}S_{M}\in\mathbb{R}^{t\times t}italic_M = italic_S start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_t × italic_t end_POSTSUPERSCRIPT. We then find the highest-similar time step pair, argmaxi,j(M)subscriptargmax𝑖𝑗𝑀\text{argmax}_{i,j}(M)argmax start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_M ). We hypothesize that the highest-similar time step pair in M𝑀Mitalic_M yields a strong point to transition between the query and match as the audio spectra are most aligned.

After finding the transition point, we perform a crossfade to further blend together the query and match audio clip. However, as previously mentioned, certain types of audio may benefit from different length crossfades [28]. Figure 3 shows two examples of audio matches that require different crossfades. When matching the strikes of a hammer and knife, long-duration crossfades result in the impacts overlapping eachother and resulting in a blurry, low-quality transition. When matching a blender and motorcycle, the audio exhibits more noise throughout the sample, that benefits from longer crossfades as both noisy samples blend into eachother slowly.

We model this characteristic based on the variance of the spectrogram similarity matrix based on the hypothesis that audio pairs that exhibit high variance in their similarity matrix (e.g. impulsive sounds) require little-to-no crossfading, while audio pairs that exhibit low variance in their similarity matrix (e.g. noisy, static sounds) benefit from longer crossfades, as they have plausible transition points across multiple time steps. We use the inverse-variance of the computed pair similarity matrix to adaptively determine crossfade length, named “Adaptive Crossfade”:

lcrossfade=1Var(M¯)ϕ;M¯=SQiTSMjSQiSMji,j{1,,t}formulae-sequencesubscript𝑙crossfade1𝑉𝑎𝑟¯𝑀italic-ϕformulae-sequence¯𝑀superscriptsubscript𝑆subscript𝑄𝑖𝑇subscript𝑆subscript𝑀𝑗normsubscript𝑆subscript𝑄𝑖normsubscript𝑆subscript𝑀𝑗for-all𝑖𝑗1𝑡l_{\text{crossfade}}=\frac{1}{Var(\overline{M})\phi};\overline{M}=\frac{S_{Q_{% i}}^{T}S_{M_{j}}}{||S_{Q_{i}}||||S_{M_{j}}||}\forall i,j\in\{1,...,t\}italic_l start_POSTSUBSCRIPT crossfade end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_V italic_a italic_r ( over¯ start_ARG italic_M end_ARG ) italic_ϕ end_ARG ; over¯ start_ARG italic_M end_ARG = divide start_ARG italic_S start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG | | italic_S start_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | | | italic_S start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | end_ARG ∀ italic_i , italic_j ∈ { 1 , … , italic_t } (2)
Retrieval Methods Dataset R-mAP HR@1 HR@2 HR@5 P@5 P@10
Random AudioSet [20] .1093 .0392 .1373 .3235 .0804 .0794
MFCC .4111 .3725 .5196 .6961 .3510 .3206
Mel-Spectrogram .3318 .3529 .5392 .6569 .3157 .2882
ImageBind [22] .5623 .6471 .7745 .9314 .5137 .4696
CLAP [21] .7225 .7843 .9314 .9608 .6765 .5990
Split-and-Contrast (Ours) .7656 .8333 .9608 .9804 .7216 .6069
Random MovieClips .1176 .1061 .1364 .3030 .0727 .0742
MFCC .3266 .3636 .5000 .6667 .2576 .2197
Mel-Spectrogram .3337 .3485 .5606 .7273 .3758 .3258
ImageBind [22] .5209 .4697 .6212 .7576 .4939 .4955
CLAP [21] .7729 .7424 .8939 .9848 .7636 .7136
Split-and-Contrast (Ours) .7995 .8788 .9394 1.000 .7758 .7227
Table 1: Audio retrieval results on the labeled audio match cut evaluation set from AudioSet [20] and MovieClips.

Here, ϕitalic-ϕ\phiitalic_ϕ controls the scaling of the relationship of the similarity matrix variance to the crossfade length. We use a value of ϕ=8italic-ϕ8\phi=8italic_ϕ = 8. For the crossfade, we use a square-root window for fade-in and fade-out, with length and overlap of lcrossfadesubscript𝑙crossfadel_{\text{crossfade}}italic_l start_POSTSUBSCRIPT crossfade end_POSTSUBSCRIPT seconds. We use the same Mel-Spectrogram parameters described in Section 2.2. Note we use dot product to find the most similar time step pair and use cosine similarity in calculating the matrix variance to keep values bounded in a defined range. We found that using dot product takes the spectrogram magnitude into account (via un-normalized features) and as a result the transition point often occurs on time steps with higher energies, like strikes and impacts rather than quiet portions, which aligns well with many real-world audio match cuts.

Transition Methods Transition Score (0-3)
Concatenation 0.821
Crossfade (0.25s) 1.750
Crossfade (0.5s) 1.714
Max-Sub-Spectrogram (Max-SS) (Ours) 1.107
Max-SS + Adaptive Crossfade (Ours) 2.143
Table 2: Transition scores on audio transition methods.

3 Experiments

3.1 Evaluation Metrics

To evaluate audio retrieval performance, we use multiple standard metrics that are widely used across various retrieval tasks. Specifically, we measure retrieval mean average precision (R-mAP), hit-rate@Khit-rate@𝐾\text{hit-rate}@Khit-rate @ italic_K, and precision@Kprecision@𝐾\text{precision}@Kprecision @ italic_K metrics. These metrics align well with the real-world use case of our proposed framework, where an editor is provided K𝐾Kitalic_K audio match cuts to choose from, with the goal of the K𝐾Kitalic_K recommendations being high-quality audio match cuts.

For evaluating transition quality, we construct criteria to grade the overall quality of the audio transition of a positive audio match pair. We create four criteria, ranging from 03030-30 - 3 of increasing transition quality: 0) Transition is poor and directly noticeable. 1) Transition is noticeable but is still a fluid transition. 2) Transition is high-quality that strongly matches either rhythm or timbre/pitch. 3) Transition is imperceptible, the transition point cannot be directly heard.

3.2 Retrieval Evaluation

Table 1 shows qualitative audio match cut retrieval performance of multiple baseline methods against our proposed method. As shown, both the MFCC and Mel-Spectrogram representations are able to outperform random selection of audio matches, showing that simple non-learnable representations are able to effectively retrieve audio match cut candidates. However, when comparing large-scale deep audio representations ImageBind [22] and CLAP [21], we see that they significantly outperform the non-learnable representations, with CLAP outperforming ImageBind [22] across all metrics. Although models like CLAP are trained for other tasks like language-audio alignment, the learned representations still are effective in audio-to-audio retrieval as the highly-similar samples are often also perceptually similar, the main criteria for creating audio match cuts. Finally, we see that our “Split-and-Contrast” scheme outperforms CLAP [21] and all other methods across all retrieval metrics, showing our self-supervised objective is effective for better aligning audio representations for the audio match cut task.

3.3 Transition Evaluation

To evaluate the quality of transition methods once an audio match is retrieved, we score the transition quality of 27 Audioset and 41 Movieclips positive matches. Table 2 shows the average transition scores for multiple baseline transition methods and our proposed method, with and without crossfading. Simple concatenation of the query and match audio often results in artifacts and audible discontinuities, which degrade the transition quality as the exact borders of audio may not be perfectly aligned. When performing crossfading at multiple time lengths, significantly higher-quality match cuts are produced as discontinuties and slight differences in spectra are blended away with the crossfade. When comparing our method of selecting a specific transition point, named “Max-SS”, we see that it outperforms concatenation, showing that selecting a more optimal transition point within the 1-second query and match often results in a higher quality transition. When adding our proposed adaptive crossfading, we see the best transition performance, showing that the addition of selecting the optimal transition point and adaptively fading based on the audio characteristics outperforms each baseline.

We highlight that the performance of the transition methods are often a function of how perceptually-similar the retrieved audio match is. The more the query and retrieved audio match, the less the need for advanced transition methods as they already transition from one another well. For very high-quality match retrievals, simple transitions may result in audio match cuts of similar perceptual quality to our proposed method. However, our method allows for the alignment of the cut on specific sound events, like impacts and strikes. Therefore, the specific transition method is left for user choice, depending on the type of audio match cut that is desired.

4 Conclusion

In this paper, we introduce a framework to automatically find and create audio match cuts, an advanced video editing technique used in videos and movies. Analogous to visual match cutting [5], this work can be used to aid in the automatic creation of trailers, edits, montages, and other videos by creating high-quality audio match cuts that are interesting and appealing to viewers. In the future, we hope to explore more advanced audio blending methods beyond crossfading, in addition creating audio-visual match cuts by incorporating the visual modality, with the ability to control specific audio-visual characteristics of the desired match cut.

References

  • [1] James E Cutting, “The evolution of pace in popular movies,” Cognitive research: principles and implications, vol. 1, pp. 1–21, 2016.
  • [2] Anton Karl Kozlovic, “Anatomy of film,” Kinema: A Journal for Film and Audiovisual Media, vol. 1, 2007.
  • [3] John S Douglass and Glenn P Harnden, “The art of technique: An aesthetic approach to film and video production,” 1996.
  • [4] Roy Thompson and Christopher J Bowen, Grammar of the Edit, Taylor & Francis, 2013.
  • [5] Boris Chen, Amir Ziai, Rebecca S Tucker, and Yuchen Xie, “Match cutting: Finding cuts with smooth visual transitions,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2115–2125.
  • [6] Go Irie, Takashi Satou, Akira Kojima, Toshihiko Yamasaki, and Kiyoharu Aizawa, “Automatic trailer generation,” in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 839–842.
  • [7] Dawit Mureja Argaw, Fabian Caba Heilbron, Joon-Young Lee, Markus Woodson, and In So Kweon, “The anatomy of video editing: a dataset and benchmark suite for ai-assisted video editing,” in European Conference on Computer Vision. Springer, 2022, pp. 201–218.
  • [8] Alejandro Pardo, Fabian Caba Heilbron, Juan León Alcázar, Ali Thabet, and Bernard Ghanem, “Moviecuts: A new dataset and benchmark for cut type recognition,” in European Conference on Computer Vision. Springer, 2022, pp. 668–685.
  • [9] Alejandro Pardo, Fabian Caba, Juan León Alcázar, Ali K Thabet, and Bernard Ghanem, “Learning to cut by watching movies,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6858–6868.
  • [10] Yaojie Shen, Libo Zhang, Kai Xu, and Xiaojie Jin, “Autotransition: Learning to recommend video transition effects,” in European Conference on Computer Vision. Springer, 2022, pp. 285–300.
  • [11] Sen Pei, Jingya Yu, Qi Chen, and Wozhou He, “Automatch: A large-scale audio beat matching benchmark for boosting deep learning assistant video editing,” arXiv preprint arXiv:2303.01884, 2023.
  • [12] Stanley Kubrick and Arthur C. Clarke, “2001: A space odyssey,” 1968.
  • [13] Andrew Adamson, “The chronicles of narnia: The lion, the witch and the wardrobe,” 2005.
  • [14] Joren Six and Marc Leman, “Panako: a scalable acoustic fingerprinting system handling time-scale and pitch modification,” in 15th International society for music information retrieval conference (ISMIR-2014), 2014.
  • [15] Sébastien Fenet, Gaël Richard, Yves Grenier, et al., “A scalable audio fingerprint method with robustness to pitch-shifting.,” in ISMIR, 2011, pp. 121–126.
  • [16] Adhiraj Banerjee and Vipul Arora, “wav2tok: Deep sequence tokenizer for audio retrieval,” in The Eleventh International Conference on Learning Representations, 2022.
  • [17] Sungkyun Chang, Donmoon Lee, Jeongsoo Park, Hyungui Lim, Kyogu Lee, Karam Ko, and Yoonchang Han, “Neural audio fingerprint for high-specific audio retrieval based on contrastive learning,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 3025–3029.
  • [18] Len Vande Veire and Tijl De Bie, “From raw audio to a seamless mix: creating an automated dj system for drum and bass,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2018, no. 1, pp. 1–21, 2018.
  • [19] Bo-Yu Chen, Wei-Han Hsu, Wei-Hsiang Liao, Marco A Martínez Ramírez, Yuki Mitsufuji, and Yi-Hsuan Yang, “Automatic dj transitions with differentiable audio effects and generative adversarial networks,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 466–470.
  • [20] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780.
  • [21] Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  • [22] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra, “Imagebind: One embedding space to bind them all,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15180–15190.
  • [23] Aaron van den Oord, Yazhe Li, and Oriol Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  • [24] Kihyuk Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” Advances in neural information processing systems, vol. 29, 2016.
  • [25] Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-Kirkpatrick, and Shlomo Dubnov, “Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 646–650.
  • [26] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [27] Fitzgerald J Archibald, “Cross fade of digital audio streams,” .
  • [28] Lucian Lupsa-Tataru, “Audio fade-out profile shaping for interactive multimedia,” 2020.