Learning Sonata Form Structure on Mozart’s String Quartets

Pierre Allegraud; Louis Bigo; Laurent Feisthauer; Mathieu Giraud; Richard Groult; Emmanuel Leguy; Florence Levé

1 Introduction

1.1 Sonata form

The large-scale structure referred to as sonata form is a post-hoc formalization of a widely used composer practice since the middle of the 18th century. It is built on a piece-level tonal path concept involving both a primary thematic zone (P) and a contrasting secondary thematic zone (S) (Figure 1). This creates a polarization between two tonalities and induces a dramatic turn to the piece. The sonata form can be viewed as an evolution of both aria and concerto Baroque forms (; ). Greenberg () investigated how sonata-form recapitulation may have come from both the double return of the tonic key and the parallel endings in a two-part movement.

Figure 1

Andante con moto of the String Quartet #16 in E♭ Major, K 428, 2nd movement. Encoded in Lilypond by Maurizio Tomasi for the Mutopia Project. This slow movement has a sonata form, as detailed in Section 2.1. Following notations of Hepokoski and Darcy (), the primary themes (P/P’) are followed by transitions (TR/TR’), ended with Medial Caesuras (MC/MC’) – they are here Half Cadences (HC) in the main tonality (I). In the exposition, the secondary theme (S) and the conclusion (C) are here in the tonality of the dominant (V, E♭ major). In the recapitulation, both S’ and C’ come back to the main tonality. In the exposition, the S theme ends with a perfect authentic cadence (PAC) named essential expositional closure (EEC), whereas, in the recapitulation, the S’ theme ends with an essential structural closure (ESC). Between the exposition and the recapitulation, the development (Dev) moves to other keys and is concluded by a retransition (RT) focusing on the dominant of the primary key.

A number of works composed by Haydn, Mozart and Beethoven are recognized as in sonata forms, especially first movements of string quartets, concerti, symphonies, and piano sonatas. However, the theories about the classical sonata form were introduced almost fifty years after its early golden era (; ; ). One of its earliest formalizations seems to be the grande coupe binaire that Reicha () described 30 years after Mozart died. The sonata form finally became a normative structure for several generations of romantic composers, being transmitted both through explicit teaching as well as implicit exposure.

Nowadays, sonata forms are still taught in music analysis, music history and composition lectures. They are also the focus of recent academic studies (; ; ; , ; ; ; ; ; ; ). The past decades have seen a revival of the Formenlehre tradition in the classical era (). In Caplin ()’s theory of formal functions, small functional units at the idea level (e.g., basic idea, contrasting idea) are combined to form units at the phrase level (e.g., presentation, antecedent), which in turn are combined to form units at the theme level (e.g., sentence, period, etc.). This bottom-up approach builds up to the whole sonata form, paving the way to the three large-scale functions that are characteristic of sonata form: Exposition, Development, and Recapitulation, possibly including two other functions, Introduction and Coda. In this study, we rathe follow the Sonata Theory of Hepokoski and Darcy (), where sonata form is viewed as an “ordered system of generically available options permitting the spanning of ever larger expanses of time” (ibid., p. 15). Their detailed formalization of the successive sections of the sonata form seems adequate to develop computational models.

1.2 MIR, high-level structure, and sonata form

On the one hand, “analyzing a sonata form”, which implies identifying the boundaries of its successive sections, often requires a number of musicological judgments that are piece-specific, which makes its automation difficult. Being strongly linked to music history, music analysis may indeed include ideas that involve the singularity of the piece, a comparison between composers as well as some aesthetic considerations. On the other hand, music analyses are often built upon specific analytical elements, like themes or patterns that structure the harmony and the texture of the piece. Analyses can therefore be modelled with Music Information Retrieval (MIR) algorithms that can be properly evaluated. Finally, the identification of a large-scale structure such as the sonata form requires the combination of these local features to reach a piece-level analysis, which is itself a challenge for MIR research. We previously reviewed research on computational analysis of musical form (). Chen et al. () proposed to segment the musical piece into sections called “sentences”, clustering phrases predicted by the LDBM algorithm by Cambouropoulos (). Rafael and Oertl () built a global structure from patterns extracted by the algorithm from Hsu et al. (). Some studies, such as by Hamanaka et al. (), have attempted to compute large-scale structures as theorized by Schenker () or later by the Generative Theory of Tonal Music (GTTM) of Lerdahl and Jackendoff (). Other works also modeled specific large-scale features, such as tonal tension (; ).

MIR modeling of high-level structures has also been employed in the field of music generation, wherein algorithms often have difficulties in producing long-term coherence. Herremans and Chew () proposed to formulate this task as a combinatorial optimization problem. Nika et al. () used harmonic scenarios to produce structured music improvisation. Medeot et al. () elaborated a Recurrent Neural Network trained on a dataset of structural elements.

Finally, some research in the MIR community specifically targets sonata form structure: Jiang and Müller () detected exposition/recapitulation pairs in Beethoven piano sonatas with self-similarity matrices. They also traced transpositions and harmonic changes through the different parts. Weiß and Müller () proposed a model of “tonal complexity” and mapped it on sections of sonata forms. Baratè et al. () introduced a model of sonata form structure based on Petri Nets. We previously proposed a model based on a Hidden Markov Model (HMM) emitting analytical features (). This model relied on human expertise, following the layout of sonata form as presented by Hepokoski and Darcy (). This previous approach was applied to a small set of pieces and the parameters of the model were hard-coded, based on music theory assumptions.

1.3 Contributions

Reproducible MIR research needs to be grounded on publicly available datasets. Here, we systematically study a corpus containing most of the sonata-form movements in Mozart’s string quartets, and we release an open dataset providing two independent analyses of each movement, encoded manually, based on formal modeling of sonata form (Section 2). Extending the approach we introduced before (), we propose several models of sonata form using Hidden Markov Models for which parameters, emission probabilities, and transition probabilities are automatically learned on the corpus. The states of the HMMs represent the different sections of a sonata form and the observations consist of binary analytical features computed through the pieces (Section 3). We discuss the relationship between the occurrences of these features and the sonata form sections.

The results show that the sonata form is better identified when the parameters are learned rather than manually set up. We also study how the granularity of the model (i.e. the number of possible states) influences the success of the detection (Section 4).

2 The Mozart Sonata-Form String Quartet Corpus

2.1 Annotating sonata form

Annotating musical structure is challenging, subjective, and may involve different hypotheses from the analyst. Although different analysts might model sonata forms differently, there are points of consensus. In this work, we follow the notations of Hepokoski and Darcy (). Basically, a sonata form is built by following a piece-level tonal path involving a primary thematic zone (P) and a contrasting secondary thematic zone (S). This is illustrated in Figure 1 on a specific movement.

More precisely, the structure goes through the following parts:

possibly an introduction (Intro);
an exposition (Exp), including a thematic zone P in the main tonality (denoted by I), and a thematic zone S in an auxiliary tonality (usually the tonality of the dominant of I, denoted by V, for major-mode sonata movements). A transition (TR) bridges the two themes and triggers the modulation between the two tonalities. The transition ends with a perfect authentic or half cadence called the Medial Caesura (MC) (), with “a decisive change of texture” (). The S zone generally concludes with a Perfect Authentic Cadence (PAC) called the Essential Expositional Closure (EEC). It is followed by a closing zone (C) rounding off the exposition by reinforcing the key of the EEC. The exposition is generally repeated once;
a development (Dev) characterized by tonal instability, in which the existing themes are transformed and new themes can be introduced, possibly closed by a retransition (RT), that modulates back to the main tonality;
a recapitulation (Rec) of P and S themes, now both in the tonality of the tonic, possibly including elements that were added throughout the development. Recapitulation follows a layout analogous to the exposition (P’, TR’ ended with MC’, S’ ended with an Essential Structural Closure (ESC), C’). The transition TR’ is generally the section that varies the most, in comparison with the exposition, as it does no longer need to include a modulation. One can often hear a move to the subdominant degree that remains in the home key, and thus resolves a “large-scale dissonance” (as called by Rosen ()) created by the exposition and intensified by the development;
and possibly a coda (Coda).

Figure 2 displays layouts of sonata form at different granularity, including the sections described above along with short transitional sections. Some of these sections or transitional states may be skipped, leading to forward transitions between non-adjacent states. These models are seen as topologies of Hidden Markov Models, detailed in Section 3.

Figure 2

Model topologies describing the most common sonata form structure at several resolutions. The set of states Q_n has n states. Q₃ and Q₇ model the basic sections of the sonata form. Q₁₄ (used by the model of Bigo et al. ()) and Q₁₈ further model Intro, TR, RT and Coda sections as well as transitional states between these sections, represented with squares: the medial cesuras MC and MC’, but also short transitions between the end of the closing zone and the complete end of the exposition (transition after the closing zone, TC), between the exposition and the development (d), between the development and the recapitulation (r), and between the recapitulation and the Coda (TC’). Initial and final states are circled twice.

2.2 The corpus

The corpus used in this work includes 32 sonata-form movements of string quartets composed by Mozart. The pieces are encoded as .krn Humdrum files () downloaded from http://github.com/musedata/humdrum-mozart-quartets. These files were originally available from http://kern.humdrum.org and encoded by Edmund Correia, Jr. and Frances Bennion.

Between 1770 and 1790, Mozart composed 23 string quartets totaling 86 movements (). We denote by K171.4 the 4th movement of K171. Out of these 86 movements, 42 are in sonata form, including 4 rondo sonata movements (K171.4, K173.1, K465.4, and K499.4), and 6 movements with special forms (K155.2, K168.2, K170.3, K171.1, K458.1, and K499.1). Special forms may include sections in unusual places, as for example the introduction and a “written” repeat of P’ and TR’ before the Coda in K171.1, or a strong bithematic unity (K168.2, continuous exposition in K458.1 “The Hunt” and K499.1). Ten out of these 42 sonata forms were left out because of unavailable clean encoding (K158.2, K160.1, K160.2, K160.3, K169.2, K170.3, K458.4, K464.1, K499.4, K575.1). Note that the dataset does not include pieces with an unusual sonata-form structure, such as K387.2, which is a minuet in sonata form without development, or K387.4, which is a fugue-sonata.

The corpus finally includes 19 first movements, 10 slow movements, and 3 final movements; 26 movements are in a major key and 6 are in a minor key.

2.3 Reference analyses

A reference annotation requires an agreement on a set of sections that need to be identified but also on the location of their boundaries. Some structural elements, such as the location of the cadences or the boundaries of the S theme, are especially subject to debate, and some of them may even be non-pertinent. For instance, there may be no precise border between P and TR. Reference datasets with divergent analyses may thus be particularly helpful. Following the above notations, we encoded two sets of analyses of the 32 sonata forms included in the corpus (Figure 3):

The set F is an encoding of elements found in Mozarts Streichquartette by Marius Flothuis (). This book contains complete analyses of the quartets, including descriptions of P/TR/S/C sections in exposition and recapitulation that we formally encoded. Flothuis did not use the notations of Hepokoski and Darcy () and took some liberties with the names of the sections. We freely interpreted his writings to match as much as possible the proposed model.
The set A is our own analysis written following the notations described by Hepokoski and Darcy (). These analyses were checked by two curators. As Flothuis we encoded P/TR/S/C section boundaries, but also MC, EEC and ESC cadences, notable structures in the development and RT, as well as some patterns and some harmonic progressions. Figure 4 shows how these analyses map onto some of the 18 possible sections. They have between 8 and 16 (average 11.9) of these 18 sections.

The two encodings were done independently. They total 1939 labels, including more than 600 section labels and more than 500 cadences.

Figure 3

Extract of the reference analysis for the second movement of the String Quartet #16 in E♭ major (K428.2), as viewed on http://www.dezrann.net/ (left) and represented as a json file (right). The Primary theme (P) ends with a half cadence in the primary key (I:HC). Here a Transition zone (TR) begins, which stops on different beats according to the references. The A analysis starts the secondary (S) theme after the HC in the primary key on measure 10, whereas the F analysis rather starts it on measure 14 (HC on the dominant key). Onsets in the json file are expressed in quarter notes.

Figure 4

Reference analysis A of 32 sonata-form and sonata-form-like movements in Mozart string quartets. The analyses are projected on the 18 states of Q₁₈. Vertical lines show cadences.

Despite some divergences (see Figure 3), 77% of the P/TR/S/C labels of A start at the same location in F. The majority of the differences between A and F occur when annotating the start of C. Indeed, Flothuis usually identifies the end of the S section on the first encountered PAC. On the contrary, Caplin () usually extends S until a last strong PAC providing a conclusion to the theme or to a group of themes, and keeps in C only post-cadential material called codettas. We follow here the first-PAC rule as stated and nuanced by ():

“(…) one could not consider S to be completed if either it or its cadential material is immediately restated. The PAC that ends the first statement of S proposes an EEC: by repeating the melody or a portion thereof, the composer reopens the PAC and shifts the EEC forward to the next PAC.”

Indeed, Mozart frequently “reopens” PACs by repeating S material. He often restates the immediately preceding cadential progression and sometimes expands it. Thus, we identify an EEC when we encounter a PAC if what follows has not been heard shortly before.

Finally, 3 out of these 32 movements are differently annotated in the two sets of analyses: We see some movements as sonata forms, while Flothuis favors the loosened two-part form (K155.2, K168.2, K172.2). Moreover, he did not consider the form including a continuous exposition without a medial caesura (K458.1, K499.1).

2.4 Corpus availability

The annotation sets described above are distributed as Supplementary Files and at http://www.algomus.fr/data/ under the Open Database License (ODbL v1.0). These analyses are encoded as json files containing labels, each label being defined by a type (Structure/Cadence/Harmony), by an onset and possibly by a duration (Figure 3, right). Moreover, they are available through Dezrann, an interactive web platform for music annotation and analysis (, http://www.dezrann.net/).

3 Detection and Learning Strategy

As in (), we consider a finite alphabet of binary analysis features $A = {α 1, α 2 …}$ M1 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\cal A} = \{ {\alpha _1},\,\,{\alpha _2}{\rm{ \ldots }}\} \] \end{document} that may be present or absent at each quarter note and a Hidden Markov Model predicting the structure based on these features. Analysis features describe harmony, melody, or other local elements. In this section, we present the different models used in our experiments (section 3.1), the analysis features selected for this study (section 3.2), and the learning method used to set up the parameters of the model (section 3.3).

3.1 Hidden Markov Models to match sonata form structure

A Hidden Markov Model $M n = (Q n, π, τ, T, E)$ M2 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_n} = ({Q_n},\,\,\pi,\,\,\tau,\,\,T,\,E) \] \end{document} on $A$ M3 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\cal A} \] \end{document} is defined by a set of n states Q_n = {q₁, …q_n} corresponding to the successive sections of sonata form. We experimented with different sets of states targeting several model topologies (Figure 2):

The 3 states Q₃ = {Exp, Dev, Rec} and the 7 states Q₇ = {P, S, C, Dev, P′, S′, C′}, where the exposition and recapitulation parts of Q₃ are decomposed into thematic parts, match the most recognizable sections of sonata form;
The 14 states Q₁₄ and the 18 states Q₁₈ are closer to sonata form structure as described by Hepokoski and Darcy (). They add the transitions TR, RT, TR’, and (for Q₁₈) the Intro and Coda sections, and also model as short-lasting states the transitions between larger sections (MC, TC, d, r, MC’, TC’, see details in Figure 2).

The probabilities of the initial state and of the final state are respectively represented by π = (π₁, …π_n) and τ = (τ₁, …τ_n). T(i, j) is the transition probability – i.e. the probability that the state q_i goes to the state q_j, and E(i, α_k) is the emission probability – i.e. the probability that the state q_i emits the feature α_k.

Since several features can be predicted at the same step, any state may output simultaneously a set of symbols $A ⊂ A$ M4 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ A \subset {\cal A} \] \end{document} . If these emissions are independent events, the probability that the state q_i outputs the set A is

E i, A = ∏ α ∈ A E i, α ⋅ ∏ α ∈ A \ A 1 – E i, α

M5 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ E\left({i,\,A} \right) = {\prod\nolimits_{{\alpha} \in A}} E\left({i,\,\,\alpha } \right) \cdot {\prod \nolimits_{\alpha \in {\cal A}\backslash A}}\left({1-E\left({i,\,\,\alpha} \right)} \right) \] \end{document}

Given an integer t, we define a path in $M$ M6 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\cal M} \] \end{document} by a t-tuple of integers P = (p₁, …, p_t) ∈ [1, n]^t, meaning that the path goes through the t states q_p₁ …q_pt. We also consider a sequence of sets of symbols $A 1 … A t ∈ P (A) t$ M7 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {A_1} \ldots {A_t} \in {\cal P}{({\cal A})^t} \] \end{document} , where $P (A)$ M8 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\cal P}({\cal A}) \] \end{document} is the set of subsets of $A$ M9 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\cal A} \] \end{document} .

The probability that the model $M$ M10 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\cal M} \] \end{document} follows a path P = (p₁, … p_t), entering by an input state p₁ and exiting from an output state p_t, while outputting the sequence A₁…A_t, one state outputting some symbols at each step, is given by:

prob (P, A 1 … A t) = π p 1 ⋅ E (p 1, A 1) ⋅ ∏ i = 2 t T (p i − 1, p i) ⋅ τ p t ⋅ E (p i, A i)

M11 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \,\,\begin{array}{*{20}{c}} {prob(P,\,\,{A_1} \ldots {A_t}) = {\pi _{{p_1}}}}\ \ \ { \cdot \,}\ \ \ {E({p_1},\,\,{A_1})}\\ \quad\qquad\qquad\qquad\qquad\qquad\qquad\ \ \ \cdot\ \ \ {\prod _{i = 2}^tT({p_{i - 1\,}},{p_i})}\\ \quad\qquad\qquad\qquad\ \ \ \cdot\ \ \ {{\tau _{{p_t}}}} \end{array} \cdot\ E({p_i},\,{A_i}) \] \end{document}

Starting from a sequence of sets of symbols A₁…A_t, the Viterbi algorithm (; ) finds the path P that maximizes prob(P, A₁…A_t).

3.2 Analysis features

In (), we selected binary features “according to whether their presence or absence could be characteristic of (…) sections in a sonata form”. We first included these features:

Pattern features: repeated candidate P pattern (pat:P) and candidate S pattern (pat:S) that may be characteristic for P and S. These patterns are extracted from the highest voice (first violin), but successive occurrences may be found in other voices. The P candidate pattern is searched by a relatively strict variant of the Mongeau and Sankoff () algorithm forbidding any transposition, whereas the S candidate pattern is searched with some transposition between the first occurrence and a next one – thus targeting a pattern that should appear in S’ rather than again in S. Additional length and position constraints account for the balance of the sonata form, such as ending the candidate P pattern and starting the candidate S pattern before one-third of the length of the piece ().
Harmonic features: local tonalities on 2-measure windows (2 × 7 ton:x features, minor and major) based on the algorithm of Krumhansl and Kessler () using pitch class profiles adapted from Temperley (), heuristic detection of Perfect Authentic Cadences (cad:PAC), Imperfect Authentic Cadences where both chords are in root position (cad:rIAC), and pedals (ped), with the rule-based algorithms of Giraud et al. (), and finally features possibly involved in the preparation of half-cadences, such as chromatic upward bass movements (harm:#) and diminished seventh or augmented second intervals (harm:7).
Features combining melody and/or harmony and/or rhythm: full rests (rest), unisons (unison), and finally long harmonic sequences (seq) where at least two voices repeat a pattern consecutively in different tonalities, the voices following the same (possibly diatonic) transpositions, for a duration of at least twenty quarter notes ().

We added the following two new features that may match more closely particular sections of the sonata form, like the Medial Caesura (Figure 5):

Rhythm break. In both exposition and recapitulation, the end of the transition between the primary and the secondary theme is often enhanced by a dense and repetitive rhythm that is broken by the half-cadence of the Medial Caesura to enhance its closure effect (). The feature break detects the interruption of repetitive rhythms, in any voice, that consist of at least 15 consecutive notes that have the same duration.
Triple hammer blow. This striking event generally consists of three strongly repeated onsets preceding a rest that separates the MC from the secondary theme (). The feature hammer detects at any voice three repeated notes followed by a rest.

All the features consider only information on note pitches and durations as well as on rests. They do not look at any other information such as annotation marks, dynamics, or repeat bars. In particular, in almost all the pieces of the corpus, repeat bars are found at the end of the exposition and could ease the analysis. However, even without this repeat bar, this boundary is almost always unambiguous and can be predicted by automated methods.

Figure 5

Medial Caesura in Allegro K80.2, measure 15. This half cadence (HC) has a very simple but very efficient tonic/dominant schema. It is reinforced by the sudden change of texture (break) between the unison in eighth notes and the triple hammer blow (hammer) that accentuates the dominant chord on D.

The absence or presence of each feature is computed at every quarter note in every piece of the corpus. Features occurring at the limit between two sections are counted in both sections.

Note that all features are somewhat heuristic and may not be perfect. Nevertheless, the next section will show that some of them are significantly present or absent in some sections of the sonata form and that they may be used to learn the sonata-form structure.

3.3 Maximum likelihood parameter estimation

The parameters of the HMM can be learned by relating the section boundaries that are manually annotated in the whole corpus and the analysis features that are computed at each quarter note.

Let $T (i, j)$ M12 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \mathbb{T}(i,\,j) \] \end{document} and $E (i, α)$ M13 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \mathbb{E}(i,\,\alpha) \] \end{document} be the observed counts of transitions and emissions on the learning corpus, and $duration (i) = ∑ k = 1 n T (i, k)$ M14 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[duration(i)\;\, = \sum \nolimits_{k = 1}^n\mathbb{T}(i,\,k)\] \end{document} the total duration of the section i on the learning corpus. Any transition or emission probabilities can be computed by the following ratios:

T i, k = T i, j duration (i) E i, α = E i, α duration (i)

M15 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \mathbb{T}\left({i,\,k}\right) = \frac{{\mathbb{T}\left({i,\,j} \right)}}{{duration(i)}}\,\,\,\,\,\,\,\,\,E\left({i,\,\alpha } \right) = \frac{{\mathbb{E}\left({i,\,\alpha } \right)}}{{duration(i)}} \] \end{document}

To prevent zero probabilities, pseudo-counts with a very small ϵ are added to every value of $E$ M16 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \mathbb{E} \] \end{document} as well as to every value $T (i, j)$ M17 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ \mathbb{T}(i,\,j) \] \end{document} with i ≤ j (preventing backward transitions). Note that we considered that the features are independent both in the learning phase and when using the models. This is not true in the general case, especially for features that are mutually exclusive such as the tonality features, but this nevertheless allows for a practical approximation.

4 Evaluation and Results

Our experiments, including the computation of the analysis features and the HMM parameters, and the implementation of the Viterbi algorithm were done in python3 within the music21 framework (), extended with analytic labels (). Every analytical feature was computed at each quarter note of every piece included in the corpus. Their occurrences in the corpus are discussed below.

To avoid overfitting, the learning strategy was evaluated with a Leave-One-Piece-Out cross-validation strategy. The sonata-form structure was predicted on each of the 32 pieces by the four HMMs described above, their parameters being learned on the 31 remaining pieces of the corpus. The cross-validation process was conducted on the whole corpus as the size and the heterogeneity of the corpus did not allow to have a separate test set dedicated to a final evaluation. Note that we did not identify any hyperparameter in the model that we tried to optimize, apart from the various topologies and feature subsets that are discussed below.

The results of the computation of the analysis features, as well as the learned probabilities, can be downloaded from http://www.algomus.fr/data/.

4.1 Discussion on feature statistics

Table 1 shows the number of occurrences of the computed features within the 18 sections of the sonata form as indicated by the annotation set A. Comparing occurrences of features or other elements against their expected number in “random” situations helps to evaluate their significance (). For example, the first primary zones (P) span 1130 quarter notes, that is 7.9% of the 14318 quarter notes of the corpus. In all the corpus, ton:I is activated on 4491 quarter notes. Should this feature be randomly distributed, ton:I would be activated on about 354 = 4491 × 7.9% quarter notes in P. However, there are actually 553 quarter notes out of these 1130 quarter notes in P where ton:I is activated.

Table 1

Feature tallies for sections of the sonata form on the 32 movements of the corpus of Mozart string quartet movements in sonata form (see Section 4.1). The table shows, for each feature, the number of quarter notes where this feature occurs followed by its number of occurrences on quarter notes labeled as each of the sections in the reference annotation A, as well as, in gray, its expected number should the feature be random or uniformly distributed across the quarter notes. Bold, italic, and the ≫ and ≪ symbols indicate an estimation of the significance of their presence or absence compared to all the other sections as well as to adjacent sections. The total numbers of quarter notes can differ slightly from the sums of the different sections due to rounding of non-integer lengths on some sections (see Figure 3).

		Status

Features	quarters	Intro			P			TR			MC			S			C			TC			d			Dev			RT		…

pat:P	2448	35	20	≪	858*	193	≫	341*	235		15	13	≫	23*	250	≫	0*	200		0*	13		0*	12		10*	385		8*	50
pat:S	3008	0*	25		0*	237	≪	482*	289		36*	16		686*	308	≫	304*	246		6	16		0*	14		8*	473		0*	61

ton:I	4491	41	38		553*	354	≫	*311**	432		22	23		*229**	460		*222**	367		18	24	≫	0*	22	≪	*360**	707	≪	115	91
ton:II	510	0	4		18*	40	≪	74	49		12*	2		72	52		79*	41		1	2		14*	2	≫	107	80		2	10
ton:III	479	0	4		27	37		54	46		4	2		70	49		64*	39		8	2		5	2		125*	75		9	9
ton:IV	1734	31*	14		186*	136	≫	*104**	166		1	9		34*	177		36*	141	≪	21	9		4	8		*168**	273		25	35
ton:V	2514	0*	21		41*	198	≫	467*	242		36*	13		683*	257		468*	205	≫	1*	13		4	12		*327**	396		70	51
ton:VI	479	6	4	≫	12*	37	≫	54	46		2	2		98*	49		66*	39		0	2		2	2	≫	90	75	≫	0*	9
ton:VII	386	9	3		20	30		30	37		0	2		14*	39		23	31		4	2		0	1		94*	60		1	7

ton:i	892	3	7		84	70		52*	85		3	4		16*	91		23*	72	≪	12	4		22*	4		148	140	≪	42*	18
ton:ii	534	4	4		33	42	≫	10*	51		0	2		13*	54		17*	43		1	2		0	2		156*	84		8	10
ton:iii	356	6	3		5*	28	≪	70*	34		4	1		68*	36		46	29		0	1	≪	12*	1	≫	48	56		14	7
ton:iv	349	12*	2		6*	27		0*	33		0	1		16	35	≪	51*	28		0	1		8	1		114*	54		18	7
ton:v	460	0	3		22	36	≪	69	44		5	2		38	47		8*	37		3	2		1	2		162*	72	≫	2	9
ton:vi	1052	0	8		46*	83		73	101		2	5		112	107		51*	86	≪	14	5	≫	0	5	≪	368*	165	≫	0*	21
ton:vii	187	3	1		0*	14	≪	22	18		0	0		21	19		36*	15		0	1		4	0		14	29		0	3

cad:PAC	416	4	3		20	32		22	40	≪	9	2		48	42		72*	34		1	2		0	2		29*	65		4	8
cad:rIAC	142	2	1	≫	16	11		8	13		3	0		15	14		9	11		0	0		0	0		29	22		1	2
harm:#	144	2	1		13	11		18	13		6	0	≫	7	14		5	11		1	0		1	0		27	22		1	2
harm:7	1122	4	9		49*	88	≪	116	108		0	5		68*	115		86	91	≪	18*	6		3	5		271*	176		17	22

ped	971	10	8		116*	76	≫	76	93		0	5		42*	99		66	79		1	5		0	4		186	152		20	19
rest	331	6	2		35	26	≫	11*	31	≪	12*	1	≫	15	33	≪	38	27		4	1		2	1		39	52	≪	18	6
seq	1254	24	10		57*	99		61*	120		2	6		95	128		60*	102		3	6	≪	29*	6	≫	420*	197	≫	0*	25
unison	685	16	5		91*	54	≫	43	65		7	3		43	70		59	56	≪	24*	3		27*	3	≫	68*	107		12	13

break	482	1	4		24	38		50	46	≪	12*	2	≫	36	49		47	39		5	2		1	2		74	75		9	9
hammer	268	0	2		14	21		8*	25	≪	14*	1	≫	49*	27		20	21		0	1		0	1		52	42		7	5

Total	14318	122			1130			1378			76			1468			1171			78			71			2255			292		…

Table 1 (continued)

Feature tallies for sections of the sonata form on the 32 movements of the corpus of Mozart string quartet movements in sonata form.

		States

Features	quarters	…	r			P’			TR’			MC’			S’			C’			TC’			Coda

pat:P	2448		5	5	≪	770*	184	≫	317*	243		15	12	≫	20*	264	≫	0*	218		0*	17		32*	125
pat:S	3008		0	6		1*	227	≪	444*	299		34*	15		692*	324	≫	307	269	≫	0*	20		7*	153

ton:I	4491		20	10		582*	339	≫	524*	447		44*	23		640*	484		497*	401	≪	17	31	≪	295*	229
ton:II	510		0	1		16*	38		28	50		5	2		25*	54	≪	56	45		0	3		0*	26
ton:III	479		1	1		18	36		24*	47		0	2		16*	51	≪	46	42		6	3	≫	1*	24
ton:IV	1734		0	3		156	130		230*	172		9	9		269*	187		256*	155		16	12		188*	88
ton:V	2514		4	5		36*	189	≪	*132**	250		13	13		*112**	271		77*	224		10	17		34*	128
ton:VI	479		3	1		18	36		35	47		0	2		58	51		27	42		0	3		7*	24
ton:VII	386		0	0		17	29	≪	80*	38		3	2		35	41		27	34	≪	12*	2	≫	18	19

ton:i	892		5	2		115*	67		148*	88		10	4		109	96		46*	79	≪	21*	6	≫	32	45
ton:ii	534		0	1		44	40		64	53		2	2		81	57		36	47		8	3		58*	27
ton:iii	356		0	0		9*	26		18	35		0	1		26	38		28	31		2	2		0*	18
ton:iv	349		0	0		9*	26		9*	34		0	1		23	37	≪	78*	31		0	2		4	17
ton:v	460		0	1		24	34		43	45		0	2		33	49		27	41		2	3		21	23
ton:vi	1052		0	2		57	79		77	104		2	5		104	113		68	94		2	7		77	53
ton:vii	187		0	0		0*	14	≪	34	18		1	0		23	20		17	16		7	1		5	9

cad:PAC	416		0	0		26	31		25	41		8	2		46	44		73*	37		1	2		28	21
cad:rIAC	142		0	0		18	10		5	14		1	0		17	15		9	12		0	0		9	7
harm:#	144		0	0		14	10		14	14		4	0		16	15		7	12		2	1		6	7
harm:7	1122		1	2		53*	84		93	111		0	5		100	121	≪	196*	100		12	7		35	57

ped	971		1	2		131*	73	≫	67	96		0	5		47*	104	≪	84	86		2	6	≪	120*	49
rest	331		5	0		35	24	≫	14	32	≪	14*	1	≫	16	35		33	29		3	2		31	16
seq	1254		0	2		58*	94	≪	172*	125		3	6		118	135		120	112		0	8		32*	64
unison	685		8*	1		96*	51	≫	44	68		7	3		43*	73		49	61	≪	27*	4	≫	22	35

break	482		3	1		24	36		52	48	≪	14*	2	≫	37	52		52	43		5	3		36	24
hammer	268		1	0		14	20		8*	26	≪	12*	1	≫	46	28		16	23		0	1		10	13

Total	14318	…	32			1081			1427			74			1545			1280			99			732

For each feature and each section, p-values are estimated by an exact Fisher test computed by the Python scipy package. Fisher tests are computed independently. To account for the large number of tests, both on features and on sections, only features with p-values under 10^–4 are considered as significant, either by their presence (bold, *) or their absence (italic, *). For example, as expected, the feature ton:I is significantly present in P and significantly absent in S (both times p < 10^–30). The ≫ and ≪ symbols between two adjacent columns show the features which can be considered as significant to distinguish these two states, again with a 10^–4 threshold on another Fisher test. For example, the feature ton:II is significantly more present in TR than in P (p < 10^–9), even if it is not significantly present in TR compared to all sections.

Although most features are not specific to a section, many of them differ significantly from one section to another and confirm their pertinence for the task of sonata form detection. A first observation is that the expected tonal path is confirmed by the ton:x features. Indeed, ton:I is met for most of the P quarter notes while ton:V and ton:III (dominant and relative major tonalities) are significantly present in S. This highlights the opposition between the two tonal zones of the exposition. As expected, this “large-scale dissonance” is resolved by the recapitulation. Indeed, both P’ and S’ are characterized by a high prevalence of ton:I.

Another result considering the tonality features is the symmetry between TR and TR’. Whereas TR usually induces an ascending fifth move from ton:I to ton:V, our results confirm that, in TR’, Mozart often moves to ton:IV (called a tonal adjustment by Caplin () or a feint by Rosen () and Hepokoski and Darcy ()) in order to reach S’ in ton:I with a move of the same interval.

The Perfect Authentic Cadences (PAC) are significantly present in C and C’, and only there. Indeed, S and S’ generally end with a strong structural EEC and ESC although the rest of S and S’ do not significantly contain cadences.

The thematic pattern pat:P is significantly present for P and P’, but also for TR and TR’. This is because the starts of TR and TR’ are often the same. The thematic pattern pat:S is significantly present for S and S’, but also for TR, C, TR’ and C’. This is because the part of the exposition that is exactly transposed often starts (contrarily to Figure 1) inside TR and continues through S’ and C’.

Features break, harm:#, and rest are especially significant on MC and MC’. Some of these features are triggered by the themes in P/P’ or S/S’ at relevant places. Long harmonic sequences and pedals significantly appear in the developments, but they are also present in other sections. In the small transitional sections before the development (TC, d), before the recapitulation (r), and before the Coda (TC’), many unisons are encountered, but again they are significantly found at other places as well.

4.2 Ability to retrieve the sonata-form structure

We evaluate the performance of the four HMMs with learned parameters $M 3, M 7, M 14, and M 18$ M18 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_3},\,{{\cal M}_7},\,{{\cal M}_{14}},\,(\rm and)\,{{\cal M}_{18}} \] \end{document} , as well as the HMM with hard-coded parameters proposed previously () that we call $M 14 *$ M19 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\cal M}_{14}^* \] \end{document} .

4.2.1 Evaluation measures

Tables 2 (focus on quarter notes) and 3 (focus on boundaries) show the performance of the five HMMs using the cross-validation process described above on the 32 pieces of the corpus.

Table 2 shows F₁-measures for all the considered classifiers and for each predicted label. The top table further shows the confusion matrix for $M 18$ M20 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} that details for each predicted label (rows), the number of corresponding quarter notes in the reference annotation (columns). For example, the second row shows that 36 quarter notes are predicted as P but are labeled Intro in the reference annotation (false positives), whereas 751 quarter notes are labeled as P (true positives).

Table 2

Classification results, with F₁-measures of the five studied HMMs as well as of baseline models on the 14318 quarter notes of the corpus against the reference A. The confusion matrix is detailed for $M 18$ M21 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} : Each column denotes the quarter notes of a section in the reference analysis, and the rows show how these quarter notes are classified (after cross-validation (c-val.)) by $M 18$ M22 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} . Underlined values are discussed in the text.

Q₁₈	Intro	P	TR	MC	S	C	TC	d	Dev	RT	r	P’	TR’	MC’	S’	C’	TC’	Coda

Intro	0	154	30	3	·	·	·	·	·	·	·	·	·	·	·	·	·	·
P	36	*751*	238	12	4	·	·	·	·	·	·	·	·	·	·	·	·	·
TR	1	86	*175*	10	121	47	·	28	35	·	·	16	32	4	29	·	·	·
MC	1	4	19	3	6	·	·	·	·	·	·	·	·	·	1	·	·	·
S	1	·	608	27	*588*	357	2	·	11	·	·	·	·	·	30	9	·	·
C	1	2	40	6	364	*355*	10	·	202	·	5	6	·	·	·	38	·	0
TC	·	23	68	·	5	1	0	21	114	·	·	12	·	·	·	·	·	·
d	3	·	29	·	5	2	6	6	101	·	·	9	·	·	·	·	·	·
Dev	49	85	134	11	268	353	60	16	*1320*	87	3	67	56	2	62	110	32	36
RT	30	24	20	·	30	12	·	·	393	*141*	14	57	51	3	12	·	·	5
r	·	·	·	·	·	·	·	·	20	25	2	49	2	·	·	·	·	·
P’	·	·	1	·	1	·	·	·	0	·	7	*713*	282	11	35	14
TR’	·	·	·	·	·	·	·	·	·	·	·	46	*161*	4	174	3	·	1
MC’	·	·	1	·	1	·	·	·	·	·	·	2	18	8	6	·	·	3
S’	·	·	14	3	73	14	·	·	·	·	·	7	549	20	*471*	393	·	16
C’	·	·	·	·	·	25	·	·	21	15	·	58	197	10	353	*213*	32	58
TC’	·	·	·	·	·	4	·	·	34	·	·	9	45	3	42	49	11	12
Coda	·	·	·	·	·	·	·	·	2	24	·	28	32	8	328	463	24	*587*

quarter notes	122	1130	1378	76	1468	1171	78	71	2255	292	32	1081	1427	74	1545	1280	99	732

F₁ ( $M 18$ M23 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} , c-val.)	0.00	0.69	0.18	0.05	0.38	0.32	0.00	0.05	0.53	0.26	0.03	0.66	0.18	0.15	0.30	0.19	0.07	0.53
F₁ (equal)	0.00	0.56	0.14	0.04	0.29	0.24	0.00	0.00	0.30	0.12	0.20	0.42	0.02	0.00	0.19	0.15	0.00	0.26
F₁ (fixed)	0.02	0.15	0.18	0.01	0.19	0.15	0.01	0.01	0.27	0.04	0.00	0.14	0.18	0.01	0.19	0.16	0.01	0.10

Q₁₄	P	TR	MC	S	C	d	Dev	RT	r	P’	TR’	MC’	S’	C’

quarter notes	1130	1378	76	1468	1250	71	2255	292	32	1081	1427	74	1562	2095

F₁( $M 14$ M24 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{14}} \] \end{document} , c-val.)	0.76	0.17	0.05	0.38	0.28	0.05	0.58	0.25	0.03	0.66	0.18	0.15	0.28	0.56
F₁ ( $M 14 $ M25 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\cal M}_{14}^ \] \end{document} )	0.66	0.35	0.03	0.27	0.26	0.04	0.16	0.14	0.02	0.29	0.33	0.09	0.29	0.61
F₁ (equal)	0.40	0.05	0.00	0.20	0.04	0.00	0.16	0.08	0.11	0.23	0.00	0.00	0.12	0.31
F₁ (fixed)	0.15	0.18	0.01	0.19	0.16	0.01	0.27	0.04	0.00	0.14	0.18	0.01	0.20	0.26

Q7	P	S	C	Dev	P’	S’	C’

quarter notes	2582	1471	1321	2580	2580	1565	2095

F₁ ( $M 7$ M26 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{7}} \] \end{document} , c-val.)	0.65	0.37	0.25	0.68	0.54	0.33	0.54
F₁ (equal)	0.50	0.36	0.23	0.44	0.39	0.18	0.37
F₁ (fixed)	0.31	0.19	0.17	0.31	0.31	0.20	0.26

Q3	Exp	Dev	Rec

quarter notes	5374	2580	6240

F₁ ( $M 3$ M27 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{3}} \] \end{document} , c-val.)	0.76	0.57	0.85
F₁ (equal)	0.41	0.30	0.68
F₁ (fixed)	0.55	0.31	0.61

To evaluate the fact that the model is able to learn transition probabilities, we also compared the learned models to HMMs with “equal” transition probabilities (restricted to forward transitions) but with learned emission probabilities. We also show the best F₁-measure for “fixed” classifiers always predicting the same section. For example, the “fixed” classifier for Q₁₈ on P always predicts P on the 14318 quarter notes of the corpus and has an F₁-measure of 0.15, far below the F₁-measure of 0.69 obtained by $M 18$ M28 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} .

In Table 3, the first four columns (main boundaries) show the results of the evaluation on four boundaries (starts of sections S, Dev, P’ and S’) corresponding to milestones in the tonal path of sonata form. The last four columns (all boundaries) show results of the evaluation while considering the boundaries of all modeled sections. In what follows, the prediction of a section boundary is considered as “correct” (+ or =) if its distance from the corresponding boundary in the reference annotation is at most 3 measures.

Table 3

Number of boundaries predicted exactly or within one measure (+), within between 2 and 3 measures (=), beyond 3 measures (–) or not predicted (!), compared to the reference analysis A. The bottom part of the table shows results obtained with a subset of features.

	main boundaries (total: 124)				all boundaries
	+	=	–	!	+	=	–	!

$M 14 $ M29 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\cal M}_{14}^ \] \end{document}	23	4	54	43	68	21	154	115

$M 18$ M30 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document}	34	17	53	20	90	45	147	104
$M 14$ M31 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\cal M}_{14}^* \] \end{document}	31	16	56	21	87	38	146	87
$M 7$ M32 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{7}} \] \end{document}	35	12	61	16	70	15	101	30
$M 3$ M33 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{3}} \] \end{document}	16	8	40	0	46	8	42	0

$M 18$ M34 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} , no pat:P/pat:S	13	7	97	7	32	29	229	96
$M 18$ M35 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} , no ton:*	3	11	100	10	32	31	236	87
$M 18$ M36 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} , no cad:*	35	16	57	16	90	40	159	97
$M 18$ M37 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} , only ton:*	3	8	104	9	24	27	247	88
$M 18$ M38 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} , no break features	33	12	61	18	85	36	168	97

4.2.2 Prediction evaluation

For the majority of the sections, the learned HMMs have much better F₁-measures than HMMs with equal transition probabilities, showing that the model can benefit from learned transitions.

Using the HMM $M 14 *$ M39 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\cal M}_{14}^* \] \end{document} with hard-coded parameters successfully predicted 27 main boundaries (22%) and 89 out of all boundaries (25%). Table 3 shows that learning parameters using the very simple $M 3$ M40 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{3}} \] \end{document} model gives a bad prediction, with 24 main boundaries correctly predicted. Indeed, as $M 3$ M41 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{3}} \] \end{document} merges P and S themes, even most tonality features are not very significant.

Better predictions are achieved by $M 7, M 14,$ M42 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_7},\,{{\cal M}_{14}}, \] \end{document} and $M 18$ M43 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} . The model $M 14$ M44 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{14}} \] \end{document} correctly predicts 47 main boundaries (38%) and 125 (35%) out of all boundaries, improving the results obtained by the HMM with hard-coded parameters. F₁-measures are also improved for most of the sections. Even better results are obtained with $M 18$ M45 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} (41% and 38%). However, $M 18$ M46 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} models many sections. Some of the 18 corresponding states rarely appear over the pieces of the corpus to be consistently learned by the model, as shown by the very low F₁-measure on sections Intro, TC, d, RT, and TC’. For example, the Intro section is found in only two movements in the whole corpus, leading to incorrect predictions between Intro and P sections.

Note that many false positives reported in the confusion matrix for $M 18$ M47 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} come from only a few pieces. Indeed, 132 of the 134 = 49 + 85 quarter notes predicted as Dev instead of Intro or P come from the wrong prediction on K465.1 (see below and Figure 7), and 60 out of the 61 = 25 + 21 + 15 quarter notes predicted as C’ instead of C, Dev, or RT come from the wrong prediction of K171.1 (data not shown).

Table 3 also shows the results on $M 18$ M48 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} while restricting the set of features. This confirms that pat:P and pat:S features are important to ground the prediction, but other features also contribute, even if the cadence features do not appear to improve the detection.

Finally, Figure 6 details the success of the prediction for the start of each section. Apart from the trivial start of P, the boundary being the best predicted is the start of P’, that is the start of the recapitulation.

Figure 6

Detection precision (relative to the reference analysis A) of the five HMMs. Boundaries are predicted exactly or within 1 measure (green, + on Table 3), within between 2 and 3 measures (blue, =), more than 3 measures (red, –), or not predicted at all (gray, !). The lines at the left show the numbers of the spurious sections falsely predicted by the models.

Whereas the hard-coded $M 14 *$ M49 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {\cal M}_{14}^* \] \end{document} predicts 9 starts of P’ exactly or within 1 measure compared to A, models $M 3, M 7, M 14, and M 18$ M50 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_3},\,{{\cal M}_7},\,{{\cal M}_{14}},\,{\rm and}\,{{\cal M}_{18}} \] \end{document} respectively predict 10, 15, 17, and 18 such boundaries. As P’ always appears in the reference, no spurious P’ is predicted. This success in detecting the start of P’ is likely to come from the correlation between this section and features representing both the thematic patterns pat:P and the tonality ton:I which is strongly captured by the model as Table 1 attests. TR and TR’ sections are badly predicted, especially on their start, which may be caused by the blend between P/P’ and TR/TR’ in our model.

As a global result, $M 18$ M51 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} correctly predicts the sections of 8 movements, only some sections of 20 movements, and incorrectly the sections of 4 movements.

4.3 Discussion on representative movements

Figure 7 illustrates 6 representative predictions performed by $M 18$ M52 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} .

Figure 7

Comparison between the reference analysis A (top) and the predicted analysis by $M 18$ M53 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} (bottom) on six string quartet movements.

The structure of the Adagio K172.2 is almost perfectly predicted. Almost all sections in the reference analysis are found (7 out of 10, since the model does not predict C, C’ nor Coda) and their starts are estimated on the correct beat or within 1 measure. The prediction for the Andante con moto K428.2 (see again Figure 1) is good in the exposition. The results in the recapitulation are degraded by the missing S’ section in the prediction, the C’ section starting far too early.

The model $M 18$ M54 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} predicts spurious Intro and/or Coda sections in different pieces such as in K428.1 or K428.2. This is due to the rarity of these sections in the corpus. These artifacts are not seen on $M 7 or M 14$ M55 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_7}\,\,{\rm or}\,{{\cal M}_{14}} \] \end{document} . In K428.1 and K465.1, both $M 14$ M56 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{14}} \] \end{document} and $M 18$ M57 \documentclass[10pt]{article} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \usepackage{pmc} \usepackage[Euler]{upgreek} \pagestyle{empty} \oddsidemargin -1.0in \begin{document} \[ {{\cal M}_{18}} \] \end{document} globally fail in predicting a pertinent structure, especially because they predict a too long development. Using a feature on the repeat bars would improve these predictions.

The Allegro K458.1 “The Hunt” is an example of a continuous exposition (), with no MC/MC’ or S/S’ sections. The model nevertheless predicts these sections, and fails on many subsequent sections. Note that the reference F identifies an S section, but not at the same place as the one estimated by the model.

The Allegro K465.4 has a rondo sonata form: The movement follows the typical tonal path of sonata form, but the first theme P acts like a chorus that may be reused at other places – here also in Dev and Coda. It is another example of well-predicted form: the model correctly predicts the occurrence of 7 of the 15 sections annotated in the reference at the right beat or its neighborhood (P/MC/S/Dev/P’/MC’/S’). The end of S (and the start of C) is predicted at measure 104, whereas both the reference analyses A and F indicate it at measure 70, at the most satisfying and conclusive PAC. Since conclusions C and C’ are very long and group several units, other analysts could reasonably agree with the model by including such thematic parts in S and S’. As in K465.4, the four rondo sonata forms in the corpus show satisfying results, even if the models have difficulty in correctly estimating the start of C.

5 Conclusions

We presented a new set of sonata-form annotations on 32 movements of Mozart string quartets and described how thematic, harmonic and rhythmic features are distributed across this corpus. Connecting both computed features and manual section annotations allows to learn parameters of Hidden Markov Models, enabling to retrieve some section boundaries of sonata form with better precision than manually set parameters.

Therefore, large music corpora can be analyzed by mixing human knowledge and learning from annotated scores. Somehow, this may be similar to the way composers learned and refined sonata form in a period of more than 150 years. On the one hand, the learning of emission and transition probabilities might reflect the human process of learning sonata form through instruction. On the other hand, modeling sonata form with unsupervised machine learning methods could be compared to the human process of learning sonata form by exposure without being aware of it.

Future directions of research include the modeling of sonata form with other learning models, either supervised, by following other theories of sonata form (e.g. Caplin ()) or unsupervised, as with HMMs by using the Baum-Welch algorithm. Recurrent neural networks may also provide better results, especially with layouts allowing to learn the positions where features tend to appear inside a section. However, the relatively small size of the corpus will be challenging for any such learning method.

Improvements might be obtained by enlarging the corpus and the set of selected features, including features using additional score elements, other than just notes. Pattern features could be extended. In particular, one may look for candidate patterns playing roles not only in the themes but also in the development. The impact of taking into account features at other resolutions than quarter notes could also be studied, especially when the tactus is not on quarter notes. Note also that most of our corpus is in the major mode. Further data could lead to the training of different models for major and minor keys.

Finally, other model topologies could analyze with more flexibility elaborated variations of sonata forms – especially continuous expositions as mentioned above – or focus on specific parts, such as the rotations in the development ().

Transactions of the International Society for Music Information Retrieval

Research articles