Multimodal Transcription and Text Analysis Anthony Baldry
Multimodal Transcription and Text Analysis Anthony Baldry
Multimodal Transcription and Text Analysis Anthony Baldry
Equinox
Published by
UK: Equinox Publishing Ltd., 1 Chelsea Manor Studios, Flood Street, London
SW3 5SR
USA: DBBC, 28 Main Street, Oakville, CT 06779
www.equinoxpub.com
Multimodal transcription and text analysis by Anthony Baldry and Paul J. Thibault
Acknowledgements xiii
Preface xv
2.0 Introduction 57
2.1 The printed page and its evolution 57
2.2 The resource integration principle in the scientific page 61
2.2.1 How can we study tables systematically? 64
2.2.2 How does the page communicate? 68
2.3 Science textbooks and multimodal meaning making 70
2.4 Visual, verbal and actional semiotic resources in a table 71
2.4.1 Visual and verbal resources 71
2.4.2 Thematic development of the page: hierarchies of textual periodicity 74
2.4.3 Actional semiotic resources 78
2.5 Blood under the microscope: multimodality in a photographic display 78
2.6 Integration of scientific photographs and verbal text 80
2.6.1 The textual metafunction 80
vi
References 251
Appendices
Appendix I I - XII
Appendix II 261
Index 265
List of Figures
Figure 1.8: The metafunctions .... and a bear hug from ‘Boo’ 40
Figure 1.9: A mini-genre analysis of the Boo Bear text 41
Figure 1.10: Intertextual relationships in websites: the role of frames 45
Figure 2.1: Leaf movements: in Darwin (top); in a modern textbook (bottom) 59
Figure 2.2: A typical multimodal page in Marx’s Capital; and the top part
of an equivalent page in The Economist 62
Figure 2.3: An example of the use of the table in Marx 66
Figure 2.4: A typical use of charts in The Economist 67
Figure 2.5: An example of the use of vectors 68
Figure 2.6: Table and related verbal text pages 60 and 61 in Australian Biology 72
Figure 2.7: Blood under the microscope and related verbal text (pages 62-63) 73
Figure 2.8: Relation of elaboration: attribution between visual image and verbal label 87
Figure 2.9: Visual-verbal thematic formation; white corpuscles 88
Figure 2.10: 1982 Italian text 94
Figure 2.11: 1985 Italian text 95
Figure 2.12: The cline between ideational and interpersonal sourcing 97
Figure 2.13: Sequence of clauses 98
Figure 2.14 Sequence of clauses highlighting interpersonal negotiation 99
Figure 3.1: Web page genre schema 115
Figure 3.2: Nasa Kids Home Page: cluster analysis 121
Figure 3.3: Nasa Kids home page (focus on NasaToons object illuminated) 124
Figure 3.4: A Far Out Pioneer page 126
Figure 3.5: NasaToons menu of options page 128
Figure 3.6: The British Museum’s Children's COMPASS home page:
cluster analysis 131
Figure 3.7: The British Museum : Search the Museum page 132
Figure 3.8: Daily life in Asia page 133
Figure 3.9: The web page: Women sewing, a print 134
Figure 3.10: Covariate tie between verbal and visual semiotic modalities, creating
a cross-modal thematic relation in an airline magazine text 139
Figure 3.11: Search subcluster before and during mouse rollover 150
Figure 3.12: Interpersonal meaning potential of linked objects; three simultaneous
parameters 151
Figure 3.13: Objects that are clicked on to reveal a link 153
Figure 3.14: Layered structure of textual objects 155
Figures 4.1a and 4.1b: Transitivity frames in the Eskimo advertisement 168-9
Figure 4.2: A (revised) preliminary network for gaze in visual texts;
primary delicacy only 171
Figure 4.3: Waves relating to the soundtrack in the first phase of the Westpac
advertisement 183
Figure 4.4: Camera position relative to depicted world of image and visual
kinaesthesis of viewer: main options 194
Figure 4.5: System network of basic options for gaze in video texts 196
Figure 4.6: Notational conventions used in the transcription of the soundtrack 215
x
List of Tables
Table 1.1: Examples of possible temporal and causal expansions of the event
sequence in the dog-chews-shoe cartoon 12
Table 1.2: Example of the temporal sequencing of events in the dog-chews-shoe
cartoon showing change (two shoes then one shoe) resulting from
the transition from one moment to the next 13
Table 1.3: Three levels of semiotic organisation in the dog-chews-shoe
cartoon 14
Table 1.4: Distribution of semiotic resources in dog-chews-shoe cartoon 15
Table 1.5: The Mitsubishi Carisma text: Summary analysis of shots,
phases and macrophases 48
Table 1.6: The Mitsubishi Carisma advertisement: thematically salient
transitivity frames 49
Table 1.7: Phonetic and prosodic features in mans spoken voice 52
Table 1.8: Some salient meaning oppositions indexed by contrasting phonetic
and prosodic features in the speaking voices of the man and the woman 53
Table 2.1: Reconstruction of ellipted clauses relating to red corpuscles in the first
column of the table in Figure 2.6 75
Table 2.2: The multimodal thematic development of the page about red and
white blood corpuscles; integrating table and verbiage 76
Table 2.3: The co-articulation of the page into subregions showing
top-bottom and left-right organisation 82
Table 3.1: Types of web pages according to social activity 104
Table 3.2: Transcription of an unfolding hypertext pathway 129
Table 3.3: British Museum Childrens COMPASS activity sequence 135
Table 3.4: Interaction potential of objects on the Nasa Kids home page 152
Table 3.5: Nasa Kids home page: comparison of three objects 154
Table 3.6: Textual links and functions in linked objects on Nasa Kids home page 156
Table 4.1: Stratification of video texts, showing both the relationship between the
expression (display) and content strata (depiction) of visual signs 226
Table 4.2: Some transformations in the delimited optic array of Phase 1 of the
Mitsubishi Carisma advertisement 229
Table 4.3: Some visual process types and their modes of realization in the
Mitsubishi Carisma advertisement 231
Table 4.4: Visual participant chains in Phase 1 of the Mitsubishi Carisma
advertisement 233
Table 4.5: Dependency relation between Shots 1 and 2 in Phase 1 of the
Mitsubishi Carisma advertisement 238
Table 4.6: Three sources of visual-textual ideational coherence in the
Mitsubishi Carisma advertisement: Phase 1 only 242
Foreword
Multimodal Transcription and Text Analysis is a book that many of us have been looking for:
a readable how to manual for analyzing images, websites, video and film, cartoons, magazine
layouts, advertisements, textbooks, television programs, and computer games. Paul Thibault and
Anthony Baldry strike just the right balance between rich examples and accessible explanations
of the concepts that lie behind their practical methods. This book is, however, far more than a
how to manual: it is a comprehensive introduction to the field of multimedia analysis.
Why is multimedia analysis so important today? Partly because multimedia themselves,
combining language with visual images, animations, video, music, and sound effects (at least!)
are becoming the dominant forms of communication in our society, not just for commercial
purposes, but also in our daily lives and personal activities. No one doubts that this is because
computers make working with multimedia (almost) easy and (almost) cheap. But that is using
multimedia. Why analyze multimedia?
Because today we understand, as never before, the power of media to influence how we
think and what we believe. We need to know how media produce their effects, how we interpret
and make sense of them and with them, and how to design media that will both influence people
and empower them to create and express their own insights and points of view. Multimedia
analysis is the foundation for both designing and criticizing media and their messages.
Multimodal Transcription and Text Analysis develops a systematic approach to under-
standing how combinations of words, images, and sounds, whether sitting on a page or flashing
past in real time, make more meanings together than any one of them can make alone. Baldry
and Thibault have been working towards this for years. They belong to a growing community
of researchers in media studies, communication, education, linguistics, semiotics, sociology,
anthropology, and political science who have been developing methods to analyze media for the
last two decades or longer. Their approach has its origins in functional linguistics, an alternative
to the very abstract and formal theories of syntax that most people still associate with linguistics.
Thibault studied with one of the great linguists of the last half-century, Michael Halliday, who
developed a method of analyzing purely verbal text in terms of the available choices we have
in putting words together, and the differences in meaning that different choices of wording
make. Thibault and Baldry, like Theo van Leeuwen, Gunther Kress, myself and many others,
have now found ways to generalize Hallidays method to combinations of words with images,
sounds, actions, and more. That is what you will discover in this book.
Read this book and use it! Take advantage of the online web-based analysis tools that
the book prepares you to use. That is the best way to understand why transcription is not just
a boring task to be left to someone else (if you can afford to pay them), but the place where
theory meets data head-on and multimedia materials are re-framed for analysis in the way that
you decide. It is also the best way to start understanding what media are already doing to you,
and what you can do with them for your own purposes. Thats important, and so is this book.
Jay Lemke,
Professor, Educational Studies,
University of Michigan, Ann Arbor
Acknowledgements
The authors and publisher wish to thank the following individuals and agencies for
permission to use copyright material. All possible care has been taken to trace and contact
the owners and copyright holders of the materials included and to provide full
acknowledgement of the use we have made of them.
All the texts we have presented have been analysed in a way which we feel is entirely
supportive of the goals of the various authors. Despite intense efforts, we have been
unable to trace and make suitable arrangements with the copyright holders of some of the
texts reproduced in this book. The authors and publisher would like to hear from them so
that this can be rectified. Below we acknowledge our debt and thanks to those copyright
holders whom we have managed to contact.
Chapter 1 investigates a variety of texts including cartoons, leaflets and websites. The
Marmaduke cartoons in Figures 1.2 and 1.3 have been reproduced with the permission of
United Media (UFS Inc.), New York and their Italian distributors Adnkronos, while the Lupo
Alberto cartoon in Figure 1.7 has been reproduced with the permission of McKenzie
Syndicate, MCK S.r.l, Milan. The LT leaflet in Figures 1.4, 1.5a and 1.5b which dates from
January 2000 is reproduced with kind permission of Transport for London. Equally, thanks
go to the Governing Body of the Chesapeake Bay Bridge-Tunnel Commission for permission to
reproduce parts of a Chesapeake Bay Tunnel-Bridge leaflet in Figures 1.6a and 1.6b. Finally,
the British Library has given permission to reproduce a website page from their Leonardo
Notebook online presentation (see Figure 1.10).
Chapter 2 investigates the printed page, the scientific page in particular, from
various standpoints, including the evolutionary one. Services, (Special Collections) University
College London are thanked for assistance regarding provision of the originals of Darwins
The Power of Movements in Plants in the top part of Figure 2.1. W.H. Freeman and
Company/Worth Publishers are thanked for permission to reproduce the text in the bottom
part of Figure 2.1 which is from Helena Curtis Biology. Similarly, The Economist is thanked
for permissions relating to Figures 2.2. and 2.4. Although we have been unable to trace the
authors of the text reproduced in Figure 2.10, which is taken from an Italian school science
textbook, we owe special thanks to Emilio Delmastro, who supervised the books produc-
tion and layout, for his kindness in allowing us to reproduce the graphic layout of the page
in question. Similarly, we wish to thank Carlo Signorelli Editore for permission to reproduce
the text in Figure 2.11 taken from a 1985 Italian school science textbook. We also thank
Prof. GianLuigi Borgato of Unipress, Padua for permission to reproduce much of the
second half of Chapter 2 which was originally published as part of the CITATAL project
in a volume edited by a team from the University of Padua led by our colleague and friend
Carol Taylor Torsello, whom we thank profusely.
Chapter 3 investigates two childrens websites. We thank the British Museum for per-
missions relating to Figures 3.6, 3.7, 3.8, 3.9, 3.13. We would be glad to hear from the copy-
right holders of the NASA Kids website, whom we have been unable to contact despite
repeated efforts in relation to Figures 3.2, 3.3, 3.4 and 3.5. The snapshots which appear at
the bottom of Inset 11 have been supplied on condition of anonymity. We wish to thank
the anonymous donor whose snow removal efforts have not gone unrewarded.
Chapter 4 investigates three TV advertisements. The Eskimo, Mitsubishi Carisma and
Westpac commercials. The Eskimo text was shot in Whitehorse, Yukon Territory in 1997.
The grandfather and the young boy were respectively played by Cliff Solomon and Yudii
Mercredi. We thank the DDB Agency, Milan, Italy and the McKinney Agency, Durham,
North Carolina for permission to reproduce stills from this advert and their kind assistance
in contacting the Vancouver offices of The Characters Talent Agency and Kirk Talent, in
relation to permissions given by the actors. We also thank Auto-Germa, Verona, Italy, the
distributors for Audi cars in Italy for whom the advert was commissioned. With regard to
the Westpac text, the Westpac Banking Corporation has kindly agreed to the reproduction of
frames from a 1983 advertisement that celebrates Australias identity as a nation and
Westpacs role in consolidating this identity. The authors are grateful to Equinox, the pub-
lishers, for agreeing to reproduce frames from the Westpac advertisement in Appendix I in
colour. This has enabled us to provide a detailed description of the nature and functions
of colour in what is, from every standpoint, a superbly crafted advertisement. The
Mitsubishi Carisma text in Appendix II is an entertaining linguistic, visual and musical
spoof of James Bond films. Despite many efforts, we have not been able to trace the actors
or the agency who produced this delightful commercial which we recorded from British
TV in the late 90s and wish to hear from them.
We wish to take this opportunity to thank the many advertising agencies and com-
panies, in particular in Italy, who have provided us with a constant supply of TV com-
mercials and permissions, mainly for cars and drinks, for many years. Without this support,
this book would have been virtually impossible and research and teaching in relation to
multimodal texts would have been so much the poorer.
Finally, we wish to thank Gino Palladino, his brothers and the staff of Palladino
Editore, Molise, Italy for their permission to reproduce various parts of Multimodality and
Multimediality in the distance learning age, a thousand copies of which were printed and pub-
lished by them in 2000. This volume, now a collectors item, was the very first book on
multimodality to appear entirely in colour at a price that defies belief. We have fond mem-
ories of the times we spent in Palladinos ultra-modern printing works tucked away in the
mountains of the Molise region in Central Italy. Multimodalists everywhere will be for ever
grateful to them for their courage and excellence in colour printing. We also wish to thank
Nicola Prozzo of IRRE Molise for his dedication and unfailing kindness in helping us to
further the multimodal cause, in a whole series of ways: photo calls, logistics, telephone
contacts, computer services and video recordings. To Gino and Nicola and our many
friends in the Molise region our heartfelt thanks.
Preface
The study of multimodal texts and multimodal meaning-making practices has
developed and matured considerably since the early 1990s. Unlike the pioneering
works of an earlier generation (e.g. Bateson, Birdwhistell, Scheflen), who were con-
cerned above all with behavioural and paralanguistic units of various kinds (e.g.
gesture, movement, posture), with a concomitant focus on the material dimension
of the behavioural units so described, the current focus is on both the material
dimension and the meaning in a unified and semiotically informed perspective. The
focus has shifted to the multimodal text as the site of meaning-making activity.
The present book participates in this shift.
Our analyses and transcriptions in this book are concerned, for example,
with what a particular pattern of movement means as a form of action involving
participants and taking place in a particular setting, or with how a particular choice
of colour, in combination with selections of other features, indexes a particular
attitudinal or evaluative stance. The examples can easily be multiplied and we will
leave further discussion of these to the chapters that follow. The point is that
specific choices and combinations of choices e.g. movement, colour, and so on
realise or express meanings (e.g. actions, evaluations) in multimodal texts. The
focus is on the meaning of different kinds of units and their functions in larger-
scale patterns of discourse organisation that cannot be described in terms of
small-scale units per se. By the same token, we also explore the importance of the
material dimension of these texts as making its own distinctive and important con-
tribution to the overall meaning-making process.
This book brings together our research efforts and findings on multimodal
text analysis and transcription and proposes a novel and distinctive approach which
is both meaning-based and functional. We aim to present a systematic account of
what multimodality is in relation to contemporary discourse practices in a variety
of social and cultural contexts. The term multimodality does not designate a pre-
given entity or text-type. Rather, it is a diversity of meaning-making activities that
are undergoing rapid change in the contemporary cultural context. Moreover, the
concept of multimodality is a useful yardstick for measuring and assessing the
diversity of ways in which texts and their associated meaning-making practices are
the results of the ways in which semiotic resources of various kinds work in
partnership to create the meanings that we attribute to texts. Multimodality there-
fore invites us to reassess many older assumptions and prejudices at the same time
that it opens up new fields of enquiry and understanding.
The term multimodality covers a diversity of perspectives, ways of thinking
and possible approaches. It is not a single principle or approach. It is a multipurpose
toolkit, not a single tool for a single purpose. Multimodal text and discourse
xvi Multimodal Transcription and Text Analysis: Preface
multimodal texts, a set of tools is needed that are kept together for this purpose as
part of an overall kit. In this spirit, we invite our readers to take up and further
develop these techniques and principles for their own purposes, modifying and
adding to the toolkit in the process.
We wish to extend a special debt of gratitude to Malcolm Coulthard, Michael
Halliday, Ruqaiya Hasan, Jay Lemke, Ole Letnes, Eva Maagerø, Jim Martin, Kay
O’Halloran, Maria Pavesi, Carol Taylor Torsello, Chris Taylor, Elise Seip Tønnessen,
Gordon Tucker, Theo van Leeuwen and Eija Ventola for critically important
discussions as well as for their encouragement and support. Alessandra Varasi’s
unfailing professionalism and dedication to the task ensured that the very highest
standards were maintained during the final and often difficult preparation of the
manuscript with its rather special and demanding technical requirements. Grazie di
cuore, Alessandra! To the Equinox team, Janet Joyce, Val Hall and David Graddol, a very
special thanks for their commitment to this project and their willingness to provide
constant editorial support and advice whenever called upon. We also acknowledge the
generosity of the many students and colleagues, too numerous to mention here, who
volunteered their time and energy to read the manuscript and to offer many useful
suggestions and corrections that have helped us to improve the quality and readability
of the final text. A special word of thanks in this respect goes to Claire Archibald,
Patti Grunther, Sheila McVeigh and Robert Ponzini. To Maggie and Marisa, a very
special thanks for making it all possible in more ways and in more modalities than we
could possibly do justice to! Finally, texts are always the products and/or records of
activities at the same time that they constitute and organise the potential for other
activities. The present book is no exception. We hope that the fruits of our own
activities, as documented here, will encourage others to explore and to apply this
many-sided and always fascinating area of research to their own areas of interest in
the multimodal analysis of discourse in all its manifestations.
28th April 2005, Anthony Baldry and Paul J. Thibault
1. 0. Introduction
� Malinowski (1923: 306) coined the term context of situation in order to broaden, as
he put it, the notion of context. He pointed out that the meaning of any single word
is to a very high degree dependent on its context (Malinowski, 1923: 306), in the
sense that its meaning is determined by the whole utterance in which it occurs. He
further pointed out that the utterance itself becomes only intelligible when it is
placed within its context of situation (Malinowski, 1923: 306). In coining this term,
he proposed both that the conception of context must be broadened beyond that
of the utterance to the situation and that the situation in which words are uttered
can never be passed over as irrelevant to the linguistic expression (Malinowski 1923:
306). He also argued that the concept of context itself must burst the bonds of
mere linguistics and be carried over into the analysis of the general conditions under
which a language is spoken (Malinowski, 1923: 306). The study of language must be
therefore conducted in conjunction with the study of [the] culture and of [the]
environment of people who live under conditions different from our own and pos-
sess a different culture (Malinowski, 1923: 306).
� Malinowskis ethnographic and anthropological perspectives on culture led him to
propose the notion of context of culture in order to connect language to the activities
through which human needs are satisfied and the forms of cultural organisation giv-
ing rise to these activities. His definition of culture hinges on the notions of function
and organisation. He posited that there is a functional relation between a performance
or activity and a human need and that culture implies the organisation of human
behaviour so that any particular purpose can be achieved (Malinowski, 1944: 38-39).
The larger significant whole of utterances is an integral part of the meaning of
utterances and includes both the context of situation and the context of culture.
� Firth (1957 [1934]; 1957 [1950]) turned context of situation into a construct concerned
more with typical contexts of situation and the typical functions of language in these
contexts than with the thick ethnographic description that Malinowski brought to bear.
In our approach to multimodal text analysis and transcription, Malinowskis detailed
description of specific instances and Firths concentration on the typical features of
different types of context of situation are equally important. Both have important les-
sons to teach us. Malinowski (1923, 1935, 1944) shows the need to develop forms of
analysis and transcription that relate language and other semiotic modalities to each
other, to the activities they help to constitute, the meanings and functions of these in
their context of situation and how these relate to the context of culture. Firths empha-
sis on types of language functions in relation to types of context, on the other hand,
highlights the need to make generalisations and encourages us to connect text analysis
and transcription to questions of genre (see Inset 6 , p. 43). They also show in a prin-
cipled way how different units and their relations on different scalar levels of textual
organisation (see Inset 12 , p. 144) are all functional in some way to the meaning of
the whole. What both these early thinkers have in common is a clear understanding
of the contextual significance of units and functions at all levels: it is contextualisa-
tion all the way up and all the way down. Context is not extrinsic to semiotic form
and function; rather, it is an integral part of it on all levels of textual organisation.
Insets 1 and 2 3
Inset 2: Text
� In Inset 1 we went back to the early insights of Malinowski and Firth regarding
context of situation and context of culture because they still remain startlingly fresh and
relevant to our present concerns. They not only provide a historical touchstone for
our own efforts in the tradition of systemic-functional linguistics but also invite us
to renew the connection, as Firth put it, with the full range of semiotic modalities
that function in partnership with each other when we analyse texts of all kinds and
seek to relate their forms of organisation to the contexts of situation and the contexts
of culture in which they function and make their meanings. With Firths and
Malinowskis thinking in mind, we now examine Hallidays more recent definition of
text (see p. 4) from the perspective of systemic-functional linguistics.
� Hallidays functional definition of text helps us to see that text is a constitutive part
of some meaning-making event or activity in which the text participates. As
explained in Inset 13: System and instance (pp. 172-173) texts involve many interact-
ing systems of different kinds on different levels of textual organisation. Halliday
also shows that the definition of text readily extends to multimodal texts and even
to texts in which there is no language whatsoever. The important point is that texts
are embedded in, and help to constitute, the contexts in which they function. Texts
are thus inseparable parts of the meaning-making activities in which they take part.
A functional and semiotic definition of text seeks to understand the ways in which
the intrinsic properties of texts and their organisation enable them to be coupled to
their contexts. As we saw in Inset 1 on the facing page, context is not something
extrinsic to text. Rather, it is created when text users knowledge of culture and
society interact with the internal features of the texts organisation during the making
and interpreting of texts.
� Texts themselves may recontextualise meanings and practices in one modality to some
other modality. For example, a film version of a novel is a recontextualisation of other
semiotic modalities in this sense. A novel is a recontextualisation of the speech
genres of everyday life and many other semiotic modalities, practices and perceptual
experiences including, for example, many non-linguistic social activity-types and
many forms of auditory, sartorial, gustatory, bodily and other experiences and
practices. Consider the following literary example: Hours later, the cart climbed the last
hill that hid Immortal Heart. I could hear the crowing of cocks, the yowling of dogs, all the
familiar sounds of our village. In this quotation from Amy Tans novel, The Bonesetters
Daughter (2001: 196), a physical event (the cart climbing the hill) and various familiar
sounds are recontextualised by the linguistic semiotic through specific choices in the
lexicogrammar. All of these events the movement of the cart and the sounds of
the village cocks, dogs and so on are themselves familiar types of experience that
are meaningful in the context that the writer connects them to. By the same token,
the indexical-symbolic resources of the linguistic semiotic allow for the possibility that
these sights and sounds can be indexically evoked in the minds eye, so to speak, of
the reader as off-line perceptual experiences that readers may undergo. Texts of all kinds
allow for this constant criss-crossing of semiotic and perceptual modalities.
4 Multimodal Transcription and Text Analysis: Chapter 1
term the resource integration principle (Inset 3, pp. 18-19), which lies at the heart of
multimodality. We do this mainly in relation to cluster analysis (Inset 5, p. 31) and
phasal analysis (Inset 7 , p. 47). In particular, this will help us understand the
relationship between individual multimodal texts and multimodal genres, or to put
the matter in slightly different terms between instance and type (see Inset 13 , pp.
172-173). In this respect it is appropriate, as our very first step, to examine the
relationship between text and society in terms of the links between context of
situation and context of culture (Inset 1, p. 2) and text (Inset 2 , p. 3), a step which
will help us subsequently to characterise the relevance of metafunctions (Inset 4, pp.
22-23) and primary and secondary genres (Inset 6, p. 43) in our approach to
multimodal text analysis and transcription.
What is a text? And what is a multimodal text? As we can see from Inset 2 on the
preceding page, in this book a text is a technical term which follows Halliday in
considering texts to be meaning-making events whose functions are defined by
their use in particular social contexts.
As Halliday points out, texts are not limited to the spoken and written media
of language. Instead there are many other resources that can be used to create texts
in addition to the spoken and written word. In this book, we shall explore these
other possibilities. As a starting point, we need to point out that different semiotic
modalities adopt different organisational principles for creating meanings. Different
semiotic modalities make different meanings in different ways according to the
different media of expression they use. When studying multimodal texts, it is all too
easy, for example, to underestimate the significance of the codeployment of space
with hand-arm movements as a meaning-making resource. To see this we may
examine the highly selective type of multimodal transcription in Figure 1.1 which
is designed specifically to reconstruct the relationship between hand-arm move-
ments and space. The transcription reconstructs the text in terms of phases and
Action Visual Image (+ camera position) Movement and/or gesture Space Meaning
Subphase 1.3: Third Participant, the Zombie Shot 3: Close-up shot of Zombies head & Zombie shakes body like an Above ground with implicit contrast Threat confirmed: underground
leader, introduced arms popping out from beneath ground animal to remove loose earth to space below ground creatures emerge threatening couple
Subphase 1.4: Final Participants, a large group Shot 4: Distant shot from raised position of Zombies hands are outstretched Space is represented as a diminishing as before
of Zombies, introduced car & trees showing Zombies encircling car circle with car as centre
Subphase 1.5: Couple screams Shot 5: Very close up shot of couples The car is now completely encircled: as before
mouths from outside (Zombies view of focus on space inside car followed
couple); Ø by focus on space inside and outside
Shot 6: From inside car over couples car. Only one space now exists
shoulders (Couples view of Zombies)
Phase 2: Counter-measures Pressing a button stops Zombies Car shown as a space providing Safety: Couples safe occupancy of
Subphase 2.1: Man operates car door locks Shot 7: Very close-up shot of dashboard entering protection from outside the cars space can be reassured
Shot 8: Very close-up shot of door locking thanks to technology (the engine and
ignition werent even on)
Subphase 2.2: Zombie leader fails to get inside Zombies hand on door handle Distinction between spaces is Threat and safety: the two opposing
car Shot 9: Very close-up shot of Zombies from outside car disappearing forces are in the balance
hand on door handle
Zombies hand
Subphase 2.3: Other Zombies also fail Shot 10: Close-up of two Zombies with Number of outstretched hands All distinction between spaces has Threat re-confirmed
raised hands attempting to touch car gradually been lost
Shot 11: Distant shot of heads and hands grows until nothing else is visible
Phase 3: Zombies give up Shot 12: Close-up of man Mans hand stifles yawn as Separation between Zombies space Safety reaffirmed
Subphase 3.1: Couple sit out attack and Shot 13: Close-up of woman woman paints her nails and cars space gradually increases
express boredom
Subphase 3 .2: Zombies leave Shot 14: Medium close-up shot of Zombies Zombies leave as before Drama over
leaving
Subphase 3.3: Zombie leader leaves after Shot 15-18: Various shots of Zombie leader Zombie leaders replacement of as before as before
replacing windscreen wiper wiper is a conciliatory gesture
Phase 4: End phase with logo Logo appears superimposed on frozen Shot 18 Ø as before as before
Multimodal texts and the resource integration principle
how the resource integration principle (see Inset 3 , pp. 18-19) contributes to the rapid
alternation of expected and unexpected in a very short time span (a 30 second
advertisement). Our understanding of the kissing couples predicament (or from
another point of view the Zombies predicament) is heavily dependent on our
knowledge of film genres, the TV car advertisement included (see in this respect
Inset 8: Intertextuality, p. 55). This is the same thing as saying that the Zombie text,
like all texts, is dependent on, and partly creates, a particular context of situation and
a particular context of culture.
We will return to the question of expected and unexpected patterns in
multimodal texts on many occasions in this book, for example in the second part
of Chapter 2 (see 2.4 to 2.8, pp. 71-102), and in particular, in Chapter 4 when dis-
cussing the ways in which soundtracks contribute to these patterns in film texts.
This relationship is further characterised in the associated online course (see
Preface, p. xv) as are the relationships, for example, between video tracks and
soundtracks in film texts. The exercises and text analyses in the associated course
which relate to printed media, websites and film texts are designed to further
analyse these relationships as well as to provide further contexts in which to
explore and apply the theoretical statements made throughout the book, for
example in the Insets.
Visual cluster
Visual setting: the first man tried the dog bites his the dog chews up the man, in pain, on the stairway , he
Phase in location and to enter the house, shoe off and his shoe gives up and leaves encounters a second
Narrative Event participants but was refused prevents him from the house with just man who intends to
Structure entry entering the house one shoe enter the house
Resource integration and the transcription of printed cartoons
cartoons. Generally speaking, the frame that surrounds a picture separates the
depicted world of the picture inside the frame from that which is outside the frame (in
a manner which is partly analogous to the division of space inside and outside the car
in the Zombie text). The frame itself is not part of the depicted world of the picture,
but stands outside of it. The frame provides some implicit indication as to how the
picture is to be viewed. In doing so, it provides a metacomment on the depicted world
of the picture or, to put the matter in slightly different terms, it specifies a metarule
concerning how the things inside the frame are to be taken (Bateson, 1973 [1972] 159-
161).
In the first cartoon in Figure 1.2, the Marmaduke-in-a-playful-mood-text, the
sentence of direct speech occurring outside the frame specifies what belongs to the
depicted world at the same time that the woman in the foreground of the picture
constitutes its deictic centre. In other words, the words outside the frame are to be
attributed to one of the participants inside the frame at the same time that they
characterise the point of view of the woman inside the frame rather than the stand-
point of the cartoonist or the reader/viewer, who are outside the frame. Thus, the
reference point for the direct speech is the woman and not the outside observer or
the cartoonist who created the world depicted inside the frame.
How do we know that the words outside the frame are to be attributed to the
woman? Why arent the words placed inside the frame, for example in a speech
bubble linked to the woman? The use of quotation marks to signal direct speech, on
the one hand, and the person deixis, mood and tense, on the other, all tell us that the
utterance represents the point of view of the woman rather than someone else out-
side the frame. This is unusual insofar as items that are placed outside the frame are
normally taken to represent the reference point of an observer of the scene rather
than one of the participants in the scene. However, by presenting the words from
the point of view of the woman, the cartoonist is able to distance himself from
judgements concerning the truth or validity of the words attributed to the woman,
especially given that the depicted scene is a fictional one. Judgements of truth and
so on can therefore be suspended or, if you like, left to the cartoon characters
themselves in their fictive world.
At the same time, the placing of the words outside the frame can indicate
that the cartoonist wishes to adopt a particular affective or other interpersonal, e.g.
evaluative stance of, say, solidarity with the woman and her words. In this way, this
multimodal text uses the combined resources of written language and depiction to
present some aspects of the situation from the point of view of the participants in
the depicted scene and other aspects of it from the point of view of the external
observer of the scene so that the latter is drawn into a particular kind of
interpersonal relation with it or with some aspect of it. In the present case, it is the
womans assessment of the situation, specifically of the mans plight after his
unhappy encounter with the dog in the background, which is salient. However, it
Resource integration and the transcription of printed cartoons 11
is the external observer who is able to view the whole depicted scene from his or
her reference point outside the frame. Such an observer will note the ironic dis-
crepancy between the womans assessment of the dogs playful mood, the mans
plight and the immense power and energy of the dog as his owners struggle to
restrain him from overwhelming the retreating man.
As mentioned above, the frame functions as an implicit metacomment, in
Batesons sense, on the depicted world of the cartoon and therefore to signal that it
is to be interpreted as a cartoon world and not as a feature of the world outside the
frame. By the same token, the direct speech of the woman is attributed to a
participant of the depicted world inside the frame at the same time that it is used to
frame the interpersonal evaluative stance of an observer outside the frame. The
humour of the cartoon may in part be due to the paradoxical effects which derive
from this.
As readers of the text, we establish a link between the inverted commas and
the woman as part of the process of deducing that it is the woman who is speak-
ing since, apart from Marmaduke (a dog who cannot be expected to speak) and the
possible exception of the man (who is too stunned to speak), she is the only char-
acter with an open mouth. Had there been more than one open mouth, the process
of associating a particular utterance with a particular speaker would have been
resolved in other ways, most probably through the presence of a speech bubble, with
an explicit link between a speaker and their words or thoughts. In such circum-
stances, the utterance would almost certainly have been inside the framing border,
in contrast to the external position in this text.
A speech bubble is itself a partly prefabricated multimodal unit, a cluster (see
Inset 5: Clusters and cluster analysis , p. 31), made up of various resources including
language, curved and straight lines and space, that is ready to be pressed into
service once specific context customisation has been enacted. As Figure 1.7 in 1.3
(pp. 34-38) indicates, this customisation relates to the choice of specific, contextu-
ally appropriate words and the decision to portray them as words (a link achieved
through lines) or as thoughts (a link achieved through circles). In our approach to
multimodal text analysis and transcription, clusters are groupings of resources that
form recognisable textual subunits that carry out specific functions within a
specific text. Multimodal transcription typically serves to identify the components
of each cluster and the function that each specific cluster plays within a text. A
further function of multimodal transcription is to identify the relations between
clusters in the same text and the relationship between specific multimodal clusters
and cluster types (see the discussion on primary genres in Inset 6 , p. 43).
Clusters are thus a prime indication of the localised effects of the resource
integration principle (see Inset 3, pp. 18-19). The variation in the complexity of the
codeployment of resources in any specific cluster is closely linked to social evolu-
tion and technological developments. Contemporary society has unquestionably
12 Multimodal Transcription and Text Analysis: Chapter 1
instantiated more recorded multimodal texts, in particular dynamic film texts, than any
previous society (Baldry, 2000b: 28-38). Nevertheless, a rock painting in the
Australian desert, a 15th century musical score, Leonardos Notebook (see Figure 1.10
in 1.5, pp. 44-46) and the latest feature film with special effects all share a basic
feature in that they are units of meaning which carry out a specific function in a
specific social context, deploying various resources to this end. They all typically
contain clusters of related items. As such, though otherwise very different, they are
all multimodal texts. We should not overlook, however, as mentioned above (Inset 2:
Text , p. 3), the absence or restricted use of language in many multimodal texts. A
multimodal text may well be something written, spoken or a combination of written
and oral discourse, but it may also extend beyond the linguistic semiotic to include
other meaning-making modalities and, in so doing, may not necessarily include lan-
guage. The likelihood, however, is that in some genres, language will be pared back
as much as possible but not entirely excluded. The cartoons in this chapter are good
examples of this.
The simultaneity of visual presentation in the cartoon scenes in Figure 1.2
should not distract us from the way in which they tell a story involving a sequence of
events. Nor should it distract us from the changes or transformations that are
brought about as these events unfold in time. How can a simultaneously presented
configuration of events in the depicted world of a single picture tell a story? How
is succession in time and consequent change communicated? How can the reader
infer a narrative sequence on the basis of the depicted scene? In the remainder of the
current section, we will propose answers to these questions by exploring the ways
in which a discourse level of narrative organisation can be unpacked from the visual
and other resources used in the cartoon.
Reference to the semiotic resources used to create the second cartoon in
Figure 1.2 and its meanings help provide some answers to these questions. The
man attempts to enter the house with both shoes on but leaves with just one shoe.
The other shoe is apparently being chewed up by the dog (part of whose head can
just be seen). The static depiction nevertheless manages to convey movement, a
temporal situation involving different moments in time as well as the change which
occurs with the passage from earlier moments in time to later ones. These three
factors together constitute an event, which, regardless of their modality of realisa-
tion, are the hallmark of narrative. Narratives, including cartoon narratives
1. the dog took the mans shoe and then the man left the house with no shoe on his right foot
2. the dog took the mans shoe so that the man left the house with no shoe on his right foot
Time 1 Time 2
Table 1.2: Example of the temporal sequencing of events in the dog-chews-shoe cartoon showing change (two
shoes then one shoe) resulting from the transition from one moment to the next
14 Multimodal Transcription and Text Analysis: Chapter 1
1992: 20-21; Martin, Rose, 2003: 3-7). The event structure is a level of semiotic
organisation which is highly condensed in the visual organisation of the depicted
scene. Nevertheless, the temporal succession of events can be reconstructed or reac-
tivated in ways which partially detach it from the visual forms themselves. This shows
the need to distinguish the narrative event structure as a level of meaning which is
realised by, though not reducible to, the resources of the visual grammar used in the
cartoon drawing. It also suggests that a visual image such as the dog-chews-shoe
cartoon in Figures 1.2b and 1.3 can be analysed in terms of the three levels of
semiotic organisation presented in Table 1.3, which provides a synoptic reconstruc-
tion with a focus on the experiential metafunction and which suggests some of the
kinds of meanings that are realised by choices from the visual grammar and their
combinations. Narrativity can be generated in a visual text like the shoe-chewing one
shown in Figures 1.2b and 1.3 on the basis of genre-related considerations such as
those listed below:
� contrary to appearances, the depicted scene in this cartoon is not
a single moment in time, though it can, of course, also be seen as
such. Rather, the scene implies a timeline comprised of actions and
events that take place in a given chronological order, which can be
deduced from the visual depiction;
� the chronological order of these events therefore corresponds to a
sequence of events in time;
� the participants who take part in the sequence of events maintain
their identity from one action or event to the next in the sequence;
� the transition from one action or event to another in the sequence
also entails change or transformation in some aspect of one or the
other of the participants but in a way that maintains their identity.
The above points summarise the conditions for the activation of narrative
discourse in a text. Figure 1.3 shows the narrative sequence that can be
reconstructed on the basis of the visual cues provided in the cartoon drawing. The
mans dress and the briefcase he is holding index his likely participant status as a
salesman who was hoping to gain entry to the house in order to discuss a business
transaction with the house owner. The reader can assume that prior to the moment
which is shown in the cartoon, he had knocked on the door, encountered the
savage dog, and that the dog took his shoe off. We see him retreating from the
house in obvious pain and discomfort. The reconstructed sequence of events
entails both continuity of participant roles (man and dog) over successive moments
in the sequence at the same time that some change occurs in the man at Time 3
and Time 4 as shown below when he loses his shoe:
(1) Time 1: man knocks on door of house with both shoes on;
(2) Time 2: man encounters savage dog;
(3) Time 3: dog bites one of his shoes off;
(4) Time 4: man leaves house without one of his shoes.
Semiotic Resources
Language Depiction Sound Movement
1.1.2. Multimodal transcription of cartoon narratives and the question of the metafunctions
What resources and what kinds of meanings made by these resources contribute
to the discourse level of narrative organisation? The contribution of the different
metafunctions (see Inset 4 : Metafunctions, pp. 22-23) is discussed below with this
question in mind. In this respect, we need to understand that the meaning-making
processes of a text need to be defined in terms of four different but general and
concomitant types of meaning.
1) Logical Meaning. It is on the discourse level that the reader of the cartoon acti-
vates the potential narrative meaning of the text by raising questions and providing
answers to them. For example, Why did the man want to enter the house? What was he
expecting to achieve? Who/what is he? What went wrong?, and so on. In this way,
relations of cause, time, comparison and so on between events in the sequence can
be postulated and possible answers provided. In the case of texts such as the one
above, the raising of such questions and the providing of answers to them gener-
ates narrativity by seeking to find reasons for the changes which occur in
participants during the unfolding event sequence (see 4.11.6 in particular pp. 238-
239). For example, why did the man go to the house? What happened when the
door was opened? What will happen to the second man?
3) Experiential Meaning. A further factor that is important here for the activation of
the narrative discourse meaning are the respective expectations that apply to the two
different participant roles (salesman and savage dog), along with the ways in which
Multimodal transcription of cartoon narratives and the question of the metafunctions 17
they are expected to interact with each other. Cartoons of this kind draw upon
stereotypical representations of social roles and the expectations that are associated
with these roles. Thus, salesmen do certain kinds of things such as knocking on the
doors of houses in the hope of finding potential clients, they dress in a certain way,
they usually carry a briefcase with their wares and so on. Likewise, savage dogs in sub-
urban houses make it difficult for strangers such as salesmen to enter, they are likely
to be aggressive and imposing, they may bite or chase such intruders and so on. In
this kind of relation, the reader associates a set of features with each of the
participants which characterise a specific role, the way that individuals (dogs or
humans) in the given role can be expected to behave in particular situations and so on.
(a) The resource integration principle views a semiotic resource as something used for the purposes
of making meaning and which accordingly functions in the texts in which these resources are used
to this end. Semiotic modalities such as language, gesture, depiction, gaze and so on, can be for-
malised and described as resource systems in this sense. A semiotic resource system is thus a
system of semiotic forms that we can use for the purposes of making texts. The forms have
particular functions in the texts in which they are used. The notion of resource therefore captures
these two aspects use and function of the relevance of semiotic systems to the texts which
these systems make possible. This does not mean that the system pre-exists use in some abstract,
Platonic sense. More accurately, semiotic resource systems are distributed across many different
individuals in a particular context of culture of a community, which makes use of particular
semiotic resources. Different individuals winnow their own way through the culture they live in
and, in doing so, define and accumulate their own semiotic resource systems on the basis of their
own experience and participation in different social contexts in the course of their lives, encoun-
ters with texts, educational and professional experience and so on. A system of semiotically salient
differences in some community, i.e. the differences that potentially make a difference in the
meaning-making practices of that community, is thus a resource for making meanings.
� Multimodal texts integrate selections from different semiotic resources to their principles of
organisation. For example, the printed page makes use of the resources of depiction, written lan-
guage, lexicogrammar, spatial positioning and arrangement of items, among other things. These
resources are not simply juxtaposed as separate modes of meaning making but are combined and
integrated to form a complex whole which cannot be reduced to, or explained in terms of the
mere sum of its separate parts. The organisational principles of the whole e.g. the page as a
visual unit cannot be understood in terms of the different resources used, taken separately. The
resource integration principle refers to the ways in which the selections from the different semiotic
resource systems in multimodal texts relate to, and affect each other, in many complex ways across
many different levels of organisation. Multimodal texts are composite products of the combined
effects of all the resources used to create and interpret them. Lemke (1998) uses the term multi-
plying effect to capture the way in which different semiotic modalities co-contextualise each other
in ways that are not predictable on the basis of the different semiotic resources seen as separate
modalities. The separation of different resources into different modalities is an analytical
abstraction. Different resources are analytically, but not constitutively, separable in actual texts.
�A semiotic resource system is thus a system of possible meanings and forms typically used to make
meanings in particular contexts. A system is always a theoretical abstraction from very many
instances (see Inset 13, pp. 172-173). An act of abstraction of this kind is an attempt to reconstruct
the possible forms and their typical patterns of combination in a given semiotic system. Language,
or rather some languages, has/have been extensively theorised as a system of semiotically salient dif-
ferences that social agents use in contextually constrained ways. The lexicogrammar of natural lan-
guage is thus to date the most studied case of a semiotic resource system from this standpoint.
However, the advent of computerised multimodal corpora will, in time, change this (Baldry,
Thibault, 2001, 2005) given that there is no reason, in principle, why this kind of thinking cannot be
extended to the complex systems of topological differentiation that characterise the grammar of
visual semiosis (see Kress,Van Leeuwen, 1996 and the discussion of gaze in 4.1, pp. 167-173).
Sources of meaning in multimodal texts & Inset 3 19
Inset 3: (continued)
� According to the resource integration principle, texts are never monomodal. Monomodality is
the result of a certain way of thinking of separate, distinct semiotic resources, abstracted from
use, as existing in their own right. In practice, texts of all kinds are always multimodal, making
use of, and combining, the resources of diverse semiotic systems in ways that show both generic
(i.e. standardised) and text-specific (i.e. individual, even innovative) aspects. This is so of even the
seemingly limiting case of the telephone conversation where no visual contact between the two
speakers (i.e. no videophone) features in the conversation. On the telephone, we attend to many
aspects of the other persons spoken voice that are not necessarily part of language e.g. its
lexicogrammar and its phonology in the narrow sense. Such resources include voice quality,
breath control, rate of speaking, hesitations and pauses. Speakers and listeners are not always
aware of these resources or even that they may be considered as meaning-making resources. Yet,
just like the choice of words, intonation and so on, these resources can be, and indeed often are,
modulated variously by speakers to create specific meaning effects just as listeners can attend to
the speakers use of them, again to varying degrees of conscious awareness, as they interpret the
speakers meanings in relation to what is said and how. It is no accident that in many call centres
telephone operators are trained to attend to and interpret the significance of often subtle cues
in the voices of the potential clients whom they never see. In this sense, the telephone voice is a
multimodal semiotic resource that some people learn to cultivate and use to great effect. Thus,
from the perspective of both the system and the instance, the resource integration principle is
essential when attempting to understand how meanings are created in multimodal texts. A
further example will suffice here: the joint verbal-visual thematic relations created by multimodal
displays, tables and diagrams in school science textbooks are a semiotic-cognitive resource
through which the specialised meanings of the scientific topic can be stored, accessed, activated
and further developed by users of these books in teaching and learning activities (see Chapter 2).
(b) The meaning-compression principle refers to the effect of the interaction of smaller-scale
semiotic resources on higher-scalar levels where meaning is observed and interpreted. Take the
London Transport (LT) leaflet in Figure 1.4. Familiar shapes such as the fried egg, the mush-
room and so on, and the rhythmical, patterned relations among them reduce and compress
more complex problems on larger-space time scales to a set of patterned relations between
familiar visual shapes and minimal verbal text. Visual scanning of these patterns may take mere
seconds and places no burden on processing. These patterns are, in turn, contextually integrated
with the complex task of encompassing in ones mind the vast reality of the city of London
and the fare structure of its urban transport system. The meaning-compression principle makes
this task manageable in this text by compressing and reducing the complexity of the higher-
scalar reality which is being interpreted to a series of rhythmically patterned and interrelated
visual shapes and images on the here-now scale of visual scanning. Readers are thus able, quickly
and effortlessly, to process these visual patterns obtaining the necessary information about the
LT system and its fare structure. The meaning-compression principle is a principle of economy
whereby patterned multimodal combinations of visual and verbal resources on the small, highly
compressed scale of the leaflet provide semiotic models of the larger, more complex realities
that individuals have to engage with. In this way, a given combination of resources compresses,
in its patterned arrangements, meanings which can be unpacked and integrated into a more
specified semiotic configuration on a higher level of textual organisation.
20 Multimodal Transcription and Text Analysis: Chapter 1
In keeping with the view presented here, it may well turn out to be the case that lan-
guage cannot be adequately described and theorised as a system in its own right.
Rather, language and other semiotic resource systems, such as gesture, body move-
ment and gaze, are likely to be parts of a still larger system which may well turn out
to look very different from any of these components taken separately. Many tran-
scriptions of texts seem to have more in common with literary rather than linguistic
traditions resembling playwrights asides of the in-a-soft-voice-glancing-at and beckon-
ing-off-stage type. This type of stage instruction is itself an instruction for a texts
recontextualisation in other semiotic modalities in a performance text (see Inset 14,
pp. 175-177). The transcription procedures discussed in this book seek to reveal the
multimodal basis of a texts meaning in a systematic rather than an ad hoc way. They
are truly part of a discourse analysis, rather than a literary analysis, tradition. In this
sense, multimodality refers to the diverse ways in which a number of distinct
semiotic resource systems are both codeployed and co-contextualised in the making of
a text-specific meaning. Rather than separate communicative channels which are
ancillary to, or which in some way supplement a primary linguistic meaning, the guid-
ing assumption is that the meaning of the text is the result of the various ways in
which elements from different classes of phenomena words, actions, objects,
visual images, sounds and so on are related to each other as parts functioning in
some larger whole.
Meaning making is the process, the activity of making and construing such
patterned relations among different classes of such elements. The term multimodal
thus recognises that, from an analytical standpoint, it is important and necessary to
distinguish different classes of meaning-making resources rather than group them
together as members of some more general class which fails to specify their individ-
ual characteristics. Such a class would be too general to be really useful. By the same
token, the term multimodal recognises that different kinds of resources are combined
to produce an overall textual meaning. As the Marmaduke cartoons show, the
meaning of the text is not the result of merely adding the meanings of one resource
language, say to those of another, such as the visual image. Meaning is multi-
plicative rather than additive (Bateson, 1987 [1951]: 175; Lemke, 1998). This funda-
mental property emerges clearly when we examine texts in detail through
transcription. In this respect, we may now turn our attention to leaflets designed to give
information on transport services to the public. This might be assumed to be a field
where ideological manipulations are absent. Nothing could be farther from the truth.
Identifying multimodal clusters (see Inset 5: Clusters and cluster analysis, p. 31) is par-
ticularly useful when describing multimodal texts, not least because it helps
exemplify some of the principles we have so far described with reference to some
22 Multimodal Transcription and Text Analysis: Chapter 1
Inset 4: Metafunctions
� Halliday (e.g. 1979) posits that the content stratum of language, its lexicogrammar and
semantics, is internally organised in terms of a small number of very general
functional regions that are simultaneously interwoven and configured in the internal
organisation of lexicogrammatical form, corresponding, respectively, to the
experiential, interpersonal, textual and logical dimensions of linguistic meaning.
Inset 4: (continued)
Phenomenon
that other semiotic systems such as
depiction, gesture, sign, move-
ment, music and so on, have
metafunctional characteristics.
This does not mean that the very
RESULT/CONSEQUENCE
different characteristics and mean-
Rheme
ings of these systems are being
Process: mental:
reduced to forms of analysis more
cognition
learn
appropriate to language. Rather, it
shows that all semiotic systems
have in common some very
general kinds of meanings though
the specific meanings and their
operator
Finite:
modal
Mood element
can
Theme
Senser
[then]
organisation which provide a basis
for the integration of different
modalities in multimodal texts.
Verbiage
(co-reference tie)
Rheme
Interpersonal
Dependency
Experiential
Logical
sample texts and some sample multimodal transcriptions. We have already consid-
ered one type of multimodal cluster, the speech and thought bubbles of cartoons,
and suggested how this type of cluster is customised according to context in keep-
ing with Hallidays principle that texts are units of meaning in specific contexts (see
Inset 2 : Text, p. 3). Bearing this in mind, we can now take a close look at the London
Transport (LT) public service leaflet shown in Figure 1.4 (top and bottom part). This
text was used to guide and assist Londoners as regards fare structures at the turn of
the millenium (see also www.tfl.gov.uk/tfl/ ).
How can we go about describing it in the light of what we have said so far?
Could we, for example, simply say that it takes the form of a six-page leaflet which,
when folded, presents two identical cover pages (Pages 2 and 3) functioning as the
front and back covers (see top part of Figure 1.4)? Or should we focus on the fact
that the cover page in question announces the texts basic thematic content: a new
two-fare structure for London buses? When the leaflet is unfolded, the reader dis-
covers the details of the fare structure, which are expanded and contextualised on
the reverse side of the leaflet (Pages 4, 5 and 6 in the bottom part of Figure 1.4).
The final page on the coverside of the leaflet (Page 1 in Figure 1.4, top part) pro-
vides a second thematic expansion with the description of Saver 6, a new but
different kind of fare. In answer to our question, we may take as our starting point
the observation that this text is organised in terms of a series of multimodal spher-
ical or hemispherical clusters containing some striking combinations of visual,
verbal and spatial resources to explain and justify a new simplified fare structure
for London buses in the new millennium. The fried egg, the mushroom, the
wastepaper basket, the weight, the cup of tea are thus more than eye-catching add-
ons and cannot be eliminated without substantially changing the texts meaning.
To see this, put your hand over the various textual objects on the cover page
and you will notice that, without the tight integration between the visual, the verbal
and the spatial, in particular the lines linking abstract numbers to concrete objects,
it would not be possible to grasp either the principle of the division of London
into two new tariff zones or accept the social message that such a division is con-
sistent with life and travel in London. Still unconvinced that the reading process is
guided by the meaning-multiplying effects of the resource integration principle?
Then try rewriting the leaflet as a piece of written discourse in a way that matches
the concision of the fried eggs, the mushroom, the wastepaper basket as visual
metaphors for London as a physical, economic and social entity. In saying this we
are looking at the resource integration principle from a slightly different perspective,
the perspective (further discussed in Chapter 2 ) of meaning compression. By
meaning compression (see Inset 3 part b , p. 19) we mean the power of multimodal
texts to allow users to identify meanings from combinations of resources in
context with the utmost efficiency, and, in particular, with much greater efficiency
than would have been the case if a different set of resources had been used.
Cluster analysis and the transcription of static mulitimodal texts 25
Figure 1.4: A London Transport leaflet (top part: cover side; bottom part: reverse side)
26 Multimodal Transcription and Text Analysis: Chapter 1
transcription is thus to treat the page as a composite of six primary objects or clusters
(Figure 1.5b) and to look at their composition, including a characterisation of the
relationships existing within these objects. The cluster-oriented macro-transcription in
Figure 1.5a, on the other hand, records the relationships between the primary objects.
A different approach could, in theory, have been adopted in which the notion
of cluster is eliminated. This, however, would go against a basic principle explored in
this chapter, namely that multimodal texts are typically made up of partly
prefabricated meaning-making units or primary genres (see Inset 6: Bakhtins
distinction between primary genres and secondary genres , p. 43). Of these, the first
cluster, recognisable as a title, and the last, recognisable as a combination of a slogan
and a logo (a textual subunit sometimes called a slogo ), are all but obligatory in a pub-
licity leaflet, as they respectively announce the texts basic theme and identify a
particular company, association or institutional body, the equivalent in a novel of the
authors name and other aspects of its identity, such as the title and name of the pub-
lisher. The remaining objects have no traditional name and are less obviously
prefabricated vis-à-vis the others, in the sense that it would be difficult to cite other
texts or types of text in which these specific combinations appear (but see Figure
1.5a). Like the title and the slogo, they are immediately recognisable as functional
units within the text, i.e. units which, though analysable in terms of subcomponents
with many potential meanings, nevertheless share the characterisitic that they are a
basic unit of meaning in a specific text.
This does not mean, of course, that the components of each primary object
cannot be the source of a specific meaning in another text, i.e. function as a
primary object in their own right. Thus, in a different context, one of the lines that
links the fare to one of the objects representing London might, for example, rep-
resent a marker of a division between various sections in a chapter or in a web
page. Nor does it mean that the components are meaningless but rather that in
this text their meaning is not primary. Instead a component is subordinated in such
a way as to function as part of a cluster of resources. It is the cluster and the rela-
tionships between clusters, rather than the individual parts of individual clusters,
that make meaning in a specific context.
The transcription given in Figure 1.5b also helps make explicit other choices
that have been made in the construction of the LT text and in the way it makes its
meaning. Most obviously, repetition: four similarly-shaped round objects have been
selected which systematically change their size as we move through the text: they
get bigger and bigger as we go down the cover page, all of which gives the reader
a visual clue as to the fact that there is a vertical reading path to follow. In other
words, when defining the relationhips between the various objects, the
transcription helps cement the idea that the reading process on the cover page is
intended to follow both a vertical and a horizontal path. The lines linking the fare
structure to the concentric circles follow a zig-zag path through the text, moving
28 Multimodal Transcription and Text Analysis: Chapter 1
1
7
8a
2
3
5
9
8b
6
10
This macroanalysis of one part of the LT leaflet reconstructs the links between clusters and the
cluster hopping that the reading/viewing process involves when attempting to decipher the texts
meaning chains. The transcription thus uses numbered boxes to identify the various clusters as well
as dotted lines to indicate the links between them. In the central panel, corresponding to the cover
page, Clusters 1 and 6 are respectively the start and end of the first meaning chain. In terms of
cluster type, they are respectively a title and a slogo whereas Clusters 2 to 5 are theme-expansion
clusters i. e. clusters which serve to develop the two-fare theme. They are also thematically inter-
related metaphors for London. Clusters 2 and 4 are, in part, repeated on the reverse side (see Figure
1.4) where their indexical relationship (they stand in for the map of London) is made explicit.
Cluster 7 in the right-hand panel is a derived cluster in the sense that, by following the meaning
chain, the reader/viewer comes to understand that the collective function of Clusters 1-6 is to spell
out the details of the two-fare structure. This derived cluster occupies a central empty space in the
panel, a position that helps cement the fare details in the reader/viewers mind. Cluster 8 in the left-
hand panel introduces a second meaning chain consisting of three clusters. It is, however, discon-
tinuous, divided into two by Cluster 9, a reworking and partial repetition of Cluster 5. The function
of Cluster 9 is to link up the two meaning chains, thereby underscoring the relevance of the primary
thematic to the secondary one. Cluster 10 provides information about the production of the leaflet.
When unfolded, the three panels (or pagelets) form a macropage read both vertically (in which case
the regularity and details of the fare structure are foregrounded) and horizontally (in which case the
different types of social venues reached by buses come more into focus). The same processes are
at work on the reverse macropage (see Figure 1.4). The text differs from the Chesapeake Bay text
(Figure 1.6) where the clusters link up to create a more traditional, top-down reading path.
4. The basket Wordings: Callouts: as above; Overlay (1) 251,176 offices (2) as above
Visual Image: wastepaper basket consisting as above of 3 concentric rings:
(1) External ring: partly as above but with colour differences: silver raised rim
marked off by black border
(2) Central ring: wire mesh containing as above Overlays 1 and 2
(3) Inner ring: screwed-up paper
Viewing position: as above
Ellipsis: (1) as above but with the rightmost part deleted
(2) slighty ellipted at the top due to overlapping (Cluster 2)
Vectors and their mutual disposition: as above
Spatial disposition: (1) rightward; (2) callout as above, overlay on right
Cluster size: as above
5. The fried egg Wordings: Callouts: (1) 1.819 cafés (2) as above
Visual Image: fried egg consisting of 3 concentric rings:
(1) External ring: partly as above but with colour differences : golden brown rim
(2) Central ring: albumen containing as above Overlays 1 and 2
(3) Inner ring: egg yolk
Viewing position: as above
Ellipsis: (1) as above but leftmost part is deleted
(2) as above slightly ellipted at the top (Cluster 3)
Vectors and their mutual disposition: as above
Spatial disposition: (1) leftward
(2) callout: as above, overlay on left
Cluster size: as above
6. The slogo
Wordings: Making London simple
Visual Image: The LT logo
Spatial disposition: rightward orientation of textual objects
first to the right of the first cluster, then to the left of the second cluster, and so
on. When we see this, we begin to realise that the cover page, however asymmetri-
cally, resembles a table in which the columns and rows are functionally, though not
formally, present. What is special about this table is that the most important column,
the central one, contains wording that seems to extend from the leftmost and right-
most parts of the page. In fact, the linguistic elements are organised in such a way
that the central part of the text provides the details of the actual fares while the top-
most, leftmost and rightmost parts provide the principle of a two-fare structure.
Thus the part of the LT text transcribed in Figure 1.5b, and to a large extent
the entire LT text, may be construed as a pseudo-table that can be read in a variety
of orders that combine the vertical and horizontal readings that a table makes avail-
able. We will discuss tables and reading paths in much greater detail in Chapter 2 in
relation to economics and science texts and in Chapter 3 in relation to the web
page. For the moment we will simply observe that readers can go down the cen-
tral column of the LT text or can read it from left to right along the rows.
Alternatively, the text can be read in a stepping-stone fashion, jumping in a zig-zag
way following a pathway that respects the default way of reading in Western cul-
ture, namely from left to right and from top to bottom, but requiring the reader to
do so in a series of jumps. It is no coincidence that the layout of the page, which
includes the fact that four round objects are grouped together in pairs (two on the
left, two on the right), is carefully arranged to encourage the reader to be aware that
the text will not make its meaning entirely through language but that other reading
skills need to be at work. Not all leaflets are inspired by the principle of tabular or
pseudo-tabular reading. Nevertheless, it is surprising just how many are.
One descriptive principle followed in this chapter is that there will be many
occasions where, because of the limitations of the page structure, the meaning-
making processes of a text cannot be captured in a single transcription and will
instead require a series of transcriptions to be made that give different zooms of
the text. Thus, while Figure 1.5a functions as a macro-transcription of a single page,
Figure 1. 5b is a micro-transcription, in that it reconstructs the micro-structure of the
same page in all its manifold detail. All this reflects the fact that the LT text, like all
the texts we analyse in this book, is a multimodal text which integrates many
meaning-making resources. It is not a linguistic text with pretty visual add-ons but
one in which visual, spatial and linguistic elements are carefully and tightly integrated.
Our use of the term cluster refers to a local grouping of items, in particular, on a
printed or web page (but also other texts such as manuscripts, paintings and films).
The items in a particular cluster may be visual, verbal and so on and are spatially prox-
imate thereby defining a specific region or subregion of the page as a whole. The
items in a cluster are functionally related both to each other and to the whole to which
they belong as parts. For example, in the Nasa Kids website in Chapter 3, the Nasa
logo and the masthead (see Figure 3.2) are two functionally related components in this
sense. The logo specifies the institutional source of the meanings of the website and
the masthead extends these meanings by connecting the institutional logo to the more
specific, children-based concerns of the Nasa Kids home page. Another example
from the same page is Cluster 17 which consists of two functionally related parts of
a larger whole, i.e. the two components of the activity sequence which is realised
when the user inserts a search item in the search engine and then clicks on ‘go’. Once
again, Cluster 17 is characterised by the spatial proximity of the two items in keeping
with the definition of cluster given above (which may differ from those given by others
e.g. Kok, 2004: 135-136). Clusters are often partly prefabricated structures and as such
frequently enact primary genres (see Inset 6 : Bakhtin’s distinction between primary
genres and secondary genres , p. 43). This is the case with the multimodal search engine
cluster/primary genre implementing a Question ^ Response sequence through visual and
kinetic resources as well as linguistic ones.
Cluster analysis helps us to see how larger-scale items and the relationships in the
visual field contain smaller-scale ones just as smaller-scale ones such as clusters are
contained within larger ones. A cluster is a locus of inclusion for a small-scale
functional arrangement of items included in some larger-scale arrangement (includ-
ing superclusters see Chapter 3 e.g. Figure 3.1). Thus, when we use the term cluster to
define a local grouping of multimodal items which are part of a larger unit in which
they function, our use of the term presupposes that clusters are in some way func-
tionally related to each other. As mentioned above, clusters of items and objects on,
for example, a web page are small-scale arrangements of items which are nested
within larger wholes. Some clusters or some items in some clusters may move while
others may remain inanimate. Other clusters may create intertextual relationships (see
Inset 8 : Intertextuality, p. 55) between texts as happens with the virtual magnifying
glass (Figure 1.10) which, by sliding over and magnifying parts of Leonardo’s Notebook,
contributes to the goal of linking various parts of a rare manuscript to an illustrative
web-based commentary. The notion of cluster thus aims to show that the items and
objects displayed on screen, the web page or on the printed page are not separate
items but are instead connected to other items. They are nested within larger struc-
tures and have relationships with some items with which they are proximally
connected more than others. The visual field that is displayed by the screen or the page
as a whole can be subdivided into various parts such as top, bottom, right and left. The
visual field displayed on the screen or the page does not consist of discrete, separate
items but is instead completely filled. Cluster analysis is one tool for understanding the
ways in which it is filled.
32 Multimodal Transcription and Text Analysis: Chapter 1
Chesapeake Bay leaflet (dating from 1989) shown in Figure 1.6a bears a number of
striking similarities to the LT text: in just the same way that the latter explains how
the London traveller is spared the hassle of a complex fare system, so travellers on
the East Coast of the United States are shown how they can be spared the hassle
of a long inland detour between New York and Virginia Beach (see www.cbbt.com ).
Although slightly larger and consisting of 12 as opposed to 6 pagelets, it is also
folded to provide identical cover pages, or rather almost identical cover pages,
since, as the panel in Figure 1.6b shows, the photograph in the central cluster has
changed: the front cover depicts the bridge aspect of the bridge-tunnel complex
while the reverse cover depicts the tunnel. The question arises as to whether we
can use the multimodal transcription to characterise the differences and similari-
Figure 1.6a: Parts of a leaflet for the Chesapeake Bay tunnel-bridge complex
Multimodal transcription and questions of genre 33
ties between specific instances of this subgenre, i.e. whether we can compare two
or more texts in an approach tendentially concerned with recurrent features in
textual genres. The macro approach to multimodal transcription certainly helps bring
out subsidiary thematics; in the case of the leaflets examined here, it brings out the
social aspects associated with the transport service offered. Thus, while there are
similarities in the thematics of the texts, there are, on the other hand, striking dif-
ferences in the way that these thematics are presented, most obviously the exten-
sive use of concrete contextualised and illustrative maps and photographs in the
latter text, while the map and the photographs are much more abstract and decon-
textualised in the former.
In the LT text, the map is the final step in the progression of circular objects
(e.g. the mushroom) which continue on the reverse side of the leaflet. In this case,
the map has a primarily textual function in that it helps clarify that the dual-circle
objects (corresponding to the two rings of inner and outer London) contribute to
the texts overall symmetry, whereas in the Chesapeake Bay text the function of the
map (not shown) is primarily interpersonal in nature: it is designed to orient the
reader on his or her journey and to persuade him or her that the route is a time-
saver. The multimodal transcriptions of the two leaflets thus show how the
organisation of different multimodal clusters and the relations between them pri-
oritise different goals. Both leaflets are designed to inform travellers but differ in
their mix of essential public service and personal entertainment.
In discussing the different goals of these two texts we have hinted at the fact
that meaning making in multimodal texts can be characterised in terms of different
overall functions (OToole, 1994, Kress, Van Leeuwen, 1996, Baldry, 2000,
Thibault, 2000a, Baldry, Thibault, 2001). In 1.4, pp. 38-44 we will see how texts of
all kinds can be systematically described in terms of the metafunctionally-
organised meanings that they make (see Inset 4 : Metafunctions , pp. 22-23). In other
words, we explore the interplay between the metafunctions more rigorously than we
have so far, given the need in subsequent chapters to use transcription techniques that
link up the relationship between resources and metafunctions in a systematic and
highly detailed way. However, before we do this it is time to take another, more detailed
analysis of cartoons. Specifically, we will take a closer look at the conventions cartoons
use and the way their textual properties can contribute to interpersonal meaning.
with the printed cartoon, have a much more limited set of resources available than,
say, a 15th century artist from Central Italy or a 16th century Dutch portrait painter,
where the very nature of the genre means that the artist can faithfully represent the
tiniest of meaningful details. This constraint means that special sets of conventions
have arisen which experienced readers of the cartoon genre have learnt to recog-
nise as having special meanings. Some of these are apparent in the Lupo Alberto
cartoon (Figure 1.7). The object above the henhouse in the first vignette, for
example, is, following Peirce (1985), both iconic in that it resembles the crescent
shape of the moon and indexical in that it implies a cause-effect relationship,
namely that if the moon is present, then the action will probably be taking place at
night. Indeed, it may be said that a general trend in cartoons is towards iconicity and
indexicality, in that one or more resources systematically stand in for others: in
printed cartoons, ambient sounds are typically expressed through a combination of
language and visual and graphic resources, the latter instantiated not just in the shape
of specialised lettering (zigzagged, curved and inclined as opposed to straight or hor-
izontal) but above all, as the example of clang in the third vignette shows, in terms
of amusing chain-like crescendos (Thibault, 2002).
Vectors constitute one important subset of these specialised conventions. A
vector is essentially a line which has properties such as dynamic force, directionality
and orientation, like those emerging from Lupo Albertos body. Some of these are
straight lines, others a series of curves; but, by transforming space and movement-in-
space into visual objects, both suggest the position that his body had occupied a
split second before; others, instead, take the form of drops of sweat suggesting not
only the direction and speed of his movements but also the emotional stress
involved. Other lines are not physically present but are instead implied. Thus a gaze
vector such as the one in the second vignette, which links the hen to Lupo Alberto,
is an invisible line; in this case the hen, who wants to go dancing, is looking at Lupo
Alberto checking to see that this is also his intention; her pupils are positioned in
the bottom part of her eyes and are dilated, suggesting the intensity of her desire
(she loves him intensely). Whether these feelings are reciprocated or not is beside
the point (the reader might well think that his real purpose, as a wolf, is to gobble
her up, though fans of this cartoon know that Lupo Alberto has in fact lost his
carnivorous instincts). Instead, the important point to note is that the combination
of the hens discourse, the position and dilation of the pupils of her eyes and the
intensity and direction of her gaze are indicative not just of a physical relationship
but are in fact a means through which attitudinal stance is expressed. Similarly, the
reader who does not know about Lupo Albertos peculiar status as a wolf might
understand that the hens feelings for Lupo Alberto are not reciprocated, by virtue
of the fact that his gaze is firmly fixed on his escape route and not on her. The reader
who does know all about Lupo Alberto, as a result of his or her experience of other
Lupo Alberto cartoons (see Inset 8: Intertextuality , p. 55; see also www.lupoalberto.it ),
36
Interpersonal aspects of the cartoon: modalisation is not just a linguistic feature since attitudinal stance can be built onto gaze. The
transcription given below of the three salient central clusters involving either the wolf and the hen or the wolf alone accounts for this.
© Silver/McK
Multimodal Transcription and Text Analysis: Chapter 1
Sneakiness: wolfs movement is Discrepancy: non-amorous stance Orientation of Wolf: off screen: indeterminate
modalised as stealthy through gaze of the wolf (gaze forward); amorous
(eyes in corner) stance of hen (pupils dilated, gaze Orientation of Hen: Depicted World:
vector on wolf) Engaged: Object: Inside personal space
interprets the clenched teeth, the sweat pouring off his back and the forward-looking
gaze as resources that indicate that he is scared stiff that he is about to be shot.
By saying that the various parts of the wolfs and the hens bodies exist in an
attitudinal, as well as a physical, relationship with each other, we are effectively say-
ing that the various visual elements in the text are modalised to indicate attitudinal
and evaluative stances (Kress, Van Leeuwen, 1996). Visual elements can be
modalised just as much as linguistic elements. While in the grammar of the English
language it is often pointed out that attitudes are expressed typically, though by no
means exclusively, through modal verbs (I do love you; I cant go on running like
this), in visual grammar the judgements made by the protagonists are modalised,
for example, through the caricatural manipulation of body parts: Lupo Albertos
angled eyes in Vignette 1 change from a confident and even mischievous Im-going-
to-sneak-up-without-anybody-seeing-me attitude to his whoops-Im-in-the-shit-now
attitude in Vignette 3.
The visual features indicating movement produce different attitudinal
stances in cartoons sneaky, cautious, confident (think of the way Asterix is por-
trayed) while the modulation of the visual image produces a different feel
sensuous in the case of a perfume advertisement, naturalistic in the case of the
maps and photographs in the Chesapeake Bay document, abstract in the case of the
map in the London Transport leaflet, hyperreal in the case of films which portray
dream sequences (see Inset 15: Gibsons optic array, p. 192).
A study of modality thus needs to take into consideration how semiotic
resources other than language, such as movement, gaze and depiction, contribute
to the expression of attitudinal and evaluative meanings by increasing the range of
possibilities. Indeed, for the analyst, much of the pleasure in analysing cartoons lies
in understanding how the cartoonist handles the process of maximising meaning
through caricature while keeping detail to a minimum (see 4.8.2, p. 206).
A further and slightly more abstract appeal derives from how the cartoonist
is able to provide a coherent depiction, on the one hand, of the flow of time and
continuity of events (see 1.1 .1, pp. 7-15) and, on the other, movements within a
place or from one place to another and the cause-and-effect relationships that this
entails. In this respect, precisely because they are narrative structures, an important
property exhibited by printed cartoons is their dynamic nature. As we have seen
from the Marmaduke cartoons, they typically display an imaginative use of visual
devices to represent the constantly changing states of:
Textual metafunction:
There is left-right organisation the bear is on the left, presented as the Given, i.e. known,
familiar or taken for granted. The verbal text is the New presented as that which the text is
about. There is also top-bottom organisation: this has to do with the Ideal/Real distinction.
The childs imaginary world of the teddy bear and the moral appeal to save bears in the main
heading is the Ideal. More specific concrete information is presented as the Real in the bot-
tom bar.
Figure 1.8: The metafunctions .... and a bear hug from Boo
Printed advertisements and their exemplification of the metafunctions 41
(VERBAL GENRES)
Cluster 1: Main heading 1) Main heading (mainly linguistic but font size and type is salient).
The grammar in this text is typical of the grammar of the
mini genre of the LITTLE TEXT Halliday, (1994
[1985]:392-397) as exemplified by newspaper headlines,
telegrams, captions and headings. There are no finite
verbs, but instead truncated grammatical structures. This
little text has an appeal function (an APPEAL is a genre
in its own right).
Cluster 2: Main verbal text 2) Main verbal text. The linguistic text draws and combines
features from a number of different linguistic genres:
a) the genre of PERSONAL PRESENTATION. The text
starts off presenting a bear as an individual, giving him a
name, classifying him as an orphan, attributing personal
qualities.
b) some elements of the genre of RECOUNT in which
some specific events in Boos life are recounted in chrono-
logical sequence.
c) INFORMATION REPORT giving general facts,
information not about Boo Bear but about bears in
general and their plight. It uses the universalising present
tense, all part of a concern with general rather than
specific detail.
d) an EXHORTATION genre concerned with making an
appeal and persuading people to adopt a desired course of
action. This is marked by the addressee-directed imperative
Help us, followed up by reason/motivation in the clause
its a small price to pay for a life, the word small being con-
trasted with the priceless nature of a life.
Cluster 3: Photograph (VISUAL GENRES)
3) Photograph. One type of visual image is the PHOTO-
GRAPH, a genre in its own right distinct from, for example,
graphs, diagrams and tables. This photograph is not an
uncoded representation of the real world, but is itself semi-
otically coded. What the eye perceives is closer to what the
photograph represents than a diagram which is concerned
with abstract and general tendencies in the scientific coding
orientation. This is in clear contrast to the specific concrete
detail of the photograph which adopts the naturalistic coding
orientation (Kress, Van Leeuwen, 1996: 170-71).
4) Bottom bar. A BOTTOM BAR is also another visual mini-
genre in which visual, linguistic and graphic resources are
used in typical ways. In this text, we see an imperative clause
concerned with very specific concrete information which is
important for the reader to know, being highlighted by the
specific font size and choice (contrasting with both the main
heading and the main verbal text).
Cluster 4: Bottom bar
5) Logo. In advertisements, logos are typically in the bottom
right corner, whereas in letters they tend to be at the top. A
LOGO is a visual mini-genre, similar in status to the LITTLE
Cluster 5: Logo TEXT genre. It indexes a specific corporate or organisational
identity as the addresser of the text and ties this identity to the
meanings in the text. In other words, it has an anchoring
function which grounds the text in relation to a particular
company or cause which the reader or consumer of the text
can identify with.
to the idealised or desired essence of something; the Real to more specific, con-
crete information and detail. As the Boo Bear example clearly shows, advertise-
ments are one text genre which often make use of the latter resource. The desir-
able, the fantastic and the sensual are placed in the top part of the page; specific,
more detailed or realistic printed information appears in the lower half of the page.
In providing a brief analysis of the Boo Bear advertisement in terms of the meta-
functions, we have, together with the examples given in the previous sections, the
beginnings of a multimodal transcription for the printed page, a matter that will be
further explored in Chapter 2.
Figure 1.9 focuses on the notion of mini-genres, which is not dissimilar to
the notion of cluster type that we have described briefly in 1.2, pp.21-34. As the
transcription makes clear, a cluster is essentially associated with specific instances
whereas a mini-genre is concerned with types. The two overlap insofar as a
specific instance is also a manifestation of type. The notion of mini-genres derives
from a distinction made by Bakhtin (1986) between primary and secondary genres
(see Inset 6 : Bakhtins distinction between primary genres and secondary genres on the
next page). Primary genres, sometimes called mini-genres, are basic prefabricated
text-making resources, such as Question-Response, Command-Compliance, Problem-
Solution, Definition-Explanation. Secondary genres, of which advertisements, novels,
and scientific articles are examples, select and combine the resources of these mini-
genres for their own purposes, thus forming more complex genres, the secondary
genres. Figure 1.9 clarifies how we can make a start, in terms of a rudimentary
multimodal transcription, to the work of demonstrating how primary genres are
typically combined to form secondary genres, in this case an advertisement. The
transcription distinguishes visual from verbal mini-genres in a clearcut way but clar-
ifies once more that the visual and the verbal are not simply added to each other. As
mentioned above, the multimodal integration of the two produces a multiplying
effect such that the one contextualises the other to produce an overall text meaning.
We may give two examples: first, we see how the linguistic genre of
PERSONAL PRESENTATION at the beginning of the text indexes Boo in the visual
image at the same time that the features that we have analysed in the visual
semiotic create a synergy between the two, all to do with creating interpersonal
closeness between the bear and us, and integrating the bear into our social world.
The bear is not in the wilderness, but is turned into your familiar, cuddly teddy.
Second, we see how the use of direct gaze, realising a visual demand for goods and
services, ties in with the EXHORTATION genre in the verbal text.
The transcription makes clear how the Boo Bear text works in terms of the
meaning-compression principle (see Inset 3: The resource integration and meaning-
compression principles , pp. 18-19). It shows how we use our knowledge of genre,
both primary and secondary, to understand the relationships between the various
clusters that typically make up a printed text. Note in passing that some of the
Inset 6 43
�Bakhtin (1986: 61-62) makes a useful distinction between primary and secondary
genres. Primary genres are elementary genres that occur in what Bakhtin calls unmedi-
ated speech communion; they are the basic generic forms characteristic of a wide range
of social situations encountered in daily life. They include dialogic exchange units such
as: Command^ Response, Statement^ Response-to-Statement, Question^ Answer,
Greetings^ Reciprocate Greetings and so on. They also include written variants such as
Personal Letters, Memos, Explanations, Instructions, Explanations, Recounts and
Arguments.
�Secondary genres are more complex and include complex scientific, artistic, legal, jour-
nalistic, political, bureaucratic, technocratic and other complex forms of discourse.
They assimilate a wide variety of primary genres to their own purposes and principles
of organisation. Primary genres thus absorbed and digested (Bakhtin, 1986: 62) are
recontextualised by the secondary genre, losing any immediate contact with the every-
day situations in which they function. Their functions and modes of organisation are
mediated by the more complex secondary forms to which they are assimilated.
various clusters in the Boo Bear text might at first sight appear to be monomodal
in terms of the resource integration principle. However, this is not the case. Even
the photograph of Boo is in fact more than a photograph. In fact the photograph
cluster integrates space in terms of orientation (Boo is at a 45 degree angle to the
horizontal plane) and visual salience (Boos body overshadows the text), all part of
the process of creating an interpersonal bond between the reader and the bear.
Chapter 3 of this book contains a detailed analysis, perhaps the first of its kind, of
the way web pages and websites typically make their meanings. The special
characteristics of web pages are described there. In this section, however, we will
briefly outline one of the characteristics that needs to be entertained when
describing web pages, namely the ways in which intertextual relationships (see
Inset 8 : Intertextuality , p. 55) are created in web pages and how we can analyse and
transcribe these relationships. We will do this with reference to a page from the
British Librarys website (www.bl.uk ) relating to Leonardos Notebook as represent-
ed in Figure 1.10. In the preceding sections, we discussed the nature and role of
clusters as meaning-making structures. The discussion included a description of
the frame as a resource that can be pressed into service in many texts to separate
a cluster, or, more often, a group of (sub)clusters, from another cluster or
(sub)cluster group. In a comic or a film storyboard, for example, frames allow clusters
to be built into higher-level meaning-making sequences that enact particular sce-
narios, or as we prefer to call them phases (see Inset 7, p. 47 and 1.6, pp. 46-54).
The very notion of sequence implies a time-based, chronological ordering
of events in a narrative and/or cause-effect structuring. However, clusters can
also be related to each other in other ways. They can, for example, at least in part,
transcend a linear time-based sequence and enact constantly changing positions
relative to each other. In this sense, clusters are more like stars moving through
space in patterned and predictable ways in relation to each other, thereby consti-
tuting part of a larger, dynamic whole.
This is often the case with web-based animations. The website in Figure
1.10 is an interesting case in point (www.bl.uk/onlinegallery/ttp/digitisation3.html ).
It consists of a display area (top part) in which Leonardos Notebook is shown and
a bottom bar which partly governs the way the Notebook is displayed. Thus,
although users can turn the pages of the Notebook by clicking on them directly,
they can also use the slider in the bottom bar. This is in itself a cluster, a grouping
of resources: three visual objects, a line/bar, a circular sliding button and two end tri-
angles/arrowheads plus the possibility of moving the button with the mouse pointer
i.e. an action potential (see in this respect Inset 14: Material object text and semiotic
action text: two sides of the same textual coin , pp. 175-177). A related cluster is the
Web pages and their transcription
page-display bar, part of the bottom bar supercluster, which displays the results of
the movement of the sliding button in terms of page numbers. The mouse pointer
is a further related cluster combining visual and actional resources. It is, of course,
essential in realising the potential of the display and bottom bar superclusters.
Part of this potential relates to overcoming the time-based consultation of
texts by allowing a greater set of selection possibilities. Indeed, with this device,
the reader can flip backwards and forwards through the pages at will, whereas
there is no such random sequencing when the Text and Audio buttons on the right
of the bottom bar are activated; that is, there are no fastforwards: you either read or
listen to the entire text or you switch these options off. A more striking example is
the Magnify option which allows the reader/viewer to randomly float a virtual mag-
nifying glass over the virtual Notebook, effectively zooming in on the individual
clusters that Leonardo wrote or drew. The Magnify option is essentially a frame with
a specialised action potential (see 3.9, p.146). It acts as a text-access tool but with very
different functions as compared with other text-access tools such as search engines.
The page shown in Figure 1.10 relates to Leonardos design for a new town
in France that he had been commissioned to work on by the King of France. It is
striking in the way that it consists of an apparently random set of jottings which,
on closer inspection, prove to be a series of interconnected and often merging
visual and verbal clusters. The last option, the Mirror option, can be used with the
Magnify function; it reverses Leonardos mirror writing (he wrote from right to left
in many of his works) thus making the text easier to comprehend. From one stand-
point, the effect of all these textual interactions is to make a precious text available
for detailed perusal by the many who would be otherwise barred from consulta-
tion. From another standpoint, we may say that by looking at these various aspects
in which clusters interact in websites, we are in part preparing the reader for the
model of textual organisation of websites that is put forward in Chapter 3.
This section is concerned with film texts. It deals with their definition and the methods
by which they can be analysed. Specifically, it is concerned with outlining some of the
analytical tools phase, transition and transitivity frames that we will analyse in much
greater detail in Chapter 4. Examples of film media are those projected in cinemas
(films, news reels and cartoons), broadcast on TV (films, documentaries, news and
sports broadcasts etc.), distributed as home video (DVD and VHS films) or those relat-
ing to events such as lectures, public meetings and conferences which have been vide-
orecorded and reproduced in videotape or digital format for non-commercial reasons.
The definition covers a large range of film texts and genres including those relating to
the general public (cinema, DVD, TV or web-based films) and those intended for more
restricted audiences (company training films, recordings of university lectures, record-
Film texts and their transcription & Inset 7 47
� The basic unit of textual sequencing and, hence, of global or macro level
organisation of a text is the phase. Following Gregory (1995, 2002), a phase may
be defined as a set of copatterned semiotic selections that are codeployed in a con-
sistent way over a given stretch of text. Phases are text-analytical units in terms of
which the text as a whole can be segmented and analysed. However, these units do
not in themselves realise or constitute relations between semiotic forms and the
meanings the forms realise. Phases are instead the enactment of the locally
foregrounded selections of options which realise the meaning which is specific to
a given phase of the text. It is the task of a multimodal text analysis to specify both
which selections are selected from which semiotic modalities and how they are
combined to produce a given, phase-specific meaning.
�Phasal analysis has also been extended to multimodal action texts in the work of
Martinec (1998, 2000) and in our own work (Baldry, Thibault, 2001, 2005). In this
approach, the text is segmented into a number of phases and the points of transi-
tion between phases. A given phase is characterised by a high level of
metafunctional consistency or homogeneity among the selections from the various
semiotic systems that comprise that particular phase in the text. In this way, the
specific selections in that phase and their modes of copatterning yield an internal
consistency which characterises a given phase and which distinguishes that phase
from other phases in the same text. The temporal unfolding of a given phase is a
wave-like pattern or, rather, a series of interacting waves. It follows that phases and
subphases refer to salient local moments in the global development of the text as
it unfolds in time. The transcription must reveal the patterns of use of choices
from different systems in the real-time unfolding of the text. The Prague school
concept of foregrounding is crucial when showing which selections from which
semiotic resource systems are relevant to the instantiation of a given phase.
ings of children telling stories). Such texts are both multimodal and dynamic : as they
unfold in time, they display different and constantly varying constellations of sound,
image, gesture, text and language (Baldry, 2000a; Thibault, 2000a). They can be
analysed as individual texts using the multimodal transcription technique or alterna-
tively as collections of texts (text corpora) using multimodal concordancing techniques
(Baldry, Beltrami, 2005; Baldry, Thibault, 2005; Taylor Torsello, Baldry, 2005).
Through analysis a much more detailed definition and understanding of film
texts can be provided. Frame-based dissections of texts such as the multimodal
transcription of advertisements in Appendix I and Appendix II play an important
role in describing the meaning-making blocks that make up a specific film text
(Baldry, 2000a, 2004; Thibault, 2000a). They necessarily go beyond cluster analysis,
which presupposes a fixed relationship between resources, and involve dynamic
sequences of resource integration, what we have termed phases (see Inset 7:
Phases and their transcription , on the previous page) following, though also mod-
ifying, Gregorys notions of phase and transition (Gregory, 1995, 2002).
Film genres are varied in their nature and their discussion requires a canvas that
goes far beyond the confines of this book. We will thus limit our exemplification in
this book to the analysis of one genre the TV advertisement genre for which we
Table 1.5: The Mitsubishi Carisma text: Summary analysis of shots, phases and macrophases
Film texts and their transcription 49
Table 1.6: The Mitsubishi Carisma advertisement: thematically salient transitivity frames
50 Multimodal Transcription and Text Analysis: Chapter 1
within that phase exhibit a high degree of sameness, at the same time that the
meanings made in other phases of the same text are different. A phase will there-
fore make use of a distinctive copatterning of meaning options in order to create
the meanings of that phase and to distinguish a particular phase from other,
different phases in the same overall text.
The transition from one phase to another is matched by a shift in the kinds
of meaning options which are selected and combined in that phase. Different
phases in a text are functional units which make their own specific contribution to
the meaning and organisation of the text as a whole. At the same time, the
different kinds of meanings made in different phases of a text are related to the
meanings made in other phases of the same text as well as to the meanings that are
made by the text as a whole. Phasal analysis is useful and revealing because it shows
how small-scale units such as the shot in a video text can be related to larger-scale
textual units such as the phase. The ways in which smaller-scale units are related to,
and are integrated with, larger-scale units is important for establishing the meaning
and function of small-scale units such as the shot in video texts. Texts are organised
on many different organisational scales; units on different scales all play their part
in creating the meaning of the text as a whole, at the same time that they make their
own specific contribution to the meaning of the whole. The small-scale meanings
of a unit such as the shot or the phase cannot simply be added together to derive
the meaning of the whole text. Different units on different scales in a text relate to
units on the same scale and on smaller and larger scales in different ways (see Inset
12: Scalar levels , p. 144). The meaning of the whole depends on the many different
ways in which units relate to each other.
For the purposes of the present discussion, we shall focus on the following
units: shot , phase and macrophase in relation to the Mitsubishi Carisma advertisement.
The Mitsubishi Carisma text has been analysed into a total of twenty-nine shots,
eight phases and three macrophases. This analysis is summarised in Table 1.5 above.
The complete text is transcribed in Appendix II. Since this text will be further
analysed in Chapter 4, we will not undertake a complete analysis of this text as yet.
Instead, we shall make some informal observations on the overall organisation of
the text into shots, phases and macrophases. Our purpose is to illustrate the impor-
tance of phasal analysis and its role in text analysis and transcription. The first part
of the discussion will look closely at Phase 1 with these questions in mind. Phase 1
consists of five shots:
� Shot 1 establishes the time (night time) and the urban location, at
the same time that it introduces the telephone box and the car,
which is seen approaching the telephone box.
� Shot 2 cuts to the telephone receiver inside the telephone box and
shows a hand grasping the receiver.
The soundtrack 51
� Shot 3 shows a young woman inside the box talking on the telephone.
� Shot 4 cuts to the middle-aged man with whom she is talking. He is
holding the telephone receiver to his ear and is talking to the woman.
� Shot 5 is a camera pan to the left which tracks the mans gaze to
his hostage, a young man suspended above a shark-infested
waterway by a rope hung from a bridge.
� Shot 6 is a transitional shot: it marks the end of Phase 1 at the same
time that it begins Phase 2. This shot shows the womans car a
Mitsubishi Carisma viewed from inside the telephone box as the
woman leaves the telephone box and walks back to the car.
The six shots in Phase 1 combine to produce a thematically homogeneous phase
which is focused on the use of the telephone to bring about this first contact
between the woman and the man. Shot 6 marks the end of the first telephone con-
versation between the woman and the man, at the same time that it is the begin-
ning of the first car-drive phase when the woman drives to Vienna. In other words,
shots and phases function on different meaning levels (see Inset 12 : Scalar levels, p.
144). The visual scene combines thematic material from both Phase 1 and Phase 2,
namely the telephone box and the womans car.
Table 1.6 presents a linguistic gloss on the thematically salient transitivity
frames (Baldry, Thibault, 2005) in Phase 1. The analysis in Table 1.6 focuses on the
participants, the actions they engage in, and other relevant circumstantial details such
as location. However, very many details of the visual text and the soundtrack remain
unaccounted for in such an analysis, which is not meant to be exhaustive in any case
(see Chapter 4 for a more detailed analysis). Instead, Table 1.6 shows some of the
resources which contribute to the thematic homogeneity of this phase: we see the
participants, the activities they engage in, and the objects that they interact with.
1. 6. 1. The soundtrack
At the same time, what we see is also related to what we hear. A few words on the
significance of the soundtrack are therefore in order at this point. In Phase 1, the
soundtrack is comprised of three components: (1) the orchestral music; (2) the
voices of the man and the woman; (3) the sound of the telephone box door clos-
ing when the woman returns to her car after making the telephone call to the man.
The music starts off at a relatively low volume and slow tempo. Both the volume
and the tempo gradually increase until the beginning of Phase 2, when there is a
marked quickening of the tempo as the woman begins the drive to Vienna in that
phase. The music in Phase 1 is a contextual ground (see Ground in Inset 16: Perspective
in sound: Van Leeuwen on Figure, Ground and Field, p. 212) to the scene which we
observe. The music supplies mood and ambience and is reminiscent of the music
52 Multimodal Transcription and Text Analysis: Chapter 1
from spy thrillers such as James Bond films. The two occurrences of the spoken
voice in this phase are the dominant sound motif, whereas the orchestral music
stays in the background.
The voice is the acoustically dominant figure in the soundtrack in Phase 1. In
other words, the acoustic focus is on the voices of the two speakers, rather than the
orchestra. However, the orchestra gains in prominence with the entry of the
trumpet solo as soon as the man finishes saying his line bring it tomorrow at the same
time that the camera tracks the shift in the mans gaze by panning to his hostage
(Shot 5 ), who is seen suspended above the shark-inhabited canal. The motif played
by the trumpet clearly associates with the man and his dark intentions; it introduces
a note of dramatic tension into the previously more melodic flow of the music.
In contrast to the womans standard middle class British accent and steady mod-
ulation, the mans voice is more heavily accented as well as exhibiting a fair degree
of resonance and reverberation. The tempo of his voice is also slower than that of the
womans more normal talking speed. Table 1.7 shows some of the salient phonetic
and prosodic features of the mans voice. The mans voice and the womans voice
clearly contrast in ways which index significant meaning contrasts. These opposi-
tional contrasts are summarised in Table 1.8. The contrasting phonetic and
prosodic features of the two speaking voices index a series of oppositions that
oppose categories such as nationality and language and therefore the moral oppo-
sition between the good intentions of the woman and the evil ones of the man.
Given that the woman is aligned with the car, it is hardly surprising that the viewer
is asked to align with the values of the female protagonist and later the car itself in
opposition to the values indexed by the man and the faintly parodistic and absurd
links that his voice indexes. These features of the mans voice suggest a sinister
quality and index the morally contorted character of the villain in many spy movies,
such as those found in many of Ian Flemings James Bond stories and the feature
films based on these. The contrasting qualities of the two speaking voices serve, in
partnership with other modalities such as the visual scene, to position the two char-
acters in very different ways in relation to the viewer of the advertisement. Thus,
the woman is visually connected to the car, to the urban world outside, and more
generally with the values of the reader/viewer that the text seeks to target. In the
telephone box, she is shown frontally and the use of colour is naturalistic. The
woman belongs to our world and viewers are asked to identify with her. In contrast,
the man is associated with a sinister subterranean setting, with criminal activity
(extorting ransom money for the hostage he is holding), and is generally shown as
not belonging to the world and the values of the targeted viewer. His face is seen
to be tilted at an oblique angle in contrast to the frontal perspective and vertical position
used to show the womans face when she speaks on the telephone. Furthermore,
the womans face is fully illuminated, whereas the mans face is partly obscured by
the interplay of light and shade on his face.
As we have pointed out above, Shot 6 is a transitional shot : it brings Phase 1
to an end at the same time that it introduces elements of Phase 2. In this shot, the
young woman leaves the telephone box and walks back to her car. The shot also
features one of the few occurrences of an ambient sound, in this case made by the
door of the telephone box when it shuts behind her. The sound is given a fairly
high degree of prominence as well as being fairly low on the absorption scale, to
the extent that it is heard as reverberant or resonant.
A similar use of an ambient sound is made in Shot 15, Phase 3 when the
woman replaces the telephone receiver after talking to the man for the second time.
In both instances, an ambient sound takes on qualities and a potential significance
that might not normally apply to such a sound in a more naturalistic acoustic
context. In both cases, the sounds mark the end of a textual phase in the
advertisement. They therefore function textually in concert with other features to
indicate a transition from one phase to the next. Furthermore, the low degree of
absorption in both cases means that these sounds take on non-naturalistic qualities
in keeping with many other features of this text.
Woman Man
At a still higher level, the text is comprised of three macrophases, one of which
is the end phase (Baldry, 2004), which integrate the smaller-scale phases to its prin-
ciples of organisation. The justification for the inclusion of this level lies in large
measure in the role that the soundtrack plays in binding the text into three higher
level macrophases. The first macrophase extends from Phases 1 to 4 and is charac-
terised by the orchestral music which binds all four of these phases into a distinc-
tive larger-scale unit. Phases 1 to 4 focus on the two telephone conversations
between the woman and the man (Phases 1 and 3 ) and the two car-drive phases when
the woman drives to Vienna (Phase 2 ) and to Prague (Phase 4 ). The music, which
starts off slowly and quietly in Phase 1, builds up to a very quick tempo and rhythm
in Phase 2 when the woman starts the journey to Vienna. This surge in musical
intensity carries through to the end of Phase 4. With the onset of Phase 5, the music
is abruptly faded out and the voice of the off-screen male presenter is heard for the
first time, at the same time that Phases 5 to 7 switch from the crime thriller genre
that is foregrounded in Phases 1 to 4 to the movie set where an overpaid director is
shown to be responsible for making what we now see to be a hackneyed movie in
Phases 5 to 7, before the car is moved centre stage and shown to be the real star of
the show. In the short end phase, different visual and acoustic resources function to
conclude the advertisement and to connect its meanings to the maker of the car.
The discussion in the preceding paragraphs suggests some of the ways in
which a multimodal video text such as a television advertisement can be analysed
in terms of a number of different, though interrelated levels of textual organisa-
tion. The analysis here is neither exhaustive nor technical (see Chapter 4 ). By intro-
ducing the notion of phasal analysis, we have drawn attention to the ways in which
different parts of the same overall text may use different semiotic resources and dif-
ferent combinations of semiotic resources to create meanings which are specific to
a particular part of a text, that we have called phase, following and adapting
Gregory’s original insights as regards this level of textual meaning (Gregory, 1995,
2002). At the same time, the various phases combine units on lower levels, such as
the shot, while also being integrated to higher-scalar units such as the macrophases
proposed here, or to the text as a whole.
1. 7. Conclusion
In this chapter, we have introduced some of the basic aspects of the scalar model
of multimodality that we are seeking to establish in this book. This model views
texts as consisting of multiple, interacting textual levels that make their meaning
through the constant interplay of smaller and larger textual units. In this respect,
we have placed particular attention on the intermediate levels of textual composi-
tion of multimodal texts, such as clusters and phases, showing how they link up with
Conclusion & Inset 8 55
Inset 8: Intertextuality
2. 0. Introduction
How does the page communicate? How does the visual image organise the per-
sons, objects, the actions they perform, and the settings in which these occur into
a structured set of relations? How does the page give structure to the relations
between the represented world of the text and the viewer of the text? And, more
generally speaking, how does it indicate to the reader/viewer the possible ways of
reading the text and the relative information priority to be assigned to the different
component parts of the overall visual composition? How can we talk about
analyse and theorise these various aspects of the way visual texts communicate?
In this chapter we link some of the general principles outlined in the previous
one to the analysis and transcription of the printed page, the scientific printed page
in particular, in an approach which moves from general to increasingly specialist con-
siderations. Specifically, in the first part of the chapter we consider the evolution of
the page as a textual unit in its own right in contemporary society, characterising the
role played by such resources as tables, charts and diagrams in this process and com-
paring texts from the 19th century with their counterparts from the end of the 20th
century. The later sections of the chapter deal with the way in which scientific mean-
ings are communicated to children in biology textbooks. In the course of the chap-
ter, we will provide a detailed account, both in terms of text analysis and multimodal
transcription, of the ways in which a metafunctional framework (see 1.4, pp. 38-44)
can help us understand many aspects of the multimodal printed page, such as the
spatial arrangement of items on the page, the relation between images and linguistic
text, the relations between reader and multimodal text, and so on.
The very notion of the evolution of the page may at first seem ludicrous. How can a
page evolve? Yet in modern society the page is an important textual unit and a com-
parison of virtually any page from contemporary publications (whether newspapers,
school textbooks or scientific journals) with those of previous generations will show
58 Multimodal Transcription and Text Analysis: Chapter 2
that this change in status is due mainly to the rise of the multimodal page in the last
fifty years (Kress, Van Leeuwen, 1996). There is, of course, no such thing as a
monomodal page: there never has been and never will be. Like any other textual
unit, a page cannot create meaning through the use of language alone but relies
instead on a combination of several meaning-making resources: linguistic, graphic
and spatial at the very least (Thibault, 1998a). While all pages are by definition
multimodal, some are more obviously multimodal than others, combining traditional
semiotic resources such as language and layout with more modern resources such
as colour and photographs. Under the influence of technology, computer technol-
ogy in particular, these developments have accelerated to such an extent that in
recent years our conception of the page has changed significantly. What was essen-
tially a linguistic unit 100 years ago has now become primarily a visual unit. The page
is no longer, as it was predominantly in the 19th century, simply a convenient divi-
sion for the purposes of printing. In Western culture, it is increasingly looked upon
as a textual unit in its own right, a matter clearly reflected in the growing list of
expressions that identify the page in terms of different social functions: index pages,
glossy pages, financial pages, yellow pages, teletext pages and, of course, web pages.
Todays page thus incorporates the principle of multimodal textual design and
organisation (Kress, Van Leeuwen, 1996, 2001).
An awareness of the page and its evolution would appear to be a signifi-
cant aspect of language studies and language education at all levels, particularly
when we recall that Internet pages are not just read but are instead listened to,
looked at and even watched, increasingly by young children, many of whom are tra-
ditionally assigned to the pre-reading age group. The scientific page is no exception
to the general trend towards the integration of diverse semiotic resources sketched
out above. A glance at any page from a modern textbook and one from the previ-
ous century will confirm that the conception of the page as a visual unit affects
specialist sectors of society, and the genres they use, just as much as it affects
popular genres, such as newspapers, magazines, directories, circulars and ency-
clopaedias, directed to a much wider public. Compare for example the
representation of plant movements in a twenty-four hour cycle in Figure 2.1, as
exemplified in Darwins The Power of Movement in Plants published in 1880
(Darwin, 1989 [1880]), with its modern counterpart taken from Curtiss Biology
(Curtis, 1975 [1972]). You will see that many of the meaning-compressing devices
that characterise the modern multimodal text in general are also employed in the
scientific text: colour, use of spatial disposition that arranges text blocks horizon-
tally as well as vertically, lines that have evolved to acquire both a metatextual
labelling function and a dynamic function, which acts to create a relationship between
movement and change in time. Following on from what we stated in the previous
chapter, the term meaning-compressing (see Inset 3: The resource integration and
meaning-compression principles, pp. 18-19) is used here in relation to the changes that
The printed page and its evolution 59
From BIOLOGY by Helena Curtis. © 1968, 1975 by Worth Publishers. Used with permission.
have taken place in the course of time which allow the same processes to be described
in scientific texts in a shorter space; these changes are typically the result of:
(i) the greater integration of visual and verbal resources;
(ii) the often concomitant process of greater abstraction in
representation that brings about a collapsing of the different
time scales that occurred in the actual process of experimenta-
tion into a single, more hypothetical point in time usually corre-
sponding to the point in time when the experiment/research was
reconstructed in terms of a written report by the scientist.
Interpreting this in relation to Darwins work on the movement of plants, we
notice how there is very little abstraction in the visual, to the point where the drawings
are reproductions of the actual lines that the movement of leaves drew on a cylinder.
On the other hand, the half-page reproduction of the same experiment in a modern
biology textbook see Figure 2.1 (bottom part) collapses all this into two inter-
linked diagrams labelled (a) and (b). The first visual Cluster (a) summarises the
entire experiment in terms of a labelled and abstract cluster which pares the experi-
ment down to a visualisation of the MATERIAL process in which a leaf draws a line
on a drum by means of a thread tied to the leaf at one end and to a lever at the other
end. The second visual Cluster (b) suggests the second stage of the experiment,
in that it represents the now completed line at the end of the 24-hour cycle and pro-
vides a blow-up of the complete cycle of leaf movements showing the regularity of
the cycle. The two visuals thus express a cause-effect relationship requiring very little
linguistic expansion; language is in fact used in what, de facto, is Cluster (c) in the left-
hand margin of Figure 2.1. By contrast, Darwins original experiment Figure 2.1
(top part) on which this modern version is based, only expresses the first part of
this equation the actual material drawing of the line but stops short of visualis-
ing the consequences, relying instead on language to describe these effects. Moreover,
the drawings in Darwins text typically show only a low level of abstraction. The
leaves are recognisably clover leaves whereas in the modern text they are abstract
representations of leaves; Darwins lines are real lines drawn by the leaves themselves
with the addition of an arrow mark suggesting directionality; in the modern text, on
the other hand, the lines are vectors of a totally abstract nature with a symmetry not
backed up by the original experiment. We may add that the contemporary scientific
page, with its sophisticated representation of movement, often has a dynamic, as
opposed to static, feel to it, which enables it to express different states. This is exem-
plified by the cartoon-like sequence of drawings, typical of modern school science text-
books that indicate stages in cycles: the water-evaporation-rain cycle, the seasons cycle, the
phases-in-human-evolution sequence, and so on (cf. the notion of temporal-analytical
processes in Kress, Van Leeuwen, 1996: 95). The dynamic potential of scientific
drawings is often exploited in the animations used in scientific film media. Video
The resource integration principle in the scientific page 61
recordings can also capture the process of drawing a diagram, a special form of ani-
mation that is used, for example, in a lecture to explain a process or a phenomenon
in a stage-like manner. As we shall see later, in Chapter 3 (e.g. 3.5, pp. 120-125), this
kind of animation is often closely associated with an awareness and use of the
dynamic properties of the multimodal web page.
A full account of the history of the scientific page as a meaning-making unit is not
possible here. A brief comparative examination of the codeployment of visual and
linguistic resources in the scientific page will, however, contribute to our under-
standing of the multimodal scientific page. Of special interest, in this respect, is
the focus that can be placed on how the various resources contribute in their
different ways to the various dimensions of a texts meaning and how the growing
integration between resources may be regarded as an important social achievement.
To illustrate this point, and to provide partial answers to the questions posed
above, we may examine some specimen texts from different ages.
We have already seen some texts from the field of biology, and below we
analyse biology texts in more detail. But we may also extend our enquiry to the field
of economics, again comparing texts from two distinct historical phases, in this case
taken from Volumes I and II of Marxs Capital as presented in the first American
edition dating back to 1908 and an issue of The Economist magazine dated September
5th 1998. Albeit in different periods and within a different world, these texts essen-
tially belong to the same discourse type, namely political economy, a genre effectively
created by Marx in Capital and which marks a significant shift in the deployment of
semiotic resources vis-à-vis earlier economics texts such as Adam Smiths The Wealth
of Nations (1776) and John Stuart Mills Principles of Political Economy (1848). It is
a discourse type characterised by a combination of other discourse types: mathemat-
ical, statistical, journalistic, political as well as pure economics. In Marx (but much less
frequently and less obviously in the Economist ) a particular discourse type is often
associated with a particular modality. Thus, in Figure 2.2, for example, we can see that
the part of the page relating to Lincolnshire is divided into three parts:
(a) a narrative section reflecting a written modality;
(b) an economic and statistical section (after the words The following ),
reflecting a combined visual/verbal modality;
(c) a journalistic section (embedded in the first section and marked
off by quotation marks) which reflects an oral modality.
Whatever definition of economics as a science is given and todays economics is
obviously concerned with many issues, such as inflation and the production of
services, that were not the primary focus for Marx the concept of measurement
62 Multimodal Transcription and Text Analysis: Chapter 2
© The Economist Newspaper Limited, London (03/09/1998)
remains an important part of all scientific texts. From this point of view, Capital and
The Economist share common ground in their contribution to the development of
semiotic devices that express measurement. Although the modern text shows a
much stronger tendency to integrate visual and linguistic resources, the text type
developed by Marx already moves in this direction in some very interesting ways,
notably in its use of the vector. Nevertheless, in reading Capital, one is constantly
aware of a struggle to make meaning emerge from a limited array of resources
principally the table and various graphic and spatial devices. Take, for example,
Chapter X (Vol. I) which is concerned with economic and social aspects of the work-
ing day and the exploitation of the labourer by Capital. Here Marx adopts a journal-
istic standpoint which foregrounds the idea of hearing different voices literally as
well as metaphorically: the voices of the labourer, the factory inspector and of
Capital. Marx writes not Let us read the Factory Inspectors report but Let us hear
the Factory Inspectors, Let us listen to Factory Inspectors. Capital is heard in the
House of Commons as is the voice of the labourer. The written report invari-
ably Marxs source of information is thus reconstrued in terms of an oral modality
in order to make a greater impact on readers regarding the social injustice that
surrounds the workers plight, a journalistic recontextualisation of the original text
achieved not through sound bites but through a combination of language and such
visual means as quotation marks, different font size and different spatial dispositions
vis-à-vis the rest of the page. Like many scientists of his time, Marx was hampered,
but not defeated, by the resources he had at his disposal. He did not have colour
photographs of the conditions of men, women, children and animals in coal mines,
live broadcasts from Parliament or complex charts capable of representing
development over time that are a staple part of modern textbooks. In some cases, he
chose not to use some resources which he did have the political cartoon, for
example. The Economist (Figure 2.2) uses all these resources and many more, some
of which, as we shall see, include dynamic properties that are the result not just of
technological innovation but also of societys learning about how to represent
meaning in increasingly complex and often abstract combinations of the visual and
verbal.
Such learning (Halliday, 1978: 192) often translates into a greater capacity in
the modern page to condense meaning. This process of compression is apparent
in the plant movements example (Figure 2.1) and may be associated with the
greater abstraction of the modern scientific page: As suggested above, Darwins
use of abstraction in the visual is limited: the verbal text describes the original
naturalistic tracing at length while the movements in the diagrams are the result of
the movement of a real pen moved by real leaves. Only the representation of the
direction of the movement is abstract, in that arrowheads have been added on in
such a way as to reflect the direction of the movement as closely as possible. The
actual pen is not represented, while the tracings and the leaves are represented sep-
64 Multimodal Transcription and Text Analysis: Chapter 2
arately. The modern text, on the contrary, is more abstract in its encoding of the
visual semiotic (Kress, Van Leeuwen, 1996: 170) not real leaves, pens or pen
marks, but artists representations of the leaves and the devices that produce the trac-
ings. This abstraction in the modern authors words, a representative recording
(Figure 2.1, Cluster (c ), running text) allows the tracing process and the notion of
a cycle that repeats itself to be expressed in just half a page. This level of abstraction
is made possible by the integration of various semiotic resources; or, to see the mat-
ter from a slightly different perspective, through the rise of the multimodal page.
The two capitals I and II remain entirely separate. But in order to represent them
thus as separate, we had to tear apart their actual interrelations and intersections, and
thus also to change the amount of turnover. For according to the above dia-
gram, the amounts turned over would be
[
]
But this is not correct, for
we shall see that the actual periods of production and circulation do not absolutely
coincide with the above diagrams (Capital p. 311, Vol II, our italics)
Since the time of Darwin and Marx, our ability to construe temporality has
evolved. We have a greater awareness of the difference between the diagram and the
table and the different ways in which they can make meanings. As may be seen in
Figure 2.3, table and diagram are synonymous in Marxs text, whereas in modern
times the table is partly embedded in language it integrates language and visual
semiotic resources as opposed to a diagram which tends to be much more
abstracted from the thematics of language. This enables the diagram to do things,
i.e. make meanings, in significantly different ways from the table, as shown by an
instance from a page from The Economist, (Figure 2.4) which, apart from the run-
ning text (omitted here), contains four charts, each of which proposes graphic,
tabular, diagrammatic and linguistic information in differing proportions. The left-
66 Multimodal Transcription and Text Analysis: Chapter 2
hand chart in Figure 2.4 is about Slumping, that is the danger of a worldwide eco-
nomic recession. It is divided into two parts World GDP growth, which shows the
ups and downs of GDP over several decades, and GDP shares, which shows the dis-
tribution of GDP in the worlds economies in 1997. The top right-hand chart, A Raw
deal, is about the decline in time of commodity prices, while the bottom right-hand
chart, Drying up, is about the decline in real as opposed to expected capital flows to
emerging countries. The key factor to note is that in the top two charts in Figure 2.4
we have a vector, such that the foregrounded emphasis is on the horizontal as con-
trasted with the vertical (which is what we found in Marx), and which we also find in
the bottom two charts. A vector is, in its simplest form, an arrow (Kress, Van Leeuwen,
1996). In the grammar of visual meaning, a vector signifies activity, action, movement,
direction and dynamism in time. There is a single vector in the top left-hand chart while
two are present in the right-hand one (with a joint starting point). In contrast to the
plant movements texts shown in Figure 2.1, none of these vectors need have the form
of arrows: no such explicit representation is necessary because, owing to the inclusion
of an explicit time scale, the vector successfully carries out the function of allowing
readers to make up their own minds about potential future developments.
This is a significant shift with respect to the visual semiotic resources avail-
able to Marx and other theorists of that era, who were struggling to represent time
in terms of static vertical structures of classification (cf. Figure 2.3) or who, alter-
natively, made use of rather primitive vectors. If we take the opening page of
Chapter X of Capital (Figure 2.5), we notice that, among its other functions, the
vector is used to measure the working day. The B-C lines in Figure 2.5 make
their meaning through their differing length, in this case, as explained in the text,
to represent the distinction between necessary working time and surplus working
time, the first a constant, the latter a variable. In itself, this is an interesting and
important development because it shows the beginnings of the use of the line to
represent change in time. However, even the most cursory glance at the properties
of vectors in a modern scientific text will clarify that the line has now acquired
many new functions. In computerised science texts, vectors are manifested in
many ways.
Whether as call-outs in word-processing packages or as indicators of the
presence of drop-down menus in websites, they play an important role in the
creation of pagelets appended in various ways to other pages. They are thus an
important resource in the evolution of the page from a static, printed medium to
the dynamic, multiple-dimension phenomenon of the distance communication age as
represented by the web page (see 3.7, pp. 130-136 and 3.8, pp. 136-146) and the
video text (4.11, pp. 223-248). A comparative study of the modern scientific page
© The Economist Newspaper Limited, London (03/09/1998)
and that of a previous era helps us understand the role that various resources, such
as vectors, play in a page while at the same time providing an important arena in
which to explore a further question that we posed at the outset, namely: how does
the page communicate?
ognise and understand. So far we have discussed this with regard to the integration
of visual and verbal resources through the use of resources such as diagrams, tables
and vectors. A further step relates to the question of intertextuality (see Inset 8, p. 55).
The dependency of texts on each other and the way a text incorporates other texts
has been explored, among others, by Lemke (1985), Thibault (1994b) and Bakhtin
(1986); the latter, as we have mentioned previously (Chapter 1 ), with regard to the
question of heteroglossia and the distinction between primary and secondary
genres (see Inset 6, p. 43). These writers essentially adopt a position whereby the
speaker is situating himself or herself in an entire field of social relations and the
conflicts which are necessarily generated within this field (Thibault, 1989: 202).
Hence any personal opinion or view or interpretation is inevitably connected to the
social and collective nature of meaning making.
In the case of Marx we can see clearly how Capital is a composite of texts
that function to give different perspectives and points of view on the same themat-
ics: the labourer’s, the parliamentarian’s, the mine owner’s, the factory inspector’s.
Indeed, the explicit attempt to represent a plurality of voices that we find in Marx’s
work raises the question of Marx’s intuitive awareness of the dynamic processes of
social heteroglossia and intertextuality. Thibault (1989: 183, 203) has pointed out that
all texts, whether spoken or written, have the meanings they do in relation to other
texts within a given speech community as well as historically prior texts with the re-
sult that texts do not stand in given, static or neutral relationships with other texts.
Instead, the relationships between texts are constructed by text users. Speech is
constituted out of the heteroglossic interplay of both consensus-oriented and con-
flicting discourse voices in our social meaning-making practices. Capital exemplifies
all this very well. In one sense Capital is a rearticulation of discourse voices
arranged in such a way that the reader is encouraged to perceive new meanings and
new relationships between them. Thus Marx goes beyond merely quoting written
reports and instead often invites the reader to look at them in a different light from
that which was originally intended. As this book attempts to suggest, see for exam-
ple the analysis of the web page in the second part of Chapter 3 and the analysis of
the Westpac text in Chapter 4, recontextualisations (Inset 17, p. 213) and realignments
typify modern discourse, whether in informal oral discourse or the more formal scien-
tific page. As Lemke writes:
The voice of a text is heard only against the background of the voices of other
texts, within some relatively stable social system of heteroglossia. A text voice
orchestrates the available social voices of heteroglossia, speaking some in its
own name, and putting others in the mouths of those it defines as opponents,
allies, or bystanders. It construes, directly or indirectly, value-orienting relations
among these voices, evaluating some favorably and others unfavorably, and
modifying, reproducing, or presupposing the prevailing relations of the voic-
es in a community.
(Lemke, 1989: 40)
70 Multimodal Transcription and Text Analysis: Chapter 2
We may briefly summarise what we have so far stated or implied. Scientific texts
have always combined and integrated language and visual images in the making of
the specialist meanings of scientific discourse (Lemke, 1998). In science texts,
words and images are typically not, and possibly never have been, strongly insulated
from each other.
Moreover, the conventions of scientific discourse do not assume that lan-
guage has necessary priority over the visual. Importantly, this distinguishes
scientific discourse from the privileged forms of verbal literacy which are derived
from the literary traditions that have informed almost all thinking about language
in the Western European tradition since the early linguistic theorising of Plato and
Aristotle. Instead, science texts exhibit a close interaction between verbal and visual
semiotic modalities. The two semiotic modalities work together in synergy to
create the overall meaning of the textbook page. The two modalities are seen as
closely linked according to shared principles of composition which are based on
the visual design or layout of the page as a whole.
Thus science textbooks are visual semiotic artifacts as much as they are
linguistic ones. This does not mean that writing now ceases to play an important
role. Rather than adding pictures to words, words are integrated with visual images.
Moreover, writing is itself becoming increasingly integrated with principles of
visual design as the graphological resources of the written language increasingly
exploit the possibilities afforded by other visual semiotic resources.
The codependence of visual and linguistic semiotic modalities in science
textbooks is based on multimodal conventions of meaning making which have
tended to elude the predominantly language-based forms of literacy which have,
until recently, shaped most of our thinking about educational practice and intellec-
tual activity. An important component in the study of such conventions is the
analysis of the semiotic principles whereby verbal and visual resources are com-
bined to produce the multimodal meanings that are characteristic of science (see
also Kress, 1998).
Visual, verbal and actional semiotic resources in a table 71
Figure 2.6: Table and related verbal text : pages 60 and 61 in Australian Biology
Visual, verbal and actional semiotic resources in a table
Figure 2.7: Blood under the microscope and related verbal text (pages 62-63)
73
74 Multimodal Transcription and Text Analysis: Chapter 2
example, each item is thematically subordinate to, and hence homogeneous with
respect to, the superordinate heading. The table can be read to obtain information
about an item (e.g. red corpuscles), or to contrast them, item by item, with the
information that is arranged in the second column under the heading White
Corpuscles. In this table, the items in each row in the first column, Red Corpuscles,
are elliptical with respect to their corresponding full clauses. The latter have been
reconstructed in Table 2.1. The linguistic thematic in the table in Figure 2.6 is only
minimally condensed. Overall, the foregrounded meaning of the table amounts to:
3 CARRIER^ PROCESS^ ATTRIBUTE (type-quality) They are about 1/3200 inches wide
macro-Theme. Thus, the main title of the chapter, viz. Blood and its functions,
specifies the most salient thematic items in a highly condensed form. It does so
here through the interaction of Theme and New in the following way. The noun
blood in the title establishes the point of departure for the thematic development
of the page as a whole by referring back to meanings which were made on the pre-
vious page, where the topic of blood had already been introduced (lefthand page
Verbiage
macro-theme macro-New
hyper-Theme hyper-New
macro-theme macro-theme
Table
An orange-red Colourless.
colour.
About 7,000 to
About 5,000,000 in
10,000 per cubic mm.
1 cubic mm (see
ruler) of blood. etc..
in Figure 2.6). It both recapitulates in condensed form these meanings at the same
time that it anchors the further development of the meanings of the current page
in relation to meanings that have already been developed. The second part of the
nominal group complex in the title, viz. its functions, is New because it anticipates
meanings which have yet to be made in the further development of the meanings
of this (and the following) page. This dual function of the main title is also high-
lighted both visually and spatially by the choice of upper case letters and the spatial
centring of this textual unit near the top of the page. As an element of the macro-
Theme, the main title of this page functions to predict the hyper-Themes which are
derived from it in both the main text and the table. A hyper-Theme may also predict
the pattern of thematic development in the table as a whole, though this is not the
case here. Themes function to predict the pattern of thematic development at the
level of the clause or the intersection of row and column in a table (see
Martin,1992: Chap. 6 for these terms).
The first paragraph is a further component of the macro-Theme: it too pre-
dicts the hyper-Themes developed in subsequent paragraphs. This paragraph con-
tains these macro-News. These are highlighted by the graphological resources of
indentation, bold type, and numbered sequencing. The first two of these macro-
News is, in turn, further developed as the hyper-Theme of the two paragraphs
which follow. However, the third of these macro-News, i.e. Cells, called corpuscles,
red and white, is not taken up by one or more paragraphs of main text as hyper-
Themes which are, in turn, further developed as hyper-News. Instead, this macro-
New is further developed in the table. In this case, the headings of the two
columns i.e. Red Corpuscles and White Corpuscles in the table function as hyper-
Themes which predict the parallel and simultaneous development of two sets of
hyper-Theme and/or hyper-New relations in the form of the numbered sequences
of ellipted clauses that constitute the rows in each of the two columns.
Each column of the table concludes with a further wave of periodicity in the
form of extended verbiage concerning the theme of work. This subheading may be
construed as a cohyponym of the word functions in the main title at the head of the
page. In this way, the hyper-Themes Red Corpuscles and White Corpuscles, which
establish the pattern of thematic development in the two columns of the table, are
further developed experientially in terms of the specific functions of the two types
of blood corpuscles. They thus constitute a further parallel set of hyper-News which
unpack and elaborate the highly condensed thematics of the title of the page, Blood
and its Functions. If, on the other hand, these meanings had been developed exclu-
sively in the verbal text, the simultaneous construal of thematically homogeneous and
thematically heterogeneous items, which the visual-graphological resources of the
table afford, would not have been possible. The method of thematic development of
the page discussed above is illustrated above on the facing page in Table 2.2.
78 Multimodal Transcription and Text Analysis: Chapter 2
the process, the important semantic relations between processes, participants and cir-
cumstances in the clause are correspondingly downplayed by the linguistic folk-theory
implicitly at work here. Such intraclausal relations are an important means through
which thematic systems are developed. The use of the microscope further implies an
intertextual/indexical link to the practices of the laboratory and highlights the cross-
coupling of semiotic and material relations and processes in the making of these
scientific meanings. These practices and the expanded perceptual capacities afford-
ed by the microscope operate as prosthetic extensions of our perceptual systems
which expand the human Umwelt (Harré, 1990: 301). The Umwelt is that part of the
physical-material world which humans have adaptively modified as their living space.
Thus, we see that technical instruments, no less than linguistic and pictorial signs,
mediate human activity and shape the processes of abstract thought and reasoning
(see Chapter 3 for further discussion of this important concept).
The three photographs that are displayed on page 63 of Australian Biology
(Figure 2.7) do not, of course, operate according to the modality of realism. First,
they index a referent situation which is not available to ordinary human perception
on the scale of reality on which humans normally act and perceive, i.e. the every-
day material and social world that we perceive through our senses, live in and in
which we interact with others. Oddly enough, photographs are often considered as
true pictures of an objective reality. However, graphs and diagrams also function
to interpret the world as it truly is. Yet, they lack the characteristics that confer on
a photograph its sense of reality. Graphs and diagrams are abstract and general;
photographs are detailed and specific. Nevertheless, both make claims about the
way things really are. They do so in different ways, according to different
conventions as to how we relate visual images to the world. Photographs are inter-
preted according to naturalistic conventions. That is, the greater the congruence
between what you perceive with the eye and what the photograph represents, the
higher the reality modality of the image. Clearly, photographs, like other visual texts
in other domains, apply this principle to varying degrees.
The photographs shown here are oriented to scientific truth and objectivity.
The scientific definition of reality and truth operates according to a different reality
modality from the naturalistic one. In this case, it aims to reveal deeper, more general
principles that go beyond the detail and characteristics of the specific instance. In the
examples analysed here, there is no background detail; colour and depth are also
absent. Features like background detail, colour and depth are signifiers of reality in the
naturalistic domain of the photograph. However, in the scientific domain of the graph
and the diagram, such details are considered unimportant. Again, these principles may
be applied to various degrees and this will depend on the goals of a particular text.
Moreover, the three images are concerned with conceptual-relational, rather than
actional, meanings (Kress, Van Leeuwen, 1996: 79-118). Thus, they function to
analyse the given phenomenon into its component parts, or to classify related
80 Multimodal Transcription and Text Analysis: Chapter 2
phenomena by grouping them together for the purposes of comparison. This would
be one motivation for the juxtaposition of the three photographs on this page.
How are linguistic and visual semiotic resources integrated in multimodal
texts? As discussed in detail in Chapter 1, Halliday (e.g. 1978, 1994 [1985]), holds that
language form, i.e. lexicogrammar, is simultaneously organised in terms of a number
of overlapping semantic-pragmatic functions (see Inset 4, pp. 22-23). Lexicogrammar
simultaneously:
� categorially construes experience, including naming and referring;
� enacts interpersonal relationships;
� construes relations of logical (spatial, temporal, causal, and so on)
dependency between items;
� provides the means whereby language coheres textually as
discourse which is operational in context.
As we saw in Chapter 1, similar proposals can be made about other, non-linguistic
semiotic systems (see Baldry, 2000a; Kress, Van Leeuwen, 1990, 1996; Lemke, 1998;
Martinec, 1998; Nalon, 1997, 2000; OHalloran, 2004, 2005; OToole, 1994;
Thibault, 1994a, 1998b, 2000a, Ventola et al., 2004). The fact that different semiotic
modalities such as language and depiction share a common metafunctional basis
thus becomes an important principle enabling their integration in multimodal
semiosis (see Inset 4 , pp. 22-23). It does not necessarily follow, however, that all the
metafunctions are equally distributed across all the semiotic modalities which are at
play in a given instance. Moreover, different weightings and distributions of the meta-
functions may come to the fore in different phases of the same text (see Inset 7, p. 47).
The full meaning of the particular page under consideration can only be
understood by integrating experiential, logical, interpersonal and textual contribu-
tions from: the visual semiotic; the linguistic semiotic of the captions; the relations
of this page to the main verbal text; the actional semiotic of laboratory practices
and other technical procedures such as using the microscope and measuring with
the ruler. However, it is important in this respect to notice once more how the
resource integration principle (Inset 3, pp. 18-19) is at work: the visual text is not
simply an illustration of a more important verbal text. Rather, the visual text adds
dimensions of meaning that are not, and cannot, be made in the linguistic text; the
visual text is complementary to, rather than subordinate to, the linguistic text, as
2.6 explains.
photographic display from Australian Biology, the frog and human blood examples
are positioned in the top part of the page, on the left and right respectively. Both
images show instances of blood belonging to animals of the superordinate class
[BACKBONED ANIMALS], as distinct from lower animals such as molluscs, and so on
(see p. 60 in Figure 2.6). However, amphibia are positioned as being lower on a per-
fection scale, as we shall now see, and this would motivate their positioning in the
overall visual space with respect to the more highly valued human case. The image
showing the human white corpuscles (top right) is therefore accorded maximum
visual salience, as well as being rated highest on a scale of visual importance. This
is consistent with the thematics of the verbal text, e.g. This [double circulation] reaches
perfection in mammals, so again the human will be taken as a typical example, important
to us (p. 60 in Figure 2.6). The covariate thematic tie which can be construed between
these visual and verbal elements also works to enact a joint visual-verbal axiological
orientation which positions the human case as high on both the usuality and impor-
tance modal and evaluative scales. The centring of the image of the frogs capillar-
ies in the lower part of the page in relation to its greater overall size suggests that it
is being positioned about midway along a cline between the low and median points
with respect to the Given-New and salience parameters (see Table 2.3).
nucleus and surrounding protoplasm link to Row 5 of the same column. Moreover, all
three of these labels are connected by arrows to selected features in the image
which they indicate. More precisely, we would argue that what is being constructed
here is a joint verbal-visual thematic system whereby a number of resources and inter-
texts are integrated in order to construe the superordinate thematic meaning
glossed here as [WHITE BLOOD CELL]. The use of upper case letters in square
brackets is a convention for specifying a higher-order, more abstract thematic
formation, rather than specific linguistic or visual instantiations of this (see Lemke,
1983; Thibault, 1986 for the earliest formulations and developments of these
notions). The meaning white blood cell is not simply built up and developed through
the experiential resources of language, i.e. through clause level selections, patterns
of lexicosemantic collocation and so on. No less importantly, there is also an
experiential dimension to this overall thematic formation in the visual semiotic.
The choice of the arrow which connects the various clause or nominal group labels
to selected features of the second (top right) image is, of course, a visual semiotic
resource which functions as a directional vector. As we shall see below, it functions
to construe an intersemiotic relationship between the verbal and visual modalities.
The third image (bottom part of p. 63) links back in with the thematic item
fine-walled capillaries, linking arteries and veins on p. 60 (Figure 2.6), as well as with
the caption Blood capillaries in a frogs foot. Here we see a number of different kinds
of basic relations between the verbal and the visual semiotics. In particular, these
are: main verbal text to figure and label or caption to figure. Such relations are
quite typical of multimodal scientific texts. The meaning of the text cannot be
restricted to either of these components considered separately. Rather, the caption
or label is co-contextualised with the visual elements of the figure or some specific
aspect of this. The one cannot be reduced to the other, since the two different
semiotic systems make meanings in quite different ways. The combination of the
resources from the two systems results in a new set of meaning relations which is
not reducible to the sum of its parts. This is yet a further example of the resource
integration principle (Inset 3, pp. 18-19) that we illustrated in Chapter 1. The meaning
is the composite product of their combination, rather than the mere addition of
the one (e.g. the verbal) to the other (e.g. the visual) (Bateson, 1987 [1951]: 175;
Lemke, 1998).
For example, the three verbal labels pointing to various aspects of the visual
text Blood and Circulation cannot specify the simultaneity of the global relationships
among, say, the different light and dark areas in this figure, or of the varying shapes
of these same dark areas. Language is specialised, if not uniquely so, to construe
phenomena in terms of typological-categorial distinctions. Moreover, it is not well-
suited to the depiction of visual phenomena, as Bühler (1990 [1934]: 220-41) has
shown. The topological-continuous relations mentioned here are, on the other hand,
characteristic of the visual semiotic. The visual image Blood and Circulation is a
84 Multimodal Transcription and Text Analysis: Chapter 2
continuous topological field which, however, may be broken down into specific subre-
gions within this overall topological space (see Saint-Martin, 1985: 47-8).
The meaning of this multimodal biology text is a result of the composite
relations among its verbal and visual elements. However, this does not preclude the
likelihood that the meanings of the visual elements may be immediately apparent
to the trained biologist without the mediation of the linguistic dimension. Indeed,
this only enhances the fully-fledged semiotic status of the visual, as well as show-
ing that the intersemiotic links that may be construed between the verbal and the
visual can be achieved on the basis of varying reading paths according to factors
such as the specific interest or the level of expertise of the reader.
In the school science textbook, on the other hand, the verbal meanings of
the labels and the vectors (the arrows) linking these to selected aspects of the visual
text become more specified when cross-linked to a visual element. The first of
these labels, human white corpuscles take in bacteria, is a clause of the type
[Actor ^ Process: Material ^ Goal]. This is linked by an arrow to a specific relation-
ship between a relatively large dark area (white blood corpuscle) with respect to sev-
eral adjacent smaller ones (bacteria). The relationship between these two classes of
entities constitutes a subregion in the visual field. Such relations are established by
relations such as proximity, separation and envelopment (Saint-Martin, 1985: 47). In
this particular case, the relationships of both proximity and envelopment give rise
to a subregion in which the larger entity (the white corpuscle) and the smaller ones
(the bacteria) are seen as interrelated.
Visually, the precise nature of this relationship is probably not readily inter-
pretable for the person who is unfamiliar with the meaning-making practices asso-
ciated with the perceptually-enhanced means of interpretation afforded by the
microscope. However, the verbal text makes it clear that this particular subregion in
the visual field realises a Process-Participant relation analogous to the one men-
tioned above. In this case, the verbal text gives specific experiential meaning to
visual relations which would not necessarily be so interpretable by the non-special-
ist. It does so by construing the visual pattern in terms of a particular kind of
Process-Participant configuration. In other words, a thematic tie is construed
between the linguistic clause and a selected aspect of the image. Note, too, that
both nominal participants in this clause have generic rather than specific reference.
That is, the verbal text abstracts from the specificity of this particular white corpus-
cle taking in bacteria and construes the visual image in terms of the generic mean-
ings that are typical of technicality and abstraction in the discourse of science
(Martin, 1991, 1993). In any case, the visual image itself operates in the modal
domain of generic truths abstracted from everyday perceptual experience (i.e. nat-
uralism), as indicated by the absence of depth and the use of black-and-white.
In the case of the other two labels, viz. irregular nucleus and surrounding pro-
toplasm, the vector linking these to selected aspects of the visual image establishes
The ideational (experiential and logical ) metafunctions 85
a cross-modal relationship between the verbal and the visual elements which is,
semantically speaking, analogous to an attributive predication. In the first instance,
the verbal elements specify the visual elements more precisely. In other words, a
type-category, as specified by the nominal group in the verbal label, is instantiated by
a visual element. In these two cases, the arrow is analogous to the copula be of
relational be -clauses. It functions to establish an intrastratal relationship of instan-
tiation i.e. schematicity between the two units. A schematic category is both more
general and more abstract than its lower level instantiations. The relationship of
schematicity can be explained as follows: the nominal group irregular nucleus
specifies a schematic category which the image instantiates as a more detailed
instance of the category. Importantly, the relationship between the two units one
visual and the other verbal is, logically speaking, one-way and irreversible. That
is, the type-category which is construed in the nominal label is schematic for the
visual element, whereas the visual element cannot be schematic for the type-category
in the nominal group (Davidse, 1992: 101).
The image instantiates in a more detailed and more specific way the
schematic category which is symbolised by the nominal group. This relationship is
one-way and irreversible because a particular instance does not specify the more
schematic criteria which define a particular category. Instead, instances conform to
varying degrees to the higher-order schema. This suggests that the precise semantic
relationship is more like attribution than identification. The attributing item in each
case is the nominal group, whereas the visual element is the item to which the
attribute is assigned (cf. Carrier in the grammar of attributive clauses; Halliday,
1994 [1985]: 120-2; Halliday, Matthiessen, 2004: 219-226). This semantic non-
reversibility is also illustrated by the fact that the spatial relations between the visual
and verbal elements in this multimodal syntagm are not interchangeable, as is further
suggested by the unidirectionality of the vector linking the two elements. Thus, the
vector moves unidirectionally from the schematic category in the verbiage to its
instantiation in the image. The image semantically elaborates the verbiage. The
notion of elaboration alluded to here is not, however, of the logical type (Halliday,
1994 [1985]: 225-9; Halliday, Matthiessen, 2004: 396-405) because there is no sug-
gestion of tactic dependency between the two items in the syntagm. Rather, elabora-
tion is intended as a more schematic category of ideation which is superordinate to
both attribution and identification. There is, therefore, a topological proximity
between the following two possible construals of these multimodal syntagms: (1) a
visual Carrier instantiating a nominalised Attribute; and (2) a visual Token/ldentified
being decoded as (signifying) a nominalised Value/Identifier. The topological close-
ness of these two interpretations argues for a more superordinate ideational gloss
such as elaboration or exemplification as the best means for revealing the
indeterminacy between the two possibilities. The semantic relationship between the
visual and the verbal components of this syntagm may be modelled as in Figure 2.8.
86 Multimodal Transcription and Text Analysis: Chapter 2
The vector also has an interpersonal function, since it draws attention to a spe-
cific feature of the overall visual field and, hence, accords it a degree of importance.
In so doing, it organises and structures the visual field such that the visual feature so
designated demonstrates the linguistic category (Goodwin, 1994: 629) in ways some-
what similar to pointing with one’s finger. Thus, the spatial organisation of elements
on the page can now be seen not as a static display but, rather, as a material-semiotic
artifact which is positioned within a dynamic social-discursive field of specific profes-
sional practices and competencies concerning questions such as what is visually
salient, who is authorised to draw attention to it, to name it and to make specific
knowledge claims about it. The vector, as realised by the arrow, also has a textual or
linking (cf. phoric ) function such that the arrow is the means whereby the two units –
one visual, the other verbal – are textually linked to each other. Moreover, the unidi-
rectionality of the arrow suggests an indexical function as well. That is, the arrow
points to the specific visual item which the linguistic item elaborates. Unlike vectors
in visual material processes (Kress, Van Leeuwen, 1996: 56-64; Thibault, 1997a: 330-
3), the vectors under consideration here do not have experiential meaning. Instead,
their meaning is centrally concerned with the textual metafunction. Experiential
meaning, on the other hand, is to be found in the lower level units – visual and lin-
guistic – which are linked by the vector in this multimodal syntagm.
The three images may, overall, be said to realise meanings in the semantic areas
of classification and analysis. The juxtaposition of the three images on the same page
follows the classification pattern mentioned earlier. In this case, we are given two
types of blood – frog and human – as well as an example of the means by which
blood is circulated in capillaries. Kress and Van Leeuwen (1996: 81) point out that
such visual classificational processes relate participants to each other in terms of a
taxonomy. Visual taxonomies are characterised by a decontextualising ‘objectivity’, as
evidenced by minimal or no background detail, lack of depth and the use of frontal
perspective. The elements which comprise the taxonomy tend to be arranged symmet-
rically and at a roughly equal distance from each other within the visual field of the
image. The single page arrangement of the three images in the Australian text is a
good example of a visual classification, as is the inset showing different types of white
blood cells in the 1982 Italian text (Figure 2.10).
Kress and Van Leeuwen (1996: 79-89) argue that the elements comprising a
visual classification are (experiential) participants in a relationship of the experiential
kind. In our view, the elements in a visual classification are better seen as standing in
some kind of superordinate ideational relation as discussed above. In the two exam-
ples referred to here, no experiential situation is being construed at the level of the
display as a whole. Rather, each of the visual images elaborates or exemplifies the
implicit superordinate term which relates each of the constituent images – the sub-
ordinate terms – in a relationship of schematicity. In the Australian text, the superor-
dinate term may simply be glossed as [BLOOD], of which the human and frog cases
The ideational (experiential and logical ) metafunctions 87
are instances. In the inset featuring the white blood cells in the 1982 Italian text, the
superordinate term is [WHITE BLOOD CELL] and the individual images are different
subtypes of this.
Now, there is nothing natural or given about these classifications. This raises an
important question as to the role of visual images in the processes of learning
specialised scientific meanings. To the apprentice reader untrained in the specialised
scientific meanings, the images in the two displays under discussion here afford the
recognition of a number of basic visual patterns. In the Australian text, this may be
glossed, somewhat crudely, as small dark regions contrasting with a uniform grey back-
ground. In the inset of the white blood cells from the Italian text (Figure 2.10), we
may discern, among other things, the existence of a central area a nucleus in each
of these types, in spite of differences in shape. However, visual pattern recognition
alone is not a sufficient basis for inferring more specific scientific meanings, though
the recognition of visual patterns is an important starting point (St. Julien, 1997: 275).
As patterns which we perceive through our visual perceptual system, these
patterns are material entities arranged on a material surface (the page in a book).
However, these material entities are themselves cross-coupled to the multimodal
meaning-making practices of some discourse community such as the classroom
practices associated with the teaching and learning of science (see Lemke, 1990b)
as well as a particular cultures conventions for making visual meanings. The visual
grammar of classification, as described above, is not therefore a perceptual given.
Instead, it serves to add structure and meaning to visual percepts. Like the grammar
of natural language, it does so for the most part implicitly and unconsciously. The
deployment of the visual grammar adds structure and meaning to otherwise vague
and unspecified percepts by organising and interpreting these according to the
conventions of a visual grammar.
Overall, the three images play their part in the building up and development of
a wider thematic formation that can be glossed as [BLOOD IN BACKBONED ANIMALS
AND ITS TRANSPORTATION]. Again, this thematic formation is jointly constructed on
Figure 2.8: Relation of elaboration: attribution between visual image and verbal label
88 Multimodal Transcription and Text Analysis: Chapter 2
the basis of verbal and visual semiotic resources. This helps us to clarify the
relationship between the top two images and the bottom one, which features capillar-
ies in a frog’s foot. The relationship is not, therefore, simply one of spatial juxtaposi-
tion on the same page. Each image, along with its caption and/or labels, and the links
between these and the main verbal text, plays its part in the development of a joint
verbal-visual thematic formation. Thus, the third image – frog’s capillaries – selectively
instantiates and develops part of the thematic concerned with blood transportation in
backboned animals (including frogs and humans). In doing so, it forms a thematic link
with the specifically verbal development of this on p. 60, especially in the three final
paragraphs on this page (Figure 2.6). The frog and human instances are cohyponyms
in relation to the superordinate thematic formation, as glossed above. This is equally true
of both the verbal and the visual contributions to this thematic formation (see
Thibault, 2004b: 306-310). With respect to the photograph showing the white blood
corpuscles (Figure 2.7), this joint visual-verbal thematic formation may be modelled
as in Figure 2.9. Figure 2.9 shows how the thematic formation [WHITE CORPUSCLE] is
built up on the basis of a trade-off between verbal and visual elements such that the
two semiotic modalities jointly create these thematics. For example, the clause white
corpuscles fight and destroy germs and bacteria in the table (p. 61) is cohyponymic to the
clause human white corpuscles take in bacteria, one of the labels in the microscope slide
labelled Blood and Circulation (p. 63). The latter also stands in a relationship of elabo-
ration/exemplification to the visual image, as indicated by the vector (see above). Each
of these thematic items and the links that are created between them, both individual-
ly and jointly, instantiate some aspect of the superordinate intertextual thematics
[WHITE CORPUSCLE].
[IMAGE] [IMAGE]
[IMAGE]
Cl-Th
[WHITE CORPUSCLES ... (DESTROY) ... GERMS] [WHITE CORPUSCLES ... (TAKE IN) ... BACTERIA]
Ac-Pr-G Ac-Pr-G
Similar remarks may also be made concerning (1) the thematic links between
the nominal group white corpuscles and the nominal groups irregular nucleus and sur-
rounding protoplasm, which label selected aspects of the visual image on p. 63, on the
one hand, and (2) white corpuscles and the nominal groups nucleus and protoplasm,
which occur in the first paragraph of the main text (p. 61), on the other. All of these
nominal groups are cohyponyms in relation to the more superordinate thematic item
[BLOOD], which is retrievable from the main text (p. 61). However, irregular nucleus
and nucleus are also hyponymic to the superordinate thematics [WHITE CORPUSCLE],
which is the focus in the present example.
� Usuality: typical /expected. Finally, the juxtaposition and comparison of frog and
human blood, along with the tie to the verbal thematics of p. 60, which establishes a
link to a superordinate thematic item, [BLOOD IN BACKBONED ANIMALS], sug-
gests that the images reveal what is typical or usual about blood in such animals despite
specific differences. Even though they obviously show us specific blood samples under
laboratory conditions, the visual images are not concerned to highlight any specific or
individual features, assuming this were possible, or to show the blood as that of an indi-
vidual frog or human. Rather, the images abstract away from the specific to foreground
what is typical and expected. Moreover, the top two images in particular, by virtue of
their juxtaposition, imply a relationship of comparison which serves to highlight the
common or shared features of frog and human blood. Neither the verbal nor the visual
text explicitly mentions how these might differ. Therefore, possible differences
between the two images, which only an expert reader could appropriately assign
meaning to in any case, are not foregrounded. Thus, we see that the top two images
show a similar not identical visual-perceptual pattern, i.e. blood cells depicted as
small dark regions in plasma fluid, depicted as a uniform background grey.
Stance towards the user. In the present example, the users are 12-year-olds in
1950s Australia. Overall, the text works to introduce new terms and to define and
explain them. The stance towards the reader may be summed up as one of author-
ity mediating for, and in some sense condescending towards, the young school
reader, in the sense that the text has to talk down the terms, explain and define
them for its young readers. For example, the second (top right) image on p. 63
(Figure 2.7) would not require the labels and vectors linking these to selected
features of the image if the text were addressed to expert readers.
Stance towards other texts. The text defers to, and bases its knowledge claims
on, the authoritative and high-prestige discourse of biology and its intertexts at the
same time that it mediates between these and the more familiar, everyday language
of its young readers. In other words, the text situates itself in, and participates in, a
system of possible social viewpoints and evaluations regarding its own thematics as
well as those of other social discourses. This is the system of social heteroglossia as
first theorised by Mikhail Bakhtin (1981; see also Thibault, 1991b: Chaps. 5-6). In the
everyday domain, experiential meaning is mainly tacit and based on commonsense
understandings of reality, as well as on personal experience. Interpersonal interaction
is grounded in family, friends and peers and the discourse positions associated with
these interactions. In the scientific-technical domain, on the other hand, abstract and
objectified forms of knowledge are learned through specialised processes of train-
ing and apprenticeship vis-à-vis the social practices and ways of making meaning of
the specialised discipline. Interpersonally, this entails specialised interactional roles with
specific claims to knowledge and authority. It also entails an objectivating stance
towards the self and the self s relationship to the ideational field of knowledge in
question. These questions will be further discussed in the next section.
The Italian texts: differences with respect to the Australian texts 91
2. 7. 1. Reading paths
We shall now turn our attention to the two Italian school textbooks dealing with
the same thematic system blood and its circulation in the body. In both cases, we
have a double-page display comprising verbal and visual elements. The first thing to
strike the reader on a first reading of the relevant parts of the two texts under
discussion here published in 1982 (Figure 2.10) and 1985 (Figure 2.11),
respectively is the overall stability of the thematic meanings, in spite of the fact
that more than twenty years have elapsed since the publication of the Australian
text in 1957. Moreover, the fact that this stability is exhibited across two different
languages English and Italian indicates that the discourse of science and its
intertexts is interlinguistic, as well as being intersemiotic. The two Italian texts, in
relation to the Australian text, also show that the multimodal construal of scientific
meanings is not something new to our own historical period. What is new in the
two Italian texts with respect to the Australian text is the increased integration of the
verbal and the visual modalities in the overall design of the page. That is, the
boundaries between words and pictures are now much more weakly insulated; the
two modalities interpenetrate each other considerably more and in ways that com-
plement and add to each others meaning-making possibilities (Kress, 1998: 59-66).
A further difference is the use of colour, as distinct from the use of black-and-
white photographs and line drawings in the Australian textbook. While these devel-
opments are also due to technological changes over the period from the 1950s to
the 1980s, the more important point is that new technology affords new
possibilities for multimodal meaning making.
Above all, the compositional textual structure of the more recent texts
foregrounds linearity to a lesser extent. This is so because in both writing and depic-
tion, it is spatial organisation which enables complex arrangements of elements in
syntagms which are often non-linear as are the reading paths they make possible (see
also Harris, 1995: 46). Let us consider once more page 61 (Figure 2.6) in the
Australian text. In this case, the table is embedded in the verbal text and, as we
showed in 2.4 (pp. 71-78), it is strongly tied to this on account of the thematic links
between the two. Here, the compositional layout of the page, which is part of the
overall verbal-visual textual metafunction, more or less compels the reader to adopt
a linear, top-to-bottom reading path. There is less motivation to view the table as a text
in its own right (which to some extent it is), which the reader can hop to if desired,
and read independently of the verbal text (see Inset 5, p. 31). In the 1957 text, the
top-bottom layout strongly compels the reader to start with the linguistic semiotic and
then to read the table as being tightly integrated with it, both visually and spatially, as
well as thematically. This is less true of the photographs on p. 63 of the Australian
text (Figure 2.7), though even here these are all placed together on a separate page and
92 Multimodal Transcription and Text Analysis: Chapter 2
are relatively strongly insulated from the main verbal text. In both of the Italian texts,
linearity is less foregrounded. The visual images are either more closely integrated with
the verbal text (the 1982 text) or more explicitly cross-linked by means of indexical
pointers such as Figure 2.11 and so on, in the case of the 1985 text.
In the double page presentation of the 1982 Italian text shown in Figure 2.10,
there are three insets showing, respectively, the heart on the top left, two different
visual perspectives on red blood cells in the bottom right of the first page and a num-
ber of different subtypes of while blood cells in the large inset at the top left of the
second page. Unfortunately, it is not possible to analyse these in detail here. The visual
semiotic is a prominent feature of the spatial design and the meaning of this double
page. Furthermore, all three insets are accompanied by their own verbal texts in italics
– itself a typographical-visual contrast with the main verbal text – which function to
emphasise the relative autonomy of the insets and the combined verbal-visual
thematics that they operate. In other words, each inset functions like a quasi-
autonomous text to which the reader can refer in any order he or she prefers, without
following a single preferred reading path based on an a priori top-down and left-right
directional organisation. Two of the insets – those concerned with red and white
blood cells – are thematically tied to material that occurs several pages earlier (p. 95,
not reproduced here) in the main verbal text. This, too, says something about the
‘looser’ nature of the links between any given part of the text and some other part.
The reader is freer to hop back and forth, and this suggests that multimodal scientific
texts are a precursor form of hypertext, with multiple reading paths and multiple links
between different parts of the overall text (see 3.6, pp. 126-129 and Lemke, 1998).
fact that colour, for mainly interpersonal reasons, is usually the unmarked choice in
modern science textbooks for school pupils can be explained as follows. Line
drawings, graphs and diagrams, as well as photographs obtained by technologically
enhanced means such as the microscope, are, in their different ways, abstractions
from or departures from the visual percepts that we are accustomed to in everyday
reality. This is so because the phenomena under consideration here exist on a scale of
reality that is very much smaller than that of the ecosocial scale on which individuals
interact with and act on both their material and social environments. This microscop-
ic scale is not available to normal human perceptual experience. Furthermore, the
colours used in the line drawings of the white blood cells in the 1982 text are, strictly
speaking, not accurate. This is so because, as the main text tells the reader on p. 95,
sono incolori ... e non hanno in genere una forma propria (they are colourless ... and in general
are formless). Rather than a question of the text contradicting itself, this is a case of
the use of colour belonging to the sensory modality, as distinct from a realistic or
objective scientific one (Kress, Van Leeuwen, 1996: 168-71).The use of colour there-
fore functions to mediate the interpersonal relations between the scientific thematics
and the apprentice readers by appealing to the domain of sensory experience, which is
characterised by rich, saturated colours. From this point of view, the electron micro-
scope photograph of white blood cells in Picture 9 on p. 425 in Figure 2.11 more
closely corresponds to the sensory modality. In other respects, the line drawings under
discussion here are oriented to the scientific modality insofar as they seek to represent
the phenomenon according to its essential characteristics. This suggests that the visual
semiotic is negotiating between two distinct contextual domains and their respective
value orientations. The resulting modality hybrid highlights the way in which the
‘objective’ scientific attitude abstracts away from perceptual phenomena and bodily
experience in order to get at the underlying truth or essence of things, while the more
subjective sensory modality draws attention to our own embodied viewpoint from
which we observe and interact with things in all their sensual immediacy, including the
activities of the science classroom, along with the embodied reading, handling,
pointing to and touching of textbook pages.
platelets are made in red bone marrow. In this example, the modalised knowledge claim
is attributed to the I of the particular person who asserts the claim in a particular
interpersonal context such as the author of a textbook addressing the reader. A claim
of this kind is interpersonalised. Explicit interpersonal sourcing of this kind in which
the writer directly addresses the reader is rare in school science textbooks. Instead,
claims to scientific expertise and authority are abstracted away from the interpersonal
contexts of authors and readers and relocated in third-person contexts in which the
scientific meanings are attributed to third-person authorities and experts – both indi-
vidual and institutional. Here are some examples: (1) King claims that platelets are
made in red bone marrow; (2) Research findings show that platelets are made in red bone
marrow. In such cases, interpersonal meanings and contexts are transferred to
ideational contexts. The negotiation of these meanings therefore takes place
between ideationalised orders of discourse and their intertexts. These orders of
discourse and their (third-person) spokespersons, rather than particular persons on
particular occasions of speaking and writing, are construed as the sources of
scientific knowledge and the authority claims associated with this knowledge.
Prototypically, ideational negotiation makes use of the lexicogrammatical resources
of projection (see Inset 9 : Projection, p. 101). Projection enables others’ texts to be
negotiated as the projected texts of third-person sources of acts of saying and
thinking (Halliday, 1994 [1985]: 250-73; Halliday, Matthiessen, 2004: 441-82;
Thibault, 1991b: 73-89; Thibault 1999a: 574-80). This strategy allows for varying
degrees of incorporation of the other’s discourse – cf. Bakhtin’s alien word – into
the context of the symbolic source (Sayer or Senser) of the text that is projected
from this source. Figure 2.12 illustrates this with respect to the clause, Platelets are
probably made in red bone marrow, in Australian Biology (p. 61 in Figure 2.6).
2. 8. 1. Linguistic resources
Both the Australian and Italian texts examined here adduce very little explicit sourc-
ing of ideational negotiation between projecting and projected discourse contexts.
Instead, there is a foregrounded predominance of unsourced semantic variation.
That is, the authoritative discourse of science – i.e. the discourse of the other in the
present case with respect to the apprentice readers – is reformulated or reconstrued
in the ideational context of the projecting context of the textbook writer. The source
of these meanings thus becomes increasingly assimilated into the projecting context
to the extent that it is nothing more than unsourced semantic variation which is not
explicitly differentiated from the projecting context of the textbook writer. This is
the semantic region where heteroglossically distinct discourse voices are ‘translated’
or reformulated as semantic variation rather than as heteroglossic diversity or differ-
ence between particular value stances held by individuals and social groups (Thibault,
1991b: 99-103; 1997a: 269-72; Fuller, 1998). Consequently, there is little or no use of
the lexicogrammatical resources of projecting, assigning and attributing. In other
Linguistic resources 97
words, there are few explicit Sayers, Sensers, Assigners and Attributors which
specify the source of the text of the other. These functional semantic labels derive
from Halliday (1994 [1985], see also Halliday, Matthiessen, 2004: Chap. 5). In mental
process clauses, the Senser is the participant who thinks, perceives or feels, e.g. John
believed Mary was coming (Halliday, 1994 [1985]: 114); in verbal process clauses, the
Sayer is the one who emits a verbal signal, as in John said Mary was coming. Assigners
and Attributors are additional functions which pertain to identifying and attributive
processes, respectively. Thus, in the identifying clause, The meeting elected him
President, the meeting is the semantic source, the Assigner, which is construed as
being responsible for assigning the role of President to him. Similarly, in the clause
her words made John angry, her words, as Attributor, is construed as the semantic
source responsible for attributing the type-quality angry to John. In all four cases, we
have different ideational grammatical resources for the sourcing of the meaning of
another in relation to the speaker or writer. Instead of explicitly sourcing the text of
the other, unsourced relational predications of identification and attribution play a
prominent role in the ways in which one grammatical unit in the relationship elab-
orates or further specifies the other. Frequently, this may function to construe
relations of identity or attribution – seen as instantiations of the more schematic
category elaboration (Thibault, 1997a: 314) – between semantic registers deriving
from different social domains such as the everyday and the technical-scientific. In the
discourse of school science, definitions and exemplifications, as realised by relational
(attributive and identifying) processes, frequently have this function. The examples
given in Figure 2.13, taken from the texts under discussion, are typical of this general
pattern. The examples instantiate the schematic pattern [CARRIER^ ELABORATION/
EXEMPLIFICATION^ ATTRIBUTE] (see 2.6.2, pp. 82-89).
The negotiation between the claims of authority and expertise, on the one
hand, and notions of accessibility and apprenticeship, on the other, does, however,
implicate both interpersonal and ideational resources. Whilst ideational resources
construe negotiation between different discursive domains, interpersonal resources
King thinks / says I think that platelets are Platelets are probably
platelets are made in made in red bone marrow made in red bone
red bone marrow marrow
It’s probable that platelets
are made in red bone
marrow
enact addresser and addressee positions and relations. Interpersonal resources thus
ground the negotiation between addresser and addressee in terms of time, place,
modal orientation and the person deixis of writing and reading positions. In the
above instances, interpersonal negotiation is mainly evidenced in specific lexical
choices which realise particular axiological orientations. In example (2), the Epithet
strong implies an evaluative judgement; in (4) not enough indicates a negative judge-
ment as to the quantity of red corpuscles; in (5) the morphemic suffix -ish indi-
cates a judgement regarding the degree to which the type-quality blue may appro-
priately qualify the noun organ; in (6) some entails a judgement concerning quantity;
and in (7) the diminutive giallino [pale yellow] specifies that plasma is evaluated as
only relatively weakly conforming to the type-quality giallo [yellow]. In the
examples given in Figure 2.14, relating to clauses that follow on from those
presented in Figure 2.13, interpersonal negotiation often, though not always,
comes to the fore at a number of points where the everyday world of the reader
is the thematic focus. In such moments, the interactive relationship between writer
and reader is more focal. In example (10), the writer is not explicitly sourced in
the lexicogrammar of the clause, though she is implicit in the modalised
orientation (probably) which she adopts towards the proposition in this clause. In
the remaining examples (11-13), the use of person deixis (I, you) and the
orientation to metaphorical commands (it may be necessary ..., occorre provvedere)
enacts the interpersonal space in which the writer/reader relationship is negotiated.
In the backboned animals, blood is confined to vessels, called arteries (leaving the heart), veins (entering the
1
heart), and fine-walled capillaries, linking arteries and veins. (p. 60 in Figure 2.6)
3 The platelets are small, colourless cells without nuclei. (p. 61 in Figure 2.6)
4 ANAEMIA is the condition of a person who has not enough red corpuscles. (p. 61 in Figure 2.6)
5 The spleen is a bluish-red oblong, flattened organ, below the stomach, ... (p. 62 in Figure 2.8)
When blood is shed some of it will set or harden on the wound. This is called a clot, ... (p. 62 in Figure
6
2.8)
I globuli rossi sono cellule senza nucleo, tondeggianti, schiacciate al centro e rialzate ai bordi. (p. 424 in
8
Figure 2.13)
I globuli bianchi (o leucociti) sono cellule incolori, più grandi di globuli rossi e provviste di nucleo.
9
(p. 424 in Figure 2.13)
2. 8. 2. Visual resources
Visual semiotic resources also possess their own means for negotiating the
relations between the ideational and interpersonal sourcing of the meanings
which are negotiated in texts. In the linguistic semiotic, this is done, as we saw
above, by projection and related lexicogrammatical systems (Halliday, 1994 [1985]:
248-69). Projection functions to recontextualise the discourse or the meaning of
some other as projected text in relation to the projecting context of the text the
writers text in the present case which is projecting it. In this way, the projecting text
adopts a metasemiotic stance on the projected text and thus comments on it or re-eval-
uates it in some way (Thibault, 1991: Chaps. 2-4; 1999a: 574-80).
Likewise, the use of colour may also function to reframe the scientific
discourse and to ground it in the interpersonal perspective of the writer-reader
relationship. In so doing, it (a) enacts interpersonal negotiation between the writer
and the reader and (b) construes ideational negotiation between the scientific and
sensory coding orientations. The visual interpersonal deictic frame is thus shifted
back to the writer/reader domain, whereas the thematic content is that of the
scientific domain. The former functions to relocate the latter into its own context
and to comment on it from that context, i.e. the writers perspective. It is a nego-
tiation between different ideational-thematic orders of discourse at the same time
that this is implicated in the interpersonal negotiations between writer and reader.
The use of the rich, saturated colours that are typical of the sensory coding
orientation in the visual semiotic (Kress, Van Leeuwen, 1996: 168-71) thus serves
to circumscribe the interpersonal domain of the visual image. This is analogous
to the ways in which comment adjuncts such as zoologically, clinically, technically, and
so on, serve to delimit the domain of validity of the propositions which they hold
in their scope. They do so by relativising the proposition according to the values of
the interpersonal domain of validity in which it is articulated. For example, in the
clause: zoologically, the Bluetongue is a Tiliqua (see Worrell, 1963: 61), the would-
be speaker/writer highlights semantic differences across different discursive
10 Platelets are probably made in red bone marrow (p. 61 in Figure 2.6)
You all know that when blood is shed some of it will set or harden on the wound. This is called a clot,
11
and you may read of this as the coagulation of blood (p. 62 in Figure 2.6)
It is foolish to rush to a running tap when you cut your finger, for this washes away the fibrin, hindering
12 clotting. Certainly, it may be necessary to bathe dirt away from a wound, but do not use much water (p.
62 in Figure 2.6)
Per salvare la vita di un uomo che in seguito ad un infortunio abbia perduto una parte del suo sangue,
13 occorre provvedere ad una trasfusione, cioè ad immettere nei vasi sanguigni dellinfortunato il sangue di
un altro (p. 97 in Figure 2.12)
Inset 9: Projection
�All languages appear to have resources for indicating some stretch of discourse as having
been spoken, written or thought by some person in a given discourse context. There are
two aspects to this relation. First, there is the context of the discourse of the person who
is quoted or reported. Secondly, there is the context in which the person who is quoting
or reporting the discourse of some other speaks or writes. The relationship between the
two contexts is called projection. The quoted or reported discourse of someone is the pro-
jected context; the projected context is projected by the projecting context of the person who
quotes or reports the other persons words. It is also possible for the person in the pro-
jecting context to quote or report their own words or thoughts. The principle is the same in
all cases. The distinction between the projecting and projected contexts is as follows
�There are two main ways of representing the speech or thoughts of others. First, some-
ones speech or thought may be grammatically construed as it is supposed to have been
actually said or thought in some context. Such is the case with the direct quoting of
someones speech, writing or thought. In such cases, the discourse of the other is
presented from the point of view of the projected context. Secondly, the others speech,
act of writing or thinking may be construed from the viewpoint of the person in the
present situation of utterance who interprets the discourse of the other. That is, from
the point of view of the projecting context. This is what is known as the indirect reporting
of anothers discourse. In both cases, the speaker or writer of the utterance attributes
some utterance, thought, perception or feeling to a sentient being who is construed as
being the source of the utterance, thought, perception or feeling in question. With ref-
erence to English, the basic possibilities are as follows:
Direct Indirect
Speech He said, "I am coming" He said that he was coming
Thought He thought, "I am coming" He thought that he was coming
�This illustrates the basic distinction between the two ways of projecting speech and thought.
In direct speech and thought, tense and person deixis index the situation of the person
responsible for saying the utterance or thinking the thought in the projected (quoted)
clause. In indirect speech and thought, tense and person deixis shift to the discourse situation
of the speaker or writer who says or writes the utterance, i.e., towards the projecting
context. This shows that two points of view and the relationship between them are
involved in this type of linguistic structure. In direct speech and thought, the quoted
utterance or thought the projected clause is presented from the standpoint of the
person who says or thinks it. The relationship between the projecting (quoting) and project-
ed (quoted) clauses is one in which there is a congruence or non-discrepancy between the
two perspectives. This is the case when, for example, the present knowledge or point of
view from the speaker/writers perspective is in accord with the perspective of the quot-
ed utterance or thought. In the case of indirect speech and thought, on the other hand,
the semantic construal of the projected utterance or thought from the point of view of
the speaker or writer indexes a non-congruence of the two perspectives. In other words, a
contrast or discrepancy between the knowledge or point of view of the speaker or writer
in the present time of utterance and the projected speech or thought is suggested.
scopal because its scope extends over a certain region or subregion of the
topological space of the visual field, in the process interacting with other features
in the visual field and thereby creating a particular interpersonal (affective,
evaluative, etc.) orientation to them. The use of colour therefore constructs a
metasemiotic frame of reference which organises interpersonal orientation. It thus
anchors or grounds the experiential meaning of the visual text by defining the
modalised intersubjective space in which the image is to be negotiated. However,
this entails more than simply an interpersonal negotiation between writer and reader.
It also serves to locate this interpersonal negotiation in a still wider intertextual field
of heteroglossic relations among texts and reading and writing positions.
2. 9. Conclusion
Science textbooks are a type of knowledge object (Bereiter, 1997: 298) which can be
used in a variety of ways enabling and constraining which go beyond the
situations in which they were produced. They are not simply objects which students
and teachers passively decode. They, too, are participants in the dynamics of the
processes both material and semiotic which link human agents, their tools and
artifacts, and semiotically mediated activities in still more extended networks of
meaning making across diverse temporal and spatial scales. As knowledge objects,
science textbooks are hybrids, to borrow Bruno Latours notion: they are
simultaneously artifacts and activities, both material and semiotic, local and global,
natural and cultural. It is in this way that scientific theories and explanations, far
from being universal laws, are kept alive by the networks of classroom activities,
others texts, measuring instruments, perceptual devices, laboratory experiments
and so on, with which the textbook is linked in, and through, all the work which is
required to keep the whole network going (Latour, 1993 [1991]: 121).
Moreover, the activity of reading, the constructing of multiple pathways
between the verbal and visual resources that are codeployed and, hence, the con-
strual of joint verbal-visual actional systems of thematic meanings and their
associated axiological orientations, allows for the emergence of scientific mean-
ings, knowledge and associated affective investments. The recognition of visual
patterns, on the one hand, and the deployment of multimodal meaning-making
practices, on the other, are not, in the final analysis, constitutively separate activities.
The one does not follow on from, or cause, the other. Rather, they are all
simultaneously cross-coupled in the time-bound processes of building up scientific
meaning and knowledge in, and through, the social practices associated with the use
of textbooks. An understanding of these links and the role of the multimodal
science textbook page in these can help us to develop a truly multimodal literacy
which is adequate to the world in which we and our children live and make meaning.
Chapter 3
How can we go about describing a website? Are web pages in fact pages? Or is the idea
of the page just a metaphor? How do we account for the web page in terms of
resources and the kinds of meanings that people make? And what about the web page
as genre? Can we describe web pages in terms of different genres? How can we
describe a particular pathway through a website and how can we relate a particular path-
way to the virtual resources of a specific website as a whole? And what transcription
techniques can we develop? How do websites work? How does this connect back to
the notion of transcription? To what extent does the transcription speak for itself as a
description of the web page?
As we can see from these questions, there are many possible starting points in
the analysis of web pages. Our chosen approach is to make some general observations
about the nature of websites, which we illustrate with some examples. We then discuss
a number of websites in detail applying the technique of multimodal transcription and
text analysis to them. For reasons of space, we have restricted our discussion to the first
of the Information websites types categorised, on the following page, in Table 3.1,
namely edutainment websites for children. Naturally, childrens edutainment websites,
include textual objects implicitly associated with, or, indeed, explicitly linked to other
types of Information websites (e.g. museums, special interest sites). Internet is a technol-
ogy that is especially good at merging disparate entities, at crossing and realigning the
boundaries between diverse discourse genres, social activities and domains, and, as we
shall see below (e.g. 3.7, pp. 130-136), at constantly relocating and recontextualising
agents and objects that would we expect to find in one setting or category into others.
In this respect, we are concerned with a detailed analyses of specific web pages. Our
analyses take into account the nature of the different semiotic resources that are used
and the way they are used to create a particular website. This includes the reading path-
ways (see 3.6, 3.7, 3.8, pp. 126-146) that can be taken through a particular website as
the user creates and negotiates the meanings afforded by that website along a particular
meaning-making trajectory (see Inset 19: Negotiation, pp. 245-247; Inset 10: The
trajectory, p. 116). Our analysis will look in particular at home pages and the way
they are linked to subsequent pages in the website to which they relate. Other
104 Multimodal Transcription and Text Analysis: Chapter 3
starting points could include discussions of the cultural and social functions of
websites. Our approach is certainly compatible with such approaches, but our con-
cern is with how textual resources function in distinctive ways to create web pages
and, in the process, make it possible to distinguish a web page from a printed page
and from other kinds of multimodal texts such as films. Fundamentally, we are con-
cerned with the way a user interacts with a web page, and we base our analysis on a
detailed account of the textual resources used in particular websites, rather than
relying on the subjective accounts of particular users.
What is distinctive about the web page as a space for meaning making? A web
page is a visual-spatial unit displayed on a computer screen. It makes use of written
resources such as language and the resources of depiction, including the spatial
juxtaposition of objects. In this respect, a web page is similar to a printed page (see
Chapters 1 and 2 ). However, the web page goes beyond the printed page because
of its hypertextual nature and the action potential that this affords (see 3.9, pp.146-
155). We need to see the web page with a dual focus: as a visual-spatial unit and as
action potential. The web page in this sense is a hybrid, sharing features of the stat-
ic page; on the other hand, it also has a dynamic potential for action; the user can
act on the page and obtain responses to his or her actions. A printed page cannot be
dynamically reorganised unless you take a pair of scissors to it, so to speak. Of
course, the reading of a printed page is itself a form of activity at the same time
that the written text enables the creation of indexical ties with activities that the
reader can perform as a consequence of reading the page.
A feature of the web page is its capacity to be reorganised. It is possible, for
example, within a childrens website, to physically move objects around. Examples of
(3) Institutions and associations, whether international bodies (e. g. Unesco ), govern-
ment sites (e.g. relating to consular services);
(4) Personal websites: these present individual people and their work, such as musicians,
film stars, academics and so on;
(6) Good and services websites that may include shopping for a) goods: books, food
and b) services: car hire, train and plane tickets, e-banking, buying a theatre ticket;
(7) Individual company sites ranging from the global (e.g. car manufacturers) to small
businesses (e.g. restaurants in a particular town).
this include the games, such as adventure games, found in a website. It is also
possible to reorganise the page by making selections either by passing the mouse
over an object (known as a rollover ), by clicking an object (e.g. drop-down menus ) or
by double clicking so as to select a link. Examples of such reorganisation include: (1)
a change in part of the page on the screen, for example, the activation of a video
(relating to a news report, an advertisement or a demonstration of an instrument),
or a drop-down menu giving a list of cities and their temperatures on a particular
day; (2) the transformation of the entire page, or at least most of it, as in the pro-
duction of a timetable; (3) access to a new page. This capacity for reorganisation is
implicit in the questions that we have listed above. We will provide answers to these
questions in what follows.
3. 1. Page or screen?
between the computer user and the possibilities of the website as a whole. The web
page which is dynamically assembled and displayed on the computer screen mediates
the users relations with particular objects which the user can interact with. Such
objects influence the users behaviour, who can obtain responses from them (see 3.9
pp. 146-155). Moreover, the web page enables a user to make links with other
objects, other web pages and other websites. In this third perspective, the screen is:
(1) a receiver of information from remote sources; (2) the means of sending information
to remote targets; (3) a field of on-screen possibilities (images, objects and so on) that
the user interacts with and which mediate the previously mentioned points (1) and
(2).
The computer user therefore enters into an active relationship with the
virtual screen world that is created by his or her interactions with the web page.
This means that the user is able to take part in forms of virtual modelling of, and
virtual participation in, the virtual hypertextual world that is so created. The
computer user becomes a participant in processes that are dynamically assembled
on the screen at the same time that the here-now events on the screen link to other
times and places in other websites. From this third perspective, Internet is a network
which enables interactions between persons and between persons and virtual
hypertextual worlds and their participants that may be widely separated in time and
space. The apparent stability and the artifactual character of the printed page, the
written text or the book which is integrated into the activities in which it
participates, is thus dynamically transformed. The web page as it appears on the
screen can be seen as being linked to other sites, other possible pages, other objects
by lines of connectivity which make it a participant, along with the user, in a network
topology of such connections. By contrast the adjacent ordering of elements on the
printed page into visual-spatial forms of organisation that are traced onto a treated
surface, puts the emphasis on a flat, two-dimensional form of organisation.
The visual patterns displayed on the screen only become a text when they are
integrated with activities which assign meaning to them. What are these activities?
How are they related to each other? Some of the relevant activities include:
Visual scanning ^ Select specific object [^ = followed by];
Point mouse at object ^ Object responds (tells me something,
changes form, lights up);
Click on object ^ Create link to thematic area, functional unit;
Click on object ^ Expand thematic area/interact with virtual
object to create activity.
The user enters into an active relationship with the screen world and its
objects. However, the virtual world of the text that is projected onto the screen for
the user to access is an abstraction away from natural objects as we perceive them
Page or screen? 107
no hits are recorded (in the case of a search engine) or where a particular website no
longer exists. Even so this is obviously a Result. This is the basic pattern which exists
also within websites, though less guesswork is likely to be involved.
We need to recall that Internet users typically find home pages using a search
engine so that the home page is likely to contain a keyword which allows the site in
question to be easily retrieved, e.g. the word recipes. A search of this type will return
vast numbers of sites relating to recipes, some of which are commercially inspired and
others not. The latter category includes sites set up by individuals and by groups of
people forming associations. These sites vary in their overall characteristics, but most
are concerned with providing a recipe and little else in other words information
restricted to the special-interest category (see Table 3.1). Commercial websites, on the
other hand, are typically emanations of food companies and food magazines whose
major concern is with marketing and establishing ways of online selling. Typically,
these websites will have textual features including surveys and links concerned with the
sale of kitchen equipment, cookbooks and subscriptions to magazines. A comparison
of these two types of website is instructive as regards the way virtual communities are
built up. A comparative multimodal transcription can help pin down these differences.
Although Internet has only made a big social impact in recent years, it has itself
undergone major changes. Specifically, there is a growth of websites in which the
page is produced dynamically, i.e. generated by a computer program. For example,
train and plane timetables nowadays rarely take the form of pre-existing pages.
Instead the user has to build his or her own timetable from a series of parameters
including place and time of departure, place and time of arrival, return journey, pre-
ferred cost category, number of passengers and special requirements (e.g. seating
arrangements, food requirements and special assistance due to mobility problems).
Consequently, the nature of the web page has itself changed from one dominated by
a series of links between pregiven pages to one in which the page is dynamically
created on the screen through parameter selection, providing the illusion that the
page in question has been transformed into something else. When the user attempts
to return to the previous page, a notice will frequently appear stating that the page is
no longer available. This is because the web page does not pre-exist; rather, it is
assembled from a database by a computer program. In other words, the ongoing
transformation taking place within Internet, is that the metaphor of browsing and
navigation that characterised the first stages of Internet and led to definitions in
terms of links between pages produced by pre-existing off-the-shelf, take-it-or-leave-it
items is giving way to authorship. Authorship is located within the space of the same
page, which transforms itself through parameter selection on the part of the user,
one reason why home pages have become increasingly significant in websites and
why many complex websites consist of a series of home pages. Rather than moving
from one virtual site to another, the web page is a modifiable site that changes its state
through the modification of its parameters by the computer user.
Decoupling of material support and information on the computer screen 109
The printed page qua material artifact is a synoptic entity. The visual tracings on a
treated surface constitute a frozen array of visual invariants which, when integrated
with the activities of visual scanning and interpretation of the reader/viewer, can
be understood as meaningful signs. In part, the illusion of permanency of the
written page and of the written text that it materially supports is due to the way in
which the relationship between the material surface (paper, plastic and so on) and
the visual tracings that are made on this surface by means of some tool or machine
(printing, engraving, writing, drawing) is fixed for as long as the surface itself or the
tracings on it do not materially decay, fade away, become erased and so on.
The material support and the tracings on it are hard-coupled for as long as the
material relationship between them endures. On the other hand, the visual images,
the linguistic texts, the audio files and so on that are dynamically assembled on the
computer screen are not hard-coupled to their material support in the same way.
Instead, the diverse sources of stimulus information (visual, auditory, kinesic and so
on) are reduced to a single abstract form, the byte, consisting of ones and zeroes,
which cannot be picked up by our perceptual systems.
The digital processing of the information that is stored in this form by
means of computer software through processes of selecting and editing means that
the data which is stored in these bytes can be dynamically assembled in newly con-
tingent ways according to the choices made by the computer user. This happens
before the data that is so elaborated is projected onto the computer screen or is
saved as a permanent record on a CD, DVD or hard disk. The digital technology
of the multimedia computer allows the computer itself to carry out part of the
process of elaboration of the data so that the material object-text which we see and
hear through the screen and the audio system of the computer is linked in real-time
to the processes of elaboration (selecting, editing, assembling) of the data (the
bytes) that are carried out by computer programs rather than by the computer user.
The web page is not the product of the hard coupling of a material support (e.g. a
treated surface such as a sheet of paper) and data (e.g. visual tracings made by a
drawing implement such as pen, crayon, chalk or by mechanical means such as
printing, engraving, photographic reproduction). Instead, the web page is charac-
terised by the soft coupling of material support (screen, CD, DVD, and so on) and
data (digital bytes).
This has two important consequences. First, some of the processes of data
elaboration are allocated to the computers internal processing. Secondly, this gives
the computer user the possibility of creating a dynamic and flexible web page,
rather than interacting with a pregiven one (see 3.1, pp. 105-108). The page can be
modified and updated through the actions of the computer user when he or she
interacts with the texts and objects that are displayed on the computer screen. In
110 Multimodal Transcription and Text Analysis: Chapter 3
semiotic terms, we can say that the relationship between the data and its material sup-
port is unhinged. Data is coded in digital form as bytes and is dynamically assembled
into newly contingent patterns by programs internal to the computer. These
processes occur prior to the processes which subsequently convert this data into a
form which we can perceive on the computer screen.
These observations pertain to what linguists and semioticians refer to as the
expression stratum of language and other semiotic systems (see Inset 18:
Stratification, pp. 236-237). The expression stratum is the semiotically organised
material means of embodiment of a semiotic system; it is the dimension of
semiosis that we apprehend with our perceptual systems. Consider, for example,
the relationship between the tracing activity of the writer and the visual-graphic
traces that are produced by this activity on a surface and picked up by the reader
as visual stimulus information about that tracing activity. The two phenomena
tracing activity and visible traces are relatively hard coupled to each other. That
is, the stimulus information provides the reader with information about environ-
mental objects and events such as the marks on the page and the activities involved
in putting them there. The sequence of activities involved in the elaboration of the
expression stratum of the written or printed page can be schematised as follows:
The radical difference lies in the way in which the digital processing of abstract
combinations of bytes by the computer is separated from the material means of its
support, i.e. as stimulus information displayed on the screen that can be picked up
by our perceptual systems and interpreted as signs of objects, events and so on.
Many commentators on the digital age talk about the information that is
elaborated, processed, stored, transferred, exchanged and so on, by means of the
computer. Bytes and combinations of bytes are information. Information, as
distinct from meaning, is defined in statistical or probabilistic terms without reference
to the categories of the observer/interpreter. Moreover, the information that is
coded in combinations of bytes entails a fixed relationship between the combina-
Decoupling of material support and information on the computer screen 111
tions of bytes and the information that is contained in them; it is information that
is read by a machine and interpreted according to the fixed relations established by
a computer program. It is correct to say that this information is coded in combi-
nations of bytes. However, the abstract coding of information in the digital form
of bytes is not in itself meaningful to a human interpreter.
The patterns of sound and light that the user picks up and interacts with
through the multimedia resources of the computer are, on the other hand, potentially
meaningful to a human interpreter. In this case, the meanings that the user creates by
interacting with this information are not coded in the patterns of light and sound
that are perceived; rather, the user interprets them as meaningful signs by integrating
them into the semiotic categories of a system of interpretance. Human semiotic
systems are not codes. A code e.g. morse is based on a fixed relationship between
the information which is coded and the means of its coding. Language, gesture,
depiction and other human systems of meaning making are not codes in this sense;
they do not exhibit fixed relationships between information and the means by
which this information is coded. The idea that language, fashion and so on, are codes
was popular in early theories of semiotics in the 1960s and 1970s, but such models
are neither realistic nor informative for theorizing and describing semiotic systems.
The information coded in combinations of bytes has to be translated into a
form that is accessible to a human user and the system of interpretation that he or
she uses. The information encoded in combinations of bytes has to be reorganised
by the computers own operations as a new type of information on a higher-scalar
level that the human user can access through his or her perceptual systems. The
computer programs that read this information and transform it into a form which
is accessible to human interpretation are, of course, designed by humans. However,
the computer programs which have these functions perform the task of comput-
ing this information (bytes) into a qualitatively different form on the scalar level of
the human interpreter with his or her categories, interests and systems of inter-
pretation. In this sense, the computer programs that perform these tasks constitute
an intermediate level of organisation in a human-computer social system of relations.
The semiotic potentiality of this hybrid human-machine system can be modelled
as a hierarchical system of relations on three levels, as follows:
3. 3. The relationship between web page, website, web users and web genres
In any approach to the study of websites and its component parts, such as web pages
and the multimodal clusters of objects and images found on web pages, it is
important to understand the overriding significance of genre as an organisational
principle. Genres regulate and mediate the ways we interact with each other in
society, and websites and web pages are no exception. The website as a whole has
generic features at the same time that it comprises many more specific genres. For
example, the home page is a functional component within the larger-scale structure of
the website as a whole. The home page also has the characteristics of a superordinate
genre in its own right at the same time that many of its component parts are
themselves distinctive mini-genres linguistic, visual, musical and so on.
In many linguistic accounts, genre refers to the most global level of
organisation of a given text-structure type or activity-structure type (Hasan, 1978;
Martin, 1985a; Ventola, 1987). A genre is defined in terms of a typical beginning-
middle-end structure, as a series or configuration of stages through which texts
belonging to the given genre typically progress. Each stage is a functioning
component both in relation to the larger whole to which it belongs and in relation
to the other component parts of that whole. Some examples include:
RECOUNT: Orientation^ Eventn^ (Coda)
ARGUMENT: Thesis^ Argumentn^ (Recommendation)
[N.B. The following notational conventions apply: ^ = followed
by; subscriptn indicates the element is recursive, i.e. can occur
more than once; round brackets indicate the item so enclosed is
optional; unbracketed items are obligatory]
A genre in this definition is a sequence of optional and obligatory elements
through which texts progress from their beginning to their end in order to fulfil some
social or communicative purpose. Recounts have the purpose of telling a chronologi-
cal sequence of events that someone experienced. A Recount typically begins with an
Orientation. This initial stage indicates who took part in the event sequence, when,
where, and so on. The Orientation is followed by a chronological sequence of actions
and/or events that are usually told in the past tense. These two stages are obligatory.
A third stage, which is optional, is the Coda. The Coda provides some retrospective
evaluation of the events recounted, and therefore provides them with some wider sig-
nificance or value. Arguments seek to persuade or convince readers or listeners to
adopt a certain point of view and perhaps to act on it or be prepared to act on it. An
Argument begins with a Thesis, which states the position to be defended. The Thesis
is followed by a series of Arguments providing information and evidence in support
of the Thesis. Arguments often, though not always, conclude with a Recommendation
to act or behave in a certain way on the basis of the arguments presented.
114 Multimodal Transcription and Text Analysis: Chapter 3
In what way can the notion of genre be applied to the home page? Does the
notion of genre described here readily apply to the website? In this section, we
shall try to give some preliminary answers to these questions.
One significant difference with respect to linguistic genres such as those
mentioned above is that the notion of sequential organisation does not apply in the
same way. The web page is, in the first instance, a visual-spatial unit which is dis-
played on the computer screen. Consequently, it is important that we attempt to
understand and describe its generic features with this in mind. Given that there is
considerable variation in the way web pages are organised, how can we go about
identifying generic features which are common to the web page in general? What
are the component parts of the web page? How do they relate to each other? What
semiotic and material resources are typically codeployed? And how? .
In Figure 3.1, we have shown, in a highly schematised way, the layout of a
typical home page. The diagram suggests some of the typical elements in this type of
web page as well as a typical combination of these elements. At the same time, we
want to stress that the positioning of the elements in Figure 3.1 is not intended to
suggest that these elements always occur or that they necessarily occur in this
particular combination. We are trying to describe in a highly schematic way some of
the typical components of a web page and their relations to each other. Even a cursory
survey of web pages will show that there is considerable variation in the way that
elements are arranged on the screen. However, we feel that, notwithstanding this
diversity, it is possible to identify a number of elements and combinations of elements
that are typical of web pages, although there is considerable diversity as regards the
way they are arranged on a given web page. In this respect, the home pages from two
sites relating to children, Nasa Kids and British Museum Childrens COMPASS, that
we discuss later in this chapter only conform to this schema to a lesser extent than
many other websites but nevertheless still exhibit many of the elements featured.
The idea of genre as a staged, goal-oriented schema comprising a particular
sequencing of functional components both optional and obligatory is less
appropriate for talking about web page genres. This view very much puts the locus
of control in the genre schema; the writer or speaker is required to create a text
which conforms to the requirements of the schema, though this does not exclude
variability and creativity. The genre is a metadiscursive construct which functions as a
point of reference for the activity; for example, it specifies the sequential ordering
of items in a determinate beginning-middle-end type of structure and also represents
the global organisation of the particular text-type. In this view, generic structure
potential is a pedagogical device which teachers and learners in classroom writing
activities can use as a model or a template for controlling their own writing activity.
It is a device which enables both teacher and learner to look at their textual
processes and make them accessible to conscious awareness and control (Martin,
1985b; Thibault, 1989). In this view, genre is a metalevel structure which enables the
The relationship between web page, website, web users and web genres 115
Left panel.
This often Top Bar. This is typically a menu which is thematically-
gives access related to the site name and which provides links to further
to the latest pages, including other home pages relating to subsites.
information
in the form
of:
Top centre-right panel. This forms a grid on thematic
grounds with verbal and visual clusters. Typically, the
1. Search clusters will form a repeating pattern arranged as a grid-
engines like structure or as a vertical list with links to other pages.
The clusters form a supercluster identified through a
heading at the top. The clusters in the superclusters are
hyponyms to the heading at the same time that they are
cohyponyms of each other. A cluster, like a supercluster,
2. A tabu-
is characterised by a semantic homogeneity of the
lated index
component items. Sometimes this supercluster will con-
creating
links to tain a search engine typically in the bottom right-hand
pages corner.
Bottom Bars. This panel usually contains a combination of clickable and non-clickable
information. This area typically displays the websites small print such as FAQs, privacy
statements, legal disclaimers and related notices, copyright, troubleshooting advice, site
map information, webmaster and contact us information. It can also contain menu bars
that are alternative or additional to the Top Bar.
� The term trajectory is used in this book to refer in particular to the meaning-
making pathways that are created when users of websites create links from one
web page to another, from one website to another, and so on, as they navigate or
author their way through a website or from one website to another. A meaning-
making trajectory in this sense refers to the progressive integration over time of the
semiotic resources that are encountered as the website user progresses from one
linked object, one text, one web page, one website to another. A trajectory may
last mere seconds or minutes or it may occur over much longer periods of time,
as well as being picked up and resumed across separate occasions.
�From the analytical perspective, the notion of trajectory is used to reconstruct
users pathways through websites and to investigate the organisational princi-
ples of such pathways as a form of multimodal text. In this sense, a trajectory
is also a textual record or a trace of the progressive integration over time of the
meaning-making resources that the website user encounters. It is the entextu-
alisation of the web users meaning-making activity. As such, it displays prop-
erties of continuity and coherence qua meaning-making trajectory. The
examples given in 3.6 and 3.7 ( see pp. 126-136) indicate that transcription
of trajectories will include descriptions of local resource configurations
such as clusters (see Inset 5, p. 31) and phases (see Inset 7, p. 47).
�The multimodal analysis and transcription of such trajectories can reveal the
ways in which the trajectory integrates diverse semiotic resources to itself as it
develops and unfolds in time. Possible trajectories are afforded by the
resources both technological and semiotic of websites. By the same token,
the recording and analysis of trajectories will provide insights into the ways in
which users experience websites and their possible meanings. It will also be
able to show the extent to which trajectories have generic and individual
characteristics in their semiotic make-up.
The relationship between web page, website, web users and web genres & Inset 10 117
In this view, the genre schema is not prior to or a cause of the emergent system
of relations or of any of its components. The genre schema is not a prior cognitive
representation in the head which computes global order for texts. Instead, the genre
schema is a semiotic tool, and therefore just one of the components which interacts
with all the other components that are involved in a distributed network of activities
in time which give rise to texts. In the case of relatively stable and fixed genre
schemas such as those mentioned above, the generic structure potential that is invoked
in the writing activity acts as a locus of control over the construction of the required
text-type. Genre is a technology with which the writer interacts in the process of
creating a text and which exerts its own agency on that process. For example, the
meaning-making trajectory of a given instance of a genre such as Narrative or
Argument must pass through a certain number of stages in a determinate sequence
and it must reach some kind of semantic closure.
The genre schema thus conceived is a stable attractor space which imposes its
own constraints on both the text-producing activity and the outcome of that
activity. In a particular context, activity is assembled in response to the various con-
straints imposed by the subsystems and their components that come together and
constitute that context. A stable genre schema is robust to local change and insta-
bility; stable global configurations of textual elements in particular instances con-
form to varying degrees to particular genre schemas. Web pages and websites not
only give rise to new genres, new combinations of semiotic modalities, new inte-
grations of meaning-making activity with technologies; they also signal a loss of sta-
bility of many of the precursor genres and forms and the relations among these that
we still find in websites. Hypertext and its conventions can be seen as newly emer-
gent solutions to this instability in the system. There are no a priori hypertextual
schemas or templates which provide solutions in advance. Instead, new solutions,
and therefore new genres and new relations between old genres, are created through
the navigation of this space.
When we encounter the home page of a website, we have before us a much
more open-ended set of possibilities as compared to the stable generic forms men-
tioned above. Whereas these forms and their integration to writing activities require
the writing activity to be organised around stable global solutions in the form of
staged, goal-directed sequences of schematic structure elements that exert principles
of strong classification and strong framing (Bernstein, 1990 [1981]) over the activity, the
web page shifts the locus of control back to the computer user. The emphasis is to
a much greater extent on variability, flexibility, multitasking, the fluidity of semiotic
resources and context-sensitive local effects arising, for example, from rolling the
mouse over or clicking one object on the page rather than some other. There is no
single, determinate starting point or sequential organisation. There is no single or
privileged causal or other factor which initiates or controls the activity. Rather, there
is a redistribution of component subsystems and their related resources such that
118 Multimodal Transcription and Text Analysis: Chapter 3
the locus of control shifts away from a stable genre schema to the computer user as
the maker and improviser of solutions. This does not mean that hypertext and its
associated activities are unstructured. What hypertext brings to the fore some
might say celebrates is that there are no a priori structures that cause or guide
meaning-making activity from start to finish. Instead, there is a multiple, parallel,
open-ended, backlooping interplay of texts, genres, semiotic modalities,
technologies and the users perceptions and actions. It is the interaction among all
these factors that gives rise to stable solutions in time.
The writing of traditional written genres is also a form of distributed activity
in time, although, as we saw above, the locus of control is more strongly focused on
the genre schema as a stable attractor, such that regularity and purposefulness are
emphasised. Hypertext turns this emphasis on its head. In the real-time navigating of
a trajectory through a website, there is an interaction of all the factors mentioned
above; a definable meaning-making pathway emerges and takes shape. Hypertext
brings to the fore these processes in often highly self-conscious ways: local variability,
fluidity and context specificity are emphasised over global order and stability.
Many commentators on hypertext have drawn attention to the features we have
mentioned here. We are not original in mentioning them again. What is absent in most
discussions, however, is any detailed account of the meaning-making process and the
codeployment of semiotic and material resources that takes place during this process.
The process is simply taken for granted and celebrated as such. And yet the premises
for a better understanding of web pages and hypertext are often ill-defined, without
any basis in the detailed analysis of hypertext and its associated activities. Questions
such as the following remain unaddressed: How are semiotic resources integrated
along a hypertext trajectory? What kinds of organisation does this trajectory embody?
How do technologies and semiotic resources co-operate in the development of
hypertext? In this chapter, we seek to give substance to these questions and to indi-
cate analytical solutions to the analysis and transcription of hypertext.
A home page is home to the other pages in a website; it provides the links which
enable users to access the other pages in the website. The home page is the gateway,
and therefore the users point of entry, to a website and its meanings. For these
reasons, the designers of web pages place a lot of emphasis on the construction of
the semiotic space in which textual objects are displayed and arranged in relation to
each other, as well as in relation to the viewer. Home pages also often function to
evoke cultural institutions, places and so on. The design of the web page and the way
it presents itself to the viewer are important considerations which also relate to the
ways in which the viewer navigates the virtual space of the website in moving from
the home page to other connected pages within a website or across websites. The
visual, spatial, auditory and linguistic features that contribute to the design of a home
The home page 119
page and its meanings convey more than just information. They also contribute to its
interpersonal appeal, to the evocation of affective responses, to the indexing of social
values and to the creation of atmosphere. Colour, spatial perspective, the depiction of
natural landscapes, architectural sites, persons and so on, can all function to obtain
an interpersonal orientation to the website on the part of the viewer.
This emphasis on the interpersonal appeal of many websites is also a reflex of
the increasing commercialisation of Internet along with the shift in emphasis from
production to consumption in post-fordist economies and the concomitant new
emphasis on the relationship with the client that results from this. In such an
environment, the semiotic engineering of meanings and texts itself becomes a primary
economic imperative in the new information age. This also means that the World Wide
Web is currently the site of a struggle between, on the one hand, the economic and
political interests of advanced capitalism, which see in the Web a vast market to be
manipulated and exploited through the buying and selling of products online along
with the semiotic strategies of persuasion and manipulation that this necessarily entails,
and those who use the web as a means of creating and sustaining new forms of
community, new ways of making meanings and new identities that constitute
alternatives to the vertical hierarchies of traditional media such as television and the
patterns of consumption of goods, services and meanings demanded by the post-
fordist economy. As suggested in Table 3.1, many websites reflect the tension between
the two tendencies in their efforts to negotiate between a concern with, on the one
hand, knowledge and learning and, on the other, the need for institutions to be
accessible to the general public in this age of mass consumption and mass entertainment.
As we shall see below, in connection with the Nasa Kids home page, the use
of contrasting bright and dark colours, the sense of opening out to new hitherto
unexplored horizons in outer space, and the grounding of the visual perspective of
the viewer in a particular physical location all function in semiotic partnership with
each other to create a sense of openness and the need to move away from the
familiar in order to explore new horizons. The semiotic design of the web page
therefore requires careful consideration of the ways in which viewers will feel wel-
comed and comfortable about the site, the institution or person it may give voice
to, not to speak of the technology of the web page for those who are first-time
users or who have had little experience in navigating their way through websites.
This last point also draws attention to the need to give website users easily
accessible (user friendly) points of access for finding and engaging with the objects,
the texts and the links that constitute a website, so that they will come to feel at
home with its meanings and practices.
At this stage, we can ask the question as to how the Nasa Kids and British
Museum Childrens COMPASS home pages organize their potential meanings, their
relations to their targeted users, the kinds of reading pathways they afford and their
potential for particular kinds of interaction between user and page.
120 Multimodal Transcription and Text Analysis: Chapter 3
Figure 3.2 presents the Nasa Kids home page (http://www.nasakids.com/ ) which
invites the viewer to enter into a virtual world that we immediately make sense of
even though it is a world relating to extraterrestrial space and space travel that is not
part of the everyday experience of most of us. Nevertheless, it is a world that we can
make sense of without feeling disoriented. We do so by making the information that
is presented by the visual scene converge into interobjectivity (Latour, 1994) on the
basis of our previous experiences of the objects presented and the relations among
them. Our previous experience and knowledge of these objects is, above all, an
intertextual one (see Inset 8, p. 55). We are familiar with, or assumed to be familiar
with, photographs and film clips of the Apollo missions to the Moon between 1968
and 1972, close-up images of planets (e.g. the planet Saturn with its rings as seen
through the eyes of space probes such as Cassini and telescopes such as Hubble ),
science fiction stories, films and comic strips of journeys into outer space as well as
artists conceptions of proposed future manned bases on the Moons surface, and so
on. These are all texts verbal and visual which we have all encountered in other
contexts in the media, in the school science classroom, in science fiction books and
films, and so on.
On the basis of this knowledge, we are able to make the various objects and
locations that are presented converge into a coherent visual presentation of a
virtual world. The use of schematic, cartoon-like images without too much detail
and without any attempt at scientific accuracy helps here. We are given just
enough information to bring about a plausible interobjectivity; we accept the scene
as it is presented to us without feeling the need for more detail to determine the
acceptability to our senses and our intellect of the world presented. When we
view this world we do not feel disoriented even if the true scale of the
relationship between, for example, the Earth and Saturn is seriously misrepre-
sented in terms of both relative size and proximity to each other in our solar
system. The depicted scene is not concerned with presenting a scientifically true
representation of the scalar relationships between these objects. Nor is it con-
cerned with presenting these objects as we would truly see them under normal
perceptual conditions on the surface of the Earth. In other words, the visual
modality of the scene is neither scientific truth nor perceptual realism.
The cartoon-like character of the overall scene and its objects projects a fan-
tasy world in which actuality, futurology and playfulness are blended. The scene fore-
grounds the perspective of someone looking back towards the Earth from the
Moons surface. There is a clear visual intertextual tie to the many widely publicised
photographs from the Apollo missions to the Moon showing the Earth as seen from
the lunar surface or from lunar orbit. By the same token, there is also a
recontextualisation of the visual genre of the realistic photograph to the cartoon genre
The Nasa Kids home page 121
1
3 20
4
7
12
16 18
8
9
10
14 13
5
11 17
19
6
15
A. Clusters responding only to mouse click: C. Self-activating clusters unresponsive to rollover and mouse
Cluster 1: This is a Nasa Logo + Masthead cluster. The masthead click (compare positions and presence with Figure 3.3 ):
is the name of the site. This cluster consists of the
two objects, the logo and the masthead. The cluster Cluster 12: The geyser : a discontinuous cluster with three geysers, one on
anchors the meaning of the page as a whole in which the left, one central, one on the right, alternately spouting
the visually prominent colours tie the notion of a site from the Moons surface in a centre-left-right sequence;
for kids with the Nasa space agency. The Masthead is Cluster 13: The astronaut : exits from the Moon base moves to the left-
repeated throughout the page. The choice of bright wards to the foreground before returning to the base in a
arresting colours and animated, cartoon type images, clockwise circle that shifts from the foreground to the
(e.g. the rocket, the space creature) foreground the middle ground;
interpersonal dimension, rather than naturalistic Cluster 14: The Moon buggy : moves on the rim formed by the Moons
representations; surface in a clockwise direction becoming larger when
Cluster 2: Nasa News link to other NASA websites, foregrounded and smaller when backgrounded.
not just those for kids; D. Self-activating clusters responsive to mouse click:
Cluster 3: Stories by kids : as above;
Cluster 15: The plane and its banner : moves across the screen from
Cluster 4: Solar Flare : link to NASA KIDS CLUB Art Gallery;
right to left; when the plane is clicked, the user is linked
Cluster 5: Hey Kids...coming soon: link to Rockets page;
to the NASA KIDS CLUB Art Gallery home page;
Cluster 6: Teachers Corner : link to teachers resource site.
when the banner is clicked, the user is linked to the
B. Clusters responding to rollover and mouse click: Connect the Stars home page (a game).
Cluster 7: Saturn: on rollover, becomes yellow with Saturns E. Self-activating clusters responsive to rollover and mouse click:
rings pulsating; the wording Space & Beyond appears
Cluster 16: The Earth: this cluster rotates simulating the real Earths
in red indicating link to page with the same name;
rotation; like Clusters 7-11, on rollover, the cluster
when clicked, goes to the Space & Beyond home page;
becomes yellow with red wording, in this case: Our
Cluster 8: The Rocket: on rollover, becomes yellow; the wording
Earth. Similarly when clicked, the user is linked to the
Rockets & Airplanes appears in red indicating a link to
Earth home page.
page with the same name; when clicked, goes to the
Note: Clusters 12-16 all interpret different cycles of movement.
Rockets home page;
Cluster 9: Nasa: on rollover, as above but with the wording F. Clusters responding to type-in and mouse click:
NASAtoons; when clicked, goes to the NASA Toons Cluster 17: Search engine: when a word such as Moon is typed in,
home page, dealing with animations; and the word Go is clicked, a database is searched and an
Cluster 10: (Moon base ) on rollover, as above, but with the wording appropriate page is returned
Astronauts, living in space; when clicked goes to the
Pioneers home page; G. Inactive Clusters (dotted to show they represent a much larger area ):
Cluster 11: (Extraterrestrial ) on rollover, as above, but with the Cluster 18: Milky Way : dark grey with white spots representing stars;
wording Projects & Games; when clicked goes to the Cluster 19: The Moons surface : dark yellow colour;
home page.of Games website. Cluster 20: Deep Space : respresented by pure black.
1 2 3 4 5
The Nasa Kids home page & Inset 11 123
of the Nasa Kids home page, with its blend of fantasy, play and science. The use of
the cartoon genre is motivated by the following criteria: (1) it appeals to the sense of
fun and enjoyment of the young reader; (2) it can hybridise diverse visual domains such
as the realistic photos from the Apollo missions and the fantasy world of the
cartoon; (3) it can negotiate the shift away from the real world to the virtual world
that the home page projects. Some of the visual intertexts that are evoked and
recontextualised in this process include:
� Nasa photographs and films taken during the Apollo missions to the
Moon;
� the space rocket in classic science fiction stories and their film ver-
sions, e.g. Jules Verne, Buck Rogers, and so on;
� photographs of distant planets taken by space probes and tele-
scopes such as Hubble;
� the stereotyped comic strip or cartoon alien (friendly green, bug-eyed
monster );
� artists conceptions of future Moon bases;
� the familiar sight of an airplane towing a display banner for adver-
tising purposes.
The viewer is thus positioned as imaginer and traveller in a virtual world. The depicted
scene, with its unrealistic scalar representations, positions the viewer as located on the
lunar surface (here you are ) looking back towards the Earth (where youve come from )
and beyond towards Saturn and the Milky Way galaxy in the more distant background
(where youre headed to ). The initial positioning, along with these shifts in perspective,
function to position the viewer variously as traveller on an imaginary journey, as
knowledge seeker, as imaginer of hypothetical worlds beyond the actual, as adventurer
and as someone to be entertained, who wants to have fun, along the way. As we have
mentioned above, the depicted scene raises questions about scale; for example, in
relation to the relative sizes of Earth and Saturn, their positioning in the solar system,
and their distance from each other. However, the real value of the scene lies in the
imaginary world which it projects. Moreover, this is from a particular vantage point,
i.e. the Moons surface. The message seems to be: our Earth is not the only vantage
point from which to view both ourselves and the Universe; therefore, we need to
expand our earthbound perspective to take in those of other worlds.
The visual semiosis of the Nasa Kids home page therefore decouples perceptual
invariants (see Inset 11: Visual transitivity frames on the facing page and Inset 15:
Gibsons optic array, p. 192) from the stimulus flux of the perception of real-world
events under natural conditions and manipulates and rearranges them as possible or
imaginary scenes. In such cases, as here, visual deixis nevertheless functions to ground
the viewer and the viewers perspective i.e. standing on and observing from the lunar
124 Multimodal Transcription and Text Analysis: Chapter 3
Cluster
highlighted
after mouse
rollover
Figure 3.3: Nasa Kids home page (focus on NasaToons object illuminated)
The Nasa Kids home page 125
NasaToons. In this way, the given object specifies a superordinate thematic item in
such a way that the object both grabs the viewers attention and invites him or her to
click on the object to explore its thematic potential.
The dominant colour contrast is between the bright yellow of the Moons
surface in contrast to the black of outer space as it recedes into the background away
from the observers perspective on the Moons surface. Outer space is sometimes
depicted as a lonely and hostile environment, e.g. in the science fiction movie Alien
(1979). In this film, the vastness and darkness of outer space is seen as a hostile
environment in which humans are exposed to great dangers. The Nasa Kids home page
uses the contrast between the bright yellow of the Moon in the foreground to evoke
a sense of emotional warmth and openness (cf. for example, the use of colour in: the
Childrens COMPASS website in 3.7, pp. 130-136; the Eskimo text in 4.1, pp. 167-
173; the Westpac text in 4.2, pp. 174-181 and Appendix I ). The Moon is depicted as
a home-away-from-home, as shown by the presence of an astronauts living quarters,
the Moon buggy roving its surface, and the astronaut who is walking in close proxim-
ity to the Moon base. The receding blackness of outer space is itself populated with
brightly coloured objects such as the Earth and the planet Saturn. Rather than evok-
ing fear and desolation, this space invites the viewer to feel safe and secure in explor-
ing it (see 1.1, pp. 4-21).
Interpersonally, the Nasa Kids home page codeploys visual resources such as
colour, perspective, the juxtaposing of objects and so on. In this way, it positions itself
and the viewer in a complex heteroglossic space in which the meanings and values of
diverse domains of social practice are negotiated and hybridised (see Inset 19, pp. 245-
247). In particular, these domains include: education; entertainment; creating public
interest in and recruiting people, especially young people, to Nasas scientific and
technological mission and values. The use of colour as a semiotic resource and a
cartoon-like genre of visual depiction in contrast to, say, a realistic one are functional
choices in this particular semiotic environment, fulfilling functions such as the
following:
� make the depicted world appealing to/enjoyable for the viewer;
� conjoin the viewer to a set of shared community values and meaning
orientations;
� inform the viewer about Nasas achievements and discoveries;
� evaluate Nasas achievements and discoveries;
� interact with the viewer by answering his or her questions, respond-
ing to his or her search inquiries and so on;
� direct the actions of the viewer in relation to the manipulation of
textual objects, the use of web page resources (audio, video), and
how to go to other sites of interest or relevance to the viewer.
126 Multimodal Transcription and Text Analysis: Chapter 3
A single mouse click can initiate a semiotic cascade on account of the multiplying
potential created by the synergy among diverse multimedia genres. As the hypertext
trajectory unfolds (see Inset 10, p. 116), it expands into a much larger-scale semiotic
formation in which diverse genres, modalities, web pages and websites are pro-
gressively integrated with each other. As we shall see, this process is very different
from the linear or sequential character of the generic structure potential of many lin-
guistically realised genres such as Recount, Narrative, Argument and so on (see 3.3,
pp. 113-118). The logogenesis of a particular pathway through a website or from one
site to another can be described as an activity structure that collects the effects of its
own cascading along the duration of the pathway. Thus, a hypertext pathway collects
thematic meanings, diverse semiotic modalities and genres and integrates them to its
own activity as meaning is accumulated along its logogenetic (meaning-making)
trajectory. Let us take the Nasa Kids home page as an example. Figure 3.3 on the pre-
vious page shows the starting point of our chosen pathway, the NasaToons linked
object (marked by an arrow). The NasaToons icon is a superordinate item (see 2.6, pp.
80-90) which indexes and, if clicked, allows access to a series of animated films
1
2
3 4b
4a
5
about various Nasa activities such as the one shown in Figure 3.4. By clicking on the
NasaToons icon, a pathway is created to another page (see Figure 3.5 on the following
page) in the form of a NasaToons menu of options.
This menu of options is a set of thematic nodes in the form of a visual icon
and a corresponding verbal caption. This page in effect expands the NasaToons icon
on the home page into a much wider set of thematic areas and associated activities
in both the verbal and visual semiotic modalities. Each of these nodes itself provides
access to a thematic area that the given higher-order node specifies. By selecting and
clicking on the verbal-visual node A Far Out Pioneer in the top-right hand corner
that we have blown up in the inset (Figure 3.5), access can be gained to the page,
shown in Figure 3.4, about the Pioneer 10 space probe, which was launched in March
1972. This page deploys the following genres and semiotic modalities:
� Cluster 1: The generic title of the page as a whole: New Science;
� Cluster 2: The specific title: A Far Out Pioneer + icon indicating the availability of audio;
� Cluster 3: The date of posting of the page, i.e. 25th August, 2001;
� Cluster 4a: A short verbal text in the form of a Recount which chronicles the history of
the Pioneer 10 space probe from its launch in March 1972 till the recent rede-
tection of its signal in interstellar space; this text foregrounds the temporal
staging of key events in the thirty-year history of the probe; overall, the verbal
Recount functions as an Orientation for the meanings of the page as a whole;
� Cluster 4b: The verbal Recount is closely integrated with a visual image depicting an artist's
conception of the Pioneer 10 space probe in deep space; the visual image
can be seen as a hypotactic extension of the meanings of the verbal text,
which are primary in this case; in other words, the visual image adds further
dimensions of meaning in the naturalistic visual modality by showing how
the spacecraft really looks out there in deep space;
� Cluster 5: A second verbal text, comprising the imperative clause Check out this
NasaToon for more about Pioneer 10, both specifies a particular procedure to
follow while it also indexically points to the embedded NasaToons animated
video clip which immediately follows this clause;
� Cluster 6: If we click on the animated film clip, we can watch a seven minute educational
video in which narration, animation, graphics and diagrams are used both to
explain the scientific principles underlying the redetection of Pioneer 10 in
Deep Space well beyond the orbit of the planet Pluto as well as some more
historical detail; in any case, it is the film clip, in contrast to the verbal Recount,
which foregrounds the scientific meanings associated with Pioneer 10;
� Cluster 7: The left-hand column of the page specifies links to other video texts about
Pioneer 10, the Pioneer 10 Home Page of Nasas Ames Research Center, and so
on.
The above analysis shows how the unfolding activity sequence involves the progres-
sive expansion and integration of meanings and actions along its trajectory. The
128 Multimodal Transcription and Text Analysis: Chapter 3
various stages in this sequence are schematised in Table 3.2, which transcribes the
main stages in the unfolding activity sequence. In the present example, the analysis
stops with the New Science page. However, it is possible to continue the pathway in
any number of ways; for example, by clicking on the Pioneer 10 Home Page icon in
order to go to that page, or going back to the NasaToons menu or the Nasa Kids
home page in order to select other objects which enable us to open other thematic
domains and their related activities. Nevertheless, this brief analysis is enough to show
that the meanings made along a given trajectory are different in important ways with
respect to the sequential unfolding of the generic structure potential of verbal texts such
as Recounts, Narratives, Arguments and so on. For a start, the hypertext trajectory is
much more open-ended; it does not have a definite beginning-middle-end type of struc-
ture, and for this reason it does not feature the same kind of semantic closure that is
characteristic of these linguistic genres. Instead, there is a progressive opening up of
thematic regions and genre possibilities as the developing trajectory navigates a path-
way across genres, semiotic modalities, activities and meanings.
This page exhibits a fairly high degree of semiotic condensation, which is typical
of many web pages. This is so in two related senses. First, there is a spatial juxtaposi-
tion of different texts, different genres of text, different semiotic modalities and
different associated technologies on the same page in a screen environment which
affords the user interaction with, and manipulation of, these texts and the objects that
are embedded in them. Secondly, the processes referred to in this first point also mean
that many processes on much longer space-time scales, e.g. the thirty-plus years of
Pioneer 10s journey in space, impinge upon, and produce effects on, the short
timescale activities of the user when he or she interacts with the page, acts on the
page, and so on (e.g. reading text, watching the video, clicking objects, printing, down-
loading and so on). These two features of the web page draw attention to the princi-
ples of weak classification and weak framing (Bernstein, 1990 [1981]) that characterise
the relationships among semiotic resources and genres and their spatial arrangement
and codeployment on the same web page or along a hypertext trajectory. The material
affordances of the computer screen both make this possible at the same time that the
user is positioned as an agent who can intervene in texts and create personalised
hypertextual pathways through websites and between websites. The personal computer
affords the dynamic assembling of, and intervention in, a diversity of semiotic resources
and genres that were previously strongly insulated from each other as domains of
specialised practices and competencies.
Importantly, the meaningful whole that is created by a particular web page is not
defined by the physical space alone of the page on the screen, or by the spatial arrange-
ment of the elements in this space. Instead, it is defined by semiotic-material functional
relations and networks of relations. This is also true of the static printed page and its
meanings. The difference lies in the kind and degree of the functional relations
involved. Rather than a vertical hierarchy of elements and their functions, based on
the criterion of spatial juxtaposition and relative size with respect to the compositional
whole, there is also an increased emphasis on the horizontal multiplicity of functional
pathways. These pathways potentially relate the different texts and objects on the web
page not only to each other, but also to other pathways, other actions, other connec-
tions to other pages and websites and their texts and objects. Pathways of this kind
enact a whole network of connections and interactions that cannot be defined or
analysed in vertical hierarchical or compositional terms alone.
1
6
4
3 8
5
Cluster 1: Masthead + Subheading: The Masthead is the Cluster 5: Interaction is the main feature of this
name of the British Museum website as a whole (the cluster. The ask-the-expert and contact-us elements
Portal). Cluster 1 consists of the overall name of the function as ways for the user to interact with the site.
website, which situates this particular web page within Each of these elements is, as before (see Cluster 2 ),
partly linguistic and partly visual, though in this case
that overall site. The use of upper case font and its posi-
we have imperative clauses instead of nouns. The
tion in the top-left part of the page foreground the lions head and paw print identify the lion as the
Mastheads superordinate position with respect to the expert-cum-companion, who will guide us around this
specific web page. The subtext: The British Museum: virtual museum site. The visual aspects of this cluster,
Illuminating world cultures hypotactically extends the the lions head and paw print, create a textual tie with
meaning of the masthead. The meaning of subordina- the central and visually salient cluster, Cluster 4.
tion is conveyed by the change in typography: the size Cluster 6: A going-on-a-museum-tour cluster. In the top
of the font is much smaller, underscoring the idea of a right of the page there are two pictures entitled Object
subordinate relationship to the masthead. and Tour, each with a caption below it. These are both
Cluster 2: The four items in Cluster 2 each comprise a linked objects leading to links on the next page. The
visual icon relating to a noun, for example, Search: An two photos are naturalistic, unlike the lion. They fore-
ground the educational value of the site with the shift
icon of two little feet is closely linked physically and
from the ludic to the scientific domains. The superor-
thematically with the noun Tours to form a cohesive tie.
dinate item This months ties this months object to the
Cluster 3: Link to the National Grid for Learning tour highlighting the museum function of the website.
Cluster 4: This cluster consists of the cartoon-like lion, Cluster 7: The Competition winners noticeboard cluster.
Alfred, and part of a pale blue compass on a deeper In this case, two similar images relate back to the super-
blue background. Alfreds central position on the page ordinate heading. The organisational principle is similar
is a good example of bilateral symmetry, creating a to Cluster 6, two same-sized images stand in a
sense of balance between the left and right sides of the hyponymic relation to a superordinate linguistic caption.
page. The lion gazes directly at the viewer, inviting him Cluster 8: This is the masthead plus logo for the
or her to come in. The point of the compass is a vector Childrens Compass site page. Cluster 8 and Cluster 1
which links Alfred in Cluster 4, to the masthead, stand in a relation of complementarity and contrast.
Cluster 1. The cartoon-like representation of the lion They are complementary as they are both about the
British Museum at the same time that they both
is predominantly interpersonal, rather than
coexist within the bigger website of the British
experiential in its orientation: it suggests user- Museum, the Portal. The Childrens COMPASS home
friendliness, as is appropriate in a site designed for page both identifies and gives access to this specific
children. There is obviously a relationship between the website within the overall British Museum portal.
lion and the compass, indicating that the lion will be Cluster 9: The bottom bar comprises four items each
the visitors guide to the museum. consisting of an icon and a verbal text.
Figure 3.6: The British Museums Childrens COMPASS home page: cluster analysis
132 Multimodal Transcription and Text Analysis: Chapter 3
surrounds the compass and that the higher luminosity of the compass creates the effect
of the compass receding away from the viewer. This choice in colour and luminosity
is co-contextualised with the directional vector of the compass, which points away
from Alfred towards the top right of the visual field. The compass thus functions as a
visual metaphor of the British Museum as educator. The Museum’s researchers explore
far away regions of knowledge and history at the same time that they make this
knowledge accessible to the general public, thanks to the friendly British Museum lion.
If the Search icon is clicked, the user gains access to another page entitled The British
Museum. Search the Museum. This page consists of a menu of options (Figure 3.7).
As with the NasaToons example, this particular menu of options consists of
a set of thematic nodes in the form of a visual icon and a corresponding verbal
caption. In the present example, this menu requires the reader to make a selection
from two thematic areas – one geographical and/or temporal (blue), the other fea-
turing domains of daily life (pink) – before submitting the combined selection to the
Find button. As with the NasaToons example, the menu of options expands the
Search icon on the home page into a much more diverse set of thematic regions and
their possible intersections. Having made, for example, the selection Asia + Daily
Life, the user clicks on the Find button and creates a link to the page headed Daily
life in Asia, which is the more specific thematic area resulting from the combination
of the two selections from the menu of options on the previous page (Figure 3.7).
Figure 3.8 shows the page, Daily life in Asia, which results from this choice.
This page consists of a brief information report giving some factual information
about the topic, with particular reference to objects used by ordinary people in
their daily lives. On the left side of the page, opposite the verbal text, a number of
such objects, with identifying verbal caption, take the form of clickable icons refer-
ring to specific Museum objects which can be viewed in the website. When the user
clicks on the icon entitled Women sewing, a print, he or she creates a link to a page
which focuses on this print and its meanings (Figure 3.9). This page consists of the
following texts and objects:
� a colour plate of the print, Women sewing, which is a display item in
the Department of Japanese Antiquities;
� a short verbal text providing a brief historical orientation to the
print in the first sentence, a brief description of each of the three
scenes depicted in the print, and an instructional text suggesting
other persons and objects in the depicted scenes that the viewer
may also wish to attend to;
� below the reproduction of the print are four icons + verbal cap-
tions of other linked and clickable objects belonging to the same
thematic area as the print that is featured on this page.
The page overall is a recontextualisation of a display item from a particular
Museum exhibition. In other words, the page recontextualises the three-dimensional
display of an arrangement of items in space accompanied by verbal text explaining
the items to the two-dimensional modalities of the web page. The Museum object
the Japanese print has been decontextualised from its grouping with other objects
in a British Museum display and recontextualised in another modality the
photographic plate embedded in a web page such that it functions as a metonym
of the Museum display. Moreover, the spatial juxtaposition of objects in a Museum
display is itself transformed by the web page into the possibility of creating links
with other objects that are thematically related (see, for example, the Linked Objects
menu in the Women sewing, a print web page shown in Figure 3.9 below the print
itself), though not necessarily coming from the same area of the Museum. Once
again, we can see how the unfolding activity sequence involves the progressive
expansion and integration of meanings and actions along its trajectory. The various
stages in this sequence are schematised in Table 3.3.
In Phase 4 of this unfolding hypertext trajectory, the reproduction of the
print, Women sewing, is surrounded by a white border in the top-left part of the
page. It is as if the print is being displayed in an autonomous space, removed from
the context of its relationships with other items in a museum display. The
recontextualisation of the print in this way tends to minimise both its historical
context and its links to other times and places beyond that depicted in the print itself.
This possibility is further emphasised by the two clickable objects, Add to folder, and
Larger picture, which are placed on the right-hand side of the print. Both of these
clickable objects specify actions which the viewer can perform on the reproduction
of the print. They allow the viewer to save the print to the computers hardisk and
to enlarge the size of the print so that it can be seen in greater detail. Both of these
actions enable the viewer to appropriate a copy of the print to a private sphere, and
therefore afford possibilities for further uses of the print in other contexts beyond
that of the museum in which it is located and the web page itself. The print can be
linked to other kinds of activities that would not be possible in the case of the
original print in its museum display context.
A hypertext trajectory unfolds in time at the same time that it integrates texts
and activities in the virtual locations, in the form of web pages and their objects
that the viewer encounters at different stages along the trajectory. The objects that
can be clicked on, and accessed, on the Daily life in Asia page are all presented in
a way which foregrounds their thematic homogeneity, in spite of the differences in
historical time, geographical provenance and their housing in different
Departments of the British Museum. For example, the object Salt bag ( Figure 3.9),
in comparison with the Women sewing print, comes from Pakistan, dates from the
19th century, and is housed in the Department of Ethnography. However, the mode
2 LOCATION: Search the Museum menu; expanded set of thematic nodes as hyponyms of
superordinate Search the Museum icon; the blue and the pink nodes are
cohyponyms of each other;
ACTION: Select Objects: Asia + Daily Life^ Click Find button
go to
of presentation of its web page is the same as that for the Japanese print. The
compositional homogeneity of the pages pertaining to these objects and the fact that
they are all construed as joint verbal-visual cohyponyms of each other at the same time
that they are hyponyms of the superordinate thematic item Daily life in Asia both
decontextualises these items as displayed museum objects and recontextualises them
as the coarticulated parts of a joint verbal-visual thematic formation in which the
diversity of times and places partly gives way to the newly contingent meanings of
this hybrid thematic formation. The original display items, now recontextualised as
visual images and verbal text in the new formation, are treated as entextualised
hypertextual objects, rather than material ones, which can be integrated to other
forms of activity in the virtual hypertextual environment of the website.
The visual presentation of the objects removes them from their contexts of use
and transforms them into aestheticised objects for contemplation and appreciation.
Whilst the verbal text in each case foregrounds meanings and activities concerned
with some aspect of the daily life of the people who used the object, the visual image
gives rise to another set of meanings that are not entirely supported by the verbal text.
The tension between the different meaning orientations of the two modalities the
visual and the verbal in this context suggests a complex process of negotiation
between the meanings that pertain to the scientific (historical, paleontological, etc.)
activities of the Museum qua research community and the requirement that the
Museum both educate and entertain a wider public of non-specialist visitors. The
latter can more readily appreciate the value of the objects displayed with reference to
formalist criteria of aesthetic appreciation and its appropriation to a private sphere of
consumption which typically emphasises the autonomy of the object rather than its
relations to the material conditions of its manufacture and use. In this way, we can
see how the meanings of the web page under consideration here are themselves a
further recontextualisation of the meanings and activities of the Museum itself. We
will now consider in more detail the ways in which joint verbal-visual thematic
relations and meanings are built up during the development of a hypertext pathway.
text. In a given thematic system, the use of a particular transitivity pattern e.g.
Actor-Process: Material-Goal in some clause in a given text may be a typical use
of this pattern in some intertextual set of texts in a community and not merely a
selection which is specific to a single text. Such a system is called an intertextual
thematic system. Thus, a particular clause may be the instantiation of a more abstract
thematic relation which the particular text shares with other texts, even if the text-
specific lexicogrammatical realisations of the common thematic pattern may vary
from one instance to another. The notion of thematic system was developed as a tool
for representing the salient and common meaning relations that are shared by any-
thing from a very restricted number of texts to a potentially indefinite number
(Lemke, 1983: 162).
To demonstrate the basic principles of this kind of analysis, let us now
consider an example of visual-verbal thematic relations between a photograph and its
verbal caption in a magazine.
Thematic relations are non-linear; they are diagrammatically represented in a
network notation consisting of nodes (thematic items) and the connections among
these (thematic relations). The specific thematic relations which occur in a text can
be seen as such or they can be connected to other texts sharing the same system of
thematic relations. An intertextual thematic relation, which is more abstract than a
text-specific one, can be represented in terms of typical transitivity patterns in the
experiential semantics of the clause (e.g. Actor-Process-Goal), of typical nominal
group patterns (e.g. Epithet-Thing-Post Qualifier) and of the clause complex
relations that link clauses into larger-scale semantic units.
Thematic relations in verbal texts are created through the lexicogrammatical
resources of the clause and the clause complex. Consider the clause Young romance
on the subway: what was out of the question for their parents generation is quite normal
today. This clause was the verbal caption of a photograph (not shown here) in an
inflight magazine of an airline. The experiential grammatical semantics of this clause
is of the Relational: Attribution type. In clauses of this type, an Attribute is attrib-
uted to a Carrier. The Carrier instantiates the type-quality or the type-class of quality
or thing specified by the Attribute. In the example, the Carrier-Attribute relationship
in the clause functions to construct a thematic relationship between the two items,
the Carrier, what was out of the question for their parents generation, and the Attribute,
quite normal today. Specifically, this grammatico-semantic relationship is used to con-
strue a thematic relation of opposition between the different values of the older and
the younger generations. The transitivity relation in this clause is the grammatical
means for relating the two items as part of a more abstract thematic pattern con-
cerned with the conflicting values of the older and the younger generations. This is
more than just a semantic relationship specific to this clause. Instead, the
lexicogrammatical resources of the clause are the means for constructing a more
abstract thematic relation that is shared with other texts.
138 Multimodal Transcription and Text Analysis: Chapter 3
In this and the subsequent analysis, the abstract intertextual pattern is repre-
sented in capitals as a notational convention to show that its meaning relation refers
to a more abstract and general class of thematic relation that is common to, or
shared by, some intertextual set small or large of texts. In the present case, this
may be represented as /GENERATION CONFLICT: OLD-UNACCEPTABLE VS. YOUNG-
ACCEPTABLE/. This shows how the locutions out of the question and quite normal are
assimilated to the meaning of an intertextual thematic pattern which is concerned
with the clash, the contrast, or the opposition between the values of parents and
their offspring, especially with regard to issues such as courtship, romantic love, sex-
uality and their display in public places such as subway trains.
The present example shows that the bringing together of the choices in both
language and depiction creates the possibility of a joint verbal-visual thematic relation
(see 2.6.2 , pp. 82-89). However, it is important to go beyond the mere possibility
of a cross-modal thematic relation and show which ties are foregrounded, which
ties are typical patterns in some wider multimodal intertextual thematic relation, and
how these ties are established and developed in texts. In the analysis below of the
thematic development strategies along a hypertextual pathway in the British Museum
Childrens COMPASS website, we shall see that the action potential of Linked Objects
(as in Figure 3.9), cross-modal covariate ties, the resources of deixis and transitivity
relations in both clause grammar and depiction, are some of the means whereby a
network of hypertextual multimodal thematic relations is built up. In the present
analysis, we shall focus in particular on some of the thematic ties between verbal and
visual patterns as well as the strategies that are deployed to create these ties.
The present example occurred as the verbal caption which accompanied a
colour photograph of a young couple in affectionate embrace on a subway train in
Seoul, South Korea, in the inflight magazine of a well-known airline company. The
verbal caption and the participants and the actions they are engaged in (the visual
transitivity) in the photograph construe a multimodal thematic relation on the basis
of the ties between ideational-grammatical relations in the clause and the visual
transitivity relations in the photograph. We would say that many cases of the verbal
anchoring of the meaning of a visual image are in fact multimodal verbal-visual
thematic relations which themselves have an intertextual, and not merely a text-
specific, basis.
A young couple, seated between another young couple on the left side of the
photograph and a young woman on the right side, are depicted embracing each
other and involved in affectionate hand play. The visual scene therefore depicts a
visual transitivity relation involving two participants and the actions they perform
together (e.g. embracing, holding hands). While the scene is specific both to this
photograph and this couple on that particular occasion, it is also the instantiation of
a typical pattern that is common to very many images, not to speak of the scenes
from everyday life that these images derive from.
A multimodal hypertextual thematic formation: Daily Life in Asia 139
In the verbal caption, the nominal group young romance on the subway indexi-
cally points to the image and verbally identifies the depicted scene. It is as if the nom-
inal group could be expanded to a clause-level structure such as This is young romance
on the subway in which the demonstrative pronoun this, which exophorically points to
the picture as its discourse referent, is the Identified in an Identified-Identifier
transitivity relation in its clause. This does not mean that the picture is without
meaning until the words supply it with one. As we have said above, the visual
transitivity in this text is part of a common pattern which is shared by many images
and, on this basis, with or without language, it is capable of generating its own mean-
ings. Likewise, the Classifier-Epithet structure in young romance uses the grammatical
resources of the nominal group to create a thematic tie between the items young and
romance which is shared by very many linguistic texts.
The combination of the photograph and the nominal group instantiates a joint
verbal-visual intertextual thematic pattern across the two semiotic systems. The point
is that the codeployment of the two systems, along with the linking together of
thematic choices in the form of the covariate tie between the visual transitivity pattern
in the photograph and the Classifier-Thing relation in the nominal group, strengthens
the thematic link between the two modalities: not only is the process-participant
relation in the visual transitivity a typical pattern, but the Classifier-Thing semantics of
the nominal group also defines its Thing as being of a certain type young romance is
a standard cultural-semantic pattern and a typical collocation pattern. In this way, a
covariate tie links the two modalities on the basis of some shared higher-order
meaning relation. This thematic relation may be diagrammatically represented as in
Figure 3.10.
covariate tie
Figure 3.10: Covariate tie between verbal and visual semiotic modalities,
creating a cross-modal thematic relation in an airline magazine text
140 Multimodal Transcription and Text Analysis: Chapter 3
The relations that are internal to the visual transitivity pattern and the nomi-
nal group are structural relations (Participants^ Vector and Classifier^ Thing) that
belong to the experiential meaning relations in the two different semiotic modalities.
The relationship that is construed between them is a non-structural or covariate
relation in which the two items are linked on the basis of some meaning relation that
they are construed as sharing. It is the interplay between the two kinds of relations
(structural and non-structural), along with the deictic or indexical link already men-
tioned, that constitutes the typical strategies for building up multimodal thematic
relations. We shall now explore how this applies to an instance of a hypertextual path-
way between pages in the British Museums Childrens COMPASS website.
(1) Many objects survive which tell us about the daily life of people living in Asia.
(2) These include paintings and books, pottery, glass, clothing and burial
goods which all contain important information about the past.
(3) Whilst many objects tell us about the lives of the wealthy, we can also
learn about the lives of ordinary people.
The first sentence in the verbal text, Daily life in Asia, creates a semantic link
between many objects and survive, because in this thematic formation there is a typical
semantic relation /OBJECTS-SURVIVE/ (Medium-Process) when /OBJECTS/ are from
the past or can tell us something interesting or important about the past. In the second
clause of this sentence, /OBJECTS/ is in the semantic role of Sayer in a verbal process
clause of the type Sayer-Process-Recipient-Circ:Matter: Verbiage, i.e. /OBJECTS TELL
US ABOUT THE LIFE OF PEOPLE LIVING IN ASIA/. Thus, in a larger set of texts which
belong to the same intertextual thematic formation (see Inset 8: Intertextuality, p.55),
/OBJECTS/ are Sayers which speak to us from the past. The Circumstance of Matter
(what about) further specifies the ideational content of the discourse that is attributed
to the Sayer, /OBJECTS/. Thus, in this first sentence a crucial thematic link is made
between, on the one hand, /OBJECTS-SURVIVE/ and /OBJECTS-TELL-ABOUT LIFE IN
THE PAST/, on the other. In this case, the life of the people living in Asia is assimila-
ble to the abstract thematic item /LIFE IN THE PAST/.
Moreover, the items the daily lives of people living in Asia, the lives of the wealthy,
and the lives of ordinary people are assignable to the superordinate thematic pattern
/PEOPLE-LIVE-IN ASIA/ which is introduced here and further developed on the specific
pages related to the Linked Objects (see below). These three items all occur in
Circumstances of Matter in relation to verbal processes (objects tell us) in their
respective clauses. In this way, the thematic pattern /PEOPLE-LIVE-IN ASIA/ is
coarticulated with the /OBJECTS-SURVIVE/ pattern described above by means of the
clause level experiential pattern Sayer-Process: Verbal-Recipient-Verbiage.
The Linked Objects displayed to the left of the verbal text, each comprising a
visual icon and a verbal caption, are therefore assimilable to the superordinate
thematic item /OBJECTS SURVIVED FROM PAST/ at the same time that the joint verbal-
visual Linked Objects can be seen as specific instantiations of the superordinate item.
Again, we see how the thematic formation is based on the inter-semiotic
complementarity between verbal and visual resources. For example, the Linked Object
[VISUAL ICON + VERBAL CAPTION: Women sewing, a print] can be seen as a specific
thematic expansion in the two modalities of the verbal item goods in the main verbal
text. At the same time, the Linked Object also creates the possibility of a coactional
tie between verbal text and Linked Object. The user can further explore the thematic
possibilities of the Object and how it develops the thematic meanings and relations that
are built up on this introductory page in relation to the more specific pages that can
be accessed via the Linked Objects.
142 Multimodal Transcription and Text Analysis: Chapter 3
The first clause of the second sentence can be assimilated to the abstract
pattern /GENERALISATION: EXAMPLE(S)/, as in the clause These (i.e. objects) include art
and books, pottery, glass, clothing and goods, which is a clause of the type Carrier-Process:
Relational-Attribute: Type-category. The second clause of this sentence, i.e. (they) all
contain important information about the past, further elaborates the previous mention of
the same abstract thematic pattern /OBJECTS-TELL-ABOUT LIFE IN THE PAST/. In
Sentence 1 this is so even though the lexicogrammatical choices in this occurrence
of the thematic item are of the type Carrier-Process: Relational: Attribute-
Circumstance: Matter. Thus, /OBJECTS CONTAIN INFORMATION ABOUT THE PAST/
and /OBJECTS TELL US ABOUT THE PAST/ are lexicogrammatical variants on the same
more general thematic relation in their intertextual thematic formation. Significantly,
the final clause in this sentence also gives voice to an evaluation of the importance
of the information that these objects contain, rating this positively on a scale
IMPORTANT-UNIMPORTANT.
The thematic meaning of the third sentence in part hinges on an implied
opposition between the thematic items /WEALTHY PEOPLE/ and /NOT SO WEALTHY
PEOPLE/, where the nominal group ordinary people can be assimilated to the meaning
of the more abstract item /NOT SO WEALTHY PEOPLE/. The clause many objects tell us
about the wealthy is assimilable to the abstract thematic item /OBJECTS-TELL-ABOUT
LIFE IN THE PAST/ that was evidenced in Sentence 2 above. In the present case, the
meaning of the wealthy is therefore interpretable as a hyponym of the superordinate
thematic item /LIFE IN THE PAST/; it is seen as a specific exemplification of the super-
ordinate item. The second clause, we can also learn about the life of ordinary people, is a
clause of the type Senser-Process: Mental-Phenomenon-Circumstance: Matter. The
two clauses in this sentence together imply a thematic relation which is partially
implicit. This relationship may be expanded into the full set of thematic items and
the relations among them as follows:
MANY OBJECTS TELL US ABOUT THE WEALTHY
THEREFORE
WE LEARN ABOUT THE WEALTHY
MANY OBJECTS (ALSO) TELL US ABOUT THE NOT SO WEALTHY
THEREFORE
WE LEARN ABOUT THE NOT SO WEALTHY.
The pivotal thematic relationship here is that between /TELL/ and /LEARN/, where it
is assumed that the latter is a consequence of the former: someone tells us something
therefore we learn something. We are able to supply the missing thematic items in order
to build up an expanded thematic formation, as shown above. The muted
heteroglossic opposition between wealthy and ordinary people, which in some other
thematic relation could imply a negative evaluation of the latter, is in this case evaluated
positively, such that objects pertaining to the lives of less wealthy and poor people are
seen as being just as interesting and important as those having to do with the wealthy.
Appendix I: Multimodal Transcription of the Westpac advertisement (T= time in seconds)
T VISUAL FRAME VISUAL IMAGE KINESIC ACTION SOUNDTRACK METAFUNCTIONAL INTERPRETATION
PHASES AND SUBPHASES
C.1 Column 2 Column 3 Column 4 Column 5 Column 6
! Shot 1 CP: stationary [Herdsman starts walk- [silence] PHASE 1a
HP: frontal ing from car towards
VP: median viewer; sheep dog goes
1 D: VLS to left; Herdsman starts
VC: sheep, eucalyptus tree, utility van, sheep dog rolling up left sleeve]
VS: progressive magnification of form of herdsman (1-10) Tempo: M
CO: naturalistic
{RG}
Herdsman bends down [ ] Solo keyboard
and twice slaps thighs to EXP: Actor; action (Herdsman walks
(pp, TWO towards viewer)
2 recall dog to his side CHORDS ^
Tempo: M [sheep]: SI
Volume: p
Tempo: S
(^ Dog returns to herds-
man). Herdsman starts INT: Viewer positioned as belonging to
rolling up right sleeve depicted world and its shared values;
3
Tempo: M
[Herdsman continues
rolling up left sleeve; Herdsman/dog; low volume, slow
dog runs ahead]. tempo of music: intimate communion
5
Multimodal Transcription and Text Analysis: Appendix I
Tempo: M
I
[Herdsman continues them //(#)
II
Tempo: M
Phase 1
[ ]
TEX: hyperthematic status of Phase 1a
functioning
7
VS: maximum magnification of visual contour of Herdsman rolls up right them //(#) (3) thematic condensation of major
herdsman sleeve themes to be developed in subse-
VC: hat, shirt typical of rural worker quent text
VF: far; off-screen Tempo: M
10
! Shot 2 CP: stationary [Draughtswoman rolls up {RG} PHASE 1b
HP: oblique left sleeve, while sitting at [ ]
VP: median
D: MCS desk] (*) roll
11 VC: indices of work (glasses, desk, lamp)
VS: draughtswoman Tempo: M
CO: naturalistic
CR: red
VF: near; hands
! Shot 3 CP: stationary [Truck driver enters cabin INT: identification of viewer with
HP: frontal of truck, moving
VP: low depicted scenes; solidarity (smile,
towards viewer, prior to gaze); chorus: imperative exhortation
D: MCS driving truck; mouth:
13 VC: blue singlet of driver; interior of truck cabin of viewer
smiles]
VS: truck driver
CO: naturalistic
VF: median; viewer Tempo: M
14
15 VC: background out of focus; nurse’s uniform smiles at someone out of Tempo: F location
VS: nurse
CO: naturalistic field of vision] (NO MUSICAL
III
{RG}
[ chorus]:R PHASE 2b
EXP: Actor;Action As above;
(*) roll them visual thematics: intergenerational
19 relations (father/son, old/young and
(*) up
Volume: ff paternalistic ideology
Tempo: F
Tempo: M
24
! Shot 9 PHASE 3a
As in Shot 5 Westpac logo moves
towards the viewer
Logo as in Shot 5
Tempo: M
HP: frontal grasps pen with right //SLOW female soloist as role model: advisory
VP: median hand] speech act
D: MCS
VC: school uniform of girl, study materials, desk
VS: contour of girl
26 CO: naturalistic
VF: near; viewer
[Schoolgirl leans forward; TEX: covariate ties: smiling, sleeve rolling,
mouth: smiles] moving forward; cut from logo to
Phases 2 & 3
schoolgirl on moving
Tempo: M
! Shot 12 CP: panning [Baker stands outside bak- a job INT: chorus: declararive clause and explicit
HP: slightly oblique
VP: median ery; rolls up left sleeve] we provides reason/motivation for
D: MLS prior exhortation in
28 VC: shop window, bread, baker’s uniform Tempo: M dyadic response to soloist
VS: shop window, contour of baker
CO: naturalistic
VF: median; off-screen
VF: distance: median; orientation: viewer; to TEX: chorus extends from businessman to
otherwise as above baker, or linking them in a joint we as
defined above.
29
(*) do (!!)
//SLOW
30
! Shot 13 CP: stationary Bricklayer crouches {RG} PHASE 4a
HP: frontal behind wall
VP: median [ ±]: I
D: MCS Volume: n
VC: brick construction wall, bricklayer occluded Tempo: M
by wall
VS: brick wall
31 CO: naturalistic
Bricklayer places brick in this (*) country EXP: Actor;Action (Shots 13-14);
As above; bricklayer now partially visible behind wall wall discourse of male speaker and visual
VF: near; directed to brick enact joint thematics of
Tempo: M building Australia though work; the-
matic condensation of this
thematic in bricklayer and dish-
washer
As above; [Bricklayer continues posi- was (NA) built
VC: blue work-singlet and trowel tioning brick in wall; raises
trowel]
32
Tempo: M
Bricklayer taps brick with on a tradition INT: viewer identification with depicted
trowel world; constant tempo and volume
of speaker’s discourse: leadership
33 Tempo: M
As in Shots 5 and 9 Westpac logo moves {RG} EXP: Actor;Action (Shots 16-20); joint
!Shot 15 towards the viewer we see (*) our verbal-visual thematic of Westpac
job services and tie to prior thematic of
37 Tempo: M getting on with the job
CP: panning;
!Shot 16 HP: oblique Helicopter pilot runs {RG}
VP: low; towards helicopter as (NA)backing
D: MCS
38 VC: pilot’s jacket, helmet, helicopter Tempo: F
VS: pilot running left to right to helicopter
CO: naturalistic;
VF: off-screen
As above Helicopter pilot enters that way of INT: As above; identification of we of
VF: near; directed inside cabin of helicopter helicopter thinking Westpac with corporate symbol of
logo (Shot 15)
39 Tempo: F
as much
40 NO VF
Tempo: M
{RG} with
ad(NA) vice
//(#)
44
!Shot 21 CP: stationary [Nun holds cricket bat; {RG} EXP: Actor;Action
HP: frontal rolls sleeves up: boys and a (*) brand thematics of bank merger made
VP: median grouped together near new explicit for first time; joint verbal-
D: MLS nun] visual thematics of competition;
DT: nun’s dress, cricket bat and wicket, white sports heteroglossic alliance of spiritual
uniform of boys (nun), youth (boys), and economic
VS: nun and boys, nun’s atypical way of holding a competition
47 cricket bat (*) spirit
CO: naturalistic [Nun holds cricket bat;
VF: median; viewer (nun) boys turn towards viewer;
one boy bowls ball]
of INT: As above
(*)
competition
48 VF: boy bowling facing viewer
{RG}
that will bring
(NA) more and
49 (*) more (*)
benefits
! Shot 23 CP: panning left to right as secretary exits office [Westpac executive ushers (NA) our sleeves
HP: oblique woman out of his office; his up
VP: median arm extends to her back;
D: MCS woman moves right as she
VC: minimal background indicates (Westpac) execu- leaves]
tive office, business dress of executives Tempo: M
VS: contours of executives, secretary in shadow
53 CO: sensory, low detail
CR: red (tie) (*)too
VF: directed to secretary // (#)
PHASE 5c
60
Multimodal thematic system development along a hypertext pathway 143
The next stage in the pathway entails selecting and clicking on the Linked Object
called Women sewing, a print on the Daily life in Asia page. In making this choice, we
go to the page entitled Women sewing, a print. We shall now discuss how the move to
this page relates to, and further develops, the thematic meanings of the Daily life in
Asia page. The verbal text on this page is as follows (1-4 belong to Paragraph 1, while
5 to Paragraph 2):
(1) Sewing was done by Japanese women, who often spent their time
apart from the men of the family.
(2) This print shows three women working together on their sewing on
a hot summers day.
(3) On the right, two of them stretch and fold a red silk sash with a tie-
dyed pattern of white starfishes.
(4) On the left, one woman is holding up a sash, maybe to check the
repair she has just made.
(5) Look for other members of the family a teenage girl peering into
a small cage, which holds her pet insect; a little boy teasing a cat with
its reflection in a mirror; and a baby playing with its mothers fan.
The first clause in the first sentence of the verbal text, Sewing was done by Japanese
women, is assimilable to the more general superordinate thematic pattern /PEOPLE-
LIVE-IN ASIA/ in the introductory text on the previous page. At the same time, the
new item is also a further semantic specification of the developing thematic
relations across the two texts. The Actor-Process: Material-Goal pattern of this
clause specifies the activity which these Japanese women undertook. Its
experiential semantics belong to a wider set of texts about /HUMAN ACTIVITY/, or
/PEOPLE DO THINGS/, as part of the way they live. From this point of view, this
clause is both assimilable to the superordinate thematic item at the same time that
it is its further specification. The second clause develops the same basic pattern.
Thus, /JAPANESE WOMEN SPEND TIME APART FROM MEN/ is itself a further instan-
tiation of the /PEOPLE LIVE IN ASIA: PEOPLE DO THINGS/ thematic pattern.
The clause (Sentence 2), This print shows three women working with their sewing
on a hot summers day, indexes a cothematic tie with the visual scenes which are
depicted in the print. The new clause also further develops the /PEOPLE DO
THINGS/ thematic relation in the non-finite clause three women working with their
sewing on a hot summers day by extending the thematics of the text into the domain
of work, which is a subset of what people do. This clause again instantiates the
Actor-Process: Material pattern that was noted above, in the process further
developing the thematics of this paragraph as /PEOPLE DO THINGS: WORK/. In
repeating the transitivity pattern of the second clause in Sentence 1, a covariate
semantic tie is established between the two: both clauses are about what the women
did (Actor-Process: Material). The remaining clauses in this paragraph continue the
144 Multimodal Transcription and Text Analysis: Chapter 3
�In the present study, these units refer to different kinds of meaning relations
on their respective levels. Thus, there are different scales of meaning making
and their relationships in multimodal texts. The units and relations that are
described at any given level usually apply only to that level. In other words,
each level is characterised by the distinctive units and their relationships on
that level. This does not alter the fact that on occasions a lower level unit
such as a visual transitivity frame may coincide with a single shot; on other
occasions, it may be distributed across more than one shot. In such cases, we
can say that the two levels have been conflated in that text. Such observations
draw attention to the tight linkages across the different levels in the given
hierarchy of semiotic relations. Levels are not therefore autonomous. The
same point applies to the relations between: subclusters, clusters and super-
clusters of items to be found on, for example, printed pages and web pages
(see Inset 5: Clusters and cluster analysis , p. 31).
�In one sense, as noted above, a system of scalar levels is a hierarchical struc-
ture. The units and relations on a given level, such as the shot in video texts,
are parts of larger wholes such as the subphase or the phase on higher levels.
However, the notion of a hierarchy in which larger-scale units contain
smaller-scale ones (e.g. a shot contains a visual transitivity frame ) can also be
misleading. There are two points to be made here:
(1) larger-scalar units provide integrating contexts for smaller-scale ones;
(2) the different levels mutually interact with and constrain each other;
they are not for this reason completely separable.
Smaller-scale units are not simply smaller parts or building blocks in larger
wholes. Leakage across levels is part of the way in which a hierarchy of
meaningful units and relations functions in discourse.
Multimodal thematic system development along a hypertext pathway & Inset 12 145
multimodal links between verbal and visual genres are made on the basis of joint
verbal-visual thematic relations. The building of a relation of this kind means that
both semiotic systems have resources that enable this linkage to occur. In the
discussion of the Women sewing print, we saw above how such intersemiotic
complementarity is created between visual transitivity and linguistic transitivity
relations. Of the many different things that could be focused on, this example selects
patterns of Process-Participant relations in the two semiotic systems as the basis for
the creation of a joint visual-verbal thematic system of relations which can be built up
in different ways as one moves between image and text on this page, or from this page
to some linked object such as Salt bag in the same website. There is nothing which
inherently connects these texts and images except the hypertextual pathways from
page to page and the specific arrangements and co-contextualisations of semiotic
modalities on different scalar levels of organisation ranging from a particular cluster
of items on a page, the relations among clusters, a particular juxtaposition of image
and verbal text, or the pathway taken from one page to another, and so on.
A hypertext object has an ambivalent status: it is a visual image at the same time that
it is more than that. The fact that the term object is used is in no way fortuitous. In this
section, we will explore the ambivalent character of some of the linked objects on the
Nasa Kids home page in order to better understand their dual status as picture and
object. A linked object on a web page has a potential for action. It is useful to think in
terms of the layered nature of such objects (see below). We can explore the space
between any given layer in terms of a potential pathway: every time we click on an
object, we get a new set of objects and a new set of relationships which link both
backwards and forwards to previous objects and to relationships in the past and in the
future. Some of these are determined by the author and some by the reader. On the
Nasa Kids home page, objects of this kind include the rocket and the Earth. When we
click on such an object, we activate various possibilities for action. Each new
possibility is a potential pathway which can itself be described in metafunctional
terms. Let us follow up this idea in more detail.
A pathway from one object to some other is a functional unit of meaning and
action; it can be related to higher or lower scale units as well as to other units on its
own scale. A unit of this kind is not a formal unit, but a unit of action and meaning.
Meaning is made through activity. Text is derived from activity. Activity and action are
the fundamental meaning-making units; text is derived from meaning-making activity.
The creation of a link from one object to another, from one web page to another, and
so on, is a form of activity which is instigated by the interaction between computer
user and the selective use of the resources of the given web page or website that the
user makes. Activity of this kind integrates objects, texts, images, web pages, and so
The action potential of hypertext objects 147
on, along its trajectory. Meaning is created according to the kinds of relationships that
the computer user construes among the objects, texts, and so on, that are so integrated
along this trajectory. A hypertext pathway therefore entails the selective
recontextualisation of the resources of the website as the pathway unfolds in time. As
we shall now see, some aspects of the activities performed are in the hands of the
computer user whilst others are controlled by the computer program.
On a web page, action can take various forms. There are objects which move
on the web page in relation to the other objects in the depicted scene. Such objects
move autonomously; they have locomotion; they move from place to place and their
movement is not influenced by the computer user. Examples on the Nasa Kids home
page include the Moon buggy and the rocket pulling the display banner. Another kind
of object is that which changes its state when clicked by the user. Examples of this
type include the rocket on the Moon’s surface, the Earth and Saturn. The two kinds
of objects on the Nasa Kids home page are two different classes of objects with
distinct functions. For example, the Moon buggy is a moving object which does not
respond to mouse rollover or to mouse clicking. On the other hand, the rocket flying
overhead with the display banner in tow does respond to clicking, though it accesses
a different kind of activity (playing the game of drawing the constellations) as com-
pared to the more thematically oriented objects such as Saturn, Earth, and the rocket
on the lunar surface (see below).
The metafunctional character of the linked object Rockets & Airplanes and other
linked objects is described in the following three subsections.
3. 9. 1. Experiential meaning
The activity sequence can be analysed in terms of the functionally related parts which
comprise the whole. In this perspective, the activity-sequence is describable as a config-
uration of Participant roles, a Process (the action performed), and a Result, as follows:
PARTICIPANT: AGENT: COMPUTER USER^ PROCESS: ACTION: MOUSE
CLICK^ PARTICIPANT: GOAL: WEB PAGE OBJECT^ RESULT: CHANGE OF STATE VIS-À-VIS
AN OBJECT (e.g. THE ROCKET, THE EARTH, THE ASTRONAUTS LIVING IN SPACE).
This activity structure comprises four functionally related parts; each part has a
functional role to play both in relation to the whole activity structure and in relation to
the other parts in the whole. The computer user-cum-mouse clicker is an agent whose
action is directed towards a particular object. When the object is clicked, the agent’s
action instigates a certain result, viz. a change of state in the object. A structure of this
kind is said to be a multivariate structure. That is, it comprises different kinds of parts
which, in combination, make their distinctive contribution to the whole in which they
function. In cases like these, the activity structure is a hybrid of real and virtual
elements. The computer user and his or her manipulation of the mouse belong to the
real world of physical actions performed by the body. This activity of the body has
148 Multimodal Transcription and Text Analysis: Chapter 3
its virtual extension on the screen in the form of the mouse arrow, which responds to
the actions of the computer user. Objects on the screen such as the rocket on the
Moons surface are virtual objects; in many respects the traditional separation between
visual image and the real object represented by that image has been replaced by, or
realigned, in a new type of relationship in which visual image and manipulable object
are now merged in a new kind of functional relationship. In this new relationship, the
visual image of the rocket is not so much the representation of something else (a
rocket) that is not present, but the presentation or creation of a (virtual) object which
has specific kinds of reality effects.
The computer user can act on and manipulate this virtual object such that in
the virtual world of the computer screen the distinction between visual image-as-
representation and represented object is dissolved. The result is a new kind of hybrid
entity whose reality status lies somewhere between the two. Rather than indexically
presupposing its represented object, the image indexically creates an object which
participates in a field of action with other participants real and virtual such as the
computer user and the other objects that populate the screen world. The conjoining
of image and object in this virtual environment restores to the image some of its
original appeal as a direct incarnation of the world of objects before the discourse of
representation prised them apart and made of the image an object of contemplation
and distancing, rather than of action and involvement.
This may also help to explain the immense appeal and efficacy of the virtual
world of hypertext objects for so many people: the objects of this virtual world are
malleable and manipulable in varying ways and to varying degrees such that we can
say that human actors submit to them, their actions, and their possibilities for emo-
tional involvement and subjective investment. These aspects will be discussed in the
next section in relation to interpersonal meaning.
3. 9. 2. Interpersonal meaning
The object has the potential for dialogic engagement with the reader. We do not wish
to suggest that this is the same as dialogue between, say, two humans in conversation.
Rather, the integration of human and computer gives rise to forms of dialogic inter-
action and co-ordination of joint human-computer activity that are modelled on
human-human interaction at the same time that they constitute its further specifica-
tion along some parameters by virtue of the fact that the computer is made by humans
and is an extension of human activity in a social and cultural context. For these
reasons, the dialogue of computers or between computers and humans must in some
ways resemble their precursor forms of human-human dialogue, at the same time that
this dialogue is a qualitatively new development that has its own characteristics, not
reducible to the precursor level. The implications of this possibility are still little
understood at this stage. The reader orients to and engages with the object in the
following way:
Interpersonal meaning 149
The modification of the object, e.g. the rocket, over the temporal duration of
the rollover both indexes the object’s interactive potential and attracts or engages the
attention of the reader (see Figure 3.13). This involves a multimodal coordination of
the following kinds of stimulus information: visual + auditory + kinesic: object:
movement: pulsating; kinesic: movement: hand-arm-eye movement of user.
McGregor (1997:79) points out that a sign can be modified or reshaped in order
to achieve interactional ends. The processes of sign modification or reshaping are con-
jugational relations: the reshaping or deformation of the sign spreads across or holds
the whole sign in its scope. Conjugational relations are thus scopal in nature in contrast
to the particulate or part-whole relations based on constituency that are characteristic
of experiential relations in the clause. Interpersonal meanings are created by the
reshaping or the modification of the signs in order to achieve interactive ends. In lan-
guage, the different mood categories (e.g. declarative, interrogative) reshape the same
proposition for different interactive purposes, i.e. asserting or interrogating the
proposition in the declarative or interrogative clause, as in semiosis is a dialogic activity
vs. is semiosis a dialogic activity ? The reshaping of the sign according to the
interactive purpose iconically expresses the interpersonal meaning. In these
examples, the interpersonal meanings declarative and interrogative are not con-
stituents in their clauses; instead, they hold the entire clausal proposition in their
scope and shape the clause accordingly. We can now relate these observations to our
example. To do so, it is necessary to distinguish between a number of distinct layers
of organisation that comprise the object in question, namely the rocket.
On Layer 1 (see the following page), the rocket is depicted as a participant
in an overall scene. The viewer orients to it from this perspective. On Layer 2, on
the other hand, when mouse rollover occurs, it is still a rocket, of course, but the
modification of this image through the change of colour (uniform bright yellow)
and texture (e.g. loss of specific detail) and the appearance of the superimposed
verbal caption, require the viewer to orient to this image from a different perspec-
tive and for a different interactional purpose with respect to Layer 1. These changes
spread across the entire object and constitute a reshaping or modification of the
object for interactive-interpersonal purposes, as described above. They iconically
signal both a change in the way the viewer is required to orient to the object and a
change in the interactive purpose to be attained by engaging with the object. The
changes that are manifested on Layer 2 when mouse rollover occurs signal a
different way of orienting to the given object in order to attain a different interac-
tional purpose. The different ways of interpersonally orienting to, and interacting
with, objects such as the rocket can be summarised as follows.
150 Multimodal Transcription and Text Analysis: Chapter 3
Layer 1:
ORIENTATION: explore overall visual field; view object as a
component of the depicted scene;
ACTION: explore objects with mouse (some respond, others
dont);
Layer 2:
ORIENTATION: focus on/attend to object as salient, appealing, now
foregrounded against the depicted scene on Layer 1:
view object as having action potential;
ACTION: mouse rollover^ object responds: change form;
Layer 3:
ORIENTATION: focus on action potential of object as link to a new page;
ACTION: respond to change of form on Layer 2^ left click
object^ create link to new page.
The two parameters ORIENTATION and ACTION specified here represent two
aspects of the interpersonal meaning potential of the object on each of the three
layers in the analysis. These two parameters are simultaneously present in the
interpersonal meaning of the object on each layer.
A third parameter, APPEAL, will be briefly discussed below. On each layer, the
changes in the object induced by [+mouse contact] or [-mouse contact] constitute
modifications of the object in order to indicate how the viewer is to orient to and
interact with the object in the way described above. The role of the mouse with
respect to the three layers may be described as follows:
BEFORE DURING
Similar observations can be made, for example, about objects such as Search on
the British Museum Childrens COMPASS home page (Cluster 2). The subcluster com-
prises three items:
(1) a picture of a torchlight; superimposed on...;
(2) a brown disc with a light coloured outer edge;
(3) the word Search.
Figure 3.11 shows this subcluster before and during mouse-roll in order to demon-
strate the interpersonal modification which occurs in this object in the transition
from Layer 1 to Layer 2. On mouse rollover, the disc both rotates on its axis around
the torch in its centre and simultaneously gyrates in an up-and-down movement.
During this movement, the brown coloured face of the disc which is apparent prior
to mouse rollover is alternately visible and not visible to the viewer during the rota-
tion of the disc. The same pattern characterises all four items in Cluster 2 in Figure
3.6. In this case, the combination of movement and the alternating loss of colour
detail whilst the disc rotates on its axis constitute a reshaping of this object for inter-
actional purposes. The combination of these two features achieves the same kinds of
interpersonal and interactive purposes as do those described above in relation to the
rocket on the Nasa Kids home page. In both these cases and many others like them,
visual and kinesic changes or modifications that spread across the domain of the whole
object act as interpersonal signs of the ways in which viewers orient to, and interact
with, objects of this kind.
APPEAL
ORIENTATION
ACTION-POTENTIAL
Overall, there are three aspects to the interpersonal meaning of such objects,
as in Figure 3.12. The feature APPEAL refers to the way in which changes in the
object function to attract the viewers attention; ORIENTATION refers to the per-
spective or stance the viewer is required to adopt on the object, e.g. is the object
part of a depicted scene or does it afford action potential?; ACTION refers to the
specific purposes that can be achieved by performing determinate actions on the
object, e.g. object change on mouse roll indicates the object can be clicked to link to
another web page.
Interpersonally, these changes index the interactive potential of the object. The
visual and kinesic changes that occur spread across the object as a whole; they enact
a kind of visual-kinesic prosody and serve to orient the user to it as an object that the
user can do something with. Clearly, the combination of bright colours (the uniformly
bright yellow) and catchy movement (e.g. the pulsating of Saturns rings) are far from
realistic. Instead, they have an appeal function; they seek to catch the users attention
and to orient the user to act in a certain way. The deformation of the object described
above also constitutes a remodalisation of the object, or the creation of a new point
of view on the object for the user/observer. The change in form goes hand-in-
hand with a shift in the observers perspective. The object is no longer part of an
overall scene, but an object to be acted upon and engaged with. Its potential for
action is now focal in this remodalisation. This example serves to illustrate a more
general point. The virtual world of hypertext is as much about the virtual points of
view of observers on the objects of this world as it is about the objects themselves.
Objects on the screen can be analysed in terms of the interaction potential
along a cline of possibilities ranging from the user having control over the objects
action to the object acting independently of the user according to a computer
program. There are different degrees of user control and object autonomy in relation
Extraterrestrial
Example Moon rocket Banner rocket Moon buggy
creature
Table 3.4: Interaction potential of objects on the Nasa Kids home page
Textual meaning 153
to the objects potential for locomotion and/or change of state, as set out in Table
3.4 In showing the correlation between the variables of user control and change of
state, Table 3.4 also shows objects which respond to clicking and to rollover, and
which change their state accordingly. Other objects do not respond in this way. The
mouse can be used to perform the following kinds of actions on objects:
(1) point to object;
(2) roll over object;
(3) click object.
In the case of (2) and (3), we can say that the object responds to the users
action. There is at least the simulation of a dialogic coordination of the two actions
in a larger-scale syntagm, which can be schematised as follows: Mouse Click ^ Object
Responds: Go to new page and Mouse Roll Over ^ Object Responds: Change form/colour.
Objects of this kind therefore have a response potential. This is so in two senses.
Their presence on the screen elicits the attention of the user and possible actions
on the part of the user. At the same time, the object itself has a repertoire of
responses when rolled over or clicked. These responses can, in turn, lead to the
development of further courses of action.
3. 9. 3. Textual meaning
Clickable objects can be described in terms of different layers as we demonstrated in
the previous subsection; they have a multi-layered type of organisation. Different
layers of an object are nested within each other. On the Nasa Kids home page, a given
object (e.g. the rocket on the Moons surface, the Earth, or the Moon base in Figure
3.2) are, on this first layer of their organisation, component parts of a larger visual
scene, i.e. the visual composition on the home page. As parts in this larger visual scene,
they have relative salience/weight and position and various kinds of relations to other
objects. However, on mouse rollover, another layer of the same object is revealed.
Figure 3.13 reveals the second layer in relation to the Moon rocket and the Earth.
This second layer is textually internally homogeneous, though less integrated
with, and partially demarcated from, the compositional whole to which the first
layer belongs. The three objects mentioned above the Earth, the rocket on the
Moons surface, and Saturn show this very well. However, each does so in slightly
different ways, at the same time that the three objects also have some features in
common with each other when the mouse is rolled over them, as shown in Table 3.5.
This partial demarcation with respect to the visual field of the first layer
accords the object a newly contingent autonomy; it now stands out from, and is
partially separated from, the surrounding visual field of the first layer by virtue of
the new characteristics that it takes on when rolled over. This second layer of the
same object is a transitional layer in two related senses:
(1) it only occurs during mouse rollover;
(2) it constitutes the potential of the object to open up a pathway to
another page.
This second layer is anaphoric to the previous state of the object (it links
back to Layer 1) and cataphoric to the next page (it points forward to Layer 3). A
transitional layer of this kind is textually ambivalent because the reader can hover,
so to speak, between clicking and not clicking before deciding which way to go.
A transitional layer of this kind and the change of state of the object that marks its
transitional status is therefore a decision-making moment on a potential pathway.
The three layers of the rocket as linked object can be described as follows:
[LAYER 1: LOCATION: HOME PAGE; OBJECT: ROCKET: COMPONENT IN
PRIMARY VISUAL FIELD; COLOUR: PRESENT; DETAIL: PRESENT]�
[LAYER 2: OBJECT: COLOUR: UNIFORM YELLOW; DETAIL: SCHEMATIC;
+ SUPERIMPOSED VERBAL CAPTION: Rockets & Airplanes ]�
[LAYER 3: Rockets menu page ].
Figure 3.14 schematises the layered organisation of web page objects such
as the one described above. Objects like the ones mentioned above undergo
change when rolled over in the way described in Table 3.5, which illustrates the
behaviour of objects with and without mouse rollover. Experientially, they are
bleached of the detail that is significant in relation to their place in the visual com-
position of the home page as a whole (Layer 1). At the same time, the verbal
caption is a further specification of the visual image (e.g. the rocket in Figure 3.13)
qua thematic node.
Textually, these objects, on account of the changes which occur in them
when rolled over, take on a new kind of visual salience. They are no longer just com-
ponents of an overall scene; the changes in them foreground their visual salience in
contrast to the visual scene to which they belonged in Layer 1. At the same time, the
objects tend to be separated off to some extent from the visual field to which they
previously belonged (Layer 1), thereby enhancing their cataphoric potential for link-
ing forward to new thematic domains in anticipation of a hypertext pathway which
can be accessed.
The superimposed verbal caption is especially important in this regard. The
fact that some of the changes recorded e.g. the uniform yellow on rollover or the
appearance of the red of the verbal caption mean that some features also serve to
link these objects on the basis of some features that they all have in common. That
is, the objects have covariate ties with each other in Layer 2 that are different from
those that tie them together on Layer 1. Textually, the objects exhibit the kinds of
textual linking relations shown in Table 3.6 on the following page.
Layer 1
Layer 2a Layer 2b
Hypertext, as the above examples from the Nasa Kids and the British Museum
Childrens COMPASS websites show, is very much a hybrid of precursor genres such
as verbal text, visual images, and multimodal combinations of these, on the one
hand, and the new meaning-making possibilities of the virtual environment of
hypertext, on the other. This is by no means surprising, though it has at times been
a source of critical confusion and disorientation. Furthermore, we must also face
the fact that many existing forms of hypertext are banal, commercialised recyclings
of already existing textual forms which are simply uploaded to a website.
Hypertext is a newly evolving system of semiotic possibilities. Yet this can only
come into being by taking over, reorganizing and reintegrating to its new forms of
organisation at least some of the possibilities of previously existing semiotic forma-
tions. This is so of all evolving systems.
In particular, hypertext foregrounds the virtual field of possibilities within
which the user can navigate and create multiple lines of connectivity among diverse
texts, web pages and websites. All acts of meaning making are created against a back-
ground of unrealised possibilities. A given choice in language, in depiction, in gesture,
and so on, and its combination with other choices from the same modality and from
other modalities is made against sets of alternative choices that were not actualised in
a given instance. Choices which are actualised constitute the syntagmatic environment
of a text; the sets of unactualised alternatives the contrast sets constitute the
paradigmatic environment of text (Saussure, 1993: 356).
The paradigmatic dimension of meaningful choice is a virtual system or field of
possible choices. Every choice that is actualised or realised in a text is made in relation
Table 3.6: Textual links and functions in linked objects on Nasa Kids home page
The virtual world of hypertext 157
to, and always implies, this wider field of unactualised possibilities. The term virtual
therefore refers to the entire field of possibilities, which both influence the actual at
the same time that the virtual is itself a field of action and manipulation beyond the
merely actual.
Saussure defined the internal constitution of the language system (la langue ) in
terms of two orders of relations, i.e. syntagmatic and associative relations. Of asso-
ciative relations, he made the following observation:
The sum of the relations with the words which the mind
associates with words which are present is a virtual series,
a series formed in memory, a mnemonic series, as
opposed to enchainment, to the syntagm which is
formed by two units which are both present. It is an
effective series in opposition to the virtual series which
engenders other relations
(Saussure, 1993: 356).
Associative relations are a virtual series in memory because they are not
actualised as syntagmatic combinations of actually present linguistic units in a given
chain or sequence of units. Nevertheless, they influence the actually present
syntagmatic relation at the same time that they constitute a latent field of
possibilities which is able to engender other possible relationships. Hjelmslev (1961
[1943]: 39-40) postulated the notion of a language ‘without a text constructed in
that language’. Hjelmslev points out that such a language would be a possible
system, ‘but that no process belonging to it is present as realised. The textual process
is virtual.’ (1961 [1943]: 40). Again, we see how the virtual is contrasted with what is
actually realised as a field of possibilities.
The virtual environment of hypertext spatialises this field of possibilities as
one in which the user’s powers for creating both vertical and horizontal lines of
connection among the objects on a single web page, across pages and across web-
sites, and the textual objects and the multimodal combinations of texts, objects, and
actions that these connections afford is massively enhanced. The space of cyber-
space is not, of course, the same as the three-dimensional physical space in which
we live. Rather, it is a metaphorical space projected on the screen but which extends
beyond the screen, and within which the user performs actions and navigates path-
ways from one point in this space to another. In this perspective, hypertext is a
virtual field of possibilities which affords enhanced semiotic possibilities for:
interaction with textual objects;
personalising texts, combinations of texts, and so on according to
individual preferences and requirements;
158 Multimodal Transcription and Text Analysis: Chapter 3
audio and visual stimulus information about these objects, the ways in which objects
index activities and participant roles for the user, the user’s taking up of these roles
and activities and the possibility of modelling new forms of experience, and of com-
bining and pooling experiences in new ways, interweaves the user’s embodied engage-
ment with the screen on the here-now scale, the making of a particular hypertext path-
way over minutes, hours, days and so on, and the virtual, forever expanding resources
of the web as a whole into a newly emergent level of organisation in our ecosocial
semiotic system. This level is made possible by the technological infrastructure which
supports it, though it is neither reducible to, nor caused by, this level alone. Rather, the
web is a newly emergent level of social semiotic organisation; it lies between
previously existing levels of social, cultural and technological organisation, as shown
in the three-level hierarchy of relations presented below.
The middle level, L, does not simply emerge from anywhere. Instead, it brings
about the reorganisation and integration of precursor meanings, texts and genres to
its own forms of organisation and the new possibilities that these afford. At the same
time, a technological infrastructure and the social and cultural conditions which enable
this must already be in place on a still higher level of organisation (L+1 ) in order to
provide a higher-scalar environment for the intermediate level. The lower level (L-1 ),
i.e. the already existing practices and technologies, constitutes both the initiating and
enabling conditions for the new intermediate level as well as the affordances, both
material and semiotic, which make the newly emergent level possible.
It is important to distinguish between the directionality and sequentiality of a
reading path and the nonlinear character of many of the meaning relations that the
reader construes when reading a written text. The visual-spatial organisation of
written text gives structure to determinate, not necessarily linear reading paths. At the
same time, it affords the reader multiple possibilities for accessing the page and its
potential meanings in a variety of places and ways. The visual-spatial organisation of
the page, in other words, enables jumping around (cf. cluster hopping as discussed in
Chapter 1 ). Typically, written genres have a default beginning-middle-end type of
organisation which structures the directionality of the reading path (3.4, pp. 118-
119). Hypertext, on the other hand, has a much looser, more open-ended type of
Community or social network of users and practices? 161
What are the ways in which websites build up a community of users? A community
is created and defined by the practices of its members, the links made between
practices, and the meanings that are made through these practices and the connec-
tions between different practices. The participants in a practice take up and per-
form various participant roles in the activities which they engage in. This combi-
nation of an activity and its associated participant roles may suffice as a first
definition of a practice. The notion of a community has often been seen in fairly ide-
alised terms as implying a homogeneous or uniform set of shared values and
meanings. To avoid this implication, we would propose the term social network in
162 Multimodal Transcription and Text Analysis: Chapter 3
order to explore the ways in which the website brings different people together in
a network of practices and meanings which tie these people or institutions in ways
that may be transient or stable to varying degrees over time. A social network
requires some kind of technological infrastructure which makes it possible for people
to connect with each other as members of the social network and participate in its
practices and to share in its meanings. The technological infrastructure of the web
enables the individual PC user to create links across space and time in the real-time
interaction between the user and his or her computer screen.
The web page as a site for linking participants as members of a wider social
network, however ephemeral or ongoing the commitment of individual
participants may be, can be understood in terms of the following parameters:
� the practices, activities, and participant roles that are internal to the
website and its semiotic resources;
� the ways in which the website is a recontextualisation of
practices, activities, and participant roles that derive from outside
the web page itself in other domains of social life;
� the ways in which the web page builds connections with other
texts and other media (for example, a website may guide the user
to books, magazines, TV programmes and so on, which relate to
the meanings and activities of the website itself.)
The website displays considerable heteroglossia (Bakhtin, 1981 [1975]). That is, it
gives voice to a rich diversity of meanings and social values and the ways in which
participants may orient to and deploy these in their meaning-making activity when
they engage with a website and its resources. In this respect, at the beginning of
this chapter we referred the reader to various types of information websites (Table
3.1). Using the analytical tools that we have provided above, the reader may now
care to consider the ways in which the reading pathways found in other types of
information wesbites can be analysed. In some cases, the transcription will highlight
the relatively fixed nature of trajectories in which the steps are preordained (e.g.
those sites whose provision of information is closely linked to specific details, e.g.
publication details of a particular book, trains between London and Glasgow on
Sunday mornings) while in other cases (e.g. special interest sites) the likelihood of
this happening is much smaller.
Internet and the World Wide Web afford the integration of texts and meanings
across space and time on a previously unprecedented scale. Internet and the net-
worked multimedia personal computer together provide the technological infra-
structure for new forms of accumulation of knowledge and experience through
The WWW as technological infrastructure and meaning-making resource 163
3. 13. Conclusion
analysed in this chapter, the Audi Eskimo advertisement dating from 1997, is only 27
seconds long but still tells a complete story. It is by far the shortest of the three texts
analysed. The second text, the Westpac television advertisement, which dates from
1983, lasts for 60 seconds. Though very different, as we shall see, in its internal
organisation, its narrative complexity is roughly comparable to the final text, the
Mitsubishi Carisma advertisement that we discussed briefly in 1.6, pp. 46-54. The
latter text, which dates from roughly the same period as the Eskimo text, is again
much shorter, with a total air time of 40 seconds.
In our analysis of the three texts we follow the practice partly adopted in the
preceding chapters of giving different types of transcriptions according to the
analytical goals being pursued. In some cases these goals presuppose a low magnifi-
cation that endeavours to capture the basic structures of the text. This requires a
macro-analytical approach to transcription, i.e. one which attempts to capture the
meaning-making processes of complete texts in terms of the links between the
various subunits that make up a text: principally clusters, phases and transitivity frames.
Thus, in the Eskimo text the transcription is concerned with reconstructing the
texts transitivity structure (see Baldry, Thibault, 2005). It takes into account the rela-
tionships existing between two or three shots to produce a transitivity frame and in
turn relates the transitivity frame to the texts phasal organisation.
However, the analytical goals will often require a much higher level of magni-
fication. This is the case with the analysis of the remaining two texts, where we are
concerned to provide a more exhaustive study that brings together a macro-
transcription, concerned with the interplay between the texts phases, and a micro-
transcription, concerned instead with a detailed description of the semiotic resources
used in the meaning-making process (cf. Figures 1.5a and 1.5b). The two types of
transcription fulfil different functions but are complementary and can be interwoven.
In other words, the distinction between various types of transcription is only a ques-
tion of methodological convenience. Rather, as our analysis, in particular of the third
text, demonstrates, the purpose of this concluding chapter is to show, albeit partially,
how multimodal text analysis and multimodal transcription can be combined in order
to develop insights concerning the ways in which meaning-making resources on the
levels of Hjelmslevs (1961 [1943]) expression and content strata are integrated to the
discourse level of organisation in multimodal texts (see Inset 18, pp. 236-237).
A multimodal transcription of a television advertisement is an entextualized
artifact which the analyst extracts from the prior discourse practices in which the
broadcast text is embedded at the same time that the analyst embeds it in the new
discourse practices of transcription and analysis. The transcription represents an
attempt to make claims about a prior discourse at the same time that it is embedded
in, and appropriated by, new discourses. The video recording of a television
advertisement is itself an entextualized artifact that we extract and appropriate
from the original context of broadcasting. The video text is then transferred to
The Eskimo text: a macro-analytical approach to transcription 167
and transformed by the transcription text and its associated practices. Rather than
say that texts hook up to a situational context, we see how the entextualization of dis-
courses as text artifacts, including transcriptions, affords the lifting of texts out of
one context and their appropriation by, and transference to, other (con)texts and
practices (Thibault, 1994). Transcription, which is both analytical activity and
textual record of this activity, must always keep this in mind when addressing the
question as to what it is we are studying. Practices create the context, not the formal
patterns and meaning relations in the text per se. In this view, the discourse level of
organisation is the point of intersection or the interface between practices and the
formal patterns and meaning relations that we identify as the global level of discourse
organisation (see Table 4.1, p. 226).
We have discussed the relevance of the systemic-functional tradition in this book (see
also Baldry, 2000b, 2004; Thibault, 2004) to describe short sequences of dynamic video
texts in terms of the relationship between phases and metafunctions (see 1.4, pp. 38-
44 and 1.6 , pp. 46-54). The transcription in Figures 4.1a and 4.1b of the Audi Quattro
Eskimo advertisement again shows the relevance of the metafunctions in two princi-
pal ways. First, a given modality (e.g. gaze ) can be described in metafunctional terms.
Thus, gaze has experiential, interpersonal and textual dimensions of organisation and
meaning. Figures 4.1a and 4.1b describe the experiential dimension of gaze in the
text in terms of transitivity frames (1-8 ). They foreground the notion of a transitivity
frame as a functional semiotic unit in which the relations between participants,
process and circumstances are realised in gaze (Baldry, Thibault, 2005). It also illus-
trates ways in which options in gaze are integrated with other meaning-making
resources. However, gaze can also be used, inter personally, to engage an
interlocutor. Textually, it typically indexes a phoric (indexical) relation between the
gazer and the object of the gaze in ways which can be interpreted by an observer
such that the gazers intentions can be inferred. Though not found in the present
example, this option is, nevertheless, attested in the Westpac text (see 4.4, pp.184-
186 and 4.7, pp. 191-202). In the current text, instead, all of the instances analysed
are used to align the TV viewer to the Phenomenon to which the Gazer (man or boy)
directs his gaze. Gaze can also be modulated by facial expressions, eyebrow move-
ments and other factors to signal attitudinal and affective modification of the gaze
syntagm. Gaze, as noted above, expresses textual meaning, serving, in particular, to
create phoric links to relevant objects in the perceptual purview of interlocutors,
either alone or in conjunction with other resources such as pointing (see Frame 5a
in Figure 4.1b). Second, as we saw in Chapter 1, the metafunctions can guide us in
terms of the resource integration principle (see Inset 3, pp. 18-19). They function as
an integrating principle showing how different semiotic resources are codeployed.
TRANSITIVITY (1) ENGAGE: panorama (2) REACT TO: object (3) ENGAGE: other person (4) ENGAGE: Object
FRAMES
168
Visual Image
Experiential Phenomenon: Gazer: old man; Phenomenon: Gazer: boy Gazer: boy Phenomenon: man Gazer: old man
Artic landscape Process: gaze wolf print in Process: gaze Process: gaze Process: gaze vector:
meaning in Gaze
vector: unfocused: snow vector: focused: vector: focused: focused: specific
non-specific: specific: specific: engage object: downwards
surveys landscape downwards to other: upwards Phenomenon: implied
paw print towards man (off-screen) wolf print
in snow
Distance Far close-up head and close very close: close: head-and- very close: head very close: head only
shoulders head-and- shoulders only
shoulders
Vertical/ Median/ Frontal High: looking Oblique high (looking Median/ Median/ Oblique Median/ frontal Median/ frontal
down/oblique down)/ frontal frontal
Horizontal Angle
Body Movement none slow head turn left none boy turns head none none
to right towards old man
Multimodal Transcription and Text Analysis: Chapter 4
Transition cut to 1b; cut aligns viewer to old man
.
.and his cut to 2b: cut cut to 3b: cut
aligns viewer to
Visual
Image
Experiential Gazer: man Phenomenon: off- Gazer: man Phenomenon: Gazer: man Phenomenon: Gazer: man Gazer: man Gazer: boy
Process: gaze screen, then shown Process: gaze hand reaching to Process: gaze snow in man’s Process: vector + Process: gaze Process: gaze
meaning in
vector: focused: as they are walking vector: focused: pick up snow + vector: focused: hand head turn: engage vector: engage vector + nod:
Gaze specific: away specific: downwards Audi tyre tracks in specific object: held with boy with boy react to +
downwards to snow in hand: scrutinising: Phenomenon: boy Phenomenon: recognition
bear print downwards boy: off-screen Phenomenon:
man (off-screen
Textual cut from Gazer to Phenomenon zoom in from Gazer to Phenomenon cut from Gazer to Phenomenon Very cut from Gazer to Phenomenon
Close Shot: close identification with man; Very Close Shot: focus on key role of man as teacher
meaning in
high intensity of gaze
Gaze
medium close: far: man and boy in medium close very close: less than very close close: head-and- very close: head close: head and
face and upper distance walking whole face shoulders + face only shoulders
Distance
body away from bear
print in foreground
Vertical/ Median/frontal Median/frontal Median/oblique high: looking Median/frontal high: looking Median/oblique Median/frontal Median frontal
down/frontal down/ frontal
Horiz. Angle
Body hand point to man and boy man bends down man’s hand picks up none hand holds snow man turns towards none boy nods in
movement Phenomenon walking away from and reaches towards snow in hand for boy response to
of Gaze Vector bear print snow on ground inspection by man’s saying
man ‘Audi Quattro’;
Gaze Focus implied; bear None implied: Audi hand picking up implied: snow held as in 7a man to boy + boy man to boy: Implied (8b):
print (shot 5b) Quattro track in snow in hand (7b) to man implied (8a + 8c) boy to man
snow (6b)
utterance: ‘Audi Quattro’:
Language ‘hanou’; + subtitle +
Italian subtitle: utterance ‘Audi
‘orso’ [‘bear’] Quattro’
of reaching for
170 Multimodal Transcription and Text Analysis: Chapter 4
In the Eskimo text, an adult Eskimo identifies animals for a small Eskimo boy
from paw prints left in the snow, the final paw print being that of the Audi Quattro
car. The teacher-learner relationship between old man and young boy is
foregrounded. Selections and combinations from the resources mentioned above, as
presented in the transcription, highlight the participant roles of the man and boy as
well as functioning to mark the status of the two participants in relation to the
activities which they are presented as performing together. The subphase compris-
ing Frames 2, 3, and 4 illustrates this. The gaze transitivity frame in 2-3 has established
the boy as the discoverer of the paw print (Frame 2 ) in the snow. The close camera
distance and the high vertical angle project the boy as looking down on, and engaging
with, the paw print in the snow as an object of interest. In Frame 3, the close shot
of the boys face, shown frontally even though the boys gaze is directed at the paw
print and not at the viewer, creates a close interpersonal involvement between the
TV viewer and the boy, and by implication with the object with which the boy is
engaging. The viewer is asked to enter into the boys world, through his eyes.
In Frame 3a, the boys gaze is directed at the man. The oblique (not frontal)
horizontal angle and the fact that his gaze is directed upwards towards the man
indexes the differential status of the two participants, i.e. the boy is marked as the
subordinate or apprentice participant who defers to the mans superior knowledge.
The very close shot of the man in Frame 3b, the frontal horizontal angle, and his
disengagement from the viewer by virtue of the downwards direction of the gaze
vector (to the implied wolf print) and his intense engagement with the Phenomenon
of his gaze vector, as shown by the partly closed eyelids and the fixed facial expres-
sion, all function to mark his teacher status as one who knows at the same time
that he pronounces the word for wolf in Frame 4 in the process of identifying the
paw print that the boy had found.
In Frame 8a-c, the interpersonal dimension of gaze is more prominent. In
Frame 8a, the gaze transitivity combines with the mans head turn, as the man shifts
his focus from the type of track in Frame 7b to the boy in Frame 8a. The close dis-
tance between viewer and man, the median vertical angle, and the oblique horizon-
tal angle, the foregrounding of the man and the backgrounding of the boy all work
together to indicate the interpersonal significance of this gaze frame at the same
time that the status relation is also clearly marked. In Frame 8a, this combination
of features reinforces the dyadic link between man and boy and hence the signifi-
cance of gaze in enacting the interpersonal relation between the two participants,
at the same time that the viewer is signalled as being an onlooker and not a party
to this. The cut to the very close shot of the mans face in Frame 8b as he utters the
words Audi Quattro focuses on his role as authority figure as he passes on the
results of his finding to the boy. The boys role as learner and apprentice is shown,
in Frame 8c, both through the reciprocal contact established by the jointly shared
gaze vector which was initiated in Frame 8a and by the boys nodding in
Gazer:
Gazer: Agent^Process^
Agent^Process^ Phenomenon
Phenomenon 1.1.1.1
1.1.1.1
Engage
Engage with
Engagewith
with 1.1.1
1.1.1 Gazer:
Gazer: Affected^Process^
Affected^Process^ Phenomenon
Phenomenon 1.1.1.2
1.1.1.2
Engage with 1.1.1
Control
Control
Control
1.1
1.1 Gazer^Process^
Gazer^Process^ Phenomenon:
Phenomenon: Agent
Agent 1.1.2.1
1.1.2.1
1.1 React
React to
to 1.1.2
1.1.2
React to 1.1.2
Gazer^Process^
Gazer^Process^ Phenomenon:
Phenomenon: Affected
Affected 1.1.2.2
1.1.2.2
Up 1.2.1 Insert
Insert
Up 1.2.1 Insert
Up 1.2.1 Frame 1: 1.1.1.1
1.1.1.1 +1.2.1 1.3.1
1.2.1 +1.3.1 A A
Direction
Direction Frame
Frame 1: 1: 1.1.1.1 ++ 1.2.1 + + 1.3.1 A
GAZE
GAZE
GAZE Direction Level
Level 1.2.2
1.2.2 1.3.2 B
Frame 2:
Frame 1.1.2.1
1.1.2.1 +1.2.31.2.3 +1.3.2 B
GAZE 1.2
1.2 Level 1.2.2 Frame 2: 2: 1.1.2.1 ++ 1.2.3 + + 1.3.2 B
1.2 Frame 3: 1.3.2 C
Down
Down 1.2.3
1.2.3 Frame
Frame 3: 3: 1.1.1.1
1.1.1.1 +1.2.1
1.1.1.1 +
1.2.1 +1.3.2
+ 1.2.1 + + 1.3.2 C
C
Down 1.2.3 Frame 4: 1.3.2 B
Frame
Frame 4: 4: 1.1.1.1
1.1.1.1 +1.2.3
1.1.1.1 +
1.2.3 +1.3.2
+ 1.2.3 + + 1.3.2 B
B
Far
Far 1.3.1
1.3.1 Frame 5: 1.3.2 B
Frame
Frame 5: 5: 1.1.1.1
1.1.1.1 +1.2.3
1.1.1.1 +
1.2.3 +1.3.2
+ 1.2.3 + + 1.3.2 B
B
Far
Distance
Distance Frame 6: 1.1.1.1
1.1.1.1 +1.2.3 1.3.3
1.2.3 +1.3.3 D D
Medium
Frame
Frame 6: 6: 1.1.1.1 ++ 1.2.3 + + 1.3.3 D
Distance Medium 1.3.2
1.3.2
1.3
1.3 Medium 1.3.2 Frame 7: 1.1.1.1
1.1.1.1 +1.2.3 1.3.3
1.2.3 +1.3.3 D D
Frame
Frame 7: 7: 1.1.1.1 ++ 1.2.3 + + 1.3.3 D
1.3
Close
Close 1.3.3 Frame 8: 1.1.1.1
1.1.1.1 +1.2.3 1.3.3
1.2.3 +1.3.3 D D
1.3.3 Frame
Frame 8: 8: 1.1.1.1 ++ 1.2.3 + + 1.3.3 D
Close
NoNo te s : 4 differenttypes types ofgaze
gaze combinationareareinstantiated this
instantiated inthis
No tete ss:: 44 different
different types of
of gaze combination
combination are instantiated in in this
text.
text.InInFrameFrame4,4,the
thephenomenon,
phenomenon,present
presentininFrame
Frame2,2,is isnow off-
nowoff-
text. In Frame 4, the phenomenon, present in Frame 2, is now off-
screen
screen andimplied.implied.
screen and and implied.
The Eskimo text: a macro-analytical approach to transcription
Figure 4.2: A (revised) preliminary network for gaze in visual texts; primary delicacy only
171
172 Multimodal Transcription and Text Analysis: Chapter 4
acknowledgement of the mans words. In Frame 8a-c, the close distance and the
frontal horizontal angle, in conjunction with the fact that the textual participants do
not directly address the viewer in any modality (e.g. gaze, language), here function
cross-modally to create a high degree of interpersonal rapport between the man and
the boy. Once again, the viewer is an onlooker and not a participant in this exchange.
As the focus on transitivity frames in this presentation suggests (see Inset 11,
p. 122), we are increasingly concerned with multimodal transcription as a text
analysis tool capable of embracing both system and instance (Baldry, Thibault,
2001; 2005). For example, a number of other options, which we have grouped
together under body movement (e.g. pointing, walking, head turning, nodding)
typically interact with gaze in relevant ways, as is the case in this advertisement.
Moreover, we have drawn attention to some of the ways in which camera position
(CP), interpersonal distance (i.e. the viewers distance from participants in the text),
Inset 13: System and instance
�The system and instance perspectives on language and other semiotic systems is not
a matter of the two poles of a simple dichotomy. The two perspectives are not
opposed to each other in this way. Instead, they refer to two very different time-
scales and, therefore, two different perspectives from the point of view of observers
on very different time-scales (Halliday, 1992). Language-as-system evolves on the evo-
lutionary time-scale; change on this scale therefore accumulates and becomes evi-
dent on time-scales that may greatly exceed that of the individuals life time. Change
and forms of organisation on this scale are nevertheless the cumulative result of
very many interactions among components on smaller scales, such as the instance
perspective. The latter perspective is the human scale of text, the human activities in
which texts are created and used, and the participants in those activities.
�Moreover, texts are not simply the instantiations of choices from an abstract lan-
guage system. Texts are also resources which we use to make meanings in different
contexts, to create links with other times and places, with other texts and so on.
Furthermore, the production and interpretation of texts involve a constant dialectic
between the virtual constraints of the language system and the dynamic constraints
of the text as it unfolds and develops in real time (see also Beaugrande, 1997: 11).
Constraints of the latter kind relate to the kind of social situation the context
which the participants understand themselves to be participating in.
The Eskimo text: a macro-analytical approach to transcription & Inset 13 173
and the means for effecting transitions between shots would typically appear to
interact with gaze. In this way, we begin to see how meaning-making resources that
belong to the production and editing of video texts (camera position and distance, cut-
ting between shots ) interact in patterned ways with the meaning-making resources of
the human body (e.g. gaze, movement, speaking, pointing ).
Figure 4.2 presents some basic distinctions in the system of gaze. It focuses in
particular on transitivity relations expressed by gaze with a view to specifying the ways
in which some basic gaze schemas are realised. Figure 4.5, which is also concerned with
options in the system of gaze, is more general in focus and sets out a range of param-
eters relevant to gaze, without, however, attempting to specify the realizations that
characterise particular gaze syntagms. In 4.2 (on the next page), we propose a detailed
microlevel multimodal transcription of the Westpac advertisement (see Appendix I).
�Finally, texts themselves are not well served by a flat model of their organisation.
They are organised on a number of different scalar levels of organisation. With
respect to video texts such as television advertisements, we have proposed in this
book a number of such levels, e.g. the visual transitivity frame, the shot, the phase,
the macrophase, and the global organisation of the text as a whole, including its
generic structure. This means that selections from paradigmatic systems of alterna-
tive choices are made on many different scalar levels, and in ways which affect choices
both on their own level and on other levels.
�For example, choices in visual transitivity in video texts interact with choices in shots
and the sequencing of shots to form still larger phases as the text develops in time.
The choices made on one level, at one point in the unfolding text, affect and antici-
pate choices still to be made, just as they may alter the significance of previously
made choices. For these reasons, the relationship between system and instance is a
complex one involving many different interacting factors. Neither system nor
instance are static entities: they both involve change and dynamic processes on their
respective time scales. At the same time, texts are always linked to the larger-scale
processes of the system, its history and the culture with which the system has
coevolved and in which it is embedded.
174 Multimodal Transcription and Text Analysis: Chapter 4
Inset 14: Material object text and semiotic action text ..............
A text is a material object or process as well as a semiotic one. These two dimensions
of a text are equally important for understanding how texts function and have meaning
in particular communities. We can say therefore that texts are dually material and semiotic
entities and processes; that is, the material and the semiotic dimensions are fully
integrated in the one overall contextualizing activity (Lemke, 1995: Chap. 6; Thibault,
2004: Chap. 5). For this reason, the material and the semiotic dimensions are the two
sides of the same textual coin: the one cannot be fully understood and cannot exist
without the other. In this inset, we shall explore this point in greater detail.
The visual, auditory and other patterns which are picked up by our perceptual systems
are the material basis for construing patterns of meaning relations in a text. In this per-
spective, the text-as-material-object participates in the dynamic physical and biological
processes of the community and has material relations and connections to other
material processes in the same or some other community. A textual object in this sense
can be integrated to sensori-motor activity of the body (e.g. a book can be held in the
hand of the reader and its pages visually scanned). A treated surface, such as the paper
on which written signs or visual images are traced or otherwise installed by
technological means, is a material object text in this sense. The surface supports visual-
graphic patterns which have been traced on it and which provide visual information
about something other than the surface in the form of an arrested optic array of visual
invariants (see Inset 15: Gibson’s optic array, p. 192).
In the first instance, the optic array provides information about the tracing
(articulatory) activity or the technological processes (e.g. photographic) that put the
traces on the surface. The material surface used for this purpose affords the material
installation of these tracings in ways which enhance their durability as well as their inte-
gration to social activities and objects relevant to their interpretation in socially mean-
ingful ways. The material object text can be transported, transmitted, manipulated by
different users on different occasions, stored and retrieved etc. It may be relatively per-
manent (e.g. a book) or as ephemeral as the message someone writes in the sand on
the beach before the text is washed away by the incoming tide.
The physical-material object text is maintained by matter and energy processes that ensure
its structural integrity over some time span, long or short, at the same time that it participates
in a larger-scale system of semiotic-discursive processes. In this sense, the material object-
text constitutes an environment of matter-energy transactions and processes that afford or
enable the object text to be integrated to, and contextualised by, the meaning-making
practices of some community. Object texts are perceivable and manipulable material entities
or processes. They may be dually an extra-somatic surface of some kind and the tracings
displayed on it. The articulatory (somatic) processes of the body, e.g. vocal tract activity in
speaking, hand-arm gestures, facial expressions and other dynamic neuromuscular processes
of the body constitute the material means of installation of audible or visible patterns of
sensori-motor activity that are projected into the environment as an optic or auditory array
that can be picked up by the perceptual systems of others attending to these patterns. Again,
the auditory or visual patterns that are so detected provide information about their source
at the same time that they provide information about things other than that source.
176 Multimodal Transcription and Text Analysis: Chapter 4
�A text is also a semiotic object or process; its material processes and characteristics afford
the possibility of its being integrated to, and made meaningful in, some community as a
semiotic text. It can be related to the patterns of contextualizing relations that typically
operate in some community. This means that the physical-material tracings on a surface
or the patterns of sound created by vocal tract activity can be construed as semiotically
salient patterns that are potentially meaningful for the participants in some social group.
Furthermore, these semiotic patterns are the means whereby we can interpret and assign
meaning to the phenomena we experience in the world around us, e.g. the events, things
and actions of other objects and participants in the world we live in.
�Semiotic patterns of this kind include the semantic relations in the lexicogrammar and
discourse organisation of linguistic text, transitivity relations construing participants and
the processes in which they are involved in pictures and gestures, the attitudinal and
evaluative meanings conveyed by facial expressions and so on. In this perspective, texts
qua semiotic processes can be contextualised by, and can participate in, social activities in
and through which their users recognise and interpret patterns of semiotic relations that
are meaningful in that activity. They can also be used to relate that activity to some other
activity in some other time and place and to interpret or otherwise assign meaning to other
activities. Semiotic texts are both made and interpreted in and through social activities at
the same time that they participate in these or other activities and help to constitute them.
�Take the example of the fire extinguisher illustrated below. As a physical object, it affords
its use in determinate material processes. It can be picked up, carried to the scene of the
fire and used to extinguish a fire by virtue of the capacity it affords for spraying a chemical
substance it contains onto the source of the fire. This object also provides the means for
the installation on its treated surface of an arrested optic array of visual invariants that
provide information about something other than the painted metal surface of the object.
In this sense, these visual patterns can be interpreted as semiotically salient linguistic and
visual patterns in an instructional text concerning the use of the object.
�The multimodal verbal-visual text that we interpret on the basis of these patterns can
therefore be related to the lexicogrammatical patterns in language and the conventions for
interpreting visual patterns as depicting socially meaningful actions and sequences of
action (e.g. pulling out the safety pin, aiming at the fire and so on). The combined verbal-
visual instructional text can thus be used to make sense of these already meaningful and
purposeful physical actions by connecting the user of the fire extinguisher, the fire
extinguisher, and specific circumstances (outbreak of a fire) in a particular kind of socially
recognisable activity- or event-type. Moreover, the physical object and not just the verbal-
visual text installed on its surface is itself a meaningful semiotic artifact; it has its typical
socially recognised uses and the participant roles that these entail. Its size, shape and
colour together provide cues as to how it is to be interpreted as a certain kind of social
artifact which is imbued with social significance. This significance is further enhanced by
the use of the colour red in such contexts. The choice of this colour from among a set
of other possible colour choices is a meaningful choice: it indexes, in those communities
which recognise its significance, a meaning such as danger; be alert and, therefore, the
kinds of actions and responses that are required in dangerous situations.
The Westpac text: a micro-analytical approach to transcription & Inset 14 177
The colour red does not have this meaning in all the situations in which it is used (e.g. my red
sweater). However, in some contexts such as the one described here, it stands in certain kinds
of typical relationships to certain types of situations, meanings, and activities and configura-
tions of these in the communities in which the colour red has the meaning mentioned above.
The relationships it has to all of these factors in combination create a set of contextualizing
relations. Thus, red has a semiotically salient relation to certain kinds of actions and situations.
It has the meaning that it does in relation to this larger whole or contextual configuration rather
than on its own. The meaning may be a highly standardised one, as in this case, but it still has
to be connected to a wider system of contextualizing relations to have the meaning it does.
Texts can also be used to make sense of other activities, and therefore to create contextualiz-
ing relations between the text and these activities. A text realised by one particular configura-
tion of modalities or semiotic resources may be used to interpret and make sense of a
semiotic action or event in some other modality. For example, the action of physically stand-
ing before an audience in an auditorium and producing a lot of vocal tract activity for the
benefit of that audience is a socially meaningful action or event. There is a contextualizing
relationship between/among the sounds the audience hears and, for example, the social role
of lecturer, the architectural configuration of the space in which the event takes place, and
the subject matter (e.g. exoplanets) to which the audience connects the audible sound patterns
and the person whom the members of the audience recognise as embodying that social role
on that particular occasion. Both the meanings related to the topic of exoplanets and the
social-participant roles of lecturer and audience are abstractions from the audible and visi-
ble features of the event; in this way, they can be connected to other similar events and to
other similar ways of talking about the topic or
related topics as well as to other individuals who
embody the same roles on other occasions.
The event as such is dually a material and a semiotic
event in the way described above. However, this
event can in turn be talked about afterwards by
other people in the texts that they create through
their conversation about the event or, for example,
in the form of a journalistic text consisting of
verbal text and photograph of the famous visiting
lecturer in the next day’s newspaper. Both of these
texts – conversation and newspaper story – make
sense of the original event. They do so both by
creating meaningful links to that event and by using
the resources of other semiotic systems (e.g. word
and gesture in conversation or written text and
photograph in the journalistic story) to make sense
of and to recontextualise the original event using Fire extinguisher used to show its dual
the resources of other semiotic modalities and the status as a material object text and
meanings these afford. semiotic action text
178 Multimodal Transcription and Text Analysis: Chapter 4
often does all three at the same time. An example is the nurse in Shot 4, Row 15.
Significantly, this shot occurs at the end of Phase 1 and is the first point in the text
where all three chains come together in the video track. This in itself emphasises
the culminative character of text-linking items, which here reach a culminative peak
at the end of Phase 1, in conjunction with that of the chorus in the soundtrack.
The partial and at times full interaction of all three chains is a striking and signifi-
cant feature of this text. Each feature in its own chain of cohesive elements may
be assigned to a specific superordinate intertextual thematic relation along with its
associated evaluative orientation. However, it is the interaction of all three chains of
elements that provides grounds for saying that these constitute foregrounded
meaning relations that link the different participants on the basis of a shared
intertextual system (Lemke, 1985). In the Westpac advertisement, this may be glossed
as [CORPORATE CAPITALISM ON THE MOVE + POSITIVE EVALUATION/AFFECTIVE
IDENTIFICATION], where CORPORATE CAPITALISM ON THE MOVE refers to the wider
thematic context in, and through which, the individual elements are assigned their
meaning, and POSITIVE EVALUATION /AFFECTIVE IDENTIFICATION refers to the axi-
ological/affective orientation which the text adopts in relation to this thematic at the
same time that it seeks to persuade viewers to adopt a similar value stance.
Column 5, Soundtrack, refers to all aspects of the soundtrack. Here language,
music, and other sounds are considered as parts of a more unified phenomenon
and are not separated out. There are two main reasons for this. First, the
multimodal basis of the transcription and concomitant analysis presume no
necessary priority for the linguistic semiotic in the making of the texts meaning.
Secondly, and while recognising that each has its distinctive qualities, speech, song,
music, ambient and other sounds also have many features in common which
provide a basis for their potential semiotic integration in multimodal texts (Van
Leeuwen, 1999). Again, the emphasis here is not on mathematical criteria of
acoustic physics, but on criteria which are perceptually and semiotically salient.
Significantly, the Westpac text does not present the various participants in their
work settings from the point of view of a naturalistic or realistic auditory modality
of how things really sound. The soundtrack does not, for example, make available
to us the ambient sounds of the carpenters working at the house construction site
(Shot 18, Row 42 ), the sounds of the street outside the bakers shop (Shot 12, Rows
28-30 ), the sounds of the boys playing cricket with the nun (Shot 21, Rows 47-
50 ), or the sound of the helicopter preparing to take off (Shot 16, Rows 38-40 ).
Instead, these visual images and associated body movements are variously integrated
with the sounds of a musical band, a female chorus, a female soloist and an off-
screen male speaker. The only possible exception to this relates to the sounds of the
sheep in the first scene. However, these sounds, too, play their role in the overall
meaning-making process. Without going into all the details here, the soundtrack
itself combines a number of different sound genres broadly defined that interact
Etic and emic criteria in multimodal transcription 181
with each other and with other semiotic modalities in the text in order to create their
own specific evaluative and affective orientations to the text’s thematics.
Column 6, Metafunctional Interpretation, represents an attempt to specify the
multifunctional basis of all acts of semiosis. The left-to-right visual organisation of
the table is not without consequences for the ways in which the makers and users
of transcriptions perceive the relationships among the various components of the
transcription and, by implication, of the text transcribed. Ochs (1979: 49) draws
attention to the left-to-right bias which derives from the Western tradition of visual
literacy. In this tradition, left is perceived as signifying both temporal and logical
priority. That is, that which is placed on the left of the transcription is – probably
unconsciously – doubly privileged on account of these organisational principles in
the grammar of visual semiosis in Western cultures. Typically, transcribers place the
verbal or linguistic component of the transcription on the left. If other semiotic
modalities are referred to at all, they tend to be placed to the right of the verbal
component. In the present transcription three interrelated strategies have been
adopted to overcome this problem. First, the leftmost column is that which dually
specifies the temporal progression of the text in seconds and the row number with
which this correlates. The numbers in Column 1 therefore have this dual function.
The important integrating and cross-referencing functions of the rows, as discussed
above, justify the priority which is accorded this column by virtue of its being placed
in the leftmost position in the table. Secondly, the verbal or linguistic dimension of
the textual transcription is located in Column 5, thus mitigating against any tendency
to treat it as more significant than the other columns. Thirdly, Column 5 includes all
relevant aspects of the soundtrack – speech, song, music – as different dimensions
of a single phenomenon. This does not mean that it is always appropriate to treat
speech in this way vis-à-vis other semiotic modalities based on sound. In the present
case, and probably in many other generically related texts, it does make sense to adopt
the approach undertaken here. The reasons for this are explained in 4.7, pp. 191-202.
Rather than saying that the advertisement is a sequence of discrete shots along with
the techniques of cutting which mark the transitions between shots or sequences
of shots, it is possible to analyse the text as a series of dialogic moves. In this view,
each dialogic move culminates in the peak of a wave, whereas the implicit response
on the part of the viewer coincides with the trough of the wave (see 4.11.7 pp.239-
42 ). The question then becomes one of asking which particular copatternings of
selections co-occur in relation to either the wave peak or the wave trough at any
given stage of the text. In other words:
(1) how does the wave-like patterning contribute to the dialogic
organisation of the text as an interactive event?
182 Multimodal Transcription and Text Analysis: Chapter 4
(2) which dialogic moves and their responses are associated with
which particular participants and speaking and listening positions
in the discourse event?
Answers to these questions would enable the analyst to account for the ways in
which variations in the kind or degree of selections in any given semiotic modality
may impact upon the overall wave patterning. Figure 4.3 refers to Phase 1 of the
text, which lasts approximately 16 seconds. This figure, which presents the sound-
track as a visual display, was derived from the Timeline window of Adobe Premiere.
It shows the acoustic intensity of the soundtrack relative to its progression in time,
indicated in seconds. With specific reference to the chorus, which starts singing at
03.75 seconds, it shows that each speech act move of the chorus roll them (three
times) to roll them up (twice) in Phase 1 is represented as a clearly discernible
periodicity, i.e., a pulse or a surge of acoustic energy. Each such periodicity alter-
nates with a lull or a low point in the overall wave-cycle, corresponding to the brief
pauses between the singing of each clause. The approximate temporal scope of
each periodicity is also indicated in Figure 4.3. The five periodicities referred to are
as follows: (1) 3.75 to 6.50 seconds; (2) 8.0 to 10.0 seconds; (3) 10.5 to 14.0 sec-
onds; (4) 14.5 to 15.0 seconds; (5) 15.5 to 16.0 seconds.
These figures are approximations only due to the limitations of the visual
display and concomitant time analysis. The lulls may thus be seen to correspond to
points where a change of interlocutor can potentially occur. In other words, each
lull is a potential response point to the imperatively realised command, as sung by
the chorus. Overall, each clause as sung in Phase 1 is best seen as a submove in an
overall dialogic move which receives its text-internal response with the transition to
Phase 2, when the female soloist sings her turn in response to the chorus. However,
the soloist does not only address the chorus. She addresses both the members of
the chorus and the viewer. For this reason, we can say that each lull, as described
above, correlates with a potential dialogic response implicit or explicit on the part
of, for example, the viewer. Of course, in this text the viewer is present only vir-
tually. The point is that the semiotically projected ideal response of the viewer is syn-
chronised with the rhythmic alternation of wave pulse and lull. Thus, we see that
the overall relationship between text and viewer is organised into a wave cycle. This
wave cycle is not, however, a mere acoustic property of the expression plane. It also
organises waves of meaning and potential courses of action on the content stratum.
That is, the flow of acoustic energy is also a flow of meaning through the entire
system of relations. This is also emphasised by the ways in which we see a number
of participants who actually do roll their sleeves up as the chorus sings (e.g. Shot 1,
Rows 3-10; Shot 2, Rows 11-2; Shot 4, Rows 15-6 ).
Figure 4.3 shows that the soundtrack can be seen as a series of cumulative
waves with their respective peaks and troughs. This raises a further question con-
Etic and emic criteria in multimodal transcription 183
Figure 4.3: Waves relating to the soundtrack in the first phase of the Westpac advertisement
184 Multimodal Transcription and Text Analysis: Chapter 4
coincide with the rhythmic accents of speech or music. An example is the sequence
showing the supervisor walking from right to left and the industrial plant in the
background (Shot 19, Rows 43-4 ). The camera distance is quite close, featuring head
and shoulders, which is typically synonymous with familiar interpersonal relations
(4.7.2, p. 195). The closeness of participant to viewer serves to accentuate the
natural movements of the head. At first, the head is roughly centre frame. This cor-
responds to the male speaker in the voiceover uttering the unaccented with. The
supervisor’s head then swings to the right of the frame, coinciding both with the
next word, the accented first syllable of the word money and the supervisor’s smile.
There is at this point a significant pause or juncture in the speech rhythm. In Visual
Frame/Row 44, the head swings to the left of the visual frame, a movement which
coincides with the manner circumstance with advice, again following the pattern of
accented syllables mentioned before. Thus, the salient accent here falls on the sylla-
ble -vice in the word advice. The supervisor’s smile prosodically extends across both
adverbial groups spoken by the male narrator and both visual frames, starting on
one accented syllable and ending on another before the cut to the next shot. Here
we have a microscopic slice of this synergetic co-operation among different inter-
acting variables, some originating from the natural rhythms of the participant’s body
as he walks, others synchronised in postproduction in such a way that a combination of
visual trimming and synchronisation of the rhythms of the male narrator’s off-scene
voice ensures that the centre-left-right swing of the head in walking, the accented
syllables in the male speaker’s speech, the rhythmic juncture or pause between the
two prepositional phrases – with money, with advice – in the speech of the male narra-
tor and the onset and subsequent development of the supervisor’s smile are all syn-
chronised, not on the basis of a single master plan but on the basis of these variables
fluctuating in a stable way so as to generate the local pattern described here.
As pointed out in Inset 7 (p. 47), a discoursal phase, following Gregory (1995,
2002), is a set of copatterned semiotic selections that are codeployed in a consis-
tent way over a given stretch of text. In the Westpac advertisement, there are five
main phases which exhibit an overall pattern either increasing towards a peak or
decreasing towards a trough. The text is not simply structured as a sequence of
alternating shots, but as a sequence of alternating turns between different voices,
sung, spoken and instrumental. The five phases in the Westpac text can be specified
in this way (see also 4.6.1, pp. 187-188).
The start of a given phase is indicated at the appropriate point in Column 6,
labelled Metafunctional Interpretation. Thus, the first phase is indicated by upper case
letters, and a number indicates which phase it is and its position in the text. The
subphases within any given phase are further specified by a lower case letter of the
Phases, subphases and transitions 185
alphabet in subscript. For example, the stretch of text which is headed by Phase 1b
refers to the second subphase of the first phase of the text, which extends from
Rows 11 to 14. The reason why the phase labelling is placed in Column 6 has to do
with the fact that any decision concerning where to draw the boundaries between one
phase, or subphase, and another is always motivated by criteria which involve all meta-
functions. As indicated in Column 1, Phase 1 extends over the first sixteen seconds of
the text. It is characterised by an overall increase towards a culminating peak. This may
be described as follows: Visual Frame/Row 1 shows the lone sheep herdsman with his
dog in the mythical vastness of the Australian outback. In this frame there is silence –
no sounds of any kind are heard during the first second of the soundtrack. The sheep
herdsman is seen as far away from the viewer. Implicit is the notion that the only dyadic
interaction is that between herdsman and dog, as then evidenced in Visual
Frames/Rows 2-4, when the herdsman beckons the dog back to his side. The solitary
life of the herdsman is then accompanied by the solo keyboard which interacts, con-
trapuntal fashion, with the sounds of the sheep in Column 5 (Rows 2-3 ). All of this
is initially offset by the vastness and silence of the Australian outback in this scene.
The transitions, boundaries or junctures between phases may be signalled in a
variety of ways on both the content and the expression planes (see Inset 18, pp. 236-
7). On the expression plane, a change, a break, or a pause in the rhythm of music,
speech, body movement, or cutting between shots coincides, generally speaking, with
the transition to a new phase or subphase. The same can be said of tempo, whether
visual, as in the movement of the camera, kinesic, having to do with the locomotory,
gestural, facial and other body movements of participants, or in the speech and
musical and other sounds of the soundtrack. On the content plane, there may be a cor-
responding shift in, for example, the visual or linguistic thematics, the
evaluative/interpersonal orientation (Lemke, 1988), or in the specific textual voice that
constitutes a given move in the text.
Perceptually speaking, transitions between phases are not always clear cut. This
means that it may be difficult to decide exactly where one phase ends and the other
begins as the boundaries between phases, rather than being segmental in character, are
continuous and, hence, blurred. This is in keeping with the wave-like or periodic char-
acter of the phase itself. Thus, the transition point may be characterised by a gradual
merging of features from the two phases in question as one phase decays or fades out
and the other comes into being. In the Westpac text, the transitions between phases
tend, overall, to be quite clear cut. Thus, the cut from one visual shot to another coin-
cides with a shift to a different musical or spoken voice in the turntaking sequencing
of, say, chorus, and female soloist. For example, the cut to the first appearance of the
Westpac logo moving forward at the beginning of Phase 2 (Row 17 ) perfectly coin-
cides with the first entry of the female soloist, singing and let’s get …, whereas this
imperative clause is not completed until the cut to Shot 6, Row 18, featuring the
schoolgirl at her desk. Her leaning forward copatterns with the word moving. The
186 Multimodal Transcription and Text Analysis: Chapter 4
typical of manual workers). Thus we see how the principal variant structures within
each individual shot are a text-developing strategy whereby global continuity and
coherence are enacted on the basis of local variation and change. Shot 1 also serves
as an orientational or establishment shot, even though its participants (herdsman and
dog) and spatial location are not seen in the remainder of the text. This may be
explained as follows.
On analogy with hyper-Theme in linguistic texts (see 2.4.2 , pp. 74-77; see
also Dane, 1974, 1989; Martin, 1992: Chap. 6), Shot 1 serves an important anchor-
ing function for the shots that follow it. In the visual semiotic, a textual hyper-Theme
is an introductory shot which functions to predict a particular pattern of thematic
development in successive shots in some phase or subphase or sequence of sub-
phases. This shot is hyper-Thematic in the sense that it functions to establish a
particular pattern of interaction among the other shots which realise this textual
subphase with respect to thematic choices and their development. Shot 1 thus has
an anchoring function because it serves to establish a global visual thematic meaning
which provides a textual basis for the development of the shots which follow it in
Phase 1. It is, therefore, prospective or anticipatory in character. In Subphase 1a, it
provides a thematic anchoring point whereby the shots that follow are linked to a
shared and developing network of (inter)textual thematic relations. Thus, in this
text the apparent problem of the lack of visual invariants that are common to each
successive shot is solved in a different way, namely the thematic continuity from shot
to shot is developed on the basis of each shots local contribution to a higher-order
visual thematic system. It is the visual salience accorded to the primary participant
herdsman, draughtswoman, truck driver, nurse along with performance indexes
(stereotypical work clothes, work setting, implements) that links all these shots
Shots 1 to 4 on the basis of a common thematic relation that may be glossed as
[TYPICAL OCCUPATIONAL ROLES]. In this respect, Shot 1 is interesting because it
instantiates in the form of the herdsman the archetype of the early pioneering hero
on which the historical myth of the Australian outback was founded.
Before concluding the current section, a brief comment on the relationship
between shot and phase seems to be in order. A shot, as defined above, is a con-
stituent in the visual semiotic. Given that the shot is specific to the visual semiotic
of video texts, it is legitimate to say that shots and the relations between shots are
intra-semiotic in character. A phase or subphase, on the other hand, is an inter-semiotic
notion. As defined in the present book, phases and their subphases are an interme-
diate level of textual organisation that integrate microlevel selections of resources
from diverse semiotic modalities in a consistent way (see 1.6, pp. 46-54, 4.4, pp.
184-186 and Inset 7, p. 47) so as to achieve a global text organisation. The notion
of shot is subordinate to that of phase, given that shots are, in the Westpac text, just
one of the semiotic resources that combine with others to produce determinate
phases.
Information structure: Given and New 189
his sleeves up – that fall within the scope of this overall movement prosody con-
stitutes the New in this shot. Importantly, it is this combination of features that
constitutes the principal informational variant or transformation in the delimited optic
array of this shot. In contrast, the landscape, the sky, the sheep and the trees are
invariant structures throughout the duration of the shot. They can thus be con-
strued as Given. In these terms, the New information unit is constituted by the
prosodic modulating of salient informational variants, or transformations, against a
background of informational invariants that are construable as Given (see 4.11.3, pp.
228-230).
As far as progressive pictures are concerned, left-right horizontal structuring
per se proves in any case to be too static a notion to be really useful. On the other
hand, the equating of salience with a dynamic informational variant in the visual
topology of the text provides a semiotically (and perceptually) better motivated
criterion for specifying what a given text treats or presents as New for the viewer.
It thus seems more reasonable to consider each shot as a quantum of information –
with both variant and invariant factors – which can be organised in terms of one
or more salient or focal informational units of variable prosodic scope rather than a
fixed geometry of left versus right.
as he or she orients to this and moves within it. It is the ground the earths surface
which provides the viewer with his or her primary means of support and his or her
main reference point with respect to all other surfaces when orienting to and sampling
the ambient optical array (Gibson, 1986 [1979]: 16, 33; Thibault, 2004b: 26-30). Thus,
the ground may be seen as a principle of congruency with respect to the metaphor-
ical transformations that the head-body system undergoes in the virtual visual
kinaesthesis of all forms of visual text drawings, paintings, photographs, scientific
diagrams, films, video games, CD-Roms, flight simulators, and so on.
In the case of a television advertisement such as the Westpac text, the total
visual system is constituted by the interactions among:
(1) the delimited optic array that changes over time of the video screen
and which is projected to a potential point of observation;
(2) the information that the surface of the screen contains about
phenomena other than the physical surface itself, comprising (i) the
depicted world of the visual image projected on the screens
surface; and (ii) the camera movements analogous to our head-
body movements that orient to the depicted world;
(3) the viewer who occupies a point of observation in relation to the
video (TV) screen and the information that it projects.
stationary
panning
sideways
Camera position dolly
moving sagittal/tilting
forwards
perpendicular
backwards
4. 7. 2. Perspective
Perspective will be transcribed in terms of two basic possibilities, viz. horizontal and
vertical angle (Kress, van Leeuwen, 1996: 140-8). Horizontal angles have to do with
degree of involvement in, or empathy with, the participants and so on in the depicted
world. There are two main options: the viewer is positioned directly in front of the
depicted world, or obliquely, i.e. at an angle. The former possibility increases the
viewers empathy with, and direct involvement in, the actions, events, and participants
of the depicted world; the latter suggests detachment, lack of involvement.
Horizontal perspective will be transcribed as follows: [HP: direct], [HP:
oblique], where, for example, [HP: oblique] indicates that the viewer is positioned
so as to view the depicted world from an oblique angle, as in Shot 23, Row 53.
Vertical perspective is concerned with the power, status and solidarity relations
between the viewer and the depicted world. There are three main options. The
viewer may: a) be positioned so as to look down at the depicted world as if from on
high. In this case, the viewer may be positioned as having power over the
participants in the depicted world, or as viewing this world from a detached,
depersonalised or objectified perspective, as is the case in aerial and birds eye
views; b) be placed on the same level as the depicted world in a relationship of
equality or solidarity; c) view the depicted world from below such that the viewer is
placed in a position of inferiority.
There are, of course, many gradations possible between these three points,
which are, therefore, best seen as located on a graded continuum of possibilities.
Further, the references to notions such as power, status, solidarity, objectification,
empathy and so on, are themselves interpretations which cannot be assigned to
these options on a one-to-one basis. Instead, such interpretations will need to be
made and justified on the basis of the copatternings of these options with others
in the text. For this reason, vertical perspective has been transcribed in terms of
the three basic possibilities of high, median and low, rather than on the basis of
specific interpretations. Thus, [VP: low], for example, means that in the vertical
perspective the viewer is positioned so as to view the depicted world from below
(e.g. Shot 3, Rows 13-4 ), and so on, as described above.
4. 7. 3. Distance
A further important visual parameter which functions to orient the perspective of
the viewer is that of the virtual or simulated distance between viewer and the
depicted world of the image. This is a further aspect of the way in which the posi-
tioning of the camera relative to the depicted world simulates visual kinaesthesis in
196
close
self
depicted world
aversion
Multimodal Transcription and Text Analysis: Chapter 4
disengaged self-involvement
orientation
viewer mental process
off-screen
indeterminate (monitoring, etc.)
Figure 4.5: System network of basic options for gaze in video texts
Distance 197
relation to the head-body position of the observer who occupies a point of obser-
vation. Obviously, the viewers actual physical distance from the video screen and
the virtual distance as constructed by the camera in relation to the depicted world
are not to be confused here. The two extremes of maximally near and maximally far
relative to the position of the observer do not refer to the mathematical
abstractions of Cartesian geometry, where maximally far implicates the mathemat-
ical notion of infinity. Instead, these two extremes have to do with the here of the
nose and the there of the horizon relative to an observer who is located on the
ground (Gibson, 1986 [1979]: 117). Distance is an embodied notion for expressing
the relations between the nose-here and the horizon-there parameters and their trans-
formations and virtual simulations in visual texts. Distance is neither an objective
property of the physical world per se nor is it simply a matter of individual per-
ception. Instead, it is a meaning-making resource (Van Leeuwen, 1996: 90).
With these considerations in mind, we can postulate a cline of possibilities
from maximally close to maximally far, relative to the embodied perspective of the
observer. Visual images may simulate interpersonal closeness or distance between
viewer and the participants in the text (Kress,Van Leeuwen, 1996: 130-5). In visual
semiosis, these are transformations of the proxemic resources which regulate social-
interpersonal relations between interactants (Hall, 1972 [1963]). As we pointed out
in 1.4.1, close shots express intimacy and personality while distance depersonalises
and objectifies. A scale of degrees of closeness and distance can be postulated on
the basis of the following transcription conventions:
MAXIMALLY CLOSE
VCS = Very close shot (less than head and shoulders);
CS = Close shot (head and shoulders);
MCS = Medium close shot (human figure cut off at waist);
MLS = Medium long shot (full length of human figure);
LS = Long shot (human figure occupies approximately half the height of the image);
VLS = Very long shot (the distance is even greater);
MAXIMALLY DISTANT
In the transcription, distance relative to the nose-here perspective of the viewer, as
simulated by the camera relative to the depicted world of the text, will be annotated
as follows. For example, [D: CS] indicates a close shot (head and shoulders in the
case of a human participant, as in Shot 17, Row 41 ). In the Westpac text, most of
the depicted participants are human. For this reason, the basic notational
conventions as presented here should suffice. Clearly, some modifications may
have to be contemplated in the case of non-human participants (e.g. buildings,
landscapes) although the basic principles remain the same. Distance also interacts
with motion perspective as participants move towards or away from the observer.
Importantly, the Westpac text makes considerable use of the first of these
198 Multimodal Transcription and Text Analysis: Chapter 4
possibilities. This shows very clearly that distance and its virtual simulations in
visual semiosis is not a matter of abstract Cartesian geometrical space. Rather, they
are, in the first instance, a question of the number of paces along the ground
(Gibson, 1986 [1979]: 117) between some object, person, and so on, and the
observer, as specified by the interaction between the optical information that
specifies the camera/observer and the information that specifies the depicted world.
4. 7. 4. Visual collocation
The Westpac advertisement devotes considerable attention to details which may in
some way collocate with or otherwise index the performance role of the
participant(s) in a given shot. The term visual collocation (VC) is intended to indicate
those secondary items which do not have participant status, but which function to
specify either the role of the participant or the activity which he or she is perform-
ing. In the topological space of the visual field, such objects form, in relation to the
main participant(s), a distributionally associated set of relations. In the Westpac text,
their use borders on the stereotypical, insofar as each shot is characterised by a num-
ber of such objects which function to index some aspect of the participant, his/her
role, or the socially relevant location in which the depicted scene takes place. A
further subcategory is the use of dress, again quite stereotypical, which serves to
index the social role, class and gender of the participant. In transcribing such
objects, tools, and other performance indicators, the aim is not to write down every-
thing which appears in a given shot, which would be pointless and self-defeating.
Rather, the aim should be to note down with a fair degree of parsimony only those
items which are strictly relevant to the purposes of the transcription and subsequent
analysis.
The transcription of visual collocation will be illustrated here with reference
to the truck driver in Shot 3 (Rows 13-14). Thus, [VC: (body) tattoos on left arm;
(dress) blue work singlet; (location) cabin of truck; (role) truck driver]. This
example may be read as follows. The items in round brackets designate particular
subcategories of VC which co-occur within the same shot. In the Westpac text,
body, dress, location, and occupational role are especially relevant. Further, it is the
collocation of such features in a given visual field which together serves to index the
relevant situation or situation-type.
The notion of collocation as used in Firthian and neo-Firthian approaches to
language (see Sinclair, 1991) has thus been adapted here to suggest the ways in
which, for example, given objects, ways of dressing, occupational roles and insti-
tutional locations have their typical patterns of distribution in a visual field even
though they can also occur independently of each other, with different functions in
different contexts, or in different distributionally associated relations. Therefore, any
given feature referred to in this way may have functions other than those which are
relevant to the analysis in the current text.
Visual collocation, Visual salience & Colour 199
4. 7. 5. Visual salience
In a given visual text, some features will be perceived to be more salient than others
and, hence, to have greater informational prominence in the text or some part of
it. Visual salience (VS) is related to the articulation of the relationship between fig-
ure and ground, as studied by researchers of visual perception in the Gestalt tradi-
tion. Kanizsa (1980: 41) points out that in a given visual field a figure emerges with
respect to a ground on the basis of a number of interacting factors. The most
important of these include the relative size of the parts, their topological relations,
their types of margins, as well as spatial orientation (Kanizsa, 1980: 41-3). Salient
objects tend overall to occupy a smaller proportion of the total volume of the
visual field than does the background. Furthermore, salient objects tend to be
more substantial and distinct with respect to their background, both in terms of
solidity and colour. The background, by contrast, may be relatively indistinct,
lacking in detail, and exhibiting less compactness of colouring. These are generali-
sations only, and each individual case may make use of these in different ways, or
it may make use of only some of these possibilities. In order to transcribe visually
salient items in the image, a very simple notation [VS: draughtswoman] may be
adopted, which simply identifies the salient feature(s), namely the draughtswoman,
with reference to Shot 2 (Rows 11-12).
4. 7. 6. Colour
In a text such as the one to hand, there is a need to transcribe specific colours
which have a special salience or significance in the text hence the reproduction
of Appendix I in colour. In the Westpac text, the colours red and blue are particu-
larly significant. This is so because of the function of these colours in tying various
features of the text together on the basis of shared covariate ties. For example, the
occurrence of the colour red clearly associates with, and indexes, Westpac, and
serves textually to link various participants and circumstances to Westpac. This is
evident, for example, in the red Westpac logo on the helicopter pilots flying suit in
Shot 16 (Row 38 ) and in the red ties worn by the business executives in Shot 23,
Rows 53-55. However, it should be emphasised that colour is not an isolate it is
not a question of a pure chromatic quality but has its significance in relation to
other features of the visual field with which is integrated. A good example in the
text is the Westpac logo, which appears three times (Shots 5, 9, 15, corresponding
to Rows 17, 24-25, and 37, respectively). Whereas the red of the W-shaped logo
exhibits qualities of surface and texture, as is typical of the surface colours of a well-
defined object located in three dimensional space, the two shades of blue which,
respectively, characterise the sky and the sea in these shots, are conspicuously less
substantial and less consistent, thus denoting something more fluid, without density
and precise contours. Respectively, they characterise colours of surfaces (e.g. the
logo) and colours of films or sheets (e.g. the sky and the ocean). A further possibility
200 Multimodal Transcription and Text Analysis: Chapter 4
4. 7. 7. Coding orientation
Bernsteins (1990 [1981]) notion of coding orientation (CO) has been used by Kress
and Van Leeuwen (1996) to distinguish a number of different orientations to
reality in visual semiosis. They distinguish three main coding orientations the
naturalistic, the sensory/sensual, and the hyperreal. These make different validity
claims with respect to the truthfulness or degree of correspondence to reality as
we normally perceive it in everyday perception. Thus, the coding orientation of a
visual text is related to the extent to which, and the way in which, it is abstracted away
from our everyday ecology of ambient visual perception. Different visual genres and
texts may exploit or even combine these possibilities in different ways.
The Westpac text is no exception to this. Generally speaking, advertisements
tend to prefer the saturated colours typical of the sensory/sensual coding orientation,
along with an appeal to the hyperreal world of our dreams, desires and fantasies. In
the latter, colours are less dense, less consistent, more misty. However, neither of
these excludes the naturalistic, and all may be used in varying ways in different parts
of the same text. In the naturalistic coding orientation, colour and other features are
deemed to correspond closely to our everyday perception of the world under nor-
mal conditions. Coding orientation is transcribed in this way: [CO: naturalistic], as
in the case of Shot 1, Rows 1-10. On the other hand, the logo (Shots 5, 9, 15) is
transcribed as follows: [CO: sensory; hyperreal ]. In this case, there are elements of
both orientations.
Kress, Van Leeuwen, 1996: 121-2). In this way, participants who look directly at the
viewer simulate an interactive relation with the viewer. This may be accompanied
by other kinesic features such as a smile, a wink of the eye, a sarcastic expression
on the face, knitted eyebrows, chin thrust forward and so on (see 4.8.2 , pp. 206-
209). The absence of direct eye contact correspondingly suggests the absence of
an interactive or interpersonal relation between viewer and textual participant. In
this case, textual participants are like third person participants and, hence, are seen
as not directly implicated in a dialogic relationship with the viewer. Figure 4.5 pro-
poses a system network for the system of gaze.
The Visual Focus (VF) of a given participant may also serve to establish a
gaze vector with another participant. This is the case at the end of Shot 6 (Row 20 ),
where father and son establish mutual contact in this way, i.e. by looking directly at
each other. The vector that links the two sets of eyes is easily discernible in this case.
Here, the function would be to support the affiliative bond of solidarity between
them. However, the primary purpose of the transcription is to establish on formal
grounds the nature of the participants gaze on the basis of several interacting vari-
ables. The first of these has to do with the specific focus of the participants gaze.
That is, to what is the gaze vector directed or extended ? Gaze vectors may extend to
the eyes of another participant, as described above. Alternatively, they may focus on
some other part of the others body or some aspect of their clothing.
Furthermore, the participants gaze may be directed to some aspect of the self,
such as the hands, in order to suggest self-involvement, self-enclosure, or submission
(see Goffman, 1985 [1976]: 65). This is the case of the boy in Shot 6, Row 18. Gaze
may also be directed to some object within the immediate purview of the
participants personal body space or, alternatively, to some more remote object out-
side this space. A participants gaze may also be disengaged from the immediate scene
in order to suggest either withdrawal of ones participation or inner cognition analo-
gous to mental process verbs in language. The gaze vector of the participant may
also extend to some indeterminate point outside the visual field of the video screen in
order to suggest a monitoring function, a sense of readiness, or expectation.
Another variable is that of distance. In this case, the basic possibilities are
close, median, and far. The two sets of variables focus or direction of gaze vector
and distance will need to be accounted for in the transcription.
In the transcription, the various possibilities outlined above will be
annotated as presented in Figure 4.5. Visual focus will thus be transcribed as shown
with reference to Shot 1. Thus, [VF: distance: far; orientation: off-screen]. With
specific reference to Visual Frame/Row 10, this tells us that the gaze vector of the
herdsman extends off-screen to something indeterminate in the distance and not
seen by the viewer.
Two additional transcription conventions can also be noted with respect to
the visual frame (Column 2 ). With reference to any given shot, in some cases more
202 Multimodal Transcription and Text Analysis: Chapter 4
than one visual frame has been inserted in order to better illustrate a specific micro-
level development that would be inadequately presented by just one frame. Shot 21,
Visual Frames 47-50 is an example of this in that Row 47 contains two visual
frames. Finally, mention of salient colours such as red in the Westpac text are
colour coded in Column 3 in order to indicate that this feature constitutes a signif-
icant covariate tie in the overall texture of the text. Thus, such references to the
colour red are printed in red to highlight this aspect (see Appendix I ).
movement such as Actor; Action; Goal and Agent/Initiator; Action; Reactor and
so on. The first of these configurations designates a movement which is construed
as intentionally performed by an Actor the performer of the movement and
which is directed towards some other participant (the Goal). The second refers to
an Agent who performs a primary movement which causes or instigates a
secondary movement in the Reactor. In this ergative perspective, the focus is on
causality rather than on intentionality. This second possibility implicates a hierarchy
in which an Agent performs a higher-order movement which causally brings about
a lower-order movement in a second participant. Furthermore, the given move-
ment may entail the drawing near of one participant to another to the point where
prolonged contact and even conjunction of the participants may occur as they
form a new unity. On the other hand, a participant may move away from or distance
him- or herself from another participant. This may also entail the dissolution of
previously existing structures or, in other words, a relationship of disjunction
between previously conjoined participants. However, the analogies to experiential
clause grammar should not be carried too far for reasons that are discussed below.
The two conditions of CONJUNCTION and DISJUNCTION are best seen as the
two polar extremes of a topological region in which various combinations and
gradings of these possibilities may occur. Thus, two movements performed by two
distinct participants in the same visual-spatial field may be related to each other
along a number of different parameters:
(1) simultaneity immediate succession succession after interval;
(2) concord discord (in direction and/or orientation);
(3) sameness or difference (of speed of movement);
(4) sameness difference (of type of movement);
(5) contact spatial separation (of movements and respective
participants).
Bodily movement is not simply a passive movement in the geometric space of
classical physics. Rather, it actively assumes and appropriates both space and time in
the service of its own projects (Merleau-Ponty, 1992 [1962]: 102). In other words it,
too, is a meaning-making resource. Merleau-Ponty further shows that movement is
capable of creating an abstract space above and beyond the concrete physical space in
which the movement takes place in space and time. This space is a virtual projection
by the agent of the movement whereby the resources of the body are deployed so as
to creatively enact meanings which are not physically present in concrete physical
space in the Newtonian sense. Thus, the agents body is the semiotic source of mean-
ings that are directed towards the other and which seek to engage the other.
Movement syntagms are organised on the basis of both their relationship to
their ecosocial environment in which they occur and which they help to constitute and
the spatial deployment of the body and/or body parts that perform the movement.
204 Multimodal Transcription and Text Analysis: Chapter 4
This implies some important differences in the way experiential meanings are realised
in movement as compared to the particulate or constituency-based organisation of
experiential meanings in language. In language, the participants, process and circum-
stances in the clause are analysed as parts (constituents) which have specific functions
or roles in the overall syntagmatic structure e.g. the clause which functions as a
semiotic construal of reality. Language users thereby draw on the experiential
resources of grammar to analyse the phenomena of experience into the participants,
the process and the circumstances that comprise the given phenomenon. In language,
clauses are experientially organised in terms of more central process-participant
relations based on constituency and more peripheral circumstances based on relations
of interdependency (see also Tesnière, 1965: 40-6; McGregor, 1997: 168-9).
In movement, the equivalent of the case markings which ground the
experiential meaning in a certain way are the following: the body or body part per-
forming the movement and whether this is an instigator or a reactor; the move-
ment performed; the body or object, etc. which instigates, initiates or reacts to
some other body or object; the spatial location of the movement; the directionality
and the orientation of the movement; the time of occurrence of the movement;
the duration of the movement. In movement, simultaneity and spatiality rather
than linear succession in time and particulateness (constituency) are important in
the realisation of experiential event and action configurations.
For example, in Shot 19 (Rows 43-44 ) the process (walking) and the
participant performing the process (the supervisor) and the circumstance of spatial
location (the work site) are not linearly segmented as in the clause: the supervisor
(Actor) walked (Process) through the work site (Circumstance). Instead, process,
participant and circumstance are conflated into a single indissoluble configuration
in the visual-spatial field of simultaneous relations unfolding in time. In this
topological visual-spatial field, relations of interdependency among the different
components of the experiential configuration are established on a different basis
from that of the linguistic semiotic of clause grammar. In Shot 19 location is a sta-
ble visual invariant which remains in the background. In this sense, it is a periph-
eral circumstantial function rather than a central process-participant function. The
question of the relationship between informational variants and invariants has to
do with the ways in which the visual field is (1) segmented into distinct objects and
(2) categorised or attributed with a meaning by the perceiver (Kanizsa, 1991: 113-
5). These two logically, though not temporally, distinct operations lend support to
the stratified nature of visual semiosis (Thibault, 1997a: 229-30). The supervisor in
Shot 19 is a dynamic variant; as a participant function it perturbs the invariant back-
ground by virtue of the fact that he moves. It is the syntagmatic bringing together
of diverse visual forms so as to constitute a determinate visual field which makes
this possible. That is, the structured relations among the various forms variant
and invariant in this field means that there are connections among them. On the
Column 4: Kinesic action 205
Secondly, movement may describe or designate some situation and evaluate this
situation by adopting a particular interpersonal orientation towards it, e.g. parody, dis-
gust, pleasure, disapproval and so on. For example, Shot 4 (Rows 15-6 ) shows the
nurse walking briskly towards the viewer. The interpersonal orientation may be
described as that of commitment and seriousness with respect to the task to hand (see
also Birdwhistells (1972 [1961]: 96) category of self-possession/self-containment).
Thirdly, movement may indicate the performers affective disposition not
only to the specific action that is being performed, but it may also index a more
general emotional state or state of mind. In the Westpac text, the movements of
the participants may be seen as indexing enthusiasm, willingness and confidence in
what they are doing.
Fourthly, a social agent may perform a given movement in order to represent
or otherwise recontextualise some action that was performed by another. The use
of movement in this sense is analogous to quoting and reporting speech and may
turn out to be a better way of explaining what people do when they use movement
to imitate the actions of others in their own discourse (Inset 9: Projection, p. 101).
No examples of this function of movement are attested in the Westpac text.
Fifthly, a movement may be evaluated according to whether it is performed
naturally, awkwardly, artificially, stiltedly, appropriately, gracefully and so on.
Analogous to the notion of vocal registers of singing and speaking (see 4.9.9, pp.
218-219), such variations in the kinesic dynamics of movement may be seen as
different movement registers. In the Westpac text, the participants express move-
ment registers of naturalness and appropriateness with respect to the circumstance
in which the movement occurs and/or the interpersonal orientation of the
participant to some other, such as another participant or the viewer.
Interpersonal modification of movement entails the shaping or deforming of
the movement according to the meaning it has in a specific interactional context.
Specific corporeal schemas, which are highly abstract in character and which have
a neurological basis, would appear to lie at the basis of this. Such schemas have a
predictive function which enables the individual to orient and adapt his or her bodily
movements to specific semiotic and material circumstances. The kinds of
interpersonal meanings adumbrated above cannot be reduced to the schema per se.
Rather, the schema is activated in a specific context in relation to other features
other semiotic modalities, the addressee, selected aspects of the material world and
so on all of which function contextually to ground the schema in meaningful
ways and hence to deform the individuals body according to a specific
interpersonal orientation. This means that the schema is a kind of embodied move-
ment grammar of a very abstract kind that can be modified so as to produce
particular contextual meanings. The deformation and shaping of bodily movement
is far from limited to human beings, but is also shown in the different ways in
which dogs and other animals deform their bodies according to varying interac-
208 Multimodal Transcription and Text Analysis: Chapter 4
created by the movement and built into its overall texture. A given movement
sequence is also structured in terms of peaks of prominence as well as various types
of boundary phenomena. There are, in this sense, onset phenomena, movement focus
phenomena, offset phenomena, and inter-movement phenomena which have a textual
function in demarcating the beginning-middle-end structure of the movement sequence
as a wave with peaks of prominence alternating with less prominent phases.
ing about them and transcribing them in a unified way rather than as entirely sepa-
rate phenomena. The guiding assumption is that the acoustic flux of the soundtrack
is a perceptual continuum constituting a delimited auditory array, which, however, lis-
teners can analyse or parse into different components of information that tell us
about a given source. The soundtrack is delimited rather than ambient because it
derives from a specific point source coming from a particular direction – the loud-
speakers, say – rather than the ambient auditory array which surrounds us and comes
from all directions as we move through and orient to a natural or urban environment.
here established between the linguistic and the visual construes a positive evaluative
orientation of the one in relation to the other. This suggests an interpersonal
dimension to such events and serves to orient the viewers own evaluation of this
relationship and its implications for him or her.
Finally, acoustic events may form parts of larger wholes on the basis of
relations of foregrounding, backgrounding, spatial location, distance from the
listener, relations of dependency with other events, and so on (see Inset 16 on the
following page). They therefore have properties of textuality as part of a larger
Gestalt to which they belong. An example of this is the relationship between male
speaker and musical accompaniment in relation to Shots 13 to 23 (Rows 31-53 ). In
this long sequence, the instrumental music accompanies the male speaker. The latter
is always more prominent, whereas the former is backgrounded so as to perform a
supportive role rather than a primary or dominant one. In this sequence, the music
continuously accompanies the speaker and thus provides one important principle of
textual cohesion in this sequence. A further striking feature of the Westpac text is that
with the exception of the sounds of the sheep at the beginning, the listener does not
hear any of the ambient sounds which are typical of the various work sites, street
scenes, and so on that are depicted in the visual track. This suggests that the various
scenes which are depicted in the visual track are recontextualised (see Insets 1, p. 2 and
17, p. 213) along both the acoustic dimension as well as along the visual and linguistic
dimensions.
slow, gradually increasing in volume and tempo as Phase 1 progresses. The female
chorus directly addresses the listener in a way that the dialogue between the sheep
sounds and the keyboard does not. This is reinforced both by the rising, crescendo-
like development of the chorus, as well as the imperative mood of the sung text. It
is significant that the appearance of the chorus is cued in by the drum beat.
Moreover, the crescendo-like development of the chorus, along with its rep-
etition and increasing tempo, as it sings Roll them, roll them, roll them up, ensures
that the chorus quickly becomes the dominant sound voice for the remainder of
Shot 1. The initial drum beat would appear to have two functions in relation to this.
First, in contrast to the keyboard, the drum stands for power and dynamism in
contrast to the lonely and introspective quality of the dialogue between sheep
sounds and keyboard. This is after all what the chorus and its sung text is all about.
Secondly, the drum beat is also the initial cue for the instrumental accompaniment
which underlies and supports the chorus throughout the remainder of Shot 1.
This sets up a relationship between the dominant sound voice of the chorus
and the non-dominant voice of the instrumental accompaniment that continues
Inset 16: Perspective in sound: Van Leeuwen on Figure, Ground and Field
� Sound events are hierarchically related to each other. In the soundtrack of a film,
different elements in the overall event are organised into subgroups which are
coarticulated in relation to each other. Some sounds are in the foreground and are focal;
others are in the background; others are somewhere in between these possibilities. Van
Leeuwen (1999: 23) refers to the hierarchical grouping of sounds and their coarticula-
tion in terms of a three-way distinction between Figure, Ground and Field.
� The Figure is the sound which is the focus of interest. It is the sound or sound group
that is treated as the most salient and which the listener is most required to engage
with or attend to. The Figure tends to stand out against both the Ground and the
Field. The Ground functions as the setting or context; it is a minor, non-salient
component of the listeners social world, though the listener is still able to orient to
it and to evaluate it as interpersonally significant. The Field is the physical place
where the observation takes place. The Field consists of sounds which index or in
some way characterise the soundscape of the listener, though the listener is not
expected to orient to them or to take up a particular evaluative stance on them.
� The above distinction also shows that sounds that are heard simultaneously are parsed
into groups and related to each other hierarchically as different sound events with
different locations and sources as well as different degrees of salience and relevance to
the listener. This further implies that sounds can be coarticulated in relation to each
other in some overall soundscape in, for example, the soundtrack of a television
advertisement.
� In Phase 1 of the Mitsubishi Carisma advertisement, the voices of the man and
woman are the Figure; the orchestra is the Ground. The sound of the telephone box
door closing at the end of this phase is the only example of a sound event which
functions as Field in this phase (see 1.6.1, pp. 51-54).
Dialogic relations among sound events, Inset 16 & Inset 17 213
throughout Phase 1a (Column 5, Rows 4-10 ). In other words, the mixing of the
sounds of the musical instrumentation and the chorus is done in such a way as to
ensure that the chorus is always the dominant voice on account of its relative loud-
ness with respect to the musical accompaniment. This is so at all stages including
the very soft and slow way in which the chorus starts, only to become considerably
louder and quicker towards the end of Phase 1a. What is the significance of this?
The quiet, almost mystical, way in which the chorus begins singing suggests a
prayer-like communion with ones surroundings cf. the natural landscape in the
visual track or a mystical union with nature. Gradually, this gives way to the
quicker, more rhythmic character of the choral singing when the initial roll them, roll
them is expanded to the complete clause roll them up. The chorus is an all female one
and the individual singing voices are highly blended to produce a markedly homo-
geneous or unified sound quality. In such cases, the individual differences in pitch,
rhythm, voice dynamics and so on, of the individual members of the chorus tend
to be attracted to an average pitch whereby the individual differences are minimised
in the service of orchestral or choral unity (Schoenberg, 1975: 151).
� The purpose of this inset is to draw attention to the ways in which a text
recontextualises material social practices and activities from the social world known
to television viewers or website users. In each case, the initial practice in the social
world is inserted into, and recontextualised by, another set of practices. Most of the
scenes in the Westpac advertisement, for example, refer to typical social practices in
the working world of Australians. For instance, bakers typically make bread and sell
it to the public. However, the advertisement recontextualises the practices of the
baker and combines these with many other such recontextualisations.
� Most importantly, it does so in ways which transform the original social practices in
accordance with the goals and values of the recontextualising practices of advertis-
ing agents and their clients (e.g. the Westpac Banking Corporation ). The text is not so
much concerned with what bakers, nurses, bricklayers do, but with a very different
strategy in which the often diverse and conflicting social viewpoints and values which
are represented by the many different categories of participant in a given community
(young, old, male, female, workers, bosses, rural, city, religious, secular, corporate,
individual, etc.) are removed and transformed into a single viewpoint which elimi-
nates or downplays the differences among them. In the Westpac advertisement, this
has to do with the manufacturing of consent about two main issues:
(1) ideologically justifying the new merger of banks, which in 1983 led to the
creation of the Westpac corporation in Australia;
(2) constructing a corporate national identity founded on a work ethic and on cer-
tain national myths and archetypes.
214 Multimodal Transcription and Text Analysis: Chapter 4
All this is significant to the meaning of the text. The quasi-mystical start to
the chorus, along with its slow crescendo-like development, suggests a gradual
emerging from ones individuality as this is harmonised with the wider social world
in the service of a larger actional project which is meant to involve all Australians.
The chorus is the dominant sound voice here because it is that which directly enters
into a dialogic relationship with the listener, exhorting him or her to be part of this
wider project. In the transcription of the soundtrack, no attempt is made to use
musical notation. There are three reasons for this. First, aside from the problems of
accessibility for those who do not read music, there is also the important question
of finding common ground that can be adapted to all of the various components
of the soundtrack speech, music, other sounds. Secondly, the transcription is
interested in revealing the semiotic integration of different acoustic phenomena.
Thirdly, it is important to preserve the criterion of computer retrievability.
is important that multimodal text transcription show how meaningful units of the
text are chunked and, hence, recognised by observers on the basis of their organi-
sation into rhythmic units, the integration of such units into still higher units, and
the transition points or the boundaries between units. In multimodal transcription,
the emphasis is not on speech or other sources of rhythm per se, but rather on the
multimodal integration of different sources of rhythm in a given text. The fact of
their integration does not change the important point that a particular rhythmic
source may be dominant. Moreover, a number of researchers have independently
shown that there is a good deal of common ground in the organisational principles
which subtend rhythm in, say, speech and gesture.
For example, McNeill (1992: 85) draws attention to the parallels between the
hierarchy of units that comprises the phonological structure of a given language and
an analogous hierarchy of units in the kinesic structure of gesture. This is hardly sur-
prising given that the basis of their synchronisation in discourse lies in the sensori-
motor activities of the body and its natural rhythms. Furthermore, both kinds of
body rhythm are experienced as movement. Abercrombie (1967: 97) talks about how
both speakers and listeners enter into reciprocally felt phonetic empathy on the basis of
= source of spoken voice off-screen, not shown in depicted world (e.g. Row 31,
8 [ ] Column 5 ff); the symbol is used at the start and end of the sequence, in this
case Row 31 and Row 57, respectively
= other non-speech or non-musical sounds, including silence, followed by a brief
[ sheep] verbal specification of the specific sound: in the present example, the sun sym-
9
bol followed by the word ‘sheep’ designates the (non-linguistic/ non-musical)
sounds of the sheep, as in e.g. Row 2, Column 5
= silence other than rhythmic pause or juncture in speech and/or music (e.g. Row 1,
10 [ silence]
Column 5 )
the speech rhythms they experience as embodied movements. Thus, listeners extract
information about the speakers articulatory movements from the speech sounds
which they hear and, on this basis, are able to enter into a relation of felt rhythmic
empathy with the speaker. This empathy is an important, if largely intuitive, con-
tributing factor to the synchronisation of speaker and listener in spoken interaction
(see Gumperz, Berenz, 1993: 106). On the basis of such observations, rhythmic units,
the transitions between these and related phenomena such as rhythmic accents in
speech, music and bodily action will be transcribed on the basis of a single notation.
4. 9. 7. Rhythm groups
Van Leeuwen (1985: 225) points out that a given sequence of accented and
unaccented units is organised into a higher-order unit on the basis of a perceived
rhythmic regularity within the sequence. When this regularity is perceived to be per-
turbed by a pause or slowing down, then the given movement sequence is felt to
come to an end. On this basis, it is possible to establish what Van Leeuwen calls
rhythm groups and the boundaries or transitions between these. In the present
transcription, such boundaries will be indicated by a double-slash followed by a spec-
ification of the type of transition (e.g. pause, change in tempo) in the following man-
ner: [//PAUSE], [//SLOW], and so on, where the double-slash indicates a boundary
between rhythm groups and where the linguistic gloss in upper case subcategorises
this according to type of boundary, i.e. pause, slowing of tempo, etc. An example of
this notation may be seen in Row 43, Column 5, where the sign [//(#)] indicates a
pause or juncture after the word money. The most prominent rhythmic unit cf. the
nuclear accent in the tradition of phonological analysis (Crystal 1972: 111; 1982: 11)
Accented rhythmic units, Rhythm groups & Degree of loudness 217
is the nuclear accent and constitutes the nucleus of the rhythmic group. Nuclear
accent is shown in the transcription by placing the following notation before the unit
in question. Thus: (NA) as in Rows 32, 34, 36, and 38.
In Column 5, rhythm groups are indicated by heading the group in question
with the following notational symbol: {RG}, which is placed before any other nota-
tional symbols at the beginning of the group in question. Two examples in Column
5 are to be found in Rows 25 and 54. In turn, rhythm groups are, generally speaking,
integrated into still higher-order units which tend to correspond to the subphases and
the phases of the text. In the Westpac text, the soundtrack plays a critically important
role in specifying the shifts from one subphase or phase to another in the text. For
example, Phase 1c is the culmination of a wave-like development that has gradually
developed in Phase 1 as a whole on the basis of a number of rhythm cycles or
periodicities in the soundtrack (see 4.9, pp. 209-222).
In the final subphase Phase 1c the full clause roll them up is sung twice; the
tempo is fast and the volume loud. With respect to the previous singing of the chorus,
this is a distinct shift in the texts dynamics. This shift coincides with Shot 4, Rows 15-
6. In relation to the visual text, Shot 4 could be seen as a continuation and further
microlevel development of the visual thematics of workers on the job, as also shown
in Shots 2 and 3.
However, it is the marked shift in the dynamics of the soundtrack which justifies
the analytical decision to see Shot 4 as belonging to a new subphase, rather than as
continuing the previous one. This also illustrates how the codeployment of different
semiotic resources provides principles of both continuity and change in textual
dynamics. In this case, the visual thematics is based on continuity, whereas the change
in the rhythm in the soundtrack in Phase 1c corresponds to criteria of change.
4. 9. 8. Degree of loudness
Degree of loudness has to do with what Abercrombie (1967: 95) calls, with reference
to speech, the degree of force with which air is expelled from the lungs during phona-
tion. While loudness is clearly a relative notion in the sense that different speakers of
the same language and even speakers of different languages may have a typical range
which is characteristic of that speaker, it is possible to notate degree of loudness in
the transcription by getting a feel for the overall volume range of a given speaker,
singer, musical performance and so on. It is also possible to postulate a continuum
of possibilities ranging, for example, from sub-vocalising, whispering, speaking soft-
ly, speaking normally, speaking loudly, shouting, yelling and screaming. Furthermore,
volume can be controlled by electronic and mechanical means so as to produce the
desired effect in a specific context. Abercrombies claim that loudness has little
linguistic importance (1967: 96) can therefore be questioned. That is, we need to
reconstitute such notions within a much more embodied notion of what linguistic
and other modalities of meaning-making are and how they function in context.
218 Multimodal Transcription and Text Analysis: Chapter 4
4. 9. 10. Tempo
In speech, tempo refers to rate of syllable-succession and has to do with the num-
ber of syllables per chest-pulse, also called breath-pulse or syllable-pulse (Abercrombie,
1967: 96). These pulses are periodic or wave-like in character and occur on cycles of
greater and lesser muscular activity when, in the former case, more effort than usual
is expended to expel air from the lungs, producing stressed syllables. A cycle is
defined by the alternation of phases of less effort with a greater effort in order to
produce a stressed syllable. Speakers vary the tempo of their speech considerably and
this variation in speech tempo may be seen as one index, along with others, of the
specific organismic and contextual variables that are in operation.
In the transcription, tempo will be indicated by a simple three-way distinction
between slow, median, fast, as follows: Tempo: S, M, F, as in Column 5, Rows 25 and
31. These signs will be placed immediately prior to the stretch of text or the item
in question. Tempo is also a relevant factor in body movement, and the same signs
will be used indifferently to specify tempo in both the auditory and kinesic dimen-
sions. It should also be emphasised that tempo is a relative factor which varies in
concert with other factors and has no fixed meaning of its own. In the Westpac text,
for example, the tempo of the chorus in Phase 1 starts quite slowly, to become very
quick at the end of this phase. We might say that the increase in both tempo and
volume that characterises this development constitutes one dimension of the over-
all textual work that is undertaken to exhort people to act in a certain way. In the
case of the male speaker in Phase 4 (Column 5, Rows 31-53 ), the tempo of his voice
is quite regular and in ways that are atypical of spontaneous conversation, where
tempo fluctuates considerably. In this case, the male speaker is the voice of Westpac
itself. The tempo is consistently moderately fast, with little variation or fluctuation,
and this contributes to the authoritative positioning of the speaker as one who
speaks with confidence, leadership and assertiveness.
220 Multimodal Transcription and Text Analysis: Chapter 4
develops from the subdued, quasi-mystical tone at the beginning through to the
rousing, ecstatic quality of the final roll them up, which is loud and fast.
Intertextually, this references a tradition of religious choral music, as in the choral
works of Bach and Händel. The overall meaning is the celebration of, and the iden-
tification with, the meanings of the chorus so that both the individual members of
the chorus and the audience are collectively bound to these meanings and values in
the celebration of something more exalted.
The lead singer responds to the chorus not by way of mere reaction. Instead,
her contribution to the dialogue constitutes a further development of the meaning
of the chorus, as also highlighted by the paratactic conjunction of extension and
(Halliday, 1994 [1985]: 230-2). The conjunction construes an explicit link of the
additive type between the dialogic move of the chorus and that of the lead singer.
As befits her role as exemplar, she extends the more formulaic meaning of the
chorus at the same time that she proposes to them (and to the listener/viewer) an
exemplary role model to follow.
In the Westpac text, the dialogue between chorus and lead singer is orderly
and sequential. There is no overlap or interruption. This in itself suggests the har-
monising of their purposes rather than conflict and competition. Importantly, the
chorus is supported by an instrumental accompaniment, whereas the soloist sings
alone. Later in the text, the male speaker – the voice of Westpac, of authority – is
also accompanied by a simultaneous instrumental support (Rows 31-53, Column 5 ).
In the first case, the accompaniment is another textual voice which harmonises with
that of the chorus in conformity with its overall modal orientation. The exhortative
orientation of the chorus receives further support from the musical accompani-
ment, which does not compete with it. In the second case, the male speaker is clearly
dominant and assertive – the voice of power and leadership – and the instrumental
support again serves to reinforce this role by remaining in the background and in
no way creating discord with the speaker (see Van Leeuwen 1991: 76). Perhaps it is
possible to say here that the instrumental support has resolved the tension between
chorus and lead singer into a single (purely instrumental) voice which has now been
harmonised to the goals of Westpac. In other words, both voices have now been
fully subordinated to the dominant discursive voice of Westpac.
For the purposes of the transcription, the salient distinctions are as follows:
sequential (SE), simultaneous (SI), initiating (I), and responding (R). Thus, in, Row 17,
Column 5, for example, the female soloist is glossed as (R) to indicate that she is
responding to the previous turn, which is sung by the chorus.
99). Thus, in singing there are said to be upper, middle, and lower registers. It is in its
original musical sense that the term is used here. Abercombie also suggests that
there are different registers of the speaking voice in order to express a range of
different emotions anger, tenderness, impatience, and so on. Moreover, speakers
may switch speaking (or singing) registers as they emotionally modulate their
discourse in different ways.
The above labels are somewhat impressionistic, but they can give us some
clues as to how we might gloss changes in the register of the speaking and singing
voice when we are transcribing multimodal texts, though without necessarily going
into the articulatory details which underlie this. Thus, the chorus may be said to
move from a register of the mystical to the ecstatic in which all voices are united in
the service of a higher cause. The lead singer sings at a higher pitch level than the
chorus in a tradition of pop and rock female singers, as distinct from the allusions
to the tradition of choral music in the chorus. Thus, the register here is a folksy,
individualistic one, as distinct from the upper, yet dark, registers of the tragic
heroine (e.g. Brünnhilde or Isolde) in a Wagnerian opera. The register of the male
speaker is that of the radio or television commentator who provides an authorita-
tive interpretation of events fast-paced, assertive and monologic in orientation,
not amenable to dialogic interrogation or interruption.
Typically, speakers and singers have a range of voice dynamics which they
variously deploy according to the context, the discourse genre, as well as more sub-
jective factors to do with their physical or psychological states.
The assumption that the metafunctions (see Inset 4, pp. 22-23) are spread across all
the resources used constitutes a unifying principle for thinking about multimodal-
ity. The decision to include this sixth Column also draws attention to the fact that
transcription and textual notation are never theory-neutral. Rather, they always make
assumptions both about the meaning of the text and about which meanings and
their modes of expression to foreground in a given analysis. The inclusion of this
column should help transcribers both to make this explicit as well as to provide
shorthand glosses on each successive phase of the texts unfolding meaning.
and their metafunctional significance glossed with reference to the notation referred
to in the current section. The main purpose of Column 6 is, then, to provide a brief
summary of the metafunctional salience of the particular semiotic modalities that are
codeployed in a given subphase. Column 6 has, then, an important integrating
function, though in a different way from Column 1. Importantly, Column 6 consti-
tutes a departure from the integration of columns and rows which characterises the
rest of the table (see 4.2 , pp. 174-181). This means that the metafunctional analysis
in Column 6 does not correlate with a row number as specified in Column 1. For this
reason, no rows are featured in Column 6. The motivation for this choice lies in the
fact that the information in Column 6 is specific to an entire subphase. Therefore, the
analytical scope of the commentary in Column 6 necessarily extends over the
equivalent of several rows. For example, the metafunctional analysis which refers to
Phase 1a extends over the section of text that stretches from Rows 1 to 10.
4. 11. Display and depiction: two sides of the same semiotic coin in visual texts
4. 11. 2. From delimited optic array to visual text: the stratification of the visual sign
The basic reality of the visual image is a delimited optic array (Gibson, 1986 [1979]:
270-273) that is projected onto the video screen by some sort of electronic device
224 Multimodal Transcription and Text Analysis: Chapter 4
such as a modulated scanning beam. In contrast to the ambient optic array that
affords the perceptual pickup of information about events in the environment of
the observer, the optic array that is projected onto the television screen is delimited
because it is confined to the screen rather than belonging to the environment that
surrounds the organism. The screen is a surface onto which optic information
about something other than the surface is projected. The surface of the screen dis-
plays to the viewer visual invariants and their transformations in time. That is, the
structure of the array undergoes change and transformation in time, and it is this
change and transformation which creates the effect of movement. Thus, changes
in the structure of the delimited optic array of the screen can provide stimulus
information that affords both the pickup of information about the movement of
persons, objects and so on, in the depicted world of the film and the pickup of
information about the movement of the viewer in relation to the depicted world
(Gibson, 1986 [1979]: 294). In other words, the optic array affords both visual event
perception and visual kinaesthesis.
The expression stratum of a video text consists of visual resources such as
lines, dots, the interplay of light and shade, colour, and so on. However,
information about visual invariants is not contained in the lines, the dots, or the
light and shade per se, but in the ways in which these are connected to and nested
within each other so as to create information about shapes, surfaces, textures and
many other features. Lines, dots and so on, are analogous to the phonetic
dimension of speech sounds. Hjelmslev (1961 [1943]) referred to this as the level
of expression substance. When lines, dots and so on, are connected to each other,
they display information about visual invariants and changes in these invariants in
the optic array. The many degrees of freedom of expression substance is shaped by
and entrained to the forms and categories of expression form. In spoken language,
this level is equivalent to the phonological system of a particular language. In visual
semiosis, expression form is equivalent to the stimulus information in the delimited
optic array that the viewer picks up with his or her perceptual systems. The
information that is picked up is in the form of structured ambient light which
specifies an environment for the observer (Gibson, 1986 [1979]: 51). Gibson calls
structured ambient light in the environment of the observer the ambient optic array
because it is ambient – i.e. it surrounds the observer – and affords the observer the
possibility of picking up structured stimulus information about the environment
that surrounds the observer. The notion of structure further implies that the stim-
ulus information has pattern, texture and configuration (Gibson, 1986 [1979]).
In the case of visual texts, the optic array is not ambient, but delimited, for the
reasons mentioned above. In any case, it is the information about structure and
pattern which the optic array – ambient or delimited – affords the observer and which
corresponds to the expression stratum of visual texts. An optic array therefore has
component parts – it is not homogeneous – and specifies information about objects,
From delimited optic array to visual text: the stratification of the visual sign 225
events and relations between these in the environment of the observer or about
something other than the surface on which the delimited optic array of visual texts
is projected. The delimited optic array projected onto the video screen displays
information about (1) transformations, substitutions, nullifications of structure in
the optical array and (2) visual kinaesthesis in the observer. As shown in Table 4.1,
which is only concerned with visual semiosis, this information has properties of
metafunctional organisation which resonate with those on the content stratum.
Table 4.1 illustrates the relationship between the expression and content strata in
visual texts. It also shows how the two strata exhibit metafunctional forms of
organisation. The distinction between display and depiction can be explained, with ref-
erence to Table 4.1, as follows. The expression stratum of visual semiosis is based on
the display of visual invariants and their transformation on, say, a video screen; the
content stratum is based on the depiction of a visual scene consisting of actions, events,
persons, objects and so on in the depicted world. Display and depiction therefore per-
tain to the expression and content strata, respectively. They are, of course, two sides
of the same semiotic coin (see Inset 14, pp. 175-177 and Table 4.1 on the following
page). Display is concerned with getting the optic invariants and their transformations
to the viewer in a perceptible form as stimulus information. Stimulus information
takes the form of the visual invariants and their transformations that are traced or
projected onto a surface such as a sheet of paper, a video screen, as a delimited optic
array (Inset 15: Gibsons optic array, p. 192). Depiction is concerned with the interpre-
tation of this stimulus information as a visual scene consisting of meaningful actions,
events, participants, settings and so on. Actions and events and their associated
participant roles are realised by the resources of the visual grammar (content form) in
the form of shapes or volumes and the connections (e.g. vectors) between these.
In this book, we have used the term visual transitivity frame to talk about this
aspect of visual grammar and the meanings realised by it. Visual transitivity frames
consist of participants, which are realised by volumes or shapes and processes, which
are realised by vectors and other dynamic features such as those proposed in Table
4.2. Visual transitivity pertains to the content stratum of depiction. The observations
made here have to do, above all, with the experiential metafunction, though it is
important to keep in mind, as Table 4.1 shows, that visual texts are organised along
metafunctional lines. Interpersonal, textual and logical meanings are also involved (see
also Baldry, 2000a; Baldry, Thibault, 2005; Kress,Van Leeuwen, 1996; Lemke, 1998;
OToole, 1994; Martin, Rose, 2003: 255-262; Martinec, 2000; Thibault, 2000a).
In Table 4.1, the discourse stratum cf. Hjelmslevs content substance (see Inset
18: Stratification, pp. 236-237) is the higher-scalar level that contextually integrates
and organises to its own level the units and their relations on the other lower-scalar
levels specified in Table 4.1. It is the global level of the text as a meaning-making
(communicative) event in a given social and cultural context. The discourse stratum
involves the entextualization of selections on the other levels of organisation as a
226 Multimodal Transcription and Text Analysis: Chapter 4
Expression Perceptual pick up of stimulus information in ambient optic array about environmental events
purport
Expression Delivery of delimited optic array to a surface (screen); array contains information about things other
substance than that surface
Content Construal of visual grammar and its integration to social activities and practices; processes of
substance/ entextualization in multimodal texts
discourse
Table 4.1: Stratification of video texts, showing both the relationship between
the expression (display) and content strata (depiction) of visual signs
From delimited optic array to visual text: the stratification of the visual sign 227
4. 11. 3. Transformations in the optic array: some examples from the Mitsubishi Carisma text
Transformations in the delimited optic array that is displayed on the screen involve
the following kinds of operations on features of the optic array: magnification;
diminishment; nullification; deletion; movement; accretion; addition; slippage; and
substitution. Table 4.2 sets out some of the main possibilities in Phase 1 of the
Mitsubishi Carisma advertisement.
In Shot 1, the movement of the car is a transformation of a visual invariant
i.e. the position and the relative prominence of the car qua visual shape in the
optic array that is displayed in this scene. This is the salient change that occurs in
this shot against a background of other features (the telephone box and the urban
setting) that remain invariant. The transformation involves both movement and mag-
nification as the car increases in size relative to the total volume of the screen space
as it moves along the road from the background towards the centre of the screen
space. The telephone box is the salient object in Shot 1. The viewers visual focus
tends to be drawn towards this object on account of its visual salience: the tele-
phone box dominates the left top-to-bottom area of the screen and its bright red
colour contrasts with the subdued dark colours of the setting. This salience anchors
the shot relative to the movement of the car and suggests that there may be a
relationship between the two objects.
The cut to Shot 2 involves deletion of the car and the urban setting in Shot 1
at the same time that there is also magnification of the interior of the telephone
box. The transition from Shot 1 to Shot 2 thus involves the deletion of some invari-
ants and changes in others. In Shot 2, the salient transformation is the movement
of the womans hand when it picks up the telephone receiver. The telephone appa-
ratus itself remains invariant. In Shot 2, the visually salient object is the slightly out-
of-focus telephone apparatus inside the telephone box and, momentarily, the
womans gloved hand as it reaches towards and grasps the receiver.
In Shot 3, the invariant features are the interior of the telephone box and the
woman. The variant features are the movement transformations observable on her
face, head, and upper torso. The salient object is the woman in contrast to the non-
salient interior of the telephone box.
The transition to Shot 4 involves the deletion of all invariant features shown
in the preceding three shots and their substitution with a new set of invariant
features, viz. the head and face of the man and the out-of-focus background detail
of the subterranean setting. As in Shot 3, the variant feature here involves the head
and facial movements of the man. In this shot, the salient feature is the head and
face of the man in relation to the indistinct and non-salient background.
Transformations in the optic array: some examples from the Mitsubishi Carisma text 229
Shot 5 features the same invariant features that are seen in Shot 4 (the man’s
head and face) at the same time that further detail is added to the background set-
ting in the form of the greater detail of the setting (the shark, the waterway, the
Optic Array
Visual Frame Variant and Invariant Features Salience Location and Disposition
Shot of Shapes on Screen
within Shot
Invariant: telephone booth + + salient: phone booth; Background + car + setting;
setting; Foreground: phone booth
1 –salient: setting
bridge, the man suspended from the bridge). The principal variants here all involve
movements of various kinds. These are as follows: (1) the movement of the man’s
head as he turns his gaze to the hostage; (2) the movement of the shark’s fin in the
water below the hostage; and (3) the leftward movement of the camera pan as it
tracks the movement of the man’s head and synchronises his gaze vector with the
viewer’s. In Shot 5, the salient feature remains the man’s head and face in relation
to the less salient details of his setting.
A number of features that are best described as part of the visual texture of
Phase 1 set up a pattern of contrasting patterns that relate to the woman and the
man, respectively. For example, Shots 1-3, which are associated with the woman,
feature the use of oblique lines that incline towards the left (\) as well as the illu-
mination of the woman and the car she is driving. These features contrast with the
oblique lines inclining to the right (/) and the partial occlusion of the man’s face on
account of the interplay of light and dark in Shots 4-5.
The repetition and interplay of these features and their association with
specific participant functions and locations undoubtedly contribute to the visual
texture of this phase at the same time that the contrasting patterns index the unre-
solved conflict between the values of the two positions that the man and woman rep-
resent. Shot 6, which concludes Phase 1, combines some aspects of both patterns:
(1) the illumination of the car contrasts with the darkness of the frame of the door-
way of the telephone box as viewed from its interior; and (2) the oblique character of
the intersecting vertical and horizontal lines formed by the frame of the doorway
signifies asymmetry, imbalance and lack of visual harmony.
In 4.11.4, we shall be concerned with transitivity frames in relation to the
Mitsubishi Carisma advertisement (see Appendix II ).
Table 4.3: Some visual process types and their modes of realization
road sign
vector: action
connection: hand-telephone
continuous
continuous change: change:
right camera left orient viewer
pan
camera pan [1�2]
1 2
continuous
continuous change: change:
left camera left behavioural: gaze: align viewer
pan +
�
� turn + manscamera
heads gaze pan [1�2] +
vector gaze to participant gaze
head turn [� ]+ mans
1 2 gaze vector
balance + centering + vector: frontal engage viewers gaze: present
orientation towards viewer object
232 Multimodal Transcription and Text Analysis: Chapter 4
not warrant the ad hoc creation of a new word or a new semantic category. The
articulatory space in which vowels are produced permits many degrees of
topological-continuous variation in the production of a particular vowel sound
(Thibault, 2004a: Chap. 3). However, this does not give rise to a semantic gradation
lying between the categories singular and plural. The distinction between the two
categories is a typological-categorial one. The lexicogrammar and semantics of natural
languages are almost completely typological in their mode of semiosis (Lemke, 1998).
In visual depiction, a visual image likewise represents some phenomenon
which is spatially and temporally grounded in some (real or imaginary) referent
situation. In this perspective, a visual image can also be analysed into two central
components: a visual unit in the form of a vector signifying a process; and one or
more visual units in the form of volumes or shapes which signify the participants
in the process. Volumes signify participants; vectors signify processes.
Unlike the lexicogrammar and semantics of natural languages, the mode of
visual semiosis is mainly topological. It is concerned with continuous change and
variation and is tied to visual kinaesthesis of the observer in a way that language is
not. Visual semiosis is concerned with the topological and dynamic characteristics of
phenomena. This can be illustrated with respect to the kinds of process types that
are typical of the visual transitivity frames that we find in television advertisements
such as the Mitsubishi Carisma text.
Visual processes are realized through a wider range of resources than vec-
tors per se. A process connects participants through a range of topological relations
such as the following: increase or decrease in quantity; deformation of body
surface; continuous change; movement; nearness and farness; connectedness;
interpenetration of domains; visual kinaesthesis.
Table 4.3, on the previous page, illustrates some of the visual process types,
their associated participants and the grammatical features that realize them with
reference to the Mitsubishi Carisma advertisement.
In 4.11.5, we shall discuss some of the ways in which the visual transitivity
frames are integrated to and function in larger-scale identity chains in the Mitsubishi
Carisma advertisement. In doing so, we are drawing attention to some of the ways
in which small-scale features such as the visual transitivity frame contribute to the
discourse level organisation of multimodal texts.
a particular participant chain. The rows show the patterns of interaction on a shot-
by-shot basis among the different participant chains. With the exception of the
single occurrence of the hostage, the other two participants consistently interact
with the telephone chain, which has a pivotal role in Phase 1 in connecting the two
main participants. Much of the thematic, and therefore the phase-specific homo-
geneity of Phase 1, can be attributed to the consistency of patterning in the shot-
by-shot development of each individual participant chain along with the patterns
of interaction among the chains.
Chains are linked to each other by visual processes (movement and
connection vectors) in different kinds of visual transitivity frames. The thematic
continuity and development that Table 4.4 shows is not only based on the percep-
tion of invariants from one shot to the next on the expression stratum; it is also
based on the ways in which the progression from one shot to the text is integrated
to a narrative logic of transformation from one shot to the next on the discourse
stratum. Thus, the movement of the car in Shot 1 does not entail transformation in
this sense, whereas the transition from Shot 1 to Shot 2 does. In this case, the move-
ment of the car and the salience of the telephone box in Shot 1 and the hand grasp-
ing the telephone receiver inside the telephone box in Shot 2 are seen as integrated
components of a larger-scale structure of meaning.
The individual shots and the sequential relations between them presume the
transformation of some features from one shot to the next. In the present case, the
spatial relocation of the woman from car to telephone box is one such
transformation which potentially generates narrative meaning, in the process raising
Table 4.4: Visual participant chains in Phase 1 of the Mitsubishi Carisma advertisement
234 Multimodal Transcription and Text Analysis: Chapter 4
questions, for example, as to the reasons for the womans action. On the basis of Shot
1, we can say, for instance, that the car moved. The integration of the two shots
enables us to say something along the lines of the woman drove to the telephone box to
make a call. The kind of question raised here concerning the ways in which shots are
integrated with each other to form larger-scale units leads us to consider the kinds of
meanings that are created by the relations between shots in the next section.
4. 11. 6. Dependency relations in the Mitsubishi Carisma text: implications for visual texts
The relations between the shots in a video text can be viewed in terms of dependency
or part-part relations. Dependency is a fundamental kind of meaning relation which
is not specific to language. In a dependency relation there is a direct relationship
between the units involved rather than a relationship which is mediated by a shared
higher-order unit. The latter type of relationship is usually referred to as constituency
in linguistics. Constituency involves part-whole relations; the parts are defined in
terms of their relationship to the whole structure to which they belong and in which
they function. Visual transitivity frames involve part-whole relations in this sense. In
the case of visual semiosis, this may at first glance appear to contradict the fact that
visual semiosis is based on topological-continuous variation. Constituency in lan-
guage is based on discrete typological-categorial distinctions. Nevertheless, the visual
image can be analytically broken down into such elements as volumes (shapes), vectors
(lines of directionality and force), and the functional relations among these (Kress,
Van Leeuwen, 1996: 56-64). In visual transitivity frames, volumes function as
participants, while vectors function as processes.
It is on this basis that we can determine different kinds of visual transitivity
frames both within and between shots in video texts (see Baldry, Thibault, 2005).
Constituency relations are analytical: a syntagmatic relation is broken down or analysed
into its parts and the functions of the parts in the whole are determined. The whole
mediates the relations among its parts. Dependency relations are synthetic: a
syntagmatic relation is synthesised between two or more units when the relationship
between them is not dominated by some superordinate constituency relation.
In linguistics, two principal types of dependency relation are recognised.
These are parataxis and hypotaxis. A paratactic relation involves equality between
the units: the units in the dependency relation are of equal status. The relationship
between Shots 24 and 25 in the Mitsubishi Carisma advertisement is of this kind. Shot
25 paratactically extends the meaning of Shot 24 by giving an alternative to it. The
switch from the angry female actor on the film set to the focus on the car in these
two shots does precisely this. In doing so, the car replaces the director and the actors
in the hackneyed thriller plot in the film studio as the centre of attention.
Hypotaxis involves a relationship of inequality : one unit is dominated by and
in some way dependent on the other unit. One example in the Mitsubishi Carisma
advertisement is the relationship between Shots 4 and 5. In this sequence, Shot 5 is
Dependency relations in the Mitsubishi Carisma text: implications for visual texts 235
dependent on Shot 4. Shot 4 is the dominant or more important shot here. Shot 4 is
the Head; it is the point of origin and the initiation of the dependency relation
between these two shots. Shot 5 is dependent on it because it is Shot 4 that could
stand alone whereas Shot 5 modifies Shot 4 and cannot stand by itself. Specifically,
Shot 5 modifies Shot 4 by extending its meaning, adding to the meaning of Shot 4.
The camera pan to the left in Shot 5 extends and adds to the meaning of the mans
head turn and gaze vector initiated in Shot 4. In this way, the specific focus of his
gaze vector is shown in Shot 5 to be the hostage suspended from the bridge. It is in
this sense that we can say that Shot 5 hypotactically extends the meaning of Shot 4.
According to Halliday (1994 [1985]: 225-241; Halliday, Matthiessen, 2004: 395-
441), there are three ways in which the dependency relations between clauses in a
clause complex may be expanded. These are: elaboration, extension and enhancement.
Elaboration (=)
The elaborating relation is indicated by the equals sign (=). In a paratactic elaborat-
ing complex, an initial unit is restated, exemplified or further specified by another
unit. Shot 26 elaborates Shot 25 in this way. Shot 25 takes the viewer from the film
set to the car as it is being withdrawn from the film set. Shot 26, however, focuses the
viewers attention squarely on the car for the first time in the entire advertisement. It
restates and further specifies the meaning of the Mitsubishi car in Shot 25 by switch-
ing from the oblique angle of the car being directed by the man beckoning with his
hands to a frontal view of the car in which the car is now the salient object without
any human in view. In this way, the visual significance of the car and its relative
salience in Shot 25 is restated and further specified in this shot.
Extension (+)
The basic meanings of the extending relation are those of addition (including the
adversative relation) and variation. That is, one unit adds to or varies the meaning
of the other, thereby extending its meaning. The relationship between Shot 17 and
Shot 18 is an example of this type. Shot 18, which features the (womans) passport
being held by a hand, is superimposed on and merges with Shot 17, which is of her
car travelling on the road to Prague from Vienna. Shot 18 extends the meaning of
Shot 17, which shows the car on the road and a road sign pointing in the direction
of Prague, by adding (quite literally) a further layer of meaning to Shot 17.
Enhancement (X)
The meanings included under enhancement tend to be circumstantial. One unit
enhances the meaning of another in terms of time, manner, place, cause,
condition, result and so on. The relationship between Shots 21 and 22 is of this
kind. Shot 21, which shows the film director shouting the word cut on the film set,
provides the reason or the motivation for the womans anger in Shot 22. Shots 23-
24, which show the telephone box being removed from the film set, enhance the
meaning of Shot 22, in which the woman is speaking on the telephone, by indicat-
ing the changed circumstances in which her conversation is taking place.
236 Multimodal Transcription and Text Analysis: Chapter 4
(e.g. cut) and camera movement (e.g. camera pan). Shots 4 and 5 also realise a visual
transitivity frame conforming to the schema GAZER^ GAZE VECTOR^ TARGET. The
fact that these units can also be functioning parts in a larger whole – the visual
transitivity frame – does not in any way preclude the fact that the two shots can
also enter into a dependency relation of the kind described here.
expression-purport the body (e.g. the vocal tract and oral cavity)
content-form lexicogrammar
The dependency relations in our two examples relate the two shots involved
in temporal or logical relations of the kinds indicated above. In narrative discourse,
the principal type of dependency relation is the type that Halliday (1994 [1985]: 230-
232; Halliday, Matthiessen, 2004: 405-410) refers to as EXTENSION, i.e. one unit
extends the meaning of another unit by adding to it, proposing an alternative, replac-
ing it, and so on. Shot 2 extends Shot 1 by adding to Shot 1s meaning. Shot 2 follows
Shot 1 in a temporal sequence: first the event in Shot 1 occurred then the action per-
formed in Shot 2 took place.
Narrative discourse tends simultaneously to express temporal and causal/con-
sequential relations between the units in a sequence and to question these. Thus, the
most general type of narrative dependency relation may be glossed as:
COMPLICATION^ RESOLUTION. This is a schematic relationship which has many
more specific kinds of instantiations in the grammar of natural languages as well as
in visual semiosis. The basic idea is as follows: in dependency relations of this kind,
temporality and causality are both affirmed (COMPLICATION) and questioned (RES-
OLUTION). In simpler terms, one thing causes, follows, or otherwise leads to another
(COMPLICATION), and this requires further explanation (RESOLUTION) (see Thibault,
1988-1989: Chap. 12). Dependency relations such as those between shots in video
texts are an important aspect of the ways in which texts are organised on the
discourse strata (see Table 4.1, p. 226). However, dependency relations are much
more than just a set of syntagmatic relations of the part-part kind. Above all, they
are a type of meaning relation which relates to the dynamic and active process of
making meaning on the discourse stratum. In the case of narrative as a mode of
discourse whatever the semiotic modality, we see this process operating on many
different scalar levels of organisation, e.g. between clauses or between larger phases
and generic stages of linguistic narratives and between shots or between the larger
phases to which shots belong in video texts.
In the Mitsubishi Carisma text, this can be illustrated as follows: Shot 1
implicitly raises or generates questions in the viewers mind. In this case, questions
Shot 1: Shot 2:
Affirm: Affirm:
a car approaches a phone box inside the phone box, a hand reaches for the
receiver
Question: Answer:
a. Temporal: what will happen next? a. Temporal: the woman driver goes into the
phone box
b. Causal: why is the car there? b. Causal: the woman wants to make a phone
call
Question
who between
Table 4.5: Dependency relation is she ringing?
Shots 1 why?
and 2
etc.
in Phase 1 of the Mitsubishi Carisma advertisement
Some sources of coherence in the Mitsubishi Carisma advertisement: Phase 1 239
of the kind where is the car going ?, will it stop at the telephone box?, why is it where it
is?, spring to mind. This does not mean that all of these questions or others that
might also be posed are equally relevant. The point is that Shot 1 in this case sets
in motion a process of raising questions that require resolution. Narrativity as
meaning making process is in evidence on this relatively small-scale level.
Shot 1 is therefore a sort of mini-complication. Shot 2 provides answers to
the questions raised in Shot 1. The car stops at the telephone box, the driver makes
a telephone call, and so on. Shot 2 therefore provides a local resolution to the kinds
of questions and problems raised by Shot 1. In doing so, Shot 2 in turn raises
further questions that will in their turn require answers, e.g. who is she ringing up?
why?, and so on. In the case of the dependency relation between Shots 1 and 2, we
can see how both temporality and causality are simultaneously affirmed and ques-
tioned, as outlined in Table 4.5.
The two logics operate simultaneously, though one or the other may be
foregrounded in any particular instance. Narrative discourse requires both tempo-
rality and causality in order that the event sequence brings about transformation or
change as the text develops. This criterion operates on many different levels and
lies at the heart of the Complication^ Resolution type of organisation which is
characteristic of narrative. It also shows that the Complication and the Resolution
are not localised constituents or segments that occur in just one part of a narrative
text and its generic structure, as the Labov and Waletzky (1967) generic schema and
its development in systemic-functional linguistics may suggest (Martin, 1992: 564-
565). Rather, these units are a pervasive feature of narrative on different scalar
levels of its discourse organisation. The raising of questions and the providing of
answers to them satisfies the most fundamental requirement of transformation in
narrative discourse. The Complication^ Resolution schema can have many more
specific instantiations, e.g. Action^ Action, Action^ Reason, Action^ Result,
Action^ Consequence, and so on. In all these cases, it is not merely the temporal
sequencing of events that is important, but the causal logic whereby features of the
text undergo change from one action or event to the next. As the discussion in the
current section has shown, there is a close connection between the logical
organisation of texts and the fundamentally dialogical principles that underlie all
forms of semiosis. We would like to suggest that the tie up between the logical
organisation of the temporal and causal relations between textual units and the dia-
logical basis of all semiosis is far from coincidental.
In 4.11.7 we offer some further suggestions concerning the ways in which
logical coherence and interactional coherence are closely related to each other.
are colourful and noisy affairs which impact upon and appeal to our senses. They
attract our attention largely on the basis of visual and acoustic waves of rhythmic
patterning which we perceive as episodic. The coherence of a communicative event
is the result of the ways we make meaningful connections among different compo-
nents of what we interpret as the same overall event or text. In the first instance,
events of this kind announce themselves as interactive events. The television viewer
is not simply a processor of abstract visual and other information, but is placed in an
interactive relationship with the text. The viewer is positioned as an addressee who
plays an active role in the interpretation of the text.
In the visual semiotic, the division of the text into distinct, though interrelated
shots, and the ways in which the viewer is positioned to adopt certain perspectives or
viewing positions with respect to the depicted world as mediated by choices in
camera angle (horizontal and vertical), camera distance, and camera movement all
play a role in constituting the viewer as an active and responsive addressee. Choices
in these systems interact with other visual resources, such as the gaze vectors of the
participants in the depicted world as well as choices in other semiotic systems (e.g.
the off-camera presenter in Phase 2 who directly addresses the viewer), in ways that
create addressee roles and positions for the viewer and therefore possible or pre-
ferred ways of responding to the advertisement.
Many approaches to coherence in discourse start with various kinds of
cognitive operations, whereby principles of logical continuity can be derived.
However, it seems to us that coherence is founded in the first instance on principles
of interactional coherence before principles of logical continuity come into the
picture. The wavelike patterning of visual, musical, and spoken rhythms provides the
platform on which other more specified forms of interactional coherence such as
those mentioned above, come into play. This is analogous in some ways to the dia-
logic organization of turn-taking and the associated speech roles in spoken discourse.
The texts rhythmic patterns provide a basis for synchronizing the viewers own body
rhythms with those of the text. The ability to attend to these patterns provides the
basis for the further recognition of interactive units whereby the viewer is able to
take up specific interactive roles in relation to the text. This in turn leads to the abil-
ity to recognise event-like or episodic units in the overall flow of the text and to con-
strue various kinds of semantic or other meaningful relations among different parts
of the text. Thus, we can construct meaningful relations among separate shots as
parts in larger, meaningful wholes, rather than seeing each shot as a discrete unit
which is unrelated to what comes before it and what goes after it.
With reference to the transcription of this advertisement, we can see how
camera position and camera movement function to position the viewer in relation to the
depicted world of the text. In Shot 1, the approaching car is seen from a distance; its
movement towards the viewer indicates a potential for some kind of interactional
involvement with the viewer. Shot 2 is a very close shot of the telephone receiver as
Some sources of coherence in the Mitsubishi Carisma advertisement: Phase 1 241
the womans hand picks it up in order to make a phone call. The close nature of this
shot in contrast to the previous shot invites the viewer to enter into the interpersonal
space that is created by the close-up shot of the womans action. Shot 3 cuts to the
woman speaking on the phone; the woman is shown frontally in medium close-up.
For the first time, she is individuated as a result of these choices in camera distance
and angle. The viewer is able to enter into her world and, potentially, to identify with
it. Shot 4a cuts to the man with whom she is speaking on the phone. He too is indi-
viduated by a medium close shot; however, the more oblique angle here and the inter-
play of light and shade entail a different kind of interpersonal positioning for the
viewer with respect to the man. In this case, the man is positioned as being remote
from the world of the viewer and the latter is being asked to evaluate him in this light.
In Shot 4b, the camera pan tracks the mans gaze vector so that the viewers position is
aligned with that of the man as he looks at his hostage, who is seen suspended above
the shark-infested waterway. In this case, the combined effect of the camera pan and
the gaze vector enable the viewer to interpret the mans intentions with respect to his
hostage at the same time that a negative evaluation of his intentions is implied.
The observations made in the previous paragraph in connection with camera
position and camera movement shows some of the ways in which each shot is an
interactive unit which functions interpersonally to create a viewing position for the
viewer and, on that basis, a potential evaluative or other response on the part of the
viewer. Selections of choices in various interpersonal systems such as those men-
tioned here create an interactional relationship between viewer and different parts of
the text at the same time that the viewer is invited to adopt different kinds of
evaluative positions and stances vis-à-vis the participants and their actions. Each shot
therefore orients the viewer and the viewers potential responses to the world of the
text. It is the shift from one perspective and potential response point to another
which creates an overall sense of interactional or interpersonal coherence. In this
way, the episodic nature of the text as a series of interrelated events becomes evi-
dent. This in turn paves the way for the more explicit forms of ideational and logical
coherence which more cognitively oriented approaches tend to focus on.
Table 4.6, on the following page, presents three sources of ideational
coherence in Phase 1 of the Mitsubishi Carisma text. Column 1 shows the number of
each of the shots in Phase 1. The other three columns each suggest a source of
coherence in the top row; the rows corresponding to the various shots in turn indi-
cate how this principle is instantiated in each shot. Thus, Column 2 shows that each
shot features a particular instantiation of the same general class of participant, i.e.
the telephone. However, we know that the different instantiations across the
various shots index two telephones the one used by the woman and the one used
by the man as they speak to each other. In this case, the coherence is provided by
the fact that the different instantiations are of the same class of participant and
that this is a textual means of connecting the woman and the man to each other.
242 Multimodal Transcription and Text Analysis: Chapter 4
Column 2 shows that different appearances from shot to shot of the woman and the
man are linked by the sameness of their appearance even though these visual forms
may have different participant roles in different visual transitivity frames. For
example, the unseen woman in Shot 1 can be retroactively construed as the Actor
who drives the car in that shot. In Shot 2, she is likewise interpretable as the Actor
who moves her hand to pick up the telephone. In Shot 3, where the womans face is
shown for the first time, we nevertheless connect this appearance of the woman to
the previous ones, even though in this shot she is now Sayer (not Actor) in the visual
transitivity frame which is shown here. Finally, Column 3 reconstructs the principle
of temporal continuity whereby the different visual processes across the different
shots and associated visual transitivity frames are seen as related actions or events in
a chronological sequence of events. The interplay between logical coherence and
interactional coherence in video texts shows some of the ways in which video texts
such as television advertisements have hypertextual characteristics not unlike those
exhibited by websites (Chapter 3 ). In 4.11.8, we shall consider this issue.
linear unfolding of the text. We shall also suggest that this is not unlike the processes of
textual development that we examined in relation to hypertext trajectories in Chapter 3.
Shots 18-20 feature the superimposition of the car on the road driving to
Prague (18) and the display of the words ‘Prague, Wednesday’ informing the viewer
of the car’s arrival in Prague (19) and the elapsed time since the previous car drive
sequence and arrival in Vienna on Tuesday. Shot 19 is in turn also superimposed with
Shot 20. Shot 20 shows the woman inside another phone box and her car parked near
the phone box. The sequential unfolding of shots in the text contributes to the
development of a textual trajectory. However, the linearity of the sequence does not
mean that the meaning relations constructed through these relations are necessarily
linear. The superimposition or overlapping of shots in the sequence of shots creates
a high degree of thematic compression in the visual semiotic. A shot is always a
choice point or a node in an unfolding textual trajectory or syntagm at the same time
that a given choice is made in relation to a paradigmatic class of possible alternative
choices in some contrast set. The superimposition of shots foregrounds to a greater
extent the paradigmatic potential of the shot as a choice point in the unfolding visual
text. Again, we see the dually paradigmatic and syntagmatic dimensions of meaning-
making being foregrounded in ways that reveal the hypertextual characteristics of
video texts such as the Mitsubishi Carisma advertisement (see Appendix II ).
The superimposition of these shots is part of a paradigmatic pattern which had
been established by the similar – not identical – patterns in Shots 8 to 11 and Shots
18-19. Thus, similar situations – talking to the man from different phone boxes and
driving to Vienna then to Prague – create a paradigmatic set based on sameness with
variation. This pattern in turn conditions the viewer’s expectations concerning the
further development of the storyline. The transition from Shot 20 to Shot 21, which
shows the director on the film set holding a megaphone to his mouth, breaks with the
expectation that has been created. A different alternative, one which was not predict-
ed by the previous choices in the text, is made at this point, as the text unexpectedly
switches from the thriller plot to the director who is making the film, thereby intro-
ducing a new participant into the text who was not in any way part of the previous
storyline up to Shot 20. At this point (Shot 21 ), the text has jumped to a new, unex-
pected storyline, which features both new participants (e.g. the film director) and new
roles for old ones (e.g. the angry woman actor who walks off the film set in protest).
In ways that are similar to the hypertextual strategies analysed in relation to the web
page in Chapter 3, the textual processes described here call attention to the dialecti-
cally dual character of the text as system and process. The television viewer is made
aware of the process through highly self-conscious metadiscursive strategies which fore-
ground the dialectic of system and process as the text unfolds along its trajectory in
ways that are similar to the examples of hypertext discussed in 3.10, pp. 156-161.
The choice of these particular patternings create a pattern of similarity with
variation in the two car drive sequences. The choices made at these two points in the
244 Multimodal Transcription and Text Analysis: Chapter 4
text therefore condition the viewers expectations concerning later choices on various
scalar levels, such as visual transitivity frames, shots, sequences of shots, and so on.
The compression and superimposition in the two sequences creates expectations as to
what will happen next when the woman gets into the phone box on arrival in Prague:
Will she meet the man and secure the release of her boyfriend? Will she be asked to
continue the car journey?, and so on. This highlights the dynamic character of both
system and text. Textual meanings are made over time, never entirely according to our
expectations. Probabilities in the system are reweighted for each situation and each
text at the same time that they dynamically shift in the process of text production.
Hypertext draws attention to this process, though it is not unique to hypertext.
Expectations that are set up by a pattern of choices at some point in the text can be
confirmed or frustrated by choices made at some later stage of the texts development.
The typical conjunction of situations that are achieved by the superimposition
of shots in the two sequences are a visual resource that functions in ways similar to
conjunctions in language. Just as conjunctive relations in language can provide indi-
cations concerning what to expect in the development of discourse (Martin, Rose,
2003: 128-133), these visual resources function in the same way to lead us to expect
a certain pattern in the further development of the event sequence (Van Leeuwen,
1990), only to see this frustrated by the unexpected twist in the plot. In the Mitsubishi
Carisma advertisement, the expectation that is created by the visual strategy of
compression and superimposition of paradigmatically related situations in the event
sequence leads the viewer to expect that the text will continue to make choices from
the same contrast set in the further development of the action. In this perspective,
the congruency of phase-specific copatternings of choices made up to this point can
be expected to function to maintain and develop the same situation, e.g. bringing the
story to a typical happy ending such as girl reunited with boy. The counter-expectation
results from the selection of a different choice from some alternative contrast set in
relation to the different situation of the film studio which occurs in Shot 20, and it
is developed from that point in the text.
Viewers expectations concerning the way the text develops e.g. its plot
structure are created by establishing a paradigmatic pattern in the syntagmatic
development of the text. In the present example, this is maximally foregrounded by
the superimposition and partial merging of different shots in the two sequences
referred to above. The visual semiotic of the film shot does not have the resources
of conjunction that are characteristic of discourse-level relations between clauses in
linguistic text. However, other specifically visual resources can be deployed to create
relations of expectation and counter-expectation in the development of a sequence
of shots and the transitions between shots. The superimposition or partial merging
of shots in the two sequences in the Mitsubishi Carisma advertisement is an example
of this. These two sequences show typical conjunctions of expected events in the
two car drive sequences. In Shots 8-11, the road to Vienna and the woman at the
Inset 19 245
�In the above excerpt, the negotiation hinges on the meaning of Captain Medinas words
those people were slowing me down, waste them as reported in (10) by Calley. In (11), the
questioners ellipted interrogative clause suggests one possible interpretation for the
meaning of this locution and asks Calley to provide his own perspective on the
proposition in the questioners interrogative clause, i.e. to verify whether these words
correspond to Calleys own interpretation of what Captain Medina is reported in (10)
to have said or whether they correspond to what Captain Medina actually said on the
radio to Calley as they spoke to each other while in different parts of the Vietnamese
village in question. In (12), Calleys negative reply refutes that interpretation. In (13),
the questioner then returns to the point previously made in (11) by adopting a different
discourse strategy, which is realized by the selection [declarative clause + polar inter-
rogative clause]. This choice puts the focus on Calleys own interpretation of the
meaning of the words referred to before and therefore on his own responsibility in
making that interpretation. Rather than formulating the question so as to put the focus
on Captain Medina, e.g. what did Captain Medina mean? or what did Captain Medina say?,
the declarative clause in (13) focuses on Calleys interpretation of the meaning of what
was previously said, viz. you interpreted it to mean
This declarative utterance begins
with the conjunction So, which connects the clause back to Calleys negative confirma-
tion in (12) that Medina was responsible for the interpretation suggested by the
questioner in (11). (13) is a complex discourse move realized by two clauses. The first
part of (13) takes the form of a logical inference which the questioner draws at this
point on the basis of the immediately prior exchange with Calley. The selection [polar
interrogative; positive polarity] in the second part functions to seek Calleys confirma-
tion of the meaning of the declarative clause as the questioner seeks to attribute this
meaning to Calley at this point in the exchange (i.e. in 13).
246 Multimodal Transcription and Text Analysis: Chapter 4
steering wheel in her car are superimposed. In Shots 18-19, there is a similar super-
imposition of car driving on the road to Prague and passport control, as the woman
crosses the border from Austria into Hungary. The two sequences compress in this
way typical turns of events concerning the temporal development of the story line.
In each case, we see situations being merged that we would normally expect to occur
close together at that particular stage of the unfolding story. Expectation is, however,
thwarted in the present example by the change in the event line and participant roles
in the cut from Shot 20 to 21. The new features in Shot 21 replace those in Shot 20
and break the previous (and expected) continuity in the development of the narra-
tive event line. The transition between these two shots creates a discontinuity in
terms of both location (i.e. car journey from one city to another then switching to
film set in studio) and in terms of narrative event line (i.e. thriller plot then switch-
ing to the making of the film in the studio).
The transcriptions and text analyses presented in this book are a first step towards
the formulation of better multimodal transcription practices and the development
of computer-assisted tools for the storage, retrieval, processing and analysis of
multimodal texts. A central goal is the construction of multimodal corpora with a
view to the development of new categories of text analysis and description, a
necessary stage in the construction of the next generation of text-based corpora
(Baldry, Thibault, 2005).
In spite of the important advances made in the past thirty or so years in the
development of linguistic corpora and related techniques of analysis, a central and
unexamined theoretical problem remains, namely that the methods adopted for
collecting and coding texts isolate the linguistic semiotic from the other semiotic
modalities with which language interacts. In other words, linguistic corpora as so
far conceived remain intra-semiotic in orientation. In contrast, multimodal corpora
are, by definition, inter-semiotic in their analytical procedures and theoretical
orientations.
This entails new methods for the collecting, coding, storing and analysing of
textual data. There are, of course, many practical and technical difficulties that will
need to be overcome in order to realise this objective. A central requirement in
such an enterprise will be (1) transparency of cross-modal coding criteria whatever
the modality in question; and (2) retrievability of inter-semiotic relations such as,
for example, the copatterning of written text and visual image or spoken language
and body kinesics among others. At the present stage of computer technology,
some kind of language-based and/or visual coding systems remain the most
feasible procedure. Nevertheless, the data so coded will need to be referenced both
to specific transcriptions as well as to electronically stored databases comprising,
for example, video clips, video texts and multimodal printed pages, multimodal
transcriptions like the prototype presented in this book and so on (Baldry, 2000b).
Only in this way can we begin to quantify on a sufficiently large scale the sys-
tematic relations between language and the other semiotic modalities with which it
is co-contextualised in the making of genre- and context-specific meanings. If lan-
guage form and function are themselves shaped by the kinds of intersemiotic
relations into which language typically enters, then it may be argued that those con-
cordancing practices which ignore this fundamental fact about language will fail in
the longer run to provide entirely adequate explanations of language itself and the
ways in which language, too, is changing under pressure from the newly emergent
forms of multimodal and multimedia meaning-making practices with which lan-
guage is codeployed and with which it has always coevolved.
Neither the present chapter nor the current book claim to provide fully
worked out solutions to all of the problems addressed. The core of the enterprise
is to provide a dynamic account of multimodal meaning making which integrates
both micro- and macro-level processes in ways that bring about and facilitate plausi-
ble and logical accounts of the processes of the co-contextualisation of diverse
semiotic modalities and of the data that is accumulated and stored in some corpus
of multimodal texts. In spite of the pioneering work of Bateson, Birdwhistell, Hall
and others in the 1950s, the multimodal study of human social meaning-making
remains in its infancy. The main concern in this book has been to provide a
methodological and theoretical starting point on the basis of which further work
towards the goals outlined above might be undertaken. It necessarily follows that
the tentative and as yet incomplete proposals made here will undergo further
development and modification.
References
(a) Primary sources for text analyses
Bargellini Alberto, Fratello, Mario and Monfroni, Luciana (1985). Scienze per il 2000:
Introduzione interdisciplinare allo studio delle scienze chimiche, fisiche e naturali. Vol.
2. Milan: Signorelli.
Curtis, Helena (1975[1972]). Biology. New York: Worth Publishers Inc.
Darwin, Charles (1989 [1880]). The Power of Movement in Plants. London: Pickering.
Orginally published: London: Murray.
Guglielmi, Bona and Ferrari, Ercole (1982 [1980]). Scienza. Natura. Società: Corso di
scienze chimiche, fisiche e naturali per la scuola media Vol. 2. First edition, sixth
reprinting. Turin: Paravia.
King, Beryl A. (1957). Australian Biology for High School Junior Classes. Fifth revised edi-
tion. Sydney and Brisbane: William Brooks & Co.
Marx, Karl (1909[1908]). Capital: a critique of political economy, Vol 1: The process of cap-
italist production. Translated from the third German edition, by S. Moore and E.
Aveling. Edited by Friedrich Engels. Revised and amplified according to the 4th
German ed. by Ernest Untermann. Chicago: Charles Kerr & Co.
The Economist, Sept. 5th -11th 1998, London (Volume 348, Number 8084).
Tan, Amy (2001). The Bonesetter’s Daughter. London: Flamingo.
Worrell, Eric. (1963). Reptiles of Australia. Sydney and London: Angus and Robertson.
Bühler, Karl (1990 [1934]). Theory of Language: The representational function of language.
Donald Fraser Goodwin (trans.). Amsterdam/Philadelphia: Benjamins.
Cheong, Yin Yuen (2004). “The construal of ideational meaning in print advertise-
ments”. In Kay O’Halloran (ed.), Multimodal Discourse Analysis: Systemic functional
perspectives. London and New York: Continuum, pp. 163-195.
Cook, Guy (2001 [1992]). The Discourse of Advertising. London and New York: Routledge.
Couper-Kuhlen, Elizabeth (1992). “Contextualizing discourse: the prosody of interac-
tive repair”. In Peter Auer and Aldo di Luzio (eds.), The Contextualization of
Language. Amsterdam/Philadelphia: John Benjamins, pp. 337-64.
Crystal, David (1972). “The intonation system of English”. In Dwight Bolinger (ed.),
Intonation: Selected readings. Harmondsworth, Middlesex: Penguin, pp. 110-36.
Crystal, David (1982). Profiling Linguistic Disability. London: Arnold.
Daneš, František (1974). “Functional sentence perspective and the organisation of the
text”. In František Daneš (ed.), Papers on Functional Sentence Perspective. The
Hague: Mouton, pp. 106-28.
Daneš, František (1989). “Functional sentence perspective and text connectedness”. In
Maria Elizabeth Conte, János Petöfi and Emel Sözer (eds.), Text and Discourse
Connectedness. Amsterdam/Philadelphia: John Benjamins, pp. 23-31.
Darwin, Charles (1955 [1872]). The Expression of the Emotions in Man and Animals.
New York: Philosophical Library.
Davidse, Kristin (1992). “A semiotic approach to relational clauses”. In Occasional
Papers in Systemic Linguistics 6: 99-131.
Echard, William (1996). “Working paper on the notion of style, by way of auditory stream-
ing and social semiotics”. Department of English, York University, Toronto: Mimeo.
Firth, John R. (1957 [1934]). “The use and distribution of certain English sounds: pho-
netics from a functional point of view”. In John R. Firth, Papers in Linguistics
1934-1951. London and Oxford: Oxford University Press, pp. 34-46.
Firth, John R. (1957 [1950]). “Personality in language and society”. In John R. Firth,
Papers in Linguistics 1934-1951. London and Oxford: Oxford University Press,
pp. 177-189.
Fuller, Gillian (1998). “Cultivating science: negotiating discourse in the popular texts of
Stephen Jay Gould”. In James R. Martin and Robert Veel (eds.), Reading Science:
Critical and functional perspectives on discourses of science. London and New York:
Routledge, pp. 35-62.
Gibson, James J. (1986 [1979]). The Ecological Approach to Visual Perception. Hillsdale, NJ
and London: Lawrence Erlbaum.
Goffman, Erving (1985 [1976]). Gender Advertisements. London and Basingstoke:
Macmillan.
Goodman, Sharon (1996). “Visual English” First part of Chapter 2 in Sharon
Goodman and David Graddol (eds.), Redesigning English: New texts, new identities.
London and New York: Routledge, pp. 38-72.
Goodwin, Charles (1994). “Professional vision”. In American Anthropologist 96 (3): 606-
33.
254 Multimodal Transcription and Text Analysis
Goodwin, Charles, Goodwin, Marjorie Harness (1992). “Context, activity and partici-
pation”. In Peter Auer and Aldo di Luzio (eds.), The Contextualization of
Language. Amsterdam/Philadelphia: John Benjamins, pp. 77-99.
Gregory, Michael (1995). “Generic expectancies and discoursal surprises: John Donne’s
The Good Morrow”. In Peter H. Fries and Michael Gregory (eds.), Discourse in
Society: Systemic functional perspectives. Meaning and choice in language: Studies for Michael
Halliday. Norwood, NJ: Ablex, pp. 67-84.
Gregory, Michael (2002). “Phasal analysis within communication linguistics: two con-
trastive discourses”. In Peter Fries, Michael Cummings, David Lockwood, and
William Sprueill (eds.), Relations and Functions in Language and Discourse. London:
Continuum, pp. 316-345.
Greimas, Algirdas Julien (1966). Sémantique structurale. Paris: Larousse.
Gumperz, John J., Berenz, Norine (1993). “Transcribing conversational exchanges”. In
Jane A. Edwards and Martin D. Lampert (eds.), Talking Data: Transcription and
coding in discourse research. Hillsdale, NJ: Lawrence Erlbaum, pp. 91-121.
Hall, Edward T. (1972 [1963]). “A system for the notation of proxemic behaviour”. In
John Laver and Sandy Hutcheson (eds.), Communication in Face to Face Interaction.
Harmondsworth: Penguin, pp. 247-273.
Halliday, M.A.K. (1978). Language as Social Semiotic: The social interpretation of language and
meaning. London: Arnold.
Halliday, M.A.K. (1979). “Modes of meaning and modes of expression: types of gram-
matical structure and their determination by different semantic functions”. In David
J. Allerton, Edward Carney and David Holdcroft (eds.), Function and Context in
Linguistic Analysis: A Festschrift for William Haas. Cambridge: Cambridge
University Press, pp. 57-79.
Halliday, M.A.K. (1989). “Part A.”. In M.A.K. Halliday and Ruqaiya Hasan (eds.),
Language, Context and Text: Aspects of language in a social-semiotic perspective. Oxford:
Oxford University Press, pp. 1-49.
Halliday, M. A. K. (1992). “How do you mean?”. In Martin Davies and Louise Ravelli
(eds.), Advances in Systemic Linguistics: Recent theory and practice. London and New
York: Pinter, pp. 20-35.
Halliday, M.A.K (1994 [1985]). An Introduction to Functional Grammar. Second edition.
London and Melbourne: Arnold.
Halliday, M.A.K., Matthiessen, Christian (2004) An Introduction to Functional Grammar.
3rd Edition, London: Arnold.
Handel, Stephen (1993 [1989]). Listening: An introduction to the perception of auditory
events. Cambridge, MA: The MIT Press.
Harré, Rom (1990). “Exploring the human Umwelt”. In Roy Bhaskar (ed.), Harré and
His Critics: Essays in honor of Rom Harré with his commentary on them. Oxford:
Blackwell, pp. 297-364.
Harris, Roy (1995). Signs of Writing. London and New York: Routledge.
Hasan, Ruqaiya (1978). “Text in the systemic-functional model”. In Wolfgang Dressler (ed.),
Current Trends in Textlinguistics. Berlin & New York: Walter de Gruyter, pp. 228-46.
References 255
O’Toole, Michael (1994). The Language of Displayed Art. London: Leicester University
Press.
Peirce, Charles Sanders (1985). “Logic as semiotic: The theory of signs.” In Robert E.
Innis (ed.), Semiotics: An introductory anthology. London: Hutchinson, pp. 4-23.
Pike, Kenneth L. (1967). Language in Relation to a Unified Theory of the Structure of
Human Behavior. Second, revised edition. The Hague and Paris: Mouton.
Poynton, Cate (1985). Language and Gender: Making the difference. Geelong, Victoria:
Deakin University Press.
Rumelhart, David E. (1975). “Notes on a schema for stories”. In Daniel G. Bobrow
and Allan Collins (eds.), Representation and Understanding: Studies in cognitive science.
New York: Academic Press, pp. 211-236.
Saint-Martin, Fernande (1985). Introduction to a Semiology of Visual Language. Victoria
University, Toronto: Monographs, Working Papers and Prepublications of the
Toronto Semiotic Circle, Vol. 3.
St. Julien, John (1997). “Explaining learning: the research trajectory of situated cogni-
tion and the implications of connectionism”. In David Kirshner and James A.,
Whitson (eds.), Situated Cognition: Social, semiotic, and psychological perspectives.
Mahwah, NJ and London: Lawrence Erlbaum, pp. 261-79.
Salthe, Stanley N. (1993). Development and Evolution: Complexity and change in biology.
Cambridge, MA and London: The MIT Press.
Saussure, Ferdinand de (1993). Eisuke Komatsu (ed.), Cours de Linguistique Générale:
Premier et troisième cours d’après les notes de Reidlinger et Constantin. Collection
Recherches Université Gaskushuin no 24. Tokyo: Université Gakushuin.
Schank, Roger, Abelson, Robert (1977). Scripts, Plans, Goals, and Understanding.
Hillsdale, NJ: Lawrence Erlbaum.
Scheflen, Albert E. (1972). Body Language and Social Order : Communication as behavioral
control. Englewood Cliffs, NJ: Prentice-Hall.
Scheflen, Albert E. (1973). Communicational Structure: Analysis of a psychotherapy trans-
action. Bloomington, IN: Indiana University Press.
Schoenberg, Arnold (1975). Style and Idea: Selected writings of Arnold Schoenberg.
Leonard Stein (ed.), Leo Black (trans.). London: Faber & Faber.
Silverstein, Michael (1992). “The indeterminacy of contextualization: when is enough
enough?”. In Peter Auer and Aldo di Luzio (eds.), The Contextualization of
Language. Amsterdam/Philadelphia: John Benjamins, pp. 55-76.
Silverstein, Michael, Urban, Greg (eds.) (1996). Natural Histories of Discourse. Chicago:
University of Chicago Press
Sinclair, John (1991). Corpus, Concordance, Collocation. Oxford and Singapore: Oxford
University Press.
Taylor Torsello, Carol, Baldry, Anthony (2005). “SFL in text-based, web-enhanced lan-
guage study”. In Ruqaiya Hasan, Christian Matthiessen and Jonathan Webster
(eds.), Continuing Discourse On Language: A functional perspective, Volume 1.
London: Equinox, pp. 311-42.
Tesnière, Lucien (1965). Éléments de Syntaxe structurale. Paris: Klincksieck.
References 259
Thibault, Paul J. (1986). “Thematic system analysis and the construction of knowledge
and belief in discourse: the headlines in two Italian newspaper texts”. In Text,
Discourse, and Context: A social semiotic perspective. Victoria University, Toronto:
Monographs, Working Papers and Prepublications of the Toronto Semiotic
Circle, Vol. 3, pp. 44-91.
Thibault, Paul J. (1988-1989). Grammar, Text, and Discourse Genre: An advanced introduc-
tion to the systemic-functional approach. Department of Linguistics, University of
Sydney: Mimeo.
Thibault, Paul J. (1989). “Genres, social action, and pedagogy: towards a critical social
semiotic account”. Southern Review (Australia) 22, 3: 338-62.
Thibault, Paul (1990). “Questions of genre and intertextuality in some Australian tele-
vision advertisements”. In Rema Rossini Favretti (ed.), The Televised Text.
Bologna: Pàtron, pp. 89-131.
Thibault, Paul (1991a). “Grammar, technocracy, and the noun: technocratic values and
cognitive linguistics”. In Eija Ventola (ed.), Functional and Systemic Linguistics:
Approaches and uses. Berlin and New York: Mouton de Gruyter, pp. 281-305.
Thibault, Paul (1991b). Semiotics as Social Praxis: Text, social meaning-making and
Nabokov’s “Ada”. Theory and History of Literature series, Vol. 74. Minneapolis
and Oxford: University of Minnesota Press.
Thibault, Paul J. (1994a). “Text and/or context?”. State-of-the-Art article. In The
Semiotic Review of Books (Toronto) 4, 1 (May 1994): 10-12.
Thibault, Paul J. (1994b). “Intertextuality” in R. E. Asher and J. M. Y Simpson (eds.),
The Encyclopedia of Language and Linguistics, Volume 4, pp. 1751-54.
Thibault, Paul J. (1997a). Re-reading Saussure: The dynamics of signs in social life. London and
New York: Routledge.
Thibault, Paul J. (1997b). “Contextualization and social meaning-making practices”.
Discussing Communication Analysis 1: 31-47.
Thibault, Paul J. (1998a). “Graphology and visual semiosis”. Dip. di Studi Linguistici e
Letterari, Europei Postcoloniali, University of Venice: Mimeo.
Thibault, Paul J. (1998b). “Multimodality”. In Paul Bouissac (ed.), The Encyclopedia of
Semiotics. New York and Oxford: Oxford University Press, pp. 427-9.
Thibault, Paul J. (1999a). “Communicating and interpreting relevance through dis-
course negotiation: an alternative to relevance theory”. Journal of Pragmatics 31,
5: 557-94.
Thibault, Paul J. (1999b). “Putting Humpty Dumpty’s theory of meaning back together
again: Can Saussure help?”. Belgian Essays on Language and Literature (BELL) 9: 7-
34. [Liège: Belgian Association of Anglicists in Higher Education].
Thibault, Paul J. (2000a). “The multimodal transcription of a television advertisement:
theory and practice”. In Anthony Baldry (ed.), Multimodality and Multimediality in
the Distance Learning Age. Campobasso: Palladino Editore, pp. 311-385.
Thibault, Paul J. (2000b). “The dialogical integration of the brain in social semiosis:
Edelman and the case for downward causation”. Mind, Culture, and Activity 7, 4:
291-311.
260 Multimodal Transcription and Text Analysis
02
CD: close:
HA: frontal;
VA: medium;
CM: stationary: tracks move-
ment of car along road
Pa: Agent: car: implied; CD: very close;
Pr: Movement Vector: L-R; HA: frontal; Fade
11 Pa: Goal: balloons VA: medium;
CM: vector: movement Shot 10
through balloons
Pa: Actor: woman; CD: distant; Cut
Pr: Movement Vector: L-R; HA: frontal:
Pa: Goal: telephone booth: VA: medium; Phase 3
12 implied CM: stationary
Shot 11
Pa: Actor: woman: arm; CD: close;
Pr: Movement Vector: L-R; HA: oblique; Cut
14 Pa: Goal: telephone VA: medium;
receiver: stationary CM: stationary Shot 12
21
CD: distant;
HA: frontal;
VA: medium;
CM: stationary
Pa: Actor: hand; CD: very close; Merge
Pr: Connection Vector: HA: frontal; Shot 18
22 hold; VA: medium; (super-
Pa: Goal: passport CM: stationary imposed on
17)
23 Fade
Pa: Goal: phone booth; CD: very close; voice: male: off screen
Pr: Movement Vector; HA: frontal; presenter: plot Cut
28 Agent: implied VA: medium;
CM: stationary Shot 23
Pa: Goal: phone booth; CD: distant; voice: male: off screen
Pr: Movement Vector; HA: frontal; presenter: shot by an over- Cut
Agent: men in film studio VA: medium; paid
CM: stationary Shot 24
37
Shot 29
blackout
38
Index
A 182, 185, 189, 203-5, 210, 213, 217, 230,
Abercrombie, David 215, 217, 219, 221, 222 235, 245-6
accessibility 93, 97 closeness and close shot(s) 5, 24, 39, 40, 42, 70, 85, 120,
accent 52, 184, 208, 216-7 125, 170, 172, 184, 187, 196, 197, 201, 231,
acoustic focus 52 232, 239, 240, 241 248
action potential 44, 46, 104, 138, 140, 146-7, 150, 152 clothes 20, 188
active and passive 39 cluster(s) 1, 4, 9, 11, 12, 21-34, 36, 38-44, 46, 48, 54, 60, 63,
Adobe Premiere 182 81, 115, 113, 121, 124, 127, 131, 145, 146,
advertisement(s) 5, 7, 17, 20, 33, 37, 38-43, 48-55, 105, 107, 151, 160, 166, 174, 214, 218
122, 165-248 cluster analysis 4, 11, 21, 24-31, 39, 48, 121, 131
ambient sound(s) 35, 53, 178, 180, 211 co-contextualisation 21, 83, 132, 249
animation 43, 60, 105, 127 codeployment of resources 4, 5, 6, 11, 20, 21, 47, 55, 61,
Asterix 37 102, 114, 118, 129, 139, 167, 178, 184, 217,
apprenticeship 93, 97 223, 247, 249
attitudinal and evaluative meanings and/or stances 37, 38, 176 coding orientation 41, 99, 200
attractors 117-8, 183 coherence 116, 187, 188, 205, 218, 239-42
attribution 85, 97, 137 cohesion 22, 179, 193, 211
Audi Quattro 167, 170 cohesive chains, elements and/or semantic ties 179-80, 187
audio files 109 cohyponyms 81, 88, 115, 136
auditory array 175, 210 collocation 198
authority 93 colour 29, 53, 58, 63, 76, 79, 91-3, 98, 99, 100, 102, 119, 124,
125, 130-5, 138, 140, 149, 151, 153-4, 159,
B 187, 198-202
backgrounding 38, 121, 205, 211 columns 30, 64, 65, 71, 74, 75, 77, 82, 127, 174, 178, 180-7,
Bakhtin, Mikhail 27, 31, 39, 42, 43, 69, 90, 96, 162, 211 190-4, 201, 202, 209, 210, 213, 214, 216-223,
Baldry, Anthony 12, 18, 34, 47, 48, 49, 51, 54, 80, 122, 166, 232, 241-2
167, 172, 225, 234, 248-9 comprehensibility 93
Bateson, Gregory 10, 17, 21, 83, 249 constituency relations 22, 234
body movement and/or position 21, 37, 172, 178, 180, 183, content stratum 22, 166, 182, 223-7, 237
185, 193, 202, 218, 219 content-form 236-7
Boo Bear text 38-44 content-substance 236-7
bottom bar 40-1, 44-6, 115, 131 context 1, 2, 3, 4, 6, 7, 11, 12, 18, 22, 24, 27, 53, 56, 80, 93,
British Museum 114, 119, 130-5, 138, 140, 151, 156, 159 96, 99, 100, 101, 111-2, 117-8, 124, 134-6,
broadcasting 165, 166 148, 165-7, 172, 178, 180, 207, 211, 212,
Bühler, Karl 83 217, 218, 222, 225, 227, 246-7, 249
context of culture 1-4, 6, 7, 18, 56, 111, 112, 165
C context of situation 2-7, 56, 111, 124, 165
call-outs 67 copatterning 50, 181, 183, 187, 195, 244, 247
camera: movement and/or position 172, 173, 187, 191, 193, corpora 18, 48, 248
194, 202, 240, 241 covariate ties 16, 138-140, 155, 187, 199
Capital 61-9, 210 cross-coupling 102
car advertisement(s) 5-7, 48-55, 104, 122, 170, 228-35, 238-44, cuts 50, 51, 190, 241
248
caricature 34, 37-8 D
Carrier (attributive clauses) 75, 85, 87, 97, 100, 137, 142 Daneš, František 74,188
cartoon(s) 7-17, 20-1, 24, 34-8, 46, 60, 63, 120-1, 123, 125,
Darwin, Charles 58, 59, 60, 63, 65, 208
130, 131, 223
Davidse, Kristin 74, 85
cartoonist(s) 34, 37
deformation of body 207, 232
causal sequences and/or relations 12, 13, 22, 80, 117, 190,
deixis and/or deictic frames 10, 99
238, 239
dependency relations 22, 234-9
charts 57, 63, 65-7
depicted scene 7, 10, 11, 12, 14, 16, 120, 123, 124, 139, 145,
Chesapeake Bay text 28, 32-4, 37
147, 150, 151, 152, 187, 198, 223
children 31, 48, 57, 58, 63, 78, 102, 103, 104, 114, 116, 119,
depicted world 10, 11, 12, 17, 36, 125, 189, 191, 193, 194,
125, 130-1, 135, 138, 140, 151, 156, 159
195, 196, 197, 198, 202, 205, 224-5, 227, 240
cinema 38, 46, 107, 192
depiction 10, 12, 14, 18, 23, 37, 55, 80, 83, 91, 104, 111, 119,
classroom 77, 87, 93, 102, 114, 116, 120
124, 125, 138, 156, 205, 206, 223, 225-7, 232
classification and analysis 86
diagrams 19, 41, 57, 60, 63, 65, 68, 79, 92-3, 127, 114, 193
clause(s) and/or clause groups 22, 41, 55, 64, 71, 74-78, 82,
dialogic acts, activity, moves and/or responses 22, 149, 179,
84, 88, 96- 99, 100-1, 127, 136-143, 145, 149,
181, 182, 220
266 Multimodal Transcription and Text Analysis
161, 163, 173-5, 178, 180, 189, 191-9, 205, 83, 87, 89, 92, 112, 136, 137, 175, 179, 180,
223, 227, 232, 234, 248 185, 205, 225, 232
incoherence 187 Leonardo’s Notebook 12, 31, 44
indeterminacy and/or indeterminate points 201 lexicogrammatical resources and/or structures 22, 43, 71, 74,
indexical functions and/or relations 3, 28, 35, 77, 78, 86, 91, 96, 99, 137, 142, 173, 176, 218, 230, 237,
104, 140, 167, 202, 205 246, 247
inequality 234 line(s) 3, 7, 11, 29, 35, 44, 52, 60, 64, 67, 74, 91-3, 107-8, 140,
informational invariants 190 248
instance 3, 4, 18, 19, 30, 39, 42, 55, 65, 78, 79, 80, 84-5, 87, linear/typological character of language 65
114, 117, 137, 140, 145, 156, 172-5, 186, linearity 26, 91, 158, 242, 243
198, 202, 208, 213, 220, 227, 234, 239, 240 linguistic meaning-making resources 61
instance and type 4, 39 local discontinuity 187
integration principle 1, 4-5, 7, 11, 17-9, 24, 44, 61-3, 80, 83, 167 local variation 187, 188
interactants 39, 43, 197, 247 location 49, 50, 51, 100, 119, 129, 135, 139, 140, 154, 187,
interaction and/or interactional 19, 22, 26, 37, 70, 75, 78, 188, 198, 204, 205, 211, 248
80, 90, 106, 112, 118-9, 128, 131, 146, 148, logical meaning and/or metafunction 16, 190
152, 157, 158, 162-3, 180, 185, 186-8, 193, logo 5, 26, 27, 29, 31, 33, 41, 121, 131, 174, 185, 186, 199,
198, 205, 211, 216, 218, 227, 232-3, 247 200, 209
interactional encounter 7 London Transport (LT) text 19, 24. 25, 26, 27, 30, 34, 37
intermediate levels 43, 54, 56 long shot 39, 197
Internet 58, 103, 106, 108, 112, 119, 162, 163, 164 Lupo Alberto text 35, 36, 37, 38
interpersonal meaning and/or metafunction 10, 11, 17, 22,
34, 36, 39, 40-2, 44, 69, 80, 85, 89, 90, 92- M
102, 119, 121, 130-1, 148-52, 167, 170-2, 179, macro-analytical approach 166, 167, 169, 171, 173
184-5, 197, 201, 206-8, 211, 222, 225, 241 macro-New(s) 74, 75, 76, 77
interpersonal bond 44 macrophase(s) 48, 50, 54, 56, 173
interpersonal closeness 42, 197 macro-Theme 74, 75, 76, 77
interpersonal negotiation 98, 99, 102 make-up 7, 116
interpersonal relations 10, 170, 179, 201 Marmaduke cartoons 7-15, 20, 21, 37, 38
interplay of light and shade 53, 224, 241 Martin, James 13, 14, 16, 74, 75, 77, 78, 83, 84, 100, 113,
interrelated levels 54, 236 114, 188, 225, 237, 239, 244
intersemiosis 71, 83, 84, 91, 146 Martinec, Radan 47, 49, 80, 225
intertexts 64, 82, 90-1, 96, 123, 247 Marx, Karl 61, 62, 63, 64, 65, 66, 69
intertextual links 56 material object text 21, 44, 109, 173, 175, 177
intertextual thematic relations and/or systems 64, 137-8, 180 material surface 87, 109, 175
intertextual/indexical links 78 McGregor, William 22, 38, 100, 149, 204, 206
intertextuality 6, 31, 35, 44, 55, 68-9, 141 meaning compression (principle) 1, 19, 24-6, 42, 56, 58, 64, 165
invariant elements and/or structures 187, 190, 204, 228-9 meaning relations 55, 71, 83, 137, 140, 158, 159, 160, 167,
inverted commas 11 175, 180, 227, 242, 243
meaning-making activities 3, 78, 116, 227, 245
J meaning-making process 6, 118, 166, 180, 183
Jakobson, Roman 107 meaning-making resources 4, 6, 19, 20, 21, 30, 58, 61, 64,
James Bond films 52 116, 162, 163, 166, 167, 173, 197, 203, 223
John Stuart Mill 61 meaning-making units 27, 49, 146, 164, 223
medium close shot 39, 197, 241
K metacomment(s) 10, 11
kinesic elements, resources and/or structures 43, 109-10, 149, metadiscursive elements, resources and/or structures 17,
151-2, 174, 178-9, 185, 191, 201, 202, 207-9, 114, 159
215-6, 219 metafunctional interpretation 174, 181, 184, 222
Kress, Gunther 18, 34, 37, 39, 41, 58, 60, 63, 66, 70, 79, 80, metafunctional organisation 49, 225
81, 86, 91, 92, 93, 99, 122, 189, 195, 197, metafunctions 1, 4, 16, 17, 22, 34, 38-42, 80-3, 85, 87, 167,
200, 201, 225, 234 185, 222
L metaphorical contrasts 6
language 1-4, 7, 10, 11, 12, 20, 58, 63-5, 70, 78, 80-3, 87, 90, metarule 10
104, 110-1, 136, 138, 139, 149, 156-7, 172, metasemiotic status 99, 100, 102, 195, 214, 216
178, 180, 189, 198, 201, 204-6, 215, 217, mini-genres 11, 42, 113
224, 230, 232, 234, 244, 248, 249 mirror writing 46
larger-scale textual features and/or units 38, 50, 54, 122, 234, Mitsubishi Carisma text 48-54, 122, 166, 212, 223-248
246 modern scientific page(s) 63, 64, 67
layout 30, 58, 70, 91, 114 mood 10, 11, 22, 51, 149, 205, 212
leaflets 21, 30, 34 mouse (pointer) 44, 46, 105-6, 117, 121, 124-6, 130, 140, 147-
left-right organisation 40, 81, 83, 105 55, 159, 174
Lemke, Jay 16, 18, 20, 21, 55, 64, 65, 68, 69, 70, 71, 80, 82,
268 Multimodal Transcription and Text Analysis
movement 1, 3, 7, 12, 20, 21, 35, 37, 58, 60, 63, 66, 105, 147- original broadcasting 165
54, 167, 172, 173, 178, 184, 185, 187, 189,
190, 193, 194, 202-8 P
movement proposition 206 page-display bar 46
movie set 54 pagelets 28, 32, 67
moving element(s), resource(s) and/or structure(s) 27, 44, panning 52, 193, 194
105, 108, 118, 147, 174, 178, 179, 185, 187, paradigmatic relations 156-9, 173, 243-4
189, 191-4, 218, 219, 220 paralinguistic elements 20
multimodal concordancing 48 parataxis 234
multimodal genre(s) 1, 4, 26, 38 participant chains 232, 233
multimodal narrative(s) 34 participant roles 15, 16, 147, 159-62, 170, 176, 177, 202, 225,
multimodal page(s) 57, 58, 62, 64, 75 242, 248
multimodal scientific text(s) 83, 92 part-part relations 22, 234, 238
multimodal syntagm 85, 86 parts functioning in some larger whole 21
multimodal textual design 58 part-whole relations 56, 149, 234
multiple pathways 102 patterned relations 19, 21, 178, 179, 183
multiplying effect 18, 42 peaks and troughs 182
multitasking 117 perceptual purview 167
music 1, 20, 23, 51-2, 54, 178, 180-1, 184-5, 209, 211, 214, perceptual realism 120
216, 218, 220-2 perceptual simultaneity 13
musical intensity 54 periodicity 26, 74-7, 182
musical score 12 person deixis 10, 98, 101
perspective 1, 24, 39, 40, 53, 64, 86, 99, 106, 119, 120, 123-5,
N 147, 149, 152, 157, 163, 195, 197, 202-6
Nalon, Elena 80 phasal analysis 4, 47, 50, 54
narrative(s) 12, 16, 17, 34, 128, 238 phasal organisation 166
narrative discourse 15, 16, 238,-9 phase(s) 1, 6, 43-4, 46-51, 53, 54, 60, 61, 80, 116, 122, 134,
narrative meaning 15, 16, 233 166-7, 173-4, 180-6, 188, 209, 212-3, 217-9,
narrative organisation 12, 16 222-3, 228, 230, 232-3, 238-41, 244
narrative sequence 12-15 phonetic empathy 215
narrative structures 37 phonological prosodies 20
narrative timeline 16 photograph(s) 32, 34, 37,41-4, 58, 63, 69, 75, 79-81, 88, 91-3,
narrativity 14, 16, 239 120, 123, 137-9, 177, 193
Nasa Kids 31, 114, 119-29, 140, 146, 147, 151-6, 159 photographic display 78, 79, 81
NasaToons 121, 124-9, 132 physical surface 193
naturalistic representations and/or tracings 37, 41, 53, 63, 79, plant movements 58, 63, 66
121, 127, 131, 180, 200, 206 playwrights’ asides 21
negotiaton 96-7, 99 pointing 7, 26, 74, 83, 85, 93, 167, 172, 173, 235
New (information) 1, 24, 40, 67, 69, 74, 75, 76, 77, 81, 82, political cartoon(s) 63
83, 90, 91, 105, 111, 117, 119, 128, 136, 143, postproduction 183, 184
145, 146, 148, 150-6, 158, 160, 162, 164, posture 20, 122, 179
185, 186, 189, 190, 203 potential significance 53
nominal groups 64, 74, 82, 88, 145, 230 precursor forms 92
nominalisation 64 primary and/or secondary genres 4, 11, 27, 31, 39, 42-3, 68
non-linguistic resources 20 printed page(s) and/or text(s) 9, 13, 15, 17, 18, 19, 21, 23,
non-naturalistic qualities 53 31, 37, 38, 39, 42, 49, 53, 57, 59, 61, 63, 103,
non-salience 189, 212, 228 104, 105, 106, 109, 110, 111, 112, 113, 115,
non-verbal accompaniments 20 117, 119, 121, 123, 125, 127, 129, 133, 153,
nose-here perspective 197, 202, 205 155, 157, 267, 269, 271, 273, 275, 277, 279,
281
O progressive picture(s) 189, 191
O’Halloran, Kay 80, projection 46, 96, 99, 101, 106, 109, 157, 175, 182, 192, 193,
O’Toole, Michael 34, 80, 225 206, 223-5
observers 47, 112, 152, 172, 183, 192, 215 prominence 22, 47, 52, 53, 78, 121, 199, 209, 216, 228
optic array 37, 123, 175-6, 189-93, 205, 210, 223-9 prosodic features 52, 53
oral discourse 12, 69
oral modality 61, 63 Q
orchestral music 51, 52, 54 quotation marks 7, 10, 61, 63
organisational principles 4, 18, 26, 116, 181, 215
organisational scales 50 R
orientation 29, 35, 36, 41, 44, 55, 82, 98-102, 113, 119, 124, recontextualisation(s) 3, 21, 63, 69, 120, 133-4, 136, 147, 158,
127, 131, 133, 150-2, 158, 180, 185, 188, 162, 165, 191, 213
196, 199- 208, 211, 221, 222, 248 reference point 10, 11, 193
Index 269
thematic-semantic condensation 64, 71 161, 174-5, 178, 180, 189, 191, 193, 197,
Thibault, Paul 1, 18, 20, 34, 35, 47, 48, 49, 51, 55, 58, 64, 65, 68, 205, 223, 227, 232-4, 248
69, 71, 80, 82, 86, 88, 90, 96, 97, 99, 112, 114, visual information 39, 175, 189, 191
122, 166, 167, 172, 175, 183, 193, 204, 205, 225, visual kinaesthesis 191, 193-5, 202, 224-5, 232
232, 234, 237, 238, 247, 248 visual parameter 195
thought bubbles 24, 37 visual percepts 87, 93
timeline 14, 16, 182 visual process 39, 40, 231, 232
title(s) 26, 27, 28, 29, 33, 75, 76, 77, 127, 140 visual resources 7, 55, 64, 70, 80, 98, 99, 102, 125, 141,
tools 46, 102, 122, 162, 164, 198, 248 224, 236, 240, 242, 244
top-bottom organisation 40, 81, 83, 91, 105 visual salience 44, 82, 155, 188, 199, 228
topological elements, space and/or values 18, 64, 65, 83, 85, visual scene 51, 52, 120, 138, 153, 155, 225
100, 178, 189, 198-9, 203-4, 232, 234, 236 visual semiotic 42, 63, 65, 66, 70, 74, 79-83, 87, 92, 93,
transformations 12, 189, 190, 191, 193, 197, 224-5, 228-9 98-9, 127, 139, 188, 191, 210, 240, 243-4
transition(s) 13, 14, 46, 47, 48, 49, 50, 53, 151, 173, 181, 182, visual strategies 190
184, 185, 186, 187, 190, 193, 215, 216, 228, 233, visual text(s) 14, 51, 57, 68, 79, 80, 83, 84, 90, 102, 122,
243, 244, 248 171, 174, 176, 178, 186, 189, 193, 197,
transitivity frame(s) 1, 16, 46, 49, 51, 55, 122, 123, 166-73, 225, 199, 200, 217, 223-5, 227, 235, 237, 243
230-234, 237, 242, 244 visual-graphological resources 71, 77
transitivity relations 55, 138, 146, 173, 176 visual-spatial units 104, 105, 114
TV advertisements 4, 7, 48-9, 54, 165-6, 193, 212, 223 voice(s) 19, 21, 51, 52, 53, 54, 55, 63, 69, 96, 119, 142,
typological-categorial relationships 64-5, 74, 83, 189, 232-4 162, 184, 185, 211, 212, 213, 214, 218,
219, 220, 221, 222
U
unfolding text 173, 187, 222 W
use of speech 7 wave cycle 182
utterance 2, 10, 11, 22, 101, 218, 245, 246 wave-like patterning 181
weak classification and/or framing 129, 159
V web users 112-3, 115, 117, 159
Van Leeuwen, Theo 7, 18, 20, 34, 37, 39, 41, 51, 58, 60, 63, 66, web-based animations 44
79, 80, 81, 86, 92, 93, 99, 122, 180, 183, 189, web-based films 46
195, 197, 200, 201, 212, 216, 221, 225, 234, 244 Westpac text 69, 125, 130, 165, 166, 167, 174-223
variability 114, 117-8 whole-part relationships 56
vector(s) 29, 35-6, 39, 40, 49, 63, 66-8, 83-8, 90, 122, 131, 132, whole-whole relationships 38
140, 170, 186, 201, 225, 230-7, 241 wipes 190
Ventola, Eija 80, 113 word-processing packages 67
verbal and visual resources 7, 70, 80, 102, 141
verbal genres 41 Z
verbal text 19, 40-2, 63-4, 71-4, 77, 80-4, 87, 89, 91-2, 127, 131-3, Zombie text 5-7, 10, 20
136, 140-6, 149, 156, 177-8
vertical hierarchies and/or structures 26, 27, 30, 39, 40, 53, 64,
66, 71, 81, 115, 119, 129, 157, 170, 174, 178,
186, 195, 230, 240
very close/long shot 39, 170, 197, 240
VHS films 46
video recordings and/or texts 50, 60, 105, 122, 127, 165, 166,
167, 173, 174, 188, 189, 196, 226, 234, 236, 238,
242, 243, 249
virtual magnifying glass 31, 46
virtual world 106, 120, 123, 124, 148, 152, 156, 157, 159, 163
visual and actional resources 46
visual and verbal resources 19, 46, 60, 64, 68, 70
visual collocation 198
visual cues 15
visual cuts 190
visual design 70
visual devices 37
visual focus 200, 201, 228
visual forms 14, 43, 68, 204, 223, 242
visual frame 174, 178, 184, 185, 186, 187, 189, 190, 193, 201,
202, 209
visual genres 39, 41, 71, 146, 200
visual image(s) 14, 20-1, 29, 37, 39, 41-2, 57, 65, 68, 70, 79, 83-8,
90-1, 99, 109, 127, 136, 138, 145-6, 148, 155-6,