Corpus Linguistics Part 1
Corpus Linguistics Part 1
Corpus Linguistics Part 1
linguistics:
Method, Analysis, Interpretation
The corpus approach harnesses the power of computers to allow analysts to work to
produce machine aided analyses of large bodies of language data - so-called corpora.
Computers allow us to do this on a scale and with a depth that would typically defy
analysis by hand and eye alone. In doing so, we gain unprecedented insights into the use
and manipulation of language in society.
What is Corpus Linguistics?
Corpus linguistics, broadly, is a collection of methods for studying language. It begins with
collecting a large set of language data – a corpus - which is made usable by computers.
Corpora (the plural of corpus) are usually so large that it would be impossible to analyze
them by hand, so software packages (often called concordancers) are used in order to
study them. It is also important that a corpus is built using data well matched to a research
question it is built to investigate. To investigate language use in an academic context, for
example, it would be appropriate for one to collect data from academic contexts such as
academic journals or lectures. Collecting data from the sports pages of a tabloid
newspaper would make much less sense.
Software:
A number of software packages are available with varying functionalities and price tags.
Some pieces of software can be downloaded and used for free, others cost money or are
available only online but have built-in reference corpora. This table an idea of the variety
of software currently available:
Glossary
Use this glossary, as a handy reference when you come
across any terminology on the course that you do not
understand.
Annotation
Codes used within a corpus that add information about things such as, for example,
grammatical category. Also refers to the process of adding such information to a corpus.
Balance
A property of a corpus (or, more precisely, a sampling frame)
A corpus is said to be balanced if the relative sizes of each of its subsections have been
chosen with the aim of adequately representing the range of language that exists in the
population of texts being sampled (see also, sample).
Colligation
More generally, colligation is co-occurrence between grammatically categories (e.g. verbs
colligate with adverbs) but can also mean a co-occurrence relationship between a word
and a grammatical category.
Collocation
A co-occurrence relationship between words or phrases; Words are said to collocate with
one another if one is more likely to occur in the presence of the other than elsewhere.
Comparability
Two corpora or sub-corpora are said to be comparable if their sampling frames are similar
or identical.
Concordance
A display of every instance of a specified word or other search term in a corpus, together
with a given amount of preceding and following context for each result or ‘hit’
Concordancer
A computer program that can produce a concordance from a specified text or corpus;
Modern concordance software can also facilitate more advanced analyses
Corpus
From the Latin for ‘body’ (plural corpora), a corpus is a body of language representative
of a particular variety of language or genre which is collected and stored in electronic
form for analysis using concordance software.
Corpus construction
The process of designing a corpus, collecting texts, encoding the corpus, assembling and
storing the metadata, marking up (see markup) the texts where necessary and possibly
adding linguistic annotation.
Corpus-based
Where corpora are used to test preformed hypotheses or exemplify existing linguistic
theories. Can mean either:
(a) Any approach to language that uses corpus data and methods.
(b) An approach to linguistics that uses corpus methods but does not subscribe to corpus-
driven principles.
Corpus-driven
An inductive process where corpora are investigated from the bottom up and patterns
found therein are used to explain linguistic regularities and exceptions of the language
variety/genre exemplified by those corpora.
Diachronic
Diachronic corpora sample (see sampling frame) texts across a span of time or from
different periods in time in order to study the changes in the use of language over time.
Compare: synchronic.
Encoding
The process of representing the structure of a text using markup language and
annotation
Frequency list
A list of all the items of a given type in a corpus (e.g. all words, all nouns, all four-word
sequences) together with a count of how often each occurs
Key word in context (KWIC)
A way of displaying a node word or search term in relation to its context within a text;
this usually means the node is displayed centrally in a table with co-text displayed in
columns to its left and right. Here, ‘keyword’ means ‘search term’ and is distinguished
from keyword.
Keyword
A word that is more frequent in a text or corpus under study than it is in some (larger)
reference corpus. Differences between corpora in how the word being studied occurs will
be statistically significant (see, statistical significance) for it to be a keyword.
Lemma
A group of words related to the same base word differing only by inflection. For example,
walked, walking, and walks are all part of the verb lemma WALK.
Lemmatisation
A form of annotation where every token is labelled to indicate its lemma
Lexis
The words and other meaningful units (such as idioms) in a language; the lexis or
vocabulary of a language in usually viewed as being stored in a kind of mental dictionary,
the lexicon.
Markup
Codes inserted into a corpus file to indicate features of the original text other than the
actual words of the text. In a written text, for example, markup might include paragraph
breaks, omitted pictures, and other aspects of layout.
Markup language
A system or standard for incorporating markup (and, sometimes, annotation and
metadata) into a file of machine-readable text; the standard markup language today is
XML.
Metadata
The texts that make-up a corpus are the data. Metadata is data about that data - it gives
information about things such as the author, publication date, and title for a written text.
Monitor corpus
A corpus that grows continually, with new texts being added over time so that the dataset
continues to represent the most recent state of the language as well as earlier periods
Node
In the study of collocation - and when looking at a key word in context (KWIC) - the node
word is the word whose co-occurrence patterns are being studied.
Reference corpus
A corpus which, rather than being representative of a particular language variety,
attempts to represent the general nature of a language by using a sampling frame
emphasing representativeness.
Representativeness
A representative corpus is one sampled (see, sample) in such a way that it contains all the
types of text, in the correct proportions, that are needed to make the contents of the
corpus an accurate reflection of the whole of the language or variety of language that it
samples (also see: balance).
Sample
A single text, or extract of a text, collected for the purpose of adding it to a corpus. The
word sample may also be used in its statistical sense by corpus linguists. In this latter
sense, it means groups of cases taken from a population that will, hopefully, represent
that population such that findings from the sample can be generalised to the population.
In another sense, corpus is a sample of language
Sample corpus
A corpus that aims for balance and representativeness within a specified sampling frame
Sampling frame
A definition, or set of instructions, for the samples (see: sample) to be included in a
corpus. A sampling frame specifies how samples are to be chosen from the population of
text, what types of texts are to be chosen, the time they come from and other such
features. The number and length of the samples may also be specified.
Significance test
A mathematical procedure to determine the statistical significance of a result
Statistical significance
A quantitative result is considered statistically significant if there is a low probability
(usually lower than 5%) that the figures extracted from the data are simply the result of
chance. A variety of statistical procedures can be used to test statistical significance.
Synchronic
Relating to the study of language or languages as they exist at a particular moment in
time, without reference to how they might change over time (compare: diachronic). A
synchronic corpus contains texts drawn from a single period - typically the present or very
recent past.
Tagging
An informal term for annotation, especially forms of annotation that assign an analysis to
every word in a corpus (such as part-of-speech or semantic tagging).
Text
As a count noun: a text is any artefact containing language usage - typically a written
document or a recorded and/or transcribed spoken text. As a non-count noun: collected
discourse, on any scale.
Token
Any single, particular instance of an individual word in a text or corpus
Compare: lemma, type.
Type
(a) A single particular wordform. Any difference of form (e.g. spelling) makes a word a
different type. All tokens comprising the same characters are considered to be examples
of the same type.
(b) Can also be used when discussing text types.
Type-token ratio
A measure of vocabulary diversity in a corpus, equal to the total number of types divided
by the total number of tokens; the closer the ratio is to 1 (or 100%), the more varied the
vocabulary is. This statistic is not comparable between corpora of different sizes.
XML
A markup language which is the contemporary standard for use in corpora as well as for
a range of data-transmission purposes on the Internet. In XML, tags are indicated by
<angle> <brackets>.
Part One:
An Introduction to
Corpus Linguistics
Introduction to this part's activities
Warm up activity
Part 1: why use a corpus?
Part 2: annotation and mark-up
Part 3: types of corpora
Part 4: Frequency Data, Concordances and Collocation
Part 5: Corpora and Language Teaching
Test your Knowledge (Quiz)
Why do I need special software?
Brown and LOB
Downloads
Introduction to AntConc
AntConc - concordancing
AntConc - using advanced search to explore the Brown corpus
AntConc - creating and using a wordlist
Practical activity - a question
Further Reading
Discussion question for Part 1
Introduction to this part’s activities
In this part, we begin by looking at the background to corpus linguistics –
the types of things you can do using a corpus and some of the technical
details of how corpora are built.
In the ‘how to’ part of this part, we introduce you to the concordance
package available free with this course – AntConc, authored by Laurence
Anthony of Waseda University.
Take notes as you go and use the ‘pop quiz’ to test your comprehension.
Undertake the readings for the part and contribute to the discussion.
Warm up activity
A quick activity to get started
Think of something you would like to find out about language. As you attend
the lecture, reflect back on your own interests – what types of corpora might
help you and what type of design issues would you have to consider if you
were to put together your own corpus to investigate language as you would
wish?
Part 1: why use a corpus?
The lecturer gives a brief review of why you might want to use a corpus and
decisions to make when building a corpus.
Please see:
Week 1 Lectures (part 1)
Week 1 Slides (Part 1)
Week 1 Videos (Part 1)
Part 2: annotation and mark-up
The Lecturer gives a brief overview of how corpus texts may be enriched
with additional information to ease analysis.
Note that this type of additional information may be called ‘mark up’,
‘annotation’, or ‘tagging’. All three terms are near synonyms. Annotation
usually refers to linguistic information encoded in a corpus - however, the
encoding is achieved using a mark-up language. Similarly, the annotation
itself is usually undertaken by putting so called tags - short codes to indicate
some linguistics feature - into a text. Hence, while the terms can be
separated, they can also be used inter-changeably!
One final note - an xml tag finishes with a forward slash rather than a back
slash.
Please see:
Week 1 Lectures (part 2)
Week 1 Slides (Part 2)
Week 1 Videos (Part 2)
Part 3: Types of Corpora
The Lecturer looks at a range of different types of corpora.
Please see:
Week 1 Lectures (part 3)
Week 1 Slides (Part 3)
Week 1 Videos (Part 3)
\
Part 4: Frequency Data,
Concordances and Collocation
The Lecturer explores the value of frequency data in corpus linguistics and
takes a first look at a key concept in corpus linguistics - collocation.
This lecture mentions the idea of normalised frequencies per million. What
are these? Imagine you have two corpora, one of two million words and
another of three million words. You look in each for the word ‘dalek’ and find
20 examples in the first and 30 examples in the second. That does not mean
that the word is more frequent in the second corpus - remember it is bigger.
One way of the making this issue apparent, and making the numbers more
comparable, is to normalise the frequencies. To normalise per million, you
are in essence asking the question ‘if my corpus was only one million words,
how many examples would I expect to find?’.
Our first corpus is two million words - so to normalise the frequency of ‘dalek’
to one million words, we would divide by two, giving us 20/2=10.
The second corpus is three times as large as one million, so to normalise per
million we would divide the results from the second corpus by three giving
30/3=10. This shows clearly that we have no reason to claim that the word
‘dalek’ is more frequent in one of the corpora than the other.
Please see:
Week 1 Lectures (part 4)
Week 1 Slides (Part 4)
Week 1 Videos (Part 4)
Part 5: Corpora and Language
Teaching
The Lecturer takes a brief look at a major application area for corpus
linguistics - language teaching.
After the video, don’t forget to update your journal! Keep a record of what
you are learning. You will find it really helps as the course proceeds if you
keep clear, structured notes of what you have learnt.
Test your Knowledge (Quiz)
What is a corpus?
A theory of language
A collection of texts stored on a computer
An electronic database similar to a dictionary
Any large collection of words such as a collection of books, newspapers or
magazines
Adding an extra layer of information to the text to allow for more sophisticated
searches
Separating text into sentences
Manual coding of text for parts of speech
Adding critical comments to a text
Multilingual corpus
Learner corpus
Diachronic corpus
Observer corpus
It is frequently updated
The Bank of English is an example of a monitor corpus
The BNC is an example of a monitor corpus
It is used to monitor rapid change in language
What is a concordance?
What is collocation?
The tendency of speakers to talk over each other
The tendency of words to co-occur with one another
The tendency of words to appear in unique, different contexts each time
The tendency of sentences to create meaning
As you will discover, software like AntConc allows you to do so much more
than a word processor does. Even for something as simple as searching for a
word, it presents the results in a format that is more suitable for those
interested in studying language; the standard concordance view of one
example per line with left and right context allows you to rapidly browse data
looking for patterns of usage.
Yet beyond this the software allows you to do a number of things that no
word processor does, such as undertaking keyword analyses and looking for
collocations. By the time you have finished learning to use AntConc, you will
have developed a full appreciation of the need to use such software to study
language in use.
Brown and LOB
These corpora are sometimes referred to as ‘snapshot’ corpora - their
design is such that they try to represent a broad range of genres of
published, professionally authored, English. Their goal is to capture the
language at one moment in time, hence the term ‘snapshot’.
Of course, as with any snapshot there are things you see and things you do
not see. So, in this case, we are looking at professionally authored written
English - not speech and not writing of a more informal variety. We are also
only looking at certain genres. As with any snapshot, it was taken at a certain
point of time in a certain place - Brown is America in the early 1960s, LOB is
the UK in the early 1960s. Such corpora are often used to compare and
contrast varieties of a language - in this case two varieties of English. They
can also be looked at on their own to explore either variety of English in its
own right.
Back to the snapshot metaphor! The two corpora can be compared because
they are composed in the same way - the subject is the same, if you like. They
look at broadly the same genres. Those genres are represented by similarly
sized and numbers of chunks of data. Also, of course, the data was gathered
in roughly the same time period.
The genres covered in the two corpora are outlined below. Note the letter
code for each genre - that is important, as it shows you which genre is
associated with which file in the corpus. Following the letter code is a
description of the type of data in the category, followed by two numbers in
parentheses - the first is the number of chunks of data in that category in
Brown, the second is the number of chunks of data in that category in LOB.
There are five hundred chunks of data in each corpus. Each chunk is
approximately 2,000 words in size, giving a rough overall corpus size of
1,000,000 words each.
A Press: reportage (44, 44)
R Humour (9, 9)
Downloads
(The instructor will provide students with the different software
packages and corpora)
Instructions on how to download AntConc and the Brown and LOB corpora
for analysis
Choose the version you want to run (i.e. for Windows, Mac or Linux) and
click the link for version 3.4.3
If you are using a Windows computer, you will download a single executable
(.exe) file. Put this on your desktop or in some other area that is easy for you
to access. Double click to start.
If you are using a Linux computer, you will download a tar.gz folder that you
need to decompress first. Inside the folder, you will find the AntConc
executable file, an icon, and a simple setup guide. Set the permissions of the
executable file and double click to start.
If you are using a Macintosh computer, you will download a zip file that you
need to unzip first. Put the unzipped AntConc application on your desktop or
in some other area which is easy for you to access. Double click to start. (At
this point, you may get one or two security warnings. AntConc is completely
virus free, so you can ignore these warnings or, if necessary, disable them via
the System Preferences.)
Click this link to download a zip file containing the two corpora.
To use the corpora, first, unzip the file (see below), and then drag the two
folders inside (“brown_corpus_untagged” and “lob_corpus_untagged”) to
a convenient place on your computer. We suggest you place them in a new
folder called “corpora”. You can then delete the original zip file if you want.
If you are using a Windows computer, you can unzip the file by right clicking
on the file name and selecting “Extract All”. The unzipped file will open in a
new window where you can see the two corpora.
If you are using a Macintosh computer, you can unzip the file by simply
double-clicking on it. You can then open the unzipped file and see the two
corpora inside.
If you are using a Linux computer, unzip the file using your preferred zip
program. On most systems you can simply double click the file and then
move the two corpora inside to a convenient place.
This includes showing you how to build a wordlist from a corpus. As part of
this, you will hear the terms type and token. A token is any given word in the
corpus. A type is the number of unique word forms present in a corpus.
Note that we can, of course, quibble about the definition of a word! Consider
the word ‘gonna’ - some may argue this is two words, others one.
Please see:
AntConc Videos (1)
AntConc Transcript (1)
AntConc - concordancing
Laurence Anthony looks at some of the basic features of the AntConc
concordance tool.
Topics covered include how to load a corpus, how to search for words in a
corpus, how to order the results of a search and how to search for parts of
words.
Please see:
AntConc Videos (2)
AntConc Transcript (2)
AntConc - using advanced search to explore
the Brown corpus
Laurence Anthony looks at some of the advanced features of the AntConc
program.
Please see:
AntConc Videos (3)
AntConc Transcript (3)
AntConc - creating and using a wordlist
Laurence Anthony shows you how to build a frequency wordlist from a
corpus.
In addition, he covers some related issues such as sorting the list and
searching it.
Please see:
AntConc Videos (4)
AntConc Transcript (4)
Practical activity - a question
Take the LOB corpus and build a word list. Look at the top thirty words. How
would you characterise these words? Do the same with the Brown corpus. Is
it similar? Are there any differences between LOB and Brown? Feel free to
concordance the words to inform your analysis.
If you have the time, do the same with the subsections of LOB and Brown.
Might wordlists help to determine genre?
Further Reading:
Our readings this week come to us courtesy of Edinburgh University Press
and Routledge
It is chapter one of this book. It will help you broaden your understanding of
the background to corpus linguistics and will place in historical context the
move away from, and return to, corpus data in linguistics.
This book will be of great assistance to you throughout this course. Each time
you hear or see a type of annotation discussed, you should be able to use
this book as a useful reference guide to find out what that type of annotation
is and how it is undertaken. While published in 1997, this book is still a good
reference guide. For this week, read chapter 1 of the book - Leech’s outline
of the principles of corpus annotation are as relevant today as they were
when they were written.
Discussion question for Part 1
When you have completed the lecture and the associated readings, consider
and discuss the following statement:
“Noam Chomsky is one of the most influential figures in corpus linguists. His
ideas have shaped corpus linguistics while also, paradoxically, seeking to
deny its value”.
Reflect back on the warm up activity and your readings this week. Think
about what you would like to use corpora for and consider the types of
corpora you would need to use.
Discuss the design aspects of your proposed work. For example, what type
of corpus would you have to use? How large do you think it would have to
be? Would annotation help you and if so what sort?
Discuss these and any other questions related to your proposed use of
corpus data.