Brown Corpus

The Brown University Standard Corpus of Present-Day American English (or just Brown Corpus) was compiled in the 1960s by Henry Kučera and W. Nelson Francis at Brown University, Providence, Rhode Island as a general corpus (text collection) in the field of corpus linguistics. It contains 500 samples of English-language text, totaling roughly one million words, compiled from works published in the United States in 1961.

History

In 1967, Kučera and Francis published their classic work Computational Analysis of Present-Day American English, which provided basic statistics on what is known today simply as the Brown Corpus. The Brown Corpus was a carefully compiled selection of current American English, totalling about a million words drawn from a wide variety of sources. Kučera and Francis subjected it to a variety of computational analyses, from which they compiled a rich and variegated opus, combining elements of linguistics, psychology, statistics, and sociology. It has been very widely used in computational linguistics, and was for many years among the most-cited resources in the field.

Shortly after publication of the first lexicostatistical analysis, Boston publisher Houghton-Mifflin approached Kučera to supply a million word, three-line citation base for its new American Heritage Dictionary. This ground-breaking new dictionary, which first appeared in 1969, was the first dictionary to be compiled using corpus linguistics for word frequency and other information.

The initial Brown Corpus had only the words themselves, plus a location identifier for each. Over the following several years part-of-speech tags were applied. The Greene and Rubin tagging program (see under part of speech tagging) helped considerably in this, but the high error rate meant that extensive manual proofreading was required.

The tagged Brown Corpus used a selection of about 80 parts of speech, as well as special indicators for compound forms, contractions, foreign words and a few other phenomena, and formed the basis for many later corpora such as the Lancaster-Oslo-Bergen Corpus. The tagged corpus enabled far more sophisticated statistical analysis, much of it carried out by graduate student Andrew Mackie. Some of the analysis appears in Frequency Analysis of English Usage: Lexicon and Grammar, by Winthrop Nelson Francis and Henry Kučera, Houghton Mifflin (January, 1983) ISBN 0-395-32250-2.

One interesting result is that even for quite large samples, graphing words in order of decreasing frequency of occurrence shows a hyperbola: the frequency of the n-th most frequent word is roughly proportional to 1/n. Thus "the" constitutes nearly 7% of the Brown Corpus, "to" and "of" more than another 3% each; while about half the total vocabulary of about 50,000 words are hapax legomena: words that occur only once in the corpus.^[1] This simple rank-vs.-frequency relationship was noted for an extraordinary variety of phenomena by George Kingsley Zipf (for example, see his The Psychobiology of Language), and is known as Zipf's law.

Although the Brown Corpus pioneered the field of corpus linguistics, by now typical corpora (such as the Corpus of Contemporary American English, the British National Corpus or the International Corpus of English) tend to be much larger, on the order of 100 million words.

Sample distribution

The Corpus consists of 500 samples, distributed across 15 genres in rough proportion to the amount published in 1961 in each of those genres. All works sampled were published in 1961; as far as could be determined they were first published then, and were written by native speakers of American English.

Each sample began at a random sentence-boundary in the article or other unit chosen, and continued up to the first sentence boundary after 2,000 words. In a very few cases miscounts led to samples being just under 2,000 words.

The original data entry was done on upper-case only keypunch machines; capitals were indicated by a preceding asterisk, and various special items such as formulae also had special codes.

The corpus originally (1961) contained 1,014,312 words sampled from 15 text categories:

A. PRESS: Reportage (44 texts)
- Political
- Sports
- Society
- Spot News
- Financial
- Cultural
B. PRESS: Editorial (27 texts)
- Institutional Daily
- Personal
- Letters to the Editor
C. PRESS: Reviews (17 texts)
- theatre
- books
- music
- dance
D. RELIGION (17 texts)
- Books
- Periodicals
- Tracts
E. SKILL AND HOBBIES (36 texts)
- Books
- Periodicals
F. POPULAR LORE (48 texts)
- Books
- Periodicals
G. BELLES-LETTRES - Biography, Memoirs, etc. (75 texts)
- Books
- Periodicals
H. MISCELLANEOUS: US Government & House Organs (30 texts)
- Government Documents
- Foundation Reports
- Industry Reports
- College Catalog
- Industry House organ
J. LEARNED (80 texts)
- Natural Sciences
- Medicine
- Mathematics
- Social and Behavioral Sciences
- Political Science, Law, Education
- Humanities
- Technology and Engineering
K. FICTION: General (29 texts)
- Novels
- Short Stories
L. FICTION: Mystery and Detective Fiction (24 texts)
- Novels
- Short Stories
M. FICTION: Science (6 texts)
- Novels
- Short Stories
N. FICTION: Adventure and Western (29 texts)
- Novels
- Short Stories
P. FICTION: Romance and Love Story (29 texts)
- Novels
- Short Stories
R. HUMOR (9 texts)
- Novels
- Essays, etc.

Part-of-speech tags used

Tag	Definition
.	sentence (. ; ? *)
(	left paren
)	right paren
*	not, n't
--	dash
,	comma
:	colon
ABL	pre-qualifier (quite, rather)
ABN	pre-quantifier (half, all)
ABX	pre-quantifier (both)
AP	post-determiner (many, several, next)
AT	article (a, the, no)
BE	be
BED	were
BEDZ	was
BEG	being
BEM	am
BEN	been
BER	are, art
BBB	is
CC	coordinating conjunction (and, or)
CD	cardinal numeral (one, two, 2, etc.)
CS	subordinating conjunction (if, although)
DO	do
DOD	did
DOZ	does
DT	singular determiner/quantifier (this, that)
DTI	singular or plural determiner/quantifier (some, any)
DTS	plural determiner (these, those)
DTX	determiner/double conjunction (either)
EX	existential there
FW	foreign word (hyphenated before regular tag)
HL	word occurring in the headline (hyphenated after regular tag)
HV	have
HVD	had (past tense)
HVG	having
HVN	had (past participle)
HVZ	has
IN	preposition
JJ	adjective
JJR	comparative adjective
JJS	semantically superlative adjective (chief, top)
JJT	morphologically superlative adjective (biggest)
MD	modal auxiliary (can, should, will)
NC	cited word (hyphenated after regular tag)
NN	singular or mass noun
NN$	possessive singular noun
NNS	plural noun
NNS$	possessive plural noun
NP	proper noun or part of name phrase
NP$	possessive proper noun
NPS	plural proper noun
NPS$	possessive plural proper noun
NR	adverbial noun (home, today, west)
NRS	plural adverbial noun
OD	ordinal numeral (first, 2nd)
PN	nominal pronoun (everybody, nothing)
PN$	possessive nominal pronoun
PP$	possessive personal pronoun (my, our)
PP$$	second (nominal) possessive pronoun (mine, ours)
PPL	singular reflexive/intensive personal pronoun (myself)
PPLS	plural reflexive/intensive personal pronoun (ourselves)
PPO	objective personal pronoun (me, him, it, them)
PPS	3rd. singular nominative pronoun (he, she, it, one)
PPSS	other nominative personal pronoun (I, we, they, you)
QL	qualifier (very, fairly)
QLP	post-qualifier (enough, indeed)
RB	adverb
RBR	comparative adverb
RBT	superlative adverb
RN	nominal adverb (here, then, indoors)
RP	adverb/particle (about, off, up)
TL	word occurring in title (hyphenated after regular tag)
TO	infinitive marker to
UH	interjection, exclamation
VB	verb, base form
VBD	verb, past tense
VBG	verb, present participle/gerund
VBN	verb, past participle
VBP	verb, non 3rd person, singular, present
VBZ	verb, 3rd. singular present
WDT	wh- determiner (what, which)
WP$	possessive wh- pronoun (whose)
WPO	objective wh- pronoun (whom, which, that)
WPS	nominative wh- pronoun (who, which, that)
WQL	wh- qualifier (how)
WRB	wh- adverb (how, where, when)

Note that some versions of the tagged Brown corpus contain combined tags. For instance the word "wanna" is tagged VB+TO, since it is a contracted form of the two words, want/VB and to/TO. Also some tags might be negated, for instance "aren't" would be tagged "BER*", where * signifies the negation. Additionally, tags may have hyphenations: The tag -HL is hyphenated to the regular tags of words in headlines. The tag -TL is hyphenated to the regular tags of words in titles. The hyphenation -NC signifies an emphasized word. Sometimes the tag has a FW- prefix which means foreign word.^{[citation needed]}

References

^ Kirsten Malmkjær, The Linguistics Encyclopedia, 2nd ed, Routledge, 2002, ISBN 0-415-22210-9, p. 87.

External links

[1] Kirsten Malmkjær, The Linguistics Encyclopedia, 2nd ed, Routledge, 2002, ISBN 0-415-22210-9, p. 87.

[1]