Antconc: Design and Development of A Freeware Corpus Analysis
Antconc: Design and Development of A Freeware Corpus Analysis
Laurence Anthony
Waseda University
[email protected]
Authorized licensed use limited to: IEEE Xplore. Downloaded on October 20, 2008 at 01:37 from IEEE Xplore. Restrictions apply.
2005 IEEE International Professional Communication Conference Proceedings
In Section two, I will describe the background to reported on successes, problems and features they
AntConc and give a summary of its features. In would like to see added, resulting in new,
Sections three to seven, I will give an overview of improved versions of the software. Interest in the
each of its tools, and explain their value to program increased further after it was chosen to be
learners. Then, I will detail the current limitations included in Morphix NLP6, a CD linux distribution
of the software in Section eight, before explaining containing a wide range of natural language
how these will be addressed in the future in processing (NLP) tools.
Section nine.
At the time of print, the latest version is AntConc
2 Background and Summary of Features 3.0. It was released in December 2004, and
includes numerous tools and features, as
AntConc was first released in 2002. At the time, it summarized in Table 1.
was a simple KWIC (Key Word in Context)
concordancer program designed for use by over 3 Concordancer Tool
700 students in a scientific and technical writing
course at the Osaka University Graduate School of The central tool used in most corpus analysis
Engineering. AntConc was developed in a software, including AntConc, is the concordancer.
Windows environment using the PERL 5.8 As Sun & Wang [6] describe, concordancers have
programming language, and the graphical user been shown to be an effective aid in the acquisition
interface (GUI) was developed using the PERL/TK of a second or foreign language, facilitating the
8.0 toolkit. This enabled the program to be easily learning of vocabulary, collocations, grammar and
ported to a Linux/Unix environment, which was writing styles. For example, research has shown
necessary as the course was initially taught in a that new vocabulary can only be acquired through
Linux based CALL (Computer Assisted Language meeting words in diverse natural contexts [7] and
Learning) laboratory before being moved to a in varied situations.[8] Based on only intuition, it is
Windows based CALL laboratory the following almost impossible to find a sufficient number of
year. examples of a specific word or phrase to satisfy
these conditions. On the other hand, using a
Following the release of AntConc 1.0, the program reasonably large corpus, a concordance program
was uploaded to the author’s website from which can find and display a huge number of examples in
researchers, teachers, and learners around the varied contexts and situations quickly and
world could easily download and use the software efficiently.
free of charge for non-profit use. This generated
wide interest in the program and many users Figure 1 shows a screenshot of AntConc while a
Authorized licensed use limited to: IEEE Xplore. Downloaded on October 20, 2008 at 01:37 from IEEE Xplore. Restrictions apply.
2005 IEEE International Professional Communication Conference Proceedings
Total Hits
File
Window
KWIC
Hit File
Progress Name
Report
1. Search terms can be either substrings, words, or 4 Concordance Search Term Plot Tool
phrases, and can be either case sensitive or
insensitive. They can be embedded with a wide The main purpose of the Concordancer Tool is to
range of wildcards that the user can assign to any show how a search term is used in a target corpus.
particular character or string of characters via a For users who want to see where a search term
menu option. appears, AntConc offers the Concordancer Search
Term Plot Tool, shown in Figure 2.
731
Authorized licensed use limited to: IEEE Xplore. Downloaded on October 20, 2008 at 01:37 from IEEE Xplore. Restrictions apply.
2005 IEEE International Professional Communication Conference Proceedings
Plot
File
Results
Window
Plot
Statistics
Progress
Report
The Concordance Search Term Plot Tool offers the the file. If the user clicks on one of the highlighted
same functionality as the Concordancer Tool in search terms, all KWIC lines based on the term are
terms of search term options. However, the results automatically shown using the Concordancer Tool.
are displayed in a quite different way. Here, each
box represents a file in which multiple lines 6 Word List / Keyword List Tools
represent the relative positions at which search
term hits can be found. From this display, it is easy One of the first things that a user will do when
to see where and in what distribution a search term analyzing a new corpus is to generate a list of all
appears in the file. This can be an effective aid, for the words in the corpus. Word lists are useful as
example, in determining where phrases such as they suggest interesting areas for investigation and
“we” or “in this paper” are used in research highlight problem areas in a corpus. Bowker &
articles, or determining which research articles use Pearson [10] describe how word lists can also be
a particular keyword or phrase. used to find families of related word forms and
lemmas in a corpus. The Word List Tool is shown
5 View Files Tool in Figure 4.
The View Files Tool of AntConc is shown in Hockey [11] states that an ideal word list
Figure 3. As described above, when a user clicks generation program should be able to sort words
on a search term in the results display of the into alphabetical or frequency order. The Word
Concordancer Tool, the View Files tool is used to List Tool offers these features and the added
display the search term in the original file. features of reverse ordering and the ability to count
However, the View Files Tool can be used words based on their ‘stem’ forms. Usually, it is
independently to search for any substring, word, important to use a stop list to avoid counting high
phrase or regular expression in a target file, frequency function words when generating a word
offering the user a very powerful text search list. In the Word List Tool, this can be done via the
engine. preferences window. In addition, users can specify
the reverse of a stop list, i.e., a list of only the
All resulting hits are displayed in a user-definable words that should be counted. These can be
highlight color, and buttons and keyboard shortcuts specified either by direct input from the keyboard
can be used to jump to a specified hit anywhere in or from a separate file.
732
Authorized licensed use limited to: IEEE Xplore. Downloaded on October 20, 2008 at 01:37 from IEEE Xplore. Restrictions apply.
2005 IEEE International Professional Communication Conference Proceedings
Total Hits
File
Window
File
Progress Display
Report
Wordlist
Preferences
File
Window
Window
Word
List
Results
Experienced users of corpus analysis tools will which finds which words appear unusually
know that word lists usually tell us little about how frequently in a corpus compared with the same
important a word is in a corpus. Therefore, words in a reference corpus that must also be
AntConc offers a Keyword List Tool (Figure 5), specified by the user. The Keywords Tool operates
in an almost identical way to the KeyWords tool in
733
Authorized licensed use limited to: IEEE Xplore. Downloaded on October 20, 2008 at 01:37 from IEEE Xplore. Restrictions apply.
2005 IEEE International Professional Communication Conference Proceedings
File
Window
Progress
Reports Keyword Save Current
List Window
Keyword List
Options
Preferences
Window
WordSmith Tools, calculating the ‘keyness’ of the number of additional words to the left and right
words using either the chi-squared or log of the search term can also be specified. It is also
likelihood statistical measures [12], and offering possible to set a minimum frequency threshold for
the user the option of displaying or hiding the clusters generated.
unusually infrequent keywords (or negative-
keywords) in the preferences window. An alternative way to search for multi-word
sequences is to find lexical bundles [14], which are
7 Word Clusters / Bundles Tool equivalent to n-grams, where n usually varies
between two and five words. Few corpus analysis
Research has shown that collocations and other programs offer this feature [1], but AntConc
multi-word units such as phrasal verbs, and idioms includes lexical bundle searches as an option in the
are particularly difficult for learners to acquire.[13] Word Clusters Tool. Of course, calculating all the
Their importance is even greater if the learner is lexical bundles for a particular set of criteria can
working with texts in a highly technical or take a great deal of time. Therefore, as in all other
scientific field, as the lexical unit is very often tools in the program, the processing can be halted
longer than a single word.[10] Surprisingly, by clicking on the ‘Stop’ button at any time.
collocations and so on have received little attention
in most CALL programs [13], perhaps due to the
difficulty in identifying and ordering them in a 8 Limitations of AntConc
systematic way for the learner.
Concordancers can be divided into two main types;
In AntConc, multi-word units can be investigated 1) those that first build an index which is used for
using the Word Clusters Tool (Figure 6). This tool subsequent search operations, and 2) those that act
displays clusters of words centered on a search directly on the raw text.[11] The first of these has
term and orders them alphabetically or by the advantage that they can operate on large
frequency. The search terms can be specified as a corpora. On the other hand, they tend to be less
substring, word, phrase or regular expression as in flexible than the second type, especially if the user
the Concordancer, Plot and View File tools, and is often switching or modifying the target corpus
734
Authorized licensed use limited to: IEEE Xplore. Downloaded on October 20, 2008 at 01:37 from IEEE Xplore. Restrictions apply.
2005 IEEE International Professional Communication Conference Proceedings
File
Window
Cluster/
Bundle
Results
Progress
Report
Most corpus analysis programs offer users the 9 Conclusions and Future Developments
ability to see the collocates of a search term in a
table, where the frequency of the most common AntConc is a lightweight, simple and easy to use
words to the left or right of the search term are corpus analysis toolkit that has been shown to be
indicated. Learners often find such tables difficult extremely effective in the technical writing
to interpret and so the current version of AntConc classroom.[17] Although it does not include all the
offers no implementation of this feature. tools and features of the popular commercial
applications, it offers many of the essential tools
Some programs also offer detailed statistics related needed for the analysis of corpora, with the added
to the corpus and search results. Again, it was felt benefit of an intuitive interface, and a freeware
that these would overwhelm many learners and so license.
the advice given by Hockey [11] was followed.
The program should not include such statistics but To date, there have been 19 releases of the
instead offer an easy way to copy and paste results program since its launch in 2002, including three
major upgrades. There are also plans to release a
735
Authorized licensed use limited to: IEEE Xplore. Downloaded on October 20, 2008 at 01:37 from IEEE Xplore. Restrictions apply.
2005 IEEE International Professional Communication Conference Proceedings
new version of the software in the near future that International Journal of Corpus Linguistics, vol. 9,
addresses some of the limitations described in the no. 2, pp. 271–298, 2004.
previous section. The first improvement will be a
redesign of the View Files Tool making it operate [2] C. A. Chapelle, Computer applications in
with far greater speed. The current tool is able to second language acquisition: Foundations for
handle files with ambiguous line endings but this teaching, testing, and research. Cambridge,
comes with a heavy loss in speed. The next release England: Cambridge University Press, 2001.
will also include a tool to view collocates, and the
ability to sort word lists alphabetically from both [3] S. Hunston, Corpora in Applied Linguistics.
the beginning and end of words, which is a feature Cambridge, England: Cambridge University Press,
recommended by Hockey.[11] 2002.
In a later release, it is hoped that AntConc will be [4] T. Johns, “Contexts: the Background,
improved to handle annotated data, in particular Development and Trialling of a Concordance-
XML, in a much more powerful and intuitive way. based CALL Program,” in Teaching and Language
XML data includes header definitions that if Corpora. A. Wichmann, S. Fligelstone, T.
extracted, can be used as part of search criteria. If McEnery, and G. Knowles. London, England:
this extraction can be carried out automatically, it Longman, 1997, pp. 100-115.
will enable users to access these definitions
without any knowledge of the annotation method. [5] J. M. Swales, Research Genres. Cambridge,
England: Cambridge University Press, 2004.
Finally, a detailed user manual and accompanying
tutorial video are planned for the software, where [6] Y. C. Sun and L. Y. Wang, “Concordancers in
the operation of each tool will be explained with the EFL Classroom: Cognitive Approaches and
concrete examples and a step-by-step guide. Collocation Difficulty,” Computer Assisted
Language Learning, vol. 16, no. 1, pp. 83-94, 2003
Acknowledgements
[7] T. Cobb, “Breadth and depth of lexical
This research was supported by a Grant-in-aid for acquisition with hands-on concordancing,”
Scientific Research by the Japan Society for the Computer Assisted Language Learning, vol. 12, no.
Promotion of Education, Science, Sports and 4, pp. 345-360, 1999.
Culture, Japan (No. 16700573), and by a Waseda
University Grant for Special Research Projects, [8] K. E. Nitsch, “Structuring decontextualized
Japan (No. 2004B-861). forms of knowledge,” Unpublished Ph.D.,
Vanderbilt University; Nashville, TN, 1978.
Notes
1. Information and download instructions available [9] C. Lonfils and J. Vanparys, “How to design
at: http://www.lexically.net/wordsmith/ user-friendly CALL interfaces,” Computer
2. Information and download instructions available Assisted Language Learning, vol. 14, no. 5, pp.
at: http://www.monoconc.com/ 405-417, 2001.
3. Information and download instructions available
at: http://home.ust.hk/~autolang/whatis_WP.htm [10] L. Bowker, L. and J. Pearson, Working with
4. Information and download instructions available Specialized Language: A Practical Guide to Using
at: p://vlc.polyu.edu.hk/concordance/aboutweb.htm Corpora. London, England/New York, NY:
5. Information and download instructions available Routledge, 2002.
at: http://www.antlab.sci.waseda.ac.jp/
6. Information and download instructions available [11] S. Hockey, “Concordance Programs for
at: http://morphix-nlp.berlios.de/ Corpus Linguistics” in Corpus Linguistics in North
America: Selections from the 1999 Symposium. R.
References C. Simpson and J. M. Swales. Ann Arbor, MI:
University of Michigan Press, 2001, pp. 76-97.
[1] D. Coniam, “Concordancing oneself:
Constructing individual textual profiles,”
736
Authorized licensed use limited to: IEEE Xplore. Downloaded on October 20, 2008 at 01:37 from IEEE Xplore. Restrictions apply.
2005 IEEE International Professional Communication Conference Proceedings
737
Authorized licensed use limited to: IEEE Xplore. Downloaded on October 20, 2008 at 01:37 from IEEE Xplore. Restrictions apply.