LancsBox 5.0 Manual

Download as pdf or txt
Download as pdf or txt
You are on page 1of 54

#LancsBox 5.

0 manual

Citation for #LancsBox:


Brezina, V., Weill-Tessier, P., & McEnery, A. (2020). #LancsBox v. 5.x. [software package]
Brezina, V., Timperley, M., & McEnery, A. (2018). #LancsBox v. 4.x. [software package]
Brezina, V., McEnery, T. & Wattam, S. (2015). Collocations in context: A new perspective on collocation
networks. International Journal of Corpus Linguistics, 20(2), 139-173
.innovation in corpus linguistics

#LancsBox
@Lancaster University

1
6.1 Visual summary of GraphColl tab ... 20
Contents 6.2 Producing a collocation graph ........ 20
1 Downloading and running #LancsBox
6.3 Reading collocation table................ 21
version 5.0 .........................................................5
6.4 Reading collocation graph .............. 22
2 Loading and importing data ......................6
6.5 Extending graph to a collocation
2.1 Visual summary of Corpora tab .........6
network ....................................................... 23
2.2 Load your corpora and wordlists.......6
6.6 Shared collocates ............................ 24
2.3 Supported file formats ......................7
6.7 Problems with graphs: overpopulated
2.4 Download #LancsBox corpora and graphs 25
wordlists ........................................................7
6.8 Reporting collocates: CPN............... 26
2.5 Working with corpora and wordlists .8
7 Words tool .............................................. 27
2.6 Saving corpora ...................................8
7.1 Visual summary............................... 27
2.7 Pre-processing of corpora (Advanced
7.2 Producing frequency list ................. 28
users). 8
7.3 Visualizing frequency and dispersion
3 Key functionalities .................................. 11
28
3.1 Mouse clicks ................................... 11
7.4 Producing keywords........................ 29
3.2 Shortcut Keys.................................. 11
7.5 Producing corpus statistics ............. 29
3.3 Tools and Tabs ................................ 12
8 Ngram tool .............................................. 31
3.4 Split screen ..................................... 12
8.1 Visual summary............................... 31
3.5 Saving results.................................. 13
9 Text ......................................................... 33
3.6 Copy/pasting selected results ........ 13
9.1 Visual summary............................... 33
4 KWIC tool (key word in context) ............ 14
9.2 Searching in Text ............................. 33
4.1 Visual summary of KWIC tab .......... 14
9.3 Settings ........................................... 34
4.2 Searching and displaying results .......... 15
10 Wizard ................................................. 35
4.3 Settings and full text pop-up ............... 15
10.1 Visual summary............................... 35
4.4 Sorting, randomising and filtering .. 16
10.2 Selecting settings and running Wizard
4.5 Statistical analysis........................... 16 36
5 Whelk tool .............................................. 18 10.3 Data analysis ................................... 36
5.1 Visual summary of Whelk tab ........ 18 10.4 Research report .............................. 37
5.2 Top panel: KWIC ............................. 18 11 Searching in #LancsBox ....................... 38
5.3 Bottom panel: Frequency distribution 12 Statistics in #LancsBox ........................ 41
18
12.1 Frequency measures ....................... 41
5.4 Statistical analysis........................... 19
12.2 Dispersion measures ....................... 41
6 GraphColl ................................................ 20
12.3 Keyword measures ......................... 41

2
12.4 Collocation measures ..................... 42
13 Glossary .............................................. 43
14 Messages.Properties .......................... 46

3
#LancsBox v.5.0: License

#LancsBox is licensed under BY-NC-ND Creative commons license. #LancsBox is free for non-commercial
use. The full license is available from: http://creativecommons.org/licenses/by-nc-nd/4.0/legalcode

#LancsBox uses the following third-party tools and libraries: Apache Tika, Gluegen, Groovy, JOGL,
minlog, QuestDB, RSyntaxTextArea, smallseg, TreeTagger. Full credits are available
http://corpora.lancs.ac.uk/lancsbox/credits.php

When you report research carried out using #LancsBox, please cite the following:

☐ Brezina, V., McEnery, T. & Wattam, S. (2015). Collocations in context: A new perspective on
collocation networks. International Journal of Corpus Linguistics, 20(2), 139-173.
☐ Brezina, V., Weill-Tessier, P., & McEnery, A. (2020). #LancsBox v. 5.x. [software package]
☐ Brezina, V., Timperley, M., & McEnery, A. (2018). #LancsBox v. 4.x. [software package].

Statistical help

Brezina, V. (2018). Statistics for corpus linguistics: A practical guide. Cambridge:


Cambridge University Press.

If you are interested in finding out details about statistical procedures used in corpus
linguistics, refer to Brezina (2018); visit also Lancaster Stats Tools online at
http://corpora.lancs.ac.uk/stats

Further reading and materials


Brezina, V. (2016). Collocation Networks. In Baker, P. & Egbert, J. (eds.) Triangulating Methodological
Approaches in Corpus Linguistic Research. Routledge: London.
Brezina, V. (2018). Statistical choices in corpus-based discourse analysis. In Taylor, Ch. & Marchi, A.
(eds.) Corpus approaches to discourse: a critical review. Routledge: London.
Brezina, V. & Gablasova, D. (2017). The corpus method. In: Culpeper, J, Kerswill, P., Wodak, R., McEnery,
T. & Katamba, F. (eds). English Language (2nd edition). Palgrave.
Brezina, V., McEnery, T. & Wattam, S. (2015). Collocations in context: A new perspective on collocation
networks. International Journal of Corpus Linguistics, 20(2), 139-173.
Brezina, V., & Meyerhoff, M. (2014). Significant or random. A critical review of sociolinguistic
generalisations based on large corpora. International Journal of Corpus Linguistics, 19(1), 1-28.
Gablasova, D., Brezina, V., & McEnery, T. (2017). Collocations in corpus‐based language learning
research: Identifying, comparing, and interpreting the evidence. Language Learning, 67 (S1), 155–
179.
Gablasova, D., Brezina, V., & McEnery, T. (2017). Exploring learner language through corpora:
comparing and interpreting corpus frequency information. Language Learning, 67 (S1), 130–154.

▪ More materials (video lectures, exercises, slides etc.) are available: on the #LancsBox website:
http://corpora.lancs.ac.uk/lancsbox/materials.php

4
1 Downloading and running #LancsBox version 5.0

#LancsBox is a new-generation corpus analysis tool. Version 5 has been designed primarily for 64-bit
operating systems (Windows 64-bit, Mac and Linux) that allow the tool’s best performance. #LancsBox
also operates on older 32-bit systems, but its performance is somewhat limited. Version 5 of #LancsBox
comes with an installer, which makes installation of #LancsBox even easier.

 Select and download: Select the version suitable for your operating system and download installer to
your computer.

 Run installer

Agree to security warnings on your machine – #LancsBox is safe to run – and follow the steps in the
installer. Always install #LancsBox to a folder, where the tool has ‘read and write’ privileges such as the
User folder or Desktop; On Windows, never install #LancsBox to Program Files.

5
2 Loading and importing data

Data can be loaded and imported into #LancsBox on the ‘Corpora’ tab. This tab opens automatically when you run
#LancsBox. #LancsBox works with corpora in different formats (.txt, .xml, .doc, .docx, .pdf, .odt, .xls, .xlsx, .zip etc.) and
with wordlists (.csv). There are two options for loading corpora and wordlists: i) load (your own) data and ii) download
corpora and wordlists that are distributed with #LancsBox.

2.1 Visual summary of Corpora tab

Top panel: Importing corpora and wordlists


You can:
▪ Select your corpus or wordlist to load.
▪ Download a corpora and wordlists distributed with
#LancsBox.
▪ Select language.
▪ Review POS tags.
▪ Review punctuation marks and sentence delimiters.
▪ Set-up pre-processing via customisable scripts.

Define
Bottom basic
panel: categories
Working (token,and
with corpora lemma, POS group,
wordlists
punctuation)..
You can:
▪ Activate or delete imported corpora or wordlists.
▪ Review corpus and text size (tokens, types, lemmas).
▪ Preview texts.
▪ Save processed corpora with pos-tags etc.
.
2.2 Load your corpora and wordlists

#LancsBox allows you to work easily with your own corpora and wordlists. These corpora are those stored on your
computer or at a location accessible from your computer (memory stick, shared drive, dropbox, cloud etc.).

1. In the Corpora tab, left-click on ‘Corpus’ or ‘Word List’ under ‘Load data’, depending on whether you want to
load a corpus or a wordlist.
2. This will open a window where you can navigate to the location (folder) where your corpus or wordlist is
stored.
3. You can select a specific file, select multiple files by holding down Ctrl and left-clicking on your chosen files, or
select all files in the folder by holding down Ctrl + A.
4. Left-click ‘Open’ to load your files.
5. Select the language of your corpus or wordlist. #LancsBox supports automatic lemmatisation and POS tagging
in multiple languages. This is done using Tree Tagger. If your language is not listed, select ‘Other’; in this case,
automatic lemmatisation and POS tagging will be disabled.
6. [Optional: You can review/change the import options by left-clicking on a bar with three triangles (▲▲▲). In
most cases, you can use the default options.]

6
7. Left-click ‘Import!’ to import your corpus into #LancsBox. By default, #LancsBox automatically adds POS tags
to the corpus.

2.3 Supported file formats

#LancsBox supports different file formats (.txt, .xml, .doc, .docx, .pdf, .odt, .xls, .xlsx, .zip and many others) of corpus files.
#LancsBox automatically extracts and processes text available in corpus files. For wordlists, #LancsBox assumes the
comma-delimited file format (.csv).

1. Corpus formats: .txt, .xml, .doc, .docx, .pdf, .odt, .xls, .xlsx, .zip – full list: Apache Tika.
2. Wordlist format: csv (see example below).

2.4 Download #LancsBox corpora and wordlists

#LancsBox allows you to work with existing corpora that are freely distributed with #LancsBox under a
specific license. Two modes for corpus sharing are available: i) open access and ii) restricted access. We
are constantly adding more corpora to this list.

1. In the corpora tab, left-click on ‘Corpus’ or ‘Word List’ under ‘Download’.


2. This will open a window where you can select corpora or wordlists distributed with #LancsBox. By
left-clicking on a corpus, you will be shown additional information about the corpus or wordlist,
including the language, date, text type, license etc.
3. Review and agree with the corpus license.
4. Left-click ‘Download’ to download the selected corpus or wordlist.
5. Left-click ‘Import!’ to import your corpus into #LancsBox. By default, #LancsBox automatically
adds POS tags to the corpus.

 Note: To switch between open and restricted access corpora, use the ‘Switch access’ button in the
bottom left corner ( ). Restricted access corpora are distributed as encrypted and have
several display and usage restrictions. For example, they cannot be displayed in the Text tool or saved
to the local computer.

7
2.5 Working with corpora and wordlists

All corpora and wordlists that have been imported into #LancsBox are displayed in the bottom panel on
the ‘Corpora’ tab. This panel allows reviewing corpora, previewing files and fast reloading of corpora and
wordlists when #LancsBox is closed and re-opened.

1. If you have imported a corpus ( ) or wordlist ( ) it will appear in the bottom panel, alongside
any other corpora or wordlist you have already imported. These can be removed by left-clicking
‘delete’. In the bottom-right section, you can view the corpus structure: the individual text files
that the corpus is composed of.
2. In the bottom panel (bottom left window), the default corpus can also be specified. The default
corpus is a corpus that #LancsBox offers as a default choice in the individual modules. The default
corpus can be specified by left-double-clicking on the name of the corpus; a filled rectangle ( )
will appear next to the name of the default corpus.
3. If #LancsBox is closed, the corpora and wordlists will remain imported but will be unloaded. To
activate (reload) the corpora or wordlists for use, left-double-click on the corpora or wordlists.
4. You can also preview the files by right-clicking on them. They will appear in the Text tool (see
Section 8). The list of files (including the info about their size) can also be copied (Ctrl/Coomand+C)
and pasted (Ctrl/Command+V) into a spreadsheet or text document.
5. Corpora are now ready to be analysed using five modules: KWIC, Whelk, GraphColl, Words and
Text. Wordlists can be used in the Words tool.

2.6 Saving corpora

#LancsBox saves corpora in the horizontal or the vertical format.

1. Right-click on the corpus which you wish to save.


2. Select appropriate options.

3. Click ‘Save’.

2.7 Pre-processing of corpora (Advanced users).

#LancsBox allows pre-processing data as part of the import procedure. This is set up in the ‘Import options’
under ‘Pre-processing’. Data can be modified in different ways using a variety of Groovy scripts, which are
fully customisable.

1. Under ‘Pre-processing’ three options are available:

8
2. ‘Download’ allows the user to download scripts and their newest versions available from the
#LancsBox website.
3. ‘Use’ displays a list of currently available scripts and a checkbox next to each script for the user to
indicate, which scripts will be used in the pre-processing stage.

4. ‘Edit’ displays the scripts in a script editor, which allows modifying existing scripts and creating
new scripts.

5. The structure of a script is as follows. More information about the Groovy scripting
language can be found at http://groovy-lang.org.

Script Comments
public void runCL(){ Scripts run via the command line.
println "Ran on the command line."
}
Scripts run when the files are being loaded.
public void runBefore(){ This allows splitting files, deleting or changing
println "Ran as a pre-process script." texts or structuring elements e.g. xml tags.
}

9
public void runAfter(Token token){ Scripts run after part-of-speech tagging. This
println "Ran after the tagging step." allows modifying the output of the Tree
} Tagger, e.g. correcting tagging errors.

public void removeHeader(){ An example of a simple script deleting header


println "Reading " + inPath indicated by <header></header> tags in text.
text = text -~ /(?s)<header>.*?<\/header>/
autoOutput = true
}

 Did you know?


The Brown corpus and the LOB (Lancaster-Oslo/Bergen) corpus are one of the first modern
corpora stored and processed on computers. Each consists of one million running words (tokens),
a size that was very ambitious at the time of their compilation. Brown was compiled in the 1960s
by Henry Kučera and W. Nelson Francis at Brown University (US). It was originally stored and
processed on IBM punch cards. In the early 1970s, a British counterpart to the Brown corpus was
compiled as a collaboration between Lancaster University (UK) and two Norwegian universities:
Oslo and Bergen. The project was initiated by Geoffrey Leech from Lancaster University.

10
3 Key functionalities

This section reviews key functionalities of #LancsBox that are common to multiple #LancsBox modules.

3.1 Mouse clicks

#LancsBox doesn’t use drop-down menus. Instead, all commands are literally just one mouse click away.

Hover with the mouse pointer for tooltips (brief contextual explanation of key functionalities/terms)
to appear.

Left-click: ‘select and sort’ Right-click: ‘additional info’


▪ Select items or lines (all modules). ▪ Filters on tables (Whelk, GraphColl
▪ +Ctrl: Multiple select. Words and Ngrams), concordances
▪ Sort tables and concordances (all (KWIC and Whelk) and text display
modules). (Text).
▪ Concordances for collocates and
Wheel: ‘zoom’
Left-double-click: ‘go inside’ wordlists (GraphColl, Words).
▪ Zoom (Graph Coll and Words)
▪ Randomise concordances (KWIC).
▪ Go to Text (KWIC, Whelk).
▪ Expand collocation networks
(GraphColl).
▪ Expand visualizations of corpora
(Words and Ngrams).

 Note: Mac users need to review their specific setup of the mouse clicks. By default, right-click is defined
as Control + click. Alternatively, a standard two-button mouse with a wheel can be connected to a Mac
machine.

3.2 Shortcut Keys

#LancsBox allows changing the size of the text for easy readability. This works both in graphs and tables.

Make all text bigger Ctrl and +

Make all text smaller Ctrl and -

11
3.3 Tools and Tabs

#LancsBox supports multiple simultaneous analyses and multiple corpora. #LancsBox has five main
modules (tools): KWIC, Whelk, GraphColl, Words and Text. Each tool can be called multiple times on
separate tabs. The modules in #LancsBox are interconnected: they can be launched as pop-ups inside a
module.

1. The figure below show the top bar in #LancsBox with buttons for individual modules and multiple
tabs open.

2. The modules in #LancsBox have the following functionalities:

KWIC produces concordances.


Whelk shows distribution of the search term in corpus files.
GraphColl identifies and visualizes collocations.
Words produces wordlists and identifies and visualizes keywords.
Ngrams produces lists of ngrams and identifies and visualizes key ngrams.
Text displays a full context of a search term.

3.4 Split screen

#LancsBox supports split-screen comparisons that allow displaying two separate analyses, one in the top
and one in the bottom panel.

3. To use split screen, left-click on a bar with three triangles: ▲▲▲. This brings up the bottom
panel.
4. To activate the bottom (or the top) panel in the split-screen view, left-click on the panel. An active
panel is indicated by a light blue border ( ).
5. To close the split-screen view, left-click on the bar with three triangles: ▼▼▼. This will hide the
bottom panel but will not clear the results, so the bottom panel can be brought back later, if
needed.

12
3.5 Saving results

#LancsBox supports easy saving of results. It saves concordances, wordlists, tables and graphics.

1. To save the results that #LancsBox produces, left-click on the save icon ( ) in the top right-hand
corner.
2. Select the location where you wish to save the results.
3. Click ‘Save’.

3.6 Copy/pasting selected results

#LancsBox supports easy copy/pasting of selected results.

1. Select results which you wish to copy/paste by left-clicking on them; the results will be highlighted.
To select discontinuous results, hold down Ctrl while selecting. To select all results, press Ctrl + A
[Mac: Command + A].

2. Press Ctrl + C [Mac: Command + C].


3. In the new location (e.g. text file, spreadsheet) press Ctrl + V [Mac: Command + V].

13
4 KWIC tool (key word in context)

The KWIC tool generates a list of all instances of a search term in a corpus in the form of a concordance.
It can be used, for example, to:
■ Find the frequency of a word or phrase in a corpus.
■ Find frequencies of different word classes such as nouns, verbs, adjectives.
■ Find complex linguistic structures such as the passives, split infinitives etc. using ‘smart searches’.
■ Sort, filter and randomise concordance lines.
■ Perform statistical analysis comparing the use of a search term in two corpora.

4.1 Visual summary of KWIC tab

Save results

Statistical analysis
Right-click
Left-double-click Left-click concordance header
‘Index’ to randomise concordance header to use advanced
concordance lines. to sort. filter.

Left-double-click Right-click inside to


concordance display apply filter.
to see text.
Pull up the bottom
panel.

Simple search Advanced search

You can: You can:


Search for a word or phrase. Search at different levels of annotation.
Search for number ranges, e.g. >1930&<=1945 Combine search terms at various levels.
Use * wildcards, e.g. new* Use regular expressions, e.g. /N.*/
Use case sensitive regular expressions, e.g. /[abc].*/ Define batch searches.
Use case insensitive regular expressions, e.g. /dog|cat/i
Search for punctuation, e.g. /.*\./p
Use ‘smart searches’, e.g. PASSIVES, NOUNS

14
4.2 Searching and displaying results

#LancsBox supports powerful searching of corpora. The search box can be used for simple as well as advanced searches
at different levels of annotation.

1. Simple searches: type in the word or phrase of interest in the search box in the top left-hand corner and left-
click ‘Search’.
2. Advanced searches: click on the triangle inside the search box ( ) to activate advanced searches at different
levels of corpus annotation. You can type search terms as separate constraints into one or more advanced
search boxes. For example, the following advanced search is a search for the lemma ‘go’.
Text level empty –> no constraint.

go Headword is go.
AND
V*
POS is any verbal use.

3. A concordance is generated. The search term, called the ‘node’, is positioned in the centre and highlighted
(orange colour), with words displayed to the left and right of it.
4. KWIC displays basic information about the frequency of the search term and its distribution in texts; the
second example shows an application of a filter (see Section 4.4):
Read: The search term ‘research’ occurs 158 times in the
corpus with the relative frequency 1.57 per 10k words in
13 out of 15 texts.
Read: When a filter is applied (indicated by blue colour),
the search term ‘research’ occurs 7 times out of 158 in
the corpus with the relative frequency 0.07 per 10k
words in 3 out of 15 texts.

4.3 Settings and full text pop-up

KWIC settings include Corpus, Context and Display options. KWIC also allows full-text pop-ups.

1. Corpus: this setting changes the corpus which is being searched. Note that different corpora can be searched
in the top and bottom panel in split-screen view.
2. Context: this setting changes the number of words that are displayed in the concordance to the left and to the
right of the node.
3. Display: this setting changes the display type. The ‘Plain text’ default can be changed to ‘Text with POS’,
‘Lemmatized text’ and ‘All annotation’. The example below demonstrates these four display formats:
Plain text: The new life looks promising for Mr. Noyce.
Text with POS: The_DT new_JJ life_NN looks_VVZ promising_JJ for_IN Mr._NP Noyce._NP
Lemmatized text: the_DT new_JJ life_NN look_VVZ promising_JJ for_IN Mr_NP Noyce_NP
All annotation: [The{the}_DT] [new{new}_JJ] [life{life}_NN] [looks{look}_VVZ] [promising{promising}_JJ]
[for{for}_IN] [Mr.{Mr}_NP] [Noyce.{Noyce}_NP]

4. Full text pop-up: Double left-click on a concordance line to display the entire text with the
appropriate line highlighted.

15
4.4 Sorting, randomising and filtering

KWIC concordance can be sorted alphabetically, randomised and filtered.

1. Alphabetical sorting: Left-click the concordance header (any column) to sort the column
alphabetically in the A-Z (ascending) order; click again to re-sort alphabetically in the Z-A
(descending) order. The sorting is indicated by arrows: A-Z (▲) and Z-A (▼).
2. Randomising: Left-double-click the header of the ‘Index’ column to randomise the concordance
lines. Randomisation is indicated by the tilde sign (~).
3. Simple filtering: Right-click anywhere inside the concordance to activate the simple filter on that
column. Input a word or phrase or a regular expression enclosed in forward slashes (/ /) and click
‘Apply’. Filtering is indicated by light blue colour of the filtered text. The filter also updates the
results (Occurrences and Texts) in the top display panel (see Section 4.2, point 4).
4. Advanced filtering: Right-click any part of the concordance header to activate the advanced filter.
Select an exact column or position for filtering (see below), enter value and click ‘Add’ and ‘Apply’.
Filtering is indicated by light blue colour on text and in the results display panel (Occurrences and
Texts).
An example of positions for advanced filtering:
.
L5 L4 L3 L2 L1 Node R1 R2 R3 R4 R5
is Mr. Robert Weaver of New York. One of his tasks

4.5 Statistical analysis

KWIC connects to Lancaster Stats Tools online to perform statistical analysis of the data in split panels.

When search results appear in both the top and the bottom panel in split-screen, these can be compared
by clicking on the statistical analysis button ( ). The tool automatically connects to Lancaster Stats Tools
online (Brezina 2018) and performs the t-test. The results are reported as follows:

16
 Did you know?
In 1992, when reviewing the state of the art in corpus linguistics, Leech (1992) considers a
concordance program “[t]he simplest and the most widely-used tool for corpus-based research”
(p. 114). 25 years later, a concordance program such as KWIC still belongs to the essential toolkit
of a corpus linguist. The simple and direct access to data that a concordance program facilitates
combined with more sophisticated functions such as sorting, filtering and randomising provides
a powerful analytical technique.

Leech, G. (1992). Corpora and theories of linguistic performance. In: Directions in corpus linguistics, 105-122.

17
5 Whelk tool

The Whelk tool provides information about how the search term is distributed across corpus files.
It can be used, for example, to:
■ Find absolute and relative frequencies of the search term in corpus files.
■ Filter the results according to different criteria.
■ Sort files according to absolute and relative frequencies of the search term.

5.1 Visual summary of Whelk tab

Top panel: Searching corpora

You can:
▪ Search, sort and filter.
▪ Use simple and advanced searching functionality.
▪ Use ‘smart’ searches.

Bottom panel: Displaying distribution

You can:
▪ View the distribution of the search term in
individual files.
▪ Sort, filter and copy/paste.

5.2 Top panel: KWIC

The top panel in Whelk has the same powerful search, sort and filter functionalities as the KWIC tool (see
Section 4). It is directly connected to the bottom panel: any update in the top panel is immediately
reflected in the bottom panel.

5.3 Bottom panel: Frequency distribution

The bottom panel in Whelk provides detailed information about the distribution of the search term.

1. ‘File’ column lists the name of the individual files in the corpus.
2. ‘Tokens’ column provides the information about the size of each file in running words (tokens).
3. ‘Frequency’ column provides absolute frequencies of the search term i.e. refers to how many
instances of the search term there are in each file.
4. ‘Relative frequency per 10k’ provides relative frequency normalised to the basis of 10,000 tokens;
this value is comparable across files and corpora.

18
5.4 Statistical analysis

Whelk connects to Lancaster Stats Tools online to perform statistical analysis of the data.

When search results appear, these can be visualised using a boxplot by clicking on the statistical analysis
button ( ). The tool automatically connects to Lancaster Stats Tools online (Brezina 2018) and displays
the result:

 Did you know?


The Whelk tool (both the name and the functionality) is inspired by Kilgarriff’s (1997: 138ff) notion
of the ‘whelks problem’. Imagine, says Kilgarriff, that you have a corpus which includes one text
(a book) about whelks – small snail-like sea creatures ( ). In this text, the word whelks will
appear many times and hence will appear as a frequent word in the entire corpus, although its
use is limited to one specific context. To overcome the problem and present more accurate
information about word distribution, the Whelk tool shows the frequency distribution of search
terms in individual corpus files.
Kilgarriff, A. (1997). Putting frequencies in the dictionary. International Journal of Lexicography, 10(2), 135-155.

19
6 GraphColl

The GraphColl tool identifies collocations and displays them in a table and as a collocation graph or
network.
It can be used, for example, to:
■ Find the collocates of a word or phrase.
■ Find colligations (co-occurrence of grammatical categories).
■ Visualise collocations and colligations.
■ Identify shared collocates of words or phrases.
■ Summarise discourse in terms of its ‘aboutness’.

6.1 Visual summary of GraphColl tab

Save results.

View options.
Change
collocation
settings.

Display collocation
graphs and
networks.
Display
collocates in a
table.
Pull up the bottom
panel.

6.2 Producing a collocation graph

GraphColl produces collocations graphs on the fly. After selecting the appropriate settings you can start
searching for the node and its collocates.

1. Select the appropriate settings for the collocation search:


i) Span: how many words to the left (L) and to the right (R) of the node (search term) are being
considered when searching for collocates [default: 5L, 5R].
ii) Statistics: the association measure used to compute the strength of collocation [default:
frequency – no association measure is preferred because the choice depends on the
research question].

20
iii) Threshold: The minimum frequency and statistics cut-off values for an item (word, lemma,
POS) to be considered a collocate.
iv) Corpus: The corpus that is being searched.
v) Unit: The unit (type, lemma, part of speech [POS] tag) used for collocates.
2. Type the search term into the search box (top left) and left-click ‘Search’.
3. This will produce a colocation table (left) and a collocation graph (right).

6.3 Reading collocation table

A collocation table is a traditional way of displaying collocates. In GraphColl, the table shows the following
pieces of information for each collocate: i) status, ii) position, iii) stat, iv) collocation frequency and
v) frequency of the collocate anywhere in the corpus. By default, the table is sorted according to the
selected collocation statistic (largest-smallest).

1. The following is a visual description of the collocation table.

Right-click header: filter Left-click header: sort

 node (expanded)

Left-double-click: expand collocation network Right- click: show concordance

2. The meaning of the individual columns is:


i) Status: shows whether the collocate has been expanded;  indicates a non-expanded
collocate, while  indicates expanded collocate (node) in a collocation network.
ii) Position: shows textual position of the collocate, which can be either left (L) of the node,
right (R) of the node or middle (M), i.e. with equal frequency L and R.
iii) Collocate: shows the collocate in question.
iv) Stat: displays the value of the selected association measure.
v) Freq (coll): displays the frequency of the collocation (combination of node + collocate).
vi) Freq (corpus): displays the frequency of the collocate anywhere in the corpus.

21
6.4 Reading collocation graph

The graph displays three dimensions: i) strength of collocation, ii) collocation frequency and iii) position of
collocates. To find out more about a collocate, right-click on it to obtain concordance lines (KWIC), in which
the collocates co-occurs with the node.

1. Strength: The strength of collocation as measured by the association measure is indicated by the
distance (length of line) between the node and the collocates. The closer the collocate is to the
node, the stronger the association between the node and the collocate (‘magnet effect’).
2. Frequency: Collocation frequency is indicated by the intensity of the colour of the collocate. The
darker the shade of colour, the more frequent the collocation is.
3. Position: The position of collocates around the node in the graph reflects the exact position of the
collocates in text: some collocates appear (predominantly) to the left of the node, others to the
right; others still appear sometimes left and sometimes right (middle position in the graph). For
the ease of display (if multiple collocates appear in a similar position and hence overlap), the tool
allows ‘spreading out’ collocates evenly around the node. This is done by clicking on the ‘Spread
out’ button (top right). When this is done, the collocates are dispersed evenly around the node
with a ‘L’ or ‘R’ index displayed above the collocate circle indicating their original position to the
left and to the right respectively.

middle position (M)

Right- click: show concordance

left (L) collocates right (R) collocates


Small distance from the
node: strong collocate

Darker colour:
frequent collocate

middle position (M)

22
6.5 Extending graph to a collocation network

A collocation network is an extended collocation graph that shows i) shared collocates and ii) cross-
associations between several nodes.

1. To expand a simple collocation graph (see above) into a collocation network, either search for
more nodes or left-double-click on a collocate in either the table or the graph.
2. A collocation network displays nodes with unique collocates (outer rim of the graph) and shared
collocates (middle of the graph). The links between nodes and shared collocates are indicated by
a dash-dot line ( ).

node

larger font: Ctrl and +


zoom: wheel

Shared collocate between


make and love

23
6.6 Shared collocates

Shared collocates are collocates shared by at least two nodes in the graph. Shared collocates are displayed
in the middle of the graph with links to the relevant nodes.
1. A full list of shared collocates can be obtained by clicking on the text ‘Shared collocates’.

2. The list of shard collocates is displayed in a tabular form.

24
6.7 Problems with graphs: overpopulated graphs

If a collocation graph or network includes too many nodes and collocates, it becomes hard to interpret.
We call this type of graph/network an overpopulated graph/network. The solution is either to change the
graph’s settings making the threshold values more restrictive (see Section 6.2) or filtering some of the
results based on a clearly specified criterion (e.g. function words, top n words).

The following figure shows an overpopulated graph on the left and a graph that is more easily interpretable
on the right. Note the difference in settings recorded in CPN (see Section 6.8)

Collocation graph of ‘time’ 3a-MI(3), L5-R5, C5-NC5 Collocation graph of ‘time’ 3a-MI(5), L5-R5, C5-NC5

To view n words in the whole


graph, apply a filter (right-
click) to the first column
(Index).

25
6.8 Reporting collocates: CPN

It is important to realise that there is no one definite sets of collocates: different statistical procedures
and threshold values highlight different sets of collocates. We therefore need to report the statistical
choices involved in the identification of collocations using standard notation called Collocation Parameters
Notation (CPN). When saving the results, GraphColl saves the settings in the form of CPN.
Brezina et al. (2015) propose CPN as a specific notation to be used for accurate description of collocation
procedure and replication of the results. The following parameters are reported.

Statistic Statistic Statistic L and R Minimum Minimum Filter


ID name cut-off span collocate collocation
value freq. (C) freq. (NC)
4b MI2 3 L5-R5 5 1 function
words
removed
4b-MI2(3), L5-R5, C5-NC1; function words removed

 Did you know?


The name GraphColl is an acronym for graphical collocations tool. GraphColl was the first module in
#LancsBox (v.1.0) with the other tools being added at a later stage. Graphical display of collocations
and collocation networks is inspired by the work of Phillips (1985), who demonstrated the concept of
lexical networks (Phillip’s term for ‘collocation networks’) with small specialised corpora. GraphColl
takes this notion further, offering different statistical choices and producing collocation networks on
the fly with both small and large corpora.
Phillips, M. (1985). Aspects of text structure: An investigation of the lexical organisation of text. Amsterdam: North-Holland.

26
7 Words tool

The Words tool allows in-depth analysis of frequencies of types, lemmas and POS categories as well as
comparison of corpora using the keywords technique.

It can be used, for example, to:


■ Compute frequency and dispersion measures for types, lemmas and POS tags.
■ Visualize frequency and dispersion in corpora.
■ Compare corpora using the keyword technique.
■ Visualize keywords.

7.1 Visual summary

Drag corpora together to


produce keywords.

Right-click on the table header


to activate filter.

Left-double-click on the corpus


to see its internal structure.

Right- click on the corpus to


see corpus statistics.

Right-click inside the table to


activate a Whelk pop-up.

Left: Creating frequency lists, computing Right: Visualizing frequencies, dispersions and
dispersion and keywords. keywords.

27
7.2 Producing frequency list

On start, Words produces a frequency list (table) based on the default corpus (see Section 2.5, point 2)
and default settings. These settings can be changed and a different frequency list is produced.

1. The following are the settings for frequency lists:


i) Corpus: The corpus that is being used.
ii) Frequency: Absolute or relative frequency [default: absolute frequency].
iii) Dispersion: The dispersion statistic [default: coefficient of variation (CV)].
iv) Unit: The unit used in the frequency list (type, lemma or part of speech tag).
2. Changing any of these settings triggers re-computing of the frequency list.
3. Frequency lists can be searched using the search box (top left).
4. Frequency lists can be sorted by left-clicking on the header.
5. Frequency lists can be filtered by right-clicking on the header and applying a filter.
6. Two different frequency lists can be computed in the split-screen view, which is triggered by left-
clicking on a bar with three triangles: ▲▲▲. This brings up the bottom panel.

7.3 Visualizing frequency and dispersion

The Words module displays corpora and corpus files (when a corpus is left-double-clicked). It visualises
frequency and dispersion of words using intensity of colour and position of individual files displayed as
circles; the size of the circle indicates the relative size of the corpus/file.

Display of frequency in the whole corpus on Display of frequency per file (when corpus is
the scale of 0 - 68,349 (most frequent item). left-double-clicked).

1. To visualize frequency of an item in the table, left-click on the item in the frequency table. The
shade of the colour of the corpus will change according to the frequency value of this item. The
scale on the right offers a reference point for interpretation.
2. To visualize dispersion of an item in the table, left-double-click on the corpus (large circle). The
corpus will expand to display individual files (small circles) of which the corpus consists. The size
of each circles is proportional to the size of the corpus subpart. The shade of the colour of the

28
small circles will change according to the frequency value of the item in the frequency list.
Crossed-out () circles indicate that the item does not occur in the given corpus file. In addition,
the corpus files are ordered according to the relative frequency of the item with the file with the
largest relative frequency of the item appearing at the 12-oclock position ( ) and the other files
ordered clockwise according to decreasing relative frequency of the item ( ).

7.4 Producing keywords

The Words module computes a comparison of frequencies between two corpora/wordlists using a
selected statistical measure. It identifies and visualizes positive keywords, negative keywords and
lockwords.

1. Left-click on ▲▲▲ to bring up the bottom panel.


2. In the bottom panel, select a comparison (reference) corpus, while in the top panel keep your
corpus of interest.
3. In the visualisation panel (right), drag the circles that represent the two corpora together .
Alternatively, press the space bar.
4. The resulting table will display frequency and dispersion info about the two corpora as well as the
keyword statistic; the graphics will identify top 10 positive keywords, top 10 negative keywords
and top 10 lockwords.
5. In the settings, you can change the i) keyword statistic and ii) threshold.
Keyword statistic: This is a measure that compares two frequency lists [default: simple maths with
constant k = 100].
Threshold: Threshold values for the identification of positive keywords, negative keywords and
(by implication) lockwords.

7.5 Producing corpus statistics

The Words module computes essential corpus statistics: i) Complexity stats and ii) Lexical stats

1. Right-click on corpus .
2. In the pop-up table toggle between Complexity stats and Lexical stats.

29
Mean sentence length and Standard deviation (SD)

Type-token ratio (TTR), Standardised type-token ratio (STTR), Moving average type-token ratio (MATTR)

 Did you know?


The statistical technique of keyword analysis was originally developed by Mike Scott (1997) and
it was implemented in WordSmith Tools. It relied on corpus comparison using the chi-squared
test or the log-likelihood test. As Kilgarriff pointed out, the chi-squared test and the log-likelihood
test are not entirely appropriate for this type of comparison. Kilgarriff’s solution implemented in
Sketch Engine was to compare corpora using a ‘simple maths’ procedure, a simple ratio between
relative frequencies of words in the two corpora we compare. In addition to ‘simple maths’,
#LancsBox offers also other types of solutions for corpus comparison.
Scott, M. (1997). PC analysis of key words—and key key words. System, 25(2), 233-245.
Kilgarriff, A. (2009, July). Simple maths for keywords. In Proceedings of the Corpus Linguistics Conference. Liverpool, UK.

30
8 Ngram tool

The Ngram tool allows in-depth analysis of frequencies of n-grams (bigrams, trigrams etc.), which could
be defined as contiguous combinations types, lemmas and POS. The tool also produces key ngrams by
comparing two corpora using a technique similar to keywords.
It can be used, for example, to:
■ Identify n-grams, lexical bundles and p-frames (also skip grams)
■ Compute frequency and dispersion measures for ngram types, lemmas and POS tags.
■ Visualize frequency and dispersion of ngrams in corpora.
■ Compare ngrams in two corpora using the keyword technique.
■ Visualize key ngrams.
8.1 Visual summary

Drag corpora together to


produce key ngrams.
Right-click on the table header
to activate filter.

Left-double-click on the corpus


to see its internal structure.

Right- click on the corpus to


see corpus statistics.

Right-click inside the table to


activate a Whelk pop-up.

Left: Creating frequency lists, computing Right: Visualizing frequencies, dispersions and key
dispersion and key ngrams. ngrams.

31
 Did you know?
Multi-word expressions are extremely important when describing language. There are different
terms to describe multi-word expressions such as collocations (Brezina et al. 2015; Gablasova et
al. 2017), n-grams, lexical bundles and p-frames. While collocations, which are identified in the
GraphColl module, typically represent non-contiguous expressions, the n-gram type multi-word
expressions represent contiguous lexico-grammatical patterns. They are defined as follows.
▪ n-gram: a sequence of n types, lemmas, POS from a text or corpus.
▪ lexical bundle: an ngram with certain frequency and distributional (dispersion)
properties, e.g. relative freq. 10 per million and range > 5.
▪ p-frame (also skip gram): an n-gram that allows for variability at one or more positions
such as it would be * to.
All these types of multi-word expressions can be identified using the Ngram tool in #LancsBox.

Brezina, V., McEnery, T. & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks.
International Journal of Corpus Linguistics, 20(2), 139-173.
Gablasova, D., Brezina, V., & McEnery, T. (2017). Collocations in corpus‐based language learning research: Identifying,
comparing, and interpreting the evidence. Language Learning, 67 (S1), 155–179.

32
9 Text

The Text tool enables an in-depth insight into the context in which a word or phrase is used.

It can be used, for example, to:


■ View a search term in full context.
■ Preview a text.
■ Preview a corpus as a run-on text.
■ Check different levels of annotation of a text/corpus.

9.1 Visual summary

All instances of a
search term are
Absolute and relative
highlighted in text.
frequency (per 10k).

Up () and down ()


arrow to move
between the
occurences.

9.2 Searching in Text

Texts and corpora can be searched easily using a simple search box.

1. Type the search term into the search box (top left). Left-click ‘Search’.
2. This will highlight all lines in the text where the search term appears in dark grey with the search
term itself in red. To move between the highlighted lines up () and down () arrows can be used.
3. Frequency information (both an absolute and relative frequency per 10,000 tokens) will appear
under ‘Occurrences’.
4. A single line can be highlighted by left-clicking on the line. To highlight multiple lines, Ctrl
(Command) + Left-click the desired lines.
5. Highlighted lines can be copied (Ctrl/Command+C) and pasted (Ctrl/Command+V) into a text
editor.

33
9.3 Settings

The following settings are used in Text: i) Corpus, ii) Text and iii) Display.

1. Corpus: this setting allows changing the corpus which is being displayed and searched. Note that
different corpora can be searched in the top and the bottom panel in the split-screen view.
2. Text: this setting allows changing the text that is being displayed and searched.
3. Display: this setting allows changing the display format. The ‘Plain text’ default can be changed to
‘Text with POS’, ‘Lemmatized text’ and ‘All annotation’.

34
10 Wizard

The Wizard tool combines the power of all tools in #LancsBox, searches corpora and produces research
reports for print (docx) and web (htlm).

It can be used, for example, to:


■ Carry out simple or complex research.
■ Produce a draft report.
■ Download all relevant data.

10.1 Visual summary

Modify the name of


the report and its
location

Choose a corpus. Enter search term(s),


Press Shift or Ctrl for Choose tools and if KWIC, GraphColl,
selecting multiple modify settings. Whelk or Text are
corpora. used.

Load search terms Run the tool


from a file

35
10.2 Selecting settings and running Wizard

Wizard produces research reports automatically. All you need to do is select the corpus/corpora and
procedures to use.

1. Select corpora in the ‘Corpora to use’ panel (left). For multiple adjacent corpora, hold Shift on the
keyboard while selecting; for multiple corpora that are not next to each other, hold Ctrl (or Control
on mac).
2. If the corpus you want does not appear on the list, go to ‘Corpora’ tab and add it.
3. Choose the tools you want to employ. The choice of the tools depends on the type of analysis you
want to perform.

▪ KWIC: analysis of concordance lines.


▪ GraphColl: collocation analysis.
▪ Whelk: analysis of frequencies in individual texts.
▪ Words: analysis of frequencies and dispersions of individual lexical items and keyword
analysis.
▪ Ngrams: analysis of frequencies and dispersions of ngrams.
▪ Text: analysis of broader contexts.

4. Adjust the tool settings by clicking on the ‘Settings’ button next to the tool.
5. Enter search term(s), if, KWIC, GraphColl, Whelk or Text are used. Alternatively, load search terms
from a text (txt) file.
6. Choose the location where the report and extracted data will be saved or leave default (Desktop).
7. Press ‘Run’

10.3 Data analysis

Wizard produces the data analysis in the background and informs the user about the progress. The
complete data set is saved together with a report.

1. The data set includes the following folders.

2. csv (comma separated files; open in Excel or Calc) include data for Lancaster Stats Tools online.
3. images (png; open in a standard graphics app) include graphs and other graphical output.
4. tsv (tab separated files; open in Excel or Calc) include complete data output from the individual
tools.
5. xml (extensible markup language; opens in a text editor) includes a complete Wizard data set,
which can be used for further computational processing.

36
10.4 Research report

Wizard produces a structured data report in two formats: docx and html

1. The .docx report can be easily edited in Word, Writer or a similar word processor.

2. The length and contents of the report depend on the number of corpora and tools that were
selected.
3. The report follows the structure of an academic research report.

37
11 Searching in #LancsBox

Throughout the tool, #LancsBox offers powerful searches at different levels of corpus annotation using
i) simple searches, ii) wildcard searches, iii) smart searches, iv) regex searches and v) batch searches.

1. Simple searches are literal searches for a particular word (new) or phrase (New York Times). Simple
searches are case insensitive; this means that new, New, NEW, NeW etc. will return the same set
of results.
2. Wildcard searches are searches including one of three special characters *, <, > and =.
Special character Meaning Example of use
* 0 or more characters new* [new, news, newly, newspaper…]
any word [with space] new *[new car, New York, new ideas…]
> larger than
< smaller than
= equals [combined with < and >]

3. Smart searches are searches predefined in the tool to offer users easy access to complex searches;
smart searches are unique to #LancsBox. These searches are used for searching for word classes
(NOUNS, VERBS etc.), complex grammatical patterns (PASSIVES, SPLIT INFINITIVE etc.) and
semantic categories (PLACE ADVERBS).
Smart searches are defined specifically for a particular language inside the tool. Currently, a small
group of features is pre-defined in the resources folder: resources\languages\[name of
language]\Searches.txt. The user can edit this file by adding or deleting items.

The following smart searches are available for English:


NOUNS N*
PROPER NOUNS NP*
VERBS [VM].*
ADJECTIVES JJ*
ADVERBS W?R.*
MODALS *MD*
CONNECTORS /(IN|CC)/
PRONOUNS (PP\$?|WP\$?)
? /.*\?/pi
! /.*\!/pi
. /.*\./pi
"," "/.*\,/pi"
CONTRACTIONS /.*(\'(s|re|ve|d|m|em|ll)|n\'t)/ /.*\|[^P].*/
PASSIVES "/VB. (R.* ){0,3}V.N/"
COMPLEX NOUN PHRASE "/(JJ.? ){1,5}NN.? /"
PAST TENSE /V.D/
NOMINALIZATIONS "/.{3,}(tion|tions|ment|ments|ness|nesses|ity|ities)/i"

38
SPLIT INFINITIVE /TO R.* V.*/
PRESENT TENSE /V.[PZ]/
PAST TENSE /V.D/
PLACE ADVERBIALS
/aboard|above|abroad|across|ahead|alongside|around|ashore|astern|away|behind|below|b
eneath|beside|downhill|downstairs|downstream|east|far|hereabouts|indoors|inland|inshore
|inside|locally|near|nearby|north|nowhere|outdoors|outside|overboard|overland|overseas|
south|underfoot|underneath|uphill|upstairs|upstream|west/
TIME ADVERBIALS
/afterwards?|again|earlier|early|eventually|formerly|immediately|initially|instantly|late|latel
y|later|momentarily|now|nowadays|once|originally|presently|previously|recently|shortly|si
multaneously|soon|subsequently

4. Regex searches are advanced searches that allow to search for any combination of characters.
Any expression enclosed in forward slashes (//) is interpreted as regular expression. #LancsBox
supports perl-compatible regular expressions.
Regex Explanation Regex Explanation
Word A string of characters (case sensitive) a{3} Exactly 3 of a
/word/i A string of characters (case insensitive) a{3,} 3 or more of a
/word\./p Punctuation search: A string of a{3,6} Between 3 and 6 of a
characters followed by full stop (case
sensitive)
[abc] A single character either a, b or c. \d Any digit
[^abc] Any single character except: a, b, or c \D Any non-digit
[a-z] Any single character in the range a-z \w Any word character (letter, number,
underscore)
[a-zA-Z] Any single character in the range a-z or \W Any non-word character
A-Z
[0-9] A single number in the range 0-9
. Any single character
(a|b) a or b
a? Zero or one of a
a* Zero or more of a
a+ One or more of a
5. Batch searches allow to search for multiple search terms recursively and saving the results
automatically; #LancsBox supports both simple and complex batch searches. Batch searches can
be used in KWIC, GraphColl and Whelk modules when the corpora are tagged. Here is how batch
searches work.
a) Click on the down arrow in the search box to activate Advanced search options. The last
option is a batch search. Click on ‘Batch’.

39
b) Navigate to and load a text file with the appropriate search terms, one per line. Simple
search terms include a list of word forms to be searched; complex search terms are
defined via a combination of criteria such as word form, pos tag, headword etc…
Consecutive criteria need to be present on the same line separated by tab (\t) in the
following order: label – wordform – headword – pos – user tag. This is best achieved by
creating the file with advanced batch search terms in Excel or Calc. Examples of simple
and complex searches can be seen below.

Simple batch search: each Complex batch search: label – wordform – headword –
search term on a separate line pos – user tag (tab separated)
my
cat
go
went

c) Once the file with search terms is loaded, click on the ‘Search’ button ( ) and navigate
to the location where the results will be saved.

40
12 Statistics in #LancsBox

#LancsBox uses statistics for calculating measures of i) frequency, ii) dispersion, iii) keywords and iv)
collocation. The equations of these measures can be reviewed and modified on the ‘Stats’ tab, which is
called by clicking on the Σ button.

12.1 Frequency measures

1. absolute frequency = o11


2. relative frequency = (o11/r1) x 10,000

12.2 Dispersion measures

1. CV = SD/mean
∑(𝑥−𝑚𝑒𝑎𝑛)2
2. SD = √ 𝑛
3. Range = no of files where the search term occurs at least once
Range
4. Range % = number of files × 100
CV
5. D = 1 −
√number of files−1
Sum of absolute values of (observed−expected proportions )
6. DP =
2

12.3 Keyword measures

relative frequency of w in C + k
1. simple maths parameter = relative frequency of w in R + k
O11 O21
2. log likelihood short = 2 × (O11 × log + O21 × log )
E11 E21
(relative freq. in C− relative freq. in R) × 100
3. % DIFF = relative freq.in R
relative freq. in C
4. Log Ratio = log2 ( )
relative freq. in R
Meanin C −Mean in R
5. Cohen’s d = pooled SD

41
12.4 Collocation measures

ID Statistic Equation ID Statistic Equation


1 Freq. of co- 𝑂11 8 T-score 𝑂 11 − 𝐸11
occurrence √𝑂11

2 MU 𝑂11 9 DICE 2 × 𝑂11


𝐸11
𝑅1 + 𝐶1

3 MI (Mutual 𝑂11 10 LOG DICE 14 + log 2


2 × 𝑂11
log 2 𝑅1 + 𝐶1
information) 𝐸11

4 MI2 𝑂11 2 11 LOG RATIO 𝑂11 × 𝑅2


log 2 log 2
𝐸11 𝑂21 × 𝑅1

5 MI3 𝑂11 3 12 MS 𝑂11 𝑂11


log 2
𝐸11 (Minimum 𝑚𝑖𝑛 ( , )
𝐶1 𝑅1
sensitivity)
6 LL (Log 2 13 DELTA P 𝑂11 𝑂21 𝑂11 𝑂12
𝑂11 𝑂21 − ; −
likelihood) 𝑂11 × 𝑙𝑜𝑔 + 𝑂21 × 𝑙𝑜𝑔 + 𝑅1 𝑅2 𝐶1 𝐶2
𝐸11 𝐸21
×
𝑂12 𝑂22
𝑂 × 𝑙𝑜𝑔 + 𝑂22 × 𝑙𝑜𝑔
( 12 𝐸12 𝐸22 )

7 Z-score1 𝑂 11 − 𝐸11 14 Cohen’s d 𝑀𝑒𝑎𝑛𝑖𝑛 𝑤𝑖𝑛𝑑𝑜𝑤 − 𝑀𝑒𝑎𝑛 𝑜𝑢𝑡𝑠𝑖𝑑𝑒 𝑤𝑖𝑛𝑑𝑜𝑤


√𝐸11 𝑝𝑜𝑜𝑙𝑒𝑑 𝑆𝐷

42
13 Glossary

Absolute (or raw) frequency – The simple frequency with which a search term occurs in a corpus or its
part(s); a number of hits of a search term in a corpus.

Batch search – A batch search enables searching for multiple search terms recursively and saving the
results automatically; #LancsBox supports both simple and complex (i.e. defined via a combination of
criteria such as wordform, pos tag, headword etc.) searches.

Colligation – Systematic co-occurrence of grammatical categories (e.g. POS tags) in text identified
statistically.

Collocate – A word that systematically occurs with the node (word or phrase of interest, search term).

Collocation – Systematic co-occurrence of words in text identified statistically.

Collocation graph is a visual display of the association between a node and its collocates. See GraphColl.

Collocation network is a visual display of complex associations (collocations) in language and discourse. It
consists of multiple inter-connected collocation graphs. See GraphColl.

Concordance line – A single line in the KWIC display representing a node (search term) with the words
before and after it (the right and left context).

Concordance is a typical form of display of examples of language use found in a corpus with the node
(search term) centred in the middle and several words of context displayed left and right of the node.
Concordance is sometimes also called a 'KWIC (display)'.

Corpus (pl. corpora) – A collection of language data that can be searched by a computer.

Dispersion – is the spread of values of a variable (e.g. relative frequencies of a search term) in a dataset
(corpus). Dispersion is measured statistically using metrics such as standard deviation (SD), coefficient of
variation (CV), range, Juilland’s D, DP etc. See Words.

Frequency – The number of times a search term occurs in the corpus. A distinction is made between
absolute (absolute number of hits) and relative frequency (proportional frequency per X number of
tokens).

Frequency distribution – frequency distribution provides information about the frequencies of a word or
phrase in different parts of the corpus. See Whelk.

43
GraphColl is a module n #LancsBox, which identifies collocations and builds collocation networks on the
fly.

Import – In #LancsBox, processing of corpus data and making it available to all modules in the package.

KWIC is an abbreviation for 'keyword in context'. This is a typical form of display of examples found in a
corpus with the node (word or phrase of interest) centred in the middle and several words of context
displayed left and right of the node. KWIC is sometimes also called a 'concordance'. KWIC is also the
name of a module in #LancsBox.

Left context – The words preceding a particular search term (node). Individual positions in the left-
context are referred to as L1 (position immediately preceding), L2, L3 etc.

Lemma – All inflected forms belonging to one stem; in #LancsBox by default, a combination of a
headword and a grammatical category (e.g. go + VERB). For example, a lemma ‘go’ includes the following
word forms (types): ‘go’, ‘goes’, ‘went’, ‘going’ and ‘gone’.

Lexical bundle – an n-gram with certain frequency and distributional (dispersion) properties, e.g. relative
freq. 10 per million and range > 5.

Loaded – In #LancsBox, when a corpus is loaded it is available to be analysed. To re-load a corpus,


double-left-click on the name of the corpus.

Module – A specific tool within #LancsBox offering particular analytical functionalities. #LancsBox
includes five different modules: KWIC, Whelk, GraphColl, Words and Text.

N-gram – a sequence of n types, lemmas, POS from a text or corpus.

Node – The word, phrase or grammatical structure of interest. See Search term.

Part of speech (POS) – A grammatical category, a word class. Part-of-speech is usually assigned
automatically using a process called part-of-speech tagging (see below). #LancsBox includes TreeTagger,
which performs part-of-speech tagging for a range of languages.

Part-of-speech tagging (POS tagging) – A process of adding information about the grammatical category
of each word in a text or corpus. For example, the following sentence was POS-tagged: Automatically_RB
annotates_VBZ data_NNS for_IN part-of-speech_NN.

P-frame (also skip gram) – an n-gram that allows for variability at one or more positions such as it would
be * to.

Regular expressions (regex) – A special meta-language that allows advanced users to search for any
combination of strings. In #LancsBox, regex searches are enclosed in forward slashes e.g. /.*ions?/

44
Relative (or normalized) frequency (RF) is calculated as the proportion of the absolute frequency of a
word we are interested in divided by the total number of words (tokens) in the corpus. This number is
usually multiplied by an appropriate basis for normalization (e.g. 10,000).

Right context – The words following a particular search term (node). Individual positions in the right-
context are referred to as R1 (position immediately following), R2, R3 etc.

Split screen – A comparison option in #LancsBox where the screen can be split into two panels; each
panel can display a different type of analysis. #LancsBox allows second panel to be opened and
minimised via left-clicking on three small triangles (▲▲▲/▼▼▼ ).

Tab – A further ‘page’ that can be opened in #LancsBox to run multiple analytical procedures
simultaneously. Each module in #LancsBox can run on an unlimited number of tabs.

Tagging – The process of adding linguistic information to the words in a text or corpus, automatically or
semi-automatically. See Part-of-speech tagging.

Text – A basic unit of a corpus; a corpus is a collection multiple texts. Text is also the name of a module
in #LancsBox that displays and searches texts in corpora.

Threshold – Setting options in GraphColl and Words to display only relevant collocates or keywords
respectively.

Token is a single occurrence of a word form in a text or corpus.

TreeTagger is a part-of-speech tagger developed by Helmut Schmid, which performs part-of-speech


tagging for a range of languages.

Type is a unique word form in a text or corpus.

Whelk is a module in #LancsBox which provides information about how the search term is distributed
across corpus files.

Words is a module in #LancsBox which allows in-depth analysis of frequencies of types, lemmas and POS
categories as well as comparison of corpora using the keywords technique.

45
14 Messages.Properties

How to configure #LancsBox for advanced users

The Messages.properties file lets you customise #LancsBox. The things you change in here will change how the program operates and looks.
15 Making Changes
To change a setting in Messages.Properties: First, look for the setting you want. This will look something like: an.interesting.setting = value. We will call the first
part (before the =) the key and the second part (after the =) the value.
Changing the value will change the setting. Each of the values has it’s own type. This just means that when you change a colour it should be for another colour, not
for a word. We will now introduce the value types used in Messages.properties.
• Path – This tells LancsBox where on your computer to look for something. This could be where to find an Icon, or where to find other information that
LancsBox relies on. It will look like this: /resources/path/to/a/file. Any path starting with /resources refers to something within the resources folder.
• Integer (This just means a whole number)
◦ As a number – Some settings want an actual number, like the default span for KWIC searches.
◦ As a selector – Sometimes we use an integer to pick between several options. The options will be described in comments so you will know what your
choices are.
◦ As a true/false – Some settings turn things on or off in #LancsBox. If you want the setting on then the value should be 1. 0 means turn it off.
• Colour – You can change many of the colours used in #LancsBox. These follow a particular format. We recommend using an online colour picker to find out
what the value should be. It will begin with a #.
• Regular expression – These are like the regular expressions that you use within #LancsBox with two changes. Firstly, the //s are omitted, Secondly the
options (like i) are in the Java format. These are in round brackets, have a question mark and precede the expression itself. like: (?i) regular expression
• Literal text – Whatever you type as the value here will be used directly. However, please note that the UI for #LancsBox uses an English font, this may limit
the utility of these types of settings.
• Number Format – Java has it’s own way of defining number formats. Changing these in Messages.Properties will change now #LancsBox displays numbers
and can be useful in altering the number of decimal places in tables.

Message.properties file Explanation


# This is the Messages.properties file Tagger .dir and .langs both have path values and refer to parts of a Tree-
# tagger Tagger installation. The root directory of the Tree-Tagger is defined by.dir
tagger.dir = resources/tagger and .langs is a folder containing the language specific files (.par
tagger.langs = resources/tagger/models extension).

46
# database Database
# 0 - RAM database, basic persistence .dir lets you change where corpora will be stored and is a path value. The
# 1 - QuestDB (recommended) (64bit) number of tokens held in RAM can be limited or expanded by changing
.size, which is an integer value. A number of different databases can be
database.use = 1 used within #LancsBox and .use lets you change which one is being used.
database.dir = resources/corpora The integer value can be one of a number of options which are part of
database.cache.size = 2000000 the comment above the setting. Note that you can’t load a corpus using a
database if it wasn’t created by the same database.
#download locations
downloads.corpora = resources/downloads/corpora Download locations
downloads.wordlists = resources/downloads/wordlists When you download wordlists and corpora they are saved in the
downloads folder in resources prior to being imported into a corpus.
Changing the path values of these settings lets you change where they
get saved.
# language settings
langs.dir = resources/languages
Language settings
The language-specific settings are stored in the languages folder in
# default tokenizer settings resources. Changing the .dir setting lets you change this location.
defaults.punctuation Default tokenizer settings
=.,:;?!\u00e3\u0080\u0082\u00ef\u00bc\u008c\u00ef\u00bc\u009b\u00ef The tokenizer can be configured on the corpora panel in #LancsBox. The
\u00bc\u009a\u00ef\u00bc\u009f\u00ef\u00bc\u0081\u00e2\u0080\u009a\u00c default values that appear on those boxes come from here. Making the
2\u00bf\u00c2\u00a1\u00e2\u0080\u00a6'\"\u00e2\u0080\u0098\u00e2\u0080\ change in Messages.Properties means that you only have to make the
u0099`\u00e2\u0080\u009c\u00e2\u0080\u009d\u00e2\u0080\u009e()<=>[]{}\u0 change once. Please note that .punctuation and .segmentation are literal
0e2\u0080\u00b9\u00e2\u0080\u00ba\u00e3\u0080\u008a\u00e3\u0080\u008b- values (which include an additional escape character - \) whereas
\u00e2\u0080\u0093\u00e2\u0080\u0094\u00e4\u00b8\u0080* .sentence_boundary is a regular expression. The sentence boundary is
defaults.segmentation =\\t\\n \\r used for calculating average sentence length and similar metrics.
defaults.sentence_boundary =(?s).*[\\.|!|\\?|。|?|!].*

# Script directory
stats.dir = resources/stats
47
stats.threshold = resources/groovy/default_threshold.groovy Script directory
stats.dir.collocate = resources/stats/collocate #LancsBox calculates statistics using a number of external scripts, which
stats.dir.keyword.frequency = resources/stats/keyword/frequency you can also edit. Each of the groups of scrips lives in a different folder.
stats.dir.keyword.dispersion = resources/stats/keyword/dispersion These path values let you change where they are read from.
stats.dir.keyword.statistic = resources/stats/keyword/statistic
shaders.dir = resources/shaders

# Tool logo
icons.logo = resources/images/logo.png Tool logo
The path value can be changed to change the #LancsBox logo for another
# Fonts image.
fonts.all.size = 12
fonts.table.size = 12 Fonts
fonts.2d.scale = 0.25 The font sizes used in #LancsBox can be altered here. The graph and
fonts.graph.size = 84 keyword fonts have large sizes which use the .scale values to shrink them.
fonts.graph.size.scale = 0.125 This gives high resolution text at a good size. To increase the beauty of
fonts.keyword.size = 84 text in graphs and words tools make the appropriate .size values larger
fonts.keyword.size.scale = 0.4 and the .scale values smaller.

# Select the fonts to use when there is no custom font installed.


# The custom font is the first .ttf file found in the resources/fonts folder.
# Java logical font options: Java uses what it calls logical fonts. The #LancsBox UI uses different
# 1 - Dialog logical fonts by default. You can change the default font options by
# 2 - DialogInput changin the .ui and .3d integer values. A comment precedes the settings
# 3 - Monospaced to inform you of the available options. The data font can be overwritten
# 4 - Serif from this by placing a single .ttf file in the resouces/fonts folder.
# 5 - SansSerif
fonts.default.ui = 5
fonts.default.3d = 1

48
# Misc
window.size.width = 1024 Misc
window.size.height = 768 .lock allows you to either lock or unlock (true/false) the slider which lets
slider.lock = 0 you resize the tables in graph and words.
tokeniser.allowRtoL =0 Most right to left corpora are actually in left to right format in the files,
display.default.RtoL =0 but are displayed in reverse. If the actual data is stored as right to left
numbers.format.integer =###,###,###,###,### (very unlikely) then .allowRtoL can be enabled and a new checkbox will
numbers.format.real =####0.000000 appear in import options on the corpora tab. In the much more likely
numbers.format.real_short =####0.00 event that the data is left to right but should be displayed right to left
then the default display direction of #LancsBox using the .RtoL setting.
Both of these settings are also true/false values. The format of numbers
in tables can be changed using the .integer, .real and .real_short settings.

# General program colours


colours.bar =#4B4B4B General program colours
colours.highlight =#00A4FF Some of the more widely used colours in #LancsBox can be changed here.
colours.text_highlight =#ff6600
colours.text =#4B4B4B
colours.advanced_arrow =#B3B1B0
General UI paths and settings
# General UI paths and settings The path values can be changed to load custom icons in the #LancsBox
icons.frame = resources/images/icon.png UI. The default message on the status bar can be changed by altering the
icons.tabs.attach = resources/images/pin1.png .welcome_message value, which is a literal string.
icons.tabs.close = resources/images/cross.png
icons.generic.right_arrow = resources/images/right-arrow.gif
icons.corpora = resources/images/corpora.png
icons.save = resources/images/save.png
icons.stats = resources/images/stats.png
icons.about = resources/images/about.png
icons.help = resources/images/help.png
icons.kwic = resources/images/kwic.png
icons.graph = resources/images/graph.png
icons.compare = resources/images/compare.png
49
icons.compare.disabled = resources/images/compareDisabled.png
statusbar.welcome_message = Welcome to #LancsBox

# table icons
icons.sort.ascending = resources/images/upArrow.png
icons.sort.descending = resources/images/downArrow.png Table Icons
icons.sort.ascending.filtered = resources/images/upArrowSquare.png Changing these path values lets you change the icons that appear in
icons.sort.descending.filtered = resources/images/downArrowSquare.png #LancsBox tables.
icons.sort.filter = resources/images/square.png
icons.sort.random = resources/images/random.png

# The tooltips for various buttons


buttons.tooltip.corpora = Corpora
buttons.tooltip.save = Save The tooltips for various buttons
buttons.tooltip.graph = Collocation graphs and networks tool The tooltips are customisable for the main buttons. These are the tool
buttons.tooltip.kwic = <html>Concordance tool</html> and status bar buttons that you first see when loading #LancsBox.
buttons.tooltip.whelk = Dispersion tool
buttons.tooltip.keywords = Wordlists and keywords tool
buttons.tooltip.ngram = N-Gram tool
buttons.tooltip.text = Text tool
buttons.tooltip.help = Help
buttons.tooltip.stats = Statistics
buttons.tooltip.about = About
buttons.popup.close = Apply

# Generic button labels, reused throughout


buttons.generic.browse = Load data
buttons.generic.delete = Delete Generic button labels, reused throughout
buttons.generic.clear = Clear The text of some buttons in the UI can be changed by altering these
buttons.generic.run = Run literal string values. This includes the apply button on some popups.
buttons.generic.new = New
buttons.generic.load = Load
50
buttons.generic.save = Save
buttons.generic.close = Close
#buttons.generic.stop = Stop

# Load pane
labels.load.prompt_name = Name: Load pane
labels.load.corpus_name = Corpus The main corpora pane uses some literal strings that can be changed
labels.load.case = Clamp types to lowercase here.
labels.load.punctuation = Store punctuation
buttons.load.new = Import!
buttons.load.reset = Reset to defaults
icons.load.corpus = resources/images/corpus.png
icons.load.wordlist = resources/images/wordlist.png

# Stats pane text


tabs.name.stats = Statistics
labels.stats.name = Name: Stats pane text
buttons.stats.commit = Save The stats panel uses some literal strings that can be changed here.
buttons.stats.save = Save as...
buttons.stats.load = Open...
buttons.stats.remove = Remove
buttons.stats.revert = Revert

# n-gram settings
defaults.ngrams = 2 N-gram settings
The n-grams tool defaults to being a bigram tool. This can be changed by
changing this integer value.
# keywords renderer
colours.keywords.corpus_name_dark = #000000
colours.keywords.corpus_name_light = #bababa Keywords renderer
colours.keywords.text = #000000 The words / ngrams tool has a number of colours which can be changed.
colours.keywords.target = #c60db8 Those which have a corresponding .max value denote a colour range. The
colours.keywords.target.max = #c60db8 frequency colours will be interpolated using these ranges.
51
colours.keywords.reference = #2e3131
colours.keywords.reference.max = #2e3131
colours.keywords.highlight = #ff6600
colours.keywords.table = #d1d1d1
colours.scroll = #5f5f5f
colours.no_scroll = #d1d1d1

# Graph pane text


buttons.graph.export = Export Graph pane text
buttons.graph.export.dot = .dot File A number of string literals are given here for the GraphColl tool. The
buttons.graph.export.img = .png Image string literals can be changed here.
buttons.graph.labels = Labels
buttons.graph.run = Search
buttons.graph.layout = Layout
buttons.graph.kwic = KWIC
buttons.graph.threshold = Threshold
buttons.graph.stat = Stat

# Graph Renderer
colours.graph.node =#c60db8
colours.graph.collocate_light =#e6f7f9 Graph renderer
colours.graph.collocate_dark =#000000 The GraphColl tool has colours and colour ranges which can be changed.
colours.graph.highlight =#ff6600 These are the colour values given here. Additionally the size of
colours.graph.edge =#d1d1d1 screenshots can be changed here (though they also apply to words) by
colours.graph.text =#000000 changing the integer values of .width and .height. The number of sides a
colours.graph.shared =#ff6600 sphere has (all 3d tools) can be changed by altering the integer value of
colours.graph.shared_background =#583e82 the .sphere_resolution setting. This can drastically speed up crowded
renderer.screenshot.width =7680 graphs but not all numbers will work on all computers. The size of graph
renderer.screenshot.height =4320 spheres can also be changed using .sphere_size. This gives you even
renderer.default.sphere_size =6 greater control that just changing the font size.
renderer.default.sphere_resolution =50
colours.toggle.free = #31c831
colours.toggle.hybrid = #ffc200
52
colours.toggle.positional = #ff0040
colours.toggle.word_class = #cd00cd

# Whelk searches
whelk.window.span = 100 Whelk searches
STTR and MATTR searches can be performed in #LancsBox. These use a
window size of a number of tokens. This number can be changed by
#KWIC pane colours setting the value of the .span setting to a different integer.
colours.kwic.node =#ff6600
colours.kwic.highlight =#00a4ff KWIC pane colours
colours.kwic.highlight_not =#5e626b The colours used in the KWIC tool can be changed by altering the colour
values of these settings.
# KWIC window size settings
kwic.left.min =3 KWIC window size settings
kwic.left.def =5 The default span settings for KWIC searches can be set here. The integer
kwic.left.max =20 values only define the defaults, you can still change them in the program.
kwic.right.min =3
kwic.right.def =5
kwic.right.max =20

# POS group colours


colours.group.1 = #0080ff
colours.group.2 = #ff0080 POS group colours
colours.group.3 = #00cd67 The POS groups / aliases can be defined in the import options. The first
colours.group.4 = #ff6500 ten of them will be assigned these colours when viewing lemma graphs in
colours.group.5 = #cd00cd the word class mode.
colours.group.6 = #d5ff00
colours.group.7 = #a6a6a6
colours.group.8 = #00e6e6
colours.group.9 = #ff4dff
colours.group.10 = #006200

53

You might also like