Key Terms in Corpus Processing

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

Key terms in corpus processing

ASCII character set A group which contains the 256 symbols of the IBM-PC extended character set. This set includes all the letters for English and the major European languages (though not those written with the Cyrillic alphabet) The symbols above 128 may be problematic in some cases, e.g. when transferring data between computers (see *7 bit). ASCII text file A file which consists of a series of lines terminated by a hard carriage return and which contains no formatting information and can thus be read by any text editor. Most word processors have a mode for reading and writing ASCII text files, i.e. for suppressing the formatting information, page layout, page breaks, typeface attributes, etc. which they normally include in the text files they create. backup Refers to a file which is the last version processed and kept for security purposes; a copy of an original diskette; a procedure for generating such files/copies. The program Corpus Presenter File Manager with do this quickly and easily. batch mode An operating mode in which a program requires no input from the user apart from the initial command which actually starts the program. The information necessary for the operation of a program in this mode must be gleaned from an initialisation file which contains all the commands to perform in a specified order. binary The adjective referring to the numerical system based on units of 2 as opposed to the decimal system, based on 10, or the hexadecimal system, based on 16. bit A single data item which can have one of two values, 0 or 1, corresponding to voltage charges of plus or minus or a voltage difference where a threshold is set above which the value 1 is assigned to a bit, below it 0. Bit stands for binary digit. blank A blank space is a character for the computer, ASCII 32. This fact must be borne in mind when editing texts or entering commands. For example, it serves as a delimiter when programs parse user input. One can achieve the same visual effect by using an underscore, for instance in file names, e.g. by writing MY_FILE.INI instead of MY FILE.INI which would usually be interpreted as two input components, i.e. MY and FILE.INI. bmp file (= bitmap) A graphics file format which is favoured by Windows but which results in files which are unnecessarily large. A much more compact format is found with JPG files. browser A program which enables you to view internet files (in HTML format, possibly with Java extensions). This can happen online (via the internet) or offline (when viewing local files on your computer). byte A series of eight bits joined together to give an ordered sequence which represents a number, thus the leftwardmost bit in a byte is the most significant bit, the rightwardmost one the

least significant bit in a numerical sense. It is the basic unit of organisation in the PC. Nearly all operations on the PC refer to entire bytes; occasionally one can access individual bits. cancel To terminate a program, to break off an action. Usually, pressing the Escape key has the function of cancelling whatever operation is pending or currently being carried out. carriage return An ASCII code value, No. 13. It can be used in programs for the purpose of affirming the choice of an option or saving data. It also causes the head of the printer to move back to the left-hand margin of a page (this is where the element carriage comes from). In corpus linguistics the carriage return character is important as it signals the end of a line of text in a so-called *ASCII text file. case sensitive Refers to the fact that upper and lower case letters are not treated as the same in the execution of commands or instructions. If a program has a differing interpretation of alphanumeric data according to whether it is written in capital letters (upper-case) or small letters (lower-case) then it is said to be case-sensitive. Frequently only the interpretation of certain types of data input is case-sensitive such as search strings used in a text editor or database manager. character 1) A term used to refer to a symbol used for screen display or print-out or both. There are 256 characters in the IBM set used on the PC. 2) A term used to designate a data type, for example in a database management system. The data type character thus refers to any symbols which are treated just as a screen display symbol and not as a symbol with arithmetical or logical value. clipboard An area of system memory which can be used as a temporary buffer to which data can be written and from which it can be retrieved easily without exiting the program currently running. client A term to refer to the computer and hence the individual using the internet. See server. Cocoa A format for parameters concerning the contents of a text file in a linguistic corpus. It has been used widely, for instance to classify the contents of the files in The Helsinki Corpus of English Texts which uses 26 of these parameters. Corpus processing software such as the Oxford Concordance Program and Corpus Presenter can access information in the Cocoa header, placed at the beginning of each file, and assess settings found there during retrieval. collation A process in the critical editing of texts whereby different versions are compared by special software, highlighting differences and thus facilitating the preparation of a definitive edition of a text. concordance A file generated by special software in which the words of the input text are highlighted, frequently by centring these on each line as a so-called KWIC, keyword in context or by extracting these and placing them in a column on the left of each line as a so-called KWOC, keyword out of context. A concordance may also contain statistical information on the input file or files.

cookie A small file which is written to a local computer by a file in the internet. This file can be used to determine, for instance, how often a website has been visited. corpus Any body of coherent data collected according to criteria laid down in advance. There are many types of corpora of which linguistic corpora are a subset. For the latter a corpus could be a selection of a language variety at some particular chronological stage, e.g. the English of the United States in 1961 (taken as the basis for the Brown Corpus). In some cases several stages may be incorporated as with The Helsinki Corpus of English Texts. Genres and styles are often criteria for the compilation of a corpus. current directory The directory on disk which is used when loading or saving files, unless you specify another one. Most programs have the option of determining this by an internal command. The current directory is not normally the directory in which the program itself is to be found. customised Created, altered or set-up by the user. A customised version of a program is thus one which has been tailored by the user to his/her personal needs. A customised screen-layout for example is one which has been created or altered by a user to suit personal requirements. data directory A directory in which data is stored. Normally you move to this directory to process some file(s) in it. The programs used to process data are not kept in the same directory as the data. See *current directory. data input The means by which data is passed to the computer. There are various possibilities here: 1) directly from the keyboard, 2) from another file which is read for this purpose, 3) from the internet by downloading data, etc. data structure The way data is encoded by a computer. Data structure is not usually the concern of the user. However, in certain cases, e.g. when the user tries to read a text file produced by a different word processor he or she may become aware of the fact that similar screen display does not necessarily mean identical data structure for the computer. data transfer A procedure whereby data is moved from one type of processing environment to another. With the Corpus Presenter suite it is possible to extract data from a corpus and deposit this in a database which can in its turn be processed, e.g. in lexical analysis. Data can also be retransferred from a database to a text if necessary. Textual data can also be stored in Rich Text Format which allows for the transfer of formatting features across different platforms, e.g. between operating systems. database A file type in which data is arranged in a structure which consists of fields, each with a specified name, length and type. The entire database consists of a number of such structures which are termed records. In addition the database has a header at the beginning in which information on its structure is contained. This type of organisation facilitates the accurate retrieval of information and the selective extraction of data. data set A small text file, used by the program Corpus Presenter, which contains a list of the files to be displayed by the program and other items of information, such as the files to be

associated with the nodes of the tree with which a corpus is presented. A data set can be created and edited with a supplied utility Corpus Presenter Make Tree. Alternatively, it can be generated on the directory level of Corpus Presenter by choosing the option Make data set. dedicated software Any type of software which is designed for a specific purpose. A dedicated word processor is thus a program which is designed as a word processor and not, say, as a module within a spreadsheet package. desktop A figurative term used to describe the area at the user interface where one arrives on starting ones computer and where one can undertake various operations from. directory A division on a disk which has a name and usually contains several files which belong together. Directories are arranged hierarchically in tree form (with branches and sub-branches) and can be displayed in this form on a certain program level like that found in Corpus Presenter when you are choosing a file to load. In Windows the term folder is often used for directory. There is a separate level for directory listings in the programs of the Corpus Presenter suite. display Refers to representation of data on the computer screen. download The action of loading files from a remote computer to a local one, e.g. when working with the internet. The term is also used for sending fonts from a computer to a printer. Escape-key A key at the upper left-hand corner of most keyboards. It normally has the function of backing out of a command at the last moment (hence the name). It is also used in a quasistandard fashion for retracing one level back through a program to a previous level. export To send data from one program to another program of a different type. Because any one item of application software cannot normally read files created with another, special export and import facilities have to be offered to allow the user to continue processing files started in another program. extension The three letters after the dot in a file name. There are typical extensions for text files, such as .TXT or .RTF. Application programs often use characteristic extensions for the files they create, e.g. .HTM for internet files (also found in four letter form as .HTML). FAQs (= frequently asked questions) In general a text file in which questions are answered which the developers of the software, with which this file is supplied, imagine will arise with end-users. file A sequence of bytes produced by a program, stored on disk and which can normally be retrieved by a program for editing. A file has a beginning and an end. An entry in the directory tells the user that the file exists and an entry in the file allocation table of the disk tells the system how the file is physically distributed over the disk medium. In the narrower sense a file is always the product of a program (word processor, spreadsheet, etc.). In a more general sense it is any coherent sequence of bytes with a beginning and an end and a name in the directory of the disk, e.g. it can be a program file.

file list A list of file names which is used as input for the operation of some program. Alternatively, a *file template can be used, but there are occasions when the set of files to be operated on cannot be captured satisfactorily by a template, hence the use of a file list. file manager Either a part of a program (as with Corpus Presenter) or an independent program which serves the function of file handling, i.e. listing, copying, moving, renaming, erasing, etc. A file manager is not normally used for processing files. There is a supplied file manager in the current program suite: Corpus Presenter File Manager. file template This is a specification which is passed to the operating system for matching or searching. For instance, entering *.TXT will list all files which have the extension .TXT. There are two legal wild-cards: the question mark ?, which stands for a single non-specified character, and the asterisk *, which stands for more than one, e.g. *.TXT, LETTER?.DOC. The template *.* encompasses all files. folder See *directory. font A set of symbols which are available as a group and which share characteristics such as character set (ASCII or ANSI, for instance), basic shape (of which there are many, Times Roman and Helvetica being two common types), size (specified in points or characters per inch), weight (bold or plain), direction (upright or right-slanting, i.e. italics), etc. formatting The act of specifying information about screen attributes (italics, boldface, etc.), page layout (left and right margins, spacing, length of page) for a particular document in a program (usually a word processor). formatting data The data which indicates how a document has been formatted with a particular program and which is responsible for such matters as page layout and fonts. This data can usually be removed from a text file to yield a text file (or ASCII file) which can then be read by any other program without being converted in advance. See *RTF. front end An impressionistic term for an interface between the user and some software which processes data in the background. function key One of the twelve keys positioned as a row on the top of the PC keyboard. These keys are non-alphanumeric keys which are assigned command values in programs. Precisely what values they have depend on what program one is using. There is little or no standardisation in this area apart from the use of F1 for online help. gigabyte A unit of measurement, typically referring to the capacity of a hard disk in a computer and totalling one billion (one thousand million) bytes. To be precise, a gigabyte is 1,073,741,824 bytes, i.e. 1024 * 1024 * 1024. Abbreviated to GB. gif file (= graphics information format) A graphics file format which is very common in the internet. Gif files can also be animated, i.e. appear like a small cartoon.

homepage An internet address at which a set of data is to be found which is associated with an individual, institution, company, etc. HTML (= Hypertext Markup Language) A programming language which specifies how text and graphics are to be displayed. This language has enjoyed an enormous increase in popularity because it is the standard used for internet files. There are several versions of HTML, expansions such as XML and the superordinate form SGML (= standard generalised markup language) from which HTML was derived. ico file (= icon) A graphics file format which is used for very small images used as a hint to recognise what a certain software function actually does. For instance, the function of closing a program often has an associated icon showing a door leading out of a room. import To take data into one program which has been processed on another, e.g. when transferring data between word processors. There is usually a conversion facility for allowing this importing to take place. Alternatively, a format can be used which can be read by both programs, e.g. the *Rich Text Format. information retrieval An area of data processing which is concerned with the swift and accurate finding of information in large bodies of data. Information retrieval is thus a central area of corpus processing and catered for by a variety of software types. INI-file This is a small ASCII text file which is read by one of the major programs of the Corpus Presenter suite on starting. The values for many parameters of these programs are gained from such a file and used for the subsequent work session. Note that an initialisation file must match the program which consults it exactly. If not, the program will behave erratically. interface The common boundary between two devices, systems, subsystems, or the user and a system. Java An extension to HTML developed by Sun microsystems to provide additional functionality to files distributed via the internet. It is strictly speaking a separate programming language which requires additional software for its code to be executed. A Java interpreter is nowadays included with Windows so that users are not aware of it as a separate item of software. JavaScript A piece of code included in an internet file which performs some function not covered by HTML. JPG file A graphics file format commonly used as a replacement for bitmap files as it requires only a fraction of the space for storing the same information. key combination A key setting which consists of more than one key, usually two. It is realised by pressing a first key, a so-called status key, followed by another key (an alphabetic or function key) which is struck while the first key is still depressed (this fact is essential to the functioning of a key combination). The status keys of the PC keyboard are: Shift, Ctrl, Alt and typical key combinations would be: Shift-F1, Ctrl-F, Alt-0.

keyness This term has gained a specific meaning in recent years by which it refers to the extent to which a text or texts show a specific stylistic profile when compared with another set of texts. Specifically, the term has been used when analysing the lexical profile of an author or author(s) in direct comparison with that of another group. By this means it has been possible to show how the style of an author or authors is distinctive vis a vis others of his/her time. This distinctiveness is the keyness of the texts by the individual(s) in question. Keyness is quantified by measuring the positive or negative difference in the lexis of author(s) vis a vis that of those in a reference corpus with which the former are compared. lemma A term in linguistic data processing which refers to the label used to link up a set of word forms together. It is the equivalent of a lexeme in linguistics and represents a formal manner of identifying word forms as belonging to a given lexeme. lemmatise The act of assigning word forms to a given lemma, usually by a process of tagging. This can be done with Corpus Presenter Text Tool. lexical cluster analysis A retrieval technique which takes any number of words, usually between two and eight, and scans an entire text returning each consecutive set of words starting from the first and moving towards the end. By this means one can determine if certain collocations, i.e. lexical clusters, occur in a text and hence analyse an authors style. lexical density A statistic on the relative frequency of word forms in an input text file or files. lower ASCII area The symbols from 0 to 127. They consist of printer control codes (0-31), alphanumeric characters (32-126) and the delete character (127). There is a high degree of standardisation in the use of these characters as opposed to that of the upper ASCII area. Systems which can only process this set of characters (i.e. not the upper area as well) are referred to as 7bit systems as the eighth bit cannot be set to identify another 128 characters. menu bar A row on the top of the screen which contains the names of certain command groups which can be activated via the keyboard or the mouse leading to a pull-down menu appearing which contains the individual commands of a group. menu-driven Refers to one of two main kinds of command structure. With this type the user is presented with a window which contains a list of options; by moving the highlight bar to the option he or she wishes to choose and pressing Return, he or she activates the particular function. The second type of command structure is one where dedicated keystrokes activate commands, e.g. a function key or a combination of the Ctrl or Alt key with an alphanumeric key. In the Corpus Presenter suite both command structures are available simultaneously. node A point in a tree at which a branch begins. Trees with nodes on different levels are used frequently in software to display information in a hierarchical fashion making it easier to grasp the organisation of the information. normalisation A procedure whereby variant spellings in any set of texts are replaced by a standardised, i.e. normalised, spelling. The advantage of this is that retrieval tasks can be

carried out more quickly and possibly more accurately, assuming that normalisation is done is a structured and organised manner. This can be achieved by Corpus Presenter Text Tool. ocr (= optical character recognition) A reference to software and hardware which is used to transfer information on a printed page typically text to digital form which can be processed by a computer. Professional OCR software can recognise well over 90% of text of good printed quality and thus can be very useful in the initial phase of the compilation of a corpus. offline Not connected to a network. online Connected to a network, i.e. when actively linked to the internet. operating system A set of program files which are loaded by the computer on starting and which represent the software part of that information which the computer requires to be able to function in the first place, e.g. to load other programs, to read and save files, to deal with the keyboard and the screen, etc. PC An abbreviation for personal computer. Now it has come to mean any computer which derives its architecture from the original Intel 8088 processor (developed in the late 1970s) and all that this implies in terms of hardware and software. pdf file (= portable data format) A format, propagated by the American software firm Adobe for their product Acrobat, which is intended to specify in a hardware-independent manner how complex texts are to be encoded so as to be readable on any computer supporting the format. portable A reference to the degree to which a program or files can be transferred from one type of computer to another or from one operating system to another without too much disruption or loss of information and /or functionality. For corpora the question of portability is important as the goal is to allow their use on different platforms (computer systems). To ensure portability of texts, they should only include characters with numerical values above (and including) 32 and below 128. provider A term referring to a computer service which provides the link between individual customers and the internet. The provider is the firm or institution which enables you to access the internet and to which you log on when starting an internet session. protocol A set of conventions which regulates the exchange of data across communication lines. retrieval An operation which involves finding information for a program or directly for the user. The flexibility and scope of retrieval software is a major yardstick for judging computer systems. Return-key The most commonly used key on the keyboard. Its function during word processing is to enter hard carriage returns (paragraph breaks) into text; it has a further equally important function as the key denoting acceptance of a command suggestion or terminating an entry. The synonymous term Enter key is also used.

root directory The main directory of a disk; the top directory below which all others are arranged. It is displayed in the operating system as a single backslash, \. RTF A specification for text layout and formatting which has been developed to allow texts to be interchangeable between programs and environments. The contents of databases can be exported from Corpus Presenter Database Manager as RTF files which means that formatting is carried over to word processors, e.g. Corpus Presenter Word Processor or Microsoft Word, used to process these text files afterwards. RTF stands for Rich Text Format. sentence A syntactic unit not recognised by computers, contrast this with the line which is the basic (unstructured) unit of the text file. Any attempt to define the sentence for corpus processing purposes must rest on formal criteria. This can be done in Corpus Presenter by specifying sentence delimiters. server A term to refer to the provider of internet services. See client. setup A number of parameters which are specified either on starting the computer or on starting a program and which reflect choices for a set of options. Setup parameters can often be determined by an initialisation file. software Any data or programs which are only available in electronic form and which need to be read into system memory before being processed or run. sorting An action whereby a series of items, records in a database or lines in a text file, are sorted in a specified order, e.g. alphabetically ascending or descending, in a case-sensitive manner or not. spreadsheet A type of software in which data is arranged in table form, i.e. as rows and columns. It is closely related to the database because data must show an internal structure as opposed to texts. stop words Very common words, information on which is not required in the linguistic analysis of a corpus. To exclude these and thus speed up processing considerably one passes a list of them to the processing software, the latter then ignoring the words in this list. string Any set of bytes. A string is frequently used in contradistinction to a word which is formally defined for corpus processing software as being delimited by certain characters. Searches usually refer to either strings or words. subdirectory On a disk, a directory which is located below the root directory. On hard disks there are usually many subdirectories which are arranged in a hierarchy which is intended to reflect the data which is kept in them. system memory See *random access memory.

table An alternative name for a database deriving from the frequent practice of displaying contents in the form of rows and columns, i.e. like a table. tagging A procedure in corpus linguistics whereby a label is attached to word forms in a set of texts. The goal of this is to allow words which belong to grammatical categories to be formally marked in a similar manner and thus to allow their quick and accurate retrieval with appropriate software afterwards. template 1) A skeleton file which has certain parameters set but does not contain any data. 2) A specification containing one or more wildcards which the operating system then attempts to match with actual files e.g. *.*, MY???.TXT. temporary file Any file which is created by a program for some transitory purpose but deleted later. There are typical extensions for temporary files so that users can recognise these, e.g. .TMP or .$$$. Users should never create files with these extensions as many programs delete such files without asking or even informing users. text editor A type of editor which deliberately makes no provision for formatting data (page layout, character attributes, etc.) as the files you edit with such a program are not primarily intended for printing. The output of a text editor is always ASCII text. Some more flexible text editors allow for both plain ASCII text and formatted text (for printing). This is the case with Corpus Presenter Text Editor. text file Any file which consists of a series of lines each of which is terminated by two special characters, carriage return (ASCII 13) and line feed (ASCII 10). A text file is often termed an *ASCII file or a DOS file. In all cases there is no formatting information included which cannot be treated as simple text or which can only be read by a specific word processor. toggle Any key which inverts a state, i.e. turning something on if it was off and off if it was on. Many commands in the Corpus Presenter suite are toggles. tree display A manner of display which is generally regarded as intuitively easy to grasp as it shows data in a hierarchy realised by nodes and branches arranged in a descending order. UniCode A means of encoding characters for processing on personal computers. The ASCII character set uses just one byte and so can only encode 256 characters. The UniCode set, on the other hand, uses two bytes and can thus encode a much greater number of characters, dealing easily with writing systems not based on the Latin alphabet, such as those for Russian, Arabic, Korean, Chinese or Japanese. unique word list A list generated by examining corpus files and which contains one and only one instance of each word. Such lists can be generated with Corpus Presenter and Corpus Presenter Text Tool. upper-case Lettering which is entirely in capitals.

upper ASCII area The area of ASCII characters from 128 to 255. update A recent version of a program or any set of data. Program updates generally ensure that data written with previous versions is still readable, i.e. the older versions of the program in question are upwards compatible with the later ones. upload The process of transferring data away from a local computer to a server as when one loads the data of a webpage up to a provider. URL An acronym for universal resource locator. This is the technical term for an internet address. Such addresses normally begin with http://www. and continue with the details of the location in question. The last element (after the last dot) is a reference to a country, e.g. .ie = Ireland, .uk = United Kingdom, or a superordinate set of internet addresses, e.g. .com, .org or .net. USB An acronym for universal serial bus. A hardware interface which is simple in design and intended for a wide variety of devices. The USB interface is also quite fast, for instance outstripping the parallel port used for printing on personal computers. The current version is 2.0 which superseded version 1.1. user interface A term which refers to the way a program presents itself to a user, what it looks like on the screen, the commands it puts at his/her disposal, or the level at which he or she can communicate with the program. user-specified Any aspect of a program which the user can decide on him/herself, frequently, the opposite of system-determined. white space A general term for blanks and tabs, i.e. characters which you cannot see. wildcards and jokers Synonymous terms for symbols used to leave information unspecified which can then be matched by the system, for instance, when searching for files. For the operating system there are two wildcards, ? and *. The first is used to leave one character unspecified and the latter to leave more than one. This is very useful if one wants to increase the scope of a command to include any files which match a certain file template with a wild card, e.g. entering DIR *.TXT will result in a listing of all files with the extension .TXT, irrespective of what comes before it. An example of the question mark wild card would be COPY LETTER?.DOC A:\ which would copy all the files beginning in letter, having one character after this, e.g. a number, and with the extension .DOC to drive A:. window A framed section of screen display, typically used by a program to show data or convey information to users. The main window is that for the program itself (and may cover the entire screen) with other windows containing sets of data such as texts loaded with a word processor. word A particular type of string which is delimited by certain characters such as spaces, punctuation, beginning or end of a line, etc. For corpus processing software there is no way of defining a word on semantic or morphological grounds.

word processor A type of editor whose prime aim is to produce neatly printed text. Such programs use a lot of additional information (apart from actual text) to control the output on the printer. A consequence of this latter fact is that files produced with one word processor cannot be read automatically by another unless they are first converted to ASCII text (with the loss of formatting information). An alternative, which saves the formatting, is to store files in the Rich Text Format (RTF) or Hypertext Markup Language (HTML) format. website A location in the internet used to display data which stem from one source such as a university, a company or in some cases a single individual. A website is accessed via an address, often with the form http://www.address.organisation.country. zap The operation of removing something totally, e.g. all the records of a database or the contents of all fields in a record.

You might also like