Structure and function in retrieval

Alan Gilchrist (Cura Consortium and TFPL Ltd, Brighton, UK)

Journal of Documentation

ISSN: 0022-0418

Article publication date: 1 January 2006

Downloads

1797

pdf (68 KB)

Abstract

Purpose

–

This paper forms part of the series “60 years of the best in information research”, marking the 60th anniversary of the Journal of Documentation. It aims to review the influence of Brian Vickery's 1971 paper, “Structure and function in retrieval languages”. The paper is not an update of Vickery's work, but a comment on a greatly changed environment, in which his analysis still has much validity.

Design/methodology/approach

–

A commentary on selected literature illustrates the continuing relevance of Vickery's ideas.

Findings

–

Generic survey and specific reference are still the main functions of retrieval languages, with minor functional additions such as relevance ranking. New structures are becoming increasingly significant, through developments such as XML. Future development in artificial intelligence hold out new prospects still.

Originality/value

–

The paper shows the continuing relevance of “traditional” ideas of information science from the 1960s and 1970s.

Keywords

Citation

Gilchrist, A. (2006), "Structure and function in retrieval", Journal of Documentation, Vol. 62 No. 1, pp. 21-29. https://doi.org/10.1108/00220410610642020

Publisher

:

Emerald Group Publishing Limited

Introduction

This brief paper is one of a series to mark the 60th Anniversary of the Journal of Documentation by harking back to an earlier paper in the journal. The paper chosen here is “Structure and function in retrieval languages” which was written by B.C.Vickery and appeared in 1971 (Vickery, 1971). The paper was a useful typology of retrieval languages suggested four years after the Cranfield experiments (Cleverdon, 1967), which had given rise to many debates on the different approaches used by different information retrieval systems. Vickery's argument was that retrieval languages served many purposes and it was necessary to identify the most suitable structures to fit the purpose. Structure and function are still important issues but the digital revolution, starting with full text searching, has had some obvious (and some perhaps less obvious effects) on retrieval, all of which are profound. This paper is not intended to be an update of Vickery's, but a comment on a vastly changed world in which his analysis still has much validity.

Vickery suggested that:

… initially, two functions were recognized: to identify, and hence to select items on a specific topic (the task of the alphabetic index); to group in proximity items on similar subjects (the task of classification). The two functions (specific reference and generic survey) seemed opposed.

But while Ranganathan had “urged the provision of both by a helpful symbiosis between classified and alphabetic arrangement”, the structures of both types of language began to grow more complex. Vickery identified the structural elements in his paper “as recall devices: confounding of true synonyms, of near synonyms and of word forms, and use of more generic terms; and as precision devices: co‐ordination of terms, interlocking of terms in a topic, expressing relations between terms in a topic, and weighting index terms. Hierarchical linkage of generic and specific terms aids both recall and precision.” He then tentatively (his word) summed up his discussion on structure and function in Table I.

It is clear, in this paper, that the computer had arrived and that new techniques were being developed. Vickery had already taken a focused look at the implications of the change from manual to computerised systems in an earlier paper in this journal (Vickery, 1968) which was devoted to an examination of bibliographic records. The abstract read:

The design of bibliographic records for computer input is explored. The elements of a record provide bibliographic description, serve as retrieval keys, facilitate ordered filing, and indicate locations. The effect of each of these functions on the form of a record is discussed. Problems are raised that must be resolved before an optimal record can be designed.

This recognition of the power of what are now called metadata was reinforced by Vickery's statement “A machine record is not simply a different physical means of recording a traditional bibliographic entry, for use in a traditional way”. (And this was illustrated by two figures, the first — looking quaint in 2005 – a 5 × 3 catalogue card, and the second a NASA Master File Record). He continued:

The only reason for creating it, in fact, is that the machine offers us the possibility of using it in ways hitherto untried because of their expense. The record becomes wide open to far more flexible manipulation ways.

Indeed, and not only have computers become less expensive, but far more powerful. But Vickery could not have foreseen that with widespread disintermediation, the end user is not only left to search for himself or herself, but is asked to index the documents he or she has generated. This is a very real problem that calls for action in the area of information literacy.

Search strategies have attracted a large number of studies (see, for example, Bates, 2002; Järvelin and Wilson, 2003). In a less academic, vein Koll (2000) has put forward a list of possible search types (functions), saying that:

Finding a needle in a haystack can mean:
•
A known needle in a known haystack
•
A known needle in an unknown haystack
•
An unknown needle in an unknown haystack
•
Any needle in a haystack
•
The sharpest needle in a haystack
•
Most of the sharpest needles in a haystack
•
All the needles in a haystack
•
Affirmation of no needles in the haystack
•
Things like needles in any haystack
•
Let me know whenever a new needle turns up
•
Where are the haystacks
•
Needles, haystacks – whatever.

The digital revolution

The first obvious result of the digital revolution is, of course, information overload. Text, data, images and voice can now all be reduced to binary code, transmitted and stored cheaply and in enormous quantities. In addition, traditional barriers between different repositories, different departments and even organisations are being removed or, at least, being rendered permeable. In principle, all of this information can now be accessed from the desktop.

At the macro level Lyman and Varian at the University of California Berkeley conduct an annual survey of world‐wide information production (SIMS, 2003) producing astronomical figures. They estimate that in 2002 print, film, magnetic and optical storage media produced about five exabytes of new information. Of the new information, 92 per cent was stored on magnetic media, mostly in hard disks. (A kilobyte is 10³ bytes and an exabyte is 10¹⁸ bytes; five exabytes would be equivalent to 37,000 Library of Congress collections each of 17 million books). The figure for 2002 has doubled over the past three years. The internet has tripled over the past three years, and now contains 167 terabytes in the surface web and an estimated 91,850 terabytes in the deep web. (a terabyte is 10¹² bytes). Emails account for 440,606 terabytes, 31 billion being sent every day.

Another important barrier to be affected is that between different information professionals, and while some notable advances are being made the convergence may not be completed for some time. Mind sets tend to be too rigid, even with training through continuing professional development, which itself will tend to concentrate on building on traditional skills, rather than “poaching” on other professional domains. However, the digital revolution will force, for example librarians, records managers and webmasters to work much more closely together. And it is here that structure and function in retrieval systems will need to be better understood and defined if there is to be greater cohesion and interoperability between different areas of the organisation.

Metadata

The most widely heard definition of metadata is the unhelpful “data about data”. It has even been claimed that the text in a document being hunted is metadata on the grounds that, in a full text search, a string of words can be regarded as an attribute of the document. A more acceptable definition in the area of information management is provided by the United Kingdom Office for Library and Information Networking Networking (UKOLN):

Metadata is data which describes attributes of a resource. Typically, it supports a number of functions: location, discovery, documentation, evaluation, selection and others. These activities may be carried out by human end‐users, or their (human or automated) agents (quoted in an extensive list of comparable definitions from the Virtual Resources Association, 2005).

Perhaps the best known metadata schema in the text world is the Dublin Core created by OCLC and named after their headquarters town of Dublin, Ohio. This has been adopted and adapted by the UK Government, through the Office of the e‐Envoy (now the e‐Government Unit) in the Cabinet Office. It forms a part of their e‐GMS programme, a mandatory standard for all public bodies including both the Dublin Core and initially the Government Category List (GCL), this latter scheme now superseded by the Integrated Public Sector Vocabulary (IPSV). The purpose of e‐GMS is to facilitate the location of public bodies likely to hold the answers to specific questions. All public bodies are obliged to use part, or all, of the e‐GMS version of the Dublin Core in the processing of documents for publication on the internet, and to classify the document with at least one category taken from the GCL. This contains 25 elements, some of which are subdivided into refinements, while each element or refinement has attributes and each attribute has values. For example the subject element is subdivided into the attributes: category (mandatory. where GCL provides the values), keyword (and organisations were invited to map their local schemes (often a thesaurus) onto the GCL), person, process identifier, programme, and project. Other elements are also accompanied by notes and recommendations regarding encoding schemes (such as the GCL). The Government Category List is a high level classification containing over 1,000 categories arranged at three levels and covering the interests of central government. Recently, an initiative of the Office of the Deputy Prime Minister has resulted in the merging of the GCL with the similar Local Government Category List (LGCL), a scheme which is more slanted to the interests of local government. The result is the Integrated Public Sector Vocabulary, which is a larger scheme and tending towards a thesaurus format rather than a classification, albeit one that is clearly structured. At the time of writing it is not entirely clear how this will be used with the Dublin Core standard: is it a move from generic survey to specific reference? Both these schemes can be accessed through the GovTalk web site (Govtalk, 2005a).

Meanwhile, The National Archive (TNA) has produced its own version of the Dublin Core, aimed at records managers, and there is an interesting version that has been circulated for consultation (Govtalk, 2005b). This version has one or two diversions from that produced by the Office of the e‐Envoy. First of all, the subject element is not mandatory, and second, TNA has introduced a new element called function. This raises some interesting questions regarding the relationship between document management and records management. A record can contain one or more documents, the purpose of records management being to keep the chronological stories of transactions and other activities in order to meet the demands of accountability, and to impose retention and disposal schedules. This is effected by use of a file plan which is supported by a “business classification”. Traditionally, this was based on the structure F‐A‐T, standing for function – activity – transaction, but TNA have produced a useful document reviewing alternative structures (TNA, 2005). One of these alternatives acknowledges that the traditional records management classification is not effective in locating information content at the document level; because, in the language of this paper, it is a structure created for a different function. Some have also attempted to superimpose other procedural instructions on the business classification, such as to follow the steps in programme and project management laid down by schemes such as PRINCE2. It will be interesting to see whether the records management community manage to combine the generic survey and specific reference functions in their work. It might also be observed that one could view an attribute of a document as belonging to a record, in which case metatagging could start at the document level.

Much has been written about the power behind the use of metadata, but there is a snag. Despite the availability of detailed metadata schema, such as the Dublin Core, the bibliographic record presented as a screen template is a shadow of the computer entry records of the 1960s, and even at this simplified level many find the burden of metatagging too much. As Aitchison and Dextre Clarke (2004) say:

We face a paradox. Ostensibly, the need and the opportunity to apply thesauri to information retrieval are greater than ever before. On the other hand, users resist most efforts to persuade them to apply one.

Structured vocabularies

When Vickery wrote his article, the debate was largely between classification and the thesaurus, with the observation that the “classificatory structure was explored and developed by Ranganathan and other advocates of the faceted approach”. One advocate, Jean Aitchison, pioneered this approach with the Thesaurofacet (Aitchison et al., 1969), which combined a faceted classification and a thesaurus, having a one‐to‐one correspondence between facets and descriptors. Not long after publication of this scheme, full text searching arrived on the scene, and the process of disintermediation got under way. The vendors of search engines proclaimed that now the end user was liberated and had the freedom to conduct his or her own searches, using the battery of Boolean operators, word stemming and other devices made available. It was, however, suggested that a thesaurus (or possibly a small collection of thesauri) might still be useful as aides mémoires in thinking of useful search terms. The next development was the “search thesaurus” which, looking back to Roget, might be little more than a collection of synonyms. The result of full text searching and search thesauri was, of course, to increase recall. Users became frustrated and bewildered by devices such as Boolean operators so that single word searches became the norm and the frustration increased.

The Delphi Research Group (2005) has conducted a survey of user attitudes and concluded that 70 per cent of search time was spent browsing, and that 75 per cent of respondents preferred browsing to searching. Findings such as these have given rise to, or appeared to validate, the use of taxonomies, pioneered by such as Yahoo! They offer the user a hierarchy of menus, each containing some 20 choices, allowing the user to recognize, rather than try to think of, a likely term; and to “drill down”, without worrying about Boolean operators, till finding a result. These taxonomies are a sort of hybrid between classification and thesauri, though being nearer to the former use generic survey rather than specific search. They do not follow the normal practices of either classification or thesauri and, as yet, no clear guidelines have been written for their construction.

However, work is under way on the revision of the British Standards BS 5723 and BS 6723, combining these into a single standard in five parts, under the title “Structured vocabularies for information retrieval”:

1.
definitions, symbols and abbreviations;
2.
thesauri;
3.
vocabularies other than thesauri (provisional title);
4.
interoperability between vocabularies (provisional title); and
5.
protocols and formats needed for exchange of vocabulary data (provisional title).

Two important features of this new standard are, first, that in Part 3 consideration is given to classifications (including business classifications), subject headings lists, taxonomies, and ontologies (see also Gilchrist, 2003); and second, that the importance of interoperability is stressed. This is, of course, an old problem which was reviewed by this author more than 30 years ago (Gilchrist, 1972). One of the research projects reviewed in this paper was named the “Intermediate Lexicon”, and where a number of retrieval languages are involved this approach is still the most logical (see, for example, Nicholson, 2003). The problems of interoperability are still significant, in particular switching between a classification and a thesaurus or between two schemes with different levels of specificity. Once again, we are back to the differences between generic survey and specific reference.

One approach which claims to overcome this old apparent dichotomy is the topic map, discussed by Garshol (2004). Topic maps originated in work on the merging of electronic indexes. In a topic map, each topic is used to represent some real‐world thing. The topics represent concepts (as they do in thesauri) and these concepts are called subjects. There is an International Standard for topic maps (ISO, 1999), which has given it a semblance of authority with regard to the resource description framework, the vehicle preferred by W3C for the Semantic Web. Topic maps integrate subject description and “document” location, where subject, according to the Standard itself can be “anything whatsoever”. Garshol describes topic maps as having “three constructs provided for describing the subjects represented by the topics: names, occurrences and associations.” These are built into a network, linking the names and the associations between them, and often to the URLs (the occurrences). This description is a simplification of a scheme that fails to do justice to such an innovative technique. It is extremely flexible, and it may be appreciated that it is a dynamic network of metadata types together with their attributes and values; and in effect can combine the virtues of generic survey and specific reference. Digitisation is facilitating the merging of content and metadata into single structures aided by languages such as XML.

Social networks

The latest phenomenon to arrive on the retrieval scene is the collaborative tagging system. Examples can be seen on web sites such as del‐icio‐us (http://del.icio.us/doc/about) which is described as a “social bookmark manager” and Flikr (http://www.flikr.com) which provides “photo sharing”. The innovative feature of these two sites is that anybody can lodge their favourite bookmarks or photos adding their own, completely uncontrolled subject tags or, indeed, any metadata. This fusion of postmodernism and disintermediation has been justified by reference to the “accuracy of crowds”, noted and described by Galton in the early years of the twentieth century (Galton, 1907). Galton collected guesses from 787 participants at a country fair who were invited to guess the weight of an ox. Galton found that the median guess was within 0.8 per cent of the correct weight, and that the mean of the guesses was within 0.01 per cent. Interesting as this finding is, it can hardly be compared to an “accuracy” of metatagging achieved by multiple “guesses”. In fact, this exercise, useful and interesting as it may be is merely another example of the search thesaurus offering a range of unstructured alternative keywords. A web site that comes nearer to the Galton study is provided by Wikipedia (http://en.wikipedia.org/ wiki/Main_Page). This is a free collaborative encyclopaedia to which anyone can contribute either original items or corrections and additions to existing work. The items are broadly categorised and accessible by keyword search; the whole system being maintained with a very small number of people whose main job seems to be to monitor input for material that may be obscene or libellous. To date, there is an astonishing 680,044 articles and 379,467 registered users. The owners have plans to extend the facility to cater for other collaborative exercises. Golder and Huberman (2006) have studied collaborative tagging, and conclude that “because stable patterns emerge in tag proportions, minority opinions can coexist alongside extremely popular ones without disrupting the nearly stable consensus choices made by many users”. What might be interesting, if feasible, would be to collect these end user views and combine them into a structure, perhaps in the form of a topic map.

Conclusions

It seems that generic survey and specific reference are still the main functions, with minor functional additions such as relevance ranking, but that new structures are increasingly made possible through the use of powerful computing and XML, and in future by improvements in artificial intelligence.

Table I

The relations between function and retrieval language elements

Corresponding author

Alan Gilchrist can be contacted at: [email protected]

References

Aitchison, J. and Dextre Clarke, S. (2004), “The thesaurus: a historical view with a look to the future”, in Roe, S.K. and Thomas, A.R. (Eds), The Thesaurus: Review, Renaissance and Revision, Haworth Information Press, New York, NY.

Aitchison, J., Gomersall, A. and Ireland, R. (1969), Thesaurofacet: Thesaurus and Faceted Classification for Engineering and Related Subjects, English Electric Company Ltd, Whetstone.

Bates, M.J. (2002), “Toward an integrated model of information seeking and searching”, paper presented at the 4th International Conference on Information Needs, Seeking and Use, Lisbon, September, available at: www.gseis.ucla.edu/faculty/bates/articles/info_SeekSearch‐i‐030329.html.

Cleverdon, C.W. (1967), “The Cranfield tests on index language devices”, Aslib Proceedings, Vol. 19 No. 6, pp. 173‐94.

Delphi Research Group (2005), “Information intelligence: content classification and the enterprise taxonomy practice”, available at: www.delphigroup.com.

Galton, F. (1907), “Vox Populi”, Nature, Vol. 75 1949, 7 March, pp. 450‐1, available at: www.mugu.com/galton/index.html.

Garshol, L.M. (2004), “Metadata? Thesauri? Taxonomies? Topic maps!”, Journal of Information Science, Vol. 30 No. 4, pp. 378‐91.

Gilchrist, A. (1972), “Intermediate languages for switching and control”, Aslib Proceedings, Vol. 24 No. 7, pp. 387‐99.

Gilchrist, A. (2003), “Thesauri, taxonomies and ontologies – an etymological note”, Journal of Documentation, Vol. 59 No. 1, pp. 7‐18.

Golder, S.A. and Huberman, B.A. (2006), “The structure of collaborative tagging systems”, Journal of Information Science, available at: www.hpl.hp.com/research/idl/papers/tags/tags.pdf (in press).

Govtalk (2005a), Schemas and Standards, available at: www.govtalk.gov.uk/ schemasstandards.

Govtalk (2005b), Metadata documents, available at: www.govtalk.gov.uk/schemasstandards/metadata_document.asp?docnum=672.

ISO (1999), Topic Maps, ISO/IEC 13250, International Standards Organisation, Geneva.

Järvelin, K. and Wilson, T.D. (2003), “On conceptual models for information seeking and searching”, Information Research, Vol. 9 No. 1, paper 163, available from http://InformarionR.net/ir/9‐1/paper163.html.

Koll, M. (2000), “Track 3: information retrieval”, Bulletin of the American Society for Information Science, Vol. 26 No. 2, pp. 16‐18.

Nicholson, D. (2003), “The HILT Pilot Terminologies Server”, paper presented at the JISC Joint Programme Meeting for Shared Services and Portals Radcliff House, University of Warwick, Coventry, May, available at: http://hilt.cdr.strath.ac.uk/hilt2web/Dissemination/presentations/presentations.html.

SIMS (2003), How Much Information?, School of Information Management and Systems, University of California at Berkeley, Berkeley, CA, available at: www.sims.berkeley.edu/research/projects/how‐much‐info‐2003.

TNA (2005), Records Management, The National Archives, available at: www.pro.gov.uk/ recordsmanagement.

Vickery, B.C. (1968), “Bibliographic description, arrangement and retrieval”, Journal of Documentation, Vol. 24 No. 1, pp. 1‐15.

Vickery, B.C. (1971), “Structure and function in retrieval languages”, Journal of Documentation, Vol. 27 No. 2, pp. 69‐82.

Virtual Resources Association (2005), Metadata, Virtual Resources Association, available at: www.vraweb.org/metadata.html.