Collection Development
Collection Development
Collection Development
LRTS 52(3)
Approaches to
Selection, Access, and
Collection Development
in the Web World
A Case Study with
Fugitive Literature
By Karen Schmidt, Wendy Allen Shelburne,
and David Steven Vess
Academic and research libraries are well-versed in collecting material from the
print world. The present and future collections that are being produced on the
Web require urgent attention to acquire, preserve, and provide access to them
for future research. Many of the skills that librarians have honed through years
of collecting in the print-based world are applicable to digital collection development, but will require ramping up technical skills and actively embracing digital
content in current and future collection-development work. This paper reports on
an exploratory project that aims to apply existing skills and knowledge to collect
materials from the Internet and lay the groundwork for collection development
in the future.
n the print world, the acquisition and selection of materials for libraries is a
well-defined and well-known system, developed over decades of work in the
profession. The bibliographic output is generally controlled, and librarians can
rely on their agents or vendors to obtain the books and journals that are required.
This system of identifying and procuring known items also translates well into
the controlled digital domain of electronic resourcesdatabases, e-books, and
e-journals. Likewise, archivists have developed a refined way of identifying and
acquiring specialized collections of letters and diaries, memorabilia, and primary
literature that form the basis for social and historic research.
A significant and growing shadow world of material of equal importance is
exploding on the Internet and now deserves attention. This fugitive literature
contains important manifestations of present day social and political history, art,
and literature, and primary cultural output. In every way, this literature is contemporary primary source material upon which research in the future will rely.
Its existence begs the question of how subject specialists and collection development librarians take the selection and procurement skills already mastered and
refine or expand them to address the new and growing population of material on
the Web.
The research presented here reflects efforts to understand the challenges of
collecting from the Web. The questions this project sought to answer are
52(3) LRTS
Approaches to Selection, Access, and Collection Development in the Web World 185
Literature Review
A review of literature in collection development includes the
standard collection development texts that detail how items
are identified, selected, obtained, and processed (cataloged).
Bonk and Magrills Building Library Collections, Gormans
Collection Management for the 21st Century, and Johnsons
Fundamentals of Collection Development and Management
provide the rubric for acquisition and collection-development activities in most libraries.1 This historic professional
framework enabled a subject-based approach, matching the
goals of this project to the standards in our profession. While
this traditional library literature helped set the stage, the literature of archives, especially recent research with archiving
Web documents, helped us understand current efforts to
capture collections on the Web. While not yet a widely
embraced area of research, some seminal writings have
been produced. Pearce-Moses and Kaczmarek examine the
challenges of a state library managing its mandate to collect
and provide access to official reports and documents.2 This
article is particularly helpful with highlighting the steps for
identifying Web sites, handling acquisition, and addressing
metadata and access issues. Other government-centered
projects are reflected in articles that describe the work of
The National Library of Australia, as well as the collaborative work of many other countries.3 These articles allude to
the challenges of crawling Web pages, suggest specific tools,
and address the problems found in saving highly dynamic
Web pages, though without offering much procedural detail.
Certainly the work that the Library of Congress is undertaking through the National Digital Information Infrastructure
and Preservation Program (NDIIPP, www.digitalpreser
vation.gov/index.html) is of great importance to describing
the technical dimensions of capturing and preserving our
digital culture and developing a methodology on which substantial aggregations of digital produce can be curated.4
With regard to current efforts to capture the literature emerging on the Internet, Brewster Kahles Internet
Research Method
To launch the collecting of fugitive literature, the project
team focused on the topic of hate literature primarily emanating from individuals and groups in the Midwest. Any
theme might have been chosen to understand how to
develop a Web collection; hate literature was selected on the
assumption that there might be more unstructured linking
among individuals and groups and thus more of a challenge
to understanding of the variety of communications, and
because the topic has some relevance to special collections
already at the University of Illinois at Urbana-Champaign
Library. The library includes the papers of Ewing C.
Baskette, a lawyer, librarian, and bibliographer. The focus of
this special collection is letters and manuscripts dealing with
anarchism, freedom of expression, and censorship, among
other items. The library also holds a related book collection
on censorship and intellectual freedom. The project team
reasoned that the hate literature that was gathered could
link to and enhance the Baskette collection.
LRTS 52(3)
52(3) LRTS
Approaches to Selection, Access, and Collection Development in the Web World 187
Findings
The eight Web sites we selected afforded the opportunity to
consider a number of unanticipated issues. We decided to
crawl to the third level of each page, but found pages that
LRTS 52(3)
52(3) LRTS
Approaches to Selection, Access, and Collection Development in the Web World 189
Figure 4. Wayback
Panthers search
Machine
result
for
New
Black
Lessons Learned
The principles upon which Internet-collection development can be basedidentifying the subject thoroughly and
thoughtfully, understanding the publishing habits of the
subject area, committing to collecting for a period of time,
understanding how it fits with the rest of the collection,
describing it and making it accessible, and preserving it
make this work accessible to the subject bibliographer and
the research library. A powerful symmetry exists between
the process of developing print collections and that of developing digital collections from the Internet. Subject special-
ists and bibliographers have the skills at hand, but many lack
the technical skills to understand how to capture what we
find and how to make undeniably labor-intensive and often
repetitive work less so. The lessons learned are simple and
straightforward:
This work cannot be automated; it requires excellent
subject specialist skills and the willingness to continually evaluate for content and follow up on new sites
and organizations on a regular basis.
The work of projects such as the Internet Archive
Wayback Machine is useful but requires professional
oversight; this is an excellent finding aid but requires
the oversight of a bibliographer. The Archive-It
subscription program takes this to a higher level of
oversight and control that is moving in the right direction, although numerous technical issues related to
curation, many of which are noted here, will continue
to arise in the next few years.
The Internet world is highly compatible with and
complementary to our print world; as we discover
and harvest Internet material, we are able to dig more
deeply into print. The resulting collection is potentially very deep and rich for our present and future
research community.
Archivists and copyright experts have much to teach
librarians about collecting in this domain, including
issues of identifying and describing item-level (sitelevel) material preservation, and ethical issues.
Just as librarians have always needed to think about how
much shelf space is needed for print collections, so too must
server space for housing these collections be considered.
The eight Web sites that were crawled produced a collection
of 6.49 gigabytes, with 56,453 files in 3,110 folders. Thus
even a targeted crawling system launched over a period of a
few months creates storage challenges that cannot be overlooked. Likewise, it is essential to consider the preservation
of these data and assure that there are adequate procedures
in place for backing up what has been captured.
Setting forth the metadata terminology, and building on
it from the beginning, is important in bringing structure to
the collection, and also to the subsequent ongoing collection
development work. It provides a template against which the
subject specialist can gauge how well a new site, blog, or
PHP fits into the existing collection. The metadata description is imperative both as a finding aid to the archive and to
establishing rhetoric upon which to base future search and
crawl work.
As a case study, this research centered upon focused
identification of a subject-based topic: hate groups in the
Midwestern states. The focus was on producers of content
and material outside the standard realms of publishing
LRTS 52(3)
output, and it was necessary to adjust our thinking to collect under this rubric. The parameters for developing this
collection established depths of crawls, timelines, and strategies to acquire print material that was discovered along
the way. Technological challenges required consultations
with other library and information technology specialists. In
short, many collection development principles and practices
were readily adapted to this project, but new techniques and
technologies needed to be brought in.
For todays users, this collection provides an opportunity to examine the activities of a core set of hate groups from
the area in the early twenty-first century, and it will continue
to serve that need well into the future. For the library profession, this case study provides an opportunity to consider
one way in which our collecting processes should change
and how we might build a framework around this new kind
of process. The next challenges lie in making this and other
gatherings of Web-based collections searchable and accessible using current search and retrieval technologies and
metadata coding schemas, and ensuring its preservation.
Future Implications
Why does collection development and management of
Internet material matter? We know that the first e-mail message was sent in 1964 from Cambridge, or perhaps CarnegieMellon, or MITwe cannot be certain because no record of
this momentous occasion exists, unlike our careful recording
of the first moments of the call between Alexander Graham
Bell and his assistant. Our digital heritage is fragile and the
challenges to identifying and preserving it are enormous. As
librarians, it is incumbent upon us to collect and preserve this
as part of our cultural and literary heritage, as we have done
for millennia with other types of material.
Research libraries have achieved significance in scholarship because of the extraordinary special collections
amassed over centuries. The commitment to finding and
preserving the record of human experience is the role of
the library and librarian. The challenges faced by those
who built our print-based specialized collections provide
inspiration and guidance in continuing that same commitment for the future: specialized digital collections of
online diaries, Web sites, games, and ephemera. Research
libraries need only look to the printed items in their collections that might well have seemed frivolous at the time of
acquisitionthe Collyers Eye streetwise sporting weekly
or the penny novels of the 1800sto understand the rich
research value today of the publications that existed on the
fringe at the time of publication.
For any number of reasons (lack of funding, lack of staffing, lack of training, etc.), our research libraries currently
are challenged by missed collection opportunities from the
52(3) LRTS
Approaches to Selection, Access, and Collection Development in the Web World 191