From the Digitized to the Digital Library [2001]
Manfred Thaller*
Reprint of: Manfred Thaller. 2001. From the Digitized to the Digital Library. D-Lib Magazine February 2001.
February 2001, <>.
1 Formerly <>; offline since
2 <>.
3 <>.
150 pages per day.4 Parallel to the CEEC model project, a conceptual project, the
"Codex Electronicus Colonensis" (CEC), is at work on the definition of an abstract
model for the representation of medieval codices in digital form.
The following paper has grown out of the design considerations for the men-
tioned CEC project. The paper reflects a growing concern of the authors that some
of the recent advances in digital (research) libraries are being diluted because it is
not clear whether the advances really reach the audience for whom the projects
would be most useful. Many, if not most, digitization projects have aimed at exist-
ing collections as individual servers. A digital library, however, should be more
than a digitized one. It should be built according to principles that are not necessari-
ly the same as those employed for paper collections, and it should be evaluated
according to different measures which are not yet totally clear.
The paper takes the form of six theses on various aspects of the ongoing transi-
tion to digital libraries. These theses have been presented at a forum on the German
"retrodigitization" program.5 The program aims at the systematic conversion of
library resources into digital form, concentrates for a number of reasons on material
primarily of interest to the Humanities, and is funded by the German research coun-
cil. As such this program is directly aimed at improving the overall infrastructure of
academic research; other users of libraries are of interest, but are not central to the
Thesis: The primary audience for a digital library is neither the leading specialist
in the respective field, nor the freshman, but the advanced student or young re-
searcher and the " almost specialist ". The primary topic of digitization projects
should not be the absolute top range of the " treasures " of a collection, but those
materials that we always have wanted to promote if they were just marginally more
important. Whether we effectively serve them to the appropriate community of
serious users can only be measured according to criteria that have yet to be devel-
4 This site is still in the state of a beta test. Permanent accessibility is noř guaranteed at the
moment. Debugging messages of the underlying DBMS may appear in the dynamically cre-
ated pages at short notice.
5 <>. In this English version of the paper
given at the conference, we have cut down theses five and six to the bare minimum. They
are either understandable only for the reader of German, as they relate directly to German
language material, or are directly connected to current funding programs within Germany.
Thesis four, which in our opinion is central to the future relationship between digital collec-
tions across the boundaries of national library systems and infoscapes, has been expanded.
Well-established academic staff have access to research assistants (RAs). Whether
such RAs carry books to a copy machine or print from a screen is not an important
concern. Well-established academic staff also usually have access to travel money.
And academic travel, in most cases has intellectual and professional side effects,
including personal contacts with academic colleagues at institutions visited, that go
considerably beyond the value of library resources read during such travel. It is an
illusion, therefore, to assume that even the most advanced and elaborate digital
systems will entice researchers to stay at home when they can afford to visit collec-
tions where they can be sure to meet colleagues whom they would not otherwise
Similarly, the well-established specialists within a field will usually already have
large collections of copies of all but the most marginal types of resources. Digitiza-
tion of resources that they already know cannot serve them. Every researcher, how-
ever, is usually aware of a large number of library resources that are relatively
peripheral to his or her primary concern, but which would be important if they
could be accessed easily. Therefore, making such material accessible digitally will
increase the opportunities of a researcher.
As a corollary, resources that have been edited in print in the 1 8th century, litho-
graphed at the beginning and photographed at the end of the 19th century, re-
photographed in color in the middle of the 20th century, and reproduced as high end
facsimile towards the end of the century are not the appropriate subjects of digitiza-
tion. Materials that always had to be left out of past reformatting projects because it
was too expensive to include them are appropriate subjects for digitization.
When we abandon the most obvious strategy, however, of digitizing well-known
and frequently reproduced items, we need to find serious and reliable criteria to
evaluate the digitization project's success. Nobody has to justify the digitization of
the Magna Carta , even if nothing is gained by it (since its text is already ubiqui-
tous). But who actually profits, and by how much, if we provide a digital version of
the less well-known cartae of the same period?
As a first step in developing new evaluation criteria, we should drop the most
obvious mark of success. There are few things in the world that are as totally irrele-
vant and meaningless for the success of a digitization project, as the raw hit rate at
its server. Hit rates are, to be precise, as meaningful for a library as the number of
tourists gaping at the glorious murals in the entrance hall: a good way to make
friends, but no real guarantee that the local chamber of commerce will not propose
to close down the costly remainder of the building.
In one of the projects with which the author has been involved, we have quite
carefully analyzed the logs of the server, which boasted some five- or six- digit hit
rates. When we looked more closely, we tried to decide what exactly a criterion for
"real use" could be. We assumed that the only people who could be considered to
be "real readers" were those who accessed a minimum of three successive pages in
a book. Furthermore, there had to be a sufficient temporal interval between each
page access so that it was plausible that each page was actually read and under-
stood. Using these criteria reduced the spectacular numbers quite drastically; this
special collection may have gained only one serious user per day, however, who
typically reads for a few hours. This should be augmented with at least two more
users each day who, according to the logs, have consulted the transcribed tables of
content systematically, but have not qualified as "serious readers" under the rather
rigid criterion applied above. This may come as a shock, if we compare it with the
"countless hits" which are usually quoted in such cases.
But these are real users , some of whom would have had to travel from Japan,
Turkey or the US to Germany for a number of weeks to consult the same literature.
And if you compare the travel costs incurred by the (fictitious) international re-
search community to fund the travel, digitization projects suddenly become less
expensive than they may look otherwise. One of the problems with this argument
may be that in many cases those profiting the most from digitization of research
material are, in the short run, members of research communities other than the
community that funded the project. Fortunately, German research funding tradition-
ally has been reasonably un-parochial in its perspectives.
Recommendation: The ongoing digitization projects should systematically de-
velop clear and open criteria for the actual usage of the resources created. These
criteria should be conservatively defined. Evaluations of the cost effectiveness of a
digitization project should compare the cost of digitization with the costs incurred if
access to the material was by other means.
Thesis: Digital collections need a critical , minimal size to make their access
worthwhile. In the end, users want to access information, not metadata or gim-
It is a well-known truism that the Internet as a whole creates an information glut.
Therefore, a corollary goes, one of the primary challenges for the evolving world-
wide infoscape is to define means to prevent the user from being overwhelmed by
the flood. This in turn leads to the further claim that the primary challenge for
digital libraries is the development of metadata standards.
For libraries this is undoubtedly true. To interconnect the OPACs of a national
or trans-national library system, agreement about descriptive standards is obviously
central. From this baseline, however, conclusions have been drawn for digital col-
lections that may easily turn out to be very counterproductive.
Some background assumptions, before we go on:
The creation of digital collections does not have to be particularly expensive an-
ymore. One of the more spectacular technical developments in recent years has
been the drop in the pricing of digital cameras, where the resolution achievable by a
$1,000 camera has been climbing sharply. At the other end, cameras like the 4096 x
4096 pixel camera offered by Kodak, with an observed workflow of ca. 5 / 10
seconds per exposure, are today still in the six-digit price range. With an emerging
mass market of digital hobby photographers, it seems to be a safe bet that high
speed digital cameras at a professional resolution will become achievable for rou-
tine projects in less than ten, presumably within the next five years.6 This means
that with 2,000 exposures per campaign day - roughly the speed possible with
analog cameras today, the handling of the object being a serious barrier for quite
some time - 1,000,000 page digitization projects will be possible with a limited
budget and over a two-year or 500-day time frame.
For reasons that are beyond the scope of this paper and which have to do with
the intricacies of the impact of the IT revolution on management hierarchies, the
organization of document servers is today still believed to be a major technical
challenge, an assumption understandably supported by the software industry profit-
ing from it. The author would, however, like to emphasize that the reference sys-
tems he is involved with have all been created at rather low cost and within very
short periods. Indeed, part of the training plans at the University of Cologne aims to
bring the requirements for the creation of digital libraries to a level where the im-
plementation of a digital library of the technical scope of the reference systems
quoted initially, can be given within the next 12 months as a seminar assignment.
We propose, therefore, to base consideration of the required access tools for dig-
ital collections on the assumption that one million page collections can be produced
reliably and cheaply within the near future.
One million pages seem like a lot. In the case of printed books, however, that
number of pages represents a collection of something like 2,000-3,000 volumes.
This is substantial enough that a user will profit from such a collection, and there-
fore be willing to learn how to use it. On the other hand, 2,000-3,000 volumes are
usually not random collections of information with arbitrary levels of authority,
quality and subject matter, as would be the case with an equivalent 1,000,000 web
pages. If a researcher is interested in the development of religious doctrine in the
early eighteenth century, he or she will probably be very willing to increase his or
her understanding of the matter by browsing through a reasonably pre-ordered
library of volumes with an overall relationship to the subject. If somebody, as it
happens, should not be interested in early eighteenth century religiosity, even the
most elaborate access system based on a highly sophisticated markup system will
probably not seduce her or him into the depths of the collection.
Collections of this size would be eminently useful for research. And a collection
of such size does not need to be made accessible beyond the levels of metadata that
traditional library catalogues provide. A million-page collection with catalogue
metadata is, to be precise, considerably more useful than a collection of ten vol-
umes with complete transcriptions encoded according to an elaborate markup
scheme developed for the occasion.
All of the above should, of course, be seen under a mutatis mutandis reservation.
In the case of classical antiquity, one million pages of text probably come pretty
close to the complete corpus of surviving texts. In a large number of cases, im-
portant ones, maximum accessibility of a relatively small number of items will be
6 "Professional resolution" in this context means the resolution required to produce a digital
object that can replace the functionality of the original on a screen today, and has a suffi-
cient quality reserve to remain useful for the foreseeable future, i.e., 20-30 years.
extremely worthwhile. And, of course, there exist special collections where a small-
er number of digital items constitutes an important tool. But one has to emphasize
that there exist even more cases where one of the primary promises offered by
digitization is the ability to have access to large bodies of material beyond what is
accessible today.
For collections in which increased access to the content matters the most, the da-
ta itself, and not the metadata, are important. The further creation of small pilot
projects with shinning interfaces that lure the user in, but which later on frustrate
him or her because the project does not contain practically useful amounts of data,
is increasingly detrimental to the acceptance of digital libraries as serious tools.
Ultimately, it is the content of a library that counts, not the architecture of the
building housing it.
Recommendation: A model for the creation of digital collections should be de-
veloped that allows for the creation of digital libraries in the one million primary
digitization object ( roughly : pages) range at minimal cost. Costs can most easily be
minimized by cutting down on the effort invested in the creation of access infor-
mation for the individual item.
Thesis: If digital library resources are to be integrated into the daily work of the
research community, they must appear on the screen of the researcher in a quality
that is useful in actual work.
Digital objects can be characterized by the degree to which they allow functional
replacement of the original. Four levels of digital objects can be differentiated. The
names of the four levels have been derived from discussions around the creation of
manuscript servers.
A digital object is called illustrative if its quality is sufficient to allow a user to
make an informed decision about whether access to the original is worthwhile. This
level of digital quality is usually employed by museum systems, since the impres-
sion of an original piece of art still goes beyond any impression that can be created
on any screen available to the humanities research community.
A digital object is called readable if its quality allows the user to access all the
information that the creator of the original object wanted to convey to the user.
Digitized pages of a printed book, for example, have to be clearly readable on the
screen and not strain the eyes. It is not necessary, however, to be able to decipher
the notes that a few generations of college students in final examination frenzy have
left in the margins.
A digital object is called paléographie in our terminology if the quality allows
the user to access all the information that can be derived from the original with the
unaided eye. In medieval codices it is important to be able to read the text. It is also
important, however, to be able to see if in the lettering there is a recognizable
change in the way the pen was held, thus indicating a change of authorship.
Finally, a digital object is enhanceable if the digital version provides access to
information that cannot be extracted from the original with the unaided eye. Image
enhancement may, for example, make erasures legible again.
Any quality much below "readable" is pointless in library applications. Thumb-
nails, for example, are eminently useful in museum systems, but are rarely useful by
themselves in library systems.
If end user requirements, rather than absolute numbers, determine the appropri-
ate quality of images in a digital system, it follows that a digital library has to speci-
fy - and discuss - a specific platform it expects its users to have, a specific purpose
they are expected to follow in looking at the material, and the actual resolutions
derived from these assumptions.
The following model for a manuscript library is offered for discussion:
a) Professional manuscript work cannot be done on screens with a resolution of
less than 1024 x 768. No specific support is given for analytic work on screens
below that resolution.
b) 1024 x 768, however, defines the lower limit. 1200 x 1024 is considerably
superior as soon as we go beyond the plain reading of manuscripts.
From these assumptions the following resolutions, with corresponding examples,
have been derived:
The lowest resolution, which is displayed while browsing or searching within
the framework of the main interface of the digital manuscript library, is defined
as Visual summary .7 It is high enough that a meaningful decision can be made
regarding whether one of the higher resolutions should be loaded. It also provides,
in exceptional situations, some support for 800 x 600 screens.
For standard work, however, two higher resolutions are provided: Working cop-
ies 8 are at a resolution that presents the horizontal dimension without scrolling on
1024 x 768 screens and preserves most of the optical properties of the origi-
nal. Optimized working copies 9 are at a still higher resolution, and present their
horizontal dimensions without scrolling on 1200 x 1024 screens. The area of the
page that actually contains the writing will in most cases also fit into the horizontal
dimension of 1024 x 768 screens. These pages are optimized in a mechanical way,
i.e., some contrast enhancement and similar operations have been applied that
optimize the readability as far as possible, without analyzing the characteristics of
the pages individually. The price to be paid for this optimization is a distortion of,
in particular, the colors.
For rare cases of detailed professional work, specifically in the area of paleogra-
phy, a pretty high resolution 10 image, close to 4491 x 3480 in size, is presented. We
are proud to bring that resolution to the world of complete digital collections, which
has so far been available only for CD-ROM-based facsimiles of individual manu-
scripts, as in the case of the Beowulf project.
Recommendation: Digital libraries that do not offer the resolutions needed for
professional work are useless. An explicit definition of the qualities offered ' based
on discussions with the potential customers, should therefore be included in the
specifications of all digital resources.
Thesis: While digital libraries are self-contained bodies of information, they are not
the basic unit that most users want to access. Users are, as a rule, more interested
in the individual objects in the library and need a straightforward way to access
Digital libraries, particularly large ones, are still seen today as unique and signif-
icant projects. As a result, they are frequently constructed as self-contained systems,
where the separation between the interface of the library and its contents is not as
clear-cut as one would wish. This means that many digital libraries expect that a
user will enter through the interface of the library. This is an example of when the
implementation of a traditional metaphor is counterproductive.
In our reference project, we are experimentally creating a functionally complete
linkage interface that allows one to access the content of the library completely
independent of its own user interface. While this specification is not yet fully stabi-
lized and public, it is partially available, and the following ways of addressing are
guaranteed to be as persistent as the floating discussion of persistent basic identifi-
ers allows. A researcher who intends to refer to the content of the Cologne manu-
script library will have a mechanism that allows him or her to address reliably and
persistently the following:
1) A digital object that represents a conventional unit of reference within a given
discipline. In our case it is a medieval codex.
2) A digital object that represents the same object at a finer level of granularity,
reflecting the usage of a given discipline. In our case individual pages are at the
finer level of granularity.
Note: We refer intentionally to "units of references" and "granular objects" instead
of "codices" and "pages," not to introduce an additional level of complexity, but to
prepare for the generalization of such addressing schemes to other cultural heritage
material. Ultimately codices can be seen as particularly simple cases, where only
one level of subdivision exists and the granular objects are ordered linearly, as
opposed to, e.g., museum objects, where a number of hierarchical levels for digiti-
zation of details exist, and intuitive schemes for the naming of granular entities are
considerably more complex. The basic problems remain the same, however.
The two types of reference above are necessary for two reasons:
3) From the end user's point of view, it is important to be able to include a refer-
ence to a digitally stored manuscript directly in a text. This will become much
more important in the future when the results of research are themselves pre-
sented on digital media. In such cases it would be almost absurd to have an end
user directed from a footnote to the search engine of a digital library, instead of
the digital object itself, the address of which was obviously accessible to the au-
thor at the time of writing.
4) From the conserving institution s point of view, a clear tendency towards virtual
libraries / archives / museums seems to exist. The most obvious way to construct
such virtual collections is to envisage them as access platforms that hide from
users the fact that the individual objects accessed are stored under different ad-
ministrative and technical conditions. This is achieved most easily if an access
machine can access individual digital objects in different holdings directly, that
is, without a negotiation process with the access tools of the specific institution
holding the object.
It would be highly impractical to rely on a central body, operating worldwide, to
create a new set of identifiers for all existing objects of cultural heritage. All exist-
ing collections of manuscripts, archives, museums, etc., would have to agree upon a
common system of shelfmarking for their objects. This is not only impractical, but
also directly damaging, as the reference systems within collections of cultural herit-
age material that have grown historically usually represent by themselves a specific
intellectual view of that material.
We envisage, therefore, a solution that divides the general problem into three
sub problems:
1) A persistent addressing scheme for collections, which by necessity must be
organized nationally, with national (or regional) solutions being coordinated by
appropriate international bodies.
2) A persistent addressing scheme for digital objects within individual collections
that is under the control of the individual institution, but which guarantees a
common functionality and interoperability of the different collections.
3) A mapping scheme that allows referencing a granule of a digital object by a
specific numbering scheme, which is then translated into the actual names of in-
dividual digital components, like page images. Such a mapping scheme is ad-
ministered by the individual collection and should even exist if the names of the
digital objects - file names - also reflect the traditional references directly. The
order of access to granules of digital objects - "next page", for example - is a
matter of interpretation. To allow operations like "virtual rebinding" of a digital
codex, we strongly propose to differentiate clearly between this level and the
preceding one.
An implementation of an addressing scheme for digital objects based on the preced-
ing analysis would look as follows:
<collection-reference> <object-reference> <granule-reference>
where <granule-reference> is either a <direct-granule-reference> or a <mapped-
CEC has a processing model for <object-reference> and <granule-reference>.
For the discussion / definition of the concept of a <collection-reference> we seek
the support of appropriate library institutions. CEEC has a working implementation
for <object-reference> and <direct-granule-reference>, and a working implementa-
tion for the concept of a <mapped-granule-reference> is expected soon.
Discussion of Individual Access Models
A complete Cologne codex can currently be reached via a WWW address like:
Ignoring the "%22" (the CGI wrap-up for the quotation marks), this means our
previous definitions are realized as follows:
<collection-reference> =
<object-reference> = ceec-cgi/kleioc/0010/exec/katk/%22kn2 8-0083 ii%22
To access an individual page of a Cologne codex, a WWW address like the follow-
ing can be used:
Here the following is applicable:
<collection-reference> =
<object-reference> = ceec-cgi/kleioc/0010/exec/pagemed/
<granule-reference> = %22kn28-0083ii_164.jpg%22
In the example, is obviously a URL. This is where we
seek the support of existing library institutions. Obviously a persistent identifier for
the individual collections should replace the URL. It would be particularly helpful
if the identifiers would incorporate existing schemes for the unambiguous reference
to institutions. For example, it would be helpful if the identifier above could be
replaced by something that contains a reference to "kn28", the (within Germany)
traditional unambiguous reference to the library in question.
Less formal than the rest of these proposals: Within the WWW the question of
top-level domains is very much open to discussion, and with museum", at least
one type of cultural heritage institution has reached top-level status. Considering the
fact that libraries in many ways are the nodes of the information network, when we
consider the actual amount of information handled, it is fair to wonder if there are
there any discussions underway that would lead to the creation of a library top-level
domain and references like "". If not, why not? This could be a
very good starting point for persistent implementations and, with the library com-
munity directly responsible for the administration of its domains, would do away
with an entire level of problems.
<obj ect-reference>
Once the problem of the persistency of the basic identifier is resolved, we consider
a robust technical solution reasonably simple.
We have implemented the following scheme:
<object-reference> = <interface> <access-mode> <resource-id>
with the following considerations:
The <interface> of a CEC <object-reference> is a series of one or more identifiers
separated by slashes. They represent a software system existing at a given point in
time, in our example: kleioc/0010.
1) The only thing about which we can be reasonably sure regarding the further
development of net-oriented information access is that it will develop considera-
bly beyond the current stage. It is very likely, therefore, that future access sys-
tems to digital resources will make them accessible in new ways.
2) On the other hand, it is unlikely that a typical preserving institution will make
fundamental changes to its software platform more regularly than, say every ten
We would therefore assume that a given institution, when exchanging a software
platform 'x' with a software platform 'y' will provide scripts or their future equiva-
lents that will direct all references to the software interface 'x' to methods provided
by the new software which closely resemble the previous picture, while at the same
time a reference to the new interface 'y' can be provided, making full use of any
additional capabilities the new software provides.
As such changes will be infrequent, it is not unreasonable to ask an institution to
provide the level of legacy support represented by such scripts.
Almost all imaginable systems for the administration of digital objects will provide
access to them according to different qualities, resolutions, access privileges and the
like. The <access-mode> of a CEC <object-reference> provides a means to differ-
entiate between different combinations of such properties. Like the interface, it is
series of one or more identifiers separated by slashes.
A direct granule reference consists of a string that can be used directly to access
digitized information on a specific server. It may be necessary to break the refer-
ence up into components that represent different levels of a storage hierarchy,
and/or into components that map logical names unto physical storage addresses. It
does not allow for any conceptual interpretation, however. A collection guarantee
indicates that the <direct-granule-reference> of a digitized page or other atomic unit
of digitization will never change throughout its existence. In our example, kn28-
0083ii_164.jpg is a direct-granule reference.
<m apped-granule-reference>
A mapped granule reference consists of a string that is separated by a dividing
character. The CEEC implementation of the CEC concepts uses a vertical line "|".
The first of the two parts is the identifier of a mechanism that allows the second part
of the string to be mapped to a direct granule reference, according to a specific set
of rules, which may be changed over the lifespan of the digital object or, indeed, be
dropped as obsolete. If a mapped granule reference starts with the vertical line, it
maps to a default mechanism that will exist for the complete lifespan of the object
and is called a "canonical reference".
In our example: |kn28-0083ii_82r will map to the file which represents page 82
recto of the manuscript according to the canonical references given in the literature
referring to it. Miller|kn28-0083ii_insertion4-3r may map to the file containing
page 3 recto of the fourth insertion into a hypothesized original manuscript pro-
posed to be assumed by researcher Miller. This interpretation may be adapted ac-
cording to the researcher's progress or, indeed be dropped if he or she turns out to
be mistaken.
Recommendation: To make digital repositories useful within digital publica-
tions, each digital repository should include a publicly accessible interface for
access to its component items, which provides a persistent mode of quoting the
content of that collection.
6. Library and Teaching
