Ontology-Based Information Retrieval For Historical Documents
Ontology-Based Information Retrieval For Historical Documents
Ontology-Based Information Retrieval For Historical Documents
Abstract— This article presents an ontology-based approach related elements which are important to the historical domain.
to designing and developing new representation IR system Ontology-based approach to document retrieval is not new as
instead of conventional keyword- based approach. Such rep- demonstrated in the work in [16- 18]. However, the
resentation improves the precision and recall of document applications of such an approach to historical documents are
retrieval. Experiments carried out on the ontology-based still scarce and are still open for further research and
approach and keyword-based approach demonstrates the development. Apart from the ontology-based approach, we
effectiveness of the proposed approach. also proposed a simple ontology- based weighting mechanism
mainly derived from the classic tf-idf scoring scheme. We
Keywords-ontology; information retrieval; semantic retrieval;
evaluated our proposed approach against the BM-25
probabilistic model involving 133 documents.
I. INTRODUCTION The paper is organized as follows: an overview of the
Information Retrieval (IR) research in various fields gives environment in which ontology has been used is presented. In
many new ideas for researchers to improve existing section 2, describe in detail about related work of IR in
approaches in all fields. However, recently the field that gets historical document. Section 3, illustrates the overall process
special attention is history. As discussed by [1], the historians of semantic retrieval while section 4 discusses the results
still expect a better approach for more accurate access to obtained from the evaluation of the approach. Finally, section
historical documents. For example, a recent study of the 5 concludes.
Australian National Library found that the numbers of
visitors increased radically when they provided historical II. RELATED WORK
documents as searchable full text index [2, 3]. Hence, the IR Some applications of IR to historical documents mostly
for historical documents is an essential issue to be studied. concern on spelling issues whereby users expect that modern
Historical document can be defined as those that keep keywords able to match with elements of words/spelling
information related with time instant at which the documents available in historical documents [3, 6, 7]. This is due to the
were published at the same time that are still useful in the fact that there are too much spelling variants located in large
future [4]. Searching and retrieving documents from large document of historical texts [3]. Full-text indexing of such
historical archive prove to be challenging for IR field as documents is not suffice as modern words are used in users’
historians typically employ their knowledge, experience and queries unable to match with the index. Two popular
intuition to decide which information they will need to find approaches to solve the issues are by proposing special
and study, and attempt to locate sources that contain the matching procedures and lexica for historical language.
information [8]. Hence, Elena et al. [1] suggest that historians
need historical source repositories and building tools that will Keywords matching procedures although are non- trivial,
enable them to access the comprehensive information in a they still not fully representing the fundamental characteristic
rapid manner. Conventional IR approaches are mostly based of historical documents. Historical document can be defined
on a simple Bag- of-Word (BOW) approach whereby terms- as those that keep information related with time instant at
order are ignored and it conflates many texts that have very which the documents were published at the same time that are
different semantic meanings into a single form. As a result, still useful in the future [4]. A response from Elena, Katifori
searching and ranking of historical documents based on the [1], stated that historians employ their knowledge, experience
BOW approach is not suffice as the documents contain rich and intuition to decide which information they will need to
semantic information relating to important entities such as find and study and attempt to locate sources that contain the
event, time, and people. information. The result from Elena, Katifori [1] obviously
Therefore in this paper we proposed an ontology-based stated that historians need historical source repositories and
approach to index and ranked [5] semantically rich historical building tools that will enable historians to access the
documents. The ontology developed centralised on the event comprehensive information in a rapid manner. The 20th and
55
Authorized licensed use limited to: Missouri State University. Downloaded on November 26,2022 at 08:06:41 UTC from IEEE Xplore. Restrictions apply.
2016 Third International Conference on Information Retrieval and Knowledge Management
early 21st centuries have transforms the way people accessed thing (location) and time and date. Then, we expanded the
information. ontology by adding some classes like country and stuff. The
Hence, users expected a wealth of historical information country class was added to know the country involved in each
can be shared and reused through digital libraries that can war, whereas stuff class includes both tangible and intangible
provide the best-matched document for any search request in entities to assign people involved in a war with their country
answering competency questions as well as providing a and organization.
support to a selected scenario [8, 9]. In order to fulfill the user
request, Mirzaee, Iverson [8] and Corda [9] suggested the
semantics of a historical document, which attempts to allow a
richer representation of its embedded knowledge that should
be captured rather than capable with standard text
manipulation tools. The used of semantics could be more
effective if it is simplified through defining the time-based
relations. Further- more, the work by Schockaert, Cock [10]
suggested that the documents should be sorted according to
temporal aspects in the context to improve the IR systems.
On the other hand, Alonso, Gertz [11] denotes that
recognizing and the used of temporal information for IR
applications was an important feature that can improve the
functionality of search applications.
However, the aforementioned works mainly suggested the
Figure 1. Classes based on historical domain.
type of knowledge that should be extracted and modeled for
describing historical documents. As such, the applications of
such ontological knowledge to support semantic retrieval of B. The Semantic Retrieval Framework
historical documents are still open for further research. The overall framework is shown in Figure 2 which de-
III. THE PROPOSED APPROACH TO SEMANTIC scribes the overall retrieval process. As shown in the
framework, the prototype takes as input a formal SPARQL
HISTORICAL DOCUMENT RETRIEVAL
query. The query is based on the knowledge base where the
The focus of this work is on historical documents. We output consists of a list of semantic entity (instance) that
chose to scope our work to ontology and documents relating meets the requirements of the query. The prototype then
to the Vietnam War. retrieves documents based on the matched entities.
The semantic retrieval framework has knowledge base
that associate to the information sources (the document base)
A. The Domain Ontology by using one or several domain ontologies that describes
The development of our history ontology mainly focused concepts appearing in a document text. The concepts and
on the aspects of events. This is due to the opinion of various instances in the knowledge base are linked to the documents
researchers that event is the most important elements in explicitly and stored in the form of annotations. These
history. With this ontology, historical documents can be annotations are used to create an initial representation for
retrieved and analysed based on events or other elements retrieval and ranking processes.
related to the events. Figure 3 illustrates the annotation mechanism which start
In our work, the ontology development was executed with the system takes as input a set of documents from
semi-automatically and formalized by the domain experts and Wikipedia to do annotation and indexing. Then they will be a
ontology developers. We reused the existing Simple News new annotation output and stored in knowledge base. The
and Press Ontologies (SNaP) ontology and expanded it based implementation document annotation process consists of the
on our vocabulary as shown in Figure 1. SNaP ontology following steps:
comprises of several ontologies, which describe assets (text,
images, video) and the events and entities (people, places, a. Load the information (the document base)
organisations, abstract concepts etc.) that appear in a news of basic terms which is extracting the
content. Although it is meant for news document, it was textual representation of the selected entity.
found to be suitable in our case as it contains detailed The basic terms have extracted from
representation about event as well as documents (i.e. assets). Wikipedia on Battles and operations of the
The event ontology inherits fully from the public domain Vietnam War. Table 1 shows the list of
event Ontology. The object property of subEventOf is an basic terms.
rdfs:subPropertyOf event:sub_event with the addition of b. The linguistic analysis is used to filter basic
transitivity. Events are considered as com- pound entities in terms and to identify those sets of terms that
our domain (i.e. they are rich entities made through the
relations with other entities, namely people, organisations,
locations and things both tangible and intangible). Figure 1
shows all the classes that were customized using TopBraid
Composer. We have imported SNAP ontology into TopBraid
Composer and started customizing it based on our vocabulary
i.e.: historical domain. Among the basic classes that were
matched to our domain was event, factor, person, spatial
56
Authorized licensed use limited to: Missouri State University. Downloaded on November 26,2022 at 08:06:41 UTC from IEEE Xplore. Restrictions apply.
2016 Third International Conference on Information Retrieval and Knowledge Management
57
Authorized licensed use limited to: Missouri State University. Downloaded on November 26,2022 at 08:06:41 UTC from IEEE Xplore. Restrictions apply.
2016 Third International Conference on Information Retrieval and Knowledge Management
documents in the repository that have been annotated by EVALUATION RESULT FOR KEYWORD-BASED
TABLE III. APPROACH
semantic entity. Once the document lists is completed, the
search engines calculate the semantic similarity value Query# BM-25 model
between the query and each document using the classic Ret. Rel. ŀ Ret. Prec. Avg. Prec. Rec.
vector space IR model. Finally we sort and rank the
Q1 119 3 0.0252 0.6905 1.0000
documents in descending order according to the similarity Q2 106 26 0.2453 0.5358 0.9286
values. Q3 128 23 0.1797 0.3983 0.9200
Q4 117 97 0.8290 0.8987 0.8981
IV. EVALUATION Q5 112 5 0.0446 0.6113 0.8333
MAP 0.6269
We compare the proposed approach against the BM25 IR
model using a corpus of 133 documents from Wikipedia1 and EVALUATION RESULT FOR ONTOLOGY
a total of five queries. BM25 IR model is considered as state- TABLE IV. BASED IR
APPROACH
of-art in the IR community and it has been widely used by IR
researchers to improve search engine relevance [12]. The Query# Ontological model
documents relate to the event of Battles and operations of the Ret. Rel.ŀ Ret. Prec. Avg. Prec. Rec.
Vietnam War. This is an initial evaluation of which we plan Q1 4 3 0.750 1.0000 1.000
to extend with larger documents and queries. The queries are Q2 27 27 1.000 1.0000 0.964
listed in Table 2. For ontology-based approach the queries Q3 63 23 0.365 0.7312 0.920
Q4 125 108 0.864 0.8622 1.000
were translated into the corresponding SPARQL query. For Q5 6 6 1.000 1.0000 1.000
example, the query “What was the sub-event of the Battle of MAP 0.918
Ap Bau Bang II?” was translated to:
SELECT *
WHERE {
?event:BattleofApBauBangII pne:subEventOf ?stuff.
}
58
Authorized licensed use limited to: Missouri State University. Downloaded on November 26,2022 at 08:06:41 UTC from IEEE Xplore. Restrictions apply.
2016 Third International Conference on Information Retrieval and Knowledge Management
REFERENCES
[1] T. Elena, A. Katifori, C. Vassilakis, G. Lepouras, and C. Halatsis,
"Historical research in archives: user methodology and supporting
tools," International Journal on Digital Libraries, vol 11(1), 2010, p.
25-36.
[2] A. Gotscharek, A. Neumann, U. Reffle, C. Ringlstetter, and K.U.
Schulz, "Enabling information retrieval on historical document
collections: the role of matching procedures and special lexica,"
Proc. The Third Workshop on Analytics for Noisy Unstructured
Text Data2009, ACM: Barcelona, Spain, 2009, p. 69-76.
[3] A. Gotscharek, U. Reffle, C. Ringlstetter, K.U. Schulz, and A.
Neumann, "Towards information retrieval on historical document
collections: the role of matching procedures and special lexica,"
International Journal on Document Analysis and Recognition
(IJDAR), vol 14(2), 2011, p. 159-171.
59
Authorized licensed use limited to: Missouri State University. Downloaded on November 26,2022 at 08:06:41 UTC from IEEE Xplore. Restrictions apply.