Ontology-Based Information Retrieval For Historical Documents

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

2016 Third International Conference on Information Retrieval and Knowledge Management

Ontology-Based Information Retrieval for Historical Documents


Fatihah Ramli1, Shahrul Azman Noah2, Tri Basuki Kurniawan3
1
Faculty of Computer Science & Information Technology, Universiti Malaysia Sarawak, 94300 Kota Samarahan,
Sarawak, Malaysia.
2, 3
Faculty of Information Science & Technology, Universiti Kebangsaan Malaysia 43600 UKM Bangi, Selangor,
Malaysia.
1
[email protected], [email protected], [email protected]
2

Abstract— This article presents an ontology-based approach related elements which are important to the historical domain.
to designing and developing new representation IR system Ontology-based approach to document retrieval is not new as
instead of conventional keyword- based approach. Such rep- demonstrated in the work in [16- 18]. However, the
resentation improves the precision and recall of document applications of such an approach to historical documents are
retrieval. Experiments carried out on the ontology-based still scarce and are still open for further research and
approach and keyword-based approach demonstrates the development. Apart from the ontology-based approach, we
effectiveness of the proposed approach. also proposed a simple ontology- based weighting mechanism
mainly derived from the classic tf-idf scoring scheme. We
Keywords-ontology; information retrieval; semantic retrieval;
evaluated our proposed approach against the BM-25
probabilistic model involving 133 documents.
I. INTRODUCTION The paper is organized as follows: an overview of the
Information Retrieval (IR) research in various fields gives environment in which ontology has been used is presented. In
many new ideas for researchers to improve existing section 2, describe in detail about related work of IR in
approaches in all fields. However, recently the field that gets historical document. Section 3, illustrates the overall process
special attention is history. As discussed by [1], the historians of semantic retrieval while section 4 discusses the results
still expect a better approach for more accurate access to obtained from the evaluation of the approach. Finally, section
historical documents. For example, a recent study of the 5 concludes.
Australian National Library found that the numbers of
visitors increased radically when they provided historical II. RELATED WORK
documents as searchable full text index [2, 3]. Hence, the IR Some applications of IR to historical documents mostly
for historical documents is an essential issue to be studied. concern on spelling issues whereby users expect that modern
Historical document can be defined as those that keep keywords able to match with elements of words/spelling
information related with time instant at which the documents available in historical documents [3, 6, 7]. This is due to the
were published at the same time that are still useful in the fact that there are too much spelling variants located in large
future [4]. Searching and retrieving documents from large document of historical texts [3]. Full-text indexing of such
historical archive prove to be challenging for IR field as documents is not suffice as modern words are used in users’
historians typically employ their knowledge, experience and queries unable to match with the index. Two popular
intuition to decide which information they will need to find approaches to solve the issues are by proposing special
and study, and attempt to locate sources that contain the matching procedures and lexica for historical language.
information [8]. Hence, Elena et al. [1] suggest that historians
need historical source repositories and building tools that will Keywords matching procedures although are non- trivial,
enable them to access the comprehensive information in a they still not fully representing the fundamental characteristic
rapid manner. Conventional IR approaches are mostly based of historical documents. Historical document can be defined
on a simple Bag- of-Word (BOW) approach whereby terms- as those that keep information related with time instant at
order are ignored and it conflates many texts that have very which the documents were published at the same time that are
different semantic meanings into a single form. As a result, still useful in the future [4]. A response from Elena, Katifori
searching and ranking of historical documents based on the [1], stated that historians employ their knowledge, experience
BOW approach is not suffice as the documents contain rich and intuition to decide which information they will need to
semantic information relating to important entities such as find and study and attempt to locate sources that contain the
event, time, and people. information. The result from Elena, Katifori [1] obviously
Therefore in this paper we proposed an ontology-based stated that historians need historical source repositories and
approach to index and ranked [5] semantically rich historical building tools that will enable historians to access the
documents. The ontology developed centralised on the event comprehensive information in a rapid manner. The 20th and

978-1-5090-2954-9/16/$31.00 ©2016 IEEE

55
Authorized licensed use limited to: Missouri State University. Downloaded on November 26,2022 at 08:06:41 UTC from IEEE Xplore. Restrictions apply.
2016 Third International Conference on Information Retrieval and Knowledge Management

early 21st centuries have transforms the way people accessed thing (location) and time and date. Then, we expanded the
information. ontology by adding some classes like country and stuff. The
Hence, users expected a wealth of historical information country class was added to know the country involved in each
can be shared and reused through digital libraries that can war, whereas stuff class includes both tangible and intangible
provide the best-matched document for any search request in entities to assign people involved in a war with their country
answering competency questions as well as providing a and organization.
support to a selected scenario [8, 9]. In order to fulfill the user
request, Mirzaee, Iverson [8] and Corda [9] suggested the
semantics of a historical document, which attempts to allow a
richer representation of its embedded knowledge that should
be captured rather than capable with standard text
manipulation tools. The used of semantics could be more
effective if it is simplified through defining the time-based
relations. Further- more, the work by Schockaert, Cock [10]
suggested that the documents should be sorted according to
temporal aspects in the context to improve the IR systems.
On the other hand, Alonso, Gertz [11] denotes that
recognizing and the used of temporal information for IR
applications was an important feature that can improve the
functionality of search applications.
However, the aforementioned works mainly suggested the
Figure 1. Classes based on historical domain.
type of knowledge that should be extracted and modeled for
describing historical documents. As such, the applications of
such ontological knowledge to support semantic retrieval of B. The Semantic Retrieval Framework
historical documents are still open for further research. The overall framework is shown in Figure 2 which de-
III. THE PROPOSED APPROACH TO SEMANTIC scribes the overall retrieval process. As shown in the
framework, the prototype takes as input a formal SPARQL
HISTORICAL DOCUMENT RETRIEVAL
query. The query is based on the knowledge base where the
The focus of this work is on historical documents. We output consists of a list of semantic entity (instance) that
chose to scope our work to ontology and documents relating meets the requirements of the query. The prototype then
to the Vietnam War. retrieves documents based on the matched entities.
The semantic retrieval framework has knowledge base
that associate to the information sources (the document base)
A. The Domain Ontology by using one or several domain ontologies that describes
The development of our history ontology mainly focused concepts appearing in a document text. The concepts and
on the aspects of events. This is due to the opinion of various instances in the knowledge base are linked to the documents
researchers that event is the most important elements in explicitly and stored in the form of annotations. These
history. With this ontology, historical documents can be annotations are used to create an initial representation for
retrieved and analysed based on events or other elements retrieval and ranking processes.
related to the events. Figure 3 illustrates the annotation mechanism which start
In our work, the ontology development was executed with the system takes as input a set of documents from
semi-automatically and formalized by the domain experts and Wikipedia to do annotation and indexing. Then they will be a
ontology developers. We reused the existing Simple News new annotation output and stored in knowledge base. The
and Press Ontologies (SNaP) ontology and expanded it based implementation document annotation process consists of the
on our vocabulary as shown in Figure 1. SNaP ontology following steps:
comprises of several ontologies, which describe assets (text,
images, video) and the events and entities (people, places, a. Load the information (the document base)
organisations, abstract concepts etc.) that appear in a news of basic terms which is extracting the
content. Although it is meant for news document, it was textual representation of the selected entity.
found to be suitable in our case as it contains detailed The basic terms have extracted from
representation about event as well as documents (i.e. assets). Wikipedia on Battles and operations of the
The event ontology inherits fully from the public domain Vietnam War. Table 1 shows the list of
event Ontology. The object property of subEventOf is an basic terms.
rdfs:subPropertyOf event:sub_event with the addition of b. The linguistic analysis is used to filter basic
transitivity. Events are considered as com- pound entities in terms and to identify those sets of terms that
our domain (i.e. they are rich entities made through the
relations with other entities, namely people, organisations,
locations and things both tangible and intangible). Figure 1
shows all the classes that were customized using TopBraid
Composer. We have imported SNAP ontology into TopBraid
Composer and started customizing it based on our vocabulary
i.e.: historical domain. Among the basic classes that were
matched to our domain was event, factor, person, spatial

56
Authorized licensed use limited to: Missouri State University. Downloaded on November 26,2022 at 08:06:41 UTC from IEEE Xplore. Restrictions apply.
2016 Third International Conference on Information Retrieval and Knowledge Management

can operate as concepts, instances and


properties.
c. The filtered terms obtaining the subset of
semantic entities to annotate.
d. The annotations are weighted according to
the semantic entity frequencies within
individual documents and the whole
collection.
e. The annotations are added to relational
database for producing indexing list.

Figure 3. Document annotation

BASIC TERMS OF HISTORICAL


TABLE I. DOMAIN

Items Basic terms


1 event
2 sub event
3 related event
4 location
5 person
6 date
7 time
8 cause
9 unit
10 belligerent

The weighting is based on an adaptation of classic in-


formation retrieval vector space model. In this model,
keywords appearing in document are assigned weight
reflecting the importance of the keywords for describing the
document content. Similarly, for this study, annotations are
assigned weights that reflect the importance of instances with
respect to the documents. Weights are computed
automatically by an adaptation of the tf-idf algorithm, based
on the frequency of occurrence of the instances in each
document. In detail, the weight of term dx of an instance x for
a document d is computed as in (1):
Figure 2. The semantic retrieval framework

where freqx,d is the number of occurrences in d of the


keywords attached to x, maxyfreqy,d is the frequency of the
most repeated instance in d, nx is the number of documents
annotated with x and D is the set of all documents in the
search space.
The query execution produces a set of tuples that meet the
SPARQL query. Then, the semantic entities extracted from
the tuple and access to the semantic index to collect all the

57
Authorized licensed use limited to: Missouri State University. Downloaded on November 26,2022 at 08:06:41 UTC from IEEE Xplore. Restrictions apply.
2016 Third International Conference on Information Retrieval and Knowledge Management

documents in the repository that have been annotated by EVALUATION RESULT FOR KEYWORD-BASED
TABLE III. APPROACH
semantic entity. Once the document lists is completed, the
search engines calculate the semantic similarity value Query# BM-25 model
between the query and each document using the classic Ret. Rel. ŀ Ret. Prec. Avg. Prec. Rec.
vector space IR model. Finally we sort and rank the
Q1 119 3 0.0252 0.6905 1.0000
documents in descending order according to the similarity Q2 106 26 0.2453 0.5358 0.9286
values. Q3 128 23 0.1797 0.3983 0.9200
Q4 117 97 0.8290 0.8987 0.8981
IV. EVALUATION Q5 112 5 0.0446 0.6113 0.8333
MAP 0.6269
We compare the proposed approach against the BM25 IR
model using a corpus of 133 documents from Wikipedia1 and EVALUATION RESULT FOR ONTOLOGY
a total of five queries. BM25 IR model is considered as state- TABLE IV. BASED IR
APPROACH
of-art in the IR community and it has been widely used by IR
researchers to improve search engine relevance [12]. The Query# Ontological model
documents relate to the event of Battles and operations of the Ret. Rel.ŀ Ret. Prec. Avg. Prec. Rec.
Vietnam War. This is an initial evaluation of which we plan Q1 4 3 0.750 1.0000 1.000
to extend with larger documents and queries. The queries are Q2 27 27 1.000 1.0000 0.964
listed in Table 2. For ontology-based approach the queries Q3 63 23 0.365 0.7312 0.920
Q4 125 108 0.864 0.8622 1.000
were translated into the corresponding SPARQL query. For Q5 6 6 1.000 1.0000 1.000
example, the query “What was the sub-event of the Battle of MAP 0.918
Ap Bau Bang II?” was translated to:

SELECT *
WHERE {
?event:BattleofApBauBangII pne:subEventOf ?stuff.
}

BASIC TERMS OF HISTORICA0L


TABLE II. DOMAIN
Query# Query
Q-1 Find sub event, start date and end date for Battle of Ap Bau
Bang II.
Q-2 Find related event and person involved in Battle of
Hamburger Hill.
Q-3 Find related event and location for Operation Apache Snow.
Q-4 Find sub event and belligerent involved in Battle of Saigon
1968.
Q-5 Find related event and unit involved in Bombing of Tan
Son Nhut Air Base. Figure 4. Document annotation.

Table 3 and Table 4 show the results of the evaluation


using the above queries. The MAP results in Table 3 and V. CONCLUSION AND FUTURE WORKS
Table 4, show that the semantic retrieval outperforms the In this paper, we have discussed an approach to designing
conventional keyword-based approach, with MAP=0.9187 as and developing new representation IR-system which uses
compared to MAP=0.6269 for the conventional approach. ontologies. Experiments have been performed on keyword-
The result also shows that the ontology-based approach based approach and ontology-based approach to validate the
retrieved less documents but most of the documents are retrieval of documents by using five queries.
relevant. For example three out of four retrieved documents Preliminary experimental results show that the purposed
for Q1 are relevant and 27 out of 27 for Q2. The result ontologies improve the precision and recall of the documents
seemed to suggest the preciseness of the ontology based retrieval. As conclusion of this work we would like to
approach. highlight that semantic retrieval approach can provide better
Figure 4 provides an overall performance comparison search capabilities, thus achieving an improvement over
between both approaches. It clearly shows the better keyword-based retrieval by means of the introduction and
performance of the proposed ontology-based approach for exploitation of ontologies.
historical documents. Future research works include further experiments by
considering large number of documents and improve number
and coverage of the queries. It is also interesting to have a

58
Authorized licensed use limited to: Missouri State University. Downloaded on November 26,2022 at 08:06:41 UTC from IEEE Xplore. Restrictions apply.
2016 Third International Conference on Information Retrieval and Knowledge Management

generic ontology and document processing which can be used


for various other event related documents.
ACKNOWLEDGMENT
This research was partially supported by the Malaysia
Ministry of Education Grant
LRGS/TD/2011/UITM/ICT/01/02 awarded to the Center for
Artificial Intelligence Technology at the Universiti
Kebangsaan Malaysia.

REFERENCES
[1] T. Elena, A. Katifori, C. Vassilakis, G. Lepouras, and C. Halatsis,
"Historical research in archives: user methodology and supporting
tools," International Journal on Digital Libraries, vol 11(1), 2010, p.
25-36.
[2] A. Gotscharek, A. Neumann, U. Reffle, C. Ringlstetter, and K.U.
Schulz, "Enabling information retrieval on historical document
collections: the role of matching procedures and special lexica,"
Proc. The Third Workshop on Analytics for Noisy Unstructured
Text Data2009, ACM: Barcelona, Spain, 2009, p. 69-76.
[3] A. Gotscharek, U. Reffle, C. Ringlstetter, K.U. Schulz, and A.
Neumann, "Towards information retrieval on historical document
collections: the role of matching procedures and special lexica,"
International Journal on Document Analysis and Recognition
(IJDAR), vol 14(2), 2011, p. 159-171.

[4] M. Cabo, and R. Llavori, "A retrieval language for historical


documents, in Database and Expert Systems Applications,"G.
Quirchmayr, E. Schweighofer, and T.M. Bench-Capon, Editors.
Springer Berlin Heidelberg, 1998, p. 216-225.
[5] Frakes, W., Introduction to information storage and retrieval
systems. Space, 1992. 14: p. 10.
[6] M. Koolen, F. Adriaans, J. Kamps, and M. De Rijke, "A Cross-Language
Approach to Historic Document Retrieval," in Advances in Information
Retrieval, Springer Berlin Heidelberg, 2006, p. 407-419.
[7] T. Pilz, W. Luther, N. Fuhr, and U. Ammon, "Rule-based Search in
Text Databases with Nonstandard Orthography," Literary and
Linguistic Computing, vol 21(2), 2006, p. 179-186.
[8] V. Mirzaee, , L. Iverson, and B. Hamidzadeh, "Towards ontological
modelling of historical documents," in The 16th International
Conference on Software Engineering and Knowledge Engineering
(SEKE), 2004.
[9] I. Corda, "Ontology-based representation and reasoning about the
history of science," The University of Leeds, 2007.
[10] S. Schockaert, M. Cock, and E. Kerre, "Reasoning about fuzzy
temporal information from the web: towards retrieval of historical
events," Soft Computing, vol 14(8), 2010, p. 869-886.
[11] O. Alonso, M. Gertz, and R. Baeza-Yates, "On the value of
temporal information in information retrieval," SIGIR Forum, vol
41(2), 2007, p. 35-41.
[12] J. Pérez-Iglesias, J. R. Pérez-Agüera, V. Fresno, and Y. Z.
Feinstein, "Integrating the probabilistic models BM25/BM25F into
Lucene," arXiv preprint arXiv:0911.5046, 2009.

59
Authorized licensed use limited to: Missouri State University. Downloaded on November 26,2022 at 08:06:41 UTC from IEEE Xplore. Restrictions apply.

You might also like