A Semantic Search Engine for Historical Handwritten Document Images

Ngo, Vuong M.; Munnelly, Gary; Orlandi, Fabrizio; Crooks, Peter; O’Sullivan, Declan; Conlan, Owen

doi:10.1007/978-3-030-86324-1_7

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12866))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1743 Accesses
2 Citations

Abstract

A very large number of historical manuscript collections are available in image formats and require extensive manual processing in order to search through them. So, we propose and build a search engine for automatically storing, indexing and efficiently searching the manuscript images. Firstly, a handwritten text recognition technique is used to convert the images into textual representations. In the next steps, we apply the named entity recognition and historical knowledge graph to build a semantic search model, which can understand the user’s intent in the query and the contextual meaning of concepts in documents, to return correctly the transcriptions and their corresponding images for users.

You have full access to this open access chapter, Download conference paper PDF

Towards a Digital Infrastructure for Illustrated Handwritten Archives

From Handwritten Manuscripts to Linked Data

Extracting Descriptive Words from Untranscribed Handwritten Images

Keywords

1 Introduction

Every year, the great collections of historical handwritten manuscripts in museums, libraries and other organisations are digitised as electronic images. The digitisation makes the manuscripts available to a wider audience, and preserves the cultural heritage. The automatic recognition of textual corpora and named entities generated from medieval and early-modern manuscript sources with high accuracy is a challenge [2, 20, 22]. Manuscript images are often processed through keyword spotting or word recognition to be accessed and searched, such as [4, 8, 14, 17] and [18]. There are some papers build a search system for handwritten images, such as [1, 5, 15, 16, 21] and [23]. However, their systems only offer keyword search.

Unlike keyword search, semantic search improves search precision and recall by understanding the user’s intent and the contextual meaning of concepts in documents and queries [3, 12, 19, 24]. This paper proposes a semantic search engine for full-text retrieval of historical handwritten document images based on named entity (NE), keyword (KW) and knowledge graph (KG). This would help not only in processing, storing and indexing automatically, but also would allow users to access quickly and retrieve efficiently manuscripts.

2 System Architecture

The Public Record Office of Ireland (PROI) was destroyed on 30 June 1922, resulting in the loss of 700 years of Irish history. The Beyond 2022 Project (https://beyond2022.ie) is combining historical research, archival discovery, and technical innovation to create a virtual reconstruction of the PROI. There are over 300 volumes of surviving and collected handwritten copies of lots documents, with some 100,000 pages containing 25 million words of text.

Our system architecture of the search engine is illustrated in Fig. 1 which has four separate processing modules being Handwritten Text Recognition, NE Recognition, KW-NE Indexing and KW-NE-Based IR Model. Firstly, the historical handwritten document images are digitised to transcriptions through the Handwritten Text Recognition module. Then, the transcriptions are annotated by NEs through the NE Recognition module. This module needs to connect to the Knowledge Graph to extract the classes and identifiers of NEs. Next, KWs and NEs of the annotated transcriptions and the respective original images are presented and indexed by the KW-NE indexing module and stored in KW-NE Annotated Text and Image Repository. The raw text query is also annotated NEs through the NE Recognition module to become a KW-NE annotated query. Finally, the KW-NE-Based IR Model module compares the annotated query and the annotated documents to return the ranked transcriptions and images.

3 Image Representation and Knowledge Graph

Transkribus [13] is used for training and deploying Handwritten Text Recognition (HTR) models to derive text transcription from image scans. Given the rate at which transcriptions can be generated, NE Recognition (NER) and Entity Linking (EL) are required to automated annotate all instances of entities occurring in the transcription text. We used SpaCy [11] for NER and had highly results on 18\(^{th}\) century English text. To provide flexibility, an NLP pipeline has been implemented as a thin layer over a number of standard NLP tools. The output of the pipeline is a NLP Interchange Format [10] in which a NER tool has annotated classes of entities and, where possible, an EL tool has connected the recognized entities to KG.

The KG collects structured data from various historical sources. Part of the data is manually curated by historians through spreadsheets. Other data sources (e.g. geographical data from OSi [6]) are imported automatically as RDF for direct insertion into KG. The schema (or ontology) used to structure KG, is mainly based on the popular CIDOC-CRM ontology [7]. A short excerpt of KG is depicted in Fig. 2. It shows a few main entities and relationships related to a person (of type CIDOC-CRM:E21_Person) named “William Sutton”, who was member of a few relevant offices in Ireland.

4 Information Retrieval Model and Demo

A search engine needs to not only return the best documents, but also be fast. We implemented the index and search functions based on Elasticsearch to have a real-time search engine [9]. The Okapi BM25 model was proposed to find and rank the relevant handwritten manuscripts for queries. In the model, documents and queries are presented by sets of concepts being NEs or KWs. Figure 3 presents an image of a handwritten medieval historical manuscript, its transcription and its concept set d, applied in the model. In the transcription, there are three kinds of words determined by our NER tool: (1) stop-words being the, to, of, we and you; (2) NEs being sheriff, Meath, clerk and William Sutton; and (3) KWs being king, &c, greeting, direct, pay, shilling and silver. The stop-words are not added into the concept set d.

Figure 4 presents the interface of our search engine^{Footnote 1}, and the concept sets of \(q_1\) and \(q_2\). In that, coun_meath is the identifier of an entity named Meath and classed Country, which is determined by our NER algorithm. While, silver and shilling are keywords. To exploit the features of NEs for semantic search, a NE needs to be presented by its most specific meaning in the concept set d. It means that, with a NE in the transcription,

If our NER can determine its identifier, the NE will be presented by its identifier in d. For example, occu_sheriff, coun_meath and occu_clerk are identifiers of entities named sheriff, Meath and clerk, and added into d.
If our NER only determines its most specific class, the NE will be presented by a combined information including its name and class. For example, the entity named William Sutton does not exist in our historical KG, so its identifier cannot be extracted. However, the NER determines its most specific class being Person. So william_sutton/person is added into d.

5 Conclusion

We proposed a novel semantic full-text search system for images of historical handwritten manuscripts. Unlike the existing approach only using KW extracted from images, we exploited NE, KW and KG of increase search performance. In that, NER and HTR tools were built to recognise transcriptions and NEs from the manuscript images. Besides, to increase the precision of our NER tool, the historical KG was designed and proposed. Then, we implemented the index and search functions for transcriptions based on Elasticsearch and Okapi BM25 to search images in real-time. Finally, the semantic search engine was also implemented and deployed.

Notes

1.
https://by2022.adaptcentre.ie/conf_demo.

References

Aghbari, Z., Brook, S.: HAH manuscripts: a holistic paradigm for classifying and retrieving historical Arabic handwritten documents. Expert Syst. Appl. 36(8), 10942–10951 (2009)
Article Google Scholar
Ahmed, R., Al-Khatib, W., Mahmoud, S.: A survey on handwritten documents word spotting. Int. J. Multimed. Inf. Retr. 6(1), 31–47 (2017). https://doi.org/10.1007/s13735-016-0110-y
Article Google Scholar
Cao, T., Ngo, V.: Semantic search by latent ontological features. Int. J. New Gener. Comput. 30(1), 53–71 (2012). https://doi.org/10.1007/s00354-012-0104-0
Article Google Scholar
Cheikhrouhou, A., Kessentini, Y., Kanoun, S.: Multi-task learning for simultaneous script identification and keyword spotting in document images. Pattern Recogn. 113, 107832 (2021)
Article Google Scholar
Colutto, S., Kahle, P., Guenter, H., Muehlberger, G.: Transkribus. A platform for automated text recognition and searching of historical documents. In: Proceedings of the 15th International Conference on eScience (eScience), pp. 463–466 (2019)
Google Scholar
Debruyne, C., et al.: Ireland?s authoritative geospatial linked data. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 66–74. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68204-4_6
Chapter Google Scholar
Doerr, M.: The CIDOC conceptual reference module: an ontological approach to semantic interoperability of metadata. AI Mag. 24(3), 75–92 (2003)
Google Scholar
Frinken, V., Palakodety, S.: Handwritten keyword spotting in historical documents. In: Handwritten Historical Document Analysis, Recognition, and Retrieval—State of the Art and Future Trends, Series in MP&AI, vol. 89, pp. 81–99. World Scientific Publishing (2021)
Google Scholar
Gheorghe, R., Hinman, M., Russo, R.: Elasticsearch in Action, 1st edn. Manning Publications Co., Shelter Island (2015)
Google Scholar
Hellmann, S., Lehmann, J., Auer, S., Brümmer, M.: Integrating NLP using linked data. In: Alani, H., et al. (eds.) ISWC 2013. LNCS, vol. 8219, pp. 98–113. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41338-4_7
Chapter Google Scholar
Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: SpaCy: industrial-strength natural language processing in Python (2020). https://doi.org/10.5281/zenodo.1212303
Jiang, Y.: Semantically-enhanced information retrieval using multiple knowledge sources. Clust. Comput. 23(4), 2925–2944 (2020). https://doi.org/10.1007/s10586-020-03057-7
Article Google Scholar
Kahle, P., Colutto, S., Hackl, G., Mühlberger, G.: Transkribus - a service platform for transcription, recognition and retrieval of historical documents. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 04, pp. 19–24 (2017). https://doi.org/10.1109/ICDAR.2017.307
Kang, L., Riba, P., Villegas, M., Fornés, A., Rusiñol, M.: Candidate fusion: integrating language modelling into a sequence-to-sequence handwritten word recognition architecture. Pattern Recogn. 112, 107790 (2021)
Article Google Scholar
Lang, E., Puigcerver, J., Toselli, A.H., Vidal, E.: Probabilistic indexing and search for information extraction on handwritten German parish records. In: Proceedings of 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 44–49 (2018)
Google Scholar
Leydier, Y., Lebourgeois, F., Emptoz, H.: Text search for medieval manuscript images. Pattern Recogn. 40(12), 3552–3567 (2007)
Article Google Scholar
Li, Z., Wu, Q., Xiao, Y., Jin, M., Lu, H.: Deep matching network for handwritten Chinese character recognition. Pattern Recogn. 107, 107471 (2020)
Article Google Scholar
Martínek, J., Lenc, L., Král, P.: Building an efficient OCR system for historical documents with little training data. Neural Comput. Appl. 32(23), 17209–17227 (2020). https://doi.org/10.1007/s00521-020-04910-x
Article Google Scholar
Ngo, V., Cao, T.: Discovering latent concepts and exploiting ontological features for semantic text search. In: Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP-2011), pp. 571–579. ACL (2011)
Google Scholar
Nozza, D., Manchanda, P., Fersini, E., Palmonari, M., Messina, E.: LearningToAdapt with word embeddings: domain adaptation of named entity recognition systems. Inf. Process. Manag. 58(3), 102537 (2021)
Article Google Scholar
Stauffer, M., Fischer, A., Riesen, K.: Filters for graph-based keyword spotting in historical handwritten documents. Pattern Recogn. Lett. 134, 125–134 (2020)
Article Google Scholar
Toledo, J., Carbonell, M., Fornés, A., Lladós, J.: Information extraction from historical handwritten document images with a context-aware neural model. Pattern Recogn. 86, 27–36 (2019)
Article Google Scholar
Vidal, E., et al.: The carabela project and manuscript collection: large-scale probabilistic indexing and content-based classification. In: The 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 85–90 (2020)
Google Scholar
Wang, J., et al.: A pseudo-relevance feedback framework combining relevance matching and semantic matching for information retrieval. Inf. Process. Manag. 57(6), 102342 (2020)
Article Google Scholar

Download references

Acknowledgment

Beyond 2022 is funded by the Government of Ireland, through the Department of Culture, Heritage and the Gaeltacht, under the Project Ireland 2040 framework. The project is also partially supported by the ADAPT Centre for Digital Content Technology under the SFI Research Centres Programme (Grant 13/RC/2106_P2).

Author information

Authors and Affiliations

ADAPT Centre, SCSS, Trinity College Dublin, Dublin, Ireland
Vuong M. Ngo, Gary Munnelly, Fabrizio Orlandi, Declan O’Sullivan & Owen Conlan
Department of History, Trinity College Dublin, Dublin, Ireland
Peter Crooks

Authors

Vuong M. Ngo
View author publications
You can also search for this author in PubMed Google Scholar
Gary Munnelly
View author publications
You can also search for this author in PubMed Google Scholar
Fabrizio Orlandi
View author publications
You can also search for this author in PubMed Google Scholar
Peter Crooks
View author publications
You can also search for this author in PubMed Google Scholar
Declan O’Sullivan
View author publications
You can also search for this author in PubMed Google Scholar
Owen Conlan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vuong M. Ngo .

Editor information

Editors and Affiliations

OsloMet – Oslo Metropolitan University, Oslo, Norway
Gerd Berget
The Open University, Milton Keynes, UK
Mark Michael Hall
Martin Luther University Halle-Wittenberg, Halle, Germany
Daniel Brenn
Tampere University, Tampere, Finland
Sanna Kumpulainen

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ngo, V.M., Munnelly, G., Orlandi, F., Crooks, P., O’Sullivan, D., Conlan, O. (2021). A Semantic Search Engine for Historical Handwritten Document Images. In: Berget, G., Hall, M.M., Brenn, D., Kumpulainen, S. (eds) Linking Theory and Practice of Digital Libraries. TPDL 2021. Lecture Notes in Computer Science(), vol 12866. Springer, Cham. https://doi.org/10.1007/978-3-030-86324-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-86324-1_7
Published: 07 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86323-4
Online ISBN: 978-3-030-86324-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Semantic Search Engine for Historical Handwritten Document Images

Abstract

Similar content being viewed by others

Towards a Digital Infrastructure for Illustrated Handwritten Archives

From Handwritten Manuscripts to Linked Data

Extracting Descriptive Words from Untranscribed Handwritten Images

Keywords

1 Introduction

2 System Architecture

3 Image Representation and Knowledge Graph

4 Information Retrieval Model and Demo

5 Conclusion

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Semantic Search Engine for Historical Handwritten Document Images

Abstract

Similar content being viewed by others

Towards a Digital Infrastructure for Illustrated Handwritten Archives

From Handwritten Manuscripts to Linked Data

Extracting Descriptive Words from Untranscribed Handwritten Images

Keywords

1 Introduction

2 System Architecture

3 Image Representation and Knowledge Graph

4 Information Retrieval Model and Demo

5 Conclusion

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation