Research Paper:
Understanding Cultural Similarities of Archaeological Sites from Excavation Reports Using Natural Language Processing Technique
Fumihiro Sakahira*, , Yuji Yamaguchi** , and Takao Terano***
*Faculty of Information Science and Technology, Osaka Institute of Technology
1-79-1 Kitayama, Hirakata-City, Osaka 573-0196, Japan
Corresponding author
**Research Institute for the Dynamics of Civilizations, Okayama University
3-1-1 Tsushimanaka, Kita-Ku, Okayama-City, Okayama 700-8530, Japan
***Platform for Arts and Science, Chiba University of Commerce
1-3-1 Konodai, Ichikawa-City, Chiba 272-8512, Japan
In this study, we applied natural language processing (NLP) techniques to texts of excavation reports on buried cultural properties to calculate the degree of similarity between the reports for determining archaeological sites that have a high degree of similarity. Specifically, we validated whether the similarity of sentence embeddings in the excavation reports of these sites is consistent with the existing classification. Four archaeological sites classified in existing archaeological research papers were used. For validation, 128 excavation reports from the four sites were used; sentence embeddings were obtained using Doc2Vec. We obtained the following results: 1) In applying NLP to excavation reports for determining the similarities of archaeological sites, merging the texts for each site into a single document and then processing it was more preferable than processing it in separate volumes of the excavation report. 2) The similarity based on sentence embedding of excavation reports using Doc2Vec was more consistent with the classification of the characteristics of archaeological sites than term frequency–inverse document frequency (TF-IDF). 3) When targeting a specific period, the sentence embedding exclusively for the text of the relevant period is consistent with the classification of the characteristics of the archaeological site from the artifacts and structural remains of that specific period. 4) When a specific period is targeted, the exclusive sentence embeddings of that period, obtained through the additive compositionality of sentence embeddings, can be used to classify the characteristics of archaeological sites based on the artifacts and structural remains on that period. Consequently, the similarities of texts based on NLP can reflect the similarities of archaeological sites. This holds true even for excavation reports that include spelling inconsistencies, optical character reader misrecognition, and garbled words.
- [1] Bunkacho-Bunkazai-Kinen-Ka, “Teihon hakkutsuchosa no tebiki: Seiri-hokokusho-hen,” Doseisha, 2016 (in Japanese).
- [2] S. Fujio, “Yayoi-bunka-zo no shin-kouchiku,” Yoshikawa Kobunkan, 2013 (in Japanese).
- [3] T. Oikarinen, “Archaeological Grey Reports – Current Issues and Their Potential for the Future,” J. Ikäheimo, A.-K. Salmi, and T. Äikäs (Eds.), “Monographs of the Archaeological Society of Finland,” Archaeological Society of Finland, pp. 187-197, 2012.
- [4] D. Luzi, “Trends and Evolution in the Development of Grey Literature: A Review,” Int. J. Grey Lit., Vol.1, No.3, pp. 106-117, 2000. https://doi.org/10.1108/14666180010345537
- [5] A. Vlachidis, C. Binding, D. Tudhope, and K. May, “Excavating Grey Literature: A Case Study on the Rich Indexing of Archaeological Documents via Natural Language-Processing Techniques and Knowledge-Based Resources,” Aslib Proc., Vol.62, No.4/5, pp. 466-475, 2010. https://doi.org/10.1108/00012531011074708
- [6] D. Tudhope, C. Binding, S. Jeffrey, K. May, and A. Vlachidis, “A STELLAR role for knowledge organization systems in digital archaeology,” Bull. Am. Soc. Inf. Sci. Tech., Vol.37, No.4, pp. 15-18, 2011. https://doi.org/10.1002/bult.2011.1720370405
- [7] J. D. Richards, D. Tudhope, and A. Vlachidis, “Text Mining in Archaeology: Extracting Information from Archaeological Reports,” J. Barcelo and I. Bogdanovic (Eds.), “Mathematics in Archaeology,” CRC Press, Florida, pp. 240-254, 2015. https://doi.org/10.1201/b18530-15
- [8] A. Vlachidis and D. Tudhope, “A Knowledge-Based Approach to Information Extraction for Semantic Interoperability in the Archaeology Domain,” J. Assoc. Inf. Sci. Technol., Vol.67, No.5, pp. 1138-1152, 2016. https://doi.org/10.1002/asi.23485
- [9] A. Brandsen, K. Lambers, S. Verberne, and M. Wansleeben, “User Requirement Solicitation for an Information Retrieval System Applied to Dutch Grey Literature in the Archaeology Domain,” J. Comput. Appl. Archaeol., Vol.2, No.1, pp. 21-30, 2019. http://doi.org/10.5334/jcaa.33
- [10] A. Brandsen, S. Verberne, M. Wansleeben, and K. Lambers, “Creating a Dataset for Named Entity Recognition in the Archaeology Domain,” Proc. of the 12th Language Resources and Evaluation Conf. (LREC 2020), pp. 4573-4577, 2020.
- [11] A. Brandsen and M. Koole, “Labelling the Past: Data Set Creation and Multi-Label Classification of Dutch Archaeological Excavation Reports,” Lang. Resour. Eval., Vol.56, pp. 543-572, 2022. https://doi.org/10.1007/s10579-021-09552-6
- [12] A. Brandsen, “Digging in Documents: Using Text Mining to Access the Hidden Knowledge in Dutch Archaeological Excavation Reports,” Ph.D. thesis, Leiden University, 2022.
- [13] Nara National Research Institute of Cultural Properties. https://sitereports.nabunken.go.jp/ja [Accessed August 29, 2022]
- [14] Y. Takata, “New Possibilities for the Dissemination of Information via Electronic Publication of Archaeological Reports,” Nara-Bunkazai-Kenkyusho-Kenkyuhokoku, Vol.21, pp. 73-78, 2019 (in Japanese). http://hdl.handle.net/11177/6888 [Accessed April 6, 2023]
- [15] Y. Takata, “The Progress of the Comprehensive Database of Archaeological Site Reports in Japan,” Nara-Bunkazai-Kenkyusho-Kenkyuhokoku, Vol.24, pp. 218-234, 2020 (in Japanese). http://hdl.handle.net/11177/7250 [Accessed April 6, 2023]
- [16] Y. Takata, “Supporting the Translation of Cultural Heritage Information with the CDASRJ Thesaurus,” Nara-Bunkazai-Kenkyusho-Kenkyuhokoku, Vol.28, pp. 56-64, 2021 (in Japanese). http://hdl.handle.net/11177/9425 [Accessed April 6, 2023]
- [17] G. Plets, P. Huijnen, and D. v. Oeveren, “Excavating Archaeological Texts: Applying Digital Humanities to the Study of Archaeological Thought and Banal Nationalism,” J. Field Archaeol., Vol.46, No.5, pp. 289-302, 2021. https://doi.org/10.1080/00934690.2021.1899889
- [18] A. D. Fischer, H. v. Londen, A. L. B.-v. d. Bercken, R. M. Visser, and J. Renes, “Urban Farming and Ruralisation in the Netherlands (1250–1850): Unravelling Farming Practice and the Use of (Open) Space by Synthesising Archaeological Reports Using Text Mining,” Nederlandse Archeologische Rapporten, Vol.68, 2021.
- [19] Adobe Acrobat Pro DC. https://www.adobe.com/jp/acrobat.html [Accessed August 29, 2022]
- [20] Q. Le and T. Mikolov, “Distributed Representations of Sentences and Documents,” Proc. of the 31st Int. Conf. on Machine Learning, Vol.32, No.2, pp. 1188-1196, 2014.
- [21] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding,” Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 4171-4186, 2019.
- [22] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks,” Proc. of the 2019 Conf. on Empirical Methods in Natural Language Processing and the 9th Int. Joint Conf. on Natural Language Processing (EMNLP-IJCNLP), pp. 3982-3992, 2019.
- [23] “Welcome to Janome’s documentation! (Japanese),” Janome v0.4 documentation (ja). https://mocobeta.github.io/janome/ [Accessed August 29, 2022]
- [24] Gensim. https://radimrehurek.com/gensim/ [Accessed April 6, 2023]
This article is published under a Creative Commons Attribution-NoDerivatives 4.0 Internationa License.