0% found this document useful (0 votes)
23 views59 pages

AI and Machine Learning Report Sample 3

AI and Machine Learning Report Sample 3

Uploaded by

Junaid Qaiser
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
23 views59 pages

AI and Machine Learning Report Sample 3

AI and Machine Learning Report Sample 3

Uploaded by

Junaid Qaiser
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 59

Detecting Ink from Ancient Documents

Using Deep Neural Networks

M.Sc. in Artificial Intelligence and Machine


Learning

Student Name: Wishmitha Samadhi Mendis


Student ID: 2451386
Supervisor: Dr. Alexander Krull
Date: 18th September, 2023
Academic Year: 2023/2024

I confirm that the work was solely undertaken by myself and that no help was provided from
any other sources than those permitted. All sections of the thesis that use quotes or describe
an argument or concept developed by another author have been referenced, including all
secondary literature used, to show that this material has been adopted to support my work.
ii
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

1 Introduction 1
1.1 Project Aim and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background and Related Work 3


2.1 Research Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Historical Background . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.3 Kaggle Competetion . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Deep Learning and Computer Vision Techniques . . . . . . . . . . . . . . . 5
2.2.1 Classical Image Segmentation Methods . . . . . . . . . . . . . . . . 5
2.2.2 Deep Learning Models for Image Segmentation . . . . . . . . . . . . 5
2.2.3 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.5 Other Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 EduceLab Scrolls Dataset . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Oxyrhynchus Papyri Image Data . . . . . . . . . . . . . . . . . . . . 8
2.3.3 ALPUB Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Research Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Methodology 13
3.1 Character Autoencoder Module . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Intensity Based Segmentation . . . . . . . . . . . . . . . . . . . . . 15
3.1.3 Character Autoencoder Network . . . . . . . . . . . . . . . . . . . . 16
3.2 Ink Detection Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Data Preprocessing and Augmentation . . . . . . . . . . . . . . . . 19
3.2.2 Ink Detection Network . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Handwritten Character Segmentation Network . . . . . . . . . . . . . . . . 21

4 Results and Evaluation 23


4.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1 Confusion Matrix for Ink Mask Prediction . . . . . . . . . . . . . . . 23
4.1.2 Accuracy Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.3 Precision Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.4 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.5 Intersection over Union (IoU Score) . . . . . . . . . . . . . . . . . . 24
4.1.6 F-Beta Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

iii
4.2 Character Autoencoder Module Results . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Intensity Based Segmentation Results . . . . . . . . . . . . . . . . . 25
4.2.2 Character Autoencoder Results . . . . . . . . . . . . . . . . . . . . 27
4.3 Ink Detection Module Results . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.1 Loss Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.3 Visual Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.4 Analyzing Generalization Effect for Ink Detection Models . . . . . . . 34
4.4 Handwritten Character Segmentation Results . . . . . . . . . . . . . . . . . 35

5 Discussion and Future Works 37


5.1 Character Autoencoder Module . . . . . . . . . . . . . . . . . . . . . . . . 37
5.1.1 Alternative Workaround for Segmentation Failed Cases . . . . . . . . 37
5.1.2 Modification for the Segmentation Algorithm . . . . . . . . . . . . . 37
5.1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Ink Detection Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.1 Network Architecture and Receptive Field . . . . . . . . . . . . . . . 39
5.2.2 Usage of 3D Convolutions . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Handwritten Character Segmentation Network . . . . . . . . . . . . . . . . 41

6 Conclusion 42

A GitLab Repository 48

iv
v
List of Figures
2.1 Main Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Original U-Net Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Fragment Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Oxynrychus Image Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 ALPUB Character Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 3DCNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.7 3DCNN Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14


3.2 Different Character Formations . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Damaged Region of a Papyrus Scroll . . . . . . . . . . . . . . . . . . . . . 15
3.4 Effect of Small Damaged Regions . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Intensity Differences in Papyri Images . . . . . . . . . . . . . . . . . . . . . 17
3.6 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.7 Crop Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.8 Fragment Masks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.9 Character Segmentation Network . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Confusion Matrix for Ink Mask Prediction . . . . . . . . . . . . . . . . . . . 23


4.2 Intensity Based Segmentation General Case Result . . . . . . . . . . . . . . 25
4.3 Dialation Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Damaged Region Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5 Segmentation Failed Case . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Character Autoencoder Training . . . . . . . . . . . . . . . . . . . . . . . . 28
4.7 Character Autoencoder Quantitative Results for the Test Set . . . . . . . . . 29
4.8 Character Autoencoder Visual Results for the Test Set . . . . . . . . . . . . 29
4.9 Loss for Ink Detection Models . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.10 Ink Detection Results for the Training Set . . . . . . . . . . . . . . . . . . . 31
4.11 Character Autoencoder Quantitative Results for the Validation Set . . . . . 32
4.12 The Effect of Output Thresholding . . . . . . . . . . . . . . . . . . . . . . 32
4.13 Visual Results of Ink Detection Networks for the Validation Set - Part 1 . . . 33
4.14 Visual Results of Ink Detection Networks for the Validation Set - Part 2 . . . 34
4.15 Visual Results of Ink Detection Networks for the Test Set . . . . . . . . . . 34
4.16 Loss Variation in Generalization Analysis . . . . . . . . . . . . . . . . . . . 35
4.17 Base Model vs. Pre-trained model Results for the Validation Set . . . . . . . 36
4.18 Character Segmentation Network Results for the Test Set . . . . . . . . . . 36

5.1 Failed Case Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

vi
List of Tables
2.1 Fragment Data Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 ALPUB Data Character Counts . . . . . . . . . . . . . . . . . . . . . . . . 10

4.1 Autoencoder Reconstruction Results for the Test Set . . . . . . . . . . . . . 27


4.2 Ink Detection Results for the Training Set . . . . . . . . . . . . . . . . . . . 31
4.3 Ink Detection Results for the Validation Set . . . . . . . . . . . . . . . . . . 31
4.4 Base Model vs. Pre-trained model Results for the Validation Set . . . . . . . 35

vii
Acronyms
3DCNN 3D Convolutional Neural Network

Adam Adaptive Moment Estimation

BCE Binary Cross Entropy

CAE Convolutional Autoencoder

CT Computed Tomography

MSE Mean Squared Error

ReLU Rectified Linear Unit

viii
Abstract
This research dives into the intricacies of a formidable challenge: the non-invasive extraction
of ink signals from ancient papyri scrolls, a complex and technically demanding procedure. The
overarching objective is to present scholars with legible text while safeguarding the integrity of
the original scrolls. This research focuses on Herculaneum papyri, which was buried under the
ashes from the eruption of Mount Vesuvius and preserved in a carbonized state for nearly two
millennia. The carbon-based ink used in these papyri poses a unique challenge, as it remains
invisible in X-ray CT scans. With recent advancements in Deep Learning and Computer Vision,
supervised learning models have emerged to capture these elusive ink signals from X-ray CT
images. This research concentrates on unveiling ink signals concealed within Herculaneum
papyri fragments, employing cutting-edge deep-learning neural networks. A notable innovation
in this research is the exploration of incorporating historical document data from diverse sources
to augment the resolution of this ink detection problem using Transfer Learning, which is
proven by the results of this research. The contributions of this research not only unravel the
hidden historical context of ancient texts but also contribute to the broader knowledge base
concerning the application of advanced technologies to the restoration of historical documents.

ix
Acknowledgements
I want to express my heartiest gratitude to my supervisor, Dr. Alexander Krull, for providing
valuable insights and guidance to complete this research. I would also like to thank my project
inspector, Dr. Rajesh Chitnis, for his valuable feedback during the project proposal stage and
the demonstration. I would also like to thank the University of Birmingham as an institute and
all the lecturers who taught us throughout the course of M.Sc. in Artificial Intelligence and
Machine Learning for providing quality education and supporting and facilitating us with the
required resources to complete this research. Without the subject knowledge which was taught
by them, completing this research would have been impossible. I would also like to thank
my lecturers from the University of Moratuwa, Sri Lanka, for equipping me with fundamental
Computer Science knowledge to embark on this journey successfully. Finally, all of this would
be impossible without the constant encouragement and support from my parents, family and
friends.

x
xi
CHAPTER 1
Introduction
Reading texts from ancient scrolls in a non-invasive is a rigorous problem because it involves a
technically challenging sequence of procedures. The ultimate objective of this process should
be to deliver a readable text to the scholars without damaging the original scrolls. Some
ancient scrolls, which were damaged or delicate in nature, cannot be physically read directly
by humans. So these scrolls require a non-invasive method to read what is written on them.
[1], [2]

This project focuses on detecting ink and producing readable text for the writings of Hercula-
neum papyri.[3] These papyri scrolls were rolled and buried under the ashes of the ruined city
of Herculaneum after the eruption of Mount Vesuvius and preserved in a solidified carbonized
state for almost 2000 years. These scrolls were initially discovered over two hundred years ago,
but attempts to unroll and read the written text damaged the scrolls and broke them into
fragments. Thus, non-invasive techniques such as X-ray Computed Tomography (CT) [4] were
proposed to figure out the writings of the scrolls. Due to the carbon-based ink used in the
Herculaneum papyri, the contrast between the writing substrate and the written ink in X-ray
CT imaging is not visible to the naked eye, thus making it unreadable. [1] However, with
the advancement of Deep Learning and Computer Vision techniques, these ink signals can
be captured from X-ray CT images with proposed supervised learning models.[1], [3] These
results are still far away from providing full readability of these scrolls, and there are numerous
directions to be explored in this research to improve upon the current state-of-the-art results
with the availability of the data.[3]

This research primarily focuses on revealing the ink signals hidden in Herculaneum papyri
fragments by incorporating deep-learning network models. The novelty of this research is
that the possibility of incorporating other available historical document data [5], [6] to tackle
this image segmentation problem has been explored with quantitative experimentations. This
technique is called transfer learning [7] in the machine learning context, and it is being widely
used in scenarios where labelled training data is limited, similar to this problem. [8] The results
of this research show that transfer learning can be applied to this problem to achieve improved
ink detection from the papyri fragment data. [3]

1.1 | Project Aim and Objectives


This project aims to develop a deep learning-based ink detection network for Herculaneum papyri
fragments, which surpasses the current state-of-the-art benchmark results. [1] This trained
model can be applied to detect the ink of rolled Herculaneum scrolls to generate readable texts
for scholars. This will be significant from both technical and historical standpoints because
these findings can reveal a lot of documented knowledge about that era.

1
Following are the main objectives/deliverables of the project to achieve the aim described
above.

1. Providing conclusive quantitative results based on controlled experimentation that transfer


learning can be applied to the ink detection problem: This includes identifying usable
datasets [5], [6] for this task which matches the characteristics of the original dataset
and [3] incorporating features and knowledge learnt from these data into the final model.

2. Developing a deep neural network model that can produce state-of-the-art ink detection
results: [1] This includes designing the optimal network architecture with available
hardware resources and evaluating the model on provided datasets.

3. Producing readable ink prediction results for the Herculaneum papyri fragments: Even
though the quantitative results are important to evaluate the results, the visual results
are equally important in this research because readability is an important factor. [3]

1.2 | Thesis Structure


This report is divided into six chapters, starting from the broader knowledge and explanations
required for the reader to understand the context of the project to its specific implementations
and results. Chapter 2 will introduce the reader to the required background regarding the
project, including historical and technical context along with related work and datasets. In
Chapter 3, the detailed implementation of the research is explained procedurally. The visual and
quantitative results of this research are presented and deeply analyzed in Chapter 4. Chapter 5
dives deep into further analyzing the results and proposed methods, providing future research
directions. And finally, the report is concluded with Chapter 6

2
CHAPTER 2
Background and Related Work
Related background knowledge required for the reader will be discussed in this section. This
includes historical context, research background, deep learning techniques and datasets used in
this research, literature review of state-of-the-art related works and research gap addressed in
this project.

2.1 | Research Background


As this research has significant importance in both historical and technical aspects, it is
important to analyze this project briefly from both facets.

2.1.1 Historical Background


The outcome of this research will be immensely beneficial for scholars and historians. Current
studies have discovered that these manuscripts contain philosophical work authored by Philode-
mus of Gadara, an Epicurean philosopher and poet. Further, some fragments contain historical
information about an ancient Hellenistic dynasty. [3] Historically, there is evidence that Greek
culture had a huge influence in the southern regions of ancient Italy, where Herculaneum was
located during the Roman Empire era. These include Greek colonization remains found in these
regions. [9] This is one of the major reasons for the papyri fragments of the Herculaneum papyri
to contain Greek characters. [10] (see Figure 2.3) Hence, significant historical, philosophical,
social and scientific discoveries can be discovered by reading the Herculaneum papyri. And
that would be a groundbreaking achievement in unearthing previously assumed to be lost
history of ancient Rome and Greece.

2.1.2 Technical Background


Due to the delicate nature of the Herculaneum papyri described above in Chapter 1, directly
reading them is impossible without damaging the scrolls. Hence, a highly technical procedure
should be followed to read the written characters from Herculaneum papyri. [3] The main
steps of this process are represented in Figure 2.1. Each of the steps of this procedure with
state-of-the-art research related to each step is described in the following sections.

X-Ray Computed Tomography


X-Ray Computed Tomography (CT) is the leading technique for non-invasively extracting text
from ancient manuscripts. A volumetric scan of the manuscript can be achieved using X-ray
CT with the desired resolution. Mocella et al. [11] used X-ray phase contrast imaging to reveal
letters in the Herculaneum papyri. Even though X-ray CT imaging could not generate fully
readable texts for the Herculaneum papyri,[4] this technique was successfully utilized to detect
ink writings on other ancient manuscripts like En-Gedi scrolls. [2], [12] The main reason for

3
Fig. 2.1. Main Pipeline: These computational and technical steps should be followed
sequentially to reveal the ink signals from solidified rolled-up Herculaneum scrolls.

this breakthrough is that the contrast between the ink and the writing substrate in X-ray CT
imaging is sufficient to detect ink signals distinctly, unlike the case with Herculaneum papyri.
[1] For Herculaneum papyri, the extra computational step of ink detection should be performed
to reveal the ink, which is the main goal of this research.

Virtual Unwrapping
Virtual unwrapping is the computer imaging technique that converts 3D volumetric scans
from X-ray CT imaging to a flattened planer volume with detected ink as the texture. This
non-invasive technique involves mainly three steps: segmentation, texturing and flattening.
Segmentation involves identifying continuous layers of the writing substrate from the cross-
section of the volumetric scan automatically or semi-automatically. If the layers are too
complex to segment using an algorithmic approach, a user can guide the segmentation process
interactively. [2] These extracted layers are transformed geometrically into a flattened planer
volume in this unwrapping process. [13]–[16] Even though this is an important step in
revealing ink signals from the rolled Herculaneum papyri scrolls, this step is not required
for the Herculaneum papyri fragments (see Figure 2.3) which are the focus of this research.
However, the ink detection research findings of this research can be directly applied to the
virtually unwrapped Herculaneum papyri scrolls. Hence, the contribution of this project is still
significant to the overall process.

Ink Detection
The ink detection process of Herculaneum papyri directly from X-ray CT scanning is impossible
due to the carbon-based ink used in the writings. The primary initiative of incorporating
supervised learning for ink detection in Herculaneum papyri was carried out by Parker et al. [1]
A 3D Convolutional Neural Network (3DCNN) [17] architecture is used to train a model which
takes a subvolume of a volumetric scan of the scroll fragment as input and predict whether the
subvoxel contains ink or not as a binary classification problem. Further, they also developed
a model to generate a photorealistic rendering for the predicted ink signal by changing the
output softmax layer [18] of the above binary classification model. Further experimentations
were carried out in this research on how the thickness of the ink layer affects the ink detection
accuracy using handcrafted carbon phantom scroll. [1] The follow-up work on this [3] uses this
3DCNN trained on fragment data to predict ink signals on rolled Herculaneum scrolls. The
results are still far from providing full readability to scholars, suggesting huge room for further
improvement. This machine learning-based process is the main focus of this research. Further
state-of-the-art literature regarding this task is analyzed in Section 2.4.

4
2.1.3 Kaggle Competetion
This problem and the related data were made publicly available through a competition organized
on the Kaggle platform. [19] This work is not eligible for the competition mainly due to the
usage of external data, [5], [6] which is not allowed in the competition. Thus, the direction of
this research was not explored by the proposed solutions of the competition. However, there
were valuable insights for this research in the discussion forums and the proposed solutions.
Some of those solutions are described in detail in Section 2.4.

2.2 | Deep Learning and Computer Vision Techniques


Since this research is mainly based on deep learning and computer vision, it is important to
explore the background of these related techniques and models. The ink detection problem in
this research is approached as an image segmentation problem, which is a classical computer
vision problem. Image segmentation can be defined as the process of dividing an image into
different regions such that each region is homogeneous, but the union of any two adjacent
regions is not homogeneous.[20] These homogeneous regions can be defined based on image
characteristics such as intensity, colour, or semantic meaning of the region. [21] Image
segmentation is one of the major components of this research because both classical and deep
learning-based segmentation methods are utilized. (see Chapter 3.1)

2.2.1 Classical Image Segmentation Methods


Unlike machine and deep learning-based approaches, classical image segmentation methods
do not require any labelled data to perform segmentation. These approaches can use image
characteristics such as colour information, intensity histograms, image texture or edges and
boundaries to perform segmentation. These segmentation methods can be divided into two
main categories based on these used characteristics: region-based methods and boundary-based
methods. [22] Popular region-based methods include thresholding [23], region growing [24]
and clustering. [25] Boundary-based methods include classical edge-detecting algorithms [26]
and energy minimizing methods such as active contours. [27] In this research, one of the
widely used region-based thresholding algorithms, Otsu’s thresholding [28], has been used for
the manual segmentation of papyri images as described in Chapter 3. (see Algorithm 2 to see
the implementation of this algorithm for the specified case)

2.2.2 Deep Learning Models for Image Segmentation


With the rapid usage of deep learning techniques in computer vision and the availability
of labelled imaged data, deep learning techniques have dominated state-of-the-art image
segmentation techniques in recent years. [29] One of the primary groundbreaking research in
this field is the fully convolutional segmentation network introduced by Long et al. [30] Then
one of the most influential image segmentation networks, U-Net was introduced in 2015. [31]
A modified architecture of this network is widely used in this research, described in Chapter
3, in detail. However, it is important to analyze the original architecture to understand the
modifications in the original architecture.
U-Net Model
U-Net features an encoder-decoder structure with skip connections, addressing precise object
localization while maintaining contextual understanding. The encoder extracts high-level
features through convolutional and pooling layers, while the decoder reconstructs segmentation
masks with a high level of detail using the latent representation of the bottleneck layer. The

5
skip connections directly link features from the encoder to the decoder, enabling the model to
capture fine-grained details and broader context. [31] The detailed architecture of the U-Net
model is represented in Figure 2.2. This encoder-decoder structure of the network allows to
exploit the usage of transfer learning since the encoder and decoder can be trained indepently
from each other when the skip connections are omitted. [32]

Fig. 2.2. Original U-Net Architecture [31]: Two 3x3 convolution operations with ReLU in
each layer and 2x2 max-pooing and up-convolution between the layers are similarly used in the
proposed architecture in Chapter 3. (see Figure 3.6).

Recent Advances
Recent advancements in image segmentation networks have seen the emergence of novel
architectures, such as SegFormer, which leverages transformers and attention mechanisms for
improved contextual understanding. [33] DeepLabV3 has continued to evolve, introducing
atrous spatial pyramid pooling and dilated convolutions to enhance semantic segmentation
with pixel-level labelling. [34] U-Net++ is a direct extension of the original U-Net and has
gained traction for its improved skip pathway design, further boosting feature fusion across
multiple scales. [35] These developments reflect the continuous dominance of deep learning
architectures in image segmentation in recent years.

2.2.3 Autoencoders
The character autoencoder module, which is described in detail in Chapter 3, plays a major role
in this research, specifically when it comes to feature extraction and learning. Autoencoders
are a fundamental network type in deep learning, especially in unsupervised learning. These
are designed to encode data into a lower latent dimension and decode data by reconstructing
them back to the original data as closely as possible. This requires the network to learn the
most important features from the original data and the optimal representation in the latent

6
dimension. This enables tasks like data compression, dimensionality reduction, denoising,
explicit clustering, feature extraction, and learning. [18] This ability to learn and extract
features of the autoencoders is mainly exploited in this research. Autoencoder architecture
has evolved into Convolutional Autoencoder (CAE), which is particularly effective for image
data. Researchers have leveraged CAEs in numerous domains, including image reconstruction,
denoising, and generative modelling. [36]

2.2.4 Transfer Learning


Transfer learning is the technique utilized to transfer the extracted features of written character
data from the autoencoder to the ink detection model. (see Chapter 3) Transfer learning is
a fundamental concept in machine learning that involves using knowledge gained from one
task or domain to improve the performance of a related but different task or domain. [7], [8]
Deep learning typically entails using pre-trained neural network models, often trained on large
datasets, as a starting point for new tasks. The model can expedite learning in the target task
by reusing the lower-level features and representations learned in the source task, especially
when labelled data for the latter is limited. [8] This is being utilized in U-Net architecture, also
by Iglovikov et al. [32] Here a VGG-11 [37] encoder pre-trained on ImageNet dataset [38] was
used to improve the segmentation performances for small-scaled datasets. This is similar to
the concept used in this research to initialize the decoder weights of the ink detection network
using the pre-trained character autoencoder. (see Chapter 3)

2.2.5 Other Concepts


Generalization and Overfitting
Generalization refers to the ability of the model to perform well on data it has never seen
before, indicating that it has learned to capture underlying patterns and features rather than
simply memorizing the training data. On the other hand, overfitting occurs when a model
becomes excessively complex and fits the training data noise, leading to poor performance on
new data. [39] This aspect of the trained models and the effect of transfer learning on this is
experimentally evaluated in Chapter 4. Regularization is the set of techniques that can be
used to reduce overfitting and improve generalization.[40]
Batch Normalization
Batch normalization is a pivotal technique in deep learning that significantly enhances the
training and convergence of neural networks. It works by normalizing the activations of each
layer in a mini-batch of training samples. This process mitigates the internal covariate shift
problem, leading to faster and more stable training and better model generalization. [41] This
is being used in the proposed architecture of this research described in Chapter 3.
Dropout
Dropout is a conceptually simple regularization technique in deep learning which works by
randomly deactivating a fraction of neurons during each forward and backward pass, effectively
making the network more robust and preventing it from relying too heavily on any single
neuron or feature. This stochastic dropout process encourages the neural network to learn
more robust and generalizable features, ultimately improving its ability to generalize to unseen
data. [42] This technique is also utilized in the proposed architecture of this research. (see
Chapter 3)

7
Adam Optimizer
The Adam optimizer, short for Adaptive Moment Estimation (Adam) is a widely used opti-
mization algorithm in deep learning. Adam maintains two moving averages: the first moment
(mean) and the second moment (uncentered variance) of the gradients. These moving averages
are used to adaptively adjust the learning rates for each parameter during training. This
adaptive learning rate helps stabilize training, accelerates convergence, and often results in
faster training times than traditional stochastic gradient descent (SGD). [43] Adam optimizer
was used in the training process of all the models in this research.

2.3 | Datasets
Mainly, three different datasets were utilized in this research. They are,

1. EduceLab Scrolls Dataset

2. Oxyrhynchus Papyri Image Data

3. ALPUB Dataset

2.3.1 EduceLab Scrolls Dataset


This data consists of the fragments of Herculaneum Papyri provided as volumetric X-ray CT
scan data along with corresponding ink labels as represented in Figure 2.3. These volumes
contain 65 layers provided in .tiff image format. The data extraction and scanning process
is described in detail by Parsons et al. [3]. The dataset consists of three training fragments
with corresponding ink labels and two test fragments without the ink labels. The quantitative
details of the fragment data are represented in Table 2.1. It can be seen that Fragment 2 is
considerably larger than the other two fragments in terms of resolution and size. The total
size of the dataset is added up to 37GB in total. However, this is a compressed version of the
original fragment data [3], totalling up to 1.8 TB in size. Due to the hardware limitations,
these data had to be downsized, as explained later in Chapter 3.

Table 2.1
Fragment Data Details

Fragment Type Resolution (width x height) Average Size per Volume Slice
Fragment 1 Training 6330 x 8181 98.7 MB
Fragment 2 Training 9506 x 14830 268.2 MB
Fragment 3 Training 5249 x 7606 76.1 MB
Fragment a Testing 6330 x 2727 32.9 MB
Fragment b Testing 6330 x 5454 65.8 MB

2.3.2 Oxyrhynchus Papyri Image Data


The Oxynrychus papyri dataset consists of 158 volumes of papyri fragments [5] founded in
ancient Egypt dating back to early centuries CE, relatively close to the period of Herculaneum
papyri. Sixty-seven of these volumes are publicly available online, each consisting of many
papyri fragments. This research uses volume XV to volume XX of the Oxynrychus dataset to
create a small dataset consisting of more than 250 papyri image crops with minimum damaged
regions. Figure 2.4 represents such a papyri fragment sample.

8
Fig. 2.3. Fragment Data: Consists of surface volume data (top) and corresponding ink labels
(bottom). A slice of volumetric data is represented in the top row for each fragment.

2.3.3 ALPUB Dataset


ALPUB dataset consists of separate handwritten Greek characters extracted from Oxynrychus
papyri. This contains 205,797 cropped images of characters from 12,070 manuscript fragments
from the Oxyrhynchus Papyri collection. [6] This dataset covers all 24 characters of the Greek
alphabet, including upper case and lower case variations. (see Figure 2.5) The exact counts
for each character are presented in Table 2.2. By analyzing these quantities, it can be seen
that vowels like alpha, epsilon, iota, omicron, and upsilon have a higher number of samples.
The distribution of these characters is not uniform because special characters like vowels occur
more often than other characters in papyri. This dataset is mainly used for the character
segmentation network described in Chapter 3.

2.4 | Related Work


Since this is a novel research problem, direct previous related work is very limited. Even though
related background work has been discussed in the above sections, diving deep into direct

9
Fig. 2.4. Oxynrychus Image Data: Red region from the full fragment (right) is cropped out
as a sample (left) for the dataset as it contains clear handwritten characters with few damaged
regions.

Fig. 2.5. ALPUB Character Data: Dataset consists of such image data covering all 24 Greek
alphabet characters for both lowercase and uppercase.

Table 2.2
ALPUB Data Character Counts

Greek Character No. of Samples Greek Character No. of Samples


Alpha 21200 Nu 20056
Beta 1296 Omega 7640
Chi 4329 Omicron 22756
Delta 5689 Phi 2930
Epsilon 14909 Pi 9031
Eta 7348 Psi 513
Gamma 3721 Rho 10453
Iota 13830 Tau 14817
Kappa 8110 Theta 3717
Lambda 6906 Upsilon 8129
Lunate Sigma 10151 Xi 641
Mu 6924 Zeta 701

previous works is also very important. Even though there have been attempts to reveal ink
signals from Herculaneum papyri using X-ray CT scanning, [4], [11] the only direct previous
which utilized machine or deep learning techniques was the research carried out by Parker et
al. [1] The basic idea of this research was discussed briefly in a previous section. But here,
this work is analyzed in depth.

10
Contrary to the proposed approach in this research, which is formulating the ink detection
problem as a segmentation problem, Parker et al. [1] approached it as a binary classification
problem. As represented in Figure 2.6, the network takes a subvolume from the fragment
volume as the input and predicts whether the voxel contains ink or not by predicting the
corresponding probabilities. This approach does not take the character shape or any other
high-level external features into context when making the prediction. This is evident in the
results of this approach, [1], [3], which is represented in Figure 2.7. These results are noisy
compared to the results of the proposed approach (see Chapter 4) because the ink signal
for each voxel of the fragment volume is predicted independently of the surrounding voxel or
considering any other high-level features. Further, many false positive ink signals seem to be in
the non-ink regions, which is not the case in the proposed solution. (see Chapter 4) However,
this research provides insights into the problem, including which layers of the subvolume should
be selected and how the input subvolume size affects the prediction accuracy, etc. [1]

Fig. 2.6. 3DCNN Architecture [1]: 34x34x34 input subvolume taken from the input X-ray
CT scan is subjected to convolution operations and a fully connected layer which outputs the
probability of having ink or not having ink for the input subvolume.

The limitations of the 3DCNN [1] are addressed in most of the proposed solutions in the
Kaggle competition by approaching this as an image segmentation problem. [19] The model
which achieved the highest score in the competition [44] proposed a model with a 3D to
2D UNeter encoder [45] and a 2D SegFormer decoder. [33] Another very recent publication
regarding ink detection from Herculaneum papyri fragments is done by Quattrini et al. [46],
which should be analyzed in future studies.

2.5 | Research Gap


All of the proposed solutions in the Kaggle competition [19] and the previous work [1], [3],
[46] strictly limit themselves to work on the problem using original EduceLab Scrolls fragment
dataset. None of the previous works have explored the opportunity to incorporate similar
handwritten ancient papyri datasets [5], [6] to solve this problem. With the wide usage of the
deep learning techniques used in this problem, transfer learning [8] can be the breakthrough
factor due to the limited amount of labelled data available for this problem. Hence, in this
research, similar handwritten papyri image datasets have been incorporated to extract features
from them to incorporate them in the ink detection process. This process is described in detail
in the next chapter.

11
Fig. 2.7. 3DCNN Results [1], [3]: Ink signal prediction for a fragment data using 3DCNN.
The results are noisy and have lot of false positives in non-ink regions

12
CHAPTER 3
Methodology
The main experimental setup of this research and the main procedural steps are discussed in
this section. The proposed methodology is designed to achieve the following two main tasks
using the datasets described in Section 2.3. They are,

1. Ink Detection

2. Character Encoding

The basic idea of the proposed pipeline is to use the learned features during character encoding
in the ink detection process. This will allow the proposed ink detection model to learn from
these features and make better predictions. This is a classic example of the usage of transfer
learning in the same domain to enhance the prediction performances of the target task. [47]
Even though the input modalities of the two tasks are different, the tasks performed by the
backend decoder of both proposed models are similar. They both produce binary segmented
character images from a latent input data representation. Thus allowing the opportunity to
perform transfer learning. A basic overview of the proposed methodology is represented in
Figure 3.1.

3.1 | Character Autoencoder Module


The character autoencoder module utilizes the Oxyrhynchus papyri [5] data and ALPUB
dataset [6] to learn character representation using an autoencoder model. [48] However, the
output modality of the proposed network should be the same as the ink detection network,
which is a binary segmented image. The input and the target of an autoencoder model should
be the same. Hence, the autoencoder model input should also be a binary segmented image.
Thus, intensity-based thresholding can be utilized to generate binary segmented images from
RGB images of the dataset.

3.1.1 Data Preprocessing


As described in the above section, the handwritten Greek character data can be found in two
formations. The first is the character-separated dataset provided in the ALPUB data, and the
other is the full papyri scroll images provided in the Oxyrhynchus papyri data. One of the
significant disadvantages of using character-separated datasets like ALPUB is the autoencoder
model will not be able to learn the shape representations of in-between regions of the characters.
This is represented in Figure 3.2.
Therefore, Oxyrhynchus papyri image data were utilized to train the autoencoder model due
to the completeness of its data. However, due to the raw nature of the data, extensive
preprocessing should be done on Oxyrhynchus papyri images before performing segmentation
on it.

13
Fig. 3.1. Proposed Methodology: Consists of two modules, Ink Detection Module (top) and
Character Autoencoder Module (bottom), focusing on achieving corresponding tasks.

Fig. 3.2. Different Character Formations: The images of the ALPUB dataset (top) have
separated handwritten Greek characters, whereas even in-between regions of the characters
can be obtained from Oxyrhynchus papyri images. (bottom)

14
Manual Cropping of Less Damaged Regions
Original Oxyrhynchus papyri images consist of extensively damaged regions due to the oldness
of the scrolls. The handwritten characters of those regions are not visible and cannot be used
for segmentation or the training of the autoencoder model. This kind of damaged region of a
papyrus scroll is represented in Figure 3.3.

Fig. 3.3. Damaged Region of a Papyrus Scroll: This represents two non-damaged (top) and
damaged (bottom) regions of the same papyrus scroll obtained from the Oxyrhynchus papyri
dataset.

Hence, these damaged regions should be cropped out when obtaining image data from the
Oxyrhynchus papyri dataset. A dataset comprising over 250 undamaged regions was manually
cropped out of selected Oxyrhynchus papyri to create a dataset for segmentation and train
the autoencoder model.
Small Damaged Region Filling
Even though extensively damaged regions were opted out from the papyri image dataset,
small damaged regions can still appear on these images. The main problem with these small
damaged regions is that they can affect the intensity-based binary thresholding segmentation
output results. This is represented in Figure 3.4.

As explained in Figure 3.4, the damaged region-filling algorithm represented in Algorithm 1


should be performed to obtain the expected result. The basic procedure of the algorithm
is performing binary thresholding on the smoothened input image to identify the damaged
regions and label them. Here, the threshold can be hard predefined since the intensity of the
damaged regions is very high compared to the rest of the regions. Finally, the mean intensity
value of the non-damaged regions is calculated, and the damaged regions are replaced with
that intensity value.

3.1.2 Intensity Based Segmentation


After the damaged region filling, intensity-based binary segmentation was performed on
images to separate handwritten characters from the background. Figure 3.4 represents one
of the results of this process. Here, Otsu’s thresholding [28] method was utilized to perform

15
Fig. 3.4. Effect of Small Damaged Regions: A small damaged region of a papyrus image
is represented in the first row. If intensity-based binary thresholding is applied to this image
directly, it will lead to a failed result. (second row) The result of the region-filling algorithm is
represented in the third row. Intensity-based binary segmentation on the region-filled image
leads to the expected result. (fourth row)

segmentation. The detailed steps of the algorithm are represented in Algorithm 2. The basic
idea of the algorithm is finding the threshold that separates the image into background and
foreground such that it minimizes the intra-class intensity variance while maximizing the
inter-class intensity variance.

This approach works better than selecting a hard threshold value for segmentation because
the threshold value will be adaptive across images. There is an intensity difference in the
handwritten ink and the papyri background across images. This is represented in Figure
3.5. Once a unique threshold that can separate the background papyri and the foreground
handwritten characters is calculated, further postprocessing steps, including erosion/dilation
and removing disconnected regions, are done on the segmented image to produce cleaner
results.

3.1.3 Character Autoencoder Network


The character autoencoder utilizes the segmented handwritten characters from the previous step
to learn reconstruction of them. The network architecture for the autoencoder is represented
in Figure 3.6. This is a convolutional autoencoder similar to a classical U-Net architecture [31]
without skip connections. 256x256 crops from binary segmented images are provided as the

16
Algorithm 1: Region Filling Algorithm
Data: Papyrus Image with Damaged Region
Result: Region Filled Image
1 image ← convertRGBtoGray(inputImage);
2 image ← medianBlur(image);
3 image ← gaussianBlur(image);
4 segImage ← image;
5 for 𝑥, 𝑦 in segImage do
6 if segImage[𝑥] [𝑦] ≥ threshold then
7 segImage[𝑥] [𝑦] ← 1;
8 else
9 segImage[𝑥] [𝑦] ← 0;
10 meanBackgroundIntensity ← mean(image where segImage[x][y]=0);
11 for 𝑥, 𝑦 in image do
12 if image[𝑥] [𝑦] == 1 then
13 image[𝑥] [𝑦] ← meanBackgroundIntensity;
14 return image;

Fig. 3.5. Intensity Differences in Papyri Images: Due to the intensity differences across the
images, a hard global threshold will not work across all the images in the dataset. Thus, an
adaptive thresholding method like Otsu’s thresholding [28] is required.

input and the target for the autoencoder, which learns the reconstructions using Binary Cross
Entropy (BCE) loss represented in Equation 3.1. BCE loss is used instead of Mean Squared
Error (MSE) because the segmented images are binary.

𝑁
1 ∑︁  
Binary Cross-Entropy Loss = − 𝑦𝑖 log( 𝑝𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑝𝑖 ) (3.1)
𝑁 𝑖=1

Diving deeper into the network architecture, it consists of an encoder that reduces the spatial
dimensionality of the original 256x256 image to 4x4x512 latent space and reconstructs the
original image using the decoder from this latent representation. Rectified Linear Unit (ReLU)
activation [49] represented in Equation 3.2 is used after each convolution operation with
additional dropout in the decoder to avoid overfitting. The output layer of the decoder consists
of a Sigmoid activation represented in Equation 3.3 to predict the probability of a given pixel
being within the region of the written character.

17
Algorithm 2: Otsu’s Thresholding Algorithm
Data: Input Image
Result: Binary Segmentation Image
1 image ← convertRGBtoGray(inputImage);
2 histogram ← computeHistogram(image);
3 totalPixels ← number of pixels in image;
4 maxVariance ← 0;
5 threshold ← 0;
6 for 𝑡 in 0 to 255 do
Í𝑡
7 backgroundPixels ← 𝑖= histogram[𝑖];
Í2550
8 foregroundPixels ← 𝑖=𝑡+1 histogram[𝑖];
Í𝑡
𝑖=0 𝑖·histogram [𝑖]
9 backgroundMean ← backgroundPixels ;
Í255
1 𝑖·histogram [𝑖]
10 foregroundMean ← 𝑖=𝑡+
foregroundPixels ;
(𝑖−backgroundMean) 2 ·histogram [𝑖]
Í𝑡
11 backgroundVariance ← 𝑖=0 backgroundPixels ;
Í255 2 ·histogram [𝑖]
(𝑖− foregroundMean )
12 foregroundVariance ← 𝑖=𝑡+1 foregroundPixels ;
13 withinClassVariance ←
( backgroundPixels·backgroundVariance) + ( foregroundPixels·foregroundVariance)
totalPixels ;
14 if withinClassVariance > maxVariance then
15 maxVariance ← withinClassVariance;
16 threshold ← 𝑡;
17 for 𝑥, 𝑦 in image do
18 if image[𝑥] [𝑦] ≥ threshold then
19 image[𝑥] [𝑦] ← 1;
20 else
21 image[𝑥] [𝑦] ← 0;
22 return image;

(
𝑥, if 𝑥 > 0
𝑓 (𝑥) = (3.2)
0, otherwise
1
𝜎(𝑥) = (3.3)
1 + 𝑒 −𝑥
The training was done using the Adam optimizer [43] with a learning rate of 0.001 and a batch
size of 16 for 30 epochs. These hyperparameters were optimized using the grid search method
along with network architecture-related parameters such as convolution kernel sizes, latent
space dimensions and dropout probabilities.

3.2 | Ink Detection Module


The ink detection module also consists of a U-Net network architecture similar to the architecture
represented in Figure 3.6. The only difference in this architecture is the input shape, where
the 256x256x1 input of the autoencoder network is now changed to 256x256xℓ. Here, 256
is the crop size of the input, and ℓ is the number of cross-sectional layers selected for the

18
Fig. 3.6. Network Architecture: The Network Architecture is inspired by U-Net architecture
[31]. It consists of two sub-networks: encoder and decoder. The encoder takes a 256x256x1
image (for character autoencoder) or 256x256xℓ surface volume (for ink detection) and
reduces it to a latent dimension of 4x4x512 through 3x3 convolutions represented in blue.
The neuron activation of the subsequent layer is represented in red. ReLU activations and
2D Batch Normalizations have been used between each encoder layer. The decoder takes
the 4x4x512 latent representation of the original data as the input and upsamples it to an
output segmentation map of 256x256x1 for both the character autoencoder and the ink
detection network. This common decoder architecture allows the feature transfer between
the character autoencoder and the ink detection network. The decoder performs 3x3 up-
convolution operations in each layer with a 0.1 dropout probability.

training. This input volume is convolved to the same latent space dimensionality of the above
architecture, and the decoder part is kept the same without any difference. This allows the
incorporation of the weights learned in the character autoencoder in the decoder of the ink
detection network.

3.2.1 Data Preprocessing and Augmentation


Data Size Reduction
One of the major challenges of this project is handling large-sized image data with limited
hardware resources. As described in Chapter 2.3, one scroll fragment consists of 65 layers of
papyri with different ink signals across them. Previous works by Parker et al. [1] and Parsons
et al. [3] showed that not all layers are contributing equally to the ink detection process due to
the varying ink signals of each layer. Thus, incorporating all 65 layers for ink detection is not
necessary. These studies also show that subsurface layers of the papyri fragment contain less
ink due to the ink seeping through the papyri layers, and the middle layers contain significant
ink signals. Because of these reasons, the middle 32 (ℓ = 32) layers of the surface volume data
of the fragments are selected for the training process. This helps reduce the input data size by
half. However, the spatial resolution of the input surface volume data was still too large, as
described in Chapter 2.3 for the available hardware resources. Thus, the spatial dimensions
were also reduced by a quarter of the original size.
Image Cropping
Even though the size reduction was performed on the input data, the resized fragment data
are still too large to be loaded onto the main and GPU memory. Thus, the fragments should

19
be cropped into segments and used as batches for the training process. Here, a crop size of
256 is selected to match the dimensionality of the input of the autoencoder network. Further,
256x256 square can roughly capture a single written character from the resized fragments from
the above step as represented in Figure 3.7. The cropping process is done for both surface
volumes and corresponding ink labels. The crops outside the fragment regions are ignored
using the provided fragment masks represented in Figure 3.9.

Fig. 3.7. Crop Size: The crop size of 256x256 is selected such that it can completely capture
a single written character of the resized fragments.

Fig. 3.8. Fragment Masks: These masks can be utilized to separate the regions within (white)
and outside (black) the fragments.

Data Augmentation
Subtle transforms were applied for these crops to augment more data samples and enhance
the model’s generalisation. These transforms consist of 0 to 10 degrees of slight rotations
and small zooming-out transforms to capture more regions within a single crop. Extreme
transforms cannot be applied because that will lead to disorientation of the written characters
and a mismatch with character autoencoder inputs.

20
3.2.2 Ink Detection Network
The network is trained, providing 256x256x32 surface volume as an input and 256x256x1
corresponding ink label as the output. The 32 surface cross-sections of the input volume are
considered separate input channels in this network. Like the character autoencoder, BCE loss
with ReLU activations and Sigmoid activation in the final layer is used. The network encoder
reduces the dimensionality of the input 256x256x32 volume to 4x4x512 latent space, and
the decoder will predict the binary ink labels using this latent space, similar to the character
autoencoder. Here, three models were trained based on their weight initialization and the
training procedure.

1. Base Model: The decoder weights are randomly initialized and trained.

2. Freezed Model: The decoder weights are initialized using the decoder weights of the
character autoencoder. However, the decoder weights are frozen and not updated during
training. Only the encoder is trained.

3. Pre-trained Model: The decoder weights are initialized using the decoder weights of the
character autoencoder and trained.

The effect of feature incorporation can be determined by analyzing the results of these three
models. The training was also done using the Adam optimizer with a learning rate of 0.001
and batch size of 16 for 150 epochs. However, the learning rate was increased to 0.01 for the
freezed model since the loss decrement was too low for the initial learning rate of 0.001. The
batch size of 16 was selected to make it consistent with the character autoencoder. Other
hyperparameters were tuned using the grid search method. Five percent of the training samples
were separated out as the validation set, and the model parameters which performed best
on the validation set were saved after 150 epochs for the three aforementioned models. The
results for these models are quantitatively analysed in Chapter 4.

3.3 | Handwritten Character Segmentation Network


Even though this network was not directly incorporated into the final pipeline of this research
to predict ink signals from the papyri fragments, the outcome of this implementation would be
useful for ink detection from ancient documents in general. This network utilizes the ALPUB
dataset [6], where the previously described intensity-based thresholding segmentation process is
applied to generate ink masks. Then, these ink masks are used as the target, and the ALPUB
character images are used as input to train the network with the same U-Net architecture
represented in Figure 3.6. The total sample size of 205,797 of the ALPUB dataset is split into
train:validation:test ratio of 0.8:0.1:0.1. The model is trained for ten epochs with a batch
size of 64 with BCE Loss and Adam optimizer. [43] The best model is selected based on the
validation loss. The results for this model are represented in Chapter 4.

21
Fig. 3.9. Character Segmentation Network: The network is trained using Greek character
images as input and corresponding segmented ink mask as the target.

22
CHAPTER 4
Results and Evaluation
The results and corresponding visual and quantitative evaluation for the experiment steps
explained in Chapter 3 are discussed in this Chapter. Further, an insight into the used evaluation
metrics is also provided.

4.1 | Evaluation Metrics


As described in Chapter 3, both the character autoencoder and ink detection modules produce
binary segmented images as the final output. Thus, evaluation metrics for binary segmentation
[50] are used to analyze the produced results quantitatively. The following are the main
evaluation metrics used in this research.

4.1.1 Confusion Matrix for Ink Mask Prediction


In the problem of ink mask prediction, which is equivalent to a binary segmentation/classification
task, each classified pixel of the image can be in one of four cases represented in Figure 4.1. If
a pixel is classified as ink by the model and it is also ink in the ground truth, it is considered
a True Positive. (TP) If a pixel is classified as not ink but ink in the ground truth, it is
considered a False Negative. (FN) If a pixel is classified as ink but not ink in the ground truth,
it is considered a False Positive. (FP) And finally, if a pixel is classified as not ink and it is not
ink in the ground truth, it is considered a True Negative. (TN)

Fig. 4.1. Confusion Matrix for Ink Mask Prediction: This matrix covers all the possible scenarios
which can occur in the ink mask prediction, which is equivalent to a binary segmentation
scenario.

These measures will be utilized when defining the evaluation metrics used for the quantitative
result analysis of this research. In those definitions, these terms have the following meaning.
𝑇 𝑃 ← True Positive Pixel Count of the Predicted Image
𝐹𝑃 ← False Positive Pixel Count of the Predicted Image
𝐹 𝑁 ← False Negative Pixel Count of the Predicted Image
𝑇 𝑁 ← True Negative Pixel Count of the Predicted Image

23
4.1.2 Accuracy Score
Accuracy defined in Equation 4.1 quantifies the proportion of correctly classified pixels for a
given predicted and ground truth image pair. It has a value range of 0 to 1, and a higher
accuracy score indicates better overall pixel-wise classification performance. Accuracy may
not be suitable for imbalanced datasets. [50] Thus, better evaluation metrics are required to
quantify the results properly.

𝑇𝑃 𝑇𝑃
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = = (4.1)
𝑇 𝑃 + 𝐹𝑃 + 𝐹 𝑁 + 𝑇 𝑁 𝑇 𝑜𝑡𝑎𝑙𝑃𝑖𝑥𝑒𝑙𝐶𝑜𝑢𝑛𝑡
4.1.3 Precision Score
Precision defined in Equation 4.2 measures the proportion of correctly predicted positive pixels
out of all predicted positives. It has a value range of 0 to 1, and a higher precision score
implies fewer false positives. This is important when minimizing false positives.

𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (4.2)
𝑇 𝑃 + 𝐹𝑃
4.1.4 Recall
Recall defined in Equation 4.3 quantifies the proportion of correctly predicted positive pixels
out of all true positives. It has a value range of 0 to 1, and a higher recall means better
detection of actual positive pixels. This is useful in minimizing false negatives.

𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 = (4.3)
𝑇𝑃 + 𝐹𝑁
4.1.5 Intersection over Union (IoU Score)
IoU score, defined in Equation 4.4, measures the ratio of the intersection of predicted and
ground truth regions to their union for a given predicted and ground truth image pair. [50] It
has a value range of 0 to 1, and a higher IoU indicates better overlap between predicted and
true regions. IoU of 0 means no overlap, while 1 implies a perfect match.

𝑇𝑃
𝐼𝑜𝑈 = (4.4)
𝑇 𝑃 + 𝐹𝑃 + 𝐹 𝑁
4.1.6 F-Beta Score
F-Beta score, defined in Equation 4.5, is a metric used in binary classification and segmentation
tasks that combine precision and recall into a single score. It is derived from the harmonic
mean of precision and recall. [50] F-beta score ranges from 0 to 1. A score of 1 indicates
perfect precision and recall, meaning all positive predictions are correct, and all actual positives
are correctly predicted. A score of 0 indicates the worst performance, where either precision or
recall (or both) is zero.

precision · recall
𝐹𝛽 = (1 + 𝛽2 ) · (4.5)
𝛽2 · precision + recall
Based on 𝛽 value, more evaluation metrics can be defined. The F1 score (also known as Dice
Coefficient [51]) corresponds to the F-beta score with 𝛽 = 1, and the F0.5 score corresponds to
the F-beta score with 𝛽 = 0.5. Both of these metrics are used in this research for quantitative
analysis. The corresponding mathematical equations can be found in Equation 4.6 and 4.7.

24
2 · precision · recall 2 · 𝑇𝑃
𝐹1 Score = = Dice Coefficient = (4.6)
precision + recall 2 · 𝑇 𝑃 + 𝐹𝑃 + 𝐹 𝑁
1.25 · precision · recall
𝐹0.5 Score = (4.7)
0.25 · precision + recall
All the described revelation metrics above will be used in the quantitative analysis of the
results.

4.2 | Character Autoencoder Module Results


4.2.1 Intensity Based Segmentation Results
General Case
A quantitative evaluation of this process is impossible because the segmented ground truth
images for Oxynrychus papyri data are unavailable. Thus, only a visual evaluation is possible.
The segmentation result for one of the general cases is represented in Figure 4.2. Here, the
ink characters and the papyri background are separated as expected. The effect of dilation is
quite evident in this result, as the horizontal inkless lines in the original image are not visible
in the segmented image. Further, small inkless noisy regions are also segmented as ink regions
in the segmented image, providing better continuity for the segmented characters. However,
this dilation process can also have a negative effect on the character details, as represented
in Figure 4.3. Even though the subsequent erosion step can trade off such effects, some fine
details will be lost, such as in the above case.

Fig. 4.2. Intensity Based Segmentation General Case Result: The ink characters and papyri
background are separated as expected in most regions.

Damaged Region Handling


As discussed in Chapter 3, handling damaged regions is an important step in the segmentation
process, as small damaged regions appear in most papyri images. The effect of these damaged
regions on the segmentation process is represented in Figure 3.4. Figure 4.4 represents more
results on damaged region handling. The detailed algorithm of this process is described in
Algorithm 1. This algorithm works perfectly across all the image samples without any failed
cases because the intensity values of the damaged regions are always high for all images.
(white in colour) This also justifies the decision to fill these regions with the mean colour
of the non-damaged regions (step 10-13 in Algorithm 1) as it allows the subsequent Otsu’s

25
Fig. 4.3. Dilation Effect: The fine character details (middle part of the highlighted character)
of the highlighted regions are lost because of the dilation effect.

thresholding algorithm [28] (Algorithm 2) to perform binary classification accurately without


any erroneous artefacts.

Fig. 4.4. Damaged Region Handling: None of the highlighted extensively damaged regions
(red boxes in left) or other smaller damaged regions appear in the segmented result. (right)

Failed Cases
The implemented methodology for intensity-based segmentation fails in certain cases. Figure
4.5 represents one of these cases. 23 samples of the 257 papyri images in the papyri image
dataset are such failed cases. Thus, the proposed segmentation approach has a failure rate
percentage of 8.94%. The main reason for this is the varying background intensity of the papyri
image. By further analyzing Figure 4.5, it can be observed that there is a colour gradient
difference diagonally. The top right part of the papyri image has a darker background colour
compared to the rest of the regions, while the bottom region has a lighter background colour.
This leads to having more than two colour-intensity regions in the image. Two of those are
the aforementioned regions, and the other is the written ink character region.
The proposed Otsu’s thresholding algorithm can only divide the image into two regions: the
foreground and the background. (see steps 6-16 in Algorithm 2) However, having more than
two colour-intensity regions, as explained above, forces the algorithm to combine two of these
regions together since it only selects a single threshold to perform binary segmentation. In the

26
Fig. 4.5. Segmentation Failed Case: The segmentation fails due to the intensity variation in
the papyri background. The top right corner has a darker background colour intensity, while
the bottom part has a lighter colour intensity.

case of Figure 4.5, it can be seen that the darker background region pixels of the top right
are combined with the ink character region pixels when performing binary segmentation, thus
leading to a failed scenario. An alternative workaround to overcome this issue is discussed in
Chapter 5.

4.2.2 Character Autoencoder Results


The training process of the character autoencoder is described in Chapter 3. The autoencoder
was trained for 30 epochs, and the BCE loss for training and validation sets are displayed in
Figure 4.6. The training was limited to 30 epochs because the network achieved a very low
average training loss, nearing 0.05. Reconstruction results of the autoencoder for the test
set are represented below. The model was trained on 34,832 image crops from segmented
Oxynrychus papyri images, and a validation set of 4,352 was used to pick the best model
parameters. The quantitative and visual results were evaluated for the remaining 4,352 image
crops of the test set. The quantitative results are represented in Table 4.1 and Figure 4.7.
Table 4.1
Autoencoder Reconstruction Results for the Test Set

Model Accuracy Precision Recall IoU Score F1 Score F0.5 Score


Autoencoder 0.9024 0.9557 0.9562 0.7387 0.9559 0.9558

Figure 4.8 represents the visual results of the character autoencoder. By analyzing the
quantitative results, it can be seen that the IoU Score is lower than other evaluation scores.
One of the possible reasons for this could be the autoencoder can generate much smoother
results than the input image. This can be seen in Figure 4.8. Due to the dimensionality
reduction within the autoencoder, it does not capture enough features to reconstruct the
finer noisy details around the ink mask (these details occur as a result of intensity-based

27
Fig. 4.6. Character Autoencoder Training: Both training and validation loss goes down with
the epoch number.

thresholding segmentation) edges in the input. This is visible by analyzing the second column
of Figure 4.8, where ink reconstruction around edges is blurry, meaning that the network is
not confident in reconstructing those finer details. Thus, a sharper image with smooth details
can be obtained after applying the thresholding of 0.5 for the original output. This is the final
reconstruction result of the character autoencoder. However, these smoother details are better
for the handwritten ink characters since that is the nature of the real ink characters.

These results prove that the autoencoder can learn the features and reconstruct the handwritten
characters. This allows transferring this knowledge in terms of learned parameters to the ink
detection network.

4.3 | Ink Detection Module Results


4.3.1 Loss Analysis
As explained in Chapter 3, three models were trained for the ink detection based on the
feature transferring of the character autoencoder. They are the base model, freezed model and
pre-trained model. Here, the visual and quantitative results for these three separate models
are analyzed. The models were trained for 150 epochs, and the BCE loss variation with epoch
number for all the models is represented in Figure 4.9. The pre-trained model achieved the
lowest validation loss, achieving a BCE loss of around 0.15, while both models were unable
to achieve a loss below 0.2. This is a major indication proving that weight initialization

28
Fig. 4.7. Character Autoencoder Quantitative Results for the Test Set: All the metrics able
to achieve a score over 0.9 apart from the IoU score.

Fig. 4.8. Character Autoencoder Visual Results for the Test Set: The first row represents the
inputs for the autoencoder. The second row represents the actual reconstructed output of the
autoencoder. In the third row, the actual output is thresholded to generate a sharper image.
This filtered output is considered as the final reconstructed output of the model and subjected
to the quantitative analysis above.

using character autoencoder (parameter transfer) helps achieve better loss. Another major

29
observation is that the loss of the base model stagnated until about the 70 epoch mark. This
indicates that the random weight initialization performs poorly than the weight transferring
from the character autoencoder since this kind of pattern does not appear in the other two
models. [32], [52], [53] Another observation is validation loss variation is high for the pre-trained
model even though it achieves the best loss. This may be due to finding the right balance
between the pre-trained decoder and the untrained encoder. However, the loss decrement rate
of the initial phases (from epoch 10 to 50) is much higher for the pre-trained model compared
to the other two models. The training and validation losses for the freezed model are much
more stable, even in the long term. However, the loss reduction rate is very low for this model,
even with the higher learning rate than the other models. This is mainly due to only the
encoder of this model being trained. Another clear observation is that the base model has an
overfitting effect after about 130 epochs. This proves that transfer learning can help achieve
better generalization, especially for smaller datasets. [52] A further analysis of this is done in
the subsequent section. Finally, the model states with the least validation loss are selected for
all three models and chosen for quantitative analysis.

Fig. 4.9. Loss for Ink Detection Models: BCE loss for the base model (left), pre-trained
model (middle) and freezed model (right)

4.3.2 Quantitative Results


Quantitative results for the ink detection networks are represented below. First, the results
are analyzed for the training set with 851 image crops of the fragment data. [3] Then, the
results are analyzed for the unseen validation set with a sample size of 45. Even though
cross-validation is a better option for such a small-scaled dataset [54], it was not feasible in
this research to train a model multiple times due to the very long training time of the networks.

Analyzing the quantitative results for the training set (see Table 4.2 and Figure 4.10) shows
that the pre-trained model has the highest scores for most of the metrics, and the base model
has the lowest scores. This is a very interesting observation since the decoder of the freeze
model is only initialized and not trained, yet it can reproduce better results for the training set.
It also has the highest precision score of all the models, meaning it rarely predicts ink-labelled
pixels outside the actual ink regions. In other words, it has a low false positive (FP) rate.
However, the pre-trained model has a significantly higher IoU score with a margin over 0.1,
meaning the result reproduction for the training set is significantly better. This lines up with
the transfer learning results of Iglovikov et al. [32] and Raghu et al. [53]

Quantitative results for the validation set are represented in Table 4.3 and Figure 4.11. A similar
results pattern to the training set can be seen here. Since this image set is not used during
the training, these scores are more important than the training set scores when evaluating the
models.

30
Table 4.2
Ink Detection Results for the Training Set

Model Accuracy Precision Recall IoU Score F1 Score F0.5 Score


Base Model 0.9153 0.6741 0.7534 0.5687 0.7059 0.6853
Freezed Model 0.9473 0.8342 0.7336 0.6537 0.7692 0.7996
Pretrained Model 0.9604 0.8168 0.897 0.7612 0.8536 0.8308

Fig. 4.10. Ink Detection Results for the Training Set: The base model has the lowest score
for all metrics while the pre-trained model has the highest scores except for Precision. The
pre-trained model also achieved scores over 0.75 for all the metrics.

Table 4.3
Ink Detection Results for the Validation Set

Model Accuracy Precision Recall IoU Score F1 Score F0.5 Score


Base Model 0.9031 0.7055 0.7736 0.5941 0.734 0.7159
Freezed Model 0.9407 0.8456 0.7805 0.6926 0.8065 0.8279
Pretrained Model 0.9531 0.8368 0.923 0.7839 0.8764 0.8519

4.3.3 Visual Results


The same output thresholding technique used in the character autoencoder (see Figure 4.8) is
also used in ink detection networks. All the output values from the final sigmoid layer greater
than 0.5 are rounded to 1, and below are rounded to 0. The effect of this thresholding is
represented in Figure 4.12. The final output of the network is this thresholded/filtered output.
The above quantitative analysis is also done for this model output.

Visual results for the validation set for the trained models are represented in Figure 4.13 and
4.14. The first row of the image represents the middle slice of the input fragment volume,

31
Fig. 4.11. Ink Detection Results for the Validation Set: This has a similar pattern to the
training set results. The base model has the lowest score for all metrics, while the pre-trained
model has the highest scores except for Precision.

Fig. 4.12. The Effect of Output Thresholding: Output thresholding produces much shaper
edges in the final output and consistent intensities in the ink regions.

while the second row represents the corresponding ground truth ink labels. The next three
rows represent results for the base model, pre-trained model and freezed model, respectively.
By analyzing the results, it can be seen that the pre-trained model (fourth row) captured
the finer details of the characters. This is mainly due to the feature transferring from the
character autoencoder and the subsequent training process of the network. [32], [53] However,
the freezed model cannot capture fine details of the characters like the pre-trained model. (see

32
4,5 columns of Figure 4.13) Another observation that can be made is that pre-trained model
predictions also have higher connectivity in the predicted ink regions. The model is able to
introduce the natural connectivity for the predicted characters even if there are disconnections
in the actual ground truth label. This nature is evident in columns 3,4,8 in Figure 4.13 and
columns 1,2,3 in Figure 4.14. This is a useful characteristic when it comes to producing
more readable results. [3] The base model can capture finer details better than the freezed
model. However, it tends to have more false positives in non-ink regions. (see column 3,7,8 in
Figure 4.13) In the last four columns of Figure 4.14, non-character-like regions are provided
as inputs for the models. Here, it can be seen that the freezed model generates very poor
results because the decoder weights are based on character-like features from the character
autoencoder. Thus, the pre-trained model found the balance between the freeze and models
through the training process of the transferred weights.

And finally, the visual results of the models for the test set are represented in Figure 4.15. For
the test set, corresponding ground truth ink labels are not available. [3] Therefore, the visual
results cannot be compared or evaluated. However, Greek letter-like shapes appearing in the
test set are described in Figure 4.15. Here also, the pre-trained model achieves the best visual
results with a high level of detail when other models fail. The freezed model fails significantly
in most cases, proving that weight initialization of the decoder without training does not
generalize for every scenario. These are only the regions where character-like ink masks are
predicted from the models. In most other cases, the ink signals do not have character-like
shapes. Thus proving the need to improve upon these results in future research.

Fig. 4.13. Visual Results of Ink Detection Networks for the Validation Set - Part 1: The ink
prediction results for each input slice (first row) are represented in each column. The ground
truth label (second row), base model (third row), pre-trained model (fourth row) and freezed
model (fifth row) are represented, respectively.

33
Fig. 4.14. Visual Results of Ink Detection Networks for the Validation Set - Part 2: The ink
prediction results for each input slice (first row) are represented in each column. The ground
truth label (second row), base model (third row), pre-trained model (fourth row) and freezed
model (fifth row) are represented, respectively.

Fig. 4.15. Visual Results of Ink Detection Networks for the Test Set: The ink prediction results
for each input slice (first row) are represented in each column. The base model (second row),
pre-trained model (third row) and freezed model (fourth row) are represented, respectively.
Following character can be seen in the columns from left to right: Kappa/𝜅 (column 1),
Lunate Sigma (column 2), Delta/Δ (column 3), Omega/𝜔 or Mu,/𝜇 (column 4), Omicron/𝑜
or Upsilon/𝜐 (column 5), Rho/𝜌 (column 6), Alpha/𝛼 (column 7), Epsilon/𝜖 (column 8),
Chi/𝜒 (column 9), Upsilon/Υ (column 10), Theta/𝜃 or Omicron/𝑜 (column 11)

4.3.4 Analyzing Generalization Effect for Ink Detection Models


The main aim of this experiment is to analyse the generalization of the ink detection models.
Transfer learning can help to increase the generalization of the model with proper parameter
transferring. [52] Since parameter transfer was done in the pre-trained model, the generalization
effect was compared between the base and pre-trained models. The previous loss analysis

34
showed (see Figure 4.9) that the base model tends to overfit with more epochs. To attest
this further, the number of training samples was reduced by 20%, and the number of training
epochs increased to 300.

The loss variation for both the models is represented in Figure 4.17. A slight upward trend in
the validation loss can be seen after about 60 epochs for the base model, signifying overfitting.
However, the pre-trained model maintains a slight downward trend throughout the 300 epochs,
indicating better generalization. [52] Finally, the model states with the least validation loss are
selected for both the models and chosen for quantitative analysis. The quantitative results
are represented in Table 4.4 and Figure 4.17. For most of the evaluation metrics, the score
achieved by the pre-trained model is significantly higher than the base model except for the
precision score. The results obtained by the pre-trained model are almost identical to the
evaluation score of the previous base model (see Figure 4.11 and Table 4.3) which utilized all
the samples in the training set. These results indicate that transfer learning can help achieve
better generalization, similar to previous studies. [52] This is a very useful development in
ink detection from ancient documents since this experiment showed that utilising other data
sources can overcome the lack of original data through transfer learning.

Fig. 4.16. Loss Variation in Generalization Analysis: BCE loss for the base model (left),
pre-trained model (right)

Table 4.4
Base Model vs. Pre-trained model Results for the Validation Set

Model Accuracy Precision Recall IoU Score F1 Score F0.5 Score


Base Model 0.8701 0.8083 0.378 0.3458 0.4877 0.6131
Pretrained Model 0.8976 0.7187 0.7066 0.5567 0.7069 0.7125

4.4 | Handwritten Character Segmentation Results


The visual results for the test set of the character segmentation network are presented below
in Figure 4.18. Quantitative analysis for the segmentation is not feasible because the ground
truth ink masks are not available in the ALPUB dataset. [6] The ink masks used as the
target for this network are inferred from the intensity-based thresholding segmentation process
described in Chapter 3. Thus, those ink masks cannot be considered the actual ground truth
due to the previously described errors and limitations of the process.

35
Fig. 4.17. Base Model vs. Pre-trained model Results for the Validation Set: All the evaluation
metric scores are significantly lower for the base model except for Precision.

By analyzing the results in Figure 4.18, it can be seen that the segmentation process works well
with most cases, and characters like ink regions were segmented in the results. However, the
network fails when the image has high-intensity noisy regions. (see samples 6,13 of the first
row and samples 3,4 of the second row and corresponding results of Figure 4.18) These failed
cases mainly occur due to the limitations of the intensity-based thresholding segmentation
method, which was described previously. However, these results show that this network can
segment handwritten characters from ancient documents.

Fig. 4.18. Character Segmentation Network Results for the Test Set: All the evaluation
metric scores are significantly lower for the base model except for Precision.

36
CHAPTER 5
Discussion and Future Works
This section analyses further discussion of the achieved results and the modifications required to
overcome failed cases. Further suggestions for future research directions will also be discussed
in this section.

5.1 | Character Autoencoder Module


5.1.1 Alternative Workaround for Segmentation Failed Cases
The intensity-based segmentation algorithm implemented in this research is described in detail
in Chapter 3.1. (see Algorithm 1 and 2) However, one of the failed scenarios for this algorithm
was described in Chapter 4. (see Figure 4.5). However, a workaround exists to manage these
failed case scenarios without modifying the existing algorithm. It is the filtering image crops
from the failed regions. Several image crops from the above failed case in Figure 4.5 are
represented in Figure 5.1. As explained in Figure 5.1, a simple filtering process can separate
failed regions from correctly segmented regions by analyzing foreground and background
pixel percentages in the image crops. Then, only the crops with the majority of background
pixels will be fed to the network for training. Thus allowing the opportunity to utilize the
correctly segmented regions from failed cases. The required modification for the algorithm to
overcome this scenario is discussed in the next section. However, in this research, the described
workaround is chosen because of the availability of sufficient samples to train the autoencoder
even after removing the failed sample crops.

5.1.2 Modification for the Segmentation Algorithm


Even though the alternative workaround for these failed cases exists, the potential modification
for the algorithm should also be discussed. Here, those modifications are described in detail.
However, these changes were not implemented at this stage of the research since the failure
rates were low and the workaround approach was sufficient to generate adequate samples of
data. However, this might not be the case for a different image dataset.

As described in Chapter 4, the fail case occurs when there are more than two colour intensity
regions in the input papyri image. This case mainly occurs due to the colour differences
in the papyri background within the same image. (see Figure 4.5) Then, the calculated
threshold for binary segmentation is not sufficient to separate darker papyri regions and ink
character regions since both are classified as foreground. This can be overcome by introducing
another segmenting level for the algorithm since darker papyri and ink regions are classified
as foreground. This modified algorithm is represented in Algorithm 3. The basic procedure
of the algorithm is similar to the previous algorithm (Algorithm 2) initially. But here, the
initially segmented foreground is subjected to another series of a user-defined number (𝑙)
of segmentation operations until the ink and dark papyri regions are completely separated.

37
Fig. 5.1. Failed Case Handling: Image crops from failed regions are represented in the top
row, while correctly segmented regions are in the bottom row. The clear contrast between
those two cases is that failed cases have a majority of the foreground pixels in the crops,
whereas correctly segmented region has a majority of background pixels.

Here, the threshold is also updated correspondingly, and finally, this updated threshold will be
utilized in the input image to perform the binary segmentation. This algorithm will be useful
to implement in a future iteration of this research.

Algorithm 3: Modified Segmentation Algorithm


Data: Input Image, Number of Levels
Result: Binary Segmentation Image
1 image ← convertRGBtoGray(inputImage);
2 threshold, foreground ← performOtsuThresholding(image);
3 for 𝑙 in numberOfLevels do
4 threshold, foreground ← performOtsuThresholding(foreground);
5 for 𝑥, 𝑦 in image do
6 if image[𝑥] [𝑦] ≥ threshold then
7 image[𝑥] [𝑦] ← 1;
8 else
9 image[𝑥] [𝑦] ← 0;
10 return image;

5.1.3 Limitations
One of the major limitations of the autoencoder module was the collection and processing
of papyri image data. As mentioned in Chapter 3, undamaged regions should be manually
cropped from the available data sources.(see Figure 3.3) This process was time-consuming,
and only 257 papyri image crops were collected in this research. With more time, more data
could have been collected, thus leading to better training results in the autoencoder and ink
detection.

In this research, the crop size from the papyri images was selected only to capture a single
character within a crop. (see second row of Figure 5.1) However, training the network with

38
different crop views can improve the performance of the model. This will allow a single crop
to capture more than a single character or a part of the character, allowing the model to learn
features from different scale levels instead of one. [55] This is an aspect of the character
autoencoder training process which can be improved upon in future iterations of the research.

5.2 | Ink Detection Module


5.2.1 Network Architecture and Receptive Field
Network Architecture Design Decisions
Figure 3.6 represents the network architecture used in the ink detection network. As described
in Chapter 3, even though the architecture was inspired by U-Net [31], some design features
of the U-Net were omitted. One of the leading such a feature was skip connections. The main
reason for avoiding skip connections in the current architecture was maintaining the decoder
features independent of the encoder. This was necessary for the transfer features from the
character autoencoder module because the encoders of the two models are taking different
data forms as input. Thus, having skip connections makes the decoders conditional on the
input data. However, removing skip connections from the architecture has disadvantages.
According to the studies about U-Net by Ronneberger et al. [31] and Zhou et al. [56], skip
connections allow the learning of multi-scale features for image segmentation. So, these kinds
of advantages will be missed in the current architecture. Thus, U-Net architecture with skip
connections should also be experimented with in future research.

The convolution and pooling layers of the current architecture are designed exactly according
to the original U-Net architecture. [31] At each layer in the encoder, two 3x3 convolution
operations were followed by a max-pool operation. [57] In the decoder, an up-convolution
operation [58] replaces max-pooling. In the original U-Net architecture, a 572x572 image
is taken as the input, producing a segmentation map size of 388x388. But in this problem,
the input and output image sizes should be the same. Because of this, a crop size of 256 is
selected, which is a power of 2. By selecting a crop size of power of 2, the same size output
can be generated after symmetric max-pool and up-convolution operations by maintaining the
same spatial dimensions after 3x3 convolutions by adding padding of size 1. (see Figure 3.6)
Further, in the original U-Net architecture, even if the spatial dimensions of the input 572x572
(327,184) image was reduced to 30x30 in the bottleneck layer, after considering the depth
dimension of 1024, overall dimensionality (921,600) of the input image was not reduced. In
the architecture of this research, the dimensionality of 256x256 (65,536) of the input image is
reduced to 4x4x512 (4096) in the latent space, which is significantly lower than the original
input. This dimensionality reduction allows the U-Net to perform as an autoencoder, learning
the features required to reconstruct character shapes from this latent space.

Another modification from the original U-Net architecture [31] was the usage of batch
normalization between layers. [41] Batch normalization can provide stable training and faster
convergence for deeper neural networks. Further, it can help reduce the sensitivity to the
parameter initialization, a key component of this research due to the parameter transferring
from the autoencoder. And it can help with providing more generalization and regularization.
[41], [59] Thus, these modifications for the original U-Net architecture were required in this
research.

39
Receptive Field
In convolutional neural networks, the receptive field refers to the region of the input data
that the feature map of a particular convolutional layer is responding to.[60] This can be
calculate theoretically using equations provided by Araujo et al.. [61] It is evident that deeper
architectures tend to have a very large receptive field, according to the calculations in [61].
Thus, the network architecture in this research also has a very high receptive field with many
convolution layers in the encoder and decoder with max-pool and up-convolution operations.
However, due to hardware limitations, the full capability of this high receptive field cannot be
utilized because the input image size is limited to 256x256. If larger image crops capturing
more area of the fragments can be utilized, the advantage of this high receptive field can be
exploited. Thus, this is an important aspect of the research that can be addressed in future
works.
Model Ensembling
Since different models for ink detection were implemented, it might be interesting to see
how a combination of these models performs. As described in Chapter 4, each model has
strengths and weaknesses when comparing the above evaluation metrics. For instance, the
freezed model could not capture the fine-level details of the characters but was excellent at
capturing high-level details such as the general shape. Hence, there is a possibility of balancing
out these trade-offs when the models are combined. Various model ensembling techniques
like bagging, boosting, stacking, voting and bootstrapping can be tested on these models to
determine the most optimal model. [62] This is an interesting direction for this research to
explore in the future.

5.2.2 Usage of 3D Convolutions


As described in Chapter 3, the input for the ink detection network is a 3D volumetric voxel
containing scanned surfaces of papyri layers. In the input layer of the proposed network
architecture, these layers are considered different input channels, and convolutions are applied.
Even though this approach was able to generate good results as described in Chapter 4,
implementing 3D convolutions can allow the network to learn from these surface-level layers
as well. [17] This is the approach of the direct previous work related to ink detection in
Herculaneum papyri fragments done by Parker et al. [1] which utilized the usage of 3D
convolutional neural networks. This is also widely used in medical image segmentation tasks
where segmentation should be performed on 3D volumetric data like MRI and CT scans. [63]
Hence, incorporating 3D convolutions can certainly improve the performance of the network
due to the nature of the input data. Thus, this should be further tested in future research.

5.2.3 Limitations
Apart from the above suggestions that should be incorporated into future research, the
implemented ink detection module has some challenges and limitations. The major challenge
was to test out other segmentation models to compare the performances of those models
with the current implementation. Since the model training times were high due to hardware
limitations, the other state-of-the-art art models cannot be trained within the limited time.
Since this is a relatively novel problem, previously implemented solutions are limited. Thus, the
availability of the pre-trained models is also limited, which can be used directly. Even though
there are existing solutions in the related Kaggle competition [19], the provided solutions must
be trained from scratch, and the hardware requirements for the training process are quite
intensive and time-consuming. However, a proper ablation study and an evaluation should

40
be conducted by comparing the results with these related works. Another reason this type of
comparison cannot be conducted because the original dataset was downsized in this research
due to hardware limitations. Thus, the results are not comparable with those provided in the
leaderboard of the Kaggle competition. [19]

There are avenues in this research to explore and experiment more on the latest state-of-the-art
segmentation networks since this research was limited mainly to U-Net. [31] These networks
include SegFormer which based on attention mechanism [33], DeepLabv3+ which is based
on atrous separable convolutions [34] and UNet++ which is a direct improvement on U-Net
architecture. [31], [35] Hence, exploring the effect of transferring learned parameters from
another dataset using these networks is interesting because these networks also have the
underlying encoder-decoder architectures of a U-Net.

5.3 | Handwritten Character Segmentation Network


Currently, this network is not incorporated into the main pipeline of this research, which is
the ink detection from papyri fragments. However, this can be incorporated in the future,
especially to replace the current intensity-based thresholding segmentation method used on
papyri image data described in Chapter 3. This is also an area where many previous works
have not been done, so there are many avenues to explore regarding this problem. Since
labelled segmented data for this task is limited, it would be useful to explore semi-supervised
or unsupervised learning base mechanisms for this segmentation problem. [64]–[67]

41
CHAPTER 6
Conclusion
In conclusion, this research experimentally proved that incorporating historical handwritten
data from a different source and applying transfer learning principles can significantly enhance
ink detection from the fragments of the Herculaneum papyri, which is the primary objective
of this research. This approach is particularly valuable when labelled training data is scarce,
mirroring the challenges posed by this unique problem. The fusion of deep-learning network
models with the concept of transfer learning has opened up new avenues for non-invasive text
recovery from fragile artefacts.

The proposed and implemented models in this research proved that they can produce readable
results for the test data. Even though the results are far from providing full readability, the ink
signals predicted by the networks would be vital for the papyrologists and language experts
to decipher the written texts. These results, combined with the proposed future directions,
can significantly contribute to uncovering the hidden written text of the Herculaneum papyri
scrolls.

42
Bibliography
[1] C. S. Parker, S. Parsons, J. Bandy, C. Chapman, F. Coppens & W. B. Seales, “From
invisibility to readability: Recovering the ink of herculaneum,” PloS one, vol. 14, no. 5,
e0215775, 2019.
[2] W. B. Seales, C. S. Parker, M. Segal, E. Tov, P. Shor & Y. Porath, “From damage to
discovery via virtual unwrapping: Reading the scroll from en-gedi,” Science advances,
vol. 2, no. 9, e1601247, 2016.
[3] S. Parsons, C. S. Parker, C. Chapman, M. Hayashida & W. B. Seales, “Educelab-
scrolls: Verifiable recovery of text from herculaneum papyri using x-ray ct,” arXiv e-prints,
arXiv–2304, 2023.
[4] W. B. Seales, J. Griffioen, R. Baumann & M. Field, “Analysis of herculaneum
papyri with x-ray computed tomography,” in International Conference on nondestructive
investigations and microanalysis for the diagnostics and conservation of cultural and
environmental heritage, 2011.
[5] A. S. Hunt, “The oxyrhynchus papyri,” The Classical Review, vol. 12, no. 1, pp. 34–35,
1898.
[6] M. I. Swindall, G. Croisdale, C. C. Hunter, B. Keener, A. C. Williams, J. H. Brusuelas,
N. Krevans, M. Sellew, L. Fortson & J. F. Wallin, “Exploring learning approaches
for ancient greek character recognition with citizen science data,” in 2021 IEEE 17th
International Conference on eScience (eScience), IEEE, 2021, pp. 128–137.
[7] S. J. Pan & Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge
and data engineering, vol. 22, no. 10, pp. 1345–1359, 2009.
[8] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang & C. Liu, “A survey on deep transfer
learning,” in Artificial Neural Networks and Machine Learning–ICANN 2018: 27th
International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7,
2018, Proceedings, Part III 27, Springer, 2018, pp. 270–279.
[9] H. Rathmann, B. Kyle, E. Nikita, K. Harvati & G. Saltini Semerari, “Population history
of southern italy during greek colonization inferred from dental remains,” American
Journal of Physical Anthropology, vol. 170, no. 4, pp. 519–534, 2019.
[10] W. A. Johnson & H. N. Parker, Ancient literacies: the culture of reading in Greece
and Rome. Oxford University Press, 2009.
[11] V. Mocella, E. Brun, C. Ferrero & D. Delattre, “Revealing letters in rolled herculaneum
papyri by x-ray phase-contrast imaging,” Nature communications, vol. 6, no. 1, p. 5895,
2015.
[12] G. Hoffmann Barfod, J. M. Larsen, A. Lichtenberger & R. Raja, “Revealing text in
a complexly rolled silver scroll from jerash with computed tomography and advanced
imaging software,” Scientific reports, vol. 5, no. 1, pp. 1–10, 2015.
[13] S. Stabile, F. Palermo, I. Bukreeva, D. Mele, V. Formoso, R. Bartolino & A. Cedola,
“A computational platform for the virtual unfolding of herculaneum papyri,” Scientific
Reports, vol. 11, no. 1, p. 1695, 2021.

43
[14] D. Baum, N. Lindow, H.-C. Hege, V. Lepper, T. Siopi, F. Kutz, K. Mahlow & H.-E.
Mahnke, “Revealing hidden text in rolled and folded papyri,” Applied Physics A, vol. 123,
pp. 1–7, 2017.
[15] O. Samko, Y.-K. Lai, D. Marshall & P. L. Rosin, “Virtual unrolling and information
recovery from scanned scrolled historical documents,” Pattern Recognition, vol. 47,
no. 1, pp. 248–259, 2014.
[16] D. Allegra, E. Ciliberto, P. Ciliberto, G. Petrillo, F. Stanco & C. Trombatore, “X-
ray computed tomography for virtually unrolling damaged papyri,” Applied Physics A,
vol. 122, pp. 1–7, 2016.
[17] D. Tran, L. Bourdev, R. Fergus, L. Torresani & M. Paluri, “Learning spatiotemporal
features with 3d convolutional networks,” in Proceedings of the IEEE international
conference on computer vision, 2015, pp. 4489–4497.
[18] G. E. Hinton, S. Osindero & Y.-W. Teh, “A fast learning algorithm for deep belief
nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
[19] AlexLourenco, B. Seales, C. Chapman, D. Havir, I. Janicki, J. Posma, N. Friedman,
R. Holbrook, S. P., S. Parsons & W. Cukierski, Vesuvius challenge - ink detection. 2023.
[Online]. Available: https://kaggle.com/competitions/vesuvius- challenge-
ink-detection.
[20] H.-D. Cheng, X. H. Jiang, Y. Sun & J. Wang, “Color image segmentation: Advances
and prospects,” Pattern recognition, vol. 34, no. 12, pp. 2259–2281, 2001.
[21] Y. Guo, Y. Liu, T. Georgiou & M. S. Lew, “A review of semantic segmentation using
deep neural networks,” International journal of multimedia information retrieval, vol. 7,
pp. 87–93, 2018.
[22] N. M. Zaitoun & M. J. Aqel, “Survey on image segmentation techniques,” Procedia
Computer Science, vol. 65, pp. 797–806, 2015.
[23] S. S. Al-Amri, N. V. Kalyankar, et al., “Image segmentation by using threshold tech-
niques,” arXiv preprint arXiv:1005.4020, 2010.
[24] J. Tang, “A color image segmentation algorithm based on region growing,” in 2010 2nd
international conference on computer engineering and technology, IEEE, vol. 6, 2010,
pp. V6–634.
[25] M.-N. Wu, C.-C. Lin & C.-C. Chang, “Brain tumor detection using color-based k-means
clustering segmentation,” in Third international conference on intelligent information
hiding and multimedia signal processing (IIH-MSP 2007), IEEE, vol. 2, 2007, pp. 245–
250.
[26] D. Ziou, S. Tabbone, et al., “Edge detection techniques-an overview,” Pattern Recogni-
tion and Image Analysis C/C of Raspoznavaniye Obrazov I Analiz Izobrazhenii, vol. 8,
pp. 537–559, 1998.
[27] M. Kass, A. Witkin & D. Terzopoulos, “Snakes: Active contour models,” International
journal of computer vision, vol. 1, no. 4, pp. 321–331, 1988.
[28] N. Otsu, “A threshold selection method from gray-level histograms,” IEEE transactions
on systems, man, and cybernetics, vol. 9, no. 1, pp. 62–66, 1979.
[29] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz & D. Terzopoulos, “Image
segmentation using deep learning: A survey,” IEEE transactions on pattern analysis and
machine intelligence, vol. 44, no. 7, pp. 3523–3542, 2021.

44
[30] J. Long, E. Shelhamer & T. Darrell, “Fully convolutional networks for semantic
segmentation,” in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 3431–3440.
[31] O. Ronneberger, P. Fischer & T. Brox, “U-net: Convolutional networks for biomedical
image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–
MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015,
Proceedings, Part III 18, Springer, 2015, pp. 234–241.
[32] V. Iglovikov & A. Shvets, “Ternausnet: U-net with vgg11 encoder pre-trained on
imagenet for image segmentation,” arXiv preprint arXiv:1801.05746, 2018.
[33] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez & P. Luo, “Segformer: Simple
and efficient design for semantic segmentation with transformers,” Advances in Neural
Information Processing Systems, vol. 34, pp. 12 077–12 090, 2021.
[34] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff & H. Adam, “Encoder-decoder with
atrous separable convolution for semantic image segmentation,” in Proceedings of the
European conference on computer vision (ECCV), 2018, pp. 801–818.
[35] Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh & J. Liang, “Unet++: A nested u-net
architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis
and Multimodal Learning for Clinical Decision Support: 4th International Workshop,
DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with
MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, Springer, 2018,
pp. 3–11.
[36] J. Masci, U. Meier, D. Cireşan & J. Schmidhuber, “Stacked convolutional auto-
encoders for hierarchical feature extraction,” in Artificial Neural Networks and Machine
Learning–ICANN 2011: 21st International Conference on Artificial Neural Networks,
Espoo, Finland, June 14-17, 2011, Proceedings, Part I 21, Springer, 2011, pp. 52–59.
[37] K. Simonyan & A. Zisserman, “Very deep convolutional networks for large-scale image
recognition,” arXiv preprint arXiv:1409.1556, 2014.
[38] Russakovsky, O., J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg & L. Fei-Fei. “ImageNet Large Scale Visual
Recognition Challenge. ”International Journal of Computer Vision (IJCV). Vol. 115.
No. 3, 211–252. 2015. doi: 10.1007/s11263-015-0816-y.
[39] B. Neyshabur, S. Bhojanapalli, D. McAllester & N. Srebro, “Exploring generalization
in deep learning,” Advances in neural information processing systems, vol. 30, 2017.
[40] I. Goodfellow, Y. Bengio & A. Courville, “Regularization for deep learning,” Deep
learning, pp. 216–261, 2016.
[41] S. Ioffe & C. Szegedy, “Batch normalization: Accelerating deep network training by
reducing internal covariate shift,” in International conference on machine learning, pmlr,
2015, pp. 448–456.
[42] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever & R. Salakhutdinov, “Dropout:
A simple way to prevent neural networks from overfitting,” The journal of machine
learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
[43] D. P. Kingma & J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint
arXiv:1412.6980, 2014.

45
[44] Chesler, R., T. Kyi, A. Loftus & A. Tersol, Solution for kaggle vesuvius ink detection
challenge. Version 0. 2023. [Online]. Available: https://github.com/ainatersol/
Vesuvius-InkDetection.
[45] A. Hatamizadeh, Y. Tang, V. Nath, D. Yang, A. Myronenko, B. Landman, H. R. Roth
& D. Xu, “Unetr: Transformers for 3d medical image segmentation,” in Proceedings of
the IEEE/CVF winter conference on applications of computer vision, 2022, pp. 574–584.
[46] F. Quattrini, V. Pippi, S. Cascianelli & R. Cucchiara, “Volumetric fast fourier
convolution for detecting ink on the carbonized herculaneum papyri,” arXiv preprint
arXiv:2308.05070, 2023.
[47] D. Karimi, S. K. Warfield & A. Gholipour, “Transfer learning in medical image
segmentation: New insights from analysis of the dynamics of model parameters and
learned representations,” Artificial intelligence in medicine, vol. 116, p. 102 078, 2021.
[48] G. E. Hinton & R. R. Salakhutdinov, “Reducing the dimensionality of data with neural
networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
[49] V. Nair & G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,”
in Proceedings of the 27th international conference on machine learning (ICML-10),
2010, pp. 807–814.
[50] M. Thoma, “A survey of semantic segmentation,” arXiv preprint arXiv:1602.06541,
2016.
[51] T. Sorensen, “A method of establishing groups of equal amplitude in plant sociology
based on similarity of species content and its application to analyses of the vegetation
on danish commons,” Biologiske skrifter, vol. 5, pp. 1–34, 1948.
[52] B. Neyshabur, H. Sedghi & C. Zhang, “What is being transferred in transfer learning?”
Advances in neural information processing systems, vol. 33, pp. 512–523, 2020.
[53] M. Raghu, C. Zhang, J. Kleinberg & S. Bengio, “Transfusion: Understanding transfer
learning for medical imaging,” Advances in neural information processing systems, vol. 32,
2019.
[54] A. Lopez-del Rio, A. Nonell-Canals, D. Vidal & A. Perera-Lluna, “Evaluation of
cross-validation strategies in sequence-based binding prediction using deep learning,”
Journal of chemical information and modeling, vol. 59, no. 4, pp. 1645–1657, 2019.
[55] A. Mikolajczyk & M. Grochowski, “Data augmentation for improving deep learning
in image classification problem,” in 2018 international interdisciplinary PhD workshop
(IIPhDW), IEEE, 2018, pp. 117–122.
[56] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh & J. Liang, “Unet++: Redesigning skip
connections to exploit multiscale features in image segmentation,” IEEE transactions on
medical imaging, vol. 39, no. 6, pp. 1856–1867, 2019.
[57] Y. LeCun, L. Bottou, Y. Bengio & P. Haffner, “Gradient-based learning applied to
document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[58] H. Noh, S. Hong & B. Han, “Learning deconvolution network for semantic segmenta-
tion,” in Proceedings of the IEEE international conference on computer vision, 2015,
pp. 1520–1528.
[59] N. Bjorck, C. P. Gomes, B. Selman & K. Q. Weinberger, “Understanding batch
normalization,” Advances in neural information processing systems, vol. 31, 2018.

46
[60] W. Luo, Y. Li, R. Urtasun & R. Zemel, “Understanding the effective receptive field in
deep convolutional neural networks,” Advances in neural information processing systems,
vol. 29, 2016.
[61] Araujo, A., W. Norris & J. Sim. “Computing receptive fields of convolutional neural
networks. ”Distill. 2019. https://distill.pub/2019/computing-receptive-fields. doi: 10.
23915/distill.00021.
[62] X. Dong, Z. Yu, W. Cao, Y. Shi & Q. Ma, “A survey on ensemble learning,” Frontiers
of Computer Science, vol. 14, pp. 241–258, 2020.
[63] S. Niyas, S. Pawan, M. A. Kumar & J. Rajan, “Medical image segmentation with
3d convolutional neural networks: A survey,” Neurocomputing, vol. 493, pp. 397–413,
2022.
[64] X. Luo, J. Chen, T. Song & G. Wang, “Semi-supervised medical image segmentation
through dual-task consistency,” in Proceedings of the AAAI conference on artificial
intelligence, vol. 35, 2021, pp. 8801–8809.
[65] J. Peng, G. Estrada, M. Pedersoli & C. Desrosiers, “Deep co-training for semi-supervised
image segmentation,” Pattern Recognition, vol. 107, p. 107 269, 2020.
[66] X. Xia & B. Kulis, “W-net: A deep model for fully unsupervised image segmentation,”
arXiv preprint arXiv:1711.08506, 2017.
[67] A. Kanezaki, “Unsupervised image segmentation by backpropagation,” in 2018 IEEE
international conference on acoustics, speech and signal processing (ICASSP), IEEE,
2018, pp. 1543–1547.

47
APPENDIX A
GitLab Repository
All the code and implementation for the project can be found on the GitLab server of the
University of Birmingham at the following URL:
https://git.cs.bham.ac.uk/projects-2022-23/wxm286

There are two folders containing the implementations of the two main modules of this research.

1. alpub autoencoder: referring to the implementation of the Character Autoencoder


module.

2. in detection: referring to the implementation of the ink detection module.

Instructions for setting up the environment, setting up the datasets and running the code will
be explained in the readme.md file.

The live GitHub repository with commit history can be accessed from the following URL:
https://github.com/Wishmitha/Vesuvius

48

You might also like