Sensor Data Understanding 1St Edition Marcin Grzegorzek Online Ebook Texxtbook Full Chapter PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 69

Sensor Data Understanding 1st Edition

Marcin Grzegorzek
Visit to download the full and correct content document:
https://ebookmeta.com/product/sensor-data-understanding-1st-edition-marcin-grzegor
zek/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Selected Sensor Circuits: From Data Sheet to Simulation


Peter Baumann

https://ebookmeta.com/product/selected-sensor-circuits-from-data-
sheet-to-simulation-peter-baumann/

Arduino in Science: Collecting, Displaying, and


Manipulating Sensor Data 1st Edition Richard J. Smythe

https://ebookmeta.com/product/arduino-in-science-collecting-
displaying-and-manipulating-sensor-data-1st-edition-richard-j-
smythe/

Distributed Sensor Networks Second Edition Sensor


Networking and Applications S. Sitharama Iyengar

https://ebookmeta.com/product/distributed-sensor-networks-second-
edition-sensor-networking-and-applications-s-sitharama-iyengar/

C Data Structures and Algorithms Explore the


Possibilities of C for Developing a Variety of
Efficient Applications 1st Edition Marcin Jamro

https://ebookmeta.com/product/c-data-structures-and-algorithms-
explore-the-possibilities-of-c-for-developing-a-variety-of-
efficient-applications-1st-edition-marcin-jamro/
Historical Atlas of Hasidism Marcin Wodzi■ski

https://ebookmeta.com/product/historical-atlas-of-hasidism-
marcin-wodzinski/

The House on Woody Creek Lane 1st Edition Claudine


Marcin

https://ebookmeta.com/product/the-house-on-woody-creek-lane-1st-
edition-claudine-marcin/

Continuous Integration in NET 1st Edition Marcin


Kawalerowicz Craig Berntson

https://ebookmeta.com/product/continuous-integration-in-net-1st-
edition-marcin-kawalerowicz-craig-berntson/

Build a Weather Station with Elixir and Nerves


Visualize Your Sensor Data with Phoenix and Grafana
Alexander Koutmosr

https://ebookmeta.com/product/build-a-weather-station-with-
elixir-and-nerves-visualize-your-sensor-data-with-phoenix-and-
grafana-alexander-koutmosr/

Build a Weather Station with Elixir and Nerves:


Visualize Your Sensor Data with Phoenix and Grafana
Alexander Koutmos

https://ebookmeta.com/product/build-a-weather-station-with-
elixir-and-nerves-visualize-your-sensor-data-with-phoenix-and-
grafana-alexander-koutmos/
Sensor Data Understanding

Marcin Grzegorzek

λογος
Sensor Data Understanding

Marcin Grzegorzek
Bibliographic information published by the Deutsche Nationalbibliothek

The Deutsche Nationalbibliothek lists this publication in the Deutsche


Nationalbibliografie; detailed bibliographic data are available
on the Internet at http://dnb.d-nb.de .

c Copyright Logos Verlag Berlin GmbH 2017


All rights reserved.

ISBN 978-3-8325-4633-5

Logos Verlag Berlin GmbH


Comeniushof, Gubener Str. 47,
10243 Berlin
Tel.: +49 (0)30 42 85 10 90
Fax: +49 (0)30 42 85 10 92
INTERNET: http://www.logos-verlag.de
Contents

Preface V

I Introduction 1
1 Fundamental Concept 3
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Active and Assisted Living . . . . . . . . . . . . . . . . . 6
1.3 Digital Medicine . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Outline and Contribution . . . . . . . . . . . . . . . . . 11
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

II Visual Scene Analysis 17


2 Large-Scale Multimedia Retrieval 19
2.1 Hierarchical Organisation of Semantic
Meanings . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Concept Detection . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Global versus Local Features . . . . . . . . . . . . 24
2.2.2 Feature Learning . . . . . . . . . . . . . . . . . . 28
2.3 Event Retrieval . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.1 Event Retrieval within Images/Shots . . . . . . . 32
2.3.2 Event Retrieval over Shot Sequences . . . . . . . 33
2.4 Conclusion and Future Trends . . . . . . . . . . . . . . . 34
2.4.1 Reasoning . . . . . . . . . . . . . . . . . . . . . 35
2.4.2 Uncertainties in Concept Detection . . . . . . . . 35
2.4.3 Adaptive Learning . . . . . . . . . . . . . . . . . 36
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

I
3 Shape-Based Object Recognition 53
3.1 Problem Statement and Motivation . . . . . . . . . . . . 53
3.2 Shape Representation . . . . . . . . . . . . . . . . . . . 54
3.2.1 Survey of Related Methods . . . . . . . . . . . . 54
3.2.2 Coarse-grained Shape Representation . . . . . . . 57
3.2.3 Fine-grained Shape Representation . . . . . . . . 58
3.3 Shape Matching . . . . . . . . . . . . . . . . . . . . . . 61
3.3.1 Survey of Related Methods . . . . . . . . . . . . 61
3.3.2 Shape Matching using Coarse-grained Features . . 63
3.3.3 Shape Matching using Fine-grained Features . . . 64
3.4 Experiments and Results . . . . . . . . . . . . . . . . . . 65
3.4.1 Shape Retrieval using Coarse-grained Features . . 65
3.4.2 Shape Retrieval using Fine-grained Features . . . 67
3.5 Conclusion and Future Trends . . . . . . . . . . . . . . . 68
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4 Moving Object Analysis for Video Interpretation 81


4.1 Object Tracking in 2D Video . . . . . . . . . . . . . . . 81
4.1.1 Survey of Related Approaches . . . . . . . . . . . 82
4.1.2 Tracking-Learning-Detection . . . . . . . . . . . . 89
4.1.3 Tracking in Omnidirectional Video . . . . . . . . . 90
4.1.4 Experiments and Results . . . . . . . . . . . . . . 92
4.2 3D Trajectory Extraction from 2D Video . . . . . . . . . 93
4.2.1 RJ-MCMC Particle Filtering . . . . . . . . . . . . 94
4.2.2 Convoy Detection in Crowded Video . . . . . . . 99
4.2.3 Experiments and Results . . . . . . . . . . . . . . 102
4.3 Conclusion and Future Trends . . . . . . . . . . . . . . . 104
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

III Human Data Interpretation 111


5 Physical Activity Recognition 113
5.1 Atomic Activity Recognition . . . . . . . . . . . . . . . . 113
5.1.1 Survey of Related Approaches . . . . . . . . . . . 115
5.1.2 Codebook Approach for Classification . . . . . . . 118
5.1.3 Experiments and Results . . . . . . . . . . . . . . 124
5.2 Gait Recognition . . . . . . . . . . . . . . . . . . . . . . 129
5.2.1 Survey or Related Approaches . . . . . . . . . . . 130

II
5.2.2 Spatiotemporal Representation of Gait . . . . . . 132
5.2.3 Experiments and Results . . . . . . . . . . . . . . 135
5.3 Conclusion and Future Trends . . . . . . . . . . . . . . . 141
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6 Cognitive Activity Recognition 157


6.1 Definition, Taxonomy, Impact on Health . . . . . . . . . 157
6.2 Sensing the Brain Activity . . . . . . . . . . . . . . . . . 158
6.2.1 Electroencephalography . . . . . . . . . . . . . . 158
6.2.2 Electrooculography . . . . . . . . . . . . . . . . . 159
6.2.3 Functional Magnetic Resonance Imaging . . . . . 159
6.2.4 Functional Near-InfraRed Spectroscopy . . . . . . 159
6.3 Survey of Related Methods . . . . . . . . . . . . . . . . 159
6.4 Electrooculography-Based Approach . . . . . . . . . . . . 161
6.4.1 Cognitive Activity Recognition Method . . . . . . 161
6.4.2 Investigating Codewords . . . . . . . . . . . . . . 162
6.5 Application and Validation . . . . . . . . . . . . . . . . . 163
6.5.1 Collecting a Dataset . . . . . . . . . . . . . . . . 163
6.5.2 Implementation Details . . . . . . . . . . . . . . 165
6.5.3 Results for Cognitive Activity Recognition . . . . . 165
6.5.4 Results for Codewords Investigation . . . . . . . . 166
6.6 Conclusion and Future Trends . . . . . . . . . . . . . . . 167
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

7 Emotion Recognition 173


7.1 Automatic Recognition of Emotions . . . . . . . . . . . . 173
7.1.1 Definition and Taxonomy of Emotions . . . . . . 174
7.1.2 Existing Techniques for Emotion Recognition . . . 180
7.1.3 Emotion Recognition Challenges . . . . . . . . . . 182
7.2 Multimodal Emotion Recognition . . . . . . . . . . . . . 186
7.2.1 Arousal/Valence Estimation . . . . . . . . . . . . 186
7.2.2 Basic Emotion Recognition . . . . . . . . . . . . 189
7.3 Approaches Based on Physiological Data . . . . . . . . . 192
7.3.1 Stress Detection Using Hand-crafted Features . . 194
7.3.2 Codebook Approach for Feature Generation . . . . 196
7.3.3 Deep Neural Networks for Feature Generation . . 199
7.4 Conclusion and Future Trends . . . . . . . . . . . . . . . 203
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

III
IV Conclusion 211
8 Summary and Future Vision 213
8.1 Visual Scene Analysis . . . . . . . . . . . . . . . . . . . 213
8.2 Human Data Interpretation . . . . . . . . . . . . . . . . 215
8.3 Data-Driven Society . . . . . . . . . . . . . . . . . . . . 217
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

List of Figures 223

List of Tables 227

IV
Preface

The rapid development in the area of sensor technology has been respon-
sible for a number of societal phenomena. For instance, the increased
availability of imaging sensors integrated into digital video cameras has
significantly stimulated the UGC (User Generated Content) movement be-
ginning from 2005. Another example is the groundbreaking innovation in
wearable technology leading to a societal phenomenon called Quantified
Self (QS), a community of people who use the capabilities of technical
devices to gain a profound understanding of collected self-related data.
Machine learning algorithms benefit a lot from the availability of such
huge volumes of digital data. For example, new technical solutions for
challenges caused by the demographic change (ageing society) can be
proposed in this way, especially in the context of healthcare systems in
industrialised countries. The decision making process is often supported
or even fully taken over by machine learning algorithms. We live in a data-
driven society and significantly contribute to this concept by voluntarily
generating terabytes of data everyday. This societal transformation cannot
be stopped anymore. Our objective should be to gain as much benefit from
this movement as possible by limiting possible risks connected to it.
The goal of this book is to present selected algorithms for Visual Scene
Analysis (VSA, processing UGC) as well as for Human Data Interpretation
(HDI, using data produced within the QS movement) and to expose a
joint methodological basis between these two scientific directions. While
VSA approaches have reached impressive robustness towards human-like
interpretation of visual sensor data, HDI methods are still of limited se-
mantic abstraction power. Using selected state-of-the-art examples, this
book shows the maturity of approaches towards closing the semantic gap
in both areas, VSA and HDI.
Another objective of this book is to sketch a scientific vision of a generic
platform for holistic human condition monitoring. Based on the data de-
livered by sensors integrated in wearables (time series) and, if available,

V
also images, the algorithms will continuously analyse humans’ physical,
cognitive, emotional and social states/activities. Integrated into a single
module for holistic human health monitoring, the software platform will
perform a long-term analysis of human data on a very large scale. In-
telligent algorithms will automatically detect “interesting events” in these
data. Both real-time data analysis and as cumulative assessments will
be possible with the platform. The conceptualisation and development of
these machine learning algorithms for the recognition of patterns in hu-
mans’ physiological and behavioural data will happen on different levels of
abstraction between the methodology and application.
This book is designated for an interdisciplinary audience who would
like to use machine learning techniques to solve problems from the areas
of visual scene analysis as well as human data interpretation. Ideally, the
book will provide helpful background and guidance to researchers, under-
graduate or graduate students, or practitioners who want to incorporate
the ideas into their own work. On the one hand, it aims to show the
technical feasibility of machine learning techniques towards automatic in-
terpretation of multimodal sensory data. On the other hand, it warns
society to carefully monitor the implications of the rapid developments in
this area.
I would like to thank all members of the Research Group for Pattern
Recognition at the University of Siegen for proofreading this book and
providing valuable discussions which helped me to improve it. My special
thanks goes to Zeyd Boukhers, Ahmad Delforouzi, Muhammad Hassan
Khan, Kristin Klaas, Lukas Köping, Frédéric Li, Przemyslaw Lagodziński,
Kimiaki Shirahama, and Cong Yang. Last but not least, I would like to
thank my family for being unfailingly supportive of this effort.

Marcin Grzegorzek

VI
Part I

Introduction

1
Chapter 1

Fundamental Concept

Sensors are everywhere. By the early 2020s, their number will have already
exceeded one trillion [5]. This is driven by falling sensor costs and new
fabrication techniques enabling their significant miniaturisation. For exam-
ple, the startup company mCube (www.mcubemems.com) creates motion
sensors that are “smaller than a grain of sand” and envisions a world where
motion sensors are embedded in “everything that moves”.
The rapid development in the area of sensor technology has been re-
sponsible for a number of societal phenomena. For instance, the increased
availability of imaging sensors integrated into digital video cameras has
significantly stimulated the UGC (User Generated Content) movement be-
ginning from 20051 . Another example is the groundbreaking innovation in
wearable technology leading to a societal phenomenon called Quantified
Self (QS), a community of people who use the capabilities of technical
devices to gain a profound understanding of collected self-related data.
Huge and continuously increasing volumes of digital sensor data are
collected everyday. For example, in June 2016, YouTube users were up-
loading 400 hours of new video content to the platform per minute2 . How-
ever, the digital sensor data themselves do not provide the users with any
added value. They need to be semantically interpreted (understood) in a
particular application context to become useful.
The abstraction of digital sensor data towards their semantic under-
standing using automated algorithms is a challenging scientific problem.
The so called semantic gap, the lack of coincidence between automatically
extractable data features and human-perceivable semantic meanings [17],
1
A video-sharing platform www.youtube.com got launched in February 2005.
2
https://www.domo.com/blog/data-never-sleeps-4-0

3
must get bridged for this. A person’s everyday life requires an immense
amount of knowledge about the world. Much of this knowledge is sub-
jective and intuitive, and therefore difficult to articulate in a formal way.
Computers need to capture the same knowledge in order to behave in an
intelligent way. One of the key challenges in artificial intelligence is how
to get this informal knowledge into a computer [6]. In contrast to human
experts from a certain application area (e.g., medical doctors), computers
do not possess the context knowledge to interpret low-level digital data on
a high-level of semantic abstraction (e.g., early diagnosis in medicine) [7].
One of the approaches towards closing the semantic gap aims at inte-
grating knowledge bases called ontologies into the process of low-level data
analysis [19]. However, the ontology generation process has been auto-
mated up to a certain limited level only which makes this strategy very time
consuming. In addition, the integration of the high-level ontology-based
reasoning techniques into the low-level data analysis algorithms usually
requires the pattern recognition software to be customised towards the
context model (application ontology) currently used [2]. This hinders the
portability of such solutions across application domains [7].
Currently, the most widely investigated family of approaches aiming to
reach high-level interpretations from low-level digital data is called deep
learning [4, 6]. Generally, deep learning algorithms allow computers to
learn from experience and understand the world in terms of a hierarchy
of concepts, with each concept defined in terms of its relation to simpler
concepts. By gathering knowledge from experience, this approach avoids
the need for human operators to formally specify all of the knowledge that
the computer needs. The hierarchy of concepts allows the computer to
learn complicated concepts by building them out of simpler ones [7].
In this book, selected state-of-the-art approaches for Visual Scene
Analysis (Part II) and for Human Data Interpretation (Part III) all aiming
at reaching the highest possible level of semantic interpretation are pre-
sented and discussed. The author comprehensively contributed to most of
the scientific results described in this book.
This chapter is structured as follows. In Section 1.1, the book is moti-
vated on the application level and from the methodological point of view.
Afterwards, the two main applications addressed by this book and in its
author’s current research, namely Active and Assisted Living (Section 1.2)
as well as Digital Medicine (Section 1.3), are introduced. Section 1.4
presents an overall structural concept of the book identifying its author’s
contributions to the particular chapters.

4
Figure 1.1: The trend towards a digital society results in a huge volume
of sensor data generated everyday. These pieces of data improve the
performance of machine learning algorithms. In this way, new technical
solutions to challenges caused by the demographic change (ageing society)
can be proposed.

1.1 Motivation
The selection of applications (Active and Assisted Living as well as Digital
Medicine, see Figure 1.4) addressed in this book and in its author’s cur-
rent research is motivated by main phenomena of modern societies. On
the one hand, the demographic change leading to society ageing, along-
side the shortage of medical staff (especially in rural areas), critically chal-
lenges healthcare systems in industrialised countries in their conventional
form [7]. On the other hand, the trend towards a digital society (digitali-
sation) progresses with tremendous speed, so that more and more health-
related data is available in digital form. As large volumes of data improve
the performance of machine learning algorithms, new technical solutions
for problems caused by the demographic change (ageing society) can be
proposed (Figure 1.1).
From the methodological point of view, this book presents and reviews

5
selected state-of-the-art algorithms for automatic sensor data understand-
ing. While in the area of image and video analysis (Part II: Visual Scene
Analysis) the semantic gap has already been closed up to an impressive
grade, the semantic interpretation of human-centred data recorded by sen-
sors embedded in wearable devices (Part II: Human Data Interpretation)
has still not reached a satisfactory level [17]. However, the analysis of visual
data (2D, 2.5D, or 3D images or videos) and the processing of human-
centred sensor data (mostly 1D time series) share the same methodological
fundament. The difference is the heterogeneity of data sources. While al-
gorithms for visual scene analysis can usually be built under the assumption
of a stable and constant dimensionality of data, in case of human data
interpretation the number of sensors available to the system can dynami-
cally change over time. Moreover, the labelling process in the supervised
training phase is more objective in case of visual scene analysis (e.g., man-
ual naming of objects in a scene) as for human data interpretation, since
human’s physiological, emotional, and behavioural states are not always
clearly distinguishable. Therefore, the main methodological motivation
for this book is to present selected algorithms for Visual Scene Analy-
sis (Part II) and Human Data Interpretation (Part II) and discuss their
difference in the semantic interpretation power.

1.2 Active and Assisted Living


The well-established concept of Ambient Assisted Living (AAL) aims at3

• extending the time people can live in their preferred environment by


increasing their autonomy, self-confidence and mobility;
• supporting the preservation of health and functional capabilities of
the elderly;
• promoting a better and healthier lifestyle for individuals at risk;
• enhancing security, preventing social isolation and supporting the
preservation of the multifunctional network around the individual;
• supporting carers, families and care organisations;
• increasing the efficiency and productivity of the resources used in
the ageing societies.
3
Source: www.aal-europe.eu

6
In the last years, the term AAL has been extended to Active and Assisted
Living to emphasise the importance of physical, cognitive, and social ac-
tivities for preserving health and functional capabilities of the elderly.
According to the Survey of Health, Ageing and Retirement in Europe
(SHARE, www.share-project.org), retirement accelerates the physi-
cal, cognitive and mental decline and, therefore, has a negative effect on
personal well-being. Staying active and social in retirement are important
ingredients for healthy ageing. For seniors who no longer head out to work
every day, it is more important than ever to find ways to stay active and
to maintain social relationships. And doing so may help seniors ward off a
number of health problems. However, finding opportunities for meaningful
physical and cognitive activities within interesting social networks becomes
increasingly difficult after retirement, especially in rural areas.
The relevance of technical solutions for AAL has continuously been
increasing over the last years, especially due to the rapid development in
the area of sensor and wearable technology. An example can be seen in
Figure 1.2. The users of such sensor-based miniaturised systems are in a
closed loop with technology. Human’s physiological and behavioural data
can be continuously recorded by wearables and automatically analysed by
machine learning algorithms to provide the users with real-time guidance
as well as recommendations for follow-up activities. In this way, the users
can benefit from individualised training programmes optimised in terms
of improving their physical, cognitive, mental, emotional and social well-
being.
The author of this book has participated in several research projects
related to Active and Assisted Living. One of them is summarised below.
In the project Cognitive Village4 [15] funded by the German Federal
Ministry of Education and Research and coordinated by Marcin Grzegorzek,
technological, economic and social innovations as well as the participatory
design approach are integrated into technical assistance systems enabling
long-term independent living of elderly and diseased people in their own
homes, and even in rural areas where well-developed infrastructure is often
missing. Under careful consideration of ethical, legal and social implica-
tions as well as the users’ real needs, the technical system is collecting
digital data about the elderly’s daily life provided by sensors voluntarily
distributed in their homes as well as by wearables such as smartwatches,
intelligent glasses and smartphones. These sensory data is then automat-

4
www.cognitive-village.de

7
Figure 1.2: Continuous feedback loop between the user and the technol-
ogy leading to personalised follow-up recommendation and individualised
training. Photograph source: www.shutterstock.com.

ically processed, analysed, classified and interpreted by adaptive machine


learning algorithms. The objective is to automatically achieve high-level
semantic interpretation of activities as well as physical and cognitive states
of the elderly for the detection of emergency situations with different criti-
cality grades. Equipping the algorithms with adaptive properties (different
users, behaviour changes over time) belongs to the most prominent sci-
entific contributions of Cognitive Village from the machine learning and
pattern recognition point of view. In addition, the system is required to
cope with the dynamically reconfigurable sensor system delivering the data.
The semantic gap in automatic data processing is reduced here by applying
probabilistic methods for sensory data fusion, introducing adaptive learn-
ing mechanisms, integrating ontological background knowledge as well as
probabilistic modelling and automatic detection of extreme events in the
data. Deep learning strategies are also used in the Cognitive Village sys-
tem.

8
Figure 1.3: Steps of healthcare.

1.3 Digital Medicine


Currently, the patient care is conducted in functionally and geographically
isolated medical facilities. It causes fragmentation of medical processes
leading to media and technology gaps in the information flow. Missing in-
teroperability of devices and data transfer interfaces is only an exemplary
reason for this. A digital and patient-centred care consequently defined
along all its steps would improve its medical quality and economic effi-
ciency [7].
Considering the current degree of digitalisation over the healthcare
stages depicted in Figure 1.3, the digitalisation has mainly been estab-
lished in the diagnostics. Especially the modern medical imaging modalities
and molecular approaches demonstrate the huge amount of digital data
generated in today’s healthcare systems for diagnostics. In the remaining
healthcare steps, such as prevention or therapy, the degree of digitalisation
in the treatment procedures has recently gradually increased [7].
The demographic change leading to society ageing alongside the short-
age of medical staff (especially in rural areas) critically challenges health-
care systems in industrialised countries in their conventional form. For
this reason, less cost intensive forms of data-driven algorithmically sup-
ported treatments will experience an extremely high scientific, societal
and economic priority in the near future. Luckily, the digitalisation of
our society progresses with a tremendous speed, so that more and more
health-related data is available in digital form. For instance, people wear
intelligent glasses or/and smartwatches, provide digital data with standard-
ised medical devices (e.g., blood pressure and blood sugar meters following
the standard ISO/IEEE 11073) or/and deliver personal behavioural data
by their smartphones.

9
This huge amount of personal data generated every day significantly
improves the accuracy of machine learning and pattern recognition al-
gorithms aiming at a holistic assessment of the human health. Better
understanding of human physical, mental and cognitive condition makes
personalised and preventive interventions possible. However, the ethical,
legal and social implications (short ELSI) of this trend must be analysed
very carefully. For instance, data-driven precise medical profiles of patients
may lead to ethically and legally completely unacceptable pricing models
in health insurance.
The health-related digital data voluntarily generated by patients/users
on a daily basis is automatically processed, analysed, classified and medi-
cally interpreted with support of semi-automatic machine learning and pat-
tern recognition algorithms in a number of projects currently conducted
by the author of this book. Two of them are shortly summarised below.
My-AHA5 (My Active and Healthy Ageing) is an EU-funded project [14]
which aims at preventing cognitive and functional decline of older adults,
through early risk detection and tailored intervention. A multinational and
multidisciplinary consortium is developing an innovative ICT-based plat-
form to detect subtle changes in physical, cognitive, psychological and
social domains of older adults that indicate an increased risk of a subse-
quent vicious cycle of disability and diseases, including dementia, depres-
sion, frailty and falls. For this, we develop, apply and investigate machine
learning approaches for multimodal data understanding in the context of
healthy ageing. Our activities follow the increasing level of semantic ab-
straction. On the low data classification level we apply and extend multiple
existing approaches targeting concrete tasks such as sleep quality estima-
tion, speech emotion analysis, gait analysis, indoor/outdoor localisation,
etc. The outcomes of these low-level classifiers are then fused on the mid-
dle data analysis level to assess the cognitive, social and physical states
of the elderly. Coming onto the high-level of semantic interpretation, the
outcomes of the middle layer are fused and jointly analysed towards general
multimodal state description of elderly in context of healthy ageing. These
multidimensional elderly description profiles deal subsequently as inputs for
a generic intervention model that, using concrete parameter values of a
particular profile, provides a specific intervention programme optimised for
a particular individual. The high heterogeneity of data sources is the main
challenge for the pattern recognition software developed in My-AHA.

5
www.activeageing.unito.it

10
In the project SenseVojta [8] funded by the German Federal Ministry
of Education and Research and conducted in collaboration with the Chil-
dren’s Hospital in Siegen (Kinderklinik Siegen6 ), a sensor-based system for
the support of diagnostics, therapy and aftercare following the so called
Vojta Principle is developed [10, 11]. The Vojta Principle starts out from
what is known as reflex locomotion. While looking for a treatment for
children with cerebral palsy, Prof. Vojta observed that these children re-
sponded to certain stimuli in certain body positions with recurring motor
reactions in the torso and the extremities. The effects of this activation
were astonishing: Afterwards, the children with cerebral palsy could first
speak more clearly, and after a short time they could stand up and walk
more assuredly7 . In Vojta Therapy, the therapist administers goal-directed
pressure to defined zones on the body of a patient who is in a prone,
supine or side lying position. In everyone, regardless of age, such stimuli
lead automatically and involuntarily, i.e. without actively willed coopera-
tion on the part of the person concerned, to two movement complexes:
Reflex creeping in a prone lying position and reflex rolling from a supine
and side lying position. Through therapeutic use of reflex locomotion,
the involuntary muscle functions necessary for spontaneous movements in
everyday life are activated in the patient, particularly in the spine, but
also in the arms and legs, the hands and feet, as well as in the face. In
this project, we develop a technical solution to support both, professional
therapists as well as relatives (e.g., children’s parents) performing the ther-
apy. For this, different sensors (e.g., a Kinect camera visually observing
the scene as well as wearables measuring the acceleration of extremities)
are applied. Data acquired by these sensors is analysed and interpreted
by the pattern recognition algorithms conceptualised and implemented in
this project. The automatic sensor-based therapy interpretation provides
real-time guidance to the therapists/parents. It also cumulatively monitors
the therapy progress.

1.4 Outline and Contribution


Overall structural concept of the book relates to research areas investigated
by its author in the last years and is depicted in Figure 1.4. The table of
contents of this book is aligned to the methodological level (Level: Algo-
6
www.drk-kinderklinik.de
7
Source: www.vojta.com

11
Figure 1.4: Overall concept of the book relates to research areas investi-
gated by its author in the last years. The table of contents of this book
is aligned to the methodological level (Level: Algorithms) and, apart from
Introduction (Part I) and Conclusion (Part IV), is divided into two parts,
Visual Scene Analysis (Part II) and Human Data Interpretation (Part III).
Sensor data analysed by the algorithms described in this book are acquired
by the mentioned cameras and wearable devices (Level: Sensors). From
the application point of view (Level: Applications), Active and Assisted
Living as well as Digital Medicine have played a crucial role in the author’s
research over the last years.
12
rithms) and, apart from Introduction (Part I) and Conclusion (Part IV), is
divided into two parts, Visual Scene Analysis (Part II) and Human Data
Interpretation (Part III). Sensor data analysed by the algorithms described
in this book are acquired by the mentioned cameras and wearable devices
(Level: Sensors). From the application point of view (Level: Applications),
Active and Assisted Living as well as Digital Medicine have played a crucial
role in the author’s research over the last years.
Part II on Visual Scene Analysis is divided into three chapters. In
Chapter 2, the scientific area of Large-Scale Multimedia Retrieval (LSMR)
is reviewed. It is based on the survey article by Shirahama and Grzegorzek
published in 2016 in the Multimedia Tools and Applications journal [17].
Chapter 3 provides an overview of shape-based object recognition. Its
contents extend the Pattern Recognition journal article by Yang, Tiebe,
Shirahama, and Grzegorzek published in 2016 [20]. In Chapter 4, video
interpretation techniques based on the analysis of moving objects are de-
scribed. The contents of this chapter have their origins in [1] recently
accepted for publication in the IEEE Transactions on Circuits and Systems
for Video Technology as well as in [3] published in the proceedings of the
International Conference on Pattern Recognition 2016, both co-authored
by Marcin Grzegorzek.
Part III on Human Data Interpretation also consists of three chapters.
Chapter 5 deals with the topic of physical activity recognition using sensors
embedded in wearable devices. It is partly based on three articles co-
authored by Grzegorzek [9, 12, 18]. In Chapter 6, selected algorithms for
cognitive activity recognition are described. Its content extends an article
co-authored by Grzegorzek and recently accepted for publication in the
Computers in Biology and Medicine journal [13]. Chapter 7 addresses
the scientific area of emotion recognition and partly originates from [16]
co-authored by Marcin Grzegorzek.

References
[1] Z. Boukhers, K. Shirahama, and M. Grzegorzek. Example-based 3D
Trajectory Extraction of Objects from 2D Videos. IEEE Transactions
on Circuits and Systems for Video Technology, 2017 (accepted for
publication).

[2] T. Declerck, M. Granitzer, M. Grzegorzek, M. Romanelli, S. Rüger,


and M. Sintek. Semantic Multimedia. Springer LNCS 6725, Heidel-

13
berg, Dordrecht, London, New York, 2011.

[3] A. Delforouzi, A. Tabatabaei, K. Shirahama, and M. Grzegorzek.


Unknown Object Tracking in 360-Degree Camera Images. In Interna-
tional Conference on Pattern Recognition, pages 1799–1804, Cancun,
Mexico, December 2016. IEEE.

[4] Li Deng and Dong Yu. Deep Learning: Methods and Applications.
2013.

[5] Joshua Ebner. How sensors will shape big data and the changing
economy. www. dataconomy. com , January 2015.

[6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning.
MIT Press, 2016.

[7] M. Grzegorzek. Medical Data Understanding. In J. Goluchowski,


M. Pańkowska, C. Barry, M. Lang, H. Linger, and C. Schneider, edi-
tors, International Conference on Information Systems Development,
Katowice, Poland, August 2016.

[8] Principal Investigator. SenseVojta: Sensor-based Diagnosis, Ther-


apy and Aftercare According to the Vojta Principle. German Federal
Ministry of Education and Research (BMBF), 12/2016 – 11/2019.

[9] M. H. Khan, M. S. Farid, and M. Grzegorzek. Gait Recognition Based


on Spatiotemporal Features of Human Motion. Pattern Recognition,
2017 (accepted for publication).

[10] M. H. Khan, J. Helsper, Z. Boukhers, and M. Grzegorzek. Auto-


matic Recognition of Movement Patterns in the Vojta-Therapy Using
RGB-D Data. In The 23rd IEEE International Conference on Image
Processing (ICIP 2016), pages 1235–1239, Phoenix, US, September
2016. IEEE.

[11] M. H. Khan, J. Helsper, C. Yang, and M. Grzegorzek. An Automatic


Vision-based Monitoring System for Accurate Vojta-Therapy. In The
15th IEEE/ACIS International Conference on Computer and Informa-
tion Science (ICIS 2016), pages 1–6, Okayama, Japan, June 2016.
IEEE.

14
[12] L. Köping, K. Shirahama, and M. Grzegorzek. A General Framework
for Sensor-based Human Activity Recognition. Computers in Biology
and Medicine, 2017 (accepted for publication).
[13] P. Lagodzinski, K. Shirahama, and M. Grzegorzek. Codebook-based
Electrooculography Data Analysis Towards Cognitive Activity Recog-
nition. Computers in Biology and Medicine, 2017 (accepted for pub-
lication).
[14] Project. My-AHA: My Active and Healthy Ageing. Website:
www.activeageing.unito.it, European Commission (Horizon 2020),
01/2016 – 12/2019.
[15] Project. Cognitive Village: Adaptively Learning Technical Support
System for Elderly. Website: www.cognitive-village.de, German
Federal Ministry of Education and Research (BMBF), 09/2015 –
08/2018.
[16] K. Shirahama and M. Grzegorzek. Emotion Recognition Based on
Physiological Sensor Data Using Codebook Approach. In E. Pietka,
P. Badura, J. Kawa, and W. Wieclawek, editors, 5th International
Conference on Information Technologies in Biomedicine (ITIB 2016),
pages 27–39, Kamien Slaski, Poland, June 2016. Springer.
[17] K. Shirahama and M. Grzegorzek. Towards Large-Scale Multime-
dia Retrieval Enriched by Knowledge about Human Interpretation -
Retrospective Survey. Multimedia Tools and Applications, 75(1):297–
331, January 2016.
[18] K. Shirahama, L. Köping, and M. Grzegorzek. Codebook Approach
for Sensor-based Human Activity Recognition. In ACM International
Joint Conference on Pervasive and Ubiquitous Computing, pages
197–200, Heidelberg, Germany, September 2016. ACM.
[19] S. Staab, A. Scherp, R. Arndt, R. Troncy, M. Grzegorzek, C. Saathoff,
S. Schenk, and L. Hardman. Semantic Multimedia. In C. Baroglio,
P. A. Bonatti, J. Maluszynski, M. Marchiori, A. Polleres, and S. Schaf-
fert, editors, Reasoning Web, pages 125–170, San Servolo, Island,
September 2008. Springer LNCS 5224.
[20] C. Yang, O. Tiebe, K. Shirahama, and M. Grzegorzek. Object Match-
ing with Hierarchical Skeletons. Pattern Recognition, 55:183–197,
July 2016.

15
16
Part II

Visual Scene Analysis

17
Chapter 2

Large-Scale Multimedia
Retrieval

Large-Scale Multimedia Retrieval (LSMR) is the task in which a large


amount of multimedia data (e.g., image, video and audio) is analysed to
efficiently find the ones relevant to a user-provided query. As described
in many publications [55, 64, 69, 72], the most challenging issue is how
to bridge the semantic gap which is the lack of coincidence between raw
data (i.e., pixel values or audio sample values) and semantic meanings
that humans perceive from this data. This chapter presents an overview
of both traditional and state-of-the-art methods, which play principal roles
in overcoming the semantic gap in LSMR.

2.1 Hierarchical Organisation of Semantic


Meanings
First of all, by referring to Figure 2.1, let us define semantic meanings
targeted by LSMR. Since events are widely-accepted access units to mul-
timedia data, semantic meanings are decomposed based on basic aspects
of event descriptions [53, 89]. As shown in Figure 2.1 (a), meanings are
organised using three components, concept, event and context. Based
on [53, 89], concepts form the participation (or informational) aspect of
objects in an event. That is, the event is derived by relating multiple ob-
jects. Contexts are the collection of part-of, causal and correlation aspects
among events.
More formally, concepts are defined as textual descriptions of mean-

19
b) An example of organised semantic meanings
a) Organisation of Eating the cooked Hamburger
semantic meanings (Causal)

Context Cooking a Hamburger

(Part-of)

Event Burning a Cheese,


Meat and Sausage
Putting a Cheese
etc. on a Bread
Eating a Hamburger

- Cheese - Hand - Person


- Meat - Food_Turner - Hamburger
Concept - Sausage
- Grill
- Bread
- Cheese etc.

Shot 1 Shot 2 Shot 3


Time

Figure 2.1: An illustration of decomposing meanings based on concepts,


events and contexts [55].

ings that can be perceived from images, shots or videos, such as objects
like Person and Car, actions like Walking and Airplane Flying, and scenes
like Outdoor and Nighttime [42, 66]. In other words, concepts are the
most primitive meanings for multimedia data. Below, concept names are
written in italics to distinguish them from the other terms. An event is a
higher-level meaning derived from the interaction of objects at a specific
situation [26, 63]. In this chapter, it is especially defined by the combina-
tion of concepts. For example, in Figure 2.1 (b), Shot 1 shows Cheese,
Meat, Sausage and Grill, from which the event “barbecuing” is derived.
Shot 2 displays Hand, Food Turner, Bread, Cheese and so on, where the
event “putting Cheese etc. on Bread” is formed based on movements
of these concepts. Furthermore, as depicted by the bold line arrow in
Figure 2.1 (a), contexts are used to recursively define higher-level events
based on part-of, causal and correlation relations among lower-level ones1 .
In Figure 2.1 (b), based on the part-of relation, the events in Shot 1 and
2 are combined into the higher-level event “cooking a Hamburger”. This
event and the one in Shot 3 (“eating a Hamburger”) are further abstracted
into “eating the cooked Hamburger”. Also, the correlation relation is used
to connect two ‘weakly-related’ events, such as those which occur in sep-
1
In this chapter, contexts only include relations which are obtained from multimedia
data themselves, and exclude external data like geo-tags and Web documents.

20
arate locations but at the same time [53]. The final goal of LSMR is the
above-mentioned organisation of semantic meanings based on concepts,
events and contexts. To make the following discussions simple and clear,
an example is used to indicate a single unit of multimedia data, such as
image, shot, video and audio. When the discrimination among these data
formats is not important, examples are used as their abstract name.
However, an event is ‘highly-abstracted’ in the sense that various ob-
jects interact with each other in different situations. In consequence, visual
appearances of examples relevant to a certain event can be completely dif-
ferent. In other words, these examples have got a huge variance in the
space of low-level features like colour, edge, and motion. One promising
solution for this is a concept-based approach which projects an example
into the space where each dimension represents the detection result of a
concept [66]. Owing to recent research progress, several concepts can be
robustly detected irrespective of their sizes, directions and deformations
in video frames. Thus, compared to the space of low-level features where
each dimension just represents the physical value of an example, in the
space of concept detection results, each dimension represents the appear-
ance of a human-perceivable meaning. In such a space, the variation of
relevant examples to an event becomes smaller and can be modelled more
easily. That is, relevant examples that are dissimilar at the level of low-level
features become more similar at the level of concepts.
Several publications reported the effectiveness of concept-based ap-
proaches. For example, Tešic̀ et al. showed that when using the same
classifier (SVM), concept detection scores lead to 50-180% higher event
retrieval performance than colour and texture features [75]. In addition,
Merler et al. reported that compared to high-dimensional features (see lo-
cal features described in the next section), concept detection scores yield
the best performance [40]. In particular, the example representation using
detection scores for 280 concepts only requires a 15 times smaller data
space than high-dimensional features, where data sizes are crucial for the
feasibility of LSMR. Furthermore, Mazloom et al. demonstrated that con-
cept detection scores offer 3.1-39.4% performance improvement compared
to a high-dimensional feature [39].
Figure 2.2 shows an overview of concept-based LSMR. Although Fig-
ure 2.2 focuses on the “birthday party” event in videos, it is straightforward
to apply the same approach to images or audio signals. First, each video
is divided into shots. For this, there exist many accurate shot boundary
detection methods. One popular approach is to detect a shot boundary

21
as a significant difference of colour histograms between two consecutive
video frames [25]. In the bottom of Figure 2.2, each shot is represented
by one video frame, and arranged from front to back based on its shot ID.
Then, concept detection is conducted as a binary classification problem.
For each concept, a detector is built using training shots annotated with
the presence or absence of this concept. After that, the detector is used
to associate each shot with a detection score, representing a scoring value
between 0 and 1 in terms of the presence of the concept. A larger detec-
tion score indicates a higher likelihood that the concept being present in
a shot.
Train a classifier Apply the classifier
Multi-dimensional sequence
of concept detection scores Person: 1.0
Person: 0.2
Person: 0.8
Person: 0.8
Person:Indoor:
0.7 0.9 Person: 1.0 Person:Indoor:
1.0 0.1 0.8
Person: Indoor:
0.0 0.10.6 Person: 0.7 Person: Indoor:
1.0 0.00.0 Person: 0.7
Person: 0.8 Person: 0.9Table:
Indoor: 0.4 Person: Indoor:
1.0 0.0 Person: 0.9Table:
Indoor:
Table:
0.2
0.0 Person: 0.9 Probability
Person: 0.9 Person: Indoor:
0.1 Table:
0.4 0.8 Person: Indoor:
1.0 0.90.3 Person: Indoor:
0.5 0.2
Crowd: Person: Indoor:
1.0 0.8
Indoor: 0.9 Indoor: Table:
0.2Crowd:
0.6 0.5 Indoor: Table:
0.8 Indoor: Table:
0.1 0.0 0.6 Indoor: 0.9 of event’s
Indoor: 0.7 Table:
Indoor: 0.5 Crowd:
0.2 0.5 Table:
Indoor: 0.8 0.2 Indoor: Table:
0.0 Crowd:
0.1 0.3 Indoor: Table:
0.9 0.4
Table: 0.6 Table:Crowd:0.0... 0.3 Shot ID Table:Crowd:0.2 0.1 Table:Crowd:0.0... 0.1 Shot ID Table: 0.5 occurrence
Food: 0.7 Food:Crowd: 0.8... 0.3 Food:Crowd: 0.2 0.2 Shot ID Food:Crowd: 0.0... 0.0
... Food:Crowd: 0.7 0.2
Crowd: 0.1 Crowd: ... 0.1 Crowd: ... 0.5 Crowd: 0.0 Crowd: 0.2
Crowd: 0.4 Shot ID Crowd: ... 0.0 Crowd: ... 0.0 Crowd: ... 0.5 Crowd: ... 0.2 Shot ID
... ... ... ... ...
... ... ... ... ...

Concept detection

Shot ID Shot ID
Shot ID Shot ID
Shot ID

Unknown videos
Videos showing a certain event (birthday party) Other videos where the certain event is not shown

Figure 2.2: An overview of concept-based LSMR where “birthday party”


is used as an example event.

Such detection scores are illustrated in the middle of Figure 2.2. For
example, the first shot in the leftmost video shows an indoor scene where
a person is bringing a birthday cake. Correspondingly, this shot is asso-
ciated with the large detection scores 0.9, 0.7 and 0.7 for Person, Indoor
and Food, respectively. Note that concept detection is uncertain because
small (or large) detection scores for a concept may be falsely assigned to
shots where it is actually present (or absent). Nonetheless, representative
concepts in shots are assumed to be successfully detected, and even if the
detection of a concept fails on some shots, its contribution to an event
can be appropriately evaluated by statistically analysing many shots. For
example, even though the shot exemplified above does not display Crowd,
a relatively large detection score 0.4 is assigned to this shot. But, by
checking the other shots in videos showing the event “birthday party”, it
can be revealed that Crowd is irrelevant to this event. The above con-
cept detection allows us to represent each video as a multi-dimensional
sequence where each shot defined as a vector of detection scores is tem-

22
porally ordered, as depicted in the middle of Figure 2.2.

A classifier is built to distinguish videos showing a certain event from


the other videos, by comparing multi-dimensional sequences for these
videos. The classifier captures intra-/inter-shot concept relations that are
specific to the event. For example, corresponding to candle blowing scenes,
videos relevant to “birthday party” often contain shots where Nighttime
and Explosion Fire are detected with high detection scores. In addition,
shots displaying Person are often followed by shots showing Singing or
Dancing. Finally, the classifier is used to examine whether the event oc-
curs in unknown videos.

In concept-based LSMR, one important issue is how to define a vocabu-


lary of concepts. Such a vocabulary should be sufficiently rich for covering
various events. One traditionally popular vocabulary is Large-Scale Con-
cept Ontology for Multimedia (LSCOM), which defines a standardised set
of 1, 000 concepts in the broadcast news video domain [42]. These con-
cepts are selected based on their ‘utility’ for classifying content in videos,
their ‘coverage’ for responding to a variety of queries, their ‘feasibility’
for automatic detection, and the ‘availability’ (observability) of large-scale
training data. It is estimated that if the number of concepts in LSCOM
reaches an amount of 3, 000, granting the quality of the new concepts
remains similar to the existing ones, the retrieval performance approaches
that of the best search engine in text information retrieval [20]. The
currently most popular concept vocabulary is ImageNet [13, 52]. This is
an extension to its predecessor WordNet which is a large lexical ontology
where concepts (called synonym sets or synsets) are interlinked based on
their meanings [17]. ImageNet aims to assign on average 500 to 1, 000
images to each WordNet concept. In March 2017, 14, 197, 122 images
are associated with 21, 841 concepts through Amazon’s Mechanical Turk,
where the assignment of images has been outsourced to Web users2 . The
developers of ImageNet plan to assign 50 million images to 80, 000 con-
cepts in the near future. In what follows, concept-based LSMR will be
explained by mainly focusing on two main processes, concept detection in
Section 2.2 and event retrieval in Section 2.3.

2
http://image-net.org/

23
2.2 Concept Detection
Concept detection (including object detection, scene recognition, image
and video classification etc.) has been investigated for a long time. It can
be formulated as a binary classification problem in machine learning, where
for each concept a detector is trained to distinguish examples showing it
from the others. This requires two types of training examples, positive
examples and negative examples. The former and latter are examples that
are annotated with the concept’s presence and absence, respectively. By
referring to these training examples, the detector examines test examples
where neither the concept’s presence nor absence is known. In accordance
with this machine learning setting, Section 2.2.1 presents the basic frame-
work by mainly focusing on representations of examples (i.e., features),
and then Section 2.2.2 provides the state-of-the-art methods that extract
useful representations by analysing a large amount of examples.

2.2.1 Global versus Local Features


Classical methods cannot achieve accurate concept detection. One main
reason is the weakness of global features which are extracted from the
whole region of an example. In other words, they only express overall
characteristics of an example. As an example of global features, Figure 2.3
shows a colour feature indicating the distribution of colours included in
an image. This kind of overall representation loses a lot of information.
For example, from the colour feature in Figure 2.3, appearances of the
car, road and vegetation cannot be deduced any more. In addition, the
overall characteristics of the example can easily change depending on the
camera techniques and shooting environments. For instance, the colour
distribution of the image in Figure 2.3 changes substantially if it is taken
in a brighter or darker lighting condition.
To overcome the weakness of global features, Schmid and Mohr pro-
posed to represent an example as a collection of local features, each of
which is extracted from a local region of the example [54]. The top right
of Figure 2.3 illustrates local features extracted from local regions circled
in yellow. In addition, [36] developed a local feature called Scale-Invariant
Feature Transform (SIFT) which represents the shape in a local region,
reasonably invariant with respect to changes in illumination, rotation, scal-
ing and viewpoint. By extracting a large number of such local features
from an example, it can be ensured that at least some of them represent

24
Local feature:
SIFT descriptor

Global feature: Color histogram Distribution (histogram) of local features

Figure 2.3: A comparison between a global feature and a local feature [56].

characteristic regions of a concept. More specifically, even if the car in


Figure 2.3 is partially masked by other objects, local features that char-
acterise a wheel, window or headlight are extracted from the visible part
of the car. Sande et al. developed extended SIFT features that are de-
fined in different colour spaces and have unique invariance properties for
lighting conditions [81]. Furthermore, local features are defined around
trajectories, each of which is obtained by tracking a sampled point in a
video [86]. The resulting local features represent the displacement of a
point, the derivative of that displacement, and edges around a trajectory.
Also, Speeded-UP Robust Features (SURF) are similar to SIFT features,
but can be efficiently computed based on the integral image structure
which quickly identifies the sum of pixel values in any image region [3].
Based on local features, Csurka et al. developed a simple and effective
example representation called Bag of Visual Words (BoVW), where each
example is represented as the collection of characteristic local features,
called visual words [10]. In BoVW, millions of local features are first
grouped into clusters where each cluster centre is a visual word representing
a characteristic local region. Then, each local feature extracted from an
example is assigned to the most similar visual word. As a result, as seen
from the bottom right of Figure 2.3, the example is represented as a vector
(histogram) where each dimension represents the frequency of a visual
word. This way, the example is summarised into a single vector where the
detailed information is maintained by visual words (local features) that
are robust with respect to varied visual appearances. The effectiveness of

25
BoVW has been validated by many researchers [10, 27, 59, 81, 96].
Many extensions of BoVW have been proposed, such as soft assign-
ment which extracts a smoothed histogram by assigning each local feature
to multiple visual words based on kernel density estimation [81], sparse cod-
ing which represents the distribution of a large number of base functions
used to sparsely approximate local features [91, 92], Gaussian Mixture
Model (GMM) supervector which estimates the distribution of local fea-
tures using a GMM [22], Fisher vector encoding which considers the first
and second order differences between the distribution of local features and
the reference distribution [49], and Vector of Locally Aggregated Descrip-
tors (VLAD) which concatenates vectors each representing the accumu-
lated difference of a visual word to the assigned local features [1, 24].
Another reason for the unsatisfactory performance of classical concept
detection is the insufficiency of training examples. Although local features
are useful for managing diverse visual appearances of a concept, instances
with significantly different appearances are included in the same concept
category. For example, the concept Car includes saloon cars, buses, trucks
and so on. Regarding this, a classifier can conduct accurate detection on
test examples where instances of a concept are similar to those in train-
ing examples. However, detection is not accurate on test examples where
instances have significantly different characteristics from those in train-
ing examples. Thus, a large number of training examples are required
to address the diversity attributed to the difference in instance types of
an object. In general, the detection performance is proportional to the
logarithm of the number of positive examples, although each concept has
its own complexity of recognition [43]. This means that 10 times more
positive examples improve the performance by 10%. Considering this im-
portance of the number of training examples, researchers have developed
Web-based collaborative annotation systems where annotation of large-
scale multimedia data is distributed to many users on the Web [2, 83].
That is, these users collaboratively annotate a large number of examples
as positive or negative. In an extreme case, 80 million training images
yield accurate recognition performance [77].
However, regular users on the Web are unlikely to volunteer to annotate
when no benefit or no reason is given. In consequence, only researchers
participate in annotation, which makes it difficult to collect large-scale
annotation. Von Ahn and Dabbish proposed a Games With A Purpose
(GWAP) approach where users play a game, and as a side effect, a com-
putationally difficult task is solved [84, 85]. More concretely, users play a

26
fun game without knowing that they conduct image annotation. Owing
to the motivation that users want to have fun, as of July 2008, 200, 000
users contributed to assigning more than 50 million labels to images on the
Web [85]. Another approach that motivates users is crowdsourcing that
outsources problems performed by designated human (employee) to users
on the Web [50]. In the field of multimedia annotation, one of the most fa-
mous crowdsourcing systems is Amazon’s Mechanical Turk where anyone
can post small tasks and specify prices paid for completing them [28]. Ima-
geNet, which is the currently most popular large-scale concept vocabulary
(see the previous section), has been created via Mechanical Turk [13, 52].
A detector for a concept is built based on BoVW-based features and
large-scale training examples. In most cases, the detector is built as a
Support Vector Machine (SVM) [79], which constructs a classification
boundary based on the ‘margin maximisation’ principle so that it is placed
in the middle between positive and negative examples. This ‘moderate’
boundary which is biased toward neither positive nor negative examples
is suitable for BoVW. Specifically, many visual words (e.g., thousands of
visual words) are required to maintain the discrimination power of BoVW.
That is, an example is represented as a high-dimensional vector. This
renders the nearest neighbour classifier ineffective because of many ir-
relevant dimensions to similarity calculation [6]. In contrast, the margin
maximisation makes the generalisation error of an SVM independent of
the number of dimensions, if this number is sufficiently large [79]. Ac-
tually, SVMs have been successfully applied to BoVW with thousands of
dimensions [10, 81, 27, 59].
Below, two important issues for accurate concept detection are dis-
cussed. The first is how to sample local features. In general, local feature
extraction consists of two modules, region detector and region descrip-
tor [96]. The former detects regions useful for characterising objects, and
the latter represents each of the detected regions as a vector. For example,
SIFT features are typically extracted using Harris-Laplace (or Harris-affine)
detector to identify regions where pixel values largely change in multiple di-
rections. Such regions are regarded as useful for characterising local shapes
of objects, like corners of buildings, vehicles and human eyes. Then, each
detected region is described as a 128-dimensional vector representing the
distribution of edge orientations. However, a concept is shown in signifi-
cantly different regions, and in videos, it does not necessarily appear in all
video frames. Considering this ‘uncertainty’ of concept appearances, it is
necessary to extract the BoVW representation of an example by exhaus-

27
tively sampling local features in both the spatial and temporal dimensions.
Actually, the performance is improved as the number of sampled local
features increases [47]. In addition, Snoek et al. compared two meth-
ods [67]. One extracts features only from one video frame in each shot
(one shot contains more than 60 frames), and the other extracts features
every 15 frames. They found out that the latter exceeds the former by 7.5
to 38.8%. The second issue is an expensive computational cost to process
a large number of training examples and exhaustively sampled local fea-
tures. So far, many methods for reducing these computational costs have
been developed based on special hardware like computer cluster [90] and
General-Purpose computing on Graphics Processing Units (GPGPU) [82],
or based on algorithm sophistication with sub-problem decomposition [15],
tree structures [22] and matrix operations [61].

2.2.2 Feature Learning


Global and local features described in the previous section are ‘hand-
crafted’ or ‘human-crafted’ in the sense that their representations are man-
ually specified in advance [4]. For instance, a SIFT feature is described as a
128-dimensional vector where each dimension represents the frequency of
a certain edge orientation in a local region. However, such a hand-crafted
feature is insufficient for representing diverse concept appearances. This
is because all of such appearances cannot be assumed in advance and
cannot be appropriately represented by the feature. Apart from this, the
human brain recognises concepts in a hierarchical fashion where simple
cells are gradually combined into more abstract complex cells [29]. This
hierarchical brain functionality is recently implemented as deep learning
that constructs a feature hierarchy with higher-level features formed by
the composition of lower-level features [4, 5]. Such a feature hierarchy is
represented as a multi-layer neural network. In every layer, each of the
artificial neurons composes a more abstract feature based on outputs of
neurons in the previous layer.
Figure 2.4 shows a conceptual comparison between a traditional ma-
chine learning approach using a hand-crafted feature and a deep learning
approach. The former in Figure 2.4 (a) uses a ‘shallow architecture’ con-
sisting of two layers, where the first layer transforms an example into a
feature represented by a high-dimensional vector, and in second layer ag-
gregates values of this feature into a detection result of a concept. On
the other hand, the deep learning in Figure 2.4 (b) first projects an exam-

28
ple into the most primitive features at the bottom layer, and then these
features are projected into more abstract ones at the second layer. This
abstraction of features is iterated to obtain a detection result of the con-
cept. For example, features in the bottom and second layers correspond
to typical edges and their combinations, respectively. Moreover, features
at in upper layer represent parts of a car, and the ones in the top layer
indicate the whole car. Like this, the workflow from processing pixels to
recognising a concept is unified into a deep architecture, which is extracted
from large-scale data.

a) Traditional machine learning approach using a hand-crafted feature

Car

b) Deep learning approach

Car

Figure 2.4: A conceptual comparison between traditional machine learning


and deep learning approaches.

Deep learning mainly offers the following three advantages (see [5] for
more detail): The first is its discrimination power compared to the shallow
one in the traditional machine learning approach. The latter requires O(N )
parameters to distinguish O(N ) examples, while the former can represent
up to O(2N ) examples using only O(N ) parameters [5]. Intuitively, a
huge first layer (i.e., very high-dimensional feature vector) is required for

29
the traditional approach to discriminate diverse examples. In contrast, the
discrimination power of the deep architecture is exponentially increased
based on the combination of features in two consecutive layers. The second
advantage is the invariance property where more abstract features are
generally invariant to subtle changes in visual appearances. The last one
is the explanatory factor that the learnt feature hierarchy can capture
valuable patterns or structures underlying raw images or videos. Finally, a
classifier for detecting a concept is created by using the learnt hierarchy
as initialisation of a multi-layer neural network, or building a supervised
classifier by constructing the feature vector of each example based on the
hierarchy (this is called transfer learning [38]).
One of the most fundamental deep learning models called AlexNet is
implemented as an eight-layer Convolutional Neural Network (CNN) which
iteratively conducts convolution or pooling of outputs by neurons in the
previous layer [30]. Convolution works as feature extraction using filters
each represented by weights of a neuron. On the other hand, pooling sum-
marises outputs of neighbouring neurons to extract more abstract features.
The parameter optimisation is conducted by stochastic gradient descent
which updates each weight of a neuron by backpropagating the derivative
of training errors in terms of this weight. In ILSVRC 2012 which is a world-
wide competition on large-scale image classification [11], AlexNet with the
error rate of 15.3% significantly outperformed the others (the second best
error rate was 26.1%). Also, Le et al. developed a nine-layer stacked
sparse autoencoder to train concept detectors from unlabelled images [32].
Each layer consists of three sub-layers, filtering, pooling and normalisation,
which respectively offer feature extraction from small regions of the pre-
vious layer, the invariance of features (neighbouring neurons’ outputs) to
local deformation of visual appearances, and the range adjustment of fea-
tures. The stacked sparse autoencoder is optimised layer-by-layer so that
sparse features constructed in a layer can be accurately converted back
into the ones in the previous layer. By training such a stacked autoen-
coder using 10 million unlabelled images with 16, 000 cores, it was shown
that the highest-level neurons characterise concepts like Face, Cat Face
and Human Body. Moreover, compared to state-of-the-art methods, the
multi-layer classifier using the stack autoencoder as the initialisation yields
15% and 70% performance improvement for 10, 000 and 22, 000 concept
detection tasks, respectively. Inspired by the above-mentioned research,
many improved deep learning models have been proposed such as VGGNet
which is a very deep CNN with consisting of 16 (or 19) layers with small

30
filter fields (3 × 3) [62], GoogLeNet which is a 22-layer CNN where mul-
tiple convolutions are performed in parallel [71], and ResNet which is a
152-layer CNN where neuron outputs in a layer are forwarded to a layer
which is more distant than the one-level higher layer [21].

2.3 Event Retrieval


There are two scenarios of event retrieval. In the first scenario, a ma-
chine learning setting similar to concept detection is adopted by regarding
concept detection scores for each examples as its feature vector. To for-
mulate this, examples are re-defined as follows: Positive examples indicate
the ones showing a certain event, while all the other examples are signified
as negative. Based on training examples consisting of these positive and
negative examples, a classifier is built to examine the occurrence of the
event in unknown test examples. The second scenario is called zero-shot
learning that builds a classifier for an event with no training example. In
other words, the classifier is trained by considering how the event is se-
mantically configured by concepts [31, 19]. For example, a classifier for
the event “a person is playing guitar outdoors” is constructed so as to
assign high weights to detection scores for the concepts Person, Outdoors
and Playing Guitar, because these concepts are obviously important for
the event.
The following discussion mainly focuses on the first scenario where a
user provides a small number of positive examples for an event. It should
be noted that although a huge diversity of examples can be negative, it
is difficult or unrealistic for the user to provide such negative examples.
On the other hand, negative examples are necessary for accurately shap-
ing regions of examples relevant to the event [33, 60]. With respect to
this, Natsev et al. assumed that only a small number of examples in the
database are relevant to an event, and all the others are irrelevant [44].
Based on this, they proposed an approach which selects negative examples
as randomly sampled examples because almost all of them should be ir-
relevant to the query. This approach works well and has been validated in
many high performance retrieval systems [46, 65]. Keeping this prepara-
tion of training examples, existing event retrieval methods are summarised
by classifying them into two categories. The first described in Section 2.3.1
focuses on events within images/shots, and the second in Section 2.3.2
targets events over video shot sequences.

31
2.3.1 Event Retrieval within Images/Shots

Given positive examples for an event, methods in this category construct


a classifier that fuses concept detection scores for a test example (image
or shot) into a single relevance score. This score indicates the relevance
of the test example to the event. Existing methods are roughly classified
into four categories, linear combination, discriminative, similarity-based or
probabilistic. Linear combination builds a classifier that computes the rel-
evance score of a test example by weighting detection scores for multiple
concepts. One popular approach to build such a classifier is to analyse con-
cept detection scores in positive examples. If the average detection score
for a concept in positive examples is large, this concept is regarded as
related to the query and associated with a large weight [45, 88]. Discrimi-
native methods construct a discriminative classifier (typically, SVM) using
positive and negative examples for an event [45, 46]. The relevance score
of a test example is obtained as the classifier’s output. Similarity-based
methods compute the relevance score of a test example as the similarity
between positive examples and the test example in terms of concept detec-
tion scores. The method in [34] uses the cosine similarity and a modified
entropy as similarity measures. Probabilistic methods estimate a proba-
bilistic distribution of concepts using detection scores in positive examples,
and use it to compute the relevance score of a test example. In [51], the
relevance score of a test image is computed as the similarity between the
multinomial distribution of concepts estimated from positive examples and
the one estimated from the test image.
In the zero-shot learning scenario, one popular approach to classifier
construction is ‘text-based weighting’ where a concept is associated with a
large weight if its name is lexically similar to a term in the text description
of the query [45, 88]. The lexical similarity between a concept name and a
term can be measured by employing a lexical ontology like WordNet [17],
or recently by utilising their vector representations (word2vec [41]) ob-
tained by a neural network, which is trained on a large amount of text
data [94]. Another approach is to construct an embedding space between
visual features and text descriptions for training examples [19]. Given a
test example, its text description is estimated by projecting its visual fea-
tures into the embedding representation, which is then further projected
into a text description. Finally, the similarity between this description and
the text description of an event is computed.

32
2.3.2 Event Retrieval over Shot Sequences
This section only focuses on the usual machine learning setting to detect
an event over video shot sequences. One big problem is the difficulty of
annotating the relevance of each shot. The reasons are two-fold: First,
it is labour-intensive to annotate shots contained in each video. Second,
videos are known as continuous media where sequences of media quanta
(i.e., video frames and audio samples) convey semantic meanings when
continuously played over time [18]. Due to this temporal continuity, any
segment of a video can become a meaningful unit [73]. Specifically, hu-
mans tend to relate each shot in a video to surrounding ones. Let us
consider a video where the event “birthday party” is shown. One per-
son may think that the event occurs in a shot where a birthday cake is
brought to a table, followed by a shot showing a candle blowing scene.
But, another person may perceive that the surrounding shots displaying
participants’ chatting are also a part of the birthday party. This kind of
shot relation makes it ambiguous to determine the boundary of an event.
Thus, objective annotation is only possible at the video level in terms of
whether each video contains an event or not. Hence, a classifier needs to
be built under this weakly supervised setting where even if a training video
is annotated as relevant to the meaning, it includes many irrelevant shots.
The simplest approach to build a classifier under the weakly supervised
setting3 is to create a ‘video-level vector’ using by max-pooling [9, 35] or
average-pooling [76], which computes each dimension value as the maxi-
mum or average concept detection score over shots in a video. However,
such video-level vectors are clearly too coarse, because max-pooling may
over-estimate detection scores for irrelevant concepts to an event, and
average-pooling may under-estimate the ones for relevant concepts.
Shirahama et al. developed a more sophisticated method using a Hid-
den Conditional Random Field (HCRF) [57]. It is a probabilistic discrimi-
native classifier with a set of hidden states. These states are used as the
intermediate layer to discriminate between relevant and irrelevant shots to
an event. Specifically, each shot in a video is assigned to a hidden state
by considering its concept detection scores and transitions among hidden
states. Then, hidden states and transitions are optimised so as to max-

3
Event detection under weakly supervised setting is being explored in TRECVID
Multimedia Event Detection task [63]. Although some other methods (e.g., [23, 74,
80]) can treat weakly supervised setting, they use low-level features, so are excluded
from the discussion.

33
imise the discrimination between positive and negative videos. It is shown
that the optimised hidden states and transitions successfully capture con-
cepts and their temporal relations, that are specific to the event. Sun and
Nevatia proposed a method which extracts temporal concept transitions
in an event using Fisher kernel encoding [70]. Using all training videos,
they first build an HMM which works as a prior distribution, representing
concept transitions in the general case. Then, the video-level vector of
a video is created by computing the difference between the actual transi-
tions of concept detection scores in the video, and the transitions predicted
by the HMM. Thereby, vectors of positive videos for an event represent
characteristic concept transitions by suppressing trivial transitions that are
observed in many negative videos. Finally, Lu and Grauman developed
a metric which can quantify the context between two events, by finding
concepts that appear in the first event and strongly influence the second
one [37]. Such influences are measured by performing a random walk on
the bipartite graph, which consists of event and concept nodes. A concept
is regarded as influential if its ignorance leads to a dramatic decrease of the
probability of transition between two event nodes. In [37], the metric was
used to create summaries consisting of events associated with semantically
consistent contexts.

2.4 Conclusion and Future Trends

This chapter presented a survey of traditional and state-of-the-art LSMR


methods by mainly focusing on concept detection and event retrieval pro-
cesses. Regarding the former, thanks to the preparation of large-scale
datasets like ImageNet [13, 52] and the development of deep learning ap-
proaches in Section 2.2.2, many concepts can be detected with acceptable
accuracies. One open issue for concept detection is how to successfully
extend deep learning approaches that have been successful for the im-
age (i.e., spatial) domain to the video (i.e., temporal) domain. Although
several methods use 3D convolutional neural network [78] or Long Short
Term Memory (LSTM) [68], there is still significant room for improvement.
Compared to concept detection, event retrieval needs much more research
attention for both performance improvement and method innovation, as
discussed below.

34
2.4.1 Reasoning
Existing event retrieval approaches lack reasoning to precisely infer events
(higher-level semantic meanings) based on ontological properties and rela-
tions of concepts. Even though some works consider hierarchical relations
among concepts, they only use is-a (generalisation/specialisation) connec-
tions among concepts [12, 98]. Reasoning based on concept properties and
relations is necessary because concept detection itself has the following two
limitations: First, concepts are too general to identify examples that users
want to retrieve. Secondly, most of the existing methods use concepts in
isolation. For example, various events are displayed in examples where the
concepts Person, Hand and Ball are present. In other words, examples that
users really want to see cannot be identified by independently examining
presences of Person, Hand and Ball. Instead, if it is observed that the
Hand of a Person is moving and the Ball is separating from the Person,
the event “throwing a ball” can be derived.
Due to the poor performance of past concept detection methods, the
above kind of reasoning has received little research attention. However,
considering their recent improvements, it seems to be the right time for the
reasoning to be widely addressed in LSRM. For this, [8] developed an in-
teresting approach which optimally specialises detected concepts and their
relations, so that they are the most probable and ontologically-consistent.
This approach, which formulates reasoning as an optimisation problem
based on constraints defined by the ontology, can be considered as a
promising future direction of LSMR.

2.4.2 Uncertainties in Concept Detection


Reasoning requires overcoming the crucial problem of how to manage ‘un-
certainties’ in concept detection. There are still many concepts that can-
not be detected with high accuracies. In addition, real-world examples are
‘unconstrained’ in the sense that they can be taken by arbitrary camera
techniques and in arbitrary shooting environments [26]. Hence, even in the
future, it cannot be expected to detect concepts with 100% of accuracy.
If one relies on uncertain concept detection results, detection errors for
some concepts damage the whole reasoning process.
Shirahama et al. have developed a pioneering method which can han-
dle uncertainties based on Dempster-Shafer Theory (DST) [58]. DST is
a generalisation of Bayesian theory where a probability is not assigned to

35
Another random document with
no related content on Scribd:
The Project Gutenberg eBook of The beautiful garment,
and other stories
This ebook is for the use of anyone anywhere in the United States and most
other parts of the world at no cost and with almost no restrictions whatsoever.
You may copy it, give it away or re-use it under the terms of the Project
Gutenberg License included with this ebook or online at www.gutenberg.org. If
you are not located in the United States, you will have to check the laws of the
country where you are located before using this eBook.

Title: The beautiful garment, and other stories

Author: A. L. O. E.

Release date: August 26, 2023 [eBook #71492]

Language: English

Original publication: Rock Island: Augustana Book Concern, 1927

*** START OF THE PROJECT GUTENBERG EBOOK THE BEAUTIFUL


GARMENT, AND OTHER STORIES ***
Transcriber's note: Unusual and inconsistent spelling is as printed.

The
Beautiful Garment
and Other Stories

By

A. L. O. E.

ROCK ISLAND, ILL.

AUGUSTANA BOOK CONCERN

COPYRIGHT 1927

BY

AUGUSTANA BOOK CONCERN

PRINTED IN U.S.A.
ROCK ISLAND, ILL.

AUGUSTANA BOOK CONCERN, PRINTERS AND BINDERS

1927
CONTENTS

The Beautiful Garment

The Captive

The Voyage
The Beautiful Garment

"You'll find our Lydia a child after your own heart, Martin," said Captain Neill, a
retired officer, to his elder brother, who had lately returned from India.

"She seems to be a quick, intelligent girl," answered Mr. Neill, in a less


enthusiastic tone.

"She is that, and a great deal more!" cried the father. "It is wonderful to see
the good that child does! From cottage to cottage she goes, reading, talking—
really like a grown-up woman; it would surprise you were you to hear her."

"Perhaps it would," said his brother, a pale, reserved man, with dark,
thoughtful eyes, and a face on which love to God and good-will to man
seemed to have set their stamp.

"Certainly, dear Lydia is a very uncommon child," lisped Mrs. Neill from the
sofa, to which long and tedious, though not dangerous illness had confined
her for several months.

"You see," pursued the captain, "we've no child but Lydia, so we've devoted all
our care to our pet."

"An only child runs some danger of being spoilt," observed Mr. Neill, with a
smile.

"Yes, yes, but we never spoil ours," answered the father, quickly.

"Oh, dear, no!" said the lady, from the sofa.

"We have always from the first taught Lydia her duty; and I must say that
we've found her an apt pupil," continued the captain. "Would you believe it—
though she is just twelve years old, that child has twice read through the Bible,
and has started on the third reading of her own accord!"

The partial father looked into his brother's face, expecting to see depicted
there admiration and surprise. There was, however, no expression of the kind.
Perhaps Mr. Neill was thinking that one verse of the Holy Scriptures, treasured
in the heart, might do more for the soul than the whole Bible read hastily over
for the sake of boasting that so much had been done.

"And then her charity," recommenced Captain Neil; but he was interrupted by
the entrance of a fine-looking girl, who came in with a quick step and self-
possessed manner, her checks glowing beneath her white hat from the
exercise which she had been taking.

"Where have you been, my darling?" asked her father.

"Oh, round by the mill, and as far as the seven cottages. Poor Jones is getting
worse and worse; his wife says that he cannot last long. I tried to get Mrs.
Brown to send all her children to school, but she tells me they can't go in such
rags. I'm about to make a parcel of my old clothes, my green dress, and a lot
of other things—"

"But, my dear," said Lydia's mother, "that dress was quite new this spring; I
don't wish—"

"I'm tired of it," interrupted Lydia; and seeing that her mother was about to
speak, she cut her short by a decided, "I hate green dresses, and I'm not
going to wear it again."

The mother looked vexed, but said nothing. "You've had a long round, my
darling; sit down and rest," said Captain Neill, kindly.

"I'm not tired, and would rather stand," replied Lydia, in her short, decided
manner, as she flung her hat back on her shoulders, and shook the curls from
her heated face. Then, turning to her mother, she said, "Whom do you think I
met on the way? All the Thomsons on ponies. I wish I had a pony, too, I
should so enjoy riding about."

"Could we afford it, you should have one," said her father, who, though very
fond of riding, had never mounted a horse since he had quitted the army. It
pained him that his child should ever form a wish which he had not the power
to gratify.

"I don't see why the Thomsons should ride when we walk!" observed Lydia,
with a little toss of the head. "We are as good as they any day. Their mother
was no fine lady, I've heard, and they say in the village that Mr. Thomson is
deep in debt, and will have to sell his fine house."
"People say ill-natured things, my love; I would not repeat them," observed
Mrs. Neill, mildly.

Lydia looked annoyed at the gentle reproof, and began humming an air to
herself, to show that she did not mind it.

"Have you written the notes as I desired you, my dear?" asked the sick
mother, after a silence.

"No, I've been busy, and shall be busy all day; I'll write them to-morrow,"
replied Lydia, sitting down, and carelessly opening a book.

"Did you carry your missionary subscription to the Vicarage?" asked the
captain. "My girl keeps a collection box," he added, smilingly turning towards
his brother to explain.

"No; I did not," replied Lydia, shortly.

"And why? for the clergyman told us he was anxious to send in the
subscriptions directly."

"I would rather wait till I have collected more," answered Lydia. "The Barnes
had one pound nine in their box."

"But we cannot attempt to compete with the Barnes, my love; we can give but
little, but we give it cheerfully."

"I will wait till I have collected more," repeated Lydia. "I should be ashamed to
send in less than my neighbors."

"It is a great privilege to be able to help a good cause," said the captain, again
addressing his brother. "My girl does not content herself with gathering
money; she gives her work, which is something better. Her little fingers were
busy for the fancy fair held for our schools: she made two bags and seven
purses—"

"Four bags and eight purses," interrupted Lydia, "and six round pincushions
besides. The Charters did not furnish so much, though there are three of them
to work. But they are such an idle set of girls, and I don't think they care about
schools."

"Four bags and eight purses, to say nothing of the pincushions; pretty well for
one little pair of hands!" said the captain, turning again to his brother, in
expectation of an approving smile or word; but no smile was given, no word
was uttered. Lydia glanced at her uncle in surprise, but could not understand
the almost sad expression on her relative's kind face. Could she have read his
thoughts, they would have run somewhat as follows:

"It is clear that these fond parents are content with their child, and that the
child is content with herself; she has enough of the sweet poison of flattering
praise without my pouring out more from a selfish desire to make myself a
favorite here. My brother thinks his Lydia perfect, and believes that the soil,
cultivated with tender care, is already covered with a glorious harvest. But
what is it that eyes less blinded by partial affection see there? In ten minutes I
have unwillingly beheld the weeds of pride, selfishness, and disobedience, a
disposition to evil speaking, covetousness, and a silly thirst for praise. Small
indeed the faults now appear, as weeds scarce showing above the soil; but it
is evident that the roots are there, and I fear that the harvest will be different
indeed from what my brother expects. What shall I do? Speak openly to him? I
fear that the only result would be to wound—perhaps to offend him; he would
think me unjust or severe, and retain his own opinion. I must gain some quiet
opportunity of speaking a word to Lydia herself; she is an intelligent, sensible
girl; but I can see too plainly by her manner toward her mother that nothing will
be welcome to the young lady that comes in the shape of reproof. My
conscience will not suffer me to leave my niece to her blind security; I will
make at least an attempt to open her eyes to the truth."

The party now dispersed—Lydia to take off her hat and cape; the two
gentlemen to visit a friend. During their walk, Captain Neill could scarcely
discourse on any subject but that of his daughter. He told anecdote after
anecdote, which had been treasured up in his affectionate heart; but his
conversation only served to convince Mr. Neill that Lydia, brought up in a
pious family, had acquired but a sort of hothouse religion, that could stand no
blast of temptation. He felt that though his niece might do many things that
were certainly proper and right, she only did them when they suited her
pleasure; her proud will was yet unbroken—her impatient temper unsubdued.

In the evening, Mr. Neill was sitting alone in the little study, when Lydia
entered the room. The girl was anxious to please her uncle, of whose
character she had heard high praise, and whose gentle, courteous manner
was well suited to win young hearts.

"I like him," thought Lydia, "and I will make him like me." So approaching Mr.
Neill, and laying her hand on the back of his chair, she said in her most
pleasing manner, "Can I do anything for you, dear uncle?"
"Yes, my dear, you can read the Bible to me; I shall be glad of the help of your
young eyes, for mine have suffered from the climate of India."

"I will read with pleasure," said Lydia, taking up the Bible; and she spoke no
more than the truth. She was glad to do a kindness to her uncle, but was more
glad still of an opportunity of showing him how beautifully she could read
aloud.

"Do you wish any particular chapter?" she inquired.

"Pray, begin the twenty-second chapter of St. Matthew."

In a tone very clear and distinct, Lydia commenced her reading—

"And Jesus answered, and spake unto them again by parables, and said, 'The
kingdom of heaven is like unto a certain king, which made a marriage for his
son, and sent forth his servants to call them that were bidden to the wedding;
and they would not come. Again, he sent forth other servants, saying, Tell
them which are bidden, Behold, I have prepared my dinner; my oxen and my
fatlings are killed and all things are ready; come unto the marriage. But they
made light of it, and went their ways, one to his farm, another to his
merchandise; and the rest laid hold on his servants, and treated them
spitefully, and slew them. But when the king heard thereof, he was wroth; and
he sent forth his armies, and destroyed those murderers, and burned up their
city.'"

"Do you understand the meaning of the parable?" asked Mr. Neill.

"Yes," replied his niece, looking up from her book; "the Jews, to whom the
invitation of the gospel was first sent, in their pride, would not accept it, but
rejected and slew the Lord, and some of His faithful servants; and so the
armies of the Romans were sent to take and to burn Jerusalem. The
command was given that the gospel should be preached to every creature, as
we hear." And Lydia proceeded to read aloud:

"'Then saith he to his servants, The wedding is ready, but they which were
bidden were not worthy. Go ye therefore into the highways, and as many as
ye shall find, bid to the marriage. So those servants went out into the
highways, and gathered together all, as many as they found, both bad and
good; and the wedding was furnished with guests.'"

"'And when the king came in to see the guests, he saw there a man which had
not on a wedding-garment: he saith unto him, Friend, how earnest thou in
hither, not having a wedding-garment? and he was speechless. Then said the
king to the servants, Bind him hand and foot, and take him away, and cast him
into outer darkness; there shall be weeping and gnashing of teeth. For many
are called but few are chosen.'"

"Was it not strange," said Mr. Neill, "that a poor man, taken from the common
highway, should be expected to be found in a wedding-garment at the feast of
the mighty king?"

"No," replied Lydia, without hesitation, "for it was the custom in the East to
provide wedding-garments for the guests, and this man must, through pride,
have refused to accept one, thinking his own dress good enough to wear."

"And what is the deeper—the spiritual meaning of this parable?" inquired her
uncle.

"The merits of our Lord form the wedding-garment, which all must wear who
would enjoy the feast of heaven. If we try to appear in our own righteousness,
we shall be cast out like the miserable man of whom we have just been
reading."

"You have been well-instructed in the Bible, Lydia."

The girl colored at the praise, and said, "I ought to know it well, for I read four
chapters every day, and a great deal more on Sundays, and can repeat
hundreds of verses by memory."

"And yet," observed Mr. Neill, "there is a wide difference between head-
knowledge and heart-knowledge, between understanding the meaning of the
Scriptures, and making their truths our own. I suspect that many of us fall
unconsciously into the error of the man in the parable, and fancy that there is
something in ourselves to make us acceptable in the presence of our
Heavenly King. When you came into the room, Lydia, my mind was dwelling
upon the very subject, and I was forming a little allegory, or story, about the
garment of human merits."

"I wish that you would tell me your allegory," said Lydia; "I often make such
stories myself."

"Close the Bible, and place it on the table, my child, and you shall know what
thoughts were suggested to my mind by the parable of the wedding-garment."
Lydia obeyed and listened with some interest and curiosity to this, the first
story which she had ever heard from the lips of her uncle. Mr. Neill thus
began:

"Ada was a bright young creature, brought up in a happy and a holy home,
where, almost as an infant, she had been taught to pray, and where Scripture
had been made familiar to her from the earliest dawn of reason. Ada was an
invited guest to the feast of the great King, and she had accepted the
invitation. She knew that she must appear in His courts robed in
righteousness not her own, a garment provided by the Lord of the feast,
spotless, holy, and white."

"But Ada had a friend, or rather let me term her an enemy, in a companion
named Self-love, whose society was so delightful to the girl that they were
constantly found together. It was wonderful to behold the influence quietly
exerted by Self-love over the mind of her young companion. She joined Ada in
her amusements, assisted at her studies, went with her wherever she went,
even to the cottages of the poor, even to the house of prayer. But Self-love
was treacherous as well as pleasing; her influence was never exerted for
good; her one great object was to draw Ada away from religion, and cause her
to be rejected at the great banquet, to which Self-love never herself could be
admitted."

"'Is it not hard,' whispered Self-love one night, 'that all the guests at the
banquet are to wear the same kind of dress, whatever their former character
or station may have been? I can well believe that poor wanderers from the
highways, and beggars from the street, will be glad enough to lay aside their
rags, and wear the garments provided; but you have a white robe of your own,
fit to be worn in any palace, even the robe of innocence, embroidered all over
by your hands with the silver blossoms of good works. How often has the
world admired you in it! How it has been praised by your family and friends! It
would, at least, form a beauteous addition to what you must wear at the
banquet of Heaven."

"Ada turned her eyes towards the robe of which Self-love had spoken, which
was spread out on a table before her. Very beautiful indeed, and very white, it
appeared to the admiring eye of the girl. Hundreds of delicate silver flowers,
work of charity, faith, and obedience, glittered in the light of a large flaring
torch which Self-love had placed beside it. The robe was studded with
innumerable pearls, which Ada knew to be her prayers, so that nothing could
seem more splendid than the robe which Ada had prepared for herself."
"I suppose," interrupted Lydia, "that this Ada was really a very excellent girl. I
do not wonder that she was unwilling that so lovely a robe should be laid
entirely aside, and not be worn at all at the banquet."

"Ada listened and looked," continued Mr. Neill; "and the more she looked and
listened, the more she regretted that ragged beggars should one day be
clothed in just the same manner as the possessor of a garment so fine. 'I
almost think that I might wear both,' she murmured, half aloud; 'I might appear
in my own beauteous robe, and if my dress should be not quite complete, the
King's mantle would cover all defects.'"

"'Ada, Ada!' whispered a voice in the air. The girl started and gazed around,
but no human form was to be seen."

"'Ada, my name is Conscience,' continued the voice, 'and my accents fall not
on the ear; they are heard in the depths of the heart. I have read thy thoughts,
I know thy desires, and I come unto thee with a message. If thou, for but one
day, canst keep thy garment quite white and fair, thou mayest wear it with joy
and honor. But thou must see it by sunlight, and not by torchlight, and thine
eyes must be anointed with Self-knowledge,—a salve which thou shalt find
close to thy Bible when thou lookest on it first in the morning.'"

"'Be content, Ada,' said Self-love, with a smile, 'a single day is no long time of
trial, and thou hast hitherto kept thy garment so fair, that thou hast small
reason to fear a stain.'"

"The first thing that Ada did in the morning was to anoint her eyes with the
golden salve which, as Conscience had promised, she found lying close to her
Bible. She resolved not to look at her robe till a part of the day should be over,
and then to examine it closely, to see whether the smallest speck or stain had
sullied its pearly whiteness."

"Had I been Ada, I should have been very careful in my conduct on that day,"
observed Lydia, with a smile.

"So Ada determined to be. She resolved to crowd it with fresh good works.
She read double her usual number of chapters, was very long at her prayers,
though it must be confessed that all the time that she was perusing God's holy
Word, or making show of pleading with her Maker, her thoughts were
wandering in every direction—now to her birds, now to her new book, now to
her plans for the morrow, and now, alas! resting with bitterness upon some
affront received from a neighbor. While Ada read or knelt, a dim, misty stain
was slowly spreading over her white garment; that which she believed to be a
merit, in God's eyes was full of sin."

"But Ada was not always in the quietness of her own room; she had to go forth
and mix with others. She determined to visit a great many poor people, and do
a great deal of good; but she lost her temper twice before she set out. First,
with the servant, for keeping her waiting while preparing some broth for a poor
invalid; then with her mother for sending her to change her dress for one of
coarser material, as it seemed likely to rain. That half-hour of peevish
impatience left its mark on the beautiful garment."

"Oh, uncle, such trifles could never be counted," exclaimed Lydia.

"Life is made up of trifles, Lydia, and especially the life of a child. But to return
to the story of Ada. On her way to the cottages, she met with a companion, a
silly, frivolous girl, and they entered into such conversation as that which is
known by the name of gossip. They spoke not of the beauties of nature, or the
wonders of art, or of the deeper things of God; they spoke of their neighbors,
and their neighbors' affairs, and the ill-natured remarks, silly jests which they
made, were certainly not such as beseem the lips of youthful Christians. Ada
was very amusing and very merry; but her face would have worn a graver
expression had she but seen how, at each foolish and unkind word, there fell
a speck, as if of ink, on the folds of her beautiful garment."

"A carriage, splendid and gay, drove past the girls, as, heated and tired, they
walked along the dusty road. Ada knew the young ladies within, and, as she
returned their bow, thoughts of discontent, covetousness, and envy
possessed the mind of the girl."

"'How hard it is to walk when others are rolling in their carriages,' was the
secret reflection of Ada. 'I wonder why things are so unequal. I'm sure we've a
better right to comforts than those girls, whose father made all his money by
manufacturing tapes and bobbins.' Ada expressed not her thoughts aloud; but
she fostered and indulged them in her heart, and deeper and duller grew the
stain that clouded her beautiful garment."

"Uncle, uncle," exclaimed Lydia, who now perceived pretty clearly that Ada
represented herself, "I think that you are hard on your heroine. It is almost
impossible to govern our words, and quite impossible to control our thoughts."

"If so," replied Mr. Neill, "would the Bible have contained such verses as
these, 'Keep thy tongue from evil, and thy lips from speaking guile,'
'Covetousness, let it not once be named amongst you, as becometh saints;
neither filthiness, nor foolish talking, nor jesting, which are not convenient'?
When our Lord declared what doth defile a man, evil thoughts were the first
things mentioned, the sin that cometh from the heart."

Lydia looked grave, and was silent.

"Ada paid many visits, read the Bible in several cottages, and returned home
with a comfortable persuasion that she had passed a most useful morning.
She felt herself wonderfully better than the ignorant creatures who had
listened with admiring attention to the words of 'the dear young lady.' Ada was
impatient to look at her robe, and could not, as she had at first intended, wait
till evening before she did so. What was her astonishment and distress when
she cast her gaze on the treasured garment! With the salve of Self-knowledge
on her eyes, she could no longer flatter herself that it was anything
approaching to white. A dull, dirty hue overspread it; it was besprinkled here
and there with dark and unsightly stains. Poor Ada was so badly disappointed,
that she could scarcely restrain her tears, till Self-love whispered, 'It is
somewhat soiled, it must be owned, yet see, it is embroidered all over with the
silver flowers of good works.'"

"Yes, that was some consolation," murmured Lydia.

"Then again the low voice of Conscience was heard, piercing the inmost soul,
'Ada, Ada! there is indeed a blessing on works done for the love of God;
precious and bright is such silver. But while man sees our actions, God sees
our motives, and tarnished with sin is the work, be it ever so good in itself,
which is done from vainglory, emulation, or self-pleasing.' As the words were
uttered, to Ada's dismay she beheld every one of her silver flowers become
tarnished and dull; some, indeed, less so than others; but not one remained
that retained its brightness, while some appeared actually black."

"Poor Ada had nothing left but her pearls, her prayers," observed Lydia, with
something like a sigh.

"Nay, the pearls shared the fate of the flowers. What is the worth of prayers
uttered from habit, or fear, or love of praise, prayers with which the heart has
nothing to do? The pearls appeared pearls no longer, but dull, discolored,
unsightly beads."

"Oh, what a wretched discovery!" cried Lydia.

"Self-knowledge showed Ada something besides," pursued Mr. Neill, without


looking at his niece as she spoke. "On bending over her garment, Ada
perceived many large rents, which seemed to grow in number and size the
longer she examined the robe. Again was heard the whisper of Conscience
—'These are thy sins of omission, neglected opportunities of serving God,
acts of kindness or obedience left undone, a tender mother's wishes
disregarded, duties put off in order to gratify the idle whims of self-love.'"

Lydia remembered the notes which she had put off writing for so long that her
sick mother had that morning done the little business herself. This had been
but one of a series of trifling neglects for which Lydia had never before felt
self-reproach; for she had not reckoned them to be sins. Tears started into her
eyes, and she wished that the story would come to an end.

"'I can never wear this,' exclaimed Ada; 'it would take me months to repair
these rents.' As she spoke she bent down to lift the garment that she might
examine it more closely; to her astonishment, the whole fabric came to pieces
in her hands. The moth of Pride had fretted the garment, and not only was it
tarnished and stained, but no sound piece was to be found in the whole of the
once goodly robe."

"Oh, I can't bear this story of yours," exclaimed Lydia; "it is one to put us all in
despair."

"If it puts us in despair of ourselves, my child," replied Mr. Neill, laying his
hand gently on the arm of his niece, "it will prove a story not without profit."

"Ada seemed such a good girl at first, and you have made all her
righteousness fall to pieces in the end! How could any one go to a banquet in
such soiled and miserable rags?"

"The knowledge of our helplessness and sin, Lydia, is beyond measure


precious to our souls. While we wrap ourselves in our fancied merits, while we
nourish a secret hope that we can stand before a holy God in the garment of
our own poor works, we will never earnestly and thoroughly seek for the grace
which alone can save. Let us ask for the gift of self-knowledge, that we may
see that we are in His sight."

"Self-knowledge only makes us miserable," exclaimed Lydia, whose pride had


been deeply wounded.

"It would be so indeed, were it not united to knowledge of the blessed


Redeemer; if the same Bible which shows us that our fancied righteousness is
but as a moth-eaten rag, did not show us, also, the spotless robe washed
white in the blood of the Lamb, prepared for the meek and lowly in heart who
come to the banquet of heaven. Let this then, dear child, be our constant
prayer to the Giver of good— 'Lord, show me myself—my nothingness, my
sin. Lord, show me Thyself—Thy holiness, Thy love. Pour Thy Spirit into my
heart; let it rule my lips and my life, and clothe me in the robe of
righteousness, even the merits of my blessed Redeemer.'"

You might also like