Advanced Intelligent Virtual Reality Technologies

Download as pdf or txt
Download as pdf or txt
You are on page 1of 255

Smart Innovation, Systems and Technologies 330

Kazumi Nakamatsu · Srikanta Patnaik ·
Roumen Kountchev · Ruidong Li ·
Ari Aharari   Editors

Advanced
Intelligent Virtual
Reality Technologies
Proceedings of 6th International
Conference on Artificial Intelligence and
Virtual Reality (AIVR 2022)

123
Smart Innovation, Systems and Technologies

Volume 330

Series Editors
Robert J. Howlett, Bournemouth University and KES International,
Shoreham-by-Sea, UK
Lakhmi C. Jain, KES International, Shoreham-by-Sea, UK
The Smart Innovation, Systems and Technologies book series encompasses the topics
of knowledge, intelligence, innovation and sustainability. The aim of the series is to
make available a platform for the publication of books on all aspects of single and
multi-disciplinary research on these themes in order to make the latest results avail-
able in a readily-accessible form. Volumes on interdisciplinary research combining
two or more of these areas is particularly sought.
The series covers systems and paradigms that employ knowledge and intelligence
in a broad sense. Its scope is systems having embedded knowledge and intelligence,
which may be applied to the solution of world problems in industry, the environment
and the community. It also focusses on the knowledge-transfer methodologies and
innovation strategies employed to make this happen effectively. The combination
of intelligent systems tools and a broad range of applications introduces a need
for a synergy of disciplines from science, technology, business and the humanities.
The series will include conference proceedings, edited collections, monographs,
handbooks, reference books, and other relevant types of book in areas of science and
technology where smart systems and technologies can offer innovative solutions.
High quality content is an essential feature for all book proposals accepted for the
series. It is expected that editors of all accepted volumes will ensure that contributions
are subjected to an appropriate level of reviewing process and adhere to KES quality
principles.
Indexed by SCOPUS, EI Compendex, INSPEC, WTI Frankfurt eG, zbMATH,
Japanese Science and Technology Agency (JST), SCImago, DBLP.
All books published in the series are submitted for consideration in Web of Science.
Kazumi Nakamatsu · Srikanta Patnaik ·
Roumen Kountchev · Ruidong Li · Ari Aharari
Editors

Advanced Intelligent Virtual


Reality Technologies
Proceedings of 6th International Conference
on Artificial Intelligence and Virtual Reality
(AIVR 2022)
Editors
Kazumi Nakamatsu Srikanta Patnaik
University of Hyogo SOA University
Kobe, Japan Bhubaneswar, India

Roumen Kountchev Ruidong Li


Technical University of Sofia Kanazawa University
Sofia, Bulgaria Kanazawa, Japan

Ari Aharari
Sojo University
Kumamoto, Japan

ISSN 2190-3018 ISSN 2190-3026 (electronic)


Smart Innovation, Systems and Technologies
ISBN 978-981-19-7741-1 ISBN 978-981-19-7742-8 (eBook)
https://doi.org/10.1007/978-981-19-7742-8

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
AIVR 2022 Organization

Honorary Chair

Prof. Lakhmi C. Jain, KES International, UK

General Co-chairs

Assoc. Prof. Ruidong Li, Kanazawa University, Japan


Prof. Kazumi Nakamatsu, University of Hyogo, Japan

Conference Chair

Assoc. Prof. Ari Aharari, Sojo University, Japan

International Advisory Board

Srikanta Patnaik, SOA University, India


Xiang-Gen Xia, University of Delaware, USA
Shrikanth (Shri) Narayanan, University of Southern California, USA
Hossam Gaber, Ontario Tech University, Canada
Jair Minoro Abe, Paulista University, Brazil
Mario Divan, National University de la Pampa, Argentina
Chip Hong Chang, Nanyang Technological University, Singapore
Aboul Ela Hassanien, Cairo University, Egypt
Ari Aharari, Sojo University, Japan

v
vi AIVR 2022 Organization

Program Chairs

Mohd. Zaid Abdullah, Universiti Sains Malaysia, Malaysia


Minghui Li, The University of Glasgow, Singapore
Letian Huang, University of Electronic Science and Technology of China, China

Technical Program Committee

Michael R. M. Jenkin, York University, Canada


Georgios Albanis, Centre for Research and Technology, Greece
Nourddine Bouhmala, Buskerud and Vestfold University College, Norway
Pamela Guevara, University of Concepción, Chile
Joao Manuel R. S. Tavares, University of Porto, Portugal
Punam Bedi, University of Delhi, India
Der-Chyuan Lou, Chang Gung University, Taiwan
Chang Hong Lin, National Taiwan University of Science and Technology, Taiwan
Tsai-Yen Li, National Chengchi University, Taiwan
Zhang Yu, Harbin Institute of Technology, China
Yew Kee Wong, Jiangxi Normal University, China
Lili Nurliyana Abdullah, University Putra Malaysia, Malaysia
S. Nagarani, Sri Ramakrishna Institute of Technology, India
Liu Huaqun, Beijing Institute of Graphic Communication, China
Jun Lin, Nanjing University, China
Hasan Kadhem, American University of Bahrain, USA
Gennaro Vessio, University of Bari, Italy
Romana Rust, ITA Institute of Technology in Architecture, Switzerland
S. Anne Susan Georgena, Sri Ramakrishna Institute of Technology, India
Juan Gutiérrez-Cárdenas, Universidad de Lima, Peru
Devendra Kumar R. N., Sri Ramakrishna Institute of Technology, Coimbatore, India
Shilei Li, Naval University of Engineering, China
Jinglu Liu, The Open University of China, China
Aiman Darwiche, Instructor and Software Developer, USA
Alexander Arntz, University of Applied Sciences Ruhr West, Germany
Mariella Farella, University of Palermo, Italy
Daniele Schicchi, University of Palermo, Italy
Liviu Octavian Mafteiu-Scai, West University of Timisoara, Romania
Shivaram, Tata Consultancy Services, India
Niket Shastri, Sarvajnik College of Engineering and Technology, India
Gbolahan Olasina, University of KwaZulu-Natal, South Africa
Amar Faiz Zainal Abidin, Universiti Teknikal Malaysia Melaka, Malaysia
Muhammad Naufal Bin Mansor, Universiti Malaysia Perlis (UniMAP), Malaysia
Le Nguyen Quoc Khanh, Nanyang Technological University, Singapore
AIVR 2022 Organization vii

Organizer and Supporting Institutes

Beijing Huaxia Rongzhi Blockchain Technology Institute, China


Sojo University, Japan
Universiti Sains Malaysia, Malaysia
Universiti Teknologi Malaysia, Malaysia
Chang Gung University, China
Preface

The international conference series, Artificial Intelligence and Virtual Reality


(AIVR), has been bringing together researchers and scientists, both industrial and
academic, developing novel Artificial Intelligence and Virtual Reality outcomes.
Research in Virtual Reality (VR) is concerned with computing technologies that allow
humans to see, hear, talk, think, learn, and solve problems in virtual and augmented
environments. Research in Artificial Intelligence (AI) addresses technologies that
allow computing machines to mimic these same human abilities. Although these
two fields evolved separately, they share an interest in human senses, skills, and
knowledge production. Thus, bringing them together will enable us to create more
natural and realistic virtual worlds and develop better, more effective applications.
Ultimately, this will lead to a future in which humans and humans, humans and
machines, and machines and machines are interacting naturally in virtual worlds,
with use cases and benefits we are only just beginning to imagine.
The sixth International Conference on Artificial Intelligence and Virtual Reality
(AIVR 2022) was originally supposed to be held in Kumamoto, Japan, on July
22–24, 2022, though, the world is still fighting against COVID-19 pandemic, there
is no doubt that the safety and well-being of our participants are most important.
Considering the health and safety of everyone, we had to make a tough decision and
convert AIVR 2022 into a fully online conference via the Internet.
Past AIVR conferences were held in Nagoya (2018), Singapore (2019), and as
virtual conference (2020 and 2021), respectively. AIVR 2022 in the successful AIVR
conference series provided an ideal opportunity for reflection on developments over
the last two decades and to focus on future developments.
The topics of AIVR 2022 focus on theory, design, development, testing, and
evaluation of all Virtual Reality intelligent technologies applicable/applied to
various systems and their infrastructures, and the major topics cover system tech-
niques, performance, and implementation; content creation and modeling; cognitive
aspects, perception, and user behavior in terms of Virtual Reality; AI technolo-
gies; interactions/interactive and responsive environments; and applications and case
studies.

ix
x Preface

We accepted one invited and 16 regular papers among submitted 44 papers from
China, Germany, Greece, Japan, Malaysia, Brazil, UK, etc., at AIVR 2022. This
volume is devoted to presenting all those accepted papers of AIVR 2022.
Lastly, we wish to express our sincere appreciation to all participants and the
technical program committee for their review of all the submissions, which is vital to
the success AIVR 2022, and also to the members of the organizer who had dedicated
their time and efforts in planning, promoting, organizing, and helping the confer-
ence. Special appreciation is extended to our keynote and invited speakers: Prof.
Xiang-Gen Xia, University of Delaware, USA; Prof. Shrikanth (Shri) Narayanan,
University of Southern California, USA; Prof. Chip Hong Chang, Nanyang Tech-
nological University, Singapore; and Prof. Minghui Li, University of Glasgow, UK,
who made very beneficial speeches for the conference audience, and also Prof. Jair
M. Abe, Paulista University, Sao Paulo, Brazil, who kindly contributed an invited
paper to AIVR 2022.

Kobe, Japan Kazumi Nakamatsu


Sofia, Bulgaria Roumen Kountchev
Bhubaneswar, India Srikanta Patnaik
Kanazawa, Japan Ruidong Li
Kumamoto, Japan Ari Aharari
July 2022
Contents

Part I Invited Paper


1 Paraconsistency and Paracompleteness in AI: Review Paper . . . . . . 3
Jair Minoro Abe, João I. da Silva Filho, and Kazumi Nakamatsu

Part II Regular Papers


2 Decision Support Multi-agent Modeling and Simulation
of Aeronautic Marine Oil Spill Response . . . . . . . . . . . . . . . . . . . . . . . . 19
Xin Li, Hu Liu, YongLiang Tian, YuanBo Xue, and YiXiong Yu
3 Transferring Dense Object Detection Models To Event-Based
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Vincenz Mechler and Pavel Rojtberg
4 Diagnosing Parkinson’s Disease Based on Voice Recordings:
Comparative Study Using Machine Learning Techniques . . . . . . . . . 49
Sara Khaled Abdelhakeem, Zeeshan Mohammed Mustafa,
and Hasan Kadhem
5 Elements of Continuous Reassessment and Uncertainty
Self-awareness: A Narrow Implementation for Face and Facial
Expression Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Stanislav Selitskiy
6 Topic-Aware Networks for Answer Selection . . . . . . . . . . . . . . . . . . . . . 73
Jiaheng Zhang and Kezhi Mao
7 Design and Implementation of Multi-scene Immersive Ancient
Style Interaction System Based on Unreal Engine Platform . . . . . . . . 85
Sirui Yang, Qing Qing, Xiaoyue Sun, and Huaqun Liu
8 Auxiliary Figure Presentation Associated with Sweating
on a Viewer’s Hand in Order to Reduce VR Sickness . . . . . . . . . . . . . 101
Masaki Omata and Mizuki Suzuki

xi
xii Contents

9 Design and Implementation of Immersive Display Interactive


System Based on New Virtual Reality Development Platform . . . . . . 119
Xijie Li, Huaqun Liu, Tong Li, Huimin Yan, and Wei Song
10 360-Degree Virtual Reality Videos in EFL Teaching: Student
Experiences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Hui-Wen Huang, Kai Huang, Huilin Liu, and Daniel G. Dusza
11 Medical-Network (Med-Net): A Neural Network for Breast
Cancer Segmentation in Ultrasound Image . . . . . . . . . . . . . . . . . . . . . . 145
Yahya Alzahrani and Boubakeur Boufama
12 Auxiliary Squat Training Method Based on Object Tracking . . . . . . 161
Yunxiang Pang, Haiyang Sun, and Yiqun Pang
13 Study on the Visualization Modeling of Aviation Emergency
Rescue System Based on Systems Engineering . . . . . . . . . . . . . . . . . . . 173
Yuanbo Xue, Hu Liu, Yongliang Tian, and Xin Li
14 An AI-Based System Offering Automatic DR-Enhanced AR
for Indoor Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Georgios Albanis, Vasileios Gkitsas, Nikolaos Zioulis,
Stefanie Onsori-Wechtitsch, Richard Whitehand, Per Ström,
and Dimitrios Zarpalas
15 Extending Mirror Therapy into Mixed Reality—Design
and Implementation of the Application PhantomAR
to Alleviate Phantom Limb Pain in Upper Limb Amputees . . . . . . . . 201
Cosima Prahm, Korbinian Eckstein, Michael Bressler,
Hideaki Kuzuoka, and Jonas Kolbenschlag
16 An Analysis of Trends and Problems of Information
Technology Application Research in China’s Accounting Field
Based on CiteSpace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Xiwen Li, Jun Zhang, Ke Nan, and Xiaoye Niu
17 Augmented Reality Framework and Application for Aviation
Emergency Rescue Based on Multi-Agent and Service . . . . . . . . . . . . 237
Siliang Liu, Hu Liu, and Yongliang Tian

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249


About the Editors

Kazumi Nakamatsu received the Ms. Eng. and Dr. Sci. from Shizuoka University
and Kyushu University, Japan, respectively. His research interests encompass various
kinds of logic and their applications to Computer Science, especially paraconsistent
annotated logic programs and their applications. He has developed some paracon-
sistent annotated logic programs called ALPSN (Annotated Logic Program with
Strong Negation), VALPSN (Vector ALPSN), EVALPSN (Extended VALPSN) and
bf-EVALPSN (before-after EVALPSN) recently, and applied them to various intelli-
gent systems such as a safety verification based railway interlocking control system
and process order control. He is an author of over 180 papers and 30 book chapters
and 20 edited books published by prominent publishers. Kazumi Nakamatsu has
chaired various international conferences, workshops, and invited sessions, and he
has been a member of numerous international program committees of workshops and
conferences in the area of Computer Science. He has served as the editor-in-chief of
the International Journal of Reasoning-based Intelligent Systems (IJRIS); he is now
the founding editor of IJRIS and an editorial board member of many international
journals. He has contributed numerous invited lectures at international workshops,
conferences, and academic organizations. He also is a recipient of numerous research
paper awards.

Dr. Srikanta Patnaik is presently working as the director of International Rela-


tion and Publication of SOA University. He is a full professor in the Department
of Computer Science and Engineering, SOA University, Bhubaneswar, India. He
has received his Ph. D. (Engineering) on Computational Intelligence from Jadavpur
University, India, in 1999. He has supervised more than 25 Ph.D. theses and 60 master
theses in the area of computational intelligence, machine learning, soft computing
applications, and re-engineering. Dr. Patnaik has published around 100 research
papers in international journals and conference proceedings. He is author of two text-
books and 52 edited volumes and few invited book chapters, published by leading
international publisher like Springer-Verlag, Kluwer Academic, etc. Dr. Srikanta
Patnaik is the editors-in-chief of International Journal of Information and Commu-
nication Technology and International Journal of Computational Vision and Robotics

xiii
xiv About the Editors

published from Inderscience Publishing House, England, and International Journal


of Computational Intelligence in Control, published by MUK Publication, the editor
of Journal of Information and Communication Convergence Engineering, and an
associate editor of Journal of Intelligent and Fuzzy Systems (JIFS), which are all
Scopus Index journals. He is also the editors-in-chief of Book Series on “Modeling
and Optimization in Science and Technology” published from Springer, Germany,
and Advances in Computer and Electrical Engineering (ACEE) and Advances in
Medical Technologies and Clinical Practice (AMTCP), published by IGI Global,
USA. Dr. Patnaik has travelled more than 20 countries across the globe to deliver
invited talks and keynote address at various places. He is also a visiting professor to
some of the universities in China, South Korea, and Malaysia.

Prof. Roumen Kountchev Ph.D., D.Sc. is a professor at the Faculty of Telecom-


munications, Department of Radio Communications and Video Technologies, Tech-
nical University of Sofia, Bulgaria. His areas of interests are digital signal and image
processing, image compression, multimedia watermarking, video communications,
pattern recognition and neural networks. Prof. Kountchev has 350 papers published
in magazines and proceedings of conferences; 20 books; 47 book chapters; and 21
patents. He had been a principle investigator of 38 research projects. At present, he
is a member of Euro Mediterranean Academy of Arts and Sciences and President
of Bulgarian Association for Pattern Recognition (member of Intern. Association
for Pattern Recognition). He is an editorial board member of: International Journal
of Reasoning-based Intelligent Systems; International Journal of Broad Research in
Artificial Intelligence and Neuroscience; KES Focus Group on Intelligent Decision
Technologies; Egyptian Computer Science Journal; International Journal of Bio-
Medical Informatics and e-Health, and International Journal of Intelligent Decision
Technologies. He has been a plenary speaker at: WSEAS International Conference
on Signal Processing, 2009, Istanbul, Turkey; WSEAS International Conference on
Signal Processing, Robotics and Automation, University of Cambridge 2010, UK;
WSEAS International Conference on Signal Processing, Computational Geometry
and Artificial Vision 2012, Istanbul, Turkey; International Workshop on Bioinfor-
matics, Medical Informatics and e-Health 2013, Ain Shams University, Cairo, Egypt;
Workshop SCCIBOV 2015, Djillali Liabes University, Sidi Bel Abbes, Algeria; Inter-
national Conference on Information Technology 2015 and 2017, Al Zayatoonah
University, Amman, Jordan; WSEAS European Conference of Computer Science
2016, Rome, Italy; The 9th International Conference on Circuits, Systems and
Signals, London, UK, 2017; IEEE International Conference on High Technology
for Sustainable Development 2018 and 2019, Sofia, Bulgaria; The 8th International
Congress of Information and Communication Technology, Xiamen, China, 2018;
General chair of the International Workshop New Approaches for Multidimensional
Signal Processing, July 2020, Sofia, Bulgaria.

Assoc. Prof. Ruidong Li is an associate professor at Kanazawa University, Japan.


Before joining this university, he was a senior researcher at the National Institute
of Information and Communications Technology (NICT), Japan. He serves as the
About the Editors xv

secretary of IEEE ComSoc Internet Technical Committee (ITC), is the founder and
chair of IEEE SIG on Big Data Intelligent Networking and IEEE SIG on Intelligent
Internet Edge, and the co-chair of young research group for Asia future internet
forum. He is the associate editor of IEEE Internet of Things Journal and also served
as the guest editors for a set of prestigious magazines, transactions, and journals,
such as IEEE Communications Magazine, IEEE Network Magazine, IEEE Trans-
actions. He also served as chairs for several conferences and workshops, such as
the general co-chair for AIVR2019, IEEE INFOCOM 2019/2020/2021 ICCN work-
shop, IEEE MSN 2020, BRAINS 2020, IEEE ICDCS 2019/2020 NMIC workshop
and IEEE Globecom 2019 ICSTO workshop, and publicity co-chair for INFOCOM
2021. His research interests include future networks, big data networking, intelligent
Internet edge, Internet of things, network security, information-centric network, arti-
ficial intelligence, quantum Internet, cyber-physical system, naming and addressing
schemes, name resolution systems, and wireless networks. He is a senior member of
IEEE and a member of IEICE.

Assoc. Prof. Ari Aharari (Ph.D.) received M.E. and Ph.D. in Industrial Science
and Technology Engineering and Robotics from Niigata University and Kyushu
Institute of Technology, Japan, in 2004 and 2007, respectively. In 2004, he joined
GMD-JAPAN as a research assistant. He was a research scientist and coordinator at
FAIS-Robotics Development Support Office from 2004 to 2007. He was a postdoc-
toral research fellow of the Japan Society for the Promotion of Science (JSPS) at
Waseda University, Japan, from 2007 to 2008. He served as a senior researcher of
Fukuoka IST involved in the Japan Cluster Project from 2008 to 2010. In 2010, he
became an assistant professor at the faculty of Informatics of Nagasaki Institute of
Applied Science. Since 2012, he has been an associate professor at the Department
of Computer and Information Science, Sojo University, Japan. His research inter-
ests are IoT, robotics, IT agriculture, image processing and data analysis (Big Data)
and their applications. He is a member of IEEE (Robotics and Automation Society),
RSJ (Robotics Society of Japan), IEICE (Institute of Electronics, Information and
Communication Engineers), and IIEEJ (Institute of Image Electronics Engineers of
Japan).
Part I
Invited Paper
Chapter 1
Paraconsistency and Paracompleteness
in AI: Review Paper

Jair Minoro Abe , João I. da Silva Filho , and Kazumi Nakamatsu

Abstract The authors analyse the contribution of the logical treatment of the
concepts of inconsistency and paracompleteness to better understand AI’s current
state of development. In particular, the relationship between Artificial Intelligence
and a new type of logic, called Paraconsistent Annotated Logic, which effectively
manipulates the above concepts, both computationally and in its use in Hardware, is
considered.

1.1 Introduction

1.1.1 Classical and Non-classical Logic

Logic, until very recently, was a single science, which progressed linearly, even after
its mathematisation by mathematicians, logicians and philosophers such as Boole,
Peano, Frege, Russell and Whitehead. The revolutionary developments in the 1930s,
such as those by Gödel and Tarski, still fall within what we can call classical or
traditional logic.
Despite all the advances in traditional logic, another parallel revolution took place
in the field of science created by Aristotle, of a very different nature. We refer to
the institution of non-classical logic. They produced, as in the case of non-Euclidean
geometry, a transformation of a profound nature in the scientific sphere, whose conse-
quences, of a philosophical nature, have not yet been investigated systematically and
comprehensively.

J. M. Abe (B)
Paulista University, São Paulo, Brazil
e-mail: [email protected]
J. I. da Silva Filho
Santa Cecília University, Santos, Brazil
K. Nakamatsu
University of Hyogo, Hyogo, Japan

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 3
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_1
4 J. M. Abe et al.

We call classical or traditional logic the study of the calculus of the first-order
predicates, with or without equality, as well as some of its subsystems, such as clas-
sical propositional calculus, and some of its extensions, for example, traditional logic,
higher-order (type theory) and the standard systems of set theory (Zermelo–Fraenkel,
von Neumann–Bernays–Gödel, Kelley–Morse, NF Quine-Rosser, ML Quine-Wang,
etc.). The logic under consideration is based on well-established syntax and seman-
tics; thus, the usual semantics of predicate calculus is based on Tarski’s concept of
truth.
Non-classical logic is characterised by amplifying, in some way, traditional logic
or by infringing or limiting its core principles or assumptions [1].
Among the first, called complementary logics of the classical, we will remember
the traditional logics of alethic modalities, deontic modalities, epistemic operators
and temporal operators. Among the second, called heterodox or rivals of classical,
we will cite paraconsistent, paracomplete and intuitionist logics without negation
(Griss, Gilmore and others).
Logic, we must stress, is much more than the discipline of valid forms of inference.
It would be difficult to fit, e.g. the theory of models in its current form and the theory of
recursion in a logic thus defined. However, for this article, we can identify (deductive)
logic as the discipline especially concerned with valid inference (or reasoning).
On the other hand, each deductive logic L is usually associated with an inductive
logic L’, which, under certain conditions, indicates how invalid inferences according
to L can still be used. The patterns for this to be legitimate are encoded in L’. Inductive
logics are perfectly placed among non-classical logics (perhaps as a replacement for
the corresponding deductive logics) [1].
Artificial Intelligence (AI) has contributed to the progress of several new logics
(non-monotonic logics, default logics, defeasible logics, and paraconsistent logics
in general).
This is because, particularly in the case of expert systems, we need non-traditional
forms of inference. Paraconsistency, e.g. is imposed as one regularly works with
inconsistent (contradictory) sets of information.
In this chapter, we outline the great relevance that AI is acquiring regarding the
deep understanding of the meaning of logicity (and, indirectly, for the very under-
standing of reason, its structure, limits and forms of application). To do so, we will
only focus on the case of paraconsistent logic and paracomplete logic; without a
doubt, it can be seen as one of the most heterodox among the heterodox logics,
although technically, it can be constructed as a complementary logic to classical
logic.

1.2 Paraconsistent and Paracomplete Logic

A (deductive) theory T, based on logic L, is said to be consistent if among its theorems


there are not two, such that one is the negation of the other; otherwise, T is called
1 Paraconsistency and Paracompleteness in AI: Review Paper 5

inconsistent. Theory T is called trivial if all its language sentences (closed formulas)
are theorems; if not, T is nontrivial.
If L is one of the standard logics, such as classical and Brouwer-Heyting’s intu-
itionistic logic, T is named trivial if and only if it is inconsistent. In other words,
logic like these does not separate the concepts of inconsistency and triviality.
L is called paraconsistent if it can function as a foundation for inconsistent and
nontrivial theories. (Only in certain specific circumstances does the presence of
contradiction imply trivialisation.) In other words, paraconsistent logic can handle
inconsistent information systems without the danger of trivialisation.
The forerunners of paraconsistent logic were the Polish logician J. Łukasiewicz
and the Russian philosopher N. A. Vasiliev. None of them had, at the time, a broad
view of classical logic as we see it today; they treated it more or less through Aris-
totle’s prism in keeping with the then dominant trends in the field. Simultaneously,
around 1910, though independently, they aired the possibility of a paraconsistent
logic that would constrain, for example, the principle of contradiction, when formu-
lated as follows: Given two contradictory propositions, that is, one of which is the
negation of the other, then one of the propositions is false. Vasilev even came to
articulate a certain paraconsistent logic, which he baptised imaginary, modifying the
Aristotelian syllogistic.
The Polish logician S. Jaśkowski, a disciple of Łukasiewicz, was the first logician
to structure a paraconsistent propositional calculus. In 1948, he published his ideas on
logic and contradiction, showing how one could construct a paraconsistent sentential
calculus with convenient motivation. Jaśkowski’s system, named by him discursive
logic, was developed later (from 1968 onwards) due to the works of authors such as
J. Kotas, L. Furmanowski, L. Dubikajtis, N. C. A. da Costa and C. Pinter. Thus, an
actual discursive logic was built, encompassing a calculus of the first-order predicate
and a higher-order logic (there are even discursive set theories, intrinsically linked
to the attribute theory, based on Lewis’ S5 calculus) [1].
The initial systems of paraconsistent logic, containing all logical levels, thus
involving propositional, predicate and description calculations and higher-order
logic, are due to N. C. A. da Costa (1954 onwards). This was carried out independently
of the inquiries of the authors, as mentioned earlier.
Today, there are even paraconsistent systems of set theories, strictly stronger
than the classical ones, as they contain them as strict subsystems and paraconsistent
mathematics. These mathematics are related to fuzzy mathematics, which, from a
certain point of view, fits into the list of the former.
As a result of the elaboration of paraconsistent logic, it has been proved that it
becomes possible to manipulate inconsistent and robust information systems without
eliminating contradictions and without falling into trivialisation.
Worthy of mentioning is that paraconsistent logic was born out of purely theo-
retical considerations, both logical-mathematical and philosophical. The first ones
refer, for example, to problems related to the concept of truth, the paradoxes of set
theory and the vagueness inherent in natural language and scientific ones. The second
is correlated with themes such as foundations of dialectics, notions of rationality and
logic and the acceptance of scientific theories.
6 J. M. Abe et al.

Some of the consequences of structuring paraconsistent logic, which can be clas-


sified into two categories, ‘positive’ and ‘negative’, are as follows: Positive: (1)
Better elucidation of some central concepts of logic, such as negation and contra-
diction, as well as the role of the abstraction scheme in set theory (set theory anti-
nomies), (2) a deeper understanding of specific philosophical theories, especially
Meinong’s dialectic and object theory, (3) proof of the possibility of strong and incon-
sistent, though not trivial, theories (common paradoxes can be treated from a new
perspective) and (4) organisation of ontological schemes different from traditional
ontology. Negatives: (1) Demonstration that specific criticisms of dialectics appear to
be unfounded (e.g. Popper’s well-known remarks), (2) proof that the methodological
requirements imposed on scientific theories prove to be too restrictive and deserve to
be liberalised and (3) Evidence that the usual conception of truth as correspondence,
a la Tarski, does not entail the laws of classical logic, without extra assumptions,
usually kept implicit. Details on paraconsistent logic can be found in [1, 2].
In general, a paracomplete logic can be conceived as the underlying logic of
incomplete theory in the strong sense, i.e. theory according to which a proposition
and its negation are both false. The motivation for paracomplete systems is connected
with the classical requirement that at least one of a proposition and its negation be
true does not always fit our intuitions. For instance, if P is a vague predicate and a
is a borderline individual, we may feel that both P(a) and the negation of P(a) are
false.
In a technical sense, paracomplete logic can be considered to be dual to paracon-
sistent logic. Examples of paracomplete logic are intuitionistic logic, multivalued
logic, annotated logic, etc.
It is worth mentioning that after discovering paracomplete logic, it was found
that the notions of paraconsistent logic and paracomplete logic are independent.
There are paraconsistent logics that are not paracomplete, and there are paracomplete
logics that are not paraconsistent. Furthermore, some logics are paraconsistent and
paracomplete simultaneously, such as annotated logics [2].

1.3 AI and Formal Systems

Nowadays, in AI, we need to manipulate inconsistent information systems. We need


to process them in similar systems via paraconsistent programming. Trying to trans-
form these systems into consistent ones would be not only impractical but, above all,
theoretically pointless. Therefore, AI constitutes a field where paraconsistent logic
naturally encounters critical applications. Thus, computing, in general, is closely
linked to paraconsistency. From a certain angle, the non-monotonic and ‘default’
logics are included in the class of paraconsistent logics (broadly). For details, the
reader can consult, among others, the following references: [1] and [2].
1 Paraconsistency and Paracompleteness in AI: Review Paper 7

In connection with the preceding exposition, here are some philosophically signif-
icant problems: (a) Are non-classical logics, logics? (b) Can there even be rival logic
to the classical one? (c) Ultimately, wouldn’t the logic called rivals be only comple-
mentary to the classical one? (d) What is the relationship between rationality and
logic? (e) Can reason be expressed through different logics, incompatible with each
other?
Obviously, within the limits of this article, we cannot address all these questions,
not even in a summarised way.
However, adopting an ‘operational’ position, if the logical system denotes a kind
of inference organon, AI contributes to lead us, inescapably, to the conclusion that
there are several kinds of logic, classical and non-classical, and among the latter,
complementary and rivals of classical logic.
Furthermore, AI corroborates the possibility and practical relevance of logic in
the category of paraconsistent, so far removed from the standards established for
logicity until recently. This is, without a doubt, surprising for those not used to the
latest advances in information technology.
It is worth remembering that numerous arguments weaken the position of those
who defend the thesis of the absolute character of classical logic. Here are four such
arguments as follows:
(1) Any given rational context is compatible with infinite logics capable of
appearing as underlying logics.
(2) Fundamental logical concepts, such as negation, have to be seen as ‘family
resemblance’ in Wittgenstein’s sense. There is no particular reason for refusing,
say, paraconsistent negation the dignity of negation: if one does so, one should
also maintain that the lines of non-Euclidean geometries are not, in effect, lines.
(3) Common semantics, e.g. restricted predicate calculus is based on set theory.
As there are several (classical) set theories, there are numerous possible inter-
pretations of such semantics, not equivalent to each other. Consequently, that
calculation is not as well defined as it appears at first sight.
(4) There is no sound and complete axiomatisation for traditional second-order (and
higher-order) logic. It, therefore, escapes (recursive) axiomatisation.
Thus, the answers to questions (a) and (b) are affirmative. A simple answer to ques-
tion (c) seems complicated: at the bottom, it is primarily a terminological problem.
However, in principle, as a result of the previous discussion, nothing prevents us from
accepting that there are rival logics, which are not included in the list of complemen-
tary ones to the traditional one. Finally, on (d) and (e), we will emphasise that we
have excellent arguments to demonstrate that reason remains reason even when it
manifests itself through non-classical logic (classical logic itself is not a well-defined
system).
From the above, we believe that the conclusions that are imposed are susceptible
to a summary, as follows:
Science is more a struggle, an advance, than a stage acquired or conquered, and
the fundamental scientific categories change over time. As Enriques [3] points out,
8 J. M. Abe et al.

science appears imperfect in any parts, developing through self-correction and self-
integration, to which others are gradually added, there is a constant back and forth
from the foundations to the most complex theories, correcting errors and eliminating
inconsistencies. However, history proves that every scientific theory contains some-
thing true: Newtonian mechanics, though surpassed by Einstein’s, evidently contains
traces of truth; if its field of application is conveniently restricted, it works, predicts
and therefore contains a bit of truth. Nevertheless, the real truth is a walk constant to
the truth. This is the teaching of history, beyond any serious doubt.
Even more, logic is constituted through history, and it does not seem possible to
predict the vicissitudes of its evolution.
It is not just about progress in extension; the concept of logicity has changed.
An expert from the beginning of the century, although familiar with the works of
Frege, Russell and Peano, could hardly have foreseen the transformations that would
take place in logic in the last forty years. Today, heterodox logics have entered
the scene with great impetus: no one could predict where polyvalent, relevant and
paraconsistent logics will take us. Perhaps, in the coming years, a new alteration of
the idea of logicity is in store, impossible to imagine at the moment [1].
‘Reason, as defined…, is the faculty of conceiving, judging and reasoning.
Conceiving and reasoning are the exclusive patrimonies of reason, but judging is
also a rational activity in the precise sense of the word. Some primitive form of non-
rational intuition provides the basis for judgment; it is the reason that judges since it
alone manipulates and combines concepts.
Most common uses of the word ‘reason’ derive from reason conceptualised as
the faculty of conceiving, judging and reasoning. Thus, to discern well and adopt
rational norms of life, one must consider reason in a sense defined. Furthermore,
there is a set of rules and principles regulating the use of reason, primarily as it
manifests itself in rational contexts. It is also permissible to call this set of rules
and principles reason. When we ask whether reason transforms itself or remains
invariant, it is undoubtedly more convenient to interpret the question as referring
to reason as a set of rules and principles and not as a faculty. So formulated, the
problem has an immediate answer: reason has changed over time. For example, the
rational categories underlying Aristotelian, Newtonian and modern physics diverge
profoundly, ipso facto, the principles that govern these categories vary, from which
it can be concluded the reason itself has been transformed.’ (da Costa [1]).
Consequently, reason does not cease to be the reason, even if it is expressed
through a different logic.
AI is currently one of the pillars on which the considerations that have just been
made are based. So, it has a practical value of the technological application and a
theoretical value, contributing to a better solution to the problems of logic, reason
and culture.
1 Paraconsistency and Paracompleteness in AI: Review Paper 9

1.4 Paraconsistent Annotated Evidential Logic Eτ

We focus on a particular paraconsistent and paracomplete logic, namely the paracon-


sistent annotated evidential logic Eτ —logic Eτ. The logic Eτ has a language such
that the atomic formulas are of the type p(μ, λ) , where (μ, λ) ∈ [0, 1]2 and [0, 1] is
the real unitary interval. The symbol p denotes a propositional variable in the usual
sense. The pair (μ, λ) is called annotation constant. In the unitary real square [0, 1]
× [0, 1], an order relation is defined as follows: (μ1 , λ1 ) ≤ (μ2 , λ2 ) iff μ1 ≤ μ2 and
λ2 ≤ λ1 . The pair [[0, 1]2 , ≤ ] constitutes a lattice symbolised by τ.
p(μ, λ) can be intuitively read (among others): ‘It is assumed that p’s favourable
evidence degree (or belief, probability, etc.) is μ and contrary evidence degree (or
disbelief, etc.) is λ’. Thus
• (1.0, 0.0) indicates total favourable evidence,
• (0.0, 1.0) indicates total unfavourable evidence,
• (1.0, 1.0) indicates total inconsistency, and
• (0.0, 0.0) indicates total absence of evidence (absence of information).
The operator ~: | τ | → | τ | defined by ~[(μ, λ)] = (λ, μ) is correlated as the
‘meaning’ of the logical negation of the logic Eτ.
The consideration of the values of the favourable degree and unfavourable degree
is made, for example, by experts who use heuristics knowledge, probability or
statistics.
We can consider several important concepts (all considerations are taken with 0
≤ μ, λ ≤ 1):
Segment DB—segment perfectly defined: μ + λ − 1 = 0.
Segment AC—segment perfectly undefined: μ − λ = 0.
Uncertainty degree: Gun (μ, λ) = μ + λ − 1;
Certainty degree: Gce (μ, λ) = μ − λ;
To fix ideas, by using the uncertainty and certainty degrees, we can define the
following 12 states: extreme states (false, true, inconsistent and paracomplete) and
non-extreme states (see Fig. 1.1 and Table 1.1). The standard Cartesian system can
represent such logical states.
The states can be described with the certainty degree and uncertainty degree
values. In this text, we have chosen the resolution 12 (number of the regions consid-
ered according to Fig. 1.2). However, the resolution is entirely dependent on the
precision of the analysis required in the output, and it can be externally adapted
according to the applications (Fig. 1.2).
So, such limit values called control values are as follows:
V cic = maximum value of uncertainty control = C 3 .
V cve = maximum value of certainty control = C 1 .
V cpa = minimum value of uncertainty control = C 4 .
V cfa = minimum value of certainty control = C 2 .
For the discussion in the present text, we used C 1 = C 3 = ½ and C 2 = C 4 = −½.
10 J. M. Abe et al.

Fig. 1.1 Representation of


the extreme and non-extreme
states

Table 1.1 Extreme and non-extreme states


Extreme states Symbol Non-extreme states Symbol
True V Quasi-true tending to inconsistent QV → T
False F Quasi-true tending to paracomplete QV → ⊥
Inconsistent T Quasi-false tending to inconsistent QF → T
Paracomplete ⊥ Quasi-false tending to paracomplete QF → ⊥
Quasi-inconsistent tending to true QT → V
Quasi-inconsistent tending to false QT → F
Quasi-paracomplete tending to true Q⊥ → V
Quasi-paracomplete tending to false Q⊥ → F

Fig. 1.2 Extreme and


non-extreme states
1 Paraconsistency and Paracompleteness in AI: Review Paper 11

Fig. 1.3 Prototype of the


terrestrial mobile robot

With the decision states and the degrees of certainty and uncertainty, we obtain a
logic analyser called para-analyser [4]. Such an analyser materialised with electrical
circuits gave rise to a logic controller called para-control [4].
Below we describe some applications made with such controllers.

1.5 Application in Robotics

1.5.1 Description of the Prototype

This project was conceived based on the history of the application of paraconsis-
tent logic in predecessor robots [5] and the development of robotics in autonomous
navigation systems [6]. The prototype of the project implemented with the ATmega
2560 Microcontroller is observed in Fig. 1.3. The HC-SR04 ultrasonic sensors were
installed correctly. At the front of the robot, one observes traction motors controlled
by pulse width modulation (PWM). On the back can be seen the differential of this
prototype compared to the predecessors, which consists of a servomotor to control
the robot’s direction.
Another difference from the previous ones was the idea of using an LCD to
monitor the readings of ultrasonic sensors and observe the value of the angle pointed
out by the servomotor. All these observations were critical in the robot’s movement
tests [6].

1.5.2 Development of the Paraconsistent Annotated Logic


Algorithm

By using concepts of paraconsistent annotated evidential logic Eτ and with the


mechatronic prototype finalised, it was ideal for simulating five possibilities of posi-
tioning supposed static obstacles to the front of the robot. The criterion of this simu-
lation was based on the models, in different positions, on the robot’s left and right
12 J. M. Abe et al.

Table 1.2 Simulation of obstacles in different positions


Front sensors
Situation Left (cm) μ Right (cm) λ Uncertainty degree Set point (º)
1 10 0.2 50 0 −0.8 −75.76
2 20 0.4 40 0.2 −0.4 −37.88
3 30 0.6 30 0.4 0 0
4 40 0.8 20 0.6 0.4 37.88
5 50 1 10 0.8 0.8 75.76

positions. Next, a normalisation of frontal sensors’ readings for the values of μ and
λ of the lattice was made, as can be observed in Eqs. (1.1) and (1.2).
The normalisation process involves adapting the distances’ values obtained from
the sensors and converting them to a range from 0 to 1, conceptual to paraconsistent
logic [2].

Left Sensor
μ= (1.1)
200
 
Right Sensor
λ=1− (1.2)
200

Using the proposition p ‘The robot’s front is free’, paraconsistent logic concepts
were applied. The certainty and uncertainty degrees were calculated according to the
values μ and λ obtained by Eqs. (1.1) and (1.2).
It was noticed that the degree of uncertainty generated very peculiar values to be
used directly in the set point of the servomotor. Then, with other values, six new
tests were performed to define the robot’s behaviour concerning supposed obstacles,
simultaneously positioned at the same distance to the left and right frontal sensors.
Table 1.2 shows the development of simulations and the results obtained in each
case.
Table 1.3 shows a gradual change in certainty degrees for these new cases that
ranged from 0.99 to -0.70. The control values obtained in the simulations were applied
to the paraconsistent algorithm programming developed in the C Language directly
in the Interface Development Environment (IDE) of Arduino ATmega 2560. These
values were used in decision-making for speed control and braking.
The program’s algorithm was divided into four main blocks to facilitate its imple-
mentation: the block of the frontal sensors, the block of paraconsistent logic, the
control block of the servomotor and the control block of speed and traction.
// Front Sensor Block
trigpulse_1(); //calls the function trigger of the right front
sensor
pulse_1 = pulsein (echo_1, high);
rt_ft_sr =pulse_1/58; //calculates obstacle distance to right
front sensor
1 Paraconsistency and Paracompleteness in AI: Review Paper 13

Table 1.3 Simulation of obstacles in equal positions simultaneously


Front sensors
Situation Left (cm) μ Right (cm) λ Certainty degree
1 180 1 180 0.01 0.99
2 150 0.75 150 0.25 0.50
3 120 0.60 120 0.40 0.20
4 90 0.45 90 0.55 −0.10
5 60 0.30 60 0.70 −0.40
6 30 0.15 30 0.85 −0.70

trigpulse_2();//calls the function trigger of the left front sensor


pulse_2 = pulsein (echo_2, high);
lt_ft_sr =pulse_2/58; //calculates obstacle distance to left front
sensor
if(rt_ft_sr >=50) { rt_ft_sr =50; } //limits distance measured at
200cm
if(lt_ft_sr >=50) { lt_ft_sr =50; } //limits distance measured at
200cm
//Paraconsistent Logic Block
mi = (sr_fe/50); //process of normalization of favorable evidence μ
la = (1-(sr_fd*0.02)); //normalization process of the contrary
evidence λ
deg_unc = ((mi+la)-1); //calculates the degree of uncertainty
deg_cer = (mi-la); //calculates the degree of certainty
//Servomotor Control Block
sv_set_pt = 538.42*gra_inc+551.5; //calculates the set point of the
servomotor
ser_pos = map (sv_set_pt, 0 , 1023, 0, 180); //positions the servo-
motor
//Speed and Traction Control Block
pwm_set_mt = deg_cer*105 + 150; //calculates the pwm of the traction
motor
analogwrite (rt_trc_mt, pmw_set_mt); //controls right motor trac-
tion
analogwrite (lt_trc_mt, pwm_set_mt); //controls left motor trac-
tion
if (deg_cer > -0.9) {
digitalwrite (in1_mot_dir, high); //traction motors follow forward
digitalwrite (in2_mot_dir, low);
digitalwrite (in3_mot_esq, high);
digitalwrite (in4_mot_esq, low); }
else if(deg_cer <= -0.9) {
digitalwrite (in1_mot_dir, high); //brake traction motors
digitalwrite (in2_mot_dir, high);
digitalwrite (in3_mot_esq, high);
digitalwrite (in4_mot_esq, high);}
14 J. M. Abe et al.

1.6 Multi-criteria Decision Analysis (MCDA) in Health


Care

Lately, the potential of multi-criteria decision analysis (MCDA) in health care has
been widely discussed. However, most MCDA methodologies pay little attention to
aggregating different individual stakeholder perspectives.
In [7], the para-analyser was applied to illustrate how a reusable MCDA frame-
work, based on paraconsistent logic, designed to aid (hospital-based) Health Tech-
nology Assessment (HTA) can be used to aggregate individual expert perspectives
when evaluating cancer treatments.
A proof-of-concept exercise line focusing on identifying and evaluating the global
value of first-rate treatments for metastatic colorectal cancer (mCRC) was undertaken
to further the development of the MCDA framework.
In consultation with hospital HTA committee members, 11 were considered on
an expert panel: medical oncology, oncology surgery, radiation therapy, palliative
care, pharmacist, health economist, epidemiologist, public health specialist, health
media specialist, pharmaceutical industry and patient advocate. The criteria ‘overall
survival’ (0.22), ‘burden of disease’ (mean 0.21) and ‘adverse events’ (mean 0.20)
received the highest weights, and the lowest weights were ‘progression-free’ and
‘cost of treatment’ (mean of 0.18 for both). FOLFIRImFlox achieved the highest
overall value approval of 0.75, followed by mFOLFOX6 with an overall value rating
of 0.71. Last ranked was the mIFL with an overall value score of 0.62. Paraconsistent
analysis of six first-line treatments for mCRC indicated that FOLFIRI and mFlox
were appropriate options for non-study reimbursement.
The paraconsistent value framework was proposed as a step forward from current
MCDA practices to improve the means of dealing with hospital HTA specialists’
perspectives of cancer treatments.

1.7 Conclusions

Paraconsistent logic was born out of applications in philosophy and specific technical
questions in mathematics, but it has found significant applications in the last three
decades, mainly in AI and Robotics [8].
ANNs, DeepLearning, Expert Systems, Bigdata, etc., based on paraconsistent
logic, come to directly deal with diffuse, inconsistent and paracomplete data, as we
have to manipulate such data frequently. In the early days of AI, some theories elim-
inated or treated inconsistencies separately. They happen frequently: in medicine,
the same symptoms can indicate different illnesses and doctors can conflict in their
diagnoses. We also have the issue of inherent ambiguity regarding the resolution of
the treated image that can lead to hasty decisions such as data captured by radar in
expert systems, and experts can have different opinions on the same problem, mali-
cious data corrupting databases and other themes. To neglect inconsistent data is to
1 Paraconsistency and Paracompleteness in AI: Review Paper 15

proceed anachronistically. Conflicting data can be as important data as the others.


Also, they can indicate extra information that requires special attention. For all these
reasons, the direct study of fuzzy, inconsistent and paracomplete concepts is vital in
AI.

References

1. da Costa, N. C. A.: Logiques Classiques et Non Classiques: Essai sur les fondements de la
logique, Masson, p. 275. ISBN-10: 2225852472, ISBN-13: 978-2225852473 (1997)
2. Abe, J. M., Akama, S., Nakamatsu, K.: Introduction to Annotated Logics—Foundations for Para-
complete and Paraconsistent Reasoning, Series Title Intelligent Systems Reference Library, vol.
88, p. 190. Publisher Springer International Publishing, Copyright Holder Springer Interna-
tional Publishing Switzerland, eBook ISBN 978-3-319-17912-4. https://doi.org/10.1007/978-3-
319-17912-4, Hardcover ISBN 978-3-319-17911-7, Series ISSN 1868-4394, Edition Number 1
(2015)
3. Enriques, F.: Per la Storia della Logica, Zanichelli, Bolonha (1922)
4. da Silva Filho, J.I.: Métodos de interpretação da Lógica Paraconsistente Anotada com anotação
com dois valores LPA2v com construção de Algoritmo e implementação de Circuitos Eletrônicos
(in Portuguese), University of São Paulo, Doctor Thesis, São Paulo (1999)
5. Torres, C.R., Abe, J.M., Lambert-Torres, G., da Silva Filho, J.I., Martins, H.G.: Autonomous
Mobile Robot Emmy iii, pp. 317–327. New Advances in Intelligent Decision Technologies,
Springer, Berlin, Heidelberg (2009)
6. Bernardini, F., da Silva, M., Abe J.M.: Application of Paraconsistent Annotated Evidential Logic
Eτ for a Terrestrial Mobile Robot to Avoid Obstacles, Procedia Computer Science, vol. 192,
pp. 1821–1830. ISSN 1877-0509 (2021)
7. Campolina, A.G, Estevez-Diz, M.D.P., Abe, J.M., de Soárez, P.C.: Multiple Criteria Decision
Analysis (MCDA) for Evaluating Cancer Treatments in Hospital-Based Health Technology
Assessment: The Paraconsistent Value Framework. PLoS ONE 17(5) (2022)
8. Abe, J.M.: Paraconsistent Intelligent-Based Systems: New Trends in the Applications of
Paraconsistency, p. 94. Springer (2015)
9. Akama, S.: Towards Paraconsistent Engineering, Intelligent Systems Reference Library, vol.
110, p. 234. Springer International Publishing (2016). ISBN: 978-3-319-40417-2 (Print) 978-3-
319-40418-9 (Online), Series ISSN 1868-4394
Part II
Regular Papers
Chapter 2
Decision Support Multi-agent Modeling
and Simulation of Aeronautic Marine Oil
Spill Response

Xin Li , Hu Liu, YongLiang Tian, YuanBo Xue , and YiXiong Yu

Abstract Modeling and simulation can provide decision support methods for marine
oil spill response, which can ensure the timeliness and effectiveness of the response
plan. This paper (1) proposes a hybrid modeling approach combining multi-agent
modeling and discrete event system (DEVS) to extract the oil spill response process
model; (2) abstracts the mathematical model of oil spill response plan based on
the multi-agent model; (3) constructs an aeronautic marine oil spill response virtual
simulation system; and (4) quantitatively evaluates the response plans of the marine
oil spillage. Furthermore, an instance is analyzed with simulation and evaluation
of two response plans, which proves that the modeling and simulation methods of
response plan can provide references for analysis and optimization of the response
plan for decision support.

2.1 Introduction

With the increasing frequency of maritime economic activities, cruise travel, and sea
transportation, maritime accidents occur more and more frequently, especially large-
scale oil spills with complex causes and serious environmental hazards. However,
most countries lack the capacity and experience to deal with large oil spill emer-
gencies. For example, the sinking of the Sanchi and the spillage of its oil cargo and
fuel has been one of the worst maritime collisions in recent years, with no precedent
for an emergency response. The application of virtual simulation can provide deci-
sion support methods for large marine oil spills and improve its emergency response
capability.
As early as in the 1990s, using computer technology, various developed countries
had researched oil spill prediction simulation systems, such as the OILMAP system
[1] in the USA and the OSIS [2] system in the UK, etc. With the maturity of oil spill
prediction technology and the development of computer virtual simulation, many

X. Li · H. Liu · Y. Tian (B) · Y. Xue · Y. Yu


Beihang University, Beijing 100083, China
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 19
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_2
20 X. Li et al.

studies oriented to oil spill disposal virtual simulation exercises have emerged. For
example, in China, Zou Changjun [3] realized a three-dimensional exercise system for
marine oil spill response and a virtual reality-based oil spill response exercise system;
Yang Yu [4] constructed a simulated training system supported by virtual environment
in real time for undersea oil spill emergency response to train emergency workers.
While skilled personnel are important, developing a response plan is extremely time
consuming during a real oil spill response. But studies geared toward response plan
development simulation are still relatively few.
The current evaluation study on oil spill response is process oriented, focusing on
the comprehensive factors affecting the effectiveness of oil spill response. And multi-
criteria evaluation methods are adopted to evaluate marine emergency incidents [5,
6]. Aurelien Hospital [7] in Canada studied oil spill response evaluation, mainly
considering the evaluation of booms and skimmers, which pointed out the way to
develop a risk-informed enhanced oil spill response capacity. Jin Weiwei [8] in China
established an evaluation indicator system of marine oil spill emergency response
capacity, including forecast, emergency support, and disposal capacity. Yet, while
process-oriented assessments are comprehensive and integrated, they lack relevance
and directionality, making it difficult to improve the efficiency of oil spill response
at the decision-making level.
The response plan involves response process and affects the efficiency of response.
Therefore, based on the above problems and current research status, this paper focuses
more on the decision support of the response plan. The evaluation framework in this
paper follows the approach of [5] in maritime search and rescue (MSRA) and is
novelly applied to a more complex scenario of maritime oil spill disposal, taking
into account multiple missions of emergency monitoring and oil spill disposal with
interactions of emergency response forces. In this paper, multi-agent modeling and
discrete event system (DEVS) modeling are firstly combined to extract the oil spill
response process model and to refine the mathematical model of the oil spill response
plan; secondly, the evaluation indicators system is established, and the analytic hier-
archy process (ANP) method is adopted as the evaluation calculation method; and
finally, based on these researches, a virtual simulation system is developed in the
AnyLogic simulation platform to provide decision support for the formulation of
response plan.

2.2 Modeling and Simulation

2.2.1 Marine Oil Spill Response Process

The process flow from oil spill occurrence to disposal can be summarized as follows:
surveillance and warning, emergency monitoring, and oil spill disposal, as shown in
Fig. 2.1.
2 Decision Support Multi-agent Modeling and Simulation of Aeronautic … 21

Fig. 2.1 Schematic diagram Start


of oil spill response process

Oil Spill Accident Occur

General Monitoring Emergency Monitoring

Information Sharing Dispatch Forces

Information Release & Information Sampling


Situation Assessment

Continuous Monitoring

Y
Developing Disposal Plan New Emergency? N

Dispatch Forces Disposal Done?

Y
End

The surveillance and warning are the preconditions of oil spill response. After
receiving the alarm, emergency monitoring is carried out to obtain detailed infor-
mation about the spillage. Then the emergency command center dispatches ships
and helicopters to the mission area for oil spill disposal according to the informa-
tion of emergency monitoring. Therefore, the critical steps of oil spill response are
emergency monitoring and oil spill disposal.
At the initial stage of marine oil spillage, rapid response and on-site emergency
monitoring are required. The emergency command center usually sends helicopters
to the scene to conduct on-site command, including investigating the emergency
site, sampling oil spills, and reporting the acquired information to the emergency
command center. The center receives the information and evaluates it to form an oil
spill response plan. In this paper, the oil spill response plan refers to the dispatching
plan and routes of response forces (as helicopters).

2.2.2 DEVS and Multi-agent Modeling

DEVS model
The marine oil spill response is constrained by the time boundary and the space
boundary. Under the constraint of these boundaries, the state variables, including oil
spill status information, monitoring status, oil area, indicators, etc., change at some
22 X. Li et al.

discrete time points. Thus, the marine oil spill response is a typical discrete event
system (DEVS).
The DEVS model is driven by a series of events and activities, where an event is
a behavior at a certain instantaneous time and an activity is the continuous state of
behavior, which is between two events in the DEVS model. The occurrence of an
event in the oil spill response process indicates a change in the state of the oil spill
response.
The events and the activities in marine oil spill response process are described in
Table 2.1.
A process is a collection of the related event and activity that describes their
logical relationship and time sequence. Considering the process interaction modeling
strategy, the DEVS architecture diagram for the aeronautic marine oil spill response
process is shown in Fig. 2.2, which is constructed based on Fig. 2.1.
Multi-agent DEVS model
It can be seen from the marine oil spill response process that it is of great significance
to model helicopters and units in distress. For helicopters’ modeling, movement logic
in mission execution and detection logic in target search need to be concerned. The
detection logic considers information interaction between helicopters and units in
distress. For the unit in distress, it is necessary to analyze self-drift and oil spill
dispersion logic and state evolution logic. Then the state evolution logic is modeled
according to the change of environment and the performance of the helicopter.
The above analysis demonstrates that the helicopter model and distress ship model
have their own logic and influence each other. The response process is modeled
by many interaction and communication behaviors between units with explicit
behavioral logic and state migration characteristics.
Therefore, multi-agent modeling is adopted to describe complex situations, such
as multiple behaviors and interactions between different agents. Each agent is a unit
with some physical or abstract mathematical meaning that can not only act on itself
and its environment, but also interact with other agents in terms of information and
behavior.
The multi-agent model includes environment agents, behavioral agents, and data
agents. The environment agent is the simulation operation environment of the behav-
ioral agents and data agents. The behavioral agent is the subject-object that generates
behavior after the simulation starts, with state variables and several behavior patterns.
When a behavioral agent interacts with the environmental agents or other associated
behavioral agents, it triggers or is triggered by events to generate data. The description
of the intelligent body is described in Table 2.2.
According to the methodology of non-uniform hybrid strategy [9], the DEVS
model can be established on the basis of the agent-based modeling method, which
will guarantee the model accuracy and computational efficiency to a certain extent,
thus improving the authenticity of the evaluation results. Based on the above two
models, the multi-agent DEVS model is obtained through agent description of the
events, activities, and processes, as shown in Fig. 2.3.
2 Decision Support Multi-agent Modeling and Simulation of Aeronautic … 23

Table 2.1 Events and activities in oil spill response


Event Description of events Activity Description of activities
E1 Information on the occurrence of oil A1 Oil spill incident is awaiting an
spills is received emergency response
E2 Oil spill emergency monitoring A2 Develop a response plan
missions are received
E3 Emergency monitoring forces A3 Assign missions
preparation is completed
E4 Emergency monitoring forces arrive A4 Emergency monitoring forces
in the mission area navigate the way to mission areas
E5 Searching conditions are met A5 Determine if search conditions are
met
E6 Oil spill search is completed, and oil A6 Search the oil spill area and obtain
sampling feasibility is met parameters
E7 Continuous monitoring is required A7 Perform oil sampling and obtain
detailed oil parameters
E8 Emergency monitoring forces arrive A8 Determine if the emergency
at the base monitoring is completed
E9 Emergency monitoring conditions are A9 Emergency monitoring forces return
met to the base
E 10 Oil spill disposal needs are identified, A10 Continuous monitoring to obtain the
and the oil spill cleaning forces are continuous status of oil spill cleanup
ready
E 11 Oil spill cleaning forces arrive at the A11 Determine if continuous monitoring
mission area is completed
E 12 There are no other isolated oil areas A12 Determine if oil spill disposal is
required
E 13 Oil spill disposal feasibility is met A13 Oil spill cleaning forces navigate the
way to oil spill area
E 14 Oil spill cleaning forces arrive at the A14 Perform oil disposal operation
base
E 15 Continuous monitoring conditions are A15 Determine if there are other isolated
met oil areas
E 16 Search conditions are not met A16 Oil spill cleaning forces return to
equipment depot
E 17 Emergency monitoring conditions are A17 Oil spill emergency response is
not met completed
E 18 Oil spill disposal conditions are not
met
24 X. Li et al.

E18 A17 E14


Pe
A1

P1 A16 E13 E12 A15

E1
E8 A12 A3 E10 A13 E11 A14 E10

A2 Pm2 P7 P8
A9
P2

E2
E16 E9

A3 E3 A4 E4 A5 E5 A4 E6 A7 A8 E7 A10 A11 E15

Pm1 P3 P4 P5 P6

: Event : Activity : Judgment Activity : Process

Fig. 2.2 DEVS model of oil spill response

In multi-agent DEVS, the event can be generated by the interaction among agents,
containing information and behavior interaction. Discrete events proceed based on
specific conditions or rules, which in turn affect the activities of the agents involved.
Simulation system framework
In order to verify the validity and feasibility of the multi-agent DEVS model, as well
as to provide decision support for the evaluation, the framework of the simulation
system is constructed as shown in Fig. 2.4.
Environment agent, acting as the environment in the system, realizes human–
computer interaction, provides oil spill information input, and formulates response
plan based on auxiliary decision-making functions such as drift trajectory prediction.
Data agent links the environment agent and the behavioral agent, recording both the
data input by users and the data in the evaluation process. The behavioral agent, as the
execution unit of the simulation, realizes the discrete event based on its own behavior
and interaction logic. The behavioral agent logic, using the ForceUnit_Monitor Agent
as an example, is divided into three phases: staying in base, navigation, command,
and control.
The simulation system is developed based on the above system architecture, thus
realizing three layers of logic. First, the user performs ScenarioEditing to edit the
oil spill emergency information. And the generated emergency information will be
stored in a local file. Second, the user performs DecisionMaking to load and display
the edited oil spill emergency in the interface. Then an oil spill response plan is devel-
oped based on the assisted decision-making function, which controls the Behavioral
Agent in the form of rules to realize the simulation and interaction. Finally, the
user carries out SimulationEvaluation. The simulation program loads the emergency
information and decision plan, starts the simulation according to the user’s intention,
and outputs the evaluation result after the simulation.
2 Decision Support Multi-agent Modeling and Simulation of Aeronautic … 25

Table 2.2 Description of the multi-agent model


Agent type Agent name Description of this agent
A. Environment Agent A1. Scenario Editing Agent Agent for developing a
simulation environment and
inputting oil spillage
information
A. Environment Agent A2. Decision Making Agent Agent for developing a
simulation environment and
generating a response plan
A. Environment Agent A3. Simulation Evaluation Agent Agent for developing a
simulation environment for
process rehearsal and
evaluation
B. Behavioral Agent B1. Emergency_Oil Spill Agent Agent for characterizing the
oil spillage with behavior
interaction
B. Behavioral Agent B2. Force Unit_Monitor Agent Agent for characterizing the
monitoring force unit with
interaction
B. Behavioral Agent B3. Force Unit_Cleaner Agent Agent for characterizing the
cleaning force unit with
interaction
C. Data Agent C1. Response Plan Agent Agent for characterizing the
response plan with
information exchange
C. Data Agent C2. Equipment Base Agent Agent for characterizing the
equipment base with
information exchange
C. Data Agent C3. Airline Agent Agent for characterizing the
airline with information
exchange

2.3 Response Plan Evaluation

2.3.1 Response Plan Agent

In the process of oil spill response, the response plan is the most central part, which
can orderly dispatch response forces and assign various missions. In the multi-agent
DEVS model, the ResponsePlan Agent is constructed to record response plan data
and control agents of the virtual simulation.
It should be clear that the response plan works on the collection of missions and
response forces in two aspects: first, the matching of response forces, characterizing
the assignment of force units (as helicopters); second, the acting of force units,
characterizing the temporal properties of the act to the mission area to perform
26 X. Li et al.

A1: ScenarioEditing Agent A3: SimulationEvaluation Agent


B1: Emergency_OilSpill Agent
Input
Save E
Accident occur Occur Is Founded Is disposaled

DEVS
A2: DecisionMaking Agent Model
B2: ForceUnit_Monitor Agents B3: ForceUnit_Cleaner Agents

B1 C2 N N
C3

M S Sp C
C1

E E
DPM DPC

Environment Agent Data Agent Behavioral Agent Action Logic/Algorithm

Develop Develop C CleanUp E Evaluate


DPM DPC N Navigate S Search Sp Sample M Monitor
Monitoring Plan Removal Plan

Fig. 2.3 Multi-agent DEVS model

Environment Agent Data Agent Behavioral Agent Behavioral Agent Logic

Oil Spill
Scenario Editing Equipment Base
Emergency Unit

Decision Making Disposal Plan Monitor Force Unit

Simulation
Airline/Route Cleaning Force Unit
Evaluation

Fig. 2.4 Framework of simulation system

the mission. The mathematical definition of the proposed disposition scheme is as


follows.
First, determine the actions contained in the mission set Mission as Eq. (2.1), with
a total of n actions.

Mission = {action1 , action2 , ..., actionn } (2.1)


2 Decision Support Multi-agent Modeling and Simulation of Aeronautic … 27

Second, determine the called force set Force as Eq. (2.2), with a total of m force
units.

Force = {unit1 , unit2 , ..., unitn } (2.2)

Third, assign actions to force units and determine the mission assignment matrix
M as Eq. (2.3). For example, if there are n actions to be assigned to m force units,
the matrix M is expressed as follows:
⎡ ⎤
M 11 M 12 · · · M 1n
⎢ M 21 M 22 · · · M 2n ⎥
⎢ ⎥
M = (M i j )m×n =⎢ . .. .. .. ⎥ (2.3)
⎣ .. . . . ⎦
M m1 M m2 · · · M mn

where M i j = 0, 1, and when M i j = 1, it means that the i-th response force unit
performs the j-th mission.
Fourth, determine the action matrix Ak as Eq. (2.4) for each force unit, character-
izing the temporal attributes of the actions of sending the force unit to the mission
area to perform the mission. For example, the k-th action is matched to a mission.
⎡ ⎤
0 0 ··· 0
⎢ a21 0 ··· 0⎥
⎢ ⎥
Ak = (ai j )n k ×n k = ⎢ .. .. . . .. ⎥ (2.4)
⎣ . . . .⎦
an k 1 an k 2 · · · 0

where ai j = 0, 1, and when ai j = 1, it means that the j-th mission needs to be


completed before the i-th mission is executed. Note that the diagonal of the matrix
and the upper half of the elements have the value 0.
All m forces units are represented by set A as Eq. (2.5).

A = {A1 , A2 , ..., Am } (2.5)

Finally, get the response plan as Eq. (2.6).

R PMission = M, A (2.6)

2.3.2 Evaluation Indicators System

The evaluation of the response plan is essentially a multi-criteria comprehensive eval-


uation problem, for which the first and most important thing is to build a reasonable
28 X. Li et al.

evaluation system. Relevant studies [5, 6] reveal that safety and efficiency should be
considered when evaluating and screening response plans. In simple terms, safety
indicators consist of helicopter safety and environmental safety (mainly considering
the harm degree of oil spill to the environment), and efficacy indicators incorporate
both emergency monitoring efficacy and oil spill disposal efficacy.
Specifically, the safety indicators embody the remaining fuel of force units, the
offshore distance, the hazard degree of the oil spillage, and the duration of the oil
spill. Efficacy indicators include emergency dispatch time of monitoring force, total
detection time, oil spill disposal and response resource consumption, and response
completion. The evaluation indicator system is given in Table 2.3.
C 1 and C 2 are security criteria, which consider the safety of the force units and
the environment. C 3 and C 4 are efficiency criteria, corresponding to two phases of
oil spill response: emergency monitoring and oil spill disposal.
Since there are certain intrinsic links between indicators and their structure is
similar to a network structure, network analysis (ANP) was used to determine the
weight coefficients of indicators. See [5] for details of the ANP method.

Table 2.3 Description of the evaluation indicator system


Criteria Criteria description Indicator Indicator description
C1 Force units security I 11 Residual fuel security. The higher
indicator means the more abundant the
remaining fuel
I 12 Offshore distance security. The higher
indicator means the maximum offshore
distance is shorter and the aircraft is
safer
C2 Environmental security I 21 Oil spill hazard level, which is
determined by the scale and oil products.
The higher indicator means the more
difficulty with the response
I 22 Environmental hazard elimination
efficiency, which is determined by oil
and response effect. The higher indicator
means the higher efficiency
C3 Monitoring efficiency I 31 Monitoring movement efficiency. The
higher indicator means the shorter
movement time and the higher efficiency
I 32 Total time of monitoring
C4 Disposal efficiency I 41 Disposal resource efficiency ratio. The
higher indicator means fewer resources
are consumed to remove the oil spill in
the same area
I 42 Disposal completion efficiency. The
higher indicator means the shorter
disposal completion time
2 Decision Support Multi-agent Modeling and Simulation of Aeronautic … 29

Table 2.4 Oil spill scenario


Accident information Information description
information
Location of accident Point A sea area
Latitude and longitude (E118.93951, N24.13997)
coordinates
Distance from coast Greater than 30 nautical miles
Time of accident April 28, 2020 12:20:00 pm
Types of oil spills Heavy crude oil
Oil spill mass (tons) 38

In this paper, eight indicators are defined, and the set of their values is represented
by the matrix V as Eq. (2.7). The elements in V correspond to these eight indicators,
respectively.

V = [V1 , V2 , V3 , ..., V8 ] (2.7)

Based on the evaluation requirements of different indicators, the standardization


process of original data can be divided into two cases. First, there is a reference value
for this indicator, so it is necessary to compare the original data with the reference
value during scoring. Second, there is a maximum value for this indicator, so the
ratio of the original data to the maximum value needs to be calculated in scoring.

2.4 Instance Analysis

2.4.1 Simulation Scenario

Suppose an oil spill emergency incident occurs in the sea area at point A. The scenario
information of the emergency is given in Table 2.4.

2.4.2 Simulation-Based Evaluation for Decision Support

Response plan development


First, the oil spill emergency information is input to calculate the oil spill dispersion
data. Then the oil spill dispersion prediction results are calculated based on wind
fluency data and oil properties, which can be used as the basis for developing the
mission area, namely the search and cleanup area for the response forces. Based on
available response forces, three response plans are developed as follows (Table 2.5).
30 X. Li et al.

Table 2.5 Response plan development


Response plan 1 Response plan 2 Response plan 3
Mission Area = [118.93951, 24.13997, 14, 10, 60°)
Missions = [Searching, Monitor, Disposal, Backing]
Forces = [B-0001(H-410), Forces = [B-0001(H-410), Forces = [B-0002(H-410),
B-0002(H-410)] B-0002(H-410), B-0001(H-410)]
B-0003(Z-11)]
M = [1, 1, 0, 1; 0, 0, 1,1] M = [1, 1, 0, 1; 0, 0, 1, 1; 0, 0, M = [1, 1, 0, 1; 0, 0, 1,1]
1, 1]
A1 = [0, 0, 0; 1, 0, 0; 1, 1, 0] A1 = [0, 0, 0; 1, 0, 0; 1, 1, 0] A1 = [0, 0, 0; 1, 0, 0; 1, 1, 0]
A2 = [0, 0; 1, 0] A2 = [0, 0; 1, 0] A2 = [0, 0; 1, 0]
A3 = [0, 0; 1, 0]

Response Plan 1 (RP1): Dispatch emergency monitoring force from Xiamen


equipment depot, including the helicopter B-0001(H-410) and oil spill cleaning force
from Quanzhou equipment depot including the helicopter B-0002(H-410).
Response Plan 2 (RP2): Dispatch emergency monitoring force from Xiamen
equipment depot, including the helicopter B-0001(H-410) and oil spill cleaning
force from Quanzhou equipment depot including the helicopter B-0002(H-410) and
helicopters B-0003(Z-11).
Response Plan 3 (RP3): Dispatch emergency monitoring force from Quanzhou
equipment depot, including the helicopter B-0002(H-410) and oil spill cleaning
force from Xiamen equipment depot including the helicopter B-0002(H-410) and
helicopters B-0001(H-410).
Simulation and evaluation
After the response plans are developed, the simulation and evaluation analysis are
initiated.
The visualization process of the oil spill response plan is displayed on the GIS
map, as shown in Fig. 2.5, so that the plan maker can understand the plan derivation
process intuitively. The evaluation results of indicators of the three response plans
are output after simulation, as given in Table 2.6.
Then, the weight coefficients of the indicators are determined based on the expert
scores. The ANP method with mature application is employed to determine the
indicator relationship matrix and the indicator priority matrix by expert scoring,
respectively, and the weight calculation results of the indicators can be obtained, as
given in Table 2.7.
Analysis and optimization
The comprehensive evaluation values of the three response plans are 0.5197, 0.5142,
and 0.4925, which are calculated from the indicator results and their weighting
factors. The results show that the RP1 has the highest overall evaluation value; the
monitoring effectiveness of RP1 and RP2 is higher than that of RP3, due to RP3’s
longer departure distance; and RP2 has a higher disposal efficiency than RP1, but it
2 Decision Support Multi-agent Modeling and Simulation of Aeronautic … 31

Fig. 2.5 Simulation and evaluation display

Table 2.6 Results of evaluation indicators system


Indicator Indicator description Response plan 1 Response plan 2 Response plan 3
I 11 Residual fuel security 0.5467 0.5503 0.5126
I 12 Offshore distance 0.5132 0.5014 0.5132
security
I 21 Oil spill hazard level 0.5000 0.5000 0.5000
I 22 Environmental hazard 0.6874 0.7185 0.6447
elimination efficiency
I 31 Monitoring movement 0.3224 0.3224 0.2521
efficiency
I 32 Total time of 0.4356 0.4356 0.3953
monitoring
I 41 Disposal resource 0.5000 0.4000 0.5000
efficiency ratio
I 42 Disposal completion 0.6874 0.7157 0.6492
efficiency
32 X. Li et al.

Table 2.7 Results of


Indicator Standardized weight Weight coefficient
evaluation indicators weights
I 11 0.52436 0.123187
I 12 0.47564 0.111740
I 21 0.52174 0.135353
I 22 0.47826 0.124074
I 31 0.50000 0.124205
I 32 0.50000 0.124205
I 41 0.60041 0.154448
I 42 0.39959 0.102788

has a lower disposal resource efficiency ratio than RP1, so RP 1 has the best overall
disposal effect.
Further analysis reveals that there is still some room for improvement for RP1.
Indicator I31 (monitoring movement efficiency) has the lowest score after weighting,
as does its related indicator I32 (total time of monitoring). This result suggests that the
developed response plan should give priority to optimizing the speed of emergency
monitoring force departure. In addition, the general principle should be to maximize
the effectiveness of the helicopter while ensuring the safety of the aircraft.
Based on the above, the Response Plan 4 (RP4) is proposed: dispatch emer-
gency monitoring force from Xiamen equipment depot, including the helicopter
B-0004(SC-76++) and oil spill cleaning force from Quanzhou equipment depot
including the helicopter B-0002(H-410) (Table 2.8).
The indicators I 22 , I 31 , and I 32 of RP4 have all improved with a response program
comprehensive evaluation value of 0.5492, 5.67% higher than that of RP1, implying
that RP4’s emergency monitoring effectiveness has been significantly improved.
As can be seen from the above instances, the developed response plans are simu-
lated and evaluated in the multi-agent system and optimized based on the evaluation
results.

Table 2.8 Results of Response Plan 4


Indicator Indicator description Evaluation results
I 11 Residual fuel security 0.5269
I 12 Offshore distance security 0.5132
I 21 Oil spill hazard level 0.5000
I 22 Environmental hazard elimination efficiency 0.7242
I 31 Monitoring movement efficiency 0.4112
I 32 Total time of monitoring 0.4942
I 41 Disposal resource efficiency ratio 0.5000
I 42 Disposal completion efficiency 0.7754
2 Decision Support Multi-agent Modeling and Simulation of Aeronautic … 33

2.5 Conclusion

In this paper, a virtual simulation model is constructed for each process of the oil
spill response plan, which fully considers the interaction between the elements and
thus is closer to the actual process of oil spill response decision and command. The
evaluation indicators system in this paper that quantitatively evaluates the response
plan through different dimensions can provide decision support for oil spill response.
(1) A simulation model based on a hybrid modeling approach combining multi-
agent and DEVS is proposed, which can accurately describe events, activities,
and processes of the oil spill response process, especially the interactions and
state changes therein.
(2) The response plan is defined and its evaluation indicator system is established
based on the multi-agent model, where the ANP method is applied to evaluate
the response plan.
(3) On the ground of the above research, this paper develops a virtual simulation
system for simulation evaluation and analysis and conducts preliminary valida-
tion on the model and the method. The decision support of the above research
results is verified by specific cases. Nevertheless, more functions and algorithms
for decision support are yet to be studied in depth.

Acknowledgements I would like to thank my tutors, Professor Hu Liu and Yongliang Tian for
their guidance and my dear friend Xiang He for her encouragement. This research did not receive
any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

References

1. Aderson, E.L, et al.: The OILMAPWin/WOSM oil spill model: application to hindcast a river
spill. In Proceeding of the 18th Arctic and Marine Oil Spill Program, Technical Seminar,
Edmonton, Alberta, Canada, pp. 793–817 (1995)
2. Leech, M., et al.: OSIS: a windows oil spill information system. In Proceeding of the 16th
Arctic and Marine Oil Spill Program, Technical Seminar, Calgary, Alberta, Canada, vol. 5, no.
1, pp. 27–30 (1983)
3. Zou, C.J., Yin, Y., Liu, X.W., et al.: Research and Implementation of a 3D exercise system for
offshore oil spill response. J. Syst. Simul. 030(003), 906–913 (2018)
4. Yu, Y., Mao, D., Yin, H., Zhang, X., Sun, C., Chu, G.: Simulated training system for undersea
oil spill emergency response. Aquatic Proc. 3, 173–179 (2015)
5. Liu, H., Chen, Z., Tian, Y., et al.: Evaluation method for helicopter maritime search and rescue
response plan with uncertainty. Chinese J. Aeron. 34(4), 493–507 (2021)
6. Guo, C., Zhang, S., Jiang, Y.: A multiple criteria decision method for selecting maritime search
and rescue scheme. Mech. Electr. Technol. 4, 2334–2338 (2012). Trans Tech Publications Ltd.
7. Hospital, A., Stronach, J.A., McCarthy, W., et al.: Spill response evaluation using an oil spill
model. Aquatic Proc. (2015)
34 X. Li et al.

8. Weiwei, J., Wei, A., Yupeng, Z., Zhaoyu, Q., Jianwei, L., Shasha, S.: Research on evaluation of
emergency response capacity of oil spill emergency vessels. Aquatic Proc. 3, 66–73 (2015)
9. Tian, Y.F., Liu, H., Huang, J.: Design space exploration in aircraft conceptual design phase based
on system-of-systems simulation. Int. J. Aeron. Space Sci. 16(4), 624–635 (2015)
Chapter 3
Transferring Dense Object Detection
Models To Event-Based Data

Vincenz Mechler and Pavel Rojtberg

Abstract Event-based image representations are fundamentally different to tradi-


tional dense images. This poses a challenge to apply current state-of-the-art models
for object detection as they are designed for dense images. In this work we evalu-
ate the YOLO object detection model on event data. To this end we replace dense-
convolution layers by either sparse convolutions or asynchronous sparse convolutions
which enables direct processing of event-based images and compare the performance
and runtime to feeding event-histograms into dense-convolutions. Here, hyper-
parameters are shared across all variants to isolate the effect sparse-representation has
on detection performance. At this, we show that current sparse-convolution imple-
mentations cannot translate their theoretical lower computation requirements into an
improved runtime.

3.1 Introduction

Event-based or neuromorphic cameras provide many advantages like high-frequency


output, high dynamic-range and a lower power-consumption. However, their sensor
output is a sparse, asynchronous image-representation, which is fundamentally dif-
ferent to traditional, dense images (Fig. 3.1).
This hinders the use of convolutional layers, which are an essential building-
block of current state-of-the-art image processing networks. Classical convolutions
on sparse data, as it is produced e.g. by event-cameras, are inefficient, as a large part
of the computed feature-map defaults to zero. Furthermore, sparsity of the data is
quickly lost, as the non-zero sites spread rapidly with each convolution. To alleviate
this problem, changes to the convolutional layers were proposed.

V. Mechler · P. Rojtberg (B)


Fraunhofer IGD, Darmstadt, Germany
e-mail: [email protected]
V. Mechler
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 35
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_3
36 V. Mechler and P. Rojtberg

Fig. 3.1 Object detection on the KITTI dataset [5]. Cyan boxes denote ground-truth. Pink boxes
denote predictions. Top: Using sparse event-histograms. Bottom: Using source RGB-image respon-
sible for the off events

Sparse convolutional layers [6] compute convolutions only at active (i.e. non-
zero) sites. The sub-type of ‘valid’ or ‘submanifold’ sparse convolutional layers
furthermore tries to preserve the sparsity of the data by only producing output sig-
nals at active sites, which makes them highly efficient at the cost of restricting signal
propagation. Non-valid sparse convolutions are semantically equivalent with dense
convolution layers in that they compute the same result given identical inputs. Valid
or submanifold sparse convolution layers, on the other hand, differ from dense con-
volutions, but still provide a good approximation for full convolutions on sparse
data.
Messikommer [8] further introduce asynchronicity into the network. This allows
for samples to be fed into the network in parts as they are produced by a sensor,
and thus to reduce the latency in real-time applications. Several small batches of
events from the same sample can be processed sequentially, producing identical
results to synchronous layers once the whole sample has been processed. However,
[8] only implemented a proof-of-concept. The project only includes asynchronous
submanifold sparse convolutional and batch-norm layers, whereas the sparseconvnet
(SCN) project [6] provides a full-fledged library. Furthermore, asynchronous models
cannot be trained, as the index_add operation used in the forward function is not
supported by PyTorch’s automatic gradient tracking. This, however, does not pose a
problem, as each layer is functionally equivalent to its SCN counterpart. Therefore, it
is possible to train an architecturally identical SCN network and transfer the weights.
As the asynchronous property is only relevant during inference, this does not pose a
limitation.
3 Transferring Dense Object Detection Models To Event-Based Data 37

An alternative approach is to convert the sparse frame representation to dense


frames first, using a learning-based approach [9]. This way, however, one loses all
computational advantages that the sparse representation offers. Notably, it is also
possible to synthesize events from a dense frame-based representation [3].
Furthermore, a leaky surface layer was proposed by [1] which integrates the event-
to-frame conversion directly into the target network. This way the network becomes
stateful, and resembles a spiking model [7].

3.1.1 Contributions and Outline

In this work, we use the YOLO v1 model [10] as a simple but powerful dense object
recognition baseline. We model sparse networks architecturally identical to YOLO
v1 using the SCN [6] and asynet [8] frameworks. These serve as a case study to
evaluate the performance of sparse and asynchronous vs dense object detection.
We implement all variants in PyTorch and evaluate the predictive performance
and runtime requirements against a dense variant. To this end, we convert the KITTI
Vision dataset to events using [3]. This allows us to answer the question if these
novel technologies are a viable optimization over dense convolutional layers, or if
they fall short of the expectations in practice.
The remaining part of this work is structured as follows: First, Sect. 3.2 introduces
data formats required for the remainder of this work. Next, Sect. 3.3 details the major
changes and additions to the used frameworks. Section 3.4 evaluates the sparse and
dense YOLO versions w.r.t. performance, and Sect. 3.5 regarding runtime. Section
3.6 concludes our work by discussing our results and providing an outlook.

3.2 Dataset and Frame Representation

In this work we focus on frame-based object detection models. Dense video-frames


capture absolute light intensity of a scene at a specific moment in time. Event cameras,
on the other hand, capture discrete changes of brightness at specific locations in space
and time as a sequence of events. These are more akin to videos, or image sequences,
but without knowledge of the absolute brightness of any location.
Synthetic event sequences can be generated from videos by interpolating to infinite
resolution in the time axis and then extracting each change of a pixels value as
a discrete event at the (estimated) time of this change. Vice versa, videos can be
reconstructed from events, if the absolute information of the starting frame is known.
Without such information, one can still try to reconstruct a single image from events,
e.g. by assuming the starting point to be an empty, grey image, which might yield
good results if the amount of events available is large enough.
38 V. Mechler and P. Rojtberg

3.2.1 KITTI Vision Dataset

As we compare sparse with dense convolutional neural networks (CNNs), we require


parallel dense image and sparse event data. Therefore, we use the dense KITTI Vision
object detection dataset [5] and convert it to events using vid2e [3].
The dataset contains a large collection of urban traffic based scenes captured
by cameras mounted on the roof of a car. Each sample is manually annotated with
bounding boxes of different object classes like ‘Pedestrian’, ‘Car’, ‘Cyclist’, etc. The
intended use is autonomous driving and driver-assistance systems.

3.2.2 Optical Event Data and Histograms

Optical event data can be represented in various formats. A simple and lossless
encoding are sequences of discrete events storing the spatial and temporal location
and the polarity of the change [2], as the amount is usually assumed to be fixed within
one dataset.
This format is, however, badly suited to processing with e.g. CNNs, as the
sequence length is variable across samples and unbounded. A common format that
overcomes this limitation are event-histograms, which accumulate all events into
a single frame similar to an image, but showing changes of brightness during the
defined interval instead of absolute brightness values at a single point in time.
In this work, we use the event-histogram representation from the asynet [8] frame-
work producing two channels, where each pixel value represents the sum of all
observed event changes of negative or positive polarity at this spatial location.

3.3 Implementation

In addition to the available dense PyTorch YOLO v1 implementation,1 we imple-


mented two more networks to be used in our final evaluation: A sparse version of
YOLO v1 implemented in the SCN framework and an asynchronous sparse version
implemented in the asynet framework. Training and evaluation of all networks is
performed using the asynet framework to ensure comparability of results. For repro-
ducibility, we make our implementation available as open-source on github.2
The sparse models follow the YOLO v1 architecture, but use specialized sparse or
asynchronous sparse layers in the convolutional block, followed by standard PyTorch
linear layers. In the SCN sparse model, the convolutional block is followed by a
sparse-to-dense layer that converts the sparse tensor into a dense representation for
further processing. In the asynet asynchronous model such a layer is not necessary,

1 https://github.com/zzzheng/pytorch-yolo-v1.
2 https://github.com/paroj/rpg_asynet.
3 Transferring Dense Object Detection Models To Event-Based Data 39

as the model does not support training anyway and the dense feature map tensor is
passed through the network alongside the sparse events as part of the sparse repre-
sentation.
Both sparse models employ submanifold sparse convolution layers where the
dense network uses convolutions with stride 1 to achieve maximum performance.
We adapted the trainers for dense and sparse object detection models already imple-
mented in the asynet code for improved logging and debugging and added early
stopping. However, neither the SCN, nor the asynet framework contained sufficient
functionality to directly implement a YOLO v1 network.

3.3.1 Sparseconvnet Extensions

In the case of SCN, the deficit was minimal, as it only lacked the ‘same’-padding
feature in its sparse convolutional layer. To get around that limitation, we chose a
rather inefficient but easy way of converting a sparse tensor into a standard dense
PyTorch tensor, pad this dense representation, and then convert it back into a SCN
sparse tensor. This does not affect our evaluation, as it can be easily excluded from
the runtime evaluation carried out via profiling, and does not change the results of
the layer computations.

3.3.2 Asynet Extensions

The asynet framework, however, was missing a layer type. The existing ‘asyncSpar-
seConvolution2D’ layer implements an asynchronous valid or submanifold sparse
convolution only. The project does not contain an implementation of an asynchronous
(non-valid) sparse convolution. We therefore implemented the asynNonValidSpar-
seConvolution2D layer, based off the asyncSparseConvolution2D implementation.
To ensure correctness, we again specified test cases to verify our implementation.
Additionally, the original code did not filter duplicate events within a sequence,
causing each active site to be processed as often as the number of duplicate events
(at the same spatial location) in the sequence. This behaviour caused the runtime to
increase by several orders of magnitude, while also producing incorrect results in
case of duplicate events.

3.3.3 Dataloader

We implemented a dataloader for the KITTI Vision dataset analogously to the already
available dataloaders for various other datasets (NCaltech101 among others). We
adapted code available through the dataset’s release site [4] for converting the ground
truth bounding boxes and labels into the commonly used format defined by the Pascal
40 V. Mechler and P. Rojtberg

VOC dataset. Each sample is converted to events at runtime because of the enormous
storage overhead of preprocessing the whole dataset. The spatial locations of the
events are then rescaled to the required tensor size, and finally accumulated into an
event histogram. Additionally, we implemented a version without the conversion into
events to be used to train the dense YOLO network on the original images.

3.4 Error Analysis

The intuition behind sparse CNNs is to speed up, and reduce energy consumption of,
dense CNNs by eliminating unnecessary computations. As such, we require sparse
CNNs to match the prediction performance of dense CNNs, while reducing resource
consumption.
We start by verifying the first condition, namely recognition performance match-
ing that of the dense model. Due to the high costs of training image detection networks
most evaluations were performed only with limited redundancy, as can be seen in
Table 3.1. As the goal of this evaluation is a qualitative comparison of different archi-
tectures, rather than trying to achieve state-of-the-art results, it is acceptable to omit
hyper-parameter tuning and use the same parameters for all models. A proper con-
vergence of each training run, as well as the absence of strong outliers within the per-
formed experiments, minimizes the risk of non-representative and non-reproducible
results.

3.4.1 Sparse YOLO Versus Dense YOLO

We first compared the dense YOLO network trained on dense images directly with
our structurally identical sparse implementation trained on 42ms event-windows.

Table 3.1 Median mAP and YOLO loss values over 3 training runs for different models, data, and
sparse event-window size
Model Data Window-size Med. mAP Med. loss
Dense YOLO Dense images N/A 0.1914 0.4465
Event-histograms 33 ms 0.1777 0.4448
Event-histograms 42 ms 0.2055 0.3921
Sparse YOLO Event-histograms 8 ms 0.2301 0.3410
16 ms 0.2332 0.3394
25 ms 0.2337 0.3413
33 ms 0.2115 0.3742
42 ms 0.2321 0.3443
50 ms 0.2311 0.3436
Best values highlighted
3 Transferring Dense Object Detection Models To Event-Based Data 41

The mAP score shows the sparse model to perform approximately 20% better
than the dense baseline. This indicates that both the conversion from dense images
to sparse events and our model implementation work as intended.
The increase in performance can be explained by the availability of additional
information: The dense model is restricted to a single image per sample, the sparse
model, however, is trained on events synthesized from a sequence of images. While
events encode only the change and lose the information about absolute brightness,
it can be argued that the change, which effectively encodes moving edges, is more
beneficial to object recognition than colour and absolute brightness.
The maximum achieved mAP of about 23% is significantly lower compared to
values YOLO v1 reportedly achieved on other datasets. This is likely due to differ-
ences of the mAP calculation.3 However, we use the same calculation throughout
this work, which makes our results comparable to each other.
The ‘YOLO loss’, as presented in the original YOLO paper [10], shows consider-
ably larger differences between the sparse and dense variants. This metric, however,
constitutes a loss function to be optimized during training and is not necessarily
suited to compare the performance of different models.

3.4.2 Sparse Versus Dense Convolutions

Training the dense YOLO network on event histograms instead of dense images did
not improve its performance notably. It consequently also performed approximately
15% worse than the sparse model trained on the same data. This indicates the sub-
manifold sparse convolutions, which are the only semantic difference from the dense
model, essentially contribute to the ability of the model to process sparse event data.
While similar results might be achieved using only dense convolutions by altering
the model structure, it is stunning that submanifold sparse convolutions enable the
transfer of a model optimized for dense images to events without requiring additional
hyper-parameter tuning.

3.4.3 Event-Window Size

Against our expectations, event-window size within a reasonable range has little to
no effect on the performance. We tested several configurations, starting from 42 ms,
which covers exactly one frame of the original dense dataset captured at a frame
rate of 24 fps. As this frame rate is, however, rather low, and 42 ms contains a huge
amount of events, we tested mainly values below that.
As each training run takes more than one day on our available hardware, we
decided to use fixed, pre-trained weights for initialization instead of averaging over

3 https://github.com/thtrieu/darkflow/issues/957.
42 V. Mechler and P. Rojtberg

multiple, randomly initialized runs to exclude effects of weight initialization from


this evaluation. All sparse test cases were trained from the same weights, pre-trained
for 121 epochs using a window size of 33 ms. There is no noticeable qualitative
difference in the evaluation loss and mAP scores between the tested window sizes
of 8, 16, 25, 33, 42 and 50 ms.

3.5 Runtime Analysis

Sparse CNNs claim to be more efficient in terms of number of operations and subse-
quently runtime and energy consumption. To verify this claim we profiled the three
different implementations using cProfile.4
For these experiments we chose to evaluate the full YOLO v1 network on real
data. This ensures a realistic ratio of active sites over locations in the tensor, as
well as number of events per active site. While in the dense framework all convo-
lution layers are essentially the same layer type, all unstrided sparse convolutions
can be translated to more efficient submanifold sparse convolutions, while strided
sparse convolutions have to be implemented as non-valid sparse convolution lay-
ers.
Due to the high runtime of the asynchronous framework, all models are profiled
over 1042 fixed samples (14%) of the KITTI Vision dataset.

3.5.1 Dense Versus Sparse

Although in theory more efficient on sparse data than dense convolutions, sparse
CNNs still show higher runtimes. Due to highly optimized code and better hard-
ware support for the massively vectorized operations of the standard dense convolu-
tions implemented in the PyTorch framework, the research oriented proof-of-concept
implementations of sparse convolutions, while also highly optimized in the SCN
framework, cannot compete in real use-cases yet.
Figure 3.2a shows the cumulative runtime of those layers that are not identi-
cal in dense and sparse networks. Convolutions in sparse networks are split into
(non-valid) sparse-convolutions and submanifold-sparse-convolutions. Here the per-
formance difference is most significant, with an increase in runtime of more than
three times. Furthermore, convolutional layers usually constitute the largest part of
CNNs.
Batch norm layers only experience a small loss in performance. I/O-Layers, which
set up additional data structures to be passed through the network with each sam-
ple to enable efficient computation of the convolutional layers, are only needed in
sparse networks and provide an additional notable overhead. As the YOLO v1 model

4 https://docs.python.org/3/library/profile.html.
3 Transferring Dense Object Detection Models To Event-Based Data 43

Fig. 3.2 Cumulative runtime (over 1042 samples) of dissimilar layers of dense, sparse, and asyn-
chronous implementations of the YOLO v1 network during prediction

contains two strided convolutional layers and only one actual input layer, however,
about two third of this overhead can be attributed to our implementation of the
‘same’-padding option as described in Sect. 3.3.1.

3.5.2 Batch Size

Repeating this experiment with a smaller batch size of 1 (instead of 30 in the previous
evaluation) revealed sparse layers don’t suffer as much overhead from smaller batch
sizes as dense layers, or rather, in the reverse direction, don’t gain as much from
predicting more samples at the same time using higher batch sizes.
For convolutions, the gap between dense and sparse layers closes significantly,
while sparse batch norm layers actually overtake their dense counterpart. For sparse
networks, I/O layers show a similar overhead to convolutions (Fig. 3.3).

3.5.3 Synchronous Versus Asynchronous

The results of profiling the asynchronous sparse CNNs implemented in the asynet
framework are far from encouraging. Due to the experimental and little-optimized
implementation, the asynchronous convolution layers show an increase in runtime
of roughly three orders of magnitude, as seen in Fig. 3.2b. While the synchronous
44 V. Mechler and P. Rojtberg

Fig. 3.3 Cumulative runtime (over 1042 samples) over batch size for dense and sparse layers

framework essentially works on event histograms accumulating all events at the


same location, for the asynchronous framework multiple events per active site are
an important feature of the data. An asynchronous network processes a sample in
multiple sequences representing discrete time steps. An active site is only processed
in any distinct sequence if there exists an event at that active site at a time that falls
into the range of that sequence.

3.5.4 Asynchronous Sequence Count

To make use of the asynchronous nature of the network, a sample will usually be
split into a number of event-sequences, which are then processed in series. When
examining the effect of the number of theses sequences, we would expect an increase
in runtime for larger sequence counts due to possibly duplicated active sites, bounded
by the runtime of the 1-sequence-baseline times the number of sequences. For small
numbers of sequences the number of active sites in each sequence will not decrease
notably, as there usually is more than one event at most active sites in a sample,
thus performing close to this upper bound. For large sequence counts, however, we
expect a sub-linear increase in runtime due to a decreasing number of active sites
per sequence.
While profiling the model using two sequences exactly matches our expectations,
as shown in Fig. 3.4, the model exceeded the upper bound for three sequences.
Further analysis showed that the unexpected increase in runtime is caused in
low-level functions like tensor formatting. We believe this to result from overhead
due to extremely high memory utilization. While profiling with three sequences
required 360 GB of RAM, profiling of higher sequence counts exceeded our available
resources and was thus omitted. Therefore, confirmation of our claim of sub-linear
increase in runtime for high sequence counts is left for future work.
3 Transferring Dense Object Detection Models To Event-Based Data 45

Fig. 3.4 Normalized


cumulative runtime (over
148 samples) of layers of the
asynchronous YOLO v1
network at different
sequence counts during
prediction. Convolution
increases super-linearly for
sequence counts above two

3.5.5 Theoretical Runtime

Given the data is sufficiently sparse, the sparse convolution based method should
have a significant advantage.
• The dense convolution convolves the filter with every position of the input tensor,
yielding as many convolution operations as there are unique locations in the tensor.
• The sparse convolution convolves the filter with every active site, with the active
sites being a subset of the unique locations in the tensor.
Therefore, the complexity of the sparse convolution is generally lower than that of
the dense convolution. Additionally, submanifold sparse convolutions further reduce
complexity by only computing those parts of the convolution at each active site where
filter and active sites overlap. The main gain of submanifold sparse convolutions,
however, lies in preventing the increase in the number of active sites, additionally
reducing the complexity for the following layers.
In practice, sparse convolution layers require the construction of a so-called rule-
book for each sample to efficiently compute the necessary convolutions, as detailed
in [6]. While this creates some overhead, it is outweighed by the gains of only pro-
cessing the active sites. Furthermore, blocks of consecutive submanifold convolution
layers pass through the rulebook and ensure it stays valid, so that it only needs to be
computed once, further reducing the complexity.
Asynchronous-sparse-convolutions split the events into multiple sequences, and
within each sequence behave like synchronous sparse convolutions. Each active
site of a synchronous-sparse-convolution is caused by at least one event, but might
accumulate many events that happened at the same spatial location. Therefore, each
active site is processed in at least one sequence, but at worst case in all of them. Such
layers thus have at least as high a complexity as their synchronous counterparts, but
stay within a predictable margin.
However, the sparse nature of the data hinders SIMD processing and the use of
on-chip caches—two techniques that are crucial for reaching high performance on
current hardware.
46 V. Mechler and P. Rojtberg

3.6 Conclusion

In this work we have evaluated the prediction performance and runtime of sparse and
asynchronous-sparse CNNs with respect to classical dense CNNs. Our experiments
have shown that sparse CNNs can match the performance of their dense counterparts
without requiring additional hyperparameter tuning.
The approach works well with synthetically generated events from an existing
dense dataset, which we believe will be beneficial for the adoption of this technol-
ogy. Whereas the production of new high-quality datasets for specialised application
domains can be very expensive, dense datasets are quite abundant in comparison,
even with the constraint of requiring image-sequences to be applicable for conversion
to events.
We think that asynchronous-sparse CNNs are a promising new concept that may
find use in real-time applications due to the extremely low latency. In practice, how-
ever, these concepts are not yet sufficiently optimized.
We have extended the experimental asynet framework for asynchronous-sparse
CNNs and shown that sparse CNNs match classical dense CNNs in prediction per-
formance. On the other hand, we found that the runtime performance of the evalu-
ated frameworks cannot yet match dense networks, and especially the asynchronous
framework can at this point only be seen as a proof-of-concept. The ease of transfer
from dense to sparse networks and the potential gains in runtime will hopefully incite
further research into these promising technologies. We believe the evaluated sparse
CNNs frameworks to be limited by code inefficiencies and lacking hardware support,
but in theory to be a viable optimization of CNNs.

References

1. Cannici, M., Ciccone, M., Romanoni, A., Matteucci, M.: Asynchronous convolutional networks
for object detection in neuromorphic cameras. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition Workshops (2019)
2. Gallego, G., Delbruck, T., Orchard, G.M., Bartolozzi, C., Taba, B., Censi, A., Leutenegger, S.,
Davison, A., Conradt, J., Daniilidis, K., Scaramuzza, D.: Event-based vision: A survey. IEEE
Trans. Patt. Anal. Mach. Intell. (2020)
3. Gehrig, D., Gehrig, M., Hidalgo-Carrió, J., Scaramuzza, D.: Video to events: recycling video
datasets for event cameras. IEEE Conf. Comput. Vis. Patt. Recog. (CVPR) (2020)
4. Geiger, A.: The kitti vision benchmark suite (2017). http://www.cvlibs.net/datasets/kitti/eval_
object.php
5. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision bench-
mark suite. Conf. Comp. Vis. Patt. Recogn. (CVPR) (2012)
6. Graham, B., Engelcke, M., van der Maaten, L.: 3d Semantic segmentation with submanifold
sparse convolutional networks. CVPR (2018)
7. Maass, W.: Networks of spiking neurons: the third generation of neural network models. Neur.
Netw. 10(9), 1659–1671 (1997)
8. Messikommer, N., Gehrig, D., Loquercio, A., Scaramuzza, D.: Event-based asynchronous
sparse convolutional networks (2020). http://rpg.ifi.uzh.ch/docs/ECCV20_Messikommer.pdf
3 Transferring Dense Object Detection Models To Event-Based Data 47

9. Rebecq, H., Ranftl, R., Koltun, V., Scaramuzza, D.: Events-to-video: bringing modern computer
vision to event cameras. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR) (2019)
10. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time
object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 779–788 (2016)
Chapter 4
Diagnosing Parkinson’s Disease Based
on Voice Recordings: Comparative Study
Using Machine Learning Techniques

Sara Khaled Abdelhakeem, Zeeshan Mohammed Mustafa,


and Hasan Kadhem

Abstract Parkinson’s disease is a neurological disorder for which the symptoms


worsen overtime, making its treatment difficult. An early detection of Parkinson’s
disease can help patients get effective treatment before the disease becomes severe.
This paper focuses on applying and evaluating different machine learning techniques
to predict Parkinson’s disease based on patient’s voice data. The various algorithms
in MATLAB were used to train models, and the better performing models among
them were chosen. The chosen algorithms were logistic regression, SVM (linear
and quadratic), and weighted KNN. The quadratic SVM classifier performed best
among other classifiers to predict Parkinson’s disease. The findings of this study
could contribute to the development of better diagnostic tools for early prediction of
Parkinson’s disease.

4.1 Introduction

The development of many technological tools in the past century has helped with the
advancement of societies around the globe for a better and more sustainable standard
of living. One of these advancements was in the field of medicine. Technology
has greatly aided in the discovery of cures for many diseases, yet there are a few
diseases for which no cure has been discovered. One of them being Parkinson’s
disease (PD). More than 10 million people worldwide suffer from PD [1]. PD is the
second most common neurodegenerative disorder, with progressive motor symptoms
worsening over time [2]. It causes patients to experience shakiness and stiffness,

S. K. Abdelhakeem · Z. M. Mustafa (B) · H. Kadhem


American University of Bahrain, Riffa 942, Bahrain
e-mail: [email protected]
S. K. Abdelhakeem
e-mail: [email protected]
H. Kadhem
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 49
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_4
50 S. K. Abdelhakeem et al.

creating difficulty in walking, balancing, and coordination. Parkinson’s disease has


experienced the fastest increase in prevalence and disability among neurological
disorders in recent years, and it has now become one of the leading causes of disability
worldwide [3].
A recurring problem in medicine is the late diagnosis of diseases such as PD.
This often causes problems, making treatments difficult. Although PD has no known
cure yet, early diagnosis of PD can be made by focusing on early motor symptoms to
reduce the severity of the disease that worsens later [4]. One of such early symptoms in
a patient suspected to have PD can be changes observed in their voice. The patient’s
voice can be used to detect symptoms such as loss of intensity, the monotony of
pitch and loudness, reduced stress, inappropriate silences short rushes of speech,
variable rate, imprecise consonant articulation, and dysphonia [5]. To aid in early
PD diagnosis, machine learning techniques can be utilized. Machine learning is the
study of self-improved computer algorithms using experience and analysis of patterns
in data [6], taking on a unique approach of applying algorithms to make predictions
or decisions without being explicitly programmed to do so [7]. Hence, a machine
learning model can be a useful tool that aids in identifying the prevalence of the
above listed symptoms and diagnosing PD.
In this research, an extensive analysis is carried out to determine appropriate
machine learning algorithm(s) to predict PD based on patient’s voice records. The
primary goal is to analyze and observe the performance of various ML algorithms to
predict how often can voice records be reliable for the early detection of Parkinson’s
disease. The experiments were based on the dataset from the IEEE Journal of Biomed-
ical and Health Informatics [6]. Four popular ML algorithms, namely weighted
KNN, quadratic SVM, linear SVM, logistic regression, were used for PD predic-
tion. Weighted KNN achieved the highest testing accuracy of 70.7%. However, after
considering other performance measures such as prediction speed and training time,
quadratic SVM was determined to be a better method with a near accuracy of 70%,
since it had the highest prediction speed and the lowest training time among others.
The rest of the paper is organized as follows. Section 4.2 discusses the related
work relevant to our research. We provide an in-depth overview of the dataset in
Sect. 4.3. Then, Sect. 4.4 illustrates the methodology of our research. Section 4.5
explains the experimental results. Finally, the conclusion and suggested future work
are presented in Sect. 4.6.

4.2 Related Research Work

The application of using machine learning to predict PD has been done by many
authors in their previous research. In this section, a few of the previously exper-
imented algorithms by other researchers used to predict PD are discussed. Each
research paper used many different algorithms, of which the following are the most
common algorithms used by the researchers.
4 Diagnosing Parkinson’s Disease Based on Voice Recordings … 51

4.2.1 Logistic Regression

Logistic regression is an iterative algorithm in which each learning problem is


described in terms of optimizing Bregman distances [8]. In the paper presented by
Challa et al. [9], a boosted logistic regression model was used, and it yielded a stag-
gering accuracy of 97.15% to predict PD. Boosting is the concept of turning a weak
model to a stronger one to fix its weaknesses applying a classification algorithm to
reweighted versions of the training dataset [10]. Furthermore, another logistic regres-
sion model performed just as well in a research experiment done by Wang et al. [11],
with the accuracy of the model for PD diagnosis being 95.73%. It is worthy to note
that the dataset used in both the previous mentioned research papers were obtained
from the same source and had similar features. However, Ramadugu et al. [12] used
a dataset obtained from a different source and reported an accuracy of 84%. In this
paper, a similar dataset to that applied by T. J. Wroge et al. was used. This paper also
explored the application of logistic regression in predicting PD.

4.2.2 Random Forest

Random Forest (RF) is a novel machine learning algorithm and a new algorithm
combination. Random Forest is a classifier that combines a number of tree structure
classifiers [13]. Using similar datasets, both experiments done by Challa et al. [9]
and Wang et al. [11] showed a good performance of their RF models with reported
accuracies of 96.59% and 95.61%, respectively, in predicting PD. Wroge et al. [5]
reported two performance accuracies by using the algorithm to train two models using
different datasets. The highest accuracy they achieved using one of the datasets was
83%. However, the reported accuracy of the model using the dataset which is similar
to the one that was used in this paper was 81%. In addition, a RF model experimented
by Ramadugu et al. [12] yielded an accuracy of 89% for predicting PD.

4.2.3 SVM

The support vector machine (SVM) technique seeks a hyperplane in an N-


dimensional space (N—the number of features) that distinguishes between data
points [14]. The SVM technique was utilized in four of the research publications
and achieved a fairly high accuracy in all of them. Wroge et al. [5] achieved the
lowest accuracy of 80%, which is still rather high, and Wang et al. [11] achieved the
maximum accuracy of 96.12%. Ramadugu et al. [12] and Hazan et al. [15] achieved
close accuracies of 93.8% and 94% accuracy, respectively, although Hazan et al.
[15] dataset was collected from domain in the US and Germany while Ramadugu
et al. [12] dataset was obtained from Kaggle. In their research article, Ramadugu
52 S. K. Abdelhakeem et al.

et al. [12] and Hazan et al. [15] obtained the maximum accuracy from the SVM
algorithm. Wang et al. [11] achieved the second greatest accuracy using the SVM
technique, with a 0.56% difference between SVM and deep learning. This demon-
strates how successful SVM algorithms are for predicting PD when compared to
other algorithms; thus, two types of SVM algorithms are used in this paper: linear
and quadratic SVM.

4.2.4 KNN

The K-nearest neighbors algorithm is a nonparametric classification and regres-


sion approach. The K-nearest neighbor (kNN) technique is frequently used in data
mining and machine learning because it is simple yet effective [16]. In the study
done by Wang et al. [11], KNN achieved an astonishing 94.94% accuracy. Mean-
while, Ramadugu et al. [12] reported an accuracy of 85% for their KNN model.
The weighted KNN algorithm was utilized in this paper and achieved the greatest
testing accuracy of 70.7%. The distinction between KNN and weighted KNN is that
weighted KNN considers the weight of the nearest neighbor as well.

4.2.5 Gradient-boosted Decision Trees

Gradient-boosted decision trees are a machine learning technique which employs


gradient-decent in each boosting step to re-weight the original training sample [17].
Wroge et al. [5] and Ramadugu et al. [12] used the gradient-boosted decision trees
approach; however, their accuracy was only 82% and 75%, which is respectively
low when compared to other algorithms. Although it had a 2% greater accuracy than
SVM in Wroge et al. [5] research, it had a considerably lower accuracy in Ramadugu
et al. [12] research compared to all other algorithms.
A summary of all explored algorithms in the research studies and their accuracies is
presented in Table 4.1. Several other algorithms, such as boosted logistic regressions
and Bayes nets, were used. Wang et al. [11] and Wroge et al. [5] utilized unsupervised
machine learning algorithms such as deep learning and artificial neural networks
(ANN) in their research as well.

4.3 Dataset

The dataset was obtained on November 20, 2021, from the UCI Machine Learning
Repository database [6]. The downloaded file contains two datasets for training
and testing. The collection of data is as follows. The dataset records belong to 20
Parkinson’s disease (PD) patients and 20 healthy subjects. From all subjects, 26
4 Diagnosing Parkinson’s Disease Based on Voice Recordings … 53

Table 4.1 Summary of


Algorithm Study Accuracy (%)
accuracies presented in
related work Logistic regression Challa et al. [9] 97.15
Wang et al. [11] 95.73
Ramadugu et al. [12] 84
Random Forest Challa et al. [9] 96.59
Wang et al. [11] 95.61
Wroge et al. [5] 83
Ramadugu et al. [12] 89
SVM Wroge et al. [5] 80
Hazan et al. [15] 94
Ramadugu et al. [12] 93.8
Wang et al. [11] 96.12
KNN Wang et al. [11] 94.94
Ramadugu et al. [12] 85
Gradient-boosted Wroge et al. [5] 82
decision tree Ramadugu et al. [12] 75

sound recordings is taken resulting in 1,040 instances. Each record has 26 descriptive
features and 1 target feature. The descriptive features are continuous while the target
feature is binary. The dataset has no missing values. The training dataset is balanced
with a 50:50 ratio (the same number of subjects with and without PD). The testing
dataset also has the same number of descriptive features, 1 binary target feature,
and no missing values. However, the testing dataset only has 168 instances with all
having the same outcome for the target feature (all patients diagnosed with PD) for
all instances. To avoid inaccurate results and measures, the training and testing data
are combined into one dataset. The final combined dataset, of 1208 instances, has a
ratio of 58:42 of patients with Parkinson’s disease against healthy patients.
A data quality report was generated for the combined dataset to identify potential
data quality issues. The generated data quality report indicates no missing values
or irregular cardinality. However, further observation of the difference between
maximum and third quartile range, as well as minimum and first quartile range
indicates that the following features may have outliers: Jitter local, Jitter absolute
local, Jitter rap, Jitter ppq5, Jitter ddp, Shimmer apq3, Shimmer apq5, Shimmer dda,
Minimum pitch, No. of pulses, and No. of voice breaks. This was further exam-
ined by observing the data distribution figures for the listed features. To tackle these
outliers, a data handling strategy known as clamp transformation was implemented.
Clamp transformation clamps all values above an upper threshold and below a lower
threshold to the pre-determined upper and lower threshold values. Figure 4.1 shows
the effect of clamp transformation on the data distributions of sample features with
outliers.
54 S. K. Abdelhakeem et al.

Fig. 4.1 Sample of data distribution before and after clamp transformation
4 Diagnosing Parkinson’s Disease Based on Voice Recordings … 55

4.4 Methodology

Figure 4.2 shows the sequence of steps for conducting the experiment. In step 1, the
dataset is imported to MATLAB, a programming and numeric computing platform
used by engineers and scientists to analyze data, develop, and create models [18]. The
combined dataset has numerical values ranging in different intervals for each feature.
Therefore, in step 2, all values for all features are normalized using Range Normal-
ization ranging between 0.0 and 1.0. Prior to inputting the dataset for model training
and testing, data exploration is necessary to check the reliability of the dataset. In
step 3, a data quality report is generated indicating the properties of all the features in
the dataset. Since all the features are of numeric data type, the properties included for
inspection are the count of instances, missing values, cardinality, minimum value,
first quartile, mean, median, third quartile, maximum value, and standard deviation
for each feature set. To further understand the relationship of features data, charts
are generated to see the trend and distribution of data. The data exploration proce-
dure outlined hitherto is key to identifying any potential data quality issues, namely
irregular cardinality, outliers, or missing values. Any significant data quality issues
will be handled using appropriate data handling strategies in step 4.
In step 5, the dataset is split into two sub-datasets with 70% assigned to the training
set and 30% assigned to the testing or holdout set using a non-stratified partitioning
method. Non-stratified partitioning does not consider the relative frequencies of the
levels of a feature in the partitioned dataset [19]. All models are cross validated

Fig. 4.2 Flow chart outlining the main steps of the experimental procedure
56 S. K. Abdelhakeem et al.

using holdout validation in step 6, considering the large size of the training set being
utilized. The models are then trained in step 7 using the training set and tested in
step 8 with the testing set for performance measures such as accuracy, training time,
and prediction speed. This procedure will be repeated for each model, from step 5
to step 8 for 20 iterations.
Model Selection:
As a trial, various classifiers for model selection are done using the Classification
Learner application in MATLAB. Different classifiers have linear and nonlinear
algorithms, and since the relationship between features in the PD dataset is not
known, different classifiers available to train models are tried, and the accuracy is
obtained for each one to compare among them to select the best model to use.
In this experiment, four classifiers, namely logistic regression, linear support
vector machine (SVM), quadratic support vector machine, and weighted K-nearest
neighbor (KNN) is chosen for dataset classification. Logistic regression is a
commonly used supervised classification algorithm when the output of the data in
question has a binary output, relevant to the dataset in this experiment (predicting
whether the subject has PD or not). Logistic regression makes use of a logistic, or a
sigmoid function, to draw an optimal separating hypothesis function to fit the binary
dataset separating the two classes in the hyperplane [19]. The sigmoid function is
represented in Eq. 4.1. The learning rate used by the classifier is set to 0.01.

1
Mw (d) = Logistic(w · d) = (4.1)
1 + e−w·d

In [7], the authors made use of an SVM classifier to train a model, that when tested
yielded an accuracy of 80% for the features similar to the features being used in this
experiment. This accuracy for SVM was better than most other models in their study
for GeMaps features. This became a basis to select this algorithm. SVM is a popular
and powerful classifier that aims to construct an optimal separating hyperplane in
the feature space between the two classes, much like logistic regression. However,
SVM makes use of the kernel trick which helps in accurately performing nonlinear
classification [7]. The kernel trick allows data to be linearly separable by projecting
them into higher dimensions. The use of linear SVM and quadratic SVM classifiers
are made, as both provided high accuracy during classifier selection. Although both
are SVM classifiers, the difference between these classifiers is the shape of the
decision boundary in the feature space. The kernel functions being used in the SVM
classifier are represented in Eqs. 4.2 and 4.3.
Linear kernel where c is an optional constant.

kernel(d, q) = d · q + c (4.2)

Polynomial kernel where p is the degree of a polynomial function.

kernel(d, q) = (d · q + 1) p (4.3)
4 Diagnosing Parkinson’s Disease Based on Voice Recordings … 57

K-nearest neighbor (KNN) is often used as the first choice for classification study
since it is one of the most fundamental and simple classification methods that is used
when there is little or no prior knowledge on how the data is distributed [20], which
is the case with the dataset used in this study. It is a supervised learning algorithm
used in both regression and classification, calculating distances between the test data
and all training points. Then, the trained model predicts the target level with the
majority vote from the set of K-nearest neighbors [18]. There is uncertainty about
the distribution among the features in the dataset being used in this experiment,
hence the use of KNN classifier is made. However, in this experiment, the use of
weighted KNN is made since it provides better accuracy during the model selection
procedure. Unlike the classic KNN classification algorithm, weighted KNN assigns
different weights to the nearest neighbors according to the distance to the unclassified
sample [21]. The weighted KNN is represented in Eq. 4.4. By default, the number
of neighbors (k) is set to 1.

argmax 
k
1
Mk (q) = × δ(ti , l) (4.4)
l ∈ levels(t) dist(q, di )2
i=1

In MATLAB, the steps function, a useful tool that immediately plots the response
of a step input without the need to solve for the time response analytically, can be
used to optimize the models that are obtained after training for better accuracy [22].
Various inputs for the number of steps will be evaluated, considering the number of
features.

4.5 Results

The device used to conduct the experiment was a Windows 10 operating system with
a 16 RAM, 64-bit, Intel® Core (TM) i7-8750H CPU, 2.20 GHz 2.21 GHz.
Table 4.2 provides information about the average performance measures of each
model that was tested using the test dataset. The kernel scale for the SVM models was
automatically set by the Classification Learner app while training. For the Weighted
KNN, the number of neighbors is set to 10, and the distance metric used is Euclidian
Distance. The distance weight is measured using the squared inverse and the data
was standardized.
The main metric of performance measure, known as accuracy, used for model
selection is defined in Eq. 4.5. Accuracy is described as the summation of true
positive and true negative predictions over the summation of all predictions made.

TP + TN
Accuracy = (4.5)
TP + TN + FP + FN
58 S. K. Abdelhakeem et al.

Table 4.2 Average performance measures for various classifiers used in the study
Model type Accuracy Prediction speed training time (sec)
Train (%) Test (%) (obs/sec)

Weighted KNN 70.0 70.7 24,700 5.51079


Quadratic SVM 71.1 70.0 55,435 4.761055
Linear SVM 66.7 65.8 54,750 5.877515
Logistic regression 65.7 65.1 44,700 6.231795

Fig. 4.3 Performance measures results summary for all 20 iterations

The prediction speed is a performance measure that is focused on how fast the
model is making predictions, which can be a basis to compare the speeds of different
models and choose the fastest one which is more effective. The prediction speed is
measured in observations (predictions) per second. It is preferable to train models fast
for efficiency, hence the training per second is also measured. Figures 4.1, 4.2, and
4.3 show the statistical summary of the performance measure for all 20 iterations.
Summary of boxplot includes minimum, first quartile, median, third quartile, and
maximum [23].
Figure 4.3.Accuracy of the testing and training for each model. It is clearly shown
that the first two models fall within the same range with a mean accuracy lower
than the last two models. The last two models, quadratic SVM and weighted KNN,
have a similar mean accuracy. Although the testing accuracy mean and range for
quadratic SVM is larger than weighted KNN. Quadratic SMV is the only model
without any outliers in accuracy. As shown in Fig. 4.3b, both SVM models have higher
prediction speed than the other two models with a mean speed of approximately 6000
observations per second. Weighted KNN had a very distinct low prediction speed. All
models had few outliers in the prediction speed noted. For the training time presented
in Fig. 4.3c, it can be noted that logistic regression, quadratic SVM, and weighted
KNN have relatively same mean training speed 5.4–5.8. In addition, quadratic SVM,
in addition to logistic regression, had no outliers in the measured training times.
The average performance measures that were calculated indicate that almost all
the classifiers performed similarly with the accuracy ranging between 65 and 71%.
The least performing models were logistic regression and linear SVM with only an
4 Diagnosing Parkinson’s Disease Based on Voice Recordings … 59

accuracy of 65.1% and 65.8%, respectively. Furthermore, both these models have an
unfavorable high training time. Despite not having the best accuracy among other
classifiers, the Quadratic SVM proved to be the best performing model with an
accuracy of 70.0%, the highest prediction speed of 55,435 observations per second,
and the lowest training time of approximately 4.8 s. Although the weighted KNN
has the best accuracy, it had a lower prediction speed and training time compared to
quadratic SVM. It can be observed that the two nonlinear models performed better
than the linear models, suggesting that the distribution of data is nonlinear. The
step function command in MATLAB yielded a quadratic equation which aids in
explaining why the quadratic SVM might have performed better than other models.

4.6 Conclusion

A good-performing machine learning model can be trained by using various classi-


fication algorithms that will aid in solving real-life applications. In this experiment,
machine learning techniques proved that it is possible to make diagnostic predic-
tions of PD. According to the analysis of the experiment, the best performing model,
quadratic SVM, was suitable in predicting PD with the obtained dataset which was
nonlinear. These results are promising, as they prove that there are ample types of
classification algorithms that can be used to classify diverse types of data distri-
butions. Although the accuracies obtained in the results were not too high, it still
allowed the comparison of different models to see which performs better with the
dataset. This is a source of encouragment in the future to explore more classification
algorithms to train more models and obtain better accuracy for PD diagnosis. In
addition, more experimental changes such as addition/removal of features, different
cross-validation techniques, and more performance measure can help obtain better
accuracy and select the best model.

References

1. Ball, N., Teo, S., Chandra, Chapman, J.: Parkinson’s Disease and the Environment. In: Frontiers
in Neurology, vol. 10. https://doi.org/10.3389/fneur.2019.00218 (2019). Last accessed 24 Feb
2022
2. Wong, S., Gilmour, H., Ramage-Morin, P.: Parkinson’s Disease: Prevalence, Diagnosis and
Impact, pp. 10–14 (2022)
3. Dauer, W., Przedborski, S.: Parkinson’s Disease. In: Neuron, vol. 39, no. 6, pp. 889–909. https://
doi.org/10.1016/s0896-6273(03)00568-3 (2003). Last accessed 24 Feb 2022
4. Stern, M.: Parkinson’s disease: early diagnosis and management. J. Family Pract.
26(4) (1993). https://link.gale.com/apps/doc/A13781209/AONE?u=anon~1081cb2b&sid=
googleScholar&xid=63168d10. Last accessed 24 Feb 2022
5. Wroge, T.J., Özkanca, Y., Demiroglu, C., Si, D., Atkins, D.C., Ghomi, R.H.: Parkinson’s
disease diagnosis using machine learning and voice. In: IEEE Signal Processing in Medicine
and Biology Symposium (SPMB), pp. 1–7 (2018)
60 S. K. Abdelhakeem et al.

6. Erdogdu Sakar, B., Isenkul, M., Sakar, C.O., Sertbas, A., Gurgen, F., Delil, S., Apaydin, H.,
Kursun, O.: Collection and analysis of a Parkinson speech dataset with multiple types of sound
recordings. IEEE J. Biomed. Health Inform. 17(4), 828–834 (2013)
7. Koza, J.R., Bennett, F.H., Andre, D., Keane, M.A.: Automated design of both the topology
and sizing of analog electrical circuits using genetic programming. In: Artificial Intelligence
in Design 96, pp. 151–170. Springer, Dordrecht (1996)
8. Collins, M., Schapire, R., Singer, Y.: Machine Learning, vol. 48, no. 13, pp. 253–285. https://
doi.org/10.1023/a:1013912006537 (2002). Last Accessed 24 Feb 2022
9. Challa, K.N.R., Pagolu, S., Panda, G., Majhi, B.: An improved approach for prediction of
Parkinson’s disease using machine learning techniques. In: International Conference on Signal
Processing, Communication, Power, and Embedded System (SCOPES), pp. 1446–1451 (2016)
10. Menezes, F., Liska, G., Cirillo, M., Vivanco, M.: Data classification with binary response
through the Boosting algorithm and logistic regression. Expert Syst. Appl. 69, 62–73 (2017).
https://doi.org/10.1016/j.eswa.2016.08.014
11. Wang, W., Lee, J., Harrou, F., Sun, Y.: Early detection of Parkinson’s disease using deep learning
and machine learning. In: IEEE Access, vol. 8, pp. 147635–147646 (2020)
12. Akhil, R., Rayyan Irbaz, M., Aruna, M.: Prediction of Parkinson’s disease using machine
learning. In: Annals of the Romanian Society for Cell Biology, pp. 5360–5367 (2021)
13. Liu, Y., Wang, Y., Zhang, J.: New machine learning algorithm: random forest. In: Infor-
mation Computing and Applications, pp. 246–252. https://doi.org/10.1007/978-3-642-34062-
8_32 (2012). Last Accessed 24 Feb 2022
14. Gandhi, R.: Support vector machine—introduction to machine learning algorithms. In:
Medium. https://towardsdatascience.com/support-vector-machine-introduction-to-machine-
learning-algorithms-934a444fca47?gi=459bd9a91d37 (2018). Last Accessed 16 Feb 2022
15. Hazan, H., Hilu, D., Manevitz, L., Ramig, L., Sapir, S.: Early diagnosis of Parkinson’s disease
via machine learning on speech data. In: IEEE 27th Convention of Electrical and Electronics
Engineers in Israel. (2012).
16. Pandey, J.A.: Comparative analysis of KNN algorithm using various normalization techniques.
Int. J. Comp. Netw. Inform. Secur. 9, 36–42. https://doi.org/10.5815/ijcnis.2017.11.04 (2017).
Last Accessed 24 Feb 2022
17. Keck, T.: FastBDT: a speed-optimized multivariate classification algorithm for the Belle II
experiment. Comput. Softw. Big Sci. 1(1) (2017)
18. Kelleher, MacNamee, B., D’Arcy, A.: Fundamentals of Machine Learning for Predictive Data
Analytics: Algorithms, Worked Examples, and Case Studies. The MIT Press (2015)
19. Kambria, K.: Logistic regression for machine learning and classification. In: Kambria. https://
kambria.io/blog/logistic-regression-for-machine-learning/ (2021). Last Accessed 11 Dec 2021
20. Peterson, L.E.: K-nearest neighbor. Scholarpedia 4(2), 1883 (2009)
21. Zuo, W., Zhang, D., Wang, K.: On kernel difference-weighted k-nearest neighbor classification.
Patt. Anal Appl. 11, 247–257 (2008)
22. Control Tutorials for MATLAB and Simulink: Extras: Generating a Step Response
in MATLAB. https://ctms.engin.umich.edu/CTMS/index.php?aux=Extras_step (2021). Last
accessed 11 Dec 2021
23. Williamson, D.: The box plot: a simple visual method to interpret data. Ann. Inter. Med.
110(11), 916 (1989)
Chapter 5
Elements of Continuous Reassessment
and Uncertainty Self-awareness:
A Narrow Implementation for Face
and Facial Expression Recognition

Stanislav Selitskiy

Abstract Reflection on one’s thought process and making corrections to it if there


exists dissatisfaction in its performance is, perhaps, one of the important traits of intel-
ligence. However, such high-level abstract concepts mandatory for Artificial General
Intelligence can be modelled even at the low level of narrow Machine Learning algo-
rithms. Here, we present the self-awareness mechanism emulation in the form of an
artificial neural network (ANN) observing patterns in activations of another under-
lying ANN in a search for indications of the high uncertainty of the underlying
ANN and, therefore, the untrustworthiness of its predictions. The underlying ANN
is a CNN employed for tasks of face recognition and facial expression. The self-
awareness ANN has a memory region where its past performance information is
stored, and its learnable parameters are adjusted during the training to optimize the
performance. The same memory mechanism is used during the test phase for the
continuous reassessment of the learning parameters after each consecutive test run.

5.1 Introduction

Artificial intelligence (AI) is quite a vague terminology artefact that has been
overused so many times, sometimes even for describing narrow software imple-
mentations of simple mathematical concepts. It is understandable that to separate
high-level AI from the narrow level, such abbreviation as Artificial General Intelli-
gence (AGI) was introduced, and alternatives like “human-level AI” pops up peri-
odically [1]. However, inherent terminological fuzziness will remain if AI/AGI even
be reserved only for complex and sophisticated systems. The very founders of AI
research, such as A. Turing and J. McCarthy, who coined the very term AI, were
sceptical about the worthiness of the attempts to answer what AI is. Instead, they
suggested answering the question of how well AI can emulate human intelligence

S. Selitskiy (B)
School of Computer Science and Technology, University of Bedfordshire, Park Square, Luton 1
3JU, UK
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 61
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_5
62 S. Selitskiy

and finding ways of quantifying the success of that imitation [11, 19]. N. Chomsky,
in numerous lectures and publications (f.e. [6]), even more categorically elaborated
that AI is a human linguistic concept rather than an independent phenomenon.
Suppose we accept discussing AI in the context of human-likeliness. There still
should be room for learning from simple and narrow machine learning (ML) algo-
rithms if they could be used as “building blocks” and working approximations of
human-like intelligence. In this work, we want to concentrate on two aspects of
human-likeliness intelligence functionality: continuous lifelong learning and reflec-
tion on or awareness of the learning and its imperfection and uncertainty. In the narrow
ML domain, parallels to such a process may be found in concepts (obviously) Life-
time Learning (LTL). Sometimes, even in narrower Continuous or Online learning
concepts. The second concept can be found in the meta-learning area of research.
The LTL concept was introduced in the mid-1995 in the context of the robot
learning process [17]. Instead of the standard approach of teaching a robot a particular
task in isolation from another, it was proposed to use invariants of learning one task
to help it learn another task. Unlike in many idealized lab-level ML applications,
in case of the robotics, expectations of the perfect knowledge of the ever-changing
reality or precise modelling of the robot itself are unrealistic, and constant learning
“on the go” is essential.
The knowledge base in LTL was suggested to be the structured data and models
trained on the data. Variety of tasks the LTL learner could face during its lifetime, and
each new learning task may benefit from the saved successful models and examples of
data they were trained and applied to [16]. Models saved in the knowledge base may
also be accompanied by meta-data or “hints” about approximation transformations
they represent [2] that can be factored in a decision to include the previous model
into the solution for the new task. Another aspect of the LTL extends “hints” about
models to “hints” about processes they explain.
The idea of learning the ML processes was also introduced in the 90s by the same
author [18]. There exist multiple flavours of “learning to learn” or meta-learning
targeting narrower and specific tasks such as either as an extension of the transfer
learning [3, 8], or model hyper-parameter optimization [4, 12], or a wider horizon
“learning about learning” approach conjoint to the explainable learning [9, 10], or
augmenting artificial neural network (ANN) models with external resources, such as
memory, knowledge bases, or other ANN models [13].
To bring general considerations into a practical, although narrow perspective,
we concentrate on making the meta-learning supervisor ANN model, which learns
patterns of the functionality of the underlying CNN models that are associated with
the failed predictions for face recognition (FR) [15] and facial expression recognition
(FER) tasks [14], self-adjusting on the previous experience during training, as well
as, test times.
The reason to use FR and FER tasks is based not only on the fact that these are
quite human-centric ones but also, although State-of-the-art (SOTA) CNN models
had already passed the milestone of the human-level accuracy of face recognition, a
number of years ago in the ideal laboratory condition, in case of the Out of (training)
5 Elements of Continuous Reassessment and Uncertainty Self-awareness … 63

Data Distribution (ODD), for example, makeup and occlusions, accuracy signifi-
cantly drops. Even worse for FER algorithms and modes, which perform far worse
than FR. The reason may be that the idea that the whole spectre of emotion expres-
sions can be reduced to six basic facial feature complexes [7] and was challenged in
the sense that human emotion recognition is context-based. The same facial feature
complexes may be interpreted differently depending on the situational context [5].
Applying the continuous uncertainty and trustworthiness self-awareness algo-
rithms to FR and FER models and data sets built and partitioned to exaggerate and
aggravate ODD conditions are a reasonable area for the algorithms’ evaluation.
The paper is organized as follows. Section 5.2 proposes a solution for dynamically
adjusting the meta-learning trustworthiness estimating algorithm for predictions done
for the FR and FER tasks. Section 5.3 describes the data set used for experiments;
Sect. 5.4 outlines experimental algorithms in detail; Sect. 5.5 presents the obtained
results, and Sect. 5.6 discusses the results, draws practical conclusions, and states
directions of the research of not yet answered questions.

5.2 Proposed Solution

In [15], two approaches to assigning a trustworthiness flag to the FR prediction were


proposed: statistical analysis of the distributions of the maximal softmax activation
value for correct and wrong verdicts and use of the meta-learning supervisor ANN
that uses the whole set of softmax activations for all FR classes (sorted into the
“uncertainty shape descriptor” (USD) to provide class-invariant generalization) as
an input and generates trusted or not-trusted flag.
This contribution “marries” these two approaches by collecting statistical informa-
tion about training results in the “loss layer memory” of the meta-learning supervisor
ANN. The information in the LL’s memory holds three parameters for each obser-
vation: prediction result yt , training label result lt , and trustworthiness threshold TT.
The latter parameter is the learnable one, and the derivative of the loss error, calcu-
lated from these statistical data, is used to auto-configure the TT to optimize the sum
of square errors loss: SSETT = t=1,K SETTt , where K is a number of entries in the
memory table:
 
(yt − TT)2 , (lt TT ∧ yt TT) ∨ (lt > TT ∧ yt < TT)
SETTt = (5.1)
0, (lt > TT ∧ yt > TT) ∨ (lt < TT ∧ yt < TT)

The input of the meta-learning supervisor ANN was built from the softmax activa-
tions of the ensemble of the underlying CNN models. The algorithm of building USD
can be described in a few words as follows: build the “uncertainty shape descriptor”
by sorting softmax activations inside each model vector, order model vectors by the
64 S. Selitskiy

highest softmax activation, flatten the list of vectors, rearrange the order of activa-
tions in each vector to the order of activations in the vector with the highest softmax
activation.
Examples of the descriptor for the M = 7 CNN models in the underlying FR or
FER ensemble (M is a number of models in the ensemble), for the cases when none
of the models detected the face correctly, 4 models and 6 models detected the face
correctly, are presented in Fig. 5.2. It could be seen that shapes of the distribution of
the softmax activations are quite distinct and, therefore, can be subject to the pattern
recognition task which is performed by the meta-learning supervisor ANN.
However, unlike in the mentioned above publication, for simplification reasons,
supervisor ANN was not categorizing the predicted number of the correct members
of the underlying ensemble but instead is performing the regression task of the
transformation. On the high level (ANN layer details are given in Sect. 5.4), the
transformation can be seen as Eq. 5.2, where n = |C|∗M is the dimensionality of
the ∀ USD ∈ X , |C|—cardinality of the set of FR or FER categories (subjects or
emotions) and M—the size of the CNN ensemble, Fig. 5.1.

reg : X ⊂ Rn
→ Y ⊂ R (5.2)

x ∈ X , x ∈ (0 . . . 1)n , ∀y ∈ Y, E(y) ∈ [0 . . . M].


where ∀

Fig. 5.1 Meta-learning supervisor ANN over underlying CNN ensemble


5 Elements of Continuous Reassessment and Uncertainty Self-awareness … 65

Fig. 5.2 Examples of the uncertainty shape descriptors (from left to right) for 0, 4, and 6 correct
FER predictions by the 7-model CNN ensemble

The loss
 function used for y is the usual for regression tasks, sum of squared error:
SSE y = t=1,Nmb (yt − et )2 , where e is the label (actual number of the members of
CNN ensemble with correctly prediction), and Nmb —minbatch size.
From the trustworthiness categorization and ensemble vote point of view, the high-
level transformation of the combined CNN ensemble together with the meta-learning
supervisor ANN can be represented as Eq. 5.3:

cat : I ⊂ Il
→ C × B ⊂ C × B (5.3)

where i are images, l—mage size, c—classifications,and b—binary  trustworthy


flags, such as ∀ i ∈ I, i ∈ (0 . . . 255)l , ∀c ∈ C, c ∈ c1 , . . . , c|C| , ∀b ∈ B, b ∈
{1, 0}.
 
1, (yi > TTt )
bi = (5.4)
0, (yi < TTt )

where i is an index of the image at the moment t of the state of the loss function
memory.

ci = argmin(|yi − ei (ci )|) (5.5)

Equations above describe the ensemble vote that chooses category ci , which
received the closest number of votes ei to the predicted regression number yi .

5.3 Data Set

The BookClub artistic makeup data set contains images of E = |C| = 21 subjects.
Each subject’s data may contain a photo-session series of photos with no makeup,
various makeup, and images with other obstacles for facial recognition, such as wigs,
glasses, jewellery, face masks, or various headdresses. The data set features 37 photo
sessions without makeup or occlusions, 40 makeup sessions, and 17 sessions with
occlusions. Each photo session contains circa 168 JPEG images of the 1072 × 712
resolution of six basic emotional expressions (sadness, happiness, surprise, fear,
anger, and disgust), a neutral expression, and the closed eyes photoshoots taken
66 S. Selitskiy

with seven head rotations at three exposure times on the off-white background.
The subjects’ age varies from their twenties to sixties. The race of the subjects
is predominately Caucasian and some Asian. Gender is approximately evenly split
between sessions.
The photos were taken over two months, and several subjects were posed at
multiple sessions over several weeks in various clothing with changed hairstyles,
downloadable from https://data.mendeley.com/datasets/yfx9h649wz/3. All subjects
gave written consent to use their anonymous images in public scientific research.

5.4 Experiments

The experiments were run on the Linux (Ubuntu 20.04.3 LTS) operating system with
two dual Tesla K80 GPUs (with 2×12 GB GDDR5 memory each) and one QuadroPro
K6000 (with $12$GB GDDR5 memory, as well), X299 chipset motherboard, 256 GB
DDR4 RAM, and i9-10900X CPU. Experiments were run using MATLAB 2022a.
The experiments were done using MATLAB with Deep Learning Toolbox. For FR
and FER experiments, the Inception v.3 CNN model was used. Out of the other SOTA
models applied to FR and FER tasks on the BookClub data set (AlexNet, GoogLeNet,
ResNet50, and Inception-ResNet v.2), Inception v.3 demonstrated overall the best
result over such accuracy metrics as trusted accuracy, precision, and recall [14, 15].
Therefore, the Inception v.3 model, which contains 315 elementary layers, was used
as an underlying CNN. Its last two layers were resized to match the number of classes
in the BookClub data set (21) and re-trained using the “Adam” learning algorithm
with 0.001 initial learning coefficient, “piecewise” learning rate drop schedule with 5
iterations drop interval, and 0.9 drop coefficient, mini-batch size 128, and 10 epochs
parameters to ensure at least 95% learning accuracy. The Inception v.3 CNN models
were used as part of the ensemble with a number of models M = 7 trained in parallel.
Meta-learning supervisor ANN models were trained using the “Adam” learning
algorithm with 0.01 initial learning coefficient, mini-batch size 64, and 200 epochs.
For online learning experiments, naturally, batch size was set to 1, as each consecutive
prediction was used to update meta-learning model parameters. The memory buffer
length, which collects statistics about previous training iterations, was set to K =
8192.
The r eg meta-learning supervisor ANN transformation represented in Eq. 5.2 is
implemented with two hidden layers with n + 1 and 2n + 1 neurons in the first and
second hidden layer, and the ReLU activation function. All source code and detailed
results are publicly available on GitHub (https://github.com/Selitskiy/StatLoss).
5 Elements of Continuous Reassessment and Uncertainty Self-awareness … 67

5.4.1 Trusted Accuracy Metrics

Suppose only the classification verdict is used as a final result of the ANN model. In
that case, the accuracy of the target CNN model can be calculated only as the ratio
of the number of correctly identified test images by the CNN model to the number
of all test images:

Ncorrect
Accuracy = (5.6)
Nall

When additional dimension in classification is used, for example amending verdict


of the meta-learning supervisor
 ANN, (see Formula 5.3), and cat(i) : c × b, where
∀ i ∈ I, ∀c × b ∈ C × B = (c1 , b1 ), . . . , (c|C| , b|C| ) , ∀b ∈ B, = {True, False}, then
the trusted accuracy and other trusted quality metrics can be calculated as:

Ncorrect: f =T + Ncorrect: f =T
Accuracyt = (5.7)
Nall

As a mapping to a more usual notations, Ncorrect: f =T can be as the True Positive


(TP) number,Nwrong: f =T —True Negative (TN), Nwrong: f =T —False Positive (FP), and
Ncorrect: f =T —False Negative (FN). Analogously to the trusted accuracy, such metrics
as precision, recall, specificity, and F1 score, we used for the models’ evaluation.

5.5 Results

Results of the FER experiments are presented in Table 5.1 (FR results are similar but
with less un-trusted and trusted metrics difference). The first column holds accuracy
metrics using the ensemble’s maximum vote. The second column using the ensemble
vote closest to the meta-learning supervisor ANN prediction and trustworthiness
threshold learned only on the test set, see Formulae 4, 5. The next two columns
contain the results of the online learning experiments. The first of these columns
has data of the online learning on the randomized test data, and the last column
online learning on the images grouped by the photo session, i.e. groups of the same
person and same makeup or occlusion, but with different lighting, head position, and
emotion expression (also see Fig. 5.3). Figure 5.4 shows the relationship between the
average session trusted threshold and session-specific trusted recognition accuracy
for FR and FER cases of the grouped test sessions.
68 S. Selitskiy

Table 5.1 Accuracy metrics for FER task


Metric Maximal Predicted Pred. online Pred. grouped
Untrusted accuracy 0.39425 0.35380 0.35380 0.35380
Trusted accuracy 0.68827 0.73339 0.64791 0.73303
Trusted precision 0.62136 0.63510 0.35043 0.66043
Trusted recall 0.53580 0.57927 0.64294 0.75818
Trusted F1 score 0.57542 0.60590 0.45362 0.70594
Trusted specificity 0.78751 0.81778 0.64937 0.71462
Maximal ensemble vote, meta-learning predicted vote, meta-learning with random online re-training
vote, and meta-learning with session-grouped online re-training vote

Fig. 5.3 Trusted threshold learned during the training phase (blue, dashed line), online learning
changes for grouped test images (green), and shuffled test images (red). FR—left and FER—right

5.6 Discussion, Conclusions, and Future Work

Computational experiments with CNN ensemble based on Inception v.3 architecture


and data set with significant out-of-training data distribution in the form of makeup
and occlusions were performed. A meta-learning supervisor ANN was used as an
instrument of self-awareness of the model about the uncertainty and trustworthiness
of its predictions. Results demonstrate a noticeable increase of the accuracy metrics
for the FR task (by tens of per cent) and significantly (doubles)—for the FER task.
The proposed novel “loss layer with memory” architecture without online re-training
increases key accuracy metrics by an additional (up to 5) percentage. The trustworthi-
ness threshold learned using the “loss layer with memory” explains why prediction
for a given image was categorized as trusted or non-trusted.
However, prima facie online re-training meta-learning supervisor ANN (while
underlying CNN stayed unchanged) after each tested image demonstrates poorer
5 Elements of Continuous Reassessment and Uncertainty Self-awareness … 69

Fig. 5.4 Trusted accuracy against trusted threshold for grouped test images. FR—left and FER—
right

Fig. 5.5 Examples of images for FER (anger expression) with low trusted threshold (bad acting)—
left and high trusted threshold (better acting)—right

performance on most accuracy metrics except recall. Obviously, improving the online
learning algorithms would be a part of future work. Still, what is fascinating, is that
the dynamically adjusted trustworthiness threshold informs the model not only about
its uncertainty but also about the quality of the test session—for example, in Fig. 5.5,
it could be seen that a low-threshold session has a poorly performing subject who
struggles to play the anger emotion expression. In contrast, in the high-threshold
session, the facial expression is much more apparent.
70 S. Selitskiy

References

1. Post|LinkedIn: https://www.linkedin.com/posts/yann-lecun_i-think-the-phrase-agi-should-be-
retired-activity-6889610518529613824-gl2F/?utm_source=linkedin_share&utm_medium=
member_desktop_web, (Online Accessed 11 Apr 2022)
2. Abu-Mostafa, Y.S.: Learning from hints in neural networks. J. Complex. 6(2), 192–198 (1990)
3. Andrychowicz, M., Denil, M., Colmenarejo, S.G., Hoffman, M.W., Pfau, D., Schaul, T.,
Shillingford, B., de Freitas, N.: Learning to learn by gradient descent by gradient descent. In:
Proceedings of the 30th International Conference on Neural Information Processing Systems,
pp. 3988–3996. NIPS’16, Curran Associates Inc., Red Hook, NY, USA (2016)
4. Bergstra, J., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization.
In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q. (eds.) Advances in
Neural Information Processing Systems, vol. 24. Curran Associates, Inc. (2011), https://procee
dings.neurips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf
5. F. Author et al. Cacioppo, J.T., Berntson, G.G., Larsen, J.T., Poehlmann, K.M., Ito, T.A., et al.:
The psychophysiology of emotion. Handbook Emot. 2(01), 2000 (2000)
6. Chomsky, N.: Powers and Prospects: Reflections on Human Nature and the Social Order. South
End Press (1996)
7. Ekman, P., Friesen, W.V.: Constants across cultures in the face and emotion. J. Pers. Soc.
Psychol. 17(2), 124 (1971)
8. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep
networks. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference
on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 1126–1135.
PMLR (06–11 Aug 2017), http://proceedings.mlr.press/v70/finn17a.html
9. Lake, B.M., Ullman, T.D., Tenenbaum, J.B., Gershman, S.J.: Building machines that learn and
think like people. Behav. Brain Sci. 40, e253 (2017).https://doi.org/10.1017/S0140525X160
01837
10. Liu, X., Wang, X., Matwin, S.: Interpretable deep convolutional neural networks via meta-
learning. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–9 (2018).
https://doi.org/10.1109/IJCNN.2018.8489172
11. McCarthy, J., Minsky, M.L., Rochester, N., Shannon, C.E.: A proposal for the dartmouth
summer research project on artificial intelligence, August 31, 1955. AI Mag. 27(4), 12–12
(2006)
12. Ram, R., Müller, S., Pfreundt, F., Gauger, N., Keuper, J.: Scalable hyperparameter optimization
with lazy Gaussian processes. In: 2019 IEEE/ACM Workshop on Machine Learning in High
Performance Computing Environments (MLHPC), pp. 56–65 (2019)
13. Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., Lillicrap, T.: Meta-learning with
memory-augmented neural networks. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings
of The 33rd International Conference on Machine Learning. Proceedings of Machine Learning
Research, vol. 48, pp. 1842–1850. PMLR, New York, New York, USA (20–22 Jun 2016)
14. Selitskiy, S., Christou, N., Selitskaya, N.: Isolating Uncertainty of the Face Expression Recog-
nition with the Meta-Learning Supervisor Neural Network, pp. 104–112. Association for
Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3480433.3480447
15. Selitskiy, S., Christou, N., Selitskaya, N.: Using statistical and artificial neural networks meta-
learning approaches for uncertainty isolation in face recognition by the established convolu-
tional models. In: Nicosia, G., Ojha, V., La Malfa, E., La Malfa, G., Jansen, G., Pardalos,
P.M., Giuffrida, G., Umeton, R. (eds.) Machine Learning, Optimization, and Data Science,
pp. 338–352. Springer International Publishing, Cham (2022)
5 Elements of Continuous Reassessment and Uncertainty Self-awareness … 71

16. Thrun, S.: Is learning the n-th thing any easier than learning the first? Adv. Neural Inf. Process.
Syst. 8 (1995)
17. Thrun, S., Mitchell, T.M.: Lifelong robot learning. Robot. Auton. Syst. 15(1–2), 25–46 (1995)
18. Thrun, S.P.L.: Learning To Learn. Springer, Boston, MA (1998). https://doi.org/10.1007/978-
1-4615-5529-2
19. Turing, A.M.: I.—Computing machinery and intelligence. MindLIX(236), 433–460 (1950).
https://doi.org/10.1093/mind/LIX.236.433
Chapter 6
Topic-Aware Networks for Answer
Selection

Jiaheng Zhang and Kezhi Mao

Abstract Answer selection is an essential task in the study of natural language


processing, which is involved in many applications such as a dialog system, reading
comprehension, and so on. It is a task of selecting the correct answer from a set of
given candidates for certain questions. One of the challenging problems for this task is
that traditional deep learning model for answer selection lacks real-world background
knowledge, which is crucial for answering questions in real-world applications. In
this paper, we propose a set of deep learning networks to enhance the traditional
answer selection models with topic modeling, so that we could use topic models
as external knowledge for the baseline models and improve the performance of the
model. Our topic-aware networks (TANs) are specially designed for answer selection
task. We proposed a novel method to generate topic embedding for both questions and
answers separately. We designed two kinds of TAN models and evaluate our models
in two commonly used answer selection datasets. The results verify the advantages
of TAN in improving the performance of traditional answer selection deep learning
models.

6.1 Introduction

Answer selection is an essential part of modern dialogue systems. Dialogue systems


also known as chat systems or chatbot and are widely used in our daily life. Some
famous mobile applications such as Apple’s Siri or Google Assistant are dialog
systems. People can receive answers or responses from dialogue systems when they
ask a question. To correctly answer people’s question, especially for those with
relatively standard answers, some of the dialogue systems use algorithms to first

J. Zhang · K. Mao (B)


School of Electrical and Electronic, Nanyang Technological University, Singapore 639798,
Singapore
e-mail: [email protected]
J. Zhang
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 73
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_6
74 J. Zhang and K. Mao

Table 6.1 Sample of answer


Answer Selection Sample Lable
selection task
How are glacier caves formed? –
A partly submerged glacier cave on Perito Moreno 0
glacier
The ice facade is approximately 60 m high 0
Ice formations in the Titlis glacier cave 0
A glacier cave is a cave formed within the ice of a 1
glacier

select a set of possible and relevant answers from various and numerous resources,
then use answer selection algorithms to sort the candidate answers, and finally send
the most likely answer to the user. Thus, improving the performance of answer
selection is crucial for dialogue systems. As shown in Table 6.1, given a question
and a set of candidate answers, the goal of answer selection is to find the correct and
the best answer to that question among those candidates.
Various frameworks or methods have been proposed for answer selection. For
example, rule-based systems which focus on feature engineering with human summa-
rized rules, and deep learning models, more popularly adopted methods recently,
which use deep learning networks such as convolutional neural networks, recurrent
neural networks, or attentional networks to extract matching features automatically.
However, traditional deep learning models are purely data driven and feature driven,
which not only face overfitting problems and also lack real-world background infor-
mation and those information beyond the features in the local contexts. To solve the
aforementioned issues, some knowledge-based methods were proposed, which use
external knowledge as a compensation for traditional deep learning models. In this
chapter, we propose to use specially designed topic-aware networks to enhance the
performance of traditional deep learning models with topic embeddings as external
knowledge references.
Topic modeling is a traditional machine learning technique that is used to model
the generation process of a set of documents. Each word in a document can be
assigned with a latent topic using topic modeling. Topic modeling is a good tool for
us to understand the nature of the documents. For each text in the document, the latent
topic tags for this text are a kind of external knowledge from a document level point
of view. Topic embeddings which is a numerical representation of latent topic tags
are proposed to make topic modeling convenient in helping deep learning models.
As shown in Fig. 6.1, we use Skip-gram techniques to generate topic embeddings.
Each word in a text is assigned with two tokens: the word token wi and the topic tag
token zi. These tokens are used as basic inputs in our proposed frameworks.
Our work is inspired by following considerations. Firstly, we think intuitively
that the correct answer to a question normally should under the same topic. For
example, if the question is asking about time-related information, the beginning of
the question may be “When…” or “What time…,” and the answer may contain texts
relating to topics about time such as “in the morning” or “at 8 am.” The questions
6 Topic-Aware Networks for Answer Selection 75

Fig. 6.1 Topic embedding generation processing

and the correct answers normally contain related words. By adding the latent topics
of the answers and questions into consideration, we could restrict the selection of
the answers so as to further improve the generalization of the model. Secondly, topic
models are based on document-level information which reveals latent topics under
targeting documents. However, traditional deep learning models normally focus on
discovering local features that can be used to classify texts. Topic models can help
us understand how the texts are generated. The output of the topic models, which
is a set of topics that form the document and lists of words that describe the same
topic, is somehow like a knowledge base of certain datasets.
Motivated by above considerations, we proposed topic-aware networks for answer
selection (TNAS) that integrates topic models into answer selection architectures by
using topic embeddings as external knowledge for baseline deep learning models.
As shown in Fig. 6.2, compared with traditional deep learning models for answer
selection, TNAS has one more topic embeddings module during the training stages.
The topic-aware module generates topic embeddings for both questions and answers.
This topic embeddings layer can help us determine the similarity about the question
and the answer from topic point of view. Eventually, we generate topic-aware vector
representations and concatenate them with baseline deep learning texts representa-
tions for both questions and answers and get their cosine distances as scoring function
for calculating the probability that the answer is a correct candidate.
To evaluate our model, we conduct experiments in 3 popular answer selection
datasets in natural language processing. The results of our experiments show that
our model improved the performance of baseline deep learning models. The main
contributions of our work are summarized into four parts as follows:
• We propose an efficient way to generate topic embeddings for baseline deep
learning models that can be used easily integrated in their architectures.
• We propose to incorporate topic embeddings as external knowledge into baseline
deep learning models for answer selection tasks by applying LDA algorithm for
both questions and answers.
• We propose two networks specially design for answer selection tasks that incor-
porate topic information into baseline deep learning models to automatically
matching topics of both questions and answers.
76 J. Zhang and K. Mao

Fig. 6.2 a Traditional deep learning framework for answer selections; b Our model

• We propose to use external databases with similar contexts in training topic embed-
dings for our topic-aware networks to further improve the performance of our
network.

6.2 Related Works

6.2.1 Topic Embedding Generation

To better extract semantic knowledge in texts for downstream NLP tasks, various
topic models have been introduced for generating topic embeddings. One influential
and classic research is the latent semantic indexing (LSI) [1]. LSI utilizes linear
algebra methods for mapping latent topics with singular value decomposition (SVD).
Subsequently, various methods for generating topic embedding have been proposed
on top of LSI. Among them include the latent Dirichlet allocation (LDA) [2], which
is introduced as a Bayesian probability model that generates document-topic and
word-topic distribution utilizing Dirichlet priors [3]. In comparison with prior topic
embeddings generation approaches such as LSI, LDA is more effective thanks to its
ability to capture hidden semantic structure within a given text through the correlated
words [4]. Dirichlet priors are leveraged to estimate document-topics density and
topic-word density in LDA, improving its efficacy in topic embedding generation.
Thanks to it superior performance, LDA has become one of the most commonly used
approach for topic embedding generation. In this work, we adopt LDA as the topic
embedding generation method to generate topic embeddings as external knowledge
base, bringing significant improvement to the result of answer selection.
6 Topic-Aware Networks for Answer Selection 77

6.2.2 Answer Selection

Answer selection has received increasing research attention thanks to its applica-
tions in areas such as dialog systems. A typical question selection model requires
the understanding of both the question as well as the candidate answer texts [5].
Previously, answer selection models typically rely on human summarized rules with
linguistic tools, feature engineering, and external resources. Specifically, Wang and
Manning [6] utilize tree-edit operation mechanism on the dependency parse trees;
Severyn and Moschitti [7] employ an SVM [8] with tree kernels for fusing feature
engineering over parsing trees for feature extraction, while lexical semantic features
obtain from WordNet [9] have been used by Yih et al. [10] to further improve on
answer selection.
More recently, deep networks such as CNN [11, 12] and RNN [11, 13, 14] have
brought significant performance boost in various NLP tasks [15]. Deep learning-
based approach has also been predominant in the task of answer selection thanks
to their better performance. Among them, Yu et al. [16] transformed the answer
selection task into a binary classification problem [17] such that candidate sentences
are ranked based on the cross-entropy loss of the question-candidate pairs, while
constructing a Siamese-structured bag-of-words model. Subsequently, QALSTM
[18] was proposed which employs a bidirectional LSTM [19, 20] network to construct
sentence representations of questions and candidate answers independently, while
CNN is utilized in [21] as the backbone structure to generate sentence representation.
Further, HyperQA [22] is proposed where the relationship between the question and
candidate answers is modeled in the hyperbolic space [23] instead of the Euclidean
space. More recently, with the success of transformer [24] in a variety of NLP tasks
[25, 26], it has also been introduced to the answer selection task. More specifically,
TANDA [27] is proposed by transferring a pre-trained model into a model specialized
for answer selection through fine-tuning on large and high-quality dataset, improving
the stability of transformer for answer selection, while Matsubara et al. [28] improve
the efficiency of transformers by reducing the amount of sentence candidates through
neural re-rankers.
Despite the impressive progress made in deep learning-based approaches for
answer selection, these methods neglect the importance of topics in answer selection.
In this work, we propose to incorporate topic embeddings as external knowledge into
baseline deep learning models for answer selection and demonstrate its effectiveness.

6.3 Methodology

We present the detailed implementation of our model in this section. The overall
architecture of our proposed model is shown in Fig. 6.2. Our model is a multi-
channel deep learning model with two stages in training. Firstly, we use techniques
in word embedding generation to help generate topic embedding as our external
78 J. Zhang and K. Mao

knowledge base for the next stage. Secondly, we set up our topic-aware network
for answer selections. We proposed two main topic-aware network architectures
based on traditional answer selection architectures. Lastly, we use triplet loss as our
objective function in our final training stage for our model.

6.3.1 Topic Embedding

To generate topic embeddings as external knowledge, we need to train a topic model


for the targeting documents first. Then, we use the topic model to label the input
texts to get topic token sequences. As shown in Fig. 6.1, we are using Skip-gram
algorithm to train our topic embeddings. To train word embeddings, we need to get
word token sequences. This is the same for training topic embeddings. To get topic
tokens, we firstly apply latent Dirichlet allocation (LDA) algorithm to get a topic
model using Gensim training tools, then use the results of LDA which a set of topics
and a set of words under each topic to assign each word wi with a latent topic zi. The
objective function of the training process for these topic tokens is shown below.

1  
M
L(D) = log Pr(wi+c , z i+c |z i ) (6.1)
M i=1 −k≤c≤k,c=0

where Pr() is the probability using softmax function. The nature under above function
is to use each topic token as a pseudo-word token to predict words and topics around
it. We aim to not only encode the word information but also the topic information
into the topic embeddings.

6.3.2 Topic-Aware Networks

After we generate topic embeddings results, we can use these results as external
knowledge for our deep learning architectures. We propose two main kinds of archi-
tecture with four kinds of network designs for topic-aware networks for answer
selection. The first is a network with shared encoder weights as shown in Figs. 6.3
and 6.4. The encoders for both questions and answers are trained together, and the
weights are shared. The second is a network with none-shared encoder weights, as
shown in Figs. 6.5 and 6.6. The encoders are trained separately for questions and
answers. The input text sequences are firstly separated into sequences one for original
texts and the other for topic tokens.
TAN1: None-shared encoders for both text and topic tokens. As shown in
Fig. 6.3, question texts and answer texts, which are transformed into text tokens
and topic tokens, are used as the inputs for both word embedding layers and topic
embedding layers. After getting the numerical representations for the input tokens,
6 Topic-Aware Networks for Answer Selection 79

Fig. 6.3 TAN3: None-shared encoders for text and shared encoders for topic

Fig. 6.4 TAN4: Shared encoders for text and none-shared encoders for topic

the outputs of each embedding layers are then processed with none-shared encoders
so that each encoder is trained separately with totally different weights inside.
TAN2: Shared encoders for both text and topic tokens. As shown in Fig. 6.4,
different from TAN1, both encoders for text channel and topic channel are shared
for TAN2.
TAN3: None-shared encoders for text and shared encoders for topic. As shown
in Fig. 6.3, different from TAN1 and TAN2, there is another architecture which use
none-shared encoders for text token embeddings and shared encoders for topic token
embeddings.
80 J. Zhang and K. Mao

Fig. 6.5 TAN1: None-shared encoders for both text and topic tokens

Fig. 6.6 TAN2: Shared encoders for both text and topic tokens

TAN4: Shared encoders for text and none-shared encoders for topic. As shown
in Fig. 6.4, similar to TAN3, it is a mixed architecture which use shared encoders for
text token embeddings and none-shared encoders for topic token embeddings.

6.3.3 Triplet Loss

For all the networks we proposed, we adopt the same training and testing mechanism.
We use triplet loss in our model. During the training stage, for each question texts
6 Topic-Aware Networks for Answer Selection 81

Table 6.2 Statistics of the questions


Dataset Train Dev Test Total
WikiQA 2118 296 633 3047
TrecQA 1229 82 100 1411

Q, besides its ground truth answer A+, we randomly pair a negative answer A− for
it. Therefore, the input data are actually a triplet set (Q, A+, A−). Our goal is to
minimize this triplet loss for the answer selection task:
      
L Q, A+ , A− = max 0, m + d Q, A− − d Q, A+ , (6.2)

where d(Q, A−) and d(Q, A+) is the Euclidean distance between the vector
representation of the question texts and the answer texts.

6.4 Experiments

In this section, we present the experiment and the result of our proposed model. All
the network architectures are achieved using Keras in this paper. We evaluate our
model using two widely used answer selection dataset.

6.4.1 Dataset

The statistics of the datasets used in this paper is shown in Table 6.2. The tasks for
both datasets are to rank the candidate answers based on their relatedness to the
question. Brief descriptions of the datasets are as follows:
1. WikiQA: This is a benchmark for open-domain answer selection that was created
from actual Bing and Wikipedia searches. We only use questions with at least
one accurate response.
2. TrecQA: This is another answer selection dataset that comes from Text REtrieval
Conference (TREC) QA track data.

6.4.2 Experiment Settings

To evaluate the model, we implement a baseline system for comparison. The baseline
model adopt CNN as the encoders, and the architecture of the baseline model is the
same as TAN1 and TAN4 but without the topic-aware module. The CNN used in
TAN1, 2, 3, 4 is the same as the baseline model. The other key settings of our models
are as follows:
82 J. Zhang and K. Mao

1. Embeddings: We use GloVe with 300 dimensions to initialize our word embed-
ding layer. For the topic embeddings, we use Gensim package to generate LDA
model and use its Word2Vec function to help generate topic embeddings. We
generate topic embeddings for both questions and answers separately.
2. CNN as encoders: We set CNN filter to 1200 filters and all the inner dense
layers to be 300 dimensions. We use keras to help us set up the training and
testing process. The optimizer we choose is an Adam optimizer.
3. TAN1 without Topic: The first baseline model we use is a traditional architecture
for answer selection which use none-shared encoders for question-and-answer
tokens.
4. TAN4 without Topic: The second baseline model we use that have shared
encoders for question-and-answer tokens.
5. Evaluation Metrics: Our task is to rank the candidate answers on their correct-
ness to the question; thus, we adopt widely used measurement standards in infor-
mation retrieval and answer selection, namely mean average precision (MAP)
and mean reciprocal rank (MRR) to evaluate the performance of our model.

6.4.3 Results and Discussion

Table 6.3 shows the results of our models. From the results, we have following
findings.
Firstly, for baseline models, TAN4 without topic outperforms TAN1 without topic
in both WikiQA and TrecQA. This indicates that the shared encoders may be more
suitable for answer selection tasks. This is reasonable because for shared encoders, the
model can compare the representation of question and answers in the same context;
however, for none-shared encoders, the model has to learn double the parameters to
compare the representation. It is harder for the model to learn more parameters with
limited samples.
Secondly, compared with the baseline model, all of our models outperform the
baseline to some extent. Adding topic-aware module does improve the performance
of the baseline models. Among all the networks, TAN2 which adopt shared encoders

Table 6.3 Model performance of topic-aware networks for answer selection task
Model WikiQA TrecQA
MAP MRR MAP MRR
TAN1 without topic 0.65 0.66 0.71 0.75
TAN4 without topic 0.67 0.68 0.73 0.77
TAN1 0.66 0.67 0.72 0.79
TAN2 0.69 0.70 0.79 0.80
TAN3 0.67 0.68 0.74 0.76
TAN4 0.68 0.69 0.72 0.78
6 Topic-Aware Networks for Answer Selection 83

for both text and topic tokens outperforms all the other networks. TAN1 has similar
performance to the best baseline model. TAN3 and TAN4 have similar performance.
These findings show that for both baseline and our proposed model, shared
encoders are more efficient in pairing the right answer to the question. Topic-
aware modules improved the performance of the baseline models. TAN2 is the best
architecture among all the architectures we have proposed.

6.5 Conclusion and Future Work

In this paper, we studied incorporating external knowledge into the traditional answer
selection deep learning models by using specially designed networks. The proposed
network is an automatic tool to extract useful information from the topic models and
use it in any deep learning baseline models. We designed the representation of external
knowledge as topic embeddings. The results show that our model can improve the
performance of the baseline deep learning model. Moreover, we identified the best
architectures among our designed networks.
For future works, we can apply two improvements. First, during the training stage
of topic modeling, we fixed the number of topics for topic models. However, we
will explore ways to automatically decide the number of topics we should use in the
model. Second, given that there are many question type classification datasets such
as TREC, we will investigate the use of transfer learning to obtain a pre-trained topic
embedding using the publicly available dataset and fine-tune the embedding using
the training data.

References

1. Lai, T., Bui, T., Li, S.: A review on deep learning techniques applied to answer selection. In:
Proceedings of the 27th International Conference on Computational Linguistics, pp. 2132–2144
(2018)
2. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual Inter-
national ACM SIGIR Conference on Research and Development in Information Retrieval,
pp. 50–57 (1999)
3. Kumari, R., Srivastava, S.K.: Machine learning: a review on binary classification. Int. J. Comput.
Appl. 160(7) (2017)
4. Yih, S.W., Chang, M.-W., Meek, C., Pastusiak, A.: Question answering using enhanced
lexical semantic models. In: Proceedings of the 51st Annual Meeting of the Association for
Computational Linguistics, (2013)
5. Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
6. Fenchel, W.: Elementary geometry in hyperbolic space. In: Elementary Geometry in Hyperbolic
Space. de Gruyter (2011)
7. Tan, M., dos Santos, C., Xiang, B., Zhou, B.: Lstm-based deep learning models for non-factoid
answer selection. arXiv preprint arXiv:1511.04108 (2015)
8. Yu, L., Hermann, K.M., Blunsom, P., Pulman, S.: Deep learning for answer sentence selection.
arXiv preprint arXiv:1412.1632 (2014)
84 J. Zhang and K. Mao

9. Yogatama, D., Dyer, C., Ling, W., Blunsom, P.: Generative and discriminative text classification
with recurrent neural networks. arXiv preprint arXiv:1703.01898 (2017)
10. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
Polosukhin, I.: Attention is all you need. Advan. Neural Inf. Process. Syst. 30 (2017)
11. Noble, W.S.: What is a support vector machine? Nature Biotechnol. 24(12), 1565–1567 (2006)
12. Matsubara, Y., Vu, T., Moschitti, A.: Reranking for efficient transformer-based answer selec-
tion. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 1577–1580 (2020)
13. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification.
Advan. Neural Inform. Process. Syst. 28 (2015)
14. Naseem, U., Razzak, I., Musial, K., Imran, M.: Transformer based deep intelligent contextual
embedding for twitter sentiment analysis. Futur. Gener. Comput. Syst. 113, 58–69 (2020)
15. Keskar, N.S., McCann, B., Varshney, L.R., Xiong, C., Socher, R.: Ctrl: a conditional transformer
language model for controllable generation. arXiv preprint arXiv:1909.05858 (2019)
16. Garg, S., Vu, T., Moschitti, A.: Tanda: transfer and adapt pre-trained transformer models for
answer sentence selection. In: Proceedings of the AAAI Conference on Artificial Intelligence,
vol. 34, pp. 7780–7788 (2020)
17. Severyn, A., Moschitti, A.: Automatic feature engineering for answer selection and extraction.
In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,
pp. 458–467 (2013)
18. Melamud, O., Goldberger, J., Dagan, I.: Context2vec: learning generic context embedding with
bidirectional lstm. In: Proceedings of the 20th SIGNLL Conference on Computational Natural
Language Learning, pp. 51–61 (2016)
19. Likhitha, S., Harish, B.S., Keerthi Kumar, H.M.: A detailed survey on topic modeling for
document and short text data. Int. J. Comput. Appl. 178(39), 1–9 (2019)
20. Liu, P., Qiu, X., Huang, X.: Recurrent neural network for text classification with multi-task
learning. arXiv preprint arXiv:1605.05101 (2016)
21. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780
(1997)
22. Severyn, A., Moschitti, A.: Learning to rank short text pairs with convolutional deep neural
networks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 373–382 (2015)
23. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(1),
993–1022 (2003)
24. Tay, Y., Tuan, L.A., Hui, S.C.: Hyperbolic representation learning for fast and efficient neural
question answering. In: Proceedings of the Eleventh ACM International Conference on Web
Search and Data Mining, pp. 583–591 (2018)
25. Wang, M., Manning, C.D.: Probabilistic tree-edit models with structured latent variables for
textual entailment and question answering. In: Proceedings of the 23rd International Conference
on Computational Linguistics (Coling 2010), pp. 1164–1172 (2010)
26. Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of CNN and RNN for natural
language processing. arXiv preprint arXiv:1702.01923 (2017)
27. Sethuraman, J.: A constructive definition of dirichlet priors. Statistica Sinica pp. 639–650
(1994)
28. Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification.
In: Twenty-ninth AAAI Conference on Artificial Intelligence (2015)
Chapter 7
Design and Implementation
of Multi-scene Immersive Ancient Style
Interaction System Based on Unreal
Engine Platform

Sirui Yang, Qing Qing, Xiaoyue Sun, and Huaqun Liu

Abstract This project, based on the clue of flying and searching for Kongming
lanterns, combines the novel and vivid interactive system with the traditional roaming
system to create a multi-scene and immersive virtual world. The project uses UE4
engine and 3dsMax modeling software to build the museum scene and the virtual
ancient scene. After the modeling completed by 3DSMax, import it to UE4 to add
collider and plants to the model. And constantly optimize the scene combining with
the layout of streets of the Tang Dynasty. Then test the scene to make the interaction
and scene better match and merge. Use Sequence to record cutscenes, use blueprint
to connect animation with character operation, intersperse particle system, to realize
scene roaming and the interaction of Kongming lanterns. The project combines with
various technologies in the unreal engine, breaks the boring experience mode of
the traditional roaming system, and reproduces the magnificent ancient city, which
hurries off us for the appointment of thousands of lights.

7.1 Introduction

With the rapid development of human science and technology, virtual reality tech-
nology has penetrated into all aspects of people’s life with it gradually changed
from theory to industrialization. It is also loved by users because of its immersion,
interaction and imagination. Virtual reality, just as its name implies, is the combi-
nation of virtual and reality. Theoretically, virtual reality technology (VR) is a kind
of computer simulation system that can create and experience the virtual world. It
uses the computer to generate a simulation environment and immerse the user in the
environment. With the help of 3D modeling technology, realistic real-time rendering
technology, collision detection technology, and other key technologies of virtual
reality, the picture expressive force and atmosphere can be improved when the users
experience.

S. Yang · Q. Qing · X. Sun · H. Liu (B)


Beijing Institute of Graphic Communication, Beijing 102600, China
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 85
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_7
86 S. Yang et al.

In the virtual reality world, the most important feature is the sense of “realism” and
“interaction.” Participants feel like being in virtual world, environment and portraits
are just like being in real environment, in which various objects and phenomena
are interacting. Objects and characteristics in the environment develop and change
according to natural laws. People in the environment have sensations such as vision,
hearing, touch, motion, taste, and smell. Virtual reality technology can create all
kinds of fabulous artificial reality environments, which are vivid and immersive, and
interact with the virtual environment to the extent of using false to confuse the truth.
Based on the characteristics and theoretical foundation of virtual reality technology
mentioned above, this project designs a multi-scene strong immersion ancient wind
interactive project.
Through multi-scene interaction, this project presents a beautiful story of ancient
and modern travel. The tourist in the museum was immersed in the artistic atmo-
sphere, looking at the ancient paintings, he imagined the charm of the Tang Dynasty.
In a trance, he seemed to be in the Tang Dynasty, and suddenly he was inhaled by
the space–time cracks. Then he opened his eyes and became a general of the Tang
Dynasty in armor. The story is set to show the idea that ancient cultures can still
move people. People’s living conditions and concepts are constantly changing, but
those cultural and architectural arts that can move people are permanent.

7.2 Technical Basis

Physical-based rendering technology has been widely used in the movie and game
industry since Disney Principled BRDF was introduced by Disney on SIGGRAPH
2012, due to its high ease of use and convenient workflow. Physical-based rendering
(PBR) refers to the concept of rendering using a coloring/lighting model based on
physical principles and microplane theory, as well as surface parameters measured
from reality to accurately represent real-world materials. The unreal engine is widely
used because of its excellent rendering technology. Next, we will describe the
theoretical and technical basis of the unreal engine for rendering scenes.

7.2.1 Lighting Model

Currently, the physical-based specular reflective BRDF model is Microsrofacet


Cook-Torrance BRDF based on the microfacet theory. The microfacet theory derives
from the idea that microgeometry is modeled as a set of microfacets and is generally
used to describe surface reflections from non-optically flat surfaces.
The basic assumption of microplane theory is the existence of microscopic geom-
etry (microgeometry). The scale of microscopic geometry is smaller than that of
observational scales (such as coloring resolution), but it is larger than that of visible
7 Design and Implementation of Multi-scene Immersive Ancient Style … 87

Fig. 7.1 When light interacts with a non-optical flat surface, the non-optical flat surface behaves
like a large collection of tiny optical flat surfaces

wavelengths (so the application of geometrical optics and wave effects such as diffrac-
tion can be ignored). The microplane theory was only used to derive the expression
of single-bounce surface reflection in 2013 and before. In recent years, with the
development of the field, there have been some discussions on the multiple bouncing
surface reflection using microfacet theory.
Each surface point can be considered as optical flat because the microscopic
geometric scale is assumed to be significantly larger than the visible light wavelength.
As mentioned above, an optical flat surface divides light into two directions: reflection
and refraction.
Each surface point reflects light from a given direction of entry into a single
direction of exit, which depends on the direction of the microgeometry normal m.
When calculating BRDF items, specify light direction l and view direction v. This
means that all surface points, only those small planes that are just pointing in the right
direction to reflect l to v, may contribute to the BRDF value (positive and negative
in other directions, after the integral, offset each other).
In Fig. 7.1, we can see that the surface normal m of these “correctly oriented”
surface points is located just in the middle between l and v. The vector between l and
v is called a half-vector or a half-angle vector. We express it as h.
Only the direction of the surface points m = h reflect light l to the direction of
line of sight v. Other surface points do not contribute to BRDF.
Not all m = h surface points contribute to reflection actively; some are blocked by
l direction (shadowing), v direction (masking), or other surface areas of both. Micro-
facet theory assumes that all shadowed light disappears from the mirror reflector. In
fact, some of these will eventually be visible due to multiple surface reflections, but
this is not generally considered in current microplane theory.
In Fig. 7.2, we see that some surface points are blocked from the direction of l,
so they are blocked and do not receive light (so they cannot reflect anything). In the
middle, we see that some surface points are not visible from the view direction v, so
of course, we will not see any light reflected from them. In both cases, these surface
points do not contribute to the BRDF. In fact, although shadow areas do not receive
any direct light from l, they do receive (and therefore reflect) light from other surface
areas (as shown in the right image). The microfacet theory ignores these interactions.
88 S. Yang et al.

Fig. 7.2 Various types of light surface interactions

7.2.2 From Physical Phenomena to BRDF

Using these assumptions (a locally optical flat surface without mutual reflection), it
is easy to derive a general form of Specular BRDF called Microfacet Cook-Torrance
BRDF. This Specular BRDF takes the following form:

D(h)F(v, h)G(l, v, h)
f (l, v) = (7.1)
4(n · l)(n · v)

Among them:
• D(h): Normal Distribution Function, which describes the probability of the normal
distribution of micropatches, i.e., the concentration of the normal that is oriented
correctly. That is, the concentration relative to the surface area of the surface point
that reflects light from L to V with the correct orientation.
• F(l, h): The Fresnel Equation, which describes the proportion of light reflected
by a surface at different surface angles.
• G(l, v, h): Geometry Function, which describes the self-shading properties of a
microplane, i.e., the percentage of uncovered surface points M = H.
• Denominator 4 (n. l) (n. v): Correction factor that corrects the amount of
microplane transformed between the local space of microscopic geometry and
the local space of the overall macro surface.

With regard to Cook-Torrance BRDF, two considerations need to be highlighted:


For the dot product in the denominator, simply avoiding negative values is not
enough, and zero values must also be avoided. This is usually done by adding very
small positive values after a regular clamp or absolute value operation.
Microfacet Cook-Torrance BRDF is the most widely used model in practice. In
fact, it is the simplest microplane model that people can think of. It only models
single scattering on a single micro-surface in a geometric optical system, without
considering multiple scattering, layered material, and diffraction. The microfacet
model actually has a long way to go.
7 Design and Implementation of Multi-scene Immersive Ancient Style … 89

7.2.3 Physical-based Environment Lighting

Lighting in a scene is the most critical and important part, and generally uses physical-
based ambient lighting. Common technical solutions for ambient lighting include
image-based lighting (IBL). For example, the diffuse reflective ambient lighting
part generally uses the Irradiance Environment Mapping technology in traditional
IBL. Based on physical specular ambient lighting, image-based lighting (IBL) is
commonly used in the industry. To use physical-based BRDF models with image-
based lighting (IBL), Radiance Integral (Radiance Integral) needs to be solved, and
Importance Sample is usually used to solve the Brightness Integral.
The importance sampling (Importance Sample) is based on some known condi-
tions (distribution functions). It is a strategy to concentrate on sampling the regions
with high probability of the distribution of integrable functions (important areas) and
then efficiently calculating the accurate estimation results. The following two terms
are briefly summarized.
Split Sum Approximation. Based on the importance sampling method, substitute
the Monte Carlo integral formula into the rendering equation:

1  L i (lk ) f (lk , v) cos θlk


N
∫ L i (l) f (l, v) cos θl · dl ≈ (7.2)
 N k=1 p(lk , v)

The direct solution of the upper form is complex, and it is not very realistic to
complete real-time rendering.
At present, the mainstream practice of the game industry is to divide the
1 N L i (lk ) f (lk ,v) cos θlk
k=1 p(lk ,v)
in the above formula into two terms: average brightness
N
 N 1 N f (lk ,v) cos θlk
k=1 L i (l k ) and environment BRDF N
1
N k=1 p(lk ,v)
.
Namely:
  
1  L i (lk ) f (lk , v) cos θlk 1  1  f (lk , v) cos θlk
N N N
≈ L i (lk ) (7.3)
N k=1 p(lk , v) N k=1 N k=1 p(lk , v)

After splitting, two terms are offline precomputed to match the rendering results
of offline rendering reference values.
In real-time rendering, we calculate the two terms that have been calculated in the
Split Sum Approximation scheme, and then make the combination as the rendering
result of the real-time IBL physical environment lighting part.
The First Term Pre-Filtered Environment Map (pre-filter). The first term is
1 N
N k=1 L i (l k ), which can be understood as the L i (l k ) mean value of brightness. After
n = v = r’s assumption, it only depends on the surface roughness and the reflection
vector. In this term, the practice of the industry is relatively uniform (including UE4
and COD: Black Ops 2). The main scheme adopted is to pre-filter the environmental
texture, and to store the fuzzy environment highlight with multilevel fuzzy mipmap.
90 S. Yang et al.

1 
N
L i (lk ) ≈ Cubemap · sample(r, mip) (7.4)
N k=1

That is to say, the first term directly uses the MIP level sampling input of cubemap.
 N f (lk ,v) cos θlk
The Second Sum Environment BRDF. The second item, N1 k=1 p(lk ,v)
, is
hemispherical-directional reflectance of the mirror reflector, which can be interpreted
as environmental BRDF. It depends on the elevation θ, Roughness α, and Fresnel Item
F. Schlick approximation is often used to approximate F, which is parameterized
only on a single value of F 0 , making Rspec a three-parameter ((elevation) θ (NdotV),
Roughness α, F 0 ).
UE4 proposed in [Real shade in Unreal Engine 4, 2013] that in the second
summation term, F 0 can be divided from the integral after using the Schlick
approximation:
 
f (l, v)  
L i (l) f (l, v) cos θl · dl = F0 1 − (1 − v · h)5 cos θl · dl
F(l, v)
 

f (l, v)
+ (1 − v · h)5 cos θl · dl (7.5)
F(l, v)


This leaves two inputs (Roughness and cos θ v) and two outputs (a scale and bias
to F 0 ), all of which are conveniently in the range [0, 1]. We precalculated the result
of this function and store it in a 2D look-up texture2 (LUT).
Figure 7.3 is about the inherent mapping relationship between roughness, cos
θ , and the reflective intensity of the environmental BRDF mirror, which can be
precomputed offline.
Specific removal method is:

1  f (lk , v) cos θlk


N
= LU T.r ∗ F0 + LU T.g) (7.6)
N k=1 p(lk , v)

Fig. 7.3 Red-green map


inputs roughness, cos θ and
outputs intensity of specular
reflection of ambient BRDF
7 Design and Implementation of Multi-scene Immersive Ancient Style … 91

That is, UE4 searched by taking F 0 of the Fresnel formula out, making up F 0 *
scale + offset, saving the indexes of scale and offset onto a piece of 2D LUT, finding
by roughness and ndotv.

7.3 Conceive Preparation

7.3.1 Research on the Current Situation of UE4 Used


for Developing This Work

As the most open and advanced real-time 3D creation tool in the world, Unreal
Engine has been widely used in games, architecture, radio and film and television,
automobile and transportation, simulation training, virtual production, man–machine
interface, etc. In the last decade to a few years, U3D has been very popular, with
over 2 million games developed on it, but in recent years, UE4 has caught up and
surprisingly surpassed it. In addition, other virtual reality development platforms
include VRP, CryEngine, ApertusVR, Amazon Sumerian, etc. Comparing with them,
the excellent picture quality, good lighting and physics effects, and simple and clear
visual programming of UE4 make it the preferred development platform for this
project. Many of the instructional videos and documents posted on the UE4 website
are extremely friendly for beginners.

7.3.2 Storyboard

The storyboard for this project is divided into six parts. The first two parts show the
game player’s sudden passing through the scene after the exhibition. The third, fourth,
and fifth sections show the scene of players roaming around the city after traversing
through the Tang Dynasty. When the players light up the lantern, the lights of the
city also fly. The sixth part shows the player chasing after the Kong Ming lamp and
finally running up the mountain when it gets dark to see the beautiful scene of the
thousand lights rising in the valley (Fig. 7.4).

7.3.3 Design Process

First of all, collect relevant information to clarify the design scheme. The pavilion
scene completes the construction of the ground, architecture, and indoor scenes in
the unreal engine, then adds a TV screen to play videos, and finally optimizes the
relevant materials. The scene construction of ancient prosperous scenes and mountain
night scenes is modeled by 3dsmax. After adjustment, it is imported into the unreal
92 S. Yang et al.

Fig. 7.4 Storyboard for this project

engine to adjust the architectural layout and add particle effects. Then add lighting
effects and collision bodies, design interactive functions, and finally test and output
(Fig. 7.5).

7.4 The Main Scene Building Part

7.4.1 The Idea of Building a Scene of the Tang Dynasty

The Tang Dynasty scene of this project depicts the impression of the Tang Dynasty
formed by players based on historical experience accumulation and observation of
ancient paintings. This city has prosperous street views and lively markets. The
magnificent architecture is the most dazzling scenery in the city, which reflects the
rich life of the people. Vendors selling a full range of goods are displayed along the
street, and the sacred palace is more mysterious and majestic. The architecture of
this scene strives to restore a fantastic and magnificent prosperous scene. In order
to make the scene closer to the real environment, the grass and trees in the scene
have added a dynamic effect of swinging with the wind, and the rivers in the scene
also present realistic effects. In addition, dynamic flying butterflies, flocks of white
pigeons flying out of the palace, and lovely lambs have been added to the scene.
7 Design and Implementation of Multi-scene Immersive Ancient Style … 93

Fig. 7.5 Design flowchart


94 S. Yang et al.

7.4.2 Collider

The interior of the scene contains a variety of buildings and other elements, and the
distance between the elements is relatively close. In the process of scene construction,
to ensure that there is no mold penetration problem in the scene, it is necessary to
set the collider of the scene one by one. For example, when importing an Fbx format
building made from 3dsmax software, when placing the building into the UE4, the
character will be able to penetrate the model. In order to avoid these problems, when
importing the model, we can double-click the static grid and choose to add simplified
collisions to avoid building molding problems. When introducing the urban ground
built by 3dsmax, the characters also have the problem of piercing molds and unable
to stand on the ground. We use landscape as the main ground of the scene, so that
the characters can stand on the ground.

7.4.3 Light Effect of Dusk

The whole scene simulates the light effect of dusk, and the gorgeous orange sun glow
adds some beauty to the magnificent ancient city. In order to achieve the desired effect,
we set up a dark pink sky box.
Use “Exponential Height Fog” to simulate the fog effect. Adjust “Max Opacity”
and “Start Distance” to make the fog effect “lighter.” Check “Volumetric Fog” here.
The comparison diagram is as follows (Fig. 7.6).
Sunlight is simulated using “Directional Light” to adjust the appropriate color and
intensity. Use “Skylight” to realize the reflection effect of sunlight, and finally add
“Light mass Importance Volume” and “Post Process Volume” to further optimize the
lighting effect.

Fig. 7.6 Lighting effect comparison map


7 Design and Implementation of Multi-scene Immersive Ancient Style … 95

7.4.4 Particle Effects

In order to enhance the sense of picture when crossing the scene, the project has
particle effects such as the particle effects while changing the scenes and particle
effects for crossing the ancient city. The author activated Niagara effect system in
this project. Niagara makes particle effects based on a modular visual effects stack
that is easy to reuse and inherit and combined with a timeline. At the same time,
Niagara supports data interfaces for a variety of data in the unreal engine, such as
using the data interface to obtain skeletal mesh data.

7.5 Functional Design

7.5.1 Sequence Animation in Scene

Character Animation. The action of the character in the animation is realized by


binding the character skeleton to the animation. The movement route of the character
is determined by adding the location key frame, calling the z-axis broken line diagram
and adding key point adjustment to realize the movement of the character up the
stairs. The door closing in the animation is realized by adding rotation key frames.
Add weight key frames where necessary to realize the smooth transition of character
actions.
Camera Animation Implementation. The camera in the animation uses the
“Cine Camera Actor.” Check “Enable Look at Tracking” and select the character
skeleton as the “Actor to Track” to realize the function of aiming the camera at the
character. Change the “Relative Offset” to adjust the desired perspective.
Transition Between Level Sequence. Add key frames to focus settings and aper-
ture in the sequence to achieve the zoom effect. The transition effect of fade in
and fade out is also added to the animation. Add fade tracks to the animation, and
complete the fade in and fade out crossing effect by setting keys.
Create a master sequence and use fade track transition to integrate various level
sequences.

7.5.2 Release Kongming Lanterns

The blueprint class of a Kongming lantern is used as the interactive object, and the
Kongming lantern is lit during flying through the set intensity node. A box collision is
used as the interactive detection range. After entering the range, the HUD prompting
to fly is triggered. When releasing the Kongming lantern, call the world location in
96 S. Yang et al.

Fig. 7.7 Release Kongming lantern blueprint

Fig. 7.8 Game player release lanterns

tick and add the Z value automatically to achieve the effect of flying the Kongming
lantern. After the character releases the Kongming lantern, play the cut-off animation
and trigger the function of a large amount of Kongming lanterns to take off (Figs. 7.7
and 7.8).

7.5.3 Large Amounts of Kongming Lanterns Take off

In the function, the Kong Ming lamp order to pursue better visual effect, the
Kongming lantern here is larger and destroyed after 30 s.
7 Design and Implementation of Multi-scene Immersive Ancient Style … 97

Fig. 7.9 Project framework diagram

Fig. 7.10 Museum scene display screenshots

7.6 Artwork Display and Rendering Resource Analysis

7.6.1 Project Framework Diagram and Interactive Display


Screenshots

The following Fig. 7.9 shows the project framework of the project (Figs. 7.9, 7.10,
7.11 and 7.12).

7.6.2 Project Performance Analysis

Performance is an omnipresent topic in making real-time games. In order to create


the illusion of moving images, we need a frame rate of at least 15 frames per second.
98 S. Yang et al.

Fig. 7.11 Ancient city display screenshots

Fig. 7.12 Screenshots of lanterns lift off and mountain night scene display

Depending on the platform and game, 30, 60, or even more frames per second may
be the target (Fig. 7.13).
For large projects in particular, it is very important to know the occupation of
rendering resources. As can be seen from the above figure, the water body, as a
non-key roaming part, occupies too much rendering resources and memory, and the
number of polys of palaces and trees is large. The number of polys of palace windows
and branches and trunks should be optimized, while the number of polys of buildings
on the street is small and the texture realism is insufficient. In the process of building

Fig. 7.13 Shader complexity and Quads


7 Design and Implementation of Multi-scene Immersive Ancient Style … 99

the main scene, the effect of the river did not meet the expectations. Later, the author
will try to restore the light effect and ripple effect of the river, so as to achieve better
results with a more resource-saving scheme.

7.6.3 Project Innovation

The player explored the palace, lit the palace lantern, and triggered the thousand lights
to fly, and then the game player ran after the lights. The process of running to the 1000-
lamp appointment is omitted in the project. The player will travel to the mountain with
night view, which means that the player chases the Kong Ming lamp all the way, and
the time has passed from evening to night on the mountain. Multi-scene interactions
exemplify Unreal Engine’s power and detail in achieving dreamy scenes. In order to
achieve better 3D effects, the Niagara Particle Effects plug-in was enabled to produce
a lot of realistic particle effects. By using the timeline in Niagara, we visually control
the rhythm of particle effects and set keyframes for them to make them better. This
detail is modeled after the film and television editing industry, making the unreal
world more detailed and realistic. When simulating the dusk lighting effect in the
Tang Cheng scene part, this project combines the directional light source and the
sky light source that are extremely close to the sunlight effect to simulate the real
sunlight effect and its reflection effect. Coupled with the LightMass Importance
Volume, which controls the accuracy of the precomputation, more indirect lighting
cache points are generated inside the space, and well-controlled precomputed lighting
can make the scene beautiful while reducing the cost of dynamic lighting.

7.7 Conclusion

Relying on the unreal engine, the project realizes the matching of model and envi-
ronment light and shadow in the interactive experience, as well as the harmonious
unity of architecture, plants, terrain, fog effect, light effect and particle effect through
the construction, and integration and optimization of the scene of the ancient city.
Through various interactive functions, it also greatly enhances the image appeal in
the roaming process and creates a real-time and high quality 3D fantasy world. As
a more dynamic and intuitive form of expression, virtual reality has its unparalleled
advantages. It combines unreal engine visual programming and sequence, Niagara
particle effects, and other technologies to make it not only achieve better visual
effects, but also have a more immersive feeling.
100 S. Yang et al.

Acknowledgements This work was supported by grant from: “Undergraduate teaching Reform
and Innovation” Project of Beijing Higher Education (GM 2109022005); Beijing College Students’
innovation and entrepreneurship training program (Item No: 22150222040, 22150222044). Key
project of Ideological and Political course Teaching reform of Beijing Institute of Graphics
Communication (Item No: 22150222063). Scientific research plan of Beijing Municipal Education
Commission (Item No: 20190222014).
Chapter 8
Auxiliary Figure Presentation Associated
with Sweating on a Viewer’s Hand
in Order to Reduce VR Sickness

Masaki Omata and Mizuki Suzuki

Abstract We have proposed a system that superimposes auxiliary figures on a VR


scene according to viewer’s physiological signals that are responses to the viewer’s
VR sickness, in order to reduce VR sickness but not interfere with viewer’s sense of
immersion. We conducted an experiment to find a type of physiological signals that
correlated strongly with VR sickness. The results showed that sweating on a viewer’s
hand was strongly correlated, and that the amount of sweating tended to increase as
the VR sickness worsened. Then, we designed and developed the proposed system
that controlled degree of alpha blending of color of Dots as an auxiliary figure
that reduces VR sickness according to amount of the sweating and conducted an
experiment to evaluate it. The results showed that the effect of the system to reduce
VR sickness was found to be effective for the participants with less VR experience
although there was no statistically significant difference among the experimental
conditions. Additionally, the results also showed that the controlled presentation
of the system was less distracting on immersion of a VR scene than the constant
presentation as a previous system.

8.1 Introduction

One of factors stalling spread of virtual reality (VR) contents is VR sickness, which
refers to deterioration of physical condition caused by viewing a VR content. The
symptoms of VR sickness are similar to those of general motion sickness, such as
vomiting, cold or numb limbs, sweating, and headache [1, 2]. When the symptom
appears, the user cannot enjoy a VR content and may stop viewing or avoid viewing

M. Omata (B)
Graduate Faculty of Interdisciplinary Research Faculty of Engineering, University of Yamanashi,
Kofu, Japan
e-mail: [email protected]
M. Suzuki
Department of Computer Science and Engineering, University of Yamanashi, Kofu, Japan
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 101
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_8
102 M. Omata and M. Suzuki

in the first place. This may hinder the future development of the VR market and field,
and it is essential to elucidate the cause of VR sickness and take preventive measures.
VR sickness is sometimes called visually induced motion sickness (VIMS) and
is considered to be one of motion sicknesses. Motion sickness is a deterioration of
physical condition caused by staying in a moving environment such as in a car or
on a ship for a long time. Although the cause of motion sickness is not completely
explained, the sensory conflict theory is the most popular theory. The theory states that
when the pattern of empirical vestibular, visual, and somatosensory information is
incompatible with the pattern of sensory information in the actual motor environment,
motion sickness occurs during the adaptation process to the situation [3]. The same
conflict is thought to occur in VR sickness. In other words, in VR sickness, the visual
system perceives motion, while the vestibular system does not.
Broadly speaking, two methods of reducing VR sickness have been studied: One is
to provide a user with actual motion sensation from outside the virtual world, and the
other is to provide some effect on a user’s field-of-view in the virtual environment.
As examples of the former method, there are a method to apply wind to the user
while viewing VR images [4], and a method to provide a pseudo-motor sensation
by applying electricity to the vestibular system [5, 6]. However, these methods have
disadvantages such as needs for large-scale equipment and high cost. On the other
hand, as examples of the latter methods, there are a method to display gazing points
on VR images [7], and a method to switch from the first-person’s view to the third
person’s view in situations where the user is prone to sickness. These methods have an
advantage that they can be solved within an HMD and are less costly than the former
methods because they only require processing of the images. The latter method is
more realistic in terms of the spread of VR contents. However, there are concerns
that superimposed images may not match a world view of a VR environment, or that
superimposed images may distract a user and make it difficult to concentrate on a
VR game, thereby diminishing the sense of immersion.
Therefore, we propose a system that is one of the latter methods, but instead
of constantly displaying the superimposed figures, it keeps detecting signs of VR
sickness from physiological signals and controls the display of the superimposed
figures in real time according to the detection results. We aim to reduce VR sickness
without lowering the sense of immersion.

8.2 Related Work

As a method to change a user’s field-of-view to reduce motion sickness, Bos et al.


hypothesized that appropriate visual information on self-motion was beneficial in a
naval setting and conducted and experiment a ship’s bridge motion simulator with
three visual conditions: an Earth-fixed outside view, an inside view that moved with
the subjects, and a blindfolded condition [8]. As the results, sickness was highest
in the inside viewing condition, and no effect of sickness on task performance was
observed. Norouzi et al. investigated use of vignetting to reduce VR sickness when
8 Auxiliary Figure Presentation Associated with Sweating on a Viewer’s … 103

using amplified head rotations instead of controller-based input and whether the
induced VR sickness is a result of the user’s head acceleration or velocity by intro-
ducing two different modes of vignetting, one triggered by acceleration and the other
by velocity [9]. The results show generally indicating that the vignetting methods
did not succeed in reducing VR sickness for most of the participants and, instead,
lead to a significant increase. Duh et al. suggested that an independent visual back-
ground (IVB) might disturbance when conflicting visual and inertial cues [10]. They
examined 3 levels of independent visual background with 2 levels of roll oscillation
frequency. As the results, there were statistically significant effects of IVB and a
significant interaction between IVB and frequency. Sargunam et al. compared three
common joystick rotation techniques: traditional continuous rotation, continuous
rotation with reduced field-of-view, and discrete rotation with fixed intervals for
turning [11]. Their goal is to investigate whether there are tradeoffs for different
joystick rotation techniques in terms of sickness, preferences in a 3D environ-
ment. The results showed no evidence of differences in orientation, but sickness
ratings found discrete rotations to be significantly better than field-of-view reduc-
tion. Fernandes et al. explored the effect of dynamically, yet subtly, changing a
physically stationary person’s field-of-view in response to visually perceived motion
in a virtual environment [12]. Then, they could reduce the degree of VR sickness
perceived by participants, without decreasing their subjective level of presence, and
minimizing their awareness of the intervention. Budhiraja et al. proposed rotation
blurring, uniformly blurring the screen during rotational movements to reduce cyber-
sickness caused by character movements in a First Person Shooter game in virtual
environment [13]. The results showed that the blurring technique led to an overall
reduction in sickness levels of the participants and delayed its onset.
On the other hand, as a method to add a figure on user’s field-of-view, Whittinghill
et al. placed a three-dimensional model of a virtual human nose in the center of the
fields of view of the display in order to observe that placing a fixed visual reference
object within the user’s field-of-view seems to somewhat reduce simulator sickness
[14]. As the results, users in the nose experimental group were able, on average, to
operate the VR applications longer and with fewer instances of stop requests than
were users in the no-nose control group. However, in the roller coaster game with
intense movements, the average play time was only about 2 s longer. Cao et al.
designed a see-through metal net surrounding users above and below as a rest frame
to reduce motion sickness reduction in an HMD [15]. They showed that subjects
feel more comfortable and tolerate when the net is included than when there was
no rest frame. Buhler et al. proposed and evaluated two novel visual effects that can
reduce VR sickness with head-mounted displays [16]. The circle effect is that the
peripheral vision shows the point of view of a different camera. The border between
the outer peripheral vision and the inner vision is visible as a circle. The dot effect
adds artificial motion in peripheral vision that counteracts a virtual motion. The
results showed lower means of sickness in the two effects; however, the difference
is not statistically significant across all users.
In many studies, entire view is changed, or figures are conspicuously superim-
posed. There are some superimposed figures that imitate the user’s nose, which are
104 M. Omata and M. Suzuki

not so obvious, but it is not effective in some situations, or can only be used for a
first-person’s view. Therefore, Omata et al. designed a more discreet static figure
in virtual space and a scene-independence figure connecting the virtual world and
the real world [7]. The results show that the VR sickness tended to reduce by the
superimposition of Dots on the four corners of the field-of-view. At the same time,
however, they also showed that the superimposition of auxiliary figures reduced the
sense of immersion.
Based on the results of Omata et al.’s study, we investigated a method to reduce VR
sickness without unnecessarily lowering the sense of immersion by displaying the
Dots only when a symptom of the sickness appears. In addition, since hand sweating
was used as a physiological index to investigate the degree of VR sickness in the study
by Omata et al., and nasal surface temperature was used as an index of the degree of
VR sickness in the study by Ishihara et al. [17]; we also have proposed to use these
physiological signals as indexes to quantify the degree of VR sickness. Additionally,
we have clarified an appropriate type of physiological signal for the purpose and have
proposed to use the physiological signal as a parameter to emphasize the presentation
of an auxiliary figure when a user felt VR sickness.

8.3 Experiment to Select Physiological Signal


that Correlates with VR Sickness

In this experiment, nasal surface temperature, hand blood volume pulse (BVP), and
hand sweating of experimental participants watching a VR scene were measured and
analyzed in order to find a type of physiological signals that were strongly corre-
lated with their VR sickness. A magnitude estimation method was used to measure
degree of psychological VR sickness of the participants [18]. The participants were
instructed that the degree of discomfort of his/her VR sickness under his/her normal
condition when they wore an HMD, and no images were presented was 100, and
they were asked to verbally answer a degree of discomfort based on the 100 at 20 s
intervals while viewing a VR scene. As an experimental task to encourage head
movement, participants were asked to look for animals hidden in the VR scene.

8.3.1 Physiological Signals

Based on the physiological and psychological findings, we selected nasal surface


temperature, BVP, and sweating as candidates for the physiological responses to
VR sickness that could be expected. ProComp INFINITI system from Thought
Technology [19] was used as the encoder for the sensors.
The surface temperature sensor was a Temperature-Flex/Pro from Thought Tech-
nology and was attached to a tip of participant’s nose, as shown in Fig. 8.1. In
8 Auxiliary Figure Presentation Associated with Sweating on a Viewer’s … 105

Fig. 8.1 Temperature sensor


attached to a participant’s
nose

Fig. 8.2 Blood volume


pulse (BVP) sensor attached
on a participant’s fingertip

this study, the Celsius temperature is used. The sensor can measure skin surface
temperature between 10 and 45 °C.
The BVP sensor was a BVP-Flex/Pro from Thought Technology and was attached
on index finger of participant’s dominant hand as shown in Fig. 8.2. The sensor
bounces infra-red light against a skin surface and measures the amount of reflected
light in order to measure heart rate and BVP amplitude. In this study, the BVP value
is a value averaged for each inter-beat interval (IBI).
The sweating sensor was a SC-Flex/Pro from Thought Technology, which is a skin
conductance (SC) sensor measures conductance across skin between two electrodes
on fingers, and was attached on index and ring fingers of participant’s non-dominant
hand as shown in Fig. 8.3. The inverse of the electrical resistance between the fingers
is the skin conductance value.

8.3.2 Experiment Environment and Task

Virtual Scene. A three-dimensional amusement park with a Ferris wheel and a roller
coaster was presented as a VR scene. We used assets available on the Unity Asset
106 M. Omata and M. Suzuki

Fig. 8.3 Skin conductance


(SC) sensor attached to a
participant’s hand

Store [20], in order to create the amusement park. “Terrain Toolkit 2017” was used
to generate the land; “Animated Steel Coaster Plus” was used for the roller coaster,
and “Ferris Wheel” was used for the Ferris wheel. To generate and present the scene,
we used a computer (IntelR CoreTM i5-8400 2.80 GHz CPU, GeForce GTX 1050
Ti), Unity, an HMD (Acer AH101), and inner-ear earphones (Apple, MD827FE).
Task. All movement of the avatar in the virtual space was automatic, but an angle
of the view was changed according to an orientation of the participant’s head. The
first scene was a 45 s walk through the forest, followed by a 75 s ride on the Ferris
wheel, a 30 s walk through the grassland again, and finally a 150 s ride on the roller
coaster, for a total of 300 s.
The participants wore the HMD to view the movement in the scene and looked
for animals in the view linked to their head movements (However, there were no
animals in the scene.). This kind of scene and task creates a situation that was more
likely to induce VR sickness. The expected degree of VR sickness in the VR scene
was small for walking in the forest, medium for riding the Ferris wheel, and large
for riding the roller coaster.
Procedure. The participants were asked to refrain from drinking alcohol the day
before the experiment, in order to avoid sickness caused by factors other than the
task. Participants were given informed consent prior to the experiment, and their
consent to participate in the experiment was obtained.
The participants were asked to spend 10 min before performing the task to famil-
iarize themselves with the room temperature of 21 °C in our laboratory, and then,
they were asked to wear the HMD, the skin temperature sensor, the BVP sensor,
and the SC sensor. Figure 8.4 shows a participant wearing the devices. Then, after
performing the experimental task, they were asked to answer a questionnaire about
their VR experience. After the experiment, the participants were asked to stay in the
laboratory until they recovered from VR sickness.
The participants were ten undergraduate or graduate students (six males and four
females) between the ages of 21 and 25 with no visual or vestibular sensory problems.
8 Auxiliary Figure Presentation Associated with Sweating on a Viewer’s … 107

Fig. 8.4 Experimental


apparatus and a participant
during an experimental task

8.3.3 Result

Figures 8.5, 8.6, and 8.7 show the relationship between each of the three types
of physiological signals and the participants’ verbal responses of discomfort. The
measured physiological values are converted to normal logarithms in order to make
it easier to check the correlation between physical measurements and human psycho-
logical responses. The line segments on the graph represent exponential trend lines,
and each color represents an individual participant from A to J.
Table 8.1 shows the mean and standard deviation of the coefficients of determi-
nation in the correlation analysis between each type of physiological signals and
discomfort for each of the participants.

Fig. 8.5 Scatter plots and


regression lines between
nasal surface temperature
and discomfort
108 M. Omata and M. Suzuki

Fig. 8.6 Scatter plots and


regression lines between
BVP and discomfort

Fig. 8.7 Scatter plots and


regression lines between
sweating and discomfort

Table 8.1 Coefficient of


Type of physiological signal Mean S.D
determination in correlation
between type of physiological Nasal surface temperature 0.606 0.207
signal and discomfort BVP 0.338 0.225
Sweating 0.723 0.148

8.3.4 Analyses and Discussions

Determination Coefficient. From Fig. 8.7 and Table 8.1, we found that there was a
strong correlation between hand sweating and discomfort due to VR sickness. From
the figure, it can be seen that for all participants, the more discomfort increases, the
more sweat increases, and the rate of increases of all participants also follows the
same trend.
Additional Confirmatory Experiment. The determination coefficient of the nasal
surface temperature also shows a rather strong correlation, and the graph shows that
the temperatures tend to increase as the discomfort increases. However, this tendency
is different from the result of Ishihara et al. that the temperature decreases with the
8 Auxiliary Figure Presentation Associated with Sweating on a Viewer’s … 109

increase of discomfort [17]. Therefore, we conducted an additional experiment to


measure nasal surface temperature of two participants during a 5-min period when
they were wearing the HMD, and no images were shown.
As a result, the temperatures of both participants increased. The reason for this
could be that the heat was trapped between the HMD and their noses. Therefore, even
if there is a relationship between the nasal surface temperature and VR sickness, we
think it is difficult to actually use the temperature as an indicator of VR sickness for
the reason.
Time Difference between VR Sickness and Sweating Response. Since there was
a strong correlation between VR sickness and sweating, we analyzed the time differ-
ence between the onset of VR sickness and the sweating response. Figure 8.8 shows
the time series variation of the subjective evaluation of discomfort for each partici-
pant from A to J, and Fig. 8.9 shows the time series variation of sweating for each
participant from A to J. In Fig. 8.8, since the responses are based on the ME method,
some participants answered with a large difference. Therefore, we investigated the
time difference between the time series variation of the subjective discomfort and
the time series variation of the sweating of the two participants who answered with
the large differences, and found that there was almost no time difference within the
range of the sampling rate of every 20 s in the experimental tasks.
Limit of Discomfort. We asked, “Did you feel so uncomfortable that you wanted to
stop watching the VR video? If so, what was the degree of discomfort?” in the post-
task questionnaire. As the results, five participants out of all participants answered
that they were in the situation, and the average discomfort level of the five participants
at that time was 196 ± 26.5. Based on the result of the previous section, which
showed that there was no large time difference between the onset of discomfort and
the sweating response, the discomfort degree of the five participants 20 s prior to that
time was 162 ± 11.7. Moreover, according to Stevens’ power law [21], the mean of
the power indexes of the five participants is 1.13 ± 0.15.

Fig. 8.8 Time series changes of discomfort


110 M. Omata and M. Suzuki

Fig. 8.9 Time series changes of sweating

8.4 Design of Auxiliary Figure Control System

We found the relationship between the amount of sweating and the degree of VR
sickness, as well as its limit and power index, from the experiment in the previous
section. Based on the results, this section explains a design of a system that controls
degree of alpha blending of auxiliary figures that reduces VR sickness according to
amount of sweating on a hand of a VR viewer.
Based on an assumption that Stevens’ law of Eq. (8.1) holds between a psycholog-
ical measure of discomfort of VR sickness and a physical measure of sweating, we
constructed an equation to derive a percentage of alpha value of the alpha blending.

R = k Sn (8.1)

where R is physical quantity, S is psychological quantity, n is power index, and k is


coefficient. The power index n is 1.13, which was obtained in the previous section.
The equation was derived so that the alpha value would be 0% when the discomfort
value was 100, which was the standard value of the experiment in the previous section,
and 100% when the discomfort value was 162, which was slight lower than the limit
value in the previous section. By deriving this equation, the more the amount of
sweating increases as the discomfort value increases, the larger the alpha percentage
of the superimposed auxiliary figure becomes, and the more clearly the auxiliary
figure becomes visible. The derived equation is Eq. (8.2).
  
x 1.13
α = 161.3 −1 (8.2)
z

where α is the alpha percentage for alpha blending, Z is the normal sweating volume
(µS) of a viewer, x is the real-time sweating volume (µS) at each sample time when
the viewer watches a VR scene, and the power index is 1.13.
8 Auxiliary Figure Presentation Associated with Sweating on a Viewer’s … 111

Fig. 8.10 Dots auxiliary


figure being alfa blended
onto a VR scene

As a previous study, Omata et al. evaluated four types of auxiliary figures (Gazing
point, Dots, User’s horizontal line, and Real-world’s horizontal line) that aimed to
reduce VR sickness, and among them, the Dots was the one that reduced VR sickness
the most [7]. Therefore, we adopt the Dots design as the auxiliary figure in this
research. Dots are composed of four dots, as shown in Fig. 8.10. The four dots are
superimposed on the four corners of the view on a screen of an HMD. The dots do
not interfere with content viewing, and it is thought that the decline in a sense of
immersion can be reduced. In this study, we made the Dots design a little larger than
that of Omata et al. and changed the color from white to peach. The reason for the
larger size is to make it easier to perceive the dots as foreground than the system of
Omata et al. The reason for the changing color is to avoid blending in with white
roads and clouds in a VR scene of our experiment.
The overall flow of the auxiliary figure presentation system is as follows: First,
the skin conductance value of ProComp INFINITI, which is a biological amplifier,
is continuously measured, and then, the value is continuously imported into Unity,
which presents a VR scene, and is reflected in alpha percentage of the color of Dots on
the HMD. The normal sweating value for each viewer is the average of the viewer’s
skin conductance acquired for 3 s after the system starts. The timing for updating
the alpha value should be once every two seconds so that it does not blink during
drawing.

8.5 Evaluation Experiment

We hypothesized that controlling the auxiliary figure according to sweating on a


hand would reduce VR sickness without losing a sense of immersion. Therefore,
we conducted an evaluation experiment to validate differences between the three
conditions described in the previous section: the condition where the alpha value
of the auxiliary figure is controlled according to a volume of sweating (hereinafter
112 M. Omata and M. Suzuki

called “controlled presentation”), the condition where the auxiliary figure is always
displayed without the alpha blending (hereinafter called “constant presentation”),
and the condition where the auxiliary figure is not displayed (hereinafter called
“no auxiliary figure”). The specific hypothesis was that the control presentation
was significantly less likely to cause VR sickness than the no auxiliary figure and
had the same sickness reduction effect as the constant presentation, and the control
presentation was less distracting to gain a sense of immersion than the constant
presentation and gave the same sense of immersion as the no auxiliary figure.
The general flow of this experiment was the same as in Sect. 8.3, but instead of
oral responses, the participants of this experiment were asked to answer Simulator
Sickness Questionnaire (SSQ) [22] before and after viewing a VR scene, and Game
Experience Questionnaire (GEQ) [23] at the end of the VR scene. The SSQ is an
index of VR sickness that can derive three sub-scores (Oculomotor-related sub-score,
nausea-related sub-score, and disorientation-related sub-score) and the total score by
rating 16 symptoms that are considered to be caused by VR sickness on a four-point
scale from 1 to 4. Since each sub-score is calculated in a different way, it is difficult
to compare them and to understand how large each sub-score is. Therefore, in this
experiment, the maximum value of each sub-score is expressed as a percentage
of 100%. In addition, to measure a worsening of VR sickness before and after an
experimental task, the SSQ was rated on a 4-point scale from 0 to 3 before the
task. The GEQ is a questionnaire to measure the user’s experience after gameplay.
Originally, 14 factors can be derived, but in this experiment, we focused on positive
and negative affect, tension, and immersion, and asked the participants to rate the 21
questions on a 5-point scale from 1 to 5.

8.5.1 Procedure and Task

The experimental environment was the same as in Sect. 8.3. The task was slightly
different. In Sect. 8.3, the participants were asked to search for animals in the space
where they were not actually placed. In this experiment, however, the participants
were asked to search for animals in a space where they were actually placed so that
they would not get bored.
The participants were asked to refrain from drinking alcohol the day before to
avoid sickness caused by factors other than VR. The room temperature was adjusted
to a constant 21 °C using an air conditioner. Informed consent was given to all
participants before the experiment. In order to counterbalance the order of the three
conditions, the participants were divided into three groups and asked to leave at least
one day between the next conditions to reduce the effect on habituation.
At the beginning of watching the VR scene in each of the three conditions, a
participant answered the SSQ to check his or her physical condition before the
start. After that, the participant wore the HMD and skin conductance sensor and
then watched the VR scene for 300 s like the task in Sect. 8.3. After watching it,
the participant answered the SSQ, GEQ, and questions about the condition. This
8 Auxiliary Figure Presentation Associated with Sweating on a Viewer’s … 113

procedure was carried out as a within-subjects design, with three conditions for each
participant, with at least one day in between.
The participants were twelve undergraduate or graduate students (six males and
six females) between the ages of 21 and 25 with no visual or vestibular sensory
problems.

8.5.2 Results

SSQ. Figure 8.11 shows the results of the SSQ of the differences among the three
conditions on VR sickness. Each score shown here is the difference between each
participant’s post-task SSQ score minus his or her pre-task SSQ score. The error
bars indicate the standard errors. Therefore, the higher the value is, the worse the
degree of VR sickness became due to the task. The scores of all evaluation items
decreased in the control and constant presentation conditions compared to the no
auxiliary figure condition, but the results of the Wilcoxon-signed rank sum test (5%
level of significance) showed no significant difference among them.
In order to analyze a difference in the number of times the participants had expe-
rienced VR in the past, we divided the participants into two groups: one with more
than five VR experiences (four participants) and the other with less than five VR
experiences (eight participants). Figure 8.12 shows the SSQ scores of the group with
less experience, and Fig. 8.13 shows the SSQ scores of the group with more experi-
ence. As a result, in the group with less experience, although there was no significant
difference in the Wilcoxon’s-signed rank sum test (5% level of significance), it was
found that the control condition and the constant condition reduced VR sickness
compared to the no auxiliary figure condition. On the other hand, no such trend was
observed in the group with more VR experience.

Fig. 8.11 Results of SSQ for all participants


114 M. Omata and M. Suzuki

Fig. 8.12 Results of SSQ of the group with less experience

Fig. 8.13 Results of SSQ of the group with more experience

GEQ. Figure 8.14 shows the results of the GEQ of differences of the three conditions.
The larger the value, the more intense the experience was. The error bars also indicate
the standard errors. Since the scores of tension were low regardless of the condition,
it is considered that sweating due to tension did not occur. The positivity, negativity,
and immersion items showed little differences among the conditions. The results of
the Wilcoxon’s signed rank sum test (5% level of significance) showed no significant
differences among the conditions in each item.
Impressions of Auxiliary Figure. As a result of the question “What did you think
about the display of Dots?” regarding the auxiliary figure in the control and constant
presentation conditions, 3 out of 12 participants answered “I didn’t mind” in the
constant condition, while 7 out of 12 participants answered “I didn’t mind” in the
control condition.
8 Auxiliary Figure Presentation Associated with Sweating on a Viewer’s … 115

Fig. 8.14 Results of GEQ for all participants

8.5.3 Discussions

VR Sickness. We divided the groups according to the number of times the partici-
pants had experienced VR and analyzed the results of the responses to the SSQ for
each group. The results showed that there was no statistically significant difference
between the conditions for the group with less VR experience. However, from the
graph in Fig. 8.11, we infer that the SSQ scores were lower with the auxiliary figure
than without it. Therefore, we assumed that the difference in SSQ scores depended on
the number of VR experiences and summarized the relationship between the number
of VR experiences and SSQ scores in a scatter diagram (Fig. 8.15). From Fig. 8.15,
we found that the difference in SSQ scores decreased with the number of VR expe-
riences. In other words, this suggests that the auxiliary figure have a negative impact
on users with a large number of experiences, and therefore, it is appropriate to present
the auxiliary figure to users with little VR experience.

Fig. 8.15 Correlation


between the number of VR
experiences and the
difference of SSQ scores
116 M. Omata and M. Suzuki

From another point of view, regarding the issue of the difficulty in reducing VR
sickness in a scene with intense motion in “Virtual nose” by Whittinghill et al. [14],
we infer that our proposed system was able to reduce VR sickness because the
increase in sweating was reduced even in the scene with intense motion in the latter
half of our experimental VR scene.
Sense of Immersion. The results of the GEQ were not able to provide an overall trend
on the participants’ senses of immersion because the scores varied widely and were
not statistically significant. This suggests that the superimposition of the auxiliary
figure does not have a significant negative impact on the sense of immersion. On
the other hand, it also indicates that there is no significant difference in the sense of
immersion between the proposed control presentation and the constant presentation.
Therefore, in the future, we consider it necessary to use or create more specific and
detailed indices to evaluate a sense of immersion.
Most of the participants answered that they were not bothered by the control
presentation of the auxiliary figure. The reason for this is that the auxiliary figure
gradually appeared by alpha blending in the control presentation condition, and thus,
they were blended into the VR image without being strongly gazed at by the viewer.

8.6 Conclusions

In this paper, we proposed a system that superimposes auxiliary figures on a VR


scene according to viewer’s physiological signals that are responses to the viewer’s
VR sickness, in order to reduce VR sickness but not interfere with viewer’s sense of
immersion. For the purpose, we conducted a physiological signal survey experiment
and an evaluation experiment of the proposed system.
In the first experiment to find the type of physiological signals that correlated
strongly with VR sickness, we found that sweating was strongly correlated among
nasal surface temperature, fingertip BVP, and hand sweating, and that the amount of
sweating tended to increase as the degree of VR sickness became stronger. Based on
this result, we developed an auxiliary figure-presentation-control system that controls
a degree of saliency of the auxiliary figure that reduces VR sickness by varying an
alpha value of alpha blending according to amount of sweating of a viewer’s hand
while viewing a VR scene.
As the result of the evaluation experiment of the system, we found that the
controlled presentation had the effect of reducing VR sickness although there was no
significant difference in SSQ between the controlled presentation and the constant
presentation. The effect of the control presentation to reduce VR sickness was found
to be effective for the participants with less VR experience. In addition, through the
interviews with all the participants, it was found that the controlled presentation was
less distracting than the constant presentation.
Since it was stated in the previous research that there are individual differences in
the degree and tendency of reduction of VR sickness, in the future of our research,
8 Auxiliary Figure Presentation Associated with Sweating on a Viewer’s … 117

we plan to analyze the effect of the proposed system in detail from the viewpoint
of differences between men and women, differences in SSQ scores at the beginning
of the task, or differences in eye movements during task execution, other than the
differences in VR experience shown in this paper. Then, based on the results of such
detailed analysis, in the future, we plan to develop a learning HMD that switches
a method to reduce VR sickness and its parameters based on the user’s time of
VR experience, time spent experiencing the same content, and variation of various
physiological reactions resulting from VR sickness in real time.

References

1. Jerald, J.: The VR book: human-centered design for virtual reality. In: Association for
Computing Machinery and Morgan & Claypool (2015)
2. Brainard, A., Gresham, C.: Prevention and treatment of motion sickness. Am. Fam. Phys. 90(1),
41–46 (2014)
3. Kariya, A., Wada, T., Tsukamoto, K.: Study on VR sickness by virtual reality snowboard.
Trans. Virtual Reality Soc. Japan 11(2), 331–338 (2006)
4. Hashilus Co, Ltd.: Business description. https://hashilus.com/business/. Last accessed
2022/01/20
5. Aoyama, K., Higuchi, D., Sakurai, K., Maeda, T., Ando, H.: GVS RIDE: providing a novel
experience using a head mounted display and four-pole galvanic vestibular stimulation. In ACM
SIGGRAPH 2017 Emerging Technologies (SIGGRAPH’17), Article 9, pp. 1–2. Association
for Computing Machinery, New York, NY, USA (2017)
6. Sra, M., Jain, A., Maes, P.: Adding proprioceptive feedback to virtual reality experiences using
galvanic vestibular stimulation. In: Proceedings of the 2019 CHI Conference on Human Factors
in Computing Systems (CHI’19), Paper 675, pp, 1–14. Association for Computing Machinery,
New York, NY, USA (2019)
7. Omata, M., Shimizu, A.: A proposal for discreet auxiliary figures for reducing VR sickness and
for not obstructing FOV. In: Proceedings of the 18th IFIP TC 13 International Conference on
Human-Computer Interaction, INTERACT 2021, Sequence number 7. Springer International
Publishing (2021)
8. Bos, J.E., MacKinnon, S.N., Patterson, A.: Motion sickness symptoms in a ship motion simu-
lator: effects of inside, outside, and no view. Aviat. Space Environ. Med. 76(12), 1111–1118
(2005)
9. Norouzi, N., Bruder, G., Welch, G.: Assessing vignetting as a means to reduce VR sickness
during amplified head rotations. In: Proceedings of the 15th ACM Symposium on Applied
Perception (SAP’18), Article 19, pp. 1–8. Association for Computing Machinery, New York,
NY, USA (2018)
10. Duh, H.B., Parker, D.E., Furness, T.A.: An “independent visual background” reduced balance
disturbance envoked by visual scene motion: implication for alleviating simulator sickness. In:
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’01),
pp. 85–89. Association for Computing Machinery, New York, NY, USA (2001)
11. Sargunam, S.P., Ragan, E.D.: Evaluating joystick control for view rotation in virtual reality
with continuous turning, discrete turning, and field-of-view reduction. In: Proceedings of the
3rd International Workshop on Interactive and Spatial Computing (IWISC’18), pp. 74–79.
Association for Computing Machinery, New York, NY, USA (2018)
12. Fernandes, A.S., Feiner, S.K.: Combating VR sickness through subtle dynamic field-of-view
modification. In: 2016 IEEE Symposium on 3D User Interfaces (3DUI), pp. 201–210 (2016)
13. Budhiraja, P., Miller, M.R., Modi, A.K., Forsyth, D.: Rotation blurring: use of artificial blurring
to reduce cybersickness in virtual reality first person shooters. arXiv:1710.02599[cs.HC] (2017)
118 M. Omata and M. Suzuki

14. Whittinghill, D.M., Ziegler, B., Moore, J., Case, T.: Nasum Virtualis: a simple technique for
reducing simulator sickness. In: Proceedings of Games Developers Conference (GDC), 74
(2015)
15. Cao, Z., Jerald, J., Kopper, R.: Visually-induced motion sickness reduction via static and
dynamic rest frames. In: IEEE Conference on Virtual Reality and 3D User Interfaces (VR),
pp. 105–112 (2018)
16. Buhler, H., Misztal, S., Schild, J.: Reducing VR sickness through peripheral visual effects. In:
IEEE Conference on Virtual Reality and 3D User Interfaces (VR), pp. 517–519 (2018)
17. Ishihara, N., Yanaka, S., Kosaka, T.: Proposal of detection device of motion sickness using
nose surface temperature. In: Proceedings of the Entertainment Computing Symposium 2015,
pp. 274–277. Information Processing Society of Japan (2015)
18. Narens, L.: A theory of ratio magnitude estimation. J. Math. Psychol. 40(2), 109–129 (1996)
19. Thought Technology Ltd.: ProComp infiniti system. https://thoughttechnology.com/procomp-
infiniti-system-w-biograph-infiniti-software-t7500m. Last accessed 2022/01/20.
20. Unity Asset Store.: https://assetstore.unity.com/. Last accessed 2022/01/20.
21. Stevens, S.S.: On the psychophysical law. Psychol. Rev. 64(3), 153–181 (1957)
22. Kennedy, R.S., Lane, N.E., Berbaum, K.S., Lilienthal, M.G.: Simulator sickness questionnaire:
an enhanced method for quantifying simulator sickness. Int. J. Aviat. Psychol. 3(3), 203–220
(1993)
23. IJsselsteijn, W.A., de Kort, Y.A.W., Poels, K.: The game experience questionnaire. Technische
Universiteit Eindhoven. Eindhoven (2013)
Chapter 9
Design and Implementation of Immersive
Display Interactive System Based on New
Virtual Reality Development Platform

Xijie Li, Huaqun Liu, Tong Li, Huimin Yan, and Wei Song

Abstract In order to solve the single form of traditional museums, reduce the impact
of space and time on ice and snow culture and expand the influence and dissemination
of ice and snow culture, we developed an immersive Winter Olympics virtual museum
based on Unreal Engine 4. We used 3ds Max to build virtual venues, import Unreal
Engine 4 through Datasmith, and explore the impact of lighting, materials, sound, and
interaction on virtual museums. This article gives users an immersive experience by
increasing the realism of the space, which provides a reference for the development
of virtual museums.

9.1 Introduction

With the rapid development of society and the improvement of people’s living stan-
dards, virtual reality technology has been widely used in medical, entertainment,
aerospace, education, tourism, museums and many other fields. The digitization of
museums is an important trend in the development of museums in recent years, and
the application of virtual reality technology in the construction of digital museums

X. Li · H. Liu (B) · H. Yan


School of New Media, Beijing Institute of Graphic Communication, Digital Art and Innovative
Design (National) Experimental Demonstration Center, Beijing, China
e-mail: [email protected]
X. Li
e-mail: [email protected]
H. Yan
e-mail: [email protected]
T. Li · W. Song
School of Information Engineering, Beijing Institute of Graphic Communication, Beijing, China
e-mail: [email protected]
W. Song
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 119
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_9
120 X. Li et al.

is an important topic. The Louvre Museum is the first museum to move its collec-
tions from the exhibition hall to the Internet. The Palace Museum has also started
the process of VR digitalization. The V Palace Museum display platform connects
online and offline display content, breaking the rules and improving user stickiness.
“Bringing the Forbidden City Cultural Relics Home.”
The successful hosting of the 2022 Beijing-Zhangjiakou Winter Olympics has
made Beijing a city that hosts both the Summer Olympics and the Winter Olympics.
However, the spread of ice and snow culture has certain limitations, which are affected
by many factors such as epidemic situation and region. Using virtual reality tech-
nology to build a virtual Winter Olympics museum can break the limitations of
physical museums, expand the extension space of the museum, expand the functions
of the museum, and an effective way to meet the multi-level and multi-faceted needs
of the public. This is one of the directions for the future development of digital
museums and has broad prospects for development.

9.2 System Methodology

9.2.1 Demand Analysis

In the context of informatization and modernization, some exhibition halls such as


museums, as important media for the accumulation and dissemination of modern
culture, have a relatively important position in education research and social devel-
opment. At the same time, since the successful bid for the 2022 Beijing-Zhangjiakou
Winter Olympics, the construction and publicity of ice and snow Nagano have been
strengthened throughout the country. The traditional way of visiting cannot meet the
needs of modern audiences, and the virtual museum is particularly important due to
the impact of the region and the epidemic.
From the perspective of user experience, this paper presents the display informa-
tion to users in a multi-sensory, multi-layered and three-dimensional way, so that
users feel like they are in the scene of a virtual museum. The virtual museum can
break the limitations of traditional memorial halls in terms of time, space, region and
visiting form, so as to promote Beijing, the Winter Olympics and the Olympic spirit.

9.2.2 Analysis of Our Developing Platform

In recent years, Unreal Engine 4 has also been widely used with the rise of virtual
reality technology. Unreal Engine 4 has its own advantages over others, not only has
good operating habits, real light and shadow relationship, but also has flexible and
free interface design and minimalist interaction design implementation [1].
9 Design and Implementation of Immersive Display Interactive System … 121

The UE4 brings a new way of program development. Visualization blueprint script
makes developers more convenient through integrated code visualization, provides
more possibilities for the realization of functions and enhances the editability of
blueprints. The blueprint script is very easy to read, which not only enhances the
development efficiency, but also can be connected through the node to watch the
running process more intuitively, and it is convenient to solve the problems that arise
[2].

9.2.3 Development Processing

The project used 3ds Max modeling tools to build virtual scenes and optimize geom-
etry. The project used Adobe Photo-shop to texture models and Adobe After Effects
and Adobe Audition to work on video and audio materials. This article exports the
3D model to Datasmith format and imports it into Unreal Engine for project scene
construction, reprocesses the materials and textures of some models, completes the
key processes such as the creation of model materials and lighting, and the preparation
of Blueprint interaction events to complete the production of the project (Fig. 9.1).

Fig. 9.1 System flowchart


122 X. Li et al.

9.3 Analysis of Key Technologies

9.3.1 Model Construction and Import

The quality of the model construction effect in the virtual museum has a great impact
on the implementation of the final system, because the model serves as a carrier for
functional implementation. First, we made a CAD basemap to determine the model
positioning criteria. Then, we built a scene model from the 3ds Max based on the
drawn planar shape, paying special attention to the units, axial direction, and model
scale. When building the model, the purpose of removing the excess number of faces
was not only to improve the utilization of the map, reduce the number of faces of the
entire scene, but also improve the speed of the interactive scene [3] as shown in the
example (Fig. 9.2).
After the scene was modeled, our used a file-based workflow to bring designs into
Unreal. Datasmith gets our design data into Unreal quickly and easily. Datasmith
is a collection of tools and plugins that bring entire pre-constructed scenes and
complex assets created in a variety of industry-standard design applications into
Unreal Engine. Firstly, we install a special plugin in 3ds Max, which we used to export
files with the. udatasmith extension. And then, we used the Datasmith Importer to
bring the saved or exported file into your current Unreal Engine Project (Fig. 9.3).
Using the Datasmith workflow, it is possible to achieve one-to-one restoration of
the scene, establish a single Unreal asset for all instantiated objects, maintain the
original position and orientation, realize layer viewing and automatically convert the
map, further narrowing the gap between the design results and the final product in
the actual experience.

9.3.2 Real Shading in Unreal Engine 4

The first thing we achieve in virtual reality is to imitate the effect of the eyes, to
achieve a realistic sense of space and immersion. Unreal Engine’s rendering system
is key to its industry-leading image quality and superior immersive experience. Real-
time rendering technology is an important part of computer graphics research [4].
The purpose of the application of this technology is to allow users to experience the
immersive feeling, according to the real situation of the scene’s shape, material and
light source distribution, to produce visual effects similar to the real scene and almost
indistinguishable.
Due to the limitation of space, the visual experience is dominant for people in the
virtual environment. In this project, the presentation effect of the model also greatly
affects the user experience, after completing the system’s scene construction, it is
necessary to further improve the model and rendering.
9 Design and Implementation of Immersive Display Interactive System … 123

Fig. 9.2 CAD floor plan and scene building in 3ds Max

Illumination
In Unreal Engine 4, there are a few key properties that have the greatest impact on
lighting in the world simulating the way light behaves in 3D worlds is handled in one
of two ways: using real-time lighting methods that support light movement and inter-
action of dynamic lights, or by using precomputed (or baked) lighting information
that gets stored in textures applied to geometric surfaces [5]. Unreal Engine provides
both these ways of lighting scenes and they are not exclusive to one another as they
can be seamlessly blended between one another.
124 X. Li et al.

Fig. 9.3 Datasmith workflow

Physically Based Rendering (PBR) refers to the rendering concept of using a


shading/lighting model modeled based on physics principles and micro-plane theory,
and using surface parameters measured from reality to accurately represent real-
world materials.
PBR shading is mainly divided into two parts: Diffuse BRDF and Microfacet
Specular BRDF.
The BRDF describes reflectance of the surface for given combination of incoming
and outgoing light direction. In other words, it determines how much light is reflected
in given direction when certain amount of light is incident from another direction,
depending on properties of the surface. Note that BRDF does not distinguish between
direct and indirect incoming light, meaning it can be used to calculate contribution of
both virtual lights placed in the scene (local illumination), and indirect light reflected
one or more times from other surfaces (global illumination). This also means that
BRDF is independent of the implementation of lights which can be developed and
authored separately (BRDF only needs to know direction of incident light and its
intensity at shaded point) [6].
Lambert Model BRDF:
Cdiff
f (l, v) = (9.1)
π
This BRDF value states that the intensity of the reflected light is proportional to
the intensity of the incident light, regardless of the angle of reflection. So, no matter
what angle the material is viewed from, the final light intensity reflected into the
camera is the same.
Cook-Torrance Model BRDF:
D(h)F(v, h)G(l, v)
f (l, v) = (9.2)
4(n ∗ l)(n ∗ v)
9 Design and Implementation of Immersive Display Interactive System … 125

• D: Normal Distribution Function (NDF)


• F: Fresnel Equation
• G: Geometry Function
• N is the material normal, and H is the angle bisector direction of the illumination
direction L and the line of sight direction V.

When the light source is farther away from the target object, the less its lighting
effect on the object will be, which is lighting mode. Lighting mode adopted a physi-
cally accurate inverse square falloff and switched to the photometric brightness unit
of lumens to improve light falloff was relatively straightforward. We chose to window
the inverse square function in such a way that the majority of the light’s influence
remains relatively unaffected, while still providing a soft transition to zero. This has
the nice property whereby modifying a light’s radius does not change its effective
brightness, which can be important when lighting has been locked artistically, but a
light’s extent still needs to be adjusted for performance reasons [7].
 2
saturate 1 − (distance/lightRadius)4
falloff = (9.3)
diatance2 + 1

The 1 in the denominator is there to prevent the function exploding at distances


close to the light source. It can be exposed as an artist-controllable parameter for
cases where physical correctness is not desired. The quality difference this simple
change made, particularly in scenes with many local light sources, means that it is
likely the largest bang for buck takeaway. It is worth exploring in depth [8] (Fig. 9.4).
Materials
Controlling the appearance of surfaces in the world using shaders. A material is an
asset that can be applied to a mesh to control the visual look of the scene. In more
technical terms, when light from the scene hits the surface, a material is used to
calculate how that light interacts with that surface. These calculations are done using
incoming data that is input to the material from a variety of images (textures) and
math expressions, as well as from various property settings inherent to the material
itself. The base material model includes: base color, metallic, specular, roughness.
The project was able to better reflect the snow sports through snow scenes and so
on, using blueprints to make snow on the ground and snow on objects. As shown in
Fig. 9.5.
Using subsurface scattering in snow materials. Subsurface scattering is the term
used to describe the lighting phenomenon where light scatters as it passes through a
translucent/semi translucent surface [9]. Using a subsurface profile in snow material.
To fulfill render realistic, Unreal Engine 4 (UE4) now offers a shading method called
subsurface profile. While the subsurface profile shading model has similar properties
to the subsurface shading model, its key difference is in how it renders (Fig. 9.6).
126 X. Li et al.

Fig. 9.4 Lighting effect map

9.3.3 Making Interactive Experiences

In addition to providing Real Shading, Unreal Engine also has a variety of interaction
methods. Now that most virtual reality products use the buttons of the controller to
implement interactive functions, accurate gesture recognition and eye movement
recognition are not yet fully mature. Based on the demand analysis of the virtual
museum, the following interactive methods are mainly adopted: users enter the virtual
museum of the Winter Olympics and roam the museum [10]. In order to let users
have more directions and more understanding of the needs of virtual museums, 2
different angles and different forms of roaming methods have been specially set
up. That is, first-person roaming, third-person roaming. In Unreal Engine 4, both
roaming methods are controlled by character blueprints.
Create a new pawn blueprint class, add to the scene and do camera switch events
and add regular game perspective operations to the pawnCamera to achieve first-
person and third-person perspective switching (Fig. 9.7).
Box Trigger Interaction: Blueprints control the ON and OFF of videos by adding
box triggers and defining OnActorBeginOverlap and OnActorEndOverlap events:
When a character touches a box trigger, the video turns on playback. Video playback
stops when the character leaves the range of the box trigger. Such a design will make
the user’s sense of experience and realism more perfect. The results are as shown in
Fig. 9.8.
In a virtual museum, sound information can be superimposed on the real-world
virtual accompaniment in real time, producing a mixed effect of sight and hearing,
which can supplement the limitations of seeing. When the virtual scene changes, the
9 Design and Implementation of Immersive Display Interactive System … 127

Fig. 9.5 Material of the snow on the ground

voice the user hears changes accordingly, rendering it immersive. Use audio volumes
to add reverb effects to the virtual museum, increase spatial realism, adjust the dry
humidity of the level sound and get a more realistic sense of distance in space. For
added realism, the sound usually moves, not just static [11] (Fig. 9.9).
128 X. Li et al.

Fig. 9.6 Comparison chart using a subsurface profile

Fig. 9.7 Camera switch blueprint


9 Design and Implementation of Immersive Display Interactive System … 129

Fig. 9.8 Box touch renderings

Fig. 9.9 Cone audio


attenuation

9.4 Conclusion

Virtual museums have changed traditional concepts and broken the shackles of time
and space. The virtual museum increases the enthusiasm of visitors and enriches the
display form of the museum. Diverse interaction design will be the direction of the
future of immersive virtual museums [12]. The Winter Olympics virtual museum
integrates technology and sports, more people can understand Beijing and under-
stand the 2022 Winter Olympics through new forms. While disseminating Chinese
culture, it also enhances China’s soft power and international influence. In the future,
combined with the increasingly perfect concept of science and technology, we will
develop virtual museum with a high level, a high level and more significance.
130 X. Li et al.

Acknowledgements This work was supported by grant from: “Undergraduate teaching Reform
and Innovation” Project of Beijing Higher Education (GM 2109022005); Beijing College Students’
innovation and entrepreneurship training program (Item No: 22150222040, 22150222044). Key
project of Ideological and Political course Teaching reform of Beijing Institute of Graphics
Communication (Item No: 22150222063). Scientific research plan of Beijing Municipal Education
Commission (Item No: 20190222014).

References

1. Kersten, T.P.: The Imperial Cathedral in Knigslutter (Germany) as an immersive experience in


virtual reality with integrated 360° panoramic photography. Appl. Sci. 10 (2020)
2. Libing, H.: Immersive somatosensory interaction system based on VR technology: design and
realization of “going to Nanyang” story. Software 42(12), 7 (2021)
3. Santos, B., Rodrigues, N., Costa, P., et al.: Integration of CAD models into game engines. In:
VISIGRAPP (1: GRAPP), pp. 153–160 (2021)
4. Burley, B., Studios, W.D.A.: Physically-based shading at Disney. In: ACM SIGGRAPH, pp. 1–7
(2012)
5. Wojciechowski, K., Czajkowski, T., Artur, B.K., et al.: Modeling and rendering of volumetric
clouds in real-time with unreal engine 4. Springer, Cham (2018)
6. Boksansky, J.: Crash course in BRDF implementation (2021)
7. Karis, B., Games, E.: Real shading in unreal engine 4. Proc. Phys. Shading Theory Pract. 4(3),
1 (2013)
8. Natephra, W., Motamedi, A., Fukuda, T., et al.: Integrating building information modeling
and virtual reality development engines for building indoor lighting design. Visualization Eng.
5(1), 1–21 (2017)
9. Yan, H., Liu, H., Lu, Y., et al.: Dawn of south lake”—design and implementation of immersive
interactive system based on virtual reality technology (2021)
10. Jianhua, L.: Interactive design of virtual dinosaur museum based on VR technology. Comput.
Knowl. Technol. 16(13), 257–259 (2020). https://doi.org/10.14004/j.cnki.ckt.2020.1698
11. Group R.: Shadow Volumes in Unreal Engine 4 (2017)
12. Wang, S., Liu, H., Zhang, X.: Design and implementation of realistic rendering and immersive
experience system based on unreal engine4. In: AIVR2020: 2020 4th International Conference
on Artificial Intelligence and Virtual Reality (2020)
Chapter 10
360-Degree Virtual Reality Videos
in EFL Teaching: Student Experiences

Hui-Wen Huang, Kai Huang, Huilin Liu, and Daniel G. Dusza

Abstract In technology-enhanced language learning environments, virtual reality


(VR) has become an effective tool to support contextualized learning and promote
immersive experiences with learning materials. This case study, using the qualitative
research method, explored students’ attitudes and perceptions of engaging in VR
language learning. Twenty-eight Chinese sophomores attended this VR project in
a travel English unit. They wore VR headsets to watch 360 VR videos containing
famous tourist attractions from four countries during a 6-week project. Data were
collected from final reflections and interviews. Results of final reflections indicated
that students showed positive feedback on this new teaching method in English
learning. In addition, interview data present the advantages and disadvantages of
implementing immersive VR learning in EFL classrooms.

10.1 Introduction

Integrating technology into language classrooms has drawn numerous scholars’


attention [1–3]. It can support teachers’ teaching efficacy, promote student learning
engagement, and supplement traditional textbooks [4, 5]. Among the emerging tech-
nologies applied in education, virtual reality (VR) is one of the most valuable learning

H.-W. Huang (B)


Shaoguan University, Guangdong, China
e-mail: [email protected]
K. Huang · H. Liu
Fujian University of Technology, Fujian, China
e-mail: [email protected]
H. Liu
e-mail: [email protected]
D. G. Dusza
Hosei University, Tokyo, Japan
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 131
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_10
132 H.-W. Huang et al.

tools because of its immersive experience [6, 7]. This VR immersion leads to engage-
ment or flow, which is beneficial for students to connect with the learning content
for producing deep positive emotional value, which consequently enhances learning
outcomes [8–10]. The nature of VR immersion enhances students’ active learning
with embodiment [11], long-term retention [12], and enjoyment [13]. Students expe-
rience contextualized learning when they actively engage in the 360 VR video
content. This learning experience is different from traditional classrooms with one-
way or didactic lectures, focusing on shared content, in which students tend to learn
passively [12].
VR technology has been employed in a variety of educational contexts. Together
with providing students with immersive engagement in learning new knowledge in
experiencing foreign language learning [6, 14, plant cells [13] and English public
speaking [9], these studies demonstrate that VR technologies facilitate promising
new learning results. Previous scholars indicated that applying VR in educa-
tion can increase students’ authentic experience in contextualized learning, which
consequently empower students to develop autonomy [4, 11, 15].
The impetus of this study is to provide Chinese EFL learners with an innovative
teaching method that promotes learner motivation in English learning. Although
previous studies have recorded positive feedback on VR integration in foreign
language classrooms [6, 9, 14], research focusing on exploring EFL learners’ experi-
ences after watching 360 VR videos in tourism English learning is scarce. Therefore,
this study applied online free 360 VR videos for students to experience foreign tourist
attractions after wearing VR headsets. Specifically, whether Chinese EFL learners
prefer this innovative learning tool to experience immersive and authentic foreign
contexts to learn English has not yet been investigated. Hence, the purpose of this
case study was to apply the VR tool in an English course, focusing on travel English
units.

10.2 Literature Review

10.2.1 Different VR Types

Parmaxi [2] classified three types of VR simulations. First, a non-immersive VR


context allows users to use a desktop computer system to go through the 3D simula-
tion, such as Second Life (https://secondlife.com). Second, semi-immersive VR has
a gesture recognition system, which can track users’ body movement, which enables
human–computer interactions, like Microsoft Kinect. Finally, fully immersive VR
applies a “head-mounted system where users’ vision is fully enveloped, creating a
sense of full immersion” in simulation (p. 5).
The main difference between semi-immersive and fully immersive VR is the
degree to which the user is immersed in the experience. Kaplan-Rakowski and
Gruber [16] indicated that high-immersion VR offers a greater sense of presence
10 360-Degree Virtual Reality Videos in EFL Teaching: Student Experiences 133

and authenticity compared with low-immersion VR with wearing a head-mounted


device (HMD). In this paper, we focus on high-immersion VR HMD and initially
propose the learning theory behind VR learning and a brief overview of integrating
VR in education. Next, we present the research method and results collected from
EFL learners in China. Finally, we give conclusion with future suggestions.

10.2.2 VR in Higher Education

Teachers’ teaching methods in the twenty-first century classrooms face the changes
from the Information Age to the Experience Age [17]. Young learners live in an expe-
rience age, full of multimedia and digital technologies, and they view technology-
mediated learning environments as their preference in obtaining and sharing new
knowledge. VR as a learning tool in educational environments meets their learning
styles better for constructing new knowledge [13] and VR guides them to experience
more engaging and exciting learning materials [4, 15].
The integration of VR has become popular in higher education over the last two
decades [4, 10, 18, 19]. These studies indicate that students in VR learning contexts
show dynamic engagement and participation in classroom discussion and obtain
abstract concepts easily because they are more receptive to real-life contexts. For
example, Dalgarno and Lee [4] conducted a review of the affordance in 3D virtual
learning environment from the 1990s to 2000s in enriching student motivation and
engagement. The 3D virtual environments create “greater opportunities for expe-
riential learning, increased motivation/engagement, improved contextualization of
learning, and richer/more effective collaborative learning” [4].
For students, the brand-new learning moments with VR technologies are unique
in educational settings. Their curiosity and motivation can be increased while being
involved in immersive VR learning contexts. This activated participation can be an
opportunity for igniting students’ interest and involvement in learning and maxi-
mizing the potential of VR learning experiences [13]. These unique VR learning
benefits make subject contents come alive because students feel a strong sense of
presence in a multi-sensory setting to create immersive and engaging learning [11].
For example, Allcoat and von Mühlenen [13] found that VR provided increased posi-
tive emotions and engagement compared with textbook and video groups learning
about plant cells. Furthermore, Parmaxi [2] indicated that the VR contextual learning
helped students who particularly struggle to stay focused on learning materials in
English learning. These positive results showed that VR contexts give a better sense
of being present in a place for connecting learning contents.
134 H.-W. Huang et al.

10.2.3 VR Embedded Language Learning

Integrating VR technologies in language courses allows students to deeply immerse


in learning materials [2, 14]. From the perspective of Vygotsky’s sociocultural theory
[20], learning occurs when individuals communicate with contexts under meaningful
interactions. The 360 VR videos, coupled with appropriate learning design, provide
students with learner-object interactions to enhance learning quality in a sociocultural
context.
Previous studies have revealed that students had positive feedback after experi-
encing VR language learning. Students indicated that their attitudes on VR language
learning is more enjoyable than conventional English learning [21]. Other scholars
reported that their participants had increased motivation, experienced high levels of
engagement and excitement to learn foreign language in immersive VR environments
[6, 14].
In addition to the benefits of VR learning, there are some challenges that VR
encounters. First, students feel dizzy if they wear VR HMDs for more than 3 min.
Berti et al. [15] suggest wearing VR HMDs for no more than 2 min. The 360 VR
video quality is the other issue while implementing VR in education.
Although the previous studies indicated that students are more engaged in VR
learning, there is a need to explore how the use of VR can be maximized in English
learning. This case study contributes to the literature by exploring students learning
experiences and attitudes in a travel English unit in an EFL classroom in China. To
explore EFL learners’ attitudes and learning experiences towards the use of 360 VR
videos in an English course, the authors addressed the research questions below.
R.Q. 1: What were the overall perceptions of the students’ VR language learning experience?

R.Q. 2: What were students’ suggestions after experiencing the VR project?

R.Q. 3: What advantages and disadvantages of the VR learning project did the students
express in the interviews?

10.3 Research Methods

This study explored students’ perceptions of experiencing VR immersive environ-


ments through virtually “visiting” four different countries’ main tourism cities. In
order to experience a fully immersive VR environment, the students wore HMDs to
watch 360 videos. They were “teleported” into a 3D view of different pre-recorded
authentic contexts and viewed the virtual locales in a 360-degree surrounding by
moving their head. This study used a qualitative research approach, collecting data
from students’ final reflections, and interview transcripts to explore EFL learners’
perceptions of the 360 VR video to English learning.
There was no control group in this study as the class teacher was expected to
investigate students’ views of this pilot study. The rationale of such a design was that
10 360-Degree Virtual Reality Videos in EFL Teaching: Student Experiences 135

the university hoped to understand students’ learning experience in this innovative


teaching method.

10.3.1 Participants

The participants were English major sophomores (n = 28; 24 females and 4 males)
enrolled in a four-year public university in southern China, all aged between 20 and
21 years old. The participants were randomly divided into six groups of 4–5 students.
This study utilized a convenience sample because all participants enrolled in a cohort
programme. They were native Chinese speakers, had studied English as a foreign
language for at least six years in the Chinese school education system, and were
assessed to be at an intermediate English proficiency level based on the Chinese
National College Entrance Examination (equivalent to the B1 level of the Common
European Framework of Reference for Languages (CEFR) levels [22]. None of the
students had a previous experience with VR learning. All the participants volunteered
to attend this study and had the right to withdraw at any time.

10.3.2 Instruments

The data collected included final reflections written by 28 students and focus-group
interviews with six participants. Students’ final reflections were collected to explore
their views about VR learning experience and suggestions regarding how VR can
be effectively used in future learning. Six volunteers, five females, and one male
attended the focus-group interview to understand their thoughts about the advantages
and challenges of using VR for language learning.

10.3.3 Procedures

The classroom teacher (the first author) announced the goals of the VR learning
programme to all participants before conducting the research. All participants in this
case study had similar opportunities to use the VR HMDs and participate in the
learning tasks. Particularly, this study investigated the use of advanced VR HMDs,
which are suitable for myopia less than 600 ◦ . This is important because the majority
of Chinese students wear glasses and the VR HMDs offer the best VR experience
under these conditions.
The research lasted for six weeks, and students had 100 min of class time each
week. The theme of the VR project was “Travel English”. Four countries were
selected by the class teacher for immersive VR learning; these countries were Turkey,
Spain, New Zealand, and Colombia.
136 H.-W. Huang et al.

Fig. 10.1 Screenshots of 360 VR videos from Spain, Colombia, New Zealand, and Turkey

The course design adopted the structure of flipped learning approach. All students
watched a 2D introductory video of the country to learn the basic concepts related
to the country before entering the class. When students came to the classroom, they
had group discussions and answered the teacher’s questions related to the 2D video.
Each student then wore a VR HMD to experience a 360 VR video of the country for
the weekly schedule and answered embedded questions (see Fig. 10.1). Afterwards,
students practised English conversation sharing what they saw in the 360 video with
group members.
Students’ final reflections were collected to explore their views of the VR learning
experience and analyse their suggestions regarding how VR can be more effectively
used in future teaching. Afterwards, six volunteers participated in semi-structured
interviews. The RAs interviewed the volunteers using the interview questions made
by the first author. The interviews were conducted in the students’ native language,
Chinese, allowing the students to express their views with less restriction or being
contrived by second language limitations. To make students feel comfortable while
answering questions, the first author did not participate in the interview process, and
the RAs started a welcome talk to put the interviewees at ease. Finally, the RAs
conducted subsequent data analysis.

10.3.4 Data Analysis

The RAs conducted a content analysis of the students’ final reflections and catego-
rized them into different themes to answer RQ1 and 2. The content analysis steps
originated from Braun and Clarke [23]. We conducted six steps to categorize the
students’ final reflections: (a) familiarizing yourself with your data, (b) generating
10 360-Degree Virtual Reality Videos in EFL Teaching: Student Experiences 137

initial codes, (c) searching for themes, (d) reviewing themes, (e) defining and naming
themes, and (f) producing the report (p. 87). Steps a, b, c, and e were performed by
the RAs. To improve accuracy, we reviewed themes (step d) and produced the report
(step f) with the RAs. In the end, we reviewed the final report to improve accuracy.
For the focus-group interviews, all interview data were collected from audio
recordings and transcribed into texts for corpus analysis. The accuracy of the tran-
scriptions was verified through the audio file and analysed it using the Chinese corpus.
Corpus analysis was conducted using WEICIYUN (http://www.weiciyun.com), an
online tool that allows for Chinese basic corpus analysis and generates visualizations
of word occurrences based on the input text. Visualizing word occurrence frequencies
enables us to analyse key information from interview data.

10.4 Results

The following results are presented following each research questions.


R.Q. 1: What were the overall perceptions of the students’ VR language learning experience?

The final reflections aimed at exploring student perceptions of VR language


learning. Their responses to this question were grouped into two categories: real-
like contexts and immersive learning with technology. Regarding the category of
real-like contexts, the data indicated that the students felt that VR learning could
enhance the realism by wearing VR HMDs in the virtual scenes, and this experience
is helpful to learning retention. In general, students’ learning experience was positive
and full of novelty towards VR language learning. Over 85% (i.e. 24 out of 28) of the
students stated that they enjoyed this new learning method in experiencing foreign
tourist attractions through 360 VR videos. Some examples are detailed below.
S5: … Dr. Huang used virtual reality to immerse me in the beautiful scenery of
other countries, watching and appreciating the beautiful scenery while learning the
knowledge of various countries. The knowledge I learned has been applied to my
composition.
S16: … VR teaching gave us an interesting and realistic learning experience. We
learned about other countries in a more realistic atmosphere, and this gave us a deeper
memory.
S23: … VR learning allows me to experience more engagement in the sense of
virtually being there, which broke my traditional English learning mindset. This
unique learning opportunity is not offered in other courses.
The second category was immersive learning with technology. The data showed
that the students enjoyed the immersive VR language learning, which help them
realize how technology supports English learning. Moreover, the students stated that
they could feel immersed in the virtual “real-world” environments after wearing VR
HMDs because they could “fly” to other countries without spending travel expenses.
Some responses are presented as follows:
138 H.-W. Huang et al.

S9: I immersed myself in learning about some foreign cultures rather than the
presentation of photos. The application of modern technology in the field of education
is of great benefit.
S17: VR learning, combining fun and high-tech, gave me a new definition of
English learning, which not only improved my English learning motivation but also
enabled me to have a sense of immersion while watching 360 videos.
S26: Technology has brought us new ways and opportunities to learn. We can have
virtual field trips to learn new knowledge and see tourism spots in other countries.
This is not possible in traditional English classrooms.
In summary, students’ final reflections indicated that VR technologies provide
learners with engaging learning opportunities and reform language learning expe-
riences in EFL classrooms. Students had VR tours in different countries, which is
an improvement on the traditional textbook or the 2D pictures experienced in other
English learning settings. Additionally, the VR tours allowed students to immerse
themselves in foreign tourism attractions, inspiring them to have deeper connection
with the learning materials.
R.Q. 2: What were students’ suggestions after experiencing the VR project?

All 28 students’ response suggestions about the VR learning project were used to
answer research question two. The RAs categorized the results collected according to
their similarities (see Fig. 10.2): (1) VR equipment and technology, (2) VR content,
and (3) others.
In the first category about VR equipment and technology, three students expressed
their views about the VR device itself. Their responses were categorized into (1)
use more advanced VR display devices; and (2) wear headphones to experience
panoramic sound.

Fig. 10.2 Overview of the classified suggestions


10 360-Degree Virtual Reality Videos in EFL Teaching: Student Experiences 139

Table 10.1 Selected students’ suggestions about the VR learning project


A. VR content
• I hope next time VR experience can introduce other countries’ traditional clothing
• In the next VR immersive language learning class, we can enjoy different conditions. For
example, we can drive a car and enjoy the view of the city
• I hope the next VR experience can go into the streets to feel the local atmosphere
• It can include some interactive games or questions in VR
B. VR equipment and technology
• My suggestion is to remind the whole class to wear headphones. Wearing headphones allows
us to have a more immersive experience
• It is better to use more advanced VR equipment to improve immersion
C. Others
• I wish I could have buffered time in the video. It is easy to miss the beginning part because I
needed to set up the device with my smartphone
• I hope to increase the length of the video and introduce the history and culture of the country
in more aspects
• I hope the teacher can provide more time for us

The second category is about the VR content. Eighteen students (64%) mentioned
viewing content that could be sorted into: (1) enriching video types and content, espe-
cially more lifelike videos such as food, clothing, and street culture; (2) improving
the clarity of content; and (3) increasing interaction.
The third category is other suggestions that could not be identified in the previous
two categories. These suggestions included (1) extending the use time; and (2)
providing buffer time in the beginning to set up the device. Table 10.1 presents
the students’ suggestions categorized by the RAs.
R.Q. 3: What advantages and disadvantages of the VR learning project did the students
expressed in the interviews?

All volunteers were divided into two groups to conduct interviews. They expressed
positive attitudes towards the VR learning project. They indicated that VR learning
has many advantages, such as allowing students to focus on the content more while
learning with VR, developing the ability to active learning, and exploring knowl-
edge by themselves. However, some students expressed some disadvantages of VR
learning. For example, the equipment had various problems and was not easy to
control. Additionally, using VR reduced teacher-student communication and wearing
the HMD caused dizziness.
The interviewees’ responses were transcribed into Chinese and then visualized
through a word cloud to present their perceptions of the advantages and disadvantages
of the VR learning project (see Fig. 10.3). The larger words in the word cloud indicate
more frequent use. The illustration also includes translations into English.
140 H.-W. Huang et al.

(a) advantages (b) disadvantages

Fig. 10.3 Word cloud results of student interview responses

10.4.1 Results of Two Interview Questions

Q: What benefits does VR learning have as a whole?


All six respondents agreed that VR learning has many benefits and can help students
learn more. Students believed that VR learning can reduce the impact of the COVID-
19 pandemic, provide an immersive learning environment, and enhance learning
efficiency (see Fig. 10.3a).
Below are examples, translated, responses to this interview question from three
students.
Because the pandemic has brought us a lot of inconvenience in classroom learning, VR
learning project is not limited by time and space, which improves our learning efficiency.
(Student D, Group2)

The 360 VR videos can help me engage in an immersive environment, which is beneficial
for me to apply my imagination in this kind of digitalized learning materials. (Student B,
Group 1)

VR allows us to participate in class activities with interesting scenarios or dialogues, which


can raise our interest, make us more motivated, and learn new knowledge quickly. (Student
A, Group 1)

Q: Were there any disadvantages in VR learning? If yes, what are they?


Some students expressed several disadvantages in VR learning and hope to improve
it in future. Students mentioned that it was difficult to experience high-quality VR
content and blurred VR content led to lower interaction. The VR teaching mode
reduced interaction between teachers and students. Some students also experienced
VR vertigo symptoms. (See Fig. 3b).
Below are two example responses to this interview question.
In addition to the technical issue of VR equipment, the main thing is probably that VR is not
very popular now and VR resources may be relatively scarce. After students put on the VR
headsets, the teacher cannot monitor the students’ viewing, which may lead to absent-minded
learning. (Student E, Group2)
10 360-Degree Virtual Reality Videos in EFL Teaching: Student Experiences 141

I felt dizzy and the recovery time may vary depending on each individual’s physical condition.
(Student C, Group1)

10.5 Discussion

This case study aimed to explore student experiences in engaging in immersive VR


learning in an EFL course in China. Several major findings were found. First, students
experienced immersive learning in an almost lifelike context. When students wear VR
HMDs, they immediately became involved and immersed in the virtual surroundings
they were visiting. They paid full attention to the 360 video contents to learn new
knowledge and experience the feeling of being present in the scene before their
eyes. The findings are consistent with previous studies [2, 6, 14], indicating that VR
learning is a highly immersive and engaging learning experience.
Second, students’ suggestions written in the final reflections supported evidence
of the main themes of VR equipment and contents. The majority of the students
mentioned their expectation to visit local street views in different countries and
to see the world. The results are similar to the conclusions of Berti et al. [15],
who reported that providing students with VR HMDs can teleport students to virtu-
ally visit anyplace in the world. Students can “visit” a famous travel spot that they
have longed for and they can interact with the virtual context by walking around
there, which replicates a field trip. These affordances help students experience a new
learning method that goes beyond conventional role plays and physical field trips.
Additionally, teachers can search the uploaded 360 VR videos on the Internet and
download them for class teaching materials. Afterwards, teachers take students to
have virtual field trips with VR HMDs, along with headphones. According to the
teaching objectives, the 360 VR videos can be a museum, a famous foreign tourism
attraction, even outer space.
Finally, students observed the advantages of VR language learning. They noticed
that the immersive VR learning experience helps their imagination to build new
knowledge. Using VR in EFL classrooms can enhance twenty-first century learners’
engagement and curiosity beyond classrooms walls. The findings of the word cloud
are in line with those of previous studies stating that VR technologies can be creative
and powerful vehicles to stimulate student imagination and increase engagement
[4, 6]. When VR becomes an educational tool in schools, students experience a
feeling of “being there” in a virtual environment, not merely perusing pictures or
reading materials. Due to virtually physical immersion, English learners benefits
from deeper engagement in language learning tasks. As for the disadvantages of VR
learning, some students felt dizzy while watching 360 videos. This could be indi-
vidual differences because all the videos were set within two minutes, as suggested
by Berti et al. [15].
142 H.-W. Huang et al.

10.6 Conclusions

Virtual reality has become a popular theme in educational contexts, and the low cost of
implementing 360 VR videos in EFL classrooms makes it a more attractive learning
tool. Although there were some technical issues related to video quality and motion
sickness, most students expressed their excitement and engagement in immersive VR
learning in an EFL course. Additionally, the responses supported previous research
that it is difficult for teachers to monitor students’ selection and attention to content
in the HMDs. However, by asking students to prepare for the lesson and use that
preparation to guide the VR experience, attention to content should be improved.
Furthermore, asking students to complete a survey, participate in an interview, and
submit a final reflection encourages attention during the task and enhances retention.
Finally, asking students to provide suggestions for future enhancements gives them
motivation to contribute to further learning. While this study did not include any
longitudinal data or quantitative language learning results, the overall impression
of VR English learning is to increase participation, improve attention, and moti-
vate students to be critical about their learning and the learning material. For future
suggestions, it is worth evaluating the variables, including students’ speaking perfor-
mance under the VR course design in English learning and their emotion changes in
quantitative data.

References

1. Godwin-Jones, R.: Augmented reality and language learning: from annotated vocabulary to
place-based mobile games. Language Learn. Technol. 20(3), 9–19 (2016). https://www.sco
pus.com/inward/record.uri?eid=2-s2.0-84994627515&partnerID=40&md5=6d3aec75cd73
21c12aa0d2acef7c8ad9
2. Parmaxi, A.: Virtual reality in language learning: a systematic review and implications for
research and practice. Interact. Learn. Environ. (2020)
3. Warschauer, M.: Comparing face-to-face and electronic discussion in the second language
classroom. CALICO J. 13(2–3), 7–26 (1995)
4. Dalgarno, B., Lee, M.: What are the learning affordances of 3-D virtual environments? Br. J.
Edu. Technol. 41, 10–32 (2010)
5. Huang, H.W.: Effects of smartphone-based collaborative vlog projects on EFL learners’
speaking performance and learning engagement. Australas. J. Educ. Technol. 37(6), 18–40
(2021)
6. Berti, M.: Italian open education: virtual reality immersions for the language classroom. In:
Comas-Quinn, A., Beaven, A., Sawhill, B. (eds.), New Case Studies of Openness in and Beyond
the Language Classroom, pp. 37–47 (2019)
7. Makransky, G., Lilleholt, L.: A structural equation modeling investigation of the emotional
value of immersive virtual reality in education [Article]. Educ. Tech. Res. Dev. 66(5), 1141–
1164 (2018)
8. Chien, S.Y., Hwang, G.J., Jong, M.S.Y.: Effects of peer assessment within the context of spher-
ical video-based virtual reality on EFL students’ English-Speaking performance and learning
perceptions. Comput. Educ. 146 (2020)
9. Gruber, A., Kaplan-Rakowski, R.: The impact of high-immersion virtual reality on foreign
language anxiety when speaking in public. SSRN Electron. J. (2022)
10 360-Degree Virtual Reality Videos in EFL Teaching: Student Experiences 143

10. Riva, G., Mantovani, F., Capideville, C., Preziosa, A., Morganti, F., Villani, D., Gaggioli, A.,
Botella, C., Alcañiz Raya, M.: Affective interactions using virtual reality: the link between
presence and emotions. Cyberpsychol. Behav. 10, 45–56 (2007)
11. Hu-Au, E., Lee, J.: Virtual reality in education: a tool for learning in the experience age. Int. J.
Innov. Educ. 4 (2017)
12. Qiu, X.-Y., Chiu, C.-K., Zhao, L.-L., Sun, C.-F., Chen, S.-J.: Trends in VR/AR technology-
supporting language learning from 2008 to 2019: a research perspective. Interact. Learn.
Environ. (2021)
13. Allcoat, D., von Mühlenen, A.: Learning in virtual reality: effects on performance, emotion
and engagement. Res. Learn. Technol. 26 (2018)
14. Lin, V., Barrett, N., Liu, G.-Z., Chen, N.-S., Morris, Jong, S.-Y.: Supporting dyadic learning of
English for tourism purposes with scenery-based virtual reality. Comput. Assisted Language
Learn. (2021)
15. Berti, M., Maranzana, S., Monzingo, J.: Fostering cultural understanding with virtual reality:
a look at students’ stereotypes and beliefs. Int. J. Comput. Assisted Language Learn. Teach.
10, 47–59 (2020)
16. Kaplan-Rakowski, R., Gruber, A.: Low-immersion versus high-immersion virtual reality: defi-
nitions, classification, and examples with a foreign language focus. In: Proceedings of the
Innovation in Language Learning International Conference 2019, pp. 552–555. Pixel (2019)
17. Wadhera, M.: The information age is over; welcome to the experience age. Tech Crunch
(2016, May 9). https://techcrunch.com/2016/05/09/the-information-age-is-overwelcome-to-
the-experience-age/
18. Hagge, P.: Student perceptions of semester-long in-class virtual reality: effectively using
“google earth VR” in a higher education classroom. J. Geogr. High. Educ. 45, 1–19 (2020)
19. Lau, K., Lee, P.Y.: The use of virtual reality for creating unusual environmental stimulation to
motivate students to explore creative ideas. Interact. Learn. Environ. 23, 3–18 (2012)
20. Vygotsky, L.: Mind in society: the development of higher psychological processes (1978)
21. Kaplan-Rakowski, R., Wojdynski, T.: Students’ attitudes toward high-immersion virtual reality
assisted language learning. In: Taalas, P., Jalkanen, J., Bradley, L., Thouësny, S. (eds.),
Future-Proof CALL: Language Learning as Exploration and Encounters—Short Papers from
EUROCALL 2018, pp. 124–129 (2018)
22. European Union and Council of Europe. Common European Framework of Reference for
Languages: Learning, Teaching, Assessment (2004). https://europa.eu/europass/system/files/
2020-05/CEFR%20self-assessment%20grid%20EN.pdf
23. Braun, V., Clarke, V.: Using thematic analysis in psychology. Qual. Res. Psychol. 3, 77–101
(2006)
Chapter 11
Medical-Network (Med-Net): A Neural
Network for Breast Cancer Segmentation
in Ultrasound Image

Yahya Alzahrani and Boubakeur Boufama

Abstract Breast tumor segmentation is an important image processing technique


for cancer diagnosis and treatment. Recently, deep learning models have shown sig-
nificant advances toward computer-aided diagnosis systems (CAD). We proposed a
novel neural network-based attention modules to segment tumors from breast ultra-
sound (BUS) images. Inspired by the human brain function of interpreting the scene,
in this contribution, we focused only on the salient areas of the image, while sup-
pressing other details. This was built on a residual encoder and dense blocks decoder.
The generated feature map comprises spatial as well as channel details and fused the
maps producing more meaningful feature map and gives better discriminative char-
acteristics. The results show that the proposed model outperformed several recent
models and has potential for clinical practices.

Keywords Convolutional neural network · BUS Images · Breast tumor


segmentation · Deep learning

11.1 Introduction

Breast cancer is by far the common breast mass among women [1]. Clinical diagno-
sis in primary care clinics is a crucial factor in decreasing the risk of breast cancer
and providing earlier treatment for more positive outcomes for patients. Although
the mammogram is a well-known and reliable image modality that is used in breast
cancer diagnosis, it can be costly and comes with radiation risks from the use of
X-rays. Mammograms also tend to produce a high number of false-positive results.
In contrast, ultrasound (US) is an appropriate alternative for early stage cancer detec-
tion. A mammogram or magnetic resonance imaging (MRI) can be used in conjunc-
tion with US to provide additional evidence. Various medical image segmentation
techniques have emerged in the last decade. However, recent studies have further

Y. Alzahrani (B) · B. Boufama


University of Windsor, 401 Sunset Ave, Windsor N9B 3P4, ON, Canada
e-mail: [email protected]; [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 145
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_12
146 Y. Alzahrani and B. Boufama

developed existing computer-aided methods as they are often helpful in combination


with machine learning and deep learning (DL) approaches.
Edges distinguish separate regions within an image, and their characteristics can
reveal the presence of a cancerous tumor [2]. However, a common challenge in US
medical images is that the high occurrence of noise in such images can obscure
noticeable edges, thus making successful boundary detection challenging. Auto-
matic segmentation is the ultimate goal of many methods; some algorithms show
sophisticated performance if incorporated with prior knowledge or human interac-
tion, such as active contour[3] and region growing (RG) [4]. In the former, the initial
contour is defined by the user, and the contour usually evolves toward homogeneous
regions. Similarly, in the latter, the initial seed of RG algorithms is chosen, and
the neighboring parts are integrated in the predefined region in an iterative process.
However, this process is prone to errors, and the initialization is subjective, meaning
that the results are initialization dependent. Recently, machine learning-based algo-
rithms have attracted the interest of many researchers [5–10] due to the availability
of graphic processing units (GPUs) and appropriate data and their ability to provide
sophisticated outcomes. The human breast, like other organs of the body, vary in
shape, appearance, and size, which means that any diagnosis tool based on human
prior knowledge must also display considerable flexibility.
Learning-based methods have shown their superiority over other segmentation
methods. However, the number of data samples available is a significant factor for
any DL model. Usually, the lack of data in a medical field is a major challenge,
which raises the need for models that are capable of generalizing well even from a
small dataset. Moreover, convolutional operations, which are widely used in state-of-
the-art computer vision networks, cannot discard the unavoidable locality imposed
by the nature of convolutional filters themselves because they can only examine a
concentrated part of an image and thus miss long-range dependencies. A possible
solution to this issue can be a good attention mechanism. Attention concepts in
computer vision are inspired from the way humans process visual information. The
human eye focuses on certain parts of a scene rather than an entire scene as a whole,
allowing an accurate recognition of objects even when the differences between the
classes are subtle, and the classes themselves contain diverse samples.
In this article, we proposed a novel model fora novel attention-based convolutional
neural network (CNN) for breast US segmentation that comprises both channel and
spatial information. We also preserve important features using residual learning. The
rest of this paper is organized as follows: Sect. 11.2 provides a literature review of the
existing methods for breast US segmentation. Section 11.3 describes the proposed
segmentation model. Section 11.4 discusses the implementation and evaluation of
the model. Finally, Sect. 11.5 concludes this article.
11 Medical-Network (Med-Net): A Neural Network … 147

11.2 Related Work

In recent years, advances in deep learning and neural networks have contributed
toward achieving fully automated US image segmentation and other relevant tasks
by overcoming several persistent challenges that many previous methods could not
effectively handle. Various deep neural network architectures have been proposed
to perform efficient segmentation and the detection of abnormalities. For example,
convolutional neural networks have been used for fully automated medical image
segmentation; patch-based neural networks are trained on image patches, and fully
convolutional networks perform pixel-wise prediction to form the final segmenta-
tion and U-nets [11]. Boundary ambiguity is one of the major issues when using
fully connected networks (FCNs) for automatic US image segmentation, resulting
in the need for more refined deep learning architectures. In this light, one study
[12] proposed the use of cascaded FCNs to perform multiscale feature extraction.
Moreover, spatial consistency can be enhanced by adding an auto-context scheme
to the main architecture. U-nets are one of the most popular and well-established
deep learning architectures for image segmentation. They deliver high performance
even with a limited amount of training data. They are primarily CNNs that consist
of a downsampling path, which reduces the image size by performing convolutional
and pooling operations on the input image and extracts contextual features, and an
upsampling path, which reconstructs the image to recover the image size and various
details [13]. However, U-net encoder-based maxpooling tends to lose some localiza-
tion information. Therefore, many studies show significant improvement when it is
replaced by more sophisticated architectures, such as in the visual geometry group
network (VGG) [14]. V-nets [15] are a similar architecture that are applied to 3D US
images. They also face the limitation of inadequate training data. They consist of a
compression and decompression path, in a manner similar to U-nets for 2D images.
The incorporation of a 3D supervision mechanism facilitates accurate segmentation
by exploiting a hybrid loss function that has shown fast convergence. Transfer learn-
ing [16] has gained the attention of practitioners for various tasks. This approach has
succeeded in many applications and is one of the popular current approaches. It is
a convenient solution for any limited data task as these models are usually trained
on relatively huge datasets, such as Google’s Open Images, ImageNet, and CIFAR-
10. In the U-net base model, the effective use of skip connections between the two
paths has some drawbacks, such as suboptimal feature reusability and a consequently
increased need for computational resources. Other versions, [5, 6], have used atten-
tion mechanisms incorporated in the U-net architectures to improve performance,
especially for detection tasks. The addition of a spatial attention gate (SAG) and a
channel attention gate (CAG) to a U-net helps in locating the region of interest (ROI)
and explaining the feature representation, respectively. This type of technique is uti-
lized in numerous applications, such as machine translation and natural language
processing (NLP). Non-local means [17] and its extended version of the non-local
neural network [18], as well as machine translation [19], can be optimized through
a back propagation process in the training iterations and therefore are considered
148 Y. Alzahrani and B. Boufama

soft attention modules. These soft attention mechanisms are very efficient and can
be plugged into CNNs. In contrast, hard attention non-differentiable operations are
not commonly used with CNNs. Attention mechanisms have proven successful in
sequence modeling as they allow the effective transmission of past information,
an advancement that was not possible with older architectures based on recurrent
neural networks. Therefore, self-attention can substitute convolutional operations to
improve the performance of neural networks on visual tasks. The best performance
though has been reported when both attention and convolutions are combined [20].

11.3 Proposed Breast Ultrasound Segmentation Model


(Med-Net)

The most significant obstacle in breast tumor US image segmentation is the shape
variation because the size of a tumor can vary, and normally, the border of a tumor is
very close to the surrounding tissue. Frequently, the majority of data points involved
are toward the back. Therefore, small tumors are quite difficult to identify. This raises
what is called a class imbalance problem. One of the popular ways to address this
problem is to force more weight on the minority class in an imbalanced dataset.
This can be achieved using a weighted objective function. Prepossession may also
be exploited to manipulate the image data in a way that helps reduce the problem.
For example, scaling the image by shifting the width and the height may help to gain
some sort of accuracy enhancement. In a similar classification task, oversampling
the minority class will balance the data and help to tackle an imbalance problem.
Inspiration from the human brain interpretation of visual perspective has influenced
deep learning researchers to adapt the same concepts to recognizing objects in CNNs
and related tasks. Many contributions in the literature have applied this concept in
applications, such as classification [21], detection [22], and segmentation [23].

11.3.1 Encoder Architecture

In this article, we present a neural network for breast ultrasound image segmentation
as shown in Fig. 11.1. Our solution is a general use model and can be utilized on
similar vision tasks. When the network processes the data to downsample the spatial
dimensions, some meaningful details may be lost. Although pooling is a must in
CNNs, we however employed residual blocks across the encoder of our network to
keep track of the previous activations of each layer and sum up the feature maps
before fusion. This seems to be a good solution to address this issue. When encoding
the data, one of the keys is to maintain the dimension reductions and to exploit the
high-level information that carries spatial information while extracting the feature
vector.
11 Medical-Network (Med-Net): A Neural Network … 149

Fig. 11.1 Proposed neural network architecture

To further enhance our network, we employed an attention module as described


in the next subsection. This is similar to [21] but with a more meaningful feature
map. However, the localization information can be preserved in a U-net-like archi-
tecture as in our proposal using residual blocks that add raw representations to the
refined feature map produced by each layer. The encoder’s residual block is shown
in Fig. 11.2. Each block in our encoder can be represented as follows:

xl∼ = Al ( f n (C(K ,n) (C(k,n) (xl )))) + (xl ) (11.1)

where xl∼ is the output of the l th layer, xl is the input to the residual block, C(K ,n)
is a convolution layer with a filter size of k and n filters (n = 32, 64, 128, 256, 512,
and 1024, respectively). A denotes an attention unit; K in l1 and l2 is of size 1 × 7
and 7 × 1, respectively, and of a symmetric size of five and three, respectively, in
the subsequent layers in both residual blocks and attention units. However, the last
residual block utilizes a k = 1 square filter.

11.3.2 Attention Module

Inspired by the work of Hu et al. [24], which is one of the early models that pro-
posed channel attention for vision tasks, a squeeze block employs a channel-wise
module that learns dependencies across the channels. However, in detection tasks,
this work may suffer from the lack of localization information needed. Similarly,
the work in [21] adds more spatial information that can be taken into account to
look into the salience map that comprises the channel and spatial details. A squeeze
150 Y. Alzahrani and B. Boufama

Fig. 11.2 Residual blocks


utilized in our proposed
encoder to downsample the
spatial dimensions

operator incorporates global feature information in an aggregated feature set across


the spatial axis using average pooling, followed by an excitation operator, which
assigns per channel weights given the respective feature channel. However, in this
work, we propose attention unit-based residual learning instead of global pooling,
which is meant for adding more importance to the spatial features incorporated with
the relevant channel features, and boosts the performance of the network.
Inspired by the two previously mentioned attention models, we propose an atten-
tion unit with two channel and spatial branches to improve the discriminative char-
acteristics of the incorporated feature maps. The channel-wise pooling path employs
a global maxpooling layer followed by a convolution operation. The feature vector is
shrunk around the channel axis, and the following 1 × 1 convolution further empha-
sizes what has been captured. The other branches utilize a residual block as shown
in Fig. 11.3, to add more spatial contextual representation. Early representations of
the previous layer are used to produce a final feature map. Both branches are incor-
porated using element-wise summation and then multiplied by the attention input.
Let M be the input feature map to the attention unit M ∈ R H ×W ×C . We first down-
size the input feature maps using a maxpooling layer so that the spatial details are
grouped and represented by the matrix Fmax ∈ R H ×W ×C . It is then squeezed around
the channel axis to produce Fmax ∈ R 1×1×C . It will be convolved by 1 × 1 filters. We
also employed a residual block as another branch to preserve the localization details
that may be lost from the first branch. This block is followed by a 1 × 1 convolution
for more refined feature maps. Let F ∼ ∈ R H ×W ×C denote the output of the branch;
then, it can be written as:
11 Medical-Network (Med-Net): A Neural Network … 151

Fig. 11.3 Proposed attention module

F ∼ = σ (C(K ,n) (R ⊗ M) (11.2)

where k is a square filter of size 7 × 7; n is the number of channels, which is equal


to the input channels; C is a convolution operation; R represents the residual block;
and σ denotes the sigmoid function. The two feature descriptors from both branches
are added together using element-wise summation and then multiplied by the the
input as follows:
M ∼ = σ (Fmax ⊕ F ∼ ) ⊗ M (11.3)

11.3.3 Decoder Architecture

In this work, we utilize four upsampling layers based on a dense block. The low-
level features are first concatenated with the encoder’s residual and attention units,
which pass them to the dense block. To take advantage of the large-size feature maps
concatenated from early layers, we employ the dense block in the encoder, which
includes two convolutional layers with 12 filters prefixed with a batch normalization
layer and a rectified linear unit (ReLU) activation layer to add non-linearity. The
dense block [25] was utilized to feed forward input as well as the output of each
layer to the subsequent layers. The decoder path in our model consists of a 2×2
upsampling operation, batch normalization, ReLU activation, and 3×3 convolution.
This can be written as:

Ul = f n  (δ(D(K ,n) (Tl−1  Al  Rl ))) (11.4)

where U is the output of layer l, T is the output of the l th−1 transposed layer, f
is a fully connected layer, D denotes the dense block with n = 12 kernels of size
k = 7 × 7, δ denotes the ReLU function, A denotes an attention unit,  represents a
concatenation, and R is the output of the l th encoder’s block.
152 Y. Alzahrani and B. Boufama

11.3.4 Implementation Details

In this study, the experiments were conducted using a Keras/TensorFlow 2.3.1 back-
end with Python 3.6 on Windows. The computer was equipped with an NVIDIA
GeForce 1080 Ti with 11 GB of GPU memory. We performed a five-fold cross-
validation to evaluate our model. The data were split randomly in each fold into
two sets with a ratio of 80% for training and 20% for the validation set. It is
worth mentioning that the model was trained on both datasets separately. First,
the images were resized to 256×256 spatial dimensions, and a preprocessing tech-
nique was applied to further enhance the performance. It involved several trans-
formations: horizontal flip ( p = 0.5); random brightness contrast ( p = 0.2); ran-
dom gamma [gamma_limit = (80, 120)]; adaptive histogram equalization ( p = 1.0,
threshold value for contrast limiting = 2.0); grid distortion ( p = 0.4); shift, scale,
and rotate (shift_limit = 0.0625, scale_limit = 0.1, −rotate_limit = 15). Finally, p
is the probability of applying a transformation. In this work, these transformations
were applied on all the experiments, including the models which were used for com-
parison.
Our proposed model has 16 million trainable parameters and was optimized using
the adaptive moment estimation (Adam) optimizer [26]. We set the adaptive learning
rate initially at 0.0001 with a minimum rate of 0.000001 and batch of 4 to train our
model. To prevent an overfitting problem, we set all the experiments to terminate
the training if no improvement was recorded within 10 epochs. Due to its robustness
against the imbalanced class issue, in this work, we used the Dice loss function to
train the model given by:

2iN pi ∗ qi
Loss = 1 − (11.5)
iN pi + iN qi

This loss function produces a value between [0, 1], where pi is the predicted pixel
and qi denotes the true mask.

11.3.5 Dataset

Two BUS datasets named UDIAT [11] and BUSIS [27–30] were used for training
and validating the model. UDIAT has fewer samples that is 163 images along with
their labels. This dataset was collected by Parc Taul’ı Corporation Diagnostic Center,
Sabadell (Spain), in 2012. BUSIS has 562 images of benign and malignant tumors
along with the ground truth. The latter was collected by different institutions using
different scanners which make it a very reliable data source. Both datasets present a
single tumor in each image. Most images in these datasets present small tumors in
which the background represents the majority class data points. This situation intro-
duces what so called class imbalance problem which needs to be carefully handled.
11 Medical-Network (Med-Net): A Neural Network … 153

11.4 Discussion

Tumor tissues in breast US images are of different shapes and can appear in different
locations. However, most of the tumors occupy only a small area of pixels. Therefore,
in the early layers, small kernels can capture local discrepancies, and there is also a
need for a large receptive field to cover more pixels to consider the semantic correla-
tions. This helps to preserve the location information before fusing the feature maps
in the subsequent layers. Moreover, the divergence of intensities in the vertical neigh-
boring pixels is very small. However, a large receptive field kernel creates overhead
regarding memory resources. To overcome this challenge, we utilized dimensions of
1 × 7 and 7 × 1 in the early two layers, respectively. Then, the size was narrowed
down in the following layers as the dimensions of the feature maps increased. This
approach preserves the long-range dependencies with a significant improvement on
the produced features, thus, providing better feature map representations.
In this article, we introduced a novel breast US image segmentation model that can
be utilized and extended for any segmentation task. Our model has been demonstrated
to be robust with imbalanced class data as seen from the results. The model was
evaluated using various metrics: Dice coefficient (DSC), Jaccard index (JI), true-
positive ratio (TPR), and false-positive ratio (FPR). In this work, we evaluated our
proposed model quantitatively and qualitatively, and the model proved to be stable
and robust for breast US image segmentation.
We compared our proposed model with four others; two of them were recent
successful models for medical image segmentation: M-net [31] and squeeze-U-Net
[32]. These two models were implemented and trained from scratch. We also utilized
selective kernel U-Net, [33], STAN [34], and U-Net-SA [35] trained originally on
UDIAT and BUSIS, for comparison only as these models were meant for breast US
images. Therefore, the scores were taken as reported in their articles. The evaluation
metrics are given by the following equations:

2T P
DSC = (11.6)
2T P + F P + F N

TP
JI = (11.7)
T P + FN + FP

Table 11.1 Evaluation metrics for all models given by the average score of five-fold cross-validation
on (UDIAT) dataset
Model Dataset DSC JI TPR FPR
Proposed model UDIAT 0.794 0.673 0.777 0.007
Squeeze U-Net [32] UDIAT 0.721 0.585 0.701 0.008
M-Net [31] UDIAT 0.748 0.615 0.740 0.007
STAN [34] UDIAT 0.782 0.695 0.801 0.266
SK-U-Net [33] UDIAT 0.791 – – –
154 Y. Alzahrani and B. Boufama

Table 11.2 Comparison and evaluation metrics for the models given by the average score of five-
fold cross-validation on (BUSIS) dataset
Model Dataset DSC JI TPR FPR
Proposed model BUSIS 0.920 0.854 0.906 0.007
Squeeze U-Net BUSIS 0.912 0.841 0.910 0.009
M-Net BUSIS 0.909 0.836 0.915 0.009
STAN BUSIS 0.912 0.847 0.917 0.093
U-Net-SA [35] BUSIS 0.905 0.838 0.910 0.089

Fig. 11.4 Training curves using five-cross-validation on UDIAT dataset


11 Medical-Network (Med-Net): A Neural Network … 155

TP
T PR = (11.8)
T P + FN

FP
FPR = (11.9)
FP + T N

The results showed our model outperformed the others that were examined.
Tables 11.1 and 11.2 show the obtained results Fig. 11.6 shows the performance
of the model on BUSIS dataset.
In terms of scores on UDIAT dataset, which had only a few samples, the model
has proven to be very efficient for a small-sized dataset as a Dice score of 0.79 and
JI score of 0.67 were achieved. The selective kernel U-net SK-U-net gained a very
close score, having the second highest Dice score on the dataset. However, it was
trained on a relatively large private dataset as compared to our model, which was
trained on only 163 samples. Moreover, Stan achieved a JI score of 0.69 and had the
highest TPR and FPR scores, which indicates that it may identify some background
pixels as a tumor. In contrast, our model and M-Net scored the lowest FPR, and this

Fig. 11.5 Segmentation sample cases produced by different models used in this study and our
proposed network (Med-Net) using UDIAT dataset
156 Y. Alzahrani and B. Boufama

Fig. 11.6 Performance curves using five-cross-validation on BUSIS dataset

can be seen in Fig. 11.5, which shows very little positive area outside of the tumor
boundaries Fig. 11.4 shows the performance of the model on UDIAT dataset.
The other models that were implemented and trained in this study were M-net
and squeeze-U-Net. M-net had few parameters and showed decent performance with
both datasets, achieving Dice and JI scores of 0.74 and 0.61 on UDIAT, respectively.
Squeeze U-net, which was a modified version of U-Net [36] equipped with a squeeze
module [37], achieved Dice and JI scores of 0.72 and 0.58, respectively.
In contrast, our model scored the lowest FPR on BUSIS dataset, and this can be
seen in Fig. 11.7, which shows very little positive area outside of the tumor bound-
aries. Our proposed model also achieved the highest performance on BUSIS dataset
11 Medical-Network (Med-Net): A Neural Network … 157

Fig. 11.7 Segmentation sample cases produced by different models used in this study and our
proposed network (Med-Net) using BUSIS dataset

with Dice and JI scores of 0.92 and 85, respectively. It is clear that our model has the
best FPR of all the models. STAN also gained the highest TPR score and the second
best JI. An adequate performance from all the models was shown with this dataset.
This is due to the fact that this dataset was collected and annotated by experts from
different institutions. It had a large number of samples and was also produced by
different devices, which make it suitable for evaluating and justifying segmentation
tasks. Overall, our model proved its superiority over the other models in this study
when all the results are considered. Our model could be computationally expen-
sive with very high-scale image data. Med-Net model can be further extended in the
future by adding more data and examining different type of images like computerized
tomography (CT), MRI, and X-ray on different organs.

11.5 Conclusion

In this article, we presented a novel U-Net-like CNN for breast US image segmen-
tation. The model was equipped with visual attention modules to focus only on the
158 Y. Alzahrani and B. Boufama

salient features and suppress irrelevant details. The proposed network was able to
extract the important features while considering spatial and channel-wise informa-
tion. Dense blocks were used along the construction path to provide full connectivity
between the layers within the blocks. The model was validated on two breast US
image datasets and showed promising results and enhanced performance. Although
the model was meant for breast US images, it can be utilized for any computer vision
segmentation task with some modifications.

References

1. Sung, H., Ferlay, J., Siegel, R.L., Laversanne, M., Soerjomataram, I., Jemal, A., Bray, F.: Global
cancer statistics 2020: globocan estimates of incidence and mortality worldwide for 36 cancers
in 185 countries. CA Cancer J. Clin. 71(3), 209–249 (2021)
2. Nugroho, H., Khusna, D.A., Frannita, E.L.: Detection and classification of breast nodule on
ultrasound images using edge feature (2019)
3. Lotfollahi, M., Gity, M., Ye, J., Far, A.: Segmentation of breast ultrasound images based on
active contours using neutrosophic theory. J. Medical Ultrasonics 45, 1–8 (2017)
4. Kwak, J.I., Kim, S.H., Kim, N.C.: Rd-based seeded region growing for extraction of breast
tumor in an ultrasound volume. Comput. Intel. Secur. 799–808 (2005)
5. Khanh, T., Duy Phuong, D., Ho, N.H., Yang, H.J., Baek, E.T., Lee, G., Kim, S., Yoo, S.:
Enhancing u-net with spatial-channel attention gate for abnormal tissue segmentation in med-
ical imaging. Appl. Sci. 10 (2020)
6. Schlemper, J., Oktay, O., Chen, L., Matthew, J., Knight, C., Kainz, B., Glocker, B., Rueckert,
D.: Attention-gated networks for improving ultrasound scan plane detection (2018)
7. Suchindran, P., Vanithamani, R., Justin, J.: Computer aided breast cancer detection using ultra-
sound images. Mat. Today Proc. 33 (2020)
8. Suchindran, P., Vanithamani, R., Justin, J.: Computer aided breast cancer detection using ultra-
sound images. Mat. Today Proc. 33 (2020)
9. Nithya, A., Appathurai, A., Venkatadri, N., Ramji, D., Anna Palagan, C.: Kidney disease detec-
tion and segmentation using artificial neural network and multi-kernel k-means clustering for
ultrasound images. Measurement 149, 106952 (2020). https://www.sciencedirect.com/science/
article/pii/S0263224119308188
10. Alzahrani, Y., Boufama, B.: Biomedical image segmentation: a survey. SN Comput. Sci. 2(4),
1–22 (2021)
11. Yap, M.H., Pons, G., Martí, J., Ganau, S., Sentís, M., Zwiggelaar, R., Davison, A.K., Martí,
R.: Automated breast ultrasound lesions detection using convolutional neural networks. IEEE
J. Biomed. Health Inform. 22(4), 1218–1226 (2018)
12. Wu, L., Xin, Y., Li, S., Wang, T., Heng, P., Ni, D.: Cascaded fully convolutional networks for
automatic prenatal ultrasound image segmentation, pp. 663–666 (2017)
13. Almajalid, R., Shan, J., Du, Y., Zhang, M.: Development of a deep-learning-based method for
breast ultrasound image segmentation, pp. 1103–1108 (2018)
14. Nair, A.A., Washington, K.N., Tran, T.D., Reiter, A., Lediju Bell, M.A.: Deep learning to obtain
simultaneous image and segmentation outputs from a single input of raw ultrasound channel
data. IEEE Trans. Ultrasonics Ferroelectrics Freq. Control 67(12), 2493–2509 (2020)
15. Lei, Y., Tian, S., He, X., Wang, T., Wang, B., Patel, P., Jani, A., Mao, H., Curran, W., Liu,
T., Yang, X.: Ultrasound prostate segmentation based on multi directional deeply supervised v
net. Med. Phys. 46 (2019)
16. Liao, W.X., He, P., Hao, J., Wang, X.Y., Yang, R.L., An, D., Cui, L.G.: Automatic identification
of breast ultrasound image based on supervised block-based region segmentation algorithm and
features combination migration deep learning model. IEEE J. Biomed. Health Inform. 1 (2019)
11 Medical-Network (Med-Net): A Neural Network … 159

17. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polo-
sukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems,
pp. 5998–6008 (2017)
18. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
19. Zhang, B., Xiong, D., Su, J.: Neural machine translation with deep attention. IEEE Trans.
Pattern Anal. Mach. Intel. 42(1), 154–163 (2020)
20. Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convolutional
networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.
3286–3295 (2019)
21. Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: convolutional block attention module. In:
Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
22. Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention (2014).
arXiv:1412.7755
23. Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., Patel, V.M.: Medical transformer: gated axial-
attention for medical image segmentation (2021). arXiv:2102.10662
24. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
25. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional
networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
tion, pp. 4700–4708 (2017)
26. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv:1412.6980
27. Xian, M., Zhang, Y., Cheng, H.D., Xu, F., Huang, K., Zhang, B., Ding, J., Ning, C., Wang, Y.:
A benchmark for breast ultrasound image segmentation (BUSIS). Infinite Study (2018)
28. Xian, M., Zhang, Y., Cheng, H.D.: Fully automatic segmentation of breast ultrasound images
based on breast characteristics in space and frequency domains. Pattern Recogn. 48(2), 485–497
(2015)
29. Cheng, H.D., Shan, J., Ju, W., Guo, Y., Zhang, L.: Automated breast cancer detection and
classification using ultrasound images: a survey. Pattern Recogn. 43(1), 299–317 (2010)
30. Xian, M., Zhang, Y., Cheng, H.D., Xu, F., Zhang, B., Ding, J.: Automatic breast ultrasound
image segmentation: a survey. Pattern Recogn. 79, 340–355 (2018)
31. Mehta, R., Sivaswamy, J.: M-net: A convolutional neural network for deep brain structure
segmentation. In: 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI
2017), pp. 437–440 (2017)
32. Beheshti, N., Johnsson, L.: Squeeze u-net: A memory and energy efficient image segmenta-
tion network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition Workshops, pp. 364–365 (2020)
33. Byra, M., Jarosik, P., Szubert, A., Galperin, M., Ojeda-Fournier, H., Olson, L., O’Boyle, M.,
Comstock, C., Andre, M.: Breast mass segmentation in ultrasound with selective kernel u-net
convolutional neural network. Biomed. Signal Process. Control 61, 102027 (2020)
34. Shareef, B., Xian, M., Vakanski, A.: Stan: small tumor-aware network for breast ultrasound
image segmentation. In: 2020 IEEE 17th International Symposium on Biomedical Imaging
(ISBI), pp. 1–5 (2020)
35. Vakanski, A., Xian, M., Freer, P.E.: Attention-enriched deep learning model for breast tumor
segmentation in ultrasound images. Ultrasound Med. Biol. 46(10), 2819–2833 (2020)
36. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image seg-
mentation. In: International Conference on Medical Image Computing and Computer-assisted
Intervention, pp. 234–241. Springer (2015)
37. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet:
Alexnet-level accuracy with 50x fewer parameters and <0.5 mb model size (2016).
arXiv:1602.07360
Chapter 12
Auxiliary Squat Training Method Based
on Object Tracking

Yunxiang Pang, Haiyang Sun, and Yiqun Pang

Abstract Background: The deep squat is not only one of the basic movement
patterns of the human body, but also a compound movement that can directly train
the hip force and has a good exercise effect on the posterior chain muscle groups.
However, improper action patterns can affect the quality of action. Research objec-
tive: In order to improve training efficiency and reduce sports injuries, a method
which can optimize the technology and movements of the deep squat needs to be
designed. Methods: The tracking algorithm based on template matching, combined
with biomechanical knowledge, was analyzed separately from the sagittal and coronal
planes. Emphasis is placed on the analysis of the power chain of force and unbalance.
Therefore, two force arms and two angles represent the power chain, and three line
segments represent balance on both sides of the body. Results: The performance was
more stable during the actual scenario test, and the motion information could be
accurately captured and analyzed. Conclusion: This method can obtain the force arm
and joint angle of the deep squat movement, and also assist in screening the balance
of both sides of the limb. Thus, the pattern and rhythm of the action can be adjusted
accordingly.

Y. Pang
Zibo Normal College, Zibo 255100, China
e-mail: [email protected]
H. Sun
School of Physics & Optoelectronic Engineering, Shandong University of Technology,
Zibo 255100, China
e-mail: [email protected]
Han Tang Power Lifting, Qufu 255100, China
Y. Pang (B)
Institute of Artificial Intelligence in Sports, Capital University of Physical Education and Sports,
Beijing 100191, China
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 161
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_13
162 Y. Pang et al.

12.1 Introduction

Squatting is not only one of the basic movement patterns of the human body, but also a
compound movement that can directly train the hip power, which has a good exercise
effect on the posterior chain muscle group. The weighted squat is an important phys-
ical training action which is able to enhance one’s strength. However, poor movement
patterns often lead to reduced efficiency and even trigger injury [1]. Computer vision
is increasingly used in sports analysis because of its non-contact advantages. Soft-
ware such as Iron Path [2] and WL Analysis [3] captures the movement trajectory
of the center of the barbell piece and is general-purpose solutions for weightlifting
technique analysis, but neither records information about the movement of the human
joints. It is a common research method to analyze the angle of joints in other sports,
such as race walking [4] and martial arts [5]. In order to record richer exercise infor-
mation and improve training efficiency to reduce sports injuries, we tried to design
a method to optimize the technical movements of squatting utilizing a computer
vision-based approach combined with knowledge of biomechanics. The code for
this study is now available [6]. And we also produced a dataset that can be used to
train object detection models, which is now available [7].

12.2 Material and Methods

12.2.1 The Basic Force Arm of Deep Squat

Good technique in the deep squat refers to the ability to maintain a zero-force arm
between the barbell bar and the center balance point of the foot. Because of the
presence of the force arm between the bar and the foot center balance point, the lifter
wastes a lot of extra force. A proper deep squat will have some specific, recognizable
characteristics controlled by bone structure and muscle function. Any kind of deep
squat, whether it is a back squat or a front squat, should meet these conditions so that
the lifter can more easily determine if his or her posture and movement are correct.
At the top of the deep squat, all the skeletal parts supporting the barbell, knees, hips
and spine are locked in extension, so the muscular parts only need to exert enough
force to maintain the position, because the force acting on the bones at this point.
The force acting on the bones is mainly pressure. In this state, the task of the muscles
is to keep the bones correctly aligned in a straight line so that they can support the
appropriate weight. At this point, the barbell bar is directly above the center of the
foot. The greater the weight, the more important this position becomes [8].
When the lifter begins to enter the centrifugal phase of the squat and gradually
moves toward the bottom, all the muscles that will eventually stretch the hip and
knee joints in the centripetal phase, as well as the erector spinae muscles that remain
isometrically contracted in this state, but under increased stress, are in a state of
stress, and at the same time have to contend with moments along with all parts of
12 Auxiliary Squat Training Method Based on Object Tracking 163

the body. During the squat, the barbell bar must remain directly above the center
of the foot. We can confirm the correct bottom position with the help of anatomical
markers:
• The spine should remain rigid while the lumbar and thoracic spine in extension.
• The barbell bar is directly above the center of the feet.
• The feet are flat on the ground, maintaining the correct angle of abduction and
standing distance.
• The thighs are parallel to the feet.
• The hip joint is in a position below the top of the patella.
Any position that does not meet these points, and any movement that deviates from
this position during the squat and rise, contains poor technique. In fact, if the bar is
kept on a vertical plane, directly above the center of the foot, during the squatting
and standing up process, as if the bar is sliding in a narrow space perpendicular to
the center of the foot, the action is correct. The skeleton will work out on its own
how to most effectively use the muscles to complete the deep squat. It will complete
the deep squat within the constraints of the mechanism by which the barbell body
gravity system works.

12.2.2 Object Tracking

The template regions are selected on the initial frame, and their similarity is expressed
using the normalized intercorrelation matrix [9, 10], between the region to be selected
and the template, thus enabling the tracking of the barbell sheet, hip, knee, and ankle.
Let the pixel size of the image I to be matched be M × N and the pixel size
of the template T be m × n. The coordinates of the upper left corner of a piece of
sub-image I x,y with pixel size m × n chosen arbitrarily from the image I are (x, y),
and the coordinates can be found in the range, x ∈ [0, M-m], y ∈ [0, N-n], where M
and N are the number of rows and columns of image pixels to be matched, and m
and n are the number of rows and columns of template pixels, respectively.
The normalized mutual correlation values R(x, y) [11] of the sub-image I x,y and
the template T are defined as:
m−1 n−1
i=0 j=0 (I x+i,y+ j − I x,y )(Ti, j − T )
R(x, y) =  (12.1)
m−1 n−1 2 m−1 n−1 2
i=0 j=0 (I x+i,y+ j − I x,y ) i=0 j=0 (Ti, j − T )

In Eq. (12.1), I and j are the coordinates of the pixels in the template. All the
normalized intercorrelation values form the normalized intercorrelation matrix R.
The pixels average value of the sub-image I x,y is

1 m−1 n−1
I x,y = Ix+i,y+ j (12.2)
m×n i=0 j=0
164 Y. Pang et al.

The pixels average value of the template T is

1 m−1 n−1
T = Tx,y (12.3)
m×n i=0 j=0

And Define RT :

m−1 
n−1
2
RT = (Ti, j − T ) (12.4)
i=0
j=0

Since the template T is known and RT is a constant and positive value throughout
the search process, it does not affect the determination of the optimal solution and
cannot be calculated, so the denominator part of Eq. (12.1) can be written as:


m−1 
n−1
Rden (x, y) = 
2
(Ix+i,y+j − Ix,y ) (12.5)
i=0
j=0


Let T (i, j) = T (i, j)− T , then the numerator part of Eq. (12.1) can be simplified
as follows:

m−1 
n−1 m−1 
n−1

Rnum = (Ix+i,y+ j − I x,y )(Ti, j − T ) = (Ix+i,y+ j − I x,y )T (i, j)
i=0 i=0
j=0 j=0

m−1 
n−1 m−1 
n−1

= (Ix+i,y+ j )T (i, j) − I x,y T (i, j)
i=0 i=0
j=0 j=0

Thus,

m−1 
n−1

Rnum = (Ix+i,y+ j )T (i, j) (12.6)
i=0
j=0

In this way, the normalized intercorrelation values R(x, y) of the sub-image I x,y
and the template T in Eq. (12.1) are equivalent to
m−1 n−1 
 i=0 j=0 (I x+i,y+ j )T (i, j)
R (x, y) =  n−1 (12.7)
m−1 2
i=0 j=0 (Ix+i,y+ j − I x,y )

For the completed matching module, we continue to find the final coordinates through
a program script.
12 Auxiliary Squat Training Method Based on Object Tracking 165

def template_demo(tpl, target):


th, tw = tpl.shape[:2] # Get the height and width of the template
result = Matching_module(target, tpl, method)
# Search for the minimum, the maximum.
# Minimum position, maximum position.
min_val, max_val, min_loc, max_loc = cv2.minMaxLoc(result)
tl = max_loc
br = (tl[0] + tw, tl[1] + th)
if max_val < 0.45:
lost = 1
else:
lost = 0
return tl, br, lost

12.2.3 Action Analysis

Analyzing the deep squat movement from the sagittal plane requires calculation of
the hip angle, knee angle, necessary force arm, and undesirable force arm, while
analysis of the deep squat movement from the coronal plane requires calculation of
three lines, representing the shoulder girdle, hip girdle, and trunk, respectively. The
definitions of these indicators are shown in Fig. 12.1 and Table 12.1.
Specifically, for the hip_angle, Eq. (12.8) is used for the calculation. For the
knee_angle, Eq. (12.9) is used to calculate. The necessary force arm is calculated
from Eq. (12.10), and the undesirable force arm is calculated from Eq. (12.11).

Hip_angle = arccos
(Xhip − Xplate )(Xhip − Xknee ) + (Yhip − Yplate )(Yhip − Yknee )
( 
[(Xhip − Xplate )2 + (Yhip − Yplate )2 ] · [(Xhip − Xknee )2 + (Yhip − Yknee )2 ]
(12.8)

Knee_angle = arccos
(Xknee − Xankle )(Xknee − Xhip ) + (Yknee − Yankle )(Yknee − Yhip )
(  )
[(Xknee − Xankle )2 + (Yknee − Yankle )2 ] · [(Xknee − Xhip )2 + (Yknee − Yhip )2 ]
(12.9)
166 Y. Pang et al.

Fig. 12.1 Biomechanical indicators for analysis of squatting techniques

Table 12.1 Definition of evaluation indicators


Indicator Definition
Sagittal plane Hip_Angle The angle between the vector from the center of the
barbell plate to the hip joint and the vector from the
knee to the hip
Knee_Angle The angle between the vector from the ankle joint to
the knee joint and the vector from the hip to the knee
Necessary force arm Absolute value of the hip joint transverse coordinate
minus the foot center transverse coordinate
Undesirable force arm The absolute value of the horizontal coordinate of the
center of the barbell plate minus the horizontal
coordinate of the center of the foot
Coronal plane Lineshoulder The line between the left wrist and the right wrist
Linehip The line between the left hip and the right hip
Linetrunk The line between the midpoint of Lineshoulder and the
midpoint of Linehip


Nassary_Force_arm = Xhip − Xfoot (12.10)


Undesirable_Force_arm = Xplate − Xfoot (12.11)
12 Auxiliary Squat Training Method Based on Object Tracking 167

12.3 Results

Two separate procedures examine motion in the sagittal plane and motion in the
coronal plane. Information on angular velocity and force arms of motion can be
obtained in the sagittal plane, and imbalances on both sides of the body can be
screened in the coronal plane.

12.3.1 Sagittal Plane Motion Analysis

In Fig. 12.2, the lifter on the left side shows good movement posture with almost
no undesirable force arms; the lifter on the right side has a more severe forward
displacement of the barbell and significant undesirable force arms, this movement
pattern not only has less strength, but also has greater pressure on the lower back.
The hip angle of a weightlifter is shown in Fig. 12.3, the knee angle of him is shown
in Fig. 12.4, the necessary force arm is shown in Fig. 12.5, and the undesirable force
arm is shown in Fig. 12.6.

Fig. 12.2 Example of


sagittal plane motion
analysis
168 Y. Pang et al.

Fig. 12.3 Visualization of hip angle

Fig. 12.4 Visualization of knee angle

12.3.2 Coronal Plane Motion Analysis

Figure 12.7 demonstrates a situation of imbalance on both sides, which needs to be


lined up in a bottom-up order. Although the barbell and upper extremity are tilted, it
is not a problem with the shoulder girdle and thoracic spine, and the key to resolving
the imbalance is to improve the balance of lower extremity strength.
12 Auxiliary Squat Training Method Based on Object Tracking 169

Fig. 12.5 Visualization of necessary force arm angle

Fig. 12.6 Visualization of undesirable force arm angle

12.4 Discussion

The weighted squat is a very functional movement that requires consideration of the
changes that occur between different muscle groups and between the muscles and
the barbell during the movement. Motion capture in sports is often done through
computer vision methods, and some studies [12, 13] have used deep learning-based
pose estimation, but for weighted deep squatting, it is not appropriate to capture
motion information with pose estimation. This is due to the occlusion of the barbell
plate, where the upper body is obscured over a large area, causing the existing pose
estimation model to barely work. However, the weighted deep squat has no rotation
170 Y. Pang et al.

Fig. 12.7 Example of coronal plane motion analysis

and no scale change involved, and the tracking algorithm based on template matching
works well in this case instead. Meaningful analytical results can be obtained based
on this, for example, in Figs. 12.3 and 12.4, the hip and knee angles vary roughly
periodically, with the top of the curve indicating the lifter’s stay in the upright posi-
tion and the bottom of the curve is steeper, with the hip and knee angles changing
rapidly, indicating that the lifter can stand up quickly after squatting and the muscles
change rapidly from centrifugal to centripetal contraction. Because the stature struc-
ture is constant for a given person, each peak of the necessary force arm in Fig. 12.5
is also essentially the same. As shown in Fig. 12.6, the undesirable force arm grad-
ually increased and reached a maximum at the 4th movement, which may indicate a
movement out of shape caused by central fatigue or peripheral fatigue.

12.5 Conclusion

In this study, the position coordinates of the hip, knee, ankle and barbell plate
centers were obtained using an object tracking algorithm for further visualizing the
joint angles, necessary force arms, and undesirable force arms at each moment the
weightlifter’s deep squat. It is convenient, fast, and reliable and can be used as a
12 Auxiliary Squat Training Method Based on Object Tracking 171

reference means to analyze the technical movements of the deep squat, improving
the safety of training. In the future work, we will try to use deep learning models to
obtain a more robust method of capturing information about the motion of weighted
deep squats.

Acknowledgements Sincere thanks to “Han Tang Power Lifting” for the technical support and
for the material and concepts that provided great help for this study. This work was supported
by Beijing college students’ innovation and entrepreneurship training program under Grant No.
S202210029010.

References

1. Diggin, D., Regan, C.O., Whelan, N., et al.: A biomechanical analysis of front and back squat:
injury implications. In: Isbs Conference in Vilas Boas Et Al (2011)
2. Kasovic, J., Martin, B., Zourdos, M.C., et al.: Agreement between the iron path app and a linear
position transducer for measuring average concentric velocity and range of motion of barbell
exercises. J Strength Condition Res (2020), Publish Ahead of Print
3. Hideyuki, N., Daichi, Y.: Validation of video analysis of marker-less barbell auto-tracking in
weightlifting. PloS One 17(1) (2022)
4. Wang, Y., Hu, G., Peng, X., Li, H.-l.: Biomechanics and engineering applications of race
walking. In: 2021 International Conference on Health Big Data and Smart Sports (HBDSS),
pp. 50–55 (2021)
5. Pang, Y., Wang, Q., Zhang, C., Wang, M., Wang, Y.: Analysis of computer vision applied in
martial arts. In: 2022 2nd International Conference on Consumer Electronics and Computer
Engineering (ICCECE), pp. 191–196 (2022)
6. Code https://github.com/pyqpyqpyqpyq789/Object-Tracking-for-Squat
7. Dataset https://aistudio.baidu.com/aistudio/datasetdetail/103531
8. Rippetoe, M.: Starting Strength: Basic Barbell Training, 3rd ed. (2016)
9. Wu, J., Yue, H.J., Cao, Y.Y., et al.: Video Object tracking method based on normalized cross-
correlation matching. In: Proceedings of the Ninth International Symposium on Distributed
Computing and Applications to Business, Engineering and Science. IEEE Computer Society
(2010)
10. Tsai, D.M., Lin, C.T., Chen, J.F.: The evaluation of normalized cross correlations for defect
detection. Pattern Recogn. Lett. 24(15), 2525–2535 (2003)
11. Sethmann, R., Burns, B.A., Heygster, G.C.: Spatial resolution improvement of SSM/I data with
image restoration techniques. IEEE Trans Geosci Remote Sens 32(6), 1144–1151
12. Pang, Y., Wang, Q., Zhang, C.: Time-frequency domain pattern analysis of Tai Chi 12 GONG FA
based on skeleton key points detection. In: 2021 International Conference on Neural Networks,
Information and Communication Engineering, International Society for Optics and Photonics,
vol. 11933, pp. 119331Y-1 (2021)
13. Guo, H.: Research and implementation of action training system based on key point detection.
Master’s thesis, Xi’an University of Technology (2021)
Chapter 13
Study on the Visualization Modeling
of Aviation Emergency Rescue System
Based on Systems Engineering

Yuanbo Xue , Hu Liu, Yongliang Tian, and Xin Li

Abstract Focusing on the need for establishing a more complete and implementable
aviation emergency rescue (AER) system, the study on system architecture and visu-
alization modeling of AER was carried out. Based on systems engineering, AER
system architecture contains four stages including disaster prevention and early
warning stage, disaster preparation stage, emergency response stage, recovery and
reconstruction stage, and corresponding six nodes including prevention and prepara-
tion node, command and control node, reconnaissance and surveillance (S&R) node,
scheduling planning node, search node, and rescue node. The AER system visu-
alization model comprises operational viewpoint (OV) DoDAF-described models
and capability viewpoint (CV) DoDAF-described models of the system architec-
ture. The OV DoDAF-described models describe the high-level operational concept,
operational elements, mission and resource flow exchanges, etc. The CV DoDAF-
described models capture the capability taxonomy and complex relationships. The
visualization model provides reference and guidance for AER system architecture
design and is suitable for the visual description of complex systems engineering
architecture for emergencies.

13.1 Introduction

As a complex systems engineering for emergencies, emergency management is a


management system that includes the whole process of prevention and preparation,
emergency rescue and recovery, etc. [1] Since emergencies seriously affect national
development and people’s lives, a well-established emergency management system
helps improve the efficiency and effectiveness of emergency rescue [2]. With the
development of the aviation industry, aviation emergency rescue (AER) has become
an important part of the emergency rescue system in many countries due to the
advantages of speed and efficiency [3].

Y. Xue · H. Liu · Y. Tian (B) · X. Li


School of Aeronautical Science & Engineering, Beihang University, Beijing 100083, China
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 173
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_14
174 Y. Xue et al.

AER is widely used in many fields, such as firefighting [4], earthquake rescue
[5], wilderness search [6], etc. The present research includes: aviation emergency
management [7], quality structure model of new-type AER command talents [8], the
aviation emergency standard system of China [9], AER evaluation capability [10],
AER response assessment [11], civil AER system architecture [12], general aviation
rescue airport site selection [13], etc. On the visualization modeling application of
AER, the present researches focus on the rescue scene modeling [14], drill platform
[15], aviation emergency and rescue operation parameter calculation [16], etc., based
on virtual reality technique. The design and decision problem of AER system as a
complex and open systems engineering could be considered as a system architecture
design and decision support systems engineering [17, 18].
Model-based systems engineering (MBSE) is a method of describing organiza-
tional management systems using mathematical and logical models [19]. As one of
the most implementable complex engineering design methods in system science,
it is characterized by high complexity, wide range of application, and combination
of quantitative and qualitative [20–22]. As the major part of MBSE, system archi-
tecture needs to be adaptable to changing needs and requirements [23]. Faced with
system architecture requirements, the Department of Defense Architecture Frame-
work (DoDAF) is proposed for visualization infrastructure. DoDAF is organized
by various viewpoints which is suitable for large systems with complex integration
and interoperability challenges [24]. Several fields closely related to society, such
as civil airline transportation [25] and information platform construction [26], have
been modeled based on MBSE and DoDAF.
For the lack of visualization modeling of the AER system, the present study
analyzed the AER system architecture and constructed an AER visualization model
using DoDAF. Through different viewpoints, the operational activities and system
capabilities are visually described, completing a nonlinear mapping from system
analysis to actual architecture.

13.2 Aviation Emergency Rescue System Architecture


Analysis

China’s AER system has made certain achievements at present. The AER system is
based on the emergency plan, with the emergency management mechanism as the
guarantee, supplemented by laws and regulations, and the support of science and
technology.
To further improve the emergency management capability, the present study
proposed an AER system architecture containing four stages and their corresponding
six nodes based on systems engineering. The four stages are disaster prevention
and early warning stage, disaster preparation stage, emergency response stage,
and recovery and reconstruction stage. Each stage consists of different activities.
According to the nature of the activities, the activities of the same nature that make
13 Study on the Visualization Modeling of Aviation Emergency Rescue … 175

Fig. 13.1 Aviation emergency rescue system architecture

up the different phases are grouped into the same node in this present study. The
corresponding six nodes are: prevention and preparation node, command and control
node, reconnaissance and surveillance (S&R) node, scheduling planning node, search
node, and rescue node (Fig. 13.1).

13.2.1 Disaster Prevention and Early Warning Stage

Disaster prevention and early warning is a prerequisite for fruitful emergency rescue.
Disaster prevention mainly refers to strengthening people’s awareness of disaster
prevention and improving people’s ability to take the initiative to avoid disasters
through publicity, drills, and other means. Early warning includes real-time moni-
toring of urban weather and engineering construction, and analysis and processing
of information collected by the early warning system. Comprehensive disaster
prevention will reduce the losses caused by poor disaster prevention. Expanding
the coverage of monitoring system and improving the accuracy of early warning
information are the technical prerequisites for speeding up emergency response.
176 Y. Xue et al.

13.2.2 Disaster Preparation Stage

Disaster preparation is divided into two categories: technological disaster prepara-


tion and engineering disaster preparation. Technological disaster preparation refers
to the use of science and technology, such as data analysis, to analyze causes of disas-
ters and to process warning information. Engineering disaster preparation refers to
improving urban disaster prevention standards, optimizing engineering layouts, and
strengthening industry supervision. Comparing these two types of disaster prepa-
ration, the technological disaster preparation is implemented by the control center,
while the engineering disaster preparation is promoted by government departments.
Together with the first stage, the disaster prevention and preparation efforts imple-
mented in the first two stages correspond together to the prevention and preparation
node.

13.2.3 Emergency Response Stage

Emergency response is the disposition process of the control center to handle situa-
tion information, formulate rescue plans, conduct real-time reconnaissance, dispatch
personnel and materials, and direct the implementation of search and rescue after a
disaster occurs or when a disaster is predicted to occur.
This stage mainly includes five types of activities, in chronological order: plan
formulation, reconnaissance and surveillance, decision-making, disposal implemen-
tation, and real-time reconnaissance.
1. Plan Formulation
Plan formulation is the formulation of an AER plan referring to the disaster situa-
tion, prevention, and preparedness corresponding to the scheduling planning node.
It includes the location and number of navigable airports available for rescue, the
type and number of rescue helicopters, the rules for dispatching rescue helicopters,
the selection of personnel resettlement points, and the evaluation method for rescue
plans.
2. Reconnaissance and Surveillance
After the completion of plan formulation, the disaster situation should be reconnoi-
tered and monitored. Through the reconnaissance by helicopter and the message back
from the monitor unit, the control center can confirm the site situation and adjust the
plan in real time.
3. Decision-Making
Through visualization models or other auxiliary decision-making means, the control
center makes and confirms all practical decisions in the rescue plan with reference
to the search information and the feasibility evaluation of the existing rescue plan
which corresponds to the command and control node.
13 Study on the Visualization Modeling of Aviation Emergency Rescue … 177

4. Disposal Implementation
The disposal implementation is the process of advancing rescue plans and decisions,
including the completion of personnel rescue, goods and material transfer, and other
mission requirements. From the perspective of requirements, the disposal implemen-
tation is disaster-oriented aviation emergency search and rescue, corresponding to
search node and rescue node.
5. Real-Time Reconnaissance
During disposal implementation, the monitor unit conducts real-time reconnaissance
of the search and rescue process, considering the possibility of secondary disas-
ters and updated mission requirements. Update mission demand and rescue plans
concerning mission completion and secondary disaster occurrences.

13.2.4 Recovery and Reconstruction Stage

After the emergency rescue, this stage mainly includes the resettlement of personnel
and the reconstruction of infrastructure. And the experience of emergency rescue is
summarized to guide and iterate the system design process of the first three stages.

13.3 Aviation Emergency Rescue System Visualization


Model

The AER system composed of the above stages contains three elements: system
architecture, operational activities, and system capabilities. Correspondingly, two
viewpoints, operational viewpoint and capability viewpoint, are selected to construct
a complete structural framework of the AER system and a visualization DoDAF-
described model which takes system architecture, operational concept and process,
task tree, and capability as input. The modeling steps are shown in Fig. 13.2.
Step 1 Determine the operational concept of the AER system, construct the high-
Level operational concept graphic (OV-1 model).
Step 2 Establish the operational resource flow description diagram (OV-2 model)
combining the OV-1 model, the system architecture, and the analysis of the
operational process.
Step 3 Build the operational activity decomposition tree (OV-5a model) corre-
sponding to the overall task tree and nodes.
Step 4 Establish the dependency matrix (OV-3 matrix) between the OV-2 model and
the OV-5a model.
178 Y. Xue et al.

Fig. 13.2 Visualization modeling step-by-step diagram

Step 5 Combined with the analysis of the capability system, construct capability
taxonomy (CV-2) which provides visualizations of the evolving capabilities.
Step 6 Build the capability to operational activities mappings (CV-6 matrix) which
describes the mapping between the capabilities required and the activities that enable
those capabilities.

13.3.1 Operational Viewpoint DoDAF-Described Models

DoDAF-described models in the operational viewpoint (OV) consist of three models


OV-1, OV-2, and OV-5a, which completely describe the high-level operational
concept, operational elements, mission and resource flow exchanges, etc., in AER.
OV-1 DoDAF-described Model
The OV-1 DoDAF-described model is the high-level operational concept graphic
and the entry point of model construction. By describing the mission and scenario of
AER, OV-1 graphically shows the main operational concepts and macroscopically
describes the interactions between the model elements.
13 Study on the Visualization Modeling of Aviation Emergency Rescue … 179

The AER system visualization model consists of seven elements: control center,
monitor unit, airport, helicopter, personnel, goods and materials, and relevant points.
1. Control Center
Control center is responsible for handling early warning information, formu-
lating rescue plans, commanding and controlling rescue processes, dispatching and
commanding rescue helicopters which is related to the type of emergency.
2. Monitor Unit
Monitor unit includes a disaster emergency monitoring system and an early warning
system, covering the whole emergency management process.
Monitor unit collects early warning information, monitors urban meteorology and
disaster situation in real time, and provides the control center with information for
analysis and processing.
3. Airport
In the emergency response stage, the control center shall formulate an emergency
rescue plan, including the location and number of airports available for rescue heli-
copter landing, refueling, and support. Rescue workers and relief materials could be
assembled at the airport according to the plan and wait for transfer.
4. Helicopter
According to the emergency rescue plan, the helicopters participating in the emer-
gency rescue gather at the designated airport, receive support, load disaster relief
materials or rescue workers, and wait for scheduling.
5. Personnel
Personnel includes rescue workers and trapped people. Rescue workers are those
who treat the wounded or transfer trapped people. Trapped people are those who are
trapped in place and need AER after emergencies.
6. Goods and Materials
In the present study, goods and materials include materials (living materials, food,
medicine, etc.) and disaster relief equipment, which are transferred to mission
demand points by helicopter.
7. Relevant Locations
The present study considers four types of relevant locations: mission demand point,
resettlement point, loading point, and unloading point.
Mission demand point refers to the place where relevant missions need to be
performed, including but not limited to the place where there is a need for rescue
workers or materials.
Resettlement point refers to the place where the trapped people can be properly
resettled.
180 Y. Xue et al.

Fig. 13.3 OV-1 high-level operational concept graphic

Loading point and unloading point, respectively, refer to the places where heli-
copters load and unload personnel or materials, which may be vacant sites temporarily
requisitioned or suitable for helicopter takeoff and landing (Fig. 13.3).
OV-2 DoDAF-described Model
The OV-2 DoDAF-described model is a further refinement of high-level opera-
tional concept which shows the flow of personnel, material, and information without
describing the flow mode. The resource flows are between the nodes included in the
AER system architecture which reflect the operational exchanges.
The operational exchange type between different nodes differs which comprises
information exchange, goods and materials exchange, and people exchange. Among
them, the information exchange includes six types of information: request informa-
tion, mission information, information tracked, control order, distress signal, and
situation information. In addition to the nodes corresponding to different stages, this
model contains five types of location and character elements. People in distress and
locations where rescue workers or supplies are needed send distress signals to the
control center for rescue plan formulation and helicopter scheduling. Resettlement
points, loading points, and unloading points send the on-site situation information to
the monitor unit with the aim of obtaining real-time information on disaster situation
(Fig. 13.4).
13 Study on the Visualization Modeling of Aviation Emergency Rescue … 181

Fig. 13.4 OV-2 operational resource flow description diagram

OV-5a DoDAF-described Model


The OV-5a DoDAF-described model describes operational activities inside and
outside the architectural scope of system of systems. Combined with the OV-2 model,
the responsibility of activities and the internal activities of nodes is delineated (given
in Table 13.1).
Search and rescue are the most critical and include the most activities in the emer-
gency response phase. Search is the basis of rescue. Rescue includes the analysis and

Table 13.1 OV-3 dependency matrices


Operational viewpoint: OV-2
Operational viewpoint: Person in distress Rescue node Search node
OV-5 Receive distress signal  
Receive situation info 
Send distress signal 
Send situation info 
182 Y. Xue et al.

processing of searched information. Specifically, the operational activity exhibition


of search and rescue is constructed in the OV-5a DoDAF-described model. Corre-
sponding to the search node, the operational activity hierarchy of search process
includes receiving distress signals, finding trapped personnel, monitoring health,
and sending situation information to the control center. Corresponding to the rescue
node, the operational activity hierarchy of rescue process includes receiving distress
signals, processing situation information, and transiting trapped personnel.
The judgment and processing of the situation information monitored in real-time
during the search and rescue process are conducive to the subsequent rescue activities
such as the transfer of trapped people. Among them, the transfer process is carried out
after monitoring health, which is divided into two categories: transfer to resettlement
points with medical assistance and transfer to resettlement points without medical
assistance. After finding the trapped people, health monitoring is conducted. Different
transfer process is carried out according to whether medical assistance is needed. In
addition to the transfer of trapped people, the transfer process includes the transfer
of rescue workers and the transfer of goods and materials as well (Fig. 13.5).

Fig. 13.5 OV-5a operational activity decomposition tree


13 Study on the Visualization Modeling of Aviation Emergency Rescue … 183

13.3.2 Capability Viewpoint DoDAF-Described Models

DoDAF-described models in the capability viewpoint (CV) capture the capability


taxonomy and complex relationships in the emergency management architecture
including CV-2 capability taxonomy and CV-6 capability to operational activities
mappings.
In detail, the capability hierarchy in the CV-2 model describes the existing or
possibly required capabilities for emergency rescue and emergency management,
which provides a reference for the architecture.
In the present study, we summarize the capabilities into three categories: preven-
tion and preparation, command and control, search and rescue, which correspond to
different nodes, respectively. Because of its dependence, search and rescue are closely
related to pre-disaster prevention and preparation, emergency response in disaster
relief, and post-disaster recovery. Therefore, we consider the following capabilities
as the components of search and rescue capability: reconnaissance and surveillance
capability, scheduling planning capability, search capability, assistance capability,
transfer capability, and recovery capability, where the scheduling planning capability
refers to the planning capability of dispatching rescue helicopters corresponding to
the rescue plan (Fig. 13.6).
To visualize use case relations and model scope between required capabilities (CV
model) and operational activities (OV model), the CV-6 capability to operational
activities mappings is established (Fig. 13.7).

13.4 Conclusion

The present study summarizes and composes the architecture of the AER system,
introduces the systems engineering idea and DoDAF model design process, and
realizes the construction of the AER system visualization model. The model provides
specific descriptions in terms of capability viewpoint and operational viewpoint,
which provides reference and guidance for the design of the AER system using
visualization modeling means.
It should be noted that the present study aims at a complete description of the
AER system based on visual modeling. The helicopter scheduling rules and internal
activities are not considered in the present study which will continue to be carried
out in the follow-up work.
184 Y. Xue et al.

Fig. 13.6 Capability taxonomy diagram


13 Study on the Visualization Modeling of Aviation Emergency Rescue … 185

Fig. 13.7 CV-6 capability to operational activities mappings

Acknowlegements I am very grateful to my tutors, Hu Liu and Yongliang Tian, and my friend
Xin Li for the great help in my field of study. This research did not receive any specific grant from
funding agencies in the public, commercial, or not-for-profit sectors.

References

1. Bullock, J., Haddow, G., Coppola, D.P.: Introduction to emergency management. Butterworth-
Heinemann (2017)
2. Alexander, D.: Towards the development of a standard in emergency planning. Disaster Prev.
Manag. 14(2), 158–175 (2005). https://doi.org/10.1108/09653560510595164
3. Yuming, L.: Aviation rescue: enhance emergency hard power. J. Beijing Univ. Aeronautics
Astronautics Social Sci. 24(4), 15 (2011). https://doi.org/10.13766/j.bhsk.1008-2204.2011.
04.002
4. Bartolo, K., Furlonger, B.: Leadership and job satisfaction among aviation fire fighters in
Australia. J. Manag. Psychol. 15(1), 87–93 (2000). https://doi.org/10.1108/026839400103
05324
5. Shen, Y., Zhang, X., Guo, Y.: Discrete-event simulation of aviation rescue efficiency on
earthquake medical evacuation. In: Americas Conference on Information Systems (2018)
186 Y. Xue et al.

6. Grissom, C.K., Thomas, F., James, B.: Medical helicopters in wilderness search and rescue
operations. Air Med. J. 25(1), 18–25 (2006). https://doi.org/10.1016/j.amj.2005.10.002
7. Bearman, C., Rainbird, S., Brooks, B.P., et al.: A literature review of methods for providing
enhanced operational oversight of teams in emergency management. Int. J. Emergency Manage.
14(3), 254–274 (2018). https://doi.org/10.1504/IJEM.2018.094237
8. Fang-Zhong, Q.I.: Exploration on undergraduate education of new-type aviation emergency
rescue command talents. Fire Sci. Technol. 39(8), 1178 (2020). https://doi.org/10.3969/j.issn.
1009-0029.2020.08.037
9. Yanhua, L., Ran, L.: Construction of China aviation emergency rescue standard system. China
Safety Sci. J. 29(8), 178 (2019). https://doi.org/10.16265/j.cnki.issn1003-3033.2019.08.028
10. Zhu, H., Xie, N.: Aviation emergency rescue evaluation capability based on improved λρ
fuzzy measure. In: Proceedings of the 2017 IEEE International Conference on Smart Cloud
(SmartCloud), pp. 289–293. IEEE, New York, NY (2017). https://doi.org/10.1109/SmartCloud.
2017.54
11. Walker, K., Oeen, O.: A risk-based approach to the assessment of aviation emergency response.
In: Proceedings of the SPE International Conference and Exhibition on Health, Safety, Secu-
rity, Environment, and Social Responsibility, Abu Dhabi (2018). https://doi.org/10.2118/190
549-MS
12. Xia, Z.-H., Pan, W.-J., Lin, R.-C., et al.: Research on efficiency of aviation emergency rescue
under major disasters. Comput. Eng. Des. 33(3), 1251–1256 (2012). https://doi.org/10.16208/
j.issn1000-7024.2012.03.004
13. Hu, B., Pan, F., Zhang, Y.: Research on selection of general aviation rescue airports. In: Proceed-
ings of the Journal of Physics: Conference Series, vol. 1910, No. 1. IOP Publishing (2021).
https://doi.org/10.1088/1742-6596/1910/1/012023
14. Sun, X., Liu, H., Yang, C., et al.: Virtual simulation-based scene modeling of helicopter earth-
quake search and rescue. In: Proceedings of the AIP Conference Proceedings, vol. 1839, No.
1, p. 020140. AIP Publishing LLC (2017). https://doi.org/10.1063/1.4982505
15. Pan, W., Xu, H., Zhu, X.: Virtual drilling platform for emergency rescue of airport based on
VR technology. J. Saf. Sci. Technol. 16(2), 136–141 (2020). https://doi.org/10.11731/j.issn.
1673-193x.2020.02.022
16. Meleschenko, R.G., Muntyan, V.K.: Justification of the approach for calculating the parameters
of aviation emergency and rescue operations when using visual search (2017)
17. Sage, A.P.: Decision support systems engineering. Wiley-Interscience (1991)
18. Parnell, G.S., Driscoll, P.J., Henderson, D.L.: Decision making in systems engineering and
management. Wiley (2011)
19. Blanchard, B.S.: System engineering management. Wiley (2004)
20. Buede, D.M., Miller, W.D.: The engineering design of systems: models and methods (2016)
21. Kaslow, D., Anderson, L., Asundi, S., et al.: Developing a cubesat model-based system engi-
neering (mbse) reference model-interim status. In: Proceedings of the 2015 IEEE Aerospace
Conference, pp. 1–16 (2015). https://doi.org/10.1109/AERO.2015.7118965
22. Madni, A.M., Madni, C.C., Lucero, S.D.: Leveraging digital twin technology in model-based
systems engineering. Systems 7(1), 7 (2019). https://doi.org/10.3390/systems7010007
23. Weilkiens, T., Lamm, J.G., Roth, S., et al.: Model-based system architecture. Wiley (2015)
24. Miletić, S., Milošević, M., Mladenović, V.: A new methodology for designing of tactical inte-
grated telecommunications and computer networks for OPNET simulation. Sci. Tech. Rev.
70(2), 35–40 (2020)
25. Pan, X., Yin, B., Hu, J.: Modeling and simulation for SoS based on the DoDAF framework. In:
Proceedings of 2011 9th International Conference on Reliability, Maintainability and Safety,
pp. 1283–1287. IEEE (2011). https://doi.org/10.1109/ICRMS.2011.5979468
26. Tao, Z.-G., Luo, Y.-F., Chen, C.-X., et al.: Enterprise application architecture development
based on DoDAF and TOGAF. Enterprise Inf. Syst. 11(5), 627–651 (2017). https://doi.org/10.
1080/17517575.2015.1068374
Chapter 14
An AI-Based System Offering Automatic
DR-Enhanced AR for Indoor Scenes

Georgios Albanis, Vasileios Gkitsas, Nikolaos Zioulis,


Stefanie Onsori-Wechtitsch, Richard Whitehand, Per Ström,
and Dimitrios Zarpalas

Abstract In this work, we present an AI-based Augmented Reality (AR) system for
indoor planning and refurbishing applications. AR can be an important medium for
such applications, as it facilitates more effective concept conveyance and addition-
ally acts as an efficient and immediate designer-to-client communication channel.
However, since AR only overlays, and cannot replace or remove, our system
relies on Diminished Reality (DR) to support deployment to real-world already
furnished indoor scenes. Further, and contrary to the traditional mobile AR applica-
tion approach, our system offers on-demand Virtual Reality (VR) viewing, relying on
spherical (360°) panoramas, capitalizing on their user-friendliness for indoor scene
capturing. Given that our system is an integration of different AI services, we analyze
its performance differentials concerning the components comprising it. This analysis
is both quantitative and qualitative, with the latter realized through user surveys, and
provides a complete systemic assessment of an attempt for a user-facing, automatic
AR/DR system.

14.1 Introduction

Interior design can be a challenging and stressful process, requiring bidirectional


communication between users and experts. Experts usually express their ideas in
traditional 2D drawings produced by Computer-Aided Design (CAD) software,
making it difficult for the end users to comprehend them. AR is an emerging tech-
nology that allows users to superimpose computer-generated (CG) elements over
the real world. In the particular case of interior design, AR can be used for placing
virtual 3D objects within the real environment bridging the communication gap
between experts (designers) and non-experts (clients). In this way, AR serves as

G. Albanis · V. Gkitsas · N. Zioulis · R. Whitehand · P. Ström · D. Zarpalas (B)


Centre for Research and Technology, Information Technologies Institute, Chennai, India
e-mail: [email protected]
S. Onsori-Wechtitsch · R. Whitehand · P. Ström
JOANNEUM RESEARCH, Graz, Austria

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 187
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_15
188 G. Albanis et al.

a medium between digitized concepts and the real scene, facilitating effective and
efficient communication and feedback between its users and improving the iterative
design process. Indicatively, an AR system for rearranging a furniture layout was
proposed in [15], while in [12] a system employing a dynamic user interface for
placing 3D virtual furniture models was developed. However, both aforementioned
systems required multiple QR markers to allow users to physically position the virtual
furniture.
Even though AR enables the interaction with virtual objects inside real environ-
ments, its nature is pure of additive nature, with a practical problem befalling when
working in occupied and filled indoor scenes as is the case for AR home design
applications [20]. Concepts like redecoration cannot be delivered solely through
AR technology, as users would only be capable of superimposing CG elements on
top of the existing real-world objects, hindering understanding due to a conflicting
mental response. To overcome this, AR needs to be supported by DR which can
diminish existing objects prior to overlaying new virtual ones and provide users with
an enhanced view to assessing furniture fit into their spaces. DR is an intriguing
technology that can enable novel concepts. One example is intercar see-through
vision, which aims at preventing accidents [14] and diminishes (i.e., “removes”)
the front car. In this particular case, DR is driven by multi-view observations and
view synthesis. There are cases though, where no view behind the removed object is
available, and then DR needs to hallucinate content, typically referred to as infilling
or inpainting [10]. Pioneering work in the DR domain was presented by [5], where
a patch-based image inpainting method was developed. Follow-up work [8] moved
beyond image-based diminishing and transitioned toward respecting scene geom-
etry by exploiting SLAM-based localization. More recently, an inpainting method
for non-planar scenes was developed [16] that considered both color and depth infor-
mation. Still, in both cases, manual selection of the region to be removed in the image
domain was required. To allow for easier selection of the object to be diminished in
indoor scenes for interior design, [17] used a manually positioned and scaled volume
to enclose the object of interest. In addition, the floor plane was identified by inserting
a marker into the scene. Real-time six degrees-of-freedom DR without manual object
selection is challenging [13] and requires a 3D reconstruction of the scene without
the object of interest but with the diminishing area annotated, limiting its flexibility.
When considering AR interior home refurnishing, where quickly prototyping ideas
is very important, minimizing interactions is very important, as users will also need
to position the new elements into the scene as well [7].
Still, all of the aforementioned studies work in a narrow field-of-view inputs,
limiting the amount of information of each scene and thus degrading their perfor-
mance on big objects (e.g., furniture), while at the same time they do not strictly
respect the structure of the environment. To overcome this, moving cameras are
employed relying on SLAM [16] or a wider field-of-view captures [7], but they limit
user-friendliness and are more error prone. In this work, we present a system that
addresses the challenges of cumbersome user diminished area selection and user
scanning, delivering DR-enhanced AR for indoor scene planning and design. To
achieve that, our system is AI based, operating on single monocular image capture,
14 An AI-Based System Offering Automatic DR-Enhanced AR for Indoor … 189

exploiting recent advances in data-driven inpainting methods [6]. In addition, albeit


image based, it takes the scene’s structure into account, an important clue for the
targeted application domain. Our main contributions are summarized below:
• A novel AI-based DR-enhanced AR system with various data-driven components
connected in parallel and cascade structure using only monocular 360° images as
input.
• A holistic system evaluation including a systemic point of view analysis to identify
the weakest link in the system and a user study focused on the importance and
relevance of DR in indoor planning applications.

14.2 System Overview

Figure 14.2 shows a high-level overview of our system comprising two main sub-
systems, and the nominal data flow among the various components. Each component
is an AI model, trained on the Structured3D dataset [23].
As presented in Fig. 14.2, the two sub-systems operate in cascade, while the DR
sub-system also includes a parallel component connection. The DR sub-system first
processes the input panoramic image by estimating the scene’s layout and segmenting
the distinct objects inside the scene. Then, for each segmented object in the scene,
the inpainting component is invoked to diminish the object and prepare the input
for the AR superimposition. Since data-driven models typically operate in lower
resolutions than required for panorama viewing, the AR sub-system first invokes
a super-resolution component to rescale the diminished area back to 360° viewing

Fig. 14.1 Imagine that you want to redesign your living space and replace existing furniture with
new ones. We propose a system consisting of various AI services for enabling next-generation AR
indoor re-planning and design experiences. Users only require a single 360° camera capture that
produces a spherical panorama of their indoor space. Then, our AI-based system automatically
generates a high-level understanding of the scene, both semantically and structurally, enabling
automatic selection of objects to be removed or replaced. This is driven by employing DR technology
that incorporates the inferred scene structural prior to generate plausible hallucinations, eventually
offering a compelling and effective AR experience. Top row shows the overall concept and higher-
level component connections, while the bottom row shows an actual example from the Structured3D
dataset, where a bed is replaced within a room
190 G. Albanis et al.

Fig. 14.2 Overview of the proposed automatic DR-enhanced 360 ◦ AR system. The system can be
dissected into two high-level sub-systems, the DR on the left and the AR one on the right, operating
in cascade. The former is responsible for the automatic diminishing of the scene and the latter for
user-driven augmentation. Given an input panorama, the scene’s junctions L(x) and objects’ masks
S(x) are first estimated in parallel by the corresponding data-driven components, with L and S being
the layout and segmentation AI models, respectively. Then the data-driven inpainting component is
invoked I(x, L(x), S(x)), with I being the respective AI model. Diminishing is achieved by inpainting
the object’s mask in a structure-aware manner using the dense layout map. The diminished panorama
y is up-sampled by invoking R(y), where R is a super-resolution model. Finally, the 3D object is
positioned in the scene, producing the DR-enhanced AR panorama image

resolution. AR is user driven by positioning elements into the scene that interact with
the masked regions depending on their projection to select the appropriate diminished
panorama. Still, users may simply require to remove an object from the scene which
is straightforwardly supported. In the following subsections, the different AI building
blocks comprising our automatic DR-enhanced AR system are presented.

14.2.1 Object Segmentation Component

In order to diminish an object from a residential indoor scene, the object’s pixel-
aligned area within the image must be available. For this purpose, we employ a
semantic segmentation network to infer objects mask for a set of a priori selected
classes, commonly present in residential scenes. We use the DeepLabv3 architecture
[3] with a ResNet50 [4] backbone, which has shown reliable and robust results in
segmentation tasks, offering a great compromise between accuracy and speed. The
network was supervised using cross-entropy and trained for 133 epochs using the
Adam optimization algorithm [9], with default parameters, a learning rate of 0.0002,
and a scheduler halving it every 20 epochs.
14 An AI-Based System Offering Automatic DR-Enhanced AR for Indoor … 191

14.2.2 Layout Estimation Component

Another prerequisite of the inpainting component is the scene’s dense layout segmen-
tation (i.e., the per-pixel classification into the ceiling, wall, or floor classes). This is
required to preserve the scene’s structure during diminishing which is a very impor-
tant cue for the downstream applications (i.e., planning or designing). We use the
HorizonNet model [19] to estimate the locations of the scene’s junctions.

14.2.3 Inpainting Component

The core of our AI-based DR sub-system is the inpainting AI model which is respon-
sible for object diminishing. Apart from the input panorama, it additionally requires
an object mask and the scene’s layout segmentation map, as depicted in Fig. 14.2. The
latter provides the structure of the scene as corner positions, which are subsequently
reconstructed as the dense layout, while the former is a requisite for specifying
the object to be diminished. We adopt a structure-aware 360° inpainting model [2]
that uses SEAN residual blocks [24] to aid in hallucinating plausible content with
semantic coherency in the diminished region. SEAN blocks leverage the structural
information provided by the input semantic maps (the layout segmentation in our
case) and use it as structural guidance.

14.2.4 Super-Resolution Component

For alleviating the aforementioned issue concerning the low resolution of the
panoramas to be processed, we resort to a lightweight super-resolution model [22], to
upscale the diminished result up to (×4) times. That way, we offer results appropriate
for panorama viewers, without degrading their visual quality.

14.2.5 Implementation and Orchestration

Our models are trained with PyTorch [11] and delivered as services using TorchServe
[1]. Our components share a common communication interface that is built around
callback URLs, with all inputs and outputs delivered as end points to either retrieve
(GET) or submit (POST) data. This interface makes our system highly modular since
the communication interface is decoupled from the back-end functionality of each
component.
The system orchestration is realized as a web server, where each upload triggers
a chain of events as follows. At first, the object segmentation and layout estimation
192 G. Albanis et al.

models are invoked to estimate the object masks and the room layout. Since we rely
on semantic segmentation, we perform connected component analysis to resolve
potentially different instances and split each segmentation map into multiple per-
class and object masks. To improve robustness, we use the convex hull for each mask
in an attempt to decouple the diminished region shape from the result (the inpainting
model is trained similarly). Likewise, the junction estimates are post-processed to
generate a dense layout map by first connecting the top and bottom boundaries and
then identifying the corresponding structural labels across each column. Finally, for
all object masks, the inpainting service is called, with its result fed into the super-
resolution service and then composited on the original panorama. The outputs are
then ready to be queried by the AR component that positions the 3D object, whose
renders interact with the masks on the image domain to retrieve the appropriate result.

14.3 Experimental Setup Methodology

The evaluation undertaken for the presented AI-based DR-enhanced AR system


follows two routes. On the one hand, we seek to assess the DR sub-system’s behavior
(Sect. 14.1), while on the other hand, we aim at validating the complete system’s
efficacy and goals (Sect. 14.2). For the former, we opt for an objective evaluation using
photo-consistency metrics using complete-diminished pairs, while for the latter, we
employ subjective scoring using pre-authored scenes.
Given that our system’s components have been trained on the Structured3D dataset
[23], we use samples from the corresponding test set for both the objective and
subjective experiments. Structured3D provides photo-realistic panoramic images of
residential rooms, room layout annotation, object segmentation masks, as well as
an empty room configuration of each scene that has all foreground (i.e., furniture)
removed. The latter data address the most challenging part of objectively evaluating
DR systems, which is the lack of paired data where the objects of interest are removed.
To simulate indoor (re-) planning/design settings, we focus our evaluation on the
{chair, bed, sofa, table, cabinet} class set.

14.3.1 Component Ablation

The DR sub-system comprises three different AI components. When considering it


as a sum of its parts, we only need to evaluate the result of the diminished output
against an empty scene. Yet two layout and segmentation components that operate in
parallel and cascade their outputs to the inpainting component can also propagate their
errors. Those errors for each separate AI component can be accumulated, affecting
the overall performance of the system. Given that each part of this sub-system is an
AI model performing a distinct task, its performance can be evaluated in isolation
14 An AI-Based System Offering Automatic DR-Enhanced AR for Indoor … 193

from the complete sub-system. Reasonably, as performance improves, it is expected


that the final result will also be improved.
Nonetheless, from a system analysis perspective, it is important to identify the
weakest link, and the component whose system performance relies mostly upon
and thus affects the most outcomes of the system. As a result, we ablate the
system’s components using differential analysis, where the component is bypassed,
and instead, a perfect prediction is used (the annotated metadata). Consequently,
Fig. 14.3 presents the component ablation setup for the DR sub-system, with the
layout estimation and segmentation components ablated in isolation and jointly. The
latter experiment allows us to assess the performance of the inpainting component
both absolutely, using the metrics, as well as relatively, with respect to the other
ablated components’ performance degradation.
We use the mean absolute error (MAE), the peak signal-to-noise ratio (PSNR), the
structured similarity index metric (SSIM), and the perceptual image patch similarity
(LPIPS) [21] metrics on the results and compare over the objects’ masked regions.
LPIPS measures the perceptual similarity between two images based on a VGG pre-
trained network [18]. It has been shown that it accounts for several parts of human
perception, in contrast to PSNR. Due to the nature of DR to hallucinate realistic
content, we consider it as our primary evaluation metric. For the pixel-wise metrics
(PSNR and MAE), the union of the ground truth and predicted masks was used
to more strictly penalize erroneous segmentations under these photo-consistency
metrics. Still, for the local (window-based SSIM) and global (CNN-based LPIPS)
metrics, the entire images were used.

Fig. 14.3 Component ablation experiments setup visualized with a vertical macro-view of the DR
sub-system of Fig. 14.3. a refers to the experiments where both room layout and object masks
are estimated by the system data-driven components, b the layout path is ablated, by replacing the
estimations with the annotated ground truth while preserving the segmentation mask estimates, c the
dual configuration to (b), with the segmentation path ablated and the layout estimations preserved,
and d where both components are replaced by the ground truth layout and object masks
194 G. Albanis et al.

14.3.2 User Study

While objective analysis can help in identifying critical components and assessing
the system’s overall performance, the end result cannot be quantitatively assessed.
This is either because ground truth is not necessarily available, or due to the subjec-
tivity of the results. Still, end user appreciation is the ultimate goal, and as a result, we
additionally performed a user survey for the entire system’s outputs. We used remote
questionnaires that were distributed to 38 users split into two sub-groups, one having
no knowledge regarding its inner workings (i.e., Group A) and the other knowledge-
able regarding AI (i.e., Group B). The questionnaires required the participants to rate
the appearance of a masked area in each one of g different scenes.
An interactive panorama viewer was used, with the initial viewpoint bearing
looking at the object to be removed. For each scene, users first were allowed to
freely navigate the entire scene in three degrees of freedom, and then an annotated
panorama with the object to be removed or replaced was presented to them. This
process ensures that users will not get lost within the 360° field-of-view and will
understand the task at hand. Afterward, users were asked to score the appearance of
the previously marked area, once presented with the object removed (i.e., pure DR),
and then once with a virtual object replacing the previous one (i.e., DR-enhanced
AR). After all, scenes were evaluated, and users were asked to rate the scenes again,
this time without DR, scoring the result of the pure virtual object superimposition on
the existing real object (i.e., pure AR). This last step was isolated from the previous
ones to remove any bias when scoring DR results. Scoring was based on a 5-point
Likert scale, resulting in aggregated mean opinion scores (MOS). Figure 14.4 depicts
samples used in the survey.

Fig. 14.4 Example survey scene types. The first column depicts the original panorama, the second
column the panorama with the object removed (i.e., pure DR), the third column the one with the
virtual furniture added in the diminished scene (i.e., DR-enhanced AR), and the final column the
one with the virtual object added without previously removing the existing object (i.e., pure AR)
14 An AI-Based System Offering Automatic DR-Enhanced AR for Indoor … 195

14.4 Results and Discussion

Before presenting the results of our experiments, it is worth noticing the potential
sources of errors. Since the inpainting component is dependent on the results of the
layout and segmentation models, it is expected that any errors in these components
will be accumulated in the final diminished result. Under-segmenting an object may
result in the erroneous diminishing of scenes since artifacts of the old object will
be present around the inpainted region. Similarly, over-segmenting may potentially
remove important relevant objects like chairs next to a table, resulting in uncanny
visuals.
Another potential source of error is the layout junction localization. The inpainting
model heavily depends on the layout of the input, as described in Sect. 14.3. Given
that the boundaries reconstructed from the junctions are used to generate the dense
layout segmentation map used to drive the SEAN decoding blocks, such errors will
propagate into both style code generation and the diminished area boundary sepa-
rating the different structural areas. As a consequence, even slight errors in the
junctions’ coordinates will translate to large miss-classified regions, manifesting in
severe diminishing distortions.

14.4.1 Objective Evaluation

Table 14.1 shows the quantitative results for the experiments described in Sect. 14.1.
The first row which showcases the best performance is the case (d) of Fig. 14.3,
where both models are replaced with perfect estimates. This is in contrast to the last
row, corresponding to case (a) of Fig. 14.3, which relies on all models’ predictions.
Interestingly, cases (b) and (c) are the most interesting ones as they present us with
the weakest link of the DR sub-system, which is the layout estimation model, given
that when replaced with the annotated layouts, performance consistently increases.
As the segmentation model produces reasonable results, the sparser junction local-
ization errors propagate deeper into the diminished result, which is reasonable as the
structural segmentation is responsible for both style code extraction and boundary
preservation.

Table 14.1 Quantitative results assessing the DR sub-system output by ablating its components.
Arrows denote direction of better performance
Experiment PSNR ↑ SSIM ↑ M AE ↓ LPIPS ↓
yL S 29.61 0.9393 0.0131 0.1127
yL S 29.13 0.9353 0.0134 0.1149
y Ls 27.37 0.9126 0.0166 0.1259
yL S 27.86 0.9189 0.0158 0.1225
196 G. Albanis et al.

14.4.2 Subjective Evaluation

Figure 14.5 presents the results of the user survey. The left columns aggregate MOS
scores across all scenes, while the remaining columns present the results for each
scene in sequence. The top row presents the results for all subjects, while the bottom
row splits them into two different groups, those not familiar with AI (i.e., Group A)
and those experienced with it (i.e., Group B). From these results, it is evident that
purely diminished scenes were rated lower than diminished scenes with augmen-
tations overlaid. This is expected as superimposing content on the DR result may
potentially hide defects. Further, the final scenes without DR where the virtual object
was simply overlaid on the actual ones, without removing them, scored lower than the
scenes where the real objects had been diminished/removed. Nevertheless, the statis-
tical confidence is lower, and this is partly expected as not all scenes may require DR.
Indeed, there are cases when the objects are of similar size and shape that render DR
as not that important. The availability of the functionality, however, is very important
for the remaining cases and may even outweigh the need to deliver high-quality DR
results.
Regarding the two user groups, those familiar with AI presented with larger
discrepancies between the different scene types albeit the ranking across both groups
remained the same.

14.5 Conclusion

In this work, we present a system that can drive user-facing applications for interior
design. The focus of our system is on usability as it relies on 360° image acquisition
of scenes, compared to scanning processes that tax users and are more error prone.
Further, we lift the requirement for manually marking the diminished region and seek
to preserve the room structure during diminishing which is highly relevant for the
targeted application domain. Our system is purely AI based, a fact that introduces the
need for assessing error propagation between its different components. To that end,
we present a system ablation analysis, accompanied by a user survey that showcases
the need for DR in indoor AR planning. Nonetheless, our work operates directly on
the image domain (i.e., 2D), and besides the benefits, this introduces, it inevitably
only offers perspective views and neglects occlusion effects.
Another limitation is that the current system has been only verified with synthetic
data. The Structure3D dataset offers annotations for all sub-tasks apart from the super-
resolution one, a trait that real-world datasets will not easily provide. Apart from that,
the application to in-the-wild real-world data is expected to reduce performance,
which will require revisiting our analysis. Future work will focus on overcoming
these challenges by integrated geometric inference (e.g., depth) to support more
advanced features like occlusions and lighting and transitioning to real-world domain
training data and validation.
Fig. 14.5 Results of the user survey. The first row depicts the total average rating for all the three cases, i.e., pure DR (empty), DR-enhanced AR (DR), pure
AR (AR), across all scenes (first column) as well as for each scene separately in the following column, the one with the virtual object added without previously
removing the existing object (i.e., pure AR)
14 An AI-Based System Offering Automatic DR-Enhanced AR for Indoor …
197
198 G. Albanis et al.

Acknowledgements We thank Werner Bailer (Joanneum Research), Georg Thallinger (Joanneum


Research), Vladimiros Sterzentsenko (Information Technologies Institute/Centre for Research and
Technology), and Suzana Farokhian (Usability Partners) for insightful discussion and feedback.
This work was supported by the EC funded H2020 project ATLANTIS [GA 951900].

References

1. Torchserve (2021). https://github.com/pytorch/serve


2. Gkitsas, V., Sterzentsenko, V., Zioulis, N., Albanis, G., Zarpalas, D.: PanoDR: spherical
panorama diminished reality for indoor scenes. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 3716–3726
3. Chen, L.-C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic
image segmentation (2017). arXiv:1706.05587
4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
(2016)
5. Herling, J., Broll, W.: Advanced self-contained object removal for realizing real-time dimin-
ished reality in unconstrained environments. In: 2010 IEEE International Symposium on Mixed
and Augmented Reality, pp. 207–212. IEEE (2010)
6. Jam, J., Kendrick, C., Walker, K., Drouard, V., Hsu, J.G.-S., Yap, M.H.: A comprehensive
review of past and present image inpainting methods. Comput. Vision Image Understand.
103147 (2020)
7. Jiddi, S., Pugh, B., Dai, Q., Puig, L., Lianos, N., Gauthier, P., Totty, B., Dorbie, A., Yin, J.,
Wong, K.: An end-to-end mixed reality product for interior home furnishing. In: 2020 IEEE
International Symposium on Mixed and Augmented Reality (ISMAR) (2020)
8. Kawai, N., Sato, T., Yokoya, N.: Diminished reality based on image inpainting considering
background geometry. IEEE Trans. Visual Comput. Graphics 22(3), 1236–1247 (2015)
9. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv:1412.6980
10. Mori, S., Ikeda, S., Saito, H.: A survey of diminished reality: techniques for visually concealing,
eliminating, and seeing through real objects. IPSJ Trans. Comput. Vision Appl. 9(1), 1–14
(2017)
11. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z.,
Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning
library (2019). arXiv:1912.01703
12. Phan, V.T., Choo, S.Y.: Interior design in augmented reality environment. Int. J. Comput. Appl.
5(5), 16–21 (2010)
13. Queguiner, G., Fradet, M., Rouhani, M.: Towards mobile diminished reality. In: 2018 IEEE
International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct),
pp. 226–231. IEEE (2018)
14. Rameau, F., Ha, H., Joo, K., Choi, J., Park, K., Kweon, I.S.: A realtime augmented reality
system to see-through cars. IEEE Trans. Visual Comput. Graphics 22(11), 2395–2404 (2016)
15. Reuksupasompon, P., Aruncharathorn, M., Vittayakorn, S.: Ar development for room design.
In: 2018 15th International Joint Conference on Computer Science and Software Engineering
(JCSSE), pp. 1–6. IEEE (2018)
16. Schmalstieg, D., Mori, S., Erat, O., Kalkofen, D., Broll, W., Saito, H.: Inpaintfusion:
incremental rgb-d inpainting for 3d scenes. In IEEE Trans. Visual. Comput. Graph. (2020)
17. Siltanen, S.: Diminished reality for augmented reality interior design. Vis. Comput. 33(2),
193–208 (2017)
18. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image
recognition (2014). arXiv:1409.1556
14 An AI-Based System Offering Automatic DR-Enhanced AR for Indoor … 199

19. Sun, C., Hsiao, C., Sun, M., Chen, H.: Horizonnet: learning room layout with 1d representation
and pano stretch data augmentation. In: IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 1047–1056 (2019)
20. Wong, K., Jiddi, S., Alami, Y., Guindi, P., Totty, B., Guo, Q., Otrada, M., Gauthier, P.: Exploiting
arkit depth maps for mixed reality home design. In: 2020 IEEE International Symposium on
Mixed and Augmented Reality (ISMAR) (2020)
21. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness
of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 586–595 (2018)
22. Zhao, H., Kong, X., He, J., Qiao, Y., Dong, C.: Efficient image superresolution using pixel
attention (2020). arXiv:2010.01073
23. Zheng, J., Zhang, J., Li, J., Tang, R., Gao, S., Zhou, Z.: Structured3d: a large photo-realistic
dataset for structured 3d modeling. In: Proceedings of The European Conference on Computer
Vision (ECCV) (2020)
24. Zhu, P., Abdal, R., Qin, Y., Wonka, P.: Sean: image synthesis with semantic region-adaptive
normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 5104–5113 (2020)
Chapter 15
Extending Mirror Therapy into Mixed
Reality—Design and Implementation
of the Application PhantomAR
to Alleviate Phantom Limb Pain
in Upper Limb Amputees

Cosima Prahm , Korbinian Eckstein , Michael Bressler ,


Hideaki Kuzuoka , and Jonas Kolbenschlag

Abstract Phantom limb pain (PLP) is a restrictive condition in which patients


perceive pain in a non-existent limb, incapacitating them from performing daily
activities. Mirror therapy, during which patients look into a mirror reflecting their
sound limb and imagining it as healthy on their amputated site, has proven to alle-
viate that pain. However, it is limited to unilateral movements which take place in a
seated position We developed an assistive mixed reality (MR) tool on the Microsoft
HoloLens 2 to extend conventional mirror therapy by enabling users to freely explore
their environment and to perform bi-manual tasks. Thereby, the patient’s residual
limb was augmented by a superimposed virtual arm that is controlled by the residual
limb. We evaluated the usability of this system with ten able-bodied individuals and
two transradial amputees. Patients additionally rated the system for its motivational
aspect using the IMI questionnaire and a user-centered survey. PhantomAR showed
a high usability rate of 78.5%; immersion, positive affect and game flow were rated
most highly in both patients, while PLP slightly decreased after using the applica-
tion. We critically examined the use and implementation of a therapy environment
on HoloLens 2 and proposed how to address potential pitfalls in development. Based
on these findings, we expect the PhantomAR mixed reality therapy tool to positively
impact the outcomes on PLP scores and motivation to carry out the therapy even in
absence of a therapist.

C. Prahm (B) · M. Bressler · J. Kolbenschlag


BG Trauma Clinic, Department for Plastic and Reconstructive Surgery, University of Tuebingen,
Schnarrenbergstr. 95, 72076 Tuebingen, Germany
e-mail: [email protected]
K. Eckstein
School of Information Technology and Electrical Engineering, The University of Queensland,
Brisbane, Australia
H. Kuzuoka
Department of Mechano-Informatics, University of Tokyo, Tokyo, Japan

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 201
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_16
202 C. Prahm et al.

15.1 Introduction

After amputation, an estimated number of up to 80% of all amputees report a sensation


of a phantom limb and additionally phantom limb pain (PLP), a painful perception
in their non-existent body part, which can appear independently of the cause of
amputation and severely affects the quality of life [1–5]. It rarely regresses over
time, rendering it a chronic condition that requires treatment [6].
Both pharmacological and non-pharmacological treatments have been used to
mitigate PLP. However, the result of current treatments is still inadequate and rarely
results in complete reduction of the pain [5, 7, 8]. The development of a non-
pharmacological treatment that effectively reduces PLP would greatly improve the
quality of life of affected patients. Especially since some of the main side effects
of pharmacological intervention include daytime tiredness and personality change.
The most common treatment for PLP in clinical use is mirror therapy, which entails
placing a mirror sagittally in front of the healthy limb and imagining its moving reflec-
tion to represent the contralateral amputated side. In this manner, the brain perceives
the amputated limb as healthy and creates the illusion of non-painful movement of the
missing limb [9]. Mirror therapy provides anthropomorphic visual feedback, which
has been cited as the main reason for its therapeutic effect [10, 11].
Modern, digital methods to address PLP shift the focus toward the use of commer-
cial virtual reality (VR) and augmented reality (AR) systems. VR has already been
an emerging field with commercial devices for home use such as Oculus Quest or
HTC Vive, which place the user in a virtual environment disconnected from reality.
In VR mirror therapy set-ups, the mirror image is replaced by a digital representation
of the lost limb. However, there is no possibility to interact with the virtual limb, as
it is still just mirrored [12].
In AR, additional virtual objects are superimposed on the view of the visible
reality, such as information on surrounding places by using a smartphone camera.
Some researchers developed a custom-made AR platform and augmented a VR
headset with cameras [13, 14]. Whereas in mixed reality (MR), virtual and real
objects are interacting with each other, such as a realistic 3D representation of a
flower sprouting on top of a table. Extended reality (XR) acts as umbrella term for
human computer interaction in 3D displaying devices [15].
AR applications in PLP research typically use cameras to project the augmented
image of an able-bodied person onscreen, with the virtual arm in the place of the
missing arm through attaching a QR code or a certain pattern to the stump [1, 16–19].
However, in MR, virtual objects can also be delivered in a first-person perspective via
commercially available see-through glasses, such as the Microsoft HoloLens, Magic
Leap or Google glasses [20–23]. Through XR, new experience of interaction can
be created for the user, delivering a highly immersive environment and facilitating
embodiment. Though viewing oneself in the third-person perspective has shown to
be beneficial to the user’s spatial awareness, the first-person perspective enables more
accurate interactions [24].
15 Extending Mirror Therapy into Mixed Reality—Design … 203

Previous studies in a screen-based augmented reality environment focused on


myoelectric prosthesis control and task transfer from AR to the real world in compar-
ative experiments with pick and place tasks [25] for pattern recognition control [26]
or motor skill enhancement [27].
Regarding gamification in health care, studies have shown that patients rarely
perform their rehabilitation exercises at home to the prescribed extent due to lack
of motivation or in the absence of a physiotherapist [28, 29]. The issue of poor
motivation and limited compliance is common in clinical everyday life [30]. Meta
reviews showed the positive potential of games and gamification to improve health
[31, 32]. Further studies discussed positive effects on therapy, motivation, training
and learning in dealing with diseases [33, 34]. Similarly, in a systematic review of the
literature on games for health, evidence is cited for the positive potential of game-
related interventions, but also point to the need for more methodologically sound
studies in this area [35].
In this study, we aimed to create a mixed reality extension of the conventional
mirror therapy: During mirror therapy, the patient is limited to only unilateral move-
ments which, moreover, take place in a seated position providing with the same
perspective. Our goal was to critically examine the technical possibilities of the
HoloLens 2 and to create an innovative assistive therapy based on a mixed reality
approach that would make full use of the unique capabilities of the AR glasses.
Specifically, we were seeking to (1) create immersion from a technical perspective,
(2) liberate the patients from their restrictive position at the table and to enable them
to freely and actively explore their environment, (3) to perform bi-manual tasks by
augmenting their residual limb with a virtual arm which is completely independent
of the movements of their sound limb and (4) to immerse them in a curiosity-driven
game-based environment.

15.2 Methods and Materials

All participants were recruited in accordance with the declaration of Helsinki and
based on the guidelines of the ethical approval by the University of Tuebingen,
Germany (181/2020BO1).
We evaluated the PhantomAR application on the HoloLens 2 in terms of usability
using the System Usability Scale (SUS) with ten able-bodied participants (7 male,
3 female, 30,4 ± 5,6 years) and two unilateral transradial (forearm) amputees (1 m,
56 years; 1f, 36 years) as proof of concept. Thereby, the real arm of the able-bodied
participants was covered so the HoloLens would not recognize it. About 80% reported
no previous AR experience. The SUS consists of a 10-item questionnaire with a 5-
point Likert scale. It has become an industry standard and allows the evaluation of
a wide variety of products, including hardware, mobile devices and applications.
Additionally, we prepared a user-centered survey consisting of 10 questions such as
“Do you prefer storytelling to guide you through the game?” to evaluate graphics,
ownership, interaction with the virtual objects using both the virtual and real arm,
204 C. Prahm et al.

and comfort of wearing the HoloLens 2. We asked all participants to evaluate the
PhantomAR application regarding intrinsic motivation using the GEQ consisting
of 5 subscales (positive affect, negative affect, flow, challenge, immersion) and 2
additional subscales for control and non-anthropomorphic feedback on a 5-point
Likert scale with 1 meaning “completely disagree” and 5 meaning “completely agree”
[36].
Additionally, we asked the patients to rate PLP sensation before, during and after-
ward on the Numerical Rating Scale (NRS). Similarly to able-bodied participants,
we assessed their game experience with a user-centered survey and added specific
questions pertaining to prosthesis control and PLP. Prosthetic embodiment was eval-
uated by the Prosthesis Embodiment Scale consisting of 10 items and 3 subscales
(ownership, agency and anatomical plausibility) with a rating scale ranging from −3
(strongly disagree) to +3 (strongly agree) [37]. Evaluating PhantomAR was a one-
time intervention, in which all 4 interaction scenes were trialed twice in a random
order.

15.2.1 Study Setup and Protocol

The setup has been deliberately chosen to be minimal in order to ensure efficient
integration into daily clinical practice. The required devices were the Microsoft
HoloLens 2 and two discontinued Thalmic Myo armbands. The armbands, which
used to be commercially available, include a 9-axis inertial measurement unit (IMU)
(InvenSense MPU-9150) for positional tracking, 8 active EMG electrodes and a
vibration motor for haptic feedback (see Fig. 15.1). For real-time external monitoring
a Windows computer was used running Unity. No further external sensors were
required for positional tracking and the complete setup can be performed wirelessly
and battery powered.
The created interaction scenes adapt to the room size available and are supposed
to be performed while moving in a given space of 10–20 m2 .
Participants were fitted with two Myo armbands on the upper and lower arm of the
residual limb, respectively. After donning the HoloLens 2, which required no cables
and was completely battery operated, the virtual arm and threshold control were
calibrated and a profile of the user containing scale and relative shoulder position
was saved once. All these preparatory steps took less than 5 min and only needed to
be performed for the first time.
No further information was given on the game, and all participants were naive.
The only instruction they received was to explore their environment with every means
available to them.
After exploring all four scenes twice, all participants took part in evaluating the
application. The average time to play all four scenes twice was 25 min (±4.3). None
of the participants experienced cyber (motion) sickness at any time during the use.
15 Extending Mirror Therapy into Mixed Reality—Design … 205

Fig. 15.1 Patients with a transradial (forearm) amputation without (upper images) and wearing
the PhantomAR system consisting of the mixed reality device Microsoft HoloLens 2 and two
myoelectric electrode armbands (Thalmic, lower images). The setup is completely wireless and
does not restrict movement

15.3 Gameplay and Technical Implementation

PhantomAR has been implemented using the game development platform Unity 3D
version 2019.4.20f and the Microsoft Mixed Reality Toolkit. The game was installed
on the Microsoft HoloLens 2 and connected via the Bluetooth low energy protocol
to the Thalmic Myo armbands and received already filtered IMU and EMG data.
206 C. Prahm et al.

15.3.1 Game Design and Interaction Scenes

The game design was focused toward increasing immersion and avoiding frustration
and discomfort for the patients, which would negatively impact PLP. Potential prob-
lems were identified as goals leading to mental stress, difficult tasks leading to high
muscle tension and failure to achieve a desired outcome leading to frustration.
Therefore, a curiosity-driven gameplay was chosen, where the patients can freely
explore an interesting and interactive environment without the possibility of failure
or underperforming. To ensure an immersive experience, the performance and
reactiveness of the game was closely monitored during development.
To allow patients to immerse themselves in the augmented reality experience, the
rehabilitative exercises were integrated into various playful scenes. However, there
is no task associated with these scenes. The patients should explore their environ-
ment curiously and discover for themselves what is possible in this specific scene or
environment. We built four different interaction scenes. All scenes used orientation
and acceleration of the hand and arm as well as different EMG signals as input. The
interaction with virtual objects could always take place with the virtual and the real
hand.
These scenes were:
(A) A fruit-picking scene where players could collect fruits spawning at random
locations in the actual room, such as on desks, on walls, in cabinets or on the
floor. Therefore, necessitating to walk around the room to retrieve these fruits.
Once grabbed by the virtual or real hand, they can be interacted with, i.e.,
enlarged by dragging the contralateral edges of the fruit as seen in Fig. 15.2.
(B) A shooting game in which players could aim and shoot at flowers sprouting
on surfaces in the room which were recognized by the HoloLens 2 grid. Once
critically hit, they wilt and different flowers spawn at various locations.
(C) Drawing into the air or onto surfaces with certain EMG activity and arm
movements. The color palette and brush could be changed.

Fig. 15.2 A patient grasping a banana from the desk with their virtual hand (left image) and
proceeding to use his healthy limb to aid in a bi-manual interaction to enlarge the banana while still
keeping it firmly in their grasp (right image)
15 Extending Mirror Therapy into Mixed Reality—Design … 207

Fig. 15.3 A patient uses their virtually augmented limb to interact with a manipulable game
element. Collision with the game element at a certain speed disrupts its structural integrity, whereas
activation of an EMG signal above a certain threshold would trigger a change in the color

(D) A scene consisting of bubbles varying in color and size, which, when touched or
interacted with, lost their structural integrity and dispersed into smaller bubbles
or changed color or speed (see Fig. 15.3).

15.3.2 Spatial Mapping

The HoloLens supports automatic spatial mapping, scanning the floor, walls and
real objects like tables or boxes to allow the interaction of virtual content and the
real world. This feature is used for many game elements, like plants spawning on
surfaces, bullets colliding with walls and other objects, and also for placing virtual
objects on real objects while playing the game.

15.3.3 Virtual Arm and Non-Anthropomorphic Feedback

The virtual arm including the hand was a rigged 3D object, i.e., it consisted of a list of
bones connected via joints, which could be rotated to perform physiologically looking
movements. The visible presentation of the virtual arm was a rendered mesh that was
connected to the underlying bone structure. In addition to controlling a human arm,
the arm object could be exchanged, and the subject could control a virtual tentacle
instead (see Fig. 15.4). The human arm and the tentacle were controlled via the same
set of degrees of freedom (DoFs), a 3D rotation of the upper and lower arm, 1D wrist
rotation and hand opening/closing, where hand opening and closing could switch
between different grip modes.

15.3.4 Virtual Arm Movement

To have a virtual arm that follows the exact movements of the actual residual arm,
the IMU data of two Thalmic Myo armbands was used, measured on the upper arm
and on the lower arm (transradial stump). The position of the shoulder was fixed in
208 C. Prahm et al.

Fig. 15.4 a Tentacle as seen during the video stream while using the PhantomAR application. b
Tentacle 3D model

relation to the head position and was adapted to match the individual subjects. The 3D
orientation received from the Thalmic Myo armbands was applied to the respective
arm segments representing the upper and lower arm. IMU sensors determine their
spatial orientation via accelerometers, gyroscopes and magnetometers; however, the
received data was affected by a horizontal drift over time and required calibration.

15.3.5 Virtual Hand Movement

The virtual hand was controlled via myoelectric signals recorded at the muscles of
the lower arm (transradial stump) with the Thalmic Myo armband. To avoid poten-
tial frustration for the patient, we deliberately chose a simple and robust threshold
controller, similar to regular prostheses. Two electrodes on agonist/antagonist
muscles recorded the activation, and when exceeding a threshold, the virtual hand
either opened or closed with the speed proportional to the muscle activation. For
opening and closing, two poses had been defined for the virtual hand, one in the open
position and one in closed position. The path of the movement between the positions
was calculated as an interpolation of the rotation of the individual bones/joints. The
same movement logic of interpolating between two positions defined as endpoints
was used for second rigged 3D model, the tentacle.

15.3.6 Calibration

The IMU sensors of the Thalmic Myo armbands were calibrated by instructing the
subject to extend their arm forward in a neutral position and place it in the same
space as the virtual arm, which was projected into their view by the HoloLens. The
15 Extending Mirror Therapy into Mixed Reality—Design … 209

calibration required repeating after several minutes of interacting in the augmented


environment, which was performed when the horizontal drift of the IMU data became
noticeable to the subject or visible in the video stream of the remote GUI. The
calibration was activated via the remote connection by the therapist, via the HoloLens
voice command “calibrate” or via a finger-tap onto one of Myo armbands.
The threshold EMG movement classification required initial calibration as well
to be optimally adapted to the available muscles and the strength of their myoelec-
tric signals. The subject was tasked to perform muscle activations corresponding
to opening and closing the virtual hand and hold each for 5 s. The electrodes with
highest mean activation for the respective movements were chosen, and the threshold
for classification was set to 15% of the maximum activation.

15.3.7 Interaction with Virtual Environment

Interaction with virtual objects was possible with both the virtual and the healthy
hand. The virtual hand had attached colliders that closely matched its shape, enabling
physical interactions with virtual objects, such as pushing a ball. Small objects, like
marbles, could pass between the virtual fingers creating an immersive interaction
experience.
Another mode of interaction was a grabbing motion, which was activated when
opposing fingers touched the virtual object. Successfully grabbing an object was
accompanied by a short vibration of the Thalmic Myo armband, which was shown to
reduce the time needed for grabbing [27]. The virtual object followed to movement
of the hand and could be carried around until released via an opening of the hand.
The scenes also consisted of special interaction types that could be triggered with
EMG activation. This included the release of paint on the fingertip, shooting a bullet
or changing the color of virtual objects.
The healthy hand was tracked by the HoloLens 2 and provided the position of the
fingers and palm. This made it possible for the real fingers to interact with the virtual
objects as well. Once a suitable object was grabbed with both hands, ambidextrous
interactions with the object were possible such as rotating and dragging it larger or
smaller.

15.3.8 Remote Connection for Therapeutic Intervention

In order to give the therapist the possibility to guide the patient and to control the
virtual scenarios, we have developed a remote control app that can be run on a
computer. The remote app is optional and it communicates with the HoloLens 2 via
a Wi-Fi connection. It provides a live video stream of the mixed reality as it is seen
by the patient. The virtual scenarios used in this study can also be controlled via the
remote app, for example, objects can be manually created or restored to their original
210 C. Prahm et al.

state. In addition, the remote app is used to make various configuration settings such
as establishing the Bluetooth connection to the wristbands or calibrating the EMG
controller.

15.4 Results

The application received an overall SUS score of 78.5% rated by all 12 participants,
indicating a slightly average usability and user-friendliness.
The following responses were obtained from all 12 participants in the user-
centered survey: Wearing the HoloLens 2 felt comfortable and users had a posi-
tive experience while interacting with virtual objects within the actual environment,
which were perceived as real. Haptic feedback as provided by the Thalmic Myo
armbands supported the immersion of grasping objects and controlling the interface
was intuitive. However, more feedback mechanisms, apart from haptic feedback on
object interactions, should be incorporated. Ownership of the overlaid virtual arm
was rated highly, though, agency could still be improved. The arm is perceived a
controllable part of the user; however, the control algorithm could still be refined.
No one was interested in a storytelling approach that would guide the user through
the application.
The results of the game experience questionnaire are shown in Fig. 15.5. All
participants received the application very positively and had no negative feelings or
felt overwhelmed while playing. Both the immersion and the game flow were rated
highly. Control of the virtual arm was rated with an average of 3.5. The use of a
tentacle instead of a real-looking virtual arm did not pose a problem for the patients
and was also rated positively for the most part, even though during the survey, they
stated to prefer an arm similar to their own.
PLP was rated 5 on the NRS by both patients before the game and 4 after using the
application. During the game, both patients reported in the user-centered survey that
their PLP decreased while participating in the application. However, on the NRS,
one patient actually reported an increase to 6.
The Prosthesis Embodiment Scale showed a high rating of agency, indicating
congruent control of their own prostheses and considered the performed movements
as their own (see Fig. 15.6). During PhantomAR, a high agency related to their own
prosthetic control is beneficial to controlling the application.

15.5 Discussion

With PhantomAR, we wanted to develop a wearable assistive therapy tool for PLP
that not only liberates users from their restrictive position at a table, but also allows
them to perform bi-manual tasks and freely interact with virtual objects as well as
objects found in their actual environment. The key motivation for this project was
15 Extending Mirror Therapy into Mixed Reality—Design … 211

Fig. 15.5 The results of the game experience questionnaire show 5 subscales for positive and
negative affect, immersion, flow, and challenge, as well as 2 additional subscales for rating the
control over the virtual arm in the PhantomAR app and for rating the experience of operating a
prefabricated tentacle instead of the image of an arm

to investigate the potential of augmented reality and specifically the HoloLens 2 to


extend conventional mirror therapy.
Addressing complex phenomena such as PLP equally requires flexibility in the
treatment approach. The application is modularly built, so we can accommodate a
threshold control method and more sophisticated machine learning algorithms on
EMG data [38], or voice commands in case of poor EMG reception.
In an early stage of development, we tried to correct the drift with QR tags attached
to the wristbands. The idea was the automatic determination of the Myo armbands’
absolute position as soon as a QR tag was detected by the HoloLens. It turned out that
for this built-in tracking function, relatively large tags (> 5 × 5 cm) were required in
order to be recognized and the recognition of moving tags worked very unreliably.
In addition, the use of this function resulted in a noticeable performance loss.
Potential for improvement certainly lies in the control of the virtual hand, which
was rated with an average of 3.5.
Literature pertaining to neuroplastic hypotheses for alleviating PLP highlights
the relevance of prioritizing anthropomorphic visual feedback [11, 39]. The concept
of stochastic entanglement as hypothesized by Ortiz-Catalan, however, predicts
that pain reduction would be independent of the level of anthropomorphic visual
representation presented to the patient [10].
212 C. Prahm et al.

Fig. 15.6 Showing the 3 subscales of the prosthesis embodiment scale for both patients. The agency
subscale was rated highest, indicating a feeling of congruent control during movement of their own
prosthetic hand

Using a tentacle for a hand was a concept which was new to both patients, but
they embraced the idea and stated, that it did not necessarily need to be their hand,
or a hand. In fact, they thought it was fun to explore in the game; however, in real
life, they preferred an anthropomorphic prosthesis to a marine animal.
One patient was certain that PLP was lower during the mixed reality experience,
while the other patient described a lessening of pain during active play but reported
later that pain was increased during play. Both patients agreed that PLP was lower
after using the application. Of course, this one-time proof of concept cannot provide
a statement about the alleviation of PLP. Therefore, increasing the sample size cannot
only provide more insight on PLP but also on embodiment and their progress over
time.
One of the challenges with AR glasses is the restrictive field of view, which might
lead to reduced immersion when not operating in the center of vision.
The Thalmic Myo armbands accumulated a tracking error that required calibrating
after 5–10 min of using the system to avoid a horizontal drift. As the Myo armbands
use a 9-axis IMU containing a magnetometer, this drift should be possible to avoid
without additional hardware. Other groups have shown that the Thalmic Myo IMU
data (without post-processing) has no drift [40].
The latency of the movement of the real arm to the visual representation of the
corresponding virtual arm was not directly measured, but for usual arm movements,
15 Extending Mirror Therapy into Mixed Reality—Design … 213

there is no noticeable lag. The latency is assumed to be below 50 ms, as the data
is received from the Thalmic Myo armbands with a latency of around 25 ms [40]
and translated to the virtual arm position within the next frame. A comparably low
latency that has not yet been reported in other studies, in which the latency was
500–800 ms when controlling a virtual arm using custom IMU sensors [27].
The finger-tap for periodic re-calibration can be unintrusively integrated as a game
element, requiring the user to perform a task with the augmented arm stretched out
and tapping onto the Myo armband with the other arm.
PhantomAR was not designed to be goal-oriented, but curiosity driven. There is
no intended or evaluated task transfer from a virtual hand to an actual myoelectric
prosthesis. There might be, though, however, the idea of PhantomAR is to simply
use the hands, or hand-like representations, moving through the room and exploring
the environment. Intrinsic motivation of what one might be able to find out should
be the primary drive.
It was important to not only provide applications for research, but also transfer
them to the clinic. They should be as easy to use as possible, with separate user
interfaces for the clinician and the patient. Therefore, the patient only has to mount
the devices and can start interacting. In addition, the whole system is portable and
completely wireless and can thus be used anywhere in the clinic or even at home. The
system automatically detects the room; therefore, there are no special requirements
for the room in which it is used.

15.6 Conclusion

In this paper, we explored how conventional mirror therapy can be reflected and
extended in a mixed reality approach using the HoloLens 2.
Immersion could be increased from a technical perspective by creating a spatially
coherent experience of the virtual and real world that are responsively interacting
with each other and underlying it with haptic feedback. The virtual as well as the
real hand could perform independently from each other or together. Players could
move around freely and actively and safely explore their surroundings in a manner
motivated by intrinsic motivation and curiosity.
Addressing complex health-related and quality of life impacting issues such as
PLP through novel technology requires interdisciplinary teamwork among therapists,
engineers and researchers. To gain further insight on the impact of XR mirror therapy,
we plan to conduct a four-week intervention study using the application four days
per week to compare the intensity, frequency and quality of PLP and embodiment.
Currently, PhantomAR is exclusively available for transradial (forearm) amputees,
but in the future, we plan extended it to transhumeral (upper arm) amputees as well.
214 C. Prahm et al.

References

1. Trojan, J. et al.: An augmented reality home-training system based on the mirror training and
imagery approach. Behav. Res. Methods. (2014)
2. Mayer, Á., Kudar, K., Bretz, K., Tihanyi, J.: Body schema and body awareness of amputees.
Prosthetics Orthot. Int. 32(3), 363–382 (2008)
3. Clark, R.L., Bowling, F.L., Jepson, F., Rajbhandari, S.: Phantom limb pain after amputation in
diabetic patients does not differ from that after amputation in nondiabetic patients. Pain 154(5),
729–732 (2013)
4. Flor, H.: Phantom-limb pain: characteristics, causes, and treatment. Lancet Neurol. 1(3), 182–
189 (2002)
5. Rothgangel, A., Braun, S., Smeets, R., Beurskens, A.: Feasibility of a traditional and tele-
treatment approach to mirror therapy in patients with phantom limb pain: a process evaluation
performed alongside a randomized controlled trial. Clin. Rehabil. 33(10), 1649–1660 (2019)
6. Richardson, C., Crawford, K., Milnes, K., Bouch, E., Kulkarni, J.: A clinical evaluation
of postamputation phenomena including phantom limb pain after lower limb amputation in
dysvascular patients. Pain. Manag. Nurs. 16(4), 561–569 (2015)
7. Perry, B.N. et al.: Clinical trial of the virtual integration environment to treat phantom limb
pain with upper extremity amputation. Front. Neurol. 9(9) (Sept 2018)
8. Rothgangel., Bekrater-Bodmann, R.: Mirror therapy versus augmented/virtual reality applica-
tions: towards a tailored mechanism-based treatment for phantom limb pain. Pain Manag. 9(2),
151–159 (March 2019)
9. Foell, J., Bekrater-Bodmann, R., Diers, M., Flor, H.: Mirror therapy for phantom limb pain:
brain changes and the role of body representation. Eur. J. Pain 18(5), 729–739 (2014)
10. Tsao, J., Ossipov, M.H., Andoh, J., Ortiz-Catalan, M.: The stochastic entanglement and
phantom motor execution hypotheses: a theoretical framework for the origin and treatment
of phantom limb pain. Front. Neurol. 9, 748 (2018). www.frontiersin.org
11. Moseley, L.G., Gallace, A., Spence, C.: Is mirror therapy all it is cracked up to be? Current
evidence and future directions. Pain 138(1), 7–10 (2008)
12. Dunn, J., Yeo, E., Moghaddampour, P., Chau, B., Humbert, S.: Virtual and augmented reality
in the treatment of phantom limb pain: a literature review. NeuroRehabilitation 40(4), 595–601
(2017)
13. Thøgersen, M., Andoh, J., Milde, C., Graven-Nielsen, T., Flor, H., Petrini, L.: Individu-alized
augmented reality training reduces phantom pain and cortical reorganization in amputees: a
proof of concept study. J. Pain 21(11–12), 1257–1269 (2020)
14. Boschmann, A., Neuhaus, D., Vogt, S., Kaltschmidt, C., Platzner, M., Dosen, S.: Immersive
augmented reality system for the training of pattern classification control with a myoelectric
prosthesis. J. Neuroeng. Rehabil. 18(1), 1–15 (2021)
15. Andrews, C., Southworth, M.K., Silva, J.N.A., Silva, J.R.: Extended reality in medical practice.
Curr. Treat. Options Cardio. Med. 21, 18 (1936)
16. Ortiz-Catalan, M., et al.: Phantom motor execution facilitated by machine learning and
augmented reality as treatment for phantom limb pain: a single group, clinical trial in patients
with chronic intractable phantom limb pain. Lancet 388(10062), 2885–2894 (2016)
17. Lendaro, E., Middleton, A., Brown, S., Ortiz-Catalan, M.: Out of the clinic, into the home: the
in-home use of phantom motor execution aided by machine learning and augmented reality for
the treatment of phantom limb pain. J. Pain Res. 13, 195–209 (2020)
18. Bach, F., et al.: Using Interactive Immersive VR/AR for the Therapy of Phantom Limb Pain.
Hc’10 Jan, pp. 183–187 (2010)
19. Ambron, E., Miller, A., Kuchenbecker, K.J., Buxbaum, L.J., Coslett, H.B.: Immersive low-cost
virtual reality treatment for phantom limb pain: evidence from two cases. Front. Neurol. 9, 67
(2018)
20. Markovic, M., Karnal, H., Graimann, B., Farina, D., Dosen, S.: GLIMPSE: Google glass
interface for sensory feedback in myoelectric hand prostheses. J. Neural. Eng. 14(3) (2017)
15 Extending Mirror Therapy into Mixed Reality—Design … 215

21. Tepper, O.M., et al.: Mixed reality with hololens: where virtual reality meets augmented reality
in the operating room. Plast. Reconstr. Surg. 140(5), 1066–1070 (2017)
22. Saito, K., Miyaki, T., Rekimoto, J.: The method of reducing phantom limb pain using optical
see-through head mounted display. In: 2019 IEEE Conference on Virtual Reality and 3D User
Interfaces (VR), pp. 1560–1562 (2019)
23. Lin, G., Panigrahi, T., Womack, J., Ponda, D.J., Kotipalli, P., Starner, T.: Comparing order
picking guidance with microsoft hololens, magic leap, google glass XE and paper. In: Proceed-
ings of the 22nd International Workshop on Mobile Computing Systems and Applications, vol.
7, pp. 133–139 (2021)
24. Gorisse, G., Christmann, O., Amato, E.A., Richir, S.: First- and third-person per-spectives in
immersive virtual environments: presence and performance analysis of em-bodied users. Front.
Robot. AI 4, 33 (2017)
25. Nishino, W., Yamanoi, Y., Sakuma, Y., Kato, R.: Development of a myoelectric prosthesis
simulator using augmented reality. In: 2017 IEEE International Conference on Systems, Man,
and Cybernetics (SMC), pp. 1046–1051 (2017)
26. Ortiz-Catalan, M., Sander, N., Kristoffersen, M.B., Håkansson, B., Brånemark, R.: Treatment
of phantom limb pain (PLP) based on augmented reality and gaming controlled by myoelectric
pattern recognition: a case study of a chronic PLP patient. Front. Neurosci. 8(8), 1–7 (Feb
2014)
27. Sharma, A., Niu, W., Hunt, C.L., Levay, G., Kaliki, R., Thakor, N.V.: Augmented reality
prosthesis training setup for motor skill enhancement (March 2019)
28. Tatla, S.K., et al.: Therapists’ perceptions of social media and video game technologies in upper
limb rehabilitation. JMIR Serious Games 3(1), e2 (2015).
29. Lohse, K., Shirzad, N., Verster, A., Hodges, N.: Video games and rehabilitation: using design
principles to enhance engagement in physical therapy, pp. 166–175 (2013)
30. Arya, K.N., Pandian, S., Verma, R., Garg, R.K.: Movement therapy induced neural reorgani-
zation and motor recovery in stroke: a review. J. Bodyw. Mov. Ther. (2011)
31. Primack, B.A., et al.: Role of video games in improving health-related outcomes: a systematic
review. Am. J. Prev. Med. (2012)
32. Kato, P.M.: Video games in health care: closing the gap. Rev. Gen. Psychol. (2010)
33. Gamberini, L., Barresi, G., Majer, A., Scarpetta, F.: A game a day keeps the doctor away: a
short review of computer games in mental healthcare. J. Cyber Ther. Rehabil. (2008)
34. Gentles, S.J., Lokker, C., McKibbon, K.A.: Health information technology to facilitate commu-
nication involving health care providers, caregivers, and pediatric patients: a scoping review.
J. Med. Internet Res. (2010)
35. Johnson, D., Deterding, S., Kuhn, K.A., Staneva, A., Stoyanov, S., Hides, L.: Gamification for
health and wellbeing: a systematic review of the literature. Internet Interv. (2016)
36. Ijsselsteijn, W.A., Kort, Y.A.W.D., Poels, K.: The game experience questionnaire. In: Johnson,
M.J., VanderLoos, H.F.M., Burgar, C.G., Shor, P., Leifer, L.J. (eds) Eindhoven, vol. 2005, no.
2013, pp. 1–47 (2013)
37. Bekrater-Bodmann, R.: Perceptual correlates of successful body–prosthesis interaction in lower
limb amputees: psychometric characterisation and development of the prosthesis em-bodiment
scale. Sci. Rep. 10(1), (Dec 2020)
38. Prahm, C., Schulz, A., Paaßen, B., Aszmann, O., Hammer, B., Dorffner, G.: Echo state networks
as novel approach for low-cost myoelectric control. In: Artificial Intelligence in Medicine: 16th
Conference on Artificial Intelligence in Medicine, AIME 2017, June 21–24, 2017, Proceedings,
no. Exc 277, Vienna (pp. 338–342). Austria, Springer (2017)
39. Harris, A.J.: Cortical origin of pathological pain. Lancet 354(9188), 1464–1466 (1999)
40. Nyomen, K., Romarheim Haugen, M., Jensenius, A.R.: MuMYO—evaluating and exploring
the MYO armband for musical interaction. Proceedings International Conference New
Interfaces Musical Expression (2015)
Chapter 16
An Analysis of Trends and Problems
of Information Technology Application
Research in China’s Accounting Field
Based on CiteSpace

Xiwen Li, Jun Zhang, Ke Nan, and Xiaoye Niu

Abstract By using CiteSpace software and using the number of publications, the
main authors and institutions, the research topics and the research fronts as indexes,
a text mining and visual analysis of the existing literature in the domestic CNKI
from 2000 to 2020 is conducted. According to the development of practice, the
number of researchers’ research literature on the application of information tech-
nology in accounting has increased year by year, but the quality has not improved; big
data, management accounting, financial sharing, cloud accounting, and blockchain
technology have been in the spotlight of recent research; the Ministry of Finance,
the National Accounting Institute and financial support played an important role.
However, there are still challenges, such as a lack of cross-institutional and cross-
regional cooperation among scholars, limited research on accounting informatization
construction of SMEs, and inadequate literature on accounting education. Strength-
ening guidance and support, promoting cooperation and exchanges can continu-
ously promote the mutual progress of theoretical research and practical innovation
of information technology application in the field of accounting.

16.1 Questions Posed

As science and technology have rapidly developed, accounting is no longer purely


manual. As a result of the application of information technology in accounting,
scholars are investigating the subject of accounting modernization, from the initial
research on accounting computerization to the current trend of accounting informa-
tization. By integrating theoretical research and practical application, technology

X. Li · J. Zhang · K. Nan (B) · X. Niu


School of Accounting, Hebei University of Economics and Business, Hebei, China
e-mail: [email protected]
J. Zhang
Hebei Zhongmei Xuyang Energy Co. LTD, Hebei, China

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 217
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_17
218 X. Li et al.

for accounting is being applied more effectively. Accounting is undergoing signif-


icant changes with the development of information technology, and practice also
demands higher requirements for accountants in the new era, especially the contin-
uous improvement of big data technology, which brings new challenges to accounting
work. In the current era, the accounting environment is undergoing tremendous
change with the development of information technology. There is great change taking
place in the accounting environment with the development of information technology.
This has resulted in higher practical requirements for accountants in the new era. In
particular, the constant improvement of big data technology has created new chal-
lenges for accountants. It has become a topic of common concern in the academic
and practical community how scholars should integrate information technology and
accounting practice to carry out research.
In this study, CiteSpace software is used to conduct text mining and visual analysis
of the manuscripts from domestic core journals and the CSSCI database from 2000
to 2020 that contains research on the application of information technology in the
accounting field to determine the context and dynamics of current research.

16.2 Text Mining and Visual Analysis of Research Trends


in Information Technology in the Field of Accounting
in China

16.2.1 Research Tools

CiteSpace, a statistical and information visualization software program developed


by Professor Chen Chaomei of the School of Computer Information Science, Russell
University, USA, is used for the analysis [1]. By analyzing the number of articles,
keywords, authors, institutions, and time distribution, the study reveals the current
status, hotspots, and directions of the current information technology literature in
the field of accounting, explores existing problems and suggests potential future
directions.

16.2.2 Source of Data

The literature selected for this paper comes from the CNKI database. In order to
ensure the representativeness and authority of the selected data, the literature source
is set to Peking University and CSSCI database through the advanced search function.
The collected content includes keywords, authors, institutions, article titles, publi-
cation time, publications, and abstracts. These themes include financial sharing,
big data accounting, Internet accounting, accounting computerization, accounting
informatization, accounting cloud computing, accounting intelligence, blockchain,
16 An Analysis of Trends and Problems of Information Technology … 219

and artificial intelligence. The retrieval period is February 1, 2021, and the time
period is 2000–2020. In total, 5501 documents were retrieved, imported into the
software, and duplicates were removed through the data module, yielding 4136 valid
documents cited 45,658 times with an average citation frequency of 11.04.

16.2.3 Text Mining and Visual Analysis of Related Research


Based on Classification

Visual Analysis of the Publication Volume. As shown in Fig. 16.1, the number
of applied research on information technology in accounting has increased year by
year, and the growth rate is fast, indicating that with the development of information
technology, related theoretical research is also receiving attention from academic
circles. The number of publications has increased significantly, and the growth trend
of non-core publications is roughly the same. However, research papers published in
Peking University core and CSSCI journals have not changed significantly, meaning
that the quality of research is not improving significantly. It may be related to the
fact that empirical research is more prevalent in core journals. Moreover, at the same
time, related research topics have also generated new branches. Especially after 2013,
technologies such as financial sharing, cloud accounting, big data, and intelligent
finance have emerged. Likewise, there are no significant changes in the number of
articles published in core journals. Compared with accounting computerization and
accounting informatization, the number and quality of articles on various branch
topics are insufficient at this stage. In conclusion, from 2000 to 2020, the number
of applied research papers on information technology in the accounting field has
increased year by year. However, the quality of the research should be improved.
Visual Analysis Based on Research Themes. It should be noted that the threshold
algorithm is set in the initial processing parameters of CiteSpace, and c (minimum
citations), cc (co-citations in this slice), and CCV (co-citations after specification)

Number of articles issued (articles)


8000

6000

4000

2000

0
2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
All Journals

Fig. 16.1 Statistics of the number of articles issued


220 X. Li et al.

are, respectively, (4, 4, 20), (6, 5, 20), (6, 5, 20), and the time slice is one year. In
addition, analysis of the keyword in literature data results in 114 nodes (N = 114),
199 critical paths (E = 199), and a network density of 0.0309 (Density = 0.0309).
The limited frequency is greater than or equal to 30, and the co-occurrence network
(Fig. 16.2) and TimeZone View (Fig. 16.3) of the main keywords are obtained. In the
keyword co-occurrence network map, each node represents a keyword, and the size
of the node reflects the frequency of that keyword. The color of the node reflects the
publication date of the document where the keyword appears, and the darker color
represents an earlier publication date.
Analysis of Co-occurrence of Keywords. In Figs. 16.2 and 16.3, it can be seen that
in the past 20 years, China’s accounting computerization and accounting informati-
zation research have occupied the top two positions, respectively, with centralities
of 0.32 and 0.18. According to the subdivision, the two themes were at the core
of research for the period 2001–2010, and the research center showed the trend of
shifting to big data accounting research. Relevant research has transitioned from the
study of accounting software to accounting information systems and internal control
and then to a combined application of information technology and management
accounting. This has promoted the rapid transformation of accounting from tradi-
tional accounting functions to management and service functions. Accounting for
management is closely related to informatization. To some extent, the management
accounting boom is driven by national policies. In 2013, the Ministry of Finance
designated management accounting as an important direction of accounting reform,
and in 2014, it issued the “Guiding Opinions on Comprehensively Promoting the
Construction of Management Accounting System,” which ushered in a period of rapid
development for management accounting [2]. Before 2012, the focus of management
accounting research was cost control and theoretical exploration. After 2012, with the
rise of big data and artificial intelligence, management accounting research content
gradually enriched, and in 2014, it became a national strategy [3].
The powerful data mining, processing, and analysis capabilities of big data tech-
nology have expanded the information sources of management accounting, enabling
it to unearth the potential value from data to enhance the competitiveness of enter-
prises and maximize its benefits [4, 5]. Based on the above analysis, it is evident that
big data and management accounting are at the core of the research at this stage,
echoing the development of practice.
Timeline Chart Analysis. The first stage (2001–2010) was devoted to the research
topics of domestic core journals and CSSCI papers. The initial processing parameters
of CiteSpace should be set to k = 25, the time period 2001–2010, the time slice is
one year, choose the timeline mode in the visualizations option, and draw a sequence
diagram of keyword co-occurrence clustering. According to Fig. 16.4, the graph has
487 nodes and is divided into eight clusters. The module value is 0.4321, and the
contour value is 0.7434.
Clusters 1, 3, 5, and 8 are computerized accounting subjects. The main
keywords are computerized accounting, computerized auditing, computerized
accounting system, accounting data, accounting software, accounting center, office
automation, commercialized accounting software, accounting reform, computerized
16 An Analysis of Trends and Problems of Information Technology … 221

Fig. 16.2 Co-occurrence network of keywords


222

Fig. 16.3 Keyword time zone network


X. Li et al.
16 An Analysis of Trends and Problems of Information Technology … 223

Fig. 16.4 2001–2010 timeline for keyword clusters


224 X. Li et al.

accounting, accounting computerization major, comprehensive professional ability,


practical teaching, accounting computerization teaching, and applied talents. Among
these keywords, accounting computerization, accounting computerization system,
accounting data, accounting software, and accounting computerization major have
strong centrality.
In Fig. 16.4, each node represents a keyword, and the size of the node indicates its
frequency. The color change of a node indicates the date when the keyword appeared
in the article, and a darker color indicates an earlier date. The lines between nodes
represent occurrences of keywords in the same article. Years are counted from left to
right on the timeline, and the position of a node indicates the date when the keyword
appeared for the first time.
Cluster 0 is accounting informatization. The main topics are accounting informa-
tization, information system, network technology, business process reorganization,
information society, internal control, information technology, accounting informa-
tion system, accounting system, accounting information resources, value chain
accounting management, information system reconstruction, intelligent agent, intel-
ligent financial software, and data mining technology, among which accounting
informatization, internal control, accounting information system, and information
technology have strong centrality. At this stage, accounting computerization and
accounting informatization are at the core of research. China’s accounting comput-
erization originated in the 1980s and completed the transition from “accounting”
to “management” at the beginning of the twenty-first century, and then ERP and
other management accounting software pushed accounting computerization into a
new stage. According to Liu Qin [6], “the sign of accounting informatization lies in
the widespread use of ERP systems” with the rapid development of the “big intelli-
gence shift cloud,” the research gradually evolves from accounting computerization
to accounting informatization to accounting intelligence.
In the second phase (2010–2020), domestic core journals and CSSCI papers were
studied for the theme. Initially, CiteSpace was configured with a g-index algorithm
and k = 25, time interval 2011–2020, 1 year-long data slice, and the timeline mode
in the visualizations option to build a keyword co-occurrence clustering time series
chart to analyze the evolution of research themes and the interdependence between
them. According to Fig. 16.5, the graph covers 499 nodes, which are classified into
seven major clusters with a modularity value of 0.4617 and a silhouette value of
0.7545.
According to Fig. 16.5, in Cluster 0, which is the largest, the keywords
include accounting informatization, cloud computing, SAAS, IASS, PAAS, cloud
accounting, financial sharing, Internet era, digital financial transformation, Internet+,
teaching model, big data accounting, and big data intelligence, among which
accounting informatization, cloud accounting, and big data accounting have greater
centrality. Big data technology is the basis for developing intelligent finance, but the
level of data storage and data processing is not high, and there is a large research
area to explore [7]. Under the influence of big data technology, accountants should
become experts in data mining and visualization, and future analysis and display
methods will consist not only of financial statements and texts, but more intuitive
16 An Analysis of Trends and Problems of Information Technology … 225

Fig. 16.5 2011–2020 timeline for keyword clusters


226 X. Li et al.

data graphs, simple, and repetitive positions will be replaced by accounting systems,
and accountants will be transformed into data analysts [8].
In cluster 1, there is the CPA industry, which focuses on examining the develop-
ment of the accounting profession, the training of accounting talents, and the quali-
fications of accountants and related policies. The main keywords include accounting
information system, accounting management work, accounting firm, accounting
service market, international financial reporting standards, certified public accountant
industry, the accounting industry, accounting information standard system, small and
medium accounting firms, and non-auditing businesses. The application of informa-
tion technology to the accounting field has stimulated the development of the industry.
The application of information technology in the accounting field has stimulated the
development of the certified public accountant industry, and the research on auditing
technology and methods is also a hot topic in the new era. For instance, Xu Chao
classified auditing into three stages: computer-assisted audit, network audit, and big
data audit [9]. A recent study by Chen Wei et al. applied text mining and visual
analysis based on big data technology to the area of auditing, leading to an entirely
new field of research [10].
Cluster 2 focuses on computerized accounting. Keywords are computerized
accounting, reporting system, teaching video, vocational college, accounting major,
teaching reform, and open education. Currently, the number of researches on comput-
erized accounting has been significantly reduced, and the finance function has devel-
oped from accounting to service-oriented and is developing toward digitization and
artificial intelligence [11]. Big data, financial sharing, and artificial intelligence are
gradually being applied to the accounting field, and centralization is gradually being
achieved.
Cluster 3 is dedicated to shared services. The main keywords are shared service
center, shared service model, national audit quality, audit quality, transaction rules,
and risk management and control, among which shared service center is more inter-
mediary, shared service model, and risk management. Financial sharing has been a
hot topic in recent years, and as early as 1998, Haier Group began to explore the
strategy of financial information sharing [12]. Due to technical limitations, develop-
ment is not yet mature, and related research has not been able to achieve national
attention. In the new stage, accounting informatization is becoming more and more
perfect, and financial sharing is no longer a simple online analysis tool. Its value in
optimizing organizational structure, optimizing processes, and reducing costs have
been recognized. Currently, financial sharing is widely used by large group compa-
nies and state-owned enterprises, and there is still room for other technologies to
be embedded within financial sharing. For example, combining RPA technology
and OCR scanning technology with financial sharing can dramatically enhance the
automation of corporate finance work. It can reduce the human error rate and reduce
the operating costs of the enterprise [13].
Cluster 5 is blockchain, and the main keywords are smart contract, consensus
mechanism, expendable biological assets, database technology, blockchain tech-
nology, business and finance integration, data mining, and surplus manipulation,
among which the more intermediary ones are: smart contract, blockchain technology,
16 An Analysis of Trends and Problems of Information Technology … 227

and consensus mechanism. Although blockchain has been widely applied to improve
information quality, most of the research focuses on audit investigation, and its value
for enterprise management has yet to be discovered. The application of blockchain in
financial sharing can promote the financial intelligence of enterprises, globalization
of management and control, shared services, and integration of business and finance
[14].
In the second stage, accounting informatization has evolved from the research of
concepts and systems to the application of information technology. In the accounting
field, the number of applied research projects on big data, financial sharing,
blockchain, robotic process automation, and other technologies has increased dramat-
ically, and the content and results have improved as well. In the context of manage-
ment accounting research, the application of different information technologies
combined with management has enriched the work of finance workers and promoted
the transformation of finance personnel from simple bookkeeping work to enter-
prise management [15], which has a far-reaching impact on accounting research and
practice.
Analysis of the Emergence of Research Frontiers. The emergent words are
commonly used to analyze the frontier or research trends in a certain research field.
As seen in Table 16.1, among the 25 keywords for which data were extracted in this
paper, a total of 6 keywords with an emergent degree greater than 20 are, in descending
order, accounting computerization (110.37), big data (62.5), management accounting
(37.74), blockchain (37.51), cloud accounting (33.77), financial sharing (24.38), and
industry-financial integration (20.23). During the period 2001–2008, computerized
accounting was the core theme of research with a prominence of 110, followed
by accounting software and accounting information systems. Since 2009, the CPA
profession has become a hot topic and continued until 2016. With the rapid applica-
tion of information technology to the accounting industry, cloud accounting in 2013,
big data and management accounting in 2014, blockchain, financial sharing, and
industry-accounting integration in 2017 became hotspots in turn, and the emergence
degree was always at a high level as of 2020.
Visual Analysis Based on the Lead Authors and Institutions. CiteSpace is set
up to use the g-index algorithm and k = 25 in the initial processing parameters, and
the time slice is one year. The author and institution in the literature are analyzed
at the same time, and the initial results show 967 nodes (N = 967) and 861 critical
paths (E = 861), as well as 0.0018 node density (Density = 0.0018). The limited
frequency is greater than or equal to 10, and the main author and institute node
information co-occurrence network are obtained. According to Fig. 16.6, in the co-
occurrence diagram of authors and institutions, each node represents an author or
a research institute. The size of the node reflects the number of published articles;
the color of the node reflects the time of issuance, and the darker color indicates the
earlier issuance; the connection between the nodes reflects the cooperation between
authors and authors, authors and institutions, and institutions, and the thickness of
the connection reflects the closeness of the cooperation. The thickness of the line
reflects the degree of cooperation.
228 X. Li et al.

Table 16.1 Research frontier keyword emergence degree

Keywords Prominence Start End 2001 - 2020


Accounting computerization 110.37 2001 2008
Computerized accounting 18.96 2001 2008
Computerization 18.63 2001 2008
Accounting computerization system 18.23 2001 2007
Accounting Software 13.98 2001 2004
Accounting Information System 14.64 2005 2006
CPA Industry 13.25 2009 2016
Accounting Information Technology 11.69 2010 2012
xbrl 11.21 2010 2016
Shared Service Center 17.51 2012 2020
Cloud Accounting 33.77 2013 2020
Cloud Computing 18.82 2013 2018
Big Data 62.5 2014 2020
Management Accounting 37.74 2014 2020
Big Data Era 11.52 2014 2020
Financial Shared Service Center 19.82 2015 2020
Management Accounting Information
11.34 2015 2020
Technology
Financial Shared Services 18.46 2016 2020
Internet+ 12.39 2016 2020
Shared Services 11.11 2016 2020
Blockchain 37.51 2017 2020
Financial Sharing 24.38 2017 2020
industry and finance integration 20.23 2017 2020
Blockchain Technology 15.15 2017 2020
Artificial Intelligence 19.37 2018 2020

In Fig. 16.6, the author with the most papers is Professor Cheng Ping and his
team from the School of Accounting at the Chongqing University of Technology,
whose research direction is the application of big data technology to accounting
[16], followed by Zhang Qinglong from Beijing National Accounting Institute [17],
whose research direction is financial sharing, and Liu Yuting from the Ministry of
Finance, whose research focuses on accounting reform in China [18], followed by
Wang Jun, Yang Jie, Huang Changyong, Ding Shuqin, Ying Limeng, and Liu Qin.
Among the four major research groups in the field of accounting informatization,
the School of Accounting of the Chongqing University of Technology is the most
active. The Accounting Department of the Ministry of Finance, Beijing National
Accounting Institute, and Shanghai National Accounting Institute are also important
research camps.
Figure 16.7 shows that except for the National Natural Science Foundation of
China and the National Social Science Foundation of China, the number of science
funds at the Chongqing Municipal Education Commission is much higher than that
of the other places, indicating that the Chongqing Municipal Education Commission
has paid sufficient attention to applying technology to accounting.
In summary, the Ministry of Finance, the National Accounting Institute, and
funding support played a major role in its completion. However, the cooperation
network of Chinese accounting scholars remains primarily internal, and the lack of
cross-institutional and cross-regional cooperation has had an adverse effect on its
progress.
16 An Analysis of Trends and Problems of Information Technology … 229

Fig. 16.6 Co-presence network of major authors and institutions


230

Number of literatures (articles)

90
83

80
73

70

60

50

40
35

30

20

11
10 8 7 7 6 5 5

0
National Natural National Social Scientific research China Postdoctoral Humanities and Soft science Research China National Humanities and Soft science Research Jiangsu Blue Project
Science ... Science Foundation... project of Science Foundation Social Science... Program... Tobacco Social Science... Project...
Chongqing... Corporation...

Fig. 16.7 Distribution of funds supporting research


X. Li et al.
16 An Analysis of Trends and Problems of Information Technology … 231

16.3 A Review of Major Issues Discovered by Research


and Analysis Regarding the Application
of Information Technology in Accounting

The previous analysis found that the number of relevant studies is basically consistent
with the trend of practice, but the quality of research has not kept up, and the lack
of cross-border cooperation among researchers has become a weak issue in current
research. In a further study, we also found two other prominent problems in the
research.

16.3.1 Few Studies Have Been Conducted on the Application


of Information Technology to Accounting in SMEs

There is a low level of informatization construction in the small and medium-sized


industries, and few scholars have conducted in-depth research on these industries.
Through the process of keyword analysis, it was also found that the frequency of
SMEs appeared 51 times, accounting for only 1.18% of the total number of docu-
ments. Differences in research content and conclusions were small, focusing mainly
on low capital investment, backward software and hardware, lack of talents, and
insufficient attention of managers [19]. This shows that how to make SMEs have
enough funds for modernization and attract complex talents to design and develop
financial informatization application systems to match the development of SMEs is a
topic that needs urgent attention and research. SMEs account for 99% of the number
of enterprises in China and are the driving force behind the continued positive devel-
opment of our national economy. The report of the 19th Party Congress clearly
points out that “deepen the reform of the science and technology system, establish
a technology innovation system with enterprises as the main body, market-oriented,
and deep integration of industry, academia, and research, strengthen the support for
SMEs’ innovation, and promote the transformation of scientific and technological
achievements.” Limited by the difficulties of financing, simple organizational struc-
ture, lack of talents, and other factors, SMEs are still relatively backward in the
application of information technology in the field of accounting, especially the low
level of application and the limited role of management accounting informatization
in SMEs in China [20].
232 X. Li et al.

16.3.2 The Number and Quality of Information-Based


Accounting Education Research is Low and Declining

The construction and application of accounting information technology require high-


quality personnel training. Higher education plays an important role in the process
of training informatization talents. By combining 22 keywords such as “accounting
education,” “accounting teaching,” “practical teaching,” and “training students” into
one theme, “accounting education,” a total of 176 articles were obtained. We found
176 documents when we combined 22 keywords with the theme “accounting educa-
tion,” including “accounting education,” “accounting teaching,” “practice teaching,”
and “cultivating students.” This proportion suggests that information technology
in accounting education does not receive enough attention. Moreover, in the past
20 years of information-based accounting education exploration, most of the contents
are accounting computerization and ERP curriculum design. Flipped classrooms and
catechism have been proposed many times [21], but innovative education models have
rarely been explored.
Figure 16.8 shows that the number of research topics in accounting education is
low and on a decreasing trend. Not only that, when the core journals were counted, it
was found that only 13 articles out of 176 accounting education literature were
conference reviews or book reviews, and high-quality research still needs to be
improved. This is basically consistent with the view that “the number of core journals
on accounting talent training is also decreasing and the quality of papers is declining”
found by Nian Yan [22].

Number of articles issued (articles)


30 400
300
20
200
10
100
0 0
200620072008200920102011201220132014201520162017201820192020

Accounting Education Total number of articles

Fig. 16.8 Statistics on the number of accounting education articles issued


16 An Analysis of Trends and Problems of Information Technology … 233

16.4 Conclusions and Recommendations of the Research


on the Application of Information Technology
in Accounting

16.4.1 Conclusions of the Research

First, the number of literature on the application of information technology in the field
of accounting is on the rise, but the number of literature published in the core jour-
nals of Peking University and CSSCI has not changed significantly, and the quality
of research on the application of emerging technologies in the field of accounting
has a decreasing trend compared to that of research in the period of computerized
accounting, which indicates that the quality of relevant research needs to be further
improved.
Second, the research themes and hotspots of information technology applica-
tion in the accounting field show obvious changes with the development of infor-
mation technology. During 2001–2011, accounting computerization and accounting
informatization were at the core of research on information technology in accounting,
and their literature quantity, centrality, and prominence were much higher than other
topics; with the gradual maturity of information technology development, big data
accounting became the hottest topic in 2013–2020, followed by financial sharing,
cloud accounting, and blockchain topics. Overall, at this stage, big data and manage-
ment accounting are at the core of research, big data has opened up new paths
for management accounting research, and management accounting innovation has
become a hot spot for current and future research.
Third, the research on the application of information technology in the field of
accounting has significant contributions from the finance department, the National
Accounting Institute, and the fund support literature, which fully demonstrates the
importance and leadership of the state in promoting the application of information
technology in the field of accounting, but the research is mostly confined within the
unit, and the lack of cross-institutional and cross-regional cooperation also limits the
extensiveness and depth of the research.
Fourth, the research on the application of information technology in the field of
accounting for SMEs and the combination of information technology and accounting
education is obviously insufficient in quantity and generally low in quality, which
needs urgent guidance and attention.
234 X. Li et al.

16.4.2 Research Recommendations for Advancing


the Application of Information Technology
in Accounting

The 14th Five-Year Plan for Accounting Reform and Development has identified
“the application of new information technology to basic accounting work, manage-
rial accounting practice, financial accounting work, and the construction of unit
financial accounting information systems” as the main subject of research. To better
promote information technology application research and enhance the integration of
theoretical research and practical innovation, government departments, application
entities, and research institutions must engage in joint efforts.
In the first place, the government departments should continue to lead research in
the field of applying information technology in accounting. Increase fund support,
pay particular attention to improving the quality of research results, and increase the
attention paid to the application of information technology in accounting for small
and medium-sized enterprises, as well as the combination of information technology
and accounting education. At the same time, the government departments should
attach great importance to improving the soft power of sustainable development of
enterprises by enhancing management accounting systems and internal control mech-
anisms. The application of information technology in the field of accounting should
not be limited to a certain enterprise or unit or a certain industry, but only through
systematic research to raise it to the theoretical level and form a scientific theoretical
system of an effective combination of information technology and accounting, can
we really promote the height and depth of accounting informatization construction,
and can give full play to the positive role of accounting in enterprise management
and even economic construction.
Secondly, accounting scholars should actively expand the scope of cooperation,
strengthen cooperation with government departments and enterprises, and make full
use of cross-institutional and cross-discipline collaboration to effectively solve prac-
tical and difficult problems concerning the application of information technology in
the field of accounting, so as to develop a new pattern of integrated development of
accounting information technology application and theoretical innovation beyond its
own narrow vision.
Finally, government departments should also raise the importance of research
and transformation of accounting education informatization results and continue to
improve the collaborative education mechanism between industry, academia, and
research. Both the supply and demand sides of accounting informatization talent
training should raise awareness and strengthen communication. The development of
the digital economy has increasingly increased the requirements for the training of
accounting professionals. These requirements include improving the comprehensive
ability of the teaching staff to apply information technology to accounting teaching
and promoting the transformation of the training model. These requirements require
science to promote collaboration and exchanges between businesses, schools and
research institutions, and to enhance the ability to develop the theoretical and applied
16 An Analysis of Trends and Problems of Information Technology … 235

integration of talent. The above measures are conducive to incubating higher quality
accounting information talents for the society, and consolidating the human resources
foundation for accounting to help the development of information economy.
The impact of information technology application in the accounting field is
far-reaching, and accounting theory research and practice innovation are equally
important. Through visual analysis, this paper sorts out the research development,
summarize and refines the characteristics of the current relevant research and some
outstanding problems, and puts forward corresponding suggestions, hoping to attract
the attention of the academic community, and only through joint efforts of govern-
ment departments and accounting scholars and practitioners, the future use of
information technology in the field of accounting will be more in-depth and positive.

References

1. Yue, C., Chaomei, C., Zeyuan, L., et al.: Methodological functions of CiteSpace knowledge
graphs. Scientology Res. 33(2), 242–253 (2015)
2. Man, W., Xiaoyu, C., Haoyang, Y.: Reflections and outlook on the construction of management
accounting system in China. Finan. Acc. (22), 4–7 (2019)
3. Zhanbiao, L., Jun, B.: Bibliometric analysis of management accounting research in China
(2009–2018)-based on core journals of Nanjing university. Finan. Acc. Commun. 7, 12–18
(2020)
4. Maohua, J., Jiao, W., Jingxin, Z., Lan, Y.: Forty years of management accounting: a visual
analysis of research themes, methods, and theoretical applications. J. Shanghai Univ. Fin.
Econ. 22(01), 51–65 (2020)
5. Ting, W., Yinghua, Q.: Exploring the professional capacity building of management accounting
in the era of big data. Friends Account. 19, 38–42 (2017)
6. Qin, L., Yin, Y.: Accounting informatization in China in the forty years of reform and opening
up: review and prospect. Account. Res. 02, 26–34 (2019)
7. Qinglong, Z.: Next-generation finance: digitalization and intelligence. Financ. Account. Mon.
878(10), 3–7 (2020)
8. Weiguo, L., Guangjun, L., Shaobing, P.: The impact of data mining technology on accounting
and response. Financ. Account. Mon. 07, 68–74 (2020)
9. Chao, X., et al.: Research on auditing technology based on big data. J. Electron. 48(05),
1003–1017 (2020)
10. Chen, W., et al.: Research on audit trail feature mining method based on big data visualization
technology. Audit Res. 201(1), 16–21 (2018)
11. Shangyong, P.: On the development and basic characteristics of modern finance. Financ.
Account. Mon. 881(13), 22–27 (2020)
12. Zhijun, W.: Practice and exploration of financial information sharing in Haier group. Financ.
Account. Newslett. 1, 30–33 (2006)
13. Ping, C., Wenyi, W.: Research on the optimization of expense reimbursement based on RPA
in financial shared service centers. Friends Account. 589(13), 146–151 (2018)
14. Runhui, Y.: Application of blockchain technology in the field of financial sharing. Financ.
Account. Mon. 09, 35–40 (2020)
15. Gang, S.: Innovation of management accounting personnel training mechanism driven by big
data and financial integration. Financ. Account. Mon. 02, 88–93 (2021)
16. Ping, C., Jinglan, Z.: Performance management of financial sharing center based on cloud
accounting in the era of big data. Friends Account. 04, 130–133 (2017)
236 X. Li et al.

17. Qinglong, Z.: Financial sharing center of Chinese enterprise group: case inspiration and
countermeasure thinking. Friends Account. 22, 2–7 (2015)
18. Yuting, L.: Eight major areas of accounting reform in China are fully promoted. Financ.
Account. 01, 4–10 (2011)
19. Yumei, J.: Discussion on the construction of cloud computing accounting information tech-
nology for small and medium-sized enterprises. Financ. Account. Commun. 07, 106–109
(2018)
20. Xiaoyi, L.: Research on the application of management accounting informatization in small
and medium-sized enterprises in China. Econ. Res. Ref. 59, 64–66 (2016)
21. Weibing, Z., Hongjin, Z.: Exploration of the design and implementation of flipped classrooms
based on effective teaching theory. Financ. Account. 04, 85–86 (2020)
22. Yan, N., Chunling, S.: Visual analysis of accounting talent training research—based on the
data of CNKI from 2009–2018. Financ. Account. Commun. 15, 172–176 (2020)
Chapter 17
Augmented Reality Framework
and Application for Aviation Emergency
Rescue Based on Multi-Agent and Service

Siliang Liu, Hu Liu, and Yongliang Tian

Abstract Aviation emergency rescue is one of the efficient ways to rescue and
transport people and transport supplies. Dispatching multiple aircraft for air rescue
covering a large area is required for systematic planning. Given the complexity of
such a system, a framework is proposed to build an augmented reality system to
present the situation and assist in decision-making. An augmented reality simulation
and monitoring system for aviation emergency rescue based on multi-agent and
service are completed to apply the framework.

17.1 Introduction

Aircraft, which include fixed-wing aircraft and helicopters, have the advantage of
rapid mobility, multi-type loading capability, and less restriction by terrain. Aircraft
have been more and more applied to the emergency rescue area. Missions such as
aviation firefighting [1], aeromedical rescue [2], aviation search and rescue [3], and
aviation transport can be collectively called aviation emergency rescue. With an
enormous scale of land and sea, China has a great need of aviation emergency rescue
in case of suffering from disasters. Yet maintaining a large aircraft fleet in every city
is impossible owing to the economic issue. Thus, how to deploy and dispatch the
aircraft in a certain area becomes a problem. Wang et al. [4] studied the deployment
and dispatch of aviation emergency rescue. While the method to deploy and dispatch
the aircraft is discussed, an intuitive way of showing and commanding the process
remains to be solved.
Augmented reality provides an efficient and intuitive way to present the virtual
environment in the physical world. Augmented reality has been applied to the aero-
nautic field for training and maintenance instruction [5]. And the model for large
scale of land [6] and agent-based model [7] has been developed in augmented reality.
Augmented reality device providers such as Microsoft and Magic Leap, and game

S. Liu · H. Liu · Y. Tian (B)


Beihang University, Xueyuan Road 37, Beijing, China
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 237
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_18
238 S. Liu et al.

engines such as Unity3D and Unreal have developed tools and environments offered
to developers to develop their augmented reality program while they can have little
consideration to hardware adaptation and focus more on the program itself.
In this paper, the advantage of augmented reality is taken into consideration for
aviation emergency rescue display and commanding. A framework for aviation emer-
gency rescue on a large scale based on augmented reality is proposed. A system
instance developed in Unity3D with the usage of MRTK and deployed in Microsoft’s
Hololens2 is developed to verify the framework.

17.2 Framework of Augmented Reality System

17.2.1 System Framework

Considering the demand for aviation emergency rescue on a large scale, the frame-
work of the augmented reality system consists of two main parts, which are the
service part and multi-agent part. And besides the augmented reality system itself,
the framework also contains the development environment part and hardware part.
The development environment includes a toolkit to develop services for augmented
reality, 3D modeling software to build models of aircraft and cities, and a game
engine to visualize the system. The whole system will be installed in augmented
reality devices with which users can watch and interact. The framework is shown in
Fig. 17.1.
The service part contains service to offer function when needed and can be called
for one or more times when the system is running. The multi-agent part contains two
main types of entities that will be visualized in the system and will be instantiated in
the application which applies the framework.

17.2.2 Service-Oriented Architecture

The service-oriented architecture contains several services in two aspects, the system
basic service and scenario service. The system basic service begins to run when the
system initializes and keeps running in the background, offering services to get input
from the user, send or receive data to other systems, and hold the persistence of the
hologram of the system to anchor to a specific place in space. The scenario service
on the other hand is highly tied with aviation emergency rescue. Services contained
in the scenario service will only be called once after the system is initialized. Those
17 Augmented Reality Framework and Application for Aviation … 239

Fig. 17.1 Augmented reality system framework for aviation emergency rescue

services will function as the first step to visualize the system or the last step when
shutting down the system.

System Basic Service


System basic service contains three main services which are the spatial anchor and
share service, gesture recognition service, and network transfer service.
Spatial anchor and share service are the key service to anchor the hologram upon
the real world. Since users of the augmented reality system may wander around the
hologram to get a better view or may wander away to deal with other stuff, it’s of
vital importance to maintain the hologram and anchor it in the space. Spatial anchor
and share service can identify the space point or plane that the user specified and
hold it until the change is required.
Gesture recognition service provides the user a way to interact with the augmented
reality system. This service runs throughout the whole period of the system running
time, playing a role as a port to receive the user’s input. In an augmented reality
system, users can use one or both hands to click, select, spin, and zoom in or out on
the hologram. Gesture recognition service monitors users’ hands’ position and their
gestures, and once the gesture fits a specific mode, gesture recognition service will
translate it into instruction and invoke relevant functions.
Network transfer service is used for transferring data through the network with
other computers to synchronize the state of the system. The function of network
transfer service can be divided into two parts, which are sending data and receiving
data. Receiving data can be called when the system is used to demonstrate the state
240 S. Liu et al.

of aircraft carrying emergency rescue tasks. And sending data can be called when
the user gives instructions about where and what task the aircraft will carry.

Scenario Service
Scenario service is highly correlated with the system’s specific task scenarios. It
contains services of three aspects and will perform before, during, and after the
scene.
Mission generation service is called the situation where the system is functioning
as a simulation or training system for users and is called before the scene starts. Such
service can create missions for cities on the scene and allow aircraft to fulfill the
missions.
The main event display service is called during the scene and plays as a notepad
for users to record the command or gives users a hint of what mission is being
accomplished.
The event log service is called after the scene when all the missions are accom-
plished. The event log service will record every arrangement of the aircraft, including
the ID of the aircraft, the mission it carries, and the time that the arrangement is
made. The event log service will save the log as a file and can be used to evaluate
the efficiency of aircraft and other indexes.

17.2.3 Multi-Agent Architecture

The multi-agent architecture demonstrates two main types of agents in the system,
aircraft agents and city agents. Each type of agent has its attributes and functions.
And they can interact with each other to renew their attributes.

Aircraft Agent
Aircraft agent is a basic class of all aircraft objects in the system. This agent class
has attributes and functions that mainly represent how the fixed-wing aircraft and
helicopters work in the system.
Aircraft agent class has attributes including appearance, aircraft type, fuel load,
and loading ability. The appearance attribute is for modeling the aircraft and is
used when visualizing the system, and the hologram of appearance will indicate the
position and rotation of the aircraft. The aircraft type is used for the main event
display and event logging. It’s one of the basic attributes of an aircraft. Fuel load
represents the quantity of fuel that the aircraft is carrying. Such attribute is taken
into consideration when users decide what mission the aircraft will accomplish.
Loading ability measures how many people, how heavy the supplies, and what kind
of equipment the aircraft can carry. This attribute is another factor that should be
considered when making decisions.
Functions of the aircraft agent class include planning a route, flying to a desti-
nation, executing a task, and updating load. The planning route is to generate the
waypoints toward the destination, depending on the aircraft type and the terrain.
17 Augmented Reality Framework and Application for Aviation … 241

Flying to the destination can be called after the route is created. And this function
can dominate the aircraft’s position and rotation so that they are consistent with the
real situation. Executing task is called after the aircraft reaches the destination and is
relevant with updating load function. Together these two functions can accomplish
the task and update what the aircraft carries to accomplish the task.

City Agent
City agent is a basic class of all city objects in the system. Attributes and functions
of the city agent class match those of the aircraft agent class.
City agent class has attributes including location, airport capacity, resource, and
resource demand. Location includes a location in the real world and a location in
the system. And both can be transferred into another when needed. Location is
needed when an aircraft needs to plan a route. Airport capacity measures how many
aircraft can land and execute the mission at the same time in this very city. This
attribute influences what mission the aircraft will take. Resource measures what
kind of resource and how much the city can offer so that aircraft can transport it
to another city in need. Resource demand on the other hand measures what kind of
resource and how much the city needs.
The functions of city agents are accepting aircraft, offering resources, and updating
demand. Accepting aircraft is used to update the number of aircraft in the airport to
decide whether this city can accept more aircraft. Offering resource can be called
when an aircraft arrives in the city and loads supplies and decrease the city’s resource
according to how much the aircraft loads. Updating demand is called after the aircraft
carries supplies to this city. Resource demand is decreased by updating the demand
function.

17.3 Construction of Aviation Emergency Rescue


Augmented System

An augmented reality system of aviation emergency rescue is constructed under the


framework in Sect. 17.2. The development platform is the Unity3D engine. Unity3D
engine can provide a user-friendly environment to develop and visualize the system.
MRTK, which represents a mixed reality toolkit, is a package of tools to assist in
developing augmented reality programs in Unity3D. The augmented reality equip-
ment to deploy the system is Microsoft’s Hololens2. Hololens2 can provide hardware
and software to support the services designed in the system. The OS to develop the
program is Windows10 professional, CPU is Intel(R) Core(TM) i7-9700KF CPU
@ 3.60 GHz. Memory is 16 GB. The Unity3D version to construct the system is
2019.4(LTS). The MRTK version is 2.7.2.0.
The map and terrain in the system are based on the Zhejiang Province of China,
which has a total area of 105,500 km2 . Fourteen types of aircraft are chosen as
instances of aircraft agent class. Fifteen cities are chosen as instances of city agent
class.
242 S. Liu et al.

Fig. 17.2 Construction of augmented reality system

17.3.1 Construction of System

The construction of the augmented reality system contains three parts, development
environment, services, and entities. MRTK and Unity3D together offer a fundamental
environment to develop a program targeting universal windows platform, which can
be released on Hololens2. Services are established in the development environment,
and some of them use functions provided by MRTK or Unity3D. Entities are objects
in the scene and are visualized by Unity3D. When data or instructions are transferred
from other outside systems by network transfer service in basic services, states of
aircraft and cities can be changed by services. The construction of the augmented
reality system is shown in Fig. 17.2.

17.3.2 Services Development

System basic services are adjusted from functions that existed in MRTK or Windows.
Some services can be realized in more than one way, and this paper chose one of
them while others will still be introduced.
Spatial anchor and share service are based on Unity3D’s built-in XR SDK. In
Unity3D component, names “World Anchor” can be added to an object and this
object is linked to the Hololens2’s understanding of an exact point in the physical
world. Unity also provides a function to transfer world anchor between devices named
“World anchor transfer batch.” The flowchart of spatial anchor and share service is
in Fig. 17.3. Other options include using services provided by World Locking Tools
which is available in Unity3D’s higher versions or using image recognition. World
Locking Tools is similar to World Anchor and is based on Hololens2’s understanding
of the real world. Image recognition is using a pre-placed picture in the physical world
to locate the device and initiate the system, and the user’s observation position in the
system is based on data from 6 DoF sensors.
17 Augmented Reality Framework and Application for Aviation … 243

Fig. 17.3 Procedure of


spatial anchor and share
service

Gesture recognition in the system uses MRTK’s input services. In the configura-
tion profile of MRTK, the gestures part of the input section can be changed to change
gestures into another setting. In this paper, the default gestures profile provided by
MRTK is used. In addition, in articulated hand tracking part of the hand mesh visu-
alization is set as “everything” so that users can confirm that their hands are tracked
by the device, and the teleport system is disabled in the teleport section since the
system needs no teleportation.
Network transfer service uses socket based on UDP protocol. After the local IP
address and the exact port is bound, a new thread is started to listen to the local area
network. This listening thread is parallel to the main thread in order not to block
the system’s main logic. When the system is shut down, the listening thread will
be interrupted and aborted. Data transferred between systems is byte array encoded
from the struct, JSON, or string. The flowchart of the network transfer service is in
Fig. 17.4. The flow of other systems in Fig. 17.4 is simplified with only the data
transfer logic is remained on the right of the network transfer service’s workflow.
244 S. Liu et al.

Fig. 17.4 Procedure of


network transfer service

Mission generation service contains two steps. Step one is to allocate missions
to cities randomly. Step two is to calculate minimum resources that can fulfill the
demand and allocate them to cities that do not need such resources.
The main event display service uses the event in C# based on the publisher–
subscriber model. When aircraft and cities complete a certain event, they publish
this event and the display board which subscribed to these events when the system
started will be corresponding to the publishment and display the news.
The event log service uses the OS’s IO function. A text writing stream is instan-
tiated after the system starts. Each time an event is published, the service will write
it into the cache through the stream and when the system is shut down, the service
will turn the cache into a text file.
17 Augmented Reality Framework and Application for Aviation … 245

Fig. 17.5 Map model in Unity3D (left) and real map model(right)

17.3.3 Entity Development

Entities in the aviation emergency rescue system include map, aircraft, and cities.
3D models of these entities are created in 3d MAX software and models of aircraft
and cities are visualized by Unity3D.
The map model is developed from the digital elevation map and the satellite map.
The digital elevation map offers the height of the terrain, and the satellite map is
the texture of the terrain. Since the map covers a large scale of land, the map model
has a large file size and consumes a lot of rendering resources of Hololens2’s GPU.
So instead of visualizing the map model in augmented reality, this paper chose to
print a real map and base the hologram on it. Figure 17.5 shows the map model in
Unity3D’s scenic view and the real map model made with resin. The real map model
remains other parts of the land and paints them in green color.
The management of aircraft and cities’ models relies on the Unity3D package
“Addressables.” Aircraft models and city models are asynchronously loaded by the
label after the system starts and services such as spatial anchor and share service
and network transfer service are initiated. City models are instantiated after the load
process is finished. Aircraft models are instantiated only when the aircraft takes off
from a city, and the model will be set not active after it arrives. Other objects in the
system are also managed by the “Addressables” package (Fig. 17.6).
Functions of aircraft and cities are organized as the followed sequence: as the
user chooses a particular aircraft and city so that this aircraft would fly to the city
and accomplish the mission, the aircraft itself calls the “plan route” function. After
the waypoints are calculated, the “fly to destination” method is called to instantiate
the model and change the position and rotation of the aircraft. Once the aircraft’s
position is closed to the city, the city calls the “accept aircraft” function to inform
the event, add the aircraft to the city’s current aircraft fleet, and destroy the model of
the aircraft. Then the aircraft calls the “execute task” function to transport resources
246 S. Liu et al.

Fig. 17.6 Procedure of models’ management using Addressables package

between aircraft and city. The “update load” and “update demand” functions are
called at last, and the aircraft will be ready for another arrangement from the user.
The procedure of calling functions is shown in Fig. 17.7.

17.4 Simulation of Aviation Emergency Rescue

The system simulates aviation emergency rescue based on a task scenario in which
Zhejiang province in China is attacked by a flood. In this scenario, suffered people
need to be transported to settlement places and supplies, and large machinery needs
to be transported to cities in need. The simulation runs in a single-device environment
while data interfaces are still open for data exchange. The view of the simulation state
from the user’s perspective can be seen in Fig. 17.8. The map and other background
are in the physical world while models of aircraft, cities, message board, and mesh
on the hands are rendered by Hololens2.
17 Augmented Reality Framework and Application for Aviation … 247

Fig. 17.7 Procedure of calling functions between aircraft and city

Fig. 17.8 System running state of simulation of aviation emergency rescue

17.5 Conclusion

In this paper, a framework for aviation emergency rescue on large scale based on
augmented reality is proposed. This framework contains two main parts including
248 S. Liu et al.

service-oriented architecture and multi-agent architecture. Services in service-


oriented architecture constitute the bottom functional layer while agent classes in
multi-agent architecture realize the logic function and visualization part of the system.
Based on the framework, an augmented reality system instance was developed.
This system used Unity3D engine and MRTK as developing foundation and fulfilled
functions in service-oriented architecture and aircraft class and city class.
At last, a scenario was simulated to examine the framework and augmented reality
system of aviation emergency rescue.

References

1. Goraj, Z., et al.: Aerodynamic, dynamic and conceptual design of a fire-fighting aircraft. Proc.
Inst. Mech. Eng., Part G: J. Aerosp. Eng. 215(3), 125–146 (2001)
2. Moeschler, O., et al.: Difficult aeromedical rescue situations: experience of a Swiss pre-alpine
helicopter base. J. Trauma 33(5), 754–759 (1992)
3. Grissom, C.K., Thomas, F., James, B.: Medical helicopters in wilderness search and rescue
operations. Air Med. J. 25(1), 18–25 (2006)
4. Wang, X., et al.: Study on the deployment and dispatching of aeronautic emergency rescue trans-
port based on virtual simulation. In: 2021 5th International Conference on Artificial Intelligence
and Virtual Reality (AIVR), pp. 29–35. Association for Computing Machinery (2021)
5. Brown, C., et al.: The use of augmented reality and virtual reality in ergonomic applications for
education, aviation, and maintenance. Ergon. Des. 10648046211003469 (2021)
6. Tan, S., et al.: Study on augmented reality electronic sand table and key technique. J. Syst. Simul.
20 (2007)
7. Guest, A., Bernardes, S., Howard, A.: Integration of an Agent-Based Model and Augmented
Reality for Immersive Modeling Exploration, p. 13. Earth and Space Science Open Archive
(2021)
Author Index

A L
Abdelhakeem, Sara Khaled, 49 Li, Tong, 119
Abe, Jair Minoro, 3 Liu, Hu, 19, 173, 237
Albanis, Georgios, 187 Liu, Huaqun, 85, 119
Alzahrani, Yahya, 145 Liu, Huilin, 131
Liu, Siliang, 237
Li, Xijie, 119
B Li, Xin, 19, 173
Boufama, Boubakeur, 145 Li, Xiwen, 217
Bressler, Michael, 201
M
Mao, Kezhi, 73
D
Mechler, Vincenz, 35
da Silva Filho I., João, 3
Mustafa, Zeeshan Mohammed, 49
Dusza, Daniel G., 131

N
E Nakamatsu, Kazumi, 3
Eckstein, Korbinian, 201 Nan, Ke, 217
Niu, Xiaoye, 217

G
Gkitsas, Vasileios, 187 O
Omata, Masaki, 101
Onsori-Wechtitsch, Stefanie, 187
H
Huang, Hui-Wen, 131 P
Huang, Kai, 131 Pang, Yiqun, 161
Pang, Yunxiang, 161
Prahm, Cosima, 201
K
Kadhem, Hasan, 49
Kolbenschlag, Jonas, 201 Q
Kuzuoka, Hideaki, 201 Qing, Qing, 85
© The Editor(s) (if applicable) and The Author(s), under exclusive license 249
to Springer Nature Singapore Pte Ltd. 2023
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8
250 Author Index

R X
Rojtberg, Pavel, 35 Xue, Yuanbo, 19, 173

S
Selitskiy, Stanislav, 61
Song, Wei, 119 Y
Ström, Per, 187 Yang, Sirui, 85
Sun, Haiyang, 161 Yan, Huimin, 119
Sun, Xiaoyue, 85 Yu, YiXiong, 19
Suzuki, Mizuki, 101

T Z
Tian, Yongliang, 19, 173, 237
Zarpalas, Dimitrios, 187
Zhang, Jiaheng, 73
W Zhang, Jun, 217
Whitehand, Richard, 187 Zioulis, Nikolaos, 187

You might also like