Advanced Intelligent Virtual Reality Technologies
Advanced Intelligent Virtual Reality Technologies
Advanced Intelligent Virtual Reality Technologies
Kazumi Nakamatsu · Srikanta Patnaik ·
Roumen Kountchev · Ruidong Li ·
Ari Aharari Editors
Advanced
Intelligent Virtual
Reality Technologies
Proceedings of 6th International
Conference on Artificial Intelligence and
Virtual Reality (AIVR 2022)
123
Smart Innovation, Systems and Technologies
Volume 330
Series Editors
Robert J. Howlett, Bournemouth University and KES International,
Shoreham-by-Sea, UK
Lakhmi C. Jain, KES International, Shoreham-by-Sea, UK
The Smart Innovation, Systems and Technologies book series encompasses the topics
of knowledge, intelligence, innovation and sustainability. The aim of the series is to
make available a platform for the publication of books on all aspects of single and
multi-disciplinary research on these themes in order to make the latest results avail-
able in a readily-accessible form. Volumes on interdisciplinary research combining
two or more of these areas is particularly sought.
The series covers systems and paradigms that employ knowledge and intelligence
in a broad sense. Its scope is systems having embedded knowledge and intelligence,
which may be applied to the solution of world problems in industry, the environment
and the community. It also focusses on the knowledge-transfer methodologies and
innovation strategies employed to make this happen effectively. The combination
of intelligent systems tools and a broad range of applications introduces a need
for a synergy of disciplines from science, technology, business and the humanities.
The series will include conference proceedings, edited collections, monographs,
handbooks, reference books, and other relevant types of book in areas of science and
technology where smart systems and technologies can offer innovative solutions.
High quality content is an essential feature for all book proposals accepted for the
series. It is expected that editors of all accepted volumes will ensure that contributions
are subjected to an appropriate level of reviewing process and adhere to KES quality
principles.
Indexed by SCOPUS, EI Compendex, INSPEC, WTI Frankfurt eG, zbMATH,
Japanese Science and Technology Agency (JST), SCImago, DBLP.
All books published in the series are submitted for consideration in Web of Science.
Kazumi Nakamatsu · Srikanta Patnaik ·
Roumen Kountchev · Ruidong Li · Ari Aharari
Editors
Ari Aharari
Sojo University
Kumamoto, Japan
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Singapore Pte Ltd. 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
AIVR 2022 Organization
Honorary Chair
General Co-chairs
Conference Chair
v
vi AIVR 2022 Organization
Program Chairs
ix
x Preface
We accepted one invited and 16 regular papers among submitted 44 papers from
China, Germany, Greece, Japan, Malaysia, Brazil, UK, etc., at AIVR 2022. This
volume is devoted to presenting all those accepted papers of AIVR 2022.
Lastly, we wish to express our sincere appreciation to all participants and the
technical program committee for their review of all the submissions, which is vital to
the success AIVR 2022, and also to the members of the organizer who had dedicated
their time and efforts in planning, promoting, organizing, and helping the confer-
ence. Special appreciation is extended to our keynote and invited speakers: Prof.
Xiang-Gen Xia, University of Delaware, USA; Prof. Shrikanth (Shri) Narayanan,
University of Southern California, USA; Prof. Chip Hong Chang, Nanyang Tech-
nological University, Singapore; and Prof. Minghui Li, University of Glasgow, UK,
who made very beneficial speeches for the conference audience, and also Prof. Jair
M. Abe, Paulista University, Sao Paulo, Brazil, who kindly contributed an invited
paper to AIVR 2022.
xi
xii Contents
Kazumi Nakamatsu received the Ms. Eng. and Dr. Sci. from Shizuoka University
and Kyushu University, Japan, respectively. His research interests encompass various
kinds of logic and their applications to Computer Science, especially paraconsistent
annotated logic programs and their applications. He has developed some paracon-
sistent annotated logic programs called ALPSN (Annotated Logic Program with
Strong Negation), VALPSN (Vector ALPSN), EVALPSN (Extended VALPSN) and
bf-EVALPSN (before-after EVALPSN) recently, and applied them to various intelli-
gent systems such as a safety verification based railway interlocking control system
and process order control. He is an author of over 180 papers and 30 book chapters
and 20 edited books published by prominent publishers. Kazumi Nakamatsu has
chaired various international conferences, workshops, and invited sessions, and he
has been a member of numerous international program committees of workshops and
conferences in the area of Computer Science. He has served as the editor-in-chief of
the International Journal of Reasoning-based Intelligent Systems (IJRIS); he is now
the founding editor of IJRIS and an editorial board member of many international
journals. He has contributed numerous invited lectures at international workshops,
conferences, and academic organizations. He also is a recipient of numerous research
paper awards.
xiii
xiv About the Editors
secretary of IEEE ComSoc Internet Technical Committee (ITC), is the founder and
chair of IEEE SIG on Big Data Intelligent Networking and IEEE SIG on Intelligent
Internet Edge, and the co-chair of young research group for Asia future internet
forum. He is the associate editor of IEEE Internet of Things Journal and also served
as the guest editors for a set of prestigious magazines, transactions, and journals,
such as IEEE Communications Magazine, IEEE Network Magazine, IEEE Trans-
actions. He also served as chairs for several conferences and workshops, such as
the general co-chair for AIVR2019, IEEE INFOCOM 2019/2020/2021 ICCN work-
shop, IEEE MSN 2020, BRAINS 2020, IEEE ICDCS 2019/2020 NMIC workshop
and IEEE Globecom 2019 ICSTO workshop, and publicity co-chair for INFOCOM
2021. His research interests include future networks, big data networking, intelligent
Internet edge, Internet of things, network security, information-centric network, arti-
ficial intelligence, quantum Internet, cyber-physical system, naming and addressing
schemes, name resolution systems, and wireless networks. He is a senior member of
IEEE and a member of IEICE.
Assoc. Prof. Ari Aharari (Ph.D.) received M.E. and Ph.D. in Industrial Science
and Technology Engineering and Robotics from Niigata University and Kyushu
Institute of Technology, Japan, in 2004 and 2007, respectively. In 2004, he joined
GMD-JAPAN as a research assistant. He was a research scientist and coordinator at
FAIS-Robotics Development Support Office from 2004 to 2007. He was a postdoc-
toral research fellow of the Japan Society for the Promotion of Science (JSPS) at
Waseda University, Japan, from 2007 to 2008. He served as a senior researcher of
Fukuoka IST involved in the Japan Cluster Project from 2008 to 2010. In 2010, he
became an assistant professor at the faculty of Informatics of Nagasaki Institute of
Applied Science. Since 2012, he has been an associate professor at the Department
of Computer and Information Science, Sojo University, Japan. His research inter-
ests are IoT, robotics, IT agriculture, image processing and data analysis (Big Data)
and their applications. He is a member of IEEE (Robotics and Automation Society),
RSJ (Robotics Society of Japan), IEICE (Institute of Electronics, Information and
Communication Engineers), and IIEEJ (Institute of Image Electronics Engineers of
Japan).
Part I
Invited Paper
Chapter 1
Paraconsistency and Paracompleteness
in AI: Review Paper
Abstract The authors analyse the contribution of the logical treatment of the
concepts of inconsistency and paracompleteness to better understand AI’s current
state of development. In particular, the relationship between Artificial Intelligence
and a new type of logic, called Paraconsistent Annotated Logic, which effectively
manipulates the above concepts, both computationally and in its use in Hardware, is
considered.
1.1 Introduction
Logic, until very recently, was a single science, which progressed linearly, even after
its mathematisation by mathematicians, logicians and philosophers such as Boole,
Peano, Frege, Russell and Whitehead. The revolutionary developments in the 1930s,
such as those by Gödel and Tarski, still fall within what we can call classical or
traditional logic.
Despite all the advances in traditional logic, another parallel revolution took place
in the field of science created by Aristotle, of a very different nature. We refer to
the institution of non-classical logic. They produced, as in the case of non-Euclidean
geometry, a transformation of a profound nature in the scientific sphere, whose conse-
quences, of a philosophical nature, have not yet been investigated systematically and
comprehensively.
J. M. Abe (B)
Paulista University, São Paulo, Brazil
e-mail: [email protected]
J. I. da Silva Filho
Santa Cecília University, Santos, Brazil
K. Nakamatsu
University of Hyogo, Hyogo, Japan
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 3
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_1
4 J. M. Abe et al.
We call classical or traditional logic the study of the calculus of the first-order
predicates, with or without equality, as well as some of its subsystems, such as clas-
sical propositional calculus, and some of its extensions, for example, traditional logic,
higher-order (type theory) and the standard systems of set theory (Zermelo–Fraenkel,
von Neumann–Bernays–Gödel, Kelley–Morse, NF Quine-Rosser, ML Quine-Wang,
etc.). The logic under consideration is based on well-established syntax and seman-
tics; thus, the usual semantics of predicate calculus is based on Tarski’s concept of
truth.
Non-classical logic is characterised by amplifying, in some way, traditional logic
or by infringing or limiting its core principles or assumptions [1].
Among the first, called complementary logics of the classical, we will remember
the traditional logics of alethic modalities, deontic modalities, epistemic operators
and temporal operators. Among the second, called heterodox or rivals of classical,
we will cite paraconsistent, paracomplete and intuitionist logics without negation
(Griss, Gilmore and others).
Logic, we must stress, is much more than the discipline of valid forms of inference.
It would be difficult to fit, e.g. the theory of models in its current form and the theory of
recursion in a logic thus defined. However, for this article, we can identify (deductive)
logic as the discipline especially concerned with valid inference (or reasoning).
On the other hand, each deductive logic L is usually associated with an inductive
logic L’, which, under certain conditions, indicates how invalid inferences according
to L can still be used. The patterns for this to be legitimate are encoded in L’. Inductive
logics are perfectly placed among non-classical logics (perhaps as a replacement for
the corresponding deductive logics) [1].
Artificial Intelligence (AI) has contributed to the progress of several new logics
(non-monotonic logics, default logics, defeasible logics, and paraconsistent logics
in general).
This is because, particularly in the case of expert systems, we need non-traditional
forms of inference. Paraconsistency, e.g. is imposed as one regularly works with
inconsistent (contradictory) sets of information.
In this chapter, we outline the great relevance that AI is acquiring regarding the
deep understanding of the meaning of logicity (and, indirectly, for the very under-
standing of reason, its structure, limits and forms of application). To do so, we will
only focus on the case of paraconsistent logic and paracomplete logic; without a
doubt, it can be seen as one of the most heterodox among the heterodox logics,
although technically, it can be constructed as a complementary logic to classical
logic.
inconsistent. Theory T is called trivial if all its language sentences (closed formulas)
are theorems; if not, T is nontrivial.
If L is one of the standard logics, such as classical and Brouwer-Heyting’s intu-
itionistic logic, T is named trivial if and only if it is inconsistent. In other words,
logic like these does not separate the concepts of inconsistency and triviality.
L is called paraconsistent if it can function as a foundation for inconsistent and
nontrivial theories. (Only in certain specific circumstances does the presence of
contradiction imply trivialisation.) In other words, paraconsistent logic can handle
inconsistent information systems without the danger of trivialisation.
The forerunners of paraconsistent logic were the Polish logician J. Łukasiewicz
and the Russian philosopher N. A. Vasiliev. None of them had, at the time, a broad
view of classical logic as we see it today; they treated it more or less through Aris-
totle’s prism in keeping with the then dominant trends in the field. Simultaneously,
around 1910, though independently, they aired the possibility of a paraconsistent
logic that would constrain, for example, the principle of contradiction, when formu-
lated as follows: Given two contradictory propositions, that is, one of which is the
negation of the other, then one of the propositions is false. Vasilev even came to
articulate a certain paraconsistent logic, which he baptised imaginary, modifying the
Aristotelian syllogistic.
The Polish logician S. Jaśkowski, a disciple of Łukasiewicz, was the first logician
to structure a paraconsistent propositional calculus. In 1948, he published his ideas on
logic and contradiction, showing how one could construct a paraconsistent sentential
calculus with convenient motivation. Jaśkowski’s system, named by him discursive
logic, was developed later (from 1968 onwards) due to the works of authors such as
J. Kotas, L. Furmanowski, L. Dubikajtis, N. C. A. da Costa and C. Pinter. Thus, an
actual discursive logic was built, encompassing a calculus of the first-order predicate
and a higher-order logic (there are even discursive set theories, intrinsically linked
to the attribute theory, based on Lewis’ S5 calculus) [1].
The initial systems of paraconsistent logic, containing all logical levels, thus
involving propositional, predicate and description calculations and higher-order
logic, are due to N. C. A. da Costa (1954 onwards). This was carried out independently
of the inquiries of the authors, as mentioned earlier.
Today, there are even paraconsistent systems of set theories, strictly stronger
than the classical ones, as they contain them as strict subsystems and paraconsistent
mathematics. These mathematics are related to fuzzy mathematics, which, from a
certain point of view, fits into the list of the former.
As a result of the elaboration of paraconsistent logic, it has been proved that it
becomes possible to manipulate inconsistent and robust information systems without
eliminating contradictions and without falling into trivialisation.
Worthy of mentioning is that paraconsistent logic was born out of purely theo-
retical considerations, both logical-mathematical and philosophical. The first ones
refer, for example, to problems related to the concept of truth, the paradoxes of set
theory and the vagueness inherent in natural language and scientific ones. The second
is correlated with themes such as foundations of dialectics, notions of rationality and
logic and the acceptance of scientific theories.
6 J. M. Abe et al.
In connection with the preceding exposition, here are some philosophically signif-
icant problems: (a) Are non-classical logics, logics? (b) Can there even be rival logic
to the classical one? (c) Ultimately, wouldn’t the logic called rivals be only comple-
mentary to the classical one? (d) What is the relationship between rationality and
logic? (e) Can reason be expressed through different logics, incompatible with each
other?
Obviously, within the limits of this article, we cannot address all these questions,
not even in a summarised way.
However, adopting an ‘operational’ position, if the logical system denotes a kind
of inference organon, AI contributes to lead us, inescapably, to the conclusion that
there are several kinds of logic, classical and non-classical, and among the latter,
complementary and rivals of classical logic.
Furthermore, AI corroborates the possibility and practical relevance of logic in
the category of paraconsistent, so far removed from the standards established for
logicity until recently. This is, without a doubt, surprising for those not used to the
latest advances in information technology.
It is worth remembering that numerous arguments weaken the position of those
who defend the thesis of the absolute character of classical logic. Here are four such
arguments as follows:
(1) Any given rational context is compatible with infinite logics capable of
appearing as underlying logics.
(2) Fundamental logical concepts, such as negation, have to be seen as ‘family
resemblance’ in Wittgenstein’s sense. There is no particular reason for refusing,
say, paraconsistent negation the dignity of negation: if one does so, one should
also maintain that the lines of non-Euclidean geometries are not, in effect, lines.
(3) Common semantics, e.g. restricted predicate calculus is based on set theory.
As there are several (classical) set theories, there are numerous possible inter-
pretations of such semantics, not equivalent to each other. Consequently, that
calculation is not as well defined as it appears at first sight.
(4) There is no sound and complete axiomatisation for traditional second-order (and
higher-order) logic. It, therefore, escapes (recursive) axiomatisation.
Thus, the answers to questions (a) and (b) are affirmative. A simple answer to ques-
tion (c) seems complicated: at the bottom, it is primarily a terminological problem.
However, in principle, as a result of the previous discussion, nothing prevents us from
accepting that there are rival logics, which are not included in the list of complemen-
tary ones to the traditional one. Finally, on (d) and (e), we will emphasise that we
have excellent arguments to demonstrate that reason remains reason even when it
manifests itself through non-classical logic (classical logic itself is not a well-defined
system).
From the above, we believe that the conclusions that are imposed are susceptible
to a summary, as follows:
Science is more a struggle, an advance, than a stage acquired or conquered, and
the fundamental scientific categories change over time. As Enriques [3] points out,
8 J. M. Abe et al.
science appears imperfect in any parts, developing through self-correction and self-
integration, to which others are gradually added, there is a constant back and forth
from the foundations to the most complex theories, correcting errors and eliminating
inconsistencies. However, history proves that every scientific theory contains some-
thing true: Newtonian mechanics, though surpassed by Einstein’s, evidently contains
traces of truth; if its field of application is conveniently restricted, it works, predicts
and therefore contains a bit of truth. Nevertheless, the real truth is a walk constant to
the truth. This is the teaching of history, beyond any serious doubt.
Even more, logic is constituted through history, and it does not seem possible to
predict the vicissitudes of its evolution.
It is not just about progress in extension; the concept of logicity has changed.
An expert from the beginning of the century, although familiar with the works of
Frege, Russell and Peano, could hardly have foreseen the transformations that would
take place in logic in the last forty years. Today, heterodox logics have entered
the scene with great impetus: no one could predict where polyvalent, relevant and
paraconsistent logics will take us. Perhaps, in the coming years, a new alteration of
the idea of logicity is in store, impossible to imagine at the moment [1].
‘Reason, as defined…, is the faculty of conceiving, judging and reasoning.
Conceiving and reasoning are the exclusive patrimonies of reason, but judging is
also a rational activity in the precise sense of the word. Some primitive form of non-
rational intuition provides the basis for judgment; it is the reason that judges since it
alone manipulates and combines concepts.
Most common uses of the word ‘reason’ derive from reason conceptualised as
the faculty of conceiving, judging and reasoning. Thus, to discern well and adopt
rational norms of life, one must consider reason in a sense defined. Furthermore,
there is a set of rules and principles regulating the use of reason, primarily as it
manifests itself in rational contexts. It is also permissible to call this set of rules
and principles reason. When we ask whether reason transforms itself or remains
invariant, it is undoubtedly more convenient to interpret the question as referring
to reason as a set of rules and principles and not as a faculty. So formulated, the
problem has an immediate answer: reason has changed over time. For example, the
rational categories underlying Aristotelian, Newtonian and modern physics diverge
profoundly, ipso facto, the principles that govern these categories vary, from which
it can be concluded the reason itself has been transformed.’ (da Costa [1]).
Consequently, reason does not cease to be the reason, even if it is expressed
through a different logic.
AI is currently one of the pillars on which the considerations that have just been
made are based. So, it has a practical value of the technological application and a
theoretical value, contributing to a better solution to the problems of logic, reason
and culture.
1 Paraconsistency and Paracompleteness in AI: Review Paper 9
With the decision states and the degrees of certainty and uncertainty, we obtain a
logic analyser called para-analyser [4]. Such an analyser materialised with electrical
circuits gave rise to a logic controller called para-control [4].
Below we describe some applications made with such controllers.
This project was conceived based on the history of the application of paraconsis-
tent logic in predecessor robots [5] and the development of robotics in autonomous
navigation systems [6]. The prototype of the project implemented with the ATmega
2560 Microcontroller is observed in Fig. 1.3. The HC-SR04 ultrasonic sensors were
installed correctly. At the front of the robot, one observes traction motors controlled
by pulse width modulation (PWM). On the back can be seen the differential of this
prototype compared to the predecessors, which consists of a servomotor to control
the robot’s direction.
Another difference from the previous ones was the idea of using an LCD to
monitor the readings of ultrasonic sensors and observe the value of the angle pointed
out by the servomotor. All these observations were critical in the robot’s movement
tests [6].
positions. Next, a normalisation of frontal sensors’ readings for the values of μ and
λ of the lattice was made, as can be observed in Eqs. (1.1) and (1.2).
The normalisation process involves adapting the distances’ values obtained from
the sensors and converting them to a range from 0 to 1, conceptual to paraconsistent
logic [2].
Left Sensor
μ= (1.1)
200
Right Sensor
λ=1− (1.2)
200
Using the proposition p ‘The robot’s front is free’, paraconsistent logic concepts
were applied. The certainty and uncertainty degrees were calculated according to the
values μ and λ obtained by Eqs. (1.1) and (1.2).
It was noticed that the degree of uncertainty generated very peculiar values to be
used directly in the set point of the servomotor. Then, with other values, six new
tests were performed to define the robot’s behaviour concerning supposed obstacles,
simultaneously positioned at the same distance to the left and right frontal sensors.
Table 1.2 shows the development of simulations and the results obtained in each
case.
Table 1.3 shows a gradual change in certainty degrees for these new cases that
ranged from 0.99 to -0.70. The control values obtained in the simulations were applied
to the paraconsistent algorithm programming developed in the C Language directly
in the Interface Development Environment (IDE) of Arduino ATmega 2560. These
values were used in decision-making for speed control and braking.
The program’s algorithm was divided into four main blocks to facilitate its imple-
mentation: the block of the frontal sensors, the block of paraconsistent logic, the
control block of the servomotor and the control block of speed and traction.
// Front Sensor Block
trigpulse_1(); //calls the function trigger of the right front
sensor
pulse_1 = pulsein (echo_1, high);
rt_ft_sr =pulse_1/58; //calculates obstacle distance to right
front sensor
1 Paraconsistency and Paracompleteness in AI: Review Paper 13
Lately, the potential of multi-criteria decision analysis (MCDA) in health care has
been widely discussed. However, most MCDA methodologies pay little attention to
aggregating different individual stakeholder perspectives.
In [7], the para-analyser was applied to illustrate how a reusable MCDA frame-
work, based on paraconsistent logic, designed to aid (hospital-based) Health Tech-
nology Assessment (HTA) can be used to aggregate individual expert perspectives
when evaluating cancer treatments.
A proof-of-concept exercise line focusing on identifying and evaluating the global
value of first-rate treatments for metastatic colorectal cancer (mCRC) was undertaken
to further the development of the MCDA framework.
In consultation with hospital HTA committee members, 11 were considered on
an expert panel: medical oncology, oncology surgery, radiation therapy, palliative
care, pharmacist, health economist, epidemiologist, public health specialist, health
media specialist, pharmaceutical industry and patient advocate. The criteria ‘overall
survival’ (0.22), ‘burden of disease’ (mean 0.21) and ‘adverse events’ (mean 0.20)
received the highest weights, and the lowest weights were ‘progression-free’ and
‘cost of treatment’ (mean of 0.18 for both). FOLFIRImFlox achieved the highest
overall value approval of 0.75, followed by mFOLFOX6 with an overall value rating
of 0.71. Last ranked was the mIFL with an overall value score of 0.62. Paraconsistent
analysis of six first-line treatments for mCRC indicated that FOLFIRI and mFlox
were appropriate options for non-study reimbursement.
The paraconsistent value framework was proposed as a step forward from current
MCDA practices to improve the means of dealing with hospital HTA specialists’
perspectives of cancer treatments.
1.7 Conclusions
Paraconsistent logic was born out of applications in philosophy and specific technical
questions in mathematics, but it has found significant applications in the last three
decades, mainly in AI and Robotics [8].
ANNs, DeepLearning, Expert Systems, Bigdata, etc., based on paraconsistent
logic, come to directly deal with diffuse, inconsistent and paracomplete data, as we
have to manipulate such data frequently. In the early days of AI, some theories elim-
inated or treated inconsistencies separately. They happen frequently: in medicine,
the same symptoms can indicate different illnesses and doctors can conflict in their
diagnoses. We also have the issue of inherent ambiguity regarding the resolution of
the treated image that can lead to hasty decisions such as data captured by radar in
expert systems, and experts can have different opinions on the same problem, mali-
cious data corrupting databases and other themes. To neglect inconsistent data is to
1 Paraconsistency and Paracompleteness in AI: Review Paper 15
References
1. da Costa, N. C. A.: Logiques Classiques et Non Classiques: Essai sur les fondements de la
logique, Masson, p. 275. ISBN-10: 2225852472, ISBN-13: 978-2225852473 (1997)
2. Abe, J. M., Akama, S., Nakamatsu, K.: Introduction to Annotated Logics—Foundations for Para-
complete and Paraconsistent Reasoning, Series Title Intelligent Systems Reference Library, vol.
88, p. 190. Publisher Springer International Publishing, Copyright Holder Springer Interna-
tional Publishing Switzerland, eBook ISBN 978-3-319-17912-4. https://doi.org/10.1007/978-3-
319-17912-4, Hardcover ISBN 978-3-319-17911-7, Series ISSN 1868-4394, Edition Number 1
(2015)
3. Enriques, F.: Per la Storia della Logica, Zanichelli, Bolonha (1922)
4. da Silva Filho, J.I.: Métodos de interpretação da Lógica Paraconsistente Anotada com anotação
com dois valores LPA2v com construção de Algoritmo e implementação de Circuitos Eletrônicos
(in Portuguese), University of São Paulo, Doctor Thesis, São Paulo (1999)
5. Torres, C.R., Abe, J.M., Lambert-Torres, G., da Silva Filho, J.I., Martins, H.G.: Autonomous
Mobile Robot Emmy iii, pp. 317–327. New Advances in Intelligent Decision Technologies,
Springer, Berlin, Heidelberg (2009)
6. Bernardini, F., da Silva, M., Abe J.M.: Application of Paraconsistent Annotated Evidential Logic
Eτ for a Terrestrial Mobile Robot to Avoid Obstacles, Procedia Computer Science, vol. 192,
pp. 1821–1830. ISSN 1877-0509 (2021)
7. Campolina, A.G, Estevez-Diz, M.D.P., Abe, J.M., de Soárez, P.C.: Multiple Criteria Decision
Analysis (MCDA) for Evaluating Cancer Treatments in Hospital-Based Health Technology
Assessment: The Paraconsistent Value Framework. PLoS ONE 17(5) (2022)
8. Abe, J.M.: Paraconsistent Intelligent-Based Systems: New Trends in the Applications of
Paraconsistency, p. 94. Springer (2015)
9. Akama, S.: Towards Paraconsistent Engineering, Intelligent Systems Reference Library, vol.
110, p. 234. Springer International Publishing (2016). ISBN: 978-3-319-40417-2 (Print) 978-3-
319-40418-9 (Online), Series ISSN 1868-4394
Part II
Regular Papers
Chapter 2
Decision Support Multi-agent Modeling
and Simulation of Aeronautic Marine Oil
Spill Response
Abstract Modeling and simulation can provide decision support methods for marine
oil spill response, which can ensure the timeliness and effectiveness of the response
plan. This paper (1) proposes a hybrid modeling approach combining multi-agent
modeling and discrete event system (DEVS) to extract the oil spill response process
model; (2) abstracts the mathematical model of oil spill response plan based on
the multi-agent model; (3) constructs an aeronautic marine oil spill response virtual
simulation system; and (4) quantitatively evaluates the response plans of the marine
oil spillage. Furthermore, an instance is analyzed with simulation and evaluation
of two response plans, which proves that the modeling and simulation methods of
response plan can provide references for analysis and optimization of the response
plan for decision support.
2.1 Introduction
With the increasing frequency of maritime economic activities, cruise travel, and sea
transportation, maritime accidents occur more and more frequently, especially large-
scale oil spills with complex causes and serious environmental hazards. However,
most countries lack the capacity and experience to deal with large oil spill emer-
gencies. For example, the sinking of the Sanchi and the spillage of its oil cargo and
fuel has been one of the worst maritime collisions in recent years, with no precedent
for an emergency response. The application of virtual simulation can provide deci-
sion support methods for large marine oil spills and improve its emergency response
capability.
As early as in the 1990s, using computer technology, various developed countries
had researched oil spill prediction simulation systems, such as the OILMAP system
[1] in the USA and the OSIS [2] system in the UK, etc. With the maturity of oil spill
prediction technology and the development of computer virtual simulation, many
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 19
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_2
20 X. Li et al.
studies oriented to oil spill disposal virtual simulation exercises have emerged. For
example, in China, Zou Changjun [3] realized a three-dimensional exercise system for
marine oil spill response and a virtual reality-based oil spill response exercise system;
Yang Yu [4] constructed a simulated training system supported by virtual environment
in real time for undersea oil spill emergency response to train emergency workers.
While skilled personnel are important, developing a response plan is extremely time
consuming during a real oil spill response. But studies geared toward response plan
development simulation are still relatively few.
The current evaluation study on oil spill response is process oriented, focusing on
the comprehensive factors affecting the effectiveness of oil spill response. And multi-
criteria evaluation methods are adopted to evaluate marine emergency incidents [5,
6]. Aurelien Hospital [7] in Canada studied oil spill response evaluation, mainly
considering the evaluation of booms and skimmers, which pointed out the way to
develop a risk-informed enhanced oil spill response capacity. Jin Weiwei [8] in China
established an evaluation indicator system of marine oil spill emergency response
capacity, including forecast, emergency support, and disposal capacity. Yet, while
process-oriented assessments are comprehensive and integrated, they lack relevance
and directionality, making it difficult to improve the efficiency of oil spill response
at the decision-making level.
The response plan involves response process and affects the efficiency of response.
Therefore, based on the above problems and current research status, this paper focuses
more on the decision support of the response plan. The evaluation framework in this
paper follows the approach of [5] in maritime search and rescue (MSRA) and is
novelly applied to a more complex scenario of maritime oil spill disposal, taking
into account multiple missions of emergency monitoring and oil spill disposal with
interactions of emergency response forces. In this paper, multi-agent modeling and
discrete event system (DEVS) modeling are firstly combined to extract the oil spill
response process model and to refine the mathematical model of the oil spill response
plan; secondly, the evaluation indicators system is established, and the analytic hier-
archy process (ANP) method is adopted as the evaluation calculation method; and
finally, based on these researches, a virtual simulation system is developed in the
AnyLogic simulation platform to provide decision support for the formulation of
response plan.
The process flow from oil spill occurrence to disposal can be summarized as follows:
surveillance and warning, emergency monitoring, and oil spill disposal, as shown in
Fig. 2.1.
2 Decision Support Multi-agent Modeling and Simulation of Aeronautic … 21
Continuous Monitoring
Y
Developing Disposal Plan New Emergency? N
Y
End
The surveillance and warning are the preconditions of oil spill response. After
receiving the alarm, emergency monitoring is carried out to obtain detailed infor-
mation about the spillage. Then the emergency command center dispatches ships
and helicopters to the mission area for oil spill disposal according to the informa-
tion of emergency monitoring. Therefore, the critical steps of oil spill response are
emergency monitoring and oil spill disposal.
At the initial stage of marine oil spillage, rapid response and on-site emergency
monitoring are required. The emergency command center usually sends helicopters
to the scene to conduct on-site command, including investigating the emergency
site, sampling oil spills, and reporting the acquired information to the emergency
command center. The center receives the information and evaluates it to form an oil
spill response plan. In this paper, the oil spill response plan refers to the dispatching
plan and routes of response forces (as helicopters).
DEVS model
The marine oil spill response is constrained by the time boundary and the space
boundary. Under the constraint of these boundaries, the state variables, including oil
spill status information, monitoring status, oil area, indicators, etc., change at some
22 X. Li et al.
discrete time points. Thus, the marine oil spill response is a typical discrete event
system (DEVS).
The DEVS model is driven by a series of events and activities, where an event is
a behavior at a certain instantaneous time and an activity is the continuous state of
behavior, which is between two events in the DEVS model. The occurrence of an
event in the oil spill response process indicates a change in the state of the oil spill
response.
The events and the activities in marine oil spill response process are described in
Table 2.1.
A process is a collection of the related event and activity that describes their
logical relationship and time sequence. Considering the process interaction modeling
strategy, the DEVS architecture diagram for the aeronautic marine oil spill response
process is shown in Fig. 2.2, which is constructed based on Fig. 2.1.
Multi-agent DEVS model
It can be seen from the marine oil spill response process that it is of great significance
to model helicopters and units in distress. For helicopters’ modeling, movement logic
in mission execution and detection logic in target search need to be concerned. The
detection logic considers information interaction between helicopters and units in
distress. For the unit in distress, it is necessary to analyze self-drift and oil spill
dispersion logic and state evolution logic. Then the state evolution logic is modeled
according to the change of environment and the performance of the helicopter.
The above analysis demonstrates that the helicopter model and distress ship model
have their own logic and influence each other. The response process is modeled
by many interaction and communication behaviors between units with explicit
behavioral logic and state migration characteristics.
Therefore, multi-agent modeling is adopted to describe complex situations, such
as multiple behaviors and interactions between different agents. Each agent is a unit
with some physical or abstract mathematical meaning that can not only act on itself
and its environment, but also interact with other agents in terms of information and
behavior.
The multi-agent model includes environment agents, behavioral agents, and data
agents. The environment agent is the simulation operation environment of the behav-
ioral agents and data agents. The behavioral agent is the subject-object that generates
behavior after the simulation starts, with state variables and several behavior patterns.
When a behavioral agent interacts with the environmental agents or other associated
behavioral agents, it triggers or is triggered by events to generate data. The description
of the intelligent body is described in Table 2.2.
According to the methodology of non-uniform hybrid strategy [9], the DEVS
model can be established on the basis of the agent-based modeling method, which
will guarantee the model accuracy and computational efficiency to a certain extent,
thus improving the authenticity of the evaluation results. Based on the above two
models, the multi-agent DEVS model is obtained through agent description of the
events, activities, and processes, as shown in Fig. 2.3.
2 Decision Support Multi-agent Modeling and Simulation of Aeronautic … 23
E1
E8 A12 A3 E10 A13 E11 A14 E10
A2 Pm2 P7 P8
A9
P2
E2
E16 E9
Pm1 P3 P4 P5 P6
In multi-agent DEVS, the event can be generated by the interaction among agents,
containing information and behavior interaction. Discrete events proceed based on
specific conditions or rules, which in turn affect the activities of the agents involved.
Simulation system framework
In order to verify the validity and feasibility of the multi-agent DEVS model, as well
as to provide decision support for the evaluation, the framework of the simulation
system is constructed as shown in Fig. 2.4.
Environment agent, acting as the environment in the system, realizes human–
computer interaction, provides oil spill information input, and formulates response
plan based on auxiliary decision-making functions such as drift trajectory prediction.
Data agent links the environment agent and the behavioral agent, recording both the
data input by users and the data in the evaluation process. The behavioral agent, as the
execution unit of the simulation, realizes the discrete event based on its own behavior
and interaction logic. The behavioral agent logic, using the ForceUnit_Monitor Agent
as an example, is divided into three phases: staying in base, navigation, command,
and control.
The simulation system is developed based on the above system architecture, thus
realizing three layers of logic. First, the user performs ScenarioEditing to edit the
oil spill emergency information. And the generated emergency information will be
stored in a local file. Second, the user performs DecisionMaking to load and display
the edited oil spill emergency in the interface. Then an oil spill response plan is devel-
oped based on the assisted decision-making function, which controls the Behavioral
Agent in the form of rules to realize the simulation and interaction. Finally, the
user carries out SimulationEvaluation. The simulation program loads the emergency
information and decision plan, starts the simulation according to the user’s intention,
and outputs the evaluation result after the simulation.
2 Decision Support Multi-agent Modeling and Simulation of Aeronautic … 25
In the process of oil spill response, the response plan is the most central part, which
can orderly dispatch response forces and assign various missions. In the multi-agent
DEVS model, the ResponsePlan Agent is constructed to record response plan data
and control agents of the virtual simulation.
It should be clear that the response plan works on the collection of missions and
response forces in two aspects: first, the matching of response forces, characterizing
the assignment of force units (as helicopters); second, the acting of force units,
characterizing the temporal properties of the act to the mission area to perform
26 X. Li et al.
DEVS
A2: DecisionMaking Agent Model
B2: ForceUnit_Monitor Agents B3: ForceUnit_Cleaner Agents
B1 C2 N N
C3
M S Sp C
C1
E E
DPM DPC
Oil Spill
Scenario Editing Equipment Base
Emergency Unit
Simulation
Airline/Route Cleaning Force Unit
Evaluation
Second, determine the called force set Force as Eq. (2.2), with a total of m force
units.
Third, assign actions to force units and determine the mission assignment matrix
M as Eq. (2.3). For example, if there are n actions to be assigned to m force units,
the matrix M is expressed as follows:
⎡ ⎤
M 11 M 12 · · · M 1n
⎢ M 21 M 22 · · · M 2n ⎥
⎢ ⎥
M = (M i j )m×n =⎢ . .. .. .. ⎥ (2.3)
⎣ .. . . . ⎦
M m1 M m2 · · · M mn
where M i j = 0, 1, and when M i j = 1, it means that the i-th response force unit
performs the j-th mission.
Fourth, determine the action matrix Ak as Eq. (2.4) for each force unit, character-
izing the temporal attributes of the actions of sending the force unit to the mission
area to perform the mission. For example, the k-th action is matched to a mission.
⎡ ⎤
0 0 ··· 0
⎢ a21 0 ··· 0⎥
⎢ ⎥
Ak = (ai j )n k ×n k = ⎢ .. .. . . .. ⎥ (2.4)
⎣ . . . .⎦
an k 1 an k 2 · · · 0
R PMission = M, A (2.6)
evaluation system. Relevant studies [5, 6] reveal that safety and efficiency should be
considered when evaluating and screening response plans. In simple terms, safety
indicators consist of helicopter safety and environmental safety (mainly considering
the harm degree of oil spill to the environment), and efficacy indicators incorporate
both emergency monitoring efficacy and oil spill disposal efficacy.
Specifically, the safety indicators embody the remaining fuel of force units, the
offshore distance, the hazard degree of the oil spillage, and the duration of the oil
spill. Efficacy indicators include emergency dispatch time of monitoring force, total
detection time, oil spill disposal and response resource consumption, and response
completion. The evaluation indicator system is given in Table 2.3.
C 1 and C 2 are security criteria, which consider the safety of the force units and
the environment. C 3 and C 4 are efficiency criteria, corresponding to two phases of
oil spill response: emergency monitoring and oil spill disposal.
Since there are certain intrinsic links between indicators and their structure is
similar to a network structure, network analysis (ANP) was used to determine the
weight coefficients of indicators. See [5] for details of the ANP method.
In this paper, eight indicators are defined, and the set of their values is represented
by the matrix V as Eq. (2.7). The elements in V correspond to these eight indicators,
respectively.
Suppose an oil spill emergency incident occurs in the sea area at point A. The scenario
information of the emergency is given in Table 2.4.
has a lower disposal resource efficiency ratio than RP1, so RP 1 has the best overall
disposal effect.
Further analysis reveals that there is still some room for improvement for RP1.
Indicator I31 (monitoring movement efficiency) has the lowest score after weighting,
as does its related indicator I32 (total time of monitoring). This result suggests that the
developed response plan should give priority to optimizing the speed of emergency
monitoring force departure. In addition, the general principle should be to maximize
the effectiveness of the helicopter while ensuring the safety of the aircraft.
Based on the above, the Response Plan 4 (RP4) is proposed: dispatch emer-
gency monitoring force from Xiamen equipment depot, including the helicopter
B-0004(SC-76++) and oil spill cleaning force from Quanzhou equipment depot
including the helicopter B-0002(H-410) (Table 2.8).
The indicators I 22 , I 31 , and I 32 of RP4 have all improved with a response program
comprehensive evaluation value of 0.5492, 5.67% higher than that of RP1, implying
that RP4’s emergency monitoring effectiveness has been significantly improved.
As can be seen from the above instances, the developed response plans are simu-
lated and evaluated in the multi-agent system and optimized based on the evaluation
results.
2.5 Conclusion
In this paper, a virtual simulation model is constructed for each process of the oil
spill response plan, which fully considers the interaction between the elements and
thus is closer to the actual process of oil spill response decision and command. The
evaluation indicators system in this paper that quantitatively evaluates the response
plan through different dimensions can provide decision support for oil spill response.
(1) A simulation model based on a hybrid modeling approach combining multi-
agent and DEVS is proposed, which can accurately describe events, activities,
and processes of the oil spill response process, especially the interactions and
state changes therein.
(2) The response plan is defined and its evaluation indicator system is established
based on the multi-agent model, where the ANP method is applied to evaluate
the response plan.
(3) On the ground of the above research, this paper develops a virtual simulation
system for simulation evaluation and analysis and conducts preliminary valida-
tion on the model and the method. The decision support of the above research
results is verified by specific cases. Nevertheless, more functions and algorithms
for decision support are yet to be studied in depth.
Acknowledgements I would like to thank my tutors, Professor Hu Liu and Yongliang Tian for
their guidance and my dear friend Xiang He for her encouragement. This research did not receive
any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
References
1. Aderson, E.L, et al.: The OILMAPWin/WOSM oil spill model: application to hindcast a river
spill. In Proceeding of the 18th Arctic and Marine Oil Spill Program, Technical Seminar,
Edmonton, Alberta, Canada, pp. 793–817 (1995)
2. Leech, M., et al.: OSIS: a windows oil spill information system. In Proceeding of the 16th
Arctic and Marine Oil Spill Program, Technical Seminar, Calgary, Alberta, Canada, vol. 5, no.
1, pp. 27–30 (1983)
3. Zou, C.J., Yin, Y., Liu, X.W., et al.: Research and Implementation of a 3D exercise system for
offshore oil spill response. J. Syst. Simul. 030(003), 906–913 (2018)
4. Yu, Y., Mao, D., Yin, H., Zhang, X., Sun, C., Chu, G.: Simulated training system for undersea
oil spill emergency response. Aquatic Proc. 3, 173–179 (2015)
5. Liu, H., Chen, Z., Tian, Y., et al.: Evaluation method for helicopter maritime search and rescue
response plan with uncertainty. Chinese J. Aeron. 34(4), 493–507 (2021)
6. Guo, C., Zhang, S., Jiang, Y.: A multiple criteria decision method for selecting maritime search
and rescue scheme. Mech. Electr. Technol. 4, 2334–2338 (2012). Trans Tech Publications Ltd.
7. Hospital, A., Stronach, J.A., McCarthy, W., et al.: Spill response evaluation using an oil spill
model. Aquatic Proc. (2015)
34 X. Li et al.
8. Weiwei, J., Wei, A., Yupeng, Z., Zhaoyu, Q., Jianwei, L., Shasha, S.: Research on evaluation of
emergency response capacity of oil spill emergency vessels. Aquatic Proc. 3, 66–73 (2015)
9. Tian, Y.F., Liu, H., Huang, J.: Design space exploration in aircraft conceptual design phase based
on system-of-systems simulation. Int. J. Aeron. Space Sci. 16(4), 624–635 (2015)
Chapter 3
Transferring Dense Object Detection
Models To Event-Based Data
3.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 35
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_3
36 V. Mechler and P. Rojtberg
Fig. 3.1 Object detection on the KITTI dataset [5]. Cyan boxes denote ground-truth. Pink boxes
denote predictions. Top: Using sparse event-histograms. Bottom: Using source RGB-image respon-
sible for the off events
Sparse convolutional layers [6] compute convolutions only at active (i.e. non-
zero) sites. The sub-type of ‘valid’ or ‘submanifold’ sparse convolutional layers
furthermore tries to preserve the sparsity of the data by only producing output sig-
nals at active sites, which makes them highly efficient at the cost of restricting signal
propagation. Non-valid sparse convolutions are semantically equivalent with dense
convolution layers in that they compute the same result given identical inputs. Valid
or submanifold sparse convolution layers, on the other hand, differ from dense con-
volutions, but still provide a good approximation for full convolutions on sparse
data.
Messikommer [8] further introduce asynchronicity into the network. This allows
for samples to be fed into the network in parts as they are produced by a sensor,
and thus to reduce the latency in real-time applications. Several small batches of
events from the same sample can be processed sequentially, producing identical
results to synchronous layers once the whole sample has been processed. However,
[8] only implemented a proof-of-concept. The project only includes asynchronous
submanifold sparse convolutional and batch-norm layers, whereas the sparseconvnet
(SCN) project [6] provides a full-fledged library. Furthermore, asynchronous models
cannot be trained, as the index_add operation used in the forward function is not
supported by PyTorch’s automatic gradient tracking. This, however, does not pose a
problem, as each layer is functionally equivalent to its SCN counterpart. Therefore, it
is possible to train an architecturally identical SCN network and transfer the weights.
As the asynchronous property is only relevant during inference, this does not pose a
limitation.
3 Transferring Dense Object Detection Models To Event-Based Data 37
In this work, we use the YOLO v1 model [10] as a simple but powerful dense object
recognition baseline. We model sparse networks architecturally identical to YOLO
v1 using the SCN [6] and asynet [8] frameworks. These serve as a case study to
evaluate the performance of sparse and asynchronous vs dense object detection.
We implement all variants in PyTorch and evaluate the predictive performance
and runtime requirements against a dense variant. To this end, we convert the KITTI
Vision dataset to events using [3]. This allows us to answer the question if these
novel technologies are a viable optimization over dense convolutional layers, or if
they fall short of the expectations in practice.
The remaining part of this work is structured as follows: First, Sect. 3.2 introduces
data formats required for the remainder of this work. Next, Sect. 3.3 details the major
changes and additions to the used frameworks. Section 3.4 evaluates the sparse and
dense YOLO versions w.r.t. performance, and Sect. 3.5 regarding runtime. Section
3.6 concludes our work by discussing our results and providing an outlook.
Optical event data can be represented in various formats. A simple and lossless
encoding are sequences of discrete events storing the spatial and temporal location
and the polarity of the change [2], as the amount is usually assumed to be fixed within
one dataset.
This format is, however, badly suited to processing with e.g. CNNs, as the
sequence length is variable across samples and unbounded. A common format that
overcomes this limitation are event-histograms, which accumulate all events into
a single frame similar to an image, but showing changes of brightness during the
defined interval instead of absolute brightness values at a single point in time.
In this work, we use the event-histogram representation from the asynet [8] frame-
work producing two channels, where each pixel value represents the sum of all
observed event changes of negative or positive polarity at this spatial location.
3.3 Implementation
1 https://github.com/zzzheng/pytorch-yolo-v1.
2 https://github.com/paroj/rpg_asynet.
3 Transferring Dense Object Detection Models To Event-Based Data 39
as the model does not support training anyway and the dense feature map tensor is
passed through the network alongside the sparse events as part of the sparse repre-
sentation.
Both sparse models employ submanifold sparse convolution layers where the
dense network uses convolutions with stride 1 to achieve maximum performance.
We adapted the trainers for dense and sparse object detection models already imple-
mented in the asynet code for improved logging and debugging and added early
stopping. However, neither the SCN, nor the asynet framework contained sufficient
functionality to directly implement a YOLO v1 network.
In the case of SCN, the deficit was minimal, as it only lacked the ‘same’-padding
feature in its sparse convolutional layer. To get around that limitation, we chose a
rather inefficient but easy way of converting a sparse tensor into a standard dense
PyTorch tensor, pad this dense representation, and then convert it back into a SCN
sparse tensor. This does not affect our evaluation, as it can be easily excluded from
the runtime evaluation carried out via profiling, and does not change the results of
the layer computations.
The asynet framework, however, was missing a layer type. The existing ‘asyncSpar-
seConvolution2D’ layer implements an asynchronous valid or submanifold sparse
convolution only. The project does not contain an implementation of an asynchronous
(non-valid) sparse convolution. We therefore implemented the asynNonValidSpar-
seConvolution2D layer, based off the asyncSparseConvolution2D implementation.
To ensure correctness, we again specified test cases to verify our implementation.
Additionally, the original code did not filter duplicate events within a sequence,
causing each active site to be processed as often as the number of duplicate events
(at the same spatial location) in the sequence. This behaviour caused the runtime to
increase by several orders of magnitude, while also producing incorrect results in
case of duplicate events.
3.3.3 Dataloader
We implemented a dataloader for the KITTI Vision dataset analogously to the already
available dataloaders for various other datasets (NCaltech101 among others). We
adapted code available through the dataset’s release site [4] for converting the ground
truth bounding boxes and labels into the commonly used format defined by the Pascal
40 V. Mechler and P. Rojtberg
VOC dataset. Each sample is converted to events at runtime because of the enormous
storage overhead of preprocessing the whole dataset. The spatial locations of the
events are then rescaled to the required tensor size, and finally accumulated into an
event histogram. Additionally, we implemented a version without the conversion into
events to be used to train the dense YOLO network on the original images.
The intuition behind sparse CNNs is to speed up, and reduce energy consumption of,
dense CNNs by eliminating unnecessary computations. As such, we require sparse
CNNs to match the prediction performance of dense CNNs, while reducing resource
consumption.
We start by verifying the first condition, namely recognition performance match-
ing that of the dense model. Due to the high costs of training image detection networks
most evaluations were performed only with limited redundancy, as can be seen in
Table 3.1. As the goal of this evaluation is a qualitative comparison of different archi-
tectures, rather than trying to achieve state-of-the-art results, it is acceptable to omit
hyper-parameter tuning and use the same parameters for all models. A proper con-
vergence of each training run, as well as the absence of strong outliers within the per-
formed experiments, minimizes the risk of non-representative and non-reproducible
results.
We first compared the dense YOLO network trained on dense images directly with
our structurally identical sparse implementation trained on 42ms event-windows.
Table 3.1 Median mAP and YOLO loss values over 3 training runs for different models, data, and
sparse event-window size
Model Data Window-size Med. mAP Med. loss
Dense YOLO Dense images N/A 0.1914 0.4465
Event-histograms 33 ms 0.1777 0.4448
Event-histograms 42 ms 0.2055 0.3921
Sparse YOLO Event-histograms 8 ms 0.2301 0.3410
16 ms 0.2332 0.3394
25 ms 0.2337 0.3413
33 ms 0.2115 0.3742
42 ms 0.2321 0.3443
50 ms 0.2311 0.3436
Best values highlighted
3 Transferring Dense Object Detection Models To Event-Based Data 41
The mAP score shows the sparse model to perform approximately 20% better
than the dense baseline. This indicates that both the conversion from dense images
to sparse events and our model implementation work as intended.
The increase in performance can be explained by the availability of additional
information: The dense model is restricted to a single image per sample, the sparse
model, however, is trained on events synthesized from a sequence of images. While
events encode only the change and lose the information about absolute brightness,
it can be argued that the change, which effectively encodes moving edges, is more
beneficial to object recognition than colour and absolute brightness.
The maximum achieved mAP of about 23% is significantly lower compared to
values YOLO v1 reportedly achieved on other datasets. This is likely due to differ-
ences of the mAP calculation.3 However, we use the same calculation throughout
this work, which makes our results comparable to each other.
The ‘YOLO loss’, as presented in the original YOLO paper [10], shows consider-
ably larger differences between the sparse and dense variants. This metric, however,
constitutes a loss function to be optimized during training and is not necessarily
suited to compare the performance of different models.
Training the dense YOLO network on event histograms instead of dense images did
not improve its performance notably. It consequently also performed approximately
15% worse than the sparse model trained on the same data. This indicates the sub-
manifold sparse convolutions, which are the only semantic difference from the dense
model, essentially contribute to the ability of the model to process sparse event data.
While similar results might be achieved using only dense convolutions by altering
the model structure, it is stunning that submanifold sparse convolutions enable the
transfer of a model optimized for dense images to events without requiring additional
hyper-parameter tuning.
Against our expectations, event-window size within a reasonable range has little to
no effect on the performance. We tested several configurations, starting from 42 ms,
which covers exactly one frame of the original dense dataset captured at a frame
rate of 24 fps. As this frame rate is, however, rather low, and 42 ms contains a huge
amount of events, we tested mainly values below that.
As each training run takes more than one day on our available hardware, we
decided to use fixed, pre-trained weights for initialization instead of averaging over
3 https://github.com/thtrieu/darkflow/issues/957.
42 V. Mechler and P. Rojtberg
Sparse CNNs claim to be more efficient in terms of number of operations and subse-
quently runtime and energy consumption. To verify this claim we profiled the three
different implementations using cProfile.4
For these experiments we chose to evaluate the full YOLO v1 network on real
data. This ensures a realistic ratio of active sites over locations in the tensor, as
well as number of events per active site. While in the dense framework all convo-
lution layers are essentially the same layer type, all unstrided sparse convolutions
can be translated to more efficient submanifold sparse convolutions, while strided
sparse convolutions have to be implemented as non-valid sparse convolution lay-
ers.
Due to the high runtime of the asynchronous framework, all models are profiled
over 1042 fixed samples (14%) of the KITTI Vision dataset.
Although in theory more efficient on sparse data than dense convolutions, sparse
CNNs still show higher runtimes. Due to highly optimized code and better hard-
ware support for the massively vectorized operations of the standard dense convolu-
tions implemented in the PyTorch framework, the research oriented proof-of-concept
implementations of sparse convolutions, while also highly optimized in the SCN
framework, cannot compete in real use-cases yet.
Figure 3.2a shows the cumulative runtime of those layers that are not identi-
cal in dense and sparse networks. Convolutions in sparse networks are split into
(non-valid) sparse-convolutions and submanifold-sparse-convolutions. Here the per-
formance difference is most significant, with an increase in runtime of more than
three times. Furthermore, convolutional layers usually constitute the largest part of
CNNs.
Batch norm layers only experience a small loss in performance. I/O-Layers, which
set up additional data structures to be passed through the network with each sam-
ple to enable efficient computation of the convolutional layers, are only needed in
sparse networks and provide an additional notable overhead. As the YOLO v1 model
4 https://docs.python.org/3/library/profile.html.
3 Transferring Dense Object Detection Models To Event-Based Data 43
Fig. 3.2 Cumulative runtime (over 1042 samples) of dissimilar layers of dense, sparse, and asyn-
chronous implementations of the YOLO v1 network during prediction
contains two strided convolutional layers and only one actual input layer, however,
about two third of this overhead can be attributed to our implementation of the
‘same’-padding option as described in Sect. 3.3.1.
Repeating this experiment with a smaller batch size of 1 (instead of 30 in the previous
evaluation) revealed sparse layers don’t suffer as much overhead from smaller batch
sizes as dense layers, or rather, in the reverse direction, don’t gain as much from
predicting more samples at the same time using higher batch sizes.
For convolutions, the gap between dense and sparse layers closes significantly,
while sparse batch norm layers actually overtake their dense counterpart. For sparse
networks, I/O layers show a similar overhead to convolutions (Fig. 3.3).
The results of profiling the asynchronous sparse CNNs implemented in the asynet
framework are far from encouraging. Due to the experimental and little-optimized
implementation, the asynchronous convolution layers show an increase in runtime
of roughly three orders of magnitude, as seen in Fig. 3.2b. While the synchronous
44 V. Mechler and P. Rojtberg
Fig. 3.3 Cumulative runtime (over 1042 samples) over batch size for dense and sparse layers
To make use of the asynchronous nature of the network, a sample will usually be
split into a number of event-sequences, which are then processed in series. When
examining the effect of the number of theses sequences, we would expect an increase
in runtime for larger sequence counts due to possibly duplicated active sites, bounded
by the runtime of the 1-sequence-baseline times the number of sequences. For small
numbers of sequences the number of active sites in each sequence will not decrease
notably, as there usually is more than one event at most active sites in a sample,
thus performing close to this upper bound. For large sequence counts, however, we
expect a sub-linear increase in runtime due to a decreasing number of active sites
per sequence.
While profiling the model using two sequences exactly matches our expectations,
as shown in Fig. 3.4, the model exceeded the upper bound for three sequences.
Further analysis showed that the unexpected increase in runtime is caused in
low-level functions like tensor formatting. We believe this to result from overhead
due to extremely high memory utilization. While profiling with three sequences
required 360 GB of RAM, profiling of higher sequence counts exceeded our available
resources and was thus omitted. Therefore, confirmation of our claim of sub-linear
increase in runtime for high sequence counts is left for future work.
3 Transferring Dense Object Detection Models To Event-Based Data 45
Given the data is sufficiently sparse, the sparse convolution based method should
have a significant advantage.
• The dense convolution convolves the filter with every position of the input tensor,
yielding as many convolution operations as there are unique locations in the tensor.
• The sparse convolution convolves the filter with every active site, with the active
sites being a subset of the unique locations in the tensor.
Therefore, the complexity of the sparse convolution is generally lower than that of
the dense convolution. Additionally, submanifold sparse convolutions further reduce
complexity by only computing those parts of the convolution at each active site where
filter and active sites overlap. The main gain of submanifold sparse convolutions,
however, lies in preventing the increase in the number of active sites, additionally
reducing the complexity for the following layers.
In practice, sparse convolution layers require the construction of a so-called rule-
book for each sample to efficiently compute the necessary convolutions, as detailed
in [6]. While this creates some overhead, it is outweighed by the gains of only pro-
cessing the active sites. Furthermore, blocks of consecutive submanifold convolution
layers pass through the rulebook and ensure it stays valid, so that it only needs to be
computed once, further reducing the complexity.
Asynchronous-sparse-convolutions split the events into multiple sequences, and
within each sequence behave like synchronous sparse convolutions. Each active
site of a synchronous-sparse-convolution is caused by at least one event, but might
accumulate many events that happened at the same spatial location. Therefore, each
active site is processed in at least one sequence, but at worst case in all of them. Such
layers thus have at least as high a complexity as their synchronous counterparts, but
stay within a predictable margin.
However, the sparse nature of the data hinders SIMD processing and the use of
on-chip caches—two techniques that are crucial for reaching high performance on
current hardware.
46 V. Mechler and P. Rojtberg
3.6 Conclusion
In this work we have evaluated the prediction performance and runtime of sparse and
asynchronous-sparse CNNs with respect to classical dense CNNs. Our experiments
have shown that sparse CNNs can match the performance of their dense counterparts
without requiring additional hyperparameter tuning.
The approach works well with synthetically generated events from an existing
dense dataset, which we believe will be beneficial for the adoption of this technol-
ogy. Whereas the production of new high-quality datasets for specialised application
domains can be very expensive, dense datasets are quite abundant in comparison,
even with the constraint of requiring image-sequences to be applicable for conversion
to events.
We think that asynchronous-sparse CNNs are a promising new concept that may
find use in real-time applications due to the extremely low latency. In practice, how-
ever, these concepts are not yet sufficiently optimized.
We have extended the experimental asynet framework for asynchronous-sparse
CNNs and shown that sparse CNNs match classical dense CNNs in prediction per-
formance. On the other hand, we found that the runtime performance of the evalu-
ated frameworks cannot yet match dense networks, and especially the asynchronous
framework can at this point only be seen as a proof-of-concept. The ease of transfer
from dense to sparse networks and the potential gains in runtime will hopefully incite
further research into these promising technologies. We believe the evaluated sparse
CNNs frameworks to be limited by code inefficiencies and lacking hardware support,
but in theory to be a viable optimization of CNNs.
References
1. Cannici, M., Ciccone, M., Romanoni, A., Matteucci, M.: Asynchronous convolutional networks
for object detection in neuromorphic cameras. In: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition Workshops (2019)
2. Gallego, G., Delbruck, T., Orchard, G.M., Bartolozzi, C., Taba, B., Censi, A., Leutenegger, S.,
Davison, A., Conradt, J., Daniilidis, K., Scaramuzza, D.: Event-based vision: A survey. IEEE
Trans. Patt. Anal. Mach. Intell. (2020)
3. Gehrig, D., Gehrig, M., Hidalgo-Carrió, J., Scaramuzza, D.: Video to events: recycling video
datasets for event cameras. IEEE Conf. Comput. Vis. Patt. Recog. (CVPR) (2020)
4. Geiger, A.: The kitti vision benchmark suite (2017). http://www.cvlibs.net/datasets/kitti/eval_
object.php
5. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision bench-
mark suite. Conf. Comp. Vis. Patt. Recogn. (CVPR) (2012)
6. Graham, B., Engelcke, M., van der Maaten, L.: 3d Semantic segmentation with submanifold
sparse convolutional networks. CVPR (2018)
7. Maass, W.: Networks of spiking neurons: the third generation of neural network models. Neur.
Netw. 10(9), 1659–1671 (1997)
8. Messikommer, N., Gehrig, D., Loquercio, A., Scaramuzza, D.: Event-based asynchronous
sparse convolutional networks (2020). http://rpg.ifi.uzh.ch/docs/ECCV20_Messikommer.pdf
3 Transferring Dense Object Detection Models To Event-Based Data 47
9. Rebecq, H., Ranftl, R., Koltun, V., Scaramuzza, D.: Events-to-video: bringing modern computer
vision to event cameras. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR) (2019)
10. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time
object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 779–788 (2016)
Chapter 4
Diagnosing Parkinson’s Disease Based
on Voice Recordings: Comparative Study
Using Machine Learning Techniques
4.1 Introduction
The development of many technological tools in the past century has helped with the
advancement of societies around the globe for a better and more sustainable standard
of living. One of these advancements was in the field of medicine. Technology
has greatly aided in the discovery of cures for many diseases, yet there are a few
diseases for which no cure has been discovered. One of them being Parkinson’s
disease (PD). More than 10 million people worldwide suffer from PD [1]. PD is the
second most common neurodegenerative disorder, with progressive motor symptoms
worsening over time [2]. It causes patients to experience shakiness and stiffness,
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 49
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_4
50 S. K. Abdelhakeem et al.
The application of using machine learning to predict PD has been done by many
authors in their previous research. In this section, a few of the previously exper-
imented algorithms by other researchers used to predict PD are discussed. Each
research paper used many different algorithms, of which the following are the most
common algorithms used by the researchers.
4 Diagnosing Parkinson’s Disease Based on Voice Recordings … 51
Random Forest (RF) is a novel machine learning algorithm and a new algorithm
combination. Random Forest is a classifier that combines a number of tree structure
classifiers [13]. Using similar datasets, both experiments done by Challa et al. [9]
and Wang et al. [11] showed a good performance of their RF models with reported
accuracies of 96.59% and 95.61%, respectively, in predicting PD. Wroge et al. [5]
reported two performance accuracies by using the algorithm to train two models using
different datasets. The highest accuracy they achieved using one of the datasets was
83%. However, the reported accuracy of the model using the dataset which is similar
to the one that was used in this paper was 81%. In addition, a RF model experimented
by Ramadugu et al. [12] yielded an accuracy of 89% for predicting PD.
4.2.3 SVM
et al. [12] and Hazan et al. [15] obtained the maximum accuracy from the SVM
algorithm. Wang et al. [11] achieved the second greatest accuracy using the SVM
technique, with a 0.56% difference between SVM and deep learning. This demon-
strates how successful SVM algorithms are for predicting PD when compared to
other algorithms; thus, two types of SVM algorithms are used in this paper: linear
and quadratic SVM.
4.2.4 KNN
4.3 Dataset
The dataset was obtained on November 20, 2021, from the UCI Machine Learning
Repository database [6]. The downloaded file contains two datasets for training
and testing. The collection of data is as follows. The dataset records belong to 20
Parkinson’s disease (PD) patients and 20 healthy subjects. From all subjects, 26
4 Diagnosing Parkinson’s Disease Based on Voice Recordings … 53
sound recordings is taken resulting in 1,040 instances. Each record has 26 descriptive
features and 1 target feature. The descriptive features are continuous while the target
feature is binary. The dataset has no missing values. The training dataset is balanced
with a 50:50 ratio (the same number of subjects with and without PD). The testing
dataset also has the same number of descriptive features, 1 binary target feature,
and no missing values. However, the testing dataset only has 168 instances with all
having the same outcome for the target feature (all patients diagnosed with PD) for
all instances. To avoid inaccurate results and measures, the training and testing data
are combined into one dataset. The final combined dataset, of 1208 instances, has a
ratio of 58:42 of patients with Parkinson’s disease against healthy patients.
A data quality report was generated for the combined dataset to identify potential
data quality issues. The generated data quality report indicates no missing values
or irregular cardinality. However, further observation of the difference between
maximum and third quartile range, as well as minimum and first quartile range
indicates that the following features may have outliers: Jitter local, Jitter absolute
local, Jitter rap, Jitter ppq5, Jitter ddp, Shimmer apq3, Shimmer apq5, Shimmer dda,
Minimum pitch, No. of pulses, and No. of voice breaks. This was further exam-
ined by observing the data distribution figures for the listed features. To tackle these
outliers, a data handling strategy known as clamp transformation was implemented.
Clamp transformation clamps all values above an upper threshold and below a lower
threshold to the pre-determined upper and lower threshold values. Figure 4.1 shows
the effect of clamp transformation on the data distributions of sample features with
outliers.
54 S. K. Abdelhakeem et al.
Fig. 4.1 Sample of data distribution before and after clamp transformation
4 Diagnosing Parkinson’s Disease Based on Voice Recordings … 55
4.4 Methodology
Figure 4.2 shows the sequence of steps for conducting the experiment. In step 1, the
dataset is imported to MATLAB, a programming and numeric computing platform
used by engineers and scientists to analyze data, develop, and create models [18]. The
combined dataset has numerical values ranging in different intervals for each feature.
Therefore, in step 2, all values for all features are normalized using Range Normal-
ization ranging between 0.0 and 1.0. Prior to inputting the dataset for model training
and testing, data exploration is necessary to check the reliability of the dataset. In
step 3, a data quality report is generated indicating the properties of all the features in
the dataset. Since all the features are of numeric data type, the properties included for
inspection are the count of instances, missing values, cardinality, minimum value,
first quartile, mean, median, third quartile, maximum value, and standard deviation
for each feature set. To further understand the relationship of features data, charts
are generated to see the trend and distribution of data. The data exploration proce-
dure outlined hitherto is key to identifying any potential data quality issues, namely
irregular cardinality, outliers, or missing values. Any significant data quality issues
will be handled using appropriate data handling strategies in step 4.
In step 5, the dataset is split into two sub-datasets with 70% assigned to the training
set and 30% assigned to the testing or holdout set using a non-stratified partitioning
method. Non-stratified partitioning does not consider the relative frequencies of the
levels of a feature in the partitioned dataset [19]. All models are cross validated
Fig. 4.2 Flow chart outlining the main steps of the experimental procedure
56 S. K. Abdelhakeem et al.
using holdout validation in step 6, considering the large size of the training set being
utilized. The models are then trained in step 7 using the training set and tested in
step 8 with the testing set for performance measures such as accuracy, training time,
and prediction speed. This procedure will be repeated for each model, from step 5
to step 8 for 20 iterations.
Model Selection:
As a trial, various classifiers for model selection are done using the Classification
Learner application in MATLAB. Different classifiers have linear and nonlinear
algorithms, and since the relationship between features in the PD dataset is not
known, different classifiers available to train models are tried, and the accuracy is
obtained for each one to compare among them to select the best model to use.
In this experiment, four classifiers, namely logistic regression, linear support
vector machine (SVM), quadratic support vector machine, and weighted K-nearest
neighbor (KNN) is chosen for dataset classification. Logistic regression is a
commonly used supervised classification algorithm when the output of the data in
question has a binary output, relevant to the dataset in this experiment (predicting
whether the subject has PD or not). Logistic regression makes use of a logistic, or a
sigmoid function, to draw an optimal separating hypothesis function to fit the binary
dataset separating the two classes in the hyperplane [19]. The sigmoid function is
represented in Eq. 4.1. The learning rate used by the classifier is set to 0.01.
1
Mw (d) = Logistic(w · d) = (4.1)
1 + e−w·d
In [7], the authors made use of an SVM classifier to train a model, that when tested
yielded an accuracy of 80% for the features similar to the features being used in this
experiment. This accuracy for SVM was better than most other models in their study
for GeMaps features. This became a basis to select this algorithm. SVM is a popular
and powerful classifier that aims to construct an optimal separating hyperplane in
the feature space between the two classes, much like logistic regression. However,
SVM makes use of the kernel trick which helps in accurately performing nonlinear
classification [7]. The kernel trick allows data to be linearly separable by projecting
them into higher dimensions. The use of linear SVM and quadratic SVM classifiers
are made, as both provided high accuracy during classifier selection. Although both
are SVM classifiers, the difference between these classifiers is the shape of the
decision boundary in the feature space. The kernel functions being used in the SVM
classifier are represented in Eqs. 4.2 and 4.3.
Linear kernel where c is an optional constant.
kernel(d, q) = d · q + c (4.2)
kernel(d, q) = (d · q + 1) p (4.3)
4 Diagnosing Parkinson’s Disease Based on Voice Recordings … 57
K-nearest neighbor (KNN) is often used as the first choice for classification study
since it is one of the most fundamental and simple classification methods that is used
when there is little or no prior knowledge on how the data is distributed [20], which
is the case with the dataset used in this study. It is a supervised learning algorithm
used in both regression and classification, calculating distances between the test data
and all training points. Then, the trained model predicts the target level with the
majority vote from the set of K-nearest neighbors [18]. There is uncertainty about
the distribution among the features in the dataset being used in this experiment,
hence the use of KNN classifier is made. However, in this experiment, the use of
weighted KNN is made since it provides better accuracy during the model selection
procedure. Unlike the classic KNN classification algorithm, weighted KNN assigns
different weights to the nearest neighbors according to the distance to the unclassified
sample [21]. The weighted KNN is represented in Eq. 4.4. By default, the number
of neighbors (k) is set to 1.
argmax
k
1
Mk (q) = × δ(ti , l) (4.4)
l ∈ levels(t) dist(q, di )2
i=1
In MATLAB, the steps function, a useful tool that immediately plots the response
of a step input without the need to solve for the time response analytically, can be
used to optimize the models that are obtained after training for better accuracy [22].
Various inputs for the number of steps will be evaluated, considering the number of
features.
4.5 Results
The device used to conduct the experiment was a Windows 10 operating system with
a 16 RAM, 64-bit, Intel® Core (TM) i7-8750H CPU, 2.20 GHz 2.21 GHz.
Table 4.2 provides information about the average performance measures of each
model that was tested using the test dataset. The kernel scale for the SVM models was
automatically set by the Classification Learner app while training. For the Weighted
KNN, the number of neighbors is set to 10, and the distance metric used is Euclidian
Distance. The distance weight is measured using the squared inverse and the data
was standardized.
The main metric of performance measure, known as accuracy, used for model
selection is defined in Eq. 4.5. Accuracy is described as the summation of true
positive and true negative predictions over the summation of all predictions made.
TP + TN
Accuracy = (4.5)
TP + TN + FP + FN
58 S. K. Abdelhakeem et al.
Table 4.2 Average performance measures for various classifiers used in the study
Model type Accuracy Prediction speed training time (sec)
Train (%) Test (%) (obs/sec)
The prediction speed is a performance measure that is focused on how fast the
model is making predictions, which can be a basis to compare the speeds of different
models and choose the fastest one which is more effective. The prediction speed is
measured in observations (predictions) per second. It is preferable to train models fast
for efficiency, hence the training per second is also measured. Figures 4.1, 4.2, and
4.3 show the statistical summary of the performance measure for all 20 iterations.
Summary of boxplot includes minimum, first quartile, median, third quartile, and
maximum [23].
Figure 4.3.Accuracy of the testing and training for each model. It is clearly shown
that the first two models fall within the same range with a mean accuracy lower
than the last two models. The last two models, quadratic SVM and weighted KNN,
have a similar mean accuracy. Although the testing accuracy mean and range for
quadratic SVM is larger than weighted KNN. Quadratic SMV is the only model
without any outliers in accuracy. As shown in Fig. 4.3b, both SVM models have higher
prediction speed than the other two models with a mean speed of approximately 6000
observations per second. Weighted KNN had a very distinct low prediction speed. All
models had few outliers in the prediction speed noted. For the training time presented
in Fig. 4.3c, it can be noted that logistic regression, quadratic SVM, and weighted
KNN have relatively same mean training speed 5.4–5.8. In addition, quadratic SVM,
in addition to logistic regression, had no outliers in the measured training times.
The average performance measures that were calculated indicate that almost all
the classifiers performed similarly with the accuracy ranging between 65 and 71%.
The least performing models were logistic regression and linear SVM with only an
4 Diagnosing Parkinson’s Disease Based on Voice Recordings … 59
accuracy of 65.1% and 65.8%, respectively. Furthermore, both these models have an
unfavorable high training time. Despite not having the best accuracy among other
classifiers, the Quadratic SVM proved to be the best performing model with an
accuracy of 70.0%, the highest prediction speed of 55,435 observations per second,
and the lowest training time of approximately 4.8 s. Although the weighted KNN
has the best accuracy, it had a lower prediction speed and training time compared to
quadratic SVM. It can be observed that the two nonlinear models performed better
than the linear models, suggesting that the distribution of data is nonlinear. The
step function command in MATLAB yielded a quadratic equation which aids in
explaining why the quadratic SVM might have performed better than other models.
4.6 Conclusion
References
1. Ball, N., Teo, S., Chandra, Chapman, J.: Parkinson’s Disease and the Environment. In: Frontiers
in Neurology, vol. 10. https://doi.org/10.3389/fneur.2019.00218 (2019). Last accessed 24 Feb
2022
2. Wong, S., Gilmour, H., Ramage-Morin, P.: Parkinson’s Disease: Prevalence, Diagnosis and
Impact, pp. 10–14 (2022)
3. Dauer, W., Przedborski, S.: Parkinson’s Disease. In: Neuron, vol. 39, no. 6, pp. 889–909. https://
doi.org/10.1016/s0896-6273(03)00568-3 (2003). Last accessed 24 Feb 2022
4. Stern, M.: Parkinson’s disease: early diagnosis and management. J. Family Pract.
26(4) (1993). https://link.gale.com/apps/doc/A13781209/AONE?u=anon~1081cb2b&sid=
googleScholar&xid=63168d10. Last accessed 24 Feb 2022
5. Wroge, T.J., Özkanca, Y., Demiroglu, C., Si, D., Atkins, D.C., Ghomi, R.H.: Parkinson’s
disease diagnosis using machine learning and voice. In: IEEE Signal Processing in Medicine
and Biology Symposium (SPMB), pp. 1–7 (2018)
60 S. K. Abdelhakeem et al.
6. Erdogdu Sakar, B., Isenkul, M., Sakar, C.O., Sertbas, A., Gurgen, F., Delil, S., Apaydin, H.,
Kursun, O.: Collection and analysis of a Parkinson speech dataset with multiple types of sound
recordings. IEEE J. Biomed. Health Inform. 17(4), 828–834 (2013)
7. Koza, J.R., Bennett, F.H., Andre, D., Keane, M.A.: Automated design of both the topology
and sizing of analog electrical circuits using genetic programming. In: Artificial Intelligence
in Design 96, pp. 151–170. Springer, Dordrecht (1996)
8. Collins, M., Schapire, R., Singer, Y.: Machine Learning, vol. 48, no. 13, pp. 253–285. https://
doi.org/10.1023/a:1013912006537 (2002). Last Accessed 24 Feb 2022
9. Challa, K.N.R., Pagolu, S., Panda, G., Majhi, B.: An improved approach for prediction of
Parkinson’s disease using machine learning techniques. In: International Conference on Signal
Processing, Communication, Power, and Embedded System (SCOPES), pp. 1446–1451 (2016)
10. Menezes, F., Liska, G., Cirillo, M., Vivanco, M.: Data classification with binary response
through the Boosting algorithm and logistic regression. Expert Syst. Appl. 69, 62–73 (2017).
https://doi.org/10.1016/j.eswa.2016.08.014
11. Wang, W., Lee, J., Harrou, F., Sun, Y.: Early detection of Parkinson’s disease using deep learning
and machine learning. In: IEEE Access, vol. 8, pp. 147635–147646 (2020)
12. Akhil, R., Rayyan Irbaz, M., Aruna, M.: Prediction of Parkinson’s disease using machine
learning. In: Annals of the Romanian Society for Cell Biology, pp. 5360–5367 (2021)
13. Liu, Y., Wang, Y., Zhang, J.: New machine learning algorithm: random forest. In: Infor-
mation Computing and Applications, pp. 246–252. https://doi.org/10.1007/978-3-642-34062-
8_32 (2012). Last Accessed 24 Feb 2022
14. Gandhi, R.: Support vector machine—introduction to machine learning algorithms. In:
Medium. https://towardsdatascience.com/support-vector-machine-introduction-to-machine-
learning-algorithms-934a444fca47?gi=459bd9a91d37 (2018). Last Accessed 16 Feb 2022
15. Hazan, H., Hilu, D., Manevitz, L., Ramig, L., Sapir, S.: Early diagnosis of Parkinson’s disease
via machine learning on speech data. In: IEEE 27th Convention of Electrical and Electronics
Engineers in Israel. (2012).
16. Pandey, J.A.: Comparative analysis of KNN algorithm using various normalization techniques.
Int. J. Comp. Netw. Inform. Secur. 9, 36–42. https://doi.org/10.5815/ijcnis.2017.11.04 (2017).
Last Accessed 24 Feb 2022
17. Keck, T.: FastBDT: a speed-optimized multivariate classification algorithm for the Belle II
experiment. Comput. Softw. Big Sci. 1(1) (2017)
18. Kelleher, MacNamee, B., D’Arcy, A.: Fundamentals of Machine Learning for Predictive Data
Analytics: Algorithms, Worked Examples, and Case Studies. The MIT Press (2015)
19. Kambria, K.: Logistic regression for machine learning and classification. In: Kambria. https://
kambria.io/blog/logistic-regression-for-machine-learning/ (2021). Last Accessed 11 Dec 2021
20. Peterson, L.E.: K-nearest neighbor. Scholarpedia 4(2), 1883 (2009)
21. Zuo, W., Zhang, D., Wang, K.: On kernel difference-weighted k-nearest neighbor classification.
Patt. Anal Appl. 11, 247–257 (2008)
22. Control Tutorials for MATLAB and Simulink: Extras: Generating a Step Response
in MATLAB. https://ctms.engin.umich.edu/CTMS/index.php?aux=Extras_step (2021). Last
accessed 11 Dec 2021
23. Williamson, D.: The box plot: a simple visual method to interpret data. Ann. Inter. Med.
110(11), 916 (1989)
Chapter 5
Elements of Continuous Reassessment
and Uncertainty Self-awareness:
A Narrow Implementation for Face
and Facial Expression Recognition
Stanislav Selitskiy
5.1 Introduction
Artificial intelligence (AI) is quite a vague terminology artefact that has been
overused so many times, sometimes even for describing narrow software imple-
mentations of simple mathematical concepts. It is understandable that to separate
high-level AI from the narrow level, such abbreviation as Artificial General Intelli-
gence (AGI) was introduced, and alternatives like “human-level AI” pops up peri-
odically [1]. However, inherent terminological fuzziness will remain if AI/AGI even
be reserved only for complex and sophisticated systems. The very founders of AI
research, such as A. Turing and J. McCarthy, who coined the very term AI, were
sceptical about the worthiness of the attempts to answer what AI is. Instead, they
suggested answering the question of how well AI can emulate human intelligence
S. Selitskiy (B)
School of Computer Science and Technology, University of Bedfordshire, Park Square, Luton 1
3JU, UK
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 61
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_5
62 S. Selitskiy
and finding ways of quantifying the success of that imitation [11, 19]. N. Chomsky,
in numerous lectures and publications (f.e. [6]), even more categorically elaborated
that AI is a human linguistic concept rather than an independent phenomenon.
Suppose we accept discussing AI in the context of human-likeliness. There still
should be room for learning from simple and narrow machine learning (ML) algo-
rithms if they could be used as “building blocks” and working approximations of
human-like intelligence. In this work, we want to concentrate on two aspects of
human-likeliness intelligence functionality: continuous lifelong learning and reflec-
tion on or awareness of the learning and its imperfection and uncertainty. In the narrow
ML domain, parallels to such a process may be found in concepts (obviously) Life-
time Learning (LTL). Sometimes, even in narrower Continuous or Online learning
concepts. The second concept can be found in the meta-learning area of research.
The LTL concept was introduced in the mid-1995 in the context of the robot
learning process [17]. Instead of the standard approach of teaching a robot a particular
task in isolation from another, it was proposed to use invariants of learning one task
to help it learn another task. Unlike in many idealized lab-level ML applications,
in case of the robotics, expectations of the perfect knowledge of the ever-changing
reality or precise modelling of the robot itself are unrealistic, and constant learning
“on the go” is essential.
The knowledge base in LTL was suggested to be the structured data and models
trained on the data. Variety of tasks the LTL learner could face during its lifetime, and
each new learning task may benefit from the saved successful models and examples of
data they were trained and applied to [16]. Models saved in the knowledge base may
also be accompanied by meta-data or “hints” about approximation transformations
they represent [2] that can be factored in a decision to include the previous model
into the solution for the new task. Another aspect of the LTL extends “hints” about
models to “hints” about processes they explain.
The idea of learning the ML processes was also introduced in the 90s by the same
author [18]. There exist multiple flavours of “learning to learn” or meta-learning
targeting narrower and specific tasks such as either as an extension of the transfer
learning [3, 8], or model hyper-parameter optimization [4, 12], or a wider horizon
“learning about learning” approach conjoint to the explainable learning [9, 10], or
augmenting artificial neural network (ANN) models with external resources, such as
memory, knowledge bases, or other ANN models [13].
To bring general considerations into a practical, although narrow perspective,
we concentrate on making the meta-learning supervisor ANN model, which learns
patterns of the functionality of the underlying CNN models that are associated with
the failed predictions for face recognition (FR) [15] and facial expression recognition
(FER) tasks [14], self-adjusting on the previous experience during training, as well
as, test times.
The reason to use FR and FER tasks is based not only on the fact that these are
quite human-centric ones but also, although State-of-the-art (SOTA) CNN models
had already passed the milestone of the human-level accuracy of face recognition, a
number of years ago in the ideal laboratory condition, in case of the Out of (training)
5 Elements of Continuous Reassessment and Uncertainty Self-awareness … 63
Data Distribution (ODD), for example, makeup and occlusions, accuracy signifi-
cantly drops. Even worse for FER algorithms and modes, which perform far worse
than FR. The reason may be that the idea that the whole spectre of emotion expres-
sions can be reduced to six basic facial feature complexes [7] and was challenged in
the sense that human emotion recognition is context-based. The same facial feature
complexes may be interpreted differently depending on the situational context [5].
Applying the continuous uncertainty and trustworthiness self-awareness algo-
rithms to FR and FER models and data sets built and partitioned to exaggerate and
aggravate ODD conditions are a reasonable area for the algorithms’ evaluation.
The paper is organized as follows. Section 5.2 proposes a solution for dynamically
adjusting the meta-learning trustworthiness estimating algorithm for predictions done
for the FR and FER tasks. Section 5.3 describes the data set used for experiments;
Sect. 5.4 outlines experimental algorithms in detail; Sect. 5.5 presents the obtained
results, and Sect. 5.6 discusses the results, draws practical conclusions, and states
directions of the research of not yet answered questions.
The input of the meta-learning supervisor ANN was built from the softmax activa-
tions of the ensemble of the underlying CNN models. The algorithm of building USD
can be described in a few words as follows: build the “uncertainty shape descriptor”
by sorting softmax activations inside each model vector, order model vectors by the
64 S. Selitskiy
highest softmax activation, flatten the list of vectors, rearrange the order of activa-
tions in each vector to the order of activations in the vector with the highest softmax
activation.
Examples of the descriptor for the M = 7 CNN models in the underlying FR or
FER ensemble (M is a number of models in the ensemble), for the cases when none
of the models detected the face correctly, 4 models and 6 models detected the face
correctly, are presented in Fig. 5.2. It could be seen that shapes of the distribution of
the softmax activations are quite distinct and, therefore, can be subject to the pattern
recognition task which is performed by the meta-learning supervisor ANN.
However, unlike in the mentioned above publication, for simplification reasons,
supervisor ANN was not categorizing the predicted number of the correct members
of the underlying ensemble but instead is performing the regression task of the
transformation. On the high level (ANN layer details are given in Sect. 5.4), the
transformation can be seen as Eq. 5.2, where n = |C|∗M is the dimensionality of
the ∀ USD ∈ X , |C|—cardinality of the set of FR or FER categories (subjects or
emotions) and M—the size of the CNN ensemble, Fig. 5.1.
reg : X ⊂ Rn
→ Y ⊂ R (5.2)
Fig. 5.2 Examples of the uncertainty shape descriptors (from left to right) for 0, 4, and 6 correct
FER predictions by the 7-model CNN ensemble
The loss
function used for y is the usual for regression tasks, sum of squared error:
SSE y = t=1,Nmb (yt − et )2 , where e is the label (actual number of the members of
CNN ensemble with correctly prediction), and Nmb —minbatch size.
From the trustworthiness categorization and ensemble vote point of view, the high-
level transformation of the combined CNN ensemble together with the meta-learning
supervisor ANN can be represented as Eq. 5.3:
cat : I ⊂ Il
→ C × B ⊂ C × B (5.3)
where i is an index of the image at the moment t of the state of the loss function
memory.
Equations above describe the ensemble vote that chooses category ci , which
received the closest number of votes ei to the predicted regression number yi .
The BookClub artistic makeup data set contains images of E = |C| = 21 subjects.
Each subject’s data may contain a photo-session series of photos with no makeup,
various makeup, and images with other obstacles for facial recognition, such as wigs,
glasses, jewellery, face masks, or various headdresses. The data set features 37 photo
sessions without makeup or occlusions, 40 makeup sessions, and 17 sessions with
occlusions. Each photo session contains circa 168 JPEG images of the 1072 × 712
resolution of six basic emotional expressions (sadness, happiness, surprise, fear,
anger, and disgust), a neutral expression, and the closed eyes photoshoots taken
66 S. Selitskiy
with seven head rotations at three exposure times on the off-white background.
The subjects’ age varies from their twenties to sixties. The race of the subjects
is predominately Caucasian and some Asian. Gender is approximately evenly split
between sessions.
The photos were taken over two months, and several subjects were posed at
multiple sessions over several weeks in various clothing with changed hairstyles,
downloadable from https://data.mendeley.com/datasets/yfx9h649wz/3. All subjects
gave written consent to use their anonymous images in public scientific research.
5.4 Experiments
The experiments were run on the Linux (Ubuntu 20.04.3 LTS) operating system with
two dual Tesla K80 GPUs (with 2×12 GB GDDR5 memory each) and one QuadroPro
K6000 (with $12$GB GDDR5 memory, as well), X299 chipset motherboard, 256 GB
DDR4 RAM, and i9-10900X CPU. Experiments were run using MATLAB 2022a.
The experiments were done using MATLAB with Deep Learning Toolbox. For FR
and FER experiments, the Inception v.3 CNN model was used. Out of the other SOTA
models applied to FR and FER tasks on the BookClub data set (AlexNet, GoogLeNet,
ResNet50, and Inception-ResNet v.2), Inception v.3 demonstrated overall the best
result over such accuracy metrics as trusted accuracy, precision, and recall [14, 15].
Therefore, the Inception v.3 model, which contains 315 elementary layers, was used
as an underlying CNN. Its last two layers were resized to match the number of classes
in the BookClub data set (21) and re-trained using the “Adam” learning algorithm
with 0.001 initial learning coefficient, “piecewise” learning rate drop schedule with 5
iterations drop interval, and 0.9 drop coefficient, mini-batch size 128, and 10 epochs
parameters to ensure at least 95% learning accuracy. The Inception v.3 CNN models
were used as part of the ensemble with a number of models M = 7 trained in parallel.
Meta-learning supervisor ANN models were trained using the “Adam” learning
algorithm with 0.01 initial learning coefficient, mini-batch size 64, and 200 epochs.
For online learning experiments, naturally, batch size was set to 1, as each consecutive
prediction was used to update meta-learning model parameters. The memory buffer
length, which collects statistics about previous training iterations, was set to K =
8192.
The r eg meta-learning supervisor ANN transformation represented in Eq. 5.2 is
implemented with two hidden layers with n + 1 and 2n + 1 neurons in the first and
second hidden layer, and the ReLU activation function. All source code and detailed
results are publicly available on GitHub (https://github.com/Selitskiy/StatLoss).
5 Elements of Continuous Reassessment and Uncertainty Self-awareness … 67
Suppose only the classification verdict is used as a final result of the ANN model. In
that case, the accuracy of the target CNN model can be calculated only as the ratio
of the number of correctly identified test images by the CNN model to the number
of all test images:
Ncorrect
Accuracy = (5.6)
Nall
Ncorrect: f =T + Ncorrect: f
=T
Accuracyt = (5.7)
Nall
5.5 Results
Results of the FER experiments are presented in Table 5.1 (FR results are similar but
with less un-trusted and trusted metrics difference). The first column holds accuracy
metrics using the ensemble’s maximum vote. The second column using the ensemble
vote closest to the meta-learning supervisor ANN prediction and trustworthiness
threshold learned only on the test set, see Formulae 4, 5. The next two columns
contain the results of the online learning experiments. The first of these columns
has data of the online learning on the randomized test data, and the last column
online learning on the images grouped by the photo session, i.e. groups of the same
person and same makeup or occlusion, but with different lighting, head position, and
emotion expression (also see Fig. 5.3). Figure 5.4 shows the relationship between the
average session trusted threshold and session-specific trusted recognition accuracy
for FR and FER cases of the grouped test sessions.
68 S. Selitskiy
Fig. 5.3 Trusted threshold learned during the training phase (blue, dashed line), online learning
changes for grouped test images (green), and shuffled test images (red). FR—left and FER—right
Fig. 5.4 Trusted accuracy against trusted threshold for grouped test images. FR—left and FER—
right
Fig. 5.5 Examples of images for FER (anger expression) with low trusted threshold (bad acting)—
left and high trusted threshold (better acting)—right
performance on most accuracy metrics except recall. Obviously, improving the online
learning algorithms would be a part of future work. Still, what is fascinating, is that
the dynamically adjusted trustworthiness threshold informs the model not only about
its uncertainty but also about the quality of the test session—for example, in Fig. 5.5,
it could be seen that a low-threshold session has a poorly performing subject who
struggles to play the anger emotion expression. In contrast, in the high-threshold
session, the facial expression is much more apparent.
70 S. Selitskiy
References
1. Post|LinkedIn: https://www.linkedin.com/posts/yann-lecun_i-think-the-phrase-agi-should-be-
retired-activity-6889610518529613824-gl2F/?utm_source=linkedin_share&utm_medium=
member_desktop_web, (Online Accessed 11 Apr 2022)
2. Abu-Mostafa, Y.S.: Learning from hints in neural networks. J. Complex. 6(2), 192–198 (1990)
3. Andrychowicz, M., Denil, M., Colmenarejo, S.G., Hoffman, M.W., Pfau, D., Schaul, T.,
Shillingford, B., de Freitas, N.: Learning to learn by gradient descent by gradient descent. In:
Proceedings of the 30th International Conference on Neural Information Processing Systems,
pp. 3988–3996. NIPS’16, Curran Associates Inc., Red Hook, NY, USA (2016)
4. Bergstra, J., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization.
In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q. (eds.) Advances in
Neural Information Processing Systems, vol. 24. Curran Associates, Inc. (2011), https://procee
dings.neurips.cc/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf
5. F. Author et al. Cacioppo, J.T., Berntson, G.G., Larsen, J.T., Poehlmann, K.M., Ito, T.A., et al.:
The psychophysiology of emotion. Handbook Emot. 2(01), 2000 (2000)
6. Chomsky, N.: Powers and Prospects: Reflections on Human Nature and the Social Order. South
End Press (1996)
7. Ekman, P., Friesen, W.V.: Constants across cultures in the face and emotion. J. Pers. Soc.
Psychol. 17(2), 124 (1971)
8. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep
networks. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference
on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 1126–1135.
PMLR (06–11 Aug 2017), http://proceedings.mlr.press/v70/finn17a.html
9. Lake, B.M., Ullman, T.D., Tenenbaum, J.B., Gershman, S.J.: Building machines that learn and
think like people. Behav. Brain Sci. 40, e253 (2017).https://doi.org/10.1017/S0140525X160
01837
10. Liu, X., Wang, X., Matwin, S.: Interpretable deep convolutional neural networks via meta-
learning. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–9 (2018).
https://doi.org/10.1109/IJCNN.2018.8489172
11. McCarthy, J., Minsky, M.L., Rochester, N., Shannon, C.E.: A proposal for the dartmouth
summer research project on artificial intelligence, August 31, 1955. AI Mag. 27(4), 12–12
(2006)
12. Ram, R., Müller, S., Pfreundt, F., Gauger, N., Keuper, J.: Scalable hyperparameter optimization
with lazy Gaussian processes. In: 2019 IEEE/ACM Workshop on Machine Learning in High
Performance Computing Environments (MLHPC), pp. 56–65 (2019)
13. Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., Lillicrap, T.: Meta-learning with
memory-augmented neural networks. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings
of The 33rd International Conference on Machine Learning. Proceedings of Machine Learning
Research, vol. 48, pp. 1842–1850. PMLR, New York, New York, USA (20–22 Jun 2016)
14. Selitskiy, S., Christou, N., Selitskaya, N.: Isolating Uncertainty of the Face Expression Recog-
nition with the Meta-Learning Supervisor Neural Network, pp. 104–112. Association for
Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3480433.3480447
15. Selitskiy, S., Christou, N., Selitskaya, N.: Using statistical and artificial neural networks meta-
learning approaches for uncertainty isolation in face recognition by the established convolu-
tional models. In: Nicosia, G., Ojha, V., La Malfa, E., La Malfa, G., Jansen, G., Pardalos,
P.M., Giuffrida, G., Umeton, R. (eds.) Machine Learning, Optimization, and Data Science,
pp. 338–352. Springer International Publishing, Cham (2022)
5 Elements of Continuous Reassessment and Uncertainty Self-awareness … 71
16. Thrun, S.: Is learning the n-th thing any easier than learning the first? Adv. Neural Inf. Process.
Syst. 8 (1995)
17. Thrun, S., Mitchell, T.M.: Lifelong robot learning. Robot. Auton. Syst. 15(1–2), 25–46 (1995)
18. Thrun, S.P.L.: Learning To Learn. Springer, Boston, MA (1998). https://doi.org/10.1007/978-
1-4615-5529-2
19. Turing, A.M.: I.—Computing machinery and intelligence. MindLIX(236), 433–460 (1950).
https://doi.org/10.1093/mind/LIX.236.433
Chapter 6
Topic-Aware Networks for Answer
Selection
6.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 73
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_6
74 J. Zhang and K. Mao
select a set of possible and relevant answers from various and numerous resources,
then use answer selection algorithms to sort the candidate answers, and finally send
the most likely answer to the user. Thus, improving the performance of answer
selection is crucial for dialogue systems. As shown in Table 6.1, given a question
and a set of candidate answers, the goal of answer selection is to find the correct and
the best answer to that question among those candidates.
Various frameworks or methods have been proposed for answer selection. For
example, rule-based systems which focus on feature engineering with human summa-
rized rules, and deep learning models, more popularly adopted methods recently,
which use deep learning networks such as convolutional neural networks, recurrent
neural networks, or attentional networks to extract matching features automatically.
However, traditional deep learning models are purely data driven and feature driven,
which not only face overfitting problems and also lack real-world background infor-
mation and those information beyond the features in the local contexts. To solve the
aforementioned issues, some knowledge-based methods were proposed, which use
external knowledge as a compensation for traditional deep learning models. In this
chapter, we propose to use specially designed topic-aware networks to enhance the
performance of traditional deep learning models with topic embeddings as external
knowledge references.
Topic modeling is a traditional machine learning technique that is used to model
the generation process of a set of documents. Each word in a document can be
assigned with a latent topic using topic modeling. Topic modeling is a good tool for
us to understand the nature of the documents. For each text in the document, the latent
topic tags for this text are a kind of external knowledge from a document level point
of view. Topic embeddings which is a numerical representation of latent topic tags
are proposed to make topic modeling convenient in helping deep learning models.
As shown in Fig. 6.1, we use Skip-gram techniques to generate topic embeddings.
Each word in a text is assigned with two tokens: the word token wi and the topic tag
token zi. These tokens are used as basic inputs in our proposed frameworks.
Our work is inspired by following considerations. Firstly, we think intuitively
that the correct answer to a question normally should under the same topic. For
example, if the question is asking about time-related information, the beginning of
the question may be “When…” or “What time…,” and the answer may contain texts
relating to topics about time such as “in the morning” or “at 8 am.” The questions
6 Topic-Aware Networks for Answer Selection 75
and the correct answers normally contain related words. By adding the latent topics
of the answers and questions into consideration, we could restrict the selection of
the answers so as to further improve the generalization of the model. Secondly, topic
models are based on document-level information which reveals latent topics under
targeting documents. However, traditional deep learning models normally focus on
discovering local features that can be used to classify texts. Topic models can help
us understand how the texts are generated. The output of the topic models, which
is a set of topics that form the document and lists of words that describe the same
topic, is somehow like a knowledge base of certain datasets.
Motivated by above considerations, we proposed topic-aware networks for answer
selection (TNAS) that integrates topic models into answer selection architectures by
using topic embeddings as external knowledge for baseline deep learning models.
As shown in Fig. 6.2, compared with traditional deep learning models for answer
selection, TNAS has one more topic embeddings module during the training stages.
The topic-aware module generates topic embeddings for both questions and answers.
This topic embeddings layer can help us determine the similarity about the question
and the answer from topic point of view. Eventually, we generate topic-aware vector
representations and concatenate them with baseline deep learning texts representa-
tions for both questions and answers and get their cosine distances as scoring function
for calculating the probability that the answer is a correct candidate.
To evaluate our model, we conduct experiments in 3 popular answer selection
datasets in natural language processing. The results of our experiments show that
our model improved the performance of baseline deep learning models. The main
contributions of our work are summarized into four parts as follows:
• We propose an efficient way to generate topic embeddings for baseline deep
learning models that can be used easily integrated in their architectures.
• We propose to incorporate topic embeddings as external knowledge into baseline
deep learning models for answer selection tasks by applying LDA algorithm for
both questions and answers.
• We propose two networks specially design for answer selection tasks that incor-
porate topic information into baseline deep learning models to automatically
matching topics of both questions and answers.
76 J. Zhang and K. Mao
Fig. 6.2 a Traditional deep learning framework for answer selections; b Our model
• We propose to use external databases with similar contexts in training topic embed-
dings for our topic-aware networks to further improve the performance of our
network.
To better extract semantic knowledge in texts for downstream NLP tasks, various
topic models have been introduced for generating topic embeddings. One influential
and classic research is the latent semantic indexing (LSI) [1]. LSI utilizes linear
algebra methods for mapping latent topics with singular value decomposition (SVD).
Subsequently, various methods for generating topic embedding have been proposed
on top of LSI. Among them include the latent Dirichlet allocation (LDA) [2], which
is introduced as a Bayesian probability model that generates document-topic and
word-topic distribution utilizing Dirichlet priors [3]. In comparison with prior topic
embeddings generation approaches such as LSI, LDA is more effective thanks to its
ability to capture hidden semantic structure within a given text through the correlated
words [4]. Dirichlet priors are leveraged to estimate document-topics density and
topic-word density in LDA, improving its efficacy in topic embedding generation.
Thanks to it superior performance, LDA has become one of the most commonly used
approach for topic embedding generation. In this work, we adopt LDA as the topic
embedding generation method to generate topic embeddings as external knowledge
base, bringing significant improvement to the result of answer selection.
6 Topic-Aware Networks for Answer Selection 77
Answer selection has received increasing research attention thanks to its applica-
tions in areas such as dialog systems. A typical question selection model requires
the understanding of both the question as well as the candidate answer texts [5].
Previously, answer selection models typically rely on human summarized rules with
linguistic tools, feature engineering, and external resources. Specifically, Wang and
Manning [6] utilize tree-edit operation mechanism on the dependency parse trees;
Severyn and Moschitti [7] employ an SVM [8] with tree kernels for fusing feature
engineering over parsing trees for feature extraction, while lexical semantic features
obtain from WordNet [9] have been used by Yih et al. [10] to further improve on
answer selection.
More recently, deep networks such as CNN [11, 12] and RNN [11, 13, 14] have
brought significant performance boost in various NLP tasks [15]. Deep learning-
based approach has also been predominant in the task of answer selection thanks
to their better performance. Among them, Yu et al. [16] transformed the answer
selection task into a binary classification problem [17] such that candidate sentences
are ranked based on the cross-entropy loss of the question-candidate pairs, while
constructing a Siamese-structured bag-of-words model. Subsequently, QALSTM
[18] was proposed which employs a bidirectional LSTM [19, 20] network to construct
sentence representations of questions and candidate answers independently, while
CNN is utilized in [21] as the backbone structure to generate sentence representation.
Further, HyperQA [22] is proposed where the relationship between the question and
candidate answers is modeled in the hyperbolic space [23] instead of the Euclidean
space. More recently, with the success of transformer [24] in a variety of NLP tasks
[25, 26], it has also been introduced to the answer selection task. More specifically,
TANDA [27] is proposed by transferring a pre-trained model into a model specialized
for answer selection through fine-tuning on large and high-quality dataset, improving
the stability of transformer for answer selection, while Matsubara et al. [28] improve
the efficiency of transformers by reducing the amount of sentence candidates through
neural re-rankers.
Despite the impressive progress made in deep learning-based approaches for
answer selection, these methods neglect the importance of topics in answer selection.
In this work, we propose to incorporate topic embeddings as external knowledge into
baseline deep learning models for answer selection and demonstrate its effectiveness.
6.3 Methodology
We present the detailed implementation of our model in this section. The overall
architecture of our proposed model is shown in Fig. 6.2. Our model is a multi-
channel deep learning model with two stages in training. Firstly, we use techniques
in word embedding generation to help generate topic embedding as our external
78 J. Zhang and K. Mao
knowledge base for the next stage. Secondly, we set up our topic-aware network
for answer selections. We proposed two main topic-aware network architectures
based on traditional answer selection architectures. Lastly, we use triplet loss as our
objective function in our final training stage for our model.
1
M
L(D) = log Pr(wi+c , z i+c |z i ) (6.1)
M i=1 −k≤c≤k,c=0
where Pr() is the probability using softmax function. The nature under above function
is to use each topic token as a pseudo-word token to predict words and topics around
it. We aim to not only encode the word information but also the topic information
into the topic embeddings.
After we generate topic embeddings results, we can use these results as external
knowledge for our deep learning architectures. We propose two main kinds of archi-
tecture with four kinds of network designs for topic-aware networks for answer
selection. The first is a network with shared encoder weights as shown in Figs. 6.3
and 6.4. The encoders for both questions and answers are trained together, and the
weights are shared. The second is a network with none-shared encoder weights, as
shown in Figs. 6.5 and 6.6. The encoders are trained separately for questions and
answers. The input text sequences are firstly separated into sequences one for original
texts and the other for topic tokens.
TAN1: None-shared encoders for both text and topic tokens. As shown in
Fig. 6.3, question texts and answer texts, which are transformed into text tokens
and topic tokens, are used as the inputs for both word embedding layers and topic
embedding layers. After getting the numerical representations for the input tokens,
6 Topic-Aware Networks for Answer Selection 79
Fig. 6.3 TAN3: None-shared encoders for text and shared encoders for topic
Fig. 6.4 TAN4: Shared encoders for text and none-shared encoders for topic
the outputs of each embedding layers are then processed with none-shared encoders
so that each encoder is trained separately with totally different weights inside.
TAN2: Shared encoders for both text and topic tokens. As shown in Fig. 6.4,
different from TAN1, both encoders for text channel and topic channel are shared
for TAN2.
TAN3: None-shared encoders for text and shared encoders for topic. As shown
in Fig. 6.3, different from TAN1 and TAN2, there is another architecture which use
none-shared encoders for text token embeddings and shared encoders for topic token
embeddings.
80 J. Zhang and K. Mao
Fig. 6.5 TAN1: None-shared encoders for both text and topic tokens
Fig. 6.6 TAN2: Shared encoders for both text and topic tokens
TAN4: Shared encoders for text and none-shared encoders for topic. As shown
in Fig. 6.4, similar to TAN3, it is a mixed architecture which use shared encoders for
text token embeddings and none-shared encoders for topic token embeddings.
For all the networks we proposed, we adopt the same training and testing mechanism.
We use triplet loss in our model. During the training stage, for each question texts
6 Topic-Aware Networks for Answer Selection 81
Q, besides its ground truth answer A+, we randomly pair a negative answer A− for
it. Therefore, the input data are actually a triplet set (Q, A+, A−). Our goal is to
minimize this triplet loss for the answer selection task:
L Q, A+ , A− = max 0, m + d Q, A− − d Q, A+ , (6.2)
where d(Q, A−) and d(Q, A+) is the Euclidean distance between the vector
representation of the question texts and the answer texts.
6.4 Experiments
In this section, we present the experiment and the result of our proposed model. All
the network architectures are achieved using Keras in this paper. We evaluate our
model using two widely used answer selection dataset.
6.4.1 Dataset
The statistics of the datasets used in this paper is shown in Table 6.2. The tasks for
both datasets are to rank the candidate answers based on their relatedness to the
question. Brief descriptions of the datasets are as follows:
1. WikiQA: This is a benchmark for open-domain answer selection that was created
from actual Bing and Wikipedia searches. We only use questions with at least
one accurate response.
2. TrecQA: This is another answer selection dataset that comes from Text REtrieval
Conference (TREC) QA track data.
To evaluate the model, we implement a baseline system for comparison. The baseline
model adopt CNN as the encoders, and the architecture of the baseline model is the
same as TAN1 and TAN4 but without the topic-aware module. The CNN used in
TAN1, 2, 3, 4 is the same as the baseline model. The other key settings of our models
are as follows:
82 J. Zhang and K. Mao
1. Embeddings: We use GloVe with 300 dimensions to initialize our word embed-
ding layer. For the topic embeddings, we use Gensim package to generate LDA
model and use its Word2Vec function to help generate topic embeddings. We
generate topic embeddings for both questions and answers separately.
2. CNN as encoders: We set CNN filter to 1200 filters and all the inner dense
layers to be 300 dimensions. We use keras to help us set up the training and
testing process. The optimizer we choose is an Adam optimizer.
3. TAN1 without Topic: The first baseline model we use is a traditional architecture
for answer selection which use none-shared encoders for question-and-answer
tokens.
4. TAN4 without Topic: The second baseline model we use that have shared
encoders for question-and-answer tokens.
5. Evaluation Metrics: Our task is to rank the candidate answers on their correct-
ness to the question; thus, we adopt widely used measurement standards in infor-
mation retrieval and answer selection, namely mean average precision (MAP)
and mean reciprocal rank (MRR) to evaluate the performance of our model.
Table 6.3 shows the results of our models. From the results, we have following
findings.
Firstly, for baseline models, TAN4 without topic outperforms TAN1 without topic
in both WikiQA and TrecQA. This indicates that the shared encoders may be more
suitable for answer selection tasks. This is reasonable because for shared encoders, the
model can compare the representation of question and answers in the same context;
however, for none-shared encoders, the model has to learn double the parameters to
compare the representation. It is harder for the model to learn more parameters with
limited samples.
Secondly, compared with the baseline model, all of our models outperform the
baseline to some extent. Adding topic-aware module does improve the performance
of the baseline models. Among all the networks, TAN2 which adopt shared encoders
Table 6.3 Model performance of topic-aware networks for answer selection task
Model WikiQA TrecQA
MAP MRR MAP MRR
TAN1 without topic 0.65 0.66 0.71 0.75
TAN4 without topic 0.67 0.68 0.73 0.77
TAN1 0.66 0.67 0.72 0.79
TAN2 0.69 0.70 0.79 0.80
TAN3 0.67 0.68 0.74 0.76
TAN4 0.68 0.69 0.72 0.78
6 Topic-Aware Networks for Answer Selection 83
for both text and topic tokens outperforms all the other networks. TAN1 has similar
performance to the best baseline model. TAN3 and TAN4 have similar performance.
These findings show that for both baseline and our proposed model, shared
encoders are more efficient in pairing the right answer to the question. Topic-
aware modules improved the performance of the baseline models. TAN2 is the best
architecture among all the architectures we have proposed.
In this paper, we studied incorporating external knowledge into the traditional answer
selection deep learning models by using specially designed networks. The proposed
network is an automatic tool to extract useful information from the topic models and
use it in any deep learning baseline models. We designed the representation of external
knowledge as topic embeddings. The results show that our model can improve the
performance of the baseline deep learning model. Moreover, we identified the best
architectures among our designed networks.
For future works, we can apply two improvements. First, during the training stage
of topic modeling, we fixed the number of topics for topic models. However, we
will explore ways to automatically decide the number of topics we should use in the
model. Second, given that there are many question type classification datasets such
as TREC, we will investigate the use of transfer learning to obtain a pre-trained topic
embedding using the publicly available dataset and fine-tune the embedding using
the training data.
References
1. Lai, T., Bui, T., Li, S.: A review on deep learning techniques applied to answer selection. In:
Proceedings of the 27th International Conference on Computational Linguistics, pp. 2132–2144
(2018)
2. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual Inter-
national ACM SIGIR Conference on Research and Development in Information Retrieval,
pp. 50–57 (1999)
3. Kumari, R., Srivastava, S.K.: Machine learning: a review on binary classification. Int. J. Comput.
Appl. 160(7) (2017)
4. Yih, S.W., Chang, M.-W., Meek, C., Pastusiak, A.: Question answering using enhanced
lexical semantic models. In: Proceedings of the 51st Annual Meeting of the Association for
Computational Linguistics, (2013)
5. Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
6. Fenchel, W.: Elementary geometry in hyperbolic space. In: Elementary Geometry in Hyperbolic
Space. de Gruyter (2011)
7. Tan, M., dos Santos, C., Xiang, B., Zhou, B.: Lstm-based deep learning models for non-factoid
answer selection. arXiv preprint arXiv:1511.04108 (2015)
8. Yu, L., Hermann, K.M., Blunsom, P., Pulman, S.: Deep learning for answer sentence selection.
arXiv preprint arXiv:1412.1632 (2014)
84 J. Zhang and K. Mao
9. Yogatama, D., Dyer, C., Ling, W., Blunsom, P.: Generative and discriminative text classification
with recurrent neural networks. arXiv preprint arXiv:1703.01898 (2017)
10. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
Polosukhin, I.: Attention is all you need. Advan. Neural Inf. Process. Syst. 30 (2017)
11. Noble, W.S.: What is a support vector machine? Nature Biotechnol. 24(12), 1565–1567 (2006)
12. Matsubara, Y., Vu, T., Moschitti, A.: Reranking for efficient transformer-based answer selec-
tion. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 1577–1580 (2020)
13. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification.
Advan. Neural Inform. Process. Syst. 28 (2015)
14. Naseem, U., Razzak, I., Musial, K., Imran, M.: Transformer based deep intelligent contextual
embedding for twitter sentiment analysis. Futur. Gener. Comput. Syst. 113, 58–69 (2020)
15. Keskar, N.S., McCann, B., Varshney, L.R., Xiong, C., Socher, R.: Ctrl: a conditional transformer
language model for controllable generation. arXiv preprint arXiv:1909.05858 (2019)
16. Garg, S., Vu, T., Moschitti, A.: Tanda: transfer and adapt pre-trained transformer models for
answer sentence selection. In: Proceedings of the AAAI Conference on Artificial Intelligence,
vol. 34, pp. 7780–7788 (2020)
17. Severyn, A., Moschitti, A.: Automatic feature engineering for answer selection and extraction.
In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,
pp. 458–467 (2013)
18. Melamud, O., Goldberger, J., Dagan, I.: Context2vec: learning generic context embedding with
bidirectional lstm. In: Proceedings of the 20th SIGNLL Conference on Computational Natural
Language Learning, pp. 51–61 (2016)
19. Likhitha, S., Harish, B.S., Keerthi Kumar, H.M.: A detailed survey on topic modeling for
document and short text data. Int. J. Comput. Appl. 178(39), 1–9 (2019)
20. Liu, P., Qiu, X., Huang, X.: Recurrent neural network for text classification with multi-task
learning. arXiv preprint arXiv:1605.05101 (2016)
21. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780
(1997)
22. Severyn, A., Moschitti, A.: Learning to rank short text pairs with convolutional deep neural
networks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp. 373–382 (2015)
23. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(1),
993–1022 (2003)
24. Tay, Y., Tuan, L.A., Hui, S.C.: Hyperbolic representation learning for fast and efficient neural
question answering. In: Proceedings of the Eleventh ACM International Conference on Web
Search and Data Mining, pp. 583–591 (2018)
25. Wang, M., Manning, C.D.: Probabilistic tree-edit models with structured latent variables for
textual entailment and question answering. In: Proceedings of the 23rd International Conference
on Computational Linguistics (Coling 2010), pp. 1164–1172 (2010)
26. Yin, W., Kann, K., Yu, M., Schütze, H.: Comparative study of CNN and RNN for natural
language processing. arXiv preprint arXiv:1702.01923 (2017)
27. Sethuraman, J.: A constructive definition of dirichlet priors. Statistica Sinica pp. 639–650
(1994)
28. Lai, S., Xu, L., Liu, K., Zhao, J.: Recurrent convolutional neural networks for text classification.
In: Twenty-ninth AAAI Conference on Artificial Intelligence (2015)
Chapter 7
Design and Implementation
of Multi-scene Immersive Ancient Style
Interaction System Based on Unreal
Engine Platform
Abstract This project, based on the clue of flying and searching for Kongming
lanterns, combines the novel and vivid interactive system with the traditional roaming
system to create a multi-scene and immersive virtual world. The project uses UE4
engine and 3dsMax modeling software to build the museum scene and the virtual
ancient scene. After the modeling completed by 3DSMax, import it to UE4 to add
collider and plants to the model. And constantly optimize the scene combining with
the layout of streets of the Tang Dynasty. Then test the scene to make the interaction
and scene better match and merge. Use Sequence to record cutscenes, use blueprint
to connect animation with character operation, intersperse particle system, to realize
scene roaming and the interaction of Kongming lanterns. The project combines with
various technologies in the unreal engine, breaks the boring experience mode of
the traditional roaming system, and reproduces the magnificent ancient city, which
hurries off us for the appointment of thousands of lights.
7.1 Introduction
With the rapid development of human science and technology, virtual reality tech-
nology has penetrated into all aspects of people’s life with it gradually changed
from theory to industrialization. It is also loved by users because of its immersion,
interaction and imagination. Virtual reality, just as its name implies, is the combi-
nation of virtual and reality. Theoretically, virtual reality technology (VR) is a kind
of computer simulation system that can create and experience the virtual world. It
uses the computer to generate a simulation environment and immerse the user in the
environment. With the help of 3D modeling technology, realistic real-time rendering
technology, collision detection technology, and other key technologies of virtual
reality, the picture expressive force and atmosphere can be improved when the users
experience.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 85
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_7
86 S. Yang et al.
In the virtual reality world, the most important feature is the sense of “realism” and
“interaction.” Participants feel like being in virtual world, environment and portraits
are just like being in real environment, in which various objects and phenomena
are interacting. Objects and characteristics in the environment develop and change
according to natural laws. People in the environment have sensations such as vision,
hearing, touch, motion, taste, and smell. Virtual reality technology can create all
kinds of fabulous artificial reality environments, which are vivid and immersive, and
interact with the virtual environment to the extent of using false to confuse the truth.
Based on the characteristics and theoretical foundation of virtual reality technology
mentioned above, this project designs a multi-scene strong immersion ancient wind
interactive project.
Through multi-scene interaction, this project presents a beautiful story of ancient
and modern travel. The tourist in the museum was immersed in the artistic atmo-
sphere, looking at the ancient paintings, he imagined the charm of the Tang Dynasty.
In a trance, he seemed to be in the Tang Dynasty, and suddenly he was inhaled by
the space–time cracks. Then he opened his eyes and became a general of the Tang
Dynasty in armor. The story is set to show the idea that ancient cultures can still
move people. People’s living conditions and concepts are constantly changing, but
those cultural and architectural arts that can move people are permanent.
Physical-based rendering technology has been widely used in the movie and game
industry since Disney Principled BRDF was introduced by Disney on SIGGRAPH
2012, due to its high ease of use and convenient workflow. Physical-based rendering
(PBR) refers to the concept of rendering using a coloring/lighting model based on
physical principles and microplane theory, as well as surface parameters measured
from reality to accurately represent real-world materials. The unreal engine is widely
used because of its excellent rendering technology. Next, we will describe the
theoretical and technical basis of the unreal engine for rendering scenes.
Fig. 7.1 When light interacts with a non-optical flat surface, the non-optical flat surface behaves
like a large collection of tiny optical flat surfaces
wavelengths (so the application of geometrical optics and wave effects such as diffrac-
tion can be ignored). The microplane theory was only used to derive the expression
of single-bounce surface reflection in 2013 and before. In recent years, with the
development of the field, there have been some discussions on the multiple bouncing
surface reflection using microfacet theory.
Each surface point can be considered as optical flat because the microscopic
geometric scale is assumed to be significantly larger than the visible light wavelength.
As mentioned above, an optical flat surface divides light into two directions: reflection
and refraction.
Each surface point reflects light from a given direction of entry into a single
direction of exit, which depends on the direction of the microgeometry normal m.
When calculating BRDF items, specify light direction l and view direction v. This
means that all surface points, only those small planes that are just pointing in the right
direction to reflect l to v, may contribute to the BRDF value (positive and negative
in other directions, after the integral, offset each other).
In Fig. 7.1, we can see that the surface normal m of these “correctly oriented”
surface points is located just in the middle between l and v. The vector between l and
v is called a half-vector or a half-angle vector. We express it as h.
Only the direction of the surface points m = h reflect light l to the direction of
line of sight v. Other surface points do not contribute to BRDF.
Not all m = h surface points contribute to reflection actively; some are blocked by
l direction (shadowing), v direction (masking), or other surface areas of both. Micro-
facet theory assumes that all shadowed light disappears from the mirror reflector. In
fact, some of these will eventually be visible due to multiple surface reflections, but
this is not generally considered in current microplane theory.
In Fig. 7.2, we see that some surface points are blocked from the direction of l,
so they are blocked and do not receive light (so they cannot reflect anything). In the
middle, we see that some surface points are not visible from the view direction v, so
of course, we will not see any light reflected from them. In both cases, these surface
points do not contribute to the BRDF. In fact, although shadow areas do not receive
any direct light from l, they do receive (and therefore reflect) light from other surface
areas (as shown in the right image). The microfacet theory ignores these interactions.
88 S. Yang et al.
Using these assumptions (a locally optical flat surface without mutual reflection), it
is easy to derive a general form of Specular BRDF called Microfacet Cook-Torrance
BRDF. This Specular BRDF takes the following form:
D(h)F(v, h)G(l, v, h)
f (l, v) = (7.1)
4(n · l)(n · v)
Among them:
• D(h): Normal Distribution Function, which describes the probability of the normal
distribution of micropatches, i.e., the concentration of the normal that is oriented
correctly. That is, the concentration relative to the surface area of the surface point
that reflects light from L to V with the correct orientation.
• F(l, h): The Fresnel Equation, which describes the proportion of light reflected
by a surface at different surface angles.
• G(l, v, h): Geometry Function, which describes the self-shading properties of a
microplane, i.e., the percentage of uncovered surface points M = H.
• Denominator 4 (n. l) (n. v): Correction factor that corrects the amount of
microplane transformed between the local space of microscopic geometry and
the local space of the overall macro surface.
Lighting in a scene is the most critical and important part, and generally uses physical-
based ambient lighting. Common technical solutions for ambient lighting include
image-based lighting (IBL). For example, the diffuse reflective ambient lighting
part generally uses the Irradiance Environment Mapping technology in traditional
IBL. Based on physical specular ambient lighting, image-based lighting (IBL) is
commonly used in the industry. To use physical-based BRDF models with image-
based lighting (IBL), Radiance Integral (Radiance Integral) needs to be solved, and
Importance Sample is usually used to solve the Brightness Integral.
The importance sampling (Importance Sample) is based on some known condi-
tions (distribution functions). It is a strategy to concentrate on sampling the regions
with high probability of the distribution of integrable functions (important areas) and
then efficiently calculating the accurate estimation results. The following two terms
are briefly summarized.
Split Sum Approximation. Based on the importance sampling method, substitute
the Monte Carlo integral formula into the rendering equation:
The direct solution of the upper form is complex, and it is not very realistic to
complete real-time rendering.
At present, the mainstream practice of the game industry is to divide the
1 N L i (lk ) f (lk ,v) cos θlk
k=1 p(lk ,v)
in the above formula into two terms: average brightness
N
N 1 N f (lk ,v) cos θlk
k=1 L i (l k ) and environment BRDF N
1
N k=1 p(lk ,v)
.
Namely:
1 L i (lk ) f (lk , v) cos θlk 1 1 f (lk , v) cos θlk
N N N
≈ L i (lk ) (7.3)
N k=1 p(lk , v) N k=1 N k=1 p(lk , v)
After splitting, two terms are offline precomputed to match the rendering results
of offline rendering reference values.
In real-time rendering, we calculate the two terms that have been calculated in the
Split Sum Approximation scheme, and then make the combination as the rendering
result of the real-time IBL physical environment lighting part.
The First Term Pre-Filtered Environment Map (pre-filter). The first term is
1 N
N k=1 L i (l k ), which can be understood as the L i (l k ) mean value of brightness. After
n = v = r’s assumption, it only depends on the surface roughness and the reflection
vector. In this term, the practice of the industry is relatively uniform (including UE4
and COD: Black Ops 2). The main scheme adopted is to pre-filter the environmental
texture, and to store the fuzzy environment highlight with multilevel fuzzy mipmap.
90 S. Yang et al.
1
N
L i (lk ) ≈ Cubemap · sample(r, mip) (7.4)
N k=1
That is to say, the first term directly uses the MIP level sampling input of cubemap.
N f (lk ,v) cos θlk
The Second Sum Environment BRDF. The second item, N1 k=1 p(lk ,v)
, is
hemispherical-directional reflectance of the mirror reflector, which can be interpreted
as environmental BRDF. It depends on the elevation θ, Roughness α, and Fresnel Item
F. Schlick approximation is often used to approximate F, which is parameterized
only on a single value of F 0 , making Rspec a three-parameter ((elevation) θ (NdotV),
Roughness α, F 0 ).
UE4 proposed in [Real shade in Unreal Engine 4, 2013] that in the second
summation term, F 0 can be divided from the integral after using the Schlick
approximation:
f (l, v)
L i (l) f (l, v) cos θl · dl = F0 1 − (1 − v · h)5 cos θl · dl
F(l, v)
f (l, v)
+ (1 − v · h)5 cos θl · dl (7.5)
F(l, v)
This leaves two inputs (Roughness and cos θ v) and two outputs (a scale and bias
to F 0 ), all of which are conveniently in the range [0, 1]. We precalculated the result
of this function and store it in a 2D look-up texture2 (LUT).
Figure 7.3 is about the inherent mapping relationship between roughness, cos
θ , and the reflective intensity of the environmental BRDF mirror, which can be
precomputed offline.
Specific removal method is:
That is, UE4 searched by taking F 0 of the Fresnel formula out, making up F 0 *
scale + offset, saving the indexes of scale and offset onto a piece of 2D LUT, finding
by roughness and ndotv.
As the most open and advanced real-time 3D creation tool in the world, Unreal
Engine has been widely used in games, architecture, radio and film and television,
automobile and transportation, simulation training, virtual production, man–machine
interface, etc. In the last decade to a few years, U3D has been very popular, with
over 2 million games developed on it, but in recent years, UE4 has caught up and
surprisingly surpassed it. In addition, other virtual reality development platforms
include VRP, CryEngine, ApertusVR, Amazon Sumerian, etc. Comparing with them,
the excellent picture quality, good lighting and physics effects, and simple and clear
visual programming of UE4 make it the preferred development platform for this
project. Many of the instructional videos and documents posted on the UE4 website
are extremely friendly for beginners.
7.3.2 Storyboard
The storyboard for this project is divided into six parts. The first two parts show the
game player’s sudden passing through the scene after the exhibition. The third, fourth,
and fifth sections show the scene of players roaming around the city after traversing
through the Tang Dynasty. When the players light up the lantern, the lights of the
city also fly. The sixth part shows the player chasing after the Kong Ming lamp and
finally running up the mountain when it gets dark to see the beautiful scene of the
thousand lights rising in the valley (Fig. 7.4).
First of all, collect relevant information to clarify the design scheme. The pavilion
scene completes the construction of the ground, architecture, and indoor scenes in
the unreal engine, then adds a TV screen to play videos, and finally optimizes the
relevant materials. The scene construction of ancient prosperous scenes and mountain
night scenes is modeled by 3dsmax. After adjustment, it is imported into the unreal
92 S. Yang et al.
engine to adjust the architectural layout and add particle effects. Then add lighting
effects and collision bodies, design interactive functions, and finally test and output
(Fig. 7.5).
The Tang Dynasty scene of this project depicts the impression of the Tang Dynasty
formed by players based on historical experience accumulation and observation of
ancient paintings. This city has prosperous street views and lively markets. The
magnificent architecture is the most dazzling scenery in the city, which reflects the
rich life of the people. Vendors selling a full range of goods are displayed along the
street, and the sacred palace is more mysterious and majestic. The architecture of
this scene strives to restore a fantastic and magnificent prosperous scene. In order
to make the scene closer to the real environment, the grass and trees in the scene
have added a dynamic effect of swinging with the wind, and the rivers in the scene
also present realistic effects. In addition, dynamic flying butterflies, flocks of white
pigeons flying out of the palace, and lovely lambs have been added to the scene.
7 Design and Implementation of Multi-scene Immersive Ancient Style … 93
7.4.2 Collider
The interior of the scene contains a variety of buildings and other elements, and the
distance between the elements is relatively close. In the process of scene construction,
to ensure that there is no mold penetration problem in the scene, it is necessary to
set the collider of the scene one by one. For example, when importing an Fbx format
building made from 3dsmax software, when placing the building into the UE4, the
character will be able to penetrate the model. In order to avoid these problems, when
importing the model, we can double-click the static grid and choose to add simplified
collisions to avoid building molding problems. When introducing the urban ground
built by 3dsmax, the characters also have the problem of piercing molds and unable
to stand on the ground. We use landscape as the main ground of the scene, so that
the characters can stand on the ground.
The whole scene simulates the light effect of dusk, and the gorgeous orange sun glow
adds some beauty to the magnificent ancient city. In order to achieve the desired effect,
we set up a dark pink sky box.
Use “Exponential Height Fog” to simulate the fog effect. Adjust “Max Opacity”
and “Start Distance” to make the fog effect “lighter.” Check “Volumetric Fog” here.
The comparison diagram is as follows (Fig. 7.6).
Sunlight is simulated using “Directional Light” to adjust the appropriate color and
intensity. Use “Skylight” to realize the reflection effect of sunlight, and finally add
“Light mass Importance Volume” and “Post Process Volume” to further optimize the
lighting effect.
In order to enhance the sense of picture when crossing the scene, the project has
particle effects such as the particle effects while changing the scenes and particle
effects for crossing the ancient city. The author activated Niagara effect system in
this project. Niagara makes particle effects based on a modular visual effects stack
that is easy to reuse and inherit and combined with a timeline. At the same time,
Niagara supports data interfaces for a variety of data in the unreal engine, such as
using the data interface to obtain skeletal mesh data.
The blueprint class of a Kongming lantern is used as the interactive object, and the
Kongming lantern is lit during flying through the set intensity node. A box collision is
used as the interactive detection range. After entering the range, the HUD prompting
to fly is triggered. When releasing the Kongming lantern, call the world location in
96 S. Yang et al.
tick and add the Z value automatically to achieve the effect of flying the Kongming
lantern. After the character releases the Kongming lantern, play the cut-off animation
and trigger the function of a large amount of Kongming lanterns to take off (Figs. 7.7
and 7.8).
In the function, the Kong Ming lamp order to pursue better visual effect, the
Kongming lantern here is larger and destroyed after 30 s.
7 Design and Implementation of Multi-scene Immersive Ancient Style … 97
The following Fig. 7.9 shows the project framework of the project (Figs. 7.9, 7.10,
7.11 and 7.12).
Fig. 7.12 Screenshots of lanterns lift off and mountain night scene display
Depending on the platform and game, 30, 60, or even more frames per second may
be the target (Fig. 7.13).
For large projects in particular, it is very important to know the occupation of
rendering resources. As can be seen from the above figure, the water body, as a
non-key roaming part, occupies too much rendering resources and memory, and the
number of polys of palaces and trees is large. The number of polys of palace windows
and branches and trunks should be optimized, while the number of polys of buildings
on the street is small and the texture realism is insufficient. In the process of building
the main scene, the effect of the river did not meet the expectations. Later, the author
will try to restore the light effect and ripple effect of the river, so as to achieve better
results with a more resource-saving scheme.
The player explored the palace, lit the palace lantern, and triggered the thousand lights
to fly, and then the game player ran after the lights. The process of running to the 1000-
lamp appointment is omitted in the project. The player will travel to the mountain with
night view, which means that the player chases the Kong Ming lamp all the way, and
the time has passed from evening to night on the mountain. Multi-scene interactions
exemplify Unreal Engine’s power and detail in achieving dreamy scenes. In order to
achieve better 3D effects, the Niagara Particle Effects plug-in was enabled to produce
a lot of realistic particle effects. By using the timeline in Niagara, we visually control
the rhythm of particle effects and set keyframes for them to make them better. This
detail is modeled after the film and television editing industry, making the unreal
world more detailed and realistic. When simulating the dusk lighting effect in the
Tang Cheng scene part, this project combines the directional light source and the
sky light source that are extremely close to the sunlight effect to simulate the real
sunlight effect and its reflection effect. Coupled with the LightMass Importance
Volume, which controls the accuracy of the precomputation, more indirect lighting
cache points are generated inside the space, and well-controlled precomputed lighting
can make the scene beautiful while reducing the cost of dynamic lighting.
7.7 Conclusion
Relying on the unreal engine, the project realizes the matching of model and envi-
ronment light and shadow in the interactive experience, as well as the harmonious
unity of architecture, plants, terrain, fog effect, light effect and particle effect through
the construction, and integration and optimization of the scene of the ancient city.
Through various interactive functions, it also greatly enhances the image appeal in
the roaming process and creates a real-time and high quality 3D fantasy world. As
a more dynamic and intuitive form of expression, virtual reality has its unparalleled
advantages. It combines unreal engine visual programming and sequence, Niagara
particle effects, and other technologies to make it not only achieve better visual
effects, but also have a more immersive feeling.
100 S. Yang et al.
Acknowledgements This work was supported by grant from: “Undergraduate teaching Reform
and Innovation” Project of Beijing Higher Education (GM 2109022005); Beijing College Students’
innovation and entrepreneurship training program (Item No: 22150222040, 22150222044). Key
project of Ideological and Political course Teaching reform of Beijing Institute of Graphics
Communication (Item No: 22150222063). Scientific research plan of Beijing Municipal Education
Commission (Item No: 20190222014).
Chapter 8
Auxiliary Figure Presentation Associated
with Sweating on a Viewer’s Hand
in Order to Reduce VR Sickness
8.1 Introduction
One of factors stalling spread of virtual reality (VR) contents is VR sickness, which
refers to deterioration of physical condition caused by viewing a VR content. The
symptoms of VR sickness are similar to those of general motion sickness, such as
vomiting, cold or numb limbs, sweating, and headache [1, 2]. When the symptom
appears, the user cannot enjoy a VR content and may stop viewing or avoid viewing
M. Omata (B)
Graduate Faculty of Interdisciplinary Research Faculty of Engineering, University of Yamanashi,
Kofu, Japan
e-mail: [email protected]
M. Suzuki
Department of Computer Science and Engineering, University of Yamanashi, Kofu, Japan
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 101
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_8
102 M. Omata and M. Suzuki
in the first place. This may hinder the future development of the VR market and field,
and it is essential to elucidate the cause of VR sickness and take preventive measures.
VR sickness is sometimes called visually induced motion sickness (VIMS) and
is considered to be one of motion sicknesses. Motion sickness is a deterioration of
physical condition caused by staying in a moving environment such as in a car or
on a ship for a long time. Although the cause of motion sickness is not completely
explained, the sensory conflict theory is the most popular theory. The theory states that
when the pattern of empirical vestibular, visual, and somatosensory information is
incompatible with the pattern of sensory information in the actual motor environment,
motion sickness occurs during the adaptation process to the situation [3]. The same
conflict is thought to occur in VR sickness. In other words, in VR sickness, the visual
system perceives motion, while the vestibular system does not.
Broadly speaking, two methods of reducing VR sickness have been studied: One is
to provide a user with actual motion sensation from outside the virtual world, and the
other is to provide some effect on a user’s field-of-view in the virtual environment.
As examples of the former method, there are a method to apply wind to the user
while viewing VR images [4], and a method to provide a pseudo-motor sensation
by applying electricity to the vestibular system [5, 6]. However, these methods have
disadvantages such as needs for large-scale equipment and high cost. On the other
hand, as examples of the latter methods, there are a method to display gazing points
on VR images [7], and a method to switch from the first-person’s view to the third
person’s view in situations where the user is prone to sickness. These methods have an
advantage that they can be solved within an HMD and are less costly than the former
methods because they only require processing of the images. The latter method is
more realistic in terms of the spread of VR contents. However, there are concerns
that superimposed images may not match a world view of a VR environment, or that
superimposed images may distract a user and make it difficult to concentrate on a
VR game, thereby diminishing the sense of immersion.
Therefore, we propose a system that is one of the latter methods, but instead
of constantly displaying the superimposed figures, it keeps detecting signs of VR
sickness from physiological signals and controls the display of the superimposed
figures in real time according to the detection results. We aim to reduce VR sickness
without lowering the sense of immersion.
using amplified head rotations instead of controller-based input and whether the
induced VR sickness is a result of the user’s head acceleration or velocity by intro-
ducing two different modes of vignetting, one triggered by acceleration and the other
by velocity [9]. The results show generally indicating that the vignetting methods
did not succeed in reducing VR sickness for most of the participants and, instead,
lead to a significant increase. Duh et al. suggested that an independent visual back-
ground (IVB) might disturbance when conflicting visual and inertial cues [10]. They
examined 3 levels of independent visual background with 2 levels of roll oscillation
frequency. As the results, there were statistically significant effects of IVB and a
significant interaction between IVB and frequency. Sargunam et al. compared three
common joystick rotation techniques: traditional continuous rotation, continuous
rotation with reduced field-of-view, and discrete rotation with fixed intervals for
turning [11]. Their goal is to investigate whether there are tradeoffs for different
joystick rotation techniques in terms of sickness, preferences in a 3D environ-
ment. The results showed no evidence of differences in orientation, but sickness
ratings found discrete rotations to be significantly better than field-of-view reduc-
tion. Fernandes et al. explored the effect of dynamically, yet subtly, changing a
physically stationary person’s field-of-view in response to visually perceived motion
in a virtual environment [12]. Then, they could reduce the degree of VR sickness
perceived by participants, without decreasing their subjective level of presence, and
minimizing their awareness of the intervention. Budhiraja et al. proposed rotation
blurring, uniformly blurring the screen during rotational movements to reduce cyber-
sickness caused by character movements in a First Person Shooter game in virtual
environment [13]. The results showed that the blurring technique led to an overall
reduction in sickness levels of the participants and delayed its onset.
On the other hand, as a method to add a figure on user’s field-of-view, Whittinghill
et al. placed a three-dimensional model of a virtual human nose in the center of the
fields of view of the display in order to observe that placing a fixed visual reference
object within the user’s field-of-view seems to somewhat reduce simulator sickness
[14]. As the results, users in the nose experimental group were able, on average, to
operate the VR applications longer and with fewer instances of stop requests than
were users in the no-nose control group. However, in the roller coaster game with
intense movements, the average play time was only about 2 s longer. Cao et al.
designed a see-through metal net surrounding users above and below as a rest frame
to reduce motion sickness reduction in an HMD [15]. They showed that subjects
feel more comfortable and tolerate when the net is included than when there was
no rest frame. Buhler et al. proposed and evaluated two novel visual effects that can
reduce VR sickness with head-mounted displays [16]. The circle effect is that the
peripheral vision shows the point of view of a different camera. The border between
the outer peripheral vision and the inner vision is visible as a circle. The dot effect
adds artificial motion in peripheral vision that counteracts a virtual motion. The
results showed lower means of sickness in the two effects; however, the difference
is not statistically significant across all users.
In many studies, entire view is changed, or figures are conspicuously superim-
posed. There are some superimposed figures that imitate the user’s nose, which are
104 M. Omata and M. Suzuki
not so obvious, but it is not effective in some situations, or can only be used for a
first-person’s view. Therefore, Omata et al. designed a more discreet static figure
in virtual space and a scene-independence figure connecting the virtual world and
the real world [7]. The results show that the VR sickness tended to reduce by the
superimposition of Dots on the four corners of the field-of-view. At the same time,
however, they also showed that the superimposition of auxiliary figures reduced the
sense of immersion.
Based on the results of Omata et al.’s study, we investigated a method to reduce VR
sickness without unnecessarily lowering the sense of immersion by displaying the
Dots only when a symptom of the sickness appears. In addition, since hand sweating
was used as a physiological index to investigate the degree of VR sickness in the study
by Omata et al., and nasal surface temperature was used as an index of the degree of
VR sickness in the study by Ishihara et al. [17]; we also have proposed to use these
physiological signals as indexes to quantify the degree of VR sickness. Additionally,
we have clarified an appropriate type of physiological signal for the purpose and have
proposed to use the physiological signal as a parameter to emphasize the presentation
of an auxiliary figure when a user felt VR sickness.
In this experiment, nasal surface temperature, hand blood volume pulse (BVP), and
hand sweating of experimental participants watching a VR scene were measured and
analyzed in order to find a type of physiological signals that were strongly corre-
lated with their VR sickness. A magnitude estimation method was used to measure
degree of psychological VR sickness of the participants [18]. The participants were
instructed that the degree of discomfort of his/her VR sickness under his/her normal
condition when they wore an HMD, and no images were presented was 100, and
they were asked to verbally answer a degree of discomfort based on the 100 at 20 s
intervals while viewing a VR scene. As an experimental task to encourage head
movement, participants were asked to look for animals hidden in the VR scene.
this study, the Celsius temperature is used. The sensor can measure skin surface
temperature between 10 and 45 °C.
The BVP sensor was a BVP-Flex/Pro from Thought Technology and was attached
on index finger of participant’s dominant hand as shown in Fig. 8.2. The sensor
bounces infra-red light against a skin surface and measures the amount of reflected
light in order to measure heart rate and BVP amplitude. In this study, the BVP value
is a value averaged for each inter-beat interval (IBI).
The sweating sensor was a SC-Flex/Pro from Thought Technology, which is a skin
conductance (SC) sensor measures conductance across skin between two electrodes
on fingers, and was attached on index and ring fingers of participant’s non-dominant
hand as shown in Fig. 8.3. The inverse of the electrical resistance between the fingers
is the skin conductance value.
Virtual Scene. A three-dimensional amusement park with a Ferris wheel and a roller
coaster was presented as a VR scene. We used assets available on the Unity Asset
106 M. Omata and M. Suzuki
Store [20], in order to create the amusement park. “Terrain Toolkit 2017” was used
to generate the land; “Animated Steel Coaster Plus” was used for the roller coaster,
and “Ferris Wheel” was used for the Ferris wheel. To generate and present the scene,
we used a computer (IntelR CoreTM i5-8400 2.80 GHz CPU, GeForce GTX 1050
Ti), Unity, an HMD (Acer AH101), and inner-ear earphones (Apple, MD827FE).
Task. All movement of the avatar in the virtual space was automatic, but an angle
of the view was changed according to an orientation of the participant’s head. The
first scene was a 45 s walk through the forest, followed by a 75 s ride on the Ferris
wheel, a 30 s walk through the grassland again, and finally a 150 s ride on the roller
coaster, for a total of 300 s.
The participants wore the HMD to view the movement in the scene and looked
for animals in the view linked to their head movements (However, there were no
animals in the scene.). This kind of scene and task creates a situation that was more
likely to induce VR sickness. The expected degree of VR sickness in the VR scene
was small for walking in the forest, medium for riding the Ferris wheel, and large
for riding the roller coaster.
Procedure. The participants were asked to refrain from drinking alcohol the day
before the experiment, in order to avoid sickness caused by factors other than the
task. Participants were given informed consent prior to the experiment, and their
consent to participate in the experiment was obtained.
The participants were asked to spend 10 min before performing the task to famil-
iarize themselves with the room temperature of 21 °C in our laboratory, and then,
they were asked to wear the HMD, the skin temperature sensor, the BVP sensor,
and the SC sensor. Figure 8.4 shows a participant wearing the devices. Then, after
performing the experimental task, they were asked to answer a questionnaire about
their VR experience. After the experiment, the participants were asked to stay in the
laboratory until they recovered from VR sickness.
The participants were ten undergraduate or graduate students (six males and four
females) between the ages of 21 and 25 with no visual or vestibular sensory problems.
8 Auxiliary Figure Presentation Associated with Sweating on a Viewer’s … 107
8.3.3 Result
Figures 8.5, 8.6, and 8.7 show the relationship between each of the three types
of physiological signals and the participants’ verbal responses of discomfort. The
measured physiological values are converted to normal logarithms in order to make
it easier to check the correlation between physical measurements and human psycho-
logical responses. The line segments on the graph represent exponential trend lines,
and each color represents an individual participant from A to J.
Table 8.1 shows the mean and standard deviation of the coefficients of determi-
nation in the correlation analysis between each type of physiological signals and
discomfort for each of the participants.
Determination Coefficient. From Fig. 8.7 and Table 8.1, we found that there was a
strong correlation between hand sweating and discomfort due to VR sickness. From
the figure, it can be seen that for all participants, the more discomfort increases, the
more sweat increases, and the rate of increases of all participants also follows the
same trend.
Additional Confirmatory Experiment. The determination coefficient of the nasal
surface temperature also shows a rather strong correlation, and the graph shows that
the temperatures tend to increase as the discomfort increases. However, this tendency
is different from the result of Ishihara et al. that the temperature decreases with the
8 Auxiliary Figure Presentation Associated with Sweating on a Viewer’s … 109
We found the relationship between the amount of sweating and the degree of VR
sickness, as well as its limit and power index, from the experiment in the previous
section. Based on the results, this section explains a design of a system that controls
degree of alpha blending of auxiliary figures that reduces VR sickness according to
amount of sweating on a hand of a VR viewer.
Based on an assumption that Stevens’ law of Eq. (8.1) holds between a psycholog-
ical measure of discomfort of VR sickness and a physical measure of sweating, we
constructed an equation to derive a percentage of alpha value of the alpha blending.
R = k Sn (8.1)
where α is the alpha percentage for alpha blending, Z is the normal sweating volume
(µS) of a viewer, x is the real-time sweating volume (µS) at each sample time when
the viewer watches a VR scene, and the power index is 1.13.
8 Auxiliary Figure Presentation Associated with Sweating on a Viewer’s … 111
As a previous study, Omata et al. evaluated four types of auxiliary figures (Gazing
point, Dots, User’s horizontal line, and Real-world’s horizontal line) that aimed to
reduce VR sickness, and among them, the Dots was the one that reduced VR sickness
the most [7]. Therefore, we adopt the Dots design as the auxiliary figure in this
research. Dots are composed of four dots, as shown in Fig. 8.10. The four dots are
superimposed on the four corners of the view on a screen of an HMD. The dots do
not interfere with content viewing, and it is thought that the decline in a sense of
immersion can be reduced. In this study, we made the Dots design a little larger than
that of Omata et al. and changed the color from white to peach. The reason for the
larger size is to make it easier to perceive the dots as foreground than the system of
Omata et al. The reason for the changing color is to avoid blending in with white
roads and clouds in a VR scene of our experiment.
The overall flow of the auxiliary figure presentation system is as follows: First,
the skin conductance value of ProComp INFINITI, which is a biological amplifier,
is continuously measured, and then, the value is continuously imported into Unity,
which presents a VR scene, and is reflected in alpha percentage of the color of Dots on
the HMD. The normal sweating value for each viewer is the average of the viewer’s
skin conductance acquired for 3 s after the system starts. The timing for updating
the alpha value should be once every two seconds so that it does not blink during
drawing.
called “controlled presentation”), the condition where the auxiliary figure is always
displayed without the alpha blending (hereinafter called “constant presentation”),
and the condition where the auxiliary figure is not displayed (hereinafter called
“no auxiliary figure”). The specific hypothesis was that the control presentation
was significantly less likely to cause VR sickness than the no auxiliary figure and
had the same sickness reduction effect as the constant presentation, and the control
presentation was less distracting to gain a sense of immersion than the constant
presentation and gave the same sense of immersion as the no auxiliary figure.
The general flow of this experiment was the same as in Sect. 8.3, but instead of
oral responses, the participants of this experiment were asked to answer Simulator
Sickness Questionnaire (SSQ) [22] before and after viewing a VR scene, and Game
Experience Questionnaire (GEQ) [23] at the end of the VR scene. The SSQ is an
index of VR sickness that can derive three sub-scores (Oculomotor-related sub-score,
nausea-related sub-score, and disorientation-related sub-score) and the total score by
rating 16 symptoms that are considered to be caused by VR sickness on a four-point
scale from 1 to 4. Since each sub-score is calculated in a different way, it is difficult
to compare them and to understand how large each sub-score is. Therefore, in this
experiment, the maximum value of each sub-score is expressed as a percentage
of 100%. In addition, to measure a worsening of VR sickness before and after an
experimental task, the SSQ was rated on a 4-point scale from 0 to 3 before the
task. The GEQ is a questionnaire to measure the user’s experience after gameplay.
Originally, 14 factors can be derived, but in this experiment, we focused on positive
and negative affect, tension, and immersion, and asked the participants to rate the 21
questions on a 5-point scale from 1 to 5.
The experimental environment was the same as in Sect. 8.3. The task was slightly
different. In Sect. 8.3, the participants were asked to search for animals in the space
where they were not actually placed. In this experiment, however, the participants
were asked to search for animals in a space where they were actually placed so that
they would not get bored.
The participants were asked to refrain from drinking alcohol the day before to
avoid sickness caused by factors other than VR. The room temperature was adjusted
to a constant 21 °C using an air conditioner. Informed consent was given to all
participants before the experiment. In order to counterbalance the order of the three
conditions, the participants were divided into three groups and asked to leave at least
one day between the next conditions to reduce the effect on habituation.
At the beginning of watching the VR scene in each of the three conditions, a
participant answered the SSQ to check his or her physical condition before the
start. After that, the participant wore the HMD and skin conductance sensor and
then watched the VR scene for 300 s like the task in Sect. 8.3. After watching it,
the participant answered the SSQ, GEQ, and questions about the condition. This
8 Auxiliary Figure Presentation Associated with Sweating on a Viewer’s … 113
procedure was carried out as a within-subjects design, with three conditions for each
participant, with at least one day in between.
The participants were twelve undergraduate or graduate students (six males and
six females) between the ages of 21 and 25 with no visual or vestibular sensory
problems.
8.5.2 Results
SSQ. Figure 8.11 shows the results of the SSQ of the differences among the three
conditions on VR sickness. Each score shown here is the difference between each
participant’s post-task SSQ score minus his or her pre-task SSQ score. The error
bars indicate the standard errors. Therefore, the higher the value is, the worse the
degree of VR sickness became due to the task. The scores of all evaluation items
decreased in the control and constant presentation conditions compared to the no
auxiliary figure condition, but the results of the Wilcoxon-signed rank sum test (5%
level of significance) showed no significant difference among them.
In order to analyze a difference in the number of times the participants had expe-
rienced VR in the past, we divided the participants into two groups: one with more
than five VR experiences (four participants) and the other with less than five VR
experiences (eight participants). Figure 8.12 shows the SSQ scores of the group with
less experience, and Fig. 8.13 shows the SSQ scores of the group with more experi-
ence. As a result, in the group with less experience, although there was no significant
difference in the Wilcoxon’s-signed rank sum test (5% level of significance), it was
found that the control condition and the constant condition reduced VR sickness
compared to the no auxiliary figure condition. On the other hand, no such trend was
observed in the group with more VR experience.
GEQ. Figure 8.14 shows the results of the GEQ of differences of the three conditions.
The larger the value, the more intense the experience was. The error bars also indicate
the standard errors. Since the scores of tension were low regardless of the condition,
it is considered that sweating due to tension did not occur. The positivity, negativity,
and immersion items showed little differences among the conditions. The results of
the Wilcoxon’s signed rank sum test (5% level of significance) showed no significant
differences among the conditions in each item.
Impressions of Auxiliary Figure. As a result of the question “What did you think
about the display of Dots?” regarding the auxiliary figure in the control and constant
presentation conditions, 3 out of 12 participants answered “I didn’t mind” in the
constant condition, while 7 out of 12 participants answered “I didn’t mind” in the
control condition.
8 Auxiliary Figure Presentation Associated with Sweating on a Viewer’s … 115
8.5.3 Discussions
VR Sickness. We divided the groups according to the number of times the partici-
pants had experienced VR and analyzed the results of the responses to the SSQ for
each group. The results showed that there was no statistically significant difference
between the conditions for the group with less VR experience. However, from the
graph in Fig. 8.11, we infer that the SSQ scores were lower with the auxiliary figure
than without it. Therefore, we assumed that the difference in SSQ scores depended on
the number of VR experiences and summarized the relationship between the number
of VR experiences and SSQ scores in a scatter diagram (Fig. 8.15). From Fig. 8.15,
we found that the difference in SSQ scores decreased with the number of VR expe-
riences. In other words, this suggests that the auxiliary figure have a negative impact
on users with a large number of experiences, and therefore, it is appropriate to present
the auxiliary figure to users with little VR experience.
From another point of view, regarding the issue of the difficulty in reducing VR
sickness in a scene with intense motion in “Virtual nose” by Whittinghill et al. [14],
we infer that our proposed system was able to reduce VR sickness because the
increase in sweating was reduced even in the scene with intense motion in the latter
half of our experimental VR scene.
Sense of Immersion. The results of the GEQ were not able to provide an overall trend
on the participants’ senses of immersion because the scores varied widely and were
not statistically significant. This suggests that the superimposition of the auxiliary
figure does not have a significant negative impact on the sense of immersion. On
the other hand, it also indicates that there is no significant difference in the sense of
immersion between the proposed control presentation and the constant presentation.
Therefore, in the future, we consider it necessary to use or create more specific and
detailed indices to evaluate a sense of immersion.
Most of the participants answered that they were not bothered by the control
presentation of the auxiliary figure. The reason for this is that the auxiliary figure
gradually appeared by alpha blending in the control presentation condition, and thus,
they were blended into the VR image without being strongly gazed at by the viewer.
8.6 Conclusions
we plan to analyze the effect of the proposed system in detail from the viewpoint
of differences between men and women, differences in SSQ scores at the beginning
of the task, or differences in eye movements during task execution, other than the
differences in VR experience shown in this paper. Then, based on the results of such
detailed analysis, in the future, we plan to develop a learning HMD that switches
a method to reduce VR sickness and its parameters based on the user’s time of
VR experience, time spent experiencing the same content, and variation of various
physiological reactions resulting from VR sickness in real time.
References
1. Jerald, J.: The VR book: human-centered design for virtual reality. In: Association for
Computing Machinery and Morgan & Claypool (2015)
2. Brainard, A., Gresham, C.: Prevention and treatment of motion sickness. Am. Fam. Phys. 90(1),
41–46 (2014)
3. Kariya, A., Wada, T., Tsukamoto, K.: Study on VR sickness by virtual reality snowboard.
Trans. Virtual Reality Soc. Japan 11(2), 331–338 (2006)
4. Hashilus Co, Ltd.: Business description. https://hashilus.com/business/. Last accessed
2022/01/20
5. Aoyama, K., Higuchi, D., Sakurai, K., Maeda, T., Ando, H.: GVS RIDE: providing a novel
experience using a head mounted display and four-pole galvanic vestibular stimulation. In ACM
SIGGRAPH 2017 Emerging Technologies (SIGGRAPH’17), Article 9, pp. 1–2. Association
for Computing Machinery, New York, NY, USA (2017)
6. Sra, M., Jain, A., Maes, P.: Adding proprioceptive feedback to virtual reality experiences using
galvanic vestibular stimulation. In: Proceedings of the 2019 CHI Conference on Human Factors
in Computing Systems (CHI’19), Paper 675, pp, 1–14. Association for Computing Machinery,
New York, NY, USA (2019)
7. Omata, M., Shimizu, A.: A proposal for discreet auxiliary figures for reducing VR sickness and
for not obstructing FOV. In: Proceedings of the 18th IFIP TC 13 International Conference on
Human-Computer Interaction, INTERACT 2021, Sequence number 7. Springer International
Publishing (2021)
8. Bos, J.E., MacKinnon, S.N., Patterson, A.: Motion sickness symptoms in a ship motion simu-
lator: effects of inside, outside, and no view. Aviat. Space Environ. Med. 76(12), 1111–1118
(2005)
9. Norouzi, N., Bruder, G., Welch, G.: Assessing vignetting as a means to reduce VR sickness
during amplified head rotations. In: Proceedings of the 15th ACM Symposium on Applied
Perception (SAP’18), Article 19, pp. 1–8. Association for Computing Machinery, New York,
NY, USA (2018)
10. Duh, H.B., Parker, D.E., Furness, T.A.: An “independent visual background” reduced balance
disturbance envoked by visual scene motion: implication for alleviating simulator sickness. In:
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI’01),
pp. 85–89. Association for Computing Machinery, New York, NY, USA (2001)
11. Sargunam, S.P., Ragan, E.D.: Evaluating joystick control for view rotation in virtual reality
with continuous turning, discrete turning, and field-of-view reduction. In: Proceedings of the
3rd International Workshop on Interactive and Spatial Computing (IWISC’18), pp. 74–79.
Association for Computing Machinery, New York, NY, USA (2018)
12. Fernandes, A.S., Feiner, S.K.: Combating VR sickness through subtle dynamic field-of-view
modification. In: 2016 IEEE Symposium on 3D User Interfaces (3DUI), pp. 201–210 (2016)
13. Budhiraja, P., Miller, M.R., Modi, A.K., Forsyth, D.: Rotation blurring: use of artificial blurring
to reduce cybersickness in virtual reality first person shooters. arXiv:1710.02599[cs.HC] (2017)
118 M. Omata and M. Suzuki
14. Whittinghill, D.M., Ziegler, B., Moore, J., Case, T.: Nasum Virtualis: a simple technique for
reducing simulator sickness. In: Proceedings of Games Developers Conference (GDC), 74
(2015)
15. Cao, Z., Jerald, J., Kopper, R.: Visually-induced motion sickness reduction via static and
dynamic rest frames. In: IEEE Conference on Virtual Reality and 3D User Interfaces (VR),
pp. 105–112 (2018)
16. Buhler, H., Misztal, S., Schild, J.: Reducing VR sickness through peripheral visual effects. In:
IEEE Conference on Virtual Reality and 3D User Interfaces (VR), pp. 517–519 (2018)
17. Ishihara, N., Yanaka, S., Kosaka, T.: Proposal of detection device of motion sickness using
nose surface temperature. In: Proceedings of the Entertainment Computing Symposium 2015,
pp. 274–277. Information Processing Society of Japan (2015)
18. Narens, L.: A theory of ratio magnitude estimation. J. Math. Psychol. 40(2), 109–129 (1996)
19. Thought Technology Ltd.: ProComp infiniti system. https://thoughttechnology.com/procomp-
infiniti-system-w-biograph-infiniti-software-t7500m. Last accessed 2022/01/20.
20. Unity Asset Store.: https://assetstore.unity.com/. Last accessed 2022/01/20.
21. Stevens, S.S.: On the psychophysical law. Psychol. Rev. 64(3), 153–181 (1957)
22. Kennedy, R.S., Lane, N.E., Berbaum, K.S., Lilienthal, M.G.: Simulator sickness questionnaire:
an enhanced method for quantifying simulator sickness. Int. J. Aviat. Psychol. 3(3), 203–220
(1993)
23. IJsselsteijn, W.A., de Kort, Y.A.W., Poels, K.: The game experience questionnaire. Technische
Universiteit Eindhoven. Eindhoven (2013)
Chapter 9
Design and Implementation of Immersive
Display Interactive System Based on New
Virtual Reality Development Platform
Xijie Li, Huaqun Liu, Tong Li, Huimin Yan, and Wei Song
Abstract In order to solve the single form of traditional museums, reduce the impact
of space and time on ice and snow culture and expand the influence and dissemination
of ice and snow culture, we developed an immersive Winter Olympics virtual museum
based on Unreal Engine 4. We used 3ds Max to build virtual venues, import Unreal
Engine 4 through Datasmith, and explore the impact of lighting, materials, sound, and
interaction on virtual museums. This article gives users an immersive experience by
increasing the realism of the space, which provides a reference for the development
of virtual museums.
9.1 Introduction
With the rapid development of society and the improvement of people’s living stan-
dards, virtual reality technology has been widely used in medical, entertainment,
aerospace, education, tourism, museums and many other fields. The digitization of
museums is an important trend in the development of museums in recent years, and
the application of virtual reality technology in the construction of digital museums
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 119
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_9
120 X. Li et al.
is an important topic. The Louvre Museum is the first museum to move its collec-
tions from the exhibition hall to the Internet. The Palace Museum has also started
the process of VR digitalization. The V Palace Museum display platform connects
online and offline display content, breaking the rules and improving user stickiness.
“Bringing the Forbidden City Cultural Relics Home.”
The successful hosting of the 2022 Beijing-Zhangjiakou Winter Olympics has
made Beijing a city that hosts both the Summer Olympics and the Winter Olympics.
However, the spread of ice and snow culture has certain limitations, which are affected
by many factors such as epidemic situation and region. Using virtual reality tech-
nology to build a virtual Winter Olympics museum can break the limitations of
physical museums, expand the extension space of the museum, expand the functions
of the museum, and an effective way to meet the multi-level and multi-faceted needs
of the public. This is one of the directions for the future development of digital
museums and has broad prospects for development.
In recent years, Unreal Engine 4 has also been widely used with the rise of virtual
reality technology. Unreal Engine 4 has its own advantages over others, not only has
good operating habits, real light and shadow relationship, but also has flexible and
free interface design and minimalist interaction design implementation [1].
9 Design and Implementation of Immersive Display Interactive System … 121
The UE4 brings a new way of program development. Visualization blueprint script
makes developers more convenient through integrated code visualization, provides
more possibilities for the realization of functions and enhances the editability of
blueprints. The blueprint script is very easy to read, which not only enhances the
development efficiency, but also can be connected through the node to watch the
running process more intuitively, and it is convenient to solve the problems that arise
[2].
The project used 3ds Max modeling tools to build virtual scenes and optimize geom-
etry. The project used Adobe Photo-shop to texture models and Adobe After Effects
and Adobe Audition to work on video and audio materials. This article exports the
3D model to Datasmith format and imports it into Unreal Engine for project scene
construction, reprocesses the materials and textures of some models, completes the
key processes such as the creation of model materials and lighting, and the preparation
of Blueprint interaction events to complete the production of the project (Fig. 9.1).
The quality of the model construction effect in the virtual museum has a great impact
on the implementation of the final system, because the model serves as a carrier for
functional implementation. First, we made a CAD basemap to determine the model
positioning criteria. Then, we built a scene model from the 3ds Max based on the
drawn planar shape, paying special attention to the units, axial direction, and model
scale. When building the model, the purpose of removing the excess number of faces
was not only to improve the utilization of the map, reduce the number of faces of the
entire scene, but also improve the speed of the interactive scene [3] as shown in the
example (Fig. 9.2).
After the scene was modeled, our used a file-based workflow to bring designs into
Unreal. Datasmith gets our design data into Unreal quickly and easily. Datasmith
is a collection of tools and plugins that bring entire pre-constructed scenes and
complex assets created in a variety of industry-standard design applications into
Unreal Engine. Firstly, we install a special plugin in 3ds Max, which we used to export
files with the. udatasmith extension. And then, we used the Datasmith Importer to
bring the saved or exported file into your current Unreal Engine Project (Fig. 9.3).
Using the Datasmith workflow, it is possible to achieve one-to-one restoration of
the scene, establish a single Unreal asset for all instantiated objects, maintain the
original position and orientation, realize layer viewing and automatically convert the
map, further narrowing the gap between the design results and the final product in
the actual experience.
The first thing we achieve in virtual reality is to imitate the effect of the eyes, to
achieve a realistic sense of space and immersion. Unreal Engine’s rendering system
is key to its industry-leading image quality and superior immersive experience. Real-
time rendering technology is an important part of computer graphics research [4].
The purpose of the application of this technology is to allow users to experience the
immersive feeling, according to the real situation of the scene’s shape, material and
light source distribution, to produce visual effects similar to the real scene and almost
indistinguishable.
Due to the limitation of space, the visual experience is dominant for people in the
virtual environment. In this project, the presentation effect of the model also greatly
affects the user experience, after completing the system’s scene construction, it is
necessary to further improve the model and rendering.
9 Design and Implementation of Immersive Display Interactive System … 123
Fig. 9.2 CAD floor plan and scene building in 3ds Max
Illumination
In Unreal Engine 4, there are a few key properties that have the greatest impact on
lighting in the world simulating the way light behaves in 3D worlds is handled in one
of two ways: using real-time lighting methods that support light movement and inter-
action of dynamic lights, or by using precomputed (or baked) lighting information
that gets stored in textures applied to geometric surfaces [5]. Unreal Engine provides
both these ways of lighting scenes and they are not exclusive to one another as they
can be seamlessly blended between one another.
124 X. Li et al.
When the light source is farther away from the target object, the less its lighting
effect on the object will be, which is lighting mode. Lighting mode adopted a physi-
cally accurate inverse square falloff and switched to the photometric brightness unit
of lumens to improve light falloff was relatively straightforward. We chose to window
the inverse square function in such a way that the majority of the light’s influence
remains relatively unaffected, while still providing a soft transition to zero. This has
the nice property whereby modifying a light’s radius does not change its effective
brightness, which can be important when lighting has been locked artistically, but a
light’s extent still needs to be adjusted for performance reasons [7].
2
saturate 1 − (distance/lightRadius)4
falloff = (9.3)
diatance2 + 1
In addition to providing Real Shading, Unreal Engine also has a variety of interaction
methods. Now that most virtual reality products use the buttons of the controller to
implement interactive functions, accurate gesture recognition and eye movement
recognition are not yet fully mature. Based on the demand analysis of the virtual
museum, the following interactive methods are mainly adopted: users enter the virtual
museum of the Winter Olympics and roam the museum [10]. In order to let users
have more directions and more understanding of the needs of virtual museums, 2
different angles and different forms of roaming methods have been specially set
up. That is, first-person roaming, third-person roaming. In Unreal Engine 4, both
roaming methods are controlled by character blueprints.
Create a new pawn blueprint class, add to the scene and do camera switch events
and add regular game perspective operations to the pawnCamera to achieve first-
person and third-person perspective switching (Fig. 9.7).
Box Trigger Interaction: Blueprints control the ON and OFF of videos by adding
box triggers and defining OnActorBeginOverlap and OnActorEndOverlap events:
When a character touches a box trigger, the video turns on playback. Video playback
stops when the character leaves the range of the box trigger. Such a design will make
the user’s sense of experience and realism more perfect. The results are as shown in
Fig. 9.8.
In a virtual museum, sound information can be superimposed on the real-world
virtual accompaniment in real time, producing a mixed effect of sight and hearing,
which can supplement the limitations of seeing. When the virtual scene changes, the
9 Design and Implementation of Immersive Display Interactive System … 127
voice the user hears changes accordingly, rendering it immersive. Use audio volumes
to add reverb effects to the virtual museum, increase spatial realism, adjust the dry
humidity of the level sound and get a more realistic sense of distance in space. For
added realism, the sound usually moves, not just static [11] (Fig. 9.9).
128 X. Li et al.
9.4 Conclusion
Virtual museums have changed traditional concepts and broken the shackles of time
and space. The virtual museum increases the enthusiasm of visitors and enriches the
display form of the museum. Diverse interaction design will be the direction of the
future of immersive virtual museums [12]. The Winter Olympics virtual museum
integrates technology and sports, more people can understand Beijing and under-
stand the 2022 Winter Olympics through new forms. While disseminating Chinese
culture, it also enhances China’s soft power and international influence. In the future,
combined with the increasingly perfect concept of science and technology, we will
develop virtual museum with a high level, a high level and more significance.
130 X. Li et al.
Acknowledgements This work was supported by grant from: “Undergraduate teaching Reform
and Innovation” Project of Beijing Higher Education (GM 2109022005); Beijing College Students’
innovation and entrepreneurship training program (Item No: 22150222040, 22150222044). Key
project of Ideological and Political course Teaching reform of Beijing Institute of Graphics
Communication (Item No: 22150222063). Scientific research plan of Beijing Municipal Education
Commission (Item No: 20190222014).
References
10.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 131
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_10
132 H.-W. Huang et al.
tools because of its immersive experience [6, 7]. This VR immersion leads to engage-
ment or flow, which is beneficial for students to connect with the learning content
for producing deep positive emotional value, which consequently enhances learning
outcomes [8–10]. The nature of VR immersion enhances students’ active learning
with embodiment [11], long-term retention [12], and enjoyment [13]. Students expe-
rience contextualized learning when they actively engage in the 360 VR video
content. This learning experience is different from traditional classrooms with one-
way or didactic lectures, focusing on shared content, in which students tend to learn
passively [12].
VR technology has been employed in a variety of educational contexts. Together
with providing students with immersive engagement in learning new knowledge in
experiencing foreign language learning [6, 14, plant cells [13] and English public
speaking [9], these studies demonstrate that VR technologies facilitate promising
new learning results. Previous scholars indicated that applying VR in educa-
tion can increase students’ authentic experience in contextualized learning, which
consequently empower students to develop autonomy [4, 11, 15].
The impetus of this study is to provide Chinese EFL learners with an innovative
teaching method that promotes learner motivation in English learning. Although
previous studies have recorded positive feedback on VR integration in foreign
language classrooms [6, 9, 14], research focusing on exploring EFL learners’ experi-
ences after watching 360 VR videos in tourism English learning is scarce. Therefore,
this study applied online free 360 VR videos for students to experience foreign tourist
attractions after wearing VR headsets. Specifically, whether Chinese EFL learners
prefer this innovative learning tool to experience immersive and authentic foreign
contexts to learn English has not yet been investigated. Hence, the purpose of this
case study was to apply the VR tool in an English course, focusing on travel English
units.
Teachers’ teaching methods in the twenty-first century classrooms face the changes
from the Information Age to the Experience Age [17]. Young learners live in an expe-
rience age, full of multimedia and digital technologies, and they view technology-
mediated learning environments as their preference in obtaining and sharing new
knowledge. VR as a learning tool in educational environments meets their learning
styles better for constructing new knowledge [13] and VR guides them to experience
more engaging and exciting learning materials [4, 15].
The integration of VR has become popular in higher education over the last two
decades [4, 10, 18, 19]. These studies indicate that students in VR learning contexts
show dynamic engagement and participation in classroom discussion and obtain
abstract concepts easily because they are more receptive to real-life contexts. For
example, Dalgarno and Lee [4] conducted a review of the affordance in 3D virtual
learning environment from the 1990s to 2000s in enriching student motivation and
engagement. The 3D virtual environments create “greater opportunities for expe-
riential learning, increased motivation/engagement, improved contextualization of
learning, and richer/more effective collaborative learning” [4].
For students, the brand-new learning moments with VR technologies are unique
in educational settings. Their curiosity and motivation can be increased while being
involved in immersive VR learning contexts. This activated participation can be an
opportunity for igniting students’ interest and involvement in learning and maxi-
mizing the potential of VR learning experiences [13]. These unique VR learning
benefits make subject contents come alive because students feel a strong sense of
presence in a multi-sensory setting to create immersive and engaging learning [11].
For example, Allcoat and von Mühlenen [13] found that VR provided increased posi-
tive emotions and engagement compared with textbook and video groups learning
about plant cells. Furthermore, Parmaxi [2] indicated that the VR contextual learning
helped students who particularly struggle to stay focused on learning materials in
English learning. These positive results showed that VR contexts give a better sense
of being present in a place for connecting learning contents.
134 H.-W. Huang et al.
R.Q. 3: What advantages and disadvantages of the VR learning project did the students
express in the interviews?
10.3.1 Participants
The participants were English major sophomores (n = 28; 24 females and 4 males)
enrolled in a four-year public university in southern China, all aged between 20 and
21 years old. The participants were randomly divided into six groups of 4–5 students.
This study utilized a convenience sample because all participants enrolled in a cohort
programme. They were native Chinese speakers, had studied English as a foreign
language for at least six years in the Chinese school education system, and were
assessed to be at an intermediate English proficiency level based on the Chinese
National College Entrance Examination (equivalent to the B1 level of the Common
European Framework of Reference for Languages (CEFR) levels [22]. None of the
students had a previous experience with VR learning. All the participants volunteered
to attend this study and had the right to withdraw at any time.
10.3.2 Instruments
The data collected included final reflections written by 28 students and focus-group
interviews with six participants. Students’ final reflections were collected to explore
their views about VR learning experience and suggestions regarding how VR can
be effectively used in future learning. Six volunteers, five females, and one male
attended the focus-group interview to understand their thoughts about the advantages
and challenges of using VR for language learning.
10.3.3 Procedures
The classroom teacher (the first author) announced the goals of the VR learning
programme to all participants before conducting the research. All participants in this
case study had similar opportunities to use the VR HMDs and participate in the
learning tasks. Particularly, this study investigated the use of advanced VR HMDs,
which are suitable for myopia less than 600 ◦ . This is important because the majority
of Chinese students wear glasses and the VR HMDs offer the best VR experience
under these conditions.
The research lasted for six weeks, and students had 100 min of class time each
week. The theme of the VR project was “Travel English”. Four countries were
selected by the class teacher for immersive VR learning; these countries were Turkey,
Spain, New Zealand, and Colombia.
136 H.-W. Huang et al.
Fig. 10.1 Screenshots of 360 VR videos from Spain, Colombia, New Zealand, and Turkey
The course design adopted the structure of flipped learning approach. All students
watched a 2D introductory video of the country to learn the basic concepts related
to the country before entering the class. When students came to the classroom, they
had group discussions and answered the teacher’s questions related to the 2D video.
Each student then wore a VR HMD to experience a 360 VR video of the country for
the weekly schedule and answered embedded questions (see Fig. 10.1). Afterwards,
students practised English conversation sharing what they saw in the 360 video with
group members.
Students’ final reflections were collected to explore their views of the VR learning
experience and analyse their suggestions regarding how VR can be more effectively
used in future teaching. Afterwards, six volunteers participated in semi-structured
interviews. The RAs interviewed the volunteers using the interview questions made
by the first author. The interviews were conducted in the students’ native language,
Chinese, allowing the students to express their views with less restriction or being
contrived by second language limitations. To make students feel comfortable while
answering questions, the first author did not participate in the interview process, and
the RAs started a welcome talk to put the interviewees at ease. Finally, the RAs
conducted subsequent data analysis.
The RAs conducted a content analysis of the students’ final reflections and catego-
rized them into different themes to answer RQ1 and 2. The content analysis steps
originated from Braun and Clarke [23]. We conducted six steps to categorize the
students’ final reflections: (a) familiarizing yourself with your data, (b) generating
10 360-Degree Virtual Reality Videos in EFL Teaching: Student Experiences 137
initial codes, (c) searching for themes, (d) reviewing themes, (e) defining and naming
themes, and (f) producing the report (p. 87). Steps a, b, c, and e were performed by
the RAs. To improve accuracy, we reviewed themes (step d) and produced the report
(step f) with the RAs. In the end, we reviewed the final report to improve accuracy.
For the focus-group interviews, all interview data were collected from audio
recordings and transcribed into texts for corpus analysis. The accuracy of the tran-
scriptions was verified through the audio file and analysed it using the Chinese corpus.
Corpus analysis was conducted using WEICIYUN (http://www.weiciyun.com), an
online tool that allows for Chinese basic corpus analysis and generates visualizations
of word occurrences based on the input text. Visualizing word occurrence frequencies
enables us to analyse key information from interview data.
10.4 Results
S9: I immersed myself in learning about some foreign cultures rather than the
presentation of photos. The application of modern technology in the field of education
is of great benefit.
S17: VR learning, combining fun and high-tech, gave me a new definition of
English learning, which not only improved my English learning motivation but also
enabled me to have a sense of immersion while watching 360 videos.
S26: Technology has brought us new ways and opportunities to learn. We can have
virtual field trips to learn new knowledge and see tourism spots in other countries.
This is not possible in traditional English classrooms.
In summary, students’ final reflections indicated that VR technologies provide
learners with engaging learning opportunities and reform language learning expe-
riences in EFL classrooms. Students had VR tours in different countries, which is
an improvement on the traditional textbook or the 2D pictures experienced in other
English learning settings. Additionally, the VR tours allowed students to immerse
themselves in foreign tourism attractions, inspiring them to have deeper connection
with the learning materials.
R.Q. 2: What were students’ suggestions after experiencing the VR project?
All 28 students’ response suggestions about the VR learning project were used to
answer research question two. The RAs categorized the results collected according to
their similarities (see Fig. 10.2): (1) VR equipment and technology, (2) VR content,
and (3) others.
In the first category about VR equipment and technology, three students expressed
their views about the VR device itself. Their responses were categorized into (1)
use more advanced VR display devices; and (2) wear headphones to experience
panoramic sound.
The second category is about the VR content. Eighteen students (64%) mentioned
viewing content that could be sorted into: (1) enriching video types and content, espe-
cially more lifelike videos such as food, clothing, and street culture; (2) improving
the clarity of content; and (3) increasing interaction.
The third category is other suggestions that could not be identified in the previous
two categories. These suggestions included (1) extending the use time; and (2)
providing buffer time in the beginning to set up the device. Table 10.1 presents
the students’ suggestions categorized by the RAs.
R.Q. 3: What advantages and disadvantages of the VR learning project did the students
expressed in the interviews?
All volunteers were divided into two groups to conduct interviews. They expressed
positive attitudes towards the VR learning project. They indicated that VR learning
has many advantages, such as allowing students to focus on the content more while
learning with VR, developing the ability to active learning, and exploring knowl-
edge by themselves. However, some students expressed some disadvantages of VR
learning. For example, the equipment had various problems and was not easy to
control. Additionally, using VR reduced teacher-student communication and wearing
the HMD caused dizziness.
The interviewees’ responses were transcribed into Chinese and then visualized
through a word cloud to present their perceptions of the advantages and disadvantages
of the VR learning project (see Fig. 10.3). The larger words in the word cloud indicate
more frequent use. The illustration also includes translations into English.
140 H.-W. Huang et al.
The 360 VR videos can help me engage in an immersive environment, which is beneficial
for me to apply my imagination in this kind of digitalized learning materials. (Student B,
Group 1)
I felt dizzy and the recovery time may vary depending on each individual’s physical condition.
(Student C, Group1)
10.5 Discussion
10.6 Conclusions
Virtual reality has become a popular theme in educational contexts, and the low cost of
implementing 360 VR videos in EFL classrooms makes it a more attractive learning
tool. Although there were some technical issues related to video quality and motion
sickness, most students expressed their excitement and engagement in immersive VR
learning in an EFL course. Additionally, the responses supported previous research
that it is difficult for teachers to monitor students’ selection and attention to content
in the HMDs. However, by asking students to prepare for the lesson and use that
preparation to guide the VR experience, attention to content should be improved.
Furthermore, asking students to complete a survey, participate in an interview, and
submit a final reflection encourages attention during the task and enhances retention.
Finally, asking students to provide suggestions for future enhancements gives them
motivation to contribute to further learning. While this study did not include any
longitudinal data or quantitative language learning results, the overall impression
of VR English learning is to increase participation, improve attention, and moti-
vate students to be critical about their learning and the learning material. For future
suggestions, it is worth evaluating the variables, including students’ speaking perfor-
mance under the VR course design in English learning and their emotion changes in
quantitative data.
References
1. Godwin-Jones, R.: Augmented reality and language learning: from annotated vocabulary to
place-based mobile games. Language Learn. Technol. 20(3), 9–19 (2016). https://www.sco
pus.com/inward/record.uri?eid=2-s2.0-84994627515&partnerID=40&md5=6d3aec75cd73
21c12aa0d2acef7c8ad9
2. Parmaxi, A.: Virtual reality in language learning: a systematic review and implications for
research and practice. Interact. Learn. Environ. (2020)
3. Warschauer, M.: Comparing face-to-face and electronic discussion in the second language
classroom. CALICO J. 13(2–3), 7–26 (1995)
4. Dalgarno, B., Lee, M.: What are the learning affordances of 3-D virtual environments? Br. J.
Edu. Technol. 41, 10–32 (2010)
5. Huang, H.W.: Effects of smartphone-based collaborative vlog projects on EFL learners’
speaking performance and learning engagement. Australas. J. Educ. Technol. 37(6), 18–40
(2021)
6. Berti, M.: Italian open education: virtual reality immersions for the language classroom. In:
Comas-Quinn, A., Beaven, A., Sawhill, B. (eds.), New Case Studies of Openness in and Beyond
the Language Classroom, pp. 37–47 (2019)
7. Makransky, G., Lilleholt, L.: A structural equation modeling investigation of the emotional
value of immersive virtual reality in education [Article]. Educ. Tech. Res. Dev. 66(5), 1141–
1164 (2018)
8. Chien, S.Y., Hwang, G.J., Jong, M.S.Y.: Effects of peer assessment within the context of spher-
ical video-based virtual reality on EFL students’ English-Speaking performance and learning
perceptions. Comput. Educ. 146 (2020)
9. Gruber, A., Kaplan-Rakowski, R.: The impact of high-immersion virtual reality on foreign
language anxiety when speaking in public. SSRN Electron. J. (2022)
10 360-Degree Virtual Reality Videos in EFL Teaching: Student Experiences 143
10. Riva, G., Mantovani, F., Capideville, C., Preziosa, A., Morganti, F., Villani, D., Gaggioli, A.,
Botella, C., Alcañiz Raya, M.: Affective interactions using virtual reality: the link between
presence and emotions. Cyberpsychol. Behav. 10, 45–56 (2007)
11. Hu-Au, E., Lee, J.: Virtual reality in education: a tool for learning in the experience age. Int. J.
Innov. Educ. 4 (2017)
12. Qiu, X.-Y., Chiu, C.-K., Zhao, L.-L., Sun, C.-F., Chen, S.-J.: Trends in VR/AR technology-
supporting language learning from 2008 to 2019: a research perspective. Interact. Learn.
Environ. (2021)
13. Allcoat, D., von Mühlenen, A.: Learning in virtual reality: effects on performance, emotion
and engagement. Res. Learn. Technol. 26 (2018)
14. Lin, V., Barrett, N., Liu, G.-Z., Chen, N.-S., Morris, Jong, S.-Y.: Supporting dyadic learning of
English for tourism purposes with scenery-based virtual reality. Comput. Assisted Language
Learn. (2021)
15. Berti, M., Maranzana, S., Monzingo, J.: Fostering cultural understanding with virtual reality:
a look at students’ stereotypes and beliefs. Int. J. Comput. Assisted Language Learn. Teach.
10, 47–59 (2020)
16. Kaplan-Rakowski, R., Gruber, A.: Low-immersion versus high-immersion virtual reality: defi-
nitions, classification, and examples with a foreign language focus. In: Proceedings of the
Innovation in Language Learning International Conference 2019, pp. 552–555. Pixel (2019)
17. Wadhera, M.: The information age is over; welcome to the experience age. Tech Crunch
(2016, May 9). https://techcrunch.com/2016/05/09/the-information-age-is-overwelcome-to-
the-experience-age/
18. Hagge, P.: Student perceptions of semester-long in-class virtual reality: effectively using
“google earth VR” in a higher education classroom. J. Geogr. High. Educ. 45, 1–19 (2020)
19. Lau, K., Lee, P.Y.: The use of virtual reality for creating unusual environmental stimulation to
motivate students to explore creative ideas. Interact. Learn. Environ. 23, 3–18 (2012)
20. Vygotsky, L.: Mind in society: the development of higher psychological processes (1978)
21. Kaplan-Rakowski, R., Wojdynski, T.: Students’ attitudes toward high-immersion virtual reality
assisted language learning. In: Taalas, P., Jalkanen, J., Bradley, L., Thouësny, S. (eds.),
Future-Proof CALL: Language Learning as Exploration and Encounters—Short Papers from
EUROCALL 2018, pp. 124–129 (2018)
22. European Union and Council of Europe. Common European Framework of Reference for
Languages: Learning, Teaching, Assessment (2004). https://europa.eu/europass/system/files/
2020-05/CEFR%20self-assessment%20grid%20EN.pdf
23. Braun, V., Clarke, V.: Using thematic analysis in psychology. Qual. Res. Psychol. 3, 77–101
(2006)
Chapter 11
Medical-Network (Med-Net): A Neural
Network for Breast Cancer Segmentation
in Ultrasound Image
11.1 Introduction
Breast cancer is by far the common breast mass among women [1]. Clinical diagno-
sis in primary care clinics is a crucial factor in decreasing the risk of breast cancer
and providing earlier treatment for more positive outcomes for patients. Although
the mammogram is a well-known and reliable image modality that is used in breast
cancer diagnosis, it can be costly and comes with radiation risks from the use of
X-rays. Mammograms also tend to produce a high number of false-positive results.
In contrast, ultrasound (US) is an appropriate alternative for early stage cancer detec-
tion. A mammogram or magnetic resonance imaging (MRI) can be used in conjunc-
tion with US to provide additional evidence. Various medical image segmentation
techniques have emerged in the last decade. However, recent studies have further
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 145
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_12
146 Y. Alzahrani and B. Boufama
In recent years, advances in deep learning and neural networks have contributed
toward achieving fully automated US image segmentation and other relevant tasks
by overcoming several persistent challenges that many previous methods could not
effectively handle. Various deep neural network architectures have been proposed
to perform efficient segmentation and the detection of abnormalities. For example,
convolutional neural networks have been used for fully automated medical image
segmentation; patch-based neural networks are trained on image patches, and fully
convolutional networks perform pixel-wise prediction to form the final segmenta-
tion and U-nets [11]. Boundary ambiguity is one of the major issues when using
fully connected networks (FCNs) for automatic US image segmentation, resulting
in the need for more refined deep learning architectures. In this light, one study
[12] proposed the use of cascaded FCNs to perform multiscale feature extraction.
Moreover, spatial consistency can be enhanced by adding an auto-context scheme
to the main architecture. U-nets are one of the most popular and well-established
deep learning architectures for image segmentation. They deliver high performance
even with a limited amount of training data. They are primarily CNNs that consist
of a downsampling path, which reduces the image size by performing convolutional
and pooling operations on the input image and extracts contextual features, and an
upsampling path, which reconstructs the image to recover the image size and various
details [13]. However, U-net encoder-based maxpooling tends to lose some localiza-
tion information. Therefore, many studies show significant improvement when it is
replaced by more sophisticated architectures, such as in the visual geometry group
network (VGG) [14]. V-nets [15] are a similar architecture that are applied to 3D US
images. They also face the limitation of inadequate training data. They consist of a
compression and decompression path, in a manner similar to U-nets for 2D images.
The incorporation of a 3D supervision mechanism facilitates accurate segmentation
by exploiting a hybrid loss function that has shown fast convergence. Transfer learn-
ing [16] has gained the attention of practitioners for various tasks. This approach has
succeeded in many applications and is one of the popular current approaches. It is
a convenient solution for any limited data task as these models are usually trained
on relatively huge datasets, such as Google’s Open Images, ImageNet, and CIFAR-
10. In the U-net base model, the effective use of skip connections between the two
paths has some drawbacks, such as suboptimal feature reusability and a consequently
increased need for computational resources. Other versions, [5, 6], have used atten-
tion mechanisms incorporated in the U-net architectures to improve performance,
especially for detection tasks. The addition of a spatial attention gate (SAG) and a
channel attention gate (CAG) to a U-net helps in locating the region of interest (ROI)
and explaining the feature representation, respectively. This type of technique is uti-
lized in numerous applications, such as machine translation and natural language
processing (NLP). Non-local means [17] and its extended version of the non-local
neural network [18], as well as machine translation [19], can be optimized through
a back propagation process in the training iterations and therefore are considered
148 Y. Alzahrani and B. Boufama
soft attention modules. These soft attention mechanisms are very efficient and can
be plugged into CNNs. In contrast, hard attention non-differentiable operations are
not commonly used with CNNs. Attention mechanisms have proven successful in
sequence modeling as they allow the effective transmission of past information,
an advancement that was not possible with older architectures based on recurrent
neural networks. Therefore, self-attention can substitute convolutional operations to
improve the performance of neural networks on visual tasks. The best performance
though has been reported when both attention and convolutions are combined [20].
The most significant obstacle in breast tumor US image segmentation is the shape
variation because the size of a tumor can vary, and normally, the border of a tumor is
very close to the surrounding tissue. Frequently, the majority of data points involved
are toward the back. Therefore, small tumors are quite difficult to identify. This raises
what is called a class imbalance problem. One of the popular ways to address this
problem is to force more weight on the minority class in an imbalanced dataset.
This can be achieved using a weighted objective function. Prepossession may also
be exploited to manipulate the image data in a way that helps reduce the problem.
For example, scaling the image by shifting the width and the height may help to gain
some sort of accuracy enhancement. In a similar classification task, oversampling
the minority class will balance the data and help to tackle an imbalance problem.
Inspiration from the human brain interpretation of visual perspective has influenced
deep learning researchers to adapt the same concepts to recognizing objects in CNNs
and related tasks. Many contributions in the literature have applied this concept in
applications, such as classification [21], detection [22], and segmentation [23].
In this article, we present a neural network for breast ultrasound image segmentation
as shown in Fig. 11.1. Our solution is a general use model and can be utilized on
similar vision tasks. When the network processes the data to downsample the spatial
dimensions, some meaningful details may be lost. Although pooling is a must in
CNNs, we however employed residual blocks across the encoder of our network to
keep track of the previous activations of each layer and sum up the feature maps
before fusion. This seems to be a good solution to address this issue. When encoding
the data, one of the keys is to maintain the dimension reductions and to exploit the
high-level information that carries spatial information while extracting the feature
vector.
11 Medical-Network (Med-Net): A Neural Network … 149
where xl∼ is the output of the l th layer, xl is the input to the residual block, C(K ,n)
is a convolution layer with a filter size of k and n filters (n = 32, 64, 128, 256, 512,
and 1024, respectively). A denotes an attention unit; K in l1 and l2 is of size 1 × 7
and 7 × 1, respectively, and of a symmetric size of five and three, respectively, in
the subsequent layers in both residual blocks and attention units. However, the last
residual block utilizes a k = 1 square filter.
Inspired by the work of Hu et al. [24], which is one of the early models that pro-
posed channel attention for vision tasks, a squeeze block employs a channel-wise
module that learns dependencies across the channels. However, in detection tasks,
this work may suffer from the lack of localization information needed. Similarly,
the work in [21] adds more spatial information that can be taken into account to
look into the salience map that comprises the channel and spatial details. A squeeze
150 Y. Alzahrani and B. Boufama
In this work, we utilize four upsampling layers based on a dense block. The low-
level features are first concatenated with the encoder’s residual and attention units,
which pass them to the dense block. To take advantage of the large-size feature maps
concatenated from early layers, we employ the dense block in the encoder, which
includes two convolutional layers with 12 filters prefixed with a batch normalization
layer and a rectified linear unit (ReLU) activation layer to add non-linearity. The
dense block [25] was utilized to feed forward input as well as the output of each
layer to the subsequent layers. The decoder path in our model consists of a 2×2
upsampling operation, batch normalization, ReLU activation, and 3×3 convolution.
This can be written as:
where U is the output of layer l, T is the output of the l th−1 transposed layer, f
is a fully connected layer, D denotes the dense block with n = 12 kernels of size
k = 7 × 7, δ denotes the ReLU function, A denotes an attention unit, represents a
concatenation, and R is the output of the l th encoder’s block.
152 Y. Alzahrani and B. Boufama
In this study, the experiments were conducted using a Keras/TensorFlow 2.3.1 back-
end with Python 3.6 on Windows. The computer was equipped with an NVIDIA
GeForce 1080 Ti with 11 GB of GPU memory. We performed a five-fold cross-
validation to evaluate our model. The data were split randomly in each fold into
two sets with a ratio of 80% for training and 20% for the validation set. It is
worth mentioning that the model was trained on both datasets separately. First,
the images were resized to 256×256 spatial dimensions, and a preprocessing tech-
nique was applied to further enhance the performance. It involved several trans-
formations: horizontal flip ( p = 0.5); random brightness contrast ( p = 0.2); ran-
dom gamma [gamma_limit = (80, 120)]; adaptive histogram equalization ( p = 1.0,
threshold value for contrast limiting = 2.0); grid distortion ( p = 0.4); shift, scale,
and rotate (shift_limit = 0.0625, scale_limit = 0.1, −rotate_limit = 15). Finally, p
is the probability of applying a transformation. In this work, these transformations
were applied on all the experiments, including the models which were used for com-
parison.
Our proposed model has 16 million trainable parameters and was optimized using
the adaptive moment estimation (Adam) optimizer [26]. We set the adaptive learning
rate initially at 0.0001 with a minimum rate of 0.000001 and batch of 4 to train our
model. To prevent an overfitting problem, we set all the experiments to terminate
the training if no improvement was recorded within 10 epochs. Due to its robustness
against the imbalanced class issue, in this work, we used the Dice loss function to
train the model given by:
2iN pi ∗ qi
Loss = 1 − (11.5)
iN pi + iN qi
This loss function produces a value between [0, 1], where pi is the predicted pixel
and qi denotes the true mask.
11.3.5 Dataset
Two BUS datasets named UDIAT [11] and BUSIS [27–30] were used for training
and validating the model. UDIAT has fewer samples that is 163 images along with
their labels. This dataset was collected by Parc Taul’ı Corporation Diagnostic Center,
Sabadell (Spain), in 2012. BUSIS has 562 images of benign and malignant tumors
along with the ground truth. The latter was collected by different institutions using
different scanners which make it a very reliable data source. Both datasets present a
single tumor in each image. Most images in these datasets present small tumors in
which the background represents the majority class data points. This situation intro-
duces what so called class imbalance problem which needs to be carefully handled.
11 Medical-Network (Med-Net): A Neural Network … 153
11.4 Discussion
Tumor tissues in breast US images are of different shapes and can appear in different
locations. However, most of the tumors occupy only a small area of pixels. Therefore,
in the early layers, small kernels can capture local discrepancies, and there is also a
need for a large receptive field to cover more pixels to consider the semantic correla-
tions. This helps to preserve the location information before fusing the feature maps
in the subsequent layers. Moreover, the divergence of intensities in the vertical neigh-
boring pixels is very small. However, a large receptive field kernel creates overhead
regarding memory resources. To overcome this challenge, we utilized dimensions of
1 × 7 and 7 × 1 in the early two layers, respectively. Then, the size was narrowed
down in the following layers as the dimensions of the feature maps increased. This
approach preserves the long-range dependencies with a significant improvement on
the produced features, thus, providing better feature map representations.
In this article, we introduced a novel breast US image segmentation model that can
be utilized and extended for any segmentation task. Our model has been demonstrated
to be robust with imbalanced class data as seen from the results. The model was
evaluated using various metrics: Dice coefficient (DSC), Jaccard index (JI), true-
positive ratio (TPR), and false-positive ratio (FPR). In this work, we evaluated our
proposed model quantitatively and qualitatively, and the model proved to be stable
and robust for breast US image segmentation.
We compared our proposed model with four others; two of them were recent
successful models for medical image segmentation: M-net [31] and squeeze-U-Net
[32]. These two models were implemented and trained from scratch. We also utilized
selective kernel U-Net, [33], STAN [34], and U-Net-SA [35] trained originally on
UDIAT and BUSIS, for comparison only as these models were meant for breast US
images. Therefore, the scores were taken as reported in their articles. The evaluation
metrics are given by the following equations:
2T P
DSC = (11.6)
2T P + F P + F N
TP
JI = (11.7)
T P + FN + FP
Table 11.1 Evaluation metrics for all models given by the average score of five-fold cross-validation
on (UDIAT) dataset
Model Dataset DSC JI TPR FPR
Proposed model UDIAT 0.794 0.673 0.777 0.007
Squeeze U-Net [32] UDIAT 0.721 0.585 0.701 0.008
M-Net [31] UDIAT 0.748 0.615 0.740 0.007
STAN [34] UDIAT 0.782 0.695 0.801 0.266
SK-U-Net [33] UDIAT 0.791 – – –
154 Y. Alzahrani and B. Boufama
Table 11.2 Comparison and evaluation metrics for the models given by the average score of five-
fold cross-validation on (BUSIS) dataset
Model Dataset DSC JI TPR FPR
Proposed model BUSIS 0.920 0.854 0.906 0.007
Squeeze U-Net BUSIS 0.912 0.841 0.910 0.009
M-Net BUSIS 0.909 0.836 0.915 0.009
STAN BUSIS 0.912 0.847 0.917 0.093
U-Net-SA [35] BUSIS 0.905 0.838 0.910 0.089
TP
T PR = (11.8)
T P + FN
FP
FPR = (11.9)
FP + T N
The results showed our model outperformed the others that were examined.
Tables 11.1 and 11.2 show the obtained results Fig. 11.6 shows the performance
of the model on BUSIS dataset.
In terms of scores on UDIAT dataset, which had only a few samples, the model
has proven to be very efficient for a small-sized dataset as a Dice score of 0.79 and
JI score of 0.67 were achieved. The selective kernel U-net SK-U-net gained a very
close score, having the second highest Dice score on the dataset. However, it was
trained on a relatively large private dataset as compared to our model, which was
trained on only 163 samples. Moreover, Stan achieved a JI score of 0.69 and had the
highest TPR and FPR scores, which indicates that it may identify some background
pixels as a tumor. In contrast, our model and M-Net scored the lowest FPR, and this
Fig. 11.5 Segmentation sample cases produced by different models used in this study and our
proposed network (Med-Net) using UDIAT dataset
156 Y. Alzahrani and B. Boufama
can be seen in Fig. 11.5, which shows very little positive area outside of the tumor
boundaries Fig. 11.4 shows the performance of the model on UDIAT dataset.
The other models that were implemented and trained in this study were M-net
and squeeze-U-Net. M-net had few parameters and showed decent performance with
both datasets, achieving Dice and JI scores of 0.74 and 0.61 on UDIAT, respectively.
Squeeze U-net, which was a modified version of U-Net [36] equipped with a squeeze
module [37], achieved Dice and JI scores of 0.72 and 0.58, respectively.
In contrast, our model scored the lowest FPR on BUSIS dataset, and this can be
seen in Fig. 11.7, which shows very little positive area outside of the tumor bound-
aries. Our proposed model also achieved the highest performance on BUSIS dataset
11 Medical-Network (Med-Net): A Neural Network … 157
Fig. 11.7 Segmentation sample cases produced by different models used in this study and our
proposed network (Med-Net) using BUSIS dataset
with Dice and JI scores of 0.92 and 85, respectively. It is clear that our model has the
best FPR of all the models. STAN also gained the highest TPR score and the second
best JI. An adequate performance from all the models was shown with this dataset.
This is due to the fact that this dataset was collected and annotated by experts from
different institutions. It had a large number of samples and was also produced by
different devices, which make it suitable for evaluating and justifying segmentation
tasks. Overall, our model proved its superiority over the other models in this study
when all the results are considered. Our model could be computationally expen-
sive with very high-scale image data. Med-Net model can be further extended in the
future by adding more data and examining different type of images like computerized
tomography (CT), MRI, and X-ray on different organs.
11.5 Conclusion
In this article, we presented a novel U-Net-like CNN for breast US image segmen-
tation. The model was equipped with visual attention modules to focus only on the
158 Y. Alzahrani and B. Boufama
salient features and suppress irrelevant details. The proposed network was able to
extract the important features while considering spatial and channel-wise informa-
tion. Dense blocks were used along the construction path to provide full connectivity
between the layers within the blocks. The model was validated on two breast US
image datasets and showed promising results and enhanced performance. Although
the model was meant for breast US images, it can be utilized for any computer vision
segmentation task with some modifications.
References
1. Sung, H., Ferlay, J., Siegel, R.L., Laversanne, M., Soerjomataram, I., Jemal, A., Bray, F.: Global
cancer statistics 2020: globocan estimates of incidence and mortality worldwide for 36 cancers
in 185 countries. CA Cancer J. Clin. 71(3), 209–249 (2021)
2. Nugroho, H., Khusna, D.A., Frannita, E.L.: Detection and classification of breast nodule on
ultrasound images using edge feature (2019)
3. Lotfollahi, M., Gity, M., Ye, J., Far, A.: Segmentation of breast ultrasound images based on
active contours using neutrosophic theory. J. Medical Ultrasonics 45, 1–8 (2017)
4. Kwak, J.I., Kim, S.H., Kim, N.C.: Rd-based seeded region growing for extraction of breast
tumor in an ultrasound volume. Comput. Intel. Secur. 799–808 (2005)
5. Khanh, T., Duy Phuong, D., Ho, N.H., Yang, H.J., Baek, E.T., Lee, G., Kim, S., Yoo, S.:
Enhancing u-net with spatial-channel attention gate for abnormal tissue segmentation in med-
ical imaging. Appl. Sci. 10 (2020)
6. Schlemper, J., Oktay, O., Chen, L., Matthew, J., Knight, C., Kainz, B., Glocker, B., Rueckert,
D.: Attention-gated networks for improving ultrasound scan plane detection (2018)
7. Suchindran, P., Vanithamani, R., Justin, J.: Computer aided breast cancer detection using ultra-
sound images. Mat. Today Proc. 33 (2020)
8. Suchindran, P., Vanithamani, R., Justin, J.: Computer aided breast cancer detection using ultra-
sound images. Mat. Today Proc. 33 (2020)
9. Nithya, A., Appathurai, A., Venkatadri, N., Ramji, D., Anna Palagan, C.: Kidney disease detec-
tion and segmentation using artificial neural network and multi-kernel k-means clustering for
ultrasound images. Measurement 149, 106952 (2020). https://www.sciencedirect.com/science/
article/pii/S0263224119308188
10. Alzahrani, Y., Boufama, B.: Biomedical image segmentation: a survey. SN Comput. Sci. 2(4),
1–22 (2021)
11. Yap, M.H., Pons, G., Martí, J., Ganau, S., Sentís, M., Zwiggelaar, R., Davison, A.K., Martí,
R.: Automated breast ultrasound lesions detection using convolutional neural networks. IEEE
J. Biomed. Health Inform. 22(4), 1218–1226 (2018)
12. Wu, L., Xin, Y., Li, S., Wang, T., Heng, P., Ni, D.: Cascaded fully convolutional networks for
automatic prenatal ultrasound image segmentation, pp. 663–666 (2017)
13. Almajalid, R., Shan, J., Du, Y., Zhang, M.: Development of a deep-learning-based method for
breast ultrasound image segmentation, pp. 1103–1108 (2018)
14. Nair, A.A., Washington, K.N., Tran, T.D., Reiter, A., Lediju Bell, M.A.: Deep learning to obtain
simultaneous image and segmentation outputs from a single input of raw ultrasound channel
data. IEEE Trans. Ultrasonics Ferroelectrics Freq. Control 67(12), 2493–2509 (2020)
15. Lei, Y., Tian, S., He, X., Wang, T., Wang, B., Patel, P., Jani, A., Mao, H., Curran, W., Liu,
T., Yang, X.: Ultrasound prostate segmentation based on multi directional deeply supervised v
net. Med. Phys. 46 (2019)
16. Liao, W.X., He, P., Hao, J., Wang, X.Y., Yang, R.L., An, D., Cui, L.G.: Automatic identification
of breast ultrasound image based on supervised block-based region segmentation algorithm and
features combination migration deep learning model. IEEE J. Biomed. Health Inform. 1 (2019)
11 Medical-Network (Med-Net): A Neural Network … 159
17. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polo-
sukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems,
pp. 5998–6008 (2017)
18. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
19. Zhang, B., Xiong, D., Su, J.: Neural machine translation with deep attention. IEEE Trans.
Pattern Anal. Mach. Intel. 42(1), 154–163 (2020)
20. Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convolutional
networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.
3286–3295 (2019)
21. Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: convolutional block attention module. In:
Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
22. Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention (2014).
arXiv:1412.7755
23. Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., Patel, V.M.: Medical transformer: gated axial-
attention for medical image segmentation (2021). arXiv:2102.10662
24. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
25. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional
networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recogni-
tion, pp. 4700–4708 (2017)
26. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arXiv:1412.6980
27. Xian, M., Zhang, Y., Cheng, H.D., Xu, F., Huang, K., Zhang, B., Ding, J., Ning, C., Wang, Y.:
A benchmark for breast ultrasound image segmentation (BUSIS). Infinite Study (2018)
28. Xian, M., Zhang, Y., Cheng, H.D.: Fully automatic segmentation of breast ultrasound images
based on breast characteristics in space and frequency domains. Pattern Recogn. 48(2), 485–497
(2015)
29. Cheng, H.D., Shan, J., Ju, W., Guo, Y., Zhang, L.: Automated breast cancer detection and
classification using ultrasound images: a survey. Pattern Recogn. 43(1), 299–317 (2010)
30. Xian, M., Zhang, Y., Cheng, H.D., Xu, F., Zhang, B., Ding, J.: Automatic breast ultrasound
image segmentation: a survey. Pattern Recogn. 79, 340–355 (2018)
31. Mehta, R., Sivaswamy, J.: M-net: A convolutional neural network for deep brain structure
segmentation. In: 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI
2017), pp. 437–440 (2017)
32. Beheshti, N., Johnsson, L.: Squeeze u-net: A memory and energy efficient image segmenta-
tion network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition Workshops, pp. 364–365 (2020)
33. Byra, M., Jarosik, P., Szubert, A., Galperin, M., Ojeda-Fournier, H., Olson, L., O’Boyle, M.,
Comstock, C., Andre, M.: Breast mass segmentation in ultrasound with selective kernel u-net
convolutional neural network. Biomed. Signal Process. Control 61, 102027 (2020)
34. Shareef, B., Xian, M., Vakanski, A.: Stan: small tumor-aware network for breast ultrasound
image segmentation. In: 2020 IEEE 17th International Symposium on Biomedical Imaging
(ISBI), pp. 1–5 (2020)
35. Vakanski, A., Xian, M., Freer, P.E.: Attention-enriched deep learning model for breast tumor
segmentation in ultrasound images. Ultrasound Med. Biol. 46(10), 2819–2833 (2020)
36. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image seg-
mentation. In: International Conference on Medical Image Computing and Computer-assisted
Intervention, pp. 234–241. Springer (2015)
37. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet:
Alexnet-level accuracy with 50x fewer parameters and <0.5 mb model size (2016).
arXiv:1602.07360
Chapter 12
Auxiliary Squat Training Method Based
on Object Tracking
Abstract Background: The deep squat is not only one of the basic movement
patterns of the human body, but also a compound movement that can directly train
the hip force and has a good exercise effect on the posterior chain muscle groups.
However, improper action patterns can affect the quality of action. Research objec-
tive: In order to improve training efficiency and reduce sports injuries, a method
which can optimize the technology and movements of the deep squat needs to be
designed. Methods: The tracking algorithm based on template matching, combined
with biomechanical knowledge, was analyzed separately from the sagittal and coronal
planes. Emphasis is placed on the analysis of the power chain of force and unbalance.
Therefore, two force arms and two angles represent the power chain, and three line
segments represent balance on both sides of the body. Results: The performance was
more stable during the actual scenario test, and the motion information could be
accurately captured and analyzed. Conclusion: This method can obtain the force arm
and joint angle of the deep squat movement, and also assist in screening the balance
of both sides of the limb. Thus, the pattern and rhythm of the action can be adjusted
accordingly.
Y. Pang
Zibo Normal College, Zibo 255100, China
e-mail: [email protected]
H. Sun
School of Physics & Optoelectronic Engineering, Shandong University of Technology,
Zibo 255100, China
e-mail: [email protected]
Han Tang Power Lifting, Qufu 255100, China
Y. Pang (B)
Institute of Artificial Intelligence in Sports, Capital University of Physical Education and Sports,
Beijing 100191, China
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 161
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_13
162 Y. Pang et al.
12.1 Introduction
Squatting is not only one of the basic movement patterns of the human body, but also a
compound movement that can directly train the hip power, which has a good exercise
effect on the posterior chain muscle group. The weighted squat is an important phys-
ical training action which is able to enhance one’s strength. However, poor movement
patterns often lead to reduced efficiency and even trigger injury [1]. Computer vision
is increasingly used in sports analysis because of its non-contact advantages. Soft-
ware such as Iron Path [2] and WL Analysis [3] captures the movement trajectory
of the center of the barbell piece and is general-purpose solutions for weightlifting
technique analysis, but neither records information about the movement of the human
joints. It is a common research method to analyze the angle of joints in other sports,
such as race walking [4] and martial arts [5]. In order to record richer exercise infor-
mation and improve training efficiency to reduce sports injuries, we tried to design
a method to optimize the technical movements of squatting utilizing a computer
vision-based approach combined with knowledge of biomechanics. The code for
this study is now available [6]. And we also produced a dataset that can be used to
train object detection models, which is now available [7].
Good technique in the deep squat refers to the ability to maintain a zero-force arm
between the barbell bar and the center balance point of the foot. Because of the
presence of the force arm between the bar and the foot center balance point, the lifter
wastes a lot of extra force. A proper deep squat will have some specific, recognizable
characteristics controlled by bone structure and muscle function. Any kind of deep
squat, whether it is a back squat or a front squat, should meet these conditions so that
the lifter can more easily determine if his or her posture and movement are correct.
At the top of the deep squat, all the skeletal parts supporting the barbell, knees, hips
and spine are locked in extension, so the muscular parts only need to exert enough
force to maintain the position, because the force acting on the bones at this point.
The force acting on the bones is mainly pressure. In this state, the task of the muscles
is to keep the bones correctly aligned in a straight line so that they can support the
appropriate weight. At this point, the barbell bar is directly above the center of the
foot. The greater the weight, the more important this position becomes [8].
When the lifter begins to enter the centrifugal phase of the squat and gradually
moves toward the bottom, all the muscles that will eventually stretch the hip and
knee joints in the centripetal phase, as well as the erector spinae muscles that remain
isometrically contracted in this state, but under increased stress, are in a state of
stress, and at the same time have to contend with moments along with all parts of
12 Auxiliary Squat Training Method Based on Object Tracking 163
the body. During the squat, the barbell bar must remain directly above the center
of the foot. We can confirm the correct bottom position with the help of anatomical
markers:
• The spine should remain rigid while the lumbar and thoracic spine in extension.
• The barbell bar is directly above the center of the feet.
• The feet are flat on the ground, maintaining the correct angle of abduction and
standing distance.
• The thighs are parallel to the feet.
• The hip joint is in a position below the top of the patella.
Any position that does not meet these points, and any movement that deviates from
this position during the squat and rise, contains poor technique. In fact, if the bar is
kept on a vertical plane, directly above the center of the foot, during the squatting
and standing up process, as if the bar is sliding in a narrow space perpendicular to
the center of the foot, the action is correct. The skeleton will work out on its own
how to most effectively use the muscles to complete the deep squat. It will complete
the deep squat within the constraints of the mechanism by which the barbell body
gravity system works.
The template regions are selected on the initial frame, and their similarity is expressed
using the normalized intercorrelation matrix [9, 10], between the region to be selected
and the template, thus enabling the tracking of the barbell sheet, hip, knee, and ankle.
Let the pixel size of the image I to be matched be M × N and the pixel size
of the template T be m × n. The coordinates of the upper left corner of a piece of
sub-image I x,y with pixel size m × n chosen arbitrarily from the image I are (x, y),
and the coordinates can be found in the range, x ∈ [0, M-m], y ∈ [0, N-n], where M
and N are the number of rows and columns of image pixels to be matched, and m
and n are the number of rows and columns of template pixels, respectively.
The normalized mutual correlation values R(x, y) [11] of the sub-image I x,y and
the template T are defined as:
m−1 n−1
i=0 j=0 (I x+i,y+ j − I x,y )(Ti, j − T )
R(x, y) = (12.1)
m−1 n−1 2 m−1 n−1 2
i=0 j=0 (I x+i,y+ j − I x,y ) i=0 j=0 (Ti, j − T )
In Eq. (12.1), I and j are the coordinates of the pixels in the template. All the
normalized intercorrelation values form the normalized intercorrelation matrix R.
The pixels average value of the sub-image I x,y is
1 m−1 n−1
I x,y = Ix+i,y+ j (12.2)
m×n i=0 j=0
164 Y. Pang et al.
1 m−1 n−1
T = Tx,y (12.3)
m×n i=0 j=0
And Define RT :
m−1
n−1
2
RT = (Ti, j − T ) (12.4)
i=0
j=0
Since the template T is known and RT is a constant and positive value throughout
the search process, it does not affect the determination of the optimal solution and
cannot be calculated, so the denominator part of Eq. (12.1) can be written as:
m−1
n−1
Rden (x, y) =
2
(Ix+i,y+j − Ix,y ) (12.5)
i=0
j=0
Let T (i, j) = T (i, j)− T , then the numerator part of Eq. (12.1) can be simplified
as follows:
m−1
n−1 m−1
n−1
Rnum = (Ix+i,y+ j − I x,y )(Ti, j − T ) = (Ix+i,y+ j − I x,y )T (i, j)
i=0 i=0
j=0 j=0
m−1
n−1 m−1
n−1
= (Ix+i,y+ j )T (i, j) − I x,y T (i, j)
i=0 i=0
j=0 j=0
Thus,
m−1
n−1
Rnum = (Ix+i,y+ j )T (i, j) (12.6)
i=0
j=0
In this way, the normalized intercorrelation values R(x, y) of the sub-image I x,y
and the template T in Eq. (12.1) are equivalent to
m−1 n−1
i=0 j=0 (I x+i,y+ j )T (i, j)
R (x, y) = n−1 (12.7)
m−1 2
i=0 j=0 (Ix+i,y+ j − I x,y )
For the completed matching module, we continue to find the final coordinates through
a program script.
12 Auxiliary Squat Training Method Based on Object Tracking 165
Analyzing the deep squat movement from the sagittal plane requires calculation of
the hip angle, knee angle, necessary force arm, and undesirable force arm, while
analysis of the deep squat movement from the coronal plane requires calculation of
three lines, representing the shoulder girdle, hip girdle, and trunk, respectively. The
definitions of these indicators are shown in Fig. 12.1 and Table 12.1.
Specifically, for the hip_angle, Eq. (12.8) is used for the calculation. For the
knee_angle, Eq. (12.9) is used to calculate. The necessary force arm is calculated
from Eq. (12.10), and the undesirable force arm is calculated from Eq. (12.11).
Hip_angle = arccos
(Xhip − Xplate )(Xhip − Xknee ) + (Yhip − Yplate )(Yhip − Yknee )
(
[(Xhip − Xplate )2 + (Yhip − Yplate )2 ] · [(Xhip − Xknee )2 + (Yhip − Yknee )2 ]
(12.8)
Knee_angle = arccos
(Xknee − Xankle )(Xknee − Xhip ) + (Yknee − Yankle )(Yknee − Yhip )
( )
[(Xknee − Xankle )2 + (Yknee − Yankle )2 ] · [(Xknee − Xhip )2 + (Yknee − Yhip )2 ]
(12.9)
166 Y. Pang et al.
Nassary_Force_arm = Xhip − Xfoot (12.10)
Undesirable_Force_arm = Xplate − Xfoot (12.11)
12 Auxiliary Squat Training Method Based on Object Tracking 167
12.3 Results
Two separate procedures examine motion in the sagittal plane and motion in the
coronal plane. Information on angular velocity and force arms of motion can be
obtained in the sagittal plane, and imbalances on both sides of the body can be
screened in the coronal plane.
In Fig. 12.2, the lifter on the left side shows good movement posture with almost
no undesirable force arms; the lifter on the right side has a more severe forward
displacement of the barbell and significant undesirable force arms, this movement
pattern not only has less strength, but also has greater pressure on the lower back.
The hip angle of a weightlifter is shown in Fig. 12.3, the knee angle of him is shown
in Fig. 12.4, the necessary force arm is shown in Fig. 12.5, and the undesirable force
arm is shown in Fig. 12.6.
12.4 Discussion
The weighted squat is a very functional movement that requires consideration of the
changes that occur between different muscle groups and between the muscles and
the barbell during the movement. Motion capture in sports is often done through
computer vision methods, and some studies [12, 13] have used deep learning-based
pose estimation, but for weighted deep squatting, it is not appropriate to capture
motion information with pose estimation. This is due to the occlusion of the barbell
plate, where the upper body is obscured over a large area, causing the existing pose
estimation model to barely work. However, the weighted deep squat has no rotation
170 Y. Pang et al.
and no scale change involved, and the tracking algorithm based on template matching
works well in this case instead. Meaningful analytical results can be obtained based
on this, for example, in Figs. 12.3 and 12.4, the hip and knee angles vary roughly
periodically, with the top of the curve indicating the lifter’s stay in the upright posi-
tion and the bottom of the curve is steeper, with the hip and knee angles changing
rapidly, indicating that the lifter can stand up quickly after squatting and the muscles
change rapidly from centrifugal to centripetal contraction. Because the stature struc-
ture is constant for a given person, each peak of the necessary force arm in Fig. 12.5
is also essentially the same. As shown in Fig. 12.6, the undesirable force arm grad-
ually increased and reached a maximum at the 4th movement, which may indicate a
movement out of shape caused by central fatigue or peripheral fatigue.
12.5 Conclusion
In this study, the position coordinates of the hip, knee, ankle and barbell plate
centers were obtained using an object tracking algorithm for further visualizing the
joint angles, necessary force arms, and undesirable force arms at each moment the
weightlifter’s deep squat. It is convenient, fast, and reliable and can be used as a
12 Auxiliary Squat Training Method Based on Object Tracking 171
reference means to analyze the technical movements of the deep squat, improving
the safety of training. In the future work, we will try to use deep learning models to
obtain a more robust method of capturing information about the motion of weighted
deep squats.
Acknowledgements Sincere thanks to “Han Tang Power Lifting” for the technical support and
for the material and concepts that provided great help for this study. This work was supported
by Beijing college students’ innovation and entrepreneurship training program under Grant No.
S202210029010.
References
1. Diggin, D., Regan, C.O., Whelan, N., et al.: A biomechanical analysis of front and back squat:
injury implications. In: Isbs Conference in Vilas Boas Et Al (2011)
2. Kasovic, J., Martin, B., Zourdos, M.C., et al.: Agreement between the iron path app and a linear
position transducer for measuring average concentric velocity and range of motion of barbell
exercises. J Strength Condition Res (2020), Publish Ahead of Print
3. Hideyuki, N., Daichi, Y.: Validation of video analysis of marker-less barbell auto-tracking in
weightlifting. PloS One 17(1) (2022)
4. Wang, Y., Hu, G., Peng, X., Li, H.-l.: Biomechanics and engineering applications of race
walking. In: 2021 International Conference on Health Big Data and Smart Sports (HBDSS),
pp. 50–55 (2021)
5. Pang, Y., Wang, Q., Zhang, C., Wang, M., Wang, Y.: Analysis of computer vision applied in
martial arts. In: 2022 2nd International Conference on Consumer Electronics and Computer
Engineering (ICCECE), pp. 191–196 (2022)
6. Code https://github.com/pyqpyqpyqpyq789/Object-Tracking-for-Squat
7. Dataset https://aistudio.baidu.com/aistudio/datasetdetail/103531
8. Rippetoe, M.: Starting Strength: Basic Barbell Training, 3rd ed. (2016)
9. Wu, J., Yue, H.J., Cao, Y.Y., et al.: Video Object tracking method based on normalized cross-
correlation matching. In: Proceedings of the Ninth International Symposium on Distributed
Computing and Applications to Business, Engineering and Science. IEEE Computer Society
(2010)
10. Tsai, D.M., Lin, C.T., Chen, J.F.: The evaluation of normalized cross correlations for defect
detection. Pattern Recogn. Lett. 24(15), 2525–2535 (2003)
11. Sethmann, R., Burns, B.A., Heygster, G.C.: Spatial resolution improvement of SSM/I data with
image restoration techniques. IEEE Trans Geosci Remote Sens 32(6), 1144–1151
12. Pang, Y., Wang, Q., Zhang, C.: Time-frequency domain pattern analysis of Tai Chi 12 GONG FA
based on skeleton key points detection. In: 2021 International Conference on Neural Networks,
Information and Communication Engineering, International Society for Optics and Photonics,
vol. 11933, pp. 119331Y-1 (2021)
13. Guo, H.: Research and implementation of action training system based on key point detection.
Master’s thesis, Xi’an University of Technology (2021)
Chapter 13
Study on the Visualization Modeling
of Aviation Emergency Rescue System
Based on Systems Engineering
Abstract Focusing on the need for establishing a more complete and implementable
aviation emergency rescue (AER) system, the study on system architecture and visu-
alization modeling of AER was carried out. Based on systems engineering, AER
system architecture contains four stages including disaster prevention and early
warning stage, disaster preparation stage, emergency response stage, recovery and
reconstruction stage, and corresponding six nodes including prevention and prepara-
tion node, command and control node, reconnaissance and surveillance (S&R) node,
scheduling planning node, search node, and rescue node. The AER system visu-
alization model comprises operational viewpoint (OV) DoDAF-described models
and capability viewpoint (CV) DoDAF-described models of the system architec-
ture. The OV DoDAF-described models describe the high-level operational concept,
operational elements, mission and resource flow exchanges, etc. The CV DoDAF-
described models capture the capability taxonomy and complex relationships. The
visualization model provides reference and guidance for AER system architecture
design and is suitable for the visual description of complex systems engineering
architecture for emergencies.
13.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 173
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_14
174 Y. Xue et al.
AER is widely used in many fields, such as firefighting [4], earthquake rescue
[5], wilderness search [6], etc. The present research includes: aviation emergency
management [7], quality structure model of new-type AER command talents [8], the
aviation emergency standard system of China [9], AER evaluation capability [10],
AER response assessment [11], civil AER system architecture [12], general aviation
rescue airport site selection [13], etc. On the visualization modeling application of
AER, the present researches focus on the rescue scene modeling [14], drill platform
[15], aviation emergency and rescue operation parameter calculation [16], etc., based
on virtual reality technique. The design and decision problem of AER system as a
complex and open systems engineering could be considered as a system architecture
design and decision support systems engineering [17, 18].
Model-based systems engineering (MBSE) is a method of describing organiza-
tional management systems using mathematical and logical models [19]. As one of
the most implementable complex engineering design methods in system science,
it is characterized by high complexity, wide range of application, and combination
of quantitative and qualitative [20–22]. As the major part of MBSE, system archi-
tecture needs to be adaptable to changing needs and requirements [23]. Faced with
system architecture requirements, the Department of Defense Architecture Frame-
work (DoDAF) is proposed for visualization infrastructure. DoDAF is organized
by various viewpoints which is suitable for large systems with complex integration
and interoperability challenges [24]. Several fields closely related to society, such
as civil airline transportation [25] and information platform construction [26], have
been modeled based on MBSE and DoDAF.
For the lack of visualization modeling of the AER system, the present study
analyzed the AER system architecture and constructed an AER visualization model
using DoDAF. Through different viewpoints, the operational activities and system
capabilities are visually described, completing a nonlinear mapping from system
analysis to actual architecture.
China’s AER system has made certain achievements at present. The AER system is
based on the emergency plan, with the emergency management mechanism as the
guarantee, supplemented by laws and regulations, and the support of science and
technology.
To further improve the emergency management capability, the present study
proposed an AER system architecture containing four stages and their corresponding
six nodes based on systems engineering. The four stages are disaster prevention
and early warning stage, disaster preparation stage, emergency response stage,
and recovery and reconstruction stage. Each stage consists of different activities.
According to the nature of the activities, the activities of the same nature that make
13 Study on the Visualization Modeling of Aviation Emergency Rescue … 175
up the different phases are grouped into the same node in this present study. The
corresponding six nodes are: prevention and preparation node, command and control
node, reconnaissance and surveillance (S&R) node, scheduling planning node, search
node, and rescue node (Fig. 13.1).
Disaster prevention and early warning is a prerequisite for fruitful emergency rescue.
Disaster prevention mainly refers to strengthening people’s awareness of disaster
prevention and improving people’s ability to take the initiative to avoid disasters
through publicity, drills, and other means. Early warning includes real-time moni-
toring of urban weather and engineering construction, and analysis and processing
of information collected by the early warning system. Comprehensive disaster
prevention will reduce the losses caused by poor disaster prevention. Expanding
the coverage of monitoring system and improving the accuracy of early warning
information are the technical prerequisites for speeding up emergency response.
176 Y. Xue et al.
Emergency response is the disposition process of the control center to handle situa-
tion information, formulate rescue plans, conduct real-time reconnaissance, dispatch
personnel and materials, and direct the implementation of search and rescue after a
disaster occurs or when a disaster is predicted to occur.
This stage mainly includes five types of activities, in chronological order: plan
formulation, reconnaissance and surveillance, decision-making, disposal implemen-
tation, and real-time reconnaissance.
1. Plan Formulation
Plan formulation is the formulation of an AER plan referring to the disaster situa-
tion, prevention, and preparedness corresponding to the scheduling planning node.
It includes the location and number of navigable airports available for rescue, the
type and number of rescue helicopters, the rules for dispatching rescue helicopters,
the selection of personnel resettlement points, and the evaluation method for rescue
plans.
2. Reconnaissance and Surveillance
After the completion of plan formulation, the disaster situation should be reconnoi-
tered and monitored. Through the reconnaissance by helicopter and the message back
from the monitor unit, the control center can confirm the site situation and adjust the
plan in real time.
3. Decision-Making
Through visualization models or other auxiliary decision-making means, the control
center makes and confirms all practical decisions in the rescue plan with reference
to the search information and the feasibility evaluation of the existing rescue plan
which corresponds to the command and control node.
13 Study on the Visualization Modeling of Aviation Emergency Rescue … 177
4. Disposal Implementation
The disposal implementation is the process of advancing rescue plans and decisions,
including the completion of personnel rescue, goods and material transfer, and other
mission requirements. From the perspective of requirements, the disposal implemen-
tation is disaster-oriented aviation emergency search and rescue, corresponding to
search node and rescue node.
5. Real-Time Reconnaissance
During disposal implementation, the monitor unit conducts real-time reconnaissance
of the search and rescue process, considering the possibility of secondary disas-
ters and updated mission requirements. Update mission demand and rescue plans
concerning mission completion and secondary disaster occurrences.
After the emergency rescue, this stage mainly includes the resettlement of personnel
and the reconstruction of infrastructure. And the experience of emergency rescue is
summarized to guide and iterate the system design process of the first three stages.
The AER system composed of the above stages contains three elements: system
architecture, operational activities, and system capabilities. Correspondingly, two
viewpoints, operational viewpoint and capability viewpoint, are selected to construct
a complete structural framework of the AER system and a visualization DoDAF-
described model which takes system architecture, operational concept and process,
task tree, and capability as input. The modeling steps are shown in Fig. 13.2.
Step 1 Determine the operational concept of the AER system, construct the high-
Level operational concept graphic (OV-1 model).
Step 2 Establish the operational resource flow description diagram (OV-2 model)
combining the OV-1 model, the system architecture, and the analysis of the
operational process.
Step 3 Build the operational activity decomposition tree (OV-5a model) corre-
sponding to the overall task tree and nodes.
Step 4 Establish the dependency matrix (OV-3 matrix) between the OV-2 model and
the OV-5a model.
178 Y. Xue et al.
Step 5 Combined with the analysis of the capability system, construct capability
taxonomy (CV-2) which provides visualizations of the evolving capabilities.
Step 6 Build the capability to operational activities mappings (CV-6 matrix) which
describes the mapping between the capabilities required and the activities that enable
those capabilities.
The AER system visualization model consists of seven elements: control center,
monitor unit, airport, helicopter, personnel, goods and materials, and relevant points.
1. Control Center
Control center is responsible for handling early warning information, formu-
lating rescue plans, commanding and controlling rescue processes, dispatching and
commanding rescue helicopters which is related to the type of emergency.
2. Monitor Unit
Monitor unit includes a disaster emergency monitoring system and an early warning
system, covering the whole emergency management process.
Monitor unit collects early warning information, monitors urban meteorology and
disaster situation in real time, and provides the control center with information for
analysis and processing.
3. Airport
In the emergency response stage, the control center shall formulate an emergency
rescue plan, including the location and number of airports available for rescue heli-
copter landing, refueling, and support. Rescue workers and relief materials could be
assembled at the airport according to the plan and wait for transfer.
4. Helicopter
According to the emergency rescue plan, the helicopters participating in the emer-
gency rescue gather at the designated airport, receive support, load disaster relief
materials or rescue workers, and wait for scheduling.
5. Personnel
Personnel includes rescue workers and trapped people. Rescue workers are those
who treat the wounded or transfer trapped people. Trapped people are those who are
trapped in place and need AER after emergencies.
6. Goods and Materials
In the present study, goods and materials include materials (living materials, food,
medicine, etc.) and disaster relief equipment, which are transferred to mission
demand points by helicopter.
7. Relevant Locations
The present study considers four types of relevant locations: mission demand point,
resettlement point, loading point, and unloading point.
Mission demand point refers to the place where relevant missions need to be
performed, including but not limited to the place where there is a need for rescue
workers or materials.
Resettlement point refers to the place where the trapped people can be properly
resettled.
180 Y. Xue et al.
Loading point and unloading point, respectively, refer to the places where heli-
copters load and unload personnel or materials, which may be vacant sites temporarily
requisitioned or suitable for helicopter takeoff and landing (Fig. 13.3).
OV-2 DoDAF-described Model
The OV-2 DoDAF-described model is a further refinement of high-level opera-
tional concept which shows the flow of personnel, material, and information without
describing the flow mode. The resource flows are between the nodes included in the
AER system architecture which reflect the operational exchanges.
The operational exchange type between different nodes differs which comprises
information exchange, goods and materials exchange, and people exchange. Among
them, the information exchange includes six types of information: request informa-
tion, mission information, information tracked, control order, distress signal, and
situation information. In addition to the nodes corresponding to different stages, this
model contains five types of location and character elements. People in distress and
locations where rescue workers or supplies are needed send distress signals to the
control center for rescue plan formulation and helicopter scheduling. Resettlement
points, loading points, and unloading points send the on-site situation information to
the monitor unit with the aim of obtaining real-time information on disaster situation
(Fig. 13.4).
13 Study on the Visualization Modeling of Aviation Emergency Rescue … 181
13.4 Conclusion
The present study summarizes and composes the architecture of the AER system,
introduces the systems engineering idea and DoDAF model design process, and
realizes the construction of the AER system visualization model. The model provides
specific descriptions in terms of capability viewpoint and operational viewpoint,
which provides reference and guidance for the design of the AER system using
visualization modeling means.
It should be noted that the present study aims at a complete description of the
AER system based on visual modeling. The helicopter scheduling rules and internal
activities are not considered in the present study which will continue to be carried
out in the follow-up work.
184 Y. Xue et al.
Acknowlegements I am very grateful to my tutors, Hu Liu and Yongliang Tian, and my friend
Xin Li for the great help in my field of study. This research did not receive any specific grant from
funding agencies in the public, commercial, or not-for-profit sectors.
References
1. Bullock, J., Haddow, G., Coppola, D.P.: Introduction to emergency management. Butterworth-
Heinemann (2017)
2. Alexander, D.: Towards the development of a standard in emergency planning. Disaster Prev.
Manag. 14(2), 158–175 (2005). https://doi.org/10.1108/09653560510595164
3. Yuming, L.: Aviation rescue: enhance emergency hard power. J. Beijing Univ. Aeronautics
Astronautics Social Sci. 24(4), 15 (2011). https://doi.org/10.13766/j.bhsk.1008-2204.2011.
04.002
4. Bartolo, K., Furlonger, B.: Leadership and job satisfaction among aviation fire fighters in
Australia. J. Manag. Psychol. 15(1), 87–93 (2000). https://doi.org/10.1108/026839400103
05324
5. Shen, Y., Zhang, X., Guo, Y.: Discrete-event simulation of aviation rescue efficiency on
earthquake medical evacuation. In: Americas Conference on Information Systems (2018)
186 Y. Xue et al.
6. Grissom, C.K., Thomas, F., James, B.: Medical helicopters in wilderness search and rescue
operations. Air Med. J. 25(1), 18–25 (2006). https://doi.org/10.1016/j.amj.2005.10.002
7. Bearman, C., Rainbird, S., Brooks, B.P., et al.: A literature review of methods for providing
enhanced operational oversight of teams in emergency management. Int. J. Emergency Manage.
14(3), 254–274 (2018). https://doi.org/10.1504/IJEM.2018.094237
8. Fang-Zhong, Q.I.: Exploration on undergraduate education of new-type aviation emergency
rescue command talents. Fire Sci. Technol. 39(8), 1178 (2020). https://doi.org/10.3969/j.issn.
1009-0029.2020.08.037
9. Yanhua, L., Ran, L.: Construction of China aviation emergency rescue standard system. China
Safety Sci. J. 29(8), 178 (2019). https://doi.org/10.16265/j.cnki.issn1003-3033.2019.08.028
10. Zhu, H., Xie, N.: Aviation emergency rescue evaluation capability based on improved λρ
fuzzy measure. In: Proceedings of the 2017 IEEE International Conference on Smart Cloud
(SmartCloud), pp. 289–293. IEEE, New York, NY (2017). https://doi.org/10.1109/SmartCloud.
2017.54
11. Walker, K., Oeen, O.: A risk-based approach to the assessment of aviation emergency response.
In: Proceedings of the SPE International Conference and Exhibition on Health, Safety, Secu-
rity, Environment, and Social Responsibility, Abu Dhabi (2018). https://doi.org/10.2118/190
549-MS
12. Xia, Z.-H., Pan, W.-J., Lin, R.-C., et al.: Research on efficiency of aviation emergency rescue
under major disasters. Comput. Eng. Des. 33(3), 1251–1256 (2012). https://doi.org/10.16208/
j.issn1000-7024.2012.03.004
13. Hu, B., Pan, F., Zhang, Y.: Research on selection of general aviation rescue airports. In: Proceed-
ings of the Journal of Physics: Conference Series, vol. 1910, No. 1. IOP Publishing (2021).
https://doi.org/10.1088/1742-6596/1910/1/012023
14. Sun, X., Liu, H., Yang, C., et al.: Virtual simulation-based scene modeling of helicopter earth-
quake search and rescue. In: Proceedings of the AIP Conference Proceedings, vol. 1839, No.
1, p. 020140. AIP Publishing LLC (2017). https://doi.org/10.1063/1.4982505
15. Pan, W., Xu, H., Zhu, X.: Virtual drilling platform for emergency rescue of airport based on
VR technology. J. Saf. Sci. Technol. 16(2), 136–141 (2020). https://doi.org/10.11731/j.issn.
1673-193x.2020.02.022
16. Meleschenko, R.G., Muntyan, V.K.: Justification of the approach for calculating the parameters
of aviation emergency and rescue operations when using visual search (2017)
17. Sage, A.P.: Decision support systems engineering. Wiley-Interscience (1991)
18. Parnell, G.S., Driscoll, P.J., Henderson, D.L.: Decision making in systems engineering and
management. Wiley (2011)
19. Blanchard, B.S.: System engineering management. Wiley (2004)
20. Buede, D.M., Miller, W.D.: The engineering design of systems: models and methods (2016)
21. Kaslow, D., Anderson, L., Asundi, S., et al.: Developing a cubesat model-based system engi-
neering (mbse) reference model-interim status. In: Proceedings of the 2015 IEEE Aerospace
Conference, pp. 1–16 (2015). https://doi.org/10.1109/AERO.2015.7118965
22. Madni, A.M., Madni, C.C., Lucero, S.D.: Leveraging digital twin technology in model-based
systems engineering. Systems 7(1), 7 (2019). https://doi.org/10.3390/systems7010007
23. Weilkiens, T., Lamm, J.G., Roth, S., et al.: Model-based system architecture. Wiley (2015)
24. Miletić, S., Milošević, M., Mladenović, V.: A new methodology for designing of tactical inte-
grated telecommunications and computer networks for OPNET simulation. Sci. Tech. Rev.
70(2), 35–40 (2020)
25. Pan, X., Yin, B., Hu, J.: Modeling and simulation for SoS based on the DoDAF framework. In:
Proceedings of 2011 9th International Conference on Reliability, Maintainability and Safety,
pp. 1283–1287. IEEE (2011). https://doi.org/10.1109/ICRMS.2011.5979468
26. Tao, Z.-G., Luo, Y.-F., Chen, C.-X., et al.: Enterprise application architecture development
based on DoDAF and TOGAF. Enterprise Inf. Syst. 11(5), 627–651 (2017). https://doi.org/10.
1080/17517575.2015.1068374
Chapter 14
An AI-Based System Offering Automatic
DR-Enhanced AR for Indoor Scenes
Abstract In this work, we present an AI-based Augmented Reality (AR) system for
indoor planning and refurbishing applications. AR can be an important medium for
such applications, as it facilitates more effective concept conveyance and addition-
ally acts as an efficient and immediate designer-to-client communication channel.
However, since AR only overlays, and cannot replace or remove, our system
relies on Diminished Reality (DR) to support deployment to real-world already
furnished indoor scenes. Further, and contrary to the traditional mobile AR applica-
tion approach, our system offers on-demand Virtual Reality (VR) viewing, relying on
spherical (360°) panoramas, capitalizing on their user-friendliness for indoor scene
capturing. Given that our system is an integration of different AI services, we analyze
its performance differentials concerning the components comprising it. This analysis
is both quantitative and qualitative, with the latter realized through user surveys, and
provides a complete systemic assessment of an attempt for a user-facing, automatic
AR/DR system.
14.1 Introduction
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 187
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_15
188 G. Albanis et al.
a medium between digitized concepts and the real scene, facilitating effective and
efficient communication and feedback between its users and improving the iterative
design process. Indicatively, an AR system for rearranging a furniture layout was
proposed in [15], while in [12] a system employing a dynamic user interface for
placing 3D virtual furniture models was developed. However, both aforementioned
systems required multiple QR markers to allow users to physically position the virtual
furniture.
Even though AR enables the interaction with virtual objects inside real environ-
ments, its nature is pure of additive nature, with a practical problem befalling when
working in occupied and filled indoor scenes as is the case for AR home design
applications [20]. Concepts like redecoration cannot be delivered solely through
AR technology, as users would only be capable of superimposing CG elements on
top of the existing real-world objects, hindering understanding due to a conflicting
mental response. To overcome this, AR needs to be supported by DR which can
diminish existing objects prior to overlaying new virtual ones and provide users with
an enhanced view to assessing furniture fit into their spaces. DR is an intriguing
technology that can enable novel concepts. One example is intercar see-through
vision, which aims at preventing accidents [14] and diminishes (i.e., “removes”)
the front car. In this particular case, DR is driven by multi-view observations and
view synthesis. There are cases though, where no view behind the removed object is
available, and then DR needs to hallucinate content, typically referred to as infilling
or inpainting [10]. Pioneering work in the DR domain was presented by [5], where
a patch-based image inpainting method was developed. Follow-up work [8] moved
beyond image-based diminishing and transitioned toward respecting scene geom-
etry by exploiting SLAM-based localization. More recently, an inpainting method
for non-planar scenes was developed [16] that considered both color and depth infor-
mation. Still, in both cases, manual selection of the region to be removed in the image
domain was required. To allow for easier selection of the object to be diminished in
indoor scenes for interior design, [17] used a manually positioned and scaled volume
to enclose the object of interest. In addition, the floor plane was identified by inserting
a marker into the scene. Real-time six degrees-of-freedom DR without manual object
selection is challenging [13] and requires a 3D reconstruction of the scene without
the object of interest but with the diminishing area annotated, limiting its flexibility.
When considering AR interior home refurnishing, where quickly prototyping ideas
is very important, minimizing interactions is very important, as users will also need
to position the new elements into the scene as well [7].
Still, all of the aforementioned studies work in a narrow field-of-view inputs,
limiting the amount of information of each scene and thus degrading their perfor-
mance on big objects (e.g., furniture), while at the same time they do not strictly
respect the structure of the environment. To overcome this, moving cameras are
employed relying on SLAM [16] or a wider field-of-view captures [7], but they limit
user-friendliness and are more error prone. In this work, we present a system that
addresses the challenges of cumbersome user diminished area selection and user
scanning, delivering DR-enhanced AR for indoor scene planning and design. To
achieve that, our system is AI based, operating on single monocular image capture,
14 An AI-Based System Offering Automatic DR-Enhanced AR for Indoor … 189
Figure 14.2 shows a high-level overview of our system comprising two main sub-
systems, and the nominal data flow among the various components. Each component
is an AI model, trained on the Structured3D dataset [23].
As presented in Fig. 14.2, the two sub-systems operate in cascade, while the DR
sub-system also includes a parallel component connection. The DR sub-system first
processes the input panoramic image by estimating the scene’s layout and segmenting
the distinct objects inside the scene. Then, for each segmented object in the scene,
the inpainting component is invoked to diminish the object and prepare the input
for the AR superimposition. Since data-driven models typically operate in lower
resolutions than required for panorama viewing, the AR sub-system first invokes
a super-resolution component to rescale the diminished area back to 360° viewing
Fig. 14.1 Imagine that you want to redesign your living space and replace existing furniture with
new ones. We propose a system consisting of various AI services for enabling next-generation AR
indoor re-planning and design experiences. Users only require a single 360° camera capture that
produces a spherical panorama of their indoor space. Then, our AI-based system automatically
generates a high-level understanding of the scene, both semantically and structurally, enabling
automatic selection of objects to be removed or replaced. This is driven by employing DR technology
that incorporates the inferred scene structural prior to generate plausible hallucinations, eventually
offering a compelling and effective AR experience. Top row shows the overall concept and higher-
level component connections, while the bottom row shows an actual example from the Structured3D
dataset, where a bed is replaced within a room
190 G. Albanis et al.
Fig. 14.2 Overview of the proposed automatic DR-enhanced 360 ◦ AR system. The system can be
dissected into two high-level sub-systems, the DR on the left and the AR one on the right, operating
in cascade. The former is responsible for the automatic diminishing of the scene and the latter for
user-driven augmentation. Given an input panorama, the scene’s junctions L(x) and objects’ masks
S(x) are first estimated in parallel by the corresponding data-driven components, with L and S being
the layout and segmentation AI models, respectively. Then the data-driven inpainting component is
invoked I(x, L(x), S(x)), with I being the respective AI model. Diminishing is achieved by inpainting
the object’s mask in a structure-aware manner using the dense layout map. The diminished panorama
y is up-sampled by invoking R(y), where R is a super-resolution model. Finally, the 3D object is
positioned in the scene, producing the DR-enhanced AR panorama image
resolution. AR is user driven by positioning elements into the scene that interact with
the masked regions depending on their projection to select the appropriate diminished
panorama. Still, users may simply require to remove an object from the scene which
is straightforwardly supported. In the following subsections, the different AI building
blocks comprising our automatic DR-enhanced AR system are presented.
In order to diminish an object from a residential indoor scene, the object’s pixel-
aligned area within the image must be available. For this purpose, we employ a
semantic segmentation network to infer objects mask for a set of a priori selected
classes, commonly present in residential scenes. We use the DeepLabv3 architecture
[3] with a ResNet50 [4] backbone, which has shown reliable and robust results in
segmentation tasks, offering a great compromise between accuracy and speed. The
network was supervised using cross-entropy and trained for 133 epochs using the
Adam optimization algorithm [9], with default parameters, a learning rate of 0.0002,
and a scheduler halving it every 20 epochs.
14 An AI-Based System Offering Automatic DR-Enhanced AR for Indoor … 191
Another prerequisite of the inpainting component is the scene’s dense layout segmen-
tation (i.e., the per-pixel classification into the ceiling, wall, or floor classes). This is
required to preserve the scene’s structure during diminishing which is a very impor-
tant cue for the downstream applications (i.e., planning or designing). We use the
HorizonNet model [19] to estimate the locations of the scene’s junctions.
The core of our AI-based DR sub-system is the inpainting AI model which is respon-
sible for object diminishing. Apart from the input panorama, it additionally requires
an object mask and the scene’s layout segmentation map, as depicted in Fig. 14.2. The
latter provides the structure of the scene as corner positions, which are subsequently
reconstructed as the dense layout, while the former is a requisite for specifying
the object to be diminished. We adopt a structure-aware 360° inpainting model [2]
that uses SEAN residual blocks [24] to aid in hallucinating plausible content with
semantic coherency in the diminished region. SEAN blocks leverage the structural
information provided by the input semantic maps (the layout segmentation in our
case) and use it as structural guidance.
For alleviating the aforementioned issue concerning the low resolution of the
panoramas to be processed, we resort to a lightweight super-resolution model [22], to
upscale the diminished result up to (×4) times. That way, we offer results appropriate
for panorama viewers, without degrading their visual quality.
Our models are trained with PyTorch [11] and delivered as services using TorchServe
[1]. Our components share a common communication interface that is built around
callback URLs, with all inputs and outputs delivered as end points to either retrieve
(GET) or submit (POST) data. This interface makes our system highly modular since
the communication interface is decoupled from the back-end functionality of each
component.
The system orchestration is realized as a web server, where each upload triggers
a chain of events as follows. At first, the object segmentation and layout estimation
192 G. Albanis et al.
models are invoked to estimate the object masks and the room layout. Since we rely
on semantic segmentation, we perform connected component analysis to resolve
potentially different instances and split each segmentation map into multiple per-
class and object masks. To improve robustness, we use the convex hull for each mask
in an attempt to decouple the diminished region shape from the result (the inpainting
model is trained similarly). Likewise, the junction estimates are post-processed to
generate a dense layout map by first connecting the top and bottom boundaries and
then identifying the corresponding structural labels across each column. Finally, for
all object masks, the inpainting service is called, with its result fed into the super-
resolution service and then composited on the original panorama. The outputs are
then ready to be queried by the AR component that positions the 3D object, whose
renders interact with the masks on the image domain to retrieve the appropriate result.
Fig. 14.3 Component ablation experiments setup visualized with a vertical macro-view of the DR
sub-system of Fig. 14.3. a refers to the experiments where both room layout and object masks
are estimated by the system data-driven components, b the layout path is ablated, by replacing the
estimations with the annotated ground truth while preserving the segmentation mask estimates, c the
dual configuration to (b), with the segmentation path ablated and the layout estimations preserved,
and d where both components are replaced by the ground truth layout and object masks
194 G. Albanis et al.
While objective analysis can help in identifying critical components and assessing
the system’s overall performance, the end result cannot be quantitatively assessed.
This is either because ground truth is not necessarily available, or due to the subjec-
tivity of the results. Still, end user appreciation is the ultimate goal, and as a result, we
additionally performed a user survey for the entire system’s outputs. We used remote
questionnaires that were distributed to 38 users split into two sub-groups, one having
no knowledge regarding its inner workings (i.e., Group A) and the other knowledge-
able regarding AI (i.e., Group B). The questionnaires required the participants to rate
the appearance of a masked area in each one of g different scenes.
An interactive panorama viewer was used, with the initial viewpoint bearing
looking at the object to be removed. For each scene, users first were allowed to
freely navigate the entire scene in three degrees of freedom, and then an annotated
panorama with the object to be removed or replaced was presented to them. This
process ensures that users will not get lost within the 360° field-of-view and will
understand the task at hand. Afterward, users were asked to score the appearance of
the previously marked area, once presented with the object removed (i.e., pure DR),
and then once with a virtual object replacing the previous one (i.e., DR-enhanced
AR). After all, scenes were evaluated, and users were asked to rate the scenes again,
this time without DR, scoring the result of the pure virtual object superimposition on
the existing real object (i.e., pure AR). This last step was isolated from the previous
ones to remove any bias when scoring DR results. Scoring was based on a 5-point
Likert scale, resulting in aggregated mean opinion scores (MOS). Figure 14.4 depicts
samples used in the survey.
Fig. 14.4 Example survey scene types. The first column depicts the original panorama, the second
column the panorama with the object removed (i.e., pure DR), the third column the one with the
virtual furniture added in the diminished scene (i.e., DR-enhanced AR), and the final column the
one with the virtual object added without previously removing the existing object (i.e., pure AR)
14 An AI-Based System Offering Automatic DR-Enhanced AR for Indoor … 195
Before presenting the results of our experiments, it is worth noticing the potential
sources of errors. Since the inpainting component is dependent on the results of the
layout and segmentation models, it is expected that any errors in these components
will be accumulated in the final diminished result. Under-segmenting an object may
result in the erroneous diminishing of scenes since artifacts of the old object will
be present around the inpainted region. Similarly, over-segmenting may potentially
remove important relevant objects like chairs next to a table, resulting in uncanny
visuals.
Another potential source of error is the layout junction localization. The inpainting
model heavily depends on the layout of the input, as described in Sect. 14.3. Given
that the boundaries reconstructed from the junctions are used to generate the dense
layout segmentation map used to drive the SEAN decoding blocks, such errors will
propagate into both style code generation and the diminished area boundary sepa-
rating the different structural areas. As a consequence, even slight errors in the
junctions’ coordinates will translate to large miss-classified regions, manifesting in
severe diminishing distortions.
Table 14.1 shows the quantitative results for the experiments described in Sect. 14.1.
The first row which showcases the best performance is the case (d) of Fig. 14.3,
where both models are replaced with perfect estimates. This is in contrast to the last
row, corresponding to case (a) of Fig. 14.3, which relies on all models’ predictions.
Interestingly, cases (b) and (c) are the most interesting ones as they present us with
the weakest link of the DR sub-system, which is the layout estimation model, given
that when replaced with the annotated layouts, performance consistently increases.
As the segmentation model produces reasonable results, the sparser junction local-
ization errors propagate deeper into the diminished result, which is reasonable as the
structural segmentation is responsible for both style code extraction and boundary
preservation.
Table 14.1 Quantitative results assessing the DR sub-system output by ablating its components.
Arrows denote direction of better performance
Experiment PSNR ↑ SSIM ↑ M AE ↓ LPIPS ↓
yL S 29.61 0.9393 0.0131 0.1127
yL S 29.13 0.9353 0.0134 0.1149
y Ls 27.37 0.9126 0.0166 0.1259
yL S 27.86 0.9189 0.0158 0.1225
196 G. Albanis et al.
Figure 14.5 presents the results of the user survey. The left columns aggregate MOS
scores across all scenes, while the remaining columns present the results for each
scene in sequence. The top row presents the results for all subjects, while the bottom
row splits them into two different groups, those not familiar with AI (i.e., Group A)
and those experienced with it (i.e., Group B). From these results, it is evident that
purely diminished scenes were rated lower than diminished scenes with augmen-
tations overlaid. This is expected as superimposing content on the DR result may
potentially hide defects. Further, the final scenes without DR where the virtual object
was simply overlaid on the actual ones, without removing them, scored lower than the
scenes where the real objects had been diminished/removed. Nevertheless, the statis-
tical confidence is lower, and this is partly expected as not all scenes may require DR.
Indeed, there are cases when the objects are of similar size and shape that render DR
as not that important. The availability of the functionality, however, is very important
for the remaining cases and may even outweigh the need to deliver high-quality DR
results.
Regarding the two user groups, those familiar with AI presented with larger
discrepancies between the different scene types albeit the ranking across both groups
remained the same.
14.5 Conclusion
In this work, we present a system that can drive user-facing applications for interior
design. The focus of our system is on usability as it relies on 360° image acquisition
of scenes, compared to scanning processes that tax users and are more error prone.
Further, we lift the requirement for manually marking the diminished region and seek
to preserve the room structure during diminishing which is highly relevant for the
targeted application domain. Our system is purely AI based, a fact that introduces the
need for assessing error propagation between its different components. To that end,
we present a system ablation analysis, accompanied by a user survey that showcases
the need for DR in indoor AR planning. Nonetheless, our work operates directly on
the image domain (i.e., 2D), and besides the benefits, this introduces, it inevitably
only offers perspective views and neglects occlusion effects.
Another limitation is that the current system has been only verified with synthetic
data. The Structure3D dataset offers annotations for all sub-tasks apart from the super-
resolution one, a trait that real-world datasets will not easily provide. Apart from that,
the application to in-the-wild real-world data is expected to reduce performance,
which will require revisiting our analysis. Future work will focus on overcoming
these challenges by integrated geometric inference (e.g., depth) to support more
advanced features like occlusions and lighting and transitioning to real-world domain
training data and validation.
Fig. 14.5 Results of the user survey. The first row depicts the total average rating for all the three cases, i.e., pure DR (empty), DR-enhanced AR (DR), pure
AR (AR), across all scenes (first column) as well as for each scene separately in the following column, the one with the virtual object added without previously
removing the existing object (i.e., pure AR)
14 An AI-Based System Offering Automatic DR-Enhanced AR for Indoor …
197
198 G. Albanis et al.
References
19. Sun, C., Hsiao, C., Sun, M., Chen, H.: Horizonnet: learning room layout with 1d representation
and pano stretch data augmentation. In: IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2019, Long Beach, CA, USA, June 16–20, 2019, pp. 1047–1056 (2019)
20. Wong, K., Jiddi, S., Alami, Y., Guindi, P., Totty, B., Guo, Q., Otrada, M., Gauthier, P.: Exploiting
arkit depth maps for mixed reality home design. In: 2020 IEEE International Symposium on
Mixed and Augmented Reality (ISMAR) (2020)
21. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness
of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer
vision and pattern recognition, pp. 586–595 (2018)
22. Zhao, H., Kong, X., He, J., Qiao, Y., Dong, C.: Efficient image superresolution using pixel
attention (2020). arXiv:2010.01073
23. Zheng, J., Zhang, J., Li, J., Tang, R., Gao, S., Zhou, Z.: Structured3d: a large photo-realistic
dataset for structured 3d modeling. In: Proceedings of The European Conference on Computer
Vision (ECCV) (2020)
24. Zhu, P., Abdal, R., Qin, Y., Wonka, P.: Sean: image synthesis with semantic region-adaptive
normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 5104–5113 (2020)
Chapter 15
Extending Mirror Therapy into Mixed
Reality—Design and Implementation
of the Application PhantomAR
to Alleviate Phantom Limb Pain
in Upper Limb Amputees
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 201
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_16
202 C. Prahm et al.
15.1 Introduction
All participants were recruited in accordance with the declaration of Helsinki and
based on the guidelines of the ethical approval by the University of Tuebingen,
Germany (181/2020BO1).
We evaluated the PhantomAR application on the HoloLens 2 in terms of usability
using the System Usability Scale (SUS) with ten able-bodied participants (7 male,
3 female, 30,4 ± 5,6 years) and two unilateral transradial (forearm) amputees (1 m,
56 years; 1f, 36 years) as proof of concept. Thereby, the real arm of the able-bodied
participants was covered so the HoloLens would not recognize it. About 80% reported
no previous AR experience. The SUS consists of a 10-item questionnaire with a 5-
point Likert scale. It has become an industry standard and allows the evaluation of
a wide variety of products, including hardware, mobile devices and applications.
Additionally, we prepared a user-centered survey consisting of 10 questions such as
“Do you prefer storytelling to guide you through the game?” to evaluate graphics,
ownership, interaction with the virtual objects using both the virtual and real arm,
204 C. Prahm et al.
and comfort of wearing the HoloLens 2. We asked all participants to evaluate the
PhantomAR application regarding intrinsic motivation using the GEQ consisting
of 5 subscales (positive affect, negative affect, flow, challenge, immersion) and 2
additional subscales for control and non-anthropomorphic feedback on a 5-point
Likert scale with 1 meaning “completely disagree” and 5 meaning “completely agree”
[36].
Additionally, we asked the patients to rate PLP sensation before, during and after-
ward on the Numerical Rating Scale (NRS). Similarly to able-bodied participants,
we assessed their game experience with a user-centered survey and added specific
questions pertaining to prosthesis control and PLP. Prosthetic embodiment was eval-
uated by the Prosthesis Embodiment Scale consisting of 10 items and 3 subscales
(ownership, agency and anatomical plausibility) with a rating scale ranging from −3
(strongly disagree) to +3 (strongly agree) [37]. Evaluating PhantomAR was a one-
time intervention, in which all 4 interaction scenes were trialed twice in a random
order.
The setup has been deliberately chosen to be minimal in order to ensure efficient
integration into daily clinical practice. The required devices were the Microsoft
HoloLens 2 and two discontinued Thalmic Myo armbands. The armbands, which
used to be commercially available, include a 9-axis inertial measurement unit (IMU)
(InvenSense MPU-9150) for positional tracking, 8 active EMG electrodes and a
vibration motor for haptic feedback (see Fig. 15.1). For real-time external monitoring
a Windows computer was used running Unity. No further external sensors were
required for positional tracking and the complete setup can be performed wirelessly
and battery powered.
The created interaction scenes adapt to the room size available and are supposed
to be performed while moving in a given space of 10–20 m2 .
Participants were fitted with two Myo armbands on the upper and lower arm of the
residual limb, respectively. After donning the HoloLens 2, which required no cables
and was completely battery operated, the virtual arm and threshold control were
calibrated and a profile of the user containing scale and relative shoulder position
was saved once. All these preparatory steps took less than 5 min and only needed to
be performed for the first time.
No further information was given on the game, and all participants were naive.
The only instruction they received was to explore their environment with every means
available to them.
After exploring all four scenes twice, all participants took part in evaluating the
application. The average time to play all four scenes twice was 25 min (±4.3). None
of the participants experienced cyber (motion) sickness at any time during the use.
15 Extending Mirror Therapy into Mixed Reality—Design … 205
Fig. 15.1 Patients with a transradial (forearm) amputation without (upper images) and wearing
the PhantomAR system consisting of the mixed reality device Microsoft HoloLens 2 and two
myoelectric electrode armbands (Thalmic, lower images). The setup is completely wireless and
does not restrict movement
PhantomAR has been implemented using the game development platform Unity 3D
version 2019.4.20f and the Microsoft Mixed Reality Toolkit. The game was installed
on the Microsoft HoloLens 2 and connected via the Bluetooth low energy protocol
to the Thalmic Myo armbands and received already filtered IMU and EMG data.
206 C. Prahm et al.
The game design was focused toward increasing immersion and avoiding frustration
and discomfort for the patients, which would negatively impact PLP. Potential prob-
lems were identified as goals leading to mental stress, difficult tasks leading to high
muscle tension and failure to achieve a desired outcome leading to frustration.
Therefore, a curiosity-driven gameplay was chosen, where the patients can freely
explore an interesting and interactive environment without the possibility of failure
or underperforming. To ensure an immersive experience, the performance and
reactiveness of the game was closely monitored during development.
To allow patients to immerse themselves in the augmented reality experience, the
rehabilitative exercises were integrated into various playful scenes. However, there
is no task associated with these scenes. The patients should explore their environ-
ment curiously and discover for themselves what is possible in this specific scene or
environment. We built four different interaction scenes. All scenes used orientation
and acceleration of the hand and arm as well as different EMG signals as input. The
interaction with virtual objects could always take place with the virtual and the real
hand.
These scenes were:
(A) A fruit-picking scene where players could collect fruits spawning at random
locations in the actual room, such as on desks, on walls, in cabinets or on the
floor. Therefore, necessitating to walk around the room to retrieve these fruits.
Once grabbed by the virtual or real hand, they can be interacted with, i.e.,
enlarged by dragging the contralateral edges of the fruit as seen in Fig. 15.2.
(B) A shooting game in which players could aim and shoot at flowers sprouting
on surfaces in the room which were recognized by the HoloLens 2 grid. Once
critically hit, they wilt and different flowers spawn at various locations.
(C) Drawing into the air or onto surfaces with certain EMG activity and arm
movements. The color palette and brush could be changed.
Fig. 15.2 A patient grasping a banana from the desk with their virtual hand (left image) and
proceeding to use his healthy limb to aid in a bi-manual interaction to enlarge the banana while still
keeping it firmly in their grasp (right image)
15 Extending Mirror Therapy into Mixed Reality—Design … 207
Fig. 15.3 A patient uses their virtually augmented limb to interact with a manipulable game
element. Collision with the game element at a certain speed disrupts its structural integrity, whereas
activation of an EMG signal above a certain threshold would trigger a change in the color
(D) A scene consisting of bubbles varying in color and size, which, when touched or
interacted with, lost their structural integrity and dispersed into smaller bubbles
or changed color or speed (see Fig. 15.3).
The HoloLens supports automatic spatial mapping, scanning the floor, walls and
real objects like tables or boxes to allow the interaction of virtual content and the
real world. This feature is used for many game elements, like plants spawning on
surfaces, bullets colliding with walls and other objects, and also for placing virtual
objects on real objects while playing the game.
The virtual arm including the hand was a rigged 3D object, i.e., it consisted of a list of
bones connected via joints, which could be rotated to perform physiologically looking
movements. The visible presentation of the virtual arm was a rendered mesh that was
connected to the underlying bone structure. In addition to controlling a human arm,
the arm object could be exchanged, and the subject could control a virtual tentacle
instead (see Fig. 15.4). The human arm and the tentacle were controlled via the same
set of degrees of freedom (DoFs), a 3D rotation of the upper and lower arm, 1D wrist
rotation and hand opening/closing, where hand opening and closing could switch
between different grip modes.
To have a virtual arm that follows the exact movements of the actual residual arm,
the IMU data of two Thalmic Myo armbands was used, measured on the upper arm
and on the lower arm (transradial stump). The position of the shoulder was fixed in
208 C. Prahm et al.
Fig. 15.4 a Tentacle as seen during the video stream while using the PhantomAR application. b
Tentacle 3D model
relation to the head position and was adapted to match the individual subjects. The 3D
orientation received from the Thalmic Myo armbands was applied to the respective
arm segments representing the upper and lower arm. IMU sensors determine their
spatial orientation via accelerometers, gyroscopes and magnetometers; however, the
received data was affected by a horizontal drift over time and required calibration.
The virtual hand was controlled via myoelectric signals recorded at the muscles of
the lower arm (transradial stump) with the Thalmic Myo armband. To avoid poten-
tial frustration for the patient, we deliberately chose a simple and robust threshold
controller, similar to regular prostheses. Two electrodes on agonist/antagonist
muscles recorded the activation, and when exceeding a threshold, the virtual hand
either opened or closed with the speed proportional to the muscle activation. For
opening and closing, two poses had been defined for the virtual hand, one in the open
position and one in closed position. The path of the movement between the positions
was calculated as an interpolation of the rotation of the individual bones/joints. The
same movement logic of interpolating between two positions defined as endpoints
was used for second rigged 3D model, the tentacle.
15.3.6 Calibration
The IMU sensors of the Thalmic Myo armbands were calibrated by instructing the
subject to extend their arm forward in a neutral position and place it in the same
space as the virtual arm, which was projected into their view by the HoloLens. The
15 Extending Mirror Therapy into Mixed Reality—Design … 209
Interaction with virtual objects was possible with both the virtual and the healthy
hand. The virtual hand had attached colliders that closely matched its shape, enabling
physical interactions with virtual objects, such as pushing a ball. Small objects, like
marbles, could pass between the virtual fingers creating an immersive interaction
experience.
Another mode of interaction was a grabbing motion, which was activated when
opposing fingers touched the virtual object. Successfully grabbing an object was
accompanied by a short vibration of the Thalmic Myo armband, which was shown to
reduce the time needed for grabbing [27]. The virtual object followed to movement
of the hand and could be carried around until released via an opening of the hand.
The scenes also consisted of special interaction types that could be triggered with
EMG activation. This included the release of paint on the fingertip, shooting a bullet
or changing the color of virtual objects.
The healthy hand was tracked by the HoloLens 2 and provided the position of the
fingers and palm. This made it possible for the real fingers to interact with the virtual
objects as well. Once a suitable object was grabbed with both hands, ambidextrous
interactions with the object were possible such as rotating and dragging it larger or
smaller.
In order to give the therapist the possibility to guide the patient and to control the
virtual scenarios, we have developed a remote control app that can be run on a
computer. The remote app is optional and it communicates with the HoloLens 2 via
a Wi-Fi connection. It provides a live video stream of the mixed reality as it is seen
by the patient. The virtual scenarios used in this study can also be controlled via the
remote app, for example, objects can be manually created or restored to their original
210 C. Prahm et al.
state. In addition, the remote app is used to make various configuration settings such
as establishing the Bluetooth connection to the wristbands or calibrating the EMG
controller.
15.4 Results
The application received an overall SUS score of 78.5% rated by all 12 participants,
indicating a slightly average usability and user-friendliness.
The following responses were obtained from all 12 participants in the user-
centered survey: Wearing the HoloLens 2 felt comfortable and users had a posi-
tive experience while interacting with virtual objects within the actual environment,
which were perceived as real. Haptic feedback as provided by the Thalmic Myo
armbands supported the immersion of grasping objects and controlling the interface
was intuitive. However, more feedback mechanisms, apart from haptic feedback on
object interactions, should be incorporated. Ownership of the overlaid virtual arm
was rated highly, though, agency could still be improved. The arm is perceived a
controllable part of the user; however, the control algorithm could still be refined.
No one was interested in a storytelling approach that would guide the user through
the application.
The results of the game experience questionnaire are shown in Fig. 15.5. All
participants received the application very positively and had no negative feelings or
felt overwhelmed while playing. Both the immersion and the game flow were rated
highly. Control of the virtual arm was rated with an average of 3.5. The use of a
tentacle instead of a real-looking virtual arm did not pose a problem for the patients
and was also rated positively for the most part, even though during the survey, they
stated to prefer an arm similar to their own.
PLP was rated 5 on the NRS by both patients before the game and 4 after using the
application. During the game, both patients reported in the user-centered survey that
their PLP decreased while participating in the application. However, on the NRS,
one patient actually reported an increase to 6.
The Prosthesis Embodiment Scale showed a high rating of agency, indicating
congruent control of their own prostheses and considered the performed movements
as their own (see Fig. 15.6). During PhantomAR, a high agency related to their own
prosthetic control is beneficial to controlling the application.
15.5 Discussion
With PhantomAR, we wanted to develop a wearable assistive therapy tool for PLP
that not only liberates users from their restrictive position at a table, but also allows
them to perform bi-manual tasks and freely interact with virtual objects as well as
objects found in their actual environment. The key motivation for this project was
15 Extending Mirror Therapy into Mixed Reality—Design … 211
Fig. 15.5 The results of the game experience questionnaire show 5 subscales for positive and
negative affect, immersion, flow, and challenge, as well as 2 additional subscales for rating the
control over the virtual arm in the PhantomAR app and for rating the experience of operating a
prefabricated tentacle instead of the image of an arm
Fig. 15.6 Showing the 3 subscales of the prosthesis embodiment scale for both patients. The agency
subscale was rated highest, indicating a feeling of congruent control during movement of their own
prosthetic hand
Using a tentacle for a hand was a concept which was new to both patients, but
they embraced the idea and stated, that it did not necessarily need to be their hand,
or a hand. In fact, they thought it was fun to explore in the game; however, in real
life, they preferred an anthropomorphic prosthesis to a marine animal.
One patient was certain that PLP was lower during the mixed reality experience,
while the other patient described a lessening of pain during active play but reported
later that pain was increased during play. Both patients agreed that PLP was lower
after using the application. Of course, this one-time proof of concept cannot provide
a statement about the alleviation of PLP. Therefore, increasing the sample size cannot
only provide more insight on PLP but also on embodiment and their progress over
time.
One of the challenges with AR glasses is the restrictive field of view, which might
lead to reduced immersion when not operating in the center of vision.
The Thalmic Myo armbands accumulated a tracking error that required calibrating
after 5–10 min of using the system to avoid a horizontal drift. As the Myo armbands
use a 9-axis IMU containing a magnetometer, this drift should be possible to avoid
without additional hardware. Other groups have shown that the Thalmic Myo IMU
data (without post-processing) has no drift [40].
The latency of the movement of the real arm to the visual representation of the
corresponding virtual arm was not directly measured, but for usual arm movements,
15 Extending Mirror Therapy into Mixed Reality—Design … 213
there is no noticeable lag. The latency is assumed to be below 50 ms, as the data
is received from the Thalmic Myo armbands with a latency of around 25 ms [40]
and translated to the virtual arm position within the next frame. A comparably low
latency that has not yet been reported in other studies, in which the latency was
500–800 ms when controlling a virtual arm using custom IMU sensors [27].
The finger-tap for periodic re-calibration can be unintrusively integrated as a game
element, requiring the user to perform a task with the augmented arm stretched out
and tapping onto the Myo armband with the other arm.
PhantomAR was not designed to be goal-oriented, but curiosity driven. There is
no intended or evaluated task transfer from a virtual hand to an actual myoelectric
prosthesis. There might be, though, however, the idea of PhantomAR is to simply
use the hands, or hand-like representations, moving through the room and exploring
the environment. Intrinsic motivation of what one might be able to find out should
be the primary drive.
It was important to not only provide applications for research, but also transfer
them to the clinic. They should be as easy to use as possible, with separate user
interfaces for the clinician and the patient. Therefore, the patient only has to mount
the devices and can start interacting. In addition, the whole system is portable and
completely wireless and can thus be used anywhere in the clinic or even at home. The
system automatically detects the room; therefore, there are no special requirements
for the room in which it is used.
15.6 Conclusion
In this paper, we explored how conventional mirror therapy can be reflected and
extended in a mixed reality approach using the HoloLens 2.
Immersion could be increased from a technical perspective by creating a spatially
coherent experience of the virtual and real world that are responsively interacting
with each other and underlying it with haptic feedback. The virtual as well as the
real hand could perform independently from each other or together. Players could
move around freely and actively and safely explore their surroundings in a manner
motivated by intrinsic motivation and curiosity.
Addressing complex health-related and quality of life impacting issues such as
PLP through novel technology requires interdisciplinary teamwork among therapists,
engineers and researchers. To gain further insight on the impact of XR mirror therapy,
we plan to conduct a four-week intervention study using the application four days
per week to compare the intensity, frequency and quality of PLP and embodiment.
Currently, PhantomAR is exclusively available for transradial (forearm) amputees,
but in the future, we plan extended it to transhumeral (upper arm) amputees as well.
214 C. Prahm et al.
References
1. Trojan, J. et al.: An augmented reality home-training system based on the mirror training and
imagery approach. Behav. Res. Methods. (2014)
2. Mayer, Á., Kudar, K., Bretz, K., Tihanyi, J.: Body schema and body awareness of amputees.
Prosthetics Orthot. Int. 32(3), 363–382 (2008)
3. Clark, R.L., Bowling, F.L., Jepson, F., Rajbhandari, S.: Phantom limb pain after amputation in
diabetic patients does not differ from that after amputation in nondiabetic patients. Pain 154(5),
729–732 (2013)
4. Flor, H.: Phantom-limb pain: characteristics, causes, and treatment. Lancet Neurol. 1(3), 182–
189 (2002)
5. Rothgangel, A., Braun, S., Smeets, R., Beurskens, A.: Feasibility of a traditional and tele-
treatment approach to mirror therapy in patients with phantom limb pain: a process evaluation
performed alongside a randomized controlled trial. Clin. Rehabil. 33(10), 1649–1660 (2019)
6. Richardson, C., Crawford, K., Milnes, K., Bouch, E., Kulkarni, J.: A clinical evaluation
of postamputation phenomena including phantom limb pain after lower limb amputation in
dysvascular patients. Pain. Manag. Nurs. 16(4), 561–569 (2015)
7. Perry, B.N. et al.: Clinical trial of the virtual integration environment to treat phantom limb
pain with upper extremity amputation. Front. Neurol. 9(9) (Sept 2018)
8. Rothgangel., Bekrater-Bodmann, R.: Mirror therapy versus augmented/virtual reality applica-
tions: towards a tailored mechanism-based treatment for phantom limb pain. Pain Manag. 9(2),
151–159 (March 2019)
9. Foell, J., Bekrater-Bodmann, R., Diers, M., Flor, H.: Mirror therapy for phantom limb pain:
brain changes and the role of body representation. Eur. J. Pain 18(5), 729–739 (2014)
10. Tsao, J., Ossipov, M.H., Andoh, J., Ortiz-Catalan, M.: The stochastic entanglement and
phantom motor execution hypotheses: a theoretical framework for the origin and treatment
of phantom limb pain. Front. Neurol. 9, 748 (2018). www.frontiersin.org
11. Moseley, L.G., Gallace, A., Spence, C.: Is mirror therapy all it is cracked up to be? Current
evidence and future directions. Pain 138(1), 7–10 (2008)
12. Dunn, J., Yeo, E., Moghaddampour, P., Chau, B., Humbert, S.: Virtual and augmented reality
in the treatment of phantom limb pain: a literature review. NeuroRehabilitation 40(4), 595–601
(2017)
13. Thøgersen, M., Andoh, J., Milde, C., Graven-Nielsen, T., Flor, H., Petrini, L.: Individu-alized
augmented reality training reduces phantom pain and cortical reorganization in amputees: a
proof of concept study. J. Pain 21(11–12), 1257–1269 (2020)
14. Boschmann, A., Neuhaus, D., Vogt, S., Kaltschmidt, C., Platzner, M., Dosen, S.: Immersive
augmented reality system for the training of pattern classification control with a myoelectric
prosthesis. J. Neuroeng. Rehabil. 18(1), 1–15 (2021)
15. Andrews, C., Southworth, M.K., Silva, J.N.A., Silva, J.R.: Extended reality in medical practice.
Curr. Treat. Options Cardio. Med. 21, 18 (1936)
16. Ortiz-Catalan, M., et al.: Phantom motor execution facilitated by machine learning and
augmented reality as treatment for phantom limb pain: a single group, clinical trial in patients
with chronic intractable phantom limb pain. Lancet 388(10062), 2885–2894 (2016)
17. Lendaro, E., Middleton, A., Brown, S., Ortiz-Catalan, M.: Out of the clinic, into the home: the
in-home use of phantom motor execution aided by machine learning and augmented reality for
the treatment of phantom limb pain. J. Pain Res. 13, 195–209 (2020)
18. Bach, F., et al.: Using Interactive Immersive VR/AR for the Therapy of Phantom Limb Pain.
Hc’10 Jan, pp. 183–187 (2010)
19. Ambron, E., Miller, A., Kuchenbecker, K.J., Buxbaum, L.J., Coslett, H.B.: Immersive low-cost
virtual reality treatment for phantom limb pain: evidence from two cases. Front. Neurol. 9, 67
(2018)
20. Markovic, M., Karnal, H., Graimann, B., Farina, D., Dosen, S.: GLIMPSE: Google glass
interface for sensory feedback in myoelectric hand prostheses. J. Neural. Eng. 14(3) (2017)
15 Extending Mirror Therapy into Mixed Reality—Design … 215
21. Tepper, O.M., et al.: Mixed reality with hololens: where virtual reality meets augmented reality
in the operating room. Plast. Reconstr. Surg. 140(5), 1066–1070 (2017)
22. Saito, K., Miyaki, T., Rekimoto, J.: The method of reducing phantom limb pain using optical
see-through head mounted display. In: 2019 IEEE Conference on Virtual Reality and 3D User
Interfaces (VR), pp. 1560–1562 (2019)
23. Lin, G., Panigrahi, T., Womack, J., Ponda, D.J., Kotipalli, P., Starner, T.: Comparing order
picking guidance with microsoft hololens, magic leap, google glass XE and paper. In: Proceed-
ings of the 22nd International Workshop on Mobile Computing Systems and Applications, vol.
7, pp. 133–139 (2021)
24. Gorisse, G., Christmann, O., Amato, E.A., Richir, S.: First- and third-person per-spectives in
immersive virtual environments: presence and performance analysis of em-bodied users. Front.
Robot. AI 4, 33 (2017)
25. Nishino, W., Yamanoi, Y., Sakuma, Y., Kato, R.: Development of a myoelectric prosthesis
simulator using augmented reality. In: 2017 IEEE International Conference on Systems, Man,
and Cybernetics (SMC), pp. 1046–1051 (2017)
26. Ortiz-Catalan, M., Sander, N., Kristoffersen, M.B., Håkansson, B., Brånemark, R.: Treatment
of phantom limb pain (PLP) based on augmented reality and gaming controlled by myoelectric
pattern recognition: a case study of a chronic PLP patient. Front. Neurosci. 8(8), 1–7 (Feb
2014)
27. Sharma, A., Niu, W., Hunt, C.L., Levay, G., Kaliki, R., Thakor, N.V.: Augmented reality
prosthesis training setup for motor skill enhancement (March 2019)
28. Tatla, S.K., et al.: Therapists’ perceptions of social media and video game technologies in upper
limb rehabilitation. JMIR Serious Games 3(1), e2 (2015).
29. Lohse, K., Shirzad, N., Verster, A., Hodges, N.: Video games and rehabilitation: using design
principles to enhance engagement in physical therapy, pp. 166–175 (2013)
30. Arya, K.N., Pandian, S., Verma, R., Garg, R.K.: Movement therapy induced neural reorgani-
zation and motor recovery in stroke: a review. J. Bodyw. Mov. Ther. (2011)
31. Primack, B.A., et al.: Role of video games in improving health-related outcomes: a systematic
review. Am. J. Prev. Med. (2012)
32. Kato, P.M.: Video games in health care: closing the gap. Rev. Gen. Psychol. (2010)
33. Gamberini, L., Barresi, G., Majer, A., Scarpetta, F.: A game a day keeps the doctor away: a
short review of computer games in mental healthcare. J. Cyber Ther. Rehabil. (2008)
34. Gentles, S.J., Lokker, C., McKibbon, K.A.: Health information technology to facilitate commu-
nication involving health care providers, caregivers, and pediatric patients: a scoping review.
J. Med. Internet Res. (2010)
35. Johnson, D., Deterding, S., Kuhn, K.A., Staneva, A., Stoyanov, S., Hides, L.: Gamification for
health and wellbeing: a systematic review of the literature. Internet Interv. (2016)
36. Ijsselsteijn, W.A., Kort, Y.A.W.D., Poels, K.: The game experience questionnaire. In: Johnson,
M.J., VanderLoos, H.F.M., Burgar, C.G., Shor, P., Leifer, L.J. (eds) Eindhoven, vol. 2005, no.
2013, pp. 1–47 (2013)
37. Bekrater-Bodmann, R.: Perceptual correlates of successful body–prosthesis interaction in lower
limb amputees: psychometric characterisation and development of the prosthesis em-bodiment
scale. Sci. Rep. 10(1), (Dec 2020)
38. Prahm, C., Schulz, A., Paaßen, B., Aszmann, O., Hammer, B., Dorffner, G.: Echo state networks
as novel approach for low-cost myoelectric control. In: Artificial Intelligence in Medicine: 16th
Conference on Artificial Intelligence in Medicine, AIME 2017, June 21–24, 2017, Proceedings,
no. Exc 277, Vienna (pp. 338–342). Austria, Springer (2017)
39. Harris, A.J.: Cortical origin of pathological pain. Lancet 354(9188), 1464–1466 (1999)
40. Nyomen, K., Romarheim Haugen, M., Jensenius, A.R.: MuMYO—evaluating and exploring
the MYO armband for musical interaction. Proceedings International Conference New
Interfaces Musical Expression (2015)
Chapter 16
An Analysis of Trends and Problems
of Information Technology Application
Research in China’s Accounting Field
Based on CiteSpace
Abstract By using CiteSpace software and using the number of publications, the
main authors and institutions, the research topics and the research fronts as indexes,
a text mining and visual analysis of the existing literature in the domestic CNKI
from 2000 to 2020 is conducted. According to the development of practice, the
number of researchers’ research literature on the application of information tech-
nology in accounting has increased year by year, but the quality has not improved; big
data, management accounting, financial sharing, cloud accounting, and blockchain
technology have been in the spotlight of recent research; the Ministry of Finance,
the National Accounting Institute and financial support played an important role.
However, there are still challenges, such as a lack of cross-institutional and cross-
regional cooperation among scholars, limited research on accounting informatization
construction of SMEs, and inadequate literature on accounting education. Strength-
ening guidance and support, promoting cooperation and exchanges can continu-
ously promote the mutual progress of theoretical research and practical innovation
of information technology application in the field of accounting.
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 217
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_17
218 X. Li et al.
The literature selected for this paper comes from the CNKI database. In order to
ensure the representativeness and authority of the selected data, the literature source
is set to Peking University and CSSCI database through the advanced search function.
The collected content includes keywords, authors, institutions, article titles, publi-
cation time, publications, and abstracts. These themes include financial sharing,
big data accounting, Internet accounting, accounting computerization, accounting
informatization, accounting cloud computing, accounting intelligence, blockchain,
16 An Analysis of Trends and Problems of Information Technology … 219
and artificial intelligence. The retrieval period is February 1, 2021, and the time
period is 2000–2020. In total, 5501 documents were retrieved, imported into the
software, and duplicates were removed through the data module, yielding 4136 valid
documents cited 45,658 times with an average citation frequency of 11.04.
Visual Analysis of the Publication Volume. As shown in Fig. 16.1, the number
of applied research on information technology in accounting has increased year by
year, and the growth rate is fast, indicating that with the development of information
technology, related theoretical research is also receiving attention from academic
circles. The number of publications has increased significantly, and the growth trend
of non-core publications is roughly the same. However, research papers published in
Peking University core and CSSCI journals have not changed significantly, meaning
that the quality of research is not improving significantly. It may be related to the
fact that empirical research is more prevalent in core journals. Moreover, at the same
time, related research topics have also generated new branches. Especially after 2013,
technologies such as financial sharing, cloud accounting, big data, and intelligent
finance have emerged. Likewise, there are no significant changes in the number of
articles published in core journals. Compared with accounting computerization and
accounting informatization, the number and quality of articles on various branch
topics are insufficient at this stage. In conclusion, from 2000 to 2020, the number
of applied research papers on information technology in the accounting field has
increased year by year. However, the quality of the research should be improved.
Visual Analysis Based on Research Themes. It should be noted that the threshold
algorithm is set in the initial processing parameters of CiteSpace, and c (minimum
citations), cc (co-citations in this slice), and CCV (co-citations after specification)
6000
4000
2000
0
2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
All Journals
are, respectively, (4, 4, 20), (6, 5, 20), (6, 5, 20), and the time slice is one year. In
addition, analysis of the keyword in literature data results in 114 nodes (N = 114),
199 critical paths (E = 199), and a network density of 0.0309 (Density = 0.0309).
The limited frequency is greater than or equal to 30, and the co-occurrence network
(Fig. 16.2) and TimeZone View (Fig. 16.3) of the main keywords are obtained. In the
keyword co-occurrence network map, each node represents a keyword, and the size
of the node reflects the frequency of that keyword. The color of the node reflects the
publication date of the document where the keyword appears, and the darker color
represents an earlier publication date.
Analysis of Co-occurrence of Keywords. In Figs. 16.2 and 16.3, it can be seen that
in the past 20 years, China’s accounting computerization and accounting informati-
zation research have occupied the top two positions, respectively, with centralities
of 0.32 and 0.18. According to the subdivision, the two themes were at the core
of research for the period 2001–2010, and the research center showed the trend of
shifting to big data accounting research. Relevant research has transitioned from the
study of accounting software to accounting information systems and internal control
and then to a combined application of information technology and management
accounting. This has promoted the rapid transformation of accounting from tradi-
tional accounting functions to management and service functions. Accounting for
management is closely related to informatization. To some extent, the management
accounting boom is driven by national policies. In 2013, the Ministry of Finance
designated management accounting as an important direction of accounting reform,
and in 2014, it issued the “Guiding Opinions on Comprehensively Promoting the
Construction of Management Accounting System,” which ushered in a period of rapid
development for management accounting [2]. Before 2012, the focus of management
accounting research was cost control and theoretical exploration. After 2012, with the
rise of big data and artificial intelligence, management accounting research content
gradually enriched, and in 2014, it became a national strategy [3].
The powerful data mining, processing, and analysis capabilities of big data tech-
nology have expanded the information sources of management accounting, enabling
it to unearth the potential value from data to enhance the competitiveness of enter-
prises and maximize its benefits [4, 5]. Based on the above analysis, it is evident that
big data and management accounting are at the core of the research at this stage,
echoing the development of practice.
Timeline Chart Analysis. The first stage (2001–2010) was devoted to the research
topics of domestic core journals and CSSCI papers. The initial processing parameters
of CiteSpace should be set to k = 25, the time period 2001–2010, the time slice is
one year, choose the timeline mode in the visualizations option, and draw a sequence
diagram of keyword co-occurrence clustering. According to Fig. 16.4, the graph has
487 nodes and is divided into eight clusters. The module value is 0.4321, and the
contour value is 0.7434.
Clusters 1, 3, 5, and 8 are computerized accounting subjects. The main
keywords are computerized accounting, computerized auditing, computerized
accounting system, accounting data, accounting software, accounting center, office
automation, commercialized accounting software, accounting reform, computerized
16 An Analysis of Trends and Problems of Information Technology … 221
data graphs, simple, and repetitive positions will be replaced by accounting systems,
and accountants will be transformed into data analysts [8].
In cluster 1, there is the CPA industry, which focuses on examining the develop-
ment of the accounting profession, the training of accounting talents, and the quali-
fications of accountants and related policies. The main keywords include accounting
information system, accounting management work, accounting firm, accounting
service market, international financial reporting standards, certified public accountant
industry, the accounting industry, accounting information standard system, small and
medium accounting firms, and non-auditing businesses. The application of informa-
tion technology to the accounting field has stimulated the development of the industry.
The application of information technology in the accounting field has stimulated the
development of the certified public accountant industry, and the research on auditing
technology and methods is also a hot topic in the new era. For instance, Xu Chao
classified auditing into three stages: computer-assisted audit, network audit, and big
data audit [9]. A recent study by Chen Wei et al. applied text mining and visual
analysis based on big data technology to the area of auditing, leading to an entirely
new field of research [10].
Cluster 2 focuses on computerized accounting. Keywords are computerized
accounting, reporting system, teaching video, vocational college, accounting major,
teaching reform, and open education. Currently, the number of researches on comput-
erized accounting has been significantly reduced, and the finance function has devel-
oped from accounting to service-oriented and is developing toward digitization and
artificial intelligence [11]. Big data, financial sharing, and artificial intelligence are
gradually being applied to the accounting field, and centralization is gradually being
achieved.
Cluster 3 is dedicated to shared services. The main keywords are shared service
center, shared service model, national audit quality, audit quality, transaction rules,
and risk management and control, among which shared service center is more inter-
mediary, shared service model, and risk management. Financial sharing has been a
hot topic in recent years, and as early as 1998, Haier Group began to explore the
strategy of financial information sharing [12]. Due to technical limitations, develop-
ment is not yet mature, and related research has not been able to achieve national
attention. In the new stage, accounting informatization is becoming more and more
perfect, and financial sharing is no longer a simple online analysis tool. Its value in
optimizing organizational structure, optimizing processes, and reducing costs have
been recognized. Currently, financial sharing is widely used by large group compa-
nies and state-owned enterprises, and there is still room for other technologies to
be embedded within financial sharing. For example, combining RPA technology
and OCR scanning technology with financial sharing can dramatically enhance the
automation of corporate finance work. It can reduce the human error rate and reduce
the operating costs of the enterprise [13].
Cluster 5 is blockchain, and the main keywords are smart contract, consensus
mechanism, expendable biological assets, database technology, blockchain tech-
nology, business and finance integration, data mining, and surplus manipulation,
among which the more intermediary ones are: smart contract, blockchain technology,
16 An Analysis of Trends and Problems of Information Technology … 227
and consensus mechanism. Although blockchain has been widely applied to improve
information quality, most of the research focuses on audit investigation, and its value
for enterprise management has yet to be discovered. The application of blockchain in
financial sharing can promote the financial intelligence of enterprises, globalization
of management and control, shared services, and integration of business and finance
[14].
In the second stage, accounting informatization has evolved from the research of
concepts and systems to the application of information technology. In the accounting
field, the number of applied research projects on big data, financial sharing,
blockchain, robotic process automation, and other technologies has increased dramat-
ically, and the content and results have improved as well. In the context of manage-
ment accounting research, the application of different information technologies
combined with management has enriched the work of finance workers and promoted
the transformation of finance personnel from simple bookkeeping work to enter-
prise management [15], which has a far-reaching impact on accounting research and
practice.
Analysis of the Emergence of Research Frontiers. The emergent words are
commonly used to analyze the frontier or research trends in a certain research field.
As seen in Table 16.1, among the 25 keywords for which data were extracted in this
paper, a total of 6 keywords with an emergent degree greater than 20 are, in descending
order, accounting computerization (110.37), big data (62.5), management accounting
(37.74), blockchain (37.51), cloud accounting (33.77), financial sharing (24.38), and
industry-financial integration (20.23). During the period 2001–2008, computerized
accounting was the core theme of research with a prominence of 110, followed
by accounting software and accounting information systems. Since 2009, the CPA
profession has become a hot topic and continued until 2016. With the rapid applica-
tion of information technology to the accounting industry, cloud accounting in 2013,
big data and management accounting in 2014, blockchain, financial sharing, and
industry-accounting integration in 2017 became hotspots in turn, and the emergence
degree was always at a high level as of 2020.
Visual Analysis Based on the Lead Authors and Institutions. CiteSpace is set
up to use the g-index algorithm and k = 25 in the initial processing parameters, and
the time slice is one year. The author and institution in the literature are analyzed
at the same time, and the initial results show 967 nodes (N = 967) and 861 critical
paths (E = 861), as well as 0.0018 node density (Density = 0.0018). The limited
frequency is greater than or equal to 10, and the main author and institute node
information co-occurrence network are obtained. According to Fig. 16.6, in the co-
occurrence diagram of authors and institutions, each node represents an author or
a research institute. The size of the node reflects the number of published articles;
the color of the node reflects the time of issuance, and the darker color indicates the
earlier issuance; the connection between the nodes reflects the cooperation between
authors and authors, authors and institutions, and institutions, and the thickness of
the connection reflects the closeness of the cooperation. The thickness of the line
reflects the degree of cooperation.
228 X. Li et al.
In Fig. 16.6, the author with the most papers is Professor Cheng Ping and his
team from the School of Accounting at the Chongqing University of Technology,
whose research direction is the application of big data technology to accounting
[16], followed by Zhang Qinglong from Beijing National Accounting Institute [17],
whose research direction is financial sharing, and Liu Yuting from the Ministry of
Finance, whose research focuses on accounting reform in China [18], followed by
Wang Jun, Yang Jie, Huang Changyong, Ding Shuqin, Ying Limeng, and Liu Qin.
Among the four major research groups in the field of accounting informatization,
the School of Accounting of the Chongqing University of Technology is the most
active. The Accounting Department of the Ministry of Finance, Beijing National
Accounting Institute, and Shanghai National Accounting Institute are also important
research camps.
Figure 16.7 shows that except for the National Natural Science Foundation of
China and the National Social Science Foundation of China, the number of science
funds at the Chongqing Municipal Education Commission is much higher than that
of the other places, indicating that the Chongqing Municipal Education Commission
has paid sufficient attention to applying technology to accounting.
In summary, the Ministry of Finance, the National Accounting Institute, and
funding support played a major role in its completion. However, the cooperation
network of Chinese accounting scholars remains primarily internal, and the lack of
cross-institutional and cross-regional cooperation has had an adverse effect on its
progress.
16 An Analysis of Trends and Problems of Information Technology … 229
90
83
80
73
70
60
50
40
35
30
20
11
10 8 7 7 6 5 5
0
National Natural National Social Scientific research China Postdoctoral Humanities and Soft science Research China National Humanities and Soft science Research Jiangsu Blue Project
Science ... Science Foundation... project of Science Foundation Social Science... Program... Tobacco Social Science... Project...
Chongqing... Corporation...
The previous analysis found that the number of relevant studies is basically consistent
with the trend of practice, but the quality of research has not kept up, and the lack
of cross-border cooperation among researchers has become a weak issue in current
research. In a further study, we also found two other prominent problems in the
research.
First, the number of literature on the application of information technology in the field
of accounting is on the rise, but the number of literature published in the core jour-
nals of Peking University and CSSCI has not changed significantly, and the quality
of research on the application of emerging technologies in the field of accounting
has a decreasing trend compared to that of research in the period of computerized
accounting, which indicates that the quality of relevant research needs to be further
improved.
Second, the research themes and hotspots of information technology applica-
tion in the accounting field show obvious changes with the development of infor-
mation technology. During 2001–2011, accounting computerization and accounting
informatization were at the core of research on information technology in accounting,
and their literature quantity, centrality, and prominence were much higher than other
topics; with the gradual maturity of information technology development, big data
accounting became the hottest topic in 2013–2020, followed by financial sharing,
cloud accounting, and blockchain topics. Overall, at this stage, big data and manage-
ment accounting are at the core of research, big data has opened up new paths
for management accounting research, and management accounting innovation has
become a hot spot for current and future research.
Third, the research on the application of information technology in the field of
accounting has significant contributions from the finance department, the National
Accounting Institute, and the fund support literature, which fully demonstrates the
importance and leadership of the state in promoting the application of information
technology in the field of accounting, but the research is mostly confined within the
unit, and the lack of cross-institutional and cross-regional cooperation also limits the
extensiveness and depth of the research.
Fourth, the research on the application of information technology in the field of
accounting for SMEs and the combination of information technology and accounting
education is obviously insufficient in quantity and generally low in quality, which
needs urgent guidance and attention.
234 X. Li et al.
The 14th Five-Year Plan for Accounting Reform and Development has identified
“the application of new information technology to basic accounting work, manage-
rial accounting practice, financial accounting work, and the construction of unit
financial accounting information systems” as the main subject of research. To better
promote information technology application research and enhance the integration of
theoretical research and practical innovation, government departments, application
entities, and research institutions must engage in joint efforts.
In the first place, the government departments should continue to lead research in
the field of applying information technology in accounting. Increase fund support,
pay particular attention to improving the quality of research results, and increase the
attention paid to the application of information technology in accounting for small
and medium-sized enterprises, as well as the combination of information technology
and accounting education. At the same time, the government departments should
attach great importance to improving the soft power of sustainable development of
enterprises by enhancing management accounting systems and internal control mech-
anisms. The application of information technology in the field of accounting should
not be limited to a certain enterprise or unit or a certain industry, but only through
systematic research to raise it to the theoretical level and form a scientific theoretical
system of an effective combination of information technology and accounting, can
we really promote the height and depth of accounting informatization construction,
and can give full play to the positive role of accounting in enterprise management
and even economic construction.
Secondly, accounting scholars should actively expand the scope of cooperation,
strengthen cooperation with government departments and enterprises, and make full
use of cross-institutional and cross-discipline collaboration to effectively solve prac-
tical and difficult problems concerning the application of information technology in
the field of accounting, so as to develop a new pattern of integrated development of
accounting information technology application and theoretical innovation beyond its
own narrow vision.
Finally, government departments should also raise the importance of research
and transformation of accounting education informatization results and continue to
improve the collaborative education mechanism between industry, academia, and
research. Both the supply and demand sides of accounting informatization talent
training should raise awareness and strengthen communication. The development of
the digital economy has increasingly increased the requirements for the training of
accounting professionals. These requirements include improving the comprehensive
ability of the teaching staff to apply information technology to accounting teaching
and promoting the transformation of the training model. These requirements require
science to promote collaboration and exchanges between businesses, schools and
research institutions, and to enhance the ability to develop the theoretical and applied
16 An Analysis of Trends and Problems of Information Technology … 235
integration of talent. The above measures are conducive to incubating higher quality
accounting information talents for the society, and consolidating the human resources
foundation for accounting to help the development of information economy.
The impact of information technology application in the accounting field is
far-reaching, and accounting theory research and practice innovation are equally
important. Through visual analysis, this paper sorts out the research development,
summarize and refines the characteristics of the current relevant research and some
outstanding problems, and puts forward corresponding suggestions, hoping to attract
the attention of the academic community, and only through joint efforts of govern-
ment departments and accounting scholars and practitioners, the future use of
information technology in the field of accounting will be more in-depth and positive.
References
1. Yue, C., Chaomei, C., Zeyuan, L., et al.: Methodological functions of CiteSpace knowledge
graphs. Scientology Res. 33(2), 242–253 (2015)
2. Man, W., Xiaoyu, C., Haoyang, Y.: Reflections and outlook on the construction of management
accounting system in China. Finan. Acc. (22), 4–7 (2019)
3. Zhanbiao, L., Jun, B.: Bibliometric analysis of management accounting research in China
(2009–2018)-based on core journals of Nanjing university. Finan. Acc. Commun. 7, 12–18
(2020)
4. Maohua, J., Jiao, W., Jingxin, Z., Lan, Y.: Forty years of management accounting: a visual
analysis of research themes, methods, and theoretical applications. J. Shanghai Univ. Fin.
Econ. 22(01), 51–65 (2020)
5. Ting, W., Yinghua, Q.: Exploring the professional capacity building of management accounting
in the era of big data. Friends Account. 19, 38–42 (2017)
6. Qin, L., Yin, Y.: Accounting informatization in China in the forty years of reform and opening
up: review and prospect. Account. Res. 02, 26–34 (2019)
7. Qinglong, Z.: Next-generation finance: digitalization and intelligence. Financ. Account. Mon.
878(10), 3–7 (2020)
8. Weiguo, L., Guangjun, L., Shaobing, P.: The impact of data mining technology on accounting
and response. Financ. Account. Mon. 07, 68–74 (2020)
9. Chao, X., et al.: Research on auditing technology based on big data. J. Electron. 48(05),
1003–1017 (2020)
10. Chen, W., et al.: Research on audit trail feature mining method based on big data visualization
technology. Audit Res. 201(1), 16–21 (2018)
11. Shangyong, P.: On the development and basic characteristics of modern finance. Financ.
Account. Mon. 881(13), 22–27 (2020)
12. Zhijun, W.: Practice and exploration of financial information sharing in Haier group. Financ.
Account. Newslett. 1, 30–33 (2006)
13. Ping, C., Wenyi, W.: Research on the optimization of expense reimbursement based on RPA
in financial shared service centers. Friends Account. 589(13), 146–151 (2018)
14. Runhui, Y.: Application of blockchain technology in the field of financial sharing. Financ.
Account. Mon. 09, 35–40 (2020)
15. Gang, S.: Innovation of management accounting personnel training mechanism driven by big
data and financial integration. Financ. Account. Mon. 02, 88–93 (2021)
16. Ping, C., Jinglan, Z.: Performance management of financial sharing center based on cloud
accounting in the era of big data. Friends Account. 04, 130–133 (2017)
236 X. Li et al.
17. Qinglong, Z.: Financial sharing center of Chinese enterprise group: case inspiration and
countermeasure thinking. Friends Account. 22, 2–7 (2015)
18. Yuting, L.: Eight major areas of accounting reform in China are fully promoted. Financ.
Account. 01, 4–10 (2011)
19. Yumei, J.: Discussion on the construction of cloud computing accounting information tech-
nology for small and medium-sized enterprises. Financ. Account. Commun. 07, 106–109
(2018)
20. Xiaoyi, L.: Research on the application of management accounting informatization in small
and medium-sized enterprises in China. Econ. Res. Ref. 59, 64–66 (2016)
21. Weibing, Z., Hongjin, Z.: Exploration of the design and implementation of flipped classrooms
based on effective teaching theory. Financ. Account. 04, 85–86 (2020)
22. Yan, N., Chunling, S.: Visual analysis of accounting talent training research—based on the
data of CNKI from 2009–2018. Financ. Account. Commun. 15, 172–176 (2020)
Chapter 17
Augmented Reality Framework
and Application for Aviation Emergency
Rescue Based on Multi-Agent and Service
Abstract Aviation emergency rescue is one of the efficient ways to rescue and
transport people and transport supplies. Dispatching multiple aircraft for air rescue
covering a large area is required for systematic planning. Given the complexity of
such a system, a framework is proposed to build an augmented reality system to
present the situation and assist in decision-making. An augmented reality simulation
and monitoring system for aviation emergency rescue based on multi-agent and
service are completed to apply the framework.
17.1 Introduction
Aircraft, which include fixed-wing aircraft and helicopters, have the advantage of
rapid mobility, multi-type loading capability, and less restriction by terrain. Aircraft
have been more and more applied to the emergency rescue area. Missions such as
aviation firefighting [1], aeromedical rescue [2], aviation search and rescue [3], and
aviation transport can be collectively called aviation emergency rescue. With an
enormous scale of land and sea, China has a great need of aviation emergency rescue
in case of suffering from disasters. Yet maintaining a large aircraft fleet in every city
is impossible owing to the economic issue. Thus, how to deploy and dispatch the
aircraft in a certain area becomes a problem. Wang et al. [4] studied the deployment
and dispatch of aviation emergency rescue. While the method to deploy and dispatch
the aircraft is discussed, an intuitive way of showing and commanding the process
remains to be solved.
Augmented reality provides an efficient and intuitive way to present the virtual
environment in the physical world. Augmented reality has been applied to the aero-
nautic field for training and maintenance instruction [5]. And the model for large
scale of land [6] and agent-based model [7] has been developed in augmented reality.
Augmented reality device providers such as Microsoft and Magic Leap, and game
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 237
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8_18
238 S. Liu et al.
engines such as Unity3D and Unreal have developed tools and environments offered
to developers to develop their augmented reality program while they can have little
consideration to hardware adaptation and focus more on the program itself.
In this paper, the advantage of augmented reality is taken into consideration for
aviation emergency rescue display and commanding. A framework for aviation emer-
gency rescue on a large scale based on augmented reality is proposed. A system
instance developed in Unity3D with the usage of MRTK and deployed in Microsoft’s
Hololens2 is developed to verify the framework.
Considering the demand for aviation emergency rescue on a large scale, the frame-
work of the augmented reality system consists of two main parts, which are the
service part and multi-agent part. And besides the augmented reality system itself,
the framework also contains the development environment part and hardware part.
The development environment includes a toolkit to develop services for augmented
reality, 3D modeling software to build models of aircraft and cities, and a game
engine to visualize the system. The whole system will be installed in augmented
reality devices with which users can watch and interact. The framework is shown in
Fig. 17.1.
The service part contains service to offer function when needed and can be called
for one or more times when the system is running. The multi-agent part contains two
main types of entities that will be visualized in the system and will be instantiated in
the application which applies the framework.
The service-oriented architecture contains several services in two aspects, the system
basic service and scenario service. The system basic service begins to run when the
system initializes and keeps running in the background, offering services to get input
from the user, send or receive data to other systems, and hold the persistence of the
hologram of the system to anchor to a specific place in space. The scenario service
on the other hand is highly tied with aviation emergency rescue. Services contained
in the scenario service will only be called once after the system is initialized. Those
17 Augmented Reality Framework and Application for Aviation … 239
Fig. 17.1 Augmented reality system framework for aviation emergency rescue
services will function as the first step to visualize the system or the last step when
shutting down the system.
of aircraft carrying emergency rescue tasks. And sending data can be called when
the user gives instructions about where and what task the aircraft will carry.
Scenario Service
Scenario service is highly correlated with the system’s specific task scenarios. It
contains services of three aspects and will perform before, during, and after the
scene.
Mission generation service is called the situation where the system is functioning
as a simulation or training system for users and is called before the scene starts. Such
service can create missions for cities on the scene and allow aircraft to fulfill the
missions.
The main event display service is called during the scene and plays as a notepad
for users to record the command or gives users a hint of what mission is being
accomplished.
The event log service is called after the scene when all the missions are accom-
plished. The event log service will record every arrangement of the aircraft, including
the ID of the aircraft, the mission it carries, and the time that the arrangement is
made. The event log service will save the log as a file and can be used to evaluate
the efficiency of aircraft and other indexes.
The multi-agent architecture demonstrates two main types of agents in the system,
aircraft agents and city agents. Each type of agent has its attributes and functions.
And they can interact with each other to renew their attributes.
Aircraft Agent
Aircraft agent is a basic class of all aircraft objects in the system. This agent class
has attributes and functions that mainly represent how the fixed-wing aircraft and
helicopters work in the system.
Aircraft agent class has attributes including appearance, aircraft type, fuel load,
and loading ability. The appearance attribute is for modeling the aircraft and is
used when visualizing the system, and the hologram of appearance will indicate the
position and rotation of the aircraft. The aircraft type is used for the main event
display and event logging. It’s one of the basic attributes of an aircraft. Fuel load
represents the quantity of fuel that the aircraft is carrying. Such attribute is taken
into consideration when users decide what mission the aircraft will accomplish.
Loading ability measures how many people, how heavy the supplies, and what kind
of equipment the aircraft can carry. This attribute is another factor that should be
considered when making decisions.
Functions of the aircraft agent class include planning a route, flying to a desti-
nation, executing a task, and updating load. The planning route is to generate the
waypoints toward the destination, depending on the aircraft type and the terrain.
17 Augmented Reality Framework and Application for Aviation … 241
Flying to the destination can be called after the route is created. And this function
can dominate the aircraft’s position and rotation so that they are consistent with the
real situation. Executing task is called after the aircraft reaches the destination and is
relevant with updating load function. Together these two functions can accomplish
the task and update what the aircraft carries to accomplish the task.
City Agent
City agent is a basic class of all city objects in the system. Attributes and functions
of the city agent class match those of the aircraft agent class.
City agent class has attributes including location, airport capacity, resource, and
resource demand. Location includes a location in the real world and a location in
the system. And both can be transferred into another when needed. Location is
needed when an aircraft needs to plan a route. Airport capacity measures how many
aircraft can land and execute the mission at the same time in this very city. This
attribute influences what mission the aircraft will take. Resource measures what
kind of resource and how much the city can offer so that aircraft can transport it
to another city in need. Resource demand on the other hand measures what kind of
resource and how much the city needs.
The functions of city agents are accepting aircraft, offering resources, and updating
demand. Accepting aircraft is used to update the number of aircraft in the airport to
decide whether this city can accept more aircraft. Offering resource can be called
when an aircraft arrives in the city and loads supplies and decrease the city’s resource
according to how much the aircraft loads. Updating demand is called after the aircraft
carries supplies to this city. Resource demand is decreased by updating the demand
function.
The construction of the augmented reality system contains three parts, development
environment, services, and entities. MRTK and Unity3D together offer a fundamental
environment to develop a program targeting universal windows platform, which can
be released on Hololens2. Services are established in the development environment,
and some of them use functions provided by MRTK or Unity3D. Entities are objects
in the scene and are visualized by Unity3D. When data or instructions are transferred
from other outside systems by network transfer service in basic services, states of
aircraft and cities can be changed by services. The construction of the augmented
reality system is shown in Fig. 17.2.
System basic services are adjusted from functions that existed in MRTK or Windows.
Some services can be realized in more than one way, and this paper chose one of
them while others will still be introduced.
Spatial anchor and share service are based on Unity3D’s built-in XR SDK. In
Unity3D component, names “World Anchor” can be added to an object and this
object is linked to the Hololens2’s understanding of an exact point in the physical
world. Unity also provides a function to transfer world anchor between devices named
“World anchor transfer batch.” The flowchart of spatial anchor and share service is
in Fig. 17.3. Other options include using services provided by World Locking Tools
which is available in Unity3D’s higher versions or using image recognition. World
Locking Tools is similar to World Anchor and is based on Hololens2’s understanding
of the real world. Image recognition is using a pre-placed picture in the physical world
to locate the device and initiate the system, and the user’s observation position in the
system is based on data from 6 DoF sensors.
17 Augmented Reality Framework and Application for Aviation … 243
Gesture recognition in the system uses MRTK’s input services. In the configura-
tion profile of MRTK, the gestures part of the input section can be changed to change
gestures into another setting. In this paper, the default gestures profile provided by
MRTK is used. In addition, in articulated hand tracking part of the hand mesh visu-
alization is set as “everything” so that users can confirm that their hands are tracked
by the device, and the teleport system is disabled in the teleport section since the
system needs no teleportation.
Network transfer service uses socket based on UDP protocol. After the local IP
address and the exact port is bound, a new thread is started to listen to the local area
network. This listening thread is parallel to the main thread in order not to block
the system’s main logic. When the system is shut down, the listening thread will
be interrupted and aborted. Data transferred between systems is byte array encoded
from the struct, JSON, or string. The flowchart of the network transfer service is in
Fig. 17.4. The flow of other systems in Fig. 17.4 is simplified with only the data
transfer logic is remained on the right of the network transfer service’s workflow.
244 S. Liu et al.
Mission generation service contains two steps. Step one is to allocate missions
to cities randomly. Step two is to calculate minimum resources that can fulfill the
demand and allocate them to cities that do not need such resources.
The main event display service uses the event in C# based on the publisher–
subscriber model. When aircraft and cities complete a certain event, they publish
this event and the display board which subscribed to these events when the system
started will be corresponding to the publishment and display the news.
The event log service uses the OS’s IO function. A text writing stream is instan-
tiated after the system starts. Each time an event is published, the service will write
it into the cache through the stream and when the system is shut down, the service
will turn the cache into a text file.
17 Augmented Reality Framework and Application for Aviation … 245
Fig. 17.5 Map model in Unity3D (left) and real map model(right)
Entities in the aviation emergency rescue system include map, aircraft, and cities.
3D models of these entities are created in 3d MAX software and models of aircraft
and cities are visualized by Unity3D.
The map model is developed from the digital elevation map and the satellite map.
The digital elevation map offers the height of the terrain, and the satellite map is
the texture of the terrain. Since the map covers a large scale of land, the map model
has a large file size and consumes a lot of rendering resources of Hololens2’s GPU.
So instead of visualizing the map model in augmented reality, this paper chose to
print a real map and base the hologram on it. Figure 17.5 shows the map model in
Unity3D’s scenic view and the real map model made with resin. The real map model
remains other parts of the land and paints them in green color.
The management of aircraft and cities’ models relies on the Unity3D package
“Addressables.” Aircraft models and city models are asynchronously loaded by the
label after the system starts and services such as spatial anchor and share service
and network transfer service are initiated. City models are instantiated after the load
process is finished. Aircraft models are instantiated only when the aircraft takes off
from a city, and the model will be set not active after it arrives. Other objects in the
system are also managed by the “Addressables” package (Fig. 17.6).
Functions of aircraft and cities are organized as the followed sequence: as the
user chooses a particular aircraft and city so that this aircraft would fly to the city
and accomplish the mission, the aircraft itself calls the “plan route” function. After
the waypoints are calculated, the “fly to destination” method is called to instantiate
the model and change the position and rotation of the aircraft. Once the aircraft’s
position is closed to the city, the city calls the “accept aircraft” function to inform
the event, add the aircraft to the city’s current aircraft fleet, and destroy the model of
the aircraft. Then the aircraft calls the “execute task” function to transport resources
246 S. Liu et al.
between aircraft and city. The “update load” and “update demand” functions are
called at last, and the aircraft will be ready for another arrangement from the user.
The procedure of calling functions is shown in Fig. 17.7.
The system simulates aviation emergency rescue based on a task scenario in which
Zhejiang province in China is attacked by a flood. In this scenario, suffered people
need to be transported to settlement places and supplies, and large machinery needs
to be transported to cities in need. The simulation runs in a single-device environment
while data interfaces are still open for data exchange. The view of the simulation state
from the user’s perspective can be seen in Fig. 17.8. The map and other background
are in the physical world while models of aircraft, cities, message board, and mesh
on the hands are rendered by Hololens2.
17 Augmented Reality Framework and Application for Aviation … 247
17.5 Conclusion
In this paper, a framework for aviation emergency rescue on large scale based on
augmented reality is proposed. This framework contains two main parts including
248 S. Liu et al.
References
1. Goraj, Z., et al.: Aerodynamic, dynamic and conceptual design of a fire-fighting aircraft. Proc.
Inst. Mech. Eng., Part G: J. Aerosp. Eng. 215(3), 125–146 (2001)
2. Moeschler, O., et al.: Difficult aeromedical rescue situations: experience of a Swiss pre-alpine
helicopter base. J. Trauma 33(5), 754–759 (1992)
3. Grissom, C.K., Thomas, F., James, B.: Medical helicopters in wilderness search and rescue
operations. Air Med. J. 25(1), 18–25 (2006)
4. Wang, X., et al.: Study on the deployment and dispatching of aeronautic emergency rescue trans-
port based on virtual simulation. In: 2021 5th International Conference on Artificial Intelligence
and Virtual Reality (AIVR), pp. 29–35. Association for Computing Machinery (2021)
5. Brown, C., et al.: The use of augmented reality and virtual reality in ergonomic applications for
education, aviation, and maintenance. Ergon. Des. 10648046211003469 (2021)
6. Tan, S., et al.: Study on augmented reality electronic sand table and key technique. J. Syst. Simul.
20 (2007)
7. Guest, A., Bernardes, S., Howard, A.: Integration of an Agent-Based Model and Augmented
Reality for Immersive Modeling Exploration, p. 13. Earth and Space Science Open Archive
(2021)
Author Index
A L
Abdelhakeem, Sara Khaled, 49 Li, Tong, 119
Abe, Jair Minoro, 3 Liu, Hu, 19, 173, 237
Albanis, Georgios, 187 Liu, Huaqun, 85, 119
Alzahrani, Yahya, 145 Liu, Huilin, 131
Liu, Siliang, 237
Li, Xijie, 119
B Li, Xin, 19, 173
Boufama, Boubakeur, 145 Li, Xiwen, 217
Bressler, Michael, 201
M
Mao, Kezhi, 73
D
Mechler, Vincenz, 35
da Silva Filho I., João, 3
Mustafa, Zeeshan Mohammed, 49
Dusza, Daniel G., 131
N
E Nakamatsu, Kazumi, 3
Eckstein, Korbinian, 201 Nan, Ke, 217
Niu, Xiaoye, 217
G
Gkitsas, Vasileios, 187 O
Omata, Masaki, 101
Onsori-Wechtitsch, Stefanie, 187
H
Huang, Hui-Wen, 131 P
Huang, Kai, 131 Pang, Yiqun, 161
Pang, Yunxiang, 161
Prahm, Cosima, 201
K
Kadhem, Hasan, 49
Kolbenschlag, Jonas, 201 Q
Kuzuoka, Hideaki, 201 Qing, Qing, 85
© The Editor(s) (if applicable) and The Author(s), under exclusive license 249
to Springer Nature Singapore Pte Ltd. 2023
K. Nakamatsu et al. (eds.), Advanced Intelligent Virtual Reality Technologies,
Smart Innovation, Systems and Technologies 330,
https://doi.org/10.1007/978-981-19-7742-8
250 Author Index
R X
Rojtberg, Pavel, 35 Xue, Yuanbo, 19, 173
S
Selitskiy, Stanislav, 61
Song, Wei, 119 Y
Ström, Per, 187 Yang, Sirui, 85
Sun, Haiyang, 161 Yan, Huimin, 119
Sun, Xiaoyue, 85 Yu, YiXiong, 19
Suzuki, Mizuki, 101
T Z
Tian, Yongliang, 19, 173, 237
Zarpalas, Dimitrios, 187
Zhang, Jiaheng, 73
W Zhang, Jun, 217
Whitehand, Richard, 187 Zioulis, Nikolaos, 187