About Latour Nature and AI
About Latour Nature and AI
About Latour Nature and AI
Artificial Intelligence
Research
First Southern African Conference for AI Research, SACAIR 2020
Muldersdrift, South Africa, February 22–26, 2021
Proceedings
Communications
in Computer and Information Science 1342
Artificial Intelligence
Research
First Southern African Conference for AI Research, SACAIR 2020
Muldersdrift, South Africa, February 22–26, 2021
Proceedings
123
Editor
Aurona Gerber
University of Pretoria
Pretoria, South Africa
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
This volume of Springer CCIS (CCIS 1342) contains the revised accepted papers of the
the First Southern African Conference for Artificial Intelligence Research (SACAIR
2020)1.
1
https://sacair.org.za/
2
The original date was November 30–December 4, 2020, but, due to the COVID-19 pandemic, the
conference was pushed into 2021 in the hope of being able to retain its face-to-face format in the
interest of building an AI community in Southern Africa.
3
https://www.cair.org.za/.
vi Preface
I sincerely thank the technical chair, Aurona Gerber, for her hard work on the
volume and the editorial duties performed. A thank you to the program chairs (Aurona
Gerber, Anne Gerdes, Giovanni Casini, Marelie Davel, Alta de Waal, Anban Pillay,
Deshendran Moodley, and Sunet Eybers), the local and international panel of
reviewers, our keynotes, and the authors and participants for their contributions. Last
but not least, our gratitude to the members of the Organizing Committee (Aurona
Gerber, Anban Pillay, and Alta de Waal), student organizers (Karabo Maiyane, Emile
Engelbrecht, Nirvana Pillay, and Yüvika Singh) and our sponsors, specifically the AIJ
division of IJCAI, without whom this conference would not have been realized.
Dear readers,
This volume of CCIS contains the revised accepted papers of SACAIR 2020. We are
thankful that our first annual Southern African Conference for Artificial Intelligence
Research elicited the support it did during this challenging year with all the uncer-
tainties due to the COVID-19 pandemic.
We received more than 70 abstracts, and after submission and a first round of
evaluation, 53 papers were sent out for review to our SACAIR Program Committee.
The 53 SACAIR submissions were solicited according to five topics: AI for Ethics and
Society (9), AI in Information Systems, AI for Development and Social Good (3),
Applications of AI (25), Knowledge Representation and Reasoning (8), and Machine
Learning Theory (8).
The Program Committee comprised 72 members, 13 of whom were from outside
Southern Africa. Each paper was reviewed by at least three members of the Program
Committee in a rigorous, double-blind process whereby especially the following cri-
teria were taken into consideration: Relevance to SACAIR, Significance, Technical
Quality, Scholarship, and Presentation that included quality and clarity of writing. For
this CCIS volume, 19 full research papers were selected, which translates to an
acceptance rate of 35.8%. The accepted full research papers per topic are: AI for Ethics
and Society (3), AI in Information Systems, AI for Development and Social Good (1),
Applications of AI (8), Knowledge Representation and Reasoning (4), and lastly,
Machine Learning Theory (3).
Thank you to all the authors and Program Committee members, and congratulations
to the authors whose work was accepted for publication in this Springer volume. We
wish our readers a fruitful reading experience with these proceedings!
SACAIR Sponsors
The sponsors of SACAIR 2020, The Journal of Artificial Intelligence and the Centre
for AI Research (CAIR), are herewith gratefully acknowledged.
Organization
General Chair
Emma Ruttkamp-Bloem University of Pretoria, Centre of AI Research (CAIR),
South Africa
Program Committee
Etienne Barnard North-West University, South Africa
Sihem Belabbes LIASD, Université Paris, France
Sonia Berman University of Cape Town, South Africa
Jacques Beukes North-West University, South Africa
Willie Brink Stellenbosch University, South Africa
Arina Britz Stellenbosch University, South Africa
Michael Burke The University of Edinburgh, UK
Jan Buys University of Cape Town, South Africa
Giovanni Casini ISTI, CNR, Italy
Colin Chibaya Sol Plaatje University, South Africa
Olawande Daramola Cape Peninsula University of Technology,
South Africa
Jérémie Dauphin University of Luxembourg, Luxembourg
Marelie Davel North-West University, South Africa
Tanya de Villiers Botha Stellenbosch University, South Africa
Alta De Waal University of Pretoria, South Africa
Febe de Wet Stellenbosch University, South Africa
Iena Derks University of Pretoria, South Africa
Tiny Du Toit North-West University, South Africa
Andries Engelbrecht University of Stellenbosch, South Africa
Sunet Eybers University of Pretoria, South Africa
Inger Fabris-Rotelli University of Pretoria, South Africa
Sebastian Feld Ludwig Maximilian University of Munich, Germany
Eduardo Fermé Universidade da Madeira, Portugal
Anne Gerdes University of Denmark, Denmark
Mandlenkosi Gwetu University of KwaZulu-Natal, South Africa
Shohreh Haddadan University of Luxembourg, Luxembourg
Henriette Harmse EMBL-EBI, UK
Organization xiii
Applications of AI
Cindy Friedman1,2(B)
1
Department of Philosophy, University of Pretoria, Pretoria, South Africa
[email protected]
2
Centre for AI Research (CAIR), Pretoria, South Africa
1 Introduction
This paper contributes to the debate in the ethics of social robots on how or
whether to treat social robots morally by way of considering a novel perspec-
tive on the moral relations between human interactants and social robots: that
human interactants are the actual moral patients of their agential moral actions
c Springer Nature Switzerland AG 2020
A. Gerber (Ed.): SACAIR 2020, CCIS 1342, pp. 3–20, 2020.
https://doi.org/10.1007/978-3-030-66151-9_1
4 C. Friedman
towards robots; robots are no more than perceived moral patients. This novel per-
spective is significant because it allows us to circumnavigate contentious debates
surrounding the (im)possibility of robot consciousness and moral patiency, thus
allowing us to address actual and urgent current ethical issues in relation to
human-robot interaction (HRI).
Social robots are becoming increasingly sophisticated and versatile technolo-
gies. Their wide range of potential utilisations include carer robots for the sick
or elderly (see e.g. [50,53,60], general companion robots (see e.g. [15,57]), teach-
ers for children (see e.g. [36,51]), or (still somewhat futuristic but nonetheless
morally relevant in human-robot interaction (HRI) contexts) sexual companions
(see e.g. [21,37,48]).
Although social robots may take on a variety of forms – such as the AIBO
robot who takes the shape of a dog, or the Paro robot that takes the shape of a
baby seal – I will here be focusing on android social robots.1 This is the case as a
combination of a human-like appearance and human-like sociability creates the
potential for human interactants to relate to these robots in seemingly realistic
human-like ways.
Given the possibility for human interactants to relate to these social robots
in human-like ways2 , researchers have investigated not only the nature of these
relations and how they may morally impact us – Turkle [57], for example, puts
forward that some relations with robot companions may fundamentally change
what it means to be human, and Nyholm & Frank [48] speculate that certain
relations with robots may hinder us from forming bonds with other people – but
also whether we have a moral relation to these robots that would require us to
relate to them in a particular way. By this, I mean - should we treat them morally
well? For example, someone such as Bryson [12] argues vehemently against the
need for moral treatment of robots, whereas some, such as Levy [38] or Danaher
[18], argue in various ways that we should consider the moral treatment of robots.
This paper will consider the issue of the moral treatment of social robots from
an anthropocentric persective (as opposed to a ‘robot perspective’) by considering
arguments that treating a robot immorally causes moral harm to its human
interactant. Given this possibility, I suggest that in this context, social robots
and human interactants have a unique moral relation: human interactants are
both the moral agents of their actions towards robots, as well as the actual moral
patients of those agential moral actions towards robots. Robots, in this case, are
no more than perceived moral patients.
Literature on robot ethics is less focused on patiency as it is agency (with
regard to both human interactants and robots in the HRI context) (see e.g.
1
Unless otherwise specified, any use of the term ‘social robot’ will specifically refer to
android social robots.
2
It must be noted that social robots cannot genuinely reciprocate human sentiments;
they cannot care for a human interactant the way in which a human interactant may
care for them (e.g. [16]). Any emotions displayed by robots are functional in nature,
thus, at least currently (or even in the near future), human interactants cannot have
genuinely reciprocal or mutual bonds with robots (e.g. [55]). Thus, any relation or
bond formed with a social robot is unidirectional in nature.
Human-Robot Moral Relations 5
[27,38]), and where there is a focus on patiency as far as robots are concerned,
it most often discusses the notion of the moral treatment of robots from the
perspective of the current (im)possibility for robots to be actually conscious
and, thus, the (im)possibility for them to be actual moral patients (see Sect. 4).
However, in putting forward that it is human interactants who are the moral
patients of their own agential moral actions towards robots, we may circumnavi-
gate the somewhat intractable debate of actual robot consciousness which arises
in relation to the (im)possibility for robots to be moral patients in the context of
questioning whether they warrant moral treatment. This is not to say that con-
cerns surrounding artificial robot consciousness are unimportant, but rather to
say that we should not become so detained by the concern as to whether robots
can be conscious or not (and thus moral patients or not) that we are misdirected
from addressing actual and urgent current ethical issues in relation to human-
robot interaction. My argument that it is human interactants who are the actual
moral patients of their agential moral actions toward social robots thus allows
us to seriously consider these actual and urgent current ethical issues.
I will first discuss two instances wherein human interactants are moral
patients in relation to the robots with which they interact: firstly, robots as
conduits of human moral actions towards other human moral patients; secondly,
humans as moral patients to the moral actions of robots. I will then introduce
a third perspective wherein a human interactant is, at the same time, both a
moral agent and a moral patient: human interactants as moral patients of their
own agential moral actions towards robots. I will firstly distinguish between the
actuality of robot consciousness and the perception of robot consciousness since
this is important for our understanding of robots as perceived moral patients,
and also for our understanding of why, in the context of this paper, the actuality
of robot consciousness is a non-issue. I will then put forward that treating social
robots immorally may cause moral harm to human interactants and I do so using
three sub-arguments: social robots are more than mere objects; the act of treat-
ing a social robot immorally is abhorrent in itself; and, due to these arguments,
treating a social robot immorally may negatively impact upon the moral fibre of
interactants. Finally, due to the perception of robot consciousness, and, thus, the
perception of robot moral patiency, as well as concern that treating social robots
immorally may cause moral harm to human interactants, I argue a human inter-
actant is, at the same time, both the agent and patient of their moral actions
towards robots: human interactants are the actual moral patients of their agen-
tial moral actions towards robots, whereas robots are perceived moral patients.
Let us now consider two ways in which human interactants may be moral
patients in the context of their interaction with robots so as to contextualise the
argument this paper makes, and make clear how and why my contribution is a
particularly novel one.
6 C. Friedman
of action [2]. As such, the machines in question are machines that make decisions
and act autonomously (without human intervention) by way of “[combining]
environmental feedback with the system’s own analysis regarding its current
situation” [29]. Given this understanding of autonomous decision making systems
(ADM systems) that have the potential to be moral agents, we can then consider
the possibility that humans can be moral patients to the moral decisions and
actions of AI. Specifically, in our context, this potentiality means that robots
could harm humans.
The topic of the possibility for machines to be considered moral agents is a
broadly contested and complicated one, full discussion of which would go beyond
the confines of this paper. However, it is worth noting some arguments that
have been made concerning the topic. Generally speaking, the topic is one which
questions whether machines can be moral agents – is morality programmable? –
and what conditions they would have to fulfill in order to be considered moral
agents, as well as the impact that these agents would have on us.
Well-known researchers weighing in on the issue include Asaro [4], Bostrom
and Yudkowsky [7], Brundage [11], Deng [24], Lumbreras [41], McDermott [43],
Moor [46], Sullins [54], Torrance [56], Wallach and Allen [61], Wang & Siau [62],
and many others. Different sets of conditions for moral agency are suggested: A
combination of free will, consciousness, and moral responsibility [61]); a combi-
nation of the abilities to be interactive, autonomous, and adaptable [25], and a
combination of autonomy, responsibility and intentionality [54]. Do we need to
ensure artificial moral agents (AMAs) are both ethically productive and ethically
receptive [56], or is the ability for rational deliberation all that is needed [39]?
Although it is debatable whether robots can or cannot truly be moral agents
given how philosophically loaded the topic is, it remains that, regardless of this
uncertainty, humans can still be moral patients of the actions of autonomous
machines that act without direct human intervention. For instance, and going
back to the example mentioned above of AWSs, although we could debate end-
lessly about whether an AWS that acts without human intervention is a moral
agent, the fact remains that it can still ultimately make the moral decision to
kill a civilian or not, and this civilian would be the moral patient of this moral
decision – whether they lived or died.
This is not to say that were the AWS to kill a civilian, it would hold full moral
responsibility for the civilian’s death – this is another complex issue entirely3 – nor
is it to say that the AWS is, in and of itself, a moral agent. Rather, it is to say that
moral responsibility and agency aside, the civilian would have been killed due to a
decision ultimately made by the AWS (although the groundwork for the decision
would be based on its programming). At that moment, there is no direct human
intervention wherein a human is making the decision to kill the civilian or not.
Thus, as stated above, there is the potential for human beings to be harmed
by this technology.
3
The topic of moral responsibility is also a contentious one and there remains what
can be termed a responsibility gap when it comes to who should be held responsible
for the actions of autonomous systems (see e.g. [42]).
8 C. Friedman
human. This is no futuristic prediction. Studies have found that people do tend
to apply social rules to the computers with which they interact [10,45]. The more
human-like something appears to be, the more likely we are to anthropomorphise
it. As such, given their android appearance, it is no leap in logic to then argue
that the tendency to anthropomorphise android social robots will likely be high.
Specifically in the context of social robots that may provide a form of com-
panionship, it also may be the case that human interactants want to believe that
the robot is conscious, because this will make their companionship with them
seem all the more realistic (see e.g. [6,48]), thus, human interactants may allow
themselves to be deceived, thus perceiving the robots as conscious, although they
may know that it is not actually conscious.
to distinguish between other social agents and various objects in the environ-
ment” [26]; (3) “Sociable: Robots that proactively engage with humans in order
to satisfy internal social aims (drives, emotions, etc.) [are sociable robots]. These
robots require deep models of social cognition” [8,9]; (4) “Socially intelligent:
Robots that show aspects of human-style social intelligence, based on possibly
deep models of human cognition and social competence” [22], are socially intel-
ligent; (5) “Socially interactive: Robots for which social interaction plays a key
role in peer-to-peer HRI [Human-Robot Interaction], different from other robots
that involve ‘conventional’ HRI, such as those used in teleoperation scenarios”
[26], are socially interactive. Given these definitions and conceptual understand-
ings of social robots, it is clear that social robots are a versatile technology, and
that there are various ways in which human interactants can socially relate to
them. As such, social robots cannot be compared to just any object that we
utilize on a daily basis; we do not socially relate to just any inanimate object
the way in which we may relate to a social robot.
Given that human interactants can socially relate to social robots, there
is then the possibility for us to bond with them in seemingly realistic ways.
Although any type of bond with a social robot may be unidirectional, and no
type of reciprocation on the part of the robot truly indicates consciousness,
the robot still does mimic reciprocation on a human social level, which impacts
the humans with whom they interact. As such, I agree with Ramey [49] that
there may be a unique social relationship (albeit possibly unidirectional as far
as genuine reciprocation is concerned) between a human and a social robot that
is qualitatively different from the way in which we relate to any other object
that we utilise [49].
We have more than a physical relation to them. Yes, one can have more than
a physical relation to an inanimate object – children, for example, love their
stuffed toys and it can be argued that these toys are created to elicit an emotional
response from children. However, this type of interaction and emotional response
differs from that which we experience with social robots since stuffed toys do
not reciprocate emotion, whereas social robots do – even though this reciprocity
may be mere mimicry. Given this, interactants may begin to see social robots
as being on the same plane as human beings (see e.g. [38]). Therefore, although
they may not actually be conscious, we may view them as being such, given the
human-like way in which we are able to relate to them (see e.g. [31,44,57]). Given
this possibility, the superficial view to treat social robots as mere objects does
not seem viable – there is more to them than that – although actually granting
them consciousness and considering them deserving of moral treatment the way
humans are, may be taking it a step too far, especially given the contentiousness
of the consciousness debate (I will elaborate upon this point in a later section).
Given that I hold that social robots can be seen to be more than just any
inanimate object due to the way in which we interact with them, I will now
consider why the act of treating a social robot immorally is wrong in itself. This
is because not only may social robots be viewed as being more than mere objects,
but they can essentially be seen to be human simulacra in that they are being
12 C. Friedman
mean that treating a social robot immorally may cause us to treat other humans
immorally, similarly to the way in which Kant argues that the cruel treatment
of animals may lead to us being “no less hardened towards men” [34].
“[T]o treat androids as humans is not to make androids actually human, but
it is to make oneself an expanded self” [49] and the way we treat robots will
affect ourselves and people around us. In light of this, Levy [38] argues that
we should treat robots in the same moral way that we would treat any human
because not doing so may negatively affect those people around us “by setting
our own behaviour towards those robots as an example of how one should treat
other human beings” [38].
Similar questions have been raised as far as the moral treatment of animals is
concerned. Kant [33] makes the argument that we have the duty to ourselves to
refrain from treating animals with violence or cruelty. This is because in treating
animals immorally (with violence or cruelty) we “[dull] shared feelings of their
suffering and so [weaken] and gradually [uproot] a natural predisposition that
is very serviceable to morality in one’s relations with other men” [33]. Thus,
immoral treatment of animals may negatively impact upon moral relations with
other humans. Similarly, Turner [59] states: “If we treat animals with contempt,
then we might start to do so with humans also. There is a link between the two
because we perceive animals as having needs and sensations – even if they do not
have the same sort of complex thought processes as we do. Essentially, animals
exhibit features which resemble humans, and we are biologically programmed to
feel empathy toward anything with those features”.
If there is concern raised about the way in which we treat animals extend-
ing to the way in which we treat humans, then surely there should be even
more concern regarding our moral treatment of social robots which are realistic
human simulacra as opposed to animals who may merely possess features that
are exhibitive as human features? As such, going back to Levy [38], the main rea-
son why he argues we should not treat robots immorally, is that if we take their
embodiedness seriously, it would impact negatively on our social relations with
humans if we treated them immorally. This argument stems from the possibility
that there is the potential for people to interact with social robots in seemingly
realistic human-like ways, leading to the human interactant perceiving the robot
as being sociable, intelligent and autonomous and, as such, being on the same
plane as human beings. This being the case, if we do begin to perceive social
robots as being on the same plane as human beings, Levy’s [38] argument that
we should treat robots morally well, for our own sake, holds some weight.
One can, therefore, argue that since social robots are – in Levy’s [38] view –
embodied computers, in treating a social robot immorally, one is simulating the
immoral treatment of a human being (as I have discussed above). If we do come
to view these robots as being on the same plane as human beings, and yet not
respect them as human beings, one can question theoretically whether this will
lead to desensitising us towards immoral behaviour, thereby lowering the moral
barriers of immoral acts. Would this potentially lead to human beings treating
one another in such immoral ways?
14 C. Friedman
that the actuality of robot consciousness is a thorny issue and, therefore, I put
forward that we focus our attention to perceived robot consciousness. Given the
link between consciousness and moral patiency, we may consider that should
human interactants perceive a social robot as being conscious, they may then
perceive them as being moral patients; because a social robot can act as if they
are conscious, they can therefore act as if they are suffering, should they be
treated immorally.
Moral patiency can be understood as the case of being a target of moral
action. In this instance, human interactants would not be direct targets of their
own actions, but rather indirect targets – like a bullet ricocheting off its direct
target and injuring an innocent bystander who becomes an indirect target of
the shooter. They (human interactants) are indirectly impacted by way of their
moral fibre being negatively impacted should they treat social robots immorally.
Where the robot is the direct target of the immoral treatment – and the
perceived moral patient – the human interactant is the indirect target – and
the actual moral patient. As such, we are indirect recipients of immoral action
because robots cannot actually be recipients. Robots are not really impacted (for
now leaving aside the possibility of robot phenomenal experience and conscious-
ness, which, if it comes to pass, would of course add a layer of the robot as moral
patient to this discussion) – we (the human interactants) are. Moreover, Danaher
[19] notes a moral patient as “a being who possesses some moral status – i.e. is
owed moral duties and obligations, and is capable of suffering moral harms and
experiencing moral benefits – but who does not take ownership over the moral
content of its own existence”. As far as human interactants being moral patients
of their own moral actions is concerned, referring to Danaher’s [19] definition,
human interactants can suffer and experience moral harms and benefits of their
own agential actions: specifically, moral harms by way of their moral fibre being
negatively impacted is an example of this kind of suffering.
Interestingly, Danaher [19] actually argues that the rise of robots could bring
about a decrease in our own moral agency: “That is to say, [the rise of robots]
could compromise both the ability and willingness of humans to act in the
world as responsible moral agents, and consequently could reduce them to moral
patients” [19]. For example, and as elaborated upon in Danaher’s [19] article,
an instance in which someone spends all their time with their sexbot. As a
consequence, the human interactant loses motivation to do anything of real con-
sequence – go out and meet new people, or spend time with a human partner –
because it takes more effort. As such, this human interactant can spend all day
at home, enjoying all the pleasure they desire [19]. As Danaher [19] states: “[T]he
rise of the robots could lead to a decline in humans’ willingness to express their
moral agency (to make significant moral changes to the world around them).
Because they have ready access to pleasure-providing robots, humans might
become increasingly passive recipients of the benefits that technology bestows”.
This is a compelling argument and worth consideration. However, I rather
argue here not so much that our moral agency could itself ‘decrease’ due to
our interaction with social robots but rather that our moral agency could be
16 C. Friedman
negatively impacted in the sense that as moral agents, our moral fibre may be
negatively impacted, thus causing us, as moral agents, to possibly act immorally
towards other human beings with whom we share the world, and towards our-
selves.
Therefore, we may consider treating social robots morally well for our own
sakes. Although specifically speaking to the topic of robot rights, we may here
draw upon Gunkel’s [28] argument that a consideration of the descriptive and
normative aspects of robot rights seem to often be amiss in current machine
ethics literature. It is important to distinguish between these two aspects so as to
avoid slipping from one to the other. As far as the moral consideration of robots
is concerned, this article distinguishes between the descriptive and normative
aspects of the moral consideration of robos by way of arguing that even though
social robots are not capable of being actual moral patients (descriptive aspect),
we should still grant them moral consideration (normative aspect).
Finally, most ethics are agent-oriented – hence Floridi & Sanders [25] refer to
this orientation as the ‘standard’ approach. As such, a patient-oriented approach
is ‘non-standard’ – “it focuses attention not on the perpetrator of an act but on
the victim or receiver of the action” [25]. Considering the possibility of human
interactants being both agents and patients in a given instance bridges such a
divide between a standard and non-standard approach. This is because human
interactants – as moral agents – have the capacity to treat robots in moral or
immoral ways. However, such treatment indirectly impacts human interactants
as moral patients – they are, too, indirect receivers or victims of their own moral
actions given that treating a robot immorally may negatively impact upon their
own moral fibre.
5 Conclusion
This paper ultimately argued that given the perception of robot conscious-
ness and moral patiency, as well as the possibility that treating a social robot
immorally may cause moral harm to human interactants, we may consider that a
human interactant is, at the same time, both a moral agent and a moral patient
of their moral actions towards a social robot. That is, a human interactant (as
a moral agent) is the actual moral patient of their moral actions, whereas the
robots is a perceived moral patient.
This argument contributes to a perspective that is sorely lacking in machine
ethics literature: there is very little focus on moral patiency as compared to
moral agency (in the context of both humans and robots). Although there is
somewhat of a focus on moral agency in that I argue that a human interactant
is, at the same time, both the moral agent and the actual moral patient, there was
more focus on human interactants being moral patients given that it it is more
relevant in the context of an anthropocentric perspective on the moral treatment
of robots. Moreover, a novel contribution is made particularly in the context of
human moral patiency in the context of human-robot interaction. Where there
has been consideration that humans can be moral patients in terms of robots
Human-Robot Moral Relations 17
being conduits of human moral action towards other human moral patients, as
well as consideration that humans can be moral patients to the moral actions
of robots, there has been no consideration of human interactants being moral
patients of their own agential moral actions towards robots (particularly android
social robots) i.e. indirect targets of their own moral actions, particularly in the
context of treating robots immorally.
This is an important consideration and contribution in the context of the
debate surrounding the moral treatment of robots, which also encompasses the
contentious subject of robot rights. It is important because analysing the moral
treatment of robots, and the possibility of robot rights, from an anthropocentric
perspective (thus not in terms of whether or not robots are harmed from a
robot perspective) as is suggested, may allow further research in this regard
that does not become so concerned with the actuality of robot consciousness
and moral patiency to such an extent that consideration concerning robot moral
status and robot rights seem superfluous. The consideration of robot moral status
and robots rights is definitely not superfluous from the perspective of human
interactants who may be morally harmed as a result of immoral interactions
with social robots who mimic human-likeness. The need to research the nature
and impact of HRI is high and often under-estimated even in AI ethics policy
making.
We cannot only consider the moral treatment of robots when, or if, they
become conscious. The very way in which we express ourselves as humans and
in which we situate ourselves in social spaces is in danger of changing rapidly
already in the case of human traits simply being mimicked. To be detained by the
concern as to whether robots can be conscious or not will only for now misdirect
us from moral issues that should be immediately addressed and present more
present ethical dangers: such as the degradation of our moral fibre due to not
treating robots morally well for our own moral sakes.
As far as non-android social robots are concerned, further research may draw
upon arguments I have made in the context of android social robots so as to
possibly generalize arguments to the impacts of non-android social robots, or
other types of robots in general. This, however, will require further research.
Further research may also draw upon the arguments made so as to consider
granting rights to robots. Specifically, we may consider granting negative rights
to robots, i.e. rights that will prevent human interactants from treating robots
immorally.
For now, the possibility of robots with full moral status who demand their
rights may seem a long way off. We cannot be certain when this will happen,
or if it will ever happen. Regardless of these possibilities, however, what we can
be certain of is that the moral fibre of human societies may be at risk if we do
not consider the moral treatment of social robots – at least, for now, from the
perspective of human interactants.
18 C. Friedman
References
1. IEEE ethically aligned design (2019). https://standards.ieee.org/content/dam/
ieeestandards/standards/web/documents/other/ead1e.pdf
2. Anderson, M., Anderson, S.: Machine ethics: creating an ethical intelligent agent.
AI Mag. 28(4), 15–26 (2007)
3. Arnold, T., Scheutz, M.: HRI ethics and type-token ambiguity: what kind of robotic
identity is most responsible? Ethics Inf. Technol.(2018)
4. Asaro, P.: What should we want from a robot ethic? Int. Rev. Inf. Ethics 6(12),
9–16 (2006)
5. Barquin, R.C.: Ten commandments of computer ethics (1992)
6. Boltuć, P.: Chuch-Turing Lovers. Oxford University Press, Oxford (2017)
7. Bostrom, N., Yudkwosky, E.: The ethics of artificial intelligence. In: Frankish, K.,
Ramsey, W. (eds.) The Cambdridge Handook of Artificial Intelligence, pp. 316–
334. Cambridge University Press, Cambridge (2014)
8. Breazeal, C.: Designing Sociable Robots. MIT Press, Cambridge (2002)
9. Breazeal, C.: Towards sociable robots. Robot. Auton. Syst. 42, 167–175 (2003)
10. Broadbent, E.: Interactions with robots: the truths we reveal about ourselves.
Annu. Rev. Psychol. 68, 627–652 (2017)
11. Brundage, M.: Limitations and risks of machine ethics. J. Exp. Theor. Artif. Intell.
26(3), 355–372 (2014)
12. Bryson, J.: Robots should be slaves. Close Engage. Artif. Companions: Key Soc.
Psychol. Ethical Des. Issues (2009)
13. Chalmers, D.: Facing up to the problem of consciousness (1995). http://consc.net/
papers/facing.pdf. Accessed 7 May 2019
14. Chalmers, D.: Philosophy of Mind: Classical and Contemporary Readings. Oxford
University Press, Oxford (2002)
15. Coeckelbergh, M.: Artificial companions: empathy and vulnerability mirroring in
human-robot relations. Stud. Ethics Law Technol. 4(3, Article 2) (2010)
16. Coeckelbergh, M.: Health care, capabilities, and AI assistive technologies. Ethical
Theory Moral Pract. 13, 181–190 (2010)
17. Damiano, L., Dumouchel, P.: Anthropomorphism in human-robot co-evolution.
Front. Psychol. 9, 1–9 (2018)
18. Danaher, J.: The Symbolic-Consequences Argument in the Sex Robot Debate. MIT
Press, Cambridge (2017)
19. Danaher, J.: The rise of the robots and the crisis of moral patiency. AI & Soc. 34,
129–136 (2019)
20. Danaher, J., Earp, B., Sandberg, A.: Should We Campaign Against Sex Robots?.
The MIT Press, Cambridge (2017)
21. Danaher, J., McArthur, N.: Robot Sex: Social and Ethical Implications. The MIT
Press, Cambridge (2017)
22. Dautenhahn, K.: The art of designing socially intelligent agents - science, fiction,
and the human in the loop. Appl. Artif. Intelli. 12, 573–617 (1998)
23. Dautenhahn, K.: Socially intelligent robots: dimensions of human-robot interac-
tion. Philos. Trans. Roy. Soc. 362, 679–704 (2007)
24. Deng, B.: Machine ethics: the robot’s dilemma. Nat. News 523, 24–26 (2015)
25. Floridi, L., Sanders, J.: On the morality of artificial agents. Mind Mach. 14, 349–
379 (2004)
26. Fong, T., Nourbakhsh, I., Dautenhahn, K.: A survey of socially inter-active robots’.
Robots Auton. Syst. 42, 143–166 (2003)
Human-Robot Moral Relations 19
27. Gunkel, D.: Moral patiency. In: The Machine Question: Critical Perspectives on
AI. Robots, and Ethics, pp. 93–157. The MIT Press, Cambridge (2012)
28. Gunkel, D.: The other question: can and should robots have rights? Ethics Inf.
Technol. 20, 87–99 (2017)
29. ICRC: Autonomy, artificial intelligence and robotics: technical aspects of human
control, Geneva (2019)
30. Jaworska, A., Tannenbaum, J.: The grounds of moral status. Stanford Encyclopedia
of Philosophy (2018)
31. Kanda, T., Freier, N., Severson, R., Gill, B.: Robovie, you’ll have to go into the
closet now: children’s social and moral relationships with a humanoid robot. Dev.
Psychol. 48(2), 303–314 (2012)
32. Kanda, T., Ishiguro, H., Imai, M., Ono, T.: Development and evaluation of inter-
active humaoid robots, 1839–1850 (2004)
33. Kant, I.: The Metaphysics of Morals. Cambridge University Press, Cambridge
(1996)
34. Kant, I.: Lectures on Ethics. Cambridge University Press, Cambridge (1997)
35. Kirk, R., Carruthers, P.: Consciousness and concepts. In: Proceedings of the Aris-
totelian Society, Supplementary, vol. 66, pp. 23–59 (1992)
36. Komatsubara, T.: Can a social robot help children’s understanding of science in
classrooms? In: Proceedings of the Second International Conference on Human-
Agent Interaction, pp. 83–90 (2014)
37. Levy, D.: Love and sex with robots: the evolution of human-robot relationships.
Harper (2007)
38. Levy, D.: The ethical treatment of artificially conscious robots. Int. J. Soc. Robot.
1(3), 209–216 (2009)
39. Lin, P., Abney, K., Bekey, G.: Robot Ethics: The Ethical and Social Implications
of Robotics. The MIT Press, Cambridge (2012)
40. Lin, P., Abney, K., Bekey, G.: Robotics, Ethical Theory, and Metaethics: A Guide
for the Perplexed. The MIT Press, Cambridge (2012)
41. Lumbreras, S.: The limits of machine ethics. Religions 8(100), 2–10 (2017)
42. Matthias, A.: The responsibility gap: ascribing responsibility for the actions of
learning automata. Ethics Inf. Technol. 6, 175–183 (2004)
43. McDermott, D.: Why ethics is a high hurdle for AI. In: North American Conference
on Computers and Philosophy, Bloomington, Indiana (2008)
44. Melson, G., Kahn, P., Beck, A., Friedman, B.: Robotic pets in human lives: impli-
cations for the human-animal bond and for human relationships with personified
technologies. J. Soc. Issues 65(3), 545–567 (2009)
45. Moon, Y., Nass, C.: Machines and mindlessness: social responses to computers. J.
Soc. Issues 56, 81–103 (2000)
46. Moor, J.: The nature, importance, and difficulty of machine ethics. IEEE 21(4),
18–21 (2006)
47. Müller, V.: Ethics of artificial intelligence and robotics (2020, edition)
48. Nyholm, S., Frank, L.: It loves me, it loves me not: is it morally problematic to
design sex robots that appear to love their owners? (2019), techné: Research in
Philosophy and Technology, Issue December
49. Ramey, C.: ‘For the sake of others’: The ‘personal’ ethics of human-android inter-
action. Stresa, Italy (2005)
50. Sharkey, A., Sharkey, N.: Granny and the robots: ethical issues in robot care for
the elderly. Ethics Inf. Technol. 14(1), 27–40 (2010)
51. Sharkey, A.: Should we welcome robot teachers? Ethics Inf. Technol. 283–297, 18
(2016)
20 C. Friedman
52. Sparrow, R.: Robots, rape, and representation. Int. J. Soc. Robot. 9(3), 465–477
(2017)
53. Sparrow, R., Sparrow, L.: In the hands of machines? The future of aged care. Minds
Mach. 16, 141–161 (2006)
54. Sullins, J.: When is a robot a moral agent? Int. Rev. Inf. Ethics 6(12) (2006)
55. Sullins, J.: Robots, love and sex: the ethics of building a love machine. IEEE Trans.
Affect. Comput. 3(4), 398–409 (2012)
56. Torrance, S.: Artificial agents and the expanding ethical circle. AI Soc. 28, 399–414
(2013)
57. Turkle, S.: A nascent robotics culture: new complicities for companionship. In:
AAAI Technical Report Series (2006)
58. Turkle, S.: Authenticity in the age of digital companions. Interact. Stud. 8(3),
501–517 (2007)
59. Turner, J.: Why robot rights? In: Robot Rules: Regulating Artificial Intelligence,
pp. 145–171. Palgrave Macmillan, Cham (2019)
60. Vallor, S.: Carebots and caregivers: sustaining the ethical ideal of care in the
twenty-first century. Philos. Technol. 24, 251–268 (2011)
61. Wallach, W., Allen, C.: Moral Machines: Teachinf Robots Right from Wrong.
Oxford University Press, New York (2009)
62. Wang, W., Siau, K.: Ethical and moral issues with AI: a case study on health-
care robots. In: Twenty-Fourth Americas Conference on Information Systems, New
Orleans (2018)
Nature, Culture, AI and the Common Good
– Considering AI’s Place in Bruno Latour’s
Politics of Nature
Jaco Kruger1,2(B)
1 St Augustine College of South Africa, Johannesburg, South Africa
[email protected]
2 Faculty of Theology, North West University, Potchefstroom, South Africa
Abstract. This paper considers the place and the role of AI in the pursuit of the
common good. The notion of the common good has a long and venerable history
in social philosophy, but this notion, so it is argued, becomes problematic with
the imminent advent of Artificial General Intelligence. Should AI be regarded as
being in the service of the common good of humanity, or should the definition of
the social common rather be enlarged to include non-human entities in general,
and AI’s, which in the future may include human level and superhuman level AI’s,
in particular? The paper aims to clarify the questions and the concepts involved by
interpreting Bruno Latour’s proposal for a politics of nature with specific reference
to the challenge posed by the imminent advent of human level artificial general
intelligence (AGI). The recent suggestion by eminent AI researcher, Stuart Russell,
that the pursuit of AI should be re-oriented towards AI that remain in the service
of the human good, will be used as a critical interlocutor of Latour’s model. The
paper concludes with the suggestion that the challenge will be to steer a middle
ground between two unacceptable extremes. On the one hand the extreme of a
“truth politics” that assumes there is a pure human nature and definite human
interests that must be protected against AI should be avoided. On the other hand,
the alternative extreme of a naked “power politics” must also be avoided because
there is a very real possibility that super AI may emerge victorious out of such a
power struggle.
1 Introduction
The modern world has been characterised by an intractable opposition between nature
and culture. This has been the longstanding thesis and arguably the primary underlying
concern in the work of French sociologist and philosopher, Bruno Latour. Latour rose
to prominence following the publication of his books Laboratory Life (1986), Science
in Action (1988) and We have never been modern (1993) in the final decades of the last
century. Especially in We have never been modern Latour describes and problematizes
the opposition between nature and culture that is, according to him, the defining charac-
teristic of modernity. On the one hand, in modern thought, nature came to be regarded
as the “objective reality out there” waiting to be discovered and faithfully described by
what has become known as modern science. Nature, in other words, simply ís what it
is. On the other hand, there is the realm of culture – the realm of human subjectivity
and freedom. In the realm of freedom, the incomparable dignity of human subjectivity
lies in its autonomy; its ability to freely decide and to take responsibility for action.
As is well known, this watertight distinction between nature and culture, necessity and
freedom, found an enormously influential articulation in the thought of Immanuel Kant,
who himself worked on philosophical problems already present in the work of Renee
Descartes at the beginning of modern thought in the 17th century.
The dichotomy between nature and freedom gave rise to a whole series of analogous
oppositions along with interminable struggles to reconcile, or at least to relate them.
In his 2004 book, Politics of Nature, which will be the primary focus in this paper,
Latour takes up the nature-culture divide again and explains how it translates into the
oppositions between facts and values, between is and ought, between the common world
and the common good, between truth politics and power politics, and between different
viewpoints regarding the orienting transcendence of the world: is it the transcendence of
nature, the transcendence of freedom, or the transcendence of the political sovereign? The
Sisyphean labour of modern thought has been to police the borders between these oppo-
sitions, while ceaselessly drawing the borders again because they remain perpetually
unclear, unstable, and porous.1
The questions and philosophical challenges brought on by the possibility of Artificial
Intelligence add further dimensions to the problem of the nature-culture divide. Artificial
Intelligence does not sit well within the opposition between nature and freedom. On
which side of the border should it be classified and maintained? Should it be regarded
as part of non-human nature, or should it be accorded aspects of agency and moral
responsibility that were hitherto reserved exclusively for human subjectivity? In his
latest book, Human Compatible, well known Artificial Intelligence researcher Stuart
Russell observes that the achievement of human level Artificial General Intelligence is
indeed not far off. However, he argues that there is something fundamentally wrong
headed about the way the achievement of Artificial Intelligence has been pursued thus
far. “From the very beginnings of AI, intelligence in machines has been defined in the
same way: Machines are intelligent to the extent that their actions can be expected
to achieve their objectives.” (Russell 2019:20) According to Russel this is wrong and
indeed could be regarded as a huge threat to the future flourishing of humanity. There
is a very real possibility that Artificial Intelligence will become super intelligence and
that it then will pursue its own objectives to the detriment of its human creators. In this
scenario Artificial Intelligence will not lead to the common good in society, but actively
detract from it. Accordingly, Russel proposes that we should change our understanding
of Artificial Intelligence to the following: “Machines are beneficial to the extent that their
1 For a critical engagement with Latour’s deconstruction of the nature-culture opposition, see
Collins and Yearley (1992:301–326), Walsham (1997) or Pollini (2013), specifically with regard
to ecology.
Nature, Culture, AI and the Common Good 23
actions can be expected to achieve our objectives.” (2019:22) Our pursuit of Artificial
Intelligence must in other words be guided by the lodestar of the human good.
In the present paper I engage with Russell’s thesis from the perspective of Bruno
Latour’s politics of nature. I argue that Artificial Intelligence can and should be accom-
modated in the ongoing political process of constructing our common world. Artificial
Intelligence should be allowed to make presentations in the developing res publica –
public thing – that is our world. It is precisely because the hitherto watertight distinction
between the human and the non-human is untenable that the role of Artificial Intelligence
in the construction of the collective can in the future become less problematic and even
normal. An important implication of this argument would then be the deconstruction of
the opposition between the common world and the common good, and the highlighting
of the possible contribution of AI in this regard.
The argument develops along the following steps: in the next Sect. 1 outline Latour’s
deconstruction of the nature-culture dichotomy, as well as his proposal for a process of
continually negotiating a common world. In the following Sect. 1 argue that Artificial
Intelligence can make a vital contribution towards the efficacy and fairness of the two
powers that, according to Latour, shape the public domain – the power to take into
account and the power to arrange in rank order. To accord such a supportive role to
AI would, however, miss the opportunity to engage with the far greater challenge that
human level and superhuman level AI poses: the challenge of non-human agency and
intelligence in general. In the final section of the paper I therefore argue that Russell’s
alarm about AI pursuing its own goals to the detriment of human goals may be understood
and philosophically critiqued in terms of the watertight dichotomy between nature and
culture. In this case “nature” is a purportedly pure human nature and autonomy that
must be safeguarded against the goals of autonomous AI. But, following Latour, it must
be conceded that there never has been a pure nature. In his words: we have never been
modern. We must accept that, just like other non-human actants, AI plays a role in the
continuous construction of the collective. The more this is recognised and normalised,
the less it will be possible to use AI for nefarious purposes in political processes. The
paper nevertheless ends with a concession to Russell that Latour’s politics of nature can
potentially reduce to a power politics, in which case a very powerful AI could indeed be
a threat to the human good.
has nothing to say in the political process. It is up to human actors to decide how we
should live together, what is moral and what is immoral, what is good and what is bad.
On these matters, science has nothing to say. Its role is restricted to simply presenting
the facts. In its essence science is and should be value free.
The watertight distinction between nature and culture can be associated with two
opposing traditions in modern political thought. Graham Harman, one of the foremost
English language interpreters of Latour, formulates the two opposing traditions that
Latour indicates in Politics of Nature, but seldom mentions in so many words as the
tradition of truth politics and that of power politics (Harman 2014 Kindle Loc. 201; see
also Harman 2009). The tradition of truth politics orients itself on what it regards as
objective truth. There are many variants of truth politics, also from premodern times,
but a salient modern example would be Marxism. After all, the history of all hitherto
existing society is the history of the development of an inexorable law – that of class
struggle, and the political process should be true to this law. Another example of truth
politics that Latour specifically treats in Politics of Nature is the politics of the so-called
Green movement. In this politics science is explicitly invoked as the touchstone of the
truth. The facts speak for themselves; we are destroying the environment and therefore
we must change our policies.
In contrast to the tradition of truth politics, according to Latour as interpreted by
Harman, we find in modern thought the tradition of power politics. Denying any objective
truth that should guide political action, power politics works on the principle that might
is right. Power is what structures society and what ultimately holds society together
(Harman 2014 Kindle Loc. 234–235). Here, of course, the salient exponent of such
an approach is Thomas Hobbes. The important point to realise, however, is that the
opposition between truth politics and power politics is only the surface effect of a deeper
agreement. Both of these approaches accept the unwavering separation of the realm of
objects, and the realm of subjects, or, in other words, of nature and freedom. They
only differ in where they place the emphasis: should the political process be guided by
objective facts or laws of nature, or should it be guided by human freedom? According to
Latour both traditions suffer from the same shortcoming: they seek to prematurely end
the political process. The strategy of truth politics is to cut off any further negotiation
by appealing to brute facts (Latour 2004:13). The strategy of power politics on the other
hand is to short circuit the political process by fiat (2004:54).
It is in the impasse between truth politics and power politics that Latour seeks to
make an intervention. He does this by demonstrating that the opposition between truth
politics and power politics does not hold, and that this is so because the opposition
between nature and culture does not hold. On the one hand, science can never be value
free. The presentation of scientific facts always has a persuasive character. The scientific
enterprise has an agenda, it wants to nudge and cajole society in a specific direction. On
the other hand, human freedom simply must take into account the constraints posed by
certain stubborn realities that keep on thrusting themselves onto the agenda. Politically,
for example, a government can decide to open up schools and beaches and restaurants
in the midst of the Covid-19 pandemic, but eventually it can no longer be denied that
the virus keeps on spreading and people keep on dying.
Nature, Culture, AI and the Common Good 25
deterministic rules of cause and effect. Furthermore, within this scheme a zero-sum
game is operative: the more entities are considered to be determined objects, the less
they can be considered as subjects, and vice versa (2004:76). But this distinction is
unhelpful according to Latour, and only serves to paralyze the political process. It is,
moreover, untenable. Because, if all entities are more and more treated as objects, we
can no longer count on the input of human actors with freedom and responsibility to
decide what must be done. Everyone is, after all, determined. Conversely, if the model
of free will is extended to everything, including the planet, there will no longer be “the
raw, unattackable nonhuman matters of fact that allow it to silence the multiplicity of
subjective viewpoints, each of which expresses itself in the name of its own interests”
(2004:73). To overcome this zero-sum deadlock, Latour proposes that we confess our
uncertainty about who is acting. Instead of talking about acting subjects and acted upon
objects, people and things, we should consistently talk of human and non-human actors.
All entities within the political collective act, simply by virtue of the fact that they
influence other actors. To rid our speech in this regard of any anthropomorphism, Latour
proposes that we talk of actants, instead of actors. An actant is an acting agent, an
intervener, an influencer. And, once again, in the political process we should keep on
enlarging the list of active actants in our commonwealth (2004:76).
The third dose of healthy agnosticism needed in a reconfigured political process is
uncertainty about what is real; what really exists. The nature-freedom divide often forces
us into a kind of materialistic naturalism on the one hand, or a constructivist idealism on
the other. But here, above all, the political process should be more pragmatic, according
to Latour. In dealing with those who speak, those who act and intervene in your world,
why not credit them with the properties you yourself hold dearest – in this case reality
(2004:77). Instead of taking external reality to be the simple “being there” of brute
facts, we should associate it with that which surprise us and interrupt the smooth flow
of our life with an insistence that it is there. “Actors are defined above all as obstacles,
scandals, as what suspends mastery, as what gets in the way of domination, as what
interrupts the closure and the composition of the collective.” (2004:81) While something
is stubbornly standing in the way of our definition of the common world, the res publica,
while something is recalcitrantly refusing to be ignored, we should accept its reality,
says Latour. And the more entities we admit as participants in our common enterprise
of forming our world, the better.
Now that we have undermined the old divide between nature and freedom, between the
common world and the common good, and have replaced it with a number of uncertainties
and a growing list of participants, we should think, along with Latour, about the possible
functioning of the political process in this new dispensation. In this Sect. 1 would like to
consider the role that Artificial Intelligence can play in this reconfigured political process
but in a restricted and still somewhat unsatisfactory way. In the last section of the paper I
then consider the deeper implications of Latour’s deconstruction of the nature-freedom
Nature, Culture, AI and the Common Good 27
divide for our thinking of Artificial Intelligence and the common good and bring them
into discussion with Russell’s reservations.
Latour’s proposal for reconfiguring the political process involves “the rearrangement
of the squares on the chess board” (2004:5). In other words, we should not only re-arrange
the pieces on the board according to the same rules, but fundamentally reconceive how
the political process works. The blurring of the line, the constitutive uncertainty about
what is nature and what is culture, what is a subject and what an object remains the point
of departure in this regard. The point is, we cannot be certain of what should be regarded
as brute facts of nature, and what should be regarded as values of human freedom in
the political process. This does not mean, however, that we should not appreciate the
rationale behind paying attention to facts and being attentive to values that animated the
reference to facts and values in the first place.
According to Latour, in the notion of “fact” there are two legitimate imperatives at
work that are nevertheless confusedly held together within this one concept. Similarly,
in the notion of “value” there are two imperatives operating that are also legitimate, but
that are held together in a confused way. The first imperative within the confused notion
of fact is to be open to “external reality” (2004:110). As we have seen, actants stubbornly
establish their presence and demand to be acknowledged: they are there, whether we
like it or not. The second imperative confused within the notion of fact has to do with
acceptance or closure. At least until the next cycle of the political process (see below)
we should now accept that certain actants are part of the political process and that their
voices should be taken into account. Thus, contained in the erstwhile notion of fact,
there is on the one hand an imperative for openness to external reality, and on the other
hand an imperative for stabilizing and institutionalizing what is for the time being to be
accepted as part of the collective.
The first imperative rolled up into what was previously regarded as “values” in the
political process is the imperative to listen to and critically evaluate the voices of the
actants that stubbornly demand to be listened to. In Latour’s words, “it is necessary
to make sure that reliable witnesses, assured opinions, credible spokespersons have
been summoned up, thanks to a long effort of investigation and provocation (in the
etymological sense of ‘production of voices’)” (ibid.) Another way of describing this
imperative would be to talk of the requirement of openness or consultation.
The second imperative confusedly contained in the notion of value is the requirement
to weigh up and to decide where to position an actant within the hierarchy of importance
that functions in the body politic. If it is true that the political process is a clamour of
many voices all appealing for a place in the sun of the common world, then it is just as true
that some kind of hierarchy must be communally agreed upon, otherwise there will only
be chaos. The relative importance of a voice – an interest – within the commonwealth
must be established through a process of give and take. Here it can clearly be seen how
politics is the proverbial art of the compromise.
Latour therefore unbundles the defunct opposition between facts and values into
four imperatives or requirements of the political process. First the requirement to pay
attention to actants that announce their intention to become part of the political process.
The body politic must be open – willing to become perplexed – by the possible reality
of voices that have hitherto not been recognized as real. Secondly there is a requirement
28 J. Kruger
to critically evaluate the voices of the actants that are harrowing the body politic. This
is the requirement of openness: what are the new voices really saying? Who is speaking
on their behalf, using what means? The third requirement, then, is the requirement to
rank the importance of a voice within the hierarchy that is the body politic. Where
should the new actant that has been identified and listened to fit in? What is its relative
importance? And finally, there is the requirement of provisional closure. The imperative
to institutionalize, at least for a time, the hierarchy that has been established so that the
body politic can live and be a common world.
The perceptive reader will have noticed that from the unbundling of the fact-value
distinction to the enumeration of the four imperatives functioning in the political process
a subtle shift has taken place. The four imperatives have been grouped differently. Latour
does this to highlight that there are two powers at work in the political process. The first
power is a power of destabilization or unsettling. Far from being negative, this power is
necessary for the health of the process. The second power is a power of stabilization and
institutionalization – a power equally necessary for the health of the political process.
Latour names the first power (the destabilizing and unsettling power) the power to “take
into account”. Two imperatives energize this power – one from the erstwhile notion of
fact, and one from the erstwhile notion of value. The imperative to be open to becoming
perplexed by external reality, and the imperative to evaluatively engage with actants
that become visible together drive the power to take into account. This power opens up
and unsettles the body politic so that it can change and grow. The second power Latour
names the power to “arrange in rank order”. This power, similarly, is made up of two
imperatives – one from the erstwhile notion of fact, and the other from the erstwhile
notion of value. In the first place the imperative to decide where in the hierarchy an
actant should be positioned is what energizes the power to arrange in rank order. In the
second place the imperative to institutionalize or close down further discussion is what
energizes the power to arrange in rank order. This power then evidently stabilizes the
body politic so that it can live and function.
Very important to note is that what has been described above is what Latour calls a
single cycle in the political process. Once provisional closure has been reached through
the power to arrange in rank order, the process starts up again in a next iteration: the
perplexity caused by actants that have hitherto been excluded must be heeded as it
functions within the power to take into account. And so, in Latour’s conception, we have
a circular process where the two powers continually operate and balance each other out.
The point I would like to make now is that this conception of the political process gives
us the theoretical tools to think about the role of Artificial Intelligence in that process,
and specifically in pursuit of the common good (bearing in mind that this conception also
disturbs the strict border between the common world and the common good.) Artificial
Intelligence can play an auxiliary or amplifying role regarding all four the imperatives,
and concomitantly, with regard to both the powers at work in the political process. In
line with a general insight regarding technology (cf. Ihde and Malafouris 2019), AI
can furthermore function in a positive way as well as in a destructive way in all these
processes.
In its present form AI is already functioning in service of the imperative to openness
in the political process. In this regard one can think of the many data analysing algorithms
Nature, Culture, AI and the Common Good 29
at work today. Using these algorithms trends and patterns are identified and these then
become actants whose candidacy for reality and inclusion in the body politic must be
considered. Russell (2019:73) provides an excellent example of machines’ role in the
imperative to openness. At present thousands of satellites are continuously imaging
every square meter of the world’s surface. In Russell’s estimation, more than thirty
million human employees would be necessary to analyse all the images received from
satellites. The result is that much of the satellite data is never seen by human eyes.
However, computer vision algorithms process this data to produce searchable databases
“with visualizations and predictive models of economic activities, changes in vegetation,
migrations of animals and people, the effects of climate change, and so on.” (Ibid.) All
of this result in an increased sensitivity towards new entities or phenomena that should
be taken into account in the construction of the common world.
Once an entity’s candidacy for citizenship has been registered, its claims must be
evaluated and weighed. It will be recalled that in Latour’s view the imperative here is to
make a case, and to be open to the case made. It is thus a matter of advocacy and of how
compelling a case can be made. Russell notes that AI will play a huge role in this regard,
in the sense that services previously open only to the super-rich will become accessible
to everyone. “And in the mundane world of daily life, an intelligent assistant and guide
would—if well designed and not co-opted by economic and political interests—empower
every individual to act effectively on their own behalf in an increasingly complex and
sometimes hostile economic and political system. You would, in effect, have a high-
powered lawyer, accountant, and political adviser on call at any time.” (Russell 2019:105)
On the other hand algorithms are also already at work to strengthen the power to evaluate
and weigh up the appeals made by an actant in the public sphere. AI is already playing
a role in various fact checking services that monitor and moderate the many voices on
social media and news sites (Russell 2019:113). In this regard one can think of sites like
factcheck.org and snopes.com.
The second imperative at work in the power to arrange in rank order is the imperative
to establish a hierarchy of interests. It is the imperative to perform triage regarding
the relative importance of an actant’s demands. Here, as well, AI is already rendering
valuable service, and the expectation is that this will increase in the future as the capacity
of AI increases. Russel (2019:134) takes an example from the airline industry to illustrate
the decision-making power of AI. At first computers were only involved in the drawing
up of flight schedules. Then the booking of seats, the allocation of flight staff and
the booking of routine maintenance were also computerised. Next airlines’ computers
were connected to international aviation networks to provide real-time status updates
on flights and situations at airports. At present algorithms are taking over the job of
managing disruption in the aviation workflow by “rerouting planes, rescheduling staff,
rebooking passengers and revising maintenance schedules.” (Ibid.) Would AI be able to
perform similar functions in the area of governance and the allocation of public funds?
Undoubtedly. This becomes even more apparent when the power of AI in scenario
planning is considered (cf. Sohrabi et al 2018).
The final imperative for the political process is again part of the power to take into
account. But now it is an imperative towards provisional closure of the body politic. For
the commonwealth to function certain realities must be stabilised, at least for the time
30 J. Kruger
being. In this regard two examples of the contribution of Artificial Intelligence should
suffice. In the first place, AI can play a role in understanding what the current state of
stability and preferences looks like. By looking at an initial state, learning algorithms
can now already infer the implicit preferences present in that state, and bring them to
light, thus accurately displaying the present state of afairs (Shah et al. 2019). The second
example pertains to the moderating role that AI plays in contemporary social media. As
body politic we have agreed amongst ourselves that it is not acceptable that the dignity of
certain actors should be jeopardised, for instance through the language used to describe
them, or the incitement of violence towards them, or the denial of their right to existence.
Algorithms monitor social media posts and are sensitive towards certain formulations.
This could result in posts being deleted and accounts being suspended. In such a way
a definitive affirming of the legitimacy of a particular social ordering is achieved. But,
as Latour emphasises, this is only until the cycle of the political starts up again, and the
voices of all actants, old and new, are taken into account again.
While taking note of the possible service that AI can render to the two powers at work
in the composition of the common world, the fundamental uncertainties that Latour take
as his points of departure must again be emphasised. Misrepresentation and deception
are also possible and are certainly also actual in the political process. AI can also amplify
these forces, as has been amply illustrated in recent electoral processes. While noting
this, I will not elaborate on it, and rather return to the original question of AI’s place in
a social world where a clear distinction between nature and culture does not hold water.
could short circuit the process of consultation, of listening, of weighing up the claims of
humans as well as non-humans in their co-existence. We find ourselves with a constitutive
uncertainty regarding the common good, including the good of human beings. According
to Latour’s conception, Russell is short circuiting the political ecology by appealing to
a pure human nature that is simply given.
Interestingly, Russell acknowledges the uncertainty about what would constitute the
human good at various instances in his book (e.g. 2019:23), but he nevertheless maintains
that a practical, engineering kind of safety system must be put in place to ensure that
the design of artificial intelligence would always follow human preferences (2019:188).
Russell suggests that while humans are not always certain about what constitutes human
flourishing, all humans would agree that being subservient to artificial super intelligence
that is indifferent to human preferences will not be good. He therefore suggests that AI
should be designed to have a constitutive uncertainty about human preferences and to
always defer to humans about their preferences.
From his perspective of the common good as inextricably bound up with the common
world, Latour might conceivably counter that the circle of the political process be allowed
to take its course. Thus, Artificial Intelligence, just like any other actant, would arrive
on the radar of the common world through its recalcitrance – it refuses to go away. This
is definitely already the case with AI, and Russell admits as much in his book. Secondly,
following Latour’s imperatives, we would have to listen to and weigh up the case that
AI makes for its inclusion in our commonwealth. Latour’s generous understanding of
agency will initially make things easier: if rising sea levels or a virus can have a voice in
the political process, then AI certainly can as well. The advantage of Latour’s imperative
towards openness is also that it urges awareness. In the political process, we need to be
aware of AI’s presence. AI should not be allowed to become invisible and work in the
background. The more we for instance become aware that AI tracks our preferences and
tailors communication accordingly, the more we will weigh it up before accepting it.
The third imperative (part of the power to arrange in rank order) is to fit AI into the
hierarchy of importance in the political process. In this case as well it cannot be all or
nothing – either deny general AI a place in the hierarchy or capitulate and allow AI to
pursue its own interests unchecked. There must be an ongoing process of negotiation
and a keeping in mind of the unique contributions that humans and other actants can
bring to the body politic.
Finally, Latour urges that the political process be stabilized, at least provisionally.
In this regard one can think of the legislation and the various protocols and industry
standards that must be in place with regard to AI in its present form. When AI develops
into artificial general intelligence (AGI), this will have to be revisited and reformed for
the next cycle of the political process.
In the case of the last imperative, Russell, of course, is afraid that the stabilization
will be too little too late. Once a certain boundary is crossed, the development of AI
will be out of human control and will go ahead according to its own goals. Russell,
in other words, is worried that AI will become so powerful that it will take over the
whole political process. All other actants will be effectively powerless in the face of AI’s
power, with the result that there will really be only one actant in town. Dave Eggers’
novel The Circle provides a sketch of what the early stages of such a scenario could
32 J. Kruger
look like: people are effectively forced to live completely transparent lives, because the
tiniest details of their lives are recorded and analysed and regulated (Eggers 2014; cf.
Horvat 2019:47–50).
Latour would, of course, insist that the political process must be continuously dis-
rupted. The circular process is an ongoing, give and take process. It cannot be smoothed
over and managed by one sovereign. The smooth circle of Eggers’ dystopia where AI
becomes all powerful but recedes into the background, should not be allowed to happen.
Rather, just like all other actants, AI’s functioning should be noticed and weighed in the
political process. The question, however, remains: what if AI becomes too powerful?
This is indeed where Latour’s proposal for a political ecology is vulnerable to critique.
It has been suggested that Latour’s model, if pressed to its consequences, falls back
into power politics (Harman 2014:19). If the political process is one of negotiation, of
garnering support for one’s interests, of pressing others into service for ones aims, then
the interest of the strongest, most convincing will prevail. In Russell’s estimation there is
a very real possibility that AI might emerge as the strongest to the detriment of humans
in society.
In considering AI’s place in Latour’s Politics of Nature one is then, seemingly, left
with the challenge to move beyond the current opposition of two unacceptable extremes.
On the one hand a truth politics that assumes there is a pure human nature and definite
human interests that must be protected against AI should be avoided. On the other hand,
the alternative of a naked power politics must also be avoided because there is a very
real possibility that super AI may emerge the most powerful. Latour’s solution to the
dilemma is that the circular movement of the political process should never be allowed
to stall. The process cannot be short circuited by an appeal to a pure human nature and
a purely human good. But equally the process must not be allowed to be hijacked by
immensely powerful AGI. In this regard the question is whether humans can rediscover
and optimize their own important and irreplaceable contributions to the common world
which will ensure a dignified and flourishing place in this commonwealth.
References
Collins, H.M., Yearley, S.: Epistemological chicken. In: Pickering, A. (ed.) Science as Practice
and Culture, pp. 301–326. The University of Chicago Press, Chicago and London (1992)
Eggers, D.: The Circle. Vintage, New York (2014)
Harman, G.: Prince of Networks: Bruno Latour and Metaphysics. Re.press, Melbourne (2009)
Harman, G.: Bruno Latour: Reassembling the Political, Kindle Edition. Pluto Press, London (2014)
Horvat, S.: Poetry from the Future. Penguin, London (2019)
Ihde, D., Malafouris, L.: Homo faber revisited: postphenomenology and material engagement
theory. Philos. Technol. 32, 195–214 (2019)
Latour, B.: Laboratory Life – The Construction of Scientific Facts, trans. Steve Woolgar. University
Press, Princeton (1986)
Latour, B.: Science in Action – How to Follow Scientists and Engineers Through Society. Harvard
University Press, Cambridge (1988)
Latour, B.: We Have Never Been Modern, trans. Catherine Porter. Harvard University Press,
Cambridge (1993)
Latour, B.: Politics of Nature – How to Bring the Sciences into Democracy, trans. Catherine Porter.
Harvard University Press, Cambridge (2004)
Nature, Culture, AI and the Common Good 33
Pollini, J.: Bruno Latour and the ontological dissolution of nature in the social sciences: a critical
review. Environ. Values 22(1), 25–42 (2013)
Rorty, R.: Philosophy and the Mirror of Nature – Thirtieth Anniversary Edition. University Press,
Princeton (2017)
Russell, S.: Human Compatible – Artificial Intelligence and the Problem of Control. Viking,
London (2019)
Shah, R., et al.: Preferences implicit in the state of the world. In: Proceedings of
the 7th International Conference on Learning Representations (2019). Available at
iclr.cc/Conferences/2019/Schedule
Sohrabi, S., Riabov, A.V., Katz, M., Udrea, O.: An AI Planning Solution to Scenario Generation
for Enterprise Risk Management. Association for the Advancement of Artificial Intelligence
(2018)
Walsham, G.: Actor-Network Theory and IS research: current status and future prospects. In:
Lee, A.S., Liebenau, J., DeGross, J.I. (eds.) Information Systems and Qualitative Research.
Proceedings of the IFIP TC8 WG 8.2 International (1997)
The Quest for Actionable AI Ethics
Emma Ruttkamp-Bloem1,2(B)
1
Department of Philosophy, University of Pretoria, Pretoria, South Africa
[email protected]
2
Centre for AI Research (CAIR), Pretoria, South Africa
1 Introduction
In this paper, I argue that in order to ensure that AI ethics is actionable, the
approach to AI ethics should change in two novel ways. AI ethics should be firstly
approached in a multi-disciplinary manner focused on concrete research in the
discipline of the ethics of AI and secondly as a dynamic system on the basis of
virtue ethics in order to work towards enabling all AI actors to take responsibil-
ity for their own actions and to hold others accountable for theirs. In conclusion,
the paper emphasises the importance of understanding AI ethics as playing out
on a continuum of interconnected interests across academia, civil society, public
policy-making and the private sector (including private sector companies ranging
from start-ups to small-and medium enterprises to large transnational compa-
nies). In addition, a novel notion of ‘AI ethics capital’ is put on the table as a
core ingredient of trustworthy AI and an outcome of actionable AI ethics.
In the face of the relative ineffectiveness of a host of recent policy guidelines,
including inter-governmental policies, national policies, professional policies, and
c Springer Nature Switzerland AG 2020
A. Gerber (Ed.): SACAIR 2020, CCIS 1342, pp. 34–50, 2020.
https://doi.org/10.1007/978-3-030-66151-9_3
The Quest for Actionable AI Ethics 35
policies generated in the private sector, there is a growing call from the AI com-
munity to increase the effectiveness of AI ethics guidelines1 . Luciano Floridi
[35] highlights the risks of not actionalising AI ethics guidelines in his article
Translating Principles into Practices of Digital Ethics. He identifies five dan-
gerous practices that may take root in a context in which AI ethics remains
idealistic and removed from the every day working reality of the technical com-
munity, and which ultimately may work against actionable AI ethics: (1) Ethics
shopping: there is confusion given the almost 100 sets of AI ethics policies avail-
able at present [3] “clear, shared, and publicly accepted ethical standards” [35];
(2) ethics bluewashing: pretending to work, or working superficially together
towards establishing trustworthy AI instead of establishing “public, account-
able, and evidence-based transparency about good practices and ethical claims”
(ibid.) and ensuring AI and AI ethics literacy of all AI actors (including board
members of private sector companies and government officials); (3) ethics lob-
bying: promoting self-regulation instead of introducing enforceable ethical and
legal norms; (4) ethics dumping: “the export of unethical research practices to
countries where there are weaker . . . legal and ethical frameworks and enforcing
mechanisms” (ibid.) as opposed to establishing a culture of research and con-
sumption ethics; and (5) ethics shirking: weak execution of ethical duties given
a perception of low returns on ethical adherence, instead of establishing clear
lines of responsibility.
In his turn, Brent Mittelstadt [53] warns in an article entitled Principles
Alone cannot Guarantee Ethical AI, that the “real” work of AI ethics only starts
now that we are faced with a multitude of policies. This work is “to . . . imple-
ment our lofty principles, and in doing so to begin to understand the real ethical
challenges of AI” (ibid.). Thilo Hagendorff [43], in an article entitled The Ethics
of AI Ethics: An Evaluation of Ethical Guidelines, concurs, and mentions the
lack of mechanisms AI ethics has to “reinforce its own normative claims” (ibid. p.
99), the view of AI ethics guidelines as coming from “ ‘outside’ the technical com-
munity” (ibid. p. 114)2 , and the lack of “distributed responsibility in conjunction
with a lack of knowledge about long-term or broader societal technological con-
sequences causing software developers to lack a feeling of accountability or a
view of the moral significance of their work” (ibid) as serious obstacles towards
realising the ‘lofty principles’ of current AI ethics.
Some suggestions have been made to address the current ‘inactive’ status
of AI ethics. These include advocating for and suggesting hands-on concrete
suggestions for ethical machine learning from within the machine learning com-
munity itself3 in terms of technical methods of addressing concerns around bias,
transparency and accountability (see e.g. [31,58,74]); warnings about the con-
1
See e.g. [18, 28, 35, 40, 43, 46, 51, 53, 58, 61, 72, 74, 83, 87] for discussions from various
points of view of the current state of affairs of AI ethics.
2
For instance, 79% of tech workers would like practical guidance with considering,
implementing and adhering to ethical guidelines [52].
3
Acknowledgment of the work of the ethics and society branch of Deepmind, the
Open AI initiative, and the FAT ML association is important in this regard.
36 E. Ruttkamp-Bloem
4
The AI system lifecycle is taken to range at least from research, design, development,
deployment to use (“including maintenance, operation, trade, financing, monitoring
and evaluation, validation, end-of-use, disassembly, and termination” [78]).
5
This definition is based on the one given in the UNESCO First Draft of the Recom-
mendation on the Ethics of AI [78].
The Quest for Actionable AI Ethics 37
Part of why there are different approaches to defining the discipline of the
ethics of AI is the fact that it has crystallised into at least the (non-exclusive)
subfields of machine ethics, data or algorithm ethics, robot ethics, information
ethics, and neuro-ethics. Machine ethics focuses on the ethics of the design of
artificial moral decision making capacities and socio-moral analyses of the con-
cept of artificial morality (see e.g. [4,8,16,17,56,85]) Gunkel ([42] p. 101) dis-
tinguishes between computer ethics and machine ethics: “computer ethics . . .
is concerned . . . with questions of human action through the instrumentality
of computers and related information systems. In clear distinction from these
efforts, machine ethics seeks to enlarge the scope of moral agents by consider-
ing the ethical status and actions of machines”. In these terms, machine ethics
is concerned with “ethics for machines, for ‘ethical machines’, for machines as
subjects, rather than for the human use of machines as objects” [59], as the
latter is the focus of robot ethics and also relates to computer ethics as defined
above (see also [71]). Another option [69] is to refine machine ethics into thinking
separately about technical aspects of computational tractability (computational
ethics) and thinking about the ethics of machines with moral agency (machine
ethics).
Robot ethics, or also known as the ethics of social robots, is focused on the
impact of social robots on society (e.g. [64]), on human-robot interaction (HRI),
on the anthropomorphisation of robots and the objectification of humans, and
robot rights (see e.g. [10,13,15,30,42,70]) and also may be broken into focusing
separately on AI-AI interaction, AI-human interaction and AI-society interaction
(see [71]). Furthermore, the ethics of social robots may also be incorporated into
robo-ethics, which is “concerned with the moral behaviour of humans as they
design, construct, use and interact with AI agents” (ibid.) (see also [82]). In his
turn, Asaro ([9] p. 10) argues that the field which he calls ‘robot ethics’, is focused
on the ethical systems built into robots (focuses on robots as ethical subjects
and relates to machine ethics and thus sometimes machine ethics is viewed as a
subset of robot ethics); the ethics of people who design and use robots (focuses
on humans as ethical subjects and relates to robo-ethics and computer ethics);
and the ethics of how people treat robots (focuses on ethical interaction and
relates to what is sometimes called the ethics of social robots). Asaro (ibid. p.
11) argues that the best approach to robot ethics is one that addresses all three
of these and that views robots as socio-technical systems.
Data ethics is centered on issues around fair, accountable and transparent
machine learning or co-called ‘critical machine learning’, socio-technical analy-
ses of machine learning practices and their impact on society, and responsible
data governance (see e.g. [12,39,81]). As such, it is a “branch of ethics that
studies and evaluates moral problems related to algorithms (including artificial
intelligence, artificial agents, machine learning and robots) and corresponding
practices (including responsible innovation, programming, hacking and profes-
sional codes), in order to formulate and support morally good solutions (e.g..
right conducts or right values)” ([39] p. 1). Information ethics, in its turn,
relates to data and algorithm ethics on the one hand, and on ethical elements
40 E. Ruttkamp-Bloem
6
This is basically the problem of why consciousness occurs at all, combined with the
problem of explaining subjective experience, or the ‘feeling what it is like’.
The Quest for Actionable AI Ethics 41
One high level way in which to sensitise civil society to AI ethics, is to ensure
that the values and ethical standards embodied in AI ethics guidelines are shared
values. Focusing on ‘intrinsic’ values, as opposed to ‘extrinsic’ values may be a
good beginning. Judgements of intrinsic value are evaluations of things that have
value for their own sake, while extrinsic values get their value from their function
or how they fit into a bigger system (see e.g. [11,57,75]) Intrinsic values include
human life, freedom, peace, security, harmony, friendship, social justice, etc. The
rationale behind emphasising intrinsic values is that such values are respected
universally, given their intrinsic nature, but more importantly, that non-buy-
in to these values is detrimental to everyone, and is perhaps most felt at the
level of ordinary citizens as the most vulnerable of AI actors. And this is what
civil society should be sensitised to grasp. Furthermore, given the international
legal stature of international human rights law, principles, and standards, a
human rights perspective in AI ethics guidelines may not only strengthen the
potential for legal enforcement, but again is also a way in which to establish
common grounds for AI ethics standards (see e.g. [25,48,63]) and ensuring every
member of civil society understands the consequences of not adhering to AI
ethics guidelines. These perspectives alone are however not concrete enough.
What is needed in addition, is to bring home to civil society that the disrup-
tiveness of AI technology impacts on every sphere of human lives, that ‘being
human’ and enjoying fundamental freedoms are in danger of coming under
increased control of AI technologies, and, perhaps most importantly, to ensure
that there are safeguards against ‘moral de-skilling’ by technology. In an article
entitled Moral Deskilling and Upskilling in a New Machine Age: Reflections on
the Ambiguous Future of Character [79], Shannon Vallor warns that “. . . moral
skills appear just as vulnerable to disruption or devaluation by technology-driven
shifts in human practices as are professional or artisanal skills such as machining,
shoemaking, or gardening. This is because moral skills are typically acquired in
specific practices, which, under the right conditions and with sufficient opportu-
nity for repetition, foster the cultivation of practical wisdom and moral habitu-
ation that jointly constitute genuine virtue. . . . profound technological shifts in
human practices, if they disrupt or reduce the availability of these opportunities,
can interrupt the path by which these moral skills are developed, habituated,
and expressed” (ibid. p. 109).
On the one hand, this points to the need for strong campaigns driving both
AI fundamentals and AI ethics literacy given that society “has greater control
than it has ever had over outcomes related to (1) who people become; (2) what
people can [or may] do; (3) what people can achieve, and (4) how people can
interact with the world” ([58] p. 1). In other words, civil society should become
aware and have a basic understanding of the potential of some AI technologies
to threaten fundamental freedoms and change the moral fibre of societies.
On the other hand, we should ensure that trust in technology does not have
the upper hand, by ensuring that we can legitimately trust in humans and their
abilities. There is thus a responsibility that comes with protecting human dignity,
human oversight and human centeredness, i.e. of fighting for ‘AI with a human
The Quest for Actionable AI Ethics 43
9
Compare Floridi’s [34] argument that every actor who is “causally relevant for bring-
ing about the collective consequences or impacts in question, has to be held account-
able” ([43] p. 113).
The Quest for Actionable AI Ethics 45
honest rational deliberation on a case-by-case basis also means that this approach
can deal with the fluidity of changing societal and political structures as well as
the pace of AI technological advancement. In this way, AI ethics is then less about
disciplining AI actors to adhere to ethical guidelines, and more about positive
self-realisation of moral responsibilities as this model “emancipate[s AI actors]
from potential inabilities to act self-responsibly on the basis of comprehensive
knowledge, as well as empathy in situations where morally relevant decisions
have to be made” (ibid. p. 114).
Only if every AI actor understands why regulating the life cycle of AI systems
is necessary and sees their own role in this process, can the AI ethics project hope
to be successful. The potential for meeting these objectives within a participatory
virtue ethics approach to AI ethics as a dynamic ethical system should be clear.
4 Conclusion
The call for addressing the lack of impact of AI ethics on tech communities is real.
In this paper, a novel participatory model for AI ethics based on a virtue ethics
approach to AI ethics and underpinned by state of the art multi-disciplinary
research and collaboration concretely anchored in research in the discipline of
the ethics of AI has been suggested. Such an approach may do much to change
the negative conception of AI ethics as stifling innovation by “broadening the
scope of action, uncovering blind spots, promoting autonomy and freedom, and
fostering self-responsibility” ([43] pp. 112–113). In addition, this approach can
deal positively with the concern raised by Morley et al. [58] that, “in a digital
context, ethical principles are not simply either applied or not, but regularly
re-applied or applied differently, or better, or ignored as algorithmic systems are
developed, deployed, configured . . . tested, revised and re-tuned. . . ” (ibid. p.
18), as it allows for AI ethics as a dynamic adaptive ethical system within which
it is active cultivation of techno-moral virtues, rational deliberation among all
AI actors and mutual respect for concrete multi-disciplinary research that guide
ethical decisions.
In conclusion, let us consider what the implications for the concept of trust-
worthy AI are, should we meet the quest for actionable AI in the terms described
above. First, trustworthiness becomes a socio-technical concept, focused as much
on the safety and robustness of AI technologies as it is on respect for every indi-
vidual human AI actor. In this context, given the active role of AI actors in the
AI ethics project, and their shared responsibility to action-alise AI ethics, trust
becomes a benchmark for the social acceptance of AI technologies. Thus, there
will be good reason to trust that AI technology brings benefits while adequate
measures are taken to mitigate risks, as the trust at issue is not only in technol-
ogy but trust in the actions of AI actors actively involved in contributing to the
dynamic model of AI ethics.10
10
See the first version of the UNESCO First Draft of the Recommendation on the
Ethics of AI [78].
46 E. Ruttkamp-Bloem
References
1. Abdul, A., Vermeulen, J., Wang, D.: Trends and trajectories for explainable,
accountable and intelligible systems: an HCI research agenda. In: Proceedings of
the 2018 CHI Conference on Human Factors in Computing Systems - CHI, vol. 18,
pp. 1–18 (2018). https://doi.org/10.1145/3173574.3174156
11
https://www.oecd.org/insights/humancapital-thevalueofpeople.htm.
The Quest for Actionable AI Ethics 47
2. Adams, F., Aizawa, K.: The Bounds of Cognition, 2nd edn. Blackwell, Oxford
(2010)
3. Algorithm-Watch: AI ethics global inventory. https://inventory.algorithmwatch.
org/. Accessed 20 Sept 2020
4. Allen, C., Varner, G., Zinser, J.: Prolegomena to any future artificial moral
agent. J. Exp. Theor. Artif. Intell. 12(3), 251–261 (2000). https://doi.org/10.1080/
09528130050111428
5. Alshammari, M., Simpson, A.: Towards a principled approach for engineering pri-
vacy by design. In: Schweighofer, E., Leitold, H., Mitrakas, A., Rannenberg, K.
(eds.) APF 2017. LNCS, vol. 10518, pp. 161–177. Springer, Cham (2017). https://
doi.org/10.1007/978-3-319-67280-9 9
6. Anabo, I., Elexpuru-Albizuri, I., Villardón-Gallego, L.: Revisiting the Belmont
report’s ethical principles in internet-mediated research: perspectives from dis-
ciplinary associations in the social sciences. Ethics Inf. Technol. 21(2), 137–149
(2019). https://doi.org/10.1007/s10676-018-9495-z
7. Ananny, M.: Toward an ethics of algorithms: Convening, observation, probability,
and timeliness. Sci. Technol. Hum. Values 41(1), 93–117 (2016)
8. Anderson, M., Anderson, S.: Machine ethics: creating an ethical intelligent agent.
AI Mag. 28(4), 15–26 (2007)
9. Asaro, P.: What should we want from a robot ethic? Int. Rev. Inf. Ethics 6(12),
9–16 (2006)
10. Asaro, P.: A body to kick, but still no soul to damn: legal perspectives. In: Lin, P.,
Abney, K., Bekey, G.A. (eds.) Robot Ethics: The Ethical and Social Implications
of Robotics, pp. 169–186. MIT Press, Cambridge (2012)
11. Audi, R.: Intrinsic value and reasons for action. Southern J. Philos. 41, 30–56
(2003)
12. Barocas, S., Selbst, A.: Big data’s disparate impact. Calif. Law Rev. 104, 671–732
(2016)
13. Bekey, A.: Current trends in robotics: technology and ethics. In: Lin, P., Abney, K.,
Bekey, G. (eds.) Robot Ethics: The Ethical and Social Implications of Robotics,
pp. 17–34. MIT Press, Cambridge (2012)
14. Benedikter, R., Siepmann, K., Reymann, A.: Head-transplanting’ and ‘mind-
uploading’: philosophical implications and potential social consequences of two
medico-scientific utopias. Rev. Contemp. Philos. 16, 38–82 (2017)
15. Boden, M., Bryson, J., Caldwell, D.: Principles of robotics: regulating robots in
the real world. Connect. Sci. 29(2), 124–129 (2017)
16. Bostrom, N., Yudkowsky, E.: The ethics of artificial intelligence. In: Frankish, K.,
Ramsey, W. (eds.) The Cambridge Handbook of Artificial Intelligence, pp. 316–
334. Cambridge University Press, Cambridge (2014)
17. Brundage, M.: Limitations and risks of machine ethics. J. Exp. Theor. Artif. Intell.
26(3), 355–372 (2014)
18. Campolo, A.: AI now 2017 report (2017). https://assets.ctfassets.net/
8wprhhvnpfc0/1A9c3ZTCZa2KEYM64Wsc2a/8636557c5fb14f2b74b2be64c3ce0
c78/ AI Now Institute 2017 Report .pdf
19. Chalmers, D.: Facing up to the problem of consciousness. J. Consciousness Stud.
2, 200–19 (1995)
20. Chalmers, D.: The singularity: a philosophical analysis. J. Consciousness Stud.
17(9–10), 7–65 (2010)
21. Clark, A.: Natural-Born Cyborgs: Minds, Technologies, and the Future of Human
Intelligence. Oxford University Press, Oxford (2003)
48 E. Ruttkamp-Bloem
22. Clark, A.: Intrinsic content, active memory and the extended mind. Analysis 65(1),
1–11 (2005)
23. Clark, A.: The frozen cyborg: a reply to selinger and engström. Phenomenol. Cogn.
Sci. 7, 343–346 (2008). https://doi.org/10.1007/s11097-008-9105-3
24. Clark, A., Chalmers, D.: The extended mind. Analysis 58, 7–19 (1998). https://
doi.org/10.1093/analys/58.1.7
25. Comninos, A.: Fabrics: Emerging AI readiness (2018)
26. Corabi, J., Schneider, S.: The metaphysics of mind uploading. J. Consciousness
Stud. 19(7–8), 26–44 (2012)
27. Couldry, N., Hepp, A.: The Mediated Construction of Reality. Polity Press, Cam-
bridge (2017)
28. Crawford, K.: The AI now report: the social and economic implications of artificial
intelligence technologies in the near-term (2016). https://artificialintelligencenow.
com
29. Crawford, K., Calo, R.: There is a blind spot in AI research. Nature 538(7625),
311–313 (2016)
30. Danaher, J.: The philosophical case for robot friendship. J. Posthuman Stud. 3(1),
5–24 (2019). https://doi.org/10.5325/jpoststud.3.1.0005
31. Diakopoulos, N.: Algorithmic accountability: journalistic investigation of compu-
tational power structures. Digit. Journal. 3(3), 398–415 (2015). https://doi.org/
10.1080/21670811.2014.976411
32. Eliasmith, C.: How to Build a Brain: A Neural Architecture for Biological Cogni-
tion. Oxford University Press, Oxford (2013)
33. Floridi, L.: The Online Manifesto: Being Human in a Hyper Connected Era.
Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-319-04093-6
34. Floridi, L.: Faultless responsibility: on the nature and allocation of moral respon-
sibility for distributed moral actions. Philos. Trans. Ser. A Math. Phys. Eng. Sci.
374(2083), 1–13 (2016)
35. Floridi, L.: Establishing the rules for building trustworthy AI. Nat. Mach. Intell.
1, 261–262 (2019). https://doi.org/10.1038/s42256-019-0055-y
36. Floridi, L.: Translating principles into practices of digital ethics: five risks of being
unethical. Philos. Technol. 32, 185–193 (2019). https://doi.org/10.1007/s13347-
019-00354-x
37. Floridi, L., Cowls, J.: AI4People - an ethical framework for a good AI society:
opportunities, risks, principles, and recommendations. Minds Mach. 28(4), 689–
707 (2018)
38. Floridi, L., Cowls, J.: A unified framework of five principles for AI in society.
Harvard Data Sci. Rev. 1(1) (2019). https://doi.org/10.1162/99608f92.8cd550d1
39. Floridi, L., Taddeo, M.: What is data ethics? Philos. Trans. Roy. Soc. A Math.
Phys. Eng. Sci. 374(2083) (2016). https://doi.org/10.1098/rsta.2016.0360
40. Green, B.: Ethical reflections on artificial intelligence. Scientia et Fides 6(2) (2018).
https://doi.org/10.12775/SetF.2018.015
41. Greenhill, K., Oppenheim, B.: Rumor has it: the adoption of unverified information
in conflict zones. Int. Stud. Q. 61(3), 660–676 (2017). https://doi.org/10.1093/isq/
sqx015
42. Gunkel, D.: The Machine Question: Critical Perspectives on AI, Robots, and
Ethics. MIT Press, Cambridge (2012)
43. Hagendorff, T.: The ethics of AI ethics: an evaluation of guidelines. Minds Mach.
30, 99–120 (2020). https://doi.org/10.1007/s11023-020-09517-8
44. Hansell, G.: H+/-: Transhumanism and Its Critics. Xlibris Corporation (2011)
The Quest for Actionable AI Ethics 49
45. Innes, M., Dobreva, D., Innes, H.: Disinformation and digital influencing after ter-
rorism: spoofing, truthing and social proofing. Contemp. Soc. Sci. (2019). https://
doi.org/10.1080/21582041.2019.1569714
46. Jobin, A., Ienca, M., Vayena, E.: The global landscape of AI ethics guidelines. Nat.
Mach. Intell. 1, 389–399 (2019). https://doi.org/10.1038/s42256-019-0088-2
47. Kroll, J.: The fallacy of inscrutability. Philos. Trans. Roy. Soc. A Math. Phys. Eng.
Sci. 376(2133) (2018). https://doi.org/10.1098/rsta.2018.0084
48. Latonero, M.: Governing artificial intelligence: upholding human rights & dignity’.
Data and Society, USC (2018)
49. Leonelli, S.: Locating ethics in data science: responsibility and accountability in
global and distributed knowledge production systems. Philos. Trans. Roy. Soc. A
(2016). https://doi.org/10.1098/rsta.2016.0122
50. Lin, P., Abney, K., Bekey , G.A.: Robot Ethics. The Ethical and Social Implications
of Robot Ethics. MIT Press, Cambridge (2012)
51. McNamara, A., Smith, J., Murphy-Hill, E.: Does ACM’s code of ethics change
ethical decision making in software development? In: Leavens, G., Garcia, A.,
Păsăreanu, C. (eds.) Proceedings of the 2018 26th ACM Joint Meeting on European
Software Engineering Conference and Symposium on the Foundations of Software
Engineering-ESEC/FSE 2018, pp. 1–7. ACM Press, New York (2018)
52. Miller, C., Coldicott, R.: People: power and technology: the tech workers’ view
(2019). https://doteveryone.org.uk/report/workersview/. Retrieved from Dotev-
eryone website
53. Mittelstadt, B.: Principles alone cannot guarantee ethical AI. Nat. Mach. Intell. 1,
501–507 (2019). https://doi.org/10.1038/s42256-019-0114-4
54. Momčilović, A. (2020). https://www.ssbm.ch/blog/naic-foundations-is-human-
capital-the-only-thing-becoming-and-remaining-important-by-aco-momcilovic-
emba/. Accessed 21 Sept 2020
55. Moor, J.: What is computer ethics? Metaphilosophy 16(4), 266–275 (1985)
56. Moor, J.: The nature, importance, and difficulty of machine ethics. IEEE 21(4),
18–21 (2006)
57. Moore, G.: Philosophical Papers. Allen and Unwin (1959)
58. Morley, J., Floridi, L., Kinsey, L.: From what to how: an initial review of publicly
available AI ethics tools, methods and research to translate principles into prac-
tices. Sci. Eng. Ethics 26, 2141–2168 (2020). https://doi.org/10.1007/s11948-019-
00165-5
59. Müller, V.: Ethics of artificial intelligence and robotics. In: Zalta, E.N. (ed.) The
Stanford Encyclopedia of Philosophy (2020). https://plato.stanford.edu/archives/
fall2020/entries/ethics-ai/
60. Pearlberg, D., Schroeder, T.: Reasons, causes, and the extended mind hypothesis.
Erkenntnis 81, 41–57 (2015). https://doi.org/10.1007/s10670-015-9727-0
61. Pekka, A., Bauer, W.: The European commission’s high-level expert group on
artificial intelligence: ethics guidelines for trustworthy AI. Working document for
Stakeholders’ Consultation (2018)
62. Pigliucci, M.: Mind uploading: a philosophical analysis. In: Blackford, R., Brod-
erick, D. (eds.) Intelligence Unbound: Future of Uploaded and Machine Minds.
Wiley, Hoboken (2014). https://doi.org/10.1002/9781118736302.ch7
63. Raso, F.: AI and Human Rights. Opportunities and Risks. Berkman Klein Centre
for Internet and Society, Harvard (2018)
64. Royakkers, L., Est, R.: A literature review on new robotics: automation from love
to war. Int. J. Soc. Robot. 7, 549–570 (2015)
50 E. Ruttkamp-Bloem
65. Royakkers, L., Timmer, J., Kool, L., Est, R.: Societal and ethical issues of digitiza-
tion. Ethics Inf. Technol. 20(2), 127–142 (2018). https://doi.org/10.1007/s10676-
018-9452-x
66. Sandberg, A.: Feasibility of whole brain emulation. In: Müller, V. (ed.) Philosophy
and Theory of Artificial Intelligence. Studies in Applied Philosophy, Epistemolog-
ical and Rational Ethics, vol. 5. Springer, Heidelberg (2013). https://doi.org/10.
1007/978-3-642-31674-6 19
67. Sandberg, A., Bostrom, N.: Whole brain emulation: a roadmap. Technical report
2008-3, Future of Humanity Institute, Oxford University (2008, online)
68. Schneider, S.: Mindscan: Transcending and Enhancing the Brain. Wiley, Hoboken
(2009)
69. Segun, S.: From machine ethics to computational ethics. AI Soc. (2020). https://
doi.org/10.1007/s00146-020-01010-1
70. Sharkey, A., Sharkey, N.: Granny and the robots: ethical issues in robot care for
the elderly. Ethics Inf. Technol. 14(1), 27–40 (2010)
71. Siau, K., Wang, W.: Artificial intelligence (AI) ethics: ethics of AI and ethical AI. J.
Database Manag. 31(2), 74–87 (2020). https://doi.org/10.4018/JDM.2020040105
72. Spielkamp, M., Matzat, L.: Algorithm watch 2019: the AI ethics guide-
lines global inventory (2019). https://algorithmwatch.org/en/project/ai-ethics-
guidelines-global-inventory/
73. Steffensen, S.: Language, languaging and the extended mind hypothesis. Pragmat-
ics Cogn. 17(3), 677–697 (2009). https://doi.org/10.1075/pc.17.3.10ste
74. Taddeo, M., Floridi, L.: How AI can be a force for good. Science 361(6404), 751–
752 (2018). https://doi.org/10.1126/science.aat5991
75. Taylor, P.: Normative Discourse. Prentice-Hall, New York (1961)
76. Turkle, S.: The Second Self: Computers and the Human Spirit. Simon and Schuster,
New York (1984)
77. Turkle, S.: Alone Together: Why We Expect More from Technology and Less from
Each Other. Basic Books, New York (2011)
78. UNESCO: Preliminary report on the first draft of the recommendation on
the ethics of artificial intelligence (2020). https://unesdoc.unesco.org/ark:/48223/
pf0000374266
79. Vallor, S.: Moral deskilling and upskilling in a new machine age: reflections on the
ambiguous future of character. Philos. Technol. (2015)
80. Vallor, S.: Technology and the Virtues: A Philosophical Guide to a Future Worth
Wanting. Oxford University Press, Oxford (2016)
81. Veale, M., Binns, R.: Mitigating Discrimination without Collecting Sensitive Data.
Big Data Soc. (2017)
82. Veruggio, G., Operto, F.: Roboethics: social and ethical implications of robotics.
In: Siciliano, B., Khatib, O. (eds.) Springer Handbook of Robotics. Springer, Hei-
delberg (2008). https://doi.org/10.1007/978-3-540-30301-5 65
83. Wachter, S., Mittelstadt, B., Floridi, L.: Why a right to explanation of automated
decision-making does not exist in the general data protection regulation. Int. Data
Priv. Law 7(2), 76–99 (2017). https://doi.org/10.1093/idpl/ipx005
84. Walker, M.: Personal identity and uploading. J. Evol. Technol. 22(1), 37–51 (2011)
85. Wallach, W., Allen, C.: Moral Machines: Teaching Robots Right from Wrong.
Oxford University Press, Oxford (2009)
86. Wiley, K., Wang, W.: A Taxonomy and Metaphysics of Mind Uploading. Human-
ity+ Press and Alautun Press, Seattle (2014)
87. Winfield, A.: An updated round up of ethical principles of robotics and AI (2019).
http://alanwinfield.blogspot.com/2019/04/an-upyeard-round-up-ofethical.html
AI in Information Systems, AI
for Development and Social Good
Dataset Selection for Transfer Learning
in Information Retrieval
1 Introduction
One may believe that Information Retrieval (IR) is a problem which has been
mostly solved, especially with the rise of state-of-the-art search engines like
Google. However, IR is an active research area that has garnered immense inter-
est over the years. In essence, Information Retrieval is the task of satisfying
an information need, expressed in the form of a query, by retrieving relevant
information from large collections.
Recently, deep neural networks have been the driving force behind several
performance breakthroughs in Information Retrieval. These models were first
introduced to IR in 2015 [11]. However, despite their relatively short lifespan in
the field, they have vastly outperformed non-neural retrieval systems on multiple
benchmark datasets [4,13,19]. Even Google, an industry leader, has announced
c Springer Nature Switzerland AG 2020
A. Gerber (Ed.): SACAIR 2020, CCIS 1342, pp. 53–65, 2020.
https://doi.org/10.1007/978-3-030-66151-9_4
54 Y. Rughbeer et al.
that they will be implementing neural systems to improve the quality of web
search [12]. Therefore, it is unsurprising that neural networks are seen as a
paradigm shift in IR.
Unfortunately, the hype surrounding neural systems has been met with criti-
cism from the IR community. Recent studies have shown that non-neural systems
outperform neural systems by a large margin when training data is insufficient or
non-existent [8,18]. This finding raised doubts about the effectiveness of neural
systems in practical applications, where the cost of obtaining large-scale train-
ing data exceeds the budgets of most organizations [19]. For example, during
the COVID-19 pandemic, rapid deployment of retrieval systems was needed to
manage the surge in scientific literature [15]. However, due to the limited supply
of training data from biomedical experts, non-neural systems proved to be more
effective than their neural counterparts [9].
To compensate for the lack of training data, researchers have turned to a
technique known as transfer learning. In our context, transfer learning involves
training a model in a domain with sufficient labelled data, then applying it to a
domain where training data is limited (target domain). Although the effective-
ness of transfer learning largely depends on the selected training data [16], there
are currently no guidelines on selecting the best training set for a given target
domain. Consequently, researchers are divided on what features to consider when
selecting a training set. Some authors argue in favour of a vocabulary alignment
between the training and target datasets [5,10,20], while others believe that it
is only necessary to consider the scale of a training set [4,19].
Understanding which features to consider when selecting a training set is
critical in transfer learning. For example, one could experience significant per-
formance gains after training on a large dataset compared to a small dataset.
In this case, scale is said to represent a principle feature. Similarly, if a medical-
related training set produces better results than a non-medical related training
set on COVID-19 literature, then vocabulary alignment may represent a princi-
ple feature. By definition, principle features are features which have the largest
influence on performance [17]. Hence, our strategy in developing a method for
optimal dataset selection relies on determining what these principle features are.
We begin by analyzing existing literature in Sect. 2. In Sect. 3, we develop
a method to identify the principle features and report on our results in Sect. 4.
Finally, we discuss our findings and provide concluding remarks on future work
in Sects. 5 and 6, respectively.
2 Literature Review
There are several key concepts that underpin Information Retrieval; however, in
this paper, it is only necessary to understand the structure of the data that is
used to train neural models. For illustrative purposes, a training instance taken
from the Twitter retrieval task is shown in Fig. 1 [14]. As seen from the figure,
training data for retrieval systems consist of three parts: a query, a document,
and a ground truth label.
Dataset Selection for Transfer Learning in Information Retrieval 55
Fig. 1. Training instance from the Twitter retrieval task – a: query, b: document
(Tweet), c: label
As seen in Table 1, there are two types of user queries, namely; keyword type
queries and natural language type queries. The former consists only of the terms
which relate to the topic that the user is interested in, while the latter consists
of grammatically correct sentences that may include question words, verbs and
prepositions. A document, on the other hand, could represent any span of text,
including Tweets, scientific literature, or even websites. Lastly, a ground truth
label can be categorized as either a 1 or a 0. Under this labelling scheme, a 1
implies that the document is relevant to the query, while a 0 implies that it is
not.
Table 1. Examples of keyword and natural language queries from TREC-COVID [15]
1
This result was achieved by the authors in [20], who re-implemented the work in [5].
56 Y. Rughbeer et al.
in [20]. Hence, one can argue that vocabulary alignment may not represent a
principle feature.
Similar to the method proposed in [5], the authors in [10] focused on creating
a training set that aligned with the vocabulary of the target domain. In more
detail, they proposed to leverage the inherent relevance between the title and
content of news and Wikipedia articles. The retrieval model was trained on
a dataset where the headlines of these articles were used as queries and the
associated content as documents. Additionally, a filter was developed to eliminate
articles whose vocabulary did not align with documents from the target domain.
However, despite training on a dataset which was semantically aligned to
the target domain, the model achieved an improvement of 6.8% over the non-
neural baseline (see Table 2)2 . It is also worth emphasizing that despite been
vastly dissimilar in content, both [5] and [10] yielded similar results. This finding
provides compelling evidence in support of our previous claim that vocabulary
alignment may not represent a principle feature.
Taking inspiration from early research [1], the authors in [20] proposed to
train a retrieval model on anchor links. An anchor link is usually identified as
blue, underlined text on a webpage that allows users to leapfrog to a specific
page on the internet. Briefly, to simulate the retrieval task, anchor links were
used as queries and the linked content as ground truth documents. Similar to
the approach discussed in [10], the authors developed a filter model. However,
instead of eliminating training instances based on their similarity to documents,
the filter model was designed to eliminate instances that did not align with the
vocabulary of queries from the target domain.
The model was able to achieve an improvement of 9.7% over the non-neural
baseline (see Table 2). Compared to the performance achieved in both [5] and
[10], this represents a statistically significant improvement. Interpretation of this
result suggests that a vocabulary alignment between queries from the training
and target datasets is more important than a vocabulary alignment between
documents. However, since vocabulary alignment may not represent a princi-
ple feature, the performance gain discussed here can be attributed to the fact
that there was an unintentional alignment between query types of the training
and target datasets. Hence, this result implies that query type alignment may
represent a principle feature.
Instead of focusing on a vocabulary alignment, the authors in [19] proposed
to train a retrieval model on the largest publicly available dataset. The selected
dataset was annotated by humans and consisted of more than a million training
instances. As evaluated in [20], the model was able to achieve an improvement
of 7.6% over the non-neural baseline (see Table 2). However, compared to the
performance achieved by [20], this result is not statistically significant. Hence
one can argue that the scale of a dataset alone may not represent a principle
feature.
2
This result was achieved by the authors in [20], who re-implemented the work in
[10].
Dataset Selection for Transfer Learning in Information Retrieval 57
3 Methods
We aim to develop a method to select an optimal training set for a specific target
domain. To achieve this, we need to determine which feature should be prioritized
when selecting a training set. Based on our review of existing literature, we have
identified three candidate features that are worthy of investigation: vocabulary
alignment, scale, and query type alignment. To this end, the objective of our
work is to determine which of these features represents a principle feature. Our
proposed method to identify a principle feature is shown in Fig. 2.
As shown in Fig. 2, the first step involves selecting two datasets. These
datasets are then used to train two separate, but identical neural models. After
training, each neural model is evaluated on a retrieval benchmark. The results
from evaluation are compared using a right-tailed t-test in order to determine if
the given feature represents a principle feature.
58 Y. Rughbeer et al.
3.1 T-Test
The alternate hypothesis for each feature is shown in Table 3. We omit the null
hypothesis as it is the opposite of the alternate hypothesis. For example, the
null hypothesis for the scale feature would be; “training on a large dataset does
not yield a statistically higher performance than training on a small dataset”.
Hence, in our context, a feature is said to represent a principle feature only if the
null hypothesis is rejected. Conversely, if the null hypothesis is not rejected, the
test is inconclusive. This means that there was not enough evidence to support
the claim that the given feature represents a principle feature.
Constant Value
Significance level (alpha) 0.05
Degrees of freedom (number of test queries −1) 29
Critical value 1.6991
answer key queries related to COVID-19 (see Table 1 for examples). Importantly,
due to the absence of training data, TREC-COVID provided an ideal scenario
to demonstrate the usefulness of our contribution. We focus on round 1, which
consisted of 30 test queries that were provided in both keyword and natural
language form. The documents for each query were constructed by concatenating
the title and abstract fields of the metadata file. As with most retrieval tasks,
TREC-COVID was characterized by a two-stage process, namely;
1. Given a query, retrieve the relevant literature from a large collection of doc-
uments
2. Rank the retrieved literature from most to least relevant based on the given
query
We selected BERT Large as the basis of our neural ranking model. In more
detail, BERT represents a general-purpose pre-trained language model that has
achieved state-of-the-art results on several natural language processing tasks,
including Information Retrieval [6]. The model used in this paper was prepared
by the authors in [13] and consisted of an additional single-layer neural network.
The input to BERT was formed by concatenating the query and document
into a sequence. Each sequence was then truncated to a maximum length of 509
tokens in order to accommodate for BERT’s [CLS] and [SEP] tokens. The final
sequence had a maximum length of 512 tokens and was converted TFRecord
format before feeding into BERT.
Fine-tuning began from the pre-trained BERT Large checkpoint. We fine
tuned BERT on Google Cloud TPU’s for 400k iterations with a batch size of
32. This equated to approximately 40 h of training and exposed the model to
12.8M training samples (400k × 32). Additionally, training was conducted using
cross-entropy loss with an initial learning rate of 3×10−6 , learning rate warm-up
over the first 10,000 iterations, and linear decay of the learning rate of 0.01.
3
The official evaluation software used by the organizers of TREC-COVID was trec-
eval, and can be downloaded at https://trec.nist.gov/trec eval/.
60 Y. Rughbeer et al.
The [CLS] token of BERT’s output was propagated through a single layer
neural network, which computed the probability of the document being relevant
to the query. Documents were then sorted in descending order based on their
probability score for the given query.
4 Results
4.1 Principle Features
Although vocabulary alignment has proven to be effective in related tasks such
as sentiment classification, our results suggest that this effectiveness does not
translate to retrieval. As seen in Table 5, the difference in NDCG@10 scores
between the COVID and Twitter datasets is not statistically significant, i.e.
test statistic < critical value. This means that there was not enough evidence
to support the claim that vocabulary alignment represents a principle feature.
It is worth emphasizing that the COVID dataset consisted of scientific queries
and documents which aligned with the scientific-based vocabulary of the target
domain. Despite this, it was unable to achieve a significant performance gain over
a Twitter dataset that consisted of random and noisy social media messages.
Dataset Selection for Transfer Learning in Information Retrieval 61
Based on our findings, query type alignment represents the feature to con-
sider when selecting a training set for a specific target domain. However, our
experiments have validated this claim for natural language type queries only.
To prove that our finding also applies to keyword type queries, we consider the
experiments conducted by the authors in [19]. As shown in Table 6, the authors
achieved a significant performance improvement on the Robust04 benchmark
after training on the Twitter dataset compared to training on the MS MARCO
dataset. At the time, the authors referred to this performance gain as ‘sur-
prising’. However, since both the Twitter and Robust04 datasets consisted of
keyword type queries, we can now attribute the performance gain to an align-
ment between query types from the training and target datasets. This example
highlights our contribution of knowledge to the field of Information Retrieval.
62 Y. Rughbeer et al.
4.2 Benchmark
It is important to recognize that our experimental results have validated our the-
oretical findings. That is, we have proven our assumptions on what the principle
feature could be. Now, to demonstrate the value of our findings, we turn our
attention to the TREC-COVID challenge. In more detail, we train BERT Large
on the MS MARCO dataset and apply it to TREC-COVID using the provided
natural language test queries. As seen in Table 7 our system achieved first place
on the round 1 leader board in terms of NDCG@104 .
5 Discussion
Contrary to existing viewpoints, our results have proven that neither scale nor
vocabulary alignment represents principle features. This means that it is not
important to consider these features when selecting a training set. On the other
hand, we have discovered that to achieve optimum performance, one should select
a training set whose query type is aligned with the target domain. More con-
cretely, we have mathematically validated that query type alignment represents
a principle feature.
To demonstrate the value of our work, we turned our attention to the TREC-
COVID challenge. As seen in Table 7, our model achieved the best performance in
round 1 with a score of 0.6493. More importantly, it was able to achieve this result
due to an alignment of query types between the training and test datasets, i.e.
both datasets utilized natural language type queries. Conversely, when training
4
TREC-COVID round 1 leader board: https://ir.nist.gov/covidSubmit.
Dataset Selection for Transfer Learning in Information Retrieval 63
on the same dataset but using keyword test queries instead, the performance
of the model dropped significantly. In fact, the model ranked 8th on the leader
board with an NDCG@10 score of 0.5580. This finding, therefore, highlights the
performance gains that can be achieved through query type alignment.
It is worth emphasizing that our system used an off-the-shelf BERT model,
while other participants used variants of BERT that were more suited to the
scientific domain. For example, one participant used SciBERT fine-tuned on a
medical-related dataset, while another participant used CovidBERT, which was
trained on the document set of TREC-COVID. The consensus behind using
these models was to achieve a higher vocabulary alignment compared to the
standard BERT model. However, our system was able to outperform both of
these systems. This finding once again validates our claim that a query type
alignment is more important than a vocabulary alignment between the training
and target datasets.
Another interesting point worth discussing is the performance of non-neural
systems in the TREC-COVID challenge. In more detail, non-neural systems
accounted for 6 of the top 10 positions on the leader board. More importantly,
as shown in Table 7, a non-neural model was able to outperform all other neural
models except for ours. The reason for this was due to the absence of training
data in TREC-COVID. As a result, most researchers opted to deploy non-neural
architectures.
However, by relying on non-neural systems, semantic understanding is
inevitably lost. This trade-off is undesirable, especially when dealing with com-
plex user queries. To solve this problem, one can now apply our dataset selection
method. As highlighted in Table 7, our method allows neural models to achieve
significant performance improvements over non-neural systems in applications
without any domain-specific training data.
Our discussion thus far has focused on the research community. Although
the contribution of knowledge was our primary objective, there are also practical
connotations of our work. For example, one could be designing a search engine
for an e-commerce website. In this context, customers typically use keyword
queries such as “iPhone Case”, or “Samsung TV”. Hence, based on our findings,
it would be best to select a training set that consists of keyword type queries.
Similarly, when designing a Chatbot, it would be best to select a training set
with natural language type queries. The reason for this is due to the fact that
users typically interact with Chatbots using natural language, e.g. “How far is
the nearest petrol station?”
6 Conclusion
The TREC-COVID challenge required participants to perform full retrieval, i.e.
retrieve a set of documents then rank those documents in descending order of
relevance. Although our system focused only on the ranking stage, we compared
it to systems that performed full retrieval. This comparison is fair for two reasons.
Firstly, the authors in [3] proved that having access to a greater number of
64 Y. Rughbeer et al.
relevant documents does not result in a higher NDCG@k score. Secondly, and
more importantly, TREC compares systems against each other even though each
system retrieves a different number of relevant documents. Hence, our system
did not have an advantage over competing systems.
It is important to emphasize that our work can be applied to any neural
retrieval model. However in this paper, we only investigated the BERT architec-
ture due to its popularity amongst researchers. As a result, we leave it to future
work to investigate other neural architectures such as Conv-KNRM. Another
interesting direction for future work is to compare the effectiveness of transfer
learning to fully supervised learning in Information Retrieval. An ideal bench-
mark for this would be MS MARCO, as it consists of more than a million domain-
specific training queries. In this context, our method would guide us in selecting
Google Natural Questions [7] as a training set since it aligns with the query type
of MS MARCO.
References
1. Asadi, N., Metzler, D., Elsayed, T., Lin, J.: Pseudo test collections for learning web
search ranking functions. In: Proceedings of the 34th International ACM SIGIR
Conference on Research and Development in Information Retrieval, Beijing, China,
pp. 1073–1082 (2011)
2. Bajaj, P., et al.: MS MARCO: a human generated machine reading comprehension
dataset. In: 30th Conference on Neural Information Processing Systems, Barcelona,
Spain (2016)
3. Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.: Overview of the
trec 2019 deep learning track. arXiv:2003.07820 (2020)
4. Dai, Z., Callan, J.: Deeper text understanding for IR with contextual neural lan-
guage modeling. In: Proceedings of the 42nd International ACM SIGIR Conference
on Research and Development in Information Retrieval, Paris, France, pp. 985–988.
Association for Computing Machinery (2019)
5. Dehghani, M., Zamani, H., Severyn, A., Kamps, J., Croft, W.: Neural ranking
models with weak supervision. In: Proceedings of SIGIR 2017, Shinjuku, Tokyo,
Japan (2017)
6. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pretraining of deep bidirec-
tional transformers for language understanding. In: Proceedings of NAACL-HLT
2019, Minneapolis, Minnesota, pp. 4171–4186. Association for Computational Lin-
guistics (2019)
7. Kwiatkowski, T., et al.: Natural questions: a benchmark for question answering
research. In: Jiang, J. (ed.) Transactions of the Association for Computational
Linguistics, vol. 7, pp. 453–466 (2019)
8. Lin, J.: The neural hype and comparisons against weak baselines. ACM SIGIR
Forum 52(2), 40–51 (2019)
9. MacAvaney, S., Cohan, A., Goharian, N.: SLEDGE: A simple yet effective baseline
for coronavirus scientific knowledge search. arXiv:2005.02365 (2020)
10. MacAvaney, S., Yates, A., Hui, K., Frieder, O.: Content-based weak supervision
for AdHoc re-ranking. In: Proceedings of the 42nd International ACM SIGIR Con-
ference on Research and Development in Information Retrieval, Paris, France, pp.
993–996. Association of Computing Machinery (2019)
Dataset Selection for Transfer Learning in Information Retrieval 65
11. Marchesin, S., Purpura, A., Silvello, G.: Focal elements of neural information
retrieval models. An outlook through a reproducibility study. Inf. Process. Manage.
57, 102109 (2020)
12. Nayak, P.: Understanding searches better than ever before. https://www.blog.
google/products/search/search-language-understanding-bert/. Accessed 18 May
2020
13. Nogueira, R., Cho, K.: Passage re-ranking with BERT. arXiv:1901.04085 (2019)
14. Rao, J., Yang, W., Zhang, Y., Ture, F., Lin, J.: Multi-perspective relevance match-
ing with hierarchical ConvNets for social media search. In: The 33rd AAAI Con-
ference on Artificial Intelligence, AAAI19, vol. 33, pp. 232240 (2019)
15. Roberts, K., et al.: TREC-COVID: rationale and structure of an information
retrieval shared task for COVID-19. J. Am. Med. Inform. Assoc. 27, 1431–1436
(2020)
16. Sun, B., Feng, J., Saenko, K.: Return of frustratingly easy domain adaptation. In:
Proceedings of the 30th AAAI Conference on Artificial Intelligence. Association
for the Advancement of Artificial Intelligence (2016)
17. Wouter, M., Marco, L.: An introduction to domain adaptation and transfer learn-
ing. arXiv:1812.11806 (2019)
18. Yang, W., Lu, K., Yang, P., Lin, J.: Critically examining the “neural hype”: weak
baselines and the additivity of effectiveness gains from neural ranking models. In:
Proceedings of the 42nd International ACM SIGIR Conference on Research and
Development in Information Retrieval, Paris, France, pp. 1129–1132. Association
for Computing Machinery (2019)
19. Yilmaz, Z., Yang, W., Zhang, H., Lin, J.: Cross-domain modeling of sentence-
level evidence for document retrieval. In: 9th International Joint Conference on
Natural Language Processing, Hong Kong, China, pp. 3481–3487. Association for
Computational Linguistics (2019)
20. Zhang, K., Xiong, C., Liu, Z., Liu, Z.: Selective weak supervision for neural infor-
mation retrieval. In: International World Web Conference, Creative Commons,
Taiwan (2020)
Applications of AI
StarGAN-ZSVC: Towards Zero-Shot
Voice Conversion in Low-Resource
Contexts
1 Introduction
Voice conversion is a speech processing task where speech from a source speaker
is transformed so that it appears to come from a different target speaker while
preserving linguistic content. A fast, human-level voice conversion system has
This work is supported in part by the National Research Foundation of South Africa
(grant number: 120409) and a Google Faculty Award for HK.
c Springer Nature Switzerland AG 2020
A. Gerber (Ed.): SACAIR 2020, CCIS 1342, pp. 69–84, 2020.
https://doi.org/10.1007/978-3-030-66151-9_5
70 M. Baas and H. Kamper
significant applications across several industries, from those in privacy and iden-
tity protection [16] to those of voice mimicry and disguise [10,37]. It can also be
essential for addressing downstream speech processing problems in low-resource
contexts where training data is limited: it could be used to augment training data
by converting the available utterances to novel speakers—effectively increasing
the diversity of training data and improving the quality of the resulting systems.
Recent techniques have improved the quality of voice conversion significantly,
in part due to the Voice Conversion Challenge (VCC) and its efforts to concen-
trate disparate research efforts [36]. Some techniques are beginning to achieve
near human-level quality in conversion outputs. However, much of the advances
and improvements in quality are limited in their practical usefulness because
they fail to satisfy several requirements that would be necessary for practical
use, particularly in low-resource settings.
First, a practical voice conversion system should be trainable on non-parallel
data. That is, training data should not need to contain utterances from multiple
speakers saying the same words – such a setting is known as a parallel data
setting. Non-parallel data is the converse, where the different utterances used to
train the model do not contain the same spoken words. Parallel data is difficult to
collect in general, and even more so for low-resource language (those which have
limited digitally stored corpora). Second, a practical system should be able to
convert speech to and from speakers which have not been seen during training.
This is called zero-shot voice conversion. Without this requirement, a system
would need to be retrained whenever speech from a new speaker is desired.
Finally, for a number of practical applications, a voice conversion system needs
to run at least in real-time. For data augmentation in particular, having the
system run as fast as possible is essential for it to be practical in the training of
a downstream speech model.
With these requirements in mind, we look to extend existing state-of-the-
art voice conversion techniques. We specifically extend the recent StarGAN-
VC2 [13] approach to the zero-shot setting, proposing the new StarGAN-ZSVC
model. StarGAN-ZSVC achieves zero-shot prediction by using a speaker encod-
ing network to generate speaker embeddings for potentially unseen speakers;
these embeddings are then used to condition the model at inference time.
Through objective and human evaluations, we show that StarGAN-ZSVC
performs better than simple baseline models and similar or better than the recent
AutoVC zero-shot voice conversion approach [24] across a range of evaluation
metrics. More specifically, it gives similar or better performance in all zero-shot
settings considered, and does so more than five times faster than AutoVC.
2 Related Work
A typical voice conversion system operates in the frequency domain, first con-
verting an input utterance into a spectrogram and then using some model to map
the spectrogram spoken by a source speaker to that of one spoken by a target
speaker. The output spectrogram is then converted to a waveform in the time-
domain using a vocoder [28]. In this paper, we denote spectrogram sequences as
StarGAN-ZSVC 71
More formally, the generator G is trained to minimize the loss L = λid Lid +
λcyc Lcyc +LG−adv . The first term, Lid is an identity loss term. It aims to minimize
the difference between the input and output spectrogram when the model is made
to keep the same speaker identity, i.e. convert from speaker A to speaker A. It
is defined by the L2 loss:
Finally, the adversarial loss term LG−adv is added based on the LSGAN [18]
loss. It defines two constants a and b, whereby G’s loss tries to push D’s output
for converted utterances closer to a, while D’s loss function tries to push D’s
output for converted utterances closer to b and its output for real outputs closer
to a. Concretely, G’s adversarial loss is defined as
2
LG−adv = (D(G(Xsrc , ssrc , strg ), ssrc , strg ) − a) . (3)
In [13], the authors set the scalar coefficients to be λid = 5, and λcyc = 10. The
original study [13] does not mention how a and b are set (despite these greatly
affecting training); we treat them as hyperparameters. Note that the true target
spectrogram Xtrg does not appear in any of the equations – this is what allows
StarGAN-VC2 to be trained with non-parallel data where the source utterance
Xsrc has no corresponding utterance from the target speaker.
StarGAN-VC2 uses a specially designed 2-1-2D convolutional architecture
for the generator, as well as a projection discriminator [19] which comprises of a
convolutional network (to extract features) followed by an inner product with an
embedding corresponding to the source/target speaker pair. For the generator,
a new form of modulation-based conditional instance normalization was intro-
duced in [13]. This allows the speaker identity (which is provided as a one-hot
vector) to multiplicatively condition the channels of an input feature. According
to [13], this special layer is a key component in achieving high performance in
StarGAN-VC2.
We use these building blocks for our new zero-shot approach. Concretely, the
one-hot speaker vectors in StarGAN-VC2 are replaced with continuous embed-
ding vectors obtained from a separate speaker encoding network (which can be
applied to arbitrary speakers), as outlined in Sect. 3.
StarGAN-ZSVC 73
2.2 AutoVC
Zero-shot voice conversion was first introduced in 2019 with the AutoVC
model [24], which remains one of only a handful of models that can perform zero-
shot conversion (see e.g. [25] for a very recent other example). For AutoVC, zero-
shot conversion is achieved by using an autoencoder with a specially designed
bottleneck layer which forces the network’s encoder to only retain linguistic con-
tent in its encoded latent representation. The model then uses a separate recur-
rent speaker encoder model E(X), originally proposed for speaker identification
[35], to extract a speaker embedding s from an input spectrogram. These speaker
embeddings are then used to supply the missing speaker identity information to
the decoder which, together with the linguistic content from the encoder, allows
the decoder to synthesize an output spectrogram for an unseen speaker.
Formally, the full encoder-decoder model is trained to primarily minimize
two terms. The first term is an L2 reconstruction loss between the decoder out-
put spectrogram Xsrc→src and input spectrogram Xsrc , with the source speaker’s
encoding (from the speaker encoder) provided to both the encoder and decoder.
The second term is an L1 loss between the speaker embedding of the decoder out-
put E(Xsrc→src ) and the original speaker embedding ssrc = E(Xsrc ). The encoder
and decoder consists of convolutional and Long Short-Term Memory (LSTM) [8]
recurrent layers which are carefully designed to ensure that no speaker identity
information is present in the encoder output.
As with StarGAN-VC and StarGAN-VC2, a corresponding parallel target
utterance Xtrg does not appear in any of the loss terms, allowing AutoVC to
be trained without parallel data. Zero-shot inference is performed by using the
speaker encoder to obtain embeddings for new utterances from unseen speakers,
which is then provided to the decoder instead of the embedding corresponding
to the source speaker, causing the decoder to return a converted output. We use
this same idea of using an encoding network to obtain embeddings for unseen
speakers in our new GAN-based approach, which we describe next.
3 StarGAN-ZSVC
Fig. 1. The StarGAN-ZSVC system framework. The speaker encoder network E and
the WaveGlow vocoder are pretrained on large speech corpora, while the generator
G and discriminator D are trained on a 9-min subset of the VCC dataset. During
inference, arbitrary utterances for the source and target speaker are used to obtain
source and target speaker embeddings, ssrc and strg .
The speed of the full voice conversion system during inference is bounded by
(a) the speed of the generator G; (b) the speed of converting the utterance
StarGAN-ZSVC 75
between time and frequency domains, consisting of the initial conversion from
time-domain waveform to Mel-spectrogram and the speed of the vocoder; and
(c) the speed of the speaker encoder E. To ensure that the speed of the full
system is at least real-time, each subsystem needs to be faster than real-time.
(c) Speaker Encoder Speed. The majority of research efforts into obtaining
speaker embeddings involve models using slower recurrent layers, often making
these encoder networks the bottleneck. We also make use of a recurrent stacked-
GRU network as our speaker embedding network E. However, we only need
to obtain a single speaker embedding to perform any number of conversions
involving that speaker. We therefore treat this as a preprocessing step where
we apply E to a few arbitrary utterances from the target and source speakers,
averaging the results to obtain target and source speaker embeddings, and use
those same embeddings for all subsequent conversions.
We also design the speaker embeddings to be 256-dimensional vectors of unit
length. If we were to use StarGAN-ZSVC downstream for data augmentation
(where we want speech from novel speakers), we could then simply sample ran-
dom unit-length vectors of this dimensionality to use with the generator.
3.3 Architecture
With the previous considerations in mind, we design the generator G, discrim-
inator D, and encoder network E, as shown in Fig. 2. The generator and dis-
criminator are adapted from StarGAN-VC2 [13], while the speaker encoder is
76 M. Baas and H. Kamper
adapted from the original model proposed for speaker identification [35]. Specifi-
cally, for E we use a simple stacked GRU model, while for D we use a projection
discriminator [19]. For G, we use the 2-1-2D generator from StarGAN-VC2 with
a modified central set of layers, denoted by the Conditional Block in the figure.
These conditional blocks are intended to provide the network with a way
to modulate the channels of an input spectrogram, with modulation factors
conditioned on the specific source and target speaker pairing. They utilize a
convolutional layer followed by a modified conditional instance normalization
layer [5] and a gated linear unit [4].
The modified conditional instance normalization layer performs the following
operation on an input feature vector f :
f − μ(f )
CIN(f , γ, β) = γ +β (5)
σ(f )
where μ(f ) and σ(f ) are respectively the scalar mean and standard deviation of
vector f , while γ and β are computed using two linear layers which derive their
inputs from the speaker embeddings, as depicted in Fig. 2. The above is computed
separately for each channel when the input feature contains multiple channels.
For the discriminator, the source and target speaker embeddings are also
fed through several linear layers and activation functions to multiply with the
pooled output of D’s main branch.
StarGAN-ZSVC 77
4 Experimental Setup
We compare StarGAN-ZSVC to other voice conversion models using the voice
conversion challenge (VCC) 2018 dataset [17], which contains parallel record-
ings of native English speakers from the United States. Importantly, we do not
train StarGAN-ZSVC or the AutoVC model (to which we compare) using paral-
lel input-output examples. However, the traditional baseline models (below) do
require parallel data. All training and speed measurements are performed on a
single NVIDIA RTX 2070 SUPER GPU.
4.1 Dataset
The VCC 2018 dataset was recorded from 8 speakers, each speaking 81 utterances
from the same transcript. 4 speakers are used for training and 4 for testing. To
emulate a low-resource setting, we use a 9-min subset of the VCC 2018 training
dataset for StarGAN-ZSVC and AutoVC. This corresponds to 90% of the utter-
ances from two female (F) and two male (M) speakers (VCC2SF1, VCC2SF2,
VCC2SM1, and VCC2SM2). This setup is in line with existing evaluations on
VCC 2018 [13], allows for all combinations of inter- and intra-gender conversions,
and allows for zero-shot evaluation on the 4 remaining unseen speakers.
In contrast to StarGAN-ZSVC and AutoVC, some of the baseline models
only allow for one-to-one conversions, i.e. they are trained on parallel data and
can only convert from seen speaker A to seen speaker B. We therefore train the
baseline models on a single source-target speaker mapping (from VCC2SF1 to
VCC2SM2), using 90% of the parallel training utterances for this speaker pair.
All utterances are resampled to 22.05 kHz and then converted to log Mel-
spectrograms with a window and hop length of 1024 and 256 samples, respec-
tively. During training, for each batch we randomly sample a k-frame sequence
from each spectrogram, where k is randomly sampled from multiples of 32
between 96 to 320 (inclusive). This is done for all models to make it robust to
utterance length, with the exception of StarGAN-ZSVC, which requires fixed-
size input for its discriminator. This leads to slightly worse performance for
StarGAN-ZSVC on long or silence-padded sequences. For a fair comparison, we
therefore only consider non-silent frames of the target utterance.
inference speed of the remaining sub-networks needs to be well under 700 ms/s,
or preferably significantly faster if used for data augmentation.
The speaker encoder is trained on 90% of the utterances from a combined set
consisting of the VCTK [34], VCC 2018 [17], LibriSpeech [22], and the English
CommonVoice 2020-06 datasets.1 It is trained with the Adam optimizer [14] for
8 epochs with 8 speakers per batch, and 6 utterances per speaker in each batch.
We start with a learning rate of 4 × 10−4 and adjust it down to 3 × 10−7 in the
final epoch. Embeddings for speakers are precomputed by taking the average
embedding over 4 arbitrary utterances for each speaker.
be half of the generators learning rate; (iii) the number of iterations training
the discriminator versus generator is updated every several hundred epochs to
ensure that the discriminator’s loss is always roughly a factor of 10 lower than
the adversarial term of the generator’s loss; and (iv) dropout with a probability
of 0.3 is added to the input of D after the first 3000 epochs (if added earlier it
causes artifacts and destabilizes training).
The loss function used is the same as that of StarGAN-VC2, with the excep-
tion that the term Lcyc (see Sect. 2.1) is squared in our model, which we found
to give superior results. We set a = 1, and b = 0 for the LSGAN constants,
and λcyc = 10, λid = 5 for loss coefficients, being adjusted downwards during
training in the same manner as in [13].
4.5 Evaluation
5 Experiments
We perform two sets of experiments. First we perform an evaluation on seen
speakers, where we compare StarGAN-ZSVC to all other models to obtain an
indication of both speed and performance. We then compare StarGAN-ZSVC
with AutoVC for zero-shot voice conversion, looking at both the output and
cyclic reconstruction error. We encourage the reader to listen to the demo sam-
ples2 for the zero-shot models.
In the first set of comparisons, we evaluate performance for test utterances where
other utterances from both the source and target speaker have been seen dur-
ing training. I.e., while the models have not been trained on these exact test
utterances, they have seen the speakers during training. There is, however, a
problem in directly comparing the one-to-one models (traditional baselines) to
the many-to-many models (AutoVC and StarGAN-ZSVC). The one-to-one mod-
els are trained on parallel data, always taking in utterances from one speaker
as input (VCC2SF1 in our case) and always producing output from a different
target speaker (VCCSM2).
In contrast, the many-to-many models are trained without access to parallel
data, taking in input utterances from several speakers (4 speakers, including
VCC2SF1 and VCCSM2 in our case, as explained in Sect. 4.1). This means
that the one-to-one and many-to-many models observe very different amounts
of data. Moreover, while the data for both the one-to-one and many-to-many
models are divided into a 90%–10% train-test split, the same exact splits aren’t
used in both setups; this is because the former requires parallel utterances, and
the split is therefore across utterance pairs and not just individual utterances.
To address this, we evaluate the many-to-many models in two settings: on the
exact same test utterances as those from the test split of the one-to-one models,
as well as on all possible source/target speaker utterance pairs where the source
utterance is in the test utterances for the 4 seen training speakers. In the former
case, it could happen that the many-to-many model actually observes one of the
test utterances during training. Nevertheless, reporting scores for both settings
allows for a meaningful comparison.
The results of this evaluation on seen speakers are given in Table 1. The
results indicate that AutoVC appears to be the best in this evaluation on seen
speakers. However, this comes at a computational cost: the linear and StarGAN-
ZSVC models are a factor of 5 or more faster than the models relying on recurrent
layers like DBLSTM and AutoVC.
2
https://rf5.github.io/sacair2020/.
StarGAN-ZSVC 81
Table 1. Objective evaluation results when converting between speakers where both
the source and target speaker are seen during training. For all metrics aside from cosine
similarity, lower is better. Speed is measured as the time (in milliseconds) required to
convert one second of input audio. The first StarGAN-ZSVC and AutoVC entries cor-
respond to evaluations on the one-to-one test utterances, while the final two starred
entries correspond to metrics computed when using test utterances from all seen train-
ing speakers for the many-to-many models.
Table 2. Objective evaluation results for zero-shot voice conversion for AutoVC
and StarGAN-ZSVC. The prediction metrics compare the predicted output to the
ground truth target, while the reconstruction metrics compare the cyclic reconstruc-
tion Xsrc→trg→src with the original source spectrogram. enorm indicates the vector norm
of the speaker embeddings for the compared spectrograms, with lower values indicating
closer speaker identities.
Prediction Reconstruction
The performance for AutoVC and StarGAN-ZSVC are similar on most met-
rics for the unseen-to-seen case. But for the seen-to-unseen case and the unseen-
to-unseen case (where both the target and source speakers are new) StarGAN-
ZSVC achieves both better prediction and reconstruction scores. This, coupled
with its fast inference speed (Sect. 5.1), enables it to be used efficiently and
effectively for downstream data augmentation purposes.
Fig. 3. Mean opinion score for naturalness for AutoVC and StarGAN-ZSVC in various
source/target seen/unseen speaker pairings with 95% confidence intervals shown.
The results of the subjective evaluation are given in Fig. 3. To put the val-
ues into context, the MOS for the raw source utterances and vocoded source
utterances included in the analysis are 4.86 and 4.33 respectively – these serve
as an upper bound for the MOS values for both models. Figure 3 largely sup-
ports the objective evaluations, providing further evidence that StarGAN-ZSVC
outperforms AutoVC in zero-shot settings. Interestingly, it would appear that
StarGAN-ZSVC also appears more natural in the traditional seen-to-seen case.
This evaluation indicates that, for human listeners, StarGAN-ZSVC appears
more natural in the low-resource context considered in this paper.
6 Conclusion
This paper aimed to improve recent voice conversion methods in terms of speed,
the use of non-parallel training data, and zero-shot prediction capability. To this
end, we adapted the existing StarGAN-VC2 system by using a speaker encoder
to generate speaker embeddings which are used to condition the generator and
discriminator network on the desired source and target speakers. The result-
ing model, StarGAN-ZSVC, can perform zero-shot inference and is trainable
with non-parallel data. In a series of experiments comparing StarGAN-ZSVC to
the existing zero-shot voice conversion method AutoVC, we demonstrated that
StarGAN-ZSVC is at least five times faster than AutoVC, while yielding bet-
ter scores on objective and subjective metrics in a low-resource zero-shot voice
conversion setting.
For future work, we plan to investigate whether scaling StarGAN-ZSVC up to
larger datasets yields similar performance to existing high-resource voice conver-
sion systems, and whether the system could be applied to other tasks aside from
pure voice conversion (such as emotion or pronunciation conversion).
StarGAN-ZSVC 83
References
1. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for
statistical machine translation. In: EMNLP (2014)
2. Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: unified
generative adversarial networks for multi-domain image-to-image translation. In:
IEEE CVPR (2018)
3. Chorowski, J., Weiss, R.J., Bengio, S., van den Oord, A.: Unsupervised speech rep-
resentation learning using WaveNet autoencoders. arXiv e-prints arXiv:1901.08810
(2019)
4. Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated
convolutional networks. In: Precup, D., Teh, Y.W. (eds.) PMLR (2017)
5. Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style.
In: ICLR (2017)
6. Erro, D., Moreno, A.: Weighted frequency warping for voice conversion. In: INTER-
SPEECH (2007)
7. He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., Li, M.: Bag of tricks for image
classification with convolutional neural networks. In: IEEE CVPR (2019)
8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9,
1735–1780 (1997)
9. Howard, J., Gugger, S.: DynamicUnet: create a U-Net from a given architecture
(2020). https://docs.fast.ai/vision.models.unet#DynamicUnet. Accessed 8 Aug
2020
10. Huang, C., Lin, Y.Y., Lee, H., Lee, L.: Defending Your Voice: Adversarial Attack
on Voice Conversion. arXiv e-prints arXiv:2005.08781 (2020)
11. Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N.: StarGAN-VC: non-parallel many-
to-many voice conversion using star generative adversarial networks. In: IEEE SLT
Workshop (2018)
12. Kameoka, H., Kaneko, T., Tanaka, K., Hojo, N.: ACVAE-VC: non-parallel voice
conversion with auxiliary classifier variational autoencoder. IEEE Trans. Audio
Speech Lang. Process. 27(9), 1432–1443 (2019)
13. Kaneko, T., Kameoka, H., Tanaka, K., Hojo, N.: StarGAN-VC2: rethinking condi-
tional methods for StarGAN-based voice conversion. In: INTERSPEECH (2019)
14. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. arXiv e-prints
arXiv:1412.6980 (2014)
15. Kumar, K., et al.: MelGAN: generative adversarial networks for conditional wave-
form synthesis. In: NeurIPS (2019)
16. Lal Srivastava, B.M., Vauquier, N., Sahidullah, M., Bellet, A., Tommasi, M., Vin-
cent, E.: Evaluating voice conversion-based privacy protection against informed
attackers. In: ICASSP (2020)
17. Lorenzo-Trueba, J., et al.: The voice conversion challenge 2018: promoting devel-
opment of parallel and nonparallel methods. In: Odyssey Speaker and Language
Recognition Workshop (2018)
18. Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Smolley, S.P.: Least squares gener-
ative adversarial networks. In: ICCV (2017)
19. Miyato, T., Koyama, M.: cGANs with projection discriminator. In: ICLR (2018)
20. Morise, M., Yokomori, F., Ozawa, K.: WORLD: a vocoder-based high-quality
speech synthesis system for real-time applications. IEICE Trans. Inf. Syst.
E99.D(7), 1877–1884 (2016)
84 M. Baas and H. Kamper
21. van den Oord, A., et al.: WaveNet: A Generative Model for Raw Audio. arXiv
e-prints arXiv:1609.03499 (2016)
22. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus
based on public domain audio books. In: IEEE ICASSP (2015)
23. Prenger, R., Valle, R., Catanzaro, B.: WaveGlow: a flow-based generative network
for speech synthesis. In: IEEE ICASSP (2019)
24. Qian, K., Zhang, Y., Chang, S., Yang, X., Hasegawa-Johnson, M.: AutoVC: zero-
shot voice style transfer with only autoencoder loss. In: PMLR (2019)
25. Rebryk, Y., Beliaev, S.: ConVoice: Real-Time Zero-Shot Voice Style Transfer with
Convolutional Network. arXiv e-prints arXiv:2005.07815 (2020)
26. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomed-
ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.
(eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).
https://doi.org/10.1007/978-3-319-24574-4 28
27. Shuang, Z.W., Bakis, R., Shechtman, S., Chazan, D., Qin, Y.: Frequency warping
based on mapping formant parameters. In: INTERSPEECH (2006)
28. Sisman, B., Yamagishi, J., King, S., Li, H.: An Overview of Voice Conversion
and its Challenges: From Statistical Modeling to Deep Learning. arXiv e-prints
arXiv:2008.03648 (2020)
29. Smith, L.N.: Cyclical learning rates for training neural networks. In: IEEE WACV
(2017)
30. Stylianou, Y., Cappe, O., Moulines, E.: Continuous probabilistic transform for
voice conversion. IEEE Trans. Speech Audio Process. 6(2), 131–142 (1998)
31. Sun, L., Kang, S., Li, K., Meng, H.: Voice conversion using deep Bidirectional Long
Short-Term Memory based Recurrent Neural Networks. In: IEEE ICASSP (2015)
32. Suundermann, D., Strecha, G., Bonafonte, A., Höge, H., Ney, H.: Evaluation of
VTLN-based voice conversion for embedded speech synthesis. In: INTERSPEECH
(2005)
33. Toda, T., Black, A.W., Tokuda, K.: Voice conversion based on maximum-likelihood
estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Pro-
cess. 15(8), 2222–2235 (2007)
34. Veaux, C., Yamagishi, J., Macdonald, K.: CSTR VCTK Corpus: English Multi-
speaker Corpus for CSTR Voice Cloning Toolkit (2017). http://homepages.inf.ed.
ac.uk/jyamagis/page3/page58/page58.html. Accessed 1 Sep 2020
35. Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker
verification. In: ICASSP (2018)
36. Zhao, Y., et al.: Voice Conversion Challenge 2020: Intra-lingual semi-parallel and
cross-lingual voice conversion. arXiv e-prints arXiv:2008.12527 (2020)
37. Zhizheng, W., Haizhou, L.: Voice conversion versus speaker verification: an
overview. APSIPA Trans. Sig. Inf. Process. 3, e17 (2014)
Learning to Generalise in Sparse Reward
Navigation Environments
1 Introduction
2 Related Work
Generalisation remains a fundamental RL problems since agents tend to memo-
rise trajectories from their training environments instead of learning transferable
skills [7]. Classic RL benchmarks like the Arcade Learning Environment (ALE)
[3] focus on creating specialist agents that perform well in a single environ-
ment. New benchmarks have been proposed to focus research on generalisation.
The ProcGen Benchmark [7] uses procedural generation to generate new envi-
ronments. The inherent diversity in the generated environments demands that
agents learn robust polices in order to succeed. A similar framework is presented
in [19] with larger scale three-dimensional environments.
Justesen et al. [20] however, highlighted limitations of procedural generation:
it is difficult to automatically scale the difficulty of the task [20] and the dis-
tribution of the procedurally generated environments is often different to that
of human-generated environments. Procedurally generating environments may
lead to overfitting to the distribution of the generated environments [20]. A
novel approach that uses reinforcement learning to learn a policy for generating
environments shows promising results in [23].
Our work is inspired by Savinov et al. [32]. The authors emphasised the need
for separate training and testing environments and investigated generalisation in
custom maze environments with random goal placements. The aims of the study
were different but the principles were incorporated into the curriculum defined
in Subsect. 3.3. Similar findings were highlighted in other studies [8,42].
Curriculum learning was shown to decrease training times and improve gen-
eralisation across multiple common datasets in [4]. The main idea is to split a
complex task into smaller, easier-to-solve sub-problems and controlling the cur-
riculum to ensure that the task is never too difficult for the agent [17]. Previous
work manually generated training curricula for various tasks [22,34]. A limita-
tion of this approach is the requirement of expert domain knowledge [39]. Various
studies attempted to alleviate this problem by presenting novel techniques for
automatically generating a curriculum [12,24,39]. Florensa et al. [12] presented
a method for automatically generating a curriculum that exhibited promising
results in sparse reward navigation environments. The maze environments from
the study have been incorporated into this study. The curriculum in this work is
manually designed though only general concepts, such as environment size and
obstacle configuration, were varied so as to ensure it did not require significant
fine-tuning or expert knowledge.
Curriculum learning is an implicit form of generalisation [4]. Closely related
to curriculum learning is hierarchical reinforcement learning. Tessler et al. [40]
presented a framework that enabled agents to transfer “skills” learnt from easy
sub-tasks to difficult tasks requiring multiple skills. Agents learnt “high-level”
actions that pertain to walking and movement and used these skills to learn
difficult navigation tasks faster in [13]. Our curriculum has been designed to
implicitly learn in this manner since there are no obstacles in the early stages of
training, thereby allowing agents to focus on locomotion.
88 A. Jeewa et al.
3 Methodology
3.1 The Task
The goal of the agent is to navigate from its starting point to a fixed distant
target, with obstacles or walls placed along its route. The agent is required to
learn foresight: it needs to learn to move further away from the target in the
present, in order to find the target in the future. The task is a variation of the
classic point-mass navigation task in various studies [10,11]. We consider an
agent interacting with an environment in discrete time steps. At each time step
t, the agent gets an observation ot of the environment and then takes an action
at from a set of actions A.
The observation set O comprises the coordinates of the agent’s current posi-
tion, the coordinates of the target, the distance to the goal and rays that extend
in 8 directions, at 45◦ intervals. These short rays provide essential feedback to
the agent by enabling it to detect walls and targets that are in its vicinity and
therefore adapt its policy accordingly.
The rays take on additional importance when agents are placed in previously
unseen environments since they enable the agents to learn robust policies: when
Learning to Generalise in Sparse Reward Navigation Environments 89
an agent detects an obstacle in its vicinity, it needs to learn to move away from
the obstacle, in the direction of an open path. If an agent executes memorised
actions, it will move directly into walls and never reach its destination.
The ray length was tuned to balance the difficulty of the task: if the rays are
too long, the agent unrealistically detects objects that are far away but if it is too
short, the agent is unable to detect anything except that which is immediately in
front of it. This is analogous to the field of view. The observations were stacked
to equip the agents with a small memory of the immediate past. The previous
ten observations were stored at any given time.
The action set at allows the agent to move in eight directions: forwards, back-
wards, sideways as well as diagonally, unlike the standard Gridworld task [42].
By default, before any training modifications are made, the environments
are all sparse reward environments since the agent only receives a +1 reward
for finding the target. The starting positions of the agent and the target are far
away from each other, on different ends of the environment. The agents do not
receive any intermediate rewards and incur a small penalty on every timestep,
to encourage them to find the target in the shortest possible time.
3.2 Environments
There are multiple environments and each varies in terms of the configuration of
walls and obstacles (see Fig. 1). This is to deter agents from learning an optimal
policy in one single environment, rather learning the “skill” of finding a target
in an arbitrary navigation environment. The predefined environments were care-
fully designed to represent high-level features or environment characteristics that
include dead-ends and multiple paths to the target. We theorise that introduc-
ing agents to numerous environment features in training enables them to learn a
flexible policy that enables them to find targets when similar features are found
in new environments. The environments were divided into a set of training and
testing environments. The generalisability of the agents was evaluated in the
testing environments.
The training environments were further divided into three categories: Obsta-
cle environments (see Fig. 1a) contain only a single obstacle that varies in terms
of size and orientation. The sizes range from a scale of 0 to 3 and the orientation
is defined as any angle from 0◦ , in 45◦ increments. The size of the agent and ray
length are also depicted in Fig. 1a to illustrate the scale of the task.
Maze environments have multiple obstacles and were subdivided based on
difficulty. There are Standard mazes in Fig. 1b and Difficult mazes in Fig. 1c.
Difficult mazes have multiple obstacles that span more than half the width of
the entire environment. They also include more complex versions of some of the
Standard mazes, by manipulating the size of each obstacle in an environment.
The “u-maze” from [11] was also incorporated into this group. The difficult
mazes were deliberately designed to test the boundaries of the algorithms and
to identify limitations.
90 A. Jeewa et al.
(a) Obstacle
(b) Standard Mazes (c) Difficult Mazes
The testing environments were divided into two categories: Orientation and
New. Orientation testing environments were created by rotating the training
mazes by 90◦ and without changing the overall structure of the obstacles. New
testing environments have different obstacle configurations to the training envi-
ronments. New features or environment characteristics, such as bottlenecks or
repeated obstacles, were incorporated into this group. This allowed us to anal-
yse whether the agents were able to learn advanced skills and further assess the
extent of the generalisation. Both these categories were further subdivided into
Standard and Difficult subcategories, as per the definition used for the train-
ing environments. An illustration of the Orientation environments are shown
in Fig. 2a. Both the Standard New and Difficult New groups, depicted in Fig. 2b
and c respectively, contain three mazes each. The “spiral-maze”, a commonly
used maze seen in [11], was incorporated into the difficult category.
3.3 Algorithms
Environment parameters are varied over time to control the difficulty of the
task to ensure that the current task is never too difficult for the agent. The first
parameter is the environment size: decreasing the size, while keeping the agent
size and speed the same, decreases the sparsity of rewards since the goal and
target are closer to each other in smaller environments. The second parameter is
the obstacle configuration, which is varied through changing the number and size
of obstacles: either single obstacles or multiple obstacles in a maze-like structure.
In the early stages of training, the environments are small and contain a
single obstacle or none at all. This was achieved by assigning O, in Algorithm 1,
to the obstacle environments in Fig. 1a. Agents are able to learn how to con-
trol themselves by navigating around the environment to nearby targets. When
the average reward (over the past 5000 consecutive episodes) reaches a prede-
fined threshold, the difficulty is increased. The first adjustment is to increase the
size and number of obstacles, through randomly sampling maze environments
from Fig. 1b and in Fig. 1c. When the agent reaches the same predefined reward
threshold, the environment size is increased. This two-fold difficulty adjustment
keeps occurring until the agent progresses to large maze environments with mul-
tiple obstacles. This ensures that the curriculum only progresses when the agent
has succeeded in its current task.
Randomly sampling environments is an important aspect of the curriculum.
It is also essential that the set of training environments is diverse and incorpo-
rates a wide array of obstacle configurations [7]. This deters agents from memo-
rising the dynamics of any particular training environment, instead learning how
92 A. Jeewa et al.
Figure 3 highlights the benefits of using the curriculum. The learning curve
never drops significantly since the agents’ task is never too difficult. The curricu-
lum advances quickly in the early stages of training when the task is easier. The
sudden drops in reward are indicative of points at which the task is made more
difficult but the fact that the curve peaks very quickly thereafter, indicates that
knowledge is being transferred between tasks. In all runs, it was noted that the
curriculum agent converged significantly faster than the curiosity agent.
A major benefit of the curriculum is that there is no reward shaping nec-
essary. This is due to the manner in which the curriculum was designed that
ensures that the agents always receive sufficient reward feedback during training.
We performed an empirical investigation into various different shaped rewards
and found no performance improvements. Rather, the motivations of the agents
became polluted [9,26]. For example, when an agent was rewarded for mov-
ing closer to the target, it lacked the foresight to move past obstacles. Shaping
rewards also resulted in more specialist policies that work well in some environ-
ments, but poorly in others. Reward shaping also requires additional information
which may not be available in the real-world.
The curiosity curve shows rewards slowly increasing as training progresses.
The hybrid training curve is very similar to the curriculum agent. When the
curiosity strength was varied, the curves still followed a similar pattern. This
indicates that the curiosity rewards had little effect on the training process when
coupled with the curriculum.
Figure 3b illustrates that, for all algorithms, the agents were able to efficiently
find the target in all training environments, under the sparse reward setting. All
algorithms have an average reward that approaches a maximum possible reward
of +1. These results act as a validation of each algorithm since it indicates that all
agents have obtained sufficient knowledge of the task and are able to find targets
across a diverse set of mazes. This allowed us to perform a fair comparison of the
generalisation capabilities of each algorithm in the testing environments. Error
bars are depicted with a confidence interval of 95%.
Learning to Generalise in Sparse Reward Navigation Environments 95
4.2 Generalisability
The best performing training run from Subsect. 4.1 was selected for each algo-
rithm. The average reward was then analysed for each of the different groups of
testing environments. Each algorithm was run for 1000 episodes, with a random
testing environment being sampled at the start of the episode, from the corre-
sponding testing group. This is necessary due to the stochastic nature of the
polices: the agents sometimes succeed and fail in the same testing environment.
This results in vastly different episodic rewards and a large number of episodes
is therefore necessary to stabilise the average rewards.
When analysing the results, there are certain important considerations that
need to be made. The performance of each algorithm is often different i.e. agents
succeed and fail in different testing environments. There are instances when one
algorithm enabled agents to navigate to the target in a short time, but another
resulted in agents only finding the target after a large number of episode steps
or never at all. We wish to investigate this phenomenon further in future work.
The task is not trivial since it as analogous to placing a human or vehicle in
a new environment and only equipping them with information about its current
location, destination and the ability to “see” what’s around it. It does not have
any knowledge of the dynamics of the environment that it is placed in. This
means that some “exploration” is necessary and it is expected that agents will
move into obstacles as they try to advance towards the goal. It is not possible to
solve the generalisation problem completely: it was not expected that the agents
would obtain expert performance in the testing environments. The goal is rather
to transfer some knowledge that can be reused in the environments.
The policies are used “as-is” and there is no fine-tuning for any of the testing
environments, as is the case in other studies [29]. It is definitely possible to
improve the results in each testing environment by fine-tuning the policy though
that is not the aim of this study. This work rather investigated the extent to
which the learned policy generalised.
Lastly, the sample size in the testing environment groups is fairly small.
There are only three environments in some groups. In future work, we wish to
investigate whether the results hold when increasing the size of the groups.
The average episodic rewards are in the range [−1, 1). A successful run is one
in which agents are able to navigate to the target. The faster an agent finds the
target, the higher the reward it receives. An average reward approaching one
therefore indicates that the agents successfully found the targets on all runs. A
score below one indicates that on most runs, the agents were unable to find the
target, across all environments, with zero representing an inflection point.
The results highlight an expected gap between training and testing perfor-
mance. However, they also indicate that some generalisation has taken place.
Difficult Mazes. While the results of the algorithms in the standard mazes
showed similar performance, the agents trained using the curriculum performed
best in the difficult mazes.
The Difficult Orientation results in Fig. 4b indicate that the agents weindi-
cate that the agents were re unable to find the target on most runs. However,
some transfer has taken place. The curriculum obtained the highest average
reward: the result is statistically significant under a 95% confidence interval.
Interestingly, both the curriculum and hybrid agents succeeded in two of the
five environments but the hybrid agent took significantly longer to find the tar-
gets. The hybrid agent is the worst performing algorithm; this indicates that
generalisability decreases significantly as the difficulty of the environments are
increased. The performance of the curiosity-driven agent showed limited transfer
to the testing environments with agents only succeeding in one environment.
Difficult New experiments show the least transfer, as expected. The curricu-
lum agent is once again the most successful. The nature of the environments
mean that agents are able to find the targets on some runs, though not consis-
tently. The most promising result was that the curriculum agent was the only
algorithm that succeeded in the “spiral-maze” [11] depicted in Fig. 2c.
Learning to Generalise in Sparse Reward Navigation Environments 97
References
1. Centre for high performance computing. https://www.chpc.ac.za/
2. Andrychowicz, M., et al.: Hindsight experience replay. In: Advances in Neural
Information Processing Systems, pp. 5048–5058 (2017)
3. Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning environ-
ment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279
(2013)
4. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Pro-
ceedings of the 26th Annual International Conference on Machine Learning, ICML
2009, pp. 41–48. Association for Computing Machinery, Montreal (2009). https://
doi.org/10.1145/1553374.1553380
Learning to Generalise in Sparse Reward Navigation Environments 99
5. Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., Efros, A.A.: Large-
scale study of curiosity-driven learning. In: International Conference on Learning
Representations (2019). https://openreview.net/forum?id=rJNwDjAqYX
6. Burda, Y., Edwards, H., Storkey, A., Klimov, O.: Exploration by random network
distillation. arXiv preprint arXiv:1810.12894 (2018)
7. Cobbe, K., Hesse, C., Hilton, J., Schulman, J.: Leveraging procedural generation to
benchmark reinforcement learning. arXiv preprint arXiv:1912.01588, p. 27 (2019)
8. Cobbe, K., Klimov, O., Hesse, C., Kim, T., Schulman, J.: Quantifying generaliza-
tion in reinforcement learning. arXiv preprint arXiv:1812.02341, p. 8 (2018)
9. Devlin, S.M., Kudenko, D.: Dynamic potential-based reward shaping (2012).
http://eprints.whiterose.ac.uk/75121/
10. Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking
deep reinforcement learning for continuous control. In: International Conference
on Machine Learning, pp. 1329–1338 (2016). http://proceedings.mlr.press/v48/
duan16.html, ISSN: 1938–7228 Section: Machine Learning
11. Florensa, C., Held, D., Geng, X., Abbeel, P.: Automatic goal generation for rein-
forcement learning agents. In: International Conference on Machine Learning,
pp. 1515–1528 (2018). http://proceedings.mlr.press/v80/florensa18a.html. ISSN:
1938–7228 Section: Machine Learning
12. Florensa, C., Held, D., Wulfmeier, M., Zhang, M., Abbeel, P.: Reverse curriculum
generation for reinforcement learning. arXiv:1707.05300 [cs] (2018)
13. Frans, K., Ho, J., Chen, X., Abbeel, P., Schulman, J.: Meta learning shared hier-
archies. arXiv:1710.09767 [cs] (2017)
14. Hacohen, G., Weinshall, D.: On the power of curriculum learning in training deep
networks. arXiv:1904.03626 [cs, stat] (2019)
15. Hussein, A., Elyan, E., Gaber, M.M., Jayne, C.: Deep reward shaping from demon-
strations. In: Proceedings of the 2017 International Joint Conference on Neural
Networks (IJCNN), pp. 510–517. IEEE (2017)
16. Jeewa, A., Pillay, A., Jembere, E.: Directed curiosity-driven exploration in hard
exploration, sparse reward environments. In: Davel, M.H., Barnard, E. (eds.) Pro-
ceedings of the South African Forum for Artificial Intelligence Research, Cape
Town, South Africa, 4–6 December 2019, CEUR Workshop Proceedings, vol. 2540,
pp. 12–24. CEUR-WS.org (2019). http://ceur-ws.org/Vol-2540/FAIR2019 paper
42.pdf
17. Jiang, L., Meng, D., Zhao, Q., Shan, S., Hauptmann, A.G.: Self-paced curricu-
lum learning. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial
Intelligence (2015). https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/
view/9750
18. Juliani, A., et al.: Unity: a general platform for intelligent agents. arXiv:1809.02627
[cs, stat] (2018)
19. Juliani, A., et al.: Obstacle tower: a generalization challenge in vision, control, and
planning. arXiv:1902.01378 [cs] (2019)
20. Justesen, N., Torrado, R.R., Bontrager, P., Khalifa, A., Togelius, J., Risi, S.: Illu-
minating generalization in deep reinforcement learning through procedural level
generation. arXiv:1806.10729 [cs, stat] (2018)
21. Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: Interna-
tional Conference on Machine Learning, pp. 2469–2478 (2018). http://proceedings.
mlr.press/v80/kang18a.html
22. Karpathy, A., van de Panne, M.: Curriculum learning for motor skills. In: Kosseim,
L., Inkpen, D. (eds.) AI 2012. LNCS (LNAI), vol. 7310, pp. 325–330. Springer,
Heidelberg (2012). https://doi.org/10.1007/978-3-642-30353-1 31
100 A. Jeewa et al.
23. Khalifa, A., Bontrager, P., Earle, S., Togelius, J.: PCGRL: Procedural content
generation via reinforcement learning. arXiv:2001.09212 [cs, stat] (2020)
24. Matiisen, T., Oliver, A., Cohen, T., Schulman, J.: Teacher-student curriculum
learning. In: IEEE Transactions on Neural Networks and Learning Systems, pp.
1–9 (2019). https://doi.org/10.1109/TNNLS.2019.2934906
25. Narvekar, S., Stone, P.: Learning curriculum policies for reinforcement learning.
arXiv:1812.00285 [cs, stat] (2018)
26. Ng, A.Y., Harada, D., Russell, S.: Policy invariance under reward transformations:
theory and application to reward shaping. ICML 99, 278–287 (1999)
27. Oudeyer, P.Y., Kaplan, F.: What is intrinsic motivation? A typology of computa-
tional approaches. Front. Neurorobotics 1, 6 (2009)
28. Packer, C., Gao, K., Kos, J., Krähenbühl, P., Koltun, V., Song, D.: Assessing
generalization in deep reinforcement learning. arXiv:1810.12282 [cs, stat] (2019)
29. Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration
by self-supervised prediction. In: Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition Workshops (CVPRW 2017), pp. 488–
489. IEEE, Honolulu (2017). https://doi.org/10.1109/CVPRW.2017.70, http://
ieeexplore.ieee.org/document/8014804/
30. Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions.
arXiv:1710.05941 [cs] (2017)
31. Ravishankar, N.R., Vijayakumar, M.V.: Reinforcement learning algorithms: survey
and classification. Indian J. Sci. Technol. 10(1), 1–8 (2017). https://doi.org/10.
17485/ijst/2017/v10i1/109385, http://www.indjst.org/index.php/indjst/article/
view/109385
32. Savinov, N., Dosovitskiy, A., Koltun, V.: Semi-parametric topological memory for
navigation. arXiv:1803.00653 [cs] (2018)
33. Savinov, N., et al.: Episodic curiosity through reachability. arXiv:1810.02274 [cs,
stat] (2019)
34. Schmidhuber, J.: POWERPLAY: training an increasingly general problem solver
by continually searching for the simplest still unsolvable problem. arXiv:1112.5309
[cs] (2012)
35. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy
optimization algorithms. arXiv:1707.06347 [cs] (2017)
36. Suay, H.B., Brys, T.: Learning from demonstration for shaping through inverse
reinforcement learning, p. 9 (2016)
37. Suay, H.B., Brys, T., Taylor, M.E., Chernova, S.: Reward shaping by demonstra-
tion. In: Proceedings of the Multi-Disciplinary Conference on Reinforcement Learn-
ing and Decision Making (RLDM) (2015)
38. Sutton, R.S., Barto, A.G.: Reinforcement Learning, Second Edition: An Introduc-
tion. MIT Press, Cambridge (2018). google-Books-ID: uWV0DwAAQBAJ
39. Svetlik, M., Leonetti, M., Sinapov, J., Shah, R., Walker, N., Stone, P.: Automatic
curriculum graph generation for reinforcement learning agents. In: Proceedings of
the Thirty-First AAAI Conference on Artificial Intelligence (2017). https://www.
aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14961
40. Tessler, C., Givony, S., Zahavy, T., Mankowitz, D.J., Mannor, S.: A deep hierar-
chical approach to lifelong learning in minecraft. arXiv:1604.07255 [cs] (2016)
41. Ye, C., Khalifa, A., Bontrager, P., Togelius, J.: Rotation, translation, and cropping
for zero-shot generalization. arXiv:2001.09908 [cs, stat] (2020)
42. Zhang, C., Vinyals, O., Munos, R., Bengio, S.: A study on overfitting in deep
reinforcement learning. arXiv:1804.06893 [cs, stat] (2018)
Evaluation of a Pure-Strategy Stackelberg
Game for Wildlife Security
in a Geospatial Framework
1 Introduction
Rhino poaching continues to be a major problem in South Africa. Although
rhino deaths started decreasing in 2015, the number of poaching activities inside
and adjacent to the Kruger National Park (KNP) has only decreased from 2 466
in 2015 to 2 014 in 2019 [3,4]. Anti-poaching units offer an attempt to com-
bat rhino poaching [15] and to facilitate such efforts, this paper focuses on
c Springer Nature Switzerland AG 2020
A. Gerber (Ed.): SACAIR 2020, CCIS 1342, pp. 101–118, 2020.
https://doi.org/10.1007/978-3-030-66151-9_7
102 L. Kirkland et al.
wildlife security games [23]. These games make use of the Stackelberg Secu-
rity Game (SSG) [5], the current game-theoretic approach in security domains.
In the domain of wildlife security, the attackers are poachers, the defenders are
rangers and the targets to protect are moving animals. SSGs are used to opti-
mally allocate limited ranger resources in a wildlife park where attacks on the
animals occur frequently. To our knowledge, there is currently no evaluation
framework for wildlife security games and there are inconsistencies in the eval-
uation. Evaluating the games based on expected utility is difficult because it is
based on the location of the wildlife animals, who are constantly moving. This
paper introduces a framework which simulates the movements of the poachers,
rangers, and wildlife to address this. The simulation studies are intended to
provide the rangers with an estimate of the average behaviour in one month.
We propose clear evaluation metrics for the rangers to assess their performance,
which allows them to decide on the best strategy for real-world implementation
to combat the poachers. Furthermore, we propose acting as the Stackelberg fol-
lower instead of the leader in this domain. A simple pure-strategy Stackelberg
game is designed and implemented within the framework to test this idea.
The SSG is an extensive form game wherein the defenders act first and the
attackers follow. The game can be represented by a game tree where the branches
are the actions of the agents and their payoffs for each combination of actions are
given at the terminal nodes. The SSG assumes that attackers conduct surveil-
lance on the defenders to obtain complete knowledge of their mixed-strategy
and then respond with a pure-strategy by attacking a single target [21]. A pure-
strategy for each agent consists of the cross product of the set of actions available
to them at each of their information states. A pure-strategy equilibrium provides
the optimal action to take at each of their nodes in the tree. A mixed-strategy
is a probability distribution over the set of pure-strategies. A Green Security
Game (GSG), which includes protection of wildlife, fisheries, and forests, has
frequent attacks on targets. Attackers can therefore not afford much time to
conduct extensive surveillance to learn the mixed-strategy of the defenders [10].
Furthermore, there could be many attackers present at any given time [17] so
assuming a pure-strategy response for the attackers is not viable. Attackers in
this domain often return to sites of past success [13] and since attacks occur fre-
quently, the defenders can gather enough observational data to learn the mixed-
strategy of the attackers. Thus, we propose that the defenders could perform
better when acting as the Stackelberg follower. There is not always an advan-
tage in terms of payoffs to being the Stackelberg leader [16], so it is reasonable
to reverse the roles of the defender and the attacker in the domain of GSGs.
Although the follower in SSGs acts with a pure-strategy, this is only necessary
to ease the computation of the leader’s optimisation problem [18]. However, we
do not need to find an optimal strategy for the attackers since we learn this from
the data. We only need to solve the follower’s problem, which is computationally
much simpler, to find the defenders’ best response to the mixed-strategy of the
attacker.
Evaluation of a Pure-Strategy Stackelberg Game for Wildlife Security 103
2 Related Work
The Bayesian SSG has become the standard approach for security games and
handles uncertainty around the attackers’ payoffs by assuming different types of
attackers with a Bayesian a priori distribution assumption [18]. Uncertainties
due to the attackers’ bounded rationality and limited observations are addressed
by using robust algorithms [19]. The algorithms are evaluated by assessing the
104 L. Kirkland et al.
defenders’ reward against two baselines: the uniform strategy, which assigns
equal probability of taking each strategy; and the maximin strategy, which max-
imises the minimum reward of the defenders irrespective of the attackers’ actions.
The first application of the Bayesian SSG for wildlife security is the PAWS
algorithm [23]. Since the targets to protect are moving animals whose location is
not always known, the wildlife park is divided into a grid of cells which become
the new targets. Available poaching data is utilised to learn a behavioural model
for the poachers to account for their bounded rationality. Evaluation of the
SSG algorithm is achieved by comparing the cumulative expected utility of the
rangers over 30 simulated rounds of the game against the maximin strategy on
a grid of 64 cells. Different behavioural models are compared in Yadav et al. [22]
where the models are learned using data from a wildlife park in Indonesia and
their predictive performance is tested using ROC curves. The SSG algorithm is
evaluated on 25 randomly selected grid cells, where the rangers’ maximum regret
of the optimal strategy is compared to that of the real world.
The work in Kar et al. [13] follows a similar approach to that of Yadav
et al. [22] by comparing different behavioural models. However, they develop a
computer game with a 5 × 5 grid over a Google Maps view of the wildlife park.
A probability heat map of the rangers’ mixed-strategy is overlayed onto the grid
and the wildlife is arranged in four different payoff structures. On average 38
human subjects played as the attackers in 5 rounds of each game to learn the
behavioural models and the defenders’ utility against these subjects is compared
for evaluation. Although there is some similarity to this work, no movements of
the wildlife, the rangers or the poachers are considered.
In our earlier work [14], a null game is designed where the rangers and the
poachers both act randomly. Thus, the uniform strategy [19] is similar to this
null game. Yet none of the research on wildlife security use the uniform strategy
for comparison and the maximin strategy is the only baseline model considered.
While evaluating models by simulating repeated instances of the game or by
applying the game to real-world data is valid, the data used only provides a
snapshot of the situation. Since the wildlife are constantly foraging for food, the
poachers foraging for wildlife and the rangers foraging for poachers, it becomes
important to know more about their spatial and temporal movements [15]. Mea-
suring the expected utility is useful for comparison but since it is calculated based
on the locations of the wildlife, who are constantly moving, the actual number
of wildlife poached is more valuable. The evaluation framework presented in this
paper allows for including real-world data and is supposed to offer an alternative
to implementation in a wildlife park.
and for each month we record the number of poaching events and the number
of times the poachers leave or are arrested. The monthly cycle is simulated for
1 000 Monte Carlo repetitions and the averages over the simulations provide
the measures of performance. Some simulations on ways to move are compared
and movement towards a destination provides the smoothest and most realistic
movement for the rangers and the poachers. The start and destination cells for
the rangers and the poachers are chosen completely randomly. Although the
grid cells are supposed to correspond to a map of the wildlife park, the game
does not take any geographical features into account. We modify that game
to incorporate geographical information into the framework. Figure 1 shows a
map of the KNP and a subarea used to demonstrate how the algorithm works.
Currently we use the public shapefiles in the SANParks data repository [2] for
the KNP. The geopandas Python library [1] is utilised so that geographical data
can be used, and distances can be calculated within the actual wildlife park. Two
classes are created, a Park class and an Agent class. Within the Park class, all
the geographical information is collected. A grid is also calculated, based on how
large the cells should be, for either the whole park or a subarea of the park.
Two attributes are created for the Agent class to exclude cells in the grid,
corresponding to areas where the agent cannot go. The first restricts agents from
entering areas which contain geographical obstacles such as steep mountains,
dense vegetation, rivers, and dams. The second restricts an agent to stay within
a specific area. For example: you might want to restrict the wildlife to stay within
an area defined by a census or home range analysis; ranger patrols might need
to stay within a certain distance from their patrol huts; or poachers might need
to stay close enough to their homes. Figure 2 shows a grid of 566 cells that are
1.5 × 1.5 km in size, for the subarea in Fig. 1, where the cells excluded are white.
Fig. 1. Map and subarea of the Kruger National Park. The map on the left illustrates
the geographical information used within the framework, while the subarea shown on
the right will be the focus of this study.
106 L. Kirkland et al.
The null game serves as a baseline model, so that when compared with other
games, the performance of the rangers’ success in protecting wildlife and arrest-
ing poachers can be assessed. We would like to know whether executing the GSG
helps the rangers to perform better than when they execute random motion.
However, in our previous attempt to construct a realistic baseline model, both
the rangers and the poachers moved from a random starting cell towards a ran-
dom destination cell (uniform game). This makes it difficult to assess whether
the rangers’ performance improves when both the rangers and the poachers act
according to the GSG since the poachers’ strategy improves at the same time as
the rangers’ strategy. Thus, in the baseline model, the random rangers need to
compete with more intelligent poachers, those who learn, to truly evaluate any
performance increase of the rangers when they act according to the GSG.
In a game theory algorithm, we would use information about an agent’s pref-
erences to try and quantify their payoffs. Similarly, we can use this information
to determine how an intelligent agent might move through the wildlife park.
Another two attributes were created for this: one for features they dislike and
how far they would like to stay away from them; and one for features they like
and how near they would like to stay to them. For example: the poachers would
probably like to avoid any entrance gates to the park since rangers often conduct
searches there; they would likely stay away from main roads, camps and picnic
spots to avoid being identified by the public; they might prefer to stay near to
the park border to make escape easier; and they would possibly like to stay near
to water sources since it is likely that they might find wildlife there. These pref-
erences are implemented by increasing or decreasing selection weights for each
cell. For each feature, each cell starts with a weight of w = 0.5. If it is a feature
that the poachers would like to stay d km away from, then the weight starts
decreasing for cells that are within d km away and continues to decrease as the
cells get nearer to the feature: if a cell ci has minimum distance di km from the
feature, then its weight will be w = w ×[1−(d−di )/d]. Similarly, if it is a feature
that the poacher would like to stay d km near to, then the weight increases more
for cells ci which are nearer to that feature: w = w × [1 + (d − di )/d]. The weights
are increased or decreased in this manner for each feature that the poacher has
preferences for. Figure 2 shows the cell selection weights for a poacher who dis-
likes being 2 km from camps, 3 km from roads and 5 km from gates, and who
likes being within 15 km of dams and water and within 30 km of the border.
The weights are depicted by a colour scale, where darker colours indicate higher
values.
Along with moving towards their preferences, the poachers also learn from
events that occur. The poachers begin in a random cell, either on the border of
the park or at the edge of the grid and proceed towards a random destination
cell. If they reach the destination cell with no event, then they head off on a new
trajectory back towards the start cell and we record that they left before poach-
ing. If they leave safely without being arrested, then they continue to use that
point of entry because they assume there is low risk in being arrested there.
Evaluation of a Pure-Strategy Stackelberg Game for Wildlife Security 107
Fig. 2. Poacher allowed cells and cell selection weights. The cells excluded for the
poachers are shown in white and the selection weights are depicted by a colour scale
where darker colours represent higher values.
The poachers can either encounter wildlife for poaching on their trajectory
towards the destination cell or on their way back to the start. We assume that
once the wildlife has been poached, that they would like to exit promptly before
being arrested. Thus, after a poaching occurs while going towards the destina-
tion, they change direction and head back towards the start cell. Furthermore,
since they know where the wildlife are likely to be, they want to return to that
area, so the poaching cell becomes their new destination cell. If the poachers
encounter the wildlife while going towards the start, then they continue on the
same trajectory towards the start after the poaching event. They would also
adopt the poaching cell as their new destination cell when re-entering the park
in this event. We do not allow for a second poaching event once on the trajectory
towards the start cell after a poaching, they can only poach again after re-entry
to the park. We record that the poachers left after poaching if they leave the
park safely in the past two events described. However, if they are arrested before
reaching the start cell then we record that they were arrested after poaching. Of
course, the poachers could also be arrested without having poached any wildlife
and in this case, we record that they were arrested before poaching. After being
arrested, the poachers must re-enter the park, but they will choose a new ran-
dom entry point since they did not have success going towards the current start
cell. Figure 3 demonstrates the different scenarios that occur during one game.
The two attributes describing the poachers’ preferences can be helpful in
making the wildlife movement more realistic as well. Some examples include
wildlife wanting to stay near water; liking specific types of vegetation for grazing;
enjoying mud baths or shady areas; and wanting to avoid camp areas where there
are lots of people. The null game is thus simulated with better movement for the
108 L. Kirkland et al.
Leave Before Poach Leave After Poach Arrest Before Poach Arrest After Poach
Start Destination Start Poach Start Arrest Start Poach
traj1 traj1 traj1 traj1
New Destination New Start Destination Arrest New Destination
traj2 traj2 traj2 traj2
New Start
traj3 traj3 traj3 traj3
Fig. 3. Scenarios that can occur and how the poachers learn from events. Trajectories
continue back and forth between the start and destination cell until the game ends after
a specified number of moves. An open circle represents the start of each trajectory and
an arrow represents the end of that trajectory. A solid circle represents a poach or
arrest event which can occur along any trajectory.
wildlife as well, where the destination cell is chosen as a random cell near water
and movement is based on the preferences specified.
The game continues with the wildlife, rangers and poachers moving back
and forth between their start and destination cells and ends after a specified
number of moves. For example, considering a grid with 1 km2 cells, a person
could walk 100 km in 20 h at a speed of 5 km/h. Allowing short stops to
rest or eat, 100 moves would consume an entire day. Thirty such games could
be played per month and the games are simulated for a specified number of
months to determine their average monthly behaviour. The number of poaching
events and arrest events per month are recorded and are used to calculate the
poach frequency per day and arrest frequency per day, so that games of different
lengths can be compared. When these measures are similar, we consider two
secondary measures: the average number of moves for each arrest and the aver-
age distance between the poachers and rangers for games with no arrests. For
further understanding and analysis, we also record how many times the agents
reach their start cell or destination cell to keep track of their trajectories. The
movingpandas Python library [12] is utilised to store trajectories which can be
easily analysed and plotted after the simulation.
– there is only one group of rangers and one group of poachers, where each
group acts together and they cannot split up;
– the park is divided into two grid cells;
– the rangers act as the leader and commit to protecting a single grid cell;
– the poachers observe which cell the rangers protect and react by attacking a
single grid cell; and
– the rangers and the poachers act to maximise their own expected utility.
With these assumptions set clearly, we can identify the components of the
game. We know that the agents are the rangers and the poachers, and that the
actions are the coverage of the two grid cells by these agents. We do not yet know
what the payoffs are, but we know that the outcome will include the number
of rhino saved and/or poached and whether the poacher is arrested. For this
example, we have two rhinos in grid cell 1 which is 500 m from the border and
one rhino in grid cell 2 which is on the border. The four possible events are shown
in Fig. 4. For the outcomes of each event, the solid rhinos represent the number
of rhinos saved and the dotted rhinos represent the number of rhinos poached.
If the rangers and the poachers are in the same grid cell, then the poachers are
arrested and no rhinos are poached.
Fig. 4. Game events and outcomes. For events, the policeman represents the rangers,
the gun represents the poachers and the solid (black) rhino depicts the animals in each
grid cell. For outcomes, the solid (green) rhino depicts the saved rhinos, the dotted (red)
rhino depicts the poached rhinos and an arrest is represented by handcuffs. (Color figure
online)
110 L. Kirkland et al.
In order to solve the game, we need to define the pure-strategies for each
agent and quantify the outcomes as payoffs. Since the rangers act first, they
have one information state and their pure-strategies are just their actions, cell 1
and cell 2, given by the set S R = {1, 2}. Because the poachers observe the
rangers’ action, they have two information states: P.1 when the rangers go to
cell 1 and P.2 when the rangers go to cell 2. The pure-strategies of the poachers
thus need to specify what action to take at each information state, and are given
by the set S P = {(i, j) : for i, j = 1, 2}, which means cell i at P.1 and cell j
at P.2. To calculate the payoffs, let the value of a rhino be 1 and the value of
an arrest be 2. A poached rhino will count as negative for the rangers and an
arrest will count as negative for the poachers. The payoffs can also include an
agent’s preferences, so let the penalty for the poachers
be −1 for every cell that
is 500 m away from the border. Let uR SiR , Sj.i P
be the payoff for the rangers
and uP SiR , Sj.i
P
be the payoff for the poachers when the rangers are are in cell i
and the poachers are in cell j.i, where SiR is the ith element in S R and Sj.i P
is
P
the jth element of S with the action at information state P.i. Then the payoffs
are calculated as:
ri + ai + 2 if i = j.i
uR SiR , Sj.i
P
= (1)
ri + ai − aj.i if i = j.i
and
P
pj.i − 2 if i = j.i
u SiR , Sj.i
P
= , (2)
pj.i + aj.i if i = j.i
where ai is the number of animals in cell i, ri is the geographic utility of the
rangers in cell i (0 in this example), pj.i is the geographic utility of the poachers
in cell j.i, and 2 is for an arrest. With the payoffs known, we can define the game
mathematically as G = A, S, U , where
– A = {R, P } is the set of agents with R denoting the rangers and P denoting
the poachers;
– S = S R , S P , where S R = {1, 2} is the set of pure-strategies for the rangers
and S P = {(1, 1), (1, 2), (2, 1), (2.2)} is the set of pure-strategies for the poach-
ers; and
– U = uR , uP is the set of utility functions, where uR , uP : S R × S P → R
are defined in Eqs. 1 and 2.
We construct the game tree in Fig. 5 to describe the game. The solution to
a pure-strategy Stackelberg Game is given by the Subgame Perfect Nash equi-
librium (SPNE) and is found using backward induction [16]. Figure 5 shows
the three subgames in this game. The SPNE requires that the solution has a
Nash equilibrium (NE) in each subgame, even if it is never reached. The back-
ward induction process for finding the SPNE is presented visually in Fig. 5 with
thick lines representing the optimal strategies for each agent. Let Ŝ R denote the
optimal strategy for the rangers and Ŝ P the optimal strategy for the poachers.
Evaluation of a Pure-Strategy Stackelberg Game for Wildlife Security 111
1 2
P.1 P.2
1 2 1 2
4, −3 1, 1 −1, 1 3, −2
Fig. 5. Stackelberg game tree and subgames. The rangers have one information state
and the poachers have two information states (P.1 and P.2). The payoffs are shown at
the terminal nodes with the rangers’ payoffs first. There are 3 subgames and the thick
lines indicate the backward induction process for finding the SPNE.
The SPNE is Ŝ R = 1, Ŝ P = (2, 1), thus the rangers go to cell 1 and the poachers
to cell 2, with payoffs of 1 for the rangers and 1 for the poachers.
The game can also be written as the normal form representation in Table 1
and the SPNE is a subset of the NE for the normal form game. Finding all NE for
this normal form game includes the SPNE above, as well as an equilibrium where
at information state P.2 the poachers execute a mixed-strategy by choosing each
cell with probability 0.5. If the roles were reversed and the rangers were to act
as the follower then the SPNE is Ŝ R = (1, 2), Ŝ P = 2 so the rangers and the
poachers both go to cell 2 and the payoffs are 3 for the rangers and −2 for
the poachers. This example thus shows a follower’s advantage since both agents
receive a higher payoff when they follow than when they lead. The use of the
SPNE is just for demonstration in this article. In practice, it would be preferred
to solve for an optimal mixed-strategy for the rangers, which is a probability
distribution over their set of pure-strategies. This mixed-strategy could then be
utilised, for example, over the duration of one month in which a random strategy
is selected from this distribution every day.
Table 1. Normal form of the Stackelberg game. The rangers are the row player and have
two pure-strategies. The poachers are the column player and have 4 pure-strategies,
where (i, j) means i at P.1 and j at P.2. The body of the table shows the payoffs at
each combination of their strategies, where the rangers’ payoffs are given first.
R P
(1, 1) (1, 2) (2, 1) (2, 2)
1 4, −3 4, −3 1, 1 1, 1
2 −1, 1 3, −2 −1, 1 3, −2
112 L. Kirkland et al.
5 Experiments
We perform simulations for the subarea in Fig. 1, using the grid shown in Fig. 2.
The poachers’ entry is at the border cells and their preferences are as described in
Sect. 3.2. The wildlife is set to dislike being 1 km near to camps and prefer being
within 10 km of dams and water. There are 566 allowed cells for each agent,
after excluding geographical obstacles. Simulations are run with 200 moves per
game, 10 games per month, and for 500 months. GAME1 is the uniform game,
with random movement for all agents. GAME2 is the null game, with random
movement for the rangers, intelligent movement for the poachers and improved
animal movement. The poach frequency per day and arrest frequency per day are
the primary measures for assessing the rangers’ performance. We also consider
the average number of moves to make an arrest and the average distance between
the rangers and the poachers when there are no arrests. Since the distributions of
these measures are skewed, we report on the median and calculate the bootstrap
standard error (SE) of the median using 1 000 bootstrap samples. Furthermore,
we do pairwise comparisons of the games for each measure and test the general
hypothesis of identical populations using Mood’s median test [11]. Table 2 shows
the median for each measure, with bootstrap SE of the median in brackets, and
the superscripts indicate where Mood’s median test is non-significant. As can be
expected, the random rangers have poorer performance against the intelligent
poachers than against the random poachers since the poach frequency per day is
significantly higher in GAME2 than in GAME1. Figure 6 shows the trajectories
for a single round of the null game.
Fig. 6. Trajectories for a single round of the null game. The lines represent the move-
ments of the wildlife (green, narrow line), the rangers (blue, medium line) and the
poachers (red, thick line). A black * represents a capture event, a black X represents a
poaching event and a black + represents a leaving event. (Color figure online)
Evaluation of a Pure-Strategy Stackelberg Game for Wildlife Security 113
– GAME3 the rangers as the Stackelberg leader against the intelligent poachers;
– GAME4 the rangers as the Stackelberg leader against the poachers as the
Stackelberg follower;
– GAME5 the rangers as the Stackelberg follower against the intelligent poach-
ers; and
– GAME6 the rangers as the Stackelberg follower against the poachers as the
Stackelberg leader.
Fig. 7. Wildlife density estimates and cell selection weights. The numbers indicate how
many wildlife have been sighted in each cell and the selection weights are depicted by
a colour scale where darker colours represent higher values.
114 L. Kirkland et al.
Poach Freq per Day Arrest Freq per Day Ave Moves for Arrests Ave Distance (km) for
Non-arrests
GAME1 0.067 (0.006) 4,6 0.067 (0.001) 2,3 78.0 (2.1) 10.4 (0, 1)
GAME2 0.083 (0.016) 3,5 0.067 (0.001) 1 67.6 (2.7) 3,4,5 11.8 (0.1) 3,6
GAME3 0.067 (0.016) 2,5 0.067 (0.013) 1 66.8 (2.4) 2,4,5 11.4 (0.2) 2,4,6
GAME4 0.033 (0.015) 1,6 0.167 (0.008) 5 62.8 (2.3) 2,3,5 11.1 (0.2) 3,6
GAME5 0.067 (0.005) 2,3 0.133 (0.016) 4 67.4 (1.2) 2,3,4 9.5 (0.1)
GAME6 0.067 (0.005) 1,4 0.267 (0.016) 49.8 (1.5) 11.6 (0.2) 2,3,4
The results for the Stackelberg games are shown in Table 2. When comparing
GAME3 and GAME5 with GAME2 (null game), we have a direct comparison
of the rangers’ performance since the only difference between the games is the
movement of the rangers. The poach frequency per day is lower in GAME3
and GAME5 than in GAME2, and the arrest frequency per day is significantly
higher in GAME5 than in GAME2. Thus, the rangers perform better than ran-
dom when playing the Stackelberg game as the leader or the follower against
the intelligent poachers. When comparing GAME3 with GAME4, where the
rangers act as the Stackelberg leader, they perform better in GAME4 since the
poach frequency per day is significantly lower and the arrest frequency per day
is significantly higher. Thus, comparison of the rangers’ performance against
the intelligent poachers (GAME3) represents a worse case for the rangers than
against the poachers as the Stackelberg follower (GAME4). This is reasonable
when trying to select the better game since we would not want to have an opti-
mistic estimate of their performance. Similarly, when comparing GAME5 and
GAME6, where the rangers acts as the Stackelberg follower, GAME5 against the
intelligent poacher represents a worse case for the rangers than GAME6 against
the poachers as the Stackelberg leader since the arrest frequency per day is much
higher and the average moves for an arrest is much lower in GAME6. Comparing
GAME4 and GAME6, there is no significant difference in poach frequency per
day but GAME6 has a much higher arrest frequency per day and a much lower
average number of moves for arrests. Thus, when both agents act according to
the Stackelberg game, the rangers perform better when acting as the follower
than as the leader.
6 Conclusions
The null game presented in this paper provides a realistic baseline model
for assessing any improvement in the rangers’ performance. Improved ranger
performance is defined as having fewer wildlife poached and more poachers
arrested. The primary performance measures of poach frequency per day and
arrest frequency per day thus directly address the objectives of the rangers. As
expected, the rangers have poorer performance against the intelligent poachers
Evaluation of a Pure-Strategy Stackelberg Game for Wildlife Security 115
than against the random poachers in the null game. Implementing the simple
Stackelberg security game shows that even this simple game-theoretic algorithm
results in a significant improvement for the rangers. An Appendix is provided in
Sect. 7 containing a summary of the classes, attributes and simulation parame-
ters required for the null game and the pure-strategy Stackelberg games.
Utilising better geographic data is expected to alleviate the problem with
back and forth movement around cells that are excluded due to geographical
obstacles. The next step would be to incorporate time into the simulation. For
example, the poachers’ re-entry into the park after an arrest could be delayed;
the visibility of the agents could be increased during dry seasons or nights when
it is full moon; rivers might be easily crossed during dry seasons; and where
there is dense vegetation or steep mountains the speed of the agents could be
decreased within that region instead of excluding those cells.
Further improvements can be implemented to make the null game more real-
istic. Including multiple groups of rangers, multiple groups of poachers, and
multiple herds of wildlife would be a valuable improvement. Utilising an ani-
mal movement model for different types of wildlife instead of simulating the
movements of the wildlife could also improve the framework considerably. Alter-
natively, we could design routes for each of the wildlife, the poachers, and the
rangers using imaging software to identify sand trails [15], using routes uncov-
ered by poacher tracking [8], or using road segments to define routes [9]. The
routes can then be used for their movement within the framework and as their
set of strategies in the GSG algorithm.
Since the framework is designed to compare and evaluate different wildlife
security games, another task would be to include game-theoretic algorithms dis-
cussed in current research within the framework. We would like to test the idea
of the rangers acting as the Stackelberg follower against the current algorithms.
Observed data can be utilised in a Bayesian network to learn the poachers’
mixed-strategy and the rangers’ mixed-strategy best response can be calculated.
7 Appendix
The evaluation framework is developed in Python 3.8. It
utilises the geopandas [1] library to handle geographical information and the
movingpandas [12] library to store, plot and analyse the trajectories in each
simulation. Table 3 provides a summary of the classes, attributes and simulation
parameters used in the framework. For the uniform game, the move_type is set
to "random" for all agents but for the null game it is set to "intelligent" for
the poachers. The Park class has methods to calculate the grid, the cells on the
edge of the grid, and the cells on the park border. The Agent class has methods
to find the agent’s allowed cells and calculate their geographical selection weights
and utilities. Furthermore, it contains methods to find their start cell, destina-
tion cell and calculate the next cell to move into. The Game class is more useful
for the security games as it contains methods to calculate strategies, payoffs, and
the game solution. For the null game, it just collects the agents and whether the
116 L. Kirkland et al.
poachers’ entry point should be a cell on the edge of the grid or on the border of
the park. The Sim class has methods to simulate a single game and to simulate
games for a number of months. Additionally, to evaluate the simulations, there
are methods for calculating the median of the performance measures, bootstrap
standard errors of the median and p-values for Mood’s median test.
References
1. GeoPandas 0.7.0. https://geopandas.org/
2. SANParks Data Repository. http://dataknp.sanparks.org/sanparks/
3. Minister Molewa highlights progress on Integrated Strategic Management
of Rhinoceros (2017). https://www.environment.gov.za/mediarelease/molewa
progressonintegrated strategicmanagement ofrhinoceros
4. Department of Environment, Forestry and Fisheries report back on rhino poaching
in South Africa in 2019 (2020). https://www.environment.gov.za/mediarelease/
reportbackon2019 rhinopoachingstatistics
5. An, B., Tambe, M.: Stackelberg security games (SSG) basics and application
overview. In: Abbas, A.E., Tambe, M., von Winterfeldt, D. (eds.) Improving Home-
land Security Decisions, pp. 485–507. Cambridge University Press, Cambridge
(2017). https://doi.org/10.1017/9781316676714.021
6. Conitzer, V., Sandholm, T.: Computing the optimal strategy to commit to. In: Pro-
ceedings of the 7th ACM Conference on Electronic Commerce, pp. 82–90 (2006).
https://dl.acm.org/doi/10.1145/1134707.1134717
7. Cournot, A.A.: Researches Into the Mathematical Principles of the Theory
of Wealth. Macmillan, New York (1897). https://www3.nd.edu/∼tgresik/IO/
Cournot.pdf
8. De Oude, P., Pavlin, G., De Villiers, J.P.: High-level tracking using bayesian context
fusion. In: FUSION 2018, 21st International Conference on Information Fusion, pp.
1415–1422. IEEE (2018). https://doi.org/10.23919/ICIF.2018.8455342
9. Fang, F., et al.: Deploying PAWS: field optimization of the protection assistant
for wildlife security. In: AAAI 2016, Proceedings of the 30th AAAI Conference on
Artificial Intelligence, pp. 3966–3973. AAAI Press (2016). https://dl.acm.org/doi/
10.5555/3016387.3016464
10. Fang, F., Stone, P., Tambe, M.: When security games go green: designing defender
strategies to prevent poaching and illegal fishing. In: IJCAI 2015, Proceedings of
the 24th International Joint Conference on Artificial Intelligence, pp. 2589–2595.
AAAI Press (2015). https://dl.acm.org/doi/10.5555/2832581.2832611
11. Gibbons, J.D., Chakraborti, S.: Nonparametric Statistical Inference, 4th edn. CRC
Press, Boca Raton (2003). https://doi.org/10.4324/9780203911563
12. Graser, A.: MovingPandas: efficient structures for movement data in Python. J.
Geogr. Inf. Sci. 7(1), 54–68 (2019). https://doi.org/10.1553/giscience2019 01 s54
13. Kar, D., Fang, F., Fave, F.D., Sintov, N., Tambe, M.: “A game of thrones”:
When human behavior models compete in repeated stackelberg security games. In:
AAMAS 2015, Proceedings of the 14th International Conference on Autonomous
Agents and Multiagent Systems, pp. 1381–1390. IFAAMAS (2015). https://dl.acm.
org/doi/10.5555/2772879.2773329
14. Kirkland, L., de Waal, A., de Villiers, J.P.: Simulating null games for uncertainty
evaluation in green security games. In: FUSION 2019, 22nd International Con-
ference on Information Fusion, pp. 1–8. IEEE (2019). https://ieeexplore.ieee.org/
document/9011280
118 L. Kirkland et al.
1 Introduction
With the advent of low cost sensors and digital transformation, time series data
is being generated at an unprecedented speed and volume in a wide range of
applications in almost every domain. For example, stock market fluctuations,
computer cluster traces, medical and biological experimental observations, sensor
networks readings, etc., are all represented in time series. Consequently, there is
an enormous interest in analyzing time series data, which has resulted in a large
number of studies on new methodologies for indexing, classifying, clustering,
summarizing, and predicting time series data [10,11,16,23,25].
In certain time series prediction applications, segmenting the time series into
a sequence of trends and predicting the slope and duration of the next trend is
preferred over predicting just the next value in the series [16,23]. Piecewise lin-
ear representation [10] or trend lines can provide a better representation for the
underlying semantics and dynamics of the generating process of a non-stationary
and dynamic time series [16,23]. Moreover, trend lines are a more natural repre-
sentation for predicting change points in the data, which may be more interesting
to decision makings. For example, suppose a share price in the stock market is
currently rising. A trader in the stock market would ask “How long will it take
and at what price will the share price peak and when will the price start drop-
ping?” Another example application is for predicting daily household electricity
consumption. Here the user may be more interested in identifying the time, scale
and duration of peak or low energy consumption.
While deep neural networks (DNNs) has been widely applied to computer
vision, natural language processing (NLP) and speech recognition, there is lim-
ited research on applying DNNs for time series prediction. In 2017, Lin et al. [16]
proposed a novel approach to directly predict the next trend of a time series as a
piecewise linear approximation (trend line) with a slope and a duration using a
hybrid neural network approach, called TreNet. The authors showed that TreNet
outperformed SVR, CNN, LSTM, pHHM [23], and cascaded CNN and RNN.
However, the study had certain limitations.
Inadequacy of Cross-validation: The study used standard cross-validation
with random shuffling. This implies that data instances, which are generated
after a given validation set, are used for training [1].
No Model Update: In real world applications where systems are often
dynamic, models become outdated and must be updated as new data becomes
available. TreNet’s test error was estimated on a single hold-out set, which
assumes that the system under consideration is static. TreNet’s evaluation there-
fore does not provide a sufficiently robust performance measure for datasets that
are erratic and non-stationary [1].
No Evaluation of Model Stability: DNNs, as a result of random initialisation
and possibly other random parameter settings could yield substantially different
results when re-run with the same hyperparameter values on the same dataset.
Thus, it is crucial that the best DNN configurations should be stable, i.e. have
minimal deviation from the mean test loss across multiple runs. There is no
evidence that this was done for TreNet.
Missing Implementation Details: Important implementation details in the
TreNet study are not stated explicitly. For instance, the segmentation method
used to transform the raw time series into trend lines is not apparent. This
questions the reproducibility of TreNet’s study.
This paper attempts to address these shortcomings. Our research questions
are:
1. Does a hybrid deep neural networks approach for trend prediction perform
better than vanilla deep neural networks?
DNNs Predicting Trends in Time Series Data 121
2. Do deep neural networks models perform better for trend prediction than
simpler traditional machine learning (ML) models?
3. Does the addition of trend line features improve performance over local raw
data features alone?
The remainder of the paper is structured as follows. We first provide a brief
background of the problem and a summary of related work, followed by the
experimental design. We then give a brief overview of the experiments, describe
the experiments, present and discuss their results. Finally, we provide a summary
and discussion of the key findings.
and using a standard cross validation approach does not take into account the
sequential nature of time series data and may give erroneous results [16]. A walk-
forward validation with successive and overlapping partitioning (see Sect. 3.4)
is better suited for evaluating and comparing model performance on time series
data [18]. It maintains the order of a time series sequence and deals with changes
in its properties over time [18]. To deal with this limitation we attempt to repli-
cate the TreNet approach using a walk forward validation instead of random
shuffling and cross validation.
Some follow-up research to TreNet added attention mechanisms [5,27], how-
ever, they not deal with trend prediction specifically. Another active and related
field to trend prediction is the stock market direction movement, which is only
concerned with the direction of the time series, it does not predict the strength
and the duration of the time series [5,6,9,17,20,24]. Generally, the baseline meth-
ods used by prior work include neural networks, the naive last value prediction,
ARIMA, SVR [16,26]. They do not include ensemble methods such as random
forests, which are widely used particularly for stock market movement prediction
[13,22].
3 Experimental Design
3.1 Datasets
Experiments were conducted on the four different datasets described below.
1. The voltage dataset from the UCI machine learning repository1 . It contains
2075259 data points of a household voltage measurements of one minute
interval. It is highly volatile but normally distributed. It follows the same
pattern every year, according to the weather seasons as shown in Fig. 4 in
the appendix. It corresponds to the power consumption dataset used by Lin
et al. [16].
2. The methane dataset from the UCI machine learning repository2 . We used a
resampled set of size of 41786 at a frequency 1 Hz. The methane dataset is
skewed to the right of its mean value and exhibits very sharp changes with
medium to low volatility as shown in Fig. 5 in the appendix. It corresponds
to the gas sensor dataset used by Lin et al. [16].
3. The NYSE dataset from Yahoo finance3 . It contains 13563 data points of the
composite New York Stock Exchange (NYSE) closing price from 31-12-1965
to 15-11-2019. Its volatility is very low initially until before the year 2000
after which, it becomes very volatile. It is skewed to the right as shown in
Fig. 6 in the appendix. It corresponds to the stock market dataset used by
Lin et al. [16].
1
https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+
consumption.
2
https://archive.ics.uci.edu/ml/datasets/gas+sensor+array+under+dynamic+gas+
mixtures.
3
https://finance.yahoo.com.
DNNs Predicting Trends in Time Series Data 123
4. The JSE dataset from Yahoo finance. It contains 3094 data points of the com-
posite Johannesburg Stock Exchange (JSE) closing price from 2007-09-18 to
2019-12-31. Compared to the NYSE, this stock market dataset is less volatile
and shows a symmetrical distribution around its mean value. However, it has
a flat top and heavy tails on both sides as shown in Fig. 7 in the appendix.
The data preprocessing consists of three operations: missing data imputation, the
data segmentation, and the sliding window operation. Each missing data point
is replaced with the closest preceding non-missing value. The segmentation of
the time series into trend lines i.e. piecewise linear approximations is done by
regression using the bottom-up approach, similar to the approach used by Wang
et al. [23]. The data instances, i.e. the input-output pairs are formed using a slid-
ing window. The input features are the local data points Lk = <xtk −w , ..., xtk >
for the current trend Tk = <sk , lk > at the current time t. The window size w is
determined by the duration of the first trend line. The output is the next trend
Tk+1 = <sk+1 , lk+1 >. The statistics of the segmented datasets are provided in
Table 7 in the appendix.
The performance of seven ML algorithms, i.e. the hybrid TreNet approach, four
vanilla DNN algorithms and two traditional ML algorithms were evaluated.
These algorithms are described below.
TreNet: TreNet has a hybrid CNN which takes in raw point data, and LSTM
which takes in trend lines as shown in Fig. 1. The LSTM consisted of a single
LSTM layer, and the CNN is composed of two stacked [16] 1D convolutional
neural networks without pooling layer. The second CNN layer is followed by a
ReLU activation function. Each of the flattened output of the CNN’s ReLU layer
and the LSTM layer is projected to the same dimension using a fully connected
124 K. H. Kouassi and D. Moodley
layer for the fusion operation. The fusion layer consists of a fully connected layer
that takes the element-wise addition of the projected outputs of the CNN and
LSTM components as its input, and outputs the slope and duration values. A
dropout layer is added to the layer before the output layer. The best TreNet
hyperparameters for each dataset are shown in Table 9 in the appendix and
compared to Lin et al.’s [16].
The parameters of the vanilla DNN algorithms were tuned manually. The best
values found for each algorithm on each dataset are shown in Table 10 in the
appendix.
DNNs Predicting Trends in Time Series Data 125
DNN Algorithm Training, and Initialisation: The equally weighted average slope
and duration mean square error (MSE) is used as a loss function during training
with the Adam optimizer [12]. To ensure robustness against random initiali-
sation, the DNNs are initialised using the He initialisation technique [8] with
normal distribution, fan-in mode, and a ReLU activation function.
are initialised using the weights of the most recent model, during model update.
This makes the training of the network faster without compromising its gener-
alisation ability. More details about this technique which we refer to as model
update with warm-start is given in Sect. A.2 in the appendix. The average root
mean square error (RMSE), given in Eq. 1, is used as the evaluation metric.
Initial training
Model update 1
Model update 2
Model update 3
Training Set Validation Set Test Set
T
RM SE = 1
T t=1 (yt − yt )2 (1)
where, yt → actual next trend, yt → predicted next trend, and T → number of
data instances. For the DNN algorithms each experiment is run 10 times and the
mean and the standard deviation across the 10 runs are reported. This provides a
measure of the stability of the DNN configuration using different random seeds.
4 Experiments
We performed four experiments; each with four datasets. In experiment 1, we
implement and evaluate a TreNet [16]. TreNet uses a hybrid deep learning struc-
ture, that combines both a CNN and an LSTM, and takes in a combination of
raw data points and trend lines as its input. In experiment 2, we compared the
TreNet results with the performance of vanilla MLP, CNN and LSTM struc-
tures on raw point data to analyse the performance improvement when using a
hybrid approach with trend lines. In experiment 3, we evaluate the performance
of three traditional ML techniques, i.e. SVR, RF, and GBM on raw point data
to analyse the performance difference between DNN and non-DNN approaches.
In experiment 4, we supplement the raw data features with trend lines features
to evaluate the performance improvement over the raw data features alone for
both DNN and non-DNN algorithms.
Table 2. Comparison of the slope (S), duration (D), and average (A) RMSE values
achieved by our hybrid neural network’s performance and Lin et al.’s results. The
percentage improvement (% improv.) over the naive LVM
In order to compare our results with the original TreNet we use a similar per-
formance measure to Lin et al. [16]. We measure the percentage improvement
over a naive last value model (LVM). The naive last value model simply “takes
the duration and slope of the last trend as the prediction for the next one” [16].
The use of a relative metric makes comparison easier, since the RMSE is scale-
dependent, and the trend lines generated in this study may differ from Lin et al.’s
[16]. Lin et al. [16] did not provide details of the segmentation method they used
in their paper. Furthermore, the naive last value model does not require any
hyper-parameter tuning, its predictions are stable and repeatable, i.e. does not
differ when the experiment is rerun, and is only dependent on the characteristics
of the dataset.
Table 2 shows the performance improvement on RMSE values over the
LVM achieved by the TreNet implementation on each dataset. They are com-
pared to the performance of the original TreNet on the three datasets they
used in their experiments, i.e. the voltage, methane and NYSE datasets. The
results of our experiment differ substantially from those reported for the origi-
nal TreNet. Our TreNet models’ percentage improvement over the naive LVM
is 13.25 (74.58/5.63) and 1.27 (30.89/24.27) times greater than Lin et al.’s
[16], on the methane and NYSE datasets respectively; but 1.19 (36.71/27.90)
times smaller on the voltage dataset. The naive LVM performs better than our
TreNet model on the NYSE for the duration prediction. The –272.73 % decrease
in performance is due to two reasons. On the one hand, the model training, i.e.
the loss minimisation was biased towards the slope loss at the expense of the
duration loss. This is because the slope loss significantly greater compared to
the duration loss, but, TreNet’s loss function weights both equally. On the other
hand, the durations of the trends in the NYSE dataset being very similar - with
a standard deviation of 0.81 - makes the last value prediction model a favourably
competitive model for the duration prediction.
The greater average improvement on the methane and NYSE is attributed
to the use of the walk-forward evaluation procedure. The methane and NYSE
datasets undergo various changes in the generating process because of the sudden
changes in methane concentrations and the economic cycles for the NYSE. Thus,
128 K. H. Kouassi and D. Moodley
the use of the walk-forward evaluation ensures that the most recent and useful
training set is used for a given validation/test set. However, given that Lin et al.
[16] did not drop older data from the training data set, the network may learn
long-range relationships that are not useful for the current test set. Furthermore,
they used random shuffling which may most likely result in future data points
being included in the training data. The smaller improvement of our TreNet
model on the voltage dataset can be attributed to our use of a smaller window
size for the local raw data fed into the CNN. We used 19 compared to their
best value of 700 on the voltage dataset. This is one of the limitations of our
replication of TreNet. For each dataset, we used the length of the first trend line
as window size of the local raw data feature fed into the CNN, instead of tuning
it to select the best value. The other limitation is the use of a sampled version
of the methane dataset instead of the complete methane dataset.
Given that we are now using a different validation method which yields differ-
ent performances scores to the original TreNet, we checked whether the TreNet
approach still outperforms the vanilla DNN algorithms. We implemented and
tested three vanilla DNN models namely a MLP, LSTM, and CNN using only
raw local data features.
Table 3 shows the average RMSE values for slope and trend predictions
achieved by the vanilla DNNs and TreNet on each dataset across 10 independent
runs. The deviation across the 10 runs is also shown to provide an indication
of the stability of the model across the runs. We use the average slope and
duration RMSE values as an overall comparison metric. The % improvement is
the improvement of the best vanilla DNN model over TreNet. The best model
is chosen based on the overall comparison metric.
In general TreNet still performs better than the vanilla DNN models, but does
not outperform the vanilla models on all the datasets. The most noticeable case is
on the NYSE, where the LSTM model outperforms the TreNet model on both the
slope and duration prediction. This contradicts Lin et al. [16]’s findings, where
TreNet clearly outperforms all other models including LSTM. On average, Lin et
al.’s [16] TreNet model outperformed their LSTM model by 22.48%; whereas, our
TreNet implementation underperformed our LSTM model by 1.31%. However,
Lin et al. [16]’s LSTM model appears to be trained using trend lines only and
not raw point data. This LSTM model uses local raw data features. It must also
be noted that the validation method used here is substantially different from the
one used by Lin et al. [16]. The large performance difference between TreNet and
the vanilla models on the methane dataset is because for this dataset the raw
local data features do not provide the global information about the time series
since it is non-stationary. This is confirmed by the increase in the performance
of the MLP (23.83%), LSTM (11.02%) and CNN (24.05%) after supplementing
the raw data features with trend line features (see experiment 4 in Sect. 4.4).
DNNs Predicting Trends in Time Series Data 129
Table 3. Comparison of the RMSE values achieved by the vanilla DNN models and
TreNet. The % improvement (% improv.) is the improvement of the best vanilla DNN
model over TreNet
Voltage Methane
Slope Duration Average Slope Duration Average
MLP 9.04 ± 0.06 62.82 ± 0.04 35.93 ± 0.05 14.57 ± 0.10 49.79 ± 4.85 32.18 ± 2.48
LSTM 10.30 ± 0.0 62.87 ± 0.0 36.59 ± 0.0 14.21 ± 0.19 56.37 ± 1.77 35.29 ± 0.49
CNN 9.24 ± 0.10 62.40 ± 0.13 35.82 ± 0.12 15.07 ± 0.35 54.79 ± 4.55 34.93 ± 2.45
TreNet 9.25 ± 0.0 62.37 ± 0.01 35.81 ± 0.01 14.87 ± 0.40 31.25 ± 2.62 23.06 ± 1.51
% improv. −0.11 −0.05 −0.03 2.02 −59.33 −39.55
NYSE JSE
Slope Duration Average Slope Duration Average
MLP 90.76 ± 4.43 33.08 ± 42.08 61.92 ± 23.26 19.87 ± 0.01 12.51 ± 0.09 16.19 ± 0.05
LSTM 86.56 ± 0.01 0.41 ± 0.08 43.49 ± 0.05 19.83 ± 0.01 12.68 ± 0.01 16.25 ± 0.01
CNN 89.31 ± 1.38 12.21 ± 12.17 50.76 ± 6.78 19.90 ± 0.06 12.48 ± 0.21 16.19 ± 0.14
TreNet 86.89 ± 0.14 1.23 ± 0.38 44.06 ± 0.26 19.65 ± 0.05 12.49 ± 0.04 16.07 ± 0.05
% improv. 0.38 66.67 1.29 −1.12 −0.16 −0.75
Table 4. Comparison of the best DNN models (Best DNN) with the traditional ML
algorithms. The % improvement (% improv.) is the performance improvement of the
best traditional ML model over the best DNN model
Voltage Methane
Slope Duration Average Slope Duration Average
RF 9.53 ± 0.0 63.11 ± 0.20 36.32 ± 0.10 10.09 ± 0.01 20.79 ± 0.01 15.44 ± 0.01
GBM 10.0 ± 0.0 62.67 ± 0.0 36.34 ± 0.0 13.05 ± 0.0 75.10 ± 0.0 44.08 ± 0.0
SVR 9.32 ± 0.0 62.58 ± 0.0 35.95 ± 0.0 14.98 ± 0.0 34.39 ± 0.0 24.69 ± 0.0
Best DNN 9.25 ± 0.0 62.37 ± 0.01 35.81 ± 0.01 14.87 ± 0.40 31.25 ± 2.62 23.06 ± 1.51
% improv. −0.76 −0.34 −0.47 32.15 33.47 33.04
NYSE JSE
Slope Duration Average Slope Duration Average
RF 88.75 ± 0.17 0.29 ± 0.0 44.52 ± 0.09 20.21 ± 0.0 12.67 ± 0.0 16.44 ± 0.0
GBM 86.62 ± 0.0 0.42 ± 0.0 43.52 ± 0.0 20.08 ± 0.0 12.62 ± 0.0 16.35 ± 0.0
SVR 86.55 ± 0.0 0.42 ± 0.0 43.49 ± 0.0 20.01 ± 0.0 12.85 ± 0.0 16.43 ± 0.0
Best DNN 86.56 ± 0.01 0.41 ± 0.08 43.49 ± 0.05 19.65 ± 0.05 12.49 ± 0.04 16.07 ± 0.05
% improv. 0.01 2.44 0.0 −2.19 −1.04 −1.74
The fact that the radial-based SVR performed better than TreNet on the
NYSE dataset contradicts Lin et al. [16]’s results. We attribute this to the use
of local raw data features alone, instead of local raw data plus trend line features
used by Lin et al. [16].
In this experiment, we supplement the raw data with trend line features to
analyse whether this yields any performance improvement to the DNN and non-
DNN models from Experiments 2 and 3. We did retain the hyperparameter
values found using the raw data features alone for this experiment.
Table 5 shows the average performance improvement (%) after supplement-
ing the raw data with trend line features. The negative sign indicates a drop in
performance. The Average is the mean and the standard error of the improve-
ments over the algorithm or the dataset. The actual RMSE values are shown in
Table 12 and Table 13 in the appendix.
Table 5. Performance improvement after supplementing the raw data with trend line
features.
The addition of trend line features improved the performance of both DNN
and non-DNN models 10 times out of 24 cases. In general, it improves the per-
formance of dynamic and non-stationary time series such as the methane and
NYSE datasets. This is because local raw data features do not capture the global
information about the time series for non-stationary time series. Thus, the addi-
tion of trend line features brings new information to the models. In 12 out of 24
cases, the addition of trend line features reduced the performance of both DNN
and non-DNN models except the GBM models. For these cases, the addition
trend line features brings noise or duplicate information, which the models did
not deal with successfully. This may be because the best hyperparameters for
the raw data features alone may not be optimal for the raw data and the trend
line features combined. For instance, DNN models are generally able to extract
the true signal from noisy or duplicate input features, however, they are sensitive
to the hyperparameter values.
The above results show that the addition of trend line features has the poten-
tial to improve the performance of both DNN and non-DNN models on non-
stationary time series. This comes at the cost of additional complexities and
restrictions. The first complexity is related to the model complexity because the
bigger the input feature size, the more complex the model becomes. Secondly, the
trend line features require the segmentation of the time series into trends, which
brings new challenges and restrictions during inference. For instance, trend pre-
diction applications that require online inference need an online segmentation
method such as the one proposed by Keogh et al. [10]. It is therefore necessary
to evaluate whether the performance gain over raw data features alone justifies
these complexities and restrictions.
Table 6 provides a summary of the best models and their average performance
from all four experiments. The TreNet algorithm outperforms the non-hybrid
algorithms on the voltage and JSE datasets, but the performance difference
is marginal < 1%. Interestingly, the traditional ML algorithms outperformed
TreNet and the vanilla DNN algorithms on the methane and NYSE datasets.
The additional of trend lines to the point data (experiment 4) did not yield
any substantial change in the results. It must be noted though that this was an
exploratory experiment and that no hyper-parameter optimisation was done to
cater for the introduction of a new input feature. It may well be the case that
better models could be found of a new hyper-parameter optimisation process
was undertaken.
It is clear from these results that TreNet generally performs well on most
datasets. However, it is not the clear winner, and there are some dataset where
traditional models can substantially outperform TreNet. It is also clear that
models built with point data alone can generally reach the performance levels of
TreNet.
132 K. H. Kouassi and D. Moodley
Table 6. Average RMSE values (E) achieved by the hybrid algorithm, i.e. TreNet; and
the best non-hybrid algorithm (A) with raw point data features alone (Pt) and with
raw point data plus trend line features (Pt + Tr). The % change is with respect to the
TreNet algorithm.
A Appendix
A.1 Datasets
Fig. 4. Top - The individual household voltage dataset. Bottom - Probability distri-
bution of the voltage dataset.
Table 7. Summary of the basic statistics of the segmented datasets and the input
vector size per feature type.
Fig. 5. Top - Methane concentration in air over time. Bottom - Probability distribution
of the methane dataset.
Fig. 6. Top - The composite New York Stock Exchange (NYSE) closing price dataset.
Bottom - Probability distribution of the NYSE dataset.
Fig. 7. Top - Composite Johannesburg Stock Exchange (JSE) closing price dataset.
Bottom - Probability distribution of the JSE dataset.
DNNs Predicting Trends in Time Series Data 135
That is, during model update, the new network is initialised with the weights
of the previous model. In effect, the patterns learnt by the previous network
are transferred to the new model, therefore, reducing the number of epochs
required to learn the new best function. In practice, the walk-forward evaluation
with warm start corresponds to performing the first training with the maximum
number of epochs required to converge, then using a fraction of this number
for every other update. This fraction - between 0.0 and 1.0 - becomes an addi-
tional hyperparameter dubbed warm start. The lowest value that out-performed
the model update without warm-start is used as the best value, because this
technique is essentially used to speed-up the model updates.
The speed-up, i.e. the expected reduction factor in the total number of epochs
can be computed in advance using Eq. 4. The Eq. 4 is derived from Eq. 2 and
Eq. 3.
E = E + E × (S − 1) × ω (2)
E = E × (1 + (S − 1) × ω) (3)
E S
speed-up =
= (4)
E 1 + (S − 1) × ω
Where, E → Total epochs with warm start, E → Epochs per split without warm-
start, S → Number of data partition splits, ω → warm-start fraction.
136 K. H. Kouassi and D. Moodley
Dropout L2 LR LSTM cells CNN filters Fusion layer Batch Size Epochs Warm start
Voltage 0.0 5e−4 1e−3 [600] [16, 16] 300 2000 100 0.2
Methane 0.0 5e−4 1e−3 [1500] [4, 4] 1200 2000 2000 0.1
NYSE 0.0 0.0 1e−3 [600] [128, 128] 300 5000 100 0.5
JSE 0.0 0.0 1e−3 [5] [32, 32] 10 500 100 0.05
Lin et al. [16] 0.5 5e−4 ? [600] [32, 32] F rom S ? ? N/A
Table 10. Hyperparameters optimised for the vanilla DNN algorithms and their best
values found for each dataset
Table 12. Performance of vanilla DNN algorithms on raw data alone and raw data
and trend line features.
Voltage Methane
Slope Duration Average Slope Duration Average
Raw data 9.04 ± 0.06 62.82 ± 0.04 35.93 ± 0.05 14.57 ± 0.10 49.79 ± 4.85 32.18 ± 2.47
Raw data + Trend lines 9.03 ± 0.06 62.81 ± 0.04 35.92 ± 0.05 14.56 ± 0.19 34.46 ± 2.79 24.51 ± 1.49
NYSE JSE
Slope Duration Average Slope Duration Average
Raw data 90.76 ± 4.43 33.08 ± 42.08 61.92 ± 23.26 19.87 ± 0.01 12.51 ± 0.09 16.19 ± 0.05
Raw data + Trend lines 90.45 ± 2.55 25.34 ± 24.09 57.90 ± 13.32 21.13 ± 0.30 12.59 ± 0.14 16.86 ± 0.22
MLP
Voltage Methane
Slope Duration Average Slope Duration Average
Raw data 10.30 ± 0.0 62.87 ± 0.0 36.59 ± 0.0 14.21 ± 0.19 56.37 ± 1.77 35.29 ± 0.68
Raw data + Trend lines 10.30 ± 0.0 62.87 ± 0.0 36.59 ± 0.0 14.77 ± 0.51 48.03 ± 5.74 31.40 ± 3.13
NYSE JSE
Slope Duration Average Slope Duration Average
Raw data 86.56 ± 0.01 0.41 ± 0.08 43.49 ± 0.05 19.83 ± 0.01 12.68 ± 0.01 16.26 ± 0.01
Raw data + Trend lines 86.50 ± 0.01 0.47 ± 0.03 43.49 ± 0.02 20.16 ± 0.03 12.74 ± 0.02 16.45 ± 0.03
LSTM
Voltage Methane
Slope Duration Average Slope Duration Average
Raw data 9.24 ± 0.10 62.40 ± 0.13 35.82 ± 0.12 15.07 ± 0.35 54.79 ± 4.55 34.93 ± 2.45
Raw data + Trend lines 33.26 ± 19.41 90.78 ± 53.17 62.02 ± 36.29 15.14 ± 0.28 37.92 ± 4.11 26.53 ± 2.20
NYSE JSE
Slope Duration Average Slope Duration Average
Raw data 89.31 ± 1.38 12.21 ± 12.17 50.76 ± 6.78 19.90 ± 0.06 12.48 ± 0.21 16.19 ± 0.14
Raw data + Trend lines 90.44 ± 1.74 14.05 ± 9.52 52.25 ± 5.63 21.41 ± 0.33 12.71 ± 0.15 17.06 ± 0.24
CNN
138 K. H. Kouassi and D. Moodley
Table 13. Performance of traditional ML algorithms on raw data alone and raw data
and trend line features.
Voltage Methane
Slope Duration Average Slope Duration Average
Local raw data 9.53 ± 0.0 63.11 ± 0.2036.32 ± 0.10 10.09 ± 0.0120.79 ± 0.01 15.44 ± 0.01
Local raw data + Trend lines 9.35 ± 0.0 63.19 ± 0.29 36.27 ± 0.1511.53 ± 0.0 20.73 ± 0.0116.13 ± 0.01
NYSE JSE
Slope Duration Average Slope Duration Average
Local raw data 88.75 ± 0.170.29 ± 0.0 44.52 ± 0.09 20.21 ± 0.0 12.67 ± 0.0 16.44 ± 0.0
Local raw data + Trend lines 86.53 ± 0.010.41 ± 0.0 43.47 ± 0.0122.68 ± 0.0 12.69 ± 0.0 17.69 ± 0.0
RF
Voltage Methane
Slope Duration Average Slope Duration Average
Local raw data 10.0 ± 0.0 62.67 ± 0.0 36.34 ± 0.0 13.05 ± 0.0 75.10 ± 0.0 44.08 ± 0.0
Local raw data + Trend lines 10.01 ± 0.0 62.63 ± 0.0 36.32 ± 0.0 12.02 ± 0.0 38.34 ± 0.0 25.18 ± 0.0
NYSE JSE
Slope Duration Average Slope Duration Average
Local raw data 86.62 ± 0.0 0.42 ± 0.0 43.52 ± 0.0 20.08 ± 0.0 12.62 ± 0.0 16.35 ± 0.0
Local raw data + Trend lines 86.42 ± 0.0 0.41 ± 0.0 43.42 ± 0.0 19.93 ± 0.0 12.65 ± 0.0 16.29 ± 0.0
GBM
Voltage Methane
Slope Duration Average Slope Duration Average
Raw data 9.32 ± 0.0 62.58 ± 0.0 35.95 ± 0.0 14.98 ± 0.0 34.39 ± 0.0 24.69 ± 0.0
Raw data + Trend lines 9.54 ± 0.0 62.62 ± 0.0 36.08 ± 0.0 17.95 ± 0.0 34.52 ± 0.0 26.24 ± 0.0
NYSE JSE
Slope Duration Average Slope Duration Average
Raw data 86.55 ± 0.0 0.42 ± 0.0 43.49 ± 0.0 20.01 ± 0.0 12.85 ± 0.0 16.43 ± 0.0
Raw data + Trend lines 86.54 ± 0.0 0.45 ± 0.0 43.50 ± 0.0 23.27 ± 0.0 13.19 ± 0.0 18.23 ± 0.0
SVR
References
1. Bergmeir, C., Benı́tez, J.M.: On the use of cross-validation for time series predictor
evaluation. Inf. Sci. 191, 192–213 (2012). https://doi.org/10.1016/j.ins.2011.12.028
2. Chang, L., Chen, P., Chang, F.: Reinforced two-step-ahead weight adjustment
technique for online training of recurrent neural networks. IEEE Trans. Neural
Netw. Learn. Syst. 23(8), 1269–1278 (2012)
3. Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recur-
rent neural networks on sequence modeling. CoRR abs/1412.3555 (2014)
4. Falkner, S., Klein, A., Hutter, F.: BOHB: robust and efficient hyperparameter opti-
mization at scale. In: Proceedings of Machine Learning Research, PMLR, Stock-
holmsmässan, Stockholm Sweden, vol. 80, pp. 1437–1446, 10–15 July 2018. http://
proceedings.mlr.press/v80/falkner18a.html
5. Feng, F., Chen, H., He, X., Ding, J., Sun, M., Chua, T.S.: Enhancing stock move-
ment prediction with adversarial training. In: Proceedings of the Twenty-Eighth
International Joint Conference on Artificial Intelligence, IJCAI-19, International
Joint Conferences on Artificial Intelligence Organization, 7 July 2019, pp. 5843–
5849. https://doi.org/10.24963/ijcai.2019/810
DNNs Predicting Trends in Time Series Data 139
6. Guo, J., Li, X.: Prediction of index trend based on LSTM model for extracting
image similarity feature. In: Proceedings of the 2019 International Conference on
Artificial Intelligence and Computer Science, pp. 335–340. AICS 2019. ACM, New
York (2019). https://doi.org/10.1145/3349341.3349427
7. Guo, T., Xu, Z., Yao, X., Chen, H., Aberer, K., Funaya, K.: Robust online time
series prediction with recurrent neural networks. In: 2016 IEEE International Con-
ference on Data Science and Advanced Analytics (DSAA), pp. 816–825 (2016).
https://doi.org/10.1109/DSAA.2016.92
8. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-
level performance on ImageNet classification (2015)
9. Kara, Y., Acar Boyacioglu, M., Baykan, Ö.K.: Predicting direction of stock price
index movement using artificial neural networks and support vector machines:
the sample of the Istanbul Stock Exchange. Expert Syst. Appl. 38(5), 5311–5319
(2011). https://doi.org/10.1016/j.eswa.2010.10.027
10. Keogh, E., Chu, S., Hart, D., Pazzani, M.: An online algorithm for segmenting
time series. In: Proceedings 2001 IEEE International Conference on Data Mining,
pp. 289–296 (2001). https://doi.org/10.1109/ICDM.2001.989531
11. Keogh, E., Pazzani, M.: An enhanced representation of time series which allows
fast and accurate classification, clustering and relevance feedback. In: KDD, vol. 98,
pp. 239–243 (1998). https://doi.org/10.1.1.42.1358. http://www.aaai.org/Papers/
KDD/1998/KDD98-041.pdf
12. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014)
13. Kumar, I., Dogra, K., Utreja, C., Yadav, P.: A comparative study of supervised
machine learning algorithms for stock market trend prediction. In: 2018 Second
International Conference on Inventive Communication and Computational Tech-
nologies (ICICCT), pp. 1003–1007 (2018)
14. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
15. Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband:
a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn.
Res. 18(1), 6765–6816 (2017)
16. Lin, T., Guo, T., Aberer, K.: Hybrid neural networks for learning the trend in
time series. In: IJCAI - Proceedings of the Twenty-Sixth International Joint Con-
ference on Artificial Intelligence, pp. 2273–2279 (2017). https://doi.org/10.24963/
ijcai.2017/316. https://www.ijcai.org/proceedings/2017/316
17. Liu, Q., Cheng, X., Su, S., Zhu, S.: Hierarchical complementary attention net-
work for predicting stock price movements with news. In: Proceedings of the
7th ACM International Conference on Information and Knowledge Management,
CIKM 2018, pp. 1603–1606. ACM, New York (2018). https://doi.org/10.1145/
3269206.3269286. http://doi.acm.org/10.1145/3269206.3269286
18. Luo, L., Chen, X.: Integrating piecewise linear representation and weighted
support vector machine for stock trading signal prediction. Appl. Soft
Comput. 13(2), 806–816 (2013). https://doi.org/10.1016/j.asoc.2012.10.026.
http://www.sciencedirect.com/science/article/pii/S1568494612004796
19. Matsubara, Y., Sakurai, Y., Faloutsos, C.: AutoPlait: automatic mining of co-
evolving time sequences. In: Proceedings of the 2014 ACM SIGMOD International
Conference on Management of Data, SIGMOD 2014, pp. 193–204. Association
for Computing Machinery, New York (2014). https://doi.org/10.1145/2588555.
2588556
140 K. H. Kouassi and D. Moodley
20. Nelson, D.M., Pereira, A.C., De Oliveira, R.A.: Stock market’s price movement
prediction with LSTM neural networks. In: Proceedings of the International Joint
Conference on Neural Networks (DCC), May 2017 pp. 1419–1426 (2017). https://
doi.org/10.1109/IJCNN.2017.7966019
21. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn.
Res. 12, 2825–2830 (2011)
22. Sharma, N., Juneja, A.: Combining of random forest estimates using LSboost for
stock market index prediction. In: 2017 2nd International Conference for Conver-
gence in Technology (I2CT), pp. 1199–1202 (2017)
23. Wang, P., Wang, H., Wang, W.: Finding semantics in time series. In: Proceedings of
the 2011 International Conference on Management of Data - SIGMOD 2011, p. 385
(2011). https://doi.org/10.1145/1989323.1989364. http://portal.acm.org/citation.
cfm?doid=1989323.1989364
24. Wen, M., Li, P., Zhang, L., Chen, Y.: Stock market trend predic-
tion using high-order information of time series. IEEE Access 7, 28299–
28308 (2019). https://doi.org/10.1109/ACCESS.2019.2901842. https://ieeexplore.
ieee.org/document/8653278/
25. Ye, L., Keogh, E.: Time series shapelets: a new primitive for data mining. In:
Proceedings of the 15th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, KDD 2009, pp. 947–956. Association for Computing
Machinery, New York (2009). https://doi.org/10.1145/1557019.1557122
26. Zhang, J., Cui, S., Xu, Y., Li, Q., Li, T.: A novel data-driven stock price trend
prediction system. Expert Syst. Appl. 97, 60–69 (2018). https://doi.org/10.1016/
j.eswa.2017.12.026
27. Zhao, Y., Shen, Y., Zhu, Y., Yao, J.: Forecasting wavelet transformed time series
with attentive neural networks. In: 2018 IEEE International Conference on Data
Mining (ICDM), pp. 1452–1457 (2018)
Text-to-Speech Duration Models
for Resource-Scarce Languages
in Neural Architectures
Johannes A. Louw(B)
1 Introduction
Deep neural network (DNN) based techniques applied to text-to-speech (TTS)
systems have brought on dramatic improvements in the naturalness and intelli-
gibility of synthesized speech. An example of the change in the landscape could
be seen in the 2019 edition of the Blizzard Challenge [1], where the best percep-
tually judged entry was based on a long short-term memory (LSTM) - recurrent
neural network (RNN) hybrid architecture [4] with WaveNet [22] as the vocoder.
In fact, of the twenty one entries to the Blizzard Challenge 2019 that submit-
ted an accompanying paper (on the Blizzard Challenge website1 ), one system
1
http://festvox.org/blizzard/blizzard2019.html.
c Springer Nature Switzerland AG 2020
A. Gerber (Ed.): SACAIR 2020, CCIS 1342, pp. 141–153, 2020.
https://doi.org/10.1007/978-3-030-66151-9_9
142 J. A. Louw
2
https://www.isca-speech.org/archive/SSW 2019/.
3
https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/
SpeechSynthesis/Tacotron2#expected-training-time.
Duration Models in Neural Architectures 143
Recent neural acoustic models such as Fastspeech [14], FastSpeech 2 [13] and
a bottleneck feed-forward neural network implemented in [8] have moved away
from using attention mechanism to align the linguistic and acoustic encoding
and decoding, and have rather reverted to using an explicit duration model
for the alignment. With our focus being on developing and implementing DNN
architectures for resource-scarce environments we are looking at duration models
in this work, and in particular speaker specific or dependent models. We compare
the traditional HMM-based duration models with a DNN-based model suitable
for resource-scarce environments and report on objective measures between the
two models and a reference data set.
The organisation of the paper is as follows: in Sect. 2 we give some background
on duration modeling as well as an overview of the two approaches followed in
this work. Section 3 details our experiments and results, and lastly a discussion
and conclusion is presented in Sect. 4.
2 Duration Models
The technique we describe here is based on the widely used HMM-based Speech
Synthesis System (HTS) [28]. The fundamental unit of duration is a phoneme,
and each phoneme is modelled as a 5-state left-to-right, with no skip, HMM.
State duration densities are modeled by single Gaussian distributions.
The duration models are context dependent, with many contextual factors
that influence the duration of the individual phonemes taken into account (e.g.,
phone and phone context identity factors, stress-related factors, locational fac-
tors). The contextual factors taken into account depend on their availability
144 J. A. Louw
phone is vowel?
no
yes
yes no
Clustered
HMM states
Merged
HMM states
Target HMM
sequence
3 Experimental Setup
3.1 Data
The data used in this work is a subset of an in-house single speaker Afrikaans
female TTS corpus of duration 12:08:15.89. The corpus was recorded in a studio
with a professional voice artist at a 44.1 kHz sampling rate with 16 bits preci-
sion. The subset used are recordings of the text of the Lwazi II Afrikaans TTS
Corpus [12], consisting of 763 utterances of duration 00:56:30.29. This subset
represents a small and phonetically balanced speech database as would be used
for building HMM-based synthetic voices and attempting to build DNN-based
synthetic voices.
The utterances were randomly split into training, validation and testing sets
as given in Table 1. All audio was down-sampled to 16 kHz at 16 bits per sample
146 J. A. Louw
Duration
features
h4
h3
Speech
database
h2
TTS engine
front-end
h1
Linguistic
description
and each utterance was normalised to the average power level of the subset (the
763 utterances).
Context Feature
Phoneme The current phone
The two preceding and succeeding phones
The position of the current phone within the current syllable
Syllable The number of phonemes within preceding, current,
and succeeding syllables
The position of the current syllable within the current word
and phrase
The number of preceding and succeeding stressed syllables
within the current phrase
The number of preceding and succeeding accented syllables
within the current phrase
The vowel identity within the current syllable
Word Guessed part-of-speech (GPOS) of preceding,
current, and succeeding words
The number of syllables within preceding, current,
and succeeding words
The position of the current word within the
current phrase
The number of preceding and succeeding content words
within the current phrase
The number of words from the previous content word
The number of words to the next content word
Phrase The number of syllables within preceding, current,
and succeeding phrases
The position of the current phrase in major phrases
Utterance The number of syllables, words, and phrases in the utterance
Reference Durations. The reference durations of the phone units in the speech
database (Table 1) were obtained from a forced-alignment procedure using the
HTK toolkit [27]. A frame resolution of 10 ms was used (hop size). A silence
state was added between all words in the database in order to identify any
pauses or phrase breaks which were recorded but not specifically annotated in
148 J. A. Louw
Fig. 3. The phone duration distribution of all the non-silent phones in the speech
database.
the text with punctuation marks (based on work in [9]). Any non-annotated
silence longer than 80 ms is marked as a pause and a phrase break is inserted
into the utterance structure at this point. These phrase breaks have an influence
on the context features as given in Table 2.
Figures 3 and 4 give the duration distributions of all the non-silent phones and
the near-open front unrounded vowel (/æ/) respectively. Note that the minimum
phone duration is 5 frames due to the use of a 5-state HMM model (see Sect. 2.1).
After the reference durations were extracted, a duration model was built based
on the standard architecture of 5-state (excluding the emitting states), left-
to-right HMM. The contextual features used were as defined in Table 2. The
duration features were modelled by a single-component Gaussian. The decision
trees state clustering was done using a minimum description length (MDL) factor
of 1.0. Training of the model was done via custom scripts based on the standard
demonstration script 2 available as part of HTS [30] (version 2.2).
Note that the model was only trained on the 715 training utterances of
Table 1.
Duration Models in Neural Architectures 149
Fig. 4. The phone duration distribution of all the near-open front unrounded vowel
(/æ/) in the speech database.
3.4 Results
The validation and test sets of Table 1 were synthesized with the HMM- and
DNN-based duration models and the durations per phone unit were extracted.
150 J. A. Louw
Fig. 5. A visual comparison of the duration prediction on the word level for the utter-
ance “Telkens moet hy die gevolge van sy dade dra”. At the top is the DNN prediction,
at the bottom the HMM prediction and in the middle the reference from the recorded
speech.
References
1. Black, A.W., Tokuda, K.: The blizzard challenge-2005: evaluating corpus-based
speech synthesis on common datasets. In: 9th European Conference on Speech
Communication and Technology, pp. 77–80 (September 2005)
2. Campbell, W.N.: Syllable-based segmental duration. In: Talking Machines: Theo-
ries, Models, and Designs, pp. 211–224 (1992)
3. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-
level performance on ImageNet classification. In: Proceedings of the IEEE Inter-
national Conference on Computer Vision (ICCV) (December 2015)
4. Jiang, Y., et al.: The USTC system for blizzard challenge 2019. In: Blizzard Chal-
lenge Workshop 2019, Vienna, Austria (September 2019)
5. Kalchbrenner, N., et al.: Efficient Neural Audio Synthesis. arXiv e-prints
arXiv:1802.08435 (February 2018)
6. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
7. Klatt, D.H.: Interaction between two factors that influence vowel duration. J.
Acoust. Soc. Am. 54(4), 1102–1104 (1973)
8. Louw, J.A.: Neural speech synthesis for resource-scarce languages. In: Barnard, E.,
Davel, M. (eds.) Proceedings of the South African Forum for Artificial Intelligence
Research, Cape Town, South Africa, pp. 103–116 (December 2019)
9. Louw, J.A., Moodley, A., Govender, A.: The speect text-to-speech entry for the
blizzard challenge 2016. In: Blizzard Challenge Workshop 2016, Cupertino, United
States of America (September 2016)
10. Louw, J.A., van Niekerk, D.R., Schlünz, G.: Introducing the speect speech synthesis
platform. In: Blizzard Challenge Workshop 2010, Kyoto, Japan (September 2010)
11. Morais, E., Violaro, F.: Exploratory analysis of linguistic data based on genetic
algorithm for robust modeling of the segmental duration of speech. In: 9th Euro-
pean Conference on Speech Communication and Technology (2005)
12. van Niekerk, D., de Waal, A., Schlünz, G.: Lwazi II Afrikaans TTS Corpus
(November 2015). https://repo.sadilar.org/handle/20.500.12185/443. ISLRN: 570–
884-577-153-6
13. Ren, Y., Hu, C., Qin, T., Zhao, S., Zhao, Z., Liu, T.Y.: FastSpeech 2: Fast and
High-Quality End-to-End Text-to-Speech. arXiv preprint arXiv:2006.04558 (2020)
14. Ren, Y., et al.: Fastspeech: fast, robust and controllable text to speech. In:
Advances in Neural Information Processing Systems, pp. 3171–3180 (2019)
Duration Models in Neural Architectures 153
15. Riley, M.D.: Tree-based modelling for speech synthesis. In: The ESCA Workshop
on Speech Synthesis, pp. 229–232 (1991)
16. Shen, J., et al.: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectro-
gram Predictions. arXiv e-prints arXiv:1712.05884 (December 2017)
17. Silverman, K., et al.: ToBI: a standard for labeling English prosody. In: Proceedings
of the 2nd International Conference on Spoken Language Processing (ICSLP),
Alberta, Canada, pp. 867–870 (October 1992)
18. Sotelo, J., et al.: Char2wav: End-to-end speech synthesis. arXiv preprint
arXiv:1609.03499 (2017)
19. Tachibana, H., Uenoyama, K., Aihara, S.: Efficiently trainable text-to-speech sys-
tem based on deep convolutional networks with guided attention. arXiv e-prints
arXiv:1710.08969 (October 2017)
20. Taylor, P.: Text-to-Speech Synthesis. Cambridge University Press, Cambridge
(2009)
21. Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J., Oura, K.: Speech
synthesis based on hidden Markov models. Proc. IEEE 101(5), 1234–1252 (2013)
22. van den Oord, A., et al.: WaveNet: A generative model for raw audio. arXiv e-prints
arXiv:1609.03499 (September 2016)
23. Wang, Y., et al.: Tacotron: Towards end-to-end speech synthesis. arXiv e-prints
arXiv:1703.10135 (March 2017)
24. Watts, O., Henter, G.E., Fong, J., Valentini-Botinhao, C.: Where do the improve-
ments come from in sequence-to-sequence neural TTS? In: 10th ISCA Speech Syn-
thesis Workshop, ISCA, Vienna, Austria (September 2019)
25. Wei, X., Hunt, M., Skilling, A.: Neural network-based modeling of phonetic dura-
tions. arXiv preprint arXiv:1909.03030 (2019)
26. Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis
system. In: SSW, pp. 202–207 (2016)
27. Young, S., et al.: The HTK Book, vol. 3, p. 175. Cambridge University Engineering
Department, Cambridge (2002)
28. Zen, H., Tokuda, K., Masuko, T., Kobayasih, T., Kitamura, T.: A hidden semi-
Markov model-based speech synthesis system. IEICE Trans. Inf. Syst. E90–D(5),
825–834 (2007)
29. Zen, H., Senior, A.: Deep mixture density networks for acoustic modeling in sta-
tistical parametric speech synthesis. In: 2014 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pp. 3844–3848. IEEE (2014)
30. Zen, H., Tokuda, K., Masuko, T., Kobayasih, T., Kitamura, T.: A hidden semi-
Markov model-based speech synthesis system. IEICE Trans. Infor. Sys. E90–D(5),
825–834 (2007)
31. Zhu, X., Zhang, Y., Yang, S., Xue, L., Xie, L.: Pre-alignment guided attention for
improving training efficiency and model stability in end-to-end speech synthesis.
IEEE Access 7, 65955–65964 (2019)
Importance Sampling Forests for Location
Invariant Proprioceptive Terrain
Classification
Abstract. The ability for ground vehicles to classify the terrain they
are traversing or have previously traversed is extremely important for
manoeuvrability. This is also beneficial for remote sensing as this infor-
mation can be used to enhance existing soil maps and geographic infor-
mation system prediction accuracy. However, existing proprioceptive
terrain classification methods require additional hardware and some-
times dedicated sensors to classify terrain, making the classification pro-
cess complex and costly to implement. This work investigates offline
classification of terrain using simple wheel slip estimations, enabling
the implementation of inexpensive terrain classification. Experimental
results show that slip-based classifiers struggle to classify the terrain sur-
faces using wheel slip estimates alone. This paper proposes a new clas-
sification method based on importance sampling, which uses position
estimates to address these limitations, while still allowing for location
independent terrain analysis. The proposed method is based on the use
of an ensemble of decision tree classifiers trained using position informa-
tion and terrain class predictions sampled from weak, slip-based terrain
classifiers.
1 Introduction
Fig. 1. Illustration of the terrain surfaces layout and the Packbot 510 UGV used for
experiments. The top part is the carpet, grass and rocks are on the sides, while the
surrounding area is the rubber.
to adapt to the type of terrain they are currently traversing. The ability for
ground vehicles to classify the terrain they are traversing or have previously tra-
versed is also beneficial for remote sensing [6], and the information can be used
to enhance existing soil maps [21] and geographic information system (GIS)
prediction accuracy [30].
Terrain classification is the process of determining into which terrain class
category a specific terrain patch falls [10,18,23]. Commonly classified terrain
surfaces for outdoor environments include dirt, sand, clay, asphalt, grass and
gravel [22,29] while carpet, ceramic tiles and linoleum [14,26,29] are generally
considered in indoor environments.
Terrain classification can be vision-based or through proprioception. Vision-
based classification uses visual features, such as colour, texture and shape,
obtained from sensors such as cameras and laser scanners [32]. Proprioceptive
classification uses physical wheel-terrain interaction features that are extracted
from a vehicle’s sensors [5], and is sometimes also referred to as contact-based
terrain classification [32].
A particularly interesting class of proprioceptive terrain classification relies
on wheel-slip [11] and has been recommended as a simple and low-cost terrain
analysis technique. However, this work shows that terrain classification using
slip measurements can be unreliable, and that models trained to classify terrain
using slip alone often suffer from over-fitting when applied in new terrains.
This paper proposed a method of addressing this limitation through impor-
tance sampling, which introduces position estimation into the classification pro-
cess, while still allowing for location independent terrain classification. Here, we
156 D. Masha and M. Burke
2 Related Work
Proprioception for ground vehicles involves the sensing of the internal states of
a vehicle using onboard sensors such as wheel encoders, accelerometers and rate
traducers [17,25]. Proprioceptive classifiers typically use these sensor measure-
ments directly to classify the terrain being traversed [10]. The two most common
proprioceptive terrain classification methods are vibration-based classification
and traction-based classification.
ω a = φ 1 vd + φ 2 ω d (3)
3 Importance Sampling
Importance Sampling (IS) is a Monte Carlo sampling tool that is used to approx-
imate a distribution when the only samples available are produced by a different
distribution. Here, a new distribution is produced by sampling random draws
from an existing distribution and computing a weighted average over the ran-
dom draws, approximating a mathematical expectation with respect to a target
distribution [2,27]. Using IS, when given a random variable x with a probability
of p(x) and assuming we wish to compute an expectation μf = Ep [f (X)] by
sampling random draws x(1) , . . . , x(m) from q(x), we can write
μf = f (x)p(x)dx (4)
For any probability density p(x) that satisfies q(x) > 0, when f (x)p(x) = 0
we also have:
To increase the accuracy of IS, for most x, q(x) has to be approximately propor-
tional to p(x), thereby reducing variance in the estimate of μf . Importantly, IS is
more than a variance reduction method as it is also used to investigate the prop-
erties of a distribution when sampling from another distribution of interest [27].
We leverage this property to draw terrain class samples from a joint distribution
conditioned on both position and slip, while still allowing for location indepen-
dent terrain classification. In order to apply the importance sampling principle,
we sample terrain labels from weak probabilistic slip classifiers (a decision tree),
and use these to train new decision trees conditioned on vehicle position to pro-
duce an ensemble model over terrain classes. This can be considered a form of
random forest [3]. Random forests are typically trained using bagging, where
a subset of features are randomly sampled from a set, but here we train the
forests (which we term importance sampling forests (ISF)) using labels sampled
from a slip classifier. This can also be viewed as a form of boosting, an iterative
training method where the weights of incorrectly classified samples are increased
to make these more important in the next iteration, thereby reducing variance
and bias and improving model performance [1,15]. The use of ISFs for terrain
classification is shown in Algorithm 1.
ISFs work by using terrain labels sampled from a pre-trained slip-based ter-
rain classifier that is applied to a new set of slip estimates captured in a previ-
ously unseen terrain patch. The ISF classifier randomly draws new terrain labels
C j from the predicted output distribution y given by the R(Ct |st ) slip-based
classifier. The new terrain label samples are then used, along with positions, xt ,
to train a number of position based terrain q(Ct |xt ) classifiers.
The goal of the proposed ISF classification method is to find the expected ter-
rain class C at position xt with respect to the slip classifier distribution R(C|st ),
C(xt ) = q(C|xt , st )R(C|st )dst . (7)
where qi (C|xt , st ) is a decision tree trained using class labels C j sampled from
the slip classifier class likelihood
C j ∼ R(C|st ). (9)
5 Experimental Results
We illustrate the use of ISFs on a terrain classification task using a Packbot 510
tracked robot.
Fig. 2. The figure shows the positions on the four terrain surface patches where data
was captured. (Yellow - Rubber, Red - Carpet, Blue - Rocks, Green - Grass.) (Color
figure online)
Figure 3 shows the terrain setup that was used for Dataset B and to perform
the second test, which we refer to as Test B. We trained support vector machines
and decision trees1 to predict terrain labels for given slip estimated using Dataset
A, and report test accuracy on Test A. Model parameters were tuned to produce
the best performance on the validation set.
We then tested these models on Dataset B, to illustrate the challenge of
over-fitting to terrain configurations. We also used Test B to illustrate the value
1
We also experimented with LSTM models, but these failed dismally, presumably due
to a lack of data.
ISFs for Location Invariant Terrain Classification 161
Fig. 3. Terrain layout where Dataset B was collected. The colour codes are still con-
sistent with those of Fig. 2 (Color figure online).
When tested on Dataset A, both the support vector machine (Fig. 4) and decision
tree (Fig. 5) seem to perform well, successfully classifying most terrain samples in
the test set. These results agree with those typically seen in the literature, which
has suggested that terrain classification using slip estimates alone is generally
effective.
The DT returned a validation accuracy of 89%, and an accuracy score of 90%
when Test A was conducted. However, when Test B is conducted, it is clear that
the DT has over-fit to the terrain configuration of Test A. Similar results are
seen for the SVM, as shown in Table 1.
When the trained decision tree classifier is used to classify the terrain patches
contained in Dataset B, the classifier fails with a low accuracy score of 31%.
In order to address this over-fitting, a more conservative DT classifier was
trained using training data from Dataset A. The classifier produced a validation
accuracy of 65% and an accuracy score of 62% when Test A was conducted.
Figure 7 shows the confusion matrix of the more conservative classifier, which is
only weakly able to classify the terrain in Test B, as shown in Fig. 6.
The poor classification results obtained here seem to contradict findings in
the literature about the efficacy of slip-based terrain classification, and highlight
the importance of testing terrain classifiers under multiple conditions to avoid
over-fitting.
162 D. Masha and M. Burke
Fig. 4. The predicted labels for an SVM on Test A terrain seem to indicate that the
terrain is generally classified correctly.
Fig. 5. The labels predicted by the DT classifier for Test A at position estimates
corresponding to the test data seem to show successful terrain classification.
Fig. 6. The predicted labels by the DT classifier for Test B terrain patch classification
at position estimates corresponding to the test data show that the slip-based terrain
classification fails to classify the terrain patches particularly well.
Fig. 7. The figure shows the performance of the properly tuned, conservative DT clas-
sifier when Test A is performed. The classifier obtains an accuracy score of 62%.
164 D. Masha and M. Burke
the confusion matrix when Test A position estimates were used to improve the
performance of the previously trained slip-based terrain classifier. From the plot,
we can see that the method accurately classifies the terrain patches. Though the
accuracy of rubber dropped from 95% to 90%, the overall performance of the
classifier improved significantly. The improvement can also be noted in Fig. 9,
where the terrain patches can be clearly seen.
When Test B position estimates were used to classify the terrain using impor-
tance sampled forests and slip classifiers trained using dataset A, the new clas-
sifier returned an accuracy score of 97%. Figure 10 shows the confusion matrix
Fig. 8. The plot shows the level of improvement from the classifier with an accuracy
score of 53% to an accuracy score of 94% when ISF and Test A pose estimations are
used. Rock, grass and carpet have improved from a classification accuracy of less than
45%, while rubber dropped from an accuracy of 95%.
Fig. 10. The figure shows the level of improvement from the classifier with an accuracy
score of 41% to an accuracy score of 97% when ISF and Test B pose estimations are
used to re-train the classifier for classification.
Fig. 11. The figure shows the Test B terrain patch label layout using the ISF terrain
classifier. The terrain patches can be clearly noted. This is in stark comparison with
the base slip classifier used for prediction, as shown in Fig. 6.
166 D. Masha and M. Burke
when Test B position estimates are used to improve the performance of the pre-
viously trained terrain classifier. From the plot, we can see that the method accu-
rately classifies the terrain patches, where the classifier performance improved.
The dramatic improvement can also be noted from Fig. 11, where the terrain
patches can be clearly seen.
The results presented in this section show that the ISF classifier is able
to dramatically increase classification performance through the use of spatial
smoothing, and that the forest ensemble helps to prevent over-fitting.
Table 1. The table summarises (% accuracy) the average performance of all the clas-
sifiers’ Test A and Test B results, while also comparing terrain-wise Test A and B
performance.
Table 1 summarises the experimental results described above. Here, the con-
vention Train A, Test B denotes a model trained on Dataset A, and tested on
Dataset B.
It should be noted that importance sampling forests are inexpensive to com-
pute when the underlying slip-based classifier is a decision tree, but slow dra-
matically if alternative models are used. For this reason, we do not attempt to
improve SVM classifiers using the proposed approach.
6 Conclusions
This work has highlighted the importance of testing terrain classification mod-
els in multiple terrain configurations to avoid over-fitting. Experimental results
showed that slip-based DT and SVM classifiers failed to classify terrain surfaces
due to a loss of generalisation caused by over-tuning of model parameters.
This paper introduced importance sampling forests for terrain classification,
a technique that uses sampled labels from a probabilistic slip-based terrain clas-
sifier to train position conditioned terrain classification models. This produces
an ensemble terrain classifier, which allows for terrain classification that incorpo-
rates spatial information. Importantly, this approach means that the slip-based
classifier can incorporate position information in new locations, as it is location
invariant.
ISFs for Location Invariant Terrain Classification 167
References
1. Akar, Ö., Güngör, O.: Classification of multispectral images using random forest
algorithm. J. Geodesy Geoinf. 1(2), 105–112 (2012)
2. Bishop, C.M.: Pattern Recognition and Machine Learning. Information Science
and Statistics. Springer, New York (2006)
3. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
4. Brooks, C.A., Iagnemma, K.: Vibration-based terrain classification for planetary
exploration rovers. IEEE Trans. Rob. 21(6), 1185–1190 (2005). https://doi.org/
10.1109/TRO.2005.855994
5. Brooks, C.A., Iagnemma, K.: Self-supervised terrain classification for planetary
surface exploration rovers. J. Field Robot. 29(3), 445–468 (2012)
6. Brooks, C.A., Iagnemma, K.D.: Self-supervised classification for planetary rover
terrain sensing. In: IEEE Aerospace Conference Proceedings, pp. 1–9 (2007).
https://doi.org/10.1109/AERO.2007.352693
7. Burke, M.: Path-following control of a velocity constrained tracked vehicle incor-
porating adaptive slip estimation. In: Proceedings of the IEEE International Con-
ference on Robotics and Automation, pp. 97–102 (2012). https://doi.org/10.1109/
ICRA.2012.6224684
8. Collins, E.G., Coyle, E.J.: Vibration-based terrain classification using surface pro-
file input frequency responses. In: 2008 IEEE International Conference on Robotics
and Automation, pp. 3276–3283 (2008). https://doi.org/10.1109/ROBOT.2008.
4543710
9. Coyle, E.: Fundamentals and methods of terrain classification using proprioceptive
sensors. Ph.D. thesis, The Florida State University (2010)
10. Coyle, E., Collins, E.G., Roberts, R.G.: Speed independent terrain classification
using singular value decomposition interpolation. In: 2011 IEEE International Con-
ference on Robotics and Automation (ICRA), pp. 4014–4019. IEEE (2011)
11. Ding, L., Gao, H., Deng, Z., Yoshida, K., Nagatani, K.: Slip ratio for lugged wheel
of planetary rover in deformable soil: definition and estimation. In: 2009 IEEE/RSJ
International Conference on Intelligent Robots and Systems, IROS 2009, pp. 3343–
3348. IEEE (2009)
12. DuPont, E.M., Moore, C.A., Collins, E.G., Coyle, E.: Frequency response method
for terrain classification in autonomous ground vehicles. Auton. Robot. 24(4), 337–
347 (2008). https://doi.org/10.1007/s10514-007-9077-0
13. DuPont, E.M., Roberts, R.G., Selekwa, M.F., Moore, C.A., Collins, E.G.: Online
terrain classification for mobile robots. Dyn. Syst. Control Parts A and B. 2005,
1643–1648 (2005). https://doi.org/10.1115/IMECE2005-81659
14. Giguere, P., Dudek, G.: Surface identification using simple contact dynamics for
mobile robots. In: 2009 IEEE International Conference on Robotics and Automa-
tion (2009). https://doi.org/10.1109/ROBOT.2009.5152662
15. Gislason, P.O., Benediktsson, J.A., Sveinsson, J.R.: Random forests for land cover
classification. Pattern Recogn. Lett. 27(4), 294–300 (2006)
16. Gonzalez, R., Iagnemma, K.: DeepTerramechanics: Terrain Classification and Slip
Estimation for Ground Robots via Deep Learning. arXiv preprint arXiv:1806.07379
(2018)
168 D. Masha and M. Burke
17. Howard, A., Turmon, M., Matthies, L., Tang, B., Angelova, A., Mjolsness, E.:
Towards learned traversability for robot navigation: from underfoot to the far field.
J. Field Robot. 23(11–12), 1005–1017 (2006). https://doi.org/10.1002/rob.20168
18. Iagnemma, K., Shibly, H., Dubowsky, S.: On-line terrain parameter estimation for
planetary rovers. In: Proceedings of the IEEE International Conference on Robotics
and Automation, vol. 3, pp. 3142–3147 (2002). https://doi.org/10.1109/ROBOT.
2002.1013710
19. Kuntanapreeda, S.: Traction control of electric vehicles using sliding-mode con-
troller with tractive force observer. Int. J. Veh. Technol 2014, 1+ (2014). https://
doi.org/10.1155/2014/829097
20. Masha, D., Burke, M., Twala, B.: Slip estimation methods for proprioceptive ter-
rain classification using tracked mobile robots. In: 2017 Pattern Recognition Asso-
ciation of South Africa and Robotics and Mechatronics (PRASA-RobMech), pp.
150–155. IEEE (2017)
21. Moore, I.D., Gessler, P.E., Nielsen, G.Æ., Peterson, G.: Soil attribute prediction
using terrain analysis. Soil Sci. Soc. Am. J. 57(2), 443–452 (1993)
22. Ojeda, L., Borenstein, J., Witus, G.: Terrain trafficability characterization with
a mobile robot. In: Proceedings of the SPIE Defense and Security Conference,
Unmanned Ground Vehicle Technology VII, vol. 5804, pp. 235–243 (2005). https://
doi.org/10.1117/12.601499
23. Ojeda, L., Borenstein, J., Witus, G., Karlsen, R.: Terrain characterization and
classification with a mobile robot. J. Field Robot. 23(2), 103–122 (2006). https://
doi.org/10.1002/rob.20113
24. Otsu, K., Ono, M., Fuchs, T.J., Baldwin, I., Kubota, T.: Autonomous terrain
classification with co-and self-training approach. IEEE Robot. Autom. Lett. 1(2),
814–819 (2016)
25. Overholt, J.L., Hudas, G.R., Gerhart, G.R.: Defining proprioceptive behaviors for
autonomous mobile robots. In: Unmanned Ground Vehicle Technology IV, vol.
4715, pp. 287–295. International Society for Optics and Photonics (2002). https://
doi.org/10.1117/12.474460
26. Tick, D., Rahman, T., Busso, C., Gans, N.: Indoor robotic terrain classification via
angular velocity based hierarchical classifier selection. In: Proceedings of the IEEE
International Conference on Robotics and Automation, pp. 3594–3600 (2012).
https://doi.org/10.1109/ICRA.2012.6225128
27. Tokdar, S.T., Kass, R.E.: Importance sampling: a review. Wiley Interdisc. Rev.
Comput. Stat. 2(1), 54–60 (2010). https://doi.org/10.1002/wics.56
28. Valada, A., Burgard, W.: Deep spatiotemporal models for robust proprioceptive
terrain classification. Int. J. Robot. Res. 36(13–14), 1521–1539 (2017). https://doi.
org/10.1177/0278364917727062
29. Weiss, C., Frohlich, H., Zell, A.: Vibration-based terrain classification using sup-
port vector machines. In: 2006 IEEE/RSJ International Conference on Intelligent
Robots and Systems, pp. 4429–4434 (2006). https://doi.org/10.1109/IROS.2006.
282076
30. Yanar, T.A., Akyürek, Z.: The enhancement of the cell-based GIS analyses with
fuzzy processing capabilities. Inf. Sci. 176(8), 1067–1085 (2006). https://doi.org/
10.1016/j.ins.2005.02.006
31. Yoshida, K., Watanabe, T., Mizuno, N., Ishigami, G.: Slip, traction control, and
navigation of a lunar rover. In: i-SAIRAS 2003, p. 8579 (2003)
32. Zou, Y., Chen, W., Xie, L., Wu, X.: Comparison of different approaches to visual
terrain classification for outdoor mobile robots. Pattern Recogn. Lett. 38, 54–62
(2014)
Hybridized Deep Learning Architectures
for Human Activity Recognition
1 Introduction
The recognition of human activity involves a process of identifying the actions
and goals of individuals from a series of observations of the activities performed.
The application of this recognition task can be utilized in multiple domains that
aim to monitor the actions of human beings; such examples include detecting
foul play in sports and abnormal activity for security purposes. There exists
various investigations using two different categories of data, viz. video and sensor
data, to solve the human activity recognition problem. Some video data may
contain depth information, which is referred to as RGB-D data. This type of
data is collected using special RGB-D camera devices. The sensor data used in
The support of the Centre for High Performance Computing (CHPC) is gratefully
acknowledged.
c Springer Nature Switzerland AG 2020
A. Gerber (Ed.): SACAIR 2020, CCIS 1342, pp. 169–182, 2020.
https://doi.org/10.1007/978-3-030-66151-9_11
170 B. J. Pillay et al.
human activity recognition are generally collected from wearable devices, motion
sensors, and body heat-sensors.
There are two aspects to human activity recognition: action classification and
action detection [2]. Action detection refers to identifying activities of interest
[3]. In contrast, action classification aims to classify an action that is being
performed in each video snippet. The action classification problem needs to also
solve the action representation sub-problem. The action representation problem
deals with finding the best features to use to train a classifier.
Many approaches have been introduced to solve the human activity recog-
nition problem. Traditional image processing techniques require the extraction
of good feature descriptors, which is essential for successful classifications. How-
ever, the application of deep learning techniques has demonstrated tremendous
performance improvements over traditional techniques. A particular advantage
of deep learning approaches is its automatic feature extraction abilities [4].
In this paper, a hybrid deep learning architecture, utilizing a Temporal Seg-
ment Network (TSN), Octave Convolutional Neural Network (OctConv), and a
Multi-Layer Perceptron (MLP), is proposed as a solution to the human activity
recognition problem. The TSN and MLP are hybridized. The OctConv is used
as a convolutional neural network (CNN). Other convolutional neural network
models and various multi-layer perceptrons were implemented and evaluated
before we settled on this architecture.
The rest of the paper is organised as follows. Section 2 presents a review of the
literature on human activity recognition from video. The proposed architecture
is described in Sect. 3. The methods, including the dataset, experimental design,
and evaluation metrics are discussed in Sect. 4. The results obtained using the
KTH dataset are presented and discussed in Sect. 5. Concluding remarks and
pointers to future work are given in Sect. 6.
2 Literature Review
The three most popular deep learning models developed for human activity
recognition are recurrent neural networks, 3D convolutional neural networks, and
two-stream convolutional neural networks [2]. Many of the deep learning models
for action recognition requires two types of input sequences: spatial and temporal
sequences [2]. A two-stream convolutional neural network processes these two
types of inputs independently. The features extracted from the sequences are
fused using a fusion strategy.
A temporal segment network (TSN) is composed of a spatial and a temporal
stream [6]. A TSN model operates on a sequence of short snippets that are
sparsely sampled from the video instead of working on individual frames or a
stack of frames. The study in [5], conducted with a TSN and a two-stream
inflated 3D convolution network (I3D), demonstrated that two-stream networks
obtain better results for motion that occurs over a short duration. The results
presented in this study show that the TSN model yields slightly better results
than the I3D model. The disadvantage of using a CNN in these two-stream
networks, is that they fail to model longer-term temporal variations.
Hybridized Deep Learning Architectures for Human Activity Recognition 171
Since deep learning is known for its automatic feature extraction abilities [4],
this study considers a two-stream deep learning model as a feature extractor to
train a simple multi-layer perceptron (MLP) model.
3 Architecture
The proposed architecture is given in Fig. 1. The architecture is based on the
hybridization of a two-stream network and a multi-layer perceptron. The two-
stream network employed is a Temporal Segment Network (TSN). A temporal
segment network framework performs video-level predictions utilizing the visual
information of entire videos [6]. The TSN is composed of two streams, namely a
spatial stream and a temporal stream. The model operates on a sequence of short
snippets sparsely sampled from the video instead of working on individual frames
or frame stacks. A video V is divided into k segments {V1 , V2 , . . . , Vk } of equal
durations. Short snippets consisting of x consecutive frames are randomly chosen
from each segment Vk . An optical flow operator is applied to these sequential
frames and the optical flow result is fed into the temporal stream.
Similarly, a random frame is selected from each segment Vk and input into the
spatial stream. The class scores assigned to the different snippets are fused by a
segmental consensus function to produce a class score, which is the video-level
predicted class. The predictions obtained from the spatial and temporal streams
are combined to produce the entire video’s final prediction. Traditionally, a TSN
is trained using three segments and then tested using 25 segments. However,
this TSN uses ten video segments during training, validation, and testing. The
consistency scheme was adopted from [8] to obtain more temporal information.
Both the spatial stream and temporal stream requires an input size of 224 × 224.
The frames chosen for the spatial stream are randomly cropped and horizontally
flipped. These are the same parameters that are employed by [8]. Once the video
is split into ten segments, an RGB video frame is randomly chosen from each
segment and is input to the spatial CNN model.
Convolutional Neural Networks are popular and powerful deep learning algo-
rithms in image processing. A few pre-trained CNN models were investigated to
determine the best candidate for both the spatial and temporal streams. These
CNNs include AlexNet, VGG-16, SqueezeNet, ResNet50, and OctResNet50 mod-
els. The AlexNet CNN consists of five convolutional layers, where some are fol-
lowed by max-pooling layers [11]. It is then followed by three fully connected
layers, where the final layer is constructed with 1000 neurons. These fully con-
nected layers utilize a dropout technique to reduce overfitting. The AlexNet
CNN was known as one of the best CNN models to produce high classification
accuracies on the ImageNet dataset.
A study was conducted on the impact of increasing the depths of a CNN
architecture [12]. It investigated these architectures’ performances with various
depths of 16 to 19 weight layers and utilizing convolutional filters. The model
with a depth of 16 layers is called the VGG-16. The VGG model achieves bet-
ter results than AlexNet as the depth increases. Another deep CNN model,
Hybridized Deep Learning Architectures for Human Activity Recognition 173
SqueezeNet, was proposed with the intention to reduce the network’s parame-
ters so that it can easily be deployed with low memory requirements and can
efficiently be transferred over a computer network [13]. This CNN model has
approximately 5 MB of parameters and was able to produce the same level of
accuracy as AlexNet when evaluated on the ImageNet dataset. The study also
demonstrated that both the AlexNet and SqueezeNet model parameter size could
be significantly reduced while maintaining accuracy by using a deep compression
technique proposed by [14].
A deep residual learning framework, ResNet, was proposed in [15]. The study
demonstrated that these networks are easily optimized and achieve higher accu-
racies with increased depths. These accuracy levels are competitive with state-
of-the-art. An Octave Convolution (OctConv) was proposed in [1], to increase
the efficiency in CNN models. This convolution operator was designed to eas-
ily replace vanilla convolutions in existing CNN models without adjusting any
aspects of the network architecture. The study claims that this OctConv boosts
accuracy for both image and video recognition tasks with much lower mem-
ory and computational cost. Evaluating the OctConv in ResNet CNN mod-
els (OctResNet) has shown significant accuracy improvements over the original
ResNet models’ performances.
When a performance comparison was completed on the selected CNN models,
the OctResNet50 model produced the best results compared to the other CNN
models when building the proposed architecture. These results are presented
and discussed in Sect. 6. Hence, the OctResNet50 model was chosen for both
the spatial and temporal stream. The OctResNet50 CNN model was pre-trained
on the ImageNet dataset [1]. The output layer of the CNN model produces the
classification score for the potential classes. These scores from all segments are
174 B. J. Pillay et al.
fused using average fusion to form a single vector as the feature representation
of the spatial stream component.
The temporal stream input consists of 11 consecutive video frames that are
randomly chosen from each segment. The optical flow is estimated by using the
FlowNet 2.0 algorithm to produce ten optical flow calculations [16]. FlowNet
is an optical flow estimation algorithm that utilizes deep neural networks [16].
FlowNet 2.0 is the most recent version to date and focused on quality and speed
improvements. The newer version has decreased estimation errors by over 50%
than compared to the original. However, it is slower than the original archi-
tecture but outperforms all other optical flow algorithms in terms of accuracy.
FlowNet 2.0 has also investigated ways to be able to work with small motions
by introducing a subnetwork. The FlowNet 2.0 algorithm has experimented on
motion segmentation and action recognition, and the optical flow estimations
have proven to be very reliable. Overall, FlowNet 2.0 performs just as well as
existing state-of-the-art methods. Due to its speed and performance accuracy, it
was the preferred optical flow method in our architecture.
The ten optical flow calculations produced by the FlowNet 2.0 algorithm are
stacked together and fed into the temporal CNN model. Another separate pre-
trained OctResNet50 model trained on the ImageNet dataset was also employed
as the temporal CNN model [1]. The output of the temporal stream’s CNN model
is the classification scores for the potential action classes. These scores from each
segment are fused by the average fusion technique to produce a single feature
representation vector of the temporal stream. The weighted average fusion is
applied to the spatial and temporal stream’s features. The spatial and temporal
weight was set to 1 and 1.5, respectively. This weighting strategy was adopted
from [8].
Once the weighted fusion was calculated, this served as input to a multi-
layer perceptron (MLP) trained using backpropagation. This MLP consisted of
an input layer with six neurons, a single hidden layer with 12 neurons, and an
output layer with six neurons. ReLU was the activation function in the MLP.
Each output neuron of the MLP carries a score that is associated with a potential
classification class label.
This proposed architecture is different from other existing architectures in
that it utilizes models that were introduced in very recent years by various
research works. These models are employed in the two-stream network segment
of this proposed architecture. This architecture stands out from the other two-
stream architectures because it uses the multi-layer perceptron (MLP) as a classi-
fier and the two-stream model as a feature extractor. Typical two-stream models
use the respective streams’ fused outputs as decision variables for the final clas-
sification result. However, this architecture uses the output as a feature vector
that serves as an input to the MLP classifier. This hybridization of a two-stream
model and an MLP have significantly increased the classification accuracies,
which are presented and discussed in Sect. 5.
Hybridized Deep Learning Architectures for Human Activity Recognition 175
4 Methods
4.1 Dataset
The KTH human activity video dataset was used to validate the proposed archi-
tecture [17]. The dataset consists of 599 video clips containing six categories of
actions, with approximately 100 videos per category. The six actions are: hand-
waving, boxing, handclapping, walking, jogging, and running. Example images
of these activities is given in Fig. 2. Each category contains videos that were
recorded by 25 different subjects in 4 different scenarios. All the video clips have
a fixed frame rate of 25 fps and a resolution of 160 × 120 pixels.
TP + TN
Accuracy = (2)
TP + TN + FP + FN
Unless stated otherwise, Eq. 1 was also used to report the accuracy of the
model instead of using Eq. 2. The use of precision calculations is to maintain a
fair comparison with other existing architecture results.
The KTH dataset was divided into three sets, where 70% of the clips were chosen
for training, 10% for validation, and 20% for testing. Some pre-trained CNN
models trained on the ImageNet dataset were selected as candidate CNN models
for the spatial and temporal stream. These pre-trained CNN models include
AlexNet, VGG-16, SqueezeNet, ResNet50, and OctResNet50. Each stream was
independently trained with these models, and the top two, best performing CNN
models were chosen as candidate CNN models for the spatial and temporal
streams. A combination of the selected CNN models was experimented with
to observe which TSN produces the best results. Then the best TSN model
was hybridized and experimented with various MLP architectures. The hybrid
architecture that yielded the highest accuracy levels were chosen as the final
architecture design. The results of the steps undertaken are presented in Sect. 5.
Three different MLP architectures were experimented with the selected TSN
model. All of them had one hidden layer. These networks varied by the number
of neurons found in that layer and the type of activation function utilised. The
different MLP architectures are defined in Table 4. Table 5 demonstrates the
differences in the accuracy levels for each proposed hybrid architecture. The
proposed TSN model hybridized with the first version of the MLP demonstrated
significant improvements over the non-hybridized TSN model.
Table 3. Classification accuracies obtained by the various versions of the TSN model
The first MLP version was adjusted by increasing the number of neurons
to produce the second version. After replacing the first version of the MLP
with the second version, the hybrid architecture showed greater accuracy levels.
Further adjusting the MLP model by replacing the ReLU activation function
with a Sigmoid function has shown a significant drop in accuracy levels than
178 B. J. Pillay et al.
when utilizing the second version of the MLP. The hybridized architecture with
the second MLP version has also displayed the potential of obtaining better
accuracies. Therefore, it was decided to train that architecture by another 50
epochs further, bringing it to a total of 150 epochs. Increasing the number of
training epochs for the hybrid TSN and MLP version 2 architecture has shown
an increase in the accuracy levels. The accuracy levels are very competitive to
the current existing state-of-the-art results.
The accuracy results of five independent runs of the proposed TSN model
hybridised with the MLP version 2 architecture is presented in Table 6. Each
of the independent runs reported was trained for 150 epochs. Saving the best
model was also used to avoid overfitting of the model. It can be seen that the
architecture produced accuracy levels that are above 90%. The highest accuracy
level that the architecture produced is 97.5%. The standard deviation was cal-
culated over the five independent runs and produced a value of 2.15. This value
is small, which indicates that the accuracy levels are close to the mean and
that high precision exists. The t-test was performed on the accuracy values to
determine if the difference in performance is statistically significant. The t-test
significance level was set to 0.05, and the two-tailed hypothesis was used. The
t-test produced a t-value of −0.001863 and a p-value of 0.998603. This result
proves that the difference in performances is not statistically significant.
The architecture was trained using a Nvidia V100 16 GB GPU and 10 CPUs
on a high-performance computer. The training took approximately 3.5 h to com-
plete for 150 epochs per independent run, which is equivalent to about 0.175 s
to process a single video.
Hybridized Deep Learning Architectures for Human Activity Recognition 179
Table 6. Results from five independent runs of the best selected hybrid architecture
Method Measures
FPR FNR (%) Precision (%) Recall (%) CRR (%)
Khan et al. [25] 0.00 0.2 99.8 99.7 99.8
Proposed architecture 0.01 2.5 97.5 97.5 99.5
Table 9. Confusion matrix obtained by the proposed architecture on the KTH dataset.
References
1. Chen, Y., et al.: Drop an octave: reducing spatial redundancy in convolutional
neural networks with octave convolution. In: Proceedings of the IEEE International
Conference on Computer Vision, pp. 3435–3444 (2019)
2. Zhang, H.B., et al.: A comprehensive survey of vision-based human action recog-
nition methods. Sensors 19(5), 1005 (2019)
3. Kang, S.M., Wildes, R.P.: Review of action recognition and detection methods.
arXiv preprint arXiv:1610.06906 (2016)
4. Chandni, Khurana, R., Kushwaha A.K.S: Delving deeper with dual-stream CNN
for activity recognition. In: Khare, A., Tiwary, U., Sethi, I., Singh, N. (eds.) Recent
Trends in Communication, Computing, and Electronics. LNEE, vol. 524, pp. 333–
342. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-2685-1 32
5. Bilkhu, M., Ayyubi, H.: Human Activity Recognition for Edge Devices. arXiv
preprint arXiv:1903.07563 (2019)
6. Wang, L., et al.: Temporal segment networks: towards good practices for deep
action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV
2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/
978-3-319-46484-8 2
Hybridized Deep Learning Architectures for Human Activity Recognition 181
7. Arif, S., Wang, J., Ul Hassan, T., Fei, Z.: 3D-CNN-based fused feature maps with
LSTM applied to action recognition. Fut. Internet 11(2), 42 (2019)
8. Song, S., Cheung, N.M., Chandrasekhar, V., Mandal, B.: Deep adaptive temporal
pooling for activity recognition. In: Proceedings of the 26th ACM International
Conference on Multimedia, pp. 1829–1837 (October 2018)
9. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep
learning for human action recognition. In: Salah, A.A., Lepri, B. (eds.) HBU 2011.
LNCS, vol. 7065, pp. 29–39. Springer, Heidelberg (2011). https://doi.org/10.1007/
978-3-642-25446-8 4
10. Ullah, M., Ullah, H., Alseadonn, I.M.: Human action recognition in videos using
stable features. Sig. Image Process. Int. J. (SIPIJ) 8(6), 1–10 (2017)
11. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep con-
volutional neural networks. In: Advances in Neural Information Processing Sys-
tems, pp. 1097–1105 (2012)
12. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. In: 3rd International Conference on Learning Representations
(2015)
13. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.:
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model
size. arXiv preprint arXiv:1602.07360 (2016)
14. Han, S., Mao, H., Dally, W.: Deep compression: compressing deep neural network
with pruning, trained quantization and Huffman coding. In: 4th International Con-
ference on Learning Representations (2016)
15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
CoRR abs/1512.03385 (2015)
16. Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0:
evolution of optical flow estimation with deep networks. In: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 2462–2470
(2017)
17. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM app-
roach. In: 2004 Proceedings of the 17th International Conference on Pattern Recog-
nition, ICPR 2004, vol. 3, pp. 32–36. IEEE (August 2004)
18. Shi, Y., Tian, Y., Wang, Y., Huang, T.: Sequential deep trajectory descriptor for
action recognition with three-stream CNN. IEEE Trans. Multimedia 19(7), 1510–
1520 (2017)
19. Shi, Y., Zeng, W., Huang, T., Wang, Y.: Learning deep trajectory descriptor for
action recognition in videos using deep neural networks. In: 2015 IEEE Interna-
tional Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (June 2015)
20. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of
flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006.
LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006). https://doi.org/10.
1007/11744047 33
21. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application
to action recognition. In: Proceedings of the 15th ACM International Conference
on Multimedia, pp. 357–360. ACM (September 2007)
22. Caetano, C., dos Santos, J.A., Schwartz, W.R.: Optical flow co-occurrence matri-
ces: a novel spatiotemporal feature descriptor. In: 2016 23rd International Confer-
ence on Pattern Recognition (ICPR), pp. 1947–1952. IEEE (December 2016)
23. Al-Akam, R., Paulus, D.: Dense 3D optical flow co-occurrence matrices for human
activity recognition. In: Proceedings of the 5th International Workshop on Sensor-
Based Activity Recognition and Interaction, p. 16. ACM (September 2018)
182 B. J. Pillay et al.
24. Samir, H., El Munim, H.E.A., Aly, G.: Suspicious human activity recognition using
statistical features. In: 2018 13th International Conference on Computer Engineer-
ing and Systems (ICCES), pp. 589–594. IEEE (December 2018)
25. Khan, M.A., Akram, T., Sharif, M., Javed, M.Y., Muhammad, N., Yasmin, M.: An
implementation of optimized framework for action classification using multilayers
neural network on selected fused features. Pattern Anal. Appl. 22(4), 1377–1397
(2019)
26. Qi, M., Wang, Y., Qin, J., Li, A., Luo, J., Van Gool, L.: stagNet: an attentive
semantic RNN for group activity and individual action recognition. In: IEEE Trans-
actions on Circuits and Systems for Video Technology (2019)
27. Jaouedi, N., Boujnah, N., Bouhlel, M.S.: A new hybrid deep learning model for
human action recognition. J. King Saud Univ. Comput. Inf. Sci 32(4), 447–453
(2020)
28. Tong, M., Wang, H., Tian, W., Yang, S.: Action recognition new framework
with robust 3D-TCCHOGAC and 3D-HOOFGAC. Multimedia Tools Appl. 76(2),
3011–3030 (2017)
29. Shao, L., Liu, L., Yu, M.: Kernelized multiview projection for robust action recog-
nition. Int. J. Comput. Vis. 118(2), 115–129 (2016)
DRICORN-K: A Dynamic RIsk
CORrelation-driven Non-parametric
Algorithm for Online Portfolio Selection
1 Introduction
Online Portfolio Selection is regarded as a fundamental problem at the inter-
section of Computer Science and Finance. Online Portfolio Selection algorithms
c Springer Nature Switzerland AG 2020
A. Gerber (Ed.): SACAIR 2020, CCIS 1342, pp. 183–196, 2020.
https://doi.org/10.1007/978-3-030-66151-9_12
184 S. Sooklal et al.
is x(t,i) = PP(t−1,i)
(t,i)
where P (t, i) is the closing price of the ith asset in the tth
trading period. Define a market window Xtt−w = (xt−w , . . . , xt ), where w is the
given window size. Next, define a portfolio bt = (b(t,1) , . . . , b(t,m) ) where b(t,i) is
th th
the proportion of the portfolio invested in the i asset for the t trading period,
and b(t,i) ≥ 0 and i b(t,i) = 1. Thus the total return after t trading periods is
t
defined as St = j=1 bj · xj .
An Online Portfolio Selection algorithm, A, specifies the sequence of portfo-
lios B = (b1 , . . . , bn ) which aims to maximise Sn given a set of conditions based
on the chosen performance measure. Examples of performance measures include
the Sharpe ratio [3–5,8,9], maximum drawdown [5,8,9], and annual percentage
yield or total cumulative wealth [4,8,10]. The maximisation of the performance
measure is executed by learning each bt sequentially at the beginning of period t
based on the market window Xt−1 t−w . The decision criteria for choosing a specific
bt is based on the implemented algorithm, A.
The following assumptions are made with regard to the above Online Port-
folio Selection problem [2]:
3 Pattern-Matching Approaches
CORN-K then searches for the optimal portfolio, Et (w, ρ), based on Ct (w, ρ)
as follows:
Et (w, ρ) = arg max (b · xi ) (2)
b∈Δm
i∈Ct (w,ρ)
After obtaining the set of all experts for a given trading period t (i.e.
Et (w, ρ) ∀w, ρ), CORN-K creates an ensemble of the TOP-K experts according
to the following formula:
w,p q(w, ρ)st−1 (w, ρ)Et (w, ρ)
bt = (3)
w,p q(w, ρ)st−1 (w, ρ)
where q(w, ρ) represents the probability distribution function, and st−1 (w, ρ) =
t−1
j=1 Et (w, ρ) · xj represents the total wealth achieved by Et (w, ρ).
Wang et al. [9] extended the CORN-K algorithm by penalising risky portfolios
during portfolio optimisation to create a risk-averse CORN-K algorithm. The
risk penalty is measured by the standard deviation of returns of the portfolio
under question. They modify Eq. (2), opting to use the log of cumulative returns
instead, and create each expert as follows:
T
i∈Ct (w,ρ) log b xi
Et (w, ρ, λ) = arg max − λσt (w, ρ) (4)
b∈Δm |Ct (w, ρ)|
where λ is their
risk-aversion
coefficient, |Ct (w, ρ)| is the size of Ct (w, ρ), and
σt (w, ρ) = std log bT xi |i∈Ct (w,ρ) is their risk measure.
Although [9] presents improved results over CORN-K in volatile markets,
the risk-averse CORN-K algorithm can result in an overly-conservative approach
that avoids exploiting upside risk. In particular, it may be beneficial to increase
portfolio risk when markets are bullish.
In financial theory, risk is decomposed into systematic and idiosyncratic com-
ponents. Idiosyncratic risk is risk experienced by a specific company or industry,
and can be diversified away. Systematic risk is inherent in the entire market and
is undiversifiable. Thus, the risk of a diversified portfolio is almost entirely due to
market movements, which can be exploited by adjusting the portfolio’s market
sensitivity according to market conditions.
4.2 DRICORN-K
We extend the risk-averse CORN-K algorithm to exploit upside risk while hedg-
ing downside risk. This is achieved by considering an alternative risk measure,
beta (β), which reflects the sensitivity of a portfolio to the overall market.
DRICORN-K penalises high-beta portfolios when the market is bearish, and
rewards high-beta portfolios when the market is bullish.
The two key components of DRICORN-K are measuring the market sensitiv-
ity associated with a portfolio, and determining the current market conditions.
188 S. Sooklal et al.
cov(Rb , Rm )
βb = (5)
var(Rm )
where Rb and Rm are the daily returns on b and the market portfolio respec-
tively. Thus, the β can be interpreted in the same manner as a regression coef-
ficient in a linear regression model. That is, β indicates the magnitude and
direction in which the portfolio moves relative to the market.
We incorporate β by extending the objective function in the portfolio opti-
misation step, based on the current market condition, as follows:
Et (w, ρ, λ) = arg max (b · xi ) ± λβb (6)
b∈Δm
i∈Ct (w,ρ)
Market movement
Category Description
Decline Rise Stationary
Price change >20%
Price Negative Positive No price
(over past month)
Changes price change price change change
Price change
>20%, from
peak or trough
(over past two
months)
Current Moving Average (CMA) Uniform Weighting
vs. Arithmetic Weighting CMA < LMA CMA > LMA CMA = LMA
Lagged Moving Average (LMA) Exponential Weighting
Moving Linear Uniform Weighting Negative Positive Zero
Regression Exponential Weighting gradient gradient gradient
DRICORN-K 189
5 Datasets
5.1 Training
where each Mt represents an element of the price relative vector, and each ci
is an empirically determined coefficient. This model was used to generate a
synthetic price relative vector over 504 trading days, M = (M1 , . . . , M504 ), which
represented our simulated market.
Using this simulated market data generator, we simulated the price relative
vectors of 28 virtual stocks. In order to ensure that our training dataset would
exhibit DRICORN-K’s functionality, we required 75% of the stocks to be posi-
tively correlated with the simulated market, and the other 25% to be negatively
correlated using the following formula:
stocki = γi M + i (8)
5.2 Testing
We tested DRICORN-K on the five real world indices, the JSE Top 40, Bovespa,
DAX, DJIA and Nikkei 225, described in Table 2.
For each index, we downloaded individual stock datasets on the available top
30 index constituents, as of September 2020, from Yahoo Finance1 . These time
series were combined, cleaned and then used to create the price relative vectors
required by the algorithms.
The datasets were also chosen to interrogate the performance of DRICORN-
K, and the other algorithms, under various markets and market conditions. For
each dataset, we selected a time period in which the index displayed a certain
trend or pattern we wished to explore.
1
https://finance.yahoo.com.
190 S. Sooklal et al.
The JSE Top 40 index displayed an oscillatory pattern during the selected
timeframe. It contained upward and downwards trends, but overall, it traded
sideways.
The Bovespa index displayed a general upward trend throughout its time-
frame with a sharp decline, and subsequent recovery, around March 2020.
The DAX index displayed an overall downward trend across the selected
timeframe. This is indicative of a bear market.
Although it contained short market rises and declines, the DJIA index traded
relatively sideways during the selected time period.
The Nikkei 225 index displayed peaks and troughs during its timeframe.
There was a large decline at the beginning of the time period, and a large incline
towards the end of the time period.
By testing DRICORN-K, and other algorithms, on these datasets, we hope
to analyse their performance, strengths, weaknesses, and ability to generalise
across various markets experiencing different economic conditions.
6 Experiments
Price Changes: The first approach checks whether the current market price
has changed by more than 20%, compared to the previous month’s price. The
second approach checks whether the current market price has changed by more
than 20% of its highest or lowest recorded price, over the past two months. A
negative (positive) price change is classified as a declining (rising) market, while
no price change is classified as a stationary market. These methods assume that
price changes alone can indicate the current trend of the market.
DRICORN-K 191
6.2 Implementation
We implemented DRICORN-K, and other algorithms, using the toolbox pre-
sented by Li et al. [15]. For all implemented algorithms, we used their respective
default parameters. For DRICORN-K, we used W = 5, P = 10, K = 10%,
λ = 0.001. In our market classification method (moving linear regression with
exponential weighting) we used a smoothing factor of 0.6 and a market window
size of two months.
where SRn is the annualised Sharpe Ratio after n periods, APY n is the
Annualised Percentage Yield (see note below), Rf is the risk-free rate of
return, and σp is the annualised standard deviation of daily returns.
To calculate SRn , we set Rf = 4%. To obtain √ σp , we multiplied the calcu-
lated standard deviation of daily returns by 252, as we assumed that there
is an average of 252 trading days in a year.
• Note:
Annualised Percentage Yield (APY)
1
APY n = (Sn ) y − 1 (11)
where Sn is the total return after n trading periods, and y is the number
of years corresponding to n. APY is the rate of return achieved, taking into
account the impact of compounding. Generally, the higher the APY, the more
preferable the Online Portfolio Selection algorithm.
The Sharpe Ratio measures risk-adjusted return. Generally, the higher the annu-
alised Sharpe Ratio, the more preferable the Online Portfolio Selection algo-
rithm.
– Maximum Drawdown (MDD)
Table 3. Results: performance measures for each of the twelve algorithms compared
to DRICORN-K in five markets with varying conditions
6.4 Algorithms
6.5 Results
The results of these experiments can be seen in Table 3, where the performance
measures for each algorithm tested on the five datasets are displayed. The top
two performances on each dataset are marked in bold. In Fig. 1 we display the
Fig. 1. Cumulative returns achieved by UCRP, CORN-K and DRICORN-K across five
datasets.
DRICORN-K 195
7 Conclusion
References
1. Paskaramoorthy, A.B., Gebbie, T.J., van Zyl, T.L.: A framework for online invest-
ment decisions. Invest. Anal. J. 49, 215–231 (2020)
2. Li, B., Hoi, S.: Online portfolio selection: a survey. ACM Comput. Surv. 46, 12
(2012)
3. Agarwal, A., Hazan, E., Kale, S., Schapire, R.E.: Algorithms for portfolio man-
agement based on the newton method. In: Proceedings of the 23rd International
Conference on Machine Learning, pp. 9–16 (2006)
4. Borodin, A., El-Yaniv, R., Gogan, V.: Can we learn to beat the best stock. J. Artif.
Intell. Res. 21, 579–594 (2004)
5. Li, B., Zhao, P., Hoi, S.C.H., et al.: PAMR: passive aggressive mean reversion
strategy for portfolio selection. Mach. Learn. 87, 221–258 (2012). https://doi.org/
10.1007/s10994-012-5281-z
6. Györfi, L., Lugosi, G., Udina, F.: Nonparametric kernel-based sequential invest-
ment strategies. Math. Finan. Int. J. Math. Stat. Finan. Econ. 16(2), 337–357
(2006)
7. Györfi, L., Udina, F., Walk, H., et al.: Nonparametric nearest neighbor based
empirical portfolio selection strategies. Stat. Decis. 26(2), 145–157 (2008)
8. Li, B., Hoi, S.C.H., Gopalkrishnan, V.: CORN: correlation-driven nonparametric
learning approach for portfolio selection. ACM Trans. Intell. Syst. Technol. (TIST)
2(3), 1–29 (2011)
9. Wang, Y., Wang, D., Zheng, T.F.: Racorn-k: risk-aversion pattern matching-based
portfolio selection. In: 2018 Asia-Pacific Signal and Information Processing Asso-
ciation Annual Summit and Conference (APSIPA ASC), pp. 1816–1820. IEEE
(2018)
10. Das, P., Banerjee, A.: Meta optimization and its application to portfolio selection.
In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 1163–1171 (2011)
11. Gooding, A.E., O’Malley, T.P.: Market phase and the stationarity of beta. J.
Financ. Quant. Anal. 12(5), 833–857 (1977)
12. Snow, D.: Machine learning in asset management–part 1: portfolio construction–
trading strategies. J. Financ. Data Sci. 2(1), 10–23 (2020)
13. Snow, D.: Machine learning in asset management–part 2: portfolio construction–
weight optimization. J. Financ. Data Sci. 2(1), 10–23 (2020)
14. James, F.E.: Monthly moving averages-an effective investment tool? J. Financ.
Quant. Anal. 3(3), 315–326 (1968)
15. Li, B., Sahoo, D., Hoi, S.C.H.: OLPS: a toolbox for on-line portfolio selection. J.
Mach. Learn. Res. 17(1), 1242–1246 (2016)
16. Sharpe, W.F.: Mutual fund performance. J. Bus. 39(1), 119–138 (1966)
Knowledge Representation
and Reasoning
Cognitive Defeasible Reasoning:
the Extent to Which Forms of Defeasible
Reasoning Correspond with Human
Reasoning
1 Introduction
It is well-documented that human reasoning exhibits flexibility considered key to
intelligence [21], yet fails to conform to the prescriptions of classical or proposi-
tional logic [25] in the Artificial Intelligence (AI) community. The AI community,
therefore, seeks to incorporate such flexibility in their work [21]. Non-classical or
non-monotonic logic is flexible by nature. Whereas classical reasoning is enough
to describe systems with a calculated output in an efficient way, how humans
reason is non-classical because humans are known to reason in different ways [25].
The problem is that non-monotonic reasoning schemes have been developed for
and tested on computers, but not on humans. There is a need to investigate
whether there exists a correspondence between non-monotonic reasoning and
human reasoning and, if so, to what extent it exists. This problem is impor-
tant because we can gain insight into how humans reason and incorporate this
into building improved non-monotonic AI systems. An issue which needs to be
considered is that humans are diverse subjects: some reason normatively while
others reason descriptively. In the case of normative reasoning, a reasoner would
conclude that a certain condition should be the case or that the condition is
usually the case. In the case of descriptive reasoning, a reasoner would make a
bold claim that a certain condition is exactly true or exactly false. We emphasise
that a thorough investigation needs to be done to determine the extent of the
correspondence between non-monotonic reasoning and how humans reason.
We propose this work as a contribution towards solving this problem. While
acknowledging that this work falls within a broader research paradigm towards
this goal [21,24,25], what differentiates this work is that it is, to our knowledge,
the first work with an explicit view towards testing each of these particular formal
non-monotonic frameworks: KLM defeasible reasoning [13], AGM belief revision
[1], and KM belief update [11]. We report on these frameworks in a paper due
to the close theoretical links between the frameworks’ domains. Postulates for
defeasible reasoning and belief revision may be translated from the one context
to the other [5]. Using such translations, KLM defeasible reasoning [13] can
be shown to be the formal counterpart of AGM belief revision [8]. This does
not hold for KM belief update [11]. Belief update is commonly considered a
necessarily distinct variant of belief revision for describing peoples’ beliefs in
certain domains [11].
In Sect. 2, we describe related work and the formalisms of non-monotonic
reasoning under investigation in our study. We end this section with our problem
statement. In Sect. 3, we describe the design and implementation of three distinct
surveys, one for each formalism of non-monotonic reasoning in our study. Each
survey seeks to determine the extent of correspondence between the postulates
of that formalism and human reasoning. In Sect. 4, we describe the methods used
to analyse our survey results. We present our results, discussion and conclusions
in Sect. 5. Lastly, we propose the track for future work in Sect. 6.
Cognitive Defeasible Reasoning 201
2 Background
Humans are known to reason differently about situations in everyday life and
this reasoning behaviour can be compared to the paradigm of non-monotonic
reasoning in AI. Non-monotonic reasoning is the study of those ways of inferring
additional information from given information that does not satisfy the mono-
tonicity property, which is satisfied by all methods based on classical logic [13].
Said otherwise, non-monotonic logic fails the principle that whenever x follows
from a set A of propositions then it also follows from every set B with B ⊆ A
[18]. With non-monotonic reasoning, a conclusion drawn about a particular sit-
uation does not always hold i.e. in light of newly gained, valid information,
previously valid conclusions have to change. This type of reasoning is described
in the context of AI [23]. We consider three forms of non-monotonic reasoning,
namely defeasible reasoning, belief revision and belief update. The latter two are
both forms of belief change [11], wherein there exists a belief base and a belief
set [6]. Explicit knowledge the agent has about the world resides in the belief
base, whereas both the explicit knowledge the agent has about the world and
the inferences derived from it reside in the belief set.
it [11,19]. Information is then taken into account by selecting the models of the
new information closest to the models of the base, where a model of information
μ is a state of the world in which μ is true [11]. An example of this reasoning
pattern will now be described. Consider the same statements used above in the
defeasible reasoning example. Using the reasoning pattern of belief revision, we
can infer from our beliefs that Alice does pay tax. Suppose we now receive new
information: Alice does not pay tax. This is inconsistent with our belief base, so a
decision must be made regarding which beliefs to retract prior to adding the new
information into our beliefs. We could revise our beliefs to be that employees pay
tax and Alice does not pay tax. In [4], this decision is proposed to be influenced
by whether we believe some statements more strongly than others. In [1], it is
proposed to be influenced by closeness (the concept of minimal change), in that
we aim to change as little about our existing knowledge as we can do without
having conflicting beliefs.
In belief update, conflicting information is seen as reflecting the fact that the
world has changed, without the agent being wrong about the past state of the
world. To get an intuitive grasp of the distinction between belief update and
revision, take the following example adapted from [11]. Let b be the proposition
that the book is on the table, and m be the proposition that the magazine is on
the table. Say that our belief set includes (b ∧ ¬m) ∨ (¬b ∧ m), that is the book
is on the table or the magazine is on the table, but not both. We send a student
in to report on the state of the book. She comes back and tells us that the book is
on the table, that is b. Under the AGM [1] postulates for belief revision proposed
in [1], we would be warranted in concluding that b ∧ ¬m, that is, the book is on
the table and the magazine is not. But consider if we had instead asked her to
ensure that the book was on the table. After reporting, we again are faced with
the new knowledge that b. This time adding the new knowledge corresponds
to the case of belief update. And here it seems presumptuous to conclude that
the magazine is not on the table [11]. Either the book was already on the table
and the magazine was not, in which case the student would have done nothing
and left, or the magazine was on the table and the book not, in which case the
student presumably would have simply put the book on the table and left the
magazine similarly so. As these examples are formally identical, there is a need
for different formalisms to accommodate both cases.
3 Implementation
In this section, we describe the design and implementation of three surveys: one
each for defeasible reasoning, belief revision, and belief update. We also describe
our implementation strategy and expected challenges. Finally, we document our
testing and evaluation strategy. The major reason for our choice of the survey
as a testing instrument was its ease of integration with Mechanical Turk, which
was the channel we had chosen for sourcing our participants. Moreover, the web-
based survey is a common tool used in sociological research, such that “it might
be considered an essential part of the sociological toolkit” [32]. Future work may
look towards testing our research questions in a non-survey environment.
Survey 2. The questions in this survey were developed to test whether postu-
lates of a specific formalisation of the process of belief revision feature in cogni-
tive reasoning. The formalisation used is that of the eight-postulate approach as
proposed by Alchourrón, Gärdenfors and Makinson (AGM) [1]. We refer to the
eight-postulate approach as the AGM [1] postulates of Closure, Success, Inclu-
sion, Vacuity, Consistency, Extensionality, Super-expansion and Sub-expansion,
included in Appendix A for reference. Two types of questions were developed:
concrete and abstract. This involved designing scenarios in which to ground the
concrete questions. Five such scenarios were designed. Abstract questions were
developed directly based on the formal postulates. The abstract questions were
included to test the postulates without having the agent’s knowledge of the
world hindering their answers and to have questions which are less semantically
loaded [16] than real-world concrete questions. The benefit of abstract examples
is further discussed by Pelletier and Elio [21]. The concrete questions started as
abstract representations explicitly requiring the application of one or some of the
formal postulates to obtain the desired answer. These representations were then
Cognitive Defeasible Reasoning 205
Survey 3. The questions in this survey were developed to test the KM approach
[11] to belief update. The KM [11] postulates we used are included as postulates
U1, U2, U3, U4, U5, U6, U7 and U8 in Appendix A. These postulates mir-
rored the eight-postulate approach for belief revision, with the core difference
between the postulates for revision and the postulates for update being the type
of knowledge referred to: static knowledge for revision and dynamic knowledge
for update. The questions in this survey were broken into three sets. The first
consisted of abstract questions, in which the KM [11] postulates were presented
and participants were asked to rate their agreement with the postulates on a lin-
ear or Likert scale with extremal points “strongly agree”and “strongly disagree”.
The postulates were presented using non-technical language. The second set of
questions were concrete questions that were meant to be confirming instances of
each of the eight KM postulates, where participants were asked to answer either
Yes or No, and motivate their answer. The third set followed the same format
as the second but was meant to present counter-examples to the postulates,
with the counter-examples largely sourced from the literature. The first counter-
example was based on the observation that updating p by p ∨ q does not affect
the KM approach [9], which seems counter-intuitive. The second was based on
the observation that updating by an inclusive disjunction leads to the exclusive
disjunction being believed in the right conditions (a modification of the checker-
board example in [9]), which again seems counter-intuitive. The third was based
on the observation that sometimes belief revision semantics seem appropriate in
cases corresponding to the way that belief update is commonly, and has been
here, presented in [15]. The final is an example testing a counter-intuitive result
of treating equivalent sentences as leading to equivalent updates.
satisfy. For the defeasible reasoning survey, Workers were required to be Mas-
ter Workers, a qualification assigned by MTurk to top Workers who consistently
submit high-quality results. Workers were required to have a HIT Approval Rate
(%) for all Requesters’ HITs ≥ 97, and have more than 0 HITs approved. For
the belief revision survey, two MTurk qualifications and one internal qualifi-
cation was used to recruit participants. Workers were required to have a HIT
Approval Rate (%) for all Requesters’ HITs > 98, and have more than 5000
HITs approved. The required number of HITs approved was varied, between
1000 and 5000, to allow for a diverse sample of respondents. We created one
internal qualification to ensure that the 30 respondents were unique across all
of the published batches of the survey. This qualification was called Completed
my survey already and assigned to Workers which have submitted a response
in a previous batch, including the batch of the trial HIT. For the belief update
survey, a single qualification was used: only Master Workers were allowed to
participate in the survey.
Each of our surveys were evaluated by a group of both laypeople and experts for
clarity. Each of our surveys were also published on MTurk as a trial HIT. The
results of the trial HITs were used to gauge how Turkers might respond to the
final survey.
Trial HITs. A trial of the surveys was conducted, (i) to gain familiarity with
the MTurk service and platform and (ii) to test the survey and its questions
on a sample of Turkers. It involved three separate postings of the survey links
as HITs on the site, each requiring five responses. The HIT was created with
certain specifications accordingly. Workers were compensated R30 (above the
South African hourly minimum wage) for completing the tasks, and the tasks
included a time estimate, all of which were under an hour. We did not restrict
workers by location, but required that they should have completed a certain
number of HITs previously, and have a certain approval rating (≥95%) for their
tasks, as recommended by Amazon to improve response quality [29]. A Turker’s
approval rating refers to the percentage of their tasks that have been approved or
accepted by the Requesters who published them. Based on the results from the
trial survey, changes were made for the final experiments. The changes included
increasing both the compensation and the estimated completion time.
Ethical issues are those which require a choice to be made between options based
on whether they evaluate as ethical or unethical. Professional issues here refer to
those which pertain to ethical standards and rules that the profession of Com-
puter Science has for its members, particularly with respect to research. Ethical
and professional issues thus overlap. Legal issues refer to those which involve
the law. As this project involved experiments with people, ethical clearance was
obtained from the University of Cape Town Faculty of Science Human Research
Ethics Committee before proceeding with the experiments. The primary issue
208 C. K. Baker et al.
in the experiments was the use of MTurk, in particular, whether Workers were
being paid a fair wage for their work. Per [2], the following three steps were
taken to mitigate these concerns. First, workers were paid more than the South
African minimum wage for an hour’s work. Second, in the title of the task, the
estimated amount of time needed for the task was clearly stated. Finally, there
is a section in the survey which gives an overview of what the research concerns,
placing the work in context. Workers were also required to give their informed
consent to participate in the study. This was achieved by having a consent form
at the start of the survey, whereby workers could either agree to participate in
the research and then continue to the rest of the survey, or they could decline to
participate and be thanked for their time. Contact details of the researchers were
also provided. Before the data-handling, all survey responses were anonymised.
We also did not collect names, cellphone numbers or email addresses from our
participants. The only personal contact information we collected from each par-
ticipant was their Amazon Turk WorkerID. To view our survey questions, raw
collected data and the codebooks used for data analysis, click here.
4 Methods of Analysis
Responses were rejected if the participant failed the checkpoint section in the
survey. In our analysis, we reference applying a baseline of 50% to our results.
The choice of 50% as a baseline was arbitrary, but it served as a tool to evaluate
the meaning of our results. As a starting point for evaluation and a baseline for
agreement, it was basic and could be improved upon in future work.
4.1 Survey 1
The defeasible reasoning survey had 30 responses, which were downloaded from
Google Forms. One response was rejected due to the participant submitting
twice. Coding of participant responses was performed using Microsoft Excel
functions. The coding spreadsheet is included in our Github repository, refer-
enced in Appendix A. For this survey, we assumed that the KLM [13] postulate
of Reflexivity, the idea that a proposition x defeasibly entails itself, holds for
all human reasoners and therefore it was not tested. Feedback from our super-
visor indicated that a few survey questions were not appropriate models of the
KLM [13] postulates they intended to test, as they used the word some in the
conclusion. These were questions 6 and 7, referring to the KLM [13] postulates
of Right Weakening and And, respectively. In the following, we state question 6
as was presented in the survey, as an example, to clarify. The given information
was presented as a numbered list and the conclusion was phrased as a question.
Question 6, testing Right Weakening, asked: given i) no police dogs are vicious
and ii) highly trained dogs are typically police dogs, can you conclude that some
highly trained dogs are vicious? We draw the reader’s attention to the fact that
the word some is not part of the definition of the KLM [13] postulates. Thus,
we have removed the responses to these questions in our analysis of the results.
Cognitive Defeasible Reasoning 209
4.2 Survey 2
Quantitative Data. The modal answer and hit rate (%) for closed questions
were calculated by applying Microsoft Excel functions to the data. A hit indi-
cates success. In this context, success is defined as both the respondent and the
application of the belief revision postulates obtaining the same answer. The hit
rate is thus calculated for each question as number of successes
no. of responses × 100. The analysis
of the results employs a baseline of a hit rate of 50% to indicate overall success.
from literature, based on the theory being empirically tested. These include
the eight postulates of belief revision as proposed by Alchourrón, Gärdenfors
and Makinson [1]: closure, success, inclusion, vacuity, consistency, extensionality,
super-expansion, sub-expansion. Other pre-determined codes include: normative
and descriptive. Emerging codes are those which were not anticipated at the
beginning, or are both unusual and of interest. They are developed solely on the
basis of the data collected from respondents by means of the survey. An exam-
ple of an emerging code used in the trial of this study is It is stated. This code
represents the respondent taking a passive approach to their response. Other
examples would be real-world influence and likelihood.
Pre-determined codes normative and descriptive refer to the reasoning style
identified in responses to open questions. A normative style involves making
value judgements [20], commenting on whether something is the way it should be
or not. This includes implied judgements through the use of emotive language.
A descriptive style, in contrast, does not - it involves making an observation,
commenting on how something is [20].
4.3 Survey 3
Quantitative Data. The belief update survey had 34 participants, of which 4
responses were rejected. For the quantitative data, two forms of analysis were
chosen, corresponding to the two different forms of quantitative data (ordinal
and binary) gathered. For the ordinal (Likert-type) data, the median is an appro-
priate measure of central tendency [28], and thus was chosen, and for the binary
data, the hit rate as above was chosen. Relating this back to the research ques-
tion, a postulate was seen as confirmed if it saw both a hit rate ≥50% for the
confirming concrete example, and a median value of agree or better.
Qualitative Data. For the qualitative data, emerging codes were developed for
Sect. 2 on a per question basis. This was so as to better interpret the quantitative
results, and, in particular with the counter-examples, to see whether the reasons
given by participants for their answers matched the theory behind the objections
as given in the literature. Similar to the belief revision case, a common code was
new information should be believed, which corresponds to the case of simply
believing new information.
Fig. 3): U1, U3, U4 and U6. For each of the three surveys, we present addi-
tional results that are of importance. Our surveys were designed separately and
contained slightly differing methodologies, so we have not attempted a holistic
comparison of the results. Future work might do so. Discussion of less expected
results from each survey can be found at either of the links in Appendix Sect.
A.1, in the respective individual papers.
abstract 36.67%) and Consistency (concrete 50%, abstract 36.67%) received hit
rates ≤50%, suggesting a negative relationship. Postulates Sub-expansion (con-
crete 76.67%, abstract 40%) and Inclusion (concrete 23.33%, abstract 60%) had
discrepancies of >30% between the hit rates for their concrete and abstract
questions, and their relationships to human reasoning thus found to be incon-
clusive. Through an additional investigation, we found that participants have a
predominantly descriptive relationship with belief revision when postulates are
presented both in concrete and abstract form. The balance of descriptive and
normative reasoning styles of respondents in their responses became more even
for the abstract questions, perhaps suggesting an increasing reliance on perceived
rules in situations to which humans are less able to relate.
6 Future Work
Our results suggest that the models of KLM defeasible reasoning [13], AGM
belief revision [1] and KM belief update [11] are not yet a perfect fit with human
reasoning because participants failed to reason in accordance with many of the
postulates of these models. A larger participant pool is required to confirm our
results. In future work, it may be interesting to add blocks to the study, in
the form of different control groups e.g. paid reasoners as opposed to unpaid
reasoners, to explore the effects of different circumstances on cognitive reasoning
and which logic form is most closely resembled in each such block.
Cognitive Defeasible Reasoning 213
KLM Postulates. Table 1 presents the KLM postulates. For ease of compari-
son, we present the postulates translated in a manner similar to [27]. We write
Cn (S) to represent the smallest set closed under classical consequence containing
all sentences in S, and DC (S) to represent the resulting set if defeasible conse-
quence is used instead. DC (S) is assumed defined only for finite S. Cn (α) is an
abbreviation for Cn ({α}), and DC (α) is an abbreviation for DC ({α}).
1 Reflexivity α ∈ DC (α)
2 Left Logical Equivalence If α ≡ φ then DC (α) = DC (φ)
3 Right Weakening If α ∈ DC (φ) and γ ∈ Cn (α) then γ ∈ DC (φ)
4 And If α ∈ DC (φ) and γ ∈ DC (φ) then α ∧ γ ∈ DC (φ)
5 Or If α ∈ DC (φ) and α ∈ DC (γ) then α ∈ DC (φ ∨ γ)
6 Cautious Monotonicity If α ∈ DC (φ) and γ ∈ DC (φ) then γ ∈ DC (φ ∧ α)
Reflexivity states that if a formula is satisfied, it follows that the formula can
be a consequence of itself. Left Logical Equivalence states that logically equiva-
lent formulas have the same consequences. Right Weakening expresses the fact
that one should accept as plausible consequences all that is logically implied
by what one thinks are plausible consequences. And expresses the fact that the
conjunction of two plausible consequences is a plausible consequence. Or says
that any formula that is, separately, a plausible consequence of two different
formulas, should also be a plausible consequence of their disjunction. Cautious
Monotonicity expresses the fact that learning a new fact, the truth of which
could have been plausibly concluded, should not invalidate previous conclusions.
Additional Postulates. Table 2 presents additional defeasible reasoning pos-
tulates. Cut expresses the fact that one may, in his way towards a plausible
214 C. K. Baker et al.
conclusion, first add an hypothesis to the facts he knows to be true and prove
the plausibility of his conclusion from this enlarged set of facts and then
deduce (plausibly) this added hypothesis from the facts. Rational Monotonicity
expresses the fact that only additional information, the negation of which was
expected, should force us to withdraw plausible conclusions previously drawn.
Transitivity expresses that if the second fact is a plausible consequence of the
first and the third fact is a plausible consequence of the second, then the third
fact is also a plausible consequence of the first fact. Contraposition allows the
converse of the original proposition to be inferred, by the negation of terms and
changing their order.
1 Closure K ∗ α = Cn (K ∗ α)
2 Success α∈K∗α
3 Inclusion K ∗ α ⊆ Cn (K ∪ {α})
4 Vacuity If ¬α ∈
/ K then Cn (K ∪ {α}) ⊆ K ∗ α
5 Consistency K ∗ α = Cn (α ∧ ¬α) only if |= ¬α
6 Extensionality If α ≡ φ then K ∗ α = K ∗ φ
7 Super-expansion K ∗ (α ∧ φ) ⊆ Cn (K ∗ α ∪ {φ})
8 Sub-expansion If ¬φ ∈
/ K then Cn (K ∗ α ∪ {φ}) ⊆ K ∗ (α ∧ φ)
Closure implies logical omniscience on the part of the ideal agent or reasoner,
including after revision of their belief set. Success expresses that the new infor-
mation should always be part of the new belief set. Inclusion and Vacuity are
motivated by the principle of minimum change. Together, they express that
in the case of information α, consistent with belief set or knowledge base K,
belief revision involves performing expansion on K by α i.e. none of the origi-
nal beliefs need to be withdrawn. Consistency expresses that the agent should
prioritise consistency, where the only acceptable case of not doing so is if the
Cognitive Defeasible Reasoning 215
Table 4. KM postulates
1 (U1) α∈K α
2 (U2) If α ∈ K then K α = K
3 (U3) K α = Cn (α ∧ ¬α) only if |= ¬α or K = Cn (α ∧ ¬α)
4, (U4) If α ≡ φ then K α = K φ
5, (U5) K (α ∧ φ) ⊆ Cn (K α ∪ {φ})
6 (U6) If φ ∈ K α and α ∈ K φ then K α = K φ
7 (U7) If K is complete then K (φ ∨ α) ⊆ Cn (K α ∪ K φ)
8. (U8) K α= Cn (φ) α
φ∈K
9 (U*9) K α = Cn (K α)
A.5 Results
In Fig. 1, we show the Hit Rate (%) for each defeasible reasoning postulate. In
Fig. 2, we show the Hit Rate (%) for each belief revision postulate. In Fig. 3, we
show the Hit Rate (%) for each belief update postulate.
216 C. K. Baker et al.
References
1. Alchourrón, C.E., Gärdenfors, P., Makinson, D.: On the logic of theory change:
partial meet contraction and revision functions. J. Symb. Logic 50, 510–530 (1985).
https://doi.org/10.2307/2274239
2. Buhrmester, M.: M-turk guide (2018). https://michaelbuhrmester.wordpress.com/
mechanical-turk-guide/
3. Creswell, J.W.: Research Design: Qualitative, Quantitative, and Mixed Methods
Approaches, vol. 4, pp. 245–253. SAGE Publications, Thousand Oaks (2014)
4. Darwiche, A., Pearl, J.: On the logic of iterated belief revision. Artif. Intell. 89,
1–29 (1997). https://doi.org/10.1016/S0004-3702(96)00038-0
5. Gärdenfors, P., Makinson, D.: Nonmonotonic inference based on expectations.
Artif. Intell. 65(2), 197–245 (1994)
6. Gärdenfors, P.: Belief Revision: An Introduction, pp. 1–26. Cambridge University
Press, Cambridge (1992). https://doi.org/10.1017/CBO9780511526664.001
7. Governatori, G., Terenziani, P.: Temporal extensions to defeasible logic. In: Orgun,
M.A., Thornton, J. (eds.) AI 2007. LNCS (LNAI), vol. 4830, pp. 476–485. Springer,
Heidelberg (2007). https://doi.org/10.1007/978-3-540-76928-6 49
8. Hansson, S.: A Textbook of Belief Dynamics: Theory Change and Database Updat-
ing. Kluwer Academic Publishers, Berlin (1999)
9. Herzig, A., Rifi, O.: Update operations: a review. In: Prade, H. (ed.) Proceedings
of the 13th European Conference on Artificial Intelligence, pp. 13–17. John Wiley
& Sons, Ltd., New York (1998)
10. Inc., A.M.T.: Faqs (2018). https://www.mturk.com/help
218 C. K. Baker et al.
11. Katsuno, H., Mendelzon, A.O.: On the difference between updating a knowl-
edge base and revising it. In: Proceedings of the Second International Conference
on Principles of Knowledge Representation and Reasoning, KR 1991, pp. 387–
394. Morgan Kaufmann Publishers Inc., San Francisco (1991). http://dl.acm.org/
citation.cfm?id=3087158.3087197
12. Kennedy, R., Clifford, S., Burleigh, T., Jewell, R., Waggoner, P.: The shape of and
solutions to the MTurk quality crisis, October 2018
13. Kraus, S., Lehmann, D., Magidor, M.: Nonmonotonic reasoning, preferential mod-
els and cumulative logics. Artif. Intell. 44, 167–207 (1990)
14. Krosnich, J., Presser, S.: Question and questionnaire design. Handbook of Survey
Research, March 2009
15. Lang, J.: Belief update revisited. In: Proceedings of the 20th International Joint
Conference on Artificial Intelligence, IJCAI 2007, pp. 1534–1540, 2517–2522. Mor-
gan Kaufmann Publishers Inc., San Francisco (2007). http://dl.acm.org/citation.
cfm?id=1625275.1625681
16. Lehmann, D.: Another perspective on default reasoning. Ann. Math. Artif. Intell.
15(1), 61–82 (1995). https://doi.org/10.1007/BF01535841
17. Lieto, A., Minieri, A., Piana, A., Radicioni, D.: A knowledge-based system for
prototypical reasoning. Connect. Sci. 27(2), 137–152 (2015). https://doi.org/10.
1080/09540091.2014.956292
18. Makinson, D.: Bridges between classical and nonmonotonic logic. Logic J. IGPL
11(1), 69–96 (2003)
19. Martins, J., Shapiro, S.: A model for belief revision. Artif. Intell. 35, 25–79 (1988).
https://doi.org/10.1016/0004-3702(88)90031-8
20. Over, D.: Rationality and the normative/descriptive distinction. In: Koehler, D.J.,
Harvey, N. (eds.) Blackwell Handbook of Judgment and Decision Making, pp. 3–18.
Blackwell Publishing Ltd., United States (2004)
21. Pelletier, F., Elio, R.: The case for psychologism in default and inheritance reason-
ing. Synthese 146, 7–35 (2005). https://doi.org/10.1007/s11229-005-9063-z
22. Peppas, P.: Belief revision. In: Harmelen, F., Lifschitz, V., Porter, B. (eds.) Hand-
book of Knowledge Representation. Elsevier Science, December 2008. https://doi.
org/10.1016/S1574-6526(07)03008-8
23. Pollock, J.: A theory of defeasible reasoning. Int. J. Intell. Syst. 6, 33–54 (1991)
24. Ragni, M., Eichhorn, C., Bock, T., Kern-Isberner, G., Tse, A.P.P.: Formal non-
monotonic theories and properties of human defeasible reasoning. Minds Mach.
27(1), 79–117 (2017). https://doi.org/10.1007/s11023-016-9414-1
25. Ragni, M., Eichhorn, C., Kern-Isberner, G.: Simulating human inferences in light
of new information: a formal analysis. In: Kambhampati, S. (ed.) Proceedings of
the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI
16), pp. 2604–2610. IJCAI Press (2016)
26. Ross, J., Zaldivar, A., Irani, L., Tomlinson, B.: Who are the turkers? Worker demo-
graphics in Amazon mechanical turk, January 2009
27. Rott, H.: Change, Choice and Inference: A Study of Belief Revision and Nonmono-
tonic Reasoning. Oxford University Press (2001)
28. Sullivan, G., Artino, R., Artino, J.: Analyzing and interpreting data from likert-
type scales. J. Grad. Med. Educ. 5(4), 541–542 (2013)
29. Turk, A.M.: Qualifications and worker task quality best practices, April 2019.
https://blog.mturk.com/qualifications-and-worker-task-quality-best-practices-
886f1f4e03fc
Cognitive Defeasible Reasoning 219
30. TurkPrime: After the bot scare: Understanding what’s been happen-
ing with data collection on mturk and how to stop it September
2018. https://blog.turkprime.com/after-the-bot-scare-understanding-whats-been-
happening-with-data-collection-on-mturk-and-how-to-stop-it
31. Verheij, B.: Correct grounded reasoning with presumptive arguments. In: Michael,
L., Kakas, A. (eds.) JELIA 2016. LNCS (LNAI), vol. 10021, pp. 481–496. Springer,
Cham (2016). https://doi.org/10.1007/978-3-319-48758-8 31
32. Witte, J.: Introduction to the special issue on web surveys. Sociol. Methods Res.
37(3), 283–290 (2009)
A Taxonomy of Explainable Bayesian
Networks
1 Introduction
Advances in technology have contributed to the generation of big data in nearly
all fields of science, giving rise to new challenges with respect to explainability
of models and techniques used to analyse such data. These models and tech-
niques are often too complex; concealing the knowledge within the machine,
hence decreasing the extent of interpretability of results. Subsequently, the lack
of explainable models and techniques contribute to mistrust among users in fields
of science where interpretability and explainability are indispensable.
c Springer Nature Switzerland AG 2020
A. Gerber (Ed.): SACAIR 2020, CCIS 1342, pp. 220–235, 2020.
https://doi.org/10.1007/978-3-030-66151-9_14
A Taxonomy of Explainable Bayesian Networks 221
To elucidate the need for explainable models, consider the following three sce-
narios. Firstly, suppose a medical diagnosis system is used to determine whether
a tumour sample is malignant or benign. Here, the medical practitioner must
be able to understand how and why the system reached the decision, and, if
necessary, inspect whether the decision is supported by medical knowledge [2].
Next, consider self-driving cars. In this context, the self-driving car must be able
to process information faster than a human, such that accidents and fatalities
can be avoided [21]. Suppose a self-driving car is involved in an accident, then
the system must be able to explain that in order to avoid hitting a pedestrian,
the only option was to swerve out of the way and, by coincidence, into another
vehicle. Lastly, consider an online restaurant review system, where reviews are
classified as positive or negative based on the words contained in the review.
Here, the classifier simply returns whether a review is positive or negative, with-
out explaining which words contributed to the classification. As such, negative
reviews that are expressed in, for example, a sarcastic manner, might be clas-
sified as positive, resulting in a restaurant receiving a higher rating and more
diners – who might experience bad service (or even food poisoning) as a result
of mislabelled reviews.
Given its relevance in many application areas, the explainability problem
has attracted a great deal of attention in recent years, and as such, is an open
research area [24]. The manifestation of explainable systems in high-risk areas
has influenced the development of explainable artificial intelligence (XAI) in
the sense of prescriptions or taxonomies of explanation. These include fairness,
accountability, transparency and ethicality [3,11,23]. The foundation of such a
system should include these prescriptions such that a level of usable intelligence
is reached to not only understand model behaviour [1] but also understand the
context of an application task [14]. Bayesian networks (BNs) – which lie at the
intersection of AI, machine learning, and statistics – are probabilistic graphical
models that can be used as a tool to manage uncertainty. These graphical models
allow the user to reason about uncertainty in the problem domain by updating
ones beliefs, whether this reasoning occurs from cause to effect, or from effect to
cause. Reasoning in Bayesian networks is often referred to as what-if questions.
The flexibility of a Bayesian network allows for these questions to be predic-
tive, diagnostic and inter-causal. Some what-if questions might be intuitive to
formulate, but this is not always the case especially on a diagnostic and inter-
causal level. This might result in sub-optimal use of explainability in BNs -
especially on an end-user level. Apart from well-established reasoning methods,
the probabilistic framework of a Bayesian network also allows for explainability
in evidence. These include most probable explanation and most relevant explana-
tion. To extend on the existing explainability methods, we propose an additional
approach which considers explanations concerned with the decision-base.
In this paper, we research the current state of explainable models in AI and
machine learning tasks, where the domain of interest is BNs. In the current
research, explanation is often done by principled approaches to finding explana-
tions for models, reasoning, and evidence. Using this, we are able to formulate
a taxonomy of explainable BNs. We extend this taxonomy to include explana-
tion of decisions. This taxonomy will provide end-users with a set of tools to
222 I. P. Derks and A. de Waal
better understand predictions made by BNs and will therefore encourage effi-
cient communication between end-users. The paper is structured as follows. We
first investigate the community and scope of explainability methods in Sect. 2.
Thereafter, we introduce explanation in BNs, which includes the formulation
of principled approaches, the theoretical properties associated therewith and
a hands-on medical diagnosis example. Section 4 presents our newly formulated
taxonomy of explainable BNs. The final section concludes the paper and includes
a short discussion of future work.
2 Related Work
Suppose during the doctor’s appointment, the patient tells the doctor he is
a smoker before any symptoms are assessed. As mentioned earlier, the doctor
knows smoking increases the probability of the patient having lung cancer and
bronchitis. This will, in turn, also influence the expectation of other symptoms,
such as the result of the chest X-Ray and shortness of breath. Here, our rea-
soning is performed from new information about the causes to new beliefs of
the effects. This type of reasoning is referred to as predictive reasoning and
follows the direction of the arcs in the network. Through predictive reasoning,
we are interested in questions concerning what will happen. In some cases, pre-
dictive reasoning is not of great insight and it is often required to reason from
symptoms (effect) to cause, which entails information flow in the opposite direc-
tion to the network arcs. For example, bronchitis can be seen as an effect of
smoking. Accordingly, we are interested in computing P (S|B). This is referred
to as diagnostic reasoning and is typically used in situations where we want
to determine what went wrong. The final type of probabilistic reasoning in BNs
is inter-causal reasoning, which relates to mutual causes of a common effect –
typically indicated by a v-structure in the network. In other words, inference is
performed on the parent nodes of a shared child node. Note that the parent nodes
are independent of one another unless the shared child node is observed, a con-
cept known as d-separation. From the Asia network, we observe a v-structure
between Tuberculosis, Lung Cancer and Tuberculosis or Cancer (see Fig. 2a).
Here, Tuberculosis is independent from Lung cancer. Suppose we observe the
patient has either Tuberculosis or Cancer – indicated by the green (or light grey
if viewed in grey-scale) bar in Fig. 2b – then this observation increases the prob-
abilities of the parent nodes, Tuberculosis and Lung Cancer. However, if it is
then revealed that the patient does, in fact, have Tuberculosis it, in turn, lowers
the probability of a patient having Lung Cancer (see Fig. 2c). We can then say
Lung Cancer has been explained away. It should be noted that the probabilistic
reasoning methods discussed above can be used as is, or can be combined to
accommodate the problem at hand.
Sometimes, users of the system find the results of reasoning unclear or ques-
tionable. One way to address this is to provide scenarios for which the reasoning
outcomes are upheld. A fully specified scenario is easier to understand than a set
of reasoning outcomes. Explanation of evidence methods are useful in specifying
these scenarios. They are based on the posterior probability and the generalised
Bayes factor. Firstly, we focus on methods that aim to find a configuration of
variables such that the posterior probability is maximised given the evidence.
Here, we consider the Most Probable Explanation (MPE), which is a special case
of the Maximum A Posteriori (MAP). The MAP in a BN is a variable config-
uration which includes a subset of unobserved variables in the explanation set
226 I. P. Derks and A. de Waal
Most Probable Explanation. Let’s first consider the MPE method. Recall
that the MPE finds the complete instantiation of the target variables – which are
defined to be unobserved – such that the joint posterior probability is maximised
given evidence. Figure 3 shows the scenario (or case) that has the highest joint
probability in the Asia network. Note here the probabilities are replaced by
the likelihood of the variable state belonging to the most probable scenario, for
example, if we look at the two possible states for Bronchitis, we see that ‘False’,
i.e., the patient does not have bronchitis, is more probable. Suppose we discover
the patient suffers from shortness of breath, we can then set the evidence for
Dyspnoea as ‘True’ (illustrated in Fig. 4). By introducing this new evidence,
we now observe a slightly different scenario, where it is more probable for the
patient to be a smoker and have bronchitis. Notice here that variables that
seem irrelevant to the evidence explanation, such as Visit to Asia and XRay,
are included in the explanation. This could lead to overspecified hypotheses,
especially in larger networks.
A Taxonomy of Explainable Bayesian Networks 227
instantiation of the target variables is found such that the generalised Bayes
factor is maximised. Let’s first consider the explanations obtained from the gen-
eralised Bayes factor. Again, suppose the patient suffers from shortness of breath
(evidence). We are then interested in finding only those variables that are rele-
vant in explaining why the patient has shortness of breath. Table 1 contains the
set of explanations obtained from the generalised Bayes factor. For example, the
last entry shows that a possible explanation for shortness of breath is a trip to
Asia and an abnormal X-ray. Thus including only 2 variables from the remaining
7 variables (excluding Dyspnoea). As mentioned, the MRE is the explanation
that maximises the generalised Bayes factor. From Table 1 we see that having
Bronchitis best explains the shortness of breath. Notice that this explanation
does not include Smoking, as opposed to the MPE which included Smoking.
Thus, although smoking is a probable cause for shortness of breath, it is not the
most relevant cause. An interesting characteristic of the MRE is its ability to
capture the explaining away phenomenon [33].
2
The SDP scenario was constructed using the decision node functionality in
Bayesialab. The decision nodes are indicated as green (or dark grey if viewed in
grey-scale).
230 I. P. Derks and A. de Waal
4 XBN in Action
The point of XBN is to explain the AI task at hand. In other words, the ques-
tion the decision-maker seeks to answer, and not the technique in principle.
Therefore, we need to be able to freely ask ‘why’ or ‘what’ and from this select
a method that would best address the AI task. In Fig. 7 we present a taxon-
omy of XBN. The purpose of this taxonomy is to categorise XBN methods into
four phases of BNs: The first phase involves the construction of the BN model.
Explanation in the ‘model’ phase is critical when the model is based on expert
knowledge. The second phase is reasoning, the third phase evidence, and the
A Taxonomy of Explainable Bayesian Networks 231
fourth decision. Explanation of the model and sensitivity analysis are illustrated
in grey as it is out of scope for this paper. Although we define the taxonomy
along these phases, we do acknowledge that not all phases are necessarily utilised
by the decision-maker. For example, when using BNs to facilitate participatory
modelling [8], the main emphasis is on explaining the model. Or, when using
BNs as a classifier, the emphasis is on explaining the decisions. In this section,
we present typical questions of interest to the decision-maker in each category
of the XBN taxonomy.
4.1 Reasoning
Reasoning in the XBN taxonomy is concerned with the justification of a con-
clusion. Returning to our Asia example, the end-user might ask the following
question,
“Given the patient recently visited Asia, how likely is an abnormal chest X-
Ray? ”
Here, we are concerned with a single outcome: the X-Ray result. On the other
hand, the doctor may have knowledge about symptoms presented by the patient
and ask,
“What is the probability of a patient being a smoker, given that he presented
shortness of breath? ”
We can extend this to a forensic context. Suppose a crime scene is investigated
where a severely burned body is found. The forensic analyst can then ask,
“The burn victim is found with a protruded tongue, was the victim exposed to
fire before death or after? ”
232 I. P. Derks and A. de Waal
Consider now a financial service context where a young prospective home owner
is declined a loan. The service provider can then ask,
“Did the prospective owner not qualify for the home loan because of his age? ”
From these examples, we see that explanation of reasoning is used where ques-
tions are asked in the context of single variable outcomes for diagnosis.
4.2 Evidence
When we are interested in the subset of variables that describes specific scenarios,
we use explanation of evidence methods. For example, in our Asia example the
doctor may ask,
“Which diseases are most probable to the symptoms presented by the patient? ”
or
“Which diseases are most relevant to the symptoms presented by the patient? ”
In a forensic context, the forensic analyst investigating a crime scene may ask
the following question,
“What are the most relevant causes of death, given the victim is found with
a severely burned body and protruded tongue? ”
Similarly this can be applied to fraud detection. Suppose the analyst investigates
the credit card transactions of a consumer. The analyst can then ask,
“What are the most probable transaction features that contributed to the flag-
ging of this consumer? ”
Explanation of evidence can also be used to provide explanations for financial
service circumstances. For example, if a prospective home owner is turned down
for a loan, he may ask the service provider which features in his risk profile are
more relevant (contributed most) to being turned down.
4.3 Decisions
Explanation of decisions typically asks the following questions “Do we have
enough evidence to make a decision?”, and if not, “what additional evidence is
required to make a decision?”. For example, in our Asia example we can ask,
“Do we have enough evidence on the symptoms presented to make a decision
on the disease? ”
or
“Since we are not yet able to determine the disease, what additional infor-
mation – test, underlying symptoms, comorbidities – is required to make a
decision? ”
A Taxonomy of Explainable Bayesian Networks 233
5 Conclusion
The development of AI systems has faced incredible advances in recent years. We
are now exposed to these systems on a daily basis, such as product recommen-
dation systems used by online retailers. However, these systems are also being
implemented by medical practitioners, forensic analysts and financial services
– application areas where decisions directly influence the lives of humans. It is
because of these high-risk application areas that progressively more interest is
given to the explainability of these systems.
This paper addresses the problem of explainability in BNs. We first explored
the state of explainable AI and in particular BNs, which serves as a foundation
for our XBN framework. We then presented a taxonomy to categorise XBN
methods in order to emphasise the benefits of each method given a specific
usage of the BN model. This XBN taxonomy will serve as a guideline, which will
enable end-users to understand how and why predictions were made and will,
therefore, be able to better communicate how outcomes were obtained based on
these predictions.
The XBN taxonomy consists of explanation of reasoning, evidence and deci-
sions. Explanation of the model is reserved for future work, since the taxonomy
described in this paper is focused on how and why predictions were made and not
on the model-construction phase. Other future research endeavours include the
addition of more dimensions and methods to the XBN taxonomy – this involves
more statistical-based methods and the incorporation of causability (which also
addresses the quality of explanations) – as well as applying this taxonomy to
real-world applications.
References
1. Barredo Arrieta, A., et al.: Explainable artificial Intelligence (XAI): concepts, tax-
onomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–
115 (2020). https://doi.org/10.1016/j.inffus.2019.12.012
2. Brito-Sarracino, T., dos Santos, M.R., Antunes, E.F., de Andrade Santos, I.B.,
Kasmanas, J.C., de Leon Ferreira, A.C.P., et al.: Explainable machine learning for
breast cancer diagnosis. In: 2019 8th Brazilian Conference on Intelligent Systems
(BRACIS), pp. 681–686. IEEE (2019)
3. Cath, C.: Governing artificial intelligence: ethical, legal and technical opportunities
and challenges. Philos. Trans. R. Soc. Math. Phys. Eng. Sci. 376(2133) (2018).
https://doi.org/10.1098/rsta.2018.0080
234 I. P. Derks and A. de Waal
4. Chan, H., Darwiche, A.: On the robustness of most probable explanations. In:
Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence, UAI
2006, June 2012
5. Choi, A., Xue, Y., Darwiche, A.: Same-decision probability: a confidence measure
for threshold-based decisions. Int. Int. J. Approximate Reasoning 53(9), 1415–1428
(2012)
6. Das, A., Rad, P.: Opportunities and Challenges in Explainable Artificial Intelli-
gence (XAI): A Survey. arXiv preprint arXiv:2006.11371 (2020)
7. De Waal, A., Steyn, C.: Uncertainty measurements in neural network predictions
for classification tasks. In: 2020 IEEE 23rd International Conference on Information
Fusion (FUSION), pp. 1–7. IEEE (2020)
8. Düspohl, M., Frank, S., Döll, P.: A review of Bayesian networks as a participa-
tory modeling approach in support of sustainable environmental management. J.
Sustain. Dev. 5(12), 1 (2012). https://doi.org/10.5539/jsd.v5n12p1
9. Gallego, M.J.F.: Bayesian networks inference: Advanced algorithms for triangula-
tion and partial abduction (2005)
10. Goebel, R., et al.: Explainable AI: the new 42? In: Holzinger, A., Kieseberg, P.,
Tjoa, A.M., Weippl, E. (eds.) CD-MAKE 2018. LNCS, vol. 11015, pp. 295–303.
Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99740-7 21
11. Greene, D., Hoffmann, A.L., Stark, L.: Better, nicer, clearer, fairer: a critical assess-
ment of the movement for ethical artificial intelligence and machine learning. In:
Proceedings of the 52nd Hawaii International Conference on System Sciences, pp.
2122–2131 (2019). https://doi.org/10.24251/hicss.2019.258
12. Gunning, D., Aha, D.W.: DARPA’s explainable artificial intelligence program. AI
Mag. 40(2), 44–58 (2019)
13. Helldin, T., Riveiro, M.: Explanation methods for Bayesian networks: review and
application to a maritime scenario. In: Proceedings of The 3rd Annual Skövde
Workshop on Information Fusion Topics, SWIFT, pp. 11–16 (2009)
14. Holzinger, A., et al.: Towards the Augmented Pathologist: Challenges of
Explainable-AI in Digital Pathology. arXiv preprint arXiv:1712.06657 pp. 1–34
(2017). http://arxiv.org/abs/1712.06657
15. Keppens, J.: Explaining Bayesian belief revision for legal applications. In: JURIX,
pp. 63–72 (2016)
16. Keppens, J.: Explainable Bayesian network query results via natural language gen-
eration systems. In: Proceedings of the Seventeenth International Conference on
Artificial Intelligence and Law, pp. 42–51 (2019)
17. Khedkar, S., Subramanian, V., Shinde, G., Gandhi, P.: Explainable AI in health-
care. In: Healthcare (April 8, 2019). 2nd International Conference on Advances in
Science and Technology (ICAST) (2019)
18. Korb, K.B., Nicholson, A.E.: Bayesian Artificial Intelligence. CRC Press, Boca
Raton (2010)
19. Lacave, C., Dı́ez, F.J.: A review of explanation methods for Bayesian net-
works. Knowl. Eng. Rev. 17(2), 107–127 (2002). https://doi.org/10.1017/
S026988890200019X
20. Lauritzen, S.L., Spiegelhalter, D.J.: Local computations with probabilities on
graphical structures and their application to expert systems. J. R. Stat. Soc. Seri.
B (Methodological) 50(2), 157–194 (1988)
21. Lawless, W.F., Mittu, R., Sofge, D., Hiatt, L.: Artificial intelligence, autonomy,
and human-machine teams: interdependence, context, and explainable AI. AI Mag.
40(3), 5–13 (2019)
A Taxonomy of Explainable Bayesian Networks 235
22. Lecue, F.: On the role of knowledge graphs in explainable AI. Seman. Web 11(1),
41–51 (2020)
23. Leslie, D.: Understanding artificial intelligence ethics and safety: A guide for the
responsible design and implementation of AI systems in the public sector (2019).
https://doi.org/10.5281/zenodo.3240529
24. Lipton, Z.C.: The mythos of model interpretability. Queue 16(3), 31–57 (2018)
25. Martens, D., Provost, F.: Explaining data-driven document classifications. MIS Q.
38(1), 73–100 (2014)
26. Miller, T., Weber, R., Magazzeni, D.: Proceedings of the IJCAI 2019 Workshop on
Explainable AI (2019)
27. Montavon, G., Binder, A., Lapuschkin, S., Samek, W., Müller, K.-R.: Layer-wise
relevance propagation: an overview. In: Samek, W., Montavon, G., Vedaldi, A.,
Hansen, L.K., Müller, K.-R. (eds.) Explainable AI: Interpreting, Explaining and
Visualizing Deep Learning. LNCS (LNAI), vol. 11700, pp. 193–209. Springer, Cham
(2019). https://doi.org/10.1007/978-3-030-28954-6 10
28. Samek, W., Müller, K.-R.: Towards explainable artificial intelligence. In: Samek,
W., Montavon, G., Vedaldi, A., Hansen, L.K., Müller, K.-R. (eds.) Explainable AI:
Interpreting, Explaining and Visualizing Deep Learning. LNCS (LNAI), vol. 11700,
pp. 5–22. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28954-6 1
29. Timmer, S.T., Meyer, J.J.C., Prakken, H., Renooij, S., Verheij, B.: A two-phase
method for extracting explanatory arguments from Bayesian networks. Interna-
tional Journal of Approximate Reasoning 80, 475–494 (2017)
30. van der Gaag, L.C., Coupé, V.M.H.: Sensitivity analysis for threshold decision
making with Bayesian belief networks. In: Lamma, E., Mello, P. (eds.) AI*IA 1999.
LNCS (LNAI), vol. 1792, pp. 37–48. Springer, Heidelberg (2000). https://doi.org/
10.1007/3-540-46238-4 4
31. Xu, F., Uszkoreit, H., Du, Y., Fan, W., Zhao, D., Zhu, J.: Explainable AI: a brief
survey on history, research areas, approaches and challenges. In: Tang, J., Kan,
M.-Y., Zhao, D., Li, S., Zan, H. (eds.) NLPCC 2019, Part II. LNCS (LNAI), vol.
11839, pp. 563–574. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-
32236-6 51
32. Yuan, C.: Some properties of most relevant explanation. In: ExaCt, pp. 118–126
(2009)
33. Yuan, C., Lim, H., Lu, T.C.: Most relevant explanation in Bayesian networks. J.
Artif. Intell. Res. 42, 309–352 (2011). https://doi.org/10.1613/jair.3301
34. Yuan, C., Liu, X., Lu, T.C., Lim, H.: Most relevant explanation: Properties, algo-
rithms, and evaluations. In: Proceedings of the 25th Conference on Uncertainty in
Artificial Intelligence, UAI 2009, pp. 631–638 (2009)
A Boolean Extension of KLM-Style
Conditional Reasoning
1 Introduction
Non-monotonic reasoning has been extensively studied in the AI literature, as it
provides a mechanism for making bold inferences that go beyond what classical
methods can provide, while retaining the possibility of revising these inferences
in light of new information. In their seminal paper, Kraus et al. [14] consider a
general framework for non-monotonic reasoning, phrased in terms of defeasible,
or conditional implications of the form α |∼ β, to be read as “If α holds, then
typically β holds”. Importantly, they provide a set of rationality conditions,
c Springer Nature Switzerland AG 2020
A. Gerber (Ed.): SACAIR 2020, CCIS 1342, pp. 236–252, 2020.
https://doi.org/10.1007/978-3-030-66151-9_15
A Boolean Extension of KLM-Style Conditional Reasoning 237
Logic, and give a brief overview of the entailment problem for PTL. In Sect. 3
we define the logic BKLM, an extension of KLM-style logics that allows for
arbitrary boolean combinations of conditionals. We investigate the expressive-
ness of BKLM, and show that it is strictly more expressive PTL by exhibiting
an explicit translation of PTL formulas into BKLM. In Sect. 4 we turn to the
entailment problem for BKLM, and show that BKLM suffers from stronger ver-
sions of the known impossibility results for PTL. Section 5 discusses some related
work, while Sect. 6 concludes and points out some future research directions.
2 Background
Let P be a set of propositional atoms, and let p, q, . . . be meta-variables for
elements of P. We write LP for the set of propositional formulas over P, defined
by α ::= p | ¬α | α ∧ α | | ⊥. Other boolean connectives are defined as usual
in terms of ∧, ¬, →, and ↔. We write U P for the set of valuations of P, which
are functions v : P → {0, 1}. Valuations are extended to LP in the usual way,
and satisfaction of a formula α will be denoted v α. For the remainder of this
paper we assume that P is finite, and drop the superscripts where there’s no
ambiguity.
2 pbf
1 pbf, pbf
0 pbf, pbf, pbf
Proposition 4 can be rephrased as saying that every KLM knowledge base has
an equivalent PTL knowledge base, in the sense that they share the same set of
ranked models. Note, however, that the converse doesn’t hold, which intuitively
shows that PTL is strictly more expressive than KLM:
Proposition 5 [3, Proposition 13]. For any p ∈ P, the knowledge base con-
sisting of •p has no equivalent KLM knowledge base.
Later, we will show that there is a sense in which PTL is not maximally
expressive for semantics given by ranked interpretations, a fact that may seem
surprising in light of its unrestricted syntax.
3 Boolean KLM
In this section we describe Boolean KLM (BKLM), an extension of KLM that
permits arbitrary boolean combinations of defeasible conditionals. Syntactically,
this goes beyond the extension of Booth and Paris [4] by allowing disjunctive as
well as negative assertions in knowledge bases. BKLM formulas are defined by
the grammar A ::= α |∼ β | ¬A | A ∧ A, with other boolean connectives defined
as usual in terms of ¬ and ∧. For convenience, we use α |∼ β as a synonym for
¬(α |∼ β), and write Lb for the set of all BKLM formulas. Hence, for example,
(α |∼ β) ∧ (γ |∼ δ) and ¬((α |∼ β) ∨ (γ |∼ δ)) are valid BKLM formulas, but the
nested conditional α |∼ (β |∼ γ) is not.
Satisfaction for BKLM is defined in terms of ranked interpretations, by
extending KLM satisfaction in the obvious fashion, namely R ¬A iff R A
and R A ∧ B iff R A and R B. This leads to some subtle differ-
ences between BKLM and the other logics described in this paper. For instance,
care must be taken to apply Proposition 3 correctly when translating between
propositional formulas and BKLM formulas. The propositional formula p ∨ q
translates to the BKLM formula ¬(p ∨ q) |∼ ⊥, and not to the BKLM formula
(¬p |∼ ⊥) ∨ (¬q |∼ ⊥), as the following example illustrates:
A Boolean Extension of KLM-Style Conditional Reasoning 243
1 pq
0 pq
Example 3. Consider the formulas A = ¬(•p → q) and B = ¬(p |∼ q), and let
R be the ranked interpretation in the example above. Note that A is equivalent
to •p ∧ ¬q, which is not satisfied by R. On the other hand, R satisfies B.
Satisfaction for KLM, PTL and BKLM formulas is defined in terms of ranked
interpretations. This allows us to compare their expressiveness directly, in terms
of the sets of models that they can characterise. With the results mentioned
earlier, we can already do this for KLM and PTL:
Example 4. Let K ⊆ L|∼ be a KLM knowledge base. Then the PTL knowledge
base K = {•α → β : α |∼ β ∈ K} has exactly the same ranked models as K by
Proposition 4, and hence PTL is at least as expressive as KLM. Proposition 5
proves that PTL is strictly more expressive than KLM.
Our main result in this section is that BKLM is maximally expressive, in the
sense that it can characterise any set of ranked interpretations. First, we recall
that for every valuation u ∈ U there is a corresponding characteristic formula
û ∈ L, which has the property that v û iff v = u.
Note that this lemma holds even in the trivial case where R(u) = ∞ for all
u ∈ U. For convenience, in later parts of the paper we will write α < β as a
standard shorthand for the defeasible conditional α ∨ β |∼ ¬β.
In principle, this corollary shows that for any PTL knowledge base there
exists some BKLM formula with the same set of models, and hence BKLM
is at least as expressive as PTL. In the next section we make this relationship
more concrete, by providing an explicit algorithm for translating PTL knowledge
bases into BKLM.
In Sect. 2.2, satisfaction for PTL formulas was defined in terms of the possible
valuations of a ranked interpretation R. In order to define a translation operator
between PTL and BKLM, our main idea is to encode satisfaction with respect
to a particular valuation u ∈ U, by defining an operator tru : L• → Lb such that
for each u ∈ U R , R tru (α) iff u R α.
1. def
tru (p) = û |∼ p
2. tru () =def
û |∼
3. tru (⊥) =def
û |∼ ⊥
4. tru (¬α) = def
¬tru (α)
5. tru (α ∧ β) = def
tru (α) ∧ tru (β)
6. tru (•α) = tru (α) ∧ v∈U (v̂ < û) → ¬trv (α)
def
Note that this is well-defined, as each case is defined in terms of strict sub-
formulas. These translations can be viewed as analogues of the definition of PTL
satisfaction - case 6 intuitively states that •α is satisfied by a possible valuation
u iff u is a minimal valuation satisfying α, for instance. The following lemma
confirms that this intuition is correct:
Finally, we can prove that this translation does indeed result in an equivalent
BKLM formula:
Lemma 4. For all α ∈ L• and any ranked interpretation R, R satisfies α iff
R satisfies tr(α).
Definition 6. Let < be a strict partial order on RI. Then for all knowledge
bases K ⊆ Lb and formulas α ∈ Lb , we say K <-entails α (denoted K |≈< α) iff
R α for all <-minimal models R ∈ Mod(K).
The relation |≈< will be referred to as an order entailment relation. Note that
while we have explicitly referred to BKLM knowledge bases here, the construc-
tion works identically for KLM and PTL. It is also worth mentioning that the
set of models of a consistent knowledge base is always finite, as we have assumed
finiteness of P, and hence always has <-minimal elements.
Proposition 11. An order entailment relation |≈< satisfies the Single Model
property iff Mod(K) has a unique <-minimal model for any knowledge base K.
This is always the case if < is total, for instance, but it is also the case
for Rational Closure and LM-entailment. In the next section we will show that,
perhaps surprisingly, total order entailment relations are nevertheless (modulo
some minor conditions) the only entailment relations for BKLM satisfying the
Single Model property.
For the remainder of the proof, we consider a fixed BKLM entailment rela-
tion |≈? satisfying the Cumulativity, Ampliativity and Single Model properties.
Corresponding to |≈? is an associated consequence operator Cn? , defined as fol-
lows:
In what follows, we will move between the entailment relation and conse-
quence operator notations freely as convenient. To begin with, the following
lemma follows easily from our assumptions:
A Boolean Extension of KLM-Style Conditional Reasoning 247
Lemma 6. For any knowledge base K ⊆ Lb , Cn? (K) = CnR (Cn? (K)) and
Cn? (K) = Cn? (CnR (K)).
1. Set M0 := RI, i := 0.
2. If Mi = ∅, terminate.
3. By Corollary 1, there is some Ki ⊆ Lb s.t. Mod(Ki ) = Mi .
4. By the Single Model property, there is some Ri ∈ Mi s.t. Cn? (Ki ) = sat(Ri ).
5. Set Mi+1 := Mi \ {Ri }, i := i + 1.
6. Go to step 2, and iterate until termination.
The following lemma proves that entailment under |≈? corresponds to min-
imisation of index:
5 Related Work
The most relevant work w.r.t. the present paper is that of Booth and Paris [4]
in which they define rational closure for the extended version of KLM for which
negated conditionals are allowed, and the work on PTL [2,5]. The relation this
work has with BKLM was investigated in detail throughout the paper.
Delgrande [10] proposes a logic that is as expressive as BKLM. The entail-
ment relation he proposes is different from the minimal entailment relations we
consider here and, given the strong links between our constructions and the KLM
248 G. Paterson-Jones et al.
approach, the remarks in the comparison made by Lehmann and Magidor [16,
Sect. 3.7] are also applicable here.
Boutilier [6] defines a family of conditional logics using preferential and
ranked interpretations. His logic is closer to ours and even more expressive, since
nesting of conditionals is allowed, but he too does not consider minimal construc-
tions. That is, both Delgrande and Boutilier’s approaches adopt a Tarskian-style
notion of consequence, in line with rank entailment. The move towards a non-
monotonic notion of defeasible entailment was precisely our motivation in the
present work.
Giordano et al. [13] propose the system Pmin which is based on a language
that is as expressive as PTL. However, they end up using a constrained form of
such a language that goes only slightly beyond the expressivity of the language
of KLM-style conditionals (their well-behaved knowledge bases). Also, the system
Pmin relies on preferential models and a notion of minimality that is closer to
circumscription [17].
In the context of description logics, Giordano et al. [11,12] propose to extend
the conditional language with an explicit typicality operator T (·), with a mean-
ing that is closely related to the PTL operator •. It is worth pointing out,
though, that most of the analysis in the work of Giordano et al. is dedicated to
a constrained use of the typicality operator T (·) that does not go beyond the
expressivity of a KLM-style conditional language, but revised, of course, for the
expressivity of description logics.
In the context of adaptive logics, Straßer [18] defines the logic R+ as an exten-
sion of KLM in which arbitrary boolean combinations of defeasible implications
are allowed, and the set of propositional atoms has been extended to include
the symbols {li : i ∈ N}. Semantically, these symbols encode rank in the object
language, in the sense that u li in a ranked interpretation R iff R(u) ≥ i.
Straßer’s interest in R+ is to define an adaptive logic ALC S that provides a
dynamic proof theory for rational closure, whereas our interest in BKLM is to
generalise rational closure to more expressive extensions of KLM. Nevertheless,
the Minimal Abnormality Strategy (see the work of Batens [1], for instance) for
ALC S is closely related to LM -entailment as defined in this paper.
6 Conclusion
The main focus of this paper is exploring the connection between expressiveness
and entailment for extensions of the core logic KLM. Accordingly, we introduce
the logic BKLM, an extension of KLM that allows for arbitrary boolean combi-
nations of defeasible implications. We take an abstract approach to the analysis
of BKLM, and show that it is strictly more expressive than existing extensions
of KLM such as PTL [3] and KLM with negation [4]. Our primary conclusion
is that a logic as expressive as BKLM has to give up several desirable prop-
erties for defeasible entailment, most notably the Single Model property, and
thus appealing forms of entailment for PTL such as LM-entailment [2] cannot
be lifted to the BKLM case.
For future work, an obvious question is what forms of defeasible entailment
are appropriate for BKLM. For instance, is it possible to skirt the impossibility
A Boolean Extension of KLM-Style Conditional Reasoning 249
results proven in this paper while still retaining the KLM rationality properties?
Other forms of entailment for PTL, such as PT-entailment, have also yet to
be analysed in the context of BKLM and may be better suited to such an
expressive logic. Another line of research to be explored is whether there is a
more natural translation of PTL formulas into BKLM than that defined in
this paper. Our translation is based on a direct encoding of PTL semantics,
and consequently results in an exponential blow-up in the size of the formulas
being translated. It is clear that there are much more efficient ways to translate
specific PTL formulas, but we leave it as an open problem whether this can
be done in general. In a similar vein, it is interesting to ask how PTL could
be extended in order to make it equiexpressive with BKLM. Finally, it may
be interesting to compare BKLM with an extension of KLM that allows for
nested defeasible implications, i.e. formulas such as α |∼ (β |∼ γ). While such an
extension cannot be more expressive than BKLM, at least for a semantics given
by ranked interpretations, it may provide more natural encodings of various
kinds of typicality, and thus be easier to work with from a pragmatic point of
view.
A Appendix
Proof. We will prove the result by structural induction on the cases in Defini-
tion 4:
250 G. Paterson-Jones et al.
Proof. Suppose that |≈? is such an entailment relation, and consider the knowl-
edge base K = {( |∼ p) ∨ ( |∼ ¬p)}. Both interpretations in Fig. 1, R1
and R2 , are models of K. R1 satisfies |∼ p and not |∼ ¬p, whereas R2
satisfies |∼ ¬p and not |∼ p. Thus, by the Typical Entailment property,
K |≈? |∼ p and K |≈? |∼ ¬p. On the other hand, by Ampliativity we get
References
1. Batens, D.: A universal logic approach to adaptive logics. Log. Univers. 1, 221–242
(2007). https://doi.org/10.1007/s11787-006-0012-5
2. Booth, R., Casini, G., Meyer, T., Varzinczak, I.: On the entailment problem for a
logic of typicality. IJCAI 2015, 2805–2811 (2015)
3. Booth, R., Meyer, T., Varzinczak, I.: A propositional typicality logic for extending
rational consequence. In: Fermé, E., Gabbay, D., Simari, G. (eds.) Trends in Belief
Revision and Argumentation Dynamics, Studies in Logic - Logic and Cognitive
Systems, vol. 48, pp. 123–154. King’s College Publications (2013)
4. Booth, R., Paris, J.: A note on the rational closure of knowledge bases with both
positive and negative knowledge. J. Logic Lang. Inform. 7(2), 165–190 (1998)
5. Booth, R., Casini, G., Meyer, T., Varzinczak, I.: On rational entailment for propo-
sitional typicality logic. Artif. Intell. 277, 103178 (2019)
6. Boutilier, C.: Conditional logics of normality: a modal approach. Artif. Intell.
68(1), 87–154 (1994)
7. Casini, G., Straccia, U.: Defeasible inheritance-based description logics. JAIR 48,
415–473 (2013)
8. Casini, G., Meyer, T., Moodley, K., Nortjé, R.: Relevant closure: a new form of
defeasible reasoning for description logics. In: Fermé, E., Leite, J. (eds.) JELIA
2014. LNCS (LNAI), vol. 8761, pp. 92–106. Springer, Cham (2014). https://doi.
org/10.1007/978-3-319-11558-0 7
9. Casini, G., Meyer, T., Varzinczak, I.: Taking defeasible entailment beyond rational
closure. In: Calimeri, F., Leone, N., Manna, M. (eds.) JELIA 2019. LNCS (LNAI),
vol. 11468, pp. 182–197. Springer, Cham (2019). https://doi.org/10.1007/978-3-
030-19570-0 12
252 G. Paterson-Jones et al.
10. Delgrande, J.: A first-order logic for prototypical properties. Artif. Intell. 33, 105–
130 (1987)
11. Giordano, L., Gliozzi, V., Olivetti, N., Pozzato, G.L.: Preferential description
logics. In: Dershowitz, N., Voronkov, A. (eds.) LPAR 2007. LNCS (LNAI), vol.
4790, pp. 257–272. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-
540-75560-9 20
12. Giordano, L., Gliozzi, V., Olivetti, N., Pozzato, G.: Semantic characterization of
rational closure: from propositional logic to description logics. Art. Int. 226, 1–33
(2015)
13. Giordano, L., Gliozzi, V., Olivetti, N., Pozzato, G.L.: A nonmonotonic extension
of KLM preferential logic P. In: Fermüller, C.G., Voronkov, A. (eds.) LPAR 2010.
LNCS, vol. 6397, pp. 317–332. Springer, Heidelberg (2010). https://doi.org/10.
1007/978-3-642-16242-8 23
14. Kraus, S., Lehmann, D., Magidor, M.: Nonmonotonic reasoning, preferential mod-
els and cumulative logics. Artif. Intell. 44, 167–207 (1990)
15. Lehmann, D.: Another perspective on default reasoning. Ann. Math. Art. Int.
15(1), 61–82 (1995)
16. Lehmann, D., Magidor, M.: What does a conditional knowledge base entail? Art.
Int. 55, 1–60 (1992)
17. McCarthy, J.: Circumscription, a form of nonmonotonic reasoning. Art. Int. 13(1–
2), 27–39 (1980)
18. Straßer, C.: An adaptive logic for rational closure. Adaptive Logics for Defeasible
Reasoning. Trends in Logic (Studia Logica Library), vol. 38, pp. 181–206. Springer,
Cham (2014). https://doi.org/10.1007/978-3-319-00792-2 7
An Exercise in a Non-classical Semantics
for Reasoning with Incompleteness
and Inconsistencies
Ivan Varzinczak1,2(B)
1
CRIL, Université d’Artois & CNRS, Lens, France
[email protected]
2
CAIR, Computer Science Division, Stellenbosch University,
Stellenbosch, South Africa
1 Introduction
The problem of dealing with information that is either contradictory or incom-
plete (or even both) has long been a major challenge in human reasoning. With
the advent of artificial intelligence (AI), such a problem has transferred to AI-
based applications and has become one of the main topics of investigation of
many areas at the intersection of AI and others.
Classical logic (and its many variants) is at the heart of knowledge repre-
sentation and the formalisation of reasoning in AI. Alas, the classical semantics
is naturally hostile to inconsistencies and does not cope well with lack of infor-
mation. This has often forced applications of classical logic into resorting to
‘workarounds’ or limiting its scope.
In this paper, we make the first steps in the study of a generalised semantics
for propositional logic in which contradictions and incompleteness are admitted
c Springer Nature Switzerland AG 2020
A. Gerber (Ed.): SACAIR 2020, CCIS 1342, pp. 253–264, 2020.
https://doi.org/10.1007/978-3-030-66151-9_16
254 I. Varzinczak
α ::= | ⊥ | P | ¬α | α ∧ α | α ∨ α | α → α
The idea behind our notions of absurd and partial worlds is certainly not
new. For instance, they have been explored by Rescher and Brandom [17], even
though their technical construction and proposed semantics for the propositional
connectives is different from ours (see below).
The definition of satisfaction of a sentence α ∈ L by a given world must be
redefined w.r.t. the classical tradition because the standard, classical, notion of
satisfaction is an ‘all-or-nothing’ notion. By that we mean , although called
satisfaction relation, is usually defined as a (recursive) function, which does not
allow for a sentence α ∈ L to be true and false at w at the same time, or to
be just unknown at a given world. Therefore, just as valuations (worlds) in our
setting are no longer total functions, which allows for propositions to be true
and false simultaneously, or even completely unknown, so will the satisfaction
relation be. The crux of the matter become then how to define in such a way
as to allow α and ¬α to be true at a given w and how to express this in terms
of the subsentences and propositions therein. Of course, if we want to give an
intuitionistic flavour to our logic, should also be such that the values of both α
and ¬α, for some α ∈ L, are unknown.
Previous attempts to redefine satisfaction in order to allow for sentences and
their negations to be true at the same time have stumbled upon the exporta-
tion principle [5,6]: the notion of satisfaction, which sits at the meta-level, gets
‘contaminated’ by inconsistencies at the object level, resulting in a contradictory
sentence being satisfied if and only if it is not satisfied (a contradiction ‘exported’
to the meta-level). Here we aim at avoiding precisely that by ensuring the meta-
language we work with remains as much as possible consistent. (We shall come
back to this matter at the end of the present section.)
The satisfaction relation is a binary relation ⊆ (U × L) × {0, 1}, where 1
and 0 are ‘meta-truth’ values standing for, respectively, ‘is true’ and ‘is false’
at world w. The intuition is that α is true at w if 1 ∈ (w, α), and false at w
if 0 ∈ (w, α). Of course, to be false does not mean not to be true, and the
other way round. (Notice that {0, 1} and {0, 1} are not to be conflated, in the
same vein as ‘valuation’ and ‘satisfaction’ are not synonyms.1 ) In that respect, a
given α ∈ L can have a truth value at a given world w ∈ U (0, 1, or both, which,
importantly, is not an extra truth value) or none at all (when (w, α) = ∅,
which is not an extra truth value either). Before we provide the definition of
satisfaction of a sentence in terms of its subsentences, we discuss its expected
behaviour w.r.t. the sentence’s main connective. For the sake of readability, in
Tables 1, 2, 3, and 4, we represent {0}, {1} and {0, 1} as, respectively, 0, 1
and 01, and the lack of truth value as ∅—which, again, is not an extra truth
value. So, in the referred tables, 0 and 1 are read as usual, ∅ is read as ‘has no
1
We could, in principle, also have used {0, 1} in the definition of satisfaction, but
here we shall adopt the (possibly superfluous) stance that truth of a fact within the
‘actual’ world and truth of a sentence at given worlds are notions sitting at different
levels, or, at the very least, are notions of subtly different kinds.
256 I. Varzinczak
value’, and 01 reads ‘is true and false’, i.e., both truth values apply. (The reader
not convinced by some of the entries in Tables 1, 2, 3, and 4 is invited to hold
on until we state some of the validities holding under the notion of satisfaction
we are about to define.)
Satisfaction of ¬α at w: If α is just true or false at a given world, then its negation
should behave as usual, i.e., as a ‘toggle’ function. By the same principle (applied
twice), if α happens to be true and false, then its negation should be false and true.
For the odd case, namely when α has no (known) value, a legitimate question to
ask is ‘can we know more about the negation of a fact than we know about the fact
itself?’ A cautious answer would be ‘no’. Table 1 summarises these considerations.
Table 1. Semantics of ¬.
α ¬α
∅ ∅
1 0
0 1
01 01
Table 2. Semantics of ∧.
α β α∧β
∅ ∅ ∅
∅ 0 0
∅ 1 ∅
∅ 01 0
0 ∅ 0
0 0 0
0 1 0
0 01 0
1 ∅ ∅
1 0 0
1 1 1
1 01 01
01 ∅ 0
01 0 0
01 1 01
01 01 01
Non-classical Semantics for Reasoning 257
Table 3. Semantics of ∨.
α β α∨β
∅ ∅ ∅
∅ 0 ∅
∅ 1 1
∅ 01 1
0 ∅ ∅
0 0 0
0 1 1
0 01 01
1 ∅ 1
1 0 1
1 1 1
1 01 1
01 ∅ 1
01 0 01
01 1 1
01 01 01
Table 4. Semantics of →.
α β α→β
∅ ∅ ∅
∅ 0 ∅
∅ 1 1
∅ 01 1
0 ∅ 1
0 0 1
0 1 1
0 01 1
1 ∅ ∅
1 0 0
1 1 1
1 01 01
01 ∅ 1
01 0 01
01 1 1
01 01 01
• (w, ) = {1};
• (w, ⊥) = {0};
• ((w, p), 1) ∈ iff (p, 1) ∈ w;
• ((w, p), 0) ∈ iff (p, 0) ∈ w;
• ((w, ¬α), 1) ∈ iff ((w, α), 0) ∈ ;
• ((w, ¬α), 0) ∈ iff ((w, α), 1) ∈ ;
• ((w, α ∧ β), 1) ∈ iff both ((w, α), 1) ∈ and ((w, β), 1) ∈ ;
• ((w, α ∧ β), 0) ∈ iff either ((w, α), 0) ∈ or ((w, β), 0) ∈ ;
• ((w, α ∨ β), 1) ∈ iff either ((w, α), 1) ∈ or ((w, β), 1) ∈ ;
• ((w, α ∨ β), 0) ∈ iff both ((w, α), 0) ∈ and ((w, β), 0) ∈ ;
• ((w, α → β), 1) ∈ iff ((w, α), 0) ∈ or ((w, β), 1) ∈ ;
• ((w, α → β), 0) ∈ iff ((w, α), 1) ∈ and ((w, β), 0) ∈ .
One of the difficulties brought about by some approaches allowing for the
notion of contradiction-bearing worlds is the ‘contamination’ of the meta-
language with inconsistencies via the exportation principle [5]. In such frame-
works, when assessing the truth of a sentence of the form α∧¬α, the correspond-
ing definition of satisfaction leads to the fact α is true and is not true (in the
meta-language). Let us see how things go in the light of our definitions above.
Assume that w α ∧ ¬α, i.e., ((w, α ∧ ¬α), 1) ∈ . By the semantics for con-
junction, we have both ((w, α), 1) ∈ and ((w, ¬α), 1) ∈ , which, according to
Table 2, happens only if (w, α) = (w, ¬α) = {0, 1}. As far as one can tell,
the latter is not an antinomy in the meta-language.
It is worth noting that, for every α, β ∈ L, the truth conditions for both
¬(α∧β) and ¬α∨¬β are the same, and so are those for ¬(α∨β) and ¬α∧¬β. In
that respect, the De Morgan laws are preserved under our non-standard seman-
tics. It turns out the truth conditions for α → β and ¬α ∨ β also coincide,
and therefore the connective for material implication is superfluous. If we admit
material implication and the constant ⊥, then it is negation that becomes super-
fluous, as the semantics for ¬α and α → ⊥ coincide. Moreover, both and ⊥
can be expressed in terms of each other with the help of negation.
As it is usually done, we can talk of validity, i.e., truth at all worlds under
consideration. Let α ∈ L; we say that α is a classical validity (alias, α is classi-
cally valid ), denoted |= α, if w α for every w ∈ Ucl . (Obviously, our notion of
classical validity and that of tautology in classical propositional logic coincide.)
We say α is a partial validity (alias, α is partially valid ), denoted |=pa α, if
w α for every w ∈ Upa . As it turns out, partial validity is quite a stingy notion
of validity: the only valid sentences in Upa are and those of the form α ∨
(or α → ). Lest this be seen as a drawback of allowing partial valuations in
our framework, here we claim this is rather a reminder that partiality may have
as consequence that even the principles of the underlying logic do not hold at
worlds where the truth of some or all the propositions is unknown. Moreover, the
presence of partial worlds automatically rules out some validities from classical
logic that are usually seen as unjustifiable in non-classical circles. Among those
are the law of excluded middle (α∨¬α is a validity), the law of non-contradiction
(¬(α ∧ ¬α)), the principle of double negation (¬¬α ↔ α), and the principle of
explosion ((α ∧ ¬α) → β, for every β ∈ L). Neither of these is a partial validity,
as it can easily be checked. (We shall come back to these principles later on.)
Finally, we can also define the notion of absurd validity (denoted |=ab ), which
amounts to satisfaction by all absurd worlds. Some examples of absurd validi-
ties within our framework are and (p1 ∧ ¬p1 ) ∨ . . . ∨ (pn ∧ ¬pn ), for pi ∈ P,
i = 1, . . . , n, with n = |P|.
Before we carry on, let us consider the so-called ‘paradoxes’ of material impli-
cation [18], namely the sentences α → (β → α), (α → β) ∨ (β → α), and
α → (β ∨ ¬β), which are all classical validities. We already know that none
of them is a partial validity, and it should not take too much effort to verify
that they are not absurd validities either. Furthermore, according to our seman-
tics, the three sentences above do not have the same meaning, i.e., their truth
260 I. Varzinczak
tables are pairwise different from each other. This means that our semantics can
distinguish between these syntactically different sentences, which the classical
semantics cannot do.
From the discussion above, one can see that not all classical tautologies are
preserved in our semantic framework, which is just as intended. In that respect,
our framework provides the semantic foundation for an infra-classical logic cur-
tailing certain classical conclusions that are often perceived as problematic. In
the next section, once we have defined a few forms of entailment, we shall also
assess the validity and failure of some commonly considered rules of inference or
reasoning patterns from classical logic.
[[α]]ab = def
{w ∈ Uab | w α} we denote the absurd models of α. The possible
models of α is the set [[α]]p = def
[[α]]cl ∪ [[α]]pa , whereas the non-classical models of α
is the set [[α]]nc = [[α]]pa ∪ [[α]]ab . Finally, the models of α tout court is the set
def
def
[[α]] = [[α]]cl ∪ [[α]]nc .
The choice of which family of models one wants to work with gives rise to
different notions of entailment or logical consequence. Below are those to which
we shall give consideration in the present paper.2
α |=p β, α |=p β → γ
(MP)
α |=p γ
2
We do not rule out the remaining combinations; space and time constraints prevent
us from assessing them here.
Non-classical Semantics for Reasoning 261
To see why, let w ∈ [[α]]p ; then, by definition of |=p , we have both w ∈ [[β]]p and
w ∈ [[β → γ]]p , i.e., w β and w β → γ. Then we have ((w, β), 1) ∈ and
((w, β → γ), 1) ∈ . Since w ∈ Up , this only holds if ((w, γ), 1) ∈ (cf. Table 4),
i.e., w γ, and therefore w ∈ [[γ]].
General entailment, on the other hand, fails (MP). Let α = β = p ∧ ¬p, and
let γ = ⊥. It can easily be verified that p∧¬p |=g p∧¬p, p∧¬p |=g (p∧¬p) → ⊥,
and p ∧ ¬p |=g ⊥.
Neither possible nor general entailment satisfies the rule of Contraposition:
α |=∗ β
(CP)
¬β |=∗ ¬α
Possible entailment also satisfies the so-called ‘easy’ half of the deduction
theorem:
α |=p β → γ
(EHD)
α ∧ β |=p γ
To witness, assume α |=p β → γ. By Mon, we have α ∧ β |=p β → γ. We also
have α ∧ β |=p β. By MP, we conclude α ∧ β |=p γ.
To see that general entailment fails EHD, let again α = β = p ∧ ¬p, and
let γ = ⊥. We have p ∧ ¬p |=g (p ∧ ¬p) → ⊥, but p ∧ ¬p ∧ p ∧ ¬p |=g ⊥.
Both possible and general entailment fail the ‘hard’ half of the deduction
theorem:
α ∧ β |=∗ γ
(HHD)
α |=∗ β → γ
Indeed, we have p ∧ q |=p q, but for some w such that w(p) = {1} and w(q) = ∅,
we have (w, q → q) = ∅, and therefore p |=p q → q. The case for |=g is
analogous.
The following Transitivity rule is a consequence of Monotonicity and is sat-
isfied by both possible and general entailment:
α |=∗ β, β |=∗ γ
(Tran)
α |=∗ γ
Not surprisingly, possible and general entailment fail the First Disjunctive
rule below, just as classical entailment does:
α |=∗ β ∨ γ
(Disj1)
α |=∗ β or α |=∗ γ
262 I. Varzinczak
4 Concluding Remarks
In this paper, we have revisited the semantics of classical propositional logic.
We started by generalising the notion of propositional valuation to that of a
world that may also admit inconsistencies, or lack of information, or both. We
have seen that our definition of valuation remains suitable for a compositional
interpretation of the truth value of a complex sentence, and that without appeal-
ing to either a dialetheist stance or the use of more than two truth values. In
particular, we have seen that assuming a compositional semantics does not lead
to difficulties brought about by the exportation principle, which is one of the
limitations of previous approaches sharing our motivations. We have also seen
that the adoption of a more general semantics, which brings in a higher num-
ber of possible states of affairs to consider, does not increase the computational
complexity of the satisfiability problem for the underlying language. We have
then explored some basic notions of entailment within our semantic framework
and compared them against many of the properties or reasoning patterns that
are usually considered in formal logic. Some of these are lost, as expected, while
some are preserved.
Immediate next steps for further investigation include (i ) an exploration of
other definitions of semantic entailment, their properties and respective suitabil-
ity (or not) for effective non-classical reasoning; (ii ) a comparison with standard
systems of paraconsistent logic and other existing non-classical logics; (iii ) the
identification of scenarios for potential applications of the framework here intro-
duced, and (iv ) the definition of a basic proof method, probably based on seman-
tic tableaux [8], that can serve as the backbone of more elaborate proof systems
for extensions of our semantic framework.
Further future work stemming from the basic definitions and results here put
forward can branch in several fruitful directions. A non-exhaustive list includes:
(i ) investigating a generalisation of the satisfiability problem [4] and the adap-
tion of existing approaches and optimised techniques for its solution; (ii ) extend-
ing the Kripkean semantics of modal logics [7] to also allow for ‘impossible’ or
‘incomplete’ worlds, or the set-theoretic semantics of description logics [3] to cap-
ture ‘incoherent’ or ‘partially-known’ objects or individuals in formal ontologies,
before considering a move to full first-order logic, and (iii ) revisiting the areas
of belief change [1,10] and non-monotonic reasoning [13] in artificial intelligence,
also benefitting from their semantic constructions in order to define more refined
forms of entailment in our setting.
With regard to the last point above, extra structure may be added to U,
e.g. in the form of a preference relation or a ranking function [11,12], in order
to distinguish worlds according to their level of logical plausibility. For instance,
absurd worlds can be deemed as the least plausible ones, and possible worlds
can be further ranked given extra information (e.g. a knowledge base and its
signature of relevant atomic propositions). The associated entailment relation
becomes then parameterised by such levels and should give rise to a consequence
relation with more interesting properties than those of the basic entailments we
have seen.
264 I. Varzinczak
Acknowledgements. I would like to thank the anonymous referees for their com-
ments and helpful suggestions. This work was partially supported by the National
Research Foundation (NRF) of South Africa.
References
1. Alchourrón, C., Gärdenfors, P., Makinson, D.: On the logic of theory change: partial
meet contraction and revision functions. J. Symbolic Logic 50, 510–530 (1985)
2. Arieli, O., Avron, A.: The value of the four values. Artif. Intell. 102, 97–141 (1998)
3. Baader, F., Calvanese, D., McGuinness, D., Nardi, D., Patel-Schneider, P. (eds.):
The Description Logic Handbook: Theory, Implementation and Applications, 2nd
edn. Cambridge University Press, Cambridge (2007)
4. Ben-Ari, M.: Mathematical Logic for Computer Science, 3rd edn. Springer, London
(2012)
5. Berto, F., Jago, M.: Impossible Worlds. Oxford University Press, Oxford (2019)
6. Berto, F., Jago, M.: Impossible worlds. In: Zalta, E.N. (ed.) The Stanford Ency-
clopedia of Philosophy. Metaphysics Research Lab, Stanford University, fall 2018
edn. (2018)
7. Chellas, B.: Modal Logic: An Introduction. Cambridge University Press, Cam-
bridge (1980)
8. D’Agostino, M., Gabbay, D., Hähnle, R., Posegga, J. (eds.): Handbook of Tableau
Methods. Kluwer Academic Publishers, Dordrecht (1999)
9. Fitting, M.: Kleene’s three valued logics and their children. Fundamenta Informat-
icae 20(1), 113–131 (1994)
10. Hansson, S.: A Textbook of Belief Dynamics: Theory Change and Database Updat-
ing. Kluwer Academic Publishers, Dordrecht (1999)
11. Kraus, S., Lehmann, D., Magidor, M.: Nonmonotonic reasoning, preferential mod-
els and cumulative logics. Artif. Intell. 44, 167–207 (1990)
12. Lehmann, D., Magidor, M.: What does a conditional knowledge base entail? Artif.
Intell. 55, 1–60 (1992)
13. Makinson, D.: Bridges from Classical to Nonmonotonic Logic, Texts in Computing.
Texts in Computing, vol. 5. King’s College Publications, London (2005)
14. Papadimitriou, C.: Computational Complexity. Addison-Wesley, Boston (1994)
15. Priest, G.: An Introduction to Non-Classical Logic: From If to Is. Cambridge Intro-
ductions to Philosophy, 2nd edn. Cambridge University Press, Cambridge (2001)
16. Priest, G., Berto, F., Weber, Z.: Dialetheism. In: Zalta, E.N. (ed.) The Stanford
Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, fall
2018 edn. (2018)
17. Rescher, N., Brandom, R.: The Logic of Inconsistency. A Study in Non-Standard
Possible-Worlds Semantics and Ontology. Basil Blackwell, Oxford; APQ Library
of Philosophy (1979)
18. Swart, H.: Logic: Mathematics, Language, Computer Science and Philosophy,
vol. 1. Peter Lang (1993)
Machine Learning Theory
Stride and Translation
Invariance in CNNs
1 Introduction
Traditional computer vision tasks such as image classification and object detec-
tion have been revolutionized by the use of Convolutional Neural networks
(CNNs) [10]. CNNs are often assumed to be translation invariant, that is, classi-
fication ability is not influenced by shifts of the input image. This is a desirable
characteristic for image recognition, as a specific object or image must be cor-
rectly identified regardless of its location within the canvas area. The assumption
that CNNs exhibit translation invariance, however, has been shown to be erro-
neous by multiple authors [1,5,14], who all show that shifts of the input image
can drastically alter network classification accuracy. This is especially troubling,
as practical applications of CNNs require that an object be recognizable from
c Springer Nature Switzerland AG 2020
A. Gerber (Ed.): SACAIR 2020, CCIS 1342, pp. 267–281, 2020.
https://doi.org/10.1007/978-3-030-66151-9_17
268 C. Mouton et al.
2 Background
In order to understand translation invariance, we must first define the terms
that play a role and their relation to one another.
This implies that g has no effect on the output of function f, and the result
remains equal whether g is applied or not.
A second misconception is that whilst modern CNNs are not transla-
tion invariant, they are translation equivariant. Translation equivariance (also
referred to as covariance by some authors) is the property by which internal
feature maps are shifted in a one-to-one ratio along with shifts of the input. We
define translation equivariance as follows (as adapted from [12]):
This implies that the output is shifted in accordance with the shift of the input,
or in other terms that the output of the function f can be translated to produce
the same result as translating the input I before f is applied would.
Intuitively it would be expected that translation equivariance holds for both
convolution and pooling layers, and this intuition proves correct for dense
convolution and pooling, if edge effects are ignored. To illustrate this prop-
erty, an arbitrary one-dimensional filter is applied to a one-dimensional input,
as well as to shifted values of this input. Consider an input signal I[n] =
[0, 0, 0, 0, 1, 2, 0, 0, 0, 0] and a kernel K[n] = [1, 0, 1]: the result of the convolu-
tion I[n] K[n] is shown in the second column of Table 1.
Table 1. Dense and strided convolution of a one-dimensional input with a kernel [1,0,1],
and shifted variants
As per this example, the equivariance property holds, as a shift of the input
results in an equal shift of the output, meaning f (g(I)) = g(f (I)). However, this
intuition fails when considering subsampling.
2.2 Subsampling
The main benefit of subsampling is that it can greatly reduce the training
time of CNNs. As He and Sun [3] show, the time complexity of convolution layers
in a CNN is given by:
d
O( nl−1 · s2l · nl · m2l ) (3)
l=1
the reduction in spatial size caused by the size of a kernel during a convolu-
tion/pooling operation, where information along the edges of an input is disre-
garded (commonly referred to as “edge effects”). Conversely, subsampling causes
spatial reduction by explicitly disregarding intermediary samples. Whilst down-
sampling does have an effect on translation invariance [6], we attempt to mitigate
this effect through the use of adequate padding.
3.1 Shiftability
Whilst subsampling breaks the equivariance property, we propose that it can
greatly benefit translation invariance under certain circumstances due to a third
property we define as shiftability. Shiftability holds for systems that make use
of subsampling and is defined as follows (note: this definition differs from that
provided in [1] for shiftability):
and
u
g (X) = t( , X) (7)
s
Put otherwise, shiftability holds for translations that are factors of the subsam-
pling factor s of f. When subsampling shifted inputs, equivalence will hold if a
given translation is in accordance with the stride. To illustrate this property,
consider an arbitrary input signal that is subsampled by a factor of four and
various shifts of the signal, as shown in Table 2.
In this example, shiftability holds for translations that are factors of the
subsampling factor (shifts of 4 and 8), and so a scaled form of equivariance is
kept. It is further evident that subsampling scales shifts of the input signal: in
this example, a shift of four in the input results in only a shift of one in the
output.
The subsampling factor dictates how many versions of the output signal can
potentially exist after translation (again ignoring edge effects): In this example
four discrete outputs are present, where all other outputs are merely shifted
variants. In the case of two-dimensional filtering, inputs are subsampled both
272 C. Mouton et al.
Input 0 0 0 0 0 0 0 0 0 3 2 5 2 4 1 6 3 4 6 5 5 0 0 0 0 0 0 0 0 0
Shift Subsampled output
0 00023500
1 00056500
2 00021600
3 00034400
4 00002350
5 00005650
6 00002160
7 00003440
8 00000235
vertically and horizontally, meaning s2 output signals can exist given a single
input and a bounded translation. This further implies that a given input will
only be shiftable for s12 of possible translations [1].
To explain how shiftability benefits translation invariance, we must first
define two distinct characteristics that must be accounted for when comparing
outputs of translated inputs to that of untranslated inputs.
Input 0 0 0 0 0 0 2 2 3 3 1 1 2 2 0 0 0 0 0 0
Shift Subsampled output
0 000231200
1 000231200
2 000023120
3 000023120
4 000002312
4 Analysis
In this section we empirically measure the translation invariance of different
architectures and explore how these results relate to local homogeneity.
274 C. Mouton et al.
proposed by Azulay and Weiss [1]. This allows us to determine an exact prob-
ability of a sample being incorrectly classified given a range of translation.
This is useful as it purely measures a change in prediction accuracy and does
not concern itself with other secondary effects.
Table 4. MCS for CIFAR10 networks with varying subsampling and max pooling
kernel sizes (10 Pixel Range)
for a maximum shift of 10 pixels for each network. We observe that in the case
of CIFAR10 2 × 2 max pooling is not sufficient, and a substantial increase in
translation invariance following subsampling is only observed at 3 × 3 pooling
and larger.
For networks that make use of subsampling, we observe that larger ker-
nel sizes always result in greater invariance. Conversely, for networks that do
not make use of subsampling (the first row of Table 4) we observe a signifi-
cant decrease in translation invariance as kernel size is increased. Intuitively one
would expect larger kernels to always provide greater translation invariance, but
this intuition fails since these networks are fully translation equivariant. Finally,
we observe that greater subsampling always results in greater invariance when
adequately sized kernels are used (as in the case of 4 × 4 and 5 × 5 pooling) which
are aligned with our findings on MNIST.
These results support our proposal that stride can significantly increase the
translation invariance of a network, given that is combined with sufficient local
homogeneity. Furthermore we also find that the inherent homogeneity of a given
dataset dictates the required filtering for subsampling to be effective.
4.3 Anti-aliasing
Zhang [14] proposes a solution to this problem which allows the generaliza-
tion benefits of max pooling without compromising translation invariance. The
author alters strided max pooling by separating it into two distinct layers: (1)
Dense Max Pooling, and (2) Strided Anti-Aliasing. By applying an anti-aliasing
filter, local homogeneity is ensured and the subsequent subsampling operation’s
affect on signal similarity is strongly mitigated, which results in a more transla-
tion invariant network.
The efficacy of this method is explored for both the MNIST and CIFAR10
datasets using the three layer 2 × 2 pooling networks from the previous section.
Each pooling layer present in the network is replaced with a dense max pooling
layer and a bin-5 anti-aliasing filter. These networks are then compared to their
baseline counterparts that do not make use of anti-aliasing. The comparative
MCS for a maximum shift of 10 pixels is shown in Table 5.
Table 5. Mean Cosine Similarity for MNIST and CIFAR10 networks with and without
anti-aliasing for a maximum shift of 10 pixels
We verify this by adding a final global average pooling layer to our baseline
model without subsampling for CIFAR10, and we find that it has a 0% Ptop1
change for shifts within the canvas area. Put otherwise, the system is completely
translation invariant.
Although this might seem to be a complete solution, GAP is not without its
drawbacks. Ignoring the benefits of subsampling, the GAP operation disregards
a tremendous amount of information and could lower the classification ability of
a given architecture, and is therefore not necessarily a suitable solution for every
dataset. However, Fully Convolutional Neural Networks (FCNNs) do make use
of GAP, and withholding the use of subsampling in these architectures could be
a suitable solution for ensuring translation invariance.
Observing this result, we find that the same pattern emerges as that
of Fig. 1, where greater subsampling leads to greater translation invariance.
Stride and Translation Invariance in CNNs 279
However, these networks are much more translation invariant than those not
trained on translated data, with the lowest MCS at a staggering 0.97.
Whilst learned invariance is certainly a powerful tool, Azulay and Weiss point
out that this method can potentially result in models that are overly biased
to translations of the train set and it can not be expected to generalize well
to translations of unseen data in all cases. We also point out that MNIST is a
particularly easy problem compared to more complex datasets such as CIFAR100
or ImageNet [9], and usually data augmentation would be required for these
networks to achieve good performance. This implies that the training set must be
explicitly augmented with translated data, which leads to a substantial increase
in training time.
Table 6. Test accuracy for CIFAR10 networks with varying subsampling and kernel
size
We observe that larger kernel sizes generally generalize better, but also that
kernels that are too large (such as 5 × 5 in this case) lead to a reduction in test set
accuracy. This is an expected result - larger kernels lower the variance of a given
sample and result in more locally homogeneous regions, but also implies that
more information is disregarded which negatively impacts the model’s ability
to generalize to samples not seen during training. Similarly, some subsampling
seems to always provide better generalization regardless of kernel size, but too
much subsampling leads to a reduction in model performance. These differences
suggest that there is a slight trade-off between a model’s inherent invariance to
translation and its generalization ability.
For the anti-aliased models of Sect. 4.3 we observe a very small overall effect
on generalization: Table 7 shows the test accuracy of these models with and
without the use of anti-aliasing filters.
The combination of subsampling with anti-aliasing actually improves gener-
alization for the CIFAR10 dataset, and only slightly hampers accuracy for that
280 C. Mouton et al.
Table 7. MNIST and CIFAR10 test accuracy with and without anti-aliasing (AA)
of MNIST. These results are aligned with that of Zhang [14] which show a slight
improvement in generalization for state-of-the-art ImageNet networks using anti-
aliasing. These results, along with those of Sect. 4.3, show that for these data
sets anti-aliasing is effective at improving translation invariance without reduc-
ing generalization ability.
5 Conclusion
References
1. Azulay, A., Weiss, Y.: Why do deep convolutional networks generalize so poorly
to small image transformations? CoRR abs/1805.12177 (2018). http://arxiv.org/
abs/1805.12177
2. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge
(2016). http://www.deeplearningbook.org
3. He, K., Sun, J.: Convolutional neural networks at constrained time cost. CoRR
abs/1412.1710 (2014). http://arxiv.org/abs/1412.1710
4. Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer
networks. CoRR abs/1506.02025 (2015). http://arxiv.org/abs/1506.02025
5. Kauderer-Abrams, E.: Quantifying translation-invariance in convolutional neural
networks. CoRR abs/1801.01450 (2018). http://arxiv.org/abs/1801.01450
6. Kayhan, O.S., van Gemert, J.C.: On translation invariance in CNNs: convolutional
layers can exploit absolute spatial location (2020)
7. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International
Conference on Learning Representations (December 2014)
8. Krizhevsky, A.: Learning multiple layers of features from tiny images. University
of Toronto (May 2012)
9. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep con-
volutional neural networks, pp. 1097–1105 (2012). http://papers.nips.cc/paper/
4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
10. LeCun, Y., et al.: Backpropagation applied to handwritten zip code recognition.
Neural Comput. 1(4), 541–551 (1989)
11. LeCun, Y., Cortes, C., Burges, C.: MNIST database of handwritten digits. http://
yann.lecun.com/exdb/mnist/
12. Lenc, K., Vedaldi, A.: Understanding image representations by measuring their
equivariance and equivalence. CoRR abs/1411.5908 (2014). http://arxiv.org/abs/
1411.5908
13. Scherer, D., Müller, A.C., Behnke, S.: Evaluation of pooling operations in convo-
lutional architectures for object recognition. In: ICANN (2010)
14. Zhang, R.: Making convolutional networks shift-invariant again. CoRR
abs/1904.11486 (2019). http://arxiv.org/abs/1904.11486
Tracking Translation Invariance in CNNs
1 Introduction
With the impressive performance of Convolutional Neural Networks (CNNs) in
object classification [8,9], they have become the go-to option for many modern
computer vision tasks. Due to their popularity, many different architectural vari-
ations of CNNs [2,4,14] have arisen in the past few years that excel at specific
tasks. One of the reasons for their rise in popularity is their capability to deal
with translated input features. It is widely believed that CNNs are capable of
learning translation-invariant representations, since convolutional kernels them-
selves are shifted across the input during execution. In this study we omit com-
plex variations of the CNN architecture and aim to explore translation invariance
c Springer Nature Switzerland AG 2020
A. Gerber (Ed.): SACAIR 2020, CCIS 1342, pp. 282–295, 2020.
https://doi.org/10.1007/978-3-030-66151-9_18
Tracking Translation Invariance in CNNs 283
2 Related Work
In this section we discuss the CNN architectures and optimization protocol used
in our experiments.
3.2 Datasets
In our analysis we use the MNIST dataset [10] containing 28×28 pixel samples of
handwritten digits, and the CIFAR10 dataset [7] containing 32×32 pixel samples
of different class colour images. We decide on the use of these two datasets as our
translation-sensitivity quantification metric is quite computationally expensive.
We thus want to perform our analysis on a somewhat simple task (MNIST)
and a more complex task (CIFAR10) without being forced to use massive CNN
architectures to fit the datasets. The MNIST dataset is split into a training set
containing 55 000 samples, a validation set containing 5 000 samples and a test
set containing 10 000 samples. The CIFAR10 dataset is split into a training
set containing 45 000 samples, a validation set containing 5 000 samples and a
test set containing 10 000 samples. To be able to generate translation-sensitivity
maps without loss of features, all samples are zero-padded with a 6-pixel border.
Tracking Translation Invariance in CNNs 285
All networks are initialized with He initialization [3] using 3 different seeds.
Adam is used to optimize Cross-Entropy Loss with a batch size of 128. Four
initial learning rates are used: when the best performing learning rate is found
at the edge of the learning rate sweep, the learning rate is varied by 0.001 outside
the sweep range to ensure that only fully optimized networks are used to generate
results. All networks are trained to near-perfect train accuracy (> 99%) and are
optimized on validation accuracy. All results shown are averaged over 3 seeds.
In this section we define translation invariance and discuss the sensitivity metric
we use to quantify translation invariance.
For a system to be completely translation invariant, its output must not be influ-
enced by any translation of the input. The output of a translation-invariant sys-
tem must thus remain identical for translated and untranslated inputs. Although
translation-invariance is a desirable quality for many image classification sys-
tems, it is seldom achieved in practice. Knowing that complete translation-
invariance is near impossible for standard CNN architectures, we redefine the
term “translation-invariance” to refer to a system’s sensitivity to translated
inputs. This means that a system can be more or less translation-invariant based
on the values received from our translation sensitivity quantification metric.
In the introductory paper [5], the Euclidean Distance between the two vectors
is used as similarity metric. The outputs from the Euclidean Distance calcula-
tion are non-normalized, restricting comparisons at different locations within a
network. (Even if two layers have the same dimensions, the activation values at
different layers may have different size distributions.) To address this normal-
ization issue, we propose the use of Cosine Similarity (Eq. 1) to calculate the
similarity between the two vectors. Cosine Similarity measures the cosine of the
angle between two vectors a and b in a multi-dimensional space, producing a
similarity value between 1 (high similarity) and −1 (high dissimilarity).
a· b
cos(θ) = (1)
ab
Dimensionality
Although Cosine Similarity has the advantage of producing normalized results,
it does not allow for the direct comparison of vectors with different dimensions.
When using either Euclidean Distance or Cosine Similarity the dimensions of
the datasets greatly influence the calculated metric. As dimension increases, the
average Euclidean Distance value tends to increase while the average Cosine
Similarity value decreases. This effect produces warped results when comparing
either Euclidean Distance or Cosine similarity values. As we use different out-
put vectors to generate our results, we ensure to never directly compare results
generated from vectors with different lengths.
Tracking Translation Invariance in CNNs 287
Fig. 1. Pearson Correlation Coefficients. The Cosine Similarity and Euclidean Distance
correlation coefficients with classification accuracy are calculated for each class of the
MNIST dataset. These results are generated with the first CNN architecture in Table 1
in the Appendix. Results are averaged over three seeds.
5 Tracking Invariance
As previously stated, standard convolutional neural networks consist mainly of
convolutional layers, separated by pooling layers, followed by a final set of fully
connected layers. The convolutional layers can be seen as encoders that morph
and highlight important features within the input. This is achieved by applying
convolutional kernels, with weights that are finetuned to identify certain features
288 J. C. Myburgh et al.
in the input. The fully connected layers then use these encoded inputs to perform
certain tasks. In essence, convolutional layers learn to identify certain features
and their characteristics within the input and then pass these features to the
following layers in a more efficient representation. These efficient representations
are referred to as “feature maps” and their size depends on several variables such
as kernel size, stride, pooling, padding and input size.
In this section, we aim to determine the extent to which convolutional and
fully connected layers, respectively, contribute to the translation invariance of a
CNN as a whole. Since the fully connected layers of a CNN act as the classifier,
it is desired that the inputs they receive be unaffected by input shifts. Although
we know that complete translation invariance is unlikely with a standard CNN
architecture, it is still desired that the convolutional layers compensate for most
of the translation in the input since they are more equipped (with moving kernels
and spatial awareness) to deal with translation.
In the first experiment, we investigate the effect of convolutional kernel size on
translation invariance on the MNIST dataset. We also investigate how sensitive
a standard CNN is to translation at two locations within the network: before
and after the fully connected layers. To test translation-sensitivity at the first
location, we use the output of the last convolution layer to generate sensitivity
maps. This is done to investigate the effect that convolutional layers have on
the network’s sensitivity to translation. For the second location, the output of
the final fully connected layer is used to generate translation-sensitivity maps
allowing us to see for how much translation invariance the fully connected layers
are responsible for.
All convolutional kernel sizes are kept constant throughout each convolu-
tional layer of each network, but varied over the three different networks. Zero-
padding is used to ensure that all feature maps have the same size regardless
of the change in convolutional kernel size. This is done to allow us to compare
convolutional layer outputs across the three CNNs. By keeping the size of the
feature maps produced by the final convolutional layer the same across the dif-
ferent CNNs (5 × 5 pixels), we can be assured that changes in dimensionality
do not affect the results. The number of channels per convolutional layer is kept
constant across all three networks. The output from the final fully connected lay-
ers all have a length of 10 and require no modifications to be comparable with
one another. All networks are optimized as described in Sect. 3, and achieve
comparable performance on the held-out test set.
In Fig. 2(a) we see the radial translation-sensitivity functions generated from
the outputs of the final convolutional layers of the CNNs. It seems that smaller
convolutional kernels produce a feature map that is slightly less sensitive to
translated inputs. Although it is expected that the convolutional layers would
be responsible for most translation invariance of the network, the fully connected
layers drastically change the results. The results in Fig. 2(b) are generated from
the final fully connected layers of the CNNs and show that networks with larger
convolutional kernel sizes tend to be more translation-invariant after the fully
connected layer.
Tracking Translation Invariance in CNNs 289
(a) Final convolutional layer output (b) Final fully connected layer output
Fig. 2. Radial translation-sensitivity functions generated from (a) the final convolu-
tional layer output and (b) the fully connected layer output on MNIST. Detailed CNN
architectures can be found in Table 1 in the Appendix.
(a) Non-Shifted Input (b) Ch.1 Output (c) Ch.2 Output (d) Ch.3 Output
(e) Shifted Input (f) Ch.1 Output (g) Ch.2 Output (h) Ch.3 Output
Fig. 3. Feature maps from the final convolutional layer of a CNN given a normal and
shifted input sample. These feature maps were randomly selected to show the presence
of translation after three convolutional layers. Similar translation effects are present in
all 150 feature maps.
(a) Final convolutional layer output (b) Final fully connected layer output
Fig. 4. Radial translation-sensitivity functions generated from (a) the final convolu-
tional layer output and (b) the fully connected layer output on CIFAR10. Detailed
CNN architectures can be found in Table 2 in the Appendix.
in Fig. 5. It seems that reducing feature map size has little to no influence on
translation invariance, especially in a CNN trained on a more complex dataset
that forces the network to be more translation invariant during training.
Tracking Translation Invariance in CNNs 293
7 Conclusion
In this paper we use translation-sensitivity maps to analyse how different com-
ponents of a standard CNN affect the network’s translation invariance. We train
several standard CNNs on the MNIST and CIFAR10 datasets. We propose a
slight change to the similarity metric and demonstrates that it produces compa-
rable results to the prior metric, with the added benefit of normalizing results
across layers. Specifically, we focus on convolutional kernel size and find that
smaller kernels tend to produce feature maps that are less sensitive to trans-
lated inputs. We also study how convolutional and fully connected layers affect
translation invariance and find that although convolutional layers contribute, it
seems that fully connected layers are responsible for the majority of translation
invariance in a standard CNN. In our study we also vary feature map size and
find that it has little effect on translation sensitivity.
In our study we focus on standard CNN architectures that can fit the MNIST
and CIFAR10 datasets. Although these datasets have varying levels of complex-
ity, the samples they contain are relatively small in size. We believe that an in-
depth study on the effects of convolutional kernel size on translation invariance
in larger CNNs able to fit a complex dataset such as ImageNet may produce
interesting insights. We would expect to see a similar pattern as in out work
(smaller convolutional kernels result in more translation invariance) but to a
much larger extent as larger inputs tend to contain more location information.
Here we show the network architectures and accuracies of the CNNs used in our
experiments.
References
1. Deng, J., et al.: ImageNet: a large-scale hierarchical image database. In: CVPR
2009 (2009)
2. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
3. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-
level performance on imagenet classification. CoRR abs/1502.01852 (2015). http://
arxiv.org/abs/1502.01852
4. Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks.
CoRR abs/1608.06993 (2016). http://arxiv.org/abs/1608.06993
5. Kauderer-Abrams, E.: Quantifying translation-invariance in convolutional neural
networks. CoRR abs/1801.01450 (2018). http://arxiv.org/abs/1801.01450
6. Kayhan, O.S., Gemert, J.C.: On translation invariance in CNNs: convolutional
layers can exploit absolute spatial location. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), June 2020
7. Krizhevsky, A., Nair, V., Hinton, G.: CIFAR-10 (Canadian Institute for Advanced
Research). http://www.cs.toronto.edu/∼kriz/cifar.html
8. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convo-
lutional neural networks. Neural Inf. Process. Syst. 25 (2012). https://doi.org/10.
1145/3065386
9. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
10. LeCun, Y., Cortes, C.: MNIST handwritten digit database (2010). http://yann.
lecun.com/exdb/mnist/
11. Lenc, K., Vedaldi, A.: Understanding image representations by measuring their
equivariance and equivalence. CoRR abs/1411.5908 (2014). http://arxiv.org/abs/
1411.5908
12. Scherer, D., Müller, A., Behnke, S.: Evaluation of pooling operations in convolu-
tional architectures for object recognition. In: Diamantaras, K., Duch, W., Iliadis,
L.S. (eds.) ICANN 2010. LNCS, vol. 6354, pp. 92–101. Springer, Heidelberg (2010).
https://doi.org/10.1007/978-3-642-15825-4 10
13. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556
14. Szegedy, C., et al.: Going deeper with convolutions. CoRR abs/1409.4842 (2014).
http://arxiv.org/abs/1409.4842
15. Zhang, R.: Making convolutional networks shift-invariant again. CoRR
abs/1904.11486 (2019). http://arxiv.org/abs/1904.11486
Pre-interpolation Loss Behavior
in Neural Networks
1 Introduction
According to the principal of empirical risk minimization, it is possible to opti-
mize the performance on machine learning tasks (e.g. classification or regression)
by reducing the empirical risk on a surrogate loss function as measured on a train-
ing dataset [5]. The success of this depends on several assumptions regarding the
sampling methods used to obtain the training data and the consistency of the
risk estimators [14]. Assuming such criteria are met, we expect the training loss
to decrease throughout training and that the loss on samples not belonging to
the training samples (henceforth referred to as validation or evaluation loss) will
initially decrease but eventually increase as a result of overfitting on spurious
features in the training set.
Actual performance is usually not directly measured with the loss function
but rather with a secondary measurement, such as classification accuracy in a
c Springer Nature Switzerland AG 2020
A. Gerber (Ed.): SACAIR 2020, CCIS 1342, pp. 296–309, 2020.
https://doi.org/10.1007/978-3-030-66151-9_19
Pre-interpolation Loss Behavior in Neural Networks 297
Fig. 1. Learning curves of an example 1×1 000 MLP trained on 5 000 FMNIST samples
using SGD with a mini-batch size of 64. The network clearly shows an increasing
validation loss with a slightly increasing validation accuracy.
The cause of this behavior can easily be thought to be shallow local optima or
borderline cases of correct classification. While this explanation is consistent with
classical ideas of overfitting, it does not fully explain observed behavior. Specifi-
cally, if this is the extent of the phenomenon, there is no reason for improvement
in validation accuracy if a local optimum is found, and an obvious quantitative
limit to the amount that the validation loss can increase.
By investigating the distribution of per-sample validation loss values and not
just a point estimation (typically averaged over all samples) we show that the
increase in average validation loss can be attributed to a minority of validation
samples. This means that the discrepancy between the validation loss and accu-
racy is due to a form of overfitting that only affects the predictions of some
validation samples, thereby allowing the model to still generalize well to most of
the validation set.
The following is a summary of the main contributions of this paper:
– We present empirical evidence of a characteristic of empirical risk minimiza-
tion in MLPs performing classification tasks, that sheds light on an apparently
paradoxical relationship between validation loss and classification accuracy.
– We explain how this phenomenon is largely a result of quantitative increases in
related parameter values and the limits of using a point estimator to measure
overfitting.
– We discuss the practical and theoretical implications of this phenomenon with
regards to generalization in related machine learning models.
298 A. E. W. Venter et al.
2 Background
Much work has been done to characterize how a neural network’s performance
changes over training iterations [7,10,17,20]. Such work has lead to some pow-
erful machine learning techniques, including drop-out [8] and batch normaliza-
tion [11]. While both theoretically principled and practically useful generaliza-
tion bounds remain out of reach, many heuristics have been found that appear to
indicate whether a trained neural network will generalize well. These heuristics
have varying degrees of complexity, generality, and popularity, and include: small
weight norms, flatness of loss landscapes [9], heavy-tailed weight matrices [13],
and large margin distributions [19]. All of these proposed metrics have empirical
evidence to support their claims of contributing to the generalization ability of
a network, however, none of them have been proven to be a sufficient condition
to ensure generalization in general circumstances.
A popular experimental framework used to investigate generalization in deep
learning is to explore the optimization process of so-called “toy problems”. Such
experiments are typically characterized by varying different design choices or
training conditions, in an often simplified machine learning model, and then
interpreting the performance of resulting models on test data [6,18]. The per-
formance can be investigated post-training but it is often informative to observe
how the generalization changes during training.
A good example of why it is important to consider performance during train-
ing is the double descent phenomenon [2,15]. This phenomenon has enjoyed much
attention recently [1,3,16], due to its apparent bridging of classical and modern
regimes of representational complexity. In its most basic form it is characterized
by poor generalization within a “critically parameterized” regime of represen-
tational capacities near the minimum that is necessary to interpolate the entire
training set. Slightly smaller or larger models produce improved generalization.
However, if early stopping is used the phenomenon has been found to be almost
non-existent [15].
Having an accurate estimate of test loss and how it changes during train-
ing is clearly beneficial in investigating generalization. In the current work we
show that averaging over all test samples can result in a misrepresentation of
generalization ability and that this can account for the sometimes paradoxical
relationship between test accuracy and test loss.
Pre-interpolation Loss Behavior in Neural Networks 299
3 Approach
We use a simple experimental setup to explore the validation loss behavior of
various fully-connected feedforward networks. All models use a multilayer per-
ceptron (MLP) architecture where hidden layers have an equal number of ReLU-
activated nodes. This architecture, while simple, still uses the fundamental prin-
ciples common to many deep learning models, that is, a set of hidden layers
optimized by gradient descent, using backpropagation to calculate the gradient
of a given loss function with regard to the parameters being optimized.
We first determine whether the studied phenomenon (both validation accuracy
and loss displaying an increase during training) occurs in general circumstances,
and then select a few models where this phenomenon is clearly visible. We then
probe these models to better understand the mechanism causing this effect.
The experiments are performed on the well-known MNIST [12] and
FMNIST [21] classification datasets. These datasets consist of 60 000 training
samples and 10 000 test samples of 28 × 28 grayscale images with an associ-
ated label ∈ [0, 9]. FMNIST can be regarded as a slightly more complex drop-in
replacement for MNIST. Recently these datasets have become less useful as
benchmarks, but they are still popular resources for investigating theoretical
principles of DNNs.
All models are optimized to reduce a cross-entropy loss function measured
on mini-batches of training samples. Techniques that could have a regularizing
effect on the optimization process (such as batch normalization, drop-out, early-
stopping or weight decay) were omitted as far as possible. Networks are trained
till convergence, with the exact stopping criteria different for the separate exper-
iments, as described per set of results.
A selection of hyperparameters were investigated to ensure a variety of vali-
dation loss behaviors during training. These hyperparameters are:
– Training and validation set sizes;
– The number of hidden layers;
– The number of nodes in each hidden layer;
– Mini-batch sizes;
– Datasets (MNIST or FMNIST); and
– Optimizers (Adam or SGD).
Parameter settings differed per experiment, as detailed below. Take note that
the validation sets are held out from the train set, so a larger train set will result
in a smaller validation set and vice versa.
4 Results
Our initial experiments show that the average validation loss can indeed increase
with a stable or increasing validation accuracy for a wide variety of hyperpa-
rameters (Sect. 4.1). Based on this result, we select a few models where the
phenomenon is clearly visible, and investigate the per-sample loss distributions
throughout training, as well as weight distributions, to probe the reason for this
behavior (Sects. 4.2 and 4.3).
300 A. E. W. Venter et al.
Fig. 2. Final validation loss (left) and validation accuracy (right) vs the same metric at
the epoch with minimum validation loss. 95 MLPs are trained on 5 000 MNIST samples
with the number of nodes in the hidden layer ranging from only 7 (red) to 2 000 (blue).
Models marked with a triangle had increasing validation loss and accuracy between
the epoch of minimum validation loss and the epoch of first interpolation. (Color figure
online)
Pre-interpolation Loss Behavior in Neural Networks 301
Fig. 3. Final validation loss (left) and validation accuracy (right) vs the same metric
at the epoch with minimum validation loss. 80 MLPs are trained with varying hyper-
parameters; colors refer to different training sets. Models marked with a triangle had
increasing validation loss and accuracy between the epoch of minimum validation loss
and the epoch of first interpolation. (Color figure online)
These networks are trained for 150 epochs. In order to ensure that each
model’s performance is good enough to be considered typical for these architec-
tures and datasets, we use optimized learning rates. The learning rate for each
set of hyperparameters is chosen by a grid search over a wide range of values. The
selection is made in accordance with the best validation error achieved by the
end of training. In some cases this resulted in final training accuracies slightly
below 100%. In these cases we selected the “final” epoch at the epoch where
maximum training accuracy was achieved.
As expected, the models trained on MNIST or on larger training sets had
lower validation losses and higher validation accuracies in general. However, we
also note that models trained on FMNIST tend to have much higher increases
in validation loss while the validation accuracy is still improving when compared
to models trained on MNIST. In the next section we investigate how the loss
distributions change for selected models from this section.
302 A. E. W. Venter et al.
– A: 3 × 100 model trained on 5 000 MNIST samples using Adam and a mini-
batch size of 64.
– B: 3 × 100 model trained on 5 000 MNIST samples using SGD and a mini-
batch size of 64.
– C: 3 × 100 model trained on 5 000 FMNIST samples using SGD and a mini-
batch size of 64.
– D: 3 × 100 model trained on 55 000 FMNIST samples using Adam and a
mini-batch size of 16.
The learning curves for these models are presented in Fig. 4. Notice that for
all four models a minimum validation loss is achieved early on. Beyond this point
the validation loss increases while the corresponding accuracy is either stable or
improving slightly.
Fig. 4. Learning curves for four selected models (A-D, see text) showing increasing
validation loss, despite an increasing or stable validation accuracy.
The validation loss curve, as seen in Fig. 4, is often used as an estimate of the
level of overfitting that is occurring as the model is optimized on the training set.
However, by averaging over the entire validation set we are producing a point
estimate that implicitly assumes that the loss of all validation samples are close
to the mean value. This assumption is reasonable with regards to the training
Pre-interpolation Loss Behavior in Neural Networks 303
set because most loss functions (e.g. cross entropy) work with the principle of
maximum likelihood estimation [5,14]. This means that by minimizing the dis-
similarity between the entire distribution of the training data and the model
the estimate of loss is all but guaranteed to be indicative of performance on the
entire training set.
There is no such guarantee with regards to any set other than the training
set. This makes the average loss value a poor estimate of performance on the
validation set. The results presented in Fig. 5 motivate this point for model A.
See Appendix A for the same results for models B, C, and D. The plots show
heatmaps of loss distributions for the four selected models at several training
iterations for three datasets (training, validation and evaluation). The validation
set is the held-out set that is used to estimate performance during training and
model selection, and the evaluation set is the set that is used post-training to
ensure no indirect optimization is performed on the test data. The iterations
refer to parameter updates, not epochs. We show the distributions at log-sampled
iterations because many changes occur early on (even before the end of the first
epoch) and few occur towards the end of training. A final note with regards
to these heatmaps is that the colors, which define the number of samples that
have the corresponding loss value, are also log-scaled. This visually highlights
the occurrence of samples with extreme loss values.
The loss distributions in Fig. 5 show that while the loss value for a vast major-
ity (indicated by the red and orange colors) of samples reduces with training
iterations there is a small minority of samples for which the loss values increase.
For the training set, this increase is relatively low and eventually reduces as the
entire set is interpolated. For the validation and evaluation sets the loss values
of these “outliers” seem to only increase. This is why it is possible for the aver-
age validation loss to increase while the classification accuracy remains stable or
improves.
Figure 6 shows the weight distributions for the same model in the same format
as Fig. 5. See Appendix A for models B, C, and D. It can be observed that there
is a clear increase in the magnitude of some weights (their absolute weight values)
at the same iterations where we observe a corresponding increase in validation
and evaluation loss values in Fig. 5. This appears to occur even more after most
of the training sample losses have been minimized. This is consistent with the
notion of limiting weight norms to improve generalization and it suggests that
the reason for the increase in validation losses is because particular weights are
being increased to fit idiosyncratic training samples.
While these heatmaps show that there are outlier per-sample loss values
in the validation set, they do not guarantee that these extreme loss values are
due to specific samples. It is possible that the extreme values are measured on
completely different samples at every measured iteration, in which case there is
nothing extreme about them and the phenomenon has something to do with the
optimization process and not training and validation distributions. We address
this question in the next section.
304 A. E. W. Venter et al.
Fig. 5. Change in loss distributions during training for model A (5k MNIST, Adam,
mini-batch size of 64). The three heatmaps refer to the train (top), validation (center),
and evaluation (bottom) loss distributions.
Fig. 6. Change in weight distributions during training for model A (MNIST, Adam,
mini-batch size of 64). Each heatmap refers to a layer in the network, including the
output layer, from top to bottom.
Pre-interpolation Loss Behavior in Neural Networks 305
In this section we investigate whether the validation set samples with extreme
loss values are individual samples that are consistently modeled poorly, or
whether these outliers change from iteration to iteration due to the stochas-
tic nature of the optimization process. Towards this end, we analyze the number
of epochs for which a sample can be regarded as an outlier and compare it with
its final loss value.
We classify a sample as an outlier when its loss value is above the upper
Tukey fence, that is, larger than Q3 + 1.5 × (Q3 − Q1 ), where Q1 and Q3 are the
first and third quartile of all loss values in the validation set, respectively [4].
This indicator is simple and adequate to illustrate whether some specific samples
consistently have much larger loss values than the majority.
In Fig. 7 we show that the validation samples with extreme loss values at the
end of training are usually classified as outliers for most of the training process.
This means that the extreme validation loss values are due to specific samples
that are not well modeled. In addition to this, it is worth observing that a large
majority of validation samples are never classified as an outlier and these samples
always have small loss values at the end of training.
5 Discussion
We have shown that validation classification accuracy can increase while the
corresponding average loss value also increases. Empirically, we have noted that
this phenomenon is most influenced by the interplay between the training dataset
and model capacity. Specifically, it occurs more for larger models, smaller train-
ing datasets, and more difficult datasets (FMNIST in our investigation). We can,
however, combine the first and second factors because capacity is directly related
to the size and complexity of the training set.
By taking a closer look at per-sample loss distributions and weight distri-
butions we have noted that the phenomenon is largely due to specific samples
in the validation set that have extremely large loss values and obtain progres-
sively larger loss values as training continues. These loss values then become
large enough to distort the average loss value in such a way that it appears that
the model is overfitting the training set, when most of the validation set sample
losses are still being minimized. From a theoretical viewpoint this is unsurprising
because the average validation loss is only a good measure of risk with regards
to the train set, where it is directly being minimized by the principle of maxi-
mum likelihood estimation. From a practical viewpoint it appears that increased
weight values are sacrificing the generality of the distributed representation used
by DNNs in order to minimize training loss as much as possible.
Practically, these findings serve as a clear cautionary tale for (1) assuming an
inverse correlation between loss and accuracy, and for (2) measuring overfitting
with point estimators such as average validation loss. Rather, we show that loss
distribution heatmaps (Fig. 5) provide additional, useful information.
306 A. E. W. Venter et al.
Fig. 7. Outliers in the validation set. The blue datapoints show the number of epochs
for which each sample is considered an outlier. The red datapoints show the loss value
of each sample at the end of training. Samples are ordered in ascending order of epoch
counts.
The findings also highlight a more general aspect of generalization and deep
learning: DNNs optimize parameters with regards to training data in a hetero-
geneous manner. With sufficient parametric flexibility, these types of models can
fit generalizable features and memorize non-generalizable features concurrently
during training. Formally defining how this is achieved, and subsequently, how
generalization should be characterized in this context, remains an open problem.
6 Conclusion
These findings imply that a validation loss that starts increasing prior to
interpolation of the training set is not necessarily an implication of overfitting;
and also that it is dangerous to assume a negative correlation between validation
accuracy and loss (which is often done when selecting hyperparameters).
We note that this study focused on a narrow set of architectures and datasets.
Testing our findings for different scenarios – more complex architectures and
more challenging datasets, such as imbalanced and sparse datasets – remains
future work. While this study aimed to answer a very specific question, we hope it
will contribute to the general discourse on factors that influence the optimization
process and generalization ability of neural networks.
A Appendix
We include the results when models B, C and D are analyzed, using the same
process as described in Sect. 4.2.
308 A. E. W. Venter et al.
(e) D: Fitting 55k FMNIST; Adam (f) D: Fitting 55k FMNIST; Adam
Fig. 8. Change in loss (left) and weight (right) distributions, for models B, C, and D,
during training. See Fig. 5 and 6 for plot ordering. (Color figure online)
References
1. Ba, J., Erdogdu, M., Suzuki, T., Wu, D., Zhang, T.: Generalization of two-layer
neural networks: an asymptotic viewpoint. In: International Conference on Learn-
ing Representations (2020)
2. Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine-learning
practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. 116, 15849–
15854 (2019)
3. d’Ascoli, S., Refinetti, M., Biroli, G., Krzakala, F.: Double trouble in double
descent: bias and variance(s) in the lazy regime. In: Thirty-seventh International
Conference on Machine Learning, pp. 2676–2686 (2020)
4. Devore, J., Farnum, N.R.: Applied Statistics for Engineers and Scientists. Thomson
Brooks/Cole, Belmont (2005)
Pre-interpolation Loss Behavior in Neural Networks 309
5. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge
(2016). http://www.deeplearningbook.org
6. Goodfellow, I.J., Vinyals, O.: Qualitatively characterizing neural network optimiza-
tion problems. CoRR abs/1412.6544 (2015)
7. Hardt, M., Recht, B., Singer, Y.: Train faster, generalize better: stability of stochas-
tic gradient descent. In: International Conference on Machine Learning, pp. 1225–
1234. PMLR (2016)
8. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
Improving neural networks by preventing co-adaptation of feature detectors. CoRR
abs/1207.0580 (2012). http://arxiv.org/abs/1207.0580
9. Hochreiter, S., Schmidhuber, J.: Flat minima. Neural Comput. 9, 1–42 (1997)
10. Hoffer, E., Hubara, I., Soudry, D.: Train longer, generalize better: closing the gen-
eralization gap in large batch training of neural networks. In: Advances in Neural
Information Processing Systems, pp. 1731–1741 (2017)
11. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by
reducing internal covariate shift. In: Proceedings of the 32nd International Confer-
ence on Machine Learning, 2015, Lille, France, 6–11 July 2015. JMLR Workshop
and Conference Proceedings, vol. 37, pp. 448–456. JMLR.org (2015)
12. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to
document recognition. Proc. IEEE 86, 2278–2324 (1998). https://doi.org/10.1109/
5.726791
13. Martin, C., Mahoney, M.: Implicit self-regularization in deep neural networks:
evidence from random matrix theory and implications for learning. ArXiv
abs/1810.01075 (2018)
14. Murphy, K.P.: Machine Learning: A Probabilistic Perspective. The MIT Press,
Cambridge (2012)
15. Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep dou-
ble descent: where bigger models and more data hurt. In: International Conference
on Learning Representations (2020)
16. Nakkiran, P., Venkat, P., Kakade, S.M., Ma, T.: Optimal regularization can miti-
gate double descent. ArXiv abs/2003.01897 (2020)
17. Neyshabur, B., Tomioka, R., Srebro, N.: In search of the real inductive bias: on
the role of implicit regularization in deep learning. CoRR abs/1412.6614 (2015)
18. Novak, R., Bahri, Y., Abolafia, D.A., Pennington, J., Sohl-Dickstein, J.: Sensi-
tivity and generalization in neural networks: an empirical study. In: International
Conference on Learning Representations (2018)
19. Sokolic, J., Giryes, R., Sapiro, G., Rodrigues, M.: Robust large margin deep neural
networks. IEEE Trans. Signal Process. 65, 4265–4280 (2017)
20. Wilson, D.R., Martinez, T.: The general inefficiency of batch training for gradient
descent learning. Neural Netw. Off. J. Int. Neural Netw. Soc. 16(10), 1429–51
(2003)
21. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for bench-
marking machine learning algorithms. CoRR abs/1708.07747 (2017). http://arxiv.
org/abs/1708.07747
Author Index