Interpretable and Explainable Logical Policies Via Neurally Guided Symbolic Abstraction 2306.01439
Interpretable and Explainable Logical Policies Via Neurally Guided Symbolic Abstraction 2306.01439
Interpretable and Explainable Logical Policies Via Neurally Guided Symbolic Abstraction 2306.01439
Abstract
The limited priors required by neural networks make them the dominating choice to encode
and learn policies using reinforcement learning (RL). However, they are also black-boxes,
making it hard to understand the agent’s behaviour, especially when working on the image level.
Therefore, neuro-symbolic RL aims at creating policies that are interpretable in the first place.
Unfortunately, interpretability is not explainability. To achieve both, we introduce Neurally
gUided Differentiable loGic policiEs (NUDGE). NUDGE exploits trained neural network-
based agents to guide the search of candidate-weighted logic rules, then uses differentiable
logic to train the logic agents. Our experimental evaluation demonstrates that NUDGE agents
can induce interpretable and explainable policies while outperforming purely neural ones and
showing good flexibility to environments of different initial states and problem sizes.
1 Introduction
Deep reinforcement learning (RL) agents use neural networks to take decisions from the unstructured input state
space without manual engineering [Mnih et al., 2015]. However, these black-box policies lack interpretabil-
ity [Rudin, 2019], i.e. the capacity to articulate the thinking behind the action selection. They are also not robust
to environmental changes [Pinto et al., 2017, Wulfmeier et al., 2017]. Although performing object detection and
policy optimization independently can get over these issues Devin et al. [2018], doing so comes at the cost of the
aforementioned issues when employing neural networks to encode the policy.
As logic constitutes a unified symbolic language that humans use to compose the reasoning behind their behavior,
logic-based policies can tackle the interpretability problems for RL. Recently proposed Neural Logic RL (NLRL)
agents [Jiang and Luo, 2019] construct logic-based policies using differentiable rule learners called ∂ILP [Evans
and Grefenstette, 2018], which can then be integrated with gradient-based optimization methods for RL. It
represents the policy as a set of weighted rules, and performs policy gradients-based learning to solve RL tasks
which require relational reasoning. It successfully produces interpretable rules, which describe each action
in terms of its preconditions and outcome. However, the number of potential rules grows exponentially with
the number of considered actions, entities, and their relations. NLRL is a memory-intensive approach, i.e. it
generates a set of potential simple rules based on rule templates and can only be evaluated on simple abstract
environments, created for the occasion. This approach can generate many newly invented predicates without their
specification of meaning [Evans and Grefenstette, 2018], making the policy challenging to interpret for complex
environments. Moreover, the function of explainability is absent, i.e. the agent cannot explain the importance of
each input on its decision. Explainable agents should adaptively produce different explanations given different
input states. A question thus arises: How can we build interpretable and explainable RL agents that are robust to
environmental changes?
∗
These authors contributed equally.
Figure 1: Overview of NUDGE. Given a state (depicted in the image), NUDGE computes the action distribution
using relational state representation and differentiable forward reasoning. NUDGE provides interpretable and
explainable policies, i.e. derives policies as sets of interepretable weighted rules, and can produce explanations
using gradient-based attribution methods.
To this end, we introduce Neurally gUided Differentiable loGic policiEs (NUDGE), illustrated in Fig. 1, that
embody the advantages of logic: they are easily adaptable to environmental changes, composable, interpretable
and explainable (because of our differentiable logic module). Given an input state, NUDGE extracts entities
and their relations, converting raw states to a logic representations. This probabilistic relational states are used
to deduce actions, using differentiable forward reasoning [Evans and Grefenstette, 2018, Shindo et al., 2023].
NUDGE produces a policy that is both interpretable, i.e. provides a policy as a set of weighted interpretable rules
that can be read out by humans, and explainable, i.e. explains which input is important using gradient-based
attribution methods [Sundararajan et al., 2017] over logical representations.
To achieve an efficient policy learning on NUDGE, we provide an algorithm to train NUDGE agents based on
the PPO actor-critic framework. Moreover, we propose a novel rule-learning approach, called Neurally-Guided
Symbolic Abstraction, where the candidate rules for the logic-based agents are obtained efficiently by being
guided by neural-based agents. NUDGE distillates abstract representations of neural policies in the form of
logic rules. Rules are assigned with their weights, and we perform gradient-based optimization using the PPO
actor-critic framework.
Overall, we make the following contributions:
1. We propose NUDGE2 : differentiable logical policies that learn interpretable rules and produce explanations
for their decisions in complex environments. NUDGE uses neurally-guided symbolic abstraction to efficiently
find a promising ruleset using pretrained neural-based agents guidance.
2. We empirically show that NUDGE agents: (i) can compete with neural-based agents, (ii) adapt to environmen-
tal changes, and (iii) are interpretable and explainable, i.e. produce interpretable policies as sets of weighted
rules and provide explanations for their action selections.
3. We evaluate NUDGE on 2 classic Atari games and on 3 proposed object-centric logically challenging
environments, where agents need relational reasoning in dynamic game-playing scenarios.
We start off by introducing the necessary background. Then we explain NUDGE’s inner workings and present
our experimental evaluation. Before concluding, we touch upon related work.
2 Background
We now describe the necessary background before formally introducing our NUDGE method.
Deep Reinforcement Learning. Reinforcement Learning problems are modelled as Markov decision process,
M =<S, A, P, R>, where, at every timestep t, an agent is in a state st ∈ S, takes action at ∈ A, receives a
reward rt = R(st , at ) and a transition to the next state st+1 , according to environment dynamicsPP (st+1 |st , at ).
Deep agents attempt to learn a parametric policy, πθ (at |st ), in order to maximize the return (i.e. t γ t rt ). In RL
problems, the desired input to output (i.e. state to action) distribution is not directly accessible, as RL agents only
2
Code publicly available: https://github.com/k4ntz/LogicRL.
2
observe returns. The value Vπθ (st ) (resp. Q-value Qπθ (st , at )) function provides the return of the state (resp.
state/action pair) following the policy πθ . Policy-based methods directly optimize πθ using the noisy return
signal, leading to potentially unstable learning. Value-based methods learn to approximate the value functions
V̂ϕ or Q̂ϕ , and implicitly encode the policy, e.g. by selecting the actions with the highest Q-value with a high
probability [Mnih et al., 2015]. To reduce the variance of the estimated Q-value function, one can learn the
advantage function Âϕ (st , at ) = Q̂ϕ (st , at ) − V̂ϕ (st ). An estimate of the advantage function can be computed
Pk−1
as Âϕ (st , at ) = i=0 γ i rt+i + γ k V̂ϕ (st+k ) − V̂ϕ (st ) [Mnih et al., 2016]. The Advantage Actor-critic (A2C)
methods both encode the policy πθ (i.e. actor) and the advantage function Âϕ (i.e. critic), and use the critic
to provide feedback to the actor, as in [Konda and Tsitsiklis, 1999]. To push πθ to take actions that lead to
higher returns, gradient ascent can be applied to LP G (θ) = Ê[log πθ (a | s)Âϕ ]. Proximal Policy Optimization
(PPO) algorithms ensure minor policy updates that avoid catastrophic drops [Schulman et al., 2017], and can
be applied to actor-critic methods. To do so, the main objective constraints the policy ratio r(θ) = ππθ θ (a|s) (a|s) ,
old
following LP R (θ) = Ê[min(r(θ)Âϕ , clip(r(θ), 1 − ϵ, 1 + ϵ)Âϕ )], where clip constrains the input with the
bound [1 − ϵ, 1 + ϵ]. PPO actor-critic algorithm’s global objective is L(θ, ϕ) = Ê[LP R (θ) − c1 LV F (ϕ)], with
LV F (ϕ) = (V̂ϕ (st ) − V (st ))2 being the value function loss. An entropy term can also be added to this objective
to encourage exploration.
First-Order Logic (FOL). A Language L is a tuple (P, D, F, V), where P is a set of predicates, D a set
of constants, F a set of function symbols (functors), and V a set of variables. A term either is a constant
(e.g. obj1, agent), a variable (e.g. O1), or a term which consists of a function symbol. An atom is a formula
p(t1 , . . . , tn ), where p is a predicate symbol (e.g. closeby) and t1 , . . . , tn are terms. A ground atom or simply
a fact is an atom with no variables (e.g. closeby(obj1, obj2)). A literal is an atom (A) or its negation (¬A). A
clause is a finite disjunction (∨) of literals. A ground clause is a clause with no variables. A definite clause is
a clause with exactly one positive literal. If A, B1 , . . . , Bn are atoms, then A ∨ ¬B1 ∨ . . . ∨ ¬Bn is a definite
clause. We write definite clauses in the form of A :- B1 , . . . , Bn . Atom A is called the head, and set of negative
atoms {B1 , . . . , Bn } is called the body. We call definite clauses as rules for simplicity in this paper.
Differentiable Forward Reasoning is a data-driven approach of reasoning in FOL [Russell and Norvig, 2003].
In forward reasoning, given a set of facts and a set of rules, new facts are deduced by applying the rules to the
facts. Differentiable forward reasoning [Evans and Grefenstette, 2018, Shindo et al., 2023] is a differentiable
implementation of the forward reasoning with fuzzy operations.
Fig. 2 illustrates an overview of RL on NUDGE. They consist of a policy reasoning module and a policy learning
module. NUDGE performs end-to-end differentiable policy reasoning based on forward reasoning, which
computes action distributions given input states. On top of the reasoning module, policies are learned using
neurally-guided symbolic abstraction and an actor-critic framework.
To realize NUDGE, we introduce a language to describe actions and states in FOL. Using it, we introduce
differentiable policy reasoning using forward chaining reasoning.
Definition 1 Action-state Language is a tuple of (PA , PS , D, V), where PA is a set of action predicates, PS is a
set of state predicates, D is a set of constants for entities, and V is a set of variables.
3
Policy Learning
Neurally-Guided Differentiable Logic Policy (Actor)
Symbolic Abstraction
Critic
0.98:right(1)(agent):-type(O1,agent),type(O2,key)
¬has_key(O1),on_right(O2,O1).
Neural
Generated 0.96:right(2)(agent):-type(O1,agent),type(O2,door)
Policy
Rules
has_key(O1),on_right(O2,O1).
0.65:jump(1)(agent):-type(O1,agent),type(O2,enemy)
closeby(O2,O1).
Policy Reasoning
Logical State Induced Action
Input State Representation Action Atoms Distribution
0.99:type(obj1,agent) 0.89:right(1)
(agent)
0.99:type(obj2,enemy) 0.08:right(2)(agent)
0.00 : left
0.95 : right
1 2 3
0.99:type(obj3,key) 0.00:left(1)(agent)
0.96:type(obj4,door) 0.00:left(2)(agent) 0.05 : jump
0.93:closeby(obj1,obj2) 0.32:jump(1)(agent) 0.00 : idle
0.89:¬has_key(obj1) 0.00:idle(1)(agent)
Figure 2: NUDGE-RL. Policy Reasoning (bottom): NUDGE agents incorporate end-to-end reasoning architec-
tures from raw input based on differentiable forward reasoning. In the reasoning step, (1) the raw input state is
converted into a logical representation, i.e. a set of atoms with probabilities. (2) Differentiable forward reasoning
is performed using weighted action rules. (3) The final distribution over actions is computed using the results
of differentiable reasoning. Policy Learning (top): Using the guidance of a pretrained neural policy, a set of
candidate action rules is searched by neurally-guided symbolic abstraction, where promising action rules are
produced. Then, randomly initialized weights are assigned to the action rules and are optimized using the critic
of an actor-critic agent.
For example, for getout illustrated in Fig. 1, we have actual actions: left, right, jump, and idle. We define
action predicates PA = {left(1) , left(2) , right(1) , right(2) , jump(1) , idle(1) } and state predicates PS =
{type, closeby}. To encode different reasons for a given game action, we define several action predicates
for (e.g. right(1) and right(2) for right) explicitly. By using these predicates, we can compose action atoms,
e.g. right(1) (agent), and state atoms, e.g. type(obj1, agent). Note that an action predicate can also be a state
predicate, e.g. in multiplayer settings. Now, we define rules to describe actions in the action-state language.
(1) (n)
Definition 2 Let XA be an action atom and XS , . . . , XS be state atoms. An action rule is a rule, written as
(1) (n)
XA :-XS , . . . , XS .
For example, for action right, we define an action rule as:
right(1) (agent):-type(O1, agent), type(O2, key), ¬has_key(O1), on_right(O2, O1).
which can be interpreted as “The agent should go right if the agent does not have the key and the key is located
on the right of the agent.". Having several action predicates for an actual action (in the game) allows us to define
several action rules that describe different reasons for the action.
4
perceive
with fΘ : RN → [0, 1]G a perception function that maps the raw input state st ∈ RN into a set of
reason
probabilistic atoms, f(C,W) : [0, 1]G → [0, 1]GA a differentiable forward reasoning function parameterized by a
set of rules C and rule weights W, and f act : [0, 1]GA → [0, 1]A an action-selection function, which computes
the probability distribution over the action space.
Relational Perception. NUDGE agents take an object-centric state representation as input, obtained by e.g. using
object detection [Redmon et al., 2016] or discovery [Lin et al., 2020, Delfosse et al., 2022] methods. These
models return the detected objects and their attributes (e.g. class and positions). They are then converted into a
probabilistic logic form with their relations, i.e. a set of facts with their probabilities. An input state st ∈ RN
is converted to a valuation vector v ∈ [0, 1]G , which maps each fact to a probabilistic value. For example,
let G = {type(obj1, agent), type(obj2, enemy), closeby(obj1, obj2), jump(agent)}. A valuation vector
[0.8, 0.6, 0.3, 0.0]⊤ maps each fact to a corresponding probabilistic value. NUDGE performs differentiable
forward reasoning by updating the initial valuation vector v(0) for T times to v(T ) .
Initial valuation vector v(0) is computed as follows. For each ground state atom p(t1 , . . . , tn ) ∈ GS ,
e.g. closeby(obj1, obj2), a differentiable function is called to compute its probability, which maps each
term t1 , . . . , tn to vector representations according to the interpretation, e.g. obj1 and obj2 are mapped to their
positions, then perform binary classification using the distance between them. For action atoms, zero is assigned
as its initial probability (e.g. for jump(1) (agent)).
Differentiable Forward Reasoning. Given a set of candidate action rules C, we create the reasoning function
reason
f(C,W) : [0, 1]G → [0, 1]GA , which takes the initial valuation vector and induces action atoms using weighted
action rules. We assign weights to the action rules of C as follows: We fix the target programs’ size, M , and select
M rules out of C candidate action rules. To do so, we introduce C-dimensional weights W = [w1 , . . . , wM ]
where wi ∈ RC (cf. Fig. 6 in the appendix). We take the softmax of each weight vector wi ∈ W to select M
action rules in a differentiable manner.
We perform T -step forward reasoning using action rules C with weights W. We compose the differentiable
forward reasoning function following Shindo et al. [2023]. It computes soft logical entailment based on efficient
tensor operations. Our differentiable forward reasoning module computes new valuation v(T ) including all
induced atoms given weighted action rules (C, W) and initial valuation v(0) . Finally, we compute valuations on
action atoms vA ∈ [0, 1]GA by extracting relevant values from v(T ) . We provide details in App. E.
Compute Action Probability. Given valuations on action atoms vA , we compute the action distribution for
actual actions. Let ai ∈ A be an actual action, and v1′ , . . . , vn′ ∈ vA be valuations which are relevant for ai
(e.g. valuations of right(1) (agent) and right(2) (agent)) in vA for the action right. We assign
P scores to each
action ai based on the log-sum-exp approach of Cuturi and Blondel [2017]: val (ai ) = γ log 1≤i≤n exp(vi′ /γ),
that smoothly approximates the maximum value of {v1′ , . . . , vn′ }. γ > 0 is used as a smoothing parameter. The
action distribution is then obtained by taking the softmax over the evaluations of all actions.
So far, we have considered that candidate rules for the policy are given, requiring human experts to handcraft
potential rules. To avoid this, template-based rule generation [Evans and Grefenstette, 2018, Jiang and Luo,
2019] can be applied, but the number of generated rules increases exponentially with the number of entities and
their potential relations. This technique is thus difficult to apply to complex environments where the agents need
to reason about many different relations of entities.
To mitigate this problem, we propose an efficient learning algorithm for NUDGE that consists of two steps:
neurally-guided symbolic abstraction and gradient-based optimization. First, NUDGE obtains symbolic abstract
representations of given neural policy. We select a set of candidate rules for the policy by neurally-guided top-k
search, i.e. , we generate a set of promising rules using neural policies as oracles to evaluate each rule. Then
we assign randomized weights for the generated rules and perform differentiable reasoning. We optimize rule
weights based on actor-critic methods to maximize the return. We now describe each step in detail.
5
neural policy to evaluate rules efficiently. The inputs are initial rules C0 , neural policy πθ . We start with
elementary action rules and refine them to generate better action rules. Cto_open is a set of rules to be refined,
and initialized as C0 . For each rule Ci ∈ Cto_open , we generate new rules by refining them as follows. Let
(1) (n)
Ci = XA ← XS , . . . , XS be an already selected general action rule. Using a randomly picked ground or
(i)
non-ground state atom YS (̸= XS ∀i ∈ [1, ..., n]), we refine the selected rule by adding a new state atom to its
(1) (n)
body, obtaining: XA ← XS , . . . XS , YS .
We evaluate each newly generated rule to select promising rules. We use the neural policy πθ as a guide for the
rule evaluation, i.e. rules that entail the same action as the neural policy πθ are promising action rules. Let X be
a set of states. Then we evaluate rule R as
1 X
eval (R, πθ ) = πθ (s)⊤ · π(R,1) (s), (2)
N (R, X )
s∈X
where N (R, X ) is a normalization term, πR,1 is the differentiable logic policy with rules R = {R} and rule
weights 1, which is an 1 × 1 identity matrix (for consistent notation), and · is the dot product. Intuitively,
π(R,1) is the logic policy that has R as its only action rule. If π(R,1) produces a similar action distribution as that
produced by neural policy πθ , we regard rule R as a promising rule. The similarity score is computed using the
dot product between the two action distributions. We compute the similarity scores for each state s ∈ X , and
sum them up to compute the score for R. The normalization term helps NUDGE avoid scoring just simple rules
as promising rules.
To compute the normalization term, we consider groundings of rule R, i.e. we remove variables on the rule by
substituting constants. We consider all of the possible groundings for each rule. Let T be a set of all of the substi-
(1) (n)
tutions for variables to ground rule R. For each τ ∈ T , we get a ground rule Rτ = XA τ :-XS τ, . . . , XS τ ,
where Xτ represents the result of applying substitution τ to atom X. Let J = {j1 , . . . , jn } be indices of the
(1) (n)
ground atoms XS τ, . . . , XS τ in ordered set of ground atoms G. Then, the normalization term is computed as:
XXY
N (R, X ) = = vs(0) [j], (3)
τ ∈T s∈X j∈J
(0) perceive
where vs is an initial valuation vector for state s, i.e. fΘ (s). Eq. 3 quantifies how often the body atoms of
ground rule Rτ are activated on the given set of states X . Simple rules with fewer atoms in their body tend to
have large values, and thus their evaluation scores in Eq. 2 tend to be small. After scoring all of the new rules,
NUDGE select top-k rules to refine them in the next step. To this end, all of the top-k rules in each step will be
returned as the candidate ruleset C for the policy (cf. App. A for more detail about our algorithm).
NUDGE has thus produced candidate action rules C, that will be associated with W to form untrained policy
π(C,W) described in Sec. 3.1.2.
4 Experimental Evaluation
We here compare neural agents’ performances to NUDGE ones, and show NUDGE interpretable policies and
ability to report the importance of each input on their decisions, i.e. explainable. We use DQN agents (on Atari
6
Figure 3: NUDGE outperforms neural and logic baselines. Returns (avg.±std.) obtained by NUDGE, neural
PPO and logic-based agents without abstraction through training. NUDGE (Top-k rf.), with k ∈ {1, 3, 10} uses
neurally-guided symbolic abstraction repeatedly until they get k rules for each action predicate. NUDGE (with
E.S.) uses rule set C supervised by an expert. Neural Logic RL composes logic-based policies by generating
all possible rules without neurally-guided symbolic abstraction [Jiang and Luo, 2019]. Random and human
baselines are also provided.
environments) and PPO actor-critic (on logic-ones) as neural baselines, for comparison and PPO as pretrained
agents to guide the symbolic abstraction. Our experimental evaluation considers that all agent types receive
object-centric descriptions of the environments. For clarity, we annotate action predicates on action rules with
specific names on purpose, e.g. right_key instead of right(1) for the rule when the rule describes an action
right motivated to get the key.
We intend to compare agents with object-centric information bottlenecks. We thus had to extract object-centric
states of the Atari environments (using information from the RAM). As Atari games do not embed logic
challenges, but are rather desgined to test the reflexes of human players, we also created 3 logic-oriented
environments. We thus have modified environments from the Procgen [Mohanty et al., 2020] environments
that are open-sourced along with our evaluation to have object-centric representations and fewer objects. Our
environments are easily hackable. We provide variations of these environments also to evaluate the ease of
adaptation of every agent type. In GetOut, the goal is to obtain a key, and then go to a door, while avoiding a
moving enemy. GetOut+ is a more complex variation with a larger world containing 5 enemies (among which 2
are static). In 3Fishes, the agent controls a fish and is confronted with 2 other fishes, one smaller (that the agent
needs to “eat”, i.e. go to) and one bigger, that the agent needs to dodge. A variation is 3Fishes-C, where the agent
can eat green fishes and dodge red ones. Finally, in Loot, the agent can open 1 or 2 chests and their corresponding
(i.e. same color) keys. In Loot-C, the chests have different colors. Further details and hyperparameters are
provided in App. D.
We aim to answer the following research questions: Q1. How does NUDGE compare with neural and logic
baselines? Q2. Can NUDGE agents easily adapt to environmental changes? Q3. Are NUDGE agents interpretable
and explainable?
NUDGE competes with PPO agents (Q1). We compare NUDGE with different baselines regarding their
scores (or returns). First, we present scores obtained by trained DQN, Random and NUDGE agents (with expert
supervision) on 2 Atari games (cf. Tab. 1). Our result show that NUDGE obtain better (Asterix) or similar
(Freeway) scores than DQN. However, as said, Atari games are not logically challenging. We thus evaluate
NUDGE on 3 logic environments. Fig. 3 shows the returns in GetOut, 3Fishes, and Loot, with descriptions for
each baseline in the caption. NUDGE obtains better performances than neural baselines (Neural PPO) on 3Fishes,
is more stable on GetOut, i.e. less variance, and achieves faster convergence on Loot. This shows that NUDGE
successfully distillates logic-based policies competing with neural baselines in different complex environments.
We also evaluate agents with a baseline without symbolic abstraction, where candidate rules are generated
not being guided by neural policies, i.e. accept all of the generated rules in the rule refinement steps. This
setting corresponds to the template-based approach [Jiang and Luo, 2019], but we train the agents by the
actor-critic method, while vanilla policy gradient [Williams, 1992] is employed in [Jiang and Luo, 2019].
For the no-abstraction baseline and NUDGE, we provide initial action rules with basic type information,
e.g. jump(1) (agent):- type(O1, agent), type(O2, enemy), for each action rule. For this baseline, we generate
5 rules for GetOut, 30 rules for 3Fishes, and 40 rules for Loot in total to define all of the actual actions. NUDGE
agents with small k tend to have less rules, e.g. 5 rules Getout, 6 rules in 3Fishes, and 8 rules for Loot in
NUDGE (top-1 rf.). In Fig. 3, the no-abstraction baselines perform worse than neural PPO and NUDGE in each
environment, even though they have much more rules in 3Fishes and Loot. We thus show that NUDGE composes
7
Score (↑) Random DQN NUDGE Score (↑) Random Neural PPO NUDGE
Asterix 235 ±134 124.5 6259 ±1150 3Fishes-C -0.64 ±0.17 -0.37 ±0.10 3.26 ±0.20
Freeway 0.0 ±0 25.8 21.4 ±0.8 GetOut+ -22.5 ±0.41 -20.88 ±0.57 3.60 ±2.93
Loot-C 0.56 ±0.29 0.83 ±0.49 5.63 ±0.33
Table 1: Left: NUDGE agents can learn successful policies. Trained NUDGE agents (with expert supervision)
scores (avg. ± std.) on 2 ALE games. Random and DQN (from van Hasselt et al. [2016]) are also provided.
Right: NUDGE agents adapt to environmental changes. Returns obtained by NUDGE, neural PPO and
random agents on our 3 modified environmnents.
0.57:jump(agent):-type(O1,agent),type(O2,enemy),closeby(O1,O2).
0.29:right_key(agent):-type(O1,agent),type(O2,key),on_right(O2,O1),not_have_key(O1).
0.32:right_door(agent):-type(O1,agent),type(O2,door),on_right(O2,O1),have_key(O1).
Figure 4: NUDGE produces an interpretable policy as set of weighted rules. A subset of the weighted action
rules discovered by NUDGE in the Getout environment. Full policies for every logic environment are provided
in App. B.3.
efficient logic-based policies using neurally-guided symbolic abstraction. In App. B.1, we visualize the transition
of the distribution of the rule weights in the GetOut environment.
NUDGE agents adapt to environmental changes (Q2). We used the agents trained on the basic environment
for this experimental evaluation, with no retraining or finetuning. For 3Fishes-C, we simply exchange the atom
is_bigger with the atom same_color. This easy modification is not applicable on the black-box networks
of neural PPO agents. For GetOut+ and Loot-C, we do not apply any modification to the agents. Our results
are summarized in Tab. 1 (right). Note that the agent obtains better performances than in 3Fishes, as it is easier
to dodge a (small) red fish than a big one. For GetOut+, NUDGE performances have decreased as avoiding 5
enemies drastically increases the difficulty of the game. On Loot-C, the performances are similar to the ones
obtained in the original game. Our experiments show that NUDGE logic agents can easily adapt to environmental
changes.
NUDGE agents are interpretable and explainable (Q3). We show that NUDGE agents are interpretable
and explainable by showing that (1) NUDGE produces interpretable policy as a set of weighted rules, and (2)
NUDGE can show the importance of each atom, explaining its action choices.
The efficient neurally-guided learning on NUDGE enables the system to learn rules without inventing predicates
with no specific interpretations, which are unavoidable in template-based approachs [Evans and Grefenstette,
2018, Jiang and Luo, 2019]. Thus, the policy can easily be read out by extracting action rules with high weights.
Fig. 4 shows some action rules discovered by NUDGE in GetOut. The first rule says: ”The agent should jump
when the enemy is close to the agent (to avoid the enemy).”. The produced NUDGE is an interpretable policy
with its set of weighted rules using interpretable predicates. For each state, we can also look at the valuation of
each atom and the selected rule.
Moreover, since NUDGE realizes differentiable logic-based
policies, we can compute attribution values over logical rep-
resentations using their gradients. We compute the action gra-
dients w.r.t. input atoms, i.e. ∂vA /∂v(0) , as shown in Fig. 5,
which represent the relevance scores of the probabilistic input
atoms v(0) for the actions given a specific state. The explana-
tion is computed on the state shown in Fig. 1, where the agent
takes right as its action. Important atoms receive large gra-
dients, e.g. ¬have_key(agent) and on_right(obj2, obj1).
By extracting relevant atoms with large gradients, NUDGE
can produce clear explanations for the action selection. For Figure 5: Explanation using inputs’ gradi-
example, by extracting the atoms wrapped in orange in Fig. 5, ents. The action gradient w.r.t. input atoms,
(0)
NUDGE can explain the motivation: “The agent decides to go i.e. ∂vA /∂v , on the state shown in Fig. 1.
right because it does not have the key and the key is located on right was selected, due to the highlighted rele-
the right-side of it.”. NUDGE is interpretable and explainable, vant atoms (with large gradients).
each action predicate is defined by interpretable rules and explanations for the action selections can be produced.
8
5 Related Work
Relational RL [Dzeroski et al., 2001, Kersting et al., 2004, Kersting and FOL N.G. Int. Exp.
Driessens, 2008, Lang et al., 2012] has been developed to tackle RL tasks in NLRL ✓ ✗ ✓ ✗
relational domains by incorporting logical representations to RL frameworks. NeSyRL ✓ ✗ ✓ ✗
These are based on probabilistic reasoning, but NUDGE uses differentiable DiffSES ✗ ✓ ✓ ✗
NUDGE ✓ ✓ ✓ ✓
logic programming. Neural Logic Reinforcement Learning (NLRL) [Jiang and
Luo, 2019] is the first framework that integrates Differentiable Inductive Logic Table 2: Logic-based RL meth-
Programming (∂ILP) [Evans and Grefenstette, 2018] to RL domain. ∂ILP ods comparison: First Order
learns generalized logic rules from examples by gradient-based optimization. Logic (FOL), neurally-guided
NLRL adopts ∂ILP as a policy function. We extend this approach by proposing search (N.G.), interpretability
neurally-guided symbolic abstraction embracing an extension of ∂ILP Shindo (Int.), and explainability (Exp.).
et al. [2021] for complex programs, which allows agents to learn interpretable
action rules efficiently for complex environments.
GALOIS [Cao et al., 2022] is a framework to represent policies as logic programs using the sketch setting [Solar-
Lezama, 2008], where programs are learned to fill blanks, but NUDGE performs structure learning from scratch
using policy gradients. KoGun [Zhang et al., 2020] integrates human knowledge as a prior for RL agents.
NUDGE learns a policy as a set of weighted rules and thus also can integrate human knowledge. Neuro-
Symbolic RL (NeSyRL) [Kimura et al., 2021] uses Logical Neural Networks (LNNs) [Riegel et al., 2020] for
the policy computation. LNNs parameterize the soft logical operators while NUDGE parameterizes rules with
their weights. Deep Relational RL approaches [Zambaldi et al., 2018] achieve relational reasoning as a neural
network, but NUDGE explicitly encodes relations in logic. Many languages for planning and RL tasks have been
developed [Fikes and Nilsson, 1971, Fox and Long, 2003]. Our approach is inspired by situation calculus [Reiter,
2001], which is an established framework to describe states and actions in logic.
Symbolic programs within RL have been investigated, e.g. program guided agent [Sun et al., 2020], program
synthesis [Zhu et al., 2019], PIPL [Verma et al., 2018], SDRL [Lyu et al., 2019], interpretable model-based
hierarchical RL Xu and Fekri [2021], deep symbolic policy Landajuela et al. [2021], and DiffSES Zheng
et al. [2021]. These approaches use domain specific languages or propositional logic, and address either of
interpretability or explainability of RL. To this end, in Tab. 2, we compare NUDGE with the most relevant
approaches that share at least 2 aspects of following: supporting first-order logic, neural guidance, interetability
and explainability. NUDGE is the first to use neural guidance on differentiable first-order logic and to address
both interpretability and explainability.
6 Conclusion
We proposed NUDGE, an interpretable and explainable policy reasoning and learning framework for rein-
forcement learning. NUDGE uses differentiable forward reasoning to obtain a set of interpretable weighted
rules as policy. NUDGE performs neurally-guided symbolic abstraction, which efficiently distillates symbolic
representations from neural-based policy, and performs gradient-based policy optimization using actor-critic
methods. We empirically demonstrated that NUDGE (1) can compete with neural based policies, (2) use logical
representations to produce both interpretable and explainable policies and (3) can automatically adapt or easily
be changed to tackle environmental changes.
Societal impact. As NUDGE can explain the importance given to the input on its decisions, and as its rules are
interpretable, it can help understand decision of RL agents trained in sensitive complicated domains as well as
discover biases and misalignments of potential discriminative nature.
Limitation and Future Work. NUDGE is only complete if provided with a sufficiently expressive language (in
terms of predicates and entities) to approximate neural policies. For future work, NUDGE could automatically
grow to (i) discover predicates, using predicate invention [Muggleton et al., 2015], (ii) and augment the number of
accessible entities to reason on. Explainable interactive learning [Teso and Kersting, 2019] in RL can be tackled
with NUDGE, since NUDGE can produce explanations using logical representations. Causal RL [Madumal
et al., 2020] and meta learning [Mishra et al., 2018] also constitute interesting future avenues for NUDGE’s
development.
9
References
Yushi Cao, Zhiming Li, Tianpei Yang, Hao Zhang, Yan Zheng, Yi Li, Jianye Hao, and Yang Liu. GALOIS:
boosting deep reinforcement learning via generalizable logic synthesis. CoRR, 2022.
Andrew Cropper, Sebastijan Dumancic, Richard Evans, and Stephen H. Muggleton. Inductive logic programming
at 30. Mach. Learn., 2022.
Marco Cuturi and Mathieu Blondel. Soft-DTW: a differentiable loss function for time-series. In Proceedings of
the 34th International Conference on Machine Learning, 2017.
Quentin Delfosse, Patrick Schramowski, Martin Mundt, Alejandro Molina, and Kristian Kersting. Adaptive
rational activations to boost deep reinforcement learning. 2021.
Quentin Delfosse, Wolfgang Stammer, Thomas Rothenbacher, Dwarak Vittal, and Kristian Kersting. Boosting
object representation learning via motion and object continuity. CoRR, 2022.
Coline Devin, Pieter Abbeel, Trevor Darrell, and Sergey Levine. Deep object-centric representations for
generalizable robot learning. In IEEE International Conference on Robotics and Automation, 2018.
Saso Dzeroski, Luc De Raedt, and Kurt Driessens. Relational reinforcement learning. Mach. Learn., 2001.
Richard Evans and Edward Grefenstette. Learning explanatory rules from noisy data. J. Artif. Intell. Res., 2018.
Richard Fikes and Nils J. Nilsson. STRIPS: A new approach to the application of theorem proving to problem
solving. Artif. Intell., 1971.
Maria Fox and Derek Long. PDDL2.1: an extension to PDDL for expressing temporal planning domains. J.
Artif. Intell. Res., 2003.
Zhengyao Jiang and Shan Luo. Neural logic reinforcement learning. In Kamalika Chaudhuri and Ruslan
Salakhutdinov, editors, International Conference on Machine Learning, 2019.
Kristian Kersting and Kurt Driessens. Non-parametric policy gradients: a unified treatment of propositional and
relational domains. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, International
Conference on Machine Learning, 2008.
Kristian Kersting, Martijn van Otterlo, and Luc De Raedt. Bellman goes relational. In Carla E. Brodley, editor,
International Conference on Machine Learning, 2004.
Daiki Kimura, Masaki Ono, Subhajit Chaudhury, Ryosuke Kohita, Akifumi Wachi, Don Joven Agravante,
Michiaki Tatsubori, Asim Munawar, and Alexander Gray. Neuro-symbolic reinforcement learning with
first-order logic. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,
Conference on Empirical Methods in Natural Language Processing, 2021.
Vijay R. Konda and John N. Tsitsiklis. Actor-critic algorithms. In Sara A. Solla, Todd K. Leen, and Klaus-Robert
Müller, editors, Advances in Neural Information Processing Systems 12, 1999.
Mikel Landajuela, Brenden K Petersen, Sookyung Kim, Claudio P Santiago, Ruben Glatt, Nathan Mundhenk,
Jacob F Pettit, and Daniel Faissol. Discovering symbolic policies with deep reinforcement learning. In
International Conference on Machine Learning, 2021.
Tobias Lang, Marc Toussaint, and Kristian Kersting. Exploration in relational domains for model-based
reinforcement learning. J. Mach. Learn. Res., 2012.
Zhixuan Lin, Yi-Fu Wu, Skand Vishwanath Peri, Weihao Sun, Gautam Singh, Fei Deng, Jindong Jiang, and
Sungjin Ahn. SPACE: unsupervised object-oriented scene representation via spatial attention and decomposi-
tion. In International Conference on Learning Representations, 2020.
Daoming Lyu, Fangkai Yang, Bo Liu, and Steven Gustafson. SDRL: interpretable and data-efficient deep
reinforcement learning leveraging symbolic planning. In The Thirty-Third AAAI Conference on Artificial
Intelligence, 2019.
10
Prashan Madumal, Tim Miller, Liz Sonenberg, and Frank Vetere. Explainable reinforcement learning through a
causal lens. In The Thirty-Fourth AAAI conference on Artificial Intelligence (AAAI), 2020.
Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. In 6th
International Conference on Learning Representations, 2018.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex
Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir
Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis.
Human-level control through deep reinforcement learning. Nat., 2015.
Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy P. Lillicrap, Tim Harley,
David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Maria-
Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33rd International Conference on
Machine Learning, 2016.
Sharada P. Mohanty, Jyotish Poonganam, Adrien Gaidon, Andrey Kolobov, Blake Wulfe, Dipam Chakraborty,
Grazvydas Semetulskis, Jo
textasciitilde ao Schapke, Jonas Kubilius, Jurgis Pasukonis, Linas Klimas, Matthew J. Hausknecht, Patrick
MacAlpine, Quang Nhat Tran, Thomas Tumiel, Xiaocheng Tang, Xinwei Chen, Christopher Hesse, Jacob
Hilton, William Hebgen Guss, Sahika Genc, John Schulman, and Karl Cobbe. Measuring sample efficiency
and generalization in reinforcement learning benchmarks: Neurips 2020 procgen benchmark. In Hugo Jair
Escalante and Katja Hofmann, editors, NeurIPS 2020 Competition and Demonstration Track, 2020.
S. Muggleton. Inverse Entailment and Progol. New Generation Computing, Special issue on Inductive Logic
Programming, 1995.
Stephen H. Muggleton, Dianhuan Lin, and Alireza Tamaddoni-Nezhad. Meta-interpretive learning of higher-order
dyadic datalog: predicate invention revisited. Mach. Learn., 2015.
Shan-Hwei Nienhuys-Cheng and Ronald de Wolf. Foundations of Inductive Logic Programming. 1997.
Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning.
In Doina Precup and Yee Whye Teh, editors, International Conference on Machine Learning, 2017.
Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. You only look once: Unified,
real-time object detection. In Computer Vision and Pattern Recognition, 2016.
Raymond Reiter. Knowledge in Action: Logical Foundations for Specifying and Implementing Dynamical
Systems. 2001.
Ryan Riegel, Alexander G. Gray, Francois P. S. Luus, Naweed Khan, Ndivhuwo Makondo, Ismail Yunus
Akhalwaya, Haifeng Qian, Ronald Fagin, Francisco Barahona, Udit Sharma, Shajith Ikbal, Hima Karanam,
Sumit Neelam, Ankita Likhyani, and Santosh K. Srivastava. Logical neural networks. CoRR, 2020.
Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable
models instead. Nat. Mach. Intell., 2019.
Stuart Russell and Peter Norvig. Artificial intelligence - a modern approach, 2nd Edition. 2003.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization
algorithms. CoRR, 2017.
Hikaru Shindo, Masaaki Nishino, and Akihiro Yamamoto. Differentiable inductive logic programming for
structured examples. In Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI), 2021.
Hikaru Shindo, Viktor Pfanschilling, Devendra Singh Dhami, and Kristian Kersting. αilp: thinking visual scenes
as differentiable logic programs. Mach. Learn., 2023.
Armando Solar-Lezama. Program Synthesis by Sketching. PhD thesis, 2008.
Shao-Hua Sun, Te-Lin Wu, and Joseph J. Lim. Program guided agent. In International Conference on Learning
Representations, 2020.
11
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In International
Conference on Machine Learning, 2017.
Stefano Teso and Kristian Kersting. Explanatory interactive machine learning. In Vincent Conitzer, Gillian K.
Hadfield, and Shannon Vallor, editors, AAAI/ACM Conference on AI, Ethics, and Society, 2019.
Hado van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Dale
Schuurmans and Michael P. Wellman, editors, AAAI Conference on Artificial Intelligence, 2016.
Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh, Pushmeet Kohli, and Swarat Chaudhuri. Programmati-
cally interpretable reinforcement learning. In International Conference on Machine Learning, 2018.
Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.
Mach. Learn., 1992.
Markus Wulfmeier, Ingmar Posner, and Pieter Abbeel. Mutual alignment transfer learning. In Conference on
Robot Learning, 2017.
Duo Xu and Faramarz Fekri. Interpretable model-based hierarchical reinforcement learning using inductive logic
programming. CoRR, abs/2106.11417, 2021.
Vinícius Flores Zambaldi, David Raposo, Adam Santoro, Victor Bapst, Yujia Li, Igor Babuschkin, Karl Tuyls,
David P. Reichert, Timothy P. Lillicrap, Edward Lockhart, Murray Shanahan, Victoria Langston, Razvan
Pascanu, Matthew M. Botvinick, Oriol Vinyals, and Peter W. Battaglia. Relational deep reinforcement learning.
CoRR, 2018.
Peng Zhang, Jianye Hao, Weixun Wang, Hongyao Tang, Yi Ma, Yihai Duan, and Yan Zheng. Kogun: Accelerating
deep reinforcement learning via integrating human suboptimal knowledge. In International Joint Conference
on Artificial Intelligence, 2020.
Wenqing Zheng, S P Sharan, Zhiwen Fan, Kevin Wang, Yihan Xi, and Zhangyang Wang. Symbolic visual
reinforcement learning: A scalable framework with object-level abstraction and differentiable expression
search. CoRR, abs/2106.11417, 2021.
He Zhu, Zikang Xiong, Stephen Magill, and Suresh Jagannathan. An inductive synthesis framework for
verifiable reinforcement learning. In ACM-SIGPLAN Symposium on Programming Language Design and
Implementation, 2019.
12
Supplemental Materials
At line 8 in Algorithm 1, given action rule C, we generate new action rules using the following refinement
operation:
(1) (n) (i)
ρ(C) = {XA ← XS , . . . XS , YS | YS ∈ GS∗ ∧ YS ̸= XS }, (7)
where GS∗ is a non-ground state atoms. This operation is a specification of (downward) refinement operator,
which a fundamental technique for rule learning in ILP [Nienhuys-Cheng and de Wolf, 1997], for action rules to
solve RL tasks.
We use mode declarations [Muggleton, 1995, Cropper et al., 2022] to define the search space, i.e. GS∗ in Eq. 7
which are defined as follows. A mode declaration is either a head declaration modeh(r, p(mdt1 , . . . , mdtn )) or a
body declaration modeb(r, p(mdt1 , . . . , mdtn )), where r ∈ N is an integer, p is a predicate, and mdti is a mode
datatype. A mode datatype is a tuple (pm, dt), where pm is a place-marker and dt is a datatype. A place-marker is
either #, which represents constants, or + (resp. −), which represents input (resp. output) variables. r represents
the number of the usages of the predicate to compose a solution. Given a set of mode declarations, we can
determine a finite set of rules to be generated by the rule refinement.
Now we describe mode declarations we used in our experiments. For Getout, we used the following mode
declarations:
modeb(2, type(−object, +type))
modeb(1, closeby(+object, +object))
modeb(1, on_left(+object, +object))
modeb(1, on_right(+object, +object))
modeb(1, have_key(+object))
modeb(1, not_have_key(+object))
13
For 3Fishes, we used the following mode declarations:
modeb(2, type(−object, +type))
modeb(1, closeby(+object, +object))
modeb(1, on_top(+object, +object))
modeb(1, at_bottom(+object, +object))
modeb(1, on_left(+object, +object))
modeb(1, on_right(+object, +object))
modeb(1, bigger_than(+object, +object))
modeb(1, high_level(+object, +object))
modeb(1, low_level(+object, +object))
B Additional Results
B.1 Weights learning
Fig. 6 shows the NUDGE agent π(C,W) parameterized by rules C and weights W before training (top) and
after training (bottom) on the GetOut environment. Each element on the x-axis of the plots corresponds to an
action rule. In this examples, we have 10 action rules C = {C0 , C1 , . . . , C9 }, and we assign M = 5 weights
i.e. W = [w0 , w1 , . . . , w4 ]. The distributions of rule weighs with softmax are getting peaked by learning to
maximize the return. The right 4 rules are redundant rules, and theses rules get low weights after learning.
Fig. 7 provides the deduction pipeline of a NUDGE agent on 3 different states. Facts can be deduced from an
object detection method, or directly given by the object centric environment. For state #1, the agent chooses to
jump as the jump action is prioritized over the other ones and all atoms that compose this rules’ body have high
valuation (including closeby). In state #2, the agent chose to go left as the rule left_key is selected. In state
#3, the agent selects right as the rule right_door has the highest forward chaining evaluation.
We show the logic policies obtained by NUDGE in GetOut, 3Fishes, and Loot in Fig. 8, e.g. the first line of
GetOut, “0.574 : jump(X):-closeby(O1, O2), type(O1, agent), type(O2, enemy).”, represents that the action
rule is chosen by the weight vector w1 with a value 0.574. NUDGE agents have several weight vectors
w1 , . . . , wM and thus several chosen action rules are shown for each environment.
14
Weights on action rules (before training)
w_0
w_1
w_2
w_3
w_4
C_0 C_1 C_2 C_3 C_4 C_5 C_6 C_7 C_8 C_9
C_0 C_1 C_2 C_3 C_4 C_5 C_6 C_7 C_8 C_9
Figure 6: Weights on action rules via softmax before training (top) and after training (bottom) on NUDGE in
GetOut. Each element on the x-axis of the plots corresponds to an action rule. NUDGE learns to get high returns
while identifying useful action rules to solve the RL task. The right 5 rules are redundant rules, and theses rules
get low weights after learning.
Figure 7: The logic reasoning of NUDGE agents makes them interpretable. The detailed logic pipeline for the
input state #1 of the Getout environment and the condensed action selection for state #2 and state #3.
is a variation with 2 enemies. In 3Fishes, the agent controls a green fish and is confronted with 2 other fishes,
one smaller (that the agent need to “eat”, i.e. go to) and one bigger, that the agent needs to dodge. A variation is
3Fishes-C, where the agent can eat green fishes and dodge red ones, all fishes have the same size. Finally, in
Loot, the (orange) agent is exposed with 1 or 2 chests and their corresponding (i.e. same color) keys. In Loot-C,
the chests have different colors. All 3 environment are stationary in the sense of Delfosse et al. [2021].
15
# GetOut
0.574:jump(X):-closeby(O1,O2),type(O1,agent),type(O2,enemy).
0.315:right_go_to_door(X):-have_key(X),on_left(O1,O2),type(O1,agent),type(O2,door).
0.296:right_go_to_door(X):-have_key(X),on_left(O1,O2),type(O1,agent),type(O2,door).
0.291:right_go_get_key(X):-not_have_key(X),on_left(O1,O2),type(O1,agent),
type(O2,key).
0.562:right_go_to_door(X):-have_key(X),on_left(O1,O2),type(O1,agent),type(O2,door).
#3Fishes
0.779:right_to_eat(X):-is_bigger_than(O1,O2),on_left(O2,O1),type(O1,agent),
type(O2,fish).
0.445:down_to_dodge(X):-is_bigger_than(O2,O1),on_left(O2,O1),type(O1,agent),
type(O2,fish).
0.579:down_to_eat(X):-high_level(O1,O2),is_smaller_than(O2,O1),type(O1,agent),
type(O2,fish).
0.699:up_to_dodge(X):-closeby(O2,O1),is_smaller_than(O1,O2),low_level(O2,O1),
type(O1,agent),type(O2,fish).
0.601:up_to_eat(X):-is_bigger_than(O2,O1),on_left(O2,O1),type(O1,agent),
type(O2,fish).
0.581:left_to_eat(X):-closeby(O1,O2),on_right(O1,O2),type(O1,agent),type(O2,fish).
# Loot
0.844:up_to_door(X):-close(O1,O2),have_key(O2),on_top(O2,O1),type(O1,agent),
type(O2,door).
0.268:right_to_key(X):-close(O1,O2),on_right(O2,O1),type(O1,agent),type(O2,key).
0.732:right_to_door(X):-close(O1,O2),have_key(O2),on_left(O1,O2),type(O1,agent),
type(O2,door).
0.508:up_to_key(X):-close(O1,O2),on_top(O2,O1),type(O1,agent),type(O2,key).
0.995:left_to_door(X):-close(O1,O2),have_key(O2),on_left(O2,O1),type(O1,agent),
type(O2,door).
0.414:down_to_key(X):-close(O1,O2),on_top(O1,O2),type(O1,agent),type(O2,key).
0.992:down_to_door(X):-close(O1,O2),have_key(O2),on_top(O1,O2),type(O1,agent),
type(O2,door).
0.447:left_to_key(X):-close(O1,O2),on_left(O2,O1),type(O1,agent),type(O2,key).
Figure 8: NUDGE produces an interpretable policy as set of weighted rules. Weighted action rules discovered
by NUDGE in the each logic environment.
D.1 Hyperparameters
We here provide the hyperparameters used in our experiments. We set the clip parameter ϵclip = 0.2, the discount
factor γ = 0.99. We use the Adam optimizer, with 1e − 3 as actor learning rate, 3e − 4 as critic learning rate.
The episode length is 500 timesteps. The policy is updated every 1000 steps We train every algorithm for 800k
steps on each environment, apart from neural PPO, that needed 5M steps on Loot. We use an epsilon greedy
−episode
strategy with ϵ = max(e 500 , 0.02).
All the rules set C of the different NUDGE and logic agents are available at https://anonymous.4open.
science/r/LogicRL-C43B in the folder nsfr/nsfr/data/lang.
We provide details of differentiable forward reasoning used in NUDGE. We denote a valuation vector at time
step t as v(t) ∈ [0, 1]G . We also denote the i-th element of vector x by x[i], and the (i, j)-th element of matrix
X by X[i, j]. The same applies to higher dimensional tensors.
16
GetOut GetOut-2En
Figure 9: Pictures of our environments (GetOut, Loot and 3Fishes) and their variations (GetOut-2En, Loot-C
and 3Fishes-C). All these environments can provide object-centric state descriptions (instead of pixel-based
states).
reason
We compose the reasoning function f(C,W) : [0, 1]G → [0, 1]GA , which takes the initial valuation vector and
returns valuation vector for induced action atoms. We describe each step in detail.
(Step 1) Encode Logic Programs to Tensors. To achieve differentiable forward reasoning, each action rule is
encoded to a tensor representation. Let S be the number of the maximum number of substitutions for existentially
quantified variables in C, and L be the maximum length of the body of rules in C. Each action rule Ci ∈ C is
encoded to a tensor Ii ∈ NG×S×L , which contain the indices of body atoms. Intuitively, Ii [j, k, l] is the index of
the l-th fact (subgoal) in the body of the i-th rule to derive the j-th fact with the k-th substitution for existentially
quantified variables.
For example. let R0 = jump(agent):-type(O1, agent), type(O2, enemy), closeby(O1, O2) ∈ C and F2 =
jump(agent) ∈ G, and we assume that constants for objects are {obj1, obj2}. R0 has existentially quantified
variables O1 and O2 on the body, so we obtain ground rules by substituting constants. By considering the possible
substitutions for O1 and O2, namely {O1/obj1, O2/obj2} and {O1/obj2, O2/obj1}, we have two ground rules,
as shown in top of Table 3. Bottom rows of Table 3 shows elements of tensor I0,:,0,: and I0,:,1,: . Facts G
and the indices are represented on the upper rows in the table. For example, I0,2,0,: = [3, 6, 7] because R0
entails jump(agent) with the first (k = 0) substitution τ = {O1/obj1, O2/obj2}. Then the subgoal atoms are
{type(obj1, agent), type(obj2, enemy), closeby(obj1, obj2)}, which have indices [3, 6, 7], respectively.
The atoms which have a different predicate, e.g., closeby(obj1, obj2), will never be entailed by clause R0 .
Therefore, the corresponding values are filled with 0, which represents the index of the false atom.
(Step 2) Assign Rule Weights. We assign weights to compose the policy with several action rules as follows: (i)
We fix the target programs’ size as M , i.e. , where we try to find a policy with M action rules. (ii) We introduce
C-dim weights W = [w1 , . . . , wM ]. (iii) We take the softmax of each weight vector wj ∈ W and softly choose
M action rules out of C action rules to compose the policy.
(Step 3) Perform Differentiable Inference. We compute 1-step forward reasoning using weighted action rules,
then we recursively perform reasoning to compute T -step reasoning.
17
(k = 0) jump(agent):-type(obj1, agent), type(obj2, enemy), closeby(obj1, obj2).
(k = 1) jump(agent):-type(obj2, agent), type(obj1, enemy), closeby(obj2, obj1).
j 0 1 2 3 4 5
G ⊥ ⊤ jump(agent) type(obj1, agent) type(obj2, agent) type(obj1, enemy)
I0,j,0,: [0, 0, 0] [1, 1, 1] [3, 6, 7] [0, 0, 0] [0, 0, 0] [0, 0, 0]
I0,j,1,: [0, 0, 0] [1, 1, 1] [4, 5, 8] [0, 0, 0] [0, 0, 0] [0, 0, 0]
j 6 7 8 ...
G type(obj2, enemy) closeby(obj1, obj2) closeby(obj2, obj1) ...
I0,j,0,: [0, 0, 0] [0, 0, 0] [0, 0, 0] ...
I0,j,1,: [0, 0, 0] [0, 0, 0] [0, 0, 0] ...
Table 3: Example of ground rules (top) and elements in the index tensor (bottom). Each fact has its index, and
the index tensor contains the indices of the facts to compute forward inferences.
[(i) Reasoning using an action rule] First, for each action rule Ci ∈ C, we evaluate body atoms for different
(t)
grounding of Ci by computing bi,j,k ∈ [0, 1]:
(t)
Y
bi,j,k = gather(v(t) , Ii )[j, k, l] (8)
1≤l≤L
G G×S×L
where gather : [0, 1] × N → [0, 1]G×S×L is:
gather(x, Y)[j, k, l] = x[Y[j, k, l]]. (9)
The gather function replaces the indices of the body state atoms by the current valuation values in v(t) . To take
(t)
logical and across the subgoals in the body, we take the product across valuations. bi,j,k represents the valuation
of body atoms for i-th rule using k-th substitution for the existentially quantified variables to deduce j-th fact at
time t.
(t)
Now we take logical or softly to combine all of the different grounding for Ci by computing ci,j ∈ [0, 1]:
(t) (t) (t)
ci,j = softor γ (bi,j,1 , . . . , bi,j,S ) (10)
γ
where softor is a smooth logical or function:
X
softor γ (x1 , . . . , xn ) = γ log exp(xi /γ), (11)
1≤i≤n
where γ > 0 is a smooth parameter. Eq. 11 is an approximation of the max function over probabilistic values
based on the log-sum-exp approach [Cuturi and Blondel, 2017].
[(ii) Combine results from different action rules] Now we apply different action rules using the assigned
(t)
weights by computing hj,m ∈ [0, 1]:
(t) (t)
X
∗
hj,m = wm,i · ci,j , (12)
1≤i≤C
∗ ∗
P
where wm,i= exp(wm,i )/ i′ exp(wm,i′ ), and wm,i = wm [i]. Note that wm,i is interpreted as a probability
that action rule Ci ∈ C is the m-th component of the policy. Now we complete the 1-step forward reasoning by
combining the results from different weights:
(t) (t) (t)
rj = softor γ (hj,1 , . . . , hj,M ). (13)
γ
Taking softor means that we compose the policy using M softly chosen action rules out of C candidates of
rules.
(t)
[(iii) Multi-step reasoning] We perform T -step forward reasoning by computing rj recursively for T times:
(t+1) (t) (t)
vj = softor γ (rj , vj ). Finally, we compute v(T ) ∈ [0, 1]G and returns vA ∈ [0, 1]GA by extracting only
output for action atoms from v(T ) . The whole reasoning computation Eq. 8-13 can be implemented using only
efficient tensor operations. See App. E.2 for a detailed description.
18
E.2 Implementation Details
Here we provide implementational details of the differentiable forward reasoning. The whole reasoning computa-
tions in NUDGE can be implemented as a neural network that performs forward reasoning and can efficiently
process a batch of examples in parallel on GPUs, which is a non-trivial function of logical reasoners.
Each clause Ci ∈ C is compiled into a differentiable function that performs forward reasoning using the tensor.
The clause function is computed as:
(t)
Ci = softorγ3 prod2 gather1 (Ṽ(t) , Ĩ) , (14)
where gather1 (X, Y)i,j,k,l = Xi,Yi,j,k,l ,k,l 3 obtains valuations for body atoms of the clause Ci from the
valuation tensor and the index tensor. prod2 returns the product along dimension 2, i.e. the product of valuations
of body atoms for each grounding of Ci . The softorγ function is applied along dimension 3, on all the grounding
(or possible substitutions) of Ci .
softorγd is a function for taking logical or softly along dimension d:
softorγd (X) = γ log sumd exp(X/γ) ,
(15)
where γ > 0 is a smoothing parameter, sumd is the sum function along dimension d. The results from each
clause Cti ∈ RB×G is stacked into tensor C(t) ∈ RC×B×G .
Finally, the T -time step inference is computed by amalgamating the inference results recursively. We take the
softmax of the clause weights, W ∈ RM ×C , and softly choose M clauses out of C clauses to compose the logic
program:
W∗ = softmax 1 (W). (16)
where softmax 1 is a softmax function over dimension 1. The clause weights W∗ ∈ RM ×C and the output of
the clause function C(t) ∈ RC×B×G are expanded (via copy) to the same shape W̃∗ , C̃(t) ∈ RM ×C×B×G . The
tensor H(t) ∈ RM ×B×G is computes as
H(t) = sum1 (W̃∗ ⊙ C̃), (17)
(t)
where ⊙ is element-wise multiplication. Each value Hi,j,k represents the weight of k-th ground atom using i-th
clause weights for the j-th example in the batch. Finally, we compute tensor R(t) ∈ RB×G corresponding to the
fact that logic program is a set of clauses:
R(t) = softorγ0 (H(t) ). (18)
With r the 1-step forward-chaining reasoning function:
r(V(t) ; I, W) = R(t) , (19)
we compute the T -step reasoning using:
V(t+1) = softorγ1 stack1 V(t) , r(V(t) ; I, W) , (20)
where I ∈ NC×G×S×L is a precomputed index tensor, and W ∈ RM ×C is clause weights. After T -step
reasoning, the probabilities over action atoms GA are extracted from V(T ) as VA ∈ [0, 1]B×GA .
3
done with pytorch.org/docs/torch.gather
19