Effective Bayesian Causal Inference via
Structural Marginalisation and Autoregressive Orders

Christian Toth    Christian Knoll    Franz Pernkopf    Robert Peharz
Abstract

Bayesian causal inference (BCI) naturally incorporates epistemic uncertainty about the true causal model into down-stream causal reasoning tasks by posterior averaging over causal models. However, this poses a tremendously hard computational problem due to the intractable number of causal structures to marginalise over. In this work, we decompose the structure learning problem into inferring (i) a causal order and (ii) a parent set for each variable given a causal order. By limiting the number of parents per variable, we can exactly marginalise over the parent sets in polynomial time, which leaves only the causal order to be marginalised. To this end, we propose a novel autoregressive model over causal orders (ARCO) learnable with gradient-based methods. Our method yields state-of-the-art in structure learning on simulated non-linear additive noise benchmarks with scale-free and Erdos-Renyi graph structures, and competitive results on real-world data. Moreover, we illustrate that our method accurately infers interventional distributions, which allows us to estimate posterior average causal effects and many other causal quantities of interest.

Bayesian causal inference, causal orders, Gaussian processes, causal discovery, causal structure learning, permutations, orderings

1 Introduction

Refer to caption
Refer to caption
Figure 1: Causal discovery on nonlinear additive noise models. Structure learning results in terms of expected Hamming distance (ESHD) and ancestor adjustment identification distance (A-AID) on simulated non-linear models with scale-free (left, blue) and Erdös-Renyi (right, orange) graphs, each with 20202020 nodes and 200200200200 data samples. Whiskers indicate maximum, minimum and median values across 20202020 simulated ground truth instances. For both metrics lower is better. Range for ESHD is set for better readability, omitting the result for DDS (>125absent125>125> 125).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Posterior interventional distributions. Several interventional distributions as inferred by ARCO-GP (blue, solid) and the corresponding ground truth (green dashed). Specifically, we sampled full SCMs (orders, parent sets, mechanisms, exogeneous variables) and performed the indicated intervention to produce a sample from the corresponding distribution, which effectively marginalises over the posterior over SCMs. Vertical lines indicate the estimated means (averages). The underlying SCM is a simulated non-linear additive noise model with graph structure taken as the consensus protein interaction graph from Sachs et al. (2005) (see Figure 4). See Appendix C for details.

Few topics in science and philosophy have been as controversial as the discussion about the nature of causality. Interestingly, the discussion becomes relatively benign, from a philosophical perspective, as soon as one agrees on a well-defined mathematical model of causality, such as a structural causal model (SCM) (Pearl, 2009). A fully specified SCM induces a wide range of joint probability distributions, organised on three levels denoted as association, intervention and counterfactuals, collectively forming the so-called Pearl’s causal hierarchy (Bareinboim et al., 2022). Any causal question then elegantly reduces to standard probabilistic inference in some of these joint distributions. Hence, assuming that the data comes from some model within a considered class of SCMs, causal questions reduce, in principle, to epistemic questions, i.e., questions about what and how much is known about the model.

The view that causal questions reduce to epistemic questions about the true SCM naturally invites a Bayesian treatment, i.e., specifying a prior distribution over SCMs and inferring the posterior given collected data. Bayesian approaches to causal discovery, i.e., learning the causal structure (represented as directed acyclic graph, DAG), dates back to work as early as (Heckerman, 1995; Murphy, 2001; Tong & Koller, 2001). More intriguing, however, is to use a Bayesian approach over entire SCMs for the purpose of Bayesian causal inference (BCI), i.e., consistent reasoning about a down-stream causal query: usually, one is less interested in the SCM per se, but rather in some causal quantity, such as the existence of a direct causal connection (edge in the DAG), or the average causal effect one variable has on another. The goal of BCI is then to reason about such down-stream tasks by computing a Bayesian average over the whole posterior over SCMs.

footnotetext: Code available at: https://github.com/chritoth/bci-arco-gp

Most existing BCI approaches are restricted to linear Gaussian models (Geiger & Heckerman, 1994; Viinikka et al., 2020; Pensar et al., 2020; Horii, 2021) or binary variables (Moffa et al., 2017; Kuipers et al., 2019), while a full Bayesian treatment of non-linear SCMs has been proposed only recently (Toth et al., 2022; Tigas et al., 2022). Both approaches leverage the DiBS method (Lorch et al., 2021), a gradient-based framework for Bayesian inference over DAGs, and infer causal mechanisms conditionally on the DAG. While Tigas et al. (2022) use Bayesian neural nets to represent the causal mechanisms and require stochastic variational inference over their parameters, Toth et al. (2022) leverage Gaussian Processes (GPs) for the mechanisms, which are marginalised in closed-form. The latter effectively represents a form of Rao-Blackwellisation, reducing inference to the problem of marginalising over DAGs.111The term Rao-Blackwellisation is used in a generalised sense here, denoting a generally observed variance reduction in Monte Carlo estimates when “nuisance variables” are integrated out (Robert & Roberts, 2021).

In this paper, we enhance non-linear BCI and decompose structure inference to (i) inferring a causal order over the endogenous variables and (ii) inferring parent sets for each variable conditional on the order.222In this paper, we use the term causal order for the topological order of a causal DAG. We use this term to emphasise that this order per se might be an object of interest to the practitioner. We perform the latter by restricting the number of parents per variable to a fixed maximal number K𝐾Kitalic_K and exactly marginalising over all possible parent sets in polynomial (𝒪(dK)𝒪superscript𝑑𝐾\mathcal{O}(d^{K})caligraphic_O ( italic_d start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT )) time, where d𝑑ditalic_d is the number of endogenous variables. This technique is well-established in Markov chain Monte Carlo (MCMC) techniques for Bayesian structure learning (Koller & Friedman, 2003; Koivisto & Sood, 2004; Viinikka et al., 2020), but has—to the best of our knowledge—not been exploited in gradient-based structure learning so far. This structural marginalisation, in combination with the analytic treatment of the causal mechanisms via GPs (Toth et al., 2022), reduces the inference problem to the marginalisation w.r.t. causal orders. To this end, we propose a novel neural generative model over topological orders which can be understood as an Auto-Regressive distribution estimator (Larochelle & Murray, 2011), constrained to Causal Orders, hence we dub this model ARCO.

Figure 1 show results for simulated scale-free and Erdös-Renyi graphs, with non-linear causal mechanisms and additive Gaussian noise, demonstrating that ARCO-GP (meaning BCI via ARCO and GPs) is highly effective for causal discovery. In particular, for scale-free graphs, where the assumption of a restricted number of parents matches our modelling assumption, ARCO-GP clearly outperforms a wide range of competitor methods. Moreover, even when the assumptions are violated (in our case for Erdos-Renyi graphs), we demonstrate that ARCO-GP still outperforms the baselines, albeit with a smaller margin. Moreover, in Figure 2 we demonstrate ARCO-GP’s efficacy in down-stream BCI, by sampling from various interventional distributions. Specifically, by sampling from the whole Bayesian model (including causal structure and mechanisms) we effectively perform posterior marginalisation over SCMs. Detailed results, including further evaluation metrics and results on a real-world dataset (Sachs et al., 2005), are presented in Section Section 5 and Appendix C.

In summary, our main contributions are:

  • We propose an effective BCI framework which exactly marginalises both causal mechanisms and parent sets from the posterior over SCMs, reducing the hard BCI problem to inferring the causal (topological) order.

  • We propose ARCO, a neural autoregressive model over causal orders and devise a gradient-based learning scheme for it. While here ARCO is used mainly for causal reasoning, it might well find applications in many other domains requiring generative models over total orders (e.g., rankings).

  • We demonstrate in experiments that ARCO-GP sets state-of-the-art against a wide range of baselines when our model assumption are exactly matched. When the model assumptions are violated, we still outperform or are competitive with all baselines.

2 Background

Structural Causal Models. An SCM \mathcal{M}caligraphic_M over observed endogenous variables 𝐗={X1,,Xd}𝐗subscript𝑋1subscript𝑋𝑑\mathbf{X}=\{X_{1},\dots,X_{d}\}bold_X = { italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT } and unobserved exogenous variables 𝐔={U1,,Ud}𝐔subscript𝑈1subscript𝑈𝑑\mathbf{U}=\{U_{1},\dots,U_{d}\}bold_U = { italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_U start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT } consists of structural equations, or mechanisms,

Xi:=fi(𝐏𝐚i,Ui),fori{1,,d},formulae-sequenceassignsubscript𝑋𝑖subscript𝑓𝑖subscript𝐏𝐚𝑖subscript𝑈𝑖for𝑖1𝑑X_{i}:=f_{i}(\mathbf{Pa}_{i},U_{i}),\qquad\quad\text{for}\quad i\in\{1,\dots,d\},italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , for italic_i ∈ { 1 , … , italic_d } , (1)

which assign the value of each Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as a deterministic function fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of its direct causes, or causal parents, 𝐏𝐚i𝑿{Xi}subscript𝐏𝐚𝑖𝑿subscript𝑋𝑖\mathbf{Pa}_{i}\subseteq\bm{X}\setminus\{X_{i}\}bold_Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊆ bold_italic_X ∖ { italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } and an exogenous variable Uisubscript𝑈𝑖U_{i}italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and a joint distribution p(𝐔)𝑝𝐔p(\mathbf{U})italic_p ( bold_U ) over the exogenous variables. In this paper we assume that the exogenous variables are independent Gaussian and enter in additive fashion, i.e. fi(𝐏𝐚i,Ui)=fi(𝐏𝐚i)+Uisubscript𝑓𝑖subscript𝐏𝐚𝑖subscript𝑈𝑖subscript𝑓𝑖subscript𝐏𝐚𝑖subscript𝑈𝑖f_{i}(\mathbf{Pa}_{i},U_{i})=f_{i}(\mathbf{Pa}_{i})+U_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, thus implying causal sufficiency. Associated with each SCM is a directed graph G𝐺Gitalic_G induced by the set of parent sets Pa={Pai}i=1dPasuperscriptsubscriptsubscriptPa𝑖𝑖1𝑑\textbf{Pa}=\{\textbf{Pa}_{i}\}_{i=1}^{d}Pa = { Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT with vertices 𝑿𝑿\bm{X}bold_italic_X and edges XjXisubscript𝑋𝑗subscript𝑋𝑖X_{j}\to X_{i}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT → italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT if and only if Xj𝐏𝐚isubscript𝑋𝑗subscript𝐏𝐚𝑖X_{j}\in\mathbf{Pa}_{i}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ bold_Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Any SCM with an acyclic directed graph (DAG) then induces a unique observational distribution p(𝑿|)𝑝conditional𝑿p(\bm{X}\,|\,\mathcal{M})italic_p ( bold_italic_X | caligraphic_M ) over the endogenous variables 𝑿𝑿\bm{X}bold_italic_X, which is obtained as the pushforward measure of p(𝑼)𝑝𝑼p(\bm{U})italic_p ( bold_italic_U ) through the causal mechanisms in Equation 1.

A (hard) intervention do(𝐖=𝐰)𝑑𝑜𝐖𝐰do(\mathbf{W}=\mathbf{w})italic_d italic_o ( bold_W = bold_w ) on a set of endogenous variables 𝐖𝐗𝐖𝐗\mathbf{W}\subset\mathbf{X}bold_W ⊂ bold_X replaces the targeted mechanisms with constants 𝐰𝐰\mathbf{w}bold_w, resulting in a modified SCM. The entailed modified causal graph lacks the incoming edges into any intervention target. The pushforward through the modified SCM yields an interventional distribution p(𝐗|do(𝐖=𝐰),)𝑝conditional𝐗𝑑𝑜𝐖𝐰p(\mathbf{X}\,|\,do(\mathbf{W}=\mathbf{w}),\mathcal{M})italic_p ( bold_X | italic_d italic_o ( bold_W = bold_w ) , caligraphic_M ).

Causal Orders. A permutation L=L1,,Ld𝐿subscript𝐿1subscript𝐿𝑑L=\left\langle L_{1},\dots,L_{d}\right\rangleitalic_L = ⟨ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ⟩ of the endogenous variables 𝐗=i=1d{Li}𝐗superscriptsubscript𝑖1𝑑subscript𝐿𝑖\mathbf{X}=\bigcup_{i=1}^{d}\{L_{i}\}bold_X = ⋃ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT { italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, where LiLjsubscript𝐿𝑖subscript𝐿𝑗L_{i}\neq L_{j}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for all ij𝑖𝑗i\neq jitalic_i ≠ italic_j, entails a strict total order L1L2Ldprecedessubscript𝐿1subscript𝐿2precedesprecedessubscript𝐿𝑑L_{1}\prec L_{2}\prec\dots\prec L_{d}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≺ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≺ ⋯ ≺ italic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT among the variables. Henceforth, we refer to such a permutation L𝐿Litalic_L as a causal order. A causal order L𝐿Litalic_L constrains the possible causal interactions between the variables, i.e., Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be a (direct) cause of Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT if, and only if, XiXjprecedessubscript𝑋𝑖subscript𝑋𝑗X_{i}\prec X_{j}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≺ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in L𝐿Litalic_L. We define L<k=L1,,Lk1subscript𝐿absent𝑘subscript𝐿1subscript𝐿𝑘1L_{<k}=\left\langle L_{1},\dots,L_{k-1}\right\rangleitalic_L start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT = ⟨ italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_L start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ⟩ to be the first k1𝑘1k-1italic_k - 1 elements in L𝐿Litalic_L. Finally, let λL:𝐗{1,,d}:superscript𝜆𝐿maps-to𝐗1𝑑\lambda^{L}\colon\mathbf{X}\mapsto\{1,\dots,d\}italic_λ start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT : bold_X ↦ { 1 , … , italic_d } be the bijective mapping between 𝑿𝑿\bm{X}bold_italic_X and indices in L𝐿Litalic_L, i.e., λL(Xi)=kLk=Xiiffsuperscript𝜆𝐿subscript𝑋𝑖𝑘subscript𝐿𝑘subscript𝑋𝑖\lambda^{L}(X_{i})=k\iff L_{k}=X_{i}italic_λ start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_k ⇔ italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We then denote by QL{0,1}d×dsuperscript𝑄𝐿superscript01𝑑𝑑Q^{L}\in\{0,1\}^{d\times d}italic_Q start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT the permutation matrix representing L𝐿Litalic_L, where QijL=1subscriptsuperscript𝑄𝐿𝑖𝑗1Q^{L}_{ij}=1italic_Q start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 iff λL(Xi)=jsuperscript𝜆𝐿subscript𝑋𝑖𝑗\lambda^{L}(X_{i})=jitalic_λ start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_j.

3 Bayesian Causal Inference via Structural Marginalisation

𝜽𝜽\bm{\theta}bold_italic_θL𝐿Litalic_LG𝐺Gitalic_G𝝍𝝍\bm{\psi}bold_italic_ψ𝐟𝐟\mathbf{f}bold_f𝒟𝒟\mathcal{D}caligraphic_DY𝑌Yitalic_YSCM \mathcal{M}caligraphic_M
Figure 3: Bayesian network view of our generative model. We characterise a Structural Causal Model (SCM) by a causal graph G𝐺Gitalic_G, causal mechanisms 𝐟𝐟\mathbf{f}bold_f and parameters 𝝍𝝍\bm{\psi}bold_italic_ψ of a joint distribution over mechanisms and exogenous variables p(𝐟,𝐔|𝝍)𝑝𝐟conditional𝐔𝝍p(\mathbf{f},\mathbf{U}\,|\,\bm{\psi})italic_p ( bold_f , bold_U | bold_italic_ψ ). We model the mechanisms 𝐟𝐟\mathbf{f}bold_f using Gaussian Processes (GPs) and 𝐔𝐔\mathbf{U}bold_U as additive Gaussian noise, implying that 𝝍𝝍\bm{\psi}bold_italic_ψ is a set of GP hyper-parameters. The SCM gives rise to the data-generating likelihood p(𝒟|𝐟,𝝍,G)𝑝conditional𝒟𝐟𝝍𝐺p(\mathcal{D}\,|\,\mathbf{f},\bm{\psi},G)italic_p ( caligraphic_D | bold_f , bold_italic_ψ , italic_G ) and determines the (distribution over the) causal query Y𝑌Yitalic_Y. To sample a directed acyclic causal graph G𝐺Gitalic_G, we first sample a causal order L𝐿Litalic_L from a neural autoregressive distribution over causal orders (ARCO) p(L|𝜽)𝑝conditional𝐿𝜽p(L\,|\,\bm{\theta})italic_p ( italic_L | bold_italic_θ ) with parameters 𝜽𝜽\bm{\theta}bold_italic_θ. Given a causal order, we can then sample or marginalise exactly over causal graphs G𝐺Gitalic_G by limiting the maximum cardinality of parent sets.

When performing causal reasoning in the BCI framework, we define the causal quantity of our interest as Y𝑌Yitalic_Y, called the causal query, which is a function of the SCM \mathcal{M}caligraphic_M (see (Toth et al., 2022)). This causal query could be, for example, an endogenous variable under some intervention, (features of) the true causal graph, or even the entire SCM. Since we are following a Bayesian approach, \mathcal{M}caligraphic_M is a random variable equipped with a prior distribution p()𝑝p(\mathcal{M})italic_p ( caligraphic_M ), and hence also Y𝑌Yitalic_Y is a random variable.

Given a set of (observational) data 𝒟={𝐱ni.i.dp(𝐗|)}n=1N𝒟superscriptsubscriptsubscript𝐱𝑛i.i.dsimilar-to𝑝conditional𝐗superscript𝑛1𝑁\mathcal{D}=\{\mathbf{x}_{n}\overset{\text{i.i.d}}{\sim}p(\mathbf{X}\,|\,% \mathcal{M}^{*})\}_{n=1}^{N}caligraphic_D = { bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT overi.i.d start_ARG ∼ end_ARG italic_p ( bold_X | caligraphic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT collected from the true underlying SCM superscript\mathcal{M}^{*}caligraphic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we are interested in inferring the posterior p(Y|𝒟)𝑝conditional𝑌𝒟p(Y\,|\,\mathcal{D})italic_p ( italic_Y | caligraphic_D ) of the causal query, reflecting our uncertainty about superscript\mathcal{M}^{*}caligraphic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Since the causal query Y𝑌Yitalic_Y is determined by the SCM, the query posterior is given as

p(Y|𝒟)=𝔼|𝒟[p(Y|)],𝑝conditional𝑌𝒟subscript𝔼conditional𝒟delimited-[]𝑝conditional𝑌p(Y\,|\,\mathcal{D})=\mathbb{E}_{\mathcal{M}\,|\,\mathcal{D}}[\,p(Y\,|\,% \mathcal{M})],italic_p ( italic_Y | caligraphic_D ) = blackboard_E start_POSTSUBSCRIPT caligraphic_M | caligraphic_D end_POSTSUBSCRIPT [ italic_p ( italic_Y | caligraphic_M ) ] , (2)

marginalising over posterior SCMs. Our inference procedure, described in Algorithm 1 and the generative model over SCMs shown in Figure 3, divides into a parameter learning and an inference phase. In the learning phase (Algorithm 1, lines 1-1), we infer posterior parameters p(𝜽,𝝍|𝒟)𝑝𝜽conditional𝝍𝒟p(\bm{\theta},\bm{\psi}\,|\,\mathcal{D})italic_p ( bold_italic_θ , bold_italic_ψ | caligraphic_D ). In the inference phase (Algorithm 1, lines 1-1), we use samples drawn from the learned generative model to approximate the query posterior. We first provide an overview of the individual phases starting with the inference phase, as it motivates our learning objective. We elaborate on the more technical details in Sections 3.1, 3.2 and 3.3.

Inference Phase.

Assuming we have already learned a set of posterior parameters 𝜽,𝝍𝜽𝝍\bm{\theta},\bm{\psi}bold_italic_θ , bold_italic_ψ approximating p(𝜽,𝝍|𝒟)𝑝𝜽conditional𝝍𝒟p(\bm{\theta},\bm{\psi}\,|\,\mathcal{D})italic_p ( bold_italic_θ , bold_italic_ψ | caligraphic_D ) (see the learning phase), we want to employ our model to approximate the query posterior in Equation 2. A key observation similar in spirit to (Lorch et al., 2021; Toth et al., 2022) is that the expectation w.r.t. SCMs can be written as the following importance weighted expectation (for a derivation see Section A.1)

p(Y|𝒟)=𝔼|𝒟[p(Y|)]𝑝conditional𝑌𝒟subscript𝔼conditional𝒟delimited-[]𝑝conditional𝑌\displaystyle p(Y\,|\,\mathcal{D})=\mathbb{E}_{\mathcal{M}\,|\,\mathcal{D}}% \left[p(Y\,|\,\mathcal{M})\right]italic_p ( italic_Y | caligraphic_D ) = blackboard_E start_POSTSUBSCRIPT caligraphic_M | caligraphic_D end_POSTSUBSCRIPT [ italic_p ( italic_Y | caligraphic_M ) ] (3)
=𝔼𝜽,𝝍|𝒟[𝔼L|𝜽[wL𝔼G|L,𝝍,𝒟[𝔼𝐟|𝝍,𝒟[p(Y|)]]]]absentsubscript𝔼𝜽conditional𝝍𝒟delimited-[]subscript𝔼conditional𝐿𝜽delimited-[]superscript𝑤𝐿subscript𝔼conditional𝐺𝐿𝝍𝒟delimited-[]subscript𝔼conditional𝐟𝝍𝒟delimited-[]𝑝conditional𝑌\displaystyle=\mathbb{E}_{\bm{\theta},\bm{\psi}\,|\,\mathcal{D}}\left[\mathbb{% E}_{L\,|\,\bm{\theta}}\left[w^{L}\cdot\mathbb{E}_{G\,|\,L,\bm{\psi},\mathcal{D% }}\left[\mathbb{E}_{\mathbf{f}\,|\,\bm{\psi},\mathcal{D}}\left[p(Y\,|\,% \mathcal{M})\right]\right]\right]\right]= blackboard_E start_POSTSUBSCRIPT bold_italic_θ , bold_italic_ψ | caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ italic_w start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ⋅ blackboard_E start_POSTSUBSCRIPT italic_G | italic_L , bold_italic_ψ , caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_f | bold_italic_ψ , caligraphic_D end_POSTSUBSCRIPT [ italic_p ( italic_Y | caligraphic_M ) ] ] ] ]

with importance weights

wL:=𝔼G|L[p(𝒟|𝝍,G)p(𝝍|G)]𝔼L|𝜽[𝔼G|L[p(𝒟|𝝍,G)p(𝝍|G)]],assignsuperscript𝑤𝐿subscript𝔼conditional𝐺𝐿delimited-[]𝑝conditional𝒟𝝍𝐺𝑝conditional𝝍𝐺subscript𝔼conditionalsuperscript𝐿𝜽delimited-[]subscript𝔼conditionalsuperscript𝐺superscript𝐿delimited-[]𝑝conditional𝒟𝝍superscript𝐺𝑝conditional𝝍superscript𝐺\displaystyle w^{L}:=\frac{\mathbb{E}_{G\,|\,L}\left[p(\mathcal{D}\,|\,\bm{% \psi},G)\cdot p(\bm{\psi}\,|\,G)\right]}{\mathbb{E}_{L^{\prime}\,|\,\bm{\theta% }}\left[\mathbb{E}_{G^{\prime}\,|\,L^{\prime}}\left[p(\mathcal{D}\,|\,\bm{\psi% },G^{\prime})\cdot p(\bm{\psi}\,|\,G^{\prime})\right]\right]},italic_w start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT := divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ italic_p ( caligraphic_D | bold_italic_ψ , italic_G ) ⋅ italic_p ( bold_italic_ψ | italic_G ) ] end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_italic_θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_p ( caligraphic_D | bold_italic_ψ , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ italic_p ( bold_italic_ψ | italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ] end_ARG , (4)

On a high level, the query posterior p(Y|𝒟)𝑝conditional𝑌𝒟p(Y\,|\,\mathcal{D})italic_p ( italic_Y | caligraphic_D ) is estimated in Eq. (3) by (i) sampling several candidate SCMs from the learned generative model in a nested manner given 𝜽,𝝍𝜽𝝍\bm{\theta},\bm{\psi}bold_italic_θ , bold_italic_ψ—first sampling order L𝐿Litalic_L, then a graph G𝐺Gitalic_G (parent sets) conditional on L𝐿Litalic_L, and mechanisms 𝐟𝐟\mathbf{f}bold_f conditional on G𝐺Gitalic_G, (ii) sample queries given the candidate SCM from p(Y|)𝑝conditional𝑌p(Y\,|\,\mathcal{M})italic_p ( italic_Y | caligraphic_M ), and (iii) weighting the sampled queries with their corresponding importance weight wLsuperscript𝑤𝐿w^{L}italic_w start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT.

Learning Phase.

The formulation in Equation 3 transforms the inference problem over posterior SCMs in Equation 2 into the problem of inferring posterior parameters p(𝜽,𝝍|D)𝑝𝜽conditional𝝍𝐷p(\bm{\theta},\bm{\psi}\,|\,D)italic_p ( bold_italic_θ , bold_italic_ψ | italic_D ) of our proposed generative model in Figure 3 (cf. (Lorch et al., 2021; Toth et al., 2022)). By ensuring that our generative distributions p(L|𝜽)𝑝conditional𝐿𝜽p(L\,|\,\bm{\theta})italic_p ( italic_L | bold_italic_θ ) and p(𝐟,𝐔|𝝍)𝑝𝐟conditional𝐔𝝍p(\mathbf{f},\mathbf{U}\,|\,\bm{\psi})italic_p ( bold_f , bold_U | bold_italic_ψ ) are sufficiently expressive, it suffices to infer a maximum a posteriori (MAP) estimate of the posterior parameters 𝜽,𝝍𝜽𝝍\bm{\theta},\bm{\psi}bold_italic_θ , bold_italic_ψ via gradient-based optimisation of (see Section A.2 for a derivation)

log\displaystyle\nabla\log∇ roman_log p(𝜽,𝝍|𝒟)=logp(𝜽)𝑝𝜽conditional𝝍𝒟𝑝𝜽\displaystyle p(\bm{\theta},\bm{\psi}\,|\,\mathcal{D})=\nabla\log p(\bm{\theta})italic_p ( bold_italic_θ , bold_italic_ψ | caligraphic_D ) = ∇ roman_log italic_p ( bold_italic_θ ) (5)
+log𝔼L|𝜽[𝔼G|L[p(𝒟|𝝍,G)p(𝝍|G)]].subscript𝔼conditional𝐿𝜽delimited-[]subscript𝔼conditional𝐺𝐿delimited-[]𝑝conditional𝒟𝝍𝐺𝑝conditional𝝍𝐺\displaystyle+\nabla\log\mathbb{E}_{L\,|\,\bm{\theta}}\left[\mathbb{E}_{G\,|\,% L}\left[p(\mathcal{D}\,|\,\bm{\psi},G)\cdot p(\bm{\psi}\,|\,G)\right]\right].+ ∇ roman_log blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ italic_p ( caligraphic_D | bold_italic_ψ , italic_G ) ⋅ italic_p ( bold_italic_ψ | italic_G ) ] ] .

A major issue for the training of such a model is the quality of the estimated gradients in the face of the high-dimensional problem space and the coupling of the parameters 𝜽,𝝍𝜽𝝍\bm{\theta},\bm{\psi}bold_italic_θ , bold_italic_ψ. However, simultaneously updating all model parameters in a single gradient step is prone to yield very noisy gradients:

  1. 1.

    The gradient w.r.t. 𝜽𝜽\bm{\theta}bold_italic_θ (causal order proposal) depends on 𝝍𝝍\bm{\psi}bold_italic_ψ (mechanism and noise parameters) through the quality of the estimated importance weights wLsuperscript𝑤𝐿w^{L}italic_w start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT in Equation 6. A bad estimate of 𝝍𝝍\bm{\psi}bold_italic_ψ will result in poor estimates of the importance weights and consequently the gradient w.r.t. 𝜽𝜽\bm{\theta}bold_italic_θ.

  2. 2.

    The gradient w.r.t. 𝝍𝝍\bm{\psi}bold_italic_ψ likewise depends on 𝜽𝜽\bm{\theta}bold_italic_θ via the quality of the sampled orders and their induced parent sets. In general, the more often a parent set occurs in the sampled orders relative to other parent sets, the stronger its gradient w.r.t. 𝝍𝝍\bm{\psi}bold_italic_ψ.333Arguably, this may be especially problematic when sharing parameters between mechanisms with different parent sets, as is the case, e.g., when modeling the mechanisms with a single, masked neural network.

To mitigate these issues, we propose a nested optimisation procedure as laid out in Algorithm 1 (lines 1-1). In an outer loop, we sample causal orders from the learned proposal, estimate the gradient w.r.t. 𝜽𝜽\bm{\theta}bold_italic_θ in Equation 5 and update the parameters 𝜽𝜽\bm{\theta}bold_italic_θ of the proposal distribution p(L|𝜽)𝑝conditional𝐿𝜽p(L\,|\,\bm{\theta})italic_p ( italic_L | bold_italic_θ ) (see Section 3.1). To estimate this gradient, we need to compute the importance weights (Equation 4) for the sampled orders, which however depend on the GP hyper-parameters 𝝍𝝍\bm{\psi}bold_italic_ψ via the marginal likelihood p(𝒟|𝝍,G)𝑝conditional𝒟𝝍𝐺p(\mathcal{D}\,|\,\bm{\psi},G)italic_p ( caligraphic_D | bold_italic_ψ , italic_G ). Therefore, before computing the importance weights, we optimise the GP hyper-parameters (see Section 3.3) for each yet unseen parent set compatible with some sampled order. This ensures truthful estimates of the importance weights for the sampled orders and, consequently, provides a good gradient estimate.

Algorithm 1 BCI with ARCO-GP

Input: (Observational) data 𝒟𝒟\mathcal{D}caligraphic_D.

Output: Posterior parameters 𝜽𝜽\bm{\theta}bold_italic_θ, 𝝍𝝍\bm{\psi}bold_italic_ψ. Estimated (posterior over the) causal query Y𝑌Yitalic_Y and importance weights 𝐰Lsuperscript𝐰𝐿\mathbf{w}^{L}bold_w start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT.

/* Learning Phase */

repeat

       sample causal orders 𝐋{Lmp(L|𝜽)}𝐋similar-tosubscript𝐿𝑚𝑝conditional𝐿𝜽\mathbf{L}\leftarrow\left\{L_{m}\sim p(L\,|\,\bm{\theta})\right\}bold_L ← { italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∼ italic_p ( italic_L | bold_italic_θ ) }
        \vartrianglerightSection 3.1 𝐰L,𝝍superscript𝐰𝐿𝝍absent\mathbf{w}^{L},\bm{\psi}\leftarrowbold_w start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , bold_italic_ψ ← ComputeIW(𝐋𝐋\mathbf{L}bold_L, 𝒟𝒟\mathcal{D}caligraphic_D)
        \vartrianglerightEquation 4 estimate the gradients for 𝜽𝜽\bm{\theta}bold_italic_θ and perform a gradient step
        \vartrianglerightEquation 6
2until stopping criterion fulfilled
/* Inference Phase */ sample causal orders 𝐋{Lmp(L|𝜽)}𝐋similar-tosubscript𝐿𝑚𝑝conditional𝐿𝜽\mathbf{L}\leftarrow\left\{L_{m}\sim p(L\,|\,\bm{\theta})\right\}bold_L ← { italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∼ italic_p ( italic_L | bold_italic_θ ) }
  \vartrianglerightSection 3.1 𝐰L,𝝍superscript𝐰𝐿𝝍absent\mathbf{w}^{L},\bm{\psi}\leftarrowbold_w start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , bold_italic_ψ ← ComputeIW(𝐋𝐋\mathbf{L}bold_L, 𝒟𝒟\mathcal{D}caligraphic_D)
  \vartrianglerightEquation 4 [optional] sample graphs 𝐆{Gkp(G|L,𝝍,𝒟)}𝐆similar-tosubscript𝐺𝑘𝑝conditional𝐺𝐿𝝍𝒟\mathbf{G}\leftarrow\left\{G_{k}\sim p(G\,|\,L,\bm{\psi},\mathcal{D})\right\}bold_G ← { italic_G start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ italic_p ( italic_G | italic_L , bold_italic_ψ , caligraphic_D ) }
  \vartrianglerightSection 3.2 [optional] sample mechanisms 𝐟{𝐟jp(𝐟|𝝍,𝒟)}𝐟similar-tosubscript𝐟𝑗𝑝conditional𝐟𝝍𝒟\mathbf{f}\leftarrow\left\{\mathbf{f}_{j}\sim p(\mathbf{f}\,|\,\bm{\psi},% \mathcal{D})\right\}bold_f ← { bold_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ italic_p ( bold_f | bold_italic_ψ , caligraphic_D ) }
  \vartrianglerightSection 3.3 sample candidate queries 𝐘{Yip(Y|𝐟,𝝍,G)}𝐘similar-tosubscript𝑌𝑖𝑝conditional𝑌𝐟𝝍𝐺\mathbf{Y}\leftarrow\left\{Y_{i}\sim p(Y\,|\,\mathbf{f},\bm{\psi},G)\right\}bold_Y ← { italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_p ( italic_Y | bold_f , bold_italic_ψ , italic_G ) }
  \vartrianglerightSection 3.3 return 𝜽𝜽\bm{\theta}bold_italic_θ, 𝝍𝝍\bm{\psi}bold_italic_ψ, 𝐘𝐘\mathbf{Y}bold_Y, 𝐰Lsuperscript𝐰𝐿\mathbf{w}^{L}bold_w start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT
/* Subroutine: compute importance weights and update mechanisms. */ Subroutine ComputeIW(𝐋𝐋\mathbf{L}bold_L, 𝒟𝒟\mathcal{D}caligraphic_D)
3       foreach causal order Lm𝐋subscript𝐿𝑚𝐋L_{m}\in\mathbf{L}italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ bold_L  do
4             foreach parent set paisubscriptpa𝑖\textbf{pa}_{i}pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT compatible with Lmsubscript𝐿𝑚L_{m}italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT do
                   Learn (or retrieve) the corresponding GP hyper-parameters 𝝍isubscript𝝍𝑖\bm{\psi}_{i}bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
                    \vartrianglerightEquation 11
5             end foreach
6            Compute the importance weight wLmsuperscript𝑤subscript𝐿𝑚w^{L_{m}}italic_w start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
7       end foreach
8      return 𝐰L,𝝍superscript𝐰𝐿𝝍\mathbf{w}^{L},\bm{\psi}bold_w start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT , bold_italic_ψ
9

3.1 Causal Order Inference

While Equations 3 and 5 elegantly reduce BCI to learning a model over causal orders and GP hyper-parameters, computing the expectation w.r.t. causal orders in Equation 3 poses a hard combinatorial problem, as there are d!𝑑d!italic_d ! possible causal orders over d𝑑ditalic_d variables. The involved distribution over causal orders p(L|𝜽)𝑝conditional𝐿𝜽p(L\,|\,\bm{\theta})italic_p ( italic_L | bold_italic_θ ) appearing in the BCI estimator Equation 3 can be interpreted as proposal distribution, with importance weights as defined in Equation 4, where the optimal proposal is the true posterior over causal orders. Hence, we require an expressive representation over causal orders which can account for multi-modal distributions over orders. In particular, consider the example of a Markov equivalence class (MEC) including a chain graph. Since the chain is contained in the MEC, also the reverse chain graph must be in the MEC, and thus, the proposal over causal orders must be able to represent both orders with equal probability.

A simple parameterisation of orders as proposed in (Charpentier et al., 2022) is not able to represent the true posterior over causal orders in this case. Specifically, they sample orders using the Gumbel Top-k trick (Kool et al., 2019) by perturbing d𝑑ditalic_d logits (corresponding to the d𝑑ditalic_d variables) with Gumbel noise and sorting these perturbed logits, yielding an order over variables. In essence, learning such a model boils down to ordering (and spreading) the d𝑑ditalic_d logits on the real line. Now, to sample a chain graph and a reverse chain graph, some variable is the first element in the causal order in one case and must thus have the highest (perturbed) logit, and it is the last element in the causal order in the other case where it must have the lowest (perturbed) logit, which is contradictory. In practice, we observe that when trying to learn a multi-modal distribution over causal orders with this model, the logits cluster together, resulting approximately in a uniform distribution over causal orders.

Here, we propose an expressive, Auto-Regressive distribution (Larochelle & Murray, 2011) p(L|𝜽)=p(L1|𝜽)k=2dp(Lk|L<k,𝜽)𝑝conditional𝐿𝜽𝑝conditionalsubscript𝐿1𝜽superscriptsubscriptproduct𝑘2𝑑𝑝conditionalsubscript𝐿𝑘subscript𝐿absent𝑘𝜽p(L\,|\,\bm{\theta})=p(L_{1}\,|\,\bm{\theta})\cdot\prod_{k=2}^{d}p(L_{k}\,|\,L% _{<k},\bm{\theta})italic_p ( italic_L | bold_italic_θ ) = italic_p ( italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_italic_θ ) ⋅ ∏ start_POSTSUBSCRIPT italic_k = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_p ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_L start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT , bold_italic_θ ) over Causal Orders (ARCO) that is amenable to gradient-based optimisation and can represent multi-modal distributions over orders, avoiding the shortcoming described above.

Sampling Causal Orders.

The ordering of variables naturally implies a sequential sampling procedure as listed in Algorithm 2. In each step of the sampling procedure, we sample the next variable in the order from a categorical distribution p(Lk|L<k,𝜽)𝑝conditionalsubscript𝐿𝑘subscript𝐿absent𝑘𝜽p(L_{k}\,|\,L_{<k},\bm{\theta})italic_p ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_L start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT , bold_italic_θ ) over the set of yet unassigned variables, conditional on all preceding variables in the order (Algorithm 2, line 2). To account for the dependence on the preceding order L<ksubscript𝐿absent𝑘L_{<k}italic_L start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT, we compute the logits of the categorical distribution using a differentiable function g𝜽:d×dd:subscript𝑔𝜽maps-tosuperscript𝑑𝑑superscript𝑑g_{\bm{\theta}}:\mathbb{R}^{d\times d}\mapsto\mathbb{R}^{d}italic_g start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT (Algorithm 2, line 2) and re-normalise them to exclude the elements in L<ksubscript𝐿absent𝑘L_{<k}italic_L start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT (Algorithm 2, line 2). We implement g𝜽subscript𝑔𝜽g_{\bm{\theta}}italic_g start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT as feed-forward neural network, taking as input a suitable encoding of the so-far sampled order L<ksubscript𝐿absent𝑘L_{<k}italic_L start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT. To this end, we encode L<ksubscript𝐿absent𝑘L_{<k}italic_L start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT using its induced permutation matrix QL<ksuperscript𝑄subscript𝐿absent𝑘Q^{L_{<k}}italic_Q start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (see Section 2) and mask the rows corresponding to elements L>=ksubscript𝐿absent𝑘L_{>=k}italic_L start_POSTSUBSCRIPT > = italic_k end_POSTSUBSCRIPT with zeros.

Training ARCO.

Training ARCO amounts to learning the parameters 𝜽𝜽\bm{\theta}bold_italic_θ of the neural network g𝜽subscript𝑔𝜽g_{\bm{\theta}}italic_g start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT by performing gradient ascent on Equation 5. To estimate the gradient in Equation 5 w.r.t. 𝜽𝜽\bm{\theta}bold_italic_θ we use the score-function estimator, yielding (see Section A.2)

𝜽logp(𝜽,𝝍|𝒟)subscript𝜽𝑝𝜽conditional𝝍𝒟\displaystyle\nabla_{\bm{\theta}}\log p(\bm{\theta},\bm{\psi}\,|\,\mathcal{D})∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_log italic_p ( bold_italic_θ , bold_italic_ψ | caligraphic_D ) =𝜽logp(𝜽)absentsubscript𝜽𝑝𝜽\displaystyle=\nabla_{\bm{\theta}}\log p(\bm{\theta})= ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_log italic_p ( bold_italic_θ ) (6)
+𝔼L|𝜽[wL𝜽logp(L|𝜽)].subscript𝔼conditional𝐿𝜽delimited-[]superscript𝑤𝐿subscript𝜽𝑝conditional𝐿𝜽\displaystyle+\mathbb{E}_{L\,|\,\bm{\theta}}\left[w^{L}\cdot\nabla_{\bm{\theta% }}\log p(L\,|\,\bm{\theta})\right].+ blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ italic_w start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_log italic_p ( italic_L | bold_italic_θ ) ] .

with wLsuperscript𝑤𝐿w^{L}italic_w start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT as defined in Equation 4 (see Section A.1 for the derivation). To evaluate logp(L|𝜽)𝑝conditional𝐿𝜽\log p(L\,|\,\bm{\theta})roman_log italic_p ( italic_L | bold_italic_θ ) for a given causal order L𝐿Litalic_L, we simply need to sum the log-probabilities of the categorical distributions logp(Lk|L<k,𝜽)𝑝conditionalsubscript𝐿𝑘subscript𝐿absent𝑘𝜽\log p(L_{k}\,|\,L_{<k},\bm{\theta})roman_log italic_p ( italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_L start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT , bold_italic_θ ) for the respective elements Lksubscript𝐿𝑘L_{k}italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The necessary log-probabilities (i.e., the normalised logits) are computed as described in Algorithm 2. Note that, although we need to compute the logits sequentially in the sampling procedure, we can compute them in parallel during evaluation.

Algorithm 2 Sample Causal Order with ARCO

Input: Logit function g𝜽:d×dd:subscript𝑔𝜽maps-tosuperscript𝑑𝑑superscript𝑑g_{\bm{\theta}}:\mathbb{R}^{d\times d}\mapsto\mathbb{R}^{d}italic_g start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT

Output: Causal order L𝐿Litalic_L sampled from p(L|𝜽)𝑝conditional𝐿𝜽p(L\,|\,\bm{\theta})italic_p ( italic_L | bold_italic_θ )

𝐑𝐗𝐑𝐗\mathbf{R}\leftarrow\mathbf{X}bold_R ← bold_X

  \vartrianglerightset of unassigned elements L𝐿L\leftarrow\varnothingitalic_L ← ∅
10  \vartrianglerightcausal order for k=1d𝑘1𝑑k=1\dots ditalic_k = 1 … italic_d do
       ϕg𝜽(QL<k)bold-italic-ϕsubscript𝑔𝜽superscript𝑄subscript𝐿absent𝑘\bm{\phi}\leftarrow g_{\bm{\theta}}(Q^{L_{<k}})bold_italic_ϕ ← italic_g start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( italic_Q start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT < italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )
        \vartrianglerightcompute logits ϕi{ϕilogj|Xj𝐑expϕjifXi𝐑otherwisesubscriptitalic-ϕ𝑖casessubscriptitalic-ϕ𝑖subscriptconditional𝑗subscript𝑋𝑗𝐑subscriptitalic-ϕ𝑗ifsubscript𝑋𝑖𝐑otherwiseotherwiseotherwise\phi_{i}\leftarrow\begin{cases}\phi_{i}-\log\sum_{j\,|\,X_{j}\in\mathbf{R}}% \exp\phi_{j}\quad\text{if}\quad X_{i}\in\mathbf{R}\\ -\infty\quad\text{otherwise}\quad\end{cases}italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← { start_ROW start_CELL italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_log ∑ start_POSTSUBSCRIPT italic_j | italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ bold_R end_POSTSUBSCRIPT roman_exp italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT if italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_R end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - ∞ otherwise end_CELL start_CELL end_CELL end_ROW
        \vartrianglerightnormalise logits lCATEGORICAL(ϕ)similar-to𝑙CATEGORICALbold-italic-ϕl\sim\text{CATEGORICAL}(\bm{\phi})italic_l ∼ CATEGORICAL ( bold_italic_ϕ )
        \vartrianglerightsample next element LLXl𝐿𝐿delimited-⟨⟩subscript𝑋𝑙L\leftarrow L\cup\langle X_{l}\rangleitalic_L ← italic_L ∪ ⟨ italic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩
        \vartrianglerightupdate causal order 𝐑𝐑{Xl}𝐑𝐑subscript𝑋𝑙\mathbf{R}\leftarrow\mathbf{R}\setminus\{X_{l}\}bold_R ← bold_R ∖ { italic_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT }
11 end for

3.2 Marginalising over Causal Graphs

The marginalisation w.r.t. causal graphs G𝐺Gitalic_G in Equations 3, 4 and 5 is in general intractable, as the number of DAGs consistent with any given (causal) order is 2d(d1)2superscript2𝑑𝑑122^{\frac{d\cdot(d-1)}{2}}2 start_POSTSUPERSCRIPT divide start_ARG italic_d ⋅ ( italic_d - 1 ) end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT. Although this is significantly smaller than the total number of DAGs with d𝑑ditalic_d nodes (which grows super-exponentially in d𝑑ditalic_d, see e.g. (OEIS Foundation Inc., 2024)), an exhaustive enumeration is still infeasible.

In this work, we tackle this problem by restricting the number of parents per variable. By restricting the maximum size of any admissible parent set to some integer K𝐾Kitalic_K, the number of distinct parent sets consistent with any causal order L𝐿Litalic_L is in 𝒪(dK)𝒪superscript𝑑𝐾\mathcal{O}(d^{K})caligraphic_O ( italic_d start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ). Although the exhaustive enumeration of all DAGs with restricted parent set size is still infeasible, it turns out that, given a causal order, the expectation w.r.t. graphs can be tractably computed under certain assumptions: by choosing a prior over graphs p(G|L)=ip(Pai|L)𝑝conditional𝐺𝐿subscriptproduct𝑖𝑝conditionalsubscriptPa𝑖𝐿p(G\,|\,L)=\prod_{i}p(\textbf{Pa}_{i}\,|\,L)italic_p ( italic_G | italic_L ) = ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_L ) that factorises over parent sets, we can compute quantities h(G)𝐺h(G)italic_h ( italic_G ) that decompose over the parent sets by the following two propositions.444In practice, we assume a uniform prior over parent sets consistent with a given causal order.

Proposition 3.1.

Let h(G)=ihi(paiG)𝐺subscriptproduct𝑖subscript𝑖superscriptsubscriptpa𝑖𝐺h(G)=\prod_{i}h_{i}(\textbf{pa}_{i}^{G})italic_h ( italic_G ) = ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) and w(G)=iw(paiG)𝑤𝐺subscriptproduct𝑖𝑤superscriptsubscriptpa𝑖𝐺w(G)=\prod_{i}w(\textbf{pa}_{i}^{G})italic_w ( italic_G ) = ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) be factorising over the parent sets, then we can compute

𝔼G|L[w(G)h(G)]=i=1dpaip(pai|L)wi(pai)hi(pai).subscript𝔼conditional𝐺𝐿delimited-[]𝑤𝐺𝐺superscriptsubscriptproduct𝑖1𝑑subscriptsubscriptpa𝑖𝑝conditionalsubscriptpa𝑖𝐿subscript𝑤𝑖subscriptpa𝑖subscript𝑖subscriptpa𝑖\displaystyle\mathbb{E}_{G\,|\,L}\left[w(G)h(G)\right]=\prod_{i=1}^{d}\sum_{% \textbf{pa}_{i}}p(\textbf{pa}_{i}\,|\,L)w_{i}(\textbf{pa}_{i})h_{i}(\textbf{pa% }_{i}).blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ italic_w ( italic_G ) italic_h ( italic_G ) ] = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_L ) italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (7)
Proposition 3.2.

Let h(G)=ihi(paiG)𝐺subscript𝑖subscript𝑖superscriptsubscriptpa𝑖𝐺h(G)=\sum_{i}h_{i}(\textbf{pa}_{i}^{G})italic_h ( italic_G ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) be summing and w(G)=iw(paiG)𝑤𝐺subscriptproduct𝑖𝑤superscriptsubscriptpa𝑖𝐺w(G)=\prod_{i}w(\textbf{pa}_{i}^{G})italic_w ( italic_G ) = ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) be factorising over the parent sets, then we can compute

𝔼G|Lsubscript𝔼conditional𝐺𝐿\displaystyle\mathbb{E}_{G\,|\,L}blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [w(G)h(G)]=delimited-[]𝑤𝐺𝐺absent\displaystyle\left[w(G)h(G)\right]=[ italic_w ( italic_G ) italic_h ( italic_G ) ] = (8)
i=1d(jiαj(L))paip(pai|L)wi(pai)hi(pai),superscriptsubscript𝑖1𝑑subscriptproduct𝑗𝑖subscript𝛼𝑗𝐿subscriptsubscriptpa𝑖𝑝conditionalsubscriptpa𝑖𝐿subscript𝑤𝑖subscriptpa𝑖subscript𝑖subscriptpa𝑖\displaystyle\sum_{i=1}^{d}\left(\prod_{j\neq i}\alpha_{j}(L)\right)\cdot\sum_% {\textbf{pa}_{i}}p(\textbf{pa}_{i}\,|\,L)w_{i}(\textbf{pa}_{i})h_{i}(\textbf{% pa}_{i}),∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( ∏ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_L ) ) ⋅ ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_L ) italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,

where

αj(L)=paip(pai|L)wi(pai).subscript𝛼𝑗𝐿subscriptsubscriptpa𝑖𝑝conditionalsubscriptpa𝑖𝐿subscript𝑤𝑖subscriptpa𝑖\displaystyle\alpha_{j}(L)=\sum_{\textbf{pa}_{i}}p(\textbf{pa}_{i}\,|\,L)w_{i}% (\textbf{pa}_{i}).italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_L ) = ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_L ) italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

These propositions generalise the results presented by Koller & Friedman (2003); Koivisto & Sood (2004) on how to compute the posterior probabilities of edges or parent sets, which is a special case of Proposition 3.1, to our setting of Bayesian causal inference (proofs in Section A.3).

Employing Proposition 3.1, we can compute 𝔼G|L[p(𝒟|𝝍,G)p(𝝍|G)]subscript𝔼conditional𝐺𝐿delimited-[]𝑝conditional𝒟𝝍𝐺𝑝conditional𝝍𝐺\mathbb{E}_{G\,|\,L}\left[p(\mathcal{D}\,|\,\bm{\psi},G)\cdot p(\bm{\psi}\,|\,% G)\right]blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ italic_p ( caligraphic_D | bold_italic_ψ , italic_G ) ⋅ italic_p ( bold_italic_ψ | italic_G ) ], and consequently

p(G|L,𝝍,𝒟)=p(𝒟|𝝍,G)p(𝝍|G)p(G|L)𝔼G|L[p(𝒟|𝝍,G)p(𝝍|G)],𝑝conditional𝐺𝐿𝝍𝒟𝑝conditional𝒟𝝍𝐺𝑝conditional𝝍𝐺𝑝conditional𝐺𝐿subscript𝔼conditional𝐺𝐿delimited-[]𝑝conditional𝒟𝝍𝐺𝑝conditional𝝍𝐺\displaystyle p(G\,|\,L,\bm{\psi},\mathcal{D})=\frac{p(\mathcal{D}\,|\,\bm{% \psi},G)\cdot p(\bm{\psi}\,|\,G)\cdot p(G\,|\,L)}{\mathbb{E}_{G\,|\,L}\left[p(% \mathcal{D}\,|\,\bm{\psi},G)\cdot p(\bm{\psi}\,|\,G)\right]},italic_p ( italic_G | italic_L , bold_italic_ψ , caligraphic_D ) = divide start_ARG italic_p ( caligraphic_D | bold_italic_ψ , italic_G ) ⋅ italic_p ( bold_italic_ψ | italic_G ) ⋅ italic_p ( italic_G | italic_L ) end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ italic_p ( caligraphic_D | bold_italic_ψ , italic_G ) ⋅ italic_p ( bold_italic_ψ | italic_G ) ] end_ARG , (9)

in closed-form by letting h(G)=p(𝒟|𝝍,G)p(𝝍|G)𝐺𝑝conditional𝒟𝝍𝐺𝑝conditional𝝍𝐺h(G)=p(\mathcal{D}\,|\,\bm{\psi},G)\cdot p(\bm{\psi}\,|\,G)italic_h ( italic_G ) = italic_p ( caligraphic_D | bold_italic_ψ , italic_G ) ⋅ italic_p ( bold_italic_ψ | italic_G ) and w(G)=1𝑤𝐺1w(G)=1italic_w ( italic_G ) = 1. This allows us to tractably compute the respective expectations w.r.t. causal graphs in Equations 3, 4 and 5. Furthermore, for any causal query that does not decompose over parent sets, we can compute a Monte-Carlo estimate of the query by sampling DAGs from the true posterior p(G|L,𝝍,𝒟)𝑝conditional𝐺𝐿𝝍𝒟p(G\,|\,L,\bm{\psi},\mathcal{D})italic_p ( italic_G | italic_L , bold_italic_ψ , caligraphic_D ) in Equation 3, as Equation 9 factorised over parent sets.

3.3 Mechanism Inference and Marginalisation

To compute the importance weights in Equation 4 we need to compute the marginal log-likelihood p(𝒟|𝝍,G)𝑝conditional𝒟𝝍𝐺p(\mathcal{D}\,|\,\bm{\psi},G)italic_p ( caligraphic_D | bold_italic_ψ , italic_G ) which is intractable for general models. As we focus on non-linear additive noise models in this work, we follow (von Kügelgen et al., 2019; Toth et al., 2022) and model each mechanism via a distinct GPs, assuming homoscedastic Gaussian noise and causal sufficiency. Under these assumptions, we can compute the marginal likelihood p(𝒟i|𝝍i,G)=𝔼𝐟|𝝍i[p(𝒟i|𝐟i,𝝍i,pai)]𝑝conditionalsubscript𝒟𝑖subscript𝝍𝑖𝐺subscript𝔼conditional𝐟subscript𝝍𝑖delimited-[]𝑝conditionalsubscript𝒟𝑖subscript𝐟𝑖subscript𝝍𝑖subscriptpa𝑖p(\mathcal{D}_{i}\,|\,\bm{\psi}_{i},G)=\mathbb{E}_{\mathbf{f}\,|\,\bm{\psi}_{i% }}\left[p(\mathcal{D}_{i}\,|\,\mathbf{f}_{i},\bm{\psi}_{i},\textbf{pa}_{i})\right]italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_G ) = blackboard_E start_POSTSUBSCRIPT bold_f | bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] in closed form. The same holds true for the predictive posterior p(𝒟itest|𝝍i,G,𝒟)𝑝conditionalsuperscriptsubscript𝒟𝑖testsubscript𝝍𝑖𝐺𝒟p(\mathcal{D}_{i}^{\text{test}}\,|\,\bm{\psi}_{i},G,\mathcal{D})italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT | bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_G , caligraphic_D ) possibly needed in Equation 3. We factorise the GP hyper-parameter prior555The parameter set 𝝍isubscript𝝍𝑖\bm{\psi}_{i}bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains the GP’s kernel parameters and the noise variance. Note that, other than the structural dependence on the parent sets, the prior parameters 𝝍𝝍\bm{\psi}bold_italic_ψ are independent of 𝜽𝜽\bm{\theta}bold_italic_θ. and the likelihood over parent sets for each node i𝑖iitalic_i, i.e.,

p(𝒟|𝝍,G)p(𝝍|G)=ip(𝒟i|𝝍i,G)p(𝝍i|PaiG),𝑝conditional𝒟𝝍𝐺𝑝conditional𝝍𝐺subscriptproduct𝑖𝑝conditionalsubscript𝒟𝑖subscript𝝍𝑖𝐺𝑝conditionalsubscript𝝍𝑖superscriptsubscriptPa𝑖𝐺\displaystyle p(\mathcal{D}\,|\,\bm{\psi},G)\cdot p(\bm{\psi}\,|\,G)=\prod_{i}% p(\mathcal{D}_{i}\,|\,\bm{\psi}_{i},G)\cdot p(\bm{\psi}_{i}\,|\,\textbf{Pa}_{i% }^{G}),italic_p ( caligraphic_D | bold_italic_ψ , italic_G ) ⋅ italic_p ( bold_italic_ψ | italic_G ) = ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_G ) ⋅ italic_p ( bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) , (10)

allowing for efficient computation by caching intermediate results. For the individual GP we infer a MAP-Type II estimate of its hyper-parameters by performing gradient-ascent on666Importantly, in Section A.2 we show that this not an ad-hoc choice, but a consequence of optimising Equation 5.

𝝍ilogsubscriptsubscript𝝍𝑖\displaystyle\nabla_{\bm{\psi}_{i}}\log∇ start_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log p(𝝍i|𝒟,G)=𝑝conditionalsubscript𝝍𝑖𝒟𝐺absent\displaystyle p(\bm{\psi}_{i}\,|\,\mathcal{D},G)=italic_p ( bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_D , italic_G ) = (11)
𝝍ilogp(𝒟i|𝝍i,G)+𝝍ilogp(𝝍i|G).subscriptsubscript𝝍𝑖𝑝conditionalsubscript𝒟𝑖subscript𝝍𝑖𝐺subscriptsubscript𝝍𝑖𝑝conditionalsubscript𝝍𝑖𝐺\displaystyle\nabla_{\bm{\psi}_{i}}\log p(\mathcal{D}_{i}\,|\,\bm{\psi}_{i},G)% +\nabla_{\bm{\psi}_{i}}\log p(\bm{\psi}_{i}\,|\,G).∇ start_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_G ) + ∇ start_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_G ) .

For root nodes, i.e., nodes without parents, we place a conjugate normal-inverse-gamma prior on the mean and variance of that node, which also allows for closed-form inference.

4 Related Work

BCI in the simpler form of Bayesian structure inference dates back to (Heckerman, 1995; Heckerman et al., 1997; Madigan et al., 1995), often using MCMC techniques, where many works utilise (causal) orders, e.g., (Koller & Friedman, 2003; Koivisto & Sood, 2004; Teyssier & Koller, 2012; Ellis & Wong, 2008; Niinimäki et al., 2016; Kuipers & Moffa, 2017; Viinikka et al., 2020), allowing a better exploration of the posterior space. Moreover, restricting the maximum number of parents per node allows an exhaustive enumeration of all parent sets in polynomial time for a given order. However, MCMC inference comes with its own set of challenges, and none of these works implement non-linear mechanism models. In contrast, our work focuses on non-linear additive noise models and utilises gradient-based learning of a generative model over causal orders. Besides sampling based inference, orders can also facilitate exact optimization schemes, e.g. (Cussens, 2010; De Campos & Ji, 2011; Peharz & Pernkopf, 2012).

After the advent of gradient-based (Bayesian) DAG structure learning methods (Zheng et al., 2018; Yu et al., 2019; Brouillard et al., 2020; Lachapelle et al., 2020; Lorch et al., 2021), inference via orders recently gained interest in the gradient-based causal structure learning community as a vehicle to sample DAGs without the need of utilising soft acyclicity constraints during optimisation (Cundy et al., 2021; Charpentier et al., 2022; Annadani et al., 2023). In contrast to these works, we stress the importance of an expressive gradient-based model for causal orders and we utilise causal orders to marginalise our Bayesian causal inference procedure by exhaustive parent set enumeration.

Finally, most existing BCI approaches are restricted to linear Gaussian models (Geiger & Heckerman, 1994; Viinikka et al., 2020; Pensar et al., 2020; Horii, 2021) or binary variables (Moffa et al., 2017; Kuipers et al., 2019). Only recently, works on a Bayesian treatment of entire non-linear SCMs (i.e., including mechanisms and exogenous noise) have been proposed by Toth et al. (2022); Tigas et al. (2022); Giudice et al. (2023, 2024). Both Toth et al. (2022) and Tigas et al. (2022) operate in an active learning scenario using DIBS (Lorch et al., 2021) for inferring posterior causal graphs, whereas Toth et al. (2022) focus on BCI while using GPs for mechanism inference. Similarly, Giudice et al. (2023) use GPs for the Bayesian inference of SCMs in combiniation with MCMC inference for causal graphs. In a concurrent preprint, Giudice et al. (2024) follow (Toth et al., 2022) in using GPs for BCI but use MCMC for sampling posterior causal graphs. None of these approaches features gradient-based inference of causal orders, nor utilises causal orders for structure marginalisation.

5 Experiments

Task I: Causal Structure Learning. For our results in Figure 1, we sample a fixed set of 200200200200 training samples from the observational distribution of non-linear additive noise ground truth SCMs with Erdös-Rényi (ER) (Erdös & Rényi, 1959) and scale-free (SF) (Barabási & Albert, 1999) graph structures. The small sample size emulates the setting of significant uncertainty relevant to the Bayesian inference scenario. The reported expected structural Hamming distance (ESHD) is a standard metric used structure learning metric. Additionally, to asses the inferred structures in terms of causal implications, we report the ancestor adjustment identification distance (A-AID) recently proposed by Henckel et al. (2024) (see Appendix D for details).

We compare our inference model (ARCO-GP) with maximum parent set size K=2𝐾2K=2italic_K = 2 to a diverse set of nine different structure learning methods: BAYESDAG ((Annadani et al., 2023)), DAG-GNN ((Yu et al., 2019)), DDS ((Charpentier et al., 2022)), DIBS-GP ((Toth et al., 2022; Lorch et al., 2021)), GADGET ((Viinikka et al., 2020)), GES ((Chickering, 2003)), GOLEM ((Ng et al., 2020)), GRASP ((Lam et al., 2022)) and PC ((Spirtes et al., 2000)), that we briefly describe in Appendix D.

Summarising Figure 1, when our assumption about the maximum parent set size is reflected by the ground truth models (as holds in our simulated SF graphs), ARCO-GP clearly outperforms the baselines in terms of ESHD and A-AID. Moreover, even when the ground truth models exhibit larger parent sets and our restriction is violated (as is the case in our simulated ER graphs), ARCO-GP still outperforms the baselines, albeit by a smaller margin. Additional experiments with varying number of variables d{11,20,50}𝑑112050d\in\left\{11,20,50\right\}italic_d ∈ { 11 , 20 , 50 }, as well as experiments on real-world data from Sachs et al. (2005) and ablations are provided in Appendix D. For these experiments we report an extended set of eight additional metrics.

Task II: Inferring Interventional Distributions and Average Causal Effects. We illustrate ARCO-GP’s causal reasoning capabilities by visualising a selected set of estimated posterior interventional distributions p(Xi|do(𝐖=𝐰),𝒟)𝑝conditionalsubscript𝑋𝑖𝑑𝑜𝐖𝐰𝒟p(X_{i}\,|\,do(\mathbf{W}=\mathbf{w}),\mathcal{D})italic_p ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_d italic_o ( bold_W = bold_w ) , caligraphic_D ) in Figure 2. In order to have access to the corresponding ground truth distributions, we simulate a non-linear additive noise model on the consensus protein interaction graph reported by Sachs et al. (2005) (see Figure 4). We generate samples from the interventional distributions with the procedure laid out in Algorithms 1 and 3 and smoothing the empirical distribution with a kernel density estimate (see Appendix B for details). As can be seen, the posterior distribution inferred by our method closely resemble the true ones. Importantly, the multi-modality of the (inferred) distributions illustrate the benefits of a Bayesian approach to causal inference, as we can represent uncertainty via full distributions instead of single point estimates. In Appendix D we further compare our ability to infer pairwise average causal effects to two baselines, namely DIBS-GP (Toth et al., 2022; Lorch et al., 2021) and GADET+BEEPS (Viinikka et al., 2020).

6 Discussion

We demonstrated that our proposed ARCO-GP method for BCI, leveraging structural marginalisation, yields superior structure learning performance on non-linear additive noise models against a set of nine state-of-the-art baseline methods. Moreover, we illustrate the capability of ARCO-GP to accurately infer posterior interventional distributions and average causal effects. The capabilities our method relies upon the following assumptions and limitations.

Assumptions. Our assumptions on the data generating process include causal sufficiency and additive Gaussian noise, as well as a limited maximum parent set size. Presumably, for very dense graphs the performance of ARCO-GP will deteriorate in comparison to other methods. While our framework and implementation can handle interventional data, we do not evaluate this scenario experimentally because not all baselines support the use of interventional data, and the observational case is the more difficult problem from the perspective of model identifiability.

Scalability and Computation. The main driver of complexity is the exact inference using GPs, which grows with N3superscript𝑁3N^{3}italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT in the number of available data points. Although we used only CPUs for running our experiments, scaleable GPU inference techniques for GPs were proposed, e.g., by Gardner et al. (2018); Pleiss et al. (2018). Additionally, the training of the GPs could be straightforwardly parallelised. Conceptually, our framework is flexible and modular, allowing to use alternative mechanism models like normalising flows as in (Brouillard et al., 2020; Pawlowski et al., 2020). A second driver of computational complexity is the exhaustive enumeration of parent sets, which may be prohibitive on larger problem instances and bigger parent sets. Note, however, that the individual parent set contributions necessary to compute the importance weights in Equation 4 could be pre-computed in parallel.

Acknowledgements

The financial support by the Austrian Federal Ministry of Labour and Economy, the National Foundation for Research, Technology and Development and the Christian Doppler Research Association is gratefully acknowledged. The computational results presented have in part been achieved using the Vienna Scientific Cluster (VSC).

References

  • Annadani et al. (2023) Annadani, Y., Pawlowski, N., Jennings, J., Bauer, S., Zhang, C., and Gong, W. BayesDAG: Gradient-Based Posterior Inference for Causal Discovery. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Ansel et al. (2024) Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Burovski, E., Chauhan, G., Chourdia, A., Constable, W., Desmaison, A., DeVito, Z., Ellison, E., Feng, W., Gong, J., Gschwind, M., Hirsh, B., Huang, S., Kalambarkar, K., Kirsch, L., Lazos, M., Lezcano, M., Liang, Y., Liang, J., Lu, Y., Luk, C. K., Maher, B., Pan, Y., Puhrsch, C., Reso, M., Saroufim, M., Siraichi, M. Y., Suk, H., Zhang, S., Suo, M., Tillet, P., Zhao, X., Wang, E., Zhou, K., Zou, R., Wang, X., Mathews, A., Wen, W., Chanan, G., Wu, P., and Chintala, S. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, New York, NY, USA, 2024. ACM. ISBN 9798400703850. doi: 10.1145/3620665.3640366.
  • Barabási & Albert (1999) Barabási, A.-L. and Albert, R. Emergence of scaling in random networks. Science, 286, 1999.
  • Bareinboim et al. (2022) Bareinboim, E., Correa, J. D., Ibeling, D., and Icard, T. On Pearl’s Hierarchy and the Foundations of Causal Inference. In Probabilistic and Causal Inference. ACM, 2022.
  • Brouillard et al. (2020) Brouillard, P., Lachapelle, S., Lacoste, A., Lacoste-Julien, S., and Drouin, A. Differentiable Causal Discovery from Interventional Data. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., 2020.
  • Charpentier et al. (2022) Charpentier, B., Kibler, S., and Günnemann, S. Differentiable {DAG} Sampling. In International Conference on Learning Representations, 2022.
  • Chickering (2003) Chickering, D. M. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3, 2003. ISSN 1532-4435.
  • Cundy et al. (2021) Cundy, C., Grover, A., and Ermon, S. BCD Nets: Scalable Variational Approaches for Bayesian Causal Discovery. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
  • Cussens (2010) Cussens, J. Maximum likelihood pedigree reconstruction using integer programming. In WCB@ ICLP, pp.  8–19, 2010.
  • De Campos & Ji (2011) De Campos, C. P. and Ji, Q. Efficient structure learning of bayesian networks using constraints. The Journal of Machine Learning Research, 12:663–689, 2011.
  • Ellis & Wong (2008) Ellis, B. and Wong, W. H. Learning Causal Bayesian Network Structures from Experimental Data. Journal of the American Statistical Association, 2008.
  • Erdös & Rényi (1959) Erdös, P. and Rényi, A. On random graphs i. Publicationes Mathematicae Debrecen, 6:290, 1959.
  • Gardner et al. (2018) Gardner, J., Pleiss, G., Weinberger, K. Q., Bindel, D., and Wilson, A. G. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., 2018.
  • Geiger & Heckerman (1994) Geiger, D. and Heckerman, D. Learning gaussian networks. In Uncertainty in Artificial Intelligence, pp.  235–243. Elsevier, 1994.
  • Giudice et al. (2023) Giudice, E., Kuipers, J., and Moffa, G. A Bayesian Take on Gaussian Process Networks. In Advances in Neural Information Processing Systems, 2023.
  • Giudice et al. (2024) Giudice, E., Kuipers, J., and Moffa, G. Bayesian causal inference with gaussian process networks. arXiv preprint arXiv:2402.00623, 2024.
  • Heckerman (1995) Heckerman, D. A bayesian approach to learning causal networks. In Proceedings of Eleventh Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, 1995.
  • Heckerman et al. (1997) Heckerman, D., Meek, C., and Cooper, G. A Bayesian Approach to Causal Discovery. Computation, Causation, and Discovery, 1997.
  • Henckel et al. (2024) Henckel, L., Würtzen, T., and Weichwald, S. Adjustment Identification Distance: A gadjid for Causal Structure Learning. arXiv:2402.08616, 2024.
  • Hinton (2012) Hinton, G. Neural networks for machine learning - lecture 6a - overview of mini-batch gradient descent, 2012. https://www.cs.toronto.edu/ tijmen/csc321/slides/lecture_slides_lec6.pdf.
  • Horii (2021) Horii, S. Bayesian Model Averaging for Causality Estimation and its Approximation based on Gaussian Scale Mixture Distributions. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, 2021.
  • Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization. In And, Y. B. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, Conference Track Proceedings, San Diego, CA, USA, 2015.
  • Koivisto & Sood (2004) Koivisto, M. and Sood, K. Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, 2004.
  • Koller & Friedman (2003) Koller, D. and Friedman, N. Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks. Machine Learning, 2003.
  • Kool et al. (2019) Kool, W., Van Hoof, H., and Welling, M. Stochastic Beams and Where To Find Them: The {G}umbel-Top-k Trick for Sampling Sequences Without Replacement. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97. PMLR, 2019. URL https://proceedings.mlr.press/v97/kool19a.html.
  • Kuipers & Moffa (2017) Kuipers, J. and Moffa, G. Partition MCMC for Inference on Acyclic Digraphs. Journal of the American Statistical Association, 2017.
  • Kuipers et al. (2019) Kuipers, J., Moffa, G., Kuipers, E., Freeman, D., and Bebbington, P. Links between psychotic and neurotic symptoms in the general population: an analysis of longitudinal british national survey data using directed acyclic graphs. Psychological Medicine, 49(3):388–395, 2019.
  • Lachapelle et al. (2020) Lachapelle, S., Brouillard, P., Deleu, T., and Lacoste-Julien, S. Gradient-Based Neural DAG Learning. In International Conference on Learning Representations, 2020.
  • Lam et al. (2022) Lam, W.-Y., Andrews, B., and Ramsey, J. Greedy relaxations of the sparsest permutation algorithm. In Cussens, J. and Zhang, K. (eds.), Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, Proceedings of Machine Learning Research. PMLR, 2022.
  • Larochelle & Murray (2011) Larochelle, H. and Murray, I. The neural autoregressive distribution estimator. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp.  29–37. JMLR Workshop and Conference Proceedings, 2011.
  • Lorch et al. (2021) Lorch, L., Rothfuss, J., Schölkopf, B., and Krause, A. DiBS: Differentiable Bayesian Structure Learning. Advances in Neural Information Processing Systems, 2021.
  • Madigan et al. (1995) Madigan, D., York, J., and Allard, D. Bayesian Graphical Models for Discrete Data. International Statistical Review / Revue Internationale de Statistique, 63, 1995. ISSN 03067734. doi: 10.2307/1403615.
  • Moffa et al. (2017) Moffa, G., Catone, G., Kuipers, J., Kuipers, E., Freeman, D., Marwaha, S., Lennox, B. R., Broome, M. R., and Bebbington, P. Using directed acyclic graphs in epidemiological research in psychosis: an analysis of the role of bullying in psychosis. Schizophrenia bulletin, 43(6):1273–1279, 2017.
  • Murphy (2001) Murphy, K. P. Active learning of causal Bayes net structure. Technical report, Department of Computer Science, U.C. Berkeley, 2001.
  • Murphy (2007) Murphy, K. P. Conjugate Bayesian Analysis of the Gaussian Distribution. Technical report, University of British Columbia, 2007.
  • Murphy (2023) Murphy, K. P. Probabilistic Machine Learning: Advanced Topics. MIT Press, 2023.
  • Ng et al. (2020) Ng, I., Ghassami, A., and Zhang, K. On the Role of Sparsity and DAG Constraints for Learning Linear DAGs. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33. Curran Associates, Inc., 2020.
  • Niinimäki et al. (2016) Niinimäki, T., Parviainen, P., and Koivisto, M. Structure Discovery in Bayesian Networks by Sampling Partial Orders. Journal of Machine Learning Research, 2016.
  • OEIS Foundation Inc. (2024) OEIS Foundation Inc. Number of acyclic digraphs (or dags) with n labeled nodes, 2024. Entry A003024 in The On-Line Encyclopedia of Integer Sequences, https://oeis.org/A003024.
  • Pawlowski et al. (2020) Pawlowski, N., Castro, D. C., and Glocker, B. Deep Structural Causal Models for Tractable Counterfactual Inference. In Advances in Neural Information Processing Systems, 2020.
  • Pearl (2009) Pearl, J. Causality. Cambridge University Press, 2009. ISBN 9780511803161.
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Peharz & Pernkopf (2012) Peharz, R. and Pernkopf, F. Exact maximum margin structure learning of bayesian networks. arXiv preprint arXiv:1206.6431, 2012.
  • Pensar et al. (2020) Pensar, J., Talvitie, T., Hyttinen, A., and Koivisto, M. A bayesian approach for estimating causal effects from observational data. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  5395–5402, 2020.
  • Peters & Bühlmann (2015) Peters, J. and Bühlmann, P. Structural Intervention Distance for Evaluating Causal Graphs. Neural Computation, 27, 2015. ISSN 0899-7667. doi: 10.1162/NECO_a_00708.
  • Pleiss et al. (2018) Pleiss, G., Gardner, J., Weinberger, K., and Wilson, A. G. Constant-time predictive distributions for Gaussian processes. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2018.
  • Reisach et al. (2021) Reisach, A. G., Seiler, C., and Weichwald, S. Beware of the Simulated DAG! Causal Discovery Benchmarks May Be Easy To Game. In Vaughan, M. R., Beygelzimer, A., Dauphin, Y., Liang, P., and Wortman, J. (eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., 2021.
  • Robert & Roberts (2021) Robert, C. P. and Roberts, G. O. Rao-blackwellization in the mcmc era. arXiv preprint arXiv:2101.01011, 2021.
  • Sachs et al. (2005) Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data. Science, 308, 2005. doi: 10.1126/science.1105809.
  • Spirtes et al. (2000) Spirtes, P., Glymour, C., and Scheines, R. Causation, Prediction, and Search. The MIT Press, second edition, 2000. ISBN 9780262527927.
  • Teyssier & Koller (2012) Teyssier, M. and Koller, D. Ordering-Based Search: A Simple and Effective Algorithm for Learning Bayesian Networks. Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, UAI 2005, 2012.
  • Tigas et al. (2022) Tigas, P., Annadani, Y., Jesson, A., Schölkopf, B., Gal, Y., and Bauer, S. Interventions, Where and How? Experimental Design for Causal Models at Scale. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., 2022.
  • Tong & Koller (2001) Tong, S. and Koller, D. Active learning for structure in Bayesian networks. In Proceedings of the 17th International Joint Conference on Artificial Intelligence, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558608125.
  • Toth et al. (2022) Toth, C., Lorch, L., Knoll, C., Krause, A., Pernkopf, F., Peharz, R., and von Kügelgen, J. Active Bayesian Causal Inference. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., 2022.
  • Viinikka et al. (2020) Viinikka, J., Hyttinen, A., Pensar, J., and Koivisto, M. Towards Scalable Bayesian Learning of Causal DAGs. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., 2020.
  • von Kügelgen et al. (2019) von Kügelgen, J., Rubenstein, P. K., Schölkopf, B., and Weller, A. Optimal experimental design via Bayesian optimization: active causal structure learning for Gaussian process networks. arXiv:1910.03962, 2019.
  • Yu et al. (2019) Yu, Y., Chen, J., Gao, T., and Yu, M. DAG-GNN: DAG Structure Learning with Graph Neural Networks. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2019.
  • Zhang et al. (2021) Zhang, K., Zhu, S., Kalander, M., Ng, I., Ye, J., Chen, Z., and Pan, L. gcastle: A python toolbox for causal discovery. arXiv:2111.15155, 2021.
  • Zheng et al. (2018) Zheng, X., Aragam, B., Ravikumar, P. K., and Xing, E. P. DAGs with NO TEARS: Continuous Optimization for Structure Learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2018.
  • Zheng et al. (2024) Zheng, Y., Huang, B., Chen, W., Ramsey, J., Gong, M., Cai, R., Shimizu, S., Spirtes, P., and Zhang, K. Causal-learn: Causal discovery in python. Journal of Machine Learning Research, 25:1–8, 2024.

Appendix A Proofs and Derivations

A.1 Derivation of the Posterior Expectation w.r.t. SCMs

In the following, we derive the expectation w.r.t. SCMs in Equation 3 and the corresponding importance weights in Equation 4.

𝔼|𝒟[p(Y|)]subscript𝔼conditional𝒟delimited-[]𝑝conditional𝑌\displaystyle\mathbb{E}_{\mathcal{M}\,|\,\mathcal{D}}\left[p(Y\,|\,\mathcal{M}% )\right]blackboard_E start_POSTSUBSCRIPT caligraphic_M | caligraphic_D end_POSTSUBSCRIPT [ italic_p ( italic_Y | caligraphic_M ) ] =𝔼G,𝐟,𝝍|𝒟[p(Y|)]absentsubscript𝔼𝐺𝐟conditional𝝍𝒟delimited-[]𝑝conditional𝑌\displaystyle=\mathbb{E}_{G,\mathbf{f},\bm{\psi}\,|\,\mathcal{D}}\left[p(Y\,|% \,\mathcal{M})\right]= blackboard_E start_POSTSUBSCRIPT italic_G , bold_f , bold_italic_ψ | caligraphic_D end_POSTSUBSCRIPT [ italic_p ( italic_Y | caligraphic_M ) ]
=𝔼𝜽,𝝍|𝒟[𝔼G,𝐟|𝜽,𝝍,𝒟[p(Y|)]]absentsubscript𝔼𝜽conditional𝝍𝒟delimited-[]subscript𝔼𝐺conditional𝐟𝜽𝝍𝒟delimited-[]𝑝conditional𝑌\displaystyle=\mathbb{E}_{\bm{\theta},\bm{\psi}\,|\,\mathcal{D}}\left[\mathbb{% E}_{G,\mathbf{f}\,|\,\bm{\theta},\bm{\psi},\mathcal{D}}\left[p(Y\,|\,\mathcal{% M})\right]\right]= blackboard_E start_POSTSUBSCRIPT bold_italic_θ , bold_italic_ψ | caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G , bold_f | bold_italic_θ , bold_italic_ψ , caligraphic_D end_POSTSUBSCRIPT [ italic_p ( italic_Y | caligraphic_M ) ] ]
=𝔼𝜽,𝝍|𝒟[𝔼G|𝜽,𝝍,𝒟[𝔼𝐟|𝝍,𝒟[p(Y|)]]]absentsubscript𝔼𝜽conditional𝝍𝒟delimited-[]subscript𝔼conditional𝐺𝜽𝝍𝒟delimited-[]subscript𝔼conditional𝐟𝝍𝒟delimited-[]𝑝conditional𝑌\displaystyle=\mathbb{E}_{\bm{\theta},\bm{\psi}\,|\,\mathcal{D}}\left[\mathbb{% E}_{G\,|\,\bm{\theta},\bm{\psi},\mathcal{D}}\left[\mathbb{E}_{\mathbf{f}\,|\,% \bm{\psi},\mathcal{D}}\left[p(Y\,|\,\mathcal{M})\right]\right]\right]= blackboard_E start_POSTSUBSCRIPT bold_italic_θ , bold_italic_ψ | caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G | bold_italic_θ , bold_italic_ψ , caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_f | bold_italic_ψ , caligraphic_D end_POSTSUBSCRIPT [ italic_p ( italic_Y | caligraphic_M ) ] ] ]
=𝔼𝜽,𝝍|𝒟[𝔼L|𝜽,𝝍,𝒟[𝔼G|L,𝝍,𝒟[𝔼𝐟|𝝍,𝒟[p(Y|)]]]]absentsubscript𝔼𝜽conditional𝝍𝒟delimited-[]subscript𝔼conditional𝐿𝜽𝝍𝒟delimited-[]subscript𝔼conditional𝐺𝐿𝝍𝒟delimited-[]subscript𝔼conditional𝐟𝝍𝒟delimited-[]𝑝conditional𝑌\displaystyle=\mathbb{E}_{\bm{\theta},\bm{\psi}\,|\,\mathcal{D}}\left[\mathbb{% E}_{L\,|\,\bm{\theta},\bm{\psi},\mathcal{D}}\left[\mathbb{E}_{G\,|\,L,\bm{\psi% },\mathcal{D}}\left[\mathbb{E}_{\mathbf{f}\,|\,\bm{\psi},\mathcal{D}}\left[p(Y% \,|\,\mathcal{M})\right]\right]\right]\right]= blackboard_E start_POSTSUBSCRIPT bold_italic_θ , bold_italic_ψ | caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ , bold_italic_ψ , caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G | italic_L , bold_italic_ψ , caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_f | bold_italic_ψ , caligraphic_D end_POSTSUBSCRIPT [ italic_p ( italic_Y | caligraphic_M ) ] ] ] ]
=𝔼𝜽,𝝍|𝒟[𝔼L|𝜽[p(𝒟,𝝍|L)p(𝒟,𝝍|𝜽)𝔼G|L,𝝍,𝒟[𝔼𝐟|𝝍,𝒟[p(Y|)]]]]absentsubscript𝔼𝜽conditional𝝍𝒟delimited-[]subscript𝔼conditional𝐿𝜽delimited-[]𝑝𝒟conditional𝝍𝐿𝑝𝒟conditional𝝍𝜽subscript𝔼conditional𝐺𝐿𝝍𝒟delimited-[]subscript𝔼conditional𝐟𝝍𝒟delimited-[]𝑝conditional𝑌\displaystyle=\mathbb{E}_{\bm{\theta},\bm{\psi}\,|\,\mathcal{D}}\left[\mathbb{% E}_{L\,|\,\bm{\theta}}\left[\frac{p(\mathcal{D},\bm{\psi}\,|\,L)}{p(\mathcal{D% },\bm{\psi}\,|\,\bm{\theta})}\cdot\mathbb{E}_{G\,|\,L,\bm{\psi},\mathcal{D}}% \left[\mathbb{E}_{\mathbf{f}\,|\,\bm{\psi},\mathcal{D}}\left[p(Y\,|\,\mathcal{% M})\right]\right]\right]\right]= blackboard_E start_POSTSUBSCRIPT bold_italic_θ , bold_italic_ψ | caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ divide start_ARG italic_p ( caligraphic_D , bold_italic_ψ | italic_L ) end_ARG start_ARG italic_p ( caligraphic_D , bold_italic_ψ | bold_italic_θ ) end_ARG ⋅ blackboard_E start_POSTSUBSCRIPT italic_G | italic_L , bold_italic_ψ , caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_f | bold_italic_ψ , caligraphic_D end_POSTSUBSCRIPT [ italic_p ( italic_Y | caligraphic_M ) ] ] ] ]
=𝔼𝜽,𝝍|𝒟[𝔼L|𝜽[wL𝔼G|L,𝝍,𝒟[𝔼𝐟|𝝍,𝒟[p(Y|)]]]]absentsubscript𝔼𝜽conditional𝝍𝒟delimited-[]subscript𝔼conditional𝐿𝜽delimited-[]superscript𝑤𝐿subscript𝔼conditional𝐺𝐿𝝍𝒟delimited-[]subscript𝔼conditional𝐟𝝍𝒟delimited-[]𝑝conditional𝑌\displaystyle=\mathbb{E}_{\bm{\theta},\bm{\psi}\,|\,\mathcal{D}}\left[\mathbb{% E}_{L\,|\,\bm{\theta}}\left[w^{L}\cdot\mathbb{E}_{G\,|\,L,\bm{\psi},\mathcal{D% }}\left[\mathbb{E}_{\mathbf{f}\,|\,\bm{\psi},\mathcal{D}}\left[p(Y\,|\,% \mathcal{M})\right]\right]\right]\right]= blackboard_E start_POSTSUBSCRIPT bold_italic_θ , bold_italic_ψ | caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ italic_w start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ⋅ blackboard_E start_POSTSUBSCRIPT italic_G | italic_L , bold_italic_ψ , caligraphic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_f | bold_italic_ψ , caligraphic_D end_POSTSUBSCRIPT [ italic_p ( italic_Y | caligraphic_M ) ] ] ] ]

with

wLsuperscript𝑤𝐿\displaystyle w^{L}italic_w start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT :=p(𝒟,𝝍|L)p(𝒟,𝝍|𝜽)assignabsent𝑝𝒟conditional𝝍𝐿𝑝𝒟conditional𝝍𝜽\displaystyle:=\frac{p(\mathcal{D},\bm{\psi}\,|\,L)}{p(\mathcal{D},\bm{\psi}\,% |\,\bm{\theta})}:= divide start_ARG italic_p ( caligraphic_D , bold_italic_ψ | italic_L ) end_ARG start_ARG italic_p ( caligraphic_D , bold_italic_ψ | bold_italic_θ ) end_ARG
=𝔼G|L[p(𝒟,𝝍|G)]p(𝒟,𝝍|𝜽)absentsubscript𝔼conditional𝐺𝐿delimited-[]𝑝𝒟conditional𝝍𝐺𝑝𝒟conditional𝝍𝜽\displaystyle=\frac{\mathbb{E}_{G\,|\,L}\left[p(\mathcal{D},\bm{\psi}\,|\,G)% \right]}{p(\mathcal{D},\bm{\psi}\,|\,\bm{\theta})}= divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ italic_p ( caligraphic_D , bold_italic_ψ | italic_G ) ] end_ARG start_ARG italic_p ( caligraphic_D , bold_italic_ψ | bold_italic_θ ) end_ARG
=𝔼G|L[p(𝒟|𝝍,G)p(𝝍|G)]p(𝒟,𝝍|𝜽)absentsubscript𝔼conditional𝐺𝐿delimited-[]𝑝conditional𝒟𝝍𝐺𝑝conditional𝝍𝐺𝑝𝒟conditional𝝍𝜽\displaystyle=\frac{\mathbb{E}_{G\,|\,L}\left[p(\mathcal{D}\,|\,\bm{\psi},G)% \cdot p(\bm{\psi}\,|\,G)\right]}{p(\mathcal{D},\bm{\psi}\,|\,\bm{\theta})}= divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ italic_p ( caligraphic_D | bold_italic_ψ , italic_G ) ⋅ italic_p ( bold_italic_ψ | italic_G ) ] end_ARG start_ARG italic_p ( caligraphic_D , bold_italic_ψ | bold_italic_θ ) end_ARG
=𝔼G|L[p(𝒟|𝝍,G)p(𝝍|G)]𝔼L|𝜽[p(𝒟,𝝍|L)]absentsubscript𝔼conditional𝐺𝐿delimited-[]𝑝conditional𝒟𝝍𝐺𝑝conditional𝝍𝐺subscript𝔼conditionalsuperscript𝐿𝜽delimited-[]𝑝𝒟conditional𝝍superscript𝐿\displaystyle=\frac{\mathbb{E}_{G\,|\,L}\left[p(\mathcal{D}\,|\,\bm{\psi},G)% \cdot p(\bm{\psi}\,|\,G)\right]}{\mathbb{E}_{L^{\prime}\,|\,\bm{\theta}}\left[% p(\mathcal{D},\bm{\psi}\,|\,L^{\prime})\right]}= divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ italic_p ( caligraphic_D | bold_italic_ψ , italic_G ) ⋅ italic_p ( bold_italic_ψ | italic_G ) ] end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_italic_θ end_POSTSUBSCRIPT [ italic_p ( caligraphic_D , bold_italic_ψ | italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] end_ARG
=𝔼G|L[p(𝒟|𝝍,G)p(𝝍|G)]𝔼L|𝜽[𝔼G|L[p(𝒟|𝝍,G)p(𝝍|G)]].absentsubscript𝔼conditional𝐺𝐿delimited-[]𝑝conditional𝒟𝝍𝐺𝑝conditional𝝍𝐺subscript𝔼conditionalsuperscript𝐿𝜽delimited-[]subscript𝔼conditionalsuperscript𝐺superscript𝐿delimited-[]𝑝conditional𝒟𝝍superscript𝐺𝑝conditional𝝍superscript𝐺\displaystyle=\frac{\mathbb{E}_{G\,|\,L}\left[p(\mathcal{D}\,|\,\bm{\psi},G)% \cdot p(\bm{\psi}\,|\,G)\right]}{\mathbb{E}_{L^{\prime}\,|\,\bm{\theta}}\left[% \mathbb{E}_{G^{\prime}\,|\,L^{\prime}}\left[p(\mathcal{D}\,|\,\bm{\psi},G^{% \prime})\cdot p(\bm{\psi}\,|\,G^{\prime})\right]\right]}.= divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ italic_p ( caligraphic_D | bold_italic_ψ , italic_G ) ⋅ italic_p ( bold_italic_ψ | italic_G ) ] end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_italic_θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_p ( caligraphic_D | bold_italic_ψ , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ italic_p ( bold_italic_ψ | italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ] end_ARG .

A.2 Derivation of the Gradient Estimators

In the following, we derive the gradient estimators in Equations 5, 6 and 11. We denote by =𝜽,𝝍subscript𝜽𝝍\nabla=\nabla_{\bm{\theta},\bm{\psi}}∇ = ∇ start_POSTSUBSCRIPT bold_italic_θ , bold_italic_ψ end_POSTSUBSCRIPT to avoid clutter.

General Posterior Gradient. The general posterior gradient in Equation 5 reads as follows.

logp(𝜽,𝝍|𝒟)𝑝𝜽conditional𝝍𝒟\displaystyle\nabla\log p(\bm{\theta},\bm{\psi}\,|\,\mathcal{D})∇ roman_log italic_p ( bold_italic_θ , bold_italic_ψ | caligraphic_D ) =logp(𝒟,𝝍|𝜽)p(𝜽)p(𝒟)absent𝑝𝒟conditional𝝍𝜽𝑝𝜽𝑝𝒟\displaystyle=\nabla\log\frac{p(\mathcal{D},\bm{\psi}\,|\,\bm{\theta})\cdot p(% \bm{\theta})}{p(\mathcal{D})}= ∇ roman_log divide start_ARG italic_p ( caligraphic_D , bold_italic_ψ | bold_italic_θ ) ⋅ italic_p ( bold_italic_θ ) end_ARG start_ARG italic_p ( caligraphic_D ) end_ARG
=logp(𝜽)+logp(𝒟,𝝍|𝜽)absent𝑝𝜽𝑝𝒟conditional𝝍𝜽\displaystyle=\nabla\log p(\bm{\theta})+\nabla\log p(\mathcal{D},\bm{\psi}\,|% \,\bm{\theta})= ∇ roman_log italic_p ( bold_italic_θ ) + ∇ roman_log italic_p ( caligraphic_D , bold_italic_ψ | bold_italic_θ )
=logp(𝜽)+log𝔼L|𝜽[p(𝒟,𝝍|L)]absent𝑝𝜽subscript𝔼conditional𝐿𝜽delimited-[]𝑝𝒟conditional𝝍𝐿\displaystyle=\nabla\log p(\bm{\theta})+\nabla\log\mathbb{E}_{L\,|\,\bm{\theta% }}\left[p(\mathcal{D},\bm{\psi}\,|\,L)\right]= ∇ roman_log italic_p ( bold_italic_θ ) + ∇ roman_log blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ italic_p ( caligraphic_D , bold_italic_ψ | italic_L ) ]
=logp(𝜽)+log𝔼L|𝜽[𝔼G|L[p(𝒟,𝝍|G)]]absent𝑝𝜽subscript𝔼conditional𝐿𝜽delimited-[]subscript𝔼conditional𝐺𝐿delimited-[]𝑝𝒟conditional𝝍𝐺\displaystyle=\nabla\log p(\bm{\theta})+\nabla\log\mathbb{E}_{L\,|\,\bm{\theta% }}\left[\mathbb{E}_{G\,|\,L}\left[p(\mathcal{D},\bm{\psi}\,|\,G)\right]\right]= ∇ roman_log italic_p ( bold_italic_θ ) + ∇ roman_log blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ italic_p ( caligraphic_D , bold_italic_ψ | italic_G ) ] ]
=logp(𝜽)+log𝔼L|𝜽[𝔼G|L[p(𝒟|𝝍,G)p(𝝍|G)]]absent𝑝𝜽subscript𝔼conditional𝐿𝜽delimited-[]subscript𝔼conditional𝐺𝐿delimited-[]𝑝conditional𝒟𝝍𝐺𝑝conditional𝝍𝐺\displaystyle=\nabla\log p(\bm{\theta})+\nabla\log\mathbb{E}_{L\,|\,\bm{\theta% }}\left[\mathbb{E}_{G\,|\,L}\left[p(\mathcal{D}\,|\,\bm{\psi},G)\cdot p(\bm{% \psi}\,|\,G)\right]\right]= ∇ roman_log italic_p ( bold_italic_θ ) + ∇ roman_log blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ italic_p ( caligraphic_D | bold_italic_ψ , italic_G ) ⋅ italic_p ( bold_italic_ψ | italic_G ) ] ]

ARCO Gradient. Using the above as starting point for the gradient in Equation 6, we get

𝜽logp(𝜽,𝝍|𝒟)subscript𝜽𝑝𝜽conditional𝝍𝒟\displaystyle\nabla_{\bm{\theta}}\log p(\bm{\theta},\bm{\psi}\,|\,\mathcal{D})∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_log italic_p ( bold_italic_θ , bold_italic_ψ | caligraphic_D ) =𝜽logp(𝜽)+𝜽log𝔼L|𝜽[𝔼G|L[p(𝒟|𝝍,G)p(𝝍|G)]]absentsubscript𝜽𝑝𝜽subscript𝜽subscript𝔼conditional𝐿𝜽delimited-[]subscript𝔼conditional𝐺𝐿delimited-[]𝑝conditional𝒟𝝍𝐺𝑝conditional𝝍𝐺\displaystyle=\nabla_{\bm{\theta}}\log p(\bm{\theta})+\nabla_{\bm{\theta}}\log% \mathbb{E}_{L\,|\,\bm{\theta}}\left[\mathbb{E}_{G\,|\,L}\left[p(\mathcal{D}\,|% \,\bm{\psi},G)\cdot p(\bm{\psi}\,|\,G)\right]\right]= ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_log italic_p ( bold_italic_θ ) + ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_log blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ italic_p ( caligraphic_D | bold_italic_ψ , italic_G ) ⋅ italic_p ( bold_italic_ψ | italic_G ) ] ]
=𝜽logp(𝜽)+𝜽𝔼L|𝜽[𝔼G|L[p(𝒟|𝝍,G)p(𝝍|G)]]𝔼L|𝜽[𝔼G|L[p(𝒟|𝝍,G)p(𝝍|G)]]absentsubscript𝜽𝑝𝜽subscript𝜽subscript𝔼conditional𝐿𝜽delimited-[]subscript𝔼conditional𝐺𝐿delimited-[]𝑝conditional𝒟𝝍𝐺𝑝conditional𝝍𝐺subscript𝔼conditionalsuperscript𝐿𝜽delimited-[]subscript𝔼conditionalsuperscript𝐺superscript𝐿delimited-[]𝑝conditional𝒟𝝍superscript𝐺𝑝conditional𝝍superscript𝐺\displaystyle=\nabla_{\bm{\theta}}\log p(\bm{\theta})+\frac{\nabla_{\bm{\theta% }}\mathbb{E}_{L\,|\,\bm{\theta}}\left[\mathbb{E}_{G\,|\,L}\left[p(\mathcal{D}% \,|\,\bm{\psi},G)\cdot p(\bm{\psi}\,|\,G)\right]\right]}{\mathbb{E}_{L^{\prime% }\,|\,\bm{\theta}}\left[\mathbb{E}_{G^{\prime}\,|\,L^{\prime}}\left[p(\mathcal% {D}\,|\,\bm{\psi},G^{\prime})\cdot p(\bm{\psi}\,|\,G^{\prime})\right]\right]}= ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_log italic_p ( bold_italic_θ ) + divide start_ARG ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ italic_p ( caligraphic_D | bold_italic_ψ , italic_G ) ⋅ italic_p ( bold_italic_ψ | italic_G ) ] ] end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_italic_θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_p ( caligraphic_D | bold_italic_ψ , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ italic_p ( bold_italic_ψ | italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ] end_ARG
and using 𝜽p(L|𝜽)=p(L|𝜽)𝜽logp(L|𝜽)subscript𝜽𝑝conditional𝐿𝜽𝑝conditional𝐿𝜽subscript𝜽𝑝conditional𝐿𝜽\nabla_{\bm{\theta}}p(L\,|\,\bm{\theta})=p(L\,|\,\bm{\theta})\cdot\nabla_{\bm{% \theta}}\log p(L\,|\,\bm{\theta})∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT italic_p ( italic_L | bold_italic_θ ) = italic_p ( italic_L | bold_italic_θ ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_log italic_p ( italic_L | bold_italic_θ )
=𝜽logp(𝜽)+𝔼L|𝜽[𝔼G|L[p(𝒟|𝝍,G)p(𝝍|G)]𝜽logp(L|𝜽)]𝔼L|𝜽[𝔼G|L[p(𝒟|𝝍,G)p(𝝍|G)]]absentsubscript𝜽𝑝𝜽subscript𝔼conditional𝐿𝜽delimited-[]subscript𝔼conditional𝐺𝐿delimited-[]𝑝conditional𝒟𝝍𝐺𝑝conditional𝝍𝐺subscript𝜽𝑝conditional𝐿𝜽subscript𝔼conditionalsuperscript𝐿𝜽delimited-[]subscript𝔼conditionalsuperscript𝐺superscript𝐿delimited-[]𝑝conditional𝒟𝝍superscript𝐺𝑝conditional𝝍superscript𝐺\displaystyle=\nabla_{\bm{\theta}}\log p(\bm{\theta})+\frac{\mathbb{E}_{L\,|\,% \bm{\theta}}\left[\mathbb{E}_{G\,|\,L}\left[p(\mathcal{D}\,|\,\bm{\psi},G)% \cdot p(\bm{\psi}\,|\,G)\right]\cdot\nabla_{\bm{\theta}}\log p(L\,|\,\bm{% \theta})\right]}{\mathbb{E}_{L^{\prime}\,|\,\bm{\theta}}\left[\mathbb{E}_{G^{% \prime}\,|\,L^{\prime}}\left[p(\mathcal{D}\,|\,\bm{\psi},G^{\prime})\cdot p(% \bm{\psi}\,|\,G^{\prime})\right]\right]}= ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_log italic_p ( bold_italic_θ ) + divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ italic_p ( caligraphic_D | bold_italic_ψ , italic_G ) ⋅ italic_p ( bold_italic_ψ | italic_G ) ] ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_log italic_p ( italic_L | bold_italic_θ ) ] end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | bold_italic_θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_p ( caligraphic_D | bold_italic_ψ , italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ⋅ italic_p ( bold_italic_ψ | italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ] end_ARG
=𝜽logp(𝜽)+𝔼L|𝜽[wL𝜽logp(L|𝜽)]absentsubscript𝜽𝑝𝜽subscript𝔼conditional𝐿𝜽delimited-[]superscript𝑤𝐿subscript𝜽𝑝conditional𝐿𝜽\displaystyle=\nabla_{\bm{\theta}}\log p(\bm{\theta})+\mathbb{E}_{L\,|\,\bm{% \theta}}\left[w^{L}\cdot\nabla_{\bm{\theta}}\log p(L\,|\,\bm{\theta})\right]= ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_log italic_p ( bold_italic_θ ) + blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ italic_w start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ⋅ ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_log italic_p ( italic_L | bold_italic_θ ) ]

with wLsuperscript𝑤𝐿w^{L}italic_w start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT as defined in Equation 4.

GP Hyper-parameter Gradient. For the gradient of a distinct GP modeling the mechanism from parents PaksubscriptPa𝑘\textbf{Pa}_{k}Pa start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to target node Xksubscript𝑋𝑘X_{k}italic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with corresponding hyper-parameters 𝝍ksubscript𝝍𝑘\bm{\psi}_{k}bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT as in Equation 11 we have

𝝍klogp(𝜽,𝝍|𝒟)subscriptsubscript𝝍𝑘𝑝𝜽conditional𝝍𝒟\displaystyle\nabla_{\bm{\psi}_{k}}\log p(\bm{\theta},\bm{\psi}\,|\,\mathcal{D})∇ start_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( bold_italic_θ , bold_italic_ψ | caligraphic_D ) =𝝍klogp(𝜽)+𝝍klog𝔼L|𝜽[𝔼G|L[p(𝒟,𝝍|G)]]absentsubscriptsubscript𝝍𝑘𝑝𝜽subscriptsubscript𝝍𝑘subscript𝔼conditional𝐿𝜽delimited-[]subscript𝔼conditional𝐺𝐿delimited-[]𝑝𝒟conditional𝝍𝐺\displaystyle=\nabla_{\bm{\psi}_{k}}\log p(\bm{\theta})+\nabla_{\bm{\psi}_{k}}% \log\mathbb{E}_{L\,|\,\bm{\theta}}\left[\mathbb{E}_{G\,|\,L}\left[p(\mathcal{D% },\bm{\psi}\,|\,G)\right]\right]= ∇ start_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( bold_italic_θ ) + ∇ start_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ italic_p ( caligraphic_D , bold_italic_ψ | italic_G ) ] ]
=𝔼L|𝜽[𝔼G|L[𝝍kp(𝒟,𝝍|G)]]𝔼L|𝜽[𝔼G|L[p(𝒟,𝝍|G)]]absentsubscript𝔼conditional𝐿𝜽delimited-[]subscript𝔼conditional𝐺𝐿delimited-[]subscriptsubscript𝝍𝑘𝑝𝒟conditional𝝍𝐺subscript𝔼conditional𝐿𝜽delimited-[]subscript𝔼conditional𝐺𝐿delimited-[]𝑝𝒟conditional𝝍𝐺\displaystyle=\frac{\mathbb{E}_{L\,|\,\bm{\theta}}\left[\mathbb{E}_{G\,|\,L}% \left[\nabla_{\bm{\psi}_{k}}p(\mathcal{D},\bm{\psi}\,|\,G)\right]\right]}{% \mathbb{E}_{L\,|\,\bm{\theta}}\left[\mathbb{E}_{G\,|\,L}\left[p(\mathcal{D},% \bm{\psi}\,|\,G)\right]\right]}= divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( caligraphic_D , bold_italic_ψ | italic_G ) ] ] end_ARG start_ARG blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ italic_p ( caligraphic_D , bold_italic_ψ | italic_G ) ] ] end_ARG
=𝔼L|𝜽[𝔼G|L[𝝍kp(𝒟,𝝍|G)]]p(𝒟,𝝍|𝜽)absentsubscript𝔼conditional𝐿𝜽delimited-[]subscript𝔼conditional𝐺𝐿delimited-[]subscriptsubscript𝝍𝑘𝑝𝒟conditional𝝍𝐺𝑝𝒟conditional𝝍𝜽\displaystyle=\frac{\mathbb{E}_{L\,|\,\bm{\theta}}\left[\mathbb{E}_{G\,|\,L}% \left[\nabla_{\bm{\psi}_{k}}p(\mathcal{D},\bm{\psi}\,|\,G)\right]\right]}{p(% \mathcal{D},\bm{\psi}\,|\,\bm{\theta})}= divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( caligraphic_D , bold_italic_ψ | italic_G ) ] ] end_ARG start_ARG italic_p ( caligraphic_D , bold_italic_ψ | bold_italic_θ ) end_ARG
and as the marginal likelihood and the prior over GP hyper-parameters factorise over parent sets we further get
=𝔼L|𝜽[𝔼G|L[𝝍ki=1dp(𝒟i,𝝍i|PaiG)]]p(𝒟,𝝍|𝜽)absentsubscript𝔼conditional𝐿𝜽delimited-[]subscript𝔼conditional𝐺𝐿delimited-[]subscriptsubscript𝝍𝑘superscriptsubscriptproduct𝑖1𝑑𝑝subscript𝒟𝑖conditionalsubscript𝝍𝑖superscriptsubscriptPa𝑖𝐺𝑝𝒟conditional𝝍𝜽\displaystyle=\frac{\mathbb{E}_{L\,|\,\bm{\theta}}\left[\mathbb{E}_{G\,|\,L}% \left[\nabla_{\bm{\psi}_{k}}\prod_{i=1}^{d}p(\mathcal{D}_{i},\bm{\psi}_{i}\,|% \,\textbf{Pa}_{i}^{G})\right]\right]}{p(\mathcal{D},\bm{\psi}\,|\,\bm{\theta})}= divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) ] ] end_ARG start_ARG italic_p ( caligraphic_D , bold_italic_ψ | bold_italic_θ ) end_ARG
=𝔼L|𝜽[Gp(G|L)𝝍ki=1dp(𝒟i,𝝍i|PaiG)]p(𝒟,𝝍|𝜽).absentsubscript𝔼conditional𝐿𝜽delimited-[]subscript𝐺𝑝conditional𝐺𝐿subscriptsubscript𝝍𝑘superscriptsubscriptproduct𝑖1𝑑𝑝subscript𝒟𝑖conditionalsubscript𝝍𝑖superscriptsubscriptPa𝑖𝐺𝑝𝒟conditional𝝍𝜽\displaystyle=\frac{\mathbb{E}_{L\,|\,\bm{\theta}}\left[\sum_{G}p(G\,|\,L)% \cdot\nabla_{\bm{\psi}_{k}}\prod_{i=1}^{d}p(\mathcal{D}_{i},\bm{\psi}_{i}\,|\,% \textbf{Pa}_{i}^{G})\right]}{p(\mathcal{D},\bm{\psi}\,|\,\bm{\theta})}.= divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT italic_p ( italic_G | italic_L ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) ] end_ARG start_ARG italic_p ( caligraphic_D , bold_italic_ψ | bold_italic_θ ) end_ARG .
Now, note that for the summation over graphs G𝐺Gitalic_G, the gradient is zero for all graphs that do not contain the parent set PaksubscriptPa𝑘\textbf{Pa}_{k}Pa start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT corresponding to the GP with hyper-parameters ψksubscript𝜓𝑘\psi_{k}italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Consequently, we get
=𝔼L|𝜽[G|PakGp(G|L)𝝍ki=1dp(𝒟i,𝝍i|PaiG)]p(𝒟,𝝍|𝜽)absentsubscript𝔼conditional𝐿𝜽delimited-[]subscriptconditional𝐺subscriptPa𝑘𝐺𝑝conditional𝐺𝐿subscriptsubscript𝝍𝑘superscriptsubscriptproduct𝑖1𝑑𝑝subscript𝒟𝑖conditionalsubscript𝝍𝑖superscriptsubscriptPa𝑖𝐺𝑝𝒟conditional𝝍𝜽\displaystyle=\frac{\mathbb{E}_{L\,|\,\bm{\theta}}\left[\sum_{G\,|\,\textbf{Pa% }_{k}\in G}p(G\,|\,L)\cdot\nabla_{\bm{\psi}_{k}}\prod_{i=1}^{d}p(\mathcal{D}_{% i},\bm{\psi}_{i}\,|\,\textbf{Pa}_{i}^{G})\right]}{p(\mathcal{D},\bm{\psi}\,|\,% \bm{\theta})}= divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_G | Pa start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_G end_POSTSUBSCRIPT italic_p ( italic_G | italic_L ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) ] end_ARG start_ARG italic_p ( caligraphic_D , bold_italic_ψ | bold_italic_θ ) end_ARG
and since 𝝍kp(𝒟k,𝝍k|Pak)=p(𝒟k,𝝍k|Pak)𝝍klogp(𝒟k,𝝍k|Pak)subscriptsubscript𝝍𝑘𝑝subscript𝒟𝑘conditionalsubscript𝝍𝑘subscriptPa𝑘𝑝subscript𝒟𝑘conditionalsubscript𝝍𝑘subscriptPa𝑘subscriptsubscript𝝍𝑘𝑝subscript𝒟𝑘conditionalsubscript𝝍𝑘subscriptPa𝑘\nabla_{\bm{\psi}_{k}}p(\mathcal{D}_{k},\bm{\psi}_{k}\,|\,\textbf{Pa}_{k})=p(% \mathcal{D}_{k},\bm{\psi}_{k}\,|\,\textbf{Pa}_{k})\cdot\nabla_{\bm{\psi}_{k}}% \log p(\mathcal{D}_{k},\bm{\psi}_{k}\,|\,\textbf{Pa}_{k})∇ start_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | Pa start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | Pa start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | Pa start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
=𝔼L|𝜽[G|PakGp(G|L)i=1dp(𝒟i,𝝍i|PaiG)𝝍klogp(𝒟k,𝝍k|Pak)]p(𝒟,𝝍|𝜽)absentsubscript𝔼conditional𝐿𝜽delimited-[]subscriptconditional𝐺subscriptPa𝑘𝐺𝑝conditional𝐺𝐿superscriptsubscriptproduct𝑖1𝑑𝑝subscript𝒟𝑖conditionalsubscript𝝍𝑖superscriptsubscriptPa𝑖𝐺subscriptsubscript𝝍𝑘𝑝subscript𝒟𝑘conditionalsubscript𝝍𝑘subscriptPa𝑘𝑝𝒟conditional𝝍𝜽\displaystyle=\frac{\mathbb{E}_{L\,|\,\bm{\theta}}\left[\sum_{G\,|\,\textbf{Pa% }_{k}\in G}p(G\,|\,L)\cdot\prod_{i=1}^{d}p(\mathcal{D}_{i},\bm{\psi}_{i}\,|\,% \textbf{Pa}_{i}^{G})\cdot\nabla_{\bm{\psi}_{k}}\log p(\mathcal{D}_{k},\bm{\psi% }_{k}\,|\,\textbf{Pa}_{k})\right]}{p(\mathcal{D},\bm{\psi}\,|\,\bm{\theta})}= divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_G | Pa start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_G end_POSTSUBSCRIPT italic_p ( italic_G | italic_L ) ⋅ ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_ψ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) ⋅ ∇ start_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | Pa start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] end_ARG start_ARG italic_p ( caligraphic_D , bold_italic_ψ | bold_italic_θ ) end_ARG
=𝔼L|𝜽[G|PakGp(G|L)p(𝒟,𝝍|G)]p(𝒟,𝝍|𝜽)𝝍klogp(𝒟k,𝝍k|Pak).absentsubscript𝔼conditional𝐿𝜽delimited-[]subscriptconditional𝐺subscriptPa𝑘𝐺𝑝conditional𝐺𝐿𝑝𝒟conditional𝝍𝐺𝑝𝒟conditional𝝍𝜽subscriptsubscript𝝍𝑘𝑝subscript𝒟𝑘conditionalsubscript𝝍𝑘subscriptPa𝑘\displaystyle=\frac{\mathbb{E}_{L\,|\,\bm{\theta}}\left[\sum_{G\,|\,\textbf{Pa% }_{k}\in G}p(G\,|\,L)\cdot p(\mathcal{D},\bm{\psi}\,|\,G)\right]}{p(\mathcal{D% },\bm{\psi}\,|\,\bm{\theta})}\cdot\nabla_{\bm{\psi}_{k}}\log p(\mathcal{D}_{k}% ,\bm{\psi}_{k}\,|\,\textbf{Pa}_{k}).= divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_L | bold_italic_θ end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_G | Pa start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_G end_POSTSUBSCRIPT italic_p ( italic_G | italic_L ) ⋅ italic_p ( caligraphic_D , bold_italic_ψ | italic_G ) ] end_ARG start_ARG italic_p ( caligraphic_D , bold_italic_ψ | bold_italic_θ ) end_ARG ⋅ ∇ start_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | Pa start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) .

Note that the term preceding the gradient 𝝍klogp(𝒟k,𝝍k|Pak)subscriptsubscript𝝍𝑘𝑝subscript𝒟𝑘conditionalsubscript𝝍𝑘subscriptPa𝑘\nabla_{\bm{\psi}_{k}}\log p(\mathcal{D}_{k},\bm{\psi}_{k}\,|\,\textbf{Pa}_{k})∇ start_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | Pa start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) is a scalar factor that does not influence the direction of the gradient and can thus be practically omitted for gradient-based optimisation, as optimisation algorithms will scale the gradient depending on tune-able step size parameters anyways. Therefore, optimising the GP hyper-parameters w.r.t. the gradient in Equation 5 practically yields the same gradient direction as the common MAP type II gradient 𝝍klogp(𝒟k,𝝍k|Pak)subscriptsubscript𝝍𝑘𝑝subscript𝒟𝑘conditionalsubscript𝝍𝑘subscriptPa𝑘\nabla_{\bm{\psi}_{k}}\log p(\mathcal{D}_{k},\bm{\psi}_{k}\,|\,\textbf{Pa}_{k})∇ start_POSTSUBSCRIPT bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_ψ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | Pa start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Arguably, for mechanism models that do not decompose over individual mechanisms for each parent set and target variable, naively estimating the gradient in Equation 5 may yield very noisy gradients, as the magnitude of the estimated gradient will depend on the sampled structures used for its estimation.

A.3 Proofs regarding Exhaustive Parent Set Enumeration

Proof of Proposition 3.1

Proof.
𝔼G|L[w(G)h(G)]subscript𝔼conditional𝐺𝐿delimited-[]𝑤𝐺𝐺\displaystyle\mathbb{E}_{G\,|\,L}\left[w(G)\cdot h(G)\right]blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [ italic_w ( italic_G ) ⋅ italic_h ( italic_G ) ] =Gp(G|L)w(G)h(G)absentsubscript𝐺𝑝conditional𝐺𝐿𝑤𝐺𝐺\displaystyle=\sum_{G}p(G\,|\,L)\cdot w(G)\cdot h(G)= ∑ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT italic_p ( italic_G | italic_L ) ⋅ italic_w ( italic_G ) ⋅ italic_h ( italic_G )
Since we assume that w(G)𝑤𝐺w(G)italic_w ( italic_G ) and h(G)𝐺h(G)italic_h ( italic_G ) factorise over the parent sets, we have
=Gip(paiG|L)wi(paiG)hi(paiG)absentsubscript𝐺subscriptproduct𝑖𝑝conditionalsuperscriptsubscriptpa𝑖𝐺𝐿subscript𝑤𝑖superscriptsubscriptpa𝑖𝐺subscript𝑖superscriptsubscriptpa𝑖𝐺\displaystyle=\sum_{G}\prod_{i}p(\textbf{pa}_{i}^{G}\,|\,L)\cdot w_{i}(\textbf% {pa}_{i}^{G})\cdot h_{i}(\textbf{pa}_{i}^{G})= ∑ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT | italic_L ) ⋅ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) ⋅ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT )
The sum over all graphs can be represented as sum over all combinations of possible parent sets to get
=pa1pa2padip(pai|L)wi(pai)hi(pai)absentsubscriptsubscriptpa1subscriptsubscriptpa2subscriptsubscriptpa𝑑subscriptproduct𝑖𝑝conditionalsubscriptpa𝑖𝐿subscript𝑤𝑖subscriptpa𝑖subscript𝑖subscriptpa𝑖\displaystyle=\sum_{\textbf{pa}_{1}}\sum_{\textbf{pa}_{2}}\dots\sum_{\textbf{% pa}_{d}}\prod_{i}p(\textbf{pa}_{i}\,|\,L)\cdot w_{i}(\textbf{pa}_{i})\cdot h_{% i}(\textbf{pa}_{i})= ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT … ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_L ) ⋅ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
=pa1p(pa1|L)w1(pai)h1(pai)pa2p(pa2|L)w2(pai)h2(pai)absentsubscriptsubscriptpa1𝑝conditionalsubscriptpa1𝐿subscript𝑤1subscriptpa𝑖subscript1subscriptpa𝑖subscriptsubscriptpa2𝑝conditionalsubscriptpa2𝐿subscript𝑤2subscriptpa𝑖subscript2subscriptpa𝑖\displaystyle=\sum_{\textbf{pa}_{1}}p(\textbf{pa}_{1}\,|\,L)\cdot w_{1}(% \textbf{pa}_{i})\cdot h_{1}(\textbf{pa}_{i})\sum_{\textbf{pa}_{2}}p(\textbf{pa% }_{2}\,|\,L)\cdot w_{2}(\textbf{pa}_{i})\cdot h_{2}(\textbf{pa}_{i})\dots= ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_L ) ⋅ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( pa start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_L ) ⋅ italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) …
Since each summation over parent sets is independent of the others, we get the final result
=ipaip(pai|L)wi(pai)hi(pai)absentsubscriptproduct𝑖subscriptsubscriptpa𝑖𝑝conditionalsubscriptpa𝑖𝐿subscript𝑤𝑖subscriptpa𝑖subscript𝑖subscriptpa𝑖\displaystyle=\prod_{i}\sum_{\textbf{pa}_{i}}p(\textbf{pa}_{i}\,|\,L)\cdot w_{% i}(\textbf{pa}_{i})\cdot h_{i}(\textbf{pa}_{i})= ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_L ) ⋅ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

Proof of Proposition 3.2

Proof.
𝔼G|Lsubscript𝔼conditional𝐺𝐿\displaystyle\mathbb{E}_{G\,|\,L}blackboard_E start_POSTSUBSCRIPT italic_G | italic_L end_POSTSUBSCRIPT [w(G)h(G)]delimited-[]𝑤𝐺𝐺\displaystyle\left[w(G)\cdot h(G)\right][ italic_w ( italic_G ) ⋅ italic_h ( italic_G ) ]
=Gp(G|L)w(G)h(G)absentsubscript𝐺𝑝conditional𝐺𝐿𝑤𝐺𝐺\displaystyle=\sum_{G}p(G\,|\,L)\cdot w(G)\cdot h(G)= ∑ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT italic_p ( italic_G | italic_L ) ⋅ italic_w ( italic_G ) ⋅ italic_h ( italic_G )
Since we assume that w(G)𝑤𝐺w(G)italic_w ( italic_G ) factorises and h(G)𝐺h(G)italic_h ( italic_G ) sums over the parent sets, we have
=Gip(paiG|L)wi(paiG)jhj(pajG)absentsubscript𝐺subscriptproduct𝑖𝑝conditionalsuperscriptsubscriptpa𝑖𝐺𝐿subscript𝑤𝑖superscriptsubscriptpa𝑖𝐺subscript𝑗subscript𝑗subscriptsuperscriptpa𝐺𝑗\displaystyle=\sum_{G}\prod_{i}p(\textbf{pa}_{i}^{G}\,|\,L)\cdot w_{i}(\textbf% {pa}_{i}^{G})\cdot\sum_{j}h_{j}(\textbf{pa}^{G}_{j})= ∑ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT | italic_L ) ⋅ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) ⋅ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( pa start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
The sum over all graphs can be represented as sum over all combinations of possible parent sets to get
=pa1pa2padip(pai|L)wi(pai)jhj(paj)absentsubscriptsubscriptpa1subscriptsubscriptpa2subscriptsubscriptpa𝑑subscriptproduct𝑖𝑝conditionalsubscriptpa𝑖𝐿subscript𝑤𝑖subscriptpa𝑖subscript𝑗subscript𝑗subscriptpa𝑗\displaystyle=\sum_{\textbf{pa}_{1}}\sum_{\textbf{pa}_{2}}\dots\sum_{\textbf{% pa}_{d}}\prod_{i}p(\textbf{pa}_{i}\,|\,L)\cdot w_{i}(\textbf{pa}_{i})\cdot\sum% _{j}h_{j}(\textbf{pa}_{j})= ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT … ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_L ) ⋅ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
=pa1pa2padjhj(paj)ip(pai|L)wi(pai)absentsubscriptsubscriptpa1subscriptsubscriptpa2subscriptsubscriptpa𝑑subscript𝑗subscript𝑗subscriptpa𝑗subscriptproduct𝑖𝑝conditionalsubscriptpa𝑖𝐿subscript𝑤𝑖subscriptpa𝑖\displaystyle=\sum_{\textbf{pa}_{1}}\sum_{\textbf{pa}_{2}}\dots\sum_{\textbf{% pa}_{d}}\cdot\sum_{j}h_{j}(\textbf{pa}_{j})\prod_{i}p(\textbf{pa}_{i}\,|\,L)% \cdot w_{i}(\textbf{pa}_{i})= ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT … ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_L ) ⋅ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
=pa1pa2padh1(pa1)ip(pai|L)wi(pai)+pa1pa2padj=2dabsentsubscriptsubscriptpa1subscriptsubscriptpa2subscriptsubscriptpa𝑑subscript1subscriptpa1subscriptproduct𝑖𝑝conditionalsubscriptpa𝑖𝐿subscript𝑤𝑖subscriptpa𝑖subscriptsubscriptpa1subscriptsubscriptpa2subscriptsubscriptpa𝑑superscriptsubscript𝑗2𝑑\displaystyle=\sum_{\textbf{pa}_{1}}\sum_{\textbf{pa}_{2}}\dots\sum_{\textbf{% pa}_{d}}h_{1}(\textbf{pa}_{1})\prod_{i}p(\textbf{pa}_{i}\,|\,L)\cdot w_{i}(% \textbf{pa}_{i})+\sum_{\textbf{pa}_{1}}\sum_{\textbf{pa}_{2}}\dots\sum_{% \textbf{pa}_{d}}\sum_{j=2}^{d}\dots= ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT … ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_L ) ⋅ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT … ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT …
=pa1h1(pa1)p(pa1|L)w1(pa1)pa2p(pa2|L)w2(pa2)+pa1pa2padj=2dabsentsubscriptsubscriptpa1subscript1subscriptpa1𝑝conditionalsubscriptpa1𝐿subscript𝑤1subscriptpa1subscriptsubscriptpa2𝑝conditionalsubscriptpa2𝐿subscript𝑤2subscriptpa2subscriptsubscriptpa1subscriptsubscriptpa2subscriptsubscriptpa𝑑superscriptsubscript𝑗2𝑑\displaystyle=\sum_{\textbf{pa}_{1}}h_{1}(\textbf{pa}_{1})\cdot p(\textbf{pa}_% {1}\,|\,L)\cdot w_{1}(\textbf{pa}_{1})\sum_{\textbf{pa}_{2}}p(\textbf{pa}_{2}% \,|\,L)\cdot w_{2}(\textbf{pa}_{2})\dots+\sum_{\textbf{pa}_{1}}\sum_{\textbf{% pa}_{2}}\dots\sum_{\textbf{pa}_{d}}\sum_{j=2}^{d}\dots= ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ italic_p ( pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_L ) ⋅ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( pa start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_L ) ⋅ italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⋯ + ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT … ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT …
By abbreviating αi(L)=paip(pai|L)wi(pai)subscript𝛼𝑖𝐿subscriptsubscriptpa𝑖𝑝conditionalsubscriptpa𝑖𝐿subscript𝑤𝑖subscriptpa𝑖\alpha_{i}(L)=\sum_{\textbf{pa}_{i}}p(\textbf{pa}_{i}\,|\,L)w_{i}(\textbf{pa}_% {i})italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_L ) = ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_L ) italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) we get
=pa1h1(pa1)p(pa1|L)w1(pa1)k1αk(L)+pa1pa2padj=2dhj(paj)ip(pai|L)wi(pai)absentsubscriptsubscriptpa1subscript1subscriptpa1𝑝conditionalsubscriptpa1𝐿subscript𝑤1subscriptpa1subscriptproduct𝑘1subscript𝛼𝑘𝐿subscriptsubscriptpa1subscriptsubscriptpa2subscriptsubscriptpa𝑑superscriptsubscript𝑗2𝑑subscript𝑗subscriptpa𝑗subscriptproduct𝑖𝑝conditionalsubscriptpa𝑖𝐿subscript𝑤𝑖subscriptpa𝑖\displaystyle=\sum_{\textbf{pa}_{1}}h_{1}(\textbf{pa}_{1})\cdot p(\textbf{pa}_% {1}\,|\,L)\cdot w_{1}(\textbf{pa}_{1})\cdot\prod_{k\neq 1}\alpha_{k}(L)+\sum_{% \textbf{pa}_{1}}\sum_{\textbf{pa}_{2}}\dots\sum_{\textbf{pa}_{d}}\sum_{j=2}^{d% }h_{j}(\textbf{pa}_{j})\prod_{i}p(\textbf{pa}_{i}\,|\,L)\cdot w_{i}(\textbf{pa% }_{i})= ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ italic_p ( pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_L ) ⋅ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ ∏ start_POSTSUBSCRIPT italic_k ≠ 1 end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_L ) + ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT … ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_L ) ⋅ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
By repeating this procedure for the remaining summands j𝑗jitalic_j, we get the final result
=i(kiαk(L))paip(pai|L)wi(pai)hi(pai)absentsubscript𝑖subscriptproduct𝑘𝑖subscript𝛼𝑘𝐿subscriptsubscriptpa𝑖𝑝conditionalsubscriptpa𝑖𝐿subscript𝑤𝑖subscriptpa𝑖subscript𝑖subscriptpa𝑖\displaystyle=\sum_{i}\left(\prod_{k\neq i}\alpha_{k}(L)\right)\cdot\sum_{% \textbf{pa}_{i}}p(\textbf{pa}_{i}\,|\,L)\cdot w_{i}(\textbf{pa}_{i})\cdot h_{i% }(\textbf{pa}_{i})= ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ∏ start_POSTSUBSCRIPT italic_k ≠ italic_i end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_L ) ) ⋅ ∑ start_POSTSUBSCRIPT pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_L ) ⋅ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

Appendix B Implementation and Estimation Details

Our model is implemented in Python mainly relying on PyTorch (Ansel et al., 2024), and GPyTorch (Gardner et al., 2018) for GP inference.

B.1 Mechanism Model.

We follow Toth et al. (2022) and use two types of models for our mechanism.

For root nodes, i.e., causal variables without parents, we place a conjugate normal-inverse-gamma prior N-Γ1superscriptΓ1\Gamma^{-1}roman_Γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT on the mean and variance of that node. Specifically,

p(fi,σi2|Pai=)𝑝subscript𝑓𝑖conditionalsuperscriptsubscript𝜎𝑖2subscriptPa𝑖\displaystyle p(f_{i},\sigma_{i}^{2}\,|\,\textbf{Pa}_{i}=\varnothing)italic_p ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∅ ) =N-Γ1(fi,σi2|μ0,κ01,α0,β0)absentN-superscriptΓ1subscript𝑓𝑖conditionalsuperscriptsubscript𝜎𝑖2subscript𝜇0subscriptsuperscript𝜅10subscript𝛼0subscript𝛽0\displaystyle=\text{N-}\Gamma^{-1}(f_{i},\sigma_{i}^{2}\,|\,\mu_{0},\kappa^{-1% }_{0},\alpha_{0},\beta_{0})= N- roman_Γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_κ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
=N(fi|μ0,σi2κ01)Γ1(σi2|α0,β0)absentNconditionalsubscript𝑓𝑖subscript𝜇0superscriptsubscript𝜎𝑖2subscriptsuperscript𝜅10superscriptΓ1conditionalsuperscriptsubscript𝜎𝑖2subscript𝛼0subscript𝛽0\displaystyle=\text{N}(f_{i}\,|\,\mu_{0},\sigma_{i}^{2}\cdot\kappa^{-1}_{0})% \cdot\Gamma^{-1}(\sigma_{i}^{2}\,|\,\alpha_{0},\beta_{0})= N ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_κ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⋅ roman_Γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )

where μ0,κ01,α0,β0subscript𝜇0subscriptsuperscript𝜅10subscript𝛼0subscript𝛽0\mu_{0},\kappa^{-1}_{0},\alpha_{0},\beta_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_κ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are fixed hyper-parameters. When sampling ground-truth SCMs we set μ0=0subscript𝜇00\mu_{0}=0italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0, κ0=1subscript𝜅01\kappa_{0}=1italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1, α0=5subscript𝛼05\alpha_{0}=5italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 5 and β0=10subscript𝛽010\beta_{0}=10italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10, sample a mean and variance from the prior and keep them fixed thereafter. For the inference with ARCO-GP and DIBS-GP, we use μ0=0subscript𝜇00\mu_{0}=0italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0, κ0=1subscript𝜅01\kappa_{0}=1italic_κ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1, α0=10subscript𝛼010\alpha_{0}=10italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10 and β0=10subscript𝛽010\beta_{0}=10italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10. This yields a prior mean of 1111 for the variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which is sufficiently broad considering that we standardise all training data to zero mean and unit variance. Analytic expressions for the posterior marginal likelihood are found in (Murphy, 2007).

For non-root nodes we place a GP prior on the mechanisms. Specifically, we use rational quadratic (RQ) kernel

kRQ(𝐱𝟏,𝐱𝟐)=δ(1+12γ(𝐱𝟏𝐱𝟐)λ2(𝐱𝟏𝐱𝟐))γsubscript𝑘RQsubscript𝐱1subscript𝐱2𝛿superscript112𝛾superscriptsubscript𝐱1subscript𝐱2topsuperscript𝜆2subscript𝐱1subscript𝐱2𝛾\displaystyle k_{\text{RQ}}(\mathbf{x_{1}},\mathbf{x_{2}})=\delta\cdot\left(1+% \frac{1}{2\gamma}(\mathbf{x_{1}}-\mathbf{x_{2}})^{\top}\lambda^{-2}(\mathbf{x_% {1}}-\mathbf{x_{2}})\right)^{-\gamma}italic_k start_POSTSUBSCRIPT RQ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) = italic_δ ⋅ ( 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ( bold_x start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - italic_γ end_POSTSUPERSCRIPT

with scaling parameter δ𝛿\deltaitalic_δ, lengthscale parameter λ𝜆\lambdaitalic_λ and mixing parameter γ𝛾\gammaitalic_γ to model non-linear mechanisms. For non-linear mechanisms we use an additive Gaussian noise likelihood with variance σi2superscriptsubscript𝜎𝑖2\sigma_{i}^{2}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We place Gamma priors Γ(δ|αδ,βδ)Γconditional𝛿subscript𝛼𝛿subscript𝛽𝛿\Gamma(\delta\,|\,\alpha_{\delta},\beta_{\delta})roman_Γ ( italic_δ | italic_α start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ), Γ(λ|αλ,βλ)Γconditional𝜆subscript𝛼𝜆subscript𝛽𝜆\Gamma(\lambda\,|\,\alpha_{\lambda},\beta_{\lambda})roman_Γ ( italic_λ | italic_α start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ), Γ(γ|αγ,βγ)Γconditional𝛾subscript𝛼𝛾subscript𝛽𝛾\Gamma(\gamma\,|\,\alpha_{\gamma},\beta_{\gamma})roman_Γ ( italic_γ | italic_α start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) and Γ(σi2|ασ,βσ)Γconditionalsuperscriptsubscript𝜎𝑖2subscript𝛼𝜎subscript𝛽𝜎\Gamma(\sigma_{i}^{2}\,|\,\alpha_{\sigma},\beta_{\sigma})roman_Γ ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_α start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT ) on these parameters.

When generating non-linear models we set αδ=100subscript𝛼𝛿100\alpha_{\delta}=100italic_α start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT = 100, βδ=10subscript𝛽𝛿10\beta_{\delta}=10italic_β start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT = 10, αλ=30|Pai|subscript𝛼𝜆30subscriptPa𝑖\alpha_{\lambda}=30\cdot|\textbf{Pa}_{i}|italic_α start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = 30 ⋅ | Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |, βλ=30subscript𝛽𝜆30\beta_{\lambda}=30italic_β start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = 30, αγ=20subscript𝛼𝛾20\alpha_{\gamma}=20italic_α start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT = 20, βγ=10subscript𝛽𝛾10\beta_{\gamma}=10italic_β start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT = 10 and ασ=50subscript𝛼𝜎50\alpha_{\sigma}=50italic_α start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT = 50, βσ=50subscript𝛽𝜎50\beta_{\sigma}=50italic_β start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT = 50. For each GP we sample a set of hyper-parameters from their priors and keep them fixed thereafter. Additionally, we sample a function from the GP prior with 50505050 support points sampled uniform random in range [-10, 10]. For the inference with ARCO-GP and DIBS-GP, we set αδ=100subscript𝛼𝛿100\alpha_{\delta}=100italic_α start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT = 100, βδ=10subscript𝛽𝛿10\beta_{\delta}=10italic_β start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT = 10, αλ=30|Pai|subscript𝛼𝜆30subscriptPa𝑖\alpha_{\lambda}=30\cdot|\textbf{Pa}_{i}|italic_α start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = 30 ⋅ | Pa start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |, βλ=30subscript𝛽𝜆30\beta_{\lambda}=30italic_β start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT = 30, αγ=20subscript𝛼𝛾20\alpha_{\gamma}=20italic_α start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT = 20, βγ=10subscript𝛽𝛾10\beta_{\gamma}=10italic_β start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT = 10 and ασ=2subscript𝛼𝜎2\alpha_{\sigma}=2italic_α start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT = 2, βσ=88subscript𝛽𝜎88\beta_{\sigma}=88italic_β start_POSTSUBSCRIPT italic_σ end_POSTSUBSCRIPT = 88 again considering that we standardise all training data to zero mean and unit variance. We train GP hyper-parameters for a maximum of 100100100100 steps with the RMSprop (Hinton, 2012) optimiser with learning rate 0.050.050.050.05.

B.2 ARCO Model.

We use a normal prior 𝒩(𝜽| 0,σ2I)𝒩conditional𝜽 0superscript𝜎2𝐼\mathcal{N}(\bm{\theta}\,|\,0,\sigma^{2}I)caligraphic_N ( bold_italic_θ | 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) with σ=10𝜎10\sigma=10italic_σ = 10 over ARCO’s neural network parameters 𝜽𝜽\bm{\theta}bold_italic_θ. The neural network g𝜽subscript𝑔𝜽g_{\bm{\theta}}italic_g start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT uses a single hidden layer with 30303030 neurons and ReLU activation functions (see Section 3.1). We train ARCO for a maximum of 400400400400 gradient steps, using the ADAM (Kingma & Ba, 2015) optimiser with learning rate of 0.010.010.010.01. For the score-function gradient estimator in Equation 6 we use an exponential moving average baseline with decay rate 0.90.90.90.9. To estimate gradients and causal queries we sample 100100100100 causal orders and 25252525 DAGs conditional on each order where necessary, resulting in 2500250025002500 DAG samples. Per default, we limit the parent set size to a maximum of two parents per node. These settings apply to all our simulations unless stated otherwise.

B.3 Kernel Density Estimates of Interventional Distributions.

To approximate the interventional distribution given our posterior samples (see Section 3, Algorithm 1, lines 1-1), we compute a kernel density estimate (KDE). We draw 100100100100 causal orders, with 10101010 graphs per order and 10101010 samples per graph, equalling 10000100001000010000 samples from the posterior interventional distribution in total. We use the KDE implementation provided by scikit-learn (Pedregosa et al., 2011), with a Gaussian kernel and a bandwidth of 0.20.20.20.2.

Appendix C Experimental Setup

Our code is available at https://github.com/chritoth/bci-arco-gp.

Refer to caption
Figure 4: Sachs Graph. Consensus protein interaction graph from (Sachs et al., 2005). We relabeled nodes to avoid misinterpretation of our simulation results. Nodes X0 to X10 correspond to the original labels [’PKC’, ’PKA’, ’Jnk’, ’P38’, ’Raf’, ’Mek’, ’Erk’, ’Akt’, ’Plcg’, ’PIP3’, ’PIP2’].

Sampling of Ground-Truth SCMs.

To sample ground truth SCMs we draw graph structures from either the Erdös-Rényi (ER) (Erdös & Rényi, 1959) or the scale-free (SF) (Barabási & Albert, 1999) random models commonly used in DAG structure learning benchmarks. We follow the setup of Lorch et al. (2021) and generate scale-free and Erdös-Rényi graphs with an expected degree of 2222. Specifically, this will yield SF graphs with a maximum parent set size of 2222, whereas no such restriction applies to ER graphs in general. We instantiate the causal mechanisms for non-linear additive noise models as described in Appendix B.

For each ground truth SCM, we sample a fixed set of N𝑁Nitalic_N training data from the observational distribution. Importantly, Reisach et al. (2021) argue, that the causal order may strongly correlate with increasing marginal variance in simulated data, and therefore, benchmarks may be easy to game by exploiting this property. As this may be especially relevant to order-based inference methods, we follow their recommendation and normalise the training data for each endogenous variable to zero mean and unit variance.

Metrics.

As metrics for learning the causal structure we report the expected structural Hamming distance (ESHD), as well as the area under the receiver operating characteristic (AUROC) and the area under the precision recall curve (AUPRC) w.r.t. posterior edge prediction as commonly reported metrics. To provide additional insight into the methods’ strengths and weaknesses, we also report the expected number of edges (#Edges), the true positive rate (TPR) and the true negative rate (TNR) for the edge prediction task. We do not report the log-likelihood on held-out test data as is sometimes reported, because the evaluated methods implement different noise models and approximate inference schemes, which leads to the (marginal) log-likelihoods being uncalibrated and thus incomparable (cf. (Murphy, 2023)[Sec. 7.5] and references therein).

To evaluate the inferred causal structures w.r.t. their causal implications, Henckel et al. (2024) recently proposed a family of causal distances called Adjustment Identification Distance (AID). Briefly summarised, the AID counts the number of correctly identified adjustment sets for pairwise causal effects w.r.t. a target graph, where the adjustment sets are determined according to a chosen identification strategy. We adopt the three variants proposed by Henckel et al. (2024), namely Ancestor adjustment (A-AID), Parent adjustment (P-AID), and Optimal adjustment (OSET-AID). These metrics can also be applied between different graph types, e.g., comparing the AID between a predicted CPDAG and a reference DAG. Note that the (P-AID) computed between to DAGs is equivalent to the well-known Structural Identification Distance proposed by Peters & Bühlmann (2015).

Baselines.

We compare our inference model (ARCO-GP) to a diverse set of nine different structure learning methods that we describe in the following.

  • BAYESDAG (Annadani et al., 2023). BayesDAG utilises a mixture of MCMC to infer permutations and mechanism parameters, and Variational Inference (VI) to infer DAG edges given the permutations. We use the implementation provided by Annadani et al. (2023) at https://github.com/microsoft/Project-BayesDAG. We needed to adapt the sparsity regularisation hyper-parameter in order to get meaningful results and ran our experiments with the configurations in Listings 2-3. As the implementation runs a number of MCMC chains and only evaluates the best chain afterwards, we use only one MCMC chain to enable a fair comparison, as multiple chains would correspond to multiple runs of the other methods.

  • DAG-GNN (Yu et al., 2019). DAG-GNN is a gradient-based structure learning approach combining graph neural networks an acyclicity constraint similar to Zheng et al. (2018). We use the implementation provided by Zhang et al. (2021)[Version 1.0.3] using default settings.

  • DDS (Charpentier et al., 2022). DDS builds upon the permutation-based approach of (Cundy et al., 2021) and utilises differentiable permutation sampling and VI to infer a posterior over DAGs. We use the implementation provided by Charpentier et al. (2022) at https://github.com/sharpenb/Differentiable-DAG-Sampling. We needed to adapt the default hyper-parameters in order to get meaningful results and ran our experiments with the configuration displayed in Listing 1.

  • DIBS-GP (Toth et al., 2022; Lorch et al., 2021). DIBS-GP is a Bayesian causal inference framework recently proposed by Toth et al. (2022) utilising the differential structure learning method by Lorch et al. (2021) employing a soft acyclicity constraint in line with Zheng et al. (2018); Yu et al. (2019). We use the implementation provided by Toth et al. (2022) at https://github.com/chritoth/active-bayesian-causal-inference. We use their standard parameters with 10101010 latent particles and constant hyper-parameters α=β=1𝛼𝛽1\alpha=\beta=1italic_α = italic_β = 1 as described in (Lorch et al., 2021). For each latent particle we sample 100100100100 graphs to estimate gradients and causal quantities.

  • GADGET (Viinikka et al., 2020). GADGET is a Bayesian structure learning method for linear Gaussian models based on MCMC and structure marginalisation over capped-size parent sets. We use the implementation provided by (Viinikka et al., 2020) at https://github.com/Sums-of-Products/sumu and parameters as shown in Listing 4.

  • GES (Chickering, 2003). GES is a well-known greedy score-based method for causal discovery using the BIC score. We use the implementation provided by Zhang et al. (2021)[Version 1.0.3] using default settings.

  • GOLEM (Ng et al., 2020). GOLEM is a differentiable DAG structure learning similar to Zheng et al. (2018) but with a likelihood-based score function for linear models. We use the implementation provided by Zhang et al. (2021)[Version 1.0.3] using default settings.

  • GRASP (Lam et al., 2022). GRASP is a recently proposed permutation-based approach to causal discovery. We use the implementation provided by Zheng et al. (2024) using default settings.

  • PC (Spirtes et al., 2000). The PC algorithm is another standard causal discovery methods based on conditional independence tests. We use the implementation provided by Zhang et al. (2021)[Version 1.0.3] using default settings. It happens quite frequently that PC return a possibly cyclic PDAG, for which the AID metrics cannot computed and are thus omited in our experimental results.

Listing 1: DDS Parameter Set.
# Architecture parameters
’seed_model’: 123, # Seed to init model.
’ma_hidden_dims’: [32, 32, 32], # Output dimension.
’ma_architecture’: ’linear’, # Output dimension.
’ma_fast’: False, # Output dimension.
’pd_initial_adj’: ’Learned’, # Output dimension.
’pd_temperature’: 1.0, # Output dimension.
’pd_hard’: True, # Output dimension.
’pd_order_type’: ’topk’, # Output dimension.
’pd_noise_factor’: 1.0, # Hidden dimensions.
# Training parameters
’max_epochs’: 500, # Maximum number of epochs for training
’patience’: 150, # Patience for early stopping.
’frequency’: 2, # Frequency for early stopping test.
’batch_size’: 16, # Batch size.
’ma_lr’: 1e-3, # Learning rate.
’pd_lr’: 1e-2, # Learning rate.
’loss’: ’ELBO’, # Loss name. string
’regr’: 1e-1, # Regularization factor in Bayesian loss.
’prior_p’: 1e-6 # Regularization factor in Bayesian loss.
Listing 2: BAYESDAG Non-Linear ER Parameter Set.
"model_hyperparams": {
"num_chains": 1,
"sinkhorn_n_iter": 3000,
"scale_noise_p": 0.001,
"scale_noise": 0.001,
"VI_norm": true,
"input_perm": false,
"lambda_sparse": 10,
"sparse_init": false
},
"training_hyperparams": {
"learning_rate": 1e-3,
"batch_size": 512,
"stardardize_data_mean": false,
"stardardize_data_std": false,
"max_epochs": 500
}
Listing 3: BAYESDAG Non-linear SF Parameter Set.
"model_hyperparams": {
"num_chains": 1,
"sinkhorn_n_iter": 3000,
"scale_noise_p": 0.001,
"scale_noise": 0.001,
"VI_norm": true,
"input_perm": false,
"lambda_sparse": 10,
"sparse_init": false
},
"training_hyperparams": {
"learning_rate": 1e-3,
"batch_size": 512,
"stardardize_data_mean": false,
"stardardize_data_std": false,
"max_epochs": 500
}
Listing 4: GADGET Parameter Set.
"scoref": ’bge’,
"max_id": -1,
"K": min(self.num_nodes - 1, 16),
"d": 2,
"cp_algo": ’greedy-lite’,
"mc3_chains": 16,
"burn_in": 1000,
"iterations": 1000,
"thinning": 10,

Appendix D Extended Experimental Results

Here we present additional experimental results and ablations with varying number of variables d{11,20,50}𝑑112050d\in\left\{11,20,50\right\}italic_d ∈ { 11 , 20 , 50 }, as well as experiments on real-world data from Sachs et al. (2005). For these extended experiments we report an extended set of eight metrics as described in Appendix C.

Real-world dataset from Sachs et al. (2005).

The dataset consists of 853 observational data. The target consensus graph has 11 nodes and 17 edges. We compare our ARCO-GP method in variants with maximum parent set size k2,3𝑘23k\in{2,3}italic_k ∈ 2 , 3. All variants are competitive with the baselines.

Simulations on larger models with d=50𝑑50d=50italic_d = 50 variables.

We report structure learning results for non-linear models with ER and SF graphs over d=50𝑑50d=50italic_d = 50 variables in Table 3. The results of ARCO-GP are qualitatively similar to the ones achieved on 20 node systems.

Influence of the maximum parent set size.

We evaluate the influence of the maximum parent set size in range k{1,2,3,4}𝑘1234k\in\{1,2,3,4\}italic_k ∈ { 1 , 2 , 3 , 4 } on the performance of our ARCO-GP method in Table 2. We distinguish the models variants by labels ARCO-Kk-GP. On non-linear benchmarks, k=1𝑘1k=1italic_k = 1 performs worst as expected, since each node can only have one parent. Not surprisingly, k=2𝑘2k=2italic_k = 2 performs best on scale-free graphs, whereas k=4𝑘4k=4italic_k = 4 performs best on Erdös-Rényi graphs, as the respective k𝑘kitalic_k better fit the stuctural properties of the true graphs.

Non-linear Average Causal Effect Estimation.

In this simulation, we compare our ability to infer pairwise average causal effects (ACEs)

ACE(Xi|do(Xj=xj)):=𝔼Xi|do(Xj=xj),𝒟[Xi]assign𝐴𝐶𝐸conditionalsubscript𝑋𝑖𝑑𝑜subscript𝑋𝑗subscript𝑥𝑗subscript𝔼conditionalsubscript𝑋𝑖𝑑𝑜subscript𝑋𝑗subscript𝑥𝑗𝒟delimited-[]subscript𝑋𝑖\displaystyle ACE(X_{i}\,|\,do(X_{j}=x_{j})):=\mathbb{E}_{X_{i}\,|\,do(X_{j}=x% _{j}),\mathcal{D}}\left[X_{i}\right]italic_A italic_C italic_E ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_d italic_o ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) := blackboard_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_d italic_o ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , caligraphic_D end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ]

to two baselines, namely DIBS-GP ((Toth et al., 2022; Lorch et al., 2021)) and GADET+BEEPS ((Viinikka et al., 2020)). While DIBS-GP can implements a non-linear additive noise model based on GPs, GADGET+BEEPS is designed for linear Gaussian models only. In particular, the implementation of GADGET+BEEPS does not naturally support the inference of average causal effects, but infers causal effects as the path-wise accumulation of the linear mechanism weights. Thus, we extend the implementation in the following way. A linear SCM can be written as

X=(GB)TX+μ+ϵ,𝑋superscriptdirect-product𝐺𝐵𝑇𝑋𝜇italic-ϵ\displaystyle X=(G\odot B)^{T}\cdot X+\mu+\epsilon,italic_X = ( italic_G ⊙ italic_B ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_X + italic_μ + italic_ϵ ,

where X𝑋Xitalic_X are the endogenous variables, G𝐺Gitalic_G is the adjacency matrix of the causal DAG, B𝐵Bitalic_B is the weight matrix of the linear edge weights, direct-product\odot denotes the element-wise multiplication, μ𝜇\muitalic_μ denotes the mean vector of the variables and ϵitalic-ϵ\epsilonitalic_ϵ denotes the exogenous Gaussian noise. Solving for X yields

X=(I(GB)T)1(μ+ϵ),𝑋superscript𝐼superscriptdirect-product𝐺𝐵𝑇1𝜇italic-ϵ\displaystyle X=(I-(G\odot B)^{T})^{-1}(\mu+\epsilon),italic_X = ( italic_I - ( italic_G ⊙ italic_B ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_μ + italic_ϵ ) ,

and taking the expectation w.r.t. ϵitalic-ϵ\epsilonitalic_ϵ yields

𝔼ϵ[X]=(I(GB)T)1μ.subscript𝔼italic-ϵdelimited-[]𝑋superscript𝐼superscriptdirect-product𝐺𝐵𝑇1𝜇\displaystyle\mathbb{E}_{\epsilon}[X]=(I-(G\odot B)^{T})^{-1}\mu.blackboard_E start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT [ italic_X ] = ( italic_I - ( italic_G ⊙ italic_B ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_μ . (12)

We then estimate ACE(Xi|do(Xj=xj))𝐴𝐶𝐸conditionalsubscript𝑋𝑖𝑑𝑜subscript𝑋𝑗subscript𝑥𝑗ACE(X_{i}\,|\,do(X_{j}=x_{j}))italic_A italic_C italic_E ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_d italic_o ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) by (i) drawing posterior DAGs G𝐺Gitalic_G and weight matrices B𝐵Bitalic_B using GADGET+BEEPS, (ii) performing the intervention by removing all parents of Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in the sampled DAGs and setting the posterior mean μj=xjsubscript𝜇𝑗subscript𝑥𝑗\mu_{j}=x_{j}italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and (iii) finally solving Equation 12 to get the desired ACE.

We generate 10 non-linear ground truth models with fixed graph as in Figure 4. We then report the mean absolute errors (MAEs) between for all pairwise ACEs in Table 4 for ARCO-GP, in Table 5 for DIBS-GP, and in Table 6 for GADGET+BEEPS.

ARCO-GP achieves better MAEs than DIBS-GP for almost all pairwise ACEs. Not surprisingly, ARCO-GP performs better than GADGET+BEEPS on non-linear models.

Table 1: Real-world dataset from Sachs et al. (2005)). The target graph has 11 nodes and 17 edges. We report means and 95%percent9595\%95 % confidence intervals (CIs) across 20202020 different ground truth models. Arrows next to metrics indicate lower is better (\downarrow) and higher is better (\uparrow).
Model #Edges \downarrow A-AID \downarrow P-AID \downarrow OSET-AID
ARCO-K2-GP 7 ±plus-or-minus\pm± 0 0.22 ±plus-or-minus\pm± 0.01 0.48 ±plus-or-minus\pm± 0.02 0.23 ±plus-or-minus\pm± 0.01
ARCO-K3-GP 7 ±plus-or-minus\pm± 1 0.22 ±plus-or-minus\pm± 0.00 0.47 ±plus-or-minus\pm± 0.02 0.22 ±plus-or-minus\pm± 0.00
BAYESDAG 24 ±plus-or-minus\pm± 1 0.33 ±plus-or-minus\pm± 0.02 0.48 ±plus-or-minus\pm± 0.02 0.33 ±plus-or-minus\pm± 0.02
DAG-GNN 7 ±plus-or-minus\pm± 0 0.23 ±plus-or-minus\pm± 0.00 0.45 ±plus-or-minus\pm± 0.00 0.23 ±plus-or-minus\pm± 0.00
DDS 39 ±plus-or-minus\pm± 1 0.36 ±plus-or-minus\pm± 0.01 0.41 ±plus-or-minus\pm± 0.01 0.36 ±plus-or-minus\pm± 0.01
DIBS-GP 7 ±plus-or-minus\pm± 1 0.22 ±plus-or-minus\pm± 0.01 0.43 ±plus-or-minus\pm± 0.04 0.22 ±plus-or-minus\pm± 0.01
GADGET 9 ±plus-or-minus\pm± 0 0.25 ±plus-or-minus\pm± 0.00 0.50 ±plus-or-minus\pm± 0.01 0.25 ±plus-or-minus\pm± 0.00
GES 8 ±plus-or-minus\pm± 0 0.28 ±plus-or-minus\pm± 0.00 0.56 ±plus-or-minus\pm± 0.00 0.28 ±plus-or-minus\pm± 0.00
GOLEM 17 ±plus-or-minus\pm± 0 0.40 ±plus-or-minus\pm± 0.00 0.49 ±plus-or-minus\pm± 0.00 0.39 ±plus-or-minus\pm± 0.00
GRASP 8 ±plus-or-minus\pm± 0 0.28 ±plus-or-minus\pm± 0.00 0.56 ±plus-or-minus\pm± 0.00 0.28 ±plus-or-minus\pm± 0.00
PC 8 ±plus-or-minus\pm± 0
(a) AID Metrics.
Model \downarrow ESHD \uparrow AUROC \uparrow AUPRC \uparrow TPR \uparrow TNR
ARCO-K2-GP 17 ±plus-or-minus\pm± 1 0.56 ±plus-or-minus\pm± 0.02 0.30 ±plus-or-minus\pm± 0.02 0.20 ±plus-or-minus\pm± 0.03 0.96 ±plus-or-minus\pm± 0.01
ARCO-K3-GP 17 ±plus-or-minus\pm± 0 0.55 ±plus-or-minus\pm± 0.03 0.31 ±plus-or-minus\pm± 0.02 0.21 ±plus-or-minus\pm± 0.02 0.96 ±plus-or-minus\pm± 0.00
BAYESDAG 34 ±plus-or-minus\pm± 2 0.49 ±plus-or-minus\pm± 0.04 0.18 ±plus-or-minus\pm± 0.03 0.22 ±plus-or-minus\pm± 0.06 0.78 ±plus-or-minus\pm± 0.01
DAG-GNN 18 ±plus-or-minus\pm± 0 0.57 ±plus-or-minus\pm± 0.00 0.37 ±plus-or-minus\pm± 0.00 0.18 ±plus-or-minus\pm± 0.00 0.96 ±plus-or-minus\pm± 0.00
DDS 45 ±plus-or-minus\pm± 2 0.48 ±plus-or-minus\pm± 0.01 0.19 ±plus-or-minus\pm± 0.02 0.33 ±plus-or-minus\pm± 0.00 0.64 ±plus-or-minus\pm± 0.02
DIBS-GP 16 ±plus-or-minus\pm± 1 0.61 ±plus-or-minus\pm± 0.02 0.36 ±plus-or-minus\pm± 0.04 0.24 ±plus-or-minus\pm± 0.04 0.97 ±plus-or-minus\pm± 0.01
GADGET 20 ±plus-or-minus\pm± 0 0.63 ±plus-or-minus\pm± 0.03 0.27 ±plus-or-minus\pm± 0.02 0.19 ±plus-or-minus\pm± 0.01 0.94 ±plus-or-minus\pm± 0.00
GES 19 ±plus-or-minus\pm± 0 0.63 ±plus-or-minus\pm± 0.00 0.44 ±plus-or-minus\pm± 0.00 0.35 ±plus-or-minus\pm± 0.00 0.91 ±plus-or-minus\pm± 0.00
GOLEM 26 ±plus-or-minus\pm± 0 0.55 ±plus-or-minus\pm± 0.00 0.29 ±plus-or-minus\pm± 0.00 0.24 ±plus-or-minus\pm± 0.00 0.86 ±plus-or-minus\pm± 0.00
GRASP 19 ±plus-or-minus\pm± 0 0.63 ±plus-or-minus\pm± 0.00 0.44 ±plus-or-minus\pm± 0.00 0.35 ±plus-or-minus\pm± 0.00 0.91 ±plus-or-minus\pm± 0.00
PC 19 ±plus-or-minus\pm± 0 0.63 ±plus-or-minus\pm± 0.00 0.44 ±plus-or-minus\pm± 0.00 0.35 ±plus-or-minus\pm± 0.00 0.91 ±plus-or-minus\pm± 0.00
(b) Edge prediction metrics.
Table 2: Varying the maximal parent set size. Ablation studies on simulated (non-)linear ground truth models with 20202020 nodes. We report means and 95%percent9595\%95 % confidence intervals (CIs) across 20202020 different ground truth models. Arrows next to metrics indicate lower is better (\downarrow) and higher is better (\uparrow).
Model #Edges \downarrow A-AID \downarrow P-AID \downarrow OSET-AID \downarrow A-AID (CO)
ARCO-K1-GP 10 ±plus-or-minus\pm± 1 0.17 ±plus-or-minus\pm± 0.02 0.82 ±plus-or-minus\pm± 0.02 0.19 ±plus-or-minus\pm± 0.02 0.24 ±plus-or-minus\pm± 0.05
ARCO-K2-GP 32 ±plus-or-minus\pm± 1 0.06 ±plus-or-minus\pm± 0.02 0.13 ±plus-or-minus\pm± 0.04 0.07 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.03
ARCO-K3-GP 35 ±plus-or-minus\pm± 1 0.09 ±plus-or-minus\pm± 0.04 0.15 ±plus-or-minus\pm± 0.04 0.10 ±plus-or-minus\pm± 0.04 0.09 ±plus-or-minus\pm± 0.05
ARCO-K4-GP 37 ±plus-or-minus\pm± 1 0.08 ±plus-or-minus\pm± 0.03 0.16 ±plus-or-minus\pm± 0.05 0.10 ±plus-or-minus\pm± 0.03 0.08 ±plus-or-minus\pm± 0.04
(a) Scale-free Nonlinear (Part 1/2).
Model \downarrow ESHD \uparrow AUROC \uparrow AUPRC \uparrow TPR \uparrow TNR
ARCO-K1-GP 26 ±plus-or-minus\pm± 1 0.86 ±plus-or-minus\pm± 0.02 0.67 ±plus-or-minus\pm± 0.02 0.28 ±plus-or-minus\pm± 0.03 1.00 ±plus-or-minus\pm± 0.00
ARCO-K2-GP 5 ±plus-or-minus\pm± 2 0.97 ±plus-or-minus\pm± 0.02 0.94 ±plus-or-minus\pm± 0.03 0.88 ±plus-or-minus\pm± 0.04 1.00 ±plus-or-minus\pm± 0.00
ARCO-K3-GP 9 ±plus-or-minus\pm± 2 0.94 ±plus-or-minus\pm± 0.03 0.89 ±plus-or-minus\pm± 0.04 0.87 ±plus-or-minus\pm± 0.04 0.99 ±plus-or-minus\pm± 0.00
ARCO-K4-GP 11 ±plus-or-minus\pm± 2 0.94 ±plus-or-minus\pm± 0.02 0.90 ±plus-or-minus\pm± 0.04 0.86 ±plus-or-minus\pm± 0.04 0.98 ±plus-or-minus\pm± 0.00
(b) Scale-free Nonlinear (Part 2/2).
Model #Edges \downarrow A-AID \downarrow P-AID \downarrow OSET-AID \downarrow A-AID (CO)
ARCO-K1-GP 8 ±plus-or-minus\pm± 1 0.21 ±plus-or-minus\pm± 0.03 0.50 ±plus-or-minus\pm± 0.03 0.21 ±plus-or-minus\pm± 0.02 0.36 ±plus-or-minus\pm± 0.06
ARCO-K2-GP 20 ±plus-or-minus\pm± 2 0.15 ±plus-or-minus\pm± 0.02 0.32 ±plus-or-minus\pm± 0.04 0.18 ±plus-or-minus\pm± 0.03 0.23 ±plus-or-minus\pm± 0.05
ARCO-K3-GP 29 ±plus-or-minus\pm± 2 0.12 ±plus-or-minus\pm± 0.02 0.23 ±plus-or-minus\pm± 0.03 0.16 ±plus-or-minus\pm± 0.03 0.19 ±plus-or-minus\pm± 0.05
ARCO-K4-GP 32 ±plus-or-minus\pm± 3 0.10 ±plus-or-minus\pm± 0.03 0.21 ±plus-or-minus\pm± 0.04 0.15 ±plus-or-minus\pm± 0.04 0.18 ±plus-or-minus\pm± 0.06
(c) Erdös-Rényi Nonlinear (Part 1/2).
Model \downarrow ESHD \uparrow AUROC \uparrow AUPRC \uparrow TPR \uparrow TNR
ARCO-K1-GP 32 ±plus-or-minus\pm± 3 0.79 ±plus-or-minus\pm± 0.03 0.54 ±plus-or-minus\pm± 0.04 0.19 ±plus-or-minus\pm± 0.03 1.00 ±plus-or-minus\pm± 0.00
ARCO-K2-GP 21 ±plus-or-minus\pm± 3 0.84 ±plus-or-minus\pm± 0.03 0.68 ±plus-or-minus\pm± 0.05 0.49 ±plus-or-minus\pm± 0.05 1.00 ±plus-or-minus\pm± 0.00
ARCO-K3-GP 17 ±plus-or-minus\pm± 2 0.87 ±plus-or-minus\pm± 0.02 0.76 ±plus-or-minus\pm± 0.04 0.65 ±plus-or-minus\pm± 0.05 0.99 ±plus-or-minus\pm± 0.00
ARCO-K4-GP 17 ±plus-or-minus\pm± 3 0.88 ±plus-or-minus\pm± 0.03 0.79 ±plus-or-minus\pm± 0.04 0.70 ±plus-or-minus\pm± 0.05 0.99 ±plus-or-minus\pm± 0.00
(d) Erdös-Rényi Nonlinear (Part 2/2).
Table 3: Benchmarks on systems with 50 variables. Ablation studies on simulated non-linear ground truth models with 50505050 nodes and different DAG structures. We report means and 95%percent9595\%95 % confidence intervals (CIs) across 20202020 different ground truth models.
Model #Edges \downarrow A-AID \downarrow P-AID \downarrow OSET-AID
ARCO-GP 91 ±plus-or-minus\pm± 1 0.03 ±plus-or-minus\pm± 0.01 0.06 ±plus-or-minus\pm± 0.02 0.04 ±plus-or-minus\pm± 0.02
BAYESDAG 237 ±plus-or-minus\pm± 5 0.25 ±plus-or-minus\pm± 0.02 0.57 ±plus-or-minus\pm± 0.05 0.25 ±plus-or-minus\pm± 0.02
DAG-GNN 37 ±plus-or-minus\pm± 4 0.14 ±plus-or-minus\pm± 0.01 0.94 ±plus-or-minus\pm± 0.02 0.14 ±plus-or-minus\pm± 0.01
DDS 1082 ±plus-or-minus\pm± 26 0.39 ±plus-or-minus\pm± 0.01 0.42 ±plus-or-minus\pm± 0.01 0.39 ±plus-or-minus\pm± 0.01
DIBS-GP 70 ±plus-or-minus\pm± 7 0.14 ±plus-or-minus\pm± 0.01 0.82 ±plus-or-minus\pm± 0.04 0.14 ±plus-or-minus\pm± 0.01
GADGET 176 ±plus-or-minus\pm± 5 0.31 ±plus-or-minus\pm± 0.02 0.75 ±plus-or-minus\pm± 0.02 0.31 ±plus-or-minus\pm± 0.02
GES 220 ±plus-or-minus\pm± 11 0.38 ±plus-or-minus\pm± 0.05 0.71 ±plus-or-minus\pm± 0.05 0.39 ±plus-or-minus\pm± 0.05
GOLEM 56 ±plus-or-minus\pm± 5 0.16 ±plus-or-minus\pm± 0.02 0.88 ±plus-or-minus\pm± 0.03 0.17 ±plus-or-minus\pm± 0.02
GRASP 154 ±plus-or-minus\pm± 8 0.30 ±plus-or-minus\pm± 0.02 0.71 ±plus-or-minus\pm± 0.03 0.31 ±plus-or-minus\pm± 0.02
PC 108 ±plus-or-minus\pm± 3
(a) Scale-free Nonlinear (Part 1/2).
Model \downarrow ESHD \uparrow AUROC \uparrow AUPRC \uparrow TPR \uparrow TNR
ARCO-GP 6 ±plus-or-minus\pm± 2 0.98 ±plus-or-minus\pm± 0.01 0.95 ±plus-or-minus\pm± 0.03 0.94 ±plus-or-minus\pm± 0.02 1.00 ±plus-or-minus\pm± 0.00
BAYESDAG 220 ±plus-or-minus\pm± 11 0.77 ±plus-or-minus\pm± 0.03 0.40 ±plus-or-minus\pm± 0.03 0.59 ±plus-or-minus\pm± 0.04 0.92 ±plus-or-minus\pm± 0.00
DAG-GNN 90 ±plus-or-minus\pm± 4 0.61 ±plus-or-minus\pm± 0.02 0.42 ±plus-or-minus\pm± 0.03 0.22 ±plus-or-minus\pm± 0.03 0.99 ±plus-or-minus\pm± 0.00
DDS 1037 ±plus-or-minus\pm± 25 0.79 ±plus-or-minus\pm± 0.03 0.52 ±plus-or-minus\pm± 0.05 0.73 ±plus-or-minus\pm± 0.03 0.57 ±plus-or-minus\pm± 0.01
DIBS-GP 123 ±plus-or-minus\pm± 7 0.61 ±plus-or-minus\pm± 0.02 0.27 ±plus-or-minus\pm± 0.03 0.22 ±plus-or-minus\pm± 0.03 0.98 ±plus-or-minus\pm± 0.00
GADGET 191 ±plus-or-minus\pm± 7 0.79 ±plus-or-minus\pm± 0.02 0.30 ±plus-or-minus\pm± 0.03 0.42 ±plus-or-minus\pm± 0.02 0.94 ±plus-or-minus\pm± 0.00
GES 229 ±plus-or-minus\pm± 13 0.69 ±plus-or-minus\pm± 0.02 0.34 ±plus-or-minus\pm± 0.03 0.46 ±plus-or-minus\pm± 0.05 0.92 ±plus-or-minus\pm± 0.00
GOLEM 94 ±plus-or-minus\pm± 6 0.65 ±plus-or-minus\pm± 0.02 0.43 ±plus-or-minus\pm± 0.04 0.30 ±plus-or-minus\pm± 0.03 0.99 ±plus-or-minus\pm± 0.00
GRASP 160 ±plus-or-minus\pm± 9 0.72 ±plus-or-minus\pm± 0.01 0.41 ±plus-or-minus\pm± 0.02 0.49 ±plus-or-minus\pm± 0.02 0.95 ±plus-or-minus\pm± 0.00
PC 156 ±plus-or-minus\pm± 5 0.61 ±plus-or-minus\pm± 0.01 0.25 ±plus-or-minus\pm± 0.02 0.25 ±plus-or-minus\pm± 0.02 0.96 ±plus-or-minus\pm± 0.00
(b) Scale-free Nonlinear (Part 2/2).
Model #Edges \downarrow A-AID \downarrow P-AID \downarrow OSET-AID
ARCO-GP 48 ±plus-or-minus\pm± 3 0.09 ±plus-or-minus\pm± 0.01 0.26 ±plus-or-minus\pm± 0.02 0.09 ±plus-or-minus\pm± 0.01
BAYESDAG 195 ±plus-or-minus\pm± 4 0.17 ±plus-or-minus\pm± 0.03 0.48 ±plus-or-minus\pm± 0.06 0.18 ±plus-or-minus\pm± 0.03
DAG-GNN 34 ±plus-or-minus\pm± 3 0.14 ±plus-or-minus\pm± 0.01 0.57 ±plus-or-minus\pm± 0.04 0.14 ±plus-or-minus\pm± 0.01
DDS 928 ±plus-or-minus\pm± 30 0.37 ±plus-or-minus\pm± 0.03 0.45 ±plus-or-minus\pm± 0.03 0.38 ±plus-or-minus\pm± 0.03
DIBS-GP 33 ±plus-or-minus\pm± 8 0.12 ±plus-or-minus\pm± 0.01 0.43 ±plus-or-minus\pm± 0.04 0.13 ±plus-or-minus\pm± 0.01
GADGET 121 ±plus-or-minus\pm± 4 0.25 ±plus-or-minus\pm± 0.02 0.50 ±plus-or-minus\pm± 0.05 0.26 ±plus-or-minus\pm± 0.02
GES 134 ±plus-or-minus\pm± 7 0.25 ±plus-or-minus\pm± 0.03 0.45 ±plus-or-minus\pm± 0.04 0.27 ±plus-or-minus\pm± 0.03
GOLEM 52 ±plus-or-minus\pm± 5 0.16 ±plus-or-minus\pm± 0.02 0.54 ±plus-or-minus\pm± 0.03 0.16 ±plus-or-minus\pm± 0.01
GRASP 92 ±plus-or-minus\pm± 5 0.18 ±plus-or-minus\pm± 0.02 0.40 ±plus-or-minus\pm± 0.05 0.19 ±plus-or-minus\pm± 0.02
PC 92 ±plus-or-minus\pm± 3
(c) Erdös-Rényi Nonlinear (Part 1/2).
Model \downarrow ESHD \uparrow AUROC \uparrow AUPRC \uparrow TPR \uparrow TNR
ARCO-GP 56 ±plus-or-minus\pm± 4 0.77 ±plus-or-minus\pm± 0.02 0.57 ±plus-or-minus\pm± 0.04 0.46 ±plus-or-minus\pm± 0.03 1.00 ±plus-or-minus\pm± 0.00
BAYESDAG 213 ±plus-or-minus\pm± 9 0.68 ±plus-or-minus\pm± 0.02 0.29 ±plus-or-minus\pm± 0.02 0.41 ±plus-or-minus\pm± 0.04 0.93 ±plus-or-minus\pm± 0.00
DAG-GNN 110 ±plus-or-minus\pm± 4 0.55 ±plus-or-minus\pm± 0.01 0.25 ±plus-or-minus\pm± 0.03 0.12 ±plus-or-minus\pm± 0.01 0.99 ±plus-or-minus\pm± 0.00
DDS 900 ±plus-or-minus\pm± 26 0.75 ±plus-or-minus\pm± 0.03 0.44 ±plus-or-minus\pm± 0.05 0.64 ±plus-or-minus\pm± 0.04 0.63 ±plus-or-minus\pm± 0.01
DIBS-GP 105 ±plus-or-minus\pm± 5 0.56 ±plus-or-minus\pm± 0.02 0.30 ±plus-or-minus\pm± 0.04 0.13 ±plus-or-minus\pm± 0.03 0.99 ±plus-or-minus\pm± 0.00
GADGET 130 ±plus-or-minus\pm± 4 0.84 ±plus-or-minus\pm± 0.01 0.46 ±plus-or-minus\pm± 0.03 0.45 ±plus-or-minus\pm± 0.02 0.97 ±plus-or-minus\pm± 0.00
GES 122 ±plus-or-minus\pm± 9 0.77 ±plus-or-minus\pm± 0.02 0.51 ±plus-or-minus\pm± 0.03 0.58 ±plus-or-minus\pm± 0.03 0.97 ±plus-or-minus\pm± 0.00
GOLEM 111 ±plus-or-minus\pm± 4 0.59 ±plus-or-minus\pm± 0.01 0.31 ±plus-or-minus\pm± 0.03 0.20 ±plus-or-minus\pm± 0.02 0.99 ±plus-or-minus\pm± 0.00
GRASP 80 ±plus-or-minus\pm± 6 0.79 ±plus-or-minus\pm± 0.01 0.61 ±plus-or-minus\pm± 0.03 0.60 ±plus-or-minus\pm± 0.03 0.98 ±plus-or-minus\pm± 0.00
PC 112 ±plus-or-minus\pm± 5 0.69 ±plus-or-minus\pm± 0.01 0.43 ±plus-or-minus\pm± 0.02 0.40 ±plus-or-minus\pm± 0.02 0.98 ±plus-or-minus\pm± 0.00
(d) Erdös-Rényi Nonlinear (Part 2/2).
Table 4: Nonlinear Average Causal Effect Estimation (ARCO-GP). We report the mean absolute errors and 95%percent9595\%95 % confidence intervals (CIs) between predicted and true average causal effects 𝔼Xj|do(Xi=1)[Xj]subscript𝔼conditionalsubscript𝑋𝑗𝑑𝑜subscript𝑋𝑖1delimited-[]subscript𝑋𝑗\mathbb{E}_{X_{j}\,|\,do(X_{i}=1)}\left[X_{j}\right]blackboard_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_d italic_o ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ]. The interventions do(Xi=1)𝑑𝑜subscript𝑋𝑖1do(X_{i}=1)italic_d italic_o ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) are listed per column and the targets Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT listed per row. The results stem from simulations on 10101010 different ground truth models with fixed graph (see Figure 4) and simulated nonlinear mechanisms.
do(X0 = 1) do(X1 = 1) do(X2 = 1) do(X3 = 1) do(X4 = 1) do(X5 = 1)
X0 0.00 ±plus-or-minus\pm± 0.00 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.02
X1 0.07 ±plus-or-minus\pm± 0.02 0.00 ±plus-or-minus\pm± 0.00 0.09 ±plus-or-minus\pm± 0.02 0.08 ±plus-or-minus\pm± 0.03 0.09 ±plus-or-minus\pm± 0.03 0.09 ±plus-or-minus\pm± 0.03
X2 0.18 ±plus-or-minus\pm± 0.09 0.25 ±plus-or-minus\pm± 0.08 0.00 ±plus-or-minus\pm± 0.00 0.10 ±plus-or-minus\pm± 0.04 0.12 ±plus-or-minus\pm± 0.03 0.10 ±plus-or-minus\pm± 0.04
X3 0.11 ±plus-or-minus\pm± 0.03 0.44 ±plus-or-minus\pm± 0.23 0.06 ±plus-or-minus\pm± 0.02 0.00 ±plus-or-minus\pm± 0.00 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.02
X4 0.13 ±plus-or-minus\pm± 0.05 0.30 ±plus-or-minus\pm± 0.12 0.08 ±plus-or-minus\pm± 0.03 0.08 ±plus-or-minus\pm± 0.02 0.00 ±plus-or-minus\pm± 0.00 0.06 ±plus-or-minus\pm± 0.02
X5 0.26 ±plus-or-minus\pm± 0.06 0.44 ±plus-or-minus\pm± 0.21 0.04 ±plus-or-minus\pm± 0.01 0.04 ±plus-or-minus\pm± 0.01 0.34 ±plus-or-minus\pm± 0.11 0.00 ±plus-or-minus\pm± 0.00
X6 0.22 ±plus-or-minus\pm± 0.08 0.27 ±plus-or-minus\pm± 0.12 0.06 ±plus-or-minus\pm± 0.03 0.06 ±plus-or-minus\pm± 0.03 0.14 ±plus-or-minus\pm± 0.04 0.10 ±plus-or-minus\pm± 0.03
X7 0.12 ±plus-or-minus\pm± 0.04 0.25 ±plus-or-minus\pm± 0.14 0.09 ±plus-or-minus\pm± 0.03 0.09 ±plus-or-minus\pm± 0.03 0.08 ±plus-or-minus\pm± 0.02 0.09 ±plus-or-minus\pm± 0.04
X8 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.02
X9 0.07 ±plus-or-minus\pm± 0.03 0.06 ±plus-or-minus\pm± 0.03 0.07 ±plus-or-minus\pm± 0.03 0.07 ±plus-or-minus\pm± 0.03 0.07 ±plus-or-minus\pm± 0.03 0.07 ±plus-or-minus\pm± 0.03
X10 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.03 0.06 ±plus-or-minus\pm± 0.03 0.07 ±plus-or-minus\pm± 0.02 0.07 ±plus-or-minus\pm± 0.02
AVG 0.12 ±plus-or-minus\pm± 0.02 0.20 ±plus-or-minus\pm± 0.03 0.06 ±plus-or-minus\pm± 0.01 0.06 ±plus-or-minus\pm± 0.01 0.10 ±plus-or-minus\pm± 0.01 0.07 ±plus-or-minus\pm± 0.01
do(X6 = 1) do(X7 = 1) do(X8 = 1) do(X9 = 1) do(X10 = 1)
X0 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.02
X1 0.12 ±plus-or-minus\pm± 0.03 0.10 ±plus-or-minus\pm± 0.02 0.10 ±plus-or-minus\pm± 0.03 0.09 ±plus-or-minus\pm± 0.03 0.09 ±plus-or-minus\pm± 0.03
X2 0.12 ±plus-or-minus\pm± 0.03 0.11 ±plus-or-minus\pm± 0.03 0.11 ±plus-or-minus\pm± 0.03 0.11 ±plus-or-minus\pm± 0.04 0.11 ±plus-or-minus\pm± 0.04
X3 0.07 ±plus-or-minus\pm± 0.02 0.07 ±plus-or-minus\pm± 0.02 0.07 ±plus-or-minus\pm± 0.02 0.07 ±plus-or-minus\pm± 0.02 0.07 ±plus-or-minus\pm± 0.02
X4 0.08 ±plus-or-minus\pm± 0.03 0.08 ±plus-or-minus\pm± 0.02 0.08 ±plus-or-minus\pm± 0.02 0.08 ±plus-or-minus\pm± 0.02 0.08 ±plus-or-minus\pm± 0.02
X5 0.04 ±plus-or-minus\pm± 0.01 0.04 ±plus-or-minus\pm± 0.01 0.05 ±plus-or-minus\pm± 0.01 0.05 ±plus-or-minus\pm± 0.01 0.04 ±plus-or-minus\pm± 0.01
X6 0.00 ±plus-or-minus\pm± 0.00 0.06 ±plus-or-minus\pm± 0.03 0.07 ±plus-or-minus\pm± 0.03 0.06 ±plus-or-minus\pm± 0.03 0.06 ±plus-or-minus\pm± 0.03
X7 0.15 ±plus-or-minus\pm± 0.05 0.00 ±plus-or-minus\pm± 0.00 0.09 ±plus-or-minus\pm± 0.03 0.09 ±plus-or-minus\pm± 0.03 0.09 ±plus-or-minus\pm± 0.03
X8 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.02 0.00 ±plus-or-minus\pm± 0.00 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.02
X9 0.07 ±plus-or-minus\pm± 0.03 0.06 ±plus-or-minus\pm± 0.03 0.26 ±plus-or-minus\pm± 0.17 0.00 ±plus-or-minus\pm± 0.00 0.07 ±plus-or-minus\pm± 0.03
X10 0.06 ±plus-or-minus\pm± 0.02 0.06 ±plus-or-minus\pm± 0.03 0.15 ±plus-or-minus\pm± 0.08 0.12 ±plus-or-minus\pm± 0.07 0.00 ±plus-or-minus\pm± 0.00
AVG 0.07 ±plus-or-minus\pm± 0.01 0.06 ±plus-or-minus\pm± 0.01 0.09 ±plus-or-minus\pm± 0.02 0.07 ±plus-or-minus\pm± 0.01 0.07 ±plus-or-minus\pm± 0.01
Table 5: Nonlinear Average Causal Effect Estimation (DIBS). We report the mean absolute errors and 95%percent9595\%95 % confidence intervals (CIs) between predicted and true average causal effects 𝔼Xj|do(Xi=1)[Xj]subscript𝔼conditionalsubscript𝑋𝑗𝑑𝑜subscript𝑋𝑖1delimited-[]subscript𝑋𝑗\mathbb{E}_{X_{j}\,|\,do(X_{i}=1)}\left[X_{j}\right]blackboard_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_d italic_o ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ]. The interventions do(Xi=1)𝑑𝑜subscript𝑋𝑖1do(X_{i}=1)italic_d italic_o ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) are listed per column and the targets Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT listed per row. The results stem from simulations on 10101010 different ground truth models with fixed graph (see Figure 4) and simulated nonlinear mechanisms.
do(X0 = 1) do(X1 = 1) do(X2 = 1) do(X3 = 1) do(X4 = 1) do(X5 = 1)
X0 0.00 ±plus-or-minus\pm± 0.00 0.30 ±plus-or-minus\pm± 0.28 0.23 ±plus-or-minus\pm± 0.27 0.20 ±plus-or-minus\pm± 0.22 0.10 ±plus-or-minus\pm± 0.10 0.06 ±plus-or-minus\pm± 0.03
X1 0.40 ±plus-or-minus\pm± 0.33 0.00 ±plus-or-minus\pm± 0.00 0.19 ±plus-or-minus\pm± 0.28 0.09 ±plus-or-minus\pm± 0.05 0.12 ±plus-or-minus\pm± 0.10 0.07 ±plus-or-minus\pm± 0.04
X2 0.45 ±plus-or-minus\pm± 0.22 0.34 ±plus-or-minus\pm± 0.20 0.00 ±plus-or-minus\pm± 0.00 0.12 ±plus-or-minus\pm± 0.09 0.09 ±plus-or-minus\pm± 0.05 0.09 ±plus-or-minus\pm± 0.06
X3 0.23 ±plus-or-minus\pm± 0.23 0.55 ±plus-or-minus\pm± 0.38 0.17 ±plus-or-minus\pm± 0.25 0.00 ±plus-or-minus\pm± 0.00 0.07 ±plus-or-minus\pm± 0.03 0.10 ±plus-or-minus\pm± 0.06
X4 0.24 ±plus-or-minus\pm± 0.11 0.49 ±plus-or-minus\pm± 0.23 0.20 ±plus-or-minus\pm± 0.19 0.10 ±plus-or-minus\pm± 0.06 0.00 ±plus-or-minus\pm± 0.00 0.14 ±plus-or-minus\pm± 0.07
X5 0.38 ±plus-or-minus\pm± 0.10 0.50 ±plus-or-minus\pm± 0.38 0.14 ±plus-or-minus\pm± 0.19 0.07 ±plus-or-minus\pm± 0.04 0.32 ±plus-or-minus\pm± 0.20 0.00 ±plus-or-minus\pm± 0.00
X6 0.37 ±plus-or-minus\pm± 0.21 0.59 ±plus-or-minus\pm± 0.26 0.06 ±plus-or-minus\pm± 0.04 0.05 ±plus-or-minus\pm± 0.03 0.20 ±plus-or-minus\pm± 0.13 0.42 ±plus-or-minus\pm± 0.23
X7 0.27 ±plus-or-minus\pm± 0.15 0.43 ±plus-or-minus\pm± 0.22 0.11 ±plus-or-minus\pm± 0.07 0.10 ±plus-or-minus\pm± 0.06 0.12 ±plus-or-minus\pm± 0.09 0.17 ±plus-or-minus\pm± 0.11
X8 0.07 ±plus-or-minus\pm± 0.04 0.07 ±plus-or-minus\pm± 0.04 0.07 ±plus-or-minus\pm± 0.05 0.07 ±plus-or-minus\pm± 0.04 0.07 ±plus-or-minus\pm± 0.04 0.08 ±plus-or-minus\pm± 0.05
X9 0.07 ±plus-or-minus\pm± 0.04 0.07 ±plus-or-minus\pm± 0.05 0.08 ±plus-or-minus\pm± 0.04 0.07 ±plus-or-minus\pm± 0.04 0.08 ±plus-or-minus\pm± 0.04 0.07 ±plus-or-minus\pm± 0.05
X10 0.04 ±plus-or-minus\pm± 0.04 0.05 ±plus-or-minus\pm± 0.03 0.07 ±plus-or-minus\pm± 0.05 0.06 ±plus-or-minus\pm± 0.04 0.04 ±plus-or-minus\pm± 0.03 0.05 ±plus-or-minus\pm± 0.04
AVG 0.23 ±plus-or-minus\pm± 0.08 0.31 ±plus-or-minus\pm± 0.10 0.12 ±plus-or-minus\pm± 0.10 0.08 ±plus-or-minus\pm± 0.04 0.11 ±plus-or-minus\pm± 0.02 0.11 ±plus-or-minus\pm± 0.03
do(X6 = 1) do(X7 = 1) do(X8 = 1) do(X9 = 1) do(X10 = 1)
X0 0.06 ±plus-or-minus\pm± 0.03 0.07 ±plus-or-minus\pm± 0.03 0.08 ±plus-or-minus\pm± 0.04 0.07 ±plus-or-minus\pm± 0.02 0.05 ±plus-or-minus\pm± 0.03
X1 0.10 ±plus-or-minus\pm± 0.05 0.08 ±plus-or-minus\pm± 0.05 0.08 ±plus-or-minus\pm± 0.05 0.07 ±plus-or-minus\pm± 0.04 0.09 ±plus-or-minus\pm± 0.04
X2 0.10 ±plus-or-minus\pm± 0.05 0.10 ±plus-or-minus\pm± 0.05 0.10 ±plus-or-minus\pm± 0.06 0.10 ±plus-or-minus\pm± 0.05 0.09 ±plus-or-minus\pm± 0.05
X3 0.08 ±plus-or-minus\pm± 0.03 0.08 ±plus-or-minus\pm± 0.03 0.09 ±plus-or-minus\pm± 0.04 0.08 ±plus-or-minus\pm± 0.03 0.08 ±plus-or-minus\pm± 0.03
X4 0.11 ±plus-or-minus\pm± 0.08 0.09 ±plus-or-minus\pm± 0.04 0.08 ±plus-or-minus\pm± 0.04 0.07 ±plus-or-minus\pm± 0.04 0.08 ±plus-or-minus\pm± 0.05
X5 0.20 ±plus-or-minus\pm± 0.16 0.06 ±plus-or-minus\pm± 0.04 0.07 ±plus-or-minus\pm± 0.03 0.07 ±plus-or-minus\pm± 0.03 0.05 ±plus-or-minus\pm± 0.03
X6 0.00 ±plus-or-minus\pm± 0.00 0.10 ±plus-or-minus\pm± 0.11 0.06 ±plus-or-minus\pm± 0.04 0.04 ±plus-or-minus\pm± 0.03 0.04 ±plus-or-minus\pm± 0.02
X7 0.27 ±plus-or-minus\pm± 0.13 0.00 ±plus-or-minus\pm± 0.00 0.10 ±plus-or-minus\pm± 0.06 0.09 ±plus-or-minus\pm± 0.05 0.10 ±plus-or-minus\pm± 0.06
X8 0.07 ±plus-or-minus\pm± 0.04 0.07 ±plus-or-minus\pm± 0.04 0.00 ±plus-or-minus\pm± 0.00 0.15 ±plus-or-minus\pm± 0.21 0.14 ±plus-or-minus\pm± 0.19
X9 0.08 ±plus-or-minus\pm± 0.04 0.07 ±plus-or-minus\pm± 0.05 0.40 ±plus-or-minus\pm± 0.27 0.00 ±plus-or-minus\pm± 0.00 0.14 ±plus-or-minus\pm± 0.12
X10 0.06 ±plus-or-minus\pm± 0.04 0.07 ±plus-or-minus\pm± 0.05 0.29 ±plus-or-minus\pm± 0.16 0.44 ±plus-or-minus\pm± 0.34 0.00 ±plus-or-minus\pm± 0.00
AVG 0.10 ±plus-or-minus\pm± 0.02 0.07 ±plus-or-minus\pm± 0.02 0.12 ±plus-or-minus\pm± 0.04 0.11 ±plus-or-minus\pm± 0.04 0.08 ±plus-or-minus\pm± 0.02
Table 6: Nonlinear Average Causal Effect Estimation (GADGET + BEEPS). We report the mean absolute errors and 95%percent9595\%95 % confidence intervals (CIs) between predicted and true average causal effects 𝔼Xj|do(Xi=1)[Xj]subscript𝔼conditionalsubscript𝑋𝑗𝑑𝑜subscript𝑋𝑖1delimited-[]subscript𝑋𝑗\mathbb{E}_{X_{j}\,|\,do(X_{i}=1)}\left[X_{j}\right]blackboard_E start_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_d italic_o ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) end_POSTSUBSCRIPT [ italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ]. The interventions do(Xi=1)𝑑𝑜subscript𝑋𝑖1do(X_{i}=1)italic_d italic_o ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) are listed per column and the targets Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT listed per row. The results stem from simulations on 10101010 different ground truth models with fixed graph (see Figure 4) and simulated nonlinear mechanisms.
do(X0 = 1) do(X1 = 1) do(X2 = 1) do(X3 = 1) do(X4 = 1) do(X5 = 1)
X0 0.00 ±plus-or-minus\pm± 0.00 0.38 ±plus-or-minus\pm± 0.17 0.27 ±plus-or-minus\pm± 0.13 0.34 ±plus-or-minus\pm± 0.17 0.24 ±plus-or-minus\pm± 0.14 0.15 ±plus-or-minus\pm± 0.12
X1 0.74 ±plus-or-minus\pm± 0.30 0.00 ±plus-or-minus\pm± 0.00 0.20 ±plus-or-minus\pm± 0.13 0.15 ±plus-or-minus\pm± 0.08 0.22 ±plus-or-minus\pm± 0.13 0.26 ±plus-or-minus\pm± 0.16
X2 0.49 ±plus-or-minus\pm± 0.28 0.31 ±plus-or-minus\pm± 0.20 0.00 ±plus-or-minus\pm± 0.00 0.16 ±plus-or-minus\pm± 0.11 0.14 ±plus-or-minus\pm± 0.14 0.21 ±plus-or-minus\pm± 0.10
X3 0.49 ±plus-or-minus\pm± 0.23 0.87 ±plus-or-minus\pm± 0.44 0.09 ±plus-or-minus\pm± 0.03 0.00 ±plus-or-minus\pm± 0.00 0.13 ±plus-or-minus\pm± 0.08 0.15 ±plus-or-minus\pm± 0.10
X4 0.43 ±plus-or-minus\pm± 0.28 0.74 ±plus-or-minus\pm± 0.57 0.14 ±plus-or-minus\pm± 0.09 0.15 ±plus-or-minus\pm± 0.06 0.00 ±plus-or-minus\pm± 0.00 0.20 ±plus-or-minus\pm± 0.12
X5 0.54 ±plus-or-minus\pm± 0.27 0.47 ±plus-or-minus\pm± 0.31 0.07 ±plus-or-minus\pm± 0.04 0.10 ±plus-or-minus\pm± 0.07 0.32 ±plus-or-minus\pm± 0.20 0.00 ±plus-or-minus\pm± 0.00
X6 0.38 ±plus-or-minus\pm± 0.17 0.48 ±plus-or-minus\pm± 0.12 0.13 ±plus-or-minus\pm± 0.10 0.09 ±plus-or-minus\pm± 0.04 0.23 ±plus-or-minus\pm± 0.13 0.33 ±plus-or-minus\pm± 0.16
X7 0.30 ±plus-or-minus\pm± 0.23 0.43 ±plus-or-minus\pm± 0.24 0.09 ±plus-or-minus\pm± 0.06 0.13 ±plus-or-minus\pm± 0.11 0.12 ±plus-or-minus\pm± 0.09 0.17 ±plus-or-minus\pm± 0.13
X8 0.06 ±plus-or-minus\pm± 0.04 0.06 ±plus-or-minus\pm± 0.03 0.06 ±plus-or-minus\pm± 0.04 0.06 ±plus-or-minus\pm± 0.04 0.06 ±plus-or-minus\pm± 0.04 0.06 ±plus-or-minus\pm± 0.04
X9 0.08 ±plus-or-minus\pm± 0.05 0.07 ±plus-or-minus\pm± 0.05 0.08 ±plus-or-minus\pm± 0.05 0.07 ±plus-or-minus\pm± 0.05 0.08 ±plus-or-minus\pm± 0.05 0.08 ±plus-or-minus\pm± 0.05
X10 0.03 ±plus-or-minus\pm± 0.01 0.03 ±plus-or-minus\pm± 0.01 0.03 ±plus-or-minus\pm± 0.01 0.03 ±plus-or-minus\pm± 0.01 0.03 ±plus-or-minus\pm± 0.01 0.03 ±plus-or-minus\pm± 0.01
AVG 0.32 ±plus-or-minus\pm± 0.10 0.35 ±plus-or-minus\pm± 0.11 0.11 ±plus-or-minus\pm± 0.03 0.12 ±plus-or-minus\pm± 0.02 0.14 ±plus-or-minus\pm± 0.03 0.15 ±plus-or-minus\pm± 0.05
do(X6 = 1) do(X7 = 1) do(X8 = 1) do(X9 = 1) do(X10 = 1)
X0 0.08 ±plus-or-minus\pm± 0.05 0.09 ±plus-or-minus\pm± 0.05 0.06 ±plus-or-minus\pm± 0.03 0.06 ±plus-or-minus\pm± 0.03 0.06 ±plus-or-minus\pm± 0.02
X1 0.18 ±plus-or-minus\pm± 0.11 0.15 ±plus-or-minus\pm± 0.07 0.10 ±plus-or-minus\pm± 0.04 0.10 ±plus-or-minus\pm± 0.04 0.10 ±plus-or-minus\pm± 0.04
X2 0.09 ±plus-or-minus\pm± 0.04 0.12 ±plus-or-minus\pm± 0.06 0.09 ±plus-or-minus\pm± 0.05 0.09 ±plus-or-minus\pm± 0.05 0.08 ±plus-or-minus\pm± 0.04
X3 0.10 ±plus-or-minus\pm± 0.05 0.12 ±plus-or-minus\pm± 0.06 0.08 ±plus-or-minus\pm± 0.02 0.08 ±plus-or-minus\pm± 0.03 0.08 ±plus-or-minus\pm± 0.03
X4 0.12 ±plus-or-minus\pm± 0.07 0.13 ±plus-or-minus\pm± 0.05 0.09 ±plus-or-minus\pm± 0.03 0.08 ±plus-or-minus\pm± 0.03 0.08 ±plus-or-minus\pm± 0.04
X5 0.17 ±plus-or-minus\pm± 0.11 0.12 ±plus-or-minus\pm± 0.07 0.04 ±plus-or-minus\pm± 0.02 0.04 ±plus-or-minus\pm± 0.02 0.04 ±plus-or-minus\pm± 0.02
X6 0.00 ±plus-or-minus\pm± 0.00 0.24 ±plus-or-minus\pm± 0.15 0.05 ±plus-or-minus\pm± 0.03 0.05 ±plus-or-minus\pm± 0.03 0.05 ±plus-or-minus\pm± 0.03
X7 0.32 ±plus-or-minus\pm± 0.09 0.00 ±plus-or-minus\pm± 0.00 0.08 ±plus-or-minus\pm± 0.04 0.08 ±plus-or-minus\pm± 0.04 0.08 ±plus-or-minus\pm± 0.04
X8 0.06 ±plus-or-minus\pm± 0.03 0.06 ±plus-or-minus\pm± 0.03 0.00 ±plus-or-minus\pm± 0.00 0.15 ±plus-or-minus\pm± 0.08 0.15 ±plus-or-minus\pm± 0.08
X9 0.07 ±plus-or-minus\pm± 0.05 0.08 ±plus-or-minus\pm± 0.05 0.51 ±plus-or-minus\pm± 0.20 0.00 ±plus-or-minus\pm± 0.00 0.26 ±plus-or-minus\pm± 0.13
X10 0.03 ±plus-or-minus\pm± 0.01 0.03 ±plus-or-minus\pm± 0.01 0.32 ±plus-or-minus\pm± 0.16 0.49 ±plus-or-minus\pm± 0.29 0.00 ±plus-or-minus\pm± 0.00
AVG 0.11 ±plus-or-minus\pm± 0.03 0.10 ±plus-or-minus\pm± 0.03 0.13 ±plus-or-minus\pm± 0.03 0.11 ±plus-or-minus\pm± 0.02 0.09 ±plus-or-minus\pm± 0.01