Effective Bayesian Causal Inference via
Structural Marginalisation and Autoregressive Orders

Christian Toth Christian Knoll Franz Pernkopf Robert Peharz

Abstract

Bayesian causal inference (BCI) naturally incorporates epistemic uncertainty about the true causal model into down-stream causal reasoning tasks by posterior averaging over causal models. However, this poses a tremendously hard computational problem due to the intractable number of causal structures to marginalise over. In this work, we decompose the structure learning problem into inferring (i) a causal order and (ii) a parent set for each variable given a causal order. By limiting the number of parents per variable, we can exactly marginalise over the parent sets in polynomial time, which leaves only the causal order to be marginalised. To this end, we propose a novel autoregressive model over causal orders (ARCO) learnable with gradient-based methods. Our method yields state-of-the-art in structure learning on simulated non-linear additive noise benchmarks with scale-free and Erdos-Renyi graph structures, and competitive results on real-world data. Moreover, we illustrate that our method accurately infers interventional distributions, which allows us to estimate posterior average causal effects and many other causal quantities of interest.

Bayesian causal inference, causal orders, Gaussian processes, causal discovery, causal structure learning, permutations, orderings

1 Introduction

Refer to caption — Figure 1: Causal discovery on nonlinear additive noise models. Structure learning results in terms of *expected Hamming distance* (ESHD) and *ancestor adjustment identification distance* (A-AID) on simulated non-linear models with scale-free (left, blue) and Erdös-Renyi (right, orange) graphs, each with $20$ nodes and $200$ data samples. Whiskers indicate maximum, minimum and median values across $20$ simulated ground truth instances. For both metrics lower is better. Range for ESHD is set for better readability, omitting the result for DDS ( $>125$ ).

Few topics in science and philosophy have been as controversial as the discussion about the nature of causality. Interestingly, the discussion becomes relatively benign, from a philosophical perspective, as soon as one agrees on a well-defined mathematical model of causality, such as a structural causal model (SCM) (Pearl, 2009). A fully specified SCM induces a wide range of joint probability distributions, organised on three levels denoted as association, intervention and counterfactuals, collectively forming the so-called Pearl’s causal hierarchy (Bareinboim et al., 2022). Any causal question then elegantly reduces to standard probabilistic inference in some of these joint distributions. Hence, assuming that the data comes from some model within a considered class of SCMs, causal questions reduce, in principle, to epistemic questions, i.e., questions about what and how much is known about the model.

The view that causal questions reduce to epistemic questions about the true SCM naturally invites a Bayesian treatment, i.e., specifying a prior distribution over SCMs and inferring the posterior given collected data. Bayesian approaches to causal discovery, i.e., learning the causal structure (represented as directed acyclic graph, DAG), dates back to work as early as (Heckerman, 1995; Murphy, 2001; Tong & Koller, 2001). More intriguing, however, is to use a Bayesian approach over entire SCMs for the purpose of Bayesian causal inference (BCI), i.e., consistent reasoning about a down-stream causal query: usually, one is less interested in the SCM per se, but rather in some causal quantity, such as the existence of a direct causal connection (edge in the DAG), or the average causal effect one variable has on another. The goal of BCI is then to reason about such down-stream tasks by computing a Bayesian average over the whole posterior over SCMs.

^†^†footnotetext: Code available at: https://github.com/chritoth/bci-arco-gp

Most existing BCI approaches are restricted to linear Gaussian models (Geiger & Heckerman, 1994; Viinikka et al., 2020; Pensar et al., 2020; Horii, 2021) or binary variables (Moffa et al., 2017; Kuipers et al., 2019), while a full Bayesian treatment of non-linear SCMs has been proposed only recently (Toth et al., 2022; Tigas et al., 2022). Both approaches leverage the DiBS method (Lorch et al., 2021), a gradient-based framework for Bayesian inference over DAGs, and infer causal mechanisms conditionally on the DAG. While Tigas et al. (2022) use Bayesian neural nets to represent the causal mechanisms and require stochastic variational inference over their parameters, Toth et al. (2022) leverage Gaussian Processes (GPs) for the mechanisms, which are marginalised in closed-form. The latter effectively represents a form of Rao-Blackwellisation, reducing inference to the problem of marginalising over DAGs.¹¹1The term Rao-Blackwellisation is used in a generalised sense here, denoting a generally observed variance reduction in Monte Carlo estimates when “nuisance variables” are integrated out (Robert & Roberts, 2021).

In this paper, we enhance non-linear BCI and decompose structure inference to (i) inferring a causal order over the endogenous variables and (ii) inferring parent sets for each variable conditional on the order.²²2In this paper, we use the term causal order for the topological order of a causal DAG. We use this term to emphasise that this order per se might be an object of interest to the practitioner. We perform the latter by restricting the number of parents per variable to a fixed maximal number $K$ and exactly marginalising over all possible parent sets in polynomial ( $\mathcal{O}(d^{K})$ ) time, where $d$ is the number of endogenous variables. This technique is well-established in Markov chain Monte Carlo (MCMC) techniques for Bayesian structure learning (Koller & Friedman, 2003; Koivisto & Sood, 2004; Viinikka et al., 2020), but has—to the best of our knowledge—not been exploited in gradient-based structure learning so far. This structural marginalisation, in combination with the analytic treatment of the causal mechanisms via GPs (Toth et al., 2022), reduces the inference problem to the marginalisation w.r.t. causal orders. To this end, we propose a novel neural generative model over topological orders which can be understood as an Auto-Regressive distribution estimator (Larochelle & Murray, 2011), constrained to Causal Orders, hence we dub this model ARCO.

Figure 1 show results for simulated scale-free and Erdös-Renyi graphs, with non-linear causal mechanisms and additive Gaussian noise, demonstrating that ARCO-GP (meaning BCI via ARCO and GPs) is highly effective for causal discovery. In particular, for scale-free graphs, where the assumption of a restricted number of parents matches our modelling assumption, ARCO-GP clearly outperforms a wide range of competitor methods. Moreover, even when the assumptions are violated (in our case for Erdos-Renyi graphs), we demonstrate that ARCO-GP still outperforms the baselines, albeit with a smaller margin. Moreover, in Figure 2 we demonstrate ARCO-GP’s efficacy in down-stream BCI, by sampling from various interventional distributions. Specifically, by sampling from the whole Bayesian model (including causal structure and mechanisms) we effectively perform posterior marginalisation over SCMs. Detailed results, including further evaluation metrics and results on a real-world dataset (Sachs et al., 2005), are presented in Section Section 5 and Appendix C.

In summary, our main contributions are:

•

We propose an effective BCI framework which exactly marginalises both causal mechanisms and parent sets from the posterior over SCMs, reducing the hard BCI problem to inferring the causal (topological) order.
•

We propose ARCO, a neural autoregressive model over causal orders and devise a gradient-based learning scheme for it. While here ARCO is used mainly for causal reasoning, it might well find applications in many other domains requiring generative models over total orders (e.g., rankings).
•

We demonstrate in experiments that ARCO-GP sets state-of-the-art against a wide range of baselines when our model assumption are exactly matched. When the model assumptions are violated, we still outperform or are competitive with all baselines.

2 Background

Structural Causal Models. An SCM $\mathcal{M}$ over observed endogenous variables $\mathbf{X}=\{X_{1},\dots,X_{d}\}$ and unobserved exogenous variables $\mathbf{U}=\{U_{1},\dots,U_{d}\}$ consists of structural equations, or mechanisms,

X_{i}:=f_{i}(\mathbf{Pa}_{i},U_{i}),\qquad\quad\text{for}\quad i\in\{1,\dots,d\},

(1)

which assign the value of each $X_{i}$ as a deterministic function $f_{i}$ of its direct causes, or causal parents, $\mathbf{Pa}_{i}\subseteq\bm{X}\setminus\{X_{i}\}$ and an exogenous variable $U_{i}$ , and a joint distribution $p(\mathbf{U})$ over the exogenous variables. In this paper we assume that the exogenous variables are independent Gaussian and enter in additive fashion, i.e. $f_{i}(\mathbf{Pa}_{i},U_{i})=f_{i}(\mathbf{Pa}_{i})+U_{i}$ , thus implying causal sufficiency. Associated with each SCM is a directed graph $G$ induced by the set of parent sets $\textbf{Pa}=\{\textbf{Pa}_{i}\}_{i=1}^{d}$ with vertices $\bm{X}$ and edges $X_{j}\to X_{i}$ if and only if $X_{j}\in\mathbf{Pa}_{i}$ . Any SCM with an acyclic directed graph (DAG) then induces a unique observational distribution $p(\bm{X}\,|\,\mathcal{M})$ over the endogenous variables $\bm{X}$ , which is obtained as the pushforward measure of $p(\bm{U})$ through the causal mechanisms in Equation 1.

A (hard) intervention $do(\mathbf{W}=\mathbf{w})$ on a set of endogenous variables $\mathbf{W}\subset\mathbf{X}$ replaces the targeted mechanisms with constants $\mathbf{w}$ , resulting in a modified SCM. The entailed modified causal graph lacks the incoming edges into any intervention target. The pushforward through the modified SCM yields an interventional distribution $p(\mathbf{X}\,|\,do(\mathbf{W}=\mathbf{w}),\mathcal{M})$ .

Causal Orders. A permutation $L=\left\langle L_{1},\dots,L_{d}\right\rangle$ of the endogenous variables $\mathbf{X}=\bigcup_{i=1}^{d}\{L_{i}\}$ , where $L_{i}\neq L_{j}$ for all $i\neq j$ , entails a strict total order $L_{1}\prec L_{2}\prec\dots\prec L_{d}$ among the variables. Henceforth, we refer to such a permutation $L$ as a causal order. A causal order $L$ constrains the possible causal interactions between the variables, i.e., $X_{i}$ can be a (direct) cause of $X_{j}$ if, and only if, $X_{i}\prec X_{j}$ in $L$ . We define $L_{<k}=\left\langle L_{1},\dots,L_{k-1}\right\rangle$ to be the first $k-1$ elements in $L$ . Finally, let $\lambda^{L}\colon\mathbf{X}\mapsto\{1,\dots,d\}$ be the bijective mapping between $\bm{X}$ and indices in $L$ , i.e., $\lambda^{L}(X_{i})=k\iff L_{k}=X_{i}$ . We then denote by $Q^{L}\in\{0,1\}^{d\times d}$ the permutation matrix representing $L$ , where $Q^{L}_{ij}=1$ iff $\lambda^{L}(X_{i})=j$ .

3 Bayesian Causal Inference via Structural Marginalisation

Figure 3: Bayesian network view of our generative model. We characterise a Structural Causal Model (SCM) by a causal graph

G

, causal mechanisms

\mathbf{f}

and parameters

\bm{\psi}

of a joint distribution over mechanisms and exogenous variables

p(\mathbf{f},\mathbf{U}\,|\,\bm{\psi})

. We model the mechanisms

\mathbf{f}

using Gaussian Processes (GPs) and

\mathbf{U}

as additive Gaussian noise, implying that

\bm{\psi}

is a set of GP hyper-parameters. The SCM gives rise to the data-generating likelihood

p(\mathcal{D}\,|\,\mathbf{f},\bm{\psi},G)

and determines the (distribution over the) causal query

Y

. To sample a directed acyclic causal graph

G

, we first sample a causal order

L

from a neural autoregressive distribution over causal orders (ARCO)

p(L\,|\,\bm{\theta})

with parameters

\bm{\theta}

. Given a causal order, we can then sample or marginalise exactly over causal graphs

G

by limiting the maximum cardinality of parent sets.

When performing causal reasoning in the BCI framework, we define the causal quantity of our interest as $Y$ , called the causal query, which is a function of the SCM $\mathcal{M}$ (see (Toth et al., 2022)). This causal query could be, for example, an endogenous variable under some intervention, (features of) the true causal graph, or even the entire SCM. Since we are following a Bayesian approach, $\mathcal{M}$ is a random variable equipped with a prior distribution $p(\mathcal{M})$ , and hence also $Y$ is a random variable.

Given a set of (observational) data $\mathcal{D}=\{\mathbf{x}_{n}\overset{\text{i.i.d}}{\sim}p(\mathbf{X}\,|\,% \mathcal{M}^{*})\}_{n=1}^{N}$ collected from the true underlying SCM $\mathcal{M}^{*}$ , we are interested in inferring the posterior $p(Y\,|\,\mathcal{D})$ of the causal query, reflecting our uncertainty about $\mathcal{M}^{*}$ . Since the causal query $Y$ is determined by the SCM, the query posterior is given as

p(Y\,|\,\mathcal{D})=\mathbb{E}_{\mathcal{M}\,|\,\mathcal{D}}[\,p(Y\,|\,% \mathcal{M})],

(2)

marginalising over posterior SCMs. Our inference procedure, described in Algorithm 1 and the generative model over SCMs shown in Figure 3, divides into a parameter learning and an inference phase. In the learning phase (Algorithm 1, lines 1-1), we infer posterior parameters $p(\bm{\theta},\bm{\psi}\,|\,\mathcal{D})$ . In the inference phase (Algorithm 1, lines 1-1), we use samples drawn from the learned generative model to approximate the query posterior. We first provide an overview of the individual phases starting with the inference phase, as it motivates our learning objective. We elaborate on the more technical details in Sections 3.1, 3.2 and 3.3.

Inference Phase.

Assuming we have already learned a set of posterior parameters $\bm{\theta},\bm{\psi}$ approximating $p(\bm{\theta},\bm{\psi}\,|\,\mathcal{D})$ (see the learning phase), we want to employ our model to approximate the query posterior in Equation 2. A key observation similar in spirit to (Lorch et al., 2021; Toth et al., 2022) is that the expectation w.r.t. SCMs can be written as the following importance weighted expectation (for a derivation see Section A.1)

		$\displaystyle p(Y\,\|\,\mathcal{D})=\mathbb{E}_{\mathcal{M}\,\|\,\mathcal{D}}% \left[p(Y\,\|\,\mathcal{M})\right]$		(3)
		$\displaystyle=\mathbb{E}_{\bm{\theta},\bm{\psi}\,\|\,\mathcal{D}}\left[\mathbb{% E}_{L\,\|\,\bm{\theta}}\left[w^{L}\cdot\mathbb{E}_{G\,\|\,L,\bm{\psi},\mathcal{D% }}\left[\mathbb{E}_{\mathbf{f}\,\|\,\bm{\psi},\mathcal{D}}\left[p(Y\,\|\,% \mathcal{M})\right]\right]\right]\right]$

with importance weights

\displaystyle w^{L}:=\frac{\mathbb{E}_{G\,|\,L}\left[p(\mathcal{D}\,|\,\bm{% \psi},G)\cdot p(\bm{\psi}\,|\,G)\right]}{\mathbb{E}_{L^{\prime}\,|\,\bm{\theta% }}\left[\mathbb{E}_{G^{\prime}\,|\,L^{\prime}}\left[p(\mathcal{D}\,|\,\bm{\psi% },G^{\prime})\cdot p(\bm{\psi}\,|\,G^{\prime})\right]\right]},

(4)

On a high level, the query posterior $p(Y\,|\,\mathcal{D})$ is estimated in Eq. (3) by (i) sampling several candidate SCMs from the learned generative model in a nested manner given $\bm{\theta},\bm{\psi}$ —first sampling order $L$ , then a graph $G$ (parent sets) conditional on $L$ , and mechanisms $\mathbf{f}$ conditional on $G$ , (ii) sample queries given the candidate SCM from $p(Y\,|\,\mathcal{M})$ , and (iii) weighting the sampled queries with their corresponding importance weight $w^{L}$ .

Learning Phase.

The formulation in Equation 3 transforms the inference problem over posterior SCMs in Equation 2 into the problem of inferring posterior parameters $p(\bm{\theta},\bm{\psi}\,|\,D)$ of our proposed generative model in Figure 3 (cf. (Lorch et al., 2021; Toth et al., 2022)). By ensuring that our generative distributions $p(L\,|\,\bm{\theta})$ and $p(\mathbf{f},\mathbf{U}\,|\,\bm{\psi})$ are sufficiently expressive, it suffices to infer a maximum a posteriori (MAP) estimate of the posterior parameters $\bm{\theta},\bm{\psi}$ via gradient-based optimisation of (see Section A.2 for a derivation)

	$\displaystyle\nabla\log$	$\displaystyle p(\bm{\theta},\bm{\psi}\,\|\,\mathcal{D})=\nabla\log p(\bm{\theta})$		(5)
		$\displaystyle+\nabla\log\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\mathbb{E}_{G\,\|\,% L}\left[p(\mathcal{D}\,\|\,\bm{\psi},G)\cdot p(\bm{\psi}\,\|\,G)\right]\right].$

A major issue for the training of such a model is the quality of the estimated gradients in the face of the high-dimensional problem space and the coupling of the parameters $\bm{\theta},\bm{\psi}$ . However, simultaneously updating all model parameters in a single gradient step is prone to yield very noisy gradients:

1.

The gradient w.r.t. $\bm{\theta}$ (causal order proposal) depends on $\bm{\psi}$ (mechanism and noise parameters) through the quality of the estimated importance weights $w^{L}$ in Equation 6. A bad estimate of $\bm{\psi}$ will result in poor estimates of the importance weights and consequently the gradient w.r.t. $\bm{\theta}$ .
2.

The gradient w.r.t. $\bm{\psi}$ likewise depends on $\bm{\theta}$ via the quality of the sampled orders and their induced parent sets. In general, the more often a parent set occurs in the sampled orders relative to other parent sets, the stronger its gradient w.r.t. $\bm{\psi}$ .³³3Arguably, this may be especially problematic when sharing parameters between mechanisms with different parent sets, as is the case, e.g., when modeling the mechanisms with a single, masked neural network.

To mitigate these issues, we propose a nested optimisation procedure as laid out in Algorithm 1 (lines 1-1). In an outer loop, we sample causal orders from the learned proposal, estimate the gradient w.r.t. $\bm{\theta}$ in Equation 5 and update the parameters $\bm{\theta}$ of the proposal distribution $p(L\,|\,\bm{\theta})$ (see Section 3.1). To estimate this gradient, we need to compute the importance weights (Equation 4) for the sampled orders, which however depend on the GP hyper-parameters $\bm{\psi}$ via the marginal likelihood $p(\mathcal{D}\,|\,\bm{\psi},G)$ . Therefore, before computing the importance weights, we optimise the GP hyper-parameters (see Section 3.3) for each yet unseen parent set compatible with some sampled order. This ensures truthful estimates of the importance weights for the sampled orders and, consequently, provides a good gradient estimate.

Algorithm 1 BCI with ARCO-GP

Input: (Observational) data $\mathcal{D}$ .

Output: Posterior parameters $\bm{\theta}$ , $\bm{\psi}$ . Estimated (posterior over the) causal query $Y$ and importance weights $\mathbf{w}^{L}$ .

/* Learning Phase */

repeat

sample causal orders

\mathbf{L}\leftarrow\left\{L_{m}\sim p(L\,|\,\bm{\theta})\right\}

\vartriangleright

Section 3.1

\mathbf{w}^{L},\bm{\psi}\leftarrow

ComputeIW( $\mathbf{L}$ , $\mathcal{D}$ )

\vartriangleright

Equation 4 estimate the gradients for

\bm{\theta}

and perform a gradient step

\vartriangleright

Equation 6

2until stopping criterion fulfilled

/* Inference Phase */ sample causal orders

\mathbf{L}\leftarrow\left\{L_{m}\sim p(L\,|\,\bm{\theta})\right\}

\vartriangleright

Section 3.1

\mathbf{w}^{L},\bm{\psi}\leftarrow

ComputeIW( $\mathbf{L}$ , $\mathcal{D}$ )

\vartriangleright

Equation 4 [optional] sample graphs

\mathbf{G}\leftarrow\left\{G_{k}\sim p(G\,|\,L,\bm{\psi},\mathcal{D})\right\}

\vartriangleright

Section 3.2 [optional] sample mechanisms

\mathbf{f}\leftarrow\left\{\mathbf{f}_{j}\sim p(\mathbf{f}\,|\,\bm{\psi},% \mathcal{D})\right\}

\vartriangleright

Section 3.3 sample candidate queries

\mathbf{Y}\leftarrow\left\{Y_{i}\sim p(Y\,|\,\mathbf{f},\bm{\psi},G)\right\}

\vartriangleright

Section 3.3 return

\bm{\theta}

\bm{\psi}

\mathbf{Y}

\mathbf{w}^{L}

/* Subroutine: compute importance weights and update mechanisms. */ Subroutine ComputeIW( $\mathbf{L}$ , $\mathcal{D}$ )

3 foreach causal order $L_{m}\in\mathbf{L}$ do

4 foreach parent set $\textbf{pa}_{i}$ compatible with $L_{m}$ do

Learn (or retrieve) the corresponding GP hyper-parameters

\bm{\psi}_{i}

\vartriangleright

Equation 11

5 end foreach

6 Compute the importance weight

w^{L_{m}}

7 end foreach

8 return

\mathbf{w}^{L},\bm{\psi}

3.1 Causal Order Inference

While Equations 3 and 5 elegantly reduce BCI to learning a model over causal orders and GP hyper-parameters, computing the expectation w.r.t. causal orders in Equation 3 poses a hard combinatorial problem, as there are $d!$ possible causal orders over $d$ variables. The involved distribution over causal orders $p(L\,|\,\bm{\theta})$ appearing in the BCI estimator Equation 3 can be interpreted as proposal distribution, with importance weights as defined in Equation 4, where the optimal proposal is the true posterior over causal orders. Hence, we require an expressive representation over causal orders which can account for multi-modal distributions over orders. In particular, consider the example of a Markov equivalence class (MEC) including a chain graph. Since the chain is contained in the MEC, also the reverse chain graph must be in the MEC, and thus, the proposal over causal orders must be able to represent both orders with equal probability.

A simple parameterisation of orders as proposed in (Charpentier et al., 2022) is not able to represent the true posterior over causal orders in this case. Specifically, they sample orders using the Gumbel Top-k trick (Kool et al., 2019) by perturbing $d$ logits (corresponding to the $d$ variables) with Gumbel noise and sorting these perturbed logits, yielding an order over variables. In essence, learning such a model boils down to ordering (and spreading) the $d$ logits on the real line. Now, to sample a chain graph and a reverse chain graph, some variable is the first element in the causal order in one case and must thus have the highest (perturbed) logit, and it is the last element in the causal order in the other case where it must have the lowest (perturbed) logit, which is contradictory. In practice, we observe that when trying to learn a multi-modal distribution over causal orders with this model, the logits cluster together, resulting approximately in a uniform distribution over causal orders.

Here, we propose an expressive, Auto-Regressive distribution (Larochelle & Murray, 2011) $p(L\,|\,\bm{\theta})=p(L_{1}\,|\,\bm{\theta})\cdot\prod_{k=2}^{d}p(L_{k}\,|\,L% _{<k},\bm{\theta})$ over Causal Orders (ARCO) that is amenable to gradient-based optimisation and can represent multi-modal distributions over orders, avoiding the shortcoming described above.

Sampling Causal Orders.

The ordering of variables naturally implies a sequential sampling procedure as listed in Algorithm 2. In each step of the sampling procedure, we sample the next variable in the order from a categorical distribution $p(L_{k}\,|\,L_{<k},\bm{\theta})$ over the set of yet unassigned variables, conditional on all preceding variables in the order (Algorithm 2, line 2). To account for the dependence on the preceding order $L_{<k}$ , we compute the logits of the categorical distribution using a differentiable function $g_{\bm{\theta}}:\mathbb{R}^{d\times d}\mapsto\mathbb{R}^{d}$ (Algorithm 2, line 2) and re-normalise them to exclude the elements in $L_{<k}$ (Algorithm 2, line 2). We implement $g_{\bm{\theta}}$ as feed-forward neural network, taking as input a suitable encoding of the so-far sampled order $L_{<k}$ . To this end, we encode $L_{<k}$ using its induced permutation matrix $Q^{L_{<k}}$ (see Section 2) and mask the rows corresponding to elements $L_{>=k}$ with zeros.

Training ARCO.

Training ARCO amounts to learning the parameters $\bm{\theta}$ of the neural network $g_{\bm{\theta}}$ by performing gradient ascent on Equation 5. To estimate the gradient in Equation 5 w.r.t. $\bm{\theta}$ we use the score-function estimator, yielding (see Section A.2)

	$\displaystyle\nabla_{\bm{\theta}}\log p(\bm{\theta},\bm{\psi}\,\|\,\mathcal{D})$	$\displaystyle=\nabla_{\bm{\theta}}\log p(\bm{\theta})$		(6)
		$\displaystyle+\mathbb{E}_{L\,\|\,\bm{\theta}}\left[w^{L}\cdot\nabla_{\bm{\theta% }}\log p(L\,\|\,\bm{\theta})\right].$

with $w^{L}$ as defined in Equation 4 (see Section A.1 for the derivation). To evaluate $\log p(L\,|\,\bm{\theta})$ for a given causal order $L$ , we simply need to sum the log-probabilities of the categorical distributions $\log p(L_{k}\,|\,L_{<k},\bm{\theta})$ for the respective elements $L_{k}$ . The necessary log-probabilities (i.e., the normalised logits) are computed as described in Algorithm 2. Note that, although we need to compute the logits sequentially in the sampling procedure, we can compute them in parallel during evaluation.

Algorithm 2 Sample Causal Order with ARCO

Input: Logit function $g_{\bm{\theta}}:\mathbb{R}^{d\times d}\mapsto\mathbb{R}^{d}$

Output: Causal order $L$ sampled from $p(L\,|\,\bm{\theta})$

$\mathbf{R}\leftarrow\mathbf{X}$

\vartriangleright

set of unassigned elements

L\leftarrow\varnothing

\vartriangleright

causal order for $k=1\dots d$ do

\bm{\phi}\leftarrow g_{\bm{\theta}}(Q^{L_{<k}})

\vartriangleright

compute logits

\phi_{i}\leftarrow\begin{cases}\phi_{i}-\log\sum_{j\,|\,X_{j}\in\mathbf{R}}% \exp\phi_{j}\quad\text{if}\quad X_{i}\in\mathbf{R}\\ -\infty\quad\text{otherwise}\quad\end{cases}

\vartriangleright

normalise logits

l\sim\text{CATEGORICAL}(\bm{\phi})

\vartriangleright

sample next element

L\leftarrow L\cup\langle X_{l}\rangle

\vartriangleright

update causal order

\mathbf{R}\leftarrow\mathbf{R}\setminus\{X_{l}\}

11 end for

3.2 Marginalising over Causal Graphs

The marginalisation w.r.t. causal graphs $G$ in Equations 3, 4 and 5 is in general intractable, as the number of DAGs consistent with any given (causal) order is $2^{\frac{d\cdot(d-1)}{2}}$ . Although this is significantly smaller than the total number of DAGs with $d$ nodes (which grows super-exponentially in $d$ , see e.g. (OEIS Foundation Inc., 2024)), an exhaustive enumeration is still infeasible.

In this work, we tackle this problem by restricting the number of parents per variable. By restricting the maximum size of any admissible parent set to some integer $K$ , the number of distinct parent sets consistent with any causal order $L$ is in $\mathcal{O}(d^{K})$ . Although the exhaustive enumeration of all DAGs with restricted parent set size is still infeasible, it turns out that, given a causal order, the expectation w.r.t. graphs can be tractably computed under certain assumptions: by choosing a prior over graphs $p(G\,|\,L)=\prod_{i}p(\textbf{Pa}_{i}\,|\,L)$ that factorises over parent sets, we can compute quantities $h(G)$ that decompose over the parent sets by the following two propositions.⁴⁴4In practice, we assume a uniform prior over parent sets consistent with a given causal order.

Proposition 3.1.

Let $h(G)=\prod_{i}h_{i}(\textbf{pa}_{i}^{G})$ and $w(G)=\prod_{i}w(\textbf{pa}_{i}^{G})$ be factorising over the parent sets, then we can compute

\displaystyle\mathbb{E}_{G\,|\,L}\left[w(G)h(G)\right]=\prod_{i=1}^{d}\sum_{% \textbf{pa}_{i}}p(\textbf{pa}_{i}\,|\,L)w_{i}(\textbf{pa}_{i})h_{i}(\textbf{pa% }_{i}).

(7)

Proposition 3.2.

Let $h(G)=\sum_{i}h_{i}(\textbf{pa}_{i}^{G})$ be summing and $w(G)=\prod_{i}w(\textbf{pa}_{i}^{G})$ be factorising over the parent sets, then we can compute

	$\displaystyle\mathbb{E}_{G\,\|\,L}$	$\displaystyle\left[w(G)h(G)\right]=$		(8)
		$\displaystyle\sum_{i=1}^{d}\left(\prod_{j\neq i}\alpha_{j}(L)\right)\cdot\sum_% {\textbf{pa}_{i}}p(\textbf{pa}_{i}\,\|\,L)w_{i}(\textbf{pa}_{i})h_{i}(\textbf{% pa}_{i}),$

where

\displaystyle\alpha_{j}(L)=\sum_{\textbf{pa}_{i}}p(\textbf{pa}_{i}\,|\,L)w_{i}% (\textbf{pa}_{i}).

These propositions generalise the results presented by Koller & Friedman (2003); Koivisto & Sood (2004) on how to compute the posterior probabilities of edges or parent sets, which is a special case of Proposition 3.1, to our setting of Bayesian causal inference (proofs in Section A.3).

Employing Proposition 3.1, we can compute $\mathbb{E}_{G\,|\,L}\left[p(\mathcal{D}\,|\,\bm{\psi},G)\cdot p(\bm{\psi}\,|\,% G)\right]$ , and consequently

\displaystyle p(G\,|\,L,\bm{\psi},\mathcal{D})=\frac{p(\mathcal{D}\,|\,\bm{% \psi},G)\cdot p(\bm{\psi}\,|\,G)\cdot p(G\,|\,L)}{\mathbb{E}_{G\,|\,L}\left[p(% \mathcal{D}\,|\,\bm{\psi},G)\cdot p(\bm{\psi}\,|\,G)\right]},

(9)

in closed-form by letting $h(G)=p(\mathcal{D}\,|\,\bm{\psi},G)\cdot p(\bm{\psi}\,|\,G)$ and $w(G)=1$ . This allows us to tractably compute the respective expectations w.r.t. causal graphs in Equations 3, 4 and 5. Furthermore, for any causal query that does not decompose over parent sets, we can compute a Monte-Carlo estimate of the query by sampling DAGs from the true posterior $p(G\,|\,L,\bm{\psi},\mathcal{D})$ in Equation 3, as Equation 9 factorised over parent sets.

3.3 Mechanism Inference and Marginalisation

To compute the importance weights in Equation 4 we need to compute the marginal log-likelihood $p(\mathcal{D}\,|\,\bm{\psi},G)$ which is intractable for general models. As we focus on non-linear additive noise models in this work, we follow (von Kügelgen et al., 2019; Toth et al., 2022) and model each mechanism via a distinct GPs, assuming homoscedastic Gaussian noise and causal sufficiency. Under these assumptions, we can compute the marginal likelihood $p(\mathcal{D}_{i}\,|\,\bm{\psi}_{i},G)=\mathbb{E}_{\mathbf{f}\,|\,\bm{\psi}_{i% }}\left[p(\mathcal{D}_{i}\,|\,\mathbf{f}_{i},\bm{\psi}_{i},\textbf{pa}_{i})\right]$ in closed form. The same holds true for the predictive posterior $p(\mathcal{D}_{i}^{\text{test}}\,|\,\bm{\psi}_{i},G,\mathcal{D})$ possibly needed in Equation 3. We factorise the GP hyper-parameter prior⁵⁵5The parameter set $\bm{\psi}_{i}$ contains the GP’s kernel parameters and the noise variance. Note that, other than the structural dependence on the parent sets, the prior parameters $\bm{\psi}$ are independent of $\bm{\theta}$ . and the likelihood over parent sets for each node $i$ , i.e.,

\displaystyle p(\mathcal{D}\,|\,\bm{\psi},G)\cdot p(\bm{\psi}\,|\,G)=\prod_{i}% p(\mathcal{D}_{i}\,|\,\bm{\psi}_{i},G)\cdot p(\bm{\psi}_{i}\,|\,\textbf{Pa}_{i% }^{G}),

(10)

allowing for efficient computation by caching intermediate results. For the individual GP we infer a MAP-Type II estimate of its hyper-parameters by performing gradient-ascent on⁶⁶6Importantly, in Section A.2 we show that this not an ad-hoc choice, but a consequence of optimising Equation 5.

	$\displaystyle\nabla_{\bm{\psi}_{i}}\log$	$\displaystyle p(\bm{\psi}_{i}\,\|\,\mathcal{D},G)=$		(11)
		$\displaystyle\nabla_{\bm{\psi}_{i}}\log p(\mathcal{D}_{i}\,\|\,\bm{\psi}_{i},G)% +\nabla_{\bm{\psi}_{i}}\log p(\bm{\psi}_{i}\,\|\,G).$

For root nodes, i.e., nodes without parents, we place a conjugate normal-inverse-gamma prior on the mean and variance of that node, which also allows for closed-form inference.

4 Related Work

BCI in the simpler form of Bayesian structure inference dates back to (Heckerman, 1995; Heckerman et al., 1997; Madigan et al., 1995), often using MCMC techniques, where many works utilise (causal) orders, e.g., (Koller & Friedman, 2003; Koivisto & Sood, 2004; Teyssier & Koller, 2012; Ellis & Wong, 2008; Niinimäki et al., 2016; Kuipers & Moffa, 2017; Viinikka et al., 2020), allowing a better exploration of the posterior space. Moreover, restricting the maximum number of parents per node allows an exhaustive enumeration of all parent sets in polynomial time for a given order. However, MCMC inference comes with its own set of challenges, and none of these works implement non-linear mechanism models. In contrast, our work focuses on non-linear additive noise models and utilises gradient-based learning of a generative model over causal orders. Besides sampling based inference, orders can also facilitate exact optimization schemes, e.g. (Cussens, 2010; De Campos & Ji, 2011; Peharz & Pernkopf, 2012).

After the advent of gradient-based (Bayesian) DAG structure learning methods (Zheng et al., 2018; Yu et al., 2019; Brouillard et al., 2020; Lachapelle et al., 2020; Lorch et al., 2021), inference via orders recently gained interest in the gradient-based causal structure learning community as a vehicle to sample DAGs without the need of utilising soft acyclicity constraints during optimisation (Cundy et al., 2021; Charpentier et al., 2022; Annadani et al., 2023). In contrast to these works, we stress the importance of an expressive gradient-based model for causal orders and we utilise causal orders to marginalise our Bayesian causal inference procedure by exhaustive parent set enumeration.

Finally, most existing BCI approaches are restricted to linear Gaussian models (Geiger & Heckerman, 1994; Viinikka et al., 2020; Pensar et al., 2020; Horii, 2021) or binary variables (Moffa et al., 2017; Kuipers et al., 2019). Only recently, works on a Bayesian treatment of entire non-linear SCMs (i.e., including mechanisms and exogenous noise) have been proposed by Toth et al. (2022); Tigas et al. (2022); Giudice et al. (2023, 2024). Both Toth et al. (2022) and Tigas et al. (2022) operate in an active learning scenario using DIBS (Lorch et al., 2021) for inferring posterior causal graphs, whereas Toth et al. (2022) focus on BCI while using GPs for mechanism inference. Similarly, Giudice et al. (2023) use GPs for the Bayesian inference of SCMs in combiniation with MCMC inference for causal graphs. In a concurrent preprint, Giudice et al. (2024) follow (Toth et al., 2022) in using GPs for BCI but use MCMC for sampling posterior causal graphs. None of these approaches features gradient-based inference of causal orders, nor utilises causal orders for structure marginalisation.

5 Experiments

Task I: Causal Structure Learning. For our results in Figure 1, we sample a fixed set of $200$ training samples from the observational distribution of non-linear additive noise ground truth SCMs with Erdös-Rényi (ER) (Erdös & Rényi, 1959) and scale-free (SF) (Barabási & Albert, 1999) graph structures. The small sample size emulates the setting of significant uncertainty relevant to the Bayesian inference scenario. The reported expected structural Hamming distance (ESHD) is a standard metric used structure learning metric. Additionally, to asses the inferred structures in terms of causal implications, we report the ancestor adjustment identification distance (A-AID) recently proposed by Henckel et al. (2024) (see Appendix D for details).

We compare our inference model (ARCO-GP) with maximum parent set size $K=2$ to a diverse set of nine different structure learning methods: BAYESDAG ((Annadani et al., 2023)), DAG-GNN ((Yu et al., 2019)), DDS ((Charpentier et al., 2022)), DIBS-GP ((Toth et al., 2022; Lorch et al., 2021)), GADGET ((Viinikka et al., 2020)), GES ((Chickering, 2003)), GOLEM ((Ng et al., 2020)), GRASP ((Lam et al., 2022)) and PC ((Spirtes et al., 2000)), that we briefly describe in Appendix D.

Summarising Figure 1, when our assumption about the maximum parent set size is reflected by the ground truth models (as holds in our simulated SF graphs), ARCO-GP clearly outperforms the baselines in terms of ESHD and A-AID. Moreover, even when the ground truth models exhibit larger parent sets and our restriction is violated (as is the case in our simulated ER graphs), ARCO-GP still outperforms the baselines, albeit by a smaller margin. Additional experiments with varying number of variables $d\in\left\{11,20,50\right\}$ , as well as experiments on real-world data from Sachs et al. (2005) and ablations are provided in Appendix D. For these experiments we report an extended set of eight additional metrics.

Task II: Inferring Interventional Distributions and Average Causal Effects. We illustrate ARCO-GP’s causal reasoning capabilities by visualising a selected set of estimated posterior interventional distributions $p(X_{i}\,|\,do(\mathbf{W}=\mathbf{w}),\mathcal{D})$ in Figure 2. In order to have access to the corresponding ground truth distributions, we simulate a non-linear additive noise model on the consensus protein interaction graph reported by Sachs et al. (2005) (see Figure 4). We generate samples from the interventional distributions with the procedure laid out in Algorithms 1 and 3 and smoothing the empirical distribution with a kernel density estimate (see Appendix B for details). As can be seen, the posterior distribution inferred by our method closely resemble the true ones. Importantly, the multi-modality of the (inferred) distributions illustrate the benefits of a Bayesian approach to causal inference, as we can represent uncertainty via full distributions instead of single point estimates. In Appendix D we further compare our ability to infer pairwise average causal effects to two baselines, namely DIBS-GP (Toth et al., 2022; Lorch et al., 2021) and GADET+BEEPS (Viinikka et al., 2020).

6 Discussion

We demonstrated that our proposed ARCO-GP method for BCI, leveraging structural marginalisation, yields superior structure learning performance on non-linear additive noise models against a set of nine state-of-the-art baseline methods. Moreover, we illustrate the capability of ARCO-GP to accurately infer posterior interventional distributions and average causal effects. The capabilities our method relies upon the following assumptions and limitations.

Assumptions. Our assumptions on the data generating process include causal sufficiency and additive Gaussian noise, as well as a limited maximum parent set size. Presumably, for very dense graphs the performance of ARCO-GP will deteriorate in comparison to other methods. While our framework and implementation can handle interventional data, we do not evaluate this scenario experimentally because not all baselines support the use of interventional data, and the observational case is the more difficult problem from the perspective of model identifiability.

Scalability and Computation. The main driver of complexity is the exact inference using GPs, which grows with $N^{3}$ in the number of available data points. Although we used only CPUs for running our experiments, scaleable GPU inference techniques for GPs were proposed, e.g., by Gardner et al. (2018); Pleiss et al. (2018). Additionally, the training of the GPs could be straightforwardly parallelised. Conceptually, our framework is flexible and modular, allowing to use alternative mechanism models like normalising flows as in (Brouillard et al., 2020; Pawlowski et al., 2020). A second driver of computational complexity is the exhaustive enumeration of parent sets, which may be prohibitive on larger problem instances and bigger parent sets. Note, however, that the individual parent set contributions necessary to compute the importance weights in Equation 4 could be pre-computed in parallel.

Acknowledgements

The financial support by the Austrian Federal Ministry of Labour and Economy, the National Foundation for Research, Technology and Development and the Christian Doppler Research Association is gratefully acknowledged. The computational results presented have in part been achieved using the Vienna Scientific Cluster (VSC).

References

Annadani et al. (2023) Annadani, Y., Pawlowski, N., Jennings, J., Bauer, S., Zhang, C., and Gong, W. BayesDAG: Gradient-Based Posterior Inference for Causal Discovery. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Ansel et al. (2024) Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Burovski, E., Chauhan, G., Chourdia, A., Constable, W., Desmaison, A., DeVito, Z., Ellison, E., Feng, W., Gong, J., Gschwind, M., Hirsh, B., Huang, S., Kalambarkar, K., Kirsch, L., Lazos, M., Lezcano, M., Liang, Y., Liang, J., Lu, Y., Luk, C. K., Maher, B., Pan, Y., Puhrsch, C., Reso, M., Saroufim, M., Siraichi, M. Y., Suk, H., Zhang, S., Suo, M., Tillet, P., Zhao, X., Wang, E., Zhou, K., Zou, R., Wang, X., Mathews, A., Wen, W., Chanan, G., Wu, P., and Chintala, S. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, New York, NY, USA, 2024. ACM. ISBN 9798400703850. doi: 10.1145/3620665.3640366.
Barabási & Albert (1999) Barabási, A.-L. and Albert, R. Emergence of scaling in random networks. Science, 286, 1999.
Bareinboim et al. (2022) Bareinboim, E., Correa, J. D., Ibeling, D., and Icard, T. On Pearl’s Hierarchy and the Foundations of Causal Inference. In Probabilistic and Causal Inference. ACM, 2022.
Brouillard et al. (2020) Brouillard, P., Lachapelle, S., Lacoste, A., Lacoste-Julien, S., and Drouin, A. Differentiable Causal Discovery from Interventional Data. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., 2020.
Charpentier et al. (2022) Charpentier, B., Kibler, S., and Günnemann, S. Differentiable {DAG} Sampling. In International Conference on Learning Representations, 2022.
Chickering (2003) Chickering, D. M. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3, 2003. ISSN 1532-4435.
Cundy et al. (2021) Cundy, C., Grover, A., and Ermon, S. BCD Nets: Scalable Variational Approaches for Bayesian Causal Discovery. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
Cussens (2010) Cussens, J. Maximum likelihood pedigree reconstruction using integer programming. In WCB@ ICLP, pp. 8–19, 2010.
De Campos & Ji (2011) De Campos, C. P. and Ji, Q. Efficient structure learning of bayesian networks using constraints. The Journal of Machine Learning Research, 12:663–689, 2011.
Ellis & Wong (2008) Ellis, B. and Wong, W. H. Learning Causal Bayesian Network Structures from Experimental Data. Journal of the American Statistical Association, 2008.
Erdös & Rényi (1959) Erdös, P. and Rényi, A. On random graphs i. Publicationes Mathematicae Debrecen, 6:290, 1959.
Gardner et al. (2018) Gardner, J., Pleiss, G., Weinberger, K. Q., Bindel, D., and Wilson, A. G. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., 2018.
Geiger & Heckerman (1994) Geiger, D. and Heckerman, D. Learning gaussian networks. In Uncertainty in Artificial Intelligence, pp. 235–243. Elsevier, 1994.
Giudice et al. (2023) Giudice, E., Kuipers, J., and Moffa, G. A Bayesian Take on Gaussian Process Networks. In Advances in Neural Information Processing Systems, 2023.
Giudice et al. (2024) Giudice, E., Kuipers, J., and Moffa, G. Bayesian causal inference with gaussian process networks. arXiv preprint arXiv:2402.00623, 2024.
Heckerman (1995) Heckerman, D. A bayesian approach to learning causal networks. In Proceedings of Eleventh Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, 1995.
Heckerman et al. (1997) Heckerman, D., Meek, C., and Cooper, G. A Bayesian Approach to Causal Discovery. Computation, Causation, and Discovery, 1997.
Henckel et al. (2024) Henckel, L., Würtzen, T., and Weichwald, S. Adjustment Identification Distance: A gadjid for Causal Structure Learning. arXiv:2402.08616, 2024.
Hinton (2012) Hinton, G. Neural networks for machine learning - lecture 6a - overview of mini-batch gradient descent, 2012. https://www.cs.toronto.edu/ tijmen/csc321/slides/lecture_slides_lec6.pdf.
Horii (2021) Horii, S. Bayesian Model Averaging for Causality Estimation and its Approximation based on Gaussian Scale Mixture Distributions. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, 2021.
Kingma & Ba (2015) Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization. In And, Y. B. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, Conference Track Proceedings, San Diego, CA, USA, 2015.
Koivisto & Sood (2004) Koivisto, M. and Sood, K. Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, 2004.
Koller & Friedman (2003) Koller, D. and Friedman, N. Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks. Machine Learning, 2003.
Kool et al. (2019) Kool, W., Van Hoof, H., and Welling, M. Stochastic Beams and Where To Find Them: The {G}umbel-Top-k Trick for Sampling Sequences Without Replacement. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97. PMLR, 2019. URL https://proceedings.mlr.press/v97/kool19a.html.
Kuipers & Moffa (2017) Kuipers, J. and Moffa, G. Partition MCMC for Inference on Acyclic Digraphs. Journal of the American Statistical Association, 2017.
Kuipers et al. (2019) Kuipers, J., Moffa, G., Kuipers, E., Freeman, D., and Bebbington, P. Links between psychotic and neurotic symptoms in the general population: an analysis of longitudinal british national survey data using directed acyclic graphs. Psychological Medicine, 49(3):388–395, 2019.
Lachapelle et al. (2020) Lachapelle, S., Brouillard, P., Deleu, T., and Lacoste-Julien, S. Gradient-Based Neural DAG Learning. In International Conference on Learning Representations, 2020.
Lam et al. (2022) Lam, W.-Y., Andrews, B., and Ramsey, J. Greedy relaxations of the sparsest permutation algorithm. In Cussens, J. and Zhang, K. (eds.), Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, Proceedings of Machine Learning Research. PMLR, 2022.
Larochelle & Murray (2011) Larochelle, H. and Murray, I. The neural autoregressive distribution estimator. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 29–37. JMLR Workshop and Conference Proceedings, 2011.
Lorch et al. (2021) Lorch, L., Rothfuss, J., Schölkopf, B., and Krause, A. DiBS: Differentiable Bayesian Structure Learning. Advances in Neural Information Processing Systems, 2021.
Madigan et al. (1995) Madigan, D., York, J., and Allard, D. Bayesian Graphical Models for Discrete Data. International Statistical Review / Revue Internationale de Statistique, 63, 1995. ISSN 03067734. doi: 10.2307/1403615.
Moffa et al. (2017) Moffa, G., Catone, G., Kuipers, J., Kuipers, E., Freeman, D., Marwaha, S., Lennox, B. R., Broome, M. R., and Bebbington, P. Using directed acyclic graphs in epidemiological research in psychosis: an analysis of the role of bullying in psychosis. Schizophrenia bulletin, 43(6):1273–1279, 2017.
Murphy (2001) Murphy, K. P. Active learning of causal Bayes net structure. Technical report, Department of Computer Science, U.C. Berkeley, 2001.
Murphy (2007) Murphy, K. P. Conjugate Bayesian Analysis of the Gaussian Distribution. Technical report, University of British Columbia, 2007.
Murphy (2023) Murphy, K. P. Probabilistic Machine Learning: Advanced Topics. MIT Press, 2023.
Ng et al. (2020) Ng, I., Ghassami, A., and Zhang, K. On the Role of Sparsity and DAG Constraints for Learning Linear DAGs. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33. Curran Associates, Inc., 2020.
Niinimäki et al. (2016) Niinimäki, T., Parviainen, P., and Koivisto, M. Structure Discovery in Bayesian Networks by Sampling Partial Orders. Journal of Machine Learning Research, 2016.
OEIS Foundation Inc. (2024) OEIS Foundation Inc. Number of acyclic digraphs (or dags) with n labeled nodes, 2024. Entry A003024 in The On-Line Encyclopedia of Integer Sequences, https://oeis.org/A003024.
Pawlowski et al. (2020) Pawlowski, N., Castro, D. C., and Glocker, B. Deep Structural Causal Models for Tractable Counterfactual Inference. In Advances in Neural Information Processing Systems, 2020.
Pearl (2009) Pearl, J. Causality. Cambridge University Press, 2009. ISBN 9780511803161.
Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
Peharz & Pernkopf (2012) Peharz, R. and Pernkopf, F. Exact maximum margin structure learning of bayesian networks. arXiv preprint arXiv:1206.6431, 2012.
Pensar et al. (2020) Pensar, J., Talvitie, T., Hyttinen, A., and Koivisto, M. A bayesian approach for estimating causal effects from observational data. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 5395–5402, 2020.
Peters & Bühlmann (2015) Peters, J. and Bühlmann, P. Structural Intervention Distance for Evaluating Causal Graphs. Neural Computation, 27, 2015. ISSN 0899-7667. doi: 10.1162/NECO_a_00708.
Pleiss et al. (2018) Pleiss, G., Gardner, J., Weinberger, K., and Wilson, A. G. Constant-time predictive distributions for Gaussian processes. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2018.
Reisach et al. (2021) Reisach, A. G., Seiler, C., and Weichwald, S. Beware of the Simulated DAG! Causal Discovery Benchmarks May Be Easy To Game. In Vaughan, M. R., Beygelzimer, A., Dauphin, Y., Liang, P., and Wortman, J. (eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., 2021.
Robert & Roberts (2021) Robert, C. P. and Roberts, G. O. Rao-blackwellization in the mcmc era. arXiv preprint arXiv:2101.01011, 2021.
Sachs et al. (2005) Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D. A., and Nolan, G. P. Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data. Science, 308, 2005. doi: 10.1126/science.1105809.
Spirtes et al. (2000) Spirtes, P., Glymour, C., and Scheines, R. Causation, Prediction, and Search. The MIT Press, second edition, 2000. ISBN 9780262527927.
Teyssier & Koller (2012) Teyssier, M. and Koller, D. Ordering-Based Search: A Simple and Effective Algorithm for Learning Bayesian Networks. Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence, UAI 2005, 2012.
Tigas et al. (2022) Tigas, P., Annadani, Y., Jesson, A., Schölkopf, B., Gal, Y., and Bauer, S. Interventions, Where and How? Experimental Design for Causal Models at Scale. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., 2022.
Tong & Koller (2001) Tong, S. and Koller, D. Active learning for structure in Bayesian networks. In Proceedings of the 17th International Joint Conference on Artificial Intelligence, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558608125.
Toth et al. (2022) Toth, C., Lorch, L., Knoll, C., Krause, A., Pernkopf, F., Peharz, R., and von Kügelgen, J. Active Bayesian Causal Inference. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., 2022.
Viinikka et al. (2020) Viinikka, J., Hyttinen, A., Pensar, J., and Koivisto, M. Towards Scalable Bayesian Learning of Causal DAGs. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H. (eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc., 2020.
von Kügelgen et al. (2019) von Kügelgen, J., Rubenstein, P. K., Schölkopf, B., and Weller, A. Optimal experimental design via Bayesian optimization: active causal structure learning for Gaussian process networks. arXiv:1910.03962, 2019.
Yu et al. (2019) Yu, Y., Chen, J., Gao, T., and Yu, M. DAG-GNN: DAG Structure Learning with Graph Neural Networks. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2019.
Zhang et al. (2021) Zhang, K., Zhu, S., Kalander, M., Ng, I., Ye, J., Chen, Z., and Pan, L. gcastle: A python toolbox for causal discovery. arXiv:2111.15155, 2021.
Zheng et al. (2018) Zheng, X., Aragam, B., Ravikumar, P. K., and Xing, E. P. DAGs with NO TEARS: Continuous Optimization for Structure Learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2018.
Zheng et al. (2024) Zheng, Y., Huang, B., Chen, W., Ramsey, J., Gong, M., Cai, R., Shimizu, S., Spirtes, P., and Zhang, K. Causal-learn: Causal discovery in python. Journal of Machine Learning Research, 25:1–8, 2024.

Appendix A Proofs and Derivations

A.1 Derivation of the Posterior Expectation w.r.t. SCMs

In the following, we derive the expectation w.r.t. SCMs in Equation 3 and the corresponding importance weights in Equation 4.

	$\displaystyle\mathbb{E}_{\mathcal{M}\,\|\,\mathcal{D}}\left[p(Y\,\|\,\mathcal{M}% )\right]$	$\displaystyle=\mathbb{E}_{G,\mathbf{f},\bm{\psi}\,\|\,\mathcal{D}}\left[p(Y\,\|% \,\mathcal{M})\right]$
		$\displaystyle=\mathbb{E}_{\bm{\theta},\bm{\psi}\,\|\,\mathcal{D}}\left[\mathbb{% E}_{G,\mathbf{f}\,\|\,\bm{\theta},\bm{\psi},\mathcal{D}}\left[p(Y\,\|\,\mathcal{% M})\right]\right]$
		$\displaystyle=\mathbb{E}_{\bm{\theta},\bm{\psi}\,\|\,\mathcal{D}}\left[\mathbb{% E}_{G\,\|\,\bm{\theta},\bm{\psi},\mathcal{D}}\left[\mathbb{E}_{\mathbf{f}\,\|\,% \bm{\psi},\mathcal{D}}\left[p(Y\,\|\,\mathcal{M})\right]\right]\right]$
		$\displaystyle=\mathbb{E}_{\bm{\theta},\bm{\psi}\,\|\,\mathcal{D}}\left[\mathbb{% E}_{L\,\|\,\bm{\theta},\bm{\psi},\mathcal{D}}\left[\mathbb{E}_{G\,\|\,L,\bm{\psi% },\mathcal{D}}\left[\mathbb{E}_{\mathbf{f}\,\|\,\bm{\psi},\mathcal{D}}\left[p(Y% \,\|\,\mathcal{M})\right]\right]\right]\right]$
		$\displaystyle=\mathbb{E}_{\bm{\theta},\bm{\psi}\,\|\,\mathcal{D}}\left[\mathbb{% E}_{L\,\|\,\bm{\theta}}\left[\frac{p(\mathcal{D},\bm{\psi}\,\|\,L)}{p(\mathcal{D% },\bm{\psi}\,\|\,\bm{\theta})}\cdot\mathbb{E}_{G\,\|\,L,\bm{\psi},\mathcal{D}}% \left[\mathbb{E}_{\mathbf{f}\,\|\,\bm{\psi},\mathcal{D}}\left[p(Y\,\|\,\mathcal{% M})\right]\right]\right]\right]$
		$\displaystyle=\mathbb{E}_{\bm{\theta},\bm{\psi}\,\|\,\mathcal{D}}\left[\mathbb{% E}_{L\,\|\,\bm{\theta}}\left[w^{L}\cdot\mathbb{E}_{G\,\|\,L,\bm{\psi},\mathcal{D% }}\left[\mathbb{E}_{\mathbf{f}\,\|\,\bm{\psi},\mathcal{D}}\left[p(Y\,\|\,% \mathcal{M})\right]\right]\right]\right]$

with

	$\displaystyle w^{L}$	$\displaystyle:=\frac{p(\mathcal{D},\bm{\psi}\,\|\,L)}{p(\mathcal{D},\bm{\psi}\,% \|\,\bm{\theta})}$
		$\displaystyle=\frac{\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D},\bm{\psi}\,\|\,G)% \right]}{p(\mathcal{D},\bm{\psi}\,\|\,\bm{\theta})}$
		$\displaystyle=\frac{\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D}\,\|\,\bm{\psi},G)% \cdot p(\bm{\psi}\,\|\,G)\right]}{p(\mathcal{D},\bm{\psi}\,\|\,\bm{\theta})}$
		$\displaystyle=\frac{\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D}\,\|\,\bm{\psi},G)% \cdot p(\bm{\psi}\,\|\,G)\right]}{\mathbb{E}_{L^{\prime}\,\|\,\bm{\theta}}\left[% p(\mathcal{D},\bm{\psi}\,\|\,L^{\prime})\right]}$
		$\displaystyle=\frac{\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D}\,\|\,\bm{\psi},G)% \cdot p(\bm{\psi}\,\|\,G)\right]}{\mathbb{E}_{L^{\prime}\,\|\,\bm{\theta}}\left[% \mathbb{E}_{G^{\prime}\,\|\,L^{\prime}}\left[p(\mathcal{D}\,\|\,\bm{\psi},G^{% \prime})\cdot p(\bm{\psi}\,\|\,G^{\prime})\right]\right]}.$

A.2 Derivation of the Gradient Estimators

In the following, we derive the gradient estimators in Equations 5, 6 and 11. We denote by $\nabla=\nabla_{\bm{\theta},\bm{\psi}}$ to avoid clutter.

General Posterior Gradient. The general posterior gradient in Equation 5 reads as follows.

	$\displaystyle\nabla\log p(\bm{\theta},\bm{\psi}\,\|\,\mathcal{D})$	$\displaystyle=\nabla\log\frac{p(\mathcal{D},\bm{\psi}\,\|\,\bm{\theta})\cdot p(% \bm{\theta})}{p(\mathcal{D})}$
		$\displaystyle=\nabla\log p(\bm{\theta})+\nabla\log p(\mathcal{D},\bm{\psi}\,\|% \,\bm{\theta})$
		$\displaystyle=\nabla\log p(\bm{\theta})+\nabla\log\mathbb{E}_{L\,\|\,\bm{\theta% }}\left[p(\mathcal{D},\bm{\psi}\,\|\,L)\right]$
		$\displaystyle=\nabla\log p(\bm{\theta})+\nabla\log\mathbb{E}_{L\,\|\,\bm{\theta% }}\left[\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D},\bm{\psi}\,\|\,G)\right]\right]$
		$\displaystyle=\nabla\log p(\bm{\theta})+\nabla\log\mathbb{E}_{L\,\|\,\bm{\theta% }}\left[\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D}\,\|\,\bm{\psi},G)\cdot p(\bm{% \psi}\,\|\,G)\right]\right]$

ARCO Gradient. Using the above as starting point for the gradient in Equation 6, we get

	$\displaystyle\nabla_{\bm{\theta}}\log p(\bm{\theta},\bm{\psi}\,\|\,\mathcal{D})$	$\displaystyle=\nabla_{\bm{\theta}}\log p(\bm{\theta})+\nabla_{\bm{\theta}}\log% \mathbb{E}_{L\,\|\,\bm{\theta}}\left[\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D}\,\|% \,\bm{\psi},G)\cdot p(\bm{\psi}\,\|\,G)\right]\right]$
		$\displaystyle=\nabla_{\bm{\theta}}\log p(\bm{\theta})+\frac{\nabla_{\bm{\theta% }}\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D}% \,\|\,\bm{\psi},G)\cdot p(\bm{\psi}\,\|\,G)\right]\right]}{\mathbb{E}_{L^{\prime% }\,\|\,\bm{\theta}}\left[\mathbb{E}_{G^{\prime}\,\|\,L^{\prime}}\left[p(\mathcal% {D}\,\|\,\bm{\psi},G^{\prime})\cdot p(\bm{\psi}\,\|\,G^{\prime})\right]\right]}$
and using $\nabla_{\bm{\theta}}p(L\,\|\,\bm{\theta})=p(L\,\|\,\bm{\theta})\cdot\nabla_{\bm{% \theta}}\log p(L\,\|\,\bm{\theta})$
		$\displaystyle=\nabla_{\bm{\theta}}\log p(\bm{\theta})+\frac{\mathbb{E}_{L\,\|\,% \bm{\theta}}\left[\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D}\,\|\,\bm{\psi},G)% \cdot p(\bm{\psi}\,\|\,G)\right]\cdot\nabla_{\bm{\theta}}\log p(L\,\|\,\bm{% \theta})\right]}{\mathbb{E}_{L^{\prime}\,\|\,\bm{\theta}}\left[\mathbb{E}_{G^{% \prime}\,\|\,L^{\prime}}\left[p(\mathcal{D}\,\|\,\bm{\psi},G^{\prime})\cdot p(% \bm{\psi}\,\|\,G^{\prime})\right]\right]}$
		$\displaystyle=\nabla_{\bm{\theta}}\log p(\bm{\theta})+\mathbb{E}_{L\,\|\,\bm{% \theta}}\left[w^{L}\cdot\nabla_{\bm{\theta}}\log p(L\,\|\,\bm{\theta})\right]$

with $w^{L}$ as defined in Equation 4.

GP Hyper-parameter Gradient. For the gradient of a distinct GP modeling the mechanism from parents $\textbf{Pa}_{k}$ to target node $X_{k}$ with corresponding hyper-parameters $\bm{\psi}_{k}$ as in Equation 11 we have

	$\displaystyle\nabla_{\bm{\psi}_{k}}\log p(\bm{\theta},\bm{\psi}\,\|\,\mathcal{D})$	$\displaystyle=\nabla_{\bm{\psi}_{k}}\log p(\bm{\theta})+\nabla_{\bm{\psi}_{k}}% \log\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D% },\bm{\psi}\,\|\,G)\right]\right]$
		$\displaystyle=\frac{\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\mathbb{E}_{G\,\|\,L}% \left[\nabla_{\bm{\psi}_{k}}p(\mathcal{D},\bm{\psi}\,\|\,G)\right]\right]}{% \mathbb{E}_{L\,\|\,\bm{\theta}}\left[\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D},% \bm{\psi}\,\|\,G)\right]\right]}$
		$\displaystyle=\frac{\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\mathbb{E}_{G\,\|\,L}% \left[\nabla_{\bm{\psi}_{k}}p(\mathcal{D},\bm{\psi}\,\|\,G)\right]\right]}{p(% \mathcal{D},\bm{\psi}\,\|\,\bm{\theta})}$
and as the marginal likelihood and the prior over GP hyper-parameters factorise over parent sets we further get
		$\displaystyle=\frac{\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\mathbb{E}_{G\,\|\,L}% \left[\nabla_{\bm{\psi}_{k}}\prod_{i=1}^{d}p(\mathcal{D}_{i},\bm{\psi}_{i}\,\|% \,\textbf{Pa}_{i}^{G})\right]\right]}{p(\mathcal{D},\bm{\psi}\,\|\,\bm{\theta})}$
		$\displaystyle=\frac{\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\sum_{G}p(G\,\|\,L)% \cdot\nabla_{\bm{\psi}_{k}}\prod_{i=1}^{d}p(\mathcal{D}_{i},\bm{\psi}_{i}\,\|\,% \textbf{Pa}_{i}^{G})\right]}{p(\mathcal{D},\bm{\psi}\,\|\,\bm{\theta})}.$
Now, note that for the summation over graphs $G$ , the gradient is zero for all graphs that do not contain the parent set $\textbf{Pa}_{k}$ corresponding to the GP with hyper-parameters $\psi_{k}$ . Consequently, we get
		$\displaystyle=\frac{\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\sum_{G\,\|\,\textbf{Pa% }_{k}\in G}p(G\,\|\,L)\cdot\nabla_{\bm{\psi}_{k}}\prod_{i=1}^{d}p(\mathcal{D}_{% i},\bm{\psi}_{i}\,\|\,\textbf{Pa}_{i}^{G})\right]}{p(\mathcal{D},\bm{\psi}\,\|\,% \bm{\theta})}$
and since $\nabla_{\bm{\psi}_{k}}p(\mathcal{D}_{k},\bm{\psi}_{k}\,\|\,\textbf{Pa}_{k})=p(% \mathcal{D}_{k},\bm{\psi}_{k}\,\|\,\textbf{Pa}_{k})\cdot\nabla_{\bm{\psi}_{k}}% \log p(\mathcal{D}_{k},\bm{\psi}_{k}\,\|\,\textbf{Pa}_{k})$
		$\displaystyle=\frac{\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\sum_{G\,\|\,\textbf{Pa% }_{k}\in G}p(G\,\|\,L)\cdot\prod_{i=1}^{d}p(\mathcal{D}_{i},\bm{\psi}_{i}\,\|\,% \textbf{Pa}_{i}^{G})\cdot\nabla_{\bm{\psi}_{k}}\log p(\mathcal{D}_{k},\bm{\psi% }_{k}\,\|\,\textbf{Pa}_{k})\right]}{p(\mathcal{D},\bm{\psi}\,\|\,\bm{\theta})}$
		$\displaystyle=\frac{\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\sum_{G\,\|\,\textbf{Pa% }_{k}\in G}p(G\,\|\,L)\cdot p(\mathcal{D},\bm{\psi}\,\|\,G)\right]}{p(\mathcal{D% },\bm{\psi}\,\|\,\bm{\theta})}\cdot\nabla_{\bm{\psi}_{k}}\log p(\mathcal{D}_{k}% ,\bm{\psi}_{k}\,\|\,\textbf{Pa}_{k}).$

Note that the term preceding the gradient $\nabla_{\bm{\psi}_{k}}\log p(\mathcal{D}_{k},\bm{\psi}_{k}\,|\,\textbf{Pa}_{k})$ is a scalar factor that does not influence the direction of the gradient and can thus be practically omitted for gradient-based optimisation, as optimisation algorithms will scale the gradient depending on tune-able step size parameters anyways. Therefore, optimising the GP hyper-parameters w.r.t. the gradient in Equation 5 practically yields the same gradient direction as the common MAP type II gradient $\nabla_{\bm{\psi}_{k}}\log p(\mathcal{D}_{k},\bm{\psi}_{k}\,|\,\textbf{Pa}_{k})$ . Arguably, for mechanism models that do not decompose over individual mechanisms for each parent set and target variable, naively estimating the gradient in Equation 5 may yield very noisy gradients, as the magnitude of the estimated gradient will depend on the sampled structures used for its estimation.

A.3 Proofs regarding Exhaustive Parent Set Enumeration

Proof of Proposition 3.1

Proof.

	$\displaystyle\mathbb{E}_{G\,\|\,L}\left[w(G)\cdot h(G)\right]$	$\displaystyle=\sum_{G}p(G\,\|\,L)\cdot w(G)\cdot h(G)$
Since we assume that $w(G)$ and $h(G)$ factorise over the parent sets, we have
		$\displaystyle=\sum_{G}\prod_{i}p(\textbf{pa}_{i}^{G}\,\|\,L)\cdot w_{i}(\textbf% {pa}_{i}^{G})\cdot h_{i}(\textbf{pa}_{i}^{G})$
The sum over all graphs can be represented as sum over all combinations of possible parent sets to get
		$\displaystyle=\sum_{\textbf{pa}_{1}}\sum_{\textbf{pa}_{2}}\dots\sum_{\textbf{% pa}_{d}}\prod_{i}p(\textbf{pa}_{i}\,\|\,L)\cdot w_{i}(\textbf{pa}_{i})\cdot h_{% i}(\textbf{pa}_{i})$
		$\displaystyle=\sum_{\textbf{pa}_{1}}p(\textbf{pa}_{1}\,\|\,L)\cdot w_{1}(% \textbf{pa}_{i})\cdot h_{1}(\textbf{pa}_{i})\sum_{\textbf{pa}_{2}}p(\textbf{pa% }_{2}\,\|\,L)\cdot w_{2}(\textbf{pa}_{i})\cdot h_{2}(\textbf{pa}_{i})\dots$
Since each summation over parent sets is independent of the others, we get the final result
		$\displaystyle=\prod_{i}\sum_{\textbf{pa}_{i}}p(\textbf{pa}_{i}\,\|\,L)\cdot w_{% i}(\textbf{pa}_{i})\cdot h_{i}(\textbf{pa}_{i})$

∎

Proof of Proposition 3.2

Proof.

	$\displaystyle\mathbb{E}_{G\,\|\,L}$	$\displaystyle\left[w(G)\cdot h(G)\right]$
		$\displaystyle=\sum_{G}p(G\,\|\,L)\cdot w(G)\cdot h(G)$
Since we assume that $w(G)$ factorises and $h(G)$ sums over the parent sets, we have
		$\displaystyle=\sum_{G}\prod_{i}p(\textbf{pa}_{i}^{G}\,\|\,L)\cdot w_{i}(\textbf% {pa}_{i}^{G})\cdot\sum_{j}h_{j}(\textbf{pa}^{G}_{j})$
The sum over all graphs can be represented as sum over all combinations of possible parent sets to get
		$\displaystyle=\sum_{\textbf{pa}_{1}}\sum_{\textbf{pa}_{2}}\dots\sum_{\textbf{% pa}_{d}}\prod_{i}p(\textbf{pa}_{i}\,\|\,L)\cdot w_{i}(\textbf{pa}_{i})\cdot\sum% _{j}h_{j}(\textbf{pa}_{j})$
		$\displaystyle=\sum_{\textbf{pa}_{1}}\sum_{\textbf{pa}_{2}}\dots\sum_{\textbf{% pa}_{d}}\cdot\sum_{j}h_{j}(\textbf{pa}_{j})\prod_{i}p(\textbf{pa}_{i}\,\|\,L)% \cdot w_{i}(\textbf{pa}_{i})$
		$\displaystyle=\sum_{\textbf{pa}_{1}}\sum_{\textbf{pa}_{2}}\dots\sum_{\textbf{% pa}_{d}}h_{1}(\textbf{pa}_{1})\prod_{i}p(\textbf{pa}_{i}\,\|\,L)\cdot w_{i}(% \textbf{pa}_{i})+\sum_{\textbf{pa}_{1}}\sum_{\textbf{pa}_{2}}\dots\sum_{% \textbf{pa}_{d}}\sum_{j=2}^{d}\dots$
		$\displaystyle=\sum_{\textbf{pa}_{1}}h_{1}(\textbf{pa}_{1})\cdot p(\textbf{pa}_% {1}\,\|\,L)\cdot w_{1}(\textbf{pa}_{1})\sum_{\textbf{pa}_{2}}p(\textbf{pa}_{2}% \,\|\,L)\cdot w_{2}(\textbf{pa}_{2})\dots+\sum_{\textbf{pa}_{1}}\sum_{\textbf{% pa}_{2}}\dots\sum_{\textbf{pa}_{d}}\sum_{j=2}^{d}\dots$
By abbreviating $\alpha_{i}(L)=\sum_{\textbf{pa}_{i}}p(\textbf{pa}_{i}\,\|\,L)w_{i}(\textbf{pa}_% {i})$ we get
		$\displaystyle=\sum_{\textbf{pa}_{1}}h_{1}(\textbf{pa}_{1})\cdot p(\textbf{pa}_% {1}\,\|\,L)\cdot w_{1}(\textbf{pa}_{1})\cdot\prod_{k\neq 1}\alpha_{k}(L)+\sum_{% \textbf{pa}_{1}}\sum_{\textbf{pa}_{2}}\dots\sum_{\textbf{pa}_{d}}\sum_{j=2}^{d% }h_{j}(\textbf{pa}_{j})\prod_{i}p(\textbf{pa}_{i}\,\|\,L)\cdot w_{i}(\textbf{pa% }_{i})$
By repeating this procedure for the remaining summands $j$ , we get the final result
		$\displaystyle=\sum_{i}\left(\prod_{k\neq i}\alpha_{k}(L)\right)\cdot\sum_{% \textbf{pa}_{i}}p(\textbf{pa}_{i}\,\|\,L)\cdot w_{i}(\textbf{pa}_{i})\cdot h_{i% }(\textbf{pa}_{i})$

∎

Appendix B Implementation and Estimation Details

Our model is implemented in Python mainly relying on PyTorch (Ansel et al., 2024), and GPyTorch (Gardner et al., 2018) for GP inference.

B.1 Mechanism Model.

We follow Toth et al. (2022) and use two types of models for our mechanism.

For root nodes, i.e., causal variables without parents, we place a conjugate normal-inverse-gamma prior N- $\Gamma^{-1}$ on the mean and variance of that node. Specifically,

	$\displaystyle p(f_{i},\sigma_{i}^{2}\,\|\,\textbf{Pa}_{i}=\varnothing)$	$\displaystyle=\text{N-}\Gamma^{-1}(f_{i},\sigma_{i}^{2}\,\|\,\mu_{0},\kappa^{-1% }_{0},\alpha_{0},\beta_{0})$
		$\displaystyle=\text{N}(f_{i}\,\|\,\mu_{0},\sigma_{i}^{2}\cdot\kappa^{-1}_{0})% \cdot\Gamma^{-1}(\sigma_{i}^{2}\,\|\,\alpha_{0},\beta_{0})$

where $\mu_{0},\kappa^{-1}_{0},\alpha_{0},\beta_{0}$ are fixed hyper-parameters. When sampling ground-truth SCMs we set $\mu_{0}=0$ , $\kappa_{0}=1$ , $\alpha_{0}=5$ and $\beta_{0}=10$ , sample a mean and variance from the prior and keep them fixed thereafter. For the inference with ARCO-GP and DIBS-GP, we use $\mu_{0}=0$ , $\kappa_{0}=1$ , $\alpha_{0}=10$ and $\beta_{0}=10$ . This yields a prior mean of $1$ for the variance $\sigma^{2}$ , which is sufficiently broad considering that we standardise all training data to zero mean and unit variance. Analytic expressions for the posterior marginal likelihood are found in (Murphy, 2007).

For non-root nodes we place a GP prior on the mechanisms. Specifically, we use rational quadratic (RQ) kernel

\displaystyle k_{\text{RQ}}(\mathbf{x_{1}},\mathbf{x_{2}})=\delta\cdot\left(1+% \frac{1}{2\gamma}(\mathbf{x_{1}}-\mathbf{x_{2}})^{\top}\lambda^{-2}(\mathbf{x_% {1}}-\mathbf{x_{2}})\right)^{-\gamma}

with scaling parameter $\delta$ , lengthscale parameter $\lambda$ and mixing parameter $\gamma$ to model non-linear mechanisms. For non-linear mechanisms we use an additive Gaussian noise likelihood with variance $\sigma_{i}^{2}$ . We place Gamma priors $\Gamma(\delta\,|\,\alpha_{\delta},\beta_{\delta})$ , $\Gamma(\lambda\,|\,\alpha_{\lambda},\beta_{\lambda})$ , $\Gamma(\gamma\,|\,\alpha_{\gamma},\beta_{\gamma})$ and $\Gamma(\sigma_{i}^{2}\,|\,\alpha_{\sigma},\beta_{\sigma})$ on these parameters.

When generating non-linear models we set $\alpha_{\delta}=100$ , $\beta_{\delta}=10$ , $\alpha_{\lambda}=30\cdot|\textbf{Pa}_{i}|$ , $\beta_{\lambda}=30$ , $\alpha_{\gamma}=20$ , $\beta_{\gamma}=10$ and $\alpha_{\sigma}=50$ , $\beta_{\sigma}=50$ . For each GP we sample a set of hyper-parameters from their priors and keep them fixed thereafter. Additionally, we sample a function from the GP prior with $50$ support points sampled uniform random in range [-10, 10]. For the inference with ARCO-GP and DIBS-GP, we set $\alpha_{\delta}=100$ , $\beta_{\delta}=10$ , $\alpha_{\lambda}=30\cdot|\textbf{Pa}_{i}|$ , $\beta_{\lambda}=30$ , $\alpha_{\gamma}=20$ , $\beta_{\gamma}=10$ and $\alpha_{\sigma}=2$ , $\beta_{\sigma}=88$ again considering that we standardise all training data to zero mean and unit variance. We train GP hyper-parameters for a maximum of $100$ steps with the RMSprop (Hinton, 2012) optimiser with learning rate $0.05$ .

B.2 ARCO Model.

We use a normal prior $\mathcal{N}(\bm{\theta}\,|\,0,\sigma^{2}I)$ with $\sigma=10$ over ARCO’s neural network parameters $\bm{\theta}$ . The neural network $g_{\bm{\theta}}$ uses a single hidden layer with $30$ neurons and ReLU activation functions (see Section 3.1). We train ARCO for a maximum of $400$ gradient steps, using the ADAM (Kingma & Ba, 2015) optimiser with learning rate of $0.01$ . For the score-function gradient estimator in Equation 6 we use an exponential moving average baseline with decay rate $0.9$ . To estimate gradients and causal queries we sample $100$ causal orders and $25$ DAGs conditional on each order where necessary, resulting in $2500$ DAG samples. Per default, we limit the parent set size to a maximum of two parents per node. These settings apply to all our simulations unless stated otherwise.

B.3 Kernel Density Estimates of Interventional Distributions.

To approximate the interventional distribution given our posterior samples (see Section 3, Algorithm 1, lines 1-1), we compute a kernel density estimate (KDE). We draw $100$ causal orders, with $10$ graphs per order and $10$ samples per graph, equalling $10000$ samples from the posterior interventional distribution in total. We use the KDE implementation provided by scikit-learn (Pedregosa et al., 2011), with a Gaussian kernel and a bandwidth of $0.2$ .

Appendix C Experimental Setup

Our code is available at https://github.com/chritoth/bci-arco-gp.

Sampling of Ground-Truth SCMs.

To sample ground truth SCMs we draw graph structures from either the Erdös-Rényi (ER) (Erdös & Rényi, 1959) or the scale-free (SF) (Barabási & Albert, 1999) random models commonly used in DAG structure learning benchmarks. We follow the setup of Lorch et al. (2021) and generate scale-free and Erdös-Rényi graphs with an expected degree of $2$ . Specifically, this will yield SF graphs with a maximum parent set size of $2$ , whereas no such restriction applies to ER graphs in general. We instantiate the causal mechanisms for non-linear additive noise models as described in Appendix B.

For each ground truth SCM, we sample a fixed set of $N$ training data from the observational distribution. Importantly, Reisach et al. (2021) argue, that the causal order may strongly correlate with increasing marginal variance in simulated data, and therefore, benchmarks may be easy to game by exploiting this property. As this may be especially relevant to order-based inference methods, we follow their recommendation and normalise the training data for each endogenous variable to zero mean and unit variance.

Metrics.

As metrics for learning the causal structure we report the expected structural Hamming distance (ESHD), as well as the area under the receiver operating characteristic (AUROC) and the area under the precision recall curve (AUPRC) w.r.t. posterior edge prediction as commonly reported metrics. To provide additional insight into the methods’ strengths and weaknesses, we also report the expected number of edges (#Edges), the true positive rate (TPR) and the true negative rate (TNR) for the edge prediction task. We do not report the log-likelihood on held-out test data as is sometimes reported, because the evaluated methods implement different noise models and approximate inference schemes, which leads to the (marginal) log-likelihoods being uncalibrated and thus incomparable (cf. (Murphy, 2023)[Sec. 7.5] and references therein).

To evaluate the inferred causal structures w.r.t. their causal implications, Henckel et al. (2024) recently proposed a family of causal distances called Adjustment Identification Distance (AID). Briefly summarised, the AID counts the number of correctly identified adjustment sets for pairwise causal effects w.r.t. a target graph, where the adjustment sets are determined according to a chosen identification strategy. We adopt the three variants proposed by Henckel et al. (2024), namely Ancestor adjustment (A-AID), Parent adjustment (P-AID), and Optimal adjustment (OSET-AID). These metrics can also be applied between different graph types, e.g., comparing the AID between a predicted CPDAG and a reference DAG. Note that the (P-AID) computed between to DAGs is equivalent to the well-known Structural Identification Distance proposed by Peters & Bühlmann (2015).

Baselines.

We compare our inference model (ARCO-GP) to a diverse set of nine different structure learning methods that we describe in the following.

•

BAYESDAG (Annadani et al., 2023). BayesDAG utilises a mixture of MCMC to infer permutations and mechanism parameters, and Variational Inference (VI) to infer DAG edges given the permutations. We use the implementation provided by Annadani et al. (2023) at https://github.com/microsoft/Project-BayesDAG. We needed to adapt the sparsity regularisation hyper-parameter in order to get meaningful results and ran our experiments with the configurations in Listings 2-3. As the implementation runs a number of MCMC chains and only evaluates the best chain afterwards, we use only one MCMC chain to enable a fair comparison, as multiple chains would correspond to multiple runs of the other methods.
•

DAG-GNN (Yu et al., 2019). DAG-GNN is a gradient-based structure learning approach combining graph neural networks an acyclicity constraint similar to Zheng et al. (2018). We use the implementation provided by Zhang et al. (2021)[Version 1.0.3] using default settings.
•

DDS (Charpentier et al., 2022). DDS builds upon the permutation-based approach of (Cundy et al., 2021) and utilises differentiable permutation sampling and VI to infer a posterior over DAGs. We use the implementation provided by Charpentier et al. (2022) at https://github.com/sharpenb/Differentiable-DAG-Sampling. We needed to adapt the default hyper-parameters in order to get meaningful results and ran our experiments with the configuration displayed in Listing 1.
•

DIBS-GP (Toth et al., 2022; Lorch et al., 2021). DIBS-GP is a Bayesian causal inference framework recently proposed by Toth et al. (2022) utilising the differential structure learning method by Lorch et al. (2021) employing a soft acyclicity constraint in line with Zheng et al. (2018); Yu et al. (2019). We use the implementation provided by Toth et al. (2022) at https://github.com/chritoth/active-bayesian-causal-inference. We use their standard parameters with $10$ latent particles and constant hyper-parameters $\alpha=\beta=1$ as described in (Lorch et al., 2021). For each latent particle we sample $100$ graphs to estimate gradients and causal quantities.
•

GADGET (Viinikka et al., 2020). GADGET is a Bayesian structure learning method for linear Gaussian models based on MCMC and structure marginalisation over capped-size parent sets. We use the implementation provided by (Viinikka et al., 2020) at https://github.com/Sums-of-Products/sumu and parameters as shown in Listing 4.
•

GES (Chickering, 2003). GES is a well-known greedy score-based method for causal discovery using the BIC score. We use the implementation provided by Zhang et al. (2021)[Version 1.0.3] using default settings.
•

GOLEM (Ng et al., 2020). GOLEM is a differentiable DAG structure learning similar to Zheng et al. (2018) but with a likelihood-based score function for linear models. We use the implementation provided by Zhang et al. (2021)[Version 1.0.3] using default settings.
•

GRASP (Lam et al., 2022). GRASP is a recently proposed permutation-based approach to causal discovery. We use the implementation provided by Zheng et al. (2024) using default settings.
•

PC (Spirtes et al., 2000). The PC algorithm is another standard causal discovery methods based on conditional independence tests. We use the implementation provided by Zhang et al. (2021)[Version 1.0.3] using default settings. It happens quite frequently that PC return a possibly cyclic PDAG, for which the AID metrics cannot computed and are thus omited in our experimental results.

Listing 1: DDS Parameter Set.

⬇

# Architecture parameters

’seed_model’: 123, # Seed to init model.

’ma_hidden_dims’: [32, 32, 32], # Output dimension.

’ma_architecture’: ’linear’, # Output dimension.

’ma_fast’: False, # Output dimension.

’pd_initial_adj’: ’Learned’, # Output dimension.

’pd_temperature’: 1.0, # Output dimension.

’pd_hard’: True, # Output dimension.

’pd_order_type’: ’topk’, # Output dimension.

’pd_noise_factor’: 1.0, # Hidden dimensions.

# Training parameters

’max_epochs’: 500, # Maximum number of epochs for training

’patience’: 150, # Patience for early stopping.

’frequency’: 2, # Frequency for early stopping test.

’batch_size’: 16, # Batch size.

’ma_lr’: 1e-3, # Learning rate.

’pd_lr’: 1e-2, # Learning rate.

’loss’: ’ELBO’, # Loss name. string

’regr’: 1e-1, # Regularization factor in Bayesian loss.

’prior_p’: 1e-6 # Regularization factor in Bayesian loss.

Listing 2: BAYESDAG Non-Linear ER Parameter Set.

⬇

"model_hyperparams": {

"num_chains": 1,

"sinkhorn_n_iter": 3000,

"scale_noise_p": 0.001,

"scale_noise": 0.001,

"VI_norm": true,

"input_perm": false,

"lambda_sparse": 10,

"sparse_init": false

"training_hyperparams": {

"learning_rate": 1e-3,

"batch_size": 512,

"stardardize_data_mean": false,

"stardardize_data_std": false,

"max_epochs": 500

}

Listing 3: BAYESDAG Non-linear SF Parameter Set.

⬇

"model_hyperparams": {

"num_chains": 1,

"sinkhorn_n_iter": 3000,

"scale_noise_p": 0.001,

"scale_noise": 0.001,

"VI_norm": true,

"input_perm": false,

"lambda_sparse": 10,

"sparse_init": false

"training_hyperparams": {

"learning_rate": 1e-3,

"batch_size": 512,

"stardardize_data_mean": false,

"stardardize_data_std": false,

"max_epochs": 500

}

Listing 4: GADGET Parameter Set.

⬇

"scoref": ’bge’,

"max_id": -1,

"K": min(self.num_nodes - 1, 16),

"d": 2,

"cp_algo": ’greedy-lite’,

"mc3_chains": 16,

"burn_in": 1000,

"iterations": 1000,

"thinning": 10,

Appendix D Extended Experimental Results

Here we present additional experimental results and ablations with varying number of variables $d\in\left\{11,20,50\right\}$ , as well as experiments on real-world data from Sachs et al. (2005). For these extended experiments we report an extended set of eight metrics as described in Appendix C.

Real-world dataset from Sachs et al. (2005).

The dataset consists of 853 observational data. The target consensus graph has 11 nodes and 17 edges. We compare our ARCO-GP method in variants with maximum parent set size $k\in{2,3}$ . All variants are competitive with the baselines.

Simulations on larger models with $d=50$ variables.

We report structure learning results for non-linear models with ER and SF graphs over $d=50$ variables in Table 3. The results of ARCO-GP are qualitatively similar to the ones achieved on 20 node systems.

Influence of the maximum parent set size.

We evaluate the influence of the maximum parent set size in range $k\in\{1,2,3,4\}$ on the performance of our ARCO-GP method in Table 2. We distinguish the models variants by labels ARCO-Kk-GP. On non-linear benchmarks, $k=1$ performs worst as expected, since each node can only have one parent. Not surprisingly, $k=2$ performs best on scale-free graphs, whereas $k=4$ performs best on Erdös-Rényi graphs, as the respective $k$ better fit the stuctural properties of the true graphs.

Non-linear Average Causal Effect Estimation.

In this simulation, we compare our ability to infer pairwise average causal effects (ACEs)

\displaystyle ACE(X_{i}\,|\,do(X_{j}=x_{j})):=\mathbb{E}_{X_{i}\,|\,do(X_{j}=x% _{j}),\mathcal{D}}\left[X_{i}\right]

to two baselines, namely DIBS-GP ((Toth et al., 2022; Lorch et al., 2021)) and GADET+BEEPS ((Viinikka et al., 2020)). While DIBS-GP can implements a non-linear additive noise model based on GPs, GADGET+BEEPS is designed for linear Gaussian models only. In particular, the implementation of GADGET+BEEPS does not naturally support the inference of average causal effects, but infers causal effects as the path-wise accumulation of the linear mechanism weights. Thus, we extend the implementation in the following way. A linear SCM can be written as

\displaystyle X=(G\odot B)^{T}\cdot X+\mu+\epsilon,

where $X$ are the endogenous variables, $G$ is the adjacency matrix of the causal DAG, $B$ is the weight matrix of the linear edge weights, $\odot$ denotes the element-wise multiplication, $\mu$ denotes the mean vector of the variables and $\epsilon$ denotes the exogenous Gaussian noise. Solving for X yields

\displaystyle X=(I-(G\odot B)^{T})^{-1}(\mu+\epsilon),

and taking the expectation w.r.t. $\epsilon$ yields

\displaystyle\mathbb{E}_{\epsilon}[X]=(I-(G\odot B)^{T})^{-1}\mu.

(12)

We then estimate $ACE(X_{i}\,|\,do(X_{j}=x_{j}))$ by (i) drawing posterior DAGs $G$ and weight matrices $B$ using GADGET+BEEPS, (ii) performing the intervention by removing all parents of $X_{j}$ in the sampled DAGs and setting the posterior mean $\mu_{j}=x_{j}$ , and (iii) finally solving Equation 12 to get the desired ACE.

We generate 10 non-linear ground truth models with fixed graph as in Figure 4. We then report the mean absolute errors (MAEs) between for all pairwise ACEs in Table 4 for ARCO-GP, in Table 5 for DIBS-GP, and in Table 6 for GADGET+BEEPS.

ARCO-GP achieves better MAEs than DIBS-GP for almost all pairwise ACEs. Not surprisingly, ARCO-GP performs better than GADGET+BEEPS on non-linear models.

Table 1: Real-world dataset from Sachs et al. (2005)). The target graph has 11 nodes and 17 edges. We report means and

95\%

confidence intervals (CIs) across

20

different ground truth models. Arrows next to metrics indicate lower is better (

\downarrow

) and higher is better (

\uparrow

Model	#Edges	$\downarrow$ A-AID	$\downarrow$ P-AID	$\downarrow$ OSET-AID
ARCO-K2-GP	7 $\pm$ 0	0.22 $\pm$ 0.01	0.48 $\pm$ 0.02	0.23 $\pm$ 0.01
ARCO-K3-GP	7 $\pm$ 1	0.22 $\pm$ 0.00	0.47 $\pm$ 0.02	0.22 $\pm$ 0.00
BAYESDAG	24 $\pm$ 1	0.33 $\pm$ 0.02	0.48 $\pm$ 0.02	0.33 $\pm$ 0.02
DAG-GNN	7 $\pm$ 0	0.23 $\pm$ 0.00	0.45 $\pm$ 0.00	0.23 $\pm$ 0.00
DDS	39 $\pm$ 1	0.36 $\pm$ 0.01	0.41 $\pm$ 0.01	0.36 $\pm$ 0.01
DIBS-GP	7 $\pm$ 1	0.22 $\pm$ 0.01	0.43 $\pm$ 0.04	0.22 $\pm$ 0.01
GADGET	9 $\pm$ 0	0.25 $\pm$ 0.00	0.50 $\pm$ 0.01	0.25 $\pm$ 0.00
GES	8 $\pm$ 0	0.28 $\pm$ 0.00	0.56 $\pm$ 0.00	0.28 $\pm$ 0.00
GOLEM	17 $\pm$ 0	0.40 $\pm$ 0.00	0.49 $\pm$ 0.00	0.39 $\pm$ 0.00
GRASP	8 $\pm$ 0	0.28 $\pm$ 0.00	0.56 $\pm$ 0.00	0.28 $\pm$ 0.00
PC	8 $\pm$ 0	–	–	–

(a) AID Metrics.

Model	$\downarrow$ ESHD	$\uparrow$ AUROC	$\uparrow$ AUPRC	$\uparrow$ TPR	$\uparrow$ TNR
ARCO-K2-GP	17 $\pm$ 1	0.56 $\pm$ 0.02	0.30 $\pm$ 0.02	0.20 $\pm$ 0.03	0.96 $\pm$ 0.01
ARCO-K3-GP	17 $\pm$ 0	0.55 $\pm$ 0.03	0.31 $\pm$ 0.02	0.21 $\pm$ 0.02	0.96 $\pm$ 0.00
BAYESDAG	34 $\pm$ 2	0.49 $\pm$ 0.04	0.18 $\pm$ 0.03	0.22 $\pm$ 0.06	0.78 $\pm$ 0.01
DAG-GNN	18 $\pm$ 0	0.57 $\pm$ 0.00	0.37 $\pm$ 0.00	0.18 $\pm$ 0.00	0.96 $\pm$ 0.00
DDS	45 $\pm$ 2	0.48 $\pm$ 0.01	0.19 $\pm$ 0.02	0.33 $\pm$ 0.00	0.64 $\pm$ 0.02
DIBS-GP	16 $\pm$ 1	0.61 $\pm$ 0.02	0.36 $\pm$ 0.04	0.24 $\pm$ 0.04	0.97 $\pm$ 0.01
GADGET	20 $\pm$ 0	0.63 $\pm$ 0.03	0.27 $\pm$ 0.02	0.19 $\pm$ 0.01	0.94 $\pm$ 0.00
GES	19 $\pm$ 0	0.63 $\pm$ 0.00	0.44 $\pm$ 0.00	0.35 $\pm$ 0.00	0.91 $\pm$ 0.00
GOLEM	26 $\pm$ 0	0.55 $\pm$ 0.00	0.29 $\pm$ 0.00	0.24 $\pm$ 0.00	0.86 $\pm$ 0.00
GRASP	19 $\pm$ 0	0.63 $\pm$ 0.00	0.44 $\pm$ 0.00	0.35 $\pm$ 0.00	0.91 $\pm$ 0.00
PC	19 $\pm$ 0	0.63 $\pm$ 0.00	0.44 $\pm$ 0.00	0.35 $\pm$ 0.00	0.91 $\pm$ 0.00

(b) Edge prediction metrics.

Table 2: Varying the maximal parent set size. Ablation studies on simulated (non-)linear ground truth models with

20

nodes. We report means and

95\%

confidence intervals (CIs) across

20

different ground truth models. Arrows next to metrics indicate lower is better (

\downarrow

) and higher is better (

\uparrow

Model	#Edges	$\downarrow$ A-AID	$\downarrow$ P-AID	$\downarrow$ OSET-AID	$\downarrow$ A-AID (CO)
ARCO-K1-GP	10 $\pm$ 1	0.17 $\pm$ 0.02	0.82 $\pm$ 0.02	0.19 $\pm$ 0.02	0.24 $\pm$ 0.05
ARCO-K2-GP	32 $\pm$ 1	0.06 $\pm$ 0.02	0.13 $\pm$ 0.04	0.07 $\pm$ 0.02	0.06 $\pm$ 0.03
ARCO-K3-GP	35 $\pm$ 1	0.09 $\pm$ 0.04	0.15 $\pm$ 0.04	0.10 $\pm$ 0.04	0.09 $\pm$ 0.05
ARCO-K4-GP	37 $\pm$ 1	0.08 $\pm$ 0.03	0.16 $\pm$ 0.05	0.10 $\pm$ 0.03	0.08 $\pm$ 0.04

(a) Scale-free Nonlinear (Part 1/2).

Model	$\downarrow$ ESHD	$\uparrow$ AUROC	$\uparrow$ AUPRC	$\uparrow$ TPR	$\uparrow$ TNR
ARCO-K1-GP	26 $\pm$ 1	0.86 $\pm$ 0.02	0.67 $\pm$ 0.02	0.28 $\pm$ 0.03	1.00 $\pm$ 0.00
ARCO-K2-GP	5 $\pm$ 2	0.97 $\pm$ 0.02	0.94 $\pm$ 0.03	0.88 $\pm$ 0.04	1.00 $\pm$ 0.00
ARCO-K3-GP	9 $\pm$ 2	0.94 $\pm$ 0.03	0.89 $\pm$ 0.04	0.87 $\pm$ 0.04	0.99 $\pm$ 0.00
ARCO-K4-GP	11 $\pm$ 2	0.94 $\pm$ 0.02	0.90 $\pm$ 0.04	0.86 $\pm$ 0.04	0.98 $\pm$ 0.00

(b) Scale-free Nonlinear (Part 2/2).

Model	#Edges	$\downarrow$ A-AID	$\downarrow$ P-AID	$\downarrow$ OSET-AID	$\downarrow$ A-AID (CO)
ARCO-K1-GP	8 $\pm$ 1	0.21 $\pm$ 0.03	0.50 $\pm$ 0.03	0.21 $\pm$ 0.02	0.36 $\pm$ 0.06
ARCO-K2-GP	20 $\pm$ 2	0.15 $\pm$ 0.02	0.32 $\pm$ 0.04	0.18 $\pm$ 0.03	0.23 $\pm$ 0.05
ARCO-K3-GP	29 $\pm$ 2	0.12 $\pm$ 0.02	0.23 $\pm$ 0.03	0.16 $\pm$ 0.03	0.19 $\pm$ 0.05
ARCO-K4-GP	32 $\pm$ 3	0.10 $\pm$ 0.03	0.21 $\pm$ 0.04	0.15 $\pm$ 0.04	0.18 $\pm$ 0.06

Model	$\downarrow$ ESHD	$\uparrow$ AUROC	$\uparrow$ AUPRC	$\uparrow$ TPR	$\uparrow$ TNR
ARCO-K1-GP	32 $\pm$ 3	0.79 $\pm$ 0.03	0.54 $\pm$ 0.04	0.19 $\pm$ 0.03	1.00 $\pm$ 0.00
ARCO-K2-GP	21 $\pm$ 3	0.84 $\pm$ 0.03	0.68 $\pm$ 0.05	0.49 $\pm$ 0.05	1.00 $\pm$ 0.00
ARCO-K3-GP	17 $\pm$ 2	0.87 $\pm$ 0.02	0.76 $\pm$ 0.04	0.65 $\pm$ 0.05	0.99 $\pm$ 0.00
ARCO-K4-GP	17 $\pm$ 3	0.88 $\pm$ 0.03	0.79 $\pm$ 0.04	0.70 $\pm$ 0.05	0.99 $\pm$ 0.00

(d) Erdös-Rényi Nonlinear (Part 2/2).

Table 3: Benchmarks on systems with 50 variables. Ablation studies on simulated non-linear ground truth models with

50

nodes and different DAG structures. We report means and

95\%

confidence intervals (CIs) across

20

different ground truth models.

Model	#Edges	$\downarrow$ A-AID	$\downarrow$ P-AID	$\downarrow$ OSET-AID
ARCO-GP	91 $\pm$ 1	0.03 $\pm$ 0.01	0.06 $\pm$ 0.02	0.04 $\pm$ 0.02
BAYESDAG	237 $\pm$ 5	0.25 $\pm$ 0.02	0.57 $\pm$ 0.05	0.25 $\pm$ 0.02
DAG-GNN	37 $\pm$ 4	0.14 $\pm$ 0.01	0.94 $\pm$ 0.02	0.14 $\pm$ 0.01
DDS	1082 $\pm$ 26	0.39 $\pm$ 0.01	0.42 $\pm$ 0.01	0.39 $\pm$ 0.01
DIBS-GP	70 $\pm$ 7	0.14 $\pm$ 0.01	0.82 $\pm$ 0.04	0.14 $\pm$ 0.01
GADGET	176 $\pm$ 5	0.31 $\pm$ 0.02	0.75 $\pm$ 0.02	0.31 $\pm$ 0.02
GES	220 $\pm$ 11	0.38 $\pm$ 0.05	0.71 $\pm$ 0.05	0.39 $\pm$ 0.05
GOLEM	56 $\pm$ 5	0.16 $\pm$ 0.02	0.88 $\pm$ 0.03	0.17 $\pm$ 0.02
GRASP	154 $\pm$ 8	0.30 $\pm$ 0.02	0.71 $\pm$ 0.03	0.31 $\pm$ 0.02
PC	108 $\pm$ 3	–	–	–

(a) Scale-free Nonlinear (Part 1/2).

Model	$\downarrow$ ESHD	$\uparrow$ AUROC	$\uparrow$ AUPRC	$\uparrow$ TPR	$\uparrow$ TNR
ARCO-GP	6 $\pm$ 2	0.98 $\pm$ 0.01	0.95 $\pm$ 0.03	0.94 $\pm$ 0.02	1.00 $\pm$ 0.00
BAYESDAG	220 $\pm$ 11	0.77 $\pm$ 0.03	0.40 $\pm$ 0.03	0.59 $\pm$ 0.04	0.92 $\pm$ 0.00
DAG-GNN	90 $\pm$ 4	0.61 $\pm$ 0.02	0.42 $\pm$ 0.03	0.22 $\pm$ 0.03	0.99 $\pm$ 0.00
DDS	1037 $\pm$ 25	0.79 $\pm$ 0.03	0.52 $\pm$ 0.05	0.73 $\pm$ 0.03	0.57 $\pm$ 0.01
DIBS-GP	123 $\pm$ 7	0.61 $\pm$ 0.02	0.27 $\pm$ 0.03	0.22 $\pm$ 0.03	0.98 $\pm$ 0.00
GADGET	191 $\pm$ 7	0.79 $\pm$ 0.02	0.30 $\pm$ 0.03	0.42 $\pm$ 0.02	0.94 $\pm$ 0.00
GES	229 $\pm$ 13	0.69 $\pm$ 0.02	0.34 $\pm$ 0.03	0.46 $\pm$ 0.05	0.92 $\pm$ 0.00
GOLEM	94 $\pm$ 6	0.65 $\pm$ 0.02	0.43 $\pm$ 0.04	0.30 $\pm$ 0.03	0.99 $\pm$ 0.00
GRASP	160 $\pm$ 9	0.72 $\pm$ 0.01	0.41 $\pm$ 0.02	0.49 $\pm$ 0.02	0.95 $\pm$ 0.00
PC	156 $\pm$ 5	0.61 $\pm$ 0.01	0.25 $\pm$ 0.02	0.25 $\pm$ 0.02	0.96 $\pm$ 0.00

(b) Scale-free Nonlinear (Part 2/2).

Model	#Edges	$\downarrow$ A-AID	$\downarrow$ P-AID	$\downarrow$ OSET-AID
ARCO-GP	48 $\pm$ 3	0.09 $\pm$ 0.01	0.26 $\pm$ 0.02	0.09 $\pm$ 0.01
BAYESDAG	195 $\pm$ 4	0.17 $\pm$ 0.03	0.48 $\pm$ 0.06	0.18 $\pm$ 0.03
DAG-GNN	34 $\pm$ 3	0.14 $\pm$ 0.01	0.57 $\pm$ 0.04	0.14 $\pm$ 0.01
DDS	928 $\pm$ 30	0.37 $\pm$ 0.03	0.45 $\pm$ 0.03	0.38 $\pm$ 0.03
DIBS-GP	33 $\pm$ 8	0.12 $\pm$ 0.01	0.43 $\pm$ 0.04	0.13 $\pm$ 0.01
GADGET	121 $\pm$ 4	0.25 $\pm$ 0.02	0.50 $\pm$ 0.05	0.26 $\pm$ 0.02
GES	134 $\pm$ 7	0.25 $\pm$ 0.03	0.45 $\pm$ 0.04	0.27 $\pm$ 0.03
GOLEM	52 $\pm$ 5	0.16 $\pm$ 0.02	0.54 $\pm$ 0.03	0.16 $\pm$ 0.01
GRASP	92 $\pm$ 5	0.18 $\pm$ 0.02	0.40 $\pm$ 0.05	0.19 $\pm$ 0.02
PC	92 $\pm$ 3	–	–	–

Model	$\downarrow$ ESHD	$\uparrow$ AUROC	$\uparrow$ AUPRC	$\uparrow$ TPR	$\uparrow$ TNR
ARCO-GP	56 $\pm$ 4	0.77 $\pm$ 0.02	0.57 $\pm$ 0.04	0.46 $\pm$ 0.03	1.00 $\pm$ 0.00
BAYESDAG	213 $\pm$ 9	0.68 $\pm$ 0.02	0.29 $\pm$ 0.02	0.41 $\pm$ 0.04	0.93 $\pm$ 0.00
DAG-GNN	110 $\pm$ 4	0.55 $\pm$ 0.01	0.25 $\pm$ 0.03	0.12 $\pm$ 0.01	0.99 $\pm$ 0.00
DDS	900 $\pm$ 26	0.75 $\pm$ 0.03	0.44 $\pm$ 0.05	0.64 $\pm$ 0.04	0.63 $\pm$ 0.01
DIBS-GP	105 $\pm$ 5	0.56 $\pm$ 0.02	0.30 $\pm$ 0.04	0.13 $\pm$ 0.03	0.99 $\pm$ 0.00
GADGET	130 $\pm$ 4	0.84 $\pm$ 0.01	0.46 $\pm$ 0.03	0.45 $\pm$ 0.02	0.97 $\pm$ 0.00
GES	122 $\pm$ 9	0.77 $\pm$ 0.02	0.51 $\pm$ 0.03	0.58 $\pm$ 0.03	0.97 $\pm$ 0.00
GOLEM	111 $\pm$ 4	0.59 $\pm$ 0.01	0.31 $\pm$ 0.03	0.20 $\pm$ 0.02	0.99 $\pm$ 0.00
GRASP	80 $\pm$ 6	0.79 $\pm$ 0.01	0.61 $\pm$ 0.03	0.60 $\pm$ 0.03	0.98 $\pm$ 0.00
PC	112 $\pm$ 5	0.69 $\pm$ 0.01	0.43 $\pm$ 0.02	0.40 $\pm$ 0.02	0.98 $\pm$ 0.00

(d) Erdös-Rényi Nonlinear (Part 2/2).

Table 4: Nonlinear Average Causal Effect Estimation (ARCO-GP). We report the mean absolute errors and

95\%

confidence intervals (CIs) between predicted and true average causal effects

\mathbb{E}_{X_{j}\,|\,do(X_{i}=1)}\left[X_{j}\right]

. The interventions

do(X_{i}=1)

are listed per column and the targets

X_{j}

listed per row. The results stem from simulations on

10

different ground truth models with fixed graph (see Figure 4) and simulated nonlinear mechanisms.

	do(X0 = 1)	do(X1 = 1)	do(X2 = 1)	do(X3 = 1)	do(X4 = 1)	do(X5 = 1)
X0	0.00 $\pm$ 0.00	0.06 $\pm$ 0.02	0.06 $\pm$ 0.02	0.06 $\pm$ 0.02	0.06 $\pm$ 0.02	0.06 $\pm$ 0.02
X1	0.07 $\pm$ 0.02	0.00 $\pm$ 0.00	0.09 $\pm$ 0.02	0.08 $\pm$ 0.03	0.09 $\pm$ 0.03	0.09 $\pm$ 0.03
X2	0.18 $\pm$ 0.09	0.25 $\pm$ 0.08	0.00 $\pm$ 0.00	0.10 $\pm$ 0.04	0.12 $\pm$ 0.03	0.10 $\pm$ 0.04
X3	0.11 $\pm$ 0.03	0.44 $\pm$ 0.23	0.06 $\pm$ 0.02	0.00 $\pm$ 0.00	0.06 $\pm$ 0.02	0.06 $\pm$ 0.02
X4	0.13 $\pm$ 0.05	0.30 $\pm$ 0.12	0.08 $\pm$ 0.03	0.08 $\pm$ 0.02	0.00 $\pm$ 0.00	0.06 $\pm$ 0.02
X5	0.26 $\pm$ 0.06	0.44 $\pm$ 0.21	0.04 $\pm$ 0.01	0.04 $\pm$ 0.01	0.34 $\pm$ 0.11	0.00 $\pm$ 0.00
X6	0.22 $\pm$ 0.08	0.27 $\pm$ 0.12	0.06 $\pm$ 0.03	0.06 $\pm$ 0.03	0.14 $\pm$ 0.04	0.10 $\pm$ 0.03
X7	0.12 $\pm$ 0.04	0.25 $\pm$ 0.14	0.09 $\pm$ 0.03	0.09 $\pm$ 0.03	0.08 $\pm$ 0.02	0.09 $\pm$ 0.04
X8	0.06 $\pm$ 0.02	0.06 $\pm$ 0.02	0.06 $\pm$ 0.02	0.06 $\pm$ 0.02	0.06 $\pm$ 0.02	0.06 $\pm$ 0.02
X9	0.07 $\pm$ 0.03	0.06 $\pm$ 0.03	0.07 $\pm$ 0.03	0.07 $\pm$ 0.03	0.07 $\pm$ 0.03	0.07 $\pm$ 0.03
X10	0.06 $\pm$ 0.02	0.06 $\pm$ 0.02	0.06 $\pm$ 0.03	0.06 $\pm$ 0.03	0.07 $\pm$ 0.02	0.07 $\pm$ 0.02
AVG	0.12 $\pm$ 0.02	0.20 $\pm$ 0.03	0.06 $\pm$ 0.01	0.06 $\pm$ 0.01	0.10 $\pm$ 0.01	0.07 $\pm$ 0.01

	do(X6 = 1)	do(X7 = 1)	do(X8 = 1)	do(X9 = 1)	do(X10 = 1)
X0	0.06 $\pm$ 0.02	0.06 $\pm$ 0.02	0.06 $\pm$ 0.02	0.06 $\pm$ 0.02	0.06 $\pm$ 0.02
X1	0.12 $\pm$ 0.03	0.10 $\pm$ 0.02	0.10 $\pm$ 0.03	0.09 $\pm$ 0.03	0.09 $\pm$ 0.03
X2	0.12 $\pm$ 0.03	0.11 $\pm$ 0.03	0.11 $\pm$ 0.03	0.11 $\pm$ 0.04	0.11 $\pm$ 0.04
X3	0.07 $\pm$ 0.02	0.07 $\pm$ 0.02	0.07 $\pm$ 0.02	0.07 $\pm$ 0.02	0.07 $\pm$ 0.02
X4	0.08 $\pm$ 0.03	0.08 $\pm$ 0.02	0.08 $\pm$ 0.02	0.08 $\pm$ 0.02	0.08 $\pm$ 0.02
X5	0.04 $\pm$ 0.01	0.04 $\pm$ 0.01	0.05 $\pm$ 0.01	0.05 $\pm$ 0.01	0.04 $\pm$ 0.01
X6	0.00 $\pm$ 0.00	0.06 $\pm$ 0.03	0.07 $\pm$ 0.03	0.06 $\pm$ 0.03	0.06 $\pm$ 0.03
X7	0.15 $\pm$ 0.05	0.00 $\pm$ 0.00	0.09 $\pm$ 0.03	0.09 $\pm$ 0.03	0.09 $\pm$ 0.03
X8	0.06 $\pm$ 0.02	0.06 $\pm$ 0.02	0.00 $\pm$ 0.00	0.06 $\pm$ 0.02	0.06 $\pm$ 0.02
X9	0.07 $\pm$ 0.03	0.06 $\pm$ 0.03	0.26 $\pm$ 0.17	0.00 $\pm$ 0.00	0.07 $\pm$ 0.03
X10	0.06 $\pm$ 0.02	0.06 $\pm$ 0.03	0.15 $\pm$ 0.08	0.12 $\pm$ 0.07	0.00 $\pm$ 0.00
AVG	0.07 $\pm$ 0.01	0.06 $\pm$ 0.01	0.09 $\pm$ 0.02	0.07 $\pm$ 0.01	0.07 $\pm$ 0.01

Table 5: Nonlinear Average Causal Effect Estimation (DIBS). We report the mean absolute errors and

95\%

confidence intervals (CIs) between predicted and true average causal effects

\mathbb{E}_{X_{j}\,|\,do(X_{i}=1)}\left[X_{j}\right]

. The interventions

do(X_{i}=1)

are listed per column and the targets

X_{j}

listed per row. The results stem from simulations on

10

different ground truth models with fixed graph (see Figure 4) and simulated nonlinear mechanisms.

	do(X0 = 1)	do(X1 = 1)	do(X2 = 1)	do(X3 = 1)	do(X4 = 1)	do(X5 = 1)
X0	0.00 $\pm$ 0.00	0.30 $\pm$ 0.28	0.23 $\pm$ 0.27	0.20 $\pm$ 0.22	0.10 $\pm$ 0.10	0.06 $\pm$ 0.03
X1	0.40 $\pm$ 0.33	0.00 $\pm$ 0.00	0.19 $\pm$ 0.28	0.09 $\pm$ 0.05	0.12 $\pm$ 0.10	0.07 $\pm$ 0.04
X2	0.45 $\pm$ 0.22	0.34 $\pm$ 0.20	0.00 $\pm$ 0.00	0.12 $\pm$ 0.09	0.09 $\pm$ 0.05	0.09 $\pm$ 0.06
X3	0.23 $\pm$ 0.23	0.55 $\pm$ 0.38	0.17 $\pm$ 0.25	0.00 $\pm$ 0.00	0.07 $\pm$ 0.03	0.10 $\pm$ 0.06
X4	0.24 $\pm$ 0.11	0.49 $\pm$ 0.23	0.20 $\pm$ 0.19	0.10 $\pm$ 0.06	0.00 $\pm$ 0.00	0.14 $\pm$ 0.07
X5	0.38 $\pm$ 0.10	0.50 $\pm$ 0.38	0.14 $\pm$ 0.19	0.07 $\pm$ 0.04	0.32 $\pm$ 0.20	0.00 $\pm$ 0.00
X6	0.37 $\pm$ 0.21	0.59 $\pm$ 0.26	0.06 $\pm$ 0.04	0.05 $\pm$ 0.03	0.20 $\pm$ 0.13	0.42 $\pm$ 0.23
X7	0.27 $\pm$ 0.15	0.43 $\pm$ 0.22	0.11 $\pm$ 0.07	0.10 $\pm$ 0.06	0.12 $\pm$ 0.09	0.17 $\pm$ 0.11
X8	0.07 $\pm$ 0.04	0.07 $\pm$ 0.04	0.07 $\pm$ 0.05	0.07 $\pm$ 0.04	0.07 $\pm$ 0.04	0.08 $\pm$ 0.05
X9	0.07 $\pm$ 0.04	0.07 $\pm$ 0.05	0.08 $\pm$ 0.04	0.07 $\pm$ 0.04	0.08 $\pm$ 0.04	0.07 $\pm$ 0.05
X10	0.04 $\pm$ 0.04	0.05 $\pm$ 0.03	0.07 $\pm$ 0.05	0.06 $\pm$ 0.04	0.04 $\pm$ 0.03	0.05 $\pm$ 0.04
AVG	0.23 $\pm$ 0.08	0.31 $\pm$ 0.10	0.12 $\pm$ 0.10	0.08 $\pm$ 0.04	0.11 $\pm$ 0.02	0.11 $\pm$ 0.03

	do(X6 = 1)	do(X7 = 1)	do(X8 = 1)	do(X9 = 1)	do(X10 = 1)
X0	0.06 $\pm$ 0.03	0.07 $\pm$ 0.03	0.08 $\pm$ 0.04	0.07 $\pm$ 0.02	0.05 $\pm$ 0.03
X1	0.10 $\pm$ 0.05	0.08 $\pm$ 0.05	0.08 $\pm$ 0.05	0.07 $\pm$ 0.04	0.09 $\pm$ 0.04
X2	0.10 $\pm$ 0.05	0.10 $\pm$ 0.05	0.10 $\pm$ 0.06	0.10 $\pm$ 0.05	0.09 $\pm$ 0.05
X3	0.08 $\pm$ 0.03	0.08 $\pm$ 0.03	0.09 $\pm$ 0.04	0.08 $\pm$ 0.03	0.08 $\pm$ 0.03
X4	0.11 $\pm$ 0.08	0.09 $\pm$ 0.04	0.08 $\pm$ 0.04	0.07 $\pm$ 0.04	0.08 $\pm$ 0.05
X5	0.20 $\pm$ 0.16	0.06 $\pm$ 0.04	0.07 $\pm$ 0.03	0.07 $\pm$ 0.03	0.05 $\pm$ 0.03
X6	0.00 $\pm$ 0.00	0.10 $\pm$ 0.11	0.06 $\pm$ 0.04	0.04 $\pm$ 0.03	0.04 $\pm$ 0.02
X7	0.27 $\pm$ 0.13	0.00 $\pm$ 0.00	0.10 $\pm$ 0.06	0.09 $\pm$ 0.05	0.10 $\pm$ 0.06
X8	0.07 $\pm$ 0.04	0.07 $\pm$ 0.04	0.00 $\pm$ 0.00	0.15 $\pm$ 0.21	0.14 $\pm$ 0.19
X9	0.08 $\pm$ 0.04	0.07 $\pm$ 0.05	0.40 $\pm$ 0.27	0.00 $\pm$ 0.00	0.14 $\pm$ 0.12
X10	0.06 $\pm$ 0.04	0.07 $\pm$ 0.05	0.29 $\pm$ 0.16	0.44 $\pm$ 0.34	0.00 $\pm$ 0.00
AVG	0.10 $\pm$ 0.02	0.07 $\pm$ 0.02	0.12 $\pm$ 0.04	0.11 $\pm$ 0.04	0.08 $\pm$ 0.02

Table 6: Nonlinear Average Causal Effect Estimation (GADGET + BEEPS). We report the mean absolute errors and

95\%

confidence intervals (CIs) between predicted and true average causal effects

\mathbb{E}_{X_{j}\,|\,do(X_{i}=1)}\left[X_{j}\right]

. The interventions

do(X_{i}=1)

are listed per column and the targets

X_{j}

listed per row. The results stem from simulations on

10

different ground truth models with fixed graph (see Figure 4) and simulated nonlinear mechanisms.

	do(X0 = 1)	do(X1 = 1)	do(X2 = 1)	do(X3 = 1)	do(X4 = 1)	do(X5 = 1)
X0	0.00 $\pm$ 0.00	0.38 $\pm$ 0.17	0.27 $\pm$ 0.13	0.34 $\pm$ 0.17	0.24 $\pm$ 0.14	0.15 $\pm$ 0.12
X1	0.74 $\pm$ 0.30	0.00 $\pm$ 0.00	0.20 $\pm$ 0.13	0.15 $\pm$ 0.08	0.22 $\pm$ 0.13	0.26 $\pm$ 0.16
X2	0.49 $\pm$ 0.28	0.31 $\pm$ 0.20	0.00 $\pm$ 0.00	0.16 $\pm$ 0.11	0.14 $\pm$ 0.14	0.21 $\pm$ 0.10
X3	0.49 $\pm$ 0.23	0.87 $\pm$ 0.44	0.09 $\pm$ 0.03	0.00 $\pm$ 0.00	0.13 $\pm$ 0.08	0.15 $\pm$ 0.10
X4	0.43 $\pm$ 0.28	0.74 $\pm$ 0.57	0.14 $\pm$ 0.09	0.15 $\pm$ 0.06	0.00 $\pm$ 0.00	0.20 $\pm$ 0.12
X5	0.54 $\pm$ 0.27	0.47 $\pm$ 0.31	0.07 $\pm$ 0.04	0.10 $\pm$ 0.07	0.32 $\pm$ 0.20	0.00 $\pm$ 0.00
X6	0.38 $\pm$ 0.17	0.48 $\pm$ 0.12	0.13 $\pm$ 0.10	0.09 $\pm$ 0.04	0.23 $\pm$ 0.13	0.33 $\pm$ 0.16
X7	0.30 $\pm$ 0.23	0.43 $\pm$ 0.24	0.09 $\pm$ 0.06	0.13 $\pm$ 0.11	0.12 $\pm$ 0.09	0.17 $\pm$ 0.13
X8	0.06 $\pm$ 0.04	0.06 $\pm$ 0.03	0.06 $\pm$ 0.04	0.06 $\pm$ 0.04	0.06 $\pm$ 0.04	0.06 $\pm$ 0.04
X9	0.08 $\pm$ 0.05	0.07 $\pm$ 0.05	0.08 $\pm$ 0.05	0.07 $\pm$ 0.05	0.08 $\pm$ 0.05	0.08 $\pm$ 0.05
X10	0.03 $\pm$ 0.01	0.03 $\pm$ 0.01	0.03 $\pm$ 0.01	0.03 $\pm$ 0.01	0.03 $\pm$ 0.01	0.03 $\pm$ 0.01
AVG	0.32 $\pm$ 0.10	0.35 $\pm$ 0.11	0.11 $\pm$ 0.03	0.12 $\pm$ 0.02	0.14 $\pm$ 0.03	0.15 $\pm$ 0.05

	do(X6 = 1)	do(X7 = 1)	do(X8 = 1)	do(X9 = 1)	do(X10 = 1)
X0	0.08 $\pm$ 0.05	0.09 $\pm$ 0.05	0.06 $\pm$ 0.03	0.06 $\pm$ 0.03	0.06 $\pm$ 0.02
X1	0.18 $\pm$ 0.11	0.15 $\pm$ 0.07	0.10 $\pm$ 0.04	0.10 $\pm$ 0.04	0.10 $\pm$ 0.04
X2	0.09 $\pm$ 0.04	0.12 $\pm$ 0.06	0.09 $\pm$ 0.05	0.09 $\pm$ 0.05	0.08 $\pm$ 0.04
X3	0.10 $\pm$ 0.05	0.12 $\pm$ 0.06	0.08 $\pm$ 0.02	0.08 $\pm$ 0.03	0.08 $\pm$ 0.03
X4	0.12 $\pm$ 0.07	0.13 $\pm$ 0.05	0.09 $\pm$ 0.03	0.08 $\pm$ 0.03	0.08 $\pm$ 0.04
X5	0.17 $\pm$ 0.11	0.12 $\pm$ 0.07	0.04 $\pm$ 0.02	0.04 $\pm$ 0.02	0.04 $\pm$ 0.02
X6	0.00 $\pm$ 0.00	0.24 $\pm$ 0.15	0.05 $\pm$ 0.03	0.05 $\pm$ 0.03	0.05 $\pm$ 0.03
X7	0.32 $\pm$ 0.09	0.00 $\pm$ 0.00	0.08 $\pm$ 0.04	0.08 $\pm$ 0.04	0.08 $\pm$ 0.04
X8	0.06 $\pm$ 0.03	0.06 $\pm$ 0.03	0.00 $\pm$ 0.00	0.15 $\pm$ 0.08	0.15 $\pm$ 0.08
X9	0.07 $\pm$ 0.05	0.08 $\pm$ 0.05	0.51 $\pm$ 0.20	0.00 $\pm$ 0.00	0.26 $\pm$ 0.13
X10	0.03 $\pm$ 0.01	0.03 $\pm$ 0.01	0.32 $\pm$ 0.16	0.49 $\pm$ 0.29	0.00 $\pm$ 0.00
AVG	0.11 $\pm$ 0.03	0.10 $\pm$ 0.03	0.13 $\pm$ 0.03	0.11 $\pm$ 0.02	0.09 $\pm$ 0.01

	$\displaystyle\mathbb{E}_{\mathcal{M}\,\|\,\mathcal{D}}\left[p(Y\,\|\,\mathcal{M}% )\right]$	$\displaystyle=\mathbb{E}_{G,\mathbf{f},\bm{\psi}\,\|\,\mathcal{D}}\left[p(Y\,\|% \,\mathcal{M})\right]$
		$\displaystyle=\mathbb{E}_{\bm{\theta},\bm{\psi}\,\|\,\mathcal{D}}\left[\mathbb{% E}_{G,\mathbf{f}\,\|\,\bm{\theta},\bm{\psi},\mathcal{D}}\left[p(Y\,\|\,\mathcal{% M})\right]\right]$
		$\displaystyle=\mathbb{E}_{\bm{\theta},\bm{\psi}\,\|\,\mathcal{D}}\left[\mathbb{% E}_{G\,\|\,\bm{\theta},\bm{\psi},\mathcal{D}}\left[\mathbb{E}_{\mathbf{f}\,\|\,% \bm{\psi},\mathcal{D}}\left[p(Y\,\|\,\mathcal{M})\right]\right]\right]$
		$\displaystyle=\mathbb{E}_{\bm{\theta},\bm{\psi}\,\|\,\mathcal{D}}\left[\mathbb{% E}_{L\,\|\,\bm{\theta},\bm{\psi},\mathcal{D}}\left[\mathbb{E}_{G\,\|\,L,\bm{\psi% },\mathcal{D}}\left[\mathbb{E}_{\mathbf{f}\,\|\,\bm{\psi},\mathcal{D}}\left[p(Y% \,\|\,\mathcal{M})\right]\right]\right]\right]$
		$\displaystyle=\mathbb{E}_{\bm{\theta},\bm{\psi}\,\|\,\mathcal{D}}\left[\mathbb{% E}_{L\,\|\,\bm{\theta}}\left[\frac{p(\mathcal{D},\bm{\psi}\,\|\,L)}{p(\mathcal{D% },\bm{\psi}\,\|\,\bm{\theta})}\cdot\mathbb{E}_{G\,\|\,L,\bm{\psi},\mathcal{D}}% \left[\mathbb{E}_{\mathbf{f}\,\|\,\bm{\psi},\mathcal{D}}\left[p(Y\,\|\,\mathcal{% M})\right]\right]\right]\right]$
		$\displaystyle=\mathbb{E}_{\bm{\theta},\bm{\psi}\,\|\,\mathcal{D}}\left[\mathbb{% E}_{L\,\|\,\bm{\theta}}\left[w^{L}\cdot\mathbb{E}_{G\,\|\,L,\bm{\psi},\mathcal{D% }}\left[\mathbb{E}_{\mathbf{f}\,\|\,\bm{\psi},\mathcal{D}}\left[p(Y\,\|\,% \mathcal{M})\right]\right]\right]\right]$

	$\displaystyle w^{L}$	$\displaystyle:=\frac{p(\mathcal{D},\bm{\psi}\,\|\,L)}{p(\mathcal{D},\bm{\psi}\,% \|\,\bm{\theta})}$
		$\displaystyle=\frac{\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D},\bm{\psi}\,\|\,G)% \right]}{p(\mathcal{D},\bm{\psi}\,\|\,\bm{\theta})}$
		$\displaystyle=\frac{\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D}\,\|\,\bm{\psi},G)% \cdot p(\bm{\psi}\,\|\,G)\right]}{p(\mathcal{D},\bm{\psi}\,\|\,\bm{\theta})}$
		$\displaystyle=\frac{\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D}\,\|\,\bm{\psi},G)% \cdot p(\bm{\psi}\,\|\,G)\right]}{\mathbb{E}_{L^{\prime}\,\|\,\bm{\theta}}\left[% p(\mathcal{D},\bm{\psi}\,\|\,L^{\prime})\right]}$
		$\displaystyle=\frac{\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D}\,\|\,\bm{\psi},G)% \cdot p(\bm{\psi}\,\|\,G)\right]}{\mathbb{E}_{L^{\prime}\,\|\,\bm{\theta}}\left[% \mathbb{E}_{G^{\prime}\,\|\,L^{\prime}}\left[p(\mathcal{D}\,\|\,\bm{\psi},G^{% \prime})\cdot p(\bm{\psi}\,\|\,G^{\prime})\right]\right]}.$

	$\displaystyle\nabla\log p(\bm{\theta},\bm{\psi}\,\|\,\mathcal{D})$	$\displaystyle=\nabla\log\frac{p(\mathcal{D},\bm{\psi}\,\|\,\bm{\theta})\cdot p(% \bm{\theta})}{p(\mathcal{D})}$
		$\displaystyle=\nabla\log p(\bm{\theta})+\nabla\log p(\mathcal{D},\bm{\psi}\,\|% \,\bm{\theta})$
		$\displaystyle=\nabla\log p(\bm{\theta})+\nabla\log\mathbb{E}_{L\,\|\,\bm{\theta% }}\left[p(\mathcal{D},\bm{\psi}\,\|\,L)\right]$
		$\displaystyle=\nabla\log p(\bm{\theta})+\nabla\log\mathbb{E}_{L\,\|\,\bm{\theta% }}\left[\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D},\bm{\psi}\,\|\,G)\right]\right]$
		$\displaystyle=\nabla\log p(\bm{\theta})+\nabla\log\mathbb{E}_{L\,\|\,\bm{\theta% }}\left[\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D}\,\|\,\bm{\psi},G)\cdot p(\bm{% \psi}\,\|\,G)\right]\right]$

	$\displaystyle\nabla_{\bm{\theta}}\log p(\bm{\theta},\bm{\psi}\,\|\,\mathcal{D})$	$\displaystyle=\nabla_{\bm{\theta}}\log p(\bm{\theta})+\nabla_{\bm{\theta}}\log% \mathbb{E}_{L\,\|\,\bm{\theta}}\left[\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D}\,\|% \,\bm{\psi},G)\cdot p(\bm{\psi}\,\|\,G)\right]\right]$
		$\displaystyle=\nabla_{\bm{\theta}}\log p(\bm{\theta})+\frac{\nabla_{\bm{\theta% }}\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D}% \,\|\,\bm{\psi},G)\cdot p(\bm{\psi}\,\|\,G)\right]\right]}{\mathbb{E}_{L^{\prime% }\,\|\,\bm{\theta}}\left[\mathbb{E}_{G^{\prime}\,\|\,L^{\prime}}\left[p(\mathcal% {D}\,\|\,\bm{\psi},G^{\prime})\cdot p(\bm{\psi}\,\|\,G^{\prime})\right]\right]}$
and using $\nabla_{\bm{\theta}}p(L\,\|\,\bm{\theta})=p(L\,\|\,\bm{\theta})\cdot\nabla_{\bm{% \theta}}\log p(L\,\|\,\bm{\theta})$
		$\displaystyle=\nabla_{\bm{\theta}}\log p(\bm{\theta})+\frac{\mathbb{E}_{L\,\|\,% \bm{\theta}}\left[\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D}\,\|\,\bm{\psi},G)% \cdot p(\bm{\psi}\,\|\,G)\right]\cdot\nabla_{\bm{\theta}}\log p(L\,\|\,\bm{% \theta})\right]}{\mathbb{E}_{L^{\prime}\,\|\,\bm{\theta}}\left[\mathbb{E}_{G^{% \prime}\,\|\,L^{\prime}}\left[p(\mathcal{D}\,\|\,\bm{\psi},G^{\prime})\cdot p(% \bm{\psi}\,\|\,G^{\prime})\right]\right]}$
		$\displaystyle=\nabla_{\bm{\theta}}\log p(\bm{\theta})+\mathbb{E}_{L\,\|\,\bm{% \theta}}\left[w^{L}\cdot\nabla_{\bm{\theta}}\log p(L\,\|\,\bm{\theta})\right]$

	$\displaystyle\nabla_{\bm{\psi}_{k}}\log p(\bm{\theta},\bm{\psi}\,\|\,\mathcal{D})$	$\displaystyle=\nabla_{\bm{\psi}_{k}}\log p(\bm{\theta})+\nabla_{\bm{\psi}_{k}}% \log\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D% },\bm{\psi}\,\|\,G)\right]\right]$
		$\displaystyle=\frac{\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\mathbb{E}_{G\,\|\,L}% \left[\nabla_{\bm{\psi}_{k}}p(\mathcal{D},\bm{\psi}\,\|\,G)\right]\right]}{% \mathbb{E}_{L\,\|\,\bm{\theta}}\left[\mathbb{E}_{G\,\|\,L}\left[p(\mathcal{D},% \bm{\psi}\,\|\,G)\right]\right]}$
		$\displaystyle=\frac{\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\mathbb{E}_{G\,\|\,L}% \left[\nabla_{\bm{\psi}_{k}}p(\mathcal{D},\bm{\psi}\,\|\,G)\right]\right]}{p(% \mathcal{D},\bm{\psi}\,\|\,\bm{\theta})}$
and as the marginal likelihood and the prior over GP hyper-parameters factorise over parent sets we further get
		$\displaystyle=\frac{\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\mathbb{E}_{G\,\|\,L}% \left[\nabla_{\bm{\psi}_{k}}\prod_{i=1}^{d}p(\mathcal{D}_{i},\bm{\psi}_{i}\,\|% \,\textbf{Pa}_{i}^{G})\right]\right]}{p(\mathcal{D},\bm{\psi}\,\|\,\bm{\theta})}$
		$\displaystyle=\frac{\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\sum_{G}p(G\,\|\,L)% \cdot\nabla_{\bm{\psi}_{k}}\prod_{i=1}^{d}p(\mathcal{D}_{i},\bm{\psi}_{i}\,\|\,% \textbf{Pa}_{i}^{G})\right]}{p(\mathcal{D},\bm{\psi}\,\|\,\bm{\theta})}.$
Now, note that for the summation over graphs $G$ , the gradient is zero for all graphs that do not contain the parent set $\textbf{Pa}_{k}$ corresponding to the GP with hyper-parameters $\psi_{k}$ . Consequently, we get
		$\displaystyle=\frac{\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\sum_{G\,\|\,\textbf{Pa% }_{k}\in G}p(G\,\|\,L)\cdot\nabla_{\bm{\psi}_{k}}\prod_{i=1}^{d}p(\mathcal{D}_{% i},\bm{\psi}_{i}\,\|\,\textbf{Pa}_{i}^{G})\right]}{p(\mathcal{D},\bm{\psi}\,\|\,% \bm{\theta})}$
and since $\nabla_{\bm{\psi}_{k}}p(\mathcal{D}_{k},\bm{\psi}_{k}\,\|\,\textbf{Pa}_{k})=p(% \mathcal{D}_{k},\bm{\psi}_{k}\,\|\,\textbf{Pa}_{k})\cdot\nabla_{\bm{\psi}_{k}}% \log p(\mathcal{D}_{k},\bm{\psi}_{k}\,\|\,\textbf{Pa}_{k})$
		$\displaystyle=\frac{\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\sum_{G\,\|\,\textbf{Pa% }_{k}\in G}p(G\,\|\,L)\cdot\prod_{i=1}^{d}p(\mathcal{D}_{i},\bm{\psi}_{i}\,\|\,% \textbf{Pa}_{i}^{G})\cdot\nabla_{\bm{\psi}_{k}}\log p(\mathcal{D}_{k},\bm{\psi% }_{k}\,\|\,\textbf{Pa}_{k})\right]}{p(\mathcal{D},\bm{\psi}\,\|\,\bm{\theta})}$
		$\displaystyle=\frac{\mathbb{E}_{L\,\|\,\bm{\theta}}\left[\sum_{G\,\|\,\textbf{Pa% }_{k}\in G}p(G\,\|\,L)\cdot p(\mathcal{D},\bm{\psi}\,\|\,G)\right]}{p(\mathcal{D% },\bm{\psi}\,\|\,\bm{\theta})}\cdot\nabla_{\bm{\psi}_{k}}\log p(\mathcal{D}_{k}% ,\bm{\psi}_{k}\,\|\,\textbf{Pa}_{k}).$

Effective Bayesian Causal Inference via Structural Marginalisation and Autoregressive Orders

Abstract

1 Introduction

2 Background

3 Bayesian Causal Inference via Structural Marginalisation

Inference Phase.

Learning Phase.

3.1 Causal Order Inference

Sampling Causal Orders.

Training ARCO.

3.2 Marginalising over Causal Graphs

Proposition 3.1.

Proposition 3.2.

3.3 Mechanism Inference and Marginalisation

4 Related Work

5 Experiments

6 Discussion

Acknowledgements

References

Appendix A Proofs and Derivations

A.1 Derivation of the Posterior Expectation w.r.t. SCMs

A.2 Derivation of the Gradient Estimators

A.3 Proofs regarding Exhaustive Parent Set Enumeration

Proof of Proposition 3.1

Proof.

Proof of Proposition 3.2

Proof.

Appendix B Implementation and Estimation Details

B.1 Mechanism Model.

B.2 ARCO Model.

B.3 Kernel Density Estimates of Interventional Distributions.

Appendix C Experimental Setup

Sampling of Ground-Truth SCMs.

Metrics.

Baselines.

Appendix D Extended Experimental Results

Real-world dataset from Sachs et al. (2005).

Simulations on larger models with d=50𝑑50d=50italic_d = 50 variables.

Influence of the maximum parent set size.

Non-linear Average Causal Effect Estimation.

Effective Bayesian Causal Inference via
Structural Marginalisation and Autoregressive Orders

Simulations on larger models with $d=50$ variables.