An Overview of Edward: A Probabilistic Programming System: Dustin Tran Columbia University

An Overview of Edward:
A Probabilistic Programming System
Dustin Tran
Columbia University
Alp Kucukelbir Adji Dieng Dawen Liang
Eugene Brevdo Maja Rudolph Matt Hoffman Rajesh Ranganath
Andrew Gelman David Blei Kevin Murphy Rif Saurous

Exploratory analysis of 1.7M taxi trajectories, in Stan
[Kucukelbir+ 2017]
Simulators of 100K time series in ecology, in Edward
[Tran+ 2017]
Generation & compression of 10M colored 32x32 images, in Edward
[Tran+ 2017; fig from Van der Oord+ 2016]

Cause and effect of 1.6B genetic measurements, in Edward
[in preparation; fig from Gopalan+ 2017]
Spatial analysis of 150,000 shots from 308 NBA players, in Edward
[Dieng+ 2017]
Probabilistic machine learning
• A probabilistic model is a joint distribution of hidden variables z and

observed variables x,
p(z, x).
• Inference about the unknowns is through the posterior, the conditional

distribution of the hidden variables given the observations
p(z, x)
p(z | x) = .
p(x)
• For most interesting models, the denominator is not tractable. We appeal

to approximate posterior inference.
Variational inference
p.z j x/
KL.q.zI ⌫ ⇤ / jj p.z j x//

q.zI ⌫/ ⌫⇤
⌫ init
• VI solves inference with optimization.
• Posit a variational family of distributions over the latent variables,
q(z; ν)
• Fit the variational parameters ν to be close (in KL) to the exact posterior.
What is probabilistic programming?
Probabilistic programs reify models from mathematics to physical objects.
• Each model is equipped with memory (“bits”, floating point, storage) and
computation (“flops”, scalability, communication).
Anything you do lives in the world of probabilistic programming.
• Any computable model.
ex. graphical models; neural networks; SVMs; stochastic processes.

• Any computable inference algorithm.
ex. automated inference; model-specific algorithms; inference within

inference (learning to learn).
• Any computable application.
ex. exploratory analysis; object recognition; code generation; causality.

[fig. from Frank Wood]
George E.P. Box (1919 - 2013)
An iterative process for science:

1. Build a model of the science
2. Infer the model given data
3. Criticize the model given data
[Box & Hunter 1962, 1965; Box & Hill 1967; Box 1976, 1980]
Box’s Loop
Edward is a library designed around this loop.
[Box 1976, 1980; Blei 2014]

We have an active community of several thousand users & many contributors.
Model
Edward’s language augments computational graphs with an abstraction for

random variables. Each random variable x is associated to a tensor x∗ ,
x∗ ∼ p(x | θ ∗ ).
Unlike tf.Tensors, ed.RandomVariables carry an explicit density with

methods such as log_prob() and sample().
For implementation, we wrap all TensorFlow Distributions and call sample to
produce the associated tensor.
[Tran+ 2017]
Example: Beta-Bernoulli
Consider a Beta-Bernoulli model,

50
Y
p(x, θ) = Beta(θ | 1, 1) Bernoulli(xn | θ),
n=1
where θ is a probability shared across 50 data points x ∈ {0, 1}50 .
Fetching x from the graph generates a binary vector of 50 elements.

All computation is represented on the graph, enabling us to leverage model
structure during inference.
Example: Variational Auto-Encoder for Binarized MNIST
[Kingma & Welling 2014; Rezende+ 2014]

[Kingma & Welling 2014; Rezende+ 2014]

[Demo]
Example: Bayesian neural network for classification
[Denker+ 1987; MacKay 1992; Hinton & Van Camp, 1993; Neal 1995]
Example: Gaussian process classification
[Rasmussen & Williams, 2006; fig from Hensman+ 2013]

Inference
Given
• Data xtrain .
• Model p(x, z, β) of observed variables x and latent variables z, β .
Goal
• Calculate posterior distribution
p(xtrain , z, β)
p(z, β | xtrain ) = R .
p(xtrain , z, β) dz dβ
This is the key problem in Bayesian inference.
edwardlib.org/tutorials
Inference
All Inference has (at least) two inputs:

1. red aligns latent variables and posterior approximations;
2. blue aligns observed variables and realizations.
Inference has class methods to finely control the algorithm. Edward is fast
as handwritten TensorFlow at runtime.
edwardlib.org/api
Inference
Variational inference. It uses a variational model.
Monte Carlo. It uses an Empirical approximation.
Conjugacy & exact inference. It uses symbolic algebra on the graph.

Inference: Composing Inference
Core to Edward’s design is that inference can be written as a collection of

separate inference programs.
For example, here is variational EM.
We can also write message passing algorithms, which work over a collection
of local inference problems. This includes expectation propagation.
[Neal & Hinton 1993; Minka 2001; Gelman+ 2017]

Non-Bayesian Methods: GANs
GANs posit a generative process,
∼ Normal(0, 1)
x = G(; θ)
for some generative network G.

Training uses a discriminative network D via the optimization problem
min max Ep∗ (x) [log D(x; φ)] + Ep(x;θ) [log(1 − D(x; φ))]
θ φ
The generator tries to generate samples indistinguishable from true

data.
The discriminator tries to discriminate samples from the generator and
samples from the true data.
[Goodfellow+ 2014]
Example: Generative Adversarial Network for MNIST
[Demo]
http://edwardlib.org/tutorials/gan
[Goodfellow+ 2014]
[Arjovsky+ 2017; Gulrajani+ 2017]

Current Work
Dynamic Graphs
Distributions Backend
def pixelcnn_dist(params, x_shape=(32, 32, 3)):
def _logit_func(features):
# single autoregressive step on observed features
logits = pixelcnn(features)
return logits
logit_template = tf.make_template("pixelcnn", _logit_func)
make_dist = lambda x: tfd.Independent(tfd.Bernoulli(logit_template(x)))
return tfd.Autoregressive(make_dist, tf.reduce_prod(x_shape)))
x = pixelcnn_dist()
loss = -tf.reduce_sum(x.log_prob(images))
train = tf.train.AdamOptimizer().minimize(loss) # run for training
generate = x.sample() # run for generation
TensorFlow Distributions consists of a large collection of distributions.

Bijector enable efficient, composable manipulation of probability
distributions.
Pytorch PPLs are consolidating on a backend for distributions.
[Dillon+ 2017]
Distributed, Compiled, Accelerated Systems
Probabilistic programming over multiple machines. XLA compiler

optimization and TPUs. More flexible programmable inference.
References
edwardlib.org
• Edward: A library for probabilistic modeling, inference, and criticism.
arXiv preprint arXiv:1610.09787, 2016.
• Deep probabilistic programming.
International Conference on Learning Representations, 2017.
• TensorFlow Distributions.
arXiv preprint arXiv:1711.10604, 2017.

An Overview of Edward: A Probabilistic Programming System: Dustin Tran Columbia University

Uploaded by

Copyright:

Available Formats

An Overview of Edward: A Probabilistic Programming System: Dustin Tran Columbia University

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Overview of Edward: A Probabilistic Programming System: Dustin Tran Columbia University

Uploaded by

Copyright:

Available Formats

An Overview of Edward:

A Probabilistic Programming System

Eugene Brevdo Maja Rudolph Matt Hoffman Rajesh Ranganath

Andrew Gelman David Blei Kevin Murphy Rif Saurous

[Tran+ 2017; fig from Van der Oord+ 2016]

• A probabilistic model is a joint distribution of hidden variables z and

• Inference about the unknowns is through the posterior, the conditional

• For most interesting models, the denominator is not tractable. We appeal

KL.q.zI ⌫ ⇤ / jj p.z j x//

• VI solves inference with optimization.

• Posit a variational family of distributions over the latent variables,

Probabilistic programs reify models from mathematics to physical objects.

ex. graphical models; neural networks; SVMs; stochastic processes.

ex. automated inference; model-specific algorithms; inference within

ex. exploratory analysis; object recognition; code generation; causality.

An iterative process for science:

Edward is a library designed around this loop.

[Box 1976, 1980; Blei 2014]

Edward’s language augments computational graphs with an abstraction for

Unlike tf.Tensors, ed.RandomVariables carry an explicit density with

Consider a Beta-Bernoulli model,

where θ is a probability shared across 50 data points x ∈ {0, 1}50 .

Fetching x from the graph generates a binary vector of 50 elements.

[Kingma & Welling 2014; Rezende+ 2014]

[Kingma & Welling 2014; Rezende+ 2014]

[Rasmussen & Williams, 2006; fig from Hensman+ 2013]

This is the key problem in Bayesian inference.

All Inference has (at least) two inputs:

Variational inference. It uses a variational model.

Monte Carlo. It uses an Empirical approximation.

Conjugacy & exact inference. It uses symbolic algebra on the graph.

Core to Edward’s design is that inference can be written as a collection of

[Neal & Hinton 1993; Minka 2001; Gelman+ 2017]

GANs posit a generative process,

for some generative network G.

The generator tries to generate samples indistinguishable from true

[Arjovsky+ 2017; Gulrajani+ 2017]

TensorFlow Distributions consists of a large collection of distributions.

Probabilistic programming over multiple machines. XLA compiler

You might also like