An Overview of Edward: A Probabilistic Programming System: Dustin Tran Columbia University

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

An Overview of Edward:

A Probabilistic Programming System

Dustin Tran
Columbia University
Alp Kucukelbir Adji Dieng Dawen Liang

Eugene Brevdo Maja Rudolph Matt Hoffman Rajesh Ranganath

Andrew Gelman David Blei Kevin Murphy Rif Saurous


Exploratory analysis of 1.7M taxi trajectories, in Stan
[Kucukelbir+ 2017]
Simulators of 100K time series in ecology, in Edward
[Tran+ 2017]
Generation & compression of 10M colored 32x32 images, in Edward

[Tran+ 2017; fig from Van der Oord+ 2016]


Cause and effect of 1.6B genetic measurements, in Edward
[in preparation; fig from Gopalan+ 2017]
Spatial analysis of 150,000 shots from 308 NBA players, in Edward

[Dieng+ 2017]
Probabilistic machine learning

• A probabilistic model is a joint distribution of hidden variables z and


observed variables x,

p(z, x).

• Inference about the unknowns is through the posterior, the conditional


distribution of the hidden variables given the observations

p(z, x)
p(z | x) = .
p(x)

• For most interesting models, the denominator is not tractable. We appeal


to approximate posterior inference.
Variational inference
p.z j x/

KL.q.zI ⌫ ⇤ / jj p.z j x//


q.zI ⌫/ ⌫⇤

⌫ init

• VI solves inference with optimization.

• Posit a variational family of distributions over the latent variables,

q(z; ν)

• Fit the variational parameters ν to be close (in KL) to the exact posterior.
What is probabilistic programming?

Probabilistic programs reify models from mathematics to physical objects.

• Each model is equipped with memory (“bits”, floating point, storage) and
computation (“flops”, scalability, communication).
Anything you do lives in the world of probabilistic programming.
• Any computable model.

ex. graphical models; neural networks; SVMs; stochastic processes.


• Any computable inference algorithm.

ex. automated inference; model-specific algorithms; inference within


inference (learning to learn).
• Any computable application.

ex. exploratory analysis; object recognition; code generation; causality.


[fig. from Frank Wood]
George E.P. Box (1919 - 2013)

An iterative process for science:


1. Build a model of the science
2. Infer the model given data
3. Criticize the model given data

[Box & Hunter 1962, 1965; Box & Hill 1967; Box 1976, 1980]
Box’s Loop

Edward is a library designed around this loop.

[Box 1976, 1980; Blei 2014]


We have an active community of several thousand users & many contributors.
Model

Edward’s language augments computational graphs with an abstraction for


random variables. Each random variable x is associated to a tensor x∗ ,
x∗ ∼ p(x | θ ∗ ).

Unlike tf.Tensors, ed.RandomVariables carry an explicit density with


methods such as log_prob() and sample().
For implementation, we wrap all TensorFlow Distributions and call sample to
produce the associated tensor.

[Tran+ 2017]
Example: Beta-Bernoulli

Consider a Beta-Bernoulli model,


50
Y
p(x, θ) = Beta(θ | 1, 1) Bernoulli(xn | θ),
n=1

where θ is a probability shared across 50 data points x ∈ {0, 1}50 .

Fetching x from the graph generates a binary vector of 50 elements.


All computation is represented on the graph, enabling us to leverage model
structure during inference.
Example: Variational Auto-Encoder for Binarized MNIST

[Kingma & Welling 2014; Rezende+ 2014]


Example: Variational Auto-Encoder for Binarized MNIST

[Kingma & Welling 2014; Rezende+ 2014]


Example: Variational Auto-Encoder for Binarized MNIST

[Demo]
Example: Bayesian neural network for classification

[Denker+ 1987; MacKay 1992; Hinton & Van Camp, 1993; Neal 1995]
Example: Gaussian process classification

[Rasmussen & Williams, 2006; fig from Hensman+ 2013]


Inference

Given
• Data xtrain .
• Model p(x, z, β) of observed variables x and latent variables z, β .

Goal
• Calculate posterior distribution

p(xtrain , z, β)
p(z, β | xtrain ) = R .
p(xtrain , z, β) dz dβ

This is the key problem in Bayesian inference.

edwardlib.org/tutorials
Inference

All Inference has (at least) two inputs:


1. red aligns latent variables and posterior approximations;
2. blue aligns observed variables and realizations.

Inference has class methods to finely control the algorithm. Edward is fast
as handwritten TensorFlow at runtime.
edwardlib.org/api
Inference

Variational inference. It uses a variational model.

Monte Carlo. It uses an Empirical approximation.

Conjugacy & exact inference. It uses symbolic algebra on the graph.


Inference: Composing Inference

Core to Edward’s design is that inference can be written as a collection of


separate inference programs.
For example, here is variational EM.

We can also write message passing algorithms, which work over a collection
of local inference problems. This includes expectation propagation.

[Neal & Hinton 1993; Minka 2001; Gelman+ 2017]


Non-Bayesian Methods: GANs

GANs posit a generative process,

 ∼ Normal(0, 1)
x = G(; θ)

for some generative network G.


Training uses a discriminative network D via the optimization problem

min max Ep∗ (x) [log D(x; φ)] + Ep(x;θ) [log(1 − D(x; φ))]
θ φ

The generator tries to generate samples indistinguishable from true


data.
The discriminator tries to discriminate samples from the generator and
samples from the true data.

[Goodfellow+ 2014]
Example: Generative Adversarial Network for MNIST

[Demo]
http://edwardlib.org/tutorials/gan
Non-Bayesian Methods: GANs

[Goodfellow+ 2014]
Non-Bayesian Methods: GANs

[Arjovsky+ 2017; Gulrajani+ 2017]


Current Work
Dynamic Graphs
Distributions Backend
def pixelcnn_dist(params, x_shape=(32, 32, 3)):
def _logit_func(features):
# single autoregressive step on observed features
logits = pixelcnn(features)
return logits
logit_template = tf.make_template("pixelcnn", _logit_func)
make_dist = lambda x: tfd.Independent(tfd.Bernoulli(logit_template(x)))
return tfd.Autoregressive(make_dist, tf.reduce_prod(x_shape)))

x = pixelcnn_dist()
loss = -tf.reduce_sum(x.log_prob(images))
train = tf.train.AdamOptimizer().minimize(loss) # run for training
generate = x.sample() # run for generation

TensorFlow Distributions consists of a large collection of distributions.


Bijector enable efficient, composable manipulation of probability
distributions.
Pytorch PPLs are consolidating on a backend for distributions.

[Dillon+ 2017]
Distributed, Compiled, Accelerated Systems

Probabilistic programming over multiple machines. XLA compiler


optimization and TPUs. More flexible programmable inference.
References

edwardlib.org
• Edward: A library for probabilistic modeling, inference, and criticism.
arXiv preprint arXiv:1610.09787, 2016.
• Deep probabilistic programming.
International Conference on Learning Representations, 2017.
• TensorFlow Distributions.
arXiv preprint arXiv:1711.10604, 2017.

You might also like