LOD Differentiable

Alice's Adventures in a
differentiable Wonderland
Presenter: Simone Scardapane
01
Differentiability: the key
ingredient of AI?
“For, you see, so many out-of-the-way things had happened lately, that Alice had begun to think that very few
things indeed were really impossible.”
—Chapter 1, Down the Rabbit-Hole
“This position paper proposes an architecture and training
paradigms with which to construct autonomous intelligent
agents.”
“A system architecture for
autonomous intelligence. All modules
in this model are assumed to be
differentiable.”
What is an “artificial neural network”?
No biology, please.
« computing systems vaguely inspired by the

biological neural networks that constitute animal brains »
— Wikipedia
Convolutional neural networks
Implicit layers
Transformers blocks Neural computers
A deep network is a differentiable function ...
def my_network(x: tensor) -> tensor:

... Input type Output type
... Sequence of
differentiable
primitives
...
return y
… that can be optimized from data.
Loss goes in
Automatic
Optimizer Differentiation
A “better” function comes

out
Corollary: deep networks are composable
def f(x):
y = my_network(x)
y = another_network(y)
y = yet_another_network(y)
return y
Automatic
Differentiation
02
The how: automatic
differentiation
“Would you tell me, please, which way I ought to go from here?”
“That depends a good deal on where you want to get to,” said the Cat.
“I don’t much care where” said Alice.
“Then it doesn’t matter which way you go,” said the Cat.
—Chapter 6, Pig and Pepper
Preliminaries: Derivative(s)
For a scalar function, the rate of change for an

infinitesimally small displacement.
Preliminaries: Gradient(s)
For a function with an n-dimensional vector in input, we can use partial derivatives:
i-th basis vector
The vector of all partial derivatives is called the gradient:

Preliminaries: Jacobian(s)
Consider now a function with an n-dimensional input and a m-dimensional output, its
Jacobian is defined as:
m rows (one for each

output)
n columns (one for

each input)
The Jacobian wrt what?
Let us go back to a classical fully-connected layer:
The Jacobian wrt x is (m,n), but the Jacobian w.r.t. W is rank-3 (m,m,n). In general,
full Jacobians in real layers can be quite cumbersome.
We will return to this point later on.

Chain rule of Jacobians
Like classical derivatives, Jacobians also have a chain rule:
The gradient of function composition is the multiplication of the corresponding

Jacobians.
Neural network primitives
Neural networks are composed of simple differentiable primitives:
Input (Trainable)
parameters
For each primitive, we know how to compute the input Jacobian and the weight
Jacobian.
Neural networks
Simple NNs are a sequence of primitive operations:
Computational graph
In the more general case, we can have a DAG (not a sequence), and also parameter
sharing between layers.
The goal of autodiff
Note that our last output, almost always, is scalar (e.g., sum of the per-element losses).
What we need is a way to efficiently, simultaneously compute all weight Jacobians (up
to numerical precision):
[1502.05767] Automatic differentiation in machine learning: a survey
Symbolic: build a symbolic formula for the gradients

from the original symbolic program.
Numeric: numerically evaluate the derivatives using the

definition.
Automatic differentiation - Wikipedia
One worked-out example
Consider a very simple example:
For example, this could be a one-layer neural network, cross-entropy loss, and final
sum.
Considering each instruction in isolation, we have 4 input Jacobians and 3 weight
Jacobians:
Jacobians of f2 Jacobians of f3
Jacobians of f1
Then, we can use the chain rule to “stitch” them together.

Performing multiplications left-to-right: forward-mode autodiff
Repeated!
Performing multiplications right-to-left: reverse-mode autodiff
Forward-mode autodiff
Forward-model autodiff can be implemented easily: all operations can be performed

in parallel to the original program (i.e., we can devise a new program returning the
original outputs and the gradients).
However, all operations will scale linearly w.r.t. number of parameters, which is
impractical for today’s neural networks. On the good side, it requires little memory
because previous operations can be discarded.
Forward-mode autodiff

Reverse-mode autodiff
Reverse-model autodiff collects all intermediate operations of the program (tracing),

and then “unrolls” all the gradient operations from right-to-left.
It is significantly more efficient (only vector-matrix operations above), but it requires

to store all intermediate outputs, making it highly memory consuming.
Autodiff in ML is almost always in reverse-mode (backpropagation).

Reverse-mode autodiff

Quick-note: autodiff or backprop?
A few commonly accepted milestones:
● Wengert (1964) is credited as the first description of forward-mode AD, which became
popular in the 80’ mostly with the work of Griewank.
● Linnainmaa (1976) is considered the first description of modern reverse-mode AD, with
the first major implementation in Speelpenning (1980).
● Werbos (1982) is the first concrete application to NNs, before being popularized (as
backpropagation) by Rumelhart et al. (1986).
Who Invented Backpropagation? Who Invented the Reverse Mode of Differentiation?

Vector-Jacobian products
Importantly, we do not need to know how to compute the Jacobians of the primitives,
but only their vector-Jacobian products (VJP):
(To see this, transpose all the equations before!)
Forward-mode autodiff can equivalently be written with Jacobian-vector products

(JVPs), by computing the final gradients one value at a time.
VJPs vs. Jacobians
The previous result is important, because sometimes VJPs are easier than the full
Jacobians:
The input Jacobian is a rank-3 tensor, while:

03
The nitty-gritty details
“The best way to explain it is to do it.”

—Chapter 3, A Caucus-Race and a Long Tale
Deep learning frameworks
Layer 2
High-level constructs (layers,
(hubs, libraries, extensions, …)
optimizers, losses, metrics, …)

Ecosystem
Layer 1
Autodiff module One primitive may be implemented

with multiple kernels depending on
the supported hardware (CPU, GPU,
TPU, IPU, …).
Layer 0
Tensor primitives + their VJPs

(matrix-multiplication, etc.).
Dispatchers
PyTorch internals : ezyang’s blog

Materials to learn more
PyTorch internals : ezyang’s blog (slightly outdated)
Automatic differentiation (Matthieu Blondel, 2020)
GitHub - mattjj/autodidact: A pedagogical implementation of Autograd
GitHub - karpathy/micrograd: A tiny scalar-valued autograd engine and
a neural net library on top of it with PyTorch-like API
GitHub - geohot/tinygrad: You like pytorch? You like micrograd? You lo
ve tinygrad!
GitHub - MINI-PYTORCH/MINI-TORCH: Mini-pytorch implemented fro
m scratch using Python
The spelled-out intro to neural networks and backpropagation: building
micrograd
Our discussion will be highly simplified, leaving out many important

topics (e.g., tracing and distributed implementations).
Some history and terminology
The revival of AD in neural networks started with Theano (2008), to which followed a
Cambrian explosion of frameworks (TensorFlow 1.0, PyTorch, Caffe, JAX, …).
Theano and TF 1.0 focused heavily on performance. The user implemented the
computational graph with a small domain specific language (DSL), and execution was
decoupled from the definition (define-then-run).
Most frameworks today are more dynamic (define-by-run), leaving compilation as a

separate, optional step (JIT in JAX / PyTorch, tf.function in TF).
Flavours of implementations
While most frameworks implement similar things, the way they implement them can make some use
cases considerably faster or easier.
1. Having an external context manager to store operations (e.g., the GradientTape of TF, technically a
Wengert list) vs. building the DAG dynamically.
2. Being able to easily differentiate w.r.t. any sort of object (e.g., the PyTrees of JAX)
3. The flexibility of the autodiff framework (e.g., to compute full Hessians).
4. Having only a functional interface (e.g., JAX).
5. Supporting sparse and/or complex-valued data.

Anatomy of a PyTorch tensor
Identifies explicitly tensors that
require gradients
A pointer to the parent dtype = …
operation to be able to requires_grad = True The actual storage of the
traverse the DAG tensor
data
grad
Gradient is accumulated
grad_fn here
Note: PyTorch also has a functional variant, very similar to the JAX implementation.
High-level APIs
Most frameworks (TensorFlow, PyTorch) implement an object-oriented API:
Simple polymorphism
allows for module
class MyModule(nn.Module): compositionality
def __init__(self):
self.params = Parameter( torch.tensor ( … ) )
Parameters are def forward(self, x):

properties of the object
return self.params @ x Forward logic is a
method
Limits of OOP
By default, JAX takes a fully functional paradigm: everything (layers, losses) is a
function.
Sometimes, this makes it easier to explicitly manipulate parameters and to

compose different transformations.
Most high-level frameworks in JAX define a layer by a pair of init/apply functions

(with some exceptions, see Equinox).
Notebook time!
Notebook 1: simple experiments with PyTorch and JAX, side-by-side.
Notebook 2: building a toy autodiff framework, PyTorch-style.
https://colab.research.google.com/drive/1AbNRRjL0DMoj4VPnul7QIYJC5nLU3Fn2?usp=sharing
From PyTorch to JAX: towards neural net frameworks that purify stateful code — Sabrina J. Mielke
From PyTorch to JAX: towards neural net frameworks that purify stateful code — Sabrina J. Mielke
Notebook time!
Notebook 3: moving from PyTorch to JAX and .vice versa
https://colab.research.google.com/drive/1AbNRRjL0DMoj4VPnul7QIYJC5nLU3Fn2?usp=sharing
04
Advanced topics
““Curiouser and curiouser!” cried Alice (she was so much surprised, that for the moment she quite
forgot how to speak good English).
—Chapter 2, The Pool of Tears
Layer sharing
1 layer 2 layers 2 layers (replicated)

Deep equilibrium layers
What happens if we replicate the layer infinite
times?
The output of the layer is now implicitly defined

… by a fixed-point equation:
Input Output (can be

initialized to zero)
Deep Equilibrium Models

Solving fixed-point equations
Writing a fixed-point layer is easy (of course, there are faster alternatives):
class FixedPointLayer(nn.Module):
def __init__(self):
self.w = ...
def forward(self, x):

z = torch.zeros_like(x)
while self.check_convergence():
z = f(self.w, z, x)
return z
By default, however, AD requires to store all intermediate steps of the while loop and
backpropagate through them, which is expensive.
Implicit function theorem
By the implicit function theorem, there exists a continuous function z* such that:
Differentiating everything:
We can differentiate the layer by computing two gradients at the optimum z (no need to store
any intermediate layers).
VJPs of implicit functions
In order to implement the layer as a primitive in a framework, we need its VJP with some
vector u. Fascinatingly, this can be expressed as another fixed-point equation:
Custom derivative rules for JAX-transformable Python functions

Read more
Implicit differentiation is the key for several topics:
● Defining layers in terms of convex optimization problems;

● Relaxing combinatorial problems inside layers;
● Neural ordinary differential equations (Neural ODEs);
● And so on…
Check out this tutorial for more: Deep Implicit Layers

Gradient checkpointing
To save memory, gradient checkpointing is now popular. Outputs of red nodes are stored
(checkpoints). When back-propagating through a non-checkpointed node, its output is
recomputed starting from the previous checkpoint in memory.
Bilevel optimization
Outer problem Inner problem

Examples:
1. Hyper-parameter optimization (outer loop is the validation accuracy;

2. Few-shot learning (inner loop is the training step on the few-shot dataset).
GitHub - leopard-ai/betty: Betty: an automatic differentiation libra
ry for generalized meta-learning and multilevel optimization
Physics-based Deep Learning
Physics-informed NNs
[2201.05624] Scientific Machine Learning through Physics-Informed Neural Networks: Where we are and What's next
“Tut, tut, child!” said the Duchess. “Everything’s got a moral, if only you can find it.”
—Chapter 9, The Mock Turtle’s Story
https://www.sscardapane.it/
https://twitter.com/s_scardapane
Thanks for listening!

CREDITS: This presentation template was created by
Slidesgo, including icons by Flaticon,and infographics &
images by Freepik

LOD Differentiable

Uploaded by

Copyright:

Available Formats

LOD Differentiable

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LOD Differentiable

Uploaded by

Copyright:

Available Formats

Alice's Adventures in a

« computing systems vaguely inspired by the

def my_network(x: tensor) -> tensor:

A “better” function comes

For a scalar function, the rate of change for an

The vector of all partial derivatives is called the gradient:

m rows (one for each

n columns (one for

We will return to this point later on.

The gradient of function composition is the multiplication of the corresponding

Symbolic: build a symbolic formula for the gradients

Numeric: numerically evaluate the derivatives using the

Then, we can use the chain rule to “stitch” them together.

Forward-model autodiff can be implemented easily: all operations can be performed

[1502.05767] Automatic differentiation in machine learning: a survey

Reverse-model autodiff collects all intermediate operations of the program (tracing),

It is significantly more efficient (only vector-matrix operations above), but it requires

Autodiff in ML is almost always in reverse-mode (backpropagation).

[1502.05767] Automatic differentiation in machine learning: a survey

Who Invented Backpropagation? Who Invented the Reverse Mode of Differentiation?

(To see this, transpose all the equations before!)

Forward-mode autodiff can equivalently be written with Jacobian-vector products

The input Jacobian is a rank-3 tensor, while:

“The best way to explain it is to do it.”

optimizers, losses, metrics, …)

Autodiff module One primitive may be implemented

Tensor primitives + their VJPs

PyTorch internals : ezyang’s blog

Our discussion will be highly simplified, leaving out many important

Most frameworks today are more dynamic (define-by-run), leaving compilation as a

3. The flexibility of the autodiff framework (e.g., to compute full Hessians).

4. Having only a functional interface (e.g., JAX).

5. Supporting sparse and/or complex-valued data.

Parameters are def forward(self, x):

Sometimes, this makes it easier to explicitly manipulate parameters and to

Most high-level frameworks in JAX define a layer by a pair of init/apply functions

1 layer 2 layers 2 layers (replicated)

The output of the layer is now implicitly defined

Input Output (can be

Deep Equilibrium Models

def forward(self, x):

Custom derivative rules for JAX-transformable Python functions

Implicit differentiation is the key for several topics:

● Defining layers in terms of convex optimization problems;

Check out this tutorial for more: Deep Implicit Layers

Outer problem Inner problem

1. Hyper-parameter optimization (outer loop is the validation accuracy;

Thanks for listening!

You might also like