LOD Differentiable

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 55

Alice's Adventures in a

differentiable Wonderland
Presenter: Simone Scardapane
01
Differentiability: the key
ingredient of AI?
“For, you see, so many out-of-the-way things had happened lately, that Alice had begun to think that very few
things indeed were really impossible.”
—Chapter 1, Down the Rabbit-Hole
“This position paper proposes an architecture and training
paradigms with which to construct autonomous intelligent
agents.”
“A system architecture for
autonomous intelligence. All modules
in this model are assumed to be
differentiable.”
What is an “artificial neural network”?
No biology, please.

« computing systems vaguely inspired by the


biological neural networks that constitute animal brains »
— Wikipedia
Convolutional neural networks

Implicit layers
Transformers blocks Neural computers
A deep network is a differentiable function ...

def my_network(x: tensor) -> tensor:


... Input type Output type
... Sequence of
differentiable
primitives
...
return y
… that can be optimized from data.

Loss goes in

Automatic
Optimizer Differentiation

A “better” function comes


out
Corollary: deep networks are composable

def f(x):
y = my_network(x)
y = another_network(y)
y = yet_another_network(y)
return y
Automatic
Differentiation
02
The how: automatic
differentiation
“Would you tell me, please, which way I ought to go from here?”
“That depends a good deal on where you want to get to,” said the Cat.
“I don’t much care where” said Alice.
“Then it doesn’t matter which way you go,” said the Cat.
—Chapter 6, Pig and Pepper
Preliminaries: Derivative(s)

For a scalar function, the rate of change for an


infinitesimally small displacement.
Preliminaries: Gradient(s)
For a function with an n-dimensional vector in input, we can use partial derivatives:
i-th basis vector

The vector of all partial derivatives is called the gradient:


Preliminaries: Jacobian(s)
Consider now a function with an n-dimensional input and a m-dimensional output, its
Jacobian is defined as:

m rows (one for each


output)

n columns (one for


each input)
The Jacobian wrt what?
Let us go back to a classical fully-connected layer:

The Jacobian wrt x is (m,n), but the Jacobian w.r.t. W is rank-3 (m,m,n). In general,
full Jacobians in real layers can be quite cumbersome.

We will return to this point later on.


Chain rule of Jacobians
Like classical derivatives, Jacobians also have a chain rule:

The gradient of function composition is the multiplication of the corresponding


Jacobians.
Neural network primitives
Neural networks are composed of simple differentiable primitives:

Input (Trainable)
parameters

For each primitive, we know how to compute the input Jacobian and the weight
Jacobian.
Neural networks
Simple NNs are a sequence of primitive operations:

Computational graph
In the more general case, we can have a DAG (not a sequence), and also parameter
sharing between layers.
The goal of autodiff

Note that our last output, almost always, is scalar (e.g., sum of the per-element losses).

What we need is a way to efficiently, simultaneously compute all weight Jacobians (up
to numerical precision):
[1502.05767] Automatic differentiation in machine learning: a survey

Symbolic: build a symbolic formula for the gradients


from the original symbolic program.

Numeric: numerically evaluate the derivatives using the


definition.
Automatic differentiation - Wikipedia
One worked-out example
Consider a very simple example:

For example, this could be a one-layer neural network, cross-entropy loss, and final
sum.
Considering each instruction in isolation, we have 4 input Jacobians and 3 weight
Jacobians:

Jacobians of f2 Jacobians of f3
Jacobians of f1

Then, we can use the chain rule to “stitch” them together.


Performing multiplications left-to-right: forward-mode autodiff

Repeated!
Performing multiplications right-to-left: reverse-mode autodiff
Forward-mode autodiff

Forward-model autodiff can be implemented easily: all operations can be performed


in parallel to the original program (i.e., we can devise a new program returning the
original outputs and the gradients).

However, all operations will scale linearly w.r.t. number of parameters, which is
impractical for today’s neural networks. On the good side, it requires little memory
because previous operations can be discarded.
Forward-mode autodiff

[1502.05767] Automatic differentiation in machine learning: a survey


Reverse-mode autodiff

Reverse-model autodiff collects all intermediate operations of the program (tracing),


and then “unrolls” all the gradient operations from right-to-left.

It is significantly more efficient (only vector-matrix operations above), but it requires


to store all intermediate outputs, making it highly memory consuming.

Autodiff in ML is almost always in reverse-mode (backpropagation).


Reverse-mode autodiff

[1502.05767] Automatic differentiation in machine learning: a survey


Quick-note: autodiff or backprop?
A few commonly accepted milestones:

● Wengert (1964) is credited as the first description of forward-mode AD, which became
popular in the 80’ mostly with the work of Griewank.
● Linnainmaa (1976) is considered the first description of modern reverse-mode AD, with
the first major implementation in Speelpenning (1980).
● Werbos (1982) is the first concrete application to NNs, before being popularized (as
backpropagation) by Rumelhart et al. (1986).

Who Invented Backpropagation? Who Invented the Reverse Mode of Differentiation?


Vector-Jacobian products
Importantly, we do not need to know how to compute the Jacobians of the primitives,
but only their vector-Jacobian products (VJP):

(To see this, transpose all the equations before!)

Forward-mode autodiff can equivalently be written with Jacobian-vector products


(JVPs), by computing the final gradients one value at a time.
VJPs vs. Jacobians
The previous result is important, because sometimes VJPs are easier than the full
Jacobians:

The input Jacobian is a rank-3 tensor, while:


03
The nitty-gritty details

“The best way to explain it is to do it.”


—Chapter 3, A Caucus-Race and a Long Tale
Deep learning frameworks

Layer 2
High-level constructs (layers,
(hubs, libraries, extensions, …)

optimizers, losses, metrics, …)


Ecosystem

Layer 1

Autodiff module One primitive may be implemented


with multiple kernels depending on
the supported hardware (CPU, GPU,
TPU, IPU, …).
Layer 0

Tensor primitives + their VJPs


(matrix-multiplication, etc.).
Dispatchers

PyTorch internals : ezyang’s blog


Materials to learn more
PyTorch internals : ezyang’s blog (slightly outdated)
Automatic differentiation (Matthieu Blondel, 2020)
GitHub - mattjj/autodidact: A pedagogical implementation of Autograd
GitHub - karpathy/micrograd: A tiny scalar-valued autograd engine and
a neural net library on top of it with PyTorch-like API
GitHub - geohot/tinygrad: You like pytorch? You like micrograd? You lo
ve tinygrad!
GitHub - MINI-PYTORCH/MINI-TORCH: Mini-pytorch implemented fro
m scratch using Python
The spelled-out intro to neural networks and backpropagation: building
micrograd

Our discussion will be highly simplified, leaving out many important


topics (e.g., tracing and distributed implementations).
Some history and terminology
The revival of AD in neural networks started with Theano (2008), to which followed a
Cambrian explosion of frameworks (TensorFlow 1.0, PyTorch, Caffe, JAX, …).

Theano and TF 1.0 focused heavily on performance. The user implemented the
computational graph with a small domain specific language (DSL), and execution was
decoupled from the definition (define-then-run).

Most frameworks today are more dynamic (define-by-run), leaving compilation as a


separate, optional step (JIT in JAX / PyTorch, tf.function in TF).
Flavours of implementations
While most frameworks implement similar things, the way they implement them can make some use
cases considerably faster or easier.

1. Having an external context manager to store operations (e.g., the GradientTape of TF, technically a
Wengert list) vs. building the DAG dynamically.

2. Being able to easily differentiate w.r.t. any sort of object (e.g., the PyTrees of JAX)

3. The flexibility of the autodiff framework (e.g., to compute full Hessians).

4. Having only a functional interface (e.g., JAX).

5. Supporting sparse and/or complex-valued data.


Anatomy of a PyTorch tensor
Identifies explicitly tensors that
require gradients
A pointer to the parent dtype = …
operation to be able to requires_grad = True The actual storage of the
traverse the DAG tensor
data

grad
Gradient is accumulated
grad_fn here

Note: PyTorch also has a functional variant, very similar to the JAX implementation.
High-level APIs
Most frameworks (TensorFlow, PyTorch) implement an object-oriented API:
Simple polymorphism
allows for module
class MyModule(nn.Module): compositionality
def __init__(self):
self.params = Parameter( torch.tensor ( … ) )

Parameters are def forward(self, x):


properties of the object
return self.params @ x Forward logic is a
method
Limits of OOP
By default, JAX takes a fully functional paradigm: everything (layers, losses) is a
function.

Sometimes, this makes it easier to explicitly manipulate parameters and to


compose different transformations.

Most high-level frameworks in JAX define a layer by a pair of init/apply functions


(with some exceptions, see Equinox).
Notebook time!
Notebook 1: simple experiments with PyTorch and JAX, side-by-side.
Notebook 2: building a toy autodiff framework, PyTorch-style.
https://colab.research.google.com/drive/1AbNRRjL0DMoj4VPnul7QIYJC5nLU3Fn2?usp=sharing
From PyTorch to JAX: towards neural net frameworks that purify stateful code — Sabrina J. Mielke
From PyTorch to JAX: towards neural net frameworks that purify stateful code — Sabrina J. Mielke
Notebook time!
Notebook 3: moving from PyTorch to JAX and .vice versa
https://colab.research.google.com/drive/1AbNRRjL0DMoj4VPnul7QIYJC5nLU3Fn2?usp=sharing
04
Advanced topics

““Curiouser and curiouser!” cried Alice (she was so much surprised, that for the moment she quite
forgot how to speak good English).
—Chapter 2, The Pool of Tears
Layer sharing

1 layer 2 layers 2 layers (replicated)


Deep equilibrium layers
What happens if we replicate the layer infinite
times?

The output of the layer is now implicitly defined


… by a fixed-point equation:

Input Output (can be


initialized to zero)

Deep Equilibrium Models


Solving fixed-point equations
Writing a fixed-point layer is easy (of course, there are faster alternatives):

class FixedPointLayer(nn.Module):
def __init__(self):
self.w = ...

def forward(self, x):


z = torch.zeros_like(x)
while self.check_convergence():
z = f(self.w, z, x)
return z

By default, however, AD requires to store all intermediate steps of the while loop and
backpropagate through them, which is expensive.
Implicit function theorem
By the implicit function theorem, there exists a continuous function z* such that:

Differentiating everything:

We can differentiate the layer by computing two gradients at the optimum z (no need to store
any intermediate layers).
VJPs of implicit functions
In order to implement the layer as a primitive in a framework, we need its VJP with some
vector u. Fascinatingly, this can be expressed as another fixed-point equation:

Custom derivative rules for JAX-transformable Python functions


Read more

Implicit differentiation is the key for several topics:

● Defining layers in terms of convex optimization problems;


● Relaxing combinatorial problems inside layers;
● Neural ordinary differential equations (Neural ODEs);
● And so on…

Check out this tutorial for more: Deep Implicit Layers


Gradient checkpointing

To save memory, gradient checkpointing is now popular. Outputs of red nodes are stored
(checkpoints). When back-propagating through a non-checkpointed node, its output is
recomputed starting from the previous checkpoint in memory.
Bilevel optimization

Outer problem Inner problem


Examples:

1. Hyper-parameter optimization (outer loop is the validation accuracy;


2. Few-shot learning (inner loop is the training step on the few-shot dataset).
GitHub - leopard-ai/betty: Betty: an automatic differentiation libra
ry for generalized meta-learning and multilevel optimization
Physics-based Deep Learning
Physics-informed NNs

[2201.05624] Scientific Machine Learning through Physics-Informed Neural Networks: Where we are and What's next
“Tut, tut, child!” said the Duchess. “Everything’s got a moral, if only you can find it.”
—Chapter 9, The Mock Turtle’s Story

https://www.sscardapane.it/

https://twitter.com/s_scardapane

Thanks for listening!


CREDITS: This presentation template was created by
Slidesgo, including icons by Flaticon,and infographics &
images by Freepik

You might also like