LOD Differentiable
LOD Differentiable
LOD Differentiable
differentiable Wonderland
Presenter: Simone Scardapane
01
Differentiability: the key
ingredient of AI?
“For, you see, so many out-of-the-way things had happened lately, that Alice had begun to think that very few
things indeed were really impossible.”
—Chapter 1, Down the Rabbit-Hole
“This position paper proposes an architecture and training
paradigms with which to construct autonomous intelligent
agents.”
“A system architecture for
autonomous intelligence. All modules
in this model are assumed to be
differentiable.”
What is an “artificial neural network”?
No biology, please.
Implicit layers
Transformers blocks Neural computers
A deep network is a differentiable function ...
Loss goes in
Automatic
Optimizer Differentiation
def f(x):
y = my_network(x)
y = another_network(y)
y = yet_another_network(y)
return y
Automatic
Differentiation
02
The how: automatic
differentiation
“Would you tell me, please, which way I ought to go from here?”
“That depends a good deal on where you want to get to,” said the Cat.
“I don’t much care where” said Alice.
“Then it doesn’t matter which way you go,” said the Cat.
—Chapter 6, Pig and Pepper
Preliminaries: Derivative(s)
The Jacobian wrt x is (m,n), but the Jacobian w.r.t. W is rank-3 (m,m,n). In general,
full Jacobians in real layers can be quite cumbersome.
Input (Trainable)
parameters
For each primitive, we know how to compute the input Jacobian and the weight
Jacobian.
Neural networks
Simple NNs are a sequence of primitive operations:
Computational graph
In the more general case, we can have a DAG (not a sequence), and also parameter
sharing between layers.
The goal of autodiff
Note that our last output, almost always, is scalar (e.g., sum of the per-element losses).
What we need is a way to efficiently, simultaneously compute all weight Jacobians (up
to numerical precision):
[1502.05767] Automatic differentiation in machine learning: a survey
For example, this could be a one-layer neural network, cross-entropy loss, and final
sum.
Considering each instruction in isolation, we have 4 input Jacobians and 3 weight
Jacobians:
Jacobians of f2 Jacobians of f3
Jacobians of f1
Repeated!
Performing multiplications right-to-left: reverse-mode autodiff
Forward-mode autodiff
However, all operations will scale linearly w.r.t. number of parameters, which is
impractical for today’s neural networks. On the good side, it requires little memory
because previous operations can be discarded.
Forward-mode autodiff
● Wengert (1964) is credited as the first description of forward-mode AD, which became
popular in the 80’ mostly with the work of Griewank.
● Linnainmaa (1976) is considered the first description of modern reverse-mode AD, with
the first major implementation in Speelpenning (1980).
● Werbos (1982) is the first concrete application to NNs, before being popularized (as
backpropagation) by Rumelhart et al. (1986).
Layer 2
High-level constructs (layers,
(hubs, libraries, extensions, …)
Layer 1
Theano and TF 1.0 focused heavily on performance. The user implemented the
computational graph with a small domain specific language (DSL), and execution was
decoupled from the definition (define-then-run).
1. Having an external context manager to store operations (e.g., the GradientTape of TF, technically a
Wengert list) vs. building the DAG dynamically.
2. Being able to easily differentiate w.r.t. any sort of object (e.g., the PyTrees of JAX)
grad
Gradient is accumulated
grad_fn here
Note: PyTorch also has a functional variant, very similar to the JAX implementation.
High-level APIs
Most frameworks (TensorFlow, PyTorch) implement an object-oriented API:
Simple polymorphism
allows for module
class MyModule(nn.Module): compositionality
def __init__(self):
self.params = Parameter( torch.tensor ( … ) )
““Curiouser and curiouser!” cried Alice (she was so much surprised, that for the moment she quite
forgot how to speak good English).
—Chapter 2, The Pool of Tears
Layer sharing
class FixedPointLayer(nn.Module):
def __init__(self):
self.w = ...
By default, however, AD requires to store all intermediate steps of the while loop and
backpropagate through them, which is expensive.
Implicit function theorem
By the implicit function theorem, there exists a continuous function z* such that:
Differentiating everything:
We can differentiate the layer by computing two gradients at the optimum z (no need to store
any intermediate layers).
VJPs of implicit functions
In order to implement the layer as a primitive in a framework, we need its VJP with some
vector u. Fascinatingly, this can be expressed as another fixed-point equation:
To save memory, gradient checkpointing is now popular. Outputs of red nodes are stored
(checkpoints). When back-propagating through a non-checkpointed node, its output is
recomputed starting from the previous checkpoint in memory.
Bilevel optimization
[2201.05624] Scientific Machine Learning through Physics-Informed Neural Networks: Where we are and What's next
“Tut, tut, child!” said the Duchess. “Everything’s got a moral, if only you can find it.”
—Chapter 9, The Mock Turtle’s Story
https://www.sscardapane.it/
https://twitter.com/s_scardapane