Map578 5

EA MAP 578: Emerging topics in ML
-
Collaborative and Reliable Learning
Aymeric DIEULEVEUT - El Mahdi EL MHAMDI
Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

What’s this lecture about
Objectives of the lecture

1 Introduction to Collaborative Learning
An exciting research domain
Focusing on recent advances (2015 - 2023 - )
Connecting multiple fields (mathematics, computer science, economics,
social sciences)
2 Introduction to research
Presenting both the intuition and mathematical results
Reading papers!

Organization of the lecture
9 lectures:
1 Lectures on some crucial topics (Lectures 1 - 5)
2 Reading articles in small groups (Lectures 6- 7)
3 Presenting those articles (Lectures 8 - 9)
Each Tuesday afternoon, 2 sequences.

Starting 13h30 - Ending 17h30-18h

Organization of the lecture
9 lectures:
1 Lectures on some crucial topics (Lectures 1 - 5)
2 Reading articles in small groups (Lectures 6- 7)
3 Presenting those articles (Lectures 8 - 9)
Each Tuesday afternoon, 2 sequences.

Starting 13h30 - Ending 17h30-18h

Detailed Organization - Tentative
Date Program first 2h Program next 2h

1 19/09 General introduction AD & EE Topic 1: Stochastic Optimisation EE
2 26/09 Topic 2: Distributed (centralized) EE Topic 3: Fault tolerance EE
- 03/10 No lecture
- 10/10 No lecture
3 17/10 Topic 4: AI ALinement EE Articles overview and pre-choice EE
4 24/10 Topic 5:Decentralized AD Topic 6: Heterogeneity AD
5 07/11 Topic 7: Compression AD Topic 8 Differential Privacy AD
6 14/11 Article choice and preparation 2 EE Article preparation 3 EE
7 21/11 Article preparation 4 AD & EE Article preparation 5 EE
8 28/11 Article preparation 6 EE Article preparation 7 EE
9 05/12 Article presentation (1/2) AD & EE Article presentation (2/2) AD & EE
10 12/12 Evaluation week - no session
11 19/12 Evaluation week
Spreadsheet for paper selection (the list is indicative and will be evolving over
the next 1-3 weeks)
https://docs.google.com/spreadsheets/d/
1WkmYHmFUMnS0FjM8UhX1z2S2xKGCj0vUJnnSRCHIR2E

Who are we?
Aymeric DIEULEVEUT
Assistant professor at Polytechnique, CMAP
Interests:
Optimization and Statistics, links between the two aspects
Large Scale Learning
Federated, Distributed, Privacy preserving learning
Contact: [email protected]
El Mahdi EL MHAMDI
Assistant professor at Polytechnique, CMAP
Interests:
Distributed systems, distributed algorithms
Robustness, fault tolerance
Computable ethics (mathematics, analytical philosophy, social sciences. . . )
Contact: [email protected]

Outline - Today
1 First sequence: general introduction to Collaborative Learning.

Description of the main challenges.
# 90 min journey in 50 years of learning
2 Second sequence: Stochastic Algorithms.
# Main workhorse of learning: a few convergence rates !

Programming-based computer science
Can you say if a number is a prime number?

Can you order a list?

Deduction versus Induction
Laws from data

~ext .
Better laws, e.g, m · ~a = ΣF
Classic CS = Automating Deduction
Machine Learning = Automating Induction
Détection de spam
Donnéees d’apprentissage: courriels

Entrée: courriel (textes, header - émetteur, récepteurs, smtp)
Réponse : “Spam”/“Non Spam”

Détection de visages
Données d’apprentissage: images annotées

Entrée : imagettes (32 × 32 pixels par exemple)
Réponse : “ Visage / “Non Visage”...

Reconnaissance écriture manuscrite
Données: bases de données de chiffres (chaque image est un vecteur

28 × 28 = 784 de pixel - niveau de gris)
Entrée: Image
Sortie: Chiffre reconnu

Apprentissage supervisé
Attributs: X = (X (1) , X (2) , . . . , X (d) ) ∈ X

Réponses: Y ∈ Y.
(X, Y ) ∼ P où P est inconnue.
Ensemble d’apprentissage : Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} (i.i.d. ∼ P)
– X ∈ Rd et Y ∈ {−1, 1} ou {0, 1} (classification)
– X ∈ Rd et Y ∈ R (régression).
Un classifieur ou prédicteur est une fonction g : X → Y mesurable.
Objectif
Construire un bon prédicteur / classifieur gb en utilisant les données

d’apprentissage

Perte et risque
Fonction de perte
Fonction de perte : `(y, g(x)) quantifie la qualité de la prévision g(x) de y.

Exemples:
Perte 0-1 (classification): `(y, g(x)) = 1{y6=g(x)} , y ∈ Y = {0, 1}
Perte quadratique (regression): `(y, g(x)) = |y − g(x)|2
Risque d’une règle de décision
Risque: R(g) = E[`(Y, g(X))]

Exemples:
Perte 0-1 (classification) : R(g) = P(Y 6= g(X))
Perte quadratique (regression): R(g) = E[|Y − g(X)|2 ]

Ensemble d’apprentissage : Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} (i.i.d. ∼ P)

Règle de décision (classifieur, prédicteur): g : X → Y mesurable
Perte: `(Y, g(X))
Risque: R(g) = E[`(Y, g(X))]

Bayes predictors, LS regression
Il existe un prédicteur optimal, noté g Bayes ou g ∗ .

Par exemple, pour la regression des moindres carrés,
g ∗ = arg min E[(g(X) − Y )2 ].

g∈RX
Alors
g ∗ (X) = E[Y |X].
c’est la fonction de régression.

Sur-apprentissage / Sous Apprentissage I
1. Décomposition de l’excès de risque et choix de C. Soit R(gC∗ ) tel que

R(gC∗ ) = inf g∈C R(g)
R(ĝn,C ) − R(g Bayes ) = R(ĝn,C ) − R(gC∗ ) + R(gC∗ ) − R(g Bayes )

| {z } | {z }
Erreur d’estimation Erreur d’approximation
Taille C %
Err. Approximation &
Err. Estimation %
L’erreur d’approximation peut être grande si C est mal choisi

L’erreur d’estimation peut être grande si C est complexe (nécessite d’être défini).

Sur-apprentissage / Sous-apprentissage
10
8
6
4
2
0
0 2 4 6 8 10
complexite
Figure: Comportement du risque d’approximation et du risque d’estimation en fonction

de la complexité du modèle C

Points de vue statistique / optimisation
Problèmes
Comment choisir C?
Comment sélectionner une règle de décision dans C ?
Approche “Générative”
Solution: Estimer la fonction de régression η(X) = P ( Y = 1 | X) et substituer
cet estimateur dans la règle bayésienne: Modèles linéaires (généralisés),
méthodes à noyaux, k-plus proches voisins, Bayes naı̈f,...
Approche “Optimisation”
Solution: Minimiser le risque empirique (ou une borne supérieure du risque
empirique): machines à vecteurs de supports, réseaux de neurones,

Minimisation du risque empirique
Nous disposons d’un ensemble d’apprentissage
Dn = {(X1 , Y1 ), . . . , (Xn , Yn )}
Nous pouvons donc estimer le risque empirique

n
X
b n (g) = n−1
R `(g(Xi ), Yi )
i=1
Comme E[`(Xi , Yi )] = R(g), R

b n (g) est un estimateur sans biais de R(g).
b n (g) P−prob
De plus, pour tout g, la LGN montre que R −→ R(g) donc, pour
n → ∞, R b n (g) ≈ R(g)
Idée (?): remplacer le risque par le risque empirique !...

Paramétrisation de la classe de fonctions
1 linéaire: moindre carrés, logistique, SVM
2 non-linéaire: réseaux de neurones

Grande taille des modèles et grand nombre d’observations
1 Taille des modèles

Les modèles utilisés sont de taille croissante à grande vitesse.
De nombreux réseaux de neurones comptent plusieurs millions de
paramètres.
La dimension d de l’espace sur lequel on doit optimiser le risque empirique
est donc très grande.

Optimisation problem
Optimisation
arg min R
b n (g) (1)
g∈C

Optimisation problem
Optimisation
arg min R
b n (g) (1)
g∈C
We thus focus on the following optimization problem

Optimisation
arg min F (θ) (2)

θ∈C⊂Rd

Optimisation
Some questions in Optimization

How to tackle such an optimization problem?
1 Grid search
2 Use local information: algorithmic question!
What makes it difficult?
How fast do algorithms converge?

Optimisation
Some questions in Optimization

How to tackle such an optimization problem?
1 Grid search
2 Use local information: algorithmic question!
What makes it difficult?
How fast do algorithms converge?
This afternoon: Stochastic Optimization.

Challenge 1: distribution
Distribution
1 Several workers/agents/nodes share the data or the model
2 Data distribution : each holds a share of the data
3 Model distribution : each holds a share of the model
Figure: Different settings for communication

Data distributed framework: Optimization problem
N
X
arg min Fi (θ) (3)
θ∈C⊂Rd i=1
Each Fi is the (empirical) loss on the data hold by worker i ∈ [N ].

Data distributed framework: Optimization problem
N
X
arg min Fi (θ) (3)
θ∈C⊂Rd i=1
Each Fi is the (empirical) loss on the data hold by worker i ∈ [N ].
1 N
Lool FI FN
loss
overall n
F w Fi
loss fu

Settings and key questions:

1 Synchronous vs Asynchronous framework
2 Centralized: how often do we communicate with the server?
3 Decentralized: what is the network topology? how fast does it take for
information to propagate?
Questions:
1 What is the speed up compared to the single workers setting?
2 How does the network topology influence convergence in decentralized?

Settings and key questions:

1 Synchronous vs Asynchronous framework
2 Centralized: how often do we communicate with the server?
3 Decentralized: what is the network topology? how fast does it take for
information to propagate?
Questions:
1 What is the speed up compared to the single workers setting?
2 How does the network topology influence convergence in decentralized?
Next week: Distributed Optimization.

Challenge 2: Heterogeneity
I 1 NN
Distribution
jin R z na
Di Dj
Heterogeneity
1 Multiple sources of heterogeneity: label of feature shift?

2 From the optimization pov, assumption on “simple quantities”

Challenge 2: Heterogeneity and adaptation?
Two different problems?
1 Averaging consensus:
N
X
arg min Fi (θ) (4)
θ∈C⊂Rd i=1
2 Adaptation:
N
X
arg min Fi (θi ) (5)
(θi )i∈[N ] ∈C N ⊂(Rd )N i=1

Challenge 3: Privacy
I 1 N
a
Observations
z si je z
k 1
Each agent wants:

1 To keep his data on his own storage
2 To ensure that sensitive information on his dataset cannot be recovered
form the observation of the model / all the training process / performing
inference / etc.

Challenge 3: Privacy - Solutions
Some of the main directions

1 Ensure that any individual datapoint does not influence the model too
much
2 Encrypt the data. Can we still perform learning, or inference on an
encrypted data?

Challenge 4: Compression, partial participation
Communication constraints
1 Communication can be the bottleneck in distributed systems: can we get a
speedup?
2 Uploading and downloading updates: saturation of networks, bandwidth?
3 Un-availability of some workers?
1 N a
Upload
Download
a oo.co

Summary: The Federated Learning framework I
Federated Learning
1 Framework combining several aspects of collaborative learning

2 “Introduced” in 2016, attracted a lot of interest.
3 Numerous start-ups/ industrial working on the topic !
1 Centralized (in the classical definition)

2 Learning from a set of N agents:
 
 N 
1 X
 
min F (θ) := Ez∼Di [`(z, θ)] .
θ∈R d 
 N i=1 | {z }

Fi (θ)
3 Tackles both
the averaging consensus problem
the adaptation problem
4 Data distribution:
cross silo
cross device

Summary: The Federated Learning framework II
5 Some concerns
Privacy
Non i.i.d. agents
Optimization with bandwidth constraints, Partial participation
6 Important implementation issues

Two Classical Examples
Collaboration between hospitals:
Figure: Map of the hospitals in 13-14th arrondissements

Building a collaborative and personalized text model:

Building a collaborative and personalized text model:

Challenge: Robustness
How bad is it if?

1 Some of the workers have a different distribution?
2 Some of the workers make mistakes?
3 Some of the workers actively try to degrade the learning process?

Robustness from distribution
The single point of failure problem throughout disciplines.

ARPANET cold war motivation: how to make sure information is still available
and possible to disseminate after a nuclear attack?

ARPANET cold war motivation: how to make sure information is still available
and possible to disseminate after a nuclear attack?

Lack of robustness because of distribution
How to manage the messy problem of synchronizing nodes, making them agree
on values despite crashes, errors in data, asynchrony etc.

Lack of robustness because of distribution
How to manage the messy problem of synchronizing nodes, making them agree
on values despite crashes, errors in data, asynchrony etc.

Fault tolerance

Loss functions from high level goals
From business models to loss functions
(the problem is the same for every other recommender algorithm)

Loss functions from high level goals
From business models to loss functions :

what are the features and the labels in a recommender system?

The alignment problem
Beyond gadgets, machine learning is about automating inductive thinking, i.e.

inferring laws from observations.
(Most of mathematics is about deductive thinking ).
However, this can hardly be done without proxy measures: loss functions.
Goodhart’s law
When a measure becomes a target, it ceases to be a good measure.
The proxy will inevitably be different from the intended goal, but we can study
the mismatch and avoid pitfalls.

Other questions (not covered in the class*)
Industrial perspective:
1 How to get people to engage
2 Loss functions from business models
Insurance:
1 Liability for loss of privacy in a Privacy preserving framework
Legal aspects:
1 How to we define privacy in the law?
2 Role of GDPR?
Economics:
1 Data valuation?
2 Value Sharing?
Cryptography:
1 Role of homomorphic encryption?
2 MPC?
*: but for which you will be (technically) better equipped by the end of the
class.

Map578 5

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Map578 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Map578 5

Uploaded by

Copyright:

Available Formats

EA MAP 578: Emerging topics in ML

Aymeric DIEULEVEUT - El Mahdi EL MHAMDI

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Objectives of the lecture

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Each Tuesday afternoon, 2 sequences.

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Each Tuesday afternoon, 2 sequences.

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Date Program first 2h Program next 2h

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

1 First sequence: general introduction to Collaborative Learning.

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Can you say if a number is a prime number?

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Laws from data

Donnéees d’apprentissage: courriels

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Données d’apprentissage: images annotées

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Données: bases de données de chiffres (chaque image est un vecteur

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Attributs: X = (X (1) , X (2) , . . . , X (d) ) ∈ X

Un classifieur ou prédicteur est une fonction g : X → Y mesurable.

Construire un bon prédicteur / classifieur gb en utilisant les données

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Fonction de perte : `(y, g(x)) quantifie la qualité de la prévision g(x) de y.

Risque d’une règle de décision

Risque: R(g) = E[`(Y, g(X))]

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Ensemble d’apprentissage : Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} (i.i.d. ∼ P)

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Il existe un prédicteur optimal, noté g Bayes ou g ∗ .

g ∗ = arg min E[(g(X) − Y )2 ].

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

1. Décomposition de l’excès de risque et choix de C. Soit R(gC∗ ) tel que

R(ĝn,C ) − R(g Bayes ) = R(ĝn,C ) − R(gC∗ ) + R(gC∗ ) − R(g Bayes )

L’erreur d’approximation peut être grande si C est mal choisi

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Figure: Comportement du risque d’approximation et du risque d’estimation en fonction

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Nous disposons d’un ensemble d’apprentissage

Nous pouvons donc estimer le risque empirique

Comme E[`(Xi , Yi )] = R(g), R

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

1 linéaire: moindre carrés, logistique, SVM

2 non-linéaire: réseaux de neurones

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

1 Taille des modèles

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

We thus focus on the following optimization problem

arg min F (θ) (2)

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning

Some questions in Optimization

Dieuleveut & El Mhamdi Emerging topics in ML : Collaborative and Reliable Learning