The Reasonable Effectiveness of The Multiplicative Weights Update Algorithm - Math Programming
The Reasonable Effectiveness of The Multiplicative Weights Update Algorithm - Math Programming
The Reasonable Effectiveness of The Multiplicative Weights Update Algorithm - Math Programming
(https://jeremykun.files.wordpress.com/2015/09/papad.jpg)
Christos Papadimitriou, who studies multiplicative weights in the context of biology.
Hard to believe
Sanjeev Arora and his coauthors (https://www.cs.princeton.edu/~arora/pubs/MWsurvey.pdf)
consider it “a basic tool [that should be] taught to all algorithms students together with divide-and-
conquer, dynamic programming, and random sampling.” Christos Papadimitriou
(https://www.youtube.com/watch?v=KP0WFbdHhJM) calls it “so hard to believe that it has been
discovered five times and forgotten.” It has formed the basis of algorithms in machine learning,
optimization, game theory, economics, biology, and more.
What mystical algorithm has such broad applications? Now that computer scientists have studied it in
generality, it’s known as the Multiplicative Weights Update Algorithm (MWUA). Procedurally, the
algorithm is simple. I can even describe the core idea in six lines of pseudocode. You start with a
collection of objects, and each object has a weight.
The name “multiplicative weights” comes from how we implement the last step: if the weight of the
chosen object at step is before the event, and represents how well the object did in the event,
then we’ll update the weight according to the rule:
Think of this as increasing the weight by a small multiple of the object’s performance on a given round.
Here is a simple example of how it might be used. You have some money you want to invest, and you
have a bunch of financial experts who are telling you what to invest in every day. So each day you pick
an expert, and you follow their advice, and you either make a thousand dollars, or you lose a thousand
dollars, or something in between. Then you repeat, and your goal is to figure out which expert is the
most reliable.
This is how we use multiplicative weights: if we number the experts , we give each expert a
weight which starts at 1. Then, each day we pick an expert at random (where experts with larger
weights are more likely to be picked) and at the end of the day we have some gain or loss . Then we
update the weight of the chosen expert by multiplying it by . Sometimes you have enough
information to update the weights of experts you didn’t choose, too. The theoretical guarantees of the
algorithm say we’ll find the best expert quickly (“quickly” will be concrete later).
In fact, let’s play a game where you, dear reader, get to decide the rewards for each expert and each day.
I programmed the multiplicative weights algorithm to react according to your choices. Click the image
below to go to the demo.
(https://j2kun.github.io/mwua/index.html)
This core mechanism of updating weights can be interpreted in many ways, and that’s part of the reason
it has sprouted up all over mathematics and computer science. Just a few examples of where this has led:
1. In game theory, weights are the “belief” of a player about the strategy of an opponent. The most
famous algorithm to use this is called Fictitious Play
(https://en.wikipedia.org/wiki/Fictitious_play), and others include EXP3
(https://jeremykun.com/2013/11/08/adversarial-bandits-and-the-exp3-algorithm/) for minimizing
regret in the so-called “adversarial bandit learning” problem.
2. In machine learning, weights are the difficulty of a specific training example, so that higher weights
mean the learning algorithm has to “try harder” to accommodate that example. The first result I’m
aware of for this is the Perceptron (https://jeremykun.com/2011/08/11/the-perceptron-and-all-the-
things-it-cant-perceive/) (and similar Winnow) algorithm for learning hyperplane separators. The
most famous is the AdaBoost algorithm (https://jeremykun.com/2015/05/18/boosting-census/).
3. Analogously, in optimization, the weights are the difficulty of a specific constraint, and this technique
can be used to approximately solve linear and semidefinite programs. The approximation is because
MWUA only provides a solution with some error.
4. In mathematical biology, the weights represent the fitness of individual alleles, and filtering
reproductive success (http://www.pnas.org/content/111/29/10620.abstract) based on this and
updating weights for successful organisms produces a mechanism very much like evolution. With
modifications, it also provides a mechanism through which to understand sex
(http://web.stanford.edu/class/ee380/Abstracts/120425-slides.pdf) in the context of evolutionary
biology.
5. The TCP protocol, which basically defined the internet, uses additive and multiplicative weight
updates (which are very similar in the analysis) to manage congestion
(https://en.wikipedia.org/wiki/TCP_congestion_control).
6. You can get easy -approximation algorithms for many NP-hard problems, such as set cover
(https://en.wikipedia.org/wiki/Set_cover_problem).
Additional, more technical examples can be found in this survey of Arora et al.
(https://www.cs.princeton.edu/~arora/pubs/MWsurvey.pdf)
In the rest of this post, we’ll implement a generic Multiplicative Weights Update Algorithm, we’ll prove
it’s main theoretical guarantees, and we’ll implement a linear program solver as an example of its
applicability. As usual, all of the code used in the making of this post is available in a Github repository
(https://github.com/j2kun/mwua).
Let’s start by writing down pseudocode and an implementation for the MWUA algorithm in full
generality.
In general we have some set of objects and some set of “event outcomes” which can be completely
independent. If these sets are finite, we can write down a table whose rows are objects, whose
columns are outcomes, and whose entry is the reward produced by object when the
outcome is . We will also write this as for object and outcome . The only assumption we’ll
make on the rewards is that the values are bounded by some small constant (by small I mean
should not require exponentially many bits to write down as compared to the size of ). In symbols,
. There are minor modifications you can make to the algorithm if you want negative
rewards, but for simplicity we will leave that out. Note the table just exists for analysis, and the
algorithm does not know its values. Moreover, while the values in are static, the choice of outcome
for a given round may be nondeterministic.
Let’s describe the algorithm in notation first and build up pseudocode as we go. The input to the
algorithm is the set of objects, a subroutine that observes an outcome, a black-box reward function, a
learning rate parameter, and a number of rounds.
We define for object a nonnegative number we call a “weight.” The weights will change over time
so we’ll also sub-script a weight with a round number , i.e. is the weight of object in round .
Initially, all the weights are . Then MWUA continues in rounds. We start each round by drawing an
example randomly with probability proportional to the weights. Then we observe the outcome for that
round and the reward for that round.
1 # draw: [float] -> int
2 # pick an index from the given list of floats proportionally
3 # to the size of the entry (i.e. normalize to a probability
4 # distribution and draw according to the probabilities).
5 def draw(weights):
6 choice = random.uniform(0, sum(weights))
7 choiceIndex = 0
8
9 for weight in weights:
10 choice -= weight
11 if choice <= 0:
12 return choiceIndex
13
14 choiceIndex += 1
15
16 # MWUA: the multiplicative weights update algorithm
17 def MWUA(objects, observeOutcome, reward, learningRate numRounds):
18 weights = [1] * len(objects)
19 for t in numRounds:
20 chosenObjectIndex = draw(weights)
21 chosenObject = objects[chosenObjectIndex]
22
23 outcome = observeOutcome(t, weights, chosenObject)
24 thisRoundReward = reward(chosenObject, outcome)
25
26 ...
Sampling objects in this way is the same as associating a distribution to each round, where if
then the probability of drawing , which we denote , is . We don’t need to
keep track of this distribution in the actual run of the algorithm, but it will help us with the
mathematical analysis.
Next comes the weight update step. Let’s call our learning rate variable parameter . In round say we
have object and outcome , then the reward is . We update the weight of the chosen object
according to the formula:
In the more general event that you have rewards for all objects (if not, the reward-producing function
can output zero), you would perform this weight update on all objects . This turns into the
following Python snippet, where we hide the division by into the choice of learning rate:
But even in such an oppressive, exploitative environment, MWUA persists and achieves its guarantee.
And now we can state that guarantee.
The core of the proof, which we’ll state as a lemma, uses one of the most elegant proof techniques in all
of mathematics. It’s the idea of constructing a potential function, and tracking the change in that potential
function over time. Such a proof usually has the mysterious script:
Clearly, coming up with a useful potential function is a difficult and prized skill.
In this proof our potential function is the sum of the weights of the objects in a given round,
. Now the lemma.
Lemma: Let be the bound on the size of the rewards, and a learning parameter. Recall that
is the probability that MWUA draws object in round . Write the expected reward for MWUA
for round as the following (using only the definition of expected value):
Using the fact that , we can replace with , which allows us to get
And then using the fact that (Taylor series), we can bound the last expression by , as
desired.
Now using the lemma, we can get a hold on for a large , namely that
If then , simplifying the above. Moreover, the sum of the weights in round is certainly
greater than any single weight, so that for every fixed object ,
Squeezing between these two inequalities and taking logarithms (to simplify the exponents) gives
Multiply through by , divide by , rearrange, and use the fact that when we have
(Taylor series) to get
The bracketed term is the payoff of object , and MWUA’s payoff is at least a fraction of that minus the
logarithmic term. The bound applies to any object , and hence to the best one. This proves the
theorem.
Briefly discussing the bound itself, we see that the smaller the learning rate is, the closer you eventually
get to the best object, but by contrast the more the subtracted quantity hurts you. If your
target is an absolute error bound against the best performing object on average, you can do more
algebra to determine how many rounds you need in terms of a fixed . The answer is roughly: let
and pick . See this survey
(https://www.cs.princeton.edu/~arora/pubs/MWsurvey.pdf) for more.
We can further simplify the constraints by assuming we know the optimal value in
advance, by doing a binary search (more on this later). So, if we ignore the hard constraint , the
“easy feasible region” of possible ‘s includes .
In order to fit linear programming into the MWUA framework we have to define two things.
Number 2 is curious (why would we give a reward for error?) but it’s crucial and we’ll discuss it
momentarily.
The special input depends on the weights in round (which is allowed, recall). Specifically, if the
weights are , we ask for a vector in our “easy feasible region” which satisfies
For this post we call the implementation of procuring such a vector the “oracle,” since it can be seen as
the black-box problem of, given a vector and a scalar and a convex region , finding a vector
satisfying . This allows one to solve more complex optimization problems with the same
technique, swapping in a new oracle as needed. Our choice of inputs, , are particular
to the linear programming formulation.
Two remarks on this choice of inputs. First, the vector is a weighted average of the constraints in ,
and is a weighted average of the thresholds. So this this inequality is a “weighted average”
inequality (specifically, a convex combination, since the weights are nonnegative). In particular, if no
such exists, then the original linear program has no solution. Indeed, given a solution to the original
linear program, each constraint, say , is unaffected by left-multiplication by .
Second, and more important to the conceptual understanding of this algorithm, the choice of rewards
and the multiplicative updates ensure that easier constraints show up less prominently in the inequality
by having smaller weights. That is, if we end up overly satisfying a constraint, we penalize that object for
future rounds so we don’t waste our effort on it. The byproduct of MWUA—the weights—identify the
hardest constraints to satisfy, and so in each round we can put a proportionate amount of effort into
solving (one of) the hard constraints. This is why it makes sense to reward error; the error is a signal for
where to improve, and by over-representing the hard constraints, we force MWUA’s attention on them.
At the end, our final output is an average of the produced in each round, i.e. . This vector
satisfies all the constraints to a roughly equal degree. We will skip the proof that this vector does what
we want, but see these notes for a simple proof
(https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15859-f11/www/notes/lecture17.pdf).
We’ll spend the rest of this post implementing the scheme outlined above.
Implementing the oracle
For the case of this linear region , we can simply find the index which maximizes . If this value
exceeds , we can return the vector with that value in the -th position and zeros elsewhere. Otherwise,
the problem has no solution.
To prove the “no solution” part, say and you have a solution to . Then for
whichever index makes bigger, say , you can increase without changing by
replacing with and with zero. I.e., we’re moving the solution along the line
until it reaches a vertex of the region bounded by and . This must happen when
all entries but one are zero. This is the same reason why optimal solutions of (generic) linear programs
occur at vertices of their feasible regions.
The code for this becomes quite simple. Note we use the numpy library in the entire codebase to make
linear algebra operations fast and simple to read.
The core solver implements the discussion from previously, given the optimal value of the linear
program as input. To avoid too many single-letter variable names, we use linearObjective instead of
.
1 def solveGivenOptimalValue(A, b, linearObjective, optimalValue, learnin
2 m, n = A.shape # m equations, n variables
3 oracle = makeOracle(linearObjective, optimalValue)
4
5 def reward(i, specialVector):
6 ...
7
8 def observeOutcome(_, weights, __):
9 ...
10
11 numRounds = 1000
12 weights, cumulativeReward, outcomes = MWUA(
13 range(m), observeOutcome, reward, learningRate, numRounds
14 )
15 averageVector = sum(outcomes) / numRounds
16
17 return averageVector
First we make the oracle, then the reward and outcome-producing functions, then we invoke the MWUA
subroutine. Here are those two functions; they are closures because they need access to and . Note
that neither nor the optimal value show up here.
Finally, the top-level routine. Note that the binary search for the optimal value is sophisticated (though it
could be more sophisticated). It takes a max range for the search, and invokes the optimization
subroutine, moving the upper bound down if the linear program is feasible and moving the lower
bound up otherwise.
1 def solve(A, b, linearObjective, maxRange=1000):
2 optRange = [0, maxRange]
3
4 while optRange[1] - optRange[0] > 1e-8:
5 proposedOpt = sum(optRange) / 2
6 print("Attempting to solve with proposedOpt=%G" % proposedOpt)
7
8 # Because the binary search starts so high, it results in extre
9 # reward values that must be tempered by a slow learning rate.
10 # to the reader: determine absolute bounds for the rewards, and
11 # this learning rate in a more principled fashion.
12 learningRate = 1 / max(2 * proposedOpt * c for c in linearObjec
13 learningRate = min(learningRate, 0.1)
14
15 try:
16 result = solveGivenOptimalValue(A, b, linearObjective, prop
17 optRange[1] = proposedOpt
18 except InfeasibleException:
19 optRange[0] = proposedOpt
20
21 return result
The output:
1 Attempting to solve with proposedOpt=500
2 Attempting to solve with proposedOpt=250
3 Attempting to solve with proposedOpt=125
4 Attempting to solve with proposedOpt=62.5
5 Attempting to solve with proposedOpt=31.25
6 Attempting to solve with proposedOpt=15.625
7 Attempting to solve with proposedOpt=7.8125
8 Attempting to solve with proposedOpt=3.90625
9 Attempting to solve with proposedOpt=1.95312
10 Attempting to solve with proposedOpt=2.92969
11 Attempting to solve with proposedOpt=3.41797
12 Attempting to solve with proposedOpt=3.17383
13 Attempting to solve with proposedOpt=3.05176
14 Attempting to solve with proposedOpt=2.99072
15 Attempting to solve with proposedOpt=3.02124
16 Attempting to solve with proposedOpt=3.00598
17 Attempting to solve with proposedOpt=2.99835
18 Attempting to solve with proposedOpt=3.00217
19 Attempting to solve with proposedOpt=3.00026
20 Attempting to solve with proposedOpt=2.99931
21 Attempting to solve with proposedOpt=2.99978
22 Attempting to solve with proposedOpt=3.00002
23 Attempting to solve with proposedOpt=2.9999
24 Attempting to solve with proposedOpt=2.99996
25 Attempting to solve with proposedOpt=2.99999
26 Attempting to solve with proposedOpt=3.00001
27 Attempting to solve with proposedOpt=3
28 Attempting to solve with proposedOpt=3 # note %G rounds the printed va
29 Attempting to solve with proposedOpt=3
30 Attempting to solve with proposedOpt=3
31 Attempting to solve with proposedOpt=3
32 Attempting to solve with proposedOpt=3
33 Attempting to solve with proposedOpt=3
34 Attempting to solve with proposedOpt=3
35 Attempting to solve with proposedOpt=3
36 Attempting to solve with proposedOpt=3
37 Attempting to solve with proposedOpt=3
38 [ 0. 0.987 1.026]
39 3.00000000425
40 [ 5.20000072e-02 8.49831849e-09]
So there we have it. A fiendishly clever use of multiplicative weights for solving linear programs.
Discussion
One of the nice aspects of MWUA is it’s completely transparent. If you want to know why a decision
was made, you can simply look at the weights and look at the history of rewards of the objects. There’s
also a clear interpretation of what is being optimized, as the potential function used in the proof is a
measure of both quality and adaptability to change. The latter is why MWUA succeeds even in adversarial
settings, and why it makes sense to think about MWUA in the context of evolutionary biology.
This even makes one imagine new problems that traditional algorithms cannot solve, but which MWUA
handles with grace. For example, imagine trying to solve an “online” linear program in which over time
a constraint can change. MWUA can adapt to maintain its approximate solution.
The linear programming technique is known in the literature as the Plotkin-Shmoys-Tardos framework
for covering and packing problems. The same ideas extend to other convex optimization problems,
including semidefinite programming (https://en.wikipedia.org/wiki/Semidefinite_programming).
If you’ve been reading this entire post screaming “This is just gradient descent!” Then you’re right and
wrong. It bears a striking resemblance to gradient descent (see this document for details
(http://tcs.epfl.ch/files/content/sites/tcs/files/Lec2-Fall14-Ver2.pdf) about how special cases of
MWUA are gradient descent by another name), but the adaptivity for the rewards makes MWUA
different.
Even though so many people have been advocating for MWUA over the past decade, it’s surprising that
it doesn’t show up in the general math/CS discourse on the internet or even in many algorithms courses.
The Arora survey (https://www.cs.princeton.edu/~arora/pubs/MWsurvey.pdf) I referenced is from
2005 and the linear programming technique I demoed (http://dl.acm.org/citation.cfm?id=208531) is
originally from 1991! I took algorithms classes wherever I could, starting undergraduate in 2007, and I
didn’t even hear a whisper of this technique until midway through my PhD in theoretical CS (I did,
however, study fictitious play in a game theory class). I don’t have an explanation for why this is the
case, except maybe that it takes more than 20 years for techniques to make it to the classroom. At the
very least, this is one good reason to go to graduate school. You learn the things (and where to look for
the things) which haven’t made it to classrooms yet.
This entry was posted in Algorithms, Game Theory, Learning Theory, Linear Algebra, Optimization and
tagged javascript, linear programming, mathematics, multiplicative weights update algorithm, mwua,
optimization, programming, python. Bookmark the permalink.
Jakob Lund
March 3, 2017 at 2:07 pm
Dang, I thought I was following you here… If I follow the code and notation, then
Sorry if I’m being a complete bozoid. And thanks btw for posting this, it’s great stuff!
Jakob
j2kun
March 3, 2017 at 2:34 pm
Ah, you’re right. I was translating the proof from an (equivalent) version in which the updates
put the in the exponent, and the sums work out nicely. I will have to fix this later today. Nice
catch!
Jakob Lund
March 3, 2017 at 2:12 pm
ouch some math is being eaten by html… I meant to write
“if any of the M_t < B then…”
j2kun
March 2, 2017 at 11:45 am • Reply
It’s not guaranteed to get anything in absolute terms. It minimizes regret in hindsight when
compared with having chosen the single option that historically performed the best every time.
Of course, that regret still grows over time no matter what, but the growth rate of the regret is
logarithmic in the number of rounds.
4.
Dang Manh Truong
March 5, 2017 at 1:39 am • Reply
“Moreover, the sum of the weights in round T is certainly less than any single weight” shouldn’t it
be: “Moreover, the sum of the weights in round T is certainly greater than any single weight” ?
5.
Ives Macedo
March 14, 2017 at 11:23 am • Reply
Hi Jeremy, I believe it makes sense to provide the official link to the paper by Arora, Hazan, and Kale
since it was published on an Open Access Journal. http://dx.doi.org/10.4086/toc.2012.v008a006
j2kun
March 14, 2017 at 11:24 am • Reply
Thanks! Much appreciated
6.
Daniel
March 20, 2017 at 6:54 am • Reply
Maybe a small typo: in the line
shouldn’t the \leq be a \geq instead (since \sum_t M(x,y_t)/B is between 0 and 1)?
7.
zenburn
April 18, 2017 at 4:11 am • Reply
small thing:
the draw(…) code seems to be doing something similar to what is called a roulette-wheel-selection
right ? if true, then by computing a cumulative sum, and then use bisection perhaps should be more
efficient. so, we have the following:
def draw(weights):
cws = [sum(weights[:i+1]) for i in xrange(len(weights))]
return bisect.bisect_left(cws, random.uniform(0, cws[-1]))
beautiful article by the way
8.
Lynda Ruiz
January 21, 2018 at 6:56 am • Reply
Useful information. Lucky me I discovered your website accidentally, and I am shocked why this
coincidence did not came about earlier! I bookmarked it.
9.
man on laptop
September 5, 2018 at 4:26 am • Reply
Hey Jeremy. Referring to your opening sentence, do you know what Sanjeev Arora meant by
“random sampling”? Do they mean randomized algorithms, or sampling theory (rejection sampling,
hierarchical sampling, inverse transform sampling, MCMC)? What are the applications of sampling
theory to computer science?
j2kun
September 5, 2018 at 8:25 pm • Reply
I expect Arora means the use of random sampling to solve otherwise difficult problems. For
example, random sampling for primality testing (or other kinds of property testing) is far more
efficient than the best known algorithm for exact computation. I think MCMC also falls within
this. I’m not quite sure what you mean by “sampling theory” insofar as being different from the
general class of algorithms that use randomness.
10.
yifei yuan
August 17, 2019 at 5:17 am • Reply
Sorry to ask a question, if the objective c vector contains negative value, your method will lead to the
corresponding x to be 0, it may not be optimal solution.