MPC Paper
MPC Paper
MPC Paper
1 Introduction
It appears that for a majority of the datasets (e.g., the MNIST database of
handwritten digits [15] or the ARCENE dataset [14]), the classification achieves
very good accuracy after only a few iterations of the gradient descent using a
piecewise-linear approximation of the sigmoid function sigmo : R → [0, 1] defined
as
1
sigmo(x) = ,
1 + e−x
although the current cost function is still far from the minimum value [25].
Other approximation methods of the sigmoid function have also been proposed
in the past. In [29], an approximation with low degree polynomials resulted in a
more efficient but less accurate algorithm. Conversely, a higher-degree polyno-
mial approximation applied to deep learning algorithms in [24] yielded more ac-
curate, but less efficient algorithms (and thus, less suitable for privacy-preserving
computing). In parallel, approximation solutions for privacy-preserving methods
based on homomorphic encryption [2], [27], [18], [22] and differential privacy [1],
[10] have been proposed in the context of both classification algorithms and deep
learning.
Nevertheless, accuracy itself is not always a sufficient measure for the quality
of the model, especially if, as mentioned in [19, p.423], our goal is to detect a rare
event such as a rare disease or a fraudulent financial transaction. If, for example,
one out of every one thousand transactions is fraudulent, a naı̈ve model that
classifies all transactions as honest achieves 99.9% accuracy; yet this model has
no predictive capability. In such cases, measures such as precision, recall and
F1-score allow for better estimating the quality of the model. They bound the
rates of false positives or negatives relative to only the positive events rather
than the whole dataset.
The techniques cited above achieve excellent accuracy for most balanced
datasets, but since they rely on a rough approximation of the sigmoid function,
they do not converge to the same model and thus, they provide poor scores on
datasets with a very low acceptance rate. In this paper, we show how to regain
this numerical precision in MPC, and to reach the same score as the plaintext
regression. Our MPC approach is mostly based on additive secret shares with
precomputed multiplication triplets [4]. This means that the computation is
divided in two phases: an offline phase that can be executed before the data is
shared between the players, and an online phase that computes the actual result.
For the offline phase, we propose a first solution based on a trusted dealer, and
then discuss a protocol where the dealer is honest-but-curious.
New protocol for the honest but curious offline phase extendable to n players.
We introduce a new protocol for executing the offline phase in the honest-but-
curious model that is easily extendable to a generic number n of players while
remaining efficient. To achieve this, we use a broadcast channel instead of peer-
to-peer communication which avoids a quadratic explosion in the number of
communications. This is an important contribution, as none of the previous
protocols for n > 3 players in this model are efficient. In [17], for instance, the
authors propose a very efficient algorithm in the trusted dealer model; yet the
execution time of the oblivious transfer protocol is quite slow.
Computing secret shares for a sum x + y (or a linear combination if (G, +) has
a module structure) can be done non-interactively by each player by adding
the corresponding shares of x and y. Computing secret shares for a product
is more challenging. One way to do that is to use an idea of Beaver based on
precomputed and secret shared multiplicative triplets. From a general point of
view, let (G1 , +), (G2 , +) and (G3 , +) be three abelian groups and let π : G1 ×
G2 → G3 be a bilinear map.
Given additive secret shares JxK+ and JyK+ for two elements x ∈ G1 and
y ∈ G2 , we would like to compute secret shares for the element π(x, y) ∈ G3 .
With Beaver’s method, the players must employ precomputed single-use random
triplets (JλK+ , JµK+ , Jπ(λ, µ)K+ ) for λ ∈ G1 and µ ∈ G2 , and then use them to
mask and reveal a = x + λ and b = y + µ. The players then compute secret
shares for π(x, y) as follows:
The computed z1 , . . . , zn are the additive shares of π(x, y). A given λ can
be used to mask only one variable, so one triplet must be precomputed for each
multiplication during the offline phase (i.e. before the data is made available
to the players). Instantiated with the appropriate groups, this abstract scheme
allows to evaluate a product in a ring, but also a vectors dot product, a matrix-
vector product, or a matrix-matrix product.
For various applications (e.g., logistic regression in Section 6), we need to com-
pute continuous real-valued functions over secret shared data. For non-linear
functions (e.g. exponential, log, power, cos, sin, sigmoid, etc.), different methods
are proposed in the literature.
A straightforward approach consists of implementing a full floating point
arithmetic framework [6, 12], and to compile a data-oblivious algorithm that
evaluates the function over floats. This is for instance what Sharemind and SPDZ
use. However, these two generic methods lead to prohibitive running times if the
floating point function has to be evaluated millions of times.
6
In this section, we present in this section our masking technique for fixed-point
arithmetic and provide an algorithm for the MPC evaluation of real-valued con-
tinuous functions. In particular, we show that to achieve p bits of numerical
precision in MPC, it suffices to have p + 2τ -bit floating points where τ is a fixed
security parameter.
The secret shares we consider are real numbers. We would like to mask these
shares using floating point numbers. Yet, as there is no uniform distribution
on R, no additive masking distribution over reals can perfectly hide the arbi-
trary inputs. In the case when the secret shares belong to some known range of
numerical precision, it is possible to carefully choose a masking distribution, de-
pending on the precision range, so that the masked value computationally leaks
no information about the input. A distribution with sufficiently large standard
deviation could do the job: for the rest of the paper, we refer to this type of
masking as “statistical masking”. In practice, we choose a normal distribution
with standard deviation σ = 240 .
On the other hand, by using such masking, we observe that the sizes of the
secret shares increase every time we evaluate the multiplication via Beaver’s
technique (Section 2.2). In Section 3.3, we address this problem by introducing
a technique that allows to reduce the secret share sizes by discarding the most
significant bits of each secret share (using the fact that the sum of the secret
shares is still much smaller than their size).
The algorithm we propose depends on two auxiliary parameters: the cutoff, de-
fined as η = B + τ so that 2η is the desired bound in absolute value, and an
auxiliary parameter M = 2κ larger than the number of players.
The main idea is that the initial share contains large components z1 , . . . , zn
that sum up to the small secret shared value z. Additionally, the most significant
9
bits of the share beyond the cutoff position (say MSB(zi ) = bzi /2η e) do not
contain any information on the data, and are all safe to reveal. We also know
that the MSB of the sum of the shares (i.e. MSB of the data) is null, so the sum
of the MSB of the shares is very small. The share reduction algorithm simply
computes this sum, and redistributes it evenly among the players. Since the sum
is guaranteed to be small, the computation is done modulo M rather than on
large integers. More precisely, using the cutoff parameter η, for i = 1, . . . , n,
player i writes his secret share zi of z as zi = ui + 2η vi , with vi ∈ Z and ui ∈
[−2η−1 , 2η−1 ). Then, he broadcasts vi mod M , so that each player computes the
sum. The individual shares can optionally be re-randomized
P using a precomputed
share JνK+ , with ν = 0 mod M . Since w = vi ’s is guaranteed to be between
−M/2 and M/2, it can be recovered from its representation mod M . Thus, each
player locally updates its share as ui + 2η w/n, which have by construction the
same sum as the original shares, but are bounded by 2η . This construction is
summarized in Algorithm 3 in Appendix B.
4 Fourier Approximation
Approximating the sigmoid function. We now restrict to the case of the sigmoid
function over the interval [−B/2, B/2] for some B > 0. We can rescale the
variable to approximate g(x) = sigmo(Bx/π) over [−π/2, π/2]. If we extend g by
anti-periodicity (odd-even) to the interval [π/2, 3π/2] with the mirror condition
g(x) = g(π − x), we obtain a continuous 2π-periodic piecewise C 1 function. By
Dirichlet’s global theorem, the Fourier serie of g converges uniformly over R, so
for all ε > 0, there exists a degree N and a trigonometric polynomial gN such that
kgN −gk∞ ≤ . To compute sigmo(t) over secret shared t, we first apply the affine
change of variable (which is easy to evaluate in MPC), to get the corresponding
x ∈ [−π/2, π/2], and then we evaluate the trigonometric polynomial gN (x) using
a Fourier triplet. This method is sufficient to get 24 bits of precision with a
polynomial of only 10 terms, however asymptotically, the convergence rate is
only in Θ(n−2 ) due to discontinuities in the derivative of g. In other words,
approximating g with λ bits of precision requires to evaluate a trigonometric
polynomial of degree 2λ/2 . Luckily, in the special case of the sigmoid function, we
can make this degree polynomial by explicitly constructing a 2π-periodic analytic
function that is exponentially close to the rescaled sigmoid on the whole interval
[−π, π] (not the half interval). Besides, the geometric decay of the coefficients of
the trigonometric polynomial ensures perfect numerical stability. The following
theorem, whose proof can be found in Appendix D summarizes this construction.
Theorem 1. Let hα (x) = 1/(1+e−αx )−x/2π for x ∈ (−π, π). For every ε > 0,
there exists α = O(log(1/ε)) such that hα is at uniform distance ε/2 from a 2π-
periodic analytic function g. There exists N = O(log2 (1/ε)) such that the N th
term of the Fourier series of g is at distance ε/2 of g, and thus, at distance ≤ ε
from hα .
In a classification problem one is given a data set, also called a training set,
that we will represent here by a matrix X ∈ MN,k (R), and a training vector
y ∈ {0, 1}N . The data set consists of N input vectors of k features each, and the
13
θ := θ − α∇Cx,y (θ),
where ∇Cx,y (θ) is the gradient of the cost function and α > 0 is a constant
called the learning rate. Choosing the optimal α depends largely on the quality
of the dataset: if α is too large, the method may diverge, and if α is too small,
a very large number of iterations are needed to reach the minimum. Unfortu-
nately, tuning this parameter requires either to reveal information on the data,
or to have access to a public fake training set, which is not always feasible in
private MPC computations. This step is often silently ignored in the literature.
Similarly, preprocessing techniques such as feature scaling, or orthogonalization
techniques can improve the dataset, and allow to increase the learning rate sig-
nificantly. But again, these techniques cannot easily be implemented when the
input data is shared, and when correlation information should remain private.
In this work, we choose to implement the IRLS method [5, §4.3], which does not
require feature scaling, works with learning rate 1, and converges in much less
iterations, provided that we have enough floating point precision. In this case,
the model is updated as:
CPU, 128-bit floating point arithmetic is emulated using GCC’s quadmath li-
brary, however additional speed-ups could be achieved on more recent hardware
that natively supports these operations (eg. IBM’s next POWER9 processor).
In our proof of concept, our main focus was to improve the running time, the
floating point precision, and the communication complexity of the online phase,
so we implemented the offline phase only for the trusted dealer scenario, leaving
the honest but curious dealer variant as a future work.
instructions on immutable variables (which are read-only once they are affected).
More importantly, the compiler associates a single additive mask λU to each of
these immutable variables U . This solves two important problems that we saw in
the previous sections: first, the masking information for huge matrices that are
re-used throughout the algorithm are transmitted only once during the whole
protocol (this optimization already appears in [25], and in our case, it has a
huge impact for the constant input matrix, and their precomputed products,
which are re-used in all IRLS iterations). It also mitigates the attack that would
retrieve information by averaging its masked distribution, because an attacker
never gets two samples of the same distribution. This justifies the choice of 40
bits of security for masking.
During the offline phase, the trusted dealer generates one random mask value
for each immutable variable, and secret shares these masks. For all matrix-vector
or matrix-matrix products between any two immutable variables U and V (com-
ing from lines 1, 4, 6, 7 and 8 of Alg.2), the trusted dealer also generates a
specific multiplication triplet using the masks λU of U and λV of V . More pre-
cisely, he generates and distributes additive shares for λU · λV as well as integer
vectors/matrices of the same dimensions as the product for the share-reduction
phase. These integer coefficients are taken modulo 256 for efficiency reasons.
6.2 Results
We implemented all the described algorithms and we tested our code for two and
three parties, using cloud instances on both the AWS and the Azure platforms,
having Xeon E5-2666 v3 processors. In our application each instance commu-
nicates via its public IP address. Furthermore, we use the zeroMQ library to
handle low-level communications between the players (peer-to-peer, broadcast,
central nodes etc...).
In the results that are provided in Table 1 in Appendix A, we fixed the
number of IRLS iterations to 8, which is enough to reach a perfect convergence
for most datasets, and we experimentally verified that the MPC computation
outputs the same model as the one with plaintext iterations. We see that for the
datasets of 150000 points, the total running time of the online phase ranges from
1 to 5 minutes. This running time is mostly due to the use of emulated quad-
float arithmetic, and this MPC computation is no more than 20 times slower
than the plaintext logistic regression on the same datasets, if we implement it
using the same 128-bit floats (yet, of course, the native double-precision version
is much faster). More interestingly, we see that the overall size of the totality
of the triplets and the amount of online communications are small: for instance,
a logistic regression on 150000 points with 8 features requires only 756MB of
triplets per player, and out of it, only 205MB of data are broadcasted during
the online phase per player. This is due to the fact that a Fourier triplet is
much larger than the value that is masked and exchanged. Because of this, the
communication time is insignificant compared to the whole running time, even
with regular WAN bandwidth.
16
Acknowledgements
We thank Hunter Brooks, Daniel Kressner and Marco Picasso for useful conver-
sations on data-independent iterative optimization algorithms. We are grateful
to Jordan Brandt, Alexandre Duc and Morten Dahl for various useful discussions
regarding multi-party computations and privacy-preserving machine learning.
References
1. M. Abadi, A. Chu, I. Goodfellow, H. Brendan McMahan, I. Mironov, K. Talwar,
and L. Zhang. Deep learning with differential privacy. CoRR, abs/1607.00133,
2016.
2. Y. Aono, T. Hayashi, L. Trieu Phong, and L. Wang. Privacy-preserving logis-
tic regression with distributed data sources via homomorphic encryption. IEICE
Transactions, 99-D(8):2079–2089, 2016.
3. T. Araki, J. Furukawa, Y. Lindell, A. Nof, and K. Ohara. High-throughput semi-
honest secure three-party computation with an honest majority. In Proceedings of
the 2016 ACM SIGSAC Conference on Computer and Communications Security,
Vienna, Austria, October 24-28, 2016, pages 805–817, 2016.
4. D. Beaver. Efficient Multiparty Protocols Using Circuit Randomization. In
CRYPTO ’91, volume 576 of Lecture Notes in Computer Science, pages 420–432.
Springer, 1992.
5. A. Björck. Numerical Methods for Least Squares Problems. Siam Philadelphia,
1996.
17
0.5
Cost Function
0.4
0.3
0.2
0.1
0
1 10 100 1000
Iteration number (logscale)
Figure 1 shows the evolution of the cost function during the logistic regression as a
function of the number of iterations, on a test dataset of 150000 samples, with 8 features
and an acceptance rate of 0.5%. In yellow is the standard gradient descent with optimal
learning rate, in red, the gradient descent using the piecewise linear approximation of
the sigmoid function (as in [25]), and in green, our MPC model (based on the IRLS
method). The MPC IRLS method (as well as the plaintext IRLS) method converge
in less than 8 iterations, against 500 iterations for the standard gradient method. As
expected, the approx method does not reach the minimal cost.
0.8
0.7
0.6
Score
0.5
0.4
0.3
0.2
0.1
0
1 10 100 1000
Iteration number (logscale)
Figure 2 shows the evolution of the F-score during the same logistic regression as a
function of the number of iterations. The standard gradient descent and our MPC
produce the same model, with a limit F-score of 0.64. However, no positive samples are
detected by the piecewise linear approximation, leading to a null F-score. However, in
the three cases, the accuracy (purple) is nearly 100% from the first iteration.
21
As it was observed in [7], if one uses the naı̈ve basis to write the solutions, the
Fourier coefficients of the functions gn are unbounded, thus resulting in numerical
instability. It was explained in [21] how to describe the solution in terms of two
families of orthogonal polynomials closely related to the Chebyshev polynomials
of the first and second kind. More importantly, it is proved that the solution
converges to f exponentially rather than super-algebraically and it is shown
how to numerically estimate the solution gn (x) in terms of these bases.
We will now summarize the method of [21]. Let
1
Cn = √ ∪ {cos(kx) : k = 1, . . . , n}.
2
and let Cn be the R-vector space spanned by these functions (the subspace of
even functions). Similarly, let
Sn = {sin(kx) : k = 1, . . . , n},
and let Sn be the R-span of Sn (the space of odd functions). Note that Cn ∪ Sn
is a basis for Gn .
where Z π
2 2
ak = f (x)Tkh (cos x)dx,
π −π
2
and Z π
2 2
bk = f (x)Ukh (cos x) sin xdx.
π −π
2
D Proof of Theorem 1
We now prove Theorem 1, with the following methodology. We first bound the
successive derivatives of the sigmoid function using a differential equation. Then,
since the first derivative of the sigmoid decays exponentially fast, we can sum
all its values for any x modulo 2π, and construct a C ∞ periodic function, which
approximates tightly the original function over [−π, π]. Finally, the bounds on
the successive derivatives directly prove the geometric decrease of the Fourier
coefficients.
Proof. First, consider the σ(x) = 1/(1 + e−x ) the sigmoid function over R.
σ satisfies the differential
Pn equation σ 0 = σ − σ 2 . By derivating
Pn n
times, we
n
= σ (n) (1 − σ) − k=1 nk σ (k) σ (n−k) .
(n+1) (n)
(k) (n−k)
have σ = σ − k=0 k σ σ
Dividing by (n + 1)!, this yields
n
!
σ (n+1) 1 σ (n) X σ (k) σ (n−k)
≤ +
(n + 1)! n+1 n! k! (n − k)!
k=1
From there, we deduce by induction that for all n ≥ 0 and for all x ∈ R,
σ (n) (x)
n! ≤ 1 and it decreases with n, so for all n ≥ 1,
1
oe(1,x)
oe(3,x)
0.9 oe(5,x)
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-6 -4 -2 0 2 4 6
0.75
h(1,x)
h(1.5,x)
0.7 h(2,x)
0.65
0.6
0.55
0.5
0.45
0.4
0.35
0.3
0.25
-6 -4 -2 0 2 4 6
x
As α grows, the discontinuity in the rescaled sigmoid function g(αx) − 2π vanishes, and
it gets exponentially close to an analytic periodic function, whose Fourier coefficients
decrease geometrically fast. This method is numerically stable, and can evaluate the
sigmoid with arbitrary precision in polynomial time.
24
The figures presented in this section represent the communication channels be-
tween the players and the dealer in both the trusted dealer and the honest but
curious models. Two types of communication channels are used: the private chan-
nels, that correspond in practice to SSL channels (generally < 20MB/s), and the
public channels, corresponding in practice to TCP connections (generally from
100MB to 1GB/s). In the figures, private channels are represented with dashed
lines, while public channels are represented with plain lines.
Figure 5 illustrates the connections during the offline phase of the MPC
protocols. In the TD model, the dealer is the only one generating all the pre-
computed data. He uses private channels to send to each player his share of the
triplets (one-way arrows). In the HBC model, the players collaborate for the
generation of the triplets. To do that, they need an additional private broadcast
channel between them, that is not accessible to the dealer.
Figure 6 represents the communication channels between players during the
online phase. The online phase is the same in both the TD and the HBC models
and the dealer is not present.
In this section we give two detailed algorithms in the honest but curious model,
already described in Section 5. The first algorithm (Algorithm 4) describes the
25
P3 P2
P2 Dealer P3 Dealer
P1 P1
P2
P1 P3
Fig. 6. Communication channels in the online phase - The figure represents the
communication channels (the same type for both the honest but curious and the trusted
dealer model) used during the online phase. The players send and receive masked values
via a public broadcast channel (public channels are denoted with plain lines). Their
number, limited to 3 in the example, can easily be extended to a generic number n of
players.
power in Algorithm 5). The dealer generates new additive shares for the result
and sends these values back to each player via the private channel. This way,
the players don’t know each other’s shares. Finally, the players, who know the
common mask, can independently unmask their secret shares, and obtain their
final share of the triplet, which is therefore unknown to the dealer.
Algorithm 5 Honest but curious triplets generation for the power function
Output: Shares JλK and Jλ−α K.
1: Each player Pi generates λi , ai (from the according distribution).
2: Each player Pi shares with all other players ai .
3: Each player computes a = a1 + · · · + an . P
4: Each player Pi generates zi in a way that n i=1 zi = 0.
5: Each player Pi sends to the dealer zi + aλi .
6: The dealer computes µλ and w = (µλ)−α .
7: The dealer creates JwK+ and sends wi to player Pi , for i = 1, . . . , n.
8: Each player Pi right-multiplies wi with µα to obtain (λ−α )i .