Full Learning
Full Learning
Full Learning
Learning a Parallelepiped:
Cryptanalysis of GGH and NTRU Signatures
1 Introduction
Inspired by the seminal work of Ajtai [1], Goldreich, Goldwasser and Halevi
(GGH) proposed at Crypto ’97 [10] a lattice analogue of the coding-theory-
based public-key cryptosystem of McEliece [22]. The security of GGH is related
?
Part of this work is supported by the Commission of the European Communities
through the IST program under contract IST-2002-507932 ECRYPT, and by the
French government through the X-Crypt RNRT project.
??
Supported by the Binational Science Foundation, by the Israel Science Foundation,
by the European Commission under the Integrated Project QAP funded by the IST
directorate as Contract Number 015848, and by a European Research Council (ERC)
Starting Grant.
2
thereof (see Fig. 1). We transform the HPP into a multivariate optimization
problem based on the fourth moment (also known as kurtosis) of one-dimensional
projections. This problem can be solved by a gradient descent. Our approach is
very effective in practice: we present the first successful key-recovery experiments
on NTRUSign-251 without perturbation, as proposed in half of the parameter
choices in the NTRU standards [4] being considered by IEEE P1363.1 [19]; ex-
perimentally, 400 signatures are enough to disclose the NTRUSign-251 secret
key. We have also been able to recover the secret key in the signature analogue
of all five GGH encryption challenges; the GGH case requires significantly more
signatures because NTRU lattices have special properties which can be exploited
by the attack. When the number of signatures is sufficiently high, the running
time of the attack is only a fraction of the time required to generate all the
signatures.
From the theoretical side, we are able to show that under a natural assump-
tion on the distribution of signatures, an attacker can recover a good approxi-
mation of the secret key of NTRUSign and the GGH challenges in polynomial
time, given a polynomial number of signatures of random messages. Since the
secret key in both NTRUSign and the GGH challenges has very small entries,
this approximation leads to the exact secret key by simple rounding.
Related Work. Interestingly, it turns out that the HPP (as well as related
problems) have already been looked at by people dealing with what is known
as Independent Component Analysis (ICA) (see, e.g., the book by Hyvärinen et
al. [17]). ICA is a statistical method whose goal is to find directions of inde-
pendent components, which in our case translates to the n vectors that define
the parallelepiped. It has many applications in statistics, signal processing, and
neural network research. To the best of our knowledge, this is the first time ICA
is used in cryptanalysis.
There are several known algorithms for ICA, and most are based on a gra-
dient method such as the one we use in our algorithm. Our algorithm is closest
in nature to the FastICA algorithm proposed in [18], who also considered the
fourth moment as a goal function. We are not aware of any rigorous analysis
of these algorithms; the proofs we have seen often ignore the effect of errors in
approximations. Finally, we remark that the ICA literature offers other, more
general, goal functions that are supposed to offer better robustness against noise
4
etc. We have not tried to experiment with these other functions, since the fourth
moment seems sufficient for our purposes.
Another closely related result is that by Frieze et al. [5], who proposed a
polynomial-time algorithm to solve the HPP (and generalizations thereof). Tech-
nically, their algorithm is slightly different from those present in the ICA litera-
ture as it involves the Hessian, in addition to the usual gradient method. They
also claim to have a fully rigorous analysis of their algorithm, taking into ac-
count the effect of errors in approximations. Unfortunately, most of the analysis
is missing from the preliminary version, and to the best of our knowledge, a full
version of the paper has never appeared.
Open Problem. Our attack does not work against the perturbation techniques
proposed in [12, 4, 14] as efficient countermeasures: these modify the signature
generation process in such a way that the hidden parallelepiped is replaced by
a more complicated set. For instance, the second half of parameter choices in
NTRU standards [4] involves exactly a single perturbation. In this case, the at-
tacker has to solve an extension of the hidden parallelepiped problem in which the
parallelepiped is replaced by the Minkowski sum of two hidden parallelepipeds:
the lattice spanned by one of the parallelepipeds is public, but not the other
one. The existence of efficient attacks against perturbation techniques is an open
problem. The drawbacks of perturbations is that they slow down signature gen-
eration, increase both the size of the secret key, and the distance between the
signature and the message.
Other Schemes. We now mention some other lattice-based signature schemes,
all of which come with an associated security proof, showing that any (asymp-
totic) attack on the scheme must necessarily lead to an efficient algorithm for a
certain lattice problem that is believed to be hard. Moreover, their security is
established based on worst-case hardness, i.e., any asymptotic attack (even with
a small probability of success) implies an efficient solution to any instance of the
underlying lattice problem. For more details on provably secure lattice-based
cryptography and on the signature schemes mentioned below, see, e.g., [32, 24,
26].
From a theoretical point of view, signature schemes can be constructed from
one-way functions in a black-box way without any further assumptions [28].
Therefore, one can obtain signature schemes that are provably secure based on
the worst-case hardness of lattice problems by using known constructions of
lattice-based one-way functions, such as those in Ajtai’s seminal work [1] and
followup work. These black-box constructions, however, incur a large overhead
and are impractical.
The first construction of efficient lattice-based signature schemes with a sup-
porting proof of security (in the random oracle model) was suggested by Miccian-
cio and Vadhan [27]. More efficient schemes were recently proposed by Gentry,
Peikert and Vaikuntanathan [7], and by Lyubashevsky and Micciancio [21].
The former scheme can be seen as a theoretically justified variant of the
GGH and NTRUSign signature schemes, with worst-case security guarantees
based on general lattices in the random oracle model. Compared to the GGH
5
scheme, their construction differs in two main aspects. First, it is based on lattices
chosen from a distribution that enjoys a worst-case connection (the lattices in
GGH and NTRU are believed to be hard, but not known to have a worst-
case connection). A second and crucial difference is that their signing algorithm
is designed so that it does not reveal any information about the secret basis.
This is achieved by replacing Babai’s round-off procedure with a “Gaussian
sampling procedure”, originally due to Klein [20], whose distinctive feature is
that its output distribution, for the range of parameters considered in [7], is
essentially independent of the secret basis used. The effect of this on our attack
is that instead of observing points chosen uniformly from the parallelepiped
generated by the secret basis, the attack observes points chosen from a spherically
symmetric Gaussian distribution, and therefore learns nothing about the secret
basis.
The scheme of Lyubashevsky and Micciancio [21] has worst-case security
guarantees based on a type of lattices known as ideal lattices, and it is the
most (asymptotically) efficient construction known to date, yielding signature
generation and verification algorithms that run in almost linear time. Moreover,
the security of [21] does not rely on the random oracle model.
Despite these significant advances, no concrete choice of parameters has been
proposed yet, and it is probably fair to say that provably-secure lattice-based
signature schemes are not yet at the level of efficiency and maturity that would
allow them to be used extensively in real-life applications.
2.1 Lattices
Let k · k and h·, ·i be the Euclidean norm and inner product of Rn . We refer to
the survey [31] for a bibliography on lattices. In this paper, by the term lattice,
we mean a full-rank discrete subgroup of Rn . The simplest lattice is Zn . It turns
out that in any lattice L, not just Zn , there must exist linearly independent
vectors b1 , . . . , bn ∈ L such that:
( n )
X
L= ni bi | ni ∈ Z .
i=1
The GGH scheme [10] works with a lattice L in Zn . The secret key is a non-
singular matrix R ∈ Mn (Z), with very short row vectors (their entries are
polynomial in n). In the GGH challenges [9], R was chosen as a perturbation
of a multiple of the identity matrix, so that
√ its vectors were almost orthogonal:
more precisely, R = kIn + E where k = 4d n + 1c + 1 and each entry of the n × n
matrix E is chosen uniformly at random in {−4, . . . , +3}. Micciancio [23] noticed
that this distribution has the weakness that it discloses the rough directions of
the secret vectors. The lattice L is the lattice in Zn spanned by the rows of R:
the knowledge of R enables the signer to approximate CVP rather well in L.
The basis R is then transformed to a non-reduced basis B, which will be public.
In the original scheme [10], B is the multiplication of R by sufficiently many
small unimodular matrices. Micciancio [23] suggested to use the Hermite normal
form (HNF) of L instead. As shown in [23], the HNF gives an attacker the least
advantage (in a certain precise sense) and it is therefore a good choice for the
public basis. The messages are hashed onto a “large enough” subset of Zn , for
instance a large hypercube. Let m ∈ Zn be the hash of the message to be signed.
The signer applies Babai’s round-off CVP approximation algorithm [3] to get a
lattice vector close to m:
s = bmR−1 eR,
7
so that s − m ∈ P1/2 (R) = {xR : x ∈ [−1/2, 1/2]n }. Of course, any other CVP
approximation algorithm could alternatively be applied, for instance Babai’s
nearest plane algorithm [3]. To verify the signature s of m, one would first check
that s ∈ L using the public basis B, and compute the distance ks − mk to check
that it is sufficiently small.
2.3 NTRUSign
NTRUSign [13] is a special instantiation of GGH with the compact lattices from
the NTRU encryption scheme [15], which we briefly recall: we refer to [13, 4] for
more details. In the NTRU standards [4] being considered by IEEE P1363.1 [19],
one selects N = 251, q = 128. Let R be the ring Z[X]/(X N − 1) whose
multiplication is denoted by ∗. Using resultants, one computes a quadruplet
(f, g, F, G) ∈ R4 such that f ∗ G − g ∗ F = q in R and f is invertible mod q,
where f and g have 0–1 coefficients (with a prescribed number of 1), while F and
G have slightly larger coefficients, yet much smaller than q. This quadruplet is
the NTRU secret key. Then the secret basis is the following (2N ) × (2N ) matrix:
f0 f1 · · · fN −1 g0 g1 · · · gN −1
fN −1 f0 · · · fN −2 gN −1 g0 · · · gN −2
.. . . . . .. .. .. .. ..
.
. . . . . . .
f1 · · · fN −1 f0 g1 · · · gN −1 g0
R= F0 F1 · · · FN −1 G0 G1 · · · GN −1 ,
FN −1 F0 · · · FN −2 GN −1 G0 · · · GN −2
. . .. .. ..
.. ... .. ..
.. . . . . .
F1 · · · FN −1 F0 G1 · · · GN −1 G0
1 0 · · · 0 h0 h1 · · · hN −1
0 1 · · · 0 hN −1 h0 · · · hN −2
.. .. . . .. .. . . . . ..
. . . . . . . .
0 0 · · · 1 h1 · · · hN −1 h0
0 0 ··· 0 q 0 ··· 0 ,
. .
0 0 ··· 0 0
q .. ..
. . . . . .
.. ...
.. .. . . .. ..
0
0 0 ··· 0 0 ··· 0 q
lead to attacks that require far less signatures. Namely, Whyte noticed that
in the particular case of NTRUSign, the hidden parallelepiped P(R) has the
following property: for each x ∈ P(R) the block-rotation σ(x) also belongs to
P(R), where σ is the function that maps any (x1 , . . . , xN , y1 , . . . , yN ) ∈ R2N
to (xN , x1 , . . . , xN −1 , yN , y1 , . . . , yN −1 ). This is because σ is a linear operation
that permutes the rows of R and hence leaves P(R) invariant. As a result, by
using the N possible rotations, each signature actually gives rise to N samples
in the parallelepiped P(R) (as opposed to just one in the general case of GGH).
For instance, 400 NTRUSign-251 signatures give rise to 100,400 samples in the
NTRU parallelepiped. Notice that these samples are no longer independent and
hence Assumption 1 does not hold. Nevertheless, as we will describe later, this
technique leads in practice to attacks using a significantly smaller number of
signatures.
4 Learning a Parallelepiped
In this section, we describe our solution to the Hidden Parallelepiped Problem
(HPP), based on the following steps. First, we approximate the covariance matrix
of the given distribution. This covariance matrix is essentially V t V (where V
defines the given parallelepiped). We then exploit this approximation in order
to transform our hidden parallelepiped P(V ) into a unit hypercube: in other
words, we reduce the HPP to the case where the hidden parallelepiped is a
hypercube. Finally, we show how hypercubic instances of the HPP are related
to a multivariate optimization problem based on the fourth moment, which we
solve by a gradient descent. The algorithm is summarized in Algorithms 1 and 2,
and is described in more detail in the following.
Exp[vt v] = V t V /3.
Proof. We can write v = xV where x has uniform distribution over [−1, 1]n .
Hence,
vt v = V t xt xV.
An elementary computation shows that Exp[xt x] = In /3 where In is the n × n
identity matrix, and the lemma follows. t
u
Hence, by taking the average of vt v over all our samples v from U (P(V )),
and multiplying the result by 3, we can obtain an approximation of V t V .
CC t = V LLt V t = V V −1 V −t V t = I.
For the second claim, let v be uniformly distributed over P(V ). Then we can
write v = xV where x is uniformly distributed over [−1, 1]n . It follows that
vL = xV L = xC has the uniform distribution over P(C). t
u
Lemma 2 says that by applying the transformation L, we can map our sam-
ples from the parallelepiped P(V ) into samples from the hypercube P(C). Then,
1
Instead of the Cholesky factor, one can take any matrix L such that G−1 = LLt .
We work with Cholesky factorization as this turns out to be more convenient in our
experiments.
12
where u is uniformly distributed over the parallelepiped P(V ). 2 Clearly, momV,k (w)
can be approximated by using the given samples from U (P(V )). Since all the
odd moments are zero, we are interested in the first even moments, namely the
second and fourth moments. A straightforward calculation shows that for any
w ∈ Rn , they are given by
n
1X 1
momV,2 (w) = hvi , wi2 = wV t V wt ,
3 i=1 3
n
1X 1X
momV,4 (w) = hvi , wi4 + hvi , wi2 hvj , wi2 .
5 i=1 3
i6=j
2
This should not be confused with an unrelated notion of moment considered in [13,
14, 8].
13
Note that the second moment is given by the covariance matrix mentioned in
Section 4.1. When V ∈ On (R) (i.e., the vectors vi are orthonormal), the second
moment becomes kwk2 /3 while the fourth moment becomes
n
1 2 X
momV,4 (w) = kwk4 − hvi , wi4 .
3 15 i=1
For w on the unit sphere the second moment is constantly 1/3, and
n
1 2 X
momV,4 (w) = − hvi , wi4 ,
3 15 i=1
n
4 8 X
∇momV,4 (w) = w− hvi , wi3 vi . (1)
3 15 i=1
See Figure 3.
0.2
0.1
-0.1
-0.2
Fig. 3. The fourth moment for n = 2. On the left: the dotted line shows the restriction
to the unit circle. On the right: a polar plot restricted to the unit circle.
Lemma 3. Let V = [v1 , . . . , vn ] ∈ On (R). Then the global minimum of momV,4 (w)
over the unit sphere of Rn is 1/5 and this minimum is obtained at ±v1 , . . . , ±vn .
There are no other local minima.
Proof. The method of Lagrange multipliers shows that for w to be an extremum
point of momV,4 on the unit sphere, it must be proportional to ∇momV,4 (w).
14
Pn
By writing w = i=1 hvi , wivi and using (1), we see that there must exist
some α such that hvi , wi3√= αhvi , wi for i = 1, . . . , n. In other words, each
hvi , wi is either zero or ± α. It is easy to check that among all such points,
only ±v1 , . . . , ±vn form local minima. t
u
In other words, the hidden hypercube problem can be reduced to a minimiza-
tion problem of the fourth moment over the unit sphere. A classical technique
to solve such minimization problems is the gradient descent described in Algo-
rithm 2. The gradient descent typically depends on a parameter δ, which has
to be carefully chosen. Since we want to minimize the function here, we go in
the opposite direction of the gradient. To approximate the gradient in Step 2 of
Algorithm 2, we notice that
This allows to approximate the gradient ∇momV,4 (w) using averages over sam-
ples, like for the fourth moment itself.
5 Experimental Results
5.1 NTRUSign
350
300
250
200
150
100
50
0
100000 150000 200000 250000 300000
case of NTRUSign). We did not notice any improvement using Babai’s near-
est plane algorithm [3] (with a BKZ-20 reduced basis [33] computed from the
public basis) as a CVP approximation. The curve shows the average number of
random descents needed for a successful descent as a function of the number of
signatures.
Typically, a single random descent does not take much time: for instance, a
usual descent for 150,000 signatures takes roughly ten minutes. When successful,
a descent may take as little as a few seconds. The minimal number of signatures
to make the attack successful in our experiments was 90,000, in which case the
required number of random descents was about 400. With 80,000 signatures, we
tried 5,000 descents without any success. The curve given in Fig. 4 may vary
a little bit, depending on the secret basis: for instance, for the basis used in
the experiments of Fig. 4, the average number of random descents was 15 with
140,000 signatures, but it was 23 for another basis generated with the same
NTRU parameters. It seems that the exact geometry of the secret basis has an
influence, as will be seen in the analysis of Section 6.
indeed the case in practice (see Table 1): as few as 400 signatures are enough
in practice to recover the secret key, though the corresponding 100,400 paral-
lelepiped samples are not independent. This means that the previous number of
90,000 signatures required by the attack can be roughly divided by N = 251.
Hence, NTRUSign without perturbation should be considered totally insecure.
100000
Fig. 5. Average number of GGH signatures required so that ten random descents
coupled with Babai’s nearest plane algorithm disclose with high probability a secret
vector, depending on the dimension of the GGH challenge.
not be interpreted as the minimal number of signatures required for the success
of the attack: it only gives an upper bound for that number. Indeed, there are
several ways to decrease the number of signatures:
– One can run much more than ten random descents.
– One can take advantage of the structure of the GGH challenges. When start-
ing a descent, rather than starting with a random point on the unit sphere,
18
we may exploit the fact that we know the rough directions of the secret
vectors.
– One can use better CVP approximation algorithms, or use better reduction
algorithms in conjunction with Babai’s nearest plane algorithm.
6 Theoretical Analysis
Our goal in this section is to give a rigorous theoretical justification to the success
of the attack. Namely, we will show that given a large enough polynomial number
of samples, Algorithm 1 succeeds in finding a good approximation to a row of
V with some constant probability. For sake of clarity and simplicity, we will not
make any attempt to optimize this polynomial bound on the number of samples.
We will also assume we can perform operations on real numbers; modifying
the analysis to work with finite precision numbers should be straightforward.
Let us remark that it is possible that a rigorous analysis already exists in the
ICA literature, although we were unable to find any (an analysis under some
simplifying assumptions can be found in [18]). Also, Frieze et al. [5] sketch a
rigorous analysis of a similar algorithm.
In order to approximate the covariance matrix, the fourth moment, and its
gradient, our attack computes averages over samples. Because the samples are
independent and identically distributed, we can use known bounds on large de-
viations such as the Chernoff bound (see, e.g., [2]) to obtain that with extremely
high probability the approximations are very close to the true values. In our
analysis below we omit the explicit calculations, as these are relatively standard.
Theorem 3. For any c0 > 0 there exists a c1 > 0 such that given nc1 samples
uniformly distributed over some unit hypercube P(V ), V = [v1 , . . . , vn ] ∈ On (R),
Algorithm 2 with δ = 3/4 and r = O(log log n) descent steps outputs with con-
stant probability a vector that is within `2 distance n−c0 of ±vi for some i.
Proof. We first analyze the behavior of Algorithm 2 under the assumption that all
gradients are computed exactly without any error. We write any vector w ∈ Rn
19
Pn
as w = i=1 wi vi . Then, using (1), we see that for w on the unit sphere,
n
4 8 X 3
∇momV,4 (w) = w− w vi .
3 15 i=1 i
The vector is then normalized in Step 4. So we see that each step in the gradi-
ent descent takes a vector (w1 , . . . , wn ) to the vector α · (w13 , . . . , wn3 ) for some
normalization factor α (where both vectors are written in the vi basis). Hence,
after r iterations, a vector (w1 , . . . , wn ) is transformed to the vector
r r
α · (w13 , . . . , wn3 )
for some normalization factor α.
Recall now that the original vector (w1 , . . . , wn ) is chosen uniformly from
the unit sphere. It can be shown that with some constant probability, one of its
coordinates is greater in absolute value than all other coordinates by a factor of
at least 1 + Ω(1/ log n) (first prove this for a vector distributed according to the
standard multivariate Gaussian distribution, and then note that by normalizing
we obtain a uniform vector from the unit sphere). For such a vector, after only
r = O(log log n) iterations, this gap is amplified to more than, say, nlog n , which
means that we have one coordinate very close to ±1 and all others are at most
n− log n in absolute value. This establishes that if all gradients are known exactly,
Algorithm 2 succeeds with some constant probability.
To complete the analysis of Algorithm 2, we now argue that it succeeds
with good probability even in the presence of noise in the approximation of
the gradients. First, it can be shown that for any c > 0, given a large enough
polynomial number of samples, with very high probability all our gradient ap-
proximations are accurate to within an additive error of n−c in the `2 norm (we
have r such approximations during the course of the algorithm). This follows
by a standard application of the Chernoff bound followed by a union bound.
Now let w = (w1 , . . . , wn ) be a unit vector in which one coordinate, say the
jth, is greater in absolute value than all other coordinates by at least a fac-
tor of 1 √+ Ω(1/ log n). Since w is a unit vector, this in particular means that
wj > 1/ n. Let w̃new = w − δ∇mom4 (w). Recall that for each i, w̃new,i = 52 wi3
which in particular implies that w̃new,j > 52 n−1.5 > n−2 . By our assumption on
the approximation g, we have that for each i, |w̃new,i − wnew,i | ≤ n−c . So for
any k 6= j,
|wnew,j | |w̃new,j | − n−c |w̃new,j |(1 − n−(c−2) )
≥ ≥ .
|wnew,k | |w̃new,k | + n−c |w̃new,k | + n−c
If |w̃new,k | > n−(c−1) , then the above is at least (1 − O(1/n))(wj /wk )3 . Oth-
erwise, the above is at least Ω(nc−3 ). Hence, after O(log log n) steps, the gap
20
wj /wk becomes Ω(nc−3 ). Therefore, for any c0 > 0 we can make the distance
between the output vector and one of the ±vi s less than n−c0 by choosing a
large enough c. t
u
The following theorem completes the analysis of the attack. In particular, it im-
plies that if V is an integer matrix all of whose entries are bounded in absolute
value by some polynomial, then running Algorithm 1 with a large enough poly-
nomial number of samples from the uniform distribution on P(V ) gives (with
constant probability) an approximation to a row of ±V whose error is less than
1/2 in each coordinate, and therefore leads to an exact row of ±V simply by
rounding each coordinate to the nearest integer. Hence we have a rigorous proof
that our attack can efficiently recover the secret key in both NTRUSign and
the GGH challenges.
Theorem 4. For any c0 > 0 there exists a c1 > 0 such that given nc1 sam-
ples uniformly distributed over some parallelepiped P(V ), V = [v1 , . . . , vn ] ∈
GLn (R), Algorithm 1 outputs with constant probability a vector ẽV where ẽ is
within `2 distance n−c0 of some standard basis vector ei .
distance3 between a set of nc−4 samples from P(C) and a set of nc−4 samples
from P(C̃) is at most O(n−1 ). By Theorem 3, we know that when given samples
from P(C̃), Algorithm 2 outputs an approximation of a row of ±C̃ with some
constant probability. Hence, when given samples from P(C), it must still output
an equally good approximation of a row of ±C̃ with a probability that is smaller
by at most O(n−1 ) and in particular, constant.
To complete the proof, let c̃ be the vector obtained in Step 4. The output of
Algorithm 1 is then
As we have seen before, all eigenvalues of U1 D−1 U1t are close to 1. It therefore
follows that the above is a good approximation to a row of ±V , and it is not
hard to verify that the quality of this approximation satisfies the requirements
stated in the theorem. t
u
Proof. We first show that the parallelepiped P(C) is almost contained and almost
contains the cube P(C̃):
To show this, take any vector y ∈ [−1, 1]n . The second containment is equivalent
to showing that all the coordinates of yU1 DU1t are at most 1 + n−c+2 in absolute
value. Indeed, by the triangle inequality,
The first containment is proved similarly. On the other hand, the ratio of volumes
between the two cubes is ((1 + n−c+2 )/(1 − n−c+2 ))n = 1 + O(n−c+3 ). From this
it follows that the statistical distance between the uniform distribution on P(C)
and that on P(C̃) is at most O(n−c+3 ). t
u
Acknowledgements. We thank William Whyte for helpful discussions and the
anonymous referees for useful comments.
References
1. M. Ajtai. Generating hard instances of lattice problems. In Complexity of com-
putations and proofs, volume 13 of Quad. Mat., pages 1–32. Dept. Math., Seconda
Univ. Napoli, Caserta, 2004.
3
The statistical distance (or total variation distance) between two distributions is the
maximum probability with which one can distinguish between an input sampled from
the first distribution and an input sampled from the second distribution.
22