A Deeper Look at Machine Learning-Based Cryptanalysis
A Deeper Look at Machine Learning-Based Cryptanalysis
A Deeper Look at Machine Learning-Based Cryptanalysis
969ll
1 Introduction
block cipher SPECK-32/64 (the 32-bit block 64-bit key version of SPECK [2]),
he managed to obtain a good accuracy for a non-negligible number of rounds.
He even managed to mount a key recovery process on top of his neural distin-
guisher, eventually leading to the current best known key recovery attack for
this number of rounds (improving over works on SPECK-32/64 such as [6,24]).
Even if his distinguisher /key recovery attack had not been improving over the
state-of-the-art, the prospect of a generic tool that could pre-scan for vulnera-
bilities in a cryptographic primitive (while reaching an accuracy close to exiting
cryptanalysis) would have been very attractive anyway.
Yet, Gohr's paper actually opened many questions. The most important,
listed by the author as an open problem, is the interpretability of the distin-
guisher. An obvious issue with a neural distinguisher is that its black-box nature
is not really telling us much about the actual weakness of the cipher analyzed.
More generally, interpretability for deep neural networks has been known to be
a very complex problem and represents a key challenge for the machine learning
community. At first sight, it seems therefore very difficult to make any advances
in this direction.
Another interesting aspect to explore is to try to match Gohr's neural dis-
tinguisher /key recovery attack with classical cryptanalysis tools. It remains very
surprising that a trained deep neural network can perform better than the
scrutiny of experienced cryptanalysts. As remarked by Gohr, his neural dis-
tinguisher is mostly differential in nature (on the ciphertext pairs), but some
unknown extra property is exploited. Indeed, as demonstrated by one of his
experiments, the neural distinguisher can still distinguish between a real and a
random set that have the exact same differential distribution on the ciphertext
pairs. Since we know there is some property that researchers have not seen or
exploited, what is it?
Finally, a last natural question is: can we do better? Are there some better
settings that could improve the accuracy of Gohr's distinguishers?
Our Contributions. In this article, we analyze the behavior of Gohr's neural
distinguishers when working on SPECK-32/64 cipher. We first study in detail
the classified sets of real/random ciphertext pairs in order to get some hints on
what criterion the neural network is actually basing its decisions on. Looking for
patterns, we observe that the neural distinguisher is very probably deducing some
differential conditions not on the ciphertext pairs directly, but on the penultimate
or antepenultimate rounds. We then conduct some experiments to validate our
hypothesis.
In order to further confirm our findings, we construct for 5, 6 and 7-round
reduced SPECK-32/64 a new distinguisher purely based on cryptanalysis, with-
out any neural network or machine learning algorithm, that matches Gohr's
neural distinguisher's accuracy while actually being faster and using the same
amount of precomputation/training data. In short, our distinguisher relies on
selective partial decryption: in order to attack nr rounds, some hypothesis is
made on some bits of the last round subkey and partial decryption is performed,
eventually filtered by a precomputed approximated DDT on nr —1 rounds.
808 A. Benamira et al.
We then take a different approach by tackling the problem not from the crypt-
analysis side, but the machine learning side. More precisely, as a deep learning
model learns high-level features by itself, in order to reach full interpretability
we need to discover what these features are. By analyzing the components of
Gohr's neural network, we managed to identify a procedure to model these fea-
tures, while retaining almost the same accuracy as Gohr's neural distinguishers.
Moreover, we also show that our method performs similarly on other primitives
by applying it on the SIMON block cipher. This result is interesting from a cryp-
tography perspective, but also from a machine learning perspective, showing an
example of interpretability by transformation of a deep neural network.
Finally, we explore possible improvements over Gohr's neural distinguishers.
By using batches of ciphertexts instead of pairs, we are able to significantly
improve the accuracy of the distinguisher, while maintaining identical experi-
mental conditions.
Outline. In Sect. 2, we introduce notations as well as basic cryptanalysis and
machine learning concepts that will be used in the rest of the paper. In Sect. 3,
we describe in more detail the various experiments conducted by Gohr and the
corresponding results. We provide in Sect. 4 an explanation of his neural distin-
guishers as well as the description of an actual cryptanalysis-only distinguisher
that matches Gohr's accuracy. We propose in Sect. 5 a machine learning app-
roach to enable interpretability of the neural distinguishers. Finally, we studied
possible improvements in Sect. 6.
2 Preliminaries
Basic notat10ns. In the rest of this article, 令 /\ and 田 will denote the
eXclusive-OR operation, the bitwise AND operation and the modular addition1
respectively. A right/left bit rotation will be denoted as>>> and<<< respectively,
while al lb will represent the concatenation of two bit strings a and b.
addition. See Fig. 1 where ki represents the 16-bit subkey at round i and where
a= 7, (3 = 2. The final ciphertext C is then obtained as C +-- (l22llr22). The
subkeys are generated with a key schedule that is very similar to the round
function (we refer to [2] for a complete description, as we do not make use of the
details of the key schedule in this article).
l, Ti
li+l r;+1
The main problem tackled by DNN is, given a dataset D = {(x0, y0) …
(x九, Yn)}, with Xi E X being samples and Yi E [O, ... , l] being labels, to find
the optimal parameters 0* for the D N Ne model, with the parameters 0 such
that:
n
0* = argmin 区 L(yi, DNN0(xi)) (1)
。
i=O
with L being the loss function. As there is no literal expression of 0*, the approx-
imate solution will depend on the chosen optimization algorithm such as the
stochastic gradient descent. Moreover, hyper-parameters of the problem (param-
eters whose value is used to control the learning process) need to be adjusted as
they play an important role in the final quality of the solution.
DNN are powerful enough to derive accurate non-linear features from the
training data, but these features are not robust. Indeed, adding a small amount
of noise at the input can cause these features to deviate and confuse the model.
In other words, the DNN is a very unbiased classifier, but has a high variance.
Different blocks can be used to implement these complex models. However,
in this paper, we will be using four types of blocks: the linear neural network, the
one-dimensional convolutional neural network, the activation functions (ReLU
and sigmoid) and the batch normalization.
Activation functions. The three activation functions that we discuss here are
the Rectified Linear Unit (ReLU), defined as ReLU(x) = max(O, x), the sigmoid,
defined as Sigmoid(x) = cr(x) = 1 十 exp(-x) and the Heaviside step function,
defined as H(x) =½+ sg~(x). This block, added between each layer of the DNN,
introduces the non-linear part of the model.
Since its release, the lightweight block cipher SPECK attracted a lot of external
cryptanalysis, together with its sibling SIMON (this was amplified by the fact
A Deeper Look at Machine Learning-Based Cryptanalysis 811
Output:
BLOCK BLOCK
1
1 3 x1
Fig. 2. The whole pipeline of Gohr's deep neural network. Block 1 refers to the initial
convolution block, Block 2-1 to 2-10 refer to the residual block and Block 3 refers to
the classification block.
a convolution layer with 32 filters is then applied. The kernel size of this 1D-
CNN is 1, thus, it maps (Cz, Cr, C{, C~) to (filter1, filter2, …, filter32). Each
filter is a non-linear combination of the features (Cz, Cr, C{, C~) after the ReLU
activation function depending on the value of the inputs and weights learned by
the 1D-CNN. The output of the first block is connected to the input and added
to the output of the subsequent layer in the residual block (see Fig. 3).
In the residual blocks (Blocks 2-i), both 1D-CNNs have a kernel of size 3,
a padding of size 1 and a stride of size 1 which make the temporal dimension
invariant across layers. At the end of each layer, the output is connected to the
input and added to the output of the subsequent layer to prevent the relevant
input signal from being wiped out across layers. The output of a residual block
is a (32 x 16) feature tensor (see Fig. 4).
Intermediate
32 Shortcut
x OR
16
Fig. 3. Initial convolution block Fig. 4. The residual block (Blocks 2-i).
(Block 1).
The final classification block takes as input the flattened output tensor of
the residual block. This 512 x 1 vector is passed into three perceptron layers
(Multi-Layer Perceptron or MLP) with batch normalization and ReLU activation
functions for the first two layers and a final sigmoid activation function performs
the binary classification (see Fig. 5).
A Deeper Look at Machine Learning-Based Cryptanalysis 813
`巨叶三『甘H订辽H~J
。
Accuracy and efficiency of the neural distinguishers. For each pair, the
neural distinguishers outputs a real-valued score between O and 1. If this score
is greater than or equal to 0.5, the sample is classified as a real pair, and as a
random pair otherwise. The results given by Gohr are presented in Table 1. Note
that N7 and N8 are trained using some sophisticated methods (we refer to [11]
for more details on the training). We remark that Gohr's neural distinguisher
has about 100,000 floating parameters, which is size efficient considering the
accuracies obtained.
Table 1. Accuracies of neural distinguishers for 5, 6, 7 and 8 rounds (taken from Table 2
of [11]). TPR and TNR denote true positive and true negative rates respectively.
from masked pairs even without re-training for this particular purpose, which
shows that they do not just rely on the difference distribution.
when we do not restrict the input difference, the best differential characteristics
for 5 rounds is Ox2800/0010 一 Ox850a/9520, with probability of 2-9. However,
when we trained the neural distinguishers to recognize ciphertext pairs with the
input difference of Ox2800/0010, the neural distinguishers performed worse (an
accuracy of 75.85% for 5 rounds). This is surprising as it is generally natural for
a cryptanalyst to maximize the differential probability when choosing a differ-
ential characteristic. We believe this is explained by the fact that Ox0040/0000
is the input difference maximizing the differential probability for 3 or 4 rounds
of SPECK-32/64 (verified with constraint programming), which has the most
chances to provide a biased distribution one or two rounds later. Generally, we
believe that when using such neural distinguisher, a good method to choose
an input difference is to simply use the input difference leading to the highest
differential probability for nr — 1 or nr — 2 rounds.
Changing the inputs to the neural network. Gohr's neural distinguishers
are trained using the actual ciphertext pairs (C, C') whereas the pure differential
distinguishers are only using the difference between the two ciphertexts C EB C'.
Thus, it is unfair to compare both as they are not exploiting the same amount of
information. To have a fair comparison of the capability of neural distinguishers
and pure differential distinguishers, we trained new neural distinguishers using
C 击亿 instead of (C, C'). The results are an accuracy of 90.6% for 5 rounds,
75.4% for 6 rounds and 58.3% for 7 rounds. This shows us that when the neural
distinguishers are restricted to only have access to the difference distribution,
they do not perform as well as their respective N nr, and similarly to Dnr 2 as
can be seen in Table 1. Therefore, this is another confirmation (on top of the
real differences experiment conducted in [11]), that Gohr's neural distinguishers
are learning more than just the distribution of the differences on the ciphertext.
With that information, we therefore naturally looked beyond just the difference
distribution at round nr.
ciphertext pairs. The goal now is to find similarities and differences in these two
groups separately.
As we believe that most of the features the neural distinguishers learned is
differential in nature, we focus on the differentials of these ciphertext pairs. To
start, we did the following experiment (Experiment A):
1. Using 105 real 5-round SPECK-32/64 ciphertext pairs, extract the set G.
2. Obtain the differences of the ciphertext pairs and sort them by frequency
3. For each of the differences <5:
(a) Generate 104 random 32-bit numbers and apply the difference, <5 to get
104 different ciphertext pairs.
(b) Feed the pairs to the neural distinguisher N5 to obtain the scores.
(c) Note down the number of pairs that yield a score~0.5
Since the neural distinguishers outperform the ones with just the XOR input,
we started to look beyond just the differences at 5 rounds. We decided to partially
A Deeper Look at Machine Learning-Based Cryptanalysis 817
decrypt the ciphertext pairs from G for a few rounds and re-run Experiment A on
these partially decrypted pairs: for each pair, we compute the difference and for
each difference, we created 104 random plaintext pairs with these differences and
encrypted them to round nr using random keys. The results are very intriguing,
as compared to that of Table 3: almost all of the (top 1000) unique differences
obtained in this experiment achieved 99% or 100% of ciphertext pairs having a
score of~0.5.
We can see that the differences at rounds 3 and 4 (after decrypting 2 and 1
round respectively) start to show some strong biases. In fact, for all of the top
1000 differences at rounds 3 and 4, all 104 pairs x 1000 differences returned a
score of 乏 0.53 . With that, we conduct yet another experiment (Experiment B):
1. For all the ciphertext pairs in G, decrypt i rounds with their respective keys
and compute the corresponding difference. Denote the set of differences as
Diff5-i·
2. Generate 105 plaintext pairs with a difference of Ox0040/0000 with random
keys, encrypt to 4 rounds
3. If the pair's difference is in Diff5-i, keep the pair. Otherwise, discard.
4. Encrypt the remaining pairs to 5 rounds and evaluate them using N5.
Table 4. Difference bit bias of ciphertext pairs in G and B after decrypting 2 rounds.
A negative (resp. positive) value indicates a bias towards'O'(resp.'1').
bit position 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
G 0.476 -0.454 -0.355 -0.135 0 045 0.084 -0 009 0.487 -0 473 -0 426 -0 300 -0 050 0.006 0 019 0.500 -0 500
B -0.002 0.018 0.008 -0.011 0.044 0 002 0.023 -0 022 0.010 -0 002 0.013 -0.004 0.006 -0 005 0.103 0 072
bit position 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 。
G 0.476 -0.454 -0.142 -0.006 0.025 0 084 -0.009 0.487 -0 473 -0 426 0.165 0 094 -0.006 0.019 -0 500 -0.500
B 0.031 -0.009 -0.015 -0.007 -0.014 -0.024 0.025 0.026 0.034 -0.005 -0.018 -0.021 0.006 0.009 0.079 -0.065
Interestingly, the difference bit biases after decrypting 1 and 2 rounds are
very similar (in their positions). We will provide an explanation in Sect. 4.2 . The
exact truncated differentials are (* denotes no specific constraint, while O or 1
denotes the expected bit difference):
3 rounds: 10 * * * * * 00 * * * * * 00 10 * * * * * 00 * * * * * 10
4 rounds: 10 * * * * * 10 * * * * * 10 10 * * * * * 10 * * * * * 00
We refer to these particular truncated differential masks as T D3 and T D4 for
the following discussion. Using constraint programming, we evaluate that the
probabilities for these truncated differentials are 87.86% and 49.87% respectively.
In order to verify how much the neural distinguisher is relying on these bits, we
perform the following experiment (Experiment C):
1. Generate 106 plaintext pairs with initial difference Ox0040/0000 and 106 ran-
dom keys.
2. Encrypt all 106 plaintext pairs to 5 —i rounds. If a plaintext pair satisfies the
TD5 —i, then we keep it. Otherwise, it will be discarded.
3. Encrypt the remaining pairs to 5 rounds and evaluate them using N5.
Table 5. Results of Experiment C with TD3 and TD4. Proport. refers to the number
of true positive ciphertext pairs captured by the experiment.
Table 5 shows the statistics of the above experiment with 5 rounds of SPECK-
32/64. The true positive rates for ciphertext pairs that follow these are closer
to that of Gohr's neural distinguisher. Now, there remains about 3% of the
ciphertext pairs yet to be explained (comparing the results of T D5_2 with N吐
The important point to note here is that the pairs we have identified are exactly
the ones verified by the neural distinguisher as well, by the nature of these
experiments. In other words, we managed to find what the neural distinguisher
is looking for and not just another distinguisher that would achieve a good
accuracy by identifying a different set of ciphertext pairs.
A Deeper Look at Machine Learning-Based Cryptanalysis 819
Fig. 6. The distribution of the possible output differences after passing through the
modular addition operation.
In Fig. 7 and Fig. 8, we show how the bits evolve along the most probable
differential path from round 1 (OxS000/8000) to round 4 (Ox850a/9520). As
it passes through the modular addition operation, we highlight the bits that
have a relatively higher probability of being different from the most probable
differential. The darker the color, the higher the probability of the difference
being toggled.
Figure 7 and Fig. 8 show us why T D3 is important at round 3, and how
the active bits shift in SPECK-32/64 when we start with the input difference of
Ox0040/0000. In every round, b31, (the leftmost bit) has a high probability of
staying active. This bit is then rotated to b24 before it goes into the modular
addition operation. In each round, b26 has a½chance of switching from 1 一 0 or
the other way round. b21 and b2s have a¾and 忐 chance respectively of switching.
This makes them highly volatile and therefore, unreliable. On the other hand,
the right part of SPECK-32/64 rotates by 2 to the left at the end of each round.
Because of the high rotation value in the left part of SPECK-32/64, low rotation
820 A. Benamira et al.
<< 2 << 2
11 lol ol olol 010111 olol ol olol 01010 I 1110101 olol 010111 ol olol 0101011 lol 111 olol ol 010101010 Iolol ol 0101 olol 111 ol 0101011101 olol ol olol 11o111 ol
Fig. 7. The left (resp. right) part shows how the active bit from difference Ox8000/8000
(resp. Ox8100/8102) propagates to difference Ox8100/8102 (Ox8000/820a). The darker
the color, the higher the probability (斗) that it has a carry propagated to.
111010101011101110101010111011101 111010111o11101110101110101010101
Fig. 8. Showing how the active bit from difference Ox8000/820a propagates to differ-
ence Ox850a/9520. The darker the color, the higher the probability (2': ¾) that it has
a carry propagated to.
value of the right part of SPECK-32/64, and the fact that the left part is added
into the right part after the rotation, it takes about 3 to 4 rounds for the volatile
and unreliable bits to spread.
A Deeper Look at Machine Learning-Based Cryptanalysis 821
— The training set of Gohr's neural network consists of 107 ciphertext pairs.
Thus, we restrict our distinguisher to only use 107 ciphertext pairs as well.
— If we do an exhaustive key search for two rounds, the time complexity will be
extremely high. Instead, we may need to limit ourselves to only one round to
match the complexity of the neural distinguishers.
822 A. Benamira et al.
— If we know the difference at round i, the i — 1 round difference for the right
part is known as well, since ri —1 = (Zi EB ri) 衾 2
pose to use the number of floating point multiplications performed by the neural
network instead. Let I and O respectively denote the number of inputs and out-
puts to one layer. The computational cost of going through a dense layer is I• 0
multiplications. For 1D-CNN with kernel size ks = 1, a null padding, a stride
equal to 1 and F filters, with input size (I, T) the cost is computed as I• F• T
multiplications. With the same input but with kernel size ks = 3, a padding
equal to 1, the cost is I• ks• F• T Applying these formulas to Gohr's neural
network, we obtain a total of 137280:::::: 沪 7-07 multiplications. Note that we do
not account for batch normalizations and additions, which are dominated by the
cost of the multiplications. Using this estimation, it seems that our distinguisher
is slightly better in terms of complexity.
4.6 Discussion
Even though Gohr trained a neural distinguisher with a fixed input difference, it
is unfair to compare the accuracy of neural distinguisher to that of a pure differ-
ential cryptanalysis (with the use of DDT), since there are alternative cryptanal-
ysis methods that the neural distinguisher may have learned. The experiments
performed indicate that while Gohr's neural distinguishers did not rely much on
difference at the nr round, they rely strongly on the differences at round nr —1
and even more strongly at round nr — 2. These results support the hypothesis
that the neural distinguisher may learn differential-linear cryptanalysis [13] in
the case of SPECK. While we did not present any attacks here, using the MILP
model shown in [9], we verified that there are indeed many linear relations with
large biases for 2 to 3 rounds.
Unlike traditional linear cryptanalysis, which usually use independent char-
acteristics or linear hull involving the same plaintext and ciphertext bits, a well-
trained neural network is able to learn and exploit several linear characteristics
while taking into account their dependencies and correlations.
We believe that neural networks find the easiest way to achieve the best
accuracy. In the case of SPECK, it seems that differential-linear cryptanalysis
would be a good fit since it requires less data and the truncated differential
has a very high probability. Thus, we think that neural networks have the abil-
ity to efficiently learn short but strong differential, linear or differential-linear
characteristics for small block ciphers for a small number of rounds.
824 A. Benamira et al.
Table 9. A comparison of the neural distinguisher and LGBM model for 5 round, for
106 samples generated of type (Ct, Cr, C{, C饥
N5 Ds LGBM as classifier for LGBM as classifier for the LGBM as classifier for the
theoriginal input 512-feature 64-feature
92.9% 91.1% 76.34% 士 2.62 91.49% 士 0.09 92.36% 士 0.07
826 A. Benamira et al.
The final MLP block is not essential. As described above, we can not
replace the entire DNN with another non-neuronal machine learning model that
is easier to interpret. However, we may be able to replace the last block (Block
3) of the neural distinguisher performing the final classification, by an ensemble
model.
Experiment. We successfully exchanged the final MLP block for a LGBM model.
The reasons for choosing LG BM as a non-linear classifier were detailed in the
previous experiment paragraph. The first attempt is a complete substitution
of Block 3, taking the 512-dimension output of Block 2—10 as input. In the
fourth column of Table 9, we observe that this experiment leads to much better
results than the one from Conjecture 2, and even better results than the classical
DDT method D5 (+0.39%). To further improve the accuracy, we implemented
a partial substitution, taking only the 64-dimension output of the first layer of
the MLP as input. As can be seen in the fifth column from Table 9, the accuracy
with those inputs is now much closer to the DNN accuracy. In both cases, the
accuracy is close to the neural distinguisher, supporting our conjecture. At this
point, in order to grasp the unknown property P, one needs to understand the
feature vector at the residuals'output.
缸periments. As the inputs of the first convolution are binary, we could formally
verify our conjecture. By forcing to one all non-zero values of the output of this
layer, we calculated the truth-table of the first convolution. We thus obtained
the boolean expression of the first layer for the 32 filters. We observed that eight
filters were empty and the remaining twenty-four filters were simple. The filter
expressions are provided in the long version of the paper that can be found on
eprint.
However, one may argue that setting all non-zero values to one is an over-
simplified approach. Therefore, we replaced the first ReL U activation function
by the Heaviside activation function, and then we retrained the DNN. Since the
Heaviside function binarizes the intermediate value (as in [28]), we can estab-
lish the formal expression of the first layer of the retrained DNN. This second
A Deeper Look at Machine Learning-Based Cryptanalysis 827
DNN had the same accuracy as the first one and almost the same filter boolean
expression.
Finally, we trained the same DNN with the following entries (L:1L, L:1V, V0, Vj让
Using the same method as before, we established the filters'boolean expressions.
This time, we obtained twenty five null filters and seven non-null filters, with the
following expressions: L:1L, Vo I\ V1 , L:1L, L:1L, Vo I\ V1 , L:1L I\ L:1 V, L:1L I\ L:1 V. These
observations support conjecture 4. Therefore, we kept only (L:1L, L:1V, V0, 忆) as
inputs for our pipeline.
Conjecture 5. The neural distinguisher internal data processing of Block 2-i can
be approached by:
This approach enables us to replace Block 2-i of the DNN. Though, we still
need to clarify how to get the RM ensemble.
These points stand respectively for Block 1, Block 2-i and Block 3.
5.3 Implementation
In this section and based on the verified conjectures, we are describing the step-
wise implementation of our method. We consider that we have a DNN formed
with 107 data of type (.dL, .dV, V0, Vi) for 5 and 6 rounds of SPECK-32/64. We
developed a three-step approach:
Mask extraction from the DNN. We first ranked 104 real samples according
to DNN score, as described in Sect. 4.1 , in order to estimate the masks from these
entries. We used multiple local interpretation methods: Integrated Gradients
[26], DeepLift [22], Gradient Shap [15], Saliency maps [23], Shapley Value [5],
and Occlusion [27]. These methods score each bit according to their importance
for the classification. Following averaging by batch and by method, there were
two possible ways to move forward. We could either assign a Hamming weight
or else set a threshold above which all bits would be set to one. After a wide
range of experiments, we chose the first option and set the Hamming weight to
sixteen and eighteen (which turned out to be the best values in our testing).
This approach allowed us to build the ensemble RM of the relevant masks.
4 https://github.com/pytorch/captum.
830 A. Benamira et al.
P(I汇 Real)P(Real)
P(Real 心)=
P(I汇 Real)P(Real) + P(I汇 Random)P
with P(Real) = P(Random) = 0.5, P(I汇 Random)= 2-HW(M), HW(M) being
the Hamming weight of M and P(I叭 Real) =卢 x U[JM]- Finally we update
M-ODT as follow: M-ODT[M] [IM] = P(Real 心).
5.4 Results
The M-ODT pipeline was implemented with numpy, scikit-learn [20] and pytorch
[19]. The project code can be found at this URL address5. Our work station is
constituted of a GPU Nvidia GeForce GTX 970 with 4043 MiB memory and
four Intel core i5-4460 processors clocked at 3.20 GHz.
General results. Table 10 shows accuracies of the DDT, the DNN and our M-
ODT pipeline on 5 and 6-round reduced SPECK-32/64 for 1.1 x 107 generated
samples. When compared to DNN and DDT, our M-ODT pipeline reached an
intermediate performance right below DNN. The main difference is the true
positive rate which is higher in our pipeline (this can be explained by the fact
that our M-ODT preprocessing only considers real samples). All in all, our M-
ODT pipeline successfully models the property P.
Table 10. A comparison of Gohr's neural network, the DDT and our M-ODT pipeline
accuracies for around 150 masks generated each time, with input (L1L, L1V, Vo, 忆),
LGBM as classifier and 1.1 x 107 samples generated in total. TPR and TNR refers to
true positive and true negative rate respectively.
identical and equal to the label. On 6 rounds, matching prediction reduces down
to 93.1%.
We thus demonstrated that our method advantageously approximates the
performance of the neural distinguisher. With an initial linear transforma-
tion on the inputs, computing a M-ODT for a set of masks extracted from
the DNN and then classifying the resulting feature vector with LGBM, we
achieved an efficient yet more easily interpretable approach than Gohr distin-
guishers. Indeed, DNN obscure features are simply approached in our pipeline
T
by F = (P(RealllM1) P(RealllM2)···P(Realll厂)) . Finally, we interpret the per-
formance of the classifier globally (i.e. retrieving the decision tree) and locally
(i.e. deducing which feature played the greatest role in the classification for each
sample) as in [14]. Those results are not displayed as they are beyond the scope
of the present work, but they can be found in the project code.
Table 11. A comparison of Gohr's neural network predictions and our M-ODT pipeline
predictions for around 150 masks generated each time, with input (.:1£, L1V, Vo, 忆),
LGBM as classifier and 1.1 x 107 samples generated in total.
32/64 block cipher. Implementing the same pipeline, we enjoyed a 82.2% accu-
racy for the classification, whereas the neural distinguisher achieves 83.4% accu-
racy. In addition, the matching rate between the two models was up to 92.4%.
The slight deterioration in the results of our pipeline for SIMON can be explained
by the lack of efficient masks as introduced in Sect. 5.3 for SPECK.
5.6 o·lSCUSSIOilS
From the cryptanalysts'standpoint, one important aspect of using the neural
distinguisher is to uncover the property P learned by the DNN. Unfortunately,
while being powerful and easy to use, Gohr's neural network remains opaque.
Our main conjecture is that the 10-layer residual blocks, considered as the
core of the model, are acting as a compressed DDT applied on the whole input
space. We model our idea with a Masked Output Distribution Table (M-ODT).
The M-ODT can be seen as a distribution table applied on masked outputs,
in our case (.1£, .1V, Vo, 忧), instead of only the difference (Cz EB C{, Cr 由切).
By doing so, features are no longer abstract as in the neural distinguisher. In
our pipeline, each one of the features is a probability for the sample to be real
knowing the mask and the input. In the end, with our M-ODT pipeline, we
successfully obtained a model which has only — 0.6% difference accuracy with the
DNN and a matching of 97.3% on 5 rounds of SPECK-32/64. Additional analysis
of our pipeline (e.g. masks independence, inputs influence, classifiers influence)
are available into the project code. To the best of our knowledge, this work is
the first successful attempt to exhibit the underlying mechanism of the neural
distinguisher. However, we note that a minor limitation of our method is that it
still requires the DNN to extract the relevant masks during the preparation of the
distinguisher. Since it is only during preparation, this does not remove anything
with regards to the interpretability of the distinguisher. Future work will aim
at computing these masks without DNN. All in all, our findings represent an
opportunity to guide the development of a novel, easy-to-use and interpretable
cryptanalysis method.
While in the two previous sections we focused on understanding how the neural
distinguisher works, here we will explain how one can outperform Gohr's results.
The main idea is to create batches of ciphertext inputs instead of pairs.
We refer to batch input of size B, a group of B ciphertexts that are con-
structed from the same key. Here, we can distinguish two ways to train and
evaluate the neural distinguisher pipeline with batch input. The straightfor-
ward one is to evaluate the neural distinguisher score for each element of the
batch and then to take the median of the results. The second is to consider
the whole batch as a single input for a neural distinguisher. In order to do so,
we used 2-dimensional CNN (2D-CNN) where the channel dimension is the fea-
tures (L'.1L,L'.1V, V0, 片) . We should point out that, for sake of comparability with
A Deeper Look at Machine Learning-Based Cryptanalysis 833
Gohr's work, we maintained the product of the training set size by the batch size
to be equal to 107. Both batch size-based challenging methods yielded similar
accuracy values (see Table 12). Notably, in both cases, we enjoyed 100% accuracy
on 5 and 6 rounds with batch sizes 10 and 50 respectively.
Table 12. Study of the batch size methods on the accuracies with (.::1£, .::1 V, Vo, V1)
as input for 5 and 6 rounds.
Rounds 5 6
Batch input size 1 5 10 1 5 10 50
Averaging Method 92.9% 99.8% 100% 78.6% 95.41 % 99.0% 100%
2D-CNN Method 99.4% 100% 93.27% 97.7% 100%
Table 13. Study of the averaging batch size method on the 7-round accuracies with
(.dL, .dV, Vo, V1) as input.
Conclusion
Our results indicate that Gohr's neural distinguishers are not really produc-
ing novel cryptanalysis attacks, but more like optimizing the information extrac-
tion with the low-data constraints. Many more distinguisher settings, machine
learning pipelines, types of ciphers should be studied to have a better understand-
ing of what machine learning-based cryptanalysis might be capable of. Yet, we
foresee that such tools could become of interest for cryptanalysts and designers
to easily and generically pre-test a primitive for simple weaknesses.
Our work also opens interesting directions with regards to interpretability of
deep neural networks and we believe our simplified pipeline might lead to better
interpretability in other areas than cryptography.
Acknowledgements. The authors are grateful to the anonymous reviewers for their
insightful comments that improved the quality of the paper. The authors are supported
by the Temasek Laboratories NTU grant DSOCLl 7101. We would like to thank Aron
Gohr for pointing out that the differential characteristics mentioned in the attacks
of Dinur's[6] have been extended by one free round, thus, our previous suggestion of
extending Dinur's attack by one round is invalid.
References
1. Abed, F., List, E., Lucks, S., Wenzel, J.: Differential cryptanalysis of round-reduced
SIMON and SPECK. In: Cid, C., Rechberger, C. (eds.) FSE 2014. LNCS, vol. 8540, pp.
525—545. Springer, Heidelberg (2015). https://doi.org/10.1007 /978-3-662-46706-
0_27
2. Beaulieu, R., Shors, D., Smith, J., Treatman-Clark, S., Weeks, B., Wingers, L.: The
SIMON and SPECK families of lightweight block ciphers. IACR Cryptol. ePrint
Arch. 2013, 404 (2013). http://eprint.iacr.org/2013/404
3. Biryukov, A., Roy, A., Velichkov, V.: Differential analysis of block ciphers SIMON
and SPECK. In: Cid, C., Rechberger, C. (eds.) FSE 2014. LNCS, vol. 8540, pp.
546—570. Springer, Heidelberg (2015). https://doi.org/10.1007 /978-3-662-46706-
0_28
4. Breiman, L.: Random forests. Mach. Learn. 45(1), 5—32 (2001)
5. Castro, J., Gomez, D., Tejada, J.: Polynomial calculation of the shapley value
based on sampling. Comput. Oper. Res. 36(5), 1726—1730 (2009)
6. Dinur, I.: Improved differential cryptanalysis of round-reduced speck. In: Joux,
A., Youssef, A. (eds.) SAC 2014. LNCS, vol. 8781, pp. 147—164. Springer, Cham
(2014). https://doi.org/10.1007 /978-3-319-13051-4_9
7. Duan, X., Yue, C., Liu, H., Guo, H., Zhang, F.: Attitude tracking control of small-
scale unmanned helicopters using quaternion-based adaptive dynamic surface con-
trol. IEEE Access 9, 10153—10165 (2021). https://doi.org/10.1109/ ACCESS.2020.
3043363
8. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann.
Stat. 1189—1232 (2001)
9. Fu, K.,Wang, M., Guo, Y., Sun, S., Hu, L.: MILP-based automatic search algo-
rithms for differential and linear trails for speck. IACR Cryptol. ePrint Arch. 407
(2016)
10. Gerault, D., Minier, M., Solnon, C.: Constraint programming models for chosen
key differential cryptanalysis. In: Rueher, M. (ed.) CP 2016. LNCS, vol. 9892, pp.
584—601. Springer, Cham (2016). https://doi.org/10.1007 /978-3-319-44953-1-37
A Deeper Look at Machine Learning-Based Cryptanalysis 835
11. Gohr, A.: Improving attacks on round-reduced speck32/64 using deep learning. In:
Boldyreva, A., Micciancio, D. (eds.) CRYPTO 2019, Part II. LNCS, vol. 11693, pp.
150—179. Springer, Cham (2019). https://doi.org/10.1007 /978-3-030-26951-7 _6
12. Ke, G., et al.: Lightgbm: A highly efficient gradient boosting decision tree. Adv.
Neural Inf. Process. Syst. 3146—3154 (2017)
13. Langford, S.K., Hellman, M.E.: Differential-linear cryptanalysis. In: Desmedt, Y.G.
(ed.) CRYPTO 1994. LNCS, vol. 839, pp. 17—25. Springer, Heidelberg (1994).
https://doi.org/10.1007 /3-540-48658-5_3
14. Lundberg, S.M., et al.: Explainable AI for trees: From local explanations to global
understanding. arXiv preprint arXiv:1905.04610 (2019)
15. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions.
In: Advances in Neural Information Processing systems, pp. 4765—4774 (2017)
16. Maghrebi, H., Portigliatti, T., Prouff, E.: Breaking cryptographic implementations
using deep learning techniques. In: Cadet, C., Hasan, M.A., Saraswat, V. (eds.)
SPACE 2016. LNCS, vol. 10076, pp. 3—26. Springer, Cham (2016). https://doi.
org/10.1007 /978-3-319-49445-6_1
17. Mouha, N., Preneel, B.: A proof that the ARX cipher salsa20 is secure against
differential cryptanalysis. IACR Cryptol. ePrint Arch. 328 (2013). http://eprint.
iacr .org/2013 /328
18. Mouha, N., Wang, Q., Gu, D., Preneel, B.: Differential and linear cryptanalysis
using mixed-integer linear programming. Inf. Secur. Cryptology - Inscrypt 2011,
57—76 (2011)
19. Paszke, A., et al.: Pytorch: An imperative style, high-performance deep learning
library. In: Advances in Neural Information Processing Systems, pp. 8026—8037
(2019)
20. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn.
Res. 12, 2825 —2830 (2011)
21. Rivest, R.L.: Cryptography and machine learning. In: Imai, H., Rivest, R.L., Mat-
sumoto, T. (eds.) ASIACRYPT 1991. LNCS, vol. 739, pp. 427— 439. Springer, Hei-
delberg (1993). https://doi.org/10.1007 /3-540-57332-1-36
22. Shrikumar, A., Greenside, P., Kundaje, A.: Learning important features through
propagating activation differences. arXiv preprint arXiv:1704.02685 (2017)
23. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional net-
works: Visualising image classification models and saliency maps. arXiv preprint
arXiv:1312.6034 (2013)
24. Song, L., Huang, Z., Yang, Q.: Automatic differential analysis of ARX block ciphers
with application to SPECK and LEA. In: Liu, J.K., Steinfeld, R. (eds.) ACISP
2016. LNCS, vol. 9723, pp. 379—394. Springer, Cham (2016). https://doi.org/10.
1007/978-3-319-40367-0_24
25. Sun, S., Gerault, D., Lafourcade, P., Yang, Q., Todo, Y., Qiao, K., Hu, L.: Analysis
of AES, skinny, and others with constraint programming. IACR Trans. Symmetric
Cryptol. 2017(1), 281 —306 (2017)
26. Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks.
arXiv preprint arXiv:1703.01365 (2017)
27. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks.
In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part I.
LNCS, vol. 8689, pp. 818—833. Springer, Cham (2014). https://doi.org/10.1007 /
978-3-319-10590-1-53
28. Zhou, S., Wu, Y., Ni, Z., Zhou, X., Wen, H., Zou, Y.: Dorefa-net: Training
low bitwidth convolutional neural networks with low bitwidth gradients. CoRR
abs/1606.06160 (2016). http://arxiv.org/abs/1606.06160