1 Introduction

Nowadays, we use cryptography for almost all online activities, e.g. payments, secure messaging and web navigation. Even though cryptography is usually seen as “one” piece, this is not the case. Cryptographic protocols use different primitives to provide security, and it is always required that each primitive needs to achieve a certain level of resistance against different attacks to be considered secure. Thus, it is crucial to define and classify what an attack is.

To define a cryptographic attack, we need to specify the goal and the abilities the adversary has with respect to a security model that describes how the primitive is used and attacked [16, 19]. For example, the plaintext recovering of an encrypted message without having the key is, by no means, the classical example of an attack on an encryption scheme.

Let us consider the concrete scenario in which an adversary is given an object sampled from one of two possible classes and the goal is to correctly guess the class used to generate that object. This adversary is known as a distinguisher for a distinguishing problem.

A classical example of a distinguishing problem in cryptography is the one used for the pseudorandomness property of a primitive (\(\mathsf{G}\)) [16]. Such a problem is defined as how to distinguish elements generated by \(\mathsf{G}\) from those uniformly random generated, i.e. \((\mathsf{G}, \mathsf{rand})\). In other words, to prove the non-pseudorandomness, it would be sufficient to create a distinguisher \(\mathsf{D}\) with a non-negligible advantage for solving the related distinguishing problem.

(ML) and cryptography have been widely combined in the literature, e.g. from random numbers generation [20, 23], random number prediction [17, 18, 22] and supervised algorithm using encrypted data [12] to testing how good a (PRG) is [10].

Our contributions In this paper, we propose a constructive methodology based on ML that allows the generation of several distinguishers \(\{\mathsf{D}_i\}_i\) for a given distinguishing problem between two classes \((\mathsf{G}_0,\mathsf{G}_1)\). We implement a tool named \(\mathsf{MLCrypto}\) and freely release the code to facilitate future work on this lineFootnote 1.

In a nutshell, we generate a dataset that contains tuples of elements \(y_i\) together with the classes from which they are sampled. This dataset is the input of an ML algorithm whose outputs are a distinguisher \(\mathsf{D}_i\). It also generates strategies and solutions to allow an adversary to improve her advantage. Concretely, we present a strategy that allows us to combine several distinguishers (\(\{\mathsf{D}_i\}_i\)) generated by \(\mathsf{MLCrypto}\) to create a more accurate distinguisher \(\mathsf{D}\). We further discuss the blind spot paradox, a paradoxical phenomenon that can annihilate any advantage when the attacker unconscionably uses tools like \(\mathsf{MLCrypto}\) in realistic attack scenarios.

We present a case study on the cipher suite distinguishing problem on the PRG and based on the (NIST) DRBG. We remark the state-of-the-art generation of a distinguisher from statistical test suites and link it with the advantage in breaking the pseudorandomness property, that is, having an advantage in discriminating between the PRG and a random element, with the advantage of distinguishing between two PRG \((\mathsf{G}_0,\mathsf{G}_1)\). We design an experiment that uses \(\mathsf{MLCrypto}\) as a distinguisher generator between DRBG recommended by NIST [4]. In more detail, \(\mathsf{MLCrypto}\) generates Naive Bayes classifiers because of their (i) computational efficiency, (ii) implementation simplification and (iii) the lack of learning parameter to be tuned.

From our experiments, we conclude that both our methodology and \(\mathsf{MLCrypto}\) can be used for efficiently generating general purpose distinguishers.

Case study: distinguishing NIST DBRGs There are two main approaches to generate distinguishers: theoretical and empirical. The theoretical approach consists of searching for flaws by scrutinising the mathematical primitive definition. For instance, there are theoretical attacks [8, 28] against PRG proposed by NIST [4] based on specific differential cryptanalysis [6]. The empirical approach relies on defining a statistically significant number of experiments to provide enough confidence of the results, used to create a distinguisher. For instance, the test suite provided by NIST [5] is composed of multiple statistical tests that check whether the outputs generated by the PRG have some kind of correlation with the presence of some pattern—defined by each one of the tests. After running these tests, the outputs are compared to the result that a uniform distribution generates. The more the passed tests, the more the confidence in stating that the PRG is pseudorandom.

However, all these tests can, and more specifically the failed ones, be used to distinguish between PRG and real randomness. By observing the failing tests, a distinguisher can infer that the input elements are generated by PRG. They can be used to define fingerprints of the PRG, i.e. each PRG is prone to fail the same tests, uniquely identifying them. Concretely, this distinguisher can be used to solve a related problem named cipher suite distinguishing problem [16]. Similar to pseudorandomness, an attacker has to discriminate between objects generated by two different primitives \((\mathsf{G}_0, \mathsf{G}_1)\) and not from random elements.

Related work Other works propose to distinguish between random numbers generated with block ciphers [2, 9, 11, 14, 15, 25, 29] of which a vast majority extract features coming from the statistical tests proposed by NIST (NIST STS) [5] and use them as inputs of ML algorithms. While the documentation provided by the NIST does not provide any formal security analysis [13], Woodgate et al. [28] carry out an in-depth security review. Contrarily to prior proposals, we apply \(\mathsf{MLCrypto}\) to DRBG recommended by NIST [4], being able to statistically distinguish between two pairs of generators.

Fig. 1
figure 1

A state machine representation of the NIST \(\mathsf{DRBG}\) work flow

To extract features from NIST STS to distinguish between random data generated from block ciphers, Zhao et al. [29] use (SVM). They use OpenSSL to generate ciphertexts from \(\mathsf{AES}\), \(\mathsf{Camellia}\), \(\mathsf{Blowfish}\), \(\mathsf{DES}\), \(\mathsf{IDEA}\) and \(\mathsf{TDEA}\) algorithms. Authors derive 54 features from the NIST STS, obtaining that accuracies of 42 features are higher than 50% while the accuracies of 12 features are higher than 60%. Hu et al. [15] use random forest to classify random data from 16 block chipers instead of the 6 that Zhao et al. use, obtaining an accuracy of 88% in the classification. Svenda et al. [26] use software circuits together with evolutionary algorithms to search for patterns, random bit predictability and random data indistinguishability.

Contrarily to the aforementioned works, instead of distinguishing between random data, we use \(\mathsf{MLCrypto}\) as a machine learning approach to distinguish between the functions that generate these data, i.e. in our case study, we create distinguishers between NIST DRBG [4].

Paper organisation In Sect. 2, we give a brief introduction to pseudorandom generators, NIST DRBG and machine learning. Section 3 describes the methodology for generating distinguishers using machine learning and additionally discusses limitation, such as the blind spot paradox, and a possible strategy to amplify the adversarial advantage. In Sect. 4, we implement our methodology into the \(\mathsf{MLCrypto}\) tool and consider a particular case study based on DRBG recommended by NIST. This paper ends with ideas for future work in Sect. 5.

2 Preliminaries

In this section, we present definitions and concepts used throughout the paper.

Notation Let \(\mathop {\mathsf{Pr}}\limits _{x \in X}\left[ E\,\right] \) denote the probability computed over the \(x \in X\) that the event E occurs. We will omit the probability space whenever it is clear by the context, i.e. \(\mathop {\mathsf{Pr}}\limits _{}\left[ E\,\right] \). The random sampling in the set X is denoted as \(x{\leftarrow }_{\$} {X} \) and, whenever it is not specified, the sampling is always considered to be uniform at random. Let the natural number be denoted with \(\mathbb {N}\), the real number field with \(\mathbb {R}\) and the positive ones with \(\mathbb {R}_+\). Let [ab] denote the interval between a and b, comprised. The space of binary strings of length \({\ell _{}}\) is \(\{0,1\}^{{\ell _{}}}\) while \(\Vert \) denotes binary concatenation.

Cryptography We report the definition of a pseudorandom generator (PRG) and the abstract NIST construction framework for a DRBG. For readability, we omit the error handling of these constructions.

Definition 2.1

(PRG [19]) Given the positive integers \({\ell _{\mathsf{in}}},{\ell _{\mathsf{out}}}\in \mathbb {N}\) with \({\ell _{\mathsf{out}}}> {\ell _{\mathsf{in}}}\), let \(\mathsf{G}: {\{0,1\}}^{{\ell _{\mathsf{in}}}} \rightarrow {\{0,1\}}^{{\ell _{\mathsf{out}}}} \) be a deterministic function. We say that \(\mathsf{G}\) is a pseudorandom generator if the following two distributions are computationally indistinguishable for a distinguisher \(\mathsf{D}\):

  • Sample a random seed \(\mathsf{s}{\leftarrow }_{\$} {{\{0,1\}}^{{\ell _{\mathsf{in}}}} } \) and output \(\mathsf{G}(\mathsf{s})\).

  • Sample a random string \(\mathsf{r} {\leftarrow }_{\$} {{\{0,1\}}^{{\ell _{\mathsf{out}}}} } \) and output \(\mathsf{r}\).

Definition 2.2

(Abstract NIST DRBG) Let \(\lambda \in \mathbb {N}\) be the security parameter, \(\widetilde{s}\in {\{0,1\}}^{ \lambda } \) a bit string obtained by a random source, \({\ell _{\mathsf{s}}}\in \mathbb {N}\) the seed length and \({\ell _{\mathsf{r}}}\in \mathbb {N}\) the number of iterations before requiring the seed’s reseed. We define a seed \(\widetilde{s}\in {\{0,1\}}^{{\ell _{\mathsf{s}}}} \), a nonce \(\nu \in {\{0,1\}}^{\lambda } \) and an auxiliary string \(\mathsf{aux}\in {\{0,1\}}^{*} \). Let a NIST abstract \(\mathsf{DRBG}\) be defined by the algorithms:

  • \(\mathsf{init}(\widetilde{s},\nu ,\mathsf{aux}, \lambda ) \rightarrow {\mathsf{st}}_1\): given a random binary string \(\widetilde{s}\), a nonce \(\nu \), an auxiliary string \(\mathsf{aux}\) and the security parameter \(\lambda \) , the instantiation algorithm outputs the initial internal stage \({\mathsf{st}}_1\).

  • \(\mathsf{reseed}({\mathsf{st}},\widetilde{s}^\prime ,\mathsf{aux}) \rightarrow {\mathsf{st}}^\prime _1\): given an internal state \({\mathsf{st}}\), a random binary string \(\widetilde{s}\), an auxiliary binary string \(\mathsf{aux}\), the reseeding algorithm outputs a fresh initial internal stage \({\mathsf{st}}^\prime _1\).

  • \(\mathsf{gen}({\mathsf{st}}_i,n,\mathsf{aux}) \rightarrow (y,{\mathsf{st}}_{i+1})\): given the internal state \({\mathsf{st}}_i\), a non-zero number of output bit \(n \in \mathbb {N}\) and an auxiliary string \(\mathsf{aux}\), the generation algorithm outputs the pseudorandom bit-string \(y \in {\{0,1\}}^{n} \) and the successive internal stage \({\mathsf{st}}_{i+1}\).

The DRBG is defined as a state machine, and it is depicted in Fig. 1. It takes a random binary string \(\widetilde{s}\), a nonce \(\nu \), an auxiliary string \(\mathsf{aux}\) and the security parameter \(\lambda \) to initialise the internal state and generates the internal state \({\mathsf{st}}_{1}\). The internal state \({\mathsf{st}}_{i}\) is used as input of subsequent updates together with a nonzero number \(n \in \mathbb {N}\) indicating the number of random bits requested and an auxiliary string \(\mathsf{aux}\). It outputs a n random bit-string y and updates the internal state to the next state \({\mathsf{st}}_{i+1}\). Whenever it is requested, the DRBG can be reseeded, i.e. it starts again from a new internal state producing a new \({\mathsf{st}}_1^\prime \) given a previous state \({\mathsf{st}}_i\), a new random binary string \(\widetilde{s}^\prime \) and some auxiliary information \(\mathsf{aux}^\prime \).

To correctly instantiate the DRBG, the NIST suggests three different constructions: (1) a hash function; (2) the \(\mathsf{HMAC}\) of a hash function and (3) a block cipher in counter-mode. NIST requires the use of recommended cryptographic primitives [4], e.g. \(\mathsf{HMAC}\) with a secure hash function, \(\mathsf{AES{\text {-}128}}\) or \(\mathsf{SHA{\text {-}2}}\) family, and a bit string obtained by a secure random source [3, 24]. Whenever it is not specified, we always consider NIST approved primitives and security parameters.

Machine learning Roughly speaking, ML is a set of algorithms whose goal is finding and describing patterns over a dataset. The dataset is usually composed of independent instances each one defined by a set of features or attributes. Once the dataset is generated, it is used as input of the ML that produces the knowledge that has been learnt [1].

There are for main types of learning: (i) supervised learning or classification; (ii) unsupervised learning or clustering; (iii) association; and (iv) numeric prediction [27]. In supervised learning, the ML learns from an already labelled dataset and it tries to predict the class of a new instance. On the contrary, in unsupervised learning, the dataset is not labelled and the ML algorithm looks for common patterns based on heuristics. Association seeks for relationships between the features of the dataset, whereas the goal of the numeric prediction learning algorithms is to predict numbers instead of (labelled) data.

Naïve Bayes. The intuition behind Naïve Bayes is that features are independent and equally important. This is the consequence of applying the Bayes theorem into a classification algorithm. There is a particular case of Naïve Bayes algorithm when the likelihood of the features follows a Gaussian distribution, i.e. when the (continuous) values associated with each feature are distributed according to a Gaussian distribution.

3 Machine learning distinguishers

In this section, we formally define the distinguishing problem and present our methodology which explains how ML can be used to solve a distinguishing problem. We discuss how to use the accuracy we obtain from ML as a cryptographic advantage, a curious phenomenon we call “blind spot paradox”, and propose a generic methodology to increase the advantage of a distinguisher at the cost of generating multiple ones.

In cryptography, it is common to find security properties defined by the probability of an adversary \(\mathcal {A}\) being able to distinguish between two different instances. For example, in a simulation-based proof, \(\mathcal {A}\) must discriminate between a real execution of a protocol and an ideal functionality assumed to be secure. Whenever proving the pseudorandomness of a function, \(\mathcal {A}\) must choose if a value is computed by the function or if it is randomly sampled.

Definition 3.1

(Distinguish problem) Let \(\mathsf{G}_0\) and \(\mathsf{G}_1\) be two classes, \(b {\leftarrow }_{\$} {\{0,1\}} \) a random coin flip and y an element of \(\mathsf{G}_b\). Consider a distinguisher \(\mathsf{D}\) that takes as input y and outputs a guess \(b^\prime \). We define the distinguish problem as \(\mathsf{D}\)’s task in discriminating the membership of the value \(y \in \mathsf{G}_b\) between the two classes \((\mathsf{G}_0,\mathsf{G}_1)\) and with advantageFootnote 2:

$$\begin{aligned} \mathsf {Adv}^{\mathsf {D}}_{\mathsf {G}_0, \mathsf {G}_1} = \left|2 \cdot \mathop {\mathsf{Pr}}\limits _{}\left[ \mathsf{D}(y) = b_i\,\right] - 1 \right| \end{aligned}$$
(3.1)

Even though the abstract definition form, the distinguishing problem appears as the core concept behind many important cryptographic security problems: pseudorandomness is defined as a distinguishing problem between a primitive \(\mathsf{G}\) and a real random process; in an indistinguishable cipher plaintext attacks, it is required to distinguish a ciphertext between two possible messages, and the cipher suite problem requires to discriminate between different primitives \((\mathsf{G}_0,\mathsf{G}_1)\).

3.1 Our methodology: from classifiers to distinguishers

Our methodology, depicted in Fig. 2, is based on the idea that a supervised learning algorithm can be used by an adversary \(\mathcal {A}\) to create a distinguisher \(\mathsf{D}\) between two classes \((\mathsf{G}_0,\mathsf{G}_1)\). We must observe that a supervised learning algorithm requires an input of a labelled dataset of correctly classified values \((y_i , \mathsf{G}_{b_i})\), such that \(y_i \in \mathsf{G}_{b_i}\), that are used to define the classifier. Our methodology assumes that an adversary \(\mathcal {A}\) can pre-compute any labelled simulated training dataset, i.e. \(\mathcal {A}\) can easily compute different but related instances of \((\mathsf{G}_0,\mathsf{G}_1)\), e.g. by sampling a different secret key. In this way, \(\mathcal {A}\) can simulate arbitrarily labelled datasets which might not refer to the original problem instance between \((\mathsf{G}_0,\mathsf{G}_1)\) but are somehow related, and thus, we consider them as correct.

Fig. 2
figure 2

Abstract representation of our methodology

The output of the algorithm is a classifier \(\mathsf{D}\) that works exactly as a distinguisher, i.e. provided an element y, it guesses whether y belongs to \(\mathsf{G}_0\) or \(\mathsf{G}_1\). The next step is to consistently evaluate the accuracy that this distinguisher obtains. For the sake of simplicity, in this paper, we consider the classifier accuracy as the probability of correctly guessing the class for every element of a target dataset \(\mathsf{Y}\). However, other mechanisms can also be used to evaluate the accuracy like computing the confusion matrix and cross-validate the obtained results. We formallyFootnote 3 define the accuracy as:

$$\begin{aligned} \mathsf{Acc}^{\mathsf{D}}_{\mathsf{G}_0,\mathsf{G}_1}(\mathsf{Y}) = \mathop {\mathsf{Pr}}\limits _{y_i \in \mathsf{Y}}\left[ \mathsf{D}(y_i) = \mathsf{G}_{b_i} \,\right] \end{aligned}$$

Observe that the distinguisher’s accuracy highly depends on the target dataset \(\mathsf{Y}\). This implies that the accuracy computed by a distinguisher generated by our methodology is not directly related to the distinguisher’s advantage previously described in the distinguishing problem of Definition 3.1. The reason is that the accuracy is computed over a target subset \(\mathsf{Y}\), which is generally much smaller than the set \(\mathbb {Y}\) of all the possible elements. In other words, it is not possible to compare the accuracy \(\mathop {\mathsf{Pr}}\limits _{y_i \in \mathsf{Y}}\left[ \mathsf{D}(y_i) = {b_i} \,\right] \) and the probability \(\mathop {\mathsf{Pr}}\limits _{y_i \in \mathbb {Y}}\left[ \mathsf{D}(y_i) = b_i\,\right] \) because the target \(\mathsf{Y}\) might not be representative of the whole space \(\mathbb {Y}\), i.e. \(\mathsf{Y}\) might, for example, only contain “easy to classify” elements providing therefore a high accuracy for \(\mathsf{D}\) even though it might have no cryptographic advantage.

Roughly speaking, the accuracy can be seen as a statistical estimator of the advantage \(\mathsf{Adv}^{\mathsf{D}}\) meaning that there is a strong conceptual gap between theoretical and empirical results. However, it is possible to estimate both the dimensions and the number of samples needed to achieve a statistically relevant distinguisher, e.g. by verifying some accuracy properties with an appropriate statistical test and later evaluate the power analysis to confirm/evaluate the amount of sample needed to reach statistic relevance.

For the rest of the paper, we assume that there is always a way to correctly generate statistically relevant distinguishers \(\mathsf{D}_i\) for any pair of classes \((\mathsf{G}_0,\mathsf{G}_1)\). Furthermore, we refer to \(\mathsf{D}\)’s advantage as:

$$\begin{aligned} \mathsf {Adv}^{\mathsf {D}}_{\mathsf {G}_0, \mathsf {G}_1}(\mathsf {Y}) = \left| 2 \cdot \mathsf{Acc}^{\mathsf{D}}_{\mathsf{G}_0,\mathsf{G}_1}(\mathsf{Y}) - 1 \right| \end{aligned}$$

Note that, whenever it is possible, the adversary \(\mathcal {A}\) can generate many different training datasets, thus obtaining a set of n distinguishers \(\{\mathsf{D}_i\}_{i = 1}^n\) each having its own accuracy \(\mathsf{Acc}^{\mathsf{D}_i}_{\mathsf{G}_0,\mathsf{G}_1}(\mathsf{Y})\). By correctly analysing the accuracy’s distribution, \(\mathcal {A}\) can consider different attack strategies. Let us explain this concept with an example. Suppose that all the distinguishers generated by \(\mathcal {A}\) have the same accuracy of 0.5. This means that \(\mathcal {A}\) has no advantage and therefore must abandon the idea of solving the distinguishing problem. Differently, if \(\mathcal {A}\) observes that a distinguisher \(\mathsf{D}_i\) has an accuracy \(0.5-\delta \) for some positive \(\delta \in \mathbb {R}_+\), \(\mathcal {A}\) can invert \(\mathsf{D}_i\)’s output to define a new distinguisher \(\mathsf{D}_i^\prime \) with accuracy \(0.5+\delta \). In this case, \(\mathcal {A}\) can transform distinguishers with an advantage in making wrong guesses into distinguishers that make correct guesses with the same advantage.

In summary, our methodology allows an adversary \(\mathcal {A}\) to produce ML generated distinguishers if \(\mathcal {A}\) can:

  1. (i)

    pre-compute labelled simulated training datasets;

  2. (ii)

    obtain statistically relevant target datasets, and;

  3. (iii)

    run appropriate tests to evaluate the accuracy.

Consider an adversary \(\mathcal {A}\) that, after executing our methodology, obtains several distinguishers of which she does not know the accuracy distribution. Despite the odd requirement, observe that this is the standard in practice since, to compute the accuracy distribution, it is required to obtain a correct target dataset which might not be obtainable, e.g. a primitive’s security might be defined as a distinguisher problem where the adversary cannot query the correct primitive instantiation, thus not allowing \(\mathcal {A}\) to get any target dataset.

The blind spot paradox, depicted in Fig. 3, is the paradoxical phenomenon where a blind adversary \(\mathcal {A}\)that does not know whether a specific distinguisher has an advantage or not is unable to spot how to correctly utilise the results, thus annihilating any advantage possessed. This paradox arises naturally whenever the accuracy is distributed symmetrically with respect to the probability of 0.5. Consider a distinguisher \(\mathsf{D}\) and observe that, without any precise knowledge, it is impossible to know if \(\mathsf{D}\) has a potential advantage \(\delta \) or \(-\delta \). The symmetric accuracy’s distribution property implies that the probability of \(\mathsf{D}\) being a “good” or a “bad” distinguisher is the same. For this reason, \(\mathcal {A}\) is unable to properly utilise the potential advantage obtained, thus giving rise to the paradox. To avoid the paradox, it is necessary to allow the adversary to receive “hints” in the form of a statistically relevant list of target’s outputs correctly classified. In this way, the adversary can get an estimation of the accuracy distribution and use this information to “filter out the bad” distinguishers. This completely breaks the symmetry of the distinguishers and allows them to use the “good” distinguishers. Of course, these hints might not be allowed by some theoretical security’s properties but might better represent a realistic usage of such property.

Fig. 3
figure 3

Representation of the blind spot paradox

3.2 Distinguisher accuracy amplification

In this section, we propose a generalisation method to combine and amplify the advantage of several independent distinguishers into a more accurate one by assuming that all the distinguishers have the same accuracy. The underlying reasoning still holds even when considering different accuracy’s distribution assumptions.

Let us assume we have n distinguishers \(\{\mathsf{D}_i\}_{i=1}^n\), between classes \((\mathsf{G}_0,\mathsf{G}_1)\), all with the same accuracy \(p > 0.5\). We require the distinguishers to be independent in the sense that they are generated from different and independent training sets. Our goal is to consider the majority of all the n distinguisher’s guesses. In order to always have a majority, we must assume that n is odd, i.e. there exists \(k \in \mathbb {N}\) such that \(2k+1 = n\).

Proposition 3.1

Let \(k \in \mathbb {N}\), \(0.5< p < 1\), \(n=2k+1\) and \(\{\mathsf{D}_i\}_{i=1}^n\) be independent distinguishers with accuracy p. We define the distinguisher \(\mathsf{D}^\prime \) as the majority function of the n independent \(\mathsf{D}_i\) guesses. Formally, \(\mathsf{D}^\prime (y) = \mathsf{maj}\left( \mathsf{D}_1(y), \dots ,\mathsf{D}_n(y) \right) \). Then, it holds that \(\mathsf{D}^\prime \) has an accuracy \(p_k\) greater than p.

Proof

Note that the distinguisher’ outputs define a binomial distribution of parameters p and n where the probability of “t distinguishers are correct” is:

$$\begin{aligned} \mathop {\mathsf {Pr}}\limits _{}\left[ t \text { are correct},\right] = \left( {\begin{array}{c}2k+1\\ t\end{array}}\right) \cdot (1-p)^{2k+1-t} \cdot p^t \end{aligned}$$

The final guess of \(\mathsf{D}^\prime \) is defined by at least \(k+1\) distinguishers that have the same guess. This implies that the accuracy of \(\mathsf{D}^\prime \) directly depends on p and n. Formally, the probability of correctly guessing the distinguishing game for \(\mathsf{D}^\prime \), with \(q = (1-p)\), is:

$$\begin{aligned} p_k = \mathop {\mathsf{Pr}}\limits _{}\left[ \mathsf{D}^\prime \text {correct}\,\right]&= {\mathsf {Pr}} \left[ \begin{array}{l} \ge k+1 \\ \mathsf {D}_i \text{ correct } \end{array}\right] \\&= \sum _{t=k+1}^{2k+1} \left( {\begin{array}{c}2k+1\\ t\end{array}}\right) \cdot q^{2k+1-t} \cdot p^t \end{aligned}$$

Let us recall the binomial identities \(\left( {\begin{array}{c}j\\ k\end{array}}\right) =\left( {\begin{array}{c}j-1\\ k\end{array}}\right) +\left( {\begin{array}{c}j-1\\ k-1\end{array}}\right) \) and \(\left( {\begin{array}{c}2k-1\\ t\end{array}}\right) = 0\) whenever \(t > 2k-1\). Let us define \(p_0\) to be exactly p. Our goal is to consider the probability \(p_k\) and obtain a relation with respect to \(p_{k-1}\). Then, it holds that:

$$\begin{aligned}&p_k = \sum _{t=k+1}^{2k+1} \left( {\begin{array}{c}2k+1\\ t\end{array}}\right) \cdot q^{2k+1-t} \cdot p^t\nonumber \\&= \sum _{t=k+1}^{2k+1} \left( \left( {\begin{array}{c}2k-1\\ t\end{array}}\right) {+} 2\cdot \left( {\begin{array}{c}2k-1\\ t-1\end{array}}\right) {+} \left( {\begin{array}{c}2k-1\\ t-2\end{array}}\right) \right) \cdot \nonumber \\&\qquad \qquad \cdot q^{2k+1-t} \cdot p^t\nonumber \\&= \sum _{t=k+1}^{2k+1} \left( {\begin{array}{c}2k-1\\ t\end{array}}\right) \cdot q^{2k+1-t} \cdot p^t \nonumber \\&\qquad \qquad + 2\sum _{t=k+1}^{2k+1}\left( {\begin{array}{c}2k-1\\ t-1\end{array}}\right) \cdot q^{2k+1-t} \cdot p^t \nonumber \\&\qquad \qquad + \sum _{t=k+1}^{2k+1} \left( {\begin{array}{c}2k-1\\ t-2\end{array}}\right) \cdot q^{2k+1-t} \cdot p^t \end{aligned}$$
(3.2)

Let us take a look at the addend and observe that it can be rewritten as:

$$\begin{aligned}&\sum _{t=k+1}^{2k+1} \left( {\begin{array}{c}2k-1\\ t\end{array}}\right) \cdot q^{2k+1-p} \cdot p^t \nonumber \\&= q^2 \cdot \sum _{t=k+1}^{2k-1} \left( {\begin{array}{c}2k-1\\ t\end{array}}\right) \cdot q^{2k-1-t} \cdot p^t\nonumber \\&= q^2 \cdot \left( \sum _{t=k}^{2k-1} \left( {\begin{array}{c}2k-1\\ t\end{array}}\right) \cdot q^{2k-1-t} \cdot p^t\right) \nonumber \\&\quad \quad - q^2 \left( {\begin{array}{c}2k-1\\ k\end{array}}\right) \cdot q^{k-1}\cdot p^k\nonumber \\&= q^2 \cdot p_{k-1} - \left( {\begin{array}{c}2k-1\\ k\end{array}}\right) \cdot q^{k+1}\cdot p^k \end{aligned}$$
(3.3)

where we note the presence of a relation to the winning probability \(p_{k-1}\). Similarly, we manipulate the second and third addends and obtain:

$$\begin{aligned} 2\sum _{t=k+1}^{2k+1}&\left( {\begin{array}{c}2k-1\\ t-1\end{array}}\right) \cdot q^{2k+1-t} \cdot p^t \nonumber \\&=2\cdot p \cdot q \cdot \sum _{t=k+1}^{2k} \left( {\begin{array}{c}2k-1\\ t-1\end{array}}\right) \cdot q^{2k-t} \cdot p^{t-1}\nonumber \\&= 2\cdot p \cdot q \cdot \sum _{t=k}^{2k-1} \left( {\begin{array}{c}2k-1\\ t\end{array}}\right) \cdot q^{2k-1-t} \cdot p^{t}\nonumber \\&= 2\cdot p \cdot q \cdot p_{k-1} \end{aligned}$$
(3.4)
$$\begin{aligned} \sum _{t=k+1}^{2k+1}&\left( {\begin{array}{c}2k-1\\ t-2\end{array}}\right) \cdot q^{2k+1-t} \cdot p^t \nonumber \\&= p^2 \cdot \sum _{t=k+1}^{2k+1} \left( {\begin{array}{c}2k-1\\ t-2\end{array}}\right) \cdot q^{2k+1-t} \cdot p^{t-2}\nonumber \\&= p^2 \cdot \sum _{t=k}^{2k-1} \left( {\begin{array}{c}2k-1\\ t\end{array}}\right) \cdot q^{2k-1-t} \cdot p^{t} + \nonumber \\&\quad \quad + p^2 \cdot \left( {\begin{array}{c}2k-1\\ k-1\end{array}}\right) \cdot q^{k} \cdot p^{k-1}\nonumber \\&= p^2 \cdot p_{k-1} + \left( {\begin{array}{c}2k-1\\ k-1\end{array}}\right) \cdot q^{k} \cdot p^{k+1} \nonumber \\&= p^2 \cdot p_{k-1} + \left( {\begin{array}{c}2k-1\\ k\end{array}}\right) \cdot q^{k} \cdot p^{k+1} \end{aligned}$$
(3.5)

where we used the fact that:

$$\begin{aligned} \left( {\begin{array}{c}2k-1\\ k-1\end{array}}\right) = \frac{(2k-1)!}{(k-1)!\cdot k!} = \left( {\begin{array}{c}2k-1\\ k\end{array}}\right) \end{aligned}$$
Fig. 4
figure 4

Example of distribution fitting with respect to an ideal binomial distribution

By putting together Equations 3.3, 3.4,3.5 into Equation 3.2, it holds that:

$$\begin{aligned} p_k&= p_{k-1} \Big ( q^2 + 2qp + p^2 \Big ) + \left( {\begin{array}{c}2k-1\\ k\end{array}}\right) \cdot q^{k} \cdot p^{k+1} \\&\quad \quad - \left( {\begin{array}{c}2k-1\\ k\end{array}}\right) \cdot q^{k+1}\cdot p^k\\&= p_{k-1} \big (q+p\big )^2 + \left( {\begin{array}{c}2k-1\\ k\end{array}}\right) \cdot \big ( q\cdot p\big )^{k} \cdot \big (p - q \big )\\&= p_{k-1} + \left( {\begin{array}{c}2k-1\\ k\end{array}}\right) \cdot \big ( q\cdot p\big )^{k} \cdot \big (2p - 1 \big ) \end{aligned}$$

from which we observe that \(p_k > p_{k-1}\) whenever:

$$\begin{aligned} p_k> p_{k-1}&\;\Leftrightarrow \; \left( {\begin{array}{c}2k-1\\ k\end{array}}\right) \cdot \big ( q\cdot p\big )^{k}\cdot \big (2p - 1 \big )> 0\\&\;\Leftrightarrow \;\big ( 2p-1 \big )> 0 \;\Longleftrightarrow \; p > \frac{1}{2} \end{aligned}$$

which is true by our hypothesis. The distinguisher \(\mathsf{D}^\prime \) built with \(2k+1\) distinguisher has an accuracy \(p_k> p_{k-1}> \cdots > p_0 = p\), concluding our proof. \(\square \)

4 Case study: cipher suite distinguisher for pseudorandom generators

In this section, we implement our methodology into the \(\mathsf{MLCrypto}\) tool which we use to create distinguishers for NIST DRBG. We also discuss the connection between our empirical results and the constraints posed by a possible real attack against the primitives.

Let us consider a PRG \(\mathsf{G}: {\{0,1\}}^{{\ell _{\mathsf{in}}}} \rightarrow {\{0,1\}}^{{\ell _{\mathsf{out}}}} \), as in Definition 2.1, and focus on the pseudorandomness property. Such a property states the indistinguishability between the distributions of the \(\mathsf{G}\)’s outputs and the uniformly random elements. By using the game-proving framework, it is required that any distinguisher \(\mathsf{D}\) is unable to distinguish between a random value and \(\mathsf{G}\)’s output when provided by the challenger. Formally, we define the advantage as:

$$\begin{aligned} \mathsf{Adv}^{\mathsf{D}}_{\mathsf{G},\mathsf{rand}}(\lambda ) = \left|\mathsf{Pr}\left[ {\mathsf{D}(\mathsf{G}(\mathsf{s})) = \mathsf{G}}\right] - {\mathsf{Pr}} \left[ {\mathsf{D}(r) = \mathsf{G}}\right] \right| \end{aligned}$$

for some random seed \(\mathsf{s}{\leftarrow }_{\$} {{\{0,1\}}^{{\ell _{\mathsf{in}}}} } \), uniformly sampled \(r {\leftarrow }_{\$} {{\{0,1\}}^{{\ell _{\mathsf{out}}}} } \). The theoretical approach is conceptually simple and tight but infeasible because it requires a function that outputs random elements, which is, by other terms, precisely what the PRG tries to emulate, thus creating a brain-twisting loophole in which the goal is the solution at the same time.

To avoid this loophole, we can use a statistical approach, which consists of running several statistical tests using the outputs of \(\mathsf{G}\). After running \(\mathsf{G}\), the tests compare the real and the theoretical distributions to accept/reject the hypothesis that \(\mathsf{G}\) is random or not. There are several statistical test suites to analyse the PRG such as NIST STS [5], Dieharder [7] and TestU01 [21].

Let us explain the approach with an example. Consider a list of N outputs \(\{y_i\}_{i=1}^N\) from a pseudorandom \(\mathsf{G}\) of which we want to determine if they appear random. To do so, consider the statistical test that shows the frequency of 1s in the output, i.e. it returns the number of 1s in a given output binary string.

Theoretically, we know that the output should describe the binomial distribution of which we know the characteristic function, i.e. the function that describes the probability distribution. For this reason, we apply the test on the set of outputs \(\{y_i\}_{i=1}^N\) and compare it with the theoretical binomial ones thus testing if the outputs are “binomial enough”. In Fig. 4, we illustrate the possible outcomes of the test where we compare the ideal distribution (b) with respect to a fitting (c) and a completely random one (a).

The tests take an analytical approach by computing precise values, e.g. the p-value for some specific statistical test. By repeating the test multiple times, it is possible to improve the confidence of the result. Sadly, regardless of the number of different tests we can perform and analyse, this approach can only state if a generator is plausibly pseudorandom or not.

On the other hand, the statistical approach allows the direct construction of a distinguisher \(\mathsf{D}\) for the general pseudorandomness property, i.e. \(\mathsf{D}\) executes the statistical test on the given output and uses the test results to discriminate between pseudorandom and non-pseudorandom. A failing test result allows \(\mathsf{D}\) to have an advantage in discriminate non-pseudorandom PRG.

Let us take a step back and observe that the pseudorandom property can be modified into a cipher suite distinguishing problem in which a distinguisher \(\mathsf{D}\) must distinguish between two different generators \(\mathsf{G}_0\) and \(\mathsf{G}_1\), regardless of their pseudorandom properties. By arithmetic manipulation of Equation 3.1, we obtain:

$$\begin{aligned}&\mathsf {Adv}^{\mathsf {D}}_{\mathsf {G}_0, \mathsf {G}_1} (\lambda )\\&\quad = \left|\mathop {\mathsf{Pr}}\limits _{}\left[ \mathsf{D}(\mathsf{G}_1(\mathsf{s})) = \mathsf{G}_1\,\right] - \mathop {\mathsf{Pr}}\limits _{}\left[ \mathsf{D}(\mathsf{G}_0(\mathsf{s})) = \mathsf{G}_1\,\right] \right|\\&\quad = \left| \mathop {\mathsf{Pr}}\limits _{}\left[ \mathsf{D}(\mathsf{G}_1(\mathsf{s})) = \mathsf{G}_1\,\right] - \mathop {\mathsf{Pr}}\limits _{}\left[ \mathsf{D}(r) = \mathsf{G}_1\,\right] \right. \\&\quad \left. + \mathop {\mathsf{Pr}}\limits _{}\left[ \mathsf{D}(r) = \mathsf{G}_1\,\right] - \mathop {\mathsf{Pr}}\limits _{}\left[ \mathsf{D}(\mathsf{G}_0(\mathsf{s})) = \mathsf{G}_1\,\right] \right| \\&\quad \le \left|\mathop {\mathsf{Pr}}\limits _{}\left[ \mathsf{D}(\mathsf{G}_1(\mathsf{s})) = \mathsf{G}_1\,\right] - \mathop {\mathsf{Pr}}\limits _{}\left[ \mathsf{D}(r) = \mathsf{G}_1\,\right] \right|\\&\quad + \left|\mathop {\mathsf{Pr}}\limits _{}\left[ \mathsf{D}(r) = \mathsf{G}_1\,\right] - \mathop {\mathsf{Pr}}\limits _{}\left[ \mathsf{D}(\mathsf{G}_0(\mathsf{s})) = \mathsf{G}_1\,\right] \right|\\&\quad \le \mathsf {Adv}^{\mathsf {D}}_{\mathsf {G}_0}, \mathsf {rand} (\lambda ) + \mathsf {Adv}^{\mathsf {D}}_{\mathsf {rand}, \mathsf {G}_1} (\lambda ) \end{aligned}$$

where the second addend:

$$\begin{aligned} \left|\mathop {\mathsf{Pr}}\limits _{}\left[ \mathsf{D}(r) {=} \mathsf{G}_1\,\right] {-} \mathop {\mathsf{Pr}}\limits _{}\left[ \mathsf{D}(\mathsf{G}_0(\mathsf{s})) {=} \mathsf{G}_1\,\right] \right| \le \mathsf {Adv}^{\mathsf {D}}_{\mathsf {rand}, \mathsf {G}_1} (\lambda ) \end{aligned}$$

measures the probability of \(\mathsf{D}\) to wrongly distinguishing \(\mathsf{G}_0\). By the nature of the absolute value, we can modify this faulty distinguisher into a correct one by just flipping \(\mathsf{D}\)’s output. The idea behind our observation is that, by triangular disequality, distinguishing between two generators imposes a lower bound on the generator’s pseudorandomness advantage. Formally:

Fig. 5
figure 5

Distinguishers’ accuracy distributions of two arbitrary primitives computed for 3 different training dataset sizes

$$\begin{aligned} \mathsf {Adv}^{\mathsf {D}}_{\mathsf {G}_0, \mathsf {G}_1} (\lambda ) \le \mathsf {Adv}^{\mathsf {D}}_{\mathsf {G}_0, \mathsf {rand}} (\lambda ) + \mathsf {Adv}^{\mathsf {D}}_{\mathsf {rand}, \mathsf {G}_1} (\lambda ) \end{aligned}$$
(4.1)

Since executing the cryptanalysis necessary to create \(\mathsf{D}\) is tedious, time-consuming and a human-intensive task, we use \(\mathsf{MLCrypto}\) to automatically generate \(\mathsf{D}\) from different NIST DRBG outputs.

4.1 Experiments and results

In this section, we analyse the distinguishers generated by \(\mathsf{MLCrypto}\) for the cipher suite distinguishing problem. Concretely, we focus on the DRBG that NIST recommends [4]. Also, all the experiments we present in this section were run on an Intel(R) Core(TM) i7-4790 CPU @3.60GHz and 16GB of RAM with Linux. We implement \(\mathsf{MLCrypto}\) in Python, and all the source code of our tool is freely released for future researchFootnote 4.

For this experiment, we consider the NIST DRBG based on the primitives \(\mathsf{TDEA}\), \(\mathsf{AES{\text {-}256}}\), \(\mathsf{SHA{\text {-}256}}\) and \(\mathsf{HMAC}\)-\(\mathsf{SHA{\text {-}256}}\). The choice of these DRBG is arbitrary, and if other primitives were chosen, the conclusions remain the same.

For all the experiments, there is a common initial phase where we calculate all possible pairs of combinations \((\mathsf{alg}_0,\mathsf{alg}_1)\) of the primitives and we accordingly generate the training and target datasets. For the training dataset, we want to simulate an adversary who cannot create such a dataset with the same seed as the target. Thus, all the training datasets have different seeds than the targets ones. In our case study, we analyse if the distribution of the accuracy of the distinguishers generated by \(\mathsf{MLCrypto}\) (see Sect. 3) is affected by (i) the size of the datasets (training and target) and (ii) different target dataset.

Fig. 6
figure 6

Distinguishers’ accuracy distributions of two arbitrary primitives computed for 4 different target datasets

Fig. 7
figure 7

Distinguishers’ accuracy distributions of the combination between the primitives in \(\mathsf{alg}\). We compute the distributions for 3 target dataset sizes and 3 training ones

The reason why we chose Naive Bayes classifiers for \(\mathsf{MLCrypto}\) is that they are (i) computationally efficient, (ii) simple to implement and (iii)lack of learning parameter to be tuned.

Fig. 8
figure 8

Distinguishers’ accuracy distribution of all the NIST recommended DRBG combinations with training dataset size \(n_{\mathsf{X}}= 2^{13}\) and target dataset size \(n_{\mathsf{Y}}= 2^{16}\)

Dataset size To cross-validate our ML classifiers, we check if the size of the datasets affects the output of the distinguisher. To do so, we generate for each primitive \(\mathsf{alg}\) a training dataset \(\mathsf{X}_{\mathsf{alg}}\) containing \(n_{\mathsf{X}}\) outputs of \(\mathsf{alg}\) and a target dataset \(\mathsf{Y}_{\mathsf{alg}}\) containing \(n_{\mathsf{Y}}\) outputs of \(\mathsf{alg}\). In more detail, the size of the training (\(n_{\mathsf{X}}\)) and the target (\(n_{\mathsf{Y}}\)) datasets are \(n_{\mathsf{X}}\in \{2^{i}: i \in [12,14]\}\) and \(n_{\mathsf{Y}}\in \{2^{i}: i \in [14,16]\}\), respectively. The datasets generation is computationally efficient, and the size average with \(2^{16}\) values is \(\sim 1.1\;\mathsf{MB}\). We independently execute \(\mathsf{MLCrypto}\) \(t_{\mathsf{X}}\) times with a freshly generated training dataset, say \(\mathsf{X}^\prime \), but with the same target dataset \(\mathsf{Y}\). Concretely, we consider \(t_{\mathsf{X}}= 2^{10}\) which would provide to compute a Cohen’s coefficient of \(d = 0.0876\) for a statistical power of \(p = 0.8\), whenever analysing the distinguishers’ accuracy distribution with a one-sample t-Student test with significance level \(\alpha =0.5\). In other words, the size of our datasets, as well as the number of tests, provides a (simplistic) statistical analysis that the obtained classifiers accuracy’s distribution has some statistical confidence. In Fig. 5, we observe that changing the training dataset size \(n_{\mathsf{X}}\) does not have any major impact on the accuracy distribution. This suggests that it is possible to provide smaller training datasets and still achieving the same accuracy distribution. Finally, we also checked our model’s ability to predict new data (i.e. avoid overfitting or selection bias); we obtained the cross-validation value of each one of the experiments we performed. In more detail, we computed the 10-fold cross-validation using the function provided by scikit-learn and got a consistent accuracy in all our independent experiments.

Different targets We generate the training datasets of such primitives and obtain a distinguisher \(\mathsf{D}\) for the algorithms \((\mathsf{alg}_0,\mathsf{alg}_1)\). Once we have \(\mathsf{D}\), we randomly generate a target dataset and compute the accuracy of the distinguisher as \(\mathsf{Acc}^{\mathsf{D}}_{\mathsf{alg}_0,\mathsf{alg}_1}\). Figure 6 depicts that the same distinguishers define different accuracy distributions when computed on different target datasets. This phenomenon is explained by the fact that each target dataset is generated using a different seed thus making the generator de facto different. This implies that an increased accuracy advantage \(\delta \) for a distinguisher \(\mathsf{D}\) holds exclusively for a specific target. By changing the target, \(\mathsf{D}\) changes the advantage to a different value \(\delta ^\prime \). We also consider a variation of \(n_{\mathsf{Y}}\) and observe that the peaks are differently spread. This is coherent when considering that a smaller dataset \(\mathsf{X}^\prime \) is a sample of a bigger one \(\mathsf{X}\), meaning that \(\mathsf{X}^\prime \) might not be a statistically significant representation of \(\mathsf{X}\). This implies the necessity of always using statistically significant target datasets when computing the accuracy distribution.

Timing and space efficiency In total, we generate \(4\cdot (1+t_{\mathsf{X}}) = 4100\) independent datasets, being 4 the number of different primitives considered, and \(\left( {\begin{array}{c}4\\ 2\end{array}}\right) \cdot t_{\mathsf{X}}\cdot 3 = 18432\) distinguishers, being 3 the distinct \(n_{\mathsf{X}}\) possible values. Each distinguisher outputs 3 values, being 3 the number of distinct \(n_{\mathsf{Y}}\) possible values of a total of 55296 measurements. In Fig. 7, we show how the accuracy of the distinguishers is always distributed with either a single peak centred in 0.5 or as two symmetric peaks at value \(0.5 \pm \delta \) for some non-negligible \(\delta \in \mathbb {R}_+\) of the order of \(\delta \sim 10^{-3}\). This demonstrates that \(\mathsf{MLCrypto}\) can create a distinguisher \(\mathsf{D}\) with advantage \(\mathsf{Adv}^{\mathsf{D}}= 2\delta \). Even though that \(\delta \) might initially be small when we consider only the distinguishers with accuracy \(0.5+\delta \), we apply the distinguisher’ amplification method presented in Proposition 3.1 to increase up that advantage. For instance, in this case we have \(\frac{t_{\mathsf{X}}}{2}-1= 511\) distinguishers with an accuracy of \(p \sim 50.1\%\) which implies that the amplification method creates a distinguisher \(\mathsf{D}^\prime \) with accuracy \(p^\prime \sim 51.8\%\).

For completeness, we execute \(\mathsf{MLCrypto}\) over all the NIST DRBG, with training datasets size \(n_{\mathsf{X}}= 2^{13}\) and target dataset size \(n_{\mathsf{Y}}= 2^{16}\) of which accuracy distributions are depicted in Fig. 8. We observe that the accuracy distribution is always symmetric. This means that a blind adversary \(\mathcal {A}\) must face the blind spot paradox, allowing us to empirically confirm that NIST DRBG are, most probably, hard to distinguish between themselves. On the other hand, if \(\mathcal {A}\) can reconstruct the distribution, then there is a concrete possibility to achieve a non-negligible advantage in distinguishing between the primitives.

5 Conclusions and future work

In this paper, we presented a methodology to use ML in developing practical distinguisher for cryptographic purposes. In particular, we show how it can be used for solving and analysing instances of distinguishing problems, e.g. we analyse the distinguishers obtained by \(\mathsf{MLCrypto}\) for the cipher suite distinguishing problem between NIST DRBG. We foresaw the possibility of applying our tool to cipher suite distinguishing problems for block ciphers, hash functions, message authentication codes and similar primitives. The generality of our method allows it to be used for more practical problems related to side-channel attacks where the attacker is interested in distinguishing between two primitives based on non-cryptographic measurements, e.g. the power consumption and the computational timing, and provides a consistent framework for future comparison between distinguishers generated by different ML approaches, e.g. random forest, neural network or the multi-layers perceptron model [2].