Romulus Spec Round2

Romulus
v1.2
Designers/Submitters (in alphabetical order):
Tetsu Iwata 1 , Mustafa Khairallah 2 , Kazuhiko Minematsu 3 , Thomas Peyrin 2

1
Nagoya University, Japan
[email protected]
2
Nanyang Technological University, Singapore
[email protected],
[email protected]
3
NEC Corporation, Japan
[email protected]
Webpage: https://romulusae.github.io/romulus/
Contents
1
1. Introduction
This document specifies Romulus, an authenticated encryption with associated data (AEAD) scheme
based on a tweakable block cipher (TBC) Skinny. Romulus consists of two families, a nonce-based
AE (NAE) Romulus-N and a nonce misuse-resistant AE (MRAE) Romulus-M.
A TBC was introduced by Liskov et al. at CRYPTO 2002 [30]. Since its inception, TBCs
have been acknowledged as a powerful primitive in that it can be used to construct simple and
highly secure NAE/MRAE schemes, including ΘCB3 [28] and SCT [36]. While these schemes
are computationally efficient (in terms of the number of primitive calls) and have high security,
lightweight applications are not the primal use cases of these schemes, and they are not particularly
suitable for small devices. With this in mind, Romulus aims at lightweight, efficient, and highly-
secure NAE and MRAE schemes, based on a TBC.
The overall structure of Romulus-N shares similarity in part with a (TBC-based variant of)
block cipher mode COFB [13, 14], yet, we make numerous refinements to achieve our design goal.
Romulus-N generally requires a fewer number of TBC calls than ΘCB3 thanks to the faster MAC
computation for associated data, while the hardware implementation is significantly smaller than
ΘCB3 thanks to the reduced state size and inverse-freeness (i.e., TBC inverse is not needed). In
fact, Romulus-N’s state size is essentially what is needed for computing TBC. Moreover, it encrypts
an n-bit plaintext block by just one call of the n-bit block TBC, hence there is no efficiency
loss. Romulus-N is extremely efficient for small messages, which is particularly important in many
lightweight applications, requiring for example only 2 TBC calls to handle one associated data
block and one message block (in comparison, other designs like ΘCB3, OCB3, TAE, CCM require
from 3 to 5 TBC to calls in the same situation). Romulus-N achieves these advantages without the
security penalty, i.e., Romulus-N has the full n-bit security, which is a similar security bound to
ΘCB3.
If we compare Romulus-N with other size-oriented and n-bit secure AE schemes, such as
conventional permutation-based AEs using 3n-bit permutation with n-bit rate, the state size is
comparable (3n to 3.5n bits). Our advantage is that the underlying cryptographic primitive is
expected to be much more lightweight and/or faster because of smaller output size (3n vs n bits).
In addition, the n-bit security of Romulus-N is proved under the standard model, which provides a
high-level assurance for security not only quantitatively but also qualitatively. To elaborate a bit
more, with a security proof in the standard model, one can precisely connect the security status
of the primitive to the overall security of the mode that uses this primitive. In our case, for each
of the members of Romulus, the best attack on it implies a chosen-plaintext attack (CPA) in the
single-key setting against Skinny, i.e., unless Skinny is broken by CPA adversaries in the single-key
setting, Romulus indeed maintains the claimed n-bit security. Such a guarantee is not possible
with non-standard models and it is often not easy to deduce the impact of a found “flaw” of the
primitive to the security of the mode. In a more general context, this gap between the proof and
the actual security is best exemplified by “uninstantiable” Random Oracle-Model schemes [5, 11].
To evaluate the security of Romulus, with the standard model proof, we can focus on the security
evaluation of Skinny, while this type of focus is not possible in schemes with proofs in non-standard
models.
2
Another interesting feature of Romulus-N is that it can reduce area depending on the use cases,
without harming security. If it is enough to have a relatively short nonce or a short counter (or
both), which is common to low-power networks, we can save the area by truncating the tweak
length. This was possible because Skinny allows to reduce area if a part of its tweak is never
used. A member of Romulus-N (Romulus-N2) particularly benefits from this feature. Note that this
type of area reduction is not possible with conventional permutation-based AE schemes: it only
offers a throughput/security tread-off. Romulus-M follows the general construction of MRAE called
SIV [39]. Romulus-M reuses the components of Romulus-N as much as possible, and Romulus-M
is simply obtained by processing message twice by Romulus-N. This allows a faster and smaller
operation than TBC-based MRAE SCT, yet, we maintain strong security features of SCT. That
is, Romulus-M achieves n-bit security against nonce-respecting adversaries and n/2-bit security
against nonce-misusing adversaries. Moreover, Romulus-M enjoys a useful feature called graceful
degradation introduced at SCT. This ensures that the full n-bit security is almost retained if the
number of nonce repetitions at encryption is limited. Thanks to the shared components, most of
the advantages of Romulus-N mentioned above also hold for Romulus-M.
We present a detailed comparison of Romulus with other AE candidates in Section 6.
As the underlying TBC, we adopt Skinny proposed at CRYPTO 2016 [2]. The security of this
TBC has been extensively studied, and it has attractive implementation characteristics.
Organization of the document. In Section 2, we first introduce the basic notations and the
notion of tweakable block cipher, followed by the list of parameters for Romulus, the recommended
parameter sets, and the specification of TBC Skinny. In the last part of Section 2, we specify two
families of Romulus, Romulus-N and Romulus-M. We present our security claims in Section 3 and
show our security analysis including the provable security bounds and the status of computational
security of Skinny in Section 4. In Section 5, we describe the desirable features of Romulus. The
design rationale under our schemes, including some details of modes and choice of the TBC, is
presented in Section 6. Finally, we show some implementation aspects of Romulus in Section 7.
3
2. Specification
2.1 Notations
Let {0, 1}∗ be the set of all finite bit strings, including the empty string ε. For X ∈ {0, 1}∗ , let |X|
denote its bit
S length. Here |ε| = 0. For integer n ≥ 0, let {0, 1}n be the set of n-bit strings, and let
{0, 1}≤n = i=0,...,n {0, 1}i , where {0, 1}0 = {ε}. Let JnK = {1, . . . , n} and JnK0 = {0, 1, . . . , n − 1}.
For two bit strings X and Y , X k Y is their concatenation. We also write this as XY if it is
clear from the context. Let 0i (1i ) be the string of i zero bits (i one bits), and for instance we write
10i for 1 k 0i . Bitwise XOR of two variables X and Y is denoted by X ⊕ Y , where |X| = |Y | = c
for some positive integer c. We write msbx (X) (resp. lsbx (X)) to denote the truncation of X to
its x most (resp. least) significant bits. See “Endian” paragraph below.
Padding. For X ∈ {0, 1}≤l of length multiple of 8 (i.e., byte string), let
(
X if |X| = l,
padl (X) = l−|X|−8
X k0 k len8 (X), if 0 ≤ |X| < l,
where len8 (X) denotes the one-byte encoding of the byte-length of X. Here, padl (ε) = 0l . When
l = 128, len8 (X) has 16 variations (i.e., byte length 0 to 15), and we encode it to the last 4 bits of
len8 (X) (for example, len8 (11) = 00001011). The case l = 64 is similarly treated, by using the
last 3 bits.
n
Parsing. For X ∈ {0, 1}∗ , let |X|n = max{1, d|X|/ne}. Let (X[1], . . . , X[x]) ←
− X be the parsing
of X into n-bit blocks. Here X[1] k X[2] k . . . k X[x] = X and x = |X|n . When X = ε, we have
n
X[1] ←
− X and X[1] = ε. Note in particular that |ε|n = 1.
Alternating Parsing. Let n and t be positive integers larger than 8. For X ∈ {0, 1}∗ , let
n,t
(X[1], . . . , X[x]) ←−− X be the parsing of X into n-bit blocks and t-bit blocks in an alternating order.
That is, we have X[1] k X[2] k . . . k X[x] = X, where |X[i]| = n for any odd i ∈ {1, . . . , x − 1},
|X[i]| = t for any even i ∈ {1, . . . , x − 1}, |X[x]| ∈ JnK if x is odd, and |X[x]| ∈ JtK if x is even.
When X 6= ε, x is determined as


2b|X|/(n + t)c if |X| > 0 and |X| mod (n + t) = 0
x = 2b|X|/(n + t)c + 1 if 1 ≤ |X| mod (n + t) ≤ n


2b|X|/(n + t)c + 2 if n < |X| mod (n + t) < n + t.
n,t
When X = ε, X[1] ←−− X (thus x = 1) and X[1] = ε.
Galois Field. An element a in the Galois field GF(2n ) will be interchangeably represented P as ani
n-bit string an−1 . . . a1 a0 , a formal polynomial an−1 xn−1 + · · · + a1 x + a0 , or an integer n−1
i=0 ai 2 .
4
Matrix. Let G be an n × n binary matrix defined over GF(2). For X ∈ {0, 1}n , let G(X) denote
the matrix-vector multiplication over GF(2), where X is interpreted as a column vector. We may
write G · X instead of G(X).
Endian. We employ little endian for byte ordering: an n-bit string X is received as
X7 X6 . . . X0 k X15 X14 . . . X8 k . . . k Xn−1 Xn−2 . . . Xn−8 ,
where Xi denotes the (i + 1)-st bit of X (for i ∈ JnK0 ). Therefore, when c is a multiple of 8 and
X is a byte string, msbc (X) and lsbc (X) denote the last (rightmost) c bytes of X and the first
(leftmost) c bytes of X, respectively. For example, lsb16 (X) = (X7 X6 . . . X0 k X15 X14 . . . X8 ) and
msb8 (X) = (Xn−1 Xn−2 . . . Xn−8 ) with the above X. Since our specification is defined over byte
strings, we only consider the above case for msb and lsb functions (i.e., the subscript c is always a
multiple of 8).
(Tweakable) Block Cipher. A tweakable block cipher (TBC) is a keyed function E e : K × TW ×

n
M → M, where K is the key space, TW is the tweak space, and M = {0, 1} is the message space,
such that for any (K, Tw ) ∈ K × TW , E(K,e Tw , ·) is a permutation over M. We interchangeably
e e e Tw
write E(K, Tw , M ) or EK (Tw , M ) or EK (M ). When TW is singleton, it is essentially a block cipher
and is simply written as E : K × M → M.
2.2 Parameters
Romulus has the following parameters:
• Nonce length nl ∈ {96, 128}.
• Key length k = 128.
• Message block length n = 128.
• Counter bit length d ∈ {24, 56, 48}.
• AD block length n + t, where t ∈ {96, 128}.
• Tag length τ = 128.
• A TBC E e : K × T × M → M, where K = {0, 1}k , M = {0, 1}n , and T = T × B × D. Here,
T = {0, 1}t , D = J2d − 1K0 , and B = J256K0 for parameters t and d, and B is also represented
as a byte (see Section 2.5.1). For tweak T = (T, B, D) ∈ T , T is always assumed to be a byte
e is either Skinny-128-384 or Skinny-128-256 with
string including ε, and t is a multiple of 8. E
appropriate tweakey encoding functions as described in Section 2.4.
e T is used to process the nonce or an AD block, D is used for counter, and B is for domain
For E,
separation, i.e., deriving a small number of independent instances.
While our submission fixes τ = 128, a tag for NAE schemes can be truncated if needed, at the
cost of decreased security against forgery. See Section 4.
NAE and MRAE families. Romulus has two families, Romulus-N and Romulus-M, and each
family consists of several members (the sets of parameters). The former implements nonce-based
AE (NAE) secure against Nonce-respecting adversaries, and the latter implements nonce Misuse-
resistant AE (MRAE) introduced by Rogaway and Shrimpton [39]. The name Romulus stands for
the set of two families.
5
Table 2.1: Members of Romulus.
Family Name e
E k nl n t d τ
Romulus-N1 Skinny-128-384 128 128 128 128 56 128
Romulus-N Romulus-N2 Skinny-128-384 128 96 128 96 48 128
Romulus-N3 Skinny-128-256 128 96 128 96 24 128
Romulus-M1 Skinny-128-384 128 128 128 128 56 128
Romulus-M Romulus-M2 Skinny-128-384 128 96 128 96 48 128
Romulus-M3 Skinny-128-256 128 96 128 96 24 128
2.3 Recommended Parameter Sets

We present our members (sets of parameters) in Table 2.1. The primary member of our submission
is Romulus-N1. Members except Romulus-N3 and Romulus-M3 conform to the requirements for
primary member with respect to key length (minimum 128 bits), nonce length (minimum 96 bits),
tag length (minimum 64 bits), and maximum input length (minimum 250 − 1 bytes). Romulus-N3
and Romulus-M3 conform to the requirements of primary member except that they have a 24-bit
counter, in return to better efficiency. See Section 6 for our justification of length limits.
2.4 The Tweakable Block Cipher Skinny

In this section, we will recall the Skinny family of tweakable block ciphers [2]. In our submission,
we will use two members of the family: Skinny-128-384 and Skinny-128-256.
Skinny Versions.
The lightweight block ciphers of the Skinny family have 64-bit and 128-bit block versions. However,
we will only use the n = 128 bits versions here. The internal state is viewed as a 4 × 4 square array
of cells, where each cell is a byte. We denote ISi,j the cell of the internal state located at Row i
and Column j (counting starting from 0). One can also view this 4 × 4 square array of cells as a
vector of cells by concatenating the rows. Thus, we denote with a single subscript ISi the cell of
the internal state located at Position i in this vector (counting starting from 0) and we have that
ISi,j = IS4·i+j .
Skinny follows the TWEAKEY framework from [25] and thus takes a tweakey input instead of
a key or a pair key/tweak. The family of lightweight block ciphers Skinny have three main tweakey
size versions, but we will use only two of them: for a block size n, we will use versions with tweakey
size t = 2n and t = 3n. We denote z = t/n the tweakey size to block size ratio. The tweakey state
is also viewed as a collection of z 4 × 4 square arrays of cells. We denote these arrays T K1 and
T K2 when z = 2, and T K1, T K2 and T K3 when z = 3. Moreover, we denote T Kzi,j the cell of
the tweakey state located at Row i and Column j of the z-th cell array. As for the internal state,
we extend this notation to a vector view with a single subscript: T K1i , T K2i and T K3i . Moreover,
we define the adversarial model SK (resp. TK1, TK2 or TK3) where the attacker cannot (resp.
can) introduce differences in the tweakey state.
6
Initialization.
The cipher receives a plaintext m = m0 km1 k · · · km14 km15 , where the mi are bytes. The initializa-
tion of the cipher’s internal state is performed by simply setting ISi = mi for 0 ≤ i ≤ 15:
 
m0 m1 m2 m3
 
 m4 m5 m6 m7 
IS =   
m m m m 
 8 9 10 11 
m12 m13 m14 m15
This is the initial value of the cipher internal state and note that the state is loaded row-wise
rather than in the column-wise fashion we have come to expect from the AES; this is a more
hardware-friendly choice, as pointed out in [34].
The cipher receives a tweakey input tk = tk0 ktk1 k · · · ktk30 ktk16z−1 , where the tki are 8-bit
cells. The initialization of the cipher’s tweakey state is performed by simply setting for 0 ≤ i ≤ 15:
T K1i = tki and T K2i = tk16+i when z = 2, and finally T K1i = tki , T K2i = tk16+i and
T K3i = tk32+i when z = 3. We note that the tweakey states are loaded row-wise.
The Round Function.

Skinny-128-256 has 48 rounds and Skinny-128-384 has 56 rounds. One encryption round is composed
of five operations in the following order: SubCells, AddConstants, AddRoundTweakey, ShiftRows
and MixColumns (see illustration in Figure 2.1). Note that no whitening key is used in Skinny.
ART ShiftRows MixColumns
>>> 1
SC AC
>>> 2
>>> 3
Figure 2.1: The Skinny round function applies five different transformations: SubCells (SC),
AddConstants (AC), AddRoundTweakey (ART), ShiftRows (SR) and MixColumns (MC).
SubCells. An 8-bit Sbox is applied to every cell of the cipher internal state. The action of this
Sbox is given in hexadecimal notation by the following Table 2.2.
Note that S8 can also be described with eight NOR and eight XOR operations, as depicted
in Figure 2.2. If x0 , . . ., x7 represent the eight inputs bits of the Sbox (x0 being the least
significant bit), it basically applies the below transformation on the 8-bit state:
(x7 , x6 , x5 , x4 , x3 , x2 , x1 , x0 ) → (x7 , x6 , x5 , x4 ⊕ (x7 ∨ x6 ), x3 , x2 , x1 , x0 ⊕ (x3 ∨ x2 )),

followed by the bit permutation:
(x7 , x6 , x5 , x4 , x3 , x2 , x1 , x0 ) −→ (x2 , x1 , x7 , x6 , x4 , x0 , x3 , x5 ),
repeating this process four times, except for the last iteration where there is just a bit swap
between x1 and x2 .
AddConstants. A 6-bit affine LFSR, whose state is denoted (rc5 , rc4 , rc3 , rc2 , rc1 , rc0 ) (with rc0
being the least significant bit), is used to generate round constants. Its update function is
defined as:
(rc5 ||rc4 ||rc3 ||rc2 ||rc1 ||rc0 ) → (rc4 ||rc3 ||rc2 ||rc1 ||rc0 ||rc5 ⊕ rc4 ⊕ 1).
7
Table 2.2: 8-bit Sbox S8 used in Skinny when s = 8.
uint8_t S8 [256] = {
0 x65 ,0 x4c ,0 x6a ,0 x42 ,0 x4b ,0 x63 ,0 x43 ,0 x6b ,0 x55 ,0 x75 ,0 x5a ,0 x7a ,0 x53 ,0 x73 ,0 x5b ,0 x7b ,
0 x35 ,0 x8c ,0 x3a ,0 x81 ,0 x89 ,0 x33 ,0 x80 ,0 x3b ,0 x95 ,0 x25 ,0 x98 ,0 x2a ,0 x90 ,0 x23 ,0 x99 ,0 x2b ,
0 xe5 ,0 xcc ,0 xe8 ,0 xc1 ,0 xc9 ,0 xe0 ,0 xc0 ,0 xe9 ,0 xd5 ,0 xf5 ,0 xd8 ,0 xf8 ,0 xd0 ,0 xf0 ,0 xd9 ,0 xf9 ,
0 xa5 ,0 x1c ,0 xa8 ,0 x12 ,0 x1b ,0 xa0 ,0 x13 ,0 xa9 ,0 x05 ,0 xb5 ,0 x0a ,0 xb8 ,0 x03 ,0 xb0 ,0 x0b ,0 xb9 ,
0 x32 ,0 x88 ,0 x3c ,0 x85 ,0 x8d ,0 x34 ,0 x84 ,0 x3d ,0 x91 ,0 x22 ,0 x9c ,0 x2c ,0 x94 ,0 x24 ,0 x9d ,0 x2d ,
0 x62 ,0 x4a ,0 x6c ,0 x45 ,0 x4d ,0 x64 ,0 x44 ,0 x6d ,0 x52 ,0 x72 ,0 x5c ,0 x7c ,0 x54 ,0 x74 ,0 x5d ,0 x7d ,
0 xa1 ,0 x1a ,0 xac ,0 x15 ,0 x1d ,0 xa4 ,0 x14 ,0 xad ,0 x02 ,0 xb1 ,0 x0c ,0 xbc ,0 x04 ,0 xb4 ,0 x0d ,0 xbd ,
0 xe1 ,0 xc8 ,0 xec ,0 xc5 ,0 xcd ,0 xe4 ,0 xc4 ,0 xed ,0 xd1 ,0 xf1 ,0 xdc ,0 xfc ,0 xd4 ,0 xf4 ,0 xdd ,0 xfd ,
0 x36 ,0 x8e ,0 x38 ,0 x82 ,0 x8b ,0 x30 ,0 x83 ,0 x39 ,0 x96 ,0 x26 ,0 x9a ,0 x28 ,0 x93 ,0 x20 ,0 x9b ,0 x29 ,
0 x66 ,0 x4e ,0 x68 ,0 x41 ,0 x49 ,0 x60 ,0 x40 ,0 x69 ,0 x56 ,0 x76 ,0 x58 ,0 x78 ,0 x50 ,0 x70 ,0 x59 ,0 x79 ,
0 xa6 ,0 x1e ,0 xaa ,0 x11 ,0 x19 ,0 xa3 ,0 x10 ,0 xab ,0 x06 ,0 xb6 ,0 x08 ,0 xba ,0 x00 ,0 xb3 ,0 x09 ,0 xbb ,
0 xe6 ,0 xce ,0 xea ,0 xc2 ,0 xcb ,0 xe3 ,0 xc3 ,0 xeb ,0 xd6 ,0 xf6 ,0 xda ,0 xfa ,0 xd3 ,0 xf3 ,0 xdb ,0 xfb ,
0 x31 ,0 x8a ,0 x3e ,0 x86 ,0 x8f ,0 x37 ,0 x87 ,0 x3f ,0 x92 ,0 x21 ,0 x9e ,0 x2e ,0 x97 ,0 x27 ,0 x9f ,0 x2f ,
0 x61 ,0 x48 ,0 x6e ,0 x46 ,0 x4f ,0 x67 ,0 x47 ,0 x6f ,0 x51 ,0 x71 ,0 x5e ,0 x7e ,0 x57 ,0 x77 ,0 x5f ,0 x7f ,
0 xa2 ,0 x18 ,0 xae ,0 x16 ,0 x1f ,0 xa7 ,0 x17 ,0 xaf ,0 x01 ,0 xb2 ,0 x0e ,0 xbe ,0 x07 ,0 xb7 ,0 x0f ,0 xbf ,
0 xe2 ,0 xca ,0 xee ,0 xc6 ,0 xcf ,0 xe7 ,0 xc7 ,0 xef ,0 xd2 ,0 xf2 ,0 xde ,0 xfe ,0 xd7 ,0 xf7 ,0 xdf ,0 xff
};
MSB LSB
MSB LSB
Figure 2.2: Construction of the Sbox S8 .
The six bits are initialized to zero, and updated before use in a given round. The bits from
the LFSR are arranged into a 4 × 4 array (only the first column of the state is affected by the
LFSR bits), depending on the size of internal state:
 
c0 0 0 0
 
c1 0 0 0
 ,
c 
 2 0 0 0
0 0 0 0
with c2 = 0x2 and
(c0 , c1 ) = (rc3 krc2 krc1 krc0 , 0k0krc5 krc4 ) when s = 4

(c0 , c1 ) = (0k0k0k0krc3 krc2 krc1 krc0 , 0k0k0k0k0k0krc5 krc4 ) when s = 8.
The round constants are combined with the state, respecting array positioning, using bitwise
exclusive-or. The values of the (rc5 , rc4 , rc3 , rc2 , rc1 , rc0 ) constants for each round are given
8
in the table below, encoded to byte values for each round, with rc0 being the least significant
bit.
Rounds Constants
1 - 16 01,03,07,0F,1F,3E,3D,3B,37,2F,1E,3C,39,33,27,0E
17 - 32 1D,3A,35,2B,16,2C,18,30,21,02,05,0B,17,2E,1C,38
33 - 48 31,23,06,0D,1B,36,2D,1A,34,29,12,24,08,11,22,04
49 - 62 09,13,26,0C,19,32,25,0A,15,2A,14,28,10,20
AddRoundTweakey. The first and second rows of all tweakey arrays are extracted and bitwise
exclusive-ored to the cipher internal state, respecting the array positioning. More formally,
for i = {0, 1} and j = {0, 1, 2, 3}, we have:
• ISi,j = ISi,j ⊕ T K1i,j ⊕ T K2i,j when z = 2,
• ISi,j = ISi,j ⊕ T K1i,j ⊕ T K2i,j ⊕ T K3i,j when z = 3.
LFSR
LFSR
PT
Extracted
8s-bit subtweakey
Figure 2.3: The tweakey schedule in Skinny. Each tweakey word T K1, T K2 and T K3 (if any)
follows a similar transformation update, except that no LFSR is applied to T K1.
Then, the tweakey arrays are updated as follows (this tweakey schedule is illustrated in
Figure 2.3). First, a permutation PT is applied on the cells positions of all tweakey arrays:
for all 0 ≤ i ≤ 15, we set T K1i ← T K1PT [i] with
PT = [9, 15, 8, 13, 10, 14, 12, 11, 0, 1, 2, 3, 4, 5, 6, 7],
and similarly for T K2 when z = 2, and for T K2 and T K3 when z = 3. This corresponds to
the following reordering of the matrix cells, where indices are taken row-wise:
P T
(0, . . . , 15) 7−→ (9, 15, 8, 13, 10, 14, 12, 11, 0, 1, 2, 3, 4, 5, 6, 7)
Finally, every cell of the first and second rows of T K2 and T K3 (for the Skinny versions
where T K2 and T K3 are used) are individually updated with an LFSR. The LFSRs used are
given in Table 2.3 (x0 stands for the LSB of the cell).
Table 2.3: The LFSRs used in Skinny to generate the round constants. The T K parameter gives
the number of tweakey words in the cipher.
TK s LFSR
T K2 8 (x7 ||x6 ||x5 ||x4 ||x3 ||x2 ||x1 ||x0 ) → (x6 ||x5 ||x4 ||x3 ||x2 ||x1 ||x0 ||x7 ⊕ x5 )
T K3 8 (x7 ||x6 ||x5 ||x4 ||x3 ||x2 ||x1 ||x0 ) → (x0 ⊕ x6 ||x7 ||x6 ||x5 ||x4 ||x3 ||x2 ||x1 )
9
ShiftRows. As in AES, in this layer the rows of the cipher state cell array are rotated, but they
are to the right. More precisely, the second, third, and fourth cell rows are rotated by 1, 2
and 3 positions to the right, respectively. In other words, a permutation P is applied on the
cells positions of the cipher internal state cell array: for all 0 ≤ i ≤ 15, we set ISi ← ISP [i]
with
P = [0, 1, 2, 3, 7, 4, 5, 6, 10, 11, 8, 9, 13, 14, 15, 12].
MixColumns. Each column of the cipher internal state array is multiplied by the following binary
matrix M:
 
1 0 1 1
 
 1 0 0 0 
M=  0
.

 1 1 0 
1 0 1 0
The final value of the internal state array provides the ciphertext with cells being unpacked in
the same way as the packing during initialization. Test vectors for Skinny-128-256 or Skinny-128-384
are provided below.
/* Skinny -128 -256 */
Key : 009 c e c 8 1 6 0 5 d 4 a c 1 d 2 a e 9 e 3 0 8 5 d 7 a 1 f 3
1 ac123ebfc00fddcf01046ceeddfcab3
Plaintext : 3 a 0 c 4 7 7 6 7 a 2 6 a 6 8 d d 3 8 2 a 6 9 5 e 7 0 2 2 e 2 5
Ciphertext : b 7 3 1 d 9 8 a 4 b d e 1 4 7 a 7 e d 4 a 6 f 1 6 b 9 b 5 8 7 f
/* Skinny -128 -384 */

Key : df889548cfc7ea52d296339301797449
ab588a34a47f1ab2dfe9c8293fbea9a5
ab1afac2611012cd8cef952618c3ebe8
Plaintext : a 3 9 9 4 b 6 6 a d 8 5 a 3 4 5 9 f 4 4 e 9 2 b 0 8 f 5 5 0 c b
Ciphertext : 94 e c f 5 8 9 e 2 0 1 7 c 6 0 1 b 3 8 c 6 3 4 6 a 1 0 d c f a
2.5 The Authenticated Encryption Romulus

2.5.1 The Tweakey Encoding
Domain separation. We will use a domain separation byte B to ensure appropriate inde-
pendence between the tweakable block cipher calls and the various versions of Romulus. Let
B = (b7 kb6 kb5 kb4 kb3 kb2 kb1 kb0 ) be the bitwise representation of this byte, where b7 is the MSB and
b0 is the LSB (see also Figure 2.4). Then, we have the following:
- b7 b6 b5 will specify the parameter sets. They are fixed to:
• 000 for Romulus-N1
• 001 for Romulus-M1
10
Note that all nonce-respecting modes have b5 = 0 and all nonce-misuse resistant modes have
b5 = 1.
- b4 is set to 1 once we have handled the last block of data (AD and message chains are treated
separately), to 0 otherwise.
- b3 is set to 1 when we are performing the authentication phase of the operating mode (i.e., when
no ciphertext data is produced), to 0 otherwise. In the special case where b5 = 1 and b4 = 1
(i.e., last block for the nonce-misuse mode), b3 will instead denote if the number of message
blocks is even (b5 = 1 if that is the case, 0 otherwise).
- b2 is set to 1 when we are handling a message block, to 0 otherwise. Note that in the case of
the misuse-resistant modes, the message blocks will be used during authentication phase (in
which case we will have b3 = 1 and b2 = 1). In the special case where b5 = 1 and b4 = 1 (i.e.,
last block for the nonce-misuse mode), b3 will instead denote if the number of message blocks
is even (b5 = 1 if that is the case, 0 otherwise).
- b1 is set to 1 when we are handling a padded AD block, to 0 otherwise.
- b0 is set to 1 when we are handling a padded message block, to 0 otherwise.
The reader can refer to Table ?? in the Appendix to obtain the exact specifications of the
domain separation values depending on the various cases.
message block
last block
parameter (or M even)
sets auth.
padded AD
(or AD even)
padded M
b7 b6 b5 b4 b3 b2 b1 b0
Figure 2.4: Domain separation when using the tweakable block cipher
LFSR. We use LFSRs for counter. For positive integer c, lfsrc is a one-to-one mapping
lfsrc : J2c − 1K0 → {0, 1}c \ {0c } defined as follows. For positive integer c, let Fc (x) be the
lexicographically-first polynomial among the the irreducible degree c polynomials of a minimum
number of coefficients. Specifically Fc (x) for c ∈ {56, 24} are
F56 (x) = x56 + x7 + x4 + x2 + 1,

F24 (x) = x24 + x4 + x3 + x + 1,
and
lfsrc (D) = 2D mod Fc (x).
Note that we use lfsrc (D) as a block counter, so most of the time D changes incrementally
with a step of 1, and this enables lfsrc (D) to generate a sequence of 2c − 1 pairwise-distinct
values. From an implementation point of view, it should be implemented in the sequence form,
xi+1 = 2 · xi mod Fc (x).
Let (zc−1 k zc−2 k . . . k z1 k z0 ) denote the state of c-bit LFSR. In our modes, these LFSRs are
initialized to 1 mod Fc (x), i.e., (07 1 k 0c−8 ), in little-endian format. Incrementation of LFSRs is
11
defined as follows: for c = 56,
zi ← zi−1 for i ∈ J56K0 \ {7, 4, 2, 0},

z7 ← z6 ⊕ z55 ,
z4 ← z3 ⊕ z55 ,
z2 ← z1 ⊕ z55 ,
z0 ← z55 .
Similarly for c = 24,
zi ← zi−1 for i ∈ J24K0 \ {4, 3, 1, 0},

z4 ← z3 ⊕ z23 ,
z3 ← z2 ⊕ z23 ,
z1 ← z0 ⊕ z23 ,
z0 ← z23 .
Our LFSRs are also called doubling over GF(2c ) in the context of modes [37].
Tweakey Encoding. We specify the following tweakey encoding functions for implementing
TBC E e : K × T × M → M using Skinny-128-256 or Skinny-128-384. The tweakey encoding is a
function
encodem,t : K × T → KT ,
where KT = {0, 1}m is the tweakey space for either Skinny-128-256 with m = 256 or Skinny-128-384
with m = 384. As defined earlier, T = T × B × D, K = {0, 1}k and T = {0, 1}t , D = J2d − 1K0 ,
B = J256K0 .
• Case (m, t) = (384, 128): this variant is used for Romulus-N1 and Romulus-M1. The
encode function is defined as follows:
encode384,128 (K, T, B, D) = lfsr56 (D) k B k 064 k T k K
• Case (m, t) = (384, 96): this variant is used for Romulus-N2 and Romulus-M2. The encode func-
tion is defined as follows:
encode384,96 (K, T, B, D) = lfsr24 (D1 ) k B k T k K k lfsr24 (D2 ) k 0104 ,
where D1 , D2 ∈ Ds with Ds = J224 − 1K. The set D is defined as D = Ds × Ds , and the

components are determined from D as D1 = (D/(224 − 1)) + 1 and D2 = (D mod (224 − 1)) + 1.
For the first 224 − 1 cycles starting from D = 0 (but note that D = 1 is the initial value in our
scheme and D = 0 is not used), D1 is fixed to 1 and D2 takes all integers of J224 − 1K. For
the next 224 − 1 cycles, D1 is fixed to 2 and D2 takes all values of J224 − 1K again, and so on.
We stress that the counter cycle is (224 − 1)2 which is slightly smaller than the original range
of D. One can also interpret D as a two-dimensional vector: for example, if D1 = 023 1 and
D2 = 020 14 , then D = (023 1 k 020 14 ). In this case, the initial value of D is [07 1016 , 07 1016 ], in
little-endian format.
• Case (m, t) = (256, 96): this variant is used for Romulus-N3 and Romulus-M3. The encode func-
tion is defined as follows:
encode256,96 (K, T, B, D) = lfsr24 (D) k B k T k K
12
For plaintext M ∈ {0, 1}n and tweak T = (T, B, D) ∈ T × B × D, E e (T,B,D) (M ) denotes
K
encryption of M with m-bit tweakey state encodem,t (K, T, B, D). Tweakey encode is always
implicitly applied, hence the counter D is never arithmetic in the tweakey state. To avoid confusion,
we may write D (in particular when it appears in a part of tweak) in order to emphasize that this
is indeed an LFSR counter. One can interpret D as a state of LFSR when clocked D times (but in
that case it is a part of tweakey state and not a part of input of encode).
2.5.2 State Update Function

Let G be an n×n binary matrix defined as an n/8×n/8 diagonal matrix of 8×8 binary sub-matrices:
 
Gs 0 0 ... 0
 
 0 Gs 0 . . . 0 
 
 . .. .. 
G =  .. . . ,
 
 0 ... 0 G 0
 s 
0 ... 0 0 Gs
where 0 here represents the 8 × 8 zero matrix, and Gs is an 8 × 8 binary matrix, defined as
 
0 1 0 0 0 0 0 0
 
0 0 1 0 0 0 0 0
 
0 0 0 1 0 0 0 0
 
 
0 0 0 0 1 0 0 0
Gs = 0 0 0 0
.
 0 1 0 0 
 
0 0 0 0 0 0 1 0
 
0 0 0 0 0 0 0 1
 
1 0 0 0 0 0 0 1
Alternatively, let X ∈ {0, 1}n , where n is a multiple of 8, then the matrix-vector multiplication
G · X can be represented as
G · X = (Gs · X[0], Gs · X[1], Gs · X[2], . . . , Gs · X[n/8 − 1]),
where
Gs · X[i] = (X[i][1], X[i][2], X[i][3], X[i][4], X[i][5], X[i][6], X[i][7], X[i][7] ⊕ X[i][0])
8 1
for all i ∈ Jn/8K0 , such that (X[0], . . . , X[n/8 − 1]) ←
− X and (X[i][0], . . . , X[i][7]) ←
− X[i], for all
i ∈ Jn/8K0 .
The state update function ρ : {0, 1}n × {0, 1}n → {0, 1}n × {0, 1}n and its inverse ρ−1 :
{0, 1}n × {0, 1}n → {0, 1}n × {0, 1}n are defined as
ρ(S, M ) = (S 0 , C),
where C = M ⊕ G(S) and S 0 = S ⊕ M . Similarly,
ρ−1 (S, C) = (S 0 , M ),
where M = C ⊕ G(S) and S 0 = S ⊕ M . We note that we abuse the notation by writing ρ−1 as this
function is only the invert of ρ according to its second parameter. For any (S, M ) ∈ {0, 1}n ×{0, 1}n ,
if ρ(S, M ) = (S 0 , C) holds then ρ−1 (S, C) = (S 0 , M ). Besides, we remark that ρ(S, 0n ) = (S, G(S))
holds.
13
2.5.3 Romulus-N nonce-based AE mode
The specification of Romulus-N is shown in Figure 2.5. Figure 2.6 shows the encryption of Romulus-N.
For completeness, the definition of ρ is also included. Note that the algorithm always assumes
t = nl.
2.5.4 Romulus-M misuse-resistant AE mode

The specification of Romulus-M is shown in Figure 2.7. Figure 2.8 shows the encryption of Romulus-
M. For completeness, the definition of ρ is also included. Note that the algorithm always assumes
t = nl.
2.5.5 Hashing mode

The current specification of Romulus does not contain a cryptographic hashing functionality. If
one wants, there are two hashing schemes based on Skinny-128-384 and Skinny-128-256, called
SKINNY-tk3-Hash and SKINNY-tk2-Hash respectively [4]. Another option is to build Hirose’s
double-block length (DBL) compression function [20] and use it with a mode such as classical
Merkle-Damgård. Both options enable an n-bit secure hash function.
14
Algorithm Romulus-N.EncK (N, A, M ) Algorithm Romulus-N.DecK (N, A, C, T )
1. S ← 0n 1. S ← 0n
n,t n,t
2. (A[1], . . . , A[a]) ←−− A 2. (A[1], . . . , A[a]) ←−− A
3. if a mod 2 = 0 then u ← t else n 3. if a mod 2 = 0 then u ← t else n
4. if |A[a]| < u then wA ← 26 else 24 4. if |A[a]| < u then wA ← 26 else 24
5. A[a] ← padu (A[a]) 5. A[a] ← padu (A[a])
6. for i = 1 to ba/2c 6. for i = 1 to ba/2c
7. (S, η) ← ρ(S, A[2i − 1]) 7. (S, η) ← ρ(S, A[2i − 1])
8. S←E e (A[2i],8,2i−1) (S) 8. S←E e (A[2i],8,2i−1) (S)
K K
9. end for 9. end for
10. if a mod 2 = 0 then V ← 0n else A[a] 10. if a mod 2 = 0 then V ← 0n else A[a]
11. (S, η) ← ρ(S, V ) 11. (S, η) ← ρ(S, V )
12. S ← E e (N,wA ,a) (S) 12. S ← E e (N,wA ,a) (S)
K K
n n
13. (M [1], . . . , M [m]) ← −M 13. (C[1], . . . , C[m]) ← −C
14. if |M [m]| < n then wM ← 21 else 20 14. if |C[m]| < n then wC ← 21 else 20
15. for i = 1 to m − 1 15. for i = 1 to m − 1
16. (S, C[i]) ← ρ(S, M [i]) 16. (S, M [i]) ← ρ−1 (S, C[i])
17. S ← E e (N,4,i) (S) 17. S ← E e (N,4,i) (S)
K K
18. end for 18. end for
19. M 0 [m] ← padn (M [m]) 19. Se ← (0|C[m]| k msbn−|C[m]| (G(S)))
20. (S, C 0 [m]) ← ρ(S, M 0 [m]) 20. C 0 [m] ← padn (C[m]) ⊕ Se
21. C[m] ← lsb|M [m]| (C 0 [m]) 21. (S, M 0 [m]) ← ρ−1 (S, C 0 [m])
22. S ← E e (N,wM ,m) (S) 22. M [m] ← lsb|C[m]| (M 0 [m])
K
23. (η, T ) ← ρ(S, 0n ) 23. S ← E e (N,wC ,m) (S)
K
24. C ← C[1] k . . . k C[m − 1] k C[m] 24. (η, T ∗ ) ← ρ(S, 0n )
25. return (C, T ) 25. M ← M [1] k . . . k M [m − 1] k M [m]
26. if T ∗ = T then return M else ⊥
Algorithm ρ(S, M ) Algorithm ρ−1 (S, C)

1. C ← M ⊕ G(S) 1. M ← C ⊕ G(S)
2. S 0 ← S ⊕ M 2. S 0 ← S ⊕ M
3. return (S 0 , C) 3. return (S 0 , M )
Figure 2.5: The Romulus-N nonce-based AE mode. Lines of [if (statement) then X ← x else
x0 ] are shorthand for [if (statement) then X ← x else X ← x0 ]. The dummy variable η is
always discarded. We use Romulus-N1 as working example. For other Romulus-N members, the
values of the bits b7 and b6 in the domain separation need to be adapted accordingly.
15
Case a is even
A[1] A[2] A[3] A[4] A[a − 1] pad(A[a]) 0n N
n t
ρ E ρ E ρ E ρ E S
0n n n K K K K
e 8,1 e 8,3 e 8,a−1 e wA ,a
need to be adapted accordingly.

wA ∈ [24, 26]
Case a is odd
A[1] A[2] A[3] A[4] A[a − 2] A[a − 1] pad(A[a]) N
n t
ρ ρ ρ ρ
16
0n n n EK
e 8,1 EK
e 8,3 E K
e 8,a−2 EK
e wA ,a S
wA ∈ [24, 26]
M [1] N M [2] N pad(M [m]) N 0n

n t
S ρ E ρ E ρ E ρ
n n K K K
n
e 4,1 e 4,2 e wM ,m
lsb|M [m]| wM ∈ [20, 21]

C[1] C[2] T
C[m]
(Middle) process of AD with odd AD blocks (Bottom) Encryption. We use Romulus-N1 as working
example. For other Romulus-N members, the values of the bits b7 and b6 in the domain separation
Figure 2.6: The Romulus-N nonce-based AE mode. (Top) process of AD with even AD blocks
Algorithm Romulus-M.EncK (N, A, M ) Algorithm Romulus-M.DecK (N, A, C, T )
1. S ← 0n 1. if C = then M ←
n,t
2. (X[1], . . . , X[a]) ←−− A 2. else
3. if a mod 2 = 0 then u ← t else n 3. S←T
n+t−u,u n
4. (X[a + 1], . . . , X[a + m]) ←−−−−− M 4. (C[1], . . . , C[m0 ]) ←−C
5. if m mod 2 = 0 then v ← u else n+t−u 5. z ← |C[m0 ]|
6. w ← 48 6. C[m0 ] ← padn (C[m0 ])
7. if |X[a]| < u then w ← w ⊕ 2 7. for i = 1 to m0
8. if |X[a + m]| < v then w ← w ⊕ 1 8. S←E e (N,36,i−1) (S)
K
9. if a mod 2 = 0 then w ← w ⊕ 8 9. (S, M [i]) ← ρ−1 (S, C[i])
10. if m mod 2 = 0 then w ← w ⊕ 4 10. end for
11. X[a] ← padu (X[a]) 11. M [m0 ] ← lsbz (M [m0 ])
12. X[a + m] ← padv (X[a + m]) 12. M ← M [1] k . . . k M [m0 − 1] k M [m0 ]
13. x ← 40 13. S ← 0n
n,t
14. for i = 1 to b(a + m)/2c 14. (X[1], . . . , X[a]) ←−− A
15. (S, η) ← ρ(S, X[2i − 1]) 15. if a mod 2 = 0 then u ← t else n
n+t−u,u
16. if i = ba/2c + 1 then x ← x ⊕ 4 16. (X[a + 1], . . . , X[a + m]) ←−−−−− M
17. S ← E e (X[2i],x,2i−1) (S) 17. if m mod 2 = 0 then v ← u else n+t−u
K
18. end for 18. w ← 48
19. if a mod 2 = m mod 2 then 19. if |X[a]| < u then w ← w ⊕ 2
20. (S, η) ← ρ(S, 0n ) 20. if |X[a + m]| < v then w ← w ⊕ 1
21. else 21. if a mod 2 = 0 then w ← w ⊕ 8
22. (S, η) ← ρ(S, X[a + m]) 22. if m mod 2 = 0 then w ← w ⊕ 4
e (N,w,a+m) (S) 23. X[a] ← padu (X[a])
23. S ← E K
24. X[a + m] ← padv (X[a + m])
24. (η, T ) ← ρ(S, 0n )
25. x ← 40
25. if M = then return (, T )
26. for i = 1 to b(a + m)/2c
26. S ← T
n 27. (S, η) ← ρ(S, X[2i − 1])
27. (M [1], . . . , M [m0 ]) ←−M
28. if i = ba/2c + 1 then x ← x ⊕ 4
28. z ← |M [m0 ]|
29. M [m0 ] ← padn (M [m0 ]) 29. S ← E e (X[2i],x,2i−1) (S)
K
30. for i = 1 to m0 30. end for
31. S ← E e (N,36,i−1) (S) 31. if a mod 2 = m mod 2 then
K
32. (S, C[i]) ← ρ(S, M [i]) 32. (S, η) ← ρ(S, 0n )
33. end for 33. else
34. C[m0 ] ← lsbz (C[m0 ]) 34. (S, η) ← ρ(S, X[a + m])
35. C ← C[1] k . . . k C[m0 − 1] k C[m0 ] 35. S ← E e (N,w,a+m) (S)
K
36. return (C, T ) 36. (η, T ) ← ρ(S, 0n )
37. if T ∗ = T then return M else ⊥
Algorithm ρ(S, M ) Algorithm ρ−1 (S, C)

1. C ← M ⊕ G(S) 1. M ← C ⊕ G(S)
2. S 0 ← S ⊕ M 2. S 0 ← S ⊕ M
3. return (S 0 , C) 3. return (S 0 , M )
Figure 2.7: The Romulus-M misuse-resistant AE mode. Lines of [if (statement) then X ← x
else x0 ] are shorthand for [if (statement) then X ← x else X ← x0 ]. The dummy variable η is
always discarded. We use Romulus-M1 as working example. For other Romulus-M members, the
values of the bits b7 and b6 in the domain separation need to be adapted accordingly. Note that in
the case of empty message, no encryption call has to be performed in the encryption part.
17
Case (a,m) = (even,even)
A[1] A[2] A[a − 1] pad(A[a]) M [1] M [2] M [m − 1] pad(M [m]) 0n N 0n
n t n t
ρ E ρ E ρ E ρ E ρ E ρ
K K K K K
0n n n
e 40,1 e 40,a−1 e 44,a+1 e 44,a+m−1 e w,a+m
w ∈ [60, . . . , 63]
T
Case (a,m) = (even,odd)

A[1] A[2] A[a − 1] pad(A[a]) M [1] M [2] pad(M [m]) N 0n
n t n t
n ρ ρ ρ ρ ρ
0 n n E K
e 40,1 E K
e 40,a−1 EK
e 44,a+1 EK
e w,a+m
w ∈ [56, . . . , 59]
T
Case (a,m) = (odd,even)

A[1] A[2] pad(A[a]) M [1] M [2] M [3] pad(M [m]) N 0n
n t n t
18
ρ E ρ E ρ E ρ E ρ
0n n n K K K K
e 40,1 e 44,a e 44,a+2 e w,a+m
w ∈ [52, . . . , 55]
T
in the domain separation need to be adapted accordingly.

Case (a,m) = (odd,odd)
A[1] A[2] pad(A[a]) M [1] M [2] M [3] M [m − 1] pad(M [m]) 0n N 0n
n t n t
ρ E ρ E ρ E ρ E ρ E ρ
0n n n K K K K K
e 40,1 e 44,a e 44,a+2 e 44,a+m−1 e w,a+m
w ∈ [48, . . . , 51]
T
N M [1] N M [2] N M [m0 − 1] N pad(M [m0 ])

t n
T E ρ E ρ E ρ E ρ
K n n K K K
n
e 36,0 e 36,1 e 36,2 e 36,m0 −1
lsb|M [m0 ]|
C[1] C[2] C[m0 − 1]
Romulus-M1 as working example. For other Romulus-M members, the values of the bits b7 and b6
even/odd, odd/even, odd/odd AD and M blocks respectively (Bottom) Encryption. We use
Figure 2.8: The Romulus-M misuse-resistant AE mode. (Top) process of AD with even/even,
C[m0 ]
3. Security Claims
Attack Models. We consider two models of adversaries: nonce-respecting (NR) and nonce-
misusing (NM)1 . In the former model, nonce values in encryption queries (the tuples (N, A, M ))
may be chosen by the adversary but they must be distinct. In the latter, nonce values in encryption
queries can repeat. Basically, an NM adversary can arbitrarily repeat a nonce, hence even using the
same nonce for all queries is possible. We can further specify NM by the distribution of a nonce,
such as the maximum number of repetition of a nonce in the encryption queries.
For both models, adversaries can use any nonce values in decryption queries (the tuples
(N, A, C, T )): it can collide with a nonce in an encryption query or with other decryption queries.
Security Claims. Our security claims are summarized in Table 3.1. The variables in the table
denote the required workload, in terms of data complexity, of an adversary to break the cipher, in
logarithm base 2. The data complexity of attacker consists of the number of queries and the total
amount of processed message blocks. If it reaches the suggested number, then there is no security
guarantee anymore, and the cipher can be broken. For simplicity, small constant factors, which
are determined from the concrete security bounds, are neglected in these tables. A more detailed
analysis is given in Section 4.
We claim these numbers hold as long as Skinny is a tweakable pseudorandom permutation, that
is, it is computationally hard to distinguish Skinny from the set of uniform random permutations
(URP) indexed by the tweak (a tweakable URP or TURP), using chosen-plaintext queries in the
single-key setting.
Table 3.1: Security claims of Romulus. NR denotes Nonce-Respecting adversary and NM denotes
Nonce-Misusing adversary.
Family NR-Priv NR-Auth NM-Priv NM-Auth

Romulus-N 128 128 – –
Romulus-M 128 128 64 ∼ 128 64 ∼ 128
For all the members of Romulus-N, Table 3.1 shows n-bit security for privacy and authenticity
against NR adversary. For all the members of Romulus-M, Table 3.1 shows n-bit security for privacy
and authenticity against NR adversary and in addition, n/2-bit security for privacy and authenticity
against NM adversary. The n/2-bit security assumes that the NM adversary has full control over
the nonce, but in practice, the nonce repetition can happen accidentally, and it is conceivable
that the nonce is repeated only a few times. As we present in Section 4, the security bounds of
Romulus-M show the notable property of graceful security degradation with respect to the number
of nonce repetition [36]. This property is similar to SCT, and if the number of nonce repetition is
limited, the actual security bound is close to the full n-bit security.
1
Also known as Nonce Repeating or Nonce Ignoring. We chose “Nonce Misuse” for notational convenience of
using acronyms, NR for nonce-respecting and NM for nonce-misuse.
19
Table 3.1 does not show the time complexity. We claim k-bit time complexity of attacker for
all the members of Romulus that use Skinny with k-bit keys, which is common to schemes having
security proofs in the standard model. This also indicates that the time complexity of key recovery
is k bits, i.e., key recovery is no easier than attacking Skinny itself, under the single-key setting.
Note that all members have k = 128. See Table 3.2.
Table 3.2: Security claims of Romulus against key recovery.
Family Key Recovery

Romulus-N 128
Romulus-M 128
20
4. Security Analysis
4.1 Security Notions

Security Notions for NAE. We consider the standard security notions for nonce-based AE [6,
7, 38]. Let Π denote an NAE scheme consisting of an encryption procedure Π.EK and a decryption
$
procedure Π.DK , for secret key K uniform over set K (denoted as K ← K). For plaintext M
with nonce N and associated data A, Π.EK takes (N, A, M ) and returns ciphertext C (typically
|C| = |M |) and tag T . For decryption, Π.DK takes (N, A, C, T ) and returns a decrypted plaintext
M if authentication check is successful, and otherwise an error symbol, ⊥.
The privacy notion is the indistinguishability of encryption oracle Π.EK from the random-bit
oracle $ which returns random |M | + τ bits for any query (N, A, M ). The adversary is assumed to
be nonce-respecting. We define the privacy advantage as
h i h i
priv
AdvΠ (A) = Pr K ← K : AΠ.EK (·,·,·) ⇒ 1 − Pr A$(·,·,·) ⇒ 1 ,
def $
which measures the hardness of breaking privacy notion for A.

The authenticity notion is the probability of successful forgery via queries to Π.EK and Π.DK or-
acles. We define the authenticity advantage as
h i
def $ Π.EK (·,·,·),Π.DK (·,·,·,·)
Advauth
Π (A) = Pr K ← K : A forges ,
where A forges if it receives a value M 0 6= ⊥ from Π.DK . Here, to prevent trivial wins, if
(C, T ) ← Π.EK (N, A, M ) is obtained earlier, A cannot query (N, A, C, T ) to Π.DK . The adversary
is assumed to be nonce-respecting for encryption queries.
Security Notion for TBC. The security of TBC: K × T × M → M is defined by the indis-
tinguishability from an ideal object, tweakable uniform random permutation (TURP), denoted
e using chosen-plaintext, chosen-tweak queries. It is a set of independent uniform random
by P,
tprp
permutations (URPs) over M indexed by tweak T ∈ T . Let Adv e (A) denote the TPRP
E
advantage of TBC E e against adversary A. It is defined as
h i h i
tprp
Adv e (A) = Pr K ← K : AEK (·,·) ⇒ 1 − Pr AP(·,·) ⇒ 1 .
def $ e e
E
Security Notions for MRAE. We adopt the security notions of MRAE following the same
security definitions as above, with the exception that the adversary can now repeat nonces. We
write the corresponding privacy advantage as
h i h i
nm-priv
(A) = Pr K ← K : AΠ.EK (·,·,·) ⇒ 1 − Pr A$(·,·,·) ⇒ 1 ,
def $
AdvΠ
and the authenticity advantage as
h i
def $ Π.EK (·,·,·),Π.DK (·,·,·,·)
Advnm-auth
Π (A) = Pr K ← K : A forges .
We note that while NM adversaries can repeat nonces, we without loss of generality assume that
they do not repeat the same query. See also [39] for reference.
21
4.2 Security of Romulus-N
n,t
For A ∈ {0, 1}∗ , we say A has a AD blocks if it is parsed as (A[1], . . . , A[a]) ←−− A. Let ã = ba/2c+1
which is a bound of actual number of primitive calls for AD. Similarly for plaintext M ∈ {0, 1}∗ ,
def
we say M has m message blocks if |M |n = d|M |/ne = m. The same applies to ciphertext C. For
encryption query (N, A, M ) or decryption query (N, A, C, T ) of a AD blocks and m message blocks,
the number of total TBC calls is at most ã + m, which is called the number of effective blocks of a
query.
Let A be an NR adversary against Romulus-N using q encryption queries with time complexity
tA and with total number of effective blocks σpriv . Moreover, let B be an NR adversary using qe
encryption queries and qd decryption queries, with total number of effective blocks for encryption
and decryption queries σauth , and time complexity tB . Then
priv tprp
AdvRomulus-N (A) ≤ Adv e (A0 ),
E
tprp 3qd 2qd
Advauth
Romulus-N (B) ≤ Adv e (B 0 ) + + τ
E 2n 2
hold for some A0 using σpriv chosen-plaintext queries with time complexity tA + O(σpriv ), and for
some B 0 using σauth chosen-plaintext queries with time complexity tB + O(σauth ). These bounds
hold for all the members of Romulus-N. Note that (n, τ ) = (128, 128) holds for all the members. If
1 ≤ τ < n (which is not a part of our submission), it still keeps n-bit privacy and τ -bit authenticity.
The security of Romulus-N crucially relies on the n × n matrix G defined over GF(2). Let G(i)
be an n × n matrix that is equal to G except the (i + 1)-st to n-th rows, which are set to all zero.
Here, G(0) is the zero matrix and G(n) = G, and for X ∈ {0, 1}n , G(i) (X) = lsbi (G(X))k0n−i for
all i = 0, 8, 16, . . . , n; note that all variables are byte strings, and lsbi (X) is the leftmost i/8 bytes
(Section 2). Let I denote the n × n identity matrix. We say G is sound if (1) G is regular and (2)
G(i) + I is regular for all i = 8, 16, . . . , n. The above security bounds hold as long as G is sound.
The proofs are similar to those for iCOFB [14]. We have verified the soundness of our G, for a range
of n including n = 64 and n = 128, by a computer program.
4.3 Security of Romulus-M

Let A be an adversary against Romulus-M using q encryption queries with time complexity tA
and with total number of effective blocks σpriv . Here, for an encryption query (N, A, M ), we
n,t
define the number of effective blocks as ba/2c + bm/2c + 2 + m0 , where (A[1], . . . , A[a]) ←−− A,
n,t n
(M [1], . . . , M [m]) ←−− M , and (M [1], . . . , M [m0 ]) ←
− M . In the NR case, we have
priv tprp
AdvRomulus-M (A) ≤ Adv e (A0 ),
E
and in the NM case, we have
nm-priv tprp 4rσpriv

AdvRomulus-M (A) ≤ Adv e (A0 ) + ,
E 2n
where r is the maximum number of the repetition of a nonce in encryption queries, and A0 uses
σpriv chosen-plaintext queries with time complexity tA + O(σpriv ).
Let B be an adversary using qe encryption queries and qd decryption queries, with total number
of effective blocks for encryption and decryption queries σauth , and time complexity tB . Here, for a
decryption query (N, A, C, T ), the number of effective blocks is defined as ba/2c + bm/2c + 2 + m0 ,
n,t n,t n
where (A[1], . . . , A[a]) ←−− A, (C[1], . . . , C[m]) ←−− C, and (C[1], . . . , C[m0 ]) ←
− C.
22
Then in the NR case, we have
tprp 5qd
Advauth
Romulus-M (B) ≤ Adv e (B 0 ) + .
E 2n
In the NM case, we have
tprp 4rqe 5rqd

Advnm-auth
Romulus-M (B) ≤ Adv e (B 0 ) + + n .
E 2n 2
Here, r is the maximum number of the repetition of a nonce in encryption queries, and B 0 uses
σauth chosen-plaintext queries with time complexity tB + O(σauth ).
4.4 Security of Skinny

Skinny [2,3] is claimed to be secure against related-tweakey attacks, an attack model very generous to
the adversary as he can fully control the tweak input. We refer to the original research paper for the
extensive security analysis provided by the authors (differential cryptanalysis, linear cryptanalysis,
meet-in-the-middle attacks, impossible differential attacks, integral attacks, slide attacks, invariant
subspace cryptanalysis, and algebraic attacks). In particular, strong security guarantees for Skinny
have been provided with regards to differential and linear cryptanalysis.
In addition, since the publication of the cipher in 2016 there has been lots of cryptanalysis or
structural analysis (improvement of security bounds) of Skinny by third parties. This was also
further motivated by the organization of cryptanalysis competitions of Skinny by the designers.
To the best of our knowledge, the cryptanalysis that can attack the highest number of rounds
(related-tweakey impossible differential attack [31, 40]) can only reach 23 of the 48 rounds of
Skinny-128-256 and 27 of the 56 rounds of Skinny-128-384, with a very high data/memory/time
complexity.
All in all, we can conclude that all versions of Skinny that we use have a very large security margin
(more than 50%), even after numerous third party cryptanalysis. This is a very strong argument
for Skinny, as it provides excellent performances while maintaining a very safe security margin. We
emphasize that comparison between ciphers should take into account this security margin aspect
(for example by normalizing performances by the maximum ratio of attacked rounds).
23
5. Features
The primary goal of Romulus is to provide a lightweight, yet highly-secure, highly-efficient AE based
on a TBC. Romulus has a number of desirable features. Below we detail some representative ones:
• Security margin. Skinny family of tweakable block ciphers was published at CRYPTO
2016. Even though a thorough security analysis was provided by the authors in the original
article, these primitives attracted a lot of attention and third party cryptanalysis in the past
years. So far, Skinny functions still offer a very comfortable security margin. For example, the
Skinny members used in Romulus still have more than 50% security margin in the related-key
related-tweakey model. Actually the security margin rate is probably even higher as these
attacks can’t be directly applied to Skinny in the Romulus setting due to data limitations,
limited tweak space, etc. Moreover, our security assumption on the internal primitive is only
single-key, not related-key.
• Security proofs. Both Romulus-N and Romulus-M have provable security reductions to
Skinny in the standard model. See [21] for the proofs. This is very important for high security
confidence of Romulus and allows us to rely on the security of Romulus to that of Skinny,
which has been extensively studied since the proposal in 2016.
• Beyond-birthday-bound security. The security bounds of Romulus shown in Section 4
are comparable to the state-of-the-art TBC modes of operation, namely ΘCB3 for NAE and
SCT for MRAE. In particular, Romulus-N and Romulus-M (under NR adversary) achieve
beyond-birthday-bound (BBB) security with respect to the block length. This level of security
is much stronger than the up-to-birthday-bound, n/2-bit security achieved by conventional
block cipher modes using n-bit block ciphers, e.g. GCM. Our provable security results are in
the standard model, where there is a reduction from the security of the entire modes to the
underlying primitive, Skinny, where the security of Skinny refers to the standard single-key
setting. This implies that, up to the security bounds, our schemes cannot be broken without
breaking the security of the underlying primitive in the single-key setting.
• Misuse resistance. Romulus-M is an MRAE mode which is secure against misuse (repeat)
of nonces in encryption queries. More formally, it provides the best-possible security against
nonce repeat in that ciphertexts do not give any information as long as the uniqueness of the
input tuple (N, A, M ) is maintained. In contrast to this, popular nonce-based AE modes are
often vulnerable against nonce repeat, even one repetition can be significant. For example,
the famous nonce repeat attack against GCM [19, 26] reveals its authentication key.
• Performances. Romulus-N is smaller than ΘCB3 in that it does not need an additional state
beyond the internal TBC. Besides, it is faster as it processes (n + t)-bit AD blocks per TBC
|M |
call. In general, it requires only |A|−n
n+t + n + 1 TBC calls, as opposed to ΘCB3, which
|A| |M |
requires n + n + 1. Although Romulus is serial in nature, i.e., not parallelizable, it
was shown during the CAESAR competition that parallelizability does not lead to significant
performance gains in hardware performance, [17, 27, 29]. Moreover, parallelizability is not
considered crucial for in lightweight applications, so it is a small price for a simple, small and
24
fast design.
In Romulus-M, a plaintext is processed twice, once for generating a tag and once for encryption.
Romulus-M inherits the overall design of Romulus-N, and thanks to the highly efficient tag
generation, the efficiency loss is minimized. Romulus-M is about only 1.5 times slower than
Romulus-N when associated data is empty, and becomes closer to Romulus-N for long associated
data.
• Simplicity/Small footprint. Romulus has a quite small footprint. Especially for Romulus-
N, we essentially need what is needed to implement the TBC Skinny itself. We remark that
this becomes possible thanks to the permutation-based structure of Skinny’s tweakey schedule,
which allows to share the state registers used for storing input variable and for deriving
round-key values. Thus, this feature is specific to our use of Skinny, though one can expect
a similar effect with TBC using a simple tweak(ey) schedule. There is no OCB-like masks
applied to the primitive, and we do not need the inverse circuit for Skinny which was needed
for ΘCB3. A comparison in Section 6 (Table 6.1) shows that Romulus-N is quite small and
especially efficient in terms of a combined metric of size and speed, compared with other
schemes.
Romulus-M also has a small footprint due to the shared structure with Romulus-N.
• Small messages. Romulus-N has a small computational overhead, thus has a good perfor-
mance for small messages. For example, it just needs two TBC calls to encrypt one-block AD
and one-block message, i.e., 16 Bytes of AD and 16 Bytes of message. In particular, in the
authentication part, the first 16 Bytes of AD can be processed for free in that it is processed
without calling the TBC.
• Flexibility. Romulus has a large flexibility. Generally, it is defined as a generic mode for
TBCs, and the provable security reduction under standard model contributes to a high
confidence of the scheme when combined with a secure TBC.
• Side channels and Fault Attacks. Romulus does not inherently guarantee security against
Side Channel Analysis and Fault Attacks. However, standard countermeasures are easily
adaptable for Romulus, e.g. Fresh Rekeying [32], Masking [33], etc. Moreover, powerful fault
attacks that require a small number of faults and pairs of faulty and non-faulty ciphertexts,
such as DFA, are not applicable to Romulus-N without violating the security model, i.e.,
repeating the nonce or releasing unverified plaintexts.
We also note that in Romulus-N1, Romulus-N2, Romulus-M1 and Romulus-M3, we do not
require the full tweakey size of Skinny-128-384, so a potential countermeasure to both SCA
and DFA is to randomize the round keys by adding a random value to each TBC call. The
downfall of this idea is that the randomness needs to be synchronized for correct decryption,
but this property is shared with most SCA and DFA randomized countermeasures. However,
we plan to analyze this idea and other ideas to make Romulus resistant to such attacks in
details in subsequent works.
25
6. Design Rationale
6.1 Overview
Romulus is designed with the following goals in mind:
1. Have a very small area compared to other TBC/BC based AEAD modes.
2. Have relatively high efficiency in general.
3. Smaller overhead and fewer TBC calls for the AD processing.
4. Use the underlying TBC as a black box, with the standard security reduction to the TBC.
6.2 Mode Design

Rationale of NAE Mode. Romulus-N has a similar structure as a mode called iCOFB, which
appeared in the full version of CHES 2017 paper [14]. Because it was introduced to show the
feasibility of the main proposal of [13], block cipher mode COFB, it does not work as a full-fledged
AE using conventional TBCs. Therefore, starting from iCOFB, we apply numerous changes for
improving efficiency while achieving high security. As a result, Romulus-N becomes a much more
advanced, sophisticated NAE mode based on a TBC. The security bound of Romulus-N is essentially
equivalent to ΘCB3, having full n-bit security.
Rationale of MRAE Mode. Romulus-M is designed as an MRAE mode following the structure
of SIV [39] and SCT [36]. Romulus-M reuses the components of Romulus-N as much as possible to
inherit its implementation advantages and the security. In fact, this brings us several advantages
(not only for implementation aspects) over SIV/SCT. Compared with SCT, Romulus-M needs a
fewer number of primitive calls thanks to the faster MAC part. Moreover, Romulus-M has a smaller
state than SCT because of single-state encryption part taken from Romulus-N (SCT employs a
variant of counter mode). The provable security of Romulus-M is equivalent to SCT: the security
depends on the maximum number of repetition of a nonce in encryption (r), and if r = 1 (i.e.,
NR adversary) we have the full n-bit security. Security will gradually decreasing as r increases,
also known as “graceful degradation”, and even if r equals to the number of encryption queries,
implying nonces are fixed, we maintain the birthday-bound, n/2-bit security.
ZAE [23] is another TBC-based MRAE. Although it is faster than SCT, the state size is much
larger than SCT and Romulus-M.
Efficiency Comparison. In Table 6.1, we compare Romulus-N to ΘCB3, a well-studied TBC-

based AEAD mode, in addition to a group of recently proposed lightweight AEAD modes. State
size is the minimum number of bits that the mode has to maintain during its operation, and rate is
the ratio of input data length divided by the total output length of the primitive needed to process
that input. The comparison follows the following guidelines, while trying to be fair in comparing
designs that follow completely different approaches:
26
Table 6.1: Features of Romulus-N members compared to ΘCB3 and other lightweight AEAD
algorithms: λ is the bit security level of a mode. Here, (n, k)-BC is a block cipher of n-bit block
and k-bit key, (n, t, k)-TBC is a TBC of n-bit block and k-bit key and t-bit tweak, and n-Perm is
an n-bit cryptographic permutation.
Number of Security State Size Rate S/R Inverse

Scheme Primitive
Primitive Calls (λ) (S) (R) Free
|A|−n |M |
Romulus-N1 2n + n +1 (n, 1.5n, k)-TBC† , n = k n n + 2.5k = 3.5λ 1 3.5λ Yes
|A|−n |M |
Romulus-N2 1.75n + n +1 (n, 1.2n, k)-TBC† , n = k n n + 2.2k = 3.2λ 1 3.2λ Yes
|A|−n |M |
Romulus-N3 1.75n + n +1 (n, n, k)-TBC, n = k n n + 2k = 3λ 1 3λ Yes
|A| |M |
COFB [13] n + n +1 (n, k)-BC, n = k n/2 − log2 n/2 1.5n + k = 5.4λ‡ 1 5.4λ Yes
|A| |M |
ΘCB3 [28] n + n +1 (n, 1.5n, k)-TBC] , n = k n 2n + 2.5k = 4.5λ 1 4.5λ No
|A| |M |
Beetle [12] n + n +2 2n-Perm, n = k n − log2 n 2n = 2.12λ 1/2 4.24λ Yes
|A| |M |
Ascon-128 [16] n + n +1 5n-Perm, n = k/2 n/2 7n = 3.5λ 1/5 17.5λ Yes
|A| |M |
Ascon-128a [16] n + n +1 2.5n-Perm, n = k n 3.5n = 3.5λ 1/2.5 8.75λ Yes
[
|A| |M |
SpongeAE [9] n + n +1 3n-Perm, n = k n 3n = 3λ 1/3 9λ Yes
† Unused part of tweakey is not a part of state thus not considered;

‡ Can possibly be enhanced to about 4λ with a 2n-bit block cipher;
] 1.5n-bit tweak for n-bit nonce and 0.5n-bit counter;
[ Duplex construction with n-bit rate, 2n-bit capacity.
1. k = 128 for all the designs.

2. n is the input block size (in bits) for each primitive call.
3. λ is the security level of the design.
4. For BC/TBC based designs, the key is considered to be stored inside the design, but we also
consider that the encryption and decryption keys are interchangeable, i.e., the encryption key
can be derived from the decryption key and vice versa. Hence, no need to store the master
key in additional storage. The same applies for the nonce.
5. For Sponge and Sponge-like designs, if the key/nonce are used only during initialization, then
they are counted as part of the state and do not need extra storage. However, in designs like
Ascon, where the key is used again during finalization, we assume the key storage is part of
the state, as the key should be supplied only once as an input.
Our comparative analysis shows that Romulus-N is smaller and more efficient than ΘCB3 for
the same security level. Moreover, the cost of processing AD is about half that of the message. For
example, in the case of Romulus-N1, if the message and AD have equal length, there is an extra
speed up of ∼ 1.33x, which means that the efficiency even increases from 3.5λ to 2.625λ, compared
to 4.5λ in case of ΘCB3, which makes Romulus-N a very promising candidate for NAE, for both
short and long messages.
Similar comparison is shown in Table 6.2 for Misuse-Resistant TBC-based AEAD modes. It
shows that Romulus-M is very efficient. Not only the state size is smaller, but also it is faster. For
example, Romulus-M1 is 25% faster (1.33x speed-up) than SCT for the same parameters, when
|A| = 0, and it is even faster when |A| > 0.
Rationale of TBC. We chose some of the members of the Skinny family of tweakable block
ciphers [2] as our internal TBC primitives. Skinny was published at CRYPTO 2016 and has received
a lot of attention since its proposal. In particular, a lot of third party cryptanalysis has been
provided (in part motivated by the organization of cryptanalysis competitions of Skinny by the
designers) and this was a crucial point in our primitive choice. Besides, our mode requested a
lightweight tweakable block cipher and Skinny is the main such primitive. It is very efficient and
27
Table 6.2: Features of Romulus-M members compared to other MRAE modes : λ is the bit security
level of a mode. Here, (n, k)-BC is a block cipher of n-bit block and k-bit key, (n, t, k)-TBC is a
TBC of n-bit block and k-bit key and t-bit tweak. Security is for Nonce-respecting adversary.
Number of Security State Size Rate S/R Inverse

Scheme Primitive
Primitive Calls (λ) (S) (R) Free
|A|+|M |−n |M |
Romulus-M1 2n + n +1 (n, 1.5n, k)-TBC† , n=k n n + 2.5k = 3.5λ 1/2 7λ Yes
|A|+|M |−n |M |
Romulus-M2 1.75n + n +1 (n, 1.2n, k)-TBC† , n = k n n + 2.2k = 3.2λ 1/2 6.4λ Yes
|A|+|M |−n |M |
Romulus-M3 1.75n + n +1 (n, n, k)-TBC, n = k n n + 2k = 3λ 1/2 6λ Yes
‡
|A|+|M | |M |
SCT [36] n + n +1 (n, n, k)-TBC, n = k n 4n = 4λ 1/2 8λ Yes
|A|+|M | |M |
SUNDAE [1] n + n +1 (n, k)-BC, n = k n/2 2n = 4λ 1/2 8λ Yes
]
|A|+|M | |M |
ZAE [23] 2n + n +6 (n, n, k)-TBC, n = k n 7n = 7λ 1/2 14λ Yes
† Unused part of tweakey is not a part of state thus not considered;

‡ Tag is n bits;
] Tag is 2n bits;
lightweight, while providing a very comfortable security margin. Provable constructions that turn a
block cipher into a tweakable block cipher were considered, but they are usually not lightweight,
not efficient, and often only guarantee birthday-bound security.
6.3 Hardware Implementations

General Architecture and Hardware Estimates. The goal of the design of Romulus is to have
a very small area overhead over the underlying TBC, specially for the round-based implementations.
In order to achieve this goal, we set two requirements:
1. There should be no extra Flip-Flops over what is already required by the TBC, since Flip-Flops
are very costly (4 ∼ 7 GEs per Flip-Flop).
2. The number of possible inputs to each Flip-Flop and outputs of the circuits have to be
minimized. This is in order to reduce the number of multiplexers required, which is usually
one of the cause of efficiency reduction between the specification and implementation.
One of the advantages of Skinny as a lightweight TBC is that it has a very simple datapath,
consisting of a simple state register followed by a low-area combinational circuit, where the same
circuit is used for all the rounds, so the only multiplexer required is to select between the initial
input for the first round and the round output afterwards (Figure 6.1(a)), and it has been shown
that this multiplexer can even have lower cost than a normal multiplexer if it is combined with
the Flip-Flops by using Scan-Flops (Figure 6.1(b)) [24]. However, when used inside an AEAD
mode, challenges arise, such as how to store the key and nonce, as the key scheduling algorithm
will change these values after each block encryption. The same goes for the block counter. In order
to avoid duplicating the storage elements for these values; one set to be used to execute the TBC
and one set to be used by the mode to maintain the current value, we studied the relation between
the original and final value of the tweakey. Since the key scheduling algorithm of Skinny is fully
linear and has very low area (most of the algorithm is just routing and renaming of different bytes),
the full algorithm can be inverted using a very small circuit that costs 64 XOR gates. Moreover,
the LFSR computation required between blocks can be implemented on top of this circuit, costing
3 extra XOR gates. This operation can be computed in parallel to ρ, such that when the state
is updated for the next block, the tweakey key required is also ready. This costs only ∼ 67 XOR
gates as opposed to ∼ 320 Flip-Flops that will, otherwise, be needed to maintain the tweakey
value. Hence, the mode was designed with the architecture in Figure 6.1(b) in mind, where only a
full-width state-register is used, carrying the TBC state and tweakey values, and every cycle, it is
28
either kept without change, updated with the TBC round output (which includes a single round of
the key scheduling algorithm) or the output of a simple linear transformation, which consists of
ρ/ρ−1 , the unrolled inverse key schedule and the block counter. In order estimate the hardware cost
of Romulus-N1 the mode we consider the round based implementation with an n/4-bit input/output
bus:
• 4 XOR gates for computing G.
• 64 XOR gates for computing ρ.
• 67 XOR gates for the correction of the tweakey and counting.
• 56 multiplexers to select whether to choose to increment the counter or not.
• 320 multiplexers to select between the output of the Skinny round and lt.
This adds up to 135 XOR gates and 376 multiplexers. For estimation purposes assume an
XOR gate costs 2.25 GEs and a multiplexer costs 2.75 GEs, which adds up to 1337.75 GEs. In the
original Skinny paper [2], the authors reported that Skinny-128-384 requires 4, 268 GEs, which adds
up to ∼ 5, 605 GEs. This is ∼ 1.4 KGEs smaller than the round based implementation of Ascon [18].
Moreover, a smart design can make use of the fact that 64 bits of the tweakey of Skinny-128-384
are not used, replacing 64 Flip-Flops by 64 multiplexers reducing an extra ∼ 200 GEs. In order to
design a combined encryption/decryption circuit, we show below that the decryption costs only
extra 32 multiplexers and ∼ 32 OR gates, or ∼ 100 GEs. Similar analysis is done for Romulus-N2
and Romulus-N3, estmating that they would cost 1, 217 and 1, 073 GEs, respectively, on top of
there corresponding Skinny variant, or 5, 485 and 4, 385 GEs, respectively.
These estimations show that Romulus-N is not just competitive theoretically but it can be a
very attractive option practically for low area applications. For example, the 8-bit implementation
of ACORN, the smallest implementation publicly available for all the round 3 candidates of the
CAESAR competition, costs 5, 900 GEs, as shown in [29]. If we assume around ∼ 1, 000 GEs as the
cost of the CAESAR Hardware API included in that design, as reported in [18], then Romulus-N3 is
still smaller than that. Besides, we believe the area can be even lower using Serial Implementations
of Skinny, which cost ∼ 3, 000 GE for Skinny-128-384 and ∼ 2, 000 GEs for Skinny-128-256, a gain
of more than 1, 000 GEs compared to the round based implementation.
Another possible optimization is to consider the fact that most of the area of Skinny comes from
the storage elements, hence, we can speed up Romulus to almost double the speed by using a simple
two-round unrolling, which costs ∼ 1, 000 GEs, as only the logic part of Skinny needs replication,
which is only < 20% increase in terms of area.
Romulus-M is estimated to have almost the same area as Romulus-N, except for an additional set
of multiplexers in order to use the tag as an initial vector for the encryption part. This indicates
that it can be a very lightweight choice for high security applications.
For the serial implementations we followed the currently popular bit-sliding framework [24] with
minor tweaks. The state of Skinny is represented as the Feedback-Shift Register which typically
operates on 8 bits at a time, while allowing the 32-bit MixColumns operation, given in Figure 6.2
It can be viewed in Figure 6.2 that several careful design choices such as a lightweight serializable
ρ function without the need of any extra storage and a lightweight padding/truncation scheme
allow the low area implementations to use a very small number of multiplexers on top of the Skinny
circuit for the state update, three 8-bit multiplexer to be exact, two of which have a constant zero
input, and ∼ 22 XORs for the ρ function and block counter. For the key update functions, we did
several experiments on how to serialize the operations and we found the best trade-off is to design
a parallel/serial register for every tweakey, where the key schedule and mode operations are done in
the same manner of the round based implementation, while the AddRoundKey operation of Skinny
is done serial as shown in Figure 6.2.
29
input
input
state
state state
input
Skinny lt
Skinny Skinny
output
(a) Overview of the round based architecture of (b) Overview of the round based architecture
Skinny. of Romulus. lt: The linear transformation that
includes ρ, block counter and inverse key sched-
ule.
Figure 6.1: Expected architectures for Skinny and Romulus
S0 S1 S2 S3
S4 S5 S6 S7
S8 S9 Sa Sb
Sc Sd Se Sf SBox
0x00 RC
input
0x00 ρ RTK
len
output
Figure 6.2: Serial State Update Function Used in Romulus
30
6.4 Software Implementations
We refer to Skinny document for discussions on software implementations of the various Skinny
versions. The Romulus mode will have little impact on the global performance of Skinny in software
as long as serial implementations are used. We expect very little increase in ROM or RAM when
compared to Skinny benchmarks. The very performant micro-controller implementations reported
in the Skinny document were benchmarked without assuming parallel cipher calls, and without any
pre-processing. Therefore, Romulus will present a very similar performance profile as the numbers
reported on micro-controllers. Generally, using little amount of RAM, Skinny is easy and efficient
to implement using simple table-based approach.
For high-end platforms, such as latest Intel processors, very efficient highly-parallel bitsliced
implementations of Skinny using SSE, AVX, AVX2 instructions on XMM/YMM registers will not
be directly applicable as our Romulus mode is serial in nature. However, in the classical case of a
server communicating with many lightweight devices, we note that it would be possible to consider
bitslicing the key schedule [8] of Skinny (being relatively simple to compute) or using scheduling
strategies [10]. Classical table-based implementation of Skinny will ensure acceptable performance
on even legacy platforms, while Vector Permute (vperm) might lead to better results on medium
range platforms by parallelizing the computation of the Sbox.
6.5 Primitives Choices

LFSR-Based Counters. The NIST call for lightweight AEAD algorithms requires that such
algorithms must allow encrypting messages of length at least 250 bytes while still maintaining their
security claims. This means that using a TBC whose block size is 128 bits, we need a block counter
of a period of at least 246 . While this can be achieved by a simple arithmetic counter of 46 bits,
arithmetic counters can be costly both in terms of area (3 ∼ 5 GEs/bit) and performance (due to
the long carry chains which limit the frequency of the circuit). In order to avoid this, we decided
to use LFSR-based counters, which can be implemented using a handful of XOR gates (3 XORs
≈ 6 ∼ 9 GEs). This, in addition to the architecture described above, makes the cost of counter
almost negligible.
Counter Separation (Romulus-N2 and Romulus-M2). For Romulus-N2 and Romulus-M2, we

used two LFSRs instead of one, such that the second LFSR is updated only when the first LFSR
performs a full period. The rationale for that is to provide an even more lightweight variant of
Romulus-N1 and Romulus-M1. Since one can argue that the limitation imposed by the NIST for the
number of bytes is impractical for some lightweight applications, the designer can choose to support
only up to ∼ 228 bytes instead, and not use the second LFSR. This saves 64 ∼ 128 Flip-Flops.
Smaller D for Romulus-N3. The goal of Romulus-N3 is to fit the Romulus algorithm in Skinny-
128-256, which is faster and smaller than Skinny-128-384. In most lightweight applications, the
amount of data to be sent under the same key is small. Hence, Romulus-N3 represents a variant
targeted at such applications that is faster and smaller than the other variants and can encrypt up
to ∼ 256 MBs of data.
Tag Generation. Considering hardware simplicity, the tag is the final output state (i.e., the
same way as the ciphertext blocks), as opposed to the final state S of the TBC. In order to avoid
branching when it comes to the output of the circuit, the tag is generated as G(S) instead of S.
In hardware, this can be implemented as ρ(S, 0n ), i.e., similar to the encryption of a zero vector.
Consequently, the output bus is always connected to the output of ρ and a multiplexer is avoided.
31
Padding. The padding function used in Romulus is chosen so that the padding information is
always inserted in the most significant byte of the last block of the message/AD. Hence, it reduces
the number of decisions for each byte to only two decisions (either the input byte or a zero byte,
except the most significant byte which is either the input byte or the byte length of that block).
Besides, it is also the case when the input is treated as a string of words (16-, 32-, 64- or 128-bit
words). This is much simpler than the classical 10∗ padding approach, where every word has a
lot of different possibilities when it comes to the location of the padding string. Besides, usually
implementations maintain the length of the message in a local variable/register, which means that
the padding information is already available, just a matter of placing it in the right place in the
message, as opposed to the decoder required to convert the message length into 10∗ padding.
Padding Circuit for Decryption. One of the main features of Romulus is that it is inverse
free and both the encryption and decryption algorithms are almost the same. However, it can be
tricky to understand the behavior of decryption when the last ciphertext block has length < n. In
order to understand padding in the decryption algorithm, we look at the ρ and ρ−1 functions when
the input plaintext/ciphertext is partial. The ρ function applied on a partial plaintext block is
shown in Equation (6.1). If ρ−1 is directly applied to padn (C), the corresponding output will be
incorrect, due to the truncation of the last ciphertext block. Hence, before applying ρ−1 we need to
0 0
regenerate the truncated bits. It can be verified that C = padn (C) ⊕ msbn−|C| (G(S)). Once C is
regenerated, ρ−1 can be computed as shown in Equation (6.2):
    
0
S 1 1 S 0
 =   and C = lsb|M | (C ). (6.1)
0
C G 1 padn (M )
  
 
0
0 S1⊕G 1 S
C = padn (C) ⊕ msbn−|C| (G(S)) and  =  . (6.2)
0
M G 1 C
While this looks like a special padding function, in practice it is simple. First of all, G(S) needs
to be calculated anyway. Besides, the whole operation can be implemented in two steps:
M = C ⊕ lsb|C| (G(s)),
0
S = padn (M ) ⊕ S
which can have a very simple hardware implementation, as discussed in the next paragraph.
Encryption-Decryption Combined Circuit. One of the goals of Romulus is to be efficient for

implementations that require a combine encryption-decryption datapath. Hence, we made sure
that the algorithm is inverse free, i.e., it does not used the inverse function of Skinny or G(S).
Moreover, ρ and ρ−1 can be implemented and combined using only one multiplexer, whose size
depends on the size of the input/output bus. The same circuit can be used to solve the padding
issue in decryption, by padding M instead of C. The tag verification operation simply checks if
ρ(S, 0n ) equals to T , which can be serialized depending on the implementation of ρ.
Choice of the G Matrix. We chose the position of G so that it is applied to the output state.
This removes the need of G for AD processing, which improves software performance. In Section 6.2,
we listed the security condition for G, and we choose our matrix G so that it meets these conditions
and suits well for various hardware and software.
32
We noticed that for lightweight applications, most implementations use an input/output bus of
width ≤ 32. Hence, we expect the implementation of ρ to be serialized depending on the bus size.
Consequently, the matrix used in iCOFB can be inefficient as it needs a feedback operation over 4
bytes, which requires up to 32 extra Flip-Flops in order to be serialized, something we are trying to
avoid in Romulus. Moreover, the serial operation of ρ is different for byte, which requires additional
multiplexers.
However, we observed that if the input block is interpreted in a different order, both problems
can be avoided. First, it is impossible to satisfy the security requirements of G without any feedback
signals, i.e., G is a bit permutation.
• If G is a bit permutation with at least one bit going to itself, then there is at least one
non-zero value on the diagonal, so I + G has at least 1 row that is all 0s.
• If G is a bit permutation without any bit going to itself, then every column in I + G has
exactly two 1’s. The sum of all rows in such matrix is the 0 vector, which means the rows are
linearly dependent. Hence, I + G is not invertible.
However, the number of feedback signals can be adjusted to our requirements, starting from only
1 feedback signal. Second, we noticed that the input block/state of length n bits can be treated
as several independent sub-blocks of size n/w each. Hence, it is enough to design a matrix Gs of
size w × w bits and apply it independently n/w times to each sub-block. The operation applied on
each sub-block in this case is the same (i.e., as we can distribute the feedback bits evenly across
the input block). Unfortunately, the choice of w and Gs that provides the optimal results depends
on the implementation architecture. However, we found out that the best trade-off/balance across
different architectures is when w = 8 and Gs uses a single bit feedback.
In order to verify our observations, we generated a family of matrices with different values of w
and Gs , and measured the cost of implementing each of them on different architectures.
33
7. Implementations
In this section we provide implementations results and estimates. Source codes can be found on
our GitHub page: https://github.com/romulusae
7.1 Software Performances

7.1.1 Software implementations
The Skinny article presents extremely fast software implementations on various recent Intel processors,
some as low as 2.37 c/B. However, these bitslice implementations are heavily relying on the
parallelism offered by some operating modes. In our case, this parallelism is not present as Romulus
is not a parallel mode. Therefore, the performance of Romulus on high-end servers will be closer to
20 c/B than 2 c/B.
However, in practice several easy solutions are possible to overcome this performance limitation.
One solution is to let the two communicating entities to use short sessions, which would re-enable
the server side to parallelise the encryption/decryption of the various sessions. Another possible
solution to still use these very fast bitslice implementations is to let the server to communicate
with several clients in parallel. This is in fact very probably what will happen in practice (a
server communicating with many clients is the main reason why fast software implementations are
interesting). Even in the case where the arriving data is always from new clients, bitslicing remains
possible by bitslicing the key schedule part of Skinny as well.
7.1.2 Micro-controller implementations

The Skinny article also reports very efficient micro-controllers implementations of the Skinny-128-128,
with various tradeoffs. One can also mention the good performances and rankings of a simple table-
based implementation of Skinny-128-128 in the FELICS benchmarks (https://www.cryptolux.
org/index.php/FELICS). Since these implementations do not require any parallelism, they can
directly be applied in Romulus. However, since Romulus uses bigger versions of Skinny, the higher
number of rounds will naturally reduce the performances accordingly. We expect the effect of the
Romulus mode to be minor on the performances since it is rate 1.
7.2 ASIC Performances

We have implemented the low-area, round-based and 4-round unrolled architectures of Romulus-
N1. The figures are exepected to be similar for Romulus-N2 and about 550 GEs smaller, and
12∼18% faster for Romulus-N3. Morover, Romulus and Remus [22] share a similar structure and our
experimental results show that it costs only around 200 GEs to convert the implementation to the
misuse resistant variant.The implementations are compliant with teh CAESAR Hardware API, so
a more lightweight interface can be designed. We added our estimations to the table labeled by ’*’.
34
Table 7.1: ASIC Implementations of Romulus-N1 using the TSMC 65nm standard cell library.
Power and Energy are estimated at 10 Mhz. Energy is for 1 TBC call.
Area w/o Area Minimum Throughput Power Energy Thput/Area NR NM

Variant Cycles
interface (GE) (GE) Delay (ns) (Gbps) (µW) (pJ) (Gbps/kGE) Security Security
Low Area 1264 - 4498 0.8 0.1689 - - 0.0376 128 -

Basic Iterative 60 5514 6620 1 2.78 548 32.8 0.42 128 -
Unrolled x4† 18 8231 9286 1.5 6.18 1325 23 0.67 128 -
Unrolled x4‡ 18 9632 10748 1 9.27 - - 0.86 128 -
† Minimum Area;
‡1 GHz;
35
Acknowledgments
The second and fourth authors are supported by the Temasek Labs grant (DSOCL16194).
36
Bibliography
[1] Banik, S., Bogdanov, A., Luykx, A., Tischhauser, E.: SUNDAE: Small Universal Deterministic
Authenticated Encryption for the Internet of Things. IACR Trans. Symmetric Cryptol. 2018(3)
(2018) 1–35
[2] Beierle, C., Jean, J., Kölbl, S., Leander, G., Moradi, A., Peyrin, T., Sasaki, Y., Sasdrich, P.,
Sim, S.M.: The SKINNY Family of Block Ciphers and Its Low-Latency Variant MANTIS.
In: CRYPTO 2016 (2). Volume 9815 of Lecture Notes in Computer Science., Springer (2016)
123–153
[3] Beierle, C., Jean, J., Kölbl, S., Leander, G., Moradi, A., Peyrin, T., Sasaki, Y., Sasdrich, P.,
Sim, S.M.: The SKINNY Family of Block Ciphers and its Low-Latency Variant MANTIS.
IACR Cryptology ePrint Archive 2016 (2016) 660
[4] Beierle, C., Jean, J., Kölbl, S., Leander, G., Moradi, A., Peyrin, T., Sasaki, Y., Sasdrich,
P., Sim, S.M.: SKINNY-AEAD and SKINNY-HASH. Submission to NIST Lightweight
Cryptography Project (2019)
[5] Bellare, M., Boldyreva, A., Palacio, A.: An Uninstantiable Random-Oracle-Model Scheme
for a Hybrid-Encryption Problem. In: EUROCRYPT 2004. Volume 3027 of Lecture Notes in
Computer Science., Springer (2004) 171–188
[6] Bellare, M., Namprempre, C.: Authenticated Encryption: Relations among Notions and
Analysis of the Generic Composition Paradigm. J. Cryptology 21(4) (2008) 469–491
[7] Bellare, M., Rogaway, P., Wagner, D.A.: The EAX Mode of Operation. In: FSE 2004. Volume
3017 of Lecture Notes in Computer Science., Springer (2004) 389–407
[8] Benadjila, R., Guo, J., Lomné, V., Peyrin, T.: Implementing Lightweight Block Ciphers on x86
Architectures. In: SAC 2013. Volume 8282 of Lecture Notes in Computer Science., Springer
(2013) 324–351
[9] Bertoni, G., Daemen, J., Peeters, M., Assche, G.V.: Duplexing the Sponge: Single-Pass
Authenticated Encryption and Other Applications. In: SAC 2011. Volume 7118 of Lecture
Notes in Computer Science., Springer (2011) 320–337
[10] Bogdanov, A., Lauridsen, M.M., Tischhauser, E.: Comb to Pipeline: Fast Software Encryption
Revisited. In Leander, G., ed.: FSE 2015. Volume 9054 of Lecture Notes in Computer Science.,
Springer (2015) 150–171
[11] Canetti, R., Goldreich, O., Halevi, S.: The Random Oracle Methodology, Revisited (Preliminary
Version). In: STOC, ACM (1998) 209–218
[12] Chakraborti, A., Datta, N., Nandi, M., Yasuda, K.: Beetle Family of Lightweight and Secure
Authenticated Encryption Ciphers. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018(2)
(2018) 218–241
37
[13] Chakraborti, A., Iwata, T., Minematsu, K., Nandi, M.: Blockcipher-Based Authenticated
Encryption: How Small Can We Go? In: CHES 2017. Volume 10529 of Lecture Notes in
Computer Science., Springer (2017) 277–298
[14] Chakraborti, A., Iwata, T., Minematsu, K., Nandi, M.: Blockcipher-based Authenticated
Encryption: How Small Can We Go? (Full version of [13]). IACR Cryptology ePrint Archive
2017 (2017) 649
[15] Cogliati, B., Lee, J., Seurin, Y.: New Constructions of MACs from (Tweakable) Block Ciphers.
IACR Trans. Symmetric Cryptol. 2017(2) (2017) 27–58
[16] Dobraunig, C., Eichlseder, M., Mendel, F., Schläffer, M.: Ascon v1. 2. Submission to the
CAESAR Competition (2016)
[17] George Mason University: ATHENa: Automated Tools for Hardware EvaluatioN. https:
//cryptography.gmu.edu/athena/ (2017)
[18] Groß, H., Wenger, E., Dobraunig, C., Ehrenhöfer, C.: Suit up!–Made-to-Measure Hardware
Implementations of ASCON. In: 2015 Euromicro Conference on Digital System Design, IEEE
(2015) 645–652
[19] Handschuh, H., Preneel, B.: Key-Recovery Attacks on Universal Hash Function Based MAC
Algorithms. In: CRYPTO 2008. Volume 5157 of Lecture Notes in Computer Science., Springer
(2008) 144–161
[20] Hirose, S.: Some Plausible Constructions of Double-Block-Length Hash Functions. In: FSE
2006. Volume 4047 of Lecture Notes in Computer Science., Springer (2006) 210–225
[21] Iwata, T., Khairallah, M., Minematsu, K., Peyrin, T.: Duel of the Titans: The Romulus and
Remus Families of Lightweight AEAD Algorithms. IACR Cryptology ePrint Archive 2019
(2019) 992
[22] Iwata, T., Khairallah, M., Minematsu, K., Peyrin, T.: Remus v1. Submission to NIST
Lightweight Cryptography Project (2019)
[23] Iwata, T., Minematsu, K., Peyrin, T., Seurin, Y.: ZMAC: A Fast Tweakable Block Cipher
Mode for Highly Secure Message Authentication. In: CRYPTO 2017 (3). Volume 10403 of
Lecture Notes in Computer Science., Springer (2017) 34–65
[24] Jean, J., Moradi, A., Peyrin, T., Sasdrich, P.: Bit-Sliding: A Generic Technique for Bit-Serial
Implementations of SPN-based Primitives - Applications to AES, PRESENT and SKINNY. In:
CHES 2017. Volume 10529 of Lecture Notes in Computer Science., Springer (2017) 687–707
[25] Jean, J., Nikolic, I., Peyrin, T.: Tweaks and Keys for Block Ciphers: The TWEAKEY
Framework. In: ASIACRYPT 2014 (2). Volume 8874 of Lecture Notes in Computer Science.,
Springer (2014) 274–288
[26] Joux, A.: Authentication Failures in NIST Version of GCM. Comments submitted to
NIST Modes of Operation Process (2006) Available at http://csrc.nist.gov/groups/ST/
toolkit/BCM/documents/comments/800-38_Series-Drafts/GCM/Joux_comments.pdf.
[27] Khairallah, M., Chattopadhyay, A., Peyrin, T.: Looting the LUTs: FPGA Optimization of
AES and AES-like Ciphers for Authenticated Encryption. In: INDOCRYPT 2017. Volume
[28] Krovetz, T., Rogaway, P.: The Software Performance of Authenticated-Encryption Modes. In:
FSE 2011. Volume 6733 of Lecture Notes in Computer Science., Springer (2011) 306–327
38
[29] Kumar, S., Haj-Yihia, J., Khairallah, M., Chattopadhyay, A.: A Comprehensive Performance
Analysis of Hardware Implementations of CAESAR Candidates. IACR Cryptology ePrint
Archive 2017 (2017) 1261
[30] Liskov, M., Rivest, R.L., Wagner, D.A.: Tweakable Block Ciphers. In: CRYPTO 2002. Volume
[31] Liu, G., Ghosh, M., Song, L.: Security Analysis of SKINNY under Related-Tweakey Settings
(Long Paper). IACR Trans. Symmetric Cryptol. 2017(3) (2017) 37–72
[32] Medwed, M., Standaert, F., Großschädl, J., Regazzoni, F.: Fresh Re-keying: Security against
Side-Channel and Fault Attacks for Low-Cost Devices. In: AFRICACRYPT 2010. Volume
[33] Messerges, T.S.: Securing the AES Finalists Against Power Analysis Attacks. In: FSE 2000.
Volume 1978 of Lecture Notes in Computer Science., Springer (2000) 150–164
[34] Moradi, A., Poschmann, A., Ling, S., Paar, C., Wang, H.: Pushing the Limits: A Very
Compact and a Threshold Implementation of AES. In: EUROCRYPT 2011. Volume 6632 of
Lecture Notes in Computer Science., Springer (2011) 69–88
[35] Naito, Y., Sugawara, T.: Lightweight Authenticated Encryption Mode of Operation for
Tweakable Block Ciphers. IACR Cryptology ePrint Archive 2019 (2019) 339
[36] Peyrin, T., Seurin, Y.: Counter-in-Tweak: Authenticated Encryption Modes for Tweakable
Block Ciphers. In: CRYPTO 2016 (1). Volume 9814 of Lecture Notes in Computer Science.,
Springer (2016) 33–63
[37] Rogaway, P.: Efficient Instantiations of Tweakable Blockciphers and Refinements to Modes
OCB and PMAC. In: ASIACRYPT 2004. Volume 3329 of Lecture Notes in Computer Science.,
Springer (2004) 16–31
[38] Rogaway, P.: Nonce-Based Symmetric Encryption. In: FSE 2004. Volume 3017 of Lecture
Notes in Computer Science., Springer (2004) 348–359
[39] Rogaway, P., Shrimpton, T.: A Provable-Security Treatment of the Key-Wrap Problem. In:
EUROCRYPT 2006. Volume 4004 of Lecture Notes in Computer Science., Springer (2006)
373–390
[40] Sadeghi, S., Mohammadi, T., Bagheri, N.: Cryptanalysis of Reduced round SKINNY Block
Cipher. IACR Trans. Symmetric Cryptol. 2018(3) (2018) 124–162
39
A. Appendix
Table A.1: Domain separation byte B of Romulus. Bits b7 and b6 are to be set to the appropriate
value according to the parameter sets.
b7 b6 b5 b4 b3 b2 b1 b0 int(B) case
- - 0 0 1 0 0 0 8 A main
- - 0 1 1 0 0 0 24 A last unpadded
- - 0 1 1 0 1 0 26 A last padded
Romulus-N
- - 0 0 0 1 0 0 4 M main
- - 0 1 0 1 0 0 20 M last unpadded
- - 0 1 0 1 0 1 21 M last padded
- - 1 0 1 0 0 0 40 A main
- - 1 0 1 1 0 0 44 M auth main
- - 1 1 1 1 1 1 63 w: (even,even,padded,padded)
- - 1 1 1 1 1 0 62 w: (even,even,padded,unpadded)
- - 1 1 1 1 0 1 61 w: (even,even,unpadded,padded)
- - 1 1 1 1 0 0 60 w: (even,even,unpadded,unpadded)
- - 1 1 1 0 1 1 59 w: (even,odd,padded,padded)
- - 1 1 1 0 1 0 58 w: (even,odd,padded,unpadded)
- - 1 1 1 0 0 1 57 w: (even,odd,unpadded,padded)
Romulus-M - - 1 1 1 0 0 0 56 w: (even,odd,unpadded,unpadded)
- - 1 1 0 1 1 1 55 w: (odd,even,padded,padded)
- - 1 1 0 1 1 0 54 w: (odd,even,padded,unpadded)
- - 1 1 0 1 0 1 53 w: (odd,even,unpadded,padded)
- - 1 1 0 1 0 0 52 w: (odd,even,unpadded,unpadded)
- - 1 1 0 0 1 1 51 w: (odd,odd,padded,padded)
- - 1 1 0 0 1 0 50 w: (odd,odd,padded,unpadded)
- - 1 1 0 0 0 1 49 w: (odd,odd,unpadded,padded)
- - 1 1 0 0 0 0 48 w: (odd,odd,unpadded,unpadded)
- - 1 0 0 1 0 0 36 M enc main
40
B. Changelog
• 29-03-2019: version v1.0

• 06-06-2019: version v1.01
– added link to webpage and GitHub
• 22-07-2019: version v1.1
– added an improved authenticity bound of Romulus-N in Section 4.2, and a corrected
and improved nonce-misusing authenticity bound of Romulus-M in Section 4.3. Both
improvements are based on the analysis in [35], and the analysis on Romulus-M is also
based on [15].
– added Section 2.5.5 to describe some possible options for a cryptographic hash function
based on Skinny.
• 20-09-2019: version v1.2
– added two paragraphs in Section 6 on how the design choices relate to the serial low-area
hardware implementations. Added Figure 6.2.
– added the synthesis results for the low area implementation in Table 7.1.
41

Romulus Spec Round2

Uploaded by

Copyright:

Available Formats

Romulus Spec Round2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Romulus Spec Round2

Uploaded by

Copyright:

Available Formats

Romulus

Designers/Submitters (in alphabetical order):

Tetsu Iwata 1 , Mustafa Khairallah 2 , Kazuhiko Minematsu 3 , Thomas Peyrin 2

X7 X6 . . . X0 k X15 X14 . . . X8 k . . . k Xn−1 Xn−2 . . . Xn−8 ,

(Tweakable) Block Cipher. A tweakable block cipher (TBC) is a keyed function E e : K × TW ×

2.3 Recommended Parameter Sets

2.4 The Tweakable Block Cipher Skinny

The Round Function.

(x7 , x6 , x5 , x4 , x3 , x2 , x1 , x0 ) → (x7 , x6 , x5 , x4 ⊕ (x7 ∨ x6 ), x3 , x2 , x1 , x0 ⊕ (x3 ∨ x2 )),

Figure 2.2: Construction of the Sbox S8 .

with c2 = 0x2 and

(c0 , c1 ) = (rc3 krc2 krc1 krc0 , 0k0krc5 krc4 ) when s = 4

PT = [9, 15, 8, 13, 10, 14, 12, 11, 0, 1, 2, 3, 4, 5, 6, 7],

P = [0, 1, 2, 3, 7, 4, 5, 6, 10, 11, 8, 9, 13, 14, 15, 12].

/* Skinny -128 -384 */

2.5 The Authenticated Encryption Romulus

F56 (x) = x56 + x7 + x4 + x2 + 1,

zi ← zi−1 for i ∈ J56K0 \ {7, 4, 2, 0},

Similarly for c = 24,

zi ← zi−1 for i ∈ J24K0 \ {4, 3, 1, 0},

encode384,128 (K, T, B, D) = lfsr56 (D) k B k 064 k T k K

encode384,96 (K, T, B, D) = lfsr24 (D1 ) k B k T k K k lfsr24 (D2 ) k 0104 ,

where D1 , D2 ∈ Ds with Ds = J224 − 1K. The set D is defined as D = Ds × Ds , and the

encode256,96 (K, T, B, D) = lfsr24 (D) k B k T k K

2.5.2 State Update Function

G · X = (Gs · X[0], Gs · X[1], Gs · X[2], . . . , Gs · X[n/8 − 1]),

where C = M ⊕ G(S) and S 0 = S ⊕ M . Similarly,

2.5.4 Romulus-M misuse-resistant AE mode

2.5.5 Hashing mode

Algorithm ρ(S, M ) Algorithm ρ−1 (S, C)

need to be adapted accordingly.

M [1] N M [2] N pad(M [m]) N 0n

lsb|M [m]| wM ∈ [20, 21]

Algorithm ρ(S, M ) Algorithm ρ−1 (S, C)

Case (a,m) = (even,odd)

Case (a,m) = (odd,even)

in the domain separation need to be adapted accordingly.

N M [1] N M [2] N M [m0 − 1] N pad(M [m0 ])

Family NR-Priv NR-Auth NM-Priv NM-Auth

Family Key Recovery

4.1 Security Notions

which measures the hardness of breaking privacy notion for A.

4.3 Security of Romulus-M

and in the NM case, we have

nm-priv tprp 4rσpriv

tprp 4rqe 5rqd

4.4 Security of Skinny

6.2 Mode Design

Efficiency Comparison. In Table 6.1, we compare Romulus-N to ΘCB3, a well-studied TBC-

Number of Security State Size Rate S/R Inverse

† Unused part of tweakey is not a part of state thus not considered;

1. k = 128 for all the designs.

Number of Security State Size Rate S/R Inverse

† Unused part of tweakey is not a part of state thus not considered;

6.3 Hardware Implementations

Figure 6.1: Expected architectures for Skinny and Romulus

Figure 6.2: Serial State Update Function Used in Romulus

6.5 Primitives Choices

Counter Separation (Romulus-N2 and Romulus-M2). For Romulus-N2 and Romulus-M2, we

Encryption-Decryption Combined Circuit. One of the goals of Romulus is to be efficient for