Qualifi Cacao
Qualifi Cacao
Qualifi Cacao
Faculdade de Tecnologia
Departamento de Engenharia Elétrica
Orientador
Prof. Dr. Daniel Guerreiro e Silva
Coorientador
Prof. Dr. Anderson Clayton Alves do Nascimento
Brasília
2022
Universidade de Brasília
Faculdade de Tecnologia
Departamento de Engenharia Elétrica
Prof. Dr. João José Costa Gondim Prof. Dr. Denis Gustavo Fantinato
Universidade de Brasília Universidade de Campinas
Machine learning (ML) applications have become increasingly frequent and pervasive in
many areas of our lives. We enjoy customized services based on predictive models built
with our private data. There are, however, growing concerns about privacy. This is proven
by the enactment of the General Law of Data Protection in Brazil, and similar legislative
initiatives in the European Union and in several countries.
This trade-off between privacy and the benefits of ML applications can be mitigated
with the use of techniques that allow the construction and operation of these computa-
tional models with formal guaranties of preservation of privacy. These techniques need to
respond adequately to challenges posed at every stage of the typical ML application life
cycle, from data discovery, through feature extraction, model training and validation, up
to its effective application.
This work presents an architecture for Privacy-Preserving Machine Learning (PPML)
solutions built on homomorphic cryptography primitives and Secure Multi-party Com-
putation (MPC) protocols, which allow the adequate treatment of data and the efficient
application of ML algorithms with robust guarantees of privacy preservation. It describes
a concrete implementation of the proposed PPML architecture and demonstrates its use
in two case studies: text classification for fake news detection and image classification for
breast cancer detection and staging.
iii
Resumo
Aplicações de aprendizagem de máquina (ML) tem se tornado cada vez mais recorrentes
e pervasivas nas diversas áreas de nossas vidas. Usufruímos de serviços personalizados
baseados em modelos preditivos construídos com nossos dados privados. Há, no entanto,
uma preocupação crescente com a privacidade. A Lei Geral de Proteção de Dados, no
Brasil, e iniciativas legislativas semelhantes na União Europeia e em diversos países são
uma prova disso.
Esse trade-off entre privacidade e os benefícios das aplicações de ML pode ser mitigado
com uso de técnicas que permitam a construção e operação desses modelos computacionais
com garantias formais, matemáticas, de preservação da privacidade dos usuários. Essas
técnicas precisam responder adequadamente aos desafios apresentados em todas as fases
no ciclo de vida típico de uma aplicação de ML, desde a descoberta de dados, passando
pela fase de feature extraction, pelo treinamento e validação dos modelos, até seu efetivo
uso.
Este trabalho apresenta uma arquitetura para soluções de Aprendizado de Máquina
com Preservação de Privacidade (PPML), construída sobre primitivas de criptografia ho-
momórfica e protocolos de computação segura de múltiplas partes (MPC), que permitem o
tratamento adequado dos dados, e a aplicação eficiente de algoritmos de ML com garan-
tias robustas de privacidade. O trabalho traz, ainda, uma implementação concreta da
arquitetura proposta e sua aplicação em dois temas relevantes e sensíveis: classificação de
texto para detecção de fake news e classificação de imagens para detecção e estadiamento
de câncer de mama.
iv
Contents
1 Introduction 1
1.1 Research subject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Research objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 Preliminary results . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Text classification 23
3.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Classic NLP preprocessing techniques . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Stopword removal (SwR) . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.4 Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.5 Part-of-Speech (PoS) tagging . . . . . . . . . . . . . . . . . . . . . 25
3.2.6 Bag-of-Words (BoW) . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.7 Term Frequency – Inverse Document Frequency (TF-IDF) . . . . . 27
3.2.8 Continuous bag-of-words (CBoW) . . . . . . . . . . . . . . . . . . . 27
v
3.3 The State-of-the-Art: Transformers . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Trade-offs and applications . . . . . . . . . . . . . . . . . . . . . . . 29
6 Conclusion 45
6.1 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.1.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Bibliography 48
Appendix 57
I MPC Protocols 58
vi
List of Figures
vii
List of Tables
viii
Acronyms
HE Homomorphic Encryption.
ML Machine Learning.
ix
1 Introduction
Pensées
Blaise Pascal
1
2018 (CCPA) are evidence of this move to limit and regulate the use of private information
by large service providers [3, 4].
This trade-off between privacy and the benefits of ML applications can be mitigated
with the use of techniques that allow the training and application of predictive models
while preserving user privacy. These techniques need to respond adequately to the chal-
lenges presented at all stages in the life-cycle of a typical ML application: from data
discovery, through the data wrangling stage (feature selection, feature extraction, combi-
nation, normalization and imputation in the sample space), to the training and validation
of the models, until their effective use in inference.
2
Note that the use of encrypted databases presupposes the prompt availability of the
the entire original database and of a large computing capability for encryption and for
the training of inference models on the part of the data owner - which is not always
the case. Seeking to overcome this limitation, there is also a line of works focused on
online, interactive and reinforcement learning techniques. These applications require the
development of Secure Multi-party Computation (MPC) and the composition of several
protocols as the basis for learning algorithms [19].
The main challenge, in all these lines of research, is the distance between the logical
or theoretical design of the proposed solutions and the practical intended applications.
The most cited works in the area, in general, do not present how, or how effectively, the
proposed system or protocol deals with problems of scale, response time, usability, among
others, which drastically affect the effective support of a real decision process, in real use
cases.
Furthermore, most works in literature focus on just one phase of the Machine Learning
life-cycle: either the training-related steps or the application of the trained model in
inference (both for regression and classification models). Little or no attention is given
to the initial steps, to data fitting, feature extraction, among other activities that are
extremely relevant for the overall performance of the predictive models. Works that bring
any analysis of the statistical robustness or the quality of the models produced are also
rare.
Therefore, the selected subject for the current research is not limited to isolated PPML
techniques for privacy-preserving model training or inference. The proposed contribution,
in fact, is the specification of a general architecture for the application of ML techniques
with robust guarantees of privacy. Attention is given to how the well established ML
solutions, such as transfer learning, can interact with the privacy enhancing technologies
without loss of predictive power or usability.
Also, the complete knowledge production cycle is considered, from the basic theory to
the assessment of the impact of its applications. Thus the choice of building a concrete
implementation and analyzing two study cases for the architecture: one in the NLP
domain and one in the computer vision domain. Practical implementation details, such
as easy of use for the general public and realistic computational cost analysis, are also
examined.
For the NLP problem, this study dives into the use case of text classification for fake
news detection. For the computer vision problem, there is a use case of image classification
for the detection and categorization of breast cancer.
3
1.2 Motivation
This work results from both the author’s research as a Doctorate student at the Graduate
Program in Electrical Engineering (PPGEE), at Universidade de Brasília, and his work
as a Data Scientist at the Administrative Council for Economic Defense (Cade). It also
brings some experience with homomorphic cryptography gathered during the author’s
masters research [18].
The scientific investigation proposed in this project is primarily based on the need to
complete the knowledge production cycle, as discussed above, with the effective develop-
ment, application and validation of a solution based on the state-of-the-art in the fields
of Machine Learning and privacy enhancing technologies. These are somewhat disparate
concepts: while ML are focused on extracting information, Cryptography is focused on
concealing it. Therefore, there is a contribution from the theoretical point of view, consid-
ering the exposition on how to couple and harmonize such different groups of techniques.
Another contribution is the discussion on adequate security models for the specific
tasks of text and image classification. There is also a contribution from a technical point
of view, with the caring out of experiments that may be used as a reference implementation
of the proposed architecture. The most relevant contribution, nevertheless, is the practical
application, with a relevant impact on the institutions involved. In the present case, the
specific contribution to Cade - the Brazilian competition authority.
The intelligence unit at Cade deals with an enormous quantity of data, from large pub-
lic procurement databases to other open intelligence sources, such as news articles, online
marketplaces, and enterprise web sites. This unit also performs down raids, collecting
documents from investigated companies - on paper, on computers, hard drives, executives
smartphones, etc. Some operations are carried out in cooperation with prosecutors and
police, at Federal or State levels. Some, involving multinationals, are coordinated with
other competition authorities in different countries.
All this intelligence activity, nonetheless, is bounded by data protection laws effective
in Brazil, that require formal guaranties of privacy protection [20]. While searching for
evidence of cartel or other anticompetitive conduct, Cade needs to protect the privacy
of the individuals involved - whether executives, legal representatives or other persons
somehow related to the investigated economic agents. And a considerable portion of this
data lies in textual and image formats. Thus, the importance of privacy in text and image
classification.
4
1.3 Research objectives
The general goal of the research work described in this document is to present a archi-
tecture, a cohesive, logically and formally integrated set of techniques, combining cryp-
tographic and MPC protocols, as well as ML algorithms, that provide robust privacy
guarantees for real inference applications, with special attention to the text and image
classification tasks.
The proposed goal is to be attained though specific objectives:
• Propose and test improvements, new techniques and implementation details that
may correct and overcome major flaws and limitations pointed out in previous so-
lutions;
1.4 Methodology
Experience shows, as reported by Trauth [21], that when a study in the area of technology
development seeks to understand the impact of a given solution, there is a need for the
application of qualitative methodologies.
Also, according to Deb, Dey & Balas [22], engineering research must effectively com-
bine the conceptualization of a research question with practical problems, from equip-
ment to algorithms and mathematical concepts used to solve the proposed problem. Fur-
thermore, engineering research should advance knowledge in three broad, and somewhat
overlapping, areas: observational data, that is, knowledge of the phenomena; functional
modeling the observed phenomena; and the design of processes (algorithms, procedures,
arrangements) that contribute to the desired output. This brings a strong descriptive
character to engineering research, as one must clearly communicate the preconditions
and environmental dependencies, observed data, processes, inputs and outputs of their
experimental results.
Kaplan and Duchon state that the eminently applied nature of research in the field
of information systems requires a combination of qualitative and quantitative methods
[23]. They assert that quantitative investigations, usually performed through statisti-
cal hypothesis testing, are extremely limited when the expected results or the intended
applications are highly dependent on context.
5
They refer to the work of Yin [24], which deals with methods for Case Study research,
to show that quantitative research must be preceded by a qualitative investigation, in
which the problem is better defined based on the observation of the context, habits and
needs of the stakeholders. This is even more relevant in exploratory research, in which
new hypotheses are raised. These initial hypotheses need contextualization, through
qualitative investigation, to create more refined models and hypotheses, which can then
be tested using quantitative methods.
Taking into account the research subject and the proposed objectives, this work is ex-
ploratory in nature. It sources from different areas of study to propose a new technological
arrangement. It is also of applied nature, as the method used for scientific investigation
is centered on the design of a solution for a specific problem. Therefore, it must combine
descriptive, qualitative and quantitative approaches to the study of the selected problem.
Overall, this work can be descried as the combination of the following initiatives:
6
As a preliminary result, the author, in collaboration with graduate students from Univer-
sidade de Campinas - Unicamp, published a solution that uses the source based detection
approach, by identifying autonomous software agents (bots) [25].
The author also reported a few experimental results on NLP preprocessing techniques
at the Seminary organized by the Digital Signal Processing Group (GPDS). These results
are detailed in the Chapter dedicated to Natural Language Processing in this work.
The experiments used to compare the computational cost and discuss on expected
execution time of HE and MPC protocols are presented in [26], and is part of the chapter
on PPML.
1.4.2 Limitations
As a result of the extensive research on different topics (ML, NLP, Computer Vision,
HE and MPC), this document is not meant as a deep exposition of any of these areas.
It provides, nonetheless, a good set of references for those willing to further investigate
relevant results in each area.
This work also does not set out to be a reference for complete security proofs of all
the cryptographic primitives and protocols used. There is, however, sufficient discussion
regarding security model, that is, the assumptions or requirements for the validity of
security of the underlying privacy-enhancing technologies.
There are projects underway to apply the knowledge gathered in this study, as well as
the proposed architecture, in document classification and evidence search at Cade. There
are critical use cases, especially when it involves the cooperation and information shar-
ing with other government agencies and with competition authorities in other countries.
However, due to confidentiality requirements, it is not possible to expose results or present
reproducible experiments performed over such data.
All the experimental details exposed are limited to the two proposed use cases: fake
news and breast cancer detection and staging. There is also the choice to deal with both
tasks as a binary output problem. Thus, objects are classified as either one of two classes:
fake or true, for texts; and binary, one-vs-all models for each class of carcinoma. This
choice results from the fact that most public datasets, on both topics, are annotated that
way. The same techniques can be generalized, with some preprocessing effort, to deal with
multi-class problems using other approaches, that would render multi-label probabilities.
1.5 Organization
The next chapter brings a more detailed exposition of the concept of Privacy-Preserving
Machine Learning (PPML
¸ ), with selected results from the literature. It also presents a
7
general architecture for PPML. Chapter 3 presents a literature review on Natural Lan-
guage Processing, from the classic preprocessing techniques developed throughout the last
3 or 4 decades, to the present state-of-the art with Transformers and other complex Natu-
ral Language Understanding models. The fourth Chapter brings the concept of fake news
and a brief review on fake news detection. It also introduces a few experimental results
on the use case on fake news detection. Chapter 5 presents a short literature review on
computer vision, transfer learning and embeddings for image processing applications. It
also describes the ongoing work on image classification for cancer detection and staging.
The last chapter summarizes our results and conclusions, pointing out the best of our
knowledge on privacy-preserving machine learning and how our proposition contributes
to that field.
8
2 An Architecture for
Privacy-Preserving Machine
Learning
9
are solutions that attracted a lot of attention, even from the mainstream media, as, for
example, the CryptDB from the MIT [16]. In this line there are also many practical
solutions, including areas as sensitive as Electronic Health Records [15, 18].
Note that the use of encrypted databases usually requires the prompt availability of the
the entire database, and the computing power needed for encryption on the side of the data
owner - which is not always the case. Also, for the training of inference models there will
be a least a few communication rounds between the data owner and the service provider,
in order to compute the loss function and adjust the trained parameters accordingly.
This process is computationally costly and complicates the analysis of the security of
the solution. Seeking to overcome this limitation, there is also a line of works focused on
online, distributed, interactive and machine learning techniques. These applications based
on Secure Multi-party Computation (MPC) and the composition of several protocols as
the basis for the learning algorithms [19].
Henceforth, privacy-preserving computation grew in importance and attention, and
many Privacy-Preserving Machine Learning (PPML) and Privacy-Preserving Function
Evaluation (PPFE) frameworks have been developed. The protocols that form the basic
building blocks of these frameworks, are usually based on homomorphic cryptography
primitives or Secure Multi-Party Computation protocols. Some of the first frameworks to
appear in literature, for instance, used MPC protocols based on Secret Sharing. Among
those preceding results are FairPlayMP [27] and Sharemind [28].
Recent developments include frameworks like PySyft [29], that uses HE protocols,
and Chameleon [30], that uses MPC for linear operations and Garbled Circuits (a form
of HE) for non-linear evaluations. Other MPC frameworks in literature include CrypTen
[31], PICCO [32], TinyGarble [33], ABY3 [34] and SecureML [35].
Most of these results were demonstrated with proof-of-concept applications focused on
the inference step of the machine learning solution. However, usually, there is no discussion
about how these frameworks meet heavily used practices, such as feature engineering
and transfer learning, that are defining characteristics in the different areas of machine
learning.
Thus the need of systematization of knowledge on how to connect the common steps of
ML with the common privacy-enhancing technologies. The rest of this chapter presents,
with a little more detail, the two classes of privacy-enhancing techniques discussed above
– MPC and HE – and a way to reason on their complexity of implementation and com-
putational cost in order to select the best fit for the different applications. At the end,
we present a general design for a PPML solution, that can be implemented with most of
the APIs and frameworks listed above.
10
2.1 Secure Multi-Party Computation
Introduced by Yao [36], Secure Multi-Party Computation refers to a set of protocols and
algorithms that allow a group of computing parties P to evaluate a function F (X) over
a set X = (1 x, 2 x, ..., n x) of private inputs, in a way that guarantees participants gain
knowledge only on the global function result, but not on each others inputs.
Protocol: πADD
Input: Secret shares Jx1 Kq , . . . , Jxn Kq
Output: JzKq = ni=1 Jxi Kq
P
Execution:
Pn
1. Each party Pi ∈ P computes zi = i=1 xi
Additive secret sharing is one way to implement MPC. Protocol parties have addi-
tive shares of their secret values and perform joint computations over those shares. For
example, to create n additive shares of a secret value x ∈ Zq , a participant can draw
(x1 , . . . , xn ) uniformly from {0, . . . , q − 1} such that x = ni=1 xi mod q. We denote this
P
With these two basic building blocks, addition and multiplication, it is possible to
compose protocols in order to perform virtually any computation. For instance, Protocol
πEq , uses πMUL in order to provide the Secure Distributed Equality computation. In the
Appendix I, you will find definitions for other additive secret share MPC protocols, such
as Secure Multi-party Inner Product Protocol πIP , Secure Multi-party Bitwise OR/XOR
11
Protocol: πMUL
Setup:
1. The Trusted Initializer draws u, v, w uniformly from Zq , such that w = uv and dis-
tributes shares JuKq , JvKq and JwKq to protocol parties
u
2. The TI draws ι ←
− {1, . . . , n}, and sends asymmetric bit {1} to party Pι and {0} to
parties Pi6=ι
2. Parties broadcast di , ei
Pn Pn
3. Each party computes d ← i=1 di , e ← i=1 ei
πOR|XOR , Secure Multi-party Argmax πargmax and Secure Multi-party Order Comparison
Protocol πDC .
Protocol: πEq
Setup: The setup procedure for πMUL
Input: JxKq and JyKq
Output: JzKq = J0Kq if x = y. JzKq 6= J0Kq , otherwise.
Execution:
3. Output JzKq
All of these protocols are built on the commodity-based model [37]. In this approach,
there is a costly offline phase, lead by a Trusted Initializer (TI) that pre-distributes
correlated random numbers. This role can be performed by an independent agent or
by one of the computing parties without loss of generality or security guarantees of the
online phase.
12
In order to learn from data, a Machine Learning algorithm will usually update a set
β of internal model parameters iterating a computation over the training set, comparing
the result of a function F (X) = y0 over the the properties of each element in the sample
with the expected inference output. This difference, y0 − y, is usually called the ‘loss’
function, and is used to update the internal parameters with an intensity defined by a
pre-defined learning rate.
The most commonly used model, for instance Linear Regression, consists in the mul-
tiplication of the model parameters β with matrix X ∈ Zq n,k representing the values of
features of all the elements in the training set. And the learning goal is to find a coefficient
vector β = (β0 β1 . . . βk ) that minimizes the mean squared error
n
1X
(βxi − yi )2 (2.1)
n i=1
β = (X T X)−1 X T y (2.2)
Protocol: πMMUL
Setup: the TI chooses uniformly random Ax , Aw ∈ Znq 1 ×n2 and Bx , Bw ∈ Znq 2 ×n3 and T ∈
Zqn1 ×n3 , and distributes the values Ax , Bx , and T to party Pi and the values Aw , Bw , and
C = (Ax Bw + Aw Bx − T ) to Pj6=i
Input: JXKq and JW Kq
Output: JXW Kq
Execution:
2. Pj sends (X − Aw ) and (Y − Bw ) to Pi .
3. Output JzKq
13
2.2 Homomorphic Cryptography
A cryptosystem is said to be homomorphic if there is an homomorphism between the
domain (the message space M) and the image (the cipher space C) of its encryption func-
tion Enc(m) [12]. An homomorphism is a map from one algebraic structure to another,
that maintains its internal properties. So, if there is an internally well defined relation,
or function, in M, fM : M → M, there will be a corresponding function defined in C,
fC : C → C, such that:
∀m ∈ M,
fC (Enc(m)) ≡ Enc(fM (m))
Fully Homomorphic Encryption (FHE) refers to a class of cryptosystems for which the
homomorphism is valid for every function defined in M. That is:
∀fM : M → M,
∃fC : C → C | fC (Enc(m)) ≡ Enc(fM (m))
The most commonly used homomorphic cryptography systems, however, are only par-
tially homomorphic. There are additive homomorphic systems, multiplicative homomor-
phic systems and systems that combine a few homomorphic features. For example, Pail-
lier’s cryptosystem has additive and multiplicative homomorphic features that can be
used to delegate a limited set of computations over a dataset, without compromising its
confidentiality [38, 17].
The underlying primitive in Paillier’s system is the Decisional Composite Residuosity
Problem (DCRP). This problem deals with the intractability of deciding, given n =
pq, where p and q are two unknown large primes, and an arbitrary integer g coprime
to n (i.e. g ∈ Zn ), if g is a n-th residue modulo n2 . In other words, the problem
consists in finding y ∈ Z∗n2 such that g ≡ y n mod n2 . In his work, Paillier defines the
DCRP problem and demonstrates its equivalence (in terms of computing cost) with the
Quadratic Residue Problem, which is the foundation of well known cryptosystems, such
as Goldwasser-Micali’s [39].
Paillier’s system can be defined by the three algorithms:
Paillier.KeyGen the key generation algorithm selects two large primes p, q; computes
u
their product n = pq; uniformly draws a coprime to n, i.e. g ← − Zn ; computes
1
λ = mmc(p − 1, q − 1); and, finally computes µ = L(gλ mod n2 )
, where L(u) = u−1
n
;
The Public Key is hn, gi, and the Private Key is hµ, λi;
14
Paillier.Enc given the public key hn, gi and a message m ∈ Zn , the encryption algorithm
u
consists on uniformly selecting r ←
− {1..n − 1} and computing the ciphertext c =
g m rn mod n2
Paillier.Dec the decryption algorithm, in turn, receives the private key hµ, λi and a
ciphertext c and computes the corresponding message as such: m = L(cλ mod n2 ).µ
mod n.
Again, it is straight forward to see that, with the basic building blocks of addition
and multiplication, it is possible to compose arbitrarily complex protocols for privacy-
preserving computations using homomorphic encryption.
15
The aforementioned PPML frameworks are proof of the research interest in such meth-
ods and the large investment put in the development of privacy-preserving computation.
It is important to note that some of the most recent and relevant results are sponsored
by major cloud service providers, especially those with specific Machine Learning related
cloud services, such as Google, IBM and Amazon [31].
Therefore, not only the capabilities of each protocol or framework are of great impor-
tance, but also their economical viability: determined mostly by their computational cost.
Expected execution time and power consumption may drive decisions that have impact
on millions of users and on relevant fields of application. In spite of the importance of
the efficiency of their solutions, authors usually only discuss very briefly the estimated
computational cost or execution times.
Very few works on PPML publish their results with computing times observed against
benchmark datasets/tasks (e.g. classification on the ImageNet or the Iris datasets). When
any general estimate measure is present, it is usually the complexity order O(g(n)), which
defines an asymptotically lower-bounding asymptotic complexity or cost function g(n).
The order function, g(n), represents the general behavior or shape of a class of func-
tions. So, if t(n) is the function defining the computational cost of the given algorithm
over its input size n, then stating that t(n) ∈ O(g(n)) means that there is a constant c
and a large enough value of n after which t(n) is always less than c × g(n).
A protocol is usually regarded efficient if its order function is at most polynomial. Sub-
exponential or exponential orders, on the other hand, are usually deemed prohibitively
high.
For example, the authors of ABY3 [34] assert that their Linear Regression protocol is
the most efficient in literature, with cost O(B + D) per round, where B is a training batch
size and D is the feature matrix dimension [34]. That may seem like a very well behaved
linear function over n, which would lead us to conclude they devised an exceptionally
efficient protocol with order O(n).
Nevertheless, this order expression only gives a bound on the number of operations
to be performed by the algorithm. However, it does not inform an accurate estimate for
execution time. And, more importantly, the order function will only bound the actual
cost function for extremely large input sizes. Recall that t(Kn) ∈ O(n), regardless of how
arbitrarily large the constant K may be.
Thus, for small input sizes, the actual cost may be many orders of magnitude higher
than the asymptotic bound. The addition protocol in [41] is also of order O(n), but one
16
would never assume that a protocol with many matrix multiplications and inversions can
run as fast as the one with a few simple additions.
Probabilistic and ML models are commonly used to estimate cost, execution time and
other statistics over complex systems and algorithms, especially in control or real-time
systems engineering [42, 43].
We propose the use of Monte Carlo methods in order to estimate execution times for
privacy-preserving computations, considering different protocols and input sizes. Section
II presents a short review on privacy-preserving computation and its cost. Section III has
a brief description of the Monte Carlo methods used. Section IV discusses implementation
details and results of the various Monte Carlo experiments we performed. Finally, Section
V presents key conclusions and points out relevant questions open for further investigation.
Monte Carlo methods are a class of algorithms based on repeated random sampling
that render numerical approximations, or estimations, to a wide range of statistics, as well
as the associated standard error for the empirical average of any function of the parameters
of interest. We know, for example, that if X is a random variable with density f (x), then
17
the mathematical expectation of the random variable T = t(X) is:
Z ∞
E[t(X)] = t(x)f (x)dx. (2.3)
−∞
And, if t(X) is unknown, or the analytic solution for the integral is hard or impossible,
then we can use a Monte Carlo estimation for the expected value. It can can be obtained
with:
M
1 X
θ̂ = t(xi )f (xi ) (2.4)
M i=1
In other words, if the probability density function f (x) has support on a set X , (that
R
is, f (x) ≥ 0 ∀x ∈ X and X f (x) = 1), we can estimate the integral
Z
θ= t(x)f (x)dx (2.5)
X
M
1 X
θ̂ = t(xi ) (2.6)
M i=1
M
ˆ 1 X 2
V ar(θ̂) = 2 t(xi ) − θ̂ (2.7)
M i=1
In order to improve the accuracy of our estimation, we can always increase M , the
divisor in the variance expression. That comes, however, with increased computational
cost. We explore this trade-off in our experiments by performing the simulations with
different values of M and then examining the impact of M on the observed sample variance
and on the execution time of the experiment.
18
2.4 A general architecture for Privacy-Preserving
Machine Learning
As discussed above, there are different privacy-enhancing technologies that may serve as
the fundamental blocks upon which to build a privacy-preserving solution. Each with its
own complexity of implementation and computational cost. The Monte Carlo methods
can be used during the design of private computation solutions in order to select the best
technology for a specific task and data type.
This, however, can be a cumbersome process if you consider that different steps of the
machine leaning life cycle, from data collection to inference, will have different algorithms
and different data types and data ranges. The process would be repeated for each step.
We propose an architecture for Privacy-Preserving Machine Learning that combines
the experience and the model ecosystem from ML community with the innovative Privacy-
Preserving Machine Learning technologies from the cryptography community. It lessens
the cost of implementing complex solutions and running heavy computations in the first
steps of the ML cycle.
The first and main aspect of the proposed solution is to abstract all the preprocessing,
data wrangling, features extraction and feature engineering steps of machine learning
using transfer learning. Transfer learning refers to a subset of ML techniques based on
the notion of storing knowledge gained while solving one problem and applying it to a
different but related problem.
There is an impossible trade-off between privacy guarantees and the need of experi-
enced data scientists and machine learning specialists to visualize and understand the data
in order to fine tune predictive models. Specially for high dimensionality problems, run-
ning task specific heuristics for data wrangling, dimensionality reduction, hyper-parameter
tuning, among other common ML techniques, over encrypted data or over MPC protocols
may be unpractical as computing costs may grow exponentially with large datasets [44].
Transfer learning allows for these steps to be abstracted with the use of pre-trained
models that usually convey knowledge gathered while training on very large datasets.
The most used models are community driven, have proven success on their intended
applications and are available at the API of common ML frameworks, such as PyTorch
and TensorFlow.
The proposed architecture consists of two groups or types of components:
1. Data encoder: performs a two step encoding process: creates the embeddings
(a dense matrix representation of the private input); and then creates privacy-
preserving computation representation of the embeddings (MPC shares for MPC
19
protocols, or encrypted embeddings, for HE protocols). There can be many data
encoders cooperating in the private computation;
2. PPML computation parties: The computing parties run the MPC and/or HE
protocols for privacy-preserving model training and inference. There can also be
many computing parties cooperating in the private computation.
This model renders a general privacy-preserving machine learning solution, since you
can plug-in any model for data encoding (such as BERT for text classification, and ResNet
for image classification) and any PPML framework for model training and inference. It can
also be described as a general framework or architecture in the sense that any learning
task (of classification, regression, interaction or process control), can be mapped and
implemented as a transfer learning application.
Now, this general architecture can have different implementations. The proposed
implementation, as suggested in Figure 2.1, uses MPC protocols running on 3 computing
parties. We recommend that any MPC implementation should hold 3 or more computing
parties, and that at least one of the computing parties should be hosted by a different
service provider. This condition can be relaxed if the end user, running the Data Encoder
component, also participates as a computing parting on MPC protocols. Also, for easy of
implementation, any of the computing parties can perform the role of Trusted Initializer
for protocol setup.
The MPC implementation, without loss of generality, could be described as the com-
bination of three components:
1. Data encoder: the second step in encoding consists in creating the respective MPC
shares for the embeddings;
20
2. MPC computing parties: The computing parties run the MPC protocols;
3. MPC Trusted Initializer: The TI pre-distributes secret shares for MPC protocol
acceleration;
In this case, the trusted initializer is only a tool for protocol acceleration. Since it is
part of the distributed, private, computation, it could be listed as just another member
of the computing parties group. It is set out in this case just to reinforce the fact that the
TI does not participate in the online phase of the MPC protocols, but only in the initial
setup.
Note that the implementation will determine the security model of the solution. When
implemented with only two computing parties (e.g. data owner and service provider), the
system is secure under a honest-but-curious adversarial model. It means that the data
remains private as long as all parties follow the protocol correctly. When implemented
with more then 2 computing parties, the underlying MPC protocols guarantee a (n − 1)-
private security model. That is, even if n − 1 out of n computing parties collude during
protocol execution, the private information cannot be disclosed.
21
This allows end users and specialists to contribute in the training of the inference model
used on relevant applications without wavering their right to privacy.
Looking at the fake news problem, PPML model training allows for fact-checking
agencies, or groups of scholars and specialists, to provide annotated news articles for a
community relevant topic. The data owners may fear the service provider to have eco-
nomical or political incentives to meddle with the specialists’ classification. The privacy-
preserving model training solution guarantees the model owner has no knowledge of the
provided texts or the classification labels, and thus, cannot negatively impact the quality
or fairness of the dataset and, consequently, of the trained model.
In the context of cancer detection or staging, PPML model training also allows for
collaborative model training in a way that centers of excellence, such as Brazil’s INCA
and the US’ NCI, could surpass legal boundaries related to health data protection and
work together on more general and powerful predictive models and serve the international
medical community with better tools to save lives all over the globe.
A general architecture allows for a broader impact, as contributions in one area are
more easily adapted and integrated in a myriad of possible applications. It also allows for
open, community driven, models of development that have a higher chance of surviving
the ‘proof-of-concept’ stage and evolving into really useful technologies with positive social
impact.
22
3 Text classification
Discourse on Method
René Descartes
23
of tokens: words, punctuation marks, emojis, etc. Most NLP software libraries (e.g. nltk,
gensim and CoreNLP) provide multiple tokenization strategies, such as character, sub-
word, word or n-gram. The best granularity or tokenization strategy usually depends on
the application [46].
A common practice is to combine tokenization with sentence splitting. The gensim li-
brary, for instance, will perform tokenization by processing a sentence at a time. CoreNLP,
in turn, adds flags to the tokens that represent the limits of each sentence. Sentence split-
ting is also very important to other NLP methods, such as Part-Of-Speech (PoS) tagging
and Named Entity Recognition (NER).
• Pre-compiled dictionary: manually curated stopword lists. The lists may be crafted
for specific contexts, jargons or document corpus;
• Frequency based: use frequency based rules, such as TF-High (removal of terms
with high frequency), TF-1 (removal of terms with a single occurrence), IDF-Low
(removal of terms with low inverse document frequency, i.e. terms that are present
in most documents);
• Term Based Random Sampling (TBRS): uses the Kullback-Leibler divergence be-
tween term frequencies on the corpus with the frequency measured on randomly
sampled text chunks to identify words with low divergence and, consequently, low
information on any given text class.
3.2.3 Stemming
Stemming is the reduction of variant forms of a word, eliminating inflectional morphemes
such as verbal tense or plural suffixes, in order to provide a common representation, the
root or stem. The intuition is to perform a dimensionality reduction on the dataset,
24
removing rare morphological word variants, and reduce the risk of bias on word statistics
measured on the documents [49].
Most stemming algorithms only truncate suffixes and do not return the appropriate
term stem or even a valid word in the language of the text. There are different classes of
stemming algorithms, including:
• Dictionary based algorithms: lookup tables with terms and corresponding stems.
Usually restricted to a specific corpus, jargon or knowledge area;
3.2.4 Lemmatization
Lemmatization consists on the reduction of each token to a linguistically valid root or
lemma. The goal, from the statistical perspective, is exactly the same as in stemming:
reduce variance in term frequency. It is sometimes compared to the normalization of the
word sample, and aims to provide more accurate transformations than stemming, from
the linguistic perspective [50].
The impact on predictive models, however, will depend on characteristics of the lan-
guage or the document corpus being processed. In highly inflectional languages, such as
Latin and the romance languages, lemmatization is expected to produce better results
then stemming [51].
The typical lemmatizer implementation requires the creation of a lexicon (dictionary
or wordbook) of valid words and their corresponding lemma [52]. Yet, there are different
classes of algorithms, designed to deal with distinct problems in word normalization and
different languages. Recent works in literature, for instance, use deep neural networks to
produce ‘neuro lemmatizers’ trained for specific tasks [53, 54].
25
found [55, 54]. Most implementations will return multiple tags per token, with syntactic,
lexical, phrasal and other categories.
PoS tagging helps to differentiate homonyms, words with the same spelling but differ-
ent meanings, and to capture part of the semantics relations between words. Therefore,
many works in the fake news detection literature use PoS tags to engineer new features
(e.g. "noun count", "adjective count", "mean adjectives per noun") to capture concepts
such as ‘style’ or ‘quality’ of the text and improve model accuracy [56, 57].
26
algorithms of Bag-of-N-grams, where the basic unit of count is not a single word by a set
of words of size n [61].
fti ,dj
tf(ti , dj ) = 1 + log P
t∈dj ft,dj
(3.1)
|D| + 1
idf(ti , D) = 1 + log
|{d ∈ D : ti ∈ d}| + 1
27
One of the advantages of word embeddings, such as CBoW, is the fixed length, dense
matrix representation that usually allows for more efficient computations. It also captures
some of the semantic relationships of words, based on their co-occurrence probability.
CBoW has also been used to achieve good results in fake news detection [66, 67] and
is possibly the most advanced preprocessing or feature engineering technique that can
be used on top of MPC protocols in order to produce a privacy-preserving fake news
classification model.
28
From this basic transformer architecture, there are even more complex models, such
as the Bidirectional Encoder Representations from Transformers (BERT), that uses a
variable number of encoders and bi-directional self-attention heads [69]. The models was
trained on two tasks: language modeling – in which the model predicts missing words in
the context; and next sequence prediction – in which the model predicts the next sentence.
As a result, BERT embeddings convey contextual meaning for words. BERT has virtually
become a baseline for any new advancements in language models and NLP in general [70].
29
4 Privacy-preserving fake news
detection
Principia Philosophiae
René Descartes
Fake news are texts, possibly distributed with or on other media formats, that present
false, incorrect or inaccurate information and are shared over digital platforms, such as
social networks, messaging apps or news web sites [72]. The main characteristics that dif-
ferentiate fake news from concepts such as gossip, hoax and other forms of misinformation
are:
1. They are formatted and presented as legitimate news, usually as a form of ’self-
validation’, with the intent to manipulates the audience’s cognitive processes;
2. They have faster and broader propagation patterns, partly due to the context of
instant and pervasive communication of digital platforms;
3. They have greater impact on the audience’s social behavior, also partly due to the
business model of the digital platforms based on engagement or “attention reten-
tion”.
These platforms are designed to retain their audience with algorithms that will filter
and sort the content displayed to each user based on their preferences and attention
patterns. The algorithms are so effective in retaining users’ attention that a growing
number of people are now suffering with addiction to their social media [73].
With users spending an ever increasing amount of time on their favorite platforms,
social media have amassed huge databases on user profiling and segmentation and be-
came, arguably, some of the most effective mass communication tools. They monetize
30
their databases serving targeted advertising and charging companies that consume their
Application Programming Interfaces (APIs) to interact with their users [25].
This business model is threatened by the misuse of the platforms with the spread
of illegitimate and false content. The malicious use of digital platforms to spread fake
news has already been applied, for example, to manipulate opinions on extremely relevant
issues, such as presidential elections in France and the United States [74].
Identifying and clearly flagging fake news may help users to assert better judgment
on the content they consume and lessen its negative effects [75]. Research in the area,
nevertheless, faces a few particular challenges: first, the difficulty, from the technological
perspective, to delimit fake news and distinguish them from other forms of propaganda.
Also, the relevance of the topic in the political arena elevates the risk of bias and partisan
interference. For instance, a dataset with over 200 citations gives you an accuracy of
99% using the presence of a single term to label a text as true news [76]. Also, there are
very few good public datasets and fewer NLP resources (dictionaries, word embeddings
models, language models, etc) in languages other than English [57].
Another important issue in the area is how to balance the need to detect and appro-
priately handle fake news and the equally important need to guarantee end user’s privacy.
This concern with user’s privacy has lead to the development of many Privacy-Preserving
Machine Learning (PPML) techniques [30, 34, 35]. There already many classic Machine
Learning (ML) algorithms, such as Logistic Regression, Decision Tree and Support Vector
Machines implemented on top of Secure Multi-party Computation (MPC) protocols [8].
31
This detection approach has shown many positive results. Nevertheless, the use of
personal information raises the concern with user privacy. The problem is aggravated if the
proposed solution involves the transfer of user’s personal information to some sort of third-
party or government agency responsible for the detection or repression of misinformation.
32
We already have a few classic ML algorithms, such as Support Vector Machines,
Decision Trees and Logistic Regression, running on the PPML setting with MPC protocols
[8]. Current research effort is focused on the identification of preprocessing or features
engineering methods that improve model accuracy and that can be ported to the PPML
setting.
4.2 Experiments
Our experiments were designed as a proof-of-concept of the general privacy-preserving ma-
chine learning framework propose in Chapter 2. They cover the two basic scenarios of ap-
plication of the framework. We implement the PPML model training and the PPML news
classification features with additive secret sharing protocols available in Meta’s CrypTen
– an open source research tool for PPML [31].
CrypTen extends the PyTorch library API with tensor based implementations of secret
sharing protocols. We decided to use this framework to facilitate our experiments, since
it extends a well known library, and would allows us to use peer-reviewed neural network
architectures found in literature with very few changes to the code. CrypTen also allows
for private inference with a encrypted PyTorch model trained on clear-text data.
CrypTen also helped us to simplify the implementation design, since it automatically
instances all the MPC protocols requirements, such as the pre-distributed multiplication
secret shares provided by the Trusted Initializer, for every computing party. It also pro-
visions the communication channels and secret share distribution between data providers
and computing parties, so data can be seamlessly loaded from any participating node and
used for computations involving any number of parties.
33
• Fake.br: 7200 full-length news articles, with text and metadata, manually labeled
as real or fake news [57];
• Liar: curated by the UC Santa Barbara NLP Group, contains 12791 claims by
North-American politicians and celebrities, classified as ‘true’, ‘mostly-true’, ‘half-
true’, ‘barely-true’, ‘false’ and ‘pants-on-fire’ [94];
• Source Based Fake News Classification (sbnc): 2020 full-length news manu-
ally labeled as Real or Fake [95].
In order to make the results across these datasets with different classification cardinali-
ties comparable we have created mapping functions, attributing binary labels (fake/true)
for the datasets annotated with more classes. For the Liar dataset, we map the class
‘mostly-true’ as true, while the classes ‘half-true’, ‘barely-true’ and ‘pants-on-fire’ are
mapped as fake news. The FactCk.br dataset has over 18 classes, as it aggregates classifi-
cation systems from different fact checking agencies. Their exact mapping is documented
on the Jupyter notebooks published with our experiments.
For hyper-parameter search, we decided to use ROC AUC metrics to compare and se-
lect the best models, as it gives a better information on a model performance despite class
34
imbalance. After model selection, we recorded ROC AUC, F1-score and accuracy metrics
on the test set for the model selected at the end of each experiment. The best combination
of preprocessing techniques and classifier algorithm, measured by the accuracy on the test
set for each dataset, is presented on Table 4.2. The runtime is in seconds.
We also trained and tested the convolutional neural network from [97] as a benchmark
for our results. Our neural networks outperformed their model in accuracy, F1 score and
ROC AUC for all datasets. For the next step, however, we select only the best neural
network of each architecture for the experiments in the privacy-preserving setting and
compare training runtimes for these networks on Table 4.3.
Contrary to the initial expectations, however, results show that in most of our clear-
text experiments, word embeddings or sentence embeddings from large language models
did not outperform traditional NLP preprocessing techniques. Also, classic ML models
outperformed simpler deep learning models. Particularly, tree-based models, Random
Forest and GBDT, presented the best performance for most datasets.
35
It is important to note that, as stated above, we used only simple feed-forward and
convolutional neural network architectures. This choice is due to CrypTen’s limited im-
plementation of PyTorch modules. It does not implement modules as Recurrent Neural
Networks (RNN), or Long Short-Term Memory Networks (LSTM), that have been proven
in NLP literature to provide better results than simpler networks architectures [98].
On the first machine, named ‘alice’, we store the trained model. The training set,
embeddings and corresponding labels, are stored on the second machine, named ‘bob’. The
third participant, ‘charlie’, holds the validation set. At the end of the MPC computation,
the accuracy score on the validation set is known to all computing parties, but only alice
has knowledge of the trained model’s weights.
The results in Table 4.3 show that training times are, in average, one order of magni-
tude higher in the privacy-preserving setting. That is a reasonable cost, considering the
advantage of preserving both the privacy of participants’ input texts and of the service
provider’s trained model.
For the sake of reproducibility, we have created a docker image with all required pack-
ages and made it publicly available on Docker Hub (https://dockr.ly/3ED3S1D). The
experiments are extensively documented on Jupyter notebooks at a public git repository
(https://bit.ly/3BwhfPn). We need to point out, however, that exact reproducibility
36
cannot be guaranteed across PyTorch and CrypTen non-deterministic algorithms. Also,
the cuDNN library, used for CUDA convolution operations, can be a source of nondeter-
minism across multiple executions.
4.3 Findings
These experiments demonstrate that it is possible to train and query predictive models
using Secure Multi-party Computation protocols in order to detect fake news in a privacy-
preserving way.
We observed that the models trained with the short text datasets had lower perfor-
mance in all metrics, over all experiments. It indicates that NLP models require a larger
word sample in order to appropriately approximate the underlying statistics represented
by trained parameters. We also noticed a higher runtime for the Random Forest model
trained over Sentence-BERT embeddings — as seen on Table 4.2 for the liar dataset. It
indicates that large language models may provide better results, but may also introduce
higher computing cost in all stages of the machine learning pipeline: from preprocessing
to inference.
Concerning the selected language models, we observed that Sentence-BERT achieved
better performances than DistilBERT for most of the datasets. Moreover, the accuracy
marks for the datasets in Portuguese were higher than the ones for the English datasets,
even though these BERT based models were trained with most texts in English. In terms
of runtime, in the worst case, the inference for each text takes about half a second. In
average, however, each text takes about 4 milliseconds, which is reasonable for a solution
with good guarantees on privacy.
We found that the performance of the privacy-preserving inference of fake news clas-
sification model, measured both in terms of runtime, accuracy and other classification
metrics, is very close to that of a model queried in the clear-text setting. Indicating that
introducing the use of MPC protocols does not reduce the predictive power or usability
of fake news detection models.
It is arguable that, despite having a visibly higher cost when compared with clear-text
training, privacy-preserving model training can still be considered viable for the specific
cases where a group of parties needs to cooperate in training without trusting the model
owner. The
We have also found that, for the datasets at hand, using large multi-lingual language
models did not significantly improve the overall pipeline performance, when compared
to well established NLP techniques such as Lemmatization, Stop-Word Removal and
TF-IDF. In possible future investigations, we may test the effect of other preprocess-
37
ing techniques, such as word count by part-of-speech, and other feature extraction and
engineering methods commonly found in NLP literature.
38
5 Privacy-preserving breast cancer
classification
Discourse on Method
René Descartes
Breast cancer is the most prevalent form of cancer, even when considering both sexes,
and causes over 600,00 deaths per year [99]. Despite all the visibility and awareness around
this disease, an all the sanitary policies put into practice, its incidence has increased, on
average, by 0.5% annually [100].
It is straight forward to understand that any knowledge or technology developed on
this subject that could contribute to the detection or support the best treatment possible
has a really relevant impact on society.
For this reason, research teams at the Brazilian National Cancer Institute (INCA) are
always developing and testing new ways to achieve the earliest possible detection of breast
cancer, and to effectively follow up the progression of the disease in order to provide the
best information for its treatment.
There is an ongoing research seeking to use INCA’s large dataset of high resolution
scans of breast cancer histologic sections to build inference models. The project also
involves the use of tabular data extracted from patients’ Electronic Health Records (EHR),
in order to build multi-modal inference models that will help identify the disease and
indicate its probable progression and best treatment course.
There is, however, a significant cost associated to transferring a patient’s EHR from
their original practice. In most cases, it also requires the patient to travel from their
home State to Rio de Janeiro, where INCA’s hospitals are located. Sometimes, the final
diagnostic is that of a benign tumor, or of a carcinoma on a initial stage, that can be
treated at the patient’s original location.
Thus, a solution that would reduce the risk of false positives, and reduce the costs
39
associated with the transfer of these patients that could have stayed home, would improve
the overall cancer related health policy, reducing costs both to the public health system
and to the patient. Also, starting the treatment as soon as possible, at their original
location, would improve their chances of success and remission.
Now, that is when a privacy-preserving solution plays a key role: the legal apparatus
EHRs restricts the sharing of patient’s data, even in situations where two hospitals col-
laborate on the diagnosis. So, a solution that would allow such a collaboration in a way
that adheres to the EHR laws and standards, could have an enormous impact.
That is the motivation on the choice for the second use case of our architecture for
privacy-preserving machine learning. This chapter presents a short literature review on
breast cancer classification. It also lists a few concepts on computer vision, transfer
learning for computer vision and image models. Finally, it discusses the design and
motivation for the experiments carried out with INCA’s researchers as a second proof-of-
concept for our PPML architecture.
40
The T categories for breast cancer are:
Tis: Carcinoma in situ (DCIS, or Paget disease of the breast with no associated tumor
mass)
T1: (includes T1a, T1b, and T1c) Tumor is 2 cm (3/4 of an inch) or less across.
T2: Tumor is more than 2 cm but not more than 5 cm (2 inches) across.
T4: (includes T4a, T4b, T4c, and T4d) Tumor of any size growing into the chest wall
or skin. This includes inflammatory breast cancer.
41
be trained to associate an input with one or more concepts (or labels), usually giving a
probability of the presence of each labeled concept in the processed input [103].
ResNet, VGG, AlexNet, among others, are some of the first image models and were
applied to problems ranging from face recognition to Covid-19 diagnosis [104]. The first
ResNet variants are no longer the state-of-the-art in image models, however, they still
serve as baseline for developments in the area [105].
42
colorization), and three-dimensional anamorphic vision (e.g., point cloud classification
and segmentation) [107].
5.4 Experiments
5.4.1 Datasets
The experiments use two datasets:
• A subset of The Cancer Genome Atlas (TCGA) available at the Genomic Data
Commons (GDC) from the US National Cancer Institute (NIH). We have used
the ‘Case’ filter, selecting ‘Breast’ as the primary site, rendering 1879 SVS images
[111]. Those are classified into Triple negative breast carcinoma, ER positive breast
carcinoma, PR positive breast carcinoma and HER positive breast carcinoma;
43
a more realistic analysis on the viability of the proposed solution for real life medical
applications.
We are also going to use the INCA dataset in a second experiment, using the trained
models form the previous experiment, and observe the respective classification perfor-
mance over the INCA dataset. The datasets are not labeled with the same classification
system. But we expect to observe if the models are going to have a low false-positive
count on the ‘Benign’ class. We also want to confront the models classification results
with that of specialists, when considering the same classification system.
In order to use common image models in the encoding of this high-resolution images,
we had to generate lower resolution copies, using the imagemagik open source tool.
We are also interested in demonstrating the privacy-preserving model inference. This
scenario is important for the Brazilian National Cancer Institute, since it receives Elec-
tronic Health Records (EHR) of potential cancer patients from most hospitals in Brazil.
INCA is not only the reference hospital for cancer treatment in Brazil, it is also the co-
ordinator of the cancer related public policies in the context of the SUS (the Brazilian
Unified Health System).
The SUS system, in turn, is a coordination framework of Federal and sub-national
(States and Municipalities) governments for the provision of sanitary and public health
services. It includes legal, budgetary, procedural standardization and many other levels
of coordination.
Therefore, if INCA has the ability of classifying exam images that would potentially
help the diagnosis of breast cancer, or any other cancer types, without requiring the formal
transfer of a patients EHR, there could be an increase of efficiency, as well as a reduction
of costs for the entire public health system.
44
6 Conclusion
45
Our experiments also demonstrate how neural networks can be trained to detect fake
news using Secure Multi-party Computation protocols and how those MPC protocols allow
users to perform text classification in a privacy-preserving way. The respective findings
were submitted for publication at a relevant venue and are currently under review.
We argue that the most relevant finding is the fact that the performance of the privacy-
preserving fake news classification solution built according to our architecture, measured
both in terms of runtime, accuracy and other classification metrics, is very close to that
of models trained and queried in the clear-text setting. Indicating that our architecture
guaranties the privacy of end users and does not reduce the predictive power or usability
of fake news detection solutions.
The results presented in this work also indicate that the use of transfer learning, with
the support from language models, is the best fit for a privacy-preserving solution, as
indicated by the architecture proposed in this work. A few arguments supporting this
statement are the fact that language models can be used as ‘plug-and-play’ pieces in
the solution: a newer, better, language model would bring better results without major
changes to the overall solution. Also, the use of language models reduces the risk of data
leakage during the data wrangling activities. And, finally, we note that this preprocessing
strategy also significantly reduces computing costs at the side of the data owner.
46
yond the scope of the present doctoral research, could study the differences in performance
between computationally heavier and lighter preprocessing techniques. Considering that
in a privacy-preserving setting, the preprocessing phase must be performed on the user’s
device, it is interesting to look for techniques with low computational cost and acceptable
performance.
Acknowledgments
This work has been funded in part by the Graduate Deanship of Universidade de Brasília,
under the “EDITAL DPG Nº 0004/2021” grants program.
47
Bibliography
[1] Dale, Robert: GPT-3: What’s it good for? Natural Language Engineering,
27(1):113–118, 2021. 1
[2] Van Noorden, Richard: The ethical questions that haunt facial-recognition research.
Nature, 587(7834):354–359, 2020. 1
[7] Canetti, R.: Universally Composable Security: A New Paradigm for Cryptographic
Protocols. In Proceedings of the 42Nd IEEE Symposium on Foundations of Computer
Science, FOCS ’01, pages 136–, Washington, DC, USA, 2001. IEEE Computer Soci-
ety, ISBN 0-7695-1390-5. http://dl.acm.org/citation.cfm?id=874063.875553.
2, 9
[8] De Cock, Martine, Rafael Dowsley, Caleb Horst, Raj Katti, Anderson Nascimento,
Wing Sea Poon, and Stacey Truex: Efficient and Private Scoring of Decision Trees,
Support Vector Machines and Logistic Regression Models based on Pre-Computation.
IEEE Transactions on Dependable and Secure Computing, PP(99), 2017. 2, 9, 31,
33
[9] Rivest, R., L. Adleman, and M. Dertouzos: On data banks and privacy homo-
morphisms. Foundations of Secure Computation, pages 169–177, 1978. 2, 9
[10] Gentry, C.: A fully homomorphic encryption scheme. PhD thesis, Stanford Univer-
sity, 2009. crypto.stanford.edu/craig. 2, 9
48
[12] Souza, Stefano M P C and Ricardo S Puttini: Client-side encryption for privacy-
sensitive applications on the cloud. Procedia Computer Science, 97:126–130, 2016.
2, 14
[13] Damgård, Ivan and Mats Jurik: A Generalisation, a Simplification and Some Appli-
cations of Paillier’s Probabilistic Public-Key System. In Proceedings of the 4th Inter-
national Workshop on Practice and Theory in Public Key Cryptography: Public Key
Cryptography, PKC ’01, pages 119–136, London, UK, UK, 2001. Springer-Verlag,
ISBN 3-540-41658-7. http://dl.acm.org/citation.cfm?id=648118.746742. 2,
9
[14] Nikolaenko, V., U. Weinsberg, S Ioannidis, D. Joyeand, M ans Boneh, and N. Taft:
Privacy-Preserving Ridge Regression on Hundreds of Millions of Records. In 2013
IEEE Symposium on Security and Privacy. IEEE, 2013. 2, 9
[15] Bos, J. W., K. Lauter, and M. Naehrig: Private Predictive Analysis on Encrypted
Medical Data. Cryptology ePrint Archive, Report 2014/336, 2014. 2, 10
[18] Souza, Stefano M. P. C.: Safe-Record: segurança e privacidade para registros eletrô-
nicos em saúde na nuvem. Master’s thesis, PPGEE/FT - Universidade de Brasília,
2016. 2, 4, 10
[21] Trauth, E. M.: Achieving the Research Goal with Qualitative Methods: Lessons
Learned along the Way. In Proceedings of the IFIP TC8 WG 8.2 International
Conference on Information Systems and Qualitative Research, page 225–245, GBR,
1997. Chapman & Hall, Ltd., ISBN 0412823608. 5
[22] Deb, Dipankar, Rajeeb Dey, and Valentina E. Balas: [Intelligent Systems Refe-
rence Library - Vol. 153] Engineering Research Methodology: A Practical Insight
for Researchers, volume 10.1007/978-981-13-2947-0, chapter 1, pages 1–7. Sprin-
ger, 2019, ISBN 978-981-13-2946-3,978-981-13-2947-0. http://gen.lib.rus.ec/
scimag/index.php?s=10.1007/978-981-13-2947-0. 5
49
[23] Kaplan, Bonnie and Dennis Duchon: Combining Qualitative and Quantitative
Methods in Information Systems Research: A Case Study. MIS Q., 12(4):571–586,
December 1988, ISSN 0276-7783. http://dx.doi.org/10.2307/249133. 5
[24] Yin, R. K.: Case Study Research: Design and Methods. SAGE, Beverly Hills, 1984.
6
[26] Souza, Stefano M. P. C. and Daniel G. Silva: Monte Carlo execution time estimation
for Privacy-preserving Distributed Function Evaluation protocols, 2021. 7
[27] Ben-David, Assaf, Noam Nisan, and Benny Pinkas: FairplayMP: a system for se-
cure multi-party computation. In Ning, Peng, Paul F. Syverson, and Somesh Jha
(editors): Proceedings of the 2008 ACM Conference on Computer and Communica-
tions Security, CCS 2008, Alexandria, Virginia, USA, October 27-31, 2008, pages
257–266. ACM, 2008. 10
[28] Bogdanov, Dan, Sven Laur, and Jan Willemson: Sharemind: A Framework for Fast
Privacy-Preserving Computations. In Proc. of the 13th European Symposium on
Research in Computer Security, pages 192–206, 2008. 10
[29] Ryffel, Theo, Andrew Trask, Morten Dahl, Bobby Wagner, Jason Mancuso, Da-
niel Rueckert, and Jonathan Passerat-Palmbach: A generic framework for privacy
preserving deep learning. CoRR, abs/1811.04017, 2018. 10
[31] Knott, B., S. Venkataraman, A.Y. Hannun, S. Sengupta, M. Ibrahim, and L.J.P.
van der Maaten: CrypTen: Secure Multi-Party Computation Meets Machine Le-
arning. In Proceedings of the NeurIPS Workshop on Privacy-Preserving Machine
Learning, 2020. 10, 16, 33
[32] Zhang, Yihua, Aaron Steele, and Marina Blanton: PICCO: A General-purpose
Compiler for Private Distributed Computation. In Proceedings of the 2013
ACM SIGSAC Conference on Computer Communications Security. ACM, 2013,
ISBN 978-1-4503-2477-9. 10
[34] Demmler, Daniel, Thomas Schneider, and Michael Zohner: ABY - A Framework
for Efficient Mixed-Protocol Secure Two-Party Computation. In 22nd Network and
Distributed System Security Symposium, 2015. 10, 16, 31
50
[35] Mohassel, P. and Y. Zhang: SecureML: A System for Scalable Privacy-Preserving
Machine Learning. In 2017 IEEE Symposium on Security and Privacy (SP), pages
19–38, May 2017. 10, 31
[36] Yao, Andrew C.: Protocols for Secure Computations. In Proceedings of the 23rd An-
nual Symposium on Foundations of Computer Science, SFCS ’82. IEEE Computer
Society, 1982. 11
[37] Beaver, Donald: One-time tables for two-party computation. In Computing and
Combinatorics, pages 361–370. Springer, 1998. 12
[38] Paillier, Pascal: Public-key cryptosystems based on composite degree residuosity clas-
ses. In IN ADVANCES IN CRYPTOLOGY — EUROCRYPT 1999, pages 223–238.
Springer-Verlag, 1999. 14
[39] Goldwasser, Shafi and Silvio Micali: Probabilistic encryption. Journal of Computer
and System Sciences, 28(2):270–299, 1984, ISSN 0022-0000. 14
[40] Naor, Moni and Kobbi Nissim: Communication Complexity and Secure Function
Evaluation. Electronic Colloquium on Computational Complexity (ECCC), 8, 2001.
15
[41] Agarwal, Anisha, Rafael Dowsley, Nicholas D. McKinney, Dongrui Wu, Chin Teng
Lin, Martine De Cock, and Anderson Nascimento: Privacy-Preserving Linear Re-
gression for Brain-Computer Interface Applications. In Proc. of 2018 IEEE Inter-
national Conference on Big Data, 2018. 16
[42] Silva, D. G, M. Jino, and B. de Abreu: A Simple Approach for Estimation of Exe-
cution Effort of Functional Test Cases. In IEEE Sixth International Conference on
Software Testing, Verification and Validation. IEEE Computer Society, Apr 2009.
17
[43] Iqbal, N., M. A. Siddique, and J. Henkel: DAGS: Distribution agnostic sequential
Monte Carlo scheme for task execution time estimation. In 2010 Design, Automation
Test in Europe Conference Exhibition (DATE 2010), pages 1645–1648, 2010. 17
[44] Li, Zhi, Hao Wang, Guangquan Xu, Alireza Jolfaei, Xi Zheng, Chunhua Su, and
Wenying Zhang: Privacy-Preserving Distributed Transfer Learning and its Applica-
tion in Intelligent Transportation. IEEE Transactions on Intelligent Transportation
Systems, pages 1–17, 2022. 19
[45] Cunha, Washington, Vítor Mangaravite, Christian Gomes, Sérgio Canuto, Elaine
Resende, Cecilia Nascimento, Felipe Viegas, Celso França, Wellington Santos Mar-
tins, Jussara M. Almeida, Thierson Rosa, Leonardo Rocha, and Marcos André
Gonçalves: On the cost-effectiveness of neural and non-neural approaches and repre-
sentations for text classification: A comprehensive comparative study. Information
Processing & Management, 58(3):102481, 2021, ISSN 0306-4573. 23
[46] Habert, Benoit, Gilles Adda, Martine Adda-Decker, P Boula de Marëuil, Serge
Ferrari, Olivier Ferret, Gabriel Illouz, and Patrick Paroubek: Towards tokenization
evaluation. In Proceedings of LREC, volume 98, pages 427–431, 1998. 24
51
[47] Kaur, Jashanjot and P Kaur Buttar: A systematic review on stopword removal
algorithms. Int. J. Futur. Revolut. Comput. Sci. Commun. Eng, 4(4), 2018. 24
[48] Gerlach, Martin, Hanyu Shi, and Luís A Nunes Amaral: A universal information
theoretic approach to the identification of stopwords. Nature Machine Intelligence,
1(12):606–612, 2019. 24
[49] Singh, Jasmeet and Vishal Gupta: Text stemming: Approaches, applications, and
challenges. ACM Computing Surveys (CSUR), 49(3):1–46, 2016. 25
[50] Dereza, Oksana: Lemmatization for Ancient Languages: Rules or Neural Networks?
In Conference on Artificial Intelligence and Natural Language, pages 35–47. Sprin-
ger, 2018. 25
[51] Jongejan, Bart and Hercules Dalianis: Automatic training of lemmatization rules
that handle morphological changes in pre-, in-and suffixes alike. In Proceedings of the
Joint Conference of the 47th Annual Meeting of the ACL and the 4th International
Joint Conference on Natural Language Processing of the AFNLP, pages 145–153,
2009. 25
[52] Plisson, Joël, Nada Lavrac, Dunja Mladenic, et al.: A rule based approach to word
lemmatization. In Proceedings of IS, volume 3, pages 83–86, 2004. 25
[53] Malaviya, Chaitanya, Shijie Wu, and Ryan Cotterell: A simple joint model for im-
proved contextual neural lemmatization. arXiv preprint arXiv:1904.02306, 2019. 25
[54] Kondratyuk, Daniel, Tomáš Gavenčiak, Milan Straka, and Jan Hajič: LemmaTag:
Jointly tagging and lemmatizing for morphologically-rich languages with BRNNs.
arXiv preprint arXiv:1808.03703, 2018. 25, 26
[55] Schmid, Helmut and Florian Laws: Estimation of conditional probabilities with de-
cision trees and an application to fine-grained POS tagging. In Proceedings of the
22nd International Conference on Computational Linguistics (Coling 2008), pages
777–784, 2008. 26
[56] Potthast, Martin, Johannes Kiesel, Kevin Reinartz, Janek Bevendorff, and Benno
Stein: A Stylometric Inquiry into Hyperpartisan and Fake News. In Proceedings of
the 56th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pages 231–240, Melbourne, Australia, July 2018. Association for
Computational Linguistics. https://www.aclweb.org/anthology/P18-1022. 26,
32
[57] Monteiro, Rafael A., Roney L. S. Santos, Thiago A. S. Pardo, Tiago A. de Al-
meida, Evandro E. S. Ruiz, and Oto A. Vale: Contributions to the Study of Fake
News in Portuguese: New Corpus and Automatic Detection Results. In Computati-
onal Processing of the Portuguese Language, pages 324–334. Springer International
Publishing, 2018, ISBN 978-3-319-99722-3. 26, 31, 34
[58] Davis, R. and C. Proctor: Fake News, Real Consequences: Recruiting Neural
Networks for the Fight Against Fake News. Technical report, Stanford University,
2017. 26
52
[59] Barde, B. V. and A. M. Bainwad: An overview of topic modeling methods and tools.
In 2017 International Conference on Intelligent Computing and Control Systems
(ICICCS), pages 745–750, 2017. 26
[60] El-Din, Doaa Mohey: Enhancement bag-of-words model for solving the challenges
of sentiment analysis. International Journal of Advanced Computer Science and
Applications, 7(1), 2016. 26
[61] Li, Bofang, Zhe Zhao, Tao Liu, Puwei Wang, and Xiaoyong Du: Weighted neural bag-
of-n-grams model: New baselines for text classification. In Proceedings of COLING
2016, the 26th International Conference on Computational Linguistics: Technical
Papers, pages 1591–1600, 2016. 27
[62] Yun-tao, Zhang, Gong Ling, and Wang Yong-cheng: An improved TF-IDF approach
for text classification. Journal of Zhejiang University-Science A, 6(1):49–55, 2005.
27
[63] Ahmed, Hadeer, Issa Traore, and Sherif Saad: Detection of online fake news using
n-gram analysis and machine learning techniques. In International conference on
intelligent, secure, and dependable systems in distributed and cloud environments,
pages 127–138. Springer, 2017. 27
[64] Dyson, Lauren and Alden Golab: Fake News Detection Exploring the Application of
NLP Methods to Machine Identification of Misleading News Sources. CAPP 30255
Adv. Mach. Learn. Public Policy, 2017. 27, 32
[65] Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean: Efficient Estimation
of Word Representations in Vector Space, 2013. 27
[66] Yang, Kai Chou, Timothy Niven, and Hung Yu Kao: Fake News Detection as Natural
Language Inference. arXiv preprint arXiv:1907.07347, 2019. 28, 32
[67] Hosseinimotlagh, Seyedmehdi and Evangelos E Papalexakis: Unsupervised content-
based identification of fake news articles with tensor decomposition ensembles. In
Proceedings of the Workshop on Misinformation and Misbehavior Mining on the
Web (MIS2), 2018. 28
[68] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan
N Gomez, Ł ukasz Kaiser, and Illia Polosukhin: Attention is All you Need. In Guyon,
I., U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R.
Garnett (editors): Advances in Neural Information Processing Systems, volume 30.
Curran Associates, Inc., 2017. https://proceedings.neurips.cc/paper/2017/
file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. 28
[69] Devlin, Jacob, Ming Wei Chang, Kenton Lee, and Kristina Toutanova: Bert: Pre-
training of deep bidirectional transformers for language understanding. arXiv pre-
print arXiv:1810.04805, 2018. 29, 32
[70] Rogers, Anna, Olga Kovaleva, and Anna Rumshisky: A Primer in BERTology: What
We Know About How BERT Works. Transactions of the Association for Computa-
tional Linguistics, 8:842–866, January 2021, ISSN 2307-387X. 29
53
[71] Reimers, Nils and Iryna Gurevych: Sentence-BERT: Sentence Embeddings using Sia-
mese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing. Association for Computational Linguistics, Novem-
ber 2019. https://arxiv.org/abs/1908.10084. 29
[72] Gelfert, Axel: Fake News: A Definition. Informal Logic, 37(0):83–117, 2017. 30
[73] D’Arienzo, Maria Chiara, Valentina Boursier, and Mark D. Griffiths: Addiction to
Social Media and Attachment Styles: A Systematic Literature Review. International
Journal of Mental Health and Addiction, 17:1094 – 1118, 2019. 30
[74] Ferrara, Emilio: Disinformation and Social Bot Operations in the Run Up to the
2017 French Presidential Election. First Monday, 22, June 2017. 31
[75] Lee, Sangwon and Michael Xenos: Social distraction? Social media use and political
knowledge in two U.S. Presidential elections. Computers in Human Behavior, 90:18
– 25, 2019, ISSN 0747-5632. 31
[76] Nascimento, Josué: Only one word 99.2%, Aug 2020. https://www.kaggle.com/
josutk/only-one-word-99-2. 31, 32
[77] Gangireddy, Siva Charan Reddy, Deepak P, Cheng Long, and Tanmoy Chakraborty:
Unsupervised Fake News Detection: A Graph-Based Approach. In Proceedings of
the 31st ACM Conference on Hypertext and Social Media, HT ’20, page 75–83, New
York, NY, USA, 2020. Association for Computing Machinery, ISBN 9781450370981.
https://doi.org/10.1145/3372923.3404783. 31
[78] Shu, Kai, Xinyi Zhou, Suhang Wang, Reza Zafarani, and Huan Liu: The Role of
User Profiles for Fake News Detection. In ASONAM ’19: International Conference
on Advances in Social Networks Analysis and Mining, page 436–439, New York,
NY, USA, 2019. Association for Computing Machinery, ISBN 9781450368681. 31
[79] Pinnaparaju, Nikhil, Vijaysaradhi Indurthi, and Vasudeva Varma: Identifying Fake
News Spreaders in Social Media. In CLEF, 2020. 31
[80] Nadeem, Moin, Wei Fang, Brian Xu, Mitra Mohtarami, and James Glass: FAKTA:
An Automatic End-to-End Fact Checking System, 2019. 32
[81] Moreno, Jo
btxfnamespacelong ao and Graça Bressan: FACTCK.BR: A New Dataset to Study
Fake News. In Proceedings of the 25th Brazillian Symposium on Multimedia and the
Web, WebMedia ’19, page 525–527, New York, NY, USA, 2019. Association for Com-
puting Machinery, ISBN 9781450367639. https://doi.org/10.1145/3323503.
3361698. 32, 33
[82] Gupta, Ankur, Yash Varun, Prarthana Das, Nithya Muttineni, Parth Srivastava,
Hamim Zafar, Tanmoy Chakraborty, and Swaprava Nath: TruthBot: An Automa-
ted Conversational Tool for Intent Learning, Curated Information Presenting, and
Fake News Alerting. CoRR, abs/2102.00509, 2021. https://arxiv.org/abs/2102.
00509. 32
54
[83] Lee, Sungjin: Nudging Neural Conversational Model with Domain Knowledge.
CoRR, abs/1811.06630, 2018. http://arxiv.org/abs/1811.06630. 32
[84] Graves, Lucas: Anatomy of a Fact Check: Objective Practice and the Contested
Epistemology of Fact Checking. Communication, Culture and Critique, 10(3):518–
537, October 2017, ISSN 1753-9129. https://doi.org/10.1111/cccr.12163. 32
[85] Marietta, Morgan, David C Barker, and Todd Bowser: Fact-checking polarized poli-
tics: Does the fact-check industry provide consistent guidance on disputed realities?
In The Forum, volume 13, pages 577–596. De Gruyter, 2015. 32
[86] Horne, Benjamin D. and Sibel Adali: This Just In: Fake News Packs a Lot in Title,
Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real
News. CoRR, abs/1703.09398, 2017. http://arxiv.org/abs/1703.09398. 32
[87] Young, T., D. Hazarika, S. Poria, and E. Cambria: Recent Trends in Deep Lear-
ning Based Natural Language Processing [Review Article]. IEEE Computational
Intelligence Magazine, 13(3):55–75, 2018. 32
[88] Baruah, Arup, K Das, F Barbhuiya, and Kuntal Dey: Automatic Detection of Fake
News Spreaders Using BERT. In CLEF, 2020. 32
[89] Zhang, T., D. Wang, H. Chen, Z. Zeng, W. Guo, C. Miao, and L. Cui: BDANN:
BERT-Based Domain Adaptation Neural Network for Multi-Modal Fake News De-
tection. In 2020 International Joint Conference on Neural Networks (IJCNN), pages
1–8, 2020. 32
[90] Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra-
fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell,
Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon
Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse,
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark,
Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario
Amodei: Language Models are Few-Shot Learners. In Larochelle, H., M. Ranzato,
R. Hadsell, M. F. Balcan, and H. Lin (editors): Advances in Neural Information
Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. 32
[91] Tan, Reuben, Bryan A. Plummer, and Kate Saenko: Detecting Cross-Modal Incon-
sistency to Defend Against Neural Fake News, 2020. 32
[92] Mosallanezhad, Ahmadreza, Kai Shu, and Huan Liu: Topic-Preserving Synthetic
News Generation: An Adversarial Deep Reinforcement Learning Approach, 2020.
32
[93] Zellers, Rowan, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Fran-
ziska Roesner, and Yejin Choi: Defending Against Neural Fake News. In Wallach,
H., H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (editors):
Advances in Neural Information Processing Systems, volume 32, pages 9054–9065.
Curran Associates, Inc., 2019. https://proceedings.neurips.cc/paper/2019/
file/3e9f0fc9b2f89e043bc6233994dfcf76-Paper.pdf. 32
55
[94] Wang, William Yang: "liar, liar pants on fire": A new benchmark dataset for fake
news detection. arXiv preprint arXiv:1705.00648, 2017. 34
[95] Bhatia, Ruchi: Source based Fake News Classification, Aug 2020. https://www.
kaggle.com/ruchi798/source-based-news-classification. 34
[96] Reimers, Nils and Iryna Gurevych: Making Monolingual Sentence Embeddings Mul-
tilingual using Knowledge Distillation. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing. Association for Computational
Linguistics, November 2020. 34
[97] Adams, Samuel, David Melanson, and Martine De Cock: Private Text Classification
with Convolutional Neural Networks. In Proceedings of the Third Workshop on Pri-
vacy in Natural Language Processing, pages 53–58, Online, June 2021. Association
for Computational Linguistics. 35
[98] Trueman, Tina Esther, Ashok Kumar J., Narayanasamy P., and Vidya J.: Attention-
based C-BiLSTM for fake news detection. Applied Soft Computing, 110:107600,
2021, ISSN 1568-4946. 36
[99] Lotter, William, Abdul Rahman Diab, Bryan Haslam, Jiye G Kim, Giorgia Grisot,
Eric Wu, Kevin Wu, Jorge Onieva Onieva, Yun Boyer, Jerrold L Boxerman, et al.:
Robust breast cancer detection in mammography and digital breast tomosynthesis
using an annotation-efficient deep learning approach. Nature Medicine, 27(2):244–
249, 2021. 39
[100] Siegel, Rebecca L., Kimberly D. Miller, Hannah E. Fuchs, and Ahmedin Jemal:
Cancer statistics, 2022. CA: A Cancer Journal for Clinicians, 72(1):7–33, 2022. 39
[101] Amin, M.B., S.B. Edge, F.L. Greene, D.R. Byrd, R.K. Brookland, M.K. Washing-
ton, J.E. Gershenwald, C.C. Compton, K.R. Hess, D.C. Sullivan, et al.: AJCC
Cancer Staging Manual. Springer Cham, 2018, ISBN 9783319406176. 40
[102] Deng, J., W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei: ImageNet: A Large-
Scale Hierarchical Image Database. In CVPR09, 2009. 41
[103] Minaee, Shervin, Yuri Y Boykov, Fatih Porikli, Antonio J Plaza, Nasser Kehtarna-
vaz, and Demetri Terzopoulos: Image segmentation using deep learning: A survey.
IEEE transactions on pattern analysis and machine intelligence, 2021. 42
[104] Elpeltagy, Marwa and Hany Sallam: Automatic prediction of COVID- 19 from chest
images using modified ResNet50. Multimedia tools and applications, 80(17):26451–
26463, 2021. 42
[105] Ali, Nairveen, Elsie Quansah, Katarina Köhler, Tobias Meyer, Michael Schmitt,
Jürgen Popp, Axel Niendorf, and Thomas Bocklitz: Automatic label-free detection
of breast cancer using nonlinear multimodal imaging and the convolutional neural
network ResNet50. Translational Biophotonics, 1(1-2):e201900003, 2019. 42
56
[106] Kolesnikov, Alexander, Alexey Dosovitskiy, Dirk Weissenborn, Georg Heigold, Ja-
kob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby,
Sylvain Gelly, Thomas Unterthiner, and Xiaohua Zhai: An Image is Worth 16x16
Words: Transformers for Image Recognition at Scale. 2021. 42
[107] Hansen, Nicklas, Hao Su, and Xiaolong Wang: Stabilizing Deep Q-Learning with
ConvNets and Vision Transformers under Data Augmentation. In Ranzato, M., A.
Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (editors): Advances
in Neural Information Processing Systems, volume 34, pages 3680–3693. Curran
Associates, Inc., 2021. 43
[108] Yang, Kaiyu, Jacqueline Yau, Li Fei-Fei, Jia Deng, and Olga Russakovsky: A Study
of Face Obfuscation in ImageNet. In International Conference on Machine Learning
(ICML), 2022. 43
[110] Kumar, Rajesh, Jay Kumar, Abdullah Aman Khan, Zakria, Hub Ali, Cobbinah
M. Bernard, Riaz Ullah Khan, and Shaoning Zeng: Blockchain and homomorphic
encryption based privacy-preserving model aggregation for medical images. Compu-
terized Medical Imaging and Graphics, 102:102139, 2022, ISSN 0895-6111. 43
[111] Koboldt, Daniel C. et al: Comprehensive molecular portraits of human breast tu-
mours. Nature, 490(7418):61–70, Oct 2012, ISSN 1476-4687. 43
57
I MPC Protocols
Protocol: πADD
Input: Secret shares Jx1 Kq , . . . , Jxn Kq
Output: JzKq = ni=1 Jxi Kq
P
Execution:
Pn
1. Each party Pi ∈ P computes zi = i=1 xi
Protocol: πMUL
Setup:
1. The Trusted Initializer draws u, v, w uniformly from Zq , such that w = uv and dis-
tributes shares JuKq , JvKq and JwKq to protocol parties
u
2. The TI draws ι ←
− {1, . . . , n}, and sends asymmetric bit {1} to party Pι and {0} to
parties Pi6=ι
2. Parties broadcast di , ei
Pn Pn
3. Each party computes d ← i=1 di , e ← i=1 ei
58
Protocol: πIP
Setup: The setup procedure for πMUL
Input: J~xKq , J~y Kq , and l (length of ~x and ~y )
Output: JzKq = J~x · ~y Kq
Execution:
1. Run l parallel instances of πMUL in order to compute J~zk Kq = J~xk Kq · J~yk Kq for k ∈
{1, . . . , l}
3. Output JzKq
Protocol: πOIS
Setup: Let l be the bitlength of the inputs to be shared and n the dimension of the input
vector. The trusted initializer pre-distributes all the correlated randomness necessary for the
execution of πMUL over Z2l
Input: Alice inputs the vector ~x = (x1 , . . . , xn ), and Bob has input k, the index of the desired
output value
Output: xk
Execution:
2. For j ∈ {1, . . . , n} and i ∈ {1, . . . , l}, let xj,i denote the i-th bit of xj
3. Define Jyj K2 as the pair of shares (0, yj ) and Jxj,i K2 as (xj,i , 0).
59
Protocol: πEq
Setup: The setup procedure for πMUL
Input: JxKq and JyKq
Output: J0Kq if x = y. Any non-zero number otherwise.
Execution:
3. Output JzKq
Protocol: F2toq
Input: JxK2
Output: JxKq
Execution:
2. Alice and Bob perform a secure bitwise xor using πOR|XOR with Alice’s inputs being
(~xa , 0) and Bob’s input being (0, ~xb ) and modulus q > 2.
Protocol: πOR|XOR
Setup: The setup procedure for πMUL over Z2
Input: JxK2 , JyK2 and k, where k = 1 to compute OR and k = 2 to compute XOR between
the numbers.
Output: Jx ∨ yK2 if k = 1, Jx Y yK2 if k = 2.
Execution:
3. Output JzK2 .
60
Protocol: πTrunc
Setup: Let λ be a statistical security parameter. The Protocol is parametrized by the size
q > 2k+f +λ+1 of the field and the dimensions `1 , `2 of the input matrix. The trusted initializer
picks a matrix R0 ∈ F`q1 ×`2 with elements uniformly drawn from {0, . . . , 2f − 1} and a matrix
R00 ∈ F`q1 ×`2 with elements uniformly drawn from {0, . . . , 2k+λ − 1}. Then, the TI computes
R = R00 2f + R0 and creates secret shares JRKq and JR0 Kq to distribute to the parties.
Input: The parties input is JWKq such that for all elements w of W it holds that w ∈
{0, 1, . . . , 2k+f −1 − 1}{q − 2k+f −1 + 1, . . . , q − 1}.
Execution:
3. For i = ((q + 1)/2)f , locally compute JTKq ← iJSKq and output JTKq .
Protocol: πBD
Setup: Let l be the bitlength of the value x to be bit-decomposed. The TI draws U, V, W
uniformly from Z2 and distribute shares of blinding values such that W := U V such that
[[W ]] ←R {Z2 }.
Input: JxKq , for q ≤ 2l
Output: JxK2
Execution:
1. Let a denote Alice’s share of x, which corresponds to the bit string {a1 , . . . , al }. Sim-
ilarly, let b denote Bob’s share of x, which corresponds to the bit string {b1 , . . . , bl }.
Define the secret sharing Jyi K2 as the pair of shares (ai , bi ) for yi = ai + bi mod2, Jai K2
as (ai , 0) and Jai K2 as (0, bi ).
2. Compute [[c1 ]]2 ← [[a1 ]]2 [[b1 ]]2 using distributed multiplication, and locally set [[x1 ]]2 ←
[[y1 ]]2 .
61
Protocol: πDC
Input: The trusted initializer will select uniformly from {Z2 } and distribute shares of
blinding values U, V, W such that W := U V such that [[W ]] ← {Z2 }.
Each party gets the shares [[xi ]]2 and Jyi K2 for each bit of l-bit integers x and y.
Output: J1K2 if x ≥ y, and [[0]]2 otherwise.
Execution:
1. For i ∈ {1, . . . , l}, compute in parallel [[di ]]2 ← Jyi K2 (J1K2 − [[xi ]]2 ) using multiplication
Protocol.
3. For i ∈ {1, . . . , l}, compute [[cj ]]2 ← [[di ]]2 lj=i+1 [[ej ]]2 using multiplication Protocol.
Q
Pl
4. compute [[w]]2 ← J1K2 + i=1 Jci K2
Protocol: πargmax
Setup: Let l be the bitlength and k be then umber of values to be compared.
Input: The trusted initializer will select uniformly from {Zq } and distribute shares of
blinding values U, V, W such that W := U V such that [[W ]] ←R {Zq }.
Each party has as inputs the shares [[vj , i]]q for all j{1, . . . , k} and i{1, . . . , l}.
Output: Value m computed by party P1 .
Execution:
1. For j ∈ {1, . . . , k} and n{1, . . . , k}, the parties compute in parallel the distributed
comparison Protocol with inputs [[vj , i]]2 and [[vn , i]]2 (i = {1, . . . , l}). Let [[wj , n]]2
denote this output obtained.
Q
2. For j ∈ {1, . . . , k}, compute in parallel [[wj ]]2 = n∈{1,...,k} [[wj , n]]2 using multiplica-
tion Protocol.
62