Qualifi Cacao

Universidade de Brasília
Faculdade de Tecnologia
Departamento de Engenharia Elétrica
An Architecture for Privacy-Preserving Machine

Learning
Stefano Mozart Pontes Canedo de Souza
Projeto de tese apresentado como requisito parcial para

qualificação do Doutorado em Engenharia Elétrica
Orientador
Prof. Dr. Daniel Guerreiro e Silva
Coorientador
Prof. Dr. Anderson Clayton Alves do Nascimento
Brasília
2022
Faculdade de Tecnologia
Departamento de Engenharia Elétrica
An Architecture for Privacy-Preserving Machine

Learning
Stefano Mozart Pontes Canedo de Souza
Projeto de tese apresentado como requisito parcial para

qualificação do Doutorado em Engenharia Elétrica
Prof. Dr. Daniel Guerreiro e Silva (Orientador)

Prof. Dr. Anderson Clayton Alves do Nascimento (Coorientador)

University of Washington at Tacoma
Prof. Dr. João José Costa Gondim Prof. Dr. Denis Gustavo Fantinato
Universidade de Brasília Universidade de Campinas
Bernardo Machado David

IT University of Copenhagen
Prof. Dr. Kleber Melo e Silva

Coordenador do Programa de Pós-graduação em Engenharia Elétrica
Brasília, 28 de Junho de 2022

Abstract
Machine learning (ML) applications have become increasingly frequent and pervasive in
many areas of our lives. We enjoy customized services based on predictive models built
with our private data. There are, however, growing concerns about privacy. This is proven
by the enactment of the General Law of Data Protection in Brazil, and similar legislative
initiatives in the European Union and in several countries.
This trade-off between privacy and the benefits of ML applications can be mitigated
with the use of techniques that allow the construction and operation of these computa-
tional models with formal guaranties of preservation of privacy. These techniques need to
respond adequately to challenges posed at every stage of the typical ML application life
cycle, from data discovery, through feature extraction, model training and validation, up
to its effective application.
This work presents an architecture for Privacy-Preserving Machine Learning (PPML)
solutions built on homomorphic cryptography primitives and Secure Multi-party Com-
putation (MPC) protocols, which allow the adequate treatment of data and the efficient
application of ML algorithms with robust guarantees of privacy preservation. It describes
a concrete implementation of the proposed PPML architecture and demonstrates its use
in two case studies: text classification for fake news detection and image classification for
breast cancer detection and staging.
Keywords: Privacy-Preserving Machine Learning, Secure Multi-Party Computation, Trans-

fer learning
iii
Resumo
Aplicações de aprendizagem de máquina (ML) tem se tornado cada vez mais recorrentes
e pervasivas nas diversas áreas de nossas vidas. Usufruímos de serviços personalizados
baseados em modelos preditivos construídos com nossos dados privados. Há, no entanto,
uma preocupação crescente com a privacidade. A Lei Geral de Proteção de Dados, no
Brasil, e iniciativas legislativas semelhantes na União Europeia e em diversos países são
uma prova disso.
Esse trade-off entre privacidade e os benefícios das aplicações de ML pode ser mitigado
com uso de técnicas que permitam a construção e operação desses modelos computacionais
com garantias formais, matemáticas, de preservação da privacidade dos usuários. Essas
técnicas precisam responder adequadamente aos desafios apresentados em todas as fases
no ciclo de vida típico de uma aplicação de ML, desde a descoberta de dados, passando
pela fase de feature extraction, pelo treinamento e validação dos modelos, até seu efetivo
uso.
Este trabalho apresenta uma arquitetura para soluções de Aprendizado de Máquina
com Preservação de Privacidade (PPML), construída sobre primitivas de criptografia ho-
momórfica e protocolos de computação segura de múltiplas partes (MPC), que permitem o
tratamento adequado dos dados, e a aplicação eficiente de algoritmos de ML com garan-
tias robustas de privacidade. O trabalho traz, ainda, uma implementação concreta da
arquitetura proposta e sua aplicação em dois temas relevantes e sensíveis: classificação de
texto para detecção de fake news e classificação de imagens para detecção e estadiamento
de câncer de mama.
Palavras-chave: Privacy-Preserving Machine Learning, Secure Multi-Party Computa-

tion, transfer learning
iv
Contents
1 Introduction 1
1.1 Research subject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Research objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 Preliminary results . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 An Architecture for Privacy-Preserving Machine Learning 9

2.1 Secure Multi-Party Computation . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Homomorphic Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Comparing complexity and cost . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.1 Monte Carlo methods for integration . . . . . . . . . . . . . . . . . 17
2.4 A general architecture for Privacy-Preserving
Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4.1 The need for a general PPML architecture . . . . . . . . . . . . . . 21
3 Text classification 23
3.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Classic NLP preprocessing techniques . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Stopword removal (SwR) . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.4 Lemmatization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.5 Part-of-Speech (PoS) tagging . . . . . . . . . . . . . . . . . . . . . 25
3.2.6 Bag-of-Words (BoW) . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.7 Term Frequency – Inverse Document Frequency (TF-IDF) . . . . . 27
3.2.8 Continuous bag-of-words (CBoW) . . . . . . . . . . . . . . . . . . . 27
v
3.3 The State-of-the-Art: Transformers . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Trade-offs and applications . . . . . . . . . . . . . . . . . . . . . . . 29
4 Privacy-preserving fake news detection 30

4.1 Detection approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 Source based detection . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.2 Fact checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.3 Natural Language Processing (NLP) . . . . . . . . . . . . . . . . . 32
4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1 Selected datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.2 Clear-text training and inference . . . . . . . . . . . . . . . . . . . 34
4.2.3 Privacy-preserving model training . . . . . . . . . . . . . . . . . . . 35
4.2.4 Privacy-preserving inference . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Privacy-preserving breast cancer classification 39

5.1 Breast cancer classification and staging . . . . . . . . . . . . . . . . . . . . 40
5.2 Transfer learning in computer vision . . . . . . . . . . . . . . . . . . . . . 41
5.2.1 Image models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.2 Image embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.3 Vision transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Privacy-preserving cancer classification . . . . . . . . . . . . . . . . . . . . 43
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6 Conclusion 45
6.1 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.1.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Bibliography 48
Appendix 57
I MPC Protocols 58
vi
List of Figures
2.1 General PPML Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1 PoS Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Continuous Bag-of-Words & Skip-gram auto-encoder . . . . . . . . . . . . 28
vii
List of Tables
2.1 Experiment Type I - Runtimes in Milliseconds . . . . . . . . . . . . . . . . 17
4.1 List of NLP preprocessing experiments in clear-text . . . . . . . . . . . . . 34

4.2 Best accuracy on clear-text setting . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Neural networks training runtime . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Best accuracy on Privacy-Preserving setting . . . . . . . . . . . . . . . . . 36
6.1 Schedule of doctoral research activities . . . . . . . . . . . . . . . . . . . . 46
viii
Acronyms
Cade Administrative Council for Economic Defense.
FHE Fully Homomorphic Encryption.
GDPR General Data Protection Regulation.
HE Homomorphic Encryption.
INCA Brazilian National Cancer Institute.
LGPD General Data Protection Law.
MIT Massachusetts Institute of Technology.
ML Machine Learning.
MPC Secure Multi-party Computation.
NLP Natural Language Processing.
PPGEE Graduate Program in Electrical Engineering.
PPML Privacy-preserving Machine Learning.
UC Universal Protocol Composability Framework.
ix
1 Introduction
The last thing one discovers in composing a work

is what to put first.
Pensées
Blaise Pascal
Machine learning (ML) is nowadays an unavoidable topic in academia, industry, and

in the general imagination. Natural Language Processing (NLP), the group of techniques
used for text classification, is one of the earliest and also one the most advanced areas of
Machine Learning (ML). Conversational agents and generational models still have people
in awe. Especially when large Language Understanding Models, such as GPT-3, make
it to the headlines in the general media, that advertise how they present a fascinating
performance in a series of tasks as diverse as producing working computer code, coherently
responding e-mails or even writing a novel [1].
Likewise, applications of ML in computer vision – the area of study concerned with
teaching machines to deal with visual perception – have not only amazed but also worried
people in all levels of society with extremely relevant issues such as facial recognition,
deep fakes and self-driving vehicles [2].
In fact, aside from the astonishing headlines, ML applications have become more and
more present, and have an increasing impact on everyday life. Predictive models, built
with ML algorithms, are found in trivial tasks, from ranking and ordering algorithms
in search engines and social media, to recommendation systems in streaming platforms
and e-commerce marketplaces. There are, however, applications in areas as sensitive as
medical imaging diagnosis, detection of tax fraud and crimes against the financial system.
In most cases, we benefit from personalized services based on inference models built
with our private data. Recent disclosures of numerous cases of abuse arising from the
possession of such data, in addition to frequent security breaches that expose millions
users, have fueled a growing concern about privacy. The General Data Protection Law
in Brazil (LGPD), and similar legislative initiatives such as the General Data Protection
Regulation (GDPR) in the European Union, and the California Consumer Privacy Act of
1
2018 (CCPA) are evidence of this move to limit and regulate the use of private information
by large service providers [3, 4].
This trade-off between privacy and the benefits of ML applications can be mitigated
with the use of techniques that allow the training and application of predictive models
while preserving user privacy. These techniques need to respond adequately to the chal-
lenges presented at all stages in the life-cycle of a typical ML application: from data
discovery, through the data wrangling stage (feature selection, feature extraction, combi-
nation, normalization and imputation in the sample space), to the training and validation
of the models, until their effective use in inference.
1.1 Research subject

The research on Privacy-Preserving Machine Learning (PPML) has several active branches
or trends. On one end of the spectrum, there are some works dealing with palliative ap-
proaches to privacy, such as obfuscation techniques or access control tools for servers
and software defined networks [5, 6]. Among the various techniques available, however,
those that offer the most robust privacy guarantees are those based on computation over
encrypted data. This assertion stems from the fact that if there is a formal – mathemat-
ical – proof of the security of the underlying cryptographic primitives, by the Universal
Composability Model (UC), it is possible to build a larger models that inherit the same
confidentiality guarantees [7, 8].
Performing computations on encrypted data has theoretically been made possible more
than three decades ago [9], but has undergone slow development ever since. The com-
plexity of formal security proofs of the proposed cryptographic systems and protocols,
as well as their high computational cost, made any practical applications unfeasible for
many years and imposed large barriers to the development of this field. More recently,
the subject has received renewed interest with works demonstrating the use of Fully Ho-
momorphic Encryption (FHE) cryptographic systems and protocols [10, 11].
Another line of publications uses partially homomorphic cryptographic systems, which
allow a limited set of operations on encrypted databases [12, 13, 14, 15]. Among these
works there are solutions that attracted a lot of attention, even from the mainstream
media, as, for example, the CryptDB from the MIT [16]. In this line there is, also, the
research developed here at the PPGEE, Universidade de Brasília, which integrated an
additively homomorphic encryption system and a cryptosystem with order preservation
in an application that allows the search on an encrypted repository of Electronic Health
Records [17, 18].
2
Note that the use of encrypted databases presupposes the prompt availability of the
the entire original database and of a large computing capability for encryption and for
the training of inference models on the part of the data owner - which is not always
the case. Seeking to overcome this limitation, there is also a line of works focused on
online, interactive and reinforcement learning techniques. These applications require the
development of Secure Multi-party Computation (MPC) and the composition of several
protocols as the basis for learning algorithms [19].
The main challenge, in all these lines of research, is the distance between the logical
or theoretical design of the proposed solutions and the practical intended applications.
The most cited works in the area, in general, do not present how, or how effectively, the
proposed system or protocol deals with problems of scale, response time, usability, among
others, which drastically affect the effective support of a real decision process, in real use
cases.
Furthermore, most works in literature focus on just one phase of the Machine Learning
life-cycle: either the training-related steps or the application of the trained model in
inference (both for regression and classification models). Little or no attention is given
to the initial steps, to data fitting, feature extraction, among other activities that are
extremely relevant for the overall performance of the predictive models. Works that bring
any analysis of the statistical robustness or the quality of the models produced are also
rare.
Therefore, the selected subject for the current research is not limited to isolated PPML
techniques for privacy-preserving model training or inference. The proposed contribution,
in fact, is the specification of a general architecture for the application of ML techniques
with robust guarantees of privacy. Attention is given to how the well established ML
solutions, such as transfer learning, can interact with the privacy enhancing technologies
without loss of predictive power or usability.
Also, the complete knowledge production cycle is considered, from the basic theory to
the assessment of the impact of its applications. Thus the choice of building a concrete
implementation and analyzing two study cases for the architecture: one in the NLP
domain and one in the computer vision domain. Practical implementation details, such
as easy of use for the general public and realistic computational cost analysis, are also
examined.
For the NLP problem, this study dives into the use case of text classification for fake
news detection. For the computer vision problem, there is a use case of image classification
for the detection and categorization of breast cancer.
3
1.2 Motivation
This work results from both the author’s research as a Doctorate student at the Graduate
Program in Electrical Engineering (PPGEE), at Universidade de Brasília, and his work
as a Data Scientist at the Administrative Council for Economic Defense (Cade). It also
brings some experience with homomorphic cryptography gathered during the author’s
masters research [18].
The scientific investigation proposed in this project is primarily based on the need to
complete the knowledge production cycle, as discussed above, with the effective develop-
ment, application and validation of a solution based on the state-of-the-art in the fields
of Machine Learning and privacy enhancing technologies. These are somewhat disparate
concepts: while ML are focused on extracting information, Cryptography is focused on
concealing it. Therefore, there is a contribution from the theoretical point of view, consid-
ering the exposition on how to couple and harmonize such different groups of techniques.
Another contribution is the discussion on adequate security models for the specific
tasks of text and image classification. There is also a contribution from a technical point
of view, with the caring out of experiments that may be used as a reference implementation
of the proposed architecture. The most relevant contribution, nevertheless, is the practical
application, with a relevant impact on the institutions involved. In the present case, the
specific contribution to Cade - the Brazilian competition authority.
The intelligence unit at Cade deals with an enormous quantity of data, from large pub-
lic procurement databases to other open intelligence sources, such as news articles, online
marketplaces, and enterprise web sites. This unit also performs down raids, collecting
documents from investigated companies - on paper, on computers, hard drives, executives
smartphones, etc. Some operations are carried out in cooperation with prosecutors and
police, at Federal or State levels. Some, involving multinationals, are coordinated with
other competition authorities in different countries.
All this intelligence activity, nonetheless, is bounded by data protection laws effective
in Brazil, that require formal guaranties of privacy protection [20]. While searching for
evidence of cartel or other anticompetitive conduct, Cade needs to protect the privacy
of the individuals involved - whether executives, legal representatives or other persons
somehow related to the investigated economic agents. And a considerable portion of this
data lies in textual and image formats. Thus, the importance of privacy in text and image
classification.
4
1.3 Research objectives
The general goal of the research work described in this document is to present a archi-
tecture, a cohesive, logically and formally integrated set of techniques, combining cryp-
tographic and MPC protocols, as well as ML algorithms, that provide robust privacy
guarantees for real inference applications, with special attention to the text and image
classification tasks.
The proposed goal is to be attained though specific objectives:
• Identify possible limitations and flaws in PPML solutions presented in literature;
• Propose and test improvements, new techniques and implementation details that
may correct and overcome major flaws and limitations pointed out in previous so-
lutions;
• Systematize this knowledge in the form of a general architecture for PPML;
• Build reference implementations of the architecture in order to demonstrate its

feasibility and the correctness of its design.
1.4 Methodology
Experience shows, as reported by Trauth [21], that when a study in the area of technology
development seeks to understand the impact of a given solution, there is a need for the
application of qualitative methodologies.
Also, according to Deb, Dey & Balas [22], engineering research must effectively com-
bine the conceptualization of a research question with practical problems, from equip-
ment to algorithms and mathematical concepts used to solve the proposed problem. Fur-
thermore, engineering research should advance knowledge in three broad, and somewhat
overlapping, areas: observational data, that is, knowledge of the phenomena; functional
modeling the observed phenomena; and the design of processes (algorithms, procedures,
arrangements) that contribute to the desired output. This brings a strong descriptive
character to engineering research, as one must clearly communicate the preconditions
and environmental dependencies, observed data, processes, inputs and outputs of their
experimental results.
Kaplan and Duchon state that the eminently applied nature of research in the field
of information systems requires a combination of qualitative and quantitative methods
[23]. They assert that quantitative investigations, usually performed through statisti-
cal hypothesis testing, are extremely limited when the expected results or the intended
applications are highly dependent on context.
5
They refer to the work of Yin [24], which deals with methods for Case Study research,
to show that quantitative research must be preceded by a qualitative investigation, in
which the problem is better defined based on the observation of the context, habits and
needs of the stakeholders. This is even more relevant in exploratory research, in which
new hypotheses are raised. These initial hypotheses need contextualization, through
qualitative investigation, to create more refined models and hypotheses, which can then
be tested using quantitative methods.
Taking into account the research subject and the proposed objectives, this work is ex-
ploratory in nature. It sources from different areas of study to propose a new technological
arrangement. It is also of applied nature, as the method used for scientific investigation
is centered on the design of a solution for a specific problem. Therefore, it must combine
descriptive, qualitative and quantitative approaches to the study of the selected problem.
Overall, this work can be descried as the combination of the following initiatives:
1. Literature review on Machine Learning;
2. Literature review on Natural Language Processing for text classification;
3. Literature review on fake news detection;
4. Literature review on Computer Vision for image classification;
5. Literature review on breast cancer detection;
6. Literature review on Privacy-Preserving Machine Learning;
7. Implementation and experimentation of Secure Multi-party Computation protocols

(MPC);
8. Experimentation on fake news detection on the clear-text setting;
9. Experimentation on fake news detection on privacy-preserving setting;
10. Experimentation on cancer detection on the clear-text setting;
11. Experimentation on cancer detection on privacy-preserving setting;
1.4.1 Preliminary results

Despite reporting only experiments based on text classification, for the fake news detec-
tion use case, the author also tested a few other approaches during the Doctorate research.
6
As a preliminary result, the author, in collaboration with graduate students from Univer-
sidade de Campinas - Unicamp, published a solution that uses the source based detection
approach, by identifying autonomous software agents (bots) [25].
The author also reported a few experimental results on NLP preprocessing techniques
at the Seminary organized by the Digital Signal Processing Group (GPDS). These results
are detailed in the Chapter dedicated to Natural Language Processing in this work.
The experiments used to compare the computational cost and discuss on expected
execution time of HE and MPC protocols are presented in [26], and is part of the chapter
on PPML.
1.4.2 Limitations
As a result of the extensive research on different topics (ML, NLP, Computer Vision,
HE and MPC), this document is not meant as a deep exposition of any of these areas.
It provides, nonetheless, a good set of references for those willing to further investigate
relevant results in each area.
This work also does not set out to be a reference for complete security proofs of all
the cryptographic primitives and protocols used. There is, however, sufficient discussion
regarding security model, that is, the assumptions or requirements for the validity of
security of the underlying privacy-enhancing technologies.
There are projects underway to apply the knowledge gathered in this study, as well as
the proposed architecture, in document classification and evidence search at Cade. There
are critical use cases, especially when it involves the cooperation and information shar-
ing with other government agencies and with competition authorities in other countries.
However, due to confidentiality requirements, it is not possible to expose results or present
reproducible experiments performed over such data.
All the experimental details exposed are limited to the two proposed use cases: fake
news and breast cancer detection and staging. There is also the choice to deal with both
tasks as a binary output problem. Thus, objects are classified as either one of two classes:
fake or true, for texts; and binary, one-vs-all models for each class of carcinoma. This
choice results from the fact that most public datasets, on both topics, are annotated that
way. The same techniques can be generalized, with some preprocessing effort, to deal with
multi-class problems using other approaches, that would render multi-label probabilities.
1.5 Organization
The next chapter brings a more detailed exposition of the concept of Privacy-Preserving
Machine Learning (PPML
¸ ), with selected results from the literature. It also presents a
7
general architecture for PPML. Chapter 3 presents a literature review on Natural Lan-
guage Processing, from the classic preprocessing techniques developed throughout the last
3 or 4 decades, to the present state-of-the art with Transformers and other complex Natu-
ral Language Understanding models. The fourth Chapter brings the concept of fake news
and a brief review on fake news detection. It also introduces a few experimental results
on the use case on fake news detection. Chapter 5 presents a short literature review on
computer vision, transfer learning and embeddings for image processing applications. It
also describes the ongoing work on image classification for cancer detection and staging.
The last chapter summarizes our results and conclusions, pointing out the best of our
knowledge on privacy-preserving machine learning and how our proposition contributes
to that field.
8
2 An Architecture for
Privacy-Preserving Machine
Learning
It is remarkable that a science which began with

the consideration of games of chance should have
become the most important object of human
knowledge.
Théorie Analytique des Probabilitiés

Pierre-Simon Laplace
The research on Privacy-Preserving Machine Learning (PPML) is currently very ac-

tive. On one end of the spectrum, there are some works dealing with palliative approaches
to privacy, such as obfuscation techniques or access control tools for servers and software
defined networks [5, 6]. Among the various techniques available, however, the ones that of-
fer more robust privacy guarantees are those based on computation over encrypted data.
This assertion stems from the fact that if there is a formal – mathematical – proof of
the security of the underlying cryptographic primitives, by the Universal Composability
Model (UC), it is possible to build a complex model that inherits the same confidentiality
guarantees of its basic building blocks [7, 8].
Performing computations on encrypted data has theoretically been made possible more
than three decades ago [9], but has undergone slow development ever since. The com-
plexity of formal security proofs of the proposed cryptographic systems and protocols,
as well as their high computational cost, made any practical applications unfeasible for
many years and imposed large barriers to the development of this field. More recently,
the subject has received renewed interest with works demonstrating the use of Fully Ho-
momorphic Encryption (FHE) cryptographic systems and protocols [10, 11].
Another line of publications uses partially homomorphic cryptographic systems, which
allow a limited set of operations on encrypted databases [13, 14]. Among these works there
9
are solutions that attracted a lot of attention, even from the mainstream media, as, for
example, the CryptDB from the MIT [16]. In this line there are also many practical
solutions, including areas as sensitive as Electronic Health Records [15, 18].
Note that the use of encrypted databases usually requires the prompt availability of the
the entire database, and the computing power needed for encryption on the side of the data
owner - which is not always the case. Also, for the training of inference models there will
be a least a few communication rounds between the data owner and the service provider,
in order to compute the loss function and adjust the trained parameters accordingly.
This process is computationally costly and complicates the analysis of the security of
the solution. Seeking to overcome this limitation, there is also a line of works focused on
online, distributed, interactive and machine learning techniques. These applications based
on Secure Multi-party Computation (MPC) and the composition of several protocols as
the basis for the learning algorithms [19].
Henceforth, privacy-preserving computation grew in importance and attention, and
many Privacy-Preserving Machine Learning (PPML) and Privacy-Preserving Function
Evaluation (PPFE) frameworks have been developed. The protocols that form the basic
building blocks of these frameworks, are usually based on homomorphic cryptography
primitives or Secure Multi-Party Computation protocols. Some of the first frameworks to
appear in literature, for instance, used MPC protocols based on Secret Sharing. Among
those preceding results are FairPlayMP [27] and Sharemind [28].
Recent developments include frameworks like PySyft [29], that uses HE protocols,
and Chameleon [30], that uses MPC for linear operations and Garbled Circuits (a form
of HE) for non-linear evaluations. Other MPC frameworks in literature include CrypTen
[31], PICCO [32], TinyGarble [33], ABY3 [34] and SecureML [35].
Most of these results were demonstrated with proof-of-concept applications focused on
the inference step of the machine learning solution. However, usually, there is no discussion
about how these frameworks meet heavily used practices, such as feature engineering
and transfer learning, that are defining characteristics in the different areas of machine
learning.
Thus the need of systematization of knowledge on how to connect the common steps of
ML with the common privacy-enhancing technologies. The rest of this chapter presents,
with a little more detail, the two classes of privacy-enhancing techniques discussed above
– MPC and HE – and a way to reason on their complexity of implementation and com-
putational cost in order to select the best fit for the different applications. At the end,
we present a general design for a PPML solution, that can be implemented with most of
the APIs and frameworks listed above.
10
2.1 Secure Multi-Party Computation
Introduced by Yao [36], Secure Multi-Party Computation refers to a set of protocols and
algorithms that allow a group of computing parties P to evaluate a function F (X) over
a set X = (1 x, 2 x, ..., n x) of private inputs, in a way that guarantees participants gain
knowledge only on the global function result, but not on each others inputs.
Protocol: πADD
Input: Secret shares Jx1 Kq , . . . , Jxn Kq
Output: JzKq = ni=1 Jxi Kq
P
Execution:
Pn
1. Each party Pi ∈ P computes zi = i=1 xi
2. Each party Pi broadcasts zi = 142 C

Pn
3. Each party Pi locally computes JzKq = i=1 zi
Protocol 1: Secure Multi-party Addition Protocol πADD
Additive secret sharing is one way to implement MPC. Protocol parties have addi-
tive shares of their secret values and perform joint computations over those shares. For
example, to create n additive shares of a secret value x ∈ Zq , a participant can draw
(x1 , . . . , xn ) uniformly from {0, . . . , q − 1} such that x = ni=1 xi mod q. We denote this
P
set of shares by JxKq .

Notice that access to any proper subset of JxKq gives no information about x. A shared
secret can only be revealed after gathering all shares. Likewise, the result of a protocol
executing linear transformations over such shares can only be known if all the local results
at each computing party are combined.
Given two sets of shares JxKq , JyKq and a constant α, it is trivial to implement a
protocol like πADD or πMUL in order to locally compute linear functions over the sum of
their respective shares and broadcast local results to securely compute values such as
JzKq = α(JxKq ± JyKq ), JzKq = α ± (JxKq ± JyKq ), and JzKq = α1 (α2 ± (JxKq ± JyKq )). In all
those operations, JzKq is the shared secret result of the protocol. The real value of z can
only be obtained if one has knowledge of the local zi values held by all computing parties,
and performs a last step of computation to obtain z = ni=1 zi mod q.
P
With these two basic building blocks, addition and multiplication, it is possible to
compose protocols in order to perform virtually any computation. For instance, Protocol
πEq , uses πMUL in order to provide the Secure Distributed Equality computation. In the
Appendix I, you will find definitions for other additive secret share MPC protocols, such
as Secure Multi-party Inner Product Protocol πIP , Secure Multi-party Bitwise OR/XOR
11
Protocol: πMUL
Setup:
1. The Trusted Initializer draws u, v, w uniformly from Zq , such that w = uv and dis-
tributes shares JuKq , JvKq and JwKq to protocol parties
u
2. The TI draws ι ←
− {1, . . . , n}, and sends asymmetric bit {1} to party Pι and {0} to
parties Pi6=ι
Input: Shares JxKq , JyKq

Output: JzKq = JxyKq
Execution:
1. Each party Pi locally computes di ← xi − ui and ei ← yi − vi
2. Parties broadcast di , ei
Pn Pn
3. Each party computes d ← i=1 di , e ← i=1 ei
4. The party Pι holding the asymmetric bit computes zι ← wι + dvι + euι + de
5. All other parties Pi6=ι compute zi ← wi + dvi + eui
Protocol 2: Secure Multi-party Multiplication πMUL
πOR|XOR , Secure Multi-party Argmax πargmax and Secure Multi-party Order Comparison
Protocol πDC .
Protocol: πEq
Setup: The setup procedure for πMUL
Input: JxKq and JyKq
Output: JzKq = J0Kq if x = y. JzKq 6= J0Kq , otherwise.
Execution:
1. Locally compute JdKq = JxKq − JyKq .
2. Execute πMUL to compute JzKq = JdKq · JvKq .
3. Output JzKq
Protocol 3: Equality Protocol πEq
All of these protocols are built on the commodity-based model [37]. In this approach,
there is a costly offline phase, lead by a Trusted Initializer (TI) that pre-distributes
correlated random numbers. This role can be performed by an independent agent or
by one of the computing parties without loss of generality or security guarantees of the
online phase.
12
In order to learn from data, a Machine Learning algorithm will usually update a set
β of internal model parameters iterating a computation over the training set, comparing
the result of a function F (X) = y0 over the the properties of each element in the sample
with the expected inference output. This difference, y0 − y, is usually called the ‘loss’
function, and is used to update the internal parameters with an intensity defined by a
pre-defined learning rate.
The most commonly used model, for instance Linear Regression, consists in the mul-
tiplication of the model parameters β with matrix X ∈ Zq n,k representing the values of
features of all the elements in the training set. And the learning goal is to find a coefficient
vector β = (β0 β1 . . . βk ) that minimizes the mean squared error
n
1X
(βxi − yi )2 (2.1)
n i=1
This coefficient that minimizes (2.1) can be computed as
β = (X T X)−1 X T y (2.2)
Therefore, to perform a privacy-preserving linear regression, one needs at least a so-

lution for matrix multiplication and inversion. The matrix multiplication can be built
as a simple series of iterations of the πMUL and πADD protocols. The construction below
represents an acceleration, that significantly reduces communication costs and execution
time. A matrix pseudo-inverse can be approximated interactively, by the Newton-Raphson
method, with a series of matrix multiplications.
Protocol: πMMUL
Setup: the TI chooses uniformly random Ax , Aw ∈ Znq 1 ×n2 and Bx , Bw ∈ Znq 2 ×n3 and T ∈
Zqn1 ×n3 , and distributes the values Ax , Bx , and T to party Pi and the values Aw , Bw , and
C = (Ax Bw + Aw Bx − T ) to Pj6=i
Input: JXKq and JW Kq
Output: JXW Kq
Execution:
2. Pj sends (X − Aw ) and (Y − Bw ) to Pi .
3. Output JzKq
Protocol 4: Matrix Multiplication πMMUL
13
2.2 Homomorphic Cryptography
A cryptosystem is said to be homomorphic if there is an homomorphism between the
domain (the message space M) and the image (the cipher space C) of its encryption func-
tion Enc(m) [12]. An homomorphism is a map from one algebraic structure to another,
that maintains its internal properties. So, if there is an internally well defined relation,
or function, in M, fM : M → M, there will be a corresponding function defined in C,
fC : C → C, such that:
∀m ∈ M,
fC (Enc(m)) ≡ Enc(fM (m))
Fully Homomorphic Encryption (FHE) refers to a class of cryptosystems for which the
homomorphism is valid for every function defined in M. That is:
∀fM : M → M,
∃fC : C → C | fC (Enc(m)) ≡ Enc(fM (m))
The most commonly used homomorphic cryptography systems, however, are only par-
tially homomorphic. There are additive homomorphic systems, multiplicative homomor-
phic systems and systems that combine a few homomorphic features. For example, Pail-
lier’s cryptosystem has additive and multiplicative homomorphic features that can be
used to delegate a limited set of computations over a dataset, without compromising its
confidentiality [38, 17].
The underlying primitive in Paillier’s system is the Decisional Composite Residuosity
Problem (DCRP). This problem deals with the intractability of deciding, given n =
pq, where p and q are two unknown large primes, and an arbitrary integer g coprime
to n (i.e. g ∈ Zn ), if g is a n-th residue modulo n2 . In other words, the problem
consists in finding y ∈ Z∗n2 such that g ≡ y n mod n2 . In his work, Paillier defines the
DCRP problem and demonstrates its equivalence (in terms of computing cost) with the
Quadratic Residue Problem, which is the foundation of well known cryptosystems, such
as Goldwasser-Micali’s [39].
Paillier’s system can be defined by the three algorithms:
Paillier.KeyGen the key generation algorithm selects two large primes p, q; computes
u
their product n = pq; uniformly draws a coprime to n, i.e. g ← − Zn ; computes
1
λ = mmc(p − 1, q − 1); and, finally computes µ = L(gλ mod n2 )
, where L(u) = u−1
n
;
The Public Key is hn, gi, and the Private Key is hµ, λi;
14
Paillier.Enc given the public key hn, gi and a message m ∈ Zn , the encryption algorithm
u
consists on uniformly selecting r ←
− {1..n − 1} and computing the ciphertext c =
g m rn mod n2
Paillier.Dec the decryption algorithm, in turn, receives the private key hµ, λi and a
ciphertext c and computes the corresponding message as such: m = L(cλ mod n2 ).µ
mod n.
This construction harnesses the homomorphism between the fields ZN and ZN 2 to

render the following features:
Asymmetric cryptography: it is possible to perform homomorphic computations over

the encrypted data using the public key. Knowing the results of the computation,
nevertheless, requires access to the private key;
Additive homomorphism: the multiplication of two ciphertexts equals the ciphertext

of the sum of their respective messages. That is:
Enc(m1 ).Enc(m2 ) mod N 2 = Enc(m1 + m2 mod N )
Multiplicative homomorphism: a ciphertext to the power of an integer equals the

ciphertext of the multiplication of the original message by that integer. That is:
Enc(m1 )m2 mod N 2 = Enc(m1 .m2 mod N )
Again, it is straight forward to see that, with the basic building blocks of addition
and multiplication, it is possible to compose arbitrarily complex protocols for privacy-
preserving computations using homomorphic encryption.
2.3 Comparing complexity and cost

In order to compare the efficiency of these privacy-preserving computation techniques,
researchers usually resort to theoretical complexity analysis, that give higher-bounding
asymptotic limits of computational cost when the input size tends to infinite [40]. These
limits, drafted from the the underlying mathematical primitives, often drive design and
implementation decisions on large and complex systems.
Ultimately, they may determine important investment in research and development
of protocols and applications. Nonetheless, theoretical limits may be very different from
the actual average computational cost and execution times observed when executing the
protocols over small, or average sized datasets.
15
The aforementioned PPML frameworks are proof of the research interest in such meth-
ods and the large investment put in the development of privacy-preserving computation.
It is important to note that some of the most recent and relevant results are sponsored
by major cloud service providers, especially those with specific Machine Learning related
cloud services, such as Google, IBM and Amazon [31].
Therefore, not only the capabilities of each protocol or framework are of great impor-
tance, but also their economical viability: determined mostly by their computational cost.
Expected execution time and power consumption may drive decisions that have impact
on millions of users and on relevant fields of application. In spite of the importance of
the efficiency of their solutions, authors usually only discuss very briefly the estimated
computational cost or execution times.
Very few works on PPML publish their results with computing times observed against
benchmark datasets/tasks (e.g. classification on the ImageNet or the Iris datasets). When
any general estimate measure is present, it is usually the complexity order O(g(n)), which
defines an asymptotically lower-bounding asymptotic complexity or cost function g(n).
The order function, g(n), represents the general behavior or shape of a class of func-
tions. So, if t(n) is the function defining the computational cost of the given algorithm
over its input size n, then stating that t(n) ∈ O(g(n)) means that there is a constant c
and a large enough value of n after which t(n) is always less than c × g(n).
t(n) ∈ O(g(n)) −→ ∃ c, k | ∀ n ≥ k, t(n) ≤ cg(n)
A protocol is usually regarded efficient if its order function is at most polynomial. Sub-
exponential or exponential orders, on the other hand, are usually deemed prohibitively
high.
For example, the authors of ABY3 [34] assert that their Linear Regression protocol is
the most efficient in literature, with cost O(B + D) per round, where B is a training batch
size and D is the feature matrix dimension [34]. That may seem like a very well behaved
linear function over n, which would lead us to conclude they devised an exceptionally
efficient protocol with order O(n).
Nevertheless, this order expression only gives a bound on the number of operations
to be performed by the algorithm. However, it does not inform an accurate estimate for
execution time. And, more importantly, the order function will only bound the actual
cost function for extremely large input sizes. Recall that t(Kn) ∈ O(n), regardless of how
arbitrarily large the constant K may be.
Thus, for small input sizes, the actual cost may be many orders of magnitude higher
than the asymptotic bound. The addition protocol in [41] is also of order O(n), but one
16
would never assume that a protocol with many matrix multiplications and inversions can
run as fast as the one with a few simple additions.
Probabilistic and ML models are commonly used to estimate cost, execution time and
other statistics over complex systems and algorithms, especially in control or real-time
systems engineering [42, 43].
We propose the use of Monte Carlo methods in order to estimate execution times for
privacy-preserving computations, considering different protocols and input sizes. Section
II presents a short review on privacy-preserving computation and its cost. Section III has
a brief description of the Monte Carlo methods used. Section IV discusses implementation
details and results of the various Monte Carlo experiments we performed. Finally, Section
V presents key conclusions and points out relevant questions open for further investigation.
2.3.1 Monte Carlo methods for integration

There is another way to estimate execution times. All examples found in literature of
privacy-preserving computation have one thing in common: their protocols depend heavily
on pseudo-randomly generated numbers, used to mask or encrypt private data.
Those numbers are assumed to be drawn independently according to a specific proba-
bility density function. That is, the algorithms use at least one random variable as input.
Although the observed execution time does not depend directly on the random inputs, it
is directly affected by their magnitude, or more specifically, their average bit-size.
Also, the size of the dataset has direct impact on execution time. If we consider the
size of the dataset as a random variable, then the interaction between the magnitude of
the random numbers and the numerical representation of the dataset are, unequivocally,
random variables. So, it is safe to assume that protocol runtimes, that are a function of
the previous two variables, are also random variables.
Table 2.1: Experiment Type I - Runtimes in Milliseconds

Dataset Protocol M θˆcli V ˆar(θˆcli ) θˆ
srv V ˆar(θˆ
srv )
1000 1943.94 4.36 13.03 0.001
πµ̂HE
Dow Jones Index 5000 1931.49 0.12 12.73 4.77e−05
(750 instances) 1000 134.35 6.66 0.25 3.15e−05
πµ̂MPC
5000 130.62 1.23 0.241 1.04e−06
1000 11630.08 165.05 24.75 0.006
πµ̂HE
Bank Marketing 5000 11514.43 24.22 24.44 5.05e−04
(4521 instances) 1000 289.51 6.96 0.89 1.63e−04
πµ̂MPC
5000 280.97 1.30 0.82 9.91e−06
Monte Carlo methods are a class of algorithms based on repeated random sampling
that render numerical approximations, or estimations, to a wide range of statistics, as well
as the associated standard error for the empirical average of any function of the parameters
of interest. We know, for example, that if X is a random variable with density f (x), then
17
the mathematical expectation of the random variable T = t(X) is:
Z ∞
E[t(X)] = t(x)f (x)dx. (2.3)
−∞
And, if t(X) is unknown, or the analytic solution for the integral is hard or impossible,
then we can use a Monte Carlo estimation for the expected value. It can can be obtained
with:
M
1 X
θ̂ = t(xi )f (xi ) (2.4)
M i=1
In other words, if the probability density function f (x) has support on a set X , (that
R
is, f (x) ≥ 0 ∀x ∈ X and X f (x) = 1), we can estimate the integral
Z
θ= t(x)f (x)dx (2.5)
X
by sampling M instances from f (x) and computing
M
1 X
θ̂ = t(xi ) (2.6)
M i=1
The intuition is that, as M grows, X = {x1 , . . . , xm } sampled from X , the support

interval of f (x), becomes closer to X itself. Therefore, the estimation θ̂ will converge to
the expected value θ. This comes from the fact that the sample mean is an unbiased
estimator for the expected value.
We know that, by the Law of Large Numbers, the sample mean θ̂ converges to E[θ̂] = θ
as M → ∞. Therefore, we are assured that for a sufficiently large M , the error ε = E[θ̂]−θ̂
becomes negligible.
The associated variance for θ̂ is V ar(θ̂) = σ 2 /M , where σ 2 is the variance of t(x).
Since we may not know the exact form of t(x), or the corresponding mean and variance,
we can use the following approximation:
M
ˆ 1 X 2
V ar(θ̂) = 2 t(xi ) − θ̂ (2.7)
M i=1
In order to improve the accuracy of our estimation, we can always increase M , the
divisor in the variance expression. That comes, however, with increased computational
cost. We explore this trade-off in our experiments by performing the simulations with
different values of M and then examining the impact of M on the observed sample variance
and on the execution time of the experiment.
18
2.4 A general architecture for Privacy-Preserving
Machine Learning
As discussed above, there are different privacy-enhancing technologies that may serve as
the fundamental blocks upon which to build a privacy-preserving solution. Each with its
own complexity of implementation and computational cost. The Monte Carlo methods
can be used during the design of private computation solutions in order to select the best
technology for a specific task and data type.
This, however, can be a cumbersome process if you consider that different steps of the
machine leaning life cycle, from data collection to inference, will have different algorithms
and different data types and data ranges. The process would be repeated for each step.
We propose an architecture for Privacy-Preserving Machine Learning that combines
the experience and the model ecosystem from ML community with the innovative Privacy-
Preserving Machine Learning technologies from the cryptography community. It lessens
the cost of implementing complex solutions and running heavy computations in the first
steps of the ML cycle.
The first and main aspect of the proposed solution is to abstract all the preprocessing,
data wrangling, features extraction and feature engineering steps of machine learning
using transfer learning. Transfer learning refers to a subset of ML techniques based on
the notion of storing knowledge gained while solving one problem and applying it to a
different but related problem.
There is an impossible trade-off between privacy guarantees and the need of experi-
enced data scientists and machine learning specialists to visualize and understand the data
in order to fine tune predictive models. Specially for high dimensionality problems, run-
ning task specific heuristics for data wrangling, dimensionality reduction, hyper-parameter
tuning, among other common ML techniques, over encrypted data or over MPC protocols
may be unpractical as computing costs may grow exponentially with large datasets [44].
Transfer learning allows for these steps to be abstracted with the use of pre-trained
models that usually convey knowledge gathered while training on very large datasets.
The most used models are community driven, have proven success on their intended
applications and are available at the API of common ML frameworks, such as PyTorch
and TensorFlow.
The proposed architecture consists of two groups or types of components:
1. Data encoder: performs a two step encoding process: creates the embeddings
(a dense matrix representation of the private input); and then creates privacy-
preserving computation representation of the embeddings (MPC shares for MPC
19
protocols, or encrypted embeddings, for HE protocols). There can be many data
encoders cooperating in the private computation;
2. PPML computation parties: The computing parties run the MPC and/or HE
protocols for privacy-preserving model training and inference. There can also be
many computing parties cooperating in the private computation.
This model renders a general privacy-preserving machine learning solution, since you
can plug-in any model for data encoding (such as BERT for text classification, and ResNet
for image classification) and any PPML framework for model training and inference. It can
also be described as a general framework or architecture in the sense that any learning
task (of classification, regression, interaction or process control), can be mapped and
implemented as a transfer learning application.
Now, this general architecture can have different implementations. The proposed
implementation, as suggested in Figure 2.1, uses MPC protocols running on 3 computing
parties. We recommend that any MPC implementation should hold 3 or more computing
parties, and that at least one of the computing parties should be hosted by a different
service provider. This condition can be relaxed if the end user, running the Data Encoder
component, also participates as a computing parting on MPC protocols. Also, for easy of
implementation, any of the computing parties can perform the role of Trusted Initializer
for protocol setup.
Figure 2.1: General PPML Architecture
The MPC implementation, without loss of generality, could be described as the com-
bination of three components:
1. Data encoder: the second step in encoding consists in creating the respective MPC
shares for the embeddings;
20
2. MPC computing parties: The computing parties run the MPC protocols;
3. MPC Trusted Initializer: The TI pre-distributes secret shares for MPC protocol
acceleration;
In this case, the trusted initializer is only a tool for protocol acceleration. Since it is
part of the distributed, private, computation, it could be listed as just another member
of the computing parties group. It is set out in this case just to reinforce the fact that the
TI does not participate in the online phase of the MPC protocols, but only in the initial
setup.
Note that the implementation will determine the security model of the solution. When
implemented with only two computing parties (e.g. data owner and service provider), the
system is secure under a honest-but-curious adversarial model. It means that the data
remains private as long as all parties follow the protocol correctly. When implemented
with more then 2 computing parties, the underlying MPC protocols guarantee a (n − 1)-
private security model. That is, even if n − 1 out of n computing parties collude during
protocol execution, the private information cannot be disclosed.
2.4.1 The need for a general PPML architecture

The first and most important use of case of such a PPML architecture is privacy-preserving
inference. It allows users to receive a probable classification or an approximate regression
value, without exposing their private data.
In the fake news classification example, they can receive an indication of whether
the content they are reading is false, without exposing their private conversation or the
people interacting with them. This addresses, for instance, the concerns of messaging app
users that want to have a feedback on a message shared on a family group, but are not
comfortable to have a fake news classification agency being able to track and record the
exact content shared on that private chat.
For the use case on breast cancer image classification, privacy-preserving inference
would allow a physician to harness the knowledge of a large database of patients and
receive an estimated probability of existence of cancer on a given exam, without exposing
any private Electronic Health Records (EHR). It addresses the need of family doctors
and general clinic practices of receiving information from specialized complex treatment
centers, without having to wait for all the bureaucratic procedures needed in order to
legally transfer the patients EHR to the specialized hospital and, latter, receive the EHR
back with the specialists diagnosis.
It can also be used to design privacy-preserving model training solutions, that allows
collaborative and distributed model training, without disclosing participants private data.
21
This allows end users and specialists to contribute in the training of the inference model
used on relevant applications without wavering their right to privacy.
Looking at the fake news problem, PPML model training allows for fact-checking
agencies, or groups of scholars and specialists, to provide annotated news articles for a
community relevant topic. The data owners may fear the service provider to have eco-
nomical or political incentives to meddle with the specialists’ classification. The privacy-
preserving model training solution guarantees the model owner has no knowledge of the
provided texts or the classification labels, and thus, cannot negatively impact the quality
or fairness of the dataset and, consequently, of the trained model.
In the context of cancer detection or staging, PPML model training also allows for
collaborative model training in a way that centers of excellence, such as Brazil’s INCA
and the US’ NCI, could surpass legal boundaries related to health data protection and
work together on more general and powerful predictive models and serve the international
medical community with better tools to save lives all over the globe.
A general architecture allows for a broader impact, as contributions in one area are
more easily adapted and integrated in a myriad of possible applications. It also allows for
open, community driven, models of development that have a higher chance of surviving
the ‘proof-of-concept’ stage and evolving into really useful technologies with positive social
impact.
22
3 Text classification
Divide each difficulty into as many parts as is

feasible and necessary to resolve it.
Discourse on Method
René Descartes
Text classification refers to a broad category of techniques used to categorize, or label,

texts based on their content. Most of these techniques include statistical analysis, in gen-
eral, and inference modeling, in particular, to infer the correct label with the help of term
statistics. Since words have many different relationships (synonymity, concordance, gen-
eralization, specification, dependency etc.), the correct computation and interpretation of
term statistics demands a great deal of work, not only on data wrangling, exploration and
preparation, but also in understanding the structure and inner working of the languages
in which the texts are written.
3.1 Natural Language Processing

The development of information retrieval techniques gave rise to many domain specific
text relanguages, such as structured query language (SQL).
Many times, simpler and computationally cheaper solutions outperform deep learning
and large language models in important metrics, such as statistical significance of model
parameters [45].
3.2 Classic NLP preprocessing techniques

3.2.1 Tokenization
Tokenization is a preprocessing technique commonly understood as the first step of any
kind of natural language processing. It is used to identify the atomic units of text process-
ing. The text, represented as a single sequence of characters, is transformed in a collection
23
of tokens: words, punctuation marks, emojis, etc. Most NLP software libraries (e.g. nltk,
gensim and CoreNLP) provide multiple tokenization strategies, such as character, sub-
word, word or n-gram. The best granularity or tokenization strategy usually depends on
the application [46].
A common practice is to combine tokenization with sentence splitting. The gensim li-
brary, for instance, will perform tokenization by processing a sentence at a time. CoreNLP,
in turn, adds flags to the tokens that represent the limits of each sentence. Sentence split-
ting is also very important to other NLP methods, such as Part-Of-Speech (PoS) tagging
and Named Entity Recognition (NER).
3.2.2 Stopword removal (SwR)

Stopword removal consists on removing some words from the documents before computing
any statistics. The intuition is to reduce the noise in the dataset and the computational
cost of the following processing steps by removing the uninformative words [47, 48].
Common stopword removal algorithms include:
• Pre-compiled dictionary: manually curated stopword lists. The lists may be crafted
for specific contexts, jargons or document corpus;
• Frequency based: use frequency based rules, such as TF-High (removal of terms
with high frequency), TF-1 (removal of terms with a single occurrence), IDF-Low
(removal of terms with low inverse document frequency, i.e. terms that are present
in most documents);
• Mutual-Information: the algorithm computes the Mutual Information (MI) between

a term and a class of documents, removing the terms with low MI with the target
class for classification;
• Term Based Random Sampling (TBRS): uses the Kullback-Leibler divergence be-
tween term frequencies on the corpus with the frequency measured on randomly
sampled text chunks to identify words with low divergence and, consequently, low
information on any given text class.
3.2.3 Stemming
Stemming is the reduction of variant forms of a word, eliminating inflectional morphemes
such as verbal tense or plural suffixes, in order to provide a common representation, the
root or stem. The intuition is to perform a dimensionality reduction on the dataset,
24
removing rare morphological word variants, and reduce the risk of bias on word statistics
measured on the documents [49].
Most stemming algorithms only truncate suffixes and do not return the appropriate
term stem or even a valid word in the language of the text. There are different classes of
stemming algorithms, including:
• Dictionary based algorithms: lookup tables with terms and corresponding stems.
Usually restricted to a specific corpus, jargon or knowledge area;
• Fixed truncation methods: truncation of a fixed number of characters, or of a fixed

list of word endings;
• Rule based truncation: terms are truncated interactively, according to a set of

predefined rules. Algorithms of this class, such as the Lovins and Porters algorithms,
are the most commonly used and are implemented in all the major NLP libraries;
• Inflectional/derivational stochastic algorithms: use stochastic algorithms, such as

Hidden Markov Model and Finit State Automata, in order to compute the proba-
bility of equivalence between two terms, based on the context (surrounding words).
3.2.4 Lemmatization
Lemmatization consists on the reduction of each token to a linguistically valid root or
lemma. The goal, from the statistical perspective, is exactly the same as in stemming:
reduce variance in term frequency. It is sometimes compared to the normalization of the
word sample, and aims to provide more accurate transformations than stemming, from
the linguistic perspective [50].
The impact on predictive models, however, will depend on characteristics of the lan-
guage or the document corpus being processed. In highly inflectional languages, such as
Latin and the romance languages, lemmatization is expected to produce better results
then stemming [51].
The typical lemmatizer implementation requires the creation of a lexicon (dictionary
or wordbook) of valid words and their corresponding lemma [52]. Yet, there are different
classes of algorithms, designed to deal with distinct problems in word normalization and
different languages. Recent works in literature, for instance, use deep neural networks to
produce ‘neuro lemmatizers’ trained for specific tasks [53, 54].
3.2.5 Part-of-Speech (PoS) tagging

Part-of-Speech tagging is a processing technique that flags each token with a grammatical
class, taking into account the sentence or even the context of the sentence in which they are
25
found [55, 54]. Most implementations will return multiple tags per token, with syntactic,
lexical, phrasal and other categories.
Figure 3.1: PoS Tagging
PoS tagging helps to differentiate homonyms, words with the same spelling but differ-
ent meanings, and to capture part of the semantics relations between words. Therefore,
many works in the fake news detection literature use PoS tags to engineer new features
(e.g. "noun count", "adjective count", "mean adjectives per noun") to capture concepts
such as ‘style’ or ‘quality’ of the text and improve model accuracy [56, 57].
3.2.6 Bag-of-Words (BoW)

A Bag-of-Words is a representation of a document in which the sequence of tokens, a one
dimensional ordered vector, is replaced by a matrix where tokens are associated with a
statistic. There are several BoW algorithms that mainly differ on how they deal with
repetition and ordering of tokens [58].
BoW may also be considered a Topic Modeling technique, although most of the meth-
ods in this class of algorithms render document representations associating tokens with
their context and also differentiating homonyms. Among these methods we can mention
Latent Semantic Indexing (LSI), Probabilistic Latent Semantic Analysis (PLSA) and La-
tent Dirichlet Allocation (LDA) [59].
The BoW algorithm used in most NLP libraries is based on the Vector Space Model
(VSM) and associates the tokens with with the corresponding term frequency: the number
of occurrences of that token in that document. This algorithm produces an unordered set
that does not retain any information on word order or proximity in the document [60].
In order to deal with this loss of information on word order or word-word relationship,
many techniques were proposed and tested in various NLP tasks. There are, for instance,
26
algorithms of Bag-of-N-grams, where the basic unit of count is not a single word by a set
of words of size n [61].
3.2.7 Term Frequency – Inverse Document Frequency (TF-IDF)

Similar to the Vector Space Model Bag-of-Words, the TF-IDF (sometimes expressed as
TF*IDF) document representation will associate each token in a document with a nor-
malized or smoothed term frequency, weighted by the inverse of the frequency at which
the term occurs in the corpus, or in the list of documents [62]. That is, fti ,dj , the number
of occurrences of token ti in document dj , is replaced by tf · idf, where:
fti ,dj
tf(ti , dj ) = 1 + log P
t∈dj ft,dj
(3.1)
|D| + 1
idf(ti , D) = 1 + log
|{d ∈ D : ti ∈ d}| + 1
And D is the collection of all documents being processed.

The TF-IDF representation tries to overcome a few limitations of the classic BoW
representation by creating a measure of relevance of each term for a specific document.
Terms that are frequent throughout the corpus will have lower TF-IDF, while terms that
are particularly common to a class of documents will have higher TF-IDF representation.
There a many results showing how TF-IDF representation improve fake news classification
models [63, 64].
3.2.8 Continuous bag-of-words (CBoW)

The expression Continuous Bag-of-Words (CBoW) is frequently used to refer to a word
embedding representation, based on a language model pre-trained on word/word co-
occurrence probabilities. The CBoW model, as originally proposed, is a neural network
whose weights represent the conditional probability of occurrence of a word, given a con-
text, that is, a window of n words that precede it and the n following words [65].
The Word2vec algorithm, proposed by Tomas Mikolov, et al. [65] at Google in 2013, is
still the most used CBoW implementation. It is trained in conjunction with the skip-gram
model, as seen in Figure 3.2, forming a kind o ‘auto-encoder’. The skip-gram model, as a
reverse step, gives the conditional probability for a set of preceding and following words,
given a specific term.
27
One of the advantages of word embeddings, such as CBoW, is the fixed length, dense
matrix representation that usually allows for more efficient computations. It also captures
some of the semantic relationships of words, based on their co-occurrence probability.
Figure 3.2: Continuous Bag-of-Words & Skip-gram auto-encoder
CBoW has also been used to achieve good results in fake news detection [66, 67] and
is possibly the most advanced preprocessing or feature engineering technique that can
be used on top of MPC protocols in order to produce a privacy-preserving fake news
classification model.
3.3 The State-of-the-Art: Transformers

A transformer is a model whose parameters are trained to apprehend context and meaning
of elements in a sequence by tracing relationships between them. In fact, transformers are
usually implemented as a composition of deep neural networks, with different architectures
and contributions to the overall capability of the transformer model.
The transformer architecture creates an encoder-decoder structure in which the en-
coder receives a raw input sequence and outputs a sequence of internal transformer sym-
bols or representations. The representation of one element in the sequence is presented
to the decoder, along with the representation of the previous element. The decoder then
outputs a vector of probabilities predicting the next element in the sequence [68].
28
From this basic transformer architecture, there are even more complex models, such
as the Bidirectional Encoder Representations from Transformers (BERT), that uses a
variable number of encoders and bi-directional self-attention heads [69]. The models was
trained on two tasks: language modeling – in which the model predicts missing words in
the context; and next sequence prediction – in which the model predicts the next sentence.
As a result, BERT embeddings convey contextual meaning for words. BERT has virtually
become a baseline for any new advancements in language models and NLP in general [70].
3.3.1 Trade-offs and applications

When comparing the two approaches, the use of classic NLP preprocessing techniques, or
the use of word embeddings and large transformer models, at last two criteria must come
to mind: computational cost and effective impact on the given task’s performance.
Some may argue that large transformer models, such as GPT-3 are still restricted to
research and inaccessible for everyday tasks. There are, however, a few libraries that al-
ready give access to well tested transformer models, such as BERT, RoBERTa, BistilBert,
XLNet and even GPT-2, with high level APIs for both Natural Language Understanding
(such as classification) and Natural Language Generation tasks.
It is also possible to fine-tune these models, training the last layers on a specific set
of texts in order to improve classification results. There are several libraries, such as
sentence_transformers [71] from the Ubiquitous Knowledge Processing (UKP) Lab at the
Technical University of Darmstadt, that offer a high level API for fine-tuning BERT,
RoBERTa and other transformer models.
This fine-tuning comes, evidently, at a high computational cost and may introduce
more complexity to the deployment of such models in real-life applications, since the
‘home made’ model is not easily integrated in common NLP and transformer libraries as
the original models.
29
4 Privacy-preserving fake news
detection
In order to seek truth, it is necessary once in the

course of our life to doubt, as far as possible, of
all things.
Principia Philosophiae
René Descartes
Fake news are texts, possibly distributed with or on other media formats, that present
false, incorrect or inaccurate information and are shared over digital platforms, such as
social networks, messaging apps or news web sites [72]. The main characteristics that dif-
ferentiate fake news from concepts such as gossip, hoax and other forms of misinformation
are:
1. They are formatted and presented as legitimate news, usually as a form of ’self-
validation’, with the intent to manipulates the audience’s cognitive processes;
2. They have faster and broader propagation patterns, partly due to the context of
instant and pervasive communication of digital platforms;
3. They have greater impact on the audience’s social behavior, also partly due to the
business model of the digital platforms based on engagement or “attention reten-
tion”.
These platforms are designed to retain their audience with algorithms that will filter
and sort the content displayed to each user based on their preferences and attention
patterns. The algorithms are so effective in retaining users’ attention that a growing
number of people are now suffering with addiction to their social media [73].
With users spending an ever increasing amount of time on their favorite platforms,
social media have amassed huge databases on user profiling and segmentation and be-
came, arguably, some of the most effective mass communication tools. They monetize
30
their databases serving targeted advertising and charging companies that consume their
Application Programming Interfaces (APIs) to interact with their users [25].
This business model is threatened by the misuse of the platforms with the spread
of illegitimate and false content. The malicious use of digital platforms to spread fake
news has already been applied, for example, to manipulate opinions on extremely relevant
issues, such as presidential elections in France and the United States [74].
Identifying and clearly flagging fake news may help users to assert better judgment
on the content they consume and lessen its negative effects [75]. Research in the area,
nevertheless, faces a few particular challenges: first, the difficulty, from the technological
perspective, to delimit fake news and distinguish them from other forms of propaganda.
Also, the relevance of the topic in the political arena elevates the risk of bias and partisan
interference. For instance, a dataset with over 200 citations gives you an accuracy of
99% using the presence of a single term to label a text as true news [76]. Also, there are
very few good public datasets and fewer NLP resources (dictionaries, word embeddings
models, language models, etc) in languages other than English [57].
Another important issue in the area is how to balance the need to detect and appro-
priately handle fake news and the equally important need to guarantee end user’s privacy.
This concern with user’s privacy has lead to the development of many Privacy-Preserving
Machine Learning (PPML) techniques [30, 34, 35]. There already many classic Machine
Learning (ML) algorithms, such as Logistic Regression, Decision Tree and Support Vector
Machines implemented on top of Secure Multi-party Computation (MPC) protocols [8].
4.1 Detection approaches

4.1.1 Source based detection
One of the main trends in recent literature is the use of graph based and ML models, such
as Logistic Regression, Random Forest and Support Vector Machines, to identify probable
sources of fake news. The predictive models are trained over user profile information and
metadata for accounts that have been manually flagged as the source of a certain number
of fake news posts [77, 78, 79].
Works in this area usually build models that are specific to each platform, using
metadata relevant to that particular context. For instance, retweet counts and thread
size are relevant on Twitter, but may not have similar meaning on other platforms. In
many cases, fake news detection is associated with the problem detecting autonomous
software agents (usually referred to as "bots") [25].
31
This detection approach has shown many positive results. Nevertheless, the use of
personal information raises the concern with user privacy. The problem is aggravated if the
proposed solution involves the transfer of user’s personal information to some sort of third-
party or government agency responsible for the detection or repression of misinformation.
4.1.2 Fact checking

A few works propose solutions that are based on complex conversational models that
query the topics identified in the text against a database of checked facts [80, 81, 82].
Conversational models are Deep Neural Networks trained to react to a text, or a spe-
cific query, according to a database of pairs hinput, responsei. First, the model classifies
the input with multiple labels, each indicating a topic or knowledge area. Then, the
model selects the responses that have higher probability of appropriately responding to
that query. Some models use a fixed knowledge-base, or even a fixed list of responses.
Others will search the web, performing a second round of classification to select the prob-
able correct response [83].
Research in this area has lead to the creation of curated databases of checked facts.
Some are maintained by multidisciplinary research groups. Most of these databases,
however, are created and curated by fact checking agencies and news companies [84].
Here, there is a risk that heightened partisanship might interfere with the quality of
these databases. This is especially true in a politically polarized environment, in which
agencies aligned with the different political forces are mutually accused of publishing
fake news [85]. The dataset mentioned in the introduction, for example, flags every text
published by a single news outlet as true and all the others as fake [76].
4.1.3 Natural Language Processing (NLP)

The majority of the works on fake news detection is based on the notion that it is possible
to identify language traits particular to fake news [86, 56]. There are many works in
recent literature still using classic ML and NLP pre-processing techniques [66, 64]. The
major trend, however, is using complex DL algorithms and large Language Models, such
as GPT-3 and BERT [87, 69, 88, 89, 90].
Some of the most recent works use adversarial deep reinforcement learning algorithms
to deal with ‘Neural Fake News’: fake news generated by large language models [91, 92, 93].
We could argue this line of research represents the state-of-the-art in fake news detection.
Nonetheless, complex solutions involving deep neural networks or large languages models
are out of our scope, since the computational cost and complexity of implementation of
such algorithms on top of PPML protocols are arguably prohibitive.
32
We already have a few classic ML algorithms, such as Support Vector Machines,
Decision Trees and Logistic Regression, running on the PPML setting with MPC protocols
[8]. Current research effort is focused on the identification of preprocessing or features
engineering methods that improve model accuracy and that can be ported to the PPML
setting.
4.2 Experiments
Our experiments were designed as a proof-of-concept of the general privacy-preserving ma-
chine learning framework propose in Chapter 2. They cover the two basic scenarios of ap-
plication of the framework. We implement the PPML model training and the PPML news
classification features with additive secret sharing protocols available in Meta’s CrypTen
– an open source research tool for PPML [31].
CrypTen extends the PyTorch library API with tensor based implementations of secret
sharing protocols. We decided to use this framework to facilitate our experiments, since
it extends a well known library, and would allows us to use peer-reviewed neural network
architectures found in literature with very few changes to the code. CrypTen also allows
for private inference with a encrypted PyTorch model trained on clear-text data.
CrypTen also helped us to simplify the implementation design, since it automatically
instances all the MPC protocols requirements, such as the pre-distributed multiplication
secret shares provided by the Trusted Initializer, for every computing party. It also pro-
visions the communication channels and secret share distribution between data providers
and computing parties, so data can be seamlessly loaded from any participating node and
used for computations involving any number of parties.
4.2.1 Selected datasets

We selected 2 datasets in English and 2 in Portuguese. Each pair has a dataset with
full-length news articles and a dataset comprised of short statements. The purpose of
experimenting with different languages and text sizes was to observe how these variables
may impact preprocessing and training cost, and, ultimately, model performance.
The selected datasets are:
• FactCk.br: 1313 claims by Brazilian politicians, manually annotated by fact check-

ing agencies1 as ‘true’, ‘false’, ‘imprecise’ and ‘others’. This dataset is imbalanced,
with 22% of the texts labeled as true [81];
1
https://piaui.folha.uol.com.br/lupa, https://www.aosfatos.org and
https://apublica.org
33
• Fake.br: 7200 full-length news articles, with text and metadata, manually labeled
as real or fake news [57];
• Liar: curated by the UC Santa Barbara NLP Group, contains 12791 claims by
North-American politicians and celebrities, classified as ‘true’, ‘mostly-true’, ‘half-
true’, ‘barely-true’, ‘false’ and ‘pants-on-fire’ [94];
• Source Based Fake News Classification (sbnc): 2020 full-length news manu-
ally labeled as Real or Fake [95].
In order to make the results across these datasets with different classification cardinali-
ties comparable we have created mapping functions, attributing binary labels (fake/true)
for the datasets annotated with more classes. For the Liar dataset, we map the class
‘mostly-true’ as true, while the classes ‘half-true’, ‘barely-true’ and ‘pants-on-fire’ are
mapped as fake news. The FactCk.br dataset has over 18 classes, as it aggregates classifi-
cation systems from different fact checking agencies. Their exact mapping is documented
on the Jupyter notebooks published with our experiments.
4.2.2 Clear-text training and inference

In order to establish benchmark performance measures in the ‘clear-text’ setting, we
ran the pipeline detailed in [25] for model tuning, selection and testing, with different
combinations of NLP preprocessing techniques, as shown in Table 4.1.
The pipeline uses k-fold validation and random search for hyper-parameter tuning on
Naive Bayes, Decision Tree, K-Nearest Neighbors, Logistic Regression, Support Vector
Machines, Random Forest and XGBoost GBDT classifiers. We have also trained a few
convolutional (CNN) and deep feed-forward (FNN) networks using these embeddings in
the clear-text setting.
For the DistilBERT and Sentence-BERT experiments, we have used the pre-built
multilingual models from [96] in order to encode our datasets with the corresponding
embeddings, before submitting them to the pipeline.
Table 4.1: List of NLP preprocessing experiments in clear-text

E01: BoW E07: SwR + TF-IDF
E02: SwR + BoW E08: Stemming + TF-IDF
E03: Stemming + BoW E09: Lemmatization + TF-IDF
E04: Lemmatization + BoW E10: SwR + Lemm. + TF-IDF
E05: SwR + Lemm. + BoW E11: DistilBERT embeddings
E06: TF-IDF E12: Sentence-BERT embeddings
For hyper-parameter search, we decided to use ROC AUC metrics to compare and se-
lect the best models, as it gives a better information on a model performance despite class
34
imbalance. After model selection, we recorded ROC AUC, F1-score and accuracy metrics
on the test set for the model selected at the end of each experiment. The best combination
of preprocessing techniques and classifier algorithm, measured by the accuracy on the test
set for each dataset, is presented on Table 4.2. The runtime is in seconds.
We also trained and tested the convolutional neural network from [97] as a benchmark
for our results. Our neural networks outperformed their model in accuracy, F1 score and
ROC AUC for all datasets. For the next step, however, we select only the best neural
network of each architecture for the experiments in the privacy-preserving setting and
compare training runtimes for these networks on Table 4.3.
Table 4.2: Best accuracy on clear-text setting

Language Portuguese English
Dataset factck.br fake.br liar sbnc
Preprocessing SwR, Lemm., BoW Lemm., BoW Sentence-BERT SwR, TF-IDF
Model Logistic Regression XGBoost Random Forest XGBoost
Accuracy 83.33 98.69 63.07 81.73
F1 score 77.07 98.26 60.0 78.26
Runtime 0.11 56 2529 31
Contrary to the initial expectations, however, results show that in most of our clear-
text experiments, word embeddings or sentence embeddings from large language models
did not outperform traditional NLP preprocessing techniques. Also, classic ML models
outperformed simpler deep learning models. Particularly, tree-based models, Random
Forest and GBDT, presented the best performance for most datasets.
4.2.3 Privacy-preserving model training

For privacy-preserving model training and inference we set up three virtual machines on
the cloud. We tested our code using Amazon AWS EC2 and Google Cloud Compute
Engine instances, with similar results. Table 4.3 brings a comparison of training cost of
the same neural networks in the clear-text and privacy-preserving settings.
Table 4.3: Neural networks training runtime

FNN CNN
Dataset Embeddings
Clear-text PPML Clear-text PPML
DistilBERT 2.91 167.48 9.04 1503.27

factck.br
Sentence-BERT 3.18 168.06 9.01 1511.90
DistilBERT 16.62 948.499 46.64 8182.08
fake.br
Sentence-BERT 16.34 903.07 45.53 8536.39
DistilBERT 28.10 1619.57 78.99 14543.13
liar
Sentence-BERT 30.64 1621.22 80.22 14577.85
DistilBERT 4.64 256.96 13.61 2276.23
sbnc
Sentence-BERT 4.73 268.62 13.59 2293.95
35
It is important to note that, as stated above, we used only simple feed-forward and
convolutional neural network architectures. This choice is due to CrypTen’s limited im-
plementation of PyTorch modules. It does not implement modules as Recurrent Neural
Networks (RNN), or Long Short-Term Memory Networks (LSTM), that have been proven
in NLP literature to provide better results than simpler networks architectures [98].
On the first machine, named ‘alice’, we store the trained model. The training set,
embeddings and corresponding labels, are stored on the second machine, named ‘bob’. The
third participant, ‘charlie’, holds the validation set. At the end of the MPC computation,
the accuracy score on the validation set is known to all computing parties, but only alice
has knowledge of the trained model’s weights.
The results in Table 4.3 show that training times are, in average, one order of magni-
tude higher in the privacy-preserving setting. That is a reasonable cost, considering the
advantage of preserving both the privacy of participants’ input texts and of the service
provider’s trained model.
4.2.4 Privacy-preserving inference

For the privacy-preserving inference experiment, we used the same VMs aforementioned.
This time, alice has the private model already trained. bob and charlie have each half of
the test set. In this particular test scenario, we also allow bob and charlie to hold the
original labels, in order to compare them with the predicted values produced with alice’s
model and compute the accuracy, F1 and ROC-AUC scores.
The result of the joint computation, the predicted classification, will be available to the
three computing parties, even though alice’s model and the other parties’ texts, remain
private to their respective owners. Table 4.4 shows the observed metrics for our models.
Table 4.4: Best accuracy on Privacy-Preserving setting

Dataset Embedding Model Accuracy F1-Score ROC-AUC Runtime
DistilBERT FNN 78.32 87.84 50.00 1.62
factck.br
Sentence-BERT CNN 78.32 87.84 60.00 175.28
DistilBERT FNN 80.83 80.72 80.83 5.28
fake.br
DistilBERT FNN 58.73 48.73 57.24 9.06
liar
DistilBERT FNN 67.82 74.10 65.61 1.66
sbnc
Sentence-BERT FNN 72.52 77.11 71.44 1.60
For the sake of reproducibility, we have created a docker image with all required pack-
ages and made it publicly available on Docker Hub (https://dockr.ly/3ED3S1D). The
experiments are extensively documented on Jupyter notebooks at a public git repository
(https://bit.ly/3BwhfPn). We need to point out, however, that exact reproducibility
36
cannot be guaranteed across PyTorch and CrypTen non-deterministic algorithms. Also,
the cuDNN library, used for CUDA convolution operations, can be a source of nondeter-
minism across multiple executions.
4.3 Findings
These experiments demonstrate that it is possible to train and query predictive models
using Secure Multi-party Computation protocols in order to detect fake news in a privacy-
preserving way.
We observed that the models trained with the short text datasets had lower perfor-
mance in all metrics, over all experiments. It indicates that NLP models require a larger
word sample in order to appropriately approximate the underlying statistics represented
by trained parameters. We also noticed a higher runtime for the Random Forest model
trained over Sentence-BERT embeddings — as seen on Table 4.2 for the liar dataset. It
indicates that large language models may provide better results, but may also introduce
higher computing cost in all stages of the machine learning pipeline: from preprocessing
to inference.
Concerning the selected language models, we observed that Sentence-BERT achieved
better performances than DistilBERT for most of the datasets. Moreover, the accuracy
marks for the datasets in Portuguese were higher than the ones for the English datasets,
even though these BERT based models were trained with most texts in English. In terms
of runtime, in the worst case, the inference for each text takes about half a second. In
average, however, each text takes about 4 milliseconds, which is reasonable for a solution
with good guarantees on privacy.
We found that the performance of the privacy-preserving inference of fake news clas-
sification model, measured both in terms of runtime, accuracy and other classification
metrics, is very close to that of a model queried in the clear-text setting. Indicating that
introducing the use of MPC protocols does not reduce the predictive power or usability
of fake news detection models.
It is arguable that, despite having a visibly higher cost when compared with clear-text
training, privacy-preserving model training can still be considered viable for the specific
cases where a group of parties needs to cooperate in training without trusting the model
owner. The
We have also found that, for the datasets at hand, using large multi-lingual language
models did not significantly improve the overall pipeline performance, when compared
to well established NLP techniques such as Lemmatization, Stop-Word Removal and
TF-IDF. In possible future investigations, we may test the effect of other preprocess-
37
ing techniques, such as word count by part-of-speech, and other feature extraction and
engineering methods commonly found in NLP literature.
38
5 Privacy-preserving breast cancer
classification
Each problem that I solved became a rule which

served afterwards to solve other problems
Discourse on Method
René Descartes
Breast cancer is the most prevalent form of cancer, even when considering both sexes,
and causes over 600,00 deaths per year [99]. Despite all the visibility and awareness around
this disease, an all the sanitary policies put into practice, its incidence has increased, on
average, by 0.5% annually [100].
It is straight forward to understand that any knowledge or technology developed on
this subject that could contribute to the detection or support the best treatment possible
has a really relevant impact on society.
For this reason, research teams at the Brazilian National Cancer Institute (INCA) are
always developing and testing new ways to achieve the earliest possible detection of breast
cancer, and to effectively follow up the progression of the disease in order to provide the
best information for its treatment.
There is an ongoing research seeking to use INCA’s large dataset of high resolution
scans of breast cancer histologic sections to build inference models. The project also
involves the use of tabular data extracted from patients’ Electronic Health Records (EHR),
in order to build multi-modal inference models that will help identify the disease and
indicate its probable progression and best treatment course.
There is, however, a significant cost associated to transferring a patient’s EHR from
their original practice. In most cases, it also requires the patient to travel from their
home State to Rio de Janeiro, where INCA’s hospitals are located. Sometimes, the final
diagnostic is that of a benign tumor, or of a carcinoma on a initial stage, that can be
treated at the patient’s original location.
Thus, a solution that would reduce the risk of false positives, and reduce the costs
39
associated with the transfer of these patients that could have stayed home, would improve
the overall cancer related health policy, reducing costs both to the public health system
and to the patient. Also, starting the treatment as soon as possible, at their original
location, would improve their chances of success and remission.
Now, that is when a privacy-preserving solution plays a key role: the legal apparatus
EHRs restricts the sharing of patient’s data, even in situations where two hospitals col-
laborate on the diagnosis. So, a solution that would allow such a collaboration in a way
that adheres to the EHR laws and standards, could have an enormous impact.
That is the motivation on the choice for the second use case of our architecture for
privacy-preserving machine learning. This chapter presents a short literature review on
breast cancer classification. It also lists a few concepts on computer vision, transfer
learning for computer vision and image models. Finally, it discusses the design and
motivation for the experiments carried out with INCA’s researchers as a second proof-of-
concept for our PPML architecture.
5.1 Breast cancer classification and staging

There are different ways to classify breast cancer tumors in terms of histologic type,
location, development stage, extension and general morphological characteristics of the
affected cells. Different classifications are used for different decision processes in the course
of treatment [101].
In the histological classification, three histologic features are examined and each is
assigned a score to determine the histologic grade. The scores are added together. The
sum between 3 and 9 is used to obtain a grade between 1, 2 and 3 that can appear in
a pathology report. Sometimes the terms well-differentiated, moderately-differentiated,
and poorly-differentiated are used to describe grade rather than numbers:
• Grade 1 - or well-differentiated, is associated with scores 3, 4, or 5. Cancerous cells
are growing slowly and look more like normal breast tissue.
• Grade 2 - or moderately-differentiated, related to scores 6 or 7. They have charac-

teristics between types 1 and 3.
• Grade 3 - or poorly differentiated, associated with scores 8 or 9. The cells do not

have normal characteristics and tend to grow and spread more aggressively.
Disease type classification
• Phyllodes Tumor: It is a very rare type of breast tumor, which develops in the
connective tissue (stroma), in contrast to carcinomas, which develop in the ducts or
lobules. Most are benign, but there are others that are malignant (cancer).
40
The T categories for breast cancer are:
TX: Primary tumor cannot be assessed.
T0: No evidence of primary tumor.
Tis: Carcinoma in situ (DCIS, or Paget disease of the breast with no associated tumor
mass)
T1: (includes T1a, T1b, and T1c) Tumor is 2 cm (3/4 of an inch) or less across.
T2: Tumor is more than 2 cm but not more than 5 cm (2 inches) across.
T3: Tumor is more than 5 cm across.
T4: (includes T4a, T4b, T4c, and T4d) Tumor of any size growing into the chest wall
or skin. This includes inflammatory breast cancer.
5.2 Transfer learning in computer vision

Computer vision is the area of study concerned with the problem of enabling machines
to solve problems related to vision, or, in other words, to process and adequately react
to images. It has many applications and techniques that vary on their approach to vision
— whether modeling biological/physical visual perception or just focusing on efficient
multidimensional data processing.
Computer vision, as the art and science of making machines perceive, has applica-
tions including event detection, information organization and retrieval, object or expe-
rience modeling, interaction, and process control, such as moving industrial robots or
autonomous vehicles.
As discussed before, transfer learning is the set of machine learning techniques related
to the use of the knowledge apprehended in the resolution of one task to solve a similar
or related task. For instance, the use of a model trained to differentiate the 1000 classes
of objects in ImageNet [102] to help label classes in a smaller, two-classes dataset.
5.2.1 Image models

Transfer learning has received a great contribution of the computer vision community,
specially in the area of image models and the extensive literature on their design and
application. These models are, usually, deep artificial neural networks trained to recognize
specific patterns, or concepts, on the processed input (image or video). Image models can
41
be trained to associate an input with one or more concepts (or labels), usually giving a
probability of the presence of each labeled concept in the processed input [103].
ResNet, VGG, AlexNet, among others, are some of the first image models and were
applied to problems ranging from face recognition to Covid-19 diagnosis [104]. The first
ResNet variants are no longer the state-of-the-art in image models, however, they still
serve as baseline for developments in the area [105].
5.2.2 Image embeddings

An image embedding is a numeric representation of the image that conveys a previously
acquired knowledge on computer vision related tasks. In other words, it is a dense vector
representation of the image that takes into account the parameters of a previously trained
model, which can be used for many tasks such as classification. A convolutional neural
network (CNN), suach as ResNet50, can be used to create the image embedding, when
one submits the image as an input and takes the activated output values of the last layer
of the network.
5.2.3 Vision transformers

The computer vision community started researching the applicability of Transformer mod-
els to vision problems not only because of their astounding performance on NLP tasks,
but also because Transformers have several advantages over recurrent networks, such as
large short-term memory, including the ability to simulate long dependencies between
input sequence parts and to run concurrent threads for sequence processing.
Transformers, in turn, can be understood as set-functions, using only minor induc-
tive biases for setup and achieving high performances on tasks such as classification and
segmentation. Their simple architecture also makes it possible to handle multiple data
domains, such as photos, videos, text, and audio, using the same processing blocks, and
exhibits great scalability to very large neural networks trained over even larger datasets
[106].
Transformer networks have made great progress on a variety of vision tasks as a result
of their strengths. The main characteristics of transformers models are self-attention,
extensive pre-training, and bidirectional feature encoding. Most works on applications
of transformers in vision deal with as well-known recognition tasks, such as image clas-
sification, object detection, action recognition, and segmentation. Transformers are also
used on generative modeling, multi-modal tasks (such as visual-question answering, visual
reasoning, and visual grounding), video processing (such as activity recognition, video
forecasting), low-level vision (such as image super-resolution, image enhancement, and
42
colorization), and three-dimensional anamorphic vision (e.g., point cloud classification
and segmentation) [107].
5.3 Privacy-preserving cancer classification

As discussed above, privacy-preserving solutions for breast cancer image classifications
have many relevant applications. There are many approaches to privacy, ranging from face
obfuscation to private computation on medical imaging for automated cancer detection
[108, 109, 110].
There is a relevant topic, however, not clearly covered in recent literature: the col-
laborative, joint computation of predictive models and inference on those models. We
want to demonstrate that the proposed PPML architecture allows for multi-institution
collaborative models training for cancer classification.
5.4 Experiments
5.4.1 Datasets
The experiments use two datasets:
• A dataset of 50 scans of histological sections, created by the Brazilian National

Cancer Institute (INCA). The scans are classified as ductal, in situ carcinoma, and
were created in the SVS format – a high resolution tiled TIFF with embedded text
annotations; They are classified into six classes: Benign, ‘in situ’ breast carcinoma,
lobular invasive breast carcinoma, non-special invasive breast carcinoma, ‘in situ’
non-special invasive breast carcinoma and invasive carcinoma;
• A subset of The Cancer Genome Atlas (TCGA) available at the Genomic Data
Commons (GDC) from the US National Cancer Institute (NIH). We have used
the ‘Case’ filter, selecting ‘Breast’ as the primary site, rendering 1879 SVS images
[111]. Those are classified into Triple negative breast carcinoma, ER positive breast
carcinoma, PR positive breast carcinoma and HER positive breast carcinoma;
subsectionExperiment motivation and design We wanted to use an approach similar

to [109]. In this work, however, we want to train a few different models, such as deep
feed-forward and convolutional neural networks, using the TCGA dataset for training
and validation. The goal is to evaluate not only the performance, in terms of accuracy,
f1-score or other classification metrics, but also the associated computing cost of each
model both in the clear-text setting and on the privacy-preserving setting. This will give
43
a more realistic analysis on the viability of the proposed solution for real life medical
applications.
We are also going to use the INCA dataset in a second experiment, using the trained
models form the previous experiment, and observe the respective classification perfor-
mance over the INCA dataset. The datasets are not labeled with the same classification
system. But we expect to observe if the models are going to have a low false-positive
count on the ‘Benign’ class. We also want to confront the models classification results
with that of specialists, when considering the same classification system.
In order to use common image models in the encoding of this high-resolution images,
we had to generate lower resolution copies, using the imagemagik open source tool.
We are also interested in demonstrating the privacy-preserving model inference. This
scenario is important for the Brazilian National Cancer Institute, since it receives Elec-
tronic Health Records (EHR) of potential cancer patients from most hospitals in Brazil.
INCA is not only the reference hospital for cancer treatment in Brazil, it is also the co-
ordinator of the cancer related public policies in the context of the SUS (the Brazilian
Unified Health System).
The SUS system, in turn, is a coordination framework of Federal and sub-national
(States and Municipalities) governments for the provision of sanitary and public health
services. It includes legal, budgetary, procedural standardization and many other levels
of coordination.
Therefore, if INCA has the ability of classifying exam images that would potentially
help the diagnosis of breast cancer, or any other cancer types, without requiring the formal
transfer of a patients EHR, there could be an increase of efficiency, as well as a reduction
of costs for the entire public health system.
44
6 Conclusion
Neither logic without observation, nor observa-

tion without logic, can move one step in the for-
mation of science.
The Organization of Thought

Alfred North Whitehead
This works presents a general architecture for privacy-preserving Machine Learning. It

results from a thorough consideration of a typical ML solution life-cycle and how privacy-
enhancing technologies, such as homomorphic encryption and secure multi-party compu-
tation, can be used as basic building blocks for private computations in each phase of the
desired ML solution.
It also presents a proof-of-concept of the proposed architecture, based on MPC proto-
cols, and its application on two use cases: text classification for fake news detection and
image classification for breast cancer detection and staging.
6.1 Experimental results

The experimental results on the first use case, fake news detection, cover two groups of
PPML techniques: privacy-preserving model training and privacy-preserving text classi-
fication.
This work also presented relevant fake news detection approaches: source based, fact
checking and Natural Language Processing. The initial results, using the source-based
approach with the detection of autonomous agents (also referred to as ‘bots’), were pub-
lished in the proceedings of a local conference [25]. Latter we identified, as pointed out in
Chapter 4, a few advantages of NLP approach in a privacy-preserving oriented solution.
We have also discussed the use of different NLP techniques in text classification, and
how large language models can be used as a preprocessing step to generate embeddings
that convey semantic information from the encoded text. Then, we showed how those
embeddings are used for training and querying fake news detection inference models.
45
Our experiments also demonstrate how neural networks can be trained to detect fake
news using Secure Multi-party Computation protocols and how those MPC protocols allow
users to perform text classification in a privacy-preserving way. The respective findings
were submitted for publication at a relevant venue and are currently under review.
We argue that the most relevant finding is the fact that the performance of the privacy-
preserving fake news classification solution built according to our architecture, measured
both in terms of runtime, accuracy and other classification metrics, is very close to that
of models trained and queried in the clear-text setting. Indicating that our architecture
guaranties the privacy of end users and does not reduce the predictive power or usability
of fake news detection solutions.
The results presented in this work also indicate that the use of transfer learning, with
the support from language models, is the best fit for a privacy-preserving solution, as
indicated by the architecture proposed in this work. A few arguments supporting this
statement are the fact that language models can be used as ‘plug-and-play’ pieces in
the solution: a newer, better, language model would bring better results without major
changes to the overall solution. Also, the use of language models reduces the risk of data
leakage during the data wrangling activities. And, finally, we note that this preprocessing
strategy also significantly reduces computing costs at the side of the data owner.
6.1.1 Future work

The work at hand now is the application of our architecture in a computer vision task
related to detection and staging of breast cancer. The experiments are being carried out
in collaboration with researchers from the Brazilian Nation Cancer Institute. We have
run a few preliminary experiments on preprocessing and classification of histologic section
images, converting large SVS images and generating embeddings with open source vision
models.
The goal for the next semester is to perform the activities listed on Table 6.1:
Table 6.1: Schedule of doctoral research activities

Jan, 2023 Improve the SVS image processing pipeline
Feb, 2023 Experiments on ‘clear-text’ setting for cancer classification
Mar, 2023 Experiments on privacy-preserving model training for cancer classification
Apr, 2023 Experiments on privacy-preserving inference cancer classification
May, 2023 Review and publish results
Jun, 2023 Review, conclusion and submission of the doctoral thesis
Moreover, in the experiments reported above, we have only looked at computational

cost, in terms of runtime, of privacy-preserving training and inference. Further work, be-
46
yond the scope of the present doctoral research, could study the differences in performance
between computationally heavier and lighter preprocessing techniques. Considering that
in a privacy-preserving setting, the preprocessing phase must be performed on the user’s
device, it is interesting to look for techniques with low computational cost and acceptable
performance.
Acknowledgments
This work has been funded in part by the Graduate Deanship of Universidade de Brasília,
under the “EDITAL DPG Nº 0004/2021” grants program.
47
Bibliography
[1] Dale, Robert: GPT-3: What’s it good for? Natural Language Engineering,
27(1):113–118, 2021. 1
[2] Van Noorden, Richard: The ethical questions that haunt facial-recognition research.
Nature, 587(7834):354–359, 2020. 1
[3] BRASIL: Lei nº 13.709, de 14 de agosto de 2018., 2018. http://http://www.

planalto.gov.br/ccivil_03/_Ato2015-2018/2018/Lei/L13709.htm. 2
[4] European Commission: Regulation EU n. 2016/679., 2016. https://ec.europa.

eu/info/law/law-topic/data-protection_en. 2
[5] Al-Rubaie, Mohammad and J. Morris Chang: Privacy-Preserving Machine Lear-

ning: Threats and Solutions. IEEE Security Privacy, 17(2):49–58, 2019. 2, 9
[6] Graepel, T., K Lauter, and M Naehrig: ML Confidential: Machine Learning on

Encrypted Data. Cryptology ePrint Archive, Report 2012/323, 2012. https://
eprint.iacr.org/2012/323. 2, 9
[7] Canetti, R.: Universally Composable Security: A New Paradigm for Cryptographic
Protocols. In Proceedings of the 42Nd IEEE Symposium on Foundations of Computer
Science, FOCS ’01, pages 136–, Washington, DC, USA, 2001. IEEE Computer Soci-
ety, ISBN 0-7695-1390-5. http://dl.acm.org/citation.cfm?id=874063.875553.
2, 9
[8] De Cock, Martine, Rafael Dowsley, Caleb Horst, Raj Katti, Anderson Nascimento,
Wing Sea Poon, and Stacey Truex: Efficient and Private Scoring of Decision Trees,
Support Vector Machines and Logistic Regression Models based on Pre-Computation.
IEEE Transactions on Dependable and Secure Computing, PP(99), 2017. 2, 9, 31,
33
[9] Rivest, R., L. Adleman, and M. Dertouzos: On data banks and privacy homo-
morphisms. Foundations of Secure Computation, pages 169–177, 1978. 2, 9
[10] Gentry, C.: A fully homomorphic encryption scheme. PhD thesis, Stanford Univer-
sity, 2009. crypto.stanford.edu/craig. 2, 9
[11] Lopez-Alt, A., E. Tromer, and V. Vaikuntanathan: On-the-Fly Multiparty Computa-

tion on the Cloud via Multikey Fully Homomorphic Encryption. Cryptology ePrint
Archive, Report 2013/094, 2013. 2, 9
48
[12] Souza, Stefano M P C and Ricardo S Puttini: Client-side encryption for privacy-
sensitive applications on the cloud. Procedia Computer Science, 97:126–130, 2016.
2, 14
[13] Damgård, Ivan and Mats Jurik: A Generalisation, a Simplification and Some Appli-
cations of Paillier’s Probabilistic Public-Key System. In Proceedings of the 4th Inter-
national Workshop on Practice and Theory in Public Key Cryptography: Public Key
Cryptography, PKC ’01, pages 119–136, London, UK, UK, 2001. Springer-Verlag,
ISBN 3-540-41658-7. http://dl.acm.org/citation.cfm?id=648118.746742. 2,
9
[14] Nikolaenko, V., U. Weinsberg, S Ioannidis, D. Joyeand, M ans Boneh, and N. Taft:
Privacy-Preserving Ridge Regression on Hundreds of Millions of Records. In 2013
IEEE Symposium on Security and Privacy. IEEE, 2013. 2, 9
[15] Bos, J. W., K. Lauter, and M. Naehrig: Private Predictive Analysis on Encrypted
Medical Data. Cryptology ePrint Archive, Report 2014/336, 2014. 2, 10
[16] Popa, R. A., C. M. S. Redfield, N. Zeldovich, and H. Balakrishnan: CryptDB: Pro-

tecting Confidentiality with Encrypted Query Processing. In Proceedings of the 23rd
ACM Symposium on Operating Systems Principles, SOSP ’11, pages 85–100, New
York, NY, USA, 2011. ACM, ISBN 978-1-4503-0977-6. 2, 10
[17] Souza, Stefano M P C, RF Gonçalves, E Leonova, RS Puttini, and Anderson CA

Nascimento: Privacy-ensuring electronic health records in the cloud. Concurrency
and Computation: Practice and Experience, 29(11):e4045, 2017. 2, 14
[18] Souza, Stefano M. P. C.: Safe-Record: segurança e privacidade para registros eletrô-
nicos em saúde na nuvem. Master’s thesis, PPGEE/FT - Universidade de Brasília,
2016. 2, 4, 10
[19] Sakuma, Jun, Shigenobu Kobayashi, and Rebecca N. Wright: Privacy-preserving

Reinforcement Learning. In Proceedings of the 25th International Conference on
Machine Learning, ICML ’08, pages 864–871, New York, 2008. ACM. 3, 10
[20] Souza, Stefano M P C: Possíveis impactos da LGPD na atividade de inteligência do

Cade. Escola Nacional de Administração Pública (Enap), 2020. 4
[21] Trauth, E. M.: Achieving the Research Goal with Qualitative Methods: Lessons
Learned along the Way. In Proceedings of the IFIP TC8 WG 8.2 International
Conference on Information Systems and Qualitative Research, page 225–245, GBR,
1997. Chapman & Hall, Ltd., ISBN 0412823608. 5
[22] Deb, Dipankar, Rajeeb Dey, and Valentina E. Balas: [Intelligent Systems Refe-
rence Library - Vol. 153] Engineering Research Methodology: A Practical Insight
for Researchers, volume 10.1007/978-981-13-2947-0, chapter 1, pages 1–7. Sprin-
ger, 2019, ISBN 978-981-13-2946-3,978-981-13-2947-0. http://gen.lib.rus.ec/
scimag/index.php?s=10.1007/978-981-13-2947-0. 5
49
[23] Kaplan, Bonnie and Dennis Duchon: Combining Qualitative and Quantitative
Methods in Information Systems Research: A Case Study. MIS Q., 12(4):571–586,
December 1988, ISSN 0276-7783. http://dx.doi.org/10.2307/249133. 5
[24] Yin, R. K.: Case Study Research: Design and Methods. SAGE, Beverly Hills, 1984.
6
[25] Souza, S. M. P. C., T. B. Rezende, J. Nascimento, L. G. Chaves, D. H. P. Soto,

and S. Salavati: Tuning machine learning models to detect bots on Twitter. In 2020
Workshop on Communication Networks and Power Systems (WCNPS), pages 1–6,
2020. 7, 31, 34, 45
[26] Souza, Stefano M. P. C. and Daniel G. Silva: Monte Carlo execution time estimation
for Privacy-preserving Distributed Function Evaluation protocols, 2021. 7
[27] Ben-David, Assaf, Noam Nisan, and Benny Pinkas: FairplayMP: a system for se-
cure multi-party computation. In Ning, Peng, Paul F. Syverson, and Somesh Jha
(editors): Proceedings of the 2008 ACM Conference on Computer and Communica-
tions Security, CCS 2008, Alexandria, Virginia, USA, October 27-31, 2008, pages
257–266. ACM, 2008. 10
[28] Bogdanov, Dan, Sven Laur, and Jan Willemson: Sharemind: A Framework for Fast
Privacy-Preserving Computations. In Proc. of the 13th European Symposium on
Research in Computer Security, pages 192–206, 2008. 10
[29] Ryffel, Theo, Andrew Trask, Morten Dahl, Bobby Wagner, Jason Mancuso, Da-
niel Rueckert, and Jonathan Passerat-Palmbach: A generic framework for privacy
preserving deep learning. CoRR, abs/1811.04017, 2018. 10
[30] Sadegh Riazi, M., C. Weinert, O. Tkachenko, E. M. Songhori, T. Schneider, and

F. Koushanfar: Chameleon: A Hybrid Secure Computation Framework for Machine
Learning Applications. ArXiv e-prints, 2018. 10, 31
[31] Knott, B., S. Venkataraman, A.Y. Hannun, S. Sengupta, M. Ibrahim, and L.J.P.
van der Maaten: CrypTen: Secure Multi-Party Computation Meets Machine Le-
arning. In Proceedings of the NeurIPS Workshop on Privacy-Preserving Machine
Learning, 2020. 10, 16, 33
[32] Zhang, Yihua, Aaron Steele, and Marina Blanton: PICCO: A General-purpose
Compiler for Private Distributed Computation. In Proceedings of the 2013
ACM SIGSAC Conference on Computer Communications Security. ACM, 2013,
ISBN 978-1-4503-2477-9. 10
[33] Songhori, E. M., S. U. Hussain, A. Sadeghi, T. Schneider, and F. Koushanfar:

TinyGarble: Highly Compressed and Scalable Sequential Garbled Circuits. In 2015
IEEE Symposium on Security and Privacy, pages 411–428, May 2015. 10
[34] Demmler, Daniel, Thomas Schneider, and Michael Zohner: ABY - A Framework
for Efficient Mixed-Protocol Secure Two-Party Computation. In 22nd Network and
Distributed System Security Symposium, 2015. 10, 16, 31
50
[35] Mohassel, P. and Y. Zhang: SecureML: A System for Scalable Privacy-Preserving
Machine Learning. In 2017 IEEE Symposium on Security and Privacy (SP), pages
19–38, May 2017. 10, 31
[36] Yao, Andrew C.: Protocols for Secure Computations. In Proceedings of the 23rd An-
nual Symposium on Foundations of Computer Science, SFCS ’82. IEEE Computer
Society, 1982. 11
[37] Beaver, Donald: One-time tables for two-party computation. In Computing and
Combinatorics, pages 361–370. Springer, 1998. 12
[38] Paillier, Pascal: Public-key cryptosystems based on composite degree residuosity clas-
ses. In IN ADVANCES IN CRYPTOLOGY — EUROCRYPT 1999, pages 223–238.
Springer-Verlag, 1999. 14
[39] Goldwasser, Shafi and Silvio Micali: Probabilistic encryption. Journal of Computer
and System Sciences, 28(2):270–299, 1984, ISSN 0022-0000. 14
[40] Naor, Moni and Kobbi Nissim: Communication Complexity and Secure Function
Evaluation. Electronic Colloquium on Computational Complexity (ECCC), 8, 2001.
15
[41] Agarwal, Anisha, Rafael Dowsley, Nicholas D. McKinney, Dongrui Wu, Chin Teng
Lin, Martine De Cock, and Anderson Nascimento: Privacy-Preserving Linear Re-
gression for Brain-Computer Interface Applications. In Proc. of 2018 IEEE Inter-
national Conference on Big Data, 2018. 16
[42] Silva, D. G, M. Jino, and B. de Abreu: A Simple Approach for Estimation of Exe-
cution Effort of Functional Test Cases. In IEEE Sixth International Conference on
Software Testing, Verification and Validation. IEEE Computer Society, Apr 2009.
17
[43] Iqbal, N., M. A. Siddique, and J. Henkel: DAGS: Distribution agnostic sequential
Monte Carlo scheme for task execution time estimation. In 2010 Design, Automation
Test in Europe Conference Exhibition (DATE 2010), pages 1645–1648, 2010. 17
[44] Li, Zhi, Hao Wang, Guangquan Xu, Alireza Jolfaei, Xi Zheng, Chunhua Su, and
Wenying Zhang: Privacy-Preserving Distributed Transfer Learning and its Applica-
tion in Intelligent Transportation. IEEE Transactions on Intelligent Transportation
Systems, pages 1–17, 2022. 19
[45] Cunha, Washington, Vítor Mangaravite, Christian Gomes, Sérgio Canuto, Elaine
Resende, Cecilia Nascimento, Felipe Viegas, Celso França, Wellington Santos Mar-
tins, Jussara M. Almeida, Thierson Rosa, Leonardo Rocha, and Marcos André
Gonçalves: On the cost-effectiveness of neural and non-neural approaches and repre-
sentations for text classification: A comprehensive comparative study. Information
Processing & Management, 58(3):102481, 2021, ISSN 0306-4573. 23
[46] Habert, Benoit, Gilles Adda, Martine Adda-Decker, P Boula de Marëuil, Serge
Ferrari, Olivier Ferret, Gabriel Illouz, and Patrick Paroubek: Towards tokenization
evaluation. In Proceedings of LREC, volume 98, pages 427–431, 1998. 24
51
[47] Kaur, Jashanjot and P Kaur Buttar: A systematic review on stopword removal
algorithms. Int. J. Futur. Revolut. Comput. Sci. Commun. Eng, 4(4), 2018. 24
[48] Gerlach, Martin, Hanyu Shi, and Luís A Nunes Amaral: A universal information
theoretic approach to the identification of stopwords. Nature Machine Intelligence,
1(12):606–612, 2019. 24
[49] Singh, Jasmeet and Vishal Gupta: Text stemming: Approaches, applications, and
challenges. ACM Computing Surveys (CSUR), 49(3):1–46, 2016. 25
[50] Dereza, Oksana: Lemmatization for Ancient Languages: Rules or Neural Networks?
In Conference on Artificial Intelligence and Natural Language, pages 35–47. Sprin-
ger, 2018. 25
[51] Jongejan, Bart and Hercules Dalianis: Automatic training of lemmatization rules
that handle morphological changes in pre-, in-and suffixes alike. In Proceedings of the
Joint Conference of the 47th Annual Meeting of the ACL and the 4th International
Joint Conference on Natural Language Processing of the AFNLP, pages 145–153,
2009. 25
[52] Plisson, Joël, Nada Lavrac, Dunja Mladenic, et al.: A rule based approach to word
lemmatization. In Proceedings of IS, volume 3, pages 83–86, 2004. 25
[53] Malaviya, Chaitanya, Shijie Wu, and Ryan Cotterell: A simple joint model for im-
proved contextual neural lemmatization. arXiv preprint arXiv:1904.02306, 2019. 25
[54] Kondratyuk, Daniel, Tomáš Gavenčiak, Milan Straka, and Jan Hajič: LemmaTag:
Jointly tagging and lemmatizing for morphologically-rich languages with BRNNs.
arXiv preprint arXiv:1808.03703, 2018. 25, 26
[55] Schmid, Helmut and Florian Laws: Estimation of conditional probabilities with de-
cision trees and an application to fine-grained POS tagging. In Proceedings of the
22nd International Conference on Computational Linguistics (Coling 2008), pages
777–784, 2008. 26
[56] Potthast, Martin, Johannes Kiesel, Kevin Reinartz, Janek Bevendorff, and Benno
Stein: A Stylometric Inquiry into Hyperpartisan and Fake News. In Proceedings of
the 56th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pages 231–240, Melbourne, Australia, July 2018. Association for
Computational Linguistics. https://www.aclweb.org/anthology/P18-1022. 26,
32
[57] Monteiro, Rafael A., Roney L. S. Santos, Thiago A. S. Pardo, Tiago A. de Al-
meida, Evandro E. S. Ruiz, and Oto A. Vale: Contributions to the Study of Fake
News in Portuguese: New Corpus and Automatic Detection Results. In Computati-
onal Processing of the Portuguese Language, pages 324–334. Springer International
Publishing, 2018, ISBN 978-3-319-99722-3. 26, 31, 34
[58] Davis, R. and C. Proctor: Fake News, Real Consequences: Recruiting Neural
Networks for the Fight Against Fake News. Technical report, Stanford University,
2017. 26
52
[59] Barde, B. V. and A. M. Bainwad: An overview of topic modeling methods and tools.
In 2017 International Conference on Intelligent Computing and Control Systems
(ICICCS), pages 745–750, 2017. 26
[60] El-Din, Doaa Mohey: Enhancement bag-of-words model for solving the challenges
of sentiment analysis. International Journal of Advanced Computer Science and
Applications, 7(1), 2016. 26
[61] Li, Bofang, Zhe Zhao, Tao Liu, Puwei Wang, and Xiaoyong Du: Weighted neural bag-
of-n-grams model: New baselines for text classification. In Proceedings of COLING
2016, the 26th International Conference on Computational Linguistics: Technical
Papers, pages 1591–1600, 2016. 27
[62] Yun-tao, Zhang, Gong Ling, and Wang Yong-cheng: An improved TF-IDF approach
for text classification. Journal of Zhejiang University-Science A, 6(1):49–55, 2005.
27
[63] Ahmed, Hadeer, Issa Traore, and Sherif Saad: Detection of online fake news using
n-gram analysis and machine learning techniques. In International conference on
intelligent, secure, and dependable systems in distributed and cloud environments,
pages 127–138. Springer, 2017. 27
[64] Dyson, Lauren and Alden Golab: Fake News Detection Exploring the Application of
NLP Methods to Machine Identification of Misleading News Sources. CAPP 30255
Adv. Mach. Learn. Public Policy, 2017. 27, 32
[65] Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean: Efficient Estimation
of Word Representations in Vector Space, 2013. 27
[66] Yang, Kai Chou, Timothy Niven, and Hung Yu Kao: Fake News Detection as Natural
Language Inference. arXiv preprint arXiv:1907.07347, 2019. 28, 32
[67] Hosseinimotlagh, Seyedmehdi and Evangelos E Papalexakis: Unsupervised content-
based identification of fake news articles with tensor decomposition ensembles. In
Proceedings of the Workshop on Misinformation and Misbehavior Mining on the
Web (MIS2), 2018. 28
[68] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan
N Gomez, Ł ukasz Kaiser, and Illia Polosukhin: Attention is All you Need. In Guyon,
I., U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R.
Garnett (editors): Advances in Neural Information Processing Systems, volume 30.
Curran Associates, Inc., 2017. https://proceedings.neurips.cc/paper/2017/
file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. 28
[69] Devlin, Jacob, Ming Wei Chang, Kenton Lee, and Kristina Toutanova: Bert: Pre-
training of deep bidirectional transformers for language understanding. arXiv pre-
print arXiv:1810.04805, 2018. 29, 32
[70] Rogers, Anna, Olga Kovaleva, and Anna Rumshisky: A Primer in BERTology: What
We Know About How BERT Works. Transactions of the Association for Computa-
tional Linguistics, 8:842–866, January 2021, ISSN 2307-387X. 29
53
[71] Reimers, Nils and Iryna Gurevych: Sentence-BERT: Sentence Embeddings using Sia-
mese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing. Association for Computational Linguistics, Novem-
ber 2019. https://arxiv.org/abs/1908.10084. 29
[72] Gelfert, Axel: Fake News: A Definition. Informal Logic, 37(0):83–117, 2017. 30
[73] D’Arienzo, Maria Chiara, Valentina Boursier, and Mark D. Griffiths: Addiction to
Social Media and Attachment Styles: A Systematic Literature Review. International
Journal of Mental Health and Addiction, 17:1094 – 1118, 2019. 30
[74] Ferrara, Emilio: Disinformation and Social Bot Operations in the Run Up to the
2017 French Presidential Election. First Monday, 22, June 2017. 31
[75] Lee, Sangwon and Michael Xenos: Social distraction? Social media use and political
knowledge in two U.S. Presidential elections. Computers in Human Behavior, 90:18
– 25, 2019, ISSN 0747-5632. 31
[76] Nascimento, Josué: Only one word 99.2%, Aug 2020. https://www.kaggle.com/
josutk/only-one-word-99-2. 31, 32
[77] Gangireddy, Siva Charan Reddy, Deepak P, Cheng Long, and Tanmoy Chakraborty:
Unsupervised Fake News Detection: A Graph-Based Approach. In Proceedings of
the 31st ACM Conference on Hypertext and Social Media, HT ’20, page 75–83, New
York, NY, USA, 2020. Association for Computing Machinery, ISBN 9781450370981.
https://doi.org/10.1145/3372923.3404783. 31
[78] Shu, Kai, Xinyi Zhou, Suhang Wang, Reza Zafarani, and Huan Liu: The Role of
User Profiles for Fake News Detection. In ASONAM ’19: International Conference
on Advances in Social Networks Analysis and Mining, page 436–439, New York,
NY, USA, 2019. Association for Computing Machinery, ISBN 9781450368681. 31
[79] Pinnaparaju, Nikhil, Vijaysaradhi Indurthi, and Vasudeva Varma: Identifying Fake
News Spreaders in Social Media. In CLEF, 2020. 31
[80] Nadeem, Moin, Wei Fang, Brian Xu, Mitra Mohtarami, and James Glass: FAKTA:
An Automatic End-to-End Fact Checking System, 2019. 32
[81] Moreno, Jo
btxfnamespacelong ao and Graça Bressan: FACTCK.BR: A New Dataset to Study
Fake News. In Proceedings of the 25th Brazillian Symposium on Multimedia and the
Web, WebMedia ’19, page 525–527, New York, NY, USA, 2019. Association for Com-
puting Machinery, ISBN 9781450367639. https://doi.org/10.1145/3323503.
3361698. 32, 33
[82] Gupta, Ankur, Yash Varun, Prarthana Das, Nithya Muttineni, Parth Srivastava,
Hamim Zafar, Tanmoy Chakraborty, and Swaprava Nath: TruthBot: An Automa-
ted Conversational Tool for Intent Learning, Curated Information Presenting, and
Fake News Alerting. CoRR, abs/2102.00509, 2021. https://arxiv.org/abs/2102.
00509. 32
54
[83] Lee, Sungjin: Nudging Neural Conversational Model with Domain Knowledge.
CoRR, abs/1811.06630, 2018. http://arxiv.org/abs/1811.06630. 32
[84] Graves, Lucas: Anatomy of a Fact Check: Objective Practice and the Contested
Epistemology of Fact Checking. Communication, Culture and Critique, 10(3):518–
537, October 2017, ISSN 1753-9129. https://doi.org/10.1111/cccr.12163. 32
[85] Marietta, Morgan, David C Barker, and Todd Bowser: Fact-checking polarized poli-
tics: Does the fact-check industry provide consistent guidance on disputed realities?
In The Forum, volume 13, pages 577–596. De Gruyter, 2015. 32
[86] Horne, Benjamin D. and Sibel Adali: This Just In: Fake News Packs a Lot in Title,
Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real
News. CoRR, abs/1703.09398, 2017. http://arxiv.org/abs/1703.09398. 32
[87] Young, T., D. Hazarika, S. Poria, and E. Cambria: Recent Trends in Deep Lear-
ning Based Natural Language Processing [Review Article]. IEEE Computational
Intelligence Magazine, 13(3):55–75, 2018. 32
[88] Baruah, Arup, K Das, F Barbhuiya, and Kuntal Dey: Automatic Detection of Fake
News Spreaders Using BERT. In CLEF, 2020. 32
[89] Zhang, T., D. Wang, H. Chen, Z. Zeng, W. Guo, C. Miao, and L. Cui: BDANN:
BERT-Based Domain Adaptation Neural Network for Multi-Modal Fake News De-
tection. In 2020 International Joint Conference on Neural Networks (IJCNN), pages
1–8, 2020. 32
[90] Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra-
fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell,
Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon
Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse,
Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark,
Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario
Amodei: Language Models are Few-Shot Learners. In Larochelle, H., M. Ranzato,
R. Hadsell, M. F. Balcan, and H. Lin (editors): Advances in Neural Information
Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. 32
[91] Tan, Reuben, Bryan A. Plummer, and Kate Saenko: Detecting Cross-Modal Incon-
sistency to Defend Against Neural Fake News, 2020. 32
[92] Mosallanezhad, Ahmadreza, Kai Shu, and Huan Liu: Topic-Preserving Synthetic
News Generation: An Adversarial Deep Reinforcement Learning Approach, 2020.
32
[93] Zellers, Rowan, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Fran-
ziska Roesner, and Yejin Choi: Defending Against Neural Fake News. In Wallach,
H., H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (editors):
Advances in Neural Information Processing Systems, volume 32, pages 9054–9065.
Curran Associates, Inc., 2019. https://proceedings.neurips.cc/paper/2019/
file/3e9f0fc9b2f89e043bc6233994dfcf76-Paper.pdf. 32
55
[94] Wang, William Yang: "liar, liar pants on fire": A new benchmark dataset for fake
news detection. arXiv preprint arXiv:1705.00648, 2017. 34
[95] Bhatia, Ruchi: Source based Fake News Classification, Aug 2020. https://www.
kaggle.com/ruchi798/source-based-news-classification. 34
[96] Reimers, Nils and Iryna Gurevych: Making Monolingual Sentence Embeddings Mul-
tilingual using Knowledge Distillation. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing. Association for Computational
Linguistics, November 2020. 34
[97] Adams, Samuel, David Melanson, and Martine De Cock: Private Text Classification
with Convolutional Neural Networks. In Proceedings of the Third Workshop on Pri-
vacy in Natural Language Processing, pages 53–58, Online, June 2021. Association
for Computational Linguistics. 35
[98] Trueman, Tina Esther, Ashok Kumar J., Narayanasamy P., and Vidya J.: Attention-
based C-BiLSTM for fake news detection. Applied Soft Computing, 110:107600,
2021, ISSN 1568-4946. 36
[99] Lotter, William, Abdul Rahman Diab, Bryan Haslam, Jiye G Kim, Giorgia Grisot,
Eric Wu, Kevin Wu, Jorge Onieva Onieva, Yun Boyer, Jerrold L Boxerman, et al.:
Robust breast cancer detection in mammography and digital breast tomosynthesis
using an annotation-efficient deep learning approach. Nature Medicine, 27(2):244–
249, 2021. 39
[100] Siegel, Rebecca L., Kimberly D. Miller, Hannah E. Fuchs, and Ahmedin Jemal:
Cancer statistics, 2022. CA: A Cancer Journal for Clinicians, 72(1):7–33, 2022. 39
[101] Amin, M.B., S.B. Edge, F.L. Greene, D.R. Byrd, R.K. Brookland, M.K. Washing-
ton, J.E. Gershenwald, C.C. Compton, K.R. Hess, D.C. Sullivan, et al.: AJCC
Cancer Staging Manual. Springer Cham, 2018, ISBN 9783319406176. 40
[102] Deng, J., W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei: ImageNet: A Large-
Scale Hierarchical Image Database. In CVPR09, 2009. 41
[103] Minaee, Shervin, Yuri Y Boykov, Fatih Porikli, Antonio J Plaza, Nasser Kehtarna-
vaz, and Demetri Terzopoulos: Image segmentation using deep learning: A survey.
IEEE transactions on pattern analysis and machine intelligence, 2021. 42
[104] Elpeltagy, Marwa and Hany Sallam: Automatic prediction of COVID- 19 from chest
images using modified ResNet50. Multimedia tools and applications, 80(17):26451–
26463, 2021. 42
[105] Ali, Nairveen, Elsie Quansah, Katarina Köhler, Tobias Meyer, Michael Schmitt,
Jürgen Popp, Axel Niendorf, and Thomas Bocklitz: Automatic label-free detection
of breast cancer using nonlinear multimodal imaging and the convolutional neural
network ResNet50. Translational Biophotonics, 1(1-2):e201900003, 2019. 42
56
[106] Kolesnikov, Alexander, Alexey Dosovitskiy, Dirk Weissenborn, Georg Heigold, Ja-
kob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby,
Sylvain Gelly, Thomas Unterthiner, and Xiaohua Zhai: An Image is Worth 16x16
Words: Transformers for Image Recognition at Scale. 2021. 42
[107] Hansen, Nicklas, Hao Su, and Xiaolong Wang: Stabilizing Deep Q-Learning with
ConvNets and Vision Transformers under Data Augmentation. In Ranzato, M., A.
Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (editors): Advances
in Neural Information Processing Systems, volume 34, pages 3680–3693. Curran
Associates, Inc., 2021. 43
[108] Yang, Kaiyu, Jacqueline Yau, Li Fei-Fei, Jia Deng, and Olga Russakovsky: A Study
of Face Obfuscation in ImageNet. In International Conference on Machine Learning
(ICML), 2022. 43
[109] Kaissis, G, A Ziller, J Passerat-Palmbach, T Ryffel, D Usynin, A Trask, I Lima, J

Mancuso, F Jungmann, M M Steinborn, A Saleh, M Makowski, D Rueckert, and R
Braren: End-to-end privacy preserving deep learning on multi-institutional medical
imaging. NATURE MACHINE INTELLIGENCE, 3:473–484, 2021. 43
[110] Kumar, Rajesh, Jay Kumar, Abdullah Aman Khan, Zakria, Hub Ali, Cobbinah
M. Bernard, Riaz Ullah Khan, and Shaoning Zeng: Blockchain and homomorphic
encryption based privacy-preserving model aggregation for medical images. Compu-
terized Medical Imaging and Graphics, 102:102139, 2022, ISSN 0895-6111. 43
[111] Koboldt, Daniel C. et al: Comprehensive molecular portraits of human breast tu-
mours. Nature, 490(7418):61–70, Oct 2012, ISSN 1476-4687. 43
57
I MPC Protocols
Protocol: πADD
Input: Secret shares Jx1 Kq , . . . , Jxn Kq
Output: JzKq = ni=1 Jxi Kq
P
Execution:
Pn
1. Each party Pi ∈ P computes zi = i=1 xi
2. Each party Pi broadcasts zi
Protocol 1: Secure Distributed Addition Protocol πADD
Protocol: πMUL
Setup:
1. The Trusted Initializer draws u, v, w uniformly from Zq , such that w = uv and dis-
tributes shares JuKq , JvKq and JwKq to protocol parties
u
2. The TI draws ι ←
− {1, . . . , n}, and sends asymmetric bit {1} to party Pι and {0} to
parties Pi6=ι
Input: Shares JxKq , JyKq

Output: JzKq = JxyKq
Execution:
1. Each party Pi locally computes di ← xi − ui and ei ← yi − vi
2. Parties broadcast di , ei
Pn Pn
3. Each party computes d ← i=1 di , e ← i=1 ei
4. The party Pι holding the asymmetric bit computes zι ← wι + dvι + euι + de
5. All other parties Pi6=ι compute zi ← wi + dvi + eui
Protocol 2: Secure Distributed Multiplication πMUL
58
Protocol: πIP
Input: J~xKq , J~y Kq , and l (length of ~x and ~y )
Output: JzKq = J~x · ~y Kq
Execution:
1. Run l parallel instances of πMUL in order to compute J~zk Kq = J~xk Kq · J~yk Kq for k ∈
{1, . . . , l}
2. Each party Pi locally computes zi = lk=1 ~zk,i

P
3. Output JzKq
Protocol 3: Secure Distributed Inner Product Protocol πIP
Protocol: πOIS
Setup: Let l be the bitlength of the inputs to be shared and n the dimension of the input
vector. The trusted initializer pre-distributes all the correlated randomness necessary for the
execution of πMUL over Z2l
Input: Alice inputs the vector ~x = (x1 , . . . , xn ), and Bob has input k, the index of the desired
output value
Output: xk
Execution:
1. Define yk = 1, and yj = 0 for j ∈ {1, . . . , n}, j 6= k
2. For j ∈ {1, . . . , n} and i ∈ {1, . . . , l}, let xj,i denote the i-th bit of xj
3. Define Jyj K2 as the pair of shares (0, yj ) and Jxj,i K2 as (xj,i , 0).
4. Compute in parallel Jzi K2 ← nj=1 Jyj K2 · Jxj,i K2 for i = 1, . . . , l

P
5. Output Jzi K2 for i = {1, . . . , l}
Protocol 4: Oblivious Input Selection Protocol πOIS
59
Protocol: πEq
Input: JxKq and JyKq
Output: J0Kq if x = y. Any non-zero number otherwise.
Execution:
2. Execute πMUL to compute JzKq = JrKq · JvKq .
3. Output JzKq
Protocol 5: Equality Protocol πEq
Protocol: F2toq
Input: JxK2
Output: JxKq
Execution:
1. Alice’s ~xa and Bob’s ~xb
2. Alice and Bob perform a secure bitwise xor using πOR|XOR with Alice’s inputs being
(~xa , 0) and Bob’s input being (0, ~xb ) and modulus q > 2.
3. Output the result of XOR.
Protocol 6: Z2 to Zq Conversion Function F2toq
Protocol: πOR|XOR
Setup: The setup procedure for πMUL over Z2
Input: JxK2 , JyK2 and k, where k = 1 to compute OR and k = 2 to compute XOR between
the numbers.
Output: Jx ∨ yK2 if k = 1, Jx Y yK2 if k = 2.
Execution:
1. Execute πMUL to compute JvK2 = JxK2 JyK2
2. Locally compute JzK2 = JxK2 + JyK2 − kJvK2 .
3. Output JzK2 .
Protocol 7: Bitwise OR/XOR Protocol πOR|XOR
60
Protocol: πTrunc
Setup: Let λ be a statistical security parameter. The Protocol is parametrized by the size
q > 2k+f +λ+1 of the field and the dimensions `1 , `2 of the input matrix. The trusted initializer
picks a matrix R0 ∈ F`q1 ×`2 with elements uniformly drawn from {0, . . . , 2f − 1} and a matrix
R00 ∈ F`q1 ×`2 with elements uniformly drawn from {0, . . . , 2k+λ − 1}. Then, the TI computes
R = R00 2f + R0 and creates secret shares JRKq and JR0 Kq to distribute to the parties.
Input: The parties input is JWKq such that for all elements w of W it holds that w ∈
{0, 1, . . . , 2k+f −1 − 1}{q − 2k+f −1 + 1, . . . , q − 1}.
Execution:
1. Locally compute JZKq ← JWKq + JRKq and then open Z.
2. Compute C = Z + 2k+f −1 and C0 = C mod 2f where these scalar operations are

performed element-wise. Then compute JSKq ← JWKq + JR0 Kq − C0 .
3. For i = ((q + 1)/2)f , locally compute JTKq ← iJSKq and output JTKq .
Protocol 8: πTrunc Truncation Protocol
Protocol: πBD
Setup: Let l be the bitlength of the value x to be bit-decomposed. The TI draws U, V, W
uniformly from Z2 and distribute shares of blinding values such that W := U V such that
[[W ]] ←R {Z2 }.
Input: JxKq , for q ≤ 2l
Output: JxK2
Execution:
1. Let a denote Alice’s share of x, which corresponds to the bit string {a1 , . . . , al }. Sim-
ilarly, let b denote Bob’s share of x, which corresponds to the bit string {b1 , . . . , bl }.
Define the secret sharing Jyi K2 as the pair of shares (ai , bi ) for yi = ai + bi mod2, Jai K2
as (ai , 0) and Jai K2 as (0, bi ).
2. Compute [[c1 ]]2 ← [[a1 ]]2 [[b1 ]]2 using distributed multiplication, and locally set [[x1 ]]2 ←
[[y1 ]]2 .
3. for i ∈ {2, . . . , l}:
(a) compute [[di ]]2 ← Jai K2 Jbi K2 + J1K2

(b) Jei K2 ← Jyi K2 Jci − 1K2 + J1K2
(c) Jci K2 ← Jei K2 [[di ]]2 + J1K2
(d) [[xi ]]2 ← Jyi K2 + Jci − 1K2
4. Output [[xi ]]2 for i ∈ {1, . . . , l}.
Protocol 9: πBD : Bit-Decomposition Protocol
61
Protocol: πDC
Input: The trusted initializer will select uniformly from {Z2 } and distribute shares of
blinding values U, V, W such that W := U V such that [[W ]] ← {Z2 }.
Each party gets the shares [[xi ]]2 and Jyi K2 for each bit of l-bit integers x and y.
Output: J1K2 if x ≥ y, and [[0]]2 otherwise.
Execution:
1. For i ∈ {1, . . . , l}, compute in parallel [[di ]]2 ← Jyi K2 (J1K2 − [[xi ]]2 ) using multiplication
Protocol.
2. Locally compute Jei K2 ← [[xi ]]2 + Jyi K2 + J1K2
3. For i ∈ {1, . . . , l}, compute [[cj ]]2 ← [[di ]]2 lj=i+1 [[ej ]]2 using multiplication Protocol.
Q
Pl
4. compute [[w]]2 ← J1K2 + i=1 Jci K2
Protocol 10: πDC : Comparison Protocol
Protocol: πargmax
Setup: Let l be the bitlength and k be then umber of values to be compared.
Input: The trusted initializer will select uniformly from {Zq } and distribute shares of
blinding values U, V, W such that W := U V such that [[W ]] ←R {Zq }.
Each party has as inputs the shares [[vj , i]]q for all j{1, . . . , k} and i{1, . . . , l}.
Output: Value m computed by party P1 .
Execution:
1. For j ∈ {1, . . . , k} and n{1, . . . , k}, the parties compute in parallel the distributed
comparison Protocol with inputs [[vj , i]]2 and [[vn , i]]2 (i = {1, . . . , l}). Let [[wj , n]]2
denote this output obtained.
Q
2. For j ∈ {1, . . . , k}, compute in parallel [[wj ]]2 = n∈{1,...,k} [[wj , n]]2 using multiplica-
tion Protocol.
3. The parties open wj for P1 . If wj = 1, P1 append j to the value to be output in the

end.
Protocol 11: ArgMax Protocol πargmax
62

Qualifi Cacao

Uploaded by

Copyright:

Available Formats

Qualifi Cacao

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Qualifi Cacao

Uploaded by

Copyright:

Available Formats

Universidade de Brasília

An Architecture for Privacy-Preserving Machine

Stefano Mozart Pontes Canedo de Souza

Projeto de tese apresentado como requisito parcial para

An Architecture for Privacy-Preserving Machine

Stefano Mozart Pontes Canedo de Souza

Projeto de tese apresentado como requisito parcial para

Prof. Dr. Daniel Guerreiro e Silva (Orientador)

Prof. Dr. Anderson Clayton Alves do Nascimento (Coorientador)

Bernardo Machado David

Prof. Dr. Kleber Melo e Silva

Brasília, 28 de Junho de 2022

Keywords: Privacy-Preserving Machine Learning, Secure Multi-Party Computation, Trans-

Palavras-chave: Privacy-Preserving Machine Learning, Secure Multi-Party Computa-

2 An Architecture for Privacy-Preserving Machine Learning 9

4 Privacy-preserving fake news detection 30

5 Privacy-preserving breast cancer classification 39

2.1 General PPML Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1 PoS Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.1 Experiment Type I - Runtimes in Milliseconds . . . . . . . . . . . . . . . . 17

4.1 List of NLP preprocessing experiments in clear-text . . . . . . . . . . . . . 34

6.1 Schedule of doctoral research activities . . . . . . . . . . . . . . . . . . . . 46

Cade Administrative Council for Economic Defense.

FHE Fully Homomorphic Encryption.

GDPR General Data Protection Regulation.

INCA Brazilian National Cancer Institute.

LGPD General Data Protection Law.

MIT Massachusetts Institute of Technology.

MPC Secure Multi-party Computation.

NLP Natural Language Processing.

PPGEE Graduate Program in Electrical Engineering.

PPML Privacy-preserving Machine Learning.

UC Universal Protocol Composability Framework.

The last thing one discovers in composing a work

Machine learning (ML) is nowadays an unavoidable topic in academia, industry, and

1.1 Research subject

• Identify possible limitations and flaws in PPML solutions presented in literature;

• Systematize this knowledge in the form of a general architecture for PPML;

• Build reference implementations of the architecture in order to demonstrate its

1. Literature review on Machine Learning;

2. Literature review on Natural Language Processing for text classification;

3. Literature review on fake news detection;

4. Literature review on Computer Vision for image classification;

5. Literature review on breast cancer detection;

6. Literature review on Privacy-Preserving Machine Learning;

7. Implementation and experimentation of Secure Multi-party Computation protocols

8. Experimentation on fake news detection on the clear-text setting;

9. Experimentation on fake news detection on privacy-preserving setting;

10. Experimentation on cancer detection on the clear-text setting;

11. Experimentation on cancer detection on privacy-preserving setting;

1.4.1 Preliminary results

It is remarkable that a science which began with

Théorie Analytique des Probabilitiés

The research on Privacy-Preserving Machine Learning (PPML) is currently very ac-

2. Each party Pi broadcasts zi = 142 C

Protocol 1: Secure Multi-party Addition Protocol πADD

set of shares by JxKq .

Input: Shares JxKq , JyKq

1. Each party Pi locally computes di ← xi − ui and ei ← yi − vi