High Performance Federated Learning Using Serverless Computing
High Performance Federated Learning Using Serverless Computing
Evgeni Kiradzhiyski
DEPARTMENT OF INFORMATICS
TECHNISCHE UNIVERSITÄT MÜNCHEN
iii
Contents
Abstract iii
1 Introduction 1
2 Background 6
2.1 Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Definition and Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Categories and Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Serverless Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Definition and Key Terminology . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Architecture and Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.4 Serverless Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Serverless Meets ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Related Work 21
3.1 Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Serverless in the Machine Learning Context . . . . . . . . . . . . . . . . . . . . 22
3.3 Serverless Meets Federated Learning . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 FedKeeper and FedLess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
iv
Contents
5 Evaluation 47
5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 Comparison with FedLess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.1 RQ 1: Parameter Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.2 RQ 2: Replicated parameter servers . . . . . . . . . . . . . . . . . . . . . 65
5.2.3 FedLess Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
List of Figures 74
List of Tables 76
Bibliography 77
v
1 Introduction
As a result of the technological advancement over the past decades, a decent amount of today’s
world population uses mobile phones and tablets as primary computing devices on a daily
basis [1, 2]. Such devices nowadays serve for the accomplishment of a various number of daily
tasks and activities in many diverse areas, for example - education, business, entertainment,
and social networking among many others [3]. In fact, the usage of smart devices has
drastically increased, especially in the last two years, driven and primarily impacted by the
global pandemic caused by Covid-19 and the resulting shift towards digitization in many
different fields with school and higher education, for instance, being among the most heavily
influenced ones [4]. Recent studies show a vast increase in the usage of mobile devices for
diverse learning activities over the last years among students such as library services, access
to learning material systems, writing texts for examination papers, internet research during
lessons, reviewing grades, sending emails to teachers, and others [5].
In general, smartphone ownership and internet usage continue to grow in emerging
economies at an extraordinary rate - a phenomenon that has been observed over the past
decade. The rise in the use of mobile devices around the globe between 2013 and 2015 is in
fact 16%, climbing from 21% to 37% in just two years [6, 1]. According to a recent survey
conducted by the Pew Research Center1 in 2021, the share of people among the Americans
that own a smartphone has even grown to 85%, marking a huge increase in just less than ten
years [7]. Along with smartphone usage, there has also been an increasing number in the
use of other powerful end-user, and network edge devices such as smart home applications,
connected vehicles, connected industrial equipment, and many others [8]. Currently, there
are nearly 12 billion connected Internet of Things (IoT) devices and more than 3 billion
smartphones worldwide that are being used daily [9].
With the technological evolution, many of these modern mobile and IoT devices - au-
tonomous vehicles, smart homes, smart wearable devices, and many more have access to a
wealth of data [10, 11] such as photos [12], videos [13], location information [14], etc. This
rapid development of new technologies and the rise of social networking applications, in
particular, have led to an exponential and unprecedented growth of the data generated at
the network edge [15, 16], with the prediction that the data generation rate will exceed the
capacity of today’s Internet in the near future [17]. These smart devices generate manifolds of
data each day [8], which, opens up various possibilities for meaningful research and industry
applications [2]. Due to the devices’ rich data nature, they provide opportunities for crafting
machine learning (ML) models that can be used to analyze data sets, build decision-making
systems, empower more intelligent applications, and thus, improve user experience on the
1 https://www.pewresearch.org/
1
1 Introduction
device [18, 19, 11, 20, 16]. Examples of potential advancements that can be achieved as a
result of applying various machine learning techniques include obtaining information about
image detection and classification, controlling self-driving cars [21], recognizing speech [22],
predicting both customer behavior [23] and future events [15], medical screening [24] and
many more. Gained insights are then used to improve workflows in various use cases -
mobile keyboard predictions on smartphones [25, 26], traffic condition monitoring on IoT
devices [27], patient mortality and hospital stay time predictions on health-card devices [28],
on-demand traffic condition estimation [29], and many more [30, 31, 32, 33].
Referring to a sub-field of computer science concerned with computer programs that are
able to "learn" from experience and thus, improve in performance over time on a specific
task [34, 35], machine learning-based algorithms are mainly used to detect data patterns
in order to automate complex tasks or make predictions [36]. Machine learning systems
find applications in many domains of science, business, and government - they are used
to identify objects in images, transcribe speech into text, select relevant results of a search
to name a few [37]. In particular, the word "learning" is a metaphor and does not imply
that computer systems are artificially replicating the advanced cognitive systems thought
to be involved in human learning [38], but rather refers to learning these algorithms in a
functional sense as they can change their behavior to enhance performance on some task
based on previous experiences [36]. The formal definition of machine learning, however, is
quite broad, varying from simple data summarization with linear regression to multi-class
classification with support vector machines (SVMs) and deep neural networks (DNNs) [39,
40]. The latter are recently showing extremely promising results on various complex tasks
such as image classification, natural language understanding [41], sentiment analysis and
question answering [42], language translation [43, 44], etc. Neural networks have countless
applications nowadays in many different areas and are used by many people worldwide in
their daily routines [45]. One key enabler of all these machine learning techniques is the
ability to learn models given a very large amount of data [15]. Being one of the reasons behind
the rapidly growing interest among academia in smart devices that are able to generate huge
amounts of data, this promises great improvements in many different applications.
Nevertheless, there are some important aspects that require careful consideration when
dealing with such heterogeneous and more importantly, quite sensitive data generated by
these devices. The large-scale collection of data entails some risks of improper usage and
data mishandling, which might have huge consequences [46]. In fact, much of the data
generated and collected by these devices is private in nature [47] and often contains privacy-
sensitive information, making data owners reluctant to upload their data to a central server
for learning purposes [46, 19] as usually done in the standard ML setting nowadays. Examples
include personally identifiable information (e.g. passport information), payment data (e.g.
bank accounts, credit card information, payment invoices, financial documents), protected
health information (e.g. medical records containing disease information), and others [45].
Additionally, by lacking serious privacy and security considerations, sensitive data are highly
exposed to undesirable disclosure, attacks, and cyber risks [45], especially when dealing
with a large-scale data collection. Such threats might result in disclosing a huge amount
2
1 Introduction
of private data to the public and, therefore, affect millions of users worldwide with eBay2
(145 million users affected in 2014), Yahoo3 (3 billion users affected in 2013-2014), LinkedIn4
(millions of SHA-1 passwords breached by an attacker in 2016), and Equifax5 (147.9 million
customers affected in 2017) being among the worst examples of data breaches recorded in the
21st century [48].
Thus, due to the growing concerns about data privacy and confidentiality, in many industry
sectors e.g. banking and healthcare domains [49], sharing data is even strictly forbidden as
it might violate many regulations [50, 51, 52] like the European Commission’s General Data
Protection Regulation (GDPR) [53, 54], the Consumer Privacy Bill of Rights in the U.S. [55],
the Health Insurance Portability and Accountability Act (HIPAA) [56], etc. Nonetheless, these
data-related restrictions conflict with the traditional setting in which machine learning is
applied, which usually requires all training data to be stored at a centralized location [57, 10]
as it might compromise users’ privacy with potential eavesdropping attacks. Therefore, an
alternative way to benefit from the rich-data devices without violating confidentiality and,
thus, ensuring their data privacy has to be considered.
A new, emerging technology that arises on the horizon and deals with many of these issues
is Mobile Edge Computing (MEC). In the MEC framework, cloud computing capabilities
are provided to mobile devices [58], meaning that these devices have the ability to offload
their tasks to the MEC servers [59]. It consists of edge nodes, backed with storage and
computation capabilities that work together with a remote cloud, in order to successfully
perform distributed tasks at scale [15]. The data is stored locally and processed to a server with
global coordination [60, 61]. There exist some research works that have considered training
machine learning models centrally, but serving and storing them locally instead, e.g. in
mobile user modeling and penalization [62, 63]. However, due to the growing computational
and storage capabilities of the devices within distributed networks, it is possible to benefit
from these strengthened resources on each device [64]. Thus, by making use of the enhanced
capabilities [65] of the end devices, the main idea is to bring the model training closer to
where the data is being produced [66], which has led to a growing interest in federated learning.
Federated learning (FL) is a new distributed machine learning approach that enables training
on a large corpus of decentralized data [67, 68] and overcomes many of the above-stated
issues. This paradigm enables multiple edge clients to collaboratively train a common model
and produce a global inference without sharing their local private data [69, 70]. The locally
trained and optimized model parameters are being aggregated and contribute to the global
model. The raw training data is collected and stored only at the edge nodes and never sent
to a central place [15], which means that no local record leaves any device at any point
in time during the process. FL decouples the model training from the need of directly
accessing the training data and allows training of machine learning models even on private,
strongly user-based information, without compromising privacy as the data never leaves the
device [10, 18]. Additionally, FL provides security guarantees with security techniques such
2 https://www.ebay.de/
3 https://de.yahoo.com/?p=us
4 https://www.linkedin.com/
5 https://www.equifax.com/
3
1 Introduction
as differential privacy [71] and secure multiparty computation [46]. This privacy-preserving
learning technique allows researchers and practitioners to benefit from shared models that
are trained under the coordination of a central server (e.g. cloud), having a large amount of
data at hand and without violating any regulations [10, 18]. Thus, it can be considered an
instance of the general approach of bringing the code to the data, instead of the data to the code [67],
allowing training of ML models by addressing fundamental problems in terms of privacy,
ownership, and locality of data [72, 73], which are hard to avoid in the standard ML setting.
Nonetheless, along with solving issues related to the privacy of the data, FL comes at some
cost and introduces new challenges that require special attention as well as handling [74]. For
instance, the involvement of many resource-constrained edge and cloud devices and a shared
model makes the management of FL clients challenging [75], turning the communication
between all involved devices into a potential bottleneck of FL systems. Additionally, due to
the rapid growth of clients in the FL setting, dealing with a large number of clients requires
special attention as scaling the FL training process becomes more difficult [75].
This work sets on top of the foundations from Chadha, Jindal, and Gerndt [75], who applied
Serverless Computing in an FL setting to address challenges such as efficient management
of participating workers (clients) and further argues that both components in the FL setting
- the server and the clients can benefit from a new computing paradigm called Serverless
Computing (Serverless). Serverless computing has recently emerged as a new type of compu-
tation infrastructure [76], which abstracts away operational concerns such as provisioning
and management of cloud computing resources from infrastructure engineers and hands
these over to a Function-as-a-Service (FaaS) platform [77, 78]. Applications are developed and
deployed in small pieces of cloud-native code - functions, that respond to a specific event [78].
The problems that serverless technologies try to solve align closely with many of the
challenges that occur in current FL systems, which is a compelling motivation behind the
work throughout this paper. By applying serverless in an FL setting there is a potential to
achieve an improvement in terms of costs on the FL-client side and performance when dealing
with a sudden increase of computationally expensive workloads on the FL server side. An
example of such is the aggregation of the latest clients’ results from a specific FL round.
Thus, the focus throughout this work is on the intersections of these two technological
trends - it is concerned with distributed ML training on top of serverless computing, aiming
to overcome some of the above-stated issues in a typical FL setting. It builds on top of the
foundations set in the development of FedKeeper6 - a framework for efficiently managing
FL over a combination of connected Function-as-a-Service (FaaS) platforms, i.e., FaaS fabric
[75] and its successor, called FedLess, and aims at addressing known issues with the existing
solution. The first part of this work addresses known issues in terms of performance and
scalability and aims at replacing the existing parameter server [79] solution and, therefore,
achieve better efficiency at FL training. Moreover, it looks at different weights serialization
approaches to adopt the newly introduced parameter server implementation and achieve
a performance speedup. The second part of this work is concerned with evaluating the
system’s performance with the consideration of multiple parameter servers, aiming at a
6 https://github.com/ansjin/fl_faas_fabric
4
1 Introduction
further-increased efficiency (e.g. in terms of network latencies). Based on the stated objectives,
the study answers the following two research questions:
• Research Question 1: Which technology should the current parameter server be replaced with,
in order to increase the system’s performance and scalability?
• Research Question 2: How would the system perform under the consideration of multiple FL
servers with replicated parameter servers?
By approaching and answering these two research questions, we aim at delivering the
following key contributions as an outcome:
• We improve FedLess’ performance by replacing the technology used for the parameter
server of the system, solving the problem with the slow model uploads, identified as
the system’s bottleneck. With the proposed solution, we not only achieve faster model
uploads, but faster running times for both the clients and the aggregator functions in
general, resulting in reduced costs. We illustrate the performance speedup and the cost
reduction by comparing the two parameter server approaches.
• We evaluate the system’s performance while eliminating the need for weights serial-
ization by comparing it with the existing solution, which uses a specific serialization
technique. We show that this further contributes to the total running time reduction
• We demonstrate the ability of the system to handle larger models by increasing the size
up to 8 times and performing the corresponding evaluations
• We examine the system under the consideration of multiple FL servers by integrating
parameter server replication nodes
In general, the improved FedLess system achieves more than 3 times faster running times on
the client-side and more than 4 times on the FL-server side, once aggregating the latest model
updates. The exact difference varies depending on the corresponding test case considering
the model size as well as the number of clients involved in the training. Nonetheless, the new
solution is always superior to the existing system’s parameter server technology. As a result
of that, we are able to significantly cut the costs associated with FL training using FedLess.
The remainder of this paper is structured as follows: Section 2, introduces the concepts
of the technological trends at the core of this work in depth, namely FL and serverless, by
defining the key terminology used throughout this study. It furthermore describes a typical
FL and serverless workflow; Section 3 presents different applications of FL and serverless in
the industry as well as related academical findings; Section 4 describes the problem statement
and the approach in detail. Additionally, it presents the changes introduced within this
work in terms of the system design and workflow. It concludes with details concerned with
the implementation of the system; Section 5 presents the results of this study comparing
the two technologies for the parameter server in terms of their running times as well as
associated costs, presenting the exact improvements in detail. The section concludes with a
brief discussion of the findings; Section 6 concludes the whole study with a summary of the
exact system evaluations and discusses potential improvements and future work.
5
2 Background
The following sections provide background information about the concepts that are used
throughout this work. Section 2.1 introduces the origin, popularity, definition, and challenges
of Federated Learning. It furthermore presents the typical workflow in an FL setting. Similarly,
Section 2.2 gives information about what is being referred to and understood as Serverless
Computing (or simply Serverless) nowadays, the popularity of the term over the last decades as
well as its most recent applications.
Figure 2.1: Popularity of the term Federated Learning as reported by Google Trends
Ever since its first mention [46], the concept of FL has received great interest among both
academia and industry and has been thoroughly studied as it promises great improvements
in many different applications [84, 10, 85, 86], providing a solution to the problem of input
data privacy [87]. In practice, FL demonstrated its success in a wide range of applications in
many different areas [11, 81, 88, 82, 89, 90, 91] including learning sentiment, semantic location,
adapting to pedestrian behavior in autonomous vehicles, activities of mobile phone users,
face detection, voice recognition, and others. Google2 , for example, made use of the Federated
1 https://trends.google.com/
2 https://www.google.com/
6
2 Background
Averaging algorithm (FedAvg) [10] and applied it to the Google’s Gboard, in order to improve
next-word [11] as well as suitable emojis [92] predictions based on users’ historical text data.
Another privacy-sensitive domain where the training data are distributed at the edge and
that benefits from federated learning is healthcare, with a wide range of concrete applications
such as diagnosis prediction [93], meta-analysis of brain data [94], predicting health events
like heart attack risk from wearable devices [95, 72], and many more [96, 97, 98, 99].
def 1 n
min f (w) where f (w) = ∑ f i (w) (2.1)
w ∈Rd n i =1
7
2 Background
partitioned over K different clients. Thus, in order to consider the multiple partitions in which
all the clients are located, one needs to modify the expression by splitting up the objective
function f, which results in the following updated formulation of the original optimization
problem
K
n 1
f (w) = ∑ k Fk (w) where Fk (w) = ∑ f i (w) (2.2)
k =1
n n k i ∈P
k
This contrasts with the IID assumption made by distributed optimization algorithms, namely,
that the partition is formed by a uniform, random distribution of the training examples over
the involved clients [10].
The application of stochastic gradient descent (SGD) and its variants seem a natural choice
in the FL setting, given the number of its successful applications in the context of deep
learning in recent times [10]. Even though SGD can be applied in its original form to the
federated optimization providing computationally efficient results as reported by [10], it
requires a quite large number of training rounds, in order to produce good model results, a
downside examined thoroughly by Ioffe and Szegedy [102]. Within their experiments, the
authors trained MNIST for 50000 steps on mini-batches of size 60 even with the use of batch
normalization, showing the incredibly large number of rounds involved.
In order to address this disadvantage and increase performance, McMahan et al. [10]
apply the large-batch synchronous SGD, an approach evaluated by Chen et al. [103] in the
data center setting, and successfully apply this approach in the federated setting with some
necessary modifications, referring to this algorithm as FederatedSGD (or FedSGD) [10]. In
FederatedSGD, instead of sending back the gradients, each client, by using the current model
and its local data, takes a single step of gradient descent. The server then computes a weighted
average of the resulting models [10]. A more formal definition of this approach takes the
form
K
n
wtk+1 ← wt − η ∇ Fk (wt ) , followed by wtk+1 ← ∑ k wtk+1 (2.3)
k =1
n
where η is the corresponding learning rate. This can be then further extended by adding
more computation to each client by iterating over the local update wk ← wk − η ∇ Fk wk
multiple times before the averaging step, which is termed by the authors as FederatedAveraging
or FedAvg [10] and can be considered as the current state-of-the-art method for federated
learning [104].
Due to great interest in federated learning among researchers, there exist many variations of
the FedAvg nowadays, studying primarily the drawbacks of the approach and seeking potential
improvements. In fact, the ways how the participating clients update their local models to
the server is of great importance for both model convergence and system efficiency [18],
giving a rise to the main differentiation between synchronous FL and asynchronous FL. In the
original synchronous federated learning approach, clients suffer waiting times, implying more
sensitivity to local model training times and irregular or missed model updates, thus, having
a negative impact not only on the client-server communication but also on the scalability of
the system [105].
8
2 Background
Therefore, providing the possibility for clients to calculate their new local updates without
the need to wait for other, delayed clients from previous rounds can be seen as a great
improvement in terms of the system’s performance. Unlike a synchronous update scheme such
as FedAvg, an asynchronous one would allow for an increased amount of data that is being
processed simultaneously and, thus, tends to achieve a higher system efficiency [18]. In fact,
asynchronous training is widely used in traditional distributed SGD [106, 107, 108, 109] and
usually converges faster than synchronous SGD in real time due to parallelism. Nonetheless,
it is impractical in the FL context given the existing unreliable and slow communication as it
directly sends gradients to the server after each local update [110]. Thus, one can find some
interesting research works dealing with the application of an asynchronous scheme in the FL
context, which are presented in Section 3.
9
2 Background
2.1.3 Challenges
Even though FL appears to be similar, in terms of its structure, to the distributed optimization
approach for statistical learning [114, 115] in a data center or traditional private data analyses,
for example, there exist a number of important differences, resulting in some new challenges
that arise in the FL-setting [74]. These challenges, therefore, necessitate an additional effort
and consideration of new, concrete solutions when applying federated learning techniques, in
order to implement an efficient FL system and benefit from all the advantages in the federated
learning context. The following list contains some of the most recurring, key challenges faced
in the FL context:
• Communication cost: Since training a model over decentralized data heavily relies
on communication, one of the core challenges in the federated learning setting is
the cost of that communication occurring [116]. Communication can be seen as a
serious bottleneck in that setting, being limited by the wide area of mobile network
bandwidth [117, 10]. This might then result in a drop in the network by many orders of
magnitude as federated networks are usually comprised of a huge number of clients
[118, 119]. As a result, the slower network communication affects the performance of the
system [64, 120, 19]. Potential approaches to overcome the expensive communication
issue include reducing the total number of communication rounds or reducing the size
of the messages that are being transmitted each round [64].
10
2 Background
in FL takes place over a high-latency network, the size of the network might addi-
tionally cause only a small portion of the devices to be active at once [121], which
could hinder these devices from participating in an FL round. Thus, heterogeneous
hardware toleration and robustness to dropped devices in the network are to be
considered thoroughly when dealing with federated learning [64].
– Data heterogeneity: In FL there is no control over the data distribution of the
participating computational nodes [87], which results in heterogeneity in the train-
ing data available at the devices. The local data distributions on the clients are
usually based on a single user or a group of users. These different contexts lead to
significant data distribution skew across all the devices [116]. This, however, has
the following consequences that the data set of a single client is not representative
of the total population across all clients and, thus, violates the most-commonly
used independent and identically distributed (I.I.D.) assumption. Additionally,
the number of data points collected across devices may also vary significantly. As
some clients might have significantly larger local data sizes than others, hetero-
geneity may cause stragglers [122, 123] (slow clients), especially in large-scale FL
training scenarios, that can potentially impact both training throughput and model
accuracy [19, 124, 125]. There exist several approaches that might be helpful to
overcome this issue, nonetheless, there are no concrete guidelines for every specific
case to the best of our knowledge. One potential approach would be to modify
existing algorithms, for example, through hyper-parameter tuning [74]. Addition-
ally, data augmentation might be an appropriate choice to handle the problem and
make the data across participating clients more similar in some applications [126].
• Privacy: Even though privacy is one of the core motivations behind the use of FL
contrasting to the standard distributed ML approach, there are some concerns that
require special consideration in FL applications as well. FL can be seen as the better
choice in terms of data protection as it shares the model updates instead of the raw
data [127, 128], nevertheless, sensitive information can still be revealed throughout the
communication involved within the training process, when communicating with the
central server, for instance [129].
This list is far from complete as it contains only the most recurring problems in the FL
context identified in the literature. In addition to the above-mentioned challenges, we can
further note lacking convergence of the shared model, efficient hyper-parameter tuning, and
more [74, 18, 130, 131, 132, 133, 47] as issues that require special attention and handling in the
FL setting. Similar to the core problems listed above, these issues are an area of active research
among academia. Based on the amount of research work in the form of scientific papers and
books, one can expect interesting findings and concrete solutions to these problems in near
future.
11
2 Background
4 https://www.gartner.com/en
12
2 Background
Figure 2.2: Popularity of the term Serverless as reported by Google Trends (2004 - 2022)
developers do not need to concern themselves with provisioning or operating servers since
serverless solutions are server-hidden. They are designed to hide how scaling is done, in
other words, abstract servers away [139].
Nevertheless, as with many different trends in the software community, there does not
exist a clear view of what serverless is and, in fact, it encompasses two different but in some
sense similar and overlapping ideas [143]. On the one hand, serverless is used to describe
applications that fully incorporate third-party, cloud-hosted applications and services, in order
to manage server-side logic and state [143]. Prominent examples include the so- called "rich
client" applications - single-page web applications (SPAs), mobile applications, authentication
services (e.g. Auth0, AWS Cognito5 [144]), databases (e.g. Parse, Google’s Firebase6 [145], AWS’
DynamoDB7 [146]), and so on, all being referred to as (Mobile) Backend-as-a-Service or shortly
(M)BaaS. Some of these services even provide "cloud functions" - the ability to run server-side
code on behalf of a mobile app, without the necessity to manage servers [78]. An example of
such service is Meta’s8 Parse Cloud Code.
On the other hand, serverless describes applications with server-side logic still written
by the application developer, however, unlike traditional architectures, it’s run in stateless
containers that can be described with the following characteristics: event-triggered, ephemeral
(may only last for one invocation), and fully managed by a third party [143]. The latter is
being referred to as Function-as-a-Service or in short - FaaS, which is the more commonly used
and more descriptive name for the core of serverless offerings [138]. In fact, both BaaS and
FaaS concepts are related in their operational attributes (e.g. lacking resource management)
and are frequently used together [143], buildig jointly the collective term serverless.
In the cloud, serverless computing appeared as an attractive alternative to already existing
cloud computing paradigms, which demand comparably more work when it comes to
operational concerns such as provisioning and management of cloud computing resources as
serverless does not require dedicated infrastructure [147]. Referring to the official, opening
description of AWS’s Lambda, it is defined as follows: "AWS Lambda lets you run code without
5 https://aws.amazon.com/cognito/
6 https://firebase.google.com/
7 https://aws.amazon.com/dynamodb/
8 https://about.facebook.com/meta/
13
2 Background
provisioning or managing servers.", which compactly describes one of the core ideas of serverless
- abstracting away server management. With FaaS, the vendor handles all underlying resource
provisioning and allocation [143]. Simplifying the way in which scalable applications are
being developed in a novel, cost-effective manner, serverless platforms can be seen as the
next step in the evolution of cloud computing architectures [148]. Serverless describes a more
fine-grained deployment model in contrast to other cloud application models, for instance,
Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS) (e.g. Google AppEngine9 [149], and
Elastic Beanstalk10 from AWS [150]), and container orchestration or Container-as-a-Service (CaaS).
These fine-grained, small pieces of cloud-native code, typically written in popular modern
programming languages such as JavaScript, Python, Go, C#, any JVM language (e.g. Java,
Clojure, Scala), etc., are also referred to as functions. The functions are usually triggered in
response to an inbound HTTP request or some predefined event determined by the provider
[78, 138] such as S3 (file/object) updates, scheduled tasks, and messages added to the message
bus with AWS [143], for instance.
As a cloud application development model, however, the exact definition of serverless
might still more or less overlap with other cloud models such as PaaS and Software-as-a-Service
(SaaS) [78]. Figure 2.3 provides an overview of different existing cloud computing approaches
and visualizes what resources are being provided with each of these different services: for
Infrastructure-as-a-Service (IaaS), only a network and a computer are provided, typically a
virtual machine (VM); in addition to that, Container-as-a-Service (CaaS) provides a (host)
operating system as well as a container run-time; Platform-as-a-Service (PaaS) adds tools and
libraries as well as application run-time to the already available resources from the preceding
service. On top of that, in PaaS, serverless adds service events and a service mesh. In the case
of Software-as-a-Service (SaaS), additionally, the application and its service are given in the
hands of the consumer.
In fact, Baldini et al. [78] explains the term Serverless by considering the varying levels
of developer control over the cloud infrastructure with the IaaS model being the most
customizable among the different services, providing the most control to the developer and,
therefore, making that person responsible for provisioning the hardware or virtual machines
(VMs). That person is in charge of customizing every single aspect of how an application is
being deployed and executed, whereas PaaS and SaaS are on the exact opposite side. This
means that the developer has the least control and is unaware of any infrastructure, having
only access to prepackaged components or full applications.
Even though the PaaS model is often also seen as an approach for abstracting away the
management of servers, there are some distinct characteristics when it comes to comparing
it to what serverless (FaaS) provides. The serverless model provides a "stripped down"
programming model based on stateless functions and unlike PaaS, it allows developers to
implement code snippets not restricted to using pre-packaged applications [78]. FaaS is
about running back-end code without managing own server systems or own long-lived
server applications, which is the key difference when it comes to comparisons with other
9 https://cloud.google.com/appengine
10 https://aws.amazon.com/elasticbeanstalk/
14
2 Background
Figure 2.3: Cloud approaches based on the SEIP lecture from msg group for TUM [151]
modern architectural trends like containers and PaaS [143]. The functions are not meant to
be long-running, just the opposite, they are designed from the start to be short-lived [147],
usually in accordance with some provider-specific, predefined timeouts for their execution
duration. In other words, in contrast to PaaS, serverless applications are geared to bring up
and down entire applications in a response to some event [143].
Furthermore, unlike SaaS or PaaS, which are always running but scale on-demand, serverless
workloads rather run on-demand, and consequently, scale on-demand [139], which makes
them an attractive alternative, due to their cost-saving billing model. In contrast to serverless,
most of the other services in the cloud-computing era follow the pay-as-you-go billing method,
which charges for allocated resources rather than actual usage [152], forcing cloud users to
eventually pay for initially-defined resources that they might not fully use. In serverless,
the cloud customer never pays for procured resources that are never used, making use of
the 0.1 seconds as a charging metric, in contrast to many VM servers that are using an
hourly charge metric [152]. Applications in the serverless context are uploaded to a platform,
executed, scaled, and billed based on the exact demand needed [153]. Thus, customers
are paying only once a corresponding resource has been consumed. In fact, from a cost
perspective, the benefits of serverless architecture are most attractive for bursty, compute-
intensive workloads [154]. The latter are appropriate as the price of a function invocation is
proportional to its running time. Additionally, bursty workloads are suitable for serverless as
functions are able to scale to zero, which means that there is no cost to the customer when
the system is idle [154].
In spite of the name, servers are still needed in serverless, however, decisions such as the
number of these servers and their capacity are taken care of by the serverless platform [78].
Though, from a functionality perspective, serverless and more traditional architectures may
be used interchangeably [154]. The determination of when to use serverless will likely
be influenced by other non-functional requirements such as the amount of control over
operations required, cost as well as application workload characteristics [154].
15
2 Background
2.2.2 Applications
Initially introduced for web microservices and IoT applications [76] and largely influenced by
the recent shift of enterprise application architectures to containers and microservices [139],
the concept has been utilized to support a wide variety of applications [154]. It is nowadays
favored by many other applications such as event processing, API composition, data flow
control, and many more [78, 155]. Concrete examples of applications that benefit from
serverless include building a chatbot [147], programming for mobile applications [156], and
securing Linux containers [157] among many others.
The stateless nature of the serverless functions puts these somewhere near to the functional
reacting programming in terms of the application structure, due to the many similarities,
when looking from a programming model perspective [158]. Therefore, typical use cases
for serverless would include applications that exhibit event-driven and flow-like processing
patterns [143]. The following lists some further interesting examples of the use of serverless
among practitioners within the industry: Coca-Cola11 makes use of serverless in their vending
machine and loyalty program, which lets the company achieve 65% cost savings at 30 million
requests per month. In fact, a serverless framework is a core component of the company’s
initiative to reduce IT operational costs; Expedia12 is constantly moving more and more
services towards the use of serverless such as integration events for their CI/CD platforms,
infrastructure governance, and autoscaling. As an example, they went from 2.3 billion Lambda
calls per month to 6.2 billion requests in 2017; A Cloud Guru13 uses functions for performing
protected actions such as payment processing and triggering group emails; Netflix14 is another
interesting example on the market that is making use of serverless. They use functions within
an internally developed framework [159].
Serverless relies on the cloud infrastructure to automatically address the challenges of
resource provisioning [160, 161, 162, 142, 163, 164, 165, 166, 167, 168]. Thus, the platform to
which the code snippets (functions) are deployed is solely maintained by the vendor, which
includes keeping the infrastructure running smoothly and scaling resources to satisfy function
executions caused by changes in user demand [147]. In general, from the perspective of the
cloud provider, serverless computing provides an opportunity to reduce operational costs
by efficient optimization and management of cloud resources [148]. This allows developers
and IT/operations teams to save time and resources on activities such as server maintenance,
updates, scaling, and capacity planning and focus on writing their applications’ business logic
instead [153, 138]. Thus, from the perspective of the cloud consumer, serverless platforms
provide developers or software architects with a simplified programming model for their
applications [148]. They are only concerned with creating actions that load on-demand and
are triggered to execute by system-generated events or end users [156], which means that after
uploading the code for the function to the FaaS provider, everything in terms of provisioning
resources, instantiating VMs, managing processes, etc. is in the "hands" of the provider [143].
11 https://www.coca-cola.com/
12 https://www.expedia.com/
13 https://acloudguru.com/
14 https://www.netflix.com/
16
2 Background
This allows practitioners to not only save costs, due to the new, pay-as-you-go pricing model
but also save time and human labor allowing them to concentrate on the business logic of
their application.
Cloud users, however, are by far not the only stakeholders that can benefit from this novel
approach. In fact, large amounts of the computing power in the data centers are allocated
but unused and idle instead [169]. A study shows that around 10-30% of the data centers in
the US are unused at each point in time, resulting in wasted resources. [169]. This means,
that along with the cloud users, cloud providers can benefit from the automated resource
management as well, allowing for an efficient distribution of the available resources.
17
2 Background
FaaS functions have several distinguishable characteristics that need to be considered when
dealing with serverless and, especially, when it comes to certain design choices in various use
cases. The functions are by nature stateless, ephemeral, and even-driven. These can be further
seen as limitations or restrictions, making these functions inappropriate in certain situations.
The following list aims at explaining and describing these characteristics in more detail, even
though some of these might have already been introduced:
• State: FaaS functions are often described as stateless as there exist some restrictions when
it comes to the local state. In fact, these functions do have storage available, however,
there are no guarantees that the state is persisted across multiple invocations [143].
Thus, the assumption that the state would be available to another invocation of the
same function is not correct. If there is a need to persist the state of any FaaS function, it
has to be externalized outside of the function instance so that it is available for further
invocations.
18
2 Background
In terms of their programming model, serverless frameworks allow for the execution of a
single main function, which typically takes a dictionary as an input (such as a JSON object)
and produces a dictionary as an output [158, 78]. In addition to that, a function is further able
to access a context object, representing the environment in which the function is running (e.g.
security context) [158]. A Hello World! example of a function that takes as an input a JSON
object as the first parameter, and a context as a second is shown in the Listing 2.1 below:
import json
return {
’statusCode’: 200,
’body’: json.dumps(’Hello from Serverless!’)
}
The example above is written in Python language, however, currently, cloud provider
serverless offerings support a wide variety of programming languages as already introduced.
The list contains modern languages such as JavaScript, Java, Swift, C#, etc. Some of the
providers go even further and support extensibility mechanisms for code written in any
language as long as it is packed in a Docker15 image supporting a well-defined API [158, 78].
15 https://www.docker.com/
19
2 Background
the structure of the functions are identical along all the existing providers, they do differ
from each other in some of their characteristics [78]. There exist differences in terms of
resource utilization, performance isolation efficiency, the way how users can include custom
dependencies, etc. Moreover, despite the fact that consumers are not charged for unused
computational resources, different FaaS platforms differ in their pricing models considering
factors such as the actual execution time of the corresponding function, configured memory,
number of function invocations as well as the size of the data that is being transferred [78]. In
fact, the function’s maximum memory, along with its cloud region, are the only two required
configurations that the consumer needs to specify. Nevertheless, these do have an important
effect on the function’s performance and if misconfigured, can result in function failures such
as invocation interrupts caused by an exceeded memory limit, for instance.
There exist some differences between the platforms when it comes to function compositions.
Some of the providers allow higher-level mechanism for composing functions, allowing to
invoke one serverless function from another and, thus, providing the possibility to construct
more complex serverless applications [78]. IBM OpenWhisk, for example, allows chaining
multiple serverless functions, in order to create composite functions [78]. Another important
aspect is deployment, which is handled similarly alongside many of these providers by
presenting support for both a user interface (UI) as well as a command-line interface (CLI)
as options to deploy a particular function. AWS recommends the use of their open-source
framework for building serverless applications, called Serverless Application Model (SAM)16 .
Alongside the UI, Google provides their users a similar, command-line way of deploying
serverless applications with the use of the gcloud SDK17 , whereas Microsoft has integrated
support for functions’ deployment within their Visual Studio product.
16 https://github.com/aws/serverless-application-model
17 https://cloud.google.com/sdk/docs/install
18 https://github.com/awslabs/autogluon
20
3 Related Work
As noted previously in Section 2, federated learning is a relatively new and heavily studied
area among academia in the past couple of years. Researchers are looking at major challenges
involved in FL such as communication efficiency, system heterogeneity, statistical heterogene-
ity, and privacy among others [64]. Even though these fields enjoy quite a big popularity
among both industry and academia recently, most of the studies referenced in the following
date back to the last couple of years. This is, in fact, not surprising given that the concept of
federated learning was first proposed in 2016 [10]. Thus, one can expect that a novel approach
such as applying serverless principles in the field of federated learning might have not yet
delivered a large amount of reasonable research outcomes. Nevertheless, there exist some
quite interesting and promising results that worth referencing and that could turn out being
the foundations of great improvements in the FL context.
In the following we present existing academical work that has been conducted in the fields
of Federated Learning and Serverless Computing. In the first two sections, namely 3.1 and 3.2,
work conducted in the fields considering each one on their own is presented. The last section,
Section 3.3, provides an information about research works at the intersection of the two that
try to benefit from serverless in the context of FL setting for training purposes.
1 https://github.com/tensorflow/federated
2 https://github.com/OpenMined/PySyft
21
3 Related Work
3 https://github.com/FederatedAI/FATE
4 https://fedml.ai/
5 https://github.com/PaddlePaddle/PaddleFL
6 https://github.com/adap/flower
7 https://github.com/scaleoutsystems/fedn
8 https://github.com/ucbrise/cirrus
22
3 Related Work
In order to reduce the amount of time on data serialization and, then, deserialization,
Cirrus uses a special routine. The runtime allows a minibatch-based iterator backed by a
local memory ring-buffer, enabling participating workes to access the data with low latency.
Nonetheless, the lack of support for popular ML frameworks such as Tensorflow of Cirrus’
worker run time could be seen as a disadvantage.
Siren is another asynchronous distributed ML framework that is making use of the serverless
architecture. It allows the execution of stateless functions in the cloud, resulting in a higher
level of parallelism and elasticity [198]. Jiang et al. [76] implemented LambdaML, a prototype
system of FaaS-based ML training on Amazon Lambda and presented comparative results
of distributed ML training between "serverfull" (IaaS) and serverless (FaaS) architectures
using the implemented platform. Within their work, the authors furthermore consider several
interesting aspects that relate to our work as well. They take a closer look at the effect of
serialization and deserialization on the communication time. With the use of two different
RPC frameworks and a Lambda function with varying memory size and, thus, CPUs, they
show that increasing the CPUs can accelerate data serialization and deserialization. As a
result of that, one could achieve a communication time speedup. Thus, the serialization
performance and, therefore, the communication time is eventually bounded by the limited
CPU resource of the corresponding function, in this case, Lambda, in accordance with the
exact evaluations performed by the authors. These findings provide us with a direction to
investigate, in order to improve the performance of the system. In particular, within our work,
we pay special attention to the data serialization and deserialization and look for potential
techniques to eliminate the need for these time consuming steps, in order accelerate the total
running times of all the participating devices in an FL training.
Stratum is another a serverless platform developed for the life-cycle management of various
ML-based data analytics tasks [199] that was identified throughout the research phase of this
work. It allows deployment, scheduling, and dynamic management of ML-as-a-service for
inference.
Serverless optimization is another topic that is being heavily studied among academia.
Researches investigate the benefits of the serverless approach hired for various optimization
approaches. There exist some works that have dealt with and examined the impact of
serverless over diverse optimzation tasks. OverSketched Newton, for instance, is a randomized
Hessian-based optimization algorithm that solves convex optimization problems in serverless
systems [200]. Interestingly, it demonstrates a reduction of around 50% in total running time
(on AWS Lambda) as compared with the state-of-the-art distributed optimization schemes.
Even though most of the reference works and solutions have dealt with the use of serverless
for ML inference, it is to be noted that training ML models with serverless is constantly
gaining intensive attention from academia nowadays [76]. Given that high interest, there is a
high expectation of more promising research works to focus on training ML models with the
use of an underlying FaaS infrastructure in the near future [76].
23
3 Related Work
9 https://github.com/ansjin/fl_faas_fabric
10 https://pywren.io/
24
3 Related Work
The system comprises several sub-components that interact with each other in order to
execute FL training, namely - FL-Server, Client Invoker, Weights Updater, and Client Register,
each being a function. The Client Register sub-component creates and stores all the information
about the clients that are going to participate in the following training in a MongoDB database.
The information includes the client’s ID, the URL through which the corresponding client
could be triggered within the training process, the type of the FaaS platform on which the
client was created (e.g. gcloud) as well as additional authentication information required. The
FL-Server sub-component is then responsible for the model and hyperparameter selection. It
25
3 Related Work
decides on the model that is to be trained by specifying various parameters such as input and
output size, number of layers, etc. as well as the hyperparameters such as the optimizer and
the learning rate. These configurations are then used by the Client-Invoker for the invocation
of the different clients that are going to participate in the upcoming training round. The
Client-Invoker then creates n invoker-functions, each one responsible for the trigger of the
corresponding FL-client, where n is the number of clients required in the FL training round,
visualized on the right-hand side within the OpenWhisk local component in Figure 3.1. Each
invoker function fetches all the necessary information about its corresponding client function
as well as the current model weights and triggers the client function via an HTTP request,
forwarding the data within the request body. The client then compiles the model, sets the
model weights, updates them in accordance to the predefined hyperparameters, and sends
the updated weights as an HTTP response to the corresponding invoker function, which then
persists the client results into the parameter server. Finally, after the successful execution of
all the specified clients, the Weights-Updater aggregates the updated clients’ weights, persists
them so that they are used and passed to the participating clients in the next training round,
and notifies the FL-Server for the completion of the current training round. This process is
then repeated until a certain threshold of a predefined metric (e.g. accuracy) is achieved.
With their work, the authors provide a novel approach that addresses problems in FL
training by intersecting two emerging technological trends. This sets up the foundations of
an interesting research area and strengthens expectations from [76], which could increase the
number of applications and researchers in both industry and academia that focus on training
ML models using an underlying FaaS infrastructure in near future.
FedLess [204] can be seen as the successor of FedKeeper that addresses open issues with
the initially introduced framework. FedLess extends FedKeeper by providing additional vital
features like authentication, authorization, and differential privacy. The first two come as
a result of two requirements. On the one hand, only the FL server that the involved data
holders interact with can call their client functions and, on the other hand, the client functions
have to be approved by the FL server, in order to be included as valid ones. To achieve this,
FedLess extends FedKeeper by integrating an external identity provider with the use of AWS
Cognito. The authentication and authorization are achieved through the use of JSON Web
Tokens (JWTs) [205] that are attached to the HTTP requests between the communicating clients
and the FL-server.
In addition to that, FedLess focuses on strengthening and protecting the privacy of the
training data, introducing (Local) Differential Privacy - an approach in which all the client
functions distort updated parameters before communicating them to the parameter server.
Moreover, looking at the performance of the system, FedLess is faster than its predecessor and
is able to easily train more complex models across up to 200 clients and beyond. Along with
these changes, FedLess extends the list of functions in the FaaS fabric and introduces support
for AWS Lambda, Azure Functions, and IBM Cloud Functions.
Figure 3.2 provides an overview of the improved system and its most important components.
All the components within the Control Pane, running in a Kubernetes cluster, belong to and are
managed by FedLess. These include Parameter Server, Client Database, an OpenWhisk cluster
26
3 Related Work
containing the Aggregator Function, responsible for the aggregation of the model updates
provided by all the involved clients, and the FedLess Controller. In contrast to FedKeeper, FedLess
lacks separation of Invoker Functions to call clients and introduces a new strategy for clients’
calls by providing them access to the Parameter Server. This is as result of the large size of
the models, making these inappropriate to be included in the HTTP body due to the existing
payload limits. The HTTP-callable external clients, including the newly introduced providers,
can be seen on the right-hand side. The external authorization and authentication provider
Auth Server can be depicted above the Control Pane in the diagram.
The evaluation results from the two frameworks motivate us to further investigate and
improve the system by extending it with new features as well as improving already existing
ones. With this work, we aim at addressing performance-related issues, identified as the main
bottleneck of the system. These, as well as the exact evaluations as a result of the introduced
improvements, are presented in the following sections.
27
4 Towards High Performance Federated
Learning Using Serverless Computing
This work aims to address some of the major challenges involved in FL such as communication
efficiency [64] and achieve a better performance of the system under development. Building
on top of previous works in the field of Serverless Federated Learning [75], it consists
of evaluating and improving an existing solution, in order to address some of the open
challenges in the context of FL. It builds on the foundations set in the development of two
consecutive frameworks for serverless FL - FedKeeper [75], and its successor - FedLess, both
introduced in Section 3.4. Thus, the main objective of this study is to evaluate and improve
the functionality of FedLess by addressing performance and scalability issues that have been
identified throughout previous experiments. Based on the identified problems of the system,
we define and summarize this work with two research questions (RQs). We aim at answering
these research questions by further improving the system and evaluating its performance.
Providing answers to the questions is considered a desired outcome of this research work.
In the following, Section 4.1, the exact definition of the problems, summarized as research
questions (RQs), of this study are described and presented in detail. This is followed by
Section 4.2, providing information about the methodology followed throughout this study, in
order to address the problems. Section 4.3 describes the changes of the FedLess system made
in the scope of this work in terms of both system architecture as well as the training workflow.
We conclude with Section 4.4, providing information about a technology used throughout the
implementation phase followed by implementation details in Section 4.5.
1 https://www.mongodb.com/
2 Throughout this study we refer to as large-scale scenario once we have more than 100 clients involved within an
FL training round
28
4 Towards High Performance Federated Learning Using Serverless Computing
bad scalability of the system. Thus, switching to another technology that can easily support
many simultaneous connections and uploads of large models is desired outcome within the
scope of this work.
The first, initial idea is to replace the MongoDB parameter server with an In-memory database
(IMDB) and evaluate the system’s performance in comparison with the old implementation.
Indeed, we end up integrating such technology after supporting the initial beliefs with
detailed literature as well as industry reviews. With this analysis we aim at gaining insights
into what is being used within similar production systems in both academia and industry,
and, thus, accept or reject our initial hypothesis of replacing the parameter server with an
in-memory database. The exact findings gathered throughout that research and the following
decisions about the technology to replace the parameter server can be found in the following.
In addition to that, aiming at increasing efficiency (e.g. in terms of network latencies), a
performance evaluation under the consideration of multiple servers comprises the second
desired outcome of this research work. The initial intuition that we want to investigate and
approach with this research question relates to the use of multiple servers with replicated
parameter servers that could reduce the costly interaction with the central server. As a result
of that, the overall training time is expected to decrease, which leads to smaller execution
costs as compared with the current FedLess solution. Given the fact that FedLess operates over
a combination of connected FaaS clients, referred to as FaaS fabric, and, thus, relies on various
cloud providers, the latter is of great interest within this research work. Reducing the training
costs would result in cheaper usage of the framework. Moreover, with the consideration
of multiple replicated parameter servers, one could also reduce the privacy concerns for
geographically distributed clients spread over multiple different regions, which might be of
great interest for the Cross-silo FL setting.
Based on the above-stated objectives, and as already noted in the introductory Section 1,
this study aims at answering the following two research questions (RQs) by extending and
executing evaluations on FedLess’ performance:
• Research Question 1: Which technology should the current parameter server be replaced with,
in order to increase the system’s performance and scalability?
• Research Question 2: How would the system perform under the consideration of multiple FL
servers with replicated parameter servers?
4.2 Approach
The following sections describe the approaches followed throughout this study in order to
address both research questions and, therefore, provide the desired comparative outcomes
related to the performance of the system. Section 4.2.1 gives information about what has
been discovered while approaching the first research question including the outcomes of the
literature review. Section 4.2.2, analogously, provides information about the findings while
considering the second research question.
29
4 Towards High Performance Federated Learning Using Serverless Computing
3 https://aws.amazon.com/s3/
30
4 Towards High Performance Federated Learning Using Serverless Computing
than S3 resulting in a faster round of communication [76], which is in line with our initial
considerations and strengthens the intention of replacing the current parameter server’s
solution with an in-memory data store technology, in order to decrease the latency and
achieve a faster communication rounds.
As AWS S3 is a disk-based object storage service just like MongoDB, the technology that
FedLess is using, replacing the current storage component with AWS S3 was not taken into
further consideration as we aimed at a shift towards a technology that introduces lower
latency and, thus, with faster access timings. As already argued above referring to existing
research work, Memcached introduces lower latency than S3 resulting in a faster FL training
round [76]. Thus, being able to support our initial beliefs with concrete scientific results from
existing performance examinations from academia, the decision for the parameter server
replacement fell on using an in-memory data store solution in FedLess.
Based on the literature findings, we reduced the list of potential candidates for the parame-
ter server replacement to two main options, namely, Redis and Memcached, which according
to the referenced research work and the benchmark introduced by the authors show similar
performances [76]. In fact, the authors give a slight advantage to Memcached, due to its
underlying multi-threading architecture that is not present in Redis, which results in better
performance and, therefore, lower communication latency throughout training when consid-
ering a larger ML model [76]. Nonetheless, we discovered several other promising advantages
throughout the research for the use of Redis in our context. These are listed in the following.
A promising finding that we discovered through our own Internet research is the RedisAI4
module by Redis. However, we did not find the use of RedisAI in any of the reviewed research
works and the open-source frameworks, which might not be surprising given the fact that it
is a relatively new module by Redis, officially introduced on May 12, 2020, according to the
official repository release notes5 . Nevertheless, given the fact that its main purpose is to serve
a variety of ML tasks, we decided to take it into account within the considerations of potential
parameter server replacements along with the Memcached option. According to the official
documentation, RedisAI allows practitioners to "to run inference engine where the data lives,
decreasing latency and increasing simplicity — all coupled with the core Redis Enterprise features".
Being referred to as one of the benefits that come with the module, reduced latency matches
our needs within the scope of the first research question. Designed and built exclusively
to support ML tasks such as deployment and management of ML models, RedisAI tends
to be a good fit and a potential solution to the existing problem of high communication
costs in FedLess. Especially interesting is the fact that RedisAI allows users to directly store
and respectively, retrieve, model parameters as tensors, removing the need to perform any
preparation steps such as serialization and deserialization of the model weights, which is the
case with the current MongoDB solution. Removing the serialization overhead could result in
a significantly improved latency and, therefore, reduced training rounds, allowing us not only
to benefit from faster, in-memory accesses to the stored objects, but also to further reduce the
cost by completely removing the need for any intermediate steps.
4 https://redis.com/modules/redis-ai/
5 https://github.com/RedisAI/RedisAI/releases?page=1
31
4 Towards High Performance Federated Learning Using Serverless Computing
Based on these findings, the decision for the technology to replace the current parameter
server solution fell on Redis with the use of the RedisAI module. Even though, as noted above,
Memcached is considered another great in-memory data store option, which furthermore
underlies a multi-threaded architecture, the benefits of the RedisAI seem promising. Being
able to completely remove the weights serialization and deserialization steps would not only
have a positive impact on the actual model training that occurs on the client-side in terms of
communication costs but also significantly improve the aggregation step that occurs on the
FL server side.
Thus, we decided to examine the system’s performance under the consideration of two
different approaches in order to address the open problem of communication cost in FedLess.
We performed multiple experiments by replacing our storage component with Redis and
examining two different implementations. At first, we evaluated the framework’s performance
with Redis without the use of any additional modules such as RedisAI. We kept the weights
serialization solution as is and only replaced the underlying parameter server with Redis.
This, however, did not provide the desired improvements. Afterward, we shifted towards the
use of the RedisAI module and completely removed any weights serialization involved. With
this change, we could not only reduce the costs of a training round for a small portion of
clients involved but also managed to outperform the MongoDB solution even in the large-scale
scenario with up to 300 clients involved. This, moreover, resulted in a more stable solution
when it comes to the use of larger model sizes.
The resulting design and workflow changes can be found in Section 4.3. The implementation
details as a result of the changes, in Section 4.5, and the outcomes and the corresponding
exact performance results, in the following Section 5.
32
4 Towards High Performance Federated Learning Using Serverless Computing
The machines providing data storage services in the Redis context are called "nodes" [207].
Each of the nodes takes a certain amount of memory space on each machine, configurable
by the practitioner in accordance with the corresponding needs. Replication is the process
that allows specific nodes to be exact copies of others, referred to as master nodes. Redis
provides different ways of achieving replication, which mainly depends on the context in
which replication is applied.
We go through the differences between the different replication techniques in Redis with the
help of the comparison diagram in Figure 4.1, taken from the official AWS documentation. As
noted, there exist two distinct implementations of replication in Redis, namely, the so-called
replication with enabled cluster mode, which can be seen on the left-hand side of the diagram,
as well as replication with disabled cluster mode, on the right-hand side of the diagram,
respectively. The former contains a single shard including all of the cluster’s data in each
node, whereas the latter has its data partitioned across up to 500 shards, each having a
single read/write primary node and up to 5 read-only replica nodes. The number of shards,
however, is configurable as Redis allows for a varying shard-replica ratio - one is allowed to
create a cluster with a higher number of shards and a lower number of replicas involved and
vice-versa.
Figure 4.1: Cluster mode disabled (left) vs cluster mode enabled (right) replication in Redis 6
The cluster mode disabled replication mode supports read-write isolation through assigning
write operations to a single master (primary) node and designates read operations onto all
replica nodes [208], whose number might vary between 0 and 5. It furthermore supports
scaling and is considered to improve the read throughput, making it a great choice for read-
intensive applications. Based on the experience made throughout the experiments concerned
with RQ 1, we considered the latter as a strong argument for using the cluster mode disabled
approach for our experiments as our system showed a good write performance, however,
comparably worse read performance with Redis as an underlying parameter server technology.
6 TheRedis replication comparison diagram is taken from the AWS documentation available at: https://docs.
aws.amazon.com/AmazonElastiCache/latest/red-ug/Replication.Redis-RedisCluster.html
33
4 Towards High Performance Federated Learning Using Serverless Computing
34
4 Towards High Performance Federated Learning Using Serverless Computing
Starting with the changes, we replace the underlying technology for the Aggregator Function,
switching from OpenWhisk to OpenFaaS. The main reason for this is the unstable OpenWhisk
solution. Throughout the experiments, we experienced issues with OpenWhisk as the aggrega-
tor function was not able to handle the aggregation of multiple clients’ results, especially in
the large-scale scenario that involves 100 and more clients, and results in a failing function
error state. During the experiments, we could not achieve stability for the aggregator function,
despite trying many different configuration parameters. Thus, we switch to an OpenFaaS
Aggregator Function, which introduces better performance and is able to handle the aggrega-
tion of up to 300 clients even with increased ML model size. For the sake of completeness,
it is to be mentioned that in contrast to OpenFaaS, OpenWhisk seems poorly maintained.
More concretely, the latest OpenWhisk release7 dates back to Nov 10, 2020, making it almost
two years old, whereas the latest stable release of OpenFaaS8 is just a couple of months old.
Moreover, the community behind the OpenFaaS solution seems to be much more active and
involved, which results in ten times fewer open issues as compared with OpenWhisk, giving a
feeling of lacking support.
In addition to the existing three components in the FedLess Controller, we introduce a new
one named Database Strategy. Just like the FL Strategy, the Database Strategy is to be selected at
the beginning of each training session by the administrator, who is also responsible for the
hyper-parameter configurations. The administrator is able to choose between two different
options to store all the necessary information throughout an FL training session such as client
and model information, client results, etc., namely MongoDB and Redis. Based on this choice
at the beginning of the session, the system uses the corresponding underlying technology for
all the FL training rounds throughout the whole training session.
Another noticeable difference with the FedLess architecture from Figure 3.2 is the newly
introduced Redis database that comprises the Parameter Server as well as the Client Database
components. These have the exact same purpose and role as the ones with MongoDB. However,
due to the RedisAI module, the clients’ results as well as the latest model parameters are
stored directly as tensors. Detailed information on how this is achieved is provided in Section
4.5.
35
4 Towards High Performance Federated Learning Using Serverless Computing
Client Side
Figure 4.3 shows the differences between the two parameter server approaches in a direct
comparison. Both diagrams contain events representing concrete steps that occur throughout
an FL training round of the system once a client has been called to participate. These distinct
events can be observed in a top to bottom manner starting with downloading the model and
the latest parameters (Model Download) and ending with the persistence of the calculated
client results (Model Upload). The execution duration of these events contributes to the total
duration of an FL training round and are measured and thoroughly evaluated as presented
in the evaluation Section 5.
Figure 4.3: MongoDB vs Redis client-side workflow comparison for a complete FL training
round
36
4 Towards High Performance Federated Learning Using Serverless Computing
In the following, we introduce the context and provide a detailed explanation behind each
of these five events from the figure above:
• Model download - consisting of two steps, namely, loading the JSON-serialized model
from the database without the model weights, followed by loading the actual latest
parameters (weights) from the parameter server. Depending on the technology, the
weights are in a different form as described below;
Differences: On the left-hand side of the figure, one can find the steps that are executed
once a client has been invoked for the participation in an FL training round with the use of
MongoDB and on the opposite, the ones needed with Redis as parameter server. The first
noticeable difference can be found within the first event - Model download. With the MongoDB
approach, one loads the latest parameters in a serialized form according to a predefined
serialization technique, whereas the Redis solution, due to the redisAI module for ML contexts,
allows for a direct persistence of the weights in the form of tensors. This allows for the
elimination of some of the steps in the following events throughout the workflow.
Moving to the Training preparation event, one can note that the fourth step of the workflow
(crossed out and highlighted in red color), namely "Deserialize the parameters" is no longer
necessary as redisAI allows us to directly store the weights as tensors without the need for a
prior serialization. The same holds true for the whole Weights serialization event, which can be
completely eliminated with the use of Redis and redisAI as the clients’ weights can be directly
stored after the training process has been successfully executed, without the need for further
modification.
With these simplifications and the elimination of the steps described above, we could achieve
a total performance speedup during an FL training round on the client-side of the system.
Redis and redisAI allow us to completely ignore weights serialization and deserialization
processes, which contributes to the desired performance increase. Moreover, the upload times
with the Redis solution perform better than the upload times of the current, MongoDB solution,
allowing us to address the main bottleneck of the current system. The exact comparisons
between the performance of the system with both parameter server technologies can be found
in the evaluation Section 5.
37
4 Towards High Performance Federated Learning Using Serverless Computing
FL-server side
Figure 4.4 shows the differences between the two parameter server approaches on the FL-
server side. Similar to the approach in Figure 4.3, the two diagrams contain events that occur
within an FL training round. However, unlike the example from above, here we show the
effect that the new Redis parameter server has on the aggregator process in terms of the
workflow. The figure below shows the subsequent events once the aggregator function has
been invoked to combine the latest clients’ model updates. The events are to be read top to
bottom starting with the Model Download and ending with persisting the aggregated latest
model parameters, namely, the Model Upload event.
38
4 Towards High Performance Federated Learning Using Serverless Computing
In total, there are in total five events that can be observed on the aggregator side:
• Load client results - comprising a single step for fetching the latest clients’ results (model
updates);
• Aggregation preparation - deserializing the weights so that they can be used in the
subsequent aggregation event (applicable to MongoDB);
• Aggregation - representing the actual aggregation of the clients’ model updates according
to a predefined aggregation scheme;
• Model upload - storing the aggregated clients’ results (weights) that are to be used as
latest parameters in the next FL training round
Differences: On the left-hand side of the figure, all the necessary steps that are executed
on the aggregator side with MongoDB can be seen. These constitute the workflow, once the
aggregator has been invoked to combine the latest clients’ updates within an FL training
round. Alternatively, the right-hand side shows the same workflow procedure with the use of
Redis and the redisAI module. All differences between these two approaches are marked in
red and, additionally, all the steps that could be eliminated with the use of Redis are crossed
out.
The first noticeable difference can be seen within the first event - Load client results. Like
in the Model download event on the client-side workflow, Redis allows us to directly fetch the
clients’ results as tensors, eliminating the deserialization need for the weights in any of the
subsequent steps with the help of the redisAI module. On the contrary, MongoDB does not
support storing the weights directly as tensors and, thus, requires some pre-processing steps
such as weights serialization. As a result of that, the Aggregation preparation event can be
completely ignored for Redis as we can directly proceed to the Aggregation, having already the
weights in the desired tensor form at hand. Likewise, the event Weights serialization that occurs
right after the actual aggregation process has been successfully executed can be eliminated.
Eliminating all the steps as described above on both client as well as FL server sides, we were
able to achieve a total performance increase. As one key contribution to this improvement,
one could consider the removed, time-consuming training and upload preparation events
such as the weights serialization and deserialization, allowing us to reduce the total steps from
5, with the use of MongoDB to just 3, when using Redis on the FL-server side. Additionally,
being an in-memory data store, Redis showed a better performance within the Model upload
step on both the client as well as the FL server-side.
39
4 Towards High Performance Federated Learning Using Serverless Computing
4.3.3 Replication
Motivation and Design
This section provides an architectural overview of the changes caused by the introduction
of the parameter server replication and, thus, relates to the second research question (RQ
2) of this study. It furthermore discusses some of the design choices made throughout the
implementation phase.
As already introduced in Section 4.1, Redis replication allows Redis nodes to be exact copies
of specific instances called primary or also master nodes. In the Redis context one can observe
two different replication approaches - with enabled as well as with disabled cluster mode. As
already discussed in previous sections, the decision for the replication approach fell on the
so-called cluster mode disabled way of handling replications in Redis. This was influenced by
the findings from the evaluation of the system’s performance within the scope of the first
research question of this work. Throughout the conducted experiments we discovered that
even though the system clearly outperforms the old FedLess solution using MongoDB as an
underlying parameter server solution, there still exists room for improvement, especially
considering the read throughput. In fact, the read performance of Redis performed even
worse than MongoDB once an increased amount of clients was used within an FL training
round.
The latter seems to be a direct result of the lacking multi-threading infrastructure in Redis.
Interestingly, even though Redis is mostly single-threaded, it does provide an option to enable
multiple threads for certain operations such as UNLINK and slow I/O accesses as of its 6.0
release. This can be achieved by extending the configurations and by setting the desired
amount of I/O threads as follows: io-threads 4. According to the official documentation, it
results in an improved performance per core as it speeds up the accesses, thus, removing
the need for pipelining, for instance. In this example, we set 4 threads and, therefore, enable
multi-threading for writes, that is, we thread the write (2) syscall and transfer the client buffers
to the socket. Throughout our evaluations, we experimented with different configurations for
the I/O threads starting with 4 and scaling up to 64 threads enabled.
In fact, we were able to achieve a good write times and, thus, outperform the system’s
performance with MongoDB by orders of magnitude, tackling the main bottleneck of the
system identified in the preceding FedLess evaluations. As already noted, however, the read
times that we could achieve were as bad as or even worse than the ones identified with
the MongoDB set-up, even after following the exact configuration instructions on enabling
multiple threads for read operations. The latter is achieved by setting corresponding flag
in the configuration file as follows: io-threads-do-reads yes. Nonetheless, we regrettably
discovered that this setting usually does not result in a performance increase as reported by
the Redis official documentation: "usually threading reads doesn’t help much".
Therefore, based on these findings and the corresponding evaluations performed as a result
of approaching the first research question, we aimed at a replication approach that would
allow us to improve the read performance of our system and, thus, the replication decision
fell on an integration of the cluster mode disabled way of applying replication in Redis.
40
4 Towards High Performance Federated Learning Using Serverless Computing
Figure 4.5 shows what the Redis replication looks like in the context of FedLess. It consists of
6 nodes in total, which differ in the operations that these nodes could perform. It furthermore
represents the clients involved and their interaction with the corresponding nodes throughout
an FL training round.
In the middle of the diagram, one can see the primary, also referred to as the master node
within the single shard. The primary node is the only one that could be used for writing
as well as reading operations from all the involved clients. Then, based on a predefined
synchronization scheme, records are copied to the connected replication.
The number of replication nodes within a single shard could vary from 0 to 5, as already
41
4 Towards High Performance Federated Learning Using Serverless Computing
introduced in Section 4.1. Throughout our experiments, we used three different settings and
compared the performance results of the system under the consideration of these varying
number of replicas, namely: 1) without any replications involved, 2) with two replications as
well as 3) with a maximum number of five replications involved.
Additionally, one could see the clients involved in an FL session. As stated, these should all
write their latest model updates to the primary node as it is the only one that allows writing
operations. However, a certain amount of clients could be configured to read the latest model
parameters from a specific replication node, offloading the read throughput of the primary
node and, thus, increasing read operations’ performance. The number of clients within a
subset reading from the very same replication node could vary from 0 to infinity based on the
needs of the FedLess user. These clients are different FaaS functions running both on-premise
as well as on any public cloud provider.
42
4 Towards High Performance Federated Learning Using Serverless Computing
4.4 RedisAI
Being one of the key contributors to the achievement of the initially-stated objectives for
improved performance and, thus, decreased communication cost, we briefly present the core
concepts of the redisAI module in the following. The following description is largely based on
the redisAI official introduction page [209].
RedisAI is as a module for Deep Learning/Machine Learning models data management,
providing out-of-the box support for popular DL/ML frameworks. The list of the redisAI
supported data structures contains the following entries:
• Tensors: being a standard representation for data in DL/ML workloads, the tensor data
structure represents an n-dimensional array of values. Like any other data structure in
Redis and RedisAI, tensors are stored and identified by keys;
• Model: represents a frozen computational graph by one of the supported DL/ML
framework backends that is stored in the database and can be executed/run. Similarly
to all the other data structures, it is identified by a specific key;
• Script: represents a TorchScript10 program
Within the implementation, we find an application of the Tensor data structure, which
allows us to simplify our workflow on two exact places, as already presented in Section 4.1.
This happens after a successful client training to persist the results in the database as well as
after a successful aggregation of the results by the aggregator function on the FL server side.
As noted, redisAI allows practitioners to store tensors in Redis. This is done by the use of
the AI.TENSORSET command. A concrete example of storing a tensor looks this: AI.TENSORSET
fedless FLOAT 2 VALUES 2 3 , where the key of the tensor is fedless and its values are of
type float as specifiec by the corresponding FLOAT argument. The VALUES argument specifies
the shape of the tensor as a list of its corresponding dimensions. For the list of complete
commands with all the data structures offered by the module as well as further examples of
their use, the reader is referred to the official documentation from Redis as referenced at the
beginning of this section.
10 https://pytorch.org/docs/stable/jit.html
43
4 Towards High Performance Federated Learning Using Serverless Computing
4.5.1 Dependencies
As our first and foremost concern throughout this work is replacing the parameter server
technology with Redis, we extend the system’s external dependencies list by adding the
corresponding Redis dependency. In order to be able to use Redis with the Python language,
we integrate redis-py11 , the most widely-used Redis Python Client. For similar purposes, we
add redisai-py12 , a Python client for RedisAI, which allows the use of the redisAI API in Python,
and perfectly matches the context and needs of our system. Additionally, we add ultra JSON,
in short ujson, to the used dependencies, in order to achieve faster dictionary serialization
and deserialization times as compared to the standard approach with the json module.
MongoDB vs Redis
Before presenting the exact implementation changes, we briefly compare MongoDB and Redis
and the implications that these had on the implementation of the parameter server. These two
distinct technologies have many differences starting with the way how data is being handled
and persisted. MongoDB is a disk-based database and as such, persists the data on the disk of
11 https://github.com/redis/redis-py
12 https://github.com/RedisAI/redisai-py
44
4 Towards High Performance Federated Learning Using Serverless Computing
the corresponding host machine. On the contrary, Redis stores data in the memory, promising
faster data access. However, it is limited by the size and amount of the data that could be
stored.
Being a key/value store, Redis allows, as the description suggests, users store as well as
fetch certain values by an identified key using the standard .set(’<KEY>’, ’<VALUE>’) and
.get(’<KEY>’) commands. Thus, querying on multiple conditions is not a natural approach
to how the data is being fetched in Redis, making Redis not the perfect solution for scenarios
where the business requirements necessitate querying data on multiple conditions or fields.
A concrete example from our code base that is making use of the multi-filed data fetch
is the realization of the load method responsible for fetching the latest client results. The
implementation with MongoDB looks as follows: As seen, we are able toapply a filter to the
fetch with MongoDB and retrieve only client results that correspond to the passed session_id
and round_id arguments. In fact, we did not find the need for such queries throughout the
implementation phase as we remove the latest clients’ results once these have been combined
by the aggregator to achieve better performance. Nonetheless, it is a distinction that is worth
mentioning at this point.
Another key difference between MongoDB and Redis is the way how Python dictionaries
are handled. MongoDB allows storing documents in accordance with a predefined scheme
and, thus, one could adopt the exact form of the dictionaries and persist these accordingly.
This, however, is not the exact case with Redis. The most common data structures supported
by Redis include the following: strings, sets, sorted sets, lists, hashes, bitmaps, etc., with hashes
being the equivalent to dictionaries in Python, storing a set of field-value pairs. An important
obstacle, however, is that the hash data structure supports nesting only one level deep
dictionaries, conflicting with many of the objects within our code base.
In order to achieve this with redis-py, we use a serialization workaround, which allows us
to serialize the nested dictionary as a string and persist a corresponding key/string value
pair in Redis. The latter can be realized with different serialization techniques and third-party
dependencies such as json, ujson, msgpack, and so on. Among these, we aim for the one adding
the least serialization time overhead, in order to achieve the best performance and, therefore,
reduce the total communication costs. The exact evaluation of the system’s performance with
all these serialization techniques is described in detail in Section 5.
Redis Pipelining
Even though the use of redisai-py allows practitioners to use the redisAI module with Python,
we had to make some adjustments and look for workarounds aiming at increased system
performance. A concrete example includes the need to use pipelining for any bulk action, due
to the lack of appropriate API able to handle such scenarios. Redisai-py, unfortunately, does
not provide any bulk get as well as set tensor operations that allow practitioners to persist
and retrieve multiple tensors at once. The example below shows the realization of a bulk
tensors set with the use of a pipeline, in order to achieve a better performance once persisting
all the clients’ results within Redis. As seen, the tensors of each layer of the corresponding
client are set in an iterative manner with the use of the .tensorset() method provided by
45
4 Towards High Performance Federated Learning Using Serverless Computing
redisai-py and executed at once afterward with the help of a previously defined pipeline.
redisai_pipe = self._redisai_client.pipeline(transaction=False)
for i in range(len(result.parameters)):
redisai_pipe.tensorset(f"client_result:{client_result_id}_layer_{(i+1)}",
np.array(result.parameters[i], dtype=np.float32))
redisai_pipe.execute()
Adding Redis on top of the existing solution, we allow the FedLess administrator to decide
between the two parameter server technologies once the training session is started. This
is realized by introducing an enum type, shown in the listing below, which differentiates
between the two different approaches. The administrator provides one of the two values
and selects the technology that is to be used for the parameter server within the FL training
session. Based on that selection, the initial configurations, as well as the intermediate and
final results, are persisted in the corresponding database accordingly.
class DatabaseTypeEnum(str, Enum):
REDIS = "redis"
MONGO_DB = "mongo_db"
Due to the problems presented and described in detail in Section 4.1, we make use of the
synchronous way of replication synchronization. Being the non-default option in Redis, the
synchronous way of handling data replication requires some changes and configurations
described in the following.
At first, in order to synchronously copy the data from the primary to the replication nodes,
we use the WAIT command to prevent the clients from reading an older version of the updated
model parameters. This happens once a specified amount of replicas have successfully
acknowledged the latest parameters from the primary node. A second argument to the
command specifies a timeout in milliseconds, which, when reached, unblocks subsequent
read operations even if the specified number of replicas were not yet reached. These two
parameters have to be specified by the FedLess administrator at the beginning of each FL
training session. If not provided, they take default values of 0 replications that have to be
reached as well as no predefined timeout threshold.
46
5 Evaluation
The following sections present the evaluation results of the experiments conducted throughout
this study. Section 5.1 introduces the experiment setup that includes the data set, the model
architecture, the hyper-parameters, and the hardware setup for the serverless functions. We
conclude with comparative examples between the improved system and the older FedLess
version in Section 5.2. For the comparisons, we use three different settings based on the model
size as shown in Table 5.1 and evaluate the system’s performance under the consideration
various involved clients in each of the three test settings. It is to be noted, that the goal of the
experiments is not to evaluate the accuracy of the different models, but rather to study how
costly, in terms of execution duration, and, thus, pricing, is each global round of FL training.
We compare the results with the results from FedLess. Moreover, we observe the dependency
between the training costs and the model’s size.
In the following, the underlying experimental environment used for the execution of the
evaluations are described in detail starting with the data set. In order to conduct the
experiments in an environment as close as possible to the FL setting, we use real data sets
from the LEAF benchmark framework. The framework is specifically designed for the FL
context with applications that, in addition, include multi-task learning, on-device learning,
and others. It contains many data sets for FL systems and covers many different domains such
as image classification. In particular, we use the FEMNIST data set, which is a modification,
in accordance with the FL setting needs, of the extended MNIST data set, or in short -
EMNIST [210]. In addition to the original data set, its extended version EMNIST, constitutes
more challenging classification tasks involving letters as well as digits [210]. The FEMNIST
data set used throughout the experiments contains over 800,000 image samples of digits
and various characters partitioned by different writers. There are 3,550 writers in total, each
contributing with roughly 226 images on average. Besides the original pre-processing steps in
line with the official recommendations provided by the framework, we additionally sample
more users, in order to scale the experiments. Firstly, we sample 25% of the users and discard
all that have less than 100 samples. Then we use 10% for testing and end up having 895
partitions. We serve the data set with a ngnix1 as a file server.
1 https://www.nginx.com/
47
5 Evaluation
The first model that we use throughout the experiments is the same CNN as used for
FEMNIST evaluations in FedLess. It consists of the following layers: two convolutional layers
with a kernel size of 5x5, a fully connected layer with 2048 neurons, and a 10-neurons output
layer. We use ReLU as an activation function and categorical-cross-entropy loss function. This
gives us a model size of 25911521 Bytes or roughly 26 MB, which is what we refer to as the
first test setting in our experiments.
In order to increase the model size and, thus, test the parameter server’s performance with
larger data transfer, we experiment with the parameters from above and artificially increase
the size of the model. Doubling the neurons within the fully connected layer from 2048 to
4096 results in a model of size 51609830 Bytes or nearly 52 MB, which we refer to as the
second scenario for our experiments.
Lastly, to achieve a model size of more than 200 MB, we, in addition to the doubled number
of neurons within the fully-connected (FC) layer, add three FC layers - two of them containing
the same, doubled amount of neurons - 4096, and one with the initial number of 2048 neurons.
This results in a model of size 219934718 Bytes or roughly 220 MB. This model size comprises
our third experiment setting.
Table 5.1 shows the resulting number of parameters that each of the model sizes contains.
Intuitively, doubling the model size roughly doubles the number of trainable parameters
within a given model.
26 MB 52 MB 220 MB
Number of parameters 6476672 12901248 54981566
Table 5.1: Model sizes and the total number of their trainable parameters
Within each of these three different test scenarios, we use a different number of clients
for an FL training round for both MongoDB and Redis and directly compare the parameter
server’s performance. Starting with a single client, we increase their number, in order to
examine the performance changes as the amount of involved clients increases.
For the case of 26 MB and 52 MB model size, we scale the number of clients involved in an
FL training round to 300 and compare the execution duration on both client-side as well as FL
server-side using the two different parameter server technologies, namely the implementation
from FedLess with MongoDB and our implementation with Redis.
In the case of 220 MB model size, we scale the number of clients per FL training round to
100, due to the increased size of the data that is being transferred. Moreover, this test setting
contains results only using Redis as a parameter server. This is as a result of the bad MongoDB
performance in the second test setting with a model of size 52 MB, once the number of clients
within a training round exceeds 250. Thus, with a roughly 4 times bigger model, we did not
expect that the system with MongoDB as a parameter server would end up having successful
training sessions.
48
5 Evaluation
Hardware Setup
The FL server is running on a machine with 10 vCPUs and 45 GB RAM. On the same
machine, we have also deployed a single-node Kubernetes2 cluster. In contrast to the
experiments performed within FedKeeper and FedLess, we deploy an OpenFaas aggregation
function inside the cluster. This is primarily caused by the unstable performance of OpenWhisk.
We encountered constant function failures and dropouts of the aggregator once invoked to
combine the results of the clients involved in the training round. This was especially the case
with a large number of clients involved in the FL training session, typically in test scenarios
that include 100 and more client functions. Nonetheless, from a design perspective, we stick
to the approach of keeping the aggregation process physically close to the parameter server
and deploy it inside the cluster, even though it could be successfully deployed in the cloud
using any cloud provider. This gives the benefit of minimized networking overhead and turns
out to be a crucial performance advantage as studied and reported by [76]. The OpenFaaS
aggregator function was configured with a memory limit of 12000 MB or 12 GB, whereas the
FaaS client functions have a limit of 2048 MB, if not stated otherwise. The only exception
from this setting is the model size of 220 MB in the third test setting. The functions deployed
on public cloud providers are in the very same region, namely - Frankfurt, Germany.
2 https://kubernetes.io/
49
5 Evaluation
Figure 5.1: Client function duration compared between five different parameter server
approaches
All but the Redis (with the redisAI module) approaches perform serialization and, respec-
tively, deserialization of the model weights when interacting with the underlying parameter
server. The model weights are serialized and stored in BSON format. The three different Redis
approaches in blue color differ in the way the serialization for the model and client configura-
tions is being achieved, considering three different serialization techniques, namely json, ujson,
and msgpack. The need for serialization for the client configurations follows from the lack of
50
5 Evaluation
support in Redis for deeply nested Python dictionaries as already discussed in Section 4.5.
Therefore, we apply a serialization to both the client as well as the model configurations and
store them as strings corresponding to some key. By measuring the performance with all the
three evaluated serialization techniques we study the implications as well as the additional
overhead that these serialization approaches have on the systems’ performance. Finally, we
compare all of the above with a novel approach in the DL/ML context by enabling the redisAI
module. RedisAI allows us to omit any serialization and, respectively, deserialization step
within the training session.
As seen in Figure 5.1, all of the solutions with Redis, in which we serialize the model
parameters in BSON format, Redis (json), Redis (ujson), and Redis (msgpack), do not lead to a
performance increase once we scale the number of clients. In fact, besides the single-client
test case, these approaches perform significantly worse even than the existing MongoDB
solution. Scaling the number of clients involved in an FL training round to 10, which we
consider a relatively small number, all but the redisAI approaches deliver more than 3 times
slower running times. MongoDB requires 2.765195 seconds on average to perform a single
training round in FedLess, whereas Redis (json) needs 8.932911 seconds, thus, being exactly 3.23
slower. Redis (ujson) and Redis (msgpack) need 8.808135 and 10.094108 seconds respectively.
The solution using redisAI, in contrast, performs even better than the original one, resulting
in 1.566101 seconds. It is to be noted, that all these timings exclude the actual training time
of the model as we consider this to be independent on the underlying parameter server
technology and serialization technique.
To support these bad scalability findings with another example, we increase the number
of clients involved to 50 and perform the exact same comparisons, receiving the following
results: MongoDB - 5.400897 seconds, Redis (json) - 105.632522 seconds, Redis (ujson) - 94.485904
seconds, Redis (msgpack) - 123.469976 seconds, and Redis (ujson) + redisAI - 2.541085 seconds.
As seen, we observe similar results - the three approaches using Redis with BSON serialized
model parameters perform significantly bad, independent on the exact serialization technique
used for the client as well as model configurations.
We reason this with the inappropriate format in which the weights are serialized when
using Redis as a technology for the parameter server. Based on these results, we no longer
consider using BSON serialized model parameters with Redis and evaluate the system’s
performance only with the Redis approach, which involves the redisAI module. Thus, in the
following, we compare the existing, MongoDB solution only to the redisAI solution and do not
consider weights serialization in the context of Redis any further.
Though, we still need to perform serialization for the model and clients’ configurations,
due to the limitations in Redis related to the nested dictionaries as described above. Even
though msgpack showed to be superior to the other two serialization methods, namely json
and ujson, in test cases with smaller number of clients, ujson delivered better results once
increasing the number of clients to 10 and 50. As one of the main goals of this study is to
evaluate the performance of the system in a large-scale scenario case, considering up to 300
clients involved in an FL training, we perform the corresponding serialization with the use of
ujson.
51
5 Evaluation
Clients: Figures 5.2 and 5.3 show running time comparisons between the two parameter
server technologies used with a model size of 26 MB. To achieve these results, we deploy 300
client functions solely on the Google Cloud Platform. Figure 5.2, however, does not include
the training times of the clients as we expect these to be independent of the underlying
parameter server technology.
Figure 5.2: Client functions running time comparisons between MongoDB and Redis for a
model size of 26 MB and varying number of clients
The events contributing to the clients’ total running times seen in the figure above are
the following - Model Download, representing the time for the client to load the latest model
parameters, Training Preparation - the time used for parameters’ deserialization (applicable
only to MongoDB) as well as a model compilation, Upload Preparation - covering the time
needed for the serialization of the weights (applicable only to MongoDB), and Model Upload
- representing the exact duration of the write operation that persists the latest client model
updates.
From the figure, we can see that by increasing the total number of clients involved in an
FL training round, we observe an increase in the running times for both parameter server
technologies. This, in fact, we consider as expected behaviour of the system as involving
more clients within an FL training round increases the complexity and the execution duration
of each and every defined event.
Taking a closer look at the exact comparisons, one can note that Redis leads to much faster
running times in each and every single test case starting from a single client per round and
scaling their number to 300. Training a single client gives us a total, average run time of a
52
5 Evaluation
round of 2.931053 seconds (excluding the actual model fit time) with MongoDB, whereas Redis
gives a run time of 1.061442 seconds (again, without considering the actual training time),
which is more than 2.5 times better performance. While scaling the number of clients to 100,
we observe a roughly two times performance difference in favor of the Redis solution.
Interestingly, having a larger number of clients involved, we see an even greater difference
in terms of the system’s performance, showing that MongoDB is unable to handle many
simultaneous interactions with the participating clients, which corresponds with FedLess’s
observations that motivate this work. Having 300 clients involved in the training process
shows the better capability of the Redis solution to handle a large amount of participating
clients. For 300 clients, for example, Redis results in a total duration time of under 15 seconds
or exactly 14.212507, whereas MongoDB, on the other side, performs significantly worse with
a total duration of nearly 60 seconds or exactly 59.106725, which is more than 4 times worse
performance. As in the example with the single client from above, these timings do not
include the exact training time as we assume it not to be influenced by the exact technology
used for the underlying parameter server.
We consider these findings to happen primarily due to the much better write operations that
Redis supports as compared with the solution involving MongoDB. Redis clearly outperforms
MongoDB in the client-side Model Upload event by orders of magnitude, leading to a total
better round execution duration. Moreover, as we are able to eliminate the need for weights
serialization with the help of the redisAI module, allowing us to store the client results as
tensors, the Upload Preparation event does not contribute to the total execution duration at all,
which differs from the MongoDB solution. The combination of these two factors results in an
improved running client-side time of the system with the use of Redis.
The two diagrams in Figure 5.3 provide a detailed view of the exact events contributing
to the total round execution duration on average, comparing the two parameter server
approaches. For the sake of completeness, here we include the actual Training event to end up
with the exact total round running time required by the involved clients.
(a) Comparison with 50 clients involved in a (b) Detailed comparison with 100 clients
training round involved in a training round
Figure 5.3: Detailed client functions running time comparisons between MongoDB and Redis
for a model size of 26 MB
53
5 Evaluation
The plot in Figure 5.3 (a) compares the two technologies under the consideration of 50
participating clients. One can observe the running time differences per individual event such
as Model Upload, for instance. Taking a closer look to the Model Upload event, we can note that
MongoDB is not able to handle many simultaneous client uploads. MongoDB requires roughly
5 times more time to execute the uploads with an exact duration of 5.384037 seconds against
the 1.085305 seconds needed by the Redis solution. Additionally, MongoDB requires 0.110116
seconds to serialize the weights prior to the upload event, which is not necessary with Redis,
due to the redisAI module.
Similarly, Figure 5.3 (b) provides the exact same comparisons with the consideration of 100
clients involved in an FL training round. In fact, here we observe an even larger difference
between the two upload events - 8.637785 seconds for MongoDB against 1.64687 seconds for
Redis. Meeting our initial expectations, the training times of the two approaches are having
roughly similar running times as they do not directly depend on the underlying parameter
server technology.
The only noticeable drawback of the Redis solution relates to the Model Download event,
which is, in fact, faster with the MongoDB solution, taking 0.710569 seconds against the
1.833939 seconds taken by Redis. This is primarily caused by the fact that enabling multi-
threading for the read operations in Redis does not improve the read performance. By
performing the changes related to the second research question of this study, RQ 2, however,
we aim at mitigating these performance issues and decreasing the read operations with Redis.
We conclude the client-side results for the 26 MB model looking at the two parameter
servers considering a larger number of involved clients, namely 200 as shown in Figure 5.4.
Figure 5.4: Detailed client functions running time comparisons between MongoDB and Redis
for a model size of 26 MB and 200 clients
54
5 Evaluation
In Figure 5.4, we plot the exact running times of the individual events and show the
capability of the Redis solution to better handle FL training, that involves a larger number of
clients within a training round. Even though we can see the slightly worse running time of
the Model Download event with the Redis approach - 3.504046 against 0.888442 seconds in favor
of MongoDB, this is barely noticeable, given the timings of the other events such as the Model
Upload, for instance. Similarly, the Training Preparation event can not be clearly seen in the plot
as it takes a tiny portion of the total running time. However, we observe that in this event,
similar to the test cases with 50 and 100 clients, Redis is superior to MongoDB taking just
0.009015 seconds against the more than 6 times longer 0.057535 seconds of the old solution.
The Model Serialization event, happening once the actual training has been successfully
executed and before the persistence of the corresponding model parameters, takes 0.098098
seconds of the total running time for MongoDB and 0 seconds with the use of the Redis
solution, due to the convenient and elegant way of persisting the client results that the redisAI
module provides.
The most noticeable difference, however, is the Model Upload event, identified by previous
work on the FedLess framework as the weak spot of the system. We see that uploading the
model weights is a large portion of the total running time throughout an FL training with
the use of MongoDB. With a total average upload time of 65.645192 seconds, the clients are
concerned with uploading their computed results to the parameter server in 89.9% of the
time total on average, which is a huge portion of the round duration. On the contrary, Redis
requires only 9.892569 seconds to upload the client results, which is more than 6 times faster
than the MongoDB solution and only 73.7% percent of the total round duration on average.
In general, considering all the different test cases that we performed throughout the
experiments with a model size of 26 MB, we achieve 2.21 times better performance with the
use of the new Redis technology for the underlying parameter server of our system. The
average round duration considering all the different scenarios with a varying number of
clients is, thus, 2.21 times faster. This reduction, furthermore, has a positive effect on the
costs associated with the use of FedLess. The exact cost savings are presented at the end of the
evaluations section.
55
5 Evaluation
Figure 5.5: Aggregator function running time comparisons between MongoDB and Redis for a
model size of 26 MB and varying number of clients
involved in an FL training round, the duration of the whole aggregation process increases as
well. Moreover, we can see that the system using Redis as a parameter server outperforms
the one with MongoDB with all the different numbers of clients. Interestingly, the more
clients we consider throughout the training round, the bigger the difference between the two
performances becomes. At the point of 300 clients, the Redis solution performs almost 3 times
faster than the one using MongoDB - with Redis we complete an aggregation for 20.111208
seconds on average, whereas using MongoDB the aggregation costs nearly a minute, namely
exactly 58.906086 seconds. This gives us a significant performance improvement.
Identical to the client-side measurements, we observe a great performance increase in
terms of the model uploads. Furthermore, being able to completely eliminate the weight
deserialization and serialization steps, which occur before and after the aggregation process
respectively, we are able to further decrease the total running time and, thus, achieve better
overall performance as compared with the old FedLess implementation.
We plot the differences between the two approaches considering all the individual events
contributing to the total round duration. Figure 5.6 compares the performances of the system
with MongoDB and Redis considering 50 and 100 clients within an FL training round. We see
that in the 50 client setting as well as in the 100 one, the aggregator with the Redis solution
clearly outperforms the old implementation of FedLess using MongoDB. In the 50 clients’
case, the total round aggregation time on average is roughly 2.5 faster with Redis with the
exact timings being 10.049266 seconds for MongoDB and 3.995371 for Redis. Comparing the
individual events, we further see that both the aggregation as well as the model upload
56
5 Evaluation
events take comparably less amount of time. Similar to the client-side measurements, the
main two reasons for these results are the better write operation timings of Redis as well as
the eliminated weights serialization and deserialization.
(a) Comparison with 50 clients involved in a (b) Comparison with 100 clients involved in a
training round training round
Figure 5.6: Detailed aggregator function running time comparisons between MongoDB and
Redis for a model size of 26 MB
Scaling the clients to 100, as seen in Figure 5.6 (b), results in a greater difference.Redis
completes the aggregation within 6.764538 seconds, whereas MongoDB needs significantly
more time - 20.390591 seconds. Looking at the largest test case possible, considering 300 clients
within an FL training, we observe 58.906086 seconds aggregation execution for MongoDB
and almost 3 times less with Redis - 20.111208. One noticeable finding, however, is the
huge portion of time taken by the actual aggregation process by both of the parameter server
technologies. For the case with 300 clients, the system with MongoDB is busy with aggregating
the weights in 99.08%. Even though we drastically improve the aggregation performance
with Redis by removing the need for serialization and, thus, deserialization afterward, the
actual aggregation is still an event that takes roughly 99% of the total aggregation time. We
consider this as a potential bottleneck on the aggregation side of the system, which could be
mitigated by replacing the aggregation scheme.
In summary for the aggregation-side performance, considering all the different test cases
that we performed with a model size of 26 MB, we achieve 2.97 times better performance
with the use of the new Redis technology for the parameter server.
Throughout all the experiments conducted with a model size of 26 MB, the solution with
Redis seemed to perform in a more stable manner. On average, we managed to achieve the
corresponding predefined accuracy, in 5 to 6 training rounds. Moreover, even though we set a
high number of stragglers allowed, we did not experience a training round, with more than
10% failing clients. In fact, the only scenarios in which there were any clients returning a
failed function status were cases with the largest possible amount of clients involved within
our setting, namely 300, however, not 10% of the total number of clients involved.
57
5 Evaluation
For the doubled model size we repeat the exact same procedure, comparing the two tech-
nologies of the parameter server with various numbers of clients involved and investigating
the ability of the system with Redis to handle larger model sizes. In contrast to the first test
setting, however, here we only look at a reduced number of test cases defined by the number
of clients used throughout an FL training round. We evaluate the system’s performance based
on 6 different settings, instead of 12 as in the preceding evaluations. The minimum, as well
as the maximum number of clients involved, do not change, nonetheless, we do not consider
test cases with 5, 30, 75, 150, and 250 clients throughout the evaluations of the second biggest
model.
Similar to the tests performed with the smallest model, we do not aim to achieve a certain
performance in terms of a predefined metric such as accuracy. Thus, we randomly pick a
value of 0.5 which has to be reached in order to successfully terminate the training round.
If this value is not achieved within a maximum number of 15 training rounds, the process
terminates automatically. To perform the measurements, we use the exact same function
configurations as used throughout the first test setting. That is, we deploy 300 Google Cloud
Functions with 2048 MB memory each as well as an OpenFaaS aggregator function running
on-premise.
Clients: Figure 5.7 shows the client running times of the two parameter server technologies
in a comparative manner.
Figure 5.7: Client functions duration comparisons between MongoDB and Redis for a model
size of 52 MB and varying number of clients
58
5 Evaluation
Throughout the experiments we observe the exact same pattern as with the 26 MB model
- Redis outperforms MongoDB in each and every test case. Moreover, the more clients we
include within a training round, the more severe the running time difference between the
two technologies becomes. Like with the 26 MB size, we do not consider the actual training
times in this diagram as we assume these do not depend on the underlying parameter server
technology.
In the very first scenario with a single client, we see that the Redis solution is superior
to MongoDB. The average round duration for Redis is 2.306066 seconds and the one for
the solution using MongoDB takes 5.852588 seconds and is more than 2.5 times slower.
Evaluating the performances with a larger number of clients, for example, 200, we can see
nearly the same difference in their run times - 71.02988 seconds for MongoDB against 26.43306
seconds for Redis. Looking at the even larger number of clients involved, we see an even
bigger difference. Having 300 clients, MongoDB ends up having an average running time per
round of more than 2 minutes or exactly 136.422855 seconds, which is more than 3.5 times
more as compared with the 46.189571 seconds Redis performance.
Figure 5.8 provides a detailed view of the performances of the two approaches considering
all the individual events that occur throughout an FL training on the client-side of the system,
including the actual training time - the Training event, unlike in the figure from above.
(a) Comparison with 50 clients involved in a (b) Comparison with 100 clients involved in a
training round training round
Figure 5.8: Detailed client functions running time comparisons between MongoDB and Redis
for a model size of 52 MB
Looking at the exact running times of these events, we can validate our existing observations
from the evaluations with the 26 MB model. Uploading the model using Redis clearly
outperforms the same event when using MongoDB by orders of magnitude, resulting in better
performance in total. Figure 5.8 (a) shows the running times with 50 clients involved. With
8.715303 seconds upload time, the existing FedLess solution using MongoDB performs worse as
compared with the 1.743917 seconds by Redis. Moreover, the upload event takes a big portion
of the whole duration - 46.52%, whereas the running times with Redis of the individual events
are distributed uniformly. The clients are busy uploading their results only in 13.72% of the
total time.
59
5 Evaluation
Similar differences can be observed in Figure 5.8 (b) showing the results with the considera-
tion of 100 clients per FL training round. The model upload event using Redis is significantly
faster than the one with MongoDB. In total, using Redis, we achieve nearly 4 times faster
running times - 36.630309 seconds vs. 9.52527 in favor of Redis. The upload time with
MongoDB takes 71.67% and with Redis, on the contrary, only 29.04%.
By measuring the system’s performance with a model size of 52 MB, we average the
performance speed ups and achieve a 3.35 times better performance with the use of Redis as
a technology for the underlying parameter server of the system. As a result of that, we were
able to reduce the costs associated with the use of FedLess.
Aggregator: As with the smallest model size, we look at the running times once the aggre-
gator function has been invoked to combine the latest clients’ results in more detail. The
running times on the aggregator-side, as displayed in Figure 5.9, strengthen the observations
and show the superiority of Redis over the existing MongoDB solution.
Figure 5.9: Aggregator function duration comparisons between MongoDB and Redis for a
model size of 52 MB and varying number of clients
In fact, the difference in this test setting is even stronger as we achieve 4.6 times faster
running times. The following values from the figure illustrate this in detail: 10 clients -
6.492808 seconds for MongoDB vs. 2.039934 seconds for Redis, which is exactly 3.1 times
slower; 50 clients - 28.961579 seconds vs. 9.798537 seconds in favor of MongoDB; 100 clients -
58.199336 seconds for MongoDB vs. 17.188075 seconds, resulting in exactly 3.3 times slower
performance. The difference increases in the case of 200 clients. We observe 5 times better
performance - 72.781905 seconds for MongoDB vs. 35.161906 seconds for Redis.
60
5 Evaluation
Looking at the contributions of the individual events, we see that the aggregation is the
most costly operation occurring on the aggregator side. The client download events have a
small contribution to the total average running time of the round. For example, with just
0.000073 seconds for the case of 50 clients involved and MongoDB as an underlying parameter
server, the download event makes less than 0.1% of the total running time. Similarly, the
upload time of storing the aggregated, latest model parameters is also relatively small as
compared with the aggregation time and has a small contribution to the total running time.
We have 0.971391 seconds upload time, being roughly 3.5% of the total 28.961579 seconds.
Figure 5.10 displays these detailed observations and compares MongoDB with Redis with 200
clients. As seen, the contributions of the download - Download Clients’ Results and the upload
- Model Upload events are relatively small in duration as compared with the aggregation event.
Looking at the two aggregations of the two different parameter server approaches, we observe
a great improvement with the use of Redis. As noted, besides the actual aggregation, we
have to perform the corresponding serialization steps, when using MongoDB. These steps are
included within the Aggregation event displayed below and are no longer necessary with the
use of Redis.
Figure 5.10: Detailed aggregator function running time comparisons between MongoDB and
Redis for a model size of 52 MB and 200 clients involved per FL training round
The aggregation with MongoDB takes longer than 2.5 minutes, whereas the solution with
Redis needs a bit more than 30 seconds, thus, being more than 5 times faster. As with the
smaller model, the aggregation event is the most costly one on the aggregator side. The
download, as well as the upload events, take significantly less amount of time as compared
to the actual aggregation. These are barely noticeable in the plot, contributing only a small
portion of the total duration.
61
5 Evaluation
In general, throughout the aggregator experiments with the model of size 52 MB, we
achieve 4.6 times better running times with Redis as compared with the existing MongoDB
solution. However, similar to the evaluations in the previous test setting, we see that the
actual aggregation clearly dominates all the other events such as downloads and uploads of
the model. Out of the total 30.630445 seconds in the 200 clients’ case with Redis, 30.47168 are
taken by the actual aggregation, which is nearly 99.5% of the whole aggregation process.
Within this test setting, we demonstrate the capability of the Redis parameter server not only
to handle a large number of clients, but also to interact with models of larger size. For this, we
use a model of size 220 MB. We achieve the exact model size of 220 MB by artificially adding
new fully-connected layers to the network. For these experiments, we increase the memory of
the functions from 2048 MB to 4096 MB. Functions with 2048 MB memory are not able to
handle such big model sizes as it exceeds the pre-configured memory limit. Additionally, we
evaluated the system’s performance only considering the Redis solution for the parameter
server as MongoDB resulted in poor performance with many unsuccessful training rounds
and client functions’ failures.
Clients: Figure 5.11 compares the performance of the system with Redis and the three
different model sizes considering a varying number of clients. Due to the large size of the
biggest model, we consider up to 100 clients, in contrast to the preceding test settings.
Figure 5.11: Client functions duration comparisons between models of different sizes and
varying number of clients using Redis
62
5 Evaluation
The running times in the comparison plot from above, as well as the exact measures
discussed in the following, do not include the actual training time. We can observe that the
model size has a linear relationship with the average round duration of the clients. Looking
at the single-client case, we see that the biggest model size is roughly 8 times slower than the
smallest one. The model of size 220 MB has an average running time of 7.797662 seconds
and the one of size 26 MB, 1.061442 seconds. In fact, the difference in their running times is
roughly equal to the difference in their exact sizes in bytes, showing the relationship between
the size of the model and the average round duration of an FL training.
Looking at a larger number of clients, we see that the difference tends to increase. Con-
sidering 10 clients, the smallest model of 26 MB requires 1.566101 seconds on average, the
model of size 51 MB - 3.200821 seconds, and the biggest model - 11.327892 seconds, in line
with the observations from above. Training with 50 clients gives us the following results -
2.451213 seconds, 6.506457 seconds, and 39.698943 seconds for all of the three models sorted
by their size respectively. In this case, with 50 as well as 100 clients, we observe an even larger
running time difference between the smallest and biggest model.
In general, by averaging through all the performance differences that we evaluate within
this test setting, we see that by increasing the size of the model 8 times, we observe a roughly
10 times slower round duration on the client-side of the system.
Aggregator: For the sake of completeness, we show the exact same system performance
comparisons with the aggregator function in Figure 5.12.
Figure 5.12: Aggregator function duration comparisons between models of different sizes
and varying number of clients using Redis
63
5 Evaluation
Similar to the client-side performances, we see a relationship between the model size and
the corresponding running time of the aggregator. A single client leads to the following
running times: 0.274577 seconds for the smallest model, 0.494854 seconds, which is roughly
twice as much, in line with the model size difference of the second model, and 1.952163
seconds, being roughly 8 times more for the largest model. We observe similar differences by
scaling the number of involved clients to 10 with 1.021198 seconds, 2.039934 seconds, and
10.007791 for the corresponding model sizes respectively. Looking at the diagram for 100
clients, however, similar to the observations on the client-side, the gap between the smallest
and the largest models seems to increase. The model with a size of 26 MB requires a bit more
than 5 seconds on average, or exactly 6.764538, whereas the biggest model needs 1.5 minutes
or exactly 90.537119 seconds to perform the aggregation of the latest model updates.
Even though we were able to increase the performance of the system on the aggregator
by introducing Redis, we, even with the largest model size, see that the whole aggregation
process, including the events measuring the interaction with the underlying parameter server
such as the download and upload ones, is clearly dominated by the time for the actual
aggregation. This is, however, independent of the parameter server technology. For the
100 clients case, the actual weights aggregation takes 90.537108 out of the total 91.501616
seconds, which is nearly 99.0% of the whole aggregation process. Thus, in the future, we
might consider switching to a different aggregation scheme that could eventually speed up
the actual aggregation’s performance.
Based on all the experiments conducted throughout the first research question, we could
achieve a better performance of the system with the use of Redis as a technology for the
underlying parameter server. We outperform MongoDB by orders of magnitude on both the
client-side as well as the FL-server side of the system considering a various number of clients
and models of different sizes. Having a model of size 26 MB, our evaluations showed 2.21
times faster client-side running times and 2.97 times faster performance for the aggregation
function. Increasing the model to 52 MB, we end up having 3.35 times better running time
for the client functions and 4.6 times for the aggregator.
The only event in which MongoDB showed to be superior to the Redis solution is the model
download - while downloading the latest model parameters, in order to train the model in a
corresponding training round. This, as discussed, is primarily caused by the single-threaded
Redis nature. However, due to the much faster write operations as well as the eliminated need
for serialization as well as deserialization of the model weights, the system ends up having
a better average round running times in total and, thus, results in reduced costs associated
with the use of the framework.
In the scope of the second research question (RQ 2), we aim to further investigate and
improve this observed performance issue with Redis. We try to reduce the download times
of the clients involved in an FL training round by integrating multiple replicated parameter
servers within the system and, thus, achieve an even better performance.
64
5 Evaluation
Table 5.2: Detailed clients’ running times for Redis with different number of replications
nodes involved and a model size of 26 MB
We identify that the use of a different number of replicas does influence the individual
events. The total running time, however, seems to be in a similar range with the only exclusion
being the 100 clients with the use of 5 replicas. For example, we see a Model Download time
of the system without the use of any replication nodes to be 0.842111 seconds, whereas by
65
5 Evaluation
adding two replication nodes, we are able to decrease it to 0.732189 seconds. Increasing the
number of replicas even further to the maximum allowed number of 5 replicas, we again
reduce the download time of the model even further and achieve a download running time
of 0.610326 seconds. Thus, we observe that increasing the number of replication nodes has
a positive effect on the read operations in Redis, and on the corresponding Model Download
event, respectively. This is true when looking at the system’s performance with an increased
number of involved clients as well. For 30 clients, for instance, we are able to decrease the
download time from 1.37695 seconds to 0.658058 seconds by scaling the number of replication
nodes used from 0 to 5.
Nonetheless, introducing replication nodes and decreasing the running time of the Model
Download event seems to have the exact opposite effect on the fast write operations and the
Model Upload event, respectively. We see that in the 30 clients’ case, the best performance we
can achieve in terms of model upload is without the consideration of any replicas involved.
Starting with an upload time of 1.97091 seconds with no replication nodes, the running time
of the event increases to 2.026332 seconds when we enable 2 replication nodes, and even
to 2.220283 seconds, once the number of replicas reaches the maximum allowed, namely 5.
The same holds true for the case of 50 clients, we get constantly increasing running times
of 1.087932 seconds, 1.713158 seconds, and 2.220283 seconds, while increasing the number
of replication nodes within the system. An identical dependency is to be observed once 100
clients are involved in an FL training round.
The observations from above show a certain trade-off with the use of replications. The
more replica nodes we use within the system, the better the read running times of the system
seem to be and, thus, the less costly the Model Download event. On the other hand, however,
the increased number of replicas leads to worse write operation running times and, thus,
more costly Model Upload times. In total, these two events seem to compensate for each other,
which results in similar overall performances of the system. Even though the total execution
duration seems to slightly increase, they do not differ by much. The overall duration with 30
clients for 0 replication nodes is 1.9709 seconds against the 2.026332 seconds for 2 replicas
and 2.220283 seconds for 5 replicas. This relation can be observed in any of the three cases
described in Table 5.2. The 50 clients’ results, for instance, have a difference of less than 0.2
seconds being exactly 2.541085 and 2.793512 seconds for the two replication extremes within
our setting.
For the test setting with the model of size 52 MB we follow the exact same approach and
execute performance tests considering two distinct test cases including 50 and 100 involved
clients. Starting with 50 clients, we scale the experiments to 100, using three different
replication configurations, namely without any replica nodes involved, with 2 as well as with
5 replicas. The evaluation results are shown in Table 5.3.
Within the 30 clients’ case, we have the following running times for the Model Download -
2.457766 seconds for 0 replicas, 1.947995 seconds for 2, and 1.305657 seconds for 5, in line
with the observations from above. Thus, adding more replication nodes to the system seems
66
5 Evaluation
to have a positive effect on the read time of the system, reducing the running time for the
model download. Similarly, with the 100 clients case, we observe the best performance in
terms of the Model Download event shows the system with 5 replication nodes.
Exactly on the opposite are the times for model uploads. We get the best running time
performance having no replica nodes, 2.567118 seconds for 30 clients’ case, and the worst
once we add 5 nodes - 8.949222 seconds for the same test case including 30 clients. Overall,
we see that even though it ends up with the worst model download running times, the system
with 0 replication nodes seems to deliver the best running time results.
Table 5.3: Detailed clients’ running times for Redis with different number of replications
nodes involved and a model size of 52 MB
With the implementation throughout the second research question of this study (RQ 2), we
were able to successfully address the identified issue of the system with Redis discovered in
the scope of RQ 1, namely, the bad read operations. With the use of replication nodes, we were
able to decrease the running times of the Model Download event that happens at the client-side
of the system once a particular client has been selected to participate in an FL training round.
Nonetheless, we identified that this improvement comes at a certain cost related to the
write operations. By improving the download times, we worsen the upload running times.
Introducing more and more replication nodes, the Model Upload event shows increasingly
worse performance. In fact, the more replication nodes we involve, the better read operation
performance we achieve, however, the worse write operation becomes. Looking at the test
cases including a large number of clients, we see that having multiple replication nodes even
increases the total running times and, thus, decreases the general performance of the system.
As a result of these observations, we do not see any general improvement of the system
with the use of various replications and find no benefits in introducing this approach to the
system. In fact, it is just the opposite, not only doesn’t it lead to a reduction in the total round
duration, but it also introduced an additional configuration overhead. Thus, we could more
easily classify it as a downside, rather than an actual performance improvement.
67
5 Evaluation
Test setting 1: Model of size 26 MB Table 5.4 shows the monthly cost estimations for the
use of FedLess with a 26 MB model size. For the case of 50 clients, we receive a total runtime
duration of 4153.73 seconds with MongoDB and two times smaller runtime, exactly 2037.80
seconds, with Redis. As a result of that, we see that given the parameters from above, FedLess
with the use of a Redis as a parameter server is two times cheaper as compared with the
solution with MongoDB. With Redis, FedLess ends up with 15.53 USD/month against the 30.87
USD/month with the use of MongoDB. In the large-scale scenario, we observe an event higher
price difference, namely 572.04 USD/month against 172.96 USD/month in favor of Redis.
Table 5.4: Monthly cost estimation comparisons between MongoDB and Redis for 26 MB
model
The last column of the table displays the cost savings that we were able to achieve with the
Redis solution in each test case. On average, we ensure a 45.52% cost reduction with Redis.
68
5 Evaluation
Test setting 2: Model of size 52 MB Table 5.5 shows the calculated monthly cost compar-
isons of FedLess for a model of size 52 MB. Here we use the exact same parameters as in the
first test setting with the smallest model.
Table 5.5: Monthly cost estimation comparisons between MongoDB and Redis for 52 MB
model
We identify roughly the same pattern as with the model with the smallest size, namely,
Redis seems to be nearly twice as cheap as MongoDB on a monthly basis in US dollars. For the
case of 50 clients involved in an FL training round, we receive a total duration of 5619.773967
seconds for MongoDB and 3176.863725 seconds for Redis and, thus, costs of 83.00 USD/month
and 47.58 USD/month respectively. Increasing the number of clients involved, increases the
total round duration, having the effect of increased monthly costs in accordance with the
pre-defined pricing by the Google Cloud Platform.
Even though these estimations are based on the exact pricing by a single public cloud
provider and, thus, are specific to Google Cloud, we expect to observe the exact same pattern
and, therefore, similar results, when looking at other prominent public cloud providers such
as AWS and Azure, for instance.
Similar to the setting with the model of size 26 MB, we take a closer look at the cost savings
with the Redis approach. The last column of Table 5.5 displays the exact cost savings per test
case, defined by the number of clients involved in FL training. On average, we receive a total
cost reduction of 52.04% by replacing the parameter server in FedLess with Redis.
Test setting 3: Model of size 220 MB In Table 5.6 we present the results for the biggest
model of size 220 MB . The table shows the costs for FedLess using the biggest model of
size 220 MB, even though we are not able to compare the costs between the two parameter
server options as we execute only measurements with Redis in this test setting. For these
experiments, we use two different test cases, defined by the number of clients involved,
namely 50 and 100 clients.
69
5 Evaluation
Redis costs
No. clients
(USD/month)
50 203.50
100 621.04
Table 5.6: Monthly FedLess cost estimation with Redis for 220 MB model
In this scenario, we slightly adjust the configurations as the functions require an increased
amount of memory, namely 4096 MB, instead of the previously used 2048 MB as a result of
the increased model size, which increases the costs for using FedLess. Having 50 clients gives
a total running time of 11881.69 seconds and, therefore, an estimated monthly cost of 203.50
USD. In the larger case, including 100 clients, we end up with 36442.90621 seconds of total
runtime resulting in 621.04 USD monthly costs for the execution of FedLess, in line with the
pricing by the provider.
5.3 Discussion
In summary, we were able to evaluate the system’s performance with the new technology
for the parameter server and, thus, reveal several interesting findings. Firstly, within the
first part of the work and in the scope of the first research question (RQ 1), we were able to
clearly outperform the existing MongoDB parameter server solution in FedLess, in terms of the
communication that occurs within an FL training round. By implementing it in Redis, the
system executes multiple client uploads much faster as compared with the current system’s
parameter server solution. Within our experiments, we scaled the number of clients involved
in an FL training round to 300 and proved the ability of the Redis’ implementation to handle
many simultaneous model uploads in a significantly reduced amount of time. We observed
this improvement in various test cases including not only the above-mentioned large amount
of clients, but also at a much smaller scale starting with a single client.
Moreover, by artificially increasing the model sizes, we showed that with the use of Redis,
FedLess is able to handle models of much larger size in a stable manner. We demonstrate this
with the use of three different model sizes 26 MB, 52 MB, and 220 MB. In addition to that,
we evaluated the system with the use of replicated parameter servers, aiming at increased
performance for model downloads. Even though we could achieve reduced running times for
the read operations, we identified that this comes at the cost of having worse write times.
Nonetheless, as a result of the improved overall performance, we were able to significantly
reduce the training costs. As the framework should support and work with client functions
that could be deployed on a diverse number of cloud providers, being able to achieve a cost
reduction is an important improvement.
70
6 Conclusion and Future Work
Initially introduced by Chadha, Jindal, and Gerndt, FedKeeper is a novel solution in the
Federated Learning (FL) context. Applying serverless computing principles, the authors
address common issues in the FL setting such as efficient FL-clients’ management. By making
use of Function-as-a-Service (FaaS) infrastructure [75], FedKeeper shows promising results once
evaluated in an image classification task with a varying number of clients. Its successor,
named FedLess, solves open problems identified in the initial framework, by providing
additional vital features like authentication, authorization, and differential privacy. In this
work, we build on top of the foundations set in the development of these two consecutive
frameworks for serverless FL.
Within the scope of this thesis, we worked on evaluating and improving the performance of
FedLess by looking at known issues and bottlenecks of the system. The latest experiments per-
formed with FedLess serve as an input and motivation for this work, showing the weaknesses
of the system and providing directions for potential improvements.
Firstly, we approached the underlying parameter server of the system, identified by previous
evaluations as the main problem in terms of the system’s performance, due to its lacking
ability to efficiently handle multiple simultaneous model uploads. We replaced the parameter
server technology with Redis, an in-memory data object store, and evaluated the system’s
performance by comparing the new implementation with the existing, MongoDB parameter
server solution in detail. Within the comparisons, we considered not only the execution
duration of the participating clients in an FL training round, but also looked at and measured
the aggregation process, occurring once an aggregator function is invoked to combine the
latest model updates of the clients involved in the training process. With the use of Redis, we
considered two different approaches for handling the serialization of the model weights. At
first, we kept the serialization technique present in the existing MongoDB solution, which
serializes the model weights in a BSON format prior to their persistence, and applied it to the
new Redis solution. This, however, did provide the expected results as we were not able to
achieve better performance results of the system. In fact, just to the opposite, by using the
exact same weights serialization approach, the system showed an even worse running times
as compared with the existing solution.
Then, inspired by the different Redis modules for various industry use cases, we applied
redisAI, a Redis module specifically designed for applications in the ML/DL context. RedisAI
maximizes computation throughput and reduces latency. The module allowed us to com-
pletely ignore any serialization as well as deserialization steps throughout the FL training
session by storing and retrieving client results as tensors directly. With this, we were able to
further reduce the execution running times of the system, achieve a better performance, and,
thus, reduce the associated costs.
71
6 Conclusion and Future Work
72
6 Conclusion and Future Work
orders of magnitude, and, thus achieve a significant cost reduction. For the model size of 26
MB, we achieved 2.21 times better clients’ running times and 2.97 times better performance
for the aggregator, which, according to our measurements, allows up to 45.54% cost reduction
in FedLess with Redis as a parameter server. Within the experiments with the bigger model
of size 52 MB, we reached 3.35 times better clients’ running times and 4.6 better aggregator
performance as compared with the existing, MongoDB solution, resulting in 52.05% cost
reduction on a monthly basis for the use of FedLess. Since one of the key and distinctive
characteristic of FedLess is that it is cloud provider-agnostic and, thus, spposed to interact
with any kind of FaaS function, the cost reduction can be seen as an important improvement.
Despite these performance and, therefore, cost-related advancements, we identified several
weaknesses of the system that could not be addressed within the scope of this work. These
could be potential subject to improvement in the future. Firstly, as discussed above, the use of
replication did not lead to the desired outcomes of reduced read times of the system. In fact,
looking only at the model download times, the existing solution with MongoDB is superior to
Redis, providing faster round training duration on average. One could eventually investigate
this problem further and apply a different replication approach such as the one with enabled
cluster mode, in order to improve the running times of the clients participating in an FL
training round.
Additionally, considering the aggregator’s performance, we identified that the system is
concerned with the actual aggregation of the latest model parameters in nearly 99.00% of
the time of the whole aggregation process. This is, in fact, independent of the underlying
parameter server technology and is influenced by factors such as the predefined aggregation
scheme, for example. Thus, replacing the existing aggregation scheme with a more efficient
one could be considered in the future, in order to improve the aggregation performance of
the framework.
In conclusion, throughout our work, we extended and evaluated a framework of serverless
FL by integrating novel techniques designed for the DL/ML context. Based on the current
interest in the FL field, the fast-growing serverless applications in many different contexts as
well as the technological advancements we expect to see many interesting findings in this
context in the future.
73
List of Figures
2.1 Popularity of the term Federated Learning as reported by Google Trends . . . . 6
2.2 Popularity of the term Serverless as reported by Google Trends (2004 - 2022) . . 13
2.3 Cloud approaches based on the SEIP lecture from msg group for TUM [151] . 15
5.1 Client function duration compared between five different parameter server
approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Client functions running time comparisons between MongoDB and Redis for a
model size of 26 MB and varying number of clients . . . . . . . . . . . . . . . . 52
5.3 Detailed client functions running time comparisons between MongoDB and
Redis for a model size of 26 MB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 Detailed client functions running time comparisons between MongoDB and
Redis for a model size of 26 MB and 200 clients . . . . . . . . . . . . . . . . . . . 54
5.5 Aggregator function running time comparisons between MongoDB and Redis
for a model size of 26 MB and varying number of clients . . . . . . . . . . . . . 56
5.6 Detailed aggregator function running time comparisons between MongoDB and
Redis for a model size of 26 MB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.7 Client functions duration comparisons between MongoDB and Redis for a model
size of 52 MB and varying number of clients . . . . . . . . . . . . . . . . . . . . 58
5.8 Detailed client functions running time comparisons between MongoDB and
Redis for a model size of 52 MB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.9 Aggregator function duration comparisons between MongoDB and Redis for a
model size of 52 MB and varying number of clients . . . . . . . . . . . . . . . . 60
5.10 Detailed aggregator function running time comparisons between MongoDB and
Redis for a model size of 52 MB and 200 clients involved per FL training round 61
5.11 Client functions duration comparisons between models of different sizes and
varying number of clients using Redis . . . . . . . . . . . . . . . . . . . . . . . . 62
74
List of Figures
75
List of Tables
5.1 Number of trainable parameters corresponding to a particular model . . . . . 48
5.2 Detailed clients’ running times for Redis with different number of replications
nodes involved and a model size of 26 MB . . . . . . . . . . . . . . . . . . . . . 65
5.3 Detailed clients’ running times for Redis with different number of replications
nodes involved and a model size of 52 MB . . . . . . . . . . . . . . . . . . . . . 67
5.4 Monthly cost estimation comparisons between MongoDB and Redis for 26 MB
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5 Monthly cost estimation comparisons between MongoDB and Redis for 52 MB
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.6 Monthly FedLess cost estimation with Redis for 220 MB model . . . . . . . . . . 70
76
Bibliography
[1] M. Anderson. Technology device ownership: 2015. Available at: https://www.pewresearch.
org / internet / 2015 / 10 / 29 / technology - device - ownership - 2015/. Accessed on
13/07/2022.
[2] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, D. Niyato,
and C. Miao. “Federated Learning in Mobile Edge Networks: A Comprehensive
Survey”. In: IEEE Communications Surveys Tutorials 22.3 (2020), pp. 2031–2063. doi:
10.1109/COMST.2020.2986024.
[3] E. Ahmed, A. Gani, M. Sookhak, S. H. A. Hamid, and F. Xia. “Application optimization
in mobile cloud computing: Motivation, taxonomies, and open challenges”. In: Journal
of Network and Computer Applications 52 (2015), pp. 52–68. issn: 1084-8045. doi: https:
//doi.org/10.1016/j.jnca.2015.02.003. url: https://www.sciencedirect.com/
science/article/pii/S1084804515000417.
[4] G. Marinoni, H. van’t Land, and T. Jensen. “The impact of COVID-19 on higher
education around the world”. In: (2020). url: https://www.iau-aiu.net/IMG/pdf/
iau_covid19_and_he_survey_report_final_may_2020.pdf.
[5] O. Zawacki-Richter. “The current state and impact of Covid-19 on digital higher educa-
tion in Germany”. In: Human Behavior and Emerging Technologies 3.1 (2021), pp. 218–226.
doi: https://doi.org/10.1002/hbe2.238. eprint: https://onlinelibrary.wiley.
com/doi/pdf/10.1002/hbe2.238. url: https://onlinelibrary.wiley.com/doi/abs/
10.1002/hbe2.238.
[6] J. Poushter. Smartphone ownership and internet usage continues to climb in emerging
economies. Available at: https : / / www . pewresearch . org / global / 2016 / 02 / 22 /
smartphone-ownership-and-internet-usage-continues-to-climb-in-emerging-
economies/. Accessed on 13/07/2022. 2016.
[7] Mobile Fact Sheet. Available at: https : / / www . pewresearch . org / internet / fact -
sheet/mobile/. 2021.
[8] M. Chiang and T. Zhang. “Fog and IoT: An Overview of Research Opportunities”.
In: IEEE Internet of Things Journal 3.6 (2016), pp. 854–864. doi: 10.1109/JIOT.2016.
2584538.
[9] K. L. Lueth. State of the IoT 2020: 12 billion IoT connections, surpassing non-IoT for the first
time. Available https://iot-analytics.com/state-of-the-iot-2020-12-billion-
iot-connections-surpassing-non-iot-for-the-first-time/. 2020.
77
Bibliography
78
Bibliography
79
Bibliography
[32] B. Liu, L. Wang, and M. Liu. “Lifelong Federated Reinforcement Learning: A Learning
Architecture for Navigation in Cloud Robotic Systems”. In: IEEE Robotics and Automa-
tion Letters 4.4 (Oct. 2019), pp. 4555–4562. issn: 2377-3774. doi: 10.1109/lra.2019.
2931179. url: http://dx.doi.org/10.1109/LRA.2019.2931179.
[33] S. Lu, Y. Yao, and W. Shi. “Collaborative Learning on the Edges: A Case Study
on Connected Vehicles”. In: 2nd USENIX Workshop on Hot Topics in Edge Computing
(HotEdge 19). Renton, WA: USENIX Association, July 2019. url: https://www.usenix.
org/conference/hotedge19/presentation/lu.
[34] S. Russel and P. Norving. Artificial Intelligence: A Modern Appriach. Prentice Hall Press,
One Lake Street Upper Saddle River, NJ, United States, 2010. isbn: 978-0-13-604259-4.
[35] P. Flach. “Machine Learning: The art and science of algorithms that make sense of
data”. English. In: (Sept. 2012).
[36] D. E. Sorkin. “Technical and Legal Approaches to Unsolicited Electronic Mail”. In: Uni-
versity of San Francisco Law Review 35 (2001). Available at SSRN: https://ssrn.com/abstract=265768,
pp. 325–384.
[37] Y. LeCun, Y. Bengio, and G. Hinton. “Deep learning”. In: Nature 521.7553 (May 2015),
pp. 436–444. issn: 1476-4687. doi: 10.1038/nature14539. url: https://doi.org/10.
1038/nature14539.
[38] I. H. Witten, E. Frank, and M. A. Hall. Data Mining: Practical Machine Learning Tools
and Techniques. 3rd. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.
isbn: 0123748569.
[39] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to
Algorithms. USA: Cambridge University Press, 2014. isbn: 1107057132.
[40] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. http://www.deeplearningbook.
org. MIT Press, 2016.
[41] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural
Language Processing (almost) from Scratch. 2011. arXiv: 1103.0398 [cs.LG].
[42] A. Bordes, S. Chopra, and J. Weston. Question Answering with Subgraph Embeddings.
2014. arXiv: 1406.3676 [cs.CL].
[43] S. Jean, K. Cho, R. Memisevic, and Y. Bengio. On Using Very Large Target Vocabulary for
Neural Machine Translation. 2015. arXiv: 1412.2007 [cs.CL].
[44] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to Sequence Learning with Neural Networks.
2014. arXiv: 1409.3215 [cs.CL].
[45] S. Abdulrahman, H. Tout, H. Ould-Slimane, A. Mourad, C. Talhi, and M. Guizani. “A
Survey on Federated Learning: The Journey From Centralized to Distributed On-Site
Learning and Beyond”. In: IEEE Internet of Things Journal 8.7 (2021), pp. 5476–5497.
doi: 10.1109/JIOT.2020.3030072.
80
Bibliography
81
Bibliography
82
Bibliography
[69] C. Chen, H. Xu, W. Wang, B. Li, B. Li, L. Chen, and G. Zhang. “Communication-
Efficient Federated Learning with Adaptive Parameter Freezing”. In: 2021 IEEE 41st
International Conference on Distributed Computing Systems (ICDCS). 2021, pp. 1–11. doi:
10.1109/ICDCS51616.2021.00010.
[70] E. Diao, J. Ding, and V. Tarokh. “HeteroFL: Computation and Communication Efficient
Federated Learning for Heterogeneous Clients”. In: (Oct. 2020).
[71] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang.
“Deep Learning with Differential Privacy”. In: Proceedings of the 2016 ACM SIGSAC
Conference on Computer and Communications Security (Oct. 2016). doi: 10.1145/2976749.
2978318. url: http://dx.doi.org/10.1145/2976749.2978318.
[72] A. Pantelopoulos and N. G. Bourbakis. “A Survey on Wearable Sensor-Based Systems
for Health Monitoring and Prognosis”. In: IEEE Transactions on Systems, Man, and
Cybernetics, Part C (Applications and Reviews) 40.1 (2010), pp. 1–12. doi: 10.1109/TSMCC.
2009.2032660.
[73] S. Samarakoon, M. Bennis, W. Saad, and M. Debbah. “Federated Learning for Ultra-
Reliable Low-Latency V2V Communications”. In: 2018 IEEE Global Communications
Conference (GLOBECOM) (Dec. 2018). doi: 10 . 1109 / glocom . 2018 . 8647927. url:
http://dx.doi.org/10.1109/GLOCOM.2018.8647927.
[74] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz,
Z. Charles, G. Cormode, R. Cummings, R. G. L. D’Oliveira, H. Eichner, S. E. Rouayheb,
D. Evans, J. Gardner, Z. Garrett, A. Gascón, B. Ghazi, P. B. Gibbons, M. Gruteser, Z.
Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M.
Khodak, J. Konečný, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal,
M. Mohri, R. Nock, A. Özgür, R. Pagh, M. Raykova, H. Qi, D. Ramage, R. Raskar,
D. Song, W. Song, S. U. Stich, Z. Sun, A. T. Suresh, F. Tramèr, P. Vepakomma, J. Wang,
L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao. Advances and Open Problems in
Federated Learning. 2021. arXiv: 1912.04977 [cs.LG].
[75] M. Chadha, A. Jindal, and M. Gerndt. “Towards Federated Learning Using FaaS
Fabric”. In: Proceedings of the 2020 Sixth International Workshop on Serverless Computing.
WoSC’20. Delft, Netherlands: Association for Computing Machinery, 2020, pp. 49–54.
isbn: 9781450382045. doi: 10.1145/3429880.3430100. url: https://doi.org/10.
1145/3429880.3430100.
[76] J. Jiang, S. Gan, Y. Liu, F. Wang, G. Alonso, A. Klimovic, A. Singla, W. Wu, and C.
Zhang. “Towards Demystifying Serverless Machine Learning Training”. In: Proceedings
of the 2021 International Conference on Management of Data (2021).
[77] S. Allen, L. Browning, L. Calcote, A. Chaudhry, D. Davis, L. Fourie, A. Gulli, Y. Haviv,
D. Krook, O. Nissan-Messing, C. Munns, K. Owens, M. Peek, C. Zhang, and C.A.
“CNCF WG-Serverless Whitepaper v1.0”. In: CNCF, Tech. Rep. (2018). [Online]. Avail-
able: https://github.com/cncf/wg-serverless/blob/master/ whitepapers/serverless-
overview/cncf serverless whitepaper v1.0.pdf.
83
Bibliography
84
Bibliography
[90] V. Patel, S. Kanani, T. Pathak, P. Patel, M. I. Ali, and J. Breslin. “An Intelligent
Doorbell Design Using Federated Deep Learning”. In: 8th ACM IKDD CODS and
26th COMAD. CODS COMAD 2021. Bangalore, India: Association for Computing
Machinery, 2021, pp. 380–384. isbn: 9781450388177. doi: 10.1145/3430984.3430988.
url: https://doi.org/10.1145/3430984.3430988.
[91] S. Doomra, N. Kohli, and S. Athavale. “Turn Signal Prediction: A Federated Learning
Case Study”. In: (Dec. 2020).
[92] S. Ramaswamy, R. Mathews, K. Rao, and F. Beaufays. Federated Learning for Emoji
Prediction in a Mobile Keyboard. 2019. arXiv: 1906.04329 [cs.CL].
[93] T. S. Brisimi, R. Chen, T. Mela, A. Olshevsky, I. C. Paschalidis, and W. Shi. “Federated
learning of predictive models from federated Electronic Health Records”. In: Interna-
tional Journal of Medical Informatics 112 (2018), pp. 59–67. issn: 1386-5056. doi: https:
//doi.org/10.1016/j.ijmedinf.2018.01.007. url: https://www.sciencedirect.
com/science/article/pii/S138650561830008X.
[94] S. Silva, B. A. Gutman, E. Romero, P. M. Thompson, A. Altmann, and M. Lorenzi.
“Federated Learning in Distributed Medical Databases: Meta-Analysis of Large-Scale
Subcortical Brain Data”. In: 2019 IEEE 16th International Symposium on Biomedical
Imaging (ISBI 2019) (Apr. 2019). doi: 10.1109/isbi.2019.8759317. url: http://dx.
doi.org/10.1109/ISBI.2019.8759317.
[95] L. Huang, Y. Yin, Z. Fu, S. Zhang, H. Deng, and D. Liu. LoAdaBoost: loss-based AdaBoost
federated machine learning with reduced computational complexity on IID and non-IID
intensive care data. 2018. arXiv: 1811.12629 [cs.LG].
[96] J. Xu, B. S. Glicksberg, C. Su, P. Walker, J. Bian, and F. Wang. “Federated Learning
for Healthcare Informatics”. In: Journal of Healthcare Informatics Research 5.1 (Mar.
2021), pp. 1–19. issn: 2509-498X. doi: 10.1007/s41666- 020- 00082- 4. url: https:
//doi.org/10.1007/s41666-020-00082-4.
[97] N. Rieke, J. Hancox, W. Li, F. Milletarì, H. R. Roth, S. Albarqouni, S. Bakas, M. N.
Galtier, B. A. Landman, K. Maier-Hein, S. Ourselin, M. Sheller, R. M. Summers,
A. Trask, D. Xu, M. Baust, and M. J. Cardoso. “The future of digital health with
federated learning”. In: npj Digital Medicine 3.1 (Sept. 2020), p. 119. issn: 2398-6352.
doi: 10.1038/s41746-020-00323-1. url: https://doi.org/10.1038/s41746-020-
00323-1.
[98] H. R. Roth, K. Chang, P. Singh, N. Neumark, W. Li, V. Gupta, S. Gupta, L. Qu, A.
Ihsani, B. C. Bizzo, Y. Wen, V. Buch, M. Shah, F. Kitamura, M. Mendonça, V. Lavor, A.
Harouni, C. Compas, J. Tetreault, P. Dogra, Y. Cheng, S. Erdal, R. White, B. Hashemian,
T. Schultz, M. Zhang, A. McCarthy, B. M. Yun, E. Sharaf, K. V. Hoebel, J. B. Patel, B.
Chen, S. Ko, E. Leibovitz, E. D. Pisano, L. Coombs, D. Xu, K. J. Dreyer, I. Dayan, R. C.
Naidu, M. Flores, D. Rubin, and J. Kalpathy-Cramer. “Federated Learning for Breast
Density Classification: A Real-World Implementation”. In: Lecture Notes in Computer
85
Bibliography
86
Bibliography
[111] Q. Yang, Y. Liu, T. Chen, and Y. Tong. “Federated Machine Learning: Concept and
Applications”. In: (2019). doi: 10.48550/ARXIV.1902.04885. url: https://arxiv.
org/abs/1902.04885.
[112] A. Gascón, P. Schoppmann, B. Balle, M. Raykova, J. Doerner, S. Zahur, and D. Evans.
“Secure Linear Regression on Vertically Partitioned Datasets”. In: IACR Cryptol. ePrint
Arch. 2016 (2016), p. 892.
[113] R. C. Geyer, T. Klein, and M. Nabi. Differentially Private Federated Learning: A Client
Level Perspective. 2017. doi: 10.48550/ARXIV.1712.07557. url: https://arxiv.org/
abs/1712.07557.
[114] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. “Distributed Optimization
and Statistical Learning via the Alternating Direction Method of Multipliers”. In:
Foundations and Trends in Machine Learning 3 (Jan. 2011), pp. 1–122. doi: 10.1561/
2200000016.
[115] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J.
Shekita, and B.-Y. Su. “Scaling Distributed Machine Learning with the Parameter
Server”. In: Proceedings of the 11th USENIX Conference on Operating Systems Design and
Implementation. OSDI’14. Broomfield, CO: USENIX Association, 2014, pp. 583–598. isbn:
9781931971164.
[116] K. Hsieh, A. Phanishayee, O. Mutlu, and P. B. Gibbons. The Non-IID Data Quagmire of
Decentralized Machine Learning. 2019. arXiv: 1910.00189 [cs.LG].
[117] K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G. R. Ganger, P. B. Gibbons, and O.
Mutlu. “Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds”. In: 14th
USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). Boston,
MA: USENIX Association, Mar. 2017, pp. 629–647. isbn: 978-1-931971-37-9. url: https:
//www.usenix.org/conference/nsdi17/technical-sessions/presentation/hsieh.
[118] J. Huang, F. Qian, Y. Guo, Y. Zhou, Q. Xu, Z. M. Mao, S. Sen, and O. Spatscheck.
“An In-Depth Study of LTE: Effect of Network Protocol and Application Behavior on
Performance”. In: Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM.
SIGCOMM ’13. Hong Kong, China: Association for Computing Machinery, 2013,
pp. 363–374. isbn: 9781450320566. doi: 10.1145/2486001.2486006. url: https://doi.
org/10.1145/2486001.2486006.
[119] C. H. ( van Berkel. “Multi-Core for Mobile Phones”. In: Proceedings of the Conference on
Design, Automation and Test in Europe. DATE ’09. Nice, France: European Design and
Automation Association, 2009, pp. 1260–1265. isbn: 9783981080155.
[120] Z. Chai, Y. Chen, A. Anwar, L. Zhao, Y. Cheng, and H. Rangwala. FedAT: A High-
Performance and Communication-Efficient Federated Learning System with Asynchronous
Tiers. 2021. arXiv: 2010.05958 [cs.DC].
87
Bibliography
88
Bibliography
[135] AWS re:Invenr 2014: Getting started with AWS Lmabda. Available: https://www.youtube.
com/watch?v=UFj27laTWQA. 2014.
[136] W. MJ. “Hype Cycle for Emerging Technologies”. In: Gartner, 2017. url: https :
//www.gartner.com/en/documents/3768572.
[137] Emerging Technology Analysis: Serverless Computing and Function Platform as a Service.
2016.
[138] J. M. Hellerstein, J. Faleiro, J. E. Gonzalez, J. Schleier-Smith, V. Sreekanti, A. Tumanov,
and C. Wu. Serverless Computing: One Step Forward, Two Steps Back. 2018. doi: 10.48550/
ARXIV.1812.03651. url: https://arxiv.org/abs/1812.03651.
[139] G. C. Fox, V. Ishakian, V. Muthusamy, and A. Slominski. “Status of Serverless
Computing and Function-as-a-Service(FaaS) in Industry and Research”. In: CoRR
abs/1708.08028 (2017). arXiv: 1708.08028. url: http://arxiv.org/abs/1708.08028.
[140] W. Ye, A. Khan, and E. Kendall. “Distributed network file storage for a serverless (P2P)
network”. In: The 11th IEEE International Conference on Networks, 2003. ICON2003. 2003,
pp. 343–347. doi: 10.1109/ICON.2003.1266214.
[141] D. Bryan, B. Lowekamp, and C. Jennings. “SOSIMPLE: A Serverless, Standards-
based, P2P SIP Communication System”. In: First International Workshop on Advanced
Architectures and Algorithms for Internet Delivery and Applications (AAA-IDEA’05). 2005,
pp. 42–49. doi: 10.1109/AAA-IDEA.2005.15.
[142] E. Jonas, J. Schleier-Smith, V. Sreekanti, C.-C. Tsai, A. Khandelwal, Q. Pu, V. Shankar,
J. Carreira, K. Krauth, N. Yadwadkar, J. E. Gonzalez, R. A. Popa, I. Stoica, and D. A.
Patterson. Cloud Programming Simplified: A Berkeley View on Serverless Computing. 2019.
arXiv: 1902.03383 [cs.OS].
[143] M. Roberts. “Serverless Architecture”. In: Available: https://martinfowler.com/
articles/serverless.html. 2016, pp. 2658–2659.
[144] Amazon. [n.d.]. AWS Cognito. Available: https://aws.amazon.com/cognito/.
[145] Firebase. Available: https://firebase.google.com/.
[146] Amazon. [n.d.]. AWS DynamoDB. Available: https://aws.amazon.com/dynamodb/.
[147] M. Yan, P. Castro, P. Cheng, and V. Isahagian. “Building a Chatbot with Serverless
Computing”. In: Dec. 2016, pp. 1–4. doi: 10.1145/3007203.3007217.
[148] V. Ishakian, V. Muthusamy, and A. Slominski. “Serving deep learning models in a
serverless platform”. In: CoRR abs/1710.08460 (2017). arXiv: 1710.08460. url: http:
//arxiv.org/abs/1710.08460.
[149] P. McDonald. Introducing Google App Engine. https://googleappengine.blogspot.
com/2008/04/introducing-google-app-engine-our-new.html. 2008.
[150] Amazon Web Services. AWS Elastic Beanstalk - PaaS Application Managemen. https :
//aws.amazon.com/de/elasticbeanstalk/.
89
Bibliography
[151] Software Engineering in der Industriellen Praxis (SEIP) from msg group for Technische
Universität München (TUM), Computer Science faculty, chair Software Engineering for
Business Information Systems (sebis). Available https://seip.msg.group/repo/seip/
public.
[152] H. Lee, K. Satyam, and G. Fox. “Evaluation of Production Serverless Computing
Environments”. In: 2018 IEEE 11th International Conference on Cloud Computing (CLOUD).
2018, pp. 442–450. doi: 10.1109/CLOUD.2018.00062.
[153] CNCF Serverless WG. Cncf wg-serverless whitepaper v1. 0. https://gw.alipayobjects.
com/os/basement_prod/24ec4498-71d4-4a60-b785-fa530456c65b.pdf. 2018.
[154] P. Castro, V. Ishakian, V. Muthusamy, and A. Slominski. “Serverless Programming
(Function as a Service)”. In: 2017 IEEE 37th International Conference on Distributed
Computing Systems (ICDCS). 2017, pp. 2658–2659. doi: 10.1109/ICDCS.2017.305.
[155] R. Chard, K. Chard, J. Alt, D. Y. Parkinson, S. Tuecke, and I. Foster. “Ripple: Home
Automation for Research Data Management”. In: 2017 IEEE 37th International Conference
on Distributed Computing Systems Workshops (ICDCSW). 2017, pp. 389–394. doi: 10.
1109/ICDCSW.2017.30.
[156] I. Baldini, P. Castro, P. Cheng, S. Fink, V. Ishakian, N. Mitchell, V. Muthusamy, R. Rab-
bah, and P. Suter. “Cloud-Native, Event-Based Programming for Mobile Applications”.
In: 2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems
(MOBILESoft). 2016, pp. 287–288. doi: 10.1109/MobileSoft.2016.063.
[157] N. Bila, P. Dettori, A. Kanso, Y. Watanabe, and A. Youssef. “Leveraging the Serverless
Architecture for Securing Linux Containers”. In: 2017 IEEE 37th International Conference
on Distributed Computing Systems Workshops (ICDCSW). 2017, pp. 401–404. doi: 10.
1109/ICDCSW.2017.66.
[158] P. Castro, V. Isahagian, V. Muthusamy, and A. Slominski. “The rise of serverless
computing”. In: Communications of the ACM 62 (Nov. 2019), pp. 44–54. doi: 10.1145/
3368454.
[159] The Netflix Cosmos Platform. https://netflixtechblog.com/the- netflix- cosmos-
platform-35c14d9351ad.
[160] M. Kiener, M. Chadha, and M. Gerndt. “Towards Demystifying Intra-Function Paral-
lelism in Serverless Computing”. In: Proceedings of the Seventh International Workshop on
Serverless Computing (WoSC7) 2021. WoSC ’21. Virtual Event, Canada, 2021, pp. 42–49.
doi: 10.1145/3493651.3493672. url: https://doi.org/10.1145/3493651.3493672.
[161] J. Carreira, P. Fonseca, A. Tumanov, A. Zhang, and R. Katz. “Cirrus: A Serverless
Framework for End-to-End ML Workflows”. In: Proceedings of the ACM Symposium
on Cloud Computing. SoCC ’19. Santa Cruz, CA, USA: Association for Computing
Machinery, 2019, pp. 13–24. isbn: 9781450369732. doi: 10.1145/3357223.3362711. url:
https://doi.org/10.1145/3357223.3362711.
90
Bibliography
91
Bibliography
92
Bibliography
93
Bibliography
94