Sodapdf
Sodapdf
Sodapdf
Tamerlan Gudabayev
The work was good but I always had this thought in the back of my mind:
"Man, I want to work on big systems such as ones for Google, Netflix,
etc..."
I didn't even know the term distributed systems until one of my colleagues
started talking about it.
I studied this concept of distributed systems for a while, but didn't fully
understand it until I saw it all in action a few years later.
Now that I have some experience, I would like to share with you what I
know about distributed systems.
Prerequisite Knowledge
The topics I'll be discussing here may be a bit advanced for beginner
programmers. To help you be prepared, here's what I assume you know:
Here are some resources that can help you brush up on some of these more
specific topics:
5. Saga Pattern
8. Sidecar Pattern
We used to get requests from clients and then we'd just build their site.
We had one backend service written in PHP and Yii2 (PHP Framework) and
a frontend written in JavaScript and React.
All of this was deployed to one server hosted in ps.kz (Kazakhstan Hosting
Provider) and exposed to the internet using NGINX as a web server.
This architecture works for most projects. But once your application
becomes more complex and popular, the cracks begin to show.
Complexity – Codebase is too large and too complex for one person
to mentally handle. It's also hard to create new features and
maintain old ones.
There are many ways to optimize a monolithic application, and it can go very
far. Many big tech companies such as Netflix, Google, and Facebook (Meta)
started off as monolithic applications because they're easier to launch.
But they all began facing problems with monoliths at scale and had to find a
way to fix it.
Some people mistake distributed systems for microservices. And it's true –
microservices are a distributed system. But distributed systems do not
always follow the microservice architecture.
So with that in mind, let's come up with a proper definition for distributed
systems:
That's why before migrating or starting a new project, you should ask the
question:
If you decide that you do need a distributed system, then there are some
common challenges you will face:
But you are not alone. Other smart people have faced similar problems and
offer common solutions which are called design patterns.
PS. We won't only cover patterns, but anything that helps in distributed systems.
This may include data structures, algorithms, common scenarios, etc...
Source
Imagine that you have an application with millions of users. You have
multiple services that handle the backend, but as a single database.
The problem arises when you do reads and writes on this same database.
Writes are a lot more expensive to compute than reads, and the system
starts to suffer.
Pros
Code Simplification – Reduces system complexity by separating
writes and reads.
Cons
Code Complexity – Adds code complexity by requiring developers
to manage reads and writes separately.
Use Cases
CQRS is best used when an application's writes and reads have different
performance requirements. But it is not always the best approach, and
developers should carefully consider the pros and cons before adopting the
pattern.
Here are some use cases that utilize the CQRS pattern:
Social Media –By applying CQRS, the read models can efficiently
handle feed generation, personalized content recommendations,
and user profile queries, while the write side handles content
creation, updates, and engagement tracking.
Source
2PC solves the problem of data consistency. When you have multiple
services talking to a relational database, it's hard to keep the data
consistent as one service can create a transaction while the other aborts it.
It works in two phases. The first phase is the Prepare phase in which the
transaction coordinator tells the service to prepare the data. Then comes
the Commit phase, which signals the service to send the prepared data, and
the transaction gets committed.
2PC systems make sure that all services are locked by default. This means
that they can't just write to the database.
While locked, the services complete the Prepare stage to get their data
ready. Then the transaction coordinator checks each service one-by-one to
see if they have any prepared data.
If they do, then the service gets unlocked and the data gets committed. If
not, then the transaction coordinator moves on to another service.
2PC ensures that only one service can operate at a time, which makes the
process more resistant and consistent than CQRS.
Pros
Data Consistency – Ensures data consistency in a distributed
transaction environment.
Cons
Blocking – The protocol can introduce delays or blocking in the
system, as it may have to wait for unresponsive participants or
resolve network issues before proceeding with the transaction.
Use Cases
2PC is best used for systems that deal with important transaction
operations that must be accurate.
Here are some use cases where the 2PC pattern would be beneficial:
Saga Pattern
So let's imagine that you have an e-commerce app that has three services,
each with its own database.
You have an API for your merchants which is called /products to which you
can add a product with all its information.
Whenever you create a product, you also have to create its price and meta-
data. All three are managed in different services with different databases.
But what if you created a product but failed to create a price? How can one
service know that there was a failed transaction of another service?
Orchestration
Source
You have a central service that calls all the different services in the right
order.
The central service makes sure that if there is a failure, it will know how to
compensate for that by reverting transactions or logging the errors.
Pros
Suitable for complex transactions that involve multiple services or
new services added over time.
Cons
Additional design complexity requires you to implement a
coordination logic.
Choreography
Source
On the other hand, the Choreography method doesn't use a central service.
Instead, all communication between servers happens by events.
Services will react to events and will know what to do in case of success or
failure.
So for our example above, when the user creates a product it will:
2. Then the price service will react to the event by creating a price for
the product and it will then create another event called price-
created-successfully
Pros
Suitable for simple workflows that don't require complex
coordination logic.
Cons
Difficult to debug because it's difficult to track which saga services
listen to which commands.
Source
This is essentially a load balancer – I don't know why they made it sound so
intimidating.
But that's not always the case – it can also route different routes to
different services.
So for example:
Pros
Performance – Load balancing distributes the workload evenly
across multiple resources, preventing any single resource from
becoming overloaded. This leads to improved response times,
reduced latency, and better overall performance for users or clients
accessing the system.
Cons
Complexity – Implementing and configuring load balancing can be
complicated especially for large scale systems.
You have a high traffic website and you want to spread the load so
that your servers don't fry.
You have users from all over the world and want to serve them data
from their closest location. You could have a server in Asia and
another in Europe. The load balancer would then route all users
from Asia to the Asian server and European users to the Europe
server.
I've written a bunch of articles on load balancing, so feel free to check them
out.
Load balancing 101: How it works and why it matters for your
platform
Source
This is good for stateless services. But what if you have a stateful service?
Then a sharded approach would be more appropriate.
For example, you may have one shard service accept all caching requests
while another shard service accepts high-priority requests.
Then you could use a load balancer to route requests by URL path to the
appropriate service.
PS. Sharding is not only used for application services but can be used for
databases, caches, CDNs, etc...
Pros:
Scalability – Sharding allows you to distribute load across multiple
nodes or servers, thus enabling horizontal scaling. As your workload
increases, you can just add more shards.
Cons:
Complexity – Sharding is not easy to implement. It requires careful
planning and design to handle data distribution, consistency, and
query coordination.
Use Cases
The Sharded Services Pattern is typically used in the following scenarios:
But keep in mind that there must be careful considerations when sharding
your services as it is very complex and expensive to implement and revert.
Sidecar Pattern
Source
If one functionality fails then this can lead to another functionality failing or
the whole service failing.
The downsides are that it adds latency to the application when we deploy
two services on different containers, and it adds complexity in terms of
hosting, deployment, and management.
The first is the application container which contains the business logic. The
second container, usually called the sidecar, is used to extend/enhance the
functionality of the application container.
You should keep in mind that the sidecar service runs in the same node as
the application container. So they share the same resources (like
filesystem, memory, network, and so on...)
An example
Let's say you have a legacy application that generates logs and saves them
in a volume (persisted data) and you want to extract them into an external
platform such as ELK.
One way to do this is just extending the main application. But that's difficult
due to the messy code.
So you decide to go with the sidecar method and develop a utility service
that:
Source
Hooray, you haven't changed any code in the application and you extended
its functionality by plugging in a sidecar.
Heck, you can even plug this log aggregator sidecar into other applications.
Pros:
Modularity – Sidecar allows you to develop and maintain utility
functions independently.
Scalability – If there is too much load on the sidecar, you can easily
horizontally scale it by adding more containers.
Cons:
Complexity – It requires extra management of multiple containers
and their dependencies.
Use Cases
The sidecar pattern is useful when you want to add additional functionality
to the application without touching the core business logic code.
By deploying the sidecar, the core logic can remain lightweight and focus on
its primary task while the sidecar can handle additional functionality.
If need be, you can reuse the sidecar for other applications too.
Now that we know when to use this pattern, let's look at some use cases
where it is beneficial:
One day, the server crashes. Your database crashes. All the data is gone,
apart from the backups.
You sync the database with the backup, but the backup is not up to date. It's
1 day old. You sit and cry in the corner.
Durability – Ensures that the data will not be lost, even in an event
of a system failure.
So for example, let's say you created your own in-memory database called
KVStore. In case of system failure, you want:
Every time you do any transaction (SET or REMOVE), the command will be
logged into a file on the hard disk. This allows us to recover the data in case
of system failure. The memory will be flushed, but the log is still stored in
the hard drive.
Source
Performance
If you use standard file-handling libraries in most programming languages,
you would most likely "flush" the file onto the hard disk.
Flushing every log will give you a strong guarantee of durability. But this
severely limits performance and can quickly become a bottleneck.
Well, this might improve performance but at the risk of losing entries from
the log if the server crashes before entries are flushed.
The best practice here is to implement techniques like Batching, to limit the
impact of the flush operation.
Data Corruption
The other consideration is that we have to make sure that corrupted log
files are detected.
To handle this, save log files via CRC (Cyclic Redundancy Check) records,
which validates the files when read.
Storage
Single Log files can be difficult to manage and can consume all the available
storage.
These two techniques are used together as they both complement each
other.
Duplicate Entries
WALs are append-only, meaning that you can only add data. Because of this
behavior, we might have duplicate entries. So when the log is applied, it
needs to make sure that the duplicates are ignored.
One way to solve this is to use a hashmap, where updates to the same key
are idempotent. If not, then there needs to be a mechanism to mark each
transaction with a unique identifier and detect duplicates.
Use Cases
Overall WALs are mostly used in databases but can be beneficial in other
areas:
Write-ahead logs (WALs) are widely used in various systems and databases.
Here are some common use cases for write-ahead logs:
Split-Brain Pattern
Source
That's definitely an interesting name, isn't it? It might make you think of the
two halves of the brain.
Data Inconsistency
This will usually shut the cluster off while developers try to fix things. This
causes downtime which makes the business lose money.
For example, if the old leader had a generational number of one, then the
second leader will have a generational number of two.
The generation number is included in every request, and now clients can
just trust the leader with the highest number.
But keep in mind that the generational number must be persisted on disk.
One way to do that is using a Write Ahead Log. See, things are connected to
each other.
Pros
Data Consistency – Implementing a fix ensures that shared data
remains consistent across the distributed system.
Cons
Increased Complexity – Fixing split brain scenarios adds complexity
to the system due to the intricate logic and mechanisms required.
Source
You send a request to update the customer's balance to $100. You send this
to all the replicas.
The request is successful for the first two replicas but the last one is down.
After a few seconds, the replica that was down got back up again but it has
the old data.
So when the unavailable node comes back alive, it can retrieve the hints and
apply them.
1. Detection – When a node fails, other nodes detect this failure and
mark that node as unavailable.
3. Hint Delivery – When the available node goes back online, it sends a
message to the other nodes requesting any hints that were made
while it was offline. The other nodes send the hints and the node
applies them.
By using this technique we ensure our data is consistent and available even
when nodes fail or become temporarily unavailable.
Pros
Improved data availability – Hinted handoff ensures data remains
accessible during temporary node failures by transferring
responsibilities to other nodes.
Cons
Increased complexity – Implementing hinted handoff adds
complexity to the system, making development, debugging, and
maintenance more challenging.
Use Cases
Hinted handoff is typically implemented in distributed database systems or
distributed storage systems where data availability and consistency are
crucial.
Source
In a distributed system, you can have data partitioned into multiple nodes.
This introduces a new challenge where we have to keep the data consistent
in all nodes.
For example, if you update data on node A, the changes might not be
immediately propagated to other nodes due to all sorts of reasons.
Once it receives the latest data, it will update the node with the old data
with the new data. Hence the "repair".
Pros
Data Consistency – Read repair maintains data consistency by
automatically detecting and correcting inconsistencies between
replicas or nodes.
Cons
Increased Complexity – Implementing read repair adds complexity
to system design, development, and maintenance efforts.
Use Cases
Read repair can be beneficial in various scenarios where maintaining data
consistency across replicas or nodes is crucial.
Here are some situations where you should consider using read repair:
One service can have ten instances at one time and two at another time.
Imagine that you have a client and it wants to talk to a service. How will it
know the IP address of the service if IP addresses are dynamically created?
But how does the service registry know all this information?
The service registry then stores this information in its data store.
Now that we know what a service registry is, let's talk about the patterns of
service discovery.
Source
The first and easiest way is for the client to call the service registry and get
information about all the available instances of a service.
But the significant drawback here is that it couples the client with the
service registry.
Source
On the other hand, the server-side discovery forces the client to make the
request via a load balancer.
If you don't know what a load balancer is, feel free to check out my other
comprehensive article on it.
The load balancer will call the service registry and routes the request to the
specific instance.
It's built-in in most popular providers such as AWS ELB (Elastic Load
Balancer).
The only downside is that you have another component (service registry) in
your infrastructure that you have to maintain.
Source
Let's say you have three services, A, B, and C. They all call each other
sequentially – A calls B, which calls C.
All goes well as long as the services work. But what if one of the services is
down Then the other services would fail. If service C is down, then B and A
would be down, too.
It also provides insight into the status of a service which helps us identify
failures more quickly.
Pros
Fault tolerance – Circuit breakers enhance system stability by
protecting against cascading failures and reducing the impact of
unavailable or error-prone dependencies.
Cons
Increased complexity – Implementing a circuit breaker adds
complexity to the system, impacting development, testing, and
maintenance efforts.
Use Cases
The circuit breaker pattern is beneficial in specific scenarios where a
system relies on remote services or external dependencies.
Source
The leader election pattern is a pattern that gives a single thing (process,
node, thread, object) superpowers in a distributed system.
So when you have three or five nodes performing similar tasks such as data
processing or maintaining a shared resource, you don't want them to
conflict with one another (that is, contesting for resources or interfering
with each other's work).
If a low-ranked node detects that the leader has failed, it sends a signal to all
the other high-ranked nodes to take over. If none of them respond then the
node will make itself the leader.
If a lower-ranked node detects that the ring has failed, then it will request
all the other nodes to update their leader to the next highest node.
Both of the above algorithms assume that every node can be uniquely
identified.
Pros
Coordination – Leader election allows for better organization and
coordination in distributed systems by establishing a centralized
point of control.
Cons
Overhead and Complexity – Leader election introduces additional
complexity and communication overhead, increasing network traffic
and computational requirements.
Use Cases
Distributed Computing – In distributed computing systems, leader
election is crucial for coordinating and synchronizing the activities of
multiple nodes. It enables the selection of a leader responsible for
distributing tasks, maintaining consistency, and ensuring efficient
resource utilization.
Source
Things fail for all sorts of reasons in a distributed system. This is why we
need to make sure our systems are resilient in case of failures.
Pros
Fault isolation – The bulkhead pattern contains failures within
individual services, minimizing their impact on the overall system.
Cons
1. Complexity – Implementing the bulkhead pattern adds architectural
complexity to the system.
Use Cases
Here are some common scenarios where the bulkhead pattern is beneficial:
Retry Pattern
Source
Requests fail for all sorts of reasons, from faulty connections to site
reloading after a deployment.
So let's say we have a client and a service. The client sends a request to the
service and receives a response code 500 back.
There are multiple ways to handle this according to the retry pattern.
For that, we need some sort of tracking system. We can use the circuit
breaker pattern to limit the impact of repeatedly retrying a
failed/recovering service.
Pros
Resilience and Reliability – By automatically retrying failed
requests, you increase the chance of the system recovering from
failure (resilience) and increase a chance of a successful requests
(reliability).
Cons
Increased latency – When retries are attempted, there is an
inherent delay introduced in the system. If the retries are frequent
or the operations take a long time to complete, the overall latency of
the system may increase.
Use Cases
The applicability of the retry pattern depends on the specific requirements
and characteristics of the system being developed.
The pattern is most useful for handling temporary errors. But, you must
also keep in mind for scenarios involving long delays, stateful requests, or
cases where manual intervention is required.
Now, let's look at some use cases where retrying requests help:
Source
We have a single image compression service that takes in the video and
processes each resolution sequentially. The problem here is that it's very
slow.
1. Scatter – We will divide the task into multiple nodes, so one node
will take care of compressing the video into 240p, another to 360p,
and so on.
3. Gather – Once all the nodes have finished compressing the videos,
the videos will be stored on some server and we collect the links for
all the different versions.
Pros
Parallel Processing – The scatter-gather pattern improves
performance by enables parallel processing of subtasks. Tasks can
execute concurrently across multiple nodes or processors.
Cons
Communication Overhead – This pattern involves communication
between nodes which introduce potential latency and network
congestion that may impact overall performance, especially with
large data volumes.
Use Cases
The scatter-gather pattern is useful in distributed systems and parallel
computing scenarios where you can divide tasks into smaller subtasks that
can be performed concurrently across multiple nodes and processors.
Here are some use cases where the scatter-gather pattern can be used:
Web Crawling – You can use the scatter gather pattern to fetch and
crawl multiple web pages concurrently.
Source
Bloom filters is a data structure that is designed to tell you efficiently both
in memory and speed whether an item is in a set.
But the cost of this efficiency is that it is probabilistic. It can tell you that an
item is either:
Might be in a set.
Pros
Space Efficiency – Bloom filters usually require a smaller amount of
memory to store the same number of elements compared to hash
tables or other similar data structures.
Cons
False Positives – Due to their probabilistic nature, there is a small
probability that the filter will incorrectly claim an element is present
when it is not. The probability of false positives increases as the
filter becomes more crowded or the number of elements increases.
Use Cases
Bloom filters are most useful in situations where approximate membership
tests with a low false positive rate are acceptable and where memory
efficiency is a priority.
Here are some scenarios where Bloom filters are particularly beneficial:
Conclusion
You don't need to be an expert in all of these things – I'm not. And even if you
don't ever directly use or work with some of these concepts, they are still
good to know.
Try to implement them in simple projects. It's a good idea to keep the
projects simple enough so you won't get sidetracked.
Below, I share with you my favorite resources that I used to learn about
distributed systems.
Recommended Resources
Microservices.io
Tamerlan Gudabayev
Hi, my name is Tamerlan and I'm a software engineer trying to make other software engineers more productive. If your interested, feel free to check
out my Twitter where I share my insights.
If you read this far, tweet to the author to show them you care. Tweet a thanks
Learn to code for free. freeCodeCamp's open source curriculum has helped more than 40,000 people get jobs
as developers. Get started
You can make a tax-deductible donation here. Declare an Array in JS SQL Date Example Linux Environment Variables