An introduction to Distributed Systems

Jhonatan Ríchard Raphael
7 min readDec 16, 2019

Hi! Today I will start a series of posts about an important branch in Computer Science: Distributed Systems and related topics. But why did I decide to write about Distributed Systems? What is the relevance of this? What is the motivation?

Distributed systems has had a huge rise in recent years. Subjects like mobile apps, cloud computing, analytics, data/CPU intensive apps and Big Data have boosted the use of distributed resources on the Internet. If you simply need to develop an application that has some kind of processing or storage on some server, as well as using the Internet, you need to thinking about the challenging scenario of distributed systems.

There are several good technical reasons that have made distributed systems a very powerful option. Some examples:

  • Scalability: If your data volume, read load, or write load is too large for a single machine, you can spread the load across multiple machines.
  • Fault tolerance/high availability: If your application needs to keep running even if a machine crashes, you can use multiple machines to gain redundancy.
  • Latency: If you have users worldwide, you can have servers worldwide so that each user can be served by a nearest datacenter. This saves users from having to wait too long for packages to travel over the network.

The general idea of the series will be to try to answer, directly and clearly, the following question: “What is the necessary theory within Distributed Systems that every software engineer should know?”, trying to apply this knowledge in our daily life. Understanding the meaning of concepts such as ACID, CAP theorem, Partitioning, Replication, eventual consistency, helps us make architectural decisions and become better software engineers. Well, let’s go!

What is a distributed system?

Two relatively recent inventions have helped us to connect computers: more efficient microprocessors and more powerful networks. The result is that it is now possible to build a computer system consisting of a large number of nodes (computers or software processes) on the network. These computers are geographically dispersed, forming a “distributed system”.

An interesting property is that a distributed system is viewed for users as a single system (transparent distribution principle). Implementing this collaboration between nodes is very difficult and is the core of distributed systems. Building a distributed system may not always be a good idea. We have many advantages as we talked about earlier but there are many challenges that can complicate the development, and in many cases it is not feasible with regard to system requirements.

A distributed system has several design goals such as easy resource sharing, transparent distribution to users and applications, being openness to other systems through integration components, but the main factor to be aware of is that the system must be scalable.

Before getting into scalability, I will first take the topic to define two very important concepts within scalability and distributed systems in general: synchronous and asynchronous communication. I will talk more about Communication of distributed systems in another post, involving topics such as RPCs and message-oriented communication.

Synchronous communication versus asynchronous communication

In synchronous communication, a requesting service, usually referred to as client, blocks the process until the response is sent by the server that implemented the service. Communication between two people is an example of synchronous communication. In the context of a software system, an example would be a client node only showing front-end data to the user after another server node delivers the data to the client.

In asynchronous communication there is no client blocking the process and the processing of sending a message is done without waiting for a response from the server. The sending node is released to perform other tasks, possibly before the message has been sent. For example, when you send an email. You open and reply to the email anytime after delivery.

Multithreading plays a major role in asynchronous communication. In addition to allowing you to perform independent tasks in parallel, threads allow you not to lock the application while a task result is expected. Therefore, threads play an important role in both undistributed and distributed environments.

Scalability

Getting back to scalability, a very common term for us developers but often misunderstood. The scalability of a system can be measured across at least three different dimensions:

  • Size Scalability: a system being scalable relative to its size means we can add more users and resources to the system without notable loss of performance. This often happens in centralized services, which can affect a distributed system. Example: Let’s assume a service is deployed on a single node. In this case, there are three possible causes that can make a bottleneck: computational capacity, limited by CPUs. Storage capacity, including I/O rate and a network between the users and a centralized service;
  • Geographic scalability: in a geographically scalable system, users and resources may be far apart physically, but communication delay is hardly seen. On systems that need a lot of synchronous communication, this can be a big problem because this scenario works well on LANs, where communication is faster, but would fail on WANs. Example: A communication pattern that has many interactions with database transactions. (I will talk about transactions in another post). Another problem is that communication in wide area networks is less reliable than in local area networks. Also, we need to deal with limited bandwidth. The effect is that solutions developed for LANs cannot easily be ported to WANs. A clear example of this would be a video streaming server;
  • Administrative Scalability: An administratively scalable system is one that can be easily managed even if it expands many independent administrative organizations. A recurring problem is the needing to address conflicting policies regarding use of resources, management, and security.

Techniques for scaling a distributed system

In most cases scalability issues in distributed systems appear as performance issues caused by limited capacity of servers and networks. There are two ways to try to work around this problem:

  1. If you need to scale to higher loads, the simplest approach is to increase hardware capabilities (examples: Increase memory, CPU, or exchange network modules), which is a solution named Scaling Up or Vertical Scaling. In this kind of shared-memory architecture, all hardware components can be treated as a single machine. The problem with the approach of shared memory is that the cost grows faster than a linear cost: a machine with double CPU, double RAM, and double disk capacity is typically more expensive than twice its original price. Also, due to communication bottlenecks, a machine with twice the computational power cannot handle twice the load. A disadvantage of shared memory architecture is that it is limited to a single geographic location, which can increase the latency of your application. Another approach is the “shared-disk architecture”, which uses many machines with independent CPUs and memory, but stores data on an array of disks that is shared between machines, which are connected through a fast network. The problem with this architecture is contention and the overhead of scalability limit.
  2. When we talk about Scaling Out or Horizontal Scaling (shared-nothing architectures), we add additional infrastructure to our system (deploying more machines, virtual machine or containers). In contrast to memory-shared architectures, shared-nothing architectures have gained a lot of popularity. In this approach, each machine that runs the distributed system software is called a node. Each node uses its CPU, RAM, and disks independently. Any coordination between nodes is performed at the software level using a conventional network. In this case, there are basically 3 techniques we can apply: hiding communication problems, partitioning and replication. I will give a short introduction to each of these techniques now but I will write another post with more details of each.

a) Hiding communication latencies: applies in case of geographic scalability. The basic idea is to try to wait as little as possible for responses from remote service requests and do some useful work in the meantime. This essentially means building the application so that it uses asynchronous communication. When the response is delivered, the application is stopped and a special handler is called to complete the previous request. However, using asynchronous communication is not always possible due to application requirements and our goal is to maximize it to hide latencies and possible bottlenecks of communication. If system requirements do not allow the use of asynchronous communication, it is recommended that you decrease the overall communication of application, for example by moving some of the computation that runs on the server to the client.

b) Partitioning and distributing (also known as Sharding): This is another important scaling technique, which involves taking a component, dividing it into smaller parts, and then distributing it across the system nodes. Good examples are the Internet Domain Name (DNS) and the World Wide Web.

c) Replication: Given the fact that scalability issues often arise in the form of performance degradation, it is generally a good idea to replicate components across a distributed system. Replication not only increases higher availability by providing redundancy, but also helps to balance the load between components leading to better performance. Also, on widely geographically dispersed systems, having a close copy can hide communication latency issues mentioned earlier.

Caching is a special form of replication, although distinguishing between the two is hard. Just like replication, caching is making a copy of a resource, usually in the proximity of the client accessing the resource. However, in contrast to replication, caching is a decision made by the client of a resource and not by the owner of a resource.

There is a serious problem with caching and replication that can affect scalability. Because we now have multiple copies of a resource, modifying a copy can make that copy different from others. Consequently, caching and replication leads us to consistency problems, which we’ll cover in another post. Replication often requires some kind of global synchronization mechanism to keep nodes consistent. Unfortunately, such mechanisms are extremely difficult or even impossible to implement in a scalable manner, because network latencies are a natural lower bound. Consequently, scaling by replication may introduce other related non-scalable solutions.

Practice shows us that combining Partitioning, Replication, and Caching techniques with different forms of consistency often leads us to acceptable solutions for size scalability and geographic scalability. Finally, administrative scalability seems to be the most difficult problem because we need to deal with non-technical issues such as organizational policies and human collaboration.

Well, that’s it for today. Leave your comments, suggestions or comments below. I will be back in the next post writing about Architectures of distributed systems and types of Communication between nodes.

See you there!

--

--