Understanding CAP Theorem

Before diving into CAP Theorem, lets understand what’s meant by Distributed System.

Distributed System

As you can imagine, we now have billions of people accessing internet. That means, your system should handle that much amount of traffic with reasonable latency for better user experience. Obviously you will try spreading the assets using CDN, scale to multiple servers across different global regions, and other optimizations. Now you have to replicate the data across these servers accodingly. There are lot of components connected over a network communicating, responding to user requests. This kind of system can be termed as “Distributed System”. If you summarize, essentially we want to handle data synchronization, and giving the right data view to the user.

Understanding CAP Theorem gives a good idea on how to think about the trade-offs based on what value that solution is going to bring. Often CAP Theorem is explained straight away and it doesn’t stick to our minds. People tend to say, you can have only two of the CAP but not all - causing many confusions. This article focuses on how you should look at CAP Theorem., and helps you appreciate the fundamentals.

Let us start with the terms mentioned in CAP Theorem.

CAP Theorem terms

Consistency - All the components should have a consistent view, meaning same data view. Lets say, I updated my profile name from X to Y. I should be seeing Y, and someone across the world is trying to see my profile name, they should see Y.

Availability - It doesn’t matter whatever happens, the system should respond to user requests. It could be a success or error, but the system should respond to the user.

Partition Tolerance - In case of any network (communication) failure between different nodes, the system should still continue to operate.

But how to look at the CAP Theorem? what does it tell us? What are the trade-offs, why do they come into picture?

CAP Theorem point of view

The way we should be looking at CAP Theorem is, from the point of Network failure, i.e., Network Partition in the distributed system, where-in one/some nodes cannot communicate to other nodes. In order to be tolerant to this failures, we have to design the system in some way. That dictates whether the system has Partition Tolerance which means it is tolerant to Network Partitions/Failures.

So, how do we make the system Partition Tolerant? We have two choices:

Choice 1: Choose Consistency - We can go for Consistency, and make the system unavailable until the network issue is resolved. Does this make sense? Yes - Cases like financial transactions systems need this kind of strategy. Imagine a withdrawl happens on a bank account, and if the other nodes in the system doesn’t have the updated view, they could allow further withdrawls making the balance negative. Even for short duration, things like these could have catastrophic problems on the economy and reconciliation is very complex.

Choice 2: Choose Availability - We can go for Availability, and make the system continue responding to user requests. Does this make sense? Yes - Cases like blog comments etc., where the immediate appearance of comment or syncing that data throughout the system may not be of high value. But responding to users is of high value, hence Availability is given priority.

There are cases, where you don’t have to go for 100% Consistency or 100% Availability. If you deep dive, you can find may be some cases with in your system require Consistency, and some can go with Availability. Combined with the domain knowledge, CAP Theorem helps in realizing what’s the best you can provide to the customer. That way with-in an area you would be going with:

Consistency + Partition Tolerance
Availability + Partition Tolerance.

Coming to the case of Consistency + Availability, that means here we are assuming there is no network failure/partition situation. Hence you can have only two together (CA, CP, CA) of the three terms (C, A, P).

Hope this gave good clarity and nice point of view of CAP Theorem. Follow for more!