Distributed Systems Primer

A distributed system is basically one giant computer. It can be defined by one application/system running across several nodes in a cluster. Instead of just running one application per server and having the entire app become unavailable if the host it's running on goes down, we can distribute the application across multiple servers, and thus make it fault-tolerant against single points of failure.

The "CAP Theorem" in distributed systems was first proposed by Professor Eric A. Brewer during a talk he gave on distributed computing in 2000. It stands for Consistency, Availability, and Partition tolerance. These three attributes explain how a distributed system behaves to a single node failing within it.

Consistency: all working nodes within a cluster will respond with the same data at the same time, regardless of which node the request goes to
Availability: all working nodes within a cluster will give a valid response
Partition tolerance: the cluster will continue to function despite any number of network segmentation between nodes

The real power behind the CAP theorem is that for any distributed system, you can only choose two of the three attributes for it to have. Remember that a cluster is just a giant distributed computer. If we lose communication with a single node, that shouldn't bring the entire system grinding to a halt. It wouldn't make sense for the whole system to fail if one node suddenly doesn't respond to the others - that would defeat the purpose of making it distributed in the first place!

Therefore, in practice, we can't really control partition tolerance - we can't choose to have 100% availability and 100% consistency all the time. So in reality, we can only choose between making a given cluster consistent, or available.

But what does that choice look like? What happens if you choose one over another?

Let's say we have two servers in a cluster. They're just happy little servers, communicating data back and forth and each servicing their own requests.

Now let's break the connection between them! Remember, our system has to be partition tolerant, or else its not really distributed, is it? What happens if a new client requests some data from Server A, who can no longer talk to Server B?

Well, if that connection wasn't broken, Server A could just verify with Server B that it has the latest data in the cluster, update itself if it wasn't, and then respond to the request. However, since Server A can no longer see B, it now has to proceed with the choice that we made when we set it up:

If we chose consistency over availability, Server A would not respond until it can re-establish a connection with Server B. It cannot verify that it's own data is accurate, so instead of sending what may be inaccurate data back to the client, it will hang indefinitely while it tries to see Server B again. This makes it consistent - it really wants to give back the right answer - but not available, it will just hang forever.

If we chose availability over consistency, Server A would respond, but with potentially inaccurate data to the client. This makes it available - we can talk to it just fine - but not consistent, it has no way of verifying if it has the latest information in the cluster.

Now, where I come from, hanging on a web page indefinitely is usually considered a bad customer experience. Typically you'll see most applications choose to sacrifice consistency and then try some fancy tricks to get as close to optimal consistency as you can. There are some interesting ways to make up for the sacrifice like eventual consistency, which can break down further into read-your-writes consistency and monotonic read consistency, but I'm going to save those topics for a later date.

For now, I want you to try to find examples of this out in the wild. Ever post a comment on social media and refreshed the page to see that it's not there? Then after freaking out that the world will never see your great commentary, you frantically refresh again to see your sweet little comment all nuzzled and warm on the web page? That social media platform chose availability over consistency.

On the contrary, if you refreshed the page and it hung for a few seconds (or in the case of a catastrophic failure - indefinitely) while it waited for all the nodes to get the same data, then that social media platform chose consistency over availability.

I hope that learning about the CAP theorem was somewhat enlightening, even if there are great debates in the distributed system community today about the validity of such a theorem.

Have you ever seen a time where a website chose C over A or vice versa? Let me know on Twitter: @synacklair