Distributed systems are even worse. Because the individual servers are now separated by a network, no one gets to see the overall state of the system. By the time a server receives a message, its contents may already be stale, but that server has to make a decision anyway. Plus, distributed systems have to deal with partial failures. Servers can fail and messages can be delayed or dropped, and the system must recover. So distributed systems really open up a whole new level of complexity (maybe that’s why I like them).
The challenge with distributed systems is managing this complexity. Most people strive to make their distributed systems as simple as possible. That’s not to say the resulting systems will be trivial; there will still be a lot of essential complexity. The goal is to minimize any additional, accidental complexity. Until we master building “simple” systems, I think that’s solid advice.
During my PhD, I co-authored a paper titled “In Search of an Understandable Consensus Algorithm”, which describes how we designed the Raft algorithm specifically for understandability. The paper argues that understandability should be a primary design objective for distributed systems.