We admire the likes of Google, Amazon, Azure, Netflix, Facebook, LinkedIn, Twitter, and Swift because it’s very difficult to build and operate systems like these. For that reason, there are only a few hundred companies like these.
In my opinion, there are five things that are very difficult to get right in any large-scale distributed system built by a dev team of more than 50:
Dev is fun - creating something out of nothing is, in my mind, the purest form of art. However Ops is bloody. Therefore, DevOps is great in theory, but very difficult in practice. Your most creative minds create; they don’t operate.
The classic iterative model fails miserably at scale. “Let us first build the system using a relational database management system (RDBMS), and later on we can switch the RDBMS to be Cassandra.” It doesn’t work that way. First, the abstractions aren't clean enough to be switchable. But more importantly, the application semantics that consume the persistence come to rely on the underlying system. To scale, one must anticipate scale. That is very hard.
Change at scale is not easy. Many people think you can lock down the system and satisfy the safety property of systems. In fact, there have been some excellent write-ups about how locking down is actually antithetical to safety, but this is a difficult lesson to learn. Trial by fire isn’t easy. But the progress property is even more difficult. How does one innovate and let one’s customers innovate, yet keep things running? At Apigee, we do hundreds of changes every month, our customers deploy thousands of new APIs every month and the infrastructure keeps changing underneath us. Yet guaranteeing availability, scale, and latency for our customers’ APIs is a task that we have learned carefully.
A common style of development teams is “let’s focus on features and the cost will work out by itself”. What we have found to be helpful is to ask each team what the baseline cost is (for example, if you’re building persistence on AWS, start with the AWS GB/month cost), and then tell us what each added functionality costs. Without this, it is all jumbled; it’s no way to build distributed systems at scale.
This, of course, is not technology, though technology helps. Clean APIs between team microservices help. But dependencies naturally arise, at the very least, when some teams are providing platform services such as core persistence services at Apigee. Furthermore, there are always tensions: commonality of skills (can one team use MySQL and another, Postgres?) or agreement on the look and feel of the technology.
It’s important to note that there’s no magic wand here. These are our learnings from what we have observed and experienced. Eventually, we want to get into the top 10, but that is still a few years away, with many more lessons to be learned.
Anant Jhingran, CTO, Apigee