Advanced systems require advanced systems engineers

null

I’ve been nostalgic recently. Over the last few weeks, I’ve been reflecting on changes in the technology industry from when we first started our careers up to now. And it goes without saying that we’ve come a long way. However, there are two different but overlapping spheres in particular which have gone through extensive change: technology and methodology. The systems we worked on when many of us first started out were the first generations of client-server applications. They were fundamentally different from the prior generation: terminals connecting to centralised apps running on mainframe or midrange systems. A learning curve emerged. Suddenly, engineers started to understand the logic of their application client as well as the server powering it. New issues needed to be considered in order to effectively manage these systems, including: connectivity, the transmission of data, security, latency and performance, and the synchronisation of state between the client and the server.

This increase in sophistication spawned commensurate changes to the complexity of the methodologies and skills required to manage those systems. New types of systems meant new skills, understanding new tools, frameworks, and programming languages. We can trace back to this moment the spawning of numerous new specialisations that had previously been more concentrated in single roles: front-end engineers, back-end engineers, data scientists, designers, UX/UI specialists, and a myriad other specialities. We can perhaps also trace back to this period the construction of more siloed functions and the increased complexity in transitions between those silos. The silos that the DevOps and SRE communities are attempting to dismantle today.

Since the first generation of client-server systems, we’ve seen significant evolution. Much of it driven by the emergence of technology as being mission critical to doing business—for any business in every industry. This has been coupled with customer demand for fast, immediate functionality available on devices, delivered seamlessly across different geographies and fabrics. Take, for example, the evolution of renting videos from the corner video store to streaming on Netflix and Hulu and their peers. Our expectation of latency for the delivery of content has dropped from hours or minutes to seconds. Our expectation of the delivery of that content is that it’ll be available to us 24x7x365 on every device we own and in every location: from our homes and offices to being on the move. We, as customers, also don’t care about the infrastructure or the complexity of the systems required to deliver this: we just want to binge watch the new season of Making A Murderer.

Each iteration of this evolution has required the technology, systems, and skills we need to build and manage that technology to change. In almost every case, those changes have introduced more complexity. The skills and knowledge we once needed to manage our client-server systems versus these modern distributed systems with their requirements for resilience, low latency, and high availability are vastly different. So, what do we need to know now that we didn’t before?

Building for a better future

As practitioners, we’ve had to build better. With availability and resilience being prime concerns, the definition of an application’s minimum viable product has had to be redefined. Good design goals now have to include a baseline architecture for operability, security, performance, and observability. Every engineer, from a front-end engineer working on a React component, to a back-end engineer building a distributed data store, needs to consider how their piece of the system will impact the overall system.

This is especially true because the performance demands of our users have created new constraints in the computational models and state management strategies available to our systems. Computational models are turning to serverless and edge computing architectures to reduce latency for users. The new lesson we’ve learned: it’s always more efficient to perform computations as close to the end-user as possible.

This is also true for state management. Applications are being deployed from inception with distributed state, shared storage, and possibly even the migration of data (or some segment of data) from centralised stores into the edge and the cloud. But being closer to the end user enables faster decisions at the expense of greatly increasing the complexity of our applications.

Both of these constraints mean engineers need to understand how their part of the stack pairs with the other pieces and what the implications of a seemingly small change might have on the overall system. And when this can’t be modelled mentally, due to complexity or lack of insight into the systems, then it has to be modelled programmatically via observability, instrumentation, tracing, and tests.

We can no longer only use simplistic probing to identify failures or easily provide sufficient information to debug faults. Applications with complex architectures and distributed state, that look fully functional to probes, may not be performing optimally or accurately for end users. Even when looking at metrics and events, which in turn require correlation and levelling across disparate systems, we struggle to gain a full picture as traditional approaches and even calculations of latency are less accurate for distributed systems.

The instrumentation of your applications is now a mandatory step in the development process and no longer an afterthought. Every engineer needs to consider how to articulate the state, performance, and observability of their aspects of the system. This requires engineers to develop the skills and adopt the techniques to ship these new capabilities.

Evolving the tech ecosystem

New frameworks, architectures, processes, and a thriving ecosystem of tools have emerged to help us meet those challenges. Some of these are in an embryonic state, but rapid adoption is driving quick maturity. We’ve seen this evolution in compute: it’s only been four years since containers became a mainstream technology, and we are now working with complex application-level abstractions enabled by tools like Kubernetes. A similar evolution is occurring with deployment, serverless, edge-computing technology, security, performance, and system observability.

Ultimately, no changes can exist in a human and organisational vacuum. In order to efficiently build truly cross-functional teams and enable the rapid iteration required to build more advanced systems, we must place emphasis on developing the appropriate leadership skills. We must avoid stagnation within the industry by continuing the work of DevOps and SRE communities to break down barriers, eradicate silos, and streamline transitions between teams – only then can we boost development velocity. The bottom line is: teams structured around swiftly delivering high-quality, secure, and performant applications create highly innovative products and organisations. By listening to our peers about how they’ve both succeeded and failed in building, scaling and securing distributed systems, we can then start to better navigate the modern complexities facing businesses all over the world.

James Turnball, CTO in residence, Microsoft
Ines Sombra, Distributed Systems Director,
Fastly
Image Credit: Bbernard / Shutterstock