Skip to main content

Fixing risk sharing with observability

HP Wolf Security
(Image credit: HP)

Where does responsibility for risk lie? With the incentives of SREs, SecOps and application developers all at odds with one another, mismatches between their incentives often creates challenges around how and what information is shared across siloed teams. Inevitably, deployment risk is shifted from one team to another with no accountability. As a result of this risk-shifting, many companies end up with unstable applications, inefficient infrastructure, security issues and poor customer experience. All of which hits the bottom line. 

Bridging the gap with observability 

Observability has the power to help bring these disparate groups together. Observable systems allow users to ask questions about their data in an open-ended way, unlike more rigid monitoring systems. Meeting the observability expectations of IT leaders requires pervasive instrumentation across applications, infrastructure, and third-party software. However, delivering that level of instrumentation has remained out of reach due to incentive mismatches, as well as human and infrastructure costs.

App devs want to ship code quickly and do so with a reasonable level of quality, meaning a low bug count. In contrast, SREs are incentivized by uptime, performance, and efficiency. For SecOps, it’s all about risk reduction and breach mitigation.

Challenges arise when changes are not effectively communicated. For instance, SREs and SecOps do not often have insight into what developers have changed. Fresh code might only include minor changes and be harmless to existing operations or it could replace large chunks of logic across the entire codebase, including adding calls to external and third-party applications. DevOps wants to deploy quickly. Waiting on approval from other teams slows down deployments. As a result, comprehensive reviews to iron out the bugs don’t happen. This doesn’t mean DevOps teams are intentionally trying to sabotage partner teams. They’re simply acting in their own interests based on their incentives.

The challenge is that one party, the developers, has more information than other parties. That information asymmetry is what creates unbalanced risk-sharing. Coping with information asymmetry has led to all kinds of new collaborative models, starting with DevOps, and evolving into DevSecOps and other permutations like BizDevSecOps.

True collaboration has been hard to come by. Early DevOps efforts are often successful but scaling beyond five to seven teams is difficult because teams lack the breadth of experience in IT operations or the SRE capacity to staff multiple product teams. Furthermore, the change velocity DevOps teams can achieve is often far greater than SREs and SecOps can absorb, making information asymmetry worse.

If teams can’t maintain high levels of collaboration and communication, another option must be developed.

Observability practices, like collecting all events, metrics, traces, and logs, allow SREs and SecOps teams to interrogate applications about their behavior without knowing which questions they want to ask ahead of time. However, observability only works if applications, and the infrastructure they rely on, are instrumented. This creates another problem: who does the instrumentation?

The expectation is DevOps teams embed instrumentation into their code as part of the development process. While that’s a nice idea, there are four reasons this falls short:

1. The quality of instrumentation varies - many log statements are brief and only understandable by the developer that wrote them. The message “In function xyx123!” isn’t helpful to an SRE digging into a performance problem that cropped up in the latest release.

2. Instrumentation libraries vary by implementation - this leads to inconsistent results across language bindings. Open Telemetry tries to improve this, but progress is slow and still requires developers to do more work that often doesn’t benefit them; it benefits SREs and SecOps. So, we’re back to mismatched incentives.

3. The volume of data – each instrumented application can produce terabytes of data each day. When you have robust instrumentation, the amount of data can be overwhelming, and extremely costly to analyze and store.

4. Instrumentation is isolated to your team’s code - that represents a fraction of the code you rely on. Vendor-provided services and APIs remain a black box, limiting your observability into those components.

Resolving information asymmetry across teams requires pervasive, pluggable instrumentation capabilities for all code that doesn’t require developer involvement. It also needs an observability pipeline to filter, redact and enrich data. This can then route data to your analytics platform of choice.

Pervasive instrumentation 

Operations teams need the flexibility of instrumentation that they can turn on and off as needed with readily consumable data. They also need every piece of data they can get, including packet payloads, insight into encrypted data and so on. This goes well beyond what’s possible with today’s instrumentation options.

The newly released open-source project, AppScope, takes a fresh approach to instrumentation. AppScope interposes itself between application threads and system libraries, tracking things like file system access, network, and HTTP activity, as well as CPU and process activity. It also provides payload data, and because it sits between the application and encryption libraries, it gives access to users’ cleartext data. SREs and operations teams can instrument anything, even code they didn’t write, because it works with any Linux binary.

Observability pipeline 

The challenge with pervasive instrumentation is dealing with all the data generated by applications. Network, file system, and other system data can easily swamp destination APM and log analytics platforms, driving up licensing and infrastructure costs. While instrumentation data is vital to rebalancing risk-sharing in organizations, you need a way to manage that data intelligently to get value out of it. This is where the observability pipeline comes in.

Observability pipelines act as a strategic control point. The pipeline provides users with control over how data is formatted, filtered, enriched, and redacted before it is routed to its destination(s). These pipelines help SREs, and operations teams deal with the flood of instrumentation data by routing low-value data to low-cost storage, like S3, while higher-value information lands in APM and log analytics tools. If at some point, you need the data stored in S3 to add more context to your analysis, you can replay it back through the pipeline and enrich your data set.

Time to change 

Teams are under ever-increasing pressure to deliver faster. However, faster delivery pushes deployment risks to operations and security teams that lack visibility into the changes developers make across complex distributed systems. Over time, these applications also become less predictable and reliable and so further push up the risk.

Traditional methods of resolving the information mismatch haven’t worked because incentives across teams aren’t aligned. Adopting pervasive instrumentation and observability practices will give SREs and operations teams critical visibility into rapidly changing application and infrastructure environments. Crucially, they can do this without disrupting the developer experience and process.

Nick Heudecker, Senior Director of Market Strategy, Cribl

Nick Heudecker is the Senior Director of Market Strategy & Competitive Intelligence at Cribl and a former analyst covering Data & Analytics for Gartner.