Although observability and Kubernetes clusters are often considered mutually exclusive, they are closely intertwined. Engineers will be able to benefit from reliability, performance, and efficiency of well-managed Kubernetes but they will not be able to reach full capacity unless they are equipped with the knowledge of what apps are performing poorly and which resources are being underutilized.
What is observability?
Observability isn’t a single tool or technology. Rather observability is just how long it takes you to understand a problem. The average time it takes you to go from getting an alert to understanding the nature of what’s going wrong. That’s observability. When you restart a pod that’s not functioning properly, and that fixes the problem, your observability in that situation may be quite poor, since you never really understood what went wrong. Observability isn’t simply a set of dashboards displaying time-series metrics or a searchable index of your application logs; it’s a holistic approach to understanding your entire system, and how it operates.
Today’s complex software architectures demand high availability, reliability, security, and auto-scaling capabilities during peak time, the scale of these operations requires comprehensive management of logs, metrics, and traces to help debug and maintain these robust software systems. Thus, Kubernetes observability is now a top priority for engineering teams.
There is no simple route to achieving Kubernetes observability but here are six top tips that can be followed to fully explore, visualize, and troubleshoot the entire environment:
- Learn the language of your microservices - Often, it is harder to track the communication between nodes and pods within a cluster compared to behavior within a single node. Engineer teams will find it easier to understand communication between microservices by linking Kubernetes metadata. This will provide access to the performance and distributed traces of applications, whether it is instrumented via New Relic agents, open-source tools like Prometheus, StatsD and Zipkin, or standards like Open Telemetry deployed in Kubernetes clusters. By making this small change, teams will benefit from insight into error rates, transaction times, and throughput to better understand their performance.
- Familiarize yourself with the overall health and capacity of your clusters - Infrastructure monitoring remains vital to the business. Application metrics need to be deployed on Kubernetes when analyzing unexpected behaviors and performance issues. The first step to troubleshooting is evaluating the cluster’s overall health.
- Monitor dynamic behavior of clusters - Tracking and tracing Kubernetes events allows engineers to obtain useful analytics about dynamic behaviors such as new deployments, autoscaling, and health checks. The real-world performance of clusters is determined by the control plane of Kubernetes so it is crucial to track dynamic events. This includes API server stats and scheduler to gain a full overview of the cluster.
- Leveraging integrated telemetry data - My definition of observability is “how quickly you can understand problems with your system.” In other words, the speed at which developers can read metrics from a dashboard has a direct impact on observability. User experience is integral to all monitoring, and it makes engineers’ lives easier when done well - the better the user experience, the more quickly they can understand and resolve problems.
- Log data correlation - To view clusters with context of the broader Kubernetes environment and accelerate troubleshooting, engineers must correlate log data from all services running on Kubernetes. It is important to avoid creating a scattered user experience. This can easily be done when developers move from logs, to monitoring of overall metrics, over to a tracing tool - it can be tricky correlating this data. As it can be hard to find logging from the slowest responses, or connect distributed traces with relevant logging, when engineering teams notice a spike in reaction time metrics. Therefore, open-source observability tools, such as Open Telemetry, are actively working to develop ‘logs in context’ to connect logging data with other monitoring tools.
It is important for developers to choose a solution that allows Prometheus data to be viewed alongside telemetry data from other sources for unified visibility. This in turn removes the overhead of managing storage and availability of Prometheus, so teams can focus on deploying and scaling software.
Kubernetes provides teams and businesses with a competitive edge as it offers differentiation on uptime, performance, and efficiency. During Covid-19, it is more important than ever that businesses stay relevant, an efficiently orchestrated cluster is one of the ways to do this. To reach performance goals, engineering teams must maintain consistent insight into how the cluster is really performing, which is especially critical for maintaining efficiency. Close monitoring will highlight when there is excess capacity that can be more efficiently put to use.
- Good security practices improve observability: The recommendations of security professionals often have knock-on benefits for other operational concerns. Practices like vulnerability scanning and auditing your software packages have huge benefits for observability. As David Sudia wrote, it’s key to keep track of whether there are known vulnerabilities in the packages you’re using in your containers, even in the relatively less-compromised Linux ecosystem. All well and good, but how does this improve observability? Because any measure to audit your package usage means more people than the original developer know our system’s dependencies. Any kind of auditing will give the reviewer a map of how this system really works and what it relies on.
Further, the maintenance of network policies, while a crucial part of external security, can also provide insight into how your system communicates and what parts are active. This again means that, when you’re getting performance alerts, your ability to understand your logging, metrics, and tracing will be that much greater.
Observability is a goal
The difference between ‘a black box of monolithic code’ and ‘a perfectly maintainable system’ isn’t a binary, and just like Agile or DevOps, observability will always be a goal that different systems adopt. Any effort you put into increasing your teams’ understanding of your stack is valuable for observability.
Nočnica Fee, developer advocate, New Relic