Skip to main content

Observability and day two problems - what issues might you face?

communication technology
(Image credit: Image source: Shutterstock/violetkaipa)

The term ‘honeymoon’ refers to the time that newlyweds spend with each other after a wedding, when everything tends to be new, amazing and go really well. After that initial period, things may come up that are no longer so perfect. It’s the same with technology projects - once the initial implementation gets completed, you can start to find things that are not as easy to deal with. In IT, where everything is faster than real life, these are called ‘day two’ problems.

For software development and observability, this can include finding unintended consequences after you implement, or workflow processes that don’t fit with some of your team that were not caught in user acceptance testing. These problems can be serious ones, and they can effectively stop your project from delivering all its benefits. This can also prevent these implementations from expanding beyond their original users.

So how can you get through the honeymoon phase and keep solving those day two problems before they affect your overall success?

What should observability deliver? 

In order to avoid these kinds of problems, it is worth spending some time on what your observability approach should deliver for you, and how you can optimize that approach in practice. Observability refers to how you can tell the state of a system based on observation of its outputs. For software developers, this means looking at application logs, metrics and tracing data from multiple application components and infrastructure sources in order to see what is happening over time. 

Today, application developers are more likely to build microservices applications and host them in the cloud. This is great for elasticity and modularity of the application architecture, but not so good from an observability perspective as the complexity grows exponentially.

Applications today have many more moving parts and dependencies to them, so finding the initial fault can be more difficult. Alongside this, the components may be third party services that are run by outside providers or by public services. For example, a retailer may use weather data in its prediction models around what products are more likely to be in demand - if the source of that data goes down, then you may know where the fault is but might not be able to fix it yourself.

Alongside this, getting a good level of insight involves defining what good performance levels currently look like, and setting up alerts that represent situations where service levels and expected results are not being delivered.

What should this offer in practice? It should help developers find issues and fix them when things go wrong. However, it should deliver more than this in terms of applications being more reliable and better performing by not only focusing on known violations of predefined signals (KPIs) but also providing enough background information to flag and fix new, unknown scenarios. It should also provide enough intelligence to prevent these issues in the future. Rather than looking at observability problems as black or white issues, instead it involves looking at the business results and experience that customers are getting. This should allow your team to improve reliability and performance as a whole.

Similarly, getting an accurate overview of application deployments in one place is not as simple as just having the data. Correlating activity and getting an accurate picture of what is taking place often requires enhancing data in a way specific to the actual customer environment. That can be more difficult and often impossible if you use a ‘black box’ agent approach to provide observability information, leaving aside the issue of vendor lock-in.

This is a good example of where a project can over time become more problematic. If you have to take a vendor-bound route around an element of observability like tracing, it can lead to more problems and stop you taking your preferred path around software innovation, development and deployment. Instead, looking at open source models like Open Telemetry can help avoid these problems.

Supporting multiple pipelines and approaches 

Alongside this, it’s important to understand that software development is more complex than a single continuous integration / continuous deployment (CI/CD) pipeline can represent. For many enterprises, there will be tens or hundreds of CI/CD pipelines in place across different teams, departments and business units that all feed into applications. Each of these teams can have different software development tools in place, depending on how standardized and how rigid the company is around its developers.

If you have not enforced a specific approach, then it should not be a surprise when different departments have their own solutions in place to help with their processes, or where they have chosen a specific cloud platform to deploy on that might be different to the corporate standard. Each team may also have its own pipeline for software development and deployment, and their own quirks around getting applications through from development to production. Getting information from all these pipelines therefore becomes more difficult to normalize and understand over time.

Lastly, the largest enterprises will have multiple business units, companies and departments. Allowing each department or company to run their own observability stacks may work, but it makes it harder for the enterprise as a whole to understand its approaches and how well it is operating. Similarly, it is much more challenging to use this data for security purposes too when it is spread across different units or tools. Ideally, getting more consistency in approach around software security and observability data can make this easier over time. 

However, this is not as easy as just enforcing specific tools or locking down to certain providers. The sheer volume of data that today’s application components create can have an impact on how successful you are over time, as the cost to support all that data does not shrink over time. If anything, it will increase, leading to a significant ‘day two’ problem around cloud cost management and the value that gets delivered over time. Consolidating your approach - and your business-critical data - can therefore help.

Stopping the problems before they start 

In order to deal with these potential problems, it is important to plan ahead. This involves looking at how you bring together data from across your applications and infrastructure, but also how you manage this process across multiple pipelines and diverse teams without making things too complex. It should also enable you to avoid overspending on data from the very solutions that you are using to keep yourself informed.

When enterprises have so many different CI/CD pipelines in place, consolidating information from these implementations will involve some automated correlation and analysis to show how teams are progressing and where there are opportunities to improve. Using this data, you can compare your teams’ performance to industry standard metrics to get an impression on where you can improve too.

Rather than running different security and observability platforms, you can look at consolidating your approach to cover both use cases. The data will be the same for both teams, but the analytics and viewpoints on that data will be different. This cuts the volume of data that you might have to store and process considerably, as rather than getting multiple versions of the same data stored in various places, you can use a single set.

For more complex deployments, where you have teams that may need their own environments to work in, you should look at how to organize parent and child organizations for the data. This has the benefit of still consolidating the amount of data that you have to store over time, but it also makes it possible for each team to work independently. For the parent enterprise, this allows you to get a full overview of all the data that each team is looking at, while the child organization can run their deployment securely. As part of this, you should consider role-based access control so you can manage access to the data set as appropriate for each member of your team - this will let those who need a specific set of data to work on their analysis, while also letting those that need wider amounts of data to work at their level too.

Understanding the role that data can play in teams across security, development and operations can help you implement tools successfully. However, thinking ahead how to encourage more collaboration between teams around their data can help each team go further. By consolidating your approach and looking at the future potential issues that can come up over time, you can keep the results from any project going for longer and get more pervasive use of data across the enterprise.

Pawel Brzoska, Principal Product Manager, Sumo Logic