If you have previously worked with cloud platforms, you will be familiar with the distributed and decoupled nature of these systems. A decoupled distributed system relies on microservices to carry out specific tasks, each one exposing its own REST (Representational State Transfer) APIs. These microservices talk to each other through a lightweight messaging layer usually in the form of a message broker such as RabbitMQ or QPID.
This is precisely how OpenStack works. Each major OpenStack component (Keystone, Glance, Cinder, Neutron, Nova, etc.) exposes a REST endpoint and the components and sub-components communicate via a message broker layer, such as RabbitMQ. The benefits of this approach are first that it allows failures to be allocated to specific components, and second that cloud infrastructure operators can scale all services in a horizontal fashion and intelligently distribute the load.
However, as with everything, this distributed decoupled system, while hugely beneficial, also brings with it inherent challenges: how to properly monitor OpenStack services and, more specially, how to identify possible single points of failure.
Below I will pinpoint real-world challenges for the specific case of proper OpenStack services monitoring and the possible solutions that can be implemented for each of those difficulties.
Challenge one: the system is not monolithic
OpenStack’s non-monolithic and decoupled nature is often highlighted as one its main advantages. And it certainly is an important advantage. However, it significantly complicates any attempt to monitor the state of the service as a whole. In a distributed system where each component carries out one specific task, and each component is further distributed into multiple sub-components, it is not hard to understand how difficult it is to identify the impact on the service when a specific piece of software fails.
The first step to overcome this is to get to know the cloud. You need to identify the relations between all major components, and then for each one, isolate specific services whose failures can impact the overall service. Simply put, you need to know everything there is to know about the relationship between all components in the cloud.
With that in mind, you need to not only monitor the state (up-and-running or stopped-and-failed) of each individual component, but also identify how other services can be affected by its possible failure.
For example, if Keystone dies, nobody will be able to obtain the service catalogue or log into any service, but that wouldn’t normally affect the virtual machines or other established cloud-services (object storage, block storage, load balancers, etc.) unless services are restarted and Keystone is still down. However, if Apache fails, Keystone and other similar API services could also be affected if they work through Apache.
So, the monitoring platform or solution must not only be capable of assessing the status of individual services, but also be able to correlate between service failures in order to examine the real impact on the entire system, and send alarms or notifications accordingly.
Challenge two: OpenStack is not simply OpenStack
Not only is the OpenStack-based cloud a distributed and decoupled system, it is also an orchestration solution which creates resources in the operating system and other devices inside or related to the cloud infrastructure. These resources include virtual machines (Xen, KVM or other hypervisor software components), persistent volumes (NFS storage servers, Ceph clusters, SAN-based LVM volumes or other storage backends), network entities (ports, bridges, networks, routers, load balancers, firewalls, VPNs, etc., running with specific components like iptables, kernel namespaces, HAProxy, Open vSwitch and many other sub-components), ephemeral disks (Qcow2 files residing in an operating system directory), and many other small systems.
The monitoring solution must therefore take into account these underlying components. Although these resources can be less complex and are less prone to failure, when they go down, which they do, the logs inside major OpenStack services can obscure the true cause. They only show the consequence in the OpenStack affected service, not the actual root cause on the device or operating system software which actually failed.
For example, if libvirt fails, the component Nova will be unable to deploy a virtual instance. Nova-compute as a service will be up and running, but the instances would fail (instance state: error) in the deploying stage. In order to detect this, you need to monitor libvirt (the service state, its metrics and its logs) alongside the nova-compute logs.
It is therefore necessary to examine the relationships between the underlying software and major components as well as monitor the end-of-the-chain and consider consistency tests along all final services. You need to monitor everything: storage, networking, hypervision layer, each individual component, and the relationship between all of these.
Challenge three: think outside the box
Cacti, Nagios, and Zabbix are good examples of OpenSource monitoring solutions. These solutions define a very specific set of metrics that identify possible problems on the operating system, however what they don’t offer are the specialised metrics necessary to determine more complex failure situations, or even, the state of a service.
This is where you need to think outside the box. You can implement specialised metrics and tests that define whether your services are OK, degraded, or completely failed.
A distributed system like OpenStack, where every core service exposes a REST API, and also connects to a TCP-based message service, is susceptible to networking bottlenecks, connection-pool exhaustion and other related problems. Many related services connect to SQL based databases, which can exhaust its max-connections pool, which means a proper connection-states monitoring metrics (established, fin-wait, closing, etc.) need to be implemented in the monitoring solution in order to detect possible connection-related problems that affect the API. Moreover, cli-tests can be constructed in order to check the endpoint state and measure its response time, which can be converted into a metric that actually shows the real state of our service.
Each of the aforementioned monitoring solutions and most other commercial or OpenSource solutions can be extended with specialised metrics that you can design yourself.
The command “time OpenStack catalogue list” can measure the Keystone API response time, evaluate the answer and generate an artificial failure state if the answer is not what expected. Additionally, you can use simple operating system tools like “netstat” or “ss” in order to monitor different connection states in your API endpoints and gain visibility into possible problems in your service. The same can be done for critical sections of the OpenStack cloud dependencies such as the message broker and the database services. Note that a message broker failure will essentially kill your OpenStack cloud.
The key here is don’t be lazy! Don’t stick with the default metrics: do your homework and implement service-related metrics.
Challenge four: the human factor
Human factor is in everything. As the old saying goes, it's a poor craftsman that blames his tools.
Without a tested scenario response procedure, the single failure not only will remain a problem, but will also create many more. Any possible incident in your cloud infrastructure and its related alarms in your monitoring solution should be well documented with clear steps that explain how to detect, contain and solve the problem.
The human factor is relevant even if you have a smart system (one with some degree of artificial intelligence) that can relate events and recommend proper solutions to detected incidents. It is important to remember that if you feed your system with inaccurate or incomplete information, the output will also be inaccurate or incomplete.
Summing up, OpenStack monitoring doesn’t have to be difficult, what’s most important is to be thorough. Each individual service as well as its interactions with every other service needs careful monitoring. Specialist metrics can even be implemented yourself. With some TLC, you can easily and successfully monitor your OpenStack.
Ronny Lehmann, CTO, Loom Systems
Image source: Shutterstock/TechnoVectors