When you’re running a modern data cluster, which are becoming increasingly commonplace and essential to businesses, you inevitably discover headaches.
Typically a wide variety of workloads run on a single cluster, which can make it a nightmare to manage and operate - similar to managing traffic in a busy city. There’s a real pain for the operations folks out there who have to manage Spark, Hive, impala and Kafka applications running on the same cluster where they have to worry about each app’s resource requirements, the time distribution of the cluster workloads, the priority levels of each app or user, and then make sure everything runs like a predictable well-oiled machine.
Anyone working in data ops will have a strong point of view here since you’ll have no doubt spent countless hours, day in and day out, studying the behaviour of giant production clusters in the discovery of insights into how to improve performance, predictability and stability. Whether it is a thousand node Hadoop cluster running batch jobs, or a five hundred node Spark cluster running AI, ML or some type of advanced, real-time, analytics. Or, more likely, 1000 nodes of Hadoop, connected via a 50 node Kafka cluster to a 500 node Spark cluster for processing.
Just listing the kind of environments that I see regularly, one can become aware quickly of what can go wrong in these multi-tenant big data clusters. For example:
Oversubscribed clusters – too many apps or jobs to run, just not enough resources
Bad container sizing – too big or too small
Poor queue management – sizes of queues are inappropriate
Resource hogging users or apps – bad apples in the cluster
So how do you go about solving each of these issues?
Measure and analyse
To understand which of the above issues plagues your cluster, you must first understand what’s happening under the hood. Modern data clusters have a number of precious resources that operations team must keep a constant eye on. These include memory, CPU, and NameNode.
When monitoring these resources make sure to measure both the total available and consumed at any given time.
Next, break down these resource charts by user, app, department, and project to truly understand who is contributing how much to the total usage. This kind of analytical exploration can help quickly reveal:
If there is any one tenant (user, app, dept, or project) causing the majority of usage of the cluster, which may then require further investigation to determine if that tenant is using or abusing resources
Which resources are under constant threat of being oversubscribed
If you need to expand your big data cluster or tune apps and system to get more juice
Make apps better multi-tenant citizens
Configuration settings at the cluster and app level dictate how much system resources each app gets. For example, if we have a setting of 8GB containers at the master level, then each app will get 8GB containers whether they need it or not. Now imagine if most of your apps only needed 4GB containers. Well your system would show it’s at max capacity when it could actually be running twice as many apps.
In addition to inefficient memory sizing, big data apps can be bad multi-tenant citizens due to other bad configuration settings (CPU, number of containers, heap size, etc.) inefficient code, and bad data layout.
Therefore it’s important to measure and understand each of these resource hogging factors for every app on the system and make sure that they are actually using and not abusing resources.
Define queues and priority levels
Your big data cluster must have a resource management tool built-in, for example YARN or Kubernetes. These tools allow you to divide your cluster into queues. This feature can work really well if you want to separate production workloads from experiments or Spark from HBase or high priority users from low priority ones, etc. The trick is to get the levels of these queues right.
This is where measure and analyse techniques help. You should analyse the usage of system resources by users, departments or any other tenant you see fit to determine the min, max, average that they usually demand. This will at the least get you some common sense levels for your queues.
However, queue levels may need to be adjusted dynamically for best results. For example, a mission critical app may need more resources if it processes 5x more data one day compared to the other. Therefore having a sense of seasonality is also important when allocating these levels. A heatmap of cluster usage will enable you to get more precise about these allocations.
Proactively find and fix rogue users or apps
Even after you follow the steps above, your cluster will experience rogue usage from time to time. Rogue usage is defined as bad behaviour on the cluster by an application or user, such as hogging resources from a mission critical app, taking more CPU or memory than needed for timely execution, having a very long idle shell.
In a multi-tenant environment this type of behaviour affects all users and ultimately reduces the reliability of the overall platform.
Therefore setting boundaries for acceptable behaviour is very important to keep your big data cluster humming. For example:
Time limit for application execution
CPU, memory, containers limit for each application or user
Setting the thresholds for these boundaries should be done after analysing your cluster patterns over a month to help determine what is the average or accepted values. These values may also be different for different days of the week. Also, think about what happens when these boundaries are breached. Should the user and admin get an alert? Should these rogue applications be killed or moved to a lower priority queue?
Only by thinking through these options can the multi-tenant big data clusters play well together.
Kunal Agarwal, CEO, Unravel
Image source: Shutterstock/wk1003mike