The science of selecting the right public cloud instances


While organisations are increasingly moving application workloads to public cloud infrastructure as a service (IaaS), many are now realising the pitfalls. Once promising a cost-effective solution, numerous organisations are in fact significantly overspending and overprovisioning their cloud instances. To give a measure of the issue, analysts predict that by 2020, 80% of organisations will overspend on cloud IaaS budgets due to a lack of effective cost optimisation.   

One standout reason for the overspend is a lack of a detailed understanding of application workload patterns. This is critical, as the precise nature of a workload has a direct impact on the cost of that workload in the cloud. By way of example, an application that is a batch processing job will have periodic high utilisation, but use little or no resources the rest of the time. This would be ideally suited to the cloud, as the pattern is simple and unchanging. The instance can be turned off when it is not active and you’ll only pay for CPU cycles during the hours when the application is running.

Another error is the reliance on tools that take a simplistic approach, such as using averages or percentiles, to size for the cloud. By using averages, organisations end up wildly over- or under-provisioning apps depending on the timeframe used, impacting either performance or cost significantly. To get instance provisioning and management right, it is essential to statistically model workload patterns in ways that look at hourly, daily, monthly, and quarterly activity. 

Avoid the “Bump-Up” Loop 

Imagine that your reporting tool indicates your application workload is running at nearly 100% utilisation for four hours overnight. It will simply interpret high utilisation as bad, and will conclude that the workload is under-provisioned and recommend bumping up the CPU resources and therefore costs. Despite the change, the next day the workload still runs at 100%, but for a shorter period of time. Once again, the tool says to throw more resources at it. And so on - you’re stuck in an endless capacity bump-up loop. 

This occurs as some applications take as much resource as you give them. To avoid this loop, you need to understand what the workload is doing and how it behaves. It’s not enough just to take a high-level view of the workload usage. You also need to understand each individual workload pattern on a granular level.   

Size Your Cloud Instances for Memory 

As with CPU capacity, sizing memory resources can also lead to the bump-up loop, as people often focus on the wrong stats. This is due to the data provided by the cloud providers — and the limited analysis tools people rely on — being misleading.   

Examining how much memory is being used will not suffice; you need to analyse how it is being used. This means looking at whether the memory is being actively used by the systems, which is often referred to as active memory and resident set size. This is the actual working set being used by the application or operating system, and is where we should look to see if there is any pressure or waste.   

To avoid costly mistakes in sizing memory, one-dimensional analysis is not enough. Best practices dictate that the actual “required memory” for a cloud instance must be a function of both consumed memory and active memory, and policy must be used in this calculation to ensure that enough extra memory is earmarked to the operating system. This will allow it to do a reasonable amount of caching, without becoming bloated, while providing the optimal balance of cost efficiency and application performance.   

Modernise for Positive Financial Impact    

Modernising public cloud instances means moving from an existing type of instance to a newer version of that offering. Newer instance types are often hosted on newer hardware, which typically offer higher performance and better processing power. Our experience has shown that when organisations both modernise and right-size instances, cost savings average 41% — approximately double the savings that right-sizing alone delivers. 

So why are many companies spending 40% more than they have to? It all comes down to the complexity of cloud offerings hindering the ability to make the right decision. All the big public cloud vendors offer a dizzying array of services and instance types to meet a variety of computing needs and are continually introducing new instance offerings with different CPU, memory and I/O combinations. With the increases in power of the underlying hardware and pricing changes that often occur, one of these new offerings may satisfy your workload needs in a way that’s more cost-efficient, but the onus is on you to find that out.   

Only by harnessing analytics that can accurately compare apples to apples in the world of hardware platforms, can you identify the best choice for each of your current workload requirements. These analytics must also factor in the cost of the various offerings, so they can accurately calculate and optimise the financial impact as well as performance.   

Eliminate the Deadwood   

Idle instances plague many public cloud environments and waste opex budget. This deadwood is often the result of hasty deployments or a lack of accountability in the cloud world. Workloads change over time and no one goes back to eliminate the now idle instances. The fact is, most organisations don’t have an effective process for managing cloud instances and identifying idle instances. Plus, given the complexity of cloud provider invoicing and the lack of visibility into workload patterns, it is often hard to really know what is truly idle.   

Eliminating idle instances may seem obvious, but there is a potential risk. What if the instance is idle 95% of the time, but lights up for a short amount of time to handle a monthly, or quarterly workload, such as batch processing? Eliminating that instance could spell disaster — especially if it’s part of a mission-critical application. Identifying the true deadwood requires sophisticated analytics that examine workload patterns across a full business cycle—and look at all utilisation factors, including CPU, I/O and memory.

While the path to public cloud IaaS might present numerous hurdles, with the right analytics and true visibility of workload patterns, organisations can make confident decisions about what workloads should stay and which should go. This insight helps them establish a process for regularly reviewing cloud instances to ensure the make the most of opex spend.      

Yama Habibzai, CMO of Densify 

Image Credit: Everything Possible / Shutterstock