How infrastructure management can make IT a trusted, agile partner

Cost, Complexity, Convergence Technology and application trends such as Big Data, storage and server growth, mobile applications, bring-your-own-device (BYOD) and social media are the biggest disrupters of Enterprise IT today.

The resultant pressure on IT organisations is to deliver a 24x7 "dial-tone" service levels that align with business demands for availability and performance while remaining ever-agile to meet continually evolving application and business requirements.

Compounding the complexity and risk of this "new normal" state on mission critical infrastructures are complex interdependencies and technology advances—leading to blindingly fast servers, converging infrastructure components, mass adoption of virtualisation, and rapid migration to the private cloud .

The key driver behind all of these trends is demand for anytime/anywhere application access and flawless performance—from any device. All this exponentially increases storage requirements, mobile data, IP traffic and servers, and fuels advances in new technologies. The result is a never-ending pace of rapid transformation, and explosive increase in complexity and risk.

Managing the new normal

It's now business-critical for enterprises to have a purpose-built infrastructure performance management platform.

One that is specifically designed for granular, real-time monitoring of performance throughout the infrastructure supporting mission critical applications and business processes. One that also assures system-wide performance and availability.

This platform should enable infrastructure teams (no matter if they are server, storage, virtualisation or performance focused) to de-risk mission critical workloads across physical, virtual and cloud based computing environments. At the same time it should also provide deep insights into the real-time performance of the IT infrastructure, so that the risk of virtualising new applications and deploying new technologies can be minimised.

Leveraging such a platform would allow IT Organisations to align with business demands and application requirements, and become a trusted partner in driving business growth.

Impediments to aligning IT organisations with business goals: Minding and managing the gap

There are three main factors—people, processes and technology—and any one of them can prevent IT organisations from reaching a state of maturity, where they are seen by the broader organisation as a business advantage that can deliver exacting, cost-optimised service levels back to the organisation.

1. People and organisational structure

The IT organisation itself can impact business alignment, particularly if resources and management are functionally siloed. For example, organisations are often divided into Application, Server, Network, and Storage administrators. If there is a performance problem, the various administrators tend to make sure that it is not occurring in their domain. This structure pits teams against each other and results in finger pointing.

This predicament is further antagonised by device-specific system administration tools that provide a biased, myopic focus on only one facet of an IT infrastructure. When it comes to interrelated and interdependent systems, without a holistic and unbiased view of the environment, it's nearly impossible to manage performance—let alone quickly find root causes of problems or set accurate SLAs.

2. Process and IT maturity

IT organisations are forever focused on providing cost appropriate service levels but many lack the processes or metrics to fully realise their commitments. The reasons range from a lack of executive sponsorship, to over-allocated or mis-allocated resources, to little or no institutionalised business collaboration processes. In all cases, teams must focus on finding and resolving specific problems.

However in many cases, they lack the proper instrumentation and processes for fast and conclusive resolution. Many IT teams try to focus on aligning their operations with business goals. Often, however, this ends up being more of a hope than a reality. Without continuing executive commitment to people and processes (backed by accurate data) it's tough to achieve alignment. Having uniform processes in place, especially when there are multiple stakeholders with different reporting structures, incentives and goals—often proves impossible to maintain.

3. Technology and tools for yesterday's environments

Systems Management tools have traditionally focused on specific components of an infrastructure, such as provisioning and monitoring server resources, managing storage capacity, or utilisation of the storage and network fabric. In most cases these legacy device-specific tools have been marketed as performance monitoring, even though it is only monitoring utilisation.

Neither singularly or combined do these tools provide the necessary performance information that Application, Server, Storage and Cloud administrators need to understand how their mission critical systems infrastructures are actually performing. They lack real-time performance metrics, like infrastructure response times for executing actions, and efficiency in moving data/information. They also have limited scalability, are often agent-based and vendor-specific.

These cobbled together tool sets may try to show that the overall IT environment is working fine, with the right capacities and resources provisioned. However, if application response is poor, related IT service levels are still deemed unsatisfactory.

Performance versus utilisation

Infrastructure monitoring tools use utilisation metrics to infer the potential impact on performance. Since they are not measuring true performance, it leads teams to over provisioning resources in order to ensure that utilisation doesn't impact performance. However, inferring performance when overprovisioning is no longer acceptable, is not an option.

The days of over-provisioning to ensure performance are gone, since it's no longer viable with hyper data growth and accompanying cost pressures—especially when the fundamental promise of virtualisation and the cloud is to drive down costs and achieve greater utilisation against existing assets.

Requirements for IT to be Ever-Agile: Another important factor placing everything at risk is the ever-changing IT infrastructure landscape. Systems Management tools were originally designed for monitoring physical elements, but with the advent of Virtualisation and Cloud architectures—those tools are neither appropriate nor useful for managing the related complexities.

The continued virtualisation and abstraction of the infrastructure and underlying physical elements, makes performance management, problem avoidance, root cause problem analysis and remediation of the physical elements all the more difficult.

IPM: Defined

Infrastructure Performance Management is the ability to continuously capture, correlate and analyse in real-time, the system-wide performance, utilisation and health of heterogeneous physical, virtual and cloud computing environments. Ultimately, IPM enables IT to establish and maintain the service levels the business requires while driving the systems level optimisation and agility that is the promise of virtualisation and the cloud.

IPM: Managing through compounding complexity and risk to assure IT business value and alignment

Infrastructure Operations and Management teams are forever under pressure to deliver IT service levels that are aligned with enterprise business goals and application performance requirements.

In fact, business processes and application teams often set IT's priorities in terms of availability and performance. Moreover, new IT infrastructure compute models include hybrid environments where portfolios of applications are migrated between and reside within physical, virtual and cloud environments simultaneously. The result is compounded complexities in monitoring, managing and reporting— necessitating an IPM platform for assuring performance optimisation, risk mitigation and sustainable SLAs.

Trying to manage through this complexity, forces the device-specific management tools of yesterday to change. In the past, management tools focused on capacity, utilisation and management of individual physical components. With the move to Virtualisation and new deployment models, Application Performance Management (APM) and Network Performance Management (NPM) have been trying to extend and fill-in for what's missing.

The challenge is that neither delivers a holistic view of the underlying IT infrastructure. So, between all of these tool sets that address device, application and network performance, there is still a huge gap in understanding system-wide performance.

This can be solved with a real-time Infrastructure Performance Management platform.

Real-Time IPM platform requirements

An IPM platform must be able to continuously capture, correlate and analyse system-wide heterogeneous infrastructure performance, utilisation and health metrics in real-time. It must provide an unbiased view of system-wide infrastructure performance, from virtual machine, to server, to switch fabric, to storage array to logical unit of storage.

1. Continuous, granular monitoring and measurement

The most important metric for IT Infrastructure performance is infrastructure response times. This is defined as the time it takes for any application, running on a physical server or Virtual machine to place an I/O request and get a response back. In large, highly virtualised and/or cloud infrastructures, too much happens within a five or 20 minute polling-interval that shouldn't be glossed over.

For any monitoring to be effective, organisations need to capture the minimum, maximum and average at no greater than one second intervals for accurate insight into the overall infrastructure performance.

2. Unbiased and heterogeneous

Infrastructures by design are heterogeneous, with multiple vendors delivering their part of a complete physical, virtual and cloud solution. An IPM platform must collect data from various devices without vendor or product-specific bias or dependencies. It must deliver an unbiased, vendor independent view of the whole system—with precise metrics that enable understanding of what is happening throughout the system.

3. System-wide data collection, visibility and analytics

The IPM platform must capture, organise and correlate metrics about all the infrastructure elements. This means from the virtual machine, server, fabric, storage arrays and logical unit number. Further, the data must be presented in a way that clearly describes interdependencies in context of system-wide performance.

The data must also be persisted, to make it easy to go back in time and pinpoint exactly where and when a performance impacting event happened and how it affected the system overall. The IPM platform must track all granular I/O activity across highly virtualised infrastructures—from server transmission, through the switch to storage and back—and leverage an analytics framework for contextual understanding, correlation and discovery.

4. Scalability

A typical virtualised enterprise IT infrastructure is comprised of thousands to hundreds-of thousands of servers and switch and storage I/O ports, and petabytes of storage. An IPM platform must be able to handle the large number of installed physical devices and the associated metrics without a hiccup and without risk of hitting a limit. Most legacy device or system managers struggle with scale by design.

Benefits of an IPM platform

An IPM platform delivers numerous CAPEX and OPEX benefits to both IT and the business:

Operational Efficiency and Effectiveness: Cross-domain system visibility enables comprehensive measurement of the infrastructure performance from the Virtual Machine all the way to the storage LUN.

So performance bottlenecks can be identified and corrected proactively, before impacting the business— resulting in significantly higher performance—and enabling higher utilisation of existing infrastructure assets. Additionally, IPM helps IT improve infrastructure response times, proactively avoid outages, quickly identify root cause and remediate performance issues, and drive continuous improvement in reliability.

1. Risk mitigation

IPM enables highly accurate baselining to proactively identify the impact of any and all changes to the infrastructure. This improves forecast accuracy, and results are seen in real-time. For example, IT can migrate applications from physical to virtual and cloud environments with much lower risk. Application performance has an interdependent relationship with the infrastructure it runs on. Different applications require different resource quality, and compete for resources with other applications, so it is important to manage the infrastructure accordingly.

By leveraging IPM data, it's possible to project application performance and ensure that business processes are executing as they should. This is equally applicable whether the change process spans months of engineering design, test and rollout, or as it becomes more real-time in highly automated cloud environments.

2. Business alignment

The business side of an enterprise is mainly concerned with providing the service levels required by their users and having an agile response as those requirements change and evolve—all while optimising the related operational (OPEX) and infrastructure capital expenses (CAPEX).

For IT organisations, IPM gives the ability to stay in lock-step with business agility. IPM helps IT quickly identify root causes, and proactively (and significantly) reduce mean time to resolution. By increasing system utilisation, reducing overprovisioning and making the correct decisions on infrastructure spend versus performance required, CAPEX is significantly reduced. Having an IPM platform helps cut through virtualisation complexity, accelerates transformation and minimise risks.

The IPM platform: Enabling IT organisations for ever-agile business alignment

Though it's still important to understand utilisation and how it affects performance, it's now essential to understand it within the context of true infrastructure response times. It's critical to understand how the infrastructure is delivering the resources required by the applications relying on it, and if the resources are delivering the right service levels.

An IPM Platform looks at the entire compute environment from a systems level—focusing on correlation and interdependencies of performance, utilisation and health overall.

The benefits for IT organisations are multi-fold, as they are now enabled to:

  • Right size and align infrastructure capacity and resource quality
  • Drive greater utilisation against existing assets and resources
  • Accurately measure I/O traffic, correlate system-wide data and monitor trends
  • Improve infrastructure response times
  • Proactively, accurately and quickly identify and remediate problems.
  • Deliver the right level of performance at the appropriate cost
  • Establish SLAs aligned with business requirements
  • Effectively partner with and contribute to the success of the business
  • Delivering on the promise of agility and business alignment

The move to virtualised and cloud environments can help reduce CAPEX growth in servers, but this new paradigm can't contribute to controlling storage CAPEX or OPEX growth in maintenance costs and staff. In addition, there's often a revenue impact of downtime or unacceptable performance, which can also impact brand equity and customer satisfaction. If one is not vigilant, the net effect of cloud-driven consolidation, migration, and new technology roll-outs is increased risk and cycle time, often negating the very reason to move to a cloud infrastructure in the first place.

The new normal

The New "Normal" Current IT infrastructures are at a point where the entire system, from server to storage fabric and storage arrays have all been virtualised and abstracted to the extent that clear visibility into performance and issue resolution are on-going challenges. As multi-vendor, multi-layer data centre architectures perpetually change, views from the user, administrator and application are increasingly de-coupled from the actual physical architectures.

As a result, a clear and unbiased understanding of the physical infrastructures, the applications dependent on them, and how they are performing, is difficult to achieve. Moreover, this "new normal" makes it nearly impossible to manage with confidence and authority. These circumstances are creating inherent gaps in visibility that must be addressed in order to guarantee performance and availability.

Image: Flickr (Sebastian Fissore; Victor1558)