Skip to main content

Top 3 criteria for choosing a data virtualisation solution

With cost cutting and greater agility so high on the CIO agenda, data virtualisation solutions (also known as Copy Data Management (CDM), Copy Data Virtualisation (CDV) and Data Virtualisation Appliances (DVA)) have steadily become more mainstream.

This adoption is hardly surprising. Data virtualisation is a simple answer to many of the challenges that exist around storage cost and copy-data provisioning.

Database vendor, Oracle, claims that on average a customer has twelve copies of production databases in non-production environments. This covers everything from development, Quality Assurance (QA), User Acceptance Testing (UAT) to backup, business intelligence and sand boxes.

Large enterprise companies often have thousands of databases, reaching multi terabytes in size. The down stream storage costs of these data copies can be staggering. However when it comes to choosing a data virtualisation solution it’s hard to know where to start. When so many vendors make similar claims or use wildly different explanations, how do you know who to choose?

With the wisdom of hindsight and after cutting my teeth with many years of data virtualisation experience, I’ve come up with what I consider the top three questions to ask when looking at data virtualisation solutions:

  1. Will the solution address your business goals?
  2. Will the solution support your entire IT landscape?
  3. Is it automated, complete and simple to use?

Simple enough, right? Well let’s unpack these a bit further.

  1. Addressing business goals

The first step is to articulate what those goals are. The top use cases for data virtualisation are storage savings, accelerating application development, improving data protection and production support.

Ultimately, all data virtualisation solutions are going to offer storage savings if you are virtualising for the first time. Data virtualisation provides thin clones of data, so each new copy initially takes up no space. New space is only necessary to save modifications or changes to these copied datasets. So the first comparison point arises, how much storage is required to store new modifications, and how much storage is required to initially link to a data source.

In my experience, the initial required storage ranges from a third to three times the size of the source data. The reduced volume is largely down to compression, so that’s a feature to look for. When it comes to additional storage it pays to look at how the solution deals with changed data blocks, and whether it saves just the changed values or is forced to make copies or larger data blocks.

Despite massive storage savings, for most adopters, data agility is far more important and more frequently than the primary business goal. Agility means that a virtual copy can be made in minutes instead of the more traditional full physical copies, which can take hours, days or even weeks to make when making copies of large databases.

This ability to provision data sets so rapidly has a massive knock on effect for the speed of application development and testing. Application development typically requires many copies of source data when developing and/or customising an application.

These copies of data are required not only by developers but also during testing. The ability to rapidly provision, refresh and reset data speeds up development and reduces the associated costs or strain on resources. Key considerations are whether this provisioning is automated, how easy or difficult is it to choose the assets you want to provision from, or identify the point in time from which the data is provisioned.

In terms of better supporting production there are some core features to look out for:

Versioning, refreshing, bookmarking and rollback - These are standard development functionality and should be absolutely fundamental in a data virtualisation solution. Can a developer bookmark a certain version of a database, version control data sets and roll back to how that data was at specific moments in time?

Masking - In almost all cases full data sets contain sensitive content that should be masked before giving the data to developers. Including data masking means that data protection is automatic and isn’t an obstacle when provisioning data. It also ensures that masked data copies stay linked to the original – avoiding challenges when it comes to reset.

Branching - Branching data copies means making a new thin clone copy from an existing thin clone copy. This is essential for being able to spin up copies of data for testing, directly from development. One of the biggest bottlenecks in development is supplying QA with the correct version of data to run tests. If there is a development database with schema changes or modifications, then instead of having to build up a new copy, one can branch a new clone, or many clones, and give them to QA in minutes. All the while development can go ahead and continue to use the data branch they were working on.

Back up and data protection - Development databases are often not backed up as they are “just development”. But if a developer inadvertently corrupts data they could be in trouble. Timelines are important, if your solution can roll back, or branch from the second before a bug or accidental damage you need not lose any work. Some solutions offer no protection, others offer manual snapshots of points in time. However, the best solutions simply and automatically provide a window of multiple days into the past.

User specific interfaces – Whether it’s for a developer, a database administrator (DBA) or a storage administrator, each user will have different requirements from a data virtualisation solution. Administrator specific interfaces will impede self-service for developers and they’ll have to request copies, causing delays. Per user logins will help support the correct security level by limiting what data developers have access to, how many copies they can make and how much extra storage they can use when modifying data.

  1. Support your entire IT landscape

What you want from your data virtualisation solution is versatility. A solution that can expand to fit the needs of the IT department and all other enterprise requirements is crucial. It also needs to be able to play well with others, integrating with and supporting your existing storage and operating systems and handling multiple data types. Take a look at the hardware required, is it specialised or can it run with multiple systems?

Future proofing against what your storage or IT estate will look like in several years time is vital. This will ensure that you don’t find yourself locked in to a specific storage type, particularly as new, better and more affordable systems come available. For example, how does it integrate with cloud solutions?

Assess the needs of your IT estate for virtualisation and ensure that the solution has a proven track record in supporting a wide range of systems and data types.

  1. Fully Automated, Complete and Simple

The key to efficiency and self-service is automating as much of the work as possible. Can an end user provision data or does it required a specialised technician such as a storage admin or DBA? When provisioning databases such as Oracle, SQL Server, MySQL etc. does the solution fully and automatically provision a running database or are manual steps required?

The goal is to reduce manual intervention as much as possible so that you have as much flexibility in terms of what data and from when, but with the least effort.

Things like syncing and collecting changes should be automated to reduce manual intervention or the need to integrate other tools.

No data virtualisation solution should be an island – you want the tool to be as complete as possible. A point solution for specific databases will limit you in terms of the kind of storage and systems you can use. If the solution includes masking, replication, backup and recovery (down to the second) then use and deployment is far simpler. These essential functions are part of the process, instead of being added layers, creating delays, obstacles and requiring work around.

Simplicity is key – both when it comes to deployment and use. You want each of your users to be able to self-serve data, providing Data as a Service, and access the tools and features they require for their role.

The goal is to empower users by putting the data they need within their reach, as they need it. However administration of the system should be simple too. It’s vital to ask questions like, is it easy to add new storage, or to take it away? And do you get a single overview of the virtual data copies across the (potentially hundreds of) separate locations in your IT estate?

In Summary

Find out how powerful, flexible and complete the solution is. Some solutions are specific point solutions that, for example, cater only to Oracle databases.

Complete flexible solutions sync automatically with source data, collect all changes, allow you to provision data from time windows that are down to the second and will support any data type or database on any hardware, as well as supporting cloud.

But when it comes down to it, even after asking all these questions, don’t believe the answers alone. Ask the vendor to prove it.

Kyle Hailey is a performance architect at Delphix.