Virtualisation technology dominates the enterprise landscape. According to Gartner, most firms report 75 per cent or higher virtualisation. Improvements in hypervisors have reduced the complexity of setting up and maintaining physical servers, greatly improved server utilisation, and increased IT flexibility and responsiveness to the needs of the business. It’s no wonder that the bulk of modern IT systems are virtualised.
But, whether you use VMware, Hyper-V, Citrix, Oracle or any of the other hypervisors, there is a potential downside to virtualisation. In order to transform a physical server into many virtual machines (VMs), an additional software layer is added. While simplifying the admin user experience, virtualisation raises the overall complexity of the IT environment as the underlying hardware is obfuscated, making it more difficult for admins to know which physical system their VMs are running on or which storage is used for a particular machine in the event of data loss. With fewer people to maintain and monitor a larger number of virtual machines (compared to physical servers), there are greater chances for problems and data loss.
The not so fantastic four
The primary causes of virtual machine data loss are:
- Hardware/RAID issues
To help prevent against data loss, modern systems will often use some form of replication of data across multiple physical drives (HDD or SSD) that is consolidated into a single logical unit. This data protection can be a hardware or software-based solution. RAID combines multiple hard drives or data stripes to improve redundancy, increase data reliability and boost I/O (input/output) performance. RAID effectively fragments data across many disks and reassembles it when requested by the user or needed by the system.
Unfortunately, data loss is not uncommon with RAID storage. The complexity of modern hardware and software RAID, is added to by the presence of deduplication and compression. Now factor in an additional virtualisation layer and the likelihood of a fault increases. If a RAID configuration becomes corrupted, files can’t be rebuilt. When that happens, the interconnectivity of multiple systems can potentially cause significant data loss and downtime.
- Formatting/Software issues
Reformatting a disk, virtual disk, array, LUN, vDisk, volume, etc (or other storage media) and re-installing software are additional causes of data loss in virtualised environments.
Corruption can come about due to buggy patches and updates without an offline backup, poorly planned implementation of new software, integration issues and database corruption. These issues can also cause host file corruption and guest file system damage.
Thin provisioning data loss, too, should be considered. Instead of allocating all the data the VM will need and positioning the file system structures at their specified physical offsets, thin provisioning only provisions the amount of space immediately needed and adds additional blocks to the virtual disk as it grows. This can result in a more complex and fragmented virtual environment on disk. If the metadata pointers to the data are missing or damaged, it is challenging to locate the various fragments and rebuild the virtual disk. Alternatively, the mapping layer within the virtual disk may be damaged or overwritten, making reassembly extremely difficult.
- Virtual file system metadata corruption
Yet another source of data loss is metadata corruption. Metadata is even more important in virtualised environments due to the number of layers and VMs that exist. A small problem with VMFS metadata can have serious repercussions to data availability.
- User error
A surprisingly large amount of failures are due to virtual disks deleted by mistake, VMs being overwritten or their space reassigned. There can also be snapshot chain corruption, i.e. one of a series of snapshots is either corrupted, gets deleted or becomes unavailable for some other reason. This can foul up backups and make it difficult to recover data.
Ironically, the ease of use of modern hypervisors is causing organisations to invest in less training. Inexperienced staff are being handed responsibility for managing large and ever-growing virtualised environments.
Employee turnover is another source of problems. The new incumbent can’t figure out the intricacies of the virtualised architectures. He or she inadvertently deletes VMs or introduces changes that result in data loss. In other cases, the original flat file may be stored but nobody can find it when data loss occurs. Neglect of backups, too, is a common reason for virtual data loss.
On the level
What can enterprises do, then, when they experience data loss from a virtualised environment? There is no back or undo button. A deleted VM is gone. Backups? They are often incomplete or corrupted. Fortunately, data recovery is often possible through global data recovery service providers.
The good news is that there are a great many ways to recover some and, in many cases, all of the lost virtual data. The first point of entry is at the storage level. It can be possible in some cases to directly recover data from physical drives by taking an image of the drives and reading whatever raw data might be available on the disk.
The next option is to attempt to recover data from the logical volumes (LUNs) or RAID. If the RAID controller is available, it can be used to track down the many slices of data spread across virtual disks. By determining what the configuration should be, engineers can virtually rebuild the array and gain access to the storage. If the RAID controller is corrupted, it may be necessary to emulate the RAID controller and rebuild what is missing.
The next level up is the host file system level. In VMware this would be VMFS and in Hyper-V, NTFS or ReFS. In many cases, data isn’t available directly at the storage level. But if the right tools are used, recovery experts can trace data from the basic storage data blocks, map it to the host level and recompile it.
If that process doesn’t provide an adequate recovery, additional tools can be employed to extend further into the guest file system level. By investigating the virtual file system, data recovery specialists can sometimes find data that would otherwise be lost. Finally, it is possible to reach into the guest file level and access data lurking in application files such as SQL, Exchange, SharePoint, Oracle, Office files, ZIP files and more.
What it takes is an understanding of each level and knowing what might be available where. Those well-versed in storage architectures can track down data that seemed lost by finding pieces of it in one level and other parts in another level.
A unique set of challenges
Virtualisation may save time and eliminate complexity from the user view. But it comes with a unique set of challenges, one of which is a rising incidence of corruption and data loss. Whether through volume corruption, ransomware, corrupted virtual backups, hardware failures or accidently deleted files, data loss is a reality for anyone managing virtual systems. Whilst backup is necessary to safeguard enterprise data, it is far from fool proof so shouldn’t be overly relied upon.
Philip Bridge, President, Ontrack