Skip to main content

Don’t be duped by dedupe: Understanding data deduplication for backup

Too much data, not enough time, not enough storage space, and not enough budget. Sounds familiar? Since the first mainframes, IT teams have worked hard to optimise storage capacity requirements and data protection processes.

Backups are failing, taking too much space, and costing way too much. In the open systems world, these are the same issues we had years ago, when the first data deduplication technology first entered the mainstream. Today, data volumes are growing exponentially and organisations of every size are struggling to manage what has become a very expensive problem.

Cheaper storage helps, but is not operationally efficient for many workloads. Instead we need to shrink data to more manageable levels, because too much of it causes real problems. Problems like:

  • Overprovisioning expensive backup infrastructures.
  • Backups with legacy products take too long or are incomplete.
  • Missing recovery point objectives and recovery time targets.
  • Backups overload infrastructures and network bandwidth.
  • Not being able to embrace new technologies such as cloud backup, because there is too much data to transfer over wide area networks.

Today, new technology advances are needed to combat the unstoppable and exponential growth of virtual machines and data. There are many out there, and it’s not always easy to tell which is right for your data. I’ll focus on two competing dedupe technologies for backup to help shine a light on what they bring to the backup process.

Target deduplication

In recent years it’s become clear that backing up large amounts of data has a big impact on backup windows. Not only that: there’s a huge cost involved in storing TBs or even PBs of backup data, and deduplication appliances have stepped into the fore. The process involves taking backup data, optimising it through deduplication processes and storing it on disk: “compressing” backup volumes and saving money.

Target deduplication works very well and is still used in many environments. It’s attractive because users only need to change the destination of the backup streams rather than drastically changing their backup software configurations or policies.

Target deduplication

Figure one. Target-side deduplication

Target deduplication happens either on the fly, or as a post process (write it all on disk, then optimise after the fact to make backups go faster). In one of its educational sessions, SNIA provides a view of target deduplication scenarios: (figure one) in the case of target deduplication, you can see that the backup software acts as the data mover and sends all the non-deduplicated backup data streams to target disk or VTL appliance.

Global source-side deduplication

Why deduplicate data after the fact if you can just backup new and unique data at the source? As long as you do not impact the client, you can save all the bandwidth to send backup streams to the target and share all the deduplicated intelligence across all your clients.

Global source-side duplication removes redundancies from data before transmission to the backup target. The challenge is to ensure that the source (client) system is not bogged down by deduplication software.

Using global source-side deduplication across all clients is central to limiting the unnecessary storage and transfer of duplicate backup data: freeing up server space and cutting down the time it takes to backup data.

Data is deduplicated across nodes, jobs and sites. Global deduplication goes beyond the limitations of network vendors. Its deduplication doesn’t simply apply to the WAN replication cache, it targets what is actually stored on disk. And as backup data is globally deduplicated before it is transferred to the target backup server, only changes are sent over the network: improving performance and reducing bandwidth usage. The entire process is secured with data store-level encryption and per-session passwords.

However, global source-side deduplication is built-into the backup server and requires a new generation of data protection. Legacy backup solutions were not designed with dedupe in mind so they can’t really accommodate this kind of technology.

What works best?

Of course every technology has its advantages and disadvantages. Driven by the rapid adoption of disk media as a backup storage target, target-side deduplication has delivered important benefits for storage efficiency with minimum disruption to the existing backup software – but ultimately it compresses all the data a user produces and in time will slow down the backup process. Global source-side deduplication on the other hand is efficient and targeted, and brings business benefits too:

  • Complete backups faster. With less data to transmit and store, backups are faster. This is important for situations where the total volume of data threatens to take so long to backup that one backup isn’t finished before the next one is due to start.
  • Reduce bandwidth required for backups. The backup server only pulls new or changed data from a client with high levels of granularity – even down to 4kb chunks. This makes backups extremely efficient.
  • Improve client performance. For virtualised environments, agentless VMware and Hyper-V backup reduces the risk of bottlenecks and performance problems at the hardware level. In other words, it ensures that backups don’t stall virtualised servers.
  • Simplify backup infrastructure. It’s easier to direct backup data to different places, for example to the cloud via Azure or Amazon Web Services or to local tape or an offsite private cloud. All data transfers more quickly because of built-in LAN and WAN optimisation.
  • Improve resilience and availability. Because the deduplication data is stored centrally in the backup server, it is easy to protect the backup infrastructure. For example, all the information in the data store can be replicated to the cloud, another server in the same datacentre or offsite.
  • Meet recovery point and recovery time SLAs. With global source-side deduplication it’s easier to meet RPO and RTO targets. With less data to transfer, backups can be more frequent and restores are faster.

So yes it’s important to deduplicate – but it’s also really important to get the right kind of technology for your workloads. There’s no one size fits all option, so take time to work out what will be the best fit for you: global deduplication is definitely the direction many are taking, because the benefits can be amazing.

Christophe Bertrand, VP of Product Marketing, Arcserve

Image source: Shutterstock/Carlos Amarillo