Where to begin with Deduplication

De-duplication in itself is easy to understand – optimised storage capacity usage by eliminating duplicated data. However the devil is in understanding the different technologies, techniques and implementations in the market and relating these to customers specific needs.

Instead of storing data multiple times, de-duplication enables the data to be stored once and uses that single instance as a reference. The techniques used to do this vary. For instance, we could look for complete files which are the same, and only when these are a complete match with each other, is a single instance created.

Alternatively we could look at files which are basically similar (for example revisions of a draft document) and create a single instance of a master file only saving the byte level differences between this and subsequent files. So which of these approaches is best? As always, the answer is not straightforward.

If we look at the first of these – working at a file level, rather than a byte level, there are well established techniques such as CAS – Content Addressable Storage. With this approach the contents of the file are put through a mathematical mincer and the end product is a unique identifier which is attached to the file.

If exactly the same file exists somewhere else in the system, the mathematical mincer will produce exactly the same identifier – indicating a duplicate file which can be made into a single instance.

Using this approach, every time a spelling mistake is corrected, or punctuation is added to a document, a new identifier would be created and both versions of the document stored.

The result is that where files are constantly changing, the saving in storage capacity that can be achieved with CAS is fairly minimal. So why do we have it at all?

The answer is – for archives. When a file is archived it is normally for long term storage and is likely only to be referenced rather than changed. After all, changing the archives is like re-writing history.

This aspect of CAS is also a way to ensure that archived records are not tampered with (as might be a temptation in a company facing significant legislative challenge) as any change will produce a new identifier and will be seen as a changed file.

This is where the second technique, byte level deduplication, comes in. At this level the mathematical mincer changes, and this time it is looking for differences between files at a byte level.

Going back to our previous example of a document where a spelling or punctuation change has been made, the byte level deduplication would recognise and store only the minor changes that have been made to the original document.

This is an effective approach to minimising the storage capacity consumed, but does not give change tracking such as CAS delivers. However, where ‘live’ data is being used, this approach is far and away the most effective for an enterprise environment, but the challenge is that it consumes much more processing power to achieve.

On the face of it, this would go a long way towards saving expensive primary storage capacity. However, the reality is that in most primary storage environments the emphasis is on performance rather than saving disk capacity and any performance overhead (such as the mathematics to determine duplication) are seen an inhibitor to speed of delivery.

Additionally, the lifecycle of primary data can be fleeting (minutes or even seconds) so going through deduplication may be an unnecessary process. As a result, today, with a few evolving exceptions, byte level deduplication is aimed at the backup environment.

Another key option to consider is where in the data centre we implement deduplication? This doesn’t sound too important, but it is a raging argument among the vendors in this part of the industry.

Some approaches have implemented deduplication for backup with a software ‘agent’ loaded onto each application processor which undertakes backup. This spreads the load of the deduplication processing requirement across the processing power of all the servers involved – but crucially must interact correctly and effectively with the existing backup software packages loaded onto the servers.

The upside of this deduplication implementation at source is that the process is completed before any data is sent to the storage devices, minimising the data transfers between server and storage.

The downside, is that encountered by any agent based strategy, the agent must stay compatible with server software. This means that any software upgrade or change on any server creates a potential for incompatibility and adds to the management task for the server administrators.

The alternative approach is to have a dedicated platform in the backup path which handle deduplication ‘on the fly’. This effectively centralises the process.

The benefits here are that the platform, not the servers, delivers the processing power for the deduplication and because it requires no changes to the server software, it is effectively transparent to the user. Some storage vendors are taking up the idea of embedding these functions in their storage devices – though none appear to exist yet.

In many ways this endorses the in-line platform as the most elegant solution, because all they are doing is maintaining the in-line dedicated platform, but locating it in the storage device.

Whichever approach eventually becomes the dominant implementation, as the data deluge continues to accelerate, deduplication will rapidly become a core element of any data centre’s storage strategy.

It is not only the storage capacity savings that are attractive, but also the support deduplication can offer for compliance (only one instance of a file makes it easier to manage, protect and delete as required) that will continue to drive this market.