Follow ITProPortal:

RSS Tweet Digg

Where to begin with Deduplication

De-duplication in itself is easy to understand – optimised storage capacity usage by eliminating duplicated data. However the devil is in understanding the different technologies, techniques and implementations in the market and relating these to customers specific needs.

Instead of storing data multiple times, de-duplication enables the data to be stored once and uses that single instance as a reference.  The techniques used to do this vary.  For instance, we could look for complete files which are the same, and only when these are a complete match with each other, is a single instance created. 

Alternatively we could look at files which are basically similar (for example revisions of a draft document) and create a single instance of a master file only saving the byte level differences between this and subsequent files.  So which of these approaches is best?  As always, the answer is not straightforward.

If we look at the first of these – working at a file level, rather than a byte level, there are well established techniques such as CAS – Content Addressable Storage.  With this approach the contents of the file are put through a mathematical mincer and the end product is a unique identifier which is attached to the file. 

If exactly the same file exists somewhere else in the system, the mathematical mincer will produce exactly the same identifier – indicating a duplicate file which can be made into a single instance. 

Using this approach, every time a spelling mistake is corrected, or punctuation is added to a document, a new identifier would be created and both versions of the document stored.



blog comments powered by Disqus

Follow ITProPortal:

RSS Tweet Digg

Owned &
operated by:

Net Communities