Introducing Deduplication to the Enterprise One Step at a Time

Making any change to an enterprise backup environment should be done with caution. You are, afterall, working to protect data, not to put it at risk.

Even the relatively low-risk task of introducing data deduplication should be done in phases to ensure that data is continuously protected in a safe, secure manner and that deduplication results are predictable and repeatable.

Start by reviewing the various types of data in your overall backup environment. Some data types, such as Microsoft Exchange and VMware snapshot (vmdk) files tend to contain a high proportion of duplicate data. Backup streams with other data types, such as check imaging or medical imaging files may not contain much duplicate data at all.

Of course, the more duplicate data in the backup stream, the more beneficial your deduplication technology will be. With this review in mind, you should set specific goals for the capacity reduction you expect to see from each data type.

To ensure that you get optimal deduplication efficiency without jeopardizing backup windows or restore times, pick a data type that is likely to yield the best deduplication results. At the same time, identify any data types that you may not want to deduplicate.

For example, you may not want to deduplicating data backed up from some testing environments with a very short (less than a week) retention time or data that is subject to some regulatory requirements.

Backup the selected data type to your deduplication technology and backup all other data to your legacy system. In this way, you can tune your deduplication for optimal results and test it to verify performance and efficiency.

Depending on the type of data you select, your deduplication technology may allow you to fine-tune the methodology it uses to identify and remove duplicate data. This fine-tuning process can have a significant impact on the amount of capacity reduction you will ultimately achieve.

Test carefully to ensure that your key metrics are being achieved and that data integrity has been maintained throughout the process. Also be sure to identify any sources of human error in the process.

Once the first data type is being backed up and deduplicated at levels of efficiency that meet your expectations, introduce the next data type. Continue to repeat the process of measuring performance and deduplication efficiency, and tuning until all of the data you want to deduplicate is on your new system.

This process may take slightly longer than simply flipping a switch, but a well-planned, phased approach to introducing deduplication will ensure you get optimal results and highest return on your investment.

This article was written by Miklos Sandorfi, Chief Technology Officer, SEPATON, Inc. Miki can be reached at msandorfi@sepaton.com.