Straight Talk: Sizing a disk backup system

Missed out on the previous parts of our Straight Talk series? Check out the introduction to data backup and deduplication, backup to tape, disk staging and enter data deduplication.

"Data deduplication only stores unique data to reduce the amount of total disk storage required. However, depending on how you implement data deduplication, the backup and restore performance can be greatly impacted." Bill Andrews, president and CEO, ExaGrid Systems

In the last section, we considered how organisations can use disk for backup at the cost of tape. We explored the differences between scale-up and scale-out architectures and the different approaches to deduplication.

This section will evaluate the aspects of your environment which can affect the size of your backup system. It's critical to size the system correctly and also choose the right architecture to avoid costly forklift upgrades.

Just as many factors must be considered in evaluating the architectural implications of different disk backup with deduplication products; many aspects of your environment are a part of the equation to ensure that you are sizing the system correctly.

In primary storage you can simply say, "I have 8TB to store and so I will buy 10TB." In disk-based backup with deduplication, a sizing exercise must be conducted based on a number of factors so that you avoid the risk of buying an undersized system which quickly exceeds capacity.

Data types

As discussed in the third chapter, the data types you have directly impact the deduplication ratio and therefore the system size you need. If your mix of data types is conducive to deduplication and has high deduplication ratios (e.g. 50:1), then the deduplicated data will occupy less storage space and you'll need a smaller system. If you have a mix of data that does not deduplicate well (e.g. 10:1 or less data reduction), then you will need a much larger system.

What matters is what deduplication ratio is achieved in a real-world environment with a real mix of data types.

Deduplication method

Deduplication method has a significant impact on deduplication ratio. All deduplication approaches are not created equal.

  • Zone-level with byte comparison or alternatively 8KB block-level with variable length content splitting will get the best deduplication ratios. The average is a 20:1 deduplication ratio with a general mix of data types.
  • 64KB and 128KB fixed block will produce the lowest deduplication ratio, as the blocks are too big to find many repetitive matches. The average is a 7:1 deduplication ratio.
  • 4KB fixed block will get close to the above but often suffers a performance hit. A 13:1 deduplication ratio is the average with a general mix of data types.


The number of weeks of retention you keep impacts deduplication ratio as well. The reason is that the longer the retention, the more the deduplication system is seeing repetitive data. Therefore, the deduplication ratio increases as the retention increases. Most vendors will say that they get a deduplication ratio of 20:1, but when you do the maths, that is typically if the retention period is about 16 weeks. If you keep only two weeks of retention, you may only get about a 4:1 reduction.

Example: If you have 10TB of data and you keep four weeks of retention, then without deduplication you would store about 40TB of data. With deduplication, assuming a two per cent weekly change rate, you would store about 5.6TB of data, so the deduplication ratio is about 7.1:1 (40TB ÷ 5.6TB = 7.1:1).


Your backup rotation will also impact the size of the disk-based backup with deduplication system you need. If you are doing rolling full backups each night, then you need a larger system than if you are doing incremental backups on files during the week and then a weekend full backup.

Rotation schemes are usually:

Database and email

  • Full backup on Monday, Tuesday, Wednesday, Thursday, weekend

File data

  • Incrementals forever or optimised synthetics - copies only changed files each night, no weekend full
  • Incrementals - copies changed files each night, full backup of all files on the weekend
  • Differentials - copies files each night that have changed since the last full backup; full backup of all files on the weekend
  • Rolling fulls - breaks total full backup into a subset and backs up a portion of the full backup each night (e.g. if the full backup is 30TB, then back up 10TB each night and keep rotating on a three-day schedule)

Because the backup rotation scheme you use changes how much data is being sent to the disk-based backup with deduplication system, this also impacts the system size you require.

Cross protection

Sizing scenario A: You are backing up data at site A and replicating to site B for disaster recovery. For example, if site A is 10TB and site B is just for DR, then a system that can handle 10TB at site A and 10TB at site B is required.

Sizing scenario B: However, if backup data is kept at both site A (e.g. 10TB) and at site B (e.g. 6TB) and the data from site A is being replicated to site B while the data from site B is being cross-replicated to site A, then a larger system on both sides is required.

Bottom line for sizing a system

In summary, dozens of possible scenarios impact the sizing of a system, including:

  • How much data is in your full backup? What percentage of the data is compressed (including media files), encrypted, database, unstructured?
  • What is the required retention period in weeks/months onsite?
  • What is the required retention period in weeks/months offsite?
  • What is the nightly backup rotation?
  • Is data being replicated one way only or backed up from multiple sites and cross-replicated?
  • Other considerations unique to your environment

When working with a vendor, ensure they have a sizing calculator and that they calculate the exact size of the system you need based on all of the above.

The mistake often made is that the system is acquired and in a few short months, it is full because the system was undersized, retention was longer, the rotation scheme put more data into the system, the deduplication method had a low deduplication ratio, or the data types were such that they could not deduplicate well.

The truly knowledgeable vendors understand that disk-based backup with deduplication is not simply primary storage; therefore, they have the proper tools to help you size the system correctly.

This guide explains the various backup complexities, enabling you to ask the right questions and make the right decision for your specific environment and requirements. Stay tuned for the next part of this guide, which will be live on ITProPortal shortly.