Pages

How Data Dedupe Works

Data dedupe (deduplication) is a process that reduces the amount of data that is stored by elimination of redundant data, even if the redundancies exist in separate files, or even if the redundancies are within a single file. It can do this because data dedupe works at both full file level as well as within a file at block and bit level.

For example, if file A has a set of data, 12345, and other files, B, C and D, also have the exact similar set of data, files B, C and D will eliminate the 12345 data set in them and replace each iteration with a reference link to the original iteration of the data in file A. Or, if file A has several iterations of the similar data set, 12345 x 4 iterations, the additional 3 iterations will be replaced by links to the original iteration. The links fill less file space than duplicates of the data set.

Additionally, if file A is changed, and that data set becomes 12345678, data dedupe will initiate the storage of just the change, the addition of data block or bit 678, but will not store the entire original file, thus further saving storage space.

Each file, block or bit of data is processed in data dedupe using an algorithm called a hash that creates a specific, unique number for each fragment of the file, block or bit. This results in a collection of hash numbers. These collections are indexed. If a hash number representing each file/block/bit already exists in the index and, during the redupe process, the system recognizes another iteration of that hash number, it does not need to be stored. The new iteration will be linked to the existing number. If a new number does not compare to existing hash numbers, the new number is indexed.

A full file in backup may contain an index of millions of unique numbers, so the economies of scale are huge considering the available opportunities to ignore and avoid duping redundant hash numbers.

It is possible, but rare, that two separate sets of data are assigned the thesame hash number. This causes a condition called "hash collision," which means that some data may be lost because the system sees a hash number already assigned to data set 12345, and compares that hash number to data set 98670, which erroneously has the thesame hash number. In a hash collision, the system would disregard data set 98670 and it would be lost.

Considering the success rate of data dedupe, the incidental instances of hash collisions are so remote, the process is more efficient and accurate than a system of complete data duplication. If files continued to be stored with complete duplication as was once the industry standard, the data storage capacity of many companies would smash, rendering a worse condition by far than the incidental, rare loss of a data set that very likely would be recovered from a previous backup, losing only any added data since the last backup.

No comments:

Post a Comment