What HASH do you prefer?

De-duplication of files is a common function of ECM systems but how does it work?

You can have two files that have exactly the same content but potentially different file names yet systems are able to determine that these are duplicates and to act appropriately. In many cases we don’t want the same content duplicated as it doesn’t lend to effective storage management. In the email world we can even utilize the compound model which splits the email from the file attachment and de-duplication can happen at both levels – on the email and on the file.

The technique used to make these comparisons is known as cryptographic hash algorithms or ‘hashing’. There are two main types of hash algorithmic:

1. MD5 – has been available for many years and hence is wide spread in the industry today. It is frequently used for checking data’s integrity similar to our de-duplication discussion. The one flaw that MD5 has in today’s world is that it isn’t as secure (128 bit) as the more recent standards due to a flaw being discovered in the algorithm.

2. SHA – SHA1 was the original hash function design by the National Security Agency which was more secure (160 bit) than MD5. It was consequently updated to create SHA2 and more recently SHA3.

The general guideline when it comes to hash keys is to use SHA2 since it is the most secure. This does apply to security focused use cases such as saving a password but the reality for many systems focusing on de-duplication is to use the original MD5 hash algorithm.

Explore posts in the same categories: Uncategorized

Tags: , , , , , , ,

You can comment below, or link to this permanent URL from your own site.

Leave a comment