“SimHash: Hash-Based Similarity Detection”, 2007-12-13 (; backlinks):
Most hash functions are used to separate and obscure data, so that similar data hashes to very different keys. We propose to use hash functions for the opposite purpose: to detect similarities between data.
Detecting similar files and classifying documents is a well-studied problem, but typically involves complex heuristics and/or 𝒪(n2) pair-wise comparisons. Using a hash function that hashed similar files to similar values, file similarity could be determined simply by comparing pre-sorted hash key values. The challenge is to find a similarity hash that minimizes false positives.
We have implemented a family of similarity hash functions with this intent (SimHash). We have further enhanced their performance by storing the auxiliary data used to compute our hash keys. This data is used as a second filter after a hash key comparison indicates that two files are potentially similar. We use these tests to explore the notion of “similarity”.
View PDF: