Image Duplication Detection19 Apr 2013
In my last post, I downloaded thounsands of images from Jiandan OOXX. (What a cool site!) When I was enjoying this amazing collection, I found there are many duplicate images. I don’t want my disk space wasted on duplicate images, so I need to figure out a way to deal with them.
Detecting duplication images is totally different from detecting duplicate normal files, because two same image may be in different formats, of different dimensions, have different sizes. Hash values can’t be relied to detect image duplications, other image features should be taken into consideration.
First of all, I will introduce you a simple but effective method for detecting image duplicates: Perceptual Hash Algorithm. The basic idea is computing a fingerprint for each image and then comparing the fingerprints. If two fingerprints are the same or very close, the two images are probably duplicate to each other.
I wrote a Python script to compare two image using this method (read the reference above if you want to understand it).
#!/usr/bin/python import sys from PIL import Image def avhash(im): if not isinstance(im, Image.Image): im = Image.open(im) im = im.resize((8, 8), Image.ANTIALIAS).convert('L') avg = reduce(lambda x, y: x + y, im.getdata()) / 64. return reduce(lambda x, (y, z): x | (z << y), enumerate(map(lambda i: 0 if i < avg else 1, im.getdata())), 0) def hamming(h1, h2): h, d = 0, h1 ^ h2 while d: h += 1 d &= d - 1 return h if __name__ == '__main__': if len(sys.argv) != 3: print "Usage: %s img1 img2" % sys.argv else: img1 = sys.argv img2 = sys.argv hash1 = avhash(img1) hash2 = avhash(img2) dist = hamming(hash1, hash2) print "hash(%s) = %d\nhash(%s) = %d\nhamming distance = %d\nsimilarity = %d%%" % (img1, hash1, img2, hash2, dist, (64 - dist) * 100 / 64)
And I grab some pictures on the internet.
They all have different formats, sizes and their pixels are slightly different
from each other, except
2.png is an identical copy of
2.jpg. The comparing
$ ./imgcmp.py 1.jpg 2.jpg hash(1.jpg) = 14933407633201104831 hash(2.jpg) = 14919930918911018815 hamming distance = 7 similarity = 89% $ ./imgcmp.py 1.jpg 3.jpg hash(1.jpg) = 14933407633201104831 hash(3.jpg) = 14933407633204774719 hamming distance = 3 similarity = 95% $ ./imgcmp.py 2.jpg 3.jpg hash(2.jpg) = 14919930918911018815 hash(3.jpg) = 14933407633204774719 hamming distance = 8 similarity = 87% $ ./imgcmp.py 2.jpg 2.png hash(2.jpg) = 14919930918911018815 hash(2.png) = 14919930918911018815 hamming distance = 0 similarity = 100%
Well, the results explain themselves.
Assume you have N images, firstly scan them and compute their fingerprints. If you take 100% similarity as duplicate (which suggests duplications have identical fingerprints), we can find all duplications in O (N) time. (This gist shows a possible solution.) If you define duplication as a similarity threshold (such as 90%), finding all duplications requires O (N * N) time (if you can improve this time bound, please let me know).
Congratulations if you read here! You definitely have a potentiality to become as ballache as me!