Image Duplication Detection
19 Apr 2013In my last post, I downloaded thounsands of images from Jiandan OOXX. (What a cool site!) When I was enjoying this amazing collection, I found there are many duplicate images. I don’t want my disk space wasted on duplicate images, so I need to figure out a way to deal with them.
Detecting duplication images is totally different from detecting duplicate normal files, because two same image may be in different formats, of different dimensions, have different sizes. Hash values can’t be relied to detect image duplications, other image features should be taken into consideration.
First of all, I will introduce you a simple but effective method for detecting image duplicates: Perceptual Hash Algorithm. The basic idea is computing a fingerprint for each image and then comparing the fingerprints. If two fingerprints are the same or very close, the two images are probably duplicate to each other.
I wrote a Python script to compare two image using this method (read the reference above if you want to understand it).
And I grab some pictures on the internet.
They all have different formats, sizes and their pixels are slightly different
from each other, except 2.png
is an identical copy of 2.jpg
. The comparing
results:
Well, the results explain themselves.
Assume you have N images, firstly scan them and compute their fingerprints. If you take 100% similarity as duplicate (which suggests duplications have identical fingerprints), we can find all duplications in O (N) time. (This gist shows a possible solution.) If you define duplication as a similarity threshold (such as 90%), finding all duplications requires O (N * N) time (if you can improve this time bound, please let me know).
Congratulations if you read here! You definitely have a potentiality to become as ballache as me!