How To: detect duplicate images
Any collection of digital media is likely to contain some duplicates, and the time and effort spent on tagging the same file (or the risks associated with fragmented record-keeping) makes this a high priority for any digital asset manager.
Duplicates files not only occupy storage, they also cause fragmentation in digital asset management systems and on file servers. This fragmentation causes a breakdown of record-keeping, and leads to metadata records for the same content being entered repeatedly. As metadata entry is an expensive operation, and consistent record-keeping is one of the core benefits of digital asset management software, duplicate files must be avoided.
Desktop tools for finding duplicate images are usually inappropriate as assets arrive from a dispersed user base, through a process of uploads, approvals and metadata capture.
There are several methods for detecting duplicate images which can be used to improve the ingest of new digital assets into centralised media storage systems, including:
- Filename matching
- Hash checking
- Visual similarity tests
- Metadata fingerprinting
The most basic approach, filename matching, involves looking for two files with same filename and then highlighting them as duplicates.
This method is poor, as many digital files have overlapping filenames (eg. DSC0001.JPG is the first filename produced by a digital camera from the factory, but your staff may have several such cameras). Additionally, many kinds of duplication arise when several versions of the same file are saved with different filenames (for example, DSC0001.JPG might be saved as Example.jpg later).
If two identical files are saved with different filenames, they can still be tested as duplicates by looking at the distinctive file signature, or hash. This is obtained easily using MD5 (a checksumming tool) or other equivalent tools like SHA1 or CRC.
A hash is a "message digest" of a large file, resulting in a very short string which is easy to compare against the hash of another file. A hash looks like this:
Although the hash is much shorter and smaller than the whole file, it is highly unlikely that two non-identical files will have the same hash, and so it provides a very easy way to test for data duplication. If two files have the same hash, they are almost certainly duplicates.
If you are using an Apple Mac, you can try this from the Terminal (open Applications > Tools > Terminal). In this test, there are two files on the Desktop with the filenames DSC0001.JPG and Example.JPG.
macpro:Desktop user$ md5 DSC0001.JPG MD5 (DSC0001.JPG) = e8735480f03aeecfa21aec49e8f95d0a
macpro:Desktop user$ md5 Example.JPG MD5 (Example.JPG) = e8735480f03aeecfa21aec49e8f95d0a
As these two files result in the same hash, they are identical - even though the filenames are different.
Tip: using a hash check is better than using a file size check, as two files can have the same file size in many different circumstances, but not be duplicates.
Unfortunately, there are some situations where using a hash is still insufficient. For example, each of the following changes would result in a change which would not be detected by a hash check:
- Adding metadata to the file
- Changing the size of the image (even if only very slightly)
- Saving the file in a different format (eg. TIFF to JPEG)
- Cropping or scaling the image
- Making adjustments in PhotoShop, however slight
- Images saved repeatedly with JPEG (resulting in compound compression losses)
Because of these concerns, hash checking has limited practical applications.
Visual Similarity Tests
Basic duplicate detection systems fail to catch obvious duplicates, because they only look at filenames or simple checksums. Another approach is to look for patterns in the images, to try and infer when the images are either identical or similar. This approach is more sophisticated and requires more image processing.
Visual duplicate detection is able to match many kinds of duplication, including the most difficult cases such as:
- When the filenames or file sizes are different
- When the images are the same but the metadata is different
- When the images are different sizes
- When the images are stored in different formats (for example, RAW, PDF, TIFF...)
- When the images are similar, but not identical.
Visual duplicate detection is one of the most powerful and useful features in Third Light's Intelligent Media Server product (a digital asset management system), which has a visual similarity duplicate image finder that automates scanning to find duplicate images.
Third Light IMS performs a rapid scan to check for similarities when new files are uploaded, and provides an interface to help you quickly review duplicates and recover wasted space by removing unnecessary copies.
The duplicate detection system in Third Light IMS is also able to detect images that are not precisely identical. For example, different cropping, aspect ratios or file sizes, as well as other subtle differences caused by compression or differing metadata tags are all taken into account.
Once duplicate files have been discovered they are summarised with visual priority filtering to help identify the best images. Those with the largest file sizes or most recent changes made, for example, are ranked highest. This makes it easier to prune selectively and quickly.
When a file is downloaded from a Third Light digital asset management system, the download is logged in the DAM server as part of an audit trail. The file that is downloaded also carries a copy of this audit log. If the file is uploaded back into the DAM server later, there will be a tell-tale metadata fingerprint that connects the downloaded asset back to its origins in the DAM server.
This method of using metadata (normally Adobe XMP) to add custom XML records to downloads is useful once an established DAM solution is in place, and most media originates in the DAM server. Third Light IMS leaves digital fingerprints in the custom 'thirdlight' XMP packet for these purposes.
Each duplicate file is a challenge to eliminate as there may be no obvious way to connect one image to a clone, or similar copy of the same image. We have seen that looking solely at filenames, file sizes or hashes is inadequate.
The best solution is to remove duplicates as soon as they arrive, using a fast visual similarity test against existing assets. This test must remain agnostic about the file formats and tolerate minor differences in the images and their metadata. Using metadata fingerprints gives the DAM software a second source of data in cases where the asset has already been in the media server once, and is being uploaded again.