How to detect duplicate images

Duplicate Image Detection

Any collection of digital media is likely to contain some duplicates, and the time and effort spent on tagging the same file (or the risks associated with fragmented record-keeping) makes this a high priority for any digital asset manager.

Duplicates files not only occupy storage, they also cause fragmentation in digital asset management systems and on file servers. This fragmentation causes a breakdown of record-keeping, and leads to metadata records for the same content being entered repeatedly. As metadata entry is an expensive operation, and consistent record-keeping is one of the core benefits of digital asset management software, duplicate files must be avoided.

Desktop tools for finding duplicate images are usually inappropriate as assets arrive from a dispersed user base, through a process of uploads, approvals and metadata capture.

There are several methods for detecting duplicate images which can be used to improve the ingest of new digital assets into centralised media storage systems, including:

  1. Filename matching
  2. Hash checking
  3. Visual similarity tests
  4. Metadata fingerprinting

Filename Matching

The most basic approach, filename matching, involves looking for two files with same filename and then highlighting them as duplicates.

This method is poor, as many digital files have overlapping filenames (eg. DSC0001.JPG is the first filename produced by a digital camera from the factory, but your staff may have several such cameras). Additionally, many kinds of duplication arise when several versions of the same file are saved with different filenames (for example, DSC0001.JPG might be saved as Example.jpg later).

Hash checking

If two identical files are saved with different filenames, they can still be tested as duplicates by looking at the distinctive file signature, or hash. This is obtained easily using MD5 (a checksumming tool) or other equivalent tools like SHA1 or CRC.

A hash is a “message digest” of a large file, resulting in a very short string which is easy to compare against the hash of another file. A hash looks like this:

4b62753267da6995182dec1b7ff523a0

Although the hash is much shorter and smaller than the whole file, it is highly unlikely that two non-identical files will have the same hash, and so it provides a very easy way to test for data duplication. If two files have the same hash, they are almost certainly duplicates.

If you are using an Apple Mac, you can try this from the Terminal (open Applications > Tools > Terminal). In this test, there are two files on the Desktop with the filenames DSC0001.JPG and Example.JPG.

macpro:Desktop user$ md5 DSC0001.JPG
MD5 (DSC0001.JPG) = e8735480f03aeecfa21aec49e8f95d0a
macpro:Desktop user$ md5 Example.JPG
MD5 (Example.JPG) = e8735480f03aeecfa21aec49e8f95d0a

As these two files result in the same hash, they are identical – even though the filenames are different.

Tip: using a hash check is better than using a file size check, as two files can have the same file size in many different circumstances, but not be duplicates.

Unfortunately, there are some situations where using a hash is still insufficient. For example, each of the following changes would result in a change which would not be detected by a hash check:

  • Adding metadata to the file
  • Changing the size of the image (even if only very slightly)
  • Saving the file in a different format (eg. TIFF to JPEG)
  • Cropping or scaling the image
  • Making adjustments in PhotoShop, however slight
  • Images saved repeatedly with JPEG (resulting in compound compression losses)

Because of these concerns, hash checking has limited practical applications.

Visual Similarity Tests

Basic duplicate detection systems fail to catch obvious duplicates, because they only look at filenames or simple checksums. Another approach is to look for patterns in the images, to try and infer when the images are either identical or similar. This approach is more sophisticated and requires more image processing.

Visual duplicate detection is able to match many kinds of duplication, including the most difficult cases such as:

  • When the filenames or file sizes are different
  • When the images are the same but the metadata is different
  • When the images are different sizes
  • When the images are stored in different formats (for example, RAW, PDF, TIFF…)
  • When the images are similar, but not identical.

Visual duplicate detection is one of the most powerful and useful features in Third Light’s Intelligent Media Server product (a digital asset management system), which has a visual similarity duplicate image finder that automates scanning to find duplicate images.

Third Light IMS performs a rapid scan to check for similarities when new files are uploaded, and provides an interface to help you quickly review duplicates and recover wasted space by removing unnecessary copies.

The duplicate detection system in Third Light IMS is also able to detect images that are not precisely identical. For example, different cropping, aspect ratios or file sizes, as well as other subtle differences caused by compression or differing metadata tags are all taken into account.

How to find duplicate images

Finding and eliminating duplicates using Third Light IMS v6.0 - using graphical indicators to identify the best files to retain

Once duplicate files have been discovered they are summarised with visual priority filtering to help identify the best images. Those with the largest file sizes or most recent changes made, for example, are ranked highest. This makes it easier to prune selectively and quickly.

Metadata Fingerprinting

When a file is downloaded from a Third Light digital asset management system, the download is logged in the DAM server as part of an audit trail. The file that is downloaded also carries a copy of this audit log. If the file is uploaded back into the DAM server later, there will be a tell-tale metadata fingerprint that connects the downloaded asset back to its origins in the DAM server.

This method of using metadata (normally Adobe XMP) to add custom XML records to downloads is useful once an established DAM solution is in place, and most media originates in the DAM server. Third Light IMS leaves digital fingerprints in the custom thirdlight‘ XMP packet for these purposes.

Duplicate file detection loop  - the role of central digital asset management in deduplicating media storage

Users may source files from other storage solutions by accident or because of legacy systems. Duplicate file detection closes this loop. Central digital asset management servers provide de-duplication at the point of upload.

Summary

Each duplicate file is a challenge to eliminate as there may be no obvious way to connect one image to a clone, or similar copy of the same image. We have seen that looking solely at filenames, file sizes or hashes is inadequate.

The best solution is to remove duplicates as soon as they arrive, using a fast visual similarity test against existing assets. This test must remain agnostic about the file formats and tolerate minor differences in the images and their metadata. Using metadata fingerprints gives the DAM software a second source of data in cases where the asset has already been in the media server once, and is being uploaded again.

Build a stylesheet from a logo image using colour recognition

How to build a stylesheet from a logo image

Third Light Intelligent Media Server v6.0 supports themes and templates for every page. The concept of this is to allow the deployment of our digital asset management software under your own brand: the colours, layout and wording are completely flexible.

We’ve also added a neat innovation: in IMS v6.0 you can create a colour scheme for your theme just by uploading your logo. IMS looks at the logo’s primary palette colours and works out a colour scheme for you.

Colour scheme derived from logo - example 1 Colour scheme derived from logo - example 2 Colour scheme derived from logo - example 3
Click to enlarge the screenshots

If you have a library of assets for several different brands, you can create a theme with each brand’s logo and assign them to the users (or collections of media) necessary. You can create as many themes as you want.

If you already have a Third Light IMS site, you can read more about these features in our online help wiki. If you are not yet using Third Light IMS, you can take it for a free 30-day trial without obligation.

Communicate Magazine: Digital Asset Management feature

Amid the deafening noise of digital and social media, how can corporate communicators hope to achieve cut through and maintain consistency in their comms output?

 “The issue with digital assets is their uncanny ability to become fragmented, lost or hoarded. It’s not just a problem of inefficiency but some genuine cost and risk when digital content starts to sprawl. Keeping track of how the content is used and being absolutely certain about the status of files used  is part of every day interaction with the press.”

“As custodians of an ever-growing array of brand assets, communicators are turning to digital asset management providers to bring efficiency and consistency to the way assets are deployed.”

Read more in the full briefing on Digital Asset Management:

Communicate Magazine feature on Digital Asset Management

Communicate Magazine feature on Digital Asset Management (PDF)

Article content is © Copyright Communicate Magazine, Cravenhill Publishing

Third Light IMS v6.0 is released

A new standard for user-friendly digital asset management software

Third Light IMS v6.0 is our answer to DAM users who expect outstanding, uncompromising usability. IMS v6.0 features a highly-evolved user interface, faster and more natural workflow tools and new assistants and wizards. We’re proud to be obsessed with great usability and making the benefits of DAM software a reality!

What’s new in IMS v6.0?

Screenshots of V6

Try it for 30 days

Read more in our August 2011 newsletter