Duplicate Finder

License: PRO+ (Professional Edition or higher)
Plugin Type: Post-Index Plugin
Author: Diskover Data, Inc.

Overview

The Duplicate Finder plugin identifies files with identical content across your storage systems. By analyzing file contents using cryptographic hash values, it pinpoints exact duplicates—even when filenames differ—enabling you to reclaim wasted storage space and streamline data management.

Why Use Duplicate Finder?

Storage environments inevitably accumulate duplicate files over time. Users save multiple copies of documents, backup processes create redundant data, and file migrations leave behind forgotten duplicates. The Duplicate Finder plugin helps you:

Quantify duplicate data with precise metrics on how much storage is consumed by redundant files
Identify cleanup candidates by marking which copies can be safely removed while preserving originals
Generate actionable reports in CSV format for review, approval workflows, or integration with cleanup tools

Key Capabilities

Multi-stage detection pipeline: Optimizes performance by filtering files through size matching, first-chunk hashing, and full content hashing—only processing what's necessary
Multiple hash algorithms: Choose from xxhash (fastest), MD5, SHA1, or SHA256 based on your speed vs. security requirements
Fast hash mode: Quick analysis using filename + size matching when you need rapid results
Cross-index support: Find duplicates spanning multiple storage systems or indices in a single operation
Persistent caching: SQLite-based cache avoids re-hashing unchanged files on subsequent runs
Copy number tracking: Assigns copy numbers (1, 2, 3...) to duplicates for systematic cleanup decisions

Use Cases

Storage Optimization

Identify and quantify duplicate files consuming unnecessary storage space across your file shares and storage systems.

Workflow:

Run Duplicate Finder against your index to identify all duplicates
Review the analysis report showing potential space savings
Use Diskover search to locate duplicates by size, location, or file type
Export results to CSV for cleanup planning and approval workflows

Cross-Storage Deduplication

Find duplicate files that exist across multiple storage systems, NAS devices, or cloud storage locations. This is particularly valuable when consolidating storage or planning migrations.

Workflow:

Ensure both storage locations have been indexed by Diskover
Run Duplicate Finder against both indices simultaneously
Export results to CSV showing which files exist in both locations
Use the report to identify files that can be removed from one location

Backup Validation

Verify that backup copies match production files by comparing hash values. This helps confirm backup integrity and identify any discrepancies between primary and backup storage.

Workflow:

Index both your production storage and backup storage
Run Duplicate Finder with a cryptographic hash algorithm (SHA256 recommended for verification)
Files appearing as duplicates confirm successful backup
Files without matches may indicate backup gaps requiring investigation

Requirements

System Requirements

Component	Requirement
Python	3.9 or higher
Diskover	Core installation with Elasticsearch
Elasticsearch	7.x or 8.x (as supported by Diskover)
Memory	Minimum 2GB RAM (4GB+ recommended for large datasets)
Storage	Sufficient space for SQLite cache database
Access	Read access to indexed storage locations

Python Dependencies

The xxhash algorithm (recommended for best performance) requires an additional Python package:

Linux:

python3 -m pip install xxhash

Windows:

python -m pip install xxhash

Verification

Verify the installation is complete:

Linux:

# Verify xxhash is installed
python3 -c "import xxhash; print(xxhash.VERSION)"

# Verify Elasticsearch connectivity
python3 -c "from diskover_elasticsearch import es_connection_cached; print(es_connection_cached().info())"

Windows:

# Verify xxhash is installed
python -c "import xxhash; print(xxhash.VERSION)"

# Verify Elasticsearch connectivity
python -c "from diskover_elasticsearch import es_connection_cached; print(es_connection_cached().info())"

Installation

The Duplicate Finder plugin is included with Diskover Professional Edition and higher. The plugin files are located in the post-index plugins directory:

Linux: /opt/diskover/plugins_postindex/diskover_dupesfinder/

Windows: C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\

No additional installation steps are required beyond ensuring the xxhash Python dependency is installed.

Configuration

Configuration is managed through the Diskover Admin Panel. Navigate to Plugins → Post Index → Dupes Finder to access the settings.

Sample Configuraiton in Diskover Admin:

Here is the beginning of our sample configuration There are many other configuraitons for the Dupes Finder plugin - covered in detail below!

Configuration Parameters

Parameter	Default	Description
`max_threads`	0	Maximum hashing threads. Set to 0 for automatic detection based on CPU cores.
`fast_hash_only`	False	Use fast hash mode only (filename + size), skipping full file content hashing.
`mode`	xxhash	Hash algorithm: `xxhash`, `md5`, `sha1`, or `sha256`.
`blocksize`	65536	Block size in bytes for reading files. Increase for large files or network storage.
`cachedir`	/opt/diskover/plugins_postindex/diskover_hash_cache/	Directory for SQLite cache database.
`cache_expiretime`	0	Cache expiry in seconds. Set to 0 for no expiration.
`minsize`	1024	Minimum file size to process (bytes).
`maxsize`	1073741824	Maximum file size to process (bytes). Default is 1GB.
`extensions`	[]	File extensions to include. Empty means all file types.
`exclude_extensions`	[]	File extensions to exclude from processing.
`exclude_files`	[]	Specific filenames to exclude.
`exclude_dirs`	[]	Directory paths to exclude. Use `*` at the end for recursive exclusion.
`hardlinks`	False	Include hardlinked files (nlink > 1) in processing.
`other_query`	""	Additional Elasticsearch query to filter files.
`restore_times`	False	Restore file access and modification times after hashing.
`replacepaths`	{}	Path replacement mapping for NFS or mounted shares.
`usediskmtime`	False	Use actual disk mtime instead of indexed mtime for cache comparison.
`csvdir`	/tmp	Directory for CSV output files.

Hash Algorithm Selection

Algorithm	Speed	Use Case
xxhash	Fastest	General duplicate detection, day-to-day operations
md5	Fast	Legacy system compatibility
sha1	Medium	Legacy compatibility
sha256	Slowest	Backup validation, compliance requirements, archival verification

For most duplicate detection scenarios, xxhash provides the best balance of speed and reliability. Use SHA256 when you need cryptographic verification for compliance or backup validation purposes.

Example Configuration

A typical configuration for general storage optimization:

Plugins:
  Post Index:
    Dupes Finder:
      Default:
        max_threads: 0
        fast_hash_only: false
        mode: xxhash
        blocksize: 1048576
        cachedir: /opt/diskover/plugins_postindex/__diskover_hash_cache__/
        cache_expiretime: 0
        minsize: 1024
        maxsize: 10737418240
        extensions: []
        exclude_extensions:
          - tmp
          - log
          - bak
        exclude_files:
          - .DS_Store
          - Thumbs.db
        exclude_dirs:
          - /mnt/data/temp/*
          - /mnt/data/cache/*
        hardlinks: false
        other_query: ""
        restore_times: false
        csvdir: /var/log/diskover/reports

Path Replacement

If your indexed paths differ from the paths where Diskover can access files (common with NFS mounts), configure path replacement:

replacepaths:
  from_path: /mnt/nfs/production
  to_path: /data/production

Execution

Manual Execution

Run Duplicate Finder from the command line to process one or more indices.

Basic Syntax

Linux:

python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py [OPTIONS] INDEX [INDEX...]

Windows:

python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" [OPTIONS] INDEX [INDEX...]

Command-Line Options

Option	Description
`-n NAME`	Use a named configuration from Diskover Admin
`-a`	Update all file documents with hash values, not just duplicates
`-c`	Export duplicate files to CSV
`-u`	Enable SQLite hash cache
`-f`	Clear the hash cache before processing
`-U INDEX`	Retrieve existing hashes from another index
`--useindexauto`	Automatically find previous index for hash reuse
`-r`	Remove all hash and duplicate fields from index
`-l TOPPATH`	Auto-find latest index by top path
`-e`	Skip files that already have hash values
`-m MODE`	Override hash algorithm (xxhash/md5/sha1/sha256)
`--fasthash`	Use fast hash mode (filename + size only)
`-v`	Enable verbose logging
`-V`	Enable debug logging
`--version`	Display version information

Common Examples

Basic duplicate detection with caching:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -u -v diskover-myindex

Windows:

python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -u -v diskover-myindex

Export results to CSV:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -u -c -v diskover-myindex

Windows:

python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -u -c -v diskover-myindex

Cross-storage deduplication (multiple indices):

Linux:

python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -u -c diskover-storage1 diskover-storage2

Windows:

python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -u -c diskover-storage1 diskover-storage2

Backup validation with SHA256:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -m sha256 -u -c diskover-production diskover-backup

Windows:

python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -m sha256 -u -c diskover-production diskover-backup

Quick analysis with fast hash mode:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py --fasthash -u -v diskover-myindex

Windows:

python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" --fasthash -u -v diskover-myindex

Process index with existing checksum data (skip already-hashed files):

Linux:

python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -e -u -v diskover-myindex

Windows:

python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -e -u -v diskover-myindex

Sample CLI Execution of Dupes Finder:

python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -c -n "Documentation Example" -l /opt/diskover

INFO - Starting diskover dupesfinder ...
INFO - Using alternate configuration: Documentation Example
INFO - Finding latest index name for /opt/diskover ...
INFO - Found latest index diskover-build-dir-dupes-finder
INFO - Starting diskover dupes finder for indices ['diskover-build-dir-dupes-finder'] ...
INFO - Started 2 file hash threads
INFO - Using hash mode md5
INFO - Updating index mappings for hash, is_dupe, and dupe_count fields in diskover-build-dir-dupes-finder...
INFO - Done.
INFO - Starting dupes finding for index diskover-build-dir-dupes-finder...
INFO - Searching and queuing files in index diskover-build-dir-dupes-finder...
INFO - Starting checksuming first chunks of files...
INFO - Found 579 docs
INFO - Finished hashing first chunks of files. Starting full hash on potential duplicates...
INFO - Done queuing files in index diskover-build-dir-dupes-finder. Waiting for hash threads to finish...
INFO - Done.
INFO - Finding and updating any duplicate files...
INFO - *** Total files: 579 ***
INFO - *** Files hashed: 37 (93.6% reduction of total files) ***
INFO - *** Files with similar sizes: 50 (91.4% reduction of total files) ***
INFO - *** Files with similar first chunk size: 37 (93.6% reduction of total files) ***
INFO - *** Dupes found: 37 (6.4% of total files) ***
INFO -
INFO - ================================================================================
INFO - DUPLICATE FILES ANALYSIS REPORT
INFO - ================================================================================
INFO -
INFO - Total duplicate files found: 37
INFO - Unique duplicate groups: 18
INFO - Files that can be cleaned up: 19
INFO - Potential space savings: 168.02 KB (172,052 bytes)
INFO -
INFO - Cleanup efficiency: 51.4% of duplicate files can be removed
INFO - Space savings percentage: 0.18% of total indexed data
INFO -
INFO - ================================================================================
INFO - Updating 37 ES docs...
INFO - Done.
INFO - Saving results to /tmp/diskover-dupesfinder_diskover-build-dir-dupes-finder_md5_2026_04_09_22_02_23.csv
INFO - Done.
INFO - Saving duplicate analysis report to /tmp/diskover-dupesfinder-report_diskover-build-dir-dupes-finder_md5_2026_04_09_22_02_23.txt
INFO - Report saved successfully.

Automated Execution

Duplicate Finder can be scheduled to run automatically using Diskover's built-in task scheduling.

Post-Crawl Command (Index Task)

Configure Duplicate Finder to run automatically after each index scan completes by adding it as a Post-Crawl Command in your Index Task configuration.

Sample Post-Crawl Command configuraiton for Dupes Finder executing with an Index Task:

In your system ensure to replace the ConfigurationName above with a named configuraiton that you’ve created at Diskover Admin → Plugins → Post-Index → CIFS ACLs – If you are not using a custom configuration and you’re just using Default than the -n flag and the ConfigurationName is not required!

Linux Example:

Field	Value
Post-Crawl Command	`python3`
Post-Crawl Command Args	`/opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -u -c -v {indexname}`

Windows Example:

Field	Value
Post-Crawl Command	`python`
Post-Crawl Command Args	`"C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -u -c -v {indexname}`

Available Index Task Tokens:

{indexname} — The name of the index that was just created

Custom Task

For on-demand or scheduled execution independent of indexing, create a Custom Task in the Diskover Admin Panel.

Sample Custom Task Configuration:

Here we can see the Run Command & args needed for the Custom Task - Note that in this case you cannot use the {indexname} variable as this is not a task that creates an index, so we must use the -l (toppath) CLI option and pass in our top path!

Reviewing the Output

Console Output

During execution, Duplicate Finder displays progress statistics including files processed, hash performance, and memory usage.

Upon completion, a summary report is displayed:

================================================================================
DUPLICATE FILES ANALYSIS REPORT
================================================================================

Total duplicate files found: 15,847
Unique duplicate groups: 4,231
Files that can be cleaned up: 11,616
Potential space savings: 847.3 GB (909,456,789,012 bytes)

Cleanup efficiency: 73.3% of duplicate files can be removed
Space savings percentage: 12.45% of total indexed data

================================================================================

CSV Export

When using the -c option, duplicate files are exported to a CSV file in the configured csvdir directory.

Filename format: diskover-dupesfinder_<indices>_<hashmode>_<timestamp>.csv

CSV columns:

Column	Description
File	Full file path
Fhash(Fast Hash)	MD5 hash of filename + size
Hash	Full content hash (when not using fast hash mode)
Copy Count	Duplicate copy number (1 = original, 2+ = duplicates)
Size(bytes)	File size
Mtime(utc)	Last modified time
Index	Source Elasticsearch index
Docid	Elasticsearch document ID

Analysis Report

A text report is also saved alongside the CSV file:

Filename format: diskover-dupesfinder-report_<indices>_<hashmode>_<timestamp>.txt

Searching in Diskover

After running Duplicate Finder, new fields become available for searching in the Diskover web interface.

New Search Fields

Field	Type	Description
`is_dupe`	Boolean	True if file has duplicates
`dupe_count`	Integer	Copy number (1 = first copy, 2+ = additional copies)
`hash.xxhash`	Keyword	xxhash value (when using xxhash mode)
`hash.md5`	Keyword	MD5 hash value
`hash.sha1`	Keyword	SHA1 hash value
`hash.sha256`	Keyword	SHA256 hash value
`hash.fhash`	Keyword	Fast hash value (filename + size MD5)

Common Search Queries

Find all duplicate files:

is_dupe:true

Find files safe to remove (copies 2 and higher):

dupe_count:[2 TO *]

Find original files to keep (first copies):

dupe_count:1

Find large duplicate files (over 100MB):

is_dupe:true AND size:[104857600 TO *]

Find duplicate files by extension:

is_dupe:true AND extension:pdf

Find duplicates in a specific directory:

is_dupe:true AND parent_path:/mnt/data/projects/*

Find files with a specific hash value:

hash.xxhash:ef46db3751d8e999

Find all files that have been hashed:

hash:*

Combine multiple criteria (large video duplicates that can be removed):

is_dupe:true AND dupe_count:[2 TO *] AND size:[104857600 TO *] AND extension:(mp4 OR mov OR avi)

Sample Diskover query output for duplicate files that exist more than twice:

Working with Existing Checksum Data

If you have previously run the Checksums plugin against an index, Duplicate Finder can leverage that existing hash data rather than re-hashing files.

Skip files that already have hash values:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -e -u diskover-myindex

Windows:

python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -e -u diskover-myindex

This approach is useful when you want to add duplicate detection to indices that already have checksum metadata, without the overhead of re-hashing all files.

Troubleshooting

No Duplicates Found

Symptom: Plugin completes but reports no duplicates despite expecting some.

Possible causes:

File size filters (minsize/maxsize) may be excluding your target files
Extension filters may be too restrictive
Files may have unique sizes (the first-stage filter eliminates them early)

Resolution:

Run with -v or -V to see the Elasticsearch query being used
Check that your filter settings include the files you expect
Verify files exist in the index with the expected attributes

xxhash Module Not Found

Symptom: Error message "Missing xxhash Python module"

Resolution:

Linux:

python3 -m pip install xxhash

Windows:

python -m pip install xxhash

Permission Denied Errors

Symptom: Warnings about unable to open or read files.

Resolution:

Ensure the Diskover service account has read access to all indexed file paths
For NFS mounts, verify export options include appropriate read permissions
Check that replacepaths configuration correctly maps indexed paths to accessible paths

Slow Performance

Symptom: Hashing takes longer than expected.

Resolution:

Increase blocksize for large files or network storage (try 1MB: 1048576)
Use xxhash mode instead of SHA256 for better performance
Enable caching (-u) to avoid re-hashing unchanged files on subsequent runs
Use --fasthash for quick initial analysis when full content verification isn't required

Cache Not Working

Symptom: Files are being re-hashed on every run despite using -u.

Resolution:

Verify the cache directory exists and is writable
Check if file modification times are changing between runs
Try flushing and rebuilding the cache with -f -u

Removing Duplicate Data from Index

To completely remove all hash and duplicate fields from an index and start fresh:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -r diskover-myindex

Windows:

python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -r diskover-myindex

Support

Last Updated: April 2026

Duplicate Finder

Overview

Why Use Duplicate Finder?

Key Capabilities

Use Cases

Storage Optimization

Cross-Storage Deduplication

Backup Validation

Requirements

System Requirements

Python Dependencies

Verification

Installation

Configuration

Configuration Parameters

Hash Algorithm Selection

Example Configuration

Path Replacement

Execution

Manual Execution

Basic Syntax

Command-Line Options

Common Examples

Automated Execution

Post-Crawl Command (Index Task)

Custom Task

Reviewing the Output

Console Output

CSV Export

Analysis Report

Searching in Diskover

New Search Fields

Common Search Queries

Working with Existing Checksum Data

Troubleshooting

No Duplicates Found

xxhash Module Not Found

Permission Denied Errors

Slow Performance

Cache Not Working

Removing Duplicate Data from Index

Support

Related articles