Duplicate Finder
License: PRO+ (Professional Edition or higher)
Plugin Type: Post-Index Plugin
Author: Diskover Data, Inc.
Overview
The Duplicate Finder plugin identifies files with identical content across your storage systems. By analyzing file contents using cryptographic hash values, it pinpoints exact duplicates—even when filenames differ—enabling you to reclaim wasted storage space and streamline data management.
Why Use Duplicate Finder?
Storage environments inevitably accumulate duplicate files over time. Users save multiple copies of documents, backup processes create redundant data, and file migrations leave behind forgotten duplicates. The Duplicate Finder plugin helps you:
Quantify duplicate data with precise metrics on how much storage is consumed by redundant files
Identify cleanup candidates by marking which copies can be safely removed while preserving originals
Generate actionable reports in CSV format for review, approval workflows, or integration with cleanup tools
Key Capabilities
Multi-stage detection pipeline: Optimizes performance by filtering files through size matching, first-chunk hashing, and full content hashing—only processing what's necessary
Multiple hash algorithms: Choose from xxhash (fastest), MD5, SHA1, or SHA256 based on your speed vs. security requirements
Fast hash mode: Quick analysis using filename + size matching when you need rapid results
Cross-index support: Find duplicates spanning multiple storage systems or indices in a single operation
Persistent caching: SQLite-based cache avoids re-hashing unchanged files on subsequent runs
Copy number tracking: Assigns copy numbers (1, 2, 3...) to duplicates for systematic cleanup decisions
Use Cases
Storage Optimization
Identify and quantify duplicate files consuming unnecessary storage space across your file shares and storage systems.
Workflow:
Run Duplicate Finder against your index to identify all duplicates
Review the analysis report showing potential space savings
Use Diskover search to locate duplicates by size, location, or file type
Export results to CSV for cleanup planning and approval workflows
Cross-Storage Deduplication
Find duplicate files that exist across multiple storage systems, NAS devices, or cloud storage locations. This is particularly valuable when consolidating storage or planning migrations.
Workflow:
Ensure both storage locations have been indexed by Diskover
Run Duplicate Finder against both indices simultaneously
Export results to CSV showing which files exist in both locations
Use the report to identify files that can be removed from one location
Backup Validation
Verify that backup copies match production files by comparing hash values. This helps confirm backup integrity and identify any discrepancies between primary and backup storage.
Workflow:
Index both your production storage and backup storage
Run Duplicate Finder with a cryptographic hash algorithm (SHA256 recommended for verification)
Files appearing as duplicates confirm successful backup
Files without matches may indicate backup gaps requiring investigation
Requirements
System Requirements
Component | Requirement |
|---|---|
Python | 3.9 or higher |
Diskover | Core installation with Elasticsearch |
Elasticsearch | 7.x or 8.x (as supported by Diskover) |
Memory | Minimum 2GB RAM (4GB+ recommended for large datasets) |
Storage | Sufficient space for SQLite cache database |
Access | Read access to indexed storage locations |
Python Dependencies
The xxhash algorithm (recommended for best performance) requires an additional Python package:
Linux:
python3 -m pip install xxhash
Windows:
python -m pip install xxhash
Verification
Verify the installation is complete:
Linux:
# Verify xxhash is installed python3 -c "import xxhash; print(xxhash.VERSION)" # Verify Elasticsearch connectivity python3 -c "from diskover_elasticsearch import es_connection_cached; print(es_connection_cached().info())"
Windows:
# Verify xxhash is installed python -c "import xxhash; print(xxhash.VERSION)" # Verify Elasticsearch connectivity python -c "from diskover_elasticsearch import es_connection_cached; print(es_connection_cached().info())"
Installation
The Duplicate Finder plugin is included with Diskover Professional Edition and higher. The plugin files are located in the post-index plugins directory:
Linux: /opt/diskover/plugins_postindex/diskover_dupesfinder/
Windows: C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\
No additional installation steps are required beyond ensuring the xxhash Python dependency is installed.
Configuration
Configuration is managed through the Diskover Admin Panel. Navigate to Plugins → Post Index → Dupes Finder to access the settings.
Sample Configuraiton in Diskover Admin:
Here is the beginning of our sample configuration There are many other configuraitons for the Dupes Finder plugin - covered in detail below!
Configuration Parameters
Parameter | Default | Description |
|---|---|---|
| 0 | Maximum hashing threads. Set to 0 for automatic detection based on CPU cores. |
| False | Use fast hash mode only (filename + size), skipping full file content hashing. |
| xxhash | Hash algorithm: |
| 65536 | Block size in bytes for reading files. Increase for large files or network storage. |
| /opt/diskover/plugins_postindex/diskover_hash_cache/ | Directory for SQLite cache database. |
| 0 | Cache expiry in seconds. Set to 0 for no expiration. |
| 1024 | Minimum file size to process (bytes). |
| 1073741824 | Maximum file size to process (bytes). Default is 1GB. |
| [] | File extensions to include. Empty means all file types. |
| [] | File extensions to exclude from processing. |
| [] | Specific filenames to exclude. |
| [] | Directory paths to exclude. Use |
| False | Include hardlinked files (nlink > 1) in processing. |
| "" | Additional Elasticsearch query to filter files. |
| False | Restore file access and modification times after hashing. |
| {} | Path replacement mapping for NFS or mounted shares. |
| False | Use actual disk mtime instead of indexed mtime for cache comparison. |
| /tmp | Directory for CSV output files. |
Hash Algorithm Selection
Algorithm | Speed | Use Case |
|---|---|---|
xxhash | Fastest | General duplicate detection, day-to-day operations |
md5 | Fast | Legacy system compatibility |
sha1 | Medium | Legacy compatibility |
sha256 | Slowest | Backup validation, compliance requirements, archival verification |
For most duplicate detection scenarios, xxhash provides the best balance of speed and reliability. Use SHA256 when you need cryptographic verification for compliance or backup validation purposes.
Example Configuration
A typical configuration for general storage optimization:
Plugins:
Post Index:
Dupes Finder:
Default:
max_threads: 0
fast_hash_only: false
mode: xxhash
blocksize: 1048576
cachedir: /opt/diskover/plugins_postindex/__diskover_hash_cache__/
cache_expiretime: 0
minsize: 1024
maxsize: 10737418240
extensions: []
exclude_extensions:
- tmp
- log
- bak
exclude_files:
- .DS_Store
- Thumbs.db
exclude_dirs:
- /mnt/data/temp/*
- /mnt/data/cache/*
hardlinks: false
other_query: ""
restore_times: false
csvdir: /var/log/diskover/reports
Path Replacement
If your indexed paths differ from the paths where Diskover can access files (common with NFS mounts), configure path replacement:
replacepaths: from_path: /mnt/nfs/production to_path: /data/production
Execution
Manual Execution
Run Duplicate Finder from the command line to process one or more indices.
Basic Syntax
Linux:
python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py [OPTIONS] INDEX [INDEX...]
Windows:
python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" [OPTIONS] INDEX [INDEX...]
Command-Line Options
Option | Description |
|---|---|
| Use a named configuration from Diskover Admin |
| Update all file documents with hash values, not just duplicates |
| Export duplicate files to CSV |
| Enable SQLite hash cache |
| Clear the hash cache before processing |
| Retrieve existing hashes from another index |
| Automatically find previous index for hash reuse |
| Remove all hash and duplicate fields from index |
| Auto-find latest index by top path |
| Skip files that already have hash values |
| Override hash algorithm (xxhash/md5/sha1/sha256) |
| Use fast hash mode (filename + size only) |
| Enable verbose logging |
| Enable debug logging |
| Display version information |
Common Examples
Basic duplicate detection with caching:
Linux:
python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -u -v diskover-myindex
Windows:
python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -u -v diskover-myindex
Export results to CSV:
Linux:
python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -u -c -v diskover-myindex
Windows:
python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -u -c -v diskover-myindex
Cross-storage deduplication (multiple indices):
Linux:
python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -u -c diskover-storage1 diskover-storage2
Windows:
python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -u -c diskover-storage1 diskover-storage2
Backup validation with SHA256:
Linux:
python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -m sha256 -u -c diskover-production diskover-backup
Windows:
python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -m sha256 -u -c diskover-production diskover-backup
Quick analysis with fast hash mode:
Linux:
python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py --fasthash -u -v diskover-myindex
Windows:
python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" --fasthash -u -v diskover-myindex
Process index with existing checksum data (skip already-hashed files):
Linux:
python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -e -u -v diskover-myindex
Windows:
python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -e -u -v diskover-myindex
Sample CLI Execution of Dupes Finder:
python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -c -n "Documentation Example" -l /opt/diskover INFO - Starting diskover dupesfinder ... INFO - Using alternate configuration: Documentation Example INFO - Finding latest index name for /opt/diskover ... INFO - Found latest index diskover-build-dir-dupes-finder INFO - Starting diskover dupes finder for indices ['diskover-build-dir-dupes-finder'] ... INFO - Started 2 file hash threads INFO - Using hash mode md5 INFO - Updating index mappings for hash, is_dupe, and dupe_count fields in diskover-build-dir-dupes-finder... INFO - Done. INFO - Starting dupes finding for index diskover-build-dir-dupes-finder... INFO - Searching and queuing files in index diskover-build-dir-dupes-finder... INFO - Starting checksuming first chunks of files... INFO - Found 579 docs INFO - Finished hashing first chunks of files. Starting full hash on potential duplicates... INFO - Done queuing files in index diskover-build-dir-dupes-finder. Waiting for hash threads to finish... INFO - Done. INFO - Finding and updating any duplicate files... INFO - *** Total files: 579 *** INFO - *** Files hashed: 37 (93.6% reduction of total files) *** INFO - *** Files with similar sizes: 50 (91.4% reduction of total files) *** INFO - *** Files with similar first chunk size: 37 (93.6% reduction of total files) *** INFO - *** Dupes found: 37 (6.4% of total files) *** INFO - INFO - ================================================================================ INFO - DUPLICATE FILES ANALYSIS REPORT INFO - ================================================================================ INFO - INFO - Total duplicate files found: 37 INFO - Unique duplicate groups: 18 INFO - Files that can be cleaned up: 19 INFO - Potential space savings: 168.02 KB (172,052 bytes) INFO - INFO - Cleanup efficiency: 51.4% of duplicate files can be removed INFO - Space savings percentage: 0.18% of total indexed data INFO - INFO - ================================================================================ INFO - Updating 37 ES docs... INFO - Done. INFO - Saving results to /tmp/diskover-dupesfinder_diskover-build-dir-dupes-finder_md5_2026_04_09_22_02_23.csv INFO - Done. INFO - Saving duplicate analysis report to /tmp/diskover-dupesfinder-report_diskover-build-dir-dupes-finder_md5_2026_04_09_22_02_23.txt INFO - Report saved successfully.
Automated Execution
Duplicate Finder can be scheduled to run automatically using Diskover's built-in task scheduling.
Post-Crawl Command (Index Task)
Configure Duplicate Finder to run automatically after each index scan completes by adding it as a Post-Crawl Command in your Index Task configuration.
Sample Post-Crawl Command configuraiton for Dupes Finder executing with an Index Task:
In your system ensure to replace the ConfigurationName above with a named configuraiton that you’ve created at Diskover Admin → Plugins → Post-Index → CIFS ACLs – If you are not using a custom configuration and you’re just using Default than the -n flag and the ConfigurationName is not required!
Linux Example:
Field | Value |
|---|---|
Post-Crawl Command |
|
Post-Crawl Command Args |
|
Windows Example:
Field | Value |
|---|---|
Post-Crawl Command |
|
Post-Crawl Command Args |
|
Available Index Task Tokens:
{indexname}— The name of the index that was just created
Custom Task
For on-demand or scheduled execution independent of indexing, create a Custom Task in the Diskover Admin Panel.
Sample Custom Task Configuration:
Here we can see the Run Command & args needed for the Custom Task - Note that in this case you cannot use the {indexname} variable as this is not a task that creates an index, so we must use the -l (toppath) CLI option and pass in our top path!
Reviewing the Output
Console Output
During execution, Duplicate Finder displays progress statistics including files processed, hash performance, and memory usage.
Upon completion, a summary report is displayed:
================================================================================ DUPLICATE FILES ANALYSIS REPORT ================================================================================ Total duplicate files found: 15,847 Unique duplicate groups: 4,231 Files that can be cleaned up: 11,616 Potential space savings: 847.3 GB (909,456,789,012 bytes) Cleanup efficiency: 73.3% of duplicate files can be removed Space savings percentage: 12.45% of total indexed data ================================================================================
CSV Export
When using the -c option, duplicate files are exported to a CSV file in the configured csvdir directory.
Filename format: diskover-dupesfinder_<indices>_<hashmode>_<timestamp>.csv
CSV columns:
Column | Description |
|---|---|
File | Full file path |
Fhash(Fast Hash) | MD5 hash of filename + size |
Hash | Full content hash (when not using fast hash mode) |
Copy Count | Duplicate copy number (1 = original, 2+ = duplicates) |
Size(bytes) | File size |
Mtime(utc) | Last modified time |
Index | Source Elasticsearch index |
Docid | Elasticsearch document ID |
Analysis Report
A text report is also saved alongside the CSV file:
Filename format: diskover-dupesfinder-report_<indices>_<hashmode>_<timestamp>.txt
Searching in Diskover
After running Duplicate Finder, new fields become available for searching in the Diskover web interface.
New Search Fields
Field | Type | Description |
|---|---|---|
| Boolean | True if file has duplicates |
| Integer | Copy number (1 = first copy, 2+ = additional copies) |
| Keyword | xxhash value (when using xxhash mode) |
| Keyword | MD5 hash value |
| Keyword | SHA1 hash value |
| Keyword | SHA256 hash value |
| Keyword | Fast hash value (filename + size MD5) |
Common Search Queries
Find all duplicate files:
is_dupe:true
Find files safe to remove (copies 2 and higher):
dupe_count:[2 TO *]
Find original files to keep (first copies):
dupe_count:1
Find large duplicate files (over 100MB):
is_dupe:true AND size:[104857600 TO *]
Find duplicate files by extension:
is_dupe:true AND extension:pdf
Find duplicates in a specific directory:
is_dupe:true AND parent_path:/mnt/data/projects/*
Find files with a specific hash value:
hash.xxhash:ef46db3751d8e999
Find all files that have been hashed:
hash:*
Combine multiple criteria (large video duplicates that can be removed):
is_dupe:true AND dupe_count:[2 TO *] AND size:[104857600 TO *] AND extension:(mp4 OR mov OR avi)
Sample Diskover query output for duplicate files that exist more than twice:
Working with Existing Checksum Data
If you have previously run the Checksums plugin against an index, Duplicate Finder can leverage that existing hash data rather than re-hashing files.
Skip files that already have hash values:
Linux:
python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -e -u diskover-myindex
Windows:
python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -e -u diskover-myindex
This approach is useful when you want to add duplicate detection to indices that already have checksum metadata, without the overhead of re-hashing all files.
Troubleshooting
No Duplicates Found
Symptom: Plugin completes but reports no duplicates despite expecting some.
Possible causes:
File size filters (
minsize/maxsize) may be excluding your target filesExtension filters may be too restrictive
Files may have unique sizes (the first-stage filter eliminates them early)
Resolution:
Run with
-vor-Vto see the Elasticsearch query being usedCheck that your filter settings include the files you expect
Verify files exist in the index with the expected attributes
xxhash Module Not Found
Symptom: Error message "Missing xxhash Python module"
Resolution:
Linux:
python3 -m pip install xxhash
Windows:
python -m pip install xxhash
Permission Denied Errors
Symptom: Warnings about unable to open or read files.
Resolution:
Ensure the Diskover service account has read access to all indexed file paths
For NFS mounts, verify export options include appropriate read permissions
Check that
replacepathsconfiguration correctly maps indexed paths to accessible paths
Slow Performance
Symptom: Hashing takes longer than expected.
Resolution:
Increase
blocksizefor large files or network storage (try 1MB:1048576)Use xxhash mode instead of SHA256 for better performance
Enable caching (
-u) to avoid re-hashing unchanged files on subsequent runsUse
--fasthashfor quick initial analysis when full content verification isn't required
Cache Not Working
Symptom: Files are being re-hashed on every run despite using -u.
Resolution:
Verify the cache directory exists and is writable
Check if file modification times are changing between runs
Try flushing and rebuilding the cache with
-f -u
Removing Duplicate Data from Index
To completely remove all hash and duplicate fields from an index and start fresh:
Linux:
python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -r diskover-myindex
Windows:
python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -r diskover-myindex
Support
Last Updated: April 2026
Comments
0 comments
Please sign in to leave a comment.