Checksums Post-Index Plugin

License: PRO+ (Professional Edition or higher)
Plugin Type: Post-Index Plugin
Author: Diskover Data, Inc.

Overview

The Checksums plugin generates and stores cryptographic hash values for files in your Diskover indices after a scan completes. These hashes serve as digital fingerprints that enable data integrity verification, migration validation, and duplicate detection.

Think of checksums as unique identifiers for your file contents. Two files with identical contents will always produce the same hash, while even a single byte change results in a completely different value. This makes checksums invaluable for tracking changes, validating data transfers, finding duplicates, and proving file authenticity.

Why Use This Plugin?

Verify data integrity — Confirm files haven't been corrupted or modified by comparing hashes over time
Validate migrations — Compare checksums before and after data moves to ensure nothing was lost or corrupted
Enable duplicate detection — Generate the hash values that the Dupesfinder plugin uses to identify duplicate files across your storage

Sample data from a Checksum execution:

‌

Here we can see the entire Checksums array containing the FHash, MD5 and SHA256 - additionally, we can see that the MD5 and SHA256 checksums have been placed in their own columns as well!

Use Cases

Data Integrity Verification

Ensuring your files remain unchanged over time is critical for long-term storage, archival systems, and compliance requirements. The Checksums plugin lets you establish a baseline hash for each file, then verify those hashes later to confirm nothing has been corrupted or tampered with.

Typical workflow:

Run an initial Diskover scan of your storage
Execute the Checksums plugin to generate hash values for all files
Periodically re-scan and re-run checksums
Compare hash values to identify any files that have changed unexpectedly

This approach is particularly valuable for cold storage and archive systems where files should never change, or for compliance scenarios where you need to demonstrate file integrity over time.

Migration Validation

When moving data between storage systems, network locations, or cloud platforms, checksums provide definitive proof that every file arrived intact. Rather than simply comparing file counts and sizes, you can verify the actual content of each file matches the original.

Typical workflow:

Index your source storage location with Diskover

Run the Checksums plugin with CSV export enabled to create a hash manifest:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_checksums/diskover_checksums.py -c -u diskover-source-index

Windows (PowerShell):

python "C:\Program Files\Diskover\plugins_postindex\diskover_checksums\diskover_checksums.py" -c -u diskover-source-index

Perform your data migration
Index the destination storage location with Diskover

Run the Checksums plugin on the destination index:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_checksums/diskover_checksums.py -c -u diskover-destination-index

Windows (PowerShell):

python "C:\Program Files\Diskover\plugins_postindex\diskover_checksums\diskover_checksums.py" -c -u diskover-destination-index

Compare the CSV exports to verify all files transferred correctly

Any mismatched hashes indicate files that were corrupted during transfer and need to be re-copied.

Duplicate Detection Preparation

The Checksums plugin generates the hash values that the Dupesfinder plugin uses to identify duplicate files across your storage. Running checksums first enriches your index with the metadata needed for accurate duplicate detection later.

Why this matters: Without hash values, duplicate detection would rely solely on filename and size matching, which can produce false positives (different files with the same name and size) and miss true duplicates (identical files with different names). Hash-based detection is definitive—if two files have the same hash, they have identical content.

Typical workflow:

Run the Checksums plugin to generate hash values:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_checksums/diskover_checksums.py -u diskover-myindex

Windows (PowerShell):

python "C:\Program Files\Diskover\plugins_postindex\diskover_checksums\diskover_checksums.py" -u diskover-myindex

Run the Dupesfinder plugin to identify files with matching hashes:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -U diskover-myindex

Windows (PowerShell):

python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -U diskover-myindex

This two-step approach gives you complete control over when hashing occurs and which files to include, while keeping the duplicate detection logic separate and flexible.

Installation

DNF Installation (Linux RPM)

On Linux systems using DNF package management, this plugin can be installed via RPM:

sudo dnf install diskover-plugin-postindex-checksums

Note: Ensure your system is configured with the Diskover RPM repository before running the install command.

Prerequisites

Component	Requirement
Python	3.9 or higher
Diskover	Core installation with Elasticsearch
Elasticsearch	7.x or 8.x (as supported by Diskover)
Storage Access	Read access to the indexed storage paths
Memory	Minimum 2GB RAM (4GB+ recommended for large datasets)

Python Dependencies

Package	Purpose
xxhash	Fast non-cryptographic hashing (required for xxhash mode)

Installation Steps

Ensure the plugin file is in your Diskover plugins directory:

Linux:

/opt/diskover/plugins_postindex/diskover_checksums/diskover_checksums.py

Windows:

C:\Program Files\Diskover\plugins_postindex\diskover_checksums\diskover_checksums.py

Install the required Python dependency:

Linux:
```
python3 -m pip install xxhash
```
Windows (PowerShell):
```
python -m pip install xxhash
```

Verify the installation:

Linux:

python3 -c "import xxhash; print('xxhash version:', xxhash.VERSION)"

Windows (PowerShell):

python -c "import xxhash; print('xxhash version:', xxhash.VERSION)"

Verify Elasticsearch connectivity:

Linux:

python3 -c "from diskover_elasticsearch import es_connection_cached; print(es_connection_cached().info())"

Windows (PowerShell):

python -c "from diskover_elasticsearch import es_connection_cached; print(es_connection_cached().info())"

Configuration

Configuration is managed through the Diskover Admin Panel. Navigate to Plugins → Post Index → Checksums to access the settings.

Here is the beginning of our sample configuration, which you can see we're going to be doing a SHA256 hash on the files. There are many other configuraitons for the Checksums plugin - covered in detail below!

Configuration Parameters

Option	Type	Default	Description
`maxthreads`	int	0	Number of hashing threads. 0 = auto-detect based on CPU cores
`fast_hash_only`	boolean	false	Use fast hash only (filename+size MD5), skip reading file contents
`hash_mode`	string	xxhash	Algorithm: xxhash, md5, sha1, or sha256
`blocksize`	int	65536	Block size in bytes for reading files
`cache_dir`	string	(see below)	Directory for SQLite cache database
`cache_expire_time`	int	0	Cache expiry in seconds (0 = never expire)
`min_size`	int	1024	Minimum file size to hash (bytes)
`max_size`	int	1073741824	Maximum file size to hash (bytes, default 1GB)
`extensions`	list	[]	File extensions to include (empty = all)
`exclude_extensions`	list	[]	File extensions to exclude
`exclude_files`	list	[]	Filenames to exclude
`exclude_dirs`	list	[]	Directory paths to exclude (use `*` for recursive)
`hardlinks`	boolean	false	Include hardlinks (nlink > 1)
`other_query`	string	""	Additional Elasticsearch query to filter files
`restore_times`	boolean	false	Restore atime/mtime after hashing
`replace_paths`	object	{}	Path replacement for NFS/mounted shares
`use_disk_mtime`	boolean	false	Use disk mtime instead of index mtime for cache comparison
`csvdir`	string	/tmp	Directory for CSV output files

Default cache directory:

Linux: /opt/diskover/plugins_postindex/__diskover_hash_cache__/
Windows: C:\Program Files\Diskover\plugins_postindex\__diskover_hash_cache__\

Hash Algorithm Selection

Choosing the right hash algorithm depends on your use case. Here's how they compare:

Algorithm	Speed	Security Level	Best For	Hash Length
xxhash	Fastest	Non-cryptographic	Duplicate detection, general integrity checks	16 chars
md5	Fast	Weak (collisions possible)	Legacy system compatibility	32 chars
sha1	Medium	Deprecated for security	Legacy compatibility	40 chars
sha256	Slowest	Strong (cryptographic)	Compliance, archival, security requirements	64 chars

Recommendations:

For duplicate detection: Use xxhash for maximum performance. Since you're comparing files within your own storage, cryptographic security isn't necessary.
For data integrity verification: Use xxhash for routine checks, or sha256 if you need cryptographic assurance.
For compliance requirements (SOX, HIPAA, GDPR): Use sha256 to meet regulatory standards for file integrity monitoring.
For migration validation: Use sha256 when you need definitive proof of data integrity, or xxhash for faster validation of large datasets.

Fast Hash Mode

The fast hash mode generates an MD5 of the filename combined with file size, without reading the file contents:

fhash = md5(filename + filesize)

This provides extremely fast fingerprinting suitable for quick duplicate detection based on name and size, but is not suitable for integrity verification since it doesn't read actual file content.

Example Configuration: Production Environment

maxthreads: 8
fast_hash_only: false
hash_mode: xxhash
blocksize: 1048576  # 1MB for large files
cache_dir: /opt/diskover/plugins_postindex/__diskover_hash_cache__/
cache_expire_time: 0
min_size: 1024
max_size: 10737418240  # 10GB
extensions: []
exclude_extensions:
  - tmp
  - log
  - bak
exclude_files:
  - .DS_Store
  - Thumbs.db
exclude_dirs:
  - /mnt/data/temp/*
  - /mnt/data/cache/*
hardlinks: false
restore_times: false
csvdir: /var/log/diskover/checksums

Path Replacement Configuration

For environments where indexed paths differ from accessible paths (common with NFS mounts or when the indexing server accesses storage differently than the checksum worker), configure path replacement:

replace_paths:
  enable: true
  from_path: /mnt/nfs/production
  to_path: /data/production

Filtering Options

You can control which files get hashed using several filtering mechanisms:

Size-based filtering:

min_size: 1024        # Skip files smaller than 1KB
max_size: 10737418240 # Skip files larger than 10GB

Extension-based filtering:

# Only hash specific file types
extensions:
  - pdf
  - docx
  - xlsx

# Or exclude specific file types
exclude_extensions:
  - tmp
  - log
  - bak

Directory exclusions:

exclude_dirs:
  - /mnt/data/temp           # Exact match
  - /mnt/data/cache/*        # Recursive (all subdirectories)
  - /mnt/data/.snapshot/*    # NetApp snapshots

Custom Elasticsearch queries:

# Only hash files modified in the last 30 days
other_query: "mtime:[now-30d TO now]"

# Only hash files with specific tags
other_query: "tags:important"

# Combine multiple conditions
other_query: "owner:dataadmin AND mtime:[now-7d TO now]"

Running the Plugin

Basic Usage

Run the plugin from the command line, specifying the index to process:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_checksums/diskover_checksums.py diskover-indexname

Windows (PowerShell):

python "C:\Program Files\Diskover\plugins_postindex\diskover_checksums\diskover_checksums.py" diskover-indexname

Command Line Options

Option	Description
`-a, --configurationname NAME`	Use a specific named configuration
`-c, --csv`	Export hash results to CSV file
`-u, --usecache`	Enable SQLite hash cache
`-f, --flushcache`	Clear the hash cache before processing
`-U, --useindex INDEX`	Retrieve hashes from an existing index
`--useindexauto`	Auto-find previous index for hash reuse
`-r, --removehashes`	Remove all hash fields from index
`-l, --latestindex TOPPATH`	Auto-find latest index by top path
`-e, --excludehashes`	Skip files that already have hash values
`-m, --hashmode MODE`	Override hash algorithm (xxhash/md5/sha1/sha256)
`--fasthash`	Use fast hash mode (overrides config)
`-v`	Enable verbose logging
`-V`	Enable very verbose (debug) logging
`--version`	Print version and exit

Example: Basic Checksums with Caching

Enable caching to avoid re-hashing unchanged files on subsequent runs:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_checksums/diskover_checksums.py -u -v diskover-myindex

Windows (PowerShell):

python "C:\Program Files\Diskover\plugins_postindex\diskover_checksums\diskover_checksums.py" -u -v diskover-myindex

Example: SHA256 for Compliance with CSV Export

Generate SHA256 hashes (for compliance requirements) and export results to a CSV file:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_checksums/diskover_checksums.py -m sha256 -c -u diskover-myindex

Windows (PowerShell):

python "C:\Program Files\Diskover\plugins_postindex\diskover_checksums\diskover_checksums.py" -m sha256 -c -u diskover-myindex

Example: Skip Already-Hashed Files

When running incrementally, skip files that already have hash values:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_checksums/diskover_checksums.py -e -u diskover-myindex

Windows (PowerShell):

python "C:\Program Files\Diskover\plugins_postindex\diskover_checksums\diskover_checksums.py" -e -u diskover-myindex

Example: Process Multiple Indices

Hash files across multiple indices in a single run:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_checksums/diskover_checksums.py -u diskover-index1 diskover-index2 diskover-index3

Windows (PowerShell):

python "C:\Program Files\Diskover\plugins_postindex\diskover_checksums\diskover_checksums.py" -u diskover-index1 diskover-index2 diskover-index3

Example: Reuse Hashes from Previous Index

When you have a new index of the same storage location, reuse hashes from the previous index for unchanged files:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_checksums/diskover_checksums.py --useindexauto -u diskover-newindex

Windows (PowerShell):

python "C:\Program Files\Diskover\plugins_postindex\diskover_checksums\diskover_checksums.py" --useindexauto -u diskover-newindex

Example: Remove All Hashes from an Index

If you need to clear hash data and start fresh:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_checksums/diskover_checksums.py -r diskover-myindex

Windows (PowerShell):

python "C:\Program Files\Diskover\plugins_postindex\diskover_checksums\diskover_checksums.py" -r diskover-myindex

Setting Up Automated Checksums

To run checksums automatically, use Diskover's built-in task scheduling features.

Option 1: Custom Task

Create a Custom Task in Diskover Admin to run checksums on a defined schedule.

Navigate to Task Panel → Custom Tasks in Diskover Admin
Create a new Custom Task with the appropriate configuration
Configure the schedule (daily, weekly, etc.)
Save and enable the task

Here we can see the Run Command & args needed for the Custom Task - Note that in this case you cannot use the {indexname} variable as this is not a task that creates an index, so we must use the -l (toppath) CLI option and pass in our top path!

Option 2: Post-Crawl Command (Index Task)

Run checksums automatically after each index completes by adding it as a Post-Crawl Command. This ensures your index is always enriched with hash metadata immediately after scanning.

Navigate to Task Panel → Index Tasks in Diskover Admin
Edit the Index Task you want to trigger checksums from
Add the Post-Crawl Command configuration:

Linux Example:

Field	Value
Post-Crawl Command	`python3`
Post-Crawl Command Args	`/opt/diskover/plugins_postindex/diskover_checksums/diskover_checksums.py -u -v {indexname}`

Windows Example:

Field	Value
Post-Crawl Command	`python`
Post-Crawl Command Args	`"C:\Program Files\Diskover\plugins_postindex\diskover_checksums\diskover_checksums.py" -u -v {indexname}`

Available Index Task Tokens:

{indexname} — The name of the index that was just created

Important:

The Post-Crawl Command field should contain ONLY the executable (e.g., python3, python)
All script paths, flags, and arguments go in the Post-Crawl Command Args field

In your system ensure to replace the ConfigurationName above with a named configuraiton that you've created at Diskover Admin → Plugins → Post-Index → Checksums – If you are not using a custom configuration and you're just using Default than the -a flag and the ConfigurationName is not required!

Reviewing the Output

Log Output

A successful run displays progress information and final statistics:

INFO - Starting diskover checksums for indices ['diskover-myindex'] ...
INFO - Using hash mode xxhash
INFO - Started 8 file hash threads
INFO - Found 15247 docs
INFO - Queuing files from index diskover-myindex...
INFO - STATS (files hashed 5000 (32.8%), files in queue 150, elapsed 0:02:34, perf 32.5 files/s, memory usage 245MB)
INFO - Done file checksuming for index diskover-myindex
INFO - *** Elapsed time 0:07:48 ***
INFO - *** Total files: 15247 ***
INFO - *** Files hashed: 15247 (0.0% reduction of total files) ***

Key metrics to watch:

Files hashed — Total number of files processed
Perf (files/s) — Processing speed, useful for estimating completion time
Memory usage — Monitor this if processing large datasets
Reduction percentage — Shows how many files were skipped due to caching or filters

CSV Export

When using the -c flag, CSV files are saved to the configured csvdir with the naming pattern:

diskover-checksums_<index>_<hashmode>_<YYYY_MM_DD_HH_MM_SS>.csv

Example: diskover-checksums_diskover-prod_xxhash_2025_01_15_14_30_00.csv

CSV columns (full hash mode):

Column	Description
File	Full file path
Fhash(Fast Hash)	MD5 of filename+size
Hash(algorithm)	Full file content hash
Size(bytes)	File size
Mtime(utc)	Modified time in UTC
Index	Elasticsearch index name
Docid	Elasticsearch document ID

CSV columns (fast hash only mode):

Column	Description
File	Full file path
Fhash(Fast Hash)	MD5 of filename+size
Size(bytes)	File size
Mtime(utc)	Modified time in UTC
Index	Elasticsearch index name
Docid	Elasticsearch document ID

Cache Database

When caching is enabled (-u), a SQLite database stores hash values to avoid re-hashing unchanged files. The cache uses:

Key: MD5 hash of the file path
Value: Hash value and file mtime

On subsequent runs, if a file's mtime matches the cached entry, the stored hash is reused instead of re-reading the file. This dramatically speeds up repeated runs on the same storage.

Sample output of initial execution with Cache enabled :

2026-04-09 19:38:02,635 - diskover.plugin.checksums - INFO - Done file checksuming for index diskover-build-dir-checksums
2026-04-09 19:38:02,635 - diskover.plugin.checksums - INFO - *** Total files: 578 ***
2026-04-09 19:38:02,635 - diskover.plugin.checksums - INFO - *** Files hashed: 578 (0.0% reduction of total files) ***
2026-04-09 19:38:02,636 - diskover.cache - INFO - CACHE HITS: 0, MISSES: 578, HIT RATIO: 0.0% (/opt/diskover/plugins_postindex/__diskover_hash_cache__/)
2026-04-09 19:38:02,636 - diskover.cache - INFO - Closing cache DB /opt/diskover/plugins_postindex/__diskover_hash_cache__/cache_database.db...
2026-04-09 19:38:02,643 - diskover.cache - INFO - Cache DB /opt/diskover/plugins_postindex/__diskover_hash_cache__/cache_database.db closed

Here we can see that these was a 0.0% hit ratio on the Cache and all 578 files were hashed!

Sample output of second execution with Cache enabled :

2026-04-09 19:38:08,453 - diskover.plugin.checksums - INFO - Done file checksuming for index diskover-build-dir-checksums
2026-04-09 19:38:08,453 - diskover.plugin.checksums - INFO - *** Total files: 578 ***
2026-04-09 19:38:08,453 - diskover.plugin.checksums - INFO - *** Files hashed: 0 (0.0% reduction of total files) ***
2026-04-09 19:38:08,453 - diskover.cache - INFO - CACHE HITS: 578, MISSES: 0, HIT RATIO: 100.0% (/opt/diskover/plugins_postindex/__diskover_hash_cache__/)
2026-04-09 19:38:08,454 - diskover.cache - INFO - Closing cache DB /opt/diskover/plugins_postindex/__diskover_hash_cache__/cache_database.db...
2026-04-09 19:38:08,454 - diskover.cache - INFO - Cache DB /opt/diskover/plugins_postindex/__diskover_hash_cache__/cache_database.db closed

Here we can see a 100% hit ratio on the Cache (as no files were modified between executions) and that 0 files were actually hashed!

Searching in Diskover

After running the Checksums plugin, hash values are stored in Elasticsearch and become searchable through the Diskover web interface. This enables powerful queries for data integrity verification and duplicate investigation.

Available Hash Fields

The plugin creates the following searchable fields:

Field	Description
`hash.fhash`	Fast hash (MD5 of filename+size)
`hash.xxhash`	xxhash content hash
`hash.md5`	MD5 content hash
`hash.sha1`	SHA1 content hash
`hash.sha256`	SHA256 content hash

Find Files by Specific Hash Value

Search for a specific hash to find all files with that exact content:

hash.xxhash: ef46db3751d8e999

hash.sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

hash.md5: d41d8cd98f00b204e9800998ecf8427e

Find All Files with Hashes

hash: *

Find Files with a Specific Hash Type

hash.sha256: *

hash.xxhash: *

Find Files Without Hashes

Useful for identifying files that weren't processed (perhaps due to size filters):

type:file AND NOT hash: *

Combine Hash Searches with Other Criteria

Find hashed PDF files larger than 1MB:

hash.xxhash: * AND extension:pdf AND size:>1048576

Find hashed files in a specific directory:

hash: * AND parent_path:/mnt/data/important/*

Find files hashed with one algorithm but not another:

hash.md5: * AND NOT hash.sha256: *

Here we can see all 578 files found have an MD5 checksum but not a SHA1!

Integration with Dupesfinder

The Checksums plugin is designed to work together with the Dupesfinder plugin for comprehensive duplicate detection. Running Checksums first enriches your index with hash metadata that Dupesfinder then uses to identify files with identical content.

Why Two Plugins?

Separating checksum generation from duplicate detection provides flexibility:

Run checksums once, find duplicates multiple times — Hash values persist in the index, so you can run duplicate detection whenever needed without re-hashing
Control resource usage — Hashing is I/O intensive; run it during off-peak hours, then run lightweight duplicate detection anytime
Different scopes — Hash a single index, then find duplicates across multiple indices
Incremental updates — Add hashes to new files without re-processing existing ones

Complete Duplicate Detection Workflow

Step 1: Generate checksums for your index

Linux:

python3 /opt/diskover/plugins_postindex/diskover_checksums/diskover_checksums.py -u -v diskover-myindex

Windows (PowerShell):

python "C:\Program Files\Diskover\plugins_postindex\diskover_checksums\diskover_checksums.py" -u -v diskover-myindex

Step 2: Run Dupesfinder to identify duplicates

Linux:

python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -U diskover-myindex

Windows (PowerShell):

python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -U diskover-myindex

Automating the Workflow

You can chain both plugins as Post-Crawl Commands on an Index Task to automatically generate checksums and find duplicates after each scan:

Linux Example:

Field	Value
Post-Crawl Command	`python3`
Post-Crawl Command Args	`/opt/diskover/plugins_postindex/diskover_checksums/diskover_checksums.py -u {indexname} && python3 /opt/diskover/plugins_postindex/diskover_dupesfinder/diskover_dupesfinder.py -U {indexname}`

Windows Example:

Field	Value
Post-Crawl Command	`cmd`
Post-Crawl Command Args	`/c python "C:\Program Files\Diskover\plugins_postindex\diskover_checksums\diskover_checksums.py" -u {indexname} && python "C:\Program Files\Diskover\plugins_postindex\diskover_dupesfinder\diskover_dupesfinder.py" -U {indexname}`

Performance Tuning

Thread Configuration

The maxthreads setting controls how many files are hashed in parallel:

# Auto-detect based on CPU cores (recommended for most environments)
maxthreads: 0

# Fixed thread count for controlled resource usage
maxthreads: 8

Recommendations:

For local storage: Use auto-detect (0) or match your CPU core count
For network storage (NFS, SMB): Start with 4-8 threads and adjust based on I/O saturation

Block Size Optimization

The blocksize setting controls how much data is read at a time when hashing:

# Default (64KB) - good for mixed file sizes
blocksize: 65536

# Large files optimization (1MB)
blocksize: 1048576

# NFS optimization - match your rsize mount option
blocksize: 131072  # 128KB

Recommendations:

For large files (video, archives): Increase to 1MB (1048576)
For NFS storage: Match your mount's rsize option for optimal read performance

Algorithm Performance

Approximate hashing speeds on typical server hardware (single thread):

Algorithm	Speed (MB/s)	1GB File Time
xxhash	5000+	~0.2s
MD5	400-600	~2s
SHA1	300-500	~2.5s
SHA256	200-400	~3.5s

If performance is critical and cryptographic security isn't required, xxhash provides dramatically faster processing.

Troubleshooting

No Hashes Generated

Symptom: Plugin runs but no hash fields appear in documents.

What to check:

Verify file size filters match your files — default min_size is 1024 bytes and max_size is 1GB
Check that the Diskover service user has read access to the files
Review extension filters if you've configured them
Run with -v or -V to see which files are being processed and why others might be skipped

Diagnostic query to count eligible files:

curl -X GET "localhost:9200/diskover-myindex/_count" -H 'Content-Type: application/json' -d'
{
  "query": {
    "query_string": {
      "query": "type:file AND size:>=1024 AND size:<=1073741824"
    }
  }
}'

xxhash Module Not Found

Symptom: Error "Missing xxhash Python module"

Solution: Install the xxhash package:

Linux:

python3 -m pip install xxhash

Windows (PowerShell):

python -m pip install xxhash

Verification:

python3 -c "import xxhash; print('xxhash version:', xxhash.VERSION)"

Permission Denied Errors

Symptom: Warnings about unable to open or stat files.

What to check:

Ensure the Diskover service user has read access to the storage paths
For NFS mounts, verify export options include read permissions
Check if replace_paths configuration is needed for your environment

Test file access:

sudo -u diskover cat /path/to/problem/file > /dev/null && echo "OK"

Cache Not Working

Symptom: Files are re-hashed on every run despite using -u.

What to check:

Verify the cache directory exists and is writable
Check if file mtimes are changing between runs
Consider enabling use_disk_mtime if index mtimes differ from actual file mtimes
Try flushing the cache with -f and rebuilding

Debug cache behavior:

Linux:

python3 /opt/diskover/plugins_postindex/diskover_checksums/diskover_checksums.py -V -u diskover-myindex 2>&1 | grep -E "CACHE (HIT|MISS)"

Windows (PowerShell):

python "C:\Program Files\Diskover\plugins_postindex\diskover_checksums\diskover_checksums.py" -V -u diskover-myindex 2>&1 | Select-String "CACHE (HIT|MISS)"

Slow Performance

What to try:

Use xxhash instead of SHA256 if cryptographic security isn't required
Increase blocksize to 1MB (1048576) for large files or network storage
Reduce maxthreads for network storage to avoid I/O saturation
Enable caching (-u) to skip unchanged files on subsequent runs
Use extension or size filters to focus on the files that matter

Elasticsearch Connection Issues

Symptom: Error connecting to Elasticsearch.

Diagnostic steps:

# Test ES connectivity
curl -X GET "localhost:9200/_cluster/health"

# Test from Python
python3 -c "from diskover_elasticsearch import es_connection_cached; print(es_connection_cached().info())"

What to check:

Verify Elasticsearch is running
Check ES host/port configuration in Diskover settings
Verify authentication credentials if security is enabled
Check network connectivity and firewall rules

Support

Last Updated: April 2026

Checksums Post-Index Plugin

Overview

Why Use This Plugin?

Use Cases

Data Integrity Verification

Migration Validation

Duplicate Detection Preparation

Installation

DNF Installation (Linux RPM)

Prerequisites

Python Dependencies

Installation Steps

Configuration

Configuration Parameters

Hash Algorithm Selection

Fast Hash Mode

Example Configuration: Production Environment

Path Replacement Configuration

Filtering Options

Running the Plugin

Basic Usage

Command Line Options

Example: Basic Checksums with Caching

Example: SHA256 for Compliance with CSV Export

Example: Skip Already-Hashed Files

Example: Process Multiple Indices

Example: Reuse Hashes from Previous Index

Example: Remove All Hashes from an Index

Setting Up Automated Checksums

Option 1: Custom Task

Option 2: Post-Crawl Command (Index Task)

Reviewing the Output

Log Output

CSV Export

Cache Database

Searching in Diskover

Available Hash Fields

Find Files by Specific Hash Value

Find All Files with Hashes

Find Files with a Specific Hash Type

Find Files Without Hashes

Combine Hash Searches with Other Criteria

Integration with Dupesfinder

Why Two Plugins?

Complete Duplicate Detection Workflow

Automating the Workflow

Performance Tuning

Thread Configuration

Block Size Optimization

Algorithm Performance

Troubleshooting

No Hashes Generated

xxhash Module Not Found

Permission Denied Errors

Cache Not Working

Slow Performance

Elasticsearch Connection Issues

Support

Related articles