Checksums
License: PRO+ (Professional Edition or higher)
Plugin Type: Index Plugin
Author: Diskover Data, Inc.
Overview
The Checksums plugin generates hash values for files during Diskover indexing, enabling data integrity verification and duplicate file detection across your storage infrastructure. By computing unique fingerprints for file contents, you can confirm successful data migrations, identify duplicate files consuming storage, and maintain confidence in your data's integrity over time.
The plugin supports multiple hash algorithms ranging from ultra-fast options optimized for performance to cryptographic hashes suitable for compliance requirements. Every processed file also receives a fast hash (fhash)—a lightweight fingerprint computed from the filename and file size without reading file contents—enabling rapid duplicate candidate identification.
Use Cases
Storage Administrators use checksums to verify data integrity after storage migrations or transfers. By comparing hash values between source and destination, administrators can confirm all files transferred correctly without bit-level corruption.
Data Managers leverage checksums for duplicate file detection and storage optimization. Files with identical hash values contain identical content, making it straightforward to identify redundant copies consuming valuable storage capacity. The hash values generated by this plugin can also be used by the Dupes Finder post-index plugin for comprehensive duplicate analysis.
Compliance Officers rely on cryptographic checksums to establish audit trails and verify file authenticity. SHA-256 hashes provide the cryptographic strength required for regulatory compliance, chain of custody documentation, and proving files haven't been modified since a specific point in time.
Indexed Fields
The plugin adds the following fields to indexed documents:
Field | Description |
|---|---|
| Fast hash—MD5 of filename + file size (always present) |
| xxhash64 digest (when using xxhash mode) |
| MD5 digest (when using md5 mode) |
| SHA1 digest (when using sha1 mode) |
| SHA256 digest (when using sha256 mode) |
Understanding File Checksums
A checksum (or hash) is a fixed-size value computed from file contents using a mathematical algorithm. Even a single-byte change in a file produces a completely different hash value, making checksums ideal for detecting modifications or verifying that two files are identical.
Hash Algorithm Comparison
The Checksums plugin supports four hash algorithms, each with different characteristics suited to different use cases:
Algorithm | Output Size | Speed | Security | Best For |
|---|---|---|---|---|
xxhash | 64-bit | Fastest | Non-cryptographic | Duplicate detection, general integrity checks |
md5 | 128-bit | Fast | Weak (collisions possible) | Legacy system compatibility |
sha1 | 160-bit | Medium | Deprecated | Legacy requirements only |
sha256 | 256-bit | Slowest | Strong (cryptographic) | Compliance, security verification |
Choosing the Right Algorithm
xxhash (Default) is recommended for most use cases. It provides exceptional performance while maintaining excellent accuracy for duplicate detection and data integrity verification. Choose xxhash when processing large data volumes where speed is important and cryptographic verification isn't required.
md5 offers broad compatibility with legacy systems and existing checksum databases. While no longer considered cryptographically secure due to known collision vulnerabilities, MD5 remains useful when you need to compare against existing MD5 hash databases or work with systems that only support MD5.
sha1 is included for legacy compatibility but is deprecated for security purposes. Use only when required by existing systems that mandate SHA1 hashes.
sha256 provides cryptographic-strength hashing suitable for compliance requirements (HIPAA, SOX, GxP) and security verification. The longer computation time is justified when you need to prove file authenticity or meet regulatory requirements that specify cryptographic hash algorithms.
Understanding Fast Hash (fhash)
The fhash field is always generated regardless of your selected hash mode. It provides an extremely fast fingerprint computed as:
fhash = MD5(filename + file_size)
Because fhash doesn't read file contents, it generates almost instantly. Files with identical fhash values have the same name and size—making them potential duplicates worth investigating with full content hashes. Use fhash for rapid initial filtering before computing more expensive full-file checksums.
Performance Considerations
Hash computation requires reading entire file contents, which takes time proportional to file size and storage performance. Consider these factors when planning your indexing strategy:
Initial scans take longer than subsequent scans due to cache population
Large files (multi-GB) require significant I/O time regardless of algorithm choice
Network storage (NFS, SMB) adds latency compared to local storage
The plugin caches hash values, so unchanged files hash instantly on re-scans
For environments with large data volumes, consider using extension filtering to hash only the file types that matter for your use case, or use fast_hash_only mode for initial duplicate candidate identification.
Requirements
Python Dependencies
Package | Version | Required | Purpose |
|---|---|---|---|
xxhash | Latest | Only if hash_mode='xxhash' | Ultra-fast hashing algorithm |
The md5, sha1, and sha256 algorithms use Python's built-in hashlib module and require no additional dependencies.
System Requirements
Diskover PRO+ license or higher
Python 3.9 or higher
Read access to files during indexing
Write access to cache directory
Installation
Step 1: Install Python Dependencies
If using the default xxhash algorithm, install the xxhash module:
Linux:
python3 -m pip install xxhash
Windows:
python -m pip install xxhash
For other hash modes (md5, sha1, sha256), no additional installation is required.
Step 2: Configure the Plugin
Navigate to Diskover Admin > Plugins > Index Plugins > Checksums
Enable the plugin and configure parameters as needed (see Configuration section)
Save the configuration
Step 3: Enable in Index Task Configuration
Navigate to Diskover > Configurations > select your configuration (e.g., Default)
Scroll to the Index Plugins Enablement section at the bottom
Enable the Checksums plugin
Save the configuration
The plugin will now run automatically during scans using this configuration.
Configuration
Configuration Parameters
Parameter | Type | Default | Description |
|---|---|---|---|
| bool |
| Generate only fhash (no full file checksum). Fastest mode—no file content reading required. |
| string |
| Hash algorithm: |
| int |
| Block size in bytes for reading files. Increase for large files or to match NFS rsize. |
| list |
| File extensions to process (e.g., |
| list |
| File extensions to skip (e.g., |
| list |
| Directory paths to exclude. Supports wildcards with |
| string |
| Directory for SQLite cache database |
| int |
| Cache entry expiration in seconds. |
| bool |
| Restore atime/mtime after hashing. Set |
Configuration Examples
Standard Configuration (Default)
Fast checksums using xxhash for general duplicate detection and integrity verification:
{
"fast_hash_only": false,
"hash_mode": "xxhash",
"blocksize": 65536,
"extensions": [],
"exclude_extensions": ["tmp", "log", "swp"],
"exclude_dirs": [],
"cache_dir": "/opt/diskover/plugins_postindex/__diskover_hash_cache__/",
"cache_expire_time": 0,
"restore_times": false
}
Fast Hash Only Mode
Maximum speed for quick duplicate candidate scanning without reading file contents:
{
"fast_hash_only": true,
"hash_mode": "xxhash",
"extensions": [],
"exclude_extensions": [],
"exclude_dirs": [],
"cache_dir": "/opt/diskover/plugins_postindex/__diskover_hash_cache__/",
"cache_expire_time": 0,
"restore_times": false
}
Compliance Mode (SHA-256)
Cryptographic hashes for regulatory compliance requirements:
{
"fast_hash_only": false,
"hash_mode": "sha256",
"blocksize": 131072,
"extensions": ["pdf", "doc", "docx", "xls", "xlsx"],
"exclude_extensions": [],
"exclude_dirs": ["/mnt/data/temp/*", "/mnt/data/scratch/*"],
"cache_dir": "/opt/diskover/plugins_postindex/__diskover_hash_cache__/",
"cache_expire_time": 0,
"restore_times": false
}
Indexed Fields / Elasticsearch Mappings
Field Mappings
Field Path | ES Type | Description |
|---|---|---|
| object | Container for all hash-related fields |
| keyword | Fast hash—MD5 of filename + file size (always present) |
| keyword | xxhash64 hexadecimal digest |
| keyword | MD5 hexadecimal digest |
| keyword | SHA1 hexadecimal digest |
| keyword | SHA256 hexadecimal digest |
Only the hash field corresponding to your configured hash_mode will be populated (plus fhash, which is always present).
Example Document
A file indexed with hash_mode='xxhash':
{
"name": "quarterly_report.pdf",
"size": 1048576,
"extension": "pdf",
"hash": {
"fhash": "a1b2c3d4e5f67890abcdef1234567890",
"xxhash": "ef46db3751d8e999"
}
}
Searching in Diskover
Use these search queries in the Diskover web interface to find files based on checksum data.
Basic Hash Searches
Query | Description |
|---|---|
| Find file with specific xxhash value |
| Find file with specific MD5 value |
| Find file with specific SHA-256 value |
| Find files with specific fast hash (same name + size) |
Finding Files with Checksums
Query | Description |
|---|---|
| Find all files that have checksum data |
| Find all files with xxhash values |
| Find all files with SHA-256 values |
Combined Searches
Query | Description |
|---|---|
| PDF files with checksums |
| Large files (1GB+) with checksums |
| Files with checksums in projects directories |
Verifying Checksums with External Tools
You can verify hash values computed by the plugin using standard command-line tools.
Verify xxhash
# Install xxhash CLI tool # RHEL/CentOS: yum install xxhash # Debian/Ubuntu: apt-get install xxhash xxh64sum /path/to/file.pdf # Compare output with hash.xxhash field in Diskover
Verify MD5
md5sum /path/to/file.pdf # Compare output with hash.md5 field in Diskover
Verify SHA-256
sha256sum /path/to/file.pdf # Compare output with hash.sha256 field in Diskover
Verify fhash
# fhash = MD5(filename + size)
filename=$(basename /path/to/file.pdf)
size=$(stat -c%s /path/to/file.pdf)
echo -n "${filename}${size}" | md5sum
# Compare output with hash.fhash field in Diskover
Troubleshooting
Common Issues
Issue | Cause | Solution |
|---|---|---|
"Missing xxhash Python module" error | xxhash package not installed | Run |
Files missing hash data | File filtered by extension or directory exclusion | Check |
Files missing hash data | Permission denied reading file | Verify Diskover service account has read access |
Slow indexing performance | Large files require significant I/O | Use extension filtering or increase blocksize for large files |
Hash values don't update after file change | File mtime unchanged | Clear cache or touch files to update mtime |
Cache Management
The plugin caches hash values to improve re-scan performance. If you need to force re-computation of all hashes, clear the cache directory:
Linux:
rm -rf /opt/diskover/plugins_postindex/__diskover_hash_cache__/
Windows:
Remove-Item -Recurse -Force "C:\Program Files\diskover\plugins_postindex\__diskover_hash_cache__\"
Debug Logging
Enable verbose logging in the Checksums configuration within Diskover Admin to see detailed processing information. Check logs at:
Linux:
/var/log/diskover/diskover.logWindows: Check Diskover service logs or configured log location
# Monitor checksum-related log entries tail -f /var/log/diskover/diskover.log | grep -i checksum
Support
Last Updated: January 2026
Diskover Data, Inc.
Comments
0 comments
Please sign in to leave a comment.