Directory Caching (dircache)
License: PRO (Professional Edition or higher)
Module Type: Alternate Scanner
Author: Diskover Data, Inc.
Overview / Use Cases
The Directory Caching (dircache) alternate scanner accelerates repeated filesystem scans by caching directory listings and file stat information in a local SQLite database. Instead of re-reading every directory from disk on each scan, the dircache scanner serves unchanged directories directly from cache — dramatically reducing scan times on subsequent runs while maintaining data accuracy through mtime-based change detection.
If you've ever watched a multi-hour Diskover scan grind through millions of directories that haven't changed since the last scan, dircache is the solution. It wraps the standard os.scandir() call with a caching layer so that only directories that have actually changed are re-read from disk. Everything else comes straight from a local SQLite database, which is orders of magnitude faster than traversing a network filesystem.
Who Benefits from dircache
Storage Administrators running scheduled re-indexes on large filesystems (daily, hourly, or more frequently) will see the biggest gains. When millions of directories exist but only a small fraction change between scans, dircache avoids redundant stat calls and directory reads for the unchanged majority. This can reduce total scan time from hours to minutes after the initial cache is populated.
NAS and Network Storage Teams dealing with NFS or SMB-mounted storage will appreciate that dircache eliminates per-operation network latency for unchanged directories. Every stat call and directory listing on network storage traverses the network — dircache short-circuits this overhead by serving cached results from the local SQLite database. This is especially impactful on high-latency links or storage systems under heavy concurrent load.
Operations Teams Needing Frequent Indexing can use dircache to achieve near-incremental scan behavior. The mtime comparison identifies which directories have changed since the last scan, and only those directories are fully re-read. Combined with lazy mtime checking, this provides a tunable trade-off between scan speed and change detection granularity.
Resource-Constrained Environments benefit because dircache shifts the I/O burden from the target storage to a local SQLite database, freeing storage system resources for production workloads during scanning windows.
Requirements
System Requirements
Component | Requirement |
|---|---|
Python | 3.9 or higher |
Diskover | Core installation with alternate scanner support |
Storage | Local disk space for the SQLite cache database |
Filesystem | Local or network-mounted filesystem accessible via standard OS path operations |
Python Dependencies
No additional Python packages are required beyond the standard Diskover installation. The scanner uses diskover_cache, hashlib, os, re, logging, and datetime from the Python standard library and Diskover core modules.
Installation
Step 1: Install Scanner Package
Linux:
dnf install diskover-scanner-dircache
Windows:
The scanner files are included with the Diskover Windows installation. No separate installation step is required.
Install locations:
Linux:
/opt/diskover/scanners/scandir_dircache/Windows:
C:\Program Files\Diskover\scanners\scandir_dircache\
Step 2: Verify Installation
Confirm the scanner module can be imported:
Linux:
cd /opt/diskover
python3 -c "from scanners.scandir_dircache import scandir_dircache; print('scandir_dircache: OK')"
Windows:
cd "C:\Program Files\Diskover"
python -c "from scanners.scandir_dircache import scandir_dircache; print('scandir_dircache: OK')"
If you see scandir_dircache: OK, the scanner is ready to configure!
# python3 -c "from scanners.scandir_dircache import scandir_dircache; print('scandir_dircache: OK')"
scandir_dircache: OK
Configuration
Configuration is managed through the Diskover Admin UI under Settings > Alternate Scanners > DirCache.
Configuration Parameters
Parameter | Type | Default | Description |
|---|---|---|---|
| bool |
| Enable verbose cache hit/miss logging for each directory processed. Useful for diagnosing cache behavior. |
| string |
| Directory where SQLite cache database files are stored. Can be an absolute or relative path. |
| int |
| Cache entry expiry time in seconds. Set to |
| bool |
| Load the SQLite database into memory at startup for faster reads. Warning: Risk of database corruption if the scan crashes before the database is written back to disk. |
| list |
| Regex patterns for directory names or absolute paths to exclude from caching. Uses Python |
| bool |
| Enable periodic mtime checking instead of always comparing the on-disk mtime. Faster scans at the expense of change detection accuracy. |
| int |
| When lazy mtime checking is enabled, only check directories with cached mtimes newer than this many days. Older entries are assumed unchanged and served from cache without any filesystem stat call. |
Windows Note: Set
cachedirto a Windows-compatible path such asC:\Program Files\diskover\__dircache__\.
Configuration via Diskover Admin
Navigate to Settings > Alternate Scanners > DirCache
Adjust parameters as needed for your environment
Save the configuration
Configuration Examples
Example 1: Standard Configuration (Accuracy-First)
This is the default configuration that checks every directory mtime on each scan. Best for environments where accuracy is more important than maximum scan speed.
Diskover:
Alternate Scanners:
DirCache:
Default:
verbose: false
cachedir: /opt/diskover/__dircache__/
dirlist_expire: 0
load_db_mem: false
cache_exclude_dirs: []
lazy_mtime_check: false
lazy_mtime_check_days: 30
Example 2: Maximum Performance
Enables lazy mtime checking with a 7-day window. Directories modified within the past week are checked for changes on each scan. Directories with cached mtimes older than 7 days are served from cache without any filesystem stat call — the assumption being that old directories are unlikely to have changed.
This configuration is ideal for very large filesystems where the vast majority of directories are archival or rarely modified.
Diskover:
Alternate Scanners:
DirCache:
Default:
verbose: false
cachedir: /opt/diskover/__dircache__/
dirlist_expire: 0
load_db_mem: false
cache_exclude_dirs: []
lazy_mtime_check: true
lazy_mtime_check_days: 7
Example 3: Selective Caching
Excludes frequently changing directories (temp, logs, and hidden/dot-prefixed directories) from caching while caching everything else normally. Sets a 24-hour (86400 seconds) cache expiry so old entries age out automatically.
This is useful when certain directories change so often that caching them provides no benefit.
Diskover:
Alternate Scanners:
DirCache:
Default:
verbose: false
cachedir: /opt/diskover/__dircache__/
dirlist_expire: 86400
load_db_mem: false
cache_exclude_dirs:
- ".*"
- "/data/tmp"
- "/data/logs"
- ".snapshot"
lazy_mtime_check: false
lazy_mtime_check_days: 30
Tip: The
cache_exclude_dirspatterns use Python regex matching (re.search), so patterns are matched anywhere in the directory name or full path. Use anchored patterns (e.g.,^temp$) if you need exact name matching.
Usage / Execution
The dircache scanner is a standard alternate scanner that integrates with diskover.py via the --altscanner flag.
Basic Usage
Linux:
cd /opt/diskover python3 diskover.py --altscanner scandir_dircache /data
Windows:
cd "C:\Program Files\Diskover" python diskover.py --altscanner scandir_dircache D:\data
Path Format Reference
Path Format | Description | Example |
|---|---|---|
Absolute Linux path | Standard filesystem path |
|
Windows drive path | Local drive with directory |
|
UNC path (Windows) | Network share path |
|
NFS/SMB mount | Network-mounted path (Linux) |
|
Advanced Usage Examples
Custom Index Name:
cd /opt/diskover python3 diskover.py --altscanner scandir_dircache -i diskover-myindex /data
Verbose / Debug Logging:
Enable debug-level logging to see cache hit/miss information for each directory:
cd /opt/diskover python3 diskover.py --altscanner scandir_dircache --loglevel DEBUG /data
Multiple Top Paths:
When scanning multiple top paths, the scanner creates a single cache database named using an MD5 hash of the combined paths:
cd /opt/diskover python3 diskover.py --altscanner scandir_dircache /data/projects /data/archive /data/shared
Integration with Index Tasks
The dircache scanner can be configured as part of a Diskover Index Task:
Field | Value |
|---|---|
Alternate Scanner |
|
Set the Alternate Scanner field to scandir_dircache in the Index Task configuration to use directory caching for scheduled scans.
Performance Tips
First scan is always full speed: The initial scan populates the cache and runs at the same speed as a standard scan (every directory is a cache miss). Subsequent scans are where you see the performance gains.
Keep the cache database on fast local storage: The SQLite cache performs best on local SSD or NVMe storage. Avoid placing the cache directory on the same network storage being scanned.
Use lazy mtime checking for archival data: If your filesystem has large archival areas that rarely change, enable
lazy_mtime_checkwith a shortlazy_mtime_check_daysvalue to skip mtime stat calls on older directories.Exclude volatile directories: Add frequently changing directories (temp, logs, build outputs) to
cache_exclude_dirs— caching them wastes I/O because they'll always be cache misses.Avoid in-memory mode for large caches: The
load_db_memoption loads the entire SQLite database into process memory at startup. This improves read performance but can consume significant memory for large filesystems, and risks database corruption if the scan crashes.
Troubleshooting
Common Issues
Issue | Cause | Solution |
|---|---|---|
Scanner fails to start with cache directory errors | The configured | Create the cache directory and set appropriate ownership/permissions (see below) |
Indexed data doesn't reflect recent file changes | Lazy mtime checking is skipping recently changed directories, or stale cache entries persist | Reduce |
High memory usage during scanning | The | Set |
First scan is slow | Expected behavior — the initial scan populates the cache and is equivalent in speed to a standard scan | Run the first scan during a maintenance window; subsequent scans will be significantly faster |
Scan crashes and cache database is corrupted | Using | Set |
Cache Database Errors
Symptom: Scanner fails to start with errors about creating the cache directory or SQLite database, such as Error creating dir cache directory.
Diagnosis:
Check that the cache directory exists and is writable:
Linux:
ls -la /opt/diskover/__dircache__/ df -h /opt/diskover/__dircache__/
Windows:
Get-Item "C:\Program Files\diskover\__dircache__" Get-PSDrive C | Select-Object Used, Free
Resolution:
Create the cache directory if it does not exist:
Linux:
mkdir -p /opt/diskover/__dircache__ chown diskover:diskover /opt/diskover/__dircache__
Windows:
New-Item -ItemType Directory -Path "C:\Program Files\diskover\__dircache__"
Ensure the Diskover process user has read/write permissions on the cache directory, and verify sufficient disk space is available.
Verification:
python3 -c "import os; print('writable:', os.access('/opt/diskover/__dircache__/', os.W_OK))"
Stale Cache Data
Symptom: Indexed data does not reflect recent changes to files or directories. New files are missing from the index, or file sizes and timestamps are outdated.
Diagnosis:
Check if
lazy_mtime_checkis enabled in Settings > Alternate Scanners > DirCache — this can cause recently changed directories to be served from stale cache.Run a scan with verbose logging to see cache hits/misses:
cd /opt/diskover python3 diskover.py --altscanner scandir_dircache --loglevel DEBUG /data 2>&1 | grep -E "CACHE (HIT|MISS|EXCLUDED)"
Resolution:
Delete the cache database to force a full re-scan:
Linux:
rm -rf /opt/diskover/__dircache__/*
Windows:
Remove-Item -Recurse -Force "C:\Program Files\diskover\__dircache__\*"
Alternatively, reduce lazy_mtime_check_days to a smaller value, disable lazy_mtime_check entirely, or set a dirlist_expire value (in seconds) so old cache entries expire automatically.
After deleting the cache, all directories should report cache misses on the first subsequent scan, confirming the cache has been rebuilt from disk.
High Memory Usage with In-Memory Database
Symptom: The Diskover process consumes excessive memory during scanning.
Diagnosis:
Check if load_db_mem is set to true in the configuration, and monitor the on-disk cache database size:
Linux:
du -sh /opt/diskover/__dircache__/*/
Resolution:
Set load_db_mem to false. The on-disk SQLite database performs well for most workloads and avoids loading the entire cache into process memory. Note that load_db_mem: true also carries a risk of SQLite database corruption if the scan process crashes before the database is written back to disk.
Debug Logging
To enable full debug output for diagnosing cache behavior:
cd /opt/diskover python3 diskover.py --altscanner scandir_dircache --loglevel DEBUG /data
With verbose: true in the configuration, the debug log will show CACHE HIT, CACHE MISS, and CACHE EXCLUDED entries for each directory processed, allowing you to verify that caching is working as expected.
Log File Locations:
Linux:
/var/log/diskover/diskover.logWindows: Check Diskover service logs or your configured log location
Support
Last Updated: March 2026
Diskover Data, Inc.
Comments
0 comments
Please sign in to leave a comment.