Troubleshooting — Scan Tasks
Overview
A Diskover scan (crawl) is initiated by diskoverd, which launches diskover.py as a subprocess. The crawler walks a directory tree (or alternate storage backend), processes each file and directory through any enabled plugins, and bulk-uploads documents to Elasticsearch.
This guide covers diagnosing and resolving failures across the full scan pipeline: task dispatch, the crawl process itself, plugin errors, Elasticsearch indexing errors, and scan performance.
For service-level issues (diskoverd not starting, Celery worker down, RabbitMQ not routing), refer to the Service-specific and OS-specific troubleshooting docs first.
Where to Look First
What failed | Where to look |
|---|---|
Task never started | Diskover Admin > Tasks — check task status and assigned worker |
Task started, crawl failed |
|
Crawl running but slow |
|
Elasticsearch errors |
|
Plugin errors |
|
Alt scanner errors |
|
# Tail the crawl log in real time sudo tail -f /var/log/diskover/diskoverd_subproc.log # Search for errors only sudo grep -E 'ERROR|CRITICAL|EXCEPTION|FATAL' /var/log/diskover/diskoverd_subproc.log
For more detail on any failing scan, increase the log level to DEBUG in Diskover Admin > Configuration > DiskoverD > $WorkerName, then re-run the task.
Running a Manual Scan
Bypassing diskoverd entirely is the fastest way to isolate whether a problem is in task dispatch or in the crawl itself:
cd /opt/diskover python3 diskover.py -i diskover-<indexname> /path/to/scan
Useful flags for diagnosis:
Flag | Purpose |
|---|---|
| Verbose — prints per-file progress |
| More verbose |
| Full debug output including thread activity |
| Drop and recreate an existing index |
| Add to an existing index instead of replacing |
| Index directories only (faster, useful for tree validation) |
| Override crawl thread count |
| Use an alternate storage scanner |
| List all loaded plugins without scanning |
All index names must start with
diskover-.
Common Scan Errors
Path Does Not Exist or Is Not a Directory
OS ERROR: top path "/mnt/share" does not exist; skipping this path. OS ERROR: top path "/mnt/share" is not a directory; skipping this path. FATAL ERROR: no valid top paths were provided.
The scan exits immediately if no valid paths remain after skipping invalid ones.
Resolution:
Verify the path exists and is accessible by the user running diskover:
ls -la /mnt/shareIf NFS/CIFS mounted, confirm the mount is active:
mount | grep shareCheck diskoverd's configured Diskover Path in Diskover Admin > Configuration > DiskoverD > $WorkerName and that mount points are consistent between the configured path and the actual filesystem
Index Already Exists
Index diskover-myindex already exists, not crawling, use -f to overwrite.
Resolution:
Use
-fto drop and recreate:python3 diskover.py -i diskover-myindex -f /pathUse
-ato append to the existing index:python3 diskover.py -i diskover-myindex -a /pathUse a different index name
Initialization Failed (ES / License / Config)
Diskover initialization failed. Details: <details>. Next steps: verify config/profile values, license validity, and Elasticsearch connectivity before retrying.
This is a startup failure before the crawl begins. Work through in order:
License — confirm the license is valid and not expired. The license check runs first; a failure here exits immediately.
Elasticsearch connectivity — test from the worker machine:
curl -s http://<es-host>:9200/_cluster/health
Config — verify the Elasticsearch host, port, and credentials in Diskover Admin > Configuration > Diskover > Elasticsearch
Admin API — diskoverd fetches its config from diskover-admin on startup. Verify diskover-admin is running and reachable on port 8000:
curl -s http://localhost:8000/diskover_admin/api/info
Bulk Indexing Errors
BULK INDEX ERROR: indexing failed for one or more documents. ERROR bulk writing error documents; retrying one-by-one. INDEXING ERROR: unable to bulk upload documents.
Documents failed to write to Elasticsearch.
Resolution:
Check cluster health and disk space — ES stops writing at 90% disk usage:
curl -s http://localhost:9200/_cluster/health?pretty df -h /var/lib/elasticsearch
Field mapping conflict — a plugin may be producing a value incompatible with the existing index mapping. Try disabling plugins and re-running to isolate.
Check for index write blocks:
curl -s "http://localhost:9200/diskover-*/_settings?pretty" | grep read_only
If blocked, clear it:
curl -X PUT "http://localhost:9200/*/_settings" \ -H 'Content-Type: application/json' \ -d '{"index.blocks.read_only_allow_delete": null}'
Plugin Errors
Plugins run during the crawl, adding metadata fields to each document before it is indexed. A plugin failure does not stop the crawl — the file is still indexed without that plugin's metadata.
Plugin Initialization Failure
PLUGIN EXCEPTION: Plugin initialization failed.
The plugin failed at startup, before any files were processed.
Resolution:
Run
python3 diskover.py -lto list loaded plugins and confirm the plugin appearsCheck the plugin's config in Diskover Admin > Configuration > Plugins > Index > $PluginName
Verify any external dependencies the plugin requires are installed (e.g.,
mediainfo,ffprobe, CLI tools)Temporarily disable the plugin in the Admin UI and re-run to confirm the base crawl works
Plugin Runtime Error
PLUGIN ERROR: runtime failure in plugin "mediainfo". PLUGIN EXCEPTION: unexpected failure in plugin "mediainfo".
The plugin encountered an error while processing a specific file.
Resolution:
Enable
DEBUGlogging and re-run — the debug output will include the specific file and exceptionCheck whether the error is consistent (every file) or intermittent (specific file types)
Verify the plugin's external dependencies are on the system
PATH— see External Command Not Found below
External Command Not Found
CRITICAL - mediainfo command not found in path! (Exit code: 1)
Plugins like mediainfo, imageinfo, and pdfinfo shell out to external CLI tools that must be installed and accessible on the PATH of the process running the crawl.
Verify the tool is installed:
which mediainfo mediainfo --version
If installed but not found by the crawler, the issue is the service PATH. On Linux, confirm /usr/local/bin and any custom install paths are included in the environment of the diskoverd service. On macOS, see the LaunchDaemon PATH section in the macOS troubleshooting guide.
Alternate Scanner Errors
Alternate scanners replace the default filesystem scandir with a custom backend (S3, Azure, PowerScale, etc.).
ALT SCANNER EXCEPTION: Alternate scanner initialization failed. ALT SCANNER ERROR: failed to stat path with alternate scanner.
Resolution:
Verify the scanner module is installed and importable:
cd /opt/diskover python3 -c "import scandir_<name>"
Check the scanner's config in Diskover Admin > Configuration > Diskover > Alternate Scanners
Test connectivity to the backend independently (storage API, credentials, network access)
Run without the alternate scanner to confirm the base crawl and ES connection are healthy:
python3 diskover.py -i diskover-test /tmp
Each alternate scanner ships with a test.py or similar validation script — run this first to confirm the backend is reachable before attempting a full scan.
Measuring Scan Performance
The crawler logs performance stats to the subprocess log throughout the crawl and at completion.
Live Stats (every ~10 seconds during crawl)
CRAWL STATS (path /mnt/share, files 12450 (skipped 230), dirs 880 (skipped 12), elapsed 0:02:14, perf 94.321 inodes/s (max 210, min 40, avg 94), 4 paths still scanning, memory usage: 312 MB)
ES UPLOAD STATS (path /mnt/share, uploaded 11200 docs, upload time 0:01:58, perf 94.9 docs/s)
Final Stats (on crawl completion)
*** finished walking /mnt/share *** *** walk files 48200, skipped 950 *** *** walk size 4.2 TB *** *** walk dirs 3400, skipped 24 *** *** walk took 0d:00h:18m:42s *** *** walk perf 44.2 inodes/s (max 210, min 8, avg 44) ***
Key metrics to watch:
inodes/s — Overall crawl throughput. Healthy range depends heavily on storage type and file count, but a sudden drop mid-crawl indicates I/O slowdown or thread starvation.
paths still scanning — Number of directories being actively walked in parallel. If this stays at 1 for a deep tree, threading isn't engaging — consider adjusting thread depth settings.
skipped counts — High skip counts may indicate misconfigured exclusions filtering out files unexpectedly.
Improving Scan Speed
Threading Configuration
The most impactful setting for scan speed on large directory trees is the thread configuration, managed in Diskover Admin > Configuration > Diskover > $ConfigName:
Setting | Description | Tuning Guidance |
|---|---|---|
Max Threads | Crawl threads per top path | Start at CPU core count; increase for network storage |
Max Walk Threads | Threads for walking multiple top paths in parallel | Set to number of top paths if scanning many simultaneously |
Max Thread Depth | Directory depth at which threading begins | Lower values thread earlier (better for shallow trees); higher for deep trees |
Auto Thread Depth | Automatically calculate optimal thread depth | Enable on first run, then tune manually based on results |
Index Threads | Threads for bulk-uploading to Elasticsearch | Default 16; increase if ES upload is the bottleneck |
Thread depth tip: If a path has millions of files in a flat structure (depth 1–2), set Max Thread Depth to 1 so threads start immediately at the top level. For deep trees (depth 10+), let Auto Thread Depth calculate the optimal point.
You can also override threading at scan time without changing the config:
python3 diskover.py -i diskover-test --threads 16 --walkthreads 4 --threaddepth 2 /path
Exclusions — Skip What You Don't Need
Excluding irrelevant paths and files reduces crawl time significantly on large volumes. Configure in Diskover Admin > Configuration > Diskover > $ConfigName:
Directory exclusions — Regex patterns; match on name or full path
File exclusions — Exact names, extensions, or patterns (e.g.,
*.tmp,.*)Min file size — Skip files below a threshold (e.g., ignore zero-byte files)
Empty directories — Skip empty directories if they aren't needed in the index
Time filters —
minmtime/maxmtimeto restrict crawl to recently modified files
To audit what is being skipped, run with --debug and grep for excluded:
python3 diskover.py -i diskover-test -v /path 2>&1 | grep -i excluded | head -50
Elasticsearch Performance
If the bottleneck is ES upload speed rather than crawl speed (upload docs/s is much lower than crawl inodes/s):
Increase Index Threads in the Diskover config (default: 16)
Check ES cluster health — a
yelloworredcluster has degraded write performanceCheck ES heap usage:
curl -s http://localhost:9200/_cat/nodes?v&h=name,heap.percentEnsure the ES data volume is not near the disk watermark (75% warning, 90% write block)
Network Storage Considerations
For NFS/CIFS/cloud-backed paths, scan speed is limited by network latency and storage IOPS rather than CPU:
Increase
Max Threadswell above CPU count (e.g., 32–64) to keep the pipeline saturated despite latencyUse Auto Thread Depth to let the crawler find the optimal depth for the share structure
For very large flat directories (millions of files in one folder), threading has less benefit — crawl speed will be limited by the storage system's directory listing performance
Testing Connectivity Before Scanning
Before running a scan against a new environment, verify each dependency independently:
Elasticsearch
curl -s http://<es-host>:9200/_cluster/health?pretty curl -s http://<es-host>:9200/_cat/indices?v
Mount / Path Accessibility
# NFS showmount -e <nfs-server> mount | grep <mountpoint> ls -la /path/to/scan # CIFS smbclient -L //<server>/ -U <user> # General — confirm readable by the user running diskover sudo -u <diskover-user> ls /path/to/scan
Alt Scanner Backend (S3, Azure, PowerScale, etc.)
Each alternate scanner includes a test/validation script. Run it before the first scan:
cd /opt/diskover python3 test_<scannername>.py
For cloud scanners (S3, Azure, OneDrive), also verify credentials and required API permissions are in place — see the scanner-specific README for the required permission scopes.
Admin API (Config Fetch)
The crawler fetches its config from diskover-admin at startup. Verify it is reachable from the worker:
curl -s http://<diskover-admin-host>:8000/diskover_admin/api/info
Scan Produces Empty or Partial Index
The crawl completed but the index has fewer files than expected.
Check skip counts in the final stats: High skipped values in the crawl completion output indicate exclusion rules are filtering out files.
Audit exclusions:
python3 diskover.py -i diskover-test --debug /path 2>&1 | grep -i 'excluded\|skipped' | head -100
Check for path permission issues: If the crawl user cannot read a subdirectory, it is silently skipped. Run a manual scan with verbose output and look for permission denied errors:
python3 diskover.py -i diskover-test -v /path 2>&1 | grep -i 'permission\|denied'
Check --maxdepth: If the task was configured with a maxdepth limit, only files within that depth are indexed. The default is 999 (effectively unlimited).
Post-Index Plugins
Post-index plugins run after the crawl completes as standalone scripts, separate from the indexing process.
# Run a post-index plugin manually python3 diskover_<name>/diskover_<name>.py <index_name> # Dry run (where supported) python3 diskover_<name>/diskover_<name>.py --dryrun <index_name> # Verbose python3 diskover_<name>/diskover_<name>.py -v <index_name>
If a post-index plugin fails, the index itself is unaffected — the plugin can be re-run against the same index after fixing the issue.
Comments
0 comments
Please sign in to leave a comment.