Troubleshooting — Scan Tasks

Overview

A Diskover scan (crawl) is initiated by diskoverd, which launches diskover.py as a subprocess. The crawler walks a directory tree (or alternate storage backend), processes each file and directory through any enabled plugins, and bulk-uploads documents to Elasticsearch.

This guide covers diagnosing and resolving failures across the full scan pipeline: task dispatch, the crawl process itself, plugin errors, Elasticsearch indexing errors, and scan performance.

For service-level issues (diskoverd not starting, Celery worker down, RabbitMQ not routing), refer to the Service-specific and OS-specific troubleshooting docs first.

Where to Look First

What failed	Where to look
Task never started	Diskover Admin > Tasks — check task status and assigned worker
Task started, crawl failed	`/var/log/diskover/diskoverd_subproc.log`
Crawl running but slow	`diskoverd_subproc.log` — look for CRAWL STATS lines
Elasticsearch errors	`diskoverd_subproc.log` — look for BULK INDEX ERROR / INDEXING ERROR
Plugin errors	`diskoverd_subproc.log` — look for PLUGIN ERROR / PLUGIN EXCEPTION
Alt scanner errors	`diskoverd_subproc.log` — look for ALT SCANNER ERROR / EXCEPTION

# Tail the crawl log in real time
sudo tail -f /var/log/diskover/diskoverd_subproc.log

# Search for errors only
sudo grep -E 'ERROR|CRITICAL|EXCEPTION|FATAL' /var/log/diskover/diskoverd_subproc.log

For more detail on any failing scan, increase the log level to DEBUG in Diskover Admin > Configuration > DiskoverD > $WorkerName, then re-run the task.

Running a Manual Scan

Bypassing diskoverd entirely is the fastest way to isolate whether a problem is in task dispatch or in the crawl itself:

cd /opt/diskover
python3 diskover.py -i diskover-<indexname> /path/to/scan

Useful flags for diagnosis:

Flag	Purpose
`-v`	Verbose — prints per-file progress
`-V`	More verbose
`--debug`	Full debug output including thread activity
`-f`	Drop and recreate an existing index
`-a`	Add to an existing index instead of replacing
`--nofiles`	Index directories only (faster, useful for tree validation)
`--threads N`	Override crawl thread count
`--altscanner <name>`	Use an alternate storage scanner
`-l` / `--listplugins`	List all loaded plugins without scanning

All index names must start with diskover-.

Common Scan Errors

Path Does Not Exist or Is Not a Directory

OS ERROR: top path "/mnt/share" does not exist; skipping this path.
OS ERROR: top path "/mnt/share" is not a directory; skipping this path.
FATAL ERROR: no valid top paths were provided.

The scan exits immediately if no valid paths remain after skipping invalid ones.

Resolution:

Verify the path exists and is accessible by the user running diskover: ls -la /mnt/share
If NFS/CIFS mounted, confirm the mount is active: mount | grep share
Check diskoverd's configured Diskover Path in Diskover Admin > Configuration > DiskoverD > $WorkerName and that mount points are consistent between the configured path and the actual filesystem

Index Already Exists

Index diskover-myindex already exists, not crawling, use -f to overwrite.

Resolution:

Use -f to drop and recreate: python3 diskover.py -i diskover-myindex -f /path
Use -a to append to the existing index: python3 diskover.py -i diskover-myindex -a /path
Use a different index name

Initialization Failed (ES / License / Config)

Diskover initialization failed. Details: <details>.
Next steps: verify config/profile values, license validity, and Elasticsearch connectivity before retrying.

This is a startup failure before the crawl begins. Work through in order:

License — confirm the license is valid and not expired. The license check runs first; a failure here exits immediately.
Elasticsearch connectivity — test from the worker machine:
```
curl -s http://<es-host>:9200/_cluster/health
```
Config — verify the Elasticsearch host, port, and credentials in Diskover Admin > Configuration > Diskover > Elasticsearch
Admin API — diskoverd fetches its config from diskover-admin on startup. Verify diskover-admin is running and reachable on port 8000:
```
curl -s http://localhost:8000/diskover_admin/api/info
```

Bulk Indexing Errors

BULK INDEX ERROR: indexing failed for one or more documents.
ERROR bulk writing error documents; retrying one-by-one.
INDEXING ERROR: unable to bulk upload documents.

Documents failed to write to Elasticsearch.

Resolution:

Check cluster health and disk space — ES stops writing at 90% disk usage:

curl -s http://localhost:9200/_cluster/health?pretty
df -h /var/lib/elasticsearch

Field mapping conflict — a plugin may be producing a value incompatible with the existing index mapping. Try disabling plugins and re-running to isolate.

Check for index write blocks:

curl -s "http://localhost:9200/diskover-*/_settings?pretty" | grep read_only

If blocked, clear it:

curl -X PUT "http://localhost:9200/*/_settings" \
  -H 'Content-Type: application/json' \
  -d '{"index.blocks.read_only_allow_delete": null}'

Plugin Errors

Plugins run during the crawl, adding metadata fields to each document before it is indexed. A plugin failure does not stop the crawl — the file is still indexed without that plugin's metadata.

Plugin Initialization Failure

PLUGIN EXCEPTION: Plugin initialization failed.

The plugin failed at startup, before any files were processed.

Resolution:

Run python3 diskover.py -l to list loaded plugins and confirm the plugin appears
Check the plugin's config in Diskover Admin > Configuration > Plugins > Index > $PluginName
Verify any external dependencies the plugin requires are installed (e.g., mediainfo, ffprobe, CLI tools)
Temporarily disable the plugin in the Admin UI and re-run to confirm the base crawl works

Plugin Runtime Error

PLUGIN ERROR: runtime failure in plugin "mediainfo".
PLUGIN EXCEPTION: unexpected failure in plugin "mediainfo".

The plugin encountered an error while processing a specific file.

Resolution:

Enable DEBUG logging and re-run — the debug output will include the specific file and exception
Check whether the error is consistent (every file) or intermittent (specific file types)
Verify the plugin's external dependencies are on the system PATH — see External Command Not Found below

External Command Not Found

CRITICAL - mediainfo command not found in path! (Exit code: 1)

Plugins like mediainfo, imageinfo, and pdfinfo shell out to external CLI tools that must be installed and accessible on the PATH of the process running the crawl.

Verify the tool is installed:

which mediainfo
mediainfo --version

If installed but not found by the crawler, the issue is the service PATH. On Linux, confirm /usr/local/bin and any custom install paths are included in the environment of the diskoverd service. On macOS, see the LaunchDaemon PATH section in the macOS troubleshooting guide.

Alternate Scanner Errors

Alternate scanners replace the default filesystem scandir with a custom backend (S3, Azure, PowerScale, etc.).

ALT SCANNER EXCEPTION: Alternate scanner initialization failed.
ALT SCANNER ERROR: failed to stat path with alternate scanner.

Resolution:

Verify the scanner module is installed and importable:
```
cd /opt/diskover
python3 -c "import scandir_<name>"
```
Check the scanner's config in Diskover Admin > Configuration > Diskover > Alternate Scanners
Test connectivity to the backend independently (storage API, credentials, network access)
Run without the alternate scanner to confirm the base crawl and ES connection are healthy:
```
python3 diskover.py -i diskover-test /tmp
```

Each alternate scanner ships with a test.py or similar validation script — run this first to confirm the backend is reachable before attempting a full scan.

Measuring Scan Performance

The crawler logs performance stats to the subprocess log throughout the crawl and at completion.

Live Stats (every ~10 seconds during crawl)

CRAWL STATS (path /mnt/share, files 12450 (skipped 230), dirs 880 (skipped 12),
elapsed 0:02:14, perf 94.321 inodes/s (max 210, min 40, avg 94), 4 paths still scanning, memory usage: 312 MB)

ES UPLOAD STATS (path /mnt/share, uploaded 11200 docs, upload time 0:01:58, perf 94.9 docs/s)

Final Stats (on crawl completion)

*** finished walking /mnt/share ***
*** walk files 48200, skipped 950 ***
*** walk size 4.2 TB ***
*** walk dirs 3400, skipped 24 ***
*** walk took 0d:00h:18m:42s ***
*** walk perf 44.2 inodes/s (max 210, min 8, avg 44) ***

Key metrics to watch:

inodes/s — Overall crawl throughput. Healthy range depends heavily on storage type and file count, but a sudden drop mid-crawl indicates I/O slowdown or thread starvation.
paths still scanning — Number of directories being actively walked in parallel. If this stays at 1 for a deep tree, threading isn't engaging — consider adjusting thread depth settings.
skipped counts — High skip counts may indicate misconfigured exclusions filtering out files unexpectedly.

Improving Scan Speed

Threading Configuration

The most impactful setting for scan speed on large directory trees is the thread configuration, managed in Diskover Admin > Configuration > Diskover > $ConfigName:

Setting	Description	Tuning Guidance
Max Threads	Crawl threads per top path	Start at CPU core count; increase for network storage
Max Walk Threads	Threads for walking multiple top paths in parallel	Set to number of top paths if scanning many simultaneously
Max Thread Depth	Directory depth at which threading begins	Lower values thread earlier (better for shallow trees); higher for deep trees
Auto Thread Depth	Automatically calculate optimal thread depth	Enable on first run, then tune manually based on results
Index Threads	Threads for bulk-uploading to Elasticsearch	Default 16; increase if ES upload is the bottleneck

Thread depth tip: If a path has millions of files in a flat structure (depth 1–2), set Max Thread Depth to 1 so threads start immediately at the top level. For deep trees (depth 10+), let Auto Thread Depth calculate the optimal point.

You can also override threading at scan time without changing the config:

python3 diskover.py -i diskover-test --threads 16 --walkthreads 4 --threaddepth 2 /path

Exclusions — Skip What You Don't Need

Excluding irrelevant paths and files reduces crawl time significantly on large volumes. Configure in Diskover Admin > Configuration > Diskover > $ConfigName:

Directory exclusions — Regex patterns; match on name or full path
File exclusions — Exact names, extensions, or patterns (e.g., *.tmp, .*)
Min file size — Skip files below a threshold (e.g., ignore zero-byte files)
Empty directories — Skip empty directories if they aren't needed in the index
Time filters — minmtime/maxmtime to restrict crawl to recently modified files

To audit what is being skipped, run with --debug and grep for excluded:

python3 diskover.py -i diskover-test -v /path 2>&1 | grep -i excluded | head -50

Elasticsearch Performance

If the bottleneck is ES upload speed rather than crawl speed (upload docs/s is much lower than crawl inodes/s):

Increase Index Threads in the Diskover config (default: 16)
Check ES cluster health — a yellow or red cluster has degraded write performance
Check ES heap usage: curl -s http://localhost:9200/_cat/nodes?v&h=name,heap.percent
Ensure the ES data volume is not near the disk watermark (75% warning, 90% write block)

Network Storage Considerations

For NFS/CIFS/cloud-backed paths, scan speed is limited by network latency and storage IOPS rather than CPU:

Increase Max Threads well above CPU count (e.g., 32–64) to keep the pipeline saturated despite latency
Use Auto Thread Depth to let the crawler find the optimal depth for the share structure
For very large flat directories (millions of files in one folder), threading has less benefit — crawl speed will be limited by the storage system's directory listing performance

Testing Connectivity Before Scanning

Before running a scan against a new environment, verify each dependency independently:

Elasticsearch

curl -s http://<es-host>:9200/_cluster/health?pretty
curl -s http://<es-host>:9200/_cat/indices?v

Mount / Path Accessibility

# NFS
showmount -e <nfs-server>
mount | grep <mountpoint>
ls -la /path/to/scan

# CIFS
smbclient -L //<server>/ -U <user>

# General — confirm readable by the user running diskover
sudo -u <diskover-user> ls /path/to/scan

Alt Scanner Backend (S3, Azure, PowerScale, etc.)

Each alternate scanner includes a test/validation script. Run it before the first scan:

cd /opt/diskover
python3 test_<scannername>.py

For cloud scanners (S3, Azure, OneDrive), also verify credentials and required API permissions are in place — see the scanner-specific README for the required permission scopes.

Admin API (Config Fetch)

The crawler fetches its config from diskover-admin at startup. Verify it is reachable from the worker:

curl -s http://<diskover-admin-host>:8000/diskover_admin/api/info

Scan Produces Empty or Partial Index

The crawl completed but the index has fewer files than expected.

Check skip counts in the final stats: High skipped values in the crawl completion output indicate exclusion rules are filtering out files.

Audit exclusions:

python3 diskover.py -i diskover-test --debug /path 2>&1 | grep -i 'excluded\|skipped' | head -100

Check for path permission issues: If the crawl user cannot read a subdirectory, it is silently skipped. Run a manual scan with verbose output and look for permission denied errors:

python3 diskover.py -i diskover-test -v /path 2>&1 | grep -i 'permission\|denied'

Check --maxdepth: If the task was configured with a maxdepth limit, only files within that depth are indexed. The default is 999 (effectively unlimited).

Post-Index Plugins

Post-index plugins run after the crawl completes as standalone scripts, separate from the indexing process.

# Run a post-index plugin manually
python3 diskover_<name>/diskover_<name>.py <index_name>

# Dry run (where supported)
python3 diskover_<name>/diskover_<name>.py --dryrun <index_name>

# Verbose
python3 diskover_<name>/diskover_<name>.py -v <index_name>

If a post-index plugin fails, the index itself is unaffected — the plugin can be re-run against the same index after fixing the issue.

Troubleshooting — Scan Tasks

Overview

Where to Look First

Running a Manual Scan

Common Scan Errors

Path Does Not Exist or Is Not a Directory

Index Already Exists

Initialization Failed (ES / License / Config)

Bulk Indexing Errors

Plugin Errors

Plugin Initialization Failure

Plugin Runtime Error

External Command Not Found

Alternate Scanner Errors

Measuring Scan Performance

Live Stats (every ~10 seconds during crawl)

Final Stats (on crawl completion)

Improving Scan Speed

Threading Configuration

Exclusions — Skip What You Don't Need

Elasticsearch Performance

Network Storage Considerations

Testing Connectivity Before Scanning

Elasticsearch

Mount / Path Accessibility

Alt Scanner Backend (S3, Azure, PowerScale, etc.)

Admin API (Config Fetch)

Scan Produces Empty or Partial Index

Post-Index Plugins

Related articles