Diskover Core Scanner Usage
License: Included with all Diskover editions
Module Type: Core Scanner (Default)
Author: Diskover Data, Inc.
Overview
The core scanner is the heart of every Diskover installation. When you run diskover.py, it walks your filesystems extracting metadata from every file and directory it encounters, and bulk-indexes everything into Elasticsearch. The result is a searchable inventory of your entire storage environment — sizes, owners, timestamps, and more — accessible from the Diskover Web UI.
Every Diskover installation uses the core scanner by default. You only need an alternate scanner if your data lives somewhere that isn't mountable as a regular filesystem (like S3, Azure Blob Storage, or Google Cloud Storage).
What Gets Indexed
For every file, the scanner captures name, parent path, owner, group, size (logical and allocated), hard link count, inode number, file extension, and the full mtime/atime/ctime timestamp triplet. For directories, it adds recursive rollups: total size, file count, directory count, and depth. When optional features like storage cost rollup, file age groups, or time rollup are enabled, those fields are added to directory documents as well.
The core scanner itself does not add any custom Elasticsearch fields beyond these built-ins. Any additional fields you see in a Diskover index come from either an index plugin (under plugins/) or an alternate scanner (under scanners/).
Choosing the Right Scanner
Diskover ships with the core scanner plus a suite of alternate scanners for non-POSIX or specialized storage backends. Use the table below to pick the right one.
Scanner | Target Storage | Invocation | When to Use |
|---|---|---|---|
Core ( | Local disk, NFS, SMB/CIFS, any POSIX-mounted filesystem |
| Default — use whenever the data is reachable through a normal mount point |
| Same as core, with an SQLite cache layer |
| Repeated scans of slow or high-latency mounts where caching stat results saves time |
| Amazon S3 buckets (and S3-compatible object stores) |
| Indexing S3 objects without a filesystem gateway |
| Azure Blob Storage (Blob and Data Lake Gen2) |
| Indexing Azure cloud storage directly via REST |
| Google Cloud Storage buckets |
| Indexing GCS without mounting gcsfuse |
| Dell PowerScale (Isilon) clusters |
| Capturing PowerScale-specific metadata in one pass |
| Microsoft 365 OneDrive / SharePoint |
| Indexing M365 user drives (standalone async scanner) |
| Spectra Logic RioBroker / tape libraries | Standalone CLI Scanner | Indexing Spectra-managed tape archives |
| Atempo archive |
| Indexing Atempo-archived content |
| Offline / tape media catalogs |
| Indexing offline or tape-resident data from a manifest |
| Google Workspace (Drive) |
| Indexing Google Drive content |
For full documentation of each alternate scanner, see the individual scanner guides under Alternate Scanners in this knowledge base.
Requirements
Component | Requirement |
|---|---|
Python | 3.9 or higher (3.11.x recommended) |
Diskover | Core installation (the scanner ships with it) |
Elasticsearch | A reachable Elasticsearch 7.x or 8.x cluster, OR an alternate ingester configured |
Operating System | Linux (RHEL 8/9, Rocky, Alma, Ubuntu) or Windows 10/11 / Windows Server |
Memory | 4 GB minimum; 16+ GB recommended for production crawls |
Disk | Free space for the local SQLite config DB and per-thread file metadata cache |
All required Python packages are included with the standard Diskover installation and pinned in /opt/diskover/requirements.txt.
Installation
The core scanner ships with the main Diskover installation — no separate package is needed.
Verify Diskover Is Installed
Linux:
python3 /opt/diskover/diskover.py --version
Windows:
python "C:\Program Files\Diskover\diskover.py" --version
Both should print diskover v2.5.0 (or the currently installed version).
Verify Elasticsearch Connectivity
Linux:
curl -sS http://localhost:9200/
Windows:
Invoke-WebRequest -Uri "http://localhost:9200/" -UseBasicParsing
If Diskover is configured for a remote cluster, substitute the host and port from your Elasticsearch configuration.
List Loaded Plugins
Verify that expected index plugins are discoverable:
Linux:
python3 /opt/diskover/diskover.py -l
Windows:
python "C:\Program Files\Diskover\diskover.py" -l
This prints the list of loaded plugins/ entries and exits.
CLI Usage
The core scanner is invoked without --altscanner. Top paths are positional arguments at the end of the command line.
Sample CLI Scan:
# python3 /opt/diskover/diskover.py -f -i diskover-test-scan /opt/diskover 2026-04-15 18:46:45,473 - diskover - INFO - configuration Default 2026-04-15 18:46:45,473 - diskover - INFO - Creating index diskover-test-scan... 2026-04-15 18:46:45,473 - diskover_elasticsearch - INFO - ES index diskover-test-scan already exists, deleting 2026-04-15 18:46:45,558 - diskover_elasticsearch - INFO - Tuning index settings for crawl 2026-04-15 18:46:45,572 - diskover - INFO - No plugins loaded 2026-04-15 18:46:45,572 - diskover - INFO - maxwalkthreads set to 4 2026-04-15 18:46:45,572 - diskover - INFO - indexthreads set to 16 2026-04-15 18:46:45,575 - diskover - INFO - Enqueuing dir tree /opt/diskover 2026-04-15 18:46:46,006 - diskover - INFO - [ThreadPoolExecutor-1_0] finished crawling /opt/diskover (3239 dirs, 15541 files, 352.0 MB) in 0d:0h:00m:00s 2026-04-15 18:46:46,022 - diskover - INFO - *** finished walking /opt/diskover *** 2026-04-15 18:46:46,023 - diskover - INFO - *** walk files 15541, skipped 123 *** 2026-04-15 18:46:46,023 - diskover - INFO - *** walk size 352.0 MB *** 2026-04-15 18:46:46,023 - diskover - INFO - *** walk du size 392.14 MB *** 2026-04-15 18:46:46,023 - diskover - INFO - *** walk dirs 3240, skipped 0 *** 2026-04-15 18:46:46,023 - diskover - INFO - *** walk took 0d:0h:00m:00s *** 2026-04-15 18:46:46,023 - diskover - INFO - *** walk perf 43851.119 inodes/s (max 43851.119, min 43851.119, avg 43851.119) *** 2026-04-15 18:46:46,023 - diskover - INFO - *** docs ingested 18781 *** 2026-04-15 18:46:46,023 - diskover - INFO - *** ingesting perf 22071.066 docs/s (max 22071.066, min 22071.066, avg 22071.066) *** 2026-04-15 18:46:46,023 - diskover - INFO - *** ingesting took 0d:0h:00m:00s *** 2026-04-15 18:46:46,023 - diskover - INFO - *** warnings/errors 0 ***
Single-Path Scan
cd /opt/diskover python3 diskover.py /data
This creates an index named diskover-<path>-<datetime> containing every file and directory under /data.
Multi-Path Scan
python3 diskover.py /data /scratch /home
Each top path becomes its own subtree in the same Elasticsearch index. The walk thread pool processes top paths in parallel.
Custom Index Name with Force-Overwrite
python3 diskover.py -f -i diskover-data-prod /data
-f silently drops the existing diskover-data-prod index before recreating it. Without -f, Diskover refuses to scan into an index that already exists.
Append a New Top Path to an Existing Index
python3 diskover.py -a -i diskover-data-prod /scratch
-a keeps the existing index intact and adds the new top path's documents to it.
Remove a Top Path from an Existing Index
python3 diskover.py -r -i diskover-data-prod /scratch
-r deletes documents under the given top path without dropping the index itself.
Limit Descent Depth
python3 diskover.py -m 3 /data
Only descend up to 3 directory levels below /data.
Thread Tuning
# Manually set crawl and walk thread counts python3 diskover.py --threads 16 --walkthreads 4 /data /scratch # Cap thread dispatch at depth 4 python3 diskover.py --threaddepth 4 /data # Let Diskover auto-compute the optimal thread depth for this tree python3 diskover.py --autothreaddepth /data
--threads, --walkthreads, and --threaddepth override their corresponding configuration fields for the duration of the run.
Dirs-Only Scan
python3 diskover.py --nofiles /data
Indexes only directory documents (including rollup fields). Useful for fast capacity reports when individual file metadata isn't needed.
Verbose / Debug Logging
python3 diskover.py -v /data # verbose python3 diskover.py -V /data # very verbose python3 diskover.py --debug /data # debug level
Use an Alternate Scanner
python3 diskover.py --altscanner scandir_s3 s3://my-bucket/prefix
The core scanner loads the alternate scanner module and runs the rest of the crawl pipeline transparently against the alternate backend.
Use an Alternate Ingester
python3 diskover.py --altingester my_ingester /data
Bulk uploads are routed through the alternate ingester instead of Elasticsearch. When an alternate ingester is in use, Elasticsearch index creation and tuning are skipped entirely.
Use a Named Configuration
python3 diskover.py -c production /data
Loads the Diskover.Configurations.production scope instead of the default. You can combine -c with any of the other flags above.
Task-Based Usage (diskoverd)
In production, the core scanner is rarely invoked by hand. Instead, the Diskover daemon diskoverd reads tasks from the local config database and spawns diskover.py as a subprocess on a cron schedule (or on demand from the Admin UI).
How Tasks Work
Each task carries everything diskoverd needs to build the diskover.py command line: the crawl paths, index naming strategy, configuration variant, alternate scanner selection, scheduling, and reliability settings. When a task fires, diskoverd resolves any date tokens in the index name, assembles the command-line flags, and executes the scan as a subprocess on the assigned worker.
Task Field to CLI Flag Mapping
Task Field | CLI Flag | Description |
|---|---|---|
|
| Task name added to IndexInfo |
(task primary key) |
| Task ID added to IndexInfo |
| positional | One or many top paths |
| (none) | If true, Diskover generates a default index name |
|
| Custom name; supports |
|
| Drop existing index before scanning |
|
| Append to existing index |
|
| Name of the alternate scanner to use |
|
| Named configuration scope |
| (raw) | Free-form additional flags passed verbatim |
| (routing) | Which diskoverd worker picks up the task |
| (env) | Extra environment variables for the subprocess |
| (subprocess) | Hard wall-clock kill time |
| (subprocess) | Number of retries on failure (exit code other than 0 or 64) |
| (subprocess) | Delay between retries |
| (hook) | Command to run before the scan |
| (hook) | Command to run after the scan |
| (cron) | Standard 5-field cron schedule |
Example Task Configuration
Here is a typical nightly production task:
name: "Nightly /data scan" description: "Full /data crawl, drops yesterday's index" crawl_paths: - /data # Index naming auto_index_name: false custom_index_name: "diskover-data-%Y%m%d" overwrite_existing: true add_to_index: false # Configuration scope use_default_config: false alt_config_file: "production" # No alt scanner — use core scanner alt_scanner: null # Extra CLI flags cli_options: "-v --autothreaddepth" # Cron: every day at 02:15 run_min: "15" run_hour: "2" run_day_month: "*" run_month: "*" run_day_week: "*" # Worker routing and reliability assigned_worker: "diskoverd-worker-1" timeout: 21600 # 6 hours retries: 2 retry_delay: 600
When this task fires, diskoverd resolves the date tokens and builds a command line equivalent to:
python3 /opt/diskover/diskover.py \ -n "Nightly /data scan" \ -t <task_id> \ -i diskover-data-20260415 \ -f \ -c production \ -v --autothreaddepth \ /data
Date Tokens in Index Names
The custom_index_name field supports Python strftime tokens that are resolved at task firing time:
Token | Expands To | Example |
|---|---|---|
| 4-digit year | 2026 |
| 2-digit month | 04 |
| 2-digit day | 15 |
| 2-digit hour (24h) | 02 |
| 2-digit minute | 15 |
| 2-digit second | 00 |
So diskover-data-%Y%m%d becomes diskover-data-20260415 when the task fires on April 15, 2026.
Alternate Scanners
Alternate scanners replace the core scanner's native calls with custom implementations that access cloud storage, enterprise storage APIs, or other specialized data sources. They come in two execution models.
Standard Alternate Scanners (--altscanner)
These integrate with the core scanner via the --altscanner flag. The core scanner loads the module, replaces the filesystem traversal functions, and runs the rest of the crawl pipeline (threading, bulk indexing, error handling) transparently.
python3 diskover.py --altscanner scandir_s3 s3://my-bucket/prefix
Scanners using this model: S3, Azure, GCP, DirCache, Offline Media, PowerScale, Atempo, Google.
In Index Tasks, set the Alternate Scanner field to the scanner name (e.g., scandir_s3). The task's crawl paths should use the scanner-specific path format.
Standalone Async Scanners
These have their own CLI entry points, asyncio event loops, and direct Elasticsearch integration. They do not use the --altscanner flag and are not configurable as Index Tasks.
# Example: OneDrive scanner python3 scandir_onedrive.py --config onedrive_config.toml
Scanners using this model: OneDrive/SharePoint, Spectra RioBroker.
Scanner Configuration
Standard alternate scanners are configured in the Diskover Admin UI under Settings > Alternate Scanners > [Scanner Name]. Standalone async scanners use their own configuration files (typically TOML or YAML) in the scanner directory.
Alternate Ingesters
Alternate ingesters redirect the scanner's output away from Elasticsearch and into a different downstream sink. When an alternate ingester is active, Elasticsearch index creation, tuning, and info-index updates are all skipped.
python3 diskover.py --altingester my_ingester /data
Both bulk uploads and error documents are routed through the alternate ingester, so it receives the full document stream. Alternate ingesters are resolved with the same lookup order as alternate scanners: first the config database scope, then the ingesters/ directory as a fallback.
Exit Codes
Exit Code | Meaning | Treated as Success by diskoverd? |
|---|---|---|
| Crawl completed cleanly with zero warnings | Yes |
| Crawl completed but with non-zero warnings (typically permission errors that became error documents) | Yes |
| Fatal error (config/license/network failure, unhandled exception) | No — diskoverd retries per the task's |
A run that exits with code 64 is still a successful crawl — every reachable file and directory was indexed, but at least one path produced an error document. Check the warning log or query Elasticsearch for documents with an error: field to see what went wrong.
Troubleshooting
Index Already Exists
Symptom: The crawl aborts immediately with Index diskover-data-prod already exists, not crawling, use -f to overwrite.
Resolution:
To replace the existing index:
python3 diskover.py -f -i diskover-data-prod /dataTo append a new top path:
python3 diskover.py -a -i diskover-data-prod /scratchTo keep both, use a different index name:
python3 diskover.py -i diskover-data-$(date +%Y%m%d) /data
SLOW PATH Warnings
Symptom: The log shows SLOW PATH: /data/projectA (612s) and the crawl takes hours.
Resolution:
Tune
slowdirtimeupward if 600 seconds is too tight for your storageSet
slowdirtimestopscan: trueto abort scanning of pathologically slow directoriesAdd the slow path to
excludes.dirsif it's not neededIncrease
--threadsfor more parallelismTry
--autothreaddepthfor deep or narrow trees
Permission Denied Errors as Error Documents
Symptom: The scan completes (often with exit code 64) and Elasticsearch contains documents with an error: field.
Resolution:
Run Diskover as a user with read access to the affected paths
If inaccessible paths are off-limits, exclude them via
excludes.dirsFor SMB/CIFS shares, verify mount credentials have list/read permission at the directory level
Diagnosis:
curl -sS "http://localhost:9200/diskover-data-prod/_search?q=_exists_:error&pretty" | head -40
Exit Code 64 (Warnings)
Symptom: diskoverd reports the task as success, but the log shows Exit code: 64 and warnings: <N>.
Resolution:
Check the warning log file:
grep -E "WARNING|ERROR" /var/log/diskover/*warning*.log | tailAddress the underlying causes where possible (permissions, mount stability, exclude rules)
Accept exit
64as expected for environments with some unavoidable warnings —diskoverdalready treats it as success
BulkIndexError: Document Mapping Conflict
Symptom: BULK INDEX ERROR lines in the log; some documents are missing from the final index.
Resolution:
Force a fresh index:
python3 diskover.py -f -i diskover-data-prod /dataIf you can't drop the index, fix the mapping at the source (typically a plugin or alt scanner that changed a field's type)
Thread Starvation on Shallow Trees
Symptom: A scan with --threads 16 runs no faster than --threads 4, and CPU usage stays low.
Resolution:
Use
--autothreaddepthto let Diskover pick the right dispatch depthOr manually set
--threaddepthto the level where the tree fans out:python3 diskover.py --threaddepth 4 --threads 16 /data
NFS or SMB Latency
Symptom: The scan is dominated by stat() time, not Elasticsearch upload time.
Resolution:
Increase
--threadsto overlap morestat()callsFor repeated scans, consider the DirCache alternate scanner:
python3 diskover.py --altscanner scandir_dircache /mnt/nasFor NFS, mount with
ro,noatime,nodiratimeFor CIFS, set
restoretimes: truein the configuration
Debug Logging
# Linux python3 /opt/diskover/diskover.py --debug /data # Windows python "C:\Program Files\Diskover\diskover.py" --debug /data
Log File Locations
Linux:
/var/log/diskover/Windows:
C:\Program Files\diskover\logs\(or your configured log location)
Support
Last Updated: April 2026
Comments
0 comments
Please sign in to leave a comment.