Diskover Core Scanner Usage

License: Included with all Diskover editions
Module Type: Core Scanner (Default)
Author: Diskover Data, Inc.

Overview

The core scanner is the heart of every Diskover installation. When you run diskover.py, it walks your filesystems extracting metadata from every file and directory it encounters, and bulk-indexes everything into Elasticsearch. The result is a searchable inventory of your entire storage environment — sizes, owners, timestamps, and more — accessible from the Diskover Web UI.

Every Diskover installation uses the core scanner by default. You only need an alternate scanner if your data lives somewhere that isn't mountable as a regular filesystem (like S3, Azure Blob Storage, or Google Cloud Storage).

What Gets Indexed

For every file, the scanner captures name, parent path, owner, group, size (logical and allocated), hard link count, inode number, file extension, and the full mtime/atime/ctime timestamp triplet. For directories, it adds recursive rollups: total size, file count, directory count, and depth. When optional features like storage cost rollup, file age groups, or time rollup are enabled, those fields are added to directory documents as well.

The core scanner itself does not add any custom Elasticsearch fields beyond these built-ins. Any additional fields you see in a Diskover index come from either an index plugin (under plugins/) or an alternate scanner (under scanners/).

Choosing the Right Scanner

Diskover ships with the core scanner plus a suite of alternate scanners for non-POSIX or specialized storage backends. Use the table below to pick the right one.

Scanner	Target Storage	Invocation	When to Use
Core (`diskover.py`)	Local disk, NFS, SMB/CIFS, any POSIX-mounted filesystem	`python3 diskover.py /path`	Default — use whenever the data is reachable through a normal mount point
`scandir_dircache`	Same as core, with an SQLite cache layer	`--altscanner scandir_dircache /path`	Repeated scans of slow or high-latency mounts where caching stat results saves time
`scandir_s3`	Amazon S3 buckets (and S3-compatible object stores)	`--altscanner scandir_s3 s3://bucket/prefix`	Indexing S3 objects without a filesystem gateway
`scandir_azure`	Azure Blob Storage (Blob and Data Lake Gen2)	`--altscanner scandir_azure <container-path>`	Indexing Azure cloud storage directly via REST
`scandir_gcp`	Google Cloud Storage buckets	`--altscanner scandir_gcp gs://bucket/prefix`	Indexing GCS without mounting gcsfuse
`scandir_powerscale`	Dell PowerScale (Isilon) clusters	`--altscanner scandir_powerscale /path`	Capturing PowerScale-specific metadata in one pass
`scandir_onedrive`	Microsoft 365 OneDrive / SharePoint	`--altscanner scandir_onedrive od://sharePointSite`	Indexing M365 user drives (standalone async scanner)
`scandir_spectra`	Spectra Logic RioBroker / tape libraries	Standalone CLI Scanner	Indexing Spectra-managed tape archives
`scandir_atempo`	Atempo archive	`--altscanner scandir_atempo <path>`	Indexing Atempo-archived content
`scandir_offline_media`	Offline / tape media catalogs	`--altscanner scandir_offline_media <path>`	Indexing offline or tape-resident data from a manifest
`scandir_google`	Google Workspace (Drive)	`--altscanner scandir_google <drive folder id>`	Indexing Google Drive content

For full documentation of each alternate scanner, see the individual scanner guides under Alternate Scanners in this knowledge base.

Requirements

Component	Requirement
Python	3.9 or higher (3.11.x recommended)
Diskover	Core installation (the scanner ships with it)
Elasticsearch	A reachable Elasticsearch 7.x or 8.x cluster, OR an alternate ingester configured
Operating System	Linux (RHEL 8/9, Rocky, Alma, Ubuntu) or Windows 10/11 / Windows Server
Memory	4 GB minimum; 16+ GB recommended for production crawls
Disk	Free space for the local SQLite config DB and per-thread file metadata cache

All required Python packages are included with the standard Diskover installation and pinned in /opt/diskover/requirements.txt.

Installation

The core scanner ships with the main Diskover installation — no separate package is needed.

Verify Diskover Is Installed

Linux:

python3 /opt/diskover/diskover.py --version

Windows:

python "C:\Program Files\Diskover\diskover.py" --version

Both should print diskover v2.5.0 (or the currently installed version).

Verify Elasticsearch Connectivity

Linux:

curl -sS http://localhost:9200/

Windows:

Invoke-WebRequest -Uri "http://localhost:9200/" -UseBasicParsing

If Diskover is configured for a remote cluster, substitute the host and port from your Elasticsearch configuration.

List Loaded Plugins

Verify that expected index plugins are discoverable:

Linux:

python3 /opt/diskover/diskover.py -l

Windows:

python "C:\Program Files\Diskover\diskover.py" -l

This prints the list of loaded plugins/ entries and exits.

CLI Usage

The core scanner is invoked without --altscanner. Top paths are positional arguments at the end of the command line.

Sample CLI Scan:

# python3 /opt/diskover/diskover.py -f -i diskover-test-scan /opt/diskover

2026-04-15 18:46:45,473 - diskover - INFO - configuration Default
2026-04-15 18:46:45,473 - diskover - INFO - Creating index diskover-test-scan...
2026-04-15 18:46:45,473 - diskover_elasticsearch - INFO - ES index diskover-test-scan already exists, deleting
2026-04-15 18:46:45,558 - diskover_elasticsearch - INFO - Tuning index settings for crawl
2026-04-15 18:46:45,572 - diskover - INFO - No plugins loaded
2026-04-15 18:46:45,572 - diskover - INFO - maxwalkthreads set to 4
2026-04-15 18:46:45,572 - diskover - INFO - indexthreads set to 16
2026-04-15 18:46:45,575 - diskover - INFO - Enqueuing dir tree /opt/diskover
2026-04-15 18:46:46,006 - diskover - INFO - [ThreadPoolExecutor-1_0] finished crawling /opt/diskover (3239 dirs, 15541 files, 352.0 MB) in 0d:0h:00m:00s
2026-04-15 18:46:46,022 - diskover - INFO - *** finished walking /opt/diskover ***
2026-04-15 18:46:46,023 - diskover - INFO - *** walk files 15541, skipped 123 ***
2026-04-15 18:46:46,023 - diskover - INFO - *** walk size 352.0 MB ***
2026-04-15 18:46:46,023 - diskover - INFO - *** walk du size 392.14 MB ***
2026-04-15 18:46:46,023 - diskover - INFO - *** walk dirs 3240, skipped 0 ***
2026-04-15 18:46:46,023 - diskover - INFO - *** walk took 0d:0h:00m:00s ***
2026-04-15 18:46:46,023 - diskover - INFO - *** walk perf 43851.119 inodes/s (max 43851.119, min 43851.119, avg 43851.119) ***
2026-04-15 18:46:46,023 - diskover - INFO - *** docs ingested 18781 ***
2026-04-15 18:46:46,023 - diskover - INFO - *** ingesting perf 22071.066 docs/s (max 22071.066, min 22071.066, avg 22071.066) ***
2026-04-15 18:46:46,023 - diskover - INFO - *** ingesting took 0d:0h:00m:00s ***
2026-04-15 18:46:46,023 - diskover - INFO - *** warnings/errors 0 ***

Single-Path Scan

cd /opt/diskover
python3 diskover.py /data

This creates an index named diskover-<path>-<datetime> containing every file and directory under /data.

Multi-Path Scan

python3 diskover.py /data /scratch /home

Each top path becomes its own subtree in the same Elasticsearch index. The walk thread pool processes top paths in parallel.

Custom Index Name with Force-Overwrite

python3 diskover.py -f -i diskover-data-prod /data

-f silently drops the existing diskover-data-prod index before recreating it. Without -f, Diskover refuses to scan into an index that already exists.

Append a New Top Path to an Existing Index

python3 diskover.py -a -i diskover-data-prod /scratch

-a keeps the existing index intact and adds the new top path's documents to it.

Remove a Top Path from an Existing Index

python3 diskover.py -r -i diskover-data-prod /scratch

-r deletes documents under the given top path without dropping the index itself.

Limit Descent Depth

python3 diskover.py -m 3 /data

Only descend up to 3 directory levels below /data.

Thread Tuning

# Manually set crawl and walk thread counts
python3 diskover.py --threads 16 --walkthreads 4 /data /scratch

# Cap thread dispatch at depth 4
python3 diskover.py --threaddepth 4 /data

# Let Diskover auto-compute the optimal thread depth for this tree
python3 diskover.py --autothreaddepth /data

--threads, --walkthreads, and --threaddepth override their corresponding configuration fields for the duration of the run.

Dirs-Only Scan

python3 diskover.py --nofiles /data

Indexes only directory documents (including rollup fields). Useful for fast capacity reports when individual file metadata isn't needed.

Verbose / Debug Logging

python3 diskover.py -v /data         # verbose
python3 diskover.py -V /data         # very verbose
python3 diskover.py --debug /data    # debug level

Use an Alternate Scanner

python3 diskover.py --altscanner scandir_s3 s3://my-bucket/prefix

The core scanner loads the alternate scanner module and runs the rest of the crawl pipeline transparently against the alternate backend.

Use an Alternate Ingester

python3 diskover.py --altingester my_ingester /data

Bulk uploads are routed through the alternate ingester instead of Elasticsearch. When an alternate ingester is in use, Elasticsearch index creation and tuning are skipped entirely.

Use a Named Configuration

python3 diskover.py -c production /data

Loads the Diskover.Configurations.production scope instead of the default. You can combine -c with any of the other flags above.

Task-Based Usage (diskoverd)

In production, the core scanner is rarely invoked by hand. Instead, the Diskover daemon diskoverd reads tasks from the local config database and spawns diskover.py as a subprocess on a cron schedule (or on demand from the Admin UI).

How Tasks Work

Each task carries everything diskoverd needs to build the diskover.py command line: the crawl paths, index naming strategy, configuration variant, alternate scanner selection, scheduling, and reliability settings. When a task fires, diskoverd resolves any date tokens in the index name, assembles the command-line flags, and executes the scan as a subprocess on the assigned worker.

Task Field to CLI Flag Mapping

Task Field	CLI Flag	Description
`name`	`-n / --name`	Task name added to IndexInfo
(task primary key)	`-t / --task`	Task ID added to IndexInfo
`crawl_paths`	positional `tree_dir` arguments	One or many top paths
`auto_index_name`	(none)	If true, Diskover generates a default index name
`custom_index_name`	`-i / --index`	Custom name; supports `%Y %m %d %H %M %S` date tokens
`overwrite_existing`	`-f / --forcedropexisting`	Drop existing index before scanning
`add_to_index`	`-a / --addtoindex`	Append to existing index
`alt_scanner`	`--altscanner`	Name of the alternate scanner to use
`use_default_config` / `alt_config_file`	`-c / --configurationname`	Named configuration scope
`cli_options`	(raw)	Free-form additional flags passed verbatim
`assigned_worker` / `allowed_workers`	(routing)	Which diskoverd worker picks up the task
`env_vars`	(env)	Extra environment variables for the subprocess
`timeout`	(subprocess)	Hard wall-clock kill time
`retries`	(subprocess)	Number of retries on failure (exit code other than 0 or 64)
`retry_delay`	(subprocess)	Delay between retries
`pre_command` / `pre_command_args`	(hook)	Command to run before the scan
`post_command` / `post_command_args`	(hook)	Command to run after the scan
`run_min` `run_hour` `run_day_month` `run_month` `run_day_week`	(cron)	Standard 5-field cron schedule

Example Task Configuration

Here is a typical nightly production task:

name: "Nightly /data scan"
description: "Full /data crawl, drops yesterday's index"
crawl_paths:
  - /data

# Index naming
auto_index_name: false
custom_index_name: "diskover-data-%Y%m%d"
overwrite_existing: true
add_to_index: false

# Configuration scope
use_default_config: false
alt_config_file: "production"

# No alt scanner — use core scanner
alt_scanner: null

# Extra CLI flags
cli_options: "-v --autothreaddepth"

# Cron: every day at 02:15
run_min: "15"
run_hour: "2"
run_day_month: "*"
run_month: "*"
run_day_week: "*"

# Worker routing and reliability
assigned_worker: "diskoverd-worker-1"
timeout: 21600     # 6 hours
retries: 2
retry_delay: 600

When this task fires, diskoverd resolves the date tokens and builds a command line equivalent to:

python3 /opt/diskover/diskover.py \
  -n "Nightly /data scan" \
  -t <task_id> \
  -i diskover-data-20260415 \
  -f \
  -c production \
  -v --autothreaddepth \
  /data

Date Tokens in Index Names

The custom_index_name field supports Python strftime tokens that are resolved at task firing time:

Token	Expands To	Example
`%Y`	4-digit year	2026
`%m`	2-digit month	04
`%d`	2-digit day	15
`%H`	2-digit hour (24h)	02
`%M`	2-digit minute	15
`%S`	2-digit second	00

So diskover-data-%Y%m%d becomes diskover-data-20260415 when the task fires on April 15, 2026.

Alternate Scanners

Alternate scanners replace the core scanner's native calls with custom implementations that access cloud storage, enterprise storage APIs, or other specialized data sources. They come in two execution models.

Standard Alternate Scanners (`--altscanner`)

These integrate with the core scanner via the --altscanner flag. The core scanner loads the module, replaces the filesystem traversal functions, and runs the rest of the crawl pipeline (threading, bulk indexing, error handling) transparently.

python3 diskover.py --altscanner scandir_s3 s3://my-bucket/prefix

Scanners using this model: S3, Azure, GCP, DirCache, Offline Media, PowerScale, Atempo, Google.

In Index Tasks, set the Alternate Scanner field to the scanner name (e.g., scandir_s3). The task's crawl paths should use the scanner-specific path format.

Standalone Async Scanners

These have their own CLI entry points, asyncio event loops, and direct Elasticsearch integration. They do not use the --altscanner flag and are not configurable as Index Tasks.

# Example: OneDrive scanner
python3 scandir_onedrive.py --config onedrive_config.toml

Scanners using this model: OneDrive/SharePoint, Spectra RioBroker.

Scanner Configuration

Standard alternate scanners are configured in the Diskover Admin UI under Settings > Alternate Scanners > [Scanner Name]. Standalone async scanners use their own configuration files (typically TOML or YAML) in the scanner directory.

Alternate Ingesters

Alternate ingesters redirect the scanner's output away from Elasticsearch and into a different downstream sink. When an alternate ingester is active, Elasticsearch index creation, tuning, and info-index updates are all skipped.

python3 diskover.py --altingester my_ingester /data

Both bulk uploads and error documents are routed through the alternate ingester, so it receives the full document stream. Alternate ingesters are resolved with the same lookup order as alternate scanners: first the config database scope, then the ingesters/ directory as a fallback.

Exit Codes

Exit Code	Meaning	Treated as Success by diskoverd?
`0`	Crawl completed cleanly with zero warnings	Yes
`64`	Crawl completed but with non-zero warnings (typically permission errors that became error documents)	Yes
`1`	Fatal error (config/license/network failure, unhandled exception)	No — diskoverd retries per the task's `retries` and `retry_delay` settings

A run that exits with code 64 is still a successful crawl — every reachable file and directory was indexed, but at least one path produced an error document. Check the warning log or query Elasticsearch for documents with an error: field to see what went wrong.

Troubleshooting

Index Already Exists

Symptom: The crawl aborts immediately with Index diskover-data-prod already exists, not crawling, use -f to overwrite.

Resolution:

To replace the existing index: python3 diskover.py -f -i diskover-data-prod /data
To append a new top path: python3 diskover.py -a -i diskover-data-prod /scratch
To keep both, use a different index name: python3 diskover.py -i diskover-data-$(date +%Y%m%d) /data

SLOW PATH Warnings

Symptom: The log shows SLOW PATH: /data/projectA (612s) and the crawl takes hours.

Resolution:

Tune slowdirtime upward if 600 seconds is too tight for your storage
Set slowdirtimestopscan: true to abort scanning of pathologically slow directories
Add the slow path to excludes.dirs if it's not needed
Increase --threads for more parallelism
Try --autothreaddepth for deep or narrow trees

Permission Denied Errors as Error Documents

Symptom: The scan completes (often with exit code 64) and Elasticsearch contains documents with an error: field.

Resolution:

Run Diskover as a user with read access to the affected paths
If inaccessible paths are off-limits, exclude them via excludes.dirs
For SMB/CIFS shares, verify mount credentials have list/read permission at the directory level

Diagnosis:

curl -sS "http://localhost:9200/diskover-data-prod/_search?q=_exists_:error&pretty" | head -40

Exit Code 64 (Warnings)

Symptom: diskoverd reports the task as success, but the log shows Exit code: 64 and warnings: <N>.

Resolution:

Check the warning log file: grep -E "WARNING|ERROR" /var/log/diskover/*warning*.log | tail
Address the underlying causes where possible (permissions, mount stability, exclude rules)
Accept exit 64 as expected for environments with some unavoidable warnings — diskoverd already treats it as success

BulkIndexError: Document Mapping Conflict

Symptom: BULK INDEX ERROR lines in the log; some documents are missing from the final index.

Resolution:

Force a fresh index: python3 diskover.py -f -i diskover-data-prod /data
If you can't drop the index, fix the mapping at the source (typically a plugin or alt scanner that changed a field's type)

Thread Starvation on Shallow Trees

Symptom: A scan with --threads 16 runs no faster than --threads 4, and CPU usage stays low.

Resolution:

Use --autothreaddepth to let Diskover pick the right dispatch depth
Or manually set --threaddepth to the level where the tree fans out: python3 diskover.py --threaddepth 4 --threads 16 /data

NFS or SMB Latency

Symptom: The scan is dominated by stat() time, not Elasticsearch upload time.

Resolution:

Increase --threads to overlap more stat() calls
For repeated scans, consider the DirCache alternate scanner: python3 diskover.py --altscanner scandir_dircache /mnt/nas
For NFS, mount with ro,noatime,nodiratime
For CIFS, set restoretimes: true in the configuration

Debug Logging

# Linux
python3 /opt/diskover/diskover.py --debug /data

# Windows
python "C:\Program Files\Diskover\diskover.py" --debug /data

Log File Locations

Linux: /var/log/diskover/
Windows: C:\Program Files\diskover\logs\ (or your configured log location)

Support

Last Updated: April 2026

Diskover Core Scanner Usage

Overview

What Gets Indexed

Choosing the Right Scanner

Requirements

Installation

Verify Diskover Is Installed

Verify Elasticsearch Connectivity

List Loaded Plugins

CLI Usage

Single-Path Scan

Multi-Path Scan

Custom Index Name with Force-Overwrite

Append a New Top Path to an Existing Index

Remove a Top Path from an Existing Index

Limit Descent Depth

Thread Tuning

Dirs-Only Scan

Verbose / Debug Logging

Use an Alternate Scanner

Use an Alternate Ingester

Use a Named Configuration

Task-Based Usage (diskoverd)

How Tasks Work

Task Field to CLI Flag Mapping

Example Task Configuration

Date Tokens in Index Names

Alternate Scanners

Standard Alternate Scanners (--altscanner)

Standalone Async Scanners

Scanner Configuration

Alternate Ingesters

Exit Codes

Troubleshooting

Index Already Exists

SLOW PATH Warnings

Permission Denied Errors as Error Documents

Exit Code 64 (Warnings)

BulkIndexError: Document Mapping Conflict

Thread Starvation on Shallow Trees

NFS or SMB Latency

Debug Logging

Log File Locations

Support

Related articles

Standard Alternate Scanners (`--altscanner`)