Diskover Core Scanner Configuration
Overview
Every aspect of how a Diskover scan behaves at runtime — threading, file exclusions, cost rollup, plugins, and more — is controlled by a configuration model. You edit these settings through the Diskover Admin UI and select which configuration to use at scan time.
This guide walks through each configuration section at a high level: what it controls, why you'd change it, and which settings matter most. For CLI usage and task scheduling, see the companion Diskover Core Scanner Usage guide.
How Configuration Works
Where Settings Live
Configuration is stored under a scope like Diskover.Configurations.Default. The Diskover Admin UI reads the configuration model at runtime and generates the form controls you see on screen — so the field names, descriptions, and allowed values in this document match the Admin UI exactly.
Named Configurations
You can create multiple configuration variants for different scanning scenarios. Each variant gets its own scope in the database (e.g., Diskover.Configurations.nightly, Diskover.Configurations.fast).
To create a variant: Open the Default scope in the Admin UI, change the configuration_name field to your new name, and save. This duplicates the configuration under a new scope — the original is left unchanged.
To use a variant at scan time:
python3 /opt/diskover/diskover.py -c nightly /data
Without -c, the Default scope is always used.
Common multi-config pattern: Teams that index the same filesystem with different policies might define a nightly scope that includes everything with cost rollup enabled, a fast scope that excludes temp and build directories for on-demand runs, and a forensic scope that disables plugins to minimize load while capturing stat metadata.
Editing via the Admin UI
Navigate to Settings > Diskover > Configurations > <configuration name>. Nested objects (Excludes, Plugins, StorageCost, etc.) appear as collapsible sub-forms. Save the configuration and it's immediately available for the next scan that references it.
Here we can create alternate scanning configurations that can be used to for scanning different volumes!
Importing Legacy YAML Configurations
If you have YAML configuration files from an earlier Diskover version:
python3 diskover.py --importconfig <file.yaml>
This imports the YAML into the SQLite database as the Default scope, automatically migrating any deprecated field names.
Threading & Performance
These five settings control how many OS threads the crawl and indexer use, and at what depth to split work across them. Getting these right is the single biggest factor in scan performance.
Setting | Default | What It Controls |
|---|---|---|
|
| Maximum simultaneous directory crawl threads per top path. Auto-sizes from CPU count when set to 0. This is the primary performance tuning knob. |
|
| Maximum simultaneous walk threads — one per top path. Only matters when scanning multiple top paths (e.g., |
|
| Maximum directory depth at which to spawn new crawl threads. |
|
| When enabled, Diskover walks the tree once up front to find the optimal depth for thread dispatch. Recommended for shallow or wide trees. |
|
| Number of threads feeding bulk uploads into Elasticsearch. Increase for slow or high-latency clusters. |
Tuning Tips
For a single-path scan: Focus on maxthreads. On a 16-core host with fast local disk, the auto-set value (CPU count) is usually close to optimal. On a slow NFS mount, try oversubscribing by 2–4x because threads spend most of their time blocked on syscalls.
For multi-path scans: maxwalkthreads controls how many top paths run concurrently. Each top path gets its own share of the crawl thread budget to prevent one large root from starving siblings.
For shallow trees: If --threads 16 runs no faster than --threads 4, try enabling automaxthreaddepth. The default maxthreaddepth: 999 only spawns threads as it descends, which doesn't help when the tree fans out near the root.
For remote Elasticsearch: Increase indexthreads to 24–32 to keep the network pipe full while crawl threads produce documents.
Example — 32-Core Host with Remote ES
maxwalkthreads: 4 maxthreads: 48 maxthreaddepth: 5 automaxthreaddepth: false indexthreads: 24
Crawl Behavior & Feature Flags
These settings control filesystem traversal behavior and optional rollup features that add fields to directory documents.
Setting | Default | What It Controls |
|---|---|---|
|
| Block size (bytes) used to compute the |
|
| Whether to follow and index symbolic links. |
|
| After reading file metadata, restore the original atime/mtime so the scan doesn't pollute timestamps. Useful for CIFS mounts that don't support |
|
| Threshold in seconds beyond which a directory is logged as a SLOW PATH. |
|
| When enabled, abort scanning of directories that exceed |
|
| Add a |
|
| Add a |
|
| Add |
Exclusions
The exclusions section decides what the crawler skips before anything reaches Elasticsearch. This is where you filter out OS junk files, snapshot directories, empty files, and data outside a specific age window.
Directory and File Exclusions
Setting | Default | What It Controls |
|---|---|---|
|
| Directory names and absolute paths to exclude. Supports exact strings, wildcards, and Python regex. |
|
| File names and extensions to exclude. Names are case-sensitive; extensions are case-insensitive. Use |
|
| Skip 0-byte files. |
|
| Skip directories that contain no kept entries after other exclude rules run. |
|
| Skip files smaller than this many bytes. Set to |
How Directory Matching Works
Directory names in excludes.dirs are matched in this order: exact string match first, then a special-case check for .* (dot-wildcard), then Python re.search. Because re.search matches anywhere in the string, a pattern like \.backup will match any directory whose name contains .backup — including backups_old and .backup.tmp. If that's not what you want, anchor the pattern: ^\.backup$ matches only the exact name .backup.
Time-Based Exclusions
All six time fields are in days, interpreted as "exclude files whose time falls outside this range." The master switch checkfiletimes must be enabled for any of these to take effect.
Setting | Default | What It Controls |
|---|---|---|
|
| Master switch for all time-based exclusions below. |
|
| Skip files modified less than X days ago or more than X days ago. |
|
| Skip files changed less than X days ago or more than X days ago. |
|
| Skip files accessed less than X days ago or more than X days ago. |
The defaults (0 to 36500 — roughly a century) effectively disable time-based filtering even when checkfiletimes is on.
Example — Stale-Data Sweep (Only Files Older Than 1 Year)
excludes: emptyfiles: true emptydirs: true checkfiletimes: true minmtime: 365 maxmtime: 36500 minatime: 365 maxatime: 36500
Example — Aggressive Tune That Strips OS Junk and Build Artifacts
excludes:
dirs:
- "^\\.snapshot$"
- "^\\.Snapshot$"
- "^\\.git$"
- "^node_modules$"
- "^__pycache__$"
files:
- ".*"
- "Thumbs.db"
- ".DS_Store"
- "*.pyc"
- "*.swp"
emptyfiles: true
emptydirs: true
minfilesize: 4096
Whitelists (Includes)
Whitelists override excludes for specific directory or file names. If a name appears in both excludes.* and includes.*, the include wins and the item is crawled. Leave them empty (the default) to disable whitelisting.
Setting | Default | What It Controls |
|---|---|---|
|
| Directory names/paths to always include, even if they match an exclude rule. |
|
| File names to always include, even if they match an exclude rule. |
Example — Exclude .git Generally but Whitelist It Back
excludes:
dirs:
- "^\\.git$"
includes:
dirs:
- "^\\.git$"
Owners & Groups
These settings control how file ownership is resolved and how domain-qualified usernames are parsed before being stored in the owner and group Elasticsearch fields.
Setting | Default | What It Controls |
|---|---|---|
|
| Store numeric UID/GID values instead of resolving to names. Useful on ID-mapped NFS mounts where name lookups are slow or unreliable. |
|
| Set to |
|
| The character that separates domain from account name — usually |
|
| Set to |
|
| When |
When to Use Each Mode
Plain Linux mount: Leave all defaults — owner names resolve through standard system calls.
ID-mapped NFS with broken name resolution: Set uidgidonly: true.
AD-joined CIFS mount reporting EXAMPLE\alice: Set domain: true, domainsep: "\", domainfirst: true, and keepdomain: false to store just alice.
Path Replacement
Path replacement rewrites every crawled path before it is stored in Elasticsearch. This is most commonly used to translate POSIX scan paths into Windows-style display paths (or vice versa) when the scanner runs on one OS but the Web UI serves users on another.
Setting | Default | What It Controls |
|---|---|---|
|
| Enable path replacement. |
|
| Source path prefix to replace. |
|
| Destination path prefix to substitute. |
Example — Linux Scan Path to Windows UNC Path
replacepaths: enable: true from_path: "/mnt/nas/marketing" to_path: "\\\\fileserver01\\marketing"
Auto-Tagging
Auto-tagging attaches string tags to files or directories at crawl time based on match rules. Each rule uses Python re.search against the entry's name, parent path, and (for files) extension, with optional age filters. When a rule matches, its tags are merged into the document's tags field before indexing.
Setting | Default | What It Controls |
|---|---|---|
|
| Master switch for auto-tagging. |
|
| Use raw strings for regex patterns. |
|
| List of file match rules (see below). |
|
| List of directory match rules (see below). |
Match Rule Fields
Each rule in autotag.files[] or autotag.dirs[] supports these fields:
Field | Type | Description |
|---|---|---|
| list of strings | Tags to apply when this rule matches. |
| list of strings | Regex patterns to match against the entry's name. |
| list of strings | Regex patterns that prevent matching (if any pattern hits the name, the rule is skipped). |
| list of strings | Regex patterns to match against the entry's parent path. |
| list of strings | Regex patterns that prevent matching by path. |
| integer (days) | Minimum mtime age in days for the rule to match. |
| integer (days) | Minimum atime age in days. |
| integer (days) | Minimum ctime age in days. |
| list of strings | (File rules only) File extensions to match (case-insensitive). |
When a file or directory matches multiple rules, each rule's tags are merged into the document.
Example — Tag Old PDFs Under /archive
autotag:
enable: true
rawstrings: true
files:
- tags:
- archived
- legal-hold
ext:
- pdf
path:
- "^/archive/"
mtime: 365
dirs: []
Example — Tag Project Directories and Design Assets
autotag:
enable: true
rawstrings: true
files:
- tags:
- design-asset
ext:
- psd
- ai
- sketch
dirs:
- tags:
- project-root
path:
- "^/projects/client-"
path_exclude:
- "^/projects/client-archive/"
Sample Directory AutoTag Rules:
Storage Cost Rollup
When enabled, every document carries a costpergb field computed from a base rate and any matching path or time rules. Directory documents additionally carry a rolled-up total summarizing all descendant costs. This powers Diskover's cost attribution dashboards.
Top-Level Settings
Setting | Default | What It Controls |
|---|---|---|
|
| Enable storage cost calculation. |
|
| Default storage cost per GB. Used when no more specific path or time rule matches. |
|
| GB unit base: |
|
| Which size field cost is computed against: |
|
| When a file matches both a path rule and a time rule, which one wins. |
|
| Use raw strings for path cost regex patterns. |
Path Cost Rules
Each entry in storagecost.paths[] associates a cost rate with files whose parent path matches a regex pattern.
Field | Description |
|---|---|
| Regex patterns for parent-path matching. |
| Regex patterns that prevent matching. |
| $/GB rate to apply when matched. |
Time Cost Rules
Each entry in storagecost.times[] associates a cost rate with files whose timestamps fall within a given window.
Field | Description |
|---|---|
| mtime window in days. |
| atime window in days. |
| ctime window in days. |
| $/GB rate to apply when matched. |
Example — Flat-Rate Default
storagecost: enable: true costpergb: 0.08 base: 2 sizefield: size_du priority: time
Example — Path-Tiered Rates (Hot and Cold)
storagecost:
enable: true
costpergb: 0.05
base: 2
sizefield: size_du
priority: path
rawstrings: true
paths:
- path:
- "^/projects/"
path_exclude: []
costpergb: 0.12
- path:
- "^/archive/"
path_exclude: []
costpergb: 0.01
times:
- mtime: 365
atime: 0
ctime: 0
costpergb: 0.02
In this configuration, files under /projects/ are charged at $0.12/GB, files under /archive/ at $0.01/GB, and everything else falls through to the default $0.05/GB. Because priority: path is set, the path rule wins when a file matches both a path and time rule.
Index Plugins
Index plugins are Python modules under /opt/diskover/plugins/ that contribute additional fields to file or directory documents during the crawl. They are different from post-index plugins (which run after a crawl completes) and file-action plugins (which run on demand from the UI).
Setting | Default | What It Controls |
|---|---|---|
|
| Master switch for all index plugins. |
|
| Plugin names to apply to directory documents. |
|
| Plugin names to apply to file documents. |
Available Plugins
The valid plugin names depend on which plugins are installed on your Diskover host. To see what's available:
python3 /opt/diskover/diskover.py -l
Copy the exact names from this output into plugins.dirs and plugins.files. The Admin UI dropdowns also show only the plugins installed and loaded on the current host.
Example
plugins:
enable: true
dirs:
- dupesfinder
files:
- hash
- mediainfo
Note: If you configure plugins in
plugins.dirsorplugins.filesbut leaveplugins.enableset tofalse, the Admin UI will prompt you to enable the master switch.
Elasticsearch Overrides
These settings let an individual Diskover scan override the cluster-wide Elasticsearch ingest settings. This is useful when you run multiple named configurations against the same cluster but need different shard counts or compression settings for specific scans.
Setting | Default | What It Controls |
|---|---|---|
|
| Enable scan-specific Elasticsearch settings. When disabled, values from the cluster-wide Elasticsearch scope are used. |
|
| Number of shards for the index. |
|
| Number of replicas for the index. |
|
| Chunk size for ES bulk operations. |
|
| Number of connections kept open to ES during the crawl. |
|
| Compress HTTP data. Set to |
Example — Remote Cluster Tune
elasticsearch_overrides:
enable: true
settings:
shards: 3
replicas: 1
chunksize: 1000
max_size: 30
http_compress: true
When to use overrides vs. the cluster-wide scope: Leave enable: false (the default) when all scans should use the same ES ingest settings. Use overrides when a specific configuration variant needs different shard/replica counts or compression — for example, a nightly variant writing a large multi-shard index versus an ad-hoc fast variant writing a small single-shard index.
Complete Configuration Examples
Production Tune for a Busy NAS
# Diskover.Configurations.production
excludes:
dirs:
- '\.snapshot'
- '\.Snapshot'
- '\.zfs'
- '~snapshot'
files:
- 'Thumbs.db'
- '.DS_Store'
- 'desktop.ini'
emptyfiles: true
emptydirs: true
minfilesize: 1
checkfiletimes: false
maxthreads: 16
maxwalkthreads: 0
maxthreaddepth: 999
automaxthreaddepth: true
indexthreads: 16
rolluptimes: true
fileagegroups: true
followsymlinks: false
restoretimes: false
storagecost:
enable: true
costpergb: 0.08
base: 2
sizefield: size_du
priority: time
plugins:
enable: true
dirs: []
files: []
Stale-Data Sweep (Only Files Older Than 1 Year)
# Diskover.Configurations.stale-sweep excludes: emptyfiles: true emptydirs: true checkfiletimes: true minmtime: 365 maxmtime: 36500 minatime: 365 maxatime: 36500 maxthreads: 8 fileagegroups: true rolluptimes: true
Support
Last Updated: April 2026
Comments
0 comments
Please sign in to leave a comment.