Diskover Core Scanner Configuration

Overview

Every aspect of how a Diskover scan behaves at runtime — threading, file exclusions, cost rollup, plugins, and more — is controlled by a configuration model. You edit these settings through the Diskover Admin UI and select which configuration to use at scan time.

This guide walks through each configuration section at a high level: what it controls, why you'd change it, and which settings matter most. For CLI usage and task scheduling, see the companion Diskover Core Scanner Usage guide.

How Configuration Works

Where Settings Live

Configuration is stored under a scope like Diskover.Configurations.Default. The Diskover Admin UI reads the configuration model at runtime and generates the form controls you see on screen — so the field names, descriptions, and allowed values in this document match the Admin UI exactly.

Named Configurations

You can create multiple configuration variants for different scanning scenarios. Each variant gets its own scope in the database (e.g., Diskover.Configurations.nightly, Diskover.Configurations.fast).

To create a variant: Open the Default scope in the Admin UI, change the configuration_name field to your new name, and save. This duplicates the configuration under a new scope — the original is left unchanged.

To use a variant at scan time:

python3 /opt/diskover/diskover.py -c nightly /data

Without -c, the Default scope is always used.

Common multi-config pattern: Teams that index the same filesystem with different policies might define a nightly scope that includes everything with cost rollup enabled, a fast scope that excludes temp and build directories for on-demand runs, and a forensic scope that disables plugins to minimize load while capturing stat metadata.

Editing via the Admin UI

Navigate to Settings > Diskover > Configurations > <configuration name>. Nested objects (Excludes, Plugins, StorageCost, etc.) appear as collapsible sub-forms. Save the configuration and it's immediately available for the next scan that references it.

Here we can create alternate scanning configurations that can be used to for scanning different volumes!

Importing Legacy YAML Configurations

If you have YAML configuration files from an earlier Diskover version:

python3 diskover.py --importconfig <file.yaml>

This imports the YAML into the SQLite database as the Default scope, automatically migrating any deprecated field names.

Threading & Performance

These five settings control how many OS threads the crawl and indexer use, and at what depth to split work across them. Getting these right is the single biggest factor in scan performance.

Setting	Default	What It Controls
`maxthreads`	`0` (auto)	Maximum simultaneous directory crawl threads per top path. Auto-sizes from CPU count when set to 0. This is the primary performance tuning knob.
`maxwalkthreads`	`0` (auto)	Maximum simultaneous walk threads — one per top path. Only matters when scanning multiple top paths (e.g., `/data /scratch /home`).
`maxthreaddepth`	`999`	Maximum directory depth at which to spawn new crawl threads.
`automaxthreaddepth`	`false`	When enabled, Diskover walks the tree once up front to find the optimal depth for thread dispatch. Recommended for shallow or wide trees.
`indexthreads`	`16`	Number of threads feeding bulk uploads into Elasticsearch. Increase for slow or high-latency clusters.

Tuning Tips

For a single-path scan: Focus on maxthreads. On a 16-core host with fast local disk, the auto-set value (CPU count) is usually close to optimal. On a slow NFS mount, try oversubscribing by 2–4x because threads spend most of their time blocked on syscalls.

For multi-path scans: maxwalkthreads controls how many top paths run concurrently. Each top path gets its own share of the crawl thread budget to prevent one large root from starving siblings.

For shallow trees: If --threads 16 runs no faster than --threads 4, try enabling automaxthreaddepth. The default maxthreaddepth: 999 only spawns threads as it descends, which doesn't help when the tree fans out near the root.

For remote Elasticsearch: Increase indexthreads to 24–32 to keep the network pipe full while crawl threads produce documents.

Example — 32-Core Host with Remote ES

maxwalkthreads: 4
maxthreads: 48
maxthreaddepth: 5
automaxthreaddepth: false
indexthreads: 24

Crawl Behavior & Feature Flags

These settings control filesystem traversal behavior and optional rollup features that add fields to directory documents.

Setting	Default	What It Controls
`blocksize`	`512`	Block size (bytes) used to compute the `size_du` (allocated size) field. Leave at 512 unless your filesystem uses a different block size.
`followsymlinks`	`false`	Whether to follow and index symbolic links.
`restoretimes`	`false`	After reading file metadata, restore the original atime/mtime so the scan doesn't pollute timestamps. Useful for CIFS mounts that don't support `noatime`. For NFS, use the `ro,noatime,nodiratime` mount options instead.
`slowdirtime`	`600`	Threshold in seconds beyond which a directory is logged as a SLOW PATH.
`slowdirtimestopscan`	`false`	When enabled, abort scanning of directories that exceed `slowdirtime`. When disabled, only log a warning.
`fileagegroups`	`false`	Add a `fileages` field to directory documents with six age-bucket counts (0–30d, 30–90d, 90–180d, 180d–1y, 1–2y, >2y). Pre-computing these makes file-age charts in the Web UI render much faster on large indices.
`rolluptimes`	`false`	Add a `timerollup` field to directory documents containing the maximum child mtime/atime/ctime. Useful for retention dashboards that need "what's the oldest file anywhere under this directory" without runtime aggregation.
`store_relative_parent`	`false`	Add `parent_path_rel` (parent path relative to the scan root) as an additional index field.

Exclusions

The exclusions section decides what the crawler skips before anything reaches Elasticsearch. This is where you filter out OS junk files, snapshot directories, empty files, and data outside a specific age window.

Directory and File Exclusions

Setting	Default	What It Controls
`excludes.dirs`	`[".snapshot", ".Snapshot", "~snapshot", "~Snapshot", ".zfs"]`	Directory names and absolute paths to exclude. Supports exact strings, wildcards, and Python regex.
`excludes.files`	`[".*", "Thumbs.db", ".DS_Store", "._.DS_Store", ".localized", "desktop.ini"]`	File names and extensions to exclude. Names are case-sensitive; extensions are case-insensitive. Use `NULLEXT` to match files without extensions.
`excludes.emptyfiles`	`true`	Skip 0-byte files.
`excludes.emptydirs`	`true`	Skip directories that contain no kept entries after other exclude rules run.
`excludes.minfilesize`	`1`	Skip files smaller than this many bytes. Set to `0` to disable.

How Directory Matching Works

Directory names in excludes.dirs are matched in this order: exact string match first, then a special-case check for .* (dot-wildcard), then Python re.search. Because re.search matches anywhere in the string, a pattern like \.backup will match any directory whose name contains .backup — including backups_old and .backup.tmp. If that's not what you want, anchor the pattern: ^\.backup$ matches only the exact name .backup.

Time-Based Exclusions

All six time fields are in days, interpreted as "exclude files whose time falls outside this range." The master switch checkfiletimes must be enabled for any of these to take effect.

Setting	Default	What It Controls
`excludes.checkfiletimes`	`false`	Master switch for all time-based exclusions below.
`excludes.minmtime` / `excludes.maxmtime`	`0` / `36500`	Skip files modified less than X days ago or more than X days ago.
`excludes.minctime` / `excludes.maxctime`	`0` / `36500`	Skip files changed less than X days ago or more than X days ago.
`excludes.minatime` / `excludes.maxatime`	`0` / `36500`	Skip files accessed less than X days ago or more than X days ago.

The defaults (0 to 36500 — roughly a century) effectively disable time-based filtering even when checkfiletimes is on.

Example — Stale-Data Sweep (Only Files Older Than 1 Year)

excludes:
  emptyfiles: true
  emptydirs: true
  checkfiletimes: true
  minmtime: 365
  maxmtime: 36500
  minatime: 365
  maxatime: 36500

Example — Aggressive Tune That Strips OS Junk and Build Artifacts

excludes:
  dirs:
    - "^\\.snapshot$"
    - "^\\.Snapshot$"
    - "^\\.git$"
    - "^node_modules$"
    - "^__pycache__$"
  files:
    - ".*"
    - "Thumbs.db"
    - ".DS_Store"
    - "*.pyc"
    - "*.swp"
  emptyfiles: true
  emptydirs: true
  minfilesize: 4096

Whitelists (Includes)

Whitelists override excludes for specific directory or file names. If a name appears in both excludes.* and includes.*, the include wins and the item is crawled. Leave them empty (the default) to disable whitelisting.

Setting	Default	What It Controls
`includes.dirs`	`[]`	Directory names/paths to always include, even if they match an exclude rule.
`includes.files`	`[]`	File names to always include, even if they match an exclude rule.

Example — Exclude `.git` Generally but Whitelist It Back

excludes:
  dirs:
    - "^\\.git$"
includes:
  dirs:
    - "^\\.git$"

Owners & Groups

These settings control how file ownership is resolved and how domain-qualified usernames are parsed before being stored in the owner and group Elasticsearch fields.

Setting	Default	What It Controls
`ownersgroups.uidgidonly`	`false`	Store numeric UID/GID values instead of resolving to names. Useful on ID-mapped NFS mounts where name lookups are slow or unreliable.
`ownersgroups.domain`	`false`	Set to `true` when owner/group names contain a domain prefix (typically on AD-joined CIFS/NFS mounts).
`ownersgroups.domainsep`	`\`	The character that separates domain from account name — usually `\` or `@`.
`ownersgroups.domainfirst`	`true`	Set to `true` when domain comes before the separator (e.g., `EXAMPLE\alice`). Set to `false` for formats like `alice@example.com`.
`ownersgroups.keepdomain`	`false`	When `true`, the domain portion is kept in the indexed value. When `false`, only the bare username is stored.

When to Use Each Mode

Plain Linux mount: Leave all defaults — owner names resolve through standard system calls.

ID-mapped NFS with broken name resolution: Set uidgidonly: true.

AD-joined CIFS mount reporting EXAMPLE\alice: Set domain: true, domainsep: "\", domainfirst: true, and keepdomain: false to store just alice.

Path Replacement

Path replacement rewrites every crawled path before it is stored in Elasticsearch. This is most commonly used to translate POSIX scan paths into Windows-style display paths (or vice versa) when the scanner runs on one OS but the Web UI serves users on another.

Setting	Default	What It Controls
`replacepaths.enable`	`false`	Enable path replacement.
`replacepaths.from_path`	`""`	Source path prefix to replace.
`replacepaths.to_path`	`""`	Destination path prefix to substitute.

Example — Linux Scan Path to Windows UNC Path

replacepaths:
  enable: true
  from_path: "/mnt/nas/marketing"
  to_path: "\\\\fileserver01\\marketing"

Auto-Tagging

Auto-tagging attaches string tags to files or directories at crawl time based on match rules. Each rule uses Python re.search against the entry's name, parent path, and (for files) extension, with optional age filters. When a rule matches, its tags are merged into the document's tags field before indexing.

Setting	Default	What It Controls
`autotag.enable`	`false`	Master switch for auto-tagging.
`autotag.rawstrings`	`false`	Use raw strings for regex patterns.
`autotag.files`	`[]`	List of file match rules (see below).
`autotag.dirs`	`[]`	List of directory match rules (see below).

Match Rule Fields

Each rule in autotag.files[] or autotag.dirs[] supports these fields:

Field	Type	Description
`tags`	list of strings	Tags to apply when this rule matches.
`name`	list of strings	Regex patterns to match against the entry's name.
`name_exclude`	list of strings	Regex patterns that prevent matching (if any pattern hits the name, the rule is skipped).
`path`	list of strings	Regex patterns to match against the entry's parent path.
`path_exclude`	list of strings	Regex patterns that prevent matching by path.
`mtime`	integer (days)	Minimum mtime age in days for the rule to match. `0` disables.
`atime`	integer (days)	Minimum atime age in days.
`ctime`	integer (days)	Minimum ctime age in days.
`ext`	list of strings	(File rules only) File extensions to match (case-insensitive).

When a file or directory matches multiple rules, each rule's tags are merged into the document.

Example — Tag Old PDFs Under /archive

autotag:
  enable: true
  rawstrings: true
  files:
    - tags:
        - archived
        - legal-hold
      ext:
        - pdf
      path:
        - "^/archive/"
      mtime: 365
  dirs: []

Example — Tag Project Directories and Design Assets

autotag:
  enable: true
  rawstrings: true
  files:
    - tags:
        - design-asset
      ext:
        - psd
        - ai
        - sketch
  dirs:
    - tags:
        - project-root
      path:
        - "^/projects/client-"
      path_exclude:
        - "^/projects/client-archive/"

Sample Directory AutoTag Rules:

Storage Cost Rollup

When enabled, every document carries a costpergb field computed from a base rate and any matching path or time rules. Directory documents additionally carry a rolled-up total summarizing all descendant costs. This powers Diskover's cost attribution dashboards.

Top-Level Settings

Setting	Default	What It Controls
`storagecost.enable`	`false`	Enable storage cost calculation.
`storagecost.costpergb`	`0.08`	Default storage cost per GB. Used when no more specific path or time rule matches.
`storagecost.base`	`2`	GB unit base: `2` = 1024 bytes (binary/GiB), `10` = 1000 bytes (decimal/GB). Most on-prem storage uses base 2; most cloud vendors bill in base 10.
`storagecost.sizefield`	`size_du`	Which size field cost is computed against: `size` (logical/apparent file size) or `size_du` (on-disk allocated size).
`storagecost.priority`	`time`	When a file matches both a path rule and a time rule, which one wins.
`storagecost.rawstrings`	`false`	Use raw strings for path cost regex patterns.

Path Cost Rules

Each entry in storagecost.paths[] associates a cost rate with files whose parent path matches a regex pattern.

Field	Description
`path`	Regex patterns for parent-path matching.
`path_exclude`	Regex patterns that prevent matching.
`costpergb`	$/GB rate to apply when matched.

Time Cost Rules

Each entry in storagecost.times[] associates a cost rate with files whose timestamps fall within a given window.

Field	Description
`mtime`	mtime window in days.
`atime`	atime window in days.
`ctime`	ctime window in days.
`costpergb`	$/GB rate to apply when matched.

Example — Flat-Rate Default

storagecost:
  enable: true
  costpergb: 0.08
  base: 2
  sizefield: size_du
  priority: time

Example — Path-Tiered Rates (Hot and Cold)

storagecost:
  enable: true
  costpergb: 0.05
  base: 2
  sizefield: size_du
  priority: path
  rawstrings: true
  paths:
    - path:
        - "^/projects/"
      path_exclude: []
      costpergb: 0.12
    - path:
        - "^/archive/"
      path_exclude: []
      costpergb: 0.01
  times:
    - mtime: 365
      atime: 0
      ctime: 0
      costpergb: 0.02

In this configuration, files under /projects/ are charged at $0.12/GB, files under /archive/ at $0.01/GB, and everything else falls through to the default $0.05/GB. Because priority: path is set, the path rule wins when a file matches both a path and time rule.

Index Plugins

Index plugins are Python modules under /opt/diskover/plugins/ that contribute additional fields to file or directory documents during the crawl. They are different from post-index plugins (which run after a crawl completes) and file-action plugins (which run on demand from the UI).

Setting	Default	What It Controls
`plugins.enable`	`true`	Master switch for all index plugins.
`plugins.dirs`	`[]`	Plugin names to apply to directory documents.
`plugins.files`	`[]`	Plugin names to apply to file documents.

Available Plugins

The valid plugin names depend on which plugins are installed on your Diskover host. To see what's available:

python3 /opt/diskover/diskover.py -l

Copy the exact names from this output into plugins.dirs and plugins.files. The Admin UI dropdowns also show only the plugins installed and loaded on the current host.

Example

plugins:
  enable: true
  dirs:
    - dupesfinder
  files:
    - hash
    - mediainfo

Note: If you configure plugins in plugins.dirs or plugins.files but leave plugins.enable set to false, the Admin UI will prompt you to enable the master switch.

Elasticsearch Overrides

These settings let an individual Diskover scan override the cluster-wide Elasticsearch ingest settings. This is useful when you run multiple named configurations against the same cluster but need different shard counts or compression settings for specific scans.

Setting	Default	What It Controls
`elasticsearch_overrides.enable`	`false`	Enable scan-specific Elasticsearch settings. When disabled, values from the cluster-wide Elasticsearch scope are used.
`elasticsearch_overrides.settings.shards`	`1`	Number of shards for the index.
`elasticsearch_overrides.settings.replicas`	`0`	Number of replicas for the index.
`elasticsearch_overrides.settings.chunksize`	`1000`	Chunk size for ES bulk operations.
`elasticsearch_overrides.settings.max_size`	`20`	Number of connections kept open to ES during the crawl.
`elasticsearch_overrides.settings.http_compress`	`false`	Compress HTTP data. Set to `true` for remote Elasticsearch clusters.

Example — Remote Cluster Tune

elasticsearch_overrides:
  enable: true
  settings:
    shards: 3
    replicas: 1
    chunksize: 1000
    max_size: 30
    http_compress: true

When to use overrides vs. the cluster-wide scope: Leave enable: false (the default) when all scans should use the same ES ingest settings. Use overrides when a specific configuration variant needs different shard/replica counts or compression — for example, a nightly variant writing a large multi-shard index versus an ad-hoc fast variant writing a small single-shard index.

Complete Configuration Examples

Production Tune for a Busy NAS

# Diskover.Configurations.production
excludes:
  dirs:
    - '\.snapshot'
    - '\.Snapshot'
    - '\.zfs'
    - '~snapshot'
  files:
    - 'Thumbs.db'
    - '.DS_Store'
    - 'desktop.ini'
  emptyfiles: true
  emptydirs: true
  minfilesize: 1
  checkfiletimes: false

maxthreads: 16
maxwalkthreads: 0
maxthreaddepth: 999
automaxthreaddepth: true
indexthreads: 16

rolluptimes: true
fileagegroups: true
followsymlinks: false
restoretimes: false

storagecost:
  enable: true
  costpergb: 0.08
  base: 2
  sizefield: size_du
  priority: time

plugins:
  enable: true
  dirs: []
  files: []

Stale-Data Sweep (Only Files Older Than 1 Year)

# Diskover.Configurations.stale-sweep
excludes:
  emptyfiles: true
  emptydirs: true
  checkfiletimes: true
  minmtime: 365
  maxmtime: 36500
  minatime: 365
  maxatime: 36500

maxthreads: 8
fileagegroups: true
rolluptimes: true

Support

Last Updated: April 2026