Auto Clean / Orchestrate
License: ENT+ (Enterprise Edition or higher)
Plugin Type: Post-Index Plugin
Author: Diskover Data, Inc.
Overview
AutoClean is Diskover's automated file orchestration plugin that takes action on files and directories based on Elasticsearch query results. Think of it as your automated file management assistant—it can move, copy, delete, rename, or even run custom scripts on files that match specific criteria you define.
This plugin runs after a Diskover indexing scan completes, making it perfect for implementing data lifecycle policies, compliance workflows, and storage optimization strategies. By combining Diskover's powerful search capabilities with automated actions, you can build sophisticated file management workflows that would otherwise require manual intervention or complex scripting.
Why Use AutoClean?
Automate repetitive tasks: Stop manually moving expired files or cleaning up temp directories
Implement approval workflows: Require specific tags before destructive actions occur
Ensure consistency: Apply the same file management policies across your entire storage infrastructure
Maintain audit trails: Track what actions were taken through Diskover's tagging system
Use Cases
Quarantine Expired Files
Files tagged as expired by users or automated policies can be automatically moved to a quarantine directory. This preserves the original path structure while giving stakeholders a grace period before permanent deletion. For example, /data/projects/report.pdf tagged as expired moves to /quarantine/data/projects/report.pdf and receives an autocleaned tag.
Archive and Purge Large Files
Implement a two-stage workflow for storage optimization: First, copy large files (say, over 1GB) to archive storage. After verifying the archive via a new Diskover scan, run a second AutoClean configuration to delete the originals from primary storage. This ensures data safety while reclaiming expensive Tier 1 storage space.
Custom Processing Pipeline
Run validation scripts, compliance checkers, or any external tool on matched files. AutoClean passes the full file path to your custom command, enabling integration with existing tools and workflows. For example, files tagged for compliance review can automatically trigger your validation tool.
Sample Data from an AutoClean Execution:
Here we can see that the data has been copied to the archive location and has been tagged with both autocleaned and copied-to-arvhive
Installation
AutoClean is included with Diskover Enterprise Edition installations. No separate installation is required—the plugin is ready to configure once your Diskover environment is set up.
DNF Installation (Linux RPM)
On Linux systems using DNF package management, this plugin can be installed via RPM:
sudo dnf install diskover-plugin-postindex-autoclean
Note: Ensure your system is configured with the Diskover RPM repository before running the install command.
Prerequisites
Component |
Requirement |
|---|---|
Diskover |
Enterprise Edition with plugin support |
Python |
3.9 or higher (included with Diskover) |
Elasticsearch |
7.x or 8.x (as supported by your Diskover version) |
Storage Access |
Read/write permissions to target storage locations |
Verify Installation
To confirm the plugin is available, run the following command:
Linux:
python3 /opt/diskover/plugins_postindex/diskover_autoclean/diskover_autoclean.py --version
Windows:
python "C:\Program Files\Diskover\plugins_postindex\diskover_autoclean\diskover_autoclean.py" --version
You should see output showing the plugin version number.
Configuration
AutoClean is configured through the Diskover Admin Panel. Navigate to Plugins > Post Index > AutoClean to access the configuration interface.
Here is the beginning of our sample configuration. There are many other configuraitons for the AutoClean plugin - covered in detail below!
Global Settings
These settings apply to all AutoClean actions:
Setting |
Type |
Default |
Description |
|---|---|---|---|
|
Integer |
8 |
Maximum parallel processing threads. Increase for I/O-bound operations on fast storage. |
|
Boolean |
False |
When True, deletes entire directory trees (like |
|
Boolean |
True |
Preserves the source directory structure when moving files. |
|
Boolean |
True |
Preserves the source directory structure when copying files. |
Path Replacement Settings
For environments where indexed paths differ from actual filesystem paths (common with NFS mounts or SMB shares):
Setting |
Type |
Description |
|---|---|---|
|
Boolean |
Enable path translation |
|
String |
The path pattern as it appears in the index |
|
String |
The actual filesystem path to use |
Action Configuration
AutoClean supports separate action lists for files and directories. Each action in the list includes:
Setting |
Type |
Default |
Description |
|---|---|---|---|
|
String |
|
The operation to perform: |
|
String |
|
Elasticsearch query to match documents |
|
Boolean |
True |
Skip files if change time differs from indexed value |
|
Boolean |
True |
Skip files if modification time differs from indexed value |
|
List |
|
Tags to add after successful processing |
Action-Specific Settings
Setting |
Applies To |
Description |
|---|---|---|
|
rename |
Suffix to append (e.g., |
|
move |
Destination directory (relative or absolute path) |
|
copy |
Destination directory for copies |
|
chmod |
Permission mode in octal (e.g., |
|
chown |
User ID to set (-1 to skip) |
|
chown |
Group ID to set (-1 to skip) |
|
custom |
External command to execute |
Understanding Path Behavior
Relative paths (starting with ./) are interpreted relative to each source file's location:
Source:
/data/projects/report.pdfMove directory:
./.archiveResult:
/data/projects/.archive/report.pdf
Absolute paths with path preservation enabled recreate the full directory structure:
Source:
/data/projects/report.pdfMove directory:
/archiveResult:
/archive/data/projects/report.pdf
Example Configuration
Here's a typical configuration for quarantining expired files:
Files Action:
Action:
moveQuery:
tags:expired AND NOT tags:quarantinedMove Directory:
/quarantineTags:
['quarantined', 'autocleaned']Check mtime:
TrueCheck ctime:
True
This configuration finds all files tagged as expired (but not already quarantined), moves them to the /quarantine directory while preserving their original path structure, and adds the quarantined and autocleaned tags to track the action.
Execution
AutoClean can be run manually from the command line or scheduled to run automatically after indexing operations.
Manual Execution
Linux:
python3 /opt/diskover/plugins_postindex/diskover_autoclean/diskover_autoclean.py [options] <indexname>
Windows:
python "C:\Program Files\Diskover\plugins_postindex\diskover_autoclean\diskover_autoclean.py" [options] <indexname>
Command-Line Options
Option |
Description |
|---|---|
|
Display help message |
|
Use a named configuration from Diskover Admin |
|
Preview mode—shows what would happen without making changes |
|
Auto-find the most recent index for a given top path |
|
Enable verbose logging |
|
Enable debug-level logging |
|
Print version and exit |
Recommended: Always Test with Dry-Run First
Before running AutoClean on production data, use the dry-run option to preview what actions would be taken:
Linux:
python3 /opt/diskover/plugins_postindex/diskover_autoclean/diskover_autoclean.py -n -V diskover-mydata-2025.01.14
Windows:
python "C:\Program Files\Diskover\plugins_postindex\diskover_autoclean\diskover_autoclean.py" -n -V diskover-mydata-2025.01.14
This executes all discovery logic and logging but skips actual file operations and tag updates.
Here we can see that the query for directories in a specifc path are being targetted for a Change of Ownership (chown) command!
Automated Execution
To run AutoClean automatically after indexing completes, configure it as a Post-Crawl Command within your Index Task.
Post-Crawl Command Configuration
Linux Example:
Field |
Value |
|---|---|
Post-Crawl Command |
|
Post-Crawl Command Args |
|
Windows Example:
Field |
Value |
|---|---|
Post-Crawl Command |
|
Post-Crawl Command Args |
|
Available Index Task Tokens:
{indexname}— The name of the index that was just created
In your system ensure to replace the ConfigurationName above with a named configuraiton that you've created at Diskover Admin → Plugins → Post-Index → AutoClean – If you are not using a custom configuration and you're just using Default than the -c flag and the ConfigurationName is not required!
Important:
The Post-Crawl Command field should contain ONLY the executable (e.g.,
python3,python)All script paths, flags, and arguments go in the Post-Crawl Command Args field
Using Named Configurations
If you have multiple AutoClean configurations for different workflows, specify which one to use with the -c option:
Linux Example:
Field |
Value |
|---|---|
Post-Crawl Command |
|
Post-Crawl Command Args |
|
Windows Example:
Field |
Value |
|---|---|
Post-Crawl Command |
|
Post-Crawl Command Args |
|
Using Latest Index Auto-Discovery
For workflows where you want to process the most recent index for a specific storage path:
Linux Example:
Field |
Value |
|---|---|
Post-Crawl Command |
|
Post-Crawl Command Args |
|
Windows Example:
Field |
Value |
|---|---|
Post-Crawl Command |
|
Post-Crawl Command Args |
|
Reviewing the Output
During Execution
With verbose logging enabled (-v or -V), AutoClean provides detailed progress information:
[MainThread] Finding files and directories to cleanup in index diskover-mydata-2025.01.14 ... (8 threads) [Thread-1] Moving /data/projects/old_report.pdf to /quarantine/data/projects/old_report.pdf (2.3 MB) ... [Thread-1] Finished moving /data/projects/old_report.pdf (Size: 2.3 MB, Duration: 0:00:01.234, Speed: 1.9 MB/s) [Thread-2] Deleting /data/temp/cache.tmp (156 KB) ... [Thread-2] Finished deleting /data/temp/cache.tmp (Size: 156 KB, Duration: 0:00:00.045)
Summary Statistics
At completion, AutoClean reports comprehensive statistics:
Finished cleaning *** Duration: 0:05:23.456 *** *** Processed 1,234 files (3.82/s), 56 dirs (0.17/s) *** *** Removed 15.6 GB, allocated size 16.2 GB *** *** Moved 8.3 GB, allocated size 8.7 GB *** *** Copied 0 B, allocated size 0 B *** *** Warnings 12, Errors 0 ***
Understanding Warnings
Common warning messages and their meanings:
Message |
Meaning |
|---|---|
|
File no longer exists at the indexed path—may have been moved or deleted since indexing |
|
File was modified after indexing; skipped to prevent processing changed files |
Log Location
AutoClean logs are written to the standard Diskover log location and can also be captured by redirecting output:
Linux:
python3 /opt/diskover/plugins_postindex/diskover_autoclean/diskover_autoclean.py -V myindex 2>&1 | tee autoclean_run.log
Windows:
python "C:\Program Files\Diskover\plugins_postindex\diskover_autoclean\diskover_autoclean.py" -V myindex 2>&1 | Tee-Object -FilePath autoclean_run.log
Here we can see that 412 files consiting of 2.24MB were copied with this AutoClean execution!
Searching in Diskover
After AutoClean processes files, you can easily find them using Diskover's search capabilities. The plugin adds configurable tags to processed files, making them searchable.
Finding Processed Files
To find all files processed by AutoClean using the default tag:
tags:autocleaned
If you configured custom tags (e.g., quarantined, archived), search for those instead:
tags:quarantined
tags:archived
Combining with Other Search Criteria
Find large files that were archived in the last month:
tags:archived AND size:>1073741824 AND mtime:[now-30d TO now]
Find files in a specific directory that were processed:
tags:autocleaned AND parent_path:\/data\/projects*
Here we can see there have been 18 items found that were autocleaned in this scan of /opt/diskover/
Troubleshooting
Files Not Being Processed
Check your query syntax: Test the exact query in Diskover's search interface first. If it doesn't return expected results there, it won't work in AutoClean either.
Time validation may be skipping files: If files were modified after indexing, AutoClean skips them by default (a safety feature). Use -V logging to see which files are skipped and why. Consider disabling check_mtime or check_ctime if appropriate for your workflow.
Verify files still exist: Files may have been moved or deleted since the index was created. The not found! warning indicates this condition.
Permission Errors
Ensure the user account running Diskover has appropriate read/write permissions on both source and destination paths. On Windows, AutoClean attempts to modify permissions before deletion if initial attempts fail, but this requires appropriate privileges.
Slow Performance
Increase thread count: For I/O-bound operations on fast storage, increase
maxthreadsbeyond the default of 8Check Elasticsearch health: Slow queries can bottleneck processing
Schedule during off-peak hours: Heavy file operations compete with normal storage access
Debug Mode
For detailed troubleshooting, run with maximum verbosity:
Linux:
python3 /opt/diskover/plugins_postindex/diskover_autoclean/diskover_autoclean.py -n -V myindex 2>&1 | tee debug.log
Windows:
python "C:\Program Files\Diskover\plugins_postindex\diskover_autoclean\diskover_autoclean.py" -n -V myindex 2>&1 | Tee-Object -FilePath debug.log
This captures the full Elasticsearch queries being executed, path translations, and detailed action logging.
Support
Last Updated: April 2026
Comments
0 comments
Please sign in to leave a comment.