Path Tokens
License: PRO+ (Professional Edition or higher)
Plugin Type: Index Plugin
Author: Diskover Data, Inc.
Overview
The Path Tokens plugin transforms file and directory names into searchable keywords during Diskover indexing. It breaks down concatenated names, extracts embedded dates, and filters common words to create meaningful search tokens—making it dramatically easier to find files based on concepts rather than exact filenames.
When file paths contain concatenated names like ProjectPhoenix_FinalDeliverable or CASE-2024-00123_Evidence, standard searches struggle to find them. This plugin transforms these names into searchable tokens: project, phoenix, final, deliverable or case, 2024-00123, evidence.
Key capabilities include:
CamelCase Conversion — Transforms
MyProjectNameinto separate searchable tokensDate Extraction — Recognizes dates in 9 different formats embedded in filenames and normalizes them
Stop Word Filtering — Removes common English words (the, and, is, etc.) to reduce noise
Path and Name Separation — Creates distinct token sets for directory paths and filenames
Modification Time Indexing — Adds human-readable
YYYY-Monthformat for time-based searches
Why Use Path Tokens?
Organizations often develop naming conventions that concatenate meaningful information into file and folder names. These conventions work well for humans organizing files, but make searching difficult. The Path Tokens plugin bridges this gap by extracting the meaningful words from complex names.
The plugin adds these fields to indexed documents:
Field | Description |
|---|---|
| Tokens extracted from the directory path |
| Tokens extracted from the file or directory name |
| File modification time as "YYYY-Month" format |
Use Cases
Discovering Hot Words Across Your Storage
One of the most powerful applications of the Path Tokens plugin is discovering common themes, project names, and terminology across your file system. By tokenizing paths and names, you can quickly identify which project names appear most frequently, common client or department identifiers, and recurring patterns in how teams name their files and folders.
For Media & Entertainment
Use Case | Description | Example Search |
|---|---|---|
Film/Show Discovery | Find all files related to a production by name |
|
Deliverable Tracking | Find final deliverables across projects |
|
For Legal & Compliance
Use Case | Description | Example Search |
|---|---|---|
Case File Discovery | Find files by legal case identifiers |
|
Matter Management | Locate documents by matter number or name |
|
For Project-Based Organizations
Use Case | Description | Example Search |
|---|---|---|
Project Discovery | Find all files related to a project codename |
|
Version Tracking | Find files with version indicators |
|
For Storage Administrators
Use Case | Description | Example Search |
|---|---|---|
Naming Convention Audit | Verify files follow naming standards |
|
Department Content Analysis | Analyze storage by department |
|
Understanding Path Tokenization
What is Path Tokenization?
Path tokenization is the process of breaking down file and directory names into individual searchable keywords. This is essential because many file naming conventions use concatenation (CamelCase, underscores, hyphens) that make natural language searching difficult.
Why Tokenization Matters
Problem | Without Tokenization | With Tokenization |
|---|---|---|
File named | Must search exact name or use wildcards | Search for |
Path | Limited path searching | Search for |
Mixed naming conventions | Different search patterns needed for each style | Consistent token-based search |
Legal case | Complex wildcard queries required | Search for |
How the Plugin Processes Names
The plugin applies several transformations in sequence:
CamelCase to snake_case —
MyProjectbecomesmy_projectSpecial character replacement — Characters like
.-_\/(){}[]|@,;become spacesWord tokenization — Text is split into individual words
Stop word removal — Common English words are filtered out
Date extraction — Date patterns are identified and normalized
Deduplication — Duplicate tokens are removed
Example Transformation
Input path:
/data/projects/ClientABC/ProjectPhoenix_FinalDeliverable_2024-01-15.docx
Resulting tokens:
Field | Tokens |
|---|---|
|
|
|
|
|
|
Notice how:
ClientABCwas split intoclientandabcProjectPhoenixwas split intoprojectandphoenixThe date
2024-01-15was extracted and normalizedThe file extension
.docxwas removed from tokensCommon words were filtered out
Stop Words
The plugin filters out common English words that add noise to search results. Words like "the", "and", "is", "in", "of", "to" are removed from tokens.
Note: The word "it" is explicitly preserved (not filtered) to allow searching for files with "it" in their names.
Supported Date Formats
The plugin recognizes and extracts dates in 9 different formats:
Format | Example | Normalized Output |
|---|---|---|
YYYY-MM-DD |
|
|
MM-DD-YYYY |
|
|
DD-MM-YYYY |
|
|
YYYY.MM.DD |
|
|
MM.DD.YYYY |
|
|
DD.MM.YYYY |
|
|
YYYY/MM/DD |
|
|
MM/DD/YYYY |
|
|
DD/MM/YYYY |
|
|
All extracted dates are normalized to YYYY-MM-DD format for consistent searching.
Requirements
Python Dependencies
Package | Version | Purpose |
|---|---|---|
nltk | 3.9.1 | Natural Language Toolkit for tokenization and stop words |
System Requirements
Python 3.9 or higher
Diskover indexer with plugin support enabled
Installation
Step 1: Install Python Dependencies
Linux:
python3 -m pip install nltk==3.9.1
Windows:
python -m pip install nltk==3.9.1
Step 2: Configure the Plugin
Navigate to ⚙️ Settings > Plugins > Index Plugins > Path Tokens
The plugin uses default settings and requires no additional configuration
Save the configuration
Step 3: Enable in Index Task Configuration
Navigate to the Index Task Configurations page
Select the configuration you want to modify (or create a new one)
Scroll to the Index Plugins Enablement section at the bottom
Enable the Path Tokens plugin
Save the configuration
Step 4: Run an Index
The plugin will automatically process files and directories during the next scan using the configured Index Task. No additional commands or manual execution is required.
Configuration
The Path Tokens plugin uses minimal configuration and works effectively with default settings.
Configuration Parameters
Parameter | Type | Default | Description |
|---|---|---|---|
| method |
| Processes both files and directories |
Default Behavior
Files: Tokenizes both the parent directory path and the filename
Directories: Tokenizes both the parent path and the directory name
All file types: No extension filtering—processes all files regardless of type
Lightweight processing: No caching required due to fast in-memory text processing
Scope
The plugin configuration scope is: Plugins.Index.Path Tokens.Default
Indexed Fields
The Path Tokens plugin adds a pathtokens object to indexed documents with the following structure:
Elasticsearch Field Mappings
Field Path | ES Type | Description |
|---|---|---|
| object | Root container for all path token data |
| keyword (array) | Tokenized directory path components |
| keyword (array) | Tokenized filename or directory name |
| keyword | File modification time in |
Example Document: File
{
"name": "CASE-2024-00123_FinalBrief.pdf",
"parent_path": "/legal/ActiveMatters/SmithVsJones",
"extension": "pdf",
"pathtokens": {
"path": ["legal", "active", "matters", "smith", "vs", "jones"],
"name": ["case", "2024-00123", "final", "brief"],
"mtime": "2024-March"
}
}
Example Document: Directory
{
"name": "ProjectPhoenix_Phase2",
"parent_path": "/projects/ClientABC/2024",
"pathtokens": {
"path": ["projects", "client", "abc", "2024"],
"name": ["project", "phoenix", "phase2"],
"mtime": "2024-February"
}
}
Example Document: Media File
{
"name": "Inception_S01E03_ColorGrade_Final.mov",
"parent_path": "/productions/Netflix/Inception/Deliverables",
"extension": "mov",
"pathtokens": {
"path": ["productions", "netflix", "inception", "deliverables"],
"name": ["s01e03", "color", "grade", "final"],
"mtime": "2024-January"
}
}
Searching in Diskover
Use these search queries in the Diskover web interface to find files based on path tokens.
Basic Token Searches
Query | Description |
|---|---|
| Find files in paths containing "project" |
| Find files with "final" in their name |
| Find files modified in January 2024 |
Project and Client Discovery
Query | Description |
|---|---|
| Find all files in Project Phoenix directories |
| Find files in ClientABC folders |
| Find deliverable files across all projects |
Date-Based Searches
Query | Description |
|---|---|
| Find files with this specific date in the name |
| Find files with any 2024 date in the name |
| Find files modified anytime in 2023 |
Combined Searches
Query | Description |
|---|---|
| Find PDFs in finance directories |
| Find 2024 reports |
| Find large files in archive paths |
Troubleshooting
Common Issues
Issue | Cause | Solution |
|---|---|---|
| Plugin not enabled in Index Task | Enable the plugin in your Index Task Configuration and re-run the scan |
Empty or missing tokens | All tokens were stop words | This is expected—common words like "the", "and", "is" are filtered out |
Date not extracted | Date format not recognized | Ensure dates use one of the 9 supported formats (see Understanding Path Tokenization section) |
File extension appearing in tokens | Plugin version issue | Update to the latest plugin version |
Tokens appear lowercase | Expected behavior | All tokens are normalized to lowercase for consistent searching |
Plugin Not Loading
Symptom: Errors during indexing or pathtokens field not appearing.
Diagnosis:
Linux:
# Verify plugin directory exists
ls -la /opt/diskover/plugins/pathtokens/
# Check Python syntax
python3 -m py_compile /opt/diskover/plugins/pathtokens/__init__.py
# Test import
cd /opt/diskover
python3 -c "from plugins.pathtokens import *; print('Plugin loaded successfully')"
Windows:
# Verify plugin directory exists
dir "C:\Program Files\Diskover\plugins\pathtokens\"
# Check Python syntax
python -m py_compile "C:\Program Files\Diskover\plugins\pathtokens\__init__.py"
# Test import
cd "C:\Program Files\Diskover"
python -c "from plugins.pathtokens import *; print('Plugin loaded successfully')"
NLTK Not Installed
Symptom: ModuleNotFoundError: No module named 'nltk'
Solution:
Linux:
python3 -m pip install nltk==3.9.1
Windows:
python -m pip install nltk==3.9.1
Verifying Indexed Data
To confirm the plugin is working, search for any file with path tokens:
pathtokens:*
If results appear, the plugin is successfully indexing token data.
Debug Logging
To enable verbose logging for troubleshooting:
Check your Diskover indexer log configuration
Look for entries containing
pathtokens_pluginin the logs
Log locations:
Linux:
/var/log/diskover/diskover.logor as configuredWindows: Check Diskover service logs or configured log location
Support
Last Updated: January 2026
Diskover Data, Inc.
Comments
0 comments
Please sign in to leave a comment.