Path Tokens

License: PRO+ (Professional Edition or higher)
Plugin Type: Index Plugin
Author: Diskover Data, Inc.

Overview

The Path Tokens plugin transforms file and directory names into searchable keywords during Diskover indexing. It breaks down concatenated names, extracts embedded dates, and filters common words to create meaningful search tokens—making it dramatically easier to find files based on concepts rather than exact filenames.

When file paths contain concatenated names like ProjectPhoenix_FinalDeliverable or CASE-2024-00123_Evidence, standard searches struggle to find them. This plugin transforms these names into searchable tokens: project, phoenix, final, deliverable or case, 2024-00123, evidence.

Key capabilities include:

CamelCase Conversion — Transforms MyProjectName into separate searchable tokens
Date Extraction — Recognizes dates in 9 different formats embedded in filenames and normalizes them
Stop Word Filtering — Removes common English words (the, and, is, etc.) to reduce noise
Path and Name Separation — Creates distinct token sets for directory paths and filenames
Modification Time Indexing — Adds human-readable YYYY-Month format for time-based searches

Why Use Path Tokens?

Organizations often develop naming conventions that concatenate meaningful information into file and folder names. These conventions work well for humans organizing files, but make searching difficult. The Path Tokens plugin bridges this gap by extracting the meaningful words from complex names.

The plugin adds these fields to indexed documents:

Field	Description
`pathtokens.path`	Tokens extracted from the directory path
`pathtokens.name`	Tokens extracted from the file or directory name
`pathtokens.mtime`	File modification time as "YYYY-Month" format

Use Cases

Discovering Hot Words Across Your Storage

One of the most powerful applications of the Path Tokens plugin is discovering common themes, project names, and terminology across your file system. By tokenizing paths and names, you can quickly identify which project names appear most frequently, common client or department identifiers, and recurring patterns in how teams name their files and folders.

For Media & Entertainment

Use Case	Description	Example Search
Film/Show Discovery	Find all files related to a production by name	`pathtokens.path:inception OR pathtokens.name:inception`
Deliverable Tracking	Find final deliverables across projects	`pathtokens.name:final AND pathtokens.name:deliverable`

For Legal & Compliance

Use Case	Description	Example Search
Case File Discovery	Find files by legal case identifiers	`pathtokens.name:2024-00123`
Matter Management	Locate documents by matter number or name	`pathtokens.path:matter AND pathtokens.path:smith`

For Project-Based Organizations

Use Case	Description	Example Search
Project Discovery	Find all files related to a project codename	`pathtokens.path:phoenix`
Version Tracking	Find files with version indicators	`pathtokens.name:v2 OR pathtokens.name:final`

For Storage Administrators

Use Case	Description	Example Search
Naming Convention Audit	Verify files follow naming standards	`pathtokens.path:archive AND pathtokens.name:*`
Department Content Analysis	Analyze storage by department	`pathtokens.path:finance OR pathtokens.path:legal`

Understanding Path Tokenization

What is Path Tokenization?

Path tokenization is the process of breaking down file and directory names into individual searchable keywords. This is essential because many file naming conventions use concatenation (CamelCase, underscores, hyphens) that make natural language searching difficult.

Why Tokenization Matters

Problem	Without Tokenization	With Tokenization
File named `ProjectAlphaFinalDeliverable.pdf`	Must search exact name or use wildcards	Search for `alpha`, `final`, or `deliverable`
Path `/data/2024-Q1-Reports/`	Limited path searching	Search for `2024`, `q1`, `reports`
Mixed naming conventions	Different search patterns needed for each style	Consistent token-based search
Legal case `CASE-2024-00123_Exhibits`	Complex wildcard queries required	Search for `case`, `2024-00123`, `exhibits`

How the Plugin Processes Names

The plugin applies several transformations in sequence:

CamelCase to snake_case — MyProject becomes my_project
Special character replacement — Characters like .-_\/(){}[]|@,; become spaces
Word tokenization — Text is split into individual words
Stop word removal — Common English words are filtered out
Date extraction — Date patterns are identified and normalized
Deduplication — Duplicate tokens are removed

Example Transformation

Input path:

/data/projects/ClientABC/ProjectPhoenix_FinalDeliverable_2024-01-15.docx

Resulting tokens:

Field	Tokens
`pathtokens.path`	`data`, `projects`, `client`, `abc`, `project`, `phoenix`
`pathtokens.name`	`final`, `deliverable`, `2024-01-15`
`pathtokens.mtime`	`2024-March`

Notice how:

ClientABC was split into client and abc
ProjectPhoenix was split into project and phoenix
The date 2024-01-15 was extracted and normalized
The file extension .docx was removed from tokens
Common words were filtered out

Stop Words

The plugin filters out common English words that add noise to search results. Words like "the", "and", "is", "in", "of", "to" are removed from tokens.

Note: The word "it" is explicitly preserved (not filtered) to allow searching for files with "it" in their names.

Supported Date Formats

The plugin recognizes and extracts dates in 9 different formats:

Format	Example	Normalized Output
YYYY-MM-DD	`2024-01-15`	`2024-01-15`
MM-DD-YYYY	`01-15-2024`	`2024-01-15`
DD-MM-YYYY	`15-01-2024`	`2024-01-15`
YYYY.MM.DD	`2024.01.15`	`2024-01-15`
MM.DD.YYYY	`01.15.2024`	`2024-01-15`
DD.MM.YYYY	`15.01.2024`	`2024-01-15`
YYYY/MM/DD	`2024/01/15`	`2024-01-15`
MM/DD/YYYY	`01/15/2024`	`2024-01-15`
DD/MM/YYYY	`15/01/2024`	`2024-01-15`

All extracted dates are normalized to YYYY-MM-DD format for consistent searching.

Requirements

Python Dependencies

Package	Version	Purpose
nltk	3.9.1	Natural Language Toolkit for tokenization and stop words

System Requirements

Python 3.9 or higher
Diskover indexer with plugin support enabled

Installation

Step 1: Install Python Dependencies

Linux:

python3 -m pip install nltk==3.9.1

Windows:

python -m pip install nltk==3.9.1

Step 2: Configure the Plugin

Navigate to ⚙️ Settings > Plugins > Index Plugins > Path Tokens
The plugin uses default settings and requires no additional configuration
Save the configuration

Step 3: Enable in Index Task Configuration

Navigate to the Index Task Configurations page
Select the configuration you want to modify (or create a new one)
Scroll to the Index Plugins Enablement section at the bottom
Enable the Path Tokens plugin
Save the configuration

Step 4: Run an Index

The plugin will automatically process files and directories during the next scan using the configured Index Task. No additional commands or manual execution is required.

Configuration

The Path Tokens plugin uses minimal configuration and works effectively with default settings.

Configuration Parameters

Parameter	Type	Default	Description
`for_type`	method	`file`, `directory`	Processes both files and directories

Default Behavior

Files: Tokenizes both the parent directory path and the filename
Directories: Tokenizes both the parent path and the directory name
All file types: No extension filtering—processes all files regardless of type
Lightweight processing: No caching required due to fast in-memory text processing

Scope

The plugin configuration scope is: Plugins.Index.Path Tokens.Default

Indexed Fields

The Path Tokens plugin adds a pathtokens object to indexed documents with the following structure:

Elasticsearch Field Mappings

Field Path	ES Type	Description
`pathtokens`	object	Root container for all path token data
`pathtokens.path`	keyword (array)	Tokenized directory path components
`pathtokens.name`	keyword (array)	Tokenized filename or directory name
`pathtokens.mtime`	keyword	File modification time in `YYYY-Month` format

Example Document: File

{
  "name": "CASE-2024-00123_FinalBrief.pdf",
  "parent_path": "/legal/ActiveMatters/SmithVsJones",
  "extension": "pdf",
  "pathtokens": {
    "path": ["legal", "active", "matters", "smith", "vs", "jones"],
    "name": ["case", "2024-00123", "final", "brief"],
    "mtime": "2024-March"
  }
}

Example Document: Directory

{
  "name": "ProjectPhoenix_Phase2",
  "parent_path": "/projects/ClientABC/2024",
  "pathtokens": {
    "path": ["projects", "client", "abc", "2024"],
    "name": ["project", "phoenix", "phase2"],
    "mtime": "2024-February"
  }
}

Example Document: Media File

{
  "name": "Inception_S01E03_ColorGrade_Final.mov",
  "parent_path": "/productions/Netflix/Inception/Deliverables",
  "extension": "mov",
  "pathtokens": {
    "path": ["productions", "netflix", "inception", "deliverables"],
    "name": ["s01e03", "color", "grade", "final"],
    "mtime": "2024-January"
  }
}

Searching in Diskover

Use these search queries in the Diskover web interface to find files based on path tokens.

Basic Token Searches

Query	Description
`pathtokens.path:project`	Find files in paths containing "project"
`pathtokens.name:final`	Find files with "final" in their name
`pathtokens.mtime:2024-January`	Find files modified in January 2024

Project and Client Discovery

Query	Description
`pathtokens.path:phoenix`	Find all files in Project Phoenix directories
`pathtokens.path:client AND pathtokens.path:abc`	Find files in ClientABC folders
`pathtokens.name:deliverable`	Find deliverable files across all projects

Date-Based Searches

Query	Description
`pathtokens.name:2024-01-15`	Find files with this specific date in the name
`pathtokens.name:2024*`	Find files with any 2024 date in the name
`pathtokens.mtime:2023*`	Find files modified anytime in 2023

Combined Searches

Query	Description
`pathtokens.path:finance AND extension:pdf`	Find PDFs in finance directories
`pathtokens.name:report AND pathtokens.path:2024`	Find 2024 reports
`pathtokens.path:archive AND size:>=104857600`	Find large files in archive paths

Troubleshooting

Common Issues

Issue	Cause	Solution
`pathtokens` field not appearing	Plugin not enabled in Index Task	Enable the plugin in your Index Task Configuration and re-run the scan
Empty or missing tokens	All tokens were stop words	This is expected—common words like "the", "and", "is" are filtered out
Date not extracted	Date format not recognized	Ensure dates use one of the 9 supported formats (see Understanding Path Tokenization section)
File extension appearing in tokens	Plugin version issue	Update to the latest plugin version
Tokens appear lowercase	Expected behavior	All tokens are normalized to lowercase for consistent searching

Plugin Not Loading

Symptom: Errors during indexing or pathtokens field not appearing.

Diagnosis:

Linux:

# Verify plugin directory exists
ls -la /opt/diskover/plugins/pathtokens/

# Check Python syntax
python3 -m py_compile /opt/diskover/plugins/pathtokens/__init__.py

# Test import
cd /opt/diskover
python3 -c "from plugins.pathtokens import *; print('Plugin loaded successfully')"

Windows:

# Verify plugin directory exists
dir "C:\Program Files\Diskover\plugins\pathtokens\"

# Check Python syntax
python -m py_compile "C:\Program Files\Diskover\plugins\pathtokens\__init__.py"

# Test import
cd "C:\Program Files\Diskover"
python -c "from plugins.pathtokens import *; print('Plugin loaded successfully')"

NLTK Not Installed

Symptom: ModuleNotFoundError: No module named 'nltk'

Solution:

Linux:

python3 -m pip install nltk==3.9.1

Windows:

python -m pip install nltk==3.9.1

Verifying Indexed Data

To confirm the plugin is working, search for any file with path tokens:

pathtokens:*

If results appear, the plugin is successfully indexing token data.

Debug Logging

To enable verbose logging for troubleshooting:

Check your Diskover indexer log configuration
Look for entries containing pathtokens_plugin in the logs

Log locations:

Linux: /var/log/diskover/diskover.log or as configured
Windows: Check Diskover service logs or configured log location

Support

Last Updated: January 2026
Diskover Data, Inc.

Path Tokens

Overview

Why Use Path Tokens?

Use Cases

Discovering Hot Words Across Your Storage

For Media & Entertainment

For Legal & Compliance

For Project-Based Organizations

For Storage Administrators

Understanding Path Tokenization

What is Path Tokenization?

Why Tokenization Matters

How the Plugin Processes Names

Example Transformation

Stop Words

Supported Date Formats

Requirements

Python Dependencies

System Requirements

Installation

Step 1: Install Python Dependencies

Step 2: Configure the Plugin

Step 3: Enable in Index Task Configuration

Step 4: Run an Index

Configuration

Configuration Parameters

Default Behavior

Scope

Indexed Fields

Elasticsearch Field Mappings

Example Document: File

Example Document: Directory

Example Document: Media File

Searching in Diskover

Basic Token Searches

Project and Client Discovery

Date-Based Searches

Combined Searches

Troubleshooting

Common Issues

Plugin Not Loading

NLTK Not Installed

Verifying Indexed Data

Debug Logging

Support

Related articles