PDF Info

License: PRO+ (Professional Edition or higher)
Plugin Type: Index Plugin
Author: Diskover Data, Inc.

Overview

The PDF Info plugin extracts metadata from PDF (Portable Document Format) files during the Diskover indexing process. When enabled, this plugin automatically reads document properties such as title, author, subject, creation date, and keywords embedded within PDF files, making this information searchable and reportable in the Diskover web interface.

PDF files often contain valuable metadata that describes a document's origin, purpose, and history. This information is typically invisible when browsing file systems but becomes fully searchable once indexed by this plugin.

Why Use This Plugin?

Find documents by content properties rather than just filenames
Track document origins including authors, creation dates, and source applications
Support compliance and governance with audit trails of document metadata
Identify document collections created by specific applications or from templates

Use Cases by Role

Legal and Compliance Teams

Locate contracts and legal documents by author or creation date
Find all documents created within specific date ranges for discovery requests

Marketing and Creative Teams

Organize branded document assets by subject and keywords
Find collateral created by specific team members

Finance and Accounting

Audit document collections by creation and modification dates
Find reports authored by specific analysts

IT Administrators

Analyze PDF document collections across storage
Generate reports on document metadata for compliance audits

Understanding PDF Metadata

PDF files store metadata in a document information dictionary that contains properties about the file itself. This metadata is separate from the document content and includes:

Metadata Type	Description	Example
Title	Document title (different from filename)	"Q4 Financial Report"
Author	Person who created the document	"Jane Smith"
Subject	Brief description of content	"Quarterly financials"
Creator	Application that created original content	"Microsoft Word"
Producer	Application that generated the PDF	"Adobe PDF Library"
Creation Date	When document was originally created	2024-06-15
Modification Date	When document was last modified	2024-06-20
Keywords	Searchable tags set by author	"confidential, finance"

Some PDFs also contain additional properties like Company (organization name) and Template URL (reference to source template), typically set by Microsoft Office applications.

Note: Not all PDFs contain metadata. Scanned documents, PDFs from certain generators, or files where metadata was intentionally stripped may have empty or minimal metadata fields.

Requirements

Software Requirements

Requirement	Version	Description
Diskover	2.3+	Core Diskover indexing application
Python	3.9+	Python runtime environment
pypdf	5.1.0	Python library for PDF metadata extraction

Installation

Step 1: Install Python Dependencies

Install the required pypdf library:

Linux:

python3 -m pip install pypdf==5.1.0

Windows:

python -m pip install pypdf==5.1.0

Verify the installation:

Linux:

python3 -m pip show pypdf

Windows:

python -m pip show pypdf

You should see output showing pypdf version 5.1.0.

Step 2: Configure the Plugin

Navigate to Diskover Admin > Plugins > Index Plugins > PDF Info
Enable the plugin and configure parameters as needed (see Configuration section)
Save the configuration

Step 3: Enable in Index Task Configuration

Navigate to Diskover > Configurations > [Your Configuration Name]
Scroll to the bottom to find Index Plugins Enablement
Enable the PDF Info plugin
Save the configuration

The plugin will now run automatically during scans using this configuration.

Configuration

The PDF Info plugin is configured through the Diskover Admin interface.

Configuration Parameters

Parameter	Type	Default	Description
`enable_cache`	boolean	`true`	Enable SQLite caching for extracted PDF metadata. Improves performance on subsequent scans.
`cache_dir`	string	`/opt/diskover/__pdfinfo_plugin_cache__/`	Directory path for the SQLite cache database.
`cache_expire_time`	integer	`0`	Cache expiration time in seconds. Set to `0` for no expiration (recommended).
`index_field_name`	string	`pdf_info`	Elasticsearch field name for PDF metadata. This is a fixed value.

Configuration Example 1: Default Configuration (Recommended)

The default configuration enables caching with no expiration time:

Linux:

enable_cache: true
cache_dir: /opt/diskover/__pdfinfo_plugin_cache__/
cache_expire_time: 0
index_field_name: pdf_info

Windows:

enable_cache: true
cache_dir: C:\Program Files\Diskover\__pdfinfo_plugin_cache__\
cache_expire_time: 0
index_field_name: pdf_info

This configuration:

Enables caching to improve performance on subsequent scans
Uses modification time validation to ensure cached data remains accurate
Never expires cached entries unless the source file changes

Configuration Example 2: Custom Cache Location

For environments with specific storage requirements:

Linux:

enable_cache: true
cache_dir: /data/diskover_cache/pdfinfo/
cache_expire_time: 604800
index_field_name: pdf_info

Windows:

enable_cache: true
cache_dir: D:\DiskoverCache\pdfinfo\
cache_expire_time: 604800
index_field_name: pdf_info

This configuration:

Stores cache in a custom directory on a dedicated volume
Sets cache entries to expire after 7 days (604800 seconds)
Useful for environments where cache storage must be managed

Important: Ensure the Diskover service user has read/write permissions to the configured cache directory.

Indexed Fields

The PDF Info plugin extracts metadata and stores it in the pdf_info field within each indexed document.

Field Mappings

Field Path	ES Type	Description
`pdf_info.title`	text	Document title as set in PDF properties
`pdf_info.author`	text	Name of the document author
`pdf_info.subject`	text	Document subject description
`pdf_info.creator`	text	Application that created the original content (e.g., "Microsoft Word")
`pdf_info.producer`	text	Application that produced the PDF file (e.g., "Adobe PDF Library")
`pdf_info.creation_date`	date	Date and time the document was created
`pdf_info.modification_date`	date	Date and time the document was last modified
`pdf_info.keywords`	text	Keywords or tags associated with the document
`pdf_info.company`	text	Company or organization (typically from Microsoft Office)
`pdf_info.template_url`	text	URL or path of the template used to create the document
`pdf_info.error`	text	Error message when metadata extraction fails

Example Document

{
  "name": "Q4_Financial_Report.pdf",
  "extension": "pdf",
  "size": 2458624,
  "pdf_info": {
    "title": "Q4 2024 Financial Report",
    "author": "Jane Smith",
    "subject": "Quarterly Financial Summary",
    "creator": "Microsoft Word",
    "producer": "Microsoft: Print To PDF",
    "creation_date": "2024-10-15T14:30:00Z",
    "modification_date": "2024-10-18T09:15:00Z",
    "keywords": "finance, quarterly, confidential",
    "company": "Acme Corporation"
  }
}

Searching in Diskover

Use the following search queries in the Diskover web interface to find PDF documents based on their metadata.

Basic Field Searches

Query	Description
`pdf_info.author: "Jane Smith"`	Find PDFs authored by Jane Smith
`pdf_info.title: "Annual Report"`	Find PDFs with "Annual Report" in the title
`pdf_info.subject: "financial"`	Find PDFs with "financial" in the subject
`pdf_info.creator: "Microsoft Word"`	Find PDFs created in Microsoft Word
`pdf_info.producer: "Adobe"`	Find PDFs produced by Adobe applications
`pdf_info.company: "Acme Corporation"`	Find PDFs from Acme Corporation
`pdf_info.keywords: "confidential"`	Find PDFs tagged as confidential

Date-Based Searches

Query	Description
`pdf_info.creation_date: [2024-01-01 TO *]`	PDFs created after January 1, 2024
`pdf_info.creation_date: [* TO 2019-12-31]`	PDFs created before 2020
`pdf_info.modification_date: [2024-01-01 TO 2024-06-30]`	PDFs modified in first half of 2024
`pdf_info.creation_date: [2024-10-01 TO 2024-12-31]`	PDFs created in Q4 2024

Wildcard Searches

Query	Description
`pdf_info.author: *`	Find all PDFs with author metadata
`pdf_info.author: John*`	Find PDFs by authors whose name starts with "John"
`pdf_info.title: report`	Find PDFs with "report" anywhere in the title
`pdf_info.creator: Microsoft*`	Find PDFs created by any Microsoft application

Finding PDFs with Errors or Missing Metadata

Query	Description
`pdf_info.error: *`	Find PDFs that had extraction errors
`extension: pdf AND NOT pdf_info.author: *`	Find PDFs without author metadata
`extension: pdf AND NOT pdf_info.error: *`	Find PDFs with successful metadata extraction

Combined Searches

Query	Description
`pdf_info.author: "Jane Smith" AND pdf_info.creation_date: [2024-01-01 TO 2024-12-31]`	Jane Smith's PDFs from 2024
`pdf_info.creator: Microsoft* AND pdf_info.company: "Acme Corp"`	Microsoft Office PDFs from Acme Corp
`pdf_info.keywords: "confidential" AND pdf_info.creation_date: [2023-01-01 TO *]`	Confidential PDFs created since 2023
`size: [10485760 TO *] AND pdf_info.creator: "Adobe InDesign"`	Large PDFs (10MB+) from InDesign
`parent_path: legal AND pdf_info.author: *`	PDFs with author info in legal directories
`pdf_info.subject: * AND pdf_info.keywords: *`	PDFs with both subject and keywords defined

Troubleshooting

Common Issues and Solutions

Issue	Cause	Solution
Plugin not processing PDF files	Plugin not enabled in Index Task Configuration	Navigate to Diskover > Configurations > [Config Name] and enable PDF Info in Index Plugins Enablement
`pdf_info` field missing from indexed PDFs	pypdf library not installed	Install with `python3 -m pip install pypdf==5.1.0` (Linux) or `python -m pip install pypdf==5.1.0` (Windows)
`pdf_info.error` present in indexed data	PDF is corrupted, encrypted, or password-protected	The plugin cannot read encrypted PDFs; verify the file can be opened normally
Empty metadata fields	PDF file contains no metadata	This is normal; many PDFs (especially scanned documents) have no embedded metadata
Cache directory permission errors	Diskover service user lacks write access	Create directory and set proper ownership (see Cache Management below)
License error in logs	Invalid or missing PRO+ license	Verify your Diskover license includes PRO+ features

Cache Management

If you need to clear the cache to force re-extraction of metadata:

Linux:

rm -rf /opt/diskover/__pdfinfo_plugin_cache__/

Windows:

rmdir /s /q "C:\Program Files\Diskover\__pdfinfo_plugin_cache__\"

To create the cache directory with proper permissions:

Linux:

mkdir -p /opt/diskover/__pdfinfo_plugin_cache__
chown diskover:diskover /opt/diskover/__pdfinfo_plugin_cache__
chmod 755 /opt/diskover/__pdfinfo_plugin_cache__

Windows:
Ensure the Diskover service account has read/write permissions to the cache directory.

Debug Logging

To enable verbose logging for troubleshooting, check the Diskover logs:

Linux:

grep -i "pdfinfo" /var/log/diskover/diskover.log | tail -50

Windows:
Check the Diskover service logs or configured log location for entries containing "pdfinfo".

Testing PDF Metadata Extraction

To verify a PDF file contains metadata and can be read by pypdf:

Linux:

python3 -c "from pypdf import PdfReader; r = PdfReader('/path/to/test.pdf'); print(r.metadata)"

Windows:

python -c "from pypdf import PdfReader; r = PdfReader('C:\path\to\test.pdf'); print(r.metadata)"

Support

Diskover Documentation: https://docs.diskoverdata.com
Diskover Support: https://support.diskoverdata.com
pypdf Documentation: https://pypdf.readthedocs.io

Last Updated: January 2026
Diskover Data, Inc.

PDF Info

Overview

Why Use This Plugin?

Use Cases by Role

Understanding PDF Metadata

Requirements

Software Requirements

Installation

Step 1: Install Python Dependencies

Step 2: Configure the Plugin

Step 3: Enable in Index Task Configuration

Configuration

Configuration Parameters

Configuration Example 1: Default Configuration (Recommended)

Configuration Example 2: Custom Cache Location

Indexed Fields

Field Mappings

Example Document

Searching in Diskover

Basic Field Searches

Date-Based Searches

Wildcard Searches

Finding PDFs with Errors or Missing Metadata

Combined Searches

Troubleshooting

Common Issues and Solutions

Cache Management

Debug Logging

Testing PDF Metadata Extraction

Support

Related articles