PDF Info
License: PRO+ (Professional Edition or higher)
Plugin Type: Index Plugin
Author: Diskover Data, Inc.
Overview
The PDF Info plugin extracts metadata from PDF (Portable Document Format) files during the Diskover indexing process. When enabled, this plugin automatically reads document properties such as title, author, subject, creation date, and keywords embedded within PDF files, making this information searchable and reportable in the Diskover web interface.
PDF files often contain valuable metadata that describes a document's origin, purpose, and history. This information is typically invisible when browsing file systems but becomes fully searchable once indexed by this plugin.
Why Use This Plugin?
Find documents by content properties rather than just filenames
Track document origins including authors, creation dates, and source applications
Support compliance and governance with audit trails of document metadata
Identify document collections created by specific applications or from templates
Use Cases by Role
Legal and Compliance Teams
Locate contracts and legal documents by author or creation date
Find all documents created within specific date ranges for discovery requests
Marketing and Creative Teams
Organize branded document assets by subject and keywords
Find collateral created by specific team members
Finance and Accounting
Audit document collections by creation and modification dates
Find reports authored by specific analysts
IT Administrators
Analyze PDF document collections across storage
Generate reports on document metadata for compliance audits
Understanding PDF Metadata
PDF files store metadata in a document information dictionary that contains properties about the file itself. This metadata is separate from the document content and includes:
Metadata Type | Description | Example |
|---|---|---|
Title | Document title (different from filename) | "Q4 Financial Report" |
Author | Person who created the document | "Jane Smith" |
Subject | Brief description of content | "Quarterly financials" |
Creator | Application that created original content | "Microsoft Word" |
Producer | Application that generated the PDF | "Adobe PDF Library" |
Creation Date | When document was originally created | 2024-06-15 |
Modification Date | When document was last modified | 2024-06-20 |
Keywords | Searchable tags set by author | "confidential, finance" |
Some PDFs also contain additional properties like Company (organization name) and Template URL (reference to source template), typically set by Microsoft Office applications.
Note: Not all PDFs contain metadata. Scanned documents, PDFs from certain generators, or files where metadata was intentionally stripped may have empty or minimal metadata fields.
Requirements
Software Requirements
Requirement | Version | Description |
|---|---|---|
Diskover | 2.3+ | Core Diskover indexing application |
Python | 3.9+ | Python runtime environment |
pypdf | 5.1.0 | Python library for PDF metadata extraction |
Installation
Step 1: Install Python Dependencies
Install the required pypdf library:
Linux:
python3 -m pip install pypdf==5.1.0
Windows:
python -m pip install pypdf==5.1.0
Verify the installation:
Linux:
python3 -m pip show pypdf
Windows:
python -m pip show pypdf
You should see output showing pypdf version 5.1.0.
Step 2: Configure the Plugin
Navigate to Diskover Admin > Plugins > Index Plugins > PDF Info
Enable the plugin and configure parameters as needed (see Configuration section)
Save the configuration
Step 3: Enable in Index Task Configuration
Navigate to Diskover > Configurations > [Your Configuration Name]
Scroll to the bottom to find Index Plugins Enablement
Enable the PDF Info plugin
Save the configuration
The plugin will now run automatically during scans using this configuration.
Configuration
The PDF Info plugin is configured through the Diskover Admin interface.
Configuration Parameters
Parameter | Type | Default | Description |
|---|---|---|---|
| boolean |
| Enable SQLite caching for extracted PDF metadata. Improves performance on subsequent scans. |
| string |
| Directory path for the SQLite cache database. |
| integer |
| Cache expiration time in seconds. Set to |
| string |
| Elasticsearch field name for PDF metadata. This is a fixed value. |
Configuration Example 1: Default Configuration (Recommended)
The default configuration enables caching with no expiration time:
Linux:
enable_cache: true cache_dir: /opt/diskover/__pdfinfo_plugin_cache__/ cache_expire_time: 0 index_field_name: pdf_info
Windows:
enable_cache: true cache_dir: C:\Program Files\Diskover\__pdfinfo_plugin_cache__\ cache_expire_time: 0 index_field_name: pdf_info
This configuration:
Enables caching to improve performance on subsequent scans
Uses modification time validation to ensure cached data remains accurate
Never expires cached entries unless the source file changes
Configuration Example 2: Custom Cache Location
For environments with specific storage requirements:
Linux:
enable_cache: true cache_dir: /data/diskover_cache/pdfinfo/ cache_expire_time: 604800 index_field_name: pdf_info
Windows:
enable_cache: true cache_dir: D:\DiskoverCache\pdfinfo\ cache_expire_time: 604800 index_field_name: pdf_info
This configuration:
Stores cache in a custom directory on a dedicated volume
Sets cache entries to expire after 7 days (604800 seconds)
Useful for environments where cache storage must be managed
Important: Ensure the Diskover service user has read/write permissions to the configured cache directory.
Indexed Fields
The PDF Info plugin extracts metadata and stores it in the pdf_info field within each indexed document.
Field Mappings
Field Path | ES Type | Description |
|---|---|---|
| text | Document title as set in PDF properties |
| text | Name of the document author |
| text | Document subject description |
| text | Application that created the original content (e.g., "Microsoft Word") |
| text | Application that produced the PDF file (e.g., "Adobe PDF Library") |
| date | Date and time the document was created |
| date | Date and time the document was last modified |
| text | Keywords or tags associated with the document |
| text | Company or organization (typically from Microsoft Office) |
| text | URL or path of the template used to create the document |
| text | Error message when metadata extraction fails |
Example Document
{
"name": "Q4_Financial_Report.pdf",
"extension": "pdf",
"size": 2458624,
"pdf_info": {
"title": "Q4 2024 Financial Report",
"author": "Jane Smith",
"subject": "Quarterly Financial Summary",
"creator": "Microsoft Word",
"producer": "Microsoft: Print To PDF",
"creation_date": "2024-10-15T14:30:00Z",
"modification_date": "2024-10-18T09:15:00Z",
"keywords": "finance, quarterly, confidential",
"company": "Acme Corporation"
}
}
Searching in Diskover
Use the following search queries in the Diskover web interface to find PDF documents based on their metadata.
Basic Field Searches
Query | Description |
|---|---|
| Find PDFs authored by Jane Smith |
| Find PDFs with "Annual Report" in the title |
| Find PDFs with "financial" in the subject |
| Find PDFs created in Microsoft Word |
| Find PDFs produced by Adobe applications |
| Find PDFs from Acme Corporation |
| Find PDFs tagged as confidential |
Date-Based Searches
Query | Description |
|---|---|
| PDFs created after January 1, 2024 |
| PDFs created before 2020 |
| PDFs modified in first half of 2024 |
| PDFs created in Q4 2024 |
Wildcard Searches
Query | Description |
|---|---|
| Find all PDFs with author metadata |
| Find PDFs by authors whose name starts with "John" |
| Find PDFs with "report" anywhere in the title |
| Find PDFs created by any Microsoft application |
Finding PDFs with Errors or Missing Metadata
Query | Description |
|---|---|
| Find PDFs that had extraction errors |
| Find PDFs without author metadata |
| Find PDFs with successful metadata extraction |
Combined Searches
Query | Description |
|---|---|
| Jane Smith's PDFs from 2024 |
| Microsoft Office PDFs from Acme Corp |
| Confidential PDFs created since 2023 |
| Large PDFs (10MB+) from InDesign |
| PDFs with author info in legal directories |
| PDFs with both subject and keywords defined |
Troubleshooting
Common Issues and Solutions
Issue | Cause | Solution |
|---|---|---|
Plugin not processing PDF files | Plugin not enabled in Index Task Configuration | Navigate to Diskover > Configurations > [Config Name] and enable PDF Info in Index Plugins Enablement |
| pypdf library not installed | Install with |
| PDF is corrupted, encrypted, or password-protected | The plugin cannot read encrypted PDFs; verify the file can be opened normally |
Empty metadata fields | PDF file contains no metadata | This is normal; many PDFs (especially scanned documents) have no embedded metadata |
Cache directory permission errors | Diskover service user lacks write access | Create directory and set proper ownership (see Cache Management below) |
License error in logs | Invalid or missing PRO+ license | Verify your Diskover license includes PRO+ features |
Cache Management
If you need to clear the cache to force re-extraction of metadata:
Linux:
rm -rf /opt/diskover/__pdfinfo_plugin_cache__/
Windows:
rmdir /s /q "C:\Program Files\Diskover\__pdfinfo_plugin_cache__\"
To create the cache directory with proper permissions:
Linux:
mkdir -p /opt/diskover/__pdfinfo_plugin_cache__ chown diskover:diskover /opt/diskover/__pdfinfo_plugin_cache__ chmod 755 /opt/diskover/__pdfinfo_plugin_cache__
Windows:
Ensure the Diskover service account has read/write permissions to the cache directory.
Debug Logging
To enable verbose logging for troubleshooting, check the Diskover logs:
Linux:
grep -i "pdfinfo" /var/log/diskover/diskover.log | tail -50
Windows:
Check the Diskover service logs or configured log location for entries containing "pdfinfo".
Testing PDF Metadata Extraction
To verify a PDF file contains metadata and can be read by pypdf:
Linux:
python3 -c "from pypdf import PdfReader; r = PdfReader('/path/to/test.pdf'); print(r.metadata)"
Windows:
python -c "from pypdf import PdfReader; r = PdfReader('C:\path\to\test.pdf'); print(r.metadata)"
Support
Diskover Documentation: https://docs.diskoverdata.com
Diskover Support: https://support.diskoverdata.com
pypdf Documentation: https://pypdf.readthedocs.io
Last Updated: January 2026
Diskover Data, Inc.
Comments
0 comments
Please sign in to leave a comment.