License: MLT (MultiStream Edition)
Module Type: Alternate Ingester
Author: Diskover Data, Inc.
Overview
The Parquet Ingester exports Diskover scan metadata directly to Apache Parquet files instead of Elasticsearch. Apache Parquet is a columnar storage format that's become the standard for analytical workloads, offering excellent compression and query performance for large datasets.
This ingester is particularly valuable for organizations building unified data platforms. Rather than keeping file system metadata siloed in Elasticsearch, you can push it directly into your data lake or data warehouse alongside business, operational, and other enterprise data.
Why Use the Parquet Ingester?
Data Lake and Data Warehouse Integration: The primary use case for this ingester is feeding unstructured data metadata into enterprise data platforms. Diskover has a technical partnership with Snowflake, and this integration enables you to push file system metadata directly into the Snowflake ecosystem for unified analytics across structured and unstructured data.
Compliance and Audit: Store immutable records of file system state at specific points in time. Parquet files serve as point-in-time snapshots that satisfy regulatory retention requirements.
Cloud Analytics: Load Parquet files into serverless analytics services like AWS Athena, Google BigQuery, or Azure Synapse without provisioning infrastructure.
Important: Data stored via the Parquet ingester is not accessible through the Diskover Web UI. You'll query this data using analytics tools appropriate for your data platform.
Understanding Apache Parquet
Before diving into configuration, it helps to understand what makes Parquet well-suited for this use case.
Columnar vs Row Storage
Traditional file formats like CSV and JSON store data row by row. When you query such files, the system must read entire rows even if you only need a few columns. Parquet uses columnar storage, where values from each column are stored together. This means analytical queries that aggregate or filter on specific columns can skip irrelevant data entirely.
For Diskover metadata, this is particularly efficient. A query like "sum all file sizes grouped by extension" only reads the size and extension columns, ignoring owner, mtime, paths, and other fields.
Storage Type | Best For | Diskover Example |
|---|---|---|
Row-based (CSV, JSON) | Transactional workloads, full-record access | Exporting individual file details |
Columnar (Parquet) | Analytical queries, aggregations | Storage reports by extension, owner analysis |
Compression
Parquet achieves excellent compression because similar values are stored together. A column containing file extensions will have many repeated values like "pdf", "docx", and "jpg"—these compress extremely well. Parquet supports multiple compression algorithms:
Algorithm | Compression Ratio | Speed | Use Case |
|---|---|---|---|
Snappy (default) | Good | Very fast | General purpose, balanced performance |
GZIP | Excellent | Slower | Long-term archival, storage optimization |
Zstd | Excellent | Fast | Modern alternative balancing ratio and speed |
The Parquet ingester uses PyArrow's default compression (Snappy), which provides a good balance for most use cases.
Parquet vs Elasticsearch Comparison
When deciding between Parquet output and Elasticsearch indexing, consider these trade-offs:
Aspect | Elasticsearch | Parquet Ingester |
|---|---|---|
Query Method | REST API / Diskover Web UI | SQL engines, Pandas, Spark, cloud analytics |
Real-time Search | Yes, millisecond latency | No, batch processing (seconds to minutes) |
Infrastructure | Elasticsearch cluster required | Local disk or cloud storage |
Cost Model | Cluster licensing + hardware | Storage costs only |
Diskover Web UI | Full support | Not supported |
Analytics Integration | Limited | Native integration with data platforms |
Best For | Interactive exploration, file actions | Historical analysis, data lake integration |
Decision Guide
Use Elasticsearch when:
You need the Diskover Web UI for interactive browsing and file actions
Real-time search and filtering are required
Users need immediate access to scan results
You're leveraging Diskover workflows and automations
Use the Parquet Ingester when:
Building a data lake or data warehouse that includes file metadata
Integrating with Snowflake or other analytics platforms
Performing historical analysis across multiple scans
Archiving scan data for compliance or long-term retention
Minimizing infrastructure costs for stored scan data
Hybrid Approach: Many organizations use both approaches—Elasticsearch for active, real-time operations and Parquet for archival and analytics. You can run scans with different ingesters for different storage targets or export Elasticsearch data to Parquet periodically.
Requirements
System Requirements
Component | Requirement |
|---|---|
Python | 3.9 or higher |
Disk Space | Sufficient for Parquet output files |
Permissions | Write access to output directory |
Python Dependencies
Package | Version | Purpose |
|---|---|---|
pandas | 2.2.3 | DataFrame operations and data manipulation |
pyarrow | 18.0.0 | Apache Arrow and Parquet file writing |
Installation
1. Install Python Dependencies
Linux:
cd /opt/diskover python3 -m pip install -r ingesters/parquet/requirements.txt
Windows:
cd "C:\Program Files\Diskover" python -m pip install -r ingesters\parquet\requirements.txt
2. Verify Installation
Linux:
python3 -c "import pandas; import pyarrow; print(f'pandas: {pandas.__version__}, pyarrow: {pyarrow.__version__}')"
Windows:
python -c "import pandas; import pyarrow; print(f'pandas: {pandas.__version__}, pyarrow: {pyarrow.__version__}')"
Expected output:
pandas: 2.2.3, pyarrow: 18.0.0
3. Create Output Directory
Linux:
mkdir -p /data/parquet_output chmod 755 /data/parquet_output
Windows:
mkdir "D:\DiskoverData\parquet_output"
4. Configure the Ingester
Navigate to Diskover Admin > Diskover > Alternate Ingesters > Parquet and configure the parameters as needed (see Configuration section below).
Here is the beginning of our sample configuration, which you can see we're outputing data to a /datalake-pipeline endpoint. There are many other configuraitons for the Parquet ingester - covered in detail below!
Configuration
Configuration Parameters
Parameter | Type | Default | Required | Description |
|---|---|---|---|---|
| string |
| No | Output directory for Parquet files. Can be overridden with the |
| string |
| No | Filename prefix for generated Parquet files. Files are named |
| integer |
| No | Size threshold in bytes for splitting output into multiple files. When the in-memory buffer reaches this size, a new file is created. |
| list |
| No | List of metadata fields to exclude from the Parquet output. |
Configuration Examples
Standard Configuration (Local Storage):
Parameter | Value |
|---|---|
parquet_filedir |
|
parquet_filename |
|
parquet_filesize_split |
|
exclude_fields |
|
Data Warehouse Integration (Minimal Fields):
When integrating with a data warehouse, you may want to exclude additional fields to reduce storage and focus on key metrics:
Parameter | Value |
|---|---|
parquet_filedir |
|
parquet_filename |
|
parquet_filesize_split |
|
exclude_fields |
|
Environment Variable Override
The output directory can be overridden at runtime using the PARQUETDIR environment variable. This is useful for directing each scan to a unique location:
Linux:
export PARQUETDIR=/data/parquet_output/scan_$(date +%Y%m%d_%H%M%S) python3 /opt/diskover/diskover.py --altingester parquet /path/to/scan
Windows:
$env:PARQUETDIR = "D:\DiskoverData\parquet_output\scan_$(Get-Date -Format 'yyyyMMdd_HHmmss')" python "C:\Program Files\Diskover\diskover.py" --altingester parquet D:\path\to\scan
Note: When using
PARQUETDIR, the ingester will create the directory automatically if it doesn't exist. However, if the specified directory already exists, the ingester will exit with an error to prevent accidental overwrites.
Usage
Basic Usage
Run a Diskover scan with the Parquet ingester:
Linux:
cd /opt/diskover python3 diskover.py --altingester parquet /path/to/scan
Windows:
cd "C:\Program Files\Diskover" python diskover.py --altingester parquet D:\path\to\scan
With Custom Output Directory
Linux:
PARQUETDIR=/data/archive/scan_$(date +%Y%m%d_%H%M%S) python3 /opt/diskover/diskover.py --altingester parquet /mnt/storage
Windows:
$env:PARQUETDIR = "D:\archive\scan_$(Get-Date -Format 'yyyyMMdd_HHmmss')" python "C:\Program Files\Diskover\diskover.py" --altingester parquet D:\storage
With Verbose Logging
Linux:
python3 /opt/diskover/diskover.py --altingester parquet --loglevel INFO /path/to/scan
Windows:
python "C:\Program Files\Diskover\diskover.py" --altingester parquet --loglevel INFO D:\path\to\scan
Multiple Scans to Separate Directories
Scanning different storage targets to separate output locations:
Linux:
# Scan production storage PARQUETDIR=/data/parquet/production_$(date +%Y%m%d) \ python3 /opt/diskover/diskover.py --altingester parquet /mnt/production # Scan archive storage PARQUETDIR=/data/parquet/archive_$(date +%Y%m%d) \ python3 /opt/diskover/diskover.py --altingester parquet /mnt/archive
Example CLI Ouput from the Parquet Ingester:
~# python3 /opt/diskover/diskover.py --altingester parquet /var/log/ Using alternate ingester <module 'ingesters.parquet.parquet' from '/opt/diskover/ingesters/parquet/parquet.py'> ingesters.parquet.parquet_writer - Creating directory /tmp/parquet_files/ ingesters.parquet.parquet_writer - Using temporary directory: /tmp/parquet_export_6ooh7pf1 configuration Default No plugins loaded .... Enqueuing dir tree /var/log ingesters.parquet.parquet_writer - INFO - Initial bytes_per_row estimate: 16.0 [ThreadPoolExecutor-1_0] finished crawling /var/log (16 dirs, 40 files, 31.97 MB) in 0d:0h:00m:00s .... ingesters.parquet.parquet_writer - INFO - Writing data to parquet file /tmp/parquet_export_6ooh7pf1/diskover_scan_data_001.parquet... ingesters.parquet.parquet_writer - INFO - File 001: 0.01 MB, 49 rows, 317.8 bytes/row, avg: 136.7 ingesters.parquet.parquet_writer - INFO - Wrote 49 rows to 1 parquet file(s)
Now let’s look at how we can integrate this into a Scan Task!
Integration with Index Tasks
The Parquet ingester can be configured as part of a scheduled Index Task:
Field | Value |
|---|---|
Alternate Ingester |
|
Important: The
--altingesterflag completely replaces Elasticsearch output. Data will NOT be searchable in the Diskover Web UI.
Example of integrating the Parquet Ingester directly into a Diskover Scan Task:
Output Format
File Naming Convention
Parquet files are named using the pattern:
{parquet_filename}-{counter}.parquet
Example output directory structure:
/data/parquet_output/ ├── diskover_scan_data-0.parquet ├── diskover_scan_data-1.parquet └── diskover_scan_data-2.parquet
Schema / Field Mappings
The Parquet schema mirrors Diskover's metadata structure with added temporal columns for partitioning:
Field | Type | Description |
|---|---|---|
| string | File or directory name |
| string | Full path to the item |
| string | Parent directory path |
| string | File extension (files only) |
| int64 | Size in bytes |
| int64 | Disk usage (allocated size) in bytes |
| datetime | Last access time |
| datetime | Last modification time |
| datetime | Creation/change time |
| string | Item type: |
| string | File owner |
| string | File group |
| string | Scan year (temporal partition) |
| string | Scan month (temporal partition) |
| string | Scan day (temporal partition) |
| string | Scan hour (temporal partition) |
Additional fields from Diskover plugins (tags, custom metadata, etc.) are included unless specified in exclude_fields.
Working with Output Data
Since Parquet data isn't accessible through the Diskover Web UI, you'll query it using analytics tools appropriate for your environment. Parquet's wide adoption means you have many options depending on your infrastructure and use case.
Snowflake
As a Diskover technical partner, Snowflake is an ideal destination for Parquet scan data. You can stage Parquet files from local storage or cloud object storage (S3, Azure Blob, GCS) and load them directly into Snowflake tables. Once loaded, file system metadata becomes part of your unified data platform, enabling cross-analysis with business data, cost allocation reporting, and compliance dashboards.
Cloud Analytics Services
AWS Athena allows you to query Parquet files directly in S3 without loading them into a database. This serverless approach is cost-effective for ad-hoc analysis and scheduled reporting. Define an external table pointing to your Parquet files and run standard SQL queries. See the AWS Athena Documentation for configuration details.
Google BigQuery can load Parquet files from Cloud Storage for serverless analytics at scale. BigQuery's columnar engine is well-matched to Parquet's format, enabling fast aggregation queries across large scan datasets. See the BigQuery Parquet Documentation for loading procedures.
Azure Synapse Analytics supports Parquet natively, allowing you to query files in Azure Data Lake Storage or load them into dedicated SQL pools for enterprise analytics.
Local Analysis Tools
Python with Pandas provides the simplest approach for local analysis. Pandas can read Parquet files directly and offers powerful data manipulation capabilities for filtering, grouping, and aggregating scan metadata. This is ideal for one-off analysis, prototyping queries, or generating reports.
DuckDB is an embedded analytical database that can query Parquet files directly without loading them entirely into memory. It's particularly useful for analyzing large scan outputs on a laptop or workstation, offering SQL syntax with excellent performance.
Apache Spark handles large-scale distributed processing when scan data exceeds what a single machine can handle. Spark's native Parquet support makes it straightforward to run analytics across petabytes of historical scan data in a cluster environment.
Common Analysis Patterns
Use Case | Approach |
|---|---|
Find largest files | Filter by type, sort by size descending |
Storage by extension | Group files by extension, sum size |
Old/stale files | Filter by mtime or atime older than threshold |
Storage by owner | Group by owner, sum size |
Directory sizing | Aggregate file sizes by parent path |
Growth over time | Compare scans using temporal partition columns |
Troubleshooting
Issue | Cause | Solution |
|---|---|---|
| Using | Include a timestamp in the directory path, or remove the existing directory if no longer needed |
| Insufficient write permissions | Ensure the Diskover user has write access to the output directory with |
| Missing Python dependencies | Run |
| Missing Python dependencies | Run |
Empty output directory | Scan path contains no files or exclusion rules filtered everything | Verify scan path contains files and check Diskover logs for errors |
| File split size too large for available RAM | Reduce |
Type conversion warnings | Mixed data types in columns | Informational only—the ingester automatically converts to strings |
Debug Logging
Enable verbose logging to troubleshoot issues:
Linux:
python3 /opt/diskover/diskover.py --altingester parquet --loglevel DEBUG /path/to/scan
Windows:
python "C:\Program Files\Diskover\diskover.py" --altingester parquet --loglevel DEBUG D:\path\to\scan
Verifying Parquet File Integrity
If you suspect file corruption, verify with PyArrow:
import pyarrow.parquet as pq
try:
table = pq.read_table('/data/parquet_output/diskover_scan_data-0.parquet')
print(f"Rows: {table.num_rows}, Columns: {table.num_columns}")
except Exception as e:
print(f"Error reading file: {e}")
Support
Last Updated: April 2026
Diskover Data, Inc.
Comments
0 comments
Please sign in to leave a comment.