Parquet – Diskover Data

License: MLT (MultiStream Edition)
Module Type: Alternate Ingester
Author: Diskover Data, Inc.

Overview

The Parquet Ingester exports Diskover scan metadata directly to Apache Parquet files instead of Elasticsearch. Apache Parquet is a columnar storage format that's become the standard for analytical workloads, offering excellent compression and query performance for large datasets.

This ingester is particularly valuable for organizations building unified data platforms. Rather than keeping file system metadata siloed in Elasticsearch, you can push it directly into your data lake or data warehouse alongside business, operational, and other enterprise data.

Why Use the Parquet Ingester?

Data Lake and Data Warehouse Integration: The primary use case for this ingester is feeding unstructured data metadata into enterprise data platforms. Diskover has a technical partnership with Snowflake, and this integration enables you to push file system metadata directly into the Snowflake ecosystem for unified analytics across structured and unstructured data.

Compliance and Audit: Store immutable records of file system state at specific points in time. Parquet files serve as point-in-time snapshots that satisfy regulatory retention requirements.

Cloud Analytics: Load Parquet files into serverless analytics services like AWS Athena, Google BigQuery, or Azure Synapse without provisioning infrastructure.

Important: Data stored via the Parquet ingester is not accessible through the Diskover Web UI. You'll query this data using analytics tools appropriate for your data platform.

Understanding Apache Parquet

Before diving into configuration, it helps to understand what makes Parquet well-suited for this use case.

Columnar vs Row Storage

Traditional file formats like CSV and JSON store data row by row. When you query such files, the system must read entire rows even if you only need a few columns. Parquet uses columnar storage, where values from each column are stored together. This means analytical queries that aggregate or filter on specific columns can skip irrelevant data entirely.

For Diskover metadata, this is particularly efficient. A query like "sum all file sizes grouped by extension" only reads the size and extension columns, ignoring owner, mtime, paths, and other fields.

Storage Type	Best For	Diskover Example
Row-based (CSV, JSON)	Transactional workloads, full-record access	Exporting individual file details
Columnar (Parquet)	Analytical queries, aggregations	Storage reports by extension, owner analysis

Compression

Parquet achieves excellent compression because similar values are stored together. A column containing file extensions will have many repeated values like "pdf", "docx", and "jpg"—these compress extremely well. Parquet supports multiple compression algorithms:

Algorithm	Compression Ratio	Speed	Use Case
Snappy (default)	Good	Very fast	General purpose, balanced performance
GZIP	Excellent	Slower	Long-term archival, storage optimization
Zstd	Excellent	Fast	Modern alternative balancing ratio and speed

The Parquet ingester uses PyArrow's default compression (Snappy), which provides a good balance for most use cases.

Parquet vs Elasticsearch Comparison

When deciding between Parquet output and Elasticsearch indexing, consider these trade-offs:

Aspect	Elasticsearch	Parquet Ingester
Query Method	REST API / Diskover Web UI	SQL engines, Pandas, Spark, cloud analytics
Real-time Search	Yes, millisecond latency	No, batch processing (seconds to minutes)
Infrastructure	Elasticsearch cluster required	Local disk or cloud storage
Cost Model	Cluster licensing + hardware	Storage costs only
Diskover Web UI	Full support	Not supported
Analytics Integration	Limited	Native integration with data platforms
Best For	Interactive exploration, file actions	Historical analysis, data lake integration

Decision Guide

Use Elasticsearch when:

You need the Diskover Web UI for interactive browsing and file actions
Real-time search and filtering are required
Users need immediate access to scan results
You're leveraging Diskover workflows and automations

Use the Parquet Ingester when:

Building a data lake or data warehouse that includes file metadata
Integrating with Snowflake or other analytics platforms
Performing historical analysis across multiple scans
Archiving scan data for compliance or long-term retention
Minimizing infrastructure costs for stored scan data

Hybrid Approach: Many organizations use both approaches—Elasticsearch for active, real-time operations and Parquet for archival and analytics. You can run scans with different ingesters for different storage targets or export Elasticsearch data to Parquet periodically.

Requirements

System Requirements

Component	Requirement
Python	3.9 or higher
Disk Space	Sufficient for Parquet output files
Permissions	Write access to output directory

Python Dependencies

Package	Version	Purpose
pandas	2.2.3	DataFrame operations and data manipulation
pyarrow	18.0.0	Apache Arrow and Parquet file writing

Installation

1. Install Python Dependencies

Linux:

cd /opt/diskover
python3 -m pip install -r ingesters/parquet/requirements.txt

Windows:

cd "C:\Program Files\Diskover"
python -m pip install -r ingesters\parquet\requirements.txt

2. Verify Installation

Linux:

python3 -c "import pandas; import pyarrow; print(f'pandas: {pandas.__version__}, pyarrow: {pyarrow.__version__}')"

Windows:

python -c "import pandas; import pyarrow; print(f'pandas: {pandas.__version__}, pyarrow: {pyarrow.__version__}')"

Expected output:

pandas: 2.2.3, pyarrow: 18.0.0

3. Create Output Directory

Linux:

mkdir -p /data/parquet_output
chmod 755 /data/parquet_output

Windows:

mkdir "D:\DiskoverData\parquet_output"

4. Configure the Ingester

Navigate to Diskover Admin > Diskover > Alternate Ingesters > Parquet and configure the parameters as needed (see Configuration section below).

Here is the beginning of our sample configuration, which you can see we're outputing data to a /datalake-pipeline endpoint. There are many other configuraitons for the Parquet ingester - covered in detail below!

Configuration

Configuration Parameters

Parameter	Type	Default	Required	Description
`parquet_filedir`	string	`/tmp/parquet_files/`	No	Output directory for Parquet files. Can be overridden with the `PARQUETDIR` environment variable.
`parquet_filename`	string	`diskover_scan_data`	No	Filename prefix for generated Parquet files. Files are named `{prefix}-{counter}.parquet`.
`parquet_filesize_split`	integer	`1073741824` (1 GB)	No	Size threshold in bytes for splitting output into multiple files. When the in-memory buffer reaches this size, a new file is created.
`exclude_fields`	list	`['task', 'ino', 'nlink', 'costpergb']`	No	List of metadata fields to exclude from the Parquet output.

Configuration Examples

Standard Configuration (Local Storage):

Parameter	Value
parquet_filedir	`/data/parquet_output/`
parquet_filename	`diskover_scan_data`
parquet_filesize_split	`1073741824`
exclude_fields	`['task', 'ino', 'nlink', 'costpergb']`

Data Warehouse Integration (Minimal Fields):

When integrating with a data warehouse, you may want to exclude additional fields to reduce storage and focus on key metrics:

Parameter	Value
parquet_filedir	`/data/warehouse_staging/`
parquet_filename	`diskover_metadata`
parquet_filesize_split	`1073741824`
exclude_fields	`['task', 'ino', 'nlink', 'costpergb', 'hardlinks', 'tag', 'tag_custom']`

Environment Variable Override

The output directory can be overridden at runtime using the PARQUETDIR environment variable. This is useful for directing each scan to a unique location:

Linux:

export PARQUETDIR=/data/parquet_output/scan_$(date +%Y%m%d_%H%M%S)
python3 /opt/diskover/diskover.py --altingester parquet /path/to/scan

Windows:

$env:PARQUETDIR = "D:\DiskoverData\parquet_output\scan_$(Get-Date -Format 'yyyyMMdd_HHmmss')"
python "C:\Program Files\Diskover\diskover.py" --altingester parquet D:\path\to\scan

Note: When using PARQUETDIR, the ingester will create the directory automatically if it doesn't exist. However, if the specified directory already exists, the ingester will exit with an error to prevent accidental overwrites.

Usage

Basic Usage

Run a Diskover scan with the Parquet ingester:

Linux:

cd /opt/diskover
python3 diskover.py --altingester parquet /path/to/scan

Windows:

cd "C:\Program Files\Diskover"
python diskover.py --altingester parquet D:\path\to\scan

With Custom Output Directory

Linux:

PARQUETDIR=/data/archive/scan_$(date +%Y%m%d_%H%M%S) python3 /opt/diskover/diskover.py --altingester parquet /mnt/storage

Windows:

$env:PARQUETDIR = "D:\archive\scan_$(Get-Date -Format 'yyyyMMdd_HHmmss')"
python "C:\Program Files\Diskover\diskover.py" --altingester parquet D:\storage

With Verbose Logging

Linux:

python3 /opt/diskover/diskover.py --altingester parquet --loglevel INFO /path/to/scan

Windows:

python "C:\Program Files\Diskover\diskover.py" --altingester parquet --loglevel INFO D:\path\to\scan

Multiple Scans to Separate Directories

Scanning different storage targets to separate output locations:

Linux:

# Scan production storage
PARQUETDIR=/data/parquet/production_$(date +%Y%m%d) \
  python3 /opt/diskover/diskover.py --altingester parquet /mnt/production

# Scan archive storage
PARQUETDIR=/data/parquet/archive_$(date +%Y%m%d) \
  python3 /opt/diskover/diskover.py --altingester parquet /mnt/archive

Example CLI Ouput from the Parquet Ingester:

~# python3 /opt/diskover/diskover.py --altingester parquet /var/log/

Using alternate ingester <module 'ingesters.parquet.parquet' from '/opt/diskover/ingesters/parquet/parquet.py'>
ingesters.parquet.parquet_writer - Creating directory /tmp/parquet_files/
ingesters.parquet.parquet_writer - Using temporary directory: /tmp/parquet_export_6ooh7pf1
configuration Default
No plugins loaded
....
Enqueuing dir tree /var/log
ingesters.parquet.parquet_writer - INFO - Initial bytes_per_row estimate: 16.0
[ThreadPoolExecutor-1_0] finished crawling /var/log (16 dirs, 40 files, 31.97 MB) in 0d:0h:00m:00s
....
ingesters.parquet.parquet_writer - INFO - Writing data to parquet file /tmp/parquet_export_6ooh7pf1/diskover_scan_data_001.parquet...
ingesters.parquet.parquet_writer - INFO - File 001: 0.01 MB, 49 rows, 317.8 bytes/row, avg: 136.7
ingesters.parquet.parquet_writer - INFO - Wrote 49 rows to 1 parquet file(s)

Now let’s look at how we can integrate this into a Scan Task!

Integration with Index Tasks

The Parquet ingester can be configured as part of a scheduled Index Task:

Field	Value
Alternate Ingester	`parquet`

Important: The --altingester flag completely replaces Elasticsearch output. Data will NOT be searchable in the Diskover Web UI.

Example of integrating the Parquet Ingester directly into a Diskover Scan Task:

Output Format

File Naming Convention

Parquet files are named using the pattern:

{parquet_filename}-{counter}.parquet

Example output directory structure:

/data/parquet_output/
├── diskover_scan_data-0.parquet
├── diskover_scan_data-1.parquet
└── diskover_scan_data-2.parquet

Schema / Field Mappings

The Parquet schema mirrors Diskover's metadata structure with added temporal columns for partitioning:

Field	Type	Description
`name`	string	File or directory name
`path`	string	Full path to the item
`path_parent`	string	Parent directory path
`extension`	string	File extension (files only)
`size`	int64	Size in bytes
`size_du`	int64	Disk usage (allocated size) in bytes
`atime`	datetime	Last access time
`mtime`	datetime	Last modification time
`ctime`	datetime	Creation/change time
`type`	string	Item type: `file` or `directory`
`owner`	string	File owner
`group`	string	File group
`year`	string	Scan year (temporal partition)
`month`	string	Scan month (temporal partition)
`day`	string	Scan day (temporal partition)
`hour`	string	Scan hour (temporal partition)

Additional fields from Diskover plugins (tags, custom metadata, etc.) are included unless specified in exclude_fields.

Working with Output Data

Since Parquet data isn't accessible through the Diskover Web UI, you'll query it using analytics tools appropriate for your environment. Parquet's wide adoption means you have many options depending on your infrastructure and use case.

Snowflake

As a Diskover technical partner, Snowflake is an ideal destination for Parquet scan data. You can stage Parquet files from local storage or cloud object storage (S3, Azure Blob, GCS) and load them directly into Snowflake tables. Once loaded, file system metadata becomes part of your unified data platform, enabling cross-analysis with business data, cost allocation reporting, and compliance dashboards.

Cloud Analytics Services

AWS Athena allows you to query Parquet files directly in S3 without loading them into a database. This serverless approach is cost-effective for ad-hoc analysis and scheduled reporting. Define an external table pointing to your Parquet files and run standard SQL queries. See the AWS Athena Documentation for configuration details.

Google BigQuery can load Parquet files from Cloud Storage for serverless analytics at scale. BigQuery's columnar engine is well-matched to Parquet's format, enabling fast aggregation queries across large scan datasets. See the BigQuery Parquet Documentation for loading procedures.

Azure Synapse Analytics supports Parquet natively, allowing you to query files in Azure Data Lake Storage or load them into dedicated SQL pools for enterprise analytics.

Local Analysis Tools

Python with Pandas provides the simplest approach for local analysis. Pandas can read Parquet files directly and offers powerful data manipulation capabilities for filtering, grouping, and aggregating scan metadata. This is ideal for one-off analysis, prototyping queries, or generating reports.

DuckDB is an embedded analytical database that can query Parquet files directly without loading them entirely into memory. It's particularly useful for analyzing large scan outputs on a laptop or workstation, offering SQL syntax with excellent performance.

Apache Spark handles large-scale distributed processing when scan data exceeds what a single machine can handle. Spark's native Parquet support makes it straightforward to run analytics across petabytes of historical scan data in a cluster environment.

Common Analysis Patterns

Use Case	Approach
Find largest files	Filter by type, sort by size descending
Storage by extension	Group files by extension, sum size
Old/stale files	Filter by mtime or atime older than threshold
Storage by owner	Group by owner, sum size
Directory sizing	Aggregate file sizes by parent path
Growth over time	Compare scans using temporal partition columns

Troubleshooting

Issue	Cause	Solution
`Directory already exists` error	Using `PARQUETDIR` with an existing directory	Include a timestamp in the directory path, or remove the existing directory if no longer needed
`Permission denied` error	Insufficient write permissions	Ensure the Diskover user has write access to the output directory with `chown` and `chmod`
`ModuleNotFoundError: No module named 'pandas'`	Missing Python dependencies	Run `pip install -r ingesters/parquet/requirements.txt`
`ModuleNotFoundError: No module named 'pyarrow'`	Missing Python dependencies	Run `pip install -r ingesters/parquet/requirements.txt`
Empty output directory	Scan path contains no files or exclusion rules filtered everything	Verify scan path contains files and check Diskover logs for errors
`MemoryError` during large scans	File split size too large for available RAM	Reduce `parquet_filesize_split` to 256 MB or lower
Type conversion warnings	Mixed data types in columns	Informational only—the ingester automatically converts to strings

Debug Logging

Enable verbose logging to troubleshoot issues:

Linux:

python3 /opt/diskover/diskover.py --altingester parquet --loglevel DEBUG /path/to/scan

Windows:

python "C:\Program Files\Diskover\diskover.py" --altingester parquet --loglevel DEBUG D:\path\to\scan

Verifying Parquet File Integrity

If you suspect file corruption, verify with PyArrow:

import pyarrow.parquet as pq

try:
    table = pq.read_table('/data/parquet_output/diskover_scan_data-0.parquet')
    print(f"Rows: {table.num_rows}, Columns: {table.num_columns}")
except Exception as e:
    print(f"Error reading file: {e}")

Support

Last Updated: April 2026
Diskover Data, Inc.

Overview

Why Use the Parquet Ingester?

Understanding Apache Parquet

Columnar vs Row Storage

Compression

Parquet vs Elasticsearch Comparison

Decision Guide

Requirements

System Requirements

Python Dependencies

Installation

1. Install Python Dependencies

2. Verify Installation

3. Create Output Directory

4. Configure the Ingester

Configuration

Configuration Parameters

Configuration Examples

Environment Variable Override

Usage

Basic Usage

With Custom Output Directory

With Verbose Logging

Multiple Scans to Separate Directories

Integration with Index Tasks

Output Format

File Naming Convention

Schema / Field Mappings

Working with Output Data

Snowflake

Cloud Analytics Services

Local Analysis Tools

Common Analysis Patterns

Troubleshooting

Debug Logging

Verifying Parquet File Integrity

Support

Related articles