BAM Info Index Plugin
License: PRO+ (Professional Edition or higher)
Plugin Type: Index Plugin
Author: Diskover Data, Inc.
Overview
The BAM Info plugin automatically extracts metadata from bioinformatics sequence alignment files during Diskover indexing. If your organization stores genomics data—such as DNA or RNA sequencing results—this plugin makes those files searchable by the software tools and processing parameters used to create them.
What does this plugin do?
When Diskover scans your storage, the BAM Info plugin reads the header information embedded in BAM, SAM, and CRAM files. These headers contain valuable metadata about:
Which alignment programs processed the file (e.g., STAR, BWA, Bowtie2)
Software version numbers used during processing
Command-line parameters that were executed
Custom annotations added by your bioinformatics pipelines
This metadata becomes searchable in Diskover, allowing you to find files based on how they were created—not just their names or locations.
Why is this useful?
Audience | Use Case |
|---|---|
IT & Storage Administrators | Identify files from deprecated pipelines for archival or deletion; plan storage migrations by workflow type; generate compliance reports showing which tool versions were used |
Bioinformaticians & Researchers | Locate all files processed with a specific software version; verify pipeline consistency across projects; find files that need reprocessing after a bug fix |
Compliance & Quality Teams | Audit processing records for regulatory requirements (FDA, GxP); track data provenance; verify reproducibility of results |
Understanding BAM/SAM/CRAM Formats
If you're an IT administrator managing storage for a genomics team, you may encounter these file types without knowing exactly what they contain. This section provides the background you need to understand what the plugin extracts and why it matters.
What Are These Files?
BAM, SAM, and CRAM files store sequence alignment data—the results of mapping DNA or RNA sequencing reads to a reference genome. Think of them as the processed output of a sequencing experiment.
Format | What It Is | File Size | When You'll See It |
|---|---|---|---|
SAM | Human-readable text format | Very large (uncompressed) | Rarely stored long-term; used for debugging |
BAM | Compressed binary version of SAM | Medium (5-50+ GB typical) | Most common format for active analysis |
CRAM | Highly compressed format | Smallest (reference-based compression) | Long-term archival storage |
Why Do These Files Get So Large?
A single human genome sequencing run can produce BAM files ranging from 5 GB to over 100 GB. Organizations with active sequencing programs may have thousands of these files, consuming petabytes of storage. Understanding what's in these files helps you make informed decisions about storage tiering and data lifecycle management.
What's Inside These Files?
Every BAM/SAM/CRAM file has two main sections:
Header Section (what this plugin reads):
Reference genome information
Program (PG) records: Which software processed the file and version numbers
Comment (CO) records: Command-line parameters, custom annotations, pipeline identifiers
Alignment Records (the bulk of the file):
Millions of individual sequence reads and their positions
Quality scores and alignment details
The plugin extracts metadata from the header section only, which is fast and doesn't require reading the entire file.
Why Header Metadata Matters for Storage Management
Scenario | How Metadata Helps |
|---|---|
Storage cleanup | Find all files created by an old pipeline version that's been deprecated |
Compliance audit | Prove that files were processed with approved software versions |
Troubleshooting | Identify which files were affected when a software bug is discovered |
Migration planning | Group files by processing workflow for targeted data movement |
Capacity planning | Understand which pipelines generate the most data |
Requirements
Python Dependencies
Package | Version | Purpose |
|---|---|---|
pysam | 0.22.1 | Python library for reading BAM/SAM/CRAM file headers |
System Requirements
Python 3.9 or higher
Diskover indexer with plugin support enabled
Read access to BAM/SAM/CRAM files during indexing
Sufficient permissions to create cache directories
Optional: CRAM Format Support
CRAM files use reference-based compression, which may require additional system libraries:
Library | Purpose |
|---|---|
htslib | Core library for high-throughput sequencing data formats |
libdeflate | Fast compression library (improves performance) |
Note: Most organizations primarily use BAM format. CRAM support is only needed if your teams have adopted CRAM for archival storage.
Installation
Step 1: Install Python Dependencies
Linux:
python3 -m pip install pysam==0.22.1
Windows:
python -m pip install pysam==0.22.1
Step 2: Verify Installation
Linux:
python3 -c "import pysam; print(f'pysam version: {pysam.__version__}')"
Windows:
python -c "import pysam; print(f'pysam version: {pysam.__version__}')"
Expected output:
pysam version: 0.22.1
Step 3: Configure the Plugin
Navigate to Diskover Admin > Plugins > Index Plugins > BAM Info
Enable the plugin and configure parameters as needed (see Configuration section below)
Save the configuration
Step 4: Enable in Index Task Configuration
Navigate to Diskover > Configurations > [Your Configuration Name]
Scroll to the bottom to find Index Plugins Enablement
Enable the BAM Info plugin
Save the configuration
The plugin will now run automatically during any scan that uses this configuration.
Configuration
The BAM Info plugin configuration controls which metadata is extracted from alignment files.
Configuration Parameters
Parameter | Type | Default | Description |
|---|---|---|---|
| Boolean |
| Enable extraction of Program Group (PG) header information—the alignment software names and versions |
| List |
| Which alignment programs to track. Only programs in this list are indexed |
| Boolean |
| Enable extraction of Comment (CO) header information—command lines and custom annotations |
| List |
| Which custom comment keys to extract as searchable fields |
| Dictionary |
| File extensions to process with their pysam open mode suffixes |
Understanding PG_program_ids
This setting determines which alignment programs are tracked. Only files processed by programs in this list will have their program metadata indexed.
Common alignment programs:
Program ID | Description | Typical Use |
|---|---|---|
| Spliced Transcripts Alignment to a Reference | RNA sequencing analysis |
| Burrows-Wheeler Aligner | DNA sequencing, whole genome |
| Long-read aligner | PacBio, Oxford Nanopore data |
| Fast short-read aligner | ChIP-seq, smaller genomes |
| Graph-based aligner | RNA sequencing |
Tip: Ask your bioinformatics team which alignment tools they use, then add those program IDs to the configuration.
Understanding CO_list
BAM files often contain custom comments in KEY:VALUE format. This setting specifies which comment keys to extract.
Key | Description |
|---|---|
| Annotation identifier |
| Pipeline name or identifier |
| Project code |
| Sample identifier |
Configuration Examples
Default Configuration (tracks STAR and BWA):
{
"PG_enabled": true,
"PG_program_ids": ["STAR", "bwa"],
"CO_enabled": true,
"CO_list": ["ANNID"],
"extensions": {"sam": "", "bam": "b", "cram": "c"}
}
Expanded Configuration (tracks multiple alignment tools and custom fields):
{
"PG_enabled": true,
"PG_program_ids": ["STAR", "bwa", "minimap2", "bowtie2", "hisat2"],
"CO_enabled": true,
"CO_list": ["ANNID", "PIPELINE", "PROJECT", "SAMPLE_ID"],
"extensions": {"sam": "", "bam": "b", "cram": "c"}
}
BAM-Only Configuration (skip SAM and CRAM files):
{
"PG_enabled": true,
"PG_program_ids": ["STAR", "bwa"],
"CO_enabled": true,
"CO_list": ["ANNID"],
"extensions": {"bam": "b"}
}
Cache Configuration
The plugin caches extracted metadata to improve re-scan performance. Cache is stored in:
Linux:
/opt/diskover/__bam_plugin_cache__/Windows:
C:\Program Files\Diskover\__bam_plugin_cache__\
The cache uses file modification time (mtime) to detect changes—if a file hasn't changed since the last scan, cached metadata is reused.
Indexed Fields / Elasticsearch Mappings
The BAM Info plugin adds a bam_info object to indexed documents containing metadata extracted from BAM file headers.
Field Mappings
Field Path | ES Type | Description |
|---|---|---|
| object | Root container for all BAM metadata |
| object (array) | Array of program group entries |
| keyword | Program identifier (e.g., "STAR", "bwa") |
| keyword | Program version number (e.g., "2.7.10a") |
| text | Full user command line from CO header |
| keyword | MD5 hash of the command line for exact matching |
| object (array) | Array of custom comment key-value pairs |
| keyword | Comment key (e.g., "ANNID", "PIPELINE") |
| keyword | Comment value |
| text | Error message if metadata extraction failed |
Example Document
A BAM file processed by STAR version 2.7.10a with custom annotations would have metadata like this:
{
"name": "sample_001_aligned.bam",
"extension": "bam",
"size": 15234567890,
"bam_info": {
"pg": [
{
"id": "STAR",
"vn": "2.7.10a"
}
],
"co_cmd": "user command line: STAR --runThreadN 8 --genomeDir /ref/GRCh38 --readFilesIn sample_001_R1.fastq.gz sample_001_R2.fastq.gz --outSAMtype BAM SortedByCoordinate",
"co_cmd_checksum": "a3f2b8c9d4e5f6a7b8c9d0e1f2a3b4c5",
"co": [
{
"key": "ANNID",
"value": "ANN001"
},
{
"key": "PROJECT",
"value": "PROJ_2024_001"
}
]
}
}
Searching in Diskover
Once the plugin indexes your BAM files, you can search for them using the metadata fields. Enter these queries in the Diskover search bar.
Program and Version Searches
Query | Description |
|---|---|
| Find all files aligned with STAR |
| Find all files aligned with BWA |
| Find files processed with a specific version |
| STAR version 2.7.x files |
Command Line and Annotation Searches
Query | Description |
|---|---|
| Files aligned to GRCh38 reference genome |
| Files with exact same processing parameters |
| Files with specific annotation ID |
| Files from 2024 projects |
Tip: Use the command line checksum to find files processed with identical parameters—useful for verifying consistency across a project.
Combined and Storage Management Searches
Query | Description |
|---|---|
| STAR files larger than 1 GB |
| BWA BAM files modified in last 30 days |
| Old version files not modified in a year (archive candidates) |
| Files that failed metadata extraction |
Troubleshooting
Common Issues
Issue | Cause | Solution |
|---|---|---|
Plugin not running during scans | Plugin not enabled in Index Task Configuration | Navigate to Diskover > Configurations > [Config Name] > Index Plugins Enablement and enable BAM Info |
No | pysam not installed or import error | Run |
Metadata missing for valid files | Program ID not in | Add the program ID to the configuration; check file headers with |
Files tagged as "bad-file" | Corrupted or truncated BAM file | Verify file integrity with |
Slow indexing performance | Cache misses on first scan; network storage latency | First scan is slower due to cache population; ensure files are accessible with low latency |
CRAM files not processing | Missing reference genome or htslib | Set |
Verifying Plugin Installation
Check if pysam is installed:
Linux:
python3 -c "import pysam; print(f'pysam version: {pysam.__version__}')"
Windows:
python -c "import pysam; print(f'pysam version: {pysam.__version__}')"
Test plugin import:
Linux:
cd /opt/diskover
python3 -c "from plugins.baminfo import *; print('Plugin loaded successfully')"
Windows:
cd "C:\Program Files\Diskover"
python -c "from plugins.baminfo import *; print('Plugin loaded successfully')"
Checking BAM File Headers
If metadata isn't being extracted, verify the file contains the expected headers:
# View all headers samtools view -H /path/to/file.bam | head -50 # Check for Program Group headers samtools view -H /path/to/file.bam | grep "^@PG" # Check for Comment headers samtools view -H /path/to/file.bam | grep "^@CO"
Cache Management
To clear the plugin cache and force re-extraction of all metadata:
Linux:
rm -rf /opt/diskover/__bam_plugin_cache__/
Windows:
rmdir /s /q "C:\Program Files\Diskover\__bam_plugin_cache__\"
Note: Clearing cache will cause the next scan to re-process all BAM files, which may increase scan time.
Debug Logging
To enable verbose logging for troubleshooting:
Enable verbose/debug logging in the BAM Info plugin configuration within Diskover Admin
Run a scan and monitor the logs
View logs:
Linux:
tail -f /var/log/diskover/diskover.log | grep baminfo
Look for these log messages:
Log Message | Meaning |
|---|---|
| Cached metadata found and reused |
| New extraction performed |
| Plugin is processing a file |
| File could not be read (check permissions or file integrity) |
Support
Last Updated: January 2026
Diskover Data, Inc.
Comments
0 comments
Please sign in to leave a comment.