BAM Info Index Plugin

License: PRO+ (Professional Edition or higher)
Plugin Type: Index Plugin
Author: Diskover Data, Inc.

Overview

The BAM Info plugin automatically extracts metadata from bioinformatics sequence alignment files during Diskover indexing. If your organization stores genomics data—such as DNA or RNA sequencing results—this plugin makes those files searchable by the software tools and processing parameters used to create them.

What does this plugin do?

When Diskover scans your storage, the BAM Info plugin reads the header information embedded in BAM, SAM, and CRAM files. These headers contain valuable metadata about:

Which alignment programs processed the file (e.g., STAR, BWA, Bowtie2)
Software version numbers used during processing
Command-line parameters that were executed
Custom annotations added by your bioinformatics pipelines

This metadata becomes searchable in Diskover, allowing you to find files based on how they were created—not just their names or locations.

Why is this useful?

Audience	Use Case
IT & Storage Administrators	Identify files from deprecated pipelines for archival or deletion; plan storage migrations by workflow type; generate compliance reports showing which tool versions were used
Bioinformaticians & Researchers	Locate all files processed with a specific software version; verify pipeline consistency across projects; find files that need reprocessing after a bug fix
Compliance & Quality Teams	Audit processing records for regulatory requirements (FDA, GxP); track data provenance; verify reproducibility of results

Understanding BAM/SAM/CRAM Formats

If you're an IT administrator managing storage for a genomics team, you may encounter these file types without knowing exactly what they contain. This section provides the background you need to understand what the plugin extracts and why it matters.

What Are These Files?

BAM, SAM, and CRAM files store sequence alignment data—the results of mapping DNA or RNA sequencing reads to a reference genome. Think of them as the processed output of a sequencing experiment.

Format	What It Is	File Size	When You'll See It
SAM	Human-readable text format	Very large (uncompressed)	Rarely stored long-term; used for debugging
BAM	Compressed binary version of SAM	Medium (5-50+ GB typical)	Most common format for active analysis
CRAM	Highly compressed format	Smallest (reference-based compression)	Long-term archival storage

Why Do These Files Get So Large?

A single human genome sequencing run can produce BAM files ranging from 5 GB to over 100 GB. Organizations with active sequencing programs may have thousands of these files, consuming petabytes of storage. Understanding what's in these files helps you make informed decisions about storage tiering and data lifecycle management.

What's Inside These Files?

Every BAM/SAM/CRAM file has two main sections:

Header Section (what this plugin reads):
- Reference genome information
- Program (PG) records: Which software processed the file and version numbers
- Comment (CO) records: Command-line parameters, custom annotations, pipeline identifiers
Alignment Records (the bulk of the file):
- Millions of individual sequence reads and their positions
- Quality scores and alignment details

The plugin extracts metadata from the header section only, which is fast and doesn't require reading the entire file.

Why Header Metadata Matters for Storage Management

Scenario	How Metadata Helps
Storage cleanup	Find all files created by an old pipeline version that's been deprecated
Compliance audit	Prove that files were processed with approved software versions
Troubleshooting	Identify which files were affected when a software bug is discovered
Migration planning	Group files by processing workflow for targeted data movement
Capacity planning	Understand which pipelines generate the most data

Requirements

Python Dependencies

Package	Version	Purpose
pysam	0.22.1	Python library for reading BAM/SAM/CRAM file headers

System Requirements

Python 3.9 or higher
Diskover indexer with plugin support enabled
Read access to BAM/SAM/CRAM files during indexing
Sufficient permissions to create cache directories

Optional: CRAM Format Support

CRAM files use reference-based compression, which may require additional system libraries:

Library	Purpose
htslib	Core library for high-throughput sequencing data formats
libdeflate	Fast compression library (improves performance)

Note: Most organizations primarily use BAM format. CRAM support is only needed if your teams have adopted CRAM for archival storage.

Installation

Step 1: Install Python Dependencies

Linux:

python3 -m pip install pysam==0.22.1

Windows:

python -m pip install pysam==0.22.1

Step 2: Verify Installation

Linux:

python3 -c "import pysam; print(f'pysam version: {pysam.__version__}')"

Windows:

python -c "import pysam; print(f'pysam version: {pysam.__version__}')"

Expected output:

pysam version: 0.22.1

Step 3: Configure the Plugin

Navigate to Diskover Admin > Plugins > Index Plugins > BAM Info
Enable the plugin and configure parameters as needed (see Configuration section below)
Save the configuration

Step 4: Enable in Index Task Configuration

Navigate to Diskover > Configurations > [Your Configuration Name]
Scroll to the bottom to find Index Plugins Enablement
Enable the BAM Info plugin
Save the configuration

The plugin will now run automatically during any scan that uses this configuration.

Configuration

The BAM Info plugin configuration controls which metadata is extracted from alignment files.

Configuration Parameters

Parameter	Type	Default	Description
`PG_enabled`	Boolean	`True`	Enable extraction of Program Group (PG) header information—the alignment software names and versions
`PG_program_ids`	List	`["STAR", "bwa"]`	Which alignment programs to track. Only programs in this list are indexed
`CO_enabled`	Boolean	`True`	Enable extraction of Comment (CO) header information—command lines and custom annotations
`CO_list`	List	`["ANNID"]`	Which custom comment keys to extract as searchable fields
`extensions`	Dictionary	`{"sam": "", "bam": "b", "cram": "c"}`	File extensions to process with their pysam open mode suffixes

Understanding PG_program_ids

This setting determines which alignment programs are tracked. Only files processed by programs in this list will have their program metadata indexed.

Common alignment programs:

Program ID	Description	Typical Use
`STAR`	Spliced Transcripts Alignment to a Reference	RNA sequencing analysis
`bwa`	Burrows-Wheeler Aligner	DNA sequencing, whole genome
`minimap2`	Long-read aligner	PacBio, Oxford Nanopore data
`bowtie2`	Fast short-read aligner	ChIP-seq, smaller genomes
`hisat2`	Graph-based aligner	RNA sequencing

Tip: Ask your bioinformatics team which alignment tools they use, then add those program IDs to the configuration.

Understanding CO_list

BAM files often contain custom comments in KEY:VALUE format. This setting specifies which comment keys to extract.

Key	Description
`ANNID`	Annotation identifier
`PIPELINE`	Pipeline name or identifier
`PROJECT`	Project code
`SAMPLE_ID`	Sample identifier

Configuration Examples

Default Configuration (tracks STAR and BWA):

{
  "PG_enabled": true,
  "PG_program_ids": ["STAR", "bwa"],
  "CO_enabled": true,
  "CO_list": ["ANNID"],
  "extensions": {"sam": "", "bam": "b", "cram": "c"}
}

Expanded Configuration (tracks multiple alignment tools and custom fields):

{
  "PG_enabled": true,
  "PG_program_ids": ["STAR", "bwa", "minimap2", "bowtie2", "hisat2"],
  "CO_enabled": true,
  "CO_list": ["ANNID", "PIPELINE", "PROJECT", "SAMPLE_ID"],
  "extensions": {"sam": "", "bam": "b", "cram": "c"}
}

BAM-Only Configuration (skip SAM and CRAM files):

{
  "PG_enabled": true,
  "PG_program_ids": ["STAR", "bwa"],
  "CO_enabled": true,
  "CO_list": ["ANNID"],
  "extensions": {"bam": "b"}
}

Cache Configuration

The plugin caches extracted metadata to improve re-scan performance. Cache is stored in:

Linux: /opt/diskover/__bam_plugin_cache__/
Windows: C:\Program Files\Diskover\__bam_plugin_cache__\

The cache uses file modification time (mtime) to detect changes—if a file hasn't changed since the last scan, cached metadata is reused.

Indexed Fields / Elasticsearch Mappings

The BAM Info plugin adds a bam_info object to indexed documents containing metadata extracted from BAM file headers.

Field Mappings

Field Path	ES Type	Description
`bam_info`	object	Root container for all BAM metadata
`bam_info.pg`	object (array)	Array of program group entries
`bam_info.pg.id`	keyword	Program identifier (e.g., "STAR", "bwa")
`bam_info.pg.vn`	keyword	Program version number (e.g., "2.7.10a")
`bam_info.co_cmd`	text	Full user command line from CO header
`bam_info.co_cmd_checksum`	keyword	MD5 hash of the command line for exact matching
`bam_info.co`	object (array)	Array of custom comment key-value pairs
`bam_info.co.key`	keyword	Comment key (e.g., "ANNID", "PIPELINE")
`bam_info.co.value`	keyword	Comment value
`bam_info.error`	text	Error message if metadata extraction failed

Example Document

A BAM file processed by STAR version 2.7.10a with custom annotations would have metadata like this:

{
  "name": "sample_001_aligned.bam",
  "extension": "bam",
  "size": 15234567890,
  "bam_info": {
    "pg": [
      {
        "id": "STAR",
        "vn": "2.7.10a"
      }
    ],
    "co_cmd": "user command line: STAR --runThreadN 8 --genomeDir /ref/GRCh38 --readFilesIn sample_001_R1.fastq.gz sample_001_R2.fastq.gz --outSAMtype BAM SortedByCoordinate",
    "co_cmd_checksum": "a3f2b8c9d4e5f6a7b8c9d0e1f2a3b4c5",
    "co": [
      {
        "key": "ANNID",
        "value": "ANN001"
      },
      {
        "key": "PROJECT",
        "value": "PROJ_2024_001"
      }
    ]
  }
}

Searching in Diskover

Once the plugin indexes your BAM files, you can search for them using the metadata fields. Enter these queries in the Diskover search bar.

Program and Version Searches

Query	Description
`bam_info.pg.id:STAR`	Find all files aligned with STAR
`bam_info.pg.id:bwa`	Find all files aligned with BWA
`bam_info.pg.vn:2.7.10a`	Find files processed with a specific version
`bam_info.pg.id:STAR AND bam_info.pg.vn:2.7*`	STAR version 2.7.x files

Command Line and Annotation Searches

Query	Description
`bam_info.co_cmd:GRCh38`	Files aligned to GRCh38 reference genome
`bam_info.co_cmd_checksum:a3f2b8c9d4e5f6a7b8c9d0e1f2a3b4c5`	Files with exact same processing parameters
`bam_info.co.key:ANNID AND bam_info.co.value:ANN001`	Files with specific annotation ID
`bam_info.co.key:PROJECT AND bam_info.co.value:PROJ_2024*`	Files from 2024 projects

Tip: Use the command line checksum to find files processed with identical parameters—useful for verifying consistency across a project.

Combined and Storage Management Searches

Query	Description
`bam_info.pg.id:STAR AND size:>=1073741824`	STAR files larger than 1 GB
`extension:bam AND bam_info.pg.id:bwa AND mtime:[now-30d TO now]`	BWA BAM files modified in last 30 days
`bam_info.pg.vn:2.5* AND mtime:<now-1y`	Old version files not modified in a year (archive candidates)
`tags:baminfo-plugin AND tags:bad-file`	Files that failed metadata extraction

Troubleshooting

Common Issues

Issue	Cause	Solution
Plugin not running during scans	Plugin not enabled in Index Task Configuration	Navigate to Diskover > Configurations > [Config Name] > Index Plugins Enablement and enable BAM Info
No `bam_info` field on BAM files	pysam not installed or import error	Run `python3 -c "import pysam"` to verify installation; check logs for import errors
Metadata missing for valid files	Program ID not in `PG_program_ids` list	Add the program ID to the configuration; check file headers with `samtools view -H file.bam \\| grep "@PG"`
Files tagged as "bad-file"	Corrupted or truncated BAM file	Verify file integrity with `samtools quickcheck file.bam`; re-transfer if corrupted
Slow indexing performance	Cache misses on first scan; network storage latency	First scan is slower due to cache population; ensure files are accessible with low latency
CRAM files not processing	Missing reference genome or htslib	Set `REF_PATH` environment variable; verify CRAM support with `samtools view -T ref.fa file.cram`

Verifying Plugin Installation

Check if pysam is installed:

Linux:

python3 -c "import pysam; print(f'pysam version: {pysam.__version__}')"

Windows:

python -c "import pysam; print(f'pysam version: {pysam.__version__}')"

Test plugin import:

Linux:

cd /opt/diskover
python3 -c "from plugins.baminfo import *; print('Plugin loaded successfully')"

Windows:

cd "C:\Program Files\Diskover"
python -c "from plugins.baminfo import *; print('Plugin loaded successfully')"

Checking BAM File Headers

If metadata isn't being extracted, verify the file contains the expected headers:

# View all headers
samtools view -H /path/to/file.bam | head -50

# Check for Program Group headers
samtools view -H /path/to/file.bam | grep "^@PG"

# Check for Comment headers
samtools view -H /path/to/file.bam | grep "^@CO"

Cache Management

To clear the plugin cache and force re-extraction of all metadata:

Linux:

rm -rf /opt/diskover/__bam_plugin_cache__/

Windows:

rmdir /s /q "C:\Program Files\Diskover\__bam_plugin_cache__\"

Note: Clearing cache will cause the next scan to re-process all BAM files, which may increase scan time.

Debug Logging

To enable verbose logging for troubleshooting:

Enable verbose/debug logging in the BAM Info plugin configuration within Diskover Admin
Run a scan and monitor the logs

View logs:

Linux:

tail -f /var/log/diskover/diskover.log | grep baminfo

Look for these log messages:

Log Message	Meaning
`CACHE HIT`	Cached metadata found and reused
`CACHE MISS`	New extraction performed
`Getting bam info for...`	Plugin is processing a file
`Failed to open file`	File could not be read (check permissions or file integrity)

Support

Last Updated: January 2026
Diskover Data, Inc.

BAM Info Index Plugin

Overview

Understanding BAM/SAM/CRAM Formats

What Are These Files?

Why Do These Files Get So Large?

What's Inside These Files?

Why Header Metadata Matters for Storage Management

Requirements

Python Dependencies

System Requirements

Optional: CRAM Format Support

Installation

Step 1: Install Python Dependencies

Step 2: Verify Installation

Step 3: Configure the Plugin

Step 4: Enable in Index Task Configuration

Configuration

Configuration Parameters

Understanding PG_program_ids

Understanding CO_list

Configuration Examples

Cache Configuration

Indexed Fields / Elasticsearch Mappings

Field Mappings

Example Document

Searching in Diskover

Program and Version Searches

Command Line and Annotation Searches

Combined and Storage Management Searches

Troubleshooting

Common Issues

Verifying Plugin Installation

Checking BAM File Headers

Cache Management

Debug Logging

Support

Related articles