pysam — BAM/CRAM Python Interface

pysam is the Pythonic interface to htslib: read BAM/CRAM/VCF/TBF, iterate alignments, call variants, and build custom pipelines. The 2026 reality: pysam 0.22+ pairs with htslib 1.19+ and is the low-level I/O engine behind GATK, freebayes, and most variant callers. If you need to programmatically iterate through NGS files, this is the canonical way.

When to use

Programmatically iterate BAM alignments with Python logic.
Extract variant or coverage data from a BAM.
Build a custom pipeline that needs alignment-level access.
Query regions from indexed BAM/CRAM without loading a whole file.

When NOT to use

You need a BAM stats summary → use samtools stats or mosdepth.
You need variant calling → use GATK/HaplotypeCaller or freebayes.
Visual → IGV, IGV.js, or JBrowse.

Prerequisites

pysam ≥ 0.22 (PyPI or conda)
htslib ≥ 1.19 (linked automatically)

Core workflow

Open an alignment file with pysam.AlignmentFile.
Iterate and filter with Python (no shell calls).
Query by region with pileup / fetch.
Write optional BAM/CRAM output.
Close explicitly or use a context manager.

Code patterns

Open a BAM (read)

import pysam
"
bam = pysam.AlignmentFile("input.bam", "rb")

Open a SAM (auto-detect gzipped)

sam = pysam.AlignmentFile("input.sam", "r")

Open a indexed region

Access a small region without loading the whole BAM:

bam = pysam.AlignmentFile("input.bam", "rb")
for read in bam.fetch("chr1", 100_000, 200_000):
    print(read.query_name, read.reference_name, read.pos, read.cigartuples)

Iterate through all alignments

bam = pysam.AlignmentFile("input.bam", "rb")
for read in bam:
    # read is a pysam AlignedSegment
    if read.is_proper_pair:
        print(read.query_name, read.tlen)

Filter by FLAG in Python

def proper_paired(read):
    return read.is_paired and read.is_proper_pair

bam = pysam.AlignmentFile("input.bam", "rb")
proper = [r for r in bam if proper_paired(r)]

Extract MAPQ, CIGAR, cigarstring

for read in bam:
    print(read.query_name,
          read.mapq,
          read.cigartuples,   # [(0, 100)] = 100M
          read.cigarstring)   # "100M"

Extract the aligned sequence

for read in bam:
    if not read.is_unmapped:
        print(read.query_sequence[:50])  # First 50 bp

Access mate information

for read in bam:
    if read.is_paired and read.mate_is_unmapped:
        print(f"{read.query_name} mate is unmapped")

Pileup at a position

bam = pysam.AlignmentFile("input.bam", "rb")
for column in bam.pileup("chr1", 100_000, 100_001):
    for read in column.pileups:
        print(read.alignment.query_name)

Write a filtered BAM

with pysam.AlignmentFile("filtered.bam", "wb", header=bam.header) as out:
    for read in bam:
        if read.mapq >= 30 and read.is_proper_pair:
            out.write(read)

Iterating over a tabix-indexed BED/VCF

vcf = pysam.TabixFile("variants.vcf.gz")
for record in vcf.contigs:
    pass  # Use region query for actual records

Read from CRAM (requires reference)

cram = pysam.AlignmentFile("input.cram", "rc", reference_fasta="genome.fa")

Access the header as Python dict

header = bam.header.to_dict()
print(header["RG"])  # List of read groups

Convert SAM to Python objects

with pysam.AlignmentFile("input.sam", "r") as sam:
    for read in sam:
        # read is an AlignedSegment
        ...

Get index statistics without loading

index = pysam.IndexedReads(pysam.AlignmentFile("input.bam"))

The Index object is for random-access.

Create custom tags

bam = pysam.AlignmentFile("input.bam", "rb", fields=["NM", "MD", "RG"])
# Fields are automatically loaded if they exist

Check if a position is covered

bam = pysam.AlignmentFile("input.bam", "rb")
for col in bam.pileup("chr1", 100_000, 100_001):
    coverage = col.nsegments
    if coverage >= 10:
        print(f"Position covered: {coverage}")

Extract all reads overlapping a gene

# Use fetch for gene coordinates
bam = pysam.AlignmentFile("input.bam", "rb")
gene_reads = [r for r in bam.fetch("chr1", start, end)]

Read group filter

rg_bam = pysam.AlignmentFile("input.bam", "rb", header=header)
for read in rg_bam:
    # read.get_tag("RG") - read group from optional tags
    pass

Common pitfalls

Forgetting to close the file. Use a context manager (with ...) or close explicitly.
CIGAR interpretation. read.cigartuples returns (operation, length) tuples. 0 = M, 1 = I, 2 = D.
MAPQ zero means unmapped in this alignment, not zero quality.
Reference fasta for CRAM. Must be the exact FASTA used to create the CRAM.
Duplicate reads being filtered. samtools markdup -r is the modern way.
Pileup object is lazy. If you need it, materialize it (e.g., with list()) otherwise it's consumed by iteration.

Validation

After filtering, samtools flagstat filtered.bam should show the expected read count.
read_mapq should be >= your threshold.
read.is_proper_pair checks both ends.

Open alternatives

Need	Tool
Shell pipeline	`samtools view`, `samtools filter`
Variant calling	GATK / freebayes / bcftools
Coverage	`mosdepth`, `samtools depth`
Multi-sample	`bcftools isec`
Pysam in R	`Rsamtools` (Bioconductor)

References

pysam documentation: https://pysam.readthedocs.io/
htslib: http://www.htslib.org/
SAM spec: https://samtools.github.io/hts-specs/SAMv1.pdf
Companion: ors-bioinformatics-sequence-sam-bam-basics, ors-bioinformatics-sequence-bcftools-variant-manipulation.

Changelog

1.0.0 (2026-06-10): Initial adaptation by Pradyumna Jayaram from pysam-genomic-files (SciAgent-Skills).

skills/bioinformatics-sequence/pysam-genomics

pysam — BAM/CRAM Python Interface

When to use

When NOT to use

Prerequisites

Core workflow

Code patterns

Open a BAM (read)

Open a SAM (auto-detect gzipped)

Open a indexed region

Iterate through all alignments

Filter by FLAG in Python

Extract MAPQ, CIGAR, cigarstring

Extract the aligned sequence

Access mate information

Pileup at a position

Write a filtered BAM

Iterating over a tabix-indexed BED/VCF

Read from CRAM (requires reference)

Access the header as Python dict

Convert SAM to Python objects

Get index statistics without loading

Create custom tags

Check if a position is covered

Extract all reads overlapping a gene

Read group filter

Common pitfalls

Validation

Open alternatives

References

Changelog