skills/bioinformatics-sequence/pysam-genomics
stars:0
forks:0
watches:0
last updated:N/A
pysam — BAM/CRAM Python Interface
pysamis the Pythonic interface to htslib: read BAM/CRAM/VCF/TBF, iterate alignments, call variants, and build custom pipelines. The 2026 reality: pysam 0.22+ pairs with htslib 1.19+ and is the low-level I/O engine behind GATK, freebayes, and most variant callers. If you need to programmatically iterate through NGS files, this is the canonical way.
When to use
- Programmatically iterate BAM alignments with Python logic.
- Extract variant or coverage data from a BAM.
- Build a custom pipeline that needs alignment-level access.
- Query regions from indexed BAM/CRAM without loading a whole file.
When NOT to use
- You need a BAM stats summary → use
samtools statsormosdepth. - You need variant calling → use GATK/HaplotypeCaller or freebayes.
- Visual → IGV, IGV.js, or JBrowse.
Prerequisites
pysam≥ 0.22 (PyPI or conda)htslib≥ 1.19 (linked automatically)
Core workflow
- Open an alignment file with
pysam.AlignmentFile. - Iterate and filter with Python (no shell calls).
- Query by region with pileup / fetch.
- Write optional BAM/CRAM output.
- Close explicitly or use a context manager.
Code patterns
Open a BAM (read)
import pysam
"
bam = pysam.AlignmentFile("input.bam", "rb")
Open a SAM (auto-detect gzipped)
sam = pysam.AlignmentFile("input.sam", "r")
Open a indexed region
Access a small region without loading the whole BAM:
bam = pysam.AlignmentFile("input.bam", "rb")
for read in bam.fetch("chr1", 100_000, 200_000):
print(read.query_name, read.reference_name, read.pos, read.cigartuples)
Iterate through all alignments
bam = pysam.AlignmentFile("input.bam", "rb")
for read in bam:
# read is a pysam AlignedSegment
if read.is_proper_pair:
print(read.query_name, read.tlen)
Filter by FLAG in Python
def proper_paired(read):
return read.is_paired and read.is_proper_pair
bam = pysam.AlignmentFile("input.bam", "rb")
proper = [r for r in bam if proper_paired(r)]
Extract MAPQ, CIGAR, cigarstring
for read in bam:
print(read.query_name,
read.mapq,
read.cigartuples, # [(0, 100)] = 100M
read.cigarstring) # "100M"
Extract the aligned sequence
for read in bam:
if not read.is_unmapped:
print(read.query_sequence[:50]) # First 50 bp
Access mate information
for read in bam:
if read.is_paired and read.mate_is_unmapped:
print(f"{read.query_name} mate is unmapped")
Pileup at a position
bam = pysam.AlignmentFile("input.bam", "rb")
for column in bam.pileup("chr1", 100_000, 100_001):
for read in column.pileups:
print(read.alignment.query_name)
Write a filtered BAM
with pysam.AlignmentFile("filtered.bam", "wb", header=bam.header) as out:
for read in bam:
if read.mapq >= 30 and read.is_proper_pair:
out.write(read)
Iterating over a tabix-indexed BED/VCF
vcf = pysam.TabixFile("variants.vcf.gz")
for record in vcf.contigs:
pass # Use region query for actual records
Read from CRAM (requires reference)
cram = pysam.AlignmentFile("input.cram", "rc", reference_fasta="genome.fa")
Access the header as Python dict
header = bam.header.to_dict()
print(header["RG"]) # List of read groups
Convert SAM to Python objects
with pysam.AlignmentFile("input.sam", "r") as sam:
for read in sam:
# read is an AlignedSegment
...
Get index statistics without loading
index = pysam.IndexedReads(pysam.AlignmentFile("input.bam"))
The Index object is for random-access.
Create custom tags
bam = pysam.AlignmentFile("input.bam", "rb", fields=["NM", "MD", "RG"])
# Fields are automatically loaded if they exist
Check if a position is covered
bam = pysam.AlignmentFile("input.bam", "rb")
for col in bam.pileup("chr1", 100_000, 100_001):
coverage = col.nsegments
if coverage >= 10:
print(f"Position covered: {coverage}")
Extract all reads overlapping a gene
# Use fetch for gene coordinates
bam = pysam.AlignmentFile("input.bam", "rb")
gene_reads = [r for r in bam.fetch("chr1", start, end)]
Read group filter
rg_bam = pysam.AlignmentFile("input.bam", "rb", header=header)
for read in rg_bam:
# read.get_tag("RG") - read group from optional tags
pass
Common pitfalls
- Forgetting to close the file. Use a context manager (
with ...) or close explicitly. - CIGAR interpretation.
read.cigartuplesreturns(operation, length)tuples.0= M,1= I,2= D. - MAPQ zero means unmapped in this alignment, not zero quality.
- Reference fasta for CRAM. Must be the exact FASTA used to create the CRAM.
- Duplicate reads being filtered.
samtools markdup -ris the modern way. - Pileup object is lazy. If you need it, materialize it (e.g., with
list()) otherwise it's consumed by iteration.
Validation
- After filtering,
samtools flagstat filtered.bamshould show the expected read count. read_mapqshould be >= your threshold.read.is_proper_pairchecks both ends.
Open alternatives
| Need | Tool |
|---|---|
| Shell pipeline | samtools view, samtools filter |
| Variant calling | GATK / freebayes / bcftools |
| Coverage | mosdepth, samtools depth |
| Multi-sample | bcftools isec |
| Pysam in R | Rsamtools (Bioconductor) |
References
- pysam documentation: https://pysam.readthedocs.io/
- htslib: http://www.htslib.org/
- SAM spec: https://samtools.github.io/hts-specs/SAMv1.pdf
- Companion:
ors-bioinformatics-sequence-sam-bam-basics,ors-bioinformatics-sequence-bcftools-variant-manipulation.
Changelog
- 1.0.0 (2026-06-10): Initial adaptation by Pradyumna Jayaram from
pysam-genomic-files(SciAgent-Skills).
