skills/bioinformatics-sequence/compressed-sequence-files

stars:0
forks:0
watches:0
last updated:N/A

Compressed Sequence Files (gzip / bzip2 / BGZF)

Every modern sequencing deliverable is compressed, but only BGZF is indexable. If you ever need to seek into a .fasta.gz from Python — for SeqIO.index(), region queries, or tabix-style lookups — you must convert it to BGZF first. This skill covers the three formats you'll actually meet, the 'rt'/'wt' text-mode trap, and when to reach for htslib bgzip instead.

When to use

  • Reading or writing .fasta.gz, .fastq.gz, .fasta.bz2 in Python.
  • Producing or consuming indexable compressed FASTA (BGZF).
  • Building pipelines that need random access into compressed records.

When NOT to use

"

  • Plain uncompressed FASTA — open with SeqIO.parse(path, "fasta") directly.
  • BAM/CRAM/VCF BGZF: use pysam (see ors-bioinformatics-sequence-pysam-genomics).
  • Whole-genome-scale index needs → use samtools faidx + bgzip, not Biopython.

Prerequisites

  • biopython>=1.83 (ships Bio.bgzf).
  • For production BGZF: htslib ≥ 1.19 (bgzip + tabix).
  • For bzip2: Python's stdlib bz2.

Core workflow

  1. Open in text mode ('rt' / 'wt'). The parser expects text. 'rb' will trip on a bytes-vs-str TypeError.
  2. Pick the format by use case:
    • .gz → archive only, not indexable.
    • .bz2 → smaller but slower; archive only.
    • .bgz → indexable via SeqIO.index(); native for BAM/VCF/tabix.
  3. For new pipelines, prefer BGZF unless you have a hard reason to use plain gzip.
  4. For BAM/VCF, always use htslib bgzip — its blocks are HTSlib-compatible; Biopython's BGZF works but is slower on huge files.

Code patterns

Read gzipped FASTA

import gzip
from Bio import SeqIO

with gzip.open("sequences.fasta.gz", "rt") as fh:
    for rec in SeqIO.parse(fh, "fasta"):
        print(rec.id, len(rec.seq))

Read bzip2

import bz2
from Bio import SeqIO

with bz2.open("reads.fastq.bz2", "rt") as fh:
    recs = list(SeqIO.parse(fh, "fastq"))

Write gzipped output

import gzip
from Bio import SeqIO

with gzip.open("clean.fasta.gz", "wt") as out:
    SeqIO.write(records, out, "fasta")

Read BGZF (text path) and via bgzf.open

from Bio import SeqIO, bgzf

# Either works
for rec in SeqIO.parse("seqs.fasta.bgz", "fasta"):
    ...

with bgzf.open("seqs.fasta.bgz", "rt") as fh:
    for rec in SeqIO.parse(fh, "fasta"):
        ...

Write BGZF for downstream indexing

from Bio import SeqIO, bgzf

with bgzf.open("indexable.fasta.bgz", "wt") as out:
    SeqIO.write(SeqIO.parse("raw.fasta", "fasta"), out, "fasta")

Index a BGZF file (random access by record ID)

from Bio import SeqIO

db = SeqIO.index("seqs.fasta.bgz", "fasta")
print(db["target_id"].seq)
db.close()

# Or persistent on-disk index for huge files
db = SeqIO.index_db("seqs.idx", "seqs.fasta.bgz", "fasta")

Auto-detect by extension (single entry point)

from pathlib import Path
import gzip, bz2
from Bio import SeqIO, bgzf

def open_seq(path: str, fmt: str):
    p = Path(path)
    sfx = "".join(p.suffixes).lower()
    if sfx.endswith(".bgz") or sfx.endswith(".bgzf"):
        fh = bgzf.open(p, "rt")
    elif sfx.endswith(".gz") or sfx.endswith(".bgzf"):
        fh = gzip.open(p, "rt")
    elif sfx.endswith(".bz2"):
        fh = bz2.open(p, "rt")
    else:
        fh = open(p, "r")
    return SeqIO.parse(fh, fmt)

Convert plain gzip to indexable BGZF

import gzip
from Bio import SeqIO, bgzf

with gzip.open("input.fasta.gz", "rt") as src, \
     bgzf.open("indexable.fasta.bgz", "wt") as dst:
    SeqIO.write(SeqIO.parse(src, "fasta"), dst, "fasta")

Low-memory record count (FASTA)

import gzip
from Bio.SeqIO.FastaIO import SimpleFastaParser

with gzip.open("sequences.fasta.gz", "rt") as fh:
    n = sum(1 for _ in SimpleFastaParser(fh))
print(f"{n} sequences")

Common pitfalls

  • Using 'rb' instead of 'rt'. You'll get TypeError: a bytes-like object is required, not 'str'.
  • Expecting SeqIO.index() to work on plain .gz. It does not — convert to BGZF first.
  • Mixing Biopython BGZF with htslib bgzip blocks. They are wire-compatible for FASTA, but for .vcf.gz/.bcf use htslib bgzip end-to-end to avoid rare edge cases.
  • Decompressing to disk then re-reading. Wastes 3-4x disk. Stream from the compressed handle.
  • .bz2 is slow. For long-term archives, yes; for active work, prefer BGZF.

Validation

  • File ends with the correct magic: gzip1f 8b, bzip242 5a, BGZF → 1f 8b 08 04 (with extra fields).
  • After BGZF write, SeqIO.index("out.bgz", "fasta")["known_id"] returns the same record.
  • Round-trip: bgzf open → gzip open byte counts differ only by BGZF block overhead.

Open alternatives

NeedOpen toolWhy
Indexable compressed FASTAhtslib bgzip (≥ 1.19)BAM/VCF-compatible blocks
Fast stats on .fasta.gzseqkit stats -j 8 *.fasta.gzMuch faster than Python
Random access by regionsamtools faidx regions.fa.gz + bgzipStandard, battle-tested
BGZF block dumpbgzip -b 0 -d out.fa.gzInspect block boundaries

References

Changelog

  • 1.0.0 (2026-06-10): Initial adaptation by Pradyumna Jayaram from bio-compressed-files (bioSkills-main/sequence-io/compressed-files).
    Good AI Tools