skills/bioinformatics-sequence/read-write-sequences
Read and Write Sequence Files
SeqIOis the I/O workhorse of Biopython. Three functions cover 90% of the work:parse(streaming),index(random access by ID), andwrite(single or batch). The remainder isconvertfor the times you don't need to touch the records. This skill is the cheat sheet I keep open.
When to use
- Streaming reads from FASTA/FASTQ/GenBank/EMBL.
- Random access to a record by ID (huge files).
- Writing records back out, optionally after modification.
- Format conversion without touching record content.
When NOT to use
- BAM/SAM/CRAM/VCF → use
pysam(seeors-bioinformatics-sequence-pysam-genomics). - gzipped BAM/CRAM with random access →
pysam.AlignmentFile. - For >10 GB files, consider
pyfastx(faster than Biopython for FASTA).
Prerequisites
biopython>=1.83- For indexed random access on BGZF FASTA, Biopython 1.83+ supports
index()on.bgz.
Format strings (cheat sheet)
| Format | String | Notes |
|--------|--------|-------|"
| FASTA | "fasta" | Sequence only, single-line or wrapped |
| FASTQ (Sanger / Illumina 1.8+) | "fastq" | Phred+33 |
| FASTQ (Illumina 1.3-1.7) | "fastq-illumina" | Phred+64 |
| FASTQ (Solexa) | "fastq-solexa" | Solexa+64 |
| GenBank | "genbank" | Annotation + sequence |
| EMBL | "embl" | EBI submission format |
| SwissProt | "swiss" | UniProt curated |
| UniProt XML | "uniprot-xml" | EBI download |
| GFF3 | "gff3" | Genome features, not sequence records |
| GTF | "gtf" | Gene transfer format |
| ABI | "abi" | Sanger trace |
| Phylip | "phylip" | Alignment |
| Clustal | "clustal" | Alignment |
| Stockholm | "stockholm" | Pfam/Rfam |
| Tab-delimited | "tab" | ID, seq, optional qualifiers |
Code patterns
Stream parse
from Bio import SeqIO
for rec in SeqIO.parse("input.fasta", "fasta"):
# rec.id, rec.description, rec.seq
...
List parse (small files only)
records = list(SeqIO.parse("small.fasta", "fasta"))
Indexed random access (huge FASTA)
from Bio import SeqIO
db = SeqIO.index("large.fasta", "fasta")
seq = db["target_id"].seq
db.close()
For persistent index (no rebuild on next open):
db = SeqIO.index_db("large.idx", "large.fasta", "fasta")
Indexed BGZF random access
db = SeqIO.index("indexable.fasta.bgz", "fasta")
Write records (single file, multiple records)
from Bio import SeqIO
SeqIO.write(records, "out.fasta", "fasta")
The return value is the number of records written.
Write one record at a time
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
with open("out.fasta", "w") as fh:
rec = SeqRecord(Seq("ACGT"), id="seq1", description="example")
SeqIO.write(rec, fh, "fasta")
Append records across formats (streaming)
from Bio import SeqIO
def stream_to_fasta(in_path, in_fmt, out_path):
with open(out_path, "w") as out:
for rec in SeqIO.parse(in_path, in_fmt):
# modify rec if needed
SeqIO.write(rec, out, "fasta")
Convert without touching records
from Bio import SeqIO
n = SeqIO.convert("in.gb", "genbank", "out.fasta", "fasta")
print(f"Converted {n} records")
Handle FASTA with wrapped multi-line sequences
Biopython auto-handles wrapped FASTA in parse(); record.seq is always the full sequence.
Parse GenBank with features
from Bio import SeqIO
for rec in SeqIO.parse("annotation.gb", "genbank"):
for feat in rec.features:
if feat.type == "CDS":
gene = feat.qualifiers.get("gene", ["?"])[0]
print(gene, feat.location)
Parse BLAST tab output
from Bio import SeqIO
for qresult in SeqIO.parse("blast.tab", "blast-tab"):
for hsp in qresult:
...
blast-tab parses the tabular -outfmt 6.
Truncate IDs to first whitespace
def short_id(rec):
rec.id = rec.id.split()[0]
rec.description = ""
return rec
Common pitfalls
- Mixing parse() with index() and forgetting to close the index. Index files hold file handles; use a context manager or explicit
close(). - Indexed access on a
.fasta.gz(plain gzip) fails. Convert to BGZF first. - Write format mismatch. If you parsed with
"fastq-illumina"and wrote with"fastq", Biopython re-encodes qualities. Usually fine, but worth being explicit. - Mismatched record count from
SeqIO.convert. Returns 0 silently if source format is wrong.assert n > 0. - Header pollution. FASTA identifiers stop at first whitespace; multi-word descriptions go into
description. If you write a record with bothid="seq1 chr1"and a description, downstream tools will seeseq1as the ID.
Validation
n_records_in == n_records_outfor round-trips.- For indexed access:
db["known_id"].seq == expected_seq. - For GenBank round-trips:
len(features_in) == len(features_out), modulo per-format feature filtering.
Open alternatives
| Need | Tool |
|---|---|
| Faster FASTA parsing | pyfastx |
| Format conversion at scale | seqkit convert, bioawk |
| Random access to huge FASTA | samtools faidx + bgzip |
| GFF/GTF parsing | gffutils, pyranges |
References
- Biopython SeqIO: https://biopython.org/docs/latest/api/Bio.SeqIO.html
- Format list: https://biopython.org/wiki/SeqIO#File_Formats
- Companion:
ors-bioinformatics-sequence-format-conversion,ors-bioinformatics-sequence-compressed-sequence-files.
Changelog
- 1.0.0 (2026-06-10): Initial adaptation by Pradyumna Jayaram from
bio-read-sequencesandbio-write-sequences(bioSkills-main/sequence-io/).
