skills/bioinformatics-sequence/format-conversion
stars:0
forks:0
watches:0
last updated:N/A
Sequence Format Conversion
Format conversion looks trivial — call
SeqIO.convertand you're done — but it isn't. FASTA is sequence-only; GenBank/EMBL/INSDC carry annotation (features, qualifiers, references); FASTQ carries quality. Crossing those boundaries is a one-way trip. This skill is the safe path: which conversions are lossless, which lose annotation, and how to batch the safe ones at scale.
When to use
- Switching between FASTA and GenBank for a tool that requires one or the other.
- Converting a directory of
.gb→.fastafor BLAST. - Converting EMBL → GenBank for NCBI submission compatibility.
- Stripping quality from FASTQ for tools that need FASTA.
When NOT to use
- Format conversions that need re-validation (e.g., a GenBank you got from a third party — see the
bio-format-validationskill fromread-qc). - Lossy round-trips where you actually need the annotation back: don't go FASTA → GenBank and expect features.
Prerequisites
biopython>=1.83- For large-scale batch:
seqkit(C) is far faster than Python.
Core workflow
- Decide if the conversion is lossless. Cross-format only when features don't matter, or when both formats are annotation-aware.
- Use
SeqIO.convert(src, src_fmt, dst, dst_fmt)— it returns the record count and handles buffered I/O. - For streaming, prefer parse + write when you need to filter or modify on the way through.
- Always set explicit formats — never rely on extension sniffing alone.
Lossless vs. lossy
| From → To | Sequence | Annotation | Quality |
|---|---|---|---|
| FASTA → GenBank | ✓ | ✗ (none to start) | ✗ |
| GenBank → FASTA | ✓ | ✗ (dropped) | ✗ |
| GenBank → EMBL | ✓ | ✓ | ✗ |
| EMBL → GenBank | ✓ | ✓ (with caveats) | ✗ |
| FASTQ → FASTA | ✓ | n/a | ✗ |
| FASTQ → FASTQ (re-encode) | ✓ | n/a | ✓ (re-encoded to Phred+33) |
| GFF3 → BED | n/a | partial (gene-level only) | n/a |
Code patterns
GenBank → FASTA (annotation dropped)
from Bio import SeqIO
n = SeqIO.convert("input.gb", "genbank", "out.fasta", "fasta")
print(f"Converted {n} records (annotation discarded)")
FASTA → GenBank (skeleton record, no features)
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
recs = []
for r in SeqIO.parse("input.fasta", "fasta"):
recs.append(SeqRecord(Seq(str(r.seq)), id=r.id, description=r.description,
annotations={"molecule_type": "DNA"}))
SeqIO.write(recs, "out.gb", "genbank")
Note: NCBI's tbl2asn will reject records without molecule_type and other minimal metadata. Add a translation table annotation if you'll submit.
EMBL → GenBank
from Bio import SeqIO
n = SeqIO.convert("input.embl", "embl", "out.gb", "genbank")
print(f"Converted {n} records (qualifiers mapped to GenBank qualifiers)")
EMBL → GenBank is lossy in edge cases (e.g., /translation table differences); always re-validate before submission.
FASTQ → FASTA (drop quality)
from Bio import SeqIO
SeqIO.write(SeqIO.parse("reads.fastq", "fastq"), "reads.fasta", "fasta")
FASTA → FASTQ (dummy Q-scores)
There's no canonical way — FASTQ requires quality. Common convention: write a constant Q40 (ASCII I):
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
def fasta_to_fastq_q40(in_fa: str, out_fq: str):
with open(out_fq, "w") as out:
for r in SeqIO.parse(in_fa, "fasta"):
quals = [40] * len(r.seq)
r.letter_annotations["phred_quality"] = quals
SeqIO.write(r, out, "fastq")
Batch convert a directory
from pathlib import Path
from Bio import SeqIO
for gb in sorted(Path("genbank").glob("*.gb")):
out = Path("fasta") / (gb.stem + ".fasta")
n = SeqIO.convert(str(gb), "genbank", str(out), "fasta")
print(f"{gb.name} -> {out.name}: {n}")
Streaming with a filter (length ≥ 500 bp)
from Bio import SeqIO
def long_records(path, fmt, min_len=500):
for r in SeqIO.parse(path, fmt):
if len(r.seq) >= min_len:
yield r
SeqIO.write(long_records("input.gb", "genbank"), "long_only.fasta", "fasta")
Re-encode legacy FASTQ to Phred+33
from Bio import SeqIO
records = SeqIO.parse("legacy.fastq", "fastq-illumina")
SeqIO.write(records, "modern.fastq", "fastq")
Common pitfalls
- GenBank round-trip loses qualifiers not mapped 1-to-1. Always diff before assuming a clean round-trip.
SeqIO.convertreturns 0 silently if the source format is wrong. Alwaysassert n > 0in pipelines.- FASTQ → FASTA can produce files with non-IUPAC characters if the FASTQ had ambiguity codes. Decide whether to clean first.
molecule_typeannotation missing → GenBank writers will warn or reject.- Don't write features onto a
Seqfrom aSeqIO.parse(... "fasta")record — theseqis aSeqbut the record'sfeatureslist is empty by design.
Validation
- After conversion,
grep -c '^>' out.fastaequals the record count. - For GenBank → FASTA → GenBank, compare
Bio.SeqIO.parse()record counts and IDs. - For FASTQ re-encoding, re-parse with
fastqand confirmletter_annotations["phred_quality"]are all 40 in the dummy case.
Open alternatives
| Need | Tool |
|---|---|
| Bulk FASTA → GenBank for tbl2asn | NCBI tbl2asn |
| Sequence-only format conversion at scale | seqkit convert (C, very fast) |
| FASTA quality inspection | seqkit fx2tab |
| Annotation-aware conversion | gff3_to_genbank from pycbio |
References
- Biopython convert formats table: https://biopython.org/wiki/SeqIO#File_Formats
- INSDC feature table: https://www.insdc.org/submitting-standards/feature-table/
- Companion:
ors-bioinformatics-sequence-batch-sequence-processing,ors-bioinformatics-sequence-read-write-sequences.
Changelog
- 1.0.0 (2026-06-10): Initial adaptation by Pradyumna Jayaram from
bio-format-conversion(bioSkills-main/sequence-io/format-conversion).
