skills/bioinformatics-sequence/sequence-properties
stars:0
forks:0
watches:0
last updated:N/A
Sequence Properties
Six numbers tell you almost everything about a sequence: length, GC fraction, molecular weight, melting temperature (DNA), isoelectric point (protein), instability index (protein). This skill is the canonical computation, the formula citations, and the 2026 modern-API reality in
Bio.SeqUtils1.83+.
When to use
- Pre-alignment sanity checks (a sequence with 90% Ns is not real).
- Primer/probe design: Tm, GC, hairpin, dimer.
- Protein characterization: MW, pI, instability, aromaticity, GRAVY.
- Genome composition: GC content by window, by chromosome, by gene.
When NOT to use
- Production primer design → use
primer3-py(handles salt, oligo concentration, mismatches). - Genome-wide GC → use
bedtools nucorcomputeGCBias(deepTools). - Protein domain detection → use InterProScan or
pyhmmer.
Prerequisites
biopython>=1.83- For protein:
Bio.SeqUtils.ProtParam - For DNA/RNA:
Bio.SeqUtils(nt utilities)
Core workflow
- Identify the molecule type (DNA, RNA, protein).
- Pick the right utility module —
ProtParamis for protein, the rest ofBio.SeqUtilsfor DNA/RNA. - Compute the canonical properties for that type.
- Cite the formula in your methods section.
Code patterns
DNA / RNA: GC, MW, Tm
from Bio.Seq import Seq
from Bio.SeqUtils import gc_fraction
from Bio.SeqUtils.MolecularWeight import MolecularWeight
"
s = Seq("ATGCATGCATGCATGCATGC")
print(f"length: {len(s)}")
print(f"GC: {gc_fraction(s):.3f}") # 0.5
mw_double = MolecularWeight(s) # double-stranded MW
print(f"ds MW: {mw_double:.1f} Da")
DNA Tm (Wallace rule for short oligos)
For very short oligos (<14 nt), Tm is roughly 2 * (A+T) + 4 * (G+C):
def tm_wallace(seq: str) -> float:
s = seq.upper()
return 2 * (s.count("A") + s.count("T")) + 4 * (s.count("G") + s.count("C"))
DNA Tm (Marmur / nearest-neighbor for longer oligos)
Use primer3-py for production work — see the primer-design reference. The
nearest-neighbor model (SantaLucia 1998) is the 2026 standard.
# In production: primer3-py
import primer3
tm = primer3.calc_tm("ATGCATGCATGCATGC")
Protein: MW, pI, instability, aromaticity, GRAVY
from Bio.Seq import Seq
from Bio.SeqUtils.ProtParam import ProteinAnalysis
pa = ProteinAnalysis(str(Seq("MGEKLPVRLNVMGYEEDILKQHKWLRNVQTLKDGIVFVD")))
print(f"length: {len(pa.sequence)}")
print(f"MW: {pa.molecular_weight():.1f} Da")
print(f"pI: {pa.isoelectric_point():.2f}")
print(f"instability_index: {pa.instability_index():.1f}") # <40 stable
print(f"aromaticity: {pa.aromaticity():.3f}")
print(f"gravy: {pa.gravy():.3f}") # Grand average of hydropathy
Amino acid percent composition
print(pa.get_amino_acids_percent())
# {'A': 0.07, 'C': 0.02, ...}
Secondary structure fraction (Chou-Fasman)
print(pa.secondary_structure_fraction()) # (helix, turn, sheet)
GC content by sliding window
from Bio.Seq import Seq
from Bio.SeqUtils import gc_fraction
def gc_window(seq: Seq, window: int = 100, step: int = 50):
s = str(seq)
out = []
for i in range(0, len(s) - window + 1, step):
out.append((i, gc_fraction(Seq(s[i:i+window]))))
return out
CpG observed/expected
CpG O/E = (CpG count) / (C count × G count / N). Used in vertebrate
methylation studies.
def cpg_oe(s: str) -> float:
s = s.upper()
c = s.count("C")
g = s.count("G")
cg = s.count("CG")
n = c + g
if c == 0 or g == 0 or n == 0:
return 0.0
return cg * n / (c * g)
DNA / RNA / protein ambiguity
gc_fraction excludes N automatically. If you need a denominator that
includes Ns, normalize manually:
def gc_with_n(s: str) -> float:
s = s.upper()
denom = sum(1 for b in s if b in "ACGTNU")
gc = sum(1 for b in s if b in "GC")
return gc / denom if denom else 0.0
Common pitfalls
Bio.SeqUtils.GCvsgc_fraction. In 1.80+, the canonical function isgc_fraction(seq). Older code usesGC(seq). The new function returns a fraction; old returned a percentage.- Wallace Tm is wrong for oligos > 14 nt. Use nearest-neighbor (SantaLucia 1998) for production.
- Tm calculation is salt-, oligo-concentration-, and Mg2+-dependent. A "Tm" without conditions is meaningless.
primer3-pyhandles this. - Instability index is for in vitro stability of a purified protein, not cellular half-life. Don't conflate.
- GRAVY negative = hydrophilic, positive = hydrophobic. The sign convention trips people up.
Validation
- Length:
len(seq) == len(str(seq)). - GC: 0 ≤ GC ≤ 1.
- pI: amino acid distribution implies a pI; verify with the expected range for the protein class (e.g., basic proteins have pI > 7).
- MW: protein MW = sum of residue masses + 18 (water).
ProteinAnalysisaccounts for this.
Open alternatives
| Need | Tool |
|---|---|
| Production primer Tm | primer3-py |
| Genome-wide GC | bedtools nuc, deeptools computeGCBias |
| Protein domain + pI + GO | InterProScan, UniProt |
| Codon usage | Bio.SeqUtils.CodonUsage (see codon-usage skill) |
References
- Biopython ProtParam: https://biopython.org/docs/latest/api/Bio.SeqUtils.ProtParam.html
- SantaLucia 1998 nearest-neighbor Tm: https://doi.org/10.1073/pnas.95.4.1460
- ExPASy ProtParam: https://web.expasy.org/protparam/
- Companion:
ors-bioinformatics-sequence-codon-usage,ors-bioinformatics-sequence-reverse-complement.
Changelog
- 1.0.0 (2026-06-10): Initial adaptation by Pradyumna Jayaram from
bio-sequence-properties(bioSkills-main/sequence-manipulation/sequence-properties).
