skills/bioinformatics-sequence/sam-bam-basics

stars:0
forks:0
watches:0
last updated:N/A

SAM / BAM File Basics

Every aligner produces SAM (text) or BAM (binary). The format is the lingua franca of short-read genomics. If you understand the SAM line format — QNAME through the optional fields — you can debug anything: why reads are unmapped, why a variant isn't callable, or why your count matrix has fewer reads than expected. This skill is the 101.

When to use

  • Understanding why reads aligned (or didn't).
  • Filtering BAMs by FLAG, MAPQ, CIGAR.
  • Extracting specific read subsets for downstream tools.
  • Converting SAM ↔ BAM and adding custom tags.

When NOT to use

  • You need BAM statistics — see bam-statistics skill.
  • You need to filter duplicates — see duplicate-handling skill.
  • Visualizing alignments — use IGV, samtools tview, or a browser.

Prerequisites

  • samtools ≥ 1.19
  • For programmatic work: pysam (see pysam-genomics skill)

The SAM line format

QNAME FLAG RNAME POS CIGAR SEQ QUAL TAGS
FieldContentExample
QNAMERead nameHISEQ:XX:123:CGT...
FLAGBitwise OR flags99 = paired + proper + reverse + mate-reverse
RNAMEReferencechr1, * if unmapped
POS1-based leftmost12345678, 0 if unmapped
CIGARAlignment100M, 50S50M, * for unmapped
SEQSequenceACGT..., * if no SEQ
QUALPhred qualityIIII..., * if no QUAL
TAGSOptional fieldsNM:i:5 MD:Z:50 GA:G...

The canonical optional tags

TagTypeMeaning
NMiEdit distance (total mismatches + indels)
MDZMD:Z string for SNPs
RGZRead group (from input @RG)
ASiAlignment score
XSAStrand for RNA (+ or -)
XPZAlternative alignment positions
YIZOld read group (Illumina)
MCZMate CIGAR
MQiMate MAPQ
PGZProgram that produced this alignment

Common FLAGs (binary decode)

Use samtools flags 99 to decode or encode:

FlagBinaryMeaning
1000000000001paired
2000000000010proper paired
4000000000100unmapped
8000000001000mate unmapped
16000000010000reverse strand
32000000100000mate reverse strand
64000001000000first in pair
128000010000000second in pair
256000100000000secondary alignment
512001000000000QC fail
1024010000000000duplicate
2048100000000000supplementary alignment

Code patterns

View a BAM as SAM (text)

samtools view -h input.bam | head -20

Convert SAM to BAM

samtools view -bS input.sam > output.bam

Extract mapped reads

samtools view -F 4 -b input.bam > mapped.bam
# -F 4 = exclude FLAG 4 (unmapped)

Extract unmapped reads

samtools view -f 4 -b input.bam > unmapped.bam
# -f 4 = require FLAG 4 (unmapped)

Extract properly paired reads

samtools view -F 4 -f 2 -b input.bam > proper.bam

Extract reads with MAPQ >= 30

samtools view -q 30 -b input.bam > q30.bam

Extract primary alignments only

samtools view -F 256 -b input.bam > primary.bam

Extract alignments to chromosome 1

samtools view -b chr1 input.bam > chr1.bam

Extract reads by read name

samtools view -h input.bam | grep "read_name" | samtools view -bS - > reads.bam

Convert BED to SAM/BAM (via BEDTools if needed)

BEDTools: bedtools bedtobam -bed12 input.bed -g genome.txt > output.bam

Decode a FLAG to readable string

samtools flags 99
# 99 = PAIRED|PROPER_PAIRED|REVERSE|MREVERSE

Sort by name (for fixmate)

samtools sort -n input.bam -o input.name.bam
samtools index input.name.bam

Sort by coordinate (default)

samtools sort -o input.coord.bam input.bam
samtools index input.coord.bam

Build a header from scratch

samtools view -H input.bam
# Shows @HD, @SQ, @RG, @PG lines

Add a read group to the header

samtools addreplacerg -r "@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA" -o out.bam in.bam
samtools index out.bam

Extract strand-specific information

For RNA-seq, the XS tag indicates strand:

samtools view input.bam | awk '$20 ~ /XS:A:[+-]/ {print $20}'

Coordinate systems

SystemStartEndUsed by
SAM/BAM (POS)1-basedinclusiveSAM/BAM
BED0-basedexclusiveUCSC, BEDTools
VCF1-basedinclusiveVCF
Python0-basedexclusiveAll Python libraries
GFF31-basedinclusiveGFF3

Common pitfalls

  • 0-based vs 1-based confusion. SAM POS is 1-based. Python slice is 0-based. Convert explicitly.
  • CIGAR interpretation. 50M = 50 matches. 10S40M = 10 soft-clipped + 40 matches.
  • Primary/secondary alignment. FLAG 256 means secondary (split read aligned elsewhere). Often not what you want.
  • Supplementary. FLAG 2048 means supplementary alignment (chimeric).
  • Duplicate markings are recommendations, not truth. samtools markdup is the canonical; confirm before filtering.
  • MAPQ = 255 from bowtie2 means "aligner can't calculate a good MAPQ" — not "very high confidence".

Validation

  • After any filter, use samtools flagstat to check read counts.
  • After samtools view -F 4, no reads should have FLAG bit 4.
  • Coordinate-sorted BAM must be indexed (samtools index) before samtools tview or region queries.

Open alternatives

NeedTool
PAF (minimap2 output)Convert to SAM with paftools, or use minimap2 -a for SAM
CRAM (compressed)samtools view -C
HTSJDK (JVM)Use Picard from GATK
Genome-wide statisticssamtools stats, mosdepth

References

Changelog

  • 1.0.0 (2026-06-10): Initial adaptation by Pradyumna Jayaram from bio-sam-bam-basics (bioSkills-main/alignment-files/sam-bam-basics).
    Good AI Tools