skills/bioinformatics-sequence/star-alignment
stars:0
forks:0
watches:0
last updated:N/A
STAR RNA-seq Alignment
STAR (Spliced Transcripts Alignment to a Reference) is the gold standard for RNA-seq alignment — sensitive to novel junctions, fast, and the default in
nf-core/rnaseq. The 2026 reality: STAR 2.7.11+ has improved memory management and 2-pass mapping is now standard for sensitive analyses. The cost: ~40 GB RAM for human genome index generation.
When to use
- Bulk RNA-seq with novel junction discovery (most cases).
- Single-cell RNA-seq (
STARsolo). - Chimeric / fusion detection (with
--chimOutType). - Long-read cDNA / ONT direct RNA (use
--alignEndsType Extend...).
When NOT to use
- Genome with no annotation and small memory budget → use
HISAT2(this category) orminimap2 -ax splice. - DNA-seq → use
bwa-memorbwa-mem2. - Quantification-only → use pseudo-alignment (
salmon/kallisto).
Prerequisites
STAR≥ 2.7.11samtools≥ 1.19- Reference FASTA + GTF
- ~40 GB RAM and 50 GB disk for human index
Core workflow
- Generate the genome index with
STAR --runMode genomeGenerate. - Align reads with
STAR --runMode alignReads. - 2-pass mapping for sensitive novel junction discovery: first pass produces a SJ.out.tab, second pass uses it.
- Sort and index the BAM (STAR can output sorted BAM directly with
--outSAMtype BAM SortedByCoordinate). - Index the BAM with
samtools index.
Code patterns
Generate the genome index (one-time, ~40 GB RAM for human)
STAR --runMode genomeGenerate \
--runThreadN 16 \
--genomeDir star_index/ \
--genomeFastaFiles reference/genome.fa \
--sjdbGTFfile reference/genes.gtf \
--sjdbOverhang 149 \
--genomeSAindexNbases 14
--sjdbOverhang 149 for 150 bp reads; for shorter reads, set to read_length - 1.
Paired-end alignment (most common)
STAR --runMode alignReads \
--runThreadN 16 \
--genomeDir star_index/ \
--readFilesIn reads_R1.fq.gz reads_R2.fq.gz \
--readFilesCommand zcat \
--outFileNamePrefix sample1/ \
--outSAMtype BAM SortedByCoordinate \
--outSAMattrRGline ID:sample1 SM:sample1 PL:ILLUMINA LB:lib1 \
--quantMode GeneCounts
Output:
sample1/Aligned.sortedByCoord.out.bam— coordinate-sorted BAMsample1/ReadsPerGene.out.tab— gene-level counts (use this for DESeq2/edgeR)sample1/SJ.out.tab— splice junctionssample1/Log.final.out— alignment stats
2-pass mapping for novel junction discovery
# Pass 1: align and produce splice junctions
STAR --runMode alignReads ... --outFileNamePrefix pass1/
mv pass1/SJ.out.tab pass1_SJ.tab
# Pass 2: re-align with discovered junctions
STAR --runMode alignReads \
--sjdbFileChrStartEnd pass1_SJ.tab \
--genomeDir star_index_with_pass1_SJ/ \
... --outFileNamePrefix pass2/
Or use the simpler workflow: regenerate the index with the new SJ file and re-align.
Single-cell RNA-seq (STARsolo)
STAR --runMode alignReads \
--genomeDir star_index/ \
--readFilesIn sc_R1.fastq.gz sc_R2.fastq.gz \
--soloType CB_UMI_Simple \
--soloCBstart 1 --soloCBlen 16 \
--soloUMIstart 17 --soloUMIlen 10 \
--soloBarcodeReadLength 0 \
--soloCellFilter EmptyDrops_CR \
--outFileNamePrefix sc_outs/
Long-read cDNA / ONT
STAR --runMode alignReads \
--genomeDir star_index/ \
--readFilesIn long_reads.fq.gz \
--outFilterMismatchNmax 5 \
--outFilterMatchNmin 10 \
--alignEndsType ExtendSoftClip
Output sorted BAM and skip the post-alignment sort
STAR ... --outSAMtype BAM SortedByCoordinate ...
This writes the BAM sorted and lets you skip samtools sort. You still need samtools index.
Multi-sample parallel alignment (shell loop)
for r1 in reads/*_R1.fq.gz; do"
base=$(basename "$r1" _R1.fq.gz)
r2="reads/${base}_R2.fq.gz"
mkdir -p "star_out/${base}"
STAR --runMode alignReads \
--runThreadN 8 \
--genomeDir star_index/ \
--readFilesIn "$r1" "$r2" \
--readFilesCommand zcat \
--outFileNamePrefix "star_out/${base}/" \
--outSAMtype BAM SortedByCoordinate \
--outSAMattrRGline ID:${base} SM:${base} PL:ILLUMINA \
--quantMode GeneCounts
done
Index the output BAM
samtools index sample1/Aligned.sortedByCoord.out.bam
Extract splice junctions for visualization
# Convert STAR SJ.out.tab to BED12 for IGV
awk 'BEGIN{OFS="\t"} $1 !~ /#/ {print $1, $2-1, $3, ".", $7, $4}' SJ.out.tab > sj.bed
Important flags
| Flag | Purpose |
|---|---|
--runThreadN | Threads |
--genomeDir | STAR index directory |
--readFilesCommand zcat | Decompress .gz |
--outSAMtype BAM SortedByCoordinate | Sorted BAM output |
--outSAMattrRGline | Read group |
--quantMode GeneCounts | Per-gene read counts |
--outFilterMismatchNoverLmax 0.1 | Max 10% mismatches (default 0.3) |
--outFilterMultimapNmax 1 | Unique alignments only (for some workflows) |
--alignSJoverhangMin 8 | Min overhang for splice junction |
--twopassMode Basic | 2-pass mapping (legacy; use --sjdbFileChrStartEnd for 2.7+) |
--chimOutType SeparateSAMold | Chimeric alignments (fusion detection) |
Common pitfalls
- Not enough RAM for human index.
STAR --runMode genomeGenerateneeds ~40 GB. STAR 2.7.11+ has--limitGenomeGenerateRAMto bound it, but at the cost of slower index generation. - Forgetting
--readFilesCommand zcatfor.fq.gzinputs. STAR will try to read them as text and fail. - Using a default
--sjdbOverhangfor non-150 bp reads. Set toread_length - 1for best sensitivity. - Ignoring 2-pass mapping. The first pass discovers novel junctions, the second pass is more sensitive. For novel transcript discovery, 2-pass is essential.
- STARsolo barcode config.
CBstart + CBlenandUMIstart + UMIlenmust match your library prep. Misconfiguration gives 0% barcodes-in-cells.
Validation
samtools flagstat sample1/Aligned.sortedByCoord.out.bam— high mapping rate (≥85% for bulk RNA-seq).head sample1/Log.final.out— look forUniquely mapped reads %≥ 75%.awk '$4=="Uniquely mapped reads %"' Log.final.outshould be ≥ 70% for most samples.ReadsPerGene.out.tabshould have non-zero counts for known reference genes.
Open alternatives
| Need | Tool |
|---|---|
| Lower memory footprint | HISAT2 |
| Pseudo-alignment (faster) | salmon, kallisto |
| Long-read cDNA | minimap2 -ax splice:hq |
| Production bulk RNA-seq | nf-core/rnaseq (default STAR) |
| Single-cell RNA-seq | nf-core/scrnaseq (default STARsolo), cellranger (10x) |
References
- STAR paper: Dobin et al. 2013, Bioinformatics —
10.1093/bioinformatics/bts635 - STAR 2.7.11+ release notes: https://github.com/alexdobin/STAR/releases
- Companion:
ors-bioinformatics-sequence-hisat2-alignment,ors-bioinformatics-omics-rna-seq-count-matrix-qc.
Changelog
- 1.0.0 (2026-06-10): Initial adaptation by Pradyumna Jayaram from
bio-star-alignment(bioSkills-main/read-alignment/star-alignment).
