skills/chemoinformatics/virtual-screening
Version Compatibility
Reference examples tested with: AutoDock Vina 1.2.5+, SMINA 2020-12+, GNINA 1.1+, RDKit 2024.09+, meeko 0.5+, P2Rank 2.4+, ProDy 2.4+, pdb2pqr 3.6+.
Before using code patterns, verify installed versions match. If versions differ:
- Python:
pip show <package>thenhelp(module.function)to check signatures - CLI:
vina --version;gnina --version;smina --version
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Virtual Screening
Screen chemical libraries against protein targets via molecular docking. Vina is the de-facto default, SMINA adds flexibility (Vinardo scoring, custom scoring), and GNINA adds CNN-based pose scoring. Deep-learning docking (DiffDock-L, EquiBind, NeuralPLexer) competes in pose accuracy but fails physical-plausibility tests (PoseBusters) ~50% of the time; the postdoc-grade workflow combines ML pose sampling with classical scoring and physical validation. For ultralarge libraries (>1M), library preparation, hierarchical filtering, and HPC orchestration become the limiting steps.
For pose physical-validity QC, see chemoinformatics/pose-validation. For ML-driven docking + rescoring, see chemoinformatics/docking-rescoring. For covalent docking, see chemoinformatics/covalent-design. For affinity calculations (FEP), see chemoinformatics/free-energy-calculations.
Docking Tool Taxonomy
| Tool | Scoring | Speed (sec/lig) | Best at | Fails when |
|---|---|---|---|---|
| AutoDock Vina 1.2 | Vina (empirical) | 5-30 | Default, well-validated | Cross-dock; cryptic pockets; metal centers |
| SMINA | Vina + flexible + custom | 5-30 | Custom scoring; flexible side chains | Same Vina-scoring caveats |
| Vinardo | Vinardo (improved Vina) | 5-30 | Better scoring than Vina on DUD-E | Limited adoption |
| GNINA 1.1 | CNN (default) or Vina | 30-120 | Pose ranking, redocking | Slower; GPU recommended |
| AutoDock 4 | AD4 + grid maps | 30-60 | Legacy reference | Slower than Vina |
| DOCK 6/7 | DOCK + Amber | 60-300 | UCSF DOCK ecosystem | Steep learning curve |
| Glide (Schrodinger) | GlideScore | 5-30 | Commercial SOTA | License cost |
| GOLD (CCDC) | GOLDScore / ChemScore | 60-180 | Commercial; metal coordination | License cost |
| FlexX (BioSolveIT) | FlexX | 30-60 | Fragment-based | License cost |
| rDock | rDock | 30-60 | Open-source alternative | Slower than Vina |
| DiffDock-L | Diffusion-generative | 5 (GPU) | Pose sampling for cross-docking | High PB-invalid rate (~50%) |
| EquiBind | Equivariant NN | <1 (GPU) | Single-shot pose | Lowest accuracy in PoseBuster benchmarks |
| Boltz-2 + GNINA rescore | Foundation model + CNN | 10 (GPU) | Modern SOTA hybrid | High GPU; not all proteins |
Decision: For most screens, GNINA 1.1 with CNN scoring is the modern default (better than Vina on every benchmark; 30s/ligand on GPU). For >1M library scale, hierarchical Vina -> GNINA -> MM/GBSA. For cross-docking (predicted target structure or apo-holo), GNINA's CNN scoring transfers better than Vina.
Decision Tree by Scenario
| Scenario | Recommended workflow |
|---|---|
| Self-dock against known ligand pocket | GNINA gnina --cnn_scoring rescore |
| Cross-dock to apo or related-target structure | DiffDock-L pose + GNINA rescore + PoseBusters |
| Ultralarge library (10M+) | Vina hierarchical: pre-filter (Lipinski, PAINS) -> Vina dock -> top 1% GNINA rescore -> top 0.1% MM/GBSA |
| Cryptic pocket / induced fit | Ensemble docking (10 conformer snapshots) + AlphaFold3 or Boltz-1 holo prediction |
| Allosteric / undefined site | P2Rank for pocket detection -> ensemble dock all pockets |
| Metal-coordinated ligand | GOLD (commercial) or manually parameterize Vina metal scoring |
| Covalent inhibitor | See chemoinformatics/covalent-design: DOCKovalent, HCovDock |
| Fragment screen (<300 Da) | rDock or constrained Vina with seed atoms |
| Hit-to-lead refinement | Use co-crystal structure if available; MD-relaxed receptor; FEP for affinity |
Receptor Preparation
Goal: Convert a protein PDB into a docking-ready format with correct protonation, missing atoms, and removed waters.
Approach: Strip ligands and waters -> fill missing atoms (PROPKA or pdbfixer) -> add hydrogens at pH 7.4 (PDB2PQR / Reduce) -> assign partial charges -> convert to PDBQT (Open Babel).
import subprocess
def prepare_receptor(pdb_in, pdbqt_out, pH=7.4, remove_het=True):
base = pdb_in.replace('.pdb', '')
fixed = f'{base}_fixed.pdb'
propka_pdb = f'{base}_propka.pdb'
if remove_het:
with open(pdb_in) as fin, open(f'{base}_clean.pdb', 'w') as fout:
for line in fin:
if line.startswith('HETATM') and 'HOH' in line:
continue
if line.startswith('HETATM'):
continue
fout.write(line)
subprocess.run(['pdb2pqr', '--ff=AMBER', f'--with-ph={pH}',
f'{base}_clean.pdb', propka_pdb], check=True)
subprocess.run(['obabel', propka_pdb, '-O', pdbqt_out,
'-xr', '--partialcharge', 'gasteiger'], check=True)
return pdbqt_out
Common pitfall: Forgetting to add hydrogens at protein pH (7.4) but using pH 7.0 ligand charges. Hist mistakenly protonated. Use PROPKA + manual review of catalytic residues.
Ligand Preparation
Goal: Generate a 3D, docking-ready ligand file from SMILES with appropriate protonation and conformation.
Approach: Protonate at pH 7.4 with Chem.MolFromSmiles then rdMolStandardize.Uncharger -> embed 3D with ETKDGv3 -> minimize with MMFF94 -> assign Gasteiger charges -> write PDBQT with meeko.
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.MolStandardize import rdMolStandardize
from meeko import MoleculePreparation, PDBQTWriterLegacy
def prepare_ligand(smiles, pdbqt_out):
mol = Chem.MolFromSmiles(smiles)
uncharger = rdMolStandardize.Uncharger()
mol = uncharger.uncharge(mol)
mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, AllChem.ETKDGv3())
AllChem.MMFFOptimizeMolecule(mol)
mk_prep = MoleculePreparation()
setups = mk_prep.prepare(mol)
pdbqt_text, is_ok, err = PDBQTWriterLegacy.write_string(setups[0])
if not is_ok:
raise RuntimeError(f'meeko PDBQT export failed: {err}')
with open(pdbqt_out, 'w') as f:
f.write(pdbqt_text)
return pdbqt_out
meeko (AutoDock developers' tool) handles torsion tree creation, rotamer flagging, and PDBQT writing -- preferred over Open Babel's PDBQT writer. Note: meeko 0.5+ separated the writer (PDBQTWriterLegacy) from MoleculePreparation; older code using prep.write_pdbqt_file() is deprecated.
Binding Site Detection
When the binding pocket is not known (apo target, novel allosteric site):
| Tool | Approach | Output |
|---|---|---|
| P2Rank | ML on protein surface descriptors | Ranked pocket list with center coords |
| fpocket | Voronoi tessellation | Pocket descriptor list |
| DoGSiteScorer | Geometric + drugability | Pocket list with score |
| AutoSite (Vina) | Affinity map clustering | Pocket centers |
| AlphaFill | Co-fold known ligand | Plausible binding site |
prank predict -f receptor.pdb -o pockets/
P2Rank output <receptor>_predictions.csv lists pocket centers with scores. Pocket 1 (highest score) is typically the orthosteric site for known protein families.
Vina Docking (Single Ligand)
# AutoDock Vina Python API requires Vina 1.2+; for Vina 1.1 use subprocess CLI:
# subprocess.run(['vina', '--receptor', ..., '--ligand', ..., '--center_x', ...], check=True)
from vina import Vina
def dock_single(receptor_pdbqt, ligand_pdbqt, center, box_size,
exhaustiveness=8, n_poses=10):
v = Vina(sf_name='vina')
v.set_receptor(receptor_pdbqt)
v.set_ligand_from_file(ligand_pdbqt)
v.compute_vina_maps(center=center, box_size=box_size)
v.dock(exhaustiveness=exhaustiveness, n_poses=n_poses)
return v.energies(), v.poses()
Exhaustiveness:
- 8: default, ~30s/ligand, acceptable for screening
- 16: ~60s/ligand, lead-like prioritization
- 32: ~2 min/ligand, top-hit re-docking
- 64: ~4 min/ligand, production / co-crystal comparison
Vina pose RMSD lower bound rmsd_lb and upper bound rmsd_ub are NOT pose-vs-reference RMSDs; they are bounds on the conformational space sampled. Use external RMSD to a reference pose for accuracy QC.
GNINA with CNN Scoring (modern default)
gnina -r receptor.pdb -l ligand.sdf \
--autobox_ligand reference_ligand.sdf \
--cnn_scoring rescore \
-o poses.sdf.gz \
--num_modes 9 --exhaustiveness 8
--cnn_scoring:
rescore(default): Vina sampling + CNN scoring (best validated)refinement: CNN refines top Vina posemetrorefine: Metropolis-style CNN refinement (slower, ~10% better pose accuracy)none: Vina-only (use SMINA equivalently)
--autobox_ligand: define box from reference ligand SDF/PDB. Otherwise specify --center_x/y/z + --size_x/y/z.
Critical: GNINA's CNN was trained on PDBbind 2019; performance on novel chemotypes outside this distribution is reduced. Validate with a known co-crystal redock if possible.
Virtual Screening Pipeline (Hierarchical)
Goal: Screen 10M-compound library down to top-1k candidates for follow-up.
Approach: Three-stage filter:
import pandas as pd
from concurrent.futures import ProcessPoolExecutor
# Stub helpers to be implemented per project; see the cross-referenced skills.
def drug_like_filter(df):
raise NotImplementedError('See chemoinformatics/admet-prediction (Lipinski+Veber+PAINS)')
def vina_dock(smi, receptor_pdbqt, center, box):
raise NotImplementedError('Wrap dock_single() above; return best affinity')
def gnina_rescore(smi, receptor_pdbqt, center, box):
raise NotImplementedError('Wrap gnina --cnn_scoring rescore subprocess call')
def pose_validate(df):
raise NotImplementedError('See chemoinformatics/pose-validation (PoseBusters)')
def vs_pipeline(library_smi, receptor_pdbqt, center, box, output_dir, n_workers=16):
df = pd.read_csv(library_smi)
df_stage1 = drug_like_filter(df) # ~10x reduction
with ProcessPoolExecutor(max_workers=n_workers) as ex:
affinities = list(ex.map(
lambda smi: vina_dock(smi, receptor_pdbqt, center, box),
df_stage1['smiles']))
df_stage1['vina_affinity'] = affinities
df_stage2 = df_stage1.nsmallest(int(len(df_stage1) * 0.01), 'vina_affinity')
df_stage2['gnina_affinity'] = df_stage2['smiles'].apply(
lambda smi: gnina_rescore(smi, receptor_pdbqt, center, box))
df_stage3 = df_stage2.nsmallest(1000, 'gnina_affinity')
return pose_validate(df_stage3)
For 10M-compound libraries (ZINC22, Enamine REAL), use slurm-orchestrated parallel docking. Single-node GPU GNINA: ~3000 ligands/day. SLURM cluster (50 nodes): ~150k ligands/day.
Ultralarge Library Screening (ZINC22, Enamine REAL)
| Library | Size | Format | Notes |
|---|---|---|---|
| ZINC22 | 4.1B | 3D SMILES + xyz | Pre-built; 37GB tranches |
| Enamine REAL | 29B | SMILES | Make-on-demand; via Enamine |
| Enamine HTS | 4M | SMILES + SDF | Stock for HTS |
| Mcule | 25M | SMILES | Real, in-stock |
| ChEMBL | 2.5M | SMILES + bioactivity | Activity-labeled |
For ultralarge VS:
- Pre-filter to drug-like (Lipinski + Ro5 + PAINS) -> typically 10-20x reduction
- Pre-filter by 2D similarity to known actives (Tanimoto >= 0.4 via ECFP4) -> 100x reduction
- Vina dock the filtered subset
- Rescore top 1% with GNINA
- Rescore top 0.1% with MM/GBSA or FEP
Lyu et al. 2019 ZINC15 ultralarge VS gold standard: 138M docked, top 96 hits picked, 13 nM-µM actives.
Per-Tool Failure Modes
Vina -- cross-dock failure
Trigger: Receptor structure not the holo (co-crystal with ligand from another binder).
Mechanism: Vina pose accuracy degrades 2-3x from self-dock (RMSD<=2A 70%) to cross-dock (RMSD<=2A 30%).
Symptom: Top-ranked pose makes no geometric sense; key contacts missing.
Fix: GNINA CNN scoring or ensemble docking. For genuine apo, predict holo with AlphaFold3 / Boltz-1 then dock.
GNINA CNN -- novel chemotype out-of-distribution
Trigger: Ligand chemotype not in PDBbind training.
Mechanism: CNN scoring overfits to PDBbind chemotypes; novel macrocycle / peptide / PROTAC scores poorly.
Symptom: Affinity prediction far worse than Vina alone.
Fix: Use --cnn_scoring rescore (sampling still by Vina) rather than CNN sampling. Validate against co-crystal of close analog.
Box too small
Trigger: Binding box defined tightly around small ligand reference.
Mechanism: Vina explores only within the box; large analogs cannot fit. " Symptom: Many ligands report "no valid pose"; chemotype-biased hits.
Fix: Add padding 5-10 A beyond reference ligand bounding box. For unknown ligand size, use 25x25x25 A.
Multi-pocket protein -- wrong site
Trigger: Protein has multiple binding sites (orthosteric + allosteric).
Mechanism: P2Rank or AutoBox picks the most "drugable" pocket; not always the desired one.
Symptom: Hits dock in wrong pocket; SAR confusing.
Fix: Verify pocket from co-crystal data; explicitly set center_x/y/z from known ligand centroid.
DiffDock-L -- PoseBusters invalid
Trigger: Default DiffDock-L output for any receptor.
Mechanism: Diffusion generates poses without physical-validity loss; ~50% fail planarity / stereochemistry / vdW overlap tests.
Symptom: Poses look reasonable but fail PoseBusters checks.
Fix: Filter to PB-valid (PoseBusters); rescore with GNINA. See chemoinformatics/pose-validation.
Wrong ionization state
Trigger: Ligand or receptor residues protonated incorrectly at pH 7.4.
Mechanism: Aspartate/glutamate/histidine protonation depends on local environment; default protonation may be wrong.
Symptom: Salt bridges missing; poses misranked.
Fix: Run PROPKA on receptor (cabinet residue pKas); for catalytic His, manually inspect rotamer state.
Reconciliation: Vina vs GNINA Disagreement
| Vina top pose | GNINA top pose | Action |
|---|---|---|
| Same pose, similar score | Same pose, similar score | Trust pose; use GNINA score for ranking |
| Vina top pose ≠ GNINA top pose | Same pocket, different orientation | Use GNINA (better trained for pose ranking) |
| Vina excellent, GNINA mediocre | Different pose, very different score | GNINA found pose Vina missed; trust GNINA |
| Both poor scores | Many ligands score similarly poor | Wrong pocket / protein conformation; reconsider receptor |
Common Errors
| Symptom | Cause | Fix |
|---|---|---|
| Vina segfault | PDBQT corrupted (atom names) | Re-prep with meeko |
| GNINA hangs | GPU OOM | Reduce batch / -num_modes 5 |
| All affinities very poor (-3 to -5) | Wrong protonation; ligand too large for box | Re-check pKa; expand box |
| Identical affinity across ligands | Receptor grid not computed | Call v.compute_vina_maps() before dock |
| Pose poses make no sense | Receptor and ligand in different frames | Ensure same coordinate origin |
| AutoDock 4 needed | Vina cannot handle Zn / Fe metals | Use AutoDock 4 with metal scoring or GOLD |
| GPU mode slow | Vina is CPU-only; only GNINA is GPU | Use GNINA for GPU; Vina is multi-core CPU |
References
- Trott & Olson, J. Comput. Chem. 31:455 -- AutoDock Vina.
- Eberhardt et al., J. Chem. Inf. Model. 61:3891 -- Vina 1.2 features.
- Quiroga & Villarreal, PLoS ONE 11:e0155183 -- Vinardo scoring.
- McNutt et al., J. Cheminformatics 13:43 -- GNINA 1.0 CNN docking.
- Buttenschoen et al., Chem. Sci. 15:3130 -- PoseBusters (DL methods fail).
- Lyu et al., Nature 566:224 -- ultralarge VS proof-of-concept.
- Krivak & Hoksza, J. Cheminformatics 10:39 -- P2Rank.
- Forli et al., Nat. Protoc. 11:905 -- AutoDockTools / meeko.
Related Skills
- chemoinformatics/molecular-io - Parse ligands
- chemoinformatics/conformer-generation - Generate 3D for ligand prep
- chemoinformatics/molecular-standardization - Canonicalize before docking
- chemoinformatics/pose-validation - PoseBusters physical-validity QC
- chemoinformatics/docking-rescoring - DiffDock-L + GNINA hybrid
- chemoinformatics/covalent-design - Covalent docking
- chemoinformatics/free-energy-calculations - FEP for refined affinity
- chemoinformatics/admet-prediction - Filter library before docking
- structural-biology/structure-io - PDB / mmCIF handling
- structural-biology/modern-structure-prediction - AlphaFold3 / Boltz-1 for apo receptors
