skills/chemoinformatics/virtual-screening

stars:0
forks:0
watches:0
last updated:N/A

Version Compatibility

Reference examples tested with: AutoDock Vina 1.2.5+, SMINA 2020-12+, GNINA 1.1+, RDKit 2024.09+, meeko 0.5+, P2Rank 2.4+, ProDy 2.4+, pdb2pqr 3.6+.

Before using code patterns, verify installed versions match. If versions differ:

  • Python: pip show <package> then help(module.function) to check signatures
  • CLI: vina --version; gnina --version; smina --version

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Virtual Screening

Screen chemical libraries against protein targets via molecular docking. Vina is the de-facto default, SMINA adds flexibility (Vinardo scoring, custom scoring), and GNINA adds CNN-based pose scoring. Deep-learning docking (DiffDock-L, EquiBind, NeuralPLexer) competes in pose accuracy but fails physical-plausibility tests (PoseBusters) ~50% of the time; the postdoc-grade workflow combines ML pose sampling with classical scoring and physical validation. For ultralarge libraries (>1M), library preparation, hierarchical filtering, and HPC orchestration become the limiting steps.

For pose physical-validity QC, see chemoinformatics/pose-validation. For ML-driven docking + rescoring, see chemoinformatics/docking-rescoring. For covalent docking, see chemoinformatics/covalent-design. For affinity calculations (FEP), see chemoinformatics/free-energy-calculations.

Docking Tool Taxonomy

ToolScoringSpeed (sec/lig)Best atFails when
AutoDock Vina 1.2Vina (empirical)5-30Default, well-validatedCross-dock; cryptic pockets; metal centers
SMINAVina + flexible + custom5-30Custom scoring; flexible side chainsSame Vina-scoring caveats
VinardoVinardo (improved Vina)5-30Better scoring than Vina on DUD-ELimited adoption
GNINA 1.1CNN (default) or Vina30-120Pose ranking, redockingSlower; GPU recommended
AutoDock 4AD4 + grid maps30-60Legacy referenceSlower than Vina
DOCK 6/7DOCK + Amber60-300UCSF DOCK ecosystemSteep learning curve
Glide (Schrodinger)GlideScore5-30Commercial SOTALicense cost
GOLD (CCDC)GOLDScore / ChemScore60-180Commercial; metal coordinationLicense cost
FlexX (BioSolveIT)FlexX30-60Fragment-basedLicense cost
rDockrDock30-60Open-source alternativeSlower than Vina
DiffDock-LDiffusion-generative5 (GPU)Pose sampling for cross-dockingHigh PB-invalid rate (~50%)
EquiBindEquivariant NN<1 (GPU)Single-shot poseLowest accuracy in PoseBuster benchmarks
Boltz-2 + GNINA rescoreFoundation model + CNN10 (GPU)Modern SOTA hybridHigh GPU; not all proteins

Decision: For most screens, GNINA 1.1 with CNN scoring is the modern default (better than Vina on every benchmark; 30s/ligand on GPU). For >1M library scale, hierarchical Vina -> GNINA -> MM/GBSA. For cross-docking (predicted target structure or apo-holo), GNINA's CNN scoring transfers better than Vina.

Decision Tree by Scenario

ScenarioRecommended workflow
Self-dock against known ligand pocketGNINA gnina --cnn_scoring rescore
Cross-dock to apo or related-target structureDiffDock-L pose + GNINA rescore + PoseBusters
Ultralarge library (10M+)Vina hierarchical: pre-filter (Lipinski, PAINS) -> Vina dock -> top 1% GNINA rescore -> top 0.1% MM/GBSA
Cryptic pocket / induced fitEnsemble docking (10 conformer snapshots) + AlphaFold3 or Boltz-1 holo prediction
Allosteric / undefined siteP2Rank for pocket detection -> ensemble dock all pockets
Metal-coordinated ligandGOLD (commercial) or manually parameterize Vina metal scoring
Covalent inhibitorSee chemoinformatics/covalent-design: DOCKovalent, HCovDock
Fragment screen (<300 Da)rDock or constrained Vina with seed atoms
Hit-to-lead refinementUse co-crystal structure if available; MD-relaxed receptor; FEP for affinity

Receptor Preparation

Goal: Convert a protein PDB into a docking-ready format with correct protonation, missing atoms, and removed waters.

Approach: Strip ligands and waters -> fill missing atoms (PROPKA or pdbfixer) -> add hydrogens at pH 7.4 (PDB2PQR / Reduce) -> assign partial charges -> convert to PDBQT (Open Babel).

import subprocess

def prepare_receptor(pdb_in, pdbqt_out, pH=7.4, remove_het=True):
    base = pdb_in.replace('.pdb', '')
    fixed = f'{base}_fixed.pdb'
    propka_pdb = f'{base}_propka.pdb'

    if remove_het:
        with open(pdb_in) as fin, open(f'{base}_clean.pdb', 'w') as fout:
            for line in fin:
                if line.startswith('HETATM') and 'HOH' in line:
                    continue
                if line.startswith('HETATM'):
                    continue
                fout.write(line)

    subprocess.run(['pdb2pqr', '--ff=AMBER', f'--with-ph={pH}',
                    f'{base}_clean.pdb', propka_pdb], check=True)
    subprocess.run(['obabel', propka_pdb, '-O', pdbqt_out,
                    '-xr', '--partialcharge', 'gasteiger'], check=True)
    return pdbqt_out

Common pitfall: Forgetting to add hydrogens at protein pH (7.4) but using pH 7.0 ligand charges. Hist mistakenly protonated. Use PROPKA + manual review of catalytic residues.

Ligand Preparation

Goal: Generate a 3D, docking-ready ligand file from SMILES with appropriate protonation and conformation.

Approach: Protonate at pH 7.4 with Chem.MolFromSmiles then rdMolStandardize.Uncharger -> embed 3D with ETKDGv3 -> minimize with MMFF94 -> assign Gasteiger charges -> write PDBQT with meeko.

from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem.MolStandardize import rdMolStandardize
from meeko import MoleculePreparation, PDBQTWriterLegacy

def prepare_ligand(smiles, pdbqt_out):
    mol = Chem.MolFromSmiles(smiles)
    uncharger = rdMolStandardize.Uncharger()
    mol = uncharger.uncharge(mol)
    mol = Chem.AddHs(mol)
    AllChem.EmbedMolecule(mol, AllChem.ETKDGv3())
    AllChem.MMFFOptimizeMolecule(mol)

    mk_prep = MoleculePreparation()
    setups = mk_prep.prepare(mol)
    pdbqt_text, is_ok, err = PDBQTWriterLegacy.write_string(setups[0])
    if not is_ok:
        raise RuntimeError(f'meeko PDBQT export failed: {err}')
    with open(pdbqt_out, 'w') as f:
        f.write(pdbqt_text)
    return pdbqt_out

meeko (AutoDock developers' tool) handles torsion tree creation, rotamer flagging, and PDBQT writing -- preferred over Open Babel's PDBQT writer. Note: meeko 0.5+ separated the writer (PDBQTWriterLegacy) from MoleculePreparation; older code using prep.write_pdbqt_file() is deprecated.

Binding Site Detection

When the binding pocket is not known (apo target, novel allosteric site):

ToolApproachOutput
P2RankML on protein surface descriptorsRanked pocket list with center coords
fpocketVoronoi tessellationPocket descriptor list
DoGSiteScorerGeometric + drugabilityPocket list with score
AutoSite (Vina)Affinity map clusteringPocket centers
AlphaFillCo-fold known ligandPlausible binding site
prank predict -f receptor.pdb -o pockets/

P2Rank output <receptor>_predictions.csv lists pocket centers with scores. Pocket 1 (highest score) is typically the orthosteric site for known protein families.

Vina Docking (Single Ligand)

# AutoDock Vina Python API requires Vina 1.2+; for Vina 1.1 use subprocess CLI:
# subprocess.run(['vina', '--receptor', ..., '--ligand', ..., '--center_x', ...], check=True)
from vina import Vina

def dock_single(receptor_pdbqt, ligand_pdbqt, center, box_size,
                exhaustiveness=8, n_poses=10):
    v = Vina(sf_name='vina')
    v.set_receptor(receptor_pdbqt)
    v.set_ligand_from_file(ligand_pdbqt)
    v.compute_vina_maps(center=center, box_size=box_size)
    v.dock(exhaustiveness=exhaustiveness, n_poses=n_poses)
    return v.energies(), v.poses()

Exhaustiveness:

  • 8: default, ~30s/ligand, acceptable for screening
  • 16: ~60s/ligand, lead-like prioritization
  • 32: ~2 min/ligand, top-hit re-docking
  • 64: ~4 min/ligand, production / co-crystal comparison

Vina pose RMSD lower bound rmsd_lb and upper bound rmsd_ub are NOT pose-vs-reference RMSDs; they are bounds on the conformational space sampled. Use external RMSD to a reference pose for accuracy QC.

GNINA with CNN Scoring (modern default)

gnina -r receptor.pdb -l ligand.sdf \
      --autobox_ligand reference_ligand.sdf \
      --cnn_scoring rescore \
      -o poses.sdf.gz \
      --num_modes 9 --exhaustiveness 8

--cnn_scoring:

  • rescore (default): Vina sampling + CNN scoring (best validated)
  • refinement: CNN refines top Vina pose
  • metrorefine: Metropolis-style CNN refinement (slower, ~10% better pose accuracy)
  • none: Vina-only (use SMINA equivalently)

--autobox_ligand: define box from reference ligand SDF/PDB. Otherwise specify --center_x/y/z + --size_x/y/z.

Critical: GNINA's CNN was trained on PDBbind 2019; performance on novel chemotypes outside this distribution is reduced. Validate with a known co-crystal redock if possible.

Virtual Screening Pipeline (Hierarchical)

Goal: Screen 10M-compound library down to top-1k candidates for follow-up.

Approach: Three-stage filter:

import pandas as pd
from concurrent.futures import ProcessPoolExecutor

# Stub helpers to be implemented per project; see the cross-referenced skills.
def drug_like_filter(df):
    raise NotImplementedError('See chemoinformatics/admet-prediction (Lipinski+Veber+PAINS)')
def vina_dock(smi, receptor_pdbqt, center, box):
    raise NotImplementedError('Wrap dock_single() above; return best affinity')
def gnina_rescore(smi, receptor_pdbqt, center, box):
    raise NotImplementedError('Wrap gnina --cnn_scoring rescore subprocess call')
def pose_validate(df):
    raise NotImplementedError('See chemoinformatics/pose-validation (PoseBusters)')

def vs_pipeline(library_smi, receptor_pdbqt, center, box, output_dir, n_workers=16):
    df = pd.read_csv(library_smi)
    df_stage1 = drug_like_filter(df)  # ~10x reduction

    with ProcessPoolExecutor(max_workers=n_workers) as ex:
        affinities = list(ex.map(
            lambda smi: vina_dock(smi, receptor_pdbqt, center, box),
            df_stage1['smiles']))
    df_stage1['vina_affinity'] = affinities
    df_stage2 = df_stage1.nsmallest(int(len(df_stage1) * 0.01), 'vina_affinity')

    df_stage2['gnina_affinity'] = df_stage2['smiles'].apply(
        lambda smi: gnina_rescore(smi, receptor_pdbqt, center, box))
    df_stage3 = df_stage2.nsmallest(1000, 'gnina_affinity')

    return pose_validate(df_stage3)

For 10M-compound libraries (ZINC22, Enamine REAL), use slurm-orchestrated parallel docking. Single-node GPU GNINA: ~3000 ligands/day. SLURM cluster (50 nodes): ~150k ligands/day.

Ultralarge Library Screening (ZINC22, Enamine REAL)

LibrarySizeFormatNotes
ZINC224.1B3D SMILES + xyzPre-built; 37GB tranches
Enamine REAL29BSMILESMake-on-demand; via Enamine
Enamine HTS4MSMILES + SDFStock for HTS
Mcule25MSMILESReal, in-stock
ChEMBL2.5MSMILES + bioactivityActivity-labeled

For ultralarge VS:

  1. Pre-filter to drug-like (Lipinski + Ro5 + PAINS) -> typically 10-20x reduction
  2. Pre-filter by 2D similarity to known actives (Tanimoto >= 0.4 via ECFP4) -> 100x reduction
  3. Vina dock the filtered subset
  4. Rescore top 1% with GNINA
  5. Rescore top 0.1% with MM/GBSA or FEP

Lyu et al. 2019 ZINC15 ultralarge VS gold standard: 138M docked, top 96 hits picked, 13 nM-µM actives.

Per-Tool Failure Modes

Vina -- cross-dock failure

Trigger: Receptor structure not the holo (co-crystal with ligand from another binder).

Mechanism: Vina pose accuracy degrades 2-3x from self-dock (RMSD<=2A 70%) to cross-dock (RMSD<=2A 30%).

Symptom: Top-ranked pose makes no geometric sense; key contacts missing.

Fix: GNINA CNN scoring or ensemble docking. For genuine apo, predict holo with AlphaFold3 / Boltz-1 then dock.

GNINA CNN -- novel chemotype out-of-distribution

Trigger: Ligand chemotype not in PDBbind training.

Mechanism: CNN scoring overfits to PDBbind chemotypes; novel macrocycle / peptide / PROTAC scores poorly.

Symptom: Affinity prediction far worse than Vina alone.

Fix: Use --cnn_scoring rescore (sampling still by Vina) rather than CNN sampling. Validate against co-crystal of close analog.

Box too small

Trigger: Binding box defined tightly around small ligand reference.

Mechanism: Vina explores only within the box; large analogs cannot fit. " Symptom: Many ligands report "no valid pose"; chemotype-biased hits.

Fix: Add padding 5-10 A beyond reference ligand bounding box. For unknown ligand size, use 25x25x25 A.

Multi-pocket protein -- wrong site

Trigger: Protein has multiple binding sites (orthosteric + allosteric).

Mechanism: P2Rank or AutoBox picks the most "drugable" pocket; not always the desired one.

Symptom: Hits dock in wrong pocket; SAR confusing.

Fix: Verify pocket from co-crystal data; explicitly set center_x/y/z from known ligand centroid.

DiffDock-L -- PoseBusters invalid

Trigger: Default DiffDock-L output for any receptor.

Mechanism: Diffusion generates poses without physical-validity loss; ~50% fail planarity / stereochemistry / vdW overlap tests.

Symptom: Poses look reasonable but fail PoseBusters checks.

Fix: Filter to PB-valid (PoseBusters); rescore with GNINA. See chemoinformatics/pose-validation.

Wrong ionization state

Trigger: Ligand or receptor residues protonated incorrectly at pH 7.4.

Mechanism: Aspartate/glutamate/histidine protonation depends on local environment; default protonation may be wrong.

Symptom: Salt bridges missing; poses misranked.

Fix: Run PROPKA on receptor (cabinet residue pKas); for catalytic His, manually inspect rotamer state.

Reconciliation: Vina vs GNINA Disagreement

Vina top poseGNINA top poseAction
Same pose, similar scoreSame pose, similar scoreTrust pose; use GNINA score for ranking
Vina top pose ≠ GNINA top poseSame pocket, different orientationUse GNINA (better trained for pose ranking)
Vina excellent, GNINA mediocreDifferent pose, very different scoreGNINA found pose Vina missed; trust GNINA
Both poor scoresMany ligands score similarly poorWrong pocket / protein conformation; reconsider receptor

Common Errors

SymptomCauseFix
Vina segfaultPDBQT corrupted (atom names)Re-prep with meeko
GNINA hangsGPU OOMReduce batch / -num_modes 5
All affinities very poor (-3 to -5)Wrong protonation; ligand too large for boxRe-check pKa; expand box
Identical affinity across ligandsReceptor grid not computedCall v.compute_vina_maps() before dock
Pose poses make no senseReceptor and ligand in different framesEnsure same coordinate origin
AutoDock 4 neededVina cannot handle Zn / Fe metalsUse AutoDock 4 with metal scoring or GOLD
GPU mode slowVina is CPU-only; only GNINA is GPUUse GNINA for GPU; Vina is multi-core CPU

References

  • Trott & Olson, J. Comput. Chem. 31:455 -- AutoDock Vina.
  • Eberhardt et al., J. Chem. Inf. Model. 61:3891 -- Vina 1.2 features.
  • Quiroga & Villarreal, PLoS ONE 11:e0155183 -- Vinardo scoring.
  • McNutt et al., J. Cheminformatics 13:43 -- GNINA 1.0 CNN docking.
  • Buttenschoen et al., Chem. Sci. 15:3130 -- PoseBusters (DL methods fail).
  • Lyu et al., Nature 566:224 -- ultralarge VS proof-of-concept.
  • Krivak & Hoksza, J. Cheminformatics 10:39 -- P2Rank.
  • Forli et al., Nat. Protoc. 11:905 -- AutoDockTools / meeko.

Related Skills

  • chemoinformatics/molecular-io - Parse ligands
  • chemoinformatics/conformer-generation - Generate 3D for ligand prep
  • chemoinformatics/molecular-standardization - Canonicalize before docking
  • chemoinformatics/pose-validation - PoseBusters physical-validity QC
  • chemoinformatics/docking-rescoring - DiffDock-L + GNINA hybrid
  • chemoinformatics/covalent-design - Covalent docking
  • chemoinformatics/free-energy-calculations - FEP for refined affinity
  • chemoinformatics/admet-prediction - Filter library before docking
  • structural-biology/structure-io - PDB / mmCIF handling
  • structural-biology/modern-structure-prediction - AlphaFold3 / Boltz-1 for apo receptors
    Good AI Tools