skills/chemoinformatics/shape-similarity

stars:0
forks:0
watches:0
last updated:N/A

Version Compatibility

Reference examples tested with: RDKit 2024.09+ (Open3DAlign), USRCAT 1.2+, ShaEP 1.7+, ROCS (OpenEye, commercial).

Before using code patterns, verify installed versions match. If versions differ:

  • Python: pip show <package> then help(module.function) to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Shape Similarity

Search for compounds with similar 3D shape (and optionally chemical features) to a query molecule. Shape-based screening complements 2D fingerprint search: it can find scaffold-hopped compounds that ECFP4 misses (different scaffolds with similar shape). ROCS (OpenEye) is the industry-standard commercial tool; Open3DAlign (RDKit), USRCAT (Schreyer & Blundell 2012), and ShaEP are open-source alternatives. Modern best practice combines shape with color (chemical-feature similarity) via Tanimoto-Combo: matches share both shape and pharmacophore feature distribution.

For 2D fingerprint similarity, see chemoinformatics/similarity-searching. For pharmacophore search (discrete feature constraints), see chemoinformatics/pharmacophore-modeling. For 3D conformer generation, see chemoinformatics/conformer-generation.

Shape Method Taxonomy

ToolSpeedApproachOpen-sourceFails when
ROCS (OpenEye)1k mols/sec on GPU (FastROCS)Gaussian shape + colorNoLicense cost
ROCS X (Sept 2025)Multi-billion library, GPUML-enhanced shapeNoLimited release
USRCAT100k mols/secUltrafast moment-based + atom typesYesCoarse approximation
Open3DAlign (RDKit)100 mols/secIterative volume overlapYesOptimization slow
ShaEP10 mols/secField-based (shape + ESP)YesLess standard
ESPSimsimilar to ShaEPElectrostatic + shapeYesLimited public benchmarks
Phase-Shape (Schrödinger)commercialShape + pharmacophoreNoCommercial
USR (original)100k mols/secMoment-based onlyYesNo atom type info

Decision: For commercial pipelines, ROCS is the gold standard. For open-source, Open3DAlign is the most accurate; USRCAT is the fastest for ultralarge libraries.

Decision Tree by Scenario

ScenarioMethodNotes
Lead-like library, search top 100kUSRCAT pre-filter + Open3DAlign rescoreHybrid speed/accuracy
Production VS for scaffold hopROCS + color (commercial)Industry standard
Scaffold hopping prospectiveOpen3DAlign with conformer ensembleShape + flexibility
Bioisostere replacementROCS color with neutral scoringPharmacophore-equivalent matches
Patent space carve-outShape constraint + 2D dissimilarityCombine shape + dissimilar scaffold
Library diversity assessmentUSRCAT k-nearest neighborFast
Crystal-bound conformer templateOpen3DAlign starting from co-crystal poseBioactive shape
Cross-target screeningShape + pharmacophore featureCombined screen

Tanimoto-Combo Scoring (ROCS Standard)

Tanimoto-Combo = (Tanimoto_shape + Tanimoto_color) / 2

  • Tanimoto_shape: volume overlap normalized
  • Tanimoto_color: pharmacophore feature overlap
RangeInterpretation
> 1.0Very similar shape + color (rare; top hits)
0.7-1.0Strong hit; likely binding mode similarity
0.5-0.7Moderate; further validation needed
0.3-0.5Weak; many false positives
< 0.3Background

In ROCS production, hits with TanimotoCombo > 0.7 are typically followed up.

USRCAT (Ultra-Fast Shape Recognition + Atom Types)

USRCAT extends Ultrafast Shape Recognition (USR) with atom-type information. Each molecule is represented as a 60-dimensional moment vector (12 moments × 5 atom types).

Goal: Encode a molecule into the 60-D USRCAT moment vector and score similarity against another molecule for alignment-free shape search.

Approach: Parse the SMILES, add hydrogens, generate one 3D conformer with ETKDGv3, compute USRCAT descriptors, and apply the inverse-mean-absolute-difference similarity to a second descriptor vector.

from usrcat import compute_usrcat_descriptors, compute_similarity

mol = Chem.MolFromSmiles('CCO')
mol = Chem.AddHs(mol)
AllChem.EmbedMolecule(mol, AllChem.ETKDGv3())

descriptors = compute_usrcat_descriptors(mol)
# Returns numpy array of 60 floats: 12 USR moments x 5 atom types
# (hydrophobic, aromatic, acceptor, donor, anion/cation)
# Similarity between two descriptor vectors: 1 / (1 + mean_abs_difference)

similarity = compute_similarity(desc1, desc2)  # 0-1, higher = more similar

Speed: O(N) descriptor calculation (no alignment); O(1) similarity comparison. Suitable for >10M compound libraries.

Limit: USRCAT is a coarse approximation. Predictive for analog identification; less precise for scaffold hopping.

Open3DAlign (RDKit)

Open3DAlign performs iterative alignment to maximize volume overlap:

Goal: Align a target molecule onto a query in 3D and score volume overlap with Open3DAlign.

Approach: Build 3D structures for query and target (parse SMILES, add hydrogens, ETKDG embed), run GetO3A to find the best alignment, then call Align() for the in-place RMSD and Score() for the overlap score.

from rdkit.Chem import rdMolAlign

query = Chem.MolFromSmiles('CCC(=O)Nc1ccccc1')
query = Chem.AddHs(query)
AllChem.EmbedMolecule(query, AllChem.ETKDGv3())

target = Chem.MolFromSmiles('CCC(=O)Nc1ccc(F)cc1')
target = Chem.AddHs(target)
AllChem.EmbedMolecule(target, AllChem.ETKDGv3())

O3A = rdMolAlign.GetO3A(target, query)
rmsd = O3A.Align()  # aligns target to query in place
score = O3A.Score()

GetO3A finds best alignment between conformers; Align() aligns and returns RMSD; Score() returns Open3DAlign score (similar to TanimotoCombo).

Open3DAlign vs ROCS: Open3DAlign is open-source and competitive on small benchmarks; slower than ROCS at scale.

Conformer-Ensemble Shape Searching

For each library molecule, generate ensemble of conformers; pick best-shape conformer:

Goal: Run shape-similarity search over a conformer ensemble per library molecule so bound-conformer-like shapes are recovered.

Approach: For each library molecule, add hydrogens, embed n_conf conformers with ETKDGv3, MMFF-optimize, score each conformer against the query with Open3DAlign, and keep the best score per molecule.

def shape_search_ensemble(query_mol, library_mols, n_conf=20):
    hits = []
    for target in library_mols:
        target = Chem.AddHs(target)
        AllChem.EmbedMultipleConfs(target, numConfs=n_conf,
                                    params=AllChem.ETKDGv3())
        AllChem.MMFFOptimizeMoleculeConfs(target)

        scores = []
        for c in range(target.GetNumConformers()):
            O3A = rdMolAlign.GetO3A(target, query_mol, prbCid=c)
            scores.append(O3A.Score())
        hits.append((target, max(scores)))
    return sorted(hits, key=lambda x: x[1], reverse=True)

Critical: Single-conformer shape search misses ~30% of true hits because the wrong conformer is sampled. Always use ensemble.

ESP Similarity (Electrostatic)

ShaEP and ESPSim extend shape with electrostatic surface potential overlap. For ESP-relevant pharmacophores (binding pockets with strong electrostatics):

shaep --query query.mol2 --target target.mol2 --output match.sdf \
      --esp-weight 0.5

ESP scoring catches electrostatic-equivalent bioisosteres that pure shape misses (carboxylate vs tetrazole same charge).

Shape vs ECFP4 Complementarity

Shape TanimotoECFP4 TanimotoInterpretation
> 0.7> 0.7Same chemotype, same shape (close analog)
> 0.7< 0.5Scaffold-hop! Different chemotype, similar shape
< 0.5> 0.7Same chemotype, different shape (flexible)
< 0.5< 0.5Unrelated

The shape >> ECFP4 quadrant is the scaffold-hopping gold:

Goal: Identify scaffold-hop candidates that are 3D-shape-similar but 2D-chemotype-dissimilar to the query.

Approach: Run the conformer-ensemble shape search, keep hits above a shape Tanimoto cutoff, then retain only those whose ECFP4 Tanimoto to the query is below an ECFP4 dissimilarity cutoff.

def scaffold_hop_candidates(query_mol, library, shape_threshold=0.7,
                            ecfp_threshold=0.5):
    shape_hits = shape_search_ensemble(query_mol, library)
    candidates = []
    for target, shape_score in shape_hits:
        if shape_score >= shape_threshold:
            ecfp_sim = ecfp_tanimoto(query_mol, target)
            if ecfp_sim < ecfp_threshold:
                candidates.append((target, shape_score, ecfp_sim))
    return candidates

Per-Tool Failure Modes

USRCAT -- false positive on small molecules

Trigger: Library has many fragment-sized compounds. " Mechanism: USRCAT moments dominated by overall shape; small molecules look "similar" if shape resemble.

Symptom: Many fragment hits; not pharmacophore-relevant.

Fix: Filter by MW (>= 200); use Open3DAlign for rescoring.

Open3DAlign -- slow on large library

Trigger: Million-compound library, full alignment.

Mechanism: Open3DAlign is iterative; O(N) per molecule.

Symptom: Hours of compute.

Fix: Pre-filter with USRCAT (fast), Open3DAlign on top 10k.

Shape only -- wrong stereochemistry match

Trigger: Mirror-image of correct binder.

Mechanism: Pure shape Tanimoto symmetric under chirality inversion.

Symptom: Enantiomer of inactive scores as hit.

Fix: Validate hits by 3D pose; check stereochemistry.

ROCS color -- bioisostere missed

Trigger: -COOH replaced by -SO3H or tetrazole.

Mechanism: Default color types may not equate these bioisosteres.

Symptom: Known bioisostere doesn't score high.

Fix: Use color-only scoring without size penalty; or pharmacophore-feature-equivalence.

Conformer not bioactive

Trigger: Library compound generated conformer is not the bound conformation.

Mechanism: ETKDGv3 generates plausible conformers; bound conformer may be higher energy.

Symptom: Known active doesn't shape-match query.

Fix: Use larger conformer ensemble; weight by Boltzmann; or use CREST + GFN2-xTB for high-quality sampling.

Field-based methods slower

Trigger: ShaEP or ESPSim on production library.

Mechanism: Field-based methods compute Gaussian fields per molecule.

Symptom: 10-100x slower than ROCS shape-only.

Fix: Use as second-stage rescore; not primary screen.

Reconciliation: Shape vs Pharmacophore

AspectShapePharmacophore
RepresentationVolume distributionDiscrete features in space
CapturesOverall bulkInteraction-relevant features
SpeedFast (USRCAT) to medium (Open3DAlign)Fast
SpecificityLowerHigher
False positive rateMediumLower
Best forScaffold hopping initialScaffold hopping refinement

Use shape for broad search (high recall, moderate precision); use pharmacophore for refinement (lower recall, high precision).

Common Errors

SymptomCauseFix
Open3DAlign returns 0No reasonable alignment foundUse random starting rotation; increase attempts
USRCAT vector all zerosMol has no 3D coordsGenerate conformer first
Shape Tanimoto > 1Implementation bug or unnormalizedCheck formula; ROCS reports unnormalized possible
ROCS very slowSequential processingUse parallel batching
Shape match but no docking poseWrong binding poseUse docking on top shape hits, not shape alone
Missing co-crystal templateApo or AlphaFold-only structureUse ligand-based pharmacophore + shape
ShaEP returns no hitsStrict toleranceLoosen overlap thresholds

References

  • Hawkins et al., J. Med. Chem. 50:74 -- ROCS algorithm.
  • Schreyer & Blundell, J. Cheminformatics 4:27 -- USRCAT.
  • Vainio & Johnson, J. Chem. Inf. Model. 47:2462 -- ShaEP.
  • Liu et al., J. Chem. Inf. Model. (2024) -- Open3DAlign improvements.
  • Roy et al., J. Med. Chem. 65:11875 -- shape-based VS modern review.

Related Skills

  • chemoinformatics/molecular-io - Parse query and library
  • chemoinformatics/conformer-generation - Generate 3D conformer ensembles
  • chemoinformatics/similarity-searching - 2D similarity comparison
  • chemoinformatics/pharmacophore-modeling - Pharmacophore alternative
  • chemoinformatics/scaffold-analysis - 2D scaffold analysis
  • chemoinformatics/virtual-screening - Shape as pre-filter to docking
    Good AI Tools