skills/chemoinformatics/substructure-search

stars:0
forks:0
watches:0
last updated:N/A

Version Compatibility

Reference examples tested with: RDKit 2024.09+. SMARTS dialect follows Daylight specification with RDKit extensions.

Before using code patterns, verify installed versions match. If versions differ:

  • Python: pip show rdkit then help(rdkit.Chem.MolFromSmarts) to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Substructure Search

" Search molecular collections for structural patterns using SMARTS. The choice of SMARTS dialect, atom/bond matching mode, and structural-alert catalog determines whether the search is correctly capturing the intended chemistry. PAINS (Baell & Holloway 2010) is the most-cited but most-misunderstood filter -- it identifies patterns of assay interference, not "bad molecules". Knowing when to apply each catalog and how to interpret hits is essential.

For SMARTS-based reactions (transforming matched substructures), see chemoinformatics/reaction-enumeration. For 3D pharmacophore matching, see chemoinformatics/pharmacophore-modeling.

SMARTS Grammar Essentials

TokenMeaningExample
[#6]Atom by atomic number[#6] carbon (any hybridization)
cLowercase = aromaticc1ccccc1 benzene aromatic
CUppercase = aliphatic onlyC(=O)O carboxylic acid carbon
[CX4]Atom + connection count X[CX4] sp3 carbon (4 connections)
[CX3]=OCarbonyl (CX3 = sp2 with 3 bonds)matches ketone, aldehyde, ester C
[#6;R]Atom in ring[#6;R] ring carbon
[#6;!R]Atom not in ring[#6;!R] acyclic carbon
[#6;r6]Atom in 6-membered ring[#6;r6] six-ring carbon
[a]Any aromatic atom[a]
[!#1]Anything except H[!#1] heavy atom
[N;H2]N with exactly 2 H[NH2] primary amine
[N+]Positively charged N[N+](=O)[O-] nitro
[$(...)]Recursive SMARTS[$(c1ccccc1)] aromatic 6-ring atom
[c]([F,Cl,Br,I])OR within bracketsaryl halide
~Any bond typec~c any aromatic-aromatic bond
@Aromatic bondc@c
-Single bond explicitC-C
=Double bondC=O
:Aromatic bond explicit

Common SMARTS Patterns

PatternSMARTSNotes
Hydroxyl (alcohol + phenol)[OX2H]OX2H avoids matching O- in OH-
Phenol only[OX2H][c]OH attached to aromatic carbon
Aliphatic OH only[OX2H][CX4]OH attached to sp3 C
Carboxylic acid[CX3](=O)[OX2H1]C(=O)OH
Carboxylate[CX3](=O)[O-]C(=O)O- (deprotonated)
Ester[CX3](=O)[OX2][!H]C(=O)O-R
Amide[CX3](=[OX1])[NX3]C(=O)N-R
Primary amine[NX3;H2]-NH2
Secondary amine[NX3;H1]-NH-R
Tertiary amine[NX3;H0;!$(NC=O)]-NR2 (not amide N)
Quaternary amine[NX4+]-NR4+
Nitro[N+](=O)[O-]-NO2
Nitrile[CX2]#[NX1]-C#N
Sulfonamide[SX4](=[OX1])(=[OX1])[NX3]-S(=O)(=O)N
Aryl halide[c][F,Cl,Br,I]halogen on aromatic
Aliphatic halide[CX4][F,Cl,Br,I]halogen on sp3 C
Hydrogen bond donor[#7,#8;!H0]N or O with at least 1 H
Hydrogen bond acceptor[#7,#8;!$([NX3]([O-])=O);!$([N+]=O)]N/O excluding nitro
Michael acceptor[CX3]=[CX3][CX3]=Oenone, acrylamide warhead
Aldehyde[CX3H1](=O)-CHO
Ketone[CX3](=O)[#6]-C(=O)R, both R = C

Basic Substructure Match

Goal: Test whether a molecule contains a SMARTS pattern and enumerate the matching atom indices.

Approach: Parse the molecule with MolFromSmiles and the pattern with MolFromSmarts, gate with HasSubstructMatch, then call GetSubstructMatches and map each atom index back to the molecule for inspection.

from rdkit import Chem

mol = Chem.MolFromSmiles('c1ccc(O)cc1CCO')
pattern = Chem.MolFromSmarts('[OX2H]')

if mol.HasSubstructMatch(pattern):
    matches = mol.GetSubstructMatches(pattern)
    for match in matches:
        atoms = [mol.GetAtomWithIdx(i).GetSymbol() for i in match]

HasSubstructMatch returns bool, GetSubstructMatches returns tuple of tuples of atom indices.

Recursive SMARTS (key for postdoc-grade patterns)

[$(pattern)] matches an atom that also matches the entire pattern starting from itself. Critical for context-aware matching.

# Aromatic carbon attached to a carbonyl
pat = Chem.MolFromSmarts('[$(c[C](=O))]')

# Aniline-type N (aromatic carbon-N-H)
pat = Chem.MolFromSmarts('[$([NX3;H2][c])]')

# Hindered amine (N with 2 sp3 neighbors)
pat = Chem.MolFromSmarts('[$([NX3]([CX4])([CX4])[CX4])]')

# H-bond donor (per Lipinski, exclude quaternary)
hbd = Chem.MolFromSmarts('[#7,#8;!H0;!$([NX3+])]')

# H-bond acceptor (per Lipinski, exclude nitro / aniline)
hba = Chem.MolFromSmarts('[$([#7,#8;!H0]);!$([NX3+]=O);!$(N(=O)~O)]')

Structural-Alert Filter Catalogs

FilterOriginPatternsUse caseFailure mode
PAINS_ABaell & Holloway 2010 (low-quality assay hits)480Flag known pan-assay interferersMany false positives in primary screens; legitimate medicines flagged
PAINS_BBaell & Holloway 2010280More aggressive PAINSSimilar
PAINS_CBaell & Holloway 2010240Most aggressive PAINSMost permissive
BRENKBrenk 2008 (DDS unsuitable)105Reactive / toxicity / undesirableUseful for fragment / virtual library
NIHNIH MLSMR~250Reactive groups, unstableLegacy filter
ZINCZINC clean-leads~90Drug-like cleanupUsed for library standardization
AldridgeAldridge medchem rules~50medchem ugly substructuresHand-curated
Glaxo / Eli LillyVendor listsvariesInternal "ugly" filtersOften unpublished
REOSWalters & Murcko 2002property + structuralDrug-likeness combined filterHand-curated thresholds

When to Apply Each Filter

ScenarioCatalogReason
Hit validation from biochemical screenPAINS_AIdentify assay-interference candidates
Library prep for HTSPAINS_A + Brenk + ZINCRemove clearly bad
Fragment library designBrenk + ZINCRemove reactive; PAINS less critical at fragments
Lead optimizationNone mandatoryFilters can exclude valid leads
Natural product analogNoneFilters trained on synthetic chemistry
Covalent inhibitor designSkip warhead filterWarheads ARE the design

Critical: Capuzzi et al. showed that 8% of FDA-approved drugs match a PAINS pattern. PAINS is a flag for assay validation, not a killing filter.

PAINS Filter

Goal: Split a molecule list into PAINS-flagged and PAINS-clean sets using one or more PAINS catalog tiers.

Approach: Configure FilterCatalogParams with the requested catalog enums, build a FilterCatalog once, and for each molecule use GetFirstMatch to either bucket it as clean or record the matching pattern description.

from rdkit.Chem.FilterCatalog import FilterCatalog, FilterCatalogParams

def pains_filter(mols, catalogs=('PAINS_A',)):
    params = FilterCatalogParams()
    for cat in catalogs:
        params.AddCatalog(getattr(FilterCatalogParams.FilterCatalogs, cat))
    catalog = FilterCatalog(params)

    flagged = []
    clean = []
    for mol in mols:
        if mol is None:
            continue
        entry = catalog.GetFirstMatch(mol)
        if entry is None:
            clean.append(mol)
        else:
            flagged.append((mol, entry.GetDescription()))
    return clean, flagged

Available catalog names: PAINS_A, PAINS_B, PAINS_C, PAINS (all), BRENK, NIH, ZINC, ALL.

Reaction-Reactive Group Filter (custom)

For HTS triage, filter electrophilic warheads (acrylamide, chloroacetamide, etc.) unless designing covalent inhibitors.

Goal: Flag molecules containing electrophilic warheads or other reactive functional groups that would interfere with biochemical HTS.

Approach: Maintain a named SMARTS dictionary of reactive groups (acid halides, epoxides, Michael acceptors, etc.), then per molecule scan each pattern with HasSubstructMatch and return the first matching warhead name.

REACTIVE_SMARTS = {
    'acid_anhydride': '[CX3](=O)O[CX3](=O)',
    'acid_halide': '[CX3](=O)[F,Cl,Br,I]',
    'alpha_halo_carbonyl': '[CX3](=O)C([F,Cl,Br,I])',
    'aldehyde_reactive': '[CX3H1](=O)[#6;X4]',  # aliphatic aldehydes
    'epoxide': 'C1OC1',
    'aziridine': 'C1NC1',
    'isocyanate': '[NX2]=C=[OX1]',
    'isothiocyanate': '[NX2]=C=[SX1]',
    'beta_lactam': 'C1(=O)NCC1',
    'sulfonyl_halide': '[SX4](=O)(=O)[F,Cl,Br,I]',
    'Michael_acceptor': '[CX3]=[CX3][CX3]=O',
    'vinyl_sulfone': '[SX4](=O)(=O)C=C',
}

def reactive_filter(mol, exclude_warheads=True):
    if not exclude_warheads:
        return False
    for name, smarts in REACTIVE_SMARTS.items():
        if mol.HasSubstructMatch(Chem.MolFromSmarts(smarts)):
            return True, name
    return False, None

For covalent-inhibitor design, see chemoinformatics/covalent-design; these warheads are the desired chemistry, not noise to filter.

Library Filtering with Multiple Patterns

Goal: Reduce a molecule library to those that match all required SMARTS patterns and none of the excluded ones.

Approach: Start from the full molecule list, iteratively intersect with each include SMARTS using HasSubstructMatch, then subtract any molecule matching an exclude SMARTS.

def filter_library(mols, include=None, exclude=None):
    keep = list(mols)
    if include:
        for s in include:
            p = Chem.MolFromSmarts(s)
            keep = [m for m in keep if m and m.HasSubstructMatch(p)]
    if exclude:
        for s in exclude:
            p = Chem.MolFromSmarts(s)
            keep = [m for m in keep if m and not m.HasSubstructMatch(p)]
    return keep

Atom Map Indices in SMARTS

Atom maps [C:1] track atoms through transformations. Used in reactions (reaction-enumeration skill) but also for substructure-based extraction:

# Find amide N with attached aryl
pat = Chem.MolFromSmarts('[CX3:1](=O)[NX3:2][c:3]')
match = mol.GetSubstructMatch(pat)

amide_C, amide_N, aryl_C = match

Per-Tool Failure Modes

PAINS -- false positive on natural product

Trigger: Library contains natural products, polyphenols, flavonoids, quinones.

Mechanism: PAINS_A patterns target rhodanines, curcumins, polyhydroxylated polyphenols -- legitimate scaffolds in natural-product chemistry.

Symptom: Library hits flagged as PAINS but trace back to validated natural products with confirmed activity.

Fix: Use PAINS as a flag not a delete. Cross-check flagged compounds for orthogonal-assay confirmation (label-free e.g. SPR, ITC).

Aromaticity dialect mismatch

Trigger: SMARTS pattern with c (aromatic) for a heteroatom-rich ring; molecule parsed with different aromaticity model.

Mechanism: RDKit, OpenEye, ChemAxon differ on whether furan, thiazole, tropone, etc. are aromatic.

Symptom: Same pattern matches in one toolkit, not in another.

Fix: Re-canonicalize molecules within RDKit before applying SMARTS. Or use [#6]:[#6] instead of c:c (explicit element + bond type).

Tautomer-sensitive pattern miss

Trigger: SMARTS targets keto form C(=O) but molecule is enol C(O)=C.

Mechanism: Default canonical form differs by toolkit + standardization choice.

Symptom: Known matching molecule reports no match.

Fix: Use tautomer-aware match: enumerate tautomers and OR-match. Or canonicalize first via chemoinformatics/molecular-standardization. Or expand pattern with [$(C(=O)),$(C(O)=C)].

Stereochemistry ignored

Trigger: SMARTS without /\@ stereo markers applied to mol with explicit stereo.

Mechanism: SMARTS matching is stereo-agnostic by default.

Symptom: Wrong stereoisomer is matched as well as right one.

Fix: mol.GetSubstructMatches(pattern, useChirality=True) to require chirality match.

Ring closure / fused ring miss

Trigger: Pattern uses c1ccccc1 but target ring is fused (naphthalene, indole).

Mechanism: c1ccccc1 requires exactly 6 atoms in ring; not the fused-ring case.

Symptom: Naphthalene not matching benzene pattern.

Fix: Use ring-flexible pattern: [c]:[c]:[c]:[c]:[c]:[c] matches any aromatic 6-ring including fused. Or [c]1[c][c][c][c][c]1.

Recursive SMARTS performance

Trigger: Deeply nested recursive SMARTS over a large library.

Mechanism: Each [$()] re-evaluates the inner pattern for every candidate atom.

Symptom: Search 10x-100x slower than expected.

Fix: Flatten recursion where possible; pre-filter with simpler pattern, then re-test with the recursive one.

Common Errors

SymptomCauseFix
Chem.MolFromSmarts returns NoneInvalid SMARTS grammarValidate with Chem.MolFromSmarts(smi, mergeHs=False); check parens, brackets
[OH] matches nothingAromatic O treated differentlyUse [OX2H] or [O;H1]
Pattern matches but library is "empty"Mol failed sanitizeTry Chem.SDMolSupplier(sanitize=False) then catch errors
Multiple matches per moleculeSingle-match query expectedGetSubstructMatch returns first; GetSubstructMatches returns all
Match indices but no fragmentMatch returns atom indices in pattern orderMap to original mol via mol.GetAtomWithIdx(i)
PAINS catalog initialization slowLoading 1000+ patterns on every callBuild catalog once, reuse for batch
Stereo SMARTS not matchinguseChirality=False (default)mol.GetSubstructMatches(p, useChirality=True)

References

  • Baell & Holloway, J. Med. Chem. 53:2719 -- original PAINS filter.
  • Capuzzi et al., J. Chem. Inf. Model. 57:417 -- PAINS reality check (FDA drug overlap).
  • Brenk et al., ChemMedChem 3:435 -- structural alerts (BRENK filter).
  • Walters & Murcko, Adv. Drug Deliv. Rev. 54:255 -- REOS filter framework.
  • Bruns & Watson, J. Med. Chem. 55:9763 -- Eli Lilly medchem rules.
  • Daylight SMARTS theory documentation -- complete grammar reference.

Related Skills

  • chemoinformatics/molecular-io - Parse molecules before searching
  • chemoinformatics/molecular-standardization - Canonicalize tautomers before SMARTS
  • chemoinformatics/similarity-searching - Fingerprint-based fuzzy matching
  • chemoinformatics/scaffold-analysis - Scaffold-based pattern derivation
  • chemoinformatics/reaction-enumeration - SMARTS for chemical transformations
  • chemoinformatics/admet-prediction - PAINS as ADMET filter
  • chemoinformatics/covalent-design - Warhead chemistry
    Good AI Tools