skills/chemoinformatics/molecular-descriptors

stars:0
forks:0
watches:0
last updated:N/A

Version Compatibility

Reference examples tested with: RDKit 2024.09+, numpy 1.26+, pandas 2.2+, mapchiral 0.1+ (MAP4), mhfp 1.9+, molfeat 0.10+.

Before using code patterns, verify installed versions match. If versions differ:

  • Python: pip show <package> then help(module.function) to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Molecular Descriptors

Featurize molecules for similarity search, QSAR, virtual screening, or ML. The fingerprint or descriptor choice is chemotype-aware: ECFP4 dominates drug-like organic similarity, AtomPair and TopologicalTorsion outperform for scaffold hopping, MAP4/MHFP6 win on metabolomics-scale chemical diversity, and 3D conformer-based descriptors are essential when shape and stereochemistry matter.

For canonicalization before featurization, see chemoinformatics/molecular-standardization. For 3D-only descriptors, see chemoinformatics/conformer-generation.

Fingerprint Taxonomy

FingerprintTypeRadius/PathBitsUse caseFails when
Morgan (ECFP)Circularr=2 (ECFP4), r=3 (ECFP6)2048 typicalDrug-like similarity, ML defaultLoses long-range topology; bit collisions at low nBits
FCFPFunctional Morganr=2 default2048Pharmacophore-aware similaritySame caveats as ECFP; less specific
MACCSSubstructure key166 fixed bits167Quick fingerprint, drug-likenessToo sparse for large diverse libraries
RDKit FPPath-basedlinear paths up to 7 atoms2048RDKit-native ECFP alternativeDrug-like only; not optimal for scaffold hopping
AtomPairPair + topological distanceAll atom pairs2048Scaffold hopping; flexible molSlower than ECFP; harder to interpret
TopologicalTorsion4-atom torsionAll TT2048Scaffold hopping; less hit-rateLike AP, slower than ECFP
AvalonSubstructure + atom pairsMixed512/1024Fast similarityLess standard; older
MAP4 (MinHashed atom-pair)MinHash atom-pairr=1,21024/2048Biological + metabolite diversityLibrary required (mapchiral); slower hash
MHFP6 (MinHash)MinHash ECFP-liker=3 (diam 6)2048Big-data nearest-neighbor (Annoy)Different distance (Jaccard on MinHash)
Pharm2D2D pharmacophorefeature pairs/tripletssparsePharmacophore searchSparse, slower

Decision: For drug-like similarity ranking, use ECFP4 2048 bit; established baseline, fast, well-understood. For diverse libraries (>1M compounds, metabolomics, peptides), MHFP6 outperforms ECFP4 on analog recovery (Probst & Reymond 2018). For scaffold hopping, AtomPair beats ECFP4 on retrospective benchmarks but loses on retrospective single-target.

Bit vs Count Vectors

FormUseLibrary impact
Bit (0/1)Tanimoto similarity, BulkTanimotoSimilarity, RDKit fingerprint foldingStandard for similarity
Count (integer)Some ML methods, RF on counts, neural fingerprintsLoses bit-level fast operations; richer signal
Sparse (dict)Direct chemical interpretation (which fragments at which atoms)Use for SHAP / atomic attribution
from rdkit import Chem
from rdkit.Chem import AllChem

mol = Chem.MolFromSmiles('CCO')

ecfp4_bit = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)
ecfp4_count = AllChem.GetHashedMorganFingerprint(mol, radius=2, nBits=2048)
ecfp4_sparse = AllChem.GetMorganFingerprint(mol, radius=2)

Morgan / ECFP Radius Math

ECFP-X notation: X is the diameter in bonds. RDKit's radius parameter is half of X.

NotationRDKit radiusDiameterCaptures
ECFP000Atom identity only
ECFP212Atom + immediate neighbors
ECFP424Atom + 2-bond environment
ECFP636Atom + 3-bond environment

Trade-off: Larger radius captures more specific local environment but inflates bit-collision rate at fixed nBits. For QSAR with <10k compounds, ECFP4 2048 is the established default (Rogers & Hahn 2010; MoleculeNet benchmarks Wu 2018). For large libraries (>1M), use nBits=4096 or unhashed sparse representation to reduce ~1-5% bit-collision rate (O'Boyle 2016).

FCFP vs ECFP

FCFP (Functional-Class) uses atom invariants based on pharmacophore role (donor, acceptor, hydrophobe, aromatic, halogen, basic, acidic) instead of atom identity. Trades atom-specificity for functional-equivalence.

ecfp4 = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, useFeatures=False)
fcfp4 = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, useFeatures=True)

When to use FCFP4: Scaffold-hopping campaigns, pharmacophore-driven similarity, cross-target activity prediction.

When to use ECFP4: Within-series QSAR, lead optimization, when chemotype identity matters.

molfeat: Pluggable Featurization

molfeat (Datamol/Molecular AI) wraps RDKit, Mordred, Vina, and hand-crafted transformers behind a uniform Transformer interface. It is the right choice when you need to swap featurizations programmatically (e.g. for a benchmark or a notebook exploration) rather than hardcoding the RDKit calls.

from molfeat.trans.fp import FPVecTransformer

ecfp4 = FPVecTransformer(kind='ecfp', length=2048, radius=2)
fcfp4 = FPVecTransformer(kind='fcfp', length=2048, radius=2)
ap = FPVecTransformer(kind='ap', length=2048)
mhfp6 = FPVecTransformer(kind='mhfp', length=2048, radius=3)

# X is a numpy array, y is a list of mols or SMILES
X = ecfp4(list_of_mols)

For large fingerprints (MHFP6, AtomPair) or large libraries, molfeat avoids the boilerplate of building fingerprints one molecule at a time.

3D Descriptors and Conformer Dependence

Conformer-dependent descriptors (asphericity, eccentricity, principal moments of inertia, RDF) require a generated 3D structure. A single conformer is rarely sufficient: descriptor variance across the conformer ensemble can exceed the descriptor signal.

Goal: Compute 3D shape descriptors over a conformer ensemble rather than from a single (possibly unrepresentative) conformer.

Approach: Add explicit hydrogens, embed N conformers with ETKDGv3, MMFF-optimize them all, then evaluate the descriptor across each conformer for downstream averaging.

from rdkit.Chem import AllChem, Descriptors3D

mol = Chem.MolFromSmiles('CCCCO')
mol = Chem.AddHs(mol)

params = AllChem.ETKDGv3()
params.randomSeed = 42
n = AllChem.EmbedMultipleConfs(mol, numConfs=20, params=params)
AllChem.MMFFOptimizeMoleculeConfs(mol)

asphericities = [Descriptors3D.Asphericity(mol, confId=c) for c in range(n)]

Decision: For QSAR / ML, compute over a conformer ensemble (n=20-100) and report mean or Boltzmann-weighted average. Single-conformer 3D descriptors are unreliable.

Partial Charge Methods

MethodSoftwareCostAccuracyUse for
Gasteiger-MarsiliRDKit, Open Babel<0.1s/molEmpirical, roughAutoDock Vina, fast screening
MMFF94RDKit0.1s/molForce-field consistentMMFF energy, conformer ranking
AM1-BCCantechamber (AmberTools)~10s/molSemi-empiricalMD setup, FEP, GAFF
RESPpsi4, Gaussianminutes/molDFT ESP-fittedHigh-accuracy MD, FEP
OpenFF Rechargeopenff-rechargesecondsDFT-derived but cachedOpenFF / SAGE setup
ABCG2Open Babel<1sImproved empiricalModern Vina, AutoDock-GPU
from rdkit.Chem import AllChem

AllChem.ComputeGasteigerCharges(mol)
for atom in mol.GetAtoms():
    print(atom.GetIdx(), atom.GetPropsAsDict().get('_GasteigerCharge', None))

Critical: Charge method must match downstream. Gasteiger charges in an AMBER MD run violate the assumptions of the protein force field.

MAP4 and MHFP6 for Diverse Libraries

For libraries spanning drug-like + natural products + peptides + metabolites, ECFP4 saturates Tanimoto similarity (most pairs report 0.1-0.3, hard to rank). MAP4 and MHFP6 use MinHash + atom-pair / circular substructures and discriminate better.

from mhfp.encoder import MHFPEncoder

encoder = MHFPEncoder
mhfp6 = encoder.encode(mol, radius=3)

MHFP6 distance is Jaccard on MinHash, not standard Tanimoto. Use MHFPEncoder.distance(fp1, fp2).

Physicochemical Descriptors

DescriptorSourceRangeDrug-like cutoff
MolWtRDKit Descriptors.MolWt~50-2000 Da<=500 (Lipinski)
MolLogP (Crippen)RDKit Descriptors.MolLogP-5 to 8<=5 (Lipinski)
HBDLipinski.NumHDonors0-10<=5 (Lipinski)
HBALipinski.NumHAcceptors0-15<=10 (Lipinski)
TPSADescriptors.TPSA (Ertl)0-200 A^2<=140 (Veber oral); <=90 (BBB+)
RotBondsLipinski.NumRotatableBonds0-15<=10 (Veber)
AromaticRingsLipinski.NumAromaticRings0-6<=3-4 (Ritchie-Macdonald aromatic ring count)
HeavyAtomsDescriptors.HeavyAtomCount<=50 (lead-like)
FractionCSP3Descriptors.FractionCSP30-1>=0.25 (Lovering 2009 escape-from-flatland)
QEDQED.qed0-1>=0.5 generally drug-like
SAscoresascorer.calculateScore (external)1-10<=4 acceptable; >6 hard to synth

Goal: Compute a standard physicochemical descriptor panel for drug-likeness filtering and QSAR features.

Approach: Combine RDKit Descriptors, Lipinski, and QED calls into a single dict so the caller gets MW, LogP, HBD/HBA, TPSA, rotatable bonds, aromatic rings, fraction sp3, and QED in one pass.

from rdkit.Chem import Descriptors, Lipinski, QED

def physchem(mol):
    return {
        'MolWt': Descriptors.MolWt(mol),
        'MolLogP': Descriptors.MolLogP(mol),
        'HBD': Lipinski.NumHDonors(mol),
        'HBA': Lipinski.NumHAcceptors(mol),
        'TPSA': Descriptors.TPSA(mol),
        'RotBonds': Lipinski.NumRotatableBonds(mol),
        'AromRings': Lipinski.NumAromaticRings(mol),
        'FractionCSP3': Descriptors.FractionCSP3(mol),
        'QED': QED.qed(mol),
    }

Drug-Likeness Rule Sets

RuleConstraintsSource
Lipinski Ro5MW<=500, LogP<=5, HBD<=5, HBA<=10Lipinski 1997
VeberRotBonds<=10, TPSA<=140Veber 2002 (oral)
Ghose160<=MW<=480, -0.4<=LogP<=5.6, 40<=MR<=130, 20<=atoms<=70Ghose 1999
EganLogP<=5.88, TPSA<=131.6Egan 2000
Muegge200<=MW<=600, -2<=LogP<=5, TPSA<=150, rings<=7Muegge 2001
Lead-likeMW<=350, LogP<=3Teague 1999
Fragment Ro3MW<=300, LogP<=3, HBD<=3, HBA<=3, RotBonds<=3Congreve 2003
BBB+ Pfizer CNSTPSA<=90, MW<=500, HBD<=3Wager 2010

Use case: Apply Ro5/Veber as a screening filter, not a hard cutoff. ~30% of marketed oral drugs violate at least one Ro5 rule (analyzed by Doak 2014). For oncology indications, Ro5 deviation is common and acceptable.

QED (Weighted Drug-Likeness)

QED (Bickerton 2012) is a single-number drug-likeness measure (0-1) combining 8 properties (MW, LogP, HBD, HBA, PSA, RotBonds, AromaticRings, structural alerts) via desirability functions.

Caveat: QED is trained on FDA-approved drugs; it under-rates fragment-like and natural-product-like molecules. Do not use as the sole drug-likeness filter for fragment screens or natural-product libraries.

Common Errors

SymptomCauseFix
Fingerprint changes between runsRandom seed not set for canonicalizationRDKit Morgan is deterministic; check if input differs (stereo, charges)
MACCS bit count != 166RDKit MACCS returns 167 bits (bit 0 unused)Slice [1:] if comparing to literature 166-bit
Crippen LogP differs from XLogPDifferent modelUse Descriptors.MolLogP for Crippen; XLogP3 requires external lib
3D descriptor differs between callsDifferent conformerSet confId=0 explicitly; or average over ensemble
QED returns nanCharged species or non-standard atomStandardize (uncharge) before QED
Tanimoto on count vector wrongTanimotoSimilarity expects bit vectorUse Hamming or weighted Tanimoto for counts
MolWt off by ~1 from PubChemImplicit H counted differentlyUse Descriptors.ExactMolWt for monoisotopic; PubChem reports average

References

  • Rogers & Hahn, J. Chem. Inf. Model. 50:742 -- ECFP / Morgan fingerprints.
  • Probst & Reymond, J. Cheminformatics 10:66 -- MHFP6 fingerprint.
  • Capecchi et al., J. Cheminformatics 12:43 -- MAP4 fingerprint.
  • Bickerton et al., Nat. Chem. 4:90 -- QED weighted drug-likeness.
  • Lipinski et al., Adv. Drug Deliv. Rev. 23:3 -- Rule of 5.
  • Veber et al., J. Med. Chem. 45:2615 -- Oral bioavailability rules.
  • Lovering et al., J. Med. Chem. 52:6752 -- Fraction sp3 / escape from flatland.

Related Skills

  • chemoinformatics/molecular-io - Parse molecules before featurization
  • chemoinformatics/molecular-standardization - Canonicalize before fingerprinting
  • chemoinformatics/conformer-generation - Generate 3D for conformer-dependent descriptors
  • chemoinformatics/similarity-searching - Use fingerprints for similarity ranking
  • chemoinformatics/qsar-modeling - ML using these descriptors as features
  • chemoinformatics/admet-prediction - Filter by drug-likeness criteria"
    Good AI Tools