skills/chemoinformatics/molecular-descriptors
Version Compatibility
Reference examples tested with: RDKit 2024.09+, numpy 1.26+, pandas 2.2+, mapchiral 0.1+ (MAP4), mhfp 1.9+, molfeat 0.10+.
Before using code patterns, verify installed versions match. If versions differ:
- Python:
pip show <package>thenhelp(module.function)to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Molecular Descriptors
Featurize molecules for similarity search, QSAR, virtual screening, or ML. The fingerprint or descriptor choice is chemotype-aware: ECFP4 dominates drug-like organic similarity, AtomPair and TopologicalTorsion outperform for scaffold hopping, MAP4/MHFP6 win on metabolomics-scale chemical diversity, and 3D conformer-based descriptors are essential when shape and stereochemistry matter.
For canonicalization before featurization, see chemoinformatics/molecular-standardization. For 3D-only descriptors, see chemoinformatics/conformer-generation.
Fingerprint Taxonomy
| Fingerprint | Type | Radius/Path | Bits | Use case | Fails when |
|---|---|---|---|---|---|
| Morgan (ECFP) | Circular | r=2 (ECFP4), r=3 (ECFP6) | 2048 typical | Drug-like similarity, ML default | Loses long-range topology; bit collisions at low nBits |
| FCFP | Functional Morgan | r=2 default | 2048 | Pharmacophore-aware similarity | Same caveats as ECFP; less specific |
| MACCS | Substructure key | 166 fixed bits | 167 | Quick fingerprint, drug-likeness | Too sparse for large diverse libraries |
| RDKit FP | Path-based | linear paths up to 7 atoms | 2048 | RDKit-native ECFP alternative | Drug-like only; not optimal for scaffold hopping |
| AtomPair | Pair + topological distance | All atom pairs | 2048 | Scaffold hopping; flexible mol | Slower than ECFP; harder to interpret |
| TopologicalTorsion | 4-atom torsion | All TT | 2048 | Scaffold hopping; less hit-rate | Like AP, slower than ECFP |
| Avalon | Substructure + atom pairs | Mixed | 512/1024 | Fast similarity | Less standard; older |
| MAP4 (MinHashed atom-pair) | MinHash atom-pair | r=1,2 | 1024/2048 | Biological + metabolite diversity | Library required (mapchiral); slower hash |
| MHFP6 (MinHash) | MinHash ECFP-like | r=3 (diam 6) | 2048 | Big-data nearest-neighbor (Annoy) | Different distance (Jaccard on MinHash) |
| Pharm2D | 2D pharmacophore | feature pairs/triplets | sparse | Pharmacophore search | Sparse, slower |
Decision: For drug-like similarity ranking, use ECFP4 2048 bit; established baseline, fast, well-understood. For diverse libraries (>1M compounds, metabolomics, peptides), MHFP6 outperforms ECFP4 on analog recovery (Probst & Reymond 2018). For scaffold hopping, AtomPair beats ECFP4 on retrospective benchmarks but loses on retrospective single-target.
Bit vs Count Vectors
| Form | Use | Library impact |
|---|---|---|
| Bit (0/1) | Tanimoto similarity, BulkTanimotoSimilarity, RDKit fingerprint folding | Standard for similarity |
| Count (integer) | Some ML methods, RF on counts, neural fingerprints | Loses bit-level fast operations; richer signal |
| Sparse (dict) | Direct chemical interpretation (which fragments at which atoms) | Use for SHAP / atomic attribution |
from rdkit import Chem
from rdkit.Chem import AllChem
mol = Chem.MolFromSmiles('CCO')
ecfp4_bit = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)
ecfp4_count = AllChem.GetHashedMorganFingerprint(mol, radius=2, nBits=2048)
ecfp4_sparse = AllChem.GetMorganFingerprint(mol, radius=2)
Morgan / ECFP Radius Math
ECFP-X notation: X is the diameter in bonds. RDKit's radius parameter is half of X.
| Notation | RDKit radius | Diameter | Captures |
|---|---|---|---|
| ECFP0 | 0 | 0 | Atom identity only |
| ECFP2 | 1 | 2 | Atom + immediate neighbors |
| ECFP4 | 2 | 4 | Atom + 2-bond environment |
| ECFP6 | 3 | 6 | Atom + 3-bond environment |
Trade-off: Larger radius captures more specific local environment but inflates bit-collision rate at fixed nBits. For QSAR with <10k compounds, ECFP4 2048 is the established default (Rogers & Hahn 2010; MoleculeNet benchmarks Wu 2018). For large libraries (>1M), use nBits=4096 or unhashed sparse representation to reduce ~1-5% bit-collision rate (O'Boyle 2016).
FCFP vs ECFP
FCFP (Functional-Class) uses atom invariants based on pharmacophore role (donor, acceptor, hydrophobe, aromatic, halogen, basic, acidic) instead of atom identity. Trades atom-specificity for functional-equivalence.
ecfp4 = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, useFeatures=False)
fcfp4 = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, useFeatures=True)
When to use FCFP4: Scaffold-hopping campaigns, pharmacophore-driven similarity, cross-target activity prediction.
When to use ECFP4: Within-series QSAR, lead optimization, when chemotype identity matters.
molfeat: Pluggable Featurization
molfeat (Datamol/Molecular AI) wraps RDKit, Mordred, Vina, and hand-crafted transformers behind a uniform Transformer interface. It is the right choice when you need to swap featurizations programmatically (e.g. for a benchmark or a notebook exploration) rather than hardcoding the RDKit calls.
from molfeat.trans.fp import FPVecTransformer
ecfp4 = FPVecTransformer(kind='ecfp', length=2048, radius=2)
fcfp4 = FPVecTransformer(kind='fcfp', length=2048, radius=2)
ap = FPVecTransformer(kind='ap', length=2048)
mhfp6 = FPVecTransformer(kind='mhfp', length=2048, radius=3)
# X is a numpy array, y is a list of mols or SMILES
X = ecfp4(list_of_mols)
For large fingerprints (MHFP6, AtomPair) or large libraries, molfeat avoids the boilerplate of building fingerprints one molecule at a time.
3D Descriptors and Conformer Dependence
Conformer-dependent descriptors (asphericity, eccentricity, principal moments of inertia, RDF) require a generated 3D structure. A single conformer is rarely sufficient: descriptor variance across the conformer ensemble can exceed the descriptor signal.
Goal: Compute 3D shape descriptors over a conformer ensemble rather than from a single (possibly unrepresentative) conformer.
Approach: Add explicit hydrogens, embed N conformers with ETKDGv3, MMFF-optimize them all, then evaluate the descriptor across each conformer for downstream averaging.
from rdkit.Chem import AllChem, Descriptors3D
mol = Chem.MolFromSmiles('CCCCO')
mol = Chem.AddHs(mol)
params = AllChem.ETKDGv3()
params.randomSeed = 42
n = AllChem.EmbedMultipleConfs(mol, numConfs=20, params=params)
AllChem.MMFFOptimizeMoleculeConfs(mol)
asphericities = [Descriptors3D.Asphericity(mol, confId=c) for c in range(n)]
Decision: For QSAR / ML, compute over a conformer ensemble (n=20-100) and report mean or Boltzmann-weighted average. Single-conformer 3D descriptors are unreliable.
Partial Charge Methods
| Method | Software | Cost | Accuracy | Use for |
|---|---|---|---|---|
| Gasteiger-Marsili | RDKit, Open Babel | <0.1s/mol | Empirical, rough | AutoDock Vina, fast screening |
| MMFF94 | RDKit | 0.1s/mol | Force-field consistent | MMFF energy, conformer ranking |
| AM1-BCC | antechamber (AmberTools) | ~10s/mol | Semi-empirical | MD setup, FEP, GAFF |
| RESP | psi4, Gaussian | minutes/mol | DFT ESP-fitted | High-accuracy MD, FEP |
| OpenFF Recharge | openff-recharge | seconds | DFT-derived but cached | OpenFF / SAGE setup |
| ABCG2 | Open Babel | <1s | Improved empirical | Modern Vina, AutoDock-GPU |
from rdkit.Chem import AllChem
AllChem.ComputeGasteigerCharges(mol)
for atom in mol.GetAtoms():
print(atom.GetIdx(), atom.GetPropsAsDict().get('_GasteigerCharge', None))
Critical: Charge method must match downstream. Gasteiger charges in an AMBER MD run violate the assumptions of the protein force field.
MAP4 and MHFP6 for Diverse Libraries
For libraries spanning drug-like + natural products + peptides + metabolites, ECFP4 saturates Tanimoto similarity (most pairs report 0.1-0.3, hard to rank). MAP4 and MHFP6 use MinHash + atom-pair / circular substructures and discriminate better.
from mhfp.encoder import MHFPEncoder
encoder = MHFPEncoder
mhfp6 = encoder.encode(mol, radius=3)
MHFP6 distance is Jaccard on MinHash, not standard Tanimoto. Use MHFPEncoder.distance(fp1, fp2).
Physicochemical Descriptors
| Descriptor | Source | Range | Drug-like cutoff |
|---|---|---|---|
| MolWt | RDKit Descriptors.MolWt | ~50-2000 Da | <=500 (Lipinski) |
| MolLogP (Crippen) | RDKit Descriptors.MolLogP | -5 to 8 | <=5 (Lipinski) |
| HBD | Lipinski.NumHDonors | 0-10 | <=5 (Lipinski) |
| HBA | Lipinski.NumHAcceptors | 0-15 | <=10 (Lipinski) |
| TPSA | Descriptors.TPSA (Ertl) | 0-200 A^2 | <=140 (Veber oral); <=90 (BBB+) |
| RotBonds | Lipinski.NumRotatableBonds | 0-15 | <=10 (Veber) |
| AromaticRings | Lipinski.NumAromaticRings | 0-6 | <=3-4 (Ritchie-Macdonald aromatic ring count) |
| HeavyAtoms | Descriptors.HeavyAtomCount | <=50 (lead-like) | |
| FractionCSP3 | Descriptors.FractionCSP3 | 0-1 | >=0.25 (Lovering 2009 escape-from-flatland) |
| QED | QED.qed | 0-1 | >=0.5 generally drug-like |
| SAscore | sascorer.calculateScore (external) | 1-10 | <=4 acceptable; >6 hard to synth |
Goal: Compute a standard physicochemical descriptor panel for drug-likeness filtering and QSAR features.
Approach: Combine RDKit Descriptors, Lipinski, and QED calls into a single dict so the caller gets MW, LogP, HBD/HBA, TPSA, rotatable bonds, aromatic rings, fraction sp3, and QED in one pass.
from rdkit.Chem import Descriptors, Lipinski, QED
def physchem(mol):
return {
'MolWt': Descriptors.MolWt(mol),
'MolLogP': Descriptors.MolLogP(mol),
'HBD': Lipinski.NumHDonors(mol),
'HBA': Lipinski.NumHAcceptors(mol),
'TPSA': Descriptors.TPSA(mol),
'RotBonds': Lipinski.NumRotatableBonds(mol),
'AromRings': Lipinski.NumAromaticRings(mol),
'FractionCSP3': Descriptors.FractionCSP3(mol),
'QED': QED.qed(mol),
}
Drug-Likeness Rule Sets
| Rule | Constraints | Source |
|---|---|---|
| Lipinski Ro5 | MW<=500, LogP<=5, HBD<=5, HBA<=10 | Lipinski 1997 |
| Veber | RotBonds<=10, TPSA<=140 | Veber 2002 (oral) |
| Ghose | 160<=MW<=480, -0.4<=LogP<=5.6, 40<=MR<=130, 20<=atoms<=70 | Ghose 1999 |
| Egan | LogP<=5.88, TPSA<=131.6 | Egan 2000 |
| Muegge | 200<=MW<=600, -2<=LogP<=5, TPSA<=150, rings<=7 | Muegge 2001 |
| Lead-like | MW<=350, LogP<=3 | Teague 1999 |
| Fragment Ro3 | MW<=300, LogP<=3, HBD<=3, HBA<=3, RotBonds<=3 | Congreve 2003 |
| BBB+ Pfizer CNS | TPSA<=90, MW<=500, HBD<=3 | Wager 2010 |
Use case: Apply Ro5/Veber as a screening filter, not a hard cutoff. ~30% of marketed oral drugs violate at least one Ro5 rule (analyzed by Doak 2014). For oncology indications, Ro5 deviation is common and acceptable.
QED (Weighted Drug-Likeness)
QED (Bickerton 2012) is a single-number drug-likeness measure (0-1) combining 8 properties (MW, LogP, HBD, HBA, PSA, RotBonds, AromaticRings, structural alerts) via desirability functions.
Caveat: QED is trained on FDA-approved drugs; it under-rates fragment-like and natural-product-like molecules. Do not use as the sole drug-likeness filter for fragment screens or natural-product libraries.
Common Errors
| Symptom | Cause | Fix |
|---|---|---|
| Fingerprint changes between runs | Random seed not set for canonicalization | RDKit Morgan is deterministic; check if input differs (stereo, charges) |
| MACCS bit count != 166 | RDKit MACCS returns 167 bits (bit 0 unused) | Slice [1:] if comparing to literature 166-bit |
| Crippen LogP differs from XLogP | Different model | Use Descriptors.MolLogP for Crippen; XLogP3 requires external lib |
| 3D descriptor differs between calls | Different conformer | Set confId=0 explicitly; or average over ensemble |
| QED returns nan | Charged species or non-standard atom | Standardize (uncharge) before QED |
| Tanimoto on count vector wrong | TanimotoSimilarity expects bit vector | Use Hamming or weighted Tanimoto for counts |
| MolWt off by ~1 from PubChem | Implicit H counted differently | Use Descriptors.ExactMolWt for monoisotopic; PubChem reports average |
References
- Rogers & Hahn, J. Chem. Inf. Model. 50:742 -- ECFP / Morgan fingerprints.
- Probst & Reymond, J. Cheminformatics 10:66 -- MHFP6 fingerprint.
- Capecchi et al., J. Cheminformatics 12:43 -- MAP4 fingerprint.
- Bickerton et al., Nat. Chem. 4:90 -- QED weighted drug-likeness.
- Lipinski et al., Adv. Drug Deliv. Rev. 23:3 -- Rule of 5.
- Veber et al., J. Med. Chem. 45:2615 -- Oral bioavailability rules.
- Lovering et al., J. Med. Chem. 52:6752 -- Fraction sp3 / escape from flatland.
Related Skills
- chemoinformatics/molecular-io - Parse molecules before featurization
- chemoinformatics/molecular-standardization - Canonicalize before fingerprinting
- chemoinformatics/conformer-generation - Generate 3D for conformer-dependent descriptors
- chemoinformatics/similarity-searching - Use fingerprints for similarity ranking
- chemoinformatics/qsar-modeling - ML using these descriptors as features
- chemoinformatics/admet-prediction - Filter by drug-likeness criteria"
