Version Compatibility

Reference examples tested with: RDKit 2024.09+, chembl_structure_pipeline 1.2+, datamol 0.12+.

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show <package> then help(module.function) to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Molecular Standardization

Convert raw molecular structures into a single canonical form for ML training data, deduplication, registry, and cross-database joining. Standardization is the single most underrated upstream step: skipping it causes silent ML data leakage (training and test compounds with different tautomers count as separate), bogus QSAR predictions, and database join misses. The ChEMBL pipeline (Bento 2020) and canSARchem (Ravi 2022) are the two industry references; canSARchem extends ChEMBL with canonical-tautomer-before-parent extraction. RDKit's rdMolStandardize implements ChEMBL-equivalent logic in C++ (the older MolVS Python implementation was deprecated Q1 2024).

For format-level I/O and aromaticity perception, see chemoinformatics/molecular-io. For descriptor calculation after standardization, see chemoinformatics/molecular-descriptors.

Standardization Pipeline Stages

Stage	RDKit Tool	Operation	Common errors caught
1. Sanitization	`Chem.SanitizeMol`	Kekulize, assign aromaticity, fix valences	Wrong valence on N/O
2. Salt stripping	`rdMolStandardize.FragmentRemover` or `LargestFragmentChooser`	Remove counterions	Cl-, Na+, K+, OH-
3. Mixture choice	`LargestFragmentChooser`	Pick parent fragment	Co-crystals, hydrates
4. Charge neutralization	`Uncharger`	Neutralize while preserving net charge	Permanent charges preserved (quaternary N+)
5. Tautomer canonicalization	`TautomerEnumerator.Canonicalize`	Pick canonical tautomer	Keto/enol; amide/imidate
6. Stereo standardization	`Chem.AssignStereochemistry`	Consistent stereo descriptors	Lost wedges, ambiguous R/S
7. Isotope normalization	manual or `MolToSmiles(isomericSmiles=False)`	Remove 13C, 2H labels	Tracer studies
8. Output canonicalization	`Chem.MolToSmiles(canonical=True)`	Canonical SMILES + InChIKey	Round-trip stability

Pipeline Reconciliation

Pipeline	Origin	Tautomer canonicalization	Salt definition	Use case
ChEMBL pipeline	EBI ChEMBL	Pre-rdMolStandardize legacy; now uses rdMolStandardize	ChEMBL salt list (extensive)	Drug-like compounds, FDA approvals
canSARchem	ICR Cancer Research UK	Canonical tautomer BEFORE parent extraction	Extended salt list	Cancer drug discovery
PubChem (OpenEye)	NIH NCBI	OpenEye QUACPAC tautomer	PubChem salt list	Bioassay data, large-scale
RDKit rdMolStandardize default	Greg Landrum	RDKit TautomerEnumerator	RDKit default	General purpose, open source
Datamol wrapper	Molecular AI / Owkin	Wraps ChEMBL/RDKit pipeline	RDKit default	Notebooks, quick standardization

Key difference (canSARchem vs ChEMBL):

ChEMBL: parent first, then canonical tautomer of parent
canSARchem: canonical tautomer first, then parent

For 95% of drug-like molecules these produce identical results. For tautomer-ambiguous molecules (amide/imidate, ketoenol, lactam/lactim), the order matters; canSARchem produces more stable canonical forms.

ChEMBL Structure Pipeline (Reference Implementation)

ChEMBL's standardization is the most widely-used reference. The Python package chembl_structure_pipeline exposes the validated pipeline.

Goal: Apply the industry-reference ChEMBL standardization pipeline to a SMILES.

Approach: Parse SMILES with RDKit, run standardize_mol (sanitize + uncharge + normalize + canonical tautomer), then get_parent_mol (strip salts/counter-ions), and emit canonical SMILES.

from chembl_structure_pipeline import standardize_mol, get_parent_mol
from rdkit import Chem

def chembl_pipeline(smi):
    mol = Chem.MolFromSmiles(smi)
    if mol is None:
        return None, 'parse_failure'
    standardized, _ = standardize_mol(mol)
    parent, _ = get_parent_mol(standardized)
    return Chem.MolToSmiles(parent), 'ok'

standardize_mol: sanitize + uncharge + normalize functional groups + canonicalize tautomers.

get_parent_mol: strip salts/counter-ions; choose largest fragment.

Output: canonical SMILES of the parent (free acid/free base, neutral form).

Full Standardization with rdMolStandardize

For more granular control or non-ChEMBL workflows.

Goal: Execute each standardization step explicitly to control salt stripping, charge handling, tautomer canonicalization, and isotope normalization.

Approach: Run the 8-stage pipeline (sanitize, largest fragment, normalize, uncharge, tautomer canonicalize, isotope strip, stereo standardize, canonical SMILES) sequentially with rdMolStandardize primitives.

from rdkit import Chem
from rdkit.Chem.MolStandardize import rdMolStandardize

def full_standardize(smi, keep_isotopes=False):
    mol = Chem.MolFromSmiles(smi)
    if mol is None:
        return None

    Chem.SanitizeMol(mol)

    largest = rdMolStandardize.LargestFragmentChooser(preferOrganic=True)
    mol = largest.choose(mol)

    normalizer = rdMolStandardize.Normalizer()
    mol = normalizer.normalize(mol)

    uncharger = rdMolStandardize.Uncharger(canonicalOrdering=True)
    mol = uncharger.uncharge(mol)

    enumerator = rdMolStandardize.TautomerEnumerator()
    mol = enumerator.Canonicalize(mol)

    if not keep_isotopes:
        for atom in mol.GetAtoms():
            atom.SetIsotope(0)

    Chem.AssignStereochemistry(mol, cleanIt=True, force=True)
    return Chem.MolToSmiles(mol)

canonicalOrdering=True ensures the uncharger produces the same result regardless of atom ordering in input -- critical for stable canonical output.

Datamol Quick Standardization

For interactive notebooks and quick pipelines, datamol provides a one-call wrapper around the ChEMBL/RDKit flow.

import datamol as dm

def quick_standardize(smi):
    mol = dm.to_mol(smi)
    mol = dm.fix_mol(mol)              # sanitize, kekulize, normalize
    mol = dm.standardize_mol(mol,        # uncharge + canonicalize
                             disconnect_metals=False,
                             normalize=True,
                             reionize=True,
                             uncharge=True,
                             stereo=True)
    return dm.to_smiles(mol)

Use datamol when the standard ChEMBL/RDKit flow is what you want; drop to RDKit primitives when you need per-stage visibility or non-default choices.

Salt Stripping Edge Cases

Salt form	Action	Example
Mono-salt	Strip counter-ion	`[Na+].CC(=O)[O-]` -> `CC(=O)O`
Di-salt	Strip both	`[Na+].[Na+].CC(=O)[O-].CC(=O)[O-]` -> `CC(=O)O`
Mixed salt	Largest organic fragment	`CCO.CC(=O)O` -> `CCO` (or CC(=O)O depending on rule)
Co-crystal	Hardest case	`CC(=O)O.CCOC(C)=O` -- both organic; default returns largest
Hydrate	Strip waters	`CC(=O)O.O` -> `CC(=O)O`
Solvate	Strip solvents	`CC(=O)O.CO` -> `CC(=O)O`
Quaternary ammonium	Preserve charge	`[N+](C)(C)(C)C` (permanent charge; do NOT neutralize)

LargestFragmentChooser(preferOrganic=True) prefers organic fragments over inorganic counter-ions even if smaller; for co-crystals, default rule picks largest organic fragment.

Tautomer Canonicalization (debated)

Tautomer canonicalization is the most controversial standardization step. There is no universally-correct canonical tautomer for many drug-like molecules.

Tautomer pair	Default canonical	Issue
Keto/enol	Keto preferred	Most kinase ATP-mimetic enols destabilize on canonicalization
Lactam/lactim	Lactam preferred	Some natural products (rifampin) are inherently lactim
Amidine/iminol	Amidine preferred	Some bioactive amidines convert
Phenol/keto (e.g., naphthol/naphthalenone)	Phenol preferred	Some quinone-form pharmaceuticals reverted
2H-pyrazole / 1H-pyrazole	1H-pyrazole	Both equally stable in vivo

Practical rules:

Always apply consistent canonicalization across train + test for ML
For prospective prediction, predict for both tautomers if disagreement could matter
For library deduplication, canonical tautomer is the standard answer
For docking, use ionization-aware preparation (epik from Schrödinger or Open Babel pkBABEL)

from rdkit.Chem.MolStandardize import rdMolStandardize

def canonical_tautomer(smi):
    mol = Chem.MolFromSmiles(smi)
    enumerator = rdMolStandardize.TautomerEnumerator()
    canon = enumerator.Canonicalize(mol)
    return Chem.MolToSmiles(canon)

Stereochemistry Standardization

from rdkit import Chem

def standardize_stereo(mol, remove_undefined=False):
    Chem.AssignStereochemistry(mol, cleanIt=True, force=True)
    if remove_undefined:
        Chem.RemoveStereochemistry(mol)
    return mol

Cases:

Explicit stereo with @ / \ / / -> preserved
Wedge bonds in SDF -> re-perceived from 3D coords if present
Ambiguous stereo (no markers) -> left as-is, marked as undefined"
Racemic (explicit "rac") -> keep as racemate

For ML, often drop stereo entirely (Chem.RemoveStereochemistry(mol)) since most QSAR endpoints are not stereo-specific. For docking and FEP, preserve stereo always.

Standardization for ML Training (avoiding data leakage)

Goal: Build a standardized + deduplicated training set with replicate-averaged activity for QSAR or ADMET model training.

Approach: Standardize every SMILES through the ChEMBL pipeline, compute InChIKey as canonical identity, group by InChIKey, and mean-aggregate activities; report replicate count for confidence weighting.

import pandas as pd
from chembl_structure_pipeline import standardize_mol, get_parent_mol

def prepare_qsar_data(df, smiles_col='smiles', activity_col='pIC50'):
    standardized = []
    for i, row in df.iterrows():
        mol = Chem.MolFromSmiles(row[smiles_col])
        if mol is None:
            continue
        try:
            mol, _ = standardize_mol(mol)
            mol, _ = get_parent_mol(mol)
            standardized.append({
                'smiles': Chem.MolToSmiles(mol),
                'inchikey': Chem.MolToInchiKey(mol),
                'activity': row[activity_col],
            })
        except Exception:
            continue

    df_std = pd.DataFrame(standardized)
    df_std = df_std.groupby('inchikey').agg(
        smiles=('smiles', 'first'),
        activity=('activity', 'mean'),
        n_replicates=('activity', 'count'),
    ).reset_index()
    return df_std

Deduplication by InChIKey collapses tautomer-equivalent compounds. Replicate count signals measurement reliability.

Per-Tool Failure Modes

ChEMBL pipeline -- inorganic salt fails

Trigger: Molecule is genuinely an inorganic salt (e.g., NaCl, K2SO4).

Mechanism: get_parent_mol chooses largest organic; falls back to largest fragment for fully inorganic.

Symptom: Returns the salt itself (not a drug).

Fix: Pre-filter to compounds with ≥1 carbon atom.

Uncharger -- removes critical charge

Trigger: Quaternary ammonium (permanent positive) or sulfonate at physiological pH.

Mechanism: Default uncharger attempts to neutralize without distinguishing permanent vs pH-dependent charges.

Symptom: Permanently charged ligands neutralized; structure incorrect for downstream docking.

Fix: Use Uncharger(canonicalOrdering=True, force=False); manually inspect borderline cases.

Tautomer enumerator -- combinatorial explosion

Trigger: Molecule with many tautomerizable groups (polyhydroxylated heterocycle).

Mechanism: TautomerEnumerator.Enumerate generates all possible tautomers; can produce thousands.

Symptom: OOM or hour-long compute on single molecule.

Fix: Use Canonicalize (returns single canonical) instead of Enumerate; for Enumerate, cap maxTransforms parameter.

MolVS deprecated -- ImportError on Python 3.12+

Trigger: Code still using legacy from molvs import Standardizer.

Mechanism: RDKit MolStandardize Python implementation removed Q1 2024.

Symptom: ImportError or AttributeError on newer RDKit.

Fix: Migrate to from rdkit.Chem.MolStandardize import rdMolStandardize; methods renamed (e.g., standardize -> Cleanup).

Round-trip InChIKey mismatch

Trigger: Compound canonicalized to different tautomer per run.

Mechanism: RDKit tautomer canonicalization depends on atom ordering for very symmetric molecules.

Symptom: Re-running standardization yields different InChIKey.

Fix: Set canonicalOrdering=True in Uncharger; sort atoms via canonical SMILES first.

Common Errors

Symptom	Cause	Fix
ImportError on `MolStandardize`	Python MolStandardize deprecated	Use `from rdkit.Chem.MolStandardize import rdMolStandardize`
`standardize_mol` returns None	Sanitize failure on input	Try `Chem.MolFromSmiles(smi, sanitize=False)` first
Stripped wrong fragment	LargestFragmentChooser ambiguity	Manually inspect; consider custom logic
Tautomer differs between runs	Atom-order-dependent	Set `canonicalOrdering=True`; sort atoms
Charge lost on quaternary N	Aggressive neutralization	Use `force=False`
InChIKey collisions across "different" mols	Same canonical InChI but different stereo / tautomer	Use longer InChIKey (or full InChI)
Pipeline slow on large library	Per-mol Python overhead	Use `chembl_structure_pipeline` (vectorized) or process in chunks

References

Bento et al., J. Cheminformatics 12:51 -- ChEMBL structure pipeline.
Ravi et al., J. Cheminformatics 14:36 -- canSARchem registration pipeline.
Hähnke et al., J. Cheminformatics 10:36 -- PubChem standardization.
Landrum, RDKit Blog -- new tautomer canonicalization code.

Related Skills

chemoinformatics/molecular-io - Parse molecules before standardizing
chemoinformatics/molecular-descriptors - Apply descriptors to standardized molecules
chemoinformatics/similarity-searching - Standardize before comparing
chemoinformatics/substructure-search - Standardize before SMARTS matching
chemoinformatics/qsar-modeling - Mandatory upstream for QSAR

skills/chemoinformatics/molecular-standardization