skills/chemoinformatics/molecular-io

stars:0
forks:0
watches:0
last updated:N/A

Version Compatibility

Reference examples tested with: RDKit 2024.09+, Open Babel 3.1.1+, ChEMBL structure_pipeline 1.2+.

Before using code patterns, verify installed versions match. If versions differ:

  • Python: pip show <package> then help(module.function) to check signatures
  • CLI: obabel -V; obabel -L formats

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Molecular I/O

Parse, write, and convert molecular file formats. Most downstream errors trace back to silent I/O issues: incorrect aromaticity perception, lost stereochemistry, mishandled charges, dropped stereo bonds, or non-canonical tautomers. This skill enumerates each format's failure modes and prescribes the correct toolchain for each scenario.

For full standardization (canonicalization, salt stripping, tautomer enumeration) see chemoinformatics/molecular-standardization. For generating 3D conformers from parsed 2D molecules, see chemoinformatics/conformer-generation.

Format Taxonomy

FormatDimStereoChargesStrengthFails when
SMILES2DExplicit /\@Net charges onlyCompact, web-friendly, fast parseLoses absolute coordinates; aromatic perception ambiguous across toolkits; tautomers not canonical
InChI2DLayer 't,s'pH-fixed std layer 'p'Canonical by construction; cross-toolkit identityLoses tautomers (std InChI merges); loses stereo at C^x metals; large molecules timeout
SDF V20002D/3DWedge bondsM CHG lineIndustry default; metadata via tags999-atom limit; cannot encode multi-component reactions; query atoms ambiguous
SDF V30002D/3DWedge + stereo flagInline chargeNo atom limit; query support; rich propertiesSome software (legacy) cannot read; verbose
MOL2 (Tripos)3DWedge bondsPer-atom partialSYBYL atom types preserved for dockingAtom-type dialects diverge (SYBYL vs Corina); RDKit MOL2 parser brittle
PDB3DNoneNone standardUniversal protein formatNo bond orders; aromatic perception lost; ligand names truncated to 3 chars
PDBQT3DNoneGasteiger / AD4AutoDock-ready; torsion tree encodedSpecific to docking; no aromaticity layer
MMTF/BCIF3DEncodedEncodedCompact PDB replacement; PDB archive default since 2023Not all toolkits parse; binary format
CDX/CDXML2DDrawingDrawingChemDraw nativeNot a structural format; converts unreliably
InChIKeyHashStereo layerDatabase key, fast lookupHash collisions ~10^-9 but possible; cannot recover structure

Aromaticity Perception (most common silent error)

Different toolkits perceive aromaticity differently. The same SMILES round-tripped between toolkits may produce different canonical strings and different fingerprints.

ModelToolkitRuleSymptom of mismatch
DaylightOpenEye, Daylight4n+2 π on planar ringFuran, thiophene aromatic
RDKit defaultRDKitDaylight-like with extensions for fused / N-containingCompatible with Daylight for drug-like molecules
MDLIndigo, ChemAxonReduced (only pyrrole-type)Pyrrole NH aromatic but tropone non-aromatic
OpenEyeOEAroModelSeveral modesCharged thiophene non-aromatic in MDL but aromatic in OpenEye

Fix: Always re-canonicalize via the toolkit doing analysis. Never trust a SMILES produced by toolkit A as input to toolkit B without SetAromaticity(rdkit.Chem.AromaticityModel).

Stereochemistry Layers

Stereo loss is the second most common silent error. Each format encodes stereo differently:

  • SMILES: @/@@ for tetrahedral, / and \ for cis/trans double bonds
  • InChI: separate /t (tetrahedral), /s (stereo flag), /m (mirror) layers
  • SDF: wedge/hash bond + parity 0/1/2; cis/trans encoded via bond direction
  • MOL2: explicit stereo flag

Round-trip tests: If Chem.MolToSmiles(Chem.MolFromSmiles(smi)) does not preserve @ and /\, the molecule was sanitized without Chem.RemoveStereochemistry(). If MolFromMolFile returns a molecule missing wedge bonds, the SDF used parity-only encoding (legacy).

Reading SMILES with Stereo Preservation

Goal: Parse SMILES while preserving stereo and aromatic-flag consistency.

Approach: Use Chem.MolFromSmiles(smi) with sanitization on, verify with round-trip canonicalization, and set explicit stereochemistry where the toolkit's perception missed it.

from rdkit import Chem
from rdkit.Chem import AllChem

def parse_smiles_safe(smi):
    mol = Chem.MolFromSmiles(smi)
    if mol is None:
        return None, 'parse_failure'
    Chem.AssignStereochemistry(mol, cleanIt=True, force=True)
    canon = Chem.MolToSmiles(mol)
    round_trip = Chem.MolFromSmiles(canon)
    if Chem.MolToSmiles(round_trip) != canon:
        return mol, 'round_trip_unstable'
    return mol, 'ok'

Reading SDF with Property Carryover

Goal: Load a multi-record SDF preserving per-molecule properties (Name, ID, IC50, etc.) used by downstream filtering and ML labeling.

Approach: Iterate via SDMolSupplier(removeHs=False, sanitize=True), filter None (parse failures), and capture properties via mol.GetPropsAsDict().

from rdkit import Chem

supplier = Chem.SDMolSupplier('library.sdf', removeHs=False, sanitize=True)
mols = []
fails = []
for i, mol in enumerate(supplier):
    if mol is None:
        fails.append(i)
        continue
    props = mol.GetPropsAsDict()
    mols.append((mol, props))
print(f'parsed: {len(mols)}; failed: {len(fails)}')

If a large fraction fails, try sanitize=False then Chem.SanitizeMol(mol, catchErrors=True) to identify per-step failures (kekulization, valence, aromaticity).

Open Babel for MOL2 / PDBQT

RDKit's MOL2 parser is incomplete (SYBYL atom-type sets differ). Open Babel is more robust for MOL2 and PDBQT.

from openbabel import pybel

mols = list(pybel.readfile('mol2', 'ligands.mol2'))
for mol in mols:
    smi = mol.write('smi').strip().split()[0]
    inchi = mol.write('inchi').strip()

For docking output PDBQT, use Open Babel rather than RDKit:

import subprocess
subprocess.run(['obabel', 'docked.pdbqt', '-O', 'docked.sdf'], check=True)

InChI for Canonical Identity

InChI is the only canonical 2D representation guaranteed across toolkits. Standard InChI ignores tautomers and metal stereo; non-standard variants preserve them with flags.

from rdkit.Chem.inchi import MolToInchi, MolToInchiKey, InchiToInchiKey

mol = Chem.MolFromSmiles('c1ccc2c(c1)cccc2')
inchi = MolToInchi(mol)
key = MolToInchiKey(mol)

inchi_fixedH, retcode, msg, _, _ = Chem.MolToInchiAndAuxInfo(mol, options='/FixedH')

Caveat: Two molecules with identical std InChI may be different tautomers. Use /FixedH for tautomer-distinguishing InChI when needed.

Datamol Wrapper for Friendly I/O

" datamol provides a thin, opinionated wrapper over RDKit with sensible defaults (sanitize on, standard tautomer, common error suppression). It is a good fit for notebook-style interactive work and for code that should "just work" without deep RDKit knowledge.

import datamol as dm

mol = dm.to_mol('CC(=O)Oc1ccccc1C(=O)O')   # parses + sanitizes
smi = dm.to_smiles(mol)                    # canonical SMILES
inchikey = dm.to_inchikey(mol)             # canonical InChIKey
fps = dm.to_fp(mol, fp={'radius': 2, 'nBits': 2048})  # ECFP4

Datamol adds utilities such as dm.standardize_mol, dm.fix_mol, and dm.remove_salt that wrap the ChEMBL/RDKit pipeline. Use datamol when you want a friendlier API; drop down to raw RDKit when you need fine-grained control over the standardization steps.

Per-Format Failure Modes

SMILES -- ambiguous aromaticity

Trigger: Input from non-RDKit source (ChemAxon, OpenEye, Daylight) round-tripping into RDKit.

Mechanism: RDKit perceives aromaticity on input. Aromatic flags from origin toolkit are overwritten.

Symptom: Fingerprints differ between toolkits for "identical" molecules; database joins by canonical SMILES miss records.

Fix: Always re-canonicalize within the analysis toolkit. For cross-toolkit identity, use InChIKey not canonical SMILES.

SDF V2000 -- atom count >999

Trigger: Large molecules (peptides, oligonucleotides, dendrimers).

Mechanism: V2000 header uses fixed 3-character atom count field.

Symptom: Truncated atom block; parse failure with cryptic error.

Fix: Switch to V3000: Chem.SDWriter('out.sdf', forceV3000=True). RDKit auto-detects V3000 on read; explicitly request on write.

SDF -- wedge bond orientation lost

Trigger: SDF written by tools that use parity flags only (older ISIS-Draw, some pipeline tools).

Mechanism: Parity alone is ambiguous without geometric coordinates; RDKit reads parity but cannot re-render wedges.

Symptom: Drawn molecule shows undefined stereo despite SDF carrying parity bits.

Fix: After read, Chem.AssignStereochemistryFrom3D(mol) if 3D coords present; otherwise stereo must be re-derived from SMILES with wedges.

PDB ligand -- no bond orders

Trigger: Parsing ligand from PDB entry (e.g., extracting co-crystal ligand).

Mechanism: PDB stores only atoms + CONECT; bond orders inferred by RDKit's AssignBondOrdersFromTemplate which requires a template molecule.

Symptom: All bonds single; aromatic rings non-aromatic; valences wrong.

Fix: Use AllChem.AssignBondOrdersFromTemplate(template, ligand) where template is a SMILES-derived mol of the expected ligand structure. Or use the PDB Ligand Expo SDF.

MOL2 -- SYBYL atom type dialect

Trigger: MOL2 produced by Corina, MOE, or Schrodinger.

Mechanism: SYBYL atom types are not perfectly standardized across vendors; RDKit's parser handles canonical SYBYL.

Symptom: Mol returns as None or with wrong atom types (Cl vs Cl.O peroxide-style).

Fix: Convert via Open Babel as intermediate: obabel input.mol2 -O temp.sdf then read SDF.

Open Babel pybel -- import path

Trigger: Code written for Open Babel 2.x.

Mechanism: OB 3.x reorganized: import pybel no longer works.

Symptom: ModuleNotFoundError: No module named 'pybel'.

Fix: from openbabel import pybel.

Charge Models on I/O

SourceCharges in fileUse for
Parsed SMILESNet formal charges onlyStorage, similarity, ML training
Parsed PDBAtomic charges typically absentAlways re-assign for downstream
obabel --partialcharge gasteigerGasteiger Marsili (empirical)AutoDock Vina, fast
AM1-BCC (AmberTools antechamber)Semi-empiricalMD, FEP setup
RESP (psi4, Gaussian)Quantum ESP-fittedHigh-accuracy MD, FEP

The charge model must match the downstream method. Mixing AM1-BCC ligand charges with TIP3P water + AMBER protein is valid; Gasteiger charges are unsuitable for MD.

Drawing for QC

Always draw a random subset of parsed molecules. Wrong stereo, missing rings, and broken aromaticity show immediately.

from rdkit.Chem.Draw import rdMolDraw2D

def draw_grid(mols, fname, mols_per_row=5, sub_img_size=(250, 200)):
    from rdkit.Chem.Draw import MolsToGridImage
    img = MolsToGridImage(mols[:25], molsPerRow=mols_per_row, subImgSize=sub_img_size,
                          legends=[m.GetProp('_Name') if m.HasProp('_Name') else ''
                                   for m in mols[:25]])
    img.save(fname)

MolsToGridImage returns PIL image; for headless servers use MolDraw2DCairo directly.

Common Errors

SymptomCauseFix
Chem.MolFromSmiles returns NoneInvalid SMILES, bad parentheses, ring not closedTry sanitize=False, inspect with Chem.MolFromSmiles(smi, sanitize=False)
Round-trip SMILES changesAromaticity perception driftAlways canonicalize within analysis toolkit
All bonds single in PDB ligandPDB has no bond ordersAllChem.AssignBondOrdersFromTemplate(template, mol)
Stereo lost on SDF writeChem.RemoveStereochemistry calledUse MolToMolBlock(mol, kekulize=True)
MOL2 parse returns NoneRDKit MOL2 parser incomplete for vendor dialectsConvert via Open Babel intermediate
InChI differs for "same" moleculeDifferent tautomers, charges, or stereoUse /FixedH to retain tautomer; compare without standardization
Fingerprints differ across toolkitsAromaticity model differenceUse InChIKey for identity; re-canonicalize for similarity

References

  • Heller et al., J. Cheminformatics 7:23 -- InChI v1.05 specification.
  • O'Boyle, J. Cheminformatics 4:22 -- canonical SMILES algorithms across toolkits.
  • Bento et al., J. Cheminformatics 12:51 -- ChEMBL structure pipeline.
  • Riniker & Landrum, J. Chem. Inf. Model. 55:2562-2574 -- ETKDG conformer embedding.
  • RDKit datamol documentation: https://docs.datamol.io/

Related Skills

  • chemoinformatics/molecular-standardization - Salt stripping, tautomer canonicalization, neutralization
  • chemoinformatics/molecular-descriptors - Calculate fingerprints and properties from parsed molecules
  • chemoinformatics/conformer-generation - Generate 3D coordinates from 2D inputs
  • chemoinformatics/virtual-screening - Prepare ligands for docking
  • chemoinformatics/similarity-searching - Use parsed structures for similarity
    Good AI Tools