skills/chemoinformatics/generative-design

stars:0
forks:0
watches:0
last updated:N/A

Version Compatibility

Reference examples tested with: REINVENT 4.0+, RDKit 2024.09+, PyTorch 2.1+, MolMIM (NVIDIA BioNeMo), chemprop 2.0+.

Before using code patterns, verify installed versions match. If versions differ:

  • Python: pip show <package> then help(module.function) to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Generative Molecular Design

Generate novel molecules biased toward desired properties using deep generative models. REINVENT 4 (Loeffler 2024, AstraZeneca) is the open-source production-grade framework, supporting 4 generation modes (de novo, scaffold decoration, linker design, molecular optimization) and 3 learning algorithms (transfer learning, reinforcement learning, curriculum learning). The art of generative design is in the scoring function: poorly-designed scoring rewards uninteresting molecules, while well-designed scoring captures both activity and developability.

For QSAR/scoring models that feed generative design, see chemoinformatics/qsar-modeling. For synthetic feasibility, see chemoinformatics/retrosynthesis. For library enumeration as alternative, see chemoinformatics/reaction-enumeration.

Generator Mode Taxonomy

ModeInputOutputUse caseFails when
De novoEmpty seed or training setNovel moleculesWide chemical space explorationSynthetic feasibility weak
Scaffold decorationScaffold + attachment pointsDecorated moleculesSeries expansionDiversity limited by scaffold
Linker design2 fragmentsLinker moleculesPROTAC, ternary complexFew linker geometric options
R-group replacementScaffold + existing R-groupsNew R-group setOptimize one positionSingle-position only
Molecular optimizationLead moleculeImproved analogsLead optimizationImprovement window narrow
Constrained generationHard constraints (MW, fragments)Compliant moleculesPatent / IP designConstraints overly restrictive

Learning Algorithm Taxonomy

AlgorithmUseProCon
Transfer learning (TL)Adapt prior model to focused training setStable, simpleLimited optimization power
Reinforcement learning (RL)Reward-driven generationPowerful for MPOReward hacking risk
Curriculum learning (CL)Gradual constraint introductionBetter convergenceSlower; tuning sensitive

Decision Tree by Scenario

ScenarioGeneratorAlgorithmScoring
New target, no SARDe novoRL on docking scoreGlide / Vina + QED
Series expansionScaffold decorationTL on series + RLQSAR ensemble + QED
PROTAC linkerLinker designRL on ternary complexDC50 surrogate
Lead optimization MPOMolecular optimizationCL with staged constraintsMulti-task: activity + ADMET
Diverse hit setDe novo with diversity bonusRL + Tanimoto distance to knownActivity + diversity
Patent space carve-outConstrained de novoRL + structural constraintsActivity + novelty
Hit-to-leadR-group replacementTL on lead + RLActivity + Lipinski
ADMET-aware designDe novo or optimizationRLhERG + CYP + AMES + QED

REINVENT 4 Setup

REINVENT 4 uses a TOML configuration file specifying generator, algorithm, prior model, and scoring functions.

[parameters]"
prior_file = "priors/reinvent.prior"
agent_file = "priors/reinvent.prior"
batch_size = 64
unique_sequences = true

[[stage]]
max_steps = 1000
chkpt_file = "checkpoints/agent.chkpt"

[[stage.scoring.component]]
name = "QED"

[[stage.scoring.component]]
name = "custom_activity"
weight = 1.0

Multi-Parameter Optimization (MPO)

The art of generative design lies in the scoring function. Common components:

ComponentPurposeReference
QEDDrug-likenessBickerton 2012
SAScoreSynthetic accessibilityErtl 2009
Activity QSARTarget bindingchemprop model
hERGCardiotoxADMETlab 3.0
LipinskiRule of 5Lipinski 1997
Tanimoto distanceDiversity from known activesRDKit

Critical pitfall: Reward hacking. If activity model is biased, generator produces structures that exploit the bias. Mitigations:

  • Use ensemble of models (5+)
  • Constrain to chemically reasonable substructures
  • Validate top-100 by orthogonal in silico methods (docking, FEP)

MolMIM (NVIDIA BioNeMo)

MolMIM is a property-guided latent-variable model:

from molmint import MolMIM
from rdkit import Chem

model = MolMIM()
smiles = 'CCO'
optimized = model.optimize(smiles, target_logp=2.5, target_sas=2.0)
mol = Chem.MolFromSmiles(optimized)

Strength: Continuous property optimization in latent space. Weakness: Latent space not always semantically meaningful.

Diffusion-Based Generators (DiffSMol, DiGress)

Diffusion models generate molecules by iteratively denoising:

# DiffSMol / DiGress pseudo-API; verify against current release.
# from diffsmol import generate
# mols = generate(n_samples=1000, scaffold='aryl_sulfonamide')

Strength: State-of-the-art sample quality on MOSES benchmark. Weakness: Slower than RL; harder to condition on multiple objectives.

JT-VAE (Latent-Space Optimization)

Junction Tree VAE optimizes in latent space then decodes:

# Pseudo-API
# from jtvae import optimize
# best_mol = optimize(smiles='CCO', target='activity', iterations=100)

Strength: Smooth latent space for optimization. Weakness: Outdated vs transformers; reconstruction quality lower.

Per-Tool Failure Modes

REINVENT -- reward hacking

Trigger: Generated molecules score high but don't bind target.

Mechanism: Generator exploits QSAR model weaknesses (e.g., simple features that correlate spuriously).

Symptom: Top-100 in silico but 0% hit rate in vitro.

Fix: Ensemble of 5+ scoring models; structural diversity constraint; orthogonal validation (docking).

Scaffold decoration -- diversity loss

Trigger: After 1000 REINVENT steps, all generated molecules are near-same scaffold.

Mechanism: Generator converges to local optimum.

Symptom: Generated SMILES have Tanimoto > 0.9 to scaffold.

Fix: Add Tanimoto distance to known actives as bonus; restart with new seed; increase stochasticity.

Generative vs docking mismatch

Trigger: Generated molecules have high predicted QED but Vina score = 0.

Mechanism: Generator not aware of binding pocket geometry.

Symptom: Synthesizable but non-binders.

Fix: Add docking score (Vina or GNINA) to scoring function; dock top candidates as filter.

MolMIM -- discontinuous objective

Trigger: Optimizing a property with sharp boundaries (e.g., exactly 1 sulfonamide).

Mechanism: Latent-space optimization uses smooth gradient; sharp objectives don't have it.

Symptom: Generator oscillates around target.

Fix: Use reward-style scoring instead of property-distance; post-filter for hard constraints.

Common Errors

SymptomCauseFix
reinvent exits with OOMPrior model too large for GPUUse smaller prior; CPU mode
All generated molecules identicalMode collapseReset agent; add diversity bonus
Generated SMILES invalidTokenizer mismatchUpdate reinvent to latest version; validate SMILES post-gen
MPO components all zeroComponents missing from TOMLRe-check TOML section names
QED 1.0 but no synthesisQED rewards unrealistic featuresAdd SAScore; run retrosynthesis filter

References

  • Loeffler et al., J. Cheminform. -- REINVENT 4.
  • Sanchez-Lengeling et al., ACS Cent. Sci. -- generative chemistry review.
  • Jin et al. -- JT-VAE.

Related Skills

  • chemoinformatics/qsar-modeling - Scoring models
  • chemoinformatics/retrosynthesis - Synthetic feasibility
  • chemoinformatics/reaction-enumeration - Library enumeration alternative
  • chemoinformatics/protac-degraders - Linker design for PROTACs
    Good AI Tools