skills/chemoinformatics/qsar-modeling

stars:0
forks:0
watches:0
last updated:N/A

Version Compatibility

Reference examples tested with: chemprop 2.0+ (major API change from 1.x), RDKit 2024.09+, scikit-learn 1.4+, MAPIE 0.8+ (conformal prediction), shap 0.44+, pytorch 2.1+.

Before using code patterns, verify installed versions match. If versions differ:

  • Python: pip show <package> to check signatures
  • CLI: chemprop train --help

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

QSAR Modeling

Build quantitative structure-activity relationship models from molecular structure inputs. The choice of model + featurization + split strategy determines whether the model captures real chemical signal or memorizes the training data. chemprop D-MPNN is the modern open-source standard; transformer-based methods (MolFormer, Uni-Mol) compete on benchmarks. The OECD 5 principles structure the model for regulatory acceptance.

For descriptor/fingerprint choices, see chemoinformatics/molecular-descriptors. For ADMET-specific QSAR, see chemoinformatics/admet-prediction.

Model Taxonomy

ModelArchitectureUse caseFails when
Random Forest + ECFP4Classical baselineSmall data (<200 compounds)Saturates at ~AUC 0.85
chemprop D-MPNNDirected message passingModern default; 100-10k compoundsVery small datasets (<100)
MolFormerTransformer (87M params)Large public dataCompute overhead
Uni-Mol3D-aware transformer3D-relevant endpointsRequires 3D conformers
Gaussian Process + ECFP4ProbabilisticActive learningO(N^3) scaling

Decision: For 200-10k compounds, chemprop 2.0 D-MPNN is the modern standard. For <200 compounds, Random Forest + ECFP4 is competitive.

chemprop 2.0 Training (CLI)

Goal: Train a chemprop D-MPNN ensemble with scaffold-balanced split.

chemprop train \
    --data-path data.csv \
    --task-type classification \
    --save-dir model_dir \
    --molecule-featurizers rdkit_2d_normalized \
    --num-folds 5 \
    --ensemble-size 5 \
    --epochs 50 \
    --batch-size 128 \
    --split scaffold_balanced \
    --split-sizes 0.8 0.1 0.1

OECD 5 Principles

  1. Defined endpoint: specific bioassay, units, threshold definitions
  2. Unambiguous algorithm: reproducible code, fixed seeds
  3. Defined applicability domain (AD): where the model is valid
  4. Appropriate statistical validation: external test set, cross-validation
  5. Mechanistic interpretation: biological/chemical rationale

Applicability Domain Methods

MethodDefinitionPro
Ensemble varianceStd across N-model predictionsBuilt-in to chemprop
kNN distanceMean Tanimoto to k nearest in trainingEasy to interpret
LeverageHat matrix diagonalStatistical

Operational rule: Set --ensemble-size 5 at training; at predict time, flag predictions with ensemble std > P95 as out-of-AD.

References

  • Yang et al., J. Chem. Inf. Model. 59:3370 -- chemprop scaffold split.
  • OECD -- QSAR validation principles.

Related Skills

  • chemoinformatics/molecular-descriptors - Features
  • chemoinformatics/molecular-standardization - Critical upstream
  • chemoinformatics/admet-prediction - ADMET QSAR"
    Good AI Tools