Version Compatibility

Reference examples tested with: chemprop 2.0+ (major API change from 1.x), RDKit 2024.09+, scikit-learn 1.4+, MAPIE 0.8+ (conformal prediction), shap 0.44+, pytorch 2.1+.

Before using code patterns, verify installed versions match. If versions differ:

Python: pip show <package> to check signatures
CLI: chemprop train --help

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

QSAR Modeling

Build quantitative structure-activity relationship models from molecular structure inputs. The choice of model + featurization + split strategy determines whether the model captures real chemical signal or memorizes the training data. chemprop D-MPNN is the modern open-source standard; transformer-based methods (MolFormer, Uni-Mol) compete on benchmarks. The OECD 5 principles structure the model for regulatory acceptance.

For descriptor/fingerprint choices, see chemoinformatics/molecular-descriptors. For ADMET-specific QSAR, see chemoinformatics/admet-prediction.

Model Taxonomy

Model	Architecture	Use case	Fails when
Random Forest + ECFP4	Classical baseline	Small data (<200 compounds)	Saturates at ~AUC 0.85
chemprop D-MPNN	Directed message passing	Modern default; 100-10k compounds	Very small datasets (<100)
MolFormer	Transformer (87M params)	Large public data	Compute overhead
Uni-Mol	3D-aware transformer	3D-relevant endpoints	Requires 3D conformers
Gaussian Process + ECFP4	Probabilistic	Active learning	O(N^3) scaling

Decision: For 200-10k compounds, chemprop 2.0 D-MPNN is the modern standard. For <200 compounds, Random Forest + ECFP4 is competitive.

chemprop 2.0 Training (CLI)

Goal: Train a chemprop D-MPNN ensemble with scaffold-balanced split.

chemprop train \
    --data-path data.csv \
    --task-type classification \
    --save-dir model_dir \
    --molecule-featurizers rdkit_2d_normalized \
    --num-folds 5 \
    --ensemble-size 5 \
    --epochs 50 \
    --batch-size 128 \
    --split scaffold_balanced \
    --split-sizes 0.8 0.1 0.1

OECD 5 Principles

Defined endpoint: specific bioassay, units, threshold definitions
Unambiguous algorithm: reproducible code, fixed seeds
Defined applicability domain (AD): where the model is valid
Appropriate statistical validation: external test set, cross-validation
Mechanistic interpretation: biological/chemical rationale

Applicability Domain Methods

Method	Definition	Pro
Ensemble variance	Std across N-model predictions	Built-in to chemprop
kNN distance	Mean Tanimoto to k nearest in training	Easy to interpret
Leverage	Hat matrix diagonal	Statistical

Operational rule: Set --ensemble-size 5 at training; at predict time, flag predictions with ensemble std > P95 as out-of-AD.

References

Yang et al., J. Chem. Inf. Model. 59:3370 -- chemprop scaffold split.
OECD -- QSAR validation principles.

Related Skills

chemoinformatics/molecular-descriptors - Features
chemoinformatics/molecular-standardization - Critical upstream
chemoinformatics/admet-prediction - ADMET QSAR"

skills/chemoinformatics/qsar-modeling