Clever Hans in Chemistry: Chemist Style Signals Confound Activity Prediction on Public Benchmarks
Episode

Clever Hans in Chemistry: Chemist Style Signals Confound Activity Prediction on Public Benchmarks

Dec 24, 20259:42
q-bio.BMMachine Learningphysics.chem-ph
No ratings yet

Abstract

Can machine learning models identify which chemist made a molecule from structure alone? If so, models trained on literature data may exploit chemist intent rather than learning causal structure-activity relationships. We test this by linking CHEMBL assays to publication authors and training a 1,815-class classifier to predict authors from molecular fingerprints, achieving 60% top-5 accuracy under scaffold-based splitting. We then train an activity model that receives only a protein identifier and an author-probability vector derived from structure, with no direct access to molecular descriptors. This author-only model achieves predictive power comparable to a simple baseline that has access to structure. This reveals a "Clever Hans" failure mode: models can predict bioactivity largely by inferring chemist goals and favorite targets without requiring a lab-independent understanding of chemistry. We analyze the sources of this leakage, propose author-disjoint splits, and recommend dataset practices to decouple chemist intent from biological outcomes.

Summary

This paper investigates a "Clever Hans" phenomenon in chemistry machine learning, where models predict bioactivity by inferring chemist intent rather than learning true structure-activity relationships. The authors hypothesized that models trained on public datasets like CHEMBL might exploit stylistic regularities in how chemists design molecules, leading to inflated benchmark performance that doesn't generalize. They tested this by first training a classifier to predict the author of a molecule based solely on its structure, achieving a 60% top-5 accuracy across 1,815 authors using scaffold-based splitting. Then, they trained an activity prediction model that received only the protein identifier and the author-probability vector (derived from the first model) as input, *without* direct access to molecular descriptors. Surprisingly, this "author-only" model achieved predictive power comparable to a simple baseline model using structural descriptors (ECFPs). The key finding is that models can predict bioactivity largely by inferring chemist goals and favorite targets, showcasing a significant confound in public medicinal chemistry datasets. This suggests that models are learning the "sociology of the dataset" (who works on what) rather than the underlying chemistry. The authors propose several mitigations, including author-disjoint splits and reporting source metadata, to better evaluate and train models that learn genuine structure-activity relationships. This research highlights the importance of being aware of dataset biases and provenance when developing and evaluating machine learning models in chemistry.

Key Insights

  • Models can predict the author of a molecule from its structure with 60% top-5 accuracy (1,815-way classification) using ECFP fingerprints and gradient boosting, even with scaffold-based splitting. This demonstrates that "chemist style" is encoded in molecular structure.
  • An activity prediction model using only author probabilities and protein identifiers achieves performance comparable to a model using ECFP fingerprints and protein identifiers (AUROC around 0.65). This indicates that much of the predictive signal in CHEMBL-derived benchmarks can be attributed to chemist style rather than true structure-activity relationships.
  • Adding author probabilities to an ECFP+protein model yields only modest additional gains, suggesting that chemist style already captures a large fraction of the predictive signal.
  • The author-probability vectors retain a rich, highly structured view of chemical space, as demonstrated by high ROC-AUC values (median around 0.9) when predicting individual ECFP bits from the author probabilities.
  • The paper highlights the importance of source metadata (authors, labs, institutions) as a potential confounder in public medicinal chemistry datasets.
  • The "Clever Hans" effect manifests as models learning the "sociology of the dataset" rather than the underlying chemistry.
  • The authors propose author-aware splits (e.g., author-disjoint) as a mitigation strategy.

Practical Implications

  • Benchmark designers should retain and report source metadata (author, lab, etc.) to allow for better analysis and mitigation of biases.
  • Researchers should include source-only or source+target baselines to quantify the contribution of source signals in their models.
  • Practitioners should consider source-aware splits (author-disjoint, lab-disjoint, site-disjoint) alongside scaffold and temporal splits when evaluating models on public datasets.
  • This work motivates future research into adversarial debiasing techniques to remove source-related biases from machine learning models.
  • The findings underscore the need for caution when interpreting results on public medicinal chemistry benchmarks without accounting for potential "Clever Hans" effects.

Links & Resources

Authors