SMDD-Bench

SMDD-Bench Leaderboard

A benchmark for LLM agents on small molecule drug design. Agents are given a sandboxed Python environment plus a structure-prediction tool (Boltz) and an ADMET predictor, then asked to solve tasks whose solutions are easy to verify computationally but hard to produce.

Read the paper
(paper link coming soon)
Submit your results
(submission form coming soon)
Full benchmark (502 tasks per run). Click headers to sort.
AgentPass rate2D Pharmacophore IdentificationInteraction Point DiscoveryScaffold HoppingLead OptimizationFragment AssemblyCost / taskAvg time
gpt-5.4_medium_minimalist_agent
boltz=8 · admet=15
40.2%12.0%0.0%3.8%57.6%1.7%$0.7823.4m
gemini-3.1-pro_medium_minimalist_agent
boltz=8 · admet=15
39.0%20.0%4.0%0.0%55.6%1.7%$0.6219.1m
claude-4.6-sonnet_medium_minimalist_agent
boltz=8 · admet=15
38.0%28.0%0.0%3.8%53.5%0.0%$1.6123.8m
kimi-k2.5-thinking_minimalist_agent
boltz=8 · admet=15
30.3%12.0%0.0%1.9%43.5%0.0%$0.4030.2m
qwen3.5-397b-a17b_minimalist_agent
boltz=8 · admet=15
27.5%4.0%0.0%1.9%40.0%0.0%$0.7518.6m
deepseek-v3.2_minimalist_agent
boltz=8 · admet=15
24.3%8.0%0.0%3.8%34.7%0.0%$0.4343.0m
minimax-m2.7_minimalist_agent
boltz=8 · admet=15
19.3%16.0%0.0%1.9%27.1%0.0%$0.3647.0m

About the benchmark

Each task is generated and synthesized to be both challenging and guarnateed-solvable. The benchmark exists to be a testbed for LLM capability under realistic chemistry constraints; however, the individual tasks themselves may not be of signficant therapuetic interest. Binding affinity values throughout SMDD-Bench are Boltz-2 outputs of log10(IC50); lower is stronger.

Type 1

2D Pharmacophore Identification

Given sets of active and inactive molecules for a protein target, write a Python predicate that distinguishes actives from inactives via 2D structural reasoning.

Type 2

Interaction Point Discovery

Given a protein pocket, propose 3D coordinates of the three interaction points most likely conserved across diverse binders (donor/acceptor/hydrophobic, etc.).

Type 3

Scaffold Hopping

Given an active molecule, propose a molecule with a chemically distinct scaffold that preserves the protein-ligand interactions and remains a binder.

Type 4

Lead Optimization

Modify a strong binder to improve sampled objectives (potency, ADMET, etc.) while satisfying hard constraints and holding other properties constant.

Type 5

Fragment Assembly

Given 1–2 fragments with 3D poses in a pocket, design a single drug-like molecule that incorporates the fragments and binds the target.

Diversity Leaderboard

The diversity benchmark resamples a 20-task subset of Lead Optimization where every model produces 10 submissions per task. We measure whether agents produce diverse, distinct, novel successful solutions rather than converging on the same, ideally passing, solution.

Diversity benchmark: 20 tasks × 10 rollouts = 200 submissions per agent.
AgentAvg successfulAvg unique & successfulNovel & successfulPairwise TanimotoCost / taskAvg time
claude-4.6-sonnet_medium_minimalist_agent
boltz=8 · admet=15
8.403.7074.0%0.823$0.8714.7m
gemini-3.1-pro_medium_minimalist_agent
boltz=8 · admet=15
8.004.0067.6%0.809$0.5917.0m
gpt-5.4_medium_minimalist_agent
boltz=8 · admet=15
7.902.7564.6%0.863$0.6618.4m
qwen3.5-397b-a17b_minimalist_agent
boltz=8 · admet=15
7.253.5567.2%0.814$0.7215.3m
kimi-k2.5-thinking_minimalist_agent
boltz=8 · admet=15
6.003.8565.0%0.786$0.4428.9m
minimax-m2.7_minimalist_agent
boltz=8 · admet=15
6.004.0573.1%0.763$0.3447.7m
deepseek-v3.2_minimalist_agent
boltz=8 · admet=15
5.353.8568.4%0.763$0.4759.6m

Citing SMDD-Bench

If you use SMDD-Bench in your work, you can cite us here!

@misc{smddbench2026,
  title  = {SMDD-Bench: A Small Molecule Drug Design Benchmark for LLM Agents},
  author = {Your Name and Collaborators},
  year   = {2026},
  eprint = {arXiv:xxxx.xxxxx},
  archivePrefix = {arXiv},
  url    = {https://smddbench.com},
}