Submit your model to the 2025 SCT-Bench Competition by May 2025. Winners will be featured on our benchmark website.
SCT-Bench is a benchmark for evaluating clinical reasoning in large language models (LLMs) using Script Concordance Tests (SCTs). SCTs are scored by comparing LLM responses to a clinical challenge with those of a panel of expert clinicians. Points are awarded to the model based on the proportion of experts who chose that answer. We evaluate the performance of various state-of-the-art LLMs compared to physicians and students on a novel set of SCT questions.
Script Concordance Testing is a validated medical assessment tool designed to evaluate clinical reasoning under uncertainty. Unlike traditional multiple-choice questions, SCTs measure how new information alters diagnostic and treatment hypotheses—a critical aspect of real-world clinical decision-making.
Will your model perform better than the current leaders on the private test set?
Rank | Model/Group | Overall Score (%) |
---|
Submission deadline: May 2025.
This leaderboard will be continually updated to include the four highest-performing models measured on the complete private test set. Note: Some values are missing (-) as they were not recorded at the original SCT testing sites.
SCT-Bench comprises 750 SCT questions drawn from diverse international datasets, including the Open Medical SCT and Adelaide SCT datasets. Each question is designed to evaluate clinical reasoning under uncertainty.
Example SCT Question with Expert Scoring
Clinical Stem | If you were thinking of: | And then you find: | Category | This diagnosis becomes: | ||||
---|---|---|---|---|---|---|---|---|
Much less likely (-2) |
Slightly less likely (-1) |
Neither more nor less (0) |
Slightly more likely (1) |
Much more likely (2) |
||||
A 27-year-old male presents to the doctor with weakness affecting his right arm. He has a manually repetitive job and also suffered a shoulder dislocation while playing sport 1 week ago. | Carpal tunnel syndrome | He also complains of "shooting" pain in his neck | Expert opinions | 10 | 7 | 0 | 0 | 0 |
Score values | 1.0 | 0.7 | 0 | 0 | 0 |
Each SCT question presents a clinical scenario, a diagnostic/treatment hypothesis, and new information. Expert clinicians rate how this new information affects the likelihood of the hypothesis. Their responses are aggregated into weighted scores, where the most common expert response receives a score of 1.0.