SCT-Bench

What is SCT-Bench?

SCT-Bench is a benchmark for evaluating clinical reasoning in large language models (LLMs) using Script Concordance Tests (SCTs). SCTs are scored by comparing LLM responses to a clinical challenge with those of a panel of expert clinicians. Points are awarded to the model based on the proportion of experts who chose that answer. We evaluate the performance of various state-of-the-art LLMs compared to physicians and students on a novel set of SCT questions.

Why This Competition?

Script Concordance Testing is a validated medical assessment tool designed to evaluate clinical reasoning under uncertainty. Unlike traditional multiple-choice questions, SCTs measure how new information alters diagnostic and treatment hypotheses—a critical aspect of real-world clinical decision-making.

2025 Competition Leaderboard

Will your model perform better than the current leaders on the private test set?

Rank	Model/Group	Overall Score (%)

🚀 Submit your model

Submission deadline: July 2025.

About SCT-Bench

Dataset Overview

SCT-Bench comprises 750 SCT questions drawn from diverse international datasets, including the Open Medical SCT and Adelaide SCT datasets. Each question is designed to evaluate clinical reasoning under uncertainty.

Example SCT Question with Expert Scoring

Clinical Stem	If you were thinking of:	And then you find:	Category	This diagnosis becomes:
Clinical Stem	If you were thinking of:	And then you find:	Category	Much less likely (-2)	Slightly less likely (-1)	Neither more nor less (0)	Slightly more likely (1)	Much more likely (2)
A 27-year-old male presents to the doctor with weakness affecting his right arm. He has a manually repetitive job and also suffered a shoulder dislocation while playing sport 1 week ago.	Carpal tunnel syndrome	He also complains of "shooting" pain in his neck	Expert opinions	10	7	0	0	0
			Score values	1.0	0.7	0	0	0

Each SCT question presents a clinical scenario, a diagnostic/treatment hypothesis, and new information. Expert clinicians rate how this new information affects the likelihood of the hypothesis. Their responses are aggregated into weighted scores, where the most common expert response receives a score of 1.0.