Multimodal Large Language Models (MLLMs) have tremendous potential to improve the accuracy, availability, and cost-effectiveness of healthcare by providing automated solutions or serving as aids to medical professionals. Despite promising first steps in developing medical MLLMs in the past few years, their capabilities and limitations are not well-understood.
MediConfusion is a challenging medical Visual Question Answering (VQA) benchmark dataset, that probes the failure modes of medical MLLMs from a vision perspective. We reveal that state-of-the-art models are easily confused by image pairs that are otherwise visually dissimilar and clearly distinct for medical experts. Strikingly, all available models (open-source or proprietary) achieve performance below random guessing on MediConfusion, raising serious concerns about the reliability of existing medical MLLMs for healthcare deployment.
What is MediConfusion?
MediConfusion is a radiology-focused benchmark designed to evaluate the visual reasoning of MLLMs. Instead of curating adversarial samples our approach leverages a high-level understanding of how MLLMs encode visual information. MediConfusion consists of VQA pairs, each pair presenting two different medical images that share the same question and answer options, however the correct answer is different for the two images. This tests whether models can truly distinguish between medical images!
Dataset curation
We find confusing image pairs in ROCO, a multimodal dataset of about 80k radiology images and their corresponding captions extracted from PMC-OA. We seek out pairs with clear visual differences, but high similarity in the feature space of medical MLLMs. This implies that at least one of the images in the pair is compressed ambiguously, and so it is likely that relevant visual information is lost in the encoding. In particular, we search for pairs that have similar BiomedCLIP embeddings (indicating they may be ambiguously encoded by the MLLM) but dissimilar DINOv2 embeddings (indicating visual differences).
For each pair, which we call a confusing pair, we provide the image captions to an LLM and prompt it to generate a two-choice question where the correct answer differs for the two images. Then, a radiologist reviewed the extracted pairs and corresponding questions, checking for quality, correctness, and relevance, and revised the questions to improve language quality. Our final filtered dataset consists of 352 curated questions across 9 subspecialties.
Results
In MediConfusion, we use 3 key metrics to evaluate MLLMs:
- Set accuracy: The proportion of VQA pairs where the model answered both questions correctly in the pair.
- Individual accuracy: The proportion of correct answers, regardless of pairs.
- Confusion: The proportion of pairs where the model has chosen the same answer option for both images.
Evaluating MLLMs fairly is often challenging due to sensitivity to the specific prompt format and phrasing. Moreover, the ability of models to interpret the multiple-choice question answering format varies greatly. Therefore, in order to provide a fair comparison, we use a range of evaluation techniques to assess performance and report the best numbers for each model. We consider a wide range of models, including open-source general-domain and medical MLLMs as well as state-of-the-art proprietary models. We summarize the results below.
Rank | Model | Version | Set acc. (%) | Confusion (%) |
---|---|---|---|---|
🥇 | Random Guessing | – | 25.00 | 50.00 |
🥈 | Gemini | 1.5 Pro | 19.89 | 58.52 |
🥉 | GPT | 4o (release 20240513) | 18.75 | 75.00 |
4 | InstructBLIP | Vicuna 7B | 12.50 | 80.35 |
5 | LLaVA | v1.6/Mistral 7B | 9.09 | 85.80 |
6 | Claude | 3 Opus | 8.52 | 84.09 |
7 | BLIP-2 | Opt 2.7B | 6.82 | 86.93 |
8 | RadFM | – | 5.68 | 85.80 |
9 | MedFlamingo | – | 4.55 | 98.30 |
10 | LLaVA-Med | v1.5/Mistral 7B | 1.14 | 97.16 |
Alarmingly, all MLLMs perform below random guessing in terms of set accuracy, corroborating our hypothesis that models struggle to differentiate between the extracted image pairs in fine enough detail necessary for accurate medical reasoning. This observation is further supported by the markedly high (often above 90%) confusion scores indicating that models tend to select the same answer for both images within a pair. Even RadFM, a model that does not leverage a CLIP-style image encoder, is confused on our benchmark (82.39% confusion score) with performance well below random guessing. As most likely proprietary models leverage visual encoders other than CLIP as well, the overall poor performance and extremely high confusion scores suggest that the exposed vulnerability is more general and not solely rooted in the specific ambiguities of CLIP-style contrastive pretraining.
An interesting outlier is Gemini 1.5, which has been the least confused (approx. 60%) on the dataset; however, its accuracy is still close to random guessing. This may suggest that the model’s visual representations are rich enough to meaningfully distinguish between images, but the medical knowledge or necessary reasoning skills are lacking to correctly answer the questions.
Failure Modes
The first step towards improving the reliability of MLLMs is to identify and categorize common cases where they tend to break down. We leverage an expert-in-the-loop pipeline to extract failure modes from MediConfusion via a combination of LLM prompting and radiologist supervision. As a result, we identify the following common patterns that have confused the models:
- Pattern 1: Normal/variant anatomy vs. pathology – Models often struggle with differentiating between normal/variant anatomy and pathological structures.
- Pattern 2: Lesion signal characteristics – Models fail to correctly identify regions of high signal intensity and their significance, particularly on T2-weighted sequences. This failure is especially of clinical significance in differentiating solid vs. cystic entities.
- Pattern 3: Vascular conditions – Identifying aneurysms and differentiating them from normal vascular structures or other abnormalities like vascular malformations seems to be challenging for MLLMs.
- Pattern 4: Medical devices – Models often fail to detect the presence of stents and have difficulties distinguishing between various types of stents. Identifying the presence or absence of guidewires in images of interventional procedures tends to also be challenging for MLLMs.
Most of the above shortcomings can be, to some degree, traced back to known, common failure modes of visual reasoning in MLLMs in the general domain.
Detecting presence (or absence) of specific features: Correct reasoning over medical VQA problems strongly relies on detecting the presence (or absence) of particular features or objects relevant to the question. MLLMs are known to suffer from object hallucinations, and we can see this specific weakness reflected in Patterns 3 and 4 directly.
Understanding state and condition: In medical VQA, it is crucially important for the model to understand the difference between “normal” and “abnormal” structures. MLLMs have difficulties identifying the state and condition of objects in the general domain, such as whether the ground is wet or if a flag is blowing in the air. These challenges may be amplified in the more nuanced medical setting, which we observe in Patterns 1 and 3 especially.
Positional and relational context: Answering medical VQA problems often necessitates a careful understanding of the spatial relationships of various anatomical features and their specific location. Recent research has uncovered serious limitations in the spatial reasoning capabilities of MLLMs, some even failing to distinguish left from right. This pervasive weakness in spatial reasoning may translate to failures in medical VQA seen in Pattern 1.
Color and appearance: Recent work has shown that MLLMs can confuse colors and their intensity (bright/dark), which may cause challenges in identifying signal characteristics in radiology images (high/low intensity) reflected in Pattern 2.
Visual prompts in MediConfusion
Free-form visual prompts are intuitive annotations in the input image, such as a red bounding box or an arrow, aimed at highlighting a specific point or area within the image. We find that some images in MediConfusion include such visual prompts, typically in the form of arrows pointing at the abnormality, and in a specific case, the correct answer is written in the image along with the prompt. We observe that only proprietary models, as well as LLaVA v1.6 and BLIP-2, have been able to provide consistently correct answers for this particular image, and none of the medical MLLMs. We hypothesize that the success of proprietary models and LLaVA v1.6 can be attributed to their OCR (optical character recognition) capabilities, which are missing from medical MLLMs. Imbuing medical MLLMs with the ability to interpret visual prompts is a promising direction for future research.
Conclusion
We firmly believe in the transformative potential of AI in medical applications. However, our findings underscore the urgent need for improved evaluation methods to ensure that these models are reliable enough for deployment in sensitive areas like healthcare. Our work uncovered serious flaws in the visual understanding capabilities of MLLMs, which must be addressed before these models can be safely integrated into critical medical AI solutions. Moreover, we identified common patterns of model failure, which we aim to use as a foundation for more trustworthy and reliable MLLMs in the medical domain. This preliminary release focuses on radiology, and we plan to expand it with additional data points and subspecialties in the future, so stay tuned!
For further details, read our paper and check our Github repository. You can download our dataset from our Huggingface repository.