Welcome to the 9th edition of the Caption Task!
Motivation
Interpreting and summarizing the insights gained from medical images such as
radiology output is a time-consuming task that involves highly trained experts
and often represents a bottleneck in clinical diagnosis pipelines.
Consequently, there is a considerable need for automatic methods that can
approximate this mapping from visual information to condensed textual
descriptions. The more image characteristics are known, the more structured
are the radiology scans and hence, the more efficient are the radiologists
regarding interpretation. We work on the basis of a large-scale collection of
figures from open access biomedical journal articles (PubMed Central). All
images in the training data are accompanied by UMLS concepts extracted from
the original image caption.
Lessons learned:
-
In the first and second editions of this task, held at ImageCLEF 2017 and
ImageCLEF 2018, participants noted a broad variety of content and situation
among training images. In 2019, the training data was reduced solely to
radiology images, with ImageCLEF 2020 adding additional imaging modality
information, for pre-processing purposes and multi-modal approaches.
-
The focus in ImageCLEF 2021 lay in using real radiology images annotated by
medical doctors. This step aimed at increasing the medical context relevance
of the UMLS concepts, but more images of such high quality are difficult to
acquire.
-
As uncertainty regarding additional source was noted, we will clearly
separate systems using exclusively the official training data from those
that incorporate additional sources of evidence
-
For ImageCLEF 2022, an extended version of the ImageCLEF 2020 dataset was
used. For the caption prediction subtask, a number of different additional
evaluations metrics were introduced with the goal of replacing the primary
evaluation metric in future iterations of the task.
-
For ImageCLEF 2023, several issues with the dataset (large number of
concepts, lemmatization errors, duplicate captions) were tackled and based
on experiments in the previous year, BERTScore was used as the primary
evaluation metric for the caption prediction subtask.
News
- 24.10.2024: website goes live
- 15.12.2024: registration opens
- 01.03.2025: development dataset released
- 15.04.2025: test dataset released
- 20.05.2025: run submission phase ended
- 22.05.2025: results published
Preliminary Schedule
- 24.10.2024: website goes live
- 15.12.2024: registration opens
- 01.03.2025: development dataset released
- 15.04.2025: test dataset released
- 20.05.2025: run submission phase ended
- 22.05.2025: results published
- 30.05.2025: submission of participant papers [CEUR-WS]
- 21.06.2025: notification of acceptance
Task Description
For captioning, participants will be requested to develop solutions for
automatically identifying individual components from which captions are composed
in Radiology Objects in COntext version 2[2] images. ImageCLEFmedical Caption
2025 consists of two substaks:
- Concept Detection Task
- Caption Prediction Task
Concept Detection Task
The first step to automatic image captioning and scene understanding is
identifying the presence and location of relevant concepts in a large corpus
of medical images. Based on the visual image content, this subtask provides
the building blocks for the scene understanding step by identifying the
individual components from which captions are composed. The concepts can be
further applied for context-based image and information retrieval purposes.
Evaluation is conducted in terms of set coverage metrics such as precision,
recall, and combinations thereof.
Caption Prediction Task
On the basis of the concept vocabulary detected in the first subtask as well
as the visual information of their interaction in the image, participating
systems are tasked with composing coherent captions for the entirety of an
image. In this step, rather than the mere coverage of visual concepts,
detecting the interplay of visible elements is crucial for strong performance.
This year, we will use BERTScore as the primary evaluation metric and ROUGE as
the secondary evaluation metric for the caption prediction subtask. Other
metrics such as MedBERTScore, MedBLEURT, and BLEU will also be published.
Explainability Task
In addition, we ask participants to provide explanations for the captions of a
small subset (will be released with the test dataset) of images. We encourage
people to be creative. There are no technical limitations to this task. The
explanations will be manually evaluated by a radiologist for interpretability,
relevance, and creativity. Examples of how such an explanation might look like
are provided as follows:
Data
The data for the caption task will contain curated images from the medical
literature including their captions and associated UMLS terms that are
manually controlled as metadata. A more diverse data set will be made
available to foster more complex approaches.
For questions regarding the dataset please use the challenge website forum or
contact hendrik.damm@fh-dortmund.de.
For the development dataset, Radiology Objects in COntext Version 2 (ROCOv2) [2], an updated and extended version of the Radiology Objects in COntext
(ROCO) dataset [1], is used for both subtasks. As in previous editions, the
dataset originates from biomedical articles of the PMC OpenAccess subset, with
the test set comprising a previously unseen set of images.
Training Set: Consists of 80091 radiology images
Validation Set: Consists of 17277 radiology images
Test Set: Consists of 19267 radiology images
Concept Detection Task
The concepts were generated using a reduced subset of the UMLS 2022 AB release.
To improve the feasibility of recognizing concepts from the images, concepts
were filtered based on their semantic type. Concepts with low frequency were
also removed, based on suggestions from previous years.
Caption Prediction Task
For this task each caption is pre-processed in the following way:
- removal of links from the captions
Evaluation methodology
The source code of the evaluation script is available on Github
(https://github.com/taubsity/clef-caption-evaluation).
For questions regarding the evaluation scripts please use the challenge website
forum or contact tabea.pakull@uk-essen.de.
Concept Detection
Evaluation is conducted in terms of F1 scores between system
predicted and ground truth concepts, using the following methodology and
parameters:
-
The default implementation of the Python
scikit-learn (v0.17.1-2) F1 scoring
method is used. It is documented
here.
-
A Python (3.x) script loads the candidate run file, as well as the ground
truth (GT) file, and processes each candidate-GT concept sets.
-
For each candidate-GT concept set, the y_pred and
y_true arrays are generated. They are binary arrays
indicating for each concept contained in both candidate and GT set if it is
present (1) or not (0).
-
The F1 score is then calculated. The default 'binary' averaging method is
used.
-
All F1 scores are summed and averaged over the number of elements in the
test set, giving the final score.
-
The primary score considers any concept. The secondary score filters both
predicted and GT concepts to the set of manually annotated concepts before
repeating the same F1 scoring steps.
The ground truth for the test set was generated based on the same reduced
subset of the UMLS 2022 AB release which was used for the training data (see
above for more details).
Caption Prediction
This year, ranking of participants is based on an average score over all used
metrics. In total 6 metrics are computed that fall into the aspects of
relevance and factuality.
Relevance
In order to evaluate the relevance aspect of generated captions the following
metrics are used:
- Image and Caption Similarity
-
BERT-Score (Recall) with inverse document frequency (idf) scores computed
from the test corpus for importance weighting
-
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) for overlap of
unigrams (ROUGE-1) (F-measure)
-
Bilingual Evaluation Understudy with Representations from Transformers
(BLEURT)
Image and Caption Similarity is computed using the following
methodology:
Using a medical imaging embedding model for calculating embeddings of the
caption and the image and calculating similarity of these embeddings.
Note: For the following relevance metrics (BERT-Score,
ROUGE and BLEURT), each caption is pre-processed in the same way:
- The caption is converted to lower-case.
- Replace numbers with the token 'number'.
- Remove punctuation.
Note that the captions are always considered as a single sentence, even if
it actually contains several sentences.
BERTScore is calculated using the following methodology and
parameters:
The
native Python implementation of BERTScore
is used. This scoring method is based on the paper
"BERTScore: Evaluating Text Generation with BERT"
and aims to measure the quality of generated text by comparing it to a
reference. We use Recall BERTScore with inverse document frequency (idf)
scores computed from the test corpus for importance weighting as this setting
correlates the most with human ratings for the image captioning task reported
in the BERTScore paper.
To calculate BERTScore, we use the
microsoft/deberta-xlarge-mnli
model, which can be found on the Hugging Face Model Hub. The model is
pretrained on a large corpus of text and fine-tuned for natural language
inference tasks. It can be used to compute contextualized word embeddings,
which are essential for BERTScore calculation.
To compute the final BERTScore, we first calculate the individual score
(Recall idf) for each caption. The BERTScore is then averaged across all
captions to give the final score.
The ROUGE score is calculated using the following methodology
and parameters:
The native
python implementation of ROUGE
scoring method is used. It is designed to replicate results from the original
perl package that was introduced in the paper
"ROUGE: A Package for Automatic Evaluation of Summaries".
Specifically, we calculate the ROUGE-1 (F-measure) score, which measures the
number of matching unigrams between the model-generated text and a reference.
The final score is the average ROUGE-1 over all captions.
For calculation of BLEURT the following methodology and
parameters are used:
The
native Python implementation of BLEURT
is used. This scoring method is based on the paper
"BLEURT: Learning Robust Metrics for Text Generation". The aim of BLEURT is to provide an evaluation metric for text generation by
learning from human judgments using BERT-based representations. In this
evaluation, the recommended BLEURT-20 checkpoint is employed.
Factuality
In order to evaluate the factuality aspect of generated captions the following
metrics are used:
- Unified Medical Language System (UMLS) Concept F1
- AlignScore
We calculate the UMLS F1 using the following methodology:
We use MedCAT to get the medical entities (UMLS concepts) of the caption and
the predicted caption. We only match entities with the semantic types that are
used to calculate MEDCON as described in the paper
"Aci-bench: a Novel Ambient Clinical Intelligence Dataset for
Benchmarking Automatic Visit Note Generation".
For calculating MEDCON the following methodology and
parameters are used:
The
native Python implementation of MEDCON
is used. As described in the paper
"Aci-bench: a Novel Ambient Clinical Intelligence Dataset for
Benchmarking Automatic Visit Note Generation"
it evaluates the clinical accuracy and consistency of the Unified Medical
Language System (UMLS) concept sets in generated and reference texts.
For detecting the UMLS concepts in both texts the QuickUMLS package is used
and then the F1 score is calculated.
The AlignScore is calculated using the following methodology
and parameters:
The
native Python implementation of the AlignScore
is used. It implements the metric based on RoBERTa introduced in the paper
"AlignScore: Evaluating Factual Consistency with A Unified Alignment
Function". The checkpoints are
available on the Huggingface Model Hub. The AlignScore is designed to evaluate
factual consistency in text generation by assessing the alignment of
information between two pieces of text. The model calculates a score by
splitting long contexts into manageable chunks and matching each claim
sentence with the most supportive context chunk. The final AlignScore is the
average alignment score across all claim sentences.
Participant registration
Please refer to the general ImageCLEF registration instructions
Results
The tables below contain only the best runs of each owner on ai4mediabench, for
a complete list of all runs please see the Google Sheets files
for Concept Detection
and
for Caption Prediction.
Concept Detection
| ID |
Owner |
Submission Name |
F1 |
F1 secondary |
| 1980 |
AUEB NLP Group |
ensemble_dual_thr_3_5monte_eff_eff_.zip |
0.5888 |
0.9484 |
| 1725 |
DeepLens |
submission |
0.5766 |
0.9299 |
| 1505 |
mapan |
submission |
0.5660 |
0.9298 |
| 1892 |
UIT-Oggy |
submission |
0.5613 |
0.9104 |
| 1508 |
DS4DH |
submission.csv |
0.5225 |
0.8672 |
| 1774 |
sakthiii |
submission |
0.4003 |
0.9082 |
| 1903 |
JJ-VMed |
submission |
0.3982 |
0.8329 |
| 1807 |
UMUTeam |
submission_with_unknown_clean |
0.2398 |
0.5377 |
| 1942 |
LekshmiscopeVIT |
submission.csv |
0.1494 |
0.2298 |
Caption Prediction
| ID |
owner |
Submission Name |
Overall |
Similarity |
BERTScore (Recall) |
ROUGE-1 |
BLEURT |
Relevance Average |
UMLS Concept F1 |
AlignScore |
Factuality Average |
| 1681 |
UMUTeam |
submission.zip |
0.3432 |
0.9271 |
0.5977 |
0.2594 |
0.3230 |
0.5268 |
0.1816 |
0.1375 |
0.1596 |
| 1520 |
DS4DH |
submission.csv.zip |
0.3362 |
0.9016 |
0.6067 |
0.2516 |
0.3096 |
0.5174 |
0.1682 |
0.1417 |
0.1549 |
| 1900 |
AI Stat Lab |
submission.zip |
0.3229 |
0.8919 |
0.5823 |
0.2440 |
0.3173 |
0.5089 |
0.1524 |
0.1213 |
0.1369 |
| 1914 |
UIT-Oggy |
submission_ep2_cleaned.zip |
0.3211 |
0.8798 |
0.5951 |
0.2535 |
0.3020 |
0.5076 |
0.1672 |
0.1021 |
0.1346 |
| 1403 |
AUEB NLP Group |
2-instruct-blip-ft.zip |
0.3068 |
0.7947 |
0.5884 |
0.2176 |
0.3030 |
0.4759 |
0.1429 |
0.1325 |
0.1377 |
| 1896 |
JJ-VMed |
submission.zip |
0.3043 |
0.8251 |
0.5953 |
0.2389 |
0.3094 |
0.4922 |
0.1366 |
0.0964 |
0.1165 |
| 1890 |
sakthiii |
submission |
0.2746 |
0.7957 |
0.5553 |
0.1607 |
0.2806 |
0.4481 |
0.1094 |
0.0928 |
0.1011 |
| 1815 |
csmorgan |
Qwen_2B_Submission_1.zip |
0.2315 |
0.5704 |
0.5180 |
0.1598 |
0.2385 |
0.3717 |
0.0741 |
0.1087 |
0.0914 |
Explainability Task - Human Evaluation Results
| Team |
Caption readability |
Clinical appropriateness of caption |
Caption level of detail |
Caption focus |
Mean caption rating |
Visual-text coherence |
Completeness of visualization |
Visualization focus |
Mean visualization rating |
Appropriateness of Methodology |
Overall |
| AUEB NLP Group |
4.5 |
2.7 |
2.6 |
3.3 |
3.3 |
3.1 |
2.8 |
2.6 |
2.8 |
4.0 |
3.2 |
| JJ-VMed |
3.4 |
2.4 |
2.8 |
4.1 |
3.2 |
1.9 |
1.9 |
1.9 |
1.9 |
2.0 |
2.6 |
* All categories were rated by a radiologist using a 5-point Likert scale,
with 5 indicating the best score.
CEUR Working Notes
The working-notes paper is your opportunity to describe your approach, present
all submitted runs and discuss the results. All participating teams with at
least one graded submission, regardless of the score, should submit a CEUR
working notes paper. Teams who participated in both tasks should generally
submit only one report
Make sure the EasyChair metadata (author names and order, title, affiliations)
exactly match the PDF, as these fields feed directly into the proceedings.
-
Camera-ready + signed copyright form: due 7 July 2025 (23:59 CEST)
Dataset image attribution
If you include dataset images in your paper, you must provide the correct
attribution. Use the lookup file below to find the attribution string for each
image ID:
https://fh-dortmund.sciebo.de/s/KT9AMjPtoq3pxTz
Insert the attribution in the figure caption or directly beside the image.
Working Notes Papers should cite both the ImageCLEF 2025 overview paper as
well as the ImageCLEFmedical task overview paper and the ROCOv2 dataset paper,
citation information is available in the Citations section below.
Reproducibility
We encourage you to make your work as reproducible as possible by releasing
code, trained models, and detailed instructions on a public repository (e.g.
GitHub) and pointing to it in your paper.
ArXiv references
Please refrain from citing the preprints e.g. of (arXiv) without checking if they have been published in the meantime.
The publication details proof the value of the cited work
and give attribution to the authors, an arxiv reference is only a preprint
with no peer review and is potentially junk. You can try
https://preprintresolver.eu/ to give proper attribution to authors and improve
the quality of your work. If no proper reference/doi is found, you can use the arXiv reference.
Citations
Its mandatory to cite the overview papers and the dataset.
When referring to ImageCLEFmedical 2025 Caption general
goals, general results, etc. please cite the following publication:
-
Hendrik Damm, Tabea M. G. Pakull, Helmut Becker, Benjamin Bracke, Bahadir
Eryilmaz, Louise Bloch, Raphael Brüngel, Cynthia S. Schmidt, Johannes
Rückert, Obioma Pelka, Henning Schäfer, Ahmad Idrissi-Yaghir, Asma Ben
Abacha, Alba García Seco de Herrera, Henning Müller and Christoph M.
Friedrich. Overview of ImageCLEFmedical 2025 -- Medical Concept Detection
and Interpretable Caption Generation. CLEF 2025 Working Notes. CEUR Workshop
Proceedings (CEUR-WS.org), Madrid, Spain, September 9-12, 2025.
-
BibTex:
@inproceedings{ImageCLEFmedicalCaptionOverview2025,
author = {Hendrik Damm and Tabea M. G. Pakull and Helmut Becker and Benjamin Bracke and Bahadir Eryilmaz and Louise Bloch and Raphael Br{\"u}ngel and Cynthia S. Schmidt and Johannes R{\"u}ckert and Obioma Pelka and Henning Sch{\"a}fer and Ahmad Idrissi{-}Yaghir and Asma Ben Abacha and Alba Garc{'{\i}}a Seco de Herrera and Henning M{\"u}ller and Christoph M. Friedrich},
title = {Overview of {ImageCLEFmedical} 2025 -- Medical Concept Detection and Interpretable Caption Generation},
booktitle = {CLEF 2025 Working Notes},
series = {CEUR Workshop Proceedings},
publisher = {CEUR-WS.org},
address = {Madrid, Spain},
month = {September 9--12},
year = {2025}
}
When referring to ImageCLEF 2025, please cite the following
publication (to be updated):
-
BibTex:
@inproceedings{OverviewImageCLEF2025,
author = {Ionescu, Bogdan and M{\"u}ller, Henning and Stanciu, Dan-Cristian and Andrei, Alexandra-Georgiana and Radzhabov, Ahmedkhan and Prokopchuk, Yuri and {\c{S}tefan, Liviu-Daniel} and Constantin, Mihai-Gabriel and Dogariu, Mihai and Kovalev, Vassili and Damm, Hendrik and R{\"u}ckert, Johannes and Ben Abacha, Asma and Garc\'ia Seco de Herrera, Alba and Friedrich, Christoph M. and Bloch, Louise and Br{\"u}ngel, Raphael and Idrissi-Yaghir, Ahmad and Sch{\"a}fer, Henning and Schmidt, Cynthia Sabrina and Pakull, Tabea M. G. and Bracke, Benjamin and Pelka, Obioma and Eryilmaz, Bahadir and Becker, Helmut and Yim, Wen-Wai and Codella, Noel and Novoa, Roberto Andres and Malvehy, Josep and Dimitrov, Dimitar and Das, Rocktim Jyoti and Xie, Zhuohan and Shan, Hee Ming and Nakov, Preslav and Koychev, Ivan and Hicks, Steven A. and Gautam, Sushant and Riegler, Michael A. and Thambawita, Vajira and P\r{a}l Halvorsen and Fabre, Diandra and Macaire, C\'ecile and Lecouteux, Benjamin and Schwab, Didier and Potthast, Martin and Heinrich, Maximilian and Kiesel, Johannes and Wolter, Moritz and Stein, Benno},
title = {Overview of {ImageCLEF} 2025: Multimedia Retrieval in Medical, Social Media and Content Recommendation Applications},
booktitle = {Experimental IR Meets Multilinguality, Multimodality, and Interaction},
series = {Proceedings of the 16th International Conference of the CLEF Association (CLEF 2025)},
year = {2025},
publisher = {Springer Lecture Notes in Computer Science LNCS},
pages = {},
month = {September 9-12},
address = {Madrid, Spain}
}
When describing the data, note that an extended version of ROCOv2 was used and
cite:
-
Johannes Rückert, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-Yaghir,
Henning Schäfer, Cynthia S. Schmidt, Sven Koitka, Obioma Pelka, Asma Ben
Abacha, Alba García Seco de Herrera, Henning Müller, Peter Horn, Felix Nensa
and Christoph M. Friedrich (2024). ROCOv2: Radiology Objects in Context
Version 2, an Updated Multimodal Image Dataset. Scientific Data, 11(1).
doi:10.1038/s41597-024-03496-6
-
BibTex:
@article{2405.10004v2,
title = {{ROCOv2}: Radiology Objects in COntext Version 2, an Updated Multimodal Image Dataset},
author = {Johannes R{\"u}ckert and Louise Bloch and Raphael Br{\"u}ngel and Ahmad Idrissi{-}Yaghir and Henning Sch{\"a}fer and Cynthia S. Schmidt and Sven Koitka and Obioma Pelka and Asma Ben Abacha and Alba Garc{'{\i}}a Seco de Herrera and Henning M{\"u}ller and Peter Horn and Felix Nensa and Christoph M. Friedrich},
journal = {Scientific Data},
volume = {11},
number = {1},
year = {2024},
doi = {10.1038/s41597-024-03496-6}
}
Contact
Organizers:
-
Hendrik Damm <hendrik.damm(at)fh-dortmund.de>, University of Applied
Sciences and Arts Dortmund, Germany
-
Tabea M. G. Pakull, <tabea.pakull(at)uk-essen.de>, Institute for Transfusion Medicine, University Hospital
Essen, Germany
-
Johannes Rückert, University of Applied Sciences and Arts Dortmund, Germany
-
Asma Ben Abacha
<abenabacha(at)microsoft.com>, Microsoft, USA
-
Alba García Seco de Herrera <alba.garcia(at)essex.ac.uk>,University of Essex, UK
-
Christoph M. Friedrich <christoph.friedrich(at)fh-dortmund.de>, University of Applied
Sciences and Arts Dortmund, Germany
-
Henning Müller <henning.mueller(at)hevs.ch>, University of Applied Sciences
Western Switzerland, Sierre, Switzerland
-
Louise Bloch <louise.bloch(at)fh-dortmund.de>, University of
Applied Sciences and Arts Dortmund, Germany
-
Raphael Brüngel <raphael.bruengel(at)fh-dortmund.de>, University
of Applied Sciences and Arts Dortmund, Germany
-
Ahmad Idrissi-Yaghir <ahmad.idrissi-yaghir(a)fh-dortmund.de>,
University of Applied Sciences and Arts Dortmund, Germany
-
Henning Schäfer <henning.schaefer(at)uk-essen.de>, Institute for
Transfusion Medicine, University Hospital Essen, Germany
-
Cynthia S. Schmidt, Institute for Artificial Intelligence in Medicine
(IKIM), University Hospital Essen
-
Benjamin Bracke, University of Applied Sciences and Arts Dortmund, Germany
-
Obioma Pelka, Institute for Artificial Intelligence in Medicine, Germany
-
Bahadir, Eryilmaz, Institute for Artificial Intelligence in Medicine,
Germany
-
Helmut Becker, Institute for Artificial Intelligence in Medicine, Germany
Acknowledgments
[1] Rückert, J., Bloch, L., Brüngel, R., Idrissi-Yaghir, A., Schäfer,
H., Schmidt, C. S., Koitka, S., Pelka, O., Abacha, A. B., de Herrera, A. G.
S., Müller, H., Horn, P. A., Nensa, F., & Friedrich, C. M. (2024). ROCOv2:
Radiology objects in COntext version 2, an updated multimodal image dataset.
https://doi.org/10.48550/ARXIV.2405.10004