You are here

ImageCLEFmed Caption

Welcome to the 9th edition of the Caption Task!

Description

Motivation

Interpreting and summarizing the insights gained from medical images such as radiology output is a time-consuming task that involves highly trained experts and often represents a bottleneck in clinical diagnosis pipelines.

Consequently, there is a considerable need for automatic methods that can approximate this mapping from visual information to condensed textual descriptions. The more image characteristics are known, the more structured are the radiology scans and hence, the more efficient are the radiologists regarding interpretation. We work on the basis of a large-scale collection of figures from open access biomedical journal articles (PubMed Central). All images in the training data are accompanied by UMLS concepts extracted from the original image caption.

Lessons learned:

  • In the first and second editions of this task, held at ImageCLEF 2017 and ImageCLEF 2018, participants noted a broad variety of content and situation among training images. In 2019, the training data was reduced solely to radiology images, with ImageCLEF 2020 adding additional imaging modality information, for pre-processing purposes and multi-modal approaches.
  • The focus in ImageCLEF 2021 lay in using real radiology images annotated by medical doctors. This step aimed at increasing the medical context relevance of the UMLS concepts, but more images of such high quality are difficult to acquire.
  • As uncertainty regarding additional source was noted, we will clearly separate systems using exclusively the official training data from those that incorporate additional sources of evidence
  • For ImageCLEF 2022, an extended version of the ImageCLEF 2020 dataset was used. For the caption prediction subtask, a number of different additional evaluations metrics were introduced with the goal of replacing the primary evaluation metric in future iterations of the task.
  • For ImageCLEF 2023, several issues with the dataset (large number of concepts, lemmatization errors, duplicate captions) were tackled and based on experiments in the previous year, BERTScore was used as the primary evaluation metric for the caption prediction subtask.

News

  • 24.10.2024: website goes live
  • 15.12.2024: registration opens
  • 01.03.2025: development dataset released
  • 15.04.2025: test dataset released
  • 20.05.2025: run submission phase ended
  • 22.05.2025: results published

Preliminary Schedule

  • 24.10.2024: website goes live
  • 15.12.2024: registration opens
  • 01.03.2025: development dataset released
  • 15.04.2025: test dataset released
  • 20.05.2025: run submission phase ended
  • 22.05.2025: results published
  • 30.05.2025: submission of participant papers [CEUR-WS]
  • 21.06.2025: notification of acceptance

Task Description

For captioning, participants will be requested to develop solutions for automatically identifying individual components from which captions are composed in Radiology Objects in COntext version 2[2] images. ImageCLEFmedical Caption 2025 consists of two substaks:
  • Concept Detection Task
  • Caption Prediction Task

Concept Detection Task

The first step to automatic image captioning and scene understanding is identifying the presence and location of relevant concepts in a large corpus of medical images. Based on the visual image content, this subtask provides the building blocks for the scene understanding step by identifying the individual components from which captions are composed. The concepts can be further applied for context-based image and information retrieval purposes.

Evaluation is conducted in terms of set coverage metrics such as precision, recall, and combinations thereof.

Caption Prediction Task

On the basis of the concept vocabulary detected in the first subtask as well as the visual information of their interaction in the image, participating systems are tasked with composing coherent captions for the entirety of an image. In this step, rather than the mere coverage of visual concepts, detecting the interplay of visible elements is crucial for strong performance.

This year, we will use BERTScore as the primary evaluation metric and ROUGE as the secondary evaluation metric for the caption prediction subtask. Other metrics such as MedBERTScore, MedBLEURT, and BLEU will also be published.

Explainability Task

In addition, we ask participants to provide explanations for the captions of a small subset (will be released with the test dataset) of images. We encourage people to be creative. There are no technical limitations to this task. The explanations will be manually evaluated by a radiologist for interpretability, relevance, and creativity. Examples of how such an explanation might look like are provided as follows:

Example of an explainability approach.

Data

The data for the caption task will contain curated images from the medical literature including their captions and associated UMLS terms that are manually controlled as metadata. A more diverse data set will be made available to foster more complex approaches.

For questions regarding the dataset please use the challenge website forum or contact hendrik.damm@fh-dortmund.de.

For the development dataset, Radiology Objects in COntext Version 2 (ROCOv2) [2], an updated and extended version of the Radiology Objects in COntext (ROCO) dataset [1], is used for both subtasks. As in previous editions, the dataset originates from biomedical articles of the PMC OpenAccess subset, with the test set comprising a previously unseen set of images. Training Set: Consists of 80091 radiology images
Validation Set: Consists of 17277 radiology images
Test Set: Consists of 19267 radiology images

Concept Detection Task

The concepts were generated using a reduced subset of the UMLS 2022 AB release. To improve the feasibility of recognizing concepts from the images, concepts were filtered based on their semantic type. Concepts with low frequency were also removed, based on suggestions from previous years.

Caption Prediction Task

For this task each caption is pre-processed in the following way:
  • removal of links from the captions

Evaluation methodology

The source code of the evaluation script is available on Github (https://github.com/taubsity/clef-caption-evaluation).

For questions regarding the evaluation scripts please use the challenge website forum or contact tabea.pakull@uk-essen.de.

Concept Detection

Evaluation is conducted in terms of F1 scores between system predicted and ground truth concepts, using the following methodology and parameters:

  • The default implementation of the Python scikit-learn (v0.17.1-2) F1 scoring method is used. It is documented here.
  • A Python (3.x) script loads the candidate run file, as well as the ground truth (GT) file, and processes each candidate-GT concept sets.
  • For each candidate-GT concept set, the y_pred and y_true arrays are generated. They are binary arrays indicating for each concept contained in both candidate and GT set if it is present (1) or not (0).
  • The F1 score is then calculated. The default 'binary' averaging method is used.
  • All F1 scores are summed and averaged over the number of elements in the test set, giving the final score.
  • The primary score considers any concept. The secondary score filters both predicted and GT concepts to the set of manually annotated concepts before repeating the same F1 scoring steps.

The ground truth for the test set was generated based on the same reduced subset of the UMLS 2022 AB release which was used for the training data (see above for more details).

Caption Prediction

This year, ranking of participants is based on an average score over all used metrics. In total 6 metrics are computed that fall into the aspects of relevance and factuality.

Relevance

In order to evaluate the relevance aspect of generated captions the following metrics are used:

  • Image and Caption Similarity
  • BERT-Score (Recall) with inverse document frequency (idf) scores computed from the test corpus for importance weighting
  • Recall-Oriented Understudy for Gisting Evaluation (ROUGE) for overlap of unigrams (ROUGE-1) (F-measure)
  • Bilingual Evaluation Understudy with Representations from Transformers (BLEURT)

Image and Caption Similarity is computed using the following methodology:

Using a medical imaging embedding model for calculating embeddings of the caption and the image and calculating similarity of these embeddings.

Note: For the following relevance metrics (BERT-Score, ROUGE and BLEURT), each caption is pre-processed in the same way:

  • The caption is converted to lower-case.
  • Replace numbers with the token 'number'.
  • Remove punctuation.

Note that the captions are always considered as a single sentence, even if it actually contains several sentences.

BERTScore is calculated using the following methodology and parameters:

The native Python implementation of BERTScore is used. This scoring method is based on the paper "BERTScore: Evaluating Text Generation with BERT" and aims to measure the quality of generated text by comparing it to a reference. We use Recall BERTScore with inverse document frequency (idf) scores computed from the test corpus for importance weighting as this setting correlates the most with human ratings for the image captioning task reported in the BERTScore paper.

To calculate BERTScore, we use the microsoft/deberta-xlarge-mnli model, which can be found on the Hugging Face Model Hub. The model is pretrained on a large corpus of text and fine-tuned for natural language inference tasks. It can be used to compute contextualized word embeddings, which are essential for BERTScore calculation.

To compute the final BERTScore, we first calculate the individual score (Recall idf) for each caption. The BERTScore is then averaged across all captions to give the final score.

The ROUGE score is calculated using the following methodology and parameters:

The native python implementation of ROUGE scoring method is used. It is designed to replicate results from the original perl package that was introduced in the paper "ROUGE: A Package for Automatic Evaluation of Summaries".

Specifically, we calculate the ROUGE-1 (F-measure) score, which measures the number of matching unigrams between the model-generated text and a reference. The final score is the average ROUGE-1 over all captions.

For calculation of BLEURT the following methodology and parameters are used:

The native Python implementation of BLEURT is used. This scoring method is based on the paper "BLEURT: Learning Robust Metrics for Text Generation". The aim of BLEURT is to provide an evaluation metric for text generation by learning from human judgments using BERT-based representations. In this evaluation, the recommended BLEURT-20 checkpoint is employed.

Factuality

In order to evaluate the factuality aspect of generated captions the following metrics are used:

  • Unified Medical Language System (UMLS) Concept F1
  • AlignScore

We calculate the UMLS F1 using the following methodology:

We use MedCAT to get the medical entities (UMLS concepts) of the caption and the predicted caption. We only match entities with the semantic types that are used to calculate MEDCON as described in the paper "Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation".

For calculating MEDCON the following methodology and parameters are used:

The native Python implementation of MEDCON is used. As described in the paper "Aci-bench: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation" it evaluates the clinical accuracy and consistency of the Unified Medical Language System (UMLS) concept sets in generated and reference texts.

For detecting the UMLS concepts in both texts the QuickUMLS package is used and then the F1 score is calculated.

The AlignScore is calculated using the following methodology and parameters:

The native Python implementation of the AlignScore is used. It implements the metric based on RoBERTa introduced in the paper "AlignScore: Evaluating Factual Consistency with A Unified Alignment Function". The checkpoints are available on the Huggingface Model Hub. The AlignScore is designed to evaluate factual consistency in text generation by assessing the alignment of information between two pieces of text. The model calculates a score by splitting long contexts into manageable chunks and matching each claim sentence with the most supportive context chunk. The final AlignScore is the average alignment score across all claim sentences.

Participant registration

Please refer to the general ImageCLEF registration instructions

Results

The tables below contain only the best runs of each owner on ai4mediabench, for a complete list of all runs please see the Google Sheets files for Concept Detection and for Caption Prediction.

Concept Detection

ID Owner Submission Name F1 F1 secondary
1980 AUEB NLP Group ensemble_dual_thr_3_5monte_eff_eff_.zip 0.5888 0.9484
1725 DeepLens submission 0.5766 0.9299
1505 mapan submission 0.5660 0.9298
1892 UIT-Oggy submission 0.5613 0.9104
1508 DS4DH submission.csv 0.5225 0.8672
1774 sakthiii submission 0.4003 0.9082
1903 JJ-VMed submission 0.3982 0.8329
1807 UMUTeam submission_with_unknown_clean 0.2398 0.5377
1942 LekshmiscopeVIT submission.csv 0.1494 0.2298

Caption Prediction

ID owner Submission Name Overall Similarity BERTScore (Recall) ROUGE-1 BLEURT Relevance Average UMLS Concept F1 AlignScore Factuality Average
1681 UMUTeam submission.zip 0.3432 0.9271 0.5977 0.2594 0.3230 0.5268 0.1816 0.1375 0.1596
1520 DS4DH submission.csv.zip 0.3362 0.9016 0.6067 0.2516 0.3096 0.5174 0.1682 0.1417 0.1549
1900 AI Stat Lab submission.zip 0.3229 0.8919 0.5823 0.2440 0.3173 0.5089 0.1524 0.1213 0.1369
1914 UIT-Oggy submission_ep2_cleaned.zip 0.3211 0.8798 0.5951 0.2535 0.3020 0.5076 0.1672 0.1021 0.1346
1403 AUEB NLP Group 2-instruct-blip-ft.zip 0.3068 0.7947 0.5884 0.2176 0.3030 0.4759 0.1429 0.1325 0.1377
1896 JJ-VMed submission.zip 0.3043 0.8251 0.5953 0.2389 0.3094 0.4922 0.1366 0.0964 0.1165
1890 sakthiii submission 0.2746 0.7957 0.5553 0.1607 0.2806 0.4481 0.1094 0.0928 0.1011
1815 csmorgan Qwen_2B_Submission_1.zip 0.2315 0.5704 0.5180 0.1598 0.2385 0.3717 0.0741 0.1087 0.0914

Explainability Task - Human Evaluation Results

Team Caption readability Clinical appropriateness of caption Caption level of detail Caption focus Mean caption rating Visual-text coherence Completeness of visualization Visualization focus Mean visualization rating Appropriateness of Methodology Overall
AUEB NLP Group 4.5 2.7 2.6 3.3 3.3 3.1 2.8 2.6 2.8 4.0 3.2
JJ-VMed 3.4 2.4 2.8 4.1 3.2 1.9 1.9 1.9 1.9 2.0 2.6

* All categories were rated by a radiologist using a 5-point Likert scale, with 5 indicating the best score.

CEUR Working Notes

The working-notes paper is your opportunity to describe your approach, present all submitted runs and discuss the results. All participating teams with at least one graded submission, regardless of the score, should submit a CEUR working notes paper. Teams who participated in both tasks should generally submit only one report

Make sure the EasyChair metadata (author names and order, title, affiliations) exactly match the PDF, as these fields feed directly into the proceedings.

  • Camera-ready + signed copyright form: due 7 July 2025 (23:59 CEST)

Dataset image attribution

If you include dataset images in your paper, you must provide the correct attribution. Use the lookup file below to find the attribution string for each image ID:

https://fh-dortmund.sciebo.de/s/KT9AMjPtoq3pxTz

Insert the attribution in the figure caption or directly beside the image.

Working Notes Papers should cite both the ImageCLEF 2025 overview paper as well as the ImageCLEFmedical task overview paper and the ROCOv2 dataset paper, citation information is available in the Citations section below.

Reproducibility

We encourage you to make your work as reproducible as possible by releasing code, trained models, and detailed instructions on a public repository (e.g. GitHub) and pointing to it in your paper.

ArXiv references

Please refrain from citing the preprints e.g. of (arXiv) without checking if they have been published in the meantime. The publication details proof the value of the cited work and give attribution to the authors, an arxiv reference is only a preprint with no peer review and is potentially junk. You can try https://preprintresolver.eu/ to give proper attribution to authors and improve the quality of your work. If no proper reference/doi is found, you can use the arXiv reference.

Citations

Its mandatory to cite the overview papers and the dataset.

When referring to ImageCLEFmedical 2025 Caption general goals, general results, etc. please cite the following publication:

  • Hendrik Damm, Tabea M. G. Pakull, Helmut Becker, Benjamin Bracke, Bahadir Eryilmaz, Louise Bloch, Raphael Brüngel, Cynthia S. Schmidt, Johannes Rückert, Obioma Pelka, Henning Schäfer, Ahmad Idrissi-Yaghir, Asma Ben Abacha, Alba García Seco de Herrera, Henning Müller and Christoph M. Friedrich. Overview of ImageCLEFmedical 2025 -- Medical Concept Detection and Interpretable Caption Generation. CLEF 2025 Working Notes. CEUR Workshop Proceedings (CEUR-WS.org), Madrid, Spain, September 9-12, 2025.
  • BibTex:
    @inproceedings{ImageCLEFmedicalCaptionOverview2025,
    author = {Hendrik Damm and Tabea M. G. Pakull and Helmut Becker and Benjamin Bracke and Bahadir Eryilmaz and Louise Bloch and Raphael Br{\"u}ngel and Cynthia S. Schmidt and Johannes R{\"u}ckert and Obioma Pelka and Henning Sch{\"a}fer and Ahmad Idrissi{-}Yaghir and Asma Ben Abacha and Alba Garc{'{\i}}a Seco de Herrera and Henning M{\"u}ller and Christoph M. Friedrich},
    title = {Overview of {ImageCLEFmedical} 2025 -- Medical Concept Detection and Interpretable Caption Generation},
    booktitle = {CLEF 2025 Working Notes},
    series = {CEUR Workshop Proceedings},
    publisher = {CEUR-WS.org},
    address = {Madrid, Spain},
    month = {September 9--12},
    year = {2025}
    }
    

When referring to ImageCLEF 2025, please cite the following publication (to be updated):

  • BibTex:
    @inproceedings{OverviewImageCLEF2025,
     author    = {Ionescu, Bogdan and M{\"u}ller, Henning and Stanciu, Dan-Cristian and Andrei, Alexandra-Georgiana and Radzhabov, Ahmedkhan and Prokopchuk, Yuri and {\c{S}tefan, Liviu-Daniel} and Constantin, Mihai-Gabriel and Dogariu, Mihai and Kovalev, Vassili and Damm, Hendrik and R{\"u}ckert, Johannes and Ben Abacha, Asma and Garc\'ia Seco de Herrera, Alba and Friedrich, Christoph M. and Bloch, Louise and Br{\"u}ngel, Raphael and Idrissi-Yaghir, Ahmad and Sch{\"a}fer, Henning and Schmidt, Cynthia Sabrina and Pakull, Tabea M. G. and Bracke, Benjamin and Pelka, Obioma and Eryilmaz, Bahadir and Becker, Helmut and Yim, Wen-Wai and Codella, Noel and Novoa, Roberto Andres and Malvehy, Josep and Dimitrov, Dimitar and Das, Rocktim Jyoti and Xie, Zhuohan and Shan, Hee Ming and Nakov, Preslav and Koychev, Ivan and Hicks, Steven A. and Gautam, Sushant and Riegler, Michael A. and Thambawita, Vajira and P\r{a}l Halvorsen and Fabre, Diandra and Macaire, C\'ecile and Lecouteux, Benjamin and Schwab, Didier and Potthast, Martin and Heinrich, Maximilian and Kiesel, Johannes and Wolter, Moritz and Stein, Benno},
    title = {Overview of {ImageCLEF} 2025: Multimedia Retrieval in Medical, Social Media and Content Recommendation Applications},
    booktitle = {Experimental IR Meets Multilinguality, Multimodality, and Interaction},
    series = {Proceedings of the 16th International Conference of the CLEF Association (CLEF 2025)},
    year = {2025},
    publisher = {Springer Lecture Notes in Computer Science LNCS},
    pages = {},
    month = {September 9-12},
    address = {Madrid, Spain}
    }
    

When describing the data, note that an extended version of ROCOv2 was used and cite:

  • Johannes Rückert, Louise Bloch, Raphael Brüngel, Ahmad Idrissi-Yaghir, Henning Schäfer, Cynthia S. Schmidt, Sven Koitka, Obioma Pelka, Asma Ben Abacha, Alba García Seco de Herrera, Henning Müller, Peter Horn, Felix Nensa and Christoph M. Friedrich (2024). ROCOv2: Radiology Objects in Context Version 2, an Updated Multimodal Image Dataset. Scientific Data, 11(1). doi:10.1038/s41597-024-03496-6
  • BibTex:
    @article{2405.10004v2,
    title = {{ROCOv2}: Radiology Objects in COntext Version 2, an Updated Multimodal Image Dataset},
    author = {Johannes R{\"u}ckert and Louise Bloch and Raphael Br{\"u}ngel and Ahmad Idrissi{-}Yaghir and Henning Sch{\"a}fer and Cynthia S. Schmidt and Sven Koitka and Obioma Pelka and Asma Ben Abacha and Alba Garc{'{\i}}a Seco de Herrera and Henning M{\"u}ller and Peter Horn and Felix Nensa and Christoph M. Friedrich},
    journal = {Scientific Data},
    volume = {11},
    number = {1},
    year = {2024},
    doi = {10.1038/s41597-024-03496-6}
    }
    

Contact

Organizers:
  • Hendrik Damm <hendrik.damm(at)fh-dortmund.de>, University of Applied Sciences and Arts Dortmund, Germany
  • Tabea M. G. Pakull, <tabea.pakull(at)uk-essen.de>, Institute for Transfusion Medicine, University Hospital Essen, Germany
  • Johannes Rückert, University of Applied Sciences and Arts Dortmund, Germany
  • Asma Ben Abacha  <abenabacha(at)microsoft.com>, Microsoft, USA
  • Alba García Seco de Herrera <alba.garcia(at)essex.ac.uk>,University of Essex, UK
  • Christoph M. Friedrich <christoph.friedrich(at)fh-dortmund.de>, University of Applied Sciences and Arts Dortmund, Germany
  • Henning Müller <henning.mueller(at)hevs.ch>, University of Applied Sciences Western Switzerland, Sierre, Switzerland
  • Louise Bloch <louise.bloch(at)fh-dortmund.de>, University of Applied Sciences and Arts Dortmund, Germany
  • Raphael Brüngel <raphael.bruengel(at)fh-dortmund.de>, University of Applied Sciences and Arts Dortmund, Germany
  • Ahmad Idrissi-Yaghir <ahmad.idrissi-yaghir(a)fh-dortmund.de>, University of Applied Sciences and Arts Dortmund, Germany
  • Henning Schäfer <henning.schaefer(at)uk-essen.de>, Institute for Transfusion Medicine, University Hospital Essen, Germany
  • Cynthia S. Schmidt, Institute for Artificial Intelligence in Medicine (IKIM), University Hospital Essen
  • Benjamin Bracke, University of Applied Sciences and Arts Dortmund, Germany
  • Obioma Pelka, Institute for Artificial Intelligence in Medicine, Germany
  • Bahadir, Eryilmaz, Institute for Artificial Intelligence in Medicine, Germany
  • Helmut Becker, Institute for Artificial Intelligence in Medicine, Germany

Acknowledgments

[1] Rückert, J., Bloch, L., Brüngel, R., Idrissi-Yaghir, A., Schäfer, H., Schmidt, C. S., Koitka, S., Pelka, O., Abacha, A. B., de Herrera, A. G. S., Müller, H., Horn, P. A., Nensa, F., & Friedrich, C. M. (2024). ROCOv2: Radiology objects in COntext version 2, an updated multimodal image dataset. https://doi.org/10.48550/ARXIV.2405.10004