You are here

ImageCLEFmedical Caption

Welcome to the 7th edition of the Caption Task!



Interpreting and summarizing the insights gained from medical images such as radiology output is a time-consuming task that involves highly trained experts and often represents a bottleneck in clinical diagnosis pipelines.

Consequently, there is a considerable need for automatic methods that can approximate this mapping from visual information to condensed textual descriptions. The more image characteristics are known, the more structured are the radiology scans and hence, the more efficient are the radiologists regarding interpretation. We work on the basis of a large-scale collection of figures from open access biomedical journal articles (PubMed Central). All images in the training data are accompanied by UMLS concepts extracted from the original image caption.

Lessons learned:

  • In the first and second editions of this task, held at ImageCLEF 2017 and ImageCLEF 2018, participants noted a broad variety of content and situation among training images. In 2019, the training data was reduced solely to radiology images, with ImageCLEF 2020 adding additional imaging modality information, for pre-processing purposes and multi-modal approaches.
  • The focus in ImageCLEF 2021 lay in using real radiology images annotated by medical doctors. This step aimed at increasing the medical context relevance of the UMLS concepts, but more images of such high quality are difficult to acquire.
  • As uncertainty regarding additional source was noted, we will clearly separate systems using exclusively the official training data from those that incorporate additional sources of evidence
  • For ImageCLEF 2022, an extended version of the ImageCLEF 2020 dataset was used. There were several issues with the dataset (large number of concepts, lemmatization errors, duplicate captions), which will be tackled for the 7th edition of the task, alongside an updated primary evaluation metric for the caption prediction subtask.


  • 12.10.2022: website goes live
  • 12.01.2023: registration opens

Task Description

For captioning, participants will be requested to develop solutions for automatically identifying individual components from which captions are composed in Radiology Objects in COntext images.

ImageCLEFmedical Caption 2023 consists of two substaks:

  • Concept Detection Task
  • Caption Prediction Task

Concept Detection Task

The first step to automatic image captioning and scene understanding is identifying the presence and location of relevant concepts in a large corpus of medical images. Based on the visual image content, this subtask provides the building blocks for the scene understanding step by identifying the individual components from which captions are composed. The concepts can be further applied for context-based image and information retrieval purposes.

Evaluation is conducted in terms of set coverage metrics such as precision, recall, and combinations thereof.

Caption Prediction Task

On the basis of the concept vocabulary detected in the first subtask as well as the visual information of their interaction in the image, participating systems are tasked with composing coherent captions for the entirety of an image. In this step, rather than the mere coverage of visual concepts, detecting the interplay of visible elements is crucial for strong performance.

This year, we will use BERTScore as the primary evaluation metric and ROUGE as the secondary evaluation metric for the caption prediction subtask. Other metrics such as METEOR, CIDEr, and BLEU will also be published.


The data for the caption task will contain curated images from the medical literature including their captions and associated UMLS terms that are manually controlled as metadata. A more diverse data set will be made available to foster more complex approaches.

An updated and extended version of the Radiology Objects in COntext (ROCO) dataset [1] is used for both subtasks. As in previous editions, the dataset originates from biomedical articles of the PMC OpenAccess subset.

Training Set: Consists of TBA radiology images
Validation Set: Consists of TBA radiology images
Test Set: Consists of TBA radiology images

Concept Detection Task

The concepts were generated using a reduced subset of the UMLS 2020 AB release, which includes the sections (restriction levels) 0, 1, 2, and 9. To improve the feasibility of recognizing concepts from the images, concepts were filtered based on their semantic type. Concepts with very low frequency were also removed, based on suggestions from previous years.

Caption Prediction Task

For this task each caption is pre-processed in the following way:

  • removal of non-english captions
  • removal of corrupted captions
  • removal of links from the captions

Evaluation methodology

For assessing performance, classic metrics will be used, ranging from F1 score, accuracy, and BERTScore or other measures like ROUGE to measure text similarity.

Concept Detection

Evaluation is conducted in terms of F1 scores between system predicted and ground truth concepts, using the following methodology and parameters:

  • The default implementation of the Python scikit-learn (v0.17.1-2) F1 scoring method is used. It is documented here.
  • A Python (3.x) script loads the candidate run file, as well as the ground truth (GT) file, and processes each candidate-GT concept sets
  • For each candidate-GT concept set, the y_pred and y_true arrays are generated. They are binary arrays indicating for each concept contained in both candidate and GT set if it is present (1) or not (0).
  • The F1 score is then calculated. The default 'binary' averaging method is used.
  • All F1 scores are summed and averaged over the number of elements in the test set (7'601), giving the final score.

The ground truth for the test set was generated based on the same reduced subset of the UMLS 2020 AB release which was used for the training data (see above for more details).

NOTE : The source code of the evaluation tool is available here. To execute it, first change the variables ground_truth_path and submission_file_path in (making sure that both files do not have header rows!), then either run data/test_runs/docker/ on a Linux system with docker installed, or run python3 after installing the requirements in requirements.txt on Python 3.6+.

Caption Prediction

Details for the evaluation metrics will be made available at a later stage.

This year, BERTScore is used as the primary metric instead of BLEU, and ROUGE is used as a secondary metric. Other metrics like METEOR and CIDR will be reported after the challenge concludes.

Details on the BERTScore implementation will follow at a later stage.

The ROUGE scores are calculated using the following methodology and parameters:

  • The native python implementation of ROUGE scoring method is used. It is designed to replicate results from the original perl package that was introduced in the original article describing the ROUGE evaluation method.
  • Specifically, we calculate the ROUGE-1 (F-measure) score, which measures the number of matching unigrams between the model-generated text and a reference.
  • A Python (3.7) script loads the candidate run file, as well as the ground truth (GT) file, and processes each candidate-GT caption pair
  • Each caption is pre-processed in the following way:
    • The caption is converted to lower-case
    • Stopwords are removed using NLTK's "english" stopword list
    • Lemmatization is applied using spacy's Lemmatizer
  • The ROUGE score is then calculated. Note that the caption is always considered as a single sentence, even if it actually contains several sentences.
  • All ROUGE scores are summed and averaged over the number of captions, giving the final score.

Participant registration

Please refer to the general ImageCLEF registration instructions

EUA template

Preliminary Schedule

  • 19 December 2022: Registration opens
  • 6 Feburary 2023: Release of the training and validation sets
  • 14 March 2023: Release of the test sets
  • 22 April 2023: Registration closes
  • 10 May 2023: Run submission deadline
  • 17 May 2023: Release of the processed results by the task organizers
  • 5 June 2023: Submission of participant papers [CEUR-WS]
  • 23 June 2023: Notification of acceptance
  • 7 July 2023: Camera ready copy of participant papers and extended lab overviews [CEUR-WS]
  • 18-21 September 2023: CLEF 2023, Thessaloniki, Greece

Submission Instructions

To be added soon.


To be added soon.



  • Johannes Rückert <johannes.rueckert(at)>, University of Applied Sciences and Arts Dortmund, Germany
  • Asma Ben Abacha  <abenabacha(at)>, Microsoft, USA
  • Alba García Seco de Herrera <alba.garcia(at)>,University of Essex, UK
  • Christoph M. Friedrich <christoph.friedrich(at)>, University of Applied Sciences and Arts Dortmund, Germany
  • Henning Müller <henning.mueller(at)>, University of Applied Sciences Western Switzerland, Sierre, Switzerland
  • Louise Bloch <louise.bloch(at)>, University of Applied Sciences and Arts Dortmund, Germany
  • Raphael Brüngel <raphael.bruengel(at)>, University of Applied Sciences and Arts Dortmund, Germany
  • Ahmad Idrissi-Yaghir <ahmad.idrissi-yaghir(a)>, University of Applied Sciences and Arts Dortmund, Germany
  • Henning Schäfer <henning.schaefer(at)>, University of Applied Sciences and Arts Dortmund, Germany


[1] O. Pelka, S. Koitka, J. Rückert, F. Nensa und C. M. Friedrich „Radiology Objects in COntext (ROCO): A Multimodal Image Dataset“, Proceedings of the MICCAI Workshop on Large-scale Annotation of Biomedical data and Expert Label Synthesis (MICCAI LABELS 2018), Granada, Spain, September 16, 2018, Lecture Notes in Computer Science (LNCS) Volume 11043, Page 180-189, DOI: 10.1007/978-3-030-01364-6_20, Springer Verlag, 2018.