You are here

Visual Question Answering in the Medical Domain

Welcome to the inaugural edition of the Medical Domain Visual Question Answering Task!


With the increasing interest in artificial intelligence (AI) to support clinical decision making and improve patient engagement, opportunities to generate and leverage algorithms for automated medical image interpretation are currently being explored. Since patients may now access structured and unstructured data related to their health via patient portals, such access also motivates the need to help them better understand their conditions regarding their available data, including medical images.

The clinicians' confidence in interpreting complex medical images can be significantly enhanced by a “second opinion” provided by an automated system. In addition, patients may be interested in the morphology/physiology and disease-status of anatomical structures around a lesion that has been well characterized by their healthcare providers – and they may not necessarily be willing to pay significant amounts for a separate office- or hospital visit just to address such questions. Although patients often turn to search engines (e.g. Google) to disambiguate complex terms or obtain answers to confusing aspects of a medical image, results from search engines may be nonspecific, erroneous and misleading, or overwhelming in terms of the volume of information.


  • 26.10.2017: Website goes live.
  • 06.03.2018: Training and validation sets released.
  • 20.03.2018: Test set released.

Task Description

Visual Question Answering is a new and exciting problem that combines natural language processing and computer vision techniques. Inspired by the recent success of visual question answering in the general domain, we propose a pilot task this year to focus on visual question answering in the medical domain. Given a medical image accompanied with a clinically relevant question, participating systems are tasked with answering the question based on the visual image content.


The data will include a training set (~5K) and a validation set (0.5K) of medical images accompanied with question-answer pairs, and a test set (0.5K) of medical images with questions only. To create the datasets for the proposed task, we consider medical domain images extracted from PubMed Central articles (essentially a subset of the ImageCLEF 2017 caption prediction task).

Evaluation Methodology

The following pre-processing methodology is applied before running the evaluation metrics on each answer:

  • Each answer is converted to lower-case
  • All punctuations are removed and the answer is tokenized to individual words
  • Stopwords are removed using NLTK's English stopword list

Evaluation is conducted based on the following three metrics: BLEU, WBSS, and CBSS.

  1. BLEU
    We use the BLEU metric [1] to capture the similarity between a system-generated answer and the ground truth answer. The overall methodology and resources for the BLEU metric are essentially similar to the ImageCLEF 2017 caption prediction task.
  2. WBSS (Word-based Semantic Similarity)
    Following a recent algorithm to calculate semantic similarity in the biomedical domain [2], we create a metric based on Wu-Palmer Similarity (WUPS) [3] with WordNet ontology in the backend. Please refer to [2] for details of the algorithm. The WUPS source code can be downloaded from here.
  3. CBSS (Concept-based Semantic Similarity)
    This metric is similar to WBSS as described above, except that instead of tokenizing the predicted and ground truth answers into words, we use MetaMap via the pymetamap wrapper to extract biomedical concepts from the answers, and build a dictionary using these concepts. Then, we build one-hot vector representations of the answers to calculate their semantic similarity using the cosine similarity measure.


[1] Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation (PDF). ACL-2002: 40th Annual meeting of the Association for Computational Linguistics. pp. 311–318.
[2] Soğancıoğlu, G., Öztürk, H., & Özgür, A. (2017). BIOSSES: a semantic sentence similarity estimation system for the biomedical domain. Bioinformatics, 33(14), i49-i58.
[3] Wu, Z., & Palmer, M. (1994, June). Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics (pp. 133-138). Association for Computational Linguistics.

Preliminary Schedule

  • 08.11.2017: registration opens for all ImageCLEF tasks (open until 27.04.2018)
  • 06.03.2018: training and validation data release
  • 20.03.2018: test data release
  • 08.05.2018: deadline for submitting the participants runs
  • 15.05.2018: release of the processed results by the task organizers
  • 31.05.2018: deadline for submission of working notes papers by the participants
  • 15.06.2018: notification of acceptance of the working notes papers
  • 29.06.2018: camera ready working notes papers
  • 10-14.09.2018: CLEF 2018, Avignon, France

Participant Registration

Please refer to the general ImageCLEF registration instructions

Submission Instructions

  • Each team is allowed to submit a maximum of 5 runs.
  • We expect the following format for the result submission file: <QA-ID><TAB><Image-ID><TAB><Answer>

    For example:

    1 rjv03401 answer of the first question in one single line
    2 AIAN-14-313-g002 answer of the second question
    3 wjem-11-76f3 answer of the third question

  • You need to respect the following constraints:

    • The separator between <QA-ID>, <Image-ID> and <Answer> has to be a tabular white space (tab).
    • Each <QA-ID> of the test set must be included in the run file exactly once.
    • You should not include special characters in the <Answer> field.
    • All 500 <QA-ID> and <Image-ID> pairs must be present in a participant’s run file in the same order as the VQAMed2018Test-QA.csv file.

  • Participants are allowed to use other resources asides from the official training/validation datasets, however the use of the additional resources must have to be explicitly stated. For meaningful comparison, we will separately group systems who exclusively use the official training data and who incorporate additional sources.


  • When referring to the ImageCLEF VQA-Med 2018 task general goals, evaluation, dataset, general results, etc. please cite the following publication which will be published by September 2018:
    • Sadid A. Hasan, Yuan Ling, Oladimeji Farri, Joey Liu, Matthew Lungren, and Henning Müller. Overview of ImageCLEF 2018 Medical Domain Visual Question Answering Task, CLEF working notes, CEUR, 2018.
    • BibTex:

        author = {Sadid A. Hasan and Yuan Ling and Oladimeji Farri and Joey Liu and Matthew Lungren and Henning M\"uller},
        title = {Overview of the {ImageCLEF} 2018 Medical Domain Visual Question Answering Task},
        booktitle = {CLEF2018 Working Notes},
        series = {{CEUR} Workshop Proceedings},
        year = {2018},
        volume = {},
        publisher = { $<$$>$},
        pages = {},
        month = {September 10-14},
        address = {Avignon, France},
  • When referring to the ImageCLEF 2018 task in general, please cite the following publication which will be published by September 2018:
    • Bogdan Ionescu, Henning Müller, Mauricio Villegas, Alba García Seco de Herrera, Carsten Eickhoff, Vincent Andrearczyk, Yashin Dicente Cid, Vitali Liauchuk, Vassili Kovalev, Sadid A. Hasan, Yuan Ling, Oladimeji Farri, Joey Liu, Matthew Lungren, Duc-Tien Dang-Nguyen, Luca Piras, Michael Riegler, Liting Zhou, Mathias Lux and Cathal Gurrin. Overview of ImageCLEF 2018: Challenges, Datasets and Evaluation, Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018), 2018.
    • BibTex:

        author = {Bogdan Ionescu and Henning M\"uller and Mauricio Villegas
        and Alba Garc\'ia Seco de Herrera and Carsten Eickhoff and Vincent
        Andrearczyk and Yashin Dicente Cid and Vitali Liauchuk and Vassili
        Kovalev and Sadid A. Hasan and Yuan Ling and Oladimeji Farri and Joey
        Liu and Matthew Lungren and Duc-Tien Dang-Nguyen and Luca Piras and
        Michael Riegler and Liting Zhou and Mathias Lux and Cathal Gurrin},
        title = {{Overview of ImageCLEF 2018}: Challenges, Datasets and Evaluation},
        booktitle = {Experimental IR Meets Multilinguality, Multimodality, and
        series = {Proceedings of the Ninth International Conference of the
        CLEF Association (CLEF 2018)},
        year = {2018},
        volume = {},
        publisher = {{LNCS} Lecture Notes in Computer Science, Springer},
        pages = {},
        month = {September 10-14},
        address = {Avignon, France},
  • Organizers

    • Sadid Hasan <sadid.hasan(at)>, Philips Research Cambridge, USA
    • Yuan Ling <yuan.ling(at)>, Philips Research Cambridge, USA
    • Oladimeji Farri <dimeji.farri(at)>, Philips Research Cambridge, USA
    • Joey Liu <joey.liu(at)>, Philips Research Cambridge, USA
    • Henning Müller <henning.mueller(at)>, University of Applied Sciences Western Switzerland, Sierre, Switzerland
    • Matthew Lungren <mlungren(at)>, Stanford University Medical Center, USA

    Join our mailing list: