Motivation
Vision-Language Models (VLMs) have demonstrated remarkable progress in integrating visual and textual information, achieving strong performance in tasks such as image captioning, visual question answering (VQA), and multimodal dialogue. Despite these advancements, their ability to perform structured reasoning and draw inferences from complex visual-linguistic relationships remains limited. In particular, VLMs often struggle with questions that demand multi-step reasoning, abstract understanding, or hypothetical thinking grounded in visual evidence. This year, we plan to expand the existing set of multiple-choice questions and introduce a new task designed to challenge VLMs’ reasoning abilities even further. The study of visual question answering and visual reasoning thus provides a crucial benchmark for evaluating how effectively modern models can understand, interpret, and reason over multimodal inputs presented across diverse domains and languages.
News
This year we are using a new website for task information: https://mbzuai-nlp.github.io/ImageCLEF-MultimodalReasoning/2026/
The Test data from ImageCLEF 2025 with released labels - https://huggingface.co/datasets/MBZUAI/EXAMS-V. This repository also contains the training and development data.
GitHub: https://github.com/mbzuai-nlp/ImageCLEF-2025-MultimodalReasoning, where you will find information about the previous edition of the task. We will update the repository accordingly with new instructions and scripts.