You are here

ImageCLEF Wikipedia Image Retrieval Datasets

1. Introduction

The Wikipedia image retrieval task is an ad-hoc image retrieval task. The overall goal of the task is to investigate how well multi-modal image retrieval approaches that combine textual and visual evidence in order to satisfy a user’s multimedia information need could deal with larger scale image collections that contain highly heterogeneous items both in terms of their textual descriptions and their visual content. The aim is to simulate image retrieval in a realistic setting, such as the Web environment, where available images cover highly diverse subjects and have highly varied visual properties, while their accompanying textual metadata (if any) are user-generated and correspond to noisy and unstructured textual descriptions of varying quality and length.

The Wikipedia image retrieval task ran as part of ImageCLEF for four years: 2008-2011.

2. Datasets

Two collections of Wikipedia images were used during the four years of the task: the Wikipedia INEX Multimedia Collection consisting of 151,519 images in 2008 and 2009, and the Wikipedia Retrieval 2010 Collection consisting of 237,434 images in 2010 and 2011. A number of topics were developed in order to respond to diverse multimedia information needs; there were 75 topics in 2008, 45 in 2009, 70 in 2010, 50 in 2011. The ground truth for these topics was created by assuming binary relevance (relevant vs. non relevant) and by assessing only the images in the pools created by the retrieved images contained in the runs submitted by the participants each year; a pool depth of 100 was used in 2008, 2010, and 2011, and a pool depth of 50 in 2009.

3. How to acquire the datasets

To obtain access to the ImageCLEF Wikipedia Image Retrieval datasets, please follow these steps:

  1. Register to the dataset management system (funded by CHORUS+).
  2. Select the dataset you would like to access. Currently, only the ImageCLEF Wikipedia Image Retrieval 2010-2011 dataset is available.
  3. You will then be able to see the access details for downloading the data in the detailed view of the dataset in the system. Use these access details to download the datasets provided below.
  4. By downloading the ImageCLEF Wikipedia Image Retrieval 2010-2011 dataset, you (the END USER) agree to this.

For any other inquiries, send an email to Theodora Tsikrika, University of Applied Sciences Western Switzerland (HES-SO), Sierre, Switzerland,

4. Downloading the datasets

  • Test collection 2008-2009 (Not available yet. To be provided soon.)
    • Wikipedia INEX Multimedia Collection: 151,519 images + user-generated textual annotations
    • 2008: 75 topics + ground truth
    • 2009: 45 topics + ground truth
  • Test collection 2010-2011
    • Wikipedia Retrieval 2010 Collection: This collection consists of 237,434 images, their associated user-generated textual annotations (i.e., the images' textual descriptions extracted from the Wikimedia Commons files and the images' captions in the Wikipedia article(s) that contain them), and the Wikipedia articles containing the images.

      The collection was built to cover similar topics in English, German and French, with the following language distribution for their associated textual annotations:
       - English only: 70,127
       - German only: 50,291
       - French only: 28,461
       - English and German: 26,880
       - English and French: 20,747
       - German and French: 9,646
       - English, German and French: 22,899
       - Language undetermined: 8,144
       - No textual annotation: 239

      An example that illustrates an image in the collection and its associated textual annotations is provided below:

      • A README file describing the provided data can be downloaded: HERE.
      • The ImageCLEF 2010 Wikipedia image collection (237,434 .jpeg and .png images - 21GB) can be downloaded in small batches: HERE.
      • A .zip file containing the user-generated textual annotations (metadata) of the images in the collection, the Wikipedia articles containing these images, and an id.txt file listing all image identifiers can be downloaded HERE.
      • Low-level visual features:
    • Topics and ground truth: The topics are descriptions of multimedia information needs that contain textual and visual hints. There were 70 topics in 2010 and 50 topics in 2011.

5. Related Publications