You are here

Revision of Wikipedia retrieval task 2011 from Thu, 02/03/2011 - 19:03

image header
Registration is available at the main ImageCLEF 2011 page.

ImageCLEF's Wikipedia Retrieval task provides a testbed for the system-oriented evaluation of visual information retrieval from a collection of Wikipedia images and articles. The aim is to investigate retrieval approaches in the context of a large and heterogeneous collection of images and their noisy text annotations (similar to those encountered on the Web) that are searched for by users with diverse information needs. This diversity is simulated on behalf of the various topics covered by the queries as well as the different types of queries that are supposedly better solved by textual, visual or multimodal retrieval.

In 2011, the task uses the ImageCLEF 2010 Wikipedia Collection (Popescu et al., 2010), which contains 237,434 Wikipedia images that cover diverse topics of interest. These images are associated with unstructured and noisy textual annotations in English, French, and German.

(Popescu et al., 2010) A. Popescu, T. Tsikrika and J. Kludas Overview of the Wikipedia Retrieval Task at ImageCLEF 2010. In CLEF (Notebook Papers/LABs/Workshops) 2010.

  • ad-hoc image retrieval task: Given a textual, multilingual query and sample images describing a user's (multimedia) information need, find as many relevant images as possible from the Wikipedia image collection. To strengthen the visual modality up to 5 example images will be given.

    Any method can be used to retrieve relevant documents. We encourage the use of both text-based and content-based retrieval methods and, in particular, multi-modal and multi-lingual approaches that investigate the combination of evidence from different modalities and language resources.

  • late information fusion task (NEW!): The participants to the subtask have access to all text-based, content-based and multimodal runs that were submitted by the participants of the ad-hoc task. The goal is to explore late fusion approaches as well as rank aggregation approaches.

ImageCLEF 2010 Wikipedia Collection
The ImageCLEF 2010 Wikipedia collection consists of 237,434 images and associated user-supplied annotations. The collection was built to cover similar topics in English, German and French. Topical similarity was obtained by selecting only Wikipedia articles which have versions in all three languages and are illustrated with at least one image in each version: 44,664 such articles were extracted from the September 2009 Wikipedia dumps, containing a total number of 265,987 images. Since the collection is intended to be freely distributed, we decided to remove all images with unclear copyright status. After this operation, duplicate elimination and some additional cleaning up, the remaining number of images in the collection is 237,434, with the following language distribution:

-English only: 70,127
-German only: 50,291
-French only: 28,461
-English and German: 26,880
-English and French: 20,747
-German and French: 9,646
-English, German and French: 22,899
-Language undetermined: 8,144
-No textual annotation: 239

Two examples that illustrate the images in the collection and their metadata are provided below:

example 8120

example 35

DOWNLOAD (participants only - the login/password are listed in the "Detail" view of the collection in the ImageCLEF registration system and they are only available to the registered participants who have also signed the End User Agreement)

  • The ImageCLEF 2010 Wikipedia image collection (237,434 .jpeg and .png images - 22.5GB) can be downloaded in small batches: HERE.
  • The metadata of the images in the collection can be downloaded: HERE.
  • A README file describing the provided data can be downloaded: HERE.
  • A id.txt file listing all image identifiers can be downloaded: HERE.

  • Additional resources:
    • The Wikipedia articles that contain the images in the collection can be downloaded: HERE.
    • The low-level visual features of the images will be provided before topic release.
  • Search Engine: Cross-Modal Search Engine (CMSE by UniGe) that allows you to search the ImageCLEF 2010 Wikipedia image collection through a web interface using text queries, example images or both at once.
Evaluation Objectives
The characteristics of the new Wikipedia collection allow for the investigation of the following objectives:

  • how well do the retrieval approaches cope with larger scale image collections?
  • how well do the retrieval approaches cope with noisy and unstructured textual annotations?
  • how well do the content-based retrieval approaches cope with images that cover diverse topics and are of varying quality?
  • how well can systems exploit and combine different modalities given a user's multimedia information need? Can they outperform mono modal approaches like query-by-text or query-by-image?
  • how well can systems exploit the multiple language resources? Can they outperform mono-lingual approaches that use for example only the English text annotations?
The results of Wikipedia Retrieval at ImageCLEF 2010 showed that the best multimedia retrieval approaches outperformed the text-based approaches. To promote research on multi-modal approaches, this year a subtask focused on late-fusion approaches is introduced. In this subtask, which will take place after the announcement of the main task results, all participants have access to text- and content-based runs submitted by other participants and are free to combine them in whatever way they consider suitable in order to obtain multi-modal runs. Similarly to 2010, a second focus will be the effectiveness of multi lingual approaches for multimedia document retrieval.
A tentative schedule can be found here:
  • 1.2.2011: registration opens for all ImageCLEF tasks
  • 15.3.2011: data release (images + metadata + article)
  • 15.4.2011: topic release
  • 15.5.2011: registration closes for all ImageCLEF tasks
  • 15.6.2011: submission of runs
  • 15.7.2011: release of results
  • 14.8.2011: submission of working notes papers
  • 19.09.2011-22.09.2011: CLEF 2011 Conference, Amsterdam, The Netherlands