You are here

Scalable Concept Image Annotation 2013

Last updated: Mon, 30 Sep 2013


  • The subtask overview presentation slides are available here.
  • The subtask overview paper is available here.
  • The online working notes have been published here.
  • The results for the participant's submissions have been posted here.



Image concept detection generally has relied on training data that has been manually, and thus reliably annotated, an expensive and laborious endeavor that cannot easily scale. To address this issue, this year's annotation task will concentrate exclusively on developing annotation systems that rely only on automatically obtained web data. A very large amount of images can be cheaply gathered from the web, and furthermore, from the webpages that contain the images, text associated with them can be obtained. However, the degree of relationship between the surrounding text and the image varies greatly. Figures 1 and 2 show some image examples retrieved from a search engine for a couple of queries, and it can be observed that there are images that do not have any apparent relationship with the intended concept. Moreover, the webpages can be of any language or even a mixture of languages, and they tend to have many writing mistakes. Overall the data can be considered to be very noisy.

rainbow 2 rainbow 3 rainbow 4
Figure 1. Images from a web search query of "rainbow".

sun 1 sun 2 sun 3
Figure 2. Images from a web search query of "sun".

The goal of this subtask is to evaluate different strategies to deal with the noisy data so that it can be reliably used for annotating images from practically any topic.


In this subtask, the objective is to develop systems that can easily change or scale the list of concepts used for image annotation. In other words, the list of concepts can also be considered to be an input to the system. Thus the system when given an input image and a list of concepts, its job is to give a score to each of the concepts in the list and decide how many of them assign as annotations. To observe this scalable characteristic of the systems, the list of concepts will be different for the development and test sets.

As described earlier, the training set does not include manually annotated concepts, only textual features obtained from the webpages where the images appeared. It is not permitted to use any labeled data for training the systems. Although a strategy could be to use the textual features to decide which concepts are present and artificially label the provided training data. On the other hand, the use of additional language resources, such as language models, language detectors, stemmers, Wordnet, is permitted and encouraged.


The dataset used in this task is a subset of 250,000 images extracted from a database of millions of images downloaded from the Internet. The URLs of the images were obtained by querying popular image search engines (namely Google, Bing and Yahoo) when searching for words in the English dictionary. Also for each image, the corresponding web page that contained the image was downloaded and processed to extract the textual features. An effort was made to avoid including in the dataset near duplicates and message images (such as "deleted image", "no hot linking", etc.), however the dataset can be considered and it is supposed to be very noisy.

Visual features

Several visual feature sets are being provided so that the participants may concentrate their efforts on other aspects of the tasks. The following feature vector types have been computed: GIST, Color Histograms, SIFT, C-SIFT, RGB-SIFT and OPPONENT-SIFT. For the *-SIFT descriptors a bag-of-words representation is provided. Also the images are provided (resized to a maximum of 640 pixels for both width and height), so that the participants can extract their own features. Further details can be found in the README.txt file that is distributed with the data.

Textual features

Several sets of textual features are provided. Two of these correspond to text from the webpages where the images appear, which differ by the amount of processing applied to them. Another set of features are the words used to find each of the images in the search engines. Finally the URLs of the images sometimes also relate to the content of the images, so these are also made available. The provided features are the following:

  • The complete websites converted to valid XML to ease processing.
  • For each image a list of word-score pairs. The scores were derived taking into account 1) the term frequency (TF), 2) the document object model (DOM) attributes, and 3) the word distance to the image.
  • Triplets of word, search engine and rank, of how the images were found.
  • The URLs of the images as referenced in the corresponding webpages.

Registering for the Task and Accessing the Data

To participate in this task, first you must sign the end-user agreement and register as it is explained in the main ImageCLEF 2013 webpage. Once the registration is complete, you will be able to access the ImageCLEF 2013 system where you can find the username and password required for accessing the data.

The task has ended, however, all of the data made available to the participants can now be downloaded from

Submission Details

The submissions will be received through the ImageCLEF 2013 system, going to "Runs" and then "Submit run" and select track "ImageCLEFphoto-annotation".

The participants will be permitted to submit up to 12 runs, distributed as follows:

  • 6 runs for the test set
  • 6 runs for the development set

The objective is to submit up to 5 pairs of runs, one run for the development set and one for the test set, where both runs use exactly the same system/parameters. This is so that this information can be used to judge the generalization capability of the systems.

The participants will submit each of their system runs in a single ASCII plain text file. The format of this text file is the following. The results of the development/test set images is given each on a single line, having each line the format:

[image_identifier] [score_for_concept_1] [decision_for_concept_1] ... [score_for_concept_M] [decision_for_concept_M]

where the image identifiers are the ones found in '{devel|test}_iids.txt', the scores are floating point values for which a higher value means a higher score, and the decisions are either '0' or '1', meaning '1' that this concept has been selected for annotating the image. The order of the concepts must be the same as in the concept lists '{devel|test}_concepts.txt', being M the total number of concepts. For the development set M=95 and for the test set M=116. The scores are optional, and if you do not want to supply them, please set them all to zero.

Performance Measures

Included with the baseline techniques toolkit is a Matlab/Octave function 'evalannotat.m' which computes some performance measures. These will be among the ones (or variations of these) used for comparing the submitted systems in the final overview paper. The performance measures are computed for each of the development/test set images and/or concepts, and as a global measure, the mean of these measures is obtained. The basic measures are:

  • F: F-measure. F=2*(precision*recall)/(precision+recall)
  • AP: Average Precision. AP=1/C [SUM_{c=1...C} c/rank(c)]

The F-measure only uses the annotation decisions and is computed in two ways, one by analyzing each of the testing samples and the other by analyzing each of the concepts. On the other hand, the AP uses the annotation scores, and it is computed for each testing image by sorting the concepts by the scores. 'C' is the number of ground truth concepts for that image and 'rank(c)' is the rank position of the c-th ranked ground truth concept. If there are ties in the scores, a random permutation is applied within the ties.

After obtaining the means, these measures can be referred to as: MAP (mean average precision) and MF (mean F-measure), respectively. The MF will be computed analyzing both the samples (MF-samples) and the concepts (MF-concepts), whereas the MAP will only be computed analyzing the samples.

Baseline results for development set

The baseline toolkit describes and implements two basic baseline techniques supplied for the task.

Providing these baseline techniques has several objectives:

  • Familiarizing the participants with the submission format for the runs.
  • Introducing to the participants some of the performance measures that will be used to judge the submissions and example code to compute them.
  • Having some reference performance measures that can help the participants know if they are in the right track.
  • Giving the participants some hints and ideas for tackling this problem.

The baseline toolkit is available for download from here.

The performance measures for the baseline systems can be found in the results page.



tranScriptorium logo    ALMPR logo This work has been partially supported through the EU 7th Framework Programme grant tranScriptorium (Ref: 600707), by the Spanish MEC under the STraDA research project (TIN2012-37475-C02-01), and by the Generalitat Valenciana (GVA) under the reference Prometeo/2009/014.