You are here

Scalable Concept Image Annotation 2014

Last updated: Thu, 15 May 2014

Welcome to the website of the 3rd edition of the Scalable Concept Image Annotation challenge!

This is a challenge that addresses the multi-concept image annotation problem. Unlike similar challenges, only automatically obtained data is allowed for training (i.e., no hand labeled samples). This restriction is imposed in order to favor the concept-wise scalability of the systems.


  • 15.09.2014: The overview presentation slides are available here.
  • 15.09.2014: The task overview paper is available here.
  • 15.09.2014: The online working notes have been published here.
  • 15.05.2014: The results for the participant's submissions have been posted here.




Automatic concept detection within images is a challenging and as of yet unsolved research problem. Impressive improvements have been achieved, although most of the proposed systems rely on training data that has been manually, and thus reliably labeled, an expensive and laborious endeavor that cannot easily scale. Recent image annotation benchmark campaigns have resorted to crowdsourcing in order to label a large collection of images. However, when considering the detection of multiple concepts per image and an increasing list of concepts for annotation, even with crowdsourcing the labeling task becomes too expensive. Thus, reducing the reliance on cleanly labeled data has become an necessity.

There are billions of images available online appearing on webpages, where the text surrounding the image may be directly or indirectly related to its content, thus providing clues as to what is actually depicted in the image. Moreover, images and the webpages on which they appear can be easily/cheaply obtained for virtually any topic by using a web crawler. Without doubt with the Internet we have at our disposal a immense source of data which can potentially be used for developing practical and scalable image annotation systems. To achieve this we need to focus the research on how to take advantage of all of the noisy information that is available to build reliable image annotation systems.

Task description

In this task, the objective is to develop systems that receive as input only an image (from which visual features would be extracted) and produce as output a prediction of which concepts are present in that image, selected from a predefined list of concepts. Unlike other image annotation evaluations, in this task the use of hand labeled training data is not allowed, instead of this, web crawled data should be used, including both images and textual data from the webpages (see dataset). However, the systems do not have to rely only on the web data, other resources can be used (and it is encouraged to do so) as long as they can be easily obtained, do not depend on specific concepts, and do not require any significant hand labeling of data. Examples of other types of resources that can be used are: ontologies, word disambiguators, language models, language detectors, spell checkers, and automatic translation systems. In case of doubt about the use of any data or resource, please feel free to contact the task organizers. To ease participation, several resources will be provided, including: already extracted visual features, pre-processed webpage data and a baseline system.

The design and development of the systems must emphasize on scalability. It should be simple to adapt or retrain the system when the list of concepts is changed, and the performance should generalize well to concepts not observed during development. To observe this scalable characteristic of the systems, the list of concepts will be different for the development and test sets, in fact, within each set the list of concepts will not be the same for all images. Furthermore, for the evaluation and comparison of the submitted systems, importance will be given to the performance of the concepts not seen during development.

Registering for the task and accessing the data

To participate in this task, please register by following the instructions found in the main ImageCLEF 2014 webpage. By registering, updates on the task will be emailed to you, or you can also follow the ImageCLEF Twitter account.

All the data for the task is now available for download from When downloading the data, the system will ask you to accept the Data Usage Agreement found in 'agreement.txt'. It will only work if you have Javascript and cookies enabled in your browser. Alternatively, the data can be downloaded using a command line tool by setting a cookie as "webupv-datasets_accept_agreement=yes". For example using wget, a file could be downloaded as follows:

$ wget --header "Cookie: webupv-datasets_accept_agreement=yes" ${FILE_URL}


The dataset used in this task is a subset of images extracted from a database of millions of images downloaded from the Internet. The URLs of the images were obtained by querying popular image search engines (namely Google, Bing and Yahoo) when searching for words in the English dictionary. Also for each image, the corresponding web page that contained the image was downloaded and processed to extract the textual features. An effort was made to avoid including in the dataset near duplicates and message images (such as "deleted image", "no hot linking", etc.), however the dataset can be considered and it is supposed to be very noisy.

For this edition of the task, there will be two releases of training data (files "train_*" and "train2_*"). The first release of will be made available at the same time as the development set, and the second will be made available with the test set. Both training sets can be used for the training the systems that will be submitted. The first training set is exactly the same as the one used in ImageCLEF 2013, thus if you participated, you can avoid downloading again the big files, so check the MD5 checksums of the files to see if you already have them.

Visual features (training, development and test)

Several visual feature sets are being provided so that the participants may concentrate their efforts on other aspects of the task. The following feature vector types have been computed: GIST, Color Histograms, SIFT, C-SIFT, RGB-SIFT and OPPONENT-SIFT. For the *-SIFT descriptors a bag-of-words representation is provided. Also the images are provided (resized to a maximum of 640 pixels for both width and height), so that the participants can extract their own features. Further details can be found in the README.txt file that is distributed with the data.

Textual features (only for training set)

Several sets of textual features are provided. Two of these correspond to text from the webpages where the images appear, which differ by the amount of processing applied to them. Another set of features are the words used to find each of the images in the search engines. Finally the URLs of the images sometimes also relate to the content of the images, so these are also made available. The provided features are the following:

  • The complete websites converted to valid XML to ease processing.
  • For each image a list of word-score pairs. The scores were derived taking into account 1) the term frequency (TF), 2) the document object model (DOM) attributes, and 3) the word distance to the image.
  • Triplets of word, search engine and rank, of how the images were found.
  • The URLs of the images as referenced in the corresponding webpages.

Submission instructions

The submissions will be received through the ImageCLEF 2014 system, going to "Runs", then "Submit run" and then select track "ImageCLEF2014:photo-annotation".

The participants will be permitted to submit up to 10 runs. Each system run will consist of a single ASCII plain text file. The format of this text file is the following. The results of the test set images are given in separate lines, having each line the format:

[image_identifier] [score_for_concept_1] [decision_for_concept_1] ... [score_for_concept_M] [decision_for_concept_M]

where the image identifiers are the ones found in 'test_conceptlists.txt', the scores are floating point values for which a higher value means a higher score, and the decisions are either '0' or '1', meaning '1' that this concept has been selected for annotating the image. It is important to clarify that the number of concepts M is not the same for all images. Each image has a corresponding concept list, and an image may appear more than once. In the run file the order of the images and the order of the concepts must be the same as in 'test_conceptlists.txt'. The scores are optional, and if you do not want to supply them, please set them all to zero. Files with incorrect format will be rejected by the submission system.

A script is available for verifying the correct format of the files. The verification script can be downloaded from here and it would be used as follows:

$ ./ test_conceptlists.txt {path_to_run_file}

Evaluation methodology

Included with the baseline techniques toolkit is a Matlab/Octave function 'evalannotat.m' which computes some performance measures. These will be among the ones (or variations of these) used for comparing the submitted systems in the final overview paper. The performance measures are computed for each of the development/test set images and/or concepts, and as a global measure, the mean of these measures is obtained. The basic measures are:

  • F: F-measure. F=2*(precision*recall)/(precision+recall)
  • AP: Average Precision. AP=1/C [SUM_{c=1...C} c/rank(c)]

The F-measure only uses the annotation decisions and is computed in two ways, one by analyzing each of the testing samples and the other by analyzing each of the concepts. On the other hand, the AP uses the annotation scores, and it is computed for each testing image by sorting the concepts by the scores. 'C' is the number of ground truth concepts for that image and 'rank(c)' is the rank position of the c-th ranked ground truth concept. If there are ties in the scores, a random permutation is applied within the ties.

After obtaining the means, these measures can be referred to as: MAP (mean average precision) and MF (mean F-measure), respectively. The MF will be computed analyzing both the samples (MF-samples) and the concepts (MF-concepts), whereas the MAP will only be computed analyzing the samples.

Baseline results for development set

The baseline toolkit describes and implements two basic baseline techniques supplied for the task.

Providing these baseline techniques has several objectives:

  • Familiarizing the participants with the submission format for the runs.
  • Introducing to the participants some of the performance measures that will be used to judge the submissions and example code to compute them.
  • Having some reference performance measures that can help the participants know if they are in the right track.
  • Giving the participants some hints and ideas for tackling this problem.

The baseline toolkit is available for download from here.

The performance measures for the baseline systems can be found in the README.txt file of the baseline toolkit.

Evaluation results

The preliminary results of the evaluation can be found here. The final and more in-depth analysis of the submitted systems will be provided in the overview paper of this task that will presented at CLEF 2014.

Past editions