You are here

Image Annotation

Welcome to the 4th edition of the Scalable Concept Image Annotation challenge!

Concept Annotation, Localization and Sentence Generation

Every day, users struggle with the ever-increasing quantity of data available to them. Trying to find “that” photo they took on holiday last year, the image on Google of their favourite actress or band, or the images of the news article someone mentioned at work. There are a huge number of images that can be cheaply found and gathered from the Internet. However, more valuable is mixed modality data, for example, web pages containing both images and text. A large amount of information about the image is present on these web pages and vice-versa. However, the relationship between the surrounding text and images varies greatly, with much of the text being redundant and/or unrelated. Despite the obvious benefits of using such information in automatic learning, the very weak supervision it provides means that it remains a challenging problem.


The Scalable Concept Image Annotation task aims to develop techniques to allow computers to reliably describe images, localize the different concepts depicted in the images and generate a description of the scene. This year the task has been split into two related subtasks using a single mixed modality data source of 500,000 web page items. Each consists of an image and the keyword text extracted from the web pages. A development set with ground truth localised concept labels and sentence descriptions will be provided to participants. The overall performance will be evaluated by asking participants to annotate and localise concepts and /or generate sentence descriptions on the 500,000 web page items.


  • 08.09.2015: Task overview presentation slides available here
  • 08.09.2015: Task overview paper available here
  • 08.09.2015: Online working notes have been published here
  • 01.06.2015: Results for all sub tasks are avaliable
  • 26.05.2015: Results for Sub task 1 are avaliable
  • 14.05.2015: deadline for submission of runs by the participants 11:59:59 PM CEST
  • 17.04.2015: The submission format for the test is defined (Submission instructions)
  • 04.03.2015: The Full development set of 2000 web pages is now available for download
  • 07.01.2015: The dataset is now available, together with pre computed features. (Access instructions)
  • 01.11.2014: Website is up!
  • 01.11.2014: registration is open.(Register here)


  • 01.11.2014: registration opens. (Register here)
  • 07.01.2015: The 500,000 web page dataset is released. (Access instructions)
  • 02.02.2015: Development data is released
  • 24.04.2015: Submission system opens
  • 07.05.2015: release of clean track for sub_task 2 by the task organizers
  • 14.05.2015: deadline for submission of runs by the participants 11:59:59 PM CEST
  • 01.06.2015: release of processed results by the task organizers
  • 07.06.2015: deadline for submission of working notes papers by the participants
  • 08.-11.09.2015: CLEF 2015, Toulouse, France

Tasks overview

There will be two of tasks in 2015:

  • SubTask 1 - Image Concept detection and localisation

    To continue the success both in participation and results of the ImageCLEF 2013 and 2014, the image annotation tasks continue in the same line of past years. The objective require the participants to develop a system that receives as input an image and produces as output a prediction of which concepts are present in that image, selected from a predefined list of concepts and where they are located within the image. In contrast to previous years, in this edition of the task hand labelled data will be allowed. Thus, the available trained ImageNET CNNs can be used, although this will be the baseline, therefore we encourage the use of the provided training data and the use of other resources such as ontologies, word disambiguators, language models, language detectors, spell checkers, and automatic translation systems.
    The design and development of the systems must emphasize on scalability. As for this year, the test set is the full 500,000 training images, therefore all images will need to be annotated with the concepts and these concepts localized within the image.

  • SubTask 2 - Generation of Textual Descriptions of Images

    In light of recent interest in annotating images beyond just concept labels, we are introducing a new subtask this year where participants are requested to develop systems that can describe an image with a textual description of the visual content depicted in the image. This can be considered an extension to SubTask 1. We are providing two separate tracks for SubTask 2:

    1. Clean track: This track is aimed primarily at those interested only in the Natural Language Generation aspects of the subtask. A gold standard input (bounding boxes labelled with concepts) will be given to the text generation system for each test image. Participants should develop systems that generate sentence, (natural language based) descriptions based on these gold standard annotations as input.
    2. Noisy track: This track is geared towards participants interested in developing systems that generate textual descriptions directly with an image as input, e.g. by using visual detectors to identify concepts and generating textual descriptions from the detected concepts. Participants are welcome and encouraged to take part with their own inputs, for example by using the output of their own image annotation systems developed for SubTask 1. They could also use the provided baseline image annotation system to facilitate participation without the need to participate in SubTask 1.

    Participants may take part in one or both tracks. Both tracks will be evaluated separately.

The classes

The concepts this year are chosen to be visual objects that are localizable and that are useful for generating textual descriptions of visual content of images. They include animate objects such as person, dogs and cats, inanimate objects such as houses, cars and balls, and scenes such as city, sea and mountains. The concepts are mined from the texts of our large database of image-webpage pairs. Nouns that are subjects or objects of sentences are extracted and mapped onto WordNet synsets. These are then filtered to 'natural', basic-level categories ('dog' rather than a 'yorkshire terrier'), based on the WordNet hierarchy and heuristics from a large-scale text corpora. The final list of concepts are manually shortlisted by the organizers such that they are (i) visually concrete and localizable; (ii) suitable for use in image descriptions; (iii) at a suitable 'every day' level of specificity that is neither too general nor too specific.

Registering for the task and accessing the data

Please register by following the instructions found in the main webpage of ImageCLEF 2015

Following the approval of registration for the task, the participants will be given access rights to download the data files. The access details can be found at the ImageCLEF system -> Collections -> c_ic15_image_annotation Detail .


The dataset used in this task is a subset of images extracted from a database of millions of images downloaded from the Internet. The URLs of the images were obtained by querying popular image search engines (namely Google, Bing and Yahoo) when searching for words in the English dictionary. Also for each image, the corresponding web page that contained the image was downloaded and processed to extract the textual features. An effort was made to avoid including in the dataset near duplicates and message images (such as "deleted image", "no hotlinking", etc.), however, the dataset can be considered and it is supposed to be very noisy.

Visual features (training, development and test)

Several visual feature sets are being provided so that the participants may concentrate their efforts on other aspects of the task. The following feature vector types have been computed: CNN, GIST, Color Histograms, SIFT, C-SIFT, RGB-SIFT and OPPONENT-SIFT. For the CNN, the seventh layer feature representations extracted from a deep CNN model pre-trained with the ImageNet dataset, are provided. For the *-SIFT descriptors, a bag-of-words representation is provided. Also, the images are provided (resized to a maximum of 640 pixels for both width and height), so that the participants can extract their own features. Further details can be found in the README.txt file that is distributed with the data.

Textual features

Several sets of textual features are provided. Two of these correspond to the text from the web pages where the images appear, which differ by the amount of processing applied to them. Another set of features, is the words used to find each of the images in the search engines. Finally the URLs of the images sometimes also relate to the content of the images, so these are also made available. The provided features are the following:
• The complete websites converted to valid XML to ease processing.
• For each image a list of word-score pairs. The scores were derived taking into account 1) the term frequency (TF), 2) the document object model (DOM) attributes, and 3) the word distance to the image.
• Triplets of word, search engine and rank, of how the images were found.
• The URLs of the images as referenced on the corresponding web pages.

Submission instructions

The submissions will be received through the ImageCLEF 2015 system, going to "Runs", then "Submit run" and then select track "ImageCLEF2015:photo-annotation".

The participants will be permitted to submit up to 10 runs. Given the size of the result files, we would like participants to host their result files (a temporary google drive would fulfill this requirement) and then share the link of the folder within the submission system. Each system run will consist of a single ASCII plain text file. The format of this text file is the following. The results of the test set images are given in separate lines, having each line providing only up to 100 localised concepts, with up to 100 localisations of the same concept expected:

Sub task 1:

The format has characters to separate the elements, colon ':' for the confidence, comma ',' to separate multiple bounding boxes, and 'x' and '+' for the size-offset bounding box format, i.e.:
[sub_task_id][image_ID] [concept1] [[confidence1,1]:][width1,1]x[height1,1]+[xmin1,1]+[ymin1,1],[[confidence1,2]:][width1,2]x[height1,2]+[xmin1,2]+[ymin1,2],... [concept2] ...

So, for an example in the development set format (notice that there are 2 bounding boxes for concept n03001627):

  • -0-QJyJXLD_48kXv 0 n04555897 0.6 329 340 366 390
  • -0-QJyJXLD_48kXv 3 n03001627 0.7 39 233 591 400
  • -0-QJyJXLD_48kXv 4 n03001627 0.3 4 233 65 355
  • In the new submission format it would be a line as:

  • 1 -0-QJyJXLD_48kXv n04555897 0.6:38x51+329+340 n03001627 0.7:553x168+39+233,0.3:62x123+4+233
  • Subtask 2:

    For both tracks, we need one sentence per image:
    [sub_task_id] [image_ID] [sentence]

    The clean track also require participants to provide bounding box(es) correspondence for relevant terms within the sentence. In the example below, "boys" refer to bounding boxes 0 and 2; "dog" refer to bounding box 1, where bounding boxes are defined in the input file given for each test image.

    [sub_task_id] [image_ID] The [[[boys|0,2]]] are playing with the [[[dog|1]]].

    where the sub_task_id = 1 for sub_task1, sub_task_id = 2 for noisy track, sub_task_id =3 for clean track. Image IDs are the ones found in 'data_iids.txt', the confidence are floating point values 0-1 for which a higher value means a higher score. In the run file the order of the images and the order of the concepts must be the same as in 'data_iids.txt' and 'concepts.txt' (that you can find also in annotations/imageclef2015.concept_to_parents.v20150320.gz), respectively. Given that the result files will be large, it is important that you validate your file format beforehand, as files with incorrect format will be ignored.

    The deadline for submission of runs by the participants 11:59:59 PM CEST


    A script is available for verifying the correct format of the files. The verification script can be downloaded from here and it would be used as follows:

  • $ ./ data_iids.txt concepts.txt {path_to_run_file}
  • or for the subtask 2 (noisy track)

  • $ ./ data_iids.txt {path_to_run_file}
  • or for the subtask 2 (clean track)

  • $ ./ data_iids_subtask2cleantrack.txt {path_to_run_file}
  • In the example verification script folder, the test run1.txt doesn’t pass the validation because the third image has 101 BBs for the first concept.

    When submitting your results please upload a file containing a string like "I have run the validator script and it passed fine" to indicate you have used the validator script and insert the shared link to your result file in the "Method description" textbox

    Evaluation methodology

    Initial details of the evaluation approach:

    SubTask 1:

    Localisation of Subtask 1 will be evaluated using the PASCAL style metric of intersection over union (IoU), the area of intersection between the foreground in the
    output segmentation and the foreground in the ground-truth segmentation, divided by the area of their union.
    The final results were both presented both in terms of average performance over all images of all concepts, and also per concept performance over all images.

    SubTask 2:

    Subtask 2 will be evaluated using the Meteor evaluation metric against a minimum of five human-authored textual descriptions as the gold standard reference. Systems participating in the "clean track" will additionally have the option of being evaluated with a fine-grained metric, which is the average F-measure across all test images on how well the text generation system selects the correct concepts to be described (against the gold standard references). A small subset of the textual descriptions in the development set has been annotated with bounding-box/textual-term correspondence for this purpose.

    Frequently Asked Questions

    Below is a number of questions and answers

    I understand that there are limitations to crowd-sourcing which can result in missed annotations. However, will we be penalised for correct annotations that are missed during testing?

    It is impossible to be 100% sure that there won't be missing annotations. Nevertheless, we can assume that the amount of missing annotations is insignificant.

    Where are the test images?

    The test and training and development images are all contained within the 500,000 images, at test time it is expected you provide classification for all images

    Do we have to submit both classification and localization results in order to participate in SubTask 1?

    Teams have to submit results in the classification in the localization format (meaning both class labels and bounding boxes). If you choose to do so, you can return the full image as their guess for the object bounding box, however it is hoped that you will use at least simple heuristics to localize the objects, but of course you are not required to.

    How does this challenge vary from the ImageNET and MSCOCO datasets and challenges ?

    All 3 are working on detection and classification of concepts within images, however, the ImageCLEF dataset is created from internet web pages, this gives a key difference to the other popular datasets. The web pages are unsorted and unconstrained meaning the relationship or quality of the text and image in relation to a concept can be very variable. Therefore instead of a high quality Flickr style photo of a car from imageNET, the image in the ImageCLEF dataset could be a fuzzy abstract car shape in the corner of the image. This allows the ImageCLEF image annotation challenge to provide additional opportunities to test proposed approaches on. Another important difference is that in addition to the image, text data from web pages can be used to train and generate the output description of the image.


    Results for all sub tasks are contained within this folder

    Sub Task 1

    MAP_0.5Overlap Is the localised Mean average precision (MAP) for each submitted method for using the performance measure of 50% overlap of the ground truth

    MAP_0_Overlap Is the image annotation MAP for each method with success if the concept is simply detected in the image without any localisation

    MAP_IncreasingMAP Is the localisation accuracy MAP using an increasing threshold of detection overlap with the ground truth

    PerConceptMAP_BBoxOverlap_0.5 is the per concept localisation accuracy given a 50% detection overlap with the groundtruth labels

    Sub Task 2 (Noisy Track)

    Subtask2_NoisyTrack_Meteor: The average Meteor score across all test images. Also provided are the median, min and max scores.

    Sub Task 2 (Clean Track)

    Subtask2_CleanTrack_Fmeasure: The average F1 score, Precision and Recall across all 450 test images. For each image, the average Precision and Recall across all gold standard sentences are computed, and the F1 score computed from the Precision and Recall. The final score is the average across all test images.

    Subtask2_CleanTrack_Meteor: The average Meteor score across all test images. Also provided are the median, min and max scores.

    Submitting a working notes paper to CLEF

    The next step is now to produce and submit a working notes paper of your method and system you evaluated. It is key to produce a paper even if your results are lower as its important to publish methods that don't work well as well as methods that are succesful The CLEF 2015 working notes will be published in the proceedings, facilitating the indexing by DBLP and the assignment of an individual DOI (URN) to each paper. According to the CEUR-WS policies, a light review of the working notes will be conducted by the task organizers to ensure quality.

    Working notes will have to be submitted before 7th June 2015 11:59 pm - midnight - Central European Summer Time, through the easychair system. The working notes papers are technical reports written in English and describing the participating systems and the conducted experiments. To avoid redundancy, the papers should *not* include a detailed description of the actual task, dataset and experimentation protocol. Instead

    of this, the papers are required to cite both the general ImageCLEF overview paper and the corresponding image annotation task overview paper. Bibtex references are available below. A general structure for the paper should provide at a minimum the following information:

  • 1. Title
  • 2. Authors
  • 3. Affiliations
  • 4. Email addresses of all authors
  • 5. The body of the text. This should contain information on:
  • o - tasks performed
  • o - main objectives of experiments
  • o - approach(es) used and progress beyond state-of-the-art
  • o - resources employed
  • o - results obtained
  • o - analysis of the results
  • o - perspectives for future work
  • The paper should not exceed 12 pages, and further instructions on how to write and submit your working notes are on the following page;

    So the upcoming dates in the schedule are:
    07.06.2015 11:59 pm - midnight - Central European Summer Time: deadline for submission of working notes papers by the participants
    30.06.2015: notification of acceptance of the working notes papers
    15.07.2015: camera ready working notes papers
    08.-11.09.2015: CLEF 2015, Toulouse, France

    Bibtex references


    title={{General Overview of ImageCLEF at the CLEF 2015 Labs}},
    author={Mauricio Villegas and
    Henning M\"uller and
    Andrew Gilbert and
    Luca Piras and
    Josiah Wang and
    Krystian Mikolajczyk and
    Alba Garc\'ia Seco de Herrera and
    Stefano Bromuri and
    M. Ashraful Amin and
    Mahmood Kazi Mohammed and
    Burak Acar and
    Suzan Uskudarli and
    Neda B. Marvasti and
    Jos\'e F. Aldana and
    Mar\'ia del Mar Rold\'an Garc\'ia
    booktitle = {},
    year = {2015},
    publisher = {Springer International Publishing},
    series = {Lecture Notes in Computer Science},
    volume = {},
    isbn = {},
    issn = {0302-9743},
    pages = {},

    author = {Andrew Gilbert and
    Luca Piras and
    Josiah Wang and
    Fei Yan and
    Emmanuel Dellandrea and
    Robert Gaizauskas and
    Mauricio Villegas and
    Krystian Mikolajczyk},
    title = {{Overview of the ImageCLEF 2015 Scalable Image Annotation, Localization and Sentence Generation task}},
    booktitle = {CLEF2015 Working Notes},
    series = {{CEUR} Workshop Proceedings},
    year = {2015},
    volume = {},
    publisher = {},
    issn = {1613-0073},
    pages = {},
    month = {September 8-11},
    address = {Toulouse, France},


    • Andrew Gilbert, University of Surrey, a.gilbert(replace-by-an-at) - Primary Contact
    • Luca Piras, University of Cagliari, Italy, luca.piras(replace-by-an-at)
    • The ViSen consortium, main contact Krystian MIkolajczyk, k.mikolajczyk(replace-by-an-at)
    • Mauricio Villegas, Universidad Politécnica de Valencia, Spain, mauvilsa(replace-by-an-at)