You are here

Scalable image annotation using general Web data

Last updated: Mon, 30 Jul 2012



Image concept detection relies on training data that have been manually, and thus reliably annotated, an expensive and laborious endeavor that cannot easily scale. To address this issue, this new annotation task is introduced this year: scalable image annotation using as training a collection of automatically obtained Web images. A very large amount of images can be cheaply gathered from the Web, and furthermore, from the webpages that contain the images, text associated with them can be obtained. However, the degree of relationship between the surrounding text and the image varies greatly, thus the data can be considered to be very noisy. Moreover, the webpages can be of any language or even a mixture of languages, and they tend to have many writing mistakes. The goal of this task is to evaluate different strategies to deal with the noisy data so that it can be reliably used for annotating images from practically any topic.

rainbow 2 rainbow 3 rainbow 4
Figure 1. Images from a Web search query of "rainbow".

sun 1 sun 2 sun 3
Figure 2. Images from a Web search query of "sun".

To illustrate the objective of the task, consider for example that we searched for the word "rainbow" in a popular image search engine. It is expected that a large amount of results would be of rainbow landscapes. However, other types of images will also appear, see figure 1. The images will be related to the query in different senses, and there might even be images that do not have any apparent relationship. See figure 2 for a similar example on "sun". Therefore, an interesting research topic would be: how to use and handle the automatically retrieved noisy Web data to complement the manually labeled training data and obtain a better performing annotation system than when using the manually labeled data alone. On the other hand, since the Web data can easily be obtained for any topic, another research topic would be: how to use the noisy Web data to develop an annotation system with a somewhat unbounded list of concepts.

Both of the research topics just mentioned will be addressed in two separate subtasks.

Subtask 1: Improving performance in Flickr concept annotation task

In this subtask the list of concepts and the test samples will be exactly the same as the ones used in the Flickr concept annotation subtask. The objective is that the participants use both the Flickr and Web training datasets, trying to improve the concept annotation performance by using the additional Web data. For the sake of comparison, the participants should also submit results using only the Flickr training set to the Flickr concept annotation subtask.

Subtask 2: Scalable concept image annotation

In this subtask, the objective is to develop systems that can easily change or scale the list of concepts used for image annotation. In other words, the list of concepts can also be considered to be an input to the system. Thus the system when given an input image and a list of concepts, its job is to give a score to each of the concepts in the list and decide how many of them assign as annotations. To observe this scalable characteristic of the systems, the list of concepts will be different for the development and test sets. Although for this first edition of the task, the list of concepts will overlap considerably with the concepts of the Flickr annotation task.

To obtain good results in the test set, the developed techniques should be data-driven. This means that in order to estimate the score of the concepts, the system should not be concept specific, but rather use the textual data automatically extracted from the webpages to derive the score of each of the possible concepts. The use of additional language resources, such as language models, language detectors, stemmers, Wordnet, is permitted and encouraged.


The dataset used in this task is a subset of 250,000 images extracted from a database of millions of images downloaded from the Internet. The URLs of the images were obtained by querying popular image search engines (namely Google, Bing and Yahoo) when searching for words in the English dictionary. Also for each image, the corresponding web page that contained the image was downloaded and processed to extract the textual features. An effort was made to avoid including in the dataset near duplicates and message images (such as "deleted image", "no hot linking", etc.), however the dataset can be considered and it is supposed to be very noisy.

Visual features

Several visual feature sets are being provided so that the participants may concentrate their efforts on other aspects of the tasks. The following feature vector types have been computed: GIST, Color Histograms, SIFT, C-SIFT, RGB-SIFT, OPPONENT-SIFT and SURF. For *-SIFT both the raw extracted features and also the more compact a bag-of-words representation are provided. These feature sets and the format they are distributed in is the same as for the Flickr data, this is to ease participation in both of the Photo Annotation tasks. Further details can be found in the README.txt file that is distributed with the data.

Textual features

Several sets of textual features are provided. Two of these correspond to text extracted from the webpages near the image, which differ by the amount of processing applied to them. Another set of features are the words used to find each of the images in the search engines. Finally the URLs of the images sometimes also relate to the content of the images, so these are also made available. The provided features are the following:

  • Raw text extracted near the image with the image position marked.
  • For each image a list of word-score pairs. The scores were derived taking into account 1) the term frequency (TF), 2) the document object model (DOM) attributes, and 3) the word distance to the image.
  • Triplets of word, search engine and rank, of how the images were found.
  • The URLs of the images as referenced in the corresponding webpages.

Registering for the Task and Accessing the Data

To participate in this task, first you must sign the end-user agreement and register as it is explained in the main ImageCLEF 2012 webpage. Once the registration is complete, you will be able to access the ImageCLEF 2012 system where you can find the username and password for accessing the data.

Now that the task is over, the data for subtask 2 is freely available from here.

Submission Details

The submissions will be received through the ImageCLEF 2012 system, going to "Runs" and then "Submit run" and select track "ImageCLEFphoto-annotation-Web" .

The participants will be permitted to submit up to 15 runs, distributed as follows:

  • 5 runs for the test set of subtask 1
  • 5 runs for the test set of subtask 2
  • 5 runs for the development set of subtask 2

The objective for runs of the development set of subtask 2 is that they correspond exactly to the same system/parameters than the runs for the test set of subtask 2. This is so that this information can be used to judge the generalization capability of the systems.

For both subtasks the submission format is the same (detailed below), although the list of concepts is different. For more details on the submission for subtask 1, please also check the Flickr concept annotation website. The submission format for the runs is in a single ASCII plain text file. The format of this text file is the following. The results of the development/test set images is given each on a single line, having each line the format:

[image_identifier] [confidence_for_concept_1] [decision_for_concept_1] ... [confidence_for_concept_M] [decision_for_concept_M]

where the image identifiers are the ones found in '{devel|test}_iids.txt', the confidences are floating point values for which a higher value means a higher confidence, and the decisions are either '0' or '1', meaning '1' that this concept has been selected for annotating the image. The order of the concepts must be the same as in the concept lists '{devel|test}_concepts.txt', being M the total number of concepts, which for subtask 2 is not necessarily the same for development and test.

Performance Measures

Included with the baseline techniques toolkit is a Matlab/Octave function 'evalannotat.m' which computes 3 performance measures. These will be among the ones (or variations of these) used for comparing the submitted systems in the final overview paper. The performance measures are computed for each of the development/test set images, and as a global measure, the mean of these measures is obtained. The basic measures are:

  • AP: Average Precision, AP=1/C [SUM_{c=1...C} c/rank(c)]
  • IAP: Interpolated Average Precision, AP=1/C [SUM_{c=1...C} MAX_c'=c...C c'/rank(c')]
  • F-measure, F=2*(precision*recall)/(precision+recall)

where 'C' is the number of ground truth concepts for the image, 'rank(c)' is the rank position of the c-th ranked ground truth concept, and 'precision' and 'recall' are respectively the precision and recall for the annotation decisions. Note that AP and IAP depend only on the confidence scores, while the F-measure only depends on the annotation decisions.

Since the number of concepts per image is small and variable, the AP and IAP will be computed using the rank positions (precision = c/rank(c)), i.e. using every possible value of recall, instead of using some fixed values of recall. If there are ties in the scores, a random permutation is applied within the ties.

After obtaining the means, these measures can be referred to as: MAP (mean average precision), MIAP (mean interpolated average precision) and MF (mean F-measure), respectively. For the IAP the geometric mean is also used which accounts better for the improvements on difficult concepts, in this case it is referred to as GMIAP (geometric mean interpolated average precision).

Baseline results for development set for Subtask 2

The baseline toolkit describes two baseline techniques supplied for subtask 2, the "Scalable concept image annotation" subtask of this ImageCLEF 2012 Web Photo Annotation task.

Providing these baseline techniques has several objectives:

  • Familiarizing the participants with the submission format for the runs.
  • Introducing to the participants some of the performance measures that will be used to judge the submissions and example code to compute them.
  • Having some reference performance measures that can help the participants know if they are in the right track.
  • Giving the participants some hints and ideas for tackling this problem.

From the 10th of May onward, the baseline toolkit will be available for download from here, using the same username and password than for accessing the dataset.

Results of submissions

Subtask 1:

Submissions using Web data
ISI 1393 0.264 0.217 0.182
ISI 1398 0.261 0.214 0.181
ISI 1399 0.259 0.213 0.178
ISI 1400 0.255 0.211 0.175
KIDS-NUTN 1369 0.536 0.498 0.399
KIDS-NUTN 1370 0.552 0.517 0.397
KIDS-NUTN 1371 0.506 0.456 0.331
KIDS-NUTN 1372 0.539 0.496 0.400
LIRIS-ECL 1355 0.645 0.609 0.459

Best submissions without using Web data and Random
Random baseline 0.115 0.096 0.054
ISI 1424 0.719 0.689 0.553
KIDS-NUTN 1451 0.589 0.541 0.454
LIRIS-ECL 1386 0.707 0.677 0.567

Subtask 2:

Development set results
Random baseline 0.090 0.065 0.063
Co-occurrence baseline 0.232 0.166 0.168
ISI 1406 0.341 0.256 0.262
ISI 1409 0.345 0.262 0.260
ISI 1410 0.350 0.263 0.267
ISI 1413 0.348 0.264 0.266
ISI 1414 0.342 0.259 0.264

Test set results
Random baseline 0.072 0.053 0.055
Co-occurrence baseline 0.229 0.151 0.171
ISI 1407 0.323 0.219 0.246
ISI 1408 0.330 0.225 0.251
ISI 1411 0.332 0.226 0.252
ISI 1412 0.331 0.227 0.254
ISI 1415 0.329 0.224 0.249



MIPRCV logo    ALMPR logo
This work is supported by the Spanish MICINN under the MIPRCV Consolider Ingenio 2010 (CSD2007-00018) and RISE (TIN2008-04571) projects, and by the Generalitat Valenciana (GVA) under the reference Prometeo/2009/014.