You are here

ImageCLEFtoPicto

logoWelcome to the 2nd edition of the ToPicto task!

Motivation

Several genetic diseases, such as Rett syndrome, can result in language impairment, thereby interfering with the development of language skills such as speaking, listening, reading, and writing. Both language production and comprehension are impaired. Language impairment may also arise from incidents such as a car accident or a stroke, leading to aphasia — a partial or complete loss of the ability to express oneself or understand written and spoken language. In these particular cases, Augmentative and Alternative Communication (AAC) can be implemented. AAC involves the use of pictograms to help individuals accurately convey their messages [1].

From left to right: “music”, “brush the teeth”, “what is your name?”.In AAC, Pictograms refer to an image representing a more or less concrete concept. It can be a single word, a named entity, or a polylexical expression among others (see the example with pictograms taken from ARASAAC, a collection featuring over 25,000 pictograms freely available under a Creative Commons CC-BY-NC-SA license).

Using pictograms as a communication aid has proven effective in visualizing syntax, manipulating words, and facilitating language access [2,3]. Moreover, the use of AAC has a positive social impact for individuals with language impairment. The “Croix-Rouge” (French Red Cross) has identified a reduction in stress, an improvement in autonomy and health, and greater serenity and enjoyment in daily life [4]. However, not everyone has prior knowledge about AAC and pictograms. Yet, in a situation where a “verbal” person aims to communicate with an AAC user, a tool that converts the two modalities — speech and text — into a sequence of pictograms is essential. By providing a relevant and comprehensible sequence of pictograms for the impaired person, the communication between the two parties can be initiated.

The goal of ToPicto is to bring together linguists, computer scientists, and translators to develop new translation methods to translate either speech or text into a corresponding sequence of pictograms.

News

  • 20.12.2024: Website goes live and registration is open.
  • 27.01.2025: The development set is available to registered participants.

Tasks description

The participants will be requested to develop solutions for translating text or speech into a sequence of pictogram terms, with each of them linked to a unique pictogram image from ARASAAC. The proposed model will be tested in two distinct scenarios: utterances from the general domain and utterances from the medical domain.

ImageCLEFToPicto 2025 consists of two substaks:

Text-to-Picto

Description

Text-to-Picto task focuses on the automatic generation of a corresponding sequence of pictogram terms from a French text. This challenge can be seen as a translation problem, where the source language is French, and the target language is French pictogram terms.

The providing translation has to follow the specifications regarding a translation in pictograms, understandable by AAC users.

Speech-to-Picto

Description

Speech-to-Picto focuses on the two modalities speech and pictograms. The challenge is to directly translate speech to pictogram terms without going through the transcription dimension, which is the focus of the speech community with current spoken language translation systems.



Data

The data for the task is sourced from the CommonVoice v.15 corpus [5] and the Orféo corpus [6]. CommonVoice is a corpus of speech data recorded by users on the Common Voice platform, and based on text from various public domain sources, including blog posts, old books, movies, and other public speech corpora. Orféo contains interactions between adults, adults and children, as well as between children, covering a wide range of topics including debates, everyday situations, and medical consultations. This type of text is representative of the interactions observed between caregivers (e.g., families and medical staff) and individuals who rely on pictograms due to language impairments.

For ToPicto, we provide a corresponding sequence of terms linked to a pictogram from either the speech utterance or the oral transcription.
Below is detailed information about each input and the expected output format for each task.

Text-to-Picto

  • Input: a JSON file with the following information (only for training and validation data, for test you will be only given the id and src):
Tag Definition Example
id unique identifier of each utterance cefc-tcof-Acc_del_07-1
src source of the utterance - text from oral transcription tu peux pas savoir
tgt target of the utterance - sequence of pictogram terms (tokens) toi pouvoir savoir non
pictos a list of pictogram identifiers linked to each pictogram terms (the size is the same as the target output)* [6625, 35949, 16885, 5526]
Description
  • Output: a JSON file with the following information:
Tag Definition Example
id unique identifier of each utterance cefc-tcof-Acc_del_07-1
hyp hypothesis given by your system / model corresponding to the sequence of pictogram terms toi savoir non


Speech-to-Picto

  • Input: a JSON file containing the following information (provided only for training and validation data; for the test data, only the ID and source will be given):
Tag Definition Example
id unique identifier of each utterance cefc-tcof-Acc_del_07-1
src audio file linked to the ID in .wav format cefc-tcof-Acc_del_07-1.wav
tgt target of the utterance - sequence of pictogram terms (tokens) toi pouvoir savoir non
pictos a list of pictogram identifiers linked to each pictogram terms (the size is the same as the target output)* [6625, 35949, 16885, 5526]
Description
  • Output: a JSON file with the following information:
Tag Definition Example
id unique identifier of each utterance cefc-tcof-Acc_del_07-1
hyp hypothesis given by your system / model corresponding to the sequence of pictogram terms (tokens) toi savoir non

*This information is provided for reference to illustrate the input with a sequence of pictogram images. Each image can be obtained from the ARASAAC website at the following link: https://api.arasaac.org/v1/pictograms/6625

Data statistics


Development set

The statistics of the Text-to-Picto and Speech-to-Picto development sets are given below.

  • The first table provides information about the source (src). For Text-to-Picto, it includes the minimum and maximum lengths of the text. For Speech-to-Picto, it includes the minimum and maximum durations of speech utterances in seconds.
Text-to-Picto / src Speech-to-Picto / src
train valid train valid
Number of utterances 20,177 1,208 20,177 1,208
Min length / duration (seconds) 1 1 0.08 0.22
Max length / duration (seconds) 99 50 28.28 21.97
Average length / duration (seconds) 9.8 10.0 4.00 4.35
  • The target (tgt) is the same for the two tasks.
Train - tgt Valid - tgt
Number of utterances 20,177 1,208
Min length 1 1
Max length 89 48
Average length 8.5 8.6
Unique tokens 4,346 1,492


Test set

The statistics of the Text-to-Picto and Speech-to-Picto test sets are given below.

The table provides information about the source (src). For Text-to-Picto, it includes the minimum and maximum lengths of the text. For Speech-to-Picto, it includes the minimum and maximum durations of speech utterances in seconds.

Text-to-Picto Speech-to-Picto
Number of utterances 2,904 2,904
Min length / duration (seconds) 1 0.19
Max length / duration (seconds) 62 21.35
Average length / duration (seconds) 10.12 4.45

Evaluation methodology

The evaluation is conducted using sacreBLEU [7], METEOR [8], and the Picto-term Error Rate (PictoER) [9]. For all three metrics, the evaluation involves comparing the hypothesis (hyp) with the target (tgt), i.e., the sequence of pictogram terms.

  • SacreBLEU measures the number of common n-grams between the translation hypothesis (hyp) and the reference translation (tgt).
  • METEOR performs an alignment between the translation hypothesis and the reference translations, going beyond simple word matching. It takes into account not only direct matches but also those based on synonyms, morphological variations (such as lemmas and word roots), and even paraphrases. The evaluation is more nuanced because it captures additional semantic information that is not encoded in the BLEU score.
  • PictoER is a metric derived from WER. Instead of evaluating the number of errors at the word level, we focus on the number of errors of tokens, each linked to an ARASAAC pictogram.

How are the scores computed?

Let's take the following example with hyp the hypothesis given by your system:

  • tgt: passé me écouter les battement de mains
  • hyp: passé me écouter les coeur de mains
Metric sacreBLEU score (with N = 4, BP = 1.0): METEOR score (with Gamma = 0.5, Beta = 3, Nchunks = 2.0, Nunigrams = 7): PictoER (N = 7):
Details
  • unigram precision = 6/7 = 0.85
  • bigram precision = 4/6 = 0.66
  • trigram precision = 2/5 = 0.4
  • 4-gram precision = 1/4 = 0.25
  • precision (P) = 0.857
  • recall (R) = 0.857
  • substitution (S) = 1
  • deletion (D) = 0
  • insertion (I) = 0
Score 48.89 Fmean = (10 * P * R) / (R + 9P) = 0.857
Penalty = gamma * (Nchunks / Nunigrams)^Beta = 0.018
METEOR = Fmean * (1 - Penalty) = 84.1
(S + D + I) / N = 1 / 7 = 0.142 * 100 = 14.2

How to interpret the results?

  • SacrebLEU [0 - 100]: the higher the better. A translation is considered good when the BLEU score is above 30.
  • METEOR [0 - 100]: the higher the better. A translation is considered good when the BLEU score is above 40.
  • PictoER [0 - 100]: the lower the better. It gives a quick overview of the number of incorrectly predicted tokens.

Visualization of the output

The target (tgt) or the hypothesis (hyp) provided by your model for each utterance should be a sequence of pictogram terms (tokens). Each token should correspond to a pictogram from ARASAAC. To visualize the output with the pictogram images, we developed a platform available here: https://huggingface.co/spaces/ToPicto/Visualize-Pictograms

How to use it:

1. Write a sequence of pictogram terms.
2. The platform will display the corresponding pictogram images.

Example:

  • id: common_voice_fr_24203862
  • src: common_voice_fr_24203862.wav
  • tgt/hyp: quatre concert coordonner à ville et à new_york
Description

Participant registration

The registration is open here:

For general information, please refer to ImageCLEF registration instructions.

Important dates

  • 20.12.2024: Registration opens for all ImageCLEF tasks
  • 27.01.2025: Development data release starts
  • 18.03.2025: Test data release starts
  • 25.04.2025: Registration closes for all ImageCLEF tasks
  • 10.05.2025 15.05.2025: Deadline for submitting the participants runs
  • 17.05.2025: Release of the processed results by the task organizers
  • 30.05.2025: Deadline for submission of working notes papers by the participants
  • 27.06.2025: Notification of acceptance of the working notes papers
  • 07.07.2025: Camera ready working notes papers
  • 09-12.09.2025: CLEF 2025, Madrid, Spain

Submission instructions

The submission instructions are available here, in the Submission instructions tab:

Results

More information will be added soon!

Contact

Organizers:

  • Diandra Fabre — <diandra.fabre(at)univ-grenoble-alpes.fr>, Université Grenoble Alpes, LIG, France
  • Cécile Macaire — <cecile.macaire(at)univ-grenoble-alpes.fr>, Université Grenoble Alpes, LIG, France
  • Benjamin Lecouteux — <benjamin.lecouteux(at)univ-grenoble-alpes.fr>, Université Grenoble Alpes, LIG, France
  • Didier Schwab — <didier.schwab(at)univ-grenoble-alpes.fr>, Université Grenoble Alpes, LIG, France

Acknowledgments

ToPicto is organized as part of the Pantagruel project (ANR 23-IAS1-0001), which focuses on developing and evaluating Multimodal and Inclusive Language Models for General and Clinical French. The project is supported by the following partners:

Description





References

[1] Romski, M., & Sevcik, R. A. (2005). Augmentative communication and early intervention: Myths and realities. Infants & Young Chitdren, 18(3), 174-185.
[2] Cataix-Nègre, É. (2017). Communiquer autrement: Accompagner les personnes avec des troubles de la parole ou du langage : les communications alternatives. De Boeck Supérieur.
[3] Beukelman, D.R. and Mirenda, P. (2013). Augmentative and Alternative Communication: Supporting Children and Adults with Complex Communication Needs. Paul H. Brookes Pub.
[4] Communication alternative améliorée (CAA) : la Croix-Rouge française dévoile sa première étude d’impact social ! (2021). Croix-Rouge. Retrieved June 28, 2023, from https://www.croix-rouge.fr/actualite/communication-alternative-amelioree-caa-la-croix-rouge-francaise-devoile-sa-premiere-etude-d-impact-social-2513
[5] Ardila et al. (2020). Common Voice: A Massively-Multilingual Speech Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France. European Language Resources Association.
[6] C. Benzitoun, J.-M. Debaisieux, H.-J. Deulofeu (2016). Le projet ORFÉO : un corpus d'études pour le français contemporain. Corpus n°15, p. 91-114.
[7] Post, M. (2018). A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers. Belgium, Brussels : Association for Computational Linguistics, p. 186-191.
[8] Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65-72).
[9] Woodard, J. P., & Nelson, J. T. (1982). An information theoretic measure of speech recognition performance. In Workshop on standardisation for speech I/O technology, Naval Air Development Center, Warminster, PA.