Several genetic diseases, such as Rett syndrome, can result in language impairment, thereby interfering with the development of language skills such as speaking, listening, reading, and writing. Both language production and comprehension are impaired. Language impairment may also arise from incidents such as a car accident or a stroke, leading to aphasia — a partial or complete loss of the ability to express oneself or understand written and spoken language. In these particular cases, Augmentative and Alternative Communication (AAC) can be implemented. AAC involves the use of pictograms to help individuals accurately convey their messages [1].
In AAC, Pictograms refer to an image representing a more or less concrete concept. It can be a single word, a named entity, or a polylexical expression among others (see the example with pictograms taken from ARASAAC, a collection featuring over 25,000 pictograms freely available under a Creative Commons CC-BY-NC-SA license).
Using pictograms as a communication aid has proven effective in visualizing syntax, manipulating words, and facilitating language access [2,3]. Moreover, the use of AAC has a positive social impact for individuals with language impairment. The “Croix-Rouge” (French Red Cross) has identified a reduction in stress, an improvement in autonomy and health, and greater serenity and enjoyment in daily life [4]. However, not everyone has prior knowledge about AAC and pictograms. Yet, in a situation where a “verbal” person aims to communicate with an AAC user, a tool that converts the two modalities — speech and text — into a sequence of pictograms is essential. By providing a relevant and comprehensible sequence of pictograms for the impaired person, the communication between the two parties can be initiated.
The goal of ToPicto is to bring together linguists, computer scientists, and translators to develop new translation methods to translate either speech or text into a corresponding sequence of pictograms.
News
20.12.2024: Website goes live and registration is open.
27.01.2025: The development set is available to registered participants.
Tasks description
The participants will be requested to develop solutions for translating text or speech into a sequence of pictogram terms, with each of them linked to a unique pictogram image from ARASAAC. The proposed model will be tested in two distinct scenarios: utterances from the general domain and utterances from the medical domain.
ImageCLEFToPicto 2025 consists of two substaks:
Text-to-Picto
Text-to-Picto task focuses on the automatic generation of a corresponding sequence of pictogram terms from a French text. This challenge can be seen as a translation problem, where the source language is French, and the target language is French pictogram terms.
The providing translation has to follow the specifications regarding a translation in pictograms, understandable by AAC users.
Speech-to-Picto
Speech-to-Picto focuses on the two modalities speech and pictograms. The challenge is to directly translate speech to pictogram terms without going through the transcription dimension, which is the focus of the speech community with current spoken language translation systems.
Data
The data for the task is sourced from the CommonVoice v.15 corpus[5] and the Orféo corpus[6]. CommonVoice is a corpus of speech data recorded by users on the Common Voice platform, and based on text from various public domain sources, including blog posts, old books, movies, and other public speech corpora. Orféo contains interactions between adults, adults and children, as well as between children, covering a wide range of topics including debates, everyday situations, and medical consultations. This type of text is representative of the interactions observed between caregivers (e.g., families and medical staff) and individuals who rely on pictograms due to language impairments.
For ToPicto, we provide a corresponding sequence of terms linked to a pictogram from either the speech utterance or the oral transcription.
Below is detailed information about each input and the expected output format for each task.
Text-to-Picto
Input: a JSON file with the following information (only for training and validation data, for test you will be only given the id and src):
Tag
Definition
Example
id
unique identifier of each utterance
cefc-tcof-Acc_del_07-1
src
source of the utterance - text from oral transcription
tu peux pas savoir
tgt
target of the utterance - sequence of pictogram terms (tokens)
toi pouvoir savoir non
pictos
a list of pictogram identifiers linked to each pictogram terms (the size is the same as the target output)*
[6625, 35949, 16885, 5526]
Output: a JSON file with the following information:
Tag
Definition
Example
id
unique identifier of each utterance
cefc-tcof-Acc_del_07-1
hyp
hypothesis given by your system / model corresponding to the sequence of pictogram terms
toi savoir non
Speech-to-Picto
Input: a JSON file containing the following information (provided only for training and validation data; for the test data, only the ID and source will be given):
Tag
Definition
Example
id
unique identifier of each utterance
cefc-tcof-Acc_del_07-1
src
audio file linked to the ID in .wav format
cefc-tcof-Acc_del_07-1.wav
tgt
target of the utterance - sequence of pictogram terms (tokens)
toi pouvoir savoir non
pictos
a list of pictogram identifiers linked to each pictogram terms (the size is the same as the target output)*
[6625, 35949, 16885, 5526]
Output: a JSON file with the following information:
Tag
Definition
Example
id
unique identifier of each utterance
cefc-tcof-Acc_del_07-1
hyp
hypothesis given by your system / model corresponding to the sequence of pictogram terms (tokens)
toi savoir non
*This information is provided for reference to illustrate the input with a sequence of pictogram images. Each image can be obtained from the ARASAAC website at the following link: https://api.arasaac.org/v1/pictograms/6625
Data statistics
Development set
The statistics of the Text-to-Picto and Speech-to-Picto development sets are given below.
The first table provides information about the source (src). For Text-to-Picto, it includes the minimum and maximum lengths of the text. For Speech-to-Picto, it includes the minimum and maximum durations of speech utterances in seconds.
Text-to-Picto / src
Speech-to-Picto / src
train
valid
train
valid
Number of utterances
20,177
1,208
20,177
1,208
Min length / duration (seconds)
1
1
0.08
0.22
Max length / duration (seconds)
99
50
28.28
21.97
Average length / duration (seconds)
9.8
10.0
4.00
4.35
The target (tgt) is the same for the two tasks.
Train - tgt
Valid - tgt
Number of utterances
20,177
1,208
Min length
1
1
Max length
89
48
Average length
8.5
8.6
Unique tokens
4,346
1,492
Test set
The statistics of the Text-to-Picto and Speech-to-Picto test sets are given below.
The table provides information about the source (src). For Text-to-Picto, it includes the minimum and maximum lengths of the text. For Speech-to-Picto, it includes the minimum and maximum durations of speech utterances in seconds.
Text-to-Picto
Speech-to-Picto
Number of utterances
2,904
2,904
Min length / duration (seconds)
1
0.19
Max length / duration (seconds)
62
21.35
Average length / duration (seconds)
10.12
4.45
Evaluation methodology
The evaluation is conducted using sacreBLEU [7], METEOR [8], and the Picto-term Error Rate (PictoER) [9]. For all three metrics, the evaluation involves comparing the hypothesis (hyp) with the target (tgt), i.e., the sequence of pictogram terms.
SacreBLEU measures the number of common n-grams between the translation hypothesis (hyp) and the reference translation (tgt).
METEOR performs an alignment between the translation hypothesis and the reference translations, going beyond simple word matching. It takes into account not only direct matches but also those based on synonyms, morphological variations (such as lemmas and word roots), and even paraphrases. The evaluation is more nuanced because it captures additional semantic information that is not encoded in the BLEU score.
PictoER is a metric derived from WER. Instead of evaluating the number of errors at the word level, we focus on the number of errors of tokens, each linked to an ARASAAC pictogram.
How are the scores computed?
Let's take the following example with hyp the hypothesis given by your system:
SacrebLEU [0 - 100]: the higher the better. A translation is considered good when the BLEU score is above 30.
METEOR [0 - 100]: the higher the better. A translation is considered good when the BLEU score is above 40.
PictoER [0 - 100]: the lower the better. It gives a quick overview of the number of incorrectly predicted tokens.
Visualization of the output
The target (tgt) or the hypothesis (hyp) provided by your model for each utterance should be a sequence of pictogram terms (tokens). Each token should correspond to a pictogram from ARASAAC. To visualize the output with the pictogram images, we developed a platform available here: https://huggingface.co/spaces/ToPicto/Visualize-Pictograms
How to use it:
1. Write a sequence of pictogram terms.
2. The platform will display the corresponding pictogram images.
Example:
id: common_voice_fr_24203862
src: common_voice_fr_24203862.wav
tgt/hyp: quatre concert coordonner à ville et à new_york
Diandra Fabre — <diandra.fabre(at)univ-grenoble-alpes.fr>, Université Grenoble Alpes, LIG, France
Cécile Macaire — <cecile.macaire(at)univ-grenoble-alpes.fr>, Université Grenoble Alpes, LIG, France
Benjamin Lecouteux — <benjamin.lecouteux(at)univ-grenoble-alpes.fr>, Université Grenoble Alpes, LIG, France
Didier Schwab — <didier.schwab(at)univ-grenoble-alpes.fr>, Université Grenoble Alpes, LIG, France
Acknowledgments
ToPicto is organized as part of the Pantagruel project (ANR 23-IAS1-0001), which focuses on developing and evaluating Multimodal and Inclusive Language Models for General and Clinical French. The project is supported by the following partners:
References
[1] Romski, M., & Sevcik, R. A. (2005). Augmentative communication and early intervention: Myths and realities. Infants & Young Chitdren, 18(3), 174-185.
[2] Cataix-Nègre, É. (2017). Communiquer autrement: Accompagner les personnes avec des troubles de la parole ou du langage : les communications alternatives. De Boeck Supérieur.
[3] Beukelman, D.R. and Mirenda, P. (2013). Augmentative and Alternative Communication: Supporting Children and Adults with Complex Communication Needs. Paul H. Brookes Pub.
[4] Communication alternative améliorée (CAA) : la Croix-Rouge française dévoile sa première étude d’impact social ! (2021). Croix-Rouge. Retrieved June 28, 2023, from https://www.croix-rouge.fr/actualite/communication-alternative-amelioree-caa-la-croix-rouge-francaise-devoile-sa-premiere-etude-d-impact-social-2513
[5] Ardila et al. (2020). Common Voice: A Massively-Multilingual Speech Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France. European Language Resources Association.
[6] C. Benzitoun, J.-M. Debaisieux, H.-J. Deulofeu (2016). Le projet ORFÉO : un corpus d'études pour le français contemporain. Corpus n°15, p. 91-114.
[7] Post, M. (2018). A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers. Belgium, Brussels : Association for Computational Linguistics, p. 186-191.
[8] Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65-72).
[9] Woodard, J. P., & Nelson, J. T. (1982). An information theoretic measure of speech recognition performance. In Workshop on standardisation for speech I/O technology, Naval Air Development Center, Warminster, PA.