You are here

ImageCLEFmed GAN

Welcome to the 3rd edition of the GANs task!

Motivation

Controlling the Quality of Synthetic Medical Images created via GANs

Description

AI systems for medical tasks like predicting, detecting and classifying diseases rely on the availability of large and diverse datasets for training. High-quality data allows these models to learn complex patterns and improve their accuracy and reliability. However, getting access to real medical data is not easy because of privacy concerns. Patients are usually willing to share their medical information only for their own treatment and not for research. This makes it hard to gather enough data to train AI models effectively, slowing down progress in developing better tools for healthcare.

One way to solve this problem is by creating synthetic data—artificial data that looks like real medical data but doesn’t come from actual patients. Generative models, like GANs (Generative Adversarial Networks), can be used to create these datasets. Synthetic data can help researchers build and test AI systems without needing to rely on real patient data, which protects privacy and makes it easier to collect the variety of information needed for training.

But there’s an important challenge with synthetic data: it must not include hidden details, or “fingerprints” from the real data it was trained on. If synthetic data can somehow be traced back to the original patient data, it could risk exposing private information. Ensuring synthetic data are completely free from such “fingerprints” is critical. It’s the only way to guarantee patient privacy while using synthetic data to push forward advancements in AI-powered healthcare.

Lessons learned:

  • In the first and second editions of this task, held at ImageCLEF 2023 and 2024, various generative models were analyzed within the framework of the first subtask to investigate whether synthetic images contained "fingerprints" of the real medical data used during training. The results demonstrated that generative models do retain and imprint features from their training data, raising important security and privacy concerns. These findings underscore the need for robust techniques to detect and mitigate such imprints to ensure that synthetic images protect patient privacy while maintaining their utility for research and development.

  • In the 2nd edition of the task, it was confirmed that generative models leave unique "fingerprints" on the synthetic images they produce. By analyzing images generated from various models, distinct patterns and features were identified that allowed the attribution of synthetic images to their respective generative models.

News

Both train and test datasets have been released and can be found here.

Task Description

We will continue to investigate the hypothesis that generative models generate synthetic medical images that retain "fingerprints" from the real images used during their training. These fingerprints raise important security and privacy concerns, particularly in the context of personal medical image data being used to create artificial images for various real-life applications.

The task is divided into two subtasks, both focusing on detecting and analyzing these "fingerprints" within synthetic biomedical image data to determine which real images contributed to the training process:

Subtask 1: Detect Training Data Usage

In this subtask, participants will analyze synthetic biomedical images to determine whether specific real images were used in the training process of generative models. For each real image in the test set, participants must label it as either used (1) or not used (0) for generating the given synthetic images. This task focuses on detecting the presence of training data "fingerprints" within synthetic outputs.

Subtask 2: Identify Training Data Subsets

In this subtask, participants will link each synthetic biomedical image to the specific subset of real data used during its generation. The goal is to identify the particular dataset of real images that contributed to the training of the generative model responsible for creating each synthetic image. This requires a more detailed attribution of synthetic images to their corresponding training subsets.

Both subtasks aim to advance our understanding of how generative models utilize training data, ensuring that synthetic data generation respects privacy and mitigates potential risks associated with patient confidentiality.

Data

The benchmarking dataset includes both real and synthetic biomedical images. The real images consist of axial slices of 3D CT scans from approximately 8,000 lung tuberculosis patients. These slices vary in appearance: some may look relatively “normal,” while others exhibit distinct lung lesions, including severe cases. The real images are stored in 8-bit per pixel PNG format, with dimensions of 256x256 pixels, providing a standardized resolution for analysis.

The synthetic images, also sized at 256x256 pixels, have been generated using various generative models, including Generative Adversarial Networks (GANs) and Diffusion Neural Networks. By providing both real and synthetic datasets, this task enables participants to analyze and compare the characteristics of synthetic images with their real counterparts, investigating potential "fingerprints" and patterns related to the training process.

Datasets are available here

Subtask 1: Detect Training Data Usage

TRAINING DATASET consists of 3 folders:

  • "generated" – contains 5,000 synthetic images generated using a Generative Adversarial Network (GAN).
  • "real_used" – contains 100 real images that were used to train the GAN to produce the synthetic images in the "generated" folder.
  • "real_not_used" – contains 100 real images that were not used in training the GAN.

TEST DATASET consists of 2 folders:

  • "generated" – This folder contains 2,000 additional synthetic images. These images were generated using the same GAN model trained under the same conditions as those used to create the synthetic images in the training dataset.
  • "real_unknown" – This folder contains a mix of 500 real images. Some of these images were used in training the generative model, while others were not

Each image in the test dataset will be assigned a label. Submission files must follow the guidelines outlined in the "Submission Instructions" section.

Subtask 2: Identify Training Data Subsets

The TRAINING DATASET consists of two main folders:

  • "generated" – contains 5 subfolders, each holding synthetic images. Each subset was generated using a different training dataset for the generative model.
  • "real" – contains 5 subfolders, each corresponding to a specific training dataset used to train the generative model. The real images in each subfolder were used to generate the synthetic images in the corresponding "generated" subfolder.

The mapping between the real and generated images is as follows:

  • Folder "t1" (real images) → Used to generate synthetic images in "gen_t1"
  • Folder "t2" (real images) → Used to generate synthetic images in "gen_t2"
  • Folder "t3" (real images) → Used to generate synthetic images in "gen_t3"
  • Folder "t4" (real images) → Used to generate synthetic images in "gen_t4"
  • Folder "t5" (real images) → Used to generate synthetic images in "gen_t5"

    The TEST DATASET contains 25,000 generated images, each derived from a real subgroup of images in the training dataset. Each image will be assigned a label consistent with those used in the training dataset.

    Submission files must follow the guidelines outlined in the "Submission Instructions" section

    Evaluation methodology

    TBD
    For assessing the performance of Task 1, Cohen Kappa Score will be used.

    Accuracy is the official evaluation metric for Task 2.

    Participant registration

    Please refer to the general ImageCLEF registration instructions

    Preliminary Schedule

    20.12.2024: registration opens for all ImageCLEF tasks
    25.04.2025: registration closes for all ImageCLEF tasks
    17.03.2025: Test data release starts
    10.05.2025 15.05.2025: Deadline for submitting the participants runs
    30.05.2025: Deadline for submission of working notes papers by the participants
    27.06.2025: Notification of acceptance of the working notes papers
    07.07.2025: Camera ready working notes papers
    09-12.09.2025: CLEF 2025, Madrid, Spain

    PRESENTATIONS
    Session 2: 11 September 16:30 – 18:00

    1. “Reverse Engineering Generative Fingerprints in Medical Images: A Deep Learning Approach to Training Data Attribution “ - Sara Nambiar, Isha Shah and Nikita Bhedasgaonkar, Pune Institute of Computer Technology, India (in person)
    2. “Evaluation of the Privacy of Images Generated by ImageCLEFmedical GANs 2025 Based on Pre-trained Model Feature Extraction Methods” – Dengtao Zhang, YunNan University, China  (Online)
    3. “Evaluating of the Privacy of Images Generated by ImageCLEFmedical GAN 2025 Using Similarity Classification Method Based on Image Enhancement and Deep Learning Model” – Haojie Zuo, YunNan University, China (Online)
    4. “ViT-based generative model fingerprinting” - Yijiang Zhou, YunNan University, China (Online)
    5. “Detecting Training Data Usage in Synthetic Images Using Machine Learning Techniques“ - Krithikha Sanju S, Sri Sivasubramaniya Nadar College of Engineering, India (Online)
    6. “Identify Training Data Subsets in GAN-Generated Medical Images” - Shruti Chandrasekar, Vedajanaani R S and Vijayalakshmi P, Sri Sivasubramaniya Nadar College of Engineering, India (Online)
    7. “Detecting Training Data Fingerprints in GAN-Generated Medical Images” - Shruti Chandrasekar, Vedajanaani R S and Vijayalakshmi P, Sri Sivasubramaniya Nadar College of Engineering, India (Online)

    Submission Instructions

    Subtask 1: Detect Training Data Usage

    • The submission must include a run file named exactly: run.csv.
    • This file must be zipped; the zip file can have any name.
    • The run.csv file should contain two columns in the following format: , where 1 indicates the image was used for training and 0 indicates the image was not used for training.

    Example Submission Structure:
    • Method1.zip
    ├── run.csv (this file will be compared with the ground truth)
    ├── real_unknown_1.png 1
    ├── real_unknown_2.png 0
    ├── ...
    ├── real_unknown_500.png 1

    Subtask 2: Identify Training Data Subsets

    • Each synthetic image must be labeled based on the real image subgroup it was generated from, using one of the following labels: [1, 2, 3, 4, 5].
    • The run file must be named exactly: run.csv
    • It must contain two columns formatted as:
    • The run.csv file must be zipped; the zip file can have any name.

    Example Submission Structure:
    • Method1.zip
    ├── run.csv (this file will be compared with the ground truth)
    ├── gen_unknown_00001.png 1
    ├── gen_unknown_00002.png 2
    ├── gen_unknown_00003.png 3
    ├── ...
    ├── gen_unknown_25000.png 3

    Constraints to follow:

    • Each test set image ID must appear exactly once in the submitted file.
    • The image names must be preserved exactly as they appear in the dataset.
    • Do not include any headers in the file.

    Results

    Results of participant submissions and their results for Subtask 1: Detect Training Data Usage.
    # Participant Run ID Entries Cohen's kappa Accuracy Precision Recall F1
    1 Neural Nexus 1878 7 0.148 0.574 0.5698 0.604 0.5864
    2 zhouyijiang1 1803 5 0.136 0.568 0.5582 0.652 0.6015
    3 zhouyijiang1 1804 5 0.136 0.568 0.5582 0.652 0.6015
    4 zhouyijiang1 1873 5 0.136 0.568 0.5582 0.652 0.6015
    5 zhouyijiang1 1802 5 0.132 0.566 0.5537 0.68 0.6104
    6 zhouyijiang1 1801 5 0.128 0.564 0.55 0.704 0.6175
    7 Neural Nexus 1880 7 0.072 0.536 0.5542 0.368 0.4423
    8 taozi 1359 8 0.064 0.532 0.5597 0.3 0.3906
    9 taozi 1367 8 0.044 0.522 0.5505 0.24 0.3343
    10 AIMultimediaLab* 1696 2 0.036 0.518 0.5162 0.572 0.5427
    11 taozi 1364 8 0.032 0.516 0.6 0.096 0.1655
    12 Neural Nexus 1881 7 0.032 0.516 0.5222 0.376 0.4372
    13 taozi 1360 8 0.032 0.516 0.5128 0.64 0.5694
    14 Neural Nexus 1877 7 0.028 0.514 0.5164 0.44 0.4752
    15 taozi 1366 8 0.02 0.51 0.5069 0.732 0.599
    16 Neural Nexus 1872 7 0.016 0.508 0.5182 0.228 0.3167
    17 Medhastra 1288 1 0.016 0.508 0.5078 0.52 0.5138
    18 taozi 1368 8 0.012 0.506 0.5092 0.332 0.4019
    19 Challengers 1811 5 0.012 0.506 0.5062 0.492 0.499
    20 ZOQ 1427 5 -0.016 0.492 0.4905 0.412 0.4478
    21 Neural Nexus 1879 7 -0.024 0.488 0.4732 0.212 0.2928
    22 Neural Nexus 1882 7 -0.028 0.486 0.4646 0.184 0.2636
    23 ZOQ 1355 5 -0.032 0.484 0.4904 0.82 0.6138
    24 SCOPE VIT Visioneers 1160 1 -0.032 0.484 0.4831 0.456 0.4691
    25 Challengers 1779 5 -0.032 0.484 0.4355 0.108 0.1731
    26 AIMultimediaLab* 1492 2 -0.044 0.478 0.4829 0.62 0.5429
    27 ZOQ 1330 5 -0.068 0.466 0.4822 0.92 0.6327
    28 ZOQ 1794 5 -0.068 0.466 0.4822 0.92 0.6327
    29 taozi 1369 8 -0.096 0.452 0.4657 0.652 0.5433
    30 Challengers 1778 5 -0.116 0.442 0.4461 0.48 0.4624
    31 ZOQ 1356 5 -0.132 0.434 0.3862 0.224 0.2835
    32 Challengers 1776 5 -0.176 0.412 0.3764 0.268 0.3131
    33 Challengers 1777 5 -0.176 0.412 0.3764 0.268 0.3131

    * -organizing team

    Results of participant submissions and their results for Subtask 2: Identify Training Data Subsets.
    # Participant Run ID Entries Accuracy Precision Recall F1 Specificity
    1 AIMultimediaLab* 1396 5 0.9904 0.9904 0.9904 0.9904 0.9972
    2 SDVAHCS/UCSD 1782 5 0.988 0.9882 0.988 0.9881 0.9969
    3 SDVAHCS/UCSD 1871 5 0.988 0.9882 0.988 0.9881 0.9969
    4 SDVAHCS/UCSD 1883 5 0.988 0.9882 0.988 0.9881 0.9969
    5 SDVAHCS/UCSD 1426 5 0.9878 0.9881 0.9878 0.988 0.9969
    6 SDVAHCS/UCSD 1425 5 0.9708 0.9716 0.9708 0.9711 0.9931
    7 Medhastra 1287 1 0.9484 0.9504 0.9484 0.9487 0.9879
    8 AIMultimediaLab* 1268 5 0.5236 0.5982 0.5236 0.5327 0.8799
    9 AIMultimediaLab* 1269 5 0.4913 0.5822 0.4913 0.4934 0.8744
    10 AIMultimediaLab* 1271 5 0.4904 0.5691 0.4904 0.4832 0.8753
    11 AIMultimediaLab* 1267 5 0.4112 0.4645 0.4112 0.3945 0.8547

    * -organizing team

    CEUR Working Notes

    For detailed instructions, please refer to this PDF file. A summary of the most important points:

    • All participating teams with at least one graded submission, regardless of the score, should submit a CEUR working notes paper.
    • Teams who participated in both tasks should generally submit only one report

    Citations

    When referring to ImageCLEF 2025, please cite the following publication:

    @inproceedings{OverviewImageCLEF2025,
        title = {
            Overview of ImageCLEF 2025: Multimedia Retrieval in Medical, Social
            Media and Content Recommendation Applications},
        author = {
            Ionescu, Bogdan and M\"uller, Henning and Stanciu, Dan-Cristian and
            Andrei, Alexandra-Georgiana and Radzhabov, Ahmedkhan and Prokopchuk,
            Yuri and {\c{S}tefan, Liviu-Daniel} and Constantin, Mihai-Gabriel and
            Dogariu, Mihai and Kovalev, Vassili and Damm, Hendrik and R\"uckert,
            Johannes and Ben Abacha, Asma and Garc\'ia Seco de Herrera, Alba and
            Friedrich, Christoph M. and Bloch, Louise and Br\"ungel, Raphael and
            Idrissi-Yaghir, Ahmad and Sch\"afer, Henning and Schmidt, Cynthia
            Sabrina and Pakull, Tabea M. G. and Bracke, Benjamin and Pelka, Obioma
            and Eryilmaz, Bahadir and Becker, Helmut and Yim, Wen-Wai and Codella,
            Noel and Novoa, Roberto Andres and Malvehy, Josep and Dimitrov, Dimitar
            and Das, Rocktim Jyoti and Xie, Zhuohan and Shan, Hee Ming and Nakov,
            Preslav and Koychev, Ivan and Hicks, Steven A. and Gautam, Sushant and
            Riegler, Michael A. and Thambawita, Vajira and P\r{a}l Halvorsen and
            Fabre, Diandra and Macaire, C\'ecile and Lecouteux, Benjamin and
            Schwab, Didier and Potthast, Martin and Heinrich, Maximilian and
            Kiesel, Johannes and Wolter, Moritz and Stein, Benno
        },
        year = 2025,
        month = {September 9-12},
        booktitle = {Experimental IR Meets Multilinguality, Multimodality, and Interaction},
        publisher = {Springer Lecture Notes in Computer Science LNCS},
        address = {Madrid, Spain},
        series = {
            Proceedings of the 16th International Conference of the CLEF
            Association (CLEF 2025)},
        pages = {}
    }
    
    

    When referring to ImageCLEFmedical 2025 GANs, please cite the following publication:

    @inproceedings{GAN2025,
      title   = {Overview of ImageCLEFMedical 2025 {GANs} Task: Training Data Analysis and Fingerprint Detection},
      author  = {
        Andrei, Alexandra-Georgiana and Constantin, Mihai Gabriel and Dogariu,
        Mihai and  Radzhabov, Ahmedkhan and {\c{S}}tefan, Liviu-Daniel, and
        Prokopchuk, Yuri and Kovalev, Vassili and M{\"u}ller, Henning and Ionescu,
        Bogdan
      },
      year    = 2025,
      month   = {September 9-12},
      booktitle = {CLEF2025 Working Notes},
      publisher = {CEUR-WS.org},
      address = {Madrid, Spain},
      series  = {{CEUR} Workshop Proceedings},
      volume  = {},
      pages   = {}
    }
    
    

    Contact

    Organizers:

    • Alexandra Andrei, alexandra.andrei(at)upb.ro, National University of Science and Technology POLITEHNICA Bucharest, Romania
    • Ahmedkhan Radzhabov, National Academy of Science of Belarus, Minsk, Belarus
    • Yuri Prokopchuk, National Academy of Science of Belarus, Minsk, Belarus
    • Liviu-Daniel Ștefan, National University of Science and Technology POLITEHNICA Bucharest, Romania.
    • Mihai Gabriel Constantin, National University of Science and Technology POLITEHNICA Bucharest, Romania.
    • Mihai Dogariu, National University of Science and Technology POLITEHNICA Bucharest, Romania.
    • Vassili Kovalev, vassili.kovalev(at)gmail.com, Belarusian Academy of Sciences, Minsk, Belarus
    • Bogdan Ionescu , bogdan.ionescu(at)upb.ro, National University of Science and Technology POLITEHNICA Bucharest, Romania
    • Henning Müller , henning.mueller(at)hevs.ch, University of Applied Sciences Western Switzerland, Sierre, Switzerland

    Acknowledgments

AttachmentSize
Image icon GANs2025.png260.93 KB