You are here

PlantCLEF 2023

Image-based plant identification at global scale

banniere

Motivation

It is estimated that there are more than 300,000 species of vascular plants in the world. Increasing our knowledge of these species is of paramount importance for the development of human civilization (agriculture, construction, pharmacopoeia, etc.), especially in the context of the biodiversity crisis. However, the burden of systematic plant identification by human experts strongly penalizes the aggregation of new data and knowledge. Since then, automatic identification has made considerable progress in recent years as highlighted during all previous editions of PlantCLEF. Deep learning techniques now seem mature enough to address the ultimate but realistic problem of global identification of plant biodiversity in spite of many problems that the data may present (a huge number of classes, very strongly unbalanced classes, partially erroneous identifications, duplications, variable visual quality, diversity of visual contents such as photos or herbarium sheets, etc). The PlantCLEF2022 challenge edition proposes to take a step in this direction by tackling a multi-image (and metadata) classification problem with a very large number of classes (80k plant species).

Data collection

The training dataset that will be used this year can be distinguished in 2 main categories: "trusted" and "web" (i.e. with or without species labels provided and checked by human experts), totaling 4M images on 80k classes.

The "trusted" training dataset is based on a selection of more than 2.9M images covering 80k plant species shared and collected mainly by GBIF (and EOL to a lesser extent). These images come mainly from academic sources (museums, universities, national institutions) and collaborative platforms such as inaturalist or Pl@ntNet, implying a fairly high certainty of determination quality. Nowadays, many more photographs are available on these platforms for a few thousand species, but the number of images has been globally limited to around 100 images per species, favouring types of views adapted to the identification of plants (close-ups of flowers, fruits, leaves, trunks, ...), in order to not unbalance the classes and to not explode the size of the training dataset.

In contrast, the second data set is based on a collection of web images provided by search engines Google and Bing. This initial collection of several million images suffers however from a significant rate of species identification errors and a massive presence of duplicates and images less adapted for visual identification of plants (herbariums, landscapes, microscopic views...), or even off-topic (portrait photos of botanists, maps, graphs, other kingdoms of the living, manufactured objects, ...). The initial collection has been then semi-automatically revised to drastically reduce the number of these irrelevant pictures and to maximise, as for the trusted dataset, close-ups of flowers, fruits, leaves, trunks, etc. The "web" dataset finally contains about 1.1 million images covering around 57k species.

Lastly, the test set will be a set of tens of thousands pictures verified by world class experts related to various regions of the world and taxonomic groups.

Task description

The task will be evaluated as a plant species retrieval task based on multi-image plant observations from the test set. The goal will be to retrieve the correct plant species among the top results of a ranked list of species returned by the evaluated system. The participants will first have access to the training set and a few months later, they will be provided with the whole test set.
The primary metrics used for the evaluation of the task will be the Macro Averaged Mean Reciprocal Rank (MA-MRR). The MRR is a statistic measure for evaluating any process that produces a list of possible responses to a sample of queries ordered by probability of correctness. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer. The MRR is the average of the reciprocal ranks for the whole test set:
plantclef2022results
where |Q| is the total number of query occurrences in the test set. However, given the long tail of the data distribution, in order to compensate for species that would be underrepresented in the test set, we will use a Macro-Averaged version of the MRR (average MRR per species).

How to participate ?

Coming soon

CEUR Working Notes

For detailed instructions, please refer to https://clef2023.clef-initiative.eu/index.php?page=Pages/publications.html.
A summary of the most important points:

  • All participating teams with at least one graded submission, regardless of the score, should submit a CEUR working notes paper.
  • Submission of reports is done through EasyChair – please make absolutely sure that the author (names and order), title, and affiliation information you provide in EasyChair match the submitted PDF exactly
  • Strict deadline for Working Notes Papers: 5 June 2023
  • Strict deadline for CEUR-WS Camera Ready Working Notes Papers: 7 July 2023
  • Templates are available here
  • Working Notes Papers should cite both the LifeCLEF 2023 overview paper as well as the PlantCLEF task overview paper, citation information will be added in the Citations section below as soon as the titles have been finalized.

Schedule

  • Jan 2023: registration opens for all LifeCLEF challenges
  • Jan-March 2023: training and test data release
  • 17 May 2023: deadline for submission of runs by participants
  • 26 May 2023: release of processed results by the task organizers
  • 5 June 2023: deadline for submission of working note papers by participants [CEUR-WS proceedings]
  • 23 June 2023: notification of acceptance of participant's working note papers [CEUR-WS proceedings]
  • 7 July 2023: camera ready copy of participant's working note papers and extended lab overviews by organizers
  • 18-21 Sept 2023: CLEF 2023 Thessaloniki