Machine learning training data: over 500,000 images of butterflies and moths (Lepidoptera) with species labels

Abstract Deep learning models can accelerate the processing of image-based biodiversity data and provide educational value by giving direct feedback to citizen scientists. However, the training of such models requires large amounts of labelled data and not all species are equally suited for identifi...

Full description

Saved in:
Bibliographic Details
Main Authors: Friederike Barkmann, Andreas Lindner, Ronald Würflinger, Helmut Höttinger, Johannes Rüdisser
Format: Article
Language:English
Published: Nature Portfolio 2025-08-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-025-05708-z
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849333730440642560
author Friederike Barkmann
Andreas Lindner
Ronald Würflinger
Helmut Höttinger
Johannes Rüdisser
author_facet Friederike Barkmann
Andreas Lindner
Ronald Würflinger
Helmut Höttinger
Johannes Rüdisser
author_sort Friederike Barkmann
collection DOAJ
description Abstract Deep learning models can accelerate the processing of image-based biodiversity data and provide educational value by giving direct feedback to citizen scientists. However, the training of such models requires large amounts of labelled data and not all species are equally suited for identification from images alone. Most butterfly and many moth species (Lepidoptera) which play an important role as biodiversity indicators are well-suited for such approaches. This dataset contains over 540.000 images of 185 butterfly and moth species that occur in Austria. Images were collected by citizen scientists with the application “Schmetterlinge Österreichs” and correct species identification was ensured by an experienced entomologist. The number of images per species ranges from one to nearly 30.000. Such a strong class imbalance is common in datasets of species records. The dataset is larger than other published dataset of butterfly and moth images and offers opportunities for the training and evaluation of machine learning models on the fine-grained classification task of species identification.
format Article
id doaj-art-ee69eafd0fa74bbd994b35b5d7f5e4a7
institution Kabale University
issn 2052-4463
language English
publishDate 2025-08-01
publisher Nature Portfolio
record_format Article
series Scientific Data
spelling doaj-art-ee69eafd0fa74bbd994b35b5d7f5e4a72025-08-20T03:45:45ZengNature PortfolioScientific Data2052-44632025-08-011211810.1038/s41597-025-05708-zMachine learning training data: over 500,000 images of butterflies and moths (Lepidoptera) with species labelsFriederike Barkmann0Andreas Lindner1Ronald Würflinger2Helmut Höttinger3Johannes Rüdisser4Department of Ecology, University of InnsbruckAdvanced Computing Austria ACA GmbHBilla Foundation Blühendes ÖsterreichBilla Foundation Blühendes ÖsterreichDepartment of Ecology, University of InnsbruckAbstract Deep learning models can accelerate the processing of image-based biodiversity data and provide educational value by giving direct feedback to citizen scientists. However, the training of such models requires large amounts of labelled data and not all species are equally suited for identification from images alone. Most butterfly and many moth species (Lepidoptera) which play an important role as biodiversity indicators are well-suited for such approaches. This dataset contains over 540.000 images of 185 butterfly and moth species that occur in Austria. Images were collected by citizen scientists with the application “Schmetterlinge Österreichs” and correct species identification was ensured by an experienced entomologist. The number of images per species ranges from one to nearly 30.000. Such a strong class imbalance is common in datasets of species records. The dataset is larger than other published dataset of butterfly and moth images and offers opportunities for the training and evaluation of machine learning models on the fine-grained classification task of species identification.https://doi.org/10.1038/s41597-025-05708-z
spellingShingle Friederike Barkmann
Andreas Lindner
Ronald Würflinger
Helmut Höttinger
Johannes Rüdisser
Machine learning training data: over 500,000 images of butterflies and moths (Lepidoptera) with species labels
Scientific Data
title Machine learning training data: over 500,000 images of butterflies and moths (Lepidoptera) with species labels
title_full Machine learning training data: over 500,000 images of butterflies and moths (Lepidoptera) with species labels
title_fullStr Machine learning training data: over 500,000 images of butterflies and moths (Lepidoptera) with species labels
title_full_unstemmed Machine learning training data: over 500,000 images of butterflies and moths (Lepidoptera) with species labels
title_short Machine learning training data: over 500,000 images of butterflies and moths (Lepidoptera) with species labels
title_sort machine learning training data over 500 000 images of butterflies and moths lepidoptera with species labels
url https://doi.org/10.1038/s41597-025-05708-z
work_keys_str_mv AT friederikebarkmann machinelearningtrainingdataover500000imagesofbutterfliesandmothslepidopterawithspecieslabels
AT andreaslindner machinelearningtrainingdataover500000imagesofbutterfliesandmothslepidopterawithspecieslabels
AT ronaldwurflinger machinelearningtrainingdataover500000imagesofbutterfliesandmothslepidopterawithspecieslabels
AT helmuthottinger machinelearningtrainingdataover500000imagesofbutterfliesandmothslepidopterawithspecieslabels
AT johannesrudisser machinelearningtrainingdataover500000imagesofbutterfliesandmothslepidopterawithspecieslabels