A multimodal framework for pepper diseases and pests detection
Abstract Pepper diseases and pests typically exhibit small target proportions, diverse shapes and sizes, complex imaging backgrounds, and similarities with the background. Existing detection methods perform poorly in identifying targets of different sizes and shapes within the same scene, and they l...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2024-11-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-024-80675-w |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Pepper diseases and pests typically exhibit small target proportions, diverse shapes and sizes, complex imaging backgrounds, and similarities with the background. Existing detection methods perform poorly in identifying targets of different sizes and shapes within the same scene, and they lack adequate noise suppression capabilities. To address the practical needs of detecting pepper diseases and pests in complex scenarios, we have constructed the first multimodal pepper diseases and pests object detection dataset (PDD). This dataset includes a wide variety of diseases and pests images, along with detailed natural language descriptions of their attributes. Locating the described targets in complex scenes with similar disease symptoms and leaf occlusion presents a significant challenge. To tackle this issue, we propose the PepperNet model for object detection in pepper diseases and pests images using natural language descriptions. This model decomposes complex multimodal features of language and images into explicit attribute features and employs fine-grained multimodal attribute contrast learning strategies. This approach effectively distinguishes subtle local differences between similar objects, achieving fine-grained mapping from language to vision in complex scenarios. Our detection results show a mAP@0.5 of 91.93% and a detection speed of 121.8 frames per second. Visualizations indicate that the model maintains high robustness under varying noise levels and occlusion conditions, demonstrating superior performance and stability across diverse complex scenarios. |
|---|---|
| ISSN: | 2045-2322 |