MVR: Synergizing Large and Vision Transformer for Multimodal Natural Language-Driven Vehicle Retrieval
In recent years, intelligent transportation systems have played a pivotal role in the development of smart cities, with vehicle retrieval becoming a critical component of traffic management and surveillance. Traditional vehicle retrieval systems rely heavily on image-based matching techniques derive...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10818666/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | In recent years, intelligent transportation systems have played a pivotal role in the development of smart cities, with vehicle retrieval becoming a critical component of traffic management and surveillance. Traditional vehicle retrieval systems rely heavily on image-based matching techniques derived from vehicle re-identification (VReID) tasks. However, these approaches are limited by their dependency on image queries, which may not always be available in real-world scenarios. Natural language (NL)-based vehicle retrieval systems offer a more flexible and accessible alternative by enabling users to query vehicles using textual descriptions. Despite progress in NL-based retrieval, existing methods face challenges in fully capturing multi-granularity information and aligning heterogeneous visual and linguistic inputs. This paper addresses these limitations by proposing a robust Multimodal Vehicle Retrieval (MVR) model that integrates both visual and textual data through a dual-stream architecture. Our model captures complementary local features, alongside global information including motion and environmental context. We utilize InfoNCE and instance losses to align the visual and textual modalities within a shared feature space, while post-processing modules, including Granular Vehicle Feature Refinement and Spatial Relationship Modeling, further enhance retrieval performance by refining vehicle attributes and contextual relationships. Our experiments, conducted on the CityFlow-NL dataset, demonstrate that our model achieves a 35.6% improvement in Mean Reciprocal Rank (MRR), a 41.3% increase in recall at 5 (R@5), and a 22.9% improvement in recall at 10 (R@10) compared to the baseline, and overcomes the inherent challenges of cross-modal retrieval in improving real-world VReID. |
|---|---|
| ISSN: | 2169-3536 |