MVR: Synergizing Large and Vision Transformer for Multimodal Natural Language-Driven Vehicle Retrieval

In recent years, intelligent transportation systems have played a pivotal role in the development of smart cities, with vehicle retrieval becoming a critical component of traffic management and surveillance. Traditional vehicle retrieval systems rely heavily on image-based matching techniques derive...

Full description

Saved in:
Bibliographic Details
Main Authors: Tareq Mahmod AlZubi, Umar Raza Mukhtar
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10818666/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In recent years, intelligent transportation systems have played a pivotal role in the development of smart cities, with vehicle retrieval becoming a critical component of traffic management and surveillance. Traditional vehicle retrieval systems rely heavily on image-based matching techniques derived from vehicle re-identification (VReID) tasks. However, these approaches are limited by their dependency on image queries, which may not always be available in real-world scenarios. Natural language (NL)-based vehicle retrieval systems offer a more flexible and accessible alternative by enabling users to query vehicles using textual descriptions. Despite progress in NL-based retrieval, existing methods face challenges in fully capturing multi-granularity information and aligning heterogeneous visual and linguistic inputs. This paper addresses these limitations by proposing a robust Multimodal Vehicle Retrieval (MVR) model that integrates both visual and textual data through a dual-stream architecture. Our model captures complementary local features, alongside global information including motion and environmental context. We utilize InfoNCE and instance losses to align the visual and textual modalities within a shared feature space, while post-processing modules, including Granular Vehicle Feature Refinement and Spatial Relationship Modeling, further enhance retrieval performance by refining vehicle attributes and contextual relationships. Our experiments, conducted on the CityFlow-NL dataset, demonstrate that our model achieves a 35.6% improvement in Mean Reciprocal Rank (MRR), a 41.3% increase in recall at 5 (R@5), and a 22.9% improvement in recall at 10 (R@10) compared to the baseline, and overcomes the inherent challenges of cross-modal retrieval in improving real-world VReID.
ISSN:2169-3536