Multimodal Deep Learning for Violence Detection: VGGish and MobileViT Integration With Knowledge Distillation on Jetson Nano

The need for efficient surveillance systems to identify crimes and improve public safety is rising as violent incidents in public and industrial settings occur more frequently. To facilitate the monitoring process, this research suggests a multimodal deep learning architecture that can automatically...

Full description

Saved in:
Bibliographic Details
Main Authors: Mohammed, Antara Labiba Swapnil, Marilyn Dip Peris, Istiaque Hasan Nihal, Riasat Khan, Mohammad Abdul Matin
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Open Journal of the Communications Society
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10810367/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The need for efficient surveillance systems to identify crimes and improve public safety is rising as violent incidents in public and industrial settings occur more frequently. To facilitate the monitoring process, this research suggests a multimodal deep learning architecture that can automatically recognize and categorize suspicious occurrences. Apart from the visual data, a multimodal approach has been implemented by integrating audio data from the RLVS dataset. The audio classification was done using the VGGish and Wav2Vec 2.0 models. Various pre-trained and vision transformer-based networks have been used for the video dataset. The VGGish and MobileViT models have been combined for both auditory and visual modalities. With multimodal VGGish + MobileViT, the classification accuracy and F1 score have been enhanced to 97.13% and 0.97, respectively. The knowledge distillation technique has been employed by transferring the backbone knowledge from a fine-tuned ViT model (teacher) to a MobileViT (student), focusing on training only the task head of the student model. Finally, the proposed distilled MobileViT model has been implemented in a Jetson Nano edge device for immediate identification at an average frame rate of 5–10 frames per second. The experiments demonstrate that the multimodal technique provides higher accuracy and robustness, confirming its effectiveness for real-time monitoring of the Jetson Nano and producing a user-friendly surveillance system.
ISSN:2644-125X