Viewport prediction with cross modal multiscale transformer for 360° video streaming

Abstract In the realm of immersive video technologies, efficient 360° video streaming remains a challenge due to the high bandwidth requirements and the dynamic nature of user viewports. Most existing approaches neglect the dependencies between different modalities, and personal preferences are rare...

Full description

Saved in:
Bibliographic Details
Main Authors: Yangsheng Tian, Yi Zhong, Yi Han, Fangyuan Chen
Format: Article
Language:English
Published: Nature Portfolio 2025-08-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-16011-7
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract In the realm of immersive video technologies, efficient 360° video streaming remains a challenge due to the high bandwidth requirements and the dynamic nature of user viewports. Most existing approaches neglect the dependencies between different modalities, and personal preferences are rarely considered. These limitations lead to inconsistent prediction performance. Here, we present a novel viewport prediction model leveraging a Cross Modal Multiscale Transformer (CMMST) that integrates user trajectory and video saliency features across different scales. Our approach outperforms baseline methods, maintaining high precision even with extended prediction intervals. By harnessing the Cross Modal attention mechanisms, CMMST captures intricate user preferences and viewing patterns, offering a promising solution for adaptive streaming in virtual reality and other immersive platforms. The code of this work is available at https://github.com/bbgua85776540/CMMST .
ISSN:2045-2322