STFormer: Spatio‐temporal former for hand–object interaction recognition from egocentric RGB video

Abstract In recent years, video‐based hand–object interaction has received widespread attention from researchers. However, due to the complexity and occlusion of hand movements, hand–object interaction recognition based on RGB videos remains a highly challenging task. Here, an end‐to‐end spatio‐temp...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jiao Liang, Xihan Wang, Jiayi Yang, Quanli Gao
Format:	Article
Language:	English
Published:	Wiley 2024-09-01
Series:	Electronics Letters
Subjects:	computer vision image classification pose estimation
Online Access:	https://doi.org/10.1049/ell2.70010
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846173408196296704
author	Jiao Liang Xihan Wang Jiayi Yang Quanli Gao
author_facet	Jiao Liang Xihan Wang Jiayi Yang Quanli Gao
author_sort	Jiao Liang
collection	DOAJ
description	Abstract In recent years, video‐based hand–object interaction has received widespread attention from researchers. However, due to the complexity and occlusion of hand movements, hand–object interaction recognition based on RGB videos remains a highly challenging task. Here, an end‐to‐end spatio‐temporal former (STFormer) network for understanding hand behaviour in interactions is proposed. The network consists of three modules: FlexiViT feature extraction, hand–object pose estimator, and interaction action classifier. The FlexiViT is used to extract multi‐scale features from each image frame. The hand–object pose estimator is designed to predict 3D hand pose keypoints and object labels for each frame. The interaction action classifier is used to predict the interaction action categories for the entire video. The experimental results demonstrate that our approach achieves competitive recognition accuracies of 94.96% and 88.84% on two datasets, namely first‐person hand action (FPHA) and 2 Hands and Objects (H2O).
format	Article
id	doaj-art-77d733feffb74d2bb88a4d79f2ef2aa4
institution	Kabale University
issn	0013-5194 1350-911X
language	English
publishDate	2024-09-01
publisher	Wiley
record_format	Article
series	Electronics Letters
spelling	doaj-art-77d733feffb74d2bb88a4d79f2ef2aa42024-11-08T14:35:49ZengWileyElectronics Letters0013-51941350-911X2024-09-016017n/an/a10.1049/ell2.70010STFormer: Spatio‐temporal former for hand–object interaction recognition from egocentric RGB videoJiao Liang0Xihan Wang1Jiayi Yang2Quanli Gao3State‐Province Joint Engineering and Research Center of Advanced Networking and Intelligent Information Services Xi'an Polytechnic University Xi'an ChinaState‐Province Joint Engineering and Research Center of Advanced Networking and Intelligent Information Services Xi'an Polytechnic University Xi'an ChinaState‐Province Joint Engineering and Research Center of Advanced Networking and Intelligent Information Services Xi'an Polytechnic University Xi'an ChinaState‐Province Joint Engineering and Research Center of Advanced Networking and Intelligent Information Services Xi'an Polytechnic University Xi'an ChinaAbstract In recent years, video‐based hand–object interaction has received widespread attention from researchers. However, due to the complexity and occlusion of hand movements, hand–object interaction recognition based on RGB videos remains a highly challenging task. Here, an end‐to‐end spatio‐temporal former (STFormer) network for understanding hand behaviour in interactions is proposed. The network consists of three modules: FlexiViT feature extraction, hand–object pose estimator, and interaction action classifier. The FlexiViT is used to extract multi‐scale features from each image frame. The hand–object pose estimator is designed to predict 3D hand pose keypoints and object labels for each frame. The interaction action classifier is used to predict the interaction action categories for the entire video. The experimental results demonstrate that our approach achieves competitive recognition accuracies of 94.96% and 88.84% on two datasets, namely first‐person hand action (FPHA) and 2 Hands and Objects (H2O).https://doi.org/10.1049/ell2.70010computer visionimage classificationpose estimation
spellingShingle	Jiao Liang Xihan Wang Jiayi Yang Quanli Gao STFormer: Spatio‐temporal former for hand–object interaction recognition from egocentric RGB video Electronics Letters computer vision image classification pose estimation
title	STFormer: Spatio‐temporal former for hand–object interaction recognition from egocentric RGB video
title_full	STFormer: Spatio‐temporal former for hand–object interaction recognition from egocentric RGB video
title_fullStr	STFormer: Spatio‐temporal former for hand–object interaction recognition from egocentric RGB video
title_full_unstemmed	STFormer: Spatio‐temporal former for hand–object interaction recognition from egocentric RGB video
title_short	STFormer: Spatio‐temporal former for hand–object interaction recognition from egocentric RGB video
title_sort	stformer spatio temporal former for hand object interaction recognition from egocentric rgb video
topic	computer vision image classification pose estimation
url	https://doi.org/10.1049/ell2.70010
work_keys_str_mv	AT jiaoliang stformerspatiotemporalformerforhandobjectinteractionrecognitionfromegocentricrgbvideo AT xihanwang stformerspatiotemporalformerforhandobjectinteractionrecognitionfromegocentricrgbvideo AT jiayiyang stformerspatiotemporalformerforhandobjectinteractionrecognitionfromegocentricrgbvideo AT quanligao stformerspatiotemporalformerforhandobjectinteractionrecognitionfromegocentricrgbvideo

STFormer: Spatio‐temporal former for hand–object interaction recognition from egocentric RGB video

Similar Items