Explore human parsing modality for action recognition

Abstract Multimodal‐based action recognition methods have achieved high success using pose and RGB modality. However, skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations. To address this, the authors introduce human parsing feature map as...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jinfu Liu, Runwei Ding, Yuhang Wen, Nan Dai, Fanyang Meng, Fang‐Lue Zhang, Shen Zhao, Mengyuan Liu
Format:	Article
Language:	English
Published:	Wiley 2024-12-01
Series:	CAAI Transactions on Intelligence Technology
Subjects:	action recognition human parsing human skeletons
Online Access:	https://doi.org/10.1049/cit2.12366
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841543334514917376
author	Jinfu Liu Runwei Ding Yuhang Wen Nan Dai Fanyang Meng Fang‐Lue Zhang Shen Zhao Mengyuan Liu
author_facet	Jinfu Liu Runwei Ding Yuhang Wen Nan Dai Fanyang Meng Fang‐Lue Zhang Shen Zhao Mengyuan Liu
author_sort	Jinfu Liu
collection	DOAJ
description	Abstract Multimodal‐based action recognition methods have achieved high success using pose and RGB modality. However, skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations. To address this, the authors introduce human parsing feature map as a novel modality, since it can selectively retain effective semantic features of the body parts while filtering out most irrelevant noise. The authors propose a new dual‐branch framework called ensemble human parsing and pose network (EPP‐Net), which is the first to leverage both skeletons and human parsing modalities for action recognition. The first human pose branch feeds robust skeletons in the graph convolutional network to model pose features, while the second human parsing branch also leverages depictive parsing feature maps to model parsing features via convolutional backbones. The two high‐level features will be effectively combined through a late fusion strategy for better action recognition. Extensive experiments on NTU RGB + D and NTU RGB + D 120 benchmarks consistently verify the effectiveness of our proposed EPP‐Net, which outperforms the existing action recognition methods. Our code is available at https://github.com/liujf69/EPP‐Net‐Action.
format	Article
id	doaj-art-b63e59b835cb47d891deb2c49c836dc0
institution	Kabale University
issn	2468-2322
language	English
publishDate	2024-12-01
publisher	Wiley
record_format	Article
series	CAAI Transactions on Intelligence Technology
spelling	doaj-art-b63e59b835cb47d891deb2c49c836dc02025-01-13T14:05:51ZengWileyCAAI Transactions on Intelligence Technology2468-23222024-12-01961623163310.1049/cit2.12366Explore human parsing modality for action recognitionJinfu Liu0Runwei Ding1Yuhang Wen2Nan Dai3Fanyang Meng4Fang‐Lue Zhang5Shen Zhao6Mengyuan Liu7School of Intelligent Systems Engineering Sun Yat‐sen University Shenzhen ChinaPeng Cheng Laboratory Shenzhen ChinaSchool of Intelligent Systems Engineering Sun Yat‐sen University Shenzhen ChinaChangchun University of Science and Technology Changchun ChinaPeng Cheng Laboratory Shenzhen ChinaVictoria University of Wellington Wellington New ZealandSchool of Intelligent Systems Engineering Sun Yat‐sen University Shenzhen ChinaState Key Laboratory of General Artificial Intelligence Peking University Shenzhen Graduate School Shenzhen ChinaAbstract Multimodal‐based action recognition methods have achieved high success using pose and RGB modality. However, skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations. To address this, the authors introduce human parsing feature map as a novel modality, since it can selectively retain effective semantic features of the body parts while filtering out most irrelevant noise. The authors propose a new dual‐branch framework called ensemble human parsing and pose network (EPP‐Net), which is the first to leverage both skeletons and human parsing modalities for action recognition. The first human pose branch feeds robust skeletons in the graph convolutional network to model pose features, while the second human parsing branch also leverages depictive parsing feature maps to model parsing features via convolutional backbones. The two high‐level features will be effectively combined through a late fusion strategy for better action recognition. Extensive experiments on NTU RGB + D and NTU RGB + D 120 benchmarks consistently verify the effectiveness of our proposed EPP‐Net, which outperforms the existing action recognition methods. Our code is available at https://github.com/liujf69/EPP‐Net‐Action.https://doi.org/10.1049/cit2.12366action recognitionhuman parsinghuman skeletons
spellingShingle	Jinfu Liu Runwei Ding Yuhang Wen Nan Dai Fanyang Meng Fang‐Lue Zhang Shen Zhao Mengyuan Liu Explore human parsing modality for action recognition CAAI Transactions on Intelligence Technology action recognition human parsing human skeletons
title	Explore human parsing modality for action recognition
title_full	Explore human parsing modality for action recognition
title_fullStr	Explore human parsing modality for action recognition
title_full_unstemmed	Explore human parsing modality for action recognition
title_short	Explore human parsing modality for action recognition
title_sort	explore human parsing modality for action recognition
topic	action recognition human parsing human skeletons
url	https://doi.org/10.1049/cit2.12366
work_keys_str_mv	AT jinfuliu explorehumanparsingmodalityforactionrecognition AT runweiding explorehumanparsingmodalityforactionrecognition AT yuhangwen explorehumanparsingmodalityforactionrecognition AT nandai explorehumanparsingmodalityforactionrecognition AT fanyangmeng explorehumanparsingmodalityforactionrecognition AT fangluezhang explorehumanparsingmodalityforactionrecognition AT shenzhao explorehumanparsingmodalityforactionrecognition AT mengyuanliu explorehumanparsingmodalityforactionrecognition

Explore human parsing modality for action recognition

Similar Items