Explore human parsing modality for action recognition

Abstract Multimodal‐based action recognition methods have achieved high success using pose and RGB modality. However, skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations. To address this, the authors introduce human parsing feature map as...

Full description

Saved in:
Bibliographic Details
Main Authors: Jinfu Liu, Runwei Ding, Yuhang Wen, Nan Dai, Fanyang Meng, Fang‐Lue Zhang, Shen Zhao, Mengyuan Liu
Format: Article
Language:English
Published: Wiley 2024-12-01
Series:CAAI Transactions on Intelligence Technology
Subjects:
Online Access:https://doi.org/10.1049/cit2.12366
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841543334514917376
author Jinfu Liu
Runwei Ding
Yuhang Wen
Nan Dai
Fanyang Meng
Fang‐Lue Zhang
Shen Zhao
Mengyuan Liu
author_facet Jinfu Liu
Runwei Ding
Yuhang Wen
Nan Dai
Fanyang Meng
Fang‐Lue Zhang
Shen Zhao
Mengyuan Liu
author_sort Jinfu Liu
collection DOAJ
description Abstract Multimodal‐based action recognition methods have achieved high success using pose and RGB modality. However, skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations. To address this, the authors introduce human parsing feature map as a novel modality, since it can selectively retain effective semantic features of the body parts while filtering out most irrelevant noise. The authors propose a new dual‐branch framework called ensemble human parsing and pose network (EPP‐Net), which is the first to leverage both skeletons and human parsing modalities for action recognition. The first human pose branch feeds robust skeletons in the graph convolutional network to model pose features, while the second human parsing branch also leverages depictive parsing feature maps to model parsing features via convolutional backbones. The two high‐level features will be effectively combined through a late fusion strategy for better action recognition. Extensive experiments on NTU RGB + D and NTU RGB + D 120 benchmarks consistently verify the effectiveness of our proposed EPP‐Net, which outperforms the existing action recognition methods. Our code is available at https://github.com/liujf69/EPP‐Net‐Action.
format Article
id doaj-art-b63e59b835cb47d891deb2c49c836dc0
institution Kabale University
issn 2468-2322
language English
publishDate 2024-12-01
publisher Wiley
record_format Article
series CAAI Transactions on Intelligence Technology
spelling doaj-art-b63e59b835cb47d891deb2c49c836dc02025-01-13T14:05:51ZengWileyCAAI Transactions on Intelligence Technology2468-23222024-12-01961623163310.1049/cit2.12366Explore human parsing modality for action recognitionJinfu Liu0Runwei Ding1Yuhang Wen2Nan Dai3Fanyang Meng4Fang‐Lue Zhang5Shen Zhao6Mengyuan Liu7School of Intelligent Systems Engineering Sun Yat‐sen University Shenzhen ChinaPeng Cheng Laboratory Shenzhen ChinaSchool of Intelligent Systems Engineering Sun Yat‐sen University Shenzhen ChinaChangchun University of Science and Technology Changchun ChinaPeng Cheng Laboratory Shenzhen ChinaVictoria University of Wellington Wellington New ZealandSchool of Intelligent Systems Engineering Sun Yat‐sen University Shenzhen ChinaState Key Laboratory of General Artificial Intelligence Peking University Shenzhen Graduate School Shenzhen ChinaAbstract Multimodal‐based action recognition methods have achieved high success using pose and RGB modality. However, skeletons sequences lack appearance depiction and RGB images suffer irrelevant noise due to modality limitations. To address this, the authors introduce human parsing feature map as a novel modality, since it can selectively retain effective semantic features of the body parts while filtering out most irrelevant noise. The authors propose a new dual‐branch framework called ensemble human parsing and pose network (EPP‐Net), which is the first to leverage both skeletons and human parsing modalities for action recognition. The first human pose branch feeds robust skeletons in the graph convolutional network to model pose features, while the second human parsing branch also leverages depictive parsing feature maps to model parsing features via convolutional backbones. The two high‐level features will be effectively combined through a late fusion strategy for better action recognition. Extensive experiments on NTU RGB + D and NTU RGB + D 120 benchmarks consistently verify the effectiveness of our proposed EPP‐Net, which outperforms the existing action recognition methods. Our code is available at https://github.com/liujf69/EPP‐Net‐Action.https://doi.org/10.1049/cit2.12366action recognitionhuman parsinghuman skeletons
spellingShingle Jinfu Liu
Runwei Ding
Yuhang Wen
Nan Dai
Fanyang Meng
Fang‐Lue Zhang
Shen Zhao
Mengyuan Liu
Explore human parsing modality for action recognition
CAAI Transactions on Intelligence Technology
action recognition
human parsing
human skeletons
title Explore human parsing modality for action recognition
title_full Explore human parsing modality for action recognition
title_fullStr Explore human parsing modality for action recognition
title_full_unstemmed Explore human parsing modality for action recognition
title_short Explore human parsing modality for action recognition
title_sort explore human parsing modality for action recognition
topic action recognition
human parsing
human skeletons
url https://doi.org/10.1049/cit2.12366
work_keys_str_mv AT jinfuliu explorehumanparsingmodalityforactionrecognition
AT runweiding explorehumanparsingmodalityforactionrecognition
AT yuhangwen explorehumanparsingmodalityforactionrecognition
AT nandai explorehumanparsingmodalityforactionrecognition
AT fanyangmeng explorehumanparsingmodalityforactionrecognition
AT fangluezhang explorehumanparsingmodalityforactionrecognition
AT shenzhao explorehumanparsingmodalityforactionrecognition
AT mengyuanliu explorehumanparsingmodalityforactionrecognition