AGPN: Action granularity pyramid network for video action recognition
IEEE Transactions on Circuits and Systems for Video Technology, 2023•ieeexplore.ieee.org
Video action recognition is a fundamental task for video understanding. Action recognition in
complex spatio-temporal contexts generally requires fusing of different multi-granularity
action information. However, existing works do not consider spatio-temporal information
modeling and fusion from the perspective of action granularity. To address this problem, this
paper proposes an Action Granularity Pyramid Network (AGPN) for action recognition, which
can be flexibly integrated into 2D backbone networks. The core module is the Action …
complex spatio-temporal contexts generally requires fusing of different multi-granularity
action information. However, existing works do not consider spatio-temporal information
modeling and fusion from the perspective of action granularity. To address this problem, this
paper proposes an Action Granularity Pyramid Network (AGPN) for action recognition, which
can be flexibly integrated into 2D backbone networks. The core module is the Action …
Video action recognition is a fundamental task for video understanding. Action recognition in complex spatio-temporal contexts generally requires fusing of different multi-granularity action information. However, existing works do not consider spatio-temporal information modeling and fusion from the perspective of action granularity. To address this problem, this paper proposes an Action Granularity Pyramid Network (AGPN) for action recognition, which can be flexibly integrated into 2D backbone networks. The core module is the Action Granularity Pyramid Module (AGPM), a hierarchical pyramid structure with residual connections, which is established to fuse multi-granularity action spatio-temporal information. From top to bottom level in the designed pyramid structure, the receptive field decreases and action granularity becomes more refined. To enrich temporal information of the inputs, a Multiple Frame Rate Module (MFM) is proposed to mix different frame rates at a fine-grained pixel-wise level. Moreover, a Spatio-temporal Anchor Module (SAM) is employed to fix spatio-temporal feature anchors to promote the effectiveness of feature extraction. We conduct extensive experiments on three large-scale action recognition datasets, Something-Something V1 & V2 and Kinetics-400. The results demonstrate that our proposed AGPN outperforms the state-of-the-art methods for the tasks of video action recognition.
ieeexplore.ieee.org
Showing the best result for this search. See all results