Energy-Guided Temporal Segmentation Network for Multimodal Human Action Recognition
Abstract
:1. Introduction
- An effective EGTSN was designed by using the distribution of motion energy extracted from the depth videos to guide segmentation for the purpose of modeling individual action motion stages and solving the problem of underutilized motion information amongst video frames in a TSN.
- By employing spatiotemporal heterogeneous two-stream networks to improve the performance of homogeneous two-stream networks of the TSN, the EGTSN can obtain more sufficient feature representations and reduce the redundancy of the network.
- The traditional optical flow information is used in depth videos to perform multimodal fusion, thus improving the misjudgment of similar actions.
2. Related Work
3. Proposed Method
3.1. Extraction of Motion Energy
3.2. Extraction of Depth Optical Flow
3.3. Energy-Guided Temporal Segmentation
3.4. Multimodal Heterogeneous Networks
4. Experiments
4.1. Evaluation Dataset
4.2. Experimental Setup
4.3. Experimental Results and Analysis
4.3.1. The Effect of Energy-Guided Temporal Segmentation
4.3.2. The Effect of Heterogeneous Networks
4.3.3. The Effect of Multimodal Fusion
4.3.4. Comparison with Other Methods
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Niu, W.; Long, J.; Han, D.; Wang, Y. Human activity detection and recognition for video surveillance. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 27–30 June 2004; pp. 719–722. [Google Scholar]
- Pickering, C.A.; Burnham, K.J.; Richardson, M.J. A Research study of hand gesture recognition technologies and applications for human vehicle interaction. In Proceedings of the Institution of Engineering and Technology Conference on Automotive Electronics, Warwick, UK, 28–29 June 2007; pp. 1–15. [Google Scholar]
- Poonam, S.; Tanuja, S.; Ashwini, P.; Sonal, K. Hand gesture recognition for real time human machine interaction system. Int. J. Eng. Trends Technol. 2015, 19, 262–264. [Google Scholar]
- Weiyao, L.; Mingting, S.; Radha, P.; Zhengyou, Z. Human activity recognition for video surveillance. In Proceedings of the 2008 IEEE International Symposium on Circuits and Systems, Seattle, WA, USA, 18–21 May 2008; pp. 2737–2740. [Google Scholar]
- Kamel, A.; Sheng, B.; Yang, P.; Li, P.; Shen, R.; Feng, D.D. Deep convolutional neural networks for human action recognition using depth maps and postures. IEEE Trans. Syst. Man Cybern. Syst. 2019, 49, 1806–1819. [Google Scholar] [CrossRef] [Green Version]
- Zhao, S.; Liu, Y.; Han, Y.; Hong, R.; Hu, Q.; Tian, Q. Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans. Circuits Syst. Video Technol. 2018, 28, 1839–1849. [Google Scholar] [CrossRef]
- Jamie, S.; Andrew, F.; Mat, C.; Toby, S.; Mark, F.; Richard, M.; Alex, K.; Andrew, B. Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 20–25 June 2011; pp. 1297–1304. [Google Scholar]
- Zhang, C.; Tian, Y.; Guo, X.; Liu, J. DAAL: Deep activation-based attribute learning for action recognition in depth videos. Comput. Vis. Image Underst. 2018, 167, 37–49. [Google Scholar] [CrossRef]
- Wang, P.; Li, W.; Gao, Z.; Tang, C.; Ogunbona, P.O. depth pooling based large-scale 3-D action recognition with convolutional neural networks. IEEE Trans. Multimed. 2018, 20, 1051–1061. [Google Scholar] [CrossRef] [Green Version]
- Sahoo, S.P.; Ari, S. Depth estimated history image based appearance representation for human action recognition. In Proceedings of the TENCON 2019—2019 IEEE Region 10 Conference (TENCON), Kochi, India, 17–20 October 2019; pp. 965–969. [Google Scholar]
- Kong, Y.; Fu, Y. Bilinear heterogeneous information machine for RGB-D action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 8–10 June 2015; pp. 1054–1062. [Google Scholar]
- Kong, Y.; Satarboroujeni, B.; Fu, Y. Learning hierarchical 3D kernel descriptors for RGB-D action recognition. Comput. Vis. Image Underst. 2016, 144, 14–23. [Google Scholar] [CrossRef] [Green Version]
- Lei, C.; Zhanjie, S.; Jiwen, L.; Jie, Z. Learning principal orientations and residual descriptor for action recognition. Pattern Recognit. 2017, 86, 14–26. [Google Scholar]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems 27 (NIPS 2014); Neural Information Processing Systems Foundation: La Jolla, CA, USA, 2014; pp. 568–576. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherland, 8–16 October 2016; pp. 20–36. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
- Shou, Z.; Lin, X.; Kalantidis, Y.; Sevilla-Lara, L.; Rohrbach, M.; Chang, S.-F.; Yan, Z. DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1268–1277. [Google Scholar]
- Zolfaghari, M.; Singh, K.; Brox, T. ECO: Efficient convolutional network for online video understanding. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 695–712. [Google Scholar]
- Tu, Z.; Xie, W.; Dauwels, J.; Li, B.; Yuan, J. Semantic cues enhanced multimodality multistream CNN for action recognition. IEEE Trans. Circuits Syst. Video Technol. 2019, 29, 1423–1437. [Google Scholar] [CrossRef]
- Zolfaghari, M.; Oliveira, G.L.; Sedaghat, N.; Brox, T. Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2923–2932. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Jian, S. Deep residual learning for image recognition. In Proceedings of the ieee conference on computer vision and pattern recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar]
- Wang, H.; Schmid, C. Action Recognition with Improved Trajectories. In Proceedings of the IEEE International Conference on Computer Vision, Portland, OR, USA, 25–27 June 2013; pp. 3551–3558. [Google Scholar]
- Hu, J.F.; Zheng, W.S.; Lai, J.; Zhang, J. Jointly learning heterogeneous features for RGB-D activity recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2186–2200. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Laptev, I.; Marszałek, M.; Schmid, C.; Rozenfeld, B. Learning realistic human actions from movies. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
- Liu, J.; Shahroudy, A.; Xu, D.; Kot, A.C.; Wang, G. Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 3007–3021. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Donahue, J.; Hendricks, L.A.; Rohrbach, M.; Venugopalan, S.; Guadarrama, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 677–691. [Google Scholar] [CrossRef] [PubMed]
- Wu, Z.; Wang, X.; Jiang, Y.; Ye, H.; Xue, X. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, 26–30 October 2015; pp. 461–470. [Google Scholar]
- Shahroudy, A.; Liu, J.; Ng, T.-T.; Wang, G. NTU RGB+D: A large scale dataset for 3D human activity Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
- Zach, C.; Pock, T.; Bischof, H. A Duality based approach for realtime TV-L1 optical flow. In Proceedings of the Joint Pattern Recognition Symposium, Heidelberg, Germany, 12–14 September 2007; pp. 214–223. [Google Scholar]
- Lin, T.; Zhao, X.; Su, H.; Wang, C.; Yang, M. BSN: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–21. [Google Scholar]
- Yu, W.; Yang, K.; Bai, Y.; Xiao, T.; Yao, H.; Rui, Y. Visualizing and comparing AlexNet and VGG using deconvolutional layers. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd international Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Wang, P.; Li, Z.; Hou, Y.; Li, W. Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 102–106. [Google Scholar]
- Han, Y.; Chung, S.-L.; Chen, S.-F.; Su, S.F. Two-Stream LSTM for action recognition with RGB-D-Based hand-crafted features and feature combination. In Proceedings of the 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Miyazaki, Japan, 7–10 October 2018; pp. 3547–3552. [Google Scholar]
- Wang, Y.; Xu, Z.; Li, L.; Yao, J. Robust multi-feature learning for skeleton-based action recognition. IEEE Access 2019, 7, 148658–148671. [Google Scholar] [CrossRef]
- Li, F.; Zhu, A.; Xu, Y.; Cui, R.; Hua, G. Multi-stream and enhanced spatial-temporal graph convolution network for skeleton-based action recognition. IEEE Access 2020, 8, 97757–97770. [Google Scholar] [CrossRef]
- Zhu, J.; Zou, W.; Zhu, Z.; Xu, L.; Huang, G. Action machine: Toward person-centric action recognition in videos. IEEE Signal Process. Lett. 2019, 26, 1633–1637. [Google Scholar] [CrossRef]
Method | RGB | Depth |
---|---|---|
TSN | 78.4 | 81.1 |
Energy-guided | 79.6 | 82.1 |
CNN | RGB | Depth | R-Flow | D-Flow |
---|---|---|---|---|
Batch Normalization (BN)-Inception | 76.9 | 79.5 | 90.2 | 89.6 |
Residual Network-101 (ResNet101) | 78.1 | 81.4 | 83.3 | 84.2 |
ResNeXt101 | 79.6 | 82.1 | 84.7 | 85.5 |
Data | Accuracy |
---|---|
RGB | 79.6 |
Depth | 82.1 |
R-flow | 90.2 |
D-flow | 89.6 |
RGB and R-flow | 92.9 |
Depth and D-flow | 91.7 |
Multimodal fusion (EGTSN) | 94.7 |
Method | Accuracy (cs) |
---|---|
Joint Trajectory Maps (Jtm) [35] | 73.4 |
RGB, Optical Flow (OF), and Pose-Baseline [20] | 76.9 |
RGB, OF, and Pose-Chained [20] | 80.8 |
DDI, DDNI, and DDMNI [9] | 87.8 |
Fused View-Invariant and Difference Handcrafted cues (FVDH) [36] | 81.3 |
MF-NET [37] | 90.0 |
Multi-stream enhanced spatiotemporal graph convolutional network (MS-ESTGCN) [38] | 91.4 |
Action Machine [39] | 94.3 |
Our EGTSN | 94.7 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, Q.; Chen, E.; Gao, L.; Liang, C.; Liu, H. Energy-Guided Temporal Segmentation Network for Multimodal Human Action Recognition. Sensors 2020, 20, 4673. https://doi.org/10.3390/s20174673
Liu Q, Chen E, Gao L, Liang C, Liu H. Energy-Guided Temporal Segmentation Network for Multimodal Human Action Recognition. Sensors. 2020; 20(17):4673. https://doi.org/10.3390/s20174673
Chicago/Turabian StyleLiu, Qiang, Enqing Chen, Lei Gao, Chengwu Liang, and Hao Liu. 2020. "Energy-Guided Temporal Segmentation Network for Multimodal Human Action Recognition" Sensors 20, no. 17: 4673. https://doi.org/10.3390/s20174673
APA StyleLiu, Q., Chen, E., Gao, L., Liang, C., & Liu, H. (2020). Energy-Guided Temporal Segmentation Network for Multimodal Human Action Recognition. Sensors, 20(17), 4673. https://doi.org/10.3390/s20174673