Future Pose Prediction from 3D Human Skeleton Sequence with Surrounding Situation
Abstract
:1. Introduction
- We propose a future pose prediction method that captures the surrounding situation of the target person by utilizing an image around the target person at the last frame of the input sequence as additional information.
- We also propose a novel skeleton feature weighting method named image-assisted attention for future skeleton prediction (IAA). It can be applied to any existing GCN-based prediction method to effectively utilize the surrounding information.
2. Related Work
3. Future Skeleton Prediction That Captures Surrounding Situation
3.1. Overview
3.2. Surrounding Image Feature Extraction
- The center of the target person in the image coordinate is calculated from the corresponding human skeleton.
- A image centered on the center of the target person is cropped from the original RGB image.
- If a part of the cropped image is outside of the original image, the part is filled with 0.
3.3. Image-Assisted Attention (IAA)
3.4. Future Skeleton Prediction Using the Modified Skeleton Feature
3.5. Model Structure and Training
4. Evaluation and Results
4.1. Outline of Experiments
4.2. Datasets
4.2.1. NTU RGB+D 120 Dataset
- Since a sample may contain multiple persons or false detections of other objects, first, we select a person where the variance of the skeleton locations is the maximum.
- We use 12 frames for 0.4 s as input and 30 frames for 1 s as future prediction by using a sliding window of 42 frames. Multiple subsequences were generated using a sliding window with a stride of 1.
- The number of joints was reduced to fit the input shape of the prediction models (25 joints → 22 joints). For MSR-GCN, we prepared multi-scale skeletons with 12, 7, and 4 joints by taking the average of the neighboring body joints, according to the preprocessing of the multi-scale joints in MSR-GCN [16].
- Since the generated subsequences may include switching of persons or other objects, we excluded them by the criterion where the sum of the distance from “head” to “spine base” and from “spine base” to “left foot” is smaller than 60 cm as noise data. Similarly, we also excluded subsequences with skeletons of the distance between any joint and spine base exceeding 140 cm as noise.
- To make the prediction system robust to the location of the target person, all skeleton sequences are aligned using the locations of “spine base” in the last frame of each input.
- Images corresponding to the input skeletons were preprocessed by the method described in Section 3.2. Here, the locations of missing body joint values of the 2D skeleton were ignored, and the center points were calculated.
- We divided the set of subsequences and the cropped images into the training set (50%), validation set (25%), and test set (25%) referring to the cross-subject evaluation method proposed in the NTURGB+D 120 dataset (dividing such that data of the same person is not included in different sets).
4.2.2. PKU-MMD Dataset
4.3. Configuration and Parameters
4.4. Evaluation Metric
4.5. Results
4.6. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Foka, A.; Trahanias, P. Probabilistic Autonomous Robot Navigation in Dynamic Environments with Human Motion Prediction. Int. J. Soc. Robot. 2010, 2, 79–94. [Google Scholar] [CrossRef]
- Koppula, H.S.; Saxena, A. Anticipating human activities for reactive robotic response. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; p. 2071. [Google Scholar] [CrossRef]
- Gong, H.; Sim, J.; Likhachev, M.; Shi, J. Multi-hypothesis motion planning for visual object tracking. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 619–626. [Google Scholar] [CrossRef] [Green Version]
- Liu, H.; Wang, L. Human motion prediction for human-robot collaboration. J. Manuf. Syst. 2017, 44, 287–294. [Google Scholar] [CrossRef]
- Gui, L.Y.; Zhang, K.; Wang, Y.X.; Liang, X.; Moura, J.M.F.; Veloso, M. Teaching Robots to Predict Human Motion. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 562–567. [Google Scholar] [CrossRef]
- Brand, M.; Hertzmann, A. Style Machines. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, New Orleans, LA, USA, 23–28 July 2000; pp. 183–192. [Google Scholar] [CrossRef]
- Taylor, G.W.; Hinton, G.E.; Roweis, S.T. Modeling Human Motion Using Binary Latent Variables. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 4–7 December 2006; Volume 19, pp. 1345–1352. [Google Scholar] [CrossRef] [Green Version]
- Fragkiadaki, K.; Levine, S.; Felsen, P.; Malik, J. Recurrent network models for human dynamics. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4346–4354. [Google Scholar] [CrossRef] [Green Version]
- Martinez, J.; Black, M.J.; Romero, J. On human motion prediction using recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4674–4683. [Google Scholar] [CrossRef] [Green Version]
- Tang, Y.; Ma, L.; Liu, W.; Zheng, W.S. Long-Term Human Motion Prediction by Modeling Motion Context and Enhancing Motion Dynamics. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, Stockholm, Sweden, 13–19 July 2018; pp. 935–941. [Google Scholar] [CrossRef]
- Wang, B.; Adeli, E.; Chiu, H.K.; Huang, D.A.; Niebles, J.C. Imitation Learning for Human Pose Prediction. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 7123–7132. [Google Scholar] [CrossRef] [Green Version]
- Li, M.; Chen, S.; Zhao, Y.; Zhang, Y.; Wang, Y.; Tian, Q. Dynamic Multiscale Graph Neural Networks for 3D Skeleton Based Human Motion Prediction. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 211–220. [Google Scholar] [CrossRef]
- Mao, W.; Liu, M.; Salzmann, M.; Li, H. Learning Trajectory Dependencies for Human Motion Prediction. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 9488–9496. [Google Scholar] [CrossRef] [Green Version]
- Cui, Q.; Sun, H.; Yang, F. Learning Dynamic Relationships for 3D Human Motion Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6519–6527. [Google Scholar] [CrossRef]
- Sofianos, T.; Sampieri, A.; Franco, L.; Galasso, F. Space-Time-Separable Graph Convolutional Network for Pose Forecasting. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 11189–11198. [Google Scholar] [CrossRef]
- Dang, L.; Nie, Y.; Long, C.; Zhang, Q.; Li, G. MSR-GCN: Multi-Scale Residual Graph Convolution Networks for Human Motion Prediction. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 11447–11456. [Google Scholar] [CrossRef]
- Fujita, T.; Kawanishi, Y. Toward Surroundings-aware Temporal Prediction of 3D Human Skeleton Sequence. In Proceedings of the Towards a Complete Analysis of People: From Face and Body to Clothes (T-CAP), Montreal, QC, Canada, 21–25 August 2022. [Google Scholar]
- Wang, J.; Hertzmann, A.; Fleet, D.J. Gaussian Process Dynamical Models. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 4–7 December 2006; Volume 18. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI conference on artificial intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar] [CrossRef]
- Chan, W.; Tian, Z.; Wu, Y. Gas-gcn: Gated action-specific graph convolutional networks for skeleton-based action recognition. Sensors 2020, 20, 3499. [Google Scholar] [CrossRef] [PubMed]
- Wang, R.; Huang, C.; Wang, X. Global Relation Reasoning Graph Convolutional Networks for Human Pose Estimation. IEEE Access 2020, 8, 38472–38480. [Google Scholar] [CrossRef]
- Azizi, N.; Possegger, H.; Rodolà, E.; Bischof, H. 3D Human Pose Estimation Using Möbius Graph Convolutional Networks. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 160–178. [Google Scholar] [CrossRef]
- Liu, Z.; Jiang, Z.; Feng, W.; Feng, H. OD-GCN: Object Detection Boosted by Knowledge GCN. In Proceedings of the 2020 IEEE International Conference on Multimedia & Expo Workshops, London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar] [CrossRef]
- Li, Z.; Du, X.; Cao, Y. GAR: Graph Assisted Reasoning for Object Detection. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1284–1293. [Google Scholar] [CrossRef]
- Chopin, B.; Otberdout, N.; Daoudi, M.; Bartolo, A. 3D Skeleton-based Human Motion Prediction with Manifold-Aware GAN. IEEE Trans. Biom. Behav. Identity Sci. 2022. [Google Scholar] [CrossRef]
- Sampieri, A.; di Melendugno, G.M.D.; Avogaro, A.; Cunico, F.; Setti, F.; Skenderi, G.; Cristani, M.; Galasso, F. Pose Forecasting in Industrial Human-Robot Collaboration. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 51–69. [Google Scholar] [CrossRef]
- Corona, E.; Pumarola, A.; Alenya, G.; Moreno-Noguer, F. Context-Aware Human Motion Prediction. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6990–6999. [Google Scholar] [CrossRef]
- Adeli, V.; Adeli, E.; Reid, I.; Niebles, J.C.; Rezatofighi, H. Socially and Contextually Aware Human Motion and Pose Forecasting. IEEE Robot. Autom. Lett. 2020, 5, 6033–6040. [Google Scholar] [CrossRef]
- Chao, Y.W.; Yang, J.; Price, B.; Cohen, S.; Deng, J. Forecasting human dynamics from static images. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 548–556. [Google Scholar] [CrossRef] [Green Version]
- Zhang, J.Y.; Felsen, P.; Kanazawa, A.; Malik, J. Predicting 3d human dynamics from video. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 7113–7122. [Google Scholar] [CrossRef] [Green Version]
- Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Machine Learning Research; Volume 97, pp. 6105–6114. [Google Scholar]
- Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2684–2701. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Liu, C.; Hu, Y.; Li, Y.; Song, S.; Liu, J. PKU-MMD: A large scale benchmark for skeleton-based human action understanding. In Proceedings of the Workshop on Visual Analysis in Smart and Connected Communities, Mountain View, CA, USA, 27 October 2017; pp. 1–8. [Google Scholar] [CrossRef]
- Ultralytics. Yolov5. Available online: https://github.com/ultralytics/yolov5 (accessed on 21 November 2022).
- Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
- Chen, C.H.; Ramanan, D. 3D Human Pose Estimation = 2D Pose Estimation + Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5759–5767. [Google Scholar] [CrossRef] [Green Version]
- Habibie, I.; Xu, W.; Mehta, D.; Pons-Moll, G.; Theobalt, C. In the Wild Human Pose Estimation Using Explicit 2D Features and Intermediate 3D Representations. In Proceedings of the IEEE/CVF Conference on Computer VISION and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 10897–10906. [Google Scholar] [CrossRef] [Green Version]
- Li, W.; Liu, H.; Tang, H.; Wang, P.; Van Gool, L. MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13137–13146. [Google Scholar] [CrossRef]
1 layer | Input of (1st) | Input of (1st) | Output of FC layers |
MSR-GCN with IAA | 4224 (= ) | 1536 | 64 |
Traj-GCN with IAA | 16,896 (= ) | 1536 | 256 |
2 layers | Input of (1st / 2nd) | Input of (1st / 2nd) | Output of FC layers |
MSR-GCN with IAA | 4224 / 2112 | 1536 / 768 | 64 |
Traj-GCN with IAA | 16,896 / 8848 | 1536 / 768 | 256 |
3 layers | Input of (1st / 2nd / 3rd) | Input of (1st / 2nd / 3rd) | Output of FC layers |
MSR-GCN with IAA | 4224 / 2112 / 1056 | 1536 / 768 / 384 | 64 |
Traj-GCN with IAA | 16,896 / 8848 / 4424 | 1536 / 768 / 384 | 256 |
Model | Average MPJPE (mm)↓ | ||
---|---|---|---|
MSR-GCN | |||
Traj-GCN | |||
P1 model | 1 FC layer | 2 FC layers | 3 FC layers |
MSR-GCN with IAA | |||
Traj-GCN with IAA | |||
P2 model | 1 FC layer | 2 FC layers | 3 FC layers |
MSR-GCN with IAA | |||
Traj-GCN with IAA | |||
P3 model | 1 FC layer | 2 FC layers | 3 FC layers |
MSR-GCN with IAA | |||
Traj-GCN with IAA |
Model | Average MPJPE (mm)↓ | ||
---|---|---|---|
MSR-GCN | |||
Traj-GCN | |||
P1 model | 1 FC layer | 2 FC layers | 3 FC layers |
MSR-GCN with IAA | |||
Traj-GCN with IAA | |||
P2 model | 1 FC layer | 2 FC layers | 3 FC layers |
MSR-GCN with IAA | |||
Traj-GCN with IAA | |||
P3 model | 1 FC layer | 2 FC layers | 3 FC layers |
MSR-GCN with IAA | |||
Traj-GCN with IAA |
Model | Processing Time (s) | Frames per Second (FPS) |
---|---|---|
MSR-GCN | 127 | |
MSR-GCN with IAA P1 | 32 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Fujita, T.; Kawanishi, Y. Future Pose Prediction from 3D Human Skeleton Sequence with Surrounding Situation. Sensors 2023, 23, 876. https://doi.org/10.3390/s23020876
Fujita T, Kawanishi Y. Future Pose Prediction from 3D Human Skeleton Sequence with Surrounding Situation. Sensors. 2023; 23(2):876. https://doi.org/10.3390/s23020876
Chicago/Turabian StyleFujita, Tomohiro, and Yasutomo Kawanishi. 2023. "Future Pose Prediction from 3D Human Skeleton Sequence with Surrounding Situation" Sensors 23, no. 2: 876. https://doi.org/10.3390/s23020876