Re-Evaluation Method by Index Finger Position in the Face Area Using Face Part Position Criterion for Sign Language Recognition
Abstract
:1. Introduction
- Coster et al. proposed a model Video Transformer Network-Pose flow (VTN-PF) that provided posture information or hand geometry from RGB video data frame by frame to the Video Transform Network. They achieved a recognition rate of 92.92% [24].
- The Wenbinwuee team trained multiple models for RGB video recognition using RGB, optical flow, and person segmentation data, obtained the final prediction for each model using SlowFast, SlowOnly, and Temporal Shift Module (TSM), and fused the results, and obtained a result of 96.55% [23].
- The rhythmblue6 team proposed an ensemble framework consisting of multiple neural networks (Inflated 3D (I3D), Semantics-Guided Neural (SGN), etc.) and implemented the University of Science and Technology of China-Sign Language Recognition (USTC-SLR) model for isolated characters, 97.62% [23].
- Ryumin et al. proposed audio-visual speech recognition using spatio-temporal features (STF) and long-short term memory (LSTM) models. The model is especially characterized by the incorporation of lip information. They achieved a recognition rate of 98.56% and demonstrated a real-time process using mobile devices [28].
- Hrúz et at. analyzed 2 appearance-based approaches, I3D and TimeSformer, and 1 pose-based approach, SPOTER, which achieved recognition rates of 96.37% and 97.56% for test and validation datasets, respectively [16].
- Al-Hammadi et al.’s proposed architecture consists of a few separable Three-Dimensional Graph Convolution Network (3DGCN) layers, which are enhanced by a spatial attention mechanism. They achieved a recognition rate of 93.38% [17].
- We proposed a method to reuse the estimation results produced at each epoch based on SAM-SLR, which improved the recognition rate to 98.05% [19].
- The four modalities are RGB-frames, RGB-flow, features, and multi-stream, each of which independently performs sign language recognition and extracts features.
- The RGB-frames and RGB-flow modalities are modeled in a 3DCNN [1] using the ResNet2+1D [31] architecture, which separates the temporal and spatial convolution of a 3DCNN. The model chooses the Res-Net2+1D-18 variant for its backbone, which is pre-trained on the Kinetics dataset [32]. In addition, it is pre-trained on the most extensive available SLR dataset SLR500 [33] for the RGB frame to further improve the accuracy.
- A separate spatial–temporal convolutional network (SSTCN [14]) was developed to learn from the entire skeleton to fully extract information from key points throughout the body.
- A multi-stream sign language graph convolutional network (SL-GCN [14]) was designed to model the embedding dynamics using the whole-body key points extracted by the pre-trained whole-body posture estimator. The estimated results and weights of the joint, bone, motion joint, and motion bone streams were multiplied and used as the evaluation value of the multi-stream modality. The highest recognition rate among the modalities was achieved by multi-stream, which had a recognition rate of 96.47%.
- There are two versions of late fusion.
- The first version [14] proposes ensemble model-free late fusion, a simple late fusion approach that fuses predictions from all modalities. All modalities were manually weighted using {1.0, 0.9, 0.4, 0.4}. The recognition result was 97.62%.
2. Methodology
2.1. Dataset
2.2. Architecture of SAM-SLR
2.3. Overview of the Re-Evaluation Method
2.3.1. Difference Value between Top-1 and Top-2
2.3.2. One-Handed or One- or Two-Handed Sign Language
2.3.3. Number of Index Finger Data
2.4. Index Finger Position in Face Area Using Face Part Position Criterion
2.5. Process of Staying Fingertip Decision
2.6. Re-Evaluation Process
3. Results
3.1. PC Environments
3.2. Summary of Results
3.3. The Performance of Each Conditional Branch
3.4. The Performance of Each Epoch
4. Conclusions
- The recognition rate with slight differences in top-1 and top-2 evaluation values in the SAM-SLR recognition results is low (e.g., 75% or less).
- The recognition rate up to Top-3 is close to 100%.
- In the face area, no special recognition processing was performed.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. TPAMI 2013, 35, 221–231. [Google Scholar] [CrossRef] [PubMed]
- Contributors, M. OpenMMLab Pose Estimation Toolbox and Benchmark. 2020. Available online: https://github.com/open-mmlab/mmpose (accessed on 26 February 2023).
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef] [PubMed]
- Google Research Team. MediaPipe. 2020. Available online: https://google.github.io/mediapipe/solutions/hands.html (accessed on 11 April 2023).
- Wang, H.; Wang, L. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 499–508. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 7444–7452. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans. Image Process. 2020, 29, 9532–9545. [Google Scholar] [CrossRef] [PubMed]
- Cheng, K.; Zhang, Y.; Cao, C.; Shi, L.; Cheng, J.; Lu, H. Decoupling GCN with DropGraph Module for Skeleton-Based Action Recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 536–553. [Google Scholar]
- Jin, S.; Xu, L.; Xu, J.; Wang, C.; Liu, W.; Qian, C.; Ouyang, W.; Luo, P. Whole-body human pose estimation in the wild. In Proceedings of the European Conference on Computer Vision (ECCV 2020), Glasgow, UK, 23–28 August 2020; pp. 196–214. [Google Scholar]
- Xiao, Q.; Qin, M.; Yin, Y. Skeleton-based Chinese sign language recognition and generation for bidirectional communication between deaf and hearing people. Neural Netw. 2020, 125, 41–55. [Google Scholar] [CrossRef] [PubMed]
- Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Stronger, Faster and More Explainable: A Graph Convolutional Baseline for Skeleton-Based Action Recognition. In Proceedings of the 28th ACM International Conference on Multimedia (ACMMM), Seattle, WA, USA, 12–16 October 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 1625–1633. [Google Scholar]
- Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 143–152. [Google Scholar]
- Vázquez-Enríquez, M.; Alba-Castro, J.L.; Fernández, L.D.; Banga, E.R. Isolated Sign Language Recognition with Multi-Scale Spatial-Temporal Graph Convolutional Networks. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Virtual, 19–25 June 2021; pp. 3457–3466. [Google Scholar]
- Jiang, S.; Sun, B.; Wang, L.; Bai, Y.; Li, K.; Fu, Y. Skeleton aware multi-modal sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 21–24 June 2021; pp. 3413–3423. [Google Scholar]
- Jiang, S.; Sun, B.; Wang, L.; Bai, Y.; Li, K.; Fu, Y. Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble. arXiv 2021, arXiv:2110.06161. [Google Scholar]
- Hrúz, M.; Gruber, I.; Kanis, J.; Boháček, M.; Hlaváč, M.; Krňoul, Z. One Model is Not Enough: Ensembles for Isolated Sign Language Recognition. Sensors 2022, 22, 5043. [Google Scholar] [CrossRef] [PubMed]
- Al-Hammadi, M.; Bencherif, M.A.; Alsulaiman, M.; Muhammad, G.; Mekhtiche, M.A.; Abdul, W.; Alohali, Y.A.; Alrayes, T.S.; Mathkour, H.; Faisal, M.; et al. Spatial Attention-Based 3D Graph Convolutional Neural Network for Sign Language Recognition. Sensors 2022, 22, 4558. [Google Scholar] [CrossRef] [PubMed]
- Dafnis, K.M.; Chroni, E.; Neidle, C.; Metaxas, D.N. Bidirectional Skeleton-Based Isolated Sign Recognition using Graph Convolution Networks. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC), Marseille, France, 20–25 June 2022; pp. 7328–7338. [Google Scholar]
- Hori, N.; Yamamoto, M. Sign Language Recognition using the reuse of estimate results by each epoch. In Proceedings of the 7th International Conference on Frontiers of Signal Processing (ICFSP), Paris, France, 7–9 September 2022; pp. 45–50. [Google Scholar]
- Sincan, O.M.; Keles, H.Y. AUTSL: A Large Scale Multi-Modal Turkish Sign Language Dataset and Baseline Methods. IEEE Access 2020, 8, 181340–181355. [Google Scholar] [CrossRef]
- Sincan, O.M.; Tur, A.O.; Keles, H.Y. Isolated sign language recognition with multi-scale features using lstm. In Proceedings of the 27th Signal Processing and Communications Applications Conference (SIU), Sivas, Turkey, 24–26 April 2019; pp. 1–4. [Google Scholar]
- Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef] [PubMed]
- Sincan, O.M.; Jacques Junior, J.C.S.; Escalera, S.; Keles, H.Y. Chalearn LAP large scale signer independent isolated sign language recognition challenge: Design, results and future research. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA, 19–25 June 2021; pp. 3467–3476. [Google Scholar]
- Coster, M.D.; Herreweghe, M.V.; Dambre, J. Isolated Sign Recognition from RGB Video using Pose Flow and Self-Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 3436–3445. [Google Scholar]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 3202–3211. [Google Scholar]
- Fan, H.; Xiong, B.; Mangalam, K.; Li, Y.; Yan, Z.; Malik, J.; Feichtenhofer, C. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Online, 11–17 October 2021; pp. 6824–6835. [Google Scholar]
- Novopoltsev, M.; Verkhovtsev, L.; Murtazin, R.; Milevich, D.; Zemtsova, I. Fine-tuning of sign language recognition models: A technical report. arXiv 2023, arXiv:2302.07693v2. [Google Scholar]
- Ryumin, D.; Ivanko, D.; Ryumina, E. Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices. Sensors 2023, 23, 2284. [Google Scholar] [CrossRef] [PubMed]
- Zach, C.; Pock, T.; Bischof, H. A Duality Based Approach for Realtime TV-L1 Optical Flow. In Pattern Pattern Recognition, Proceedings of the 29th DAGM Symposium, Heidelberg, Germany, 12–14 September 2007; Springer: Berlin/Heidelberg, Germany, 2007; pp. 214–223. [Google Scholar] [CrossRef]
- Wang, S.; Li, Z.; Zhao, Y.; Xiong, Y.; Wang, L.; Lin, D. Denseflow. 2020. Available online: https://github.com/open-mmlab/denseflow (accessed on 26 February 2023).
- Tran, D.; Wang, H.; Torresani, L.; Ray, J.; Lecun, Y.; Paluri, M. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the IEEE Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–25 June 2018; pp. 6450–6459. [Google Scholar]
- Carreira, J.; Noland, E.; Banki-Horvath, A.; Hillier, C.; Zisserman, A. A short note about kinetics-600. arXiv 2018, arXiv:1808.01340. [Google Scholar]
- Zhang, J.; Zhou, W.; Xie, C.; Pu, J.; Li, H. Chinese sign language recognition with adaptive HMM. In Proceedings of the 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA, 11–15 July 2016; pp. 1–6. [Google Scholar]
- Li, D.; Rodriguez, C.; Yu, X.; Li, H. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2020; pp. 1459–1469. [Google Scholar]
- Albanie, S.; Varol, G.; Momeni, L.; Afouras, T.; Chung, J.S.; Fox, N.; Zisserman, A. BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues. In Proceedings of the 16th European Conference on Computer Vision (ECCV 2020), Glasgow, UK, 23–28 August 2020; pp. 35–53. [Google Scholar]
Datasets | Language | Glosses | Signers | Samples |
---|---|---|---|---|
AUTSL [20] | Turkish | 226 | 43 | 36,302 |
SLR500 [33] | Chinese | 500 | 50 | 125,000 |
WLASL2000 [34] | American | 2000 | 119 | 21,083 |
BSL-1K [35] | British | 1064 | 40 | 273,000 |
Dataset | Signs | Signers | Language | Frames | Samples | |||
---|---|---|---|---|---|---|---|---|
Training | Validation | Testing | Total | |||||
AUTSL | 226 | 43 | Turkish | 57–157 | 28,142 | 4418 | 3742 | 36,302 |
Value of Difference | Number of Matched | Number of Correct | Number of Incorrect | Top-1 Acc (%) |
---|---|---|---|---|
1.0 | 64 | 38 | 26 | 59.38 |
2.0 | 132 | 91 | 41 | 68.94 |
3.0 | 226 | 171 | 55 | 75.66 |
… | ||||
9.0 | 1984 | 1907 | 77 | 96.12 |
… | ||||
14.0 | 3742 | 3665 | 77 | 97.94 |
t | x | y | di | 1st | 2nd | 3rd | 4th |
---|---|---|---|---|---|---|---|
14 | 266 | 130 | 13.0064 | −3 | |||
15 | 262 | 123 | 17.5491 | −2 | |||
16 | 259 | 114 | 10.4868 | −1 | |||
17 | 259 | 115 | 3.2361 | 0 | |||
18 | 257 | 114 | 4.4721 | −3 | 1 | ||
19 | 255 | 115 | 3.2361 | −2 | 2 | ||
20 | 254 | 115 | 2.0000 | −1 | 3 | ||
21 | 255 | 115 | 1.0000 | 0 | |||
22 | 255 | 115 | 3.0000 | 1 | -3 | ||
23 | 258 | 115 | 10.6158 | 2 | -2 | ||
24 | 265 | 112 | 9.8518 | 3 | -1 | ||
25 | 267 | 111 | 13.0064 | 0 | |||
26 | 263 | 121 | 24.1120 | −3 | 1 | ||
27 | 260 | 134 | 17.8138 | −2 | 2 | ||
28 | 264 | 132 | 6.7082 | −1 | 3 | ||
29 | 265 | 134 | 3.6503 | 0 | |||
30 | 264 | 135 | 5.0198 | 1 | |||
31 | 267 | 137 | 9.6883 | 2 | |||
32 | 266 | 143 | 53.1784 | 3 |
Model | Fine-Tune | Publication | Our PC |
---|---|---|---|
Baseline [20] | - | 49.22 | - |
VTN-PF [24] | With validation data | 92.92 | - |
Enhanced 3DGCN [17] | No | 93.38 | - |
MViT-SLR [27] | - | 95.72 | - |
Wenbinwuee team [23] | With validation data | 96.55 | - |
Neural Ens. [16] | No | 96.37 | - |
USTC-SLR [23] | With validation data | 97.62 | - |
Second version of SAM-SLR [15] | No | 98.00 | - |
STF + LSTM [28] | - | 98.56 | - |
SAM-SLR [14] | No | 97.62 | 97.94 |
+Last ours (proposed joint and bone streams) [19] | No | - | 98.05 |
+Ours-1 (re-evaluation method) | No | - | 98.24 |
+Ours-2 (Last ours and re-evaluation method) | No | - | 98.21 |
Based Model | Re-Evaluation Method (Finger Position Evaluation) | Acc. | Average of Acc. From Epoch 150 to 229 |
---|---|---|---|
SAM-SLR [14] (Ours-1) | - | 97.94 | 97.73 |
Absolute | 98.05 | 97.82 | |
Relative | 98.16 | 97.92 | |
Absolute and relative | 98.24 | 98.00 | |
SAM-SLR with proposed joint and bone streams [19] (Ours-2) | - | 98.05 | 97.79 |
Absolute | 98.00 | 97.82 | |
Relative | 98.13 | 97.96 | |
Absolute and relative | 98.21 | 98.04 |
Number of Index Finger Data | ||||||
---|---|---|---|---|---|---|
2 | 3 | 4 | 5 | 6 | 7 | |
One-handed sign language | ||||||
Total number | 132 | 132 | 132 | 132 | 132 | 132 |
Number of correct | 98 | 95 | 102 | 99 | 97 | 95 |
Correct rate (%) | 74.24 | 71.97 | 77.27 | 75.00 | 73.48 | 71.97 |
Number of re-evaluations | 41 | 41 | 35 | 30 | 20 | 13 |
Number of correct | 30 | 27 | 29 | 22 | 17 | 12 |
Correct rate (%) | 73.17 | 65.86 | 82.86 | 73.33 | 85.00 | 92.31 |
Final recognition rate (%) | 98.13 | 98.05 | 98.24 | 98.16 | 98.10 | 98.05 |
One- or two-handed sign language | ||||||
Total number | 132 | 132 | 132 | 132 | 132 | 132 |
Number of correct | 93 | 90 | 99 | 98 | 96 | 94 |
Correct rate (%) | 70.45 | 68.18 | 75.00 | 74.24 | 72.72 | 71.21 |
Number of re-evaluations | 54 | 51 | 42 | 33 | 21 | 14 |
Number of correct | 35 | 30 | 32 | 23 | 17 | 12 |
Correct rate (%) | 64.81 | 58.82 | 76.19 | 69.70 | 80.95 | 85.71 |
Final recognition rate (%) | 98.00 | 97.92 | 98.16 | 98.13 | 98.08 | 98.02 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hori, N.; Yamamoto, M. Re-Evaluation Method by Index Finger Position in the Face Area Using Face Part Position Criterion for Sign Language Recognition. Sensors 2023, 23, 4321. https://doi.org/10.3390/s23094321
Hori N, Yamamoto M. Re-Evaluation Method by Index Finger Position in the Face Area Using Face Part Position Criterion for Sign Language Recognition. Sensors. 2023; 23(9):4321. https://doi.org/10.3390/s23094321
Chicago/Turabian StyleHori, Noriaki, and Masahito Yamamoto. 2023. "Re-Evaluation Method by Index Finger Position in the Face Area Using Face Part Position Criterion for Sign Language Recognition" Sensors 23, no. 9: 4321. https://doi.org/10.3390/s23094321
APA StyleHori, N., & Yamamoto, M. (2023). Re-Evaluation Method by Index Finger Position in the Face Area Using Face Part Position Criterion for Sign Language Recognition. Sensors, 23(9), 4321. https://doi.org/10.3390/s23094321