Applying Hand Gesture Recognition For User Guide Application Using Mediapipe
Applying Hand Gesture Recognition For User Guide Application Using Mediapipe
Applying Hand Gesture Recognition For User Guide Application Using Mediapipe
Proceedings of the 2nd International Seminar of Science and Applied Technology (ISSAT 2021)
ABSTRACT
Hand gesture recognition is considered important with development technology in industry 4.0 in Human-Computer-
Interactions (HCI) which gives computers the competence to capture and interpret hand gestures the executing
command without touching devices physically. The MediaPipe is present as a framework built-in machine learning
that has a solution for a hand gesture recognition system. In this research, we develop a simple user guide application
using the MediaPipe framework. The user guide is commonly known as documentation about technical
communication or a manual in a certain system to assist people. The user guide has step-by-step descriptions about
handling a particular system and helps the user deal with user frustration by giving them the means to be identified,
understand, and disentangle technical problems that frequently occurred by themselves. In our experiment, we
captured a real-time image using Kinect, then trained a variety of hand gesture data, identified each hand gesture, and
recognized hand gestures to convey information based on hand gestures in the system user guide application. The user
can archive information user guide based on hand gestures that have been recognized. We proposed using hand
gesture recognition using MediaPipe in our application to improve the convenience of utilization the user guide
application and change user guide application that is still manual become a more interactive application.
1. INTRODUCTION body, the hand or face are the most commonly adopt
[3]. Gesture-based interaction introduced by Krueger as
We are now in an era of industry 4.0 or the Fourth a new type of Human-Computer Interaction (HCI) in the
Industrial Revolution which requires automation and middle 1970s has become a magnetic area of the
computerized that are realized from the consolidation research. In the Human-Computer-Interaction (HCI),
between various physical and digital technologies such building interfaces of applications with managing each
as sensors, embedded systems, Artificial Intelligence part of the human body to communicate naturally are
(AI), Cloud Computing, Big Data, Adaptive Robotic, the great attention to do research, especially the hands
Augmented Reality, Additive Manufacturing (AM), and as the most effective-alternative for the interaction tool,
Internet of Things (IoT). [1]. The enhanced digital considering their ability [4].
technology connectivity made technology a crucial
Through Human-Computer-Interaction (HCI),
requirement in carrying out our daily activities like
recognizing hand gestures could help achieve the ease
doing tasks or work, shopping, communication,
entertainment, and even searching for information or and naturalness desired [5]. When interacting with other
news [2]. The technology works more using the people, hand movements have the meaning to convey
something with its information. Ranging from simple
machines and advances in interaction with using a broad
hand movements to more complex ones. For example,
range of gestures to recognize, communicate, or interact
we can use our hand to point something (object or
with each other.
people) or use different simple shapes of hand or hand
The gesture is known as a form of non-verbal movements expressed through manual articulations
communication or non-vocal communication where combined with their grammar and lexicon as well-
utilize of the body’s movement that can convey a known as sign languages. Hence, using hand gestures as
particular message originating from parts of the human
a device then integration with computers can help The gestures are defined from the results with a set
people communicate more intuitively [5]. of rules and conditions from the vectors and joints of
the hands [13].
Currently, many frameworks or library machine
learning for hand gesture recognition have been built to ● Low-Level Feature-Based Approaches: Utilized
make it easier for anyone to build AI (Artificial these features for could be extracted quickly for
Intelligence) based applications. One of them is robust to noise. Zhou [14] discovered recognition of
MediaPipe. The MediaPipe framework is present by the hand shape as a cluster-based signature using a
Google for solving the problem using machine learning novel distance metric called Finger Earth’s Distance.
such as Face Detection, Face Mesh, Iris, Hands, Pose, Stanner [15] determines the bounding region of the
Holistic, Hair segmentation, Object detection, Box hand elliptically for implement hand recognition
Tracking, Instant Motion Tracking, Objection, and based on principal axes. Yang [16] did research
KIFT. MediaPipe framework helps a developer focus on using the optical flow of the hand region as a low-
the algorithm and model development on the level feature. Low-Level Feature-Based is not
application, then support environment application efficient when cluttered background [4].
through results reproducible across different devices and
● 3D Reconstruction-Based Approaches: Use the 3D
platforms which it is a few advantages of using features
model of features for achieving the construe of hand
on the MediaPipe framework [6].
completely. Research [17] showed that successfully
In this paper, we focus on developing a manual segmenting the hand in skin color needs similarity
user guide application with improving architecture and high contrast of the background related to the
application by applying hand gesture recognition using hand through structured light to bring in 3D of depth
the MediaPipe framework and camera of Kinect for data. Another one [18] uses a stereo camera to track
capture hand pose to recognized. Using hand gesture numerous interest points of the superficies of the
recognition will improve our user guide application hand which results in difficulty for handle robust 3D
more interactive. reconstruction, despite data contains 3D has valuable
information that can help dispose of vagueness. See
2. RELATED WORK [19] [20] for more 3D reconstruction-based
approach.
2.1 Hand Gesture Recognition
Gesture recognition is an essential topic in computer From kinds of literature, there are three Hand
science and builds technology that aims to interpret gesture recognition methods, as follow:
human gestures where anyone can use simple gestures
● Machine Learning Approaches: The resulting output
to interact with the device without touching them
came from the stochastic process and approach
directly. The entire procedure of tracking gestures to
based on statistical modeling for dynamic gestures
their representation and converting them to some
such as PCA, HMM [21][22][23][24], advanced
purposeful command is known as gesture recognition
particle filtering [26], and condensation algorithm
[7]. Identify from explicit hand gestures as input then
[25].
process these gestures representation for devices
through mapping as output is the aim in hand gestures ● Algorithm Approaches: Collection of encoded
recognition. conditions and restraints manually for defining as
gestures in dynamic gestures. Galveia [27] applied a
Recognition of the hand gesture in kinds of literature
3rd-degree polynomial equation to determine the
based on extracted features is divided into three groups,
dynamic component of the hand gestures (create a
as follows:
3rd-degree polynomial equation, recognition,
● High-Level Features-Based Approaches: Aim to reduced complexity of equations, and comparison
figure out the position of the palm and joint angles handling in gestures library).
such as the fingertips, joint location, or anchor
● Rule-based Approaches: Suitable for dynamic
points of the palm [8][9][10][11]. Whereas, effect
gestures either static gestures which are contained a
collisions or occlusions on the image are difficult to
set of pre-encoded rules and features inputs [4]. The
detect after features are extracted [12], and
features of input gestures are extracted and
sensitivity segmentation performance on 2D hand
compared to the encoding rules that are the flow of
image [4] are the problem that occurred frequently.
102
Advances in Engineering Research, volume 207
the recognized gestures. Matching between gestures 2.2.1 Palm Detector Model
with rule and input which is outputted approved as
known gestures [28]. MediaPipe framework has built detect initial palm
detector called BlazePalm. Detecting the hand is a
2.2 MediaPipe Framework complex task. Step one is to train the palm instead of the
hand detector, then using the non-maximum suppression
Today, there are many frameworks or libraries of algorithm on the palm, where it is modeled using square
machine learning for hand gesture recognition. One of bounding boxes to avoid other aspect ratios and
them is MediaPipe. The MediaPipe is a framework reducing the number of anchors by a factor of 3-5. Next,
designed to implement production-ready machine encoder-decoder of feature extraction that is used for
learning that must build pipelines to perform inference bigger scene context-awareness even small objects,
over arbitrary sensory data, has published code lastly, minimize the focal loss during training with
accompanying research work, and build technology support a large number of anchors resulting from the
prototypes [6]. In MediaPipe, graph modular high scale variance [35] [36].
components come from a perception pipeline along with
the function of inference model function, media 2.2.2 Hand Landmark
processing model, and data transformations [29]. Graph
of operations are used in others machine learning such Achieves precise key point localization of 21 key
as Tensor flow [30], MXNet [31], PyTorch[32], points with a 3D hand-knuckle coordinate which is
CNTK[33], OpenCV 4.0[34]. conducted inside the detected hand regions through
regression which will produce the coordinate prediction
Using MediaPipe for hand gesture recognition has
directly which is a model of the hand landmark in
been researched by Zhang [35] before, using a single
MediaPipe [35][36]., see in Figure 2.
RGB camera for AR/VR application in a real-time
system that predicts a hand skeleton of the human. We
can develop a combined MediaPipe using other devices.
The MediaPipe implements pipeline in Figure 1.
consists of two models for hand gesture recognition as
follows [29][35][36] :
1. A palm detector model processes the captured image
and turns the image with an oriented bounding box
of the hand,
2. A hand landmark model processes on cropped
bounding box image and returns 3D hand key points Figure 2 Hand Landmark in MediaPipe [38]
on hand.
Each hand-knuckle of the landmark has
3. A gesture recognizer that classifies 3D hand key coordinate is composed of x, y, and z where x and y are
points then configuration them into a discrete set of normalized to [0.0, 1.0] by image width and height,
gestures. while z representation the depth of landmark. The depth
of landmark that can be found at the wrist being the
ancestor. The closed the landmark to the camera, the
value becomes smaller.
103
Advances in Engineering Research, volume 207
104
Advances in Engineering Research, volume 207
105
Advances in Engineering Research, volume 207
4. RESULT
The measure of the performance on the model in
machine learning used Confusion Matrix. In Python, we
can use the library scikit-learn to develop a confusion
matrix. Experiment datasets were obtained before we
used them to predict the hand gestures. The Confusion
Matrix was also used to observe an accuracy achieved
for the model was made.
106
Advances in Engineering Research, volume 207
Using MediaPipe for implementing machine Computer-Interaction. In Processing of the 2011 8th
learning on hand gesture recognition in a user guide International Conference on Information,
application achieved good performance see Figure 12 Communication and Signal Processing (ICICS).
and Figure 13. Figure 12 above shows the predicted 10 Singapore. 2011.
varieties of hand gestured, and we can also see that a
[5] S.Rautaray S, Agrawal A. Vision Based Hand
few hand gestures could make a false prediction of hand
Gesture Recognition for Human Computer
gestures. Figure 13 above results in validation accuracy
Interaction: A Survey. Springer Artificial
of 95% for hand gestures to be recognized. A false
Intelligence Review. 2012. DOI:
prediction of hand gestures was getting the percentage
https://doi.org/10.1007/s10462-012-9356-9.
accuracy performance to become descend. This is
caused by lighting, a distance between Kinect as a [6] Lugaresi C, Tang J, Nash H, McClanahan C, et al.
picture catcher and user while using the application, and MediaPipe: A Framework for Building Perception
degrees of angle from Kinect is placed. Figure 14 shows Pipelines. Google Research. 2019.
the result of images of deploying the user guide https://arxiv.org/abs/2006.10214.
application using hand gestures as a command for [7] Z.Xu, et.al, Hand Gesture Recognition and Virtual
display information. Game Control Based on 3D Accelerometer and
EMG Sensors, In Processing og IUI’09, 2009, pp
5. CONCLUSION 401-406.
The Hand gesture recognition system has become [8] C.Chua, H. Guan, Y.Ho, Model-Based 3D Hand
an important role in building efficient human-machine Posture Estimation From a Single 2D Image.
interaction. Implementation using hand gesture Image and Vision Computing vol.20, 2002, pp.
recognition promises wide-ranging in technology 191-202.
industry. The MediaPipe as one framework based on
[9] Y.Li, Hand Gesture Recognition Using Kinect,
machine learning plays an effective role in developing
2012.
this application using hand gesture recognition, with the
result has shown an accuracy performance of 95%. We [10] M.Panwar, Hand Gesture Recognition Based on
would like to extend our system further to develop Shape Parameters, In International Conferences:
collaboration with other devices and other human body Computing Communication and Application
parts and experiment with both static and dynamic hand (ICCCA), 2012.
gesture recognition systems.
[11] Marco Maisto, An Accurate Algorithm for
Identification of Fingertips Using an RGB-D
REFERENCES Camera, IEEE Journal on Emerging and Selected
[1] Ustunug A, Cevikcan, Industry 4.0: Managing The Topics in Circuits and System, 2013. pp. 272-283.
Digital Transformation, Springer Series in [12] E.Holden, Visual Recognition of Hand Motion,
Advanced Manufacturing, Switzerland. 2018. DOI: Ph.D Thesis Departement of Computer Science,
https://doi.org/10.1007/978-3-319-57870-5. University of Western. 1997.
[2] Pantic M, Nijholt A, Pentland A, Huanag TS, [13] Cardoso T, Delgado J, Barata J, Hand Gesture
Human-Centered Intelligent Human-Computer Recognition toward Echancing Accessibility. In 6 th
Interaction (HCI2): How Far We From Attaining International Conference on Software
It?,International Jounal of Autonomous and Development and Technologies for Enhancing and
Adaptive Communications Systems (IJAACS), Fighting Infoexclusion (DSAI). Procedia Computer
vol.1 no.2, 2008. pp 168-187. DOI: Sciences vol.67. 2015. pp.419-429. DOI:
10.1504/IJAACS.2008.019799 . https://doi.org/10.1016/j.procs.2015.09.287.
[3] Hamed Al-Saedi A.K, Hassin Al-Asadi A, Survey [14] Z Ren, Robust Hand Gesture Recognition Based on
of Hand Gesture Recognition System. IOP Finger-Earth Mover’s Distance with Commodity
Conferences Series: Journal of Physics: Depth Camera. 2011.
Conferences Series 1294 042003. 2019. DOI:
https://doi.org/10.1088/1742-6596/4/042003. [15] Stanner T, Weaver J, Pentland A, Real Time
America Sign Language Recognition Using Desk
[4] Z.Ren, J.Meng, Yuan J. Depth Camera Based Hand
Gesture Regconition and its Application in Human-
107
Advances in Engineering Research, volume 207
and Wearable Computer based Video. IEEE Tran [30] Abadi M, Barham P, Chen J et.al, Tensorflow: A
on PAMI vol.20. 1998. pp. 1375-1375. System for Large-Scale Machine Learning, In 12th
USENIX Symposium on Operating System Design
[16] Yang M.H, Ahuja N, Tabb M, Extraction of 2D
and Implementation (OSDI), USA, 2016,
Motion Trajectories and its Application to Hand
https://www.usenix.org/conference/osdi16/technica
Gesture Recognition. IEEE Trans on PAMI vol.29.
l-sessions/presentation/abadi.
2002. pp 1062-1074.
[31] Chen T, Li M, Li Y, MXNet: A Flexible and
[17] Bray M, Koller-Meier E, Gool L.V, Smart Particle
Efficient Machine Learning Library for
Filtering for 3D Hand Tracking. In Processing of
Heterogeneous Distributed System, 2015,
Sixth IEEE International Conference on Face and
https://arxiv.org/pdf/1512.01274.pdf.
Gesture Recognition. 2004.
[32] Pazke A, Gross A, Chintala S, Automatic
[18] Dewaele G, Devernay F, Horaud R. Hand Motion
Differentiation in PyTorch, In 31st Conference on
from 3D Point Trajectories and Smooth Surface
Neural Information Processing System (NIPS),
Model. In Processing of 8th ECCV. 2004.
USA, 2017.
[19] Stanger C, Model-Based 3D Tracking of an
[33] Seide F, Agarwal A, CNTK: Microsoft’s Open-
Articulated Hand, 2001.
Source Deep Learning Toolkit, In KDD ’16:
[20] Keskin C, Real Time Hand Pose Estimation using Proceedings of the 22nd ACM SIGKDD
Depth Sensors, In IEEE International Conferences International Conference on Knowledge Discovery
on Computer Vision Workshop. 2011. and Data Mining. 2016, DOI:
[21] Lee H, Kim J. An HMM-Based Threshold Model https://doi.org/10.1145/2939672.2945397.
Approach for Gesture Recognition. IEEE Trans on [34] Matveev D, OpenCV Graph API. Intel
PAMI vol.21. 1999. pp 961-973. Corporation. 2018.
[22] Wilson A, Bobick A, Parametric Hidden Markov [35] Zhag F, Bazarevsky, Vakunov A et.al, MediaPipe
Models for Gesture Recognition. IEEE Trans. On Hands: On – Device Real Time Hand Tracking,
PAMI vol.21, 1999. pp.884-900. Google Research. USA. 2020.
[23] Wu Xiayou, An Intelligent Interactive System https://arxiv.org/pdf/2006.10214.pdf.
Based on Hand Gesture Recognition Algorithm [36] MediaPipe: On-Device, Real Time Hand Tracking,
and Kinect, In 5th International Symposium on In https://ai.googleblog.com/2019/08/on-device-
Computational Intelligence and Design.2012 real-time-hand-tracking-with.html. 2019. Access
[24] Wang Y, Kinect Based Dynamic Hand Gesture 2021.
Recognition Algorithm Research, In 4th [37] Grishchenko I, Bazarevsky V, MediaPipe Holositic
International Conference on Intelligent Human- – Simultaneoue Face, Hand and Pose Prediction on
Machine System and Cybernetics. 2012. Device, Google Research, USA, 2020,
[25] Doucet A, De Freitas N, Gordon N, Sequential https://ai.googleblog.com/2020/12/mediapipe-
Monte Carlo In Practice. New York: Springer- holistic-simultaneous-face.html, Access 2021.
Verlag.2001 [38] MediaPipe Github:
[26] Kwok C, Fox D, Meila, Real Time Particle Filters. https://google.github.io/mediapipe/solutions/hands.
In Processing of IEEE.2004. Access 2021.
108