Applying Hand Gesture Recognition For User Guide Application Using Mediapipe

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Advances in Engineering Research, volume 207

Proceedings of the 2nd International Seminar of Science and Applied Technology (ISSAT 2021)

Applying Hand Gesture Recognition


for User Guide Application Using MediaPipe
Indriani1 Moh.Harris1 Ali Suryaperdana Agoes1,*
1
Department of Informatics Engineering, STMIK AMIKBANDUNG, Bandung, Indonesia.
*
Corresponding author. Email: [email protected]

ABSTRACT
Hand gesture recognition is considered important with development technology in industry 4.0 in Human-Computer-
Interactions (HCI) which gives computers the competence to capture and interpret hand gestures the executing
command without touching devices physically. The MediaPipe is present as a framework built-in machine learning
that has a solution for a hand gesture recognition system. In this research, we develop a simple user guide application
using the MediaPipe framework. The user guide is commonly known as documentation about technical
communication or a manual in a certain system to assist people. The user guide has step-by-step descriptions about
handling a particular system and helps the user deal with user frustration by giving them the means to be identified,
understand, and disentangle technical problems that frequently occurred by themselves. In our experiment, we
captured a real-time image using Kinect, then trained a variety of hand gesture data, identified each hand gesture, and
recognized hand gestures to convey information based on hand gestures in the system user guide application. The user
can archive information user guide based on hand gestures that have been recognized. We proposed using hand
gesture recognition using MediaPipe in our application to improve the convenience of utilization the user guide
application and change user guide application that is still manual become a more interactive application.

Keywords: Hand Gesture Recognition, MediaPipe, Kinect, User Guide Application.

1. INTRODUCTION body, the hand or face are the most commonly adopt
[3]. Gesture-based interaction introduced by Krueger as
We are now in an era of industry 4.0 or the Fourth a new type of Human-Computer Interaction (HCI) in the
Industrial Revolution which requires automation and middle 1970s has become a magnetic area of the
computerized that are realized from the consolidation research. In the Human-Computer-Interaction (HCI),
between various physical and digital technologies such building interfaces of applications with managing each
as sensors, embedded systems, Artificial Intelligence part of the human body to communicate naturally are
(AI), Cloud Computing, Big Data, Adaptive Robotic, the great attention to do research, especially the hands
Augmented Reality, Additive Manufacturing (AM), and as the most effective-alternative for the interaction tool,
Internet of Things (IoT). [1]. The enhanced digital considering their ability [4].
technology connectivity made technology a crucial
Through Human-Computer-Interaction (HCI),
requirement in carrying out our daily activities like
recognizing hand gestures could help achieve the ease
doing tasks or work, shopping, communication,
entertainment, and even searching for information or and naturalness desired [5]. When interacting with other
news [2]. The technology works more using the people, hand movements have the meaning to convey
something with its information. Ranging from simple
machines and advances in interaction with using a broad
hand movements to more complex ones. For example,
range of gestures to recognize, communicate, or interact
we can use our hand to point something (object or
with each other.
people) or use different simple shapes of hand or hand
The gesture is known as a form of non-verbal movements expressed through manual articulations
communication or non-vocal communication where combined with their grammar and lexicon as well-
utilize of the body’s movement that can convey a known as sign languages. Hence, using hand gestures as
particular message originating from parts of the human

Copyright © 2021 The Authors. Published by Atlantis Press International B.V.


This is an open access article distributed under the CC BY-NC 4.0 license -http://creativecommons.org/licenses/by-nc/4.0/. 101
Advances in Engineering Research, volume 207

a device then integration with computers can help The gestures are defined from the results with a set
people communicate more intuitively [5]. of rules and conditions from the vectors and joints of
the hands [13].
Currently, many frameworks or library machine
learning for hand gesture recognition have been built to ● Low-Level Feature-Based Approaches: Utilized
make it easier for anyone to build AI (Artificial these features for could be extracted quickly for
Intelligence) based applications. One of them is robust to noise. Zhou [14] discovered recognition of
MediaPipe. The MediaPipe framework is present by the hand shape as a cluster-based signature using a
Google for solving the problem using machine learning novel distance metric called Finger Earth’s Distance.
such as Face Detection, Face Mesh, Iris, Hands, Pose, Stanner [15] determines the bounding region of the
Holistic, Hair segmentation, Object detection, Box hand elliptically for implement hand recognition
Tracking, Instant Motion Tracking, Objection, and based on principal axes. Yang [16] did research
KIFT. MediaPipe framework helps a developer focus on using the optical flow of the hand region as a low-
the algorithm and model development on the level feature. Low-Level Feature-Based is not
application, then support environment application efficient when cluttered background [4].
through results reproducible across different devices and
● 3D Reconstruction-Based Approaches: Use the 3D
platforms which it is a few advantages of using features
model of features for achieving the construe of hand
on the MediaPipe framework [6].
completely. Research [17] showed that successfully
In this paper, we focus on developing a manual segmenting the hand in skin color needs similarity
user guide application with improving architecture and high contrast of the background related to the
application by applying hand gesture recognition using hand through structured light to bring in 3D of depth
the MediaPipe framework and camera of Kinect for data. Another one [18] uses a stereo camera to track
capture hand pose to recognized. Using hand gesture numerous interest points of the superficies of the
recognition will improve our user guide application hand which results in difficulty for handle robust 3D
more interactive. reconstruction, despite data contains 3D has valuable
information that can help dispose of vagueness. See
2. RELATED WORK [19] [20] for more 3D reconstruction-based
approach.
2.1 Hand Gesture Recognition
Gesture recognition is an essential topic in computer From kinds of literature, there are three Hand
science and builds technology that aims to interpret gesture recognition methods, as follow:
human gestures where anyone can use simple gestures
● Machine Learning Approaches: The resulting output
to interact with the device without touching them
came from the stochastic process and approach
directly. The entire procedure of tracking gestures to
based on statistical modeling for dynamic gestures
their representation and converting them to some
such as PCA, HMM [21][22][23][24], advanced
purposeful command is known as gesture recognition
particle filtering [26], and condensation algorithm
[7]. Identify from explicit hand gestures as input then
[25].
process these gestures representation for devices
through mapping as output is the aim in hand gestures ● Algorithm Approaches: Collection of encoded
recognition. conditions and restraints manually for defining as
gestures in dynamic gestures. Galveia [27] applied a
Recognition of the hand gesture in kinds of literature
3rd-degree polynomial equation to determine the
based on extracted features is divided into three groups,
dynamic component of the hand gestures (create a
as follows:
3rd-degree polynomial equation, recognition,
● High-Level Features-Based Approaches: Aim to reduced complexity of equations, and comparison
figure out the position of the palm and joint angles handling in gestures library).
such as the fingertips, joint location, or anchor
● Rule-based Approaches: Suitable for dynamic
points of the palm [8][9][10][11]. Whereas, effect
gestures either static gestures which are contained a
collisions or occlusions on the image are difficult to
set of pre-encoded rules and features inputs [4]. The
detect after features are extracted [12], and
features of input gestures are extracted and
sensitivity segmentation performance on 2D hand
compared to the encoding rules that are the flow of
image [4] are the problem that occurred frequently.

102
Advances in Engineering Research, volume 207

the recognized gestures. Matching between gestures 2.2.1 Palm Detector Model
with rule and input which is outputted approved as
known gestures [28]. MediaPipe framework has built detect initial palm
detector called BlazePalm. Detecting the hand is a
2.2 MediaPipe Framework complex task. Step one is to train the palm instead of the
hand detector, then using the non-maximum suppression
Today, there are many frameworks or libraries of algorithm on the palm, where it is modeled using square
machine learning for hand gesture recognition. One of bounding boxes to avoid other aspect ratios and
them is MediaPipe. The MediaPipe is a framework reducing the number of anchors by a factor of 3-5. Next,
designed to implement production-ready machine encoder-decoder of feature extraction that is used for
learning that must build pipelines to perform inference bigger scene context-awareness even small objects,
over arbitrary sensory data, has published code lastly, minimize the focal loss during training with
accompanying research work, and build technology support a large number of anchors resulting from the
prototypes [6]. In MediaPipe, graph modular high scale variance [35] [36].
components come from a perception pipeline along with
the function of inference model function, media 2.2.2 Hand Landmark
processing model, and data transformations [29]. Graph
of operations are used in others machine learning such Achieves precise key point localization of 21 key
as Tensor flow [30], MXNet [31], PyTorch[32], points with a 3D hand-knuckle coordinate which is
CNTK[33], OpenCV 4.0[34]. conducted inside the detected hand regions through
regression which will produce the coordinate prediction
Using MediaPipe for hand gesture recognition has
directly which is a model of the hand landmark in
been researched by Zhang [35] before, using a single
MediaPipe [35][36]., see in Figure 2.
RGB camera for AR/VR application in a real-time
system that predicts a hand skeleton of the human. We
can develop a combined MediaPipe using other devices.
The MediaPipe implements pipeline in Figure 1.
consists of two models for hand gesture recognition as
follows [29][35][36] :
1. A palm detector model processes the captured image
and turns the image with an oriented bounding box
of the hand,
2. A hand landmark model processes on cropped
bounding box image and returns 3D hand key points Figure 2 Hand Landmark in MediaPipe [38]
on hand.
Each hand-knuckle of the landmark has
3. A gesture recognizer that classifies 3D hand key coordinate is composed of x, y, and z where x and y are
points then configuration them into a discrete set of normalized to [0.0, 1.0] by image width and height,
gestures. while z representation the depth of landmark. The depth
of landmark that can be found at the wrist being the
ancestor. The closed the landmark to the camera, the
value becomes smaller.

2.2.3 Hand Recognizer

For recognizing hand gestures, the implementation


of a simple algorithm is to compute gestures with a
determined accumulated angle of the joint state or
conditions each finger such as bent finger or straight
finger then do map the set of finger states that we got
Figure 1 Hand Perception Pipeline Overview [36]. before to set the label of pre-defined gestures like “OK”,
“Spiderman”, “Rock” [35][36]. It can be seen in Figure
3 below.

103
Advances in Engineering Research, volume 207

Figure 3 Hand Gesture Recognition [35][36].

Grishchenko et al. [37] challenged developing a


library in MediaPipe that presents a novel state-of-the-
art human body pose topology that optimized 540+ key
points, which consist of 33 poses and 468 facials and 21
per-hand landmarks, which well known as MediaPipe
Holistic. Their research built a simple interface of
remote control application that has featured user
interaction without using a device like a mouse or a
Figure 5 Workflow Method Research.
keyboard, manipulated object on the screen that depends
on the hand detection accuracy including gesture
Numerous inspiring successes research in applying
recognition, see Figure 4.
Kinect as a device for articulating human body tracking,
pose even recognition systems. We use the Kinect x360
with RGB camera with a resolution of 640x240 pixel for
capturing a real-time image for process using MediaPipe
framework. MediaPipe will read the image that received
from Kinect, then on an image will do palm detection
and make hand landmark that made return 3D hand key
points and joint it make up like skeleton. 3D key points
in the palm that has been marked in the image will be
computed and initialized as a tool for reading pose hand
Figure 4 Remote Control Interface using Gesture and recognition based that will be conveyed information
Recognition [37]. based on hand pose had been initialized before.

3. RESEARCH METHODS 3.1 Identification Hand Gesture in MediaPipe


In this research, the user guide application is a To identify the poses of the hand differently could
guide for the user to display steps taken by the system be calculated using 21 key points on hand landmarks
by identifying hand gestures as a certain command. explained in Section 2.2.2 above. This identifying could
We develop a user guide application that implements be done with, firstly, determine every finger of the hand
hand gesture recognition using Kinect to capture hand condition is open or close. For more clearly, see Figure
pose and then recognize it for running the application. 6 below. Figure 6 shows a pseudocode or algorithm for
We are using the MediaPipe framework and Python identifying a condition finger of a hand-related to Figure
programming language to develop an application user 2. In Figure 2, we can see that coordinate [4,8,12,16,20]
guide. For detail, can see Figure 5 below. is a coordinate of tips of fingers and declared as
fingertips. The Declaration of hand with coordinate 0
until 20 is obtained from 21 key points with a hand-
knuckle coordinate in hand landmark. We will compare
the coordinate of fingertip based on position x
(horizontal) and y (vertical) with middle points
[2,6,10,14,18]. If the compare coordinates a fingertip
has a value higher than middle points, then set finger
with value 1, mean finger in condition open and vice
versa.

104
Advances in Engineering Research, volume 207

Figure 6 Pseudocode detects a finger of hand condition


open or close Figure 9 Mock-up One of Menu of User Guide.

If the user chooses one of the menus in the user


guide application, then display information about the
problem topic based on the chosen hand pose, as seen in
Figure 7 Pseudocode detects a hand condition open or Figure 9. The figure shows a mock-up of Instruction
close. Machine Control that is one of a menu of user guide
user application occurred along with information about
In the next step, after determined the condition guide or resolve a problem based on a topic after
each finger is open or close as mentioned before, we recognized through hand pose what the user has chosen.
gathered all values of fingers and compared every
condition of fingers of the hand (1 for the finger is open In Figure 8 and Figure 9, the right corner on the
and 0 for the finger is close), see in Figure 7. If the mock-up shows varieties of hand pose for the menu and
condition has been fulfilled and resulted is true, then the use to help the user as a reference to display information
program will execute according to the instruction that guide based on problem topics that have been provided.
has been made. Whereas the top left corner is shown capturing hand
pose an image on the camera Kinect. Other mock-up
3.2 Mock-up User Guide Application menus will have the same flow as Figure 9, which is
different. Every menu is the content information to be
To develop a user guide application, we prepared a delivered.
design mock-up user and a hand pose to recognize and
position the information on the mock-up on the user 3.3 Description of The Dataset
application guide, see Figure 8 and Figure 9 below. The
system will capture an image of a hand pose with all The main objective of the research is to recognize
fingers had stated open, identifying it then do command hand gestures to display one of the menus that a user has
to display mock-up as seen in Figure 8 is shown a chosen through a Kinect. We used 10 captured hand
mock-up open menu that contains hand pose is gestures which each hand gesture directly set one menu
initialized with a varied menu. out, see in Figure 10 below.

Figure 10 Hand Gestures for Menus.


Figure 8 Mock-up Open Menus User Guide
Application.

105
Advances in Engineering Research, volume 207

simply written in a CSV file and saved to local. This


allowed us to obtain training data made of quality data.

4. RESULT
The measure of the performance on the model in
machine learning used Confusion Matrix. In Python, we
can use the library scikit-learn to develop a confusion
matrix. Experiment datasets were obtained before we
used them to predict the hand gestures. The Confusion
Matrix was also used to observe an accuracy achieved
for the model was made.

Figure 11 Key points of hand landmark corresponding


in coordinate (x,y,z) in MediaPipe for one hand gesture.
Actually, many datasets that contain images of
hand gestures are publicly available to use. In this paper,
we recorded our data and created a small dataset of
around 900 samples were taken. These datasets consist
of 10 varieties of hand gestures, as seen in Figure 10.
The various conditions of the hand for datasets samples
are such as from both the right and the left of the hand,
Figure 12 Classification performance of hand gesture
the position of the palm such as the palm forward to the
recognition.
camera or the back of hand forward to the camera, and
variety degree position of the hand where had captured
the image from the camera in Kinect.
The collected datasets consist of an index of
gestures (ID) is specified, extracted landmark
coordinates, relative coordinates, flattening to the one-
dimensional array, and normalized values were captured
in MediaPipe. Index of gestures (ID) is a reference for
labeling 10 varieties of hand gestures, as seen in Figure
10 above. Each gesture has one label ID for an
identifier. We label the index from 0 – 9 for a picture
from the first-row top left until the second-row bottom
right. So, we will know that the hand gesture with the Figure 13 Accuracy classification performance of hand
condition all fingers in open has label 0 for Open Menu. gesture recognition.

Besides the index of gestures, determined the


extracted key points are also made to get valuable 21
key points in each hand gesture that has been captured.
The extracted key points are generated from the
coordinate x, y, and z from 21 key points of the hand,
see Figure 11 above. The coordinate x is shown the
landmark position in the horizontal axis, coordinate y is
the landmark position in the vertical axis, and z is the
landmark depth from the camera. These datasets are
Figure 14 Screenshot s user guide application.

106
Advances in Engineering Research, volume 207

Using MediaPipe for implementing machine Computer-Interaction. In Processing of the 2011 8th
learning on hand gesture recognition in a user guide International Conference on Information,
application achieved good performance see Figure 12 Communication and Signal Processing (ICICS).
and Figure 13. Figure 12 above shows the predicted 10 Singapore. 2011.
varieties of hand gestured, and we can also see that a
[5] S.Rautaray S, Agrawal A. Vision Based Hand
few hand gestures could make a false prediction of hand
Gesture Recognition for Human Computer
gestures. Figure 13 above results in validation accuracy
Interaction: A Survey. Springer Artificial
of 95% for hand gestures to be recognized. A false
Intelligence Review. 2012. DOI:
prediction of hand gestures was getting the percentage
https://doi.org/10.1007/s10462-012-9356-9.
accuracy performance to become descend. This is
caused by lighting, a distance between Kinect as a [6] Lugaresi C, Tang J, Nash H, McClanahan C, et al.
picture catcher and user while using the application, and MediaPipe: A Framework for Building Perception
degrees of angle from Kinect is placed. Figure 14 shows Pipelines. Google Research. 2019.
the result of images of deploying the user guide https://arxiv.org/abs/2006.10214.
application using hand gestures as a command for [7] Z.Xu, et.al, Hand Gesture Recognition and Virtual
display information. Game Control Based on 3D Accelerometer and
EMG Sensors, In Processing og IUI’09, 2009, pp
5. CONCLUSION 401-406.
The Hand gesture recognition system has become [8] C.Chua, H. Guan, Y.Ho, Model-Based 3D Hand
an important role in building efficient human-machine Posture Estimation From a Single 2D Image.
interaction. Implementation using hand gesture Image and Vision Computing vol.20, 2002, pp.
recognition promises wide-ranging in technology 191-202.
industry. The MediaPipe as one framework based on
[9] Y.Li, Hand Gesture Recognition Using Kinect,
machine learning plays an effective role in developing
2012.
this application using hand gesture recognition, with the
result has shown an accuracy performance of 95%. We [10] M.Panwar, Hand Gesture Recognition Based on
would like to extend our system further to develop Shape Parameters, In International Conferences:
collaboration with other devices and other human body Computing Communication and Application
parts and experiment with both static and dynamic hand (ICCCA), 2012.
gesture recognition systems.
[11] Marco Maisto, An Accurate Algorithm for
Identification of Fingertips Using an RGB-D
REFERENCES Camera, IEEE Journal on Emerging and Selected
[1] Ustunug A, Cevikcan, Industry 4.0: Managing The Topics in Circuits and System, 2013. pp. 272-283.
Digital Transformation, Springer Series in [12] E.Holden, Visual Recognition of Hand Motion,
Advanced Manufacturing, Switzerland. 2018. DOI: Ph.D Thesis Departement of Computer Science,
https://doi.org/10.1007/978-3-319-57870-5. University of Western. 1997.
[2] Pantic M, Nijholt A, Pentland A, Huanag TS, [13] Cardoso T, Delgado J, Barata J, Hand Gesture
Human-Centered Intelligent Human-Computer Recognition toward Echancing Accessibility. In 6 th
Interaction (HCI2): How Far We From Attaining International Conference on Software
It?,International Jounal of Autonomous and Development and Technologies for Enhancing and
Adaptive Communications Systems (IJAACS), Fighting Infoexclusion (DSAI). Procedia Computer
vol.1 no.2, 2008. pp 168-187. DOI: Sciences vol.67. 2015. pp.419-429. DOI:
10.1504/IJAACS.2008.019799 . https://doi.org/10.1016/j.procs.2015.09.287.
[3] Hamed Al-Saedi A.K, Hassin Al-Asadi A, Survey [14] Z Ren, Robust Hand Gesture Recognition Based on
of Hand Gesture Recognition System. IOP Finger-Earth Mover’s Distance with Commodity
Conferences Series: Journal of Physics: Depth Camera. 2011.
Conferences Series 1294 042003. 2019. DOI:
https://doi.org/10.1088/1742-6596/4/042003. [15] Stanner T, Weaver J, Pentland A, Real Time
America Sign Language Recognition Using Desk
[4] Z.Ren, J.Meng, Yuan J. Depth Camera Based Hand
Gesture Regconition and its Application in Human-

107
Advances in Engineering Research, volume 207

and Wearable Computer based Video. IEEE Tran [30] Abadi M, Barham P, Chen J et.al, Tensorflow: A
on PAMI vol.20. 1998. pp. 1375-1375. System for Large-Scale Machine Learning, In 12th
USENIX Symposium on Operating System Design
[16] Yang M.H, Ahuja N, Tabb M, Extraction of 2D
and Implementation (OSDI), USA, 2016,
Motion Trajectories and its Application to Hand
https://www.usenix.org/conference/osdi16/technica
Gesture Recognition. IEEE Trans on PAMI vol.29.
l-sessions/presentation/abadi.
2002. pp 1062-1074.
[31] Chen T, Li M, Li Y, MXNet: A Flexible and
[17] Bray M, Koller-Meier E, Gool L.V, Smart Particle
Efficient Machine Learning Library for
Filtering for 3D Hand Tracking. In Processing of
Heterogeneous Distributed System, 2015,
Sixth IEEE International Conference on Face and
https://arxiv.org/pdf/1512.01274.pdf.
Gesture Recognition. 2004.
[32] Pazke A, Gross A, Chintala S, Automatic
[18] Dewaele G, Devernay F, Horaud R. Hand Motion
Differentiation in PyTorch, In 31st Conference on
from 3D Point Trajectories and Smooth Surface
Neural Information Processing System (NIPS),
Model. In Processing of 8th ECCV. 2004.
USA, 2017.
[19] Stanger C, Model-Based 3D Tracking of an
[33] Seide F, Agarwal A, CNTK: Microsoft’s Open-
Articulated Hand, 2001.
Source Deep Learning Toolkit, In KDD ’16:
[20] Keskin C, Real Time Hand Pose Estimation using Proceedings of the 22nd ACM SIGKDD
Depth Sensors, In IEEE International Conferences International Conference on Knowledge Discovery
on Computer Vision Workshop. 2011. and Data Mining. 2016, DOI:
[21] Lee H, Kim J. An HMM-Based Threshold Model https://doi.org/10.1145/2939672.2945397.
Approach for Gesture Recognition. IEEE Trans on [34] Matveev D, OpenCV Graph API. Intel
PAMI vol.21. 1999. pp 961-973. Corporation. 2018.
[22] Wilson A, Bobick A, Parametric Hidden Markov [35] Zhag F, Bazarevsky, Vakunov A et.al, MediaPipe
Models for Gesture Recognition. IEEE Trans. On Hands: On – Device Real Time Hand Tracking,
PAMI vol.21, 1999. pp.884-900. Google Research. USA. 2020.
[23] Wu Xiayou, An Intelligent Interactive System https://arxiv.org/pdf/2006.10214.pdf.
Based on Hand Gesture Recognition Algorithm [36] MediaPipe: On-Device, Real Time Hand Tracking,
and Kinect, In 5th International Symposium on In https://ai.googleblog.com/2019/08/on-device-
Computational Intelligence and Design.2012 real-time-hand-tracking-with.html. 2019. Access
[24] Wang Y, Kinect Based Dynamic Hand Gesture 2021.
Recognition Algorithm Research, In 4th [37] Grishchenko I, Bazarevsky V, MediaPipe Holositic
International Conference on Intelligent Human- – Simultaneoue Face, Hand and Pose Prediction on
Machine System and Cybernetics. 2012. Device, Google Research, USA, 2020,
[25] Doucet A, De Freitas N, Gordon N, Sequential https://ai.googleblog.com/2020/12/mediapipe-
Monte Carlo In Practice. New York: Springer- holistic-simultaneous-face.html, Access 2021.
Verlag.2001 [38] MediaPipe Github:
[26] Kwok C, Fox D, Meila, Real Time Particle Filters. https://google.github.io/mediapipe/solutions/hands.
In Processing of IEEE.2004. Access 2021.

[27] Galveia B, Cardoso T, Rybarczyk, Adding Value to


The Kinect SDK Creating a Gesture Library, 2014.
[28] Su C.M, A Fuzzy Rule-Based Approach to Spatio-
Temporal Hand Gesture Recognition. IEEE Trans
on System, Man and Cybernetics-Part C:
Application and Review no.30, 2000, pp. 276-281.
[29] Lugaresi C, Tang J, Nash H et.al, MediaPipe: A
Framework for Perceiving and Processing Reality.
Google Research. 2019.

108

You might also like