An AI-Based Visual Aid With Integrated Reading Assistant For The Completely Blind

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, VOL. 50, NO.

6, DECEMBER 2020 507

An AI-Based Visual Aid With Integrated Reading


Assistant for the Completely Blind
Muiz Ahmed Khan , Pias Paul , Mahmudur Rashid , Student Member, IEEE, Mainul Hossain , Member, IEEE,
and Md Atiqur Rahman Ahad , Senior Member, IEEE

Abstract—Blindness prevents a person from gaining knowledge at all. Recent statistics from the World Health Organization esti-
of the surrounding environment and makes unassisted navigation, mate the number of visually impaired or blind people to be about
object recognition, obstacle avoidance, and reading tasks a major 2.2 billion [1]. A white cane is used traditionally by the blind
challenge. In this work, we propose a novel visual aid system
for the completely blind. Because of its low cost, compact size, people to help them navigate their surroundings, although use of
and ease-of-integration, Raspberry Pi 3 Model B+ has been used the white cane does not provide information for moving obsta-
to demonstrate the functionality of the proposed prototype. The cles that are approaching from a distance. Moreover, white canes
design incorporates a camera and sensors for obstacle avoidance are unable to detect raised obstacles that are above the knee level.
and advanced image processing algorithms for object detection. Trained guide dogs are another option that can assist the blind.
The distance between the user and the obstacle is measured by
the camera as well as ultrasonic sensors. The system includes However, trained dogs are expensive and not readily available.
an integrated reading assistant, in the form of the image-to-text Recent studies have proposed several types [2]–[9] of wearable
converter, followed by an auditory feedback. The entire setup or hand-held electronic travel aids (ETAs). Most of these devices
is lightweight and portable and can be mounted onto a regular integrate various sensors to map the surroundings and provide
pair of eyeglasses, without any additional cost and complexity. voice or sound alarms through headphones. The quality of the
Experiments are carried out with 60 completely blind individuals to
evaluate the performance of the proposed device with respect to the auditory signal, delivered in real-time, affects the reliability of
traditional white cane. The evaluations are performed in controlled these gadgets. Many ETAs, currently available in the market,
environments that mimic real-world scenarios encountered by a do not include a real-time reading assistant and suffer from a
blind person. Results show that the proposed device, as compared poor user interface, high cost, limited portability, and lack of
with the white cane, enables greater accessibility, comfort, and ease hands-free access. These devices are, therefore, not widely pop-
of navigation for the visually impaired.
ular among the blind and require further improvement in design,
Index Terms—Blind people, completely blind, electronic performance, and reliability for use in both indoor and outdoor
navigation aid, Raspberry Pi, visual aid, visually impaired people, settings.
wearable system.
In this article, we propose a novel visual aid system for
completely blind individuals. The unique features, which define
I. INTRODUCTION the novelty of the proposed design, include the following.
1) Hands free, wearable, low power, and compact design,
LINDNESS or loss of vision is one of the most common
B disabilities worldwide. Blindness, either caused by natural
means or some form of accidents, has grown over the past
mountable on a pair of eyeglasses, for the indoor and
outdoor navigation with an integrated reading assistant.
2) Complex algorithm processing with a low-end configura-
decades. Partially blind people experience a cloudy vision, see-
tion.
ing only shadows, and suffer from poor night vision or tunnel vi-
3) Real-time, camera-based, accurate distance measurement,
sion. A completely blind person, on the other hand, has no vision
which simplifies the design and lowers the cost by reduc-
ing the number of required sensors.
Manuscript received March 31, 2020; revised July 23, 2020; accepted Septem-
ber 6, 2020. Date of publication October 20, 2020; date of current version The proposed setup, in its current form, can detect both
November 12, 2020. This article was recommended by Associate Editor Z. Yu. stationary and moving objects in real time and provide auditory
(Corresponding author: Mainul Hossain.) feedback to the blind. In addition, the device comes with an
Muiz Ahmed Khan, Pias Paul, and Mahmudur Rashid are with the Department
of Electrical and Computer Engineering, North South University, Dhaka 1229, in-built reading assistant that is capable of reading text from
Bangladesh (e-mail: [email protected]; [email protected]; any document. This article discusses the design, construction,
[email protected]). and performance evaluation of the proposed visual aid sys-
Mainul Hossain is with the Department of Electrical and Electronic En-
gineering, University of Dhaka, Dhaka 1000, Bangladesh (e-mail: mainul. tem and is organized as follows. Section II summarizes the
[email protected]). existing literature on blind navigation aids, highlighting their
Md Atiqur Rahman Ahad is with the Department of Electrical and Electronic benefits and challenges. Section III presents the design and the
Engineering, University of Dhaka, Dhaka 1000, Bangladesh, and also with
the Department of Intelligent Media, Osaka University, Suita 565-0871, Japan working principle of the prototype, while Section IV discusses
(e-mail: [email protected]). the experimental setup for performance evaluation. Section V
Color versions of one or more of the figures in this article are available online summarizes the results using appropriate statistical analysis.
at https://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/THMS.2020.3027534 Finally, Section VI concludes the article.

2168-2291 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: EASWARI COLLEGE OF ENGINEERING. Downloaded on October 28,2022 at 06:14:27 UTC from IEEE Xplore. Restrictions apply.
508 IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, VOL. 50, NO. 6, DECEMBER 2020

II. RELEVANT WORK A context-aware navigation framework is demonstrated by


The electronic aids for the visually impaired can be cat- Xiao et al. [4], which provides visual cues and distance sensing
egorized into three different subcategories, ETAs, electronic along with location-context information, using GPS. The plat-
orientation aids, and positional locator devices. ETAs provide form can also access geographic information systems, trans-
object detection, warning, and avoidance for safe navigation portation databases, and social media with the help of Wi-Fi
[10]–[12]. ETAs work in few steps; sensors are used to collect communication through the Internet. Lan et al. [26] proposed
data from the environment, which are then processed through a a smart glass system, which can detect and recognize road
computing device to detect an obstacle or object and give the user signs, such as public toilets, restaurants, and bus stops, in the
a feedback corresponding to the identified object. The ultrasonic cities in real time. This system is lightweight, portable, and
sensors can detect an object within 300 cm by generating a flexible. However, reading out the road signage alone may not
40 kHz signal and receiving reflected echo from the object in carry enough information for a blind user to be comfortable
front of it. The distance is calculated based on the pulse count in an outdoor environment. Since public signs can be different
and time-of-flight (TOF). Smart glasses [2], [9] and boots [12], in different cities, therefore, if a sign is not registered in the
mounted with ultrasonic sensors, have already been proposed as database of the system, the system will not be able to recognize
an aid to the visually impaired. A new approach by Katzschmann it. Hoang et al. [20] designed an assistive system using mobile
et al. [13] uses an array of infrared TOF distance sensors facing in Kinect and a matrix of electrodes for obstacle detection and
different directions. Villanueva and Farcy [14] combine a white warning. However, the system has a complex configuration and
cane with near-IR LED and a photodiode to emit and detect the an uncomfortable setup because the sensors are always placed
IR pulses reflected from obstacles, respectively. Cameras [15], inside the mouth during navigation. Furthermore, it is expensive
[16] and binocular vision sensors [17] have also been used to and has less portability.
capture the visual data for the blind. Islam et al. [27] presented a comprehensive review of sensor-
Different devices and techniques are used for processing the based walking assistants for the visually impaired. The authors
collected data. Raspberry Pi 3 Model B+, with open computer identified key features that are essential for an ideal walking
vision (OpenCV) software, has been used to process the images assistant. These include low cost, simple, and lightweight design
captured from the camera [18]. Platforms such as Google tango with a reliable indoor and outdoor coverage. Based on the
[3] have also been used. A cloud-enabled computation enables feedback from several blind user groups, software developers,
the use of wearable devices [2]. A field-programmable gate array and engineers, Dakopoulos and Bourbakis [10] also identified
is also another option to process the gathered data [19]. The 14 structural and operational features that describe an ideal ETA
preprocessing of captured images is done to reduce noise and for the blind.
distortion. Images are manually processed by using the Gaussian Despite numerous efforts, many existing systems do not in-
filter, gray scale conversion, binary image conversion, edge corporate all features to the same satisfactory level and are often
detection, and cropping [20]. The processed image is then fed limited by cost and complexity. Our main contribution here was
to the Tesseract optical character recognition (OCR) engine to to build a simple, low cost, portable, and hands-free ETA pro-
extract the text from it [21]. The stereo image quality assessment totype for the blind, with text-to-speech conversion capabilities
[17] employs a novel technique to select the best image, out for basic, everyday indoor and outdoor use. While, the proposed
of many. The best image is then fed to a convolutional neural system, in its present form, lacks advanced features, such as
network (CNN), which is trained on big data and runs on a cloud the detection of wet floors and ascending staircases, reading of
device. The audio feedback in most devices is provided through road signs, use of GPS, or mobile communication module, the
a headset or a speaker. The audio is either a synthetic voice [20] flexible design presents opportunities for future improvements
from the text-to-speech synthesis system [22] or a voice user and enhancements.
interface [23] generating a beep sound. Vibrations and tactile
feedback are also used in some systems.
III. DESIGN OF THE PROPOSED DEVICE
Andò et al. [24] introduced a haptic device, similar to the
white cane, with an embedded smart sensing strategy and an We propose a visual aid for completely blind individuals, with
active handle, which detects an obstacle and produces vibra- an integrated reading assistant. The setup is mounted on a pair
tion mimicking a real sensation on the cane handle. Another of eyeglasses and can provide real-time auditory feedback to
traditional white cane like system, guide cane [13], rolls on the user through a headphone. Camera and sensors are used for
wheels and has steering servo motors to guide the wheels by distance measurement between the obstacle and the user. The
sensing the obstacles from ultrasonic sensors. The backdrop schematic view in Fig. 1 presents the hardware setup of the
of this system is that the user must always hold the device by proposed device, while Fig. 2 shows a photograph of the actual
their hand, whereas, many systems, which provide a hands-free device prototype.
experience, are readily available. NavGuide [12] and NavCane For the object detection part, multiple techniques have been
[25] are assistive devices that use multiple sensors to detect adopted. For instance, TensorFlow object detection application
obstacles up to the knee level. Both NavGuide and NavCane are programming interface (API), frameworks, and libraries, such
equipped with wet floor sensors. NavCane can be integrated into as OpenCV and Haar cascade classifier, are used for detecting
the white cane systems and offers a global positioning system faces and eyes and implement distance measurement. Tesseract,
(GPS) with a mobile communication module. which is a free OCR engine, for various operating systems, is

Authorized licensed use limited to: EASWARI COLLEGE OF ENGINEERING. Downloaded on October 28,2022 at 06:14:27 UTC from IEEE Xplore. Restrictions apply.
KHAN et al.: AI-BASED VISUAL AID WITH INTEGRATED READING ASSISTANT FOR THE COMPLETELY BLIND 509

Fig. 1. Hardware configuration of the proposed system. The visual assistant Fig. 3. Basic hardware setup: Raspberry Pi 3 Model B+ and associated module
takes the image as inputs, processes it through the Raspberry Pi Processor, and with the camera and ultrasonic sensors.
gives the audio feedback through a headphone.

designed to fit onboard into Raspberry Pi. It can capture 3280


pixels × 2464 pixels static images and supports 1080p, 720p,
and 640 pixels × 480 pixels video. It is attached to the Pi
module through small sockets, using the dedicated camera serial
interface. The RGB data are retrieved by our program, in real
time, and can recognize objects from every video frame that is
already known to the system.
To acquire data from the ultrasonic rangefinder, HC-SR04
was mounted below the camera, as shown in Fig. 3. There are
four pins on the ultrasound module that were connected to the
Raspberry Pi’s GPIO ports. VCC was connected to pin 2 (VCC),
Fig. 2. Proposed prototype. Raspberry Pi with the camera module and ultra-
sonic sensors mounted on a regular pair of eyeglasses. GND to pin 6 (GND), TRIG to pin 12 (GPIO18), and the ECHO
to pin 18 (GPIO24). The ultrasonic sensor output (ECHO) will
always give output LOW (0 V), unless it has been triggered, in
used to extract text from an image. In addition, eSpeak, which which case, it will give output HIGH (5 V). Therefore, one GPIO
is a compact open-source speech synthesizer (text-to-speech), is pin was set as an output to trigger the sensor and one as an input
used for auditory feedback for object type and distance between to detect the ECHO voltage change. However, this HC-SR04
the object and the user. For obstacles within 40–45 inches of sensor requires a short 10 s pulse to trigger the module. This
the user, the ultrasonic transducer (HC-SR04) sets off a voice causes the sensor to start generating eight ultrasound bursts, at
alarm, while the eSpeak speech synthesizer uses audio feedback 4 kHz, to obtain an echo response. So, to create the trigger pulse,
to inform the user about his or her distance from the obstacle, the trigger pin is set HIGH for 10 s and then set to LOW again.
thereby, alerting the blind person and avoiding any potential The sensor sets ECHO to HIGH for the time it takes for the
accident. pulse to travel the distance and the reflected signal to travel back.
Raspberry Pi 3 Model B+ was chosen as the functional device Once a signal is received, the value changes from LOW (0) to
owing to its low cost and high portability. Also, unlike many HIGH (1) and remains HIGH for the duration of the echo pulse.
existing systems, it offers a multiprocessing capability. To detect From the difference between the two recorded time stamps, the
obstacles and generate an alarm, a TensorFlow object detection distance between the ultrasound source and the reflecting object
API has been used. The API was constructed using robust can be calculated. The speed of sound depends on the medium
deep learning algorithms that require massive computing power. it is traveling through and the temperature of that medium. In
Raspberry Pi 3 Model B+ offers a 1.2 GHz quad-core ARM our proposed system, 343 m/s, which is the speed of sound at
Cortex A53 processor that can output a video at a full 1080p sea level, has been used.
resolution with desired details and accuracy. In addition, it has
40 general purpose input/output (GPIO) pins, which were used, B. Feature Extraction
in the proposed design, to configure the distance measurement
by the ultrasonic sensors. The TensorFlow object detection API is used to extract
features (objects) from images captured from the live video
stream. The TensorFlow object detection API is an open-source
A. Data Acquisition framework, built on the top of TensorFlow, which is easy to
Fig. 3 shows how the Raspberry Pi 3 Model B+ is connected to integrate, train, and create models that perform well in different
other components in the system. Data are acquired in two ways. scenarios. TensorFlow represents deep learning networks as
Information that have red, green, and blue (RGB) data were the core of the object detection computations. The foundation
acquired using the Raspberry Pi camera module V2, which has of TensorFlow is the graph object, which contains a network
a high quality, 8-megapixel Sony IMX219 image sensor. The of nodes. GraphDef objects can be created by the ProtoBuf
camera sensor, featuring a fixed focus lens, has been custom library to save the network. For the proposed design, a pretrained

Authorized licensed use limited to: EASWARI COLLEGE OF ENGINEERING. Downloaded on October 28,2022 at 06:14:27 UTC from IEEE Xplore. Restrictions apply.
510 IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, VOL. 50, NO. 6, DECEMBER 2020

of the image [29]. By extracting these visual attributes [30], the


deep learning techniques can mimic human brains and can detect
salient objects from images, video frames [31], and even from
optical remote sensing [32]. A pixelwise and nonparametric
moving object detection method [33] can extract from the spatial
and temporal features and detect moving objects with intricate
background from the video frame. Many other techniques for
object detection and tracking, from the video frame, such as
the object-level RGB-D video segmentation, are also commonly
used [34].
For object detection, every object must be localized within a
bounding box, in each frame of a video input. A “region proposal
Fig. 4. Complete workflow of the proposed system. The hardware interface system” or Regions + CNN (R-CNN) can be used [35], where,
collects data from the environment. The software interfaces process the collected after the final convolutional layers, a regression layer is added
data and generate an output response through the audio interface. Raspberry Pi to get a number that consists of four variables x0 , y0 , width,
3B+ is the central processing unit of the system.
and height of the image. This process must train the support
vector machine for each class, to classify between the object
model, called single-shot detection (SSD)Lite-MobileNet, from and background, while proposing the region in each image. In
the TensorFlow detection model zoo, has been used. The model addition, a linear regression classifier needs to be trained, which
zoo is Google’s collection of pretrained object detection mod- will output some correction factor. To eliminate the unnecessary
els trained on different datasets, such as the common objects bounding boxes from each class, the intersection over union
in context (COCO) dataset [28]. This model was particularly method must be applied to filter out the actual location of an
chosen for the proposed prototype because it does not require object in each image. Methods used in faster R-CNN dedicatedly
high-end processing capabilities, making it compatible with provide region proposals, followed by a high-quality classifier to
the low processing power of the Raspberry Pi. To recognize classify these proposals [35]. These methods are very accurate
objects from the live video stream, no further training is required but come at a big computational cost. Furthermore, because
since the models have already been trained on different types of of the low frame rate, these methods are not fit to be used on
objects. An image has an infinite set of possible object locations embedded devices.
and detecting these objects can be challenging because most of Object detection can also be done by combining the two tasks
these potential locations contain different background colors, into one network by having a network that produces proposals
not actual objects. The SSD models usually use the one-stage instead of having a set of predefined boxes to look for objects.
object detection, which directly predicts object bounding boxes The computation that is already made during the classification,
for an image. This has a simple and faster architecture, although to localize the objects, could be reused. This is achieved by using
the accuracy is comparatively lower than the other state-of-the- the convolutional feature maps from the later layers of a network,
art object detection models having two or more stages. upon which convolutional filters can be run, to predict class
scores and bounding box offsets at once. The SSD detector [36]
C. Workflow of the System uses multiple layers that provide a finer accuracy on the objects
with different scales. As the layers go deeper, the bigger objects
Fig. 4 shows the complete workflow of the proposed system
become more visible. SSD is fast enough to infer objects in the
with the hardware and software interfaces. Every frame of the
real-time video. In SSDLite, MobileNetv2 [37] was used as the
video is being processed through a standard convolutional net-
backbone and has depthwise separable convolutions for the SSD
work to build a feature representation of the original image or the
layers. The SSDLite models make predictions on a fixed-sized
frame. This backbone network is then pretrained on Image-Net
grid. Each cell in this grid is responsible for detecting objects,
in the SSD model, as an image classifier, to learn how to extract
in a location, from the original input image and produces two
features from an image using SSD. Then, the model manually
tensors as the outputs that contain the bounding box predictions
defines a collection of aspect ratios for bounding boxes, at each
for different classes. SSDLite has several different grids ranging
grid cell location. For each bounding box, it predicts the offsets
in size from 19 × 19 to 1 × 1 cells. The number of bounding
for the bounding box coordinates and dimensions. Along with
boxes per grid cell is 3, for the largest grid, and 6, for the others,
this, the distance measurement is processed using both the depth
making a total of 19 × 17 boxes.
information and the ultrasonic sensor. In addition, the reading
For the designed prototype, Google’s object detection API,
assistant works without interrupting any of the prior processes.
COCO, has been used, which has 3 000 000 images of 90 most
All the three features run in the software interface with the help
found objects. The API provides five different models, making a
of the modules from the hardware interface.
tradeoff between the speed of execution and the accuracy in plac-
ing bounding boxes. SSDLite-MobileNet, whose architecture is
D. Object Detection shown in Fig. 5, is chosen as the object detection algorithm since
The human brain focuses on the region of interests and salient it requires less processing power. Basically, SSD is designed
objects, recognizing the most important and informative parts to be independent of the base network and so it can run on

Authorized licensed use limited to: EASWARI COLLEGE OF ENGINEERING. Downloaded on October 28,2022 at 06:14:27 UTC from IEEE Xplore. Restrictions apply.
KHAN et al.: AI-BASED VISUAL AID WITH INTEGRATED READING ASSISTANT FOR THE COMPLETELY BLIND 511

Fig. 6. Workflow for the reading assistant. Raspberry Pi gets a single frame
from the camera module and runs through the Tesseract OCR engine. The test
output is then converted to the audio.

Fig. 5. SSD Lite-MobileNet architecture. objects inside the view of the user will be identified. Next, a
rectangle is drawn around the objects. With the SSDLite model
and the Raspberry Pi 3 Model B+, a frame rate higher than 1 fps
MobileNet [35]. With the SSDLite on top of the MobileNet, we can be achieved, which is fast enough for most real-time object
were able to get around 30 frames per second (fps), which is detection applications.
enough to evaluate the system in real-time test cases. In places
where online access is either limited or absent, the proposed E. Reading Assistant
device can operate offline as well. In SSDLite-MobileNet, the
“classifier head” of MobileNet, which made the predictions for The proposed system integrates an intelligent reader that will
the whole network, gets replaced with the SSD network. As allow the user to read text from any document. An open-source li-
shown in Fig. 5, the output of the base network is typically a brary, Tesseract version-4, which includes a highly accurate deep
7 × 7 pixel image, which is fed into the replaced SSD network learning-based model for text recognition, is used for the reader.
to do further feature extraction. Not only the replaced SSD Tesseract has unicode (UTF-8) support and can recognize many
network takes the output of the base network but it also takes the languages along with various output formats: plain-text, hocr
outputs of several previous layers. The MobileNet layers convert (HTML), pdf, tsv, and invisible-text-only pdf. The underlying
the pixels from the input image into features that describe the engine uses a long short-term memory (LSTM) network. LSTM
contents of the image and pass these along to the other layers. is part of a recurrent neural network, which is a combination of
A new family of object detectors, such as POLY-YOLO [38], some unfolded layers that use cell states in each time steps to
DETR [39], Yolact [40], and Yolact++ [41], introduced instance predict letters from an image. The captured image is divided into
segmentation along with object detection. Despite the efforts, horizontal boxes, and in each time step, the horizontal boxes are
many object detection methods still struggle with medium and being analyzed with the ground truth value to predict the output
large-sized objects. Researchers have, therefore, focused on letter. LSTM uses gate layers to update the cell state, at each
proposing better anchor boxes to scale up the performance of an time step, by using several activation functions. Therefore, the
object detector with regards to the perception, size, and shape time required to recognize texts can be optimized.
of the object. Recent detectors offer a smaller parameter size Fig. 6 shows the working principle of the reading assistant. An
while significantly improving mean average precision. However, image is captured from the live video feed without interrupting
large input frame sizes limit their use in the systems with low the object detection process. In the background, Tesseract API
processing power. will extract the texts from the image and save them in a temporary
For object detection, MobileNetv2 is used as the base network, text file. Then it reads out the text from the text file using the text-
along with SSD since it is desirable to know both high-level as to-speech engine eSpeak. The accuracy of the Tesseract OCR
well as low-level features by reading the previous layers. Since engine depends on ambient lighting and background and usually
object detection is more complicated than the classification, SSD works well in the white background and brightly illuminated
adds many additional convolution layers on the top of the base places.
network. To detect objects in live feeds, we used a Pi camera.
Basically, our script sets paths to the model and label maps, IV. SYSTEM EVALUATION AND EXPERIMENTS
loads the model into memory, initializes the Pi camera, and then
A. Evaluation of Object Detection
begins performing object detection on each video frame from
the Pi camera. Once the script initializes, which can take up to Our model (SSDLite) is pretrained on the Image-Net dataset
a maximum of 30 s, a live video stream will begin and common for the image classification. It draws a bounding box on an

Authorized licensed use limited to: EASWARI COLLEGE OF ENGINEERING. Downloaded on October 28,2022 at 06:14:27 UTC from IEEE Xplore. Restrictions apply.
512 IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, VOL. 50, NO. 6, DECEMBER 2020

TABLE I
PERFORMANCE OF SINGLE AND MULTIPLE OBJECT DETECTION

Fig. 7. Single Object Detection. The object detection algorithm can detect the
cell phone with 97% confidence.

simultaneously, from a single video frame. The confidence level


indicates the percentage of times the system can detect an object
without any failure.
Table I summarizes the results from single and multiple object
detection, for 22 unique cases, consisting of either a single
item or a combination of items, commonly found in indoor and
outdoor setups. The system can identify single items with near
100% accuracy with zero failure cases. Where multiple objects
are in the frame, the proposed system can recognize each known
object within the view. For any object situated in the range of
15–20 m from the user, the object can be recognized with at
Fig. 8. Detecting multiple objects, with various confidence levels, from a least 80% accuracy. The camera identifies objects based on their
single frame (white boxes are added for better visibility for readers). ground truth values (in %), as shown in Figs. 7 and 8. However,
to make the device more reliable, the ultrasonic sensor is also
used to measure the distance between the object and the user.
object and tries to predict the object type based on the trained Whenever there are multiple objects, in front of the user, the
data from the network. It directly predicts the probability that system will generate feedback for the object, which is closest to
each class is present in each bounding box using the softmax the user. An object with a higher ground truth value has a higher
activation function and cross entropy loss function. The model priority. The pretrained model, however, is subject to failure due
also has a background object class when it is classifying different to variation in the shape and color of the object as well as changes
objects. However, there can be a large number of bounding boxes in ambient lighting conditions.
detected in one frame with only background classes. To avoid
this problem, the model uses hard negative mining to sample
negative predictions or downsampling the convolutional feature B. Evaluation of Distance Measurement
maps to filter out the extra bounding boxes. Fig. 9 shows the device measuring the distance between a
Fig. 7 shows the detection of a single object from a video computer mouse and the blind person using the ultrasonic sensor.
stream. Although most part of the image contains the back- If the distance measured from the sensor is less than 40 cm, the
ground, the model is still able to filter out other bounding user will get a voice alert saying that the object is within 40 cm.
boxes and detect the desired object in the frame, with 97% The sensor can measure distances within a range of 2–120 cm
confidence. The device can also detect multiple objects, with by sonar waves.
different confidence levels, from one video frame, as shown in Fig. 10 demonstrates the case where the combination of
Fig. 8. Our model can easily identify up to four or five objects, camera and ultrasonic sensor is used to identify a person’s face

Authorized licensed use limited to: EASWARI COLLEGE OF ENGINEERING. Downloaded on October 28,2022 at 06:14:27 UTC from IEEE Xplore. Restrictions apply.
KHAN et al.: AI-BASED VISUAL AID WITH INTEGRATED READING ASSISTANT FOR THE COMPLETELY BLIND 513

Fig. 11. Demonstration of the distance measurement using camera and ultra-
sonic sensor.

Fig. 9. Measuring the distance of a mouse from the prototype device using TABLE II
ultrasonic sensor. DISTANCE MEASUREMENT BETWEEN OBJECT AND USER

camera lens [43], is used to calculate the distance between the


object and user:
(2 × 3.14 × 180)
distance (inches) = . (1)
(w + h × 360) × 1000 + 3

The actual distance between the object and the user is mea-
sured by a measuring tape and compared with that measured
by the camera and the ultrasonic sensor. Since the camera can
Fig. 10. Face detection and distance measurement from a single video frame. detect a person’s face, the object used in this case is a human
face, as shown in Fig. 10. Table II summarizes the results. The
distance measured by ultrasonic sensors is more accurate than
and determine how far the person is from the blind user. The that measured by the camera. Also, the ultrasonic sensor can
integration of the camera with the ultrasonic sensor, therefore, respond in real time so that it can be used to measure the distance
allows simultaneous object detection and distance measurement, between the blind user and a moving object. The camera, with
which adds novelty to our proposed design. We have used the a higher processing power and more fps, has a shorter response
Haar cascade algorithm [42] to detect face from a single video time. Although the camera takes slightly more time to process,
frame. It can also be modified and used for other objects. The both camera and ultrasonic sensors can generate feedback at the
bounding boxes, which appear while recognizing an object, same time.
consist of a rectangle. The width w, height h, and the coordinates
of the rectangular box (x0 , y0 ) can be adjusted as required.
C. Evaluation of Reading Assistant
Fig. 11 demonstrates how the distance between the object
and the blind user can be simultaneously measured by both The integrated reading assistant in our prototype is tested
the camera and the ultrasonic sensor. The dotted line (6 m) under different ambient lighting conditions for various combina-
represents the distance measured by the camera and the solid tions of text size, font, color, and background. The OCR engine
line (5.6 m) represents the distance calculated from the ultrasonic performs better in an environment with more light as it can easily
sensor. Width w and height h of the bounding box are defined in extract the text from the captured image. While comparing text
the .xml file with feature vectors, and they vary depending on the with different colored background, it has been shown that a
distance between the camera and the object. In addition to the well-illuminated background yields better performance for the
camera, the use of the ultrasonic sensor makes object detection reading assistant. As given in Table III, the performance of the
more reliable. The following equation, which can be derived by reading assistant is tested under three different illuminations:
considering the formation of image, as light passes through the bright, slightly dark, and dark, using the green and black-colored

Authorized licensed use limited to: EASWARI COLLEGE OF ENGINEERING. Downloaded on October 28,2022 at 06:14:27 UTC from IEEE Xplore. Restrictions apply.
514 IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, VOL. 50, NO. 6, DECEMBER 2020

TABLE III
PERFORMANCE OF THE READING ASSISTANT

Fig. 13. Testing the prototype in an indoor setting.

proposed system in future endeavors. A short training session,


over a period of 2 hours, is conducted to familiarize the blind
participants with the prototype device. During the training, the
evaluation and scoring criterion were discussed in detail.
The indoor environment, as shown in Fig. 13, consisted of six
stationary obstacles of different heights and a moving person
(not shown in Fig. 13). The position of the stationary objects
was shuffled to create ten different indoor test setups, which were
assigned, in random, to each user. A blind individual walks from
point A to point B, along the path AB (∼15 m in length), first with
our proposed blind assistant, mounted on a pair of eyeglasses,
Fig. 12. Typical outdoor test environment. and then with a traditional white cane. For both the device and the
white cane, the time taken to complete the walk was recorded for
each participant. Based on the time, the corresponding velocity
texts, written on white pages. When the text color is black, the for each participant is calculated. The results from the indoor
device performed accurately in bright and even in a slightly dark setting, as shown in Fig. 13, are summarized and discussed in
environment but under the dark condition, it failed to read the Section V.
full sentence. For the green-colored text, the reading assistant
had no issues in the brightly lit environment but failed to perform E. Assessment Criterion
accurately in slightly dark and dark conditions.
Blind participants were instructed to rate the device based
on its comfort level or ease-of-use, mobility, and preference
D. Experimental Setup compared with the more commonly used traditional white cane.
The usability and performance of the prototype device is Ratings were done on a scale of 0–5 and the user experiences for
primarily tested in controlled indoor settings that mimic real-life comfortability, mobility, and preference over the white cane are
scenarios. Although the proposed device functioned well in a divided into the following three categories based on the scores:
typical outdoor setting, as shown in Fig. 12, the systematic study 1) worst (score: 0–2);
and conclusions, discussed in the following sections, are based 2) moderate (score: 3);
on the indoor setup only. 3) good (score: 4 and 5).
A total of 60 completely blind individuals (male: 30 and The preferability scores also refer to the likelihood that the
female: 30) volunteered to participate in the controlled experi- user would recommend the device to someone else. For example,
ments. The influence of gender or age on the proposed system is a score of 3 for preferability means that the user is only slightly
beyond the scope of our current work and has, therefore, not been impressed with the overall performance of the device, while a
investigated here. However, since the gender-based blindness score of 1 means that the blind person highly discourages the
studies [44], [45] have shown blindness to be more prevalent use of the device. The accuracy of the reading assistant was also
among women than in men, it is important to have female scored on a scale of 0–5, with 0 being the least accurate and 5
blind users represented in significant numbers, in the testing and being the most. The total score, from each user, is calculated
evaluation of any visual aid. Dividing 60 human samples into by summing the individual scores for comfort, mobility, prefer-
30 males and 30 females to study separately, could, therefore, ability, and accuracy of the reading assistant. In the best-case
prove useful to conduct the gender-based evaluation study of the scenario, each category gets a score of 5 with a total score of

Authorized licensed use limited to: EASWARI COLLEGE OF ENGINEERING. Downloaded on October 28,2022 at 06:14:27 UTC from IEEE Xplore. Restrictions apply.
KHAN et al.: AI-BASED VISUAL AID WITH INTEGRATED READING ASSISTANT FOR THE COMPLETELY BLIND 515

Fig. 15. User rating for the proposed device tested in the indoor setup of
Fig. 14. Velocity of blind participants walking from point A to B in Fig. 13. Fig. 13.

TABLE IV 30 female participants. It is evident from the table that, on an


AVERAGE VELOCITY OF BLIND PARTICIPANTS average, the blind assistant provides slightly faster navigation
than the white cane, for both the genders. To compare the per-
formances between our proposed blind assistant and the white
cane, a t-test is performed with a sample size of 60, using the
following statistics:
|x̄b − x̄w |
t=  2 (2)
sb s2w
TABLE V nb + nw
PARAMETERS AND VALUES USED FOR T-TEST
where xb , sb , and nb are the mean, standard deviation, and sample
size, respectively, for the experiment with the blind assistant. The
corresponding values for the white cane are denoted by xw , sw ,
and nw . Table V lists the values used in the t-test.
With a t-value equal to 4.9411, the two-tailed P value is
less than 0.0001. Therefore, by the conventional criteria and
at 95% confidence interval, the difference in velocity between
the blind assistant and white cane can be considered statistically
20. Depending on the total score, the proposed blind assistant is significant.
labeled as “not helpful (total score: 0–8),” “helpful (total score: The user ratings are plotted in Fig. 15, which shows the
9–15),” and “very helpful (total score: 16–20).” These labels individual scores for comfort, mobility, preference, and accuracy
were set after an extensive discussion with the blind participants of the reading assistant, on a scale of 0–5, for each of the 60 users.
prior to conducting the experiments. Almost all the blind users In addition, the total score, rated on a scale of 0–20, is also shown.
were participating in such a study for the first time with no The average of all scores is 14.5, which deems our proposed
prior experience of using any form of ETA. Therefore, it was device as “helpful” based on the criterion defined in Section
necessary to set a scoring and evaluation criterion that could IV-E. Since we only used a prototype to conduct the experi-
be easily adopted without the need for advanced training and ments, the comfort level was slightly compromised. However,
extensive guidelines. the mobility and preference of the proposed device over the white
cane gained high scores. The pretrained model, which was used,
could be retrained with more objects for better performance.
V. RESULTS AND DISCUSSION The reading assistant performed well under brightly illuminated
Fig. 14 plots the velocity at which each blind user completes settings. One major limitation of the reading assistant, as pointed
a walk from point A to point B, as shown in Fig. 13. For each out by the users, is that it was unable to read texts containing
user, the speed achieved using the blind assistant and the white tables and pictures.
cane is plotted. The plots for male and female users are shown A cost analysis was done with similar state-of-the-art assistive
separately. Table IV lists the average velocity for 30 male and navigation devices. Table VI compares the cost of our blind

Authorized licensed use limited to: EASWARI COLLEGE OF ENGINEERING. Downloaded on October 28,2022 at 06:14:27 UTC from IEEE Xplore. Restrictions apply.
516 IEEE TRANSACTIONS ON HUMAN-MACHINE SYSTEMS, VOL. 50, NO. 6, DECEMBER 2020

TABLE VI [6] C. Ye and X. Qian, “3-D object recognition of a robotic navigation aid for
COST OF PROPOSED DEVICE VERSUS EXISTING VISUAL AIDS the visually impaired,” IEEE Trans. Neural Syst. Rehabil. Eng., vol. 26,
no. 2, pp. 441–450, Feb. 2018.
[7] Y. Liu, N. R. B. Stiles, and M. Meister, “Augmented reality powers a
cognitive assistant for the blind,” eLife, vol. 7, Nov. 2018, Art. no. e37841.
[8] A. Adebiyi et al., “Assessment of feedback modalities for wearable visual
aids in blind mobility,” PLoS One, vol. 12, no. 2, Feb. 2017, Art. no.
e0170531.
[9] J. Bai, S. Lian, Z. Liu, K. Wang, and D. Liu, “Smart guiding glasses for
visually impaired people in indoor environment,” IEEE Trans. Consum.
Electron., vol. 63, no. 3, pp. 258–266, Aug. 2017.
[10] D. Dakopoulos and N. G. Bourbakis, “Wearable obstacle avoidance elec-
assistant with some of the existing platforms. The total cost tronic travel aids for blind: A survey,” IEEE Trans. Syst. Man, Cybern.
Part C Appl. Rev., vol. 40, no. 1, pp. 25–35, Jan. 2010.
of making the proposed device is roughly US $68, whereas [11] E. E. Pissaloux, R. Velazquez, and F. Maingreaud, “A new framework for
some existing devices, with a similar performance, appear more cognitive mobility of visually impaired users in using tactile device,” IEEE
expensive. The service dogs, another viable alternative, can cost Trans. Human-Mach. Syst., vol. 47, no. 6, pp. 1040–1051, Dec. 2017.
[12] K. Patil, Q. Jawadwala, and F. C. Shu, “Design and construction of
up to US $4000 and require high maintenance. Although the electronic aid for visually impaired people,” IEEE Trans. Human-Mach.
white canes are cheaper, they are unable to detect moving objects Syst., vol. 48, no. 2, pp. 172–182, Apr. 2018.
and do not include a reading assistant. [13] R. K. Katzschmann, B. Araki, and D. Rus, “Safe local navigation for
visually impaired users with a time-of-flight and haptic feedback device,”
IEEE Trans. Neural Syst. Rehabil. Eng., vol. 26, no. 3, pp. 583–593,
VI. CONCLUSION Mar. 2018.
[14] J. Villanueva and R. Farcy, “Optical device indicating a safe free path to
This research article introduces a novel visual aid system, in blind people,” IEEE Trans. Instrum. Meas., vol. 61, no. 1, pp. 170–177,
the form of a pair of eyeglasses, for the completely blind. The Jan. 2012.
[15] X. Yang, S. Yuan, and Y. Tian, “Assistive clothing pattern recognition for
key features of the proposed device include the following. visually impaired people,” IEEE Trans. Human-Mach. Syst., vol. 44, no. 2,
1) The hands free, wearable, low power, low cost, and com- pp. 234–243, Apr. 2014.
pact design for indoor and outdoor navigation. [16] S. L. Joseph et al., “Being aware of the world: Toward using social media
to support the blind with navigation,” IEEE Trans. Human-Mach. Syst.,
2) The complex algorithm processing using the low-end pro- vol. 45, no. 3, pp. 399–405, Jun. 2015.
cessing power of Raspberry Pi 3 Model B+. [17] B. Jiang, J. Yang, Z. Lv, and H. Song, “Wearable vision assistance system
3) Dual capabilities for object detection and distance mea- based on binocular sensors for visually impaired users,” IEEE Internet
Things J., vol. 6, no. 2, pp. 1375–1383, Apr. 2019.
surement using a combination of camera and ultrasound [18] L. Tepelea, I. Buciu, C. Grava, I. Gavrilut, and A. Gacsadi, “A vision
sensors. module for visually impaired people by using raspberry PI platform,” in
4) Integrated reading assistant, offering image-to-text con- Proc.15th Int. Conf. Eng. Modern Electr. Syst. (EMES), Oradea, Romania,
2019, pp. 209–212.
version capabilities, enabling the blind to read texts from [19] L. Dunai, G. Peris-Fajarnés, E. Lluna, and B. Defez, “Sensory naviga-
any document. tion device for blind people,” J. Navig., vol. 66, no. 3, pp. 349–362,
A detailed discussion, on the software and hardware aspects May 2013.
[20] V.-N. Hoang, T.-H. Nguyen, T.-L. Le, T.-H. Tran, T.-P. Vuong, and N.
of the proposed blind assistant, has been given. A total of 60 Vuillerme, “Obstacle detection and warning system for visually impaired
completely blind users have rated the performance of the device people based on electrode matrix and mobile kinect,” Vietnam J. Comput.
in well-controlled indoor settings that represent real-world sit- Sci., vol. 4, no. 2, pp. 71–83, Jul. 2016.
[21] C. I. Patel, A. Patel, and D. Patel, “Optical character recognition by open
uations. Although the current setup lacks advanced functions, source OCR tool Tesseract: A case study,” Int. J. Comput. Appl., vol. 55,
such as wet-floor and staircases detection or the use of GPS no. 10, pp. 50–56, Oct. 2012.
and mobile communication module, the flexibility in the design [22] A. Chalamandaris, S. Karabetsos, P. Tsiakoulis, and S. Raptis, “A unit
selection text-to-speech synthesis system optimized for use with screen
leaves room for future improvements and enhancements. In addi- readers,” IEEE Trans. Consum. Electron., vol. 56, no. 3, pp. 1890–1897,
tion, with the advanced machine learning algorithms and a more Aug. 2010.
improved user interface, the system can further be developed [23] R. Keefer, Y. Liu, and N. Bourbakis, “The development and evaluation of
an eyes-free interaction model for mobile reading devices,” IEEE Trans.
and tested in a more complex outdoor environment. Human-Mach. Syst., vol. 43, no. 1, pp. 76–91, Jan. 2013.
[24] B. Andò, S. Baglio, V. Marletta, and A. Valastro, “A haptic solution to
REFERENCES assist visually impaired in mobility tasks,” IEEE Trans. Human-Mach.
Syst., vol. 45, no. 5, pp. 641–646, Oct. 2015.
[1] Blindness and vision impairment, World Health Organization, Geneva, [25] V. V. Meshram, K. Patil, V. A. Meshram, and F. C. Shu, “An astute
Switzerland, Oct. 2019. [Online]. Available: https://www.who.int/news- assistive device for mobility and object recognition for visually impaired
room/fact-sheets/detail/blindness-and-visual-impairment people,” IEEE Trans. Human-Mach. Syst., vol. 49, no. 5, pp. 449–460,
[2] J. Bai, S. Lian, Z. Liu, K. Wang, and D. Liu, “Virtual-blind-road following- Oct. 2019.
based wearable navigation device for blind people,” IEEE Trans. Consum. [26] F. Lan, G. Zhai, and W. Lin, “Lightweight smart glass system with audio
Electron., vol. 64, no. 1, pp. 136–143, Feb. 2018. aid for visually impaired people,” in Proc. IEEE Region 10 Conf., Macao,
[3] B. Li et al., “Vision-based mobile indoor assistive navigation aid for China, 2015, pp. 1–4.
blind people,” IEEE Trans. Mobile Comput., vol. 18, no. 3, pp. 702–714, [27] M. M. Islam, M. S. Sadi, K. Z. Zamli, and M. M. Ahmed, “Developing
Mar. 2019. walking assistants for visually impaired people: A review,” IEEE Sens. J.,
[4] J. Xiao, S. L. Joseph, X. Zhang, B. Li, X. Li, and J. Zhang, “An assistive vol. 19, no. 8, pp. 2814–2828, Apr. 2019.
navigation framework for the visually impaired,” IEEE Trans. Human- [28] T-Y Lin et al., “Microsoft COCO: Common objects in context,” Feb. 2015.
Mach. Syst., vol. 45, no. 5, pp. 635–640, Oct. 2015. [Online]. Available: https://arxiv.org/abs/1405.0312
[5] A. Karmel, A. Sharma, M. Pandya, and D. Garg, “IoT based assistive [29] J. Han et al., “Representing and retrieving video shots in human-centric
device for deaf, dumb and blind people,” Procedia Comput. Sci., vol. 165, brain imaging space,” IEEE Trans. Image Process., vol. 22, no. 7,
pp. 259–269, Nov. 2019. pp. 2723–2736, Jul. 2013.

Authorized licensed use limited to: EASWARI COLLEGE OF ENGINEERING. Downloaded on October 28,2022 at 06:14:27 UTC from IEEE Xplore. Restrictions apply.
KHAN et al.: AI-BASED VISUAL AID WITH INTEGRATED READING ASSISTANT FOR THE COMPLETELY BLIND 517

[30] J. Han, K. N. Ngan, M. Li, and H. J. Zhang, “Unsupervised extraction of [39] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
visual attention objects in color images,” IEEE Trans. Circuits Syst. Video S. Zagoruyko, “End-to-end object detection with transformers,” May 2020.
Technol., vol. 16, no. 1, pp. 141–145, Jan. 2006. [Online]. Available: http://arxiv.org/abs/2005.12872
[31] D. Zhang, D. Meng, and J. Han, “Co-saliency detection via a self-paced [40] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: Real-time instance
multiple-instance learning framework,” IEEE Trans. Pattern Anal. Mach. segmentation,” in Proc. IEEE/CVF Conf. Comput. Vision, Seoul, South
Intell., vol. 39, no. 5, pp. 865–878, May 2017. Korea, 2019, pp. 4510–4520.
[32] G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convolutional [41] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT++: Better real-time
neural networks for object detection in VHR optical remote sensing im- instance segmentation,” Dec. 2019. [Online]. Available: https://arxiv.org/
ages,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12, pp. 7405–7415, abs/1912.06218
Dec. 2016. [42] R. Padilla, C. C. Filho, and M. Costa, “Evaluation of Haar cascade
[33] Y. Yang, Q. Zhang, P. Wang, X. Hu, and N. Wu, “Moving object detection classifiers designed for face detection,” Int. J. Comput., Elect., Autom.,
for dynamic background scenes based on spatiotemporal model,” Adv. Control Inf. Eng., vol. 6, no. 4, pp. 466–469, Apr. 2012.
Multimedia, vol. 2017, Jun. 2017, Art. no. 5179013. [43] L. Xiaoming, Q. Tian, C. Wanchun, and Y. Xingliang, “Real-time distance
[34] Q. Xie, O. Remil, Y. Guo, M. Wang, M. Wei, and J. Wang, “Object detection measurement using a modified camera,” in Proc. IEEE Sensors Appl.
and tracking under occlusion for object-level RGB-D video segmentation,” Symp., Limerick, Ireland, 2010, pp. 54–58.
IEEE Trans. Multimedia, vol. 20, no. 3, pp. 580–592, Mar. 2018. [44] L. Doyal and R. G. Das-Bhaumik, “Sex, gender and blindness: A new
[35] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time framework for equity,” BMJ Open Ophthalmol., vol. 3, no. 1, Sep. 2018,
object detection with region proposal networks,” IEEE Trans. Pattern Anal. Art. no. e000135.
Mach. Intell., vol. 39, no. 6, pp. 1137–1149, Jun. 2017. [45] M. Prasad, S. Malhotra, M. Kalaivani, P. Vashist, and S. K. Gupta, “Gender
[36] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. differences in blindness, cataract blindness and cataract surgical coverage
Comput. Vision, vol. 9905, Sep. 2016, pp. 21–37. in India: A systematic review and meta-analysis,” Brit. J. Ophthalmol.,
[37] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mo- vol. 104, no. 2, pp. 220–224, Jan. 2020.
bileNetV2: Inverted residuals and linear bottlenecks,” in Proc. IEEE/CVF [46] M. Rajesh et al., “Text recognition and face detection aid for visually
Conf. Comput. Vis. Pattern Recognit., Salt Lake City, UT, USA, 2018, impaired person using raspberry PI,” in Proc. Int. Conf. Circuit, Power
pp. 4510–4520. Comput. Technol., Kollam, India, 2017, pp. 1–5.
[38] P. Hurtik, V. Molek, J. Hula, M. Vajgl, P. Vlasanek, and T. Nejezchleba,
“Poly-YOLO: Higher speed, more precise detection and instance segmen-
tation for YOLOv3,” May 2020. [Online]. Available: http://arxiv.org/abs/
2005.13243

Authorized licensed use limited to: EASWARI COLLEGE OF ENGINEERING. Downloaded on October 28,2022 at 06:14:27 UTC from IEEE Xplore. Restrictions apply.

You might also like