Suspicious Activity Detection Using Deep Learning Approach
Suspicious Activity Detection Using Deep Learning Approach
Department of Electronics and Communication Engineering Department of Electronics and Communication Engineering
Visvesvaraya National Institute of Technology Visvesvaraya National Institute of Technology
Nagpur, India Nagpur, India
barsagadekshitij358@gmail.com s.tabhane123@gmail.com
Abstract—Video Surveillance plays a pivotal position in today’s of armed robberies and heists. ATMs are a common target
global. The technologies have superior an excessive amount of for thieves, and automatic surveillance cameras may help to
when synthetic intelligence, gadget learning, and deep learning are improve their security [7], [8].
pitched into the gadget. the usage of the above mixtures,
exceptional systems are in a region which helps to differentiate Surveillance cameras may aid in the detection of disruptive
various suspicious behaviors from the live monitoring of photos. conduct among students on campus, such as bullying and
Human behavior is the most unpredictable, and it is very difficult
to determine whether it is suspicious or normal. In this paper, we fighting [9], [10]. They can also assist in enhancing the cam-
have classified human activities into two: Normal and Suspicious. pus’s anti-theft protection. In the examination hall, automated
Normal activities include sitting, walking, jogging, hand waving, security cameras are used to identify suspicious behavior by
etc. Suspicious activities include running, boxing, fighting, etc. We students such as stealing and copying [11].
achieve this classification by using convolutional neural networks.
First, the convolutional neural network is used to extract high- Safety cameras are increasingly being used in small busi-
level features from images. The convolutional network nesses, factories, and shopping centers [12], [13]. They’re used
classification is taken into account, the final poolinglayer result is to apprehend shoplifters and robbers, as well as to keep armed
extracted and the final prediction is made.
robberies at bay [14], [15]. Security cameras are also used to
Index Terms—suspicious activity, deep learning, convolutional track supplies and inventory held in warehouses and to detect
neural network employee bribery and theft [16], [17].
I. INTRODUCTION Platforms, routes, roads, tunnels, and parking lots are all
monitored by security cameras in railways and bus stations.
In today’s world, we can see that crime has escalated despite
Terrorists may use these areas as a staging ground for explo-
the presence of surveillance cameras everywhere. To detect
sive attacks by leaving a bag containing explosives [4], [18].
suspicious behavior, a model must be developed that reduces
Automated security cameras can detect discarded bags and
the time taken to detect it so that we can take action. Inthis
warn officials, who can then remove them to protect passengers
case, the surveillance camera is in the form of film. Splitting
and facilities [19].
the video into images and then editing it is the easiest way to
process it [1], [2]. There are many machine learning approaches Video monitoring can be used to keep an eye on patients in
available today to process images, but as the dataset grows hospitals and elderly people in their homes [20]. It is capable of
larger, the accuracy decreases, so we turned to some deep detecting abnormal behavior in patients, such as vomiting,
learning algorithms. fainting, or any other irregular behavior [14]. As a result, given
In public infrastructures such as parking lots, jails, military the wide range of applications, we must devise a methodfor
bases, mosques, borders, and public transit stations, automated detecting suspicious activities in videos [21], [22]. The
video monitoring can help deter harm due to overcrowding, remaining paper is organized as follows: literature survey is
people fighting each other, and people carrying arms that could discussed in Section 2, activity classification and the CNN
be used to inflict damage on other people, people carrying models are presented in Section 3, the proposed framework
bombs, robbery, vandalism, and so on [3], [4]. is presented in Section 4, details of the dataset are presented
Video monitoring is an important part of enhancing the in Section 5, and results are discussed in Section 6 and we
security of banks and ATMs [5], [6]. The presence of auto- conclude the paper in Section 7.
matic surveillance cameras in banks will aid in the prevention
Authorized licensed use limited to: M S RAMAIAH INSTITUTE OF TECHNOLOGY. Downloaded on November 21,2023 at 09:52:11 UTC from IEEE Xplore. Restrictions apply.
1st IEEE International conference on Innovations in High-Speed Communication and Signal Processing (IEEE-IHCSP) 4-5 March, 2023
II. LITERATURE SURVEY which depicted normal behavior. Many of the videos include
footage from inside the bank. The color-based motion and
A. Security camera research for detecting violent activity appearance are used to keep track of the object in motion.
In this part, we’ll go through some of the research that’s been The presented model reliably detects robbery using a single-
done in the field of detecting violent behavior in security threaded ontology, but the model’s key flaw is that it is unable
cameras. Fighting, vandalism, punching, kicking, scratching, to detect robberies in which more than one person is
peeping, shooting, and other violent acts are examples. involved. The algorithm of fuzzy k-means, which was based
A non-tracking, real-time algorithm that detected suspicious on histogram ratio, was used by Chuang et al. [20] to
behavior, is very useful in crowded and public areas [23]. recognize suspicious behavior. Using a system known as
Instead of object tracking, the algorithm keeps track of low- GMM, the suspicious activity was correctly identified. The
level measurements in a series of fixed spatial locations. This entity is detected in this model using a commonly used ratio
algorithm has the downside of not providing sequential histogram. The fuzzy color histogram was used to solve the
tracking. problem of color similarity. By tracking the transferring state,
Willim et al. [24] used contextual information to identify abnormal behaviors have been discovered.
suspicious behavior in that study. A data stream clustering
algorithm, a device inference algorithm, and a context space C. Security camera research for detecting abandoned objects
model were the three components he used. Continuous in-
formation upgradation from incoming videos was possible Abandoned object detection can be difficult, particularly in
using a data stream type clustering algorithm. The Inference densely populated areas where the object may be partially or
algorithm makes a decision based on a combination of contex- fully obscured from view by cameras. Many researchers have
tual information and machine awareness. The framework used focused on detecting an abandoned objects using surveillance
two datasets: two clips from the Queensland University of cameras in order to protect people and public facilities from
Technology’s Z-Block dataset and 23 clips from the CAVIAR possible explosives in the bag.
dataset. The AUC of this method is 0.787, with 0.135 errors. Sacchi and Regazzoni [27] proposed a model that uses
Ghazal et al. [25] discovered that videos could be used security camera footage to detect an object left behind at a train
to detect vandalism such as graffiti and theft. The writer station. If the left-behind object is detected in the model, an
used a history model and a Gaussian model that is additive alarm is activated in the nearby station, and proper authorities
in nature for segmentation. A frame difference is applied are notified, allowing the danger to be avoided. This model
between the current frame and the historical model. To find the uses multiple access with direct sequence code sharing to
area’s main features as well as the color histogram, LPF with create a noise tolerant device and ensure a secure connection
adaptive thresholding is used, as well as contour tracing and between remotes and stations. This model is designed to work
morphological edge detection. He used the shape and motion with monochrome cameras. By using colored images, the
features to monitor objects. model shown can be enhanced in the event of a false alarm or
Gowsikhaa et al. [26] discovered fraudulent practices in an object that is identified by accident. But this comes with a
exam halls. He used the student’s head role to detect fraudulent major disadvantage that it increases the computational time of
activities such as theft, transferring sheets of paper between the system and hence it cannot be used as a real-time system.
students, and conversing with other students, among other Ellingsen [28] proposed a model that uses mean pixel
things. He did so by combining adaptive background subtrac- intensity and pixel standard deviation to detect fall artifacts.
tion with sequential and periodic modeling of the background. A foreground image is formed by subtracting a frame from a
His machine, on the other hand, couldn’t manage occlusion. background image containing multiple objects. This approach
Tripathi et al. [16] provided a model that detects suspi- is used to find objects that are moving. The features extracted
cious ATM behaviors such as (forcefully taking money, and to locate the object dropped by the individual are region, minor
customer fights), and an alarm is activated if the activity is axis, major axis, the center of mass, and so on because it
detected. The videos’ main features were extracted using Hu contains more than enough information about it. It is essential
and MHI moments. The features are classified using an SVM to function on a learning mechanism and an automated feature
classifier, and the dimension of the features is reduced using vectors classifier.
PCA. A window-size study based on MHI has been carried out. In this paper, we used many videos from real-world surveil-
lance cameras, as well as some videos from the caviar dataset,
B. Research in theft detection in surveillance cameras to train and test our system. Human behaviors are divided into
Centered on ontology, Akdemir et al. [19] proposed the three categories: common, suspicious, and unusual. Sitting,
identification of human behavior in banks and other places walking, jogging, and hand waving are all popular practices.
in this paper. The authors used design consistency, ontology Running, boxing, war, and other suspicious activities are ex-
consistency, minimal coding bias, extensibility, and minimal amples. Convolutional neural networks are used to accomplish
ontology binding as criteria. The model was put to the test this grouping. To begin, high-level features from images are
on six videos, four of which depicted robbery and two of extracted using a convolutional neural network. In doing so, the
convolutional network classification is taken into account,
Authorized licensed use limited to: M S RAMAIAH INSTITUTE OF TECHNOLOGY. Downloaded on November 21,2023 at 09:52:11 UTC from IEEE Xplore. Restrictions apply.
1st IEEE International conference on Innovations in High-Speed Communication and Signal Processing (IEEE-IHCSP) 4-5 March, 2023
the final pooling layer result is extracted, and the final predic-
tion is made.
III. PRELIMINARIES
A. Activity Classification
We classify human activities under two categories: normal
activities and suspicious activities. In the context of this work,
and according to the dataset that we have used to train our
model, we have defined the following human activities as
normal:
• Walking
• Jogging
• Hand-waving Fig. 2. Convolutional Neural Network
All these activities were shot from cameras at different angles
in the dataset. and allows re-usability of weights, a CNN is able to provide
a better match to the image dataset. As a result, a CNN can
be equipped to understand the image’s complexities and
sophistication far better than traditional ANNs.
IV. CNN MODEL FOR SUSPICIOUS ACTIVITY DETECTION
Various CNN models have been proposed depending on the
target outputs. LeNet is among the initial networks proposed
for various image-processing applications. The architecture
consists of seven layers, including two sets of convolutional
layers, two sets of average pooling layers, and a flattening
convolutional layer. Following that, there are two thick com-
pletely connected layers and a softmax classifier. A graphical
representation of the LeNet CNN architecture is presented in
Fig. 3.
• Layer 1: A convolutional layer with a kernel size of 3×3,
a stride of 1×1, and a total of 6 kernels. As a result, a
28x28x1 input image yields a 26x26x6 image. Let’s take
a look at how many criteria are needed. The convolution
kernel is 3 x 3 in size, with a total of 6 × (3 ×3 + 1) =
60 parameters, where +1 means that the kernel is biased.
• Layer 2: A scale of 2 × 2 kernels, a step of 2 × 2 kernels,
and a total of 6 kernels of pooling layers. This pooling
layer works in a unique way. The receptive input values are
added, multiplied by the trainable parameters (1 per filter),
and the result is added to the trainablebiases (1 per filter).
Fig. 1. Various forms of activities. Finally, the output has undergone tanh activation. As a
result, the input from the previous sheet, which was
Activities like boxing and fighting are categorized as suspi- 26x26x6, is sub-sampled to 13x13x6. [1 (trainable
cious activities. Sample images for various forms of activities parameters) + 1 (trainable biases)] * 6 = 12 total
are shown in Fig. 1. parameters in the plane.
• Layer 3: This layer is a convolutional layer with the same
B. Convolutional Neural Network
configuration as layer 1, except that it has 16 filters
Convolutional Neural Networks (CNN) are deep learning instead of 6. As a result, the previous layer’s input of
networks that are commonly used as image classifiers. This 13x13x6 yields an output of 11x11x16. Total layer
network takes an image as input and assigns importance to parameters: (3x3x6x16 + 16) + 16 = 880.
different objects in the image so that different groups of images • Layer 4: This layer, like Layer 2, is a pooling layer, except
can be distinguished. A sample CNN architecture is shown in this time it has 16 filters. The tanh activation function is
Fig. 2 used to pass the outputs. The previous layer’s input of
In classification, convolutional neural networks apply var- 11x11x16 is sub-sampled to 5x5x16. (1 + 1) * 16 = 32 total
ious filters to images to obtain their spatial and temporal parameters in layer.
dependencies. Since it uses fewer parameters than an ANN
Authorized licensed use limited to: M S RAMAIAH INSTITUTE OF TECHNOLOGY. Downloaded on November 21,2023 at 09:52:11 UTC from IEEE Xplore. Restrictions apply.
1st IEEE International conference on Innovations in High-Speed Communication and Signal Processing (IEEE-IHCSP) 4-5 March, 2023
• Layer 5: After that, the data is flattened, yielding 400 to programmatically make and receive phone calls, send and
neurons (5x5x16). receive text messages. Twilio offers a specific phone number
• Layer 6: This is a 128-parameter dense sheet. Total as well as an ACCOUNT SID and AUTH TOKEN for sending
parameters: 400 x 128 + 128 = 51328. tanh was the and receiving text messages.
activation mechanism used in this case. The KTH database contains six types of human behavior:
• Layer 7: Finally, a dense layer with two units is used, walking, jogging, running, boxing, waving, and clapping. This
which is a completely connected Softmax output layer. is an open dataset available online for the Recognition ofhuman
actions. This dataset was then split, since this dataset contains
V. DATASETS videos, we have to extract frames to get images in JPEG format.
We have taken several videos from real-life surveillance The Nanyang Technological University (CCTV-Fights
cameras, and some videos from the KTH dataset and Nanyang Dataset) includes 1,000 videos of real-world fights captured
Technological University (CCTV-Fights dataset) for training on security cameras or cell phones. Videos for the dataset
and created our own dataset for testing. We have classified were gathered from YouTube. The fights have a wide range of
human activities into two categories, they are normal and acts and characteristics, such as punching, kicking, pushing,
suspicious. The CCTV-Fights dataset includes 1,000 videos of grappling, fighting with two or more people, and so on. The
real-world fights captured on security cameras or cell phones. dataset contains 280 CCTV videos of various forms of combat
Videos for the dataset were gathered from YouTube. Thefights ranging in length from 5 seconds to 12 minutes with an average
have a wide range of acts and characteristics, such as punching, duration of 2 minutes. Also included are 720 videosof live
kicking, pushing, grappling, fighting with two or more people, action from various sources (hereafter referred to as Non-
and so on. CCTV).
Common activities include sitting, walking, jogging, and VI. RESULTS & DISCUSSION
waving. Suspicious activities include running, boxing, fighting, The proposed framework was implemented in Python 3.2
etc. We achieve this classification by using convolutional neu- with the OpenCV library. The system hardware specifications
ral networks. First, we use a convolutional neural network to are as follows: Intel(R) Core (TM) i5-8300H @ 2.30GHz,
extract high-level features from the image. The convolutional 8.00GB RAM, Windows Operating System (64-bit).
network classification is taken into account, the final pooling The training dataset consisted of 10700 frames of non-
layer result is extracted, and the final prediction is made. suspicious (safe) activity and 96800 frames of suspicious
Videos were split into split images, which were then loaded activity. The output shown in Table III was calculated using a
in the form of a NumPy array with their labels. We know that testing dataset of 20000 frames of non-suspicious (safe) and
when dealing with large datasets, the model must be loaded suspicious behavior.
several times. So, to keep things easy, we’ll create a single Table I represented the performance on the test set.
pickle file that contains all of the NumPy arrays and their labels, The confusion matrix for training data and the corre-
allowing us to load them quickly every time. The model sponding performance is presented in Table IV and Table II
architecture was done on TensorFlow and Keras environment respectively.
in python.
The majority of papers using a deep learning method only VII. CONCLUSION & FUTURE SCOPE
detect suspicious behavior. As a result, an effective mechanism We provided a detection tool based on frames extracted from
is needed to notify security in the event of any suspicious videos and deep learning-based algorithms in this article. To
activity. When your device detects suspicious activity, it will detect the operation, this novel and special method necessarily
send an SMS to the appropriate authorities. This framework require the use of minimal computational resources. It is
was built in Python on an open source platform. versatile and mobile due to the lack of special hardware
You can send SMS by creating an account with Twilio components. As a result, this cost-effective tool can be easily
and installing the Twilio library in Python. Twilio allows you
Authorized licensed use limited to: M S RAMAIAH INSTITUTE OF TECHNOLOGY. Downloaded on November 21,2023 at 09:52:11 UTC from IEEE Xplore. Restrictions apply.
1st IEEE International conference on Innovations in High-Speed Communication and Signal Processing (IEEE-IHCSP) 4-5 March, 2023
Performance Parameters
Total images
True positive True negative False positive False negative Accuracy (%) Recall (%) Specificity (%) Precision (%)
20222 17842 2092 111 177 98.57 99.01 94.96 99.38
Performance Parameters
Total training images
True positive True negative False positive False negative Accuracy (%) Recall (%) Specificity (%) Precision (%)
80884 72343 8494 27 20 99.94 99.97 99.68 99.96
Authorized licensed use limited to: M S RAMAIAH INSTITUTE OF TECHNOLOGY. Downloaded on November 21,2023 at 09:52:11 UTC from IEEE Xplore. Restrictions apply.
1st IEEE International conference on Innovations in High-Speed Communication and Signal Processing (IEEE-IHCSP) 4-5 March, 2023
Authorized licensed use limited to: M S RAMAIAH INSTITUTE OF TECHNOLOGY. Downloaded on November 21,2023 at 09:52:11 UTC from IEEE Xplore. Restrictions apply.