Technical Paper 2

An AI-Based Student Tracking System for In-Depth
Analysis of Student behaviour

Nampalli Shiva Kumar J Vijay Gopal Vundela Vamsi
Undergraduate, Dept. of CSE-AIML Associate Professor, Dept. of CSE-AIML Undergraduate, Dept. of CSE-AIML
MLR Institute of Technology MLR Institute of Technology MLR Institute of Technology
Hyderabad, India Hyderabad, India Hyderabad, India
[email protected] [email protected] [email protected]
Munigela Raviteja K.Sai Prasad G Abhinav Goud

Undergraduate, Dept. of CSE-AIML HOD , Dept. of CSE-AIML Undergraduate, Dept. of CSE-AIML
MLR Institute of Technology MLR institute Of Technology MLR institute Of Technology
Hyderabad, India Hyderabad, India Hyderabad, India
[email protected] [email protected] [email protected]
Abstract—The current trend of offline education has created diffi- minds. Apart from that they have regular duties which may
culties in monitoring and supervising student behavior, resulting require continuous monitoring. CCTV cameras have been
in disturbances and diversions during offline sessions. While implemented to track and supervise pupils, a hard task [9,1].
several algorithms recognize these behaviors, their accuracy and
efficiency are limited and mostly focused on single items. To Currently, these systems require human oversight. Managing
solve this, we present a new AI-powered Student Tracking System enormous amounts of video data may be challenging, leading
that will revolutionize behavior analysis, attendance management, to tiredness and the possibility of missing important occur-
and incident detection in educational institutions with real-time rences. Manually monitoring closed-circuit television cameras
monitoring. The system uses artificial intelligence (AI) methods, is impractical.
including Convolutional Neural Networks (CNNs), OpenCV, Face
Recognition module, and YOLOv8 to record real-time insights [14] Previously, student behavior analysis depended mainly
regarding student behaviors such as napping in class, using on instructor observations, which was time-consuming and
mobile phones, and engaging in irregular activities which are
also incorporated with real-time alerts using Twilio. The primary unsuitable for meeting large-scale practical needs. There is
procedure is image recognition using strategically positioned a pressing demand for automatic student behavior detection
cameras, with the AI model trained to detect and categorize for thorough analysis. Various tools and methodologies are
diverse activities quickly. The model is trained on the YOLOv8 available for this purpose. As we are living in this rapidly
algorithm, which is capable of capturing real-time video with growing AI era we can increase the efficiency and the quality
faster Frames per Second(FPS). The system’s key advantages
include its capacity to accurately identify and record problems, of mentoring they give to the students by taking advantage of
particularly those involving property damage, allowing for timely the algorithms and models that will automate the entire process
actions and fostering responsibility. The user-friendly interface where human intervention is not mandatory. To address these
acts as a single center for instructors and administrators, issues, we suggest combining cutting-edge technology such as
providing real-time information on student conduct, attendance, Artificial Intelligence (AI).
and reported occurrences. The system evolves through continuous
development, ensuring that it remains successful in addressing [15] High-level processing employs many machine learning
changing educational settings and dynamic student behaviors. approaches, such as Naive Bayes, multi-layer perceptrons, iter-
Keywords: Abnormal Event Detection, AI-powered student ative classifier optimizers, decision trees, and random forests.
Tracking System, Attendance Management, Behavior Analysis, In addition to machine learning, probabilistic approaches
Convolutional Neural Networks (CNNs), Incident Detection, like the Hidden Markov Model and Bayesian classifier are
OpenCV, (You Only Look Once)YOLOv8.
also used. These algorithms are useful for object detection,
I. I NTRODUCTION although their accuracy is restricted. [9] Deep learning has
grown in popularity due to increased access to GPUs, CPUs,
This study highlights the importance of ensuring educational and large datasets. [7] To accumulate the enormous data
quality and student safety. It examines the transition from required for deep learning, Wireless sensor networks use sen-
manual surveillance, which relied on human observations and sors to monitor user and environmental parameters, including
was limited in scale, to automated behavior detection using temperature, sound, and video. The system analyzes sensor
technological advancements. data to identify patterns and provide tailored services based on
The teacher’s duties are to explain the concepts of the user activity. The implementation includes using IP cameras
subjects in a more simplified way to feed into the students’ to cover the rooms and a NAS to securely store sensor data.
Our solution uses the powerful YOLOv8 (You Only Look to address several computer vision difficulties that were previ-
Once version 8) to build a strong and comprehensive Student ously handled separately. These include paying attention to
Tracking System which internally uses CNN, they are essential sights, watching videos, identifying suspicious actions, and
for object detection [9,1,21]. CNNs rely largely on Computer noticing unusual things. [3] In this study, they suggested
Vision (CV) to understand and analyze visual input efficiently. a clustering-based method for detecting odd occurrences in
Computer vision technologies involve modeling surroundings, CCTV footage. To avoid overfitting, they utilized a multi-
detecting motion, categorizing moving objects, tracking, inter- sample-based correlation metric. This involves training hidden
preting behavior, and synthesizing data from several cameras Markov models on many comparable data. The dynamic
[27]. Convolutional neural networks are a key architecture in hierarchical clustering algorithm acquires various training data
deep learning models, particularly for visual data processing. and corrects overfitting-related clustering mistakes.
Deep learning models can extract features and create complex [2] The authors developed a way to address inconsistencies
representations of visual data. This trait makes them more in the classic linear transformation method, which is insuf-
inclusive when feature extraction is automated. ficient for changing spatial location or phase. ”Kohonen” in-
CNNs can recognize visual patterns based on individual troduced Invariant Subspaces Analysis (ISA), which processes
pixels in a picture. LSTM models [9,26] are increasingly used the full video image without requiring an initial step to extract
in video streaming due to their capacity to identify long-term moving objects. An infinite hidden Markov model (iHMM)
relationships within the sequence, improving understanding of was used to describe the time-evolving properties of these
temporal data. The capacity of LSTM networks to preserve features. This model can adapt to the complexity of different
contextual information enhances their effectiveness in video datasets. The model does not require parameters indicating
analysis. [2,23], it is impossible to extract features directly abnormal behavior. The sequence guides the exploration and
from many moving entities, therefore this step is bypassed adoption of each observation set. This study compares two
completely. Additionally, the earliest phases do not require models: ”Markov chain Monte Carlo” (MCMC) and Varia-
background removal or subtraction processes. tional Bayes (VB). While VB is a cost-effective alternative
to MCMC, it does not accurately detect anomalous scenarios
The entire data for monitoring is collected from already
during testing.
placed CCTV cameras inside the campus and the data is pre-
processed using OpenCV. YOLOv8 enables real-time analysis, [4] Infinite hidden Markov models (iHMMs) are used for
identifying behaviors like sleeping, using phones, Violence, feature extraction. The Bayesian inference method generates
or property damage, and triggers immediate alerts. The user- a posterior distribution based on the appropriate number of
friendly future scope interface will have insights, allowing HMM states, rather than selecting a predetermined number and
instructors to monitor behavior and attendance effectively, en- using feature extraction methods like shift-invariant wavelets
suring a positive learning environment in physical classrooms. (SIWs) at each decomposition level. Wavelet coefficients after
the second are discarded. ICA and ISA are used for feature
Outline of the paper
extraction. Research indicates that ISA characteristics are more
This paper unfolds as follows: Section II conducts a literature effective for detecting anomalies. This paper presents a flexible
survey, exploring various approaches and methodologies em- approach that may be applied to various surroundings and
ployed in the study of human behavior using AI within offline moving things. Additionally, they want to prevent issues with
educational contexts. Section III outlines the specific prob- backdrop removal.
lem formulation addressed by the proposed Student Tracking [5] According to this paper, the model accurately recognizes
System, highlighting the objectives and scope of the research characteristics in image sequences. This strategy eliminates
within traditional classroom settings. Section IV delves into the requirement for manual inspection to ensure a correct
the detailed solution offered by the system, presenting its sequence. Teaching a computer system to recognize normal
architecture and simulated results. Section V discusses the processes and identify errors in image sequences. To retain
implications of the proposed system, including a comparison quality and efficiency, high-dimensional data is reduced to a
table and simulated outcomes. Finally, Section VI concludes basic form using HMM and GMM. [6] The authors advocated
the paper, by summarizing key findings and outlining avenues using behavior profiling to detect anomalies in several scenar-
for future research in the realm of offline student behavior ios, including airplane docking and corridor entrances/exits.
monitoring. The authors propose two frameworks: dynamic Bayesian net-
works (DBNs) for unsupervised learning of behavior classes
II. L ITERATURE S URVEY to identify problems using unknown data, and the expectation-
[1] This paper presents a technique for determining the maximization (EM) algorithm for segmenting and representing
validity of visual patterns and behaviors from a limited number behavior patterns. The authors evaluated the performance of
of samples, regardless of whether they have previously been models trained on unlabeled and labeled data to show the
recorded. This method uses a graph-based Bayesian approach usefulness of their technique.
to quickly find patches on a wide scale, with limits on their [7,30] This study examines Bayesian networks, which are
layout and descriptions. They presented a unified framework graphical models that depict probabilistic relationships be-
tween random variables. The document uses Bayesian net- visual frames to extract characteristics, while LSTM stores sig-
works to create an inference model that identifies anoma- nificant information over time through gating. [29] A proposed
lous events. The graphical model is composed of several strategy combines a convolutional neural network (CNN) with
random variables, including voice power, velocity variance, long short-term memory (LSTM) to enhance surveillance
and direction variation. The goal is to create a structure for system accuracy in detecting anomalous behavioral patterns.
the graphical model that can accurately detect anomalous [14] This study used a faster R-CNN object detection
events in an acceptable length of time. Creating a Bayesian system. It is made up of a ResNet-101 feature extraction
network involves two phases: structure learning and parameter network and a detection head. The detection head catego-
learning. Structure learning can be achieved by traditional rizes and regresses the network’s proposed ROIs. Our scale-
search approaches like hill climbing or by collaborating with aware detection head identifies objects of different sizes with
subject matter experts to create a model. Historical data is various dilation rates. This addresses scale variation in the
used to generate the conditional probability distribution for student behavior dataset and uses online hard example mining
parameter learning. (OHEM) to discover small items and handle class imbalances.
[9,31] This technique uses VGG-16 (Visual Geometry The suggested approach improved mean average precision
Group), a previously constructed CNN model trained on the (mAP) by 3.4% on a genuine corpus, indicating its potential
ImageNet dataset. The film-based system has been trained for practical applications.
to anticipate behavior. The model predicts human behavior
[15,7] The technique includes modeling the environment us-
in videos, facilitating surveillance. The architecture involves
ing IoT recordings, motion segmentation, cluster-based object
multiple phases, including video acquisition, pre-processing,
tracking, feature extraction, and object classification. Machine
feature extraction, classification, and prediction. [31] To send
learning methods used for activity detection include naive
SMS messages after detecting unusual activity, create a Twilio
Bayes, iterative classifier optimizer, decision tree, random
account and download the Twilio package in Python. Twilio
forest, and multilayer perception. The third phase involves
enables programmable phone calls and text message commu-
identifying normal and aberrant conduct.
nication.
[11,32] We investigated the suggested weak supervised [10] The study provides a brief overview of recent object
aberrant behavior detection approach using the CAVIAR and detection algorithms, including YOLO and the CNN family.
Crossing databases. The suggested method is compared against In practice, YOLO is more effective than CNNs. YOLO is a
four different methods, including adaptive sparse represen- method to identify unified items. It is simple to build and
tations, dense trajectory-based identification, deep learning can be trained instantly using images. YOLOv2’s FPS of
of appearance and motion, and spatiotemporal convolutional 155 outperforms Faster R-CNN and addresses transmission
neural network detection. The experimental results indicate accuracy and speed issues. It performed exceptionally well,
that the recommended technique is successful. with a mean average precision (mAP) of 78.62. YOLOv2 is a
reliable and efficient solution for studying things.
[12] Optical flow in video analysis uses the Lucas-Kanade
method in C++ and OpenCV to track object motion and [16] The system detects students’ faces using a single-
minimize noise through local flow constancy. Magnitudes are shot multi-box detector using ResNet-10 architecture. Pre-
calculated in zones, resulting in histograms for analysis. To trained, histogram-oriented gradients. Linear SVM object de-
optimize efficiency and memory, thresholds for normal and tectors were used to generate facial landmarks. The eyes
abnormal scenarios were determined by analyzing optical flow are detected with dlib, and the OpenCV eye classifier is
under various settings using OpenCV’s machine learning and used to reconstruct the centers of the eyeballs based on the
image processing. collected attributes. This model outperforms other models,
including H.We employs YOLO and Support Vector Machines
[8,31] This work proposes a technique that uses a photo-
(SVM) for effective object detection in deep learning [28].
graph as input to identify each sequence and remove the back-
SVM is highly optimized for Intel machines. Using object
ground. CNN generates a Gaussian (MoG) feature mixture for
detection modules, pupils’ distraction factors can be identified
each frame. It can distinguish between regular and aberrant
by counting people and recognizing phones.
frames. The collected characteristics are loaded into LSTM,
which can learn long-term dependencies in images and videos [18] This research proposes using both classic and deep
and classify each frame as normal or abnormal using Linear learning methods to detect objects. We employ YOLO in deep
SVM. LSTM computes input features for SVM. It uses binary learning to efficiently detect objects. We divide the entire
classification algorithms to generate output. image into N grids. We will provide a confidence score if we
[13] This model combines CNN for feature extraction and find an object in the grid. When an object appears in many
LSTM to recognize sequential patterns. CNN processes pic- grids, the chosen center is referred to as the intersection of the
ture features from frame sequences, while LSTM recalls key unions. They also explained the Anchor Box, which is used
information for finding patterns. This model’s 86% accuracy to detect many objects in an image.
beats previous detection methods, making it a valuable tool for [17,14,10] This paper presents the experimental findings and
detecting odd behavior in a university setting. CNN employs analysis of the proposed ET-YOLOv5s model for identifying
students’ in-class behavior. The model was trained on a dataset Research Project (SRP) Fund has provided funding for this
of 330 pictures, with 165 used for training and the remaining study. Overall, the study makes a substantial contribution to
165 for validation and testing. Loss function convergence the fields of crowd behavior analysis and anomalous event
curve and performance metrics. As training time grows, the identification.
model’s performance steadily improves, reaching over 95Over- [24] The article describes a proposed method for detecting
all, the ET-YOLOv5 model provides optimal training results. facial expressions in a classroom context to assess pupils’
[20] This article explores how teachers use AI to enhance attention levels. The system grabs crucial video frames, detects
student engagement and attention in the classroom. This tiredness, and analyzes students’ gaze to assess whether they
approach employs emotion detection to identify all students in are paying attention. The computational design incorporates
class, even if they are wearing masks. The ML model has 76% video acquisition, change detection, drowsiness detection, fa-
accuracy. The model continuously monitors kids with many cial expression analysis, and gaze detection, all using the
cameras to collect real-time classroom views. The obtained Structural Similarity Index Method (SSIM) and other visual
photos were analyzed with YOLOv5. This system quickly features. While the technology has potential, drawbacks in-
and accurately recognizes student activities. The Deepsort clude problems with video analysis and managing several
algorithm is used to track and identify specific students. pupils in a classroom. Future studies should overcome these
Finally, individual student reports are supplied. constraints and increase overall system performance.
[19] This study examines the use of video surveillance to [25] The article explores the use of a mobile monitoring
detect inappropriate conduct in students’ learning. This study system in classrooms to combat smartphone usage during
compares aberrant student behavior to usual learning habits lectures. This system uses face recognition and object de-
using computer vision and machine learning techniques. The tection techniques to identify gestures that indicate student
algorithms were modified for better detection accuracy and inattention, such as yawning, phone use, and bowing down.
durability. The study found that video monitoring can help The study paper also investigates the detection of abnormal
teachers recognize possible challenges and provide timely behavior among students using a one-layer neural network
support to students. and a Gaussian distribution technique. The current solution,
[21]The article discusses a real-time video surveillance named ”Monitoring System for Student Behavior in Exami-
system that detects human aberrant behaviors using two meth- nation Rooms,” uses a three-layer approach that includes face
ods: Principal Component Analysis (PCA) combined with detection, neural network-based suspicious state detection, and
Support Vector Machine (SVM) and optical flow analysis. The anomaly detection utilizing a Gaussian-based algorithm. The
technology effectively detects irregularities such as rushing technology effectively watches students in examination rooms,
in congested locations, bending down, carrying lengthy bars, detecting and tracking them, segmenting them from the camera
and waving hands. Experimental results demonstrate these stream, and identifying suspicious and abnormal conduct.
approaches’ robustness and efficiency across various settings. [26] The paper delves into the creation of an automatic
The current method, Principal Component Analysis (PCA) and attendance system utilizing computer vision techniques, with
Support Vector Machine (SVM) uses PCA for feature selection a particular emphasis on face detection, recognition, and mask
based on blob border information and SVM to identify behav- checking for classroom use. It examines the performance of
iors as abnormal or normal. These approaches have proved two face detection algorithms, demonstrating the superiority
successful in congested contexts, giving a complete solution of Histogram of Oriented Gradients (HOG) under different
for real-time identification and analysis of human aberrant illumination situations. The system recognizes faces using
behaviors in surveillance. Convolutional Neural Networks (CNNs), which generate 128-
[22] The study’s current solution is a multilevel strategy dimensional encodings for each person. A mask detection
investigating correlations between observed rule infractions in feature is built into the system, using a CNN model trained
non-classroom settings. This approach considers observational on a dataset of masked faces generated from unmasked face
variables (e.g., location and number of pupils present) and photos. The present approach, known as the Face Detection
school characteristics. The study focuses on finding ”hot spots” and Recognition System, combines these techniques and char-
for problematic conduct to enhance school climate. acteristics with a graphical user interface (GUI) for ease of
[23] The paper describes a unique method for detecting use and maintenance.
anomalous crowd behavior based on optical flow techniques. [27] The article focuses on finding anomalies in video
The suggested method outperforms previous methods, es- frames by analyzing the motion and physical characteristics
pecially in cases like crowd escape. It outperforms other of objects. It shows a variety of systems that use three
models in both the UMN and PETS2009 datasets, with an ways to increase performance by taking temporal and spatial
overall accuracy of 96.46% and 96.72%, respectively. The sequences. Key contributions include improving system per-
key innovation is event feature extraction, which uses the formance, boosting abnormal behavior detection speed, and
angle difference between optical flow vectors to accurately developing methods for identifying diverse aberrant behaviors
identify anomalies while minimizing noise. The Middle East based on the chosen dataset. Pixel- and frame-level assess-
Technical University - Northern Cyprus Campus Scientific ments are used to evaluate approaches such as Convolutional
Automatic Encoder (CAE), spatiotemporal structure learning,
and sophisticated Long Short-Term Memory (LSTM) cells.
The new technique, which incorporates CAE, spatiotemporal
structure learning, and complex LSTM cells, outperforms
standard methods in real-time video systems, significantly
enhancing anomaly identification.
III. E XISTING PROBLEM

In traditional educational environments, all the techniques
of attendance tracking and behavior monitoring encounter
substantial obstacles in maintaining an appropriate learning
environment in physical classrooms. The current methods fail
to effectively manage the changing environment of student be-
havior, which includes concerns such as students dozing off in
class, using mobile phones during lectures, engaging in disrup-
tive activities, and causing property damage. These issues need
a more advanced and thorough approach to student monitoring
to create an environment that promotes successful learning.
The study aims to close this gap by combining cutting-edge
AI technologies, specifically YOLOv8, to transform student
tracking, behavior analysis, and incident detection in offline
educational institutions.
Key techniques previously employed in traditional student
monitoring methods include: Figure 1. Data Pipelines Architecture
1) Manual Attendance Tracking: In the conventional
technique, teachers must be involved in taking student
attendance, which may require additional time and effort When you look at abnormal detection, automatic attendance
from the teachers. Furthermore, kids engage in fraud- tracking, sleeping, and cellphone detection, data plays a crucial
ulent attendance of their peers, making it harder for role. The accuracy of the model and its performance depend
professors to recognize. on the quality of the data and the algorithms used. When
2) Observation of student’s activity: Due to the unde- you look into the overview of the main architecture, you can
sirable activities of the students within the classroom, see how each component works together seamlessly. The data
teachers when lecturing in class are more focused on the collection is divided into three major data pipelines where the
students’ behavior and the activities (i.e., using mobiles, data is processed, transformed, and stored for future analysis,
napping, abnormal activities) that students are engaging refer the Figure-1 for the architecture diagram.
in while not listening to the lectures. This requires more Data is gathered for the initial data pipeline from a variety
work from the instructor, which we may employ in a of internet sources, including YouTube. We separate the video
fruitful educational atmosphere. into the necessary sources that we find useful to our model
3) Incident tracking and reporting: If students identify and gather its frames, and we separate the necessary frames
any odd behavior in class, teachers find it difficult to for identification. We proceed to analyze them before storing
follow the activity and report it to higher authorities, them in a database. Next, we obtain data sources from Kaggle
which includes any proof where the traditional technique or Roboflow for the second data pipeline. The added benefit
falls far behind. of using Kaggle or Roboflow is that you have labeled data that
4) CCTV Usage: In the traditional method, CCTV uti- does not require manual annotations, making our work faster
lization in educational establishments is confined to and more effective. In the third data pipeline, To use the facial
merely recording the image of what is happening in the recognition feature of this system, which requires student faces
classroom, with no further measures done. for training, we scrape the college or school website to obtain
the student faces. This allows us to use the system’s automated
IV. I MPLEMENTATION attendance.
Several technologies are incorporated into the student track- Combining all of the data pipelines results in a corpus of
ing system, such as YOLO, and face recognition. The Yolo student face images for attendance tracking and violence data
algorithm serves as the foundation for the first of the two images that can be used for training purposes to detect abnor-
approaches to a solution in the suggested methodology. Let’s mal network activity and training images to detect student use
review the preprocessing procedures and an outline of the data of mobile phones and dozing off in class.
collection operations before delving into those specifics. Now that the data has been transmitted to OpenCV, you
Figure 3. Triplet loss before and after learning
comparison and matching. Face Detection is used to identify

recognized faces The face is automatically labeled as unknown
if it is unknown.
2) Tripet Loss: Then, to lower the loss percentage, tuning
approaches such as triplet loss are utilized. Triplet loss can
be included in facial recognition models to increase their
performance, particularly in scenarios requiring fine-grained
classification between distinct individuals. Triplet loss is a
metric learning loss function that is commonly used in Siamese
networks or triplet networks. The goal is to learn embed-
Figure 2. Proposed architecture workflow
dings that minimize the distance between anchor-positive
pairs (representing the same individual) while maximizing the
can read video input from files or cameras. The video frames distance between anchor-negative pairs (representing different
can then be preprocessed with OpenCV’s image processing individuals) you can refer Figure 3 for more information.
tools to boost contrast, reduce noise, or adjust brightness and The triplet loss function is typically formulated as
color balance. Following the preprocessing, the frames are L = max(0, ||f (anchor) − f (positive)||2 − ||f (anchor) − f (negati
ready for further analysis. It also includes tools for conducting (1)
basic image processing operations such as thresholding, filter-
ing, morphological operations, and transformations (rotation, f() represents the embedding function produced by the CNN.
scaling, and translation). These approaches are required for
image preparation before proceeding with further analysis the ||x − y||2 (2)
augmentation steps are also done here. After image processing,
the data is now passed to the machine learning model which The above equation denotes the squared Euclidean distance
incorporates the YOLO algorithm and facial recognition. between the embeddings of two images.
Alpha is a margin that controls how far the negative should
A. ML model using the YOLOv8 architecture be from the anchor compared to the positive.
coming to the proposed architecture in figure 2 , The data 1) Anchor:An image ‘a’ selected from a dataset.
is sent from the camera for testing or validation purposes, the 2) Positive: An image ‘p’ selected such that it belongs to
model is already trained with the previously created corpus, the same class as the anchor ‘a’.
and the data is gathered from the database and preprocessed 3) Negative: An image ’n’ selected such that it belongs to
with the aid of OpenCV. The final data format is sent to any other class but the class of anchor ‘a’.
the model. The model’s first task is face recognition, which A triplet is represented as:
uses predefined face recognition software to identify faces and
Triplet : (Anchor, P ositive, N egative) (3)
verify or identify people based on their facial features. This
software uses the Eigen Faces method to extract a person’s Then a database comparison and matching algorithm is
face encodings and store them in a list with labels. The inner performed, and all data is recorded in a list format with
implementation of this face recognition is as follows recognizable faces. The matching method encodes unfamiliar
1) Face recognition: Firstly Feature extraction is done it is faces and compares them to the stored values. If it finds a
the process by which the system takes important facial traits match, it records the student’s name into a CSV file and uses
from faces that have been identified. These characteristics a date-time module to write the time of the entered person.
could include the dimensions and forms of the mouth, nose, 3) All you need to know about YOLO: Deep learning
eyes, and jawline, as well as the distances between them. models like YOLOv8 have become essential in several areas,
Secondly, Feature Representation, The extracted face fea- such as autonomous driving, robotics, and video surveillance.
tures generate a mathematical vector or template. The unique With astounding speed and accuracy, the YOLOv8 architecture
characteristics of each person’s face are compact and evenly can detect and localize objects in photos and videos by using
represented in this representation, which can be used for computer vision techniques and machine learning algorithms.
to provide a sufficient comprehension or depiction of the
image’s content.
• Anchor-Free Prediction: This v8 model employs an
anchor-free prediction, in contrast to the previous iter-
ations of YOLO that use anchor-based detection. This
involves predicting the object’s center point directly
rather than using pre-established anchor boxes, which is
good because it increases efficiency and reduces the need
for bounding box prediction forked Class and bounding
Figure 4. Internal working of YOLO
boxes
• probabilities: YOLOv8 divides the input image into
cells, creating a grid-like structure. The bounding boxes
Convolutional neural networks (CNNs) are the most suc-
and class probabilities for each bounding box are pre-
cessful method for achieving precise and efficient object
dicted by each grid cell. These odds indicate the like-
identification. Drawing inspiration from the human visual
lihood that a certain item class—such as a car or per-
system, CNNs can recognize and identify objects in pictures
son—will be found in that box.
and forecast their position, refer Figure 6. YOLOv8, one of the
• Non-maxima suppression(NMS): Several bounding
well-known deep learning models for object recognition, uses
boxes can overlap in an image since YOLOv8 predicts
CNNs to provide highly accurate real-time object detection.
numerous bounding boxes for the same object. YOLOv8
Significant progress has been made in object detection in
use the NMS technique to choose the most likely bound-
computer vision with the use of CNNs and advanced algo-
ing box for each object and discard a large number of
rithms such as YOLOv8. These developments have opened up
boxes with a high boundary overlap to prevent this and
a plethora of applications and increased the capacity of robots
get rid of a lot of redundant bounding boxes.
to see and comprehend the visual environment.
• Results: The final result of YOLOv8 is an image with
YOLOv8’s key features include mosaic data augmentation, bounding boxes formed around observed items, class
anchor-free detection, a C2f module, a decoupled head, and names, and confidence scores.
a changed loss function , Refer Figure-4. Like YOLOv4, 5) Attention Mechanism: An attention mechanism is ad-
YOLOv8 employs mosaic data augmentation, which combines ditionally employed in this model, in human perception we
four photos to offer the model with more context information. tend to look or concentrate more on aspects such as marks
To increase performance, YOLOv8 now pauses augmentation on the face or different appearances we give attention to such
in the final 10 training epochs.YOLOv8 used anchor-free aspects taking this into account the attention mechanism has
detection to increase generalization. Anchor-based detection developed in deep learning models
has the disadvantage of slowing learning for bespoke datasets An attention mechanism in a visual network is essentially
due to preset anchor boxes. Anchor-free detection allows the a dynamic weight adjustment function based on an input
model to directly estimate an object’s midpoint, reducing the feature map x overlaid between the convolutional layers and
number of bounding box predictions. This speeds up Non-max an attention function g(x). Its job is to communicate to the
Suppression (NMS), a pre-processing phase that eliminates deep network’s subsequent layer which features are more or
inaccurate predictions. less significant. The function is displayed below.
Misalignment is possible because of the decoupled head, Attention = f (g(x), x) (4)
which separates the classification and regression processes.
This means that the model can localize one item while In this case, f stands for a function that acts on both the input
categorizing another. The approach is to incorporate a task (x) and the output (g(x)) of an inner function. This captures the
alignment score, from which the model can identify a positive dynamic character of attention mechanisms, in which the inner
and negative sample. The task alignment score is calculated function g(x), which may be conceptualized as calculating the
by multiplying the categorization and Intersection over Union saliency or importance of various elements, determines the
(IoU) scores. The IoU score represents the correctness of a relevance of various components of the input data x.
bounding box. Convolutional neural networks (CNNs), a kind of deep
learning model frequently used for image processing applica-
4) A brief introduction to YOLO: tions, incorporate these into their architecture. CNNs produce
• Input and Feature Extraction: A picture is used as the feature maps that, at multiple levels of abstraction, capture
first input in this process, and it passes via the ”backbone” patterns and structures in an input image for object detec-
of a convolutional neural network. The backbone network tion.The way that channel attention mechanisms function is by
extracts the pertinent features. dynamically varying the relative relevance of various feature
• Feature Fusion: The neck, a component of the model, channels in these maps. Every channel is associated with a
receives features gleaned from the backbone. The features certain feature or attribute of the input data. The model can
are combined and examined on several scale dimensions efficiently highlight discriminative features that are essential
Figure 5. Backbone
Figure 6. workflow of CNN

Figure 7. Representation of Non Maximum Supression
for accurate object detection while filtering out unnecessary

information or noise by selectively boosting informative chan- authorities in other words Upon attendance tracking, the data
nels and decreasing less relevant ones. Its adaptive channel is sent to the Yolo algorithm, which compares the training
significance modification improves the model’s performance data with the given data to identify any suspicious activity or
in identifying objects against complicated backgrounds and security threats.
changes in appearance.
[33]Three channels are initially used to represent each V. C ONCLUSION
image (R, G, B). Every channel that has been handled by
a different convolution kernel will produce new channels with In summary, YOLOv8 is a significant development in
distinct data. A higher weight indicates a higher relevancy, computer vision’s real-time object identification field. With
therefore the relevant channel should receive more attention if its improved architecture and state-of-the-art features, its deep
weights are given to each channel to indicate the relationship learning model allows for very accurate object recognition in
between the channel and important information a variety of applications.this study report offered a complete
approach to tackling the issues of monitoring student behavior
6) Feature Fusion: joining the original feature maps and
and creating a conducive learning atmosphere in offline edu-
the output feature maps obtained by the attention mechanism
cational environments. By integrating cutting-edge AI tech-
block. This concatenates the two sets of features along the
nologies, particularly the YOLOv8 algorithm, the proposed
channel dimension to produce a single feature map that retains
Student Tracking System provides a game-changing solution
the wider context from the original data as well as the specific
for behavior analysis, attendance management, and incident
information highlighted by the attention mechanism.
detection in educational settings. This article used the Student
7) IOU/NMS: Intersection over Union / Non-Maximum Tracking System to show how AI technologies like YOLOv8,
suppression is used on the outputs; if they are not anticipated CNNs, and face recognition modules can revolutionize be-
adequately by applying those approaches, the data is sent back havior analysis and incident detection. The technology can
to the model for prediction; this process continues until the precisely recognize and categorize a wide range of student
objects are detected properly; refer to the NMU in figure-7 behaviors, such as dozing in class, using mobile phones, and
Once an object is successfully detected, it is compared engaging in abnormal activities, thanks to strategically placed
to the classes we have defined, such as violence, using a cameras and real-time monitoring.
phone, sleeping, and eating during class. Next, we convert Furthermore, the system’s integration with OpenCV for
our predicted tensor into a numpy array and use that array preprocessing, data collecting from various sources, and ma-
to compare our classes. Once we have identified our class, chine learning models improves accuracy and efficiency. The
we use the Twilio module to generate a message to higher addition of features such as triplet loss for fine-grained
Figure 10. Realtime results from the model
Figure 8. Confusion matrix
Figure 11. PR curve
Figure 9. different losses in the phases [3] F. Jiang, Y. Wu and A. K. Katsaggelos, ”Abnormal Event Detection
from Surveillance Video by Dynamic Hierarchical Clustering,” 2007 IEEE
International Conference on Image Processing, San Antonio, TX, USA,
2007, pp. V - 145-V - 148, doi: 10.1109/ICIP.2007.4379786.
categorization and non-maximum suppression for removing [4] I. Pruteanu-Malinici and L. Carin, ”Infinite Hidden Markov Mod-
redundant detections enhances the system’s performance. els for Unusual-Event Detection in Video,” in IEEE Transactions
The suggested system architecture, as explained, offers on Image Processing, vol. 17, no. 5, pp. 811-822, May 2008, doi:
10.1109/TIP.2008.919359.
educational institutions a scalable and customizable solution
[5] M. Jager, C. Knoll and F. A. Hamprecht, ”Weakly Supervised Learn-
for successfully monitoring student behavior and ensuring a ing of a Classifier for Unusual Event Detection,” in IEEE Transactions
positive learning environment. The user-friendly interface and on Image Processing, vol. 17, no. 9, pp. 1700-1708, Sept. 2008, doi:
real-time warnings allow instructors and administrators to han- 10.1109/TIP.2008.2001043.
dle concerns quickly and instill accountability in students. The [6] T. Xiang and S. Gong, ”Video Behavior Profiling for Anomaly Detection,”
Student Tracking System, which combines novel technologies
with realistic implementation tactics, represents a viable route
for improving educational quality and student safety in offline
educational settings.
VI. R ESULTS
R EFERENCES
[1] O. Boiman and M. Irani, ”Detecting irregularities in images and
in video,” Tenth IEEE International Conference on Computer Vision
(ICCV’05) Volume 1, Beijing, China, 2005, pp. 462-469 Vol. 1, doi:
10.1109/ICCV.2005.70.
[2] I. Pruteanu-Malinici and L. Carin, ”Infinite Hidden Markov Models
and ISA Features for Unusual-Event Detection in Video,” 2007 IEEE
International Conference on Image Processing, San Antonio, TX, USA,
2007, pp. V - 137-V - 140, doi: 10.1109/ICIP.2007.4379784. Figure 12. Realtime results from the Face recognition model
in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. [23] Mahdyar Ravanbakhsh, Moin Nabi, Enver Sangineto, Lucio Marcenaro,
30, no. 5, pp. 893-908, May 2008, doi: 10.1109/TPAMI.2007.70731. Carlo Regazzoni, Nicu Sebe,” Abnormal event detection in videos using
[7] Y. -L. Hsueh, N. -H. Lin, C. -C. Chang, O. T. . -C. Chen and W. -N. generative adversarial nets”, IEEE International Conference on Image
Lie, ”Abnormal event detection using Bayesian networks at a smart home,” Processing,22 February 2018.
2015 8th International Conference on Ubi-Media Computing (UMEDIA), [24] Waqas Sultani, Chen Chen, Mubarak Shah,” Real-World Anomaly
Colombo, Sri Lanka, 2015, pp. 273-277, doi: Detection in Surveillance Videos”, IEEE/CVF Conference on Computer
Vision and Pattern Recognition,16 December 2018.
[8] K. Vignesh, G. Yadav and A. Sethi, ”Abnormal Event Detection on
BMTT-PETS 2017 Surveillance Challenge,” 2017 IEEE Conference on [25] Louis Kratz, Ko Nishino, “Anomaly detection in extremely crowded
Computer Vision and Pattern Recognition Workshops (CVPRW), Hon- scenes using spatio-temporal motion pattern models IEEE/CVF Conference
olulu, HI, USA, 2017, pp. 2161-2168, doi: 10.1109/CVPRW.2017.268. on Computer Vision and Pattern Recognition, 20 June 2009.
[9] C. V. Amrutha, C. Jyotsna and J. Amudha, ”Deep Learning Ap- [26] Dong-Gyu Lee, Heung-Il Suk, Sung-Kee Park, Seong-Whan Lee, Mo-
proach for Suspicious Activity Detection from Surveillance Video,” tion Influence Map for Unusual Human Activity Detection and Localization
2020 2nd International Conference on Innovative Mechanisms for In- in Crowded Scenes, IEEE Transactions on Circuits and Systems for Video
dustry Applications (ICIMIA), Bangalore, India, 2020, pp. 335-339, doi: Technology, 28 January 2015.
10.1109/ICIMIA48430.2020.9074920. [27] Weixin Li; Vijay Mahadevan; Nuno Vasconcelos, Anomaly Detection
[10] Du, J. (2018, April). Understanding of object detection based on CNN and Localization in Crowded Scenes, IEEE Transactions on Pattern Anal-
family and YOLO. In Journal of Physics: Conference Series (Vol. 1004, ysis and Machine Intelligence, 13 JUNE 2013.
p. 012029). IOP Publishing. [28] Borislav Antić; Björn Ommer, Video parsing for abnormality detection,
[11] X. Sun, S. Zhu, S. Wu and X. Jing, ”Weak Supervised Learning Based International Conference on Computer Vision (ICCV), 12 JAN 2012.
Abnormal Behavior Detection,” 2018 24th International Conference on [29] Direkoglu, H., & Kilic, O. F. Abnormal Crowd Behavior Detection
Pattern Recognition (ICPR), Beijing, China, 2018, pp. 1580-1585, doi: Using Motion Information Images and Convolutional Neural Networks.
10.1109/ICPR.2018.8545345 IEEE Transactions on Circuits and Systems for Video Technology, 2017.
[12] Z. Kain, A. Y. Ouness, I. E. Sayad, S. Abdul-Nabi and H. Kassem, [30] Scutari, Marco & Graafland, Catharina & Gutiérrez, J.. (2019). Who
”Detecting Abnormal Events in University Areas,” 2018 International learns better Bayesian network structures: Accuracy and speed of structure
Conference on Computer and Applications (ICCA), Beirut, Lebanon, 2018, learning algorithms. International Journal of Approximate Reasoning. 115.
pp. 260-264, doi: 10.1109/COMAPP.2018.8460336. 10.1016/j.ijar.2019.10.003.
[13] D. O. Esan, P. A. Owolawi and C. Tu, ”Detection of Anomalous [31] Chole, S., Tiwari, R. N., Siddique, S., Jain, P., & Mane, S. DETECT-
Behavioural Patterns In University Environment Using CNN-LSTM,” ING SUSPICIOUS ACTIVITIES IN SURVEILLANCE VIDEOS USING
2020 IEEE 23rd International Conference on Information Fusion (FU- DEEP LEARNING METHODS.
SION), Rustenburg, South Africa, 2020, pp. 1-8, doi: 10.23919/FU- [32] Xin Zou, Wen Long Yue, ”A Bayesian Network Approach to Cau-
SION45008.2020.9190406. sation Analysis of Road Accidents Using Netica”, Journal of Ad-
[14] R. Zheng, F. Jiang and R. Shen, ”Intelligent Student Behav- vanced Transportation, vol. 2017, Article ID 2525481, 18 pages, 2017.
ior Analysis System for Real Classrooms,” ICASSP 2020 - 2020 https://doi.org/10.1155/2017/2525481
IEEE International Conference on Acoustics, Speech and Signal [33] Guo, MH., Xu, TX., Liu, JJ. et al. Attention mechanisms in com-
Processing (ICASSP), Barcelona, Spain, 2020, pp. 9244-9248, doi: puter vision: A survey. Comp. Visual Media 8, 331–368 (2022).
10.1109/ICASSP40776.2020.9053457. https://doi.org/10.1007/s41095-022-0271-y
[15] E. Elbasi, ”Reliable abnormal event detection from IoT surveillance sys-
tems,” 2020 7th International Conference on Internet of Things: Systems,
Management and Security (IOTSMS), Paris, France, 2020, pp. 1-5, doi:
10.1109/IOTSMS52051.2020.9340162.
[16] H. S, J. D. Pushparaj and M. Malarvel, ”Computer Vision based
Student Behavioral Tracking and Analysis using Deep Learning,” 2022
3rd International Conference on Electronics and Sustainable Commu-
nication Systems (ICESC), Coimbatore, India, 2022, pp. 772-777, doi:
10.1109/ICESC54411.2022.9885410.
[17] L. Li, M. Liu, L. Sun, Y. Li and N. Li, ”ET-YOLOv5s: Toward Deep
Identification of Students’ in-Class Behaviors,” in IEEE Access, vol. 10,
pp. 44200-44211, 2022, doi: 10.1109/ACCESS.2022.3169586.
[18] A. Tripathi, M. K. Gupta, C. Srivastava, P. Dixit and S. K. Pandey, ”Ob-
ject Detection using YOLO: A Survey,” 2022 5th International Conference
on Contemporary Computing and Informatics (IC3I), Uttar Pradesh, India,
2022, pp. 747-752, doi: 10.1109/IC3I56241.2022.10073281.
[19] Y. Cui and H. Zou, ”Detection of Abnormal Behavioral States in Student
Learning Based on Video Surveillance,” 2023 International Conference on
Data Science and Network Security (ICDSNS), Tiptur, India, 2023, pp.
01-06, doi: 10.1109/ICDSNS58469.2023.10245238.
[20] Trabelsi, Z.; Alnajjar, F.; Parambil, M.M.A.; Gochoo, M.; Ali, L. Real-
Time Attention Monitoring System for Classroom: A Deep Learning
Approach for Student’s Behavior Recognition. Big Data Cogn. Comput.
2023, 7, 48. https://doi.org/10.3390/bdcc7010048.
[21] Riddhi Sonkar, Sadhana Rathod, Renuka Jadhav, Deepali Patil, “Crowd
abnormal behavior detection using deep learning”, Informatics and Math-
ematics Web of Conferences 32, 2020
[22] Megha Chhirolya, Dr. Nitesh Dubey, ” Abnormal Human Behavior De-
tection and Classification In Crowd Using Image Processing ”, Conference
Proceeding Issue Published in International Journal of Trend in Research
and Development (IJTRD), 2013.

Technical Paper 2

Uploaded by

Copyright:

Available Formats

Technical Paper 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Technical Paper 2

Uploaded by

Copyright:

Available Formats

An AI-Based Student Tracking System for In-Depth

Analysis of Student behaviour

Munigela Raviteja K.Sai Prasad G Abhinav Goud

III. E XISTING PROBLEM

comparison and matching. Face Detection is used to identify

Figure 6. workflow of CNN

for accurate object detection while filtering out unnecessary

Figure 8. Confusion matrix

Figure 11. PR curve

You might also like