Crowd behavior detection: leveraging video swin transformer for crowd size and violence level analysis

Qaraqe, Marwa; Yang, Yin David; Varghese, Elizabeth B; Basaran, Emrah; Elzein, Almiqdad

doi:10.1007/s10489-024-05775-6

Crowd behavior detection: leveraging video swin transformer for crowd size and violence level analysis

Open access
Published: 26 August 2024

Volume 54, pages 10709–10730, (2024)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

Crowd behavior detection: leveraging video swin transformer for crowd size and violence level analysis

Download PDF

Marwa Qaraqe¹,
Yin David Yang¹,
Elizabeth B Varghese ORCID: orcid.org/0000-0003-4274-7092¹,
Emrah Basaran¹ &
…
Almiqdad Elzein¹

901 Accesses
Explore all metrics

Abstract

In recent years, crowd behavior detection has posed significant challenges in the realm of public safety and security, even with the advancements in surveillance technologies. The ability to perform real-time surveillance and accurately identify crowd behavior by considering factors such as crowd size and violence levels can avert potential crowd-related disasters and hazards to a considerable extent. However, most existing approaches are not viable to deal with the complexities of crowd dynamics and fail to distinguish different violence levels within crowds. Moreover, the prevailing approach to crowd behavior recognition, which solely relies on the analysis of closed-circuit television (CCTV) footage and overlooks the integration of online social media video content, leads to a primarily reactive methodology. This paper proposes a crowd behavior detection framework based on the swin transformer architecture, which leverages crowd counting maps and optical flow maps to detect crowd behavior across various sizes and violence levels. To support this framework, we created a dataset comprising videos capable of recognizing crowd behaviors based on size and violence levels sourced from CCTV camera footage and online videos. Experimental analysis conducted on benchmark datasets and our proposed dataset substantiates the superiority of our proposed approach over existing state-of-the-art methods, showcasing its ability to effectively distinguish crowd behaviors concerning size and violence level. Our method’s validation through Nvidia’s DeepStream Software Development Kit (SDK) highlights its competitive performance and potential for real-time intelligent surveillance applications.

Graphical abstract

Video analytics using deep learning for crowd analysis: a review

Article Open access 29 March 2022

Intelligent video surveillance: a review through deep learning techniques for crowd analysis

Article Open access 06 June 2019

Crowd dynamics analysis and behavior recognition in surveillance videos based on deep learning

Article 12 September 2024

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Intelligent and smart analysis of video data has gained immense importance in modern surveillance as it increases the efficiency, overall capabilities, and effectiveness of security and monitoring operations. In particular, video analysis is a powerful tool for creating real-time intelligence from an observed environment [1]. In today’s public surveillance system, it helps to perceive aberrant events such as traffic rule-breaking, unauthorized parking, fights, violent crowds, etc. In addition, it is a significant aid in the current smart world to monitor the video streaming content on social media platforms. Thus, video analysis possesses a wide range of applications in public surveillance as well as online monitoring of events. The most common example is the COVID pandemic situation where video analysis enabled the real-time monitoring of social distancing in public venues and curfew adherence. Among the range of applications, one of the complex and crucial areas is the management of crowds and their associated behavior [2]. This is because the behavior of crowds is often unpredictable and prone to unexpected disasters and crime-related events, making them a substantial concern for government officials and law enforcement agencies. Although CCTV is heavily deployed throughout the world for public monitoring purposes, CCTV is generally seen as a reactive system and requires manual monitoring of events.

In this context, efficient, autonomous, and real-time analysis of video data can enable effective and proactive monitoring over large geographical areas, and can assist public safety officials in proactive decision-making in areas that exhibit large crowds. Thus, intelligent and smart crowd behavior recognition has emerged as an indispensable area in computer vision research. Since the advent of deep learning (DL) algorithms, the processing of enormous amounts of unstructured data has led to many human behavior recognition being developed using CNN [3,4,5], LSTM [5], GAN [6], Autoencoder [3], ResNet [7], etc. However, most of the methods are for classifying crowds as violent/nonviolent [5, 7] and normal/abnormal [3, 4, 6]. However, for law enforcement agencies, the size and violence level of the crowd is also crucial to making decisions in practical scenarios [8]. For example, if the model identifies the existence of a small violent crowd, then the authorities can prioritize containment and swift intervention to minimize the impact and prevent escalation. At the same time, the identification of large violent crowds changes the reaction strategies of officials to deploy additional resources to maintain crowd control. The development of such systems requires training a model using classes that characterize crowd size and violence levels.

To the best of our knowledge, no such dataset exists in the literature nor has such a problem been addressed by researchers. To this effort, we first present a novel dataset consisting of videos representative of typical public gatherings. The video database contains videos of normal public daily activity, small-scale violent events, large-scale violent events, and large-scale peaceful events. This distinction allows for crowd behavior classification based on the size of the crowd within the frame and the level of violence. In addition, the dataset contains videos taken from CCTV footage, where the camera is stationary and at a distance from the event, and from social media uploads, where the video is taken via a mobile camera, introducing motion in the video. We have deliberately introduced social media video content to develop a system that can not only identify crowd behavior CCTV cameras managed by authorities but also analyze video content that is uploaded via social media by the public. The latter allows governments to expand their monitoring regions and identify potential threats, suspicious behavior, or illegal activities that might be shared or discussed in these videos. The proposed system is a proactive approach to public safety monitoring that enables the initiation of appropriate actions to prevent crimes before they occur or escalate.

The need to identify and classify crowd behavior in CCTV video and social media streaming makes crowd behavior classification more challenging. Considering all these aspects, we propose a DL model based on a video swin transformer to classify crowd behavior to Natural(N), Large Peaceful Gathering (LPG), Large Violent Gathering (LVG), and Fighting (F) that can distinguish crowd dynamics and extent of violence. To facilitate the learning and prediction of crowd behavior classes, we have exerted crowd-counting maps and optical flow maps as influential components within our proposed model. The crowd-counting maps aid the model in distinguishing between large and small events, whereas the optical flow maps enhance the analysis of temporal violent patterns of the crowd. Finally, to demonstrate the outcomes of the proposed model in real-time and on real videos, we leverage Nvidia’s DeepStream Software Development Kit (SDK) [9], an intelligent application framework to process real-time video data. Thus, our main contributions are:

A swin transformer-based DL model is developed for the purpose of classifying crowd behavior into four discrete categories characterized by varying levels of violence and crowd sizes.
Additional semantic knowledge pertaining to crowd density and violence levels is augmented into the swin transformer framework by the integration of crowd-counting maps and optical flow maps.
We have curated a large dataset that can serve as a benchmark resource for training models dedicated to monitoring crowd-related events through the analysis of data originating from public CCTV surveillance cameras and online social media platforms. Furthermore, we have extracted a subset of the dataset comprising exclusively CCTV footage. This dedicated subset is instrumental in the development of models for public CCTV surveillance applications.
Experimental analysis has been executed employing the DeepStream SDK to ascertain the viability and practicality of our proposed methodology within an actual real-time surveillance environment.

The rest of the paper is structured as follows: In Section 2, a comprehensive review of the existing literature is presented, while Section 3 delineates the proposed crowd behavior detection model and elucidates the processes involved in dataset creation. Section 4 is dedicated to discussing experimental analysis and its outcomes, and elaborates on real-time analysis employing DeepStream. Finally, the paper is concluded in Section 5.

2 Related work

Accurate detection and precise prediction of crowd behavior are inevitable for effective crowd management within smart surveillance systems. The increase in crowd-related mishaps in the past decades has led to significant advances in computer vision research, which actively drives efficient and proactive crowd surveillance. This section provides an outlook of recent DL approaches for video data analysis, various methods employed for analyzing video data derived from the internet and CCTV sources, as well as existing publicly available datasets for tasks related to crowd control and human activity recognition.

2.1 Advances in DL methods for video analysis

DL has revolutionized video analysis by enabling the extraction of high-level representations from raw video data. The breakthrough in video analysis was mainly due to the power of Convolutional Neural Networks (CNN), which are successful in object detection [10], tracking [11], and action recognition [12]. CNN is widely used for crowd analysis as well. A cascade of 3D CNN and 3D autoencoder was proposed by Sabokrou et al. [3] for crowd anomaly classification. Zhou et al. [4] utilized a spatiotemporal CNN to detect panic situations in a crowd. 3DCNN was employed in [13] and [14] to detect various crowd behaviors.

Recently, ResNet, a variant of CNN, which eliminates the vanishing gradient problem of CNN and helps in easy training [15], has been widely used for video processing. Ng et al. [16] proposed a ResNet-based architecture, namely ActionFlowNet, for classifying human actions. The long-term and short-term features in action videos are segregated using ResNet in [17] and a 3D Loop ResNet was utilized by Kakamu et al. [18] for predicting various human actions. ResNet was also employed in [7] for violent behavior detection, crowd density classification, and crowd counting. Abnormal crowd event detection in small-scale and large-scale crowds was proposed in [19], and in [20], features for crowd behavior pattern analysis were done using ResNet.

Other widely used DL methods for video analysis include Recurrent Neural Networks (RNNs) and their variant, Long Short-Term Memory (LSTM) networks. Chen et al. [21], and Ebrahimi et al. [22] proposed an RNN-based algorithm to identify various emotions of a crowd. Moreover, many studies were performed by exploring the properties of LSTM networks that can remember long-term dependencies and solve the vanishing and exploding gradient problem of RNN [23]. The sequences of group activities were recognized in [24] using a 2-stage LSTM model, and the crowd behaviors based on psychological properties were predicted in [25] by wielding a convolutional LSTM.

In the recent past, attention mechanisms have been applied to video analysis to focus on relevant spatiotemporal regions or frames. Vaswani et al. [26] put forth the idea of attention as a Transformer in language translation using an RNN framework. Inspired by the success of transformers in natural language processing (NLP), transformer-based architectures have been adapted for video analysis. These models capture long-range dependencies and facilitate parallel processing of frames in videos. Furthermore, transformers are more scalable to very large-capacity models [27] and assume less prior knowledge about the structure of the problem as compared to CNNs and RNNs [28]. These advantages have led to their success in many computer vision tasks such as image recognition [29] and object detection [30]. Dosovitskiy et al. [29] proposed Vison Transformers (ViT), which achieved promising results in image classification tasks by modeling the relationship (attention) between the spatial patches of an image using the standard transformer encoder [26]. After ViT, many transformer-based video recognition methods [31, 32] have been proposed. In these works, different techniques have been developed for temporal attention as well as spatial attention. Subsequently, attention mechanisms similar to Transformers were used with convLSTM for action recognition [33], crowd behavior prediction [25], and gesture recognition [34] from videos.

In a nutshell, Transformer-based approaches have led to significant advancements in the realm of computer vision. The performance improvements are quite impressive and represent a major step forward in this field. Among the Transformer frameworks discussed above, the swin transformer [31] has been a game changer in the field of computer vision. It has set new records in object detection [35] and semantic segmentation benchmarks [35], and has shown that Transformer-based approaches are the future of visual modeling. In addition, the swin transformer possesses shifted non-overlapping windows, which makes it suitable for faster running speed and hardware friendly, which inspired us to use the framework as the backbone of our proposed model (Details of swin transformer framework are given in Section 3).

2.2 Existing video analysis methods for online videos and CCTV footage

Online videos constitute multimedia content accessible for either streaming or downloading via the internet. This category spans diverse content genres, including but not limited to movies, TV shows, documentaries, music videos, tutorials, vlogs, and more. In some cases, surveillance of social media videos can contribute to public safety efforts. Wang et al. [36] proposed a deep recurrent neural network to extract temporal features to classify audio frames for event detection from videos such as sandwich making, flash mob gathering, etc. Complex events from web videos were classified using a two-stage CNN in [37] and in [38] CNN was utilized to extract features from the video content, and a concept library using Support Vector Machine (SVM) was created to organize the events.

Conversely, analyzing CCTV videos is a common practice in various domains, including security, safety, transportation, and retail, to enhance situational awareness, improve operational efficiency, and enable proactive decision-making. CCTV footage is typically captured by stationary cameras strategically placed at specific locations for surveillance purposes. Since these cameras have a fixed field of view and do not move, they provide a continuous stream of video footage from a particular perspective. In [39] suspicious activities inside a campus were detected from CCTV footage by employing VGG-16 as the feature extractor and LSTM as the classifier. The method proposed by Khan et al. [40] utilized a CNN to find anomalies such as accidents from traffic videos. Anomaly detection was also proposed by Aboah et al. [41] using a decision tree-based approach. Moreover, CCTV footage was used to analyze the crowd’s real-time behaviors, which helps in reliable and proactive crowd management. Baqui et al. [42] studied the cross-correlation and optical flow patterns to analyze pedestrian flows from real-time CCTV videos. The crowd density and the parameters of pedestrian flow such as direction and speed from Hajj videos collected using CCTV cameras were also explored in [43] to display the crowd movement in 3D animation form for better crowd control. The camera’s rotation, focal length, position arguments, and CSRNet-based head tracking AI algorithm were used to detect the position of persons in the crowd.

Although many works have been proposed for analyzing online video content for captioning, event detection, sentiment analysis, etc., the videos have largely remained unused by law enforcement agencies and public surveillance systems due to the lack of suitable models and datasets for training and evaluation. Despite the pervasive utilization of DL models in the analysis of online and CCTV videos across various domains, none of these models exhibit promising capabilities for the discernment of crowd behavior predicated on criteria such as crowd size and violence level. Henceforth, the exigent requirement is the development of an intelligent surveillance system with global applicability, notably crucial for governmental agencies facing diverse challenges, especially in cases of emergencies, such as widespread unrest, and during large-scale public events, such as concerts, national holidays, and sports tournaments. Furthermore, the prevailing literature lacks comprehensive methodologies supported by real-time experimentation, which is essential in pre-empting situations from spiraling out of control due to delayed or inadequate security responses. Therefore, we propose a DL framework alongside a diligently created dataset customized for the classification of crowd behaviors contingent upon crowd size and violence levels. Additionally, we furnish empirical validation through real-time experiments, thereby rendering our system aptly suited for smart surveillance in real-world scenarios.

2.3 Existing human activity recognition (HAR) and crowd datasets

The most important part of an AI-based smart surveillance system for crowd behavior detection is the availability of benchmark datasets for training purposes. Here, we provide a review of existing publicly available datasets for crowd management and HAR closely related to our work.

Movie Actions Dataset [44]: The dataset provides annotated movie clips . Each clip in the dataset belongs to one of the 51 classes for various actions such as GetOutCar, HandShake, HugPerson, Kiss, SitDown, SitUp, StandUp, etc.
UCF50 [45] & UCF101 [46]: UCF50 and UCF101 datasets consist of YouTube clips grouped into one of 50 and 101 action categories, respectively. Examples of action classes in the UCF50 dataset include Basketball Shooting and Pull-Ups while the action classes in UCF101 include a wider spectrum of classes subdivided into five different categories, namely, body motion, human-human interactions, human-object interactions, and playing musical instruments and sports.
Kinetics Dataset: This dataset consists of three versions- Kinetics-400 [47], Kinetics-600 [48], and Kinetics-700 [49]. The Kinetics-400 dataset is a large-scale action recognition dataset that contains around 240,000 video clips categorized into 400 action classes. Each video clip has an average duration of around 10 seconds. This dataset was designed for the task of action recognition in videos. An extension of Kinetics-400, Kinetics-600 includes additional action classes for video action recognition. It provides a broader range of actions for more comprehensive research and evaluation. Another extension of Kinetics-400, Kinetics-700 extends the action classes even further, providing a more diverse and challenging dataset for action recognition tasks.
Violent Flows [50]: Focuses on crowd violence that comprises 246 crowd videos extracted from YouTube and consists of two classes- violence and non-violence.
UCF Crime Dataset [51]: Collection of long surveillance videos from YouTube and LiveLeak that consists of thirteen crime classes (e.g, road accident, burglary, robbery, etc.).
CCTV-fights [52]: A dataset of 1000 videos, whose accumulative length exceeds 8 hours of real fights caught by CCTV cameras with annotation as fight and non-fight.
Surveillance Camera Fight Dataset [53]: Contains 300 videos collected from movies and hockey games and divided equally into two classes; fight and non-fight.
UMN [54]: The dataset comprises eleven videos and intends to classify the crowd as either normal or abnormal. The normal and abnormal classes are classified based on the running patterns of people in the crowd.
UCF Normal/Abnormal Web Dataset [55]: A collection of twenty videos with normal, escape panic, clash, and fights as crowd classes

In short, although the HAR datasets are useful for testing different DL architectures, they are not necessarily useful for specific practical tasks, such as surveillance, which likely requires the distinction between a limited number of specific action classes. Furthermore, to the best of our knowledge, no video dataset in the literature contains large gatherings, such as protests, as an action class. For instance, protest datasets in the literature are limited to image datasets [56] and protest metadata [57], which document protester demands, government responses, protest location, and protester identities. Thus, the novelty of our developed video dataset is that it is specifically aimed toward identifying scenarios of public unrest (violent protests, fights, etc.) or scenarios that have the potential to develop into public unrest (large gatherings, peaceful protests, etc.). Large gatherings are particularly interesting and important to be carefully monitored as they can lead to unruly events. Large gatherings that seem peaceful can evolve into a violent scenario with fighting, destruction of property, etc. In addition, the scale of violence captured can inform the scale of the response from law enforcement. Thus, for the current task, we divide violence into small-scale violence (i.e., F) and large-scale violence (i.e., LVG). To our knowledge, these aspects have been largely neglected in existing datasets, which motivates this work.

3 Proposed framework and dataset

This section describes the proposed model for analyzing internet and surveillance videos as well as the dataset used to train that model. Figure 1 depicts the overall system architecture of the proposed framework.

3.1 Video swin transformation

The main backbone of our framework is the swin transformer, more precisely, the variant known as the video swin transformer. The swin transformer is characterized by its hierarchical architecture, which partitions images into smaller patches at the initial layers of the transformer structure and progressively consolidates adjacent layers at deeper levels to create larger patches. It leverages the concept of shifted windows during inference, thereby enhancing its capacity for representation and contributing to its remarkable recent state-of-the-art performance [58]. Beyond its state-of-the-art performance, the swin transformer demonstrates superior computational efficiency compared to other models. Notably, the computational demands of the model exhibit linear growth in relation to the input image resolution, contrasting with other models where computation time escalates quadratically with increasing image resolution. Among multiple versions of video swin transformer, we contemplate Swin-T, the tiny version of swin as it is designed to be more efficient and faster than other versions, making it well-suited for scenarios where computational resources are limited or inference speed is crucial. The architecture of Swin-T is provided in Fig. 2.

The framework consists of four stages, where each stage has three components- Patch Merging, Linear Layer, and a Video Swin Transformer block except stage 1. In stage 1, each frame in the input video, $V = \{f_1, f_2, ...f_T\}$ is divided into 3D patches/tokens of size $2 \times 4 \times 4 \times 3$ by the 3D patch partition layer that results in $\frac{T}{2} \times \frac{H}{4}\times \frac{W}{4}$ tokens. These tokens are given to the linear embedding layer, where the features of each token are projected to an arbitrary dimension, C (For Swin-T, $C =96$). The patch merging layers of each stage perform the spatial downsampling and concatenation of $2 \times 2$ neighboring patches, where a linear layer is utilized to project the concatenated patches to half of the input dimension. The significant block in each stage is the video swin transformer block that comprises a 2-layer multi-layer perceptron (MLP) with Gaussian Error Linear Unit (GELU) activation unit and 3D shifted window-based multi-head self-attention (3DWMSA) module as shown in Fig. 3.

A residual connection is established after each module to overcome vanishing gradients and layer normalization (LN) is applied after the MLP and 3DWMSA to get control over covariate shift. A block of the video swin transformer, as illustrated in Fig. 3, is given by

$$\begin{aligned} \hat{z}^l = 3DWMSA(LN(z^{l-1})) + z^{l-1} \end{aligned}$$

(1)

and

$$\begin{aligned} z^{l} = MLP(LN(\hat{z}^l)) + \hat{z}^l, \end{aligned}$$

(2)

where $\hat{z}^l$ represents the input to the MLP at layer l, while $z^{l}$ denotes the output from the layer l MLP, which is subsequently passed to layer $l+1$.

The 3DWMSA is responsible for efficient event recognition from temporal video data with its multi-head self-attention (MSA) property and non-overlapping 3D windows. Each input V is divided into $T' \times H' \times W'$ tokens and each token is further divided into a 3D non-overlapping window of size $P \times M \times M$. That is, the MSA of the first layer generates non-overlapping 3D windows of size $ \lceil \frac{T'}{P}\rceil \times \lceil \frac{H'}{M}\rceil \times \lceil \frac{W'}{M}\rceil $. The window partition for the second layer is shifted temporally and results in $(\frac{P}{2}, \frac{M}{2}, \frac{M}{2})$ tokens. An example of 3DWMSA is provided in Fig. 4. Finally, self-attention is computed by including a 3D relative position bias, $B\in R$, and is given by

$$\begin{aligned} Attention(q,k,v) = softmax(\frac{qk^T}{\sqrt{d}+B})v, \end{aligned}$$

(3)

, where q represents the query matrix with dimensions d, and k and v, denote the key and value matrices respectively, for the self-attention calculation of T frames. Finally, after stage 4, a softmax layer is employed to calculate the probability distribution of crowd behavior labels.

The proposed framework leverages crowd counting maps ($CC\_Maps$) and optical flow patterns ($Opt\_Flow$) as important components to augment supplementary semantic knowledge to classify crowd behavior based on attributes including crowd size and violence level. The $Opt\_Flow$ and $CC\_Maps$ maps are computed for each two consecutive frames for each sample. For a sample with frames $\{f_1, \dots , f_{T}\}$, we compute a $CC\_Maps$, C for frames $\{f_1, f_3, f_5, \dots , f_{T-1}\}$, skipping one frame at a time. Additionally, we compute the $Opt\_Flow$, O for each frame pair $\{(f_i, f_{i+1}) | i \in [1, T]\}$. Consequently, one sample of input to the swin transformer is the result of the concatenation of T input frames $V =(f_1, f_2, ..., f_{T})$, T/2 $CC\_Maps$, $C = (c_1, c_2, ..., c_{T/2})$, and T-1 $Opt\_Flow$, $O = (o_1, o_2, ....o_{T-1})$ and is represented as

$$\begin{aligned} I_j = V_j\uplus C_j \uplus O_j, \end{aligned}$$

(4)

where $j=1,2,3,...n$ denotes the number of samples of each video and $\uplus $ is the concatenation operation. The overall procedure of the proposed model is illustrated in Algorithm 1. The following subsection furnishes a detailed explanation of the processes involved in generating $CC\_Maps$ and $Opt\_Flow$.

3.2 Crowd counting and optical flow maps

Recall that our primary objective entails the classification of human crowd behavior, and this classification is predicated on two key parameters: the crowd’s size and the level of violence exhibited. Specifically, we are concerned with two fundamental aspects within the input video data: the dynamics of individuals’ movements captured in the video and the spatial concentration of these individuals. It is worth noting that the motion patterns within the crowd can offer insights into its potential for violence. For instance, a violent crowd tends to manifest erratic motion, while a peaceful crowd’s movement is more likely to be slow and subtle. Also, the concentration of people in a crowd can inform whether the crowd is large or small, and the higher the density in a significant proportion of the frame, the more likely the crowd will be large. Besides, in intuitive contexts, a crowd’s mobility and density distribution may interact in other ways to help classify the crowd as small or large, violent or non-violent. Thus, to aid in crowd footage classification, we utilize $CC\_Maps$, which can contain information about the crowd’s density distribution, and $Opt\_Flow$, which can store information about crowd movement. This section describes how optical flow and crowd-counting maps are extracted for videos in the datasets and how they are utilized for training and validation.

3.2.1 Computation of CC_Maps

Crowd counting and localization have drawn significant attention in the literature for their usefulness in surveillance, tracking, and crowd management applications [59]. Crowd counting can also be useful in our application since it can inform us about the size of the crowd, which would help in distinguishing between LPG and N, as well as between LVG and F. There are two ways in which crowd counting could be helpful for our purposes. One way would be to get the number of people present in a video [60, 61] and use it as a feature of the input video to aid in classification. This approach has two potential drawbacks. First, the total number of people does not always inform us about the number of people involved in the action. In other words, a large number of people could be in the background of the scene while relevant action in the foreground is taking place, meaning that the distribution of the people in the crowd also matters. Secondly, since we are dealing with video data as the number of people is just a single feature, its influence during inference might be greatly diminished by the thousands of features extracted and used to obtain a final classification of an input video.

Rather than relying solely on headcount as a feature, our approach is geared towards the computation of crowd density maps. These maps serve as continuous, smoothed heat maps, functioning as a visual representation of the crowd’s distribution and intensity. We employed the idea proposed by Wan et al. [59] to generate $CC\_Maps$ that take $V = \{f_1, f_2, ..f_T\}$ as input to a VGG19 pre-trained model [62] and returns a 2-dimensional crowd density matrix having values between 0 and 1, which can be transformed into a grey-scale image, C. In this grey-scale image, a higher value for each pixel indicates a higher crowd concentration at that pixel. For each frame $f_i$, we produce a crowd density estimation, $C_i$ in the form of a grey-scale image. An example of a sequence of 3 frames and their respective $CC\_Maps$ is shown in Fig. 5. Instead of processing crowd-density maps independently of the image sequences, we opt for concatenating both sets of images and processing them through the swin transformer at once. This would allow the network to learn the complex relationship between the frames and the crowd densities and how those two change with time.

3.2.2 Generation of Opt_Flow

Optical flow is the distribution of velocities of brightness patterns in an image [63]. The velocities of brightness in these brightness patterns arise as a result of relative movement between the objects in the video or the video’s point of view, such as a change in the position or orientation of the camera. Optical flow maps are image representations that can be computed for two consecutive video frames. The adjacent frame components $f_i$ and $f_{i+1}$ of input video frames $f_{i=1}^T$ are utilized to generate $Opt\_Flow$, $O_i$ using pre-trained Recurrent All-Pairs Field Transforms (RAFT) model described in [64].

RAFT is a deep learning architecture that addresses the problem of estimating optical flow by predicting per-pixel displacements between two frames. Unlike traditional optical flow methods that often rely on handcrafted features and assumptions about brightness constancy, RAFT takes a learning-based approach. It utilizes a recurrent neural network (RNN) to model the interactions between pixels in a pair of frames and predict the flow field that best explains the observed motion. RAFT computes pixel-wise feature vectors and uses these vectors to compute the corresponding pixel in the second image for each pixel in the first image. The product of this operation is a field of vectors, one for each pixel, that shows the “movement” of each pixel. Each neighborhood of pixels that move together is colored homogeneously. An example of a sample of three consecutive frames and their two corresponding $ Opt\_Flow$ maps are shown in Fig. 6. Homogeneous color patterns in the figure represent regions in the video frame where motion is relatively uniform in both direction and magnitude. These patterns can help identify regions within a crowd where people are moving collectively or uniformly, potentially indicating activities like large-scale gatherings or synchronized movement.

Since Optical Flow shows the movement of objects relative to the camera, we must guarantee that the camera is sufficiently stationary for $ Opt\_Flow$ to be useful. Thus, the use of $ Opt\_Flow$ maps is useful mostly for CCTV footage, not for internet videos, since internet videos tend to be fast-moving videos taken from the ground. This is in contrast to CCTV cameras, which are almost always stationary and are usually from a high point of view.

3.3 Dataset collection

Given the absence of pre-existing datasets aligning with our specific class criteria, we embarked on the creation of a novel dataset. This dataset serves as the training foundation for our model designed to monitor internet and CCTV videos. Subsequently, a distinct subset extracted from this dataset is employed for the exclusive training of a model designated for the analysis of CCTV footage. It’s important to emphasize that our dataset is uniquely customized to comprise the four distinct classes of behavior requisite for monitoring both internet and CCTV videos, spanning large-scale and small-scale peaceful as well as violent events. This custom dataset fulfills the precise requirements essential for our research objectives. To this end, a large set of YouTube videos and videos from pre-existing datasets that contain one or more of the classes of interest were identified. The videos were given unique IDs that indicate the order in which the videos were obtained. Then, the start and end time stamps of the occurrences of each class in each of the collected videos were recorded. The record of the occurrence of a class in a video would be as follows: “video i contains an instance of class c from the time stamp $h_i:m_i:s_i$ to time stamp $h_f:m_f:s_f$”, where h, m, and s represents the hours, minutes, and seconds of the timestamp respectively.

In order to prepare a video for being fed into the proposed framework, the frames of the time periods where the classes occur must be extracted. Before that, we guarantee that the time difference between every two consecutive frames is the same for all videos by setting the frame rate of each video to 10 frames per second (FPS). 10 FPS was chosen since it’s a reasonable frame rate that allows the model to analyze the videos in sufficient detail without needing excessive storage space for the frames of each video. Then, we extract the frames of each occurrence of each class in each of the collected videos. Note that the number of frames for each occurrence of each of the classes can be different since the time periods during which an instance of a class occurs in a video can vary in length, thus changing the number of frames of that instance. For example, an instance that occurs from time stamp 0:0:0 to 0:0:10 has 11 seconds $\times $ 10 frames/second = 110 frames while an instance that occurs from time stamp 0:0:5 to 0:0:8 has 4 seconds $\times $ 10 frames/second = 40 frames.

However, the number of frames taken by a DL model must be constant and set before training. To resolve this, we have determined that our model will take as input 20 frame sequences. This is because 20 frames are the minimum number of frames for any possible occurrence of one of the chosen classes of behavior given the way we record these occurrences. The shortest occurrence of a class is an occurrence that begins at the time stamp h : m : s and ends at the time stamp $h:m:(s+1)$, meaning that it will have 2 seconds $\times $ 10 frames/second = 20 frames. Then, occurrences that are more than 2 seconds long will be used to produce more than one 20-frame sequence using a sliding window. for instance, if an occurrence of one of the classes that starts at time stamp $h_i:m_i:s_i$ and ends at time stamp $h_f:m_f:s_f$ has k frames $\{f_j, f_{j+1}, \dots , f_{j+k-1}\}$, a sliding window of size 20 will slide through the frames, taking a 20-frame sequence at each step.

Specifically, the 20-frame sequences that will be extracted from the occurrence with frames $\{f_j, f_{j+1},$ $ \dots , f_{j+k-1}\}$ will be $\{f_i, \dots ,f_{i+19}\} \forall i \in [0,k-20]$. That is, every two consecutive 20-frame sequences will share 19 frames. Note that consecutive 20-frame sequences sharing some frames are valuable, as this trains the model to be somewhat time-invariant. For example, a 20-frame sequence with a punch must be categorized as Fighting no matter where the punch occurs in the 20-frame sequence. However, sharing 19 consecutive frame sequences out of 20 frames is inefficient because the dataset requires excessive storage space. Instead, we use a sliding window that jumps 10 frames at each step, meaning that consecutive 20-frame sequences will only share 10 frames. In particular, for an occurrence with k-frames $\{f_j, f_{j+1}, \dots , f_{j+k-1}\}$, the 20-frame sequences that will be used in the dataset are $\{f_{10i}, \dots ,f_{10i+19}\} \forall i \in [0,m]$ where $m=\lfloor \frac{k}{10}\rfloor - 2$. The 20-frame sequences, which we call samples, are extracted for each class occurrence in each video collected and added to our dataset. Overall, 2,570 different videos were collected and the cumulative duration of the occurrences recorded amounted to 68 hours.

3.4 Model training

To train the swin transformer model effectively, we partitioned the videos into distinct training and validation sets. However, it’s important to clarify that our division was based on samples, not entire videos, with the aim of allocating 80% of the samples for training and reserving 20% for validation. Achieving this 80-20 sample split, while ensuring the uniqueness of training and validation videos, was accomplished through a random search procedure as follows: Initially, a random selection of videos, with random sizes, was chosen from the video dataset for training, while the remaining videos were designated for validation. The number of training and validation samples for each class within the training and validation videos was tallied and the per-class training/validation ratio was calculated After 2 hours of searching, the training and validation sets that achieve the per-class split that is closest to 80-20 were selected. As a result of this procedure, we arrived at a specific set of 1977 training videos and 593 validation videos. These two sets of videos yielded the following per-class training/validation ratios: N : 80.69% (Training) / 19.31% (Validation) LPG : 78.62% (Training) / 21.38% (Validation) LVG : 79.51% (Training) / 20.49% (Validation) F : 79.18% (Training) / 20.82% (Validation) This process ensured a well-balanced and representative distribution of samples across classes for both training and validation, thereby contributing to the robustness of our model training.

As mentioned in Section 3.2.2, $Opt\_Flow$ maps prove most effective when applied to videos featuring a stable camera viewpoint. In cases where the camera is in motion, the use of optical flow maps can lead to potential confusion, as the model might interpret camera-induced movement as object motion within the video. To mitigate this issue, we extracted a subset from our dataset, comprising samples characterized by minimal changes in the camera’s perspective. This subset closely resembles typical CCTV footage, where cameras are typically stationary and not mobile. Our approach involves a detailed examination of each recorded occurrence within various classes. If a segment of the video demonstrates ”significant” camera movement, we opt to exclude that particular occurrence record from the dataset. This process yielded a 25-hour dataset primarily consisting of stationary samples, which we refer to as the Static dataset. Conversely, the broader dataset, of which the Static dataset is a subset, is termed the Original dataset. Note that the proposed model, when trained on the Original dataset, can be used for monitoring internet videos. In contrast, when trained on the Static dataset, it becomes well-suited for CCTV monitoring applications. For the Static dataset, the same random search procedure was adopted for splitting the dataset into training and validation videos. This process resulted in 1121 videos allocated for training and 279 videos designated for validation. The training/validation ratios achieved for the Static dataset were 79.72% / 20.28%, 79.03% / 20.97%, 80.01% / 19.99%, and 79.80% / 20.20% for N, LPG, LVG, and F, respectively.

4 Experimental analysis

The experiments were conducted using a novel dataset that collected videos from YouTube and existing crowd datasets. The details of the video collection for the dataset are provided in Section 3.3. We define four crowd behavior classes based on the size and violence level such as Natural(N), Large Peaceful Gathering (LPG), Large Violent Gathering (LVG), and Fighting (F). LPG depicts a large number of individuals gathered for a unique purpose, like peaceful protests or sports spectators, whereas LVG represents a large group of individuals of whom a significant number are engaged in violent action that includes clashes with police, fighting between members of the crowd, property destruction, etc. On the other hand, F refers to a small group of individuals fighting each other, and if the footage shows no relation to the above-described behaviors, it is classified as N. Figure 7 portrays the sample frames from each class. Extracted videos were annotated carefully by identifying when behaviors of interest occurred. This was done by recording the start and end time stamps within which interesting behaviors were observed as shown in Table 1. Each video is assigned a unique ID, and the occurrence of a class recorded in the annotation table is denoted as an instance of that class.

Table 1 A sample annotation table - 5 instances of the behaviors in 3 separate videos

Full size table

Our dataset consists of 68 hours of videos and is referred to as Original dataset in the rest of the paper. The Original dataset comprises videos from both static CCTV cameras and moving cameras. To perform experiments on video footage from stationary CCTV cameras, a key component of city-wide surveillance, we extract a subset of videos that match the CCTV footage, called the Static dataset, consisting of 25 hours of video. For training and validation, the videos from both datasets were converted to non-overlapping frames, with $ 224 \times 224$ as frame size. As explained in Section 3.3, the videos were converted to 20 frame samples, and hence the input to the swin transformer is a tensor of size $ 20 \times 3 \times 224 \times 224 $. The training and validation of the proposed model were performed with a training validation ratio of 8:2, as discussed in Section 3.4. All the experiments were done using Python’s PyTorch framework in a GPU having NVIDIA GeForce with CUDA 11.4.

The training process of the proposed model was performed by minimizing the categorical cross-entropy loss by utilizing the optimizer Stochastic Gradient Descent (SGD) with an initial learning rate of 0.0001, a momentum of 0.9, and a weight decay of 0.0001. The hyperparameters used for training are compiled in Table 2. Figure 8 depicts the average loss values during the training and validation of crowd behavior classification. The decreasing behavior detection loss demonstrates that the proposed approach successfully detects the correct behaviors similar to the ground truth labels.

Table 2 Hyperparameters used for training the proposed model

Full size table

We validated the model by calculating the average accuracy and mean average precision (mAP) for the instance videos as well as the sample videos. Instance videos are a set of frames whose starting and ending timestamps are identified and recorded for a specific class as given in Table 1. Sample videos are equal-sized image sequences extracted from the instance video, and we set the sample size as 20 frames. The details of sample extraction are given in Section 3.3, and the number of instances and samples used for training and validation for each crowd behavior is portrayed in Table 3. The average accuracy and mAP of the proposed model are shown in Table 4 and the confusion matrix is portrayed in Fig. 9.

Table 3 Number of instances and samples for each crowd behavior used for training and validation

Full size table

In terms of accuracy and mAP, the Static dataset is better compared to the Original dataset. This improvement is due to the nature of the Original dataset, which includes both online and CCTV videos. When the camera is in motion, optical flow maps can confuse, leading the model to interpret camera movement as object motion mistakenly. Conversely, the Static dataset closely mirrors typical CCTV footage, where cameras are usually stationary, thus providing more consistent and reliable data for the model.

Furthermore, we recorded two types of accuracy scores: ”sample accuracy” and ”instance accuracy.” Sample accuracy is calculated by performing inference on all samples in the validation set, and then dividing the number of correctly classified validation samples by the total number of validation samples. Conversely, instance accuracy is measured by performing inference on all samples within an instance. If the class to which most samples in an instance are classified matches the instance’s label, the number of correctly classified instances is increased by one. The total number of correctly classified instances is then divided by the total number of instances in the validation set to obtain the instance accuracy of the model. Hence, in most cases, sample accuracy is higher than instance accuracy. In the proposed dataset, the number of LVG videos is lower compared to other classes. This class imbalance results in reduced performance for the LVG class relative to others.

The experimental analysis demonstrated the efficacy of the proposed model in effectively detecting crowd behavior, considering both crowd size and the degree of violence, using data from CCTV and online video sources. Since the videos in the dataset are rich in diverse crowd scenes, that were captured from multiple climatic conditions and scenarios of occlusion, the performance analysis proves that the model is robust to variations like weather conditions, occlusion, and video quality. Moreover, the swin transformer’s ability to capture both local and global contexts through shifted windows and hierarchical processing helps in maintaining performance despite these global changes. The use of optical flow maps and the attention mechanism of the swin transformer can reweight the attention distribution to focus on visible and relevant parts of the image, thereby mitigating the impact of occlusions and video quality.

Table 4 Average Accuracy (%) and mAP(%) of the proposed approach

Full size table

4.1 Impact of crowd counting maps and optical flow in crowd behavior detection

Crowd behavior is recognized using our proposed swin transformer model that takes crowd counting maps and optical flow maps as input along with the original input frames. Given our classification task’s focus on crowd behavior differentiation based on size and violence level, the integration of crowd counting maps plays a pivotal role in enhancing the precision of crowd behavior detection by effectively discriminating between large-scale and small-scale events. On the contrary, optical flow maps help analyze the temporal patterns of significant motions of objects in a sample video. We performed experiments to examine the effect of crowd-counting maps and optical flow in the Original and Static datasets. The analysis was done by estimating the mAP, sample accuracy, and instance accuracy by varying the input patterns to the Swin Transformer in the following four ways- (1) Swin Only:- Frames from the video sample were given as input (2) Swin+OptFlow:- Input frames were concatenated with optical flow maps. (3) Swin+CCmaps:- Input frames were concatenated with crowd-counting maps.(4) Swin+ CCmaps+OptFlow:- Our proposed approach where crowd counting maps and optical flow maps were concatenated with original input frames. The results are portrayed in Figs. 10 and 11, which display the comparison of average accuracy (%) and mAP (%) in the Original and Static datasets. It is clear from the figure that the combination of crowd-counting maps and optical flow patterns has a considerable impact on behavior detection. Figure 12 shows the sample frames representing the four classes that were correctly classified by the proposed approach. Furthermore, Fig. 13 presents three scenarios aimed at illustrating the significance of crowd counting maps and optical flow maps in the context of distinguishing crowd behavior with respect to size and violence levels. These figures exemplify instances where our approach outperformed alternative methods in accurately categorizing crowd behavior.

Table 5 Comparison of Accuracy (%) in the Original and Static Datasets

Full size table

4.2 Comparison with state-of-the-art approaches

We implemented state-of-the-art models utilized for video recognition tasks using our Original and Static datasets and the results are given in Table 5. Average accuracy (%) was calculated for sample videos and instance videos and it is evident that our approach is better than the recent video recognition models for the detection of crowd behavior based on crowd size and violence level. The results also emphasize that crowd-counting maps and optical flow maps influence the behavior detection ability of the proposed model.

We empirically assess the efficacy of the proposed model through rigorous performance evaluation on established publicly available datasets consisting of behavior classes associated with fight and violence actions. To effectively evaluate the performance of our model, we require datasets containing instances of violent and non-violent scenarios captured from real surveillance environments. As such, we have chosen to utilize four distinct datasets for comparative analysis: the Hockey Fight Dataset [68], the Surveillance Camera Fight Dataset [53], the Violent Flows Dataset [50], and the RWF_2000 Dataset [69]. The Hockey Fight Dataset comprises a collection of 1,000 video sequences categorized into two distinct classes: fights and non-fights. A similar binary classification scheme is also applied to the Surveillance Camera Dataset, consisting of 300 video recordings. The RWF_2000 Dataset, on the other hand, incorporates a more extensive compilation featuring 2000 video clips that are segregated into the fight and non-fight categories. Lastly, the ViolentFlows dataset comprises 246 video instances, each annotated to distinguish between violent and non-violent behaviors.

Table 6 Comparison of Accuracy(%) in Hockey Fight Dataset

Full size table

The Hockey Fight and RWF 2000 datasets comprised of instances categorized into fight and no-fight classes, which align with our F and N classes, respectively. Similarly, the Violent Flows Dataset and the Surveillance Camera Fight Dataset contain scenes depicting both violent and non-violent scenarios, like the classes such as LPG, LVG, and N. Figure 14 illustrates sample frames from the datasets and Tables 6, 7, 8, and 9 present the quantitative outcomes in terms of accuracy. The results substantiate the efficiency of our proposed approach in discerning patterns associated with violent and fight scenarios.

The evaluated benchmark datasets consist of classes representing fight/no fight or violence/non-violence scenarios. The Hockey Fight Dataset predominantly features fight sequences characterized by clearer visuals and a smaller number of individuals. In contrast, the Violent Flows Dataset presents more distinct patterns of both violent and non-violent behavior, thereby facilitating the model’s learning and generalization of patterns. Conversely, the Surveillance Camera Fight Dataset and RWF 2000 Dataset present more diverse and challenging scenarios. These datasets comprise variations in lighting conditions, camera angles, and crowd dynamics, which posed challenges for our model’s performance. Despite these complexities, our approach consistently demonstrated superior performance compared to existing methods.

Table 7 Comparison of Accuracy(%) in Surveillance Camera Fight Dataset

Full size table

Table 8 Comparison of Accuracy(%) in RWF_2000 Dataset

Full size table

Table 9 Comparison of Accuracy(%) in Violent Flows Dataset

Full size table

Table 10 Inference time of the proposed model using DeepStream SDK

Full size table

4.3 Deepstream for real-time analysis

A DL model for smart surveillance is considered efficient when it exhibits real-time inference capabilities that align with the demands of a surveillance environment. As a result, we perform model validation within a real surveillance ecosystem utilizing the DeepStream SDK [9]. This SDK serves as a powerful tool for deploying real-time video classification deep learning models. The deployment process involves the provision of a video source to DeepStream, which can be either an MPEG-4 (MP4) video stored locally or a video stream originating from a camera via the Real-Time Streaming Protocol (RTSP) [92].

DeepStream mandates that the DL model be in the Open Neural Network Exchange (ONNX) [93] format. To achieve this, we employed the ”onnx” module provided by PyTorch to perform the conversion of the pre-trained proposed model into the ONNX format. Subsequently, the ONNX file representing the proposed DL model is specified within DeepStream’s configuration file, facilitating the generation of an inference engine file. This inference engine file is crucial for subsequent executions of the DeepStream SDK, enabling real-time video classification.

As part of our configuration process, we have defined critical DeepStream parameters for utilization. These parameters are (1) T, which denotes the number of video frames processed by the DL model during each inference cycle. For optimal performance, T has been configured to 20, aligning with the number of frames constituting a single sample. (2) H and W, which represent the dimensions of each video frame, specified as $H \times W$. In our setup, both H and W have been set to 224. This specific dimension is mandated by the requirement of the backbone swin transformer model, which necessitates input frames to be 224$\times $224 in size.

Once the ONNX file and the path to the MP4 video or the RTSP stream link are supplied to DeepStream, the framework initiates the video playback while concurrently applying the proposed DL model to the frames. The visualization process varies depending on the video source: (a) Local MP4 Video: When processing a local MP4 video, DeepStream displays the video at its native frame rate. Simultaneously, it overlays the inference results for the most recent 20 frames in the top-left corner of the video display. This dynamic display simulates real-time inference, providing users with up-to-date classification information as the video plays. (b) RTSP Stream from a Camera: In the case of an RTSP stream sourced from a camera, DeepStream generates the class inference result in the top-left corner based on the last 20 frames received from the stream. This approach ensures that the displayed inference information reflects the most recent data processed from the live camera feed.

To execute the DeepStream SDK, we utilized the NVIDIA GeForce RTX 3080 GPU. The SDK is configured to capture either the local MP4 video or the most recent 20 frames from the RTSP stream. These captured frames are then processed by the proposed pre-trained model residing in the GPU. Subsequently, DeepStream incorporates the model’s output label, which could signify ”Natural (N),” ”Large Peaceful Gathering (LPG),” ”Large Violent Gathering (LVG),” or ”Fighting (F),” into the incoming video feed, rendering it visible on the screen for real-time monitoring and analysis.

The proposed approach was tested for real-time inference using DeepStream, which accepted videos stored locally in MP4 format and video streams from a camera via RTSP stream. The locally stored video yielded inference results in 0.3 sec, whereas the RTSP stream exhibited a delay of 5 seconds in displaying behavior inference, and is portrayed in Table 10.

For visual reference, refer to Fig. 15, which illustrates sample frames from each class as displayed within the DeepStream environment, where the input is given in MP4 format. This visualization offers insights into how DeepStream seamlessly integrates real-time video processing and DL inference. This integration highlights the effectiveness and value of our proposed approach for crowd behavior recognition.

5 Conclusions

In a public surveillance system, proactive real-time analysis of crowds can be challenging due to the difficulties authorities face in promptly assessing crowd scale and potential violence levels. Furthermore, the current practice of conducting crowd behavior recognition by exclusively analyzing CCTV footage while neglecting the incorporation of online social media video content results in a predominantly reactive methodology. This necessitates the vital requirement for datasets and models specifically designed to facilitate the analysis of both CCTV footage and online videos, with the capability to detect and classify crowd behavior along two essential dimensions: violence and crowd size. In this paper, we introduced a large dataset comprising 68 hours of data, including both stationary CCTV feeds and online social media content. We developed a subset of this extensive dataset, which includes only CCTV footage, to serve as a foundation for developing dedicated models suitable for CCTV video data analysis. A DL model based on swin transformer architecture was trained to capture crowd behaviors, consisting of regular events, large peaceful gatherings, large violent gatherings, and small-scale fighting. Besides, we aimed to enhance the model’s understanding of the dataset’s dynamics and violence patterns by incorporating crowd-counting maps and optical flow maps as auxiliary data sources. The experimental analysis proved the efficacy of the proposed model in effectively detecting crowd behavior, taking into account both crowd size and the degree of violence, across data derived from both CCTV and online video sources. The proposed model was also tested with benchmark datasets that further demonstrated the model’s proficiency in distinguishing fight and violence patterns within video data. Conclusively, the real-time performance analysis of the proposed model, trained on our dataset and leveraged through the DeepStream SDK, serves as captivating evidence of the model’s efficiency in the context of real-time surveillance environments. In the future, we intend to develop multi-attention spatiotemporal DL models capable of detecting and predicting fine-grained crowd behavior within a single scenario.

Availability of data and materials

The datasets generated during the current study are available from the corresponding author upon reasonable request.

Code Availibility

The code of the current study is available from the corresponding author upon reasonable request.

References

Regazzoni CS, Cavallaro A, Wu Y, Konrad J, Hampapur A (2010) Video analytics for surveillance: Theory and practice [from the guest editors]. IEEE Signal Process Mag 27(5):16–17
Article Google Scholar
Varghese EB, Thampi SM (2023) A comprehensive review of crowd behavior and social group analysis techniques in smart surveillance. Intell Image Video Anal 57–84
Sabokrou M, Fayyaz M, Fathy M, Klette R (2017) Deep-cascade: Cascading 3d deep neural networks for fast anomaly detection and localization in crowded scenes. IEEE Trans Image Process 26(4):1992–2004
Article MathSciNet Google Scholar
Zhou S, Shen W, Zeng D, Fang M, Wei Y, Zhang Z (2016) Spatial-temporal convolutional neural networks for anomaly detection and localization in crowded scenes. Signal Process: Image Commun 47:358–368
Google Scholar
Sumon SA, Shahria MT, Goni MR, Hasan N, Almarufuzzaman A, Rahman RM (2019) Violent crowd flow detection using deep learning. In: Intelligent information and database systems: 11th Asian Conference, ACIIDS 2019, Yogyakarta, Indonesia, April 8–11, 2019, Proceedings, Part I 11, Springer, pp 613–625
Ravanbakhsh M, Nabi M, Sangineto E, Marcenaro L, Regazzoni C, Sebe N (2017) Abnormal event detection in videos using generative adversarial nets. In: 2017 IEEE International conference on image processing (ICIP), IEEE, pp 1577–1581
Marsden M, McGuinness K, Little S, O’Connor NE (2017) Resnetcrowd: A residual deep learning architecture for crowd counting, violent behaviour detection and crowd density level classification. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), IEEE, pp 1–7
Glenesk J, Strang L, Disley E (2018) How Can Crowd Behaviour Modelling Be Used to Prevent and Respond to Violence and Antisocial Behaviour at Qatar 2022? RAND Corporation, Cambridge, UK
Book Google Scholar
DeepStream SDK NVIDIA Developer. https://developer.nvidia.com/deepstream-sdk. Last accessed 5 July 2023
Zhiqiang W, Jun L (2017) A review of object detection based on convolutional neural network. In: 2017 36th Chinese Control Conference (CCC), IEEE, pp 11104–11109
Chu Q, Ouyang W, Li H, Wang X, Liu B, Yu N (2017) Online multi-object tracking using cnn-based single object tracker with spatial-temporal attention mechanism. In: Proceedings of the IEEE international conference on computer vision, pp 4836–4845
Yao G, Lei T, Zhong J (2019) A review of convolutional-neural-network-based action recognition. Pattern Recognit Lett 118:14–22
Article Google Scholar
Dupont C, Tobias L, Luvison B (2017) Crowd-11: A dataset for fine grained crowd behaviour analysis. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 9–16
Varghese EB, Thampi SM (2018) A deep learning approach to predict crowd behavior based on emotion. In: Smart Multimedia: First International Conference, ICSM 2018, Toulon, France, August 24–26, 2018, Revised Selected Papers 1, Springer pp 296–307
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Ng JY-H, Choi J, Neumann J, Davis LS (2018) Actionflownet: Learning motion representation for action recognition. In: 2018 IEEE Winter conference on applications of computer vision (WACV), IEEE, pp 1616–1624
Zhang J, Wang X, Wan Y, Wang L, Wang J, Philip SY (2023) Sor-tc: Self-attentive octave resnet with temporal consistency for compressed video action recognition. Neurocomputing 533:191–205
Article Google Scholar
Kakamu Y, Hotta K (2022) Predicting human behavior using 3d loop resnet. In: 2022 26th International conference on pattern recognition (ICPR), IEEE, pp 3259–3264
Alafif T, Hadi A, Allahyani M, Alzahrani B, Alhothali A, Alotaibi R, Barnawi A (2023) Hybrid classifiers for spatio-temporal abnormal behavior detection, tracking, and recognition in massive hajj crowds. Electronics 12(5):1165
Article Google Scholar
Mandal B, Fajtl J, Argyriou V, Monekosso D, Remagnino P (2018) Deep residual network with subclass discriminant analysis for crowd behavior recognition. In: 2018 25th IEEE International Conference on Image Processing (ICIP), IEEE, pp 938–942
Chen Y (2020) Crowd behaviour recognition using enhanced butterfly optimization algorithm based recurrent neural network. Multimed Res 3(3):1–20
MathSciNet Google Scholar
Ebrahimi Kahou S, Michalski V, Konda K, Memisevic R, Pal C (2015) Recurrent neural networks for emotion recognition in video. In: Proceedings of the 2015 ACM on international conference on multimodal interaction, pp 467–474
Sharma V, Gupta M, Kumar A, Mishra D (2021) Video processing using deep learning techniques: A systematic literature review. IEEE Access 9:139489–139507
Article Google Scholar
Ibrahim MS, Muralidharan S, Deng Z, Vahdat A, Mori G (2016) A hierarchical deep temporal model for group activity recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1971–1980
Varghese EB, Thampi SM, Berretti S (2020) A psychologically inspired fuzzy cognitive deep learning framework to predict crowd behavior. IEEE Trans Affective Comput 13(2):1005–1022
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30
Lepikhin D, Lee H, Xu Y, Chen D, Firat O, Huang Y, Krikun M, Shazeer N, Chen Z (2020) Gshard: Scaling giant models with conditional computation and automatic sharding. In: International conference on learning representations
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. MIT press, London
Google Scholar
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: International conference on learning representations
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2021) Deformable {detr}: Deformable transformers for end-to-end object detection. In: International conference on learning representations
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022
Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C (2021) Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6836–6846
Li Z, Gavrilyuk K, Gavves E, Jain M, Snoek CG (2018) Videolstm convolves, attends and flows for action recognition. Comput Vision Image Underst 166:41–50
Article Google Scholar
Zhang L, Zhu G, Mei L, Shen P, Shah SAA, Bennamoun M (2018) Attention in convolutional lstm for gesture recognition. Adv Neural Inf Process Syst 31
Liu Z, Ning J, Cao Y, Wei Y, Zhang Z, Lin S, Hu H (2022) Video swin transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3202–3211
Wang Y, Neves L, Metze F (2016) Audio-based multimedia event detection using deep recurrent neural networks. In: 2016 IEEE International conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 2742–2746
Lan Z (2017) Towards usable multimedia event detection. PhD thesis, PhD Thesis, Carnegie Mellon University
Ye G, Li Y, Xu H, Liu D, Chang S-F (2015) Eventnet: A large scale structured concept library for complex event detection in video. In: Proceedings of the 23rd ACM international conference on multimedia, pp 471–480
Amrutha C, Jyotsna C, Amudha J (2020) Deep learning approach for suspicious activity detection from surveillance video. In: 2020 2nd International conference on innovative mechanisms for industry applications (ICIMIA), IEEE, pp 335–339
Khan SW, Hafeez Q, Khalid MI, Alroobaea R, Hussain S, Iqbal J, Almotiri J, Ullah SS (2022) Anomaly detection in traffic surveillance videos using deep learning. Sensors 22(17):6563
Article Google Scholar
Aboah A (2021) A vision-based system for traffic anomaly detection using deep learning and decision trees. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4207–4212
Baqui M, Löhner R (2017) Real-time crowd safety and comfort management from cctv images. In: Real-Time image and video processing 2017, SPIE, vol. 10223, pp 10–23
Yu Q, Hu L, Alzahrani B, Baranawi A, Alhindi A, Chen M (2021) Intelligent visual-iot-enabled real-time 3d visualization for autonomous crowd management. IEEE Wireless Commun 28(4):34–41
Article Google Scholar
Movie Actions Datasaet. https://www.di.ens.fr/~laptev/actions/. Last accessed 16 July 2023
U.C. for Research in Computer Vision. https://www.crcv.ucf.edu/data/UCF50.php. Last accessed 16 July 2023
Soomro K, Zamir AR, Shah M (2012) A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision 2(11)
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, et al (2017) The kinetics human action video dataset. arXiv:1705.06950
Carreira J, Noland E, Banki-Horvath A, Hillier C, Zisserman A (2018) A short note about kinetics-600. arXiv:1808.01340
Carreira J, Noland E, Hillier C, Zisserman A (2019) A short note on the kinetics-700 human action dataset. arXiv:1907.06987
Hassner T, Itcher Y, Kliper-Gross O (2012) Violent flows: Real-time detection of violent crowd behavior. In: 2012 IEEE Computer society conference on computer vision and pattern recognition workshops, IEEE, pp 1–6
Sultani W, Chen C, Shah M (2018) Real-world anomaly detection in surveillance videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6479–6488
Perez M, Kot AC, Rocha A (2019) Detection of real-world fights in surveillance videos. In: ICASSP 2019-2019 IEEE International conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 2662–2666
Aktı Ş, Tataroğlu G.A, Ekenel HK (2019) Vision-based fight detection from surveillance cameras. In: 2019 Ninth international conference on image processing theory, tools and applications (IPTA), IEEE, pp 1–6
Detection of unusual crowd activity. http://mha.cs.umn.edu/proj_events.shtml#crowd. Last accessed 16 July 2023
Mehran R, Oyama A, Shah M (2009) Abnormal crowd behavior detection using social force model. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, pp 935–942
Political protest movements: Data. https://guides.library.yale.edu/c.php?g=956915amp;p=6961578. Last accessed 16 July 2023
Political protest movements: MetaData. https://dataverse.harvard.edu/dataverse/MMdata. Last accessed 16 July 2023
Chromiak M (2021) Exploring recent advancements of transformer based architectures in computer vision. Selected Topics Appl Comput Sci 59–75
Wan J, Liu Z, Chan AB (2021) A generalized loss function for crowd counting and localization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1974–1983
Leibe B, Seemann E, Schiele B (2005) Pedestrian detection in crowded scenes. In: 2005 IEEE Computer society conference on computer vision and pattern recognition (CVPR’05), IEEE, vol. 1:878–885
Chan AB, Liang Z-SJ, Vasconcelos N (2008) Privacy preserving crowd monitoring: Counting people without people models or tracking. In: 2008 IEEE Conference on computer vision and pattern recognition, IEEE, pp 1–7
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: 3rd International conference on learning representations (ICLR 2015). Computational and Biological Learning Society
Horn B, Schunck B (1981) Determining optical flow. Artif Intell 17:185–203
Article Google Scholar
Teed Z, Deng J (2020) Raft: Recurrent all-pairs field transforms for optical flow. In: Vedaldi A, Bischof H, Brox T, Frahm J-M (eds) Computer Vision - ECCV 2020. Springer, Cham, pp 402–419
Chapter Google Scholar
Wang Y, Yue Y, Lin Y, Jiang H, Lai Z, Kulikov V, Orlov N, Shi H, Huang G (2022) Adafocus v2: End-to-end training of spatial dynamic networks for video recognition. In: 2022 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), IEEE pp 20030–20040
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 6450–6459
Zheng Z, Yang L, Wang Y, Zhang M, He L, Huang G, Li F (2024) Dynamic spatial focus for efficient compressed video action recognition. IEEE Trans Circuits Syst Video Technol 34(2):695–708
Article Google Scholar
Nievas EB, Suarez OD, Garcia GB, Sukthankar R (2011) Hockey fight detection dataset. In: Computer Analysis of Images and Patterns, Springer, pp 332–339. http://visilab.etsii.uclm.es/personas/oscar/FightDetection/
Cheng M, Cai K, Li M (2021) Rwf-2000: an open large scale video database for violence detection. In: 2020 25th International conference on pattern recognition (ICPR), IEEE, pp 4183–4190
Gao Y, Liu H, Sun X, Wang C, Liu Y (2016) Violence detection using oriented violent flows. Image Vision Comput 48:37–41
Article Google Scholar
Jebur SA, Hussein KA, Hoomod HK, Alzubaidi L (2023) Novel deep feature fusion framework for multi-scenario violence detection. Computers 12(9):175
Article Google Scholar
Dong Z, Qin J, Wang Y (2016) Multi-stream deep networks for person to person violence detection in videos. In: Pattern Recognition: 7th Chinese Conference, CCPR 2016, Chengdu, China, November 5-7, 2016, Proceedings, Part I 7, Springer, pp 517–531
Xu L, Gong C, Yang J, Wu Q, Yao L (2014) Violent video detection based on mosift feature and sparse coding. In: 2014 IEEE International conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 3538–3542
Su Y, Lin G, Zhu J, Wu Q (2020) Human interaction learning on 3d skeleton point clouds for video violence recognition. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, Springer, pp 74–90
Sudhakaran S, Lanz O (2017) Learning to detect violent videos using convolutional long short-term memory. In: 2017 14th IEEE International CONFERENCE ON ADVANCED VIDEO AND SIGNAL BASED SURVEILLANCE (AVSS), IEEE, pp 1–6
Freire-Obregón D, Barra P, Castrillón-Santana M, Marsico MD (2022) Inflated 3d convnet context analysis for violence detection. Mach Vision Appl 33:1–13
Article Google Scholar
Abdali A-MR, Al-Tuma RF (2019) Robust real-time violence detection in video using cnn and lstm. In: 2019 2nd Scientific Conference of Computer Sciences (SCCS), IEEE, pp 104–108
Ullah FUM, Muhammad K, Haq IU, Khan N, Heidari AA, Baik SW, de Albuquerque VHC (2021) Ai-assisted edge vision for violence detection in iot-based industrial surveillance networks. IEEE Trans Indust Inf 18(8):5359–5370
Article Google Scholar
Vijeikis R, Raudonis V, Dervinis G (2022) Efficient violence detection in surveillance. Sensors 22(6):2216
Google Scholar
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497
Chaturvedi K, Dhiman C, Vishwakarma DK (2024) Fight detection with spatial and channel wise attention-based convlstm model. Expert Syst 41(1):13474
Article Google Scholar
Zhou L (2022) End-to-end video violence detection with transformer. In: 2022 5th International conference on pattern recognition and artificial intelligence (PRAI), IEEE, pp 880–884
Pan C, Fei S (2022) Violence detection based on attention mechanism. In: 2022 41st Chinese Control Conference (CCC), IEEE, pp 6438–6443
Ravanbakhsh M, Mousavi H, Nabi M, Marcenaro L, Regazzoni C (2018) Fast but not deep: Efficient crowd abnormality detection with local binary tracklets. In: 2018 15th IEEE International conference on advanced video and signal based surveillance (AVSS), IEEE, pp 1–6
Mousavi H, Mohammadi S, Perina A, Chellali R, Murino V (2015) Analyzing tracklets for the detection of abnormal crowd behavior. In: 2015 IEEE Winter conference on applications of computer vision, IEEE, pp 148–155
Gao M, Jiang J, Ma L, Zhou S, Zou G, Pan J, Liu Z (2019) Violent crowd behavior detection using deep learning and compressive sensing. In: 2019 Chinese control and decision conference (CCDC), IEEE, pp 5329–5333
Zhang T, Jia W, He X, Yang J (2016) Discriminative dictionary learning with motion weber local descriptor for violence detection. IEEE Trans Circuits Syst Video Technol 27(3):696–709
Article Google Scholar
Hachiuma R, Sato F, Sekii T (2023) Unified keypoint-based action recognition framework via structured keypoint pooling. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 22962–22971
Ullah FUM, Ullah A, Muhammad K, Haq IU, Baik SW (2019) Violence detection using spatiotemporal features with 3d convolutional neural network. Sensors 19(11):2472
Article Google Scholar
Zhenhua T, Zhenche X, Pengfei W, Chang D, Weichao Z (2023) Ftcf: Full temporal cross fusion network for violence detection in videos. Appl Intell 53(4):4218–4230
Article Google Scholar
Ullah FUM, Obaidat MS, Muhammad K, Ullah A, Baik SW, Cuzzolin F, Rodrigues JJ, de Albuquerque VHC (2022) An intelligent system for complex violence pattern analysis and detection. Int J Intell Syst 37(12):10400–10422
Article Google Scholar
Rao A, Lanphier R (1996) Real Time Streaming Protocol(RTSP). Internet-Draft draft-rao-rtsp-00, Internet Engineering Task Force. Work in Progress. https://datatracker.ietf.org/doc/draft-rao-rtsp/00/
Bai J, Lu F, Zhang K, et al (2019) ONNX: Open Neural Network Exchange. GitHub

Download references

Acknowledgements

This publication was made possible by AICC03-0324-200005 from the Qatar National Research Fund (a member of the Qatar Foundation). The findings herein reflect the work and are solely the responsibility, of the authors.

Funding

Open Access funding provided by the Qatar National Library. This publication was made possible by AICC03-0324-200005 from the Qatar National Research Fund (a member of Qatar Foundation). The findings herein reflect the work and are solely the responsibility, of the authors.

Author information

Authors and Affiliations

Division of Information and Computing Technology, College of Science and Engineering, Hamad Bin Khalifa University, Qatar Foundation, Doha, Qatar
Marwa Qaraqe, Yin David Yang, Elizabeth B Varghese, Emrah Basaran & Almiqdad Elzein

Authors

Marwa Qaraqe
View author publications
You can also search for this author in PubMed Google Scholar
Yin David Yang
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth B Varghese
View author publications
You can also search for this author in PubMed Google Scholar
Emrah Basaran
View author publications
You can also search for this author in PubMed Google Scholar
Almiqdad Elzein
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization: [Marwa Qaraqe, Yin David Yang, Emrah Basaran]; Methodology: [Elizabeth B Varghese, Emrah Basaran, Almiqdad Elzein]; Formal analysis and investigation: [Elizabeth B Varghese, Emrah Basaran, Almiqdad Elzein]; Writing - original draft preparation: [Elizabeth B Varghese, Almiqdad Elzein]; Writing - review and editing: [Marwa Qaraqe, Yin David Yang]; Funding acquisition: [Marwa Qaraqe, Yin David Yang]; Resources: [Marwa Qaraqe]; Supervision: [Marwa Qaraqe, Yin David Yang].

Corresponding author

Correspondence to Elizabeth B Varghese.

Ethics declarations

Competing interests

The authors declare they have no competing interests to report regarding the present study.

Ethics approval

No ethics approval was required for the study

Consent for publication

All authors have approved the manuscript and agree with its publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Qaraqe, M., Yang, Y.D., Varghese, E.B. et al. Crowd behavior detection: leveraging video swin transformer for crowd size and violence level analysis. Appl Intell 54, 10709–10730 (2024). https://doi.org/10.1007/s10489-024-05775-6

Download citation

Accepted: 12 August 2024
Published: 26 August 2024
Issue Date: November 2024
DOI: https://doi.org/10.1007/s10489-024-05775-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Crowd behavior detection: leveraging video swin transformer for crowd size and violence level analysis

Abstract

Graphical abstract

Similar content being viewed by others

Video analytics using deep learning for crowd analysis: a review

Intelligent video surveillance: a review through deep learning techniques for crowd analysis

Crowd dynamics analysis and behavior recognition in surveillance videos based on deep learning

Explore related subjects

1 Introduction

2 Related work

2.1 Advances in DL methods for video analysis

2.2 Existing video analysis methods for online videos and CCTV footage

2.3 Existing human activity recognition (HAR) and crowd datasets

3 Proposed framework and dataset

3.1 Video swin transformation

3.2 Crowd counting and optical flow maps

3.2.1 Computation of CC_Maps

3.2.2 Generation of Opt_Flow

3.3 Dataset collection

3.4 Model training

4 Experimental analysis

4.1 Impact of crowd counting maps and optical flow in crowd behavior detection

4.2 Comparison with state-of-the-art approaches

4.3 Deepstream for real-time analysis

5 Conclusions

Availability of data and materials

Code Availibility

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethics approval

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation