1 Introduction

Intelligent and smart analysis of video data has gained immense importance in modern surveillance as it increases the efficiency, overall capabilities, and effectiveness of security and monitoring operations. In particular, video analysis is a powerful tool for creating real-time intelligence from an observed environment [1]. In today’s public surveillance system, it helps to perceive aberrant events such as traffic rule-breaking, unauthorized parking, fights, violent crowds, etc. In addition, it is a significant aid in the current smart world to monitor the video streaming content on social media platforms. Thus, video analysis possesses a wide range of applications in public surveillance as well as online monitoring of events. The most common example is the COVID pandemic situation where video analysis enabled the real-time monitoring of social distancing in public venues and curfew adherence. Among the range of applications, one of the complex and crucial areas is the management of crowds and their associated behavior [2]. This is because the behavior of crowds is often unpredictable and prone to unexpected disasters and crime-related events, making them a substantial concern for government officials and law enforcement agencies. Although CCTV is heavily deployed throughout the world for public monitoring purposes, CCTV is generally seen as a reactive system and requires manual monitoring of events.

In this context, efficient, autonomous, and real-time analysis of video data can enable effective and proactive monitoring over large geographical areas, and can assist public safety officials in proactive decision-making in areas that exhibit large crowds. Thus, intelligent and smart crowd behavior recognition has emerged as an indispensable area in computer vision research. Since the advent of deep learning (DL) algorithms, the processing of enormous amounts of unstructured data has led to many human behavior recognition being developed using CNN [3,4,5], LSTM [5], GAN [6], Autoencoder [3], ResNet [7], etc. However, most of the methods are for classifying crowds as violent/nonviolent [5, 7] and normal/abnormal [3, 4, 6]. However, for law enforcement agencies, the size and violence level of the crowd is also crucial to making decisions in practical scenarios [8]. For example, if the model identifies the existence of a small violent crowd, then the authorities can prioritize containment and swift intervention to minimize the impact and prevent escalation. At the same time, the identification of large violent crowds changes the reaction strategies of officials to deploy additional resources to maintain crowd control. The development of such systems requires training a model using classes that characterize crowd size and violence levels.

To the best of our knowledge, no such dataset exists in the literature nor has such a problem been addressed by researchers. To this effort, we first present a novel dataset consisting of videos representative of typical public gatherings. The video database contains videos of normal public daily activity, small-scale violent events, large-scale violent events, and large-scale peaceful events. This distinction allows for crowd behavior classification based on the size of the crowd within the frame and the level of violence. In addition, the dataset contains videos taken from CCTV footage, where the camera is stationary and at a distance from the event, and from social media uploads, where the video is taken via a mobile camera, introducing motion in the video. We have deliberately introduced social media video content to develop a system that can not only identify crowd behavior CCTV cameras managed by authorities but also analyze video content that is uploaded via social media by the public. The latter allows governments to expand their monitoring regions and identify potential threats, suspicious behavior, or illegal activities that might be shared or discussed in these videos. The proposed system is a proactive approach to public safety monitoring that enables the initiation of appropriate actions to prevent crimes before they occur or escalate.

The need to identify and classify crowd behavior in CCTV video and social media streaming makes crowd behavior classification more challenging. Considering all these aspects, we propose a DL model based on a video swin transformer to classify crowd behavior to Natural(N), Large Peaceful Gathering (LPG), Large Violent Gathering (LVG), and Fighting (F) that can distinguish crowd dynamics and extent of violence. To facilitate the learning and prediction of crowd behavior classes, we have exerted crowd-counting maps and optical flow maps as influential components within our proposed model. The crowd-counting maps aid the model in distinguishing between large and small events, whereas the optical flow maps enhance the analysis of temporal violent patterns of the crowd. Finally, to demonstrate the outcomes of the proposed model in real-time and on real videos, we leverage Nvidia’s DeepStream Software Development Kit (SDK) [9], an intelligent application framework to process real-time video data. Thus, our main contributions are:

  • A swin transformer-based DL model is developed for the purpose of classifying crowd behavior into four discrete categories characterized by varying levels of violence and crowd sizes.

  • Additional semantic knowledge pertaining to crowd density and violence levels is augmented into the swin transformer framework by the integration of crowd-counting maps and optical flow maps.

  • We have curated a large dataset that can serve as a benchmark resource for training models dedicated to monitoring crowd-related events through the analysis of data originating from public CCTV surveillance cameras and online social media platforms. Furthermore, we have extracted a subset of the dataset comprising exclusively CCTV footage. This dedicated subset is instrumental in the development of models for public CCTV surveillance applications.

  • Experimental analysis has been executed employing the DeepStream SDK to ascertain the viability and practicality of our proposed methodology within an actual real-time surveillance environment.

The rest of the paper is structured as follows: In Section 2, a comprehensive review of the existing literature is presented, while Section 3 delineates the proposed crowd behavior detection model and elucidates the processes involved in dataset creation. Section 4 is dedicated to discussing experimental analysis and its outcomes, and elaborates on real-time analysis employing DeepStream. Finally, the paper is concluded in Section 5.

2 Related work

Accurate detection and precise prediction of crowd behavior are inevitable for effective crowd management within smart surveillance systems. The increase in crowd-related mishaps in the past decades has led to significant advances in computer vision research, which actively drives efficient and proactive crowd surveillance. This section provides an outlook of recent DL approaches for video data analysis, various methods employed for analyzing video data derived from the internet and CCTV sources, as well as existing publicly available datasets for tasks related to crowd control and human activity recognition.

2.1 Advances in DL methods for video analysis

DL has revolutionized video analysis by enabling the extraction of high-level representations from raw video data. The breakthrough in video analysis was mainly due to the power of Convolutional Neural Networks (CNN), which are successful in object detection [10], tracking [11], and action recognition [12]. CNN is widely used for crowd analysis as well. A cascade of 3D CNN and 3D autoencoder was proposed by Sabokrou et al. [3] for crowd anomaly classification. Zhou et al. [4] utilized a spatiotemporal CNN to detect panic situations in a crowd. 3DCNN was employed in [13] and [14] to detect various crowd behaviors.

Recently, ResNet, a variant of CNN, which eliminates the vanishing gradient problem of CNN and helps in easy training [15], has been widely used for video processing. Ng et al. [16] proposed a ResNet-based architecture, namely ActionFlowNet, for classifying human actions. The long-term and short-term features in action videos are segregated using ResNet in [17] and a 3D Loop ResNet was utilized by Kakamu et al. [18] for predicting various human actions. ResNet was also employed in [7] for violent behavior detection, crowd density classification, and crowd counting. Abnormal crowd event detection in small-scale and large-scale crowds was proposed in [19], and in [20], features for crowd behavior pattern analysis were done using ResNet.

Other widely used DL methods for video analysis include Recurrent Neural Networks (RNNs) and their variant, Long Short-Term Memory (LSTM) networks. Chen et al. [21], and Ebrahimi et al. [22] proposed an RNN-based algorithm to identify various emotions of a crowd. Moreover, many studies were performed by exploring the properties of LSTM networks that can remember long-term dependencies and solve the vanishing and exploding gradient problem of RNN [23]. The sequences of group activities were recognized in [24] using a 2-stage LSTM model, and the crowd behaviors based on psychological properties were predicted in [25] by wielding a convolutional LSTM.

In the recent past, attention mechanisms have been applied to video analysis to focus on relevant spatiotemporal regions or frames. Vaswani et al. [26] put forth the idea of attention as a Transformer in language translation using an RNN framework. Inspired by the success of transformers in natural language processing (NLP), transformer-based architectures have been adapted for video analysis. These models capture long-range dependencies and facilitate parallel processing of frames in videos. Furthermore, transformers are more scalable to very large-capacity models [27] and assume less prior knowledge about the structure of the problem as compared to CNNs and RNNs [28]. These advantages have led to their success in many computer vision tasks such as image recognition [29] and object detection [30]. Dosovitskiy et al. [29] proposed Vison Transformers (ViT), which achieved promising results in image classification tasks by modeling the relationship (attention) between the spatial patches of an image using the standard transformer encoder [26]. After ViT, many transformer-based video recognition methods [31, 32] have been proposed. In these works, different techniques have been developed for temporal attention as well as spatial attention. Subsequently, attention mechanisms similar to Transformers were used with convLSTM for action recognition [33], crowd behavior prediction [25], and gesture recognition [34] from videos.

In a nutshell, Transformer-based approaches have led to significant advancements in the realm of computer vision. The performance improvements are quite impressive and represent a major step forward in this field. Among the Transformer frameworks discussed above, the swin transformer [31] has been a game changer in the field of computer vision. It has set new records in object detection [35] and semantic segmentation benchmarks [35], and has shown that Transformer-based approaches are the future of visual modeling. In addition, the swin transformer possesses shifted non-overlapping windows, which makes it suitable for faster running speed and hardware friendly, which inspired us to use the framework as the backbone of our proposed model (Details of swin transformer framework are given in Section 3).

2.2 Existing video analysis methods for online videos and CCTV footage

Online videos constitute multimedia content accessible for either streaming or downloading via the internet. This category spans diverse content genres, including but not limited to movies, TV shows, documentaries, music videos, tutorials, vlogs, and more. In some cases, surveillance of social media videos can contribute to public safety efforts. Wang et al. [36] proposed a deep recurrent neural network to extract temporal features to classify audio frames for event detection from videos such as sandwich making, flash mob gathering, etc. Complex events from web videos were classified using a two-stage CNN in [37] and in [38] CNN was utilized to extract features from the video content, and a concept library using Support Vector Machine (SVM) was created to organize the events.

Conversely, analyzing CCTV videos is a common practice in various domains, including security, safety, transportation, and retail, to enhance situational awareness, improve operational efficiency, and enable proactive decision-making. CCTV footage is typically captured by stationary cameras strategically placed at specific locations for surveillance purposes. Since these cameras have a fixed field of view and do not move, they provide a continuous stream of video footage from a particular perspective. In [39] suspicious activities inside a campus were detected from CCTV footage by employing VGG-16 as the feature extractor and LSTM as the classifier. The method proposed by Khan et al. [40] utilized a CNN to find anomalies such as accidents from traffic videos. Anomaly detection was also proposed by Aboah et al. [41] using a decision tree-based approach. Moreover, CCTV footage was used to analyze the crowd’s real-time behaviors, which helps in reliable and proactive crowd management. Baqui et al. [42] studied the cross-correlation and optical flow patterns to analyze pedestrian flows from real-time CCTV videos. The crowd density and the parameters of pedestrian flow such as direction and speed from Hajj videos collected using CCTV cameras were also explored in [43] to display the crowd movement in 3D animation form for better crowd control. The camera’s rotation, focal length, position arguments, and CSRNet-based head tracking AI algorithm were used to detect the position of persons in the crowd.

Although many works have been proposed for analyzing online video content for captioning, event detection, sentiment analysis, etc., the videos have largely remained unused by law enforcement agencies and public surveillance systems due to the lack of suitable models and datasets for training and evaluation. Despite the pervasive utilization of DL models in the analysis of online and CCTV videos across various domains, none of these models exhibit promising capabilities for the discernment of crowd behavior predicated on criteria such as crowd size and violence level. Henceforth, the exigent requirement is the development of an intelligent surveillance system with global applicability, notably crucial for governmental agencies facing diverse challenges, especially in cases of emergencies, such as widespread unrest, and during large-scale public events, such as concerts, national holidays, and sports tournaments. Furthermore, the prevailing literature lacks comprehensive methodologies supported by real-time experimentation, which is essential in pre-empting situations from spiraling out of control due to delayed or inadequate security responses. Therefore, we propose a DL framework alongside a diligently created dataset customized for the classification of crowd behaviors contingent upon crowd size and violence levels. Additionally, we furnish empirical validation through real-time experiments, thereby rendering our system aptly suited for smart surveillance in real-world scenarios.

Fig. 1
figure 1

Overall Structure of the Proposed Framework

2.3 Existing human activity recognition (HAR) and crowd datasets

The most important part of an AI-based smart surveillance system for crowd behavior detection is the availability of benchmark datasets for training purposes. Here, we provide a review of existing publicly available datasets for crowd management and HAR closely related to our work.

  • Movie Actions Dataset [44]: The dataset provides annotated movie clips . Each clip in the dataset belongs to one of the 51 classes for various actions such as GetOutCar, HandShake, HugPerson, Kiss, SitDown, SitUp, StandUp, etc.

  • UCF50 [45] & UCF101 [46]: UCF50 and UCF101 datasets consist of YouTube clips grouped into one of 50 and 101 action categories, respectively. Examples of action classes in the UCF50 dataset include Basketball Shooting and Pull-Ups while the action classes in UCF101 include a wider spectrum of classes subdivided into five different categories, namely, body motion, human-human interactions, human-object interactions, and playing musical instruments and sports.

  • Kinetics Dataset: This dataset consists of three versions- Kinetics-400 [47], Kinetics-600 [48], and Kinetics-700 [49]. The Kinetics-400 dataset is a large-scale action recognition dataset that contains around 240,000 video clips categorized into 400 action classes. Each video clip has an average duration of around 10 seconds. This dataset was designed for the task of action recognition in videos. An extension of Kinetics-400, Kinetics-600 includes additional action classes for video action recognition. It provides a broader range of actions for more comprehensive research and evaluation. Another extension of Kinetics-400, Kinetics-700 extends the action classes even further, providing a more diverse and challenging dataset for action recognition tasks.

  • Violent Flows [50]: Focuses on crowd violence that comprises 246 crowd videos extracted from YouTube and consists of two classes- violence and non-violence.

  • UCF Crime Dataset [51]: Collection of long surveillance videos from YouTube and LiveLeak that consists of thirteen crime classes (e.g, road accident, burglary, robbery, etc.).

  • CCTV-fights [52]: A dataset of 1000 videos, whose accumulative length exceeds 8 hours of real fights caught by CCTV cameras with annotation as fight and non-fight.

  • Surveillance Camera Fight Dataset [53]: Contains 300 videos collected from movies and hockey games and divided equally into two classes; fight and non-fight.

  • UMN [54]: The dataset comprises eleven videos and intends to classify the crowd as either normal or abnormal. The normal and abnormal classes are classified based on the running patterns of people in the crowd.

  • UCF Normal/Abnormal Web Dataset [55]: A collection of twenty videos with normal, escape panic, clash, and fights as crowd classes

In short, although the HAR datasets are useful for testing different DL architectures, they are not necessarily useful for specific practical tasks, such as surveillance, which likely requires the distinction between a limited number of specific action classes. Furthermore, to the best of our knowledge, no video dataset in the literature contains large gatherings, such as protests, as an action class. For instance, protest datasets in the literature are limited to image datasets [56] and protest metadata [57], which document protester demands, government responses, protest location, and protester identities. Thus, the novelty of our developed video dataset is that it is specifically aimed toward identifying scenarios of public unrest (violent protests, fights, etc.) or scenarios that have the potential to develop into public unrest (large gatherings, peaceful protests, etc.). Large gatherings are particularly interesting and important to be carefully monitored as they can lead to unruly events. Large gatherings that seem peaceful can evolve into a violent scenario with fighting, destruction of property, etc. In addition, the scale of violence captured can inform the scale of the response from law enforcement. Thus, for the current task, we divide violence into small-scale violence (i.e., F) and large-scale violence (i.e., LVG). To our knowledge, these aspects have been largely neglected in existing datasets, which motivates this work.

3 Proposed framework and dataset

This section describes the proposed model for analyzing internet and surveillance videos as well as the dataset used to train that model. Figure 1 depicts the overall system architecture of the proposed framework.

Fig. 2
figure 2

Architecture of Swin-T [35]. The input video is represented by a tensor of shape \(T \times H \times W \times 3\), where T is the number of frames and \(H \times W\) is the height and width of each frame having 3 channels (RGB)

3.1 Video swin transformation

The main backbone of our framework is the swin transformer, more precisely, the variant known as the video swin transformer. The swin transformer is characterized by its hierarchical architecture, which partitions images into smaller patches at the initial layers of the transformer structure and progressively consolidates adjacent layers at deeper levels to create larger patches. It leverages the concept of shifted windows during inference, thereby enhancing its capacity for representation and contributing to its remarkable recent state-of-the-art performance [58]. Beyond its state-of-the-art performance, the swin transformer demonstrates superior computational efficiency compared to other models. Notably, the computational demands of the model exhibit linear growth in relation to the input image resolution, contrasting with other models where computation time escalates quadratically with increasing image resolution. Among multiple versions of video swin transformer, we contemplate Swin-T, the tiny version of swin as it is designed to be more efficient and faster than other versions, making it well-suited for scenarios where computational resources are limited or inference speed is crucial. The architecture of Swin-T is provided in Fig. 2.

The framework consists of four stages, where each stage has three components- Patch Merging, Linear Layer, and a Video Swin Transformer block except stage 1. In stage 1, each frame in the input video, \(V = \{f_1, f_2, ...f_T\}\) is divided into 3D patches/tokens of size \(2 \times 4 \times 4 \times 3\) by the 3D patch partition layer that results in \(\frac{T}{2} \times \frac{H}{4}\times \frac{W}{4}\) tokens. These tokens are given to the linear embedding layer, where the features of each token are projected to an arbitrary dimension, C (For Swin-T, \(C =96\)). The patch merging layers of each stage perform the spatial downsampling and concatenation of \(2 \times 2\) neighboring patches, where a linear layer is utilized to project the concatenated patches to half of the input dimension. The significant block in each stage is the video swin transformer block that comprises a 2-layer multi-layer perceptron (MLP) with Gaussian Error Linear Unit (GELU) activation unit and 3D shifted window-based multi-head self-attention (3DWMSA) module as shown in Fig. 3.

Fig. 3
figure 3

Illustration of a Video Swin Transformer block [35]

A residual connection is established after each module to overcome vanishing gradients and layer normalization (LN) is applied after the MLP and 3DWMSA to get control over covariate shift. A block of the video swin transformer, as illustrated in Fig. 3, is given by

$$\begin{aligned} \hat{z}^l = 3DWMSA(LN(z^{l-1})) + z^{l-1} \end{aligned}$$
(1)

and

$$\begin{aligned} z^{l} = MLP(LN(\hat{z}^l)) + \hat{z}^l, \end{aligned}$$
(2)

where \(\hat{z}^l\) represents the input to the MLP at layer l, while \(z^{l}\) denotes the output from the layer l MLP, which is subsequently passed to layer \(l+1\).

Fig. 4
figure 4

Example of 3DWMSA [35]

The 3DWMSA is responsible for efficient event recognition from temporal video data with its multi-head self-attention (MSA) property and non-overlapping 3D windows. Each input V is divided into \(T' \times H' \times W'\) tokens and each token is further divided into a 3D non-overlapping window of size \(P \times M \times M\). That is, the MSA of the first layer generates non-overlapping 3D windows of size \( \lceil \frac{T'}{P}\rceil \times \lceil \frac{H'}{M}\rceil \times \lceil \frac{W'}{M}\rceil \). The window partition for the second layer is shifted temporally and results in \((\frac{P}{2}, \frac{M}{2}, \frac{M}{2})\) tokens. An example of 3DWMSA is provided in Fig. 4. Finally, self-attention is computed by including a 3D relative position bias, \(B\in R\), and is given by

$$\begin{aligned} Attention(q,k,v) = softmax(\frac{qk^T}{\sqrt{d}+B})v, \end{aligned}$$
(3)

, where q represents the query matrix with dimensions d, and k and v, denote the key and value matrices respectively, for the self-attention calculation of T frames. Finally, after stage 4, a softmax layer is employed to calculate the probability distribution of crowd behavior labels.

The proposed framework leverages crowd counting maps (\(CC\_Maps\)) and optical flow patterns (\(Opt\_Flow\)) as important components to augment supplementary semantic knowledge to classify crowd behavior based on attributes including crowd size and violence level. The \(Opt\_Flow\) and \(CC\_Maps\) maps are computed for each two consecutive frames for each sample. For a sample with frames \(\{f_1, \dots , f_{T}\}\), we compute a \(CC\_Maps\), C for frames \(\{f_1, f_3, f_5, \dots , f_{T-1}\}\), skipping one frame at a time. Additionally, we compute the \(Opt\_Flow\), O for each frame pair \(\{(f_i, f_{i+1}) | i \in [1, T]\}\). Consequently, one sample of input to the swin transformer is the result of the concatenation of T input frames \(V =(f_1, f_2, ..., f_{T})\), T/2 \(CC\_Maps\), \(C = (c_1, c_2, ..., c_{T/2})\), and T-1 \(Opt\_Flow\), \(O = (o_1, o_2, ....o_{T-1})\) and is represented as

$$\begin{aligned} I_j = V_j\uplus C_j \uplus O_j, \end{aligned}$$
(4)

where \(j=1,2,3,...n\) denotes the number of samples of each video and \(\uplus \) is the concatenation operation. The overall procedure of the proposed model is illustrated in Algorithm 1. The following subsection furnishes a detailed explanation of the processes involved in generating \(CC\_Maps\) and \(Opt\_Flow\).

Algorithm 1
figure g

Crowd Behavior Detection Model

Fig. 5
figure 5

Sample frames and their respective crowd counting maps

3.2 Crowd counting and optical flow maps

Recall that our primary objective entails the classification of human crowd behavior, and this classification is predicated on two key parameters: the crowd’s size and the level of violence exhibited. Specifically, we are concerned with two fundamental aspects within the input video data: the dynamics of individuals’ movements captured in the video and the spatial concentration of these individuals. It is worth noting that the motion patterns within the crowd can offer insights into its potential for violence. For instance, a violent crowd tends to manifest erratic motion, while a peaceful crowd’s movement is more likely to be slow and subtle. Also, the concentration of people in a crowd can inform whether the crowd is large or small, and the higher the density in a significant proportion of the frame, the more likely the crowd will be large. Besides, in intuitive contexts, a crowd’s mobility and density distribution may interact in other ways to help classify the crowd as small or large, violent or non-violent. Thus, to aid in crowd footage classification, we utilize \(CC\_Maps\), which can contain information about the crowd’s density distribution, and \(Opt\_Flow\), which can store information about crowd movement. This section describes how optical flow and crowd-counting maps are extracted for videos in the datasets and how they are utilized for training and validation.

3.2.1 Computation of CC_Maps

Crowd counting and localization have drawn significant attention in the literature for their usefulness in surveillance, tracking, and crowd management applications [59]. Crowd counting can also be useful in our application since it can inform us about the size of the crowd, which would help in distinguishing between LPG and N, as well as between LVG and F. There are two ways in which crowd counting could be helpful for our purposes. One way would be to get the number of people present in a video [60, 61] and use it as a feature of the input video to aid in classification. This approach has two potential drawbacks. First, the total number of people does not always inform us about the number of people involved in the action. In other words, a large number of people could be in the background of the scene while relevant action in the foreground is taking place, meaning that the distribution of the people in the crowd also matters. Secondly, since we are dealing with video data as the number of people is just a single feature, its influence during inference might be greatly diminished by the thousands of features extracted and used to obtain a final classification of an input video.

Rather than relying solely on headcount as a feature, our approach is geared towards the computation of crowd density maps. These maps serve as continuous, smoothed heat maps, functioning as a visual representation of the crowd’s distribution and intensity. We employed the idea proposed by Wan et al. [59] to generate \(CC\_Maps\) that take \(V = \{f_1, f_2, ..f_T\}\) as input to a VGG19 pre-trained model [62] and returns a 2-dimensional crowd density matrix having values between 0 and 1, which can be transformed into a grey-scale image, C. In this grey-scale image, a higher value for each pixel indicates a higher crowd concentration at that pixel. For each frame \(f_i\), we produce a crowd density estimation, \(C_i\) in the form of a grey-scale image. An example of a sequence of 3 frames and their respective \(CC\_Maps\) is shown in Fig. 5. Instead of processing crowd-density maps independently of the image sequences, we opt for concatenating both sets of images and processing them through the swin transformer at once. This would allow the network to learn the complex relationship between the frames and the crowd densities and how those two change with time.

Fig. 6
figure 6

Three consecutive frames and their two corresponding Optical Flow maps

3.2.2 Generation of Opt_Flow

Optical flow is the distribution of velocities of brightness patterns in an image [63]. The velocities of brightness in these brightness patterns arise as a result of relative movement between the objects in the video or the video’s point of view, such as a change in the position or orientation of the camera. Optical flow maps are image representations that can be computed for two consecutive video frames. The adjacent frame components \(f_i\) and \(f_{i+1}\) of input video frames \(f_{i=1}^T\) are utilized to generate \(Opt\_Flow\), \(O_i\) using pre-trained Recurrent All-Pairs Field Transforms (RAFT) model described in [64].

RAFT is a deep learning architecture that addresses the problem of estimating optical flow by predicting per-pixel displacements between two frames. Unlike traditional optical flow methods that often rely on handcrafted features and assumptions about brightness constancy, RAFT takes a learning-based approach. It utilizes a recurrent neural network (RNN) to model the interactions between pixels in a pair of frames and predict the flow field that best explains the observed motion. RAFT computes pixel-wise feature vectors and uses these vectors to compute the corresponding pixel in the second image for each pixel in the first image. The product of this operation is a field of vectors, one for each pixel, that shows the “movement” of each pixel. Each neighborhood of pixels that move together is colored homogeneously. An example of a sample of three consecutive frames and their two corresponding \( Opt\_Flow\) maps are shown in Fig. 6. Homogeneous color patterns in the figure represent regions in the video frame where motion is relatively uniform in both direction and magnitude. These patterns can help identify regions within a crowd where people are moving collectively or uniformly, potentially indicating activities like large-scale gatherings or synchronized movement.

Since Optical Flow shows the movement of objects relative to the camera, we must guarantee that the camera is sufficiently stationary for \( Opt\_Flow\) to be useful. Thus, the use of \( Opt\_Flow\) maps is useful mostly for CCTV footage, not for internet videos, since internet videos tend to be fast-moving videos taken from the ground. This is in contrast to CCTV cameras, which are almost always stationary and are usually from a high point of view.

3.3 Dataset collection

Given the absence of pre-existing datasets aligning with our specific class criteria, we embarked on the creation of a novel dataset. This dataset serves as the training foundation for our model designed to monitor internet and CCTV videos. Subsequently, a distinct subset extracted from this dataset is employed for the exclusive training of a model designated for the analysis of CCTV footage. It’s important to emphasize that our dataset is uniquely customized to comprise the four distinct classes of behavior requisite for monitoring both internet and CCTV videos, spanning large-scale and small-scale peaceful as well as violent events. This custom dataset fulfills the precise requirements essential for our research objectives. To this end, a large set of YouTube videos and videos from pre-existing datasets that contain one or more of the classes of interest were identified. The videos were given unique IDs that indicate the order in which the videos were obtained. Then, the start and end time stamps of the occurrences of each class in each of the collected videos were recorded. The record of the occurrence of a class in a video would be as follows: “video i contains an instance of class c from the time stamp \(h_i:m_i:s_i\) to time stamp \(h_f:m_f:s_f\)”, where h, m, and s represents the hours, minutes, and seconds of the timestamp respectively.

In order to prepare a video for being fed into the proposed framework, the frames of the time periods where the classes occur must be extracted. Before that, we guarantee that the time difference between every two consecutive frames is the same for all videos by setting the frame rate of each video to 10 frames per second (FPS). 10 FPS was chosen since it’s a reasonable frame rate that allows the model to analyze the videos in sufficient detail without needing excessive storage space for the frames of each video. Then, we extract the frames of each occurrence of each class in each of the collected videos. Note that the number of frames for each occurrence of each of the classes can be different since the time periods during which an instance of a class occurs in a video can vary in length, thus changing the number of frames of that instance. For example, an instance that occurs from time stamp 0:0:0 to 0:0:10 has 11 seconds \(\times \) 10 frames/second = 110 frames while an instance that occurs from time stamp 0:0:5 to 0:0:8 has 4 seconds \(\times \) 10 frames/second = 40 frames.

However, the number of frames taken by a DL model must be constant and set before training. To resolve this, we have determined that our model will take as input 20 frame sequences. This is because 20 frames are the minimum number of frames for any possible occurrence of one of the chosen classes of behavior given the way we record these occurrences. The shortest occurrence of a class is an occurrence that begins at the time stamp h : m : s and ends at the time stamp \(h:m:(s+1)\), meaning that it will have 2 seconds \(\times \) 10 frames/second = 20 frames. Then, occurrences that are more than 2 seconds long will be used to produce more than one 20-frame sequence using a sliding window. for instance, if an occurrence of one of the classes that starts at time stamp \(h_i:m_i:s_i\) and ends at time stamp \(h_f:m_f:s_f\) has k frames \(\{f_j, f_{j+1}, \dots , f_{j+k-1}\}\), a sliding window of size 20 will slide through the frames, taking a 20-frame sequence at each step.

Specifically, the 20-frame sequences that will be extracted from the occurrence with frames \(\{f_j, f_{j+1},\) \( \dots , f_{j+k-1}\}\) will be \(\{f_i, \dots ,f_{i+19}\} \forall i \in [0,k-20]\). That is, every two consecutive 20-frame sequences will share 19 frames. Note that consecutive 20-frame sequences sharing some frames are valuable, as this trains the model to be somewhat time-invariant. For example, a 20-frame sequence with a punch must be categorized as Fighting no matter where the punch occurs in the 20-frame sequence. However, sharing 19 consecutive frame sequences out of 20 frames is inefficient because the dataset requires excessive storage space. Instead, we use a sliding window that jumps 10 frames at each step, meaning that consecutive 20-frame sequences will only share 10 frames. In particular, for an occurrence with k-frames \(\{f_j, f_{j+1}, \dots , f_{j+k-1}\}\), the 20-frame sequences that will be used in the dataset are \(\{f_{10i}, \dots ,f_{10i+19}\} \forall i \in [0,m]\) where \(m=\lfloor \frac{k}{10}\rfloor - 2\). The 20-frame sequences, which we call samples, are extracted for each class occurrence in each video collected and added to our dataset. Overall, 2,570 different videos were collected and the cumulative duration of the occurrences recorded amounted to 68 hours.

3.4 Model training

To train the swin transformer model effectively, we partitioned the videos into distinct training and validation sets. However, it’s important to clarify that our division was based on samples, not entire videos, with the aim of allocating 80% of the samples for training and reserving 20% for validation. Achieving this 80-20 sample split, while ensuring the uniqueness of training and validation videos, was accomplished through a random search procedure as follows: Initially, a random selection of videos, with random sizes, was chosen from the video dataset for training, while the remaining videos were designated for validation. The number of training and validation samples for each class within the training and validation videos was tallied and the per-class training/validation ratio was calculated After 2 hours of searching, the training and validation sets that achieve the per-class split that is closest to 80-20 were selected. As a result of this procedure, we arrived at a specific set of 1977 training videos and 593 validation videos. These two sets of videos yielded the following per-class training/validation ratios: N : 80.69% (Training) / 19.31% (Validation) LPG : 78.62% (Training) / 21.38% (Validation) LVG : 79.51% (Training) / 20.49% (Validation) F : 79.18% (Training) / 20.82% (Validation) This process ensured a well-balanced and representative distribution of samples across classes for both training and validation, thereby contributing to the robustness of our model training.

As mentioned in Section 3.2.2, \(Opt\_Flow\) maps prove most effective when applied to videos featuring a stable camera viewpoint. In cases where the camera is in motion, the use of optical flow maps can lead to potential confusion, as the model might interpret camera-induced movement as object motion within the video. To mitigate this issue, we extracted a subset from our dataset, comprising samples characterized by minimal changes in the camera’s perspective. This subset closely resembles typical CCTV footage, where cameras are typically stationary and not mobile. Our approach involves a detailed examination of each recorded occurrence within various classes. If a segment of the video demonstrates ”significant” camera movement, we opt to exclude that particular occurrence record from the dataset. This process yielded a 25-hour dataset primarily consisting of stationary samples, which we refer to as the Static dataset. Conversely, the broader dataset, of which the Static dataset is a subset, is termed the Original dataset. Note that the proposed model, when trained on the Original dataset, can be used for monitoring internet videos. In contrast, when trained on the Static dataset, it becomes well-suited for CCTV monitoring applications. For the Static dataset, the same random search procedure was adopted for splitting the dataset into training and validation videos. This process resulted in 1121 videos allocated for training and 279 videos designated for validation. The training/validation ratios achieved for the Static dataset were 79.72% / 20.28%, 79.03% / 20.97%, 80.01% / 19.99%, and 79.80% / 20.20% for N, LPG, LVG, and F, respectively.

Fig. 7
figure 7

Samples frames for each behavior from our dataset

4 Experimental analysis

The experiments were conducted using a novel dataset that collected videos from YouTube and existing crowd datasets. The details of the video collection for the dataset are provided in Section 3.3. We define four crowd behavior classes based on the size and violence level such as Natural(N), Large Peaceful Gathering (LPG), Large Violent Gathering (LVG), and Fighting (F). LPG depicts a large number of individuals gathered for a unique purpose, like peaceful protests or sports spectators, whereas LVG represents a large group of individuals of whom a significant number are engaged in violent action that includes clashes with police, fighting between members of the crowd, property destruction, etc. On the other hand, F refers to a small group of individuals fighting each other, and if the footage shows no relation to the above-described behaviors, it is classified as N. Figure 7 portrays the sample frames from each class. Extracted videos were annotated carefully by identifying when behaviors of interest occurred. This was done by recording the start and end time stamps within which interesting behaviors were observed as shown in Table 1. Each video is assigned a unique ID, and the occurrence of a class recorded in the annotation table is denoted as an instance of that class.

Table 1 A sample annotation table - 5 instances of the behaviors in 3 separate videos

Our dataset consists of 68 hours of videos and is referred to as Original dataset in the rest of the paper. The Original dataset comprises videos from both static CCTV cameras and moving cameras. To perform experiments on video footage from stationary CCTV cameras, a key component of city-wide surveillance, we extract a subset of videos that match the CCTV footage, called the Static dataset, consisting of 25 hours of video. For training and validation, the videos from both datasets were converted to non-overlapping frames, with \( 224 \times 224\) as frame size. As explained in Section 3.3, the videos were converted to 20 frame samples, and hence the input to the swin transformer is a tensor of size \( 20 \times 3 \times 224 \times 224 \). The training and validation of the proposed model were performed with a training validation ratio of 8:2, as discussed in Section 3.4. All the experiments were done using Python’s PyTorch framework in a GPU having NVIDIA GeForce with CUDA 11.4.

The training process of the proposed model was performed by minimizing the categorical cross-entropy loss by utilizing the optimizer Stochastic Gradient Descent (SGD) with an initial learning rate of 0.0001, a momentum of 0.9, and a weight decay of 0.0001. The hyperparameters used for training are compiled in Table 2. Figure 8 depicts the average loss values during the training and validation of crowd behavior classification. The decreasing behavior detection loss demonstrates that the proposed approach successfully detects the correct behaviors similar to the ground truth labels.

Table 2 Hyperparameters used for training the proposed model
Fig. 8
figure 8

Average loss during the training and validation of the proposed model

We validated the model by calculating the average accuracy and mean average precision (mAP) for the instance videos as well as the sample videos. Instance videos are a set of frames whose starting and ending timestamps are identified and recorded for a specific class as given in Table 1. Sample videos are equal-sized image sequences extracted from the instance video, and we set the sample size as 20 frames. The details of sample extraction are given in Section 3.3, and the number of instances and samples used for training and validation for each crowd behavior is portrayed in Table 3. The average accuracy and mAP of the proposed model are shown in Table 4 and the confusion matrix is portrayed in Fig. 9.

Table 3 Number of instances and samples for each crowd behavior used for training and validation

In terms of accuracy and mAP, the Static dataset is better compared to the Original dataset. This improvement is due to the nature of the Original dataset, which includes both online and CCTV videos. When the camera is in motion, optical flow maps can confuse, leading the model to interpret camera movement as object motion mistakenly. Conversely, the Static dataset closely mirrors typical CCTV footage, where cameras are usually stationary, thus providing more consistent and reliable data for the model.

Furthermore, we recorded two types of accuracy scores: ”sample accuracy” and ”instance accuracy.” Sample accuracy is calculated by performing inference on all samples in the validation set, and then dividing the number of correctly classified validation samples by the total number of validation samples. Conversely, instance accuracy is measured by performing inference on all samples within an instance. If the class to which most samples in an instance are classified matches the instance’s label, the number of correctly classified instances is increased by one. The total number of correctly classified instances is then divided by the total number of instances in the validation set to obtain the instance accuracy of the model. Hence, in most cases, sample accuracy is higher than instance accuracy. In the proposed dataset, the number of LVG videos is lower compared to other classes. This class imbalance results in reduced performance for the LVG class relative to others.

The experimental analysis demonstrated the efficacy of the proposed model in effectively detecting crowd behavior, considering both crowd size and the degree of violence, using data from CCTV and online video sources. Since the videos in the dataset are rich in diverse crowd scenes, that were captured from multiple climatic conditions and scenarios of occlusion, the performance analysis proves that the model is robust to variations like weather conditions, occlusion, and video quality. Moreover, the swin transformer’s ability to capture both local and global contexts through shifted windows and hierarchical processing helps in maintaining performance despite these global changes. The use of optical flow maps and the attention mechanism of the swin transformer can reweight the attention distribution to focus on visible and relevant parts of the image, thereby mitigating the impact of occlusions and video quality.

Table 4 Average Accuracy (%) and mAP(%) of the proposed approach
Fig. 9
figure 9

Confusion Matrix portraying the per class accuracy in the Static and Original datasets

Fig. 10
figure 10

Impact of crowd counting maps and optical flow in Original Dataset

4.1 Impact of crowd counting maps and optical flow in crowd behavior detection

Crowd behavior is recognized using our proposed swin transformer model that takes crowd counting maps and optical flow maps as input along with the original input frames. Given our classification task’s focus on crowd behavior differentiation based on size and violence level, the integration of crowd counting maps plays a pivotal role in enhancing the precision of crowd behavior detection by effectively discriminating between large-scale and small-scale events. On the contrary, optical flow maps help analyze the temporal patterns of significant motions of objects in a sample video. We performed experiments to examine the effect of crowd-counting maps and optical flow in the Original and Static datasets. The analysis was done by estimating the mAP, sample accuracy, and instance accuracy by varying the input patterns to the Swin Transformer in the following four ways- (1) Swin Only:- Frames from the video sample were given as input (2) Swin+OptFlow:- Input frames were concatenated with optical flow maps. (3) Swin+CCmaps:- Input frames were concatenated with crowd-counting maps.(4) Swin+ CCmaps+OptFlow:- Our proposed approach where crowd counting maps and optical flow maps were concatenated with original input frames. The results are portrayed in Figs. 10 and 11, which display the comparison of average accuracy (%) and mAP (%) in the Original and Static datasets. It is clear from the figure that the combination of crowd-counting maps and optical flow patterns has a considerable impact on behavior detection. Figure 12 shows the sample frames representing the four classes that were correctly classified by the proposed approach. Furthermore, Fig. 13 presents three scenarios aimed at illustrating the significance of crowd counting maps and optical flow maps in the context of distinguishing crowd behavior with respect to size and violence levels. These figures exemplify instances where our approach outperformed alternative methods in accurately categorizing crowd behavior.

Fig. 11
figure 11

Impact of crowd counting maps and optical flow in Static Dataset

Fig. 12
figure 12

Sample frames from the four classes that were correctly classified by the proposed approach

Fig. 13
figure 13

Three scenarios to show the importance of crowd maps and optical flow in the proposed approach. Crowd counting maps help in the differentiation of LPG and N or LVG and F. On the other hand, optical flow maps assist in identifying temporal patterns within a video, aiding in the discrimination between categories such as LPG and LVG, or F and N. (a) Fight scene in a largely empty area misclassified as N without optical flow. (b) A large crowd with a violent scene at the end is classified as LPG when optical flow is not considered. (c) Crowd counting maps help to correctly distinguish LPG from N

Table 5 Comparison of Accuracy (%) in the Original and Static Datasets
Fig. 14
figure 14

Sample frames from benchmark datasets

4.2 Comparison with state-of-the-art approaches

We implemented state-of-the-art models utilized for video recognition tasks using our Original and Static datasets and the results are given in Table 5. Average accuracy (%) was calculated for sample videos and instance videos and it is evident that our approach is better than the recent video recognition models for the detection of crowd behavior based on crowd size and violence level. The results also emphasize that crowd-counting maps and optical flow maps influence the behavior detection ability of the proposed model.

We empirically assess the efficacy of the proposed model through rigorous performance evaluation on established publicly available datasets consisting of behavior classes associated with fight and violence actions. To effectively evaluate the performance of our model, we require datasets containing instances of violent and non-violent scenarios captured from real surveillance environments. As such, we have chosen to utilize four distinct datasets for comparative analysis: the Hockey Fight Dataset [68], the Surveillance Camera Fight Dataset [53], the Violent Flows Dataset [50], and the RWF_2000 Dataset [69]. The Hockey Fight Dataset comprises a collection of 1,000 video sequences categorized into two distinct classes: fights and non-fights. A similar binary classification scheme is also applied to the Surveillance Camera Dataset, consisting of 300 video recordings. The RWF_2000 Dataset, on the other hand, incorporates a more extensive compilation featuring 2000 video clips that are segregated into the fight and non-fight categories. Lastly, the ViolentFlows dataset comprises 246 video instances, each annotated to distinguish between violent and non-violent behaviors.

Table 6 Comparison of Accuracy(%) in Hockey Fight Dataset

The Hockey Fight and RWF 2000 datasets comprised of instances categorized into fight and no-fight classes, which align with our F and N classes, respectively. Similarly, the Violent Flows Dataset and the Surveillance Camera Fight Dataset contain scenes depicting both violent and non-violent scenarios, like the classes such as LPG, LVG, and N. Figure 14 illustrates sample frames from the datasets and Tables 6, 7, 8, and 9 present the quantitative outcomes in terms of accuracy. The results substantiate the efficiency of our proposed approach in discerning patterns associated with violent and fight scenarios.

The evaluated benchmark datasets consist of classes representing fight/no fight or violence/non-violence scenarios. The Hockey Fight Dataset predominantly features fight sequences characterized by clearer visuals and a smaller number of individuals. In contrast, the Violent Flows Dataset presents more distinct patterns of both violent and non-violent behavior, thereby facilitating the model’s learning and generalization of patterns. Conversely, the Surveillance Camera Fight Dataset and RWF 2000 Dataset present more diverse and challenging scenarios. These datasets comprise variations in lighting conditions, camera angles, and crowd dynamics, which posed challenges for our model’s performance. Despite these complexities, our approach consistently demonstrated superior performance compared to existing methods.

Table 7 Comparison of Accuracy(%) in Surveillance Camera Fight Dataset
Table 8 Comparison of Accuracy(%) in RWF_2000 Dataset
Table 9 Comparison of Accuracy(%) in Violent Flows Dataset
Table 10 Inference time of the proposed model using DeepStream SDK

4.3 Deepstream for real-time analysis

A DL model for smart surveillance is considered efficient when it exhibits real-time inference capabilities that align with the demands of a surveillance environment. As a result, we perform model validation within a real surveillance ecosystem utilizing the DeepStream SDK [9]. This SDK serves as a powerful tool for deploying real-time video classification deep learning models. The deployment process involves the provision of a video source to DeepStream, which can be either an MPEG-4 (MP4) video stored locally or a video stream originating from a camera via the Real-Time Streaming Protocol (RTSP) [92].

DeepStream mandates that the DL model be in the Open Neural Network Exchange (ONNX) [93] format. To achieve this, we employed the ”onnx” module provided by PyTorch to perform the conversion of the pre-trained proposed model into the ONNX format. Subsequently, the ONNX file representing the proposed DL model is specified within DeepStream’s configuration file, facilitating the generation of an inference engine file. This inference engine file is crucial for subsequent executions of the DeepStream SDK, enabling real-time video classification.

Fig. 15
figure 15

Sample frames displayed using DeepStream. The detected behaviors are shown in the top left corner of each frame

As part of our configuration process, we have defined critical DeepStream parameters for utilization. These parameters are (1) T, which denotes the number of video frames processed by the DL model during each inference cycle. For optimal performance, T has been configured to 20, aligning with the number of frames constituting a single sample. (2) H and W, which represent the dimensions of each video frame, specified as \(H \times W\). In our setup, both H and W have been set to 224. This specific dimension is mandated by the requirement of the backbone swin transformer model, which necessitates input frames to be 224\(\times \)224 in size.

Once the ONNX file and the path to the MP4 video or the RTSP stream link are supplied to DeepStream, the framework initiates the video playback while concurrently applying the proposed DL model to the frames. The visualization process varies depending on the video source: (a) Local MP4 Video: When processing a local MP4 video, DeepStream displays the video at its native frame rate. Simultaneously, it overlays the inference results for the most recent 20 frames in the top-left corner of the video display. This dynamic display simulates real-time inference, providing users with up-to-date classification information as the video plays. (b) RTSP Stream from a Camera: In the case of an RTSP stream sourced from a camera, DeepStream generates the class inference result in the top-left corner based on the last 20 frames received from the stream. This approach ensures that the displayed inference information reflects the most recent data processed from the live camera feed.

To execute the DeepStream SDK, we utilized the NVIDIA GeForce RTX 3080 GPU. The SDK is configured to capture either the local MP4 video or the most recent 20 frames from the RTSP stream. These captured frames are then processed by the proposed pre-trained model residing in the GPU. Subsequently, DeepStream incorporates the model’s output label, which could signify ”Natural (N),” ”Large Peaceful Gathering (LPG),” ”Large Violent Gathering (LVG),” or ”Fighting (F),” into the incoming video feed, rendering it visible on the screen for real-time monitoring and analysis.

The proposed approach was tested for real-time inference using DeepStream, which accepted videos stored locally in MP4 format and video streams from a camera via RTSP stream. The locally stored video yielded inference results in 0.3 sec, whereas the RTSP stream exhibited a delay of 5 seconds in displaying behavior inference, and is portrayed in Table 10.

For visual reference, refer to Fig. 15, which illustrates sample frames from each class as displayed within the DeepStream environment, where the input is given in MP4 format. This visualization offers insights into how DeepStream seamlessly integrates real-time video processing and DL inference. This integration highlights the effectiveness and value of our proposed approach for crowd behavior recognition.

5 Conclusions

In a public surveillance system, proactive real-time analysis of crowds can be challenging due to the difficulties authorities face in promptly assessing crowd scale and potential violence levels. Furthermore, the current practice of conducting crowd behavior recognition by exclusively analyzing CCTV footage while neglecting the incorporation of online social media video content results in a predominantly reactive methodology. This necessitates the vital requirement for datasets and models specifically designed to facilitate the analysis of both CCTV footage and online videos, with the capability to detect and classify crowd behavior along two essential dimensions: violence and crowd size. In this paper, we introduced a large dataset comprising 68 hours of data, including both stationary CCTV feeds and online social media content. We developed a subset of this extensive dataset, which includes only CCTV footage, to serve as a foundation for developing dedicated models suitable for CCTV video data analysis. A DL model based on swin transformer architecture was trained to capture crowd behaviors, consisting of regular events, large peaceful gatherings, large violent gatherings, and small-scale fighting. Besides, we aimed to enhance the model’s understanding of the dataset’s dynamics and violence patterns by incorporating crowd-counting maps and optical flow maps as auxiliary data sources. The experimental analysis proved the efficacy of the proposed model in effectively detecting crowd behavior, taking into account both crowd size and the degree of violence, across data derived from both CCTV and online video sources. The proposed model was also tested with benchmark datasets that further demonstrated the model’s proficiency in distinguishing fight and violence patterns within video data. Conclusively, the real-time performance analysis of the proposed model, trained on our dataset and leveraged through the DeepStream SDK, serves as captivating evidence of the model’s efficiency in the context of real-time surveillance environments. In the future, we intend to develop multi-attention spatiotemporal DL models capable of detecting and predicting fine-grained crowd behavior within a single scenario.