Next Article in Journal
Design of a Highly Efficient Wideband Multi-Frequency Ambient RF Energy Harvester
Next Article in Special Issue
DTS-Depth: Real-Time Single-Image Depth Estimation Using Depth-to-Space Image Construction
Previous Article in Journal
Arithmetic Framework to Optimize Packet Forwarding among End Devices in Generic Edge Computing Environments
Previous Article in Special Issue
Comparison of Machine Learning and Sentiment Analysis in Detection of Suspicious Online Reviewers on Different Type of Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Social Distance Estimation and Crowd Monitoring System for Surveillance Cameras

1
Faculty of Information Technology and Communication Sciences, Tampere University, 33720 Tampere, Finland
2
Faculty of Medicine, Clinicum, University of Helsinki, 00014 Helsinki, Finland
3
Department of Electrical Engineering, Qatar University, Doha, Qatar
4
TietoEVRY Oy, Keilalahdentie 2-4, 02101 Espoo, Finland
5
Haltian Oy, Yrttipellontie 1 D3, 90230 Oulu, Finland
*
Author to whom correspondence should be addressed.
Sensors 2022, 22(2), 418; https://doi.org/10.3390/s22020418
Submission received: 26 November 2021 / Revised: 24 December 2021 / Accepted: 3 January 2022 / Published: 6 January 2022
(This article belongs to the Special Issue Computer Visions and Pattern Recognition)

Abstract

:
Social distancing is crucial to restrain the spread of diseases such as COVID-19, but complete adherence to safety guidelines is not guaranteed. Monitoring social distancing through mass surveillance is paramount to develop appropriate mitigation plans and exit strategies. Nevertheless, it is a labor-intensive task that is prone to human error and tainted with plausible breaches of privacy. This paper presents a privacy-preserving adaptive social distance estimation and crowd monitoring solution for camera surveillance systems. We develop a novel person localization strategy through pose estimation, build a privacy-preserving adaptive smoothing and tracking model to mitigate occlusions and noisy/missing measurements, compute inter-personal distances in the real-world coordinates, detect social distance infractions, and identify overcrowded regions in a scene. Performance evaluation is carried out by testing the system’s ability in person detection, localization, density estimation, anomaly recognition, and high-risk areas identification. We compare the proposed system to the latest techniques and examine the performance gain delivered by the localization and smoothing/tracking algorithms. Experimental results indicate a considerable improvement, across different metrics, when utilizing the developed system. In addition, they show its potential and functionality for applications other than social distancing.

1. Introduction

The rapid outbreak of the Coronavirus Disease 2019 (COVID-19) has imposed restrictions on people’s movement and daily life [1]. Reducing the spread of the virus mandates constraining social interactions, traveling, and access to public areas and events [1]. These limitations arise to mainly advocate social distancing; the practice of increasing physical space among people to minimize virus transmission [2]. Monitoring and maintaining social distancing is carried out by governmental bodies and agencies using mass surveillance systems and closed-circuit television (CCTV) cameras [3]. Nonetheless, this task is cumbersome and suffers from subjective interpretations and human error due to fatigue; hence, computer vision and machine learning tools are convenient for automation [4]. In addition, they enable crowd behavior to be monitored and anomalies such as congested regions, curfew infractions, and illegal gatherings to be recognized. The widespread of mass surveillance and its integration with Machine Learning is hindered by ethical concerns, including possible breach of privacy and potential abuse [3]. Therefore, privacy-preserving surveillance and Machine Learning solutions are paramount to their ethical adoption and application [5].
The design of vision-based social distance estimation and crowd monitoring system deals with the following challenges [4]: (1) geometry understanding, in terms of ground plane identification and homography estimation; (2) multiple people detection and localization; and (3) statistical/temporal characterization for social distance infractions, e.g., short-term violations are irrelevant. Currently, Machine Learning-based solutions identify social distance infringements using off-the-shelf person detection and tracking models [4]. In general, the models’ performance is conjoined with privacy; they yield high performance by carrying and processing person-specific information to develop robustness against occlusions and missing data [4]. In addition, they localize human subjects via bounding boxes that can be over-sized or incomplete which results in significant distance estimation errors [6]. Therefore, we propose a privacy-preserving adaptive social distance estimation and crowd monitoring system that can be implemented on top of any existing CCTV infrastructure. The main contributions of the paper are as follows: (1) Developing a robust person localization strategy using pose estimation techniques; (2) Forming an adaptive smoothing and tracking paradigm to mitigate the problem of occlusions and missing data without compromising privacy; (3) Designing a real-time privacy-preserving social distance estimation and crowd monitoring solution with potential to cover other application areas and tasks.
The rest of this paper is organized as follows: Section 2 overviews the related work and Section 3 describes our methodologies to build and evaluate the proposed system. Afterwards, we present and discuss the system outcome and performance in Section 4. Finally, Section 5 concludes the paper and suggests topics for future research.

2. Related Work

This section reviews state-of-the-art Machine Learning-based social distance estimation and monitoring solutions and summarizes their advantages and limitations. First, we analyze various person detection and localization strategies within the scope of social distancing. After that, we review different approaches to recognize social distancing abnormalities. Finally, we discuss the latest vision-based crowd monitoring techniques.

2.1. Person Detection and Localization

Several methods exist in the literature and fall under two main categories: object detectors and pose estimation techniques. The former identifies objects by bounding a box around them, while the latter detects the human joints and connects them resulting in pose estimates [7]. On the one hand, object detectors, such as YOLO models [8], are more general-purpose than pose estimation techniques, but their utility for identifying human subjects may require pruning and/or retraining. In addition, they do not offer further information about the detected objects and their bounding boxes can be over-sized or incomplete [6]. On the other hand, pose estimators are specialized models; hence, they are more suitable to detect people in a scene. Specifically, they account for various body orientations/actions such as standing, sitting, riding, and bending, when compared to object detectors [9]. Moreover, their ability to work in dense crowds was verified in [10,11], which is the very same setting social distance monitoring is dealing with. Nonetheless, pose estimators are computationally more expensive than object detectors and their high entropy output requires further processing [7].
In [12], a visual analysis technique is proposed to quantify and monitor contact tracing for COVID-19. The detection and tracking of human subjects are performed by a YOLO architecture and a Simple Online and Real-time Tracking model, respectively. In addition, each subject is localized by its bounding box bottom mid-point. Similar detection and tracking approaches are proposed in [13,14], but the latter localizes the subjects by their bounding box centroid. The aforementioned solutions, although accurate, are not suitable, because they carry person-specific information which hinders their adoption for privacy-preserving applications. Nonetheless, privacy-preserving techniques are developed in [6,15] to monitor the evolution of social distancing patterns using CCTV cameras. The first work utilizes YOLO-v3 to detect pedestrians and the bounding box centroid for localization. Moreover, the second work explores two-person detectors and one end-to-end model and provides evidence that the latter does not necessarily improve performance and the bounding box bottom mid-point is the best for localization. Many variants of the YOLO model and other neural network architectures are used to detect humans in videos and the bounding box centroid, top left edge, or bottom center, is used for localization [16,17,18,19,20,21,22,23,24,25]. Lastly, the social distancing problem is tackled in [26] using a pose estimation model to detect human subjects in videos and to infer their location using the predicted feet joints. The same approach is employed in [27] to measure inter-personal distances but for still images. This has motivated us to use pose estimation techniques to detect people because they offer rich information about the localized subjects and mitigate the pitfalls of bounding boxes.

2.2. Anomaly Recognition

The scope of the social distancing problem defines an anomaly in a surveillance video by the presence of social distance violations [4]. This task requires estimating inter-personal distances among the localized subjects and comparing them to a predefined safety threshold [4]. In [13,15], the localized subjects’ pair-wise distances are calculated in the real-world coordinates and social distance violations are identified by a 2 m safety threshold; however, the problem of occlusion is not tackled in [15]. Furthermore, in [12,18,20,23,24], the localization results are morphed to the real-world top-view coordinates to calculate the pair-wise distances. The social distance violations are identified by 1, 1.8, and 2 m safety distances. However, the reported results focus on the person detection performance and they illustrate identifying infractions by a few qualitative examples. Moreover, the developed systems in [18,20,23,24] do not mitigate the problem of occlusion nor missing detections. This is important because these are major limitations and tracking with privacy preservation is an essential remedy [28]. In [21], a centroid tracking algorithm is used to resolve occlusions [29], pair-wise distances are computed, and violations are identified by a 1.8 m safety threshold. However, the performance evaluation is assessed using a single video with only two people in it. This restricts generalizing the system’s efficacy and its applicability to real-life scenarios. Moreover, inter-personal distances are computed in [6] and the violations are identified at three safety levels; 1, 1.8, and 3.6 m. The study concludes that incomplete or over-sized bounding boxes introduce significant errors to the distance calculation; hence, selecting an appropriate person detector is paramount to the system’s feasibility and success. Finally, in [26], pair-wise distances are approximated through the estimated body joints and social distance infractions are identified by a 2 m threshold.
The reviewed literature shows a discrepancy in the safety distance selection for detecting social distance violations. This inconsistency hinders fair comparisons, but it has motivated us to test the proposed system applicability across a wide-range of safety distances and to utilize various performance measures.

2.3. Crowd Monitoring

Crowd monitoring aims to attain a high-level understanding of crowd behavior by processing the scene in a global or local manner [30]. Macroscopic methods such as crowd density, crowd counting, and flow estimation, neglect the local features and focus on the scene as a whole [31,32]. In contrast, microscopic techniques start by detecting individual subjects and then group their statistics to summarize the crowd state [33]. These two approaches are complementary in terms of the efficiency/accuracy trade-off. In other words, macroscopic techniques are efficient in handling high-density crowds, while microscopic methods are accurate for sparse groups [31].
An approach to analyze the crowd and social distancing behavior from UAV captured videos is proposed in [31]. Discrete density maps are generated to classify the crowd state in each aerial frame patch as dense, medium, sparse, or empty. In addition, a microscopic technique is employed to detect, track, and compute inter-personal distances. In [34], crowd counting and action recognition techniques are reviewed in the scope of social distancing. The study suggests that density-based approaches are preferred due to their inherent error suppression in which the contribution of faulty counts or missing detections is insignificant to the long-term-averaged density map. Moreover, pedestrians’ spatial patterns are captured in [6] by long-term occupancy and crowd density maps. The former describes the spatial signature exerted by the subjects in the surveilled scene, while the latter encodes the spatial impression of social distance infractions [35]. Similarly, heatmaps are generated in [13,26,36] to represent the regions in which social distance violations are frequent. These studies demonstrate that short and long-term occupancy/crowd density maps are important to identify high-risk regions in the scene. In addition, they allow a quantification for the pedestrians’ compliance with social distancing guidelines [6].

3. Methodology

The proposed social distance estimation and crowd monitoring system is depicted in Figure 1. The model is comprised of the following stages:
  • Read a frame from the surveillance camera. This component can be adjusted to skip/drop frames in case of using high-resolution and/or high-frame-rate cameras.
  • Detect human subjects in the input frame and compute their position. The position of each detected subject is estimated as a single point.
  • Discard any localized positions outside a selected region of interest (ROI). The ROI is defined by the user beforehand and typically encloses the ground plane.
  • Transform the localized positions from the image–pixel coordinates to the real-world coordinates. This provides a top-view depiction of the subject’s position.
  • Smooth the noisy top-view positions and compensate for missing data due to occlusion with tracking.
  • Estimate the inter-personal distances among the detected subjects and the occupancy/crowd density maps.
  • Recognize social distance violations and identify congested or overcrowded regions in the scene.
  • Integrate the smoothed/tracked positions, estimated parameters, and detected anomalies with the video frame.
  • View the integrated video frame and generate a dynamic top-view map for the scene. This component allows adjusting the type and amount of appended information.
The proposed system design process is governed by the following requirements:
  • High accuracy and reliability in terms of robustness to noise and missing data.
  • Light weight for implementation and deployment.
  • Modularity to facilitate maintenance, upgrades, decentralization, and to avoid resource allocation bottlenecks.
  • Privacy-preserving by not carrying nor processing person-specific features.
  • Robustness against different vertical pose states and actions, e.g., standing, sitting, bowing, bending, walking, and cycling.
The remaining subsections discuss and detail each stage in the proposed system. We use an example video frame from the EPFL-MPV dataset to illustrate the outcome of each stage—see Section 4.1 for more details on the dataset.

3.1. Person Detection and Localization

Given an input video frame, we detect and localize human subjects using a pose estimation technique, because object detection models can yield incomplete or over-sized bounding boxes and they do not offer rich information [6].

3.1.1. Detection

We utilize OpenPose to detect and estimate human poses in the input video frame. Specifically, OpenPose estimates and connects the body joints using part affinity fields [37]. Let N and M be the total number of true and detected subjects in the video frame, respectively, and J m m [ 1 , M ] be the set of estimated joints for all detected subjects where J m = j m j j [ 1 , 25 ] , j m j = u m j , v m j , u m j and v m j define the horizontal and vertical coordinates of the joint j, respectively—see Figure 2 for an example.
Ideally, OpenPose yields 25 joints for each detected subject, but we recognize that some might not be detected due to various reasons. This results in some empty entries in J m , but does not change the indexing scheme. Moreover, to model a realistic scenario, we assume that N and M are not necessarily equal, i.e., the number of detected subjects can be less, equal, or more than the true number of people in a frame. Finally, note that we select OpenPose due to its simplicity and availability, but it can be replaced with any other pose estimation model given the same body joints indexing scheme.
Figure 3 shows the pose estimation outcome for an example input frame with 5 people moving freely in a room. OpenPose yields five detections shown in gray, red, orange, green, and blue with 13, 22, 20, 17, and 8 total number of connected joints, respectively. The gray and blue poses are incomplete because of partial occlusion and missing data.

3.1.2. Localization

We select the midpoint of the feet of each subject as the anchor to localize their positions, also known as the ground position. The selected point offers reliable estimation because: (1) it is independent of the subjects height, width, and orientation; (2) it lies on the ground; thus, homography transformation is possible; (3) it has a clear definition when compared to bounding boxes; (4) it carries no person-specific information; hence, privacy is preserved.
In [26], given the non-empty set of feet joints { j m 12 , j m 15 , j m 20 , j m 21 , j m 22 , j m 23 , j m 24 , j m 25 } and the condition # J m 13 , the ground position of subject m is estimated as follows:
u m = u m 1 , u m 2 , u m 9 # u m 1 , u m 2 , u m 9 : u m 1 , u m 2 , u m 9 min ( u m ) + max ( u m ) 2 : otherwise ,
v m = v m 12 , v m 15 , v m 20 , v m 21 , v m 22 , v m 23 , v m 24 , v m 25 # v m 12 , v m 15 , v m 20 , v m 21 , v m 22 , v m 23 , v m 24 , v m 25 ,
where u m = u m j , j [ 1 , 25 ] , and # denotes the number of non-empty elements in the set. We call this approach the basic ground position estimation and argue that it is inadequate because the constraints are quite restrictive. For instance, Equation (1) assumes human subjects with perfect vertical orientation, which may not be the case. In addition, in Equation (2), the sole reliance on detecting any foot joint and the required minimum number of joints limits its applicability in real-life scenarios. In fact, this approach estimates the ground position only when information is abundant. Therefore, we propose a localization strategy that eliminates the basic position pitfalls and relaxes its restrictions and constraints.
Algorithm 1 explains the proposed localization strategy. First, we eliminate the conditions mandated by the basic approach and expand the search space to include the subject’s feet, knees, hips, and torso. In particular, for the horizontal coordinate u m , we leverage the joints left/right symmetry by averaging the horizontal position of two opposing joints. For instance, u 2 and u 3 in Figure 2 are computed by the 1st case (line 2), while u 1 is found by the 7th case using the hip joints, i.e., u 1 10 and u 1 13 (line 8). Moreover, for the vertical coordinate v m , we relax the requirement for detecting the feet joints by exploiting the human average skeletal characteristics. More specifically, we use the ratio between the torso and lower body lengths to infer the ground position vertical coordinate [26], i.e., ( 0.85 / 0.6 ) in line 15. Finally, regardless of the approach, we discard any estimated positions outside the user-defined ROI—see Figure 2.
Algorithm 1 The proposed localization strategy.
Input: u m j and v m j where j [ 1 , 25 ] .
Output: u m and v m .
Initialization: Left/right foot horizontal coordinates α / β and the feet vertical coordinate γ .
α = u m 15 , u m 20 , u m 21 , u m 22 # u m 15 , u m 20 , u m 21 , u m 22           β = u m 12 , u m 23 , u m 24 , u m 25 # u m 12 , u m 23 , u m 24 , u m 25           γ = { v m 12 , v m 15 , v m 20 , v m 21 , v m 22 , v m 23 , v m 24 , v m 25 }
1:
switch true do
2:
    case  α β    then    u m = 1 2 α # α + β # β and F m u = 1 .
▹ Both feet joints are available
3:
    case  α u m 11    then    u m = 1 2 α # α + u m 11 and F m u = 2 .
▹ Left foot and right knee joints are available
4:
    case  u m 14 β    then    u m = 1 2 u m 14 + β # β and F m u = 2 .
▹ Left knee and right foot joints are available
5:
    case  u m 11 u m 14    then    u m = 1 2 u m 11 + u m 14 and F m u = 2 .
▹ Both knees’ joints are available
6:
    case  u m 10 u m 14    then    u m = 1 2 u m 10 + u m 14 and F m u = 2 .
▹ Right hip and left knee joints are available
7:
    case  u m 11 u m 13    then    u m = 1 2 u m 11 + u m 13 and F m u = 2 .
▹ Right knee and left hip joints are available
8:
    case  u m 10 u m 13    then    u m = 1 2 u m 10 + u m 13 and F m u = 2 .
▹ Hip’s joints are available
9:
    case  u m 2 u m 9    then    u m = 1 2 u m 2 + u m 9 and F m u = 2 .
▹ Torso’s joints are available
10:
    case  α β    then    u m = α , β # α , β and F m u = 2 .
▹ Consider any available feet joints
11:
    otherwise    u m = and F m u = 0 .
12:
end switch
13:
switch true do
14:
    case  γ    then    v m = γ # γ and F m v = 1 .
▹ Consider any available feet joints
15:
    case  v m 2 v m 9    then    v m = v m 9 + ( 0.85 / 0.6 ) v m 2 v m 9 and F m v = 2 .
▹ Torso’s joints are available
16:
    otherwise    v m = and F m v = 0 .
17:
end switch
The proposed localization strategy is driven by the argument that noisy measurements with known error states are more valuable than no measurements at all. In other words, if we predict the subject’s ground position and supplement it with the state of available information, we can append each prediction with a flag describing its integrity, or confidence level. In this work, we coin this concept by forming the error state flags F m u and F m v in the following manner:
  • F m u = ( F m v = ): subject is not detected.
  • F m u = 0 ( F m v = 0 ): subject is detected but u m ( v m ) is not available, regardless of the reason.
  • F m u = 1 ( F m v = 1 ): subject is detected and u m ( v m ) is directly estimated from the feet joints.
  • F m u = 2 ( F m v = 2 ): subject is detected and u m ( v m ) is predicted using other joints.
Similarly, an overall localization error flag is constructed for each detected subject m as follows:
F m = : F m u = F m v = 0 : F m u = 0 F m v = 0 1 : F m u = 1 F m v = 1 2 : F m u = 2 F m v = 2 .
Figure 4a demonstrates the basic and proposed localization results using the estimated poses in Figure 3. In addition, it shows the selected ROI in cyan, which encloses the floor plane in the scene. By examining Figure 4a, one notes that both localization strategies yield valid estimates when supplied with enough number of connected joints. However, the proposed approach is more accurate since it does not assume perfect vertical orientation. Moreover, it mitigates partial occlusion by inferring the position vertical coordinate using the torso to lower body lengths ratio—see the estimated position in gray. Nonetheless, both strategies are limited, because they cannot resolve the ground position when information is scarce or completely missing. For instance, they cannot localize the fifth subject, the one with the blue pose in Figure 3 because we only have a few joints.

3.2. Top-View Transformation

Let us assume that the surveillance camera is placed at height h and oriented with a pan and tilt angles θ p and θ t , respectively. The transformation from a three-dimensional position in the real-world coordinates to its corresponding two-dimensional (2D) position in the image–pixel coordinates; [ x , y , z ] [ u , v ] is expressed as follows:
u v 1 = 1 α s K R T 0 x y z 1 ,
where α s is the image-to-real distance scale, K R 3 × 3 is the camera intrinsic parameter matrix which maps the camera coordinates to the image coordinates, R T 0 maps the real-world coordinates to the camera coordinates, R R 3 × 3 is a rotation matrix that compensates for the camera orientation ( θ p and θ t ), and T 0 R 3 × 1 is a translation vector which deals with the camera position and height. Since we are only concerned with transforming the subjects’ ground positions from the image coordinates [ u m , v m ] to the real-world ground plane [ x m , y m ] , Equation (4) simplifies to:
x m y m 1 = α s H 1 u m v m 1 ,
where H R 3 × 3 is the camera homography matrix. This transformation results in a top-view depiction of the subject’s real-world positions—see Figure 4e.
In this work, we assume the homography matrix H and the image-to-real distance scale α s to be known for simplicity; however, they can be obtained by GPS and accelerometers [38,39,40], determined by calibration [41,42], inferred from the computed poses [26,27], or estimated by a four-point perspective transformation [43].

3.3. Smoothing and Tracking

The top-view transformed ground positions are noisy and suffer from missing values. The former is due to uncertainties and errors in the localization technique while the latter comes from occlusions. In this section, we formulate the estimated positions temporal evolution by a constant velocity model. Afterwards, we compensate for localization errors and missing measurements by a linear Kalman filter (KF) and a global nearest neighbor (GNN) tracker.

3.3.1. State and Measurement Models

Let x m , t = [ x m , t , x ˙ m , t , y m , t , y ˙ m , t ] T be the state vector of subject m that defines its ground position and velocity at frame t. Assuming constant velocity, x m , t and its measured counterpart y m , t are expressed as follows [44]:
x m , t = F x m , t 1 + ω m , t 1 ,
y m , t = H x m , t + ν m , t ,
where F is a constant state transition matrix from x m , t 1 to x m , t , H is a constant state-to-measurement matrix, ω m , t N ( 0 , Q m , t ) , and ν m , t N ( 0 , R m , t ) .

3.3.2. The Linear Kalman Filter

The KF offers an optimal estimate for x m , t given the measurement y m , t by following the process depicted in Figure 5. First, given a previous (or initial) posterior estimate x ^ m , t 1 with error covariance P m , t 1 , the KF predicts a prior estimate x ˜ m , t and computes its error covariance P ˜ m , t . Afterwards, it calculates the posterior estimate x ^ m , t with error covariance P m , t using a Kalman filter gain K m , t . Finally, the process repeats using x ^ m , t and P m , t as inputs to the state prediction stage.
By examining the Kalman gain equation in the measurement correction stage in Figure 5, one notes that increasing/decreasing R m , t decreases/increases the reliance of x ^ m , t on the measurement y m , t . In this work, we control this mechanism by adjusting the variance σ m , t 2 in R m , t according to the overall localization error flag F m , t , i.e., [45]:
σ m , t 2 = σ 1 2 : F m , t = 0 σ 2 2 : F m , t = 1 σ 3 2 : F m , t = 2 .
In other words, the measurement error variance is adapted to smooth the estimated positions according to their appended quality. Consequently, the KF reduces the localization noise and can offer posterior estimates when the measurement is missing [45]. Nevertheless, the KF equations require knowing the correspondence between the detections/predictions at consecutive frames. This is generally tackled via multiple object tracking (MOT) approaches such as the global nearest neighbor (GNN) algorithm.

3.3.3. Global Nearest Neighbor Tracking

GNN is a real-time light-weight MOT solution that tracks objects by assigning detections/predictions to tracks, and by maintaining its track record [46]. It solves the assignment task by minimizing the following cost function:
min α m , q m = 1 M q = 1 Q C m , q α m , q s . t . m = 1 M α m , q = 1 q and q = 1 Q α m , q = 1 m ,
where M is the number of detected subjects, Q is the number of maintained (or initiated) tracks, C m , q is the cost of assigning detection m to track q and α m , q { 0 , 1 } such that if detection m is assigned to track q, then α m , q = 1 , otherwise α m , q = 0 . The constraints in Equation (9) ensure that each detection can be assigned to only one track and vice versa.
The GNN defines the assignment cost C m , q in Equation (9) as follows:
C m , q = D y m , t , y ^ q , t + log H P q , t H T + R q , t ,
D 2 y m , t , y ^ q , t = y m , t y ^ q , t T H P m , t H T + R m , t 1 y m , t y ^ q , t γ g ,
where y ^ q , t = H x ^ q , t is the estimated measurement with error covariance H P q , t H T + R q , t , D y m , t , y ^ q , t is the Mahalanobis distance between y m , t and y ^ q , t , log | X | is the natural logarithm of the determinant of X, and γ g is a gating threshold that reduces unnecessary computations; it selects detections that are close to predictions. In this work, we solve the GNN assignment problem in Equation (9) using the optimal Munkres algorithm [47,48].
The GNN maintains its track record as follows [46]:
  • Initiation: create new tentative tracks for unassigned detections; M > Q .
  • Promotion: confirm a tentative track if its likelihood of being true is greater than γ c .
  • Demotion: demote a confirmed track to tentative if the subject leaves the ROI.
  • Deletion: delete a confirmed track if its maximum likelihood decreases by γ d .
Figure 4b and Figure 4f present the smoothed/tracked ground positions in the image–pixel and real-world coordinates, respectively. In addition, we overlay the plots with the original localization results in Figure 4a and Figure 4e to visualize the role of smoothing and tracking. By examining the results, one notes that the KF corrects the predicted position in gray and makes it closer to the subject’s actual location. In addition, the fifth subject’s unresolved position, because of missing information, is now compensated for by GNN—see the predicted position in blue. In summary, the smoothing and tracking stage lowers the localization error through the KF and corrects for the missing measurements by GNN. Note that this stage preserves privacy and it is intended for data correction rather than conventional tracking; hence, we are not concerned with the re-identification problem nor the subjects’ particular identities.

3.4. Parameter Estimation

The crowd state, in terms of social distancing behavior and congestion, is estimated by computing the inter-personal distances and the occupancy/crowd density maps.

3.4.1. Inter-Personal Distance

The instantaneous pair-wise Euclidean distance between subjects i and j is expressed as:
d i , j , t = x i , t x j , t 2 + y i , t y j , t 2 .
Given a social safety distance r, the instantaneous number of violations is computed by:
V t = i = 1 N ^ t j = i + 1 N ^ t v i , j , t ,
v i , j , t = 1 : d i , j , t r 0 : d i , j , t > r ,
where N ^ t is the number of estimated/tracked people in frame t and V t counts the number of subjects that are r or less apart from each other—see Figure 4d and Figure 4h.

3.4.2. Occupancy and Crowd Density Maps

The occupancy density map (ODM) encodes the spatial patterns exerted by the subjects in the surveilled environment [6]. It is formed by summing and averaging Gaussian functions centered at the subjects’ ground positions, i.e.:
O ( x , y ) = 1 T 1 T 1 N ^ t i = 1 N ^ t G ( x x i , t , y y i , t ) d t ,
G ( x , y ) = 2 π δ 2 exp x 2 + y 2 δ 2 / 2 ,
where O ( x , y ) is the averaged ODM, T is the current frame number (or total number of frames), G ( x , y ) is a 2D symmetric Gaussian function, and δ controls the spatial resolution of the map. Similarly, the crowd density map (CDM) offers a spatial signature for the social distance infringements in the scene [35]. It is formulated by imposing the safety distance constraint as follows:
C ( x , y ) = 1 T 1 T 1 N ^ t i = 1 N ^ t ψ i , t G ( x x i , t , y y i , t ) d t ,
where C ( x , y ) is the averaged CDM and ψ i , t is a binary mask that is 1 or 0 if subject i violates or follows the social safety distance r, respectively.
Figure 4c and Figure 4g show the instantaneous ODM in the image–pixel and real-world coordinates, respectively. In addition, we superimpose the smoothed/tracked localization results and the computed inter-personal distances in both domains. Moreover, Figure 4d and Figure 4h illustrate the instantaneous CDM in the image–pixel and real-world coordinates, respectively.

3.5. Anomaly Recognition

We define an irregularity in the surveillance video by the presence of social distance infractions and overcrowded, or congested, regions. We treat the first task as a classification problem by forming the binary label S t as follows:
S t = 1 : V t > 0 0 : otherwise .
Moreover, we consider the second task as a segmentation problem where we identify overcrowded areas in the scene by thresholding the averaged CDM as follows:
R ( x , y ) = 1 : C ( x , y ) γ m 0 : otherwise ,
where γ m is selected to keep 50% of the energy in C ( x , y ) .

3.6. Performance Evaluation

The social distance estimation and crowd monitoring system is evaluated in terms of its ability to detect human subjects, localize their positions, recognize social distance violations, estimate crowd density maps, and to identify overcrowded regions in surveillance videos.
Let N t and N ^ t be the true and estimated/tracked number of people in frame t. The averaged person detection rate (PDR) and localization relative error are calculated as follows:
PDR = 1 1 T 1 T | N t N ^ t | N t + 1 d t ,
Error = 1 T 1 T x i , t x ^ i , t 2 + y i , t y ^ i , t 2 x i , t 2 + y i , t 2 + η t d t ,
η t = N t : N ^ t = 0 N ^ t : N t = 0 | N t N ^ t | / N t : otherwise ,
where ( x i , t , y i , t ) and ( x ^ i , t , y ^ i , t ) are the true and estimated ground coordinates for subject i at frame t, respectively. We associate the estimated positions with their true counterparts using the optimal Munkres algorithm [47,48]. Moreover, given the true and predicted binary outputs S t and S ^ t , respectively, we assess the detection of social distance violations by accuracy, precision, recall, and the F1-score, i.e.:
Accuracy = T P + T N T P + T N + F P + F N ,
Precision = T P T P + F P ,
Recall = T P T P + F N ,
F1-score = 2 × Precision × Recall Precision + Recall ,
where T P , T N , F P , and F N are true positives, true negatives, false positives, and false negatives, respectively. Furthermore, we complement the former evaluations by computing the averaged violations count rate (VCR), i.e.:
VCR = 1 1 T 1 T | V t V ^ t | V t + 1 d t ,
where V t and V ^ t are the true and predicted counts, respectively—see Equation (13).
Finally, we evaluate the quality of the averaged CDM by the Pearson’s correlation coefficient (CORR) and assess the identified overcrowded regions using the intersection over union (IOU), i.e.:
IOU = R ( x , y ) R ^ ( x , y ) d x d y R ( x , y ) R ^ ( x , y ) d x d y ,
where R ( x , y ) and R ^ ( x , y ) are the true and predicted thresholded averaged CDM, respectively—see Equation (19).

4. Results and Discussions

4.1. Dataset

We utilize the EPFL-MPV, EPFL-Wildtrack, and OxTown public datasets along with the pose estimations prepared in [26]. The EPFL-MPV is comprised of four sequences, named 6p-c0, 6p-c1, 6p-c2, and 6p-c3, for six people moving freely in a room [49]. The sequences are synchronized and view the same environment but from different perspectives. Each sequence is recorded at 25 frames per second (fps) and has 2954 frames. The EPFL-Wildtrack contains seven synchronized sequences, named C1-C7, with approximately 20 people moving outdoor [50]. The sequences view walking pedestrians outside the main building of the ETH university in Switzerland. They are shot using seven cameras positioned at different locations and each has a total number of 400 frames. Lastly, the OxTown is a street surveillance video with 4501 frames shot with a single camera at 25 fps. It oversees, on average, 16 people walking down a street in Oxford, London [51].

4.2. Preprocessing and Settings

The utilized datasets offer annotations in terms of bounding boxes that localize people in the scene. Additionally, they provide the homography matrix and the image-to-real distance scale of each recording camera. The EPFL-MPV and OxTown bounding boxes are vertically over-sized and enclose more than the areas occupied by the human subjects. Therefore, their bottom mid-points are lower than the subjects actual ground positions. In this work, we correct for this by shifting the mid-points up a percentage of the bounding box total height. In specific, we apply a 10% and 2% uplift to the EPFL-MPV and OxTown localization data, respectively. Moreover, the OxTown dataset annotation includes bounding boxes for babies in strollers/prams accompanied by adults. This is outside the scope of our work; hence, we discard them (This corresponds to the following subject IDs: 24, 42, 44, 45, and 47). Finally, the ROI for each dataset/sequence is manually selected, in the image–pixel domain, to cover the floor of the scene. The ROIs include most annotated positions, but we discard the remaining few that are outside the selected area. This corresponds to excluding 2.38% (960 out of 40,393), 6.67% (4767 out of 71,460) and 15% (6403 out of 42,721) of the EPFL-MPV, EPFL-Wildtrack, and OxTown annotations, respectively. The proposed system smoothing and tracking parameters are found for every dataset/sequence by minimizing the localization error in Equation (21) using the Bayesian optimization algorithm in MATLAB; see Table 1. The optimization is executed for 500 iterations using the expected improvement plus acquisition function and repeated five times for verification [52].

4.3. System Integration

Figure 6 illustrates three examples for integrating the proposed system outputs and displaying them on the user interface unit. These examples offer complementary interpretations for the scene and serve different purposes depending on the intended application or required analysis. For instance, in Figure 6a, the input video frame, depicted in Figure 4, is overlaid with the localization and averaged ODM results. This type of display is important when monitoring crowds in public areas or for analyzing customer’s browsing habits and preferences in shops. Moreover, we show in Figure 6b that the former information can be replaced with the detected social distance violations and the averaged thresholded CDM. This example is directly intended for social distance monitoring applications and can be used to oversee critical waiting areas, e.g., in airports and hospitals. Furthermore, Figure 6c demonstrates a dynamic top-view map for the scene by plotting the localization, inter-personal distances, and the averaged CDM in the real-world coordinates. This figure serves as a footprint for redesigning congested areas and facilitates developing physical interaction protocols and guidelines. Finally, apart from these applications, one can merge and/or adjust the type and amount of displayed information. In addition, the user is able to view one or multiple integrated frames, or top-view maps, simultaneously; hence, offering valuable information about the scene and crowd state. The supplementary material of this paper includes videos of the system integration outcome for other video sequences.

4.4. Evaluations and Results

Figure 7 demonstrates the social distance violation detection performance of the basic and proposed approaches in terms of accuracy, F1-score, and VCR. In addition, it shows their IOU for identifying the overcrowded regions in the scene. The results are computed for a range of safety distances and averaged across all video sequences. We vary the safety distance from 1 to 2.5 m with a 0.05 step to cover a wide range of guidelines. Moreover, Table 2 illustrates the system capacity to detect human subjects, localize their positions, recognize social distance violations, estimate crowd density maps, and identify high-risk areas in each video sequence; it summarizes the PDR, localization error, accuracy, F1-score, precision, recall, VCR, CORR, and IOU. The results are averaged across the range of safety distances and we assess the gain in performance delivered by the smoothing and tracking stage.
The trends in Figure 7 indicate that the accuracy, F1-score, and IOU increase with the safety distance, whereas the VCR is stable for the proposed approach and decreases for the basic method. Additionally, they depict the gain in performance delivered by the proposed system. Specifically, the boost in accuracy, F1-score, VCR, and IOU is up to 5.8%, 9.5%, 7.6%, and 10.7%, respectively. Furthermore, by examining the results in Table 2, one notes a clear advantage for utilizing the proposed system as it yields the best overall performance across all measures, except precision to ensure balanced precision/recall trade-off. In specific, it offers the highest person detection rates and lowest localization errors for all video sequences with gains up to 43% and 38.3%, respectively. Similarly, it results in better social distance violation recognition and raises the conventional method accuracy, F1-score, and VCR by 17%, 9.6%, and 39%, respectively. Moreover, the quality of the estimated crowd density maps, in terms of correlation, is high for both techniques, because the contribution of faulty detections is insignificant to the long-term averaged estimation. However, it is not the case when identifying high-risk regions. The results highlight a growth in the IOU of the proposed method up to 12.4%; hence, it is more reliable. Finally, Table 2 emphasizes the smoothing and tracking role where it offers a considerable improvement due to its treatment for occlusions and missing data. In particular, it balances the system efficacy, by reducing the difference between precision and recall, and expands its functionality to cover various tasks and application domains.
Table 3 shows a comparison between the proposed system, the basic pose-based approach from [26], and an object detection-based system developed in [15]. The comparison focuses on the systems’ ability to detect social distance violations in the OxTown dataset with a 2 m social safety distance. Note that since the compared solutions do not utilize tracking, we demonstrate the proposed system results with and without the smoothing/tracking stage. In addition, we illustrate example results in Figure 8 to visualize the proposed system outcomes. The results in Table 3 verify the proposed system applicability and the adequacy of pose-based techniques to detect social distance infractions. They indicate a 4.6% and 3% gain in accuracy and F1-score, respectively, when compared to the object detection-based method in [15]. In addition, they affirm the smoothing and tracking stage role which pushes the proposed system accuracy and F1-score by 0.9% and 0.5%, respectively.

4.5. Computational Complexity Analysis

The complexity of the proposed system is measured by its frame rate; the number of processed video frames per second, and processing rate; the amount of processing time per frame. The assessment is conducted by Monte-Carlo simulations where we run the model depicted in Figure 1 using all video frames and repeat the process ten times for validation. Note that we exclude the complexity of OpenPose since we use the pre-computed poses in [26]. Nevertheless, OpenPose real-time operation on both CPU and GPU machines was verified in [37,53]. In addition, we select OpenPose due to its simplicity and availability, but it can be replaced with any other pose estimation model given the same body joints indexing scheme described in Section 3.1.1. We use a desktop equipped with 2 Intel ® Xeon ® E5-2697V2 x64-based processors, 192 GB of memory, and MATLAB R2020b. Figure 9 demonstrates the developed system frame and processing rates with respect to the number of detected/tracked subjects. The averaged results suggest that the system is capable of running in real-time despite the smoothing/tracking stage additional complexity. Specifically, it runs at 106.5 fps (9.9 ms/frame) when solely relying on the proposed localization strategy and at 33.6 fps (44.5 ms/frame) when accommodating the tracking algorithm. Moreover, the results indicate that the localization approach complexity depends on the amount of occlusions present in the video frame—see Figure 9a. This is shown by the drop in frame rate when 2–6 people are present and by its slow decline when having more than 7 people in the scene. The first drop is caused by the EPFL-MPV dataset where we have six subjects moving in a highly confined environment resulting in many occlusions, while the second is due to the general increase in the number of people, which escalates the chances of occlusion. Furthermore, the smoothing/tracking introduced complexity is demonstrated by the frame rate rapid decay when increasing the number of subjects—see Figure 9b. The trends reveal the system limited ability to resolve highly dense crowds. In particular, the average frame rate drops below 25 fps (40 ms/frame) and 12 fps (83 ms/frame) when we have more than 10 and 17 people, respectively. Nevertheless, these findings highlight a need to distribute the computational load across the surveillance infrastructure. For instance, stages 1–4 in Figure 1 can be performed locally by the camera or on edge devices, while stages 5–9 require more resources.

5. Conclusions

The COVID-19 pandemic has deemed social distancing a critical first line of defense against its wide spread; nevertheless, safety distance guidelines are not always followed. Monitoring social distancing is important to draw realistic mitigation plans and to structure exit strategies. However, it is a labor-intensive task and suffers from subjective interpretations; therefore, combining computer vision and machine learning models with mass surveillance is intuitive for automation, but it must preserve privacy to ensure ethical adoption and application.
This work presented a privacy-preserving adaptive social distance estimation and crowd monitoring system for surveillance cameras. We evaluated the system’s ability to detect human subjects, localize their positions, recognize social distance violations, estimate crowd density maps, and identify high-risk areas. Additionally, we analyzed its computational complexity in terms of processing time. The results indicated a clear advantage for utilizing the proposed localization approach when compared to the latest techniques. In addition, they showed a considerable improvement delivered by the adaptive smoothing and tracking stage. Specifically, the system improves the PDR, localization relative error, accuracy, F1-score, VCR, and IOU up to 43%, 38.3%, 17%, 9.6%, 39%, and 12.4%, respectively. In addition, it runs at 33.6 fps (44.5 ms/frame) making it a real-time solution for low to medium-dense crowds. The proposed system occupancy/crowd density map functionality extends its application domain beyond the COVID-19 pandemic to cover other areas. For instance, it can help re-configure or re-design common physical layouts and relocate facilities in businesses to optimally reduce congestion. Additionally, it is capable of facilitating the analysis of customer’s browsing habits in shops and quantifying the effectiveness of marketing kiosks.
The developed system, although advantageous, is still limited and can be extended in various ways such as: (1) estimating the body orientation to relax the assumption of vertically oriented subjects; (2) fuse detections and estimations from multi-view cameras to assess the environment state rather than the camera specific scenery; (3) develop an automatic online training paradigm for the tracking algorithm parameters; (4) embed regression techniques to estimate the crowd density maps; (5) detect other anomalies such as fire, smoke, unattended objects in public places, and abnormal individual or crowd behavior. These will be the topics of our future research.

Supplementary Materials

The following are available at https://github.com/Al-Sad/Social-Pose.

Author Contributions

Conceptualization, M.A.-S., S.K. and M.G.; methodology, M.A.-S., S.K. and M.G.; software, M.A.-S.; validation, M.A.-S. and S.K.; formal analysis, M.A.-S., S.K. and M.G.; investigation, M.A.-S., S.K. and M.G.; resources, I.A., C.S., M.V. and M.G.; data curation, M.A.-S.; writing—original draft preparation, M.A.-S.; writing—review and editing, M.A.-S., S.K., I.A., C.S., M.V. and M.G.; visualization, I.A., C.S., M.V. and M.G.; supervision, S.K. and M.G.; project administration, I.A., C.S. and M.V.; funding acquisition, S.K., I.A., C.S., M.V. and M.G. All authors have read and agreed to the published version of the manuscript.

Funding

The work was supported by projects NSF IUCRC CVDI AMALIA, Mad@Work and Stroke-Data. Financial support of Business Finland, Haltian and TietoEVRY is acknowledged.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Our system is open-sourced. The implementation and the experiment data can be assessed via our GitHub repository: https://github.com/Al-Sad/Social-Pose, accessed on 5 January 2022.

Acknowledgments

The authors would like to thank Kateryna Chumachenko (Tampere University, Finland) for her valuable comments and feedback.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Fauci, A.S.; Lane, H.C.; Redfield, R.R. COVID-19—Navigating the Uncharted. N. Engl. J. Med. 2020, 382, 1268–1269. [Google Scholar] [CrossRef]
  2. Hoeben, E.M.; Bernasco, W.; Suonperä Liebst, L.; van Baak, C.; Rosenkrantz Lindegaard, M. Social distancing compliance: A video observational analysis. PLoS ONE 2021, 16, e0248221. [Google Scholar] [CrossRef]
  3. Hossain, M.S.; Muhammad, G.; Guizani, N. Explainable AI and Mass Surveillance System-Based Healthcare Framework to Combat COVID-I9 Like Pandemics. IEEE Netw. 2020, 34, 126–132. [Google Scholar] [CrossRef]
  4. Cristani, M.; Bue, A.D.; Murino, V.; Setti, F.; Vinciarelli, A. The Visual Social Distancing Problem. IEEE Access 2020, 8, 126876–126886. [Google Scholar] [CrossRef]
  5. Sugianto, N.; Tjondronegoro, D.; Stockdale, R.; Yuwono, E.I. Privacy-preserving AI-enabled video surveillance for social distancing: Responsible design and deployment for public spaces. Inf. Technol. People 2021. [Google Scholar] [CrossRef]
  6. Zuo, F.; Gao, J.; Kurkcu, A.; Yang, H.; Ozbay, K.; Ma, Q. Reference-free video-to-real distance approximation-based urban social distancing analytics amid COVID-19 pandemic. J. Transp. Health 2021, 21, 101032. [Google Scholar] [CrossRef]
  7. Antonucci, A.; Magnago, V.; Palopoli, L.; Fontanelli, D. Performance Assessment of a People Tracker for Social Robots. In Proceedings of the 2019 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Auckland, New Zealand, 20–23 May 2019; pp. 1–6. [Google Scholar] [CrossRef]
  8. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef] [Green Version]
  9. Gupta, A.; Gupta, K.; Gupta, K.; Gupta, K. A Survey on Human Activity Recognition and Classification. In Proceedings of the 2020 International Conference on Communication and Signal Processing (ICCSP), Chennai, India, 28–30 July 2020. [Google Scholar] [CrossRef]
  10. Golda, T.; Kalb, T.; Schumann, A.; Beyerer, J. Human Pose Estimation for Real-World Crowded Scenarios. In Proceedings of the 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Taipei, China, 18–21 September 2019; pp. 1–8. [Google Scholar] [CrossRef] [Green Version]
  11. Li, J.; Wang, C.; Zhu, H.; Mao, Y.; Fang, H.S.; Lu, C. CrowdPose: Efficient Crowded Scenes Pose Estimation and a New Benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar] [CrossRef] [Green Version]
  12. Pi, Y.; Nath, N.D.; Sampathkumar, S.; Behzadan, A.H. Deep Learning for Visual Analytics of the Spread of COVID-19 Infection in Crowded Urban Environments. Nat. Hazards Rev. 2021, 22, 1–14. [Google Scholar] [CrossRef]
  13. Rezaei, M.; Azarmi, M. DeepSOCIAL: Social Distancing Monitoring and Infection Risk Assessment in COVID-19 Pandemic. Appl. Sci. 2020, 10, 7514. [Google Scholar] [CrossRef]
  14. Punn, N.S.; Sonbhadra, S.K.; Agarwal, S. Monitoring COVID-19 social distancing with person detection and tracking via fine-tuned YOLO v3 and Deepsort techniques. arXiv 2020, arXiv:2005.01385. [Google Scholar]
  15. Yang, D.; Yurtsever, E.; Renganathan, V.; Redmill, K.A.; Özgüner, Ü. A Vision-Based Social Distancing and Critical Density Detection System for COVID-19. Sensors 2021, 21, 4608. [Google Scholar] [CrossRef]
  16. Ahmed, I.; Ahmad, M.; Rodrigues, J.; Jeon, G.; Din, S. A deep learning-based social distance monitoring framework for COVID-19. Sustain. Cities Soc. 2021, 65, 102571. [Google Scholar] [CrossRef]
  17. Srinivasan, S.; Rujula Singh, R.; Biradar, R.R.; Revathi, S.A. COVID-19 Monitoring System using Social Distancing and Face Mask Detection on Surveillance video datasets. In Proceedings of the 2021 International Conference on Emerging Smart Computing and Informatics (ESCI), Pune, India, 5–7 March 2021; pp. 449–455. [Google Scholar] [CrossRef]
  18. Magoo, R.; Singh, H.; Jindal, N.; Hooda, N.; Rana, P.S. Deep learning-based bird eye view social distancing monitoring using surveillance video for curbing the COVID-19 spread. Neural Comput. Appl. 2021, 33, 15807–15814. [Google Scholar] [CrossRef]
  19. Saponara, S.; Elhanashi, A.; Gagliardi, A. Implementing a real-time, AI-based, people detection and social distancing measuring system for COVID-19. J. Real-Time Image Process. 2021, 18, 1937–1947. [Google Scholar] [CrossRef]
  20. Hou, Y.C.; Baharuddin, M.Z.; Yussof, S.; Dzulkifly, S. Social Distancing Detection with Deep Learning Model. In Proceedings of the 2020 8th International Conference on Information Technology and Multimedia (ICIMU), Selangor, Malaysia, 24–25 August 2020; pp. 334–338. [Google Scholar] [CrossRef]
  21. Gupta, S.; Kapil, R.; Kanahasabai, G.; Joshi, S.S.; Joshi, A.S. SD-Measure: A Social Distancing Detector. In Proceedings of the 2020 12th International Conference on Computational Intelligence and Communication Networks (CICN), Bhimtal, India, 25–26 September 2020; pp. 306–311. [Google Scholar] [CrossRef]
  22. Qin, J.; Xu, N. Reaserch and implementation of social distancing monitoring technology based on SSD. Procedia Comput. Sci. 2021, 183, 768–775. [Google Scholar] [CrossRef]
  23. Shao, Z.; Cheng, G.; Ma, J.; Wang, Z.; Wang, J.; Li, D. Real-time and Accurate UAV Pedestrian Detection for Social Distancing Monitoring in COVID-19 Pandemic. IEEE Trans. Multimed. 2021. [Google Scholar] [CrossRef]
  24. Shorfuzzaman, M.; Hossain, M.S.; Alhamid, M.F. Towards the sustainable development of smart cities through mass video surveillance: A response to the COVID-19 pandemic. Sustain. Cities Soc. 2021, 64, 102582. [Google Scholar] [CrossRef] [PubMed]
  25. Ahamad, A.H.; Zaini, N.; Latip, M.F.A. Person Detection for Social Distancing and Safety Violation Alert based on Segmented ROI. In Proceedings of the 2020 10th IEEE International Conference on Control System, Computing and Engineering (ICCSCE), Penang, Malaysia, 21–22 August 2020; pp. 113–118. [Google Scholar] [CrossRef]
  26. Aghaei, M.; Bustreo, M.; Wang, Y.; Bailo, G.; Morerio, P.; Del Bue, A. Single Image Human Proxemics Estimation for Visual Social Distancing. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 2784–2794. [Google Scholar] [CrossRef]
  27. Seker, M.; Mannisto, A.; Iosifidis, A.; Raitoharju, J. Automatic Social Distance Estimation From Images: Performance Evaluation, Test Benchmark, and Algorithm. arXiv 2021, arXiv:2103.06759. [Google Scholar]
  28. Khandelwal, P.; Khandelwal, A.; Agarwal, S.; Thomas, D.; Xavier, N.; Raghuraman, A. Using Computer Vision to enhance Safety of Workforce in Manufacturing in a Post COVID World. arXiv 2020, arXiv:2005.05287. [Google Scholar]
  29. Nascimento, J.C.; Abrantes, A.J.; Marques, J.S. An algorithm for centroid-based tracking of moving objects. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing Proceedings, Phoenix, AZ, USA, 15–19 March 1999; Volume 6, pp. 3305–3308. [Google Scholar] [CrossRef]
  30. Rezaee, K.; Rezakhani, S.M.; Khosravi, M.R.; Moghimi, M.K. A survey on deep learning-based real-time crowd anomaly detection for secure distributed video surveillance. Pers. Ubiquitous Comput. 2021. [Google Scholar] [CrossRef]
  31. Bouhlel, F.; Mliki, H.; Hammami, M. Crowd Behavior Analysis based on Convolutional Neural Network: Social Distancing Control COVID-19. In Proceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, Online, 8–10 February 2021; Volume 5, pp. 273–280. [Google Scholar] [CrossRef]
  32. Kizrak, M.A.; Bolat, B. Crowd Density Estimation by Using Attention Based Capsule Network and Multi-Column CNN. IEEE Access 2021, 9, 75435–75445. [Google Scholar] [CrossRef]
  33. Ahmed, I.; Ahmad, M.; Ahmad, A.; Jeon, G. IoT-based crowd monitoring system: Using SSD with transfer learning. Comput. Electr. Eng. 2021, 93, 107226. [Google Scholar] [CrossRef]
  34. Elbishlawi, S.; Abdelpakey, M.H.; Eltantawy, A.; Shehata, M.S.; Mohamed, M.M. Deep Learning-Based Crowd Scene Analysis Survey. J. Imaging 2020, 6, 95. [Google Scholar] [CrossRef]
  35. Ozcan, A.H.; Unsalan, C.; Reinartz, P. Sparse people group and crowd detection using spatial point statistics in airborne images. In Proceedings of the 2015 7th International Conference on Recent Advances in Space Technologies (RAST), Istanbul, Turkey, 16–19 June 2015; pp. 307–310. [Google Scholar] [CrossRef] [Green Version]
  36. Gloudemans, D.; Gloudemans, N.; Abkowitz, M.; Barbour, W.; Work, D.B. Quantifying Social Distancing Compliance and the Effects of Behavioral Interventions Using Computer Vision. In Proceedings of the Workshop on Data-Driven and Intelligent Cyber-Physical Systems, Nashville, TN, USA, 18 May 2021; pp. 1–5. [Google Scholar] [CrossRef]
  37. Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 172–186. [Google Scholar] [CrossRef] [Green Version]
  38. Kholopov, I.S. Bird’s Eye View Transformation Technique in Photogrammetric Problem of Object Size Measuring at Low-altitude Photography. In Proceedings of the International Conference “Actual Issues of Mechanical Engineering” 2017 (AIME 2017), Tomsk, Russia, 27–29 July 2017; Atlantis Press: Tomsk, Russia, 2017; pp. 318–324. [Google Scholar] [CrossRef] [Green Version]
  39. Toriya, H.; Kitahara, I.; Ohta, Y. Mobile Camera Localization Using Aerial-view Images. Inf. Media Technol. 2014, 9, 896–904. [Google Scholar] [CrossRef] [Green Version]
  40. Calore, E.; Pedersini, F.; Frosio, I. Accelerometer based horizon and keystone perspective correction. In Proceedings of the 2012 IEEE International Instrumentation and Measurement Technology Conference Proceedings, Graz, Austria, 13–16 May 2012; pp. 205–209. [Google Scholar] [CrossRef]
  41. Huang, W.; Li, Y.; Hu, F. Real-Time 6-DOF Monocular Visual SLAM based on ORB-SLAM2. In Proceedings of the 2019 Chinese Control Furthermore, Decision Conference (CCDC), Nanchang, China, 3–5 June 2019; pp. 2929–2932. [Google Scholar] [CrossRef]
  42. Zhang, L.; Li, Y.; Zhao, Y.; Sun, Q.; Zhao, Y. High Precision Monocular Plane Measurement for Large Field of View. In Proceedings of the 2018 IEEE 8th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), Tianjin, China, 19–23 July 2018; pp. 1388–1392. [Google Scholar] [CrossRef]
  43. Kiran, A.G.; Murali, S. Automatic rectification of perspective distortion from a single image using plane homography. Int. J. Comput. Sci. Appl. 2013, 3, 47–58. [Google Scholar] [CrossRef] [Green Version]
  44. Bishop, G.; Welch, G. An introduction to the Kalman filter. SIGGRAPH Course 2001, 41, 27599-23175. [Google Scholar]
  45. Almagbile, A.; Wang, J.; Ding, W. Evaluating the Performances of Adaptive Kalman Filter Methods in GPS/INS Integration. J. Glob. Position. Syst. 2010, 9, 33–40. [Google Scholar] [CrossRef] [Green Version]
  46. Sinha, A.; Ding, Z.; Kirubarajan, T.; Farooq, M. Track Quality Based Multitarget Tracking Approach for Global Nearest-Neighbor Association. IEEE Trans. Aerosp. Electron. Syst. 2012, 48, 1179–1191. [Google Scholar] [CrossRef]
  47. Dezert, J.; Benameur, K. On the Quality of Optimal Assignment for Data Association. In Belief Functions: Theory and Applications; Cuzzolin, F., Ed.; Springer International Publishing: Cham, Switzerland, 2014; pp. 374–382. [Google Scholar] [CrossRef] [Green Version]
  48. Al-Shakarji, N.M.; Bunyak, F.; Seetharaman, G.; Palaniappan, K. Multi-object Tracking Cascade with Multi-Step Data Association and Occlusion Handling. In Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; pp. 1–6. [Google Scholar] [CrossRef]
  49. Fleuret, F.; Berclaz, J.; Lengagne, R.; Fua, P. Multicamera People Tracking with a Probabilistic Occupancy Map. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 267–282. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  50. Chavdarova, T.; Baqué, P.; Bouquet, S.; Maksai, A.; Jose, C.; Bagautdinov, T.; Lettry, L.; Fua, P.; Van Gool, L.; Fleuret, F. WILDTRACK: A Multi-camera HD Dataset for Dense Unscripted Pedestrian Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5030–5039. [Google Scholar] [CrossRef]
  51. Benfold, B.; Reid, I. Stable multi-target tracking in real-time surveillance video. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 3457–3464. [Google Scholar] [CrossRef]
  52. Bull, A.D. Convergence Rates of Efficient Global Optimization Algorithms. J. Mach. Learn. Res. 2011, 12, 2879–2904. [Google Scholar]
  53. Osokin, D. Real-time 2D Multi-Person Pose Estimation on CPU: Lightweight OpenPose. arXiv 2018, arXiv:1811.12004. [Google Scholar]
Figure 1. The proposed social distance estimation and crowd monitoring system model. The model is comprised of the following stages: (1) Read a video frame; (2) Detect human subjects and localize their positions; (3) Discard all positions outside the ROI; (4) Transform the remaining positions to the real-world coordinates; (5) Smooth the noisy estimates and compensate for missing data by tracking; (6) Estimate the subjects’ inter-personal distances and crowd density maps; (7) Recognize irregularities in the crowd state in terms of social distance infringements and congestion; (8) Integrate the video frame with the estimated parameters and identified anomalies; and (9) Display the integrated frame and generate a dynamic top-view map for the scene.
Figure 1. The proposed social distance estimation and crowd monitoring system model. The model is comprised of the following stages: (1) Read a video frame; (2) Detect human subjects and localize their positions; (3) Discard all positions outside the ROI; (4) Transform the remaining positions to the real-world coordinates; (5) Smooth the noisy estimates and compensate for missing data by tracking; (6) Estimate the subjects’ inter-personal distances and crowd density maps; (7) Recognize irregularities in the crowd state in terms of social distance infringements and congestion; (8) Integrate the video frame with the estimated parameters and identified anomalies; and (9) Display the integrated frame and generate a dynamic top-view map for the scene.
Sensors 22 00418 g001
Figure 2. An example pose estimation for three subjects with varying heights and spatial positions. The OpenPose 25 estimated joints are indexed on the right (blue) skeleton. The remaining two subjects have some undetected joints, but their joint indexing remains the same. The ground position of each subject is estimated by the midpoint of their feet joints. The user-defined region of interest is depicted in gray and includes all three ground positions.
Figure 2. An example pose estimation for three subjects with varying heights and spatial positions. The OpenPose 25 estimated joints are indexed on the right (blue) skeleton. The remaining two subjects have some undetected joints, but their joint indexing remains the same. The ground position of each subject is estimated by the midpoint of their feet joints. The user-defined region of interest is depicted in gray and includes all three ground positions.
Sensors 22 00418 g002
Figure 3. The estimated poses in frame 1824 of the EPFL-MPV dataset scene 6p-c0. Three out of five people are detected correctly (red, orange, and green poses) whereas the rest are not due to partial occlusion and missing data (gray and blue poses).
Figure 3. The estimated poses in frame 1824 of the EPFL-MPV dataset scene 6p-c0. Three out of five people are detected correctly (red, orange, and green poses) whereas the rest are not due to partial occlusion and missing data (gray and blue poses).
Sensors 22 00418 g003
Figure 4. The proposed system outcome at each stage using the example input frame and the estimated poses in Figure 3. (ad) demonstrate the localized human subjects, smoothed/tracked ground positions, inter-personal distances with the instantaneous occupancy map, and the instantaneous crowd map along with the detected social distance violations in the image–pixel coordinates, respectively. (eh) present the same results as in (ad), but in the real-world coordinates. The user-selected ROI is shown in cyan and covers the floor plane in the scene. The basic and proposed localization results are depicted by triangles and squares, respectively, while the smoothed/tracked ground positions are visualized with circles. The distances among the subjects are visualized using lines with varying thickness and darkness where thick/thin and dark/light lines indicate shorter/longer distances. The instantaneous occupancy and crowd density maps are computed with a 1 m spatial resolution ( δ = 1 ) and 2 m social safety distance ( r = 2 ), respectively. Note that the ground positions in (a,b,e,f) are color-coded in accordance with the estimated poses in Figure 3. The color notion is dropped in (c,d,g,h) to preserve privacy and to emphasize the recognition of a social distance infringement; red/green indicates the presence/absence of subjects violating the defined social safety distance.
Figure 4. The proposed system outcome at each stage using the example input frame and the estimated poses in Figure 3. (ad) demonstrate the localized human subjects, smoothed/tracked ground positions, inter-personal distances with the instantaneous occupancy map, and the instantaneous crowd map along with the detected social distance violations in the image–pixel coordinates, respectively. (eh) present the same results as in (ad), but in the real-world coordinates. The user-selected ROI is shown in cyan and covers the floor plane in the scene. The basic and proposed localization results are depicted by triangles and squares, respectively, while the smoothed/tracked ground positions are visualized with circles. The distances among the subjects are visualized using lines with varying thickness and darkness where thick/thin and dark/light lines indicate shorter/longer distances. The instantaneous occupancy and crowd density maps are computed with a 1 m spatial resolution ( δ = 1 ) and 2 m social safety distance ( r = 2 ), respectively. Note that the ground positions in (a,b,e,f) are color-coded in accordance with the estimated poses in Figure 3. The color notion is dropped in (c,d,g,h) to preserve privacy and to emphasize the recognition of a social distance infringement; red/green indicates the presence/absence of subjects violating the defined social safety distance.
Sensors 22 00418 g004
Figure 5. The linear Kalman filter process is comprised of two stages; state prediction and measurement correction.
Figure 5. The linear Kalman filter process is comprised of two stages; state prediction and measurement correction.
Sensors 22 00418 g005
Figure 6. The proposed system example integrated video frames and dynamic top-view maps using frames 1 to 1824 of the EPFL-MPV dataset scene 6p-c0. The type and amount of displayed information is adjustable and one can view multiple integrated frames and/or top-view maps simultaneously. Note that the pair-wise lines in (c) are plotted only for distances between 0 and 3 m to ease visualization.
Figure 6. The proposed system example integrated video frames and dynamic top-view maps using frames 1 to 1824 of the EPFL-MPV dataset scene 6p-c0. The type and amount of displayed information is adjustable and one can view multiple integrated frames and/or top-view maps simultaneously. Note that the pair-wise lines in (c) are plotted only for distances between 0 and 3 m to ease visualization.
Sensors 22 00418 g006
Figure 7. The performance evaluation results in terms of accuracy, F1-score, VCR, and IOU averaged across all video sequences and plotted for a range of social safety distances.
Figure 7. The performance evaluation results in terms of accuracy, F1-score, VCR, and IOU averaged across all video sequences and plotted for a range of social safety distances.
Sensors 22 00418 g007
Figure 8. Example social distance violation detection results using frames 1 to 2005 of the OxTown dataset with a 2 m social safety distance. (a,b) overlay the detection results with the averaged ODM and CDM, respectively.
Figure 8. Example social distance violation detection results using frames 1 to 2005 of the OxTown dataset with a 2 m social safety distance. (a,b) overlay the detection results with the averaged ODM and CDM, respectively.
Sensors 22 00418 g008
Figure 9. The computational complexity analysis results in terms of frame and processing rates. The proposed approach is tested with and without the smoothing/tracking stage. The results are grouped by the number of detected/tracked subjects.
Figure 9. The computational complexity analysis results in terms of frame and processing rates. The proposed approach is tested with and without the smoothing/tracking stage. The results are grouped by the number of detected/tracked subjects.
Sensors 22 00418 g009
Table 1. The proposed system smoothing and tracking optimized parameters for each utilized video sequence.
Table 1. The proposed system smoothing and tracking optimized parameters for each utilized video sequence.
SequenceParameters
σ 1 σ 2 σ 3 γ g γ c γ d
6p-c06 × 10 9 0.2042.0139876−186
6p-c18.210.3298.7839964−71
6p-c20.2160.2780.2715252−84
6p-c37 × 10 3 0.0151.7610486−106
OxTown2 × 10 4 0.8730.1422311−26
C17 × 10 9 6 × 10 9 2 × 10 5 111−92
C20.002 10 9 1.46128−181
C30.020.00786 × 10 8 1392−5
C44290.0220.062813−2
C5 10 8 2.053.79153−41
C69 × 10 3 10 7 4 × 10 8 159−106
C75 × 10 7 0.00064 × 10 7 3951−2
Table 2. The performance evaluation results in terms of PDR, localization relative error, accuracy, F1-score, precision, recall, VCR, CORR, and IOU averaged across the range of safety distances and summarized for each video sequence. The proposed approach is evaluated with (✓) and without (✕) the smoothing/tracking stage (S/T). Best results are in bold to ease interpretation and results depicting highest gain are in brackets for comparison.
Table 2. The performance evaluation results in terms of PDR, localization relative error, accuracy, F1-score, precision, recall, VCR, CORR, and IOU averaged across the range of safety distances and summarized for each video sequence. The proposed approach is evaluated with (✓) and without (✕) the smoothing/tracking stage (S/T). Best results are in bold to ease interpretation and results depicting highest gain are in brackets for comparison.
MeasureApproachS/TEPFL-MPVOxTownEPFL-WildtrackOverall
6p-c06p-c16p-c26p-c3C1C2C3C4C5C6C7
PDRBasic [26]-90.990.787.587.385.759.256.674.287.378.2(39.9)91.385.4
Proposed93.894.089.188.188.464.762.679.188.580.843.592.587.9
95.696.591.990.889.884.483.479.288.583.9(82.9)92.691.5
ErrorBasic [26]-17.017.223.021.924.051.352.849.041.333.8(80.4)15.424.7
Proposed12.613.220.720.620.847.047.149.839.830.976.114.321.7
10.710.716.617.819.131.936.748.336.327.0(42.0)14.018.0
AccuracyBasic [26]-91.089.283.385.692.895.597.792.588.092.3(82.7)97.089.3
Proposed94.192.786.588.294.596.598.693.588.593.087.096.991.8
94.994.688.289.895.699.499.793.588.192.7(99.7)96.993.3
F1-scoreBasic [26]-90.789.180.783.895.997.798.895.979.295.7(90.2)98.189.5
Proposed94.492.985.087.596.998.299.396.581.296.092.998.192.3
95.294.987.089.597.599.799.896.580.095.9(99.8)98.193.6
PrecisionBasic [26]-98.298.198.495.697.410010098.788.810010099.097.6
Proposed96.097.496.893.497.210010097.485.099.810098.696.4
95.797.096.491.797.010010097.486.498.210098.695.9
RecallBasic [26]-84.882.669.575.494.495.597.793.571.691.7(82.7)97.383.7
Proposed92.889.376.582.696.696.598.695.877.992.686.997.689.0
94.693.079.887.598.199.499.795.774.793.7(99.7)97.691.8
VCRBasic [26]-81.378.578.178.764.833.536.546.186.772.6(20.9)86.872.2
Proposed84.383.780.180.167.638.443.148.385.975.926.089.775.1
86.086.681.479.665.162.563.247.586.573.3(59.8)89.777.0
CORRBasic [26]-98.399.198.998.9(85.3)89.173.485.896.488.272.398.893.8
Proposed99.299.499.299.389.490.072.586.096.890.772.398.695.1
99.499.299.299.2(89.9)89.377.386.397.191.172.898.695.3
IOUBasic [26]-(74.7)86.284.184.052.251.847.163.861.944.713.983.870.8
Proposed83.489.685.386.161.055.751.361.566.947.314.483.475.6
(87.1)89.286.385.763.855.355.261.668.250.722.383.477.1
Table 3. Comparison for the social distance violation detection performance using the OxTown dataset with a 2 m social safety distance. The Yang et al. results are extracted from Table 6 in [15]. The proposed approach is compared with (✓) and without (✕) the smoothing/tracking stage (S/T). Best results are in bold to ease interpretation and results that are used in the discussion are in brackets.
Table 3. Comparison for the social distance violation detection performance using the OxTown dataset with a 2 m social safety distance. The Yang et al. results are extracted from Table 6 in [15]. The proposed approach is compared with (✓) and without (✕) the smoothing/tracking stage (S/T). Best results are in bold to ease interpretation and results that are used in the discussion are in brackets.
MethodAccuracyF1-ScorePrecisionRecall
Yang et al. [15](92.8)(95.6)95.495.9
Basic [26]96.097.998.996.8
Proposed, S/T: ✕(97.4)(98.6)98.898.4
Proposed, S/T: ✓(98.3)(99.1)98.799.5
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Al-Sa’d, M.; Kiranyaz, S.; Ahmad, I.; Sundell, C.; Vakkuri, M.; Gabbouj, M. A Social Distance Estimation and Crowd Monitoring System for Surveillance Cameras. Sensors 2022, 22, 418. https://doi.org/10.3390/s22020418

AMA Style

Al-Sa’d M, Kiranyaz S, Ahmad I, Sundell C, Vakkuri M, Gabbouj M. A Social Distance Estimation and Crowd Monitoring System for Surveillance Cameras. Sensors. 2022; 22(2):418. https://doi.org/10.3390/s22020418

Chicago/Turabian Style

Al-Sa’d, Mohammad, Serkan Kiranyaz, Iftikhar Ahmad, Christian Sundell, Matti Vakkuri, and Moncef Gabbouj. 2022. "A Social Distance Estimation and Crowd Monitoring System for Surveillance Cameras" Sensors 22, no. 2: 418. https://doi.org/10.3390/s22020418

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop