Documentation K - TS
Documentation K - TS
Documentation K - TS
2020-2024
1
KOMMURI PRATAP REDDY INSTITUTE OF TECHNOLOGY
(Affiliated to JNTUH, Ghanpur(V), Ghatkesar(M), Medchal(D)-506345)
CERTIFICATE
2
ABSTRACT
The COVID-19 pandemic has presented unprecedented challenges to public health and society
at large, necessitating innovative approaches for understanding, tracking, and mitigating its
impact. In this context, the utilization of big data has emerged as a crucial tool for researchers,
healthcare professionals, and policymakers. This abstract provides an overview of the
significant contributions and applications of big data analytics in the study of COVID-19.
Big data sources encompass a wide array of data types, including epidemiological data, clinical
records, genomic sequences, social media posts, mobility data, and more. The integration and
analysis of these diverse datasets have enabled researchers to gain valuable insights into the
dynamics of the virus, its transmission patterns, and the effectiveness of public health
interventions.
Machine learning and artificial intelligence techniques have played a pivotal role in predicting
disease spread, identifying potential hotspots, and optimizing resource allocation for healthcare
systems. These predictive models have been instrumental in guiding decision-makers and
public health authorities in their response to the pandemic, helping to save lives and reduce the
strain on healthcare facilities.
Furthermore, big data has facilitated the rapid development of diagnostic tools, such as
COVID-19 testing algorithms and contact tracing applications. These innovations have been
critical in identifying and isolating cases promptly, thus curbing the virus's spread.
Additionally, genetic sequencing data has enabled the monitoring of viral mutations and the
development of targeted vaccines and therapeutics.
The social and economic impact of COVID-19 has also been studied extensively using big data
analytics. Researchers have analyzed trends in job loss, economic activity, and mental health
through data derived from social media, online job platforms, and surveys. These insights have
informed policymakers in devising support measures and stimulus packages for affected
populations.
However, the use of big data in the context of COVID-19 also raises ethical concerns related
to privacy, data security, and bias. Striking a balance between harnessing the power of big data
for public health and safeguarding individual rights remains an ongoing challenge.
In conclusion, the integration of big data analytics has been instrumental in understanding,
managing, and mitigating the COVID-19 pandemic. It has empowered researchers and
healthcare professionals with valuable insights and tools for decision-making. Nevertheless,
the responsible and ethical use of big data in the context of public health emergencies remains
a topic of ongoing discussion and regulation.
3
LIST OF CONTENTS
1. Introduction
1.1 Background and Significance
1.2 Objectives
2. Description of work
3. Bigdata on covid-19
3.1 Background and related work
3.1.1 COVID-19 Research
3.1.2 Confirmed Cases and Mortality
3.2 Our data science solution:
3.2.1 Data Collection and Integration
3.2.2 Data Preprocessing
4
6.Applications
6.1 BlueDot's Early Warning System:
6.2 Google's COVID-19 Community Mobility Reports:
5.3 COVID-19 Genome Sequencing:
5.4 Contact Tracing Apps:
5.5 Hospital Resource Allocation with Predictive Analytics:
8.CONCLUSIONS
9.REFERENCES
5
1.INTRODUCTION
In the current era of big data , high volume of big data can be generated and collected from a
wide variety of rich data sources at a rapid rate. Due to differences in level of veracity, some
of these big data are precise while some others are imprecise and uncertain. Embedded in these
big data are useful information and valuable knowledge that can be discovered by big data
science and engineering (BigDataSE) , which applies techniques from various related areas—
such as data mining , machine learning, as well as mathematical and statistical modeling—to
real-life applications and services and/or for social good. Examples of rich sources of these
valuable big data include.
1.2 OBJECTIVE
Knowledge discovered from these big data would be valuable. For instance, knowledge
discovered from the epidemiological data—such as data related to cases who suffered from
viral diseases like (a) severe acute respiratory syndrome (SARS) that broke out in 2002–2004,
(b) Middle East respiratory syndrome (MERS) that broke out in 2012–2015, and (c)
coronavirus disease 2019 (COVID-19) that broke out in 2019 and became pandemic in 2020—
helps researchers, epidemiologists and policy makers to get a better understanding of the
disease. This, in turn, may inspire them to come up ways to detect, control and combat the
disease.
6
2.DESCRIPTION OF WORK
Big data has played an instrumental role in our fight against the COVID-19 pandemic. With
the vast amount of information generated daily, it has become a powerful tool for researchers,
healthcare professionals, and policymakers alike. One crucial data source is healthcare records,
including electronic health records (EHRs) and patient data. These records contain a wealth of
information about symptoms, treatment outcomes, and comorbidities, allowing healthcare
providers to track the progression of the disease and make informed decisions regarding patient
care. This data has not only assisted in individual patient management but also in understanding
the broader trends and patterns of the virus's spread.
Epidemiological data has been another vital source of information. Tracking the number of
confirmed cases, deaths, and recoveries, along with contact tracing data, has enabled health
authorities to monitor the virus's spread and identify potential hotspots. This information has
been instrumental in shaping public health policies, from implementing social distancing
measures to targeted lockdowns.
Genomic data has also been critical in the fight against COVID-19. By sequencing the virus's
genome, scientists have been able to monitor its mutations and adapt vaccine and treatment
strategies accordingly. Genomic data has played a significant role in the rapid development
and modification of vaccines, helping to curb the pandemic's impact.
Social media and internet data have provided real-time insights into public sentiment and
information dissemination. Analyzing social media platforms has allowed authorities to gauge
public perceptions and address concerns effectively. It has also been used to identify and
combat misinformation, a critical aspect of managing a public health crisis.
Mobility data, collected through GPS and mobile devices, has helped understand the
effectiveness of lockdowns and social distancing measures. By tracking the movement of
people, governments and health officials have been able to make informed decisions about
when and where to implement restrictions, reducing the virus's transmission.
Supply chain data has been indispensable in ensuring the timely distribution of medical
supplies. By analyzing supply chain disruptions and predicting demand, healthcare systems
have been better prepared to allocate resources where they are needed most, such as personal
protective equipment (PPE) and ventilators.
7
3.BIGDATA ON COVID-19
Covid-19 research Partially because of the COVID-19 pandemic, many researchers have
explored on different aspects of the COVID-19 disease. These led to numerous works on
COVID-19 in different disciplines or areas:
• For medical and health sciences, there have been (a) systematic reviews on literature
about medical research on COVID-19 , (b) clinical and treatment information [,as well
as (c) drug discovery and vaccine development .
• For social sciences, there have been studies on crisis management for the COVID-19
outbreak.
• For natural sciences and engineering (NSE), there have been works focusing on (a)
artificial intelligence (AI)- driven informatics, sensing, imaging for tracking, testing,
diagnosis, treatment and prognosis such as those imaging-based diagnosis of COVID-
19 using chest computed tomography (CT) images and (b) mathematical modelling of
the spread of COVID-19 .
Instead of projecting the spread of the disease, our data science solution discovers common
characteristics among COVID-19 cases belonging to a certain gender, age group combination,
and compares them with those belonging to other combinations.
While these overall numbers of confirmed cases and mortality are important in showing the
severity of the disease at a specific time or time interval. However, it is equally important to:
• explore the breakdown of these numbers among different gender and/or age groups,
and
• discover other useful knowledge (e.g., symptoms, clinical course and outcomes,
transmission methods) from the epidemiological data
A reason is that the discovered knowledge can reveal useful information (e.g., some
characteristics of COVID-19 cases) associated with the disease. This, in turn, helps users to get
a better understanding on characteristics of the confirmed cases of COVID-19 (rather than just
the numbers of cases).
8
3.2Our data science solution:
Recall from Section I that big COVID-19 epidemiological data can be characterized by their
variety in two aspects. First the data can be generated and collected from a wide variety of
data sources. As a concrete example, in Canada, healthcare is a responsibility of provincial
governments. So, Canadian COVID19 epidemiological data are gathered from each province
(or territory), and provincial data are obtained from health regions (which are also known as
health authorities) within the province.
Second, the big COVID-19 epidemiological data can contain a wide variety of information,
which usually includes:
• administrative information—such as (a) an unique privacy-preserving identifier for
each case, (b) its location, and (c) episode day (i.e., symptom onset day or its closest
day).
• case details—such as (a) gender, (b) age, and (c) specific occupation of the cases.
9
3.2.2 Data Preprocessing
After collecting and integrating data from heterogeneous sources, we preprocess the
collected and integrated data. Recall from Section I that big COVID-19 epidemiological
data can be characterized by their veracity. Specifically, we observe that there are some
missing, unstated or unknown information (i.e., NULL values). Given the nature of these
COVID-19 cases (e.g., for timely reporting of cases, privacy-preservation of the identity
of cases), it is not unusual to have NULL values because values may not be available or
recorded. For some other attributes related to case details (e.g., personal information like
gender, age), patients may prefer not to report it due the privacy concerns. As there are
many cases with NULL values for some attributes, ignoring them may lead to inaccurate
or incomplete analysis of the data. Instead, our solution keeps all these cases for data
science.
For some attributes (e.g., date), it would be too specific for the analysis. Moreover,
delays in testing or reporting (especially, due to weekends) are not uncommon. Hence,
it would also be logical to group days into a 7-day interval—i.e., a week. For example,
all days within the week of January 19-25 inclusive are considered as Week 3. Side-
benefits of such grouping include:
• Summing the frequency of cases over a week (cf. a single day) increases the
chance of having sufficient frequency for being discovered as a frequent pattern
and getting statistically significant mining results.
• Generalizing the cases help preserve the privacy of the individuals while
maintaining the utility for knowledge discovery.
Similarly, for some attributes (e.g., age, occupation), it would be logical to group similar
values into a mega-value (say, ages can be binned into age groups). For example:
• grouping ages to age groups (e.g., ≤ 19 years old, 20-29 years old, ..., 70-79 years
old, ≥ 80 years old);
10
4.DISCUSSION AND EVALUATION RESULT
11
4.1.2 Big Data Science on Cases :
Once the data are preprocessed, our data science solution first analyzes and mines the
national data. With 201,341 COVID-19 cases with stated gender and age (out of an estimated
Canadian population of 38,005,238), the solution reveals that about 0.53% of the population
contacted the disease.
12
TABLE I.
Table I confirms the above observations. Moreover, it also reveals that (a) age groups 20s-40s
and 80+ (as well as female in their 50s) appear to be more vulnerable to the disease as they
have higher COVID-19 percentages than the national norm. Here, the percentage is computed
by dividing the number of cases in a specific group (i.e., a specific
gender, age group
- combination) by the population of the corresponding combination. For instance, 19,049 cases
of male in their 20s correspond to 0.28% of this population group of about 2.6 million male in
their 20s. (b) Among all age groups, seniors in 80+ have the highest risk—with a COVID-19
percentage of 1.31% of their corresponding population (cf. the national norm of 0.53% of the
national population).
The table also reveals that (c) female appears to be slightly more vulnerable to the disease than
their male counterparts. (d) In all age groups from 0-59, percentages of female COVID19 cases
are slightly higher than their male counterparts. (e) For age groups 60s-70s, the opposite is
observed. (f) Among all age groups, female in their 80+ have the highest risk—with a COVID-
19 percentage of 1.49% of their corresponding population (cf. 1.04% of male in 80+).
13
4.1.3 Big Data Science on Hospital Status:
lso examines the hospital status among the 16 combinations. Table II reveals that, (a) as the
age increases, the absolute number of hospitalized cases also increases. When combined with
Table I, we observe that (b) despite the number of cases decreases from age groups 20s to 70s,
the number of hospitalization increases. This means that, when young people catches COVID-
19, a majority of them do not need to be hospitalized. When people age, their chance of
requiring hospitalization once they catch COVID-19 increases. (c) Between the two genders,
more male in their 30+ are admitted into the ICU than female.
Cells in Table III shows the percentage of hospitalized cases with respect to COVID-19
patients in their corresponding gender, age group combination. For instance, 38 male COVID-
19 patients in their 20s admitted to the ICU (as shown in Table II) account for 0.77% (as shown
in Table III) of all 19,049 male COVID-19 patients in their 20s. Table III reveals that, (a) for
seniors 60+, the hospitalization percentages among all COVID-19 cases are high—ranging
from 14.01% to 25.57% (cf. national norm of 7.05%)—and peak at 70s. In particular, (b) males
in their 70s have highest percentages of both ICU admission (8.51% wrt COVID-19 cases for
males in 70s) and hospitalization (8.51%+20.02% = 28.53%). In contrast, (c) males in their 80s
have the highest percentage of non-ICU hospitalization (23.70%).
14
4.1.4 Big Data Science on Occupation Groups:
Our solution also examines different occupation groups. Table IV shows the number of
healthcare workers for some gender, age group combinations (and their percentages wrt
COVID-19 cases in the corresponding combination). It reveals that (a) female healthcare
workers in their 30s-50s account for more than a quarter of COVID-19 cases in their respective
combinations. For instance, 5,308 (33.49%) of 15,851 COVID19 cases for females in their 40s
are healthcare workers. (b) In terms of both absolute number (in terms of cases) and relative
number (wrt cases in their combinations), female healthcare workers have much higher
numbers (about 4x higher) than their male counterparts. For completeness, Table IV also
includes the total numbers for all age groups (including 0-19 and 70+) in the bottom row.
15
5. Big Data Analysis and Processing
HDFS is a distributed file system designed to store and manage vast amounts of data
across a cluster of commodity hardware. It's a core component of the Hadoop ecosystem
and is highly fault-tolerant and scalable.
NoSQL databases are a category of databases that are designed to handle large volumes
of unstructured or semi-structured data. Some popular NoSQL databases for big data
storage include:
• MongoDB: A document-oriented NoSQL database.
• Cassandra: A distributed NoSQL database designed for scalability.
• HBase: A distributed, scalable, and consistent NoSQL database for use with Hadoop.
These cloud providers offer various services for big data processing, machine learning,
and analytics, making it easier for organizations to scale their infrastructure based on
their data storage and processing needs.
• Amazon Web Services (AWS): AWS offers a wide range of cloud-based storage
and computing services, including Amazon S3 for storage, Amazon EC2 for
virtual servers, and Amazon Redshift for data warehousing.
• Microsoft Azure: Azure provides cloud-based storage solutions like Azure Blob
Storage and Azure Data Lake Storage, as well as virtual machines (Azure VMs)
and Azure SQL Data Warehouse for data processing and analytics.
• Google Cloud Platform (GCP): GCP offers Google Cloud Storage for object
storage,
16
5.2 Data Analysis:
These analytical techniques have not only helped in understanding the spread and impact of
COVID-19 but have also guided public health measures, resource allocation, vaccination
strategies, and policy decisions during the pandemic. It's important to note that the field of
COVID-19 data analysis is dynamic, with new techniques and insights continually emerging
as the situation evolves.
• Incidence and Prevalence: Calculations of incidence rates (new cases per unit
of time) and prevalence rates (total cases in a population) are essential for
tracking the spread of the virus.
Time Series Plots: Line charts and time series plots are used to visualize the progression
of COVID-19 cases, deaths, and recoveries over time. These help identify trends and
spikes.
• Geospatial Maps: Maps with color-coded regions or markers are used to
visualize the geographic distribution of cases, helping identify hotspots and
areas in need of targeted interventions.
• Histograms and Bar Charts: These are used to display distributions of variables
like age, comorbidities, and vaccination rates among COVID-19 patients.
• Heatmaps: Heatmaps can reveal correlations and patterns in data, such as the
spread of the virus in relation to population density or climate factors.
17
5.3 Machine Learning and Predictive Analytics:
Machine learning algorithms have played a significant role in forecasting COVID-19 trends
and assisting in various aspects of the pandemic response. These models leverage historical
data, such as infection rates, vaccination data, mobility patterns, and other relevant information,
to make predictions and inform decision-making. Here are some examples of how machine
learning has been used for forecasting COVID-19 trends:
• SEIR Models: Machine learning can be used to optimize the parameters of traditional
epidemiological models like Susceptible-Exposed-Infectious-Removed (SEIR)
models. These models take into account various factors such as population
demographics, mobility data, and social interactions to forecast infection rates over
time.
• Machine learning models can analyze vaccine trial data to predict the efficacy of
different vaccines against variants of the virus. This helps health authorities make
informed decisions about which vaccines to prioritize.
18
5.4 real-time data streaming
Real-time data streaming plays a crucial role in tracking and managing pandemics, as it enables
public health authorities and researchers to make informed decisions quickly and effectively.
Here's a discussion on the importance of real-time data in tracking the pandemic and how
technologies like Apache Kafka and Apache Flink can be employed for real-time data
processing:
5.4.1 Importance of Real-time Data in Tracking the Pandemic:
• Rapid Response: Pandemics, such as COVID-19, require rapid response measures to
contain the spread of the virus. Real-time data provides up-to-the-minute information
on infection rates, hospitalizations, and other critical metrics. This enables authorities
to react swiftly to emerging hotspots and adjust containment strategies in real-time.
• Resource Allocation: Real-time data helps in the allocation of healthcare resources like
hospital beds, ventilators, and medical personnel. By monitoring the influx of patients
in real-time, hospitals can optimize resource distribution, ensuring that critically ill
patients receive the care they need promptly.
• Contact Tracing: Effective contact tracing is vital for containing the spread of a
pandemic. Real-time data allows contact tracers to identify potential exposures quickly
and notify individuals at risk, helping to break the chains of transmission.
19
6.Applications
Several successful Big Data applications in COVID-19 research have had a significant impact
on understanding the pandemic, managing its effects, and informing public health responses.
Here are a few notable examples:
• Impact: BlueDot's system identified the early signs of the COVID-19 outbreak in
Wuhan, China, before the World Health Organization (WHO) made its official
announcement. This early warning helped public health agencies and governments
prepare and respond more effectively.
20
6.3 COVID-19 Genome Sequencing:
• Use Case: Researchers worldwide used high-throughput DNA sequencing technologies
to analyze the genetic makeup of the SARS-CoV-2 virus responsible for COVID-19.
• Impact: Genome sequencing facilitated the identification of virus mutations and the
tracking of viral evolution. This data was crucial for vaccine development,
understanding transmission dynamics, and monitoring the emergence of new variants.
21
6.5 Hospital Resource Allocation with Predictive Analytics:
• Use Case: Hospitals and healthcare systems used predictive analytics models to forecast
COVID-19 patient admissions and optimize resource allocation.
• Impact: Predictive models helped hospitals prepare for surges in COVID-19 cases,
ensuring that they had adequate beds, ventilators.
22
7.Challenges and Limitations
Working with COVID-19 Big Data presents a variety of challenges, and there are also
limitations associated with the available data and analytics techniques. Let's explore some of
these challenges and limitations:
• Data Lag: There can be delays in reporting and updating data, which can hinder real-
time decision-making. Timely and accurate data is critical during a pandemic, and data
lags can have serious consequences.
• Model Uncertainty: Predictive models used for forecasting the spread of the virus or
estimating healthcare resource needs are subject to uncertainties. The rapidly evolving
nature of the pandemic makes it challenging to build accurate models.
• Resource Constraints: Organizations working with COVID-19 data often face resource
constraints, including limited access to skilled data scientists, computational resources,
and funding.
23
7.2 Limitations of Available Data and Analytics Techniques:
• Underreporting: Official COVID-19 case counts may underestimate the true extent of
the pandemic due to factors like limited testing availability, asymptomatic cases, and
underreporting in some regions.
• Data Bias: Data can be biased towards certain demographic groups or regions, leading
to disparities in analysis and decision-making. For example, marginalized communities
may have less access to testing and healthcare, resulting in skewed data.
• Data Access and Sharing: Data sharing and access can be restricted due to privacy
concerns and legal or regulatory barriers, limiting the ability of researchers to analyze
and use the data effectively.
• Model Generalization: Machine learning models developed using early pandemic data
may not generalize well to later stages of the pandemic when conditions and
interventions change.
• Emerging Variants: New variants of the virus can impact the accuracy of existing
models and necessitate ongoing data collection and analysis to understand their
behavior and implications.
24
8.CONCLUSIONS
In this paper, we presented a data science solution for conducting data science on big COVID-
19 epidemological data. The solution generalizes some attributes (e.g., age into age groups) for
effective analysis. Instead of ignoring unstated/ NULL values of some attributes, the solution
provides users with flexibility of including or excluding these values. It also provides users
with flexibility to express their preference (e.g., “must include symptoms”) in mining of
frequent patterns. It discovers frequent patterns from each of the 16 gender, age group
-combinations. Moreover, it compares and contrasts the discovered frequent patterns among
these combinations.
Taking into account differences in population and/or the number of cases in each of the 16
combinations, our solution computes relative frequency (with respect to population and/or the
number of cases in the respective combination) in addition to showing the absolute frequency
of the attributes and/or frequent patterns. Evaluation results show the practicality of our
solution in providing rich knowledge about characteristics of COVID-19 cases.
This helps researchers, epidemiologists and policy makers to get a better understanding of the
disease, which may inspire them to come up ways to detect, control and combat the disease.
As ongoing and future work, we transfer knowledge learned from the current work to data
science on other big data in many real-life applications and services.
25
9.REFERENCES
1. Alsaig et al. "A critical analysis of the V-model of big data" IEEE TrustCom/BigDataSE pp.
1809-1813 2018.
2. A. Kobusinska et al. "Emerging trends issues and challenges in Internet of Things big data
and cloud computing" FGCS vol. 87 pp. 416-419 2018.
3. K. Kritikos "Towards dynamic and optimal big data placement" IEEE TrustCom/BigDataSE
pp. 1730-1737 2018.
4. C.K. Leung "Big data analysis and mining" Encyclopedia of Information Science and
Technology vol. 4e pp. 338-348 2018.
5. C.K. Leung "Big data computing and mining in a smart world" Big Data Analyses Services
and Smart Data pp. 15-27 2021.
6. B. Yin et al. "A cooperative edge computing scheme for reducing the cost of transferring big
data in 5G networks" IEEE TrustCom/BigDataSE pp. 700-706 2019.
7. K.E. Dierckens et al. "A data science and engineering solution for fast k-means clustering of
big data" IEEE TrustCom/BigDataSE/ICESS pp. 925-932 2017.
8.C.K. Leung,“Data science for big data applications and services: data lake management, data
analytics and visualization,”in Big Data Analyses, Services, and Smart Data,2021, pp.28-44.
9. C.K. Leung, F. Jiang, “A data science solution for mining interesting patterns from uncertain
big data,” in IEEE BDCloud 2014, pp. 235-242.
10. A.K. Chanda, et al., “A new framework for mining weighted periodic patterns in time series
databases,” ESWA 79, 2017, pp. 207-224.
11. A. Fariha, et al., “Mining frequent patterns from human interactions in meetings using
directed acyclic graphs,” in PAKDD 2013, Part I, pp. 38- 49.
12. C.K. Leung, C.L. Carmichael, “FpVAT: A visual analytic tool for supporting frequent
pattern mining,” ACM SIGKDD Explorations 11(2), 2009, pp. 39-48.
26