Documentation K - TS

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

A Technical Seminar Report on

BIG DATA ON COVID-19


Submitted to
JAWAHARLAL NEHRU TECHNOLOGICAL UNIVERSITY
HYDERABAD

In partial fulfilment of the requirement for the award of degree of


BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
By
M . KARTHIK REDDY
20RA1A0534

DEPARTMENT OF COMPUTER SCIENCE AND TECHNOLOGY

KOMMURI PRATAP REDDY INSTITUTE OF TECHNOLOHY

(Affiliated to JNTUH, Ghanpur(V), Ghatkesar(M), Medchal(D)-500088)

2020-2024

1
KOMMURI PRATAP REDDY INSTITUTE OF TECHNOLOGY
(Affiliated to JNTUH, Ghanpur(V), Ghatkesar(M), Medchal(D)-506345)

CERTIFICATE

This is to certify that the Technical Seminar entitled “BIGDATA ON


COIVID-19” is submitted by Mr. M. Karthik Reddy, student of
Kommuri Pratap Reddy Institute of Technology in partial
fulfillment of the requirement for the award of the degree of Bachelor
of Technology in Computer Science and Engineering of the
Jawaharlal Nehru Technological University, Hyderabad during
the year2023-24.

TSC & Internal Examiner HOD

------ ---- ----------

2
ABSTRACT

The COVID-19 pandemic has presented unprecedented challenges to public health and society
at large, necessitating innovative approaches for understanding, tracking, and mitigating its
impact. In this context, the utilization of big data has emerged as a crucial tool for researchers,
healthcare professionals, and policymakers. This abstract provides an overview of the
significant contributions and applications of big data analytics in the study of COVID-19.

Big data sources encompass a wide array of data types, including epidemiological data, clinical
records, genomic sequences, social media posts, mobility data, and more. The integration and
analysis of these diverse datasets have enabled researchers to gain valuable insights into the
dynamics of the virus, its transmission patterns, and the effectiveness of public health
interventions.

Machine learning and artificial intelligence techniques have played a pivotal role in predicting
disease spread, identifying potential hotspots, and optimizing resource allocation for healthcare
systems. These predictive models have been instrumental in guiding decision-makers and
public health authorities in their response to the pandemic, helping to save lives and reduce the
strain on healthcare facilities.

Furthermore, big data has facilitated the rapid development of diagnostic tools, such as
COVID-19 testing algorithms and contact tracing applications. These innovations have been
critical in identifying and isolating cases promptly, thus curbing the virus's spread.
Additionally, genetic sequencing data has enabled the monitoring of viral mutations and the
development of targeted vaccines and therapeutics.

The social and economic impact of COVID-19 has also been studied extensively using big data
analytics. Researchers have analyzed trends in job loss, economic activity, and mental health
through data derived from social media, online job platforms, and surveys. These insights have
informed policymakers in devising support measures and stimulus packages for affected
populations.

However, the use of big data in the context of COVID-19 also raises ethical concerns related
to privacy, data security, and bias. Striking a balance between harnessing the power of big data
for public health and safeguarding individual rights remains an ongoing challenge.

In conclusion, the integration of big data analytics has been instrumental in understanding,
managing, and mitigating the COVID-19 pandemic. It has empowered researchers and
healthcare professionals with valuable insights and tools for decision-making. Nevertheless,
the responsible and ethical use of big data in the context of public health emergencies remains
a topic of ongoing discussion and regulation.

3
LIST OF CONTENTS
1. Introduction
1.1 Background and Significance
1.2 Objectives
2. Description of work
3. Bigdata on covid-19
3.1 Background and related work
3.1.1 COVID-19 Research
3.1.2 Confirmed Cases and Mortality
3.2 Our data science solution:
3.2.1 Data Collection and Integration
3.2.2 Data Preprocessing

4 .DISCUSSION AND EVALUATION RESULT

4.1 Case Study on Real-Life COVID-19 Data:


4.1.1 Data Collection, Integration and Preprocessing:
4.1.2 Big Data Science on Cases :
4.1.3 Big Data Science on Hospital Status:
4.1.4 Big Data Science on Occupation Groups:

5. Big Data Analysis and Processing


5.1 Big Data Storage Solutions:
5.1.1 Hadoop Distributed File System(HDFS):
5.1.2 NoSQL DataBases:
5.1.3 Cloud-Based Storage and Computing Options:
5.2 Data Analysis:
5.2.1 Descriptive Statistics:
5.2.2 Data Visualization:
5.2.3 Spatial Analysis:
5.3 Machine Learning and Predictive Analytics:
5.3.1 Epidemic Models:
5.3.2 Infection Rate Prediction:
5.3.3 Vaccine Distribution:
5.3.4 Vaccine Efficacy Prediction:
5.4 real-time data streaming
5.4.1 Importance of Real-time Data in Tracking the Pandemic:
5.4 real-time data streaming

4
6.Applications
6.1 BlueDot's Early Warning System:
6.2 Google's COVID-19 Community Mobility Reports:
5.3 COVID-19 Genome Sequencing:
5.4 Contact Tracing Apps:
5.5 Hospital Resource Allocation with Predictive Analytics:

7.Challenges and Limitations


7.1 Challenges in Working with COVID-19 Big Data:
7.2 Limitations of Available Data and Analytics Techniques:

8.CONCLUSIONS

9.REFERENCES

5
1.INTRODUCTION

In the current era of big data , high volume of big data can be generated and collected from a
wide variety of rich data sources at a rapid rate. Due to differences in level of veracity, some
of these big data are precise while some others are imprecise and uncertain. Embedded in these
big data are useful information and valuable knowledge that can be discovered by big data
science and engineering (BigDataSE) , which applies techniques from various related areas—
such as data mining , machine learning, as well as mathematical and statistical modeling—to
real-life applications and services and/or for social good. Examples of rich sources of these
valuable big data include.

1.1 BACKGROUND AND SIGNIFICANCE


These data are usually collected from a wide variety of data sources (e.g., regional health
authorities within a province, from which data are integrated and reported at higher levels such
as a national level). For instance, in the Canadian province of Manitoba, COVID-19 data can
be gathered from Winnipeg Regional Health Authority (WRHA) and four other health
authorities1. Moreover, a wide variety of data (e.g., gender, age, symptoms, clinical course and
outcomes, transmission methods) are collected too.

1.2 OBJECTIVE
Knowledge discovered from these big data would be valuable. For instance, knowledge
discovered from the epidemiological data—such as data related to cases who suffered from
viral diseases like (a) severe acute respiratory syndrome (SARS) that broke out in 2002–2004,
(b) Middle East respiratory syndrome (MERS) that broke out in 2012–2015, and (c)
coronavirus disease 2019 (COVID-19) that broke out in 2019 and became pandemic in 2020—
helps researchers, epidemiologists and policy makers to get a better understanding of the
disease. This, in turn, may inspire them to come up ways to detect, control and combat the
disease.

6
2.DESCRIPTION OF WORK
Big data has played an instrumental role in our fight against the COVID-19 pandemic. With
the vast amount of information generated daily, it has become a powerful tool for researchers,
healthcare professionals, and policymakers alike. One crucial data source is healthcare records,
including electronic health records (EHRs) and patient data. These records contain a wealth of
information about symptoms, treatment outcomes, and comorbidities, allowing healthcare
providers to track the progression of the disease and make informed decisions regarding patient
care. This data has not only assisted in individual patient management but also in understanding
the broader trends and patterns of the virus's spread.

Epidemiological data has been another vital source of information. Tracking the number of
confirmed cases, deaths, and recoveries, along with contact tracing data, has enabled health
authorities to monitor the virus's spread and identify potential hotspots. This information has
been instrumental in shaping public health policies, from implementing social distancing
measures to targeted lockdowns.

Genomic data has also been critical in the fight against COVID-19. By sequencing the virus's
genome, scientists have been able to monitor its mutations and adapt vaccine and treatment
strategies accordingly. Genomic data has played a significant role in the rapid development
and modification of vaccines, helping to curb the pandemic's impact.

Social media and internet data have provided real-time insights into public sentiment and
information dissemination. Analyzing social media platforms has allowed authorities to gauge
public perceptions and address concerns effectively. It has also been used to identify and
combat misinformation, a critical aspect of managing a public health crisis.

Mobility data, collected through GPS and mobile devices, has helped understand the
effectiveness of lockdowns and social distancing measures. By tracking the movement of
people, governments and health officials have been able to make informed decisions about
when and where to implement restrictions, reducing the virus's transmission.

Supply chain data has been indispensable in ensuring the timely distribution of medical
supplies. By analyzing supply chain disruptions and predicting demand, healthcare systems
have been better prepared to allocate resources where they are needed most, such as personal
protective equipment (PPE) and ventilators.

7
3.BIGDATA ON COVID-19

3.1 BACKGROUND AND RELATED WORKS:

3.1.1 COVID-19 Research

Covid-19 research Partially because of the COVID-19 pandemic, many researchers have
explored on different aspects of the COVID-19 disease. These led to numerous works on
COVID-19 in different disciplines or areas:
• For medical and health sciences, there have been (a) systematic reviews on literature
about medical research on COVID-19 , (b) clinical and treatment information [,as well
as (c) drug discovery and vaccine development .
• For social sciences, there have been studies on crisis management for the COVID-19
outbreak.
• For natural sciences and engineering (NSE), there have been works focusing on (a)
artificial intelligence (AI)- driven informatics, sensing, imaging for tracking, testing,
diagnosis, treatment and prognosis such as those imaging-based diagnosis of COVID-
19 using chest computed tomography (CT) images and (b) mathematical modelling of
the spread of COVID-19 .

Instead of projecting the spread of the disease, our data science solution discovers common
characteristics among COVID-19 cases belonging to a certain gender, age group combination,
and compares them with those belonging to other combinations.

3.1.2 Confirmed Cases and Mortality


Many existing works on the COVID-19 epidemiological data focused on reporting simply the
numbers of confirmed cases and mortality spatially and/or temporally. In other words, they
highlight (a) spatial differences among different continents, countries, or sovereignties and/or
(b) temporal trends, which both may demonstrate how effective different public health
strategies and mitigation techniques—such as social/physical distancing, stay-at-home orders,
and/or lockdown—help in “flattening the (epidemic) curve”.

While these overall numbers of confirmed cases and mortality are important in showing the
severity of the disease at a specific time or time interval. However, it is equally important to:

• explore the breakdown of these numbers among different gender and/or age groups,
and
• discover other useful knowledge (e.g., symptoms, clinical course and outcomes,
transmission methods) from the epidemiological data

A reason is that the discovered knowledge can reveal useful information (e.g., some
characteristics of COVID-19 cases) associated with the disease. This, in turn, helps users to get
a better understanding on characteristics of the confirmed cases of COVID-19 (rather than just
the numbers of cases).

8
3.2Our data science solution:

3.2.1 Data Collection and Integration

Recall from Section I that big COVID-19 epidemiological data can be characterized by their
variety in two aspects. First the data can be generated and collected from a wide variety of
data sources. As a concrete example, in Canada, healthcare is a responsibility of provincial
governments. So, Canadian COVID19 epidemiological data are gathered from each province
(or territory), and provincial data are obtained from health regions (which are also known as
health authorities) within the province.

Second, the big COVID-19 epidemiological data can contain a wide variety of information,
which usually includes:
• administrative information—such as (a) an unique privacy-preserving identifier for
each case, (b) its location, and (c) episode day (i.e., symptom onset day or its closest
day).

• case details—such as (a) gender, (b) age, and (c) specific occupation of the cases.

• symptom-related data—such as a Boolean indicator to indicate whether the case is


asymptomatic or not. If not (i.e., symptomatic case), additional information is
captured, which include:
1. onset day of symptoms, and
2. a collection of symptoms (including cough, fever, chills, sore throat,
runny nose, shortness of breath, nausea, headache, weakness, pain,
irritability, diarrhea, and other symptoms).
• clinical course and outcomes—such as:
1. hospital status (e.g., hospitalized in the intensive care unit (ICU), non-
ICU hospitalized, not hospitalized), and
2. a Boolean indicator to indicate whether the patients recovered from
the disease or not. If so (i.e., recovered case), additional information
(e.g., recovery day) is captured.

9
3.2.2 Data Preprocessing

After collecting and integrating data from heterogeneous sources, we preprocess the
collected and integrated data. Recall from Section I that big COVID-19 epidemiological
data can be characterized by their veracity. Specifically, we observe that there are some
missing, unstated or unknown information (i.e., NULL values). Given the nature of these
COVID-19 cases (e.g., for timely reporting of cases, privacy-preservation of the identity
of cases), it is not unusual to have NULL values because values may not be available or
recorded. For some other attributes related to case details (e.g., personal information like
gender, age), patients may prefer not to report it due the privacy concerns. As there are
many cases with NULL values for some attributes, ignoring them may lead to inaccurate
or incomplete analysis of the data. Instead, our solution keeps all these cases for data
science.

For some attributes (e.g., date), it would be too specific for the analysis. Moreover,
delays in testing or reporting (especially, due to weekends) are not uncommon. Hence,
it would also be logical to group days into a 7-day interval—i.e., a week. For example,
all days within the week of January 19-25 inclusive are considered as Week 3. Side-
benefits of such grouping include:

• Summing the frequency of cases over a week (cf. a single day) increases the
chance of having sufficient frequency for being discovered as a frequent pattern
and getting statistically significant mining results.

• Generalizing the cases help preserve the privacy of the individuals while
maintaining the utility for knowledge discovery.

Similarly, for some attributes (e.g., age, occupation), it would be logical to group similar
values into a mega-value (say, ages can be binned into age groups). For example:

• grouping ages to age groups (e.g., ≤ 19 years old, 20-29 years old, ..., 70-79 years
old, ≥ 80 years old);

• generalizing specific occupation of the cases to some egeneralized key


occupation groups—say, (a) healthcare workers, (b) school or daycare workers,
(c) long-term care residents, and (d) others;

• generalizing specific transmission methods to some generalized key


transmission methods—say, (a) community exposures, (b) travel exposures, and
(c) others.

10
4.DISCUSSION AND EVALUATION RESULT

4.1 Case Study on Real-Life COVID-19 Data:


4.1.1 Data Collection, Integration and Preprocessing:
To evaluate and demonstrate the usefulness of our data science solution, we tested it
with different COVID-19 epidemiological data including the Canada cases from
Statistics Canada2. With this dataset, data have been collected and integrated from
provincial and territorial public health authorities by the Public Health Agency of
Canada (PHAC). We preprocess data and generalize some attributes to obtain a dataset
with the following attributes :
1. A unique privacy-preserving identifier for each case
2. A generalized region/location
3. Episode week (or onset week of symptoms): From Week 3 (i.e., week of
January 19-25, 2020) to now
4. Gender (cf. sex at birth, which consists of male and female), including (a)
male, (b) female, (c) others including unstated gender and non-binary gender
(e.g., lesbian, gay, bisexual, transgender, queer/questioning, two-spirited
(LGBTQ2+)).
5. Age group: ≤ 19, 20s, 30s, 40s, 50s, 60s, 70s, and ≥ 80s.
6. Occupation group, including:
a) healthcare worker,
b) school or daycare worker (or attendee),
c) long-term care resident, and
d) other occupation.
7. Asymptomatic: Yes and No
8. Set of 13 symptoms, including cough, fever, chills, sore throat, runny nose,
shortness of breath, nausea, headache, weakness, pain, irritability, diarrhea, and
other symptoms.
9. Hospital status, including:
a) hospitalized in the ICU,
b) hospitalized but not in the ICU, and
c) not hospitalized.
10. Transmission method, including:
a) community exposures, and
b) travel exposures.
11. Clinical outcome: Recovered and death
12. Recovery week

11
4.1.2 Big Data Science on Cases :

Once the data are preprocessed, our data science solution first analyzes and mines the
national data. With 201,341 COVID-19 cases with stated gender and age (out of an estimated
Canadian population of 38,005,238), the solution reveals that about 0.53% of the population
contacted the disease.

12
TABLE I.

DISTRIBUTION OF CUMULATIVE COVID-19 CASES (AND PERCENTAGES WITH


RESPECT TO POPULATION OF THE CORREPONDING
GENDER, AGE GROUP
-COMBINATION AS OF NOVEMBER 12, 2020

Table I confirms the above observations. Moreover, it also reveals that (a) age groups 20s-40s
and 80+ (as well as female in their 50s) appear to be more vulnerable to the disease as they
have higher COVID-19 percentages than the national norm. Here, the percentage is computed
by dividing the number of cases in a specific group (i.e., a specific
gender, age group
- combination) by the population of the corresponding combination. For instance, 19,049 cases
of male in their 20s correspond to 0.28% of this population group of about 2.6 million male in
their 20s. (b) Among all age groups, seniors in 80+ have the highest risk—with a COVID-19
percentage of 1.31% of their corresponding population (cf. the national norm of 0.53% of the
national population).

The table also reveals that (c) female appears to be slightly more vulnerable to the disease than
their male counterparts. (d) In all age groups from 0-59, percentages of female COVID19 cases
are slightly higher than their male counterparts. (e) For age groups 60s-70s, the opposite is
observed. (f) Among all age groups, female in their 80+ have the highest risk—with a COVID-
19 percentage of 1.49% of their corresponding population (cf. 1.04% of male in 80+).

13
4.1.3 Big Data Science on Hospital Status:

lso examines the hospital status among the 16 combinations. Table II reveals that, (a) as the
age increases, the absolute number of hospitalized cases also increases. When combined with
Table I, we observe that (b) despite the number of cases decreases from age groups 20s to 70s,
the number of hospitalization increases. This means that, when young people catches COVID-
19, a majority of them do not need to be hospitalized. When people age, their chance of
requiring hospitalization once they catch COVID-19 increases. (c) Between the two genders,
more male in their 30+ are admitted into the ICU than female.

Cells in Table III shows the percentage of hospitalized cases with respect to COVID-19
patients in their corresponding gender, age group combination. For instance, 38 male COVID-
19 patients in their 20s admitted to the ICU (as shown in Table II) account for 0.77% (as shown
in Table III) of all 19,049 male COVID-19 patients in their 20s. Table III reveals that, (a) for
seniors 60+, the hospitalization percentages among all COVID-19 cases are high—ranging
from 14.01% to 25.57% (cf. national norm of 7.05%)—and peak at 70s. In particular, (b) males
in their 70s have highest percentages of both ICU admission (8.51% wrt COVID-19 cases for
males in 70s) and hospitalization (8.51%+20.02% = 28.53%). In contrast, (c) males in their 80s
have the highest percentage of non-ICU hospitalization (23.70%).

14
4.1.4 Big Data Science on Occupation Groups:

Our solution also examines different occupation groups. Table IV shows the number of
healthcare workers for some gender, age group combinations (and their percentages wrt
COVID-19 cases in the corresponding combination). It reveals that (a) female healthcare
workers in their 30s-50s account for more than a quarter of COVID-19 cases in their respective
combinations. For instance, 5,308 (33.49%) of 15,851 COVID19 cases for females in their 40s
are healthcare workers. (b) In terms of both absolute number (in terms of cases) and relative
number (wrt cases in their combinations), female healthcare workers have much higher
numbers (about 4x higher) than their male counterparts. For completeness, Table IV also
includes the total numbers for all age groups (including 0-19 and 70+) in the bottom row.

15
5. Big Data Analysis and Processing

5.1 Big Data Storage Solutions:


5.1.1 Hadoop Distributed File System(HDFS):

HDFS is a distributed file system designed to store and manage vast amounts of data
across a cluster of commodity hardware. It's a core component of the Hadoop ecosystem
and is highly fault-tolerant and scalable.

5.1.2 NoSQL DataBases:

NoSQL databases are a category of databases that are designed to handle large volumes
of unstructured or semi-structured data. Some popular NoSQL databases for big data
storage include:
• MongoDB: A document-oriented NoSQL database.
• Cassandra: A distributed NoSQL database designed for scalability.
• HBase: A distributed, scalable, and consistent NoSQL database for use with Hadoop.

5.1.3 Cloud-Based Storage and Computing Options:

These cloud providers offer various services for big data processing, machine learning,
and analytics, making it easier for organizations to scale their infrastructure based on
their data storage and processing needs.

• Amazon Web Services (AWS): AWS offers a wide range of cloud-based storage
and computing services, including Amazon S3 for storage, Amazon EC2 for
virtual servers, and Amazon Redshift for data warehousing.
• Microsoft Azure: Azure provides cloud-based storage solutions like Azure Blob
Storage and Azure Data Lake Storage, as well as virtual machines (Azure VMs)
and Azure SQL Data Warehouse for data processing and analytics.
• Google Cloud Platform (GCP): GCP offers Google Cloud Storage for object
storage,

16
5.2 Data Analysis:

These analytical techniques have not only helped in understanding the spread and impact of
COVID-19 but have also guided public health measures, resource allocation, vaccination
strategies, and policy decisions during the pandemic. It's important to note that the field of
COVID-19 data analysis is dynamic, with new techniques and insights continually emerging
as the situation evolves.

5.2.1 Descriptive Statistics:

• Summary Statistics: Basic statistics such as mean, median, mode, standard


deviation, and variance are used to describe key characteristics of COVID-19
data, such as the number of cases, deaths, and recovery rates.

• Incidence and Prevalence: Calculations of incidence rates (new cases per unit
of time) and prevalence rates (total cases in a population) are essential for
tracking the spread of the virus.

5.2.2 Data Visualization:

Time Series Plots: Line charts and time series plots are used to visualize the progression
of COVID-19 cases, deaths, and recoveries over time. These help identify trends and
spikes.
• Geospatial Maps: Maps with color-coded regions or markers are used to
visualize the geographic distribution of cases, helping identify hotspots and
areas in need of targeted interventions.

• Histograms and Bar Charts: These are used to display distributions of variables
like age, comorbidities, and vaccination rates among COVID-19 patients.

• Heatmaps: Heatmaps can reveal correlations and patterns in data, such as the
spread of the virus in relation to population density or climate factors.

5.2.3 Spatial Analysis:

• Geographic Information Systems (GIS) and spatial analysis techniques are


used to analyze the spatial distribution of COVID-19 cases and identify
clusters or areas with higher transmission rates.
• Spatial autocorrelation and hotspot analysis help identify statistically
significant clusters of cases.

17
5.3 Machine Learning and Predictive Analytics:
Machine learning algorithms have played a significant role in forecasting COVID-19 trends
and assisting in various aspects of the pandemic response. These models leverage historical
data, such as infection rates, vaccination data, mobility patterns, and other relevant information,
to make predictions and inform decision-making. Here are some examples of how machine
learning has been used for forecasting COVID-19 trends:

5.3.1 Epidemic Models:

• SEIR Models: Machine learning can be used to optimize the parameters of traditional
epidemiological models like Susceptible-Exposed-Infectious-Removed (SEIR)
models. These models take into account various factors such as population
demographics, mobility data, and social interactions to forecast infection rates over
time.

5.3.2 Infection Rate Prediction:

• Time Series Forecasting: Machine learning algorithms, particularly time series


forecasting models like ARIMA, Prophet, or LSTM, are used to predict daily or weekly
infection rates based on historical data. These models can capture seasonality and trends
in the data.
• Spatial-Temporal Models: Spatiotemporal models like Gaussian Processes or
Convolutional Neural Networks (CNNs) can capture the geographical spread of the
virus, allowing for localized infection rate predictions.

5.3.3 Vaccine Distribution:

• Optimization Algorithms: Machine learning optimization techniques can help in


optimizing vaccine distribution strategies. These models consider factors such as
vaccine supply, population demographics, and logistics constraints to determine the
most efficient allocation of vaccines.
• Demand Forecasting: ML models can predict vaccine demand at different locations
and times, ensuring that sufficient vaccines are available where needed.

5.3.4 Vaccine Efficacy Prediction:

• Machine learning models can analyze vaccine trial data to predict the efficacy of
different vaccines against variants of the virus. This helps health authorities make
informed decisions about which vaccines to prioritize.

18
5.4 real-time data streaming
Real-time data streaming plays a crucial role in tracking and managing pandemics, as it enables
public health authorities and researchers to make informed decisions quickly and effectively.
Here's a discussion on the importance of real-time data in tracking the pandemic and how
technologies like Apache Kafka and Apache Flink can be employed for real-time data
processing:
5.4.1 Importance of Real-time Data in Tracking the Pandemic:
• Rapid Response: Pandemics, such as COVID-19, require rapid response measures to
contain the spread of the virus. Real-time data provides up-to-the-minute information
on infection rates, hospitalizations, and other critical metrics. This enables authorities
to react swiftly to emerging hotspots and adjust containment strategies in real-time.

• Resource Allocation: Real-time data helps in the allocation of healthcare resources like
hospital beds, ventilators, and medical personnel. By monitoring the influx of patients
in real-time, hospitals can optimize resource distribution, ensuring that critically ill
patients receive the care they need promptly.

• Contact Tracing: Effective contact tracing is vital for containing the spread of a
pandemic. Real-time data allows contact tracers to identify potential exposures quickly
and notify individuals at risk, helping to break the chains of transmission.

5.4.2 Technologies for Real-time Data Processing:


• Apache Kafka:
Data Ingestion: Kafka serves as a robust platform for ingesting large volumes of real-
time data from various sources, such as hospitals, testing centers, and mobile apps.
Data Aggregation: It allows the aggregation of data streams, helping in the
consolidation of information from multiple locations.
• Apache Flink:
Real-time Data Processing: Flink is designed for stream processing and can perform
real-time analytics on incoming data streams, making it suitable for monitoring and
analyzing pandemic-related data.
Complex Event Processing: Flink supports complex event processing, allowing the
detection of patterns and anomalies in real-time data, which is valuable for identifying
emerging hotspots.

19
6.Applications
Several successful Big Data applications in COVID-19 research have had a significant impact
on understanding the pandemic, managing its effects, and informing public health responses.
Here are a few notable examples:

6.1 BlueDot's Early Warning System:


• Use Case: BlueDot, a Canadian digital health company, used Big Data analytics to
detect and track infectious disease outbreaks, including COVID-19.

• Impact: BlueDot's system identified the early signs of the COVID-19 outbreak in
Wuhan, China, before the World Health Organization (WHO) made its official
announcement. This early warning helped public health agencies and governments
prepare and respond more effectively.

6.2 Google's COVID-19 Community Mobility Reports:


• Use Case: Google leveraged location data from smartphones to create mobility reports
that tracked changes in people's movements during the pandemic.
• Impact: These reports provided insights into the effectiveness of social distancing
measures, helping governments and health officials make data-driven decisions. It also
enabled them to assess compliance with lockdowns and evaluate the impact of
reopening measures.

20
6.3 COVID-19 Genome Sequencing:
• Use Case: Researchers worldwide used high-throughput DNA sequencing technologies
to analyze the genetic makeup of the SARS-CoV-2 virus responsible for COVID-19.
• Impact: Genome sequencing facilitated the identification of virus mutations and the
tracking of viral evolution. This data was crucial for vaccine development,
understanding transmission dynamics, and monitoring the emergence of new variants.

6.4 Contact Tracing Apps:


• Use Case: Many countries developed contact tracing apps that relied on Big Data to
notify individuals who may have been exposed to COVID-19.
• Impact: These apps helped identify and isolate potential cases quickly, reducing the
spread of the virus. They also provided valuable data for public health agencies to track
and analyze outbreaks.

21
6.5 Hospital Resource Allocation with Predictive Analytics:
• Use Case: Hospitals and healthcare systems used predictive analytics models to forecast
COVID-19 patient admissions and optimize resource allocation.
• Impact: Predictive models helped hospitals prepare for surges in COVID-19 cases,
ensuring that they had adequate beds, ventilators.

22
7.Challenges and Limitations
Working with COVID-19 Big Data presents a variety of challenges, and there are also
limitations associated with the available data and analytics techniques. Let's explore some of
these challenges and limitations:

7.1 Challenges in Working with COVID-19 Big Data:


• Data Volume and Velocity: The sheer volume and velocity of COVID-19 data
generated can overwhelm existing infrastructure and tools. Managing and processing
massive amounts of data in real-time is a significant challenge.

• Data Quality: Ensuring data accuracy and consistency is crucial. Inaccurate or


incomplete data can lead to incorrect analyses and decisions. Data quality issues can
arise due to reporting discrepancies, testing variations, and inconsistent data collection
methods.
• Data Privacy: Balancing the need for data to combat the pandemic with privacy
concerns is challenging. Sharing and analyzing data while protecting individual privacy
is a delicate balance that must be maintained.

• Data Heterogeneity: Data on COVID-19 is collected from various sources, including


hospitals, laboratories, government agencies, and mobile apps. These sources often use
different data formats and standards, making data integration and interoperability
difficult.

• Data Lag: There can be delays in reporting and updating data, which can hinder real-
time decision-making. Timely and accurate data is critical during a pandemic, and data
lags can have serious consequences.

• Model Uncertainty: Predictive models used for forecasting the spread of the virus or
estimating healthcare resource needs are subject to uncertainties. The rapidly evolving
nature of the pandemic makes it challenging to build accurate models.

• Resource Constraints: Organizations working with COVID-19 data often face resource
constraints, including limited access to skilled data scientists, computational resources,
and funding.

23
7.2 Limitations of Available Data and Analytics Techniques:

• Underreporting: Official COVID-19 case counts may underestimate the true extent of
the pandemic due to factors like limited testing availability, asymptomatic cases, and
underreporting in some regions.

• Data Bias: Data can be biased towards certain demographic groups or regions, leading
to disparities in analysis and decision-making. For example, marginalized communities
may have less access to testing and healthcare, resulting in skewed data.

• Lack of Longitudinal Data: Long-term data on the consequences of COVID-19,


including long COVID and the effectiveness of vaccines, is limited. This makes it
difficult to assess the full impact of the pandemic and vaccination efforts.

• Data Access and Sharing: Data sharing and access can be restricted due to privacy
concerns and legal or regulatory barriers, limiting the ability of researchers to analyze
and use the data effectively.

• Model Generalization: Machine learning models developed using early pandemic data
may not generalize well to later stages of the pandemic when conditions and
interventions change.

• Resource-Intensive Analytics: Advanced analytics techniques, such as machine


learning and deep learning, often require significant computational resources and
expertise, which may not be available to all organizations and researchers.

• Emerging Variants: New variants of the virus can impact the accuracy of existing
models and necessitate ongoing data collection and analysis to understand their
behavior and implications.

24
8.CONCLUSIONS

In this paper, we presented a data science solution for conducting data science on big COVID-
19 epidemological data. The solution generalizes some attributes (e.g., age into age groups) for
effective analysis. Instead of ignoring unstated/ NULL values of some attributes, the solution
provides users with flexibility of including or excluding these values. It also provides users
with flexibility to express their preference (e.g., “must include symptoms”) in mining of
frequent patterns. It discovers frequent patterns from each of the 16 gender, age group
-combinations. Moreover, it compares and contrasts the discovered frequent patterns among
these combinations.

Taking into account differences in population and/or the number of cases in each of the 16
combinations, our solution computes relative frequency (with respect to population and/or the
number of cases in the respective combination) in addition to showing the absolute frequency
of the attributes and/or frequent patterns. Evaluation results show the practicality of our
solution in providing rich knowledge about characteristics of COVID-19 cases.

This helps researchers, epidemiologists and policy makers to get a better understanding of the
disease, which may inspire them to come up ways to detect, control and combat the disease.
As ongoing and future work, we transfer knowledge learned from the current work to data
science on other big data in many real-life applications and services.

25
9.REFERENCES

1. Alsaig et al. "A critical analysis of the V-model of big data" IEEE TrustCom/BigDataSE pp.
1809-1813 2018.

2. A. Kobusinska et al. "Emerging trends issues and challenges in Internet of Things big data
and cloud computing" FGCS vol. 87 pp. 416-419 2018.

3. K. Kritikos "Towards dynamic and optimal big data placement" IEEE TrustCom/BigDataSE
pp. 1730-1737 2018.

4. C.K. Leung "Big data analysis and mining" Encyclopedia of Information Science and
Technology vol. 4e pp. 338-348 2018.

5. C.K. Leung "Big data computing and mining in a smart world" Big Data Analyses Services
and Smart Data pp. 15-27 2021.

6. B. Yin et al. "A cooperative edge computing scheme for reducing the cost of transferring big
data in 5G networks" IEEE TrustCom/BigDataSE pp. 700-706 2019.

7. K.E. Dierckens et al. "A data science and engineering solution for fast k-means clustering of
big data" IEEE TrustCom/BigDataSE/ICESS pp. 925-932 2017.

8.C.K. Leung,“Data science for big data applications and services: data lake management, data
analytics and visualization,”in Big Data Analyses, Services, and Smart Data,2021, pp.28-44.

9. C.K. Leung, F. Jiang, “A data science solution for mining interesting patterns from uncertain
big data,” in IEEE BDCloud 2014, pp. 235-242.

10. A.K. Chanda, et al., “A new framework for mining weighted periodic patterns in time series
databases,” ESWA 79, 2017, pp. 207-224.

11. A. Fariha, et al., “Mining frequent patterns from human interactions in meetings using
directed acyclic graphs,” in PAKDD 2013, Part I, pp. 38- 49.

12. C.K. Leung, C.L. Carmichael, “FpVAT: A visual analytic tool for supporting frequent
pattern mining,” ACM SIGKDD Explorations 11(2), 2009, pp. 39-48.

26

You might also like