0% found this document useful (0 votes)
23 views

Unit 1

bd notes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Unit 1

bd notes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Big Data (KCS-061)

Unit-1
Introduction to Big Data: Types of digital data, history of Big Data innovation, introduction to Big Data platform,
drivers for Big Data, Big Data architecture and characteristics, 5 Vs of Big Data, Big Data technology components,
Big Data importance and applications, Big Data features – security, compliance, auditing and protection, Big Data
privacy and ethics, Big Data Analytics, Challenges of conventional systems, intelligent data analysis, nature of data,
analytic processes and tools, analysis vs reporting, modern data analytic tools.

History of Big Data


The phrase “big data” can be traced back to Silicon Valley lunch-table conversations and pitch meetings in
the 1990s[1].
1990s - Emergence of Data Warehousing:
The concept of data warehousing emerged in the 1990s. Companies began to store and manage large
volumes of structured data in centralized repositories for analytical purposes.
2000s - Rise of NoSQL Databases:
As internet usage exploded, companies faced challenges in handling diverse and unstructured data. The
2000s saw the rise of NoSQL databases, which offered more flexibility than traditional relational databases
in handling different types of data.
2003 - Introduction of Hadoop:
Apache Hadoop, an open-source framework for distributed storage and processing of large data sets, was
introduced in 2003. It became a fundamental technology for big data, enabling the processing of massive
amounts of data across clusters of commodity hardware.
2008 - Growth of Cloud Computing:
Cloud computing platforms like Amazon Web Services (AWS), Google Cloud Platform (GCP), and
Microsoft Azure gained popularity. These platforms provided scalable and on-demand resources, making it
easier for organizations to store and process large volumes of data without the need for extensive
infrastructure investments.
2010s - Expansion of Big Data Ecosystem:
The big data ecosystem expanded significantly with the development of various tools and technologies.
Apache Spark, a fast and general-purpose cluster computing system, gained popularity for in-memory
processing. Additionally, technologies like Apache Hive, Apache HBase, and Apache Storm became
integral parts of the big data landscape.
2010s - Emergence of Data Lakes:
Data lakes became popular as organizations sought to store large volumes of diverse data, including
structured and unstructured data, in their raw form. Technologies like Apache Hadoop Distributed File
System (HDFS) and cloud-based storage solutions facilitated the creation of data lakes.
2010s - Advanced Analytics and Machine Learning:
The integration of advanced analytics and machine learning became more prevalent in big data solutions.
Data scientists and analysts started using big data technologies to derive insights, make predictions, and
uncover patterns in large datasets.
2010s - Real-time Processing:
Real-time data processing gained importance, and technologies like Apache Kafka and Apache Flink
emerged to handle streaming data for real-time analytics.
Present - Continued Evolution:
Big data technologies continue to evolve rapidly, with a focus on improving performance, scalability, and
ease of use. The adoption of containerization and orchestration tools, such as Docker and Kubernetes, has
also played a role in streamlining big data deployments.

1
What is big data?

Big data is a term used to describe data of great variety, huge volumes, and even more velocity. Apart from
the significant volume, big data is also complex such that none of the conventional data management tools
can effectively store or process it. The data can be structured or unstructured.

• Examples of big data include:


• Mobile phone details
• Social media content
• Health records
• Transactional data
• Web searches
• Financial documents
• Weather information

Big data can be generated by users (emails, images, transactional data, etc.), or machines (IoT, ML
algorithms, etc.). And depending on the owner, the data can be made commercially available to the public
through API or FTP. In some instances, it may require a subscription for you to be granted access to it.

What Is Digital Data?

Digital data is the electronic representation of information in a format or language that machines can read
and understand. In more technical terms, digital data is a binary format of information that's converted into
a machine-readable digital format. The power of digital data is that any analog inputs, from very simple
text documents to genome sequencing results, can be represented with the binary system.

Classification of Data

Data Classification :
Process of classifying data in relevant categories so that it can be used or applied more efficiently. The
classification of data makes it easy for the user to retrieve it. Data classification holds its importance when
comes to data security and compliance and also to meet different types of business or personal objective. It
is also of major requirement, as data must be easily retrievable within a specific period of time.

Types of Data Classification:

Data can be broadly classified into 3 types.


1. Structured Data:
Structured data is created using a fixed schema and is maintained in tabular format. The elements in
structured data are addressable for effective analysis. It contains all the data which can be stored in
the SQL database in a tabular format. Today, most of the data is developed and processed in the simplest
way to manage information.

Examples –
DWS (Data Ware House) , DM (Data Mart), OLTP (Online transaction process) ,ODS (operational data
source/store ), APIs (Application programming interface) , ERP (Enterprise resource planning) , CRM
(Customer relationship management ), MIS (management information system)

2
Relational data, Geo-location, credit card numbers, addresses, etc.
Consider an example for Relational Data like we have to maintain a record of students for a university
like the name of the student, ID of a student, address, and Email of the student. To store the record of
students used the following relational schema and table for the same.

S_ID S_Name S_Address S_Email

1001 A Delhi [email protected]

1002 B Mumbai [email protected]

2. Unstructured Data:

It is defined as the data in which is not follow a pre-defined standard or you can say that any does not
follow any organized format. This kind of data is also not fit for the relational database because in the
relational database we will see a pre-defined manner or we can say organized way of data. Unstructured
data is also very important for the big data domain and to manage and store Unstructured data. There are
many platforms to handle it like No-SQL Database.

Examples –
Word, PDF, text, media logs, audio, video, www, Geo location, social media likeTwittter, facebook,
Instagram, etc.
Mobile phone,smart watch,wifi,

3. Semi-Structured Data :
Semi-structured data is information that does not reside in a relational database but that have some
organizational properties that make it easier to analyze. With some process, you can store them in
a relational database but is very hard for some kind of semi-structured data, but semi-structured exist to
ease space.

Example
HTML, XML data, No SQL, emails, CSV files, JSON files, log files , Excel files,

Features of Data Classification :

The main goal of the organization of data is to arrange the data in such a form that it becomes fairly
available to the users. So it’s basic features as following.
• Homogeneity – The data items in a particular group should be similar to each other.
• Clarity – There must be no confusion in the positioning of any data item in a particular group.
• Stability – The data item set must be stable i.e. any investigation should not affect the same set of
classification.
• Elastic – One should be able to change the basis of classification as the purpose of classification
changes.

3
Types of big data

Big data refers to extremely large datasets that are complex and difficult to process using traditional data
processing tools and techniques. These datasets often contain a vast amount of information from various
sources, such as social media, business transactions, sensors, and machine-generated data.
The three main types of big data are:

Structured data:
Structured data is highly organized and can be easily processed using traditional data processing tools. This
type of data is typically stored in a relational database and includes data that can be represented in a tabular
format.

Semi-structured data:
Semi-structured data is partially organized and does not have a specific structure. It can be processed using
some traditional data processing tools and techniques, but it may require some preprocessing before it can
be analyzed. Examples of semi-structured data include XML, JSON, and CSV files.

Unstructured data:
Unstructured data is not organized and does not have a specific structure. This type of data is often
generated by humans or machines and includes data such as text, images, audio, and video. Analyzing
unstructured data requires advanced data processing techniques such as natural language processing, image
processing, and machine learning.

Apart from these three types of big data, there are also two other categories of big data:

Internal data:
Internal data is generated by an organization's internal systems and processes, such as customer data,
employee data, and financial data.
External data:
External data is generated by sources outside of an organization, such as social media, online reviews, and
weather data.
Overall, big data offers businesses and organizations an opportunity to gain valuable insights from vast
amounts of data. However, it requires specialized tools and techniques to process and analyze effectively.

Big Data platform

A Big Data platform is a software framework designed to manage and process large, complex datasets. It
provides a scalable and efficient infrastructure to store, process, and analyze big data. A Big Data platform
typically consists of several components that work together to enable data processing and analysis at scale.
These components may include:

Data storage:
A Big Data platform may use distributed file systems such as Hadoop Distributed File System (HDFS) or
cloud-based storage services to store large amounts of data.

Data processing:
A Big Data platform may use tools like Apache Spark, Apache Storm, or Apache Flink to process large
datasets in parallel across multiple nodes or clusters.

4
Data integration:
A Big Data platform may use tools such as Apache Kafka or Apache Nifi to move and integrate data from
various sources.

Data analysis:
A Big Data platform may provide tools like Apache Hive, Apache Pig, or Apache Drill to perform data
analysis and generate insights.

Machine learning:
A Big Data platform may include machine learning tools such as TensorFlow or Apache Mahout to enable
predictive modeling and decision-making based on large datasets.

Data visualization:
A Big Data platform may provide data visualization tools such as Tableau or Apache Superset to create
interactive dashboards and visualizations to help users understand and explore large datasets.
Overall, a Big Data platform provides a powerful infrastructure for processing and analyzing large,
complex datasets, enabling organizations to gain valuable insights from their data and make data-driven
decisions.

Drivers for Big Data


There are several drivers that have led to the growth and adoption of Big Data.
These include:

Data growth:
The volume of data generated by individuals and organizations has grown exponentially in recent years.
This includes data generated by social media, sensors, mobile devices, and other sources.

Data diversity:
Data is no longer limited to structured data that can be easily processed using traditional tools. There is a
growing need to process and analyze semi-structured and unstructured data, such as text, images, and
video.
Data complexity:
The complexity of data has increased with the rise of interconnected systems and the Internet of Things
(IoT), which generates data from a wide range of sources and devices.

Competitive advantage:
Businesses are using Big Data to gain a competitive advantage by analyzing customer behavior, identifying
trends, and improving decision-making.

Cost reduction:
Big Data technologies enable organizations to store and process large volumes of data more cost-
effectively than traditional data processing tools.

Real-time processing:
Big Data platforms enable organizations to process and analyze data in real-time, allowing them to make
faster and more informed decisions.

5
Regulatory compliance:
The need to comply with regulations and protect sensitive data has led to increased investment in Big Data
technologies to ensure secure data storage and processing.
Overall, the drivers for Big Data have led to the development of new tools and technologies that enable
organizations to store, process, and analyze large, complex datasets, and gain valuable insights from their
data.

Big Data architecture and characteristics


Big data architecture refers to the infrastructure and systems required to process, store, and analyze large
and complex data sets.

What is Big Data Architecture?


The term "Big Data architecture" refers to the systems and software used to manage Big Data. A Big Data
architecture must be able to handle the scale, complexity, and variety of Big Data. It must also be able to
support the needs of different users, who may want to access and analyze the data differently.

6
Big Data Architecture Layers
There are four main Big Data architecture layers to an architecture of Big Data:

1. Data Ingestion
This layer is responsible for collecting and storing data from various sources. In Big Data, the data
ingestion process of extracting data from various sources and loading it into a data repository. Data
ingestion is a key component of a Big Data architecture because it determines how data will be
ingested, transformed, and stored.

2. Data Processing
Data processing is the second layer, responsible for collecting, cleaning, and preparing the data for
analysis. This layer is critical for ensuring that the data is high quality and ready to be used in the
future.

3. Data Storage
Data storage is the third layer, responsible for storing the data in a format that can be easily
accessed and analyzed. This layer is essential for ensuring that the data is accessible and available
to the other layers.

4. Data Visualization
Data visualization is the fourth layer and is responsible for creating visualizations of the data that
humans can easily understand. This layer is important for making the data accessible.

The characteristics of big data describes with 5 Vs:

7
Volume:

Big data architecture deals with massive volumes of data, ranging from terabytes to petabytes.
Volume describes both the size and quantity of the data.
Data from internet, social media, and the Internet of Things (IoT) is being generated at an unprecedented
rate. Traditional data storage and processing techniques are often insufficient to handle such a massive
amount of data.

Variety:

Variety describes the diversity of the data types and its heterogeneous sources. Big Data information draws
from a vast quantity of sources, and not all of them provide the same level of value or relevance.
The data, pulled from new sources located in-house and off-site, comes in three different types:
Structured Data: Also known as organized data, information with a defined length and
format. An Excel spreadsheet with customer names, e-mails, and cities is an example of
structured data.
Unstructured Data: Unlike structured data, unstructured data covers information that can’t
neatly fit in the rigid, traditional row and column structure found in relational databases.
Unstructured data includes images, texts, and videos, to name a few. For example, if a
company received 500,000 jpegs of their customers’ cats, that would qualify as unstructured
data.
Semi-structured Data: As the name suggests, semi-structured data is information that
features associated information like metadata, although it doesn't conform to formal data
structures. This category includes e-mails, web pages, and TCP/IP packets.
Data comes in different forms and formats such as text, audio, video, and images. Big data architecture
handles data from a variety of sources and in various formats.

8
Velocity:

Velocity describes how rapidly the data is generated and how quickly it moves. This data flow comes from
sources such as mobile phones, social media, networks, servers, etc. Velocity covers the data's speed, and it
also describes how the information continuously flows. For instance, a consumer with wearable tech that
has a sensor connected to a network will keep gathering and sending data to the source.
Big data architecture deals with data streams that arrive at a high rate, such as social media feeds, sensor
data, and web traffic. Social media platforms generate millions of posts, likes, and comments every second,
while IoT sensors can generate data at a rate of several terabytes per hour

Veracity:

Veracity describes the data’s accuracy and quality. Since the data is pulled from diverse sources, the
information can have uncertainties, errors, redundancies, gaps, and inconsistencies. It's bad enough when
an analyst gets one set of data that has accuracy issues; imagine getting tens of thousands of such datasets,
or maybe even millions.
Big data architecture deals with data of varying quality, completeness, and accuracy.

Value:

The ultimate goal of big data architecture is to extract insights and value from the data that can help
organizations make better decisions.
To handle these characteristics, big data architecture employs distributed computing systems such as
Hadoop, Spark, and NoSQL databases. These systems are designed to distribute data and processing across
multiple nodes and servers, providing the scalability and fault tolerance required to handle big data.
Additionally, big data architecture often includes data warehousing, data integration, and data visualization
tools to enable efficient querying, analysis, and reporting of the data.

Area/Uses

• Transportation.
• Advertising and Marketing.
• Banking and Financial Services.
• Government.
• Media and Entertainment. Meteorology.
• Healthcare.
• Cyber security.
• Banking and Securities
• Communications, Media and Entertainment
• Insurance
• Retail and Wholesale trade
• etc

9
Big Data importance and applications

Big data has become increasingly important in today's digital world, where massive amounts of data are
generated every second by individuals, businesses, and machines. Here are some of the key reasons why
big data is important:

Improved decision-making:
By analyzing large and complex datasets, organizations can make more informed and data-driven
decisions, leading to better outcomes.
Cost savings:
Big data analytics can help identify inefficiencies in business operations, leading to cost savings.

Improved customer experience:


Big data analytics can help companies understand customer behavior and preferences, enabling them to
provide personalized and relevant experiences.

Innovation
Big data can help companies identify emerging trends and opportunities, leading to new products, services,
and business models.

Competitive advantage
Companies that effectively leverage big data can gain a competitive advantage by understanding customer
needs, identifying new markets, and optimizing business processes.
Some of the key applications of big data include:

Healthcare
Big data analytics can help improve patient outcomes by identifying patterns and insights from medical
records, clinical trials, and patient feedback.

Finance
Big data analytics can help financial institutions detect fraud, assess credit risk, and optimize investment
strategies.

Marketing
Big data analytics can help companies target and personalize marketing messages based on customer
behavior and preferences.
Manufacturing:
Big data analytics can help optimize supply chain management, reduce downtime, and improve quality
control.

Transportation
Big data analytics can help optimize logistics and route planning, improve safety, and reduce fuel
consumption.

Big Data applications

1. Banking and Securities


2. Communications, Media and Entertainment
3. Healthcare Providers
10
4. Education
5. Government
6. Insurance
7. Retail and Wholesale trade
8. Transportation

Big Data Analytics

Big data analytics is the use of advanced analytic techniques against very large,
diverse data sets that include structured, semi-structured and unstructured data, from
different sources, and in different sizes from terabytes to zettabytes.

Types of Big Data Analytics


Big data analytics involves processing and analyzing large and complex datasets to extract valuable
insights, patterns, and trends. There are several types of big data analytics, each serving different purposes
based on the goals and requirements of the analysis.

Descriptive Analytics
What is happening now based on incoming data?

Purpose Descriptive analytics focuses on summarizing and describing historical data to gain an
understanding of what has happened.
Examples Reporting, dashboards, data visualization, and basic statistical analysis.

Diagnostic Analytics
What did it happen?

Purpose Diagnostic analytics aims to identify the reasons behind past events or trends. It
involves analyzing historical data to understand the causes of specific outcomes.
Examples Root cause analysis, trend analysis, and correlation analysis.

11
Predictive Analytics
What might happen in future?

Purpose Predictive analytics involves using historical data and statistical algorithms to predict
future outcomes or trends.
Examples Regression analysis, machine learning models, and forecasting.

Prescriptive Analytics
What action should be taken?

Purpose Prescriptive analytics goes beyond predicting future outcomes and recommends
actions to achieve desired results. It provides insights on what actions to take to optimize a
particular outcome.
Examples Decision support systems, optimization algorithms, and simulation models.

Text Analytics (or Text Mining):

Purpose Text analytics involves analyzing unstructured text data to extract insights, sentiment,
and patterns from documents, social media, and other textual sources.
Examples: Sentiment analysis, natural language processing (NLP), and text clustering.

Spatial Analytics:

Purpose Spatial analytics involves analyzing geographic or location-based data to understand


patterns, relationships, and trends associated with specific locations.
Examples Geographic Information System (GIS) analysis, location-based recommendation
systems, and mapping.

Streaming Analytics

Purpose Streaming analytics involves analyzing real-time data streams to gain insights and
make decisions as events unfold.
Examples Real-time monitoring, fraud detection, and IoT data analysis.

Preservation Analytics

Purpose Preservation analytics focuses on maintaining and ensuring the quality, reliability, and
integrity of data over time.
Examples Data quality monitoring, data governance, and data lifecycle management.
Social Media Analytics
Purpose Social media analytics involves analyzing data from social media platforms to
understand user behavior, sentiments, and trends.
Examples Social network analysis, social media monitoring, and trend analysis on social
platforms.

Video Analytics
Purpose Video analytics involves analyzing video data to extract insights, patterns, and
information from visual content.
Examples Video surveillance analysis, facial recognition, and object detection.
12
Big Data features –
security, compliance, auditing and protection, Big Data privacy and ethics, Big Data Analytics, Challenges
of conventional systems, intelligent data analysis, nature of data, analytic processes and tools, analysis vs
reporting, modern data analytic tools

Big Data Features

Security
Big data security involves protecting data from unauthorized access, modification, or destruction. It
includes measures such as encryption, access controls, and firewalls.

Compliance and Auditing


Big data systems need to comply with various regulations such as HIPAA, GDPR, etc. Auditing ensures
that data is being handled according to regulatory requirements and company policies.

Privacy and Ethics


Big data privacy concerns include collecting, storing, and processing personal information. Ethical
considerations involve ensuring fairness, transparency, and avoiding biases in the analysis of data.

Analytics
Big data analytics involves using data to discover patterns, insights, and trends. This can help organizations
make better decisions, optimize operations, and improve customer experiences.

Challenges of conventional systems


Big data presents challenges for conventional systems due to the volume, variety, and velocity of data.
Traditional systems are not equipped to handle these large amounts of data and may require additional
resources or specialized tools.

Intelligent data analysis


Intelligent data analysis involves using machine learning algorithms and artificial intelligence techniques to
analyze data. This can help automate the analysis process and provide more accurate and relevant insights.

Nature of Data
Big data is characterized by its size, complexity, and variety. It includes structured, semi-structured, and
unstructured data from various sources such as social media, sensors, and IoT devices.

Analytic processes and tools


Big data analytics requires specialized tools and processes for data ingestion, storage, processing, and
analysis. These include Hadoop, Spark, NoSQL databases, and data visualization tools.

Analysis vs Reporting
Big data analysis involves exploring data to uncover insights and make predictions, while reporting
involves presenting data in a static format. Analysis is more interactive and iterative, while reporting is
more static and less flexible.

13
Tools Used In Big Data analytics

Hadoop :
Frame Store and process bigdata distributed in parallel Fashion
Mongo DB
Deal with large amount unstructured data
Talend
SW & service for data integration data management and data store.
Cassendra
Management of large about of data real time processing
Spark
Used for data processing
Storm
Real time data processing
Kafka
Distributed screaming data tool (LinkedIn)

Modern data analytic tools


Modern data analytic tools are designed to handle big data and provide advanced analytics
capabilities. These include data warehouses, data lakes, and cloud-based analytics platforms such as
AWS Redshift, Google BigQuery, and Microsoft Azure.

Intelligent Data Analysis (IDA)

Intelligent Data Analysis (IDA) refers to the process of extracting appropriate information and knowledge
from large and complex datasets using advanced techniques and technologies.

It involves the application of artificial intelligence (AI), machine learning (ML), and other computational
methods to analyze data and make informed decisions.

Key Component of Intelligent Data Analysis (IDA)

• Data Collection and Preprocessing


• Feature Extraction
• Modelling and Analysis
• Pattern Recognition
• Decision Making
• Visualization
Data Collection and Preprocessing
Gathering relevant data from various sources and preparing it for analysis. This step involves
cleaning, transforming, and organizing the data to ensure its quality and suitability for analysis.
Feature Extraction
Identifying and selecting relevant features (variables) from the dataset that are essential for the
analysis. This step helps reduce dimensionality and focuses on the most important aspects of the
data.
Modeling and Analysis
Employing advanced algorithms and statistical techniques to build models that can uncover patterns,
relationships, and trends within the data. This often involves the use of machine learning algorithms such
as clustering, classification, regression, and others.
14
Pattern Recognition
Identifying meaningful patterns and structures in the data that can provide appropriate information.
This can include the detection of anomalies, trends, correlations, and other important relationships.
Decision Making
Using the generated insights to make informed decisions and predictions.Intelligent Data Analysis
can be applied in various domains, such as business, healthcare, finance, and more, to support
decision-making processes.
Visualization
Presenting the results in a visual and interpretable format, making it easier for stakeholders to
understand and act upon the findings.Visualization tools help in conveying complex information in
a more accessible manner.

Nature of Data and Nature of Big data

Nature of data refers to the inherent characteristics and properties of information that is collected,
processed, and analyzed within various contexts. Understanding the nature of data is essential for
effectively managing, interpreting, and utilizing information.

Whereas nature of big data refers to the defining characteristics and properties that distinguish big data
from traditional data. Big data is characterized by five Vs.[Volume, Value, Variety, Velocity, Veracity]

Big Data Privacy


Protecting / maintaining individuals’ data privacy
• Data Collection
• Data Storage
• Data Sharing
Big Data Ethics
Ethics means ensuring ethical use of data in the context of big data analytics.
• Informed Consent
• Transparency
• Fairness and Bias
• Accountability
• Legal Compliance
• Continuous Monitoring and Auditing

1. Privacy Concerns:
Data Collection:
Big data often involves the collection of extensive and diverse datasets. Privacy concerns arise
when personally identifiable information (PII) is included without adequate consent.
Data Storage
Safeguarding data during storage is crucial to prevent unauthorized access and data breaches.
Encryption and access controls are common measures.
Data Sharing
Sharing data among organizations or third parties for collaborative projects can pose privacy risks.
Organizations must ensure proper agreements and safeguards are in place.

2. Ethical Considerations:
Informed Consent
Ethical data practices involve obtaining informed consent from individuals before collecting
and using their data. Users should be aware of how their data will be utilized.
15
Transparency
Organizations should be transparent about their data practices, providing clear information
on data collection, storage, and usage policies.
Fairness and Bias
Addressing biases in data and algorithms is crucial to ensure fair and unbiased outcomes.
Biases in data can lead to unfair treatment of certain groups.
Accountability
Organizations should be accountable for the consequences of their data practices. This
includes taking responsibility for any negative impacts on individuals or groups.

3. Legal Compliance:

Data Protection Regulations:


Adhering to data protection laws and regulations, such as the General Data Protection
Regulation (GDPR), is essential. These regulations outline rules for the collection,
processing, and storage of personal data.
Cross-Border Data Flows
Big data initiatives often involve international data transfers. Organizations need to comply
with regulations governing cross-border data flows and ensure data sovereignty.

4. Anonymization and De-identification:

Anonymization
Removing or encrypting personally identifiable information to protect individuals' identities.
However, achieving true anonymity can be challenging.
De-identification
Transforming data to make it less identifying while maintaining its utility. This involves
techniques like pseudonymization.

5. Data Governance and Security:

Data Governance:
Implementing strong data governance practices to ensure responsible and ethical data
management throughout its lifecycle.
Security Measures
Employing robust security measures to protect data from unauthorized access, breaches, and
cyber threats.

6. Continuous Monitoring and Auditing


Monitoring
Regularly monitoring data practices and security measures to identify and address potential
risks.
Auditing
Conducting audits to assess compliance with privacy policies, ethical standards, and legal
requirements.

16
The big challenges of Big Data

Big data brings big benefits, but it also brings big challenges such new privacy and security concerns,
accessibility for business users, and choosing the right solutions for your business needs.

Making big data accessible.


Collecting and processing data becomes more difficult as the amount of data grows.
Organizations must make data easy and convenient for data owners of all skill levels to use.

Maintaining quality data.


With so much data to maintain, organizations are spending more time than ever before scrubbing
for duplicates, errors, absences, conflicts, and inconsistencies. Keeping data secure. As the amount
of data grows, so do privacy and security concerns. Organizations will need to strive for compliance
and put tight data processes in place before they take advantage of big data.

Finding the right tools and platforms.


New technologies for processing and analyzing big data are developed all the time. Organizations
must find the right technology to work within their established ecosystems and address their
particular needs.Often, the right solution is also a flexible solution that can accommodate future
infrastructure changes.

Analysis vs Report

Analysis Report
It is an examination of large and complex It is a summarizing and presenting the results of
datasets to extract meaningful patterns, trends, big data analysis in a coherent and accessible
and insights. manner.
It involves tasks such as data preprocessing, It organizes the analytical results, creating
exploratory data analysis, employing visualizations
distributed computing frameworks
It is conducted by data scientists, analysts, and It is prepared by Non-technical stakeholders,
professionals executives, and decision-makers
Use of big data technologies and frameworks, Utilizes visualization tools (e.g., Tableau, Power
such as Apache Hadoop, Spark, BI), presentation software, and reporting tools

17

You might also like