Chapter Two

CHAPTER TWO
DATA SCIENCE
1
After completing this chapter, the students will be able to:
➢ Describe what data science is and the role of data scientists.
➢ Differentiate data and information.
➢ Describe data processing life cycle
➢ Understand different data types from diverse perspectives
➢ Describe data value chain in emerging era of big data.
➢ Understand the basics of Big Data.
➢ Describe the purpose of the Hadoop ecosystem components.

2
2.1. An Overview of Data Science
 Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms, and systems
to extract knowledge and insights from structured, semi-structured and unstructured data.
 Data science is much more than simply analyzing data.
 It offers a range of roles and requires a range of skills.
 Data scientists
 master the full spectrum of the data science life cycle and possess a level of flexibility and understanding
to maximize returns at each phase of the process.
 Data scientists possess a strong quantitative background in statistics and linear algebra as well as
programming knowledge with focuses on data warehousing, mining, and modeling to build and analyze
algorithms.
3
What are data and information?
 Data can be defined as a representation of facts, concepts, or instructions in a formalized manner,
 which should be suitable for communication, interpretation, or processing, by human or electronic machines.
 It can be described as unprocessed facts and figures.
 It is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-9) or special characters (+,
-, /, *, <,>, =, etc.).
 Information is the processed data on which decisions and actions are based.
 It is data that has been processed into a form that is meaningful to the recipient and is of real or perceived
value in the current or the prospective action or decision of recipient.
 Information is interpreted data; created from organized, structured, and processed data in a particular context.
4
2.2. Data Processing Cycle
 Data processing is the re-structuring or re-ordering of data by people or machines to increase their usefulness
and add values for a particular purpose.
 Data processing consists of the following basic steps - input, processing, and output.
 Input - in this step, the input data is prepared in some convenient form for processing.
 The form will depend on the processing machine.
 For example, when electronic computers are used, the input data can be recorded on any one of the several
types of storage medium, such as hard disk, CD, flash disk and so on.
 Processing - in this step, the input data is changed to produce data in a more useful form.
 For example, interest can be calculated on deposit to a bank, or a summary of sales for the month can be
calculated from the sales orders.
 Output - at this stage, the result of the proceeding processing step is collected. .
 For example, output data may be payroll for employees.
5
2.3 Data types and their representation
I. Data types from Computer programming perspective
 Common data types programming languages include:
 Integers(int)- is used to store whole numbers, mathematically known as integers
 Booleans(bool)- is used to represent restricted to one of two values: true or false
 Characters(char)- is used to store a single character
 Floating-point numbers(float)- is used to store real numbers
 Alphanumeric strings(string)- used to store a combination of characters and numbers
II. Data types from Data Analytics perspective
 From a data analytics point of view, there are three common types of data types or structures:
 Structured data types ,
 Semi-structured data types, and
 Unstructured data types
 Metadata.
6
A. Structured Data
 Structured data is data that adheres to a pre-defined data model
 It is straightforward to analyze.
 Structured data conforms to a tabular format with a relationship between the different rows and columns.
 Common examples of structured data are Excel files or SQL databases.
B. Semi-structured Data
 Semi-structured data is a form of structured data that does not conform with the formal structure of data models
associated with relational databases or other forms of data tables, but nonetheless, contains tags or other markers to
separate semantic elements and enforce hierarchies of records and fields within the data.
 Therefore, it is also known as a self-describing structure.
 Examples of semi-structured data include JSON and XML are forms of semi-structured data.
C. Unstructured Data
 Unstructured data is information that either does not have a predefined data model or is not organized in a pre-
defined manner.
 Unstructured information is typically text-heavy but may contain data such as dates, numbers, and facts as well.
 This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared
to data stored in structured databases.
 Common examples of unstructured data include audio, video files or NoSQL databases.
7
D. Metadata
 It is one of the most important elements for Big Data analysis and big data solutions.
 Metadata is data about data.
 It provides additional information about a specific set of data
 For example, In a set of photographs, metadata could describe when and where the photos were taken.
 The metadata then provides fields for dates and locations which, by themselves, can be considered structured data.
 Because of this reason, metadata is frequently used by Big Data solutions for initial analysis
2.4. Data value Chain
 The Data Value Chain is used to describe the information flow within a big data system
 It is as a series of steps needed to generate value and useful insights from data.
 The Big Data Value Chain identifies the following key high-level activities:
 Data Acquisition -Data Analysis
 Data Curation - Data Storage
 Data Usage
8
A) Data Acquisition
 It is the process of gathering, filtering, and cleaning data before it is put in a data warehouse or any other storage solution
on which data analysis can be carried out.
B) Data Analysis
 It is concerned with making the raw data acquired amenable to use in decision-making as well as domain-specific usage.
 Data analysis involves exploring, transforming, and modeling data with the goal of highlighting relevant data,
 It also involves the synthesizing and extracting useful hidden information with high potential from a business point of
view. Related areas include data mining, business intelligence, and machine learning.
C) Data Curation
 It is the active management of data over its life cycle to ensure it meets the necessary data quality requirements for its
effective usage.
 Data curation processes can be categorized into different activities such as content creation, selection, classification,
transformation, validation, and preservation.
 Data curation is performed by expert curators that are responsible for improving the accessibility and quality of data.
 Data curators (also known as scientific curators or data annotators) hold the responsibility of ensuring that data are
trustworthy, discoverable, accessible, reusable and fit their purpose. 9
D) Data Storage
 It is the persistence and management of data in a scalable way for fast access to the data.
 Examples of data storages:
 Relational Database Management Systems (RDBMS.
 ACID (Atomicity, Consistency, Isolation, and Durability)
E) Data Usage
 It covers the data-driven business activities that need access to data, its analysis, and the tools needed to
integrate the data analysis within the business activity.
 Data usage in business decision making can enhance competitiveness through the reduction of costs,
increased added value, or any other parameter that can be measured against existing performance criteria.
10
2.5. Basic concepts of big data
 What Is Big Data?
 Big data is the term for a collection of data sets so large and complex that it becomes difficult to process
using on-hand database management tools or traditional data processing applications.
 In this context, a “large dataset” means a dataset too large to reasonably process or store with
 traditional tooling or on a single computer. It may be Data that exceeds the computing power or storage of a
single computer
 This means that the common scale of big datasets is constantly shifting and may vary significantly from
organization to organization. Big data is characterized by:-
 Volume: large amounts of data Zeta bytes/Massive datasets (data at rest)
 Velocity: Data is live streaming or in motion (Data in motion)
 Variety: data comes in many different forms from diverse sources (Data from many)
 Veracity: can we trust the data? How accurate is it? Etc (Data in doubts)
11
 Clustered Computing
 Computer clusters are needed to store and compute big data
 Big data clustering software combines the resources of many smaller machines, seeking to provide a number of benefits:
 Resource Pooling: Combining the available storage space, CPU and memory to hold data
 High Availability: Clusters can provide varying levels of fault tolerance and availability guarantees to prevent hardware
or software failures from affecting access to data and processing.
 Easy Scalability: Clusters make it easy to scale horizontally by adding additional machines to the group.
 Hadoop and its Ecosystem
 Hadoop is an open-source framework used for managing cluster membership, coordinating resource sharing, and
scheduling actual work on individual nodes.
 It is a framework that allows for the distributed processing of large datasets across clusters of computers using simple
programming models.
 It is inspired by a technical document published by Google.
12
 The four key characteristics of Hadoop are:
 Economical: Its systems are highly economical as ordinary computers can be used for data processing.
 Reliable: It is reliable as it stores copies of the data on different machines and is resistant to hardware
failure.
 Scalable: It is easily scalable both, horizontally and vertically.
 Flexible: It is flexible and one can store as much structured and unstructured data as needed
 Hadoop ecosystem
 evolved from its four core components: data management, access, processing, and storage.
 It comprises the following components and many others:
 HDFS: Hadoop Distributed File System YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing Spark: In-Memory data processing
 PIG, HIVE: Query-based processing of data services HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster Oozie: Job Scheduling 13
2.6. Big Data Life Cycle with Hadoop
I. Ingesting data into the system
 The data is ingested or transferred to Hadoop from various sources such as relational databases, systems, or local files.
 Sqoop transfers data from RDBMS to HDFS, whereas Flume transfers event data.
II. Processing the data in storage
 In this stage, the data is stored and processed.
 The data is stored in the distributed file system, HDFS, and the NoSQL distributed data, HBase. Spark etc.
III. Computing and analyzing data
 Here, the data is analyzed by processing frameworks such as Pig, Hive, and Impala.
 Pig converts the data using a map and reduce and then analyzes it. Hive is also based on the map and reduce
programming and is most suitable for structured data.
IV. Visualizing the results
 In this stage, the analyzed data can be accessed by users.
END CH2 14

Chapter Two

Uploaded by

Chapter Two

Uploaded by

CHAPTER TWO

➢ Describe what data science is and the role of data scientists.

➢ Differentiate data and information.

➢ Describe data processing life cycle

➢ Understand different data types from diverse perspectives

➢ Describe data value chain in emerging era of big data.

➢ Understand the basics of Big Data.

➢ Describe the purpose of the Hadoop ecosystem components.

You might also like