Unit 1 - Lecture 1.2,3 - Data Science & Big Data

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 34

Course: DSA Google Classroom: efxznbm

Programme: M.Sc CS Unit: I


Hour : 2

Data Science and Big Data

1
VISION & MISSION
VISION
To provide excellent infrastructure and
educational prospects through
innovation, creativity, and integrity,
fostering holistic student excellence in
academic and career pursuits.
MISSION
Empowering the nation by imparting
students with a broad spectrum of
knowledge, attitudes, skills, and
practices, while integrating the latest
trends in advancing technologies.
Unit I- SYLLABUS
 Introduction of Data Science: data science and big data
– facets of data-data science processEcosystem- The
Data Science process – six steps- Machine Learning
Recap of the Previous Session
•Introduction to Data Science
•Need for Data Science
•Types

6
Business Intelligence (BI)
• “The processes, technologies and tools needed to turn data
into information and information into knowledge and
knowledge into plans that drive profitable business action.

• BI encompasses data warehousing, business analytics and


knowledge management.
OLAP vs. Business Intelligence

Online analytical processing, or OLAP


• It is an approach to quickly answer multi-dimensional analytical
queries.

• OLAP is part of the broader category of business intelligence,


which also encompasses reporting, data mining, and analytics.
The Challenges of Building BI Solutions

• There are several issues inherent to any BI project:


• Data exists in multiple places
• Data is not formatted to support complex analysis
• Different kinds of workers have different data needs
• What data should be examined and in what detail
• How will users interact with that data
Consolidation of Data

• The process of consolidating data means moving it, making it


consistent, and cleaning up the data as much as possible
• Data is frequently stored in different formats
• Data is frequently inconsistent between sources
• Data may be dirty
• Internally inconsistent or missing values
Disparate Data

• Data in a variety of locations and formats:


• Relational databases (operational data systems)
• XML files
• Desktop databases
• Microsoft ® Excel™ spreadsheets

• The data may also be in databases on different operating


system and hardware platforms
Inconsistent Data

• Data may be inconsistent


• Two plants might have different part numbers for the same
physical part
• To represent True and False, one system may use 1 and 0, while
another system may use T and F
• Data stored in different countries will likely store sales in their local
currency
• These sales must be converted to a common currency
Data Quality Issues
• Clean data facilitates more accurate analysis

• Many data entry systems allow free-form data entry of text


values
• For example, the same city might be entered as
Louisville, Lewisville, and Luisville

• Routines to clean up data need to take into account all


possible variations of bad data
Extraction, Transformation, and Loading
(ETL)

• The process of data consolidation is often called Extraction,


Transformation, and Loading (ETL)
• The ETL process extracts data from the various source systems
• Data is then transformed to make it consistent and improve data
quality
• The consolidated, consistent, and cleaned data is then loaded into
a data repository

• Developing the ETL process often consumes 80% of the


development time
Extraction, Transformation, and Loading
(ETL) Tools

• Some ETL Tools


• Oracle Data Integrator (ODI)
• Informatica
• IBM Ascential
• Abinitio
Business Issues with Data Consolidation
• Business users must drive what should be in the data
warehouse

• Someone in the business must decide how to consolidate


inconsistent data
• If True is 1 in one system and T in another, what should the
value be once the data is consolidated from the two
systems?

• The business must decide how to handle other necessary


items - such as currency conversions
Supporting Different Types of Users

• One of the great benefits of BI is that it can support the data


needs of the entire business
• This support comes from the many different ways that users
can consume BI data

• Different tools exist to support these different data needs


The Users of Business Intelligence

• Executives and business decision makers look at the business


from a high level, performing limited analysis

• Analysts perform complex, detailed data analysis

• Information workers need static reports or limited analytic


power

• Line workers need no analytic capabilities as BI is presented to


them as part of their job
The Users of Business Intelligence
The Approaches to Consuming Business
Intelligence
• Scorecards
• Customized high-level views with limited analytic capabilities

• Reports
• Standardized reports aimed at a large audience, with no or
limited analytic capabilities

• Analytics Applications
• Applications designed to allow complex data analysis

• Custom Applications
• Embed BI data within an application
The Components of a Data Warehouse

• There are several items that make up a data warehouse


• Cubes
• Measures
• Key Performance Indicators
• Dimensions
• Attributes
• Hierarchies
Cubes
• Cubes are the structures in which data is stored

• Users access data in the cubes by navigating through various


dimensions
Measures

• Measures are what you want to see

• They are almost always numeric

• They are often additive


• Dollar sales, unit sales, profit, expenses, and more

• Some measures are not additive


• Date of last shipment
• Inventory counts and number of unique customers
Dimensions

• Dimensions are how you want to see the data

• You usually want to see data by time, geography, product,


account, employee, …

• Dimensions are made up of attributes and may or may not


include hierarchies
• Year – Semester – Quarter – Month – Day
• Product Category – Product Subcategory - Product
Attributes

• Attributes are individual values that make up dimensions


• A Time dimension may have a Month attribute, a Year
attribute, and so forth
• A Geography dimension may have a Country attribute, a
Region attribute, a City attribute, and so on
• A Product dimension may have a Part Number attribute, a
size attribute, a color attribute, a manufacturer attribute, and
more
Hierarchies
• You can put attributes into a hierarchical structure to assist
user analysis

• One of the most common functions in BI is to “drill down” to


a more detailed level

• For example, Time hierarchy might be to go from Year to


Quarter to Month to Day

• Another Time hierarchy might go from Year to Month to


Week to Day to Hour
Summary
• The ETL process extracts data from source systems,
transforms it and then loads it to a data warehouse or a
data mart.

• Using reports and dashboards, BI looks at data as a


collection of measures and KPIs viewed by dimensions.
Oracle DW/BI Products

• OBIEE – mainly based on Siebel technology.

• Oracle Hyperion Essbase


MCQ
1. Which is not a tool for Statistical Data Analysis
• 1. Logistic Regression
• 2. Linear & Non-linear Regression
• 3. Histogram
• 4. ANOVA
2. To find the ________ you put all numbers in order from least to greatest and find the
•number that is in the middle.
• 1. Median
• 2. Mode
• 3. Mean
• 4. Range
MCQ
• 3. Data has been collected on visitors' viewing habits at a bank's website. Which
technique is used to identify pages commonly viewed during the same visit to the
website?
• 1.Clustering
• 2. Classification
• 3. Association Rules
• 4. Regression
MCQ
4. A graphical representation of a data set is referred to as a ______
•1. Visualization
•2. Data Set
•3. Investigative Cycle
•4. None
Points to Ponder
• Introduction
• Consolidating Data from Multiple Sources
• Supporting Different Types of Users
• Identifying Elements to Support Analysis
• NEXT CLASS LECTURE

DATA SCIENCE AND BIG DATA ANALYTICS

You might also like