Week-1 Introduction To BDDA-TWM PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

BIG DATA & DATA ANALYTICS

BIG DATA and DATA ANALYTICS


1. RPS / Course Plan (14 weeks)
2. Online Lab Activity
3. Grade Percentage TASK1:UTS:TASK2:UAS -> 25%:25%:25%:25%
4. Class Rules : Be Active, On Time (Before Time)
5. Class Coordinator (pick one)
6. Lab Activity using the following tools : Google Collab (Python)

2
COURSE OBJECTIVE
1. Understand conceptual, framework, opportunity and challenge of
Big Data
2. Understand concept, theory, framework from Data Analytics
activities
3. Ability to choose and perform Data Analytics activities based on the
contextual business problem
4. Ability to build description model and prediction model using
available data

3
INTRODUCTION TO BIG DATA
& DATA ANALYTICS
Week 1 – EBI3B4 Big Data & Data Analytics
--

4
OUTLINE
o Introduction, Background of Big Data
o Big Data, Data Analytics, Data Science
o Big Data Properties
o Data Exponential Growth – Data Driven Decision Making
o Big Data Complexity
o Big Data Optimization and Trade Off
o Big Data Reduction Complexity Strategy

5
LARGE SCALE DATA

6
LARGE SCALE DATA

7
SOCIAL NETWORK DATA

8
BACKGROUND OF BIG DATA
1. We generate huge amounts of data (from UGC / mobile habit to machine generation
data / sensors / IoT)
2. Our society leaves massive digital footprint (so our behavior / attitudes)
3. Finding unexpected pattern is so exciting (also useful for predictive analytics)
4. In 2020s, we are entering AI era, where we needs massive analytics efforts
Source: Super intelligence
(Nick Bostrom)

Please see Video about Artificial Intelligence: https://www.youtube.com/watch?v=naPfusziGvA&t=866s


9
DEFINITION
Big Data, Data Analytics, Social Computing, Data Science

oBig Data : It is a term for data sets that are so large or complex that
traditional data processing tools are inadequate to process. The
challenges include analysis, capture, data curation, search, sharing,
storage, transfer, visualization, querying, updating and information
privacy (wikipedia)
oData Analytics : It is the process of examining raw data with the
purpose of drawing conclusions about that information. Data
Analytics is used in many industries to allow companies and
organization to make better business decisions and in the sciences to
verify or disprove existing models or theories (wikipedia)
For your reference see Video :
Big Data : https://www.youtube.com/watch?v=aC2CmTTZTVU
10
DEFINITION
Big Data, Data Analytics, Social Computing, Data Science

oSocial Computing: It is an area of computer science that is concerned


with the intersection of social behavior and computational systems.
It is based on creating or recreating social conventions and social
contexts through the use software and technology (wikipedia)
oData Science : It is interdisciplinary field (science) about processes
and system to extract knowledge or insight from data in various
forms, either structured or unstructured. This field is continuation of
some the data analysis field such as statistics, data mining, and
predictive analytics(wikipedia). Combination skill of math/stat,
computer science, and domain knowledge
For your reference see Video :
Data Science : https://www.youtube.com/watch?v=X3paOmcrTjQ
11
BIG DATA PROPERTIES
1. Volume: big data doesn’t sample
2. Velocity: big data is often available in real-time
3. Variety: big data raw from text, image, audio, and video
4. Veracity: big data need certainty, consistency, integrity of data
5. Value : The benefit

Veracity, and Value are the "quality" of Big Data

Volume, Variety, and Velocity are the "essential" characteristics of Big Data
DATA EXPONENTIALY
GROWTH
DATA DRIVEN DECISION MAKING
1. Data science involves principles, processes, and techniques for
understanding phenomena via the (automated) analysis of data
2. The ultimate goal of data science as improving decision making, as this
generally is of direct interest to business
3. Statistically, the more data-driven a firm is, the more productive it is—
even controlling for a wide range of possible confounding factors. And the
differences are not small. One standard deviation higher on the DDD scale
is associated with a 4%–6% increase in productivity.
4. DDD also is correlated with higher return on assets, return on equity,
asset utilization, and market value, and the relationship seems to be
causal

https://www.plutora.com/blog/data-driven-decision-making 15
4 types of analytics to create business & Opportunities

• Descriptive: Analysis of historical


data

• Diagnostic: Utilization of historical


data to identify a product failure
pattern and determine the
failure’s root cause.

• Predictive: Use of modeling, data


mining, and machine learning to
analyze both real-time and
historical data to predict and
anticipate future events based on
patterns found in the data;

• Prescriptive: a suggested next


step and/or decision is identified,
evaluated, and can be
automatically enabled

16
Data Analytics Example (in a supermarket) :

1. Descriptive : Total Product A, B, C, D sold. Retailer will know which


product are sold / popular
2. Predictive : People who buy product A, mostly also buy product B.
Retailer know / predict the future event
3. Prescriptive : Giving recommendation what product to buy based on
our profile / requirement. Giving recommendation how to achieve the
goal
Data Science Based on aforementioned definitions,
we can conclude that Data Analytics
includes:
• Data engineering
• Scientific Method
• Math
• Statistics

Data Engineering may include:


• Data Gathering
• Data Mining
• Data Transformation
• Data Cleansing
• etc.

Ref: many sources


DATA SCIENCE
BIG DATA / Data Science Taxonomy
(Body of Knowledge)
Name Knowledge Area Scientific Subject
Big Data Statistical Methods, Machine Learning, Data Mining, Computing Methodologies,
Analytics Predictive Analytics, Computational Modelling / Mathematics of Computing
Simulation / Optimization
Big Data Big Data Infrastructure & Technologies, Infrastructure Algorithm &
Engineering & Platform for DS Apps, Cloud Computing Tech, Data & Complexity, Architeture &
Apps Security, Big Data System Organization & Organization, Computational
Engineering, DS / Big Data Apps Design, IS to support Science, Graphic & Visualization,
DSS Information Management, Platform
Based Dev., Software Engineering
Big Data General Principle & Concepts in Data Management and Data (Governance, Architecture,
Management Organization, Data Management Systems, Data Model & Design, Storage &
Enterprise Infrastructure, Data Governance, Big Data Operations, Security, Integration &
Storage, Digital Library & Archives, Data Curation, Data Interoperability, Warehousing & BI,
Preservation. Quality), Metadata, Reference &
Master Data
Big Data Analytics
COMPUTATIONAL COMPLEXITY

Computational Complexity measured by Big O 22


DATA STRUCTURE COMPLEXITY

23
COMPLEXITY THEORY

Two Important Dimensions


1. Space / Size
2. Time

See reference of Video : https://www.youtube.com/watch?v=waPQP2TDOGE


24
CYNEFIN FRAMEWORK (KIH-NEH-VIHN)
1. The framework provides a typology of
contexts that guides what sort of
explanations or solutions might apply.
2. It draws on research into complex
adaptive systems theory, cognitive
science, anthropology, and narrative
patterns, as well as evolutionary
psychology, to describe problems,
situations, and systems.
3. It "explores the relationship between
man, experience, and context“ and
proposes new approaches to
communication, decision-making, policy-
making, and knowledge management in
complex social environments. 25
EXPLANATION
• The Cynefin framework has five domains. The first four domains are:
• Obvious - replacing the previously used terminology Simple from early 2014 - in which
the relationship between cause and effect is obvious to all, the approach is to Sense -
Categorize - Respond and we can apply best practice.
• Complicated, in which the relationship between cause and effect requires analysis or
some other form of investigation and/or the application of expert knowledge, the
approach is to Sense - Analyze - Respond and we can apply good practice.
• Complex, in which the relationship between cause and effect can only be perceived in
retrospect, but not in advance, the approach is to Probe - Sense - Respond and we can
sense emergent practice.
• Chaotic, in which there is no relationship between cause and effect at systems level,
the approach is to Act - Sense - Respond and we can discover novel practice.
• The fifth domain is Disorder, which is the state of not knowing what type of causality
exists, in which state people will revert to their own comfort zone in making a decision.
In full use, the Cynefin framework has sub-domains, and the boundary between
obvious and chaotic is seen as a catastrophic one: complacency leads to failure.
THE CONCEPT OF OPTIMIZATION
▪ Management → efficient & effective
- Efficient → using resources (thrifty)
- Effective → goal attainment (at all cost, imperative)

▪ To achieve both efficient and effective goal respectively, use OPTIMIZATION.

See reference Video : https://www.youtube.com/watch?v=Q2dewZweAtU


OPTIMIZATION MODEL
Optimization in Computing Context:
What are the most frequently made trade-
offs in computer science?
1. Time versus space
2. Complexity versus ease-of-understanding
3. Accuracy versus time/speed
4. Complexity versus time (or space)
5. Relevance versus recall (in information retrieval)
6. Bias versus variance (in machine learning)
7. Integrity versus time (in transactional systems)
8. Speed of reads versus speed of writes
Optimization in Computing Context:
Trade-off in Time vs Space
• The two most common measures are:
1. Time: how long does the algorithm take to complete.
2. Space: how much working memory (typically RAM) is needed by the
algorithm. This has two aspects: the amount of memory needed by the
code, and the amount of memory needed for the data on which the code
operates.

A reference to watch :
https://www.youtube.com/watch?v=9YRw0Yk7N8c
https://www.youtube.com/watch?v=Du1q5oA7Cik 30
Optimization in Computing Context:
Speed vs Accuracy in Machine Learning Model

The two most common measures


are:
1. Accuracy: incredibly accurate
but too slow for real time
2. Speed: very fast but less
accurate
EXPONENTIAL IN COMPUTER TECHNOLOGY
1. Processing power of computers. See also Moore's law and technological singularity. (Under
exponential growth, there are no singularities. The singularity here is a metaphor, meant to convey
an unimaginable future. The link of this hypothetical concept with exponential growth is most
vocally made by transhumanist Ray Kurzweil.)
2. In computational complexity theory, computer algorithms of exponential complexity require an
exponentially increasing amount of resources (e.g. time, computer memory) for only a constant
increase in problem size. So for an algorithm of time complexity 2x, if a problem of size x = 10
requires 10 seconds to complete, and a problem of sizex = 11 requires 20 seconds, then a problem of
size x = 12 will require 40 seconds. This kind of algorithm typically becomes unusable at very small
problem sizes, often between 30 and 100 items (most computer algorithms need to be able to solve
much larger problems, up to tens of thousands or even millions of items in reasonable times,
something that would be physically impossible with an exponential algorithm). Also, the effects
of Moore's Law do not help the situation much because doubling processor speed merely allows you
to increase the problem size by a constant. E.g. if a slow processor can solve problems of size x in
time t, then a processor twice as fast could only solve problems of size x+constant in the same time
t. So exponentially complex algorithms are most often impractical, and the search for more efficient
algorithms is one of the central goals of computer science today.
3. Internet traffic growth

34
MOORE’S LAW
• Moore's law
is the observation that the
number of transistors in a
dense integrated
circuit doubles
approximately every two
years.

35
COMPUTATIONAL
POWER

36
Choose what’s best for you
(or you may say Optimization)

37
LEVEL OF OPTIMIZATION
1. Design level
2. Algorithms and data structures Our interest for this course

3. Source code level


4. Build level
5. Compile level
6. Assembly level
7. Run time

38
STRENGTH REDUCTION
• Computational tasks can be performed in several different ways with varying
efficiency. A more efficient version with equivalent functionality is known as
a strength reduction.
• For example, consider the following C code snippet whose intention is to obtain
the sum of all integers from 1 to N:
1. int i, sum = 0;
2. for (i = 1; i <= N; ++i) {
3. sum += i;
4. }
5. printf("sum: %d\n", sum);
• This code can (assuming no arithmetic overflow) be rewritten using a
mathematical formula like:
1. int sum = N * (1 + N) / 2;
2. printf("sum: %d\n", sum);

39
Strength Reduction should…
1. Minimize space / size
2. Minimize time

Take examples in apps optimization. Optimized apps have characteristics:


1. Run faster (means more efficient)
2. Take less space (Before optimization: 1GB, after optimization: 0.9GB)
3. Preferably take less RAM space
These characteristics also apply to algorithm.

40
THINGS GROW FAST: EXPONENTIALLY
• Exponential growth is a phenomenon
that occurs when the growth rate of the
value of a mathematical function is
proportional to the function's current
value, resulting in its growth with time
being an exponential function.

Green: Exponential growth


Red: Linear growth
Blue: Cubic growth

41
BORROW BEST PRACTICES FROM
MANAGEMENT KNOWLEDGE
How To Reduce Complexity In Five Simple Steps
1. Clear the underbrush, get rid of ambiguous rules and low-value
activities, time-wasters
2. Clear perspective, focus on specific goals
3. Prioritize most important things
4. Take shortest path by eliminating loops, redundancies, and also
create things leaner
5. Reduce levels

42
GRAPH DATABASE to Represent Complex
Relationship in Data
• Graph Database is a database that uses graph structures for semantic queries with nodes, edges and properties to
represent and store data. A key concept of the system is the graph (or edge or relationship), which directly relates
data items in the store. The relationships allow data in the store to be linked together directly, and in most cases
retrieved with a single operation.

conventional/legacy RDBMS
Graph database

Video : https://www.youtube.com/watch?v=GM9bB4ytGao 43
CASE STUDY
• Big Data at Verizon Wireless (Phoenix Suns)
• Big Data at Schneider National
• Big Data at UPS
• Big Data at United Healthcare
• Big Data at Macys.com
• Big Data at Bank of America
• Big Data at Citigroup
Read Book:
Big Data at Works: Chapter : What You Can Learn from Large Companies: Big Data and Analytics 3.0
Davenport, T. (2014). Big data at work: dispelling the myths, uncovering the opportunities. Harvard Business Review Press.
Use Case
• GOJEK : Predict one’s favorite food, even though the particular person
never order on the particular restaurant.
• Veritrans (MidTrans) : Detect fraud transaction among millions
ecommerce transactions very fast using clustered network analysis
based on triangular customer information about their credit card,
phone number, and email address.
• Modalku : Create an algorithm to see a particular persons / SME
eligible to lend the money / get the investment.
• Bank Mandiri : Use customer data to understand customer (wallet
size), improve lead management (targeted customer), detect fraud
transaction (using network analysis)
Case Study : Customer Voice (Telco)

Telkomsel XL
Network Text Analysis to Summarize Online Conversations for Marketing Intelligence Efforts in Telecommunication Industry (Alamsyah at al, 2016)
Case Study : Predict Travel Price

• https://www.academia.edu/28776805/Prediction_Models_Based_on_Flight_Tickets_and_Hotel_Room
s_Data_Sales_for_Recommendation_System_in_Online_Travel_Agent_Business
To wrap up, Why Big Data Analytics Matters
1. NEW DATA
Example: eCommerce capturing clickstream
2. UNLOCKING VALUE
Example: Sentiment analysis from online social network
3. SHAPING THE FUTURE
Example: modelling the future, anticipating & influencing
Assignment Week 1
Find a Case Study of Big Data Implementation / Application for Business
or others
o Download the template (*.doc) for assignment week 1
o State the objective, problems, solution idea
o Upload as requested (due date) in form of pdf file

You might also like