Data Analytics 1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

Data Analytics 1 - THE HISTORY AND CONCEPTS OF DATA ANALYTIC

“The purpose of computing is insight, not numbers” Richard Hamming 1961

GIDEON & THE 300 SOLDIERS?

KEY EVENTS IN THE HISTORY OF DATA ANALYTICS


• 1890-Herman Hollerith invents the Hollerith Tabulating Machine which reduced crunching of census data from
10years to 3months
1962 John Tukey writes a paper title “The Future Of Data Analysis”, where he brought into question the
relationship between statistics and analysis.
• 1970-Edgar F. Codd presents his framework for relational databases.
• 1989-Howard Dresner at Gartner proposes the term “Business Intelligence.”
• 1990s-Data Mining is born following the success of the concept of data warehouses introduced by William H.
Inman.
• 1991-Tim Bernes Lee sets out the specifications for a worldwide, interconnected web of data accessible to
anyone across the world, now the internet.
• 2004-A whitepaper on MapReduce from Google inspires open source software projects like Apache Hadoop and
Apache Cassandra to deal with huge volumes of data through distributed computing. • 2008-Jeff Hammerbacher
and DJ Patil, then at Facebook and LinkedIn coin the term “data scientist” to describe their work and it then
becomes a buzzword.
• 2013-IBM shows statistics that 90% of the world’s data was created in the preceding 2 years!

DEFINITION OF KEY TERMS


Business intelligence (BI) is an umbrella term that includes the applications, infrastructure and tools, and best
practices that enable access to and analysis of information to improve and optimize decisions and performance.
(Gartner)
The purpose of Business Intelligence is to support better business decision making. Essentially, Business
Intelligence systems are datadriven Decision Support Systems (DSS). Business Intelligence is sometimes used
interchangeably with briefing books, report and query tools and executive information systems.

• Data Mining: is the computational process to discover patterns in large datasets stored in relational databases
and data warehouses.
• It is an intersection of artificial intelligence, machine learning, statistics and database systems.
Data Analytics also known as Predictive Analytics, is all about automating insights into a dataset through usage of
queries and data aggregation procedures. It can represent various dependencies between input variables and
discover hidden patterns in the dataset under analysis.
- is the science of examining raw data with the purpose of finding and drawing conclusions about the
information in the data using methods from statistics and machine learning.
- goes beyond the concept of data mining by analysing semi-structured and unstructured data from different
sources and in different formats e.g. text mining.

Advantages of Data Analytics


• Smart Decision Making
• Clearer insights into the inner workings of the business
• Improved Efficiency in business processes
• Better responds to market changes
• Better customer relationship management
• Better risk management
• Products and services tailored to customer need

Big Data implies huge data volumes that cannot be processed effectively with traditional applications. Big Data
processing begins with raw data that is not aggregated and it is often impossible to store such data in the memory
of a single computer.
• In other words big data analytics is an extension of traditional data analytics which was mainly done on
structured data e.g. stored in relational databases, to a more complex analysis on structured, semi-structured and
unstructured data.
• Big data is characterized by the famous 3 ‘Vs’-{volume, velocity and variety}
• Big data deals with huge volumes of data, whose rate of growing is very fast and is from diverse sources
and in different formats e.g. texts, tweets, web click streams, satellite images, sensors, web log data etc.
• Big data analytics is fuelled by improvements in bandwidth and connectivity, advancement in processing
power and the falling price of hard disk

DATA SCIENCE AND ANALYTICS


LIFE CYCLE:STEP BY STEP PROCESS
Data Science is a concept used to tackle big data and includes data cleansing, preparation, and analysis,
while Data Analysis is do basic descriptive statistics, visualize data, and communicate data points for conclusions.
• data scientists to follow a well-defined data science workflow.
• end goal of any data science project is to produce an effective data product
• Data Product is usable results produced at the end of a data science project
• It can be a dashboard, a recommendation engine or anything that facilitates business decision-making) to
solve a business problem

What is Data Science Life Cycle?


• a series of activities you must repeatedly follow in
order to finish work and provide it to your customers.
• begins with the identification of an issue or difficulty and concludes with the offering of a solution
Who are involved?
• Domain Expert- they are usually those people with substantial experience in any particular domain or
industry.
• Business Analyst- who can understand the business needs of a particular domain or industry. Their main
responsibilities include finding the right solutions and timeline for the said needs.
• Machine Learning Engineer- responsible for providing advice regarding the kind of model to be applied to
generate the desired output and required to come up with appropriate solutions for the correct and required output.
• Data Engineer and Data Architect- considered the experts in data modeling. From visualization of data to
storage and retrieval of data, all are done by these individuals.
1.Understanding Business Problem
issue or query you are attempting to address before you can create goals for the project.
• The first step in these situations is to identify clear objectives and concrete difficulties. • Ask questions:
• What is the problem that needs to be solved?
• What form should the solution take?
• What results would constitute ‘success’?

2.Information Acquisition
gathering useful information from the data sources available that can be used to solve the problem
• data must be described, together with their type, relevancy, and organization.
• Data can be found/gathered from the company’s databases, data sets available online, through Web APIs and
crawling data, and social media sites such as Twitter and Facebook let their users approach data by connecting
with web servers.

3.Data Preparation
Process of filtering out the data applicable for the problem, merge different datasets, clean the data
- eliminating inaccurate data, treating missing values and outliers
• setting up a sandbox (or testing) environment, and extracting, transforming, and loading your data into your new
sandbox in a format that is ready for analysis and model building.
• convert the data into desired format, eliminate columns that are not needed and derive new elements from the
data acquired

4. Data Analysis -” exploratory data analysis”


Analysis of data, look at possible relations between various features and get an understanding of how much
effect each variable has over the final prediction or target
• data and its characteristics require inspection.
• features can be extracted and important variables can be tested.
• Data scientists pick out the properties that represent the concerned data. These may be things such as ‘name’,
‘gender’, and ‘age’.
• Data visualization is utilized in this phase to highlight important trends and patterns in data to adequately
comprehended through simple aids such as bar and line charts

5. Feature Engineering/ Model Building


choosing the essential properties that will directly aid the prediction of the model.
• requires writing, running and refining the programs to analyze and derive meaningful business insights from
data
• machine learning techniques are applied to the data to identify the machine learning model that best fits the
business needs. All the contending machine learning models are trained with the training data sets.
• Machine learning techniques are an AI technique that teaches computers to learn from experience. Machine
learning algorithms use computational methods to “learn” information directly from data without relying on a
predetermined equation as a model.

6. Data Modelling
creating a conceptual representation of data and the relationships between the various data entities.
• main objective of data modeling is to ensure that data is well-organized and easily accessible, thereby
facilitating analysis and decision-making.
• data scientist will often build a baseline model that has proved successful in similar situations and then tailor it
to suit the specifics of your problem
7. Data Visualization /Operationalize
delivering the final reports on the model performance findings, as well as any necessary briefings, code, and
technical documents
• communicate our findings to the stakeholders • Data visualization is used to convey the information in an easy
way and understand the performance of the proposed solution

One Quick Simulation Crime Rate scenario


Problem Formulation
identify patterns and trends in crime data to help law enforcement agencies to reduce crime rates.

Data Acquisition
Crime data is collected from various sources, including police reports, crime statistics, and incident reports

Data Preparation
the data might be cleaned by removing duplicate records or filling in missing values..

Data Exploration
data might be visualized to identify which types of crimes are most common, and which locations are most
affected.

Data Modeling
a model might be built to predict the likelihood of a crime occurring based on factors such as location, time of
day, and weather conditions.
Data Visualization
the model might be evaluated to measure its ability to predict crime rates accurately .
the model might be integrated into a crime prevention strategy to identify high - risk areas and allocate
resources accordingly

You might also like