0% found this document useful (0 votes)
6 views26 pages

UNIT II Database & Data Warehouse

Download as pdf or txt
0% found this document useful (0 votes)
6 views26 pages

UNIT II Database & Data Warehouse

Download as pdf or txt
Download as pdf or txt
You are on page 1/ 26

UNIT II

Database & Data Warehousing

In computing, a database is an organized collection


of data stored and accessed electronically. Small databases
can be stored on a file system, while large databases are
hosted on computer clusters or cloud storage. The design of
databases spans formal techniques and practical
considerations including data modeling, efficient data
representation and storage, query
languages, security and privacy of sensitive data,
and distributed computing issues including
supporting concurrent access and fault tolerance.
A database management system (DBMS) is
the software that interacts with end users, applications, and
the database itself to capture and analyze the data. The
DBMS software additionally encompasses the core facilities
provided to administer the database. The sum total of the
database, the DBMS and the associated applications can be
referred to as a database system. Often the term
"database" is also used loosely to refer to any of the DBMS,
the database system or an application associated with the
database.
Computer scientists may classify database management
systems according to the database models that they
support. Relational databases became dominant in the 1980s.
These model data as rows and columns in a series of tables,
and the vast majority use SQL for writing and querying data.
In the 2000s, non-relational databases became popular,

Database & Data warehousing Page 1


collectively referred to as NoSQL because they use
different query languages.

Classification
One way to classify databases involves the type of their
contents, for example: bibliographic, document-text,
statistical, or multimedia objects. Another way is by their
application area, for example: accounting, music compositions,
movies, banking, manufacturing, or insurance. A third way is by
some technical aspect, such as the database structure or
interface type. This section lists a few of the adjectives used
to characterize different kinds of databases.

 An in-memory database is a database that primarily


resides in main memory, but is typically backed-up by non-
volatile computer data storage. Main memory databases are
faster than disk databases, and so are often used where
response time is critical, such as in telecommunications
network equipment.
 An active database includes an event-driven architecture
which can respond to conditions both inside and outside the
database. Possible uses include security monitoring,
alerting, statistics gathering and authorization. Many
databases provide active database features in the form
of database triggers.
 A cloud database relies on cloud technology. Both the
database and most of its DBMS reside remotely, "in the
cloud", while its applications are both developed by
programmers and later maintained and used by end-users
through a web browser and Open APIs.

Database & Data warehousing Page 2


 Data warehouses archive data from operational databases
and often from external sources such as market research
firms. The warehouse becomes the central source of data
for use by managers and other end-users who may not have
access to operational data.
 A deductive database combines logic programming with a
relational database.
 A distributed database is one in which both the data and
the DBMS span multiple computers.
 A document-oriented database is designed for storing,
retrieving, and managing document-oriented, or semi
structured, information. Document-oriented databases are
one of the main categories of NoSQL databases.
 An embedded database system is a DBMS which is tightly
integrated with application software that requires access
to stored data in such a way that the DBMS is hidden from
the application's end-users and requires little or no ongoing
maintenance.[21]
 End-user databases consist of data developed by individual
end-users. Examples of these are collections of documents,
spreadsheets, presentations, multimedia, and other files.
Several products exist to support such databases. Some of
them are much simpler than full-fledged DBMSs, with more
elementary DBMS functionality.
 A federated database system comprises several distinct
databases, each with its own DBMS. It is handled as a
single database by a federated database management
system (FDBMS), which transparently integrates multiple
autonomous DBMSs, possibly of different types (in which

Database & Data warehousing Page 3


case it would also be a heterogeneous database system),
and provides them with an integrated conceptual view.
 Sometimes the term multi-database is used as a synonym to
federated database, though it may refer to a less
integrated (e.g., without an FDBMS and a managed
integrated schema) group of databases that cooperate in a
single application. In this case, typically middleware is used
for distribution, which typically includes an atomic commit
protocol (ACP), e.g., the two-phase commit protocol, to
allow distributed (global) transactions across the
participating databases.
 A graph database is a kind of NoSQL database that
uses graph structures with nodes, edges, and properties to
represent and store information. General graph databases
that can store any graph are distinct from specialized
graph databases such as triple stores and network
databases.
 An array DBMS is a kind of NoSQL DBMS that allows
modeling, storage, and retrieval of (usually large) multi-
dimensional arrays such as satellite images and climate
simulation output.
 In a hypertext or hypermedia database, any word or a piece
of text representing an object, e.g., another piece of text,
an article, a picture, or a film, can be hyperlinked to that
object. Hypertext databases are particularly useful for
organizing large amounts of disparate information. For
example, they are useful for organizing online
encyclopedias, where users can conveniently jump around
the text. The World Wide Web is thus a large distributed
hypertext database.

Database & Data warehousing Page 4


 A knowledge base (abbreviated KB/kb) is a special kind of
database for knowledge management, providing the means
for the computerized collection, organization,
and retrieval of knowledge. Also a collection of data
representing problems with their solutions and related
experiences.

 A mobile database can be carried on or synchronized from


a mobile computing device.

 A parallel database seeks to improve performance


through parallelization for tasks such as loading data,
building indexes and evaluating queries.
The major parallel DBMS architectures which are
induced by the underlying hardware architecture are:

 Shared memory architecture, where multiple


processors share the main memory space, as well as
other data storage.
 Shared disk architecture, where each processing unit
(typically consisting of multiple processors) has its own
main memory, but all units share the other storage.
 Shared-nothing architecture, where each processing
unit has its own main memory and other storage.

DATA WAREHOUSING
Decision support systems (DSS) are subject-oriented, integrated,
time-variant, and non-volatile. The term data warehouse was first used
by William Inmon in the early 1980s. He defined data warehouse to be
a set of data that supports DSS and is "subject-oriented, integrated,
time-variant and nonvolatile" With data warehousing, corporate-wide
Database & Data warehousing Page 5
data (current & historical) are merged into a single repository.
Traditional databases contain operational data that represent the day-
to-day needs of a company. Traditional business data processing (such
as billing, inventory control, payroll, and manufacturing support online
transaction processing and. batch reporting applications. A data
warehouse, however, contains informational data, which are used to
support other functions such as planning and forecasting. Although
much of the content is similar between the operational and
informational data, much is different. A data warehouse is a data
repository used to support decision support systems.
Data warehousing system include data migration, the warehouse, and
access tools. The data are extracted from operational systems, but
must be reformatted, cleansed, integrated, and summarized before
being placed in the warehouse. Much of the operational data are not
needed in the warehouse and are removed during this conversion
process. This migration process is similar to that needed for data
mining applications except that data mining application need not
necessarily be performed on summarized or business-wide data.

 The data transformation process required to convert operational


data to informational involves many functions including:
 Unwanted data must be removed.
 Converting heterogeneous sources into one common schema. This
problem is the same as that found when accessing data from
multiple heterogeneous sources. Each operational database may
contain the same data with different attribute names. For
example, one system may use "Employee ID," while another uses

Database & Data warehousing Page 6


"EID" for the same attribute. In addition, there may be multiple
data types for the same attribute.
 As the operational data is probably a snapshot of the data,
multiple snapshots may need to be merged to create the
historical view. • Summarizing data is performed to provide a
higher level view of the data. This summarization may be done at
multiple granularities and for different dimensions.
 New derived data (e.g., using age rather than birth date) may be
added to better facilitate decision support functions.
 Handling missing and erroneous data must be performed. This
could entail replacing them with predicted or default values or
simply removing these entries.
 There are several ways to improve the performance of data
warehouse applications.
1) Summarization: Because many applications require
summary-type information, data that are known to be
needed for consolidation queries should be pre-summarized
before storage. Different levels of summarization should be
included to improve performance. With a 20 to 100%
increase in storage space, an increase in performance of 2
to 10 times can be achieved
2) De-normalization: Traditional normalization reduces such
problems as redundancy as well as insert, update, and
deletion anomalies. However, these improvements are
achieved at the cost of increased processing time due to
joins. With a data warehouse, improved performance can be
achieved by storing de-normalized data. Since data
warehouses are not usually updated as frequently as
operational data are, the negatives associated with update
operations are not an issue.
3) Partitioning: Dividing the data warehouse into smaller
fragments may reduce processing time by allowing queries
to access small data sets.

Database & Data warehousing Page 7


Artificial Intelligence

Artificial intelligence (AI) is intelligence demonstrated


by machines, as opposed to natural intelligence displayed
by animals including humans. Leading AI textbooks define the field as
the study of "intelligent agents": any system that perceives its
environment and takes actions that maximize its chance of achieving
its goals. AI applications include advanced web search engines
(e.g., Google), recommendation systems (used
by YouTube, Amazon and Netflix), understanding human speech (such
as Alexa), self-driving cars (e.g., Tesla), automated decision-making and
competing at the highest level in strategic game systems (such
as chess and Go). As machines become increasingly capable, tasks
considered to require "intelligence" are often removed from the
definition of AI, a phenomenon known as the AI effect.

Artificial intelligence was founded as an academic discipline in


1956, and in the years since has experienced several waves of
optimism, followed by disappointment and the loss of funding, followed
by new approaches, success and renewed funding. AI research has
tried and discarded many different approaches since its founding,
including simulating the brain, modeling human problem solving, formal
logic, large databases of knowledge and imitating animal behavior. In
the first decades of the 21st century, highly mathematical
statistical machine learning has dominated the field, and this technique
has proved highly successful, helping to solve many challenging
problems throughout industry and academia.

The various sub-fields of AI research are centered on particular goals


and the use of particular tools. The traditional goals of AI research
include reasoning, knowledge representation, planning, learning, natural
language processing, perception, and the ability to move and manipulate
objects. General intelligence (the ability to solve an arbitrary problem)
is among the field's long-term goals. To solve these problems, AI
researchers have adapted and integrated a wide range of problem-
solving techniques—including search and mathematical optimization,
Database & Data warehousing Page 8
formal logic, artificial neural networks, and methods based
on statistics, probability and economics. AI also draws upon computer
science, psychology, linguistics, philosophy, and many other fields.

Goals

The general problem of simulating (or creating) intelligence has


been broken down into sub-problems. These consist of particular traits
or capabilities that researchers expect an intelligent system to display.
The traits described below have received the most attention.

Reasoning, problem solving

Early researchers developed algorithms that imitated step-by-


step reasoning that humans use when they solve puzzles or make logical
deductions. By the late 1980s and 1990s, AI research had developed
methods for dealing with uncertain or incomplete information,
employing concepts from probability and economics.

Many of these algorithms proved to be insufficient for solving


large reasoning problems because they experienced a "combinatorial
explosion"

Knowledge representation

Knowledge representation and knowledge engineering allow AI


programs to answer questions intelligently and make deductions about
real world facts.

Planning

An intelligent agent that can plan makes a representation of the


state of the world, makes predictions about how their actions will
change it and makes choices that maximize the utility (or "value") of
the available choices.

Learning

Database & Data warehousing Page 9


Machine learning (ML), a fundamental concept of AI research
since the field's inception, is the study of computer algorithms that
improve automatically through experience.

Unsupervised learning finds patterns in a stream of


input. Supervised learning requires a human to label the input data
first, and comes in two main varieties: classification and
numerical regression. Classification is used to determine what category
something belongs in the program sees a number of examples of things
from several categories and will learn to classify new inputs. Regression
is the attempt to produce a function that describes the relationship
between inputs and outputs and predicts how the outputs should
change as the inputs change.

Natural language processing

Natural language processing (NLP) allows machines to read


and understand human language. A sufficiently powerful natural
language processing system would enable natural-language user
interfaces and the acquisition of knowledge directly from human-
written sources, such as newswire texts. Some straightforward
applications of NLP include information retrieval, question
answering and machine translation.

Symbolic AI used formal syntax to translate the deep


structure of sentences into logic. This failed to produce useful
applications, due to the intractability of logic and the breadth of
commonsense knowledge.

Perception

Machine perception is the ability to use input from sensors (such


as cameras, microphones, wireless signals, and active radar, radar
and tactile sensors) to deduce aspects of the world. Applications
include speech recognition, facial recognition, and object recognition.
Database & Data warehousing Page 10
Computer vision is the ability to analyze visual input.

Social intelligence

Affective computing is an interdisciplinary umbrella that


comprises systems which recognize, interpret, process, or simulate
human feeling, emotion and mood. For example, some virtual
assistants are programmed to speak conversationally or even to banter
humorously; it makes them appear more sensitive to the emotional
dynamics of human interaction, or to otherwise facilitate human–
computer interaction.

General intelligence

A machine with general intelligence can solve a wide variety of


problems with a breadth and versatility similar to human intelligence.

Machine learning (ML)

Machine learning (ML) is the study of computer algorithms that


can improve automatically through experience and by the use of
data. It is seen as a part of artificial intelligence. Machine learning
algorithms build a model based on sample data, known as training data,
in order to make predictions or decisions without being explicitly
(externally) programmed to do so. Machine learning algorithms are used
in a wide variety of applications, such as in medicine, email
filtering, speech recognition, and computer vision, where it is difficult
or unfeasible to develop conventional algorithms to perform the needed
tasks.

A subset of machine learning is closely related to computational


statistics, which focuses on making predictions using computers; but
not all machine learning is statistical learning. The study
of mathematical optimization delivers methods, theory and application
domains to the field of machine learning. Data mining is a related field
of study, focusing on exploratory data analysis through unsupervised
learning. Some implementations of machine learning use data and neural
Database & Data warehousing Page 11
networks in a way that mimics the working of a biological brain. In its
application across business problems, machine learning is also referred
to as predictive analytics.

A core objective of a learner is to generalize from its


experience. Generalization in this context is the ability of a learning
machine to perform accurately on new, unseen examples/tasks after
having experienced a learning data set. The training examples come
from some generally unknown probability distribution (considered
representative of the space of occurrences) and the learner has to
build a general model about this space that enables it to produce
sufficiently accurate predictions in new cases.

The computational analysis of machine learning algorithms and their


performance is a branch of theoretical computer science known
as computational learning theory. Because training sets are finite and
the future is uncertain, learning theory usually does not yield
guarantees of the performance of algorithms. Instead, probabilistic
bounds on the performance are quite common. The bias–variance
decomposition is one way to quantify generalization error.

Approaches

Machine learning approaches are traditionally divided into three


broad categories, depending on the nature of the "signal" or
"feedback" available to the learning system:

 Supervised learning: The computer is presented with example


inputs and their desired outputs, given by a "teacher", and the goal
is to learn a general rule that maps inputs to outputs.
 Unsupervised learning: No labels are given to the learning
algorithm, leaving it on its own to find structure in its input.
Unsupervised learning can be a goal in itself (discovering hidden
patterns in data) or a means towards an end (feature learning).
 Reinforcement learning: A computer program interacts with a
dynamic environment in which it must perform a certain goal (such

Database & Data warehousing Page 12


as driving a vehicle or playing a game against an opponent). As it
navigates its problem space, the program is provided feedback
that's analogous to rewards, which it tries to maximize.

Artificial Neural Networks (ANN)

An Artificial Neural Network (ANN) is a computational model inspired


by the way biological neural networks in the human brain process
information. ANNs consist of interconnected nodes (or "neurons")
organized in layers: an input layer, one or more hidden layers, and an
output layer. Here are some key concepts:

Components of ANNs

1. Neurons: Basic units that receive input, apply a transformation


(often using an activation function), and produce output.
2. Layers:
o Input Layer: Accepts the initial data.
o Hidden Layers: Intermediate layers that process inputs.
o Output Layer: Produces the final output.
3. Weights: Each connection between neurons has a weight that
adjusts as learning proceeds.
4. Activation Functions: Functions like ReLU, sigmoid, introduce non-
linearity into the model, allowing it to learn complex patterns.

Learning Process

1. Forward Propagation: Input data is passed through the network,


producing an output.
2. Loss Function: Measures the difference between the predicted
output and the actual output.
3. Back propagation: The algorithm adjusts the weights based on
the error, propagating the error backward through the network.
4. Optimization: Techniques like gradient descent are used to
minimize the loss function.

Database & Data warehousing Page 13


Applications

 Image Recognition: Used in computer vision tasks.


 Natural Language Processing: Helps in understanding and
generating human language.
 Medical Diagnosis: Assists in analyzing medical data for disease
prediction.
 Game Playing: Powers AI in strategic games.

Advantages

 Flexibility: Can model complex functions.


 Scalability: Can handle large datasets and complex structures.

Challenges

 Over fitting: The model may perform well on training data but
poorly on unseen data.
 Training Time: Can require significant computational resources.
 Interpretability: Often seen as a "black box," making it hard to
understand how decisions are Detailed Components of ANNs

1. Neurons:
o Each neuron receives inputs, processes them, and sends an
output to the next layer.
o The processing typically involves calculating a weighted sum
of inputs followed by an activation function.
2. Weights and Biases:
o Weights determine the importance of each input. During
training, these are adjusted to minimize error.
o Biases provide an additional parameter that allows the
model to fit the data better by shifting the activation
function.
3. Activation Functions:
o Sigmoid: Outputs values between 0 and 1. Good for binary
classification but can suffer from vanishing gradients.

Database & Data warehousing Page 14


o Tanh: Outputs values between -1 and 1. Similar to sigmoid
but generally performs better in hidden layers.
o ReLU (Rectified Linear Unit): Outputs the input directly if
positive; otherwise, it outputs zero. It’s popular due to its
simplicity and effectiveness in reducing training time.

Applications in Various Fields

 Healthcare: Diagnosing diseases from medical images, predicting


patient outcomes.
 Finance: Fraud detection, risk assessment, stock price prediction.
 Autonomous Vehicles: Object detection, path planning, and
decision-making in real-time.
 Robotics: Control systems, human-robot interaction, and learning
from demonstration.

Scalable & Non-scalable data

A scalable data system or architecture is designed with flexibility and


growth in mind. The goal is to ensure that as the data volume or
processing demands increase, the system can continue to function
efficiently by either adding more resources or using existing resources
more effectively. Here are additional aspects of scalable data:

1. Horizontal vs. Vertical Scaling:

Horizontal Scaling (Scale Out): This involves adding more machines or


nodes to a system. For example, in a distributed database, you can add
more servers to share the load, increasing the capacity to handle
larger datasets.

Vertical Scaling (Scale Up): This involves upgrading the existing


hardware by adding more CPU, memory, or storage to a single machine.
It works well for limited increases in data but can hit a ceiling since
hardware has physical limitations.

Database & Data warehousing Page 15


2. Examples of Scalable Data Systems:

NoSQL Databases: Systems like MongoDB, Cassandra, and Amazon


DynamoDB are designed to scale horizontally. They can handle massive
datasets spread across multiple servers.

Cloud-Based Storage: Cloud services such as Amazon S3 or Google


Cloud Storage automatically scale with the data. They allow you to
store and access vast amounts of data without worrying about physical
storage limits.

Distributed Computing Frameworks: Apache Hadoop and Apache Spark


are examples of frameworks that can process large datasets across a
cluster of machines, enabling high scalability in data processing.

3. Challenges and Solutions in Scalability:

Consistency vs. Availability: In distributed systems, scalability often


introduces trade-offs between consistency and availability (as
described by the CAP theorem). Some systems prioritize availability
and partition tolerance (e.g., NoSQL databases), sacrificing strict
consistency in favor of scalability.

Load Balancing: To ensure scalability, systems must distribute data


and workload evenly across nodes to avoid bottlenecks. Load balancers
and replication techniques are often used to ensure optimal
distribution of resources.

Content Delivery Networks (CDNs): To handle high traffic loads,


CDNs distribute data across multiple servers around the world,
enabling scalable content delivery.

Non-Scalable Data:

Database & Data warehousing Page 16


Non-scalable data systems face limitations in handling large datasets
or high traffic. They might work well for smaller datasets but
encounter significant performance issues as the data grows.

1. Issues in Non-Scalable Systems:

Single Points of Failure: Many non-scalable systems rely on a single


server or resource, making the system vulnerable to failure if the load
exceeds its capacity.

I/O Constraints: Traditional databases and systems often face


input/output bottlenecks when the size of the data exceeds the
processing capacity. For example, handling large files or massive
database queries can lead to slower read/write times.

Resource Constraints: Non-scalable systems typically run into


resource limitations (CPU, memory, disk space) that prevent them from
handling large amounts of data.

2. Common Issues with Non-Scalability:

Performance Degradation: As more data is added to non-scalable


systems, users may experience slower query times, application crashes,
and overall decreased efficiency.

Manual Scaling Efforts: Non-scalable systems often require manual


intervention, such as moving data to larger machines or manually
splitting workloads. These solutions are neither automated nor
efficient in the long term.

Increased Maintenance Costs: As the data grows, maintaining a non-


scalable system can become costlier and more complex, requiring
frequent upgrades or redesigns.

Database & Data warehousing Page 17


3. Examples of Non-Scalable Systems:

Legacy Relational Databases: Many older relational databases like


MySQL without proper optimization or clustering, face difficulty
handling massive datasets.

Monolithic Applications: Applications designed in a monolithic


architecture often suffer from scalability issues because all
components are tightly coupled. When one part of the system needs
more resources, the entire application must scale, which is inefficient.

Single Server Setups: Systems that run on a single server with no


distributed computing or cloud architecture may work for small
businesses, but as data volumes grow, they become slower and more
difficult to maintain.

In essence, scalable systems are future-proof and designed for


growth, while non-scalable systems may be easier to set up initially but
become inefficient and expensive as data grows.

Use of statistical Methods & technique

Statistics is the discipline that concerns the collection,


organization, analysis, interpretation, and presentation of data. In
applying statistics to a scientific, industrial, or social problem, it is
conventional to begin with a statistical population or a statistical
model to be studied. Populations can be diverse groups of people or
objects such as "all people living in a country" or "every atom composing
a crystal". Statistics deals with every aspect of data, including the
planning of data collection in terms of the design
of surveys and experiments.
When census data cannot be collected, statisticians collect data
by developing specific experiment designs and survey samples.
Representative sampling assures that inferences and conclusions can
reasonably extend from the sample to the population as a whole.
Database & Data warehousing Page 18
An experimental study involves taking measurements of the system
under study, manipulating the system, and then taking additional
measurements using the same procedure to determine if the
manipulation has modified the values of the measurements. In contrast,
an observational study does not involve experimental manipulation.
Two main statistical methods are used in data
analysis: descriptive statistics, which summarize data from a sample
using indexes such as the mean or standard deviation, and inferential
statistics, which draw conclusions from data that are subject to
random variation (e.g., observational errors, sampling
variation). Descriptive statistics are most often concerned with two
sets of properties of a distribution (sample or population): central
tendency (or location) seeks to characterize the distribution's central
or typical value, while dispersion (or variability) characterizes the
extent to which members of the distribution depart from its center
and each other. Inferences on mathematical statistics are made under
the framework of probability theory, which deals with the analysis of
random phenomena.
A standard statistical procedure involves the collection of data leading
to test of the relationship between two statistical data sets, or a data
set and synthetic data drawn from an idealized model. A hypothesis is
proposed for the statistical relationship between the two data sets,
and this is compared as an alternative to an idealized null hypothesis of
no relationship between two data sets. Rejecting or disproving the null
hypothesis is done using statistical tests that quantify the sense in
which the null can be proven false, given the data that are used in the
test.
Descriptive statistics

A descriptive statistic (in the count noun sense) is a summary


statistic that quantitatively describes or summarizes features from a
collection of information, while descriptive statistics (in the mass
noun sense) is the process of using and analyzing those statistics.
Descriptive statistics is distinguished from inferential statistics (or
inductive statistics) by its aim to summarize a sample, rather than use
Database & Data warehousing Page 19
the data to learn about the population that the sample of data is
thought to represent. This generally means that descriptive statistics,
unlike inferential statistics, is not developed on the basis of probability
theory, and are frequently non-parametric statistics. Even when a data
analysis draws its main conclusions using inferential statistics,
descriptive statistics are generally also presented.
Descriptive statistics provide simple summaries about the sample
and about the observations that have been made. Such summaries may
be either quantitative, i.e. summary statistics, or visual, i.e. simple-to-
understand graphs. These summaries may either form the basis of the
initial description of the data as part of a more extensive statistical
analysis, or they may be sufficient in and of themselves for a
particular investigation.
For example, the shooting percentage in basketball is a
descriptive statistic that summarizes the performance of a player or a
team. This number is the number of shots made divided by the number
of shots taken. For example, a player who shoots 33% is making
approximately one shot in every three. The percentage summarizes or
describes multiple discrete events. Consider also the grade point
average. This single number describes the general performance of a
student across the range of their course experiences.
The use of descriptive and summary statistics has an extensive
history and, indeed, the simple tabulation of populations and of
economic data was the first way the topic of statistics appeared. More
recently, a collection of summarization techniques has been formulated
under the heading of exploratory data analysis: an example of such a
technique is the box plot.
In the business world, descriptive statistics provides a useful
summary of many types of data. For example, investors and brokers
may use a historical account of return behavior by performing empirical
and analytical analyses on their investments in order to make better
investing decisions in the future.

Database & Data warehousing Page 20


Inferential/ inference Statistics
Statistical inference is the process of using data analysis to infer
properties of an underlying distribution of probability. Inferential
statistical analysis infers properties of a population, for example
by testing hypotheses and deriving estimates. It is assumed that the
observed data set is sampled from a larger population.

Statistical inference makes propositions about a population, using data


drawn from the population with some form of sampling. Given a
hypothesis about a population, for which we wish to draw inferences,
statistical inference consists of (first) selecting a statistical model of
the process that generates the data and (second) deducing
propositions from the model.

The majority of the problems in statistical inference can be


considered to be problems related to statistical modeling", Sir David
Cox has said, "How [the] translation from subject-matter problem to
statistical model is done is often the most critical part of an analysis".

The conclusion of a statistical inference is a statistical proposition


( Proposition is the meaning of a declarative sentence) Some common
forms of statistical proposition are the following:

 A point estimate ( point estimation involves the use


of sample data to calculate a single value) i.e. a particular value that
best approximates some parameter of interest;
 An interval estimate (interval estimation is the use of sample
data to estimate an interval of plausible values of a parameter of
interest. This is in contrast to point estimation, which gives a single
value) e.g. a confidence interval (confidence interval (CI) is a
range of estimates for an unknown parameter, defined as an interval
with a lower bound and an upper bound )(or set estimate), i.e. an
interval constructed using a dataset drawn from a population so
that, under repeated sampling of such datasets, such intervals would

Database & Data warehousing Page 21


contain the true parameter value with the probability at the
stated confidence level;
 A credible interval (credible interval is an interval within which an
unobserved parameter value falls with a particular probability) i.e. a
set of values containing, for example, 95% of posterior belief.
 Rejection of a hypothesis;[note 1]
 Clustering or Classification of data points into groups.

Data analysis

Data analysis is a process of inspecting, cleansing (Data


cleansing or data cleaning is the process of detecting and correcting
(or removing) corrupt or inaccurate records from a record set, table,
or database and refers to identifying incomplete, incorrect, inaccurate
or irrelevant parts of the data and then replacing, modifying, or
deleting the dirty data), transforming (Data transformation is the
process of converting data from one format or structure into another
format or structure) and modeling data (process of creating data
model) with the goal of discovering useful information, informing
conclusions, and supporting decision-making. Data analysis has multiple
facets and approaches, encompassing diverse techniques under a
variety of names, and is used in different business, science, and social
science domains. In today's business world, data analysis plays a role in
making decisions more scientific and helping businesses operate more
effectively.
Data mining is a particular data analysis technique that focuses on
statistical modeling and knowledge discovery for predictive rather than
purely descriptive purposes, while business intelligence covers data
analysis that relies heavily on aggregation, focusing mainly on business
information. In statistical applications, data analysis can be divided
into descriptive statistics, exploratory data analysis (EDA),
and confirmatory data analysis (CDA). EDA focuses on discovering new
features in the data while CDA focuses on confirming or falsifying
existing hypotheses. Predictive analytics focuses on the application of
statistical models for predictive forecasting or classification,
Database & Data warehousing Page 22
while text analytics applies statistical, linguistic, and structural
techniques to extract and classify information from textual sources, a
species of unstructured data. All of the above are varieties of data
analysis

The process of data analysis


Analysis refers to dividing a whole into its separate components for
individual examination. Data analysis is a process for obtaining raw
data, and subsequently converting it into information useful for
decision-making by users. Following are different processes which are
used during data analysis & also similar for data mining too.

Data requirements

The data is necessary as inputs to the analysis, which is specified


based upon the requirements of those directing the analysis (or
customers, who will use the finished product of the analysis). The
general type of entity upon which the data will be collected is referred
to as an experimental unit (e.g., a person or population of people).
Specific variables regarding a population (e.g., age and income) may be
specified and obtained. Data may be numerical or categorical (i.e., a
text label for numbers).
Data collection

Data is collected from a variety of sources. The requirements


may be communicated by analysts to custodians of the data; such
as, Information Technology personnel within an organization. The data
may also be collected from sensors in the environment, including
traffic cameras, satellites, recording devices, etc. It may also be

Database & Data warehousing Page 23


obtained through interviews, downloads from online sources, or reading
documentation.
Data Processing
Data, when initially obtained, must be processed or organized for
analysis. For instance, these may involve placing data into rows and
columns in a table format (known as structured data) for further
analysis, often through the use of spreadsheet or statistical software.

Data cleaning
Once processed and organized, the data may be incomplete,
contain duplicates, or contain errors. The need for data cleaning will
arise from problems in the way that the datum are entered and
stored. Data cleaning is the process of preventing and correcting these
errors. Common tasks include record matching, identifying inaccuracy
of data, and overall quality of existing data, duplication and column
segmentation.

Exploratory data analysis

Once the datasets are cleaned, they can then be analyzed.


Analysts may apply a variety of techniques, referred to as exploratory
data analysis, to begin understanding the messages contained within
the obtained data. The process of data exploration may result in
additional data cleaning or additional requests for data
Hypothesis

A hypothesis is a proposed explanation for a phenomenon


(observable fact or event). For a hypothesis to be a scientific
hypothesis, the scientific method requires that one
can test it. Scientist’s generally base scientific hypotheses on
previous observations (s the active acquisition of information from
a primary source.) of that cannot satisfactorily be explained with the
available scientific theories. Even though the words "hypothesis" and

Database & Data warehousing Page 24


"theory" are often used synonymously, a scientific hypothesis is not
the same as a scientific theory. A working hypothesis is a provisionally
accepted hypothesis proposed for further research, in a process
beginning with an educated guess or thought. Even though the words
"hypothesis" and "theory" are often used synonymously, a scientific
hypothesis is not the same as a scientific theory. A working
hypothesis is a provisionally accepted hypothesis proposed for
further research, in a process beginning with an educated guess or
thought.

Scientific hypothesis
People refer to a trial solution to a problem as a hypothesis, often
called an "educated guess because it provides a suggested outcome
based on the evidence. However, some scientists reject the term
"educated guess" as incorrect. Experimenters may test and reject
several hypotheses before solving the problem.

Working hypothesis
A working hypothesis is a hypothesis that is provisionally accepted as a
basis for further research in the hope that a tenable theory will be
produced, even if the hypothesis ultimately fails.[18] Like all
hypotheses, a working hypothesis is constructed as a statement of
expectations, which can be linked to the exploratory research purpose
in empirical investigation.

Important Questions (Assignment: II)

Q.1 What is database? Explain various classification of databases.

Q.2. What is data warehousing? Explain in detail.

Q.3. Explain AI in details.

Q.4. Explain data analysis.


Database & Data warehousing Page 25
Q.5 Explain Scalable & non-scalable data in details.

Q.6 Explain data analysis process in detail.

Database & Data warehousing Page 26

You might also like