0% found this document useful (0 votes)
95 views32 pages

Pharma Batch: Data Warehousing

The document discusses the history and architecture of data warehousing. It describes how data warehouses evolved to meet growing demands for analysis that operational systems could not support. A data warehouse contains historical data optimized for reporting and analysis without impacting operational systems. Key aspects of a data warehouse architecture include layers for data access, storage, and presentation. The document also contrasts the relational and dimensional approaches to data warehouse storage and design.

Uploaded by

rmerala
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
Download as ppt, pdf, or txt
0% found this document useful (0 votes)
95 views32 pages

Pharma Batch: Data Warehousing

The document discusses the history and architecture of data warehousing. It describes how data warehouses evolved to meet growing demands for analysis that operational systems could not support. A data warehouse contains historical data optimized for reporting and analysis without impacting operational systems. Key aspects of a data warehouse architecture include layers for data access, storage, and presentation. The document also contrasts the relational and dimensional approaches to data warehouse storage and design.

Uploaded by

rmerala
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1/ 32

Pharma Batch

Data Warehousing
History
• Data Warehouses became a distinct type of computer
database during the late 1980s and early 1990s. They
were developed to meet a growing demand for
management information and analysis that could not
be met by operational systems. Operational systems
were unable to meet this need for a range of reasons:
• The processing load of reporting reduced the response time
of the operational systems,
• The database designs of operational systems were not
optimized for information analysis and reporting,
• Most organizations had more than one operational system, so
company-wide reporting could not be supported from a single
system
• Development of reports in operational systems often required
writing specific computer programs which was slow and
expensive
Data Warehouse
• A data warehouse is the main repository of an
organization's historical data, its corporate
memory. It contains the raw material for
management's decision support system. The
critical factor leading to the use of a data
warehouse is that a data analyst can perform
complex queries and analysis, such as data
mining, on the information without slowing
down the operational systems.
• Bill Inmon, an early and influential practitioner, has
formally defined a data warehouse in the following
terms;
• subject-oriented, meaning that the data in the database is
organized so that all the data elements relating to the same
real-world event or object are linked together;
• time-variant, meaning that the changes to the data in the
database are tracked and recorded so that reports can be
produced showing changes over time;
• non-volatile, meaning that data in the database is
never over-written or deleted, once committed, the data is
static, read-only, but retained for future reporting; and
• integrated, meaning that the database contains data from
most or all of an organization's operational applications, and
that this data is made consistent.
• While operational systems are optimized for
simplicity and speed of modification (OLTP)
through heavy use of database normalization
and an entity-relationship model, the data
warehouse is optimized for reporting and
analysis (online analytical processing, or
OLAP).
• As a result, separate computer databases began to be built
that were specifically designed to support management
information and analysis purposes. These data warehouses
were able to bring in data from a range of different data
sources, such as mainframe computers, minicomputers, as
well as personal computers and office automation software
such as spreadsheet, and integrate this information in a
single place. This capability, coupled with user-friendly
reporting tools and freedom from operational impacts, has
led to a growth of this type of computer system.
• As technology improved (lower cost for more performance)
and user requirements increased (faster data load cycle
times and more features), data warehouses have evolved
through several fundamental stages:
• Offline Operational Databases — Data warehouses in this initial
stage are developed by simply copying the database of an
operational system to an off-line server where the processing load
of reporting does not impact on the operational system's
performance.
• Offline Data Warehouse — Data warehouses in this stage of
evolution are updated on a regular time cycle (usually daily,
weekly or monthly) from the operational systems and the data is
stored in an integrated reporting-oriented data structure
• Real Time Data Warehouse — Data warehouses at this stage
are updated on a transaction or event basis, every time an
operational system performs a transaction (e.g. an order or a
delivery or a booking etc.)
• Integrated Data Warehouse — Data warehouses at this stage
are used to generate activity or transactions that are passed back
into the operational systems for use in the daily activity of the
organization.
Architecture
• The term data warehouse architecture is
primarily used today to describe the overall
structure of a Business Intelligence system.
Other historical terms include decision support
systems (DSS), management information
systems (MIS), and others.
• A data warehouse architecture is a way of representing
the overall structure of data, communications,
processing and presentation that exists for end-user
computing within the enterprise.
• The architecture is made up of a number of
interconnected parts
• Operational database/external database layer
• Information access layer
• Data access layer
• Data directory (metadata) layer
• Process management layer
• Application messaging layer
• Data warehouse (physical) layer
• Data staging layer
Storage
• In OLTP — online transaction processing systems relational
database design use the discipline of data modeling and
generally follow the Codd rules of data normalization in order to
ensure absolute data integrity. Less complex information is
broken down into its most simple structures (a table) where all
of the individual atomic level elements relate to each other and
satisfy the normalization rules. Codd defines 5 increasingly
stringent rules of normalization and typically OLTP systems
achieve a 3rd level normalization. Fully normalized OLTP
database designs often result in having information from a
business transaction stored in dozens to hundreds of tables.
Relational database managers are efficient at managing the
relationships between tables and result in very fast
insert/update performance because only a little bit of data is
affected in each relational transaction.
• In addition, data warehousing suggests that data be
restructured and reformatted to facilitate query and analysis
by novice users. OLTP databases are designed to provide
good performance by rigidly defined applications built by
programmers fluent in the constraints and conventions of
the technology. Add in frequent enhancements, and too
many a database is just a collection of cryptic names,
seemingly unrelated and obscure structures that store data
using incomprehensible coding schemes. All factors that
while improving performance, complicate use by untrained
people. Lastly, the data warehouse needs to support high
volumes of data gathered over extended periods of time and
are subject to complex queries and need to accommodate
formats and definitions inherited from independently
designed package and legacy systems.
• OLTP databases are efficient because they are
typically only dealing with the information around a
single transaction. In reporting and analysis, thousands
to billions of transactions may need to be reassembled
imposing a huge workload on the relational database.
Given enough time the software can usually return the
requested results, but because of the negative
performance impact on the machine and all of its
hosted applications, data warehousing professionals
recommend that reporting databases be physically
separated from the OLTP database.
• Designing the data warehouse data Architecture synergy is the
realm of Data Warehouse Architects. The goal of a data
warehouse is to bring data together from a variety of existing
databases to support management and reporting needs. The
generally accepted principle is that data should be stored at its
most elemental level because this provides for the most useful
and flexible basis for use in reporting and information analysis.
However, because of different focus on specific requirements,
there can be alternative methods for design and implementing
data warehouses. There are two leading approaches to organizing
the data in a data warehouse: the dimensional approach
advocated by Ralph Kimball and the normalized approach
advocated by Bill Inmon. Whilst the dimension approach is very
useful in data mart design, it can result in a rats nest of long term
data integration and abstraction complications when used in a
data warehouse.
• In the "dimensional" approach, transaction data is partitioned into
either a measured "facts" which are generally numeric data that
captures specific values or "dimensions" which contain the
reference information that gives each transaction its context. As
an example, a sales transaction would be broken up into facts
such as the number of products ordered, and the price paid, and
dimensions such as date, customer, product, geographical
location and salesperson. The main advantages of a dimensional
approach is that the data warehouse is easy for business staff with
limited information technology experience to understand and use.
Also, because the data is pre-joined into the dimensional form, the
data warehouse tends to operate very quickly. The main
disadvantage of the dimensional approach is that it is quite difficult
to add or change later if the company changes the way in which it
does business.
• The "normalized" approach uses database
normalization. In this method, the data in the data
warehouse is stored in third normal form. Tables are
then grouped together by subject areas that reflect the
general definition of the data (customer, product,
finance, etc.) The main advantage of this approach is
that it is quite straightforward to add new information
into the database — the primary disadvantage of this
approach is that because of the number of tables
involved, it can be rather slow to produce information
and reports. Furthermore, since the segregation of
facts and dimensions is not explicit in this type of data
model, it is difficult for users to join the required data
elements into meaningful information without a precise
understanding of the data structure.
• Subject areas are just a method of organizing
information and can be defined along any lines.
The traditional approach has subjects defined
as the subjects or nouns within a problem
space. For example, in a financial services
business, you might have customers, products
and contracts. An alternative approach is to
organize around the business transactions,
such as customer enrollment, sales and trades.
Advantages
• There are many advantages to using a data
warehouse, some of them are:
• Enhances end-user access to a wide variety of data.
• Decision support system users can obtain specified
trend reports, e.g. the item with the most sales in a
particular area/country within the last two years.
• A data warehouse can be a significant enabler of
commercial business applications, most notably
customer relationship management (CRM).
Concerns
• Extracting, transforming and loading data consumes a
lot of time and computational resources.
• Data warehousing project scope must be actively
managed to deliver a release of defined content and
value.
• Compatibility problems with systems already in place.
• Security could develop into a serious issue, especially if
the data warehouse is web accessible.
• Data Storage design controversy warrants careful
consideration and perhaps prototyping of the data
warehouse solution for each project's environments.
• Data warehouse options
• Scope of the data warehouse
• personal warehouse
• project warehouse
• Departmental divisional warehouse
• Encrypted functional warehouse
• Data redundancy
• distributed information data warehouse
• centralised information data warehouse
• virtual information data warehouse or point to point data warehouse
• Types of end-users
• executive and managers
• power users (business and financial analysts, engineers, etc)
• support users (clerical, administrative users etc.)
Summary:
• Developing a data warehouse
• Developing a data warehouse strategy
• Evolving a data warehouse architecture
• Designing data warehouse
• Managing data warehouses
Data Mining
• Data mining is a technology that is used to discover or find out the
hidden complex relationships, patterns and models within the data
to predict future trends and behaviours.
• It helps in decision makers to make proactive and knowledge
based decisions.
• The core component of data mining technology include various
algorithmic approaches or techniques base on statistics
• There are five major types of approaches that data mining tool
uses
• associations
• sequences
• Classifications
• Clusters
• forecasting
• Data mining has been defined as "the nontrivial
extraction of implicit, previously unknown, and
potentially useful information from data" and "the
science of extracting useful information from large data
sets or databases".
• Data mining involves sorting through large amounts of
data and picking out relevant information. It is usually
used by Business intelligence organizations, and
financial analysts, but is increasingly used in the
sciences to extract information from the enormous data
sets generated by modern experimental and
observational methods.
• Metadata, or data about a given data set, are often
expressed in a condensed data mine-able format, or
one that facilitates the practice of data mining.
Common examples include executive summaries and
scientific abstracts.
• Although data mining is a relatively new term, the
technology is not. Companies for a long time have
used powerful computers to sift through volumes of
data such as supermarket scanner data, and produce
market research reports. Continuous innovations in
computer processing power, disk storage, and
statistical software are dramatically increasing the
accuracy and usefulness of analysis.
• Data mining identifies trends within data that go beyond simple
analysis. Through the use of sophisticated algorithms, users have
the ability to identify key attributes of business processes and
target opportunities.
• The term data mining is often used to apply to the two separate
processes of knowledge discovery and prediction. Knowledge
discovery provides explicit information that has a readable form
and can be understood by a user. Forecasting, or predictive
modeling provides predictions of future events and may be
transparent and readable in some approaches (e.g. rule based
systems) and opaque in others such as neural networks.
Moreover, some data mining systems such as neural networks are
inherently geared towards prediction and pattern recognition,
rather than knowledge discovery.
Structured Data Mining
• Concept mining
• Database mining
• Relational data mining
• Database
• Document warehouse
• Data warehouse
• Graph mining
• Molecule mining
• Sequence mining
• Data stream mining
• Learning from time-varying data streams under concept
• drift
• Tree mining
• Decision tree learning
• Web mining
• Software mining
• Unstructured Data Mining
• Text mining
• Image mining
• Multimedia mining
Application areas
• Business intelligence
• Business performance management
• Discovery science
• Loyalty card
• Cheminformatics
• Quantitative structure-activity relationship
• Bioinformatics
• Intelligence services
OLAP

• A class of data mining technologies that are


designed for live and ad hoc data access and
decision making.
• Helps decision makers to quickly analyse
information by providing summarized multi-
dimensional and logical view of the business
data stored in the data warehouse.
• OLAP enables analysts, managers and executives to
gain insight into data and to view their data strategically
so that they can make predictions about the future in
addition to understanding their history.
• OLAP tools can provide the benefit of increased
performance over traditional database access tools, a
great deal of the complex calculation that is required
to summarise that data is done long before a query is
submitted.
• It always involves interactive query and analysis of the
data. It offers analytical modelling capabilities, including
a calculation engine for deriving ratios, variance, etc.
involving measurements or numerical data across
many dimensions.
• Supports functional models for forecasting, trend analysis and
statistical analysis.
• It enables to retrieve and display the data in two dimensional or
three dimensional cross-tabulations and through various charts
and graphs.
• OLAP applications span a variety of organizational functions.
Finance departments uses OLAP for applications such as
budgeting, activity-based costing (allocations), financial
performance analysis and financial modelling.
• Sales analysis and forecasting are two of the OLAP applications
found in sales departments. Among other applications, marketing
department s use OLAP for market research analysis, sales
forecasting, promotions analysis, customer analysis and
market/customer segmentation.
• Successful OLAP applications increase the
productivity of business managers, developers
and whole organizations. The inherent flexibility
of OLAP systems means business users of
OLAP applications can become self sufficient.
• OLAP applications enable managers to model
problems that would be impossible using less
flexible systems with lengthy and inconsistent
response times. More control and timely
access to strategic information equal more
effective decision making.

You might also like